Download PDF PDF Download WAV WAV Download MP3 MP3 Download EAF EAF Download XML XML Download TSV TSV Download ZIP/WAV ZIP/WAV Download ZIP/MP3 ZIP/MP3 Download ZIP/EAF ZIP/EAF

Multi-CAST, the Multilingual Corpus of Annotated Spoken Texts, is a collection of annotated texts from a typologically diverse section of languages.

Citation

Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/) (date accessed)

GRAID
RefIND
RefLex

Northern Kurdish [nkurd_muserz03_0065]
îcarro-k-îpaşawezîr-êxwedi-şîn-eçarşu-yêdibê
nowday-indf-oblpashavizier-ezafe.plreflind-send.prs-3sgmarket-oblØind.say.prs.3sg
##otherothernp.h:anp.h:prn_refl.h:possv:prednp:g##0.h:sv:pred
00360033003600340036
newbridgingbridging
‘One day the king sends his advisors to the market. (He) says, ...’

Annotations

Alongside standard spoken corpus annotations, the GRAID (Grammatical Relations and Animacy in Discourse, Haig & Schnell 2014) and RefIND (Referent Indexing in Natural Language Discourse, Schiborr et al. 2018) annotation schemes enable cross-linguistic research in the area of discourse and grammar. GRAID provides a uniform set of tags with a simple combinatory syntax, and RefIND allows individual discourse referents to be identified and tracked throughout a text.

The GRAID manual and RefIND guidelines provide extensive discussion of the analytical considerations involved in the annotation.

The corpora

height="606" width="875">

Versioning

The Multi-CAST collection continues to develop as new material is added and the annotations of older texts are revised. Successive releases of the corpus data are assigned version numbers composed of the year and month they were published.

The files listed below represent the latest version of Multi-CAST; a directory of older versions can be accessed via the links on the right. A list of changes introduced with each release can be found in the Multi-CAST collection overview.

  • The current version of Multi-CAST is
  • 1907
  • published in July 2019

Arta [arta]

Yukinori Kimoto

Arta (ISO 639-3: atz) is an endangered Austronesian language spoken by a group of hunter-gatherers living in Luzon, the Philippines. The number of fluent speakers is between nine and eleven, most of which are over the age of forty. Since all speakers have settled down in the communities of neighboring Negrito groups (Casiguran/Nagitupunan Agta people), the language is not in active use and no longer taught to children. All of the speakers are multilingual with Casiguran/Nagtipunan Agta and Ilokano.

The texts were collected by Yukinori Kimoto during fieldwork in the Quirino and Aurora provinces in Luzon between 2012 and 2018. See Kimoto (2017) for a description of the language.

Multi-CAST Arta Speakers of Arta in Luzon, the Philippines. Photo by Yukinori Kimoto.

Citation for this corpus

Kimoto, Yukinori. 2019. Multi-CAST Arta. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#arta) (date accessed)

Corpus documentation

Corpus files

    • ((placeholder))
    • 0 MB
    • 0 MB
    • 0.0 MB
    • 0.0 MB
    • 0.0 MB
    • preview
    • IMDI
    • full corpus
    • 0 MB
    • 0 MB
    • 0.0 MB
    • 0.0 MB
    • 0.0 MB

Cypriot Greek [cypgreek]

Harris Hadjidas, Maria Vollmer

Cypriot Greek (ISO 639-3: ell) is the variety of Greek spoken in Cyprus. The three texts in this corpus, all of which are traditional narratives, were originally recorded in the 1960s, and later compiled and published by Konstantinos Giangoullis as part of a book of traditional Cypriot tales (Giangoullis 2009). The author of the text collection, Konstantinos Giangoullis, has kindly given his permission for the three texts in this corpus to be made freely available as part of Multi-CAST.

While unfortunately no audio recordings are available for this corpus, the texts appear to have been only minimally edited and reflect reasonably faithfully the spoken language used in traditional narratives. The texts were initially transliterated into the Roman alphabet and translated into English by a native speaker, Harris Hadjidas, who also conducted the first round of syntactic annotation. A second round of annotation was completed by Maria Vollmer under the supervision of Geoffrey Haig.

Multi-CAST Cypriot Greek Aphrodite's Rock, Paphos, Cyprus. Photo by Anna Anichkova, 2013, CC-BY-SA 3.0.

Citation for this corpus

Hadjidas, Harris & Vollmer, Maria. 2015. Multi-CAST Cypriot Greek. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#cypgreek) (date accessed)

Corpus documentation

Corpus files

    • full corpus
    • 0.2 MB
    • 1.6 MB
    • 0.3 MB

English [english]

Nils Norman Schiborr

The Multi-CAST English (ISO 639-3: eng) corpus contains autobiographical narratives taken from the Freiburg English Dialect Corpus (FRED, English Dialects Research Group 2005), which has been compiled under the supervision of Bernd Kortmann and Lieselotte Anderwald at the University of Freiburg from texts recorded during the 1970s and 80s as part of various oral history projects.

The texts annotated for Multi-CAST were recorded with older working-class speakers from southern and southeastern England. They depict everyday scenes and personal experiences from the speakers' lives: recurring topics include agriculture, animal husbandry, and the two World Wars.

Multi-CAST English St James's Park, London. Photo by David Iliff, 2006, CC-BY-SA 3.0.

Citation for this corpus

Schiborr, Nils N. 2015. Multi-CAST English. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#english) (date accessed)

Corpus documentation

Corpus files

    • english_kent01
    • 139 MB
    • 25 MB
    • 3.3 MB
    • 1.2 MB
    • 0.2 MB
    • preview
    • IMDI
    • english_kent02_a
    • 151 MB
    • 27 MB
    • 4.0 MB
    • 1.4 MB
    • 0.3 MB
    • preview
    • IMDI
    • english_kent02_b
    • 165 MB
    • 30 MB
    • 4.6 MB
    • 1.6 MB
    • 0.3 MB
    • preview
    • IMDI
    • full corpus
    • 389 MB
    • 82 MB
    • 0.5 MB
    • 4.2 MB
    • 0.8 MB

Northern Kurdish [nkurd]

Geoffrey Haig, Maria Vollmer, Hanna Thiele

Northern Kurdish (ISO 639-3: kmr), also known as Kurmanjî, is a Northwest Iranian language spoken in eastern Turkey, Iraq, Syria, and parts of western Iran. The three texts recorded here are traditional narratives, from a female and a male speaker who grew up near the townships of Erzurum and Muš, respectively.

The texts were recorded in Germany in the late 1990s and early 2000s, and subsequently transcribed, translated, and annotated for Multi-CAST by Geoffrey Haig, Abdullah Incekan, Hanna Thiele, and Maria Vollmer. A description of the language can be found in Haig (2018).

Multi-CAST Northern Kurdish A speaker of Kurmanjî. Photo by Geoffrey Haig.

Citation for this corpus

Haig, Geoffrey & Vollmer, Maria & Thiele, Hanna. 2019. Multi-CAST Northern Kurdish. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#nkurd) (date accessed)

Corpus documentation

Corpus files

    • nkurd_muserz01
    • 50 MB
    • 18 MB
    • 2.9 MB
    • 1.0 MB
    • 0.2 MB
    • preview
    • IMDI
    • nkurd_muserz02
    • 123 MB
    • 11 MB
    • 1.8 MB
    • 0.7 MB
    • 0.1 MB
    • preview
    • IMDI
    • nkurd_muserz03
    • 50 MB
    • 18 MB
    • 3.0 MB
    • 1.1 MB
    • 0.2 MB
    • preview
    • IMDI
    • full corpus
    • 182 MB
    • 47 MB
    • 0.4 MB
    • 2.7 MB
    • 0.5 MB

Persian [persian]

Shirin Adibifar

Persian (ISO 639-3: pes) is an Iranian language with official variants spoken in Iran, Afghanistan, and parts of Tajikistan; the variety spoken in Iran is also referred to as Farsi.

The texts in this corpus are narrative retellings of the Pear film (Chafe 1980), a roughly five minute-long short film about a boy stealing the fruit a man had been picking. The recordings were made by Shirin Adibifar in Tehran and locations in the province of Mazandaran in 2015. Of the 29 speakers in this corpus, 17 of are female and 12 male. The median age is 25, with a range of 20 to 39. All speakers have received at least some measure of university-level education.

Multi-CAST Persian Badab-e Surt, Mazandaran, Iran. Photo by M. Samaee, 2010, CC-BY 3.0.

Citation for this corpus

Adibifar, Shirin. 2016. Multi-CAST Persian. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#persian) (date accessed)

Corpus documentation

Corpus files

    • persian_g1-f-01
    • 16 MB
    • 1 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g1-f-02
    • 22 MB
    • 2 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g1-f-05
    • 23 MB
    • 2 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g1-f-07
    • 11 MB
    • 1 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g1-f-08
    • 17 MB
    • 2 MB
    • 0.1 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g1-f-09
    • 45 MB
    • 4 MB
    • 0.5 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g1-f-10
    • 34 MB
    • 3 MB
    • 0.4 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g1-f-11
    • 17 MB
    • 2 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g1-f-12
    • 18 MB
    • 2 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g1-f-14
    • 31 MB
    • 3 MB
    • 0.4 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g1-m-03
    • 8 MB
    • 1 MB
    • 0.1 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g1-m-04
    • 21 MB
    • 2 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g1-m-06
    • 9 MB
    • 1 MB
    • 0.1 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g1-m-13
    • 29 MB
    • 3 MB
    • 0.4 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g2-f-01
    • 24 MB
    • 2 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g2-f-02
    • 15 MB
    • 1 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g2-f-03
    • 16 MB
    • 2 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g2-f-04
    • 11 MB
    • 1 MB
    • 0.1 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g2-f-05
    • 19 MB
    • 2 MB
    • 0.1 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g2-f-06
    • 15 MB
    • 1 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g2-f-07
    • 17 MB
    • 2 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g2-m-08
    • 18 MB
    • 2 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g2-m-09
    • 14 MB
    • 1 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g2-m-10
    • 13 MB
    • 1 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g2-m-11
    • 10 MB
    • 1 MB
    • 0.1 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g2-m-12
    • 12 MB
    • 1 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g2-m-13
    • 14 MB
    • 1 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g2-m-14
    • 11 MB
    • 1 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • persian_g2-m-15
    • 26 MB
    • 2 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • full corpus
    • 421 MB
    • 48 MB
    • 0.3 MB
    • 2.3 MB
    • 0.5 MB

Sanzhi Dargwa [sanzhi]

Diana Forker, Nils Norman Schiborr

Sanzhi Dargwa (ISO 639-3: dar) is a Nakh-Daghestanian (Caucasian) language from the Dargwa subbranch. From 1968 onwards, over a relatively short time span, all Sanzhi speakers left their village of Sanzhi in the mountains of central Daghestan, Russia, to move to linguistically and ethnically heterogeneous settlements in the lowlands. Today Sanzhi is spoken by approximately 250 speakers, and heavily endangered.

The eight texts in this corpus represent a small subset of the material that was recorded, transcribed, translated, and glossed by Diana Forker with the assistance of Gadzhimurad Gadzhimuradov, a native speaker, as part of a DOBES language documentation project (2012–2019), which has culminated in a grammar of Sanzhi Dargwa (Forker, Under revision).

The texts presented here are a mixture of autobiographical and traditional narratives. They were annotated for Multi-CAST by Nils Schiborr.

Multi-CAST Sanzhi Dargwa The ruins of Sanzhi village, Daghestan, Russia. Photo by Gadzhimurad Gadzhimuradov.

Citation for this corpus

Forker, Diana & Schiborr, Nils N. 2019. Multi-CAST Sanzhi Dargwa. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#sanzhi) (date accessed)

Corpus documentation

Corpus files

    • sanzhi_asabali
    • 70 MB
    • 6 MB
    • 0.6 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • IMDI
    • sanzhi_bazhuk
    • 47 MB
    • 4 MB
    • 0.4 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • sanzhi_dragon
    • 61 MB
    • 5 MB
    • 0.5 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • IMDI
    • sanzhi_kurban
    • 49 MB
    • 4 MB
    • 0.7 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • IMDI
    • sanzhi_mill
    • 57 MB
    • 5 MB
    • 0.5 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • IMDI
    • sanzhi_patima
    • 57 MB
    • 5 MB
    • 0.5 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • IMDI
    • sanzhi_ramazan
    • 80 MB
    • 7 MB
    • 1.0 MB
    • 0.3 MB
    • 0.1 MB
    • preview
    • IMDI
    • sanzhi_tape
    • 20 MB
    • 2 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • IMDI
    • full corpus
    • 381 MB
    • 37 MB
    • 0.2 MB
    • 1.5 MB
    • 0.3 MB

Teop [teop]

Ulrike Mosel, Stefan Schnell

Teop (ISO 639-3: tio) is a Western Oceanic language spoken on Bougainville Island, Papua New Guinea. The texts, all traditional narratives, were recorded by Ulrike Mosel and Enoch Horai Magum over the course of a language documentation project (principal investigator: Ulrike Mosel) funded by the Volkswagen Foundation (grant no. II 77 973).

Details on the project can be found online at the DOBES webpage. A sketch grammar of Teop (Mosel & Thiesen 2007) and additional materials are also available there. The texts were annotated for Multi-CAST by Ulrike Mosel and Stefan Schnell. Referent indexing with RefIND was added in 2019 by Ulrike Mosel, Stefan Schnell, and Maria Vollmer.

Multi-CAST Teop Teop Island, Bougainville, Papua New Guinea. Photo by Ulrike Mosel.

Citation for this corpus

Mosel, Ulrike & Schnell, Stefan. 2015. Multi-CAST Teop. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#teop) (date accessed)

Corpus documentation

Corpus files

    • teop_iar
    • 148 MB
    • 13 MB
    • 1.9 MB
    • 0.7 MB
    • 0.1 MB
    • preview
    • IMDI
    • teop_mat
    • 70 MB
    • 6 MB
    • 1.0 MB
    • 0.4 MB
    • 0.1 MB
    • preview
    • IMDI
    • teop_sii
    • 196 MB
    • 18 MB
    • 3.1 MB
    • 1.1 MB
    • 0.2 MB
    • preview
    • IMDI
    • teop_viv
    • 58 MB
    • 5 MB
    • 1.0 MB
    • 0.3 MB
    • 0.1 MB
    • preview
    • IMDI
    • full corpus
    • 352 MB
    • 43 MB
    • 0.3 MB
    • 2.4 MB
    • 0.4 MB

Tondano [tondano]

Timothy Brickell

The Toulour dialect of Tondano (ISO 639-3: tdn) is an Austronesian (Malayo-Polynesian, Philippine, Minahasa, North, Northeast) language spoken in and to the east of the town of Tondano, which is located in the Minahasa regency of North Sulawesi, Indonesia. All Minahasan languages are endangered and have been shifting to the most commonly used language of wider communication, Manado Malay (ISO 639-3: xmm), since the early 20th century (Wolff 2010: 299). Personal experience of the researcher estimates the number of fluent speakers of Tondano at around 30 000.

This corpus is the result of fieldwork undertaken by Timothy Brickell as part of PhD candidature at La Trobe University, Melbourne, Australia between 2011 and 2015 (see Brickell 2015). The speakers recorded were of both genders, of various ages, and from a number of professions, with many older speakers already retired. The texts in Multi-CAST constitute a subset of the 20 recordings made by Brickell. In some instances speakers discuss a topic chosen just prior to recording, in others they talk while engaging in traditional activities, while in some they narrate an elicitation video which depicts other community members carrying out traditional cultural activities.

Multi-CAST Tondano Pemandangan, Minahasa Regency, Indonesia. Photo by Timothy Brickell, 2013.

Citation for this corpus

Brickell, Timothy. 2016. Multi-CAST Tondano. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#tondano) (date accessed)

Corpus documentation

Corpus files

    • tondano_gulamera
    • 104 MB
    • 9.4 MB
    • 0.7 MB
    • 0.3 MB
    • 0.1 MB
    • preview
    • IMDI
    • tondano_holiday
    • 53 MB
    • 5 MB
    • 0.5 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • IMDI
    • tondano_kiniar01
    • 88 MB
    • 8 MB
    • 0.7 MB
    • 0.3 MB
    • 0.1 MB
    • preview
    • IMDI
    • tondano_kiniar02
    • 127 MB
    • 12 MB
    • 1.0 MB
    • 0.4 MB
    • 0.1 MB
    • preview
    • IMDI
    • tondano_kiniar03
    • 89 MB
    • 8 MB
    • 0.6 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • IMDI
    • tondano_mapalus
    • 69 MB
    • 6 MB
    • 0.7 MB
    • 0.3 MB
    • 0.1 MB
    • preview
    • IMDI
    • tondano_water
    • 51 MB
    • 5 MB
    • 0.5 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • IMDI
    • tondano_watulaney
    • 185 MB
    • 17 MB
    • 1.2 MB
    • 0.5 MB
    • 0.1 MB
    • preview
    • IMDI
    • full corpus
    • 558 MB
    • 69 MB
    • 0.3 MB
    • 2.2 MB
    • 0.4 MB

Tulil [tulil]

Chenxi Meng

Tulil (ISO 639-3: tuh), also known as Taulil, is a Papuan language spoken in the East New Britain Province of Papua New Guinea. As of 2000, Tulil is spoken by approximately 2 000 people spread out over four villages (Tulil 1, Tulil 2, Kadaulung, and Toma) according to data from Ethnologue (Eberhard et al. 2019).

The six texts in this corpus comprise a subset of a larger collection of material that was recorded and transcribed during two field trips undertaken by Chenxi Meng in 2012 and 2015 for her PhD project, which has resulted in a comprehensive grammar of Tulil (Meng 2018). The entirety of the data have been deposited in PARADISEC.

The texts selected for Multi-CAST include both traditional and personal narratives. Annotations with RefIND were added by Maria Vollmer.

Multi-CAST Tulil A plume of volcanic ash over New Britain, Papua New Guinea. Photo by Chenxi Meng, 2014.

Citation for this corpus

Meng, Chenxi. 2019. Multi-CAST Tulil. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#tulil) (date accessed)

Corpus documentation

Corpus files

    • tulil_all1
    • 54 MB
    • 5 MB
    • 0.6 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • IMDI
    • tulil_alrm
    • 233 MB
    • 21 MB
    • 2.7 MB
    • 1.0 MB
    • 0.2 MB
    • preview
    • IMDI
    • tulil_jkpp
    • 257 MB
    • 23 MB
    • 2.1 MB
    • 0.7 MB
    • 0.1 MB
    • preview
    • IMDI
    • tulil_lnsl
    • 65 MB
    • 6 MB
    • 0.6 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • IMDI
    • tulil_lrdw
    • 85 MB
    • 8 MB
    • 1.0 MB
    • 0.4 MB
    • 0.1 MB
    • preview
    • IMDI
    • tulil_sves
    • 52 MB
    • 5 MB
    • 0.7 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • IMDI
    • full corpus
    • 571 MB
    • 67 MB
    • 0.3 MB
    • 2.6 MB
    • 0.5 MB

Vera'a [veraa]

Stefan Schnell

Vera'a (ISO 639-3: vra) is an Oceanic (Austronesian) language from the village of the same name on Vanua Lava (13.80°S 167.47°E), one of the Banks Islands in North Vanuatu. The language has approximately 450 speakers and is the first language of most inhabitants of Vera'a and the coastline to the north of it. Vera'a is closely related to the neighbouring language Vurës, and speakers of Vera'a also speak Vurës.

Both languages have been extensively documented within a VolkswagenStiftung-funded DOBES documentation project (2006–2012; PI: Dr Catriona Hyslop-Malau). Vera'a has been the focus of Stefan Schnell's PhD project at Kiel University (2007–2010, see Schnell 2011), and Stefan has subsequently been undertaking additional documentary work on Vera'a as part of his ARC-funded DECRA project Typology of Language Use (ARC grant no. DE120102017) in 2012–2015, hosted by La Trobe University (Melbourne, Australia).

The Multi-CAST Vera'a corpus consists of 10 folkloristic narrative texts collected and annotated by Stefan Schnell. They constitute a subcorpus of a larger corpus of Vera'a compiled and curated by Stefan Schnell in close collaboration with speakers of the language and researchers of other disciplines from outside the community. Annotations with RefIND were added to the corpus in 2019 by Stefan Schnell and Maria Vollmer.

Multi-CAST Vera'a At work in Vera'a village, Vanua Lava, Vanuatu. Photo by Stefan Schnell.

Citation for this corpus

Schnell, Stefan. 2015. Multi-CAST Vera'a. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#veraa) (date accessed)

Corpus documentation

Corpus files

    • veraa_anv
    • 67 MB
    • 6 MB
    • 1.0 MB
    • 0.4 MB
    • 0.1 MB
    • preview
    • IMDI
    • veraa_as1
    • 58 MB
    • 5 MB
    • 1.1 MB
    • 0.4 MB
    • 0.1 MB
    • preview
    • IMDI
    • veraa_gabg
    • 96 MB
    • 8 MB
    • 1.0 MB
    • 0.3 MB
    • 0.1 MB
    • preview
    • IMDI
    • veraa_gaqg
    • 98 MB
    • 8 MB
    • 1.2 MB
    • 0.4 MB
    • 0.1 MB
    • preview
    • IMDI
    • veraa_hhak
    • 139 MB
    • 12 MB
    • 2.1 MB
    • 0.7 MB
    • 0.1 MB
    • preview
    • IMDI
    • veraa_isam
    • 81 MB
    • 7 MB
    • 1.3 MB
    • 0.4 MB
    • 0.1 MB
    • preview
    • IMDI
    • veraa_iswm
    • 239 MB
    • 20 MB
    • 3.5 MB
    • 1.2 MB
    • 0.2 MB
    • preview
    • IMDI
    • veraa_jjq
    • 333 MB
    • 28 MB
    • 4.6 MB
    • 1.6 MB
    • 0.3 MB
    • preview
    • IMDI
    • veraa_mvbw
    • 111 MB
    • 9 MB
    • 1.6 MB
    • 0.6 MB
    • 0.1 MB
    • preview
    • IMDI
    • veraa_pala_a
    • 44 MB
    • 4 MB
    • 0.7 MB
    • 0.3 MB
    • 0.1 MB
    • preview
    • IMDI
    • veraa_pala_b
    • 73 MB
    • 6 MB
    • 1.3 MB
    • 0.5 MB
    • 0.1 MB
    • preview
    • IMDI
    • full corpus
    • 1.0 GB
    • 111 MB
    • 0.8 MB
    • 6.6 MB
    • 1.1 MB

Research

Background

Multi-CAST has been designed to facilitate empirical research into the structure of spontaneous spoken language from a cross-linguistic perspective. The overriding questions are the following:

  • Are there cross-linguistically recurrent patterns in the way discourse is organized (i.e. text-based, as opposed to grammar-based, typology)?
  • How do these statistical patterns in usage relate to the architecture of grammars?
  • How do they relate to change in grammars over time?

Our research agenda has been heavily inspired by work in the functionalist tradition, initiated by scholars such as Wallace Chafe, Talmy Givón, Barbara Fox, and others.

We have drawn on Multi-CAST data to follow up on many of the issues raised by the pioneers of usage-based grammar, for example the relationship between topicality and subjecthood, the notion of an ergative bias to discourse organization, the role of animacy in morphosyntax, and the mechanisms involved in the emergence of agreement morphology.

Quantitative Analysis

The small symbol inventory of the GRAID annotation scheme aims to capture cross-linguistically comparable categories, which, when combined with the morpheme-by-morpheme glosses and referent indexing with RefIND, allows for highly complex queries across corpora. See the Multi-CAST research context for illustrative examples.

One straightforward way of working with the Multi-CAST data is via the EAF files and the linguistic annotation software ELAN, which is freely available online. ELAN allows for conditional searches with regular expressions across sets of multiple EAF files. Please refer to the ELAN user guide and manual for instructions.

A more programmatic alternative is offered by the statistical computing language R and the custom-built multicastR package (Schiborr 2018), which offers a convenient way of accessing the annotation values and metadata directly in R. The multicastR package is freely available from the Comprehensive R Archive Network (CRAN). The source files for a manual installation can also be found here.

Publications

Collected below are publications and presentations that make use of data from Multi-CAST. If you have employed Multi-CAST in your research and would like to see your work included in this list, please contact Geoffrey Haig and/or Stefan Schnell.

Published papers

Haig, Geoffrey & Schnell, Stefan. 2016. The discourse basis of ergativity revisited. Language 92(3). 591–618. (DOI: 10.1353/lan.2016.0049)

Haig, Geoffrey & Schnell, Stefan. 2016. The discourse basis of ergativity revisited: Online appendices. Language 92(3). 1–14. (DOI: 10.1353/lan.2016.0044)

Haig, Geoffrey & Adibifar, Shirin. To appear. Referential Null Subjects (RNS) in colloquial spoken Persian: Does speaker familiarity have an impact? In Korangy, Alireza & Mahmoodi-Bahktiari, Behrooz (eds.), Essays on the typology of Iranian languages. Berlin: Mouton de Gruyter.

Kimoto, Yukinori. 2018. Operationalizing Philippine-type syntax for the GRAID system: Clause structure, case marking, and verb class in Arta. Asian and African Languages and Linguistics 12. 17–35. (hdl.handle.net/10108/91147)

Schnell, Stefan & Barth, Danielle. 2018. Discourse motivations for pronominal and zero objects across genres in Vera'a. Language Variation and Change 30(1), 51–81. (DOI: 10.1017/S0954394518000054)

Schnell, Stefan & Schiborr, Nils N. 2018. Corpus-based typological research in discourse and grammar: GRAID and Multi-CAST. Asian and African Languages and Linguistics 12. 1–16. (hdl.handle.net/10108/91145)

Presentations

Schiborr, Nils N. 2018. Data-driven models of referential choice: Antecedent distance and beyond. Paper presented at the Workshop Information Structure in Spoken Language Corpora 3: Discourse and Information Structure (ISSLaC3), Münster, Germany, 7–8 December 2018.

Schnell, Stefan & Schiborr, Nils N. & Haig, Geoffrey. 2018. Is intransitive subject the preferred role for introducing new referents? Evidence from corpus-based typology. Paper presented at the 51st Annual Meeting of the Societas Linguistica Europaea (SLE2018), Tallinn, Estonia, 29 August–1 September 2018.

Haig, Geoffrey & Schnell, Stefan & Schiborr, Nils N. 2017. The limits of accessibility: A corpus-based typological approach. Paper presented at the 12th Conference of the Association for Linguistic Typology (ALT2017), Canberra, Australia, 11–15 December 2017.

Haig, Geoffrey & Schiborr, Nils N. 2016. Multi-CAST (Multilingual Corpus of Annotated Spoken Texts): Ein Projekt zur Erstellung und Auswertung mehrsprachiger Korpora für die Sprachtypologie. Paper presented at the CLARIN Forum CA3, Hamburg, Germany, 7–8 June 2016.

Guidelines for contributors

The shared utility of Multi-CAST grows with increasing typological representativity of the language sample it contains. We therefore encourage scholars to contribute additional data sets to Multi-CAST, which can be incorporated into the collection as stand-alone resources, citable with their names as the authors and annotators.

If you wish to contribute data, here are some points to consider:

  • Open access corpus data. Your data should be free of copyright and other restrictions on availability or usage. Multi-CAST is committed to open science, and hence makes all of its data freely available under a Creative Commons licence (CC BY 4.0 International). All data sets are citable online resources, with your name(s) as author(s).
  • Unscripted narratives. Ideally not stimulus-based.
  • Monologues. Texts should be (predominantly) monologic. Coping with multi-person discourse raises additional issues of annotation and analysis, which we have chosen not to tackle in this collection.
  • Media-linked time-aligned annotations. Transcribed texts are ideally accompanied by a sound file in an uncompressed WAV file format, morphologically glossed, and translated into English. Annotations are time-aligned with the audio recordings.
  • Minimum size of 1 000 clauses. All corpora in Multi-CAST minimally contain 1 000 clause units.

If you have a data set that complies with these conditions and you are interested in contributing it to Multi-CAST, please contact Geoffrey Haig and/or Stefan Schnell in order to coordinate the next steps.

In technical terms, this involves transferring your data into the EAF file format of the annotation software ELAN, for which purpose we will provide you with a Multi-CAST ELAN template, and annotating your texts with GRAID. The latter involves some quite tricky analytical decisions, and we strongly recommend that potential contributors liaise with us before undertaking this task. The actual labour input required will vary from language to language, but we will certainly assist you and be able to give you a realistic assessment of what may be necessary.

People

The Multi-CAST project is being coordinated by Geoffrey Haig, Stefan Schnell, Nils Schiborr, and Maria Vollmer, all at the Department of General Linguistics at the University of Bamberg.

In addition, the following researchers were involved in the collection, translation, and annotation of the various Multi-CAST corpora, or have contributed to the project in other ways:

  • Shirin Adibifar
  • Timothy Brickell
  • Diana Forker
  • Gadzhimurad Gadzhimuradov
  • Harris Hadjidas
  • Abdullah Incekan
  • Kimoto Yukinori
  • Adrian Kuqi
  • Jenny Herzky
  • Enoch Horai Magum
  • Chenxi Meng
  • Ulrike Mosel
  • Rasul Mutalov
  • Nicholas Peterson
  • Nick Thieberger
  • Hanna Thiele
  • Makson Vores

References

Ariel, Mira. 1988. Referring and accessibility. Journal of Linguistics 24(1). 67–87.

Ariel, Mira. 1990. Accessing noun-phrase antecedents. London: Routledge.

Ariel, Mira. 2004. Accessibility marking: Discourse functions, discourse profiles, and processing cues. Discourse Processes 37(2). 91–116.

Bickel, Balthazar. 2003. Referential density in discourse and syntactic typology. Language 79(4). 708–736.

Brickell, Timothy. 2015. A grammar of Tondano. Ph.D. dissertation, La Trobe University, Melbourne, Australia.

Chafe, Wallace. 1980. The deployment of consciousness in the production of a narrative. In Chafe, Wallace (ed.), The Pear Stories: Cognitive, cultural, and linguistic aspects of narrative production, 9–50. Norwood, NJ: Ablex.

Du Bois, John. 1987. The discourse basis of ergativity. Language 63(4). 805–855.

Du Bois, John. 2003. Argument structure: Grammar in use. In Du Bois, John & Kumpf, Lorraine & Ashby, William J. (eds.), Preferred argument structure: Grammar as architecture for function, 11–60. Amsterdam: John Benjamins.

Du Bois, John. 2017. Ergativity in discourse and grammar. In Coon, Jessica & Massam, Diane & Travis, Lisa D. (eds.), The Oxford handbook of ergativity, 23–57. Oxford: Oxford University Press.

English Dialects Research Group. 2005. Freiburg English Dialect Corpus (FRED). (fred.ub.uni-freiburg.de/)

Forker, Diana. Under revision. A grammar of Sanzhi Dargwa. Berlin: Language Science Press.

Giangoullis, Konstantinos G. 2009. Kypriaka paradosiaka paramytha: Ek stomatos Elenis Mich, Satsia, Apo to Geri-Pyroi (1887–1982) [A traditional Cypriot storyteller: From the mouth of Elenis Mich, Satsia, from Geri-Pyroi (1887–1982)]. Leukosia: Theopress Publications.

Haig, Geoffrey. 2018. Northern Kurdish (Kurmanjî). In Haig, Geoffrey & Khan, Geoffrey (eds.), The languages and linguistics of Western Asia: An areal perspective, 106–158. Berlin: Mouton de Gruyter.

Haig, Geoffrey & Schnell, Stefan. 2014. Annotations using GRAID (Grammatical Relations and Animacy in Discourse): Introduction and guidelines for annotators. Version 7.0. (multicast.aspra.uni-bamberg.de/)

Haig, Geoffrey & Schnell, Stefan. 2015. Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/)

Kimoto, Yukinori. 2017. A grammar of Arta: A Philippine Negrito language. Ph.D. dissertation, Kyoto University, Kyoto, Japan.

Meng, Chenxi. 2018. A grammar of Tulil. Ph.D. dissertation, La Trobe University, Melbourne, Australia.

Mosel, Ulrike & Thiesen, Yvonne. 2007. The Teop sketch grammar. Unpublished manuscript, University of Kiel. (hdl.handle.net/1839/00-0000-0000-0008-24F6-3)

Noonan, Michael. 2003. A crosslinguistic investigation of referential density. Unpublished manuscript, University of Wisconsin-Milwaukee. (crossasia-repository.ub.uni-heidelberg.de/190/)

Riester, Arndt & Baumann, Stefan. 2017. The RefLex scheme — Annotation guidelines. SinSpeC: Working papers of the SFB 732 14. (DOI: 10.18419/opus-9011)

Schiborr, Nils N. 2018. multicastR: A companion to the Multi-CAST collection. R package version 1.1.0. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (cran.r-project.org/package=multicastR)

Schiborr, Nils N. & Schnell, Stefan & Thiele, Hanna. 2018. RefIND — Referent Indexing in Natural-language Discourse: Annotation guidelines. Version 1.1. (multicast.aspra.uni-bamberg.de/)

Schnell, Stefan. 2011. A grammar of Vera'a. Ph.D. dissertation, Kiel University, Germany.

Eberhard, David M. & Simons, Gary F. & Fennig, Charles D. (eds.). 2019. Ethnocide: Languages of the World. Dallas, TX: Wycliffe.

Wolff, John. Proto-Austronesian phonology. Ithaca, NY: Cornell Southeast Asia Program Publications.

Acknowledgements

The collection and annotation of the data in Multi-CAST have graciously received support from the following institutions and organizations:

  • 2017–2020
    the German Research Foundation (DFG) via the project Does morphosyntactic alignment shape discourse? — principal investigators: Geoffrey Haig and Stefan Schnell (DFG project no. 323627599);
  • 2018–2020
    the Australian Research Council (ARC) and the Centre of Excellence for the Dynamics of Language (CoEDL) as part of CoEDL's corpus development project, headed by Nick Thieberger at The University of Melbourne, for annotation work in collaboration with the aforementioned DFG project;
  • 2012–2019
    the VolkswagenStiftung as part of the Documentation of endangered languages (DOBES) project for the documentation of Shiri and Sanzhi — PI: Diana Forker;
  • 2012–2015
    the Australian Research Council (ARC) as part of the DECRA project Typology of language use, hosted by La Trobe University, Melbourne — PI: Stefan Schnell (ARC grant no. DE120102017.);
  • 2006–2012
    as part of the DOBES project for the documentation of Vera'a and Vurës — Stefan Schnell (PI: Catriona Malau, grants no. II/81 898 and II/84 316);
  • 2000–2007
    as part of DOBES for the documentation of Teop — PI: Ulrike Mosel (grant no. II/77 973).

The Department of General Linguistics at the University of Bamberg contributed departmental funding and research infrastructure to the Multi-CAST project, and the ARC Centre of Excellence for the Dynamics of Language provided additional support.

The following texts in the collection are made available in cooperation with these researchers and institutions:

The editors of and contributors to Multi-CAST would also like to thank our respective research communities for their support and stimulating criticism.

supported by

Licensing

In the spirit of open science, the entirety of the Multi-CAST collection, including the recordings, transcriptions, annotations, and all supplementary materials, are published under the Creative Commons Attribution 4.0 International Licence (CC BY 4.0).

The text of the licence can be found online here.

The CC-BY licence allows full access to Multi-CAST for any purpose related to research, art, journalism, or any other endeavour, under the condition that proper credit is given to the editors of the collection and its contributors. Doing so must also include a link to this website (multicast.aspra.uni-bamberg.de), and a brief note about the licensing terms.

Contact

For inquiries regarding Multi-CAST, please contact Geoffrey Haig or Stefan Schnell. Questions concerning the multicastR package and this website please direct to Nils Schiborr.

The Multi-CAST collection as well as this page are hosted on the servers of the computing centre of the University of Bamberg. Relevant legal information can be found here.