Multi-CAST, the Multilingual Corpus of Annotated Spoken Texts, is a collection of annotated texts from a typologically diverse section of languages.
|Northern Kurdish [nkurd_muserz03_0065]|
|‘One day the king sends his advisors to the market. (He) says, ...’|
Alongside standard spoken corpus annotations, the GRAID (Grammatical Relations and Animacy in Discourse, Haig & Schnell 2014) and RefIND (Referent Indexing in Natural Language Discourse, Schiborr et al. 2018) annotation schemes enable cross-linguistic research in the area of discourse and grammar. GRAID provides a uniform set of tags with a simple combinatory syntax, and RefIND allows individual discourse referents to be identified and tracked throughout a text.
The Multi-CAST collection continues to develop as new material is added and the annotations of older texts are revised. Successive releases of the corpus data are assigned version numbers composed of the year and month they were published.
The files listed below represent the latest version of Multi-CAST; a directory of older versions can be accessed via the links on the right. A list of changes introduced with each release can be found in the Multi-CAST collection overview.
Arta (arta1239) is an endangered Austronesian language spoken by a group of hunter-gatherers living in Luzon, the Philippines. The number of fluent speakers is between nine and eleven, most of which are over the age of forty. Since all speakers have settled down in the communities of neighboring Negrito groups (Casiguran/Nagitupunan Agta people), the language is not in active use and no longer taught to children. All of the speakers are multilingual with Casiguran/Nagtipunan Agta and Ilokano.
Kimoto, Yukinori. 2019. Multi-CAST Arta. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#arta) (date accessed)
Cypriot Greek (cypr1239) is the variety of Greek spoken in Cyprus. The three texts in this corpus, all of which are traditional narratives, were originally recorded in the 1960s, and later compiled and published by Konstantinos Giangoullis as part of a book of traditional Cypriot tales (Giangoullis 2009). The author of the text collection, Konstantinos Giangoullis, has kindly given his permission for the three texts in this corpus to be made freely available as part of Multi-CAST.
While unfortunately no audio recordings are available for this corpus, the texts appear to have been only minimally edited and reflect reasonably faithfully the spoken language used in traditional narratives. The texts were initially transliterated into the Roman alphabet and translated into English by a native speaker, Harris Hadjidas, who also conducted the first round of syntactic annotation. A second round of annotation was completed by Maria Vollmer under the supervision of Geoffrey Haig.
Hadjidas, Harris & Vollmer, Maria. 2015. Multi-CAST Cypriot Greek. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#cypgreek) (date accessed)
The Multi-CAST English (sout3282) corpus contains auto- biographical narratives taken from the Freiburg English Dialect Corpus (FRED, English Dialects Research Group 2005), which has been compiled under the supervision of Bernd Kortmann and Lieselotte Anderwald at the University of Freiburg from texts recorded during the 1970s and 80s as part of various oral history projects.
The texts annotated for Multi-CAST were recorded with older working-class speakers from southern and southeastern England. They depict everyday scenes and personal experiences from the speakers' lives: recurring topics include agriculture, animal husbandry, shipwrighting, work in the London docks, and the two World Wars.
The audio recordings (WAV, MP3) in this corpus are in the public domain.
Schiborr, Nils N. 2015. Multi-CAST English. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#english) (date accessed)
The Nafsan (sout2856) language, also known as South Efate, is a Southern Oceanic language spoken on the island of Efate in central Vanuatu. As of 2005, there are approximately 6 000 speakers of Nafsan living in coastal villages from Pango to Eton. A description of the language can be found in Thieberger (2006).
The Multi-CAST Nafsan corpus constitutes a subset of the material collected by Nick Thieberger for his PhD research over three periods of fieldwork in the villages of Eratap and Erakor in South Efate between 1995 and 2000, and during subsequent trips. The entirety of the data has been archived in PARADISEC, and can also be accessed via ANNIS. See further Thieberger (2004).
The texts were glossed with GRAID by Nick Thieberger and Timothy Brickell, and subsequently annotated with RefIND by Adrian Kuqi under supervision of Stefan Schnell.
Thieberger, Nick & Brickell, Timothy. 2019. Multi-CAST Nafsan. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#nafsan) (date accessed)
Northern Kurdish (nort2641), also known as Kurmanjî, is a Northwest Iranian language spoken in eastern Turkey, Iraq, Syria, and parts of western Iran. The three texts recorded here are traditional narratives, from a female and a male speaker who grew up near the townships of Erzurum and Muš, respectively.
The texts were recorded in Germany in the late 1990s and early 2000s, and subsequently transcribed, translated, and annotated for Multi-CAST by Geoffrey Haig, Abdullah Incekan, Hanna Thiele, and Maria Vollmer. A description of the language can be found in Haig (2018).
Haig, Geoffrey & Vollmer, Maria & Thiele, Hanna. 2019. Multi-CAST Northern Kurdish. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#nkurd) (date accessed)
Persian (tehr1242) is an Iranian language with official variants spoken in Iran, Afghanistan, and parts of Tajikistan; the variety spoken in Iran is also referred to as Farsi.
The texts in this corpus are narrative retellings of the Pear film (Chafe 1980), a roughly five minute-long short film about a boy stealing the fruit a man had been picking. The recordings were made by Shirin Adibifar in Tehran and locations in the province of Mazandaran in 2015. Of the 29 speakers in this corpus, 17 are female and 12 male. The median age is 25, with a range of 20 to 39. All speakers have received at least some measure of university-level education.
Adibifar, Shirin. 2016. Multi-CAST Persian. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#persian) (date accessed)
Sanzhi Dargwa (sanz1248) is a Nakh-Daghestanian (Caucasian) language from the Dargwa subbranch. From 1968 onwards, over a relatively short time span, all Sanzhi speakers left their village of Sanzhi in the mountains of central Daghestan, Russia, to move to linguistically and ethnically heterogeneous settlements in the lowlands. Today Sanzhi is spoken by approximately 250 speakers and heavily endangered.
The eight texts in this corpus represent a small subset of the material that was recorded, transcribed, translated, and glossed by Diana Forker with the assistance of Gadzhimurad Gadzhimuradov, a native speaker, as part of a DOBES language documentation project (2012–2019), which has culminated in a grammar of Sanzhi Dargwa (Forker, Under revision).
The texts presented here are a mixture of autobiographical and traditional narratives. They were annotated for Multi-CAST by Nils Schiborr.
Forker, Diana & Schiborr, Nils N. 2019. Multi-CAST Sanzhi Dargwa. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#sanzhi) (date accessed)
Teop (teop1238) is a Western Oceanic language spoken on Bougainville Island, Papua New Guinea. The texts, all traditional narratives, were recorded by Ulrike Mosel and Enoch Horai Magum over the course of a language documentation project (principal investigator: Ulrike Mosel) funded by the Volkswagen Foundation (grant no. II 77 973). Details on the project can be found online at the DOBES webpage.
A sketch grammar of Teop (Mosel & Thiesen 2007) and additional materials are also available there, and an online dictionary (A multifunctional Teop-English dictionary, Mosel2019) can be found here. The texts were annotated for Multi-CAST by Ulrike Mosel and Stefan Schnell; referent indexing with RefIND was added in 2019 by Ulrike Mosel, Stefan Schnell, and Maria Vollmer.
Mosel, Ulrike & Schnell, Stefan. 2015. Multi-CAST Teop. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#teop) (date accessed)
The Toulour dialect of Tondano (tond1251) is an Austronesian (Malayo-Polynesian, Philippine, Minahasa, North, Northeast) language spoken in and to the east of the town of Tondano, which is located in the Minahasa regency of North Sulawesi, Indonesia. All Minahasan languages are endangered and have been shifting to the most commonly used language of wider communication, Manado Malay (mala1481), since the early 20th century (Wolff 2010: 299). Personal experience of the researcher estimates the number of fluent speakers of Tondano at around 30 000.
This corpus is the result of fieldwork undertaken by Timothy Brickell as part of PhD candidature at La Trobe University, Melbourne, Australia between 2011 and 2015 (see Brickell 2015). The speakers recorded were of both genders, of various ages, and from a number of professions, with many older speakers already retired. The texts in Multi-CAST constitute a subset of the 20 recordings made by Brickell. In some instances speakers discuss a topic chosen just prior to recording, in others they talk while engaging in traditional activities, while in some they narrate an elicitation video which depicts other community members carrying out traditional cultural activities.
Brickell, Timothy. 2016. Multi-CAST Tondano. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#tondano) (date accessed)
Tulil (taul1251), also known as Taulil, is a Papuan language spoken in the East New Britain Province of Papua New Guinea. As of 2000, Tulil is spoken by approximately 2 000 people spread out over four villages (Tulil 1, Tulil 2, Kadaulung, and Toma).
The six texts in this corpus comprise a subset of a larger collection of material that was recorded and transcribed during two field trips undertaken by Chenxi Meng in 2012 and 2015 for her PhD project, which has resulted in a comprehensive grammar of Tulil (Meng 2018). The entirety of the data has been deposited in PARADISEC.
The texts selected for Multi-CAST include both traditional and personal narratives. Annotations with RefIND were added by Maria Vollmer.
Meng, Chenxi. 2019. Multi-CAST Tulil. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#tulil) (date accessed)
Vera'a (vera1241) is an Oceanic (Austronesian) language from the village of the same name on Vanua Lava (13.80°S 167.47°E), one of the Banks Islands in North Vanuatu. The language has approximately 450 speakers and is the first language of most inhabitants of Vera'a and the coastline to the north of it. Vera'a is closely related to the neighbouring language Vurës, and speakers of Vera'a also speak Vurës.
Both languages have been extensively documented within a VolkswagenStiftung-funded DOBES documentation project (2006–2012; PI: Dr Catriona Hyslop-Malau). Vera'a has been the focus of Stefan Schnell's PhD project at Kiel University (2007–2010, see Schnell 2011), and Stefan has subsequently been undertaking additional documentary work on Vera'a as part of his ARC-funded DECRA project Typology of Language Use (ARC grant no. DE120102017) in 2012–2015, hosted by La Trobe University (Melbourne, Australia).
The Multi-CAST Vera'a corpus consists of 10 folkloristic narrative texts collected and annotated by Stefan Schnell. They constitute a subcorpus of a larger corpus of Vera'a compiled and curated by Stefan Schnell in close collaboration with speakers of the language and researchers of other disciplines from outside the community. Annotations with RefIND were added to the corpus in 2019 by Stefan Schnell and Maria Vollmer.
Schnell, Stefan. 2015. Multi-CAST Vera'a. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/#veraa) (date accessed)
Multi-CAST has been designed to facilitate empirical research into the structure of spontaneous spoken language from a cross-linguistic perspective. The overriding questions are the following:
Our research agenda has been heavily inspired by work in the functionalist tradition, initiated by scholars such as Wallace Chafe, Talmy Givón, Barbara Fox, and others.
We have drawn on Multi-CAST data to follow up on many of the issues raised by the pioneers of usage-based grammar, for example the relationship between topicality and subjecthood, the notion of an ergative bias to discourse organization, the role of animacy in morphosyntax, and the mechanisms involved in the emergence of agreement morphology.
The small symbol inventory of the GRAID annotation scheme aims to capture cross-linguistically comparable categories, which, when combined with the morpheme-by-morpheme glosses and referent indexing with RefIND, allows for highly complex queries across corpora. See the Multi-CAST research context for illustrative examples.
One straightforward way of working with the Multi-CAST data is via the EAF files and the linguistic annotation software ELAN, which is freely available online. ELAN allows for conditional searches with regular expressions across sets of multiple EAF files. Please refer to the ELAN user guide and manual for instructions.
A more programmatic alternative is offered by the statistical computing language R and the custom-built multicastR package (Schiborr 2018), which offers a convenient way of accessing the annotation values and metadata directly in R. The multicastR package is freely available from the Comprehensive R Archive Network (CRAN). The source files for a manual installation can also be found here.
Collected below are publications and presentations that make use of data from Multi-CAST. If you have employed Multi-CAST in your research and would like to see your work included in this list, please contact Geoffrey Haig and/or Stefan Schnell.
Haig, Geoffrey & Adibifar, Shirin. To appear. Referential Null Subjects (RNS) in colloquial spoken Persian: Does speaker familiarity have an impact? In Korangy, Alireza & Mahmoodi-Bahktiari, Behrooz (eds.), Essays on the typology of Iranian languages. Berlin: Mouton de Gruyter.
Kimoto, Yukinori. 2018. Operationalizing Philippine-type syntax for the GRAID system: Clause structure, case marking, and verb class in Arta. Asian and African Languages and Linguistics 12. 17–35. (hdl.handle.net/10108/91147)
Kurabe, Keita. 2018. The GRAID-annotated Jinghpaw corpus: Annotations and initial findings. Asian and African Languages and Linguistics 12. 37–73. (hdl.handle.net/10108/91142)
Schnell, Stefan & Barth, Danielle. 2018. Discourse motivations for pronominal and zero objects across genres in Vera'a. Language Variation and Change 30(1). 51–81. (DOI: 10.1017/S0954394518000054)
Schnell, Stefan & Schiborr, Nils N. 2018. Corpus-based typological research in discourse and grammar: GRAID and Multi-CAST. Asian and African Languages and Linguistics 12. 1–16. (hdl.handle.net/10108/91145)
Shiohara, Asako. 2018. A progress report on the Sumbawa annotated spoken corpus: Tentative annotation notes. Asian and African Languages and Linguistics 12. 75–97. (hdl.handle.net/10108/91143)
Brickell, Timothy & Schnell, Stefan. 2017. Do grammatical relations reflect information status? Reassessing Preferred Argument Structure against discourse data from Tondano. Linguistic Typology 21(1). 177–209. (DOI: 10.1515/lingty-2017-0005)
Haig, Geoffrey & Schnell, Stefan. 2016. The discourse basis of ergativity revisited. Language 92(3). 591–618. (DOI: 10.1353/lan.2016.0049)
Haig, Geoffrey & Schnell, Stefan. 2016. The discourse basis of ergativity revisited: Online appendices. Language 92(3). 1–14. (DOI: 10.1353/lan.2016.0044)
(NEW!) Schiborr, Nils N. 2019. Quantitative models of referential choice: Lexical anaphora in English. Paper presented at the 8th Biennial International Conference on the Linguistics of Contemporary English (BICLCE 2019), Bamberg, Germany, 26–28 September 2019.
(NEW!) Schiborr, Nils N. 2019. Modelling referential choice in natural spoken discourse: Multi-CAST, GRAID, and RefIND. Paper presented at the Workshop Annotation of Non-standard Corpora (ANSC 2019), Bamberg, Germany, 16–18 September 2019.
Schiborr, Nils N. 2018. Data-driven models of referential choice: Antecedent distance and beyond. Paper presented at the Workshop Information Structure in Spoken Language Corpora 3: Discourse and Information Structure (ISSLaC3), Münster, Germany, 7–8 December 2018.
Schnell, Stefan & Schiborr, Nils N. & Haig, Geoffrey. 2018. Is intransitive subject the preferred role for introducing new referents? Evidence from corpus-based typology. Paper presented at the 51st Annual Meeting of the Societas Linguistica Europaea (SLE2018), Tallinn, Estonia, 29 August–1 September 2018.
Haig, Geoffrey & Schnell, Stefan & Schiborr, Nils N. 2017. The limits of accessibility: A corpus-based typological approach. Paper presented at the 12th Conference of the Association for Linguistic Typology (ALT2017), Canberra, Australia, 11–15 December 2017.
Haig, Geoffrey & Schiborr, Nils N. 2016. Multi-CAST (Multilingual Corpus of Annotated Spoken Texts): Ein Projekt zur Erstellung und Auswertung mehrsprachiger Korpora für die Sprachtypologie. Paper presented at the CLARIN Forum CA3, Hamburg, Germany, 7–8 June 2016.
The shared utility of Multi-CAST grows with increasing typological representativity of the language sample it contains. We therefore encourage scholars to contribute additional data sets to Multi-CAST, which can be incorporated into the collection as stand-alone resources, citable with their names as the authors and annotators.
If you wish to contribute data, here are some points to consider:
If you have a data set that complies with these conditions and you are interested in contributing it to Multi-CAST, please contact Geoffrey Haig and/or Stefan Schnell in order to coordinate the next steps.
In technical terms, this involves transferring your data into the EAF file format of the annotation software ELAN, for which purpose we will provide you with a Multi-CAST ELAN template, and annotating your texts with GRAID. The latter involves some quite tricky analytical decisions, and we strongly recommend that potential contributors liaise with us before undertaking this task. The actual labour input required will vary from language to language, but we will certainly assist you and be able to give you a realistic assessment of what may be necessary.
The Multi-CAST project is being coordinated by Geoffrey Haig, Stefan Schnell, Nils Schiborr, and Maria Vollmer, all at the Department of General Linguistics at the University of Bamberg.
In addition, the following researchers were involved in the collection, translation, and annotation of the various Multi-CAST corpora, or have contributed to the project in other ways:
Ariel, Mira. 1988. Referring and accessibility. Journal of Linguistics 24(1). 67–87.
Ariel, Mira. 1990. Accessing noun-phrase antecedents. London: Routledge.
Ariel, Mira. 2004. Accessibility marking: Discourse functions, discourse profiles, and processing cues. Discourse Processes 37(2). 91–116.
Bickel, Balthazar. 2003. Referential density in discourse and syntactic typology. Language 79(4). 708–736.
Brickell, Timothy. 2015. A grammar of Tondano. Ph.D. dissertation, La Trobe University, Melbourne, Australia.
Chafe, Wallace. 1980. The deployment of consciousness in the production of a narrative. In Chafe, Wallace (ed.), The Pear Stories: Cognitive, cultural, and linguistic aspects of narrative production, 9–50. Norwood, NJ: Ablex.
Du Bois, John. 1987. The discourse basis of ergativity. Language 63(4). 805–855.
Du Bois, John. 2003. Argument structure: Grammar in use. In Du Bois, John & Kumpf, Lorraine & Ashby, William J. (eds.), Preferred argument structure: Grammar as architecture for function, 11–60. Amsterdam: John Benjamins.
Du Bois, John. 2017. Ergativity in discourse and grammar. In Coon, Jessica & Massam, Diane & Travis, Lisa D. (eds.), The Oxford handbook of ergativity, 23–57. Oxford: Oxford University Press.
English Dialects Research Group. 2005. Freiburg English Dialect Corpus (FRED). (fred.ub.uni-freiburg.de/)
Forker, Diana. Under revision. A grammar of Sanzhi Dargwa. Berlin: Language Science Press.
Giangoullis, Konstantinos G. 2009. Kypriaka paradosiaka paramytha: Ek stomatos Elenis Mich, Satsia, Apo to Geri-Pyroi (1887–1982) [A traditional Cypriot storyteller: From the mouth of Elenis Mich, Satsia, from Geri-Pyroi (1887–1982)]. Leukosia: Theopress Publications.
Haig, Geoffrey. 2018. Northern Kurdish (Kurmanjî). In Haig, Geoffrey & Khan, Geoffrey (eds.), The languages and linguistics of Western Asia: An areal perspective, 106–158. Berlin: Mouton de Gruyter.
Haig, Geoffrey & Schnell, Stefan. 2014. Annotations using GRAID (Grammatical Relations and Animacy in Discourse): Introduction and guidelines for annotators. Version 7.0. (multicast.aspra.uni-bamberg.de/)
Haig, Geoffrey & Schnell, Stefan. 2015. Multi-CAST: Multilingual corpus of annotated spoken texts. (multicast.aspra.uni-bamberg.de/)
Hammarström, Harald & Forkel, Robert & Haspelmath, Martin. (eds.). 2019. Glottolog 4.0. Jena: Max Planck Institute for the Science of Human History. (glottolog.org)
Kimoto, Yukinori. 2017. A grammar of Arta: A Philippine Negrito language. Ph.D. dissertation, Kyoto University, Kyoto, Japan.
Meng, Chenxi. 2018. A grammar of Tulil. Ph.D. dissertation, La Trobe University, Melbourne, Australia.
Mosel, Ulrike. 2019. A multifunctional Teop-English dictionary. Dictionaria 4(1-6488). (dictionaria.clld.org/contributions/teop)
Mosel, Ulrike & Thiesen, Yvonne. 2007. The Teop sketch grammar. Unpublished manuscript, University of Kiel. (hdl.handle.net/1839/00-0000-0000-0008-24F6-3)
Noonan, Michael. 2003. A crosslinguistic investigation of referential density. Unpublished manuscript, University of Wisconsin-Milwaukee. (crossasia-repository.ub.uni-heidelberg.de/190/)
Riester, Arndt & Baumann, Stefan. 2017. The RefLex scheme — Annotation guidelines. SinSpeC: Working papers of the SFB 732 14. (DOI: 10.18419/opus-9011)
Schiborr, Nils N. 2018. multicastR: A companion to the Multi-CAST collection. R package version 1.3.0. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (cran.r-project.org/package=multicastR)
Schiborr, Nils N. & Schnell, Stefan & Thiele, Hanna. 2018. RefIND — Referent Indexing in Natural-language Discourse: Annotation guidelines. Version 1.1. (multicast.aspra.uni-bamberg.de/)
Schnell, Stefan. 2011. A grammar of Vera'a. Ph.D. dissertation, Kiel University, Germany.
Thieberger, Nick. 2004. Documentation in practice: Developing a linked media corpus of South Efate. In Austin, Peter (ed.), Language documentation and description, 169–178. London: Hans Rausing Endangered Languages Project, SOAS.
Thieberger, Nick. 2006. A grammar of South Efate: An Oceanic language of Vanuatu. Honolulu: University of Hawaii Press. (hdl.handle.net/11343/31242)
Wolff, John. Proto-Austronesian phonology. Ithaca, NY: Cornell Southeast Asia Program Publications.
The collection and annotation of the data in Multi-CAST have graciously received support from the following institutions and organizations:
The Department of General Linguistics at the University of Bamberg contributed departmental funding and research infrastructure to the Multi-CAST project, and the ARC Centre of Excellence for the Dynamics of Language provided additional support.
The following texts in the collection are made available in cooperation with these researchers and institutions:
The editors of and contributors to Multi-CAST would also like to thank our respective research communities for their support and stimulating criticism.
In the spirit of open science, the Multi-CAST collection, including the recordings, transcriptions, annotations, and all supplementary materials, are published under the Creative Commons Attribution 4.0 International licence (CC BY 4.0).
The CC-BY licence allows full access to Multi-CAST for any purpose related to research, art, journalism, or any other endeavour, under the condition that proper credit is given to the editors of the collection and its contributors. Doing so must also include a link to this website (multicast.aspra.uni-bamberg.de), and a brief note about the licensing terms.