Download PDF PDF Download WAV WAV Download MP3 MP3 Download EAF EAF Download XML XML Download TSV TSV Download ZIP/WAV ZIP/WAV Download ZIP/MP3 ZIP/MP3 Download ZIP/EAF ZIP/EAF

Multi-CAST, the Multilingual Corpus of Annotated Spoken Texts, is a collection of annotated texts from a typologically diverse section of languages.

GRAID
RefIND
ISNRef

Northern Kurdish [nkurd_muserz03_0065]
îcarro-k-îpaşawezîr-êxwedi-şîn-eçarşu-yêdibê
nowday-indf-oblpashavizier-ezafe.plreflind-send.prs-3sgmarket-oblØind.say.prs.3sg
##otherothernp.h:anp.h:prn_refl.h:possv:prednp:g##0.h:sv:pred
00360033003600340036
newbridgingbridging
‘One day the king sends his advisors to the market. (He) says, ...’

Annotations

Alongside standard spoken corpus annotations, the GRAID (Grammatical Relations and Animacy in Discourse, Haig & Schnell 2014) and RefIND (Referent Indexing in Natural Language Discourse, Schiborr et al. 2018) annotation schemes enable cross-linguistic research in the area of discourse and grammar. GRAID provides a uniform set of tags with a simple combinatory syntax, and RefIND allows individual discourse referents to be identified and tracked throughout a text.

The GRAID manual and RefIND guidelines provide extensive discussion of the analytical considerations involved in the annotation.

The corpora

height="606" width="875">

Versioning

The Multi-CAST collection continues to develop as new material is added and the annotations of older texts are revised. Successive releases of the corpus data are assigned version numbers composed of the year and month they were published.

The files listed below represent the latest version of Multi-CAST; a directory of older versions can be accessed via the links on the right. A list of changes introduced with each release can be found in the Multi-CAST collection overview.

  • The current version of Multi-CAST is
  • 2311
  • published in November 2023

Arta [arta]

Yukinori Kimoto

Arta (arta1239) is an endangered Austronesian language spoken by a group of hunter-gatherers living in Luzon, the Philippines. The number of fluent speakers is between nine and eleven, most of which are over the age of forty. Since all speakers have settled down in the communities of neighboring Negrito groups (Casiguran/Nagitupunan Agta people), the language is not in active use and no longer taught to children. All of the speakers are multilingual with Casiguran/Nagtipunan Agta and Ilokano.

The texts were collected by Yukinori Kimoto during fieldwork in the Quirino and Aurora provinces in Luzon between 2012 and 2018. See Kimoto (2017) for a description of the language.

Multi-CAST Arta Speakers of Arta in Luzon, the Philippines. Photo by Yukinori Kimoto.

Citation for this corpus

Kimoto, Yukinori. 2019. Multi-CAST Arta. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2311. Bamberg: University of Bamberg. (multicast.aspra.uni-bamberg.de/#arta) (date accessed) download citation

Corpus documentation

Corpus files

    • arta_alisiya
    • 15 MB
    • 3 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • arta_arsenyo
    • 77 MB
    • 7 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • arta_child
    • 156 MB
    • 13 MB
    • 0.9 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • arta_delia
    • 114 MB
    • 10 MB
    • 0.6 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • arta_disubu
    • 56 MB
    • 5 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • arta_hapon
    • 126 MB
    • 10 MB
    • 0.7 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • arta_husband
    • 46 MB
    • 4 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • arta_marry
    • 38 MB
    • 3 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • arta_swateng
    • 117 MB
    • 10 MB
    • 0.9 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • arta_typhoon
    • 65 MB
    • 5 MB
    • 0.4 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • arta_udulan
    • 68 MB
    • 6 MB
    • 0.4 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • full corpus
    • 716 MB
    • 74 MB
    • 0.3 MB
    • 0.3 MB
    • 1.9 MB

Bora [bora]

Frank Seifart, Tai Hong

Bora (bora1263) Bora is a Boran language spoken in various small communities in the Colombian and Peruvian Amazon region (e.g. 3.23°S 71.99°W, 1.75°S 72.50°W). The language has approximately 1 000 speakers, almost all of whom are bilingual in local Spanish. The number of children acquiring Bora is currently decreasing.

Bora has been extensively documented within a VolkswagenStiftung-funded DOBES documentation project (2005–2009). The Multi-CAST Bora corpus consists of two folkloristic narrative texts taken from the larger DOBES collection. They were recorded and annotated by Frank Seifart in collaboration with, especially, Clever Panduro (original transcription and translation) and Lena Sell (original morphological glossing). Annotations with GRAID and RefIND were added to the corpus in 2021–2022 by Tai Hong in collaboration with Frank Seifart.

The Multi-CAST Bora texts (version 2207) also constitute a part of the Bora data set in DoReCo, which has been time-aligned at the phone level.

Multi-CAST Bora Ritual exchange ceremony in the Bora community Colonia Ancón, Department of Loreto, Peru.
Photo by Frank Seifart, 2014.

Citation for this corpus

Seifart, Frank & Hong, Tai. 2022. Multi-CAST Bora. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2211. Bamberg: University of Bamberg. (multicast.aspra.uni-bamberg.de/#bora) (date accessed) download citation

Corpus documentation

Corpus files

    • bora_ajyuwa
    • 407 MB
    • 34 MB
    • 3.8 MB
    • 0.8 MB
    • 0.2 MB
    • preview
    • bora_meenujkatsi
    • 136 MB
    • 11 MB
    • 1.3 MB
    • 0.3 MB
    • 0.1 MB
    • preview
    • full corpus
    • 346 MB
    • 45 MB
    • 0.3 MB
    • 1.0 MB
    • 0.3 MB

Cypriot Greek [cypgreek]

Harris Hadjidas, Maria Vollmer

Cypriot Greek (cypr1249) is the variety of Greek spoken in Cyprus. The three texts in this corpus, all of which are traditional narratives, were originally recorded in the 1960s, and later compiled and published by Konstantinos Giangoullis as part of a book of traditional Cypriot tales (Giangoullis 2009). The author of the text collection, Konstantinos Giangoullis, has kindly given his permission for the three texts in this corpus to be made freely available as part of Multi-CAST.

While unfortunately no audio recordings are available for this corpus, the texts appear to have been only minimally edited and reflect reasonably faithfully the spoken language used in traditional narratives. The texts were initially transliterated into the Roman alphabet and translated into English by a native speaker, Harris Hadjidas, who also conducted the first round of syntactic annotation. A second round of annotation was completed by Maria Vollmer under the supervision of Geoffrey Haig.

Multi-CAST Cypriot Greek Aphrodite's Rock, Paphos, Cyprus. Photo by Anna Anichkova, 2013, CC-BY-SA 3.0.

Citation for this corpus

Hadjidas, Harris & Vollmer, Maria. 2015. Multi-CAST Cypriot Greek. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2311. Bamberg: University of Bamberg. (multicast.aspra.uni-bamberg.de/#cypgreek) (date accessed) download citation

Corpus documentation

Corpus files

    • cypgreek_jitros
    • 1.2 MB
    • 0.3 MB
    • 0.1 MB
    • preview
    • cypgreek_minaes
    • 1.7 MB
    • 0.3 MB
    • 0.1 MB
    • preview
    • cypgreek_psarin
    • 1.8 MB
    • 0.4 MB
    • 0.1 MB
    • preview
    • full corpus
    • 0.2 MB
    • 1.0 MB
    • 0.3 MB

English [english]

Nils Norman Schiborr

The Multi-CAST English (sout3282) corpus contains auto- biographical narratives taken from the Freiburg English Dialect Corpus (FRED, English Dialects Research Group 2005), which has been compiled under the supervision of Bernd Kortmann and Lieselotte Anderwald at the University of Freiburg from texts recorded during the 1970s and 80s as part of various oral history projects.

The texts annotated for Multi-CAST were recorded with older working-class speakers from southern and southeastern England. They depict everyday scenes and personal experiences from the speakers' lives: recurring topics include agriculture, animal husbandry, shipwrighting, work in the London docks, and the two World Wars.

A subset of the texts in this corpus (version 2207) have been time-aligned at the phone level for a data set in the DoReCo project.

The audio recordings (WAV, MP3) in this corpus are in the public domain.

Multi-CAST English St James's Park, London, United Kingdom. Photo by David Iliff, 2006, CC-BY-SA 3.0.

Citation for this corpus

Schiborr, Nils N. 2015. Multi-CAST English. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2311. Bamberg: University of Bamberg. (multicast.aspra.uni-bamberg.de/#english) (date accessed) download citation

Corpus documentation

Corpus files

    • english_devon01
    • 320 MB
    • 29 MB
    • 3.3 MB
    • 0.7 MB
    • 0.2 MB
    • preview
    • english_kent01
    • 139 MB
    • 25 MB
    • 3.7 MB
    • 0.8 MB
    • 0.2 MB
    • preview
    • english_kent02_a
    • 151 MB
    • 27 MB
    • 4.4 MB
    • 0.9 MB
    • 0.3 MB
    • preview
    • english_kent02_b
    • 165 MB
    • 30 MB
    • 5.1 MB
    • 1.0 MB
    • 0.3 MB
    • preview
    • english_kent03_a
    • 254 MB
    • 23 MB
    • 3.7 MB
    • 0.8 MB
    • 0.2 MB
    • preview
    • english_kent03_b
    • 303 MB
    • 27 MB
    • 4.2 MB
    • 0.6 MB
    • 0.3 MB
    • preview
    • english_london01_a
    • 296 MB
    • 27 MB
    • 3.8 MB
    • 0.8 MB
    • 0.3 MB
    • preview
    • english_london01_b
    • 299 MB
    • 27 MB
    • 3.9 MB
    • 0.8 MB
    • 0.3 MB
    • preview
    • full corpus
    • 1.8 GB
    • 214 MB
    • 1.5 MB
    • 6.4 MB
    • 2.1 MB

Jinghpaw [jinghpaw]

Keita Kurabe

Jinghpaw (kach1280), also known as Kachin, is a Tibeto-Burman language spoken in northern Myanmar and neighbouring areas in India and the PR of China. The variety represented in the corpus is spoken in and around Myitkyina, Kachin State, Myanmar. The Jinghpaw speakers, as is typical for highlanders in mainland Southeast Asia, live in a socioculturally dynamic and multilingual environment. Jinghpaw serves as a lingua franca among the Kachin people, who speak diverse mutually unintelligible Tibeto-Burman languages, but have a number of shared cultural traits.

The Multi-CAST Jinghpaw corpus consists of traditional narratives glossed and annotated with GRAID by Keita Kurabe with the help of Stefan Schnell; annotations with RefIND were added by Ivan Kapitonov. They constitute a subset of more than 2 700 traditional Kachin narratives and related stories collected by Keita Kurabe and members from the Kachin community through a community-based documentation project undertaken in northern Myanmar between 2009 and 2020. Audio recordings for 2 754 stories are archived in two PARADISEC collections (1, 2). See Kurabe (2016) for a comprehensive grammar of the language.

Multi-CAST Jinghpaw Traditionally woven fabrics from Kachin Province, Myanmar. Photo by Keita Kurabe, 2018.

Citation for this corpus

Kurabe, Keita. 2021. Multi-CAST Jinghpaw. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2311. Bamberg: University of Bamberg. (multicast.aspra.uni-bamberg.de/#jinghpaw) (date accessed) download citation

Corpus documentation

Corpus files

    • jinghpaw_chyeju
    • 38 MB
    • 3 MB
    • 0.4 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • jinghpaw_dwi
    • 95 MB
    • 9 MB
    • 2.3 MB
    • 0.5 MB
    • 0.1 MB
    • preview
    • jinghpaw_galang
    • 35 MB
    • 3 MB
    • 0.8 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • jinghpaw_ganu
    • 74 MB
    • 3 MB
    • 0.7 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • jinghpaw_hkaili
    • 39 MB
    • 4 MB
    • 0.9 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • jinghpaw_hpaji
    • 19 MB
    • 2 MB
    • 0.4 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • jinghpaw_manau
    • 25 MB
    • 2 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • jinghpaw_natga
    • 38 MB
    • 3 MB
    • 0.5 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • jinghpaw_nchyang
    • 24 MB
    • 2 MB
    • 0.5 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • jinghpaw_nga
    • 30 MB
    • 3 MB
    • 0.7 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • jinghpaw_shanngayi
    • 26 MB
    • 2 MB
    • 0.5 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • full corpus
    • 335 MB
    • 35 MB
    • 0.4 MB
    • 1.7 MB
    • 0.5 MB

Kalamang [kalamang]

Eline Visser

Kalamang (kara1499) is a Papuan language spoken on the Karas Islands in West Papua, Indonesia. It is spoken by some 130 people in two villages on the biggest of the Karas Islands: Maas and Antalisa. Kalamang is under pressure from the local lingua franca, a variant of Papuan Malay, and is not currently spoken by people born after 1990. The texts in this corpus are all traditional narratives and were recorded in 2018 and 2019 as part of Eline Visser's PhD project at Lund University in Sweden, which resulted in a comprehensive grammar of Kalamang (Visser 2020). All Kalamang linguistic and cultural data have been deposited on the Humanities Lab corpus server at Lund University.

Multi-CAST Kalamang Cassowary Island, off the coast of the Karas Islands, Indonesia. Photo Eline Visser, 2019.

Citation for this corpus

Visser, Eline. 2021. Multi-CAST Kalamang. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2311. Bamberg: University of Bamberg. (multicast.aspra.uni-bamberg.de/#kalamang) (date accessed) download citation

Corpus documentation

Corpus files

    • kalamang_kasuari
    • 45 MB
    • 4 MB
    • 0.5 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • kalamang_keluer
    • 56 MB
    • 5 MB
    • 0.7 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • kalamang_kuawi
    • 90 MB
    • 8 MB
    • 1.1 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • kalamang_monyet
    • 171 MB
    • 16 MB
    • 1.9 MB
    • 0.4 MB
    • 0.1 MB
    • preview
    • kalamang_pitiskiet
    • 105 MB
    • 10 MB
    • 1.4 MB
    • 0.3 MB
    • 0.1 MB
    • preview
    • kalamang_yardakdak
    • 22 MB
    • 2 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • full corpus
    • 423 MB
    • 44 MB
    • 0.3 MB
    • 1.2 MB
    • 0.3 MB

Mandarin [mandarin]

Maria Vollmer

The Multi-CAST Mandarin (Modern Standard Mandarin; mand1415) corpus consists of traditional narratives from three native speakers of Mandarin. They were recorded in Xī'ān, PRC, by Maria Vollmer during an exchange semester in 2015 and 2016. Two of the speakers are originally from Northeast China (Dōngběi), the third hails from Xī'ān.

The stories were transcribed by Liu Ruoyu in 2016 and 2017 under the supervision of Maria Vollmer, and subsequently translated, glossed, and annotated with GRAID between 2016 and 2019 by Maria Vollmer. Annotations with RefIND and ISNRef were added by Maria Vollmer and Adrian Kuqi in 2019. Further stories have been recorded and transcribed and will be added to the corpus in the future.

Multi-CAST Mandarin A speaker of Mandarin listening to his recorded voice, Dōngběi, PRC. Photo by Maria Vollmer, 2016.

Citation for this corpus

Vollmer, Maria. 2020. Multi-CAST Mandarin. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2311. Bamberg: University of Bamberg. (multicast.aspra.uni-bamberg.de/#mandarin) (date accessed) download citation

Corpus documentation

Corpus files

    • mandarin_hml
    • 105 MB
    • 10 MB
    • 1.8 MB
    • 0.4 MB
    • 0.1 MB
    • preview
    • mandarin_jgz
    • 214 MB
    • 19 MB
    • 3.8 MB
    • 0.8 MB
    • 0.2 MB
    • preview
    • mandarin_lzh
    • 83 MB
    • 8 MB
    • 1.3 MB
    • 0.3 MB
    • 0.1 MB
    • preview
    • full corpus
    • 367 MB
    • 36 MB
    • 0.3 MB
    • 1.4 MB
    • 0.4 MB

Matukar Panau [matukar]

Danielle Barth, Kira Davey, Maria Matheas

Matukar Panau is a highly endangered Oceanic language spoken around 45 km north of Madang, Papua New Guinea. Although most children are no longer learning Matukar Panau, current speakers (approximately 300) form a vibrant community of multilinguals in dense social networks. As an Oceanic language on the Papua New Guinea coast, Matukar Panau has many interesting Papuan features.

The Multi-CAST Matukar Panau corpus constitutes a small subset of recordings made by Danielle Barth during her fieldwork between 2010–2020 (Australian National University Asia-Pacific Innovation Program Grant, Resolving Ambiguity: What face-to-face communication can contribute, PI: Danielle Barth); language documentation is ongoing. Data has been transcribed and translated with help from local community members, especially Kadagoi Rawad Forepiso and Rudolf Raward. Recordings can be found in the ELAR and PARADISEC (1, 2) archives. More information and resources on the language can be found on the project website.

The texts in Multi-CAST were glossed with GRAID and RefIND by Danielle Barth, Kira Davey, and Maria Matheas. In addition to monologue narratives, some stimulus-based conversational descriptions have also been annotated with these schemata to enable research about referent expression when describing familiar and unfamiliar objects, places, and people. Recordings of these events are archived in ELAR and PARADISEC and those archives will eventually also provide open access to ELAN files with the annotations.

Multi-CAST Matukar Panau A tree in Matukar village, Madau Province, Papua New Guinea. Photo by Danielle Barth, 2013.

Citation for this corpus

Barth, Danielle & Davey, Kira & Matheas, Maria. 2023. Multi-CAST Matukar Panau. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2311. Bamberg: University of Bamberg. (multicast.aspra.uni-bamberg.de/#matukar) (date accessed) download citation

Corpus documentation

Corpus files

    • matukar_bklife
    • 81 MB
    • 7 MB
    • 1.0 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • matukar_fishing
    • 57 MB
    • 3 MB
    • 0.6 MB
    • 0.1 MB
    • 0.2 MB
    • preview
    • matukar_kadagoi
    • 23 MB
    • 2 MB
    • 0.4 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • matukar_manub
    • 97 MB
    • 5 MB
    • 0.7 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • matukar_mariu
    • 103 MB
    • 9 MB
    • 1.3 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • matukar_niu
    • 114 MB
    • 6 MB
    • 0.8 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • matukar_ww2
    • 110 MB
    • 9 MB
    • 1.0 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • matukar_yali
    • 120 MB
    • 11 MB
    • 1.7 MB
    • 0.3 MB
    • 0.1 MB
    • preview
    • full corpus
    • 457 MB
    • 53 MB
    • 0.3 MB
    • 1.5 MB
    • 0.4 MB

Nafsan [nafsan]

Nick Thieberger, Timothy Brickell

The Nafsan (sout2856) language, also known as South Efate, is a Southern Oceanic language spoken on the island of Efate in central Vanuatu. As of 2005, there are approximately 6 000 speakers of Nafsan living in coastal villages from Pango to Eton. A description of the language can be found in Thieberger (2006).

The Multi-CAST Nafsan corpus constitutes a subset of the material collected by Nick Thieberger for his PhD research over three periods of fieldwork in the villages of Eratap and Erakor in South Efate between 1995 and 2000, and during subsequent trips. The entirety of the data has been archived in PARADISEC, and can also be accessed via ANNIS. See further Thieberger (2004).

The texts were glossed with GRAID by Nick Thieberger and Timothy Brickell, and subsequently annotated with RefIND by Adrian Kuqi under supervision of Stefan Schnell.

The Multi-CAST Nafsan texts (version 2207) also constitute a part of the Nafsan data set in DoReCo, which has been time-aligned at the phone level.

Multi-CAST Nafsan A view of the coast of Efate, Vanuatu. Photo by Nick Thieberger, 2006.

Citation for this corpus

Thieberger, Nick & Brickell, Timothy. 2019. Multi-CAST Nafsan. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2311. Bamberg: University of Bamberg. (multicast.aspra.uni-bamberg.de/#nafsan) (date accessed) download citation

Corpus documentation

Corpus files

    • nafsan_kori
    • 69 MB
    • 6 MB
    • 1.3 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • nafsan_lelep
    • 35 MB
    • 3 MB
    • 0.6 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • nafsan_lisau
    • 28 MB
    • 3 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • nafsan_litog
    • 36 MB
    • 3 MB
    • 0.4 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • nafsan_maal
    • 32 MB
    • 3 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • nafsan_nmatu
    • 43 MB
    • 4 MB
    • 0.5 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • nafsan_ntwam
    • 78 MB
    • 7 MB
    • 0.9 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • nafsan_taapes
    • 22 MB
    • 2 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • nafsan_tafra
    • 44 MB
    • 4 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • full corpus
    • 304 MB
    • 34 MB
    • 0.2 MB
    • 1.0 MB
    • 0.3 MB

Northern Kurdish [nkurd]

Geoffrey Haig, Maria Vollmer, Hanna Thiele

Northern Kurdish (nort2641), also known as Kurmanjî, is a Northwest Iranian language spoken in eastern Turkey, Iraq, Syria, and parts of western Iran. The three texts recorded here are traditional narratives, from a female and a male speaker who grew up near the townships of Erzurum and Muš, respectively.

The texts were recorded in Germany in the late 1990s and early 2000s, and subsequently transcribed, translated, and annotated for Multi-CAST by Geoffrey Haig, Abdullah Incekan, Hanna Thiele, and Maria Vollmer. A description of the language can be found in Haig (2018).

The texts in this corpus (version 2207) also make up a part of the Northern Kurdish data set in DoReCo, which has been time-aligned at the phone level.

Multi-CAST Northern Kurdish A speaker of Kurmanjî. Photo by Geoffrey Haig.

Citation for this corpus

Haig, Geoffrey & Vollmer, Maria & Thiele, Hanna. 2019. Multi-CAST Northern Kurdish. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2311. Bamberg: University of Bamberg. (multicast.aspra.uni-bamberg.de/#nkurd) (date accessed) download citation

Corpus documentation

Corpus files

    • nkurd_muserz01
    • 50 MB
    • 18 MB
    • 2.9 MB
    • 0.6 MB
    • 0.2 MB
    • preview
    • nkurd_muserz02
    • 123 MB
    • 11 MB
    • 1.8 MB
    • 0.4 MB
    • 0.1 MB
    • preview
    • nkurd_muserz03
    • 50 MB
    • 18 MB
    • 3.0 MB
    • 0.6 MB
    • 0.2 MB
    • preview
    • full corpus
    • 182 MB
    • 47 MB
    • 0.4 MB
    • 1.6 MB
    • 0.5 MB

Persian [persian]

Shirin Adibifar

Persian (tehr1242) is an Iranian language with official variants spoken in Iran, Afghanistan, and parts of Tajikistan; the variety spoken in Iran is also referred to as Farsi.

The texts in this corpus are narrative retellings of the Pear film (Chafe 1980), a roughly five minute-long short film about a boy stealing the fruit a man had been picking. The recordings were made by Shirin Adibifar in Tehran and locations in the province of Mazandaran in 2015. Of the 29 speakers in this corpus, 17 are female and 12 male. The median age is 25, with a range of 20 to 39. All speakers have received at least some measure of university-level education.

Multi-CAST Persian Badab-e Surt, Mazandaran, Iran. Photo by M. Samaee, 2010, CC-BY 3.0.

Citation for this corpus

Adibifar, Shirin. 2016. Multi-CAST Persian. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2311. Bamberg: University of Bamberg. (multicast.aspra.uni-bamberg.de/#persian) (date accessed) download citation

Corpus documentation

Corpus files

    • persian_g1-f-01
    • 16 MB
    • 1 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g1-f-02
    • 22 MB
    • 2 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g1-f-05
    • 23 MB
    • 2 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g1-f-07
    • 11 MB
    • 1 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g1-f-08
    • 17 MB
    • 2 MB
    • 0.1 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g1-f-09
    • 45 MB
    • 4 MB
    • 0.5 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g1-f-10
    • 34 MB
    • 3 MB
    • 0.4 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g1-f-11
    • 17 MB
    • 2 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g1-f-12
    • 18 MB
    • 2 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g1-f-14
    • 31 MB
    • 3 MB
    • 0.4 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g1-m-03
    • 8 MB
    • 1 MB
    • 0.1 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g1-m-04
    • 21 MB
    • 2 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g1-m-06
    • 9 MB
    • 1 MB
    • 0.1 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g1-m-13
    • 29 MB
    • 3 MB
    • 0.4 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g2-f-01
    • 24 MB
    • 2 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g2-f-02
    • 15 MB
    • 1 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g2-f-03
    • 16 MB
    • 2 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g2-f-04
    • 11 MB
    • 1 MB
    • 0.1 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g2-f-05
    • 19 MB
    • 2 MB
    • 0.1 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g2-f-06
    • 15 MB
    • 1 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g2-f-07
    • 17 MB
    • 2 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g2-m-08
    • 18 MB
    • 2 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g2-m-09
    • 14 MB
    • 1 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g2-m-10
    • 13 MB
    • 1 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g2-m-11
    • 10 MB
    • 1 MB
    • 0.1 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g2-m-12
    • 12 MB
    • 1 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g2-m-13
    • 14 MB
    • 1 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g2-m-14
    • 11 MB
    • 1 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • persian_g2-m-15
    • 26 MB
    • 2 MB
    • 0.2 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • full corpus
    • 421 MB
    • 48 MB
    • 0.3 MB
    • 1.3 MB
    • 0.5 MB

Sanzhi Dargwa [sanzhi]

Diana Forker, Nils Norman Schiborr

Sanzhi Dargwa (sanz1248) is a Nakh-Daghestanian (Caucasian) language from the Dargwa subbranch. From 1968 onwards, over a relatively short time span, all Sanzhi speakers left their village of Sanzhi in the mountains of central Daghestan, Russia, to move to linguistically and ethnically heterogeneous settlements in the lowlands. Today Sanzhi is spoken by approximately 250 speakers and heavily endangered.

The eight texts in this corpus represent a small subset of the material that was recorded, transcribed, translated, and glossed by Diana Forker with the assistance of Gadzhimurad Gadzhimuradov, a native speaker, as part of a DOBES language documentation project (2012–2019), which has culminated in a grammar of Sanzhi Dargwa (Forker 2020).

The texts presented here are a mixture of autobiographical and traditional narratives. They were annotated for Multi-CAST by Nils Schiborr.

The texts in this corpus (version 2207) are part of the Sanzhi Dargwa data set in DoReCo, which has been time-aligned at the phone level.

Multi-CAST Sanzhi Dargwa The ruins of Sanzhi village, Daghestan, Russia. Photo by Gadzhimurad Gadzhimuradov.

Citation for this corpus

Forker, Diana & Schiborr, Nils N. 2019. Multi-CAST Sanzhi Dargwa. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2311. Bamberg: University of Bamberg. (multicast.aspra.uni-bamberg.de/#sanzhi) (date accessed) download citation

Corpus documentation

Corpus files

    • sanzhi_asabali
    • 70 MB
    • 6 MB
    • 0.6 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • sanzhi_bazhuk
    • 47 MB
    • 4 MB
    • 0.4 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • sanzhi_dragon
    • 61 MB
    • 5 MB
    • 0.5 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • sanzhi_kurban
    • 49 MB
    • 4 MB
    • 0.7 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • sanzhi_mill
    • 57 MB
    • 5 MB
    • 0.5 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • sanzhi_patima
    • 57 MB
    • 5 MB
    • 0.5 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • sanzhi_ramazan
    • 80 MB
    • 7 MB
    • 1.0 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • sanzhi_tape
    • 20 MB
    • 2 MB
    • 0.3 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • full corpus
    • 381 MB
    • 37 MB
    • 0.2 MB
    • 0.9 MB
    • 0.3 MB

Sumbawa [sumbawa]

Asako Shiohara

Sumbawa (sumb1241, indigenous designation: Samawa) is a Western Austronesian language spoken in the western part of Sumbawa Island, Indonesia. Administratively, the area belongs to two districts, namely Sumbawa district (Kabupaten Sumbawa) and West Sumbawa district (Kabupaten Sumbawa Barat), in the province of West Nusa Tenggara (Nusa Tenggara Barat). Sumbawa belongs to the Bali-Sasak-Sumbawa subgroup of the Malayo-Polynesian branch of the Austronesian language family (Adelaar 2005; Mbete 1990).

The texts in this corpus were collected by Asako Shiohara in 1996 and 1997. They were recorded in the small town of Empang and in Desa Bantu, a village close to Empang. Among the several dialects of the Sumbawa language, the dialect spoken in these two locations is classified as the Sumbawa Besar dialect, which is distributed across a large part of the western Sumbawa-speaking area.

The texts were annotated for Multi-CAST by Shiohara between 2018 and 2022, with RefIND annotations added in 2022 by Tai Hong.

Multi-CAST Sumbawa Bala Loka, or Sultan's Palace, in Sumbawa Besar, Indonesia. Photo by Asako Shiohara.

Citation for this corpus

Shiohara, Asako. 2022. Multi-CAST Sumbawa. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2311. Bamberg: University of Bamberg. (multicast.aspra.uni-bamberg.de/#sumbawa) (date accessed) download citation

Corpus documentation

Corpus files

    • sumbawa_flood
    • 33 MB
    • 6 MB
    • 0.8 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • sumbawa_kerekkure
    • 213 MB
    • 19 MB
    • 2.9 MB
    • 0.6 MB
    • 0.2 MB
    • preview
    • sumbawa_langlelo
    • 86 MB
    • 8 MB
    • 0.9 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • sumbawa_menangis
    • 66 MB
    • 6 MB
    • 0.7 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • sumbawa_nuntut
    • 56 MB
    • 5 MB
    • 0.6 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • full corpus
    • 377 MB
    • 44 MB
    • 0.3 MB
    • 1.2 MB
    • 0.4 MB

Tabasaran [tabasaran]

Natalia Bogomolova, Dmitry Ganenkov, Nils Norman Schiborr

Tabasaran (taba1259) is a Nakh-Daghestanian (Caucasian) language from the Lezgic subbranch. Recent census data puts the number of speakers at about 120 000; Campbell et al. (2017) classify the language as vulnerable.

The texts in the Multi-CAST Tabasaran corpus were recorded by Natalia Bogomolova with the assistance of Dmitry Ganenkov in 2010, and subsequently transcribed, glossed, and translated by the former. The annotations with GRAID and RefIND were added by Nils Schiborr between 2019 and 2020. The five texts in this corpus are a mixture of traditional and biographical narratives.

The texts in this corpus (version 2207) have been time-aligned at the phone level for a data set in the DoReCo project.

Multi-CAST Tabasaran A view of Mount Shalbuzdag in Daghestan, Russia. Photo by Аль-Гимравий, 2016, CC-BY-SA 4.0.

Citation for this corpus

Bogomolova, Natalia & Ganenkov, Dmitry & Schiborr, Nils N. 2021. Multi-CAST Tabasaran. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2311. Bamberg: University of Bamberg. (multicast.aspra.uni-bamberg.de/#tabasaran) (date accessed) download citation

Corpus documentation

Corpus files

    • tabasaran_belt
    • 29 MB
    • 5 MB
    • 0.8 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • tabasaran_horse
    • 84 MB
    • 15 MB
    • 2.0 MB
    • 0.4 MB
    • 0.1 MB
    • preview
    • tabasaran_naz
    • 84 MB
    • 15 MB
    • 0.5 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • tabasaran_nuradin
    • 84 MB
    • 15 MB
    • 0.7 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • tabasaran_work
    • 75 MB
    • 14 MB
    • 2.3 MB
    • 0.5 MB
    • 0.2 MB
    • preview
    • full corpus
    • 202 MB
    • 43 MB
    • 0.3 MB
    • 1.3 MB
    • 0.4 MB

Teop [teop]

Ulrike Mosel, Stefan Schnell

Teop (teop1238) is a Western Oceanic language spoken on Bougainville Island, Papua New Guinea. The texts, all traditional narratives, were recorded by Ulrike Mosel and Enoch Horai Magum over the course of a language documentation project (principal investigator: Ulrike Mosel) funded by the Volkswagen Foundation (grant no. II 77 973). Details on the project can be found online at the DOBES webpage.

A sketch grammar of Teop (Mosel & Thiesen 2007) and additional materials are also available there, and an online dictionary (A multifunctional Teop-English dictionary, Mosel 2019) can be found here. The texts were annotated for Multi-CAST by Ulrike Mosel and Stefan Schnell; referent indexing with RefIND was added in 2019 by Ulrike Mosel, Stefan Schnell, and Maria Vollmer.

The Multi-CAST Teop texts (version 2207) also constitute a part of the Teop data set in DoReCo, which has been time-aligned at the phone level.

Multi-CAST Teop Teop Island, Bougainville, Papua New Guinea. Photo by Ulrike Mosel.

Citation for this corpus

Mosel, Ulrike & Schnell, Stefan. 2015. Multi-CAST Teop. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2311. Bamberg: University of Bamberg. (multicast.aspra.uni-bamberg.de/#teop) (date accessed) download citation

Corpus documentation

Corpus files

    • teop_iar
    • 148 MB
    • 13 MB
    • 2.0 MB
    • 0.4 MB
    • 0.1 MB
    • preview
    • teop_mat
    • 70 MB
    • 6 MB
    • 1.0 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • teop_sii
    • 196 MB
    • 18 MB
    • 3.1 MB
    • 0.6 MB
    • 0.2 MB
    • preview
    • teop_viv
    • 58 MB
    • 5 MB
    • 1.0 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • full corpus
    • 352 MB
    • 43 MB
    • 0.3 MB
    • 1.4 MB
    • 0.4 MB

Tondano [tondano]

Timothy Brickell

The Toulour dialect of Tondano (tond1251) is an Austronesian (Malayo-Polynesian, Philippine, Minahasa, North, Northeast) language spoken in and to the east of the town of Tondano, which is located in the Minahasa regency of North Sulawesi, Indonesia. All Minahasan languages are endangered and have been shifting to the most commonly used language of wider communication, Manado Malay (mala1481), since the early 20th century (Wolff 2010: 299). Personal experience of the researcher estimates the number of fluent speakers of Tondano at around 30 000.

This corpus is the result of fieldwork undertaken by Timothy Brickell as part of PhD candidature at La Trobe University, Melbourne, Australia between 2011 and 2015 (see Brickell 2015). The speakers recorded were of both genders, of various ages, and from a number of professions, with many older speakers already retired. The texts in Multi-CAST constitute a subset of the 20 recordings made by Brickell. In some instances speakers discuss a topic chosen just prior to recording, in others they talk while engaging in traditional activities, while in some they narrate an elicitation video which depicts other community members carrying out traditional cultural activities.

Multi-CAST Tondano Pemandangan, Minahasa Regency, Indonesia. Photo by Timothy Brickell, 2013.

Citation for this corpus

Brickell, Timothy. 2016. Multi-CAST Tondano. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2311. Bamberg: University of Bamberg. (multicast.aspra.uni-bamberg.de/#tondano) (date accessed) download citation

Corpus documentation

Corpus files

    • tondano_gulamera
    • 104 MB
    • 9.4 MB
    • 0.7 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • tondano_holiday
    • 53 MB
    • 5 MB
    • 0.5 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • tondano_kiniar01
    • 88 MB
    • 8 MB
    • 0.7 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • tondano_kiniar02
    • 127 MB
    • 12 MB
    • 1.0 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • tondano_kiniar03
    • 89 MB
    • 8 MB
    • 0.6 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • tondano_mapalus
    • 69 MB
    • 6 MB
    • 0.7 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • tondano_water
    • 51 MB
    • 5 MB
    • 0.5 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • tondano_watulaney
    • 185 MB
    • 17 MB
    • 1.2 MB
    • 0.3 MB
    • 0.1 MB
    • preview
    • full corpus
    • 558 MB
    • 69 MB
    • 0.3 MB
    • 1.3 MB
    • 0.4 MB

Tulil [tulil]

Chenxi Meng

Tulil (taul1251), also known as Taulil, is a Papuan language spoken in the East New Britain Province of Papua New Guinea. As of 2000, Tulil is spoken by approximately 2 000 people spread out over four villages (Tulil 1, Tulil 2, Kadaulung, and Toma).

The six texts in this corpus comprise a subset of a larger collection of material that was recorded and transcribed during two field trips undertaken by Chenxi Meng in 2012 and 2015 for her PhD project, which has resulted in a comprehensive grammar of Tulil (Meng 2018). The entirety of the data has been deposited in PARADISEC.

The texts selected for Multi-CAST include both traditional and personal narratives. Annotations with RefIND were added by Maria Vollmer.

Multi-CAST Tulil A plume of volcanic ash over New Britain, Papua New Guinea. Photo by Chenxi Meng, 2014.

Citation for this corpus

Meng, Chenxi. 2019. Multi-CAST Tulil. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2311. Bamberg: University of Bamberg. (multicast.aspra.uni-bamberg.de/#tulil) (date accessed) download citation

Corpus documentation

Corpus files

    • tulil_all1
    • 54 MB
    • 5 MB
    • 0.6 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • tulil_alrm
    • 233 MB
    • 21 MB
    • 2.7 MB
    • 0.6 MB
    • 0.2 MB
    • preview
    • tulil_jkpp
    • 257 MB
    • 23 MB
    • 2.1 MB
    • 0.4 MB
    • 0.1 MB
    • preview
    • tulil_lnsl
    • 65 MB
    • 6 MB
    • 0.6 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • tulil_lrdw
    • 85 MB
    • 8 MB
    • 1.0 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • tulil_sves
    • 52 MB
    • 5 MB
    • 0.7 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • full corpus
    • 571 MB
    • 67 MB
    • 0.3 MB
    • 1.5 MB
    • 0.5 MB

Vera'a [veraa]

Stefan Schnell

Vera'a (vera1241) is an Oceanic (Austronesian) language from the village of the same name on Vanua Lava (13.80°S 167.47°E), one of the Banks Islands in North Vanuatu. The language has approximately 450 speakers and is the first language of most inhabitants of Vera'a and the coastline to the north of it. Vera'a is closely related to the neighbouring language Vurës, and speakers of Vera'a also speak Vurës.

Both languages have been extensively documented within a VolkswagenStiftung-funded DOBES documentation project (2006–2012; PI: Dr Catriona Hyslop-Malau). Vera'a has been the focus of Stefan Schnell's PhD project at Kiel University (2007–2010, see Schnell 2011), and Stefan has subsequently been undertaking additional documentary work on Vera'a as part of his ARC-funded DECRA project Typology of Language Use (ARC grant no. DE120102017) in 2012–2015, hosted by La Trobe University (Melbourne, Australia).

The Multi-CAST Vera'a corpus consists of 10 folkloristic narrative texts collected and annotated by Stefan Schnell. They constitute a subcorpus of a larger corpus of Vera'a compiled and curated by Stefan Schnell in close collaboration with speakers of the language and researchers of other disciplines from outside the community. Annotations with RefIND were added to the corpus in 2019 by Stefan Schnell and Maria Vollmer.

The texts in this corpus (version 2207) have been time-aligned at the phone level for a data set in the DoReCo project.

Multi-CAST Vera'a At work in Vera'a village, Vanua Lava, Vanuatu. Photo by Stefan Schnell.

Citation for this corpus

Schnell, Stefan. 2015. Multi-CAST Vera'a. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. Version 2311. Bamberg: University of Bamberg. (multicast.aspra.uni-bamberg.de/#veraa) (date accessed) download citation

Corpus documentation

Corpus files

    • veraa_anv
    • 67 MB
    • 6 MB
    • 1.0 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • veraa_as1
    • 58 MB
    • 5 MB
    • 1.1 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • veraa_gabg
    • 96 MB
    • 8 MB
    • 1.0 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • veraa_gaqg
    • 98 MB
    • 8 MB
    • 1.2 MB
    • 0.2 MB
    • 0.1 MB
    • preview
    • veraa_hhak
    • 139 MB
    • 12 MB
    • 2.1 MB
    • 0.4 MB
    • 0.1 MB
    • preview
    • veraa_isam
    • 81 MB
    • 7 MB
    • 1.3 MB
    • 0.3 MB
    • 0.1 MB
    • preview
    • veraa_iswm
    • 239 MB
    • 20 MB
    • 3.5 MB
    • 0.7 MB
    • 0.2 MB
    • preview
    • veraa_jjq
    • 333 MB
    • 28 MB
    • 4.6 MB
    • 0.9 MB
    • 0.3 MB
    • preview
    • veraa_mvbw
    • 111 MB
    • 9 MB
    • 1.6 MB
    • 0.3 MB
    • 0.1 MB
    • preview
    • veraa_pala_a
    • 44 MB
    • 4 MB
    • 0.7 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • veraa_pala_b
    • 73 MB
    • 6 MB
    • 1.3 MB
    • 0.3 MB
    • 0.1 MB
    • preview
    • full corpus
    • 1.0 GB
    • 111 MB
    • 0.8 MB
    • 3.8 MB
    • 1.1 MB

Research

Background

Multi-CAST has been designed to facilitate empirical research into the structure of spontaneous spoken language from a cross-linguistic perspective. The overriding questions are the following:

  • Are there cross-linguistically recurrent patterns in the way discourse is organized (i.e. text-based, as opposed to grammar-based, typology)?
  • How do these statistical patterns in usage relate to the architecture of grammars?
  • How do they relate to change in grammars over time?

Our research agenda has been heavily inspired by work in the functionalist tradition, initiated by scholars such as Wallace Chafe, Talmy Givón, Barbara Fox, and others.

We have drawn on Multi-CAST data to follow up on many of the issues raised by the pioneers of usage-based grammar, for example the relationship between topicality and subjecthood, the notion of an ergative bias to discourse organization, the role of animacy in morphosyntax, and the mechanisms involved in the emergence of agreement morphology.

Quantitative Analysis

The small symbol inventory of the GRAID annotation scheme aims to capture cross-linguistically comparable categories, which, when combined with the morpheme-by-morpheme glosses and referent indexing with RefIND, allows for highly complex queries across corpora. See the Multi-CAST research context for illustrative examples.

One straightforward way of working with the Multi-CAST data is via the EAF files and the linguistic annotation software ELAN, which is freely available online. ELAN allows for conditional searches with regular expressions across sets of multiple EAF files. Please refer to the ELAN user guide and manual for instructions.

A more programmatic alternative is offered by the statistical computing language R and the custom-built multicastR package (Schiborr 2018), which offers a convenient way of accessing the annotation values and metadata directly in R. The multicastR package is freely available from the Comprehensive R Archive Network (CRAN). The source files for a manual installation can also be found here.

Publications

Collected below are publications and presentations that make use of data from Multi-CAST. If you have employed Multi-CAST in your research and would like to see your work included in this list, please contact Geoffrey Haig and/or Stefan Schnell.

Published papers

Schnell, Stefan & Haig, Geoffrey & Schiborr, Nils N. & Vollmer, Maria. Forthcoming. Are referent introductions sensitive to forward planning in discourse? Evidence from Multi-CAST. To appear in Discourse phenomena in typological perspective, edited by Mattiola, Simone & Barotto, Alessandra. Amsterdam: John Benjamins.

(NEW!) Schnell, Stefan & Schiborr, Nils N. 2022. Cross-linguistic corpus studies in linguistic typology. Annual Review of Linguistics 8: 171–191. (DOI: 10.1146/annurev-linguistics-031120-104629).

(NEW!) Haig, Geoffrey & Schnell, Stefan & Schiborr, Nils N. 2021. Universals of reference in discourse and grammar: Evidence from the Multi-CAST collection of spoken corpora. In Haig, Geoffrey & Schnell, Stefan & Seifart, Frank (eds.), Doing corpus-based typology with spoken language corpora, 141–177. Language Documentation & Conservation special publication 25. Honolulu: University of Hawai’i Press. (hdl.handle.net/10125/74660)

Schiborr, Nils N. 2021. Lexical anaphora: A corpus-based typological study of referential choice. Unpublished Ph.D. dissertation, University of Bamberg.

Schnell, Stefan & Schiborr, Nils N. & Haig, Geoffrey. 2021. Efficiency in discourse processing: Does morphosyntax adapt to accommodate new referents? In Levshina, Natalia & Moran, Steven (eds.), Efficiency in human languages: Corpus evidence for universal principles. Linguistics Vanguard special issue 7(s3). (DOI: 10.1515/lingvan-2019-0064)

Haig, Geoffrey & Adibifar, Shirin. 2019. Referential Null Subjects (RNS) in colloquial spoken Persian: Does speaker familiarity have an impact? In Korangy, Alireza & Mahmoodi-Bakhtiari, Behrooz (eds.), Essays on the typology of Iranian languages, 102–121. Berlin: Mouton de Gruyter.

Vollmer, Maria. 2019. How radical is pro-drop in Mandarin? A quantitative corpus study on referential choice in Mandarin Chinese. Unpublished MA thesis, University of Bamberg.

Kimoto, Yukinori. 2018. Operationalizing Philippine-type syntax for the GRAID system: Clause structure, case marking, and verb class in Arta. Asian and African Languages and Linguistics 12. 17–35. (hdl.handle.net/10108/91147)

Kurabe, Keita. 2018. The GRAID-annotated Jinghpaw corpus: Annotations and initial findings. Asian and African Languages and Linguistics 12. 37–73. (hdl.handle.net/10108/91142)

Schnell, Stefan & Barth, Danielle. 2018. Discourse motivations for pronominal and zero objects across genres in Vera'a. Language Variation and Change 30(1). 51–81. (DOI: 10.1017/S0954394518000054)

Schnell, Stefan & Schiborr, Nils N. 2018. Corpus-based typological research in discourse and grammar: GRAID and Multi-CAST. Asian and African Languages and Linguistics 12. 1–16. (hdl.handle.net/10108/91145)

Shiohara, Asako. 2018. A progress report on the Sumbawa annotated spoken corpus: Tentative annotation notes. Asian and African Languages and Linguistics 12. 75–97. (hdl.handle.net/10108/91143)

Brickell, Timothy & Schnell, Stefan. 2017. Do grammatical relations reflect information status? Reassessing Preferred Argument Structure against discourse data from Tondano. Linguistic Typology 21(1). 177–209. (DOI: 10.1515/lingty-2017-0005)

Schiborr, Nils N. 2017. Antecedent distance and the accessibility hierarchy: A quantitative approach. Unpublished MA thesis, University of Bamberg.

Haig, Geoffrey & Schnell, Stefan. 2016. The discourse basis of ergativity revisited. Language 92(3). 591–618. (DOI: 10.1353/lan.2016.0049)

Haig, Geoffrey & Schnell, Stefan. 2016. The discourse basis of ergativity revisited: Online appendices. Language 92(3). 1–14. (DOI: 10.1353/lan.2016.0044)

Conference talks

(NEW!) Schnell, Stefan & Haig, Geoffrey & Schiborr, Nils N. 2022. Conditioned pronoun use as precursor of conditioned P indexing. Paper presented at the workshop on Differential Argument Indexing as part of the 14th Conference of the Association for Linguistic Typology (ALT2022), Austin, United States of America, 15–17 December 2022.

(NEW!) Schnell, Stefan & Schiborr, Nils N. & Haig, Geoffrey. 2022. The role of grammatical relations in co-reference production across diverse languages. Paper presented at the workshop on Spoken- and Signed-language Corpus Studies in Linguistic Typology as part of the 14th Conference of the Association for Linguistic Typology (ALT2022), Austin, United States of America, 15–17 December 2022.

Schnell, Stefan & Schiborr, Nils N. 2022. Referential choice, inter-sentential co-reference, and topicality effects. Paper presented at the workshop Disentangling Topicality Effects as part of the 55th Annual Meeting of the Societas Linguistica Europaea (SLE2022), Bucharest, Romania, 24–27 August 2022.

Haig, Geoffrey. 2021. Doing corpus-based syntactic typology with spoken language corpora. Workshop held as part of the LILEC Summer School 2021: Catching Language Data, Bologna, Italy, 21–25 June 2021.

Schiborr, Nils N. 2021. Corpus analysis: Deriving complex measures from simple annotations. Workshop held as part of the 17th Linguistics Conference for PhD Students 2021 (STaPs'17), Freiburg, Germany, 23–24 April 2021.

Haig, Geoffrey. 2020. Stability and adaptivity of word order in the Western Asian Transition Zone: Evidence from West Iranian. Paper presented at the Workshop on Tracing Contact in Closely Related Languages, Zürich, Switzerland, 19–20 November 2020.

Schnell, Stefan. & Haig, Geoffrey & Schiborr, Nils N. & Vollmer, Maria C. 2020. Introducing new referents: A corpus-based cross-linguistic perspective. Paper presented at the Workshop Discourse Phenomena in Typological Perspective as part of the 53rd Annual Meeting of the Societas Linguistica Europaea (SLE2020), Bucharest, Romania, 26 August–1 September 2020.

Haig, Geoffrey & Schiborr, Nils N. & Schnell, Stefan. 2020. On potential statistical universals of grammar in discourse: Evidence from Multi-CAST. Paper presented at the Workshop Corpus-based typology: Spoken language from a cross-linguistic perspective as part of the 42nd Annual Conference of the German Linguistic Society (DGfS 2020), Hamburg, Germany, 4–6 March 2020.

Schiborr, Nils N. 2019. Quantitative models of referential choice: Lexical anaphora in English. Paper presented at the 8th Biennial International Conference on the Linguistics of Contemporary English (BICLCE 2019), Bamberg, Germany, 26–28 September 2019.

Schiborr, Nils N. 2019. Modelling referential choice in natural spoken discourse: Multi-CAST, GRAID, and RefIND. Paper presented at the Workshop Annotation of Non-standard Corpora (ANSC 2019), Bamberg, Germany, 16–18 September 2019.

Schiborr, Nils N. 2018. Data-driven models of referential choice: Antecedent distance and beyond. Paper presented at the Workshop Information Structure in Spoken Language Corpora 3: Discourse and Information Structure (ISSLaC3), Münster, Germany, 7–8 December 2018.

Schnell, Stefan & Schiborr, Nils N. & Haig, Geoffrey. 2018. Is intransitive subject the preferred role for introducing new referents? Evidence from corpus-based typology. Paper presented at the 51st Annual Meeting of the Societas Linguistica Europaea (SLE2018), Tallinn, Estonia, 29 August–1 September 2018.

Haig, Geoffrey & Schnell, Stefan & Schiborr, Nils N. 2017. The limits of accessibility: A corpus-based typological approach. Paper presented at the 12th Conference of the Association for Linguistic Typology (ALT2017), Canberra, Australia, 11–15 December 2017.

Haig, Geoffrey & Schiborr, Nils N. 2016. Multi-CAST (Multilingual Corpus of Annotated Spoken Texts): Ein Projekt zur Erstellung und Auswertung mehrsprachiger Korpora für die Sprachtypologie. Paper presented at the CLARIN Forum CA3, Hamburg, Germany, 7–8 June 2016.

Guidelines for contributors

The shared utility of Multi-CAST grows with increasing typological representativity of the language sample it contains. We therefore encourage scholars to contribute additional data sets to Multi-CAST, which can be incorporated into the collection as stand-alone resources, citable with their names as the authors and annotators.

If you wish to contribute data, here are some points to consider:

  • Open access corpus data. Your data should be free of copyright and other restrictions on availability or usage. Multi-CAST is committed to open science, and hence makes all of its data freely available under a Creative Commons licence (CC BY 4.0 International). All data sets are citable online resources, with your name(s) as author(s).
  • Unscripted narratives. Texts should be original narratives (i.e. not translations), and ideally not stimulus-based.
  • Monologues. Texts should be (predominantly) monologic. Coping with multi-person discourse raises additional issues of annotation and analysis, which we have chosen not to tackle in this collection.
  • Media-linked time-aligned annotations. Transcribed texts are ideally accompanied by a sound file in an uncompressed WAV file format, morphologically glossed, and translated into English. Annotations are time-aligned with the audio recordings.
  • Minimum size of 1 000 clauses. All corpora in Multi-CAST minimally contain 1 000 clause units.

If you have a data set that complies with these conditions and you are interested in contributing it to Multi-CAST, please contact Geoffrey Haig and/or Stefan Schnell in order to coordinate the next steps.

In technical terms, this involves transferring your data into the EAF file format of the annotation software ELAN, for which purpose we will provide you with a Multi-CAST ELAN template, and annotating your texts with GRAID. The latter involves some quite tricky analytical decisions, and we strongly recommend that potential contributors liaise with us before undertaking this task. The actual labour input required will vary from language to language, but we will certainly assist you and be able to give you a realistic assessment of what may be necessary.

People

The Multi-CAST project is being coordinated by Geoffrey Haig and Nils Schiborr at the Department of General Linguistics (University of Bamberg), Stefan Schnell at the Department of Comparative Language Science (University of Zurich), and Maria Vollmer at the Department of General Linguistics (University of Freiburg).

In addition, the following people were involved in the collection, transcription, translation, and annotation of the various Multi-CAST corpora, or have contributed to the project in other ways:

  • Shirin Adibifar
  • George Atkins
  • Danielle Barth
  • Natalia Bogomolova
  • Timothy Brickell
  • Kira Davey
  • Diana Forker
  • Gadzhimurad Gadzhimuradov
  • Dmitry Ganenkov
  • Harris Hadjidas
  • Jenny Herzky
  • Tai Hong
  • Enoch Horai Magum
  • Abdullah Incekan
  • Ivan Kapitonov
  • Yukinori Kimoti
  • Keita Kurabe
  • Adrian Kuqi
  • Liu Ruoyu
  • Maria Matheas
  • Chenxi Meng
  • Ulrike Mosel
  • Rasul Mutalov
  • Alice Nora
  • Clever Panduro
  • Nicholas Peterson
  • Kadagoi Rawad Forepiso
  • Rudolf Raward
  • Lauren Reed
  • Sabrina Ryffel
  • Asako Shiohara
  • Lena Sell
  • Frank Seifart
  • Nick Thieberger
  • Hanna Thiele
  • Eva van Lier
  • Eline Visser
  • Maria Vollmer
  • Makson Vores

Acknowledgements

The collection and annotation of the data in Multi-CAST have graciously received support from the following institutions and organizations:

  • 2017–2021
    the German Research Foundation (DFG) via the project Does morphosyntactic alignment shape discourse? — principal investigators: Geoffrey Haig and Stefan Schnell (DFG project no. 323627599);
  • 2018–2020
    the Australian Research Council (ARC) and the Centre of Excellence for the Dynamics of Language (CoEDL) as part of CoEDL's corpus development project, headed by Nick Thieberger at The University of Melbourne, for annotation work in collaboration with the aforementioned DFG project;
  • 2012–2019
    the VolkswagenStiftung as part of the Documentation of endangered languages (DOBES) project for the documentation of Shiri and Sanzhi — PI: Diana Forker;
  • 2012–2015
    the Australian Research Council (ARC) as part of the DECRA project Typology of language use, hosted by La Trobe University, Melbourne — PI: Stefan Schnell (ARC grant no. DE120102017.);
  • 2006–2012
    as part of the DOBES project for the documentation of Vera'a and Vurës — Stefan Schnell (PI: Catriona Malau, grants no. II/81 898 and II/84 316);
  • 2000–2007
    as part of DOBES for the documentation of Teop — PI: Ulrike Mosel (grant no. II/77 973).

The Department of General Linguistics at the University of Bamberg contributed departmental funding and research infrastructure to the Multi-CAST project, and the ARC Centre of Excellence for the Dynamics of Language provided additional support.

The following texts in the collection are made available in cooperation with these researchers and institutions:

The editors of and contributors to Multi-CAST would also like to thank our respective research communities for their support and stimulating criticism.

supported by

Licensing

In the spirit of open science, the Multi-CAST collection, including the recordings, transcriptions, annotations, and all supplementary materials, are published under the Creative Commons Attribution 4.0 International licence (CC BY 4.0).

The text of the licence can be found online here.

The CC-BY licence allows full access to Multi-CAST for any purpose related to research, art, journalism, or any other endeavour, on the sole condition that proper credit is given to the editors of the collection and its contributors. Doing so must also include a link to this website (multicast.aspra.uni-bamberg.de), and a brief note about the licensing terms.

References

Adelaar, Alexander. 2005. Malayo-Sumbawan. Oceanic Linguistics 44(2), 357–388.

Ariel, Mira. 1988. Referring and accessibility. Journal of Linguistics 24(1), 67–87.

Ariel, Mira. 1990. Accessing noun-phrase antecedents. London: Routledge.

Ariel, Mira. 2004. Accessibility marking: Discourse functions, discourse profiles, and processing cues. Discourse Processes 37(2), 91–116.

Bickel, Balthazar. 2003. Referential density in discourse and syntactic typology. Language 79(4), 708–736.

Brickell, Timothy. 2015. A grammar of Tondano. Ph.D. dissertation, La Trobe University, Melbourne, Australia.

Campbell, Lyle & Lee, Nala H. & Okura, Eve & Simpson, Sean & Ueki, Kaori. 2017. The catalogue of endangered languages (ElCat). (endangeredlanguages.com/)

Chafe, Wallace. 1980. The deployment of consciousness in the production of a narrative. In Chafe, Wallace (ed.), The Pear Stories: Cognitive, cultural, and linguistic aspects of narrative production, 9–50. Norwood, NJ: Ablex.

Du Bois, John. 1987. The discourse basis of ergativity. Language 63(4), 805–855.

Du Bois, John. 2003. Argument structure: Grammar in use. In Du Bois, John & Kumpf, Lorraine & Ashby, William J. (eds.), Preferred argument structure: Grammar as architecture for function, 11–60. Amsterdam: John Benjamins.

Du Bois, John. 2017. Ergativity in discourse and grammar. In Coon, Jessica & Massam, Diane & Travis, Lisa D. (eds.), The Oxford handbook of ergativity, 23–57. Oxford: Oxford University Press.

English Dialects Research Group. 2005. Freiburg English Dialect Corpus (FRED). (fred.ub.uni-freiburg.de/)

Forker, Diana. 2020. A grammar of Sanzhi Dargwa. Berlin: Language Science Press.

Giangoullis, Konstantinos G. 2009. Kypriaka paradosiaka paramytha: Ek stomatos Elenis Mich, Satsia, Apo to Geri-Pyroi (1887–1982) [A traditional Cypriot storyteller: From the mouth of Elenis Mich, Satsia, from Geri-Pyroi (1887–1982)]. Leukosia: Theopress Publications.

Haig, Geoffrey. 2018. Northern Kurdish (Kurmanjî). In Haig, Geoffrey & Khan, Geoffrey (eds.), The languages and linguistics of Western Asia: An areal perspective, 106–158. Berlin: Mouton de Gruyter.

Haig, Geoffrey & Schnell, Stefan. 2014. Annotations using GRAID (Grammatical Relations and Animacy in Discourse): Introduction and guidelines for annotators. Version 7.0. (multicast.aspra.uni-bamberg.de/)

Hammarström, Harald & Forkel, Robert & Haspelmath, Martin. (eds.). 2019. Glottolog 4.0. Jena: Max Planck Institute for the Science of Human History. (glottolog.org)

Kimoto, Yukinori. 2017. A grammar of Arta: A Philippine Negrito language. Ph.D. dissertation, Kyoto University, Kyoto, Japan.

Kurabe, Keita. 2016. A grammar of Jinghpaw, from Northern Burma. Ph.D. dissertation, Kyoto University, Kyoto, Japan.

Mbete, Aron Meko. 1990. Rekonstruksi protobahasa Bali-Sasak-Sumbawa [A reconstruction of Proto-Bali-Sasak-Sumbawa]. Jakarta: University of Indonesia.

Meng, Chenxi. 2018. A grammar of Tulil. Ph.D. dissertation, La Trobe University, Melbourne, Australia.

Mosel, Ulrike. 2019. A multifunctional Teop-English dictionary. Dictionaria 4(1-6488). (dictionaria.clld.org/contributions/teop)

Mosel, Ulrike & Thiesen, Yvonne. 2007. The Teop sketch grammar. Unpublished manuscript, University of Kiel. (hdl.handle.net/1839/00-0000-0000-0008-24F6-3)

Noonan, Michael. 2003. A crosslinguistic investigation of referential density. Unpublished manuscript, University of Wisconsin-Milwaukee. (crossasia-repository.ub.uni-heidelberg.de/190/)

Riester, Arndt & Baumann, Stefan. 2017. The RefLex scheme — Annotation guidelines. SinSpeC: Working papers of the SFB 732 14. (DOI: 10.18419/opus-9011)

Schiborr, Nils N. 2018. multicastR: A companion to the Multi-CAST collection. R package version 2.0.0. In Haig, Geoffrey & Schnell, Stefan (eds.), Multi-CAST: Multilingual corpus of annotated spoken texts. (cran.r-project.org/package=multicastR)

Schiborr, Nils N. & Schnell, Stefan & Thiele, Hanna. 2018. RefIND — Referent Indexing in Natural-language Discourse: Annotation guidelines. Version 1.1. (multicast.aspra.uni-bamberg.de/)

Schnell, Stefan. 2011. A grammar of Vera'a. Ph.D. dissertation, Kiel University, Germany.

Thieberger, Nick. 2004. Documentation in practice: Developing a linked media corpus of South Efate. In Austin, Peter (ed.), Language documentation and description, 169–178. London: Hans Rausing Endangered Languages Project, SOAS.

Thieberger, Nick. 2006. A grammar of South Efate: An Oceanic language of Vanuatu. Honolulu: University of Hawaii Press. (hdl.handle.net/11343/31242)

Visser, Eline. 2020. A grammar of Kalamang: The Papuan language of the Karas Islands. Ph.D. dissertation, Lund University, Lund, Sweden.

Wolff, John. Proto-Austronesian phonology. Ithaca, NY: Cornell Southeast Asia Program Publications.

Contact

For inquiries regarding Multi-CAST, please contact Geoffrey Haig or Stefan Schnell. Questions concerning the multicastR package and this website please direct to Nils Schiborr.

The Multi-CAST collection as well as this page are hosted on the servers of the computing centre of the University of Bamberg. Relevant legal information can be found here.