Download PDF PDF Download WAV WAV Download MP3 MP3 Download EAF EAF Download XML XML Download TSV TSV Download ZIP/WAV ZIP/WAV Download ZIP/MP3 ZIP/MP3 Download ZIP/EAF ZIP/EAF

This website serves as a hub linking the various corpus-building projects initiated at and maintained by the Department
of General Linguistics
at the University of Bamberg in cooperation with a number of other institutions and researchers.
Please direct inquiries regarding any of the projects listed here towards Geoffrey Haig.

— This website is a work in progress! —
Please report issues to Nils Schiborr.

Multi-CAST — The Multilingual Corpus of Annotated Spoken Texts

edited by Geoffrey Haig, Stefan Schnell

Multi-CAST is an online collection of annotated spoken language corpora from a steadily expanding range of typologically diverse languages.

It features standardized annotations across multiple levels, targeting morphosyntactic structure and reference. Multi-CAST has been designed as a tool for quantitative, corpus-based typology.

It is based on open-source software resources, and all data are fully accessible under a Creative Commons (CC-BY 4.0) licence.

  • natural narrative texts from 16 languages, encompassing roughly 26 500 clause units (c. 175 000 words), with an additional 10 corpora in preparation
  • multiple levels of parallel annotation for morphosyntax and referent tracking (including zero anaphora) using unified annotation schemes (GRAID, RefIND)
  • a companion R package facilitates quantitative cross-corpus analysis (Schiborr 2018)

HamBam — The Hamedan-Bamberg Corpus of Contemporary Spoken Persian

Geoffrey Haig, Mohammad Rasekh-Mahand

The HamBam corpus is a collection of annotated recordings of contemporary spoken Persian, jointly compiled at Hamedan University in Iran and the University of Bamberg in Germany.

It was designed for investigations into the impact of information structure on word order variation in spoken Persian, but is freely adaptable to other research questions, including research on prosody, referential density, or usage-based approaches to grammar.

  • time-aligned transcriptions, translations, and annotations using the free open software ELAN
  • architecture and design following the example of Multi-CAST, utilizing a simplified form of the GRAID annotation system
  • all data are freely available online under a Creative Commons (CC-BY 4.0) licence

WOWA — The Word Order in Western Asia Corpus  (WIP)

edited by Geoffrey Haig, Donald Stilo, Mahîr Can Doğan, Nils Norman Schiborr

The WOWA corpus is aims to provide an accessible and transparent source of data for corpus-based approaches to word order typology, focussing on the languages spoken in the region designated here as Western Asia.

The focus on Western Asia is motivated by an overarching research interest in the areal diffusion of word order regularities; specifically, we investigate the respective impact of inheritance and the impact of neighbouring languages, related or not, in shaping word order in usage. More generally, this is connected to the issue of integrating variation into typology.

  • transcribed spoken texts from dozens of languages/doculects spanning eight language families
  • annotations targetting referential nominal expressions in non-subject positions, each coded for their position relative to the governing predicate and other salient features
  • all data are freely available online under a Creative Commons (CC-BY 4.0) licence

Corpora of spoken and written varieties of Kurdish

Listed here are a number of smaller corpora that were compiled from spoken and written material from several varieties of Kurdish, a number of which developed out of language documentation projects and doctoral dissertations.

Most of these corpora are freely available online under a Creative Commons (CC-BY 4.0) licence. For those that are not, please refer to the usage terms listed for each corpus in its respective section.


For inquiries, please contact Geoffrey Haig or Stefan Schnell. Please direct questions concerning this website to Nils Schiborr.

The resources presented here as well as this page are hosted on the servers of the computing centre of the University of Bamberg. Relevant legal information can be found here.