This website serves as a hub linking the various corpus-building projects initiated at and maintained by the Department
of General Linguistics at the University of Bamberg in cooperation with a number of other institutions and researchers.
Please direct inquiries regarding any of the projects listed here towards Geoffrey Haig.

— This website is a work in progress! —
Please report issues to Nils Schiborr.

Multi-CAST — The Multilingual Corpus of Annotated Spoken Texts

edited by Geoffrey Haig, Stefan Schnell

Multi-CAST is an online collection of annotated spoken language corpora from a steadily expanding range of typologically diverse languages.

It features standardized annotations across multiple levels, targeting morphosyntactic structure and reference. Multi-CAST has been designed as a tool for quantitative, corpus-based typology.

It is based on open-source software resources, and all data are fully accessible under a Creative Commons (CC-BY 4.0) licence.

» to Multi-CAST

natural narrative texts from 18 languages, encompassing roughly 29 000 clause units (c. 140 000 words), with a number of additional corpora in preparation
multiple levels of parallel annotation for morphosyntax and referent tracking (including zero anaphora) using unified annotation schemes (GRAID, RefIND)
a companion R package facilitates quantitative cross-corpus analysis (Schiborr 2018)

HamBam — The Hamedan-Bamberg Corpus of Contemporary Spoken Persian

Geoffrey Haig, Mohammad Rasekh-Mahand

The HamBam corpus is a collection of annotated recordings of contemporary spoken Persian, jointly compiled at Hamedan University in Iran and the University of Bamberg in Germany.

It was designed for investigations into the impact of information structure on word order variation in spoken Persian, but is freely adaptable to other research questions, including research on prosody, referential density, or usage-based approaches to grammar.

» to HamBam

time-aligned transcriptions, translations, and annotations using the free open software ELAN
architecture and design following the example of Multi-CAST, utilizing a simplified form of the GRAID annotation system
all data are freely available online under a Creative Commons (CC-BY 4.0) licence

WOWA — The Word Order in Western Asia Corpus (WIP)

edited by Geoffrey Haig, Donald Stilo, Mahîr Can Doğan, Nils Norman Schiborr

The WOWA corpus is aims to provide an accessible and transparent source of data for corpus-based approaches to word order typology, focussing on the languages spoken in the region designated here as Western Asia.

The focus on Western Asia is motivated by an overarching research interest in the areal diffusion of word order regularities; specifically, we investigate the respective impact of inheritance and the impact of neighbouring languages, related or not, in shaping word order in usage. More generally, this is connected to the issue of integrating variation into typology.

» to the WOWA Corpus

transcribed spoken texts from dozens of languages/doculects spanning eight language families
annotations targetting referential nominal expressions in non-subject positions, each coded for their position relative to the governing predicate and other salient features
all data are freely available online under a Creative Commons (CC-BY 4.0) licence

Corpora of spoken and written varieties of Kurdish

The Laki variety of Harsin Sara Belelli
The Corpus of Contemporary Written Kurdish Abdullah Incekan, Geoffrey Haig
The Corpus of Contemporary Kurdish Newspaper Texts Geoffrey Haig

Listed here are a number of smaller corpora that were compiled from spoken and written material from several varieties of Kurdish, a number of which developed out of language documentation projects and doctoral dissertations.

Most of these corpora are freely available online under a Creative Commons (CC-BY 4.0) licence. For those that are not, please refer to the usage terms listed for each corpus in its respective section.

» to the corpora

Contact

For inquiries, please contact Geoffrey Haig or Stefan Schnell. Please direct questions concerning this website to Nils Schiborr.

The resources presented here as well as this page are hosted on the servers of the computing centre of the University of Bamberg. Relevant legal information can be found here.