This website serves as a hub linking the various corpus-building projects initiated at and maintained by the Department
of General Linguistics at the University of Bamberg in cooperation with a number of other institutions and researchers.
Please direct inquiries regarding any of the projects listed here towards Geoffrey Haig.
— This website is a work in progress! —
Please report issues to Nils Schiborr.
Multi-CAST is an online collection of annotated spoken language corpora from a steadily expanding range of typologically diverse languages.
It features standardized annotations across multiple levels, targeting morphosyntactic structure and reference. Multi-CAST has been designed as a tool for quantitative, corpus-based typology.
It is based on open-source software resources, and all data are fully accessible under a Creative Commons (CC-BY 4.0) licence.
The HamBam corpus is a collection of annotated recordings of contemporary spoken Persian, jointly compiled at Hamedan University in Iran and the University of Bamberg in Germany.
It was designed for investigations into the impact of information structure on word order variation in spoken Persian, but is freely adaptable to other research questions, including research on prosody, referential density, or usage-based approaches to grammar.
The WOWA corpus is aims to provide an accessible and transparent source of data for corpus-based approaches to word order typology, focussing on the languages spoken in the region designated here as Western Asia.
The focus on Western Asia is motivated by an overarching research interest in the areal diffusion of word order regularities; specifically, we investigate the respective impact of inheritance and the impact of neighbouring languages, related or not, in shaping word order in usage. More generally, this is connected to the issue of integrating variation into typology.
Listed here are a number of smaller corpora that were compiled from spoken and written material from several varieties of Kurdish, a number of which developed out of language documentation projects and doctoral dissertations.
Most of these corpora are freely available online under a Creative Commons (CC-BY 4.0) licence. For those that are not, please refer to the usage terms listed for each corpus in its respective section.