Download PDF PDF Download WAV WAV Download MP3 MP3 Download EAF EAF Download XML XML Download TSV TSV Download XLS XLS Download XLSX XLSX Open as HTML HTML Download ZIP ZIP Download ZIP/WAV ZIP/WAV Download ZIP/MP3 ZIP/MP3 Download ZIP/EAF ZIP/EAF Download ZIP/XML ZIP/XML Download ZIP/XLS ZIP/XLS Download ZIP/PDF ZIP/PDF Download ZIP/TSV ZIP/TSV

The WOWA corpus grew out of the project Post-predicate elements in Iranian and neighbouring languages: Inheritance, contact, and information structure. It contains data that were collected and annotated by the researchers involved in that project, as well as others contributed by associated researchers.

The principle aim of WOWA is to provide an accessible and transparent source of data for corpus-based approaches to word order typology, focussing on the languages spoken in the region designated here as Western Asia.

The data sets are successively being made available, with 26 online as of July 2022.

Research Background

The focus on Western Asia is motivated by an overarching research interest in the areal diffusion of word order regularities; specifically, we investigate the respective impact of inheritance (the genetic affiliation of the languages concerned, e.g. Turkic, Semitic, etc.) and the impact of neighbouring languages, related or not, in shaping word order in usage. In addition, we address the issue of which aspects of word order are stable within a particular doculect, and which display corpus-internal variability.

More generally, this is connected to the issue of integrating variation into typology. Finally, WOWA is the only cross-linguistic data-base of its type that includes exclusively spoken language, and thus provides an important corrective to much ongoing work in corpus-based typology, which is still largely based on written language.

Corpus design

Each dataset in WOWA is based on a corpus of transcribed spoken language, usually compiled in a field-work setting. The sources are extremely varied; some are taken from published dialect surveys such as those undertaken by the Turkish Language Society (Turk Dil Kurumu), or published work by experts on particular language groups (e.g. Khan 2008, on the Neo-Aramaic (Christian) dialect of Barwar, northern Iraq). Others were gathered in the course of PhD projects and other initiatives in language documentation.

All data in the WOWA corpus, including supplementary materials, are published under the Creative Commons Attribution 4.0 International licence (CC BY 4.0). The text of the licence can be found online here.

The texts in WOWA contain at least 500 analysable tokens; the current mean is 650 tokens. They are digitalized, if not already in digital form, segmented into syntactic segments of up to three clauses (the size of segmented units varies and is immaterial for the analysis), and imported to a spreadsheet template.

The tokens to be analysed are referential nominal expressions in non-subject positions (i.e. subjects are not included). They are coded for a range of features, including animacy, weight, role, and flagging. The dependent variable is position relative to the governing predicate, for which two values are available: (A) before the governing predicate, or (B) after the governing predicate. The details are outlined in the coding guidelines. Once fully coded, the spreadsheets are exported as TSV files, which can then be imported into R for statistical analysis.

For each data set, we minimally make available (i) metadata on the doculect and source texts, (ii) the complete coded data, in XLS and TSV formats, and, where available, (iii) the original sources including sound files.

The doculects

— Please note that a number of data sets are still in the process of being compiled. —

Missing components are marked with "—/—" in the lists below; they will be added in the near future.








Published papers


Conference talks

(NEW!) Haig, Geoffrey. 2021. Doing corpus-based syntactic typology with spoken language corpora. Workshop held as part of the LILEC Summer School 2021: Catching Language Data, Bologna, Italy, 23–24 April 2021.

Haig, Geoffrey. 2020. Stability and adaptivity of word order in the Western Asian Transition Zone: Evidence from West Iranian. Paper presented at the Workshop on Tracing Contact in Closely Related Languages, Zürich, Switzerland, 19–20 November 2020.


Faghiri, Pegah & Samvelian, Pollet & Hemforth, Barbara. 2018. Is there a canonical order in Persian ditransitive constructions? In Korn, Angnes & Malchukov, Andrey (eds.), Ditransitive constructions in a cross-linguistic perspective, 165–186. Wiesbaden: Reichert.

Frommer, Paul. 1981. Post-verbal phenomena in colloquial Persian syntax. PhD dissertation, University of Southern California.

Haig, Geoffrey & Adibifar, Shirin. 2019. Referential Null Subjects (RNS) in colloquial spoken Persian: Does speaker familiarity have an effect? In Korangy, Alireza & Mahmoodi-Bakhtiari, Behrooz (eds.), Essays on the typology of Iranian languages, 102–121. Berlin: Mouton de Gruyter.

Khan, Goeffrey. 2008. The Neo-Aramaic dialect of Barwar. Leiden: Brill.

Roberts, John. 2009. A study of Persian discourse structure. Uppsala: Acta Universitatis Upsaliensis.


For inquiries, please contact Geoffrey Haig. Please direct questions concerning this website to Nils Schiborr.

The resources presented here as well as this page are hosted on the servers of the computing centre of the University of Bamberg. Relevant legal information can be found here.