Download PDF PDF Download WAV WAV Download MP3 MP3 Download EAF EAF Download XML XML Download TSV TSV Download XLS XLS Download ZIP ZIP Download ZIP/WAV ZIP/WAV Download ZIP/MP3 ZIP/MP3 Download ZIP/EAF ZIP/EAF Download ZIP/XML ZIP/XML Download ZIP/XLS ZIP/XLS Download ZIP/PDF ZIP/PDF Download ZIP/TSV ZIP/TSV

The WOWA corpus grew out of the project Post-predicate elements in Iranian and neighbouring languages: Inheritance, contact, and information structure. It contains data that were collected and annotated by the researchers involved in that project, as well as others contributed by associated researchers.

The principle aim of WOWA is to provide an accessible and transparent source of data for corpus-based approaches to word order typology, focussing on the languages spoken in the region designated here as Western Asia.

The data sets are successively being made available, with 24 online as of September 2021.

Research Background

The focus on Western Asia is motivated by an overarching research interest in the areal diffusion of word order regularities; specifically, we investigate the respective impact of inheritance (the genetic affiliation of the languages concerned, e.g. Turkic, Semitic, etc.) and the impact of neighbouring languages, related or not, in shaping word order in usage. In addition, we address the issue of which aspects of word order are stable within a particular doculect, and which display corpus-internal variability.

More generally, this is connected to the issue of integrating variation into typology. Finally, WOWA is the only cross-linguistic data-base of its type that includes exclusively spoken language, and thus provides an important corrective to much ongoing work in corpus-based typology, which is still largely based on written language.

Corpus design

Each dataset in WOWA is based on a corpus of transcribed spoken language, usually compiled in a field-work setting. The sources are extremely varied; some are taken from published dialect surveys such as those undertaken by the Turkish Language Society (Turk Dil Kurumu), or published work by experts on particular language groups (e.g. Khan 2008, on the Neo-Aramaic (Christian) dialect of Barwar, northern Iraq). Others were gathered in the course of PhD projects and other initiatives in language documentation.

All data in the WOWA corpus, including supplementary materials, are published under the Creative Commons Attribution 4.0 International licence (CC BY 4.0). The text of the licence can be found online here.

The texts in WOWA contain at least 500 analysable tokens; the current mean is 650 tokens. They are digitalized, if not already in digital form, segmented into syntactic segments of up to three clauses (the size of segmented units varies and is immaterial for the analysis), and imported to a spreadsheet template.

The tokens to be analysed are referential nominal expressions in non-subject positions (i.e. subjects are not included). They are coded for a range of features, including animacy, weight, role, and flagging. The dependent variable is position relative to the governing predicate, for which two values are available: (A) before the governing predicate, or (B) after the governing predicate. The details are outlined in the coding guidelines. Once fully coded, the spreadsheets are exported as TSV files, which can then be imported into R for statistical analysis.

For each data set, we minimally make available (i) metadata on the doculect and source texts, (ii) the complete coded data, in XLS and TSV formats, and, where available, (iii) the original sources including sound files.

The doculects

— Please note that a number of data sets are still in the process of being compiled. —

Missing components are marked with "—/—" in the lists below; they will be added in the near future.

Turkic

Oghuz  (Ankara)

Kateryna Iefremenko download citation

    • source texts
    • 0.3 MB
    • coded values
    • 0.3 MB
    • metadata
    • 0.1 MB
    • all files
    • 0.7 MB
    • archive
    • 275 MB
    • (updated 21/08/24)
    • 0.2 MB

Oghuz  (Erzurum)

Mahîr Can Doğan download citation

    • source texts
    • 1.1 MB
    • coded values
    • 0.3 MB
    • metadata
    • 0.1 MB
    • all files
    • 1.2 MB
    • archive
    • (updated 21/09/09)
    • 0.2 MB

Oghuz  (Gagauz)

Mahîr Can Doğan download citation

    • source texts
    • —/— 
    • coded values
    • 0.3 MB
    • metadata
    • —/— 
    • all files
    • —/— 
    • archive
    • (updated 21/09/22)
    • 0.2 MB

Oghuz  (Qashqai)

Laurentia Schreiber download citation

    • source texts
    • 2.8 MB
    • coded values
    • 0.3 MB
    • metadata
    • 0.1 MB
    • all files
    • 3.0 MB
    • archive
    • —/— 
    • (updated 21/09/16)
    • 0.1 MB

Oghuz  (Tabriz)

Donald Stilo download citation

    • source texts
    • —/— 
    • coded values
    • 0.3 MB
    • metadata
    • —/— 
    • all files
    • —/— 
    • archive
    • —/— 
    • (updated 21/09/05)
    • 0.2 MB

Iranian

Balochi  (Coastal)

Maryam Nourzaei download citation

    • source texts
    • 1.7 MB
    • coded values
    • 0.5 MB
    • metadata
    • 0.1 MB
    • all files
    • 1.9 MB
    • archive
    • 247 MB
    • (updated 21/09/15)
    • 0.2 MB

Balochi  (Koroshi)

Maryam Nourzaei download citation

    • source texts
    • 0.5 MB
    • coded values
    • 0.2 MB
    • metadata
    • 0.1 MB
    • all files
    • 0.7 MB
    • archive
    • —/— 
    • (updated 21/08/24)
    • 0.1 MB

Balochi  (Turkmen)

Geoffrey Haig download citation

    • source texts
    • 1.9 MB
    • coded values
    • 0.3 MB
    • metadata
    • 0.1 MB
    • all files
    • 2.0 MB
    • archive
    • —/— 
    • (updated 21/09/08)
    • 0.2 MB

Gorani  (Gawraǰū)

Masoud Mohammadirad download citation

    • source texts
    • 2.8 MB
    • coded values
    • 0.6 MB
    • metadata
    • 0.1 MB
    • all files
    • 3.1 MB
    • archive
    • —/— 
    • (updated 21/08/31)
    • 0.3 MB

Kumzari  (Musandam)

Geoffrey Haig download citation

    • source texts
    • 0.5 MB
    • coded values
    • 0.4 MB
    • metadata
    • 0.1 MB
    • all files
    • 0.7 MB
    • archive
    • (updated 21/08/24)
    • 0.2 MB

Kurdish  (Central, Sanandaj)

Masoud Mohammadirad download citation

    • source texts
    • —/— 
    • coded values
    • —/— 
    • metadata
    • 0.1 MB
    • all files
    • —/— 
    • archive
    • —/— 
    • (updated 21/09/05)
    • —/— 

Kurdish  (Northern, Ankara)

Kateryna Iefremenko download citation

    • source texts
    • 0.6 MB
    • coded values
    • 0.2 MB
    • metadata
    • 0.1 MB
    • all files
    • 0.7 MB
    • archive
    • 294 MB
    • (updated 21/08/24)
    • 0.1 MB

Kurdish  (Northern, Lachin)

Donald Stilo download citation

    • source texts
    • 18 MB
    • coded values
    • 0.3 MB
    • metadata
    • 0.1 MB
    • all files
    • 19 MB
    • archive
    • (updated 21/08/30)
    • 0.2 MB

Kurdish  (Northern, Muş)

Geoffrey Haig download citation

    • source texts
    • 0.4 MB
    • coded values
    • 0.3 MB
    • metadata
    • 0.1 MB
    • all files
    • 0.5 MB
    • archive
    • 142 MB
    • (updated 21/08/24)
    • 0.1 MB

Kurdish  (Southern, Bijar)

Masoud Mohammadirad download citation

    • source texts
    • 0.3 MB
    • coded values
    • 0.5 MB
    • metadata
    • 0.1 MB
    • all files
    • 0.5 MB
    • archive
    • 541 MB
    • (updated 21/10/25)
    • 0.1 MB

Mazandarani  (Kordxeyl)

Donald Stilo, Geoffrey Haig download citation

    • source texts
    • 18 MB
    • coded values
    • 0.3 MB
    • metadata
    • 0.1 MB
    • all files
    • 18 MB
    • archive
    • (updated 21/08/26)
    • 0.1 MB

Talyshi  (Lerik)

Donald Stilodownload citation

    • source texts
    • 2.5 MB
    • coded values
    • 0.3 MB
    • metadata
    • 0.1 MB
    • all files
    • 2.6 MB
    • archive
    • (updated 21/08/26)
    • 0.1 MB

Vafsi  (Gurchani)

Mahîr Can Doğan download citation

    • source texts
    • —/— 
    • coded values
    • 0.4 MB
    • metadata
    • 0.1 MB
    • all files
    • —/— 
    • archive
    • —/— 
    • (updated 21/09/06)
    • 0.3 MB

Zazakî  (Çewlîg)

Netîce Demir, Mahîr Can Doğan download citation

    • source texts
    • 0.1 MB
    • coded values
    • 0.2 MB
    • metadata
    • 0.1 MB
    • all files
    • 0.2 MB
    • archive
    • 189 MB
    • (updated 21/09/02)
    • 0.1 MB

Zazakî  (Siwêreg)

Netîce Demir, Mahîr Can Doğan download citation

    • source texts
    • 0.1 MB
    • coded values
    • 0.1 MB
    • metadata
    • 0.1 MB
    • all files
    • 0.2 MB
    • archive
    • 125 MB
    • (updated 21/09/02)
    • 0.1 MB

Kartvelian

Laz  (Arhavi)

Donald Stilo, René Lacroix download citation

    • source texts
    • 0.6 MB
    • coded values
    • 0.2 MB
    • metadata
    • 0.1 MB
    • all files
    • 0.7 MB
    • archive
    • (updated 21/08/31)
    • 0.1 MB

Semitic

Arabic  (Khuzestan)

Bettina Leitner download citation

    • source texts
    • —/— 
    • coded values
    • 0.4 MB
    • metadata
    • 0.1 MB
    • all files
    • —/— 
    • archive
    • 639 MB
    • (updated 21/09/15)
    • 0.3 MB

NE Neo-Aramaic  (Christian, Barwar)

Donald Stilo download citation

    • source texts
    • 2.5 MB
    • coded values
    • 0.4 MB
    • metadata
    • —/— 
    • all files
    • —/— 
    • archive
    • (updated 21/08/24)
    • 0.2 MB

NE Neo-Aramaic  (Jewish, Dohok)

Dorota Molin download citation

    • source texts
    • —/— 
    • coded values
    • 0.2 MB
    • metadata
    • 0.1 MB
    • all files
    • —/— 
    • archive
    • —/— 
    • (updated 21/08/31)
    • 0.1 MB

Hellenic

Pontic Greek  (Romeyka)

Laurentia Schreiber download citation

    • coded values
    • 0.2 MB
    • metadata
    • 0.1 MB
    • all files
    • 0.1 MB
    • archive
    • (updated 21/08/24)
    • 0.1 MB

Indo-Aryan

Kholosi  (Kholos)

Maryam Nourzaei download citation

    • source texts
    • —/— 
    • coded values
    • —/— 
    • metadata
    • —/— 
    • all files
    • —/— 
    • archive
    • 234 MB
    • (updated 21/09/05)
    • —/— 

Publications

Published papers

(TBA)

Conference talks

(NEW!) Haig, Geoffrey. 2021. Doing corpus-based syntactic typology with spoken language corpora. Workshop held as part of the LILEC Summer School 2021: Catching Language Data, Bologna, Italy, 23–24 April 2021.

Haig, Geoffrey. 2020. Stability and adaptivity of word order in the Western Asian Transition Zone: Evidence from West Iranian. Paper presented at the Workshop on Tracing Contact in Closely Related Languages, Zürich, Switzerland, 19–20 November 2020.

References

Faghiri, Pegah & Samvelian, Pollet & Hemforth, Barbara. 2018. Is there a canonical order in Persian ditransitive constructions? In Korn, Angnes & Malchukov, Andrey (eds.), Ditransitive constructions in a cross-linguistic perspective, 165–186. Wiesbaden: Reichert.

Frommer, Paul. 1981. Post-verbal phenomena in colloquial Persian syntax. PhD dissertation, University of Southern California.

Haig, Geoffrey & Adibifar, Shirin. 2019. Referential Null Subjects (RNS) in colloquial spoken Persian: Does speaker familiarity have an effect? In Korangy, Alireza & Mahmoodi-Bakhtiari, Behrooz (eds.), Essays on the typology of Iranian languages, 102–121. Berlin: Mouton de Gruyter.

Khan, Goeffrey. 2008. The Neo-Aramaic dialect of Barwar. Leiden: Brill.

Roberts, John. 2009. A study of Persian discourse structure. Uppsala: Acta Universitatis Upsaliensis.

Contact

For inquiries, please contact Geoffrey Haig. Please direct questions concerning this website to Nils Schiborr.

The resources presented here as well as this page are hosted on the servers of the computing centre of the University of Bamberg. Relevant legal information can be found here.