The HamBam corpus contains annotated recordings of contemporary spoken Persian, compiled as part of a cooperation between Bu-Ali Sina University in Hamedan, Iran (team coordinator: Mohammad Rasekh-Mahand), and the University of Bamberg in Germany (Geoffrey Haig).
HamBam is a component of the project Post-predicate elements in Iranian and neighbouring languages: Inheritance, contact, and information structure.
The corpus was primarily designed for investigations into the impact of information structure on word order variation in spoken Persian, but is freely adaptable to other research questions, including research on prosody, referential density, or usage-based approaches to grammar.
Development of the corpus is funded by a Linkage Grant (institute partnership) of the Alexander von Humboldt Foundation (2019–2021), awarded to Haig and Rasekh-Mahand.
Work on HamBam is ongoing, and new texts are continuously added as annotation and editing are completed. The latest update was published on 1 September 2022 (v2), adding four new texts.
Persian (west2369), also known as Farsi, is a Southwest Iranian language and the official language of Iran; closely related varieties are also spoken in Afghanistan and Tajikistan.
The texts gathered in this corpus are predominantly monological in nature, and represent colloquial spoken Persian, as the neutral lingua franca used throughout Iran by educated Iranians. The speakers are of both genders, various ages, different educational levels and occupations. The recordings include radio interviews on a variety of topics, as well as less formal oral history recounted in a domestic setting among family members.
Colloquial Persian differs in lexicon, morphology, and syntax from formal written Persian, but has received very little systematic attention to date. Existing corpus-based research on Persian, with the exception of Frommer (1981) and Haig and Adibifar (2019), draws on corpora containing overwhelmingly written, rather than spoken, data (e.g. Roberts 2009: 351–352; Faghiri et al. 2018).
The transcriptions, translation, and annotations of the texts in the HamBam corpus are time-aligned with sound files. All data are freely available from this website. The annotations and linking to sound files is undertaken in the EAF file format, using the free linguistic annotation software ELAN.
All data in the HamBam corpus, including supplementary materials, are published under the Creative Commons Attribution 4.0 International licence (CC BY 4.0). The text of the licence can be found online here.
Click on the buttons in the table below to download the selected file, or the buttons in the bottom row to download bundles of all corpus data. Clicking 'preview' will bring up a web-based rendering of the EAF files with audio playback.
Listed here is the latest version of the corpus files (v2, published 1 September 2022). Older versions of the files can be found in the archive.
The HamBam corpus is being jointly developed by teams at Bu-Ali Sina University in Hamedan in Iran and at the University of Bamberg in Germany under the supervision of Mohammad Rasekh-Mahand and Geoffrey Haig.
Faghiri, Pegah & Samvelian, Pollet & Hemforth, Barbara. 2018. Is there a canonical order in Persian ditransitive constructions? In Korn, Angnes & Malchukov, Andrey (eds.), Ditransitive constructions in a cross-linguistic perspective, 165–186. Wiesbaden: Reichert.
Frommer, Paul. 1981. Post-verbal phenomena in colloquial Persian syntax. PhD dissertation, University of Southern California.
Haig, Geoffrey & Adibifar, Shirin. 2019. Referential Null Subjects (RNS) in colloquial spoken Persian: Does speaker familiarity have an effect? In Korangy, Alireza & Mahmoodi-Bakhtiari, Behrooz (eds.), Essays on the typology of Iranian languages, 102–121. Berlin: Mouton de Gruyter.
Haig, Geoffrey & Schnell, Stefan. 2014. Annotations using GRAID (Grammatical Relations and Animacy in Discourse): Introduction and guidelines for annotators. Version 7.0. (multicast.aspra.uni-bamberg.de/#annotations)
Haig, Geoffrey & Schnell, Stefan. 2022. Multi-CAST: Multilingual Corpus of Annotated Spoken Texts.(multicast.aspra.uni-bamberg.de/)
Roberts, John. 2009. A study of Persian discourse structure. Uppsala: Acta Universitatis Upsaliensis.