The HamBam corpus contains annotated recordings of contemporary spoken Persian, compiled as part of a cooperation between Bu-Ali Sina University in Hamedan, Iran (team coordinator: Mohammad Rasekh-Mahand), and the University of Bamberg in Germany (Geoffrey Haig).

HamBam is a component of the project Post-predicate elements in Iranian and neighbouring languages: Inheritance, contact, and information structure.

The corpus was primarily designed for investigations into the impact of information structure on word order variation in spoken Persian, but is freely adaptable to other research questions, including research on prosody, referential density, or usage-based approaches to grammar.

Development of the corpus is funded by a Linkage Grant (institute partnership) of the Alexander von Humboldt Foundation (2019–2021), awarded to Haig and Rasekh-Mahand.

Work on HamBam is ongoing, and new texts are continuously added as annotation and editing are completed. The latest update was published on 1 September 2022 (v2), adding four new texts.

How to cite HamBam

Haig, Geoffrey & Rasekh-Mahand, Mohammad. 2022. HamBam: The Hamedan-Bamberg Corpus of Contemporary Spoken Persian.
Bamberg: University of Bamberg. (multicast.aspra.uni-bamberg.de/resources/hambam/) (date accessed) download citation

Background — The language and the texts

Persian (west2369), also known as Farsi, is a Southwest Iranian language and the official language of Iran; closely related varieties are also spoken in Afghanistan and Tajikistan.

The texts gathered in this corpus are predominantly monological in nature, and represent colloquial spoken Persian, as the neutral lingua franca used throughout Iran by educated Iranians. The speakers are of both genders, various ages, different educational levels and occupations. The recordings include radio interviews on a variety of topics, as well as less formal oral history recounted in a domestic setting among family members.

Colloquial Persian differs in lexicon, morphology, and syntax from formal written Persian, but has received very little systematic attention to date. Existing corpus-based research on Persian, with the exception of Frommer (1981) and Haig and Adibifar (2019), draws on corpora containing overwhelmingly written, rather than spoken, data (e.g. Roberts 2009: 351–352; Faghiri et al. 2018).

A lake in Hamedan province, Iran, with the Zagros mountains visible in the distance.

Corpus design — Annotations, file formats, and licensing

The transcriptions, translation, and annotations of the texts in the HamBam corpus are time-aligned with sound files. All data are freely available from this website. The annotations and linking to sound files is undertaken in the EAF file format, using the free linguistic annotation software ELAN.

The architecture of the corpus largely follows the design implemented in Multi-CAST (Haig & Schnell 2022), with word-to-word alignment of morphological glosses and annotations.

HamBam utilizes GRAID-L, a simplified form of the GRAID annotation scheme (Haig & Schnell 2014) that does not systematically indicate anaphoric zeroes.

All data in the HamBam corpus, including supplementary materials, are published under the Creative Commons Attribution 4.0 International licence (CC BY 4.0). The text of the licence can be found online here.

Corpus data — Documentation and downloads

Click on the buttons in the table below to download the selected file, or the buttons in the bottom row to download bundles of all corpus data. Clicking 'preview' will bring up a web-based rendering of the EAF files with audio playback.

Listed here is the latest version of the corpus files (v2, published 1 September 2022). Older versions of the files can be found in the archive.

Documentation

- corpus description (!)
- 192 KB
- v2.0
- 22/09/01
- metadata
- 4 KB
- v2.0
- 22/09/01
- GRAID-L manual
- (TBA)

Recordings and annotation files

- ac_f_social
- 159 MB
- 28 MB
- 2.6 MB
- 0.2 MB
- 0.1 MB
- preview
- ac_m_corona1
- 42 MB
- 4 MB
- 0.5 MB
- 0.1 MB
- 0.1 MB
- preview
- ac_m_corona2
- 59 MB
- 5 MB
- 0.6 MB
- 0.1 MB
- 0.1 MB
- preview
- ac_m_depression
- 37 MB
- 3 MB
- 0.5 MB
- 0.1 MB
- 0.1 MB
- preview
- acd_f_science
- 112 MB
- 9 MB
- 1.4 MB
- 0.3 MB
- 0.1 MB
- preview
- acd_m_education
- 56 MB
- 5 MB
- 0.6 MB
- 0.1 MB
- 0.1 MB
- preview
- acd_m_plane
- 41 MB
- 4 MB
- 0.6 MB
- 0.1 MB
- 0.1 MB
- preview
- acd_m_ufo
- 35 MB
- 3 MB
- 0.5 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_f_accident
- 35 MB
- 6 MB
- 1.0 MB
- 0.2 MB
- 0.1 MB
- preview
- oh_f_amirali
- 11 MB
- 1 MB
- 0.2 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_f_aunt
- 42 MB
- 4 MB
- 0.6 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_f_childhood1
- 15 MB
- 3 MB
- 0.4 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_f_childhood2
- 24 MB
- 2 MB
- 0.4 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_f_class
- 28 MB
- 3 MB
- 0.4 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_f_daryush
- 12 MB
- 2 MB
- 0.4 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_f_istanbul1
- 29 MB
- 2 MB
- 0.3 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_f_istanbul2
- 35 MB
- 3 MB
- 0.5 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_f_marry1
- 46 MB
- 4 MB
- 0.5 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_f_marry2
- 11 MB
- 2 MB
- 0.2 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_f_mask
- 14 MB
- 1 MB
- 0.2 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_f_nosejob
- 69 MB
- 6 MB
- 1.0 MB
- 0.2 MB
- 0.1 MB
- preview
- oh_f_parham
- 33 MB
- 3 MB
- 0.5 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_f_pool
- 15 MB
- 1 MB
- 0.2 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_f_taxi1
- 40 MB
- 3 MB
- 0.6 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_f_uncle1
- 31 MB
- 3 MB
- 0.3 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_f_uncle2
- 31 MB
- 3 MB
- 0.6 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_f_university1
- 31 MB
- 3 MB
- 0.4 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_m_childhood3
- 22 MB
- 2 MB
- 0.2 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_m_corona3
- 10 MB
- 1 MB
- 0.1 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_m_football
- 26 MB
- 2 MB
- 0.2 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_m_military1
- 64 MB
- 6 MB
- 0.8 MB
- 0.2 MB
- 0.1 MB
- preview
- oh_m_military2
- 19 MB
- 2 MB
- 0.3 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_m_music
- 61 MB
- 5 MB
- 0.8 MB
- 0.2 MB
- 0.1 MB
- preview
- oh_m_taxi2
- 32 MB
- 3 MB
- 0.4 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_m_television
- 25 MB
- 2 MB
- 0.3 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_m_university2
- 33 MB
- 3 MB
- 0.3 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_m_university3
- 35 MB
- 3 MB
- 0.4 MB
- 0.1 MB
- 0.1 MB
- preview
- oh_m_usa
- 21 MB
- 2 MB
- 0.3 MB
- 0.1 MB
- 0.1 MB
- preview
- full corpus
- 904 MB
- 143 MB
- 1.1 MB
- 3.9 MB
- 1.0 MB

Teams — Annotators and contributors

The HamBam corpus is being jointly developed by teams at Bu-Ali Sina University in Hamedan in Iran and at the University of Bamberg in Germany under the supervision of Mohammad Rasekh-Mahand and Geoffrey Haig.

Bamberg
Geoffrey Haig
Nils Schiborr
Laurentia Schreiber
Hamedan
Mohammad Rasekh-Mahand
Elham Izadi
Fariba Sabouri
Maryam Pouyankhah
Iran Abdi
Mehdi Parizadeh
Mehrdad Meshkinfam

Geoffrey Haig

Mohammad Rasekh-Mahand

Elham Izadi

Fariba Sabouri

Maryam Pouyankhah

Laurentia Schreiber

Iran Abdi

Mehdi Parizadeh

Mehrdad Meshkinfam

References

Faghiri, Pegah & Samvelian, Pollet & Hemforth, Barbara. 2018. Is there a canonical order in Persian ditransitive constructions? In Korn, Angnes & Malchukov, Andrey (eds.), Ditransitive constructions in a cross-linguistic perspective, 165–186. Wiesbaden: Reichert.

Frommer, Paul. 1981. Post-verbal phenomena in colloquial Persian syntax. PhD dissertation, University of Southern California.

Haig, Geoffrey & Adibifar, Shirin. 2019. Referential Null Subjects (RNS) in colloquial spoken Persian: Does speaker familiarity have an effect? In Korangy, Alireza & Mahmoodi-Bakhtiari, Behrooz (eds.), Essays on the typology of Iranian languages, 102–121. Berlin: Mouton de Gruyter.

Haig, Geoffrey & Schnell, Stefan. 2014. Annotations using GRAID (Grammatical Relations and Animacy in Discourse): Introduction and guidelines for annotators. Version 7.0. (multicast.aspra.uni-bamberg.de/#annotations)

Haig, Geoffrey & Schnell, Stefan. 2022. Multi-CAST: Multilingual Corpus of Annotated Spoken Texts.(multicast.aspra.uni-bamberg.de/)

Roberts, John. 2009. A study of Persian discourse structure. Uppsala: Acta Universitatis Upsaliensis.

Gallery

A view of Bu-Ali Sina University in the depth of winter.

The Tomb of Bu-Ali Sina (Avicenna) in Hamedan.

A still lake reflecting snow-capped peaks in Hamedan province.

Hamedan station obscured by a snow storm.

Bu-Ali Sina University in early spring.

A view of Bamberg Cathedral and the Domberg.

Contact

For inquiries, please contact Geoffrey Haig. Please direct questions concerning this website to Nils Schiborr.

The resources presented here as well as this page are hosted on the servers of the computing centre of the University of Bamberg. Relevant legal information can be found here.