Download PDF PDF Download WAV WAV Download MP3 MP3 Download EAF EAF Download XML XML Download TSV TSV Download ZIP ZIP Download ZIP/WAV ZIP/WAV Download ZIP/MP3 ZIP/MP3 Download ZIP/EAF ZIP/EAF Download ZIP/XML ZIP/XML

The HamBam corpus contains annotated recordings of contemporary spoken Persian, compiled as part of a cooperation between Bu-Ali Sina University in Hamedan, Iran (team coordinator: Mohammad Rasekh-Mahand), and the University of Bamberg in Germany (Geoffrey Haig).

HamBam is a component of the project Post-predicate elements in Iranian and neighbouring languages: Inheritance, contact, and information structure.

The corpus was primarily designed for investigations into the impact of information structure on word order variation in spoken Persian, but is freely adaptable to other research questions, including research on prosody, referential density, or usage-based approaches to grammar.

Development of the corpus is funded by a Linkage Grant (institute partnership) of the Alexander von Humboldt Foundation (2019–2021), awarded to Haig and Rasekh-Mahand.

Work on HamBam is ongoing, and new texts are continuously added as annotation and editing are completed.

How to cite HamBam

Haig, Geoffrey & Rasekh-Mahand, Mohammad. 2020. HamBam: The Hamedan-Bamberg Corpus of Contemporary Spoken Persian. (multicast.aspra.uni-bamberg.de/resources/hambam/) (date accessed)

Background — The language and the texts

Persian (west2369), also known as Farsi, is a Southwest Iranian language and the official language of Iran; closely related varieties are also spoken in Afghanistan and Tajikistan.

The texts gathered in this corpus are predominantly monological in nature, and represent colloquial spoken Persian, as the neutral lingua franca used throughout Iran by educated Iranians. The speakers are of both genders, various ages, different educational levels and occupations. The recordings include radio interviews on a variety of topics, as well as less formal oral history recounted in a domestic setting among family members.

Colloquial Persian differs in lexicon, morphology, and syntax from formal written Persian, but has received very little systematic attention to date. Existing corpus-based research on Persian, with the exception of Frommer (1981) and Haig and Adibifar (2019), draws on corpora containing overwhelmingly written, rather than spoken, data (e.g. Roberts 2009: 351–352; Faghiri et al. 2018).

A photo of Hamedan A lake in Hamedan province, Iran, with the Zagros mountains visible in the distance.

Corpus design — Annotations, file formats, and licensing

The transcriptions, translation, and annotations of the texts in the HamBam corpus are time-aligned with sound files. All data are freely available from this website. The annotations and linking to sound files is undertaken in the EAF file format, using the free linguistic annotation software ELAN.

The architecture of the corpus largely follows the design implemented in Multi-CAST (Haig & Schnell 2021), with word-to-word alignment of morphological glosses and annotations.

HamBam utilizes GRAID-S, a simplified form of the GRAID annotation scheme (Haig & Schnell 2014) that does not systematically indicate anaphoric zeroes.

All data in the HamBam corpus, including supplementary materials, are published under the Creative Commons Attribution 4.0 International licence (CC BY 4.0). The text of the licence can be found online here.

Corpus data — Documentation and downloads

Click on the buttons in the table below to download the selected file, or the buttons in the bottom row to download bundles of all corpus data. Clicking 'preview' will bring up a web-based rendering of the EAF files with audio playback.

Please note that the corpus is still actively being annotated.
Missing texts are marked with "—/—" in the lists below.

Documentation

Recordings and annotation files

    • ac_f_positivism1
    • 109 MB
    • 9 MB
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • ac_f_social
    • 155 MB
    • 28 MB
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • ac_m_positivism2
    • 55 MB
    • 5 MB
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • oh_f_class
    • 28 MB
    • 3 MB
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • oh_f_daryush
    • 12 MB
    • 2 MB
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • oh_f_ruya_accident
    • 34 MB
    • 6 MB
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • oh_f_ruya_aunt
    • 40 MB
    • 4 MB
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • oh_f_ruya_childhood
    • 15 MB
    • 3 MB
    • 0.4 MB
    • 0.1 MB
    • 0.1 MB
    • preview
    • oh_f_ruya_marry
    • 45 MB
    • 4 MB
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • oh_f_ruya_uncle
    • 30 MB
    • 3 MB
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • oh_m_siyavash_music
    • 60 MB
    • 5 MB
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • full corpus
    • 390 MB
    • 69 MB
    • —/—
    • (TBA)
    • —/—
    • (TBA)
    • —/—
    • (TBA)

Teams — Annotators and contributors

The HamBam corpus is being jointly developed by teams at Bu-Ali Sina University in Hamedan in Iran and at the University of Bamberg in Germany under the supervision of Mohammad Rasekh-Mahand and Geoffrey Haig.

A photo of Geoffrey Haig Geoffrey Haig
A photo of Mohammad Rasekh-Mahand Mohammad Rasekh-Mahand
A photo of Elham Izadi Elham Izadi
A photo of Fariba Sabouri Fariba Sabouri
A photo of Maryam Pouyankhah Maryam Pouyankhah
A photo of Laurentia Schreiber Laurentia Schreiber
A photo of Iran Abdi Iran Abdi
A photo of Mehdi Parizadeh Mehdi Parizadeh
A photo of Mehrdad Meshkinfam Mehrdad Meshkinfam

References

Faghiri, Pegah & Samvelian, Pollet & Hemforth, Barbara. 2018. Is there a canonical order in Persian ditransitive constructions? In Korn, Angnes & Malchukov, Andrey (eds.), Ditransitive constructions in a cross-linguistic perspective, 165–186. Wiesbaden: Reichert.

Frommer, Paul. 1981. Post-verbal phenomena in colloquial Persian syntax. PhD dissertation, University of Southern California.

Haig, Geoffrey & Adibifar, Shirin. 2019. Referential Null Subjects (RNS) in colloquial spoken Persian: Does speaker familiarity have an effect? In Korangy, Alireza & Mahmoodi-Bakhtiari, Behrooz (eds.), Essays on the typology of Iranian languages, 102–121. Berlin: Mouton de Gruyter.

Haig, Geoffrey & Schnell, Stefan. 2014. Annotations using GRAID (Grammatical Relations and Animacy in Discourse): Introduction and guidelines for annotators. Version 7.0. (multicast.aspra.uni-bamberg.de/#annotations)

Haig, Geoffrey & Schnell, Stefan. 2021. Multi-CAST: Multilingual Corpus of Annotated Spoken Texts.(multicast.aspra.uni-bamberg.de/)

Roberts, John. 2009. A study of Persian discourse structure. Uppsala: Acta Universitatis Upsaliensis.

Gallery

A photo of Hamedan A view of Bu-Ali Sina University in the depth of winter.
A photo of Hamedan The Tomb of Bu-Ali Sina (Avicenna) in Hamedan.
A photo of Hamedan A still lake reflecting snow-capped peaks in Hamedan province.
A photo of Hamedan Hamedan station obscured by a snow storm.
A photo of Hamedan Bu-Ali Sina University in early spring.
A photo of Bamberg A view of Bamberg Cathedral and the Domberg.

Contact

For inquiries, please contact Geoffrey Haig. Please direct questions concerning this website to Nils Schiborr.

The resources presented here as well as this page are hosted on the servers of the computing centre of the University of Bamberg. Relevant legal information can be found here.