CoSIH: The Corpus of Spoken Israeli Hebrew | Transcription and analysis
עברית | English

The Corpus of Spoken Israeli Hebrew (CoSIH)


Transcribing spoken language

CoSIH's texts are presented in sound and in transcriptions in the standard Hebrew orthography. Some recordings are also supplied with phonetic transcriptions.3 Transcripting recordings made in ordinary, everyday circumstances is not an easy task and it requires great skill, extensive time, and no meager expense. Phonetic transcription takes far longer time and demands colossal funding. For phonological study, as well as morpho-phonological or morphological study, speech must be phonetically transcribed in either narrow or broad phonetic transcription according to the goals of research. Phonetic variants (allophones), phonological and morpho-phonological regularity, clitics and affixes – these can only be identified and analyzed using phonetic transcription. Nevertheless, transcription in standard othography has its own advantages: arbitrariness of the visual sign and detachment from the sequence of sounds. Those who use transcriptions in standard orthography are aware of the differences between spoken and written language will not be led astray by inaccuracies of phonetic transcription, which – narrow as it may be – cannot fully record the articulated sequence (cf. Izre'el 2004). Standard orthography transcription will do good service to those who study linguistic units above the word level, viz. syntax, pragmatics or information structure. Transcription in standard orthography further enables the study of vocabulary and idioms, although homographs – especially those that result from unvocalized orthography – will not be recognized as such in transcription. Nevertheless, anyone who studies spoken language cannot rely upon standard orthography neither can they rely upon phonetic transcription alone, no matter how narrow the latter may be, and will always attentively listen to the recordings.

Transcribing CoSIH

Substantial funding by Tel Aviv University enabled us to transcribe some of recordings from the preparatory phase. Some of these transcripts were published in יזרעאל תשס"ב(א). Later on, texts from both from the preparatory phase and the pilot study were transcribed at various opportunities, mainly for seminar papers and M.A. theses (כהן תשס"ד; זילבר-ורוד תשס"ה; גונן תשס"ט and doctoral dissertations (Dekel 2010; Silber-Varod 2011).4 Samples from the pilot study chosen for transcription were selected in order to represent a wide array of texts, especially as regards demographic variation. These samples were further chosen based on quality and their relative length in a given context (cf. Izre’el & Rahav 2004). A research grant awarded to Esther Borochovsky Bar-Aba by the Israel Science Foundation for her study of concise utterances in spoken Hebrew enabled us to transcribe additional samples as well as adapt all transcripts for analysis using ELAN.5 This software presents transcripts aligned with the original recordings, making it possible to listen to the recording while reading its transcript, facilitates easy search and offers additional analytic tools. For instructions on using ELAN and links to ELAN's website, click here.

Prosodic Groups

CoSIH's transcriptions are segmented into Prosodic Groups ("Intonation Units"). This segmentation is primarily based on perception and enhanced by acoustic analysis using Praat. In addition, both standard orthography and phonetic transcriptions are provided with indications of boundary tones – major (terminal), minor (continuing) and appeal (cf. Izre'el 2005) as follows:

  • Two vertical bars (||) mark a boundary tone perceived as terminal;
  • A single vertical bar (|) marks a boundary tone perceived as continuing;
  • A slash (/) marks boundary tone perceived as appeal.
This system follows the approach taken by John Du Bois and his colleagues in the Santa Barbara Corpus of Spoken American English (Du Bois et al. 1992; יזרעאל תשס"ב; יזרעאל תש"ע. 6

Tagged transcriptions from CoSIH

A sample of preliminary transcriptions (i.e. excluding marking, prosodic or otherwise) from CoSIH recordings form the basis – alongside transcriptions of other recordings provided by Esther Borochovsky Bar-Aba – of a tagged corpus of 92,000 tokens compiled by Dalia Bojan in the Technion's MILA Center. A group of transcriptions from an earlier version of CoSIH were tagged and published by Justin Parry as a part of the National Middle East Language Resource Center (NMELRC) project.

3 Three chunks from the preparatory phase were transcribed in narrow phonetic transcription by Yael Maimon and revised by Werner Arnold. The other phonetic transcriptions were done by Il-Il Yatziv-Malibert (as part of CORPAFROAS: A Corpus for Afro-Asiatic Languages), Elissa Gutterman and Noam Faust. We thank them all, especially Elissa and Noam, who volunteered to help CoSIH out of their earnest desire to help and promote the project.

4 Thanks are due to Irit Yatziv, Shelley Bachar, Illan Gonen and Smadar Cohen, who transcribed recordings from the preparatory phase (cf. יזרעאל תשס"ב(א), n. 1). Some transcriptions of the preparatory phase recordings were published in M.A. theses by סמדר כהן (תשס"ד) and אילן גונן (תשס"ט) and served as the basis for the transcriptions currently presented herein. Thanks are also due to all those who participated in graduate seminars and transcribed numerous other texts. Thank you all for your work done with care, for your earnest desire and great enthusiasm.

5 Thanks are due to Tali Okman, who perfectly handled this project, to all the research assistants who took part in this endeavor, to Esther Borochovsky Bar-Aba, who recognized the importance of switching to ELAN and initiated the enhancement of this corpus through her blessed cooperation.

6 Additional sigla:
- truncated word
-- truncated prosodic group
@ unidentified syllable
@..@ unidentified sequence
<צחוק> non-verbal sounds