CoSIH: The Corpus of Spoken Israeli Hebrew | Corpus files and sociolinguistic descriptions
עברית | English

The Corpus of Spoken Israeli Hebrew (CoSIH)

 

Text sigla

Sigla for recordings of the preparatory phase have two components: the initial letter of the volunteer's (pseudonymous) first name, followed by another identifying symbol. Sigla for recordings of the pilot study have two components as well: the initial letter of the institute that recruited the volunteers (C, D, P)7, after which comes the number of the tape and disc from which the recording was sampled (if more than one part were sampled from the same disc, their respective consecutive numbers are shown after an underscore).

Downloadable CoSIH Texts

The selection of CoSIH texts currently presented includes sample recordings of 37 volunteers, three of which are from the preparatory phase and 34 are from the pilot study. The total number of speakers, including interlouctors, is ca. 140. The total length of the recordings presented as ELAN transcriptions aligned with their audio is ca. five hours and 15 minutes, not including an additional five and a half minute text transcribed and presented only in PDF format. These are supplemented by CoSIH samples that formed part of a research corpus compiled by Nurit Dekel for her doctoral dissertation (Dekel 2010). The transcriptions of these texts are presented in PDF format. Their total length is just over five hours.8

We furthermore offer the research community samples from yet non-transcribed recordings, and invite our colleagues to send us their transcriptions, either in standard orthography or phonetic, thus helping to enhance CoSIH.9 The total length of these texts is around two hours and 45 minutes.

Overall, the research community is now presented with ca. 13½ hours of recorded texts. We hope that we will be able to enlarge this selection in the future – both in recordings and in standard orthography or phonetic transcriptions.

Sociolinguistic data

The data regarding each volunteer, as they were given to CoSIH's representatives, are summarized in Table 2 (in Hebrew). Clicking the links in the "Questionnaire" column will display/download the sociolinguistic questionnaire that was filled in accordance with the answers given by the volunteer to a representative of CoSIH.

Downloads

Table 3 (also in Hebrew) displays some details regarding the recorders' interlocutors and recordings, as well as provides links to downloadable audio files in WAV, MP3 and ELAN formats as well as PDF documents.9

Use of CoSIH material

Use of the recordings and transcriptions (standard orthography and phonetic transcriptions) is limited to non-commercial use. Whenever CoSIH material is used, its source and copyright must be specified as follows:

References

  • In Hebrew: <http://humanities.tau.ac.il/~cosih> מאגר העברית המדוברת בישראל (מעמ"ד).
  • In other languages: The Corpus of Spoken Israeli Hebrew (CoSIH) <http://humanities.tau.ac.il/~cosih>.
  • Reference to texts should be made by using the reference line in their respective ELAN files, e.g. C714_sp1_014.
  • References to recordings that lack standard orthography or phonetic transcriptions should be made by using their filename and the respective time, in seconds and hundredths or thousandth of a second; e.g. C211_1:17.50”-45.64”.
  • References to Nurit Dekel's transcripts should be made by using the filename and the quoted line number, e.g. C211_1ND:14-37.

Copyright

  • The copyright for CoSIH, including all its recordings, standard orthography and phonetic transcripts, is a property of Tel Aviv University.
  • The copyright for the transcriptions marked by the initials ND is a property of Nurit Dekel. The copyright for the phonetic transcription of the recordings marked as OCh is a property of Il-il Yatziv-Malibert and the CorpAfroAs – A corpus for Afroasiatic Languages project. The copyright for the phonetic transcription marked as C714 is a property of Alisa Guterman. The copyright for the phonetic transcriptions marked as C1624, P931_1 and Y32 is a property of Noam Faust.
  • The copyright for the logo and page headers is a property of Li-Mor Izre'el-Avishar. The copyright for the photographs is a property of Oren Izre'el.


7 C = The B. I. and Lucille Cohen Institute for Public Opinion Research at Tel Aviv University, which recruited 16 volunteers; D = Dahaf Institute, which recruited 10 volunteers; P = PORI Institute, which recruited 16 volunteers.

8We wish to thank Nurit Dekel for agreeing to present her transcripts as part of CoSIH's website for the benefit of the research community. The transcripts were prepared by Nurit in the years 2005-2007. For additional sigla, cf. יזרעאל תשס"ב(א): 291-290.

Texts from Nurit's corpus that overlap CoSIH transcripts in ELAN format were not uploaded. For partial overlaps, cf. their respective cells in Table 3.

9The transcriber's name and copyright will of course be displayed for each contribution, as we have gladly done with the contributions by Nurit Dekel, Il-il Yatziv-Malibert, Elissa Guterman and Noam Faust.

10 PDFs that are displayed alongside ELAN files reproduce the text exactly as it appears in the ELAN files, and accordingly do not include indications for overlaps and pauses.