CoSIH: The Corpus of Spoken Israeli Hebrew

Introduction

Plans for The Corpus of Spoken Israeli Hebrew (CoSIH) started to take shape in 1998. CoSIH aimed at compiling a large database of recordings of spoken Israeli Hebrew in order to facilitate research in a range of disciplines. A corpus is a preliminary desideratum for larger projects that cannot otherwise be accomplished. The research potential of such a corpus is extremely large, including, inter alia, applications in the following areas: general and theoretical linguistics, Hebrew language and linguistics, applied linguistics, language engineering, education, and cultural and sociological studies.

CoSIH was designed with the intention to include a representative sample of both demographically and contextually defined varieties. The model according to which CoSIH would be compiled was to consist of a thousand sets of recordings ("cells") with 5000 words each, i.e., a corpus of five million words. We have taken a culture-dependent approach for the compilation of CoSIH. CoSIH aspires to bridge between the infinite number of varieties used by the Israeli Hebrew speech community and their representation in the corpus, by characterizing their diversity in both demographic and contextual terms. CoSIH seems to be a first and singular attempt to establish a representative corpus using the axes of both demographic and contextual variables, based on statistical and analytic criteria.

The selection of informants for the recordings of CoSIH would be made by a random sample of the Israeli population, in order to reflect the social structure of the Israeli Hebrew speech community. The segmentation of the corpus for analytic purposes would be done using well-defined criteria, notwithstanding the fact that all sociolinguistic data of the recorded informants will be made available for CoSIH's endusers. The working hypothesis of CoSIH is based on demographic criteria that seem to be most significant for the representation of the linguistic diversity in Israel: (1) place of birth, familial land of origin, ethnic group or religion; (2) age; (3) education; and (4) sex.¹

For the analysis of the contextual variables for each discourse, CoSIH's working hypothesis is based on five variables. There are three primary variables: interpersonal relationships, discourse structure and discourse topic; and two secondary variables: number of participants and medium (i.e. face-to-face conversation and telephone conversation).

A comprehensive study of the demographic and circumstantial variables in Hebrew discourse in Israel remains a desideratum. Therefore, in order to design a proper model for CoSIH, the setting of the corpus would be done in phases, during which a research program would be taken in order to verifty the wortking hypothesis suggested above.

This model was first published online, in both Hebrew and English. The English version eventually found its place in Hary & Izre’el 2003. A more sophisticated model has been published in English in Izre'el, Hary & Rahav 2001.

CoSIH was initiated, designed and operated by a team of Israeli and international scholars:

Core team: Shlomo Izre'el, Tel-Aviv University (director); Benjamin Hary, Emory University (principal investigator); John Du Bois, University of California at Santa arbara (corpus analyst); Mira Ariel, Tel-Aviv University (discourse analysis and pragmatics); Giora Rahav, Tel-Aviv University (statistics and sociology). Esther Borochovsky-Bar Aba, Tel Aviv University (syntax) joined the team at a later stage.

Advisory board: Eliezer Ben-Rafael, Tel Aviv University (sociolinguistics – sociological aspects); Yaakov Bentolila, Ben Gurion University (sociolinguistics – linguistic aspects); Otto Jastrow, Universität Erlangen-Nürnberg (transcription, phonology, dialectology); Shmuel Bolozky, University of Massachusetts at Amherst (phonology, morphology); Geoffrey Khan, Cambridge University (syntax); Elana Shohamy, Tel Aviv University (language education).

The Present State of CoSIH

As of 2012, this ambitious project still awaits its realization. The limited financial support that was at our disposal enabled us to compile two sets of recordings, the first of which was made during the initial preparatory phase, while the second was done as a pilot study. The initial preparatory phase produced 11 recordings spanning at least 6 hours each, with some being much longer. Although we initially designed a pilot of 20 sets of 3-hour recordings, we have eventually ended up with 42 sets, each including between 8 to 16 hours of uninterrupted recording of everyday speech. Taken together, we now possess 6 to 18 hour recordings by 53 volunteers, which we believe to be a reasonable source of data for the study of Spoken Hebrew. The recordings, which were all made between August 2000 and October 2002, are all real life conversations of CoSIH's informants. As such, they naturally include both the speech of the volunteers who recorded them and their interlocutors.

The Informants

While the representative informants for CoSIH will be recruited by a probabilistic procedure, we have used quota sampling for the pilot study, trying to reach a wide coverage of the main socio-demographic groups in the population. Recruiting informants for this phase was made by three data collection agencies (a university associated agency and two well recognized, reputable commercial agencies). Each of the agencies was asked to collect data from 16 informants according to the demographic categories presented in Table 1:

**Table 1: Demographic criteria for the recruitment of volunteers**
Age	Education	Ashkenazi	Mizrahi	Arab	Others
Young	≤high school
Young	>high school
Old	≤high school
Old	>high school

The three first groups (=columns) were set to fit, mutatis mutandis, the major demographic sections of the Israeli Hebrew speaking community: Jews of European or other Western ethnic origin (‘Ashkenazi’); Jews of Asian or African ethnic origin (‘Mizrahi’); non-Jews, of which the majority are Arabs, comprising ca. 20% of the Israeli population. The fourth column, ‘others’ or, rather, ‘special groups’, was set to consist of three demographic sections for which we hypothesized to show significant differences in their use of language and in their linguistic structure: ultra-orthodox, soldiers and members of other security forces, and recently-arrived immigrants. Each agency was assigned one of these latter groups.

Of the three major ethnic groups, each agency was assigned to recruit four informants: two young (<20 year old) and two old (>50 year old), two with high education, two without. Lastly, each agency was instructed to recruit men and women in equal numbers, irrespective of any of the other criteria.

¹ During the preparatory phase, the analytic structure included only the first three criteria. Because a statistic sample would have provided an equal number of male and female volunteers, the criterion of sex was exlcuded from the analytic structure.