Olomouc Corpus of Spoken Czech: Characterization and Main Features of the Project

This study presents the results of the author's research project called Olomouc Corpus of Spoken Czech (OCSC). The paper is focused on the state and partial phases of constructing the corpora, its methodology and annotation. Within the OCSC we use so called dual system of transcription, which means (1) an orthographic one with the purpose of linguistic (morphological) analysis and tagging and (2) a phonetic version of transcript which consists of three layers of the text: first the real transcription and further various types of the metatexts as a second and third layer, including communication aspects of the texts. The criteria of selection of speakers are also listed here and the highly important statistical analysis of the sociolinguistic categories (gender, age, type of education, types of recordings) is presented as well. This analysis can serve as a base for a partial correction of possible non-balance among those sociolinguistic parameters. The annotation rules and principles are mentioned at the end of this study.


Introduction
The research project of Olomouc Corpus of Spoken Czech (OCSC) is systematically built by the author of this paper from 2002 to date at the Department of Czech Studies at Palacký University in Olomouc (Czech Republic), Faculty of Arts.OCSC, which is pursued as a general corpus typologically, is currently the biggest corpus of spoken Czech (circa 1,5 million words -see table 1).All previous spoken corpora -Prague (PSC) and Brno (BSC) Spoken Corpus as well as ORAL2006 and ORAL2008 -have been constructed on the same methodological base with the modifications at the Institute of Czech National Corpus.1 (There is one more corpus focused on spoken form of Czech language -specialised corpus DIALOG that is focused on analysis of dialogues in media). 2 The OCSC project started in 2002 (or 2003 respectively) and was firstly based on general methodology of spoken corpora of Czech National Corpus (CNC).We needed to modify and change some methodological aspects because the conception of spoken part of Czech National Corpus is based prevailingly on orthography that doesn't reflect some substantial aspects of spoken language in general.
We've decided to make the changes and modifications based on specificities of the spoken language so radically that we created Czech spoken corpus based on the new conception: we pay close attention to transcription, annotation, format of transcripts, and an appropriate software for processing, managing corpus and querying the data from the corpus (data retrieval).(2) age (older-younger; with the lowest limit being c. 20 years of age and the limit that is set on 35 years of age); (3) education (lower-higher) and ( 4) language data that are gained from driven and non-driven way of recording process.For (4): It means (4a) formal recording as a monologue, by course of predefined and thematically wide questionnaire, and (4b) informal recording, which means a non-driven dialogue among speakers (knowing each other well).
The informal recording or dialogue is not thematically specialised; the length of recording is set at about 20 minutes (roughly 2 300 words).The optimal number of speakers in dialogue is two or three participants in order to avoid it becoming intelligible due to simultaneous speech.One of the participants was usually also a respondent in the formal recordings, which enables us to observe the differences between the Czech language used in unofficial and semi-official situations.
From the beginning of creating Czech spoken corpora there is a principle that participants recorded are either native speakers of a given area, in this case in Olomouc, or have lived in this area for at least 20 years.In OCSC the rules are not so strict: it is not necessary to be a native speaker, nor to live in Olomouc for at least 20 years.It is essential that a speaker lives or has been living in Olomouc, or he/she has an employment here and comes to Olomouc daily (daily contact with the language variety in Olomouc).We exclude the language of adolescent youth of a given area.cation is divided from the original two-values subdivision (BASIS vs. ALTUS) into a trichotomy: (1) primary (BASIS -B), ( 2) secondary (MEDIUS -M), and ( 3) university (ALTUS -A) education (see below).

Statistic Analysis of Sociolinguistic Variables in OCSC
The statistical analysis of sociolinguistic variables (gender, age, education, type of recording) within all corpus data has been provided, first of all to find out if the corpus is balanced and eventually to appoint disproportions among objects in view.An achievement of balanced input data in spoken corpora is always a very problematic task. 7As a very important and essential fact we are considering the possibility of bringing into effect an additional "correction" or partial revision of particular (non)-balanced sociolinguistic variables on the base of statistical data.Therefore we'd provide subsequent collection of recordings aimed at the most noticeable disproportion of particular sociolinguistic parameter (gender, age, education).The results reveal a marked domination of females.Gender category can be amended with relative ease by subsequent collection of recordings in which men would prevail.

I (iunior) = under 35 years V (vetus) = above 35 years
Having a balanced corpus in accordance with age of participants it is important to provide statistical analysis on the base of particular age of speakers (or at least by decades).We currently prepare the data for this analysis.
The numbers show that the age category in our corpus is unbalanced.Hence we have also explored a mutual connection between age and gender category, i.e. we've explored the participation of categories IUNIOR-VETUS separately for men and women to get more precise Contrary to corpora of CNC we don't use quasi-orthographic type of notation, because the transcription rules of this notation are based markedly and prevailingly on orthography and don't reflect majority of substantial aspects of spoken language.Authors of the quasiorthographic notation were motivated by requirements of a subsequent morphological analysis and tagging, but to date the corpora of CNC are still not morphologically annotated.
Based on the fact that OCSC is a spoken corpus we've tried to develop such system of notation and transcription that could lead towards an adequate visualisation of a phonetic realization of a speech continuum, and could enable (semi)automatic linguistic annotation by means of some software as well.This is an ambivalent situation: on one hand there is a need to have preferably the most accurate written record of audio-recording, on the other hand the written record should enable technical processing of text.We solved this situation by using a dual system of transcription, which means (1) an orthographic one with the purpose of linguistic (morphological) analysis and tagging, and (2) a phonetic version of transcript that reflects all important aspects of spoken variety of a given language, i.e.Czech, as well as communication aspects of the dialogues (see bellow).
Common text editors are used to create transcripts that are saved into plain text format (.txt).
For such purposes we've developed a special transcription format called SVIFT (Structural  Vertical and Interlinear Format of Transcription).This SVIFT format enables to execute the automatic conversion into XML format (an international standard for structured data) in the next stage of implementation of corpus data, namely by means of a script written in some scripting language (Perl, Python, etc.).
The defined structural symbols of SVIFT format mark a type of (meta)text: whether it is a factual text of transcript, or a new section, commentary, time reading etc.They always precede an each separate line of transcript (separated by Enter), i.e. these signs stand at the beginning of each speech-turn and of metatext lines.The structural symbols are combined with transcription (meta)symbols that are instrumental to mark the simultaneity of speech-turns, commentary sections, an indication of incomplete words, an unintelligible part of recording and other subjects.
Phonetic transcription is therefore multilayer and in comparison to orthographic one it is much more detailed having three layers of text: the real transcription as the first and a basic layer and other various types of metatexts as a second and third layer.The second layer is aimed to structure the text (topic sections followed by time reading) and the third layer serves to capture all metatext information enclosed within commentary (angle) brackets including communication aspects of texts, commentaries, non-verbal and paraverbal events.The particular layers are marked by the dollar sign ($) -the first layer with the phonetic record, the number sign (#) -the second layer with the orthographic record, and the paragraph ( §)the third layer with the orthographic record.(Meaning of these signs see also in section (Meta)Symbols of Annotation -Overview.) Important metasymbols are marked by square brackets.They are used to enclose the parts of speech-turns that are realized simultaneously by two (or more) speakers at the same time and signalize the start and the end of overlapping.There is relatively a common situation in dialogue when one speaker enters into the speech of another speaker several times during the only one speech-turn.These square brackets are therefore matching with numerical index (see an example below): Example: A: [1 not that we had made arrangements ]1 no but / [2 no we are co-debtors ]2 but we [3 made arrangements cause romca paid ]3 much more than me // i think it's split equally half-half because we have a bond / half-half hey / even though roman's repayments are higher or he pays for both of us B: [1 it is / there is only one debtor ↑ / ]1 B: [2 / it's better / ]2 B: [3 you have a share in it based on amount invested ↑ / ]3 The use of indexing square brackets serves as an instrument signalizing and identifying the mutual parts of different speakers' speech-turns involved.It's a relevant element of transcripts and has to be marked consistently.
The list of all symbols complemented by marks for prosodic level of utterances (and a short sample transcript) are itemized below.

6.4
Sample Transcript § A man and his dog # < time: 00:13:50 > $B: < do something ↓ hey ↑ > # < a dog > $A: good ↑ < did you eat it yet ↑ > # < asking his dog > $B: did you eat it yet ↑ but it doesn't matter ↓ / here it comes there is some wine ↓ $A: <1 was is tasty ↑ >1 / <2 look ↑ now he will be begging ↓ / look ↓ >2 # <1 asking his dog >1 # <2 pointing at the dog >2 $B: some advert ↑ $A: yeah → it was in the newspapers ↓ … … … § Grandmother's party -"tasting" # < time: 00:17:25 > $A: < wait ↓ let's taste it ↓ ok ↑ > / it's for our guys anyway ↓ # < spiced nuts > $B: here you are ↓ /// so i don't know ↓ /// it can't be taken out nicely ↓ $C: < some orange flavour ↓ > # < they're eating chocolates of various flavours > $A: grandma → this is again the most embarrassing what you have ↓ isn't it ↑ $B: oh my god → my colleague yesterday → / he indulges in eating those ninety-nine per cent chocolates → $C: i wouldn't eat it ↓ $B: but you know what ↓ / that chocolate ↓ $A: i did taste it ↓ / you don't tell a difference ↓ $B: is it worth ↑ it if you can't tell the difference ↑

Transcription of Recordings: Multilayer Transcript and SVIFT format
Previous Czech spoken corpora use the subdivision into two values: BASIS vs. ALTUS -as mentioned above, whereas the term basis covers both primary and secondary education.Our three-value subdivision covers following subcategories: BASIS = primary and apprentice education, MEDIUS = secondary one, and ALTUS = commenced, unfinished and finished university education.The question is, if formal and informal recordings should be considered as two separate types of discourse (note: that they have no common denominator differing form each other), or if it is more suitable and adequate to think of these types of recordings as the only one discourse.Spoken corpora of CNC represent the first concept (see Chart 7).Based on the fact that formal and informal types of recordings are allied together by their methodological matter we find it more suitable to consider FOR and INFOR recordings as just one set connected by one speaker, who takes part in both types of recordings (see Chart 6).It's apparent from Charts 6 and 7 bellow that these two different approaches markedly affect the results of the analysis, especially in case of one, two, or three speakers respectively within one recording of given corpus.
7 Statistical analyses are presented by a graphical chart type (percentage ratio), and by a numerical chart.