Logotypografia Corpus
RESEARCH OUTPUTS

Logotypografia Corpus

Category: Language resources

The Logotypografia Corpus is distributed through the European Language Resources Association (ELRA). It consists of read material collected in order to be used for the development of continuous speech recognition systems for the Greek language. All recorded sentences were selected from extracts of the Elefterotypia-journal text corpus and provide a vocabulary of about 40,000 words. The total number of utterances is over 32,000 (aproximately 72 hours of speech material from 120 different speakers, male and female).

Detailed orthographic transcription files are also included in the distribution. There are markings for the utterance's orthography and several speech and non-speech events (e.g. mispronunciations, truncation, noise etc).

The recording procedure took place in three different environments : a sound proof room, a quiet environment and an office environment. Two different microphones were used : a desk microphone and a head-mounted close-talking microphone. The format of the waveform files is NIST. Waveforms are encoded using PCM coding format, 16000 sampling rate, 2 bytes per sample.

 
 

 Research areas