Sentence-level Prosodic Modelling of the Greek language with Applications to Text-To-Speech synthesis

Fotinea, Stavroula-Evita

PROFILE

Sentence-level Prosodic Modelling of the Greek language with Applications to Text-To-Speech synthesis

Research Area:

Speech and Music Technology

Type:

Phd Thesis

Year:	1999

Authors:	Stavroula-Evita Fotinea
University:	National Technical University of Athens, Dpt of Electrical and Computer Engineering




Abstract:	Text To Speech (TTS) synthesis lies among the most important scientific areas of the field of Digital Signal Processing. TTS-systems are widely used in a great variety of applications such as automatic tools for speech synthesis addressed to the visually impaired, systems of voice-based information services (weather reports, financial transactions etc) and parts of educational software for "mother" or "foreign" language teaching. The aim of such systems is the high quality synthetic production of arbitrary long texts. Language specific factors related to the quality of the synthetic speech output are the constancy of the speech tempo as well as the correct prosodic representation of the melody of the language. The constancy of the tempo is crucial because it can lead to a pleasant to the ear synthetic speech output. Quite often the speakers use different speech rates when introducing new as opposed to old, or important as opposed to unimportant pieces of information. Usually when introducing new or emphatic information the speech tempo slows down, allowing for the complete registration of the information by the listeners. On the other hand, when already known information is introduced, the speech rate might be accelerated. Speech tempo variations are tools used by the speaker when following a specific strategy of conveying information. A TTS system should be able to produce synthetic speech of constant tempo, maintaining pleasant and thus not tiring speech flow, since a varying tempo could be eventually misleading, due to the listener’s inability to understand the meaning of this variation. In this work, we have attempted an exhaustive study of the durations of vowels of the Modern Greek language (both stressed and unstressed) found in words in both focal and non focal positions. We found that relative durations of vowels, is an important factor for achieving constancy of the speech tempo. In practice, during speech, factors such as segmental context and word prominence may slightly affect these durations, the modification of which beyond a certain threshold may lead in the perception of a tempo mismatch. A duration rule is therefore introduced, the use of which preserves the constancy of the tempo in the synthetic speech output. Moreover, regardless of the synthesis technique adopted, one of the most important but highly language specific sub-modules of a TTS system is the prosodic one. Prosody involves duration and intensity control, most importantly though, it requires correct representation of the melody of language, hence, of the pitch curve evolution. A TTS should be able to produce synthetic speech output for single words, other slightly more complex utterance structures (such as catalogues or declination patterns) and phrase or sentence level structures. Single word utterances are widely used in everyday discourse and provide the simplest means of communication. A slightly more complicated utterance, is the one of the inflection of the verbs or its melodic equivalent, single words in catalogues. Finally, the most complex structures are those of phrases or sentences, the variety of which is quite huge given the complexity and great number of syntactic structures of the Greek language. In this work we present prosodic models that apply from single words up to sentences, spanning the structural variety of the Modern Greek language. Word level models for different types of expression are presented, which convey affirmation, interrogation as well as a third type of expression, that conveys the message that the utterance is not yet completed and that more information follows. Moreover, data regarding the relative duration of the significant parts of the models for correct application to TTS systems addressed to the Greek language are also provided. Prosody rules are also given for the slightly more complex structures mentioned above. For verb declination structures as well as for their prosodic equivalent of single words two models are used: a) the third type of expression model conveying the message that more information follows and b) the affirmation model (at the catalogue's final word) conveying the message that the utterance is completed. An exhaustive study of pitch evolution at phrase and sentence level is also presented, based on a great variety of sentences covering the syntactic structures of the Modern Greek in neutral stress affirmations. We also present pitch models and prosody rules that apply to sentence level intonation words. The basic rule covering the biggest subset of the syntactic phenomena investigated is in short presented below. The Introductory model is applied at every sentence's introductory intonation word, conveying the message that more information follows, while a Middle model is applied to every sentence non introductory and non final intonation word conveying the same message as the Introductory, being though slightly differentiated especially with regard to pitch levels achieved. Finally, the Conclusive model is applied to every sentence final intonation word, conveying the message that the utterance is completed. Modifications of this rule are observed in the cases of enclisis of stress, pause and emphasis. During the enclisis of stress, intonation words with two stresses are formed due to the attachment of a clitic to grammatical words. In the case of clitic attachment, under certain stress conditions, violation of the Antepenultimate rule is observed, which is recovered by the appearance of a secondary corrective stress. In such cases and depending on the position of the double stress intonation word, the Double Stress Introductory or Double Stress Conclusive model is applied. In cases where the pause mechanism is activated, due to the existence of a comma in the written text, or because of an intrinsic pause in some structures or because the speaker felt like pausing for a while, the Pausive model is applied to the pre pausal intonation word, and the intonation word that follows is modelled via the Introductory instead of the Middle model. Special treatment is required when special words (key words) appear activating the emphasis mechanism. A number of syntactic structures can only be phonologically signified by this mechanism. This happens for instance in the case of negation «Μην φεύγεις» (Do not leave) or «Η Μαρία δεν καταλαβαίνει καμία διαφορά» (Maria does not understand any difference). The emphasis mechanism is also present in the case of relative comparison e.g. «Η Μαρία είναι πιο μεγάλη από την Κατερίνα» (Maria is older than Catherine) and disjunctive co-ordination («Είτε-είτε» (either-or), «ή-ή», «ούτε-ούτε» (neither-nor)). In the above cases, an Emphasis model is applied, with pitch rise in the neighbourhood of the special word and consecutive pitch fall until the end of the phrase or sentence.
[Bibtex]

Sentence-level Prosodic Modelling of the Greek language with Applications to Text-To-Speech synthesis

Quick links