Festival's basic object for synthesis is the utterance. An represents some chunk of text that is to be rendered as speech. In general you may think of it as a sentence but in many cases it wont actually conform to the standard linguistic syntactic form of a sentence. In general the process of text to speech is to take an utterance which contains a simple string of characters and convert it step by step, filling out the utterance structure with more information until a waveform is built that says what the text contains.
The processes involved in conversion are, in general, as follows
Each of these steps in Festival is achived by a module which will typically add new information to the utterance structure.
An utterance structure consists of a set of items which may be
part of one or more relations. Items represent things like words
and phones, though may also be used to represent less concrete objects
like noun phrases, and nodes in metrical trees. An item contains a set
of features, (name and value). Relations are typically simple lists of
items or trees of items. For example the the Word
relation is a
simple list of items each of which represent a word in the utterance.
Those words will also be in other relations, such as the
SylStructure relation where the word will be the top of a tree
structure containing its syllables and segments.
Unlike previous versions of the system items (then called stream items) are not in any particular relations (or stream). And are merely part of the relations they are within. Importantly this allows much more general relations to be made over items that was allowed in the previous system. This new architecture is the continuation of our goal of providing a general efficient structure for representing complex interrelated utterance objects.
The architecture is fully general and new items and relations may be defined at run time, such that new modules may use any relations they wish. However within our standard English (and other voices) we have used a specific set of relations ass follows.
Token
relation. They may also appear in the Syntax
relation (as leafs) if the parser is used. They will also be leafs
of the Phrase
relation.
Word's
within those phrases.
Word
relation.
Word
, Syllable
and
Segment
relations. Each Word
is the root of a tree
whose immediate daughters are its syllables and their daughters in
turn as its segments.
SylStructure
relation. In that relation its parent will be the
word it is in and its daughters will be the segments that are in it.
Syllables are also in the Intonation
relation giving links to
their related intonation events.
SylStructure
relation. These may also be in the
Target
relation linking them to F0 target points.
Intonation
relation as leafs on that
relation. Thus their parent in the Intonation
relation is the
syllable these events are attached to.
Intonation
are Syllables
and their daughters
are IntEvents
.
wave
whose value
is the generated waveform.