For English there are a number of assumptions made about the lexicon which are worthy of explicit mention. If you are basically going to use the existing token rules you should try to include at least the following in any lexicon that is to work with them.
a
as
a determiner which can be schwa'd from a
as a letter which
cannot.) The part of speech should be nn
by default, but the
value of the variable token.letter_pos
is used and may be
changed if this is not what is required.
mrpa_addend
in
festival/lib/dicts/oald/oaldlex.scm. This list should
also contain the control characters and eight bit characters.
's
should be in your lexicon as schwa and voiced
fricative (z
). It should be in twice, once as part speech type
pos
and once as n
(used in plurals of numbers acronyms
etc. e.g 1950's). 's
is treated as a word and is separated from
the tokens it appears with. The post-lexical rule (the function
postlex_apos_s_check
) will delete the schwa and devoice the z
in appropriate contexts. Note this post-lexical rule brazenly assumes
that the unvoiced fricative in the phoneset is s
. If it
is not in your phoneset copy the function (it is in
festival/lib/postlex.scm) and change it for your phoneset
and use your version as a post-lexical rule.
token.unknown_word_name
. This is used in a few obscure cases
when there just isn't anything that can be said (e.g. single characters
which aren't in the lexicon). Some people have suggested it should be
possible to make this a sound rather than a word. I agree, but Festival
doesn't support that yet.