Current voices - Festival Speech Synthesis System

24.1 Current voices

Currently there are a number of voices available in Festival and we expect that number to increase. Each is elected via a function of the name ‘voice_*’ which sets up the waveform synthesizer, phone set, lexicon, duration and intonation models (and anything else necessary) for that speaker. These voice setup functions are defined in lib/voices.scm.

The current voice functions are

voice_rab_diphone: A British English male RP speaker, Roger. This uses the UniSyn residual excited LPC diphone synthesizer. The lexicon is the computer users version of Oxford Advanced Learners' Dictionary, with letter to sound rules trained from that lexicon. Intonation is provided by a ToBI-like system using a decision tree to predict accent and end tone position. The F0 itself is predicted as three points on each syllable, using linear regression trained from the Boston University FM database (f2b) and mapped to Roger's pitch range. Duration is predicted by decision tree, predicting zscore durations for segments trained from the 460 Timit sentence spoken by another British male speaker.
voice_ked_diphone: An American English male speaker, Kurt. Again this uses the UniSyn residual excited LPC diphone synthesizer. This uses the CMU lexicon, and letter to sound rules trained from it. Intonation as with Roger is trained from the Boston University FM Radio corpus. Duration for this voice also comes from that database.
voice_kal_diphone: An American English male speaker. Again this uses the UniSyn residual excited LPC diphone synthesizer. And like ked, uses the CMU lexicon, and letter to sound rules trained from it. Intonation as with Roger is trained from the Boston University FM Radio corpus. Duration for this voice also comes from that database. This voice was built in two days work and is at least as good as ked due to us understanding the process better. The diphone labels were autoaligned with hand correction.
voice_don_diphone: Steve Isard's LPC based diphone synthesizer, Donovan diphones. The other parts of this voice, lexicon, intonation, and duration are the same as voice_rab_diphone described above. The quality of the diphones is not as good as the other voices because it uses spike excited LPC. Although the quality is not as good it is much faster and the database is much smaller than the others.
voice_el_diphone: A male Castilian Spanish speaker, using the Eduardo Lopez diphones. Alistair Conkie and Borja Etxebarria did much to make this. It has improved recently but is not as comprehensive as our English voices.
voice_gsw_diphone: This offers a male RP speaker, Gordon, famed for many previous CSTR synthesizers, using the standard diphone module. Its higher levels are very similar to the Roger voice above. This voice is not in the standard distribution, and is unlikely to be added for commercial reasons, even though it sounds better than Roger.
voice_en1_mbrola: The Roger diphone set using the same front end as voice_rab_diphone but uses the MBROLA diphone synthesizer for waveform synthesis. The MBROLA synthesizer and Roger diphone database (called en1) is not distributed by CSTR but is available for non-commercial use for free from http://tcts.fpms.ac.be/synthesis/mbrola.html. We do however provide the Festival part of the voice in festvox_en1.tar.gz.
voice_us1_mbrola: A female Amercian English voice using our standard US English front end and the us1 database for the MBROLA diphone synthesizer for waveform synthesis. The MBROLA synthesizer and the us1 diphone database is not distributed by CSTR but is available for non-commercial use for free from http://tcts.fpms.ac.be/synthesis/mbrola.html. We provide the Festival part of the voice in festvox_us1.tar.gz.
voice_us2_mbrola: A male Amercian English voice using our standard US English front end and the us2 database for the MBROLA diphone synthesizer for waveform synthesis. The MBROLA synthesizer and the us2 diphone database is not distributed by CSTR but is available for non-commercial use for free from http://tcts.fpms.ac.be/synthesis/mbrola.html. We provide the Festival part of the voice in festvox_us2.tar.gz.
voice_us3_mbrola: Another male Amercian English voice using our standard US English front end and the us2 database for the MBROLA diphone synthesizer for waveform synthesis. The MBROLA synthesizer and the us2 diphone database is not distributed by CSTR but is available for non-commercial use for free from http://tcts.fpms.ac.be/synthesis/mbrola.html. We provide the Festival part of the voice in festvox_us1.tar.gz.

Other voices will become available through time. Groups other than CSTR are working on new voices. Particularly OGI's CSLU have release a number of American English voices, two Mexican Spanish voices and two German voices. All use OGI's their own residual excited LPC synthesizer which is distributed as a plug-in for Festival. (see http://www.cse.ogi.edu/CSLU/research/TTS for details).

Other languages are being worked on including German, Basque, Welsh, Greek and Polish already have been developed and could be release soon. CSTR has a set of Klingon diphones though the text anlysis for Klingon still requires some work (If anyone has access to a good Klingon continous speech corpora please let us know.)

Pointers and examples of voices developed at CSTR and elsewhere will be posted on the Festival home page.