UniSyn synthesizer - Festival Speech Synthesis System

Next: Diphone synthesizer, Previous: Duration, Up: Top

20 UniSyn synthesizer

Since 1.3 a new general synthesizer module has been included. This designed to replace the older diphone synthesizer described in the next chapter. A redesign was made in order to have a generalized waveform synthesizer, singla processing module that could be used even when the units being concatenated are not diphones. Also at this stage the full diphone (or other) database pre-processing functions were added to the Speech Tool library.

20.1 UniSyn database format

The Unisyn synthesis modules can use databases in two basic formats, separate and grouped. Separate is when all files (signal, pitchmark and coefficient files) are accessed individually during synthesis. This is the standard use during databse development. Group format is when a database is collected together into a single special file containing all information necessary for waveform synthesis. This format is designed to be used for distribution and general use of the database.

A database should consist of a set of waveforms, (which may be translated into a set of coefficients if the desired the signal processing method requires it), a set of pitchmarks and an index. The pitchmarks are necessary as most of our current signal processing are pitch synchronous.

20.1.1 Generating pitchmarks

Pitchmarks may be derived from laryngograph files using the our proved program pitchmark distributed with the speech tools. The actual parameters to this program are still a bit of an art form. The first major issue is which direction the lar files. We have seen both, though it does seem to be CSTR's ones are most often upside down while others (e.g. OGI's) are the right way up. The -inv argument to pitchmark is specifically provided to cater for this. There other issues in getting the pitchmarks aligned. The basic command for generating pitchmarks is

     pitchmark -inv lar/file001.lar -o pm/file001.pm -otype est \
          -min 0.005 -max 0.012 -fill -def 0.01 -wave_end

The -min, -max and -def (fill values for unvoiced regions), may need to be changed depending on the speaker pitch range. The above is suitable for a male speaker. The -fill option states that unvoiced sections should be filled with equally spaced pitchmarks.

20.1.2 Generating LPC coefficients

LPC coefficients are generated using the sig2fv command. Two stages are required, generating the LPC coefficients and generating the residual. The prototypical commands for these are

     sig2fv wav/file001.wav -o lpc/file001.lpc -otype est -lpc_order 16 \
         -coefs "lpc" -pm pm/file001.pm -preemph 0.95 -factor 3 \
         -window_type hamming
     sigfilter wav/file001.wav -o lpc/file001.res -otype nist \
         -lpcfilter lpc/file001.lpc -inv_filter

For some databases you may need to normalize the power. Properly normalizing power is difficult but we provide a simple function which may do the jobs acceptably. You should do this on the waveform before lpc analysis (and ensure you also do the residual extraction on the normalized waveform rather than the original.

     ch_wave -scaleN 0.5 wav/file001.wav -o file001.Nwav

This normalizes the power by maximizing the signal first then multiplying it by the given factor. If the database waveforms are clean (i.e. no clicks) this can give reasonable results.

20.2 Generating a diphone index

The diphone index consists of a short header following by an ascii list of each diphone, the file it comes from followed by its start middle and end times in seconds. For most databases this files needs to be generated by some database specific script.

An example header is

     EST_File index
     DataType ascii
     NumEntries 2005
     IndexName rab_diphone
     EST_Header_End

The most notable part is the number of entries, which you should note can get out of sync with the actual number of entries if you hand edit entries. I.e. if you add an entry and the system still can't find it check that the number of entries is right.

The entries themselves may take on one of two forms, full entries or index entries. Full entries consist of a diphone name, where the phones are separated by "-"; a file name which is used to index into the pitchmark, LPC and waveform file; and the start, middle (change over point between phones) and end of the phone in the file in seconds of the diphone. For example

     r-uh    edx_1001        0.225   0.261   0.320
     r-e     edx_1002        0.224   0.273   0.326
     r-i     edx_1003        0.240   0.280   0.321
     r-o     edx_1004        0.212   0.253   0.320

The second form of entry is an index entry which simply states that reference to that diphone should actually be made to another. For example

     aa-ll   &aa-l

This states that the diphone aa-ll should actually use the diphone aa-l. Note they are a number of ways to specify alternates for missing diphones an this method is best used for fixing single or small classes of missing or broken diphones. Index entries may appear anywhere in the file but can't be nested.

Some checks are made one reading this index to ensure times etc are reasonable but multiple entries for the same diphone are not, in that case the later one will be selected.

20.3 Database declaration

There two major types of database grouped and ungrouped. Grouped databases come as a single file containing the diphone index, coeficinets and residuals for the diphones. This is the standard way databases are distributed as voices in Festoval. Ungrouped access diphones from individual files and is designed as a method for debugging and testing databases before distribution. Using ungrouped dataabse is slower but allows quicker changes to the index, and associated coefficient files and residuals without rebuilding the group file.

A database is declared to the system through the command us_diphone_init. This function takes a parameter list of various features used for setting up a database. The features are

name: An atomic name for this database, used in selecting it from the current set of laded database.
index_file: A filename name containing either a diphone index, as descripbed above, or a group file. The feature grouped defines the distinction between this being a group of simple index file.
grouped: Takes the value "true" or "false". This defined simple index or if the index file is a grouped file.
coef_dir: The directory containing the coefficients, (LPC or just pitchmarks in the PSOLA case).
sig_dir: The directory containing the signal files (residual for LPC, full waveforms for PSOLA).
coef_ext: The extension for coefficient files, typically ".lpc" for LPC file and ".pm" for pitchmark files.
sig_ext: The extension for signal files, typically ".res" for LPC residual files and ".wav" for waveform files.
default_diphone: The diphone to be used when the requested one doesn't exist. No matter how careful you are you should always include a default diphone for distributed diphone database. Synthesis will throw an error if no diphone is found and there is no default. Although it is usually an error when this is required its better to fill in something than stop synthesizing. Typical values for this are silence to silence or schwa to schwa.
alternates_left: A list of pairs showing the alternate phone names for the left phone in a diphone pair. This is list is used to rewrite the diphone name when the directly requested one doesn't exist. This is the recommended method for dealing with systematic holes in a diphone database.
alternates_right: A list of pairs showing the alternate phone names for the right phone in a diphone pair. This is list is used to rewrite the diphone name when the directly requested one doesn't exist. This is the recommended method for dealing with systematic holes in a diphone database.

An example database definition is

     (set! rab_diphone_dir "/projects/festival/lib/voices/english/rab_diphone")
     (set! rab_lpc_group
           (list
            '(name "rab_lpc_group")
            (list 'index_file
                  (path-append rab_diphone_dir "group/rablpc16k.group"))
            '(alternates_left ((i ii) (ll l) (u uu) (i@ ii) (uh @) (a aa)
                                      (u@ uu) (w @) (o oo) (e@ ei) (e ei)
                                      (r @)))
            '(alternates_right ((i ii) (ll l) (u uu) (i@ ii)
                                       (y i) (uh @) (r @) (w @)))
            '(default_diphone @-@@)
            '(grouped "true")))
     (us_dipohone_init rab_lpc_group)

20.4 Making groupfiles

The function us_make_group_file will make a group file of the currently selected US diphone database. It loads in all diphone sin the dtabaase and saves them in the named file. An optional second argument allows specification of how the group file will be saved. These options are as a feature list. There are three possible options

track_file_format: The format for the coefficient files. By default this is est_binary, currently the only other alternative is est_ascii.
sig_file_format: The format for the signal parts of the of the database. By default this is snd (Sun's Audio format). This was choosen as it has the smallest header and supports various sample formats. Any format supported by the Edinburgh Speech Tools is allowed.
sig_sample_format: The format for the samples in the signal files. By default this is mulaw. This is suitable when the signal files are LPC residuals. LPC residuals have a much smaller dynamic range that plain PCM files. Because mulaw representation is half the size (8 bits) of standard PCM files (16bits) this significantly reduces the size of the group file while only marginally altering the quality of synthesis (and from experiments the effect is not perceptible). However when saving group files where the signals are not LPC residuals (e.g. in PSOLA) using this default mulaw is not recommended and short should probably be used.

20.5 UniSyn module selection

In a voice selection a UniSyn database may be selected as follows

       (set! UniSyn_module_hooks (list rab_diphone_const_clusters ))
       (set! us_abs_offset 0.0)
       (set! window_factor 1.0)
       (set! us_rel_offset 0.0)
       (set! us_gain 0.9)
     
       (Parameter.set 'Synth_Method 'UniSyn)
       (Parameter.set 'us_sigpr 'lpc)
       (us_db_select rab_db_name)

The UniSyn_module_hooks are run before synthesis, see the next selection about diphone name selection. At present only lpc is supported by the UniSyn module, though potentially there may be others.

An optional implementation of TD-PSOLA moulines90 has been written but fear of legal problems unfortunately prevents it being in the public distribution, but this policy should not be taken as acknowledging or not acknowledging any alleged patent violation.

20.6 Diphone selection

Diphone names are constructed for each phone-phone pair in the Segment relation in an utterance. If a segment has the feature in forming a diphone name UniSyn first checks for the feature us_diphone_left (or us_diphone_right for the right hand part of the diphone) then if that doesn't exist the feature us_diphone then if that doesn't exist the feature name. Thus is is possible to specify diphone names which are not simply the concatenation of two segment names.

This feature is used to specify consonant cluster diphone names for our English voices. The hook UniSyn_module_hooks is run before selection and we specify a function to add us_diphone_* features as appropriate. See the function rab_diphone_fix_phone_name in lib/voices/english/rab_diphone/festvox/rab_diphone.scm for an example.

Once the diphone name is created it is used to select the diphone from the database. If it is not found the name is converted using the list of alternates_left and alternates_right as specified in the database declaration. If that doesn't specify a diphone in the database. The default_diphone is selected, and a warning is printed. If no default diphone is specified or the default diphone doesn't exist in the database an error is thrown.