The standard method for diphone resynthesis in the released system is residual excited LPC (hunt89). The actual method of resynthesis isn't important to the database format, but if residual LPC synthesis is to be used then it is necessary to make the LPC coefficient files and their corresponding residuals.
Previous versions of the system used a "host of hacky little scripts" to this but now that the Edinburgh Speech Tools supports LPC analysis we can provide a walk through for generating these.
We assume that the waveform file of nonsense words are in a directory called wave/. The LPC coefficients and residuals will be, in this example, stored in lpc16k/ with extensions .lpc and .res respectively.
Before starting it is worth considering power normalization. We have
found this important on all of the databases we have collected so far.
The ch_wave
program, part of the speech tools, with the optional
-scaleN 0.4
may be used if a more complex method is not
available.
The following shell command generates the files
for i in wave/*.wav do fname=`basename $i .wav` echo $i lpc_analysis -reflection -shift 0.01 -order 18 -o lpc16k/$fname.lpc \ -r lpc16k/$fname.res -otype htk -rtype nist $i done
It is said that the LPC order should be sample rate divided by one thousand plus 2. This may or may not be appropriate and if you are particularly worried about the database size it is worth experimenting.
The program lpc_analysis, found in speech_tools/bin, can be used to generate the LPC coefficients and residual. Note these should be reflection coefficients so they may be quantised (as they are in group files).
The coefficients and residual files produced by different LPC analysis
programs may start at different offsets. For example the Entropic's ESPS
functions generate LPC coefficients that are offset by one frame shift
(e.g. 0.01 seconds). Our own lpc_analysis routine has no offset.
The Diphone_Init
parameter list allows these offsets to be
specified. Using the above function to generate the LPC files the
description parameters should include
(lpc_frame_offset 0) (lpc_res_offset 0.0)
While when generating using ESPS routines the description should be
(lpc_frame_offset 1) (lpc_res_offset 0.01)
The defaults actually follow the ESPS form, that is lpc_frame_offset
is 1 and lpc_res_offset
is equal to the frame shift, if they are
not explicitly mentioned.
Note the biggest problem we have in implementing the residual excited LPC resynthesizer was getting the right part of the residual to line up with the right LPC coefficients describing the pitch mark. Making errors in this degrades the synthesized waveform notably, but not seriously, making it difficult to determine if it is an offset problem or some other bug.
Although we have started investigating if extracting pitch synchronous LPC parameters rather than fixed shift parameters gives better performance, we haven't finished this work. lpc_analysis supports pitch synchronous analysis but the raw "ungrouped" access method does not yet. At present the LPC parameters are extracted at a particular pitch mark by interpolating over the closest LPC parameters. The "group" files hold these interpolated parameters pitch synchronously.
The American English voice kd was created using the speech tools lpc_analysis program and its set up should be looked at if you are going to copy it. The British English voice rb was constructed using ESPS routines.