The easiest way to extract features from a labelled database of the form described in the previous section is by loading in each of the utterance structures and dumping the desired features.
Using the same mechanism to extract the features as will eventually be
used by models built from the features has the important advantage of
avoiding spurious errors easily introduced when collecting data. For
example a feature such as n.accent
in a Festival utterance will
be defined as 0 when there is no next accent. Extracting all the
accents and using an external program to calculate the next accent may
make a different decision so that when the generated model is used a
different value for this feature will be produced. Such mismatches
in training models and actual use are unfortunately common, so using
the same mechanism to extract data for training, and for actual
use is worthwhile.
The recommedn method for extracting features is using the festival script dumpfeats. It basically takes a list of feature names and a list of utterance files and dumps the desired features.
Features may be dumped into a single file or into separate files one for each utterance. Feature names may be specified on the command line or in a separate file. Extar code to define new features may be loaded too.
For example suppose we wanted to save the features for a set of utterances include the duration, phone name, previous and next phone names for all segments in each utterance.
dumpfeats -feats "(segment_duration name p.name n.name)" \ -output feats/%s.dur -relation Segment \ festival/utts/*.utt
This will save these features in files named for the utterances they come from in the directory feats/. The argument to -feats is treated as literal list only if it starts with a left parenthesis, otherwise it is treated as a filename contain named features (unbracketed).
Extra code (for new feature definitions) may be loaded through the -eval option. If the argument to -eval starts with a left parenthesis it is trated as an s-expression rather than a filename and is evaluated. If argument -output contains "%s" it will be filled in with the utterance's filename, if it is a simple filename the features from all utterances will be saved in that same file. The features for each item in the named relation are saved on a single line.