Edinburgh Speech Tools  2.1-release
wagon

Table of Contents

CART building program

Synopsis

wagon [options] [-desc ifile] [-data ifile] [-stop int] [-test ifile] [-frs float] [-dlist ] [-dtree ] [-output ofile] [-o ofile] [-distmatrix ifile] [-track ifile] [-track_start int] [-track_end int] [-track_feats string] [-unittrack ifile] [-quiet ] [-verbose ] [-predictee string] [-ignore string] [-count_field string] [-stepwise ] [-swlimit float] [-swopt string] [-balance float] [-vertex_output string] [-held_out int] [-heap int] [-noprune ]

wagon is used to build CART tress from feature data, its basic features include:

  • both decisions trees and decision lists are supported
  • predictees can be discrete or continuous
  • input features may be discrete or continuous
  • many options for controlling tree building
  • fixed stop value
  • balancing
  • held-out data and pruning
  • stepwise use of input features
  • choice of optimization criteria correct/entropy (for classification and rmse/correlation (for regression)

A detailed description of building CART models can be found in the Overview section.

Options

Building Trees

To build a decision tree (or list) Wagon requires data and a description of it. A data file consists a set of samples, one per line each consisting of the same set of features. Features may be categorial or continuous. By default the first feature is the predictee and the others are used as predictors. A typical data file will look like this

0.399 pau sh  0   0     0 1 1 0 0 0 0 0 0 
0.082 sh  iy  pau onset 0 1 0 0 1 1 0 0 1
0.074 iy  hh  sh  coda  1 0 1 0 1 1 0 0 1
0.048 hh  ae  iy  onset 0 1 0 1 1 1 0 1 1
0.062 ae  d   hh  coda  1 0 0 1 1 1 0 1 1
0.020 d   y   ae  coda  2 0 1 1 1 1 0 1 1
0.082 y   ax  d   onset 0 1 0 1 1 1 1 1 1
0.082 ax  r   y   coda  1 0 0 1 1 1 1 1 1
0.036 r   d   ax  coda  2 0 1 1 1 1 1 1 1
...

The data may come from any source, such as the festival script dumpfeats which allows the creation of such files easily from utterance files.

In addition to a data file a description file is also require that gives a name and a type to each of the features in the datafile. For the above example it would look like

((segment_duration float)
 ( name  aa ae ah ao aw ax ay b ch d dh dx eh el em en er ey f g 
    hh ih iy jh k l m n nx ng ow oy p r s sh t th uh uw v w y z zh pau )
 ( n.name 0 aa ae ah ao aw ax ay b ch d dh dx eh el em en er ey f g 
    hh ih iy jh k l m n nx ng ow oy p r s sh t th uh uw v w y z zh pau )
 ( p.name 0 aa ae ah ao aw ax ay b ch d dh dx eh el em en er ey f g 
    hh ih iy jh k l m n nx ng ow oy p r s sh t th uh uw v w y z zh pau )
 (position_type 0 onset coda)
 (pos_in_syl float)
 (syl_initial 0 1)
 (syl_final   0 1)
 (R:Sylstructure.parent.R:Syllable.p.syl_break float)
 (R:Sylstructure.parent.syl_break float)
 (R:Sylstructure.parent.R:Syllable.n.syl_break float)
 (R:Sylstructure.parent.R:Syllable.p.stress 0 1)
 (R:Sylstructure.parent.stress 0 1)
 (R:Sylstructure.parent.R:Syllable.n.stress 0 1)
)

The feature names are arbitrary, but as they appear in the generated trees is most useful if the trees are to be used in prediction of an utterance that the names are features and/or pathnames.

Wagon can be used to build a tree with such files with the command

wagon -data feats.data -desc fest.desc -stop 10 -output feats.tree

A test data set may also be given which must match the given data description. If specified the built tree will be tested on the test set and results on that will be presented on completion, without a test set the results are given with respect to the training data. However in stepwise case the test set is used in the multi-level training process thus it cannot be considered as true test data and more reasonable results should found on applying the generate tree to truly held out data (via the program wagon_test).