Train n-gram language model

Synopsis

ngram_build [input file0] [input file1] ... -o [output file] [-p ifile] [-order int] [-smooth int] [-o ofile] [-input_format string] [-otype string] [-sparse ] [-dense ] [-backoff int] [-floor double] [-freqsmooth int] [-trace ] [-save_compressed ] [-oov_mode string] [-oov_marker string] [-prev_tag string] [-prev_prev_tag string] [-last_tag string] [-default_tags ]

ngram_build offers basic ngram language model estimation.

Input data format**:

Two input formats are supported. In sentence_per_line format, the program will deal with start and end of sentence (if required) by using special vocabulary items specified by -prev_tag, -prev_prev_tag and -last_tag. For example, the input sentence:

the cat sat on the mat

would be treated as

... prev_prev_tag prev_prev_tag prev_tag the cat sat on the mat last_tag

where prev_prev_tag is the argument to -prev_prev_tag, and so on. A default set of tag names is also available. This input format is only useful for sliding-window type applications (e.g. language modelling for speech recognition).

The second input format is ngram_per_line which is useful for either non-sliding-window applications, or where the user requires an alternative treatment of start/end of sentence to that provided above. Now the input file simply contains a complete ngram per line. For the same example as above (to build a trigram model) this would be:

prev_prev_tag prev_tag the
prev_tag the cat
the cat sat
cat sat on
sat on the
on the mat
the mat last_tag

Representation**:

The internal representation of the model becomes important for higher values of N where, if V is the vocabulary size, $V^N$ becomes very large. In such cases, we cannot explicitly hold probabilities for all possible ngrams, and a sparse representation must be used (i.e. only non-zero probabilities are stored).

Getting more robust probability estimates**:

The common techniques for getting better estimates of the low/zero frequency ngrams are provided: namely smoothing and backing-off

Testing an ngram model**:

Use the ngram_test program.

Options

-w: ifile filename containing word list (required)
-p: ifile filename containing predictee word list (default is to use wordlist given by -w)
-order: int order, 1=unigram, 2=bigram etc. (default 2)
-smooth: int Good-Turing smooth the grammar up to the given frequency
-o: ofile Output file for constructed ngram
-input_format: string format of input data (default sentence_per_line) may be sentence_per_file, ngram_per_line.
-otype: string format of output file, one of cstr_ascii cstr_bin or htk_ascii
-sparse: build ngram in sparse representation
-dense: build ngram in dense representation (default)
-backoff: int build backoff ngram (requires -smooth)
-floor: double frequency floor value used with some ngrams
-freqsmooth: int build frequency backed off smoothed ngram, this requires -smooth option
-trace: give verbose outout about build process
-save_compressed: save ngram in gzipped format
-oov_mode: string what to do about out-of-vocabulary words, one of skip_ngram, skip_sentence (default), skip_file, or use_oov_marker
-oov_marker: string special word for oov words (default !OOV) (use in conjunction with '-oov_mode use_oov_marker' Pseudo-words :
-prev_tag: string tag before sentence start
-prev_prev_tag: string all words before 'prev_tag'
-last_tag: string after sentence end
-default_tags: use default tags of !ENTER,!EXIT and !EXIT respectively