Part of speech tagging is a fairly well-defined process. Festival includes a part of speech tagger following the HMM-type taggers as found in the Xerox tagger and others (e.g. DeRose88). Part of speech tags are assigned, based on the probability distribution of tags given a word, and from ngrams of tags. These models are externally specified and a Viterbi decoder is used to assign part of speech tags at run time.
So far this tagger has only been used for English but there
is nothing language specific about it. The module POS
assigns the tags. It accesses the following variables for
parameterization.
pos_lex_name
NIL
no part of speech tagging takes place.
pos_ngram_name
ngram.load
).
pos_p_start_tag
pos_pp_start_tag
pos_map
pos_map
should be a a list of pairs consisting of a list of tags
to be mapped and the new tag they are to be mapped to.
Note is it important to have the part of speech tagger match the tags used in later parts of the system, particularly the lexicon. Only two of our lexicons used so far have (mappable) part of speech labels.
An example of the part of speech tagger for English can be found in lib/pos.scm.