Previous: Token to word rules, Up: Text analysis


15.3 Homograph disambiguation

Not all tokens can be rendered as words easily. Their context may affect the way they are to be pronounced. For example in the utterance

     On May 5 1985, 1985 people moved to Livingston.


the tokens "1985" should be pronounced differently, the first as
a year, "nineteen eighty five" while the second as a quantity "one thousand nine hundred and eighty five". Numbers may also be pronounced as ordinals as in the "5" above, it should be "fifth" rather than "five".

Also, the pronunciation of certain words cannot simply be found from their orthographic form alone. Linguistic part of speech tags help to disambiguate a large class of homographs, e.g. "lives". A part of speech tagger is included in Festival and discussed in POS tagging. But even part of speech isn't sufficient in a number of cases. Words such as "bass", "wind", "bow" etc cannot by distinguished by part of speech alone, some semantic information is also required. As full semantic analysis of text is outwith the realms of Festival's capabilities some other method for disambiguation is required.

Following the work of yarowsky96 we have included a method for identified tokens to be further labelled with extra tags to help identify their type. Yarowsky uses decision lists to identify different types for homographs. Decision lists are a restricted form of decision trees which have some advantages over full trees, they are easier to build and Yarowsky has shown them to be adequate for typical homograph resolution.

15.3.1 Using disambiguators

Festival offers a method for assigning a token_pos feature to each token. It does so using Yarowsky-type disambiguation techniques. A list of disambiguators can be provided in the variable token_pos_cart_trees. Each disambiguator consists of a regular expression and a CART tree (which may be a decision list as they have the same format). If a token matches the regular expression the CART tree is applied to the token and the resulting class is assigned to the token via the feature token_pos. This is done by the Token_POS module.

For example, the follow disambiguator distinguishes "St" (street and saint) and "Dr" (doctor and drive).

        ("\\([dD][Rr]\\|[Ss][tT]\\)"
         ((n.name is 0)
          ((p.cap is 1)
           ((street))
           ((p.name matches "[0-9]*\\(1[sS][tT]\\|2[nN][dD]\\|3[rR][dD]\\|[0-9][tT][hH]\\)")
            ((street))
            ((title))))
          ((punc matches ".*,.*")
           ((street))
           ((p.punc matches ".*,.*")
            ((title))
            ((n.cap is 0)
             ((street))
             ((p.cap is 0)
              ((p.name matches "[0-9]*\\(1[sS][tT]\\|2[nN][dD]\\|3[rR][dD]\\|[0-9][tT][hH]\\)")
               ((street))
               ((title)))
              ((pp.name matches "[1-9][0-9]+")
               ((street))
               ((title)))))))))

Note that these only assign values for the feature token_pos and do nothing more. You must have a related token to word rule that interprets this feature value and does the required translation. For example the corresponding token to word rule for the above disambiguator is

       ((string-matches name "\\([dD][Rr]\\|[Ss][tT]\\)")
        (if (string-equal (item.feat token "token_pos") "street")
            (if (string-matches name "[dD][rR]")
                (list "drive")
                (list "street"))
            (if (string-matches name "[dD][rR]")
                (list "doctor")
                (list "saint"))))

15.3.2 Building disambiguators

Festival offers some support for building disambiguation trees. The basic method is to find all occurrences of a homographic token in a large text database, label each occurrence into classes, extract appropriate context features for these tokens and finally build an classification tree or decision list based on the extracted features.

The extraction and building of trees is not yet a fully automated process in Festival but the file festival/examples/toksearch.scm shows some basic Scheme code we use for extracting tokens from very large collections of text.

The function extract_tokens does the real work. It reads the given file, token by token into a token stream. Each token is tested against the desired tokens and if there is a match the named features are extracted. The token stream will be extended to provide the necessary context. Note that only some features will make any sense in this situation. There is only a token relation so referring to words, syllables etc. is not productive.

In this example databases are identified by a file that lists all the files in the text databases. Its name is expected to be bin/DBNAME.files where DBNAME is the name of the database. The file should contain a list of filenames in the database e.g for the Gutenberg texts the file bin/Gutenberg.files contains

     gutenberg/etext90/bill11.txt
     gutenberg/etext90/const11.txt
     gutenberg/etext90/getty11.txt
     gutenberg/etext90/jfk11.txt
     ...

Extracting the tokens is typically done in two passes. The first pass extracts the context (I've used 5 tokens either side). It extracts the file and position, so the token is identified, and the word in context.

Next those examples should be labelled with a small set of classes which identify the type of the token. For example for a token like "Dr" whether it is a person's title or a street identifier. Note that hand-labelling can be laborious, though it is surprising how few tokens of particular types actually exist in 62 million words.

The next task is to extract the tokens with the features that will best distinguish the particular token. In our "Dr" case this will involve punctuation around the token, capitalisation of surrounding tokens etc. After extracting the distinguishing tokens you must line up the labels with these extracted features. It would be easier to extract both the context and the desired features at the same time but experience shows that in labelling, more appropriate features come to mind that will distinguish classes better and you don't want to have to label twice.

Once a set of features consisting of the label and features is created it is easy to use wagon to create the corresponding decision tree or decision list. wagon supports both decision trees and decision lists, it may be worth experimenting to find out which give the best results on some held out test data. It appears that decision trees are typically better, but are often much larger, and the size does not always justify the the sometimes only slightly better results.