Next: , Previous: Tokenizing, Up: Text analysis


15.2 Token to word rules

Tokens are further analysed into lists of words. A word is an atom that can be given a pronunciation by the lexicon (or letter to sound rules). A token may give rise to a number of words or none at all.

For example the basic tokens

     This pocket-watch was made in 1983.

would give a word relation of

     this pocket watch was made in nineteen eighty three

Becuase the relationship between tokens and word in some cases is complex, a user function may be specified for translating tokens into words. This is designed to deal with things like numbers, email addresses, and other non-obvious pronunciations of tokens as zero or more words. Currently a builtin function builtin_english_token_to_words offers much of the necessary functionality for English but a user may further customize this.

If the user defines a function token_to_words which takes two arguments: a token item and a token name, it will be called by the Token_English and Token_Any modules. A substantial example is given as english_token_to_words in festival/lib/token.scm.

An example of this function is in lib/token.scm. It is quite elaborate and covers most of the common multi-word tokens in English including, numbers, money symbols, Roman numerals, dates, times, plurals of symbols, number ranges, telephone number and various other symbols.

Let us look at the treatment of one particular phenomena which shows the use of these rules. Consider the expression "$12 million" which should be rendered as the words "twelve million dollars". Note the word "dollars" which is introduced by the "$" sign, ends up after the end of the expression. There are two cases we need to deal with as there are two tokens. The first condition in the cond checks if the current token name is a money symbol, while the second condition check that following word is a magnitude (million, billion, trillion, zillion etc.) If that is the case the "$" is removed and the remaining numbers are pronounced, by calling the builtin token to word function. The second condition deals with the second token. It confirms the previous is a money value (the same regular expression as before) and then returns the word followed by the word "dollars". If it is neither of these forms then the builtin function is called.

     (define (token_to_words token name)
     "(token_to_words TOKEN NAME)
     Returns a list of words for NAME from TOKEN."
      (cond
       ((and (string-matches name "\\$[0-9,]+\\(\\.[0-9]+\\)?")
             (string-matches (item.feat token "n.name") ".*illion.?"))
        (builtin_english_token_to_words token (string-after name "$")))
       ((and (string-matches (item.feat token "p.name")
                               "\\$[0-9,]+\\(\\.[0-9]+\\)?")
             (string-matches name ".*illion.?"))
        (list
         name
         "dollars"))
       (t
        (builtin_english_token_to_words token name))))

It is valid to make some conditions return no words, though some care should be taken with that, as punctuation information may no longer be available to later processing if there are no words related to a token.