Next: , Up: Text analysis


15.1 Tokenizing

A crucial stage in text processing is the initial tokenization of text. A token in Festival is an atom separated with whitespace from a text file (or string). If punctuation for the current language is defined, characters matching that punctuation are removed from the beginning and end of a token and held as features of the token. The default list of characters to be treated as white space is defined as

     (defvar token.whitespace " \t\n\r")

While the default set of punctuation characters is

     (defvar token.punctuation "\"'`.,:;!?(){}[]")
     (defvar token.prepunctuation "\"'`({[")

These are declared in lib/token.scm but may be changed for different languages, text modes etc.