A diphone database consists of a dictionary file, a set of waveform files, and a set of pitch mark files. These files are the same format as the previous CSTR (Osprey) synthesizer.
The dictionary file consist of one entry per line. Each entry consists of five fields: a diphone name of the form P1-P2, a filename (without extension), a floating point start position in the file in milliseconds, a mid position in milliseconds (change in phone), and an end position in milliseconds. Lines starting with a semi-colon and blank lines are ignored. The list may be in any order.
For example a partial list of phones may look like.
ch-l r021 412.035 463.009 518.23 jh-l d747 305.841 382.301 446.018 h-l d748 356.814 403.54 437.522 #-@ d404 233.628 297.345 331.327 @-# d001 836.814 938.761 1002.48
Waveform files may be in any form, as long as every file is the same type, headered or unheadered as long as the format is supported the speech tools wave reading functions. These may be standard linear PCM waveform files in the case of PSOLA or LPC coefficients and residual when using the residual LPC synthesizer. LPC databases
Pitch mark files consist a simple list of positions in milliseconds (plus places after the point) in order, one per line of each pitch mark in the file. For high quality diphone synthesis these should be derived from laryngograph data. During unvoiced sections pitch marks should be artificially created at reasonable intervals (e.g. 10 ms). In the current format there is no way to determine the "real" pitch marks from the "unvoiced" pitch marks.
It is normal to hold a diphone database in a directory with a number of sub-directories namely dic/ contain the dictionary file, wave/ for the waveform files, typically of whole nonsense words (sometimes this directory is called vox/ for historical reasons) and pm/ for the pitch mark files. The filename in the dictionary entry should be the same for waveform file and the pitch mark file (with different extensions).