These are organized into a tree structure, shown schematically in 1.2.At the top level there is a split between training and testing sets, which gives away its intended use for developing and evaluating statistical models.

: Structure of the Published TIMIT Corpus: The CD-ROM contains doc, train, and test directories at the top level; the train and test directories both have 8 sub-directories, one per dialect region; each of these contains further subdirectories, one per speaker; the contents of the directory for female speaker A fourth feature of TIMIT is the hierarchical structure of the corpus.

With 4 files per sentence, and 10 sentences for each of 500 speakers, there are 20,000 files.

The inclusion of speaker demographics brings in many more independent variables, that may help to account for variation in the data, and which facilitate later uses of the corpus for purposes that were not envisaged when the corpus was created, such as sociolinguistics.

A third property is that there is a sharp division between the original linguistic event captured as an audio recording, and the annotations of that event.

TIMIT was developed by a consortium including Texas Instruments and MIT, from which it derives its name.

It was designed to provide data for the acquisition of acoustic-phonetic knowledge and to support the development and evaluation of automatic speech recognition systems.Like the Brown Corpus, which displays a balanced selection of text genres and sources, TIMIT includes a balanced selection of dialects, speakers, and materials.For each of eight dialect regions, 50 male and female speakers having a range of ages and educational backgrounds each read ten carefully chosen sentences.First, the corpus contains two layers of annotation, at the phonetic and orthographic levels.In general, a text or speech corpus may be annotated at many different linguistic levels, including morphological, syntactic, and discourse levels.Despite its complexity, the TIMIT corpus only contains two fundamental data types, namely lexicons and texts.

