Data Science Research Group

Corpus annotation

To enable computers to process human language, we need databases (corpora) of language samples annotated to show their structural features, as a source of information and statistics to guide the development of language-processing algorithms. This in turn requires some set of categories to be explicitly defined, so that researchers exchanging language data can be confident that they are using the annotations in the same way. Computational linguistics thus needs something like the Linnaean taxonomy created for botany in the 18th century. Geoffrey Sampson's 1995 book, English for the Computer provides such a taxonomy for English, the SUSANNE scheme (the name SUSANNE stands for 'surface and underlying structural analysis of natural English').

A by-product of the work of creating this scheme was the production of a corpus of English annotated in accordance with the scheme. The SUSANNE Corpus contains annotations of a 130,000-word cross-section of written American English; it is freely available without formalities for use by researchers anywhere (and has been heavily used since the first release was published in 1992).

For practical reasons of availability at the time, the SUSANNE Corpus was based on written American English of some decades past. In more recent work, Sampson's team (Alan Morris and Anna Babarczy) have been applying the annotation standards to create resources for studying structure in present-day British speech and writing.

The first stage of the CHRISTINE Corpus, comprising a socially-representative annotated sample of current spontaneous speech, was circulated in August 1999. It includes various extensions of the annotation scheme to identify the many structural features particular to speech. Following on from this, the LUCY Corpus will be a comparable resource for modern written English in the UK. LUCY samples the output both of mature writers and of young people whose writing skills are imperfect. Among other things, it will offer insights for writing-skills training about the nature of the route from fluency as a native speaker of English to mastery of the written medium.