Data Science Research Group

Statistical parsing

Robust, accurate parsing would be of great benefit for many language processing tasks. Members of the group have worked on the three major unsolved problems in this area: disambiguation, undergeneration, and efficiency.


Geoffrey Sampson has carried out pioneering research - starting in the 1980's - into the use of probabilistic techniques for parsing. To find the optimal analysis within the huge search space of alternative analyses, Sampson's APRIL system evolved statistically plausible syntactic analyses by biased random mutation using simulated annealing, maximising a numerical measure of conformity to the patterns found in samples of accurately-analysed language. The technique was robust in the face of 'messy' corpus material.

In more recent work, both David Weir and John Carroll have worked onparsing systems that cut down on the initial search space of analyses by using manually-written grammars. This reduces the task of the probabilistic disambiguation component to choosing between linguistically plausible structures. Weir's team have developed a large-scale lexicalised tree grammar of English and have investigated the various ways in which frequency information derived from corpora can be integrated into the grammar for use in disambiguation. Carroll uses a shallow unification-based `phrasal' grammar with high coverage of unrestricted text to parse large amounts of material, returning analyses in the form of grammatical relations (such as 'subject', 'clausal complement') between heads and dependents. This representation has proved useful for interfacing to applications, and also for evaluating parser accuracy.


Grammar development and corpus annotation are time-consuming and expensive activities. The range of language data that can be accounted for is therefore limited, so systems based on these resources are prone to undergeneration (i.e. not being able to assign an analysis to an utterance). Bill Keller and Rudi Lutz are exploring the use of evolutionary techniques for learning a grammar of a language just from example sentences. The research adopts a Bayesian approach using a genetic algorithm to `evolve' the most probable grammar given a corpus. To avoid over-fitting of the data the Minimum Description Length Principle is used to provide an informative prior for candidate grammars.


Large-scale tree grammars are structurally complex, containing hundreds (or even thousands) of elementary trees. This means that in conventional parsing algorithms much of the computation associated with different structures is duplicated. David Weir and Olga Shaumyan are working on precompilation techniques for use with such grammars which allow some of this computation to be shared. The approach transforms the elementary structures of the grammar into finite-state automata which can then be merged and minimised.

With wide-coverage unification-based grammars, a parser can spend of the order of 90% of the time performing unification operations. John Carroll has devised new algorithms for fast unification, sharing of structure between unification results (thus saving space), and fast filtering of unifications that are guaranteed to fail. He has also recently worked on novel techniques for compactly representing ambiguity during parsing. This research has been applied in a speech-to-speech translation system using large-scale Head Driven Phrase Structure grammars.