Data Science Research Group

Lexical acquisition

Many tasks in natural language processing need detailed, accurate information about the grammatical behaviour of individual words. Manual coding of lexicons is expensive, error-prone, and needs to be carried out afresh for new domains. It also cannot reliably capture statistical tendencies. People in the group are carrying out ground-breaking research into the unsupervised acquisition of lexical information from corpora, mostly based on automatic parses of tens of millions of words of raw, unannotated English text (using analysis tools developed at Sussex).

Subcategorisation, selectional preferences and WSD

John Carroll is using the syntactically analysed data to acquire detailed information on relative subcategorisation frequencies for verbs, which are central to the meaning of sentences. He has shown that the acquired information can be incorporated into a parsing system to significantly improve its disambiguation accuracy; the information may also be of use in other language processing tasks such as the automatic generation of text.

Diana McCarthy is building on the subcategorisation extraction work, automatically acquiring selectional preferences for English verbs. Most verbs select more or less strongly for certain types of subject and direct object (for example, to eat has a preference for a direct object that is a type of food). One of the challenges is to find the right level of generalisation for the argument types - this is done by applying the Minimum Description Length principle with respect to classes of word senses in the hierarchically-organised WordNet thesaurus. She has successfully used the acquired preferences to detect diathesis alternations and also, with John Carroll, for word sense disambiguation (WSD) as part of the SENSEVAL exercises.

Class-based generalisation and distributional similarity

David Weir and Stephen Clark have developed an alternative approach to estimating the frequency with which a word sense appears as a given argument of a verb, again starting from text that has not been annotated with word sense information. The technique uses a re-estimation process over WordNet in conjunction with a statistical test to measure the homogeneity of sets of concepts in the hierarchy. The acquired knowledge has been successfully used in a number of different NLP tasks, one being the resolution of prepositional phrase attachment ambiguities.

Techniques that exploit knowledge of distributional similarity between words have been proposed in many areas of NLP. However, due to the wide range of potential applications and the lack of a strict definition of the concept of distributional similarity, many methods of calculating distributional similarity have been proposed or adopted. Julie Weeds and David Weir proposed a flexible, parameterized framework for calculating distributional similarity. Within this framework, the problem of finding distributionally similar words is cast as one of co-occurrence retrieval (CR) for which precision and recall can be measured by analogy with the way they are measured in document retrieval. A number of popular existing measures of distributional similarity are simulated with parameter settings within the CR framework. The CR framework is used to systematically investigate three fundamental questions concerning distributional similarity. First, is the relationship of lexical similarity necessarily symmetric, or are there advantages to be gained from considering it as an asymmetric relationship? Second, are some co-occurrences inherently more salient than others in the calculation of distributional similarity? Third, is it necessary to consider the difference in the extent to which each word occurs in each co-occurrence type?

Collocations and multi-words

Darren Pearce is working on a novel method for extracting collocations from corpora. The approach is based on the observation that in a collocation none of the words can be substituted for a synonym. (So, for example, emotional baggage is a collocation, but substituting the second word for luggage gives *emotional luggage, which is not). The system collects candidate phrases from analysed text, and compares the distribution of phrases with their substituted versions. A useful by-product of this technique is a way of deriving anticollocations - combinations of words that are not acceptable, e.g. *powerful coffee. This type of information might be useful in teaching English as a foreign language.