We have conducted many collaborative projects with other academic institutions, companies and educational organisations – both within the UK and overseas. Please browse through the list of past projects. Current projects are listed below.
A Unified Model of Compositional and Distributional Semantics: Theory and Applications
There have been two main approaches to modelling the meaning of language in NLP. The first, the so-called compositional approach, is based on classical ideas from Philosophy and Mathematical Logic. Using a well-known principle from the 19th century logician Frege - that the meaning of a phrase can be determined from the meanings of its parts and how those parts are combined - logicians have developed formal accounts of how the meaning of a sentence can be determined from the relations of words in a sentence. This idea culminated famously in Linguistics in the work of Richard Montague in the 1970s. The compositional approach addresses a fundamental problem in Linguistics - how it is that humans are able to generate an unlimited number of sentences using a limited vocabulary. We would like computers to have a similar capacity also.
The second, more recent, approach to modelling meaning in NLP focuses on the meanings of the words themselves. This is the so-called distributional approach to modelling word meanings and is based on the ideas of the "structural" linguists such as Firth from the 1950s. This idea is also sometimes related to Wittenstein's philosophy of "meaning as use". The idea is that the meanings of words can be determined by considering the contexts in which words appear in text. For example, if we take a large amount of text and see which words appear close to the word "dog", and do a similar thing for the word "cat", we will see that the contexts of dog and cat tend to share many words in common (such as walk, run, furry, pet, and so on). Whereas if we see which words appear in the context of the word "television", for example, we will find less overlap with the contexts for "dog". Mathematically we represent the contexts in a vector space, so that word meanings occupy positions in a geometrical space. We would expect to find that "dog" and "cat" are much closer in the space than "dog" and "television", indicating that "dog" and "cat" are closer in meaning than "dog" and "television".
The two approaches to meaning can be roughly characterized as follows: the compositional approach is concerned with how meanings combine, but has little to say about the individual meanings of words; the distributional approach is concerned with word meanings, but has little to say about how those meanings combine. This project exploits the strengths of the two approaches, by developing a unified model of distributional and compositional semantics. The project has a central theoretical component, drawing on models of semantics from Theoretical Computer Science and Mathematical Logic. This central component which will inform, be driven by, and evaluated on tasks and applications in NLP and Information Retrieval,and also data drawn from empirical studies in Cognitive Science (the computational study of the mind). Hence we aim to make the following contributions:
- advance the theoretical study of meaning in Linguistics, Computer Science and Artificial Intelligence;
- develop new meaning-sensitive approaches to NLP applications which can be robustly applied to naturally occurring text.
This multisite project is funded by the EPSRC, with related grants funding the project at the Universities of Cambridge, Edinburgh, Oxford and York.
PIs: David Weir, Bill Keller
Research Fellows: Daoud Clarke, Jeremy Reffin, Julie Weeds
Exploitation of Diverse Data via Automatic Adaptation of Knowledge Extraction Software
This project involves two industrial partners: Brandwatch (based in Brighton) and Linguamatics (based in Cambridge). The companies are developing systems for turning 'big data' into useful business information, and want to cover further diverse data sources, from patents to Tweets. The project addresses a bottleneck that often arises when applying natural language processing technology in practical settings: the need for laborious customisation with respect both to the type of data source (e.g. newswire vs. patent literature) and to a domain's terminology (e.g. medical practice vs. pharmaceutical research). The project partners will explore ‘distributional’ methods based on contextual similarity of word usage in order to accelerate two key components of this customisation, namely the recognition of concepts and the creation or adaptation of terminologies that link terms to concepts. This will allow software which extracts information from 'big data' to be adapted extremely rapidly to new and diverse data sources.
PI: David Weir
Research Fellow: Jeremy Reffin
Research Students: Hamish Morgan, Simon Wibberley
Visualisation of the Quality of Electronic Health Records for Clinical Research
This is a collaborative project with Brighton-based Dataline Software, which has wide experience in the design and implementation of highly ergonomic IT systems; and General Practice Research Database (GPRD), which collects electronic patient records from many GP practices across the UK. The project partners will develop statistical pattern recognition techniques for scoring patient records with respect to data quality, and construct a system (QViz) that will provide quick and effective visualisation of patient record quality. Since findings arising from patient data have potential public health and safety implications it is of crucial importance that any data used for research is of high quality. QViz will be used to help set up feasibility studies for Randomised Controlled Trials (RCTs), which are the key to proving the efficacy and safety of new medicines. QViz will allow researchers to interactively investigate and select GP practices based on data quality and the suitability of the patient base for a particular study.
PIs: Rosemary Tate, Natasha Beloff
Electronic patient records contain a mixture of coded information and free text. We aim to develop generalisable methods for the identification and interrogation of potentially important data "concealed" in free text, use the results to enhance coded data, and evaluate the utility of this approach. Through user centred methods, we will explore what influences clinicians in the balance between recording free text vs using standard codes (e.g. 002.23 Appendicectomy), and how information needs to be stored for it to be useful to and retrievable by clinicians. Natural Language Processing (NLP) will be used to search the free text of large quantities of anonymised free text patient records, and to enhance coded data with pseudo-codes. Statistical methods will be used to explore the impact of integrating the additional information on (a) prevalence estimates (rheumatoid arthritis), and (b) estimates of dates of first relevant presentation (ovarian cancer). A visualization tool for the integrated graphical display of coded and NLP generated data will be developed. It will be used to validate the novel data through clinician and researcher review, and thus to explore the value of these techniques in improving the quality and accessibility of information in electronic patient records. This is a joint project with The Brighton and Sussex Medical School and a number of other partners.
PI (NLP Workstrand): John Carroll
PI (Visualization Workstrand): Donia Scott, Rosemary Tate
PI (Statistics Workstrand): Rosemary Tate
Research Fellow: Rob Koeling