The University of Sussex

Class-based statistical models for lexical knowledge acquisition

Stephen Clark

This thesis is about the automatic acquisition of a particular kind of lexical knowledge, namely the knowledge of which noun senses can fill the argument slots of predicates. The knowledge is represented using probabilities, which agrees with the intuition that there are no absolute constraints on the arguments of predicates, but that the constraints are satisfied to a certain degree; thus the problem of knowledge acquisition becomes the problem of probability estimation from corpus data. The problem with defining a probability model in terms of senses is that this involves a huge number of parameters, which results in a sparse data problem. The proposal here is to define a probability model over senses in a semantic hierarchy, and exploit the fact that senses can be grouped into classes consiting of semantically similar senses. A novel class-based estimation technique is developed, together with a procedure that determines a suitable class for a sense (given a predicate and argument position). The problem of determining a suitable class can be thought of as finding a suitable level of generalisation in the hierarchy. The generalisation procedure uses a statistical test to locate areas consisting of semantically similar senses, and, as well as being used for probability estimation, is also employed as part of a re-estimation algorithm for estimating sense frequencies from incomplete data. The rest of the thesis considers how the lexical knowledge can be used to resolve structural ambiguities, and provides empirical evaluations. The estimation techniques are first integrated into a parse selection system, using a probabilistic dependency model to rank the alternative parses for a sentence. Then, a PP-attachment task is used to provide an evaluation which is more focussed on the class-based estimation technique, and finally, a pseudo disambiguation task is used to compare the estimation technique with alternative approaches.

Download compressed postscript file