Diana McCarthy, Rob Koeling, Julie Weeds
In word sense disambiguation (WSD), the heuristic of choosing the first listed sense in a dictionary is often hard to beat, especially by systems that do not exploit hand-tagged training data. The problem with using the first sense heuristic, aside from the fact that it does not take surrounding context into account, is that it assumes some quantity of hand-tagged data. Whilst there are some hand-tagged corpora available for some languages, one would expect the frequency distribution of the senses of words to depend on the genre and domain of the text under consideration. For example, one would expect a different predominant sense for 'star' if one were looking at scientific astronomy reports compared with popular news. We present work on the use of the WordNet similarity package and a thesaurus automatically acquired from raw textual corpora to rank WordNet noun senses automatically. The results are promising when evaluated against the gold-standard provided by SemCor, giving us a theoretical WSD precision of 60% on an all-words task. Moreover, some of the ranking errors can be explained by differences in the corpus data used to produce the thesaurus compared to this gold-standard. Our experiments also show that the automatic ranking can be used to filter senses which are unseen or infrequent in the gold-standard.