SALS-SIG Research Seminar | ||||||||||
|
Towards Solving the Problem of Ambiguity in Statistical Natural Language Processing
Abstract: Many problems in NLP can be successfully approached by using methods that rely on feature vectors. Since entities in natural language tend to be ambiguous, the feature vectors that we derive from text corpora can be assumed to be mixtures of the vectors of some underlying unambiguous entities. The problem in understanding and simulating natural language is that we can only observe and study the complicated behavior of the ambiguous entities, whereas the presumably simpler behavior of the underlying unambiguous entities remains hidden. These considerations lead us to independent component analysis (ICA), a statistical formalism related to principal component analysis that takes higher-order dependencies into account. By assuming independence, ICA is capable of detecting a set of hidden vectors if only different linear mixtures of these vectors are observable. As a test case for ICA's applicability to NLP we look at the task of word sense induction. Our starting point is that we consider the co-occurrence vector of an ambiguous word as a linear mixture of its unknown sense vectors. If corpora from different domains are available, this should give us the different linear mixtures that are required for ICA. It turns out that the independent sense vectors derived by ICA from the distributional differences of word usage reflect a word's meanings surprisingly well. Parking: Visitors requiring a parking pass are asked to contact us at least one working day before the seminar. Enquiries: sals@ics.mq.edu.au | ||||||||||
| Last modified: 28th July 2003 |