next up previous
Next: Conclusion Up: Utilizing Word-Sense Distinctions Previous: Semantic relatedness

Word-sense disambiguation

There have been several approaches that have relied upon word-overlap in dictionary definitions to resolve word-sense ambiguities in context, starting with [72].

Cowie et al. [30] extend the idea by using simulated annealing to optimize a configuration of word senses simultaneously in terms of degree of word overlap.

Veronis and Ide [117] develop a neural network model to overcome another limitation of word-overlap approaches, which only address pairwise dependencies. Using dictionary definitions, they construct a network where there is a link from a word node to nodes for each of its senses and links from each of the sense nodes to the words used in the definition. By activation through the neural network, longer-distance dependencies are addressed. Their model introduces noise by adding incorporating links from senses to words, and there is no distinguishing of important lexical relations from incidental ones.

Resnik [99] describes an unsupervised approach that is based on estimating synset frequencies. The estimated frequency of a synset is based on the frequency of the word plus the frequencies of all its descendant synsets in a large corpus. Therefore, the top-level synsets have the highest frequencies and thus the highest estimated frequency of occurrence. For each pair of nouns from the text to be disambiguated, the most informative subsumer is determined by finding the common ancestor with the highest information content, which is inversely related to frequency. Then each noun is disambiguated by selecting the synset that receives the most support (i.e., information content) from the all of the most informative subsumers. This approach is attractive since sense-tagged data is not required; however, it is vulnerable to the noise due to estimating synset frequencies over raw text.

Yarowsky [126] uses co-occurrence statistics for the words listed under each Roget category to determine contextual support for that category. The classification variable ranges over the different Roget thesaural topics (roughly 1000). Each training instance, a context of 100 words, contributes support to each of the topics for which the context includes any of the words listed under the topic. Then during classification, the selected sense of a word will be the one having a thesaural category with the most associations to the words in the current context.

   Bruce and Wiebe [21,24] show the importance of accounting for feature interdependence when developing statistical models for word-sense disambiguation. Instead of assuming a fixed model, as done in Naive Bayes classifiers [37], a method is described for selecting the model form. Specifically, model search is performed over the space of decomposable models, for which close-form expressions exist; the graphs of these types of models have the property that they can be triangulated. Backward search starts with the saturated model (all features interdependent) and iteratively selects a dependency (or edge of the graph) to be removed until the model of independence is reached or until a goodness-of-fit criteria indicates the edge to be removed no longer sufficiently reflects independence of the joined features (i.e., the features are most likely dependent). Forward search proceeds in the opposite direction, adding dependencies each iteration. Using simple context features, such as part-of-speech of surrounding words and collocations for the target word senses, they show how this approach achieves high accuracy for twelve cases previously covered in the literature. This approach is a form of supervised learning, so it requires a fair amount of tagged training data to work best.

To alleviate the problem of insufficient training data, Bruce and Wiebe [23] proposed that data augmentation techniques be used and that analytical knowledge from existing resources (e.g., a dictionary) be integrated into the model inferred from the training data. Data augmentation uses stochastic sampling to infer the values of untagged data by iteratively selecting the most probable value for the missing data based on the values of the others (using default values for the missing data to begin with). Iteration proceeds until convergence is reached in the average of the assigned values.

The use of analytical knowledge enriches the set of relations among the target words that can be inferred from the training data. For instance, scientific terminology can be organized into a taxonomy. These additional relationship can also allow for the simultaneous resolution of interdependent ambiguities, in which the empirical support for individual word senses (inferred from the context) can be used to achieve mutual reinforcement. Wiebe et al. [121] later implemented this approach using Bayesian networks.


next up previous
Next: Conclusion Up: Utilizing Word-Sense Distinctions Previous: Semantic relatedness