Corpus analysis (using statistical language learning)

Next: Analysis of dictionary definitions Up: Acquiring Lexical Knowledge Previous: Manual acquisition

Corpus analysis (using statistical language learning)

There have been a few studies that have used lexical associations derived from corpus analysis in structural disambiguation. Hindle and Rooth [56] were the first to demonstrate the basic technique. They show how to induce lexical associations from simple syntactic relationships (e.g., verb/object) extracted using somewhat shallow parsing, in combination with a few heuristics for resolving ambiguous relationships. These associations can be considered as conditional probabilities that a particular preposition is attached to the noun or verb, given that the latter is present. Attachment is resolved by selecting the case with the higher association. Evaluation was performed on 1000 random cases from the AP newswire, showing that the approach improves much over just using right association, a common heuristic. In addition, comparison versus information derived from a dictionary (COBUILD), shows that this corpus-based approach fares better.

To train the system, they first applied a part-of-speech tagger and a shallow parser to a 13 million word newswire corpus and then extracted triples of the form ${\langle}verb, noun, prep{\rangle}$ from the parses, where either the verb or the noun might be empty. Next, heuristics were applied to associate the preposition with the verb or noun, and the results were tabulated to produce ${\langle}verb, prep{\rangle}$ and ${\langle}noun,prep{\rangle}$ bigram frequency counts. (Thus, this is an instance of unsupervised learning, since it doesn't require human annotations defining examples of the classification being performed.)

To decide on the attachment for test data, the POS tagging and parsing are performed as above, along with the extraction of the triples. But then, instead of using the heuristics on each ambiguous triple (i.e., those with both verb and noun non-empty), the bigram frequencies are used in a log-likelihood ratio test:³

where $P(verb\_attach\ p\ \vert\ v, n)$ is estimated by freq(verb,prep)/TotalFreq and likewise for the noun attachment probability. A positive score favors verb attachment, and a negative one favors the noun attachment; a score of zero implies insufficient information to decide on the attachment.

Basili et al. [9] show how the same type of disambiguation can be achieved using selectional restrictions that are semi-automatically acquired from corpus statistics. These selectional restrictions form the basis for their lexicons that are tuned to particular subdomains, in their case, a legal domain, a business domain, and a remote sensing (satellite images) domain. Some important observations are that these selectional restrictions are much different in each domain, and that, in some cases, this information cannot be inferred from conventional dictionary definitions. They define semantic expectation as the probability that a pair of concepts occurs in a given relationship. This is similar in spirit to Resnik's [100] selectional association measure (see below), although they consider more than just verb-object relationships.

The main drawback to this approach is that it requires manual effort to assign the high-level concepts to the entries in the lexicon. In addition, it relies upon a core set of selectional restrictions at the conceptual level that a human must determine as being relevant (ideally a linguist). However, once this has been done, the rest of the process is automatic, that is, the determination of the selection restrictions for particular words. An experiment in deciding prepositional attachment shows how this method improves over an extension to Hindle and Rooth's [56]. See Table 3. They use information gain ( $I_{\Sigma} good - I_{\Sigma} all$ ) as a measure of the task complexity: this indicates the number of bits needed to represent the reduction in the certainty that is provided by the classification.

The table shows that semantic expectation (SE), incorporating the selectional restrictions semi-automatically acquired, performs better than lexical association (LA). In addition, the task for SE is slightly more complicated, as indicated by the higher information gain.

Table 3: Semantic expectation (SE) vs. Lexical association (LA)

	SE	LA
$I_{\Sigma}$ all	0.203	0.174
$I_{\Sigma}$ good	0.748	0.673
Accuracy	0.686	0.614
$I_{\Sigma}$ avg	0.380	0.200
[9]

Building upon this basic framework for determining verb subcategorizations, Basili et al. [8] show how verbs can be hierarchically clustered into classes. In line with the notion of cue validity, the classification is based on maximizing the extent to which categories are associated with different attributes.

This can be seen as minimizing the mean entropy of the distribution of the likelihood for the attribute values. The attributes are based on the pairings of thematic roles and conceptual types derived from the relational triples. The main advantage of this clustering approach is that the thematic roles can serve in the semantic description of the classes.

Grishman and Steering [48,49] have also extracted relational triples from parsed corpora. They use these to define selectional patterns. The initial work [48] concentrated on the extraction and generalization of the triples, using manually developed word classes from earlier work. The steps in this approach are as follows:
$\begin{example}\begin{enumerate} \item Parse with broad-coverage grammar. \item ... ...gle}$ . \item Generalize triples using word classes \end{enumerate}\end{example}$
They experimented with various ways for handling the triples due to multiple parses and for evaluating the results. In one case, they showed that the performance on selecting parses based on the triples is better than manually selected parses (as compared to the Penn Treebank standard parses). Their later work [49] experimented with automatically deriving the word classes. To do this, a confusion matrix is computed, which shows the probability each word can be used in the same context as the others, taken individually. This is computed as follows:
$\begin{example}\begin{displaymath} P_C(w_i\vert w_i') = \sum_{r,w_j} P(w_i\vert... ...c{F(\langle w_i'~r~w_j \rangle)}{F_{head}(w_i')} \end{displaymath} \end{example}$

This uses the relation and the relational object (w_j) as the context. A large value of P_C(w_i|w_i') suggests that w_i is ``selectionally (semantically) acceptable'' in contexts that w_i' appears in. Then, the frequencies for the triples are generalized (smoothed) by averaging the frequencies for similar words:
$\begin{example}\begin{displaymath} F_S(\langle w_i~r~w_j\rangle) = \sum_{w_i'} P_C(w_i\vert w_i') F(\langle w_i'~r~w_j\rangle) \end{displaymath} \end{example}$

where $F_S(\langle w_i~r~w_j\rangle)$ is the smoothed frequency for the triple. Evaluations revealed that this smoothing can be used to significantly increase the recall on identifying valid relational triples.

In addition to analyzing large corpora of the same language, there have been several projects that have used bilingual corpora of the same text in different languages, for examples transcripts of the Canadian parliament (Hansards) in French and English [18]. Once the sentences have been aligned, fairly accurate lexical associations can be made between synonymous words in the two languages [44]. This has the advantage of producing a quick and dirty translation lexicon tuned to a particular corpus. It also has found use in lexical ambiguity resolution, since an ambiguous word might be consistently aligned with different unambiguous words in the other language. This type of approach might be helpful in a potential future extension of applying the techniques for extracting finer sense distinctions to definition glosses of bilingual dictionaries (e.g., the extended definitions provided in cases of lexical gaps).

Resnik [100] has done some influential work on combining statistical approaches with more traditional knowledge-based approaches. For instance, he defines a measure based on information content for the semantic similarity of nouns that uses the WordNet hierarchy along with frequency statistics for each synset. These statistics are estimated from the corpus frequency of all the words subsumed by that synset. This is particularly relevant for the proposed work, as it will serve as one of the methods for determining semantic relatedness against which the differentia-extraction technique will be compared.

Resnik's technique relies on the use of WordNet synsets to define the classes over which frequency statistics are maintained. This is done to avoid the data sparsity problem associated with statistical inference at the word level. A benefit of doing this is that the classes provide an abstraction that facilitates comparison. For instance, he defines selectional preference profiles for verbs by tabulating the distribution of the classes for the verbal subjects and objects. The degree to which verbs select for their arguments can summarized by a measure called the selectional preference strength, which is the relative entropy of the distribution for the conditional probability of the classes given the verb compared to the distribution of the prior probabilities for the classes [100]:

To find out the preference for a particular class, the selectional association measure was defined, as follows:

This is the relative contribution that the class makes to the selectional preference strength.

The 1993 SIGLEX Workshop [13] emphasized corpus techniques specifically for acquiring lexical knowledge. Some of the techniques rely solely on corpus analysis. For instance, Aone and McKee [5] use heuristics for deciding on the high-level verbal situation type based on the degrees to which the corpus instances are transitive and have animate subjects. These situation types (e.g., caused-process) determine the basic predicate argument structure of the verbs. Also, idiosyncratic properties of the verbs are acquired through the use of mutual information co-occurrence checks.

Grefenstette [46] compares two knowledge-poor techniques for deriving notions of semantic similarity. The evaluation is novel in not requiring hand-coded similarity ratings and in not basing the result on application performance. Instead, a similarity relationship is considered valid if the words occur together in some Roget category. The baseline technique is co-occurrence of words in a window around the target word. The other technique uses shallow parsing to extract syntactic relationships involving the target word. The latter works better in general for high-frequency terms, but it is not as good for infrequent ones. Some of the other techniques incorporate existing lexical resources. For instance, Hearst and Schütze [51] show how to refine the WordNet taxonomy with domain-specific terms. This uses Schütze's WordSpace algorithm which computes a reduced-dimension co-occurrence matrix for tokens of 4 letter sequences from the text⁴. For a given word in WordSpace, its associated WordNet synset is calculated by finding the one containing the most number of the words in a 20-word nearest neighborhood in WordSpace. Poznanski and Sanfilippo [94] describe a way to extract semantically-tagged subcategorization frames from bracketted corpora, specifically Penn Treebank, using the thesaural and grammatical information in Longman's Lexicon of Contemporary English (LLOCE). This works by extracting the verbal frames, replacing words by their thesaural categories, merging similar frames, and then using the grammatical codes in LLOCE to filter out categories which don't license the particular frame. In other words, this is using LLOCE to convert the word-based subcategorization instances extracted from the Penn Treebank into a class-based form. These still would need to be postprocessed to produce a list of the most common frames. Then to make the work more comparable to Resnik [100] or Basili et al. [9], conditional probability distributions for the various verbal arguments would need to be derived.

Lauer [70] has defined a model for assigning probabilities to meaning representations in a compositional manner. The effect of context is handled by the change in the (prior) probabilities for the concepts. This is illustrated with a simple 1st-order predicate predicate calculus presentation. The mapping of the semantic representation into a syntactic representation is described in detail. For practical reasons, this mapping must be simple, since the source of (unsupervised training) data will be in a form more closer to the syntactic representation.

Two main applications of meaning representations theory are described: first, a syntactic application of the approach (like most current statistical language learning approaches) that selects the most probable bracketing for a N-N-N compound (i.e., [[N N] N] vs. [N [N N]]); second, a novel semantic approach for paraphrasing noun-noun compounds (via prepositional phrases). Unlike the adjacency approach that computes probabilities based on grammatical relatedness, the dependency approach computes probabilities based on conceptual relatedness. For instance, to decide whether a left-bracketing over a right-bracketing is likely for N1-N2-N3, the adjacency approach would just consider the relatedness of (N1, N2) vs. (N2, N3), that is immediate adjacency. In contrast, the conceptual approach considered (N1, N2) vs. (N1, N3), which is at the level of the modification.

A common corpus used in computational linguistics is the Penn Treebank [76]. It consists of parse-tree annotations for a one million word subset of the Wall Street Journal 1989 corpus, along with similar annotations for the Brown corpus. Of particular interest is that the second release of the Penn Treebank [77] includes semantic role tags in their parse tree annotations. The main relations covered are direction, manner, location, purpose, and temporality. The earlier parse tree annotations from the Wall Street Journal corpus were revised and extended, yielding over 35,000 of these semantic role annotations.

Next: Analysis of dictionary definitions Up: Acquiring Lexical Knowledge Previous: Manual acquisition