Statistical NLP and Machine Learning
More generally, since we will be processing text, we will need to understand discourse theory and current techniques for discourse processing. We will examine the state of the art in discourse processing and to what extent these techniques might be useful or necessary for our task. This will take us into the areas of discourse structure theory, text categorization, text segmentation, and text summarization. We will also look at how these techniques might be applied to determining ideological point of view.
In order to better understand and define ideological point of view, we will consider some of its linguistic aspects. We will survey work by computational linguists to date on point of view and available linguistic resources. We will also consider some related work on subjectivity in newspaper text and message filtering systems in USENET newsgroups.
As necessary we may discuss techniques from machine learning, information retrieval, language understanding, knowledge representation, and psycholinguistics. Throughout, we will place the greatest emphasis on statistical techniques.
Supposing that such a system can be created, we will consider appropriate means to test and evaluate it.
In order to build an automatic system to segment text by ideological point of view, we must first understand what we mean by ideological point of view. We would also like to understand where ideological point of view might fit into the larger picture of natural language processing.
Ideology has been studied from the perspective of a variety of academic disciplines, including: anthropology, sociology, linguistics (particularly in the areas of pragmatics and sociolinguistics), psychology, cognitive science, history, communications studies, political science, rhetoric, critical theory and computer science (particularly in natural language processing, a sub field of artificial intelligence). Given this list, it seems that a multi-disciplinary approach to ideology is in order.
For purposes of a working definition of ideology, we will turn to the work of Teun van Dijk, a linguist and professor of discourse studies, who has taken a multidisciplinary approach to ideology. His orientation towards discourse studies make his work closer to work done in computational linguistics than many of the researchers in other disciplines.
a working understanding of ideology, we will survey systems developed
natural language researchers which relate to ideology or to point of
UNDERSTANDING AND DEFINING IDEOLOGICAL POINT OF VIEW
In common usage ideology tends to be a somewhat vague term. Something that, like pornography or art, one knows when one sees, but would be hard pressed to define precisely. Ideology is often used pejoratively, perhaps being most easily recognized when someone expresses a strong position with which one disagrees. When conflict involving fundamental differences arises, we have "knowledge", they have "ideology". However, even these superficial musings start to give us clues toward a definition. It must have something to do with fundamental beliefs held by groups of people ("Us" versus "Them"). To formalize this, let's take a look at the work of van Dijk:
In his internet course "Ideology and Discourse: A Multidisciplinary Introduction", based on his book Ideology: A Multidisciplinary Approach, Teun van Dijk provides a general working definition:
"Ideologies are the fundamental beliefs of a group and its members."
Some comments about this definition are in order:
1. This definition does not imply a negative evaluation of ideology, nor does it limit us to ideologies that legitimize dominance. In this sense it will serve us well in our project of segmenting by point of view, because we want to classify a broad range of ideologies and not to judge them.
2. This definition centers on "groups", a notion we will discuss more shortly, as an intermediate structure between the extremes of an individual and the entire culture or society. In the sense that there are not individual languages our definition does not provide for individual ideologies. At the other extreme we have cultural common ground knowledge, or knowledge that is not disputed in a given society or culture. Since the knowledge is not disputed, there is no real interest in considering it in terms of point of view. Over the course of time group ideologies may become common ground knowledge and vice versa. For example, in Europe, 500 years ago, Christianity was common ground knowledge, whereas today it would be viewed as group ideology. In contrast our current common ground knowledge about the position of planets in the solar system was once considered ideology. Again, it appears that these distinctions will serve our task, because on a given topic we want to distinguish between the differing points of views of groups of discussants: individual points of view would be too fine grained, while common ground views would be too coarse.
3. In order to understand the definition we need to clarify what is meant by "group". We would probably not want to consider a "group" of people in a supermarket checkout line as having an ideology. We are interested in social groups that have some level of permanency and common goals. We might define social groupness in terms of "membership criteria (origin, appearance, language, religion, diplomas or a membership card), typical activities (as is the case for professionals), specific goals (teach students, heal patients, bring the news), norms, group relations and resources." "In social terms we may define a number of the properties that people routinely use to identify themselves and others as ingroup and outgroup members, and to act accordingly. Sometimes these group criteria will be quite loose and superficial, e.g., when based on preferred dress or music styles, sometimes they organize virtually all aspects of the life and activities of the members of a group, as may be the case for gender, ethnicity, religion and profession." In other words, a group must have some basis for self-definition and commonality in order to develop a group ideology.
Another aspect of groups that may prove useful is to recognize that they often have structure. The structure may be formal or informal, often including leaders, ordinary members, followers, teachers, and subgroups or individuals fulfilling special functions. The group structure can play an important role in the "acquisition, spreading, defense or inculcation of ideologies. Thus, new members need to learn the ideology of a group." We might hope that some of these structures will be mirrored in the local structure of the world wide web and the structure of Usenet newsgroups and susceptible to computational discovery. We will keep this in mind later when we discuss Kleinberg's methods for locating hubs and authorities on the web.
4. Discourse plays a key role in the development and promulgation of ideologies. Ideologies influence the content of discourse and ideologies are acquired and transformed through discourse. We will explore the like between ideology and discourse in greater detail below, with particular emphasis on how ideological point of view might be detected in a given discourse. For the purposes of this paper we are interested in written discourse.
5. Ideologies are fundamentally subjective, since beliefs are subjective. We will develop this further when we explore cognitive models of ideology. This observation is important because it links the study of ideological point of view with the more general study of subjectivity in discourse.
Van Dijk divides his study of ideology in to three interconnected areas: Cognition, Society, and Discourse. For our purposes, cognition and discourse are of most interest; the issues he raises regarding ideology and society that are of interest to us, related to understanding groups, have been discussed above. Cognition will be of interest in terms of the mental model he presents and to better understand the definition of ideology. Discourse is of great interest to us, so we will consider his work on it in some detail.
How ideologies are represented in memory is very much an open question at this point. Van Dijk hypothesizes that ideologies are represented in social memory, a part of Long Term Memory that is distinct from, but linked to, Episodic Memory. He chooses to represent the general beliefs of ideologies in a propositional format, for convenience. For example, "All citizens should have equal rights." He also assumes that ideologies form "systems" of beliefs, suggesting some level of order and organization. He theorizes that the organization of ideologies is "schema-like", consisting of "a number of conventional categories that allow social actors to rapidly understand or to build, reject or modify an ideology."
A possibility for such a theoretical schema, derived from the basic properties of a social group is:
Membership criteria: Who does (not) belong?
Van Dijk deals with this by defining context models to represent the current ongoing communicative event dynamically. The context model keeps track of our goals and intentions, what we believe the participants know, social relations between participants, the social situation, time, or more generally, what is "relevant for discourse in the current communicative situation."
Like other mental models discussed, context models may be ideologically biased. For example, how the speaker perceives the participants in the discourse may be affected by the speaker's ideologies. Thus, ideologies may control both the content and manner of our discourse.
We have considered the cognitive aspects of ideology for several reasons:
1. To better understand and define ideology in general.
2. To understand how we might plausibly get from ideology to discourse, which is our main interest here.
3. Because understanding the human cognitive processes may be helpful in determining appropriate data structures and architectures to use when we attempt to have a machine understand discourse.
4. To provide a foundation for our discussion of ideology and discourse.
We have seen that there are many open questions in the area of cognition, which necessitate making choices and assumptions. We are not currently in a position to say whether van Dijk's choices and assumptions are the correct or even the most expedient ones possible.
We would like to discover properties of discourse that will pick up variations in ideology, based on the underlying context and event models, and social attitudes. Our first thought is that semantics and style would be better places to look than morphology and syntax. Still some more concrete methodology would be helpful. Van Dijk proposes a heuristic, based on the fundamental notion that ideologies are represented as "some kind of basic self-schema of a group, featuring the fundamental information by which group members identify and categorize themselves, such as their membership criteria, group activities, aims, norms, relations to others, resources, etc." His heuristic claims to represent a very general overall strategy of most ideological discourse: to organize people and society in polarized terms (US versus Them). The result is a conceptual or "ideological" square of four principles:
Emphasize positive things about Us.
Here we examine the internal structure of the propositions, which taken together constitute the meaning of the discourse. Recall that propositions are things that may be true or false, or which (intuitively speaking) express one complete 'thought'. Sentences consist of one or more propositions. Propositions have a structure of the form:
Predicate(Argument, Argument, Argument...).
Of interest in ideological analysis is that "the predicates of propositions may be more or less positive or negative, depending on the underlying opinions (as represented in mental models)."
We see that the three most general
of analysis are meaning, argumentation and rhetoric. This confirms our
suspicions that these three areas would be the most fruitful to pursue
in general discourse where we would like to detect ideological point of
view. Note that the majority of the categories at the level of rhetoric
deal with non-literal language, an area which is particularly
in NLP and yet crucial to understanding subjectivity.
Blommaert and Verschueren
Van Dijk has applied his analysis to issues of immigration and racism. For an additional example on the side of the grand scale, we look at the work of Blommaert and Verschueren, two Belgian linguists (pragmaticists), who in their book Debating Diversity, conducted an empirical study of publicly available discourse on "migrants" in Belgium. Their sources included mass media, government, political parties, and social scientists whose work was widely broadcast in the media. They found that the central concepts in the migrant debate, as conducted by the "tolerant majority", such as culture and democracy, we rarely defined explicitly. When there were explicit definitions, they tended to serve specific rhetorical goals. They also found that there were significant discrepancies between what might be termed the dictionary definition of meaning and the associative or derived meanings used in rhetorical practice of these central concepts. They concluded that the "migrant debate" rests on an ideology of homogeneism: "the idea that the ideal society should be as uniform or homogeneous as possible." In other words, despite the differences in rhetoric between the "tolerant majority" and racist or nationalist groups, the there is an unquestioned underlying ideology of homogeneism, which controls the definition of the problem and range of solutions proposed.
They note three crucial problems in this type of research:
1. The role of the investigator in the communicative process under investigation.
2. Dealing with the integration of micro- and macro-influences on communication. This presents the qualitative problem: "the interdependence of individual cognition and socially constructed meaning."
3. The need to fully utilize the potential of interdisciplinarity.
While it seems clear that interdisciplinarity is crucial in the understanding of ideology and that the role of the investigator must always be considered in scientific research, their second point brings up intriguing possibilities. It seems that the traceable structure of the World Wide Web and Usenet newsgroups may provide genres where the interaction between the micro- and macro- influences can be studied. We will explore this in greater detail when we look at some of the work that has been done on the structure of the World Wide Web (e.g. Terveen, Kleinberg).
Blommaert and Verschueren's definitional framework is similar to van Dijk's. They emphasize the importance of group relations and identities, which they view as "cognitively framed phenomena to be found at the inter-subjective level of the community." Where group identities determine our opinions and discourses about others and other forms of our behavior towards them, noting that here is no objective criteria that can be used to identify groups. They define ideology as "any constellation of fundamental or commonsensical, and often normative, ideas and attitudes related to some aspect(s) of social 'reality'." This is at the same time both more broad and specific then van Dijk's definition, more specific in that it specifies the types of ideas and attitudes and more general in that it does not rely explicitly on the idea of groups, with qualifying discussion the underlying idea is essentially the same.
In contrast to van Dijk's approach, Blommaert and Verschueren adopt a materialist perspective on the discourse data: viewing discourse as a "(symbolic) commodity, a source of and instrument for acquiring and elaborating power and status. This invites an ethnographic approach, analyzing the data in the "context of a synchronic pattern of social relations and practices", and a historical approach, which detects "waves of discourses: discourse traditions, genres, styles and transformations of fissures in these waves." This seems like a broader approach than we will want to take for our purposes.
Other than the general mention of cognitive framing, they are not concerned with cognitive models or mental representations of discourse. Their pragmatic approach to discourse analysis, recognizing that "every utterance relies on a world of implicit background assumptions"
Some of the phenomena they looked at systematically are:
1. Wording patterns and strategies. "Meaning derives from the grammatical and lexical choice, which language users make from the range of possible choices, in relation to subject matter and context."
2. 'Local' carriers of implicit information. This encompasses implication- and presupposition- carrying constructs.
3. Global meaning constructs. This encompasses the ways in which implicit and explicit meanings are combined. For instance, in patterns of argumentation, coherence and recursivity.
4. Interactional patterns. This encompasses "many types of direct and indirect interaction between different points of view."
They seek to use these methods to uncover a "common frame of reference": the general world view the language user assumes to be shared with others in the same community (group), including assumptions about that is acceptable or appropriate social behavior.
In this book their methodology is painted with broad strokes, thus while similar in general outline to van Dijk's, it is difficult to compare point by point. They note, as van Dijk does, that patterns of argumentation do not vary with ideology, rather it is the meanings or contents that vary. They cite Albert Hirschman's 1991 book, "The Rhetoric of Reaction", where his examination of patterns of argumentation used by 'reactionaries' at three points in history where social and political changes were taking place, reveals three types of arguments against the changes: perversity, futility and jeopardy. Blommaert and Verschueren found examples of these three argumentation patterns in the rhetoric of both the 'reactionaries' and the 'progressives' in the migrant debate.
The perversity argument is essentially that the actual effects of the changes, through a chain of unintended consequences, are the opposite of what was intended. Futility: nothing really changes, i.e. the proposed changes will be superficial or cosmetic, hence an illusion, while the deep structures of society remain unchanged. Jeopardy: while the proposed change may be a good idea, the costs or consequences are too great.
Whether these kinds of arguments have sufficiently clear structural patterns, that would enable automatic recognition, is not immediately obvious. But it does seem that taking this kind of analysis into consideration could be useful since if an argumentation pattern can be identified, focusing on its content or semantics might be a key place to look for ideological point of view.
This example from Blommaert and Verschueren shows:
1. That further investigation of the discourse analysis techniques of linguistic pragmaticists is worth while to better understand ideological point of view in discourse and may possibly yield techniques that can be applied automatically.
2. That here may be inherent social value in the analysis of ideological point of view in discourse: improved understanding of implicit ideological contents may aid in understanding of the debate of important contemporary social issues, leading perhaps to better informed and rational participation by citizens in a democracy.
3. That while defining ideology may be difficult and controversial, reasonable working definitions can be developed. It appears that van Dijk has put more emphasis on developing a definition that takes into account cognitive models and is likely to work in a broader, multidisciplinary format.
4. While van Dijk and Blommaert and Verschueren have developed systematic techniques for analyzing ideology in discourse, there is still a lot of work to be done to develop a reasonable machine implementation.
We now consider two more examples of analysis of ideological point of view in discourse: Wang analyses journalistic coverage of the 1991 Soviet coup attempt. Her techniques include some developed by van Dijk. Wortham and Locher analyze 1992 television newscasts by systematizing Bakhtin's concepts of voice and
Whether or not Wang's methodology turns out to be too genre-specific for our purposes remains to be seen. Her work builds on van Dijk's earlier work on discourse analysis of news and her findings point to the possibility of finding ideological differences in both structure and style of text.
Wortham and Locher
For a different approach to approach for analyzing media bias or point of view we turn to the work of Wortham and Locher. Their interest is in the implicit devices that a speaker uses implicitly evaluate others while appearing to speak neutrally about them. They provide analytic methods for studying implicit moral evaluations and interactional positioning described by Goffman, using the concepts of voice and ventriloquation of Bakhtin.
In Bakhtin's theory of the novel, voice is defined as an identifiable social role or position that the character enacts. Voicing is the process of working out speaker's social locations. Ventriloquation occurs when an authorial voice enters and takes a position with respect to a character: the author speaks through the character by aligning or distancing himself from the character. Thus the authorial voice implicitly comments on the on the social world it represents: using the speech of characters to express his or her own social and ethical positions. Wortham and Locher note that "newscasters portray their subjects as people who speak with identifiable voices. And they themselves speak through these voices and evaluate those they cover."
Wortham and Locher's technique for identifying attribution of social positions to those described in a newscast (voicing) and evaluation of those described (ventriloquation) is based on the "identification of tokens of certain textual devices that speakers commonly use to voice and evaluate their subjects." They consider five types of textual devices, suggested by Silverstein, used in voicing and ventriloquation are:
Reference and predication
Wortham and Locher analyze three national newscasts from the evening of October 30, 1992: CNN/Telemundo, ABC World News Tonight, and CBS Evening News. On this date, four days before the 1992 U.S. Presidential election, the lead story was the release of notes from a 1986 meeting, taken by then Secretary of Defense Caspar Weinberger, that seemed to contradict Vice President and Presidential candidate George Bush's repeated statements that he did not know about the sale of missiles to Iran ahead of time. Wortham and Locher's analysis does uncover implicit messages sent by the newscasters.
They claim that voicing and ventriloquation are unavoidable in reporting (but need not be insidious), that their five devices are useful but not sufficient in determining implicit messages, and that their tools cannot be applied mechanically because the process of orchestrating an implicit message is "poetic", i.e. speakers do not mechanically apply rules to obtain intended outcomes. Thus, while their tools may prove helpful in finding clues to speaker ideology or determining implicit meaning in discourse, it does not appear that they can stand alone.
Wang's work, the amount of data analyzed is small and limited to a
genre. While the events and media sources may have been chosen because
they are particularly illustrative of what the authors wish to show,
is no clear indication of the extent to which their techniques will
even within their chosen genre. There is also the question of how to
these techniques and test them on larger data sets.
to investigate some systems that have been developed over the years
take into account ideology or point of view.
1965-1973 The Ideology Machine, aka The Goldwater Machine by Abelson et al.
Abelson's interest in political psychology led him to try to simulate a True Believer to better understand the phenomenon of 'ideological oversimplification'. This "tendency to caricature and trivialize the motives and character of the enemy and to glorify - but also trivialize-the motives and character of one's own side", which involves "interposing oversimplified symbol systems" between oneself and the external world and tends to exacerbate conflicts, both internationally and intranationally. He has a strong interest in developing a cognitive model of belief systems.
The Ideology Machine is designed to simulate responses to foreign policy questions by a right-wing ideologue, modeled on Barry Goldwater and foreign policy issues of the Cold War. Goldwater was chosen because his belief system was "notably 'closed'" and well understood, thus enabling it to be encoded in the system's memory structure and the analysis of the responses.
The basic memory structure of the machine a "horizontal' encoding of sentences consisting of a concept followed by a predicate, where the predicate is generally a verb followed by a concept. Approximately 500 sentences were encoded. There is also a "vertical" structure to memory consisting of 'instance' and 'quality' relationships. For example, "India" is an instance of "left-leaning neutral nations" and "left-leaning neutral nation" a quality of "India". Also stored is an 'evaluation' of each element (concept or predicate): a signed quantity summarizing the positive and negative affects attached to the element. 'Generic events' were represented by a verb category between two noun categories and then used as building blocks for 'episodes'. About 24 episodes were placed in memory, some quite intricate and with multiple branches of sequences of potential generic events.
Ideological perspective was encoded as a "masterscript" which guided the processing of political information.
We will not discuss the cognitive model in detail here, other than to note that two important components are the "Credibility Test" and the "Rationalization Attempt". These enable the system to respond to one of six questions, such as: Is a given event credible?, When a given event happened, what should a given actor have done? For example, for the first question, the credibility of a given event is assessed by checking whether its generic event type is recognized by the system. If it is recognized a similar, specific past event is retrieved from memory.
an improved system in 1973, which incorporated planning and Shank's
of Conceptual Dependency. From today's perspective the search
might be considered inefficient, the domain too limited, the generation
mechanism primitive, the definition of ideology too vague and
and Conceptual Dependency might be questioned as a cognitive model.
was no attempt to analyze the meaning of actions; they were only
according to ideological criteria, so that "Castro would throw eggs at
West Berlin" could be inferred from the fact that "leftist students in
South America threw eggs at Richard Nixon." Nevertheless, it was an
first step, in that the Ideology Machine was able to demonstrate that
types of ideological behavior could be simulated by a computer program.
1976 Tale-Spin by Meehan
Tale-spin generates stories that describe actions that actors take to create changes in the world that result in the satisfaction of the actor's goal. The actors are people and talking animals, who have goals, environments, and relationships to one another. Principal motivations have to do with physical needs, like hunger. Basically, characters are created who want to achieve simple goals. They create plans to achieve these goals, which can include moving to other physical spaces, manipulating objects, communicating with other characters (honestly and dishonestly) and negotiating with other characters to get something they want. The actions and state changes in Tale-spin are represented in a conceptual dependency framework.
Meehan's intent was to model people engaged in rational behavior. The basic components of the system are:
1. A problem solver: given a goal it produces other goals or subgoals and events. Contains the planner.
2. An assertion mechanism: takes an event and adds it to the world model.
3. An inference maker: given and event, produces a set of consequent events. Where one kind of consequence is a goal.
Tale-spin allows for two methods of storytelling: a bottom-up approach where the reader controls the simulator and a top-down approach, where the program uses a predefined set of "morals" to create the story.
Machine, Tale-spin was an important first prototype. It suffers from
of scaling up, lack of commonsense background knowledge, and the
of the conceptual dependency framework (discussed below). However,
Pazzani has used his learning system, Occam, to construct a simulator
on Tale-spin, addressing the issue of obtaining data that is not
Further investigations of his methodology and of the extent to which he
succeeded might be worthwhile, since this is a serious problem for many
1979 Politics by Carbonell
Originally an attempt to improve on the Ideology Machine, by incorporating conceptual dependency, frames for representing real-world knowledge, situational scripts, planning units and other memory structures, Politics turned into a "general process model of subjective understanding". (Carbonell) Carbonell claims that the model of subjective understanding transcends ideological behavior, incorporating goal trees and counterplanning strategies to understand, personality traits, certain aspects of discourse and human conflict situations.
Carbonell has developed a theory of subjective understanding, which he defines as "the process of applying the beliefs, motivations, and interests of the understander to the task of formulating a full interpretation of an event." Politics simulates the ideologies of a United States conservative and a United States liberal interpreting brief political events by answering questions posed about a given event. It analyzes the events into a conceptual dependency representation, applies situation information in the form of scripts, and applies an inferencing process guides by the ideology. Additional inferences may be performed during the question and answer phase.
Ideologies are modeled by goal trees. Once the goals for a particular ideology are determined, two different trees can be constructed: one with sub-goal links, where each sub-goal helps to achieve a higher level goal, and on the relative-importance links. Four specific criteria guides the development of the goal-tree model for political ideologies:
1. Parsimony: a political ideology should contain only the subjective knowledge required for ideological reasoning.
2. Orthogonality: a political ideology does not necessarily affect other aspects of subjective understanding and may be de-coupled from other ideological beliefs.
3. Compatibility: all aspects of subjective understanding should be represented in the same formalism.
4. Generality: the reasoning process is not domain-dependent and should apply across all ideologies and subjective beliefs.
The goals focus attention on the aspects of the situation or event that most interest the actor or understander. This directs the inferences that are made about the consequences of events. Understanders have different interpretations of events because they focus on how the events affect them personally and not on how they affect others, which gives rise to subjective understanding. Understanders know that other understanders are doing the same thing, as a function of their own goals, which leads to planning and counterplanning. Carbonell defines counterplanning as "a process in which one actor intentionally thwarts another actor's plans or attempts to achieve his own goals by circumventing the counterplanning attempts of the other actor." Counterplanning occurs when goal conflicts or plan interferences arise.
It seems possible that Carbonell's theory of goals, plans and counterplans might be useful in analyzing argument structure for clues of ideological point of view. On the other hand, this is a very conflict oriented approach that may not account for more subtle ideological differences or cases where, as we say in Debating Diversity, the surface rhetoric differs greatly, but the underlying ideology is basically the same. It also requires the construction of the goal tree and this requires sufficient knowledge to model the goals that follow from an ideology in advance.
Consequent to some of the limitations Politics and because he believes that people use similar decision processes in resolving political, economic, judicial, domestic and social conflicts, Carbonell developed Triad. "Triad is a process model of understanding general conflict situations." Carbonell defines seven 'basic social acts':
1987 Pauline by Hovy
Pauline (Planning and Uttering Language In Natural Environments) generates natural language text for news events subject to pragmatic constraints. The constraints are formulated as rhetorical goals. Pauline knows about three events and is able to produce 100 different descriptions of each event.
Based on a list of thirteen goals that characterize the pragmatics of an interaction, when activating the program the values are selected for a set of features such as:
The time, tone and conditions of the conversation.
The generator incorporates these goals, interleaving rhetorical planning and realization at choice points. This supports the "standard top-down planning-to-realization approach, as well as a bottom-up approach, in which partially realized syntactic options present themselves as opportunities to the rhetorical criteria, at which point further planning can occur."
at the rhetorical goals of opinion in some detail because they fit in
with the four principles of van Dijk's ideological square. Since we are
not primarily concerned with planning and generation will not discuss
further here. Hovy does use conceptual dependency, but due to the
of pragmatic constrains does not rely on it as heavily as Meehan and
There are still questions about how well Pauline would handle other
of events and more complex events. There is also the issue of the need
to hard-code knowledge.
The "Yale School", Schank et al.
All four systems above come out of the "Yale School", centered around Roger Schank. Abelson is a colleague, while Meehan, Carbonell and Hovy were Schank's students. Under their "conceptual dependency" framework, Schank and Abelson (1977) make universalist claims of cognitive plausibility for a specific set of semantic primitives, scripts, and other knowledge-specific constructs. Scripts are stereotypical representations of situations, which provide slots for various events, actions, objects, and relationships. They are a sort of template or schema of a situation, allowing for inferences and thus understanding of discourse.
The theories of conceptual dependency and scripts assume that all memory is episodic and organized in terms of scripts. While Schank and Abelson note that this may be controversial (Schank and Abelson 1977), they state their 'clear preference' for episodic and base their work on it. This memory model is at odds with van Dijk's model, discussed above.
While having a small number of universal primitives maybe aesthetically and computationally pleasing, finding the correct set may be difficult or impossible. There are also questions as to the cognitive plausibility of reduction to semantic primitives (Mallery 1988). Mallery also notes that "these programs from the ``Yale school'' have not been scaled up beyond hand-crafted microworlds nor produced cognitively felicitous representations. One major reason is that since top-down processing requires extensive background knowledge already coded in pre-existing data structures, the range of application is limited by the amount of background knowledge available to a system." This level of domain-specificity would certainly not work for the system we envision.
serious limitation of conceptual dependency is that as a "theory of
that finds equivalent meanings through reduction to primitives", it
literalism in language. (Mallery 1988) Since we would most likely want
our system to be able to handle non-literal language, such as metaphor,
the conceptual dependency paradigm, seems an unlikely choice.
1991 Viewgen by Ballim and Wilks
Ballim and Wilks program Viewgen is a viewpoint generator, where a viewpoint is "some person's belief about a topic". Viewgen generates multiple belief environments from different points of view. It is, in some sense, an agent modeling tool allowing for the generation of arbitrarily deep nested belief spaces. It uses a default reasoning mechanism that assume that all agents' beliefs are the same as the system, unless there is evidence to the contrary. All beliefs are ultimately held by the system, in the sense that what a given agent believes is what the system believes the agent believes.
program, it does not seem to suit our need to understand belief and
of view. It may also be more general that we want, since we are
in ideological points of view that represent the perspective of groups
of people and not a particular individual's belief. However, there may
be some value in further investigating their definitions and
to see if there are aspects of the generation process that can lead to
1994 Tracking point of view in Narrative by Weibe
As part of a larger program of studying subjectivity in test, Weibe developed an algorithm to track characters psychological point of view in third-person fictional narrative text. Wiebe and Bruce (1995) proposed using the output of the algorithm as a feature in a probabilistic classifier to track point of view.
Wiebe's goal is to segment the text of fictional narratives into "maximal blocks of objective sentences and maximal blocks of subjective sentences that have the same subjective character." Following the work of Banfield, Wiebe defines subjective sentences as those which present private states of characters, as opposed to sentences that "objectively narrate events or describe the fictional world." Where private states are states that cannot be objectively observed or verified, such as intellectual, emotive and perceptual states. Wiebe represents private states as PS(p, experiencer, attitude, object), where p is the private state, the experiencer is the person in that state, the attitude is the sort of private state, and the object is the object of the private state. Note that, in a given sentence, some of the components of PS may be implicit. Wiebe further defines the 'subjective character' of a sentence as the character whose point of view is taken in a subjective sentence and 'subjective elements' as linguistic elements that express attitudes of a subjective character.
In order to make the problem more tractable, Wiebe makes some simplifying assumptions:
1. She only considers text without overt narrators
1994 Spin Doctor by Sack
Theory of ideology and point of view
distinction between ideological and psychological point of view:
point of view characterized the political slant of an entire story,
psychological point of view (e.g., as it is used by Wiebe 1994)
the source of a sentence or statement contained within a story." He
precisely defines ideology as semiotic closure, building on the work of
Greimas and Jameson on semiotic squares. This moves beyond the binary
of or schematization of two rational possibilities (e.g.
or workers/bourgeoisie) to enable the mapping out of why, and in what
two terms can be posited as oppositions or contradictories. Given two
(strong opposition: black/white male/female) positions, these are
at the top corners of the square, the bottom corners are filled in with
the logical negations of these terms with the negation relation on the
diagonals (note that nonwhite is more than black and nonmale is more
Thus the top and bottom edges of the square represent the contrary relation and the vertical sides of the square represent the implication relation. Using the square one can map out the ideology that circulates around a given issue. For example on the issue of abortion the corners of the square might be feminist, christian fundamentalist, family values conservative, liberal humanist. He terms each of these positions (at the corners of the square) to be a point of view.
Sack applies and extends these ideas in the realm of news texts. His extension of the semiotic squares is to actor-role analysis, where he is concerned with the identification of actor and thematic role combinations. He distinguishes between actantial and thematic roles: actantial roles, like heroes and villains exist on a narrative level and can only be identified by examining how an actor interacts with the other actors in a given narrative; in contrast thematic roles, e.g. a fisherman, exist on the discourse level and are part of larger discourses, which necessarily connect together many stories. For example, fishermen are associated with a set of attributes, which are carried over from story to story. In general a particular actor will be assigned one or more thematic and one or more actantial roles in a particular narrative. For example, on the abortion issue actor-role analysis might yield:
feminist point of view
Ultimately this shows that the source of the text and how people are represented in the text is crucial to determining point of view. To create a real system machine learning would need to be used since one of the weaknesses of this system is the extent to which it is hard-coded for a very specific data set. Note that determinations of points of view and the manual coding of this was done be one person so there no study of coder agreement or reliability. However, unlike the systems discussed above, a quantitative evaluation was performed.
Sack's definition of ideology is somewhat restricted by the political framework in which it is cast, but seems as thought it could be applied more broadly. His conception of an ideological square is framed in terms of groups and the ideologies they hold, in contrast to van Dijk's, which is framed in terms of principles. These ideas complement, rather than contradict each other and both should probably be included in the definitional and analytical framework of ideology.
disadvantages of Spin Doctor are the hard coding of ideology and the
specificity that follows from it. If the actors and roles could be
it might well generalize to, at least, the news genre. Whether it would
generalize to such genres as medicine on the internet is less clear.
might hypothesize that actors might be treatments rather than people
that the role a treatment plays would be either good or bad depending
the ideology of the author.
2000 Terminal Time by Mateas et al.
Terminal Time was conceived as a work of art. It is an interactive performance piece that constructs an ideologically-biased history of the world form 1000 A.D. to present, in PBS documentary style, based on audience response. Its presentation is multimedia combining video and narration to give a low-budget, 1980s television documentary feel. At about where commercial would appear in a half-hour television program it presents multiple choice questions to the audience and based on an applause meter determines the ideological slant of the next segment of the documentary. Eleven major ideologies are currently represented in the system, mostly centering on race, class, gender, technology, and religion.
The basic architecture consists of a knowledge base, ideological goal trees, a rule-based natural language generator, rhetorical devices, and a database of indexed audio/visual elements (including short digital movies and sound files containing music).
The knowledge base combines higher order predicate statements about historical events, definitions of ontological entities used in historical event descriptions and inference rules. They used the upper Cyc ontology as a basis for their ontology. The inference engine is based on "higher-order hereditary Harrop logic" which allows the knowledge base entries to consist of Horn clauses and the queries to consist of standard Prolog-like goals and embedded implications. It is implemented in Common Lisp and makes use of its extra-logical support functions.
Terminal Time organizes ideological bias with goal trees, which were adapted from Carbonell's Politics, to represent the goals of an ideological story-teller. The rhetorical goals of a story-teller are to show that something is the case by constructing an argument using the events available in the database. For example:
A word about systems
We have seem several systems that compute point of view of various types, by various means and to various ends. Some issues begin to emerge:
1. In AI there is always a tension between work that aims for psychological or linguistic plausibility and work that is more motivated by engineering concerns. The systems we have seen here then to be motivated primarily by psychological or linguistic plausibility. This creates some difficulties because there is still so much unknown about psychology and how the mind works. For example, to what extent do scripts, schema, and semantic primitives reflect psychological reality and if they don't should we use then anyway if they handle engineering concerns.
2. With the exception of Wiebe, all of the systems require a great deal of hard-coded knowledge, which can dramatically limit scalability and portability. Terminal Time does make use of Cyc and it would be worth look at the currently available knowledge resources to see if there are tools which could be of use. More likely, this points to the need to incorporate machine learning into systems.
3. While most of the definitions of ideology and point of view get at approximately the same ideas, there is a need for greater specificity in order to translate ideology into something a machine might recognize.
4. The early systems primarily came out of the "Yale School" and are much concerned with planning and generation, so we might ask, how much, if any of those techniques might be useful in natural language understanding?
5. Van Dijk's definition of ideology rests on the concept of a social group. We should consider the possibility of exploiting the structures of the WWW and Usenet to see if they can be reconciled with his concept of a social group.
6. With the exception of Wiebe and Sack, very little has been done to evaluate the performance of these systems objectively. In fact, the knowledge that is provided for systems about ideology my be stereotyped and reflect ideological biases of the systems' creator. Evaluation of the system we propose will most likely require testing against human annotations. These annotations would need to be performed by annotators with reasonable intercoder reliability on this task.
We have seen several definitions of ideological point of view. It appears the van Dijk's definition is a good place to start for our system. So we will adopt his definition, concept of group, and ideological square. We also want to incorporate Sack's ideological square and actor-role analysis. Keeping in mind that since ideological point of view is inherently subjective that we are working under the overall framework of subjectivity as defined by Wiebe.
Wiebe (Wiebe et al 2001) defines subjectivity as "aspects of language used to express opinions and evaluations." Where evaluation includes emotions, evaluations, judgments, and opinions. Subjectivity also includes speculation: "anything that removes the presupposition of events occurring or states holding, such as speculation and uncertainty."
Of the systems we have surveyed, Wiebe and Sack come the closest to the task we propose. Both allow possibilities for extension or modification: Wiebe's algorithm would need to be modified to accommodate ideological, rather than psychological, point of view. The probabilistic model proposed by Wiebe and Bruce could incorporate features for ideological point of view. The possibility of incorporating machine learning techniques to Sack's system might succeed in overcoming the need for hand-coded knowledge and make it less domain and genre dependent. The other systems are of primary interest to better understand the problem and to develop a set of features that could be used in our system. We also want to bear in mind the importance of incorporating pragmatics into our system as Hovy has done.
section we begin to consider the tools that will be necessary.
References for ideology section:
Robert P. Abelson and J. Douglas Carroll. Computer Simulation of Individual Belief Systems. The
Now that we have a working definition for and some understanding of ideology, let's look a little more specifically at what a system might involve to enable us to focus on tools that might be of use.
One approach would be to build the system on top of an existing search engine that would collect a set of web pages or Usenet newsgroup message by topic. There is some risk that the collection would be too small or too large, so it might be necessary to expand or contract the query. The process of collecting the documents by topic is essentially and information retrieval problem. We will assume for simplicity that we want full documents, rather than sections of documents. We will consider Kleinberg's hubs and authorities as a possible way to narrow down a collection of documents. We would also like to explore the possibilities of exploiting the topology of the web or Usenet to find groups of documents sharing the same ideology.
Given an appropriate collection of documents on a certain topic, we would now like to segment them by ideological point of view. We now have a problem that could be viewed as either text classification or clustering of documents by similarity. Some issues arise immediately:
1. The problem falls into the broad category of natural language or text understanding. This leads to the question of whether or not we need to understand the text and if so to what extent? If we chose to aim for psychological plausibility, the we do want to understand the text. On the other hand, we may find statistical techniques that work with little or no understanding. In between these extremes, we may want some level of understanding of discourse or argument structure. We have seen from van Dijk and from Blommaert and Verschueren that the discourse structure alone is unlikely to be sufficient, but it could be used in combination with lexical clues or as feature input to a classifier. Hence, we will explore the work of computational linguists on the structure of discourse.
2. The desire of domain independence comes at a cost of knowing the ideological points of view in advance. This means that, assuming ideological points of view are discrete classes, we must learn the classes from the set of documents, if we want to treat it as a classification problem. Otherwise, we may want to find techniques to cluster the documents by similarity, where the similarity metric heavily weights features of ideological point of view.
3. Machine learning may be either supervised or unsupervised. If we elect to use supervised learning, we will need to so some level of human annotation. Human annotation is costly and requires the development of annotation instructions that are sufficiently detail that reasonable agreement between annotators can be achieved. Recently developed techniques, such as co-training, can help limit the amount of annotated data that is required. It is sometimes possible to modify systems, such as Riloff's AutoSlog pattern extractor, to work with unannotated text. We will not discuss the process of annotation or evaluation further here, but it is an important issue to note.
4. We would need an appropriate user interface to represent the segmented collection of documents. Two possibilities immediately come to mind: a fisheye viewer with a hyperbolic distance metric on the graph (nodes represent documents and edges represent the connections between documents, the shorter the distance the greater the similarity between the documents) that is initially centered on the most ideologically balanced or objective document in the collection and allows the user to see the relationships between the documents and retrieve a document by clicking on a node. Or a similar graph with a Euclidean metric that shows how the documents cluster in the plane. We will not discuss the details of how to implement the user interface here.
Finally, any system developed will need to be evaluated. We will see that in general the work we discuss in the areas of statistical natural language processing and machine learning will include rigorous evaluation. In contrast, like the systems discussed in the ideology section, the work on discourse structure and web structure will generally provide minimal evaluation, if any. Some issues we anticipate will arise in the evaluation of our system:
a. Since we plan to use a search engine to retrieve topically segmented collections of documents, the performance of the system may depend on the search engine used. We will most likely need to evaluate our system on different search engines and possibly perform some evaluation of available search engines.
b. Annotation of data, if required, can be evaluated through Cohen's Kappa coefficient of annotator agreement.
c. User interfaces can be evaluated through user studies.
d. Since classification of ideological point of is inherently subjective, the output of the system will be difficult to judge and may be open to dispute. One option might be to test it by giving it a relatively small collection of documents to classify and giving the same collection along with instructions to several humans to determine if the machine agrees with the humans as well as they agree with each other. Another option, since the system is being developed to aid users, would be to conduct user studies of the system's usefulness.
We have seen in the section on ideology that there are computational workable definitions of ideology and that some systems have been designed to generate point of view. None of the systems we examined did quite what we want to do, so while we might borrow some of their techniques, we need to explore other possibilities.
We noted the tension between developing a system that is psychologically and linguistically plausible and one that is more motivated by engineering concerns. We will see this play out again here. If we are not concerned about psychological and linguistic plausibility then we might ask whether we need to understand discourse at all. Perhaps, purely statistical techniques would suffice. We will look at some of these possibility in the section on Statistical NLP. In the meantime let us assume that understanding discourse structure would be of use in our task.
Some issues we might want to consider:
Are we concerned with the ideology as perceived by the reader, or as presented by the author, or both?
Do we need to consider the psycholinguistic aspects of a given text?
Do we need to understand the structure of arguments being presented?
Is there stilted or inflammatory language present and does this help us?
Can we view text as primarily for information transmission, or have we oversimplified by omitting the social, emotional, persuasive, and entertainment aspects?
to answer some of these questions, we will consider some important
from computational linguistics and psycholinguistics on discourse
We will look at how these techniques have been applied and to what
of success, along with their applicability to our task.
Much of the work done by computational linguists has been application-oriented, so it makes sense to classify the work in terms of the types of applications that are expected to come out of it: the two main areas of application are language generation and interpretation. Here we will focus on how discourse structure theory might be used in the interpretation of text, because that seems the most suited to our task.
We will look at three important papers on the theory of discourse structure (Hobbs 1979, Grosz & Sidner 1986, Mann & Thompson 1988, Morris & Hirst 1991), and subsequent papers which discuss, extend and reconcile these theories. There has been significant debate in the computational discourse community between proponents of these theories, in particular Rhetorical Structure Theory (Mann & Thompson 1988) and the intensionality based theory of Grosz & Sidner (Grosz & Sidner 1986). Much of this theoretical work was done at a time when resources (large corpora, thesauri, etc) were not available to adequately test these theoretical frameworks. As these resources became available and implementation and testing became possible, the trend has been toward a synthesis of these theories (Moser & Moore 1996, Marcu 2000).
We will look at each of these theories in turn, considering their usefulness for text interpretation, shortcomings and areas where additional research is needed.
Hobbs (Hobbs 1978) builds on the work of Grimes, Halliday & Hassan, Longacre, and Fillmore, on relations that link segments of discourse. He claims that "a relatively small number of coherence relations occur in coherent English discourse", where the degree of coherence of a text "varies inversely with the degree of 'difficulty' the inferencing operations have in recognizing some coherence relation." He believes that a theory of coherence must be able to explain the function of each coherence relation; be able to derive Halliday & Hassan's cohesive relations; and must have relations that are computable.
Hobbs defines a finite set of coherence relations, in the framework of an inference component of a language processor, that hold between portions of a discourse. His relations operate recursively on 'sentential units', where clauses are the base. Clauses are represented as sets of propositions.
The inference component consists of four aspects: representation, operations, control, and data. His representation scheme is a sort of predicate calculus. Propositions are formed by applying a predicate to one or more entities, or other propositions. Clauses in the text are operated on successively and propositions are asserted. The system also has a large number of axioms that encode lexical and world knowledge. Axioms are assumed to have plausibility and general applicability. Operations, working in parallel, include word sense disambiguation, resolving anaphora, determining illocutionary force, and recognizing coherence. The operations attempt to construct 'chains of inference'. These chains are searched for based on salience and chain length.
In this paper, Hobbs considers three coherence relations: Elaboration, Parallel, and Contrast. A more extensive list of relations is defined elsewhere. Since the coherence relations are defined in terms of the inferences that a reader makes, using world knowledge, to recognize them, understanding the discourse structure is essentially equivalent to finding the best proof explaining the information in a segment of discourse.
of Hobbs work are the cost of encoding the axioms (world and lexical
and the computational cost of his inferencing process. There is also
question of whether a (small) finite number of coherence relations can
suffice to capture discourse structure.
MANN AND THOMPSON
Mann and Thompson's Rhetorical Structure Theory (RST) (1988) has found wide use in the area of text generation, for example, in generating text summaries.They claim that approximately twenty-three rhetorical relations are necessary to account for discourse coherence. The relations link different portions of text, called "spans", which can range in size from clauses to paragraphs. Adjacent spans are relate by exactly one of the possible rhetorical relations, forming new spans that are subsequently related to their neighboring spans until all spans are connected. In this way a hierarchy, or tree structure is formed.
The relations listed are: circumstance, solutionhood, elaboration, background; enablement, motivation; evidence, justify; volitional cause, non-volitional cause, volitional result, non-volitional result, purpose; antithesis, concession; condition, otherwise; interpretation, evaluation; restatement, summary; sequence, contrast. Note that several of these relations map to van Dijk's categories of ideological analysis.
They define five schema: circumstance, contrast, joint, motivation, enablement, and sequence. The schema are defined in terms of relations, they specify how spans of text can co-occur and define the structural constituency arrangements of the text.
Rhetorical relations constrain the components of the span and then intended effects of the span. Component spans are either nuclei or satellites, the nuclei being the more important span: "more essential to the writer's purpose than the other," (Mann and Thompson 1988) and the satellites are the less important. All relations contain an "effect" field which describes the intended effect of the text on the reader. So as with Hobbs, the focus is on the reader, rather than on the intentions of the writer. However, a weakness of RST is that the model of the effects that each span has on the reader's mental state is imprecise.
Moore and Pollack (1992) argue that RST does not take proper account of the distinction between relations and that the restriction that only a single relation can hold between pairs of adjacent spans is incorrect because discourse elements are related simultaneously on multiple levels (Grosz and Sidner 1986).
we note the question of whether or not a small set of relations
to capture discourse structure. Another issue for RST is while noting
there may be many RST analyses that are consistent with linguistic
Mann and Thompson do not offer any methods to deal with this ambiguity.
GROSZ AND SIDNER
Grosz and Sidner (1986) view discourse structure as three interrelated components: a linguistic structure, an intentional structure, and an attentional state. The linguistic structure consists of discourse segments and an embedding relationship that can hold between them. The intentional structure consists of discourse segment purposes (DSPs) and discourse purposes (DPs). The DSPs are related to each other by one of two relations: dominance and satisfaction-precedence.
A discourse segment purpose, DSP1, satisfaction-precedes DSP2, whenever DSP1 must be satisfied before DSP2.
An action that satisfies one intention, DSP1, may be intended to provide the satisfaction of another, DSP2. When this occurs DSP2 is said to dominate DSP1.
The attentional state distinguishes the most salient information from other less salient information to aid in the interpretation of subsequent discourse segments. It can be viewed as an abstraction of the discourse participants' focus of attention and is modeled by a stack of focus spaces, each holding the most salient information from a given discourse segment. The transition rules that add to or delete from the stack correspond to the dominance relation from the intentional structure.
surface it appears the discourse structure theories of Grosz and Sidner
(1986) and Mann and Thompson (1988) are quite different, and in fact,
has been a decade long debate in the computational linguistics
between their respective proponents, recent work has been toward a
Moser and Moore (1996) have found considerable common ground between
two theories, based primarily on understanding the correspondence
the notions of dominance in Grosz and Sidner and nuclearity in RST.
on this work Marcu (2000) has extended his normalization of RST (Marcu
1996) to incorporate the intentional structure of Grosz and Sidner
to reduce the ambiguity of discourse. The attentional structure has not
MORRIS AND HIRST
Based on the work of Halliday and Hassan on textual cohesion, Morris and Hirst describe an algorithm for computing lexical chains using lexical cohesion. They define lexical cohesion as "cohesion that arises from the semantic relationships between words." It basically involves the selection of a lexical item that is related, in some way, to one occurring earlier in the text. Lexical cohesion is one type of cohesion considered by Halliday and Hassan, others include reference, substitution, ellipsis, and conjunction. Morris and Hirst distinguish between the independent concepts of cohesion, that the text sticks together, and coherence, that the text makes sense. Cohesion is much more easily determined than coherence. Lexical chains are sequences of related words spanning a topical unit in the text and are good indicators of linguistic segmentation. They can be used to identify central themes in a document, which can be helpful in identifying key phrases for document summarization.
The algorithm proposed involves finding candidate words (removing words on a stop list such as pronouns and high frequency words), then for each candidate word finding an appropriate chain, within a suitable span, based on its relatedness to members of existing chains (or creating a new chain). Morris and Hirst used Roget's Thesaurus and distance criteria to determine relatedness. More recent versions have tended to use Wordnet instead of Roget's (Anderson 2000, Barzilay and Elhadad 1997), which reduces the problem of having different senses of the same word appear in a chain. Finally, if an appropriate chain is found, insert the word, update, and find the next candidate word.
The strength of a lexical chain can be determined by considering the distribution of the elements in the chain within the text. Three factors to consider are reiteration, density and length of the chain. Since the lexical chain encapsulates context, chain strength corresponds to the significance of the textual context it embodies.
Morris and Hirst compared the lexical chain structure of a text with the intentional structure computed using Grosz and Sidner's structural analysis method and found that lexical chains were a good clue for determining intentional structure. This is useful because Grosz and Sidner did not provide a method for computing the intentions or linguistic segments in their proposed structure.
is relatively domain independent and computationally feasible. While
and Hirst did not implement their lexical chaining algorithm because
was no machine readable version of Roget's Thesaurus at the time, it
been used as a basis for computational linguistics applications,
text categorization (Anderson 2000) and summarization (Barzilay and
Summary of computational linguistics approaches
These four papers in computational linguistics provide a survey of some of important theories of discourse structure. These theories were developed, at least to some extent, with the idea of using them for applications in mind, rather than in the interest of modeling brain function. Below we discuss an example of a psycholinguistic theory of discourse processing that comes out of the interest in modeling cognition and which has been, at least partially, implemented.
The first three computational linguistics theories all rely to some extent on intentional structure and take into account the intended effect on the reader. This still leaves open to what extent the intent of the writer could or should be factored in.
There theories were developed at a time when many fewer machine-readable resources were available, e.g. large corpora, thesauri, Wordnet, and computational power and memory capacity were limited. Because of this implementation and evaluation tends to have been limited and manual. More recently, lexical chaining and RST have been implemented, problems discovered and improvements made.
The primary question of interest to us is which of these might be useful in determining ideological point of view for our system. It seems clear that finding the rhetorical structure of a text or the important topics in the text would not on their own determine ideological point of view. Thus, we view discourse processing as a component of a larger system that would incorporate additional knowledge, such as lexical semantics and subjectivity analysis.
The first question is do we need it at all? Perhaps there is a statistical approach, such as Latent Semantic Analysis that will segment by ideological point of view and there will be no need to understand the discourse structure at all. On the other hand, having noted some level of correspondence between van Dijk's ideological categories and relations in RST, it seems reasonable that discourse structure would be a helpful component of a system that determines ideological point of view. So let's consider which of these theories we might use.
Clearly, we will not be able to settle the debate as to whether a finite number of discourse relations (Hobbs, Mann & Thompson) can be used to determine discourse structure and if so what exactly they are. One might imagine that one could do better at finding relations that define the discourse structure by limiting the domain or genre of the discourse considered, but this is not what we want to do. On the other hand, since our purpose it to find ideological point of view using discourse structure as one component of a larger system, we may not need to fully represent the discourse structure and, in fact, only consider certain relations.
Besides the issue of a small finite number of coherence relations, the amount of world knowledge and preprocessing necessary for Hobbs system seems prohibitive. As does the computational complexity of his inference engine. His system would likely do more than we need at a cost we cannot afford.
Since Grosz and Sidner's work has several parallels in RST (Moser and Moore), its intentional structure has been incorporated into recent versions of RST (Marcu 2000), and RST has been more widely implemented, it makes sense to consider RST over Grosz and Sidner theory. Given our perspective of a discourse structure component in a larger system, the theoretical shortcomings of RST (Moore and Pollack) are unlikely to be an issue for us. The main shortcomings of RST from our perspective are the need to consider aligning the rhetorical relations with more suitable relations for determining ideological point of view, such as van Dijk's ideological categories, and the cost of developing an annotated corpus of text, parsed for rhetorical structure. Note that a parsed corpus would need to be sufficiently large to allow the cross-domain portability we desire.
Given the cost of implementing an RST component to our system, it seems like it would be reasonable to try other methods first. One of the methods that should be considered is lexical chaining. It is possible that lexical chains, by identifying central themes and important phrases in a document, might provide significant information about ideological point of view. Given lexical chains, our system might look for potential subjective elements, ideological clue words, actors and roles, and other clues to ideological perspective in the chains.
So we conclude that for our purposes in choosing between discourse structure theories, we should first explore the possibilities of using lexical chains which can be easily implemented. If it turns out that more information about the discourse structure is necessary for our task, we would next look at existing RST implementations and if they prove insufficient, we would consider modifications to RST to improve alignment with ideological components and domain independence.
We will discuss other options to lexical chaining, such as Text-tiling (Hearst 1994), when we discuss statistical NLP. We also note two discourse structure issues particular to the WWW and Usenet:
1. We have not explored the how the structure of Html documents on the web might be exploited for our task.
2. It is
author's personal observation from annotation studies undertaken on a
corpus, that levels of cohesion and coherence tend to be lower in
postings than is common in written English text. This observation, if
turns out to hold true in general, may necessitate modification of
Finally, we will consider Kintsch's (1994) Construction-Integration Model as an example of a general model of discourse comprehension, because his work is based in part on work by van Dijk, because it provides some insights into the area of discourse processing that aims to model cognition, and because it has been at least partially implemented. There are other competing models, but, unfortunately, to date it does not appear that there has been a systematic comparison of the models. There seems to be a reasonable amount of empirical evidence supporting this model, but some aspects, in particular, the bottom-up processing during the construction phase, is controversial (Whitney 1998). It incorporates the idea, first proposed by van Dijk and Kintsch (1983), that we form multiple memories for discourse.
A summary of the construction-integration model as explained by Kintsch (1994), describes the sequence of cognitive states, the mental representation of texts, the processing cycles, knowledge elaboration, macroprocesses, and inferences. We will consider each one in turn:
1. The sequence of cognitive states in Text Comprehension
The next step is the integration process, which can be viewed spreading activation in the network until it reaches a stable state. Here it is necessary to use the matrix representation of the network. A vector, called the activity vector, is initialized with equal activation values for all elements. The activity vector is multiplied by the matrix and renormalized repeatedly, until the activity values stabilize. This has the effect of strengthening strongly interconnected parts of the network, while isolated parts become deactivated. The result is a coherent mental representation of the text.
This procedure is in sharp contrast to schema theories which provide a control structure to ensure context sensitivity in the construction phase, thereby eliminating the need for the integration phase. This is at the cost of a much more complex construction process.
3. Processing Cycles
While these implementations have been used successfully to test theories of retrieval from long-term memory, and various theories of mental representation and organization within a construction-integration framework, they do not fully implement Kintsch's model. Notably lacking is an implementation of the situation model. In addition, the input data structures require preprocessing, which may involve the use of different discourse analyses, based on different theories of discourse processing. From the computational linguist's point of view it does not solve any real-world problems. One might also ask to extent syntactic processing can, or should, be integrated in to this approach?
While it seems unlikely that this is a model we would want to use for our system, it does raise some interesting issues:
1. Kintsch provides us with a detailed model of how discourse is processed. Van Dijk gives us a general model of how ideologies are represented in memory and how one gets from the knowledge representation to discourse and back. While determining the extent to which these can be reconciled, given the current state of knowledge about how the brain functions, may be a theoretical question for cognitive scientists, it is clear that if we need our system to understand discourse we will have to go beyond van Dijk in level of detail. This process will result in needing to make some assumptions about plausible mental models.
2. Some areas where there seems to be controversy:
Our system will be built on top of a search engine to collect documents by topic. This is a classic problem in information retrieval, where given a query, in this case the topic, the system returns a set of documents that satisfy the query, in this case being about the same topic. Thus it is in our interest to understand something about the issues involved in Internet search and the range of search engines available.
When the web is viewed as a directed graph, where the nodes are web sites and the edges are hyperlinks between the sites, we hypothesize that the topology of the web is related to groups of people or organizations that shed light on ideological point of view. Similarly, we can define a structure on Usenet, which is already divided into newsgroups at a high level and individuals at a low level, which could be considered node, while messages posted might be considered directed edges. We would like to better understand the topologies of these graphs and see how they can be used in our study of ideology.
Long-term issues for our system might include retrieval of documents in multiple languages. This presupposes that the problem of machine translation is solved or that our system is sufficiently flexible to handle bad translations. In the context of a global society, we may experience a shrinkage of "common ground" knowledge and an expansion of what is considered ideology. Whether or not the language that a message or web site is originally written in might be a reasonable feature for our system might be a subject of future research.
For now we
will content ourselves with looking at some basic issues, explorations,
and work on the Internet.
Kleinberg - WWW
The problem Kleinberg considers is searching the web, or the discovery of pages relevant to a given query. A user query may be specific, broad based, or a request for pages similar to a given page. Different types of queries have different associated problems. Specific queries may give rise to the 'scarcity problem': there may be very few pages that contain the desired information and it may be difficult to find them. On the other hand, broad based queries give rise to the 'abundance problem': the set of pages reasonable retrieved as relevant may be unmanageably large. Thus, there is a need to limit the number of pages retrieved to the most authoritative one. The problem that is the focus of this work is how to determine if a page is authoritative.
Kleinberg notes that evaluation of a system that finds authoritative pages will be an issue due to the inherent subjectivity in notions such as relevance.
One possible way to limit the number of texts retrieved by a broad based query would be a text-based ranking scheme. For example, rank pages by the number of occurrences of the query string in the page, or the prominence of the query string in the page. This scheme is likely to fail due to the 'self reference problem': many natural authorities do not use terms that would categorize them on their web pages, e.g. there is no reason to expect that Honda or Toyota use the term "automobile manufacturer" on their web pages.
Another way to approach the problem would be to try to exploit the link structure. Based on the assumption that the creation of a link from a page p to a page q, confers some amount of authority on q. Thus, it would seem that we could find authoritative pages by counting links into them. Some pitfalls of this approach are: some links are purely navigational and should not be counted; what to do about paid advertisements; the need to balance relevance against popularity. Consider the simple heuristic: of all pages that contain the query string, return the ones with the most in-links. Note we still have the self-reference problem and the additional difficulty that very popular sites, like yahoo.com, will be considered highly authoritative with respect to any query string they contain.
Kleinberg proposes a different link-based model. He defines a class of pages called 'hubs' which link to many related authorities. It turns out that there is a "certain natural type of equilibrium between hubs and authorities in the graph define by link structure." Kleinberg's approach is global in that it seeks to identify the most central pages for broad topic searches in the context of the WWW as a whole. His approach is fundamentally different from clustering, which groups similar pages within a broad topic, but does not find the authoritative pages or reduce the number of pages retrieved.
Kleinberg's algorithm to identify hubs and authorities simultaneously operates on a 'focused subgraph'. The focused subgraph is untended to be a small collection of pages that is most likely to contain the desired authorities. It is constructed by taking the top t (usually 200) pages resulting from querying Altavista, then all pages pointing into this set and all pages point at by members of this set are added (with some restriction on the number of pages each page can add). The focused subgraph generally contains 1000 to 5000 pages. Further preprocessing involves the removal of intrinsic link, that link between pages in the same domain.
Intuitively, the algorithm iteratively computes numerical weights for hubs and authorities by increasing the authority weight if a lot of hubs point to it and increasing hub weight if it points to a lot of authorities. The weights are iteratively refined until an equilibrium is reached. In practice convergence occurs quite rapidly, with 20 iterations generally being sufficient. Kleinberg shows that the weight sequences always converge.
Kleinberg's algorithm has the ability to disambiguate queries and cluster authorities by sense. This is done by considering eigenvectors, other than the principal eigenvector, from the adjacency matrix of the focused subgraph with intrinsic links removed. For example, when the query is "jaguar" the authorities for principal eigenvector concern the Atari Jaguar product; for the 2nd non-principal eigenvector: the NFL football team Jacksonville Jaguars; and for the third non-principal eigenvector: the car.
Kleinberg also shows how similar page queries can be addressed by modifying the query to request t pages pointing to the page P, where pages similar to P are desired.
While the examples provided in the paper are convincing, the limited principled evaluations attempted have not provided definite conclusions.
paper, Kleinberg et al 1999, study the Web graph using Kleinberg's HITS
algorithm (described in above paper) and the enumeration of certain
cliques. The second algorithm is designed to trawl the web for
They determine that traditional random graph models do a poor job of
the web graph and propose a class of plausible random graph models that
might better fit their observations of the local structure of the web.
They found Zipfian distributions in the in-degree plot, which could not
arise in traditional random graph models. This work raises a number of
questions including: how the communities found can be organized and
and how the connectivity measures found can be applied and extended.
Terveen et al - WWW
Terveen et al describe some innovations designed to aid users who wish to obtain and evaluate entire collections of topically related web sites. They define a 'site' as a "structured collection of pages, a multimedia document - as the basic unit of analysis." We note that this is somewhat analogous to Kleinberg's elimination of intrinsic links.
To find topically related web sites they use a previously developed system called PHOAKS to search newsgroup messages for mentions of web sites. PHOAKS applies rules to identify which mentions are recommendations and ranks them within a topic by number of recommendations from different individuals. They state that previous work has shown that this method has high accuracy in recognizing recommendations and that there is a correlation between their highly ranked pages and other metrics of web page quality.
They define 'clan graphs' grouping sets of related sites. An N-clan graph is defined as a graph where "(1) every node is connected to every other node by a path of length N or less and (2) all of the connecting paths go through nodes in the clan." They have found that 2-clan graphs capture the notions of collection and locality needed to determine topically related subgraphs within a larger graph.
The construction algorithm tends to filter out irrelevant sites and discover additional relevant items. It places seed pages, chosen by the user, in a queue and based on a scoring metric decides which pages to construct a site around, which sites to expand, and which sites to add to the graph. The score of a page is the number of seeds that are linked to the page by a path of length two or less.
They augment the representation of a site with a 'site profile' to help the user evaluate the quality and function of each site. It includes: the title of the sites root page, a thumbnail image, links to and from other sites, media content information, information about internal pages, and a count of occurrences of domain-specific indexing phrases.
They have invented a new graph visualization they call 'auditorium visualization' that reveals important structural and content properties of sites within the clan graph. It includes linked views, thumbnail representations, and progressive revelation of greater detail. Development involved iterated cycles of design and usability testing.
that they have begun experimental comparisons of their algorithm with
link analysis algorithms. Although they do not provide details, they
that authority/hub computations to not tell us much more than simple
of in and out degree based on preliminary results.
Graphs and the WWW
Based on the three papers considered above, insufficient evaluation has been on the models and systems to make valid comparisons. It seems clear that, when considering the WWW, that link structure is important. This is particularly true for us, since we are interested in finding groups or communities that share ideological perspective, which may be represent by some form of subgraph. While other ways to define and compute such subgraphs should be explored, the work considered above does have applicability.
It is conceivable that if Kleinberg's algorithm were used on topic query that the various eigenvectors might produce a type of clustering that could be manipulated to reflect ideology. If this could be accomplished then these authorities could be used as seeds to construct clan graph possibly producing a group with shared ideology.
problems with processing web pages are brought to light, such as the
and abundance problems, and the issue of whether a site or a page is to
be considered the minimal unit.
Whittaker et al - Usenet
Whittaker et al explore the demographics, conversational strategies, and interactivity in Usenet. The investigate modeling mass interaction in Usenet with the common ground model and find that it would need to be modified to "incorporate notions of weak ties and communication overload." Some of their findings include: highly frequent "cross-posting" to external newsgroups, that a small minority of participants post a large proportion of the messages, that cross-posting and short messages promote interactivity, and moderate conversational threading.
What seems to be missing here is investigation of possible application of graphs to the structure of Usenet, analysis of interactions between Usenet and WWW, and further investigation into cross-posting and how it relates different newsgroups.
segmentation by ideological point of view on Usenet, we might ask
it would be more appropriate to consider individual newsgroups as
of topically related documents or if groups of newsgroups related by
cross-postings should be considered instead. Another option would be to
use a search engine on newsgroup archives with a topic query.
References for Internet
Jon M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. In Journal of the
There seem to be two obvious ways to frame our problem of classifying a set of documents on the same topic by ideological point of view, so we choose to focus on these two plausible techniques, rather than survey all possible techniques.
1. As a clustering problem, where we want to cluster the documents by some appropriate measure of semantic similarity. This view fits our problem well because we do not necessarily have predefined categories or points of view. We will consider Latent Semantic Analysis as a possible method to implement this view.
2. As a text classification problem. This problem can be defined as classifying a set of documents into a fixed number of predefined classes. Wiebe and Bruce (1995) propose a method to get around this problem in the context of classifying psychological point of view using probabilistic classifiers. The use of a probabilistic classifier is attractive, because, after our discussions of ideology and discourse, we may start to suspect that we will need to combine a number of feature variables to solve our problem.
of a system using LSA should be more straightforward, we propose that
be tried first. Should it not perform sufficiently well, it could be
to compare with the probabilistic classifier.
Latent Semantic Analysis (LSA)
LSA is a fully automatic corpus-based statistical method for extracting and inferring relations of expected contextual usage of words in discourse (Landauer, Foltz and Latham 1998). In LSA the text is represented as a matrix, where there is a row for each unique word in the text and the columns represent a text passage or other context. The entries in this matrix are the frequency of the word in the context. There is a preliminary information-theoretic weighting of the entries, followed by singular value decomposition (SVD) of the matrix. the result is a 100-150 dimensional "semantic space", where the original words and passages are represented as vectors. The meaning of a passage is the average of the vector of the words in the passage (Landauer, Latham, Rehder, and Schreiner 1997).
For a more detailed view, once the word-by-context matrix is constructed, the word frequent in each cell is converted to its log and divided by the entropy of its row ( -sum (p log p)). The effect of this is to "weight each word-type occurrence directly by an estimate of its importance in the passage and inversely by the degree to which knowing that a word occurs provides information about which passage it appeared in." (Landauer et al 1998) Then SVD is applied, the matrix is decomposed into the product of three other matrices: two of derived orthogonal factor values or the rows and columns respectively and a diagonal scaling matrix. the dimensionality of the solution is reduced by deleting entries from the diagonal matrix, generally the smallest entries are removed first. This dimension reduction has the effect that words that appear in similar contexts are represented by similar feature vectors. Then a measure of similarity (usually the cosine between vectors) is computed in the latent, or reduced dimensional, space.
LSA can be viewed as a tool to characterize the semantic contents of words and documents, but in addition it can be viewed as a model of semantic knowledge representation and semantic word learning (Foltz 1998). While LSA has been able to simulate human abilities and comprehension in a variety of experiments, there is still some controversy over its validity as a model. The main objection seems to center around the fact that it ignores word order and syntax. Objections raised by Perfetti (1998) have been refuted by Landauer (1999).
LSA does not claim to be a complete model of discourse processing. Laudauer (1999) points out that the more general class of models, to which LSA belongs, associative learning and spectral decomposition, are well understood in terms of formal modeling properties and as existing phenomena at both psychological and physiological levels. Perhaps this is the beginning of an explanation of why LSA seems to do so well at simulating human abilities, with so little.
LSA has been used for a number of natural language processing tasks including information retrieval (for which it was originally developed), summarization (Ando 2000), text segmentation (Choi et al 2001), measuring text coherence (Foltz 1998).
Ando (2000) proposed an iterative scaling algorithm to replace SVD and showed significant increase in precision on a text classification task. Her algorithm iteratively scales vectors and computes eigenvectors to create basis vectors for a reduced space. She uses a log-likelihood model to choose the number of dimensions, which improves over LSA where no empirical method is proposed to select the number of retained dimensions.
address some shortcomings of LSA due to "unsatisfactory statistical
Hofmann (1999) introduces Probabilistic Latent Semantic Analysis
based on the likelihood principle. In experiments on four document
PLSA performed better than LSA, tf, and tfidf, in retrieval tasks. The
core of PLSA is a "latent variable model for general co-occurrence data
which associates an unobserved class variable with each observation",
an 'aspect model'. He uses a tempered EM algorithm for maximum
estimation of the latent variable to avoid overfitting.
A probabilistic classifier assigns the most probable class, out of a set of finite classes, to an object, based on a probability model. The probability model defines the joint distribution of the variables, which are made up of the classification variable and a set of feature variables. The feature variables represent properties of the objects, in our case documents, we wish to classify. Features might include useful semantic, syntactic, and lexical distinctions or properties with respect to ideological point of view.
As noted above, we have a difficulty because the set of ideological points of view may not be known in advance, so cannot serve as values got the classification variable. Weibe and Bruce (1995) get around a similar problem of classifying point of view by breaking up the classification problem into three problems. Each problem takes input from the preceding problem and has its own classification variable.
Rather than deciding which feature variables to use or how the variables are related, they use statistical methods. Specifically they use decomposable graphical models, which graphically represent the features as nodes and the interdependences of the features by undirected edges. A model that describes the data by representing only the most important interdependencies is chosen, based on a process of hypothesis testing using the likelihood ratio statistic to measure the goodness-of-fit of the model. Once the model is determined, maximum likelihood estimates are used for its parameters.
Wiebe and Bruce propose modifications to their model to limit its reliance on large amounts of untagged data, by estimating parameters for untagged data using a stochastic simulation technique.
In using a
probabilistic classifier we need to consider what features we might
to use. It would be preferable if the features could be automatically
in a preprocessing phase. On option for the automatic extraction of
and syntactic features would be to use a system like AutoSlog-TS
1996). We would also want to consider features discussed in the
section (above), possibly modified to better fit out task of
ideological point of view. Additional features might be mined from the
structure of the Internet.
References for Statistical NLP and Machine Learning
Rie Kubota Ando. Latent Semantic Space: Iterative Scaling Improves Inter-document Similarity