IDEOLOGYNow that we have a working definiton for and some understandingDISCOURSEINTERNETSTATISTICAL NLP AND MACHINE LEARNING FOR TEXT CLASSIFICATIONConclusion

Introduction
Ideology
Discussion
Discourse
Internet
Statistical NLP and Machine Learning
Conclusion

Suppose that one desires to create an automatic system that would, given a collection of text from the Internet about a given topic, segment the text by ideological point of view. An example might be to sort the text into ideological camps and present it in some fashion that would be useful to users of the Internet. This might enable a user who is searching for information on back pain to choose to read information that is oriented towards traditional medicine or that which is oriented towards alternative treatments; a user searching for information about a political event or issue to find representative articles from various bands on the political spectrum (e.g. communist, liberal, moderate, conservative, new right, etc.). To better understand what such a task might entail, we will break the task down into smaller steps and examine what knowledge, tools and definitions would be necessary at each step.

We begin by discussing the genre: we are interested in the Internet, but we will limit the problem to text. Areas excluded from our task would include: speech, sound, and video. While these areas are encompassed by the Internet and interesting in their own right, eliminating them will better focus our task. Thus we will discuss two primary areas where text may be found on the Internet: web pages and USENET newsgroup postings. Also for purposes of focus we will not consider chat rooms, or private email. Thus we will consider web and newsgroup related research and tools.

Having limited our focus to text processing on the Internet, we will need to have some understanding of the structure of the Internet and how it can aid and constrain searching and text processing. We will survey current work on search and the retrieval of information from the web.

More generally, since we will be processing text, we will need to understand discourse theory and current techniques for discourse processing. We will examine the state of the art in discourse processing and to what extent these techniques might be useful or necessary for our task. This will take us into the areas of discourse structure theory, text categorization, text segmentation, and text summarization. We will also look at how these techniques might be applied to determining ideological point of view.

In order to better understand and define ideological point of view, we will consider some of its linguistic aspects. We will survey work by computational linguists to date on point of view and available linguistic resources. We will also consider some related work on subjectivity in newspaper text and message filtering systems in USENET newsgroups.

As necessary we may discuss techniques from machine learning, information retrieval, language understanding, knowledge representation, and psycholinguistics. Throughout, we will place the greatest emphasis on statistical techniques.

Supposing that such a system can be created, we will consider appropriate means to test and evaluate it.

In order to build an automatic system to segment text by ideological point of view, we must first understand what we mean by ideological point of view. We would also like to understand where ideological point of view might fit into the larger picture of natural language processing.

Ideology has been studied from the perspective of a variety of academic disciplines, including: anthropology, sociology, linguistics (particularly in the areas of pragmatics and sociolinguistics), psychology, cognitive science, history, communications studies, political science, rhetoric, critical theory and computer science (particularly in natural language processing, a sub field of artificial intelligence). Given this list, it seems that a multi-disciplinary approach to ideology is in order.

For purposes of a working definition of ideology, we will turn to the work of Teun van Dijk, a linguist and professor of discourse studies, who has taken a multidisciplinary approach to ideology. His orientation towards discourse studies make his work closer to work done in computational linguistics than many of the researchers in other disciplines.

Once we gain a working understanding of ideology, we will survey systems developed by natural language researchers which relate to ideology or to point of view.

In common usage ideology tends to be a somewhat vague term. Something that, like pornography or art, one knows when one sees, but would be hard pressed to define precisely. Ideology is often used pejoratively, perhaps being most easily recognized when someone expresses a strong position with which one disagrees. When conflict involving fundamental differences arises, we have "knowledge", they have "ideology". However, even these superficial musings start to give us clues toward a definition. It must have something to do with fundamental beliefs held by groups of people ("Us" versus "Them"). To formalize this, let's take a look at the work of van Dijk:

In his internet course "Ideology and Discourse: A Multidisciplinary Introduction", based on his book Ideology: A Multidisciplinary Approach, Teun van Dijk provides a general working definition:

1. This definition does not imply a negative evaluation of ideology, nor does it limit us to ideologies that legitimize dominance. In this sense it will serve us well in our project of segmenting by point of view, because we want to classify a broad range of ideologies and not to judge them.

2. This definition centers on "groups", a notion we will discuss more shortly, as an intermediate structure between the extremes of an individual and the entire culture or society. In the sense that there are not individual languages our definition does not provide for individual ideologies. At the other extreme we have cultural common ground knowledge, or knowledge that is not disputed in a given society or culture. Since the knowledge is not disputed, there is no real interest in considering it in terms of point of view. Over the course of time group ideologies may become common ground knowledge and vice versa. For example, in Europe, 500 years ago, Christianity was common ground knowledge, whereas today it would be viewed as group ideology. In contrast our current common ground knowledge about the position of planets in the solar system was once considered ideology. Again, it appears that these distinctions will serve our task, because on a given topic we want to distinguish between the differing points of views of groups of discussants: individual points of view would be too fine grained, while common ground views would be too coarse.

3. In order to understand the definition we need to clarify what is meant by "group". We would probably not want to consider a "group" of people in a supermarket checkout line as having an ideology. We are interested in social groups that have some level of permanency and common goals. We might define social groupness in terms of "membership criteria (origin, appearance, language, religion, diplomas or a membership card), typical activities (as is the case for professionals), specific goals (teach students, heal patients, bring the news), norms, group relations and resources." "In social terms we may define a number of the properties that people routinely use to identify themselves and others as ingroup and outgroup members, and to act accordingly. Sometimes these group criteria will be quite loose and superficial, e.g., when based on preferred dress or music styles, sometimes they organize virtually all aspects of the life and activities of the members of a group, as may be the case for gender, ethnicity, religion and profession." In other words, a group must have some basis for self-definition and commonality in order to develop a group ideology.

Another aspect of groups that may prove useful is to recognize that they often have structure. The structure may be formal or informal, often including leaders, ordinary members, followers, teachers, and subgroups or individuals fulfilling special functions. The group structure can play an important role in the "acquisition, spreading, defense or inculcation of ideologies. Thus, new members need to learn the ideology of a group." We might hope that some of these structures will be mirrored in the local structure of the world wide web and the structure of Usenet newsgroups and susceptible to computational discovery. We will keep this in mind later when we discuss Kleinberg's methods for locating hubs and authorities on the web.

4. Discourse plays a key role in the development and promulgation of ideologies. Ideologies influence the content of discourse and ideologies are acquired and transformed through discourse. We will explore the like between ideology and discourse in greater detail below, with particular emphasis on how ideological point of view might be detected in a given discourse. For the purposes of this paper we are interested in written discourse.

5. Ideologies are fundamentally subjective, since beliefs are subjective. We will develop this further when we explore cognitive models of ideology. This observation is important because it links the study of ideological point of view with the more general study of subjectivity in discourse.

Van Dijk divides his study of ideology in to three interconnected areas: Cognition, Society, and Discourse. For our purposes, cognition and discourse are of most interest; the issues he raises regarding ideology and society that are of interest to us, related to understanding groups, have been discussed above. Cognition will be of interest in terms of the mental model he presents and to better understand the definition of ideology. Discourse is of great interest to us, so we will consider his work on it in some detail.

How ideologies are represented in memory is very much an open question at this point. Van Dijk hypothesizes that ideologies are represented in social memory, a part of Long Term Memory that is distinct from, but linked to, Episodic Memory. He chooses to represent the general beliefs of ideologies in a propositional format, for convenience. For example, "All citizens should have equal rights." He also assumes that ideologies form "systems" of beliefs, suggesting some level of order and organization. He theorizes that the organization of ideologies is "schema-like", consisting of "a number of conventional categories that allow social actors to rapidly understand or to build, reject or modify an ideology."

A possibility for such a theoretical schema, derived from the basic properties of a social group is:

The forgoing is very abstract and general and raises the question: How do we get from abstract, general ideologies, to people who produce and understand discourse or engage in other social practices? Van Dijk provides an explanation of this process from general to specific, with the following interface:

Group attitudes are intermediary representations between ideologies and discourse. Attitudes are defined as beliefs with an evaluative component (as distinguished from knowledge). Attitudes may embody ideological propositions as applied to a specific domain. For example, a racist ideology might be applied in the area of education.

Group knowledge may be affected by ideology, in the sense that those who hold certain group beliefs consider them to be true, and thus knowledge. For example, "if some racist psychologists hold that Blacks are less intelligent than Whites, they might see this as knowledge, while obtained by what they see as scientific evidence, but others may well see this as a form of racist prejudice, based on biased argumentation and misguided application of scientific method."

Personal mental models are the representations in episodic memory of our daily experiences. This would include events we participate in, witness, or read about, and our opinions about these events. These models are personal and thus inherently subjective. They may be strongly influenced by ideology. It is these mental models that are the basis for comprehension and production of action and discourse, in the sense that speaking (writing) involves the expression of mental models and hearing (reading) involves the updating or construction of mental models.

Van Dijk deals with this by defining context models to represent the current ongoing communicative event dynamically. The context model keeps track of our goals and intentions, what we believe the participants know, social relations between participants, the social situation, time, or more generally, what is "relevant for discourse in the current communicative situation."

Like other mental models discussed, context models may be ideologically biased. For example, how the speaker perceives the participants in the discourse may be affected by the speaker's ideologies. Thus, ideologies may control both the content and manner of our discourse.

2. To understand how we might plausibly get from ideology to discourse, which is our main interest here.

3. Because understanding the human cognitive processes may be helpful in determining appropriate data structures and architectures to use when we attempt to have a machine understand discourse.

We have seen that there are many open questions in the area of cognition, which necessitate making choices and assumptions. We are not currently in a position to say whether van Dijk's choices and assumptions are the correct or even the most expedient ones possible.

We would like to discover properties of discourse that will pick up variations in ideology, based on the underlying context and event models, and social attitudes. Our first thought is that semantics and style would be better places to look than morphology and syntax. Still some more concrete methodology would be helpful. Van Dijk proposes a heuristic, based on the fundamental notion that ideologies are represented as "some kind of basic self-schema of a group, featuring the fundamental information by which group members identify and categorize themselves, such as their membership criteria, group activities, aims, norms, relations to others, resources, etc." His heuristic claims to represent a very general overall strategy of most ideological discourse: to organize people and society in polarized terms (US versus Them). The result is a conceptual or "ideological" square of four principles:

Since discourse provides many ways to emphasize or de-emphasize meanings. Van Dijk applies his ideological square to analyze discourse at the levels of meaning, propositional structures, formal structures, sentence syntax, discourse forms, argumentation, rhetoric, and action and interaction. We will look at each of these in turn.

Here we distinguish between topics and themes: topics can be represented propositionally, whereas themes, which are more abstract, are typically represented by a single word. Themes define classes of text which contain many different topics. For example, under the theme of "Education", topics might include: Falling test scores, Debate over creationist curriculum, Is physical education necessary, etc. "Topics typically are the information that is best recalled of a discourse."

Given a topic, we have the option in the realization of our mental model to provide abstract or specific descriptions with many or few details. These options can serve an ideological purposes since "we will usually be more specific and more detailed about our good things and about the bad things of the others, and vice versa -- remain pretty vague and general when it comes to talk about our failures."

As we saw when discussion mental representations, the context model determines what part of the information in the mental model should be expressed. The decision to express information or leave it implicit can serve ideological purposes: "people tend to leave information implicit that is inconsistent with their positive self-image. On the other hand, any information that tells the recipient about the bad things of our enemies or about those we consider our outgroup will tend to be explicitly expressed in text and talk."

One can also presuppose information that is not generally shared or accepted at all, introducing it by implication. For example, if a politician expresses concern about "the high crime rate of inner-city youth", this presupposes that inner-city youth do in fact have a high crime rate. Even if it is true, the presupposition may be misleading due to being under qualified: the crime rate may be due to unemployed white males and not inner-city youth in general.

We say that a discourse is globally coherent if it has a topic and that it is locally coherent if the meanings of the sentences (their propositions) are related in some way. Local coherence may be referential or functional. A discourse sequence is referentially coherent if it has a model, or intuitively, if we can imagine a situation in which it is or could be true. Functional local coherence is defined in terms of the relations between the propositions themselves. Foe example, where one proposition has the function of being a Specification, a Generalization, an Example or a Contrast of another proposition.

While coherence is a very general condition of discourse and must be respected in order for discourse to be meaningful, it is in a sense ideologically controlled by the mental model it is based on. Also the hearer will tend to make inferences in order to make the discourse coherent, so a sentence such as "He is from Nigeria, but a very good worker" invites the inference that, in general, people from Nigeria are not good workers.

Synonyms and paraphrases alter meaning to some degree and these alterations may have ideological implications. For example, in Western Europe today, the use of "foreigners" generally implies a reference to ethnic minorities and immigrants. Depending on the context, these terms may be more positive or negative.

Here we examine the internal structure of the propositions, which taken together constitute the meaning of the discourse. Recall that propositions are things that may be true or false, or which (intuitively speaking) express one complete 'thought'. Sentences consist of one or more propositions. Propositions have a structure of the form:

Of interest in ideological analysis is that "the predicates of propositions may be more or less positive or negative, depending on the underlying opinions (as represented in mental models)."

"The arguments of a propositions may be about actors in various roles, namely as agents, patients, or beneficiaries of an action. Since ideological discourse is typically about Us and Them, the further analysis of actors is very important." Ideologically based actor descriptions semantically reflect the social distance implied by ideologies.

Here it is not the forms themselves, but the forms in context and to the extent that they emphasize or de-emphasize meanings, that we want to consider. Forms include sentence syntax (discussed below), and overall schematic forms of discourse such as argumentative or narrative structures, a news article or a scholarly article in a psychological journal. For example, emphasis can be given by placing something at the beginning of a news article. The same item can be de-emphasized by placing it towards the end of the article or leaving it out entirely.

Much of sentence syntax is not contextually variable, so not helpful when looking for ideological clues. Some places where we can look for ideological clues are: word order, active and passive sentences, and nominalizations. For example, "Words may be put up front through so called 'topicalization', or they may be 'downgraded' by putting them later in a clause or sentence, or leaving them out completely."

Since syntactic parsing is a fairly well understood area of computational linguistics, it might be reasonable to look for syntactic patterns that are out of the ordinary and see what words or concepts are either emphasized or de-emphasized to provide clues to ideological point of view.

Here we consider propositions at the level of the whole discourse. As with the expression of meanings in the sentence syntactic for which may be varied, at this level propositions may be expressed in sentences that appear at the beginning of the discourse adding emphasis or near the end providing de-emphasis. So we see that one of the many possible functions of sentence order in discourse can be ideological.

Keeping in mind the ideological square, we note that "sentences that express positive meanings about us, and negative meanings about them, will typically appear up front -- if possible in headlines, leads, abstracts, announcements or initial summaries of stories. And conversely, meanings that embody information that is bad for our image will typically tend to appear at the end, or be left implicit altogether."

In discourse genres where participants have different opinions or points of view, generally participants will use an argumentative structure of the form either standpoint and arguments or arguments and conclusion, to make their standpoint more acceptable, credible or truthful. While the use of any given argument structure or fallacies (breach in the rules or principles of argumentation) is unlikely to be linked to a specific ideology, the structure of argumentation may still be useful in determining ideologies. The main point of view, often functions like a headline, representing the most important information in the text, adding emphasis, and controlling the production of the rest of the discourse. Since standpoint and opinion are often linked to shared group attitudes, argument structures may also signal the underlying structure of ideological attitudes.

Some difficulties may arise when the underling ideologies are "politically incorrect" leading to arguments that are hidden of rationalized in terms of more "respectable" arguments and hence may be more difficult to detect. For example, a speaker who opposes allowing immigration of Mexicans, might hide a racist ideology by using arguments about the labor market, lack of housing, or cultural problems.

Understanding argument structure, at least on a superficial level, is potentially helpful in determining the main standpoint and conclusions. It may help in determining what is being emphasized and what is being de-emphasized providing clues about ideology. Even though no given argument structure can be linked to a specific ideology, on might still hypothesize that where social groups are relatively close knit, some argument structures and fallacies may be repeated by different group members. For example, in Usenet newsgroups.

Here we are interested in "figures of style" described in described in rhetoric, such as: alliterations, metaphors, similes, irony, euphemisms, litotes, etc. These generally fall into the class of non-literal language. We are not likely to be interested in whether or not a figure of style is used, since most everyone uses, say, metaphors, regardless of the ideologies being expressed, but rather in the meaning, content and cognition of the particular figure of style used.

One might hypothesize that the use of non-literal language could be a key factor in identifying ideological point of view. That members of a group are likely to use the same or similar metaphors and euphemisms to evoke shared attitudes or knowledge and to express the corners of the ideological square. For example, the use of the euphemism "collateral damage" for civilians accidentally killed in a missile attack, is very evocative of a specific ideology.

With each of the formal structures van Dijk considers, we have seen that, since all formal structures are available to all people or groups, the existence of specific structures is unlikely to help in determining ideological point of view. Rather it is the context of formal structure that is likely to provide clues. That said, one might still hypothesize that when group ideology is expressed by different members of the group, we may find similarity in argument structure and style.

Van Dijk provides examples of categories found in the analysis of a debate in the British Parliament (House of Commons), held on March 5, 1997. The topic was the issue of benefits for specific categories of asylum seekers, after an earlier discussion about whether certain inner city boroughs of London would have to pay for the extra costs for reception of those refugees who are entitled to benefits. The debate provides an example of "an anti-immigrant attitude which we associate with a form of political racism, and on the other hand various humanitarian, or anti-racist ideologies that control more tolerant attitudes about immigration." Our concern here is with the categories he found, in the hope that we may better understand the instantiation of ideology in discourse. We have grouped the categories by level of analysis, discussed above. Categories, which fall in to more than one level, are listed under each level.

norm expression

We see that the three most general levels of analysis are meaning, argumentation and rhetoric. This confirms our suspicions that these three areas would be the most fruitful to pursue in general discourse where we would like to detect ideological point of view. Note that the majority of the categories at the level of rhetoric deal with non-literal language, an area which is particularly problematic in NLP and yet crucial to understanding subjectivity.

To summarize, we have found in van Dijk's work a working definition of ideology, plausible cognitive structures that may be associated with it, and a heuristic strategy for finding ideological point of view in discourse. We have also seen that there are perhaps more unanswered questions than answered ones in the study of ideology.

One might ask why we would want to study ideology in the first place. On a grand scale, one would hope that improved understanding of the ideological point of view of discussants of major societal issues would help us all as citizens participating in democracy. Also, it might help us better understand our own ideological biases, enabling us to better evaluate our own positions and arguments. More relevant here, it might give us a better understanding of subjectivity, that would be useful in many computational linguistic applications, help us better understand and model cognition, and provide a useful classification tool to help navigate the almost unfathomable amount of information now available electronically.

Van Dijk has applied his analysis to issues of immigration and racism. For an additional example on the side of the grand scale, we look at the work of Blommaert and Verschueren, two Belgian linguists (pragmaticists), who in their book Debating Diversity, conducted an empirical study of publicly available discourse on "migrants" in Belgium. Their sources included mass media, government, political parties, and social scientists whose work was widely broadcast in the media. They found that the central concepts in the migrant debate, as conducted by the "tolerant majority", such as culture and democracy, we rarely defined explicitly. When there were explicit definitions, they tended to serve specific rhetorical goals. They also found that there were significant discrepancies between what might be termed the dictionary definition of meaning and the associative or derived meanings used in rhetorical practice of these central concepts. They concluded that the "migrant debate" rests on an ideology of homogeneism: "the idea that the ideal society should be as uniform or homogeneous as possible." In other words, despite the differences in rhetoric between the "tolerant majority" and racist or nationalist groups, the there is an unquestioned underlying ideology of homogeneism, which controls the definition of the problem and range of solutions proposed.

1. The role of the investigator in the communicative process under investigation.

2. Dealing with the integration of micro- and macro-influences on communication. This presents the qualitative problem: "the interdependence of individual cognition and socially constructed meaning."

While it seems clear that interdisciplinarity is crucial in the understanding of ideology and that the role of the investigator must always be considered in scientific research, their second point brings up intriguing possibilities. It seems that the traceable structure of the World Wide Web and Usenet newsgroups may provide genres where the interaction between the micro- and macro- influences can be studied. We will explore this in greater detail when we look at some of the work that has been done on the structure of the World Wide Web (e.g. Terveen, Kleinberg).

Blommaert and Verschueren's definitional framework is similar to van Dijk's. They emphasize the importance of group relations and identities, which they view as "cognitively framed phenomena to be found at the inter-subjective level of the community." Where group identities determine our opinions and discourses about others and other forms of our behavior towards them, noting that here is no objective criteria that can be used to identify groups. They define ideology as "any constellation of fundamental or commonsensical, and often normative, ideas and attitudes related to some aspect(s) of social 'reality'." This is at the same time both more broad and specific then van Dijk's definition, more specific in that it specifies the types of ideas and attitudes and more general in that it does not rely explicitly on the idea of groups, with qualifying discussion the underlying idea is essentially the same.

In contrast to van Dijk's approach, Blommaert and Verschueren adopt a materialist perspective on the discourse data: viewing discourse as a "(symbolic) commodity, a source of and instrument for acquiring and elaborating power and status. This invites an ethnographic approach, analyzing the data in the "context of a synchronic pattern of social relations and practices", and a historical approach, which detects "waves of discourses: discourse traditions, genres, styles and transformations of fissures in these waves." This seems like a broader approach than we will want to take for our purposes.

Other than the general mention of cognitive framing, they are not concerned with cognitive models or mental representations of discourse. Their pragmatic approach to discourse analysis, recognizing that "every utterance relies on a world of implicit background assumptions"

1. Wording patterns and strategies. "Meaning derives from the grammatical and lexical choice, which language users make from the range of possible choices, in relation to subject matter and context."

2. 'Local' carriers of implicit information. This encompasses implication- and presupposition- carrying constructs.

3. Global meaning constructs. This encompasses the ways in which implicit and explicit meanings are combined. For instance, in patterns of argumentation, coherence and recursivity.

4. Interactional patterns. This encompasses "many types of direct and indirect interaction between different points of view."

They seek to use these methods to uncover a "common frame of reference": the general world view the language user assumes to be shared with others in the same community (group), including assumptions about that is acceptable or appropriate social behavior.

In this book their methodology is painted with broad strokes, thus while similar in general outline to van Dijk's, it is difficult to compare point by point. They note, as van Dijk does, that patterns of argumentation do not vary with ideology, rather it is the meanings or contents that vary. They cite Albert Hirschman's 1991 book, "The Rhetoric of Reaction", where his examination of patterns of argumentation used by 'reactionaries' at three points in history where social and political changes were taking place, reveals three types of arguments against the changes: perversity, futility and jeopardy. Blommaert and Verschueren found examples of these three argumentation patterns in the rhetoric of both the 'reactionaries' and the 'progressives' in the migrant debate.

The perversity argument is essentially that the actual effects of the changes, through a chain of unintended consequences, are the opposite of what was intended. Futility: nothing really changes, i.e. the proposed changes will be superficial or cosmetic, hence an illusion, while the deep structures of society remain unchanged. Jeopardy: while the proposed change may be a good idea, the costs or consequences are too great.

Whether these kinds of arguments have sufficiently clear structural patterns, that would enable automatic recognition, is not immediately obvious. But it does seem that taking this kind of analysis into consideration could be useful since if an argumentation pattern can be identified, focusing on its content or semantics might be a key place to look for ideological point of view.

1. That further investigation of the discourse analysis techniques of linguistic pragmaticists is worth while to better understand ideological point of view in discourse and may possibly yield techniques that can be applied automatically.

2. That here may be inherent social value in the analysis of ideological point of view in discourse: improved understanding of implicit ideological contents may aid in understanding of the debate of important contemporary social issues, leading perhaps to better informed and rational participation by citizens in a democracy.

3. That while defining ideology may be difficult and controversial, reasonable working definitions can be developed. It appears that van Dijk has put more emphasis on developing a definition that takes into account cognitive models and is likely to work in a broader, multidisciplinary format.

4. While van Dijk and Blommaert and Verschueren have developed systematic techniques for analyzing ideology in discourse, there is still a lot of work to be done to develop a reasonable machine implementation.

We now consider two more examples of analysis of ideological point of view in discourse: Wang analyses journalistic coverage of the 1991 Soviet coup attempt. Her techniques include some developed by van Dijk. Wortham and Locher analyze 1992 television newscasts by systematizing Bakhtin's concepts of voice and

Wang chose to analyze newspaper coverage of the 1991 coup attempt in the Soviet Union, because as the kind of complex, uncertain and fast-changing situation allows for more insight into how ideologies affected the press reports. The papers she chose are the New York Times and Renmin Ribao, the Communist Party paper in China. Her main research tool is discourse analysis, although she also did quantitative content analysis. Her definition of ideologies as "sets of ideas involved in the ordering of experience, making some sense of the world", is taken from Hodge, Kress and Jones (1979). Given this definition, journalism, as it involves ordering the world and making sense of it for its readers, is an ideological process.

Findings from the context analysis included: the New York Times devoted much more space to news items and to editorials and analysis. In the stories that Renmin Rebao did run, most (93%) cited official news sources, compared to 11% in the New York Times. Very striking were the differences between the paper in number and content of photographs: Renmin Rebao published only two photographs, neither of them showing Gorbachev or Yeltsin.

For the discourse analysis, based in part on the models developed by van Dijk, thematic and schematic analysis along with stylistic descriptions of the main thematic actors, we done. She follows van Dijk's definition of thematic structure as 'hierarchical organization of themes or topics of a text.' In this area headlines, topic categories, thematic structures and main actors were analyzed.

Thematic analysis considers the hierarchical organization of themes or topics in the text (headlines, topic categories, thematic structures). Schematic analysis looks at the overall organization of news items (summary, story, comments, background, verbal reaction, conclusions, etc.). According to van Dijk, 'schematic superstructures organized thematic macrostructures, much in the same way as the syntax of a sentence organizes the meaning of a sentence.' The thematic and schematic tree structures might be useful should be compared with the tree structures in Rhetorical Structure Theory.

The analysis of stylistic descriptions of the main actors basically involves listing all of the different ways, say Yeltsin, was referred to in the text. Both the descriptive terminology and the stylistic descriptions of the main actors differed greatly between the two newspapers, with the tendency of the New York Times to much more evaluative. However, this analysis may be misleading to some extent because 14% of the New York Times stories were editorials or opinion columns, while none of the Renmin Rebao stories were. One would speculate the most strongly evaluative descriptions of the actors would be found in editorials and opinion columns. On the other hand, evaluative descriptions can be valuable clues to the writer's ideology.

Wang found that "the news structures of the two papers were rather different." This seems to contradict van Dijk's finding on international news structures, that a "globally shared code of journalistic practices leads to a standardized description of events." Wang found that both the thematic and schematic structures were different. She maintains that politics and ideology are two factors that at least partly explain these differences.

Whether or not Wang's methodology turns out to be too genre-specific for our purposes remains to be seen. Her work builds on van Dijk's earlier work on discourse analysis of news and her findings point to the possibility of finding ideological differences in both structure and style of text.

For a different approach to approach for analyzing media bias or point of view we turn to the work of Wortham and Locher. Their interest is in the implicit devices that a speaker uses implicitly evaluate others while appearing to speak neutrally about them. They provide analytic methods for studying implicit moral evaluations and interactional positioning described by Goffman, using the concepts of voice and ventriloquation of Bakhtin.

In Bakhtin's theory of the novel, voice is defined as an identifiable social role or position that the character enacts. Voicing is the process of working out speaker's social locations. Ventriloquation occurs when an authorial voice enters and takes a position with respect to a character: the author speaks through the character by aligning or distancing himself from the character. Thus the authorial voice implicitly comments on the on the social world it represents: using the speech of characters to express his or her own social and ethical positions. Wortham and Locher note that "newscasters portray their subjects as people who speak with identifiable voices. And they themselves speak through these voices and evaluate those they cover."

Wortham and Locher's technique for identifying attribution of social positions to those described in a newscast (voicing) and evaluation of those described (ventriloquation) is based on the "identification of tokens of certain textual devices that speakers commonly use to voice and evaluate their subjects." They consider five types of textual devices, suggested by Silverstein, used in voicing and ventriloquation are:

Wortham and Locher analyze three national newscasts from the evening of October 30, 1992: CNN/Telemundo, ABC World News Tonight, and CBS Evening News. On this date, four days before the 1992 U.S. Presidential election, the lead story was the release of notes from a 1986 meeting, taken by then Secretary of Defense Caspar Weinberger, that seemed to contradict Vice President and Presidential candidate George Bush's repeated statements that he did not know about the sale of missiles to Iran ahead of time. Wortham and Locher's analysis does uncover implicit messages sent by the newscasters.

They claim that voicing and ventriloquation are unavoidable in reporting (but need not be insidious), that their five devices are useful but not sufficient in determining implicit messages, and that their tools cannot be applied mechanically because the process of orchestrating an implicit message is "poetic", i.e. speakers do not mechanically apply rules to obtain intended outcomes. Thus, while their tools may prove helpful in finding clues to speaker ideology or determining implicit meaning in discourse, it does not appear that they can stand alone.

Here and in Wang's work, the amount of data analyzed is small and limited to a specific genre. While the events and media sources may have been chosen because they are particularly illustrative of what the authors wish to show, there is no clear indication of the extent to which their techniques will generalize, even within their chosen genre. There is also the question of how to automatize these techniques and test them on larger data sets.

We now turn to investigate some systems that have been developed over the years which take into account ideology or point of view.

Abelson's interest in political psychology led him to try to simulate a True Believer to better understand the phenomenon of 'ideological oversimplification'. This "tendency to caricature and trivialize the motives and character of the enemy and to glorify - but also trivialize-the motives and character of one's own side", which involves "interposing oversimplified symbol systems" between oneself and the external world and tends to exacerbate conflicts, both internationally and intranationally. He has a strong interest in developing a cognitive model of belief systems.

The Ideology Machine is designed to simulate responses to foreign policy questions by a right-wing ideologue, modeled on Barry Goldwater and foreign policy issues of the Cold War. Goldwater was chosen because his belief system was "notably 'closed'" and well understood, thus enabling it to be encoded in the system's memory structure and the analysis of the responses.

The basic memory structure of the machine a "horizontal' encoding of sentences consisting of a concept followed by a predicate, where the predicate is generally a verb followed by a concept. Approximately 500 sentences were encoded. There is also a "vertical" structure to memory consisting of 'instance' and 'quality' relationships. For example, "India" is an instance of "left-leaning neutral nations" and "left-leaning neutral nation" a quality of "India". Also stored is an 'evaluation' of each element (concept or predicate): a signed quantity summarizing the positive and negative affects attached to the element. 'Generic events' were represented by a verb category between two noun categories and then used as building blocks for 'episodes'. About 24 episodes were placed in memory, some quite intricate and with multiple branches of sequences of potential generic events.

Ideological perspective was encoded as a "masterscript" which guided the processing of political information.

We will not discuss the cognitive model in detail here, other than to note that two important components are the "Credibility Test" and the "Rationalization Attempt". These enable the system to respond to one of six questions, such as: Is a given event credible?, When a given event happened, what should a given actor have done? For example, for the first question, the credibility of a given event is assessed by checking whether its generic event type is recognized by the system. If it is recognized a similar, specific past event is retrieved from memory.

Abelson proposed an improved system in 1973, which incorporated planning and Shank's theories of Conceptual Dependency. From today's perspective the search procedures might be considered inefficient, the domain too limited, the generation mechanism primitive, the definition of ideology too vague and oversimplified, and Conceptual Dependency might be questioned as a cognitive model. There was no attempt to analyze the meaning of actions; they were only classified according to ideological criteria, so that "Castro would throw eggs at West Berlin" could be inferred from the fact that "leftist students in South America threw eggs at Richard Nixon." Nevertheless, it was an important first step, in that the Ideology Machine was able to demonstrate that certain types of ideological behavior could be simulated by a computer program.

Tale-spin generates stories that describe actions that actors take to create changes in the world that result in the satisfaction of the actor's goal. The actors are people and talking animals, who have goals, environments, and relationships to one another. Principal motivations have to do with physical needs, like hunger. Basically, characters are created who want to achieve simple goals. They create plans to achieve these goals, which can include moving to other physical spaces, manipulating objects, communicating with other characters (honestly and dishonestly) and negotiating with other characters to get something they want. The actions and state changes in Tale-spin are represented in a conceptual dependency framework.

Meehan's intent was to model people engaged in rational behavior. The basic components of the system are:

1. A problem solver: given a goal it produces other goals or subgoals and events. Contains the planner.

3. An inference maker: given and event, produces a set of consequent events. Where one kind of consequence is a goal.

Tale-spin allows for two methods of storytelling: a bottom-up approach where the reader controls the simulator and a top-down approach, where the program uses a predefined set of "morals" to create the story.

Like the Goldwater Machine, Tale-spin was an important first prototype. It suffers from difficulties of scaling up, lack of commonsense background knowledge, and the limitations of the conceptual dependency framework (discussed below). However, Michael Pazzani has used his learning system, Occam, to construct a simulator based on Tale-spin, addressing the issue of obtaining data that is not hand-coded. Further investigations of his methodology and of the extent to which he succeeded might be worthwhile, since this is a serious problem for many systems.

Originally an attempt to improve on the Ideology Machine, by incorporating conceptual dependency, frames for representing real-world knowledge, situational scripts, planning units and other memory structures, Politics turned into a "general process model of subjective understanding". (Carbonell) Carbonell claims that the model of subjective understanding transcends ideological behavior, incorporating goal trees and counterplanning strategies to understand, personality traits, certain aspects of discourse and human conflict situations.

Carbonell has developed a theory of subjective understanding, which he defines as "the process of applying the beliefs, motivations, and interests of the understander to the task of formulating a full interpretation of an event." Politics simulates the ideologies of a United States conservative and a United States liberal interpreting brief political events by answering questions posed about a given event. It analyzes the events into a conceptual dependency representation, applies situation information in the form of scripts, and applies an inferencing process guides by the ideology. Additional inferences may be performed during the question and answer phase.

Ideologies are modeled by goal trees. Once the goals for a particular ideology are determined, two different trees can be constructed: one with sub-goal links, where each sub-goal helps to achieve a higher level goal, and on the relative-importance links. Four specific criteria guides the development of the goal-tree model for political ideologies:

1. Parsimony: a political ideology should contain only the subjective knowledge required for ideological reasoning.

2. Orthogonality: a political ideology does not necessarily affect other aspects of subjective understanding and may be de-coupled from other ideological beliefs.

3. Compatibility: all aspects of subjective understanding should be represented in the same formalism.

4. Generality: the reasoning process is not domain-dependent and should apply across all ideologies and subjective beliefs.

The goals focus attention on the aspects of the situation or event that most interest the actor or understander. This directs the inferences that are made about the consequences of events. Understanders have different interpretations of events because they focus on how the events affect them personally and not on how they affect others, which gives rise to subjective understanding. Understanders know that other understanders are doing the same thing, as a function of their own goals, which leads to planning and counterplanning. Carbonell defines counterplanning as "a process in which one actor intentionally thwarts another actor's plans or attempts to achieve his own goals by circumventing the counterplanning attempts of the other actor." Counterplanning occurs when goal conflicts or plan interferences arise.

It seems possible that Carbonell's theory of goals, plans and counterplans might be useful in analyzing argument structure for clues of ideological point of view. On the other hand, this is a very conflict oriented approach that may not account for more subtle ideological differences or cases where, as we say in Debating Diversity, the surface rhetoric differs greatly, but the underlying ideology is basically the same. It also requires the construction of the goal tree and this requires sufficient knowledge to model the goals that follow from an ideology in advance.

Consequent to some of the limitations Politics and because he believes that people use similar decision processes in resolving political, economic, judicial, domestic and social conflicts, Carbonell developed Triad. "Triad is a process model of understanding general conflict situations." Carbonell defines seven 'basic social acts':

Pauline (Planning and Uttering Language In Natural Environments) generates natural language text for news events subject to pragmatic constraints. The constraints are formulated as rhetorical goals. Pauline knows about three events and is able to produce 100 different descriptions of each event.

Based on a list of thirteen goals that characterize the pragmatics of an interaction, when activating the program the values are selected for a set of features such as:

The generator incorporates these goals, interleaving rhetorical planning and realization at choice points. This supports the "standard top-down planning-to-realization approach, as well as a bottom-up approach, in which partially realized syntactic options present themselves as opportunities to the rhetorical criteria, at which point further planning can occur."

We have looked at the rhetorical goals of opinion in some detail because they fit in will with the four principles of van Dijk's ideological square. Since we are not primarily concerned with planning and generation will not discuss it further here. Hovy does use conceptual dependency, but due to the introduction of pragmatic constrains does not rely on it as heavily as Meehan and Carbonell. There are still questions about how well Pauline would handle other types of events and more complex events. There is also the issue of the need to hard-code knowledge.

All four systems above come out of the "Yale School", centered around Roger Schank. Abelson is a colleague, while Meehan, Carbonell and Hovy were Schank's students. Under their "conceptual dependency" framework, Schank and Abelson (1977) make universalist claims of cognitive plausibility for a specific set of semantic primitives, scripts, and other knowledge-specific constructs. Scripts are stereotypical representations of situations, which provide slots for various events, actions, objects, and relationships. They are a sort of template or schema of a situation, allowing for inferences and thus understanding of discourse.

The theories of conceptual dependency and scripts assume that all memory is episodic and organized in terms of scripts. While Schank and Abelson note that this may be controversial (Schank and Abelson 1977), they state their 'clear preference' for episodic and base their work on it. This memory model is at odds with van Dijk's model, discussed above.

While having a small number of universal primitives maybe aesthetically and computationally pleasing, finding the correct set may be difficult or impossible. There are also questions as to the cognitive plausibility of reduction to semantic primitives (Mallery 1988). Mallery also notes that "these programs from the ``Yale school'' have not been scaled up beyond hand-crafted microworlds nor produced cognitively felicitous representations. One major reason is that since top-down processing requires extensive background knowledge already coded in pre-existing data structures, the range of application is limited by the amount of background knowledge available to a system." This level of domain-specificity would certainly not work for the system we envision.

An addition serious limitation of conceptual dependency is that as a "theory of meaning that finds equivalent meanings through reduction to primitives", it assumes literalism in language. (Mallery 1988) Since we would most likely want our system to be able to handle non-literal language, such as metaphor, the conceptual dependency paradigm, seems an unlikely choice.

Ballim and Wilks program Viewgen is a viewpoint generator, where a viewpoint is "some person's belief about a topic". Viewgen generates multiple belief environments from different points of view. It is, in some sense, an agent modeling tool allowing for the generation of arbitrarily deep nested belief spaces. It uses a default reasoning mechanism that assume that all agents' beliefs are the same as the system, unless there is evidence to the contrary. All beliefs are ultimately held by the system, in the sense that what a given agent believes is what the system believes the agent believes.

As a generation program, it does not seem to suit our need to understand belief and points of view. It may also be more general that we want, since we are interested in ideological points of view that represent the perspective of groups of people and not a particular individual's belief. However, there may be some value in further investigating their definitions and methodology to see if there are aspects of the generation process that can lead to understanding.

As part of a larger program of studying subjectivity in test, Weibe developed an algorithm to track characters psychological point of view in third-person fictional narrative text. Wiebe and Bruce (1995) proposed using the output of the algorithm as a feature in a probabilistic classifier to track point of view.

Wiebe's goal is to segment the text of fictional narratives into "maximal blocks of objective sentences and maximal blocks of subjective sentences that have the same subjective character." Following the work of Banfield, Wiebe defines subjective sentences as those which present private states of characters, as opposed to sentences that "objectively narrate events or describe the fictional world." Where private states are states that cannot be objectively observed or verified, such as intellectual, emotive and perceptual states. Wiebe represents private states as PS(p, experiencer, attitude, object), where p is the private state, the experiencer is the person in that state, the attitude is the sort of private state, and the object is the object of the private state. Note that, in a given sentence, some of the components of PS may be implicit. Wiebe further defines the 'subjective character' of a sentence as the character whose point of view is taken in a subjective sentence and 'subjective elements' as linguistic elements that express attitudes of a subjective character.

In order to make the problem more tractable, Wiebe makes some simplifying assumptions:

Wiebe's algorithm results from the extensive examination of regularities in the way authors initiate, continue and resume a character's point of view in naturally occurring narratives. She found that certain combinations of sentence features and contexts lead to the expectation of particular point of view operations. At the highest level, the algorithm takes a text as input and returns the point of view of each sentence. Note that it does not return a representation of the meaning of sentences.

The features of a sentence are then combined with the context to produce an interpretation of sentence. Then the context is updated. The context consists of the last subjective character, previous subjective characters, the last active character, and the text situation. Once we an interpretation is provided and the context is updated the next input item is processed. If the input item is not a sentence the context is updated before proceeding to the next input item.

Wiebe has extensively catalogued potential subjective elements, which she defines as linguistic elements, which potentially express subjectivity. These become subjective elements when they are instantiated in a context where they are actually subjective. We should consider close examination of these potential subjective elements to determine which ones might apply to ideological point of view as well as the psychological point of view studied by Wiebe. Of particular interest might be the elements that express evaluation or judgment.

Wiebe's algorithm has been evaluated by hand-simulation to determine the kinds of exceptions that occur. The exceptions were primarily due to issues not specifically addressed in this work. Additional work on how spatial and temporal point of view affect psychological point of view and on anaphora resolution. Additional limited testing of the algorithm for statistical purposes yielded positive results.

Wiebe and Bruce (1995) reframe the problem as the segmenting of "a text into blocks such that all subjective sentences in a block are from the point of view of the same agent." They show how the problem can be made amenable to statistical processing and describe how it might be solved using probabilistic classifiers. They do not report on the implementation or evaluation of their model. We will discuss their approach in more detail when we discuss statistical methods.

Sack makes a distinction between ideological and psychological point of view: "ideological point of view characterized the political slant of an entire story, while psychological point of view (e.g., as it is used by Wiebe 1994) characterized the source of a sentence or statement contained within a story." He more precisely defines ideology as semiotic closure, building on the work of Greimas and Jameson on semiotic squares. This moves beyond the binary model of or schematization of two rational possibilities (e.g. conservative/liberal or workers/bourgeoisie) to enable the mapping out of why, and in what circumstances, two terms can be posited as oppositions or contradictories. Given two contrary (strong opposition: black/white male/female) positions, these are placed at the top corners of the square, the bottom corners are filled in with the logical negations of these terms with the negation relation on the diagonals (note that nonwhite is more than black and nonmale is more than female).

Thus the top and bottom edges of the square represent the contrary relation and the vertical sides of the square represent the implication relation. Using the square one can map out the ideology that circulates around a given issue. For example on the issue of abortion the corners of the square might be feminist, christian fundamentalist, family values conservative, liberal humanist. He terms each of these positions (at the corners of the square) to be a point of view.

Sack applies and extends these ideas in the realm of news texts. His extension of the semiotic squares is to actor-role analysis, where he is concerned with the identification of actor and thematic role combinations. He distinguishes between actantial and thematic roles: actantial roles, like heroes and villains exist on a narrative level and can only be identified by examining how an actor interacts with the other actors in a given narrative; in contrast thematic roles, e.g. a fisherman, exist on the discourse level and are part of larger discourses, which necessarily connect together many stories. For example, fishermen are associated with a set of attributes, which are carried over from story to story. In general a particular actor will be assigned one or more thematic and one or more actantial roles in a particular narrative. For example, on the abortion issue actor-role analysis might yield:

The high level structure of the system takes a story as input; find actor-role-bindings calls find-noun-phrases to identify the actors and roles, find-roles is a simple pattern matcher, the output of find-noun-phrases and find-roles is combined to create the actor-role bindings; then actors are coreferenced using member-of relations; then high-level weighted actor-role bindings are constructed by abstracting the actors into their most inclusive groups and counting the number of times actors from each most inclusive groups play each role matched in the text; point of view is then computed and output to the interface.

Sack's implementation is called SpinDoctor. The current (improved and much simplified) version is written in Common Lisp and uses a part of speech tagger. He coded a database of actors, roles, and points of view using the first 25 stories about El Salvador in the MUC-3 Corpus. He then tested SpinDoctor on 17 unseen stories on El Salvador and 58 unseen stories about other countries (all from the MUC-3 Corpus). Out of 100 stories (including the training set) in 42 point of view was assigned correctly, 3 incorrectly, and 55 vaguely. The vague category indicates that SpinDoctor selected a superset of the points of view actually represented in the story. In the cases where it was wrong, two suggest that the source of the article should be weighted more heavily, the other was a case where a point of view was missing (due to a change in government). SpinDoctor actually did reasonably well on the stories from other countries, which may indicate that the government-guerrillero-Catholic Church model fits most of the unrest in Latin America pretty well.

Ultimately this shows that the source of the text and how people are represented in the text is crucial to determining point of view. To create a real system machine learning would need to be used since one of the weaknesses of this system is the extent to which it is hard-coded for a very specific data set. Note that determinations of points of view and the manual coding of this was done be one person so there no study of coder agreement or reliability. However, unlike the systems discussed above, a quantitative evaluation was performed.

Sack's definition of ideology is somewhat restricted by the political framework in which it is cast, but seems as thought it could be applied more broadly. His conception of an ideological square is framed in terms of groups and the ideologies they hold, in contrast to van Dijk's, which is framed in terms of principles. These ideas complement, rather than contradict each other and both should probably be included in the definitional and analytical framework of ideology.

The two main disadvantages of Spin Doctor are the hard coding of ideology and the domain specificity that follows from it. If the actors and roles could be learned, it might well generalize to, at least, the news genre. Whether it would generalize to such genres as medicine on the internet is less clear. One might hypothesize that actors might be treatments rather than people and that the role a treatment plays would be either good or bad depending of the ideology of the author.

Terminal Time was conceived as a work of art. It is an interactive performance piece that constructs an ideologically-biased history of the world form 1000 A.D. to present, in PBS documentary style, based on audience response. Its presentation is multimedia combining video and narration to give a low-budget, 1980s television documentary feel. At about where commercial would appear in a half-hour television program it presents multiple choice questions to the audience and based on an applause meter determines the ideological slant of the next segment of the documentary. Eleven major ideologies are currently represented in the system, mostly centering on race, class, gender, technology, and religion.

The basic architecture consists of a knowledge base, ideological goal trees, a rule-based natural language generator, rhetorical devices, and a database of indexed audio/visual elements (including short digital movies and sound files containing music).

The knowledge base combines higher order predicate statements about historical events, definitions of ontological entities used in historical event descriptions and inference rules. They used the upper Cyc ontology as a basis for their ontology. The inference engine is based on "higher-order hereditary Harrop logic" which allows the knowledge base entries to consist of Horn clauses and the queries to consist of standard Prolog-like goals and embedded implications. It is implemented in Common Lisp and makes use of its extra-logical support functions.

Terminal Time organizes ideological bias with goal trees, which were adapted from Carbonell's Politics, to represent the goals of an ideological story-teller. The rhetorical goals of a story-teller are to show that something is the case by constructing an argument using the events available in the database. For example:

There is a test for event applicability to recognize whether or not and event can be used to satisfy a goal. Once an event is determined to be applicable, a rhetorical plan is implemented to 'spin' the event, by selecting a subset of the knowledge represented in the knowledge base, to satisfy the rhetorical goal. The atomic actions of the plan language add syntactic units to an event spin using rules, which map pieces of knowledge representation onto English.

During performances of Terminal Time the ideological bias grows stronger with each segment as the documentary proceeds, which allows audiences to see bias at work and the role of ideology in the construction of history. Evaluating the system based on audience reaction during the performance and discussion afterwards, Terminal Time is successful in meeting the author's artistic goals.

The ideologies presented in Terminal Time are the product of the author's conception of what each ideology entails. The ideologies are hard-coded into the system. Of particular interest here is the use of the Cyc ontology and the use of Carbonell's goal trees. It might be possible to have a system that would have a component to parse a text into a goal tree, based on argument structure, and use a metric to compare it to known ideological goal trees, to aid in determining the ideological point of view presented in the text.

We have seem several systems that compute point of view of various types, by various means and to various ends. Some issues begin to emerge:

1. In AI there is always a tension between work that aims for psychological or linguistic plausibility and work that is more motivated by engineering concerns. The systems we have seen here then to be motivated primarily by psychological or linguistic plausibility. This creates some difficulties because there is still so much unknown about psychology and how the mind works. For example, to what extent do scripts, schema, and semantic primitives reflect psychological reality and if they don't should we use then anyway if they handle engineering concerns.

2. With the exception of Wiebe, all of the systems require a great deal of hard-coded knowledge, which can dramatically limit scalability and portability. Terminal Time does make use of Cyc and it would be worth look at the currently available knowledge resources to see if there are tools which could be of use. More likely, this points to the need to incorporate machine learning into systems.

3. While most of the definitions of ideology and point of view get at approximately the same ideas, there is a need for greater specificity in order to translate ideology into something a machine might recognize.

4. The early systems primarily came out of the "Yale School" and are much concerned with planning and generation, so we might ask, how much, if any of those techniques might be useful in natural language understanding?

5. Van Dijk's definition of ideology rests on the concept of a social group. We should consider the possibility of exploiting the structures of the WWW and Usenet to see if they can be reconciled with his concept of a social group.

6. With the exception of Wiebe and Sack, very little has been done to evaluate the performance of these systems objectively. In fact, the knowledge that is provided for systems about ideology my be stereotyped and reflect ideological biases of the systems' creator. Evaluation of the system we propose will most likely require testing against human annotations. These annotations would need to be performed by annotators with reasonable intercoder reliability on this task.

We have seen several definitions of ideological point of view. It appears the van Dijk's definition is a good place to start for our system. So we will adopt his definition, concept of group, and ideological square. We also want to incorporate Sack's ideological square and actor-role analysis. Keeping in mind that since ideological point of view is inherently subjective that we are working under the overall framework of subjectivity as defined by Wiebe.

Wiebe (Wiebe et al 2001) defines subjectivity as "aspects of language used to express opinions and evaluations." Where evaluation includes emotions, evaluations, judgments, and opinions. Subjectivity also includes speculation: "anything that removes the presupposition of events occurring or states holding, such as speculation and uncertainty."

Of the systems we have surveyed, Wiebe and Sack come the closest to the task we propose. Both allow possibilities for extension or modification: Wiebe's algorithm would need to be modified to accommodate ideological, rather than psychological, point of view. The probabilistic model proposed by Wiebe and Bruce could incorporate features for ideological point of view. The possibility of incorporating machine learning techniques to Sack's system might succeed in overcoming the need for hand-coded knowledge and make it less domain and genre dependent. The other systems are of primary interest to better understand the problem and to develop a set of features that could be used in our system. We also want to bear in mind the importance of incorporating pragmatics into our system as Hovy has done.

Robert P. Abelson and J. Douglas Carroll. Computer Simulation of Individual Belief Systems. The

Now that we have a working definition for and some understanding of ideology, let's look a little more specifically at what a system might involve to enable us to focus on tools that might be of use.

One approach would be to build the system on top of an existing search engine that would collect a set of web pages or Usenet newsgroup message by topic. There is some risk that the collection would be too small or too large, so it might be necessary to expand or contract the query. The process of collecting the documents by topic is essentially and information retrieval problem. We will assume for simplicity that we want full documents, rather than sections of documents. We will consider Kleinberg's hubs and authorities as a possible way to narrow down a collection of documents. We would also like to explore the possibilities of exploiting the topology of the web or Usenet to find groups of documents sharing the same ideology.

Given an appropriate collection of documents on a certain topic, we would now like to segment them by ideological point of view. We now have a problem that could be viewed as either text classification or clustering of documents by similarity. Some issues arise immediately:

1. The problem falls into the broad category of natural language or text understanding. This leads to the question of whether or not we need to understand the text and if so to what extent? If we chose to aim for psychological plausibility, the we do want to understand the text. On the other hand, we may find statistical techniques that work with little or no understanding. In between these extremes, we may want some level of understanding of discourse or argument structure. We have seen from van Dijk and from Blommaert and Verschueren that the discourse structure alone is unlikely to be sufficient, but it could be used in combination with lexical clues or as feature input to a classifier. Hence, we will explore the work of computational linguists on the structure of discourse.

2. The desire of domain independence comes at a cost of knowing the ideological points of view in advance. This means that, assuming ideological points of view are discrete classes, we must learn the classes from the set of documents, if we want to treat it as a classification problem. Otherwise, we may want to find techniques to cluster the documents by similarity, where the similarity metric heavily weights features of ideological point of view.

3. Machine learning may be either supervised or unsupervised. If we elect to use supervised learning, we will need to so some level of human annotation. Human annotation is costly and requires the development of annotation instructions that are sufficiently detail that reasonable agreement between annotators can be achieved. Recently developed techniques, such as co-training, can help limit the amount of annotated data that is required. It is sometimes possible to modify systems, such as Riloff's AutoSlog pattern extractor, to work with unannotated text. We will not discuss the process of annotation or evaluation further here, but it is an important issue to note.

4. We would need an appropriate user interface to represent the segmented collection of documents. Two possibilities immediately come to mind: a fisheye viewer with a hyperbolic distance metric on the graph (nodes represent documents and edges represent the connections between documents, the shorter the distance the greater the similarity between the documents) that is initially centered on the most ideologically balanced or objective document in the collection and allows the user to see the relationships between the documents and retrieve a document by clicking on a node. Or a similar graph with a Euclidean metric that shows how the documents cluster in the plane. We will not discuss the details of how to implement the user interface here.

Finally, any system developed will need to be evaluated. We will see that in general the work we discuss in the areas of statistical natural language processing and machine learning will include rigorous evaluation. In contrast, like the systems discussed in the ideology section, the work on discourse structure and web structure will generally provide minimal evaluation, if any. Some issues we anticipate will arise in the evaluation of our system:

a. Since we plan to use a search engine to retrieve topically segmented collections of documents, the performance of the system may depend on the search engine used. We will most likely need to evaluate our system on different search engines and possibly perform some evaluation of available search engines.

b. Annotation of data, if required, can be evaluated through Cohen's Kappa coefficient of annotator agreement.

d. Since classification of ideological point of is inherently subjective, the output of the system will be difficult to judge and may be open to dispute. One option might be to test it by giving it a relatively small collection of documents to classify and giving the same collection along with instructions to several humans to determine if the machine agrees with the humans as well as they agree with each other. Another option, since the system is being developed to aid users, would be to conduct user studies of the system's usefulness.

We have seen in the section on ideology that there are computational workable definitions of ideology and that some systems have been designed to generate point of view. None of the systems we examined did quite what we want to do, so while we might borrow some of their techniques, we need to explore other possibilities.

We noted the tension between developing a system that is psychologically and linguistically plausible and one that is more motivated by engineering concerns. We will see this play out again here. If we are not concerned about psychological and linguistic plausibility then we might ask whether we need to understand discourse at all. Perhaps, purely statistical techniques would suffice. We will look at some of these possibility in the section on Statistical NLP. In the meantime let us assume that understanding discourse structure would be of use in our task.

Are we concerned with the ideology as perceived by the reader, or as presented by the author, or both?

Can we view text as primarily for information transmission, or have we oversimplified by omitting the social, emotional, persuasive, and entertainment aspects?

In an attempt to answer some of these questions, we will consider some important papers from computational linguistics and psycholinguistics on discourse analysis. We will look at how these techniques have been applied and to what degree of success, along with their applicability to our task.

Much of the work done by computational linguists has been application-oriented, so it makes sense to classify the work in terms of the types of applications that are expected to come out of it: the two main areas of application are language generation and interpretation. Here we will focus on how discourse structure theory might be used in the interpretation of text, because that seems the most suited to our task.

We will look at three important papers on the theory of discourse structure (Hobbs 1979, Grosz & Sidner 1986, Mann & Thompson 1988, Morris & Hirst 1991), and subsequent papers which discuss, extend and reconcile these theories. There has been significant debate in the computational discourse community between proponents of these theories, in particular Rhetorical Structure Theory (Mann & Thompson 1988) and the intensionality based theory of Grosz & Sidner (Grosz & Sidner 1986). Much of this theoretical work was done at a time when resources (large corpora, thesauri, etc) were not available to adequately test these theoretical frameworks. As these resources became available and implementation and testing became possible, the trend has been toward a synthesis of these theories (Moser & Moore 1996, Marcu 2000).

We will look at each of these theories in turn, considering their usefulness for text interpretation, shortcomings and areas where additional research is needed.

Hobbs (Hobbs 1978) builds on the work of Grimes, Halliday & Hassan, Longacre, and Fillmore, on relations that link segments of discourse. He claims that "a relatively small number of coherence relations occur in coherent English discourse", where the degree of coherence of a text "varies inversely with the degree of 'difficulty' the inferencing operations have in recognizing some coherence relation." He believes that a theory of coherence must be able to explain the function of each coherence relation; be able to derive Halliday & Hassan's cohesive relations; and must have relations that are computable.

Hobbs defines a finite set of coherence relations, in the framework of an inference component of a language processor, that hold between portions of a discourse. His relations operate recursively on 'sentential units', where clauses are the base. Clauses are represented as sets of propositions.

The inference component consists of four aspects: representation, operations, control, and data. His representation scheme is a sort of predicate calculus. Propositions are formed by applying a predicate to one or more entities, or other propositions. Clauses in the text are operated on successively and propositions are asserted. The system also has a large number of axioms that encode lexical and world knowledge. Axioms are assumed to have plausibility and general applicability. Operations, working in parallel, include word sense disambiguation, resolving anaphora, determining illocutionary force, and recognizing coherence. The operations attempt to construct 'chains of inference'. These chains are searched for based on salience and chain length.

In this paper, Hobbs considers three coherence relations: Elaboration, Parallel, and Contrast. A more extensive list of relations is defined elsewhere. Since the coherence relations are defined in terms of the inferences that a reader makes, using world knowledge, to recognize them, understanding the discourse structure is essentially equivalent to finding the best proof explaining the information in a segment of discourse.

Two drawbacks of Hobbs work are the cost of encoding the axioms (world and lexical knowledge) and the computational cost of his inferencing process. There is also the question of whether a (small) finite number of coherence relations can suffice to capture discourse structure.

Mann and Thompson's Rhetorical Structure Theory (RST) (1988) has found wide use in the area of text generation, for example, in generating text summaries.They claim that approximately twenty-three rhetorical relations are necessary to account for discourse coherence. The relations link different portions of text, called "spans", which can range in size from clauses to paragraphs. Adjacent spans are relate by exactly one of the possible rhetorical relations, forming new spans that are subsequently related to their neighboring spans until all spans are connected. In this way a hierarchy, or tree structure is formed.

The relations listed are: circumstance, solutionhood, elaboration, background; enablement, motivation; evidence, justify; volitional cause, non-volitional cause, volitional result, non-volitional result, purpose; antithesis, concession; condition, otherwise; interpretation, evaluation; restatement, summary; sequence, contrast. Note that several of these relations map to van Dijk's categories of ideological analysis.

They define five schema: circumstance, contrast, joint, motivation, enablement, and sequence. The schema are defined in terms of relations, they specify how spans of text can co-occur and define the structural constituency arrangements of the text.

Rhetorical relations constrain the components of the span and then intended effects of the span. Component spans are either nuclei or satellites, the nuclei being the more important span: "more essential to the writer's purpose than the other," (Mann and Thompson 1988) and the satellites are the less important. All relations contain an "effect" field which describes the intended effect of the text on the reader. So as with Hobbs, the focus is on the reader, rather than on the intentions of the writer. However, a weakness of RST is that the model of the effects that each span has on the reader's mental state is imprecise.

Moore and Pollack (1992) argue that RST does not take proper account of the distinction between relations and that the restriction that only a single relation can hold between pairs of adjacent spans is incorrect because discourse elements are related simultaneously on multiple levels (Grosz and Sidner 1986).

Here again we note the question of whether or not a small set of relations suffices to capture discourse structure. Another issue for RST is while noting that there may be many RST analyses that are consistent with linguistic experience, Mann and Thompson do not offer any methods to deal with this ambiguity.

Grosz and Sidner (1986) view discourse structure as three interrelated components: a linguistic structure, an intentional structure, and an attentional state. The linguistic structure consists of discourse segments and an embedding relationship that can hold between them. The intentional structure consists of discourse segment purposes (DSPs) and discourse purposes (DPs). The DSPs are related to each other by one of two relations: dominance and satisfaction-precedence.

A discourse segment purpose, DSP1, satisfaction-precedes DSP2, whenever DSP1 must be satisfied before DSP2.

An action that satisfies one intention, DSP1, may be intended to provide the satisfaction of another, DSP2. When this occurs DSP2 is said to dominate DSP1.

The attentional state distinguishes the most salient information from other less salient information to aid in the interpretation of subsequent discourse segments. It can be viewed as an abstraction of the discourse participants' focus of attention and is modeled by a stack of focus spaces, each holding the most salient information from a given discourse segment. The transition rules that add to or delete from the stack correspond to the dominance relation from the intentional structure.

While on the surface it appears the discourse structure theories of Grosz and Sidner (1986) and Mann and Thompson (1988) are quite different, and in fact, there has been a decade long debate in the computational linguistics community between their respective proponents, recent work has been toward a synthesis. Moser and Moore (1996) have found considerable common ground between the two theories, based primarily on understanding the correspondence between the notions of dominance in Grosz and Sidner and nuclearity in RST. Based on this work Marcu (2000) has extended his normalization of RST (Marcu 1996) to incorporate the intentional structure of Grosz and Sidner (1986) to reduce the ambiguity of discourse. The attentional structure has not been incorporated.

Based on the work of Halliday and Hassan on textual cohesion, Morris and Hirst describe an algorithm for computing lexical chains using lexical cohesion. They define lexical cohesion as "cohesion that arises from the semantic relationships between words." It basically involves the selection of a lexical item that is related, in some way, to one occurring earlier in the text. Lexical cohesion is one type of cohesion considered by Halliday and Hassan, others include reference, substitution, ellipsis, and conjunction. Morris and Hirst distinguish between the independent concepts of cohesion, that the text sticks together, and coherence, that the text makes sense. Cohesion is much more easily determined than coherence. Lexical chains are sequences of related words spanning a topical unit in the text and are good indicators of linguistic segmentation. They can be used to identify central themes in a document, which can be helpful in identifying key phrases for document summarization.

The algorithm proposed involves finding candidate words (removing words on a stop list such as pronouns and high frequency words), then for each candidate word finding an appropriate chain, within a suitable span, based on its relatedness to members of existing chains (or creating a new chain). Morris and Hirst used Roget's Thesaurus and distance criteria to determine relatedness. More recent versions have tended to use Wordnet instead of Roget's (Anderson 2000, Barzilay and Elhadad 1997), which reduces the problem of having different senses of the same word appear in a chain. Finally, if an appropriate chain is found, insert the word, update, and find the next candidate word.

The strength of a lexical chain can be determined by considering the distribution of the elements in the chain within the text. Three factors to consider are reiteration, density and length of the chain. Since the lexical chain encapsulates context, chain strength corresponds to the significance of the textual context it embodies.

Morris and Hirst compared the lexical chain structure of a text with the intentional structure computed using Grosz and Sidner's structural analysis method and found that lexical chains were a good clue for determining intentional structure. This is useful because Grosz and Sidner did not provide a method for computing the intentions or linguistic segments in their proposed structure.

Lexical chaining is relatively domain independent and computationally feasible. While Morris and Hirst did not implement their lexical chaining algorithm because there was no machine readable version of Roget's Thesaurus at the time, it has been used as a basis for computational linguistics applications, including text categorization (Anderson 2000) and summarization (Barzilay and Elhadad 1997).

These four papers in computational linguistics provide a survey of some of important theories of discourse structure. These theories were developed, at least to some extent, with the idea of using them for applications in mind, rather than in the interest of modeling brain function. Below we discuss an example of a psycholinguistic theory of discourse processing that comes out of the interest in modeling cognition and which has been, at least partially, implemented.

The first three computational linguistics theories all rely to some extent on intentional structure and take into account the intended effect on the reader. This still leaves open to what extent the intent of the writer could or should be factored in.

There theories were developed at a time when many fewer machine-readable resources were available, e.g. large corpora, thesauri, Wordnet, and computational power and memory capacity were limited. Because of this implementation and evaluation tends to have been limited and manual. More recently, lexical chaining and RST have been implemented, problems discovered and improvements made.

The primary question of interest to us is which of these might be useful in determining ideological point of view for our system. It seems clear that finding the rhetorical structure of a text or the important topics in the text would not on their own determine ideological point of view. Thus, we view discourse processing as a component of a larger system that would incorporate additional knowledge, such as lexical semantics and subjectivity analysis.

The first question is do we need it at all? Perhaps there is a statistical approach, such as Latent Semantic Analysis that will segment by ideological point of view and there will be no need to understand the discourse structure at all. On the other hand, having noted some level of correspondence between van Dijk's ideological categories and relations in RST, it seems reasonable that discourse structure would be a helpful component of a system that determines ideological point of view. So let's consider which of these theories we might use.

Clearly, we will not be able to settle the debate as to whether a finite number of discourse relations (Hobbs, Mann & Thompson) can be used to determine discourse structure and if so what exactly they are. One might imagine that one could do better at finding relations that define the discourse structure by limiting the domain or genre of the discourse considered, but this is not what we want to do. On the other hand, since our purpose it to find ideological point of view using discourse structure as one component of a larger system, we may not need to fully represent the discourse structure and, in fact, only consider certain relations.

Besides the issue of a small finite number of coherence relations, the amount of world knowledge and preprocessing necessary for Hobbs system seems prohibitive. As does the computational complexity of his inference engine. His system would likely do more than we need at a cost we cannot afford.

Since Grosz and Sidner's work has several parallels in RST (Moser and Moore), its intentional structure has been incorporated into recent versions of RST (Marcu 2000), and RST has been more widely implemented, it makes sense to consider RST over Grosz and Sidner theory. Given our perspective of a discourse structure component in a larger system, the theoretical shortcomings of RST (Moore and Pollack) are unlikely to be an issue for us. The main shortcomings of RST from our perspective are the need to consider aligning the rhetorical relations with more suitable relations for determining ideological point of view, such as van Dijk's ideological categories, and the cost of developing an annotated corpus of text, parsed for rhetorical structure. Note that a parsed corpus would need to be sufficiently large to allow the cross-domain portability we desire.

Given the cost of implementing an RST component to our system, it seems like it would be reasonable to try other methods first. One of the methods that should be considered is lexical chaining. It is possible that lexical chains, by identifying central themes and important phrases in a document, might provide significant information about ideological point of view. Given lexical chains, our system might look for potential subjective elements, ideological clue words, actors and roles, and other clues to ideological perspective in the chains.

So we conclude that for our purposes in choosing between discourse structure theories, we should first explore the possibilities of using lexical chains which can be easily implemented. If it turns out that more information about the discourse structure is necessary for our task, we would next look at existing RST implementations and if they prove insufficient, we would consider modifications to RST to improve alignment with ideological components and domain independence.

We will discuss other options to lexical chaining, such as Text-tiling (Hearst 1994), when we discuss statistical NLP. We also note two discourse structure issues particular to the WWW and Usenet:

1. We have not explored the how the structure of Html documents on the web might be exploited for our task.

2. It is the author's personal observation from annotation studies undertaken on a newsgroup corpus, that levels of cohesion and coherence tend to be lower in newsgroup postings than is common in written English text. This observation, if it turns out to hold true in general, may necessitate modification of methodology.

Finally, we will consider Kintsch's (1994) Construction-Integration Model as an example of a general model of discourse comprehension, because his work is based in part on work by van Dijk, because it provides some insights into the area of discourse processing that aims to model cognition, and because it has been at least partially implemented. There are other competing models, but, unfortunately, to date it does not appear that there has been a systematic comparison of the models. There seems to be a reasonable amount of empirical evidence supporting this model, but some aspects, in particular, the bottom-up processing during the construction phase, is controversial (Whitney 1998). It incorporates the idea, first proposed by van Dijk and Kintsch (1983), that we form multiple memories for discourse.

A summary of the construction-integration model as explained by Kintsch (1994), describes the sequence of cognitive states, the mental representation of texts, the processing cycles, knowledge elaboration, macroprocesses, and inferences. We will consider each one in turn:

Kintsch characterizes cognition as a sequence of cognitive states, which can be considered as: the contents of short-term memory, or the focus of attention, or the state of consciousness. A cognitive state (focus of attention) can be viewed as a word, sentence, phrase, or paragraph, depending on the type of analysis to be done. (Here we will assume sentences are states.) The state is the result of input and analyses. In put can be from the outside world or long-term memory, including lexical, perceptual, and general world knowledge and beliefs. The contents of short-term memory can be viewed as retrieval cues which activate the items they are linked to in long term memory and retrieve them. The input is analyzed with the use of temporary buffers. For each state there is a processing cycle consisting of receiving the input and analyzing it. In order that there be some coherence between cycles it is assumed that some elements generate in a given cycle are carried over to the next cycle in a buffer. Kintsch hypothesizes that these elements are the ones most strongly activated.

After processing, the mental representation of a text will contain at a minimum three levels: surface (linguistic or word), propositional (conceptual or semantic), and situation model. Additional levels may come in to play, for example, in poetry or mathematics. The propositional level and situation model are generally most important, both in the laboratory and in real life.

The propositional level is modeled by constructing (usually by hand coding) a hierarchical structure of propositions and arguments based on lexical information from the text. A proposition is relational term or predicate and one or more arguments. Arguments may be concepts or other propositions, classified in terms of their semantic case roles.

Once these representational units are constructed, their interrelations may be viewed as a graph, where the elements are the nodes and the relations between the elements are the edges. The graph can be translated into matrix form, here the rows and columns are indexed by the elements and nonzero entries in the matrix correspond to relations between the elements. Generally, nodes are considered to be related if they have a common element, however, depending on the level of analysis necessary, additional relations may be desirable.

The first step in simulation of text comprehension is to construct a graph or network, as described above, based on the text. The construction process may be rough (imprecise), for example, all meanings of ambiguous words may be included and parsing ambiguities may be considered in parallel. The network may contain irrelevant or contradictory elements.

The next step is the integration process, which can be viewed spreading activation in the network until it reaches a stable state. Here it is necessary to use the matrix representation of the network. A vector, called the activity vector, is initialized with equal activation values for all elements. The activity vector is multiplied by the matrix and renormalized repeatedly, until the activity values stabilize. This has the effect of strengthening strongly interconnected parts of the network, while isolated parts become deactivated. The result is a coherent mental representation of the text.

This procedure is in sharp contrast to schema theories which provide a control structure to ensure context sensitivity in the construction phase, thereby eliminating the need for the integration phase. This is at the cost of a much more complex construction process.

Text is processed sequentially: a sentence is read, network is constructed and integrated, then when another sentence is read parts of the old network participate in the new integration process. In order to maintain coherence in the network, it is assumed that the strongest propositions from the previous sentence are maintained in the focus of attention when processing the current sentence.

The fact that the mental representation of a text contains both information derived from the text and knowledge elaborations for long-term memory, begs the question: Is a reader reminded of only relevant, contextually appropriate things, or are irrelevant, contextually inappropriate things also activated, and if the latter is the case why are we not conscious of them? One possible explanation is that schema or other structure filter out the irrelevant, or inappropriate knowledge. The explanation offered by the construction-integration model is that many irrelevant and contradictory pieces of knowledge will be retrieved in the construction process, but they will be quickly deactivated in the integration process. The model views knowledge activation as an uncontrolled, bottom-up process, determined only by the strength of associations between the items in the text and the items in long-term memory. Although the reliance of this model on bottom-up processing is somewhat controversial, empirical lexical decision studies of priming indicate that irrelevant items are activated, but quickly inhibited. This tends to support the construction-integration model view.

In addition to the global model of the situation described by the text, there is global model of the propositional structure of the text, called the macrostructure. The macrostructure is a hierarchy of propositions which reflects the rhetorical structure of the text. The macrostructure may or may not be correlated with the situation model.

The macrostructure is constructed strategically in response to cues indicating relative importance of various portions of the text. Three types of operators that reduce information are deletion, generalization and construction. These operator can be regarded as inference processes.

Inferences may be divided into two categories: those that reduce information (macro-operators as above), and those that add information to the text. For those that add information to the text we classify them as follows: based on where the information comes from (long-term memory or newly created), and based on whether the inference process is automatic or controlled.

Knowledge retrieval during comprehension can be either automatic or controlled. If it is automatic, it is a locally determined, associative process where much of what is retrieved may be irrelevant or contradictory and will be removed in the integration process. The process retrieves items in long-term memory that are strongly linked to the text. The controlled process occurs when there is a comprehension problem or in response to special tasks or goals.

Controlled knowledge generation can occur during comprehension or later as the text is studied again or reviewed in memory. Like controlled knowledge retrieval it occurs when there is a comprehension problem or in response to special tasks or goals. This type is what is commonly considered inferencing.

The Construction-Integration model has been implemented by Kintsch and others (Goldman and Varma 1995; Kintsch 1988, 1992; Kintsch et al. 1990; Langston, Trabasso, and Magliano 1999; Mross and Roberts 1992; Tapiero and Denhiere 1995). In a standard implementation, text is represented by a series of interconnected nodes, each corresponding to a concept, sentence, phrase, or word in the text. The model's long-term memory contains all the nodes and their connections that have been processed so far. On input of a new node, it is processed in working memory, which contains the n most activated nodes of those that have been processed. Processing consists of spreading activation among the nodes in working memory until the process stabilized or settles. At this point, the activation values for the nodes in long-term memory are adjusted based on the settled activation values and the model determines which nodes should be kept in working memory by choosing the n most highly activated nodes. The model is then ready to receive new input and the process is repeated until all nodes have been processed.

Starting with the standard implementation, Goldman and Varma (1995) examined the impact of relaxing the working memory buffer size, finding that the simplifying assumption of a fixed buffer size was psychologically implausible (Fletcher 1999). Langston, Trabasso, and Magliano (1999) assumed a relaxed spreading activation mechanism so that nodes in long-term memory that are connected to nodes in working memory are allowed to participate in the processing. Their model, coupled with discourse analysis of text relations, was able to account for on-line comprehension as measured by reading time or fit judgments.

While these implementations have been used successfully to test theories of retrieval from long-term memory, and various theories of mental representation and organization within a construction-integration framework, they do not fully implement Kintsch's model. Notably lacking is an implementation of the situation model. In addition, the input data structures require preprocessing, which may involve the use of different discourse analyses, based on different theories of discourse processing. From the computational linguist's point of view it does not solve any real-world problems. One might also ask to extent syntactic processing can, or should, be integrated in to this approach?

While it seems unlikely that this is a model we would want to use for our system, it does raise some interesting issues:

1. Kintsch provides us with a detailed model of how discourse is processed. Van Dijk gives us a general model of how ideologies are represented in memory and how one gets from the knowledge representation to discourse and back. While determining the extent to which these can be reconciled, given the current state of knowledge about how the brain functions, may be a theoretical question for cognitive scientists, it is clear that if we need our system to understand discourse we will have to go beyond van Dijk in level of detail. This process will result in needing to make some assumptions about plausible mental models.

Our system will be built on top of a search engine to collect documents by topic. This is a classic problem in information retrieval, where given a query, in this case the topic, the system returns a set of documents that satisfy the query, in this case being about the same topic. Thus it is in our interest to understand something about the issues involved in Internet search and the range of search engines available.

When the web is viewed as a directed graph, where the nodes are web sites and the edges are hyperlinks between the sites, we hypothesize that the topology of the web is related to groups of people or organizations that shed light on ideological point of view. Similarly, we can define a structure on Usenet, which is already divided into newsgroups at a high level and individuals at a low level, which could be considered node, while messages posted might be considered directed edges. We would like to better understand the topologies of these graphs and see how they can be used in our study of ideology.

Long-term issues for our system might include retrieval of documents in multiple languages. This presupposes that the problem of machine translation is solved or that our system is sufficiently flexible to handle bad translations. In the context of a global society, we may experience a shrinkage of "common ground" knowledge and an expansion of what is considered ideology. Whether or not the language that a message or web site is originally written in might be a reasonable feature for our system might be a subject of future research.

For now we will content ourselves with looking at some basic issues, explorations, and work on the Internet.

The problem Kleinberg considers is searching the web, or the discovery of pages relevant to a given query. A user query may be specific, broad based, or a request for pages similar to a given page. Different types of queries have different associated problems. Specific queries may give rise to the 'scarcity problem': there may be very few pages that contain the desired information and it may be difficult to find them. On the other hand, broad based queries give rise to the 'abundance problem': the set of pages reasonable retrieved as relevant may be unmanageably large. Thus, there is a need to limit the number of pages retrieved to the most authoritative one. The problem that is the focus of this work is how to determine if a page is authoritative.

Kleinberg notes that evaluation of a system that finds authoritative pages will be an issue due to the inherent subjectivity in notions such as relevance.

One possible way to limit the number of texts retrieved by a broad based query would be a text-based ranking scheme. For example, rank pages by the number of occurrences of the query string in the page, or the prominence of the query string in the page. This scheme is likely to fail due to the 'self reference problem': many natural authorities do not use terms that would categorize them on their web pages, e.g. there is no reason to expect that Honda or Toyota use the term "automobile manufacturer" on their web pages.

Another way to approach the problem would be to try to exploit the link structure. Based on the assumption that the creation of a link from a page p to a page q, confers some amount of authority on q. Thus, it would seem that we could find authoritative pages by counting links into them. Some pitfalls of this approach are: some links are purely navigational and should not be counted; what to do about paid advertisements; the need to balance relevance against popularity. Consider the simple heuristic: of all pages that contain the query string, return the ones with the most in-links. Note we still have the self-reference problem and the additional difficulty that very popular sites, like yahoo.com, will be considered highly authoritative with respect to any query string they contain.

Kleinberg proposes a different link-based model. He defines a class of pages called 'hubs' which link to many related authorities. It turns out that there is a "certain natural type of equilibrium between hubs and authorities in the graph define by link structure." Kleinberg's approach is global in that it seeks to identify the most central pages for broad topic searches in the context of the WWW as a whole. His approach is fundamentally different from clustering, which groups similar pages within a broad topic, but does not find the authoritative pages or reduce the number of pages retrieved.

Kleinberg's algorithm to identify hubs and authorities simultaneously operates on a 'focused subgraph'. The focused subgraph is untended to be a small collection of pages that is most likely to contain the desired authorities. It is constructed by taking the top t (usually 200) pages resulting from querying Altavista, then all pages pointing into this set and all pages point at by members of this set are added (with some restriction on the number of pages each page can add). The focused subgraph generally contains 1000 to 5000 pages. Further preprocessing involves the removal of intrinsic link, that link between pages in the same domain.

Intuitively, the algorithm iteratively computes numerical weights for hubs and authorities by increasing the authority weight if a lot of hubs point to it and increasing hub weight if it points to a lot of authorities. The weights are iteratively refined until an equilibrium is reached. In practice convergence occurs quite rapidly, with 20 iterations generally being sufficient. Kleinberg shows that the weight sequences always converge.

Kleinberg's algorithm has the ability to disambiguate queries and cluster authorities by sense. This is done by considering eigenvectors, other than the principal eigenvector, from the adjacency matrix of the focused subgraph with intrinsic links removed. For example, when the query is "jaguar" the authorities for principal eigenvector concern the Atari Jaguar product; for the 2nd non-principal eigenvector: the NFL football team Jacksonville Jaguars; and for the third non-principal eigenvector: the car.

Kleinberg also shows how similar page queries can be addressed by modifying the query to request t pages pointing to the page P, where pages similar to P are desired.

While the examples provided in the paper are convincing, the limited principled evaluations attempted have not provided definite conclusions.

In a second paper, Kleinberg et al 1999, study the Web graph using Kleinberg's HITS algorithm (described in above paper) and the enumeration of certain bipartite cliques. The second algorithm is designed to trawl the web for cyber-communities. They determine that traditional random graph models do a poor job of explaining the web graph and propose a class of plausible random graph models that might better fit their observations of the local structure of the web. They found Zipfian distributions in the in-degree plot, which could not arise in traditional random graph models. This work raises a number of questions including: how the communities found can be organized and annotated and how the connectivity measures found can be applied and extended.

Terveen et al describe some innovations designed to aid users who wish to obtain and evaluate entire collections of topically related web sites. They define a 'site' as a "structured collection of pages, a multimedia document - as the basic unit of analysis." We note that this is somewhat analogous to Kleinberg's elimination of intrinsic links.

To find topically related web sites they use a previously developed system called PHOAKS to search newsgroup messages for mentions of web sites. PHOAKS applies rules to identify which mentions are recommendations and ranks them within a topic by number of recommendations from different individuals. They state that previous work has shown that this method has high accuracy in recognizing recommendations and that there is a correlation between their highly ranked pages and other metrics of web page quality.

They define 'clan graphs' grouping sets of related sites. An N-clan graph is defined as a graph where "(1) every node is connected to every other node by a path of length N or less and (2) all of the connecting paths go through nodes in the clan." They have found that 2-clan graphs capture the notions of collection and locality needed to determine topically related subgraphs within a larger graph.

The construction algorithm tends to filter out irrelevant sites and discover additional relevant items. It places seed pages, chosen by the user, in a queue and based on a scoring metric decides which pages to construct a site around, which sites to expand, and which sites to add to the graph. The score of a page is the number of seeds that are linked to the page by a path of length two or less.

They augment the representation of a site with a 'site profile' to help the user evaluate the quality and function of each site. It includes: the title of the sites root page, a thumbnail image, links to and from other sites, media content information, information about internal pages, and a count of occurrences of domain-specific indexing phrases.

They have invented a new graph visualization they call 'auditorium visualization' that reveals important structural and content properties of sites within the clan graph. It includes linked views, thumbnail representations, and progressive revelation of greater detail. Development involved iterated cycles of design and usability testing.

They report that they have begun experimental comparisons of their algorithm with other link analysis algorithms. Although they do not provide details, they claim that authority/hub computations to not tell us much more than simple counts of in and out degree based on preliminary results.

Based on the three papers considered above, insufficient evaluation has been on the models and systems to make valid comparisons. It seems clear that, when considering the WWW, that link structure is important. This is particularly true for us, since we are interested in finding groups or communities that share ideological perspective, which may be represent by some form of subgraph. While other ways to define and compute such subgraphs should be explored, the work considered above does have applicability.

It is conceivable that if Kleinberg's algorithm were used on topic query that the various eigenvectors might produce a type of clustering that could be manipulated to reflect ideology. If this could be accomplished then these authorities could be used as seeds to construct clan graph possibly producing a group with shared ideology.

Also some important problems with processing web pages are brought to light, such as the self-reference and abundance problems, and the issue of whether a site or a page is to be considered the minimal unit.

Whittaker et al explore the demographics, conversational strategies, and interactivity in Usenet. The investigate modeling mass interaction in Usenet with the common ground model and find that it would need to be modified to "incorporate notions of weak ties and communication overload." Some of their findings include: highly frequent "cross-posting" to external newsgroups, that a small minority of participants post a large proportion of the messages, that cross-posting and short messages promote interactivity, and moderate conversational threading.

What seems to be missing here is investigation of possible application of graphs to the structure of Usenet, analysis of interactions between Usenet and WWW, and further investigation into cross-posting and how it relates different newsgroups.

In considering segmentation by ideological point of view on Usenet, we might ask whether it would be more appropriate to consider individual newsgroups as collections of topically related documents or if groups of newsgroups related by frequent cross-postings should be considered instead. Another option would be to use a search engine on newsgroup archives with a topic query.

Jon M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. In Journal of the

There seem to be two obvious ways to frame our problem of classifying a set of documents on the same topic by ideological point of view, so we choose to focus on these two plausible techniques, rather than survey all possible techniques.

1. As a clustering problem, where we want to cluster the documents by some appropriate measure of semantic similarity. This view fits our problem well because we do not necessarily have predefined categories or points of view. We will consider Latent Semantic Analysis as a possible method to implement this view.

2. As a text classification problem. This problem can be defined as classifying a set of documents into a fixed number of predefined classes. Wiebe and Bruce (1995) propose a method to get around this problem in the context of classifying psychological point of view using probabilistic classifiers. The use of a probabilistic classifier is attractive, because, after our discussions of ideology and discourse, we may start to suspect that we will need to combine a number of feature variables to solve our problem.

Since implementation of a system using LSA should be more straightforward, we propose that it be tried first. Should it not perform sufficiently well, it could be used to compare with the probabilistic classifier.

LSA is a fully automatic corpus-based statistical method for extracting and inferring relations of expected contextual usage of words in discourse (Landauer, Foltz and Latham 1998). In LSA the text is represented as a matrix, where there is a row for each unique word in the text and the columns represent a text passage or other context. The entries in this matrix are the frequency of the word in the context. There is a preliminary information-theoretic weighting of the entries, followed by singular value decomposition (SVD) of the matrix. the result is a 100-150 dimensional "semantic space", where the original words and passages are represented as vectors. The meaning of a passage is the average of the vector of the words in the passage (Landauer, Latham, Rehder, and Schreiner 1997).

For a more detailed view, once the word-by-context matrix is constructed, the word frequent in each cell is converted to its log and divided by the entropy of its row ( -sum (p log p)). The effect of this is to "weight each word-type occurrence directly by an estimate of its importance in the passage and inversely by the degree to which knowing that a word occurs provides information about which passage it appeared in." (Landauer et al 1998) Then SVD is applied, the matrix is decomposed into the product of three other matrices: two of derived orthogonal factor values or the rows and columns respectively and a diagonal scaling matrix. the dimensionality of the solution is reduced by deleting entries from the diagonal matrix, generally the smallest entries are removed first. This dimension reduction has the effect that words that appear in similar contexts are represented by similar feature vectors. Then a measure of similarity (usually the cosine between vectors) is computed in the latent, or reduced dimensional, space.

LSA can be viewed as a tool to characterize the semantic contents of words and documents, but in addition it can be viewed as a model of semantic knowledge representation and semantic word learning (Foltz 1998). While LSA has been able to simulate human abilities and comprehension in a variety of experiments, there is still some controversy over its validity as a model. The main objection seems to center around the fact that it ignores word order and syntax. Objections raised by Perfetti (1998) have been refuted by Landauer (1999).

LSA does not claim to be a complete model of discourse processing. Laudauer (1999) points out that the more general class of models, to which LSA belongs, associative learning and spectral decomposition, are well understood in terms of formal modeling properties and as existing phenomena at both psychological and physiological levels. Perhaps this is the beginning of an explanation of why LSA seems to do so well at simulating human abilities, with so little.

LSA has been used for a number of natural language processing tasks including information retrieval (for which it was originally developed), summarization (Ando 2000), text segmentation (Choi et al 2001), measuring text coherence (Foltz 1998).

Ando (2000) proposed an iterative scaling algorithm to replace SVD and showed significant increase in precision on a text classification task. Her algorithm iteratively scales vectors and computes eigenvectors to create basis vectors for a reduced space. She uses a log-likelihood model to choose the number of dimensions, which improves over LSA where no empirical method is proposed to select the number of retained dimensions.

In order to address some shortcomings of LSA due to "unsatisfactory statistical foundation" Hofmann (1999) introduces Probabilistic Latent Semantic Analysis (PLSA), based on the likelihood principle. In experiments on four document collections, PLSA performed better than LSA, tf, and tfidf, in retrieval tasks. The core of PLSA is a "latent variable model for general co-occurrence data which associates an unobserved class variable with each observation", called an 'aspect model'. He uses a tempered EM algorithm for maximum likelihood estimation of the latent variable to avoid overfitting.

A probabilistic classifier assigns the most probable class, out of a set of finite classes, to an object, based on a probability model. The probability model defines the joint distribution of the variables, which are made up of the classification variable and a set of feature variables. The feature variables represent properties of the objects, in our case documents, we wish to classify. Features might include useful semantic, syntactic, and lexical distinctions or properties with respect to ideological point of view.

As noted above, we have a difficulty because the set of ideological points of view may not be known in advance, so cannot serve as values got the classification variable. Weibe and Bruce (1995) get around a similar problem of classifying point of view by breaking up the classification problem into three problems. Each problem takes input from the preceding problem and has its own classification variable.

Rather than deciding which feature variables to use or how the variables are related, they use statistical methods. Specifically they use decomposable graphical models, which graphically represent the features as nodes and the interdependences of the features by undirected edges. A model that describes the data by representing only the most important interdependencies is chosen, based on a process of hypothesis testing using the likelihood ratio statistic to measure the goodness-of-fit of the model. Once the model is determined, maximum likelihood estimates are used for its parameters.

Wiebe and Bruce propose modifications to their model to limit its reliance on large amounts of untagged data, by estimating parameters for untagged data using a stochastic simulation technique.

In using a probabilistic classifier we need to consider what features we might want to use. It would be preferable if the features could be automatically extracted in a preprocessing phase. On option for the automatic extraction of lexical and syntactic features would be to use a system like AutoSlog-TS (Riloff 1996). We would also want to consider features discussed in the discourse section (above), possibly modified to better fit out task of determining ideological point of view. Additional features might be mined from the structure of the Internet.

Rie Kubota Ando. Latent Semantic Space: Iterative Scaling Improves Inter-document Similarity

CONCLUSION

We have endeavored to understand what would be necessary to build a system that would classify web pages or Usenet messages, first by topic and then by ideological point of view within the topic. Our proposal is to build a system on top of a search engine that would use either a form of Latent Semantic Analysis or a probabilistic classifier. The system would need to take into account discourse structure and the structure of the Internet, as well as other lexical, syntactic, and semantic clues. At least some of these clues should be derived from analysis of ideological point of view as defined and explored here. In the case that these methods are not successful, a further exploration of techniques used in Information Retrieval, Information Extraction, Text Classification, Text Segmentation, and more generally Statistical NLP and Machine Learning is warranted. Additional research needs to be done to better understand the structure of the Internet and how it might be exploited for our task. We have discussed some difficulties and strategies that may be involved with a rigorous evaluation of the system once developed. We have seen that the task fits into an overall framework of understanding subjectivity in text. Should the task prove intractable at this point, it would be worthwhile considering applying these strategies to a limited domain, such as law or newspaper editorials.