. - ﬁch.
223 2.3.5.."Emwmutéﬁﬁf 5:.

t. 0 91 1
z z .3.

v ‘tlo- .
. Er...
1.0"“. L
. 1
mama-nth... randy
_ - . .
Q] I!
E; in
.. $33.3
2.1.... - , :,
.mi,..,.u.u.-:h .3... .. «mm...
«a, Hahn...“ any; 1..., I
Shaw-Mr «Jaw»
- .1. than...
3.... 1.3.? .
r taut-.1 nll‘buﬂ‘ cl
3...“. ”wry... :91” 3.“... A$grﬂ J aid?
.:.-m...,n ham. Luusvbu»?
Add. 513%... dﬁfﬂﬁﬁx
- it! Ill-l u.
’ 5P .
, “b... ”mu-4.... ”5.3..
I’l- Dingu. “I: .5
11.1. $23.93. .2 - .Nwr.
.1 1 .Va ) '0 A p. ’3‘“!
up. - 2. s. .3
.37 z r u
h... .- 359...,»
Pill-“(t
5
- .- . I:-
“me-pumruuhimwmﬁ.
x; 31.1%.?“
3.5.54.9;

.CI'!‘ ’3-
kiﬂ or 3...:
. 1 if.
. .....u3...¢...0.

ﬂ.

. 3%.!
ufniid

3.] - 1
2! , .1,
nan“?! , Lin... b.»
.53.. .6” . - x...)
the n
f}... -13:
3.1.1.... .- 2
“Vito:

 

i «inn .‘nuuiv‘vnh... I.”

AA):

. nits-PH

1]....“ 1.. u. Kum

1-4%,,- éﬁw $331.3. .9
V Q. A .

; ,sﬁg 1 5| :zq! 47-5.,
Egié $3.? -

4.2.2.

 

This is to certify that the
dissertation entitled

LINGUISTIC KNOWLEDGE BASED DISCOURSE MODELING FOR
CONTEXT QUESTION ANSWERING

presented by

MINGYU SUN

has been accepted towards fulfillment
of the requirements for the

PHD degree in LINGUISTICS

 

 

Major Professor’s Signatﬁe
'Xzz7/o A

Date

MSU is an Afﬁrmative Action/Equal Opportunity Institution

 

”" ‘LTBRARY W
Michigan State
University

 

 

 

 

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2/05 p2/CIRC/DateDue,indd-p.1

LINGUISTIC KNOWLEDGE BASED DISCOURSE
MODELING FOR CONTEXT QUESTION
ANSWERING

By

Mingyu Sun

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Linguistics

2006

ABSTRACT

LIN GUISTIC KNOWLEDGE BASED DISCOURSE
MODELING FOR CONTEXT QUESTION
ANSWERING

By

Mingyu Sun

This dissertation is motivated by the recent developments in scenario-based context
question answering (QA). The role of discourse processing, and its implication on query
expansion for a sequence of questions, is investigated. My view is that a question sequence
is not random, but rather, follows a coherent manner to serve certain information goals
of users. Therefore, this sequence of questions can be considered as a mini discourse with
some characteristics of discourse cohesion and discourse coherence. Understanding such a
discourse will help QA systems better interpret questions and retrieve answers. Thus, in
the ﬁrst part of my study, I propose three models driven by Centering Theory for discourse
processing: an anaphora model that resolves pronoun references for each question, a
forward model that makes use of the forward looking centers from previous questions,
and a transition model that takes into account the transition state between adjacent
questions. The empirical results indicate that more sophisticated processing based on
discourse transitions and centers can signiﬁcantly improve the performance of document
retrieval compared to models that only resolve references. In the second part of the
study, the inﬂuence of the processing based on pronoun resolution and deﬁnite description
resolution is investigated. Results show that a combined model that incorporates both

approaches performs the best under the situation where no explicit target is given for

the context questions. The processing for the event type of context question answering
is also investigated brieﬂy. For different discourse models proposed in the dissertation,
systematic evaluation is provided and the potentials and limitations of these models in

processing coherent context questions are also discussed.

Copyright by
Mingyu Sun
2006

To my parents, Zhengmei Yao and Xian Sun, my husband, Chunlei Wu,

for their endless love.

ACKNOWLEDGMENTS

I am grateful for all the support I have received while researching and writing up this
dissertation. I would like to thank the members of my dissertation committee: I am
grateful to Dr. Barbara Abbott for her many years of encouragement and guidance. She
has been showing me how to present my own ideas into scholarly writing. Her careful
readings, detailed comments and insightful thoughts will never be forgotten. I would like
to thank Dr. John Hale for being my committee chair, taking care of all the administrative
issues, giving me quick replies to my questions, and especially making a long trip for my
defense. Thanks especially to Joyce Chai, my co-chair, who has supported me every step
of the way. She spent numerous hours sitting with me discussing the initial ideas of the
dissertation, walking with me line by line on the drafts and guiding me to shape part
of my work into two publications, giving comments on the implementation details and
results. Without her support, I could not have done what I was able to do. I’d also like
to thank Dr. Yen-Hwei Lin and Dr. Dennis Preston who have given me great comments
on my proposal.

The acknowledgments would not be complete without a heartfelt thanks to my friends.
I would like to thank Mike Kramizeh for his support, especially for standing by me at
my defense to make sure the technology worked. I would like to thank Julie Delgado
for her kindness and help during my stay at the Linguistic Department. I would like to
thank Zheng Wang and Jun A0 for their long term support. Finally, I would like to thank
Vineet Bansal, a dear friend who supported me through all of the challenges. He offered

to trouble—shoot my programs, proofread draft copies and listened to my moanings. I

cannot thank him enough.

vi

TABLE OF CONTENTS

LIST OF TABLES ' x
LIST OF FIGURES xi
1 Introduction 1
1.1 Question answering ............................... 1

1.2 Problem statement ............................... 2
1.3 Overview of the thesis ............................. 5

2 TREC Question answering 7
2.1 TREC QA tasks ................................ 7
2.2 TREC systems ................................. 10
2.2.1 System architecture ........................... 10

2.2.2 Question processing ........................... 11

2.2.3 Further processing ........................... 13

2.3 TREC evaluation ................................ 15
2.4 Summary .................................... 17

3 QA discourse and discourse modeling 18
3.1 Introduction ................................... 18
3.2 QA context ................................... 18
3.3 QA discourse .................................. 19
3.4 Discourse modeling ............................... 21
3.4.1 Discourse structure ........................... 22

3.4.2 Discourse structure theories ...................... 22

3.4.3 Computational efforts ......................... 24

3.5 The hypotheses ................................. 26
3.6 Summary .................................... 29

4 Data collection and platforms 30
4.1 Introduction ................................... 30
4.2 System overview ................................ 30
4.3 Question collection ............................... 31
4.3.1 User data ................................ 31

4.3.2 TREC 2004 data ............................ 33

4.3.3 TREC 2005 data ............................ 34

vii

4.4 Data format ...................................

4.4.1 Original TREC data format ......................
4.4.2 Customized data format ........................
4.4.3 Annotation validation .........................
4.5 Document collections ..............................
4.5.1 User study documents .........................
4.5.2 AQUAINT corpus ...........................
4.5.3 Document format ............................
4.6 Document retrieval engine ...........................
4.7 Summary ....................................
Centering-based discourse modeling
5.1 Motivation ....................................
5.2 Centering Theory ................................
5.2.1 Grosz, Joshi and Weinstein (1995) ...................
5.2.2 Terminologies and deﬁnitions .....................
5.2.3 Constraints and rules ..........................
5.2.4 Centering algorithms ..........................
5.3 Three discourse models for query expansion .................
5.3.1 Anaphora model ............................
5.3.2 Forward model .............................
5.3.3 Transition model ............................
5.4 Data analysis ..................................
5.5 Evaluation ....................................
5.5.1 Overall Evaluation Results .......................
5.5.2 Evaluation and Analysis based on Tfansitions ............
5.5.3 Evaluation and Analysis based on Pronouns .............
5.6 Summary ....................................
Discourse models based on deﬁniteness hierarchy
6.1 Introduction ...................................
6.2 A new challenge in QA discourse .......................
6.3 Related work ..................................
6.3.1 Computational work on deﬁnite descriptions .............
6.4 Theoretical background .............................
6.4.1 Deﬁniteness hierarchy .........................
6.4.2 Familiarity theory ............................
6.4.3 Uniqueness theory ...........................
6.5 Classiﬁcation of deﬁnite description uses ...................
6.5.1 Fine-grained classiﬁcations .......................
6.5.2 Prince (1992) ..............................
6.5.3 Poesio and Vieira (1998) ........................
6.5.4 Classifying deﬁnite descriptions in QA discourse ...........
6.6 Discourse models focusing on deﬁnite descriptions ..............

viii

36
39
40

44

46
47
49

50
50
55
56
57
59
63
65
65
72
73
78
80
81
92
101
105

106
106
108
110
110
112
112
113
115
117
117
121
122
123
126

6.6.1 Data analysis .............................. 127

6.6.2 No-target situation ........................... 132

6.6.3 Target-given situation ......................... 136

6.7 Evaluation for deﬁniteness hierarchy based models .............. 140
6.7.1 Evaluation for the no—target situation ................. 140

6.7.2 Evaluation for the target-given situation ............... 146

6.7.3 Evaluation based on the deﬁnite description types .......... 150

6.8 Summary .................................... 155

7 Conclusion 157
7.1 Summary ............................ , ........ 157
APPENDICES 161
A User data collection 161
A.1 Instruction .................................... 161
A2 Documents collection on Presidential Debates 2004: ............. 162
A3 Documents collection on Tom Cruise ..................... 165
A.4 Documents collection on Pompeii ....................... 167
A5 Documents collection on Hawaii ........................ 170
BIBLIOGRAPHY 173

LIST OF TABLES

4.1 TREC 2004 targets ............................... 35
4.2 TREC 2005 targets ............................... 37
4.3 TREC 2005 targets (cont’) ........................... 38
5.1 Extended transition states (Adapted from Brennan et al.[1]) ........ 62
5.2 Transition rules for questions without pronouns but with non—pronominal

referring expressions .............................. 75
5.3 Examples of transition rules on non-pronominal referring expressions . . . 75
5.4 Query expansion strategies based on transition type ............. 77
5.5 Characteristics comparison between my data and TREC 2004 data (includ-

ing only factoid questions) ........................... 79
5.6 Overall performance of different models on document retrieval for my data

and TREC 2004 data .............................. 85
5.7 Document retrieval performance based on the transition model and passage

retrieval performance from the University of Shefﬁeld on TREC data . . . 92
5.8 Transition model performance improvement for continue .......... 93
5.9 Transition model performance improvement for retain ............ 97
6.1 Deﬁniteness hierarchy of noun phrases .................... 113
6.2 Hearer and discourse status of a discourse entity* .............. 122
6.3 Data analysis for the original TREC 2004 and TREC 2005 data ...... 130
6.4 Data analysis for the modiﬁed TREC 2004 and TREC 2005 data ...... 131
6.5 MRR performance based on deﬁnite hierarchy in no-target situation . . . . 143
6.6 Coverage performance on all TREC data* in no—target situation ...... 144
6.7 MRR performance based on modiﬁed deﬁnite hierarchy in no—target situation146
6.8 MRR Performance based on deﬁnite hierarchy in target-given situation . . 147
6.9 Coverage performance on all TREC data* in target-given situation . . . . 147
6.10 Question distribution based on deﬁnite description types .......... 151

2.1
4.1

5.1
5.2
5.3
5.4
5.5
5.6
5.7

6.1

6.2
6.3

6.4

6.5

6.6

6.7

LIST OF FIGURES

Question answering system architecture .................... 11
Flow chart of the centering-based automated system architecture ...... 31
Overall comparison of four models based on automated processing ..... 87
Coverage comparison between four models based on automated processing 91
Performance on CONTINUE pairs ...................... 94
Performance on RETAIN pairs ........................ 96
Performance on SHIFT pairs .......................... 100
Performance for questions with pronouns ................... 103
Performance for questions without pronouns ................. 104
MRR Performance comparison between the deﬁnitencss hierarchy based

models and the modiﬁed version ........................ 146
MRR performance on all TREC data* in target-given situation ...... 148
Overall comparison of four models based on automated processing on an-
tecedents ..................................... 151
MRR performance on questions with direct anaphoric deﬁnite description
in no—target situation .............................. 152
MRR performance on questions with bridging deﬁnite description in no-
target situation ................................. 153
MRR performance on questions with discourse-new deﬁnite description in
no-target situation ............................... 154

MRR performance on questions without any deﬁnite descriptions in no-
target situation ................................. 155

xi

CHAPTER 1

Introduction

1. 1 Question answering

With enormous electronic textual data available, it is important for human users to be able
to ﬁnd information through natural language questions. Research on Question Answering
(QA) systems aims to provide such a capability. Given a natural language question
from a user (such as a question in English, What is the oldest sports trophy?), an ideal
computer system can consult a database, knowledge base, or go to the Web to locate
the answer, the America ’3 Cap. Natural language processing and Information Retrieval
(IR) make it possible to provide informative, appropriate and nommisleading answers
to simple questions. The Text REtrieval Conference (TREC1 2) has been conducting
evaluations for different QA systems since 1999. The submitted competing QA systems
have been moving from the stage of dealing single fact-based questions to more complex
questions such as opinion questions (like What do people think about the 2004 presidential
debate?) Various issues in terms of system performance, uses, and techniques have invited
international research interests and activities in QA.

Putting aside the problem of question complexity, once engaged in interaction with a
system, a user may ask a series of questions. This sequence of questions is not arbitrary

lhttp://trec.nist.gov/
2TREC is co—sponsored by the National Institute of Standards and Technology (NIST), Informa-

tion Technology Laboratory’s (ITL) Retrieval Group of the Information Access Division (IAD) and the
Advanced Research and Development Activity (ARDA) of the US. Department of Defense.

but rather coherent, leading towards some information goal. In the TREC 2004 and
TREC 2005 Question Answering system competitions, questions were grouped together
by target (similar to topic). The participating systems usually have modules like question
processing, document retrieval, passage selection and answer retrieval. In their system
architectures, question processing was the ﬁrst module in the pipeline. How to characterize
a question sequence therefore becomes a very interesting problem in that it will affect
query expansion in the question processing stage, answer extraction in the answer retrieval
stage and the overall performance.

The purpose of this dissertation is to develop interpretive discourse models of question
sequences for a computer system that takes textual questions as input. These models aim
to provide proper interpretation of the questions and to provide efﬁcient computation
algorithms for question processing and query expansion. From a linguistic point of view,
semantic and pragmatic information in the questions will be investigated in developing

these models.

1.2 Problem statement

In this section I will introduce the notions of context and context questions, current meth-
ods used in processing context questions, and some relevant linguistic research.

In linguistic literature, the term context has been deﬁned differently, for example, as
“the discourse that surrounds a language unit and helps to determine its interpretation”
by WordNet3 , or as “the complete discourse record of a discourse at any given point,
including both linguistic and situational information” (Geluykens, 1992)[2]. In Geluykens’
deﬁnition, the linguistic information usually refers to the surrounding words or sentences
of a piece of text. The situational context refers to “the features of the non-linguistic world
in relation to which linguistic units are systematically used” (Crystal, 1992) [3]. Recent
work has given context a more dynamic interpretation. Roberts [4] pictures context at a

given point of a discourse as an ordered set which includes ﬁve sets of linguistic elements

 

3See WordNet project website http://wordnet.princeton.edu/

(such as entities in a discourse, Common Ground propositions) or non-linguistic elements
(such as intentional goals of speakers).

The term context in the ﬁeld of Question Answering can refer to different contexts such
as user context [5] and discourse context [6]. In this study, I will focus on the discourse
context, in particular, the discourse of a sequence of questions. I will limit ourselves
to a description of context as follows: given a series of questions, when one question is
under processing, its preceding question or questions or constituents of these questions
are all regarded as its context. A context question is a follow—up question that needs to
be processed using its context information. Context question answering is the task of
answering such context questions. Consider the following example as illustration:

(1)

Q1: Who was Tom Cruise married to?

(A1: Nicole Kidman)

Q2: What was her Broadway debut?

(A2: The Blue Room)

Q3: Who ﬁled for divorce?

(A3: Tom Cruise)

It is impossible for a system to isolate (1Q2) from (1Q1) and (1A1), to retrieve correct
answer (1A2). In this case, the pronoun her in (1Q2) has to be solved before (1Q2) is
sent to an Information Retrieval (IR) engine(i.e. application that seeks information from
various resources 4). For an IR engine, the resolution of pronoun references is necessary
because otherwise pronouns would be treated as stop words5 and totally ignored in the
retrieval process. (1Q2) therefore is the context question of (1Q1) and (1A1). One method
that is currently used in QA systems is called reference-resolution. What this method does
is to ﬁnd what such referring expressions as pronoun he, demonstrative this, or deﬁnite
noun phrase the book refer to in the context questions. For instance, in example (1), the

possessive pronoun her will be resolved ﬁrst using the extracted answer (1A1). Therefore

 

4such as the backend engine of http://www.google.com that retrieves textual information from the

World Wide Web
5Please see the deﬁnition at http://en.wikipedia.org/wiki/Stop.words

the query terms6 that will be fed into an IR engine would be the terms from the current
question (1Q2), {what, was, her, Broadway, debut} concatenated with the terms (words)
from the answer to (1Q1), that is, (Nicole, Kidman}. Similar reference-based methods
are computationally implemented. However, it is noticeable that this kind of approach
is not able to solve (1Q3) since (1Q3) does not have any explicit reference whatsoever.
Human being, however knows that the marriage (as being an event) involves two parties
(i.e. two referents: Tom Cruise and Nicole Kidman) for (1Q3). The fact that there is not
many systematic linguistic analyses on context question answering motivates the present
study on how context questions are related not only lexically but also at a higher semantic
and/ or pragmatic level.

Real life experience has shown that questions raised in an information seeking session
are often related, because one question forms a context for another. Presumably, the
techniques used to process single question would not be an efﬁcient means of gathering
such context information for the system engaged in an interactive QA session. Thus, the
context is critical in processing a question sequence. But how does a computer system
determine what context to use and when to use it? What is the role of context in
processing a series of question? The central hypothesis of this thesis is that context can
be modeled, or captured by an entity-based linguistic theoretical framework.

This thesis is a linguistic effort to investigate the context question answering problem.
The goal is to show that a system needs to take into account the linguistic knowledge
of questions, that is, to properly represent discourse entities based on linguistic knowl-
edge. This approach implies that discourse entities, represented by deﬁnite descriptions,
pronouns or proper names play an important role in determining the relationship be-
tween / among context questions. This work also aims to show that successfully capturing
the context in a question sequence requires computational algorithms that combine both

lexical information and discourse information of the questions.

 

6Query terms here refer to a set of tokens obtained from the linguistic expressions in a natural language
question. For instance, the query terms for the question Who is Tom Cruise? would be {who, is, Tom,
Cruise}. They are treated as a bag of words by most retrieval engines.

1.3 Overview of the thesis

The thesis is organized as follows: in Chapter 2, I present the background information of
the Text REtrieval Conferences(TREC), which provides essential material for a complete
understanding of the context question answering task. This chapter introduces the TREC
QA tasks, and includes discussions on the state-of-the—art TREC systems, the question
processing component, and evaluation metrics for the TREC systems.

Chapter 3 continues the discussion of background material from a linguistic perspec-
tive. The context question problem is established as a discourse modeling problem, for a
sequence of questions are treated as a mini discourse. Discourse structure and discourse
structure theories are reviewed brieﬂy. Relevant computational efforts are presented to
show how previous work has led to the present state of the ﬁeld. Moreover, two hypotheses
are presented for the work discussed in Chapter 5 and Chapter 6.

Chapter 4 presents all the empirical elements involved in the current study. They
include: the context question sets, the annotation scheme of the questions, the document
collections, etc. Data collection as well as data formatting are described in detail.

Chapter 5 is motivated by recent efforts in scenario-based context question answering.
I investigate the role of discourse processing and its implication on query expansion for
a sequence of questions. The view is that a question sequence is not random, but rather
follows a coherent manner to serve some information goals. This sequence of questions
can be considered as a mini discourse with some characteristics of discourse cohesion.
Understanding such a discourse will help QA systems better interpret questions and re-
trieve answers. Thus, three models driven by Centering Theory for discourse processing
are examined: an anaphora model that resolves pronoun references for each question, a
forward model that makes use of the forward looking centers from previous questions,
and a transition model that takes into account the transition state between adjacent
questions. The empirical results indicate that more sophisticated processing based on
discourse transitions and centers can signiﬁcantly improve the performance of document

retrieval compared to models that only resolve pronoun references. This chapter provides

a systematic evaluation of these models and discusses their potentials and limitations in
processing coherent context questions.

Chapter 6 presents another attempt at modeling context questions, especially the ones
that contain deﬁnite descriptions as well as pronouns. Pronouns and deﬁnite descriptions
are the focus for this part of the study. Deﬁniteness hierarchy is adopted to help resolve
pronouns in context questions. Deﬁnite descriptions are classiﬁed into three classes for
the purpose of query expansion. Experiments on context question data are conducted
under two situations: one is that a target / topic is given for each context set (target-
given situation), and the other is that no target is given (no-target situation). Different
models involving pronoun resolution, deﬁnite description resolution, and targets are de-
veloped. The results show that in the no-target situation, extra processing on the deﬁnite
descriptions and the pronouns based on the deﬁniteness hierarchy improves the perfor-
mance signiﬁcantly. In the target-given situation, a target-appending strategy shows its
efﬁciency, which implies that identifying a topic or target for a context question set is a
task as important as processing the context questions themselves.

Chapter 7 summarizes the work presented in the dissertation, and contains some hind-

sight recommendations for improvements and possible future work.

CHAPTER 2

TREC Question answering

This chapter provides the background material on the TREC QA tasks and the state-
of-the-art Question Answering technology. It describes how context Question Answering

enhances the current Question Answering systems.

2.1 TREC QA tasks

For the purpose of boosting research in information retrieval and facilitating technology
transfer from the research community to commercial products, the Text REtrieval Confer-
ence (TREC) has been conducting tracks for different focus areas since 1992. For instance,
TREC 2006 has 7 tracks: Blog Track, Enterprise 'Dack, Genomics Track, Legal Track,
Question Answering Track, SPAM Track and Terabyte Track. These tracks aim to solve
different problems, for example, the Blog Track, a new track in TREC 2006 is designed to
explore information seeking behavior in the blogosphere. Note that information retrieval
is no longer conﬁned to text retrieval, which provides users to access to natural language
texts. It is possible to retrieve information from other type of media such as video clips.
The various TREC tracks have ignited enormous research interests all over the world.
The number of participating research groups has increased over the years 1. These groups

are mainly academic, commercial, and government institutions. Although some tracks

‘

1For example, there were 66 groups from 16 different countries participating in 1999 (TREC-8). But
in 2004 (TREC-13), the number has increased to 103 groups from 21 different countries

are added and some are dropped over the years, the Question Answering track has been
on the TREC task list since 1999, with some changes made gradually.

Information retrieval is concerned with how a user mines information to satisfy his / her
information need. Traditionally, there are two types of information retrieval: ad hoc and
known item search. Typical examples of ad hoc search would be a library search system or
an online search engine, which takes the user’s investigation topic of any kind, searches a
set of documents (the library’s holdings or the World Wide Web). The known item search
is the scenario where the user is trying to ﬁnd certain information that he/she knows
exists. Both types of information retrieval focus on locating information on a document
level, namely, document retrieval. What document retrieval does is to match a user query
(such as a natural language question) against large textual collections and return a ranked
document list. The text collections could be of any kind, such as newswire articles, book
chapters, user manuals or a large-scale corpus. Document retrieval systems use different
techniques to retrieve relevant documents from the collections. This practice reﬂects the
characteristics inherited from the traditional library reference systems. The T REC QA
tracks were initiated to take a step further from document retrieval to answer retrieval.
At times, users would prefer the system to offer the answer instead of going through all
the returned documents looking for the exact answer. The QA tracks aimed to address
this challenge and each QA track consisted of one or more tasks trying to extract answers
to different types of questions.

The ﬁrst ﬁve tracks (from 1999’s TREC-8 to 2003’s TREC-12) focused on answering
single factoid questions like “what is the oldest sports trophy?”. The total of 2393 such
questions were either created by the TREC staff or from search or query logs donated by
different sources (Encarta, Microsoft, Ask Jeeves, AOL etc.). Since 2001, TREC added
list type of questions in addition to the main factoid question task. List questions are
questions such as Name 32 countries Pope John Paul II has visited. Answering such
questions requires a system to submit an answer based on the information located in
multiple documents. The list questions were created by the TREC assessors. TREC 2003

added deﬁnition type of questions such as Who is Aaron Copland? A deﬁnition question

asks for relevant information on a person or an organization. Deﬁnite questions also
require systems to be able to locate information from multiple documents. Comparing
with the list questions, the information of interest for deﬁnition questions is much less
incisively represented. The results of the TREC 2003 track demonstrated that the list
and deﬁnition questions posed challenges not only for the QA systems but also for the
evaluation of the systems. However, these two types of questions were kept in TREC 2004
and TREC 2005.

The main task in TREC 2004 was on context questions, a form of question ﬁrst inves-
tigated in 2001’s TREC—10. Instead of being independent individual questions, factoid
questions and list questions are grouped into series each of which has a target associated
with it. A question series consists of several factoid questions, zero to two list questions.
To each series, TREC2004 also added a question of type other, which is equivalent to
the TREC 2003 deﬁnition questions. It could also be interpreted as “tell me something
interesting about this target”. Each series has exactly one other question. The questions
in a series ask for information about the target. The target and previous questions in the
series provide the context for the question under processing. The series a question belongs
to, the order of the question in the series, the question type (such as “list”) and the asso-
ciated target were all explicitly given in the XML format. Details of the TREC question
sets and examples are given in Chapter 4. Question series were developed by NIST staff
and TREC assessors. The questions were created in a scenario where an average native
speaker of English asked a series of questions about a term (i.e. target) encountered while
reading a US newspaper [7]. In practice, the TREC assessors created some questions re-
lated to the target and then searched the test document collections looking for answers to
these questions. NIST staff then reviewed the information found in the related documents
and reconstructed the ﬁnal test series.

The development of the QA tracks and increased difﬁculty of QA tasks imply that QA
technologies have advanced tremendously in response to the application demand in the
area. The context QA task has become a regular task at TREC evaluation since 2004.

For each series of context questions, participating systems were required to process each

question in the series independently from one another and in question order. Systems
were allowed to use the questions or answers from earlier questions in a series to answer
later questions but not the other way around. Next, I will introduce TREC QA systems

and the technology involved in context Question Answering.

2.2 TREC systems

2.2.1 System architecture

Processing of the questions is supposed to be strictly automatic. The participating systems
accomplish the TREC tasks by directing subtasks to different system components. There
are four prototypical components common to most system architectures. In Figure 2.1, the
pipeline consists of question processing, document retrieval, passage selection, and answer
retrieval. Question processing components focus on the preprocessing of the questions
and outputs query terms for the document retrieval component. Document retrieval
component usually takes the query terms as input to a search engine (such as Lemur 2
or Lucene 3), which is programmed to ﬁnd information that matches with user’s search
criteria. Document retrieval component outputs a list of relevant documents for the
passage selection component, which then further narrows down the search to paragraph
level and outputs passages that may contain the potential answers. Within the targeted
passages, the answer extraction component ﬁnally pinpoints an answer to the original
question.

Based on the four components, different systems have their own system architectures tai-
lored to ﬁt their implementation purposes. Some could be as sophisticated as the ILQUA
system in Wu 2005’s work [8] Note that many processing techniques were implemented in
the ILQUA system consisting of subcomponents under the four logical components. For
instance, there were four subcomponents for the question processing component. They

were: syntactic chunking, type categorization, target classiﬁcation and query generation.

 

2http: / / www.1emurpro ject.org /
3http: / / lucene.apache.org / java / docs /

10

 

Document
collections I

TREC Questions ——u Question processing

1.

Document retrieval

l

Passage retrieval

1.

Answer extraction

   

 

 

 

 

 

 

 

 

 

 

 

—+ Answers

 

Figure 2.1. Question answering system architecture

It should be noted that the overall approach used in the participating systems stayed
rather stable over the years.

For answering the factoid questions, systems usually ﬁrst identify the expected answer
type of the question (such as date, name, location, size, speed, city, nationality, etc.),
and then use question words and / or related terms as the query to retrieve documents or
passages that may contain answers that match with the answer type, and ﬁnally extract
an answer from the retrieved passage. A system often uses the same architecture for
processing both the factoid and the list questions. The difference is that they return a
single answer for a factoid question but multiple answers for a list question. Next, I will
brieﬂy summarize the major techniques used in the TREC 2005 systems for the question

processing component, which my discourse modeling is mostly concerned with.

2.2.2 Question processing

Traditional QA systems have question analysis as the ﬁrst stage of processing. Research
shows that the question analysis component included in the retrieval methods specialized
for QA has effect on the overall system performance in a positive way [9]. Except for
[10] that applies Discourse Representation Structure (DRS) in question processing, most
work related to question processing has focused on three aspects. The ﬁrst aspect is on
question type analysis and categorization, which identiﬁes the target types for context

questions (such as whether a question is to ask “person”, “number” or “location”). This

11

is similar to the processing for the isolated factoid questions. The goal is to help pin-point
the expected answer strings. Several systems in TREC 2005 have adopted this technique
such as [11] and [8].

The second aspect emphasizes the processing of the words (i.e. the tokens) in the
questions. Parsing tools (which help to ﬁnd verbal predicates), POS tagging tools, name-
entity4 recognizers (which tag name-entities into categories), statistical analysis (unigram,
bigram, n—grams for question words5) and knowledge bases (such as WordNet’s Synset,
which provides synonyms for a particular word) are utilized to expand the queries. Sys-
tems such as [11], [13],[14] and [8] used this method.

The third is to make use of the target (topic) provided by TREC datasets. TREC QA
track provided a target for each context question series. A target is a linguistic expression
denoting a person, a thing, an organization or an event. Although the target of each series
is given, it is not necessarily mentioned in each of the factoid question and list question
in a series. Therefore in TREC 2005 some systems chose to append the target to each
question within the series such as [11]. Another approach was to replace all the pronouns
that appear in the question set with the target such as [15]. The methods were effective
because the target was the domain for the questions. However there were problems too.
One problem was that sometimes pronouns do not refer to the target. Instead, some
pronouns refer to entities appearing in the previous question(s). For example, the target
in TREC 2005 question series 136 is Shiite, however the pronouns his and he do not refer
to Shiite.

(1)
Q1: Who was the ﬁrst Imam of the Shiite sect of Islam?
Q2: Where is his tomb?
Q3: What was this person’s relationship to the Prophet Mohammad?

Q4: Who was the third Imam of Shiite Muslims?

 

4Name-entity could be person, location, and organization, as well as times, data, percentages, money

amounts, etc. . 0
5Interested readers may refer to Jurafsky and Martin 2000[12] for the deﬁnition of umgram, bigram

and n-grams.

12

Q5: When did he die?

To solve this problem, some systems conducted true anaphora resolution for pronouns.
However it was difﬁcult to judge how much beneﬁt was gained from the extensive anaphora
processing6. After all, most pronouns do refer to the target. Another problem was that
it is not necessarily the case that the target is referred only by the pronouns. Sometimes
deﬁnite descriptions or demonstratives are used instead. For example, The target for
question series 75 in TREC 2005 main task is Merck (’5 Co.

(2)

Q1: Where is the company headquartered?

Q2: What is their symbol on the New York Stock Exchange?

Q3: Name products manufactured by Merck.
Some systems (such as U of Sheffield [16]) chose to replace nominals as well as the pro—
nouns with the target. In TREC 2005, a new difficulty was introduced into the processing
because of the event type of target. Events such as Hindenbury disaster and Russian sub-
marine K ursk sinks were given as targets for question series. This new type of target
added more complexity to the question processing, upon which I will have detailed dis-
cussion in Chapter 6.

In general, the overall approaches for question processing were question type analysis,
lexical expansion on question words, simple target appending method, and statistical
analysis. There was only one system in the TREC 2005 that used discourse analysis
approach for context question processing. In general, the research on answering the

questions in series from discourse analysis point of view has been limited.

2.2.3 Further processing

The main purpose of the document retrieval component is to search document candidates
for further processing. Different software and search engines were chosen for this purpose
in the TREC 2005 systems. Software such as MySQL was used in [17] to help with the

full-text search. Some search engines are freely available for the research community,

 

6I will discuss this in Chapter 6

13

for instance, the Lemur7 tool kit used in [15] and [10], Lucene8 used in [16],]13] and [11]
and PRISE9 used in [17]. Some were developed by the participating groups themselves,
such as the INQUERY developed at University of Massachusetts at Amherst [8]. Search
engines basically index the document collection in the way that they ensure the search to
be quick and effective.

For the passage selection and answer extraction components, processing techniques are
more sophisticated, varying from system to system. Statistical methods, heuristics as well
as natural language processing tools allow the systems to ﬁnally extract answers for the
original questions. Filtering of irrelevant documents is normally based on answer type,
target or question terms. Passages were then selected among the candidate documents.
Given the fact that state-of-the—art NLP techniques (e.g. natural language parsers, POS
taggers etc.) are available, it is each individual system’s choice as how to use these tech-
niques to implement various ideas at this stage. For instance, lexical resources (WordNet,
online dictionary Wikipedia 1(have been used to locate matched name entities in systems
such as [16] and [15]; The technique of indexing (sentence level indexing) was also im-
plemented in systems such as [13] to narrow down the search domain. Surface pattern
matching[8] and logical form matching[16] were also seen in passage retrieval. Like in
question processing, N LP software (parsers, POS taggers and name entity recognizers)
were utilized at this stage too.

In the ﬁnal stage of answer extraction, the techniques used in the TREC 2005 systems
varied to the extent that no other stages have the same diversity as in this stage. Besides
the techniques used in the previous stages, the Web was also used as a resource [10] [11].
Logical prover [18] and some ranking or scoring schemes (such as [10] and [19]were also
employed speciﬁcally for answer extraction.

Generally systems used different techniques to narrow down the search and reduced the

returned information. The TREC participating groups submitted their retrieval results

 

7http: //www. lemurproject.org/

8http: //lucene.apache.org/java/docs/index.html

9http: / / www-nlpir.nist.gov / works / papers / zp2 / zp2.html
1littp://www.wikipedia.org/

14

within one week of receiving the test set. The TREC evaluations on the results were then
conducted by the TREC staff. Next, I will discuss how the evaluations were done and
what the current evaluation metrics are for the factoid questions of which the context

question series consist in my data pool.

2.3 TREC evaluation

Since the QA track aims to obtain direct answers to the questions instead of documents
containing the answers, the TREC evaluation is at the answer extraction level rather than
the document retrieval level. After the answer string is returned by the answer extraction
component, it will be independently judged by two human assessors. Only when there
is a disagreement between these two assessors, will a third judge, normally a NIST staff
member be introduced to make a ﬁnal judgment. Each response will have the following 4
judgments: correct, not exact, not supported, and incorrect. The answer string is regarded
as correct when it is exactly the right answer and it is supported by the document in
which it appears. The response is not exact when the answer is contained in the answer
string and the document returned supports the answer, but the string has more or less
information than just the answer itself. The response is not supported when the answer
string contains a right answer but the document does not clearly answer the question.
The response is incorrect when the answer string does not contain the right answer. NIL
responses are for the questions to which there are no answers in the document collection].1

Considering the difference of question type and response format, the ﬁnal score of the
participating systems is decided as a weighted average of the scores computed according
to question type. The scores for factoid questions were computed by accuracy, which is
deﬁned as the fraction of questions judged correct. For list questions, multiple answer
strings were returned and the scores for list questions were computed using instance

precision (IP) and instance recall(IR): IP = D/N and IR = 0/5; where S is the

 

11It was the NIST staff’s ultimate goal to provide test sets where each question should have an answer
in the document collections, however in both the TREC 2004 and the TREC 2005 data, there were some

questions that did not have answers.

15

number of known instances for the list question, D is the number of correct, distinct
responses returned by the system, and N is the total number of responses returned by
the system. F score was then computed using the following formula: F = W.

Since the other questions are similar to deﬁnition questions, the returned answer strings
should have information atomic about the target, and was not part of or an answer to
an earlier question in the series [7]. Each answer string is deﬁned as a nugget for the

list question. The ﬁnal score for an other question was computed using F measure:

F (d = 3)12: hgmgiszn: $531111 . Next, I will see what nugget recall and nugget precision

 

are. “Given the nugget list and the set of nuggets matched in a system’s response, the
nugget recall of the response is the ratio of the number of matched nuggets to the total
number of vital nuggets in the list.” Nugget precision is another measure which is more
complicated than nugget recall. It was computed based on a length-based measure that
“starts with an initial allowance of 100 characters for each (vital or non-vital) nugget
matched.” “If the total system response is less than this number of characters, the value

of the measure is 1.0. Otherwise, the measure’s value decreases as the length increases

length—allowance”
length

 

using the function 1 — [7]. ﬂ = 3 indicates that the nugget recall is three
times as important as nugget precision.

Based on the scores for each type of question, the ﬁnal score for a QA run was
computed as a weighted average of the three component scores: F inalScore = .5 a:
F actoidAccuracy + .25 * ListAveF + .25 * OtherAveF; where the FactoidAccuracy is
the accuracy measure score for the factoid questions; the ListAveF is the average F score
for the list questions; the OtherAveF is the average F score for the other type of questions.
It should be noted that this score was question-based, which means that each question
was treated equally no matter which series it is in. TREC also conducted series-based

evaluation, which gives equal weight to each series. Interested readers may refer to [7] for

more details. The evaluation details for the current study will be discussed in Chapter 5.

 

12where beta is a parameter signifying the relative importance of recall and precision. A value of 3
indicates that recall is 3 times as important as precision

16

2.4 Summary

This chapter has examined the TREC background for the task of question answering.
It has introduced QA tasks in recent TREC conferences and presented the prototypical
TREC system architecture. Based on the pipeline, relevant techniques used in TREC
2005 QA main task were introduced, especially the ones used in the question processing
stage. I concluded with the evaluation metrics for the TREC QA track. The background
is necessary to present the readers a complete picture in which context question answering
resides.

The next chapter continues my discussion of background material by focusing on lin-
guistic aspect of discourse processing and describing how the notion of discourse can be
used to guide the current study. Theories of discourse coherence and discourse modeling

will be the highlights of the next chapter.

17

CHAPTER 3

QA discourse and discourse modeling

3. 1 Introduction

In this chapter, I continue the discussion on context question processing in a question
answering system, but from a discourse point of view. Question series are assumed to
be coherent and thus form a discourse. Some relevant linguistic theories of discourse
coherence and discourse structure will be reviewed and investigated as to what extent

they are useful in solving the current problem of context question answering.

3.2 QA context

Before I delve into solving the problem of context question answering, I would like to
clarify some terminologies. The term context with regard to a question has been deﬁned
in Chapter 1 as previous questions or answers to the previous questions, to be more
speciﬁc. Given that the state-of-the-art QA systems do not use the answers to the previous
questions to help processing the current question, in the rest of the thesis, context will
exclude the answers to the previous questions for two reasons: ﬁrst, it is hard to guarantee
that the correct answer of each question can be retrieved by the question answering
system that I use; second, the focus of the study is on question processing rather than on

answer extraction. As Chapter 2 indicated, question answering based on context was ﬁrst

18

investigated in TREC 10 Question Answering ’Ii'ack [20]. The context task was designed
to investigate systems’ capability to track context through a series of questions. However,
as described in [6], there were two unexpected results of this task. First, the evaluations
of systems have shown that the ability to identify the correct answer to a question in
the later series had no correlation with the capability to identify correct answers to the
preceding questions. Second, since the ﬁrst question in a series already restricted answers
to a small set of documents, the performance was determined by whether the system could
answer a particular type of question, rather than the ability to track context. Although
context processing has not been incorporated into the TREC QA track, the results from
TREC 10 motivate more systematic studies of discourse processing for context question

answering.

3.3 QA discourse

Another term that I will be using throughout the thesis is discourse. It is desirable and
necessary to clarify the deﬁnition of discourse because discourse processing of the context
questions and discourse modeling would be the focus of this study. In discourse analysis
literature, researchers reserve the term discourse with a small d for stretches of language.
It was deﬁned as “a stretch of language consisting of several sentences which are perceived
as being related in some way” [21] or “a continuous stretch of (especially spoken) language
larger than a sentence, often constituting a coherent unit, such as a sermon, argument,
joke or narrative” [3]. Discourse was also deﬁned as “the piece of communication in
context” [22]. Grosz and Sidner [23] deﬁned discourse as “a piece of language behavior
that typically involves multiple utterances and multiple participants” whose deﬁnition
is more conversation-oriented. Discourse analysts classify discourse types according to
the communicative function that they perform. To name a few, there are conversations
of social interactions, email exchanges between friends and interviews on TV shows etc.
Following this classiﬁcation, I may as well treat the conversation between users and a

computer operated QA system as a special kind of discourse.

19

Carrying different communicative purposes, certain discourses can be distinguished from
other kinds in terms of their recurring patterns. Setting aside most other types of dis-
courses, I notice that discourses situated in a classroom, an oral exam, a courtroom or a
question answering system all contain questions and answers. However, the ways they are
combined or presented vary. The building block of a QA sequence is strictly alternating

question-answer pairs. Here I would like to deﬁne a QA discourse.

A QA discourse is a textual dialogue between a human user and a question

answering system with at least two question answer pairs.

Example (1) (repeated from Chapter 1) is such an instance of a QA discourse with three
consecutive question answer pairs.
(1)

Q1: Who was Tom Cruise married to?

(A1: Nicole Kidman)

Q2: What was her Broadway debut?

(A2: The Blue Room)

Q3: Who ﬁled for divorce?

(A3: Tom Cruise)
A QA discourse is different from other types of discourses not only in terms of its forming
structure, but also in terms of the special participants that are involved. Unlike normal
conversations, in a QA setting it is always the human users that initiate, direct and
terminate the conversation. The traditional Gricean Cooperative Principle [24] is not
necessarily observed in a QA discourse. In this sense the discourse is different from
normal human-human conversation discourse in that both parties are not responsible for
being cooperative.

With some unique characteristics, does a QA sequence resemble other types of dis-
course? Now consider the following example, a sequence whose questions are jumbled up

from an originally coherent QA discourse as shown in example (1).

(2)

20

Q1: What was her Broadway debut?

(A1: ?)

Q2: Who was Tom Cruise married to?

(A2: Nicole Kidman)

Q3: Who ﬁled for divorce?
, (A3: Tom Cruise)
This sequence is not coherent at all. Although question (2Q1) and (2Q2) are grammat-
ically perfect, they are not coherent in that it is impossible for a computer or even a
human being to interpret what her in (2Q1) refers to. On the other hand, the sequence of
(1Q1) and (1Q2) makes more sense than the reordered example (2), and thus is a piece of
coherent text sequence. In the current study, I assume that the TREC test sets and the
data collected from the user study (discussed in Chapter 4) are question series consisting

of coherent QA discourse. I also call them mini discourses.

3.4 Discourse modeling

There has been a tremendous amount of work on discourse modeling in the area of natural
language processing. The discourse research mainly addresses two important questions:
1) what information is to be captured from the discourse; and 2) how such information
can be represented for language interpretation and generation. Many theories have been
developed for both texts (e.g., Hobbs theory [25] and Rhetorical Structure Theory [26])
and dialogues (e.g., Grosz and Sidner’s conversation theory [23] and Discourse Repre-
sentation Theory Kamp1993). The NLP community has seen successful applications of
discourse modeling based on the discourse structures that these theories assume. For
example, in TREC 2005, Ahn et. a1. [10] applied the Discourse Representation Structure
(DRS) in question processing. In particular, Chai and Jin’s work [27] presents a semantic
rich discourse representation, which provides a motivation for the work reported in this
dissertation. In order to provide some background information on the work that is done

in this study, I will brieﬂy go over some of these theories in this section. First, I will

21

discuss the relationship between discourse structure and discourse coherence; then I will
look at some of the well-established discourse structure theories; Finally I will introduce

some recent computational work in identifying discourse structures.

3.4.1 Discourse structure

A common assumption made in natural language discourse modeling is that the discourses
under processing are coherent. But why is a discourse coherent? Researchers, especially
linguists and computer scientists have been looking at discourse structure seeking answers
to this question. Now that I have determined that QA series can be treated as a kind
of discourse, it is necessary to explore its structure for the purpose of building discourse
models for processing context questions.

Traditionally, with implicit structures, a discourse puts individual sentences together
to form a discourse structure. To have a better understanding of discourse structure, its
representation, and their role in context question answering, I now review some discourse

structure theories in linguistic literature.

3.4.2 Discourse structure theories

There are various discourse structure theories trying to help accomplish such tasks as
mentioned above. More importantly, based on different ways of representing discourse
structure these theories offer different account for the coherence of a discourse.

Kamp and Reyle [28]’s DRT (Discourse Representation Theory) has discourse repre-
sentation structure (DRS) as the constructing units for a discourse. Each sentence in a
discourse to be processed is dealt with using the context of a structure (i.e. DRS) re-
sulting from processing the previous sentence. Not until the DRS for the whole discourse
is built, can one interpret the discourse content represented by compositional semantics.
The interpretation of a discourse is thus a dynamic and incremental process. One of the
applications of DRS is to resolve reference for pronouns, because the DRS determines

possible antecedents to anaphora. Yet one problem with this approach of anaphora res-

22

olution is that it both overgenerates and undergenerates the possible interpretations of
pronouns. The problem occurs due to the notion of accessibility, which is so deﬁned that
basically anything would be accessible for the target anaphora.

Grosz and Sidner [23]’s Intentional Discourse Model is based on three structures:
segment structure, intentional structure and attentional structure. In this segment-
hierarchically structured model, a discourse is considered as a built-up of small information
chunks (i.e. discourse segments), which consist of a group of sentences. Each discourse
segment conveys a communicative intention or a purpose that contributes to the overall
goal of the discourse. Local coherence1 within a discourse segment is maintained through
the operations of centers 2. In the meantime, the focus, which is the attentional state of
this segment, depends on its intentional structure. This intention-based theory therefore
has semantic elements built into the model. The core notion of intentions imposes struc-
ture on discourse and thus makes discourse coherence possible. The major concern for
this theory is how to infer intentions and it requires strong cognitive evidence and for the
same reason, it is hard to compute and verify the attentional states. Centering Theory
[29] upon which the work in Chapter 5 is based, is part of the Intentional Discourse Model.
It relates the salience of entities in an utterance with the form of linguistic expressions
and the local coherence of discourse.

Another semantically driven discourse structure theory is Mann and Thompson’s RST
(Rhetorical Structure Theory). Originally it was developed for the purpose of computer-
based text generation, yet it has been established in linguistics too. Similar to Grosz and
Sidner[23], RST represents discourse as hierarchically organized text based on rhetorical
relations between the discourse parts often consisting of two spans of texts: the nucleus
(the claim span) and the satellite (the evidence span). Relations are further classiﬁed into
two types based on the meaning: the subject-matter relations expressing the content of
the subject matter of the text (such as elaboration and concession) and the presentational

relations facilitating the presentational process of two text spans (such as list and join).

 

1this notion will be discussed in Chapter 5
2this notion will be discussed in Chapter 5

23

However the inferring of relations is not easy and the relations deﬁned by RST seem ad
hoc as well. Note that this theory is descriptive rather than generative so the inferring
of relations assumes the coherence of discourse. A coherent discourse is built in such a
way that any two text spans have at most one relation between them and the recursive
application of schemes (abstract pattern for constituent text spans and a speciﬁcation of
the relation between the spans) strings nucleus and satellite together and thus gets bigger
text spans.

With different discourse structures, the discourse structure theories have their own
strength and weakness in dealing with linguistic phenomenon and tasks. Take anaphora
resolution as an example. Techniques of anaphora resolution based on DRT have been
widely spread in computational linguistics, because DRT addresses itself speciﬁcally to
the problem of anaphora resolution. The structure of RST, however does not help with
anaphora resolution, because it is descriptive and the basic constructing structure is on
the sentence level. With the three structures modeled in Grosz and Sidner [23], it is also
possible to conduct anaphora resolution because the attentional states make it possible
to separate focus (entities) from text spans out of a discourse.

Essentially different discourse structure theories explain discourse coherence using var-
ious discourse structures. Discourse representations based on the structure thus can be
classiﬁed into two categories: coherence relation based and discourse entity based. The-
ories such as RST and DRT are discourse relation based, which focus on the relations
that link discourse pieces together. Centering Theory in Intentional Discourse Model, on
the other hand, is entity-based. A discourse maintains its coherence through the entities
that appear in it. Theories from both camps have made contributions to the ﬁeld of

computational linguistics.

3.4.3 Computational efforts

To properly capture the discourse coherence, researchers in computational linguistics have
made efforts to automatically identify discourse structures.

One computational approach to discourse structure and discourse coherence was ﬁrst

24

described in Hobbs’ [30] and later in [31]. Discourse coherence relations between segments
of a discourse are used to characterize discourse coherence. Such coherence relations as
result and explanation are provided to establish coherence in a computational inference
system. For example in (3) the relation explanation holds between (381) and (382).

(3)

S1. Max fell.

S2. John pushed him.

The inference made by abduction for (3) is that the state or event asserted by (382)
causes or could cause the state or event asserted by (381). Abduction3 inferences are
made according to certain interpretation algorithm. In order to interpret a sentence, ﬁrst
of all, the logical form (LF4) of the sentence has to be proved via abduction. Proving
involves the logical form of the sentence that is derived from syntax and the constraints
that predicates impose on their arguments. According to Hobbs, the best explanation to
which the inference is made would be the one with the least cost based on the weight
assigned to the predicates. However, how to correctly get the logical form and to make
the right assumptions is not trivial. Probabilities and heuristics have been introduced
into this approach to ensure best proof.

Marcu’s [32] work is a computational effort for automatically identifying discourse struc-
ture. Theoretically this practice relies on the RST discussed above. The implementation
of a decision-tree based machine learning approach relies on three components: corpora
annotation, discourse segmenter and shift-reduce action identiﬁer. Elementary discourse
units (edus) are deﬁned as small chunks of texts. With annotation, edus and rhetorical
relations are annotated manually according to Marcu [33]’s annotation protocol. A dis-
course segmenter is a decision-based algorithm that partitions the input text into edus.
Features that models both local and global contexts (POS, boundary identiﬁcation such

as comma, discourse marker, etc.) are represented in a binary format. The shift-reduce

 

3Abduction is the process by which, from (Vx)p(x)3 q(x) and q(A), one concludes p(A). One can
think of q(A) as the observable evidence, of (Vx)p(x)D q(x) as a general principle that could explain

q(A)’s occurrence, and of p(A) as the inferred, underlying cause of q(A)(p.98). [31]
4Usually derived from surface structure, logical form is the level of representation where linguistic

expressions are assigned a representation of meaning.

25

action identiﬁer is to derive the discourse tree. The basic idea is to add an input sen-
tence under consideration to the previous discourse tree and decide what the relevant
rhetorical relation is. This is a rule-based process which also requires the knowledge of
structural (such as number of trees in the stack), lexical (cue-phrase—like words such as
although, but, etc.), syntactic(position of these phrases), operational (last parsing oper-
ations) and semantic-similarity-based (WordNet—based word similarity) features. As a
result, local rhetorical relations between text spans and then a global discourse structure
will be determined. This approach captures a lot of features therefore its performance
on segmentation and tree construction is better than on rhetorical relation identiﬁcation,
which heavily relies on manual annotation.

Stolcke et. al.’s work [34] presents another corpus-based approach for identiﬁcation of
discourse structure of a conversation, which is treated as a hidden Markov model. Similar
to Marcu’s work, lexical, collocational, prosodic cues are used as features in probabilistic
machine learning. Dialogue acts (DA, such as statement, opinion, agreement, etc.) that
represent the meaning of an utterance at the level of illocutionary force [35] are predicted.
Constraints on the likely sequence of dialogue acts are modeled via a dialogue act n-gram
5. Similar to Marcu’s work, this study is based on the manual annotation of conversation
corpus. The probabilistic modeling is based on the features that sometimes seem to be
random and without linguistic motivation.

Having reviewed some computational work based on discourse structure and discourse
structure theories, we noticed that there was no speciﬁc work addressing the context

question problem using linguistic theories of discourse structure or discourse coherence.

3.5 The hypotheses

Halliday and Hasan (Halliday, 1976) have classiﬁed ﬁve types of so-called text-forming

devices: reference, substitution, ellipsis, conjunction and lexical cohesion. I believe that

 

5n-gram usually refers to an n-gmm model, which is often used in natural language processing.
An n-gram model predicts x; based on x(,-_1), x(,-_2),..., x(,~_,,). For more information, please check
http: / / en.wikipedia.org / wiki / N -gram

26

a QA discourse will present the same characteristic as having some of the “text-forming
devices”. That is, lexical ties are presented between questions. For instance, reference he
in (4Q1b), and ellipsis in (4Q2b) are such cases.
(4)

Qla Who was Tom Cruise married to?

le When was he born?

Q2a Where is the highest point on earth?

Q2b Where is the lowest?
These lexical ties form lexical cohesion relationships. Therefore I proposed Hypothesis I:
similar to other types of discourses, a QA sequence presents the characteristics of having
lexical cohesion relationships, which contribute to the ﬁrst level of coherence hierarchy.

The question that rises is: are these devices good enough to handle context questions
in QA? Let us look at example (5)?
(5)

Q1: What is the name of the volcano that destroyed Pompeii?

Q2: How many people died?

Q3: Any pictures?
Example (5) is a coherent QA discourse, however there is no apparent lexical cohesion
relations between (5Q2) and (5Q3). In both Chapter 5 and Chapter 6, I will present
discourse models for context questions with and without lexical cohesion relations.

Cohesion relations are only concerned with the pattern of relating linguistic expressions,
not with the patterns of content. Discourse coherence has been studied from different per-
spectives, as mentioned in the previous section. Different coherent relationships have been
identiﬁed [31], [32]. It is well acknowledged that a “felicitous discourse must meet a rather
strong criterion, that of being coherent”([36], p.241). I believe that a QA discourse also
observes various coherence relationships between questions or even among questions. Ac-
cording to Kehler (2004)’s classiﬁcation these relationships are resemblance, contiguity
and cause-effect with each having different subcategories. For example the relationship

cause-effect is further broken into result, explanation, violated expectation and denial

27

of preventer. Following the classiﬁcations of Rhetorical structure theory (Mann, 2003),
(6Q1) and (6Q2) are coherent in that they form an evaluation relationship, that is (6Q2
assesses the event in (6Q1).
(6)

Q1: How often does the US. government conduct an ofﬁcial population census?

Q2: Is the census conﬁdential?
Similarly, in example (7), (7Q2) being the context question of (7Q1), there is a causal
relationship between the two questions (7Q1) and (7Q2), in addition to the explicit rep-
etition of word volcano.
(7)

Q1: What is the name of the volcano that destroyed Pompeii?

Q2: Why did the volcano erupt?
However, the problem presented by example (1) is still not solved, that is, how to deal
with context questions that do not have any reference clue or implicit rhetorical relation-
ships, such as (1Q3) Who ﬁled for the divorce? In this thesis, I will not discuss such
discourse coherence relations as the evaluation and the causal relations in example (6)
and (7). Instead, I would like to look at the discourse coherence from a different angle.
Empirical data shows that the TREC context questions are on the same target /topic.
This information may be essential for the follow-up questions that cannot be processed
otherwise. Therefore I would like to seek relevant linguistic theoretical frameworks to help
with the processing on top of the discourse cohesion relations. Centering Theory adopted
in Chapter 5 and the deﬁniteness hierarchy used in Chapter 6 are the results of such
efforts. Centering theory helps to explains the local coherence for a QA discourse, while
the deﬁniteness hierarchy also explains how a QA discourse organizes discourse entities
in a way that coherence is maintained. In using them, I tried to capture the discourse
coherence relations of the QA discourse. Therefore I proposed Hypothesis II which states
as follows: coherence relations are necessary to capture more semantic information in a

QA discourse and it is at a level higher than lexical cohesion.

28

3.6 Summary

In this section, I have ﬁrst clariﬁed the notion of QA context and introduced the no-
tion of QA discourse. In order to build discourse models for context questions, I believe
that a series of questions can be treated as a kind of discourse, which shares some com-
mon features of other types of discourses. Then, I have reviewed some of the discourse
structure theories that have been used in NLP research, hoping to provide insights and
background information for my study on QA discourse processing. In the following chap-
ters, I will mainly present different discourse models based on the theoretical frames to

process context QA discourses.

29

CHAPTER 4

Data collection and platforms

4. 1 Introduction

This chapter presents the data formats and implementation platform that are used for
the current study. Section 4.2 introduces the architecture of the automated QA discourse
processing system. Section 4.3 presents the context questions collected from my user study
and the data from TREC 2004 and TREC 2005. Section 4.4 explains how the data was
annotated for computational implementation of discourse models. Section 4.5 introduces
both the user document collection and the AQUAINT corpus, the document collection
upon which the document retrieval engine runs. Section 4.6 introduces the document

retrieval engine used for the entire implementation. Section 4.7 concludes the chapter.

4.2 System overview

The prototypical system architecture for QA systems as it was presented in Chapter 2 has
four1 basic modules. The basic idea is to run the natural language question through the
system and extract the exact answer to the question. Since my focus of the study is on
the ﬁrst component, that is, the question processing, I will only evaluate the implemented

system at document retrieval level instead of at answer extraction level or passage retrieval

 

1Some systems incorporated document retrieval and passage retrieval into one module.

30

 

.-e-e-l-e-I-0-3-0-I-I-n-O-I-O-I-O-I-I-I-a---I-I-e-l-I-O-l-o-I-

 

   

 

Reference
model

 

 

Convert to Query
XML
terms
Question format Centering
Annotation alaon'thm

 

 

 

 

 

 

 

0-0—0-0-I-I-I

  

 

 

 

 

 

 

 

Document

MRR.

 

 

Retrieval Coverage
0 result Result

 

 

 

 

 

 

 

 

 

Figure 4.1. Flow chart of the centering-based automated system architecture

level. As is shown in ﬁgure 4.1, the dotted box represents the QA discourse modeling
module2 of the system. This module takes input in the form of context questions formatted
in XML with relevant linguistic information annotated . The output of this module is the
query terms resulting from the three models. Then, they become the input to the Lemur
retrieval engine. The Lemur retrieval engine then returns a ranked list of documents.
Based on the rank, evaluation is conducted by means of MRR and coverage, two evaluation

metrics that will be discussed in Chapter 5.

4.3 Question collection

Two sets of data were used in the computational implementations in Chapter 5 and
Chapter 6: context questions collected from a user study and factoid questions from the

Text REtrieval Conference (TREC) 2004 and 2005.

4.3.1 User data

To support the investigation of different discourse models for processing context questions,

a data collection effort was initiated through user studies. It was conducted through two

 

2which will be implemented in Chapter 5 and Chapter 6

31

steps: 1) design a user study and 2) administer a data collection session. In the ﬁrst step,
the following four topics were chosen for collecting context questions from subjects: (1)
the presidential debate in 2004; (2) Hawaii; (3) the city of Pompeii; and (4) Tom Cruise.
After the topics were decided, the key terms of these topics were entered to the Coogle
search engine . A set of documents was manually selected from the search results. These
documents contain relevant facts about each of these four topics.

In total, 22 subjects participated in my study. These subjects were recruited from the
undergraduate students who took the Introduction to Artiﬁcial Intelligence class in Com-
puter Science Department at Michigan State University. The data collecting session was
conducted in a classroom environment and was ﬁnished within 20 minutes. Each subject
was given both an instruction sheet and a copy of the document collection 3 prepared in
the ﬁrst step. On the instruction sheet, they were then asked to put him/herself in a
position to acquire information about these topics. And they were asked to specify their
information needs 4 and provide a sequence of questions (no less than 6 questions) to ad-
dress that need. As a result of this effort, I collected 87 sets (i.e., sequences of questions)
with a total of 522 questions. Example (1) shows a set of context questions collected
on the topic Tom Cruise. One subject put down his information goal as to “know more
about Tom Cruise’s personal life”.

(1)

Q1: What is Tom Cruise’s full name?

Q2: How long was he married to Nicole Kidman?

Q3: Where was Nicole born?

Q4: Who is Cruise dating now?

Q5: When was Penelope Cruz born?

Q6: How old was Kidman when she moved to Australia?

Speciﬁcally, the following issues during the data collection were emphasized:

 

3The document collection was provided because the subjects were instructed to ask questions, the

answers to which were contained in the documents.
“Le. what information do the users want to ﬁnd about these topics? Please see the exact instruction

in Appendix A.

32

e The answer to each question should come from a different document to enforce the
use of the context for the subsequent questions. Users were provided with differ-
ent paragraph segments, which originated with and were extracted from different
documents on the same topic. Users were encouraged to ask questions to which
these paragraph segments provide the exact answers. This way, the control that
the answers to each question came from different document was guaranteed. This
design is believed to be closer to a natural scenario because if some information
has already been shown in the surroundings of the answer to a previous question,
users may not even need to ask questions about that information. Users tend to
ask questions about facts that he/she has not seen during the information seeking

session.

0 Each sequence of questions should be coherent in the sense that they should serve

a certain information goal.

0 Since the goal of this research is to investigate discourse processing for coherent
question answering, concise questions that depend on the discourse are of particular
interest. Therefore users were asked to provide questions that are as natural and

5

concise as possible.

This methodology of collecting context questions is motivated by TREC evaluation where
the sequences of context questions were predeﬁned by the National Institute of Standards

and Technology (NIST).

4.3.2 TREC 2004 data

In addition to the data from the user study, TREC 2004 data and TREC 2005 data was
also used for testing purposes on the discourse models developed in this research. As
mentioned in Chapter 2, 65 sets of questions were ﬁnalized by N IST staff from the TREC

assessors’ original questions. The question types include factoid, list and other. For

 

5] only collected concise context questions rather than complicated questions, because I wanted to
keep the user data consistent with the TREC data used in my research. The TREC 2004 and TREC
2005 data are mostly very concise and short. Please see the detailed data analysis in Chapter 5.

33

the purpose of evaluation, the ﬁnal test sets aimed to ensure that there would be actual
answers to the factoid and list questions and sufficient information for the other questions
in the document collections6. Because of this criterion, a lot of original questions were
eliminated. TREC 2004 question sets also avoided such questions that would speciﬁcally
mention the answers of previous questions. Therefore strictly speaking, TREC 2004
question sets are not “true samples of naturally occurring user-system dialog” [7].

The main task in TREC 2004 QA track was to answer questions that were grouped by
targets. In other words, each set of questions comes with a predeﬁned target. Example
(2) is taken from TREC 2004 where its target is Jar Jar Binks. The targets of the TREC
2004 test sets are listed in table 4.1. Of the 65 targets, 23 are about person, 25 are about
organization, and 17 are about thing. There are a total of 230 factoid questions, 56 list
questions, and 65 other questions. The number of questions per series ranges from 4 to
10.

(2)

Q1: What ﬁlm introduced Jar Jar Binks?

Q2: What actor is used as his voice?

Q3: To what alien race does he belong?

Since TREC data was also designed to test system capability of answering list and def-
inition questions, which are not the focus of this work, those questions will be omitted
in the evaluation that is covered in Chapter 5 and 6. For TREC 2004 data, I only focus
on the 230 factoid context questions in the analysis and evaluation. It should be noted

that there were still 26 factoid questions that do not have answers in the corresponding

document collection, that is, the AQUAINT corpus.

4.3.3 TREC 2005 data

Similarly, in TREC 2005’s QA main track, there were 75 sets of question series, targets of

which are listed in table 4.2 and table 4.3. In addition to the target type thing (such as

 

6However, unfortunately there were still some questions that did not have answers in the document

collections.

34

 

8—32? 8.3m mom commoamsqi Hagen. 3m

=>83 :02 new bosom was “658.: com H350 sanctum mom vooﬁoﬁem 5:952 Em

so? as sacs 8m Esau 235 3m 35 €32 assassin mam

N am esp: Sm assets? sax woos Sm sac: E35 3m

2:8 one: some saw 932 880 mam was 835 gm
seam Em Boa 8st Esau 8m Eves. 2: 8:8 3m

32 22 mam Baa. asssm new 38 season 3m
:oﬁﬁoEOO succumb Enosﬁbmaﬁ new emkemaoem 3m «53 E902 mvm
comes—05280 mmD mwm 3988 0809 39.85 gm 35:2 538:0 ovm
:86 2:. 3m scans wmm 8.33 Sm

amuse seem 8m sass ﬁes mam €5st 3m
sawsssz 8585 mmm 80;» Sm Exam 5% 3m
8%.. 2 asses 8m haw so sass. mam sea as ssaossﬁ. mam
EmEQaO Swanson Rm BASH 0mm 90388 320 82.8w mam
Etc ism asses sum 926 8200 mam «sex ass Nam
82 85 am @2880 8m 3252 am

cemuoﬁwm FEE SXOQ mam €500 EEEEO EcoSchuS Sm 3538 Sm
sass Sm 2m mac: Em 9:3 E. 5. 2m

ssm as seem am 282 use a: :m Sosa 2m
ommom 550—0 282: mm Bogus; #35 mm Snows um
2228 saga om $32 mm 58 sass em

secs 88 as: mm sag BE am $20 a

 

mamas soon came .3 sea.

35

F16, Louvre Museum, and Virginia wine), person (such as Bing Crosby) and organization
(such as United Parcel Service ( UPS )7 and American Legion), a new target type, event
was introduced into the TREC 2005 data sets. Events could be presented as a nominal
event such as Preakness 1998 or a description event such as 1998 indictment and trial
of Susan McDougal or France wins World Cup in soccer. Of the 75 sets, 18 were event,
19 were thing, 19 were oryanization and 19 were person. The numbers show that NIST
staff managed to keep the target types evenly distributed. There were total 362 factoid
questions, 93 list questions and 76 other questions. For the same reason I mentioned above,
only the 362 factoid questions will be included in the implementation and evaluation. Note
that there were 30 factoid questions that do not have answers in the TREC 2005 data.
Therefore, if the TREC 2004 and TREC 2005 data are put together, there are 140 sets
with total 592 factoid questions, 56 of which do not have actual answers. This no—answer

factor will be considered in the evaluation for the work in Chapter 6.

4.4 Data format

Next, I will introduce the data format used in my research. NIST provided a single
document for each year’s QA test sets. They can be found at NIST’s websites. Example
(2) was taken from the TREC 20059 question set documentl,0 where all the information
about the questions was encoded using XML tags.

What is XML then? XML stands for eXtensible Markup Language, which is a W3C11
recommendation . XML is non-proprietary and extensible. Unlike natural languages, it is
a computer language designed to describe data. In other words, it tells us what data it is
by using tags. The tags are not predeﬁned but deﬁned at user’s will. This language allows
user to deﬁne his own tag, anything from a self-explanatory label in a natural language

such as “this is a TREC question” to a sign such as “TREC_Q”. Note that the purpose

 

7Most of the times, both the full name and the acronym of the organization, if there is one are given.

8http: //trec.nist.gov/data/qa.html
9TREC 2005 questions can be found at http://trec.nist.gov/data/qa/2005.qadata/QA2005.testset.xml
10I’REC 2004 questions can be found at http://trec.nist.gov/data/qa/2004.qadata/QA2004.testset.xml

1World Wide Web Consortium

36

 

cosmic wagon .om 3:02 own: 35
Exxx 28m sod ooa
s63 35% 85

303m Ou5< HmSOﬁ—dﬂuoaﬂw Gdomu08< ﬁﬁoz QQQH VOHW

ma wwm neomom «Sm
doom macaw 85

2330 3503 cam nomwma 525:5. mam

£580 waﬁasoo 3m 3:80 299:0 oaowez wag com

ﬁnwmougomo $0530 3 wcovm waom m0 558 now Bwsomog .596 m0 Eta was $088ch mama 3m
sass 35285 800-55 ooom sao mom 8&3 Essa mom
ssoz mono am 2% seas» oom

zonmwom osmomq 235 mwm Ammbv 838m $0qu won—ED wwm

“Each 023m nwm ogooaz Bow wwm

3073 8:5 motto “5“mequ mwm 83032: vwm

8:832 c.2504 mwm 303m zvooQ hvkom mwm

woos assess com ooo some “€on oo :85 oom

938% 30an SEE Ex mum oBemOSM who? mum

~88th owuomO Rm mango wEm com

.00 d x832 mum 3634,83 gamma 3m

some; mum vooizom mum

So no tems seem: a sea was 8:0 sass ohm

888 E Q50 283 m5? conﬁrm mom @8332 Saﬁ< ﬁom wow
8:308 ooom 883:: $22 now mes—Em View 225885 command wow

 

3&5 ooom 053. .3 some

37

 

00mm oEm

AOHOV 8:28:00 282$ we :oﬁoﬁgwoo mam 3&3 :oED Eamon Ewan/ED Ho :58de Ecoﬁoﬁooﬁ wmﬁm
932 585M swam 325m cam
Basses. ooéooooo sum macaw sum
as: 28:5: mam : was. so: mode.
:33va waspﬁocﬁm Hmﬁm 88:23 35
OB<Z mam Ommo wmam

5885132 o: 5m :x was 80a 85
8:30 038m mam 0:20.52 309m «mam

xom 3:85 mam 296m Boa «Nam

828 some as seas see as
:OmUFoQJAo—Em 35 253+ mo BEE .m.D wSm

as: Em use: oaso ozm

mgwEO «505qu 35 3395 comma. vzm
:egoz 3mm 35 :oSSoEoO $220002 Sam
><>>SH< HHHW Ecoﬁwﬁuoﬁnm QSrU wﬂomd OHHm

5on mo 828338 85 Ammmv Eoﬁﬁdﬁoém 8:505 %:om onm
6:550 Sam motmm 28.3 ﬁﬁonmoom was: ooﬁm

 

pass sass soon came .2. some

38

of XML is to describe textual data, therefore it is desirable to define meaningful names
for the data of interest. Due to this feature of XML, the descriptive information stored
by XML can be used across any platform. Another feature is that the data format of
XML is reusable. Because XML is accepted and backed as an international standard, the
format will be available for later access and processing.

With the evident advantages of XML, one would be eager to know how XML describes
data. As a matter of fact, DTD (Document Type Definition) is used to define the building
blocks of an XML document. DTD allows users to deﬁne the document structure with a
list of legal elements. To avoid jumping too much into technical details, I now discuss the
elements that were deﬁned in the DTD for the TREC data and then I will present the

customized DTD for the purpose of building my discourse models.

4.4.1 Original TREC data format

The DTD must reﬂect what part of the data is annotated. The ﬁrst type of information
being annotated is for the convenience of programming. Housekeeping information of
questions such as question ID, question sequence boundary, answer to each question, etc.

were annotated. The original data provided by TREC is shown in example (3).

(3)
<trecqa year: “2005” task: “main” >
(target id=“66” text=“Russian submarine Kursk sinks”>
<qa> <q id: “66.1” type: “FACTOID” >When did the submarine
sink?</q></qa>
<qa> <q id=“66.2” type: “FACTOID” >Who was the on-board commander of the

submarine? < /q> < /qa>

<qa><q id=“66.3” type: “FACTOID” >The submarine was part of which Russian

ﬂeet?</q></qa>

<qa><q id=“66.4” type=“FACTOID”>How many crewmen were lost in the
disaster?</q></qa>

<qa><q id=“66.5” type=“LIST”>Which countries expressed regret about the

loss?</q></qa>
<qa><q id=“66.6” type=“FACTOID”>In what sea did the submarine

39

sink?</q></qa>

<qa><q id=“66.7” type=“LIST”>Which U.S. submarines were reportedly in the
area?</q></qa>

<qa><q id=“66.8” type: “OTHER” >Other< /q>< /qa>

< /target>

< /trecqa>

The information indicates the year of the TREC QA track (2005), the QA track task
type (i.e. main), the question target(Russz'an submarine K ursk sinks) , the target id(66),
the individual question id(e.g. 66.1, 66.2 etc.), and the question type(FA CTOID) in the
series. The only information that is not encoded in the test sets is the target type. That
information is given separately in another TREC documentl.2

Another note on the data format is the numbering of the series. Since TREC 2004 has
65 sets, the TREC 2005 data sets continued the numbering from 66 till 140. Example (2)

was actually the ﬁrst set from the TREC 2005 data.

4.4.2 Customized data format

Besides the basic information (such as the target, the type, the ID etc.) of a question set,
the second type of information is the linguistic information that needs to be encoded for
the questions. There are two reasons to keep using XML as the annotation language to
annotate such information as well. First, since the TREC 2004 and the TREC 2005 data
were provided in the language of XML, it is wise to use XML to keep the consistency in that
aspect. In doing so both datasets can be processed and evaluated using the same programs.
Second, it is because of the self-explanatory, portable characteristics of XML discussed in
the previous section. Next, I will present the customized data format for the user data and
the TREC data. The goal of this study is to build discourse processing models for context

questions based on linguistic knowledge. For the convenience of conducting evaluations on

 

l"'http://trec.nist.gov/data/qa/2005.qadata/OS.target_type.txt

40

the models, I annotated relevant linguistic information manually so as to keep the research
focused on modeling context questions rather than automatically identifying the linguistic
information used for the modeling. In the following section, I will present the annotation
schemes for the experiments discussed in Chapter 5 and Chapter 6, respectively.

In Chapter 5, the semantics of noun phrases is the focus. Special efforts were made
to annotate noun phrase types, for example, pronoun, deﬁnite description, indeﬁnite de-
scription etc. For the purpose of pronoun resolution, agreement constraints (i.e. number,
gender and animacy) are annotated. The non-personal pronoun it is annotated to indicate
whether it refers to an object, event, situation or other.

The grammatical roles of noun phrases were also annotated. The following grammatical
roles are identiﬁed: possessor, subject, predicate nominal, object, indirect object, and
adverbial prepositional phrase and other. In addition, the deﬁniteness information of noun
phrases was annotated. NP types include demonstrative, pronoun, deﬁnite description,
possessive NP, proper names, indeﬁnite NP. Finally adverbials of each question were
annotated. The following adverbials are identiﬁed: location, time, reason, modiﬁer and
adverb.

Example (4) shows how an annotation of a question looks like. The annotation tag
q.id=“q2” shows that the question why did Tom Cruise begin acting? is the second ques-
tion of a context question sequence. And the tag prev.q=“q1” shows that its preceding
question has the question ID ql. The answer tag <a>indicates that the answer to the
question q2 is in the document QA 880002-0003. For some information, such as the topic,
the annotation tells us that the topic of the original question starts from token 5 and ends
at token 6, which points to the term “acting”. The noun phrase Tom Cruise, which is
annotated as starting from token 3 and ends at 4 is further annotated using tags male,
subject and proper name. Moreover, in addition to the topic information, the question
word why is annotated as focus. It is for the convenience of future work on question

processing strategies that might use the information of question type.
(4)

< q prev-q= “ql ” q_id= “q2” >

41

<orig>Why did Tom Cruise begin acting ?</orig>
<a>QA880002-0003 </a>
<TOPIC start-tok= “5” end_tok= “6” >
<ENTITY sem-type= “male” gram.role= “subject” >
<PROPERTY prop_type=“properJiame” start-tok=“3” end_tok=“4” />
</ENTITY>
</TOPIC >
<FOCUS start-tok= “I” end_tok= “1” / >

</q>

The 522 user study questions and the 230 TREC questions were annotated this way for
the study in Chapter 5.

For the experiments conducted in Chapter 6, in addition to the semantic type, gram-
matical role and NP property information of an entity that were already annotated in the
TREC 2004 and the user data, I added more detailed syntactic information to N Ps, deﬁ-
nite descriptions in particular. I annotated an entity 13in the way that its corresponding
expression was split into premodiﬁer, head and postmodiﬁer. If any of the premodiﬁer or
postmodiﬁer corresponds to an entity itself, that entity will be treated as a separate en-
tity and thus annotated too. The discourse old/ new status of deﬁnite descriptions is also
annotated for evaluation purpose. Target was also annotated in the same fashion using

the scheme. The semantic type of the target was annotated according to the information

TREC provided. The following is an example:
(5)

<qa qaid=“01” t=“Crips” type=“ORG”>
<q.id=“0”>
<orig>Crips</orig>
<TOPIC start-tok= “1” end -tok= “1” >
<ENTITY sem_type= “ob ject-plural” gramJolez “subject” >
<PROPERT Y prop-type=“proper.name” start_tok=“1” end-tok=“1”/>
</ENTITY>
</TOPIC>

 

13Event is also decomposed into entities and actions (predicate verbs).

42

</q>
<q.id=“q1”>
<orig>When was the ﬁrst Crip gang started ? </orig>
<TOPIC start_tok= “7” end-tok= “7” >
<ACTIVITY>
< ENTITY sem_type=“person-singular” gram.mle=“subject” >

<PROPERTY prop_type=“deﬁnite-des” start-tok=“6” end_tok=“6”/>

<CON STRAIN T constraint.type= “premodiﬁcr” st art_tok= “4”
end-tok= “5” / >
<DEF.TYPE def.type= “new” / >
< /ENTITY>

<ENTITY sem_type= “object-singular” gram _role= “possessor” >
<PROPERTY prop_type=“proper_name” start-tok=“5” end-tok=“5” />
< / EN TITY>
< /ACTIVITY>
< /TOPIC>

</q>

Based on the annotation scheme, the 230 TREC 2004 factoid questions and 362 TREC

2005 factoid questions were annotated accordingly.

4.4.3 Annotation validation

After the annotations were done, two validations were conducted. Firstly, the syntax
of each annotation was checked. All the data were manually annotated using Oxygen14
XML editor 5.1. With the help of the validating tool in the software, the annotation was
checked against the DTD to ensure that the tagging and hierarchical relations between
tags were correct.

Secondly, Stanford Lexicalized Parserlilexparser) version 1.4, a probabilistic lexicalized
natural language context free grammar parser, was used. All the data was run through

lexparser on a Unix system to validate the linguistic information that was annotated (i.e.

the NP type and the grammatical role of each entity). Also some of the statistics presented

 

l4http://www.oxygenxml.com
15http: / / www-nlp.stanford.edu / software / lex-parser.shtml

43

in the data analysis section was obtained by using the lexparser (such as the number of
the deﬁnite descriptions in the data). The following is an example of the resulting parse
tree for the question why did he start acting?
(6)
Q: Why did he start acting?
(SBARQ
(WHADVP (WRB Why))
(SQ (VBD did)
(NP (PRP he))
(VP (VB start)
(S
(VP (VBG acting)))))
(- '0)

The parse tree not only shows the syntactic structure of the question, but also indicates
the NP types of noun phrases. The parsed tree tells us that he is an NP and also it is a
pronoun (i.e. (NP (PRP he))). The main verb of the question is the VP at a higher level,
that is, the action verb start. This information is important because the determination
of question topic relies on the identiﬁcation of the predicate verb of the main sentence
for each question. Also the parser labels determiners, which helps to identify deﬁnite

descriptions. This information is used in Chapter 61.6

4.5 Document collections

In this section, I will introduce the document collections that were used in the document
retrieval component in the study, ﬁrst the user study documents, and then the AQUAINT

corpus.

 

16Apparently, the parser tree outputs less rich than my annotation. In a fully automated system,
the annotated information could be obtained by using other NLP software (such as taggers, name-entity
recognizer etc.) and feature heuristics.

44

4.5.1 User study documents

As I have mentioned in section 4.3.1, the 22 subjects were provided with a document
for each topic. Each paragraph of the document was chosen from a Web search through
G'ooglel.7 The following is a part extracted from the Tom Cruise document that was
provided to the subjects.

(7) Tom Cruise

0 That million megawatt smile has helped Tom Cruise reach the pinnacle of his pro-
fession and stay there. He’s a down-to—earth movie star with huge box-ofﬁce hits

under his belt such as Top Gun and Mission: Impossible.

0 Thomas Cruise Mapother IV was born on the 3rd of July, 1962 (eerily similar to his
ﬁlm Born on the 4th of July), in Syracuse, New York. He was the only boy of four
children. Since his father was an electrical engineer, he must have inherited his love

for acting from his mother, who was a teacher.

0 His acting career really began because he injured his knee in high school and was

forced to quit the amateur wrestling team.

0 His popularity took a beating in movies like All the Right Moves in 1983, followed
by Legend in 1985. Cruise’s career began to solidify during his signature hit of the

19803, Top Gun.

For each of these paragraphs, I created a pseudo document that contains it and ﬁlled up
the document with more relevant texts on the topics using some of the Google1%earching
results. I ended up having 52 such documents: 11 pseudo documents on Hawaii, 15 on
Pompeii, 13 on Tom Cruise, and 13 for the 2004 presidential debate. All these documents

were combined into one ﬁle using the document format that will be introduced in Section

4.5.3.

17http://www.google.com
1g used the same topic as key word input to the Google search user interface (http://www.google.com).

 

45

4.5.2 AQUAINT corpus

The AQUAINTlgcorpus was prepared by the Linguistic Data Consortium hosted at the
University of Pennsylvania for the AQUAINT Project, and is used in ofﬁcial benchmark
evaluations conducted by NIST. This corpus consists of newswire text data in English,
drawn from three sources. The data is distributed on two CD3. CD1 (about 1.37GB)
contains all the text data from the New York Times News Service. CD2 (about 1.67GB)
contains all the text data from the other two sources, that is, the Xinhua News Service
(from People’s Republic of China), and the Associated Press Worldstream News Ser-
vice. Within each source, the data ﬁles are organized by year, and within each year,
there is one ﬁle per calendar day and the ﬁle name reﬂects the source and date (e.g.

19980605_NYT.html). Total amount of data is about 3.04GB with 3344 ﬁles.

4.5.3 Document format

There are two sets of document collections used in the research. The main document
database is the AQUAINT corpus of English News Text. The other set is the document
collection prepared for the user study. In order to run document retrieval engine on these
data, both datasets have to use the same data format which will be discussed in the
following paragraphs.

The format of each document ﬁle is rather simple. All data ﬁles contain a series of news
stories as a concatenation of DOC elements (i.e. blocks of text bounded by <DOC>and
</DOC>tags). Example (8) shows a segment of such a data ﬁle.

(8)

<DOC>

<DOCNO>NYT19980601.0001 </DOCNO>

<DOCTYPE>NEWS STORY </DOCTYPE>

<DATE_TIME>1998-06-01 00:02 </DATE_TIME>

<TEXT>
19http: / / www.1dc. upenn.ed u / Catalog / docs / LDC2002T31 /

 

46

<P>

NEW YORK- Kenneth Joseph Lenihan, a New York research sociologist who helped
reﬁne the scientiﬁc methods used in criminology, died May 25 at his home in Manhattan.
He was 69.

</P>

</TEXT>

</DOC>
The user documents on the four topics were ﬁrst collected without any special markup,
therefore they were converted using the format of AQUAINT data. They were then
concatenated to one ﬁle, which later was merged to the AQUAINT corpus for the Lemur

retrieval engine to form a data index.

4.6 Document retrieval engine

The reason for introducing the document retrieval engine Lemur is that Lemur’s indexing
function is able to index the AQUAINT documents and the document collection prepared
for the user study in TREC format. Lemur 2.0.3 was used as the document retrieval
engine for this research. The Lemur Project is sponsored by the Advanced Research and
Development Activity in Information Technology (ARDA) and by the National Science
Foundation. The Lemur toolkit supports indexing of large-scale text database and the
implementation of retrieval systems based on a variety of retrieval models. The Lemur
system can run both under Unix/ Linux operating systems and under Windows. In this
research, the Lemur toolkit was installed on a Linux server where all the programs for
different discourse models ran.

Although Lemur supports many useful features for researchers in language modeling
and information retrieval, indexing and retrieval are the two most useful for the current
research in Question Answering. Lemur is able to form a sequence of documents for
indexing after tokenization, stemming and removal of stop words. The retrieval engine

then takes converted query terms as input and returns ranked documents as output. The

47

query terms have to be converted to the TREC document format. Example (9) is an
input in the converted format for the retrieval engine where the query terms are the same
as the original question. In the case of modeling context questions, the query terms will
be whatever terms that an original question results in, after it runs through a discourse

processing algorithm.

(9)

<DOC>

<DOCNO>1394 </DOCNO>

<TEXT>

In what country did the game of croquet originate?
</TEXT>

</DOC>

Although Lemur supports passage retrieval, cross-language retrieval and other retrieval
technologies, the document level ad hoc retrieval function was chosen for the discourse
model evaluation and analysis in this work. Example (10) is an output for document level
retrieval. The most important information of the output is the rank. The document that
contains the answer to the question is supposed to be ranked higher than other documents
which may contain relevant information about the topic but not the answer. In the later
chapters, I will discuss the evaluation metrics used for the study, which involves the
ranking of the returned documents.

(10)

query no. document no. rank score
1394 Q0 NYT19980813.0364 1 109.842
1394 Q0 NYT19980812.0066 2 102.185
1394 Q0 NYT19980704.0174 3 87.2226
1394 Q0 NYT19980626.0194 4 70.6496
1394 Q0 NYT19980813.0462 5 64.5089
1394 Q0 NYT19980813.0457 6 64.5089

48

1394 Q0 NYT19980813.046O 7 64.5089
1394 Q0 NYT19980813.0459 8 64.5089
1394 Q0 NYT19980707.0070 9 47.3473
1394 Q0 NYT19980813.0240 10 46.6417

4.7 Summary

In this chapter, I ﬁrst presented the data that was used in the context question discourse
modeling. Then I proceeded to present the data formats and explain the question anno-
tation details. Relating to data processing, other resources such as Stanford lexparser,
AQUAINT corpus, and the Lemur retrieval engine were also introduced to enable a better

understanding of the technical implementation of the research.

49

CHAPTER 5

Centering-based discourse modeling

5. 1 Motivation

Question answering (QA) systems take users’ natural language questions and automati-
cally locate answers from large collections of documents. In the past few years, automated
QA technologies have advanced tremendously, partly motivated by a series of evaluations
conducted at the TREC [6]. To better address user information needs, recent trends in QA
research have shifted towards complex, context—based, and interactive question answering
[6], [37]and [38].

Chapter 2 recounts how National Institute of Standards and Technology (NIST)1 ini-
tiated a special task on context question answering in TREC 10 [6], which later became
a regular task in TREC 2004 [7] and 2005. The motivation is that users tend to ask a
sequence of related questions rather than isolated single questions to satisfy their infor-
mation needs. For example, suppose a user is interested in ﬁnding out information about
the ecological system in Hawaii. To satisfy this information goal, the user may need to
specify a sequence of related questions as follows:

(1)
Q1: Where is Hawaii located?

Q2: What is the state ﬁsh?

 

lhttp://www.nist.gov/

50

Q3: Is it endangered?

Q4: Any other endangered species?

This QA process is considered coherent because the questions are not isolated, but
rather evolving and related to serve a speciﬁc information goal. I treat the question

2 in which each question relates to its preceding question(s).

sequence as a mini discourse
For example, question (1Q2) relates to (1Q1) since it asks about the state ﬁsh of Hawaii.
Question (1Q3) relates to (1Q2) about the endangerment of the ﬁsh and (1Q4) relates to
the whole discourse about other endangered species in Hawaii. This example indicates
that each of these questions needs to be interpreted in a particular context as the question
answering process proceeds. From a linguistic point of view, these questions are related
to each other through different linguistic expressions and devices such as the deﬁnite
description the state ﬁsh in (1Q2), pronoun it in (1Q3), and the indeﬁnite noun phrase
any other endangered species in (1Q4). In other words, there tend to be relationships
between successive questions that users ask QA systems. However, it is still not known
how best to make use of these relationships to facilitate answer retrieval. This chapter
takes up this issue by considering how the relationships between successive questions can
improve the query expansion step in QA.

Toward an answer, two linguistic levels of representation are relevent. These two levels
would capture two kinds of relationships between questions in a QA mini-discourse.

The ﬁrst level is called “lexical cohesion”. This level would capture the lexical cohesion
relationships between questions, mainly, the repetition of same/ similar words or reference
to same entity. Halliday and Hasan [39] have classiﬁed ﬁve types of so-called text-forming
devices: reference, substitution, ellipsis, conjunction and lexical cohesion. I believe that
a QA discourse presents the same characteristic as having some of the “text-forming de
vices”. That is, lexical ties are presented between questions. For instance, in (2), pronoun
reference he, repetition of Dublin, synonyms ﬁlm and movie, and ellipsis in (2Q4b) are
used in the context questions.

(2)

2See discussions in Chapter 3

 

51

Qla: Who was Tom Cruise married to?

le: When was he born?

Q2a: What river runs through Dublin?

Q2b: What is the most famous export for Dublin?

Q3a: Where was the movie “Somewhere in Time” ﬁlmed?

Q3b: What other ﬁlms were shot there?

Q4a: Where is the highest point on earth?

Q4b: Where is the lowest?

Similar to other types of discourses, sequences of questions in QA have lexical cohesion
relationships, which contribute to the ﬁrst level of representation. This level is motivated
by a psychological notion called lexical entrainment. Lexical entrainment refers to linguis-
tic adaptation presented in natural speech where people repeat the same kinds of words,
given a choice.

Lexical entrainment has been observed in natural conversations and psycholinguists
have conducted various experiments trying to explain lexical entrainment. Pickering and
Garrod’s [40] experiment results show that in conversations, speakers intend to use lexical
entrainment for communication ease and efﬁciency. Historically, passing references during
conversation is regarded as one of the explanations of lexical entrainment because referring
is a collaborative process [41]. Brennan ([42], p.41) claims that in conversation, “people
achieve conceptual pacts, or shared conceptualizations, which they mark by using the same
terms”. Lewis [43] points out in a dialogue that speakers play a game of cooperation,
for instance, they use similar words referring to the same entities. Empirically, lexical
entrainment results in the lexical ties that are exempliﬁed in (2) (i.e. repetition for the
“same word” use, synonyms for the “similar word” use, and reference for the “same
entity” use). Moreover, Pickering and Garrod [40] claim that once dialogue is studied, it
displays properties of entrainment at all levels (semantic, syntactic, phonetic, and others).
Syntactic coordination/ alignment examined by Branigan et. a1. [44] is an example at the
syntactic level. However, my study this part of my study only concerns with the lexical

level. The discourse models that are proposed in this chapter should at least be able to

52

capture the information reflected by the lexical cohesion relationships.

The second level of linguistic representation “discourse coherence” is proposed as the
other level of representation to capture the topic relationships between questions. I pos-
tulate that context questions are be easier to process if a system can identify the key
information that a discourse carries. From empirical experience, I notice that topic of a
QA discourse may stay the same or shift between questions. The topic information may
be essential for the follow-up questions that cannot be processed otherwise. For example
in (3)

(3)

Q1: Who was Tom Cruise married to?

(A1: Nicole Kidman)

Q2: What was her Broadway debut?

(A2: The Blue Room)

Q3: Who ﬁled for divorce?

(A3: Tom Cruise)

Tom Cruise and Nicole Kidman are two entities involved in the ﬁrst two quastions, and
intuitively the information on them are the common ground knowledge, or potential top-
ics for the third question, which can used to reform (3Q3) to a more explicit question
“Who ﬁled for divorce, Tom Cruise or Nicole Kidman?”, which will greatly increase the
chance of hitting the right answer for an IR system. Therefore the “discourse coherence”
level attempts to identify the t0picality of a discourse, which I believe is central to the
deep understanding of a coherent QA discourse.

Given the two levels of representation, how do they help with query expansion part of
IR systems? In answering this question, I ﬁnd that Centering Theory is a very appro—
priate framework for solving the context QA problem. Firstly, based on lexical cohesion
relationships, I hope to improve IR by adding resolved anaphors or terms used in earlier
questions. Among the ﬁve types of lexical ties, pronominal reference will be my focus3.

Centering Theory is, in part, a theory of prononimal reference uses, therefore it can pro-

 

3Other types of lexical relationships such as ellipsis will not be discussed in this study.

53

vide pronoun resolutions for context questions. Secondly, based on topic relationships,
I hope to improve IR performance by selectively adding terms for query expansion by
detecting topic change in the context QA discourse. Centering Theory, again, can fulﬁll
the task because it also formalizes whether the topic stays the same or not.

In this chapter, I present a linguistic knowledge driven approach that aims to tie the
lexical cohesion level with the discourse coherence level together based on Centering The-
ory [29]. This part of the study examines how Centering Theory can be used to process
discourse and link key pieces of information together from a sequence of context questions.
In particular, three models based on Centering Theory have been implemented to model
the question discourse and guide query expansion. The models are: (1) an anaphora
model that resolves pronoun references for each question, (2) a forward model that adds
query terms from the previous question based on its forward looking centers, and (3) a
transition model that selectively adds query terms according to the transitions identiﬁed
between adjacent questions.

In the current investigation, rather than a complete end-to-end study, I focus on dis-
course processing of questions for query expansion. Given a context question, the approach
examines the discourse of questions that lead up to the current question and identiﬁes
key entities in the discourse to help form query terms for the current question. A good
retrieval component based on the expanded queries can be integrated with other sophisti-
cated answer extraction techniques to improve the end-to-end performance. In particular,
I evaluated the three models concerning their performance in document retrieval on two
data sets: the data collected in the user studies and the data provided by the 2004’s
TREC. The empirical results indicate that Centering Theory based approaches provide
better performance for entity related context questions (e.g., about Hawaii) as opposed
to event-based context questions (e.g., about the presidential election in 2004). The
transition model and the forward model consistently outperform the reference resolution
model.

This chapter focuses on modeling discourse processing for context questions and it is

organized as follows. Section 5.2 presents a review on Centering Theory. Section 5.3

54

describes the three models for discourse processing based on Centering Theory. Finally
section 5.5 presents the empirical evaluation and discusses the potentials and limitations

of these models. Section 5.6 summarizes the chapter.

5.2 Centering Theory

In this section, I will spend some time reviewing CT basics. It is necessary because any
possible misunderstandings of the theory would cause mistakes in the empirical imple-
mentations.

Centering Theory is an intellectual work that has been evolved and developed since the
19703. It is well-established and continues to ignite various research interests. Centering
Theory and the centering framework discussed in Grosz, Joshi and Weinstein 1995 [29]

are developed from three sources:

0 the early centering theory, which includes: the unpublished manuscript from Grosz,
Joshi and Weinstein [45]), computational application in dialogue systems ([46], [47]
[48], [49], and [50]);

o the computational theory of deﬁnite anaphora understanding and interpretation,

and focusing algorithm in capturing local discourse coherence ([51], [52], and [53]);

o the relationship between the computational inference load and change of focusing

state ([54],[55], [56], [45]).

Grosz, Joshi and Weinstein’s work [29] integrates the three strands of work and presented
the framework of centering. As mentioned in Chapter 3, Grosz and Sidner’s work [23]
provides a theory of discourse structure in which centering is a part. As a computational
model for discourse interpretation, Centering Theory aims to identify the mechanism of
how a discourse maintained its local coherence (within a discourse segment) using various
referring expressions.

Centering Theory has been established as a linguistic theory and computational model

that relates the local focusing of attention, inferencing complexity and the linguistic re-

55

ferring expressions in a discourse segment. It helps explain some linguistic phenomena
that other semantic and inferential theories would not be able to explain. For instance,
in example (4) (from [12], p.692)),
(4)

81: John saw a beautiful Acura Integra at the dealership.

S2: He showed it to Bob.

83: He bought it.
Centering Theory predicts that he in (483) prefers the interpretation of being John instead
of Bob, because this would result in less inferential load and make this mini discourse more
coherent. According to CT, which will be discussed more in a moment, John in (481)
and the discourse entity co-speciﬁed by he in (4S2) are the centers of utterance (481)
and (482), respectively. If the pronoun he in utterance (483) is interpreted as entity Bob
instead of John, the center of (483) would shift to be Bob. This would result in a less
coherent discourse. And according to the CT claims (to be discussed later) it will also end
up with more inferencing and processing efforts for readers to understand the utterances.
Intuitively the discourse would be more coherent if he in (483) is interpreted as John than
if it is Bob, therefore CT explains why a different interpretation of the referring expression

would affect the local coherence of discourse.

5.2.1 Grosz, Joshi and Weinstein (1995)

After the unpublished draft of Grosz, Joshi and Weinstein ([45],henceforth GJW86),
Grosz, Joshi and Weinstein provided a series of work on centering in 1995 (henceforth
GJW95). It summarizes some early work, clariﬁed some of the early claims, and unfolds
a whole framework of centering. Also it opens up various research directions for further
exploration.

In GJW95, Centering Theory is formulated as a theory regarding focus of attention,
referring expressions, and local coherence within a discourse([29]). The major claim that
CT has is that discourse segments that keep mentioning the same discourse entities are

more coherent than those in which different entities are mentioned. GJW95 formulates

56

this claim as follows: in a discourse segment, each utterance has a unique entity4which
functions as a linking device to the previous utterance. The mechanism built upon this

core entity explains the local coherence through a set of hypotheses, rulas and constraints.

5.2.2 Terminologies and deﬁnitions

In order for the readers to fully understand CT, it is necessary to clarify some of the
terminologies and their deﬁnitions. These terms are local coherence, focusing/attentional
state, center and realization.

The term discourse coherence in centering framework actually refers to local coherence
as opposed to global coherence, therefore 2 levels in a 3-level discourse structure (Grosz,
Joshi and Weinstein [56], henceforth GJW83). In addition to an attentional structure
and an intentional structure, GJW83 speciﬁcally describes a linguistic structure which
mainly concerns the actual organization of sentences. Within this linguistic structure, a
discourse consists of smaller chunks of discourse, named discourse segments. The global
coherence and local coherence are thus described in terms of discourse segments. Global
coherence is the “coherence with other segments in the discourse”. Local coherence is the

“coherence among the utterances in that segment” ([29], p.204).

“Local coherence refers to the ways in which individual sentences bind
together to form larger discourse segments. It depends on such things as the
syntactic structure of an utterance, ellipsis, and the use of pronominal referring

expressions [8idner, 1981]” ([56], p.44)

Corresponding to these two levels of coherence, there are two levels of focusing: global
focusing and local focusing. It is local focusing that becomes centering. The notion of
focusing has been associated with different terminologies. Focusing is proposed into CT
by Sidner [51, 52, 53], later replaced by the term attentional state in Grosz and Sidner
([23]). GJW95 describes it as “an abstraction of the focus of attention of the discourse

participants”, participants being either speakers / writers or hearers / readers for a discourse

 

4It is named as a backward looking center.

57

of either a dialogue or a written text. In other words, focusing tracked speakers/ hearers’
attention on discourse entities. Grosz and Sidner ([23],p.175) describe focusing as “inher—
ently dynamic, recording the objects, properties, and relations that are salient at each
point in the discourse”. GJW95’s description emphasizes more on the cognitive status of
the discourse participants instead of on discourse objects focused on by the participants.
I would like to clarify the notion of local coherence because in my work, a context ques-
tion sequence is treated as a mini discourse and the discourse coherence hypothesized in
Chapter 3 between/ among context questions is local coherence.

Centering Theory claims that local coherence is captured by the operations on centers.
A center is “regarded as an ascription of a property to a single individual” ([54], p.435).
This description implies that a center is not a linguistic expression but an entity that the
expression refers to. GJW95 describes centers as those entities serving to link utterances
together coherently. In order to distinguish the linguistic expressions and the centers
they refer to, I will italicize the linguistic expressions while discussing the examples in
this chapter.

An essential term for understanding centers is directly realizes. It is closely related to

the semantics of centers. In GJW95, the notion of directly realizes is stated as follows.

“U5 directly realizes c if U is an utterance of some phrase for which 0 is the
semantic interpretation. Realizes is a generalization of directly realizes”([29],

p.209).

It is through directly realizes that centers relate to linguistic expressions. The realization
relation makes it possible to have computational representations for centers in that the
expressions are associated with the corresponding discourse entities for valid semantic
interpretation. Simply, it allows computational processing on discourse entities. For us,
it provides a way to represent entities from QA discourse.

Having discussed some of the important terminologies used in GJW95, I now switch to

investigate the constraints and rules that GJW95 presents for the CT framework.

 

5U stands for an utterance, and c stands for a center.

58

5.2.3 Constraints and rules

In this section, the constraints and rules described in GJW95 will be discussed in detail.
They are fundamental to centering algorithms and centering-based applications.

GJW95 differentiates backward looking center (represented as Cb) from forward looking
center (represented as C f) according to its discourse property. The backward looking
center is the linking device between an utterance and its preceding utterance. Therefore,
it is considered crucial in keeping local coherence. From a cognitive aspect, the repeated
mention of this entity makes it easier for the discourse participants to access in memory,
therefore more “retrievable into consciousness” in Chafe [57]’s term. In addition to the
backward looking center, Centering Theory also provides the insight on the preferences
for interpreting subsequent discourse through the speculation of forward looking center
and preferred center.

GJ W95 deﬁnes the forward looking centers6 as a set of entities mentioned in an utterance
(represented as Un). They are what the succeeding utterance Un+1 may be linked to.
The term preferred center7 (represented as Cp) was introduced by Brennan, Friedman,
and Pollard [1] to represent the highest-ranked member in this set. It is only when the
preferred center of Un is realized in Un+1, is it deﬁned as the backward looking center of
Un+1. Note that the preferred center only predicts the backward looking center in Un+1
in the sense that, it does not have to be realized in Un+1. CT claims that local coherence
is maintained if the preferred center of Un is realized in Un+1 and thus becomes the
backward looking center Cb of Un+1.

To sum up, CT assigns different discourse status to all the entities among which the
repeatedly mentioned one, i.e. the backward looking center is participants’ focus and most
accessible in memory, therefore gets the most discourse prominence; all the candidate
entities in Un for the backward looking center are the forward looking centers; and the

highest ranked forward looking center is preferred to be the backward looking center of

 

6Forward looking centers are initially deﬁned as those entities that are arguments of the main predicate

in JK79 and JW81.
7It is roughly corresponds to Sidner (1983)’s expected focus.

59

Un+1, therefore is named the preferred center.

Walker, Joshi and Prince ([58],p.3) state that the “distinction between looking back
to the previous discourse with the Cb and projecting preferences for interpretation in
subsequent discourse with the Cp is a key aspect of centering theory”. After different
centers are discussed, I turn to examine the speciﬁc constraints and rules that CT forces
upon the operations of the centers.

The ﬁrst constraint states that there is only one backward looking center in an utterance.
Following JW81, GJW95 assumes that backward looking center of an utterance Cb(Un)
is a singleton. Although there is no argument for this statement offered in either JW81
or GJW95, psycholinguistic evidence is shown that there is not more than one Cb ([59];
[60])-

Another constraint states that within the set of forward looking centers, every entity
must be realized. This brings back the important notion of realizes. GJW95 mentions
that the precise deﬁnition of U (utterance) realizes c (center) depends on speciﬁc semantic
theory that one adopts. Relying on situational semantics [61] [62], Grosz, Joshi and

Weinstein (1986)’s work [45] deﬁnes the realize relation as follows:

An utterance U realizes a center 0 if c is an element of the situation de-

scribed by U, or c is the semantic interpretation of some subpart of U.

By this deﬁnition, discourse entities could be realized as pronouns, zero pronouns,
explicitly realized discourse entities (such as those directly realized) or implicitly via
inferrables ([63]; [64]). It should be noted that in addition to discourse entities, discourse
relations can also be realized ([56]). The relation direct realizes is a specialization of
the relation realizes. I should point out that in this study, I will only focus on entities
explicitly realized as pronouns, excluding those realized as zero pronouns or implicitly
realized. Within the direct realizes relation, I will only emphasize on semantic reference,
excluding pragmatic reference.

The last constraint states that forward looking centers could be ranked. The idea is that

entities in Un have different likelihood to be the backward looking center of Un+1. This

60

constraint is critical to centering. Since the preferred center is the highest ranked entity
among all the forward looking centers, the identiﬁcation of the preferred center therefore
depends on what ranking scheme a CT implementation employs. Treated as a parameter
in CT framework ([65],[66]), the ranking of forward looking centers has received lots of
attention when researchers apply CT cross-linguistically. The original CT proposals that
lead to GJW95 are based on the observation of English and used grammatical role and
pronominalization to rank the forward looking centers. In fact, various factors that may
inﬂuence the ranking have been discussed from language to language, for instance, topic
markers on NPs in Japanese ([67],[68]), features on the verbs in Italian ([69],[70]), word
order in subordinate clauses in German ([71]), word order in modiﬁer clauses in English
([60]),thematic relations in Turkish ( [72], [73]), etc.

In addition to the three constraints, GJ W95 also formulates two rules. Rule 1 is actually

a pronominalization rule, which states as follows:

Rule 1: If some element of C f (Un) is realized by a pronoun in Un+1 then

the Cb(Un + 1) must be realized by a pronoun also.
This rule implies that:
0 this rule does not apply to entities that are realized in Un but not in Un+1;

0 this rule does not apply to utterance Un+1 that does not have any pronouns (this

is possible because Cb could be realized by a proper name or a deﬁnite description);

0 if in utterance Un+1, there are multiple pronouns realizing entities from its preceding

utterance Un, one of them must realize the backward looking center Cb;
e if there is only one pronoun, then that pronoun must realize the Cb.

The last two implications indicate that, as long as the Cb of utterance Un is realized as
a pronoun, the utterance Un+1 does not preclude using pronouns for other entities. In
general, this rule explains how the use of pronouns could capture local coherence and at

the same time, indicates that the Cb is often pronominalized.

61

Now, let us look at the second rule, which can be regarded as a transition rule. Origi-
nally, three types of transition relations are deﬁned across two adjacent utterances: con-
tinuation, retaining and shifting. Brennan et al. [1] and later work extend the transitional
state shifting to smooth shifting and rough shifting.

The deﬁnition of transition states mentioned in the rule is summarized in Table 5.1.
Two factors are important in deﬁning these states: (1) whether Cb (Un+1) is the same
as Cb (Un); (2) whether Cb (Un+1) is the most highly ranked entity of C f (Un+1), that
is, the Cp(Un+1).

Table 5.1. Extended transition states (Adapted from Brennan et al.[1])
Cb(Un+1) = Cb (Un) Cb (Un+1) 75 Cb (Un)
Cb (Un+1) = Cp (Un+1) Continue Smooth shift
Cb (Un+1) aé Cp (Un+1) Retain Rough shift

 

 

 

If both (1) and (2) hold, then the two utterances are related by a continue transition,
which indicates that the speaker has focused on an entity and intends to continue talking
about it. If (1) holds, but (2) does not, then the two utterances are related by a retain
transition. In this case, the speaker intends to shift his focus onto a new entity and this
entity is realized as a center in a lower-ranked position other than the highest position.
If (1) does not hold, then the two utterances are related by smooth shift or rough shift,
depending on whether (2) holds or not. Both shifts occur when the speaker has shifted
his focus from one entity to another entity. Rough shift is a rather extreme case where
the new entity is not realized as the preferred center.

According to the summarization in Table 5.1, two utterances are most coherent if they
share the same Cb and the Cb is the same as Cp, least coherent if neither they share the
same Cb nor the Cb coincides with the Cp (rough shift). Meanwhile the last constraint
(i.e. the forward looking centers are ranked.) in centering is critical, because the ranking
of forward looking centers will determine the Cb (Un+1) through identifying Cp(Un), thus

inﬂuencing the transition relation that holds between two utterances.

Rule 2: Sequences of continuation are preferred over sequences of retaining;

62

and sequences of retaining are preferred over sequences of shifting.

Rule 2 is a preference constraint on the ordering of transition relations. This rule provides
a coherence measurement through which one could explain why some discourse segments
are more coherent than others. This preference rule has been used in measuring coherence
degree in some applications ([74]). Unlike grammatical rules, constraint rules such as rule
2 can sometimes be violated resulting in an acceptable utterance sequence.

Based on the rules and constraints, various Centering Theory based algorithms have
been developed over the years for different computational purposes. I will ﬁrst brieﬂy go

over some of them.

5.2.4 Centering algorithms

There have been various algorithms based on centering framework aiming to fulﬁll dis-
course processing tasks, such as Kameyama [68], Brennan, et a1. [1], Walker [7 5], Baldwin
[76], Kehler [77], Walker et al. [78], Strube and Hahn [79] etc.

Kameyama [68] adds another structural parallelism constraint to GJ W86. Brieﬂy speak-
ing, subject pronouns prefer subject antecedents and non-subject pronouns prefer non-
subject antecedents. Brennan et al.[1] argue that this constraint is the consequence of
the ranking scheme deﬁned by grammatical functions and the preference for continuation
over retaining. Walker [75] evaluates Brennan et al. [1]’s algorithm to ﬁnd the referents
for pronouns in naturally occurring texts, and also proposes rules to segment discourse
since centering is intended to operate within a discourse segment. Kehler [77] suggests
an anaphoric processing model that helps process ellipsis and anaphora resolution. In
particular, this model provides explanation and evidence for ranking the forward look-
ing centers according to their grammatical roles in the centering framework. Walker et
a1. [78] intend to investigate the role of centering in interpreting anaphoric expressions
in Japanese. Agreeing with Walker et a1. [78] that the ordering of the list of forward
looking centers is the most important single construct of the centering model, Strube and

Hahn [79] propose another ordering scheme that depends on the functional information

63

structure of utterances instead of the ranking principle depending on grammatical roles.

Next, I will summarize Brennan, et al. [1]’s work in which a centering algorithm to
resolve third-person pronouns was proposed. This algorithm will be utilized for model-
ing the QA discourse. Brennan et al.’s work follows the two rules and three constraints
mentioned above. Given the entity-based assumption in Centering Theory, the problem
of how to rank the entities has been discussed extensively in the literature. The ranking
scheme relying on grammatical relations is most widely implemented with different vari-
ations. For example, one ranking scheme indicates that an entity in a subject position is
ranked higher than an entity in an object position, which is ranked higher than entities
in other positions (i.e., subject>object(s)>other) [29].

Brennan et al.’s algorithm can be described as having four steps. They are listed as

follows:

0 1. For each utterance, generate all the possible combinations of C f and Cb in terms

of reference assignment;

0 2. For each possible combination from step 1, apply various constraints to get the

compatible combinations;
0 3. Rank the candidates by transitional orderings;
e 4. Assign the reference to the top candidate.

Generally, possible candidates for backward looking center will ﬁrst be constructed, then
be ﬁltered and ﬁnally be classiﬁed and ranked. At the construction stage, referring ex-
pressions are identiﬁed and ordered by grammatical relations. Note that the ordering
implies that the salience of discourse entities can be ranked with the grammatical func-
tion through which they are realized. After ordering the referring expressions, possible
forward looking centers and backward looking centers are created. Finally the proposed
candidates are created by the cross-product of the previous step. At the ﬁltering stage,
these candidates are ﬁltered by a series of constraints such as syntactic coreference con-

straints, selectional restrictions and Rule 1 etc. At the ﬁnal stage, the ﬁltered candidates

64

are ranked by transitional orderings given in Rule 2. Examples and more implementation
details of this algorithm will be discussed later.

In the following section I will present three interpretive models for processing context
questions. Given a question in a discourse, the ﬁrst model forms query terms by resolving
the pronouns (I name it the anaphora model) in the question. The second model incorpo—
rates the forward looking centers from the adjacent preceding question with terms from
the current question for query expansion (i.e., the forward model). The third model ap-
plies discourse transitions to selectively incorporate entities from the discourse for query

expansion (i.e., the transition model).

5.3 Three discourse models for query expansion

5.3.1 Anaphora model

In the anaphora model, I use the centering algorithm based on Brennan et al. [1] to
resolve pronoun references. There are a few implementation details and modiﬁcations
worth mentioning here:(1) Instead of only dealing with the adjacent utterance (the strict
local coherence in [29]), my approach keeps looking back at all the previous questions
till an antecedent is found; (2) The linguistic features used include gender, number and
animacy; (3) The ranking scheme is based on the same grammatical role hierarchy of
the discourse entities as proposed in Brennan et al. [1] (mentioned above). At a higher
level, this algorithm only assigns those highly ranked antecedents from the discourse
to references that can form a more coherent discourse (as indicated by the transitions
in Table 5.1. The details of the algorithm are reviewed in Jurafsky 2000 [12]. Once a
pronoun is resolved, its antecedent is used in the query formation for the current question.

In this study, linguistic expressions will be used for query expansion based on discourse
modeling. In example (5), the use of the expression such as it in (5Q3) to denote the
entity the state ﬁsh is called reference. The linguistic expression used to perform reference

is called referring expression, and the entity the state ﬁsh is called the referent. Both

65

expressions it and the state ﬁsh corefer to the same entity, one kind of ﬁsh that is named
as the state ﬁsh of Hawaii. Referring expressions are very common in all kinds of discourse,
including the mini discourse of context questions.

(5)

Q1: Where is Hawaii located?

Q2: What is the state ﬁsh?

Q3: Is it endangered?

For now, I will restrict the discussion on reference to entities as Centering Theory is
an entity-based theory. Such linguistic expressions as indeﬁnite noun phrases, deﬁnite
descriptions, pronouns and demonstratives are often used in contextual questions to refer.
However, in this chapter, I only focus on pronouns and investigate how they affect the
processing of context questions within the framework of CT. As for reference resolution,
there are some strict semantic (such as number, person, case, and gender agreement) and
syntactic constraints that any successful resolution algorithm has to take into considera—
tion. These hard rules help to reduce the referent candidates before further processing.
Pronominalization is a form of deﬁnite reference as illustrated in example (5). Research
on pronoun resolution has received tremendous attention from both linguistic and NLP
communities. Various algorithms have been developed to address this problem. In the
following sections, I will ﬁrst discuss the pronoun resolution preference, then the pronoun
resolution algorithms, ﬁnally the implementation details of the centering algorithm used

in the anaphora model.

Pronoun resolution preferences

Different pronoun interpretation preferences have been applied in pronoun resolution algo-
rithms to determine the potential referents. J urafsky and Martin ([12], p.682) summarize
these preferences as recency, grammatical role, repeated mention, parallelism and verb
semantics. Brieﬂy, the preference recency says that entities appearing in recent discourse
are more salient than those from discourse further back. The preference grammatical role

states that the grammatical role of the discourse entities determine the salience degree. In

66

other words, entities whose corresponding expressions are in subject position for example,
are more likely to be the referents of pronouns than those appear in the object position,
if the grammatical role preference decides that the salience degree of a subject is higher
than that of an object. The preference repeated mention supports the idea that entities
that have been focused on are more likely to keep their status, continuing to be focused
on. This preference is another version of the transition preference constraint in CT, which
favors the transition type continue instead of retain and shift. The preference parallelism
is a syntactic constraint on potential referent candidates. Those entities whose realized
expressions are structurally parallel with the pronouns are preferred to be the referents.
The preference verb semantics is nonetheless a semantic-based constraint. The semantics
of certain predicate verbs determines the pronoun interpretation and therefore determines
the referents 8.

Unlike the agreement constraints, these preferences are more likely to be violated and
one preference does not have to be ranked more salient than other preferences. For the
same reason, a good pronoun resolution algorithm will have to decide which preference

to implement and if more preferences are implemented, what is the difference computa-

tionally? Is the difference quantitative or qualitative?

Pronoun resolution algorithms

In this part, I will summarize some pronoun resolution algorithms implemented in the NLP
literature, hoping to show how centering algorithm is different in terms of its mechanism.

Lappin and Leass ([80])’s algorithm singles out several inﬂuential factors each of which
is assigned a numeric value or a weight which, as a matter of fact, takes a lot of empirical
training to adjust. The sum of the weights then will be calculated to determine the
referent of the pronoun in question. Besides the constraints required by agreement, this

algorithm also considers other preferences. It ranks recency higher than grammatical role,

 

8The example ([12], p.683) shown below suggests that the semantic processing in pronoun resolution

should be further investigated.
(la) John telephoned Bill. (lb) He lost the pamphlet on Acuras.
(2a) John criticized Bill. (2b) He lost the pamphlet on Acuras.

67

which in turn is ranked higher than syntactic preference (e.g. head noun emphasis, which
adds more weight to a referent if its corresponding expression is a head noun). In this
way, pronoun resolution preferences are quantitativized and therefore calculable. In the
meanwhile, the disadvantage of this approach is shown. It is experimental in terms of
assigning the weights to each preference factor and this method does not guarantee optimal
accuracy. In addition, the accuracy of the pronoun resolution will largely depend on the
genre of the training corpus which means the weight assignment does not necessarily work
best across all genres.

Hobbs’([81]) tree search algorithm basically is a syntactically-based algorithm. It fully
relies on the correct and complete syntactic structure of the current sentence and its
preceding sentences in the discourse since the search of the referents is performed entirely
on the parsed trees of these sentences. In other words, the more accurate a parser is, the
higher the accuracy this method may obtain. However, how a grammar is deﬁned for a
parse tree will affect the search results even if the same algorithm is adopted. Note that no
explicit preferences are speciﬁed in the algorithm. Priorities in the search order implicitly
reﬂect certain preferences. For example, recency is implicitly implemented because the
search starts from the current sentence rather than its preceding sentence.

Note that the two algorithms just reviewed are different in several ways: 1) The ﬁrst
algorithm generates and processes a set of referent candidates while the second algorithm
only proposes one. 2) The ﬁrst algorithm explicitly speciﬁes a set of preferences. More
preferences could be added to this open set with ease in the sense that new operations are
simply incorporated with existing components. The preferences in the second algorithm
are implicitly implemented and any operation on the preferences, such as replacing or
adjusting a preference, would increase the computation complexity tremendously in that
the whole algorithm would be changed structurally.

Of course, none of the algorithms implement all of the preferences mentioned earlier. As
for the performance of both algorithms it is hard to tell which is better than the other. The
accuracy of both methods is below 90% to date. One has to admit the fact that automated

pronoun resolution algorithms cannot resolve all the pronoun occurrences since sometimes

68

preferences contradict between themselves. And, it is obvious that as more preferences
are introduced into an algorithm, the more complications it will encounter.

The third pronoun resolution algorithm I will describe is the centering-based algorithm
that Brennan et al. ([1]) present. In section 5.2.3 I have described the two rules, three
constraints and four intersentential transitions speciﬁed in this algorithm. The constraints
include the coreference agreement constraints and the preference constraints discussed
earlier.

Walker [75] reports an accuracy of 77.6% for Centering algorithm and 81.8% for Hobbs
[81]; therefore the algorithm based on centering is a relatively efﬁcient and competitive
algorithm. The major reason that I did not use Hobbs’s algorithm is that Hobbs’ algorithm
is purely syntactically-driven and in that sense it does not provide a mechanism to track
how discourse entities are related semantically and how local coherence is maintained

through the operations of discourse entities.

Implementation details

The algorithm implemented in the anaphora model is based on BFP87[1]. There are a
few implementation details and modiﬁcations worth mentioning here.

First, I will consider the basic utterance unit to be a naturally occurring question in
the context of QA. It has been an open issue as how to identify an utterance in centering.
Does it have to be a sentence? Could it be a unit within a complex sentence? Multiclausal
sentences in particular, introduce complication into the discussion. Poesio and Stevenson
(2004) therefore treat utterance as a parameter in CT based on different arguments (see
also [82], [83], and [84]) for discussions on identiﬁcation of utterance with tensed clauses).
GJ W95 implicitly identiﬁes utterances with sentences. When looking at the collected data
for this study, I notice that the context questions have an average length of 7.28 words.
Besides, in the data pool, there are no instances of questions with multiclausal sentences.
Intuitively, it makes sense to identify an utterance with each individual question.

Secondly, instead of only dealing with the adjacent utterance (the notion of local coher-

ence of Centering theory), this approach keeps looking back to all the previous questions

69

until an antecedent is found; Recall that rule 2 in GJW95 is written in terms of utterance
sequences. As Grosz and Sidner [85] point out, starting from BF P87, “all the uses of this
rule in language processing systems have adapted the rule by restricting it to pairs of ut—
terances.” This results in missing “the essential intuition that what matters to coherence
are centering transitions throughout a segment, not only between pairs of utterances”
([85],p.48). Taking the critique into consideration, I modiﬁed BFP87 to include more ut-
terances / questions when searching for proper antecedents. Thirdly, the linguistic features
used include gender, number and animacy. As mentioned in Chapter 3, these features are
manually annotated according to my linguistic judgment.

Also, the ranking scheme is based on the same grammatical role hierarchy of the dis-
course entities as proposed in BFP87. It has been shown that the grammatical role
functions as the primary determinant of discourse salience [86]. Miltsakaki [87] presents
some evidence from corpus studies that entities in adjunct subordinate clauses have lower
discourse salience than those in main clauses. In the implementation, I adopt Brennan et

a1 [1]’s centering algorithm and a more detailed ranking hierarchy proposed as follows:

subject > existential predicate nominal 9 > object > indirect object >

demarcated adverbial PP 10

Note that entities in adjunct subordinate clauses are ranked lower than the entities in the
main clause, but with the same grammatical role ranking hierarchyl.1

In the anaphora model, once a pronoun is resolved, its antecedent is used in the query
formation for the current question. Let us ﬁrst step through the algorithm for example
(5), repeated below as (6).
(6)

Q1: Where is Hawaii located?

Q2: What is the state ﬁsh?

Q3: Is it endangered?

 

9A noun phrase that is used as a predicate in an existential sentence (e.g. There is a cat).
1"A noun phrase that is used in an adverbial prepositional phrase separated from the main clause (e.g.

In the parking lot, there is an Acura).
11The details of the algorithm are reviewed in Jurafsky 2000([12], p.691-693).

70

Since (6Q1) is the ﬁrst question, or what is usually called a feeding question, it does not
have a backward looking center according to CT. Since Hawaii is the only entity in (6Q1),
it is the forward looking center as well as the preferred center which is deﬁned as the
highest ranked entity in C f.

Cf(Q1): {Hawaii}

Cp(Q1): {Hawaii}

Cb(Q1): {undeﬁned}
For (6Q2) there are no pronouns that need to be resolved, it is then treated as if it were
a feeding question.

Cf(Q2): {the state ﬁsh}

Cp(Q2): {the state ﬁsh}

Cb(Q2): {undeﬁned}
(6Q3) has one pronoun it that needs to be resolved using BFP87. According to Rule 1
and the coreference constraints (gender, number and animacy), Cp of (6Q2), the state
ﬁsh, which is compatible with it is assigned as the referent. The linguistic expression the
state ﬁsh corresponding to the Cp will then be added to (6Q3) to form query termslzfor
the document retrieval engine. To be more speciﬁc, the ﬁnal query terms for (6Q3) would
be: {is, it, endangered, the, state, ﬁsh}. Through this example, it is shown that the
realizes relation makes it possible for a computer system to operate on discourse entities.

It should be noted that unless a pronominal referent gets assigned, the modiﬁed algo-
rithm will go further back to a prior question. For example, in example (7) in order to
resolve him in (7Q3), the algorithm will have to keep looking back to (7Q1) and locate
Tom Cruise as the referent because the noun phrase Nicole K idman in (7Q2) violates the
gender constraint for him. Also when both she and him are resolved, the forward looking
center for (7Q3) would be Nicole Kidman and Tom Cruise where subject (Nicole Kid-

man) is ranked higher than the object (Tom Cruise) according to the ranking hierarchy

 

l2Query terms are the linguistic tokens that are input to a retrieval engine. The retrieval engine then
sees them as a bag of words, which ignores the word order. When discussing the three computational
models, by centers, I mean the realization of them, i.e. the corresponding linguistic expressions. Query
terms in the brackets are separated by commas to indicate that they are a bag of words.

71

mentioned above.
(7)

Q1: When was Tom Cruise born?

Q2: Who was Nicole Kidman?

Q3: When did she marry him?
Of course, as expected, this algorithm cannot correctly resolve all the pronouns given
that there are always complicated and unexpected cases that the algorithm cannot han-
dle. Some complicated cases such as generics and cases that require more constraints are
beyond the scope of this discussion. Since the efﬁciency of pronoun resolution algorithms
are not the major concern of this work I will focus more on the discourse modeling. It
is more of an implementation problem to add more constraints to the algorithm. As we
will see, pronoun resolution is a common constituent of the processing models. Any extra

efforts to improve the accuracy of pronoun resolution will beneﬁt all three models.

5.3.2 Forward model

In the forward model, query terms for a current question are formed by incorporating
forward looking centers C f from its adjacent preceding question. Note that the forward
looking centers have already been resolved by the reference resolution algorithm, so this
model is one step further from the anaphora model. The motivation for the forward model
is based on my assumption that a question discourse is coherent. The forward looking
centers from the previous adjacent question form the local entity context for the current
question.

The motivation for the forward model is based on the assumption that a question
discourse is coherent. The assumption behind this ”model is: the forward looking centers
from the previous adjacent question provide more context information for the current
question. Therefore this model would predict a better performance than the anaphora
model. Consider example (8) that has two forward looking centers Tom Cruise and
Nicole Kidman in (8Q1). The pronoun she in (8Q2) ”is resolved to Nicole Kidman using

the anaphora model. The forward model will add the other forward looking center Tom

72

Cruise to the query terms when processing (8Q2). For the same reason, the forward
looking center of (8Q2), i.e. Nicole Kidman will be added to (8Q3) for query expansion
after he is resolved to Tom Cruise.
(8)

Q1: How is Tom Cruise related to Nicole Kidman?

Q2: What movies was she in?

Q3: What movies was he in?

C f(Q1): {Tom, Cruise, Nicole, Kidman}

Cf(Q2): {Nicole, Kidman}

Forward model:

Query terms for Q2: {what, movies, was, she, in, Nicole, Kidman, Tom, Cruise}

Query terms for Q3: {what, movies, was, he, in, Tom, Cruise, Nicole, Kidman}

Anaphora model:

Query terms for Q2: {what, movies, was, she, in, Nicole, Kidman}

Query terms for Q3: {what, movies, was, he, in, Tom, Cruise}
Horn the resulting query terms, it is shown that in forward model both entities are
incorporated into (8Q2) and (8Q3) to enrich the context of this mini discourse (which is
different from anaphora model). In other words, not only the entities referred by pronouns
are regarded as local context, but also the other forward-looking centers from previous
utterances. Also, we notice that even if the order of (8Q2) and (8Q3) is switched, which
is an equally possible discourse sequence, the fact that the resulting query terms would
be the same shows that this algorithm is able to correctly capture the local context.
Now that more entities are related to the context in processing question sequence, it is

necessary to explore deeper to the semantic level as how these entities are related and

how the relationship helps with the processing.

5.3.3 Transition model

Instead of incorporating forward looking centers from its adjacent preceding question

as in the forward model, the transition model takes even one step further by selectively

73

incorporating entities from the discourse based on discourse transitions. Centering Theory
is used in this model to identify transitions.

As described earlier, the transitions of centers from one utterance to the next imply the
degree of discourse coherence, which is captured by four types: continue, retain, smooth
shift and rough shift. The ﬁrst two transitions mainly correspond to the situation where
the user is continuing the topic and / or the focus from the preceding utterance; and the last
two correspond to a certain type of shift of interest. For questions that involve pronouns,
the transition types are automatically identiﬁed by the reference resolution algorithm (see
the algorithm in [12]). For questions that do not have pronouns, I used an entity—based
algorithm, which extends the centering algorithm and assumes the highaet ranked entity
is the centered entity or most accessible in terms of interpretation and understanding.
The same ranking scheme was used as in the anaphora model to assign a rank to each
entity. Then the highest ranked entities from the adjacent question pair were compared
and assigned a transition type according to Table 5.1.

More speciﬁcally, different transitions are determined based on the syntactic informa-
tion of a noun phrase (NP) that realizes the Cp. A real world object or an entity can serve
as a center depending on the NP that realizes it. NPs, especially referring expressions
including non-pronominal deﬁnite referring expressions and pronominal expressions are
the linguistic elements that are discussed initially within the centering framework [13].
GJW95[29] mentioned that semantically the realization relation for the deﬁnite noun
phrases may hold in three cases: (1) referentially as to denote an object; (2) attribu-
tively as to contribute to the semantic interpretation related to the descriptive content of
the expressions; and (3) as the pragmatic reference that is essentially a “speaker’s refer-
ence”. The ﬁrst two aspects motivate the approach to identify transitions based on NP
expressions, in particular, the deﬁnite noun phrases.

In the transition model, the extended algorithm is based on the following speculation.
Intuitively, deﬁnite noun phrases that share the same NP head and modiﬁer often refer
to the same center, which results in a continuation using CT’s terminology. Similarly,

attention will be retained if two similar entities referred to in two utterances have cor-

74

responding NP expressions that share the same NP head but different modiﬁers. NPs
that have same modiﬁer but different NP heads often refer to different entities that share
the same descriptive properties. In this case attention is more shifted from the retention
case, less from the rough shift where attention on the properties of the entity as well as
the entity itself has been shifted. Table 5.2 shows the four rules that are used to identify
different types of transitions. Table 5.3 shows the examples of how these transition rules
would be applied to the non-pronominal referring expressions. A ﬁfth transition other is
assigned to a question pair if none of the four rules can be applied, for example, a ques-
tion pair that does not have non-pronominal referring expressions. Once different types
of transitions are identiﬁed, the next step is to apply different strategies to selectively
incorporate entities from the discourse for query expansion. To this end, I have currently
simpliﬁed the process by combining smooth shift, rough shift, and other together to a
general type shift. The speciﬁc strategies for each transition type are shown in Table 5.4
for the query expansion of the QA question in processing.

Table 5.2. Transition rules for questions without pronouns but with non-pronominal
referring expressions

 

NP Modiﬁer* NP head Transition
Same Same Continue
Different Same Retain
Same Different Smooth shift
Different Different Rough shift

 

*Modiﬁers do not include the determiners a, an and the.

Table 5.3. Examples of transition rules on non-pronominal referring expressions

 

Transition Qn(N P) Qn+1 (NP)
Continue a movie star the movie star
Retain the second debate the third debate
Smooth shift the best actor the best actress
Rough shift the space shuttle the flight crew

 

75

The strategy for the continue transition13is motivated by the following two reasons.
First, as pointed out in [29], there are cases where “the full deﬁnite noun phrases that

”

realize the centers do more than just refer. Being part of a discourse, they contribute
to the discourse coherence as well. Similarly I conjecture that the highest ranked proper
name in a question sequence carries more information than just for referring. In other
words, I believe that given questions that involve pronouns, a highest ranked proper name
can provide adequate context if that proper name is not the antecedent of the pronoun
and its status is not overwritten by the new information from the current question. Sec-
ond, as described in [88] on topic status and proper name’s status in the deﬁniteness
hierarchy in [89], proper name should be given certain discourse prominence as it is an
important deﬁnite noun phrase type. Since currently I do not resolve deﬁnite descriptions
this strategy partially addresses the importance of deﬁniteness status of other types of
deﬁnite noun phrases besides pronouns.

(9)

Q1: Where is Hawaii located?

Q2: What is the state ﬁsh?

Q3: Is it endangered?

In my favorite example, repeated as (9), the transition between (9Q2) and (9Q3) is iden-
tiﬁed as continue because it in (9Q3) and the state ﬁsh in (9Q2) refer to the same entity
(i.e., the state ﬁsh) and this entity is also the Cp of (9Q3). According to the strategy
described in table 5.4 for continue, when processing (9Q3), in addition to the query term
the state ﬁsh (corresponding to the antecedent for the pronoun it in (9Q3)), the proper
name Hawaii from (9Q1) will also be inherited.

For the transition type retain, intuitively I believe if two questions are on similar but
not the same entities (e.g., the ﬁrst debate and the second debate), they should share a
similar constraint environment (such as time, location, etc.). That particular constraint
from a preceding question still applies to a current question unless its value is explicitly

revised in the current question. The strategy for the retain transition was designed based

 

13If there is no proper name, then do not expand query terms.

76

Table 5.4. Query expansion strategies based on transition type

 

Transition Strategy

 

Continue Add the highest ranked proper name most recently introduced from
the discourse.

Retain Inherit and then update (if necessary) the constraints from the
discourse. Constraints are currently location and time.

Shift Add the forward looking centers from the previous adjacent to the
current question.

 

on this intuition.
(10)

Q1: Where was the 2nd presidential debate held in 2004?

Q2: Where was the 3rd debate held?
In example (10), the transition between (10Q1) and (10Q2) is identiﬁed as retain because
according to table 5.2, expressions realizing Cp(10Q1) and Cp(10Q2), that is, the 2nd
president debate and the 3rd debate share the same NP head but different modiﬁers. The
strategy for retain will allow (10Q2) to inherit its time constraint 2004 from (10Q1).

For the transition type shift, currently I adopt the same strategy in the forward model
by incorporating forward looking centers from the preceding question. Although the shift
transition reﬂects the least local coherence between utterances, the preceding forward
looking centers are still important in terms of offering the local context information.
(11)

Q1: When did Vesuvius destroy Pompeii the ﬁrst time?

Q2: What civilization ruled at that time?
In example (11), the transition between (11Q1) and (11Q2) is identiﬁed as rough shift
according to table 5.2 because NPs realizing Cp (11Q1) (i.e., Vesuvius) and Cp (11Q2)
(i.e., civilization) neither share the same head nor the same modiﬁers. Following the
strategy for the shift transition the resulting query terms inherit the forward looking
centers from the preceding question. In this case, query terms Vesuvius and Pompeii will

be added to (11Q2) for document retrieval. Note that all the strategies described here

77

are based on some linguistic observations. Other strategies can be experimented with, in

the future.

5.4 Data analysis

The user data and the TREC 2004 factoid questions were used for the evaluation of
the three models. As mentioned in Chapter 4, the user data includes four topic:(1) the
presidential debate in 2004; (2) Hawaii; (3) the city of Pompeii; and (4) Tom Cruise.
The basic information of the user data and the TREC 2004 data has been described in
Chapter 4. Next, I will describe more information on these data in table 5.5.

Table 5.5 shows a comparison of the two data sets: my data and the TREC 2004
data. First of all, the TREC 2004 data consists of 65 topics (i.e., targets) and each topic
has one set of questions. In contrast, the user data consists of only four topics where
each topic comes with more than 20 sets of questions from different users. Question sets
from multiple users on a same topic will allow us to test the generality of my discourse
processing strategies across different users.

Unlike the TREC 2004 data where each topic is about a single entity such as the Black
Panthers organization, my data covers both event and entity. For example, the topic on
the presidential debate is about an event, which can potentially relate to the facts (e.g.,
when, what, etc), the cause, and the consequence of the event. This variation will allow
us to study the potential distinctions in processing different types of topics (in terms of
event or entity) systematically.

From Table 5.5 we can see that, the surface characteristics across my data and the
TREC 2004 factoid questions are very similar in terms of the question length. However,
the TREC 2004 data has a higher percentage of pronoun usage in the context questions.
In my data, only questions with the topic Tom Cruise have a high percentage of pronouns,
while the other topics have signiﬁcantly lower percentage of pronouns. This variation will

allow us to study the potential different impact of pronoun resolution in different data

sets.

78

55:0\p:0>m_ .I. See How: 2830 *

 

 

H.508 @3me S as an a 22:25: as e 852

stcom HszHH wH R Hm a. 8255: as? so Honesz

§§3H @832 He em 2 Ha 80:88 masses e 852

m3 ans 2: m2 2: o: 30565.5 “0 55:5: 1308

mg of 2. cm 3 NH 538:0 Ev::5\m:0m>0a-:m-a:0e0000:< m0 88:02

meodm cams new? oeohm “Rude can?“ 833 0: H80: m::0:0a 80:3 3058:: mo 03.5005.»

oemdu Hwemdm $53 gowm oewbm 0503 m::0:0.a 4:? 3038:: 38:80 m0 0wmp:00:0m

3 3 H: 3 E 5 ﬂags 8:85 88;:

339m p:0\0>m maﬁa **a:0\0>m 335m u:0>m_ 8E0: mo 0&8

as «S m2 3: HmH SH .8885 Ho 8&5 H309

no a Hm mm mm mm as 5:85 e .852

8 H. H H H H 8&8 no :88sz
488%? 2380 $8.6 5:. H85: 5%: 88m:

 

Agoswmsd 205$ 3:0 wﬁvﬁoﬁv 3.30 voom 0mm? 0:.“ Saw :8: 505303 :0mtmaaoo moSmHHopoﬁwso .m.m 0358

79

Furthermore, the majority of the pronouns within each set in the TREC 2004 data
(96%) refer to the topic/ target which has been provided to the set. Therefore, incorpo-
rating target terms for query expansion will have the same effect as a model that resolves
pronouns. Each context question will then become an isolated factoid question and addi-
tional discourse processing may not be necessary. In my data, the percentage of pronouns
that refer to the topic is signiﬁcantly lower, which indicates a higher demand on discourse
processing.

In term of transitions, the majority of the TREC 2004 data has the continuation tran—
sition (64%), while my data exhibits more diverse behavior. By studying these different
characteristics of the two data sets, I hope to learn their implications for speciﬁc strategies
from my empirical evaluation. Next, I will discuss the evaluations on the three models in

detail.

5.5 Evaluation

A series of experiments were conducted to compare the performance of the three models
on both the user study data and the TREC 2004 data. For the user study data, I
incorporated documents with answers to each of the collected questions to the AQUAINT
CD2 collection and the evaluation was done based on the updated CD2 collection (with a
size about 1.8G). For the TREC 2004 questions, the entire AQUAINT collection (about
3G) was used. In all the experiments, the Lemur retrieval engine14was used for document
retrieval. Since the ﬁrst occurrence of a correct answer is important, Mean Reciprocal

Ranking (MRR) was taken as the ﬁrst measurement. MRR is deﬁned as:

1 1
—_§: 5.1
MRR Nilm kt ( )

 

where rank,- is the rank of a retrieved documentlSWhich provides the ﬁrst correct answer

for the ith question and N is the total number of questions evaluated. My evaluation

 

1“ http://www-2.cs.cmu.edu/~lemur/. The Lemur toolkit supports indexing of large-scale text
databases and the implementation of retrieval systems based on as a variety of retrieval models.
15represented as DocRank in the following discussions.

80

mainly addresses the following issues:

0 How are the different models based on Centering Theory compared to each other in
terms of document retrieval performance? Will the different models affect different
types of questions? Are there any correlations between the characteristics of ques-
tions and the effectiveness of the strategies? To answer these questions, I compared
the performance of each model on both data sets. I further provided detailed anal-
ysis of performance comparison based on different characteristics of questions such

as the type of transitions and the pronoun usages.

o How sensitive is each model’s response to performance limitation of automated
discourse processing? In other words, what is the capability of each model in com-
pensating the potential mistakes caused by machine processing (e.g., incorrectly re-
solving some pronouns)? To answer these questions, the evaluation was performed
based on two conﬁgurations: 1) automated system where the pronoun resolution and
transitions are all automatically identiﬁed by the computer system; 2) annotated

system where the correct references to pronouns and transitions are annotated.

Note that my focus is not on document retrieval, but rather on the impact of the discourse
processing on document retrieval. Therefore, the evaluation reported in this paper is based
on the subsequent questions (435 in my data and 165 from the TREC 2004 data) which
exclude every ﬁrst question in each set since processing the ﬁrst question does not use

any discourse information.

5.5.1 Overall Evaluation Results

Before I present the overall evaluation results, it is worth pointing out the fact that the
proposed three discourse models sometimes work, and sometimes fail for individual data
questions. Now I pick some examples from the 230 TREC 2004 factoid questions.

The ﬁrst example shows where the anaphora model works the best while the forward
model and the transition model do not.

(12)

81

Q1: When was the ﬁrst Crip gang started?

Q2: What does the name mean or come from?

Q3: What ethnic group/ race are Crip members?

Q4: What is their gang color?

Query terms for Q4:

Anaphora model: {What, is, their, gang, color, Crip, members}

Forward model: {What, is, their, gang, color, ethnic, group, race, Crip, members}

Transition model: {What, is, their, gang, color, Crip, members, ﬁrst, Crip, gang}

Anaphora model retrieval result: DocRank = 8, MRR = 0.125

Forward model retrieval result: DocRank = 10, MRR = 0.1

Transition model retrieval result: DocRank = 16, MR = 0.1667

In example (12), the anaphora model identiﬁes that the expression Crip members
corefers with the pronoun their. The forward model adds the expressions (i.e.ethnic
group/race and Crip members) that correspond to the forward looking centers for query
expansion. The transition model identiﬁes the transition between (12Q3) and (12Q4) as
continue so it also adds the ﬁrst proper name (i.e. ﬁrst Crip gangI)5 in the discourse in
addition to the resolution of their. The document retrieval results show that the anaphora
model (MRR = 0.125) outperforms the other two models. The result indicates that more
context information captured by the forward model and the transition model sometimes
hurts the performance.

The next example shows that the forward model outperforms the anaphora model and
the transition model.
(13)

Q1: Who is the lead singer / musician in Nirvana?

Q2: When was the band formed?

Q3: What is their biggest hit?

Q4: What style of music do they play?

 

16Since the deﬁnite description is not the concern of this chapter, if an expression that contains a
proper name, the whole expression will be counted as a proper name in the implementation conducted

in this chapter.

82

Query terms for Q4:

Anaphora model: {what, style, of, music, do, they, play, Nirvana}

Forward model: {what, style, of, music, do, they, play, Nirvana, biggest, hit}

’Dransition model: {what, style, of, music, do, they, play, Nirvana}

Anaphora model retrieval result: DocRank = 2, MRR = 0.5

Forward model retrieval result: DocRank = 1, MRR = 1.0

Transition model retrieval result: DocRank = 2, MRR = 0.5
In this example, the pronoun they is resolved to Nirvana for the anaphora model. For the
transition model, the transition between (13Q3) and (13Q4) is identiﬁed as continue so
the strategy for continue transition adds the ﬁrst encountered proper name Nirvana to
the query terms. Note that although the same term Nirvana is added for query expansion
according to the anaphora model and the transition model, they are added for different
reasons. The forward model, by simply adding the forward looking center (their biggest
hit), provides more context for processing (13Q4). Again, the pronoun their in their
biggest hit is resolved to Nirvana, but the other two terms biggest and hit actually help
the document retrieval. Choosing forward looking centers, in this example works better
(MRR = 1.0) than other techniques.

Example (14) shows the advantage of the transition model where it outperforms both
the anaphora model and the forward model. The transition model gets the best MRR
value (0.25).

(14)

Q1: What ﬁlm introduced Jar Jar Binks?

Q2: What actor is used as his voice?

Q3: To what alien race does he belong?

Query terms for Q3:

Anaphora model: {T0, what, alien, race, does, he, belong, his}

Forward model: {T0, what, alien, race, does, he, belong, his, actor, voice}

Transition model: {T0, what, alien, race, does, he, belong, his, Jar, Jar, Binks}

Anaphora model retrieval result: DocRank = 0, MRR = 0

83

Forward model retrieval result: DocRank = 17, MRR = 0.0588

Transition model retrieval result: DocRank = 4, MRR = 0.25
In example (14), the anaphora model fails to resolve the pronoun he in (14Q3), because
the entity Jar Jar Binks is annotated as object singular instead of male singular. The
centering program I designed therefore could not identify the antecedent of he being
Jar Jar Binks. The forward model adds the expressions his voice and actor for query
expansion. For the same reason, the pronouns he in (14Q3) and his in (14Q2) could not
be resolved. However, the strategy used for continue between (14Q2) and (14Q3) addes
the pr0per name Jar Jar Binks to the query input. The strategy actually compensates
the reference resolution failure.

From the above examples, we see that if the document collection and/ or the document
retrieval engine stay the same, the proposed techniques both work and fail to work for
individual cases.

Table 5.6 shows the overall performance of all three models on the two data sets com-
pared to a baseline model in terms of MRR. The baseline model simply incorporates the
preceding question to the current question to form a query without any pronoun resolu-
tion. The motivation for this baseline strategy is that since most antecedents of pronoun
references have occurred in the preceding questions (see Table 5.5, especially the TREC
2004 data), the preceding question can simply provide a context for the current question.
Since all three models rely on pronoun resolution, the performance of the automated pro-
noun resolution algorithm directly impacts the ﬁnal performance of document retrieval.
Therefore in Table 5.6, along with the performance resulting from automated processing
(i.e., marked with “auto” in the column title), I also provide retrieval results for each
model based on manually annotated antecedents (with “key” in the column title), as well
as the performance difference between the two (i.e., the % difference column).

To better present the results, Figure 5.1 shows a detailed comparison between four
models as a result of automated processing. As shown in Figure 5.1(a), except for the
Debate data the incremental increase in the complexity of discourse processing (e.g., from

the anaphora model, to the forward model, to the transition model) improves the overall

84

 

 

”Nova ommd owmd 0554 wwmd mwmd oAoodm mead Sad wmad voom UmmH.
“Nova ogd wmﬁo nRave ammo mad 0552 m:d moﬁo owed =30>O

sac wmmd wmmd ﬁnd hmmd ommd “King nmmd mmﬁo come «EEO EOE

ﬁend owﬁd ammo Rand memo omﬁd KEEN agd de ammo momﬁom
0593 ozd ooﬁo sand mwod owed gmdm mwod Need Sod mdem
exec mvod mvod Koo wvod wvod 0&9: wvod mvod god magma
MEREQEB mmvHQHQB 08:43.28 bwvxxcgom zonQOm ca=<gom mwvokcdqiﬂ ~8de=< 08:495. 25QO Back

 

.330 vcom Ommrh 98 3mm >5 now 12,258 2055030 :0 £358 econommv mo oosaﬁuotom 2.39/0 .wb .0sz

85

performance. For the Debate data, different models performed comparably the same. In
other words, any type of discourse processing has not shown a signiﬁcant effect compared
to the baseline model. One of the reasons is that, the sets of questions collected for Debate
are somewhat different from the rest of the topics in terms of the content of the questions.
The Debate data relates to an event while the rest of the data sets relate to entities such
as place or person. Since Centering Theory is mainly based on the transitions between
discourse entities, it could be the case that my models would work better for entity
related questions than event related questions. An event may involve more complicated
transitions such as consequence, cause, and reason; other models utilizing relation-based
coherence theories such as Rhetorical Structure Theory could be a potential approach.
However, more in-depth analysis is necessary in order to reach a better understanding
of event related questions and their implications on the automated discourse processing
targeted to these questions.

To illustrate the contribution of each incremental processing, Figure 5.1(b) shows the
percentage of improvement compared to the baseline model. First of all, it is possible
that the automated processing of pronoun resolution could result in wrong antecedents;
therefore the anaphora model based on automated processing might hurt the retrieval
performance compared to the baseline model. This is evident for the Debate and Pam-
peii data. The Pompeii data is a mixture of event and entity topic (e.g., it involves the
event of volcano eruption) so the effect from the forward and transition models is also
limited compared to the baseline. Furthermore, the additional contribution of the tran-
sition model is relatively less for the Tom Cruise data and the TREC 2004 data than
that for the Hawaii and Pompeii data. A possible explanation is that both the Tom
Cruise and the TREC 2004 data have a higher percentage of pronouns (see Table 5.5).
The speciﬁc transitions identiﬁed between two adjacent questions largely depend on the
resolution of those pronouns. Therefore, the anaphora model has already handled the
functions provided by the transition model. However, in the Hawaii and Pompeii data,
the occurrences of pronouns are relatively lower. The transition model can particularly

accommodate entities that are not realized as pronouns such as deﬁnite descriptions (e.g.,

86

 

 

 

     
  

   
 
   
  
    

Overall-auto I Baseline I Ana.Auto El For.Auto u Trans/tutu}
0.35 r?"-
o.a
0.25 $2
s:
a: 0.2 §§
5 0.15 S:
§;
0.1 § .
s§
Hawaii Pompeii Tom Oman TREC
Cruise 2004 Topic

 

 

 

(a) Overall automated system

 

 

Compare with healing I Anaphor-lmp. I Forward-Imp. c1 Transition Imp.

 

Percentage

 

 

 

 

(b) Automated system compared to baseline

Figure 5.1. Overall comparison of four models based on automated processing

87

through the continue transition as discussed earlier).

From the experimental results, it is interesting to point out that the sensitivity of each
model varies in response to the accuracy of automated discourse processing. From Table
5.6, in the anaphora model, a perfect pronoun resolution makes a big difference compared
to an imperfect automated pronoun resolution (the performance difference is between
12—27% as shown in the “Ref% diff” column). However, the performance difference as
a result of the capability of resolving pronouns becomes diminished in the forward and
the transition models. This result indicates that by inheriting more context from the
preceding questions as in the forward and transition model, it can potentially compensate
the inaccuracy in automated pronoun resolution.

To further examine the three models on document retrieval, I also evaluated document
retrieval performance in terms of coverage. While MRR rewards the method that improves
the ranking of the correct answers, coverage rewards methods that introduce the correct
answer in the retrieved results. More speciﬁcally, coverage is deﬁned as the percentage of
questions for which a text returned at or below the given rank contains an answer [90].
Figure 5.2 shows the coverage measurement for each model on different topics. Overall,
we see that the transition model is consistently better than the other models. The entity
topic resemblance between the Tom Cruise data and the TREC 2004 data again results
in similar performance (i.e., they both have a large percentage of pronouns referring to
the topic itself).

Given the experimental results described above, a natural question is how the retrieval
performance from my models is compared to other retrieval performance. It is hard to
achieve this kind of comparison because TREC 2004 did not provide document retrieval
performance based on the context questions. The closest I can ﬁnd is the “coverage”
based on passage retrieval for TREC 2004 factoid questions provided by the University
of Shefﬁeld [90]. Table 5.7 shows the retrieval performance (from the transition model)
and the Shefﬁeld’s retrieval performance (using the Lucene retrieval engine) in terms of
coverage based on all 230 factoid questions. Note that since my system was evaluated

on document retrieval and Shefﬁeld’s system was on passage retrieval, this is not a direct

88

 

 

 

Debate +Baseiine +AnaAuto ForAuto ~ TransAutoj

 

35

   

 

30
25 . .7
E20 I, _

1o , ,

5

0

1 5 10 20 30 50

Docunlomﬂlnk
(9.) Debate

 

 

 

 

Howell -+—-Baseline +AnaAuto ForAuto ,. TransAuto
70

    

 

1o
O 7 . _
1 5 10 20 30 50
Document Rank
(b) Hawaii

89

 

 

 

 

 

Pompell {—o— Baseline + AnaAuto ForAuto TransAuto

 

 

60 --

 

50

 

4o

 

30

 

g
8 } ,

1o

 

 

 

0 l T I 1 T
1 5 10 20 30 50
Document Rank

 

(c) Pompeii

 

 

 

Tom + Baseline + AnaAuto ForAuto TransAuto

 

 

 

 

 

50 4L.__.__~.4.~0..__ _.... - "N“--.ﬂ“... _ . . . ....v-___.._.__~._ _._.....__-._. ”-4.-. :5 __.-:r. ._.—----_-.-. __.._....__. __ ._ . ._,. 1

 

 

 

 

 

 

 

5 10 20 30 50
Document Rank

 

(d) Tom Cruise

90

 

 

 

Overall uar date +Baseline +AnaAuto ForAuto _ TransAuto‘l

 

 

50

 

 

8

 

 

Coverage
8 8

 

 

 

 

 

1 5 10 20 30 50
Document Rank

 

 

 

(e) My overall data

 

 

mac 2004 date L—o— Baseline a.-- AnaAuto ForAuto TransAuto}

 

 

 

 

 

 

 

 

 

 

1 5 10 20 30 50
Document Rank

 

 

 

(f) TREC2004 data

Figure 5.2. Coverage comparison between four models based on automated processing

91

Table 5.7. Document retrieval performance based on the transition model and passage
retrieval performance from the University of Shefﬁeld on TREC data

 

 

Document Rank Transition model Shefﬁeld’s Lucene* [90]
1 20.87 12.17
5 40.43 32.17
10 49.57 39.56
20 58.26 47.39
30 59.57 51.30
50 64.78 55.65

 

*http: / / lucene.apache.org / j ava/ docs /

comparison. I list them together simply to have some sense about whether my performance
is on the right track. Resources and initiatives to facilitate a direct comparison are in great
need in order to enhance understanding on discourse processing for document retrieval.
To further understand the effectiveness of these models on questions with different
characteristics, I isolated two dimensions: 1) questions with diﬁerent transition types and
2) questions with and without pronouns, and conducted a detailed performance analysis

along these two dimensions. I report the results next.

5.5.2 Evaluation and Analysis based on Transitions

In this section, I discuss the role of three models on question pairs with the transition

type continue, retain, and shift, respectively.

Continue transition

Figure 5.3 shows the overall comparison of the three models on the question pairs with the
transition type continue, with Figure 5.3(a) for the automated system and Figure 5.3(b)
for the annotated system. In general, for continue pairs, the transition model works the
best, then the forward model and the anaphora model, and the baseline is the worst. This

implies that the transition model would work the best for the most coherent discourses,

92

which, according to Centering Theory, have a higher percentage of continue pairs.

Figure 5.3(a) shows that the transition model performs consistently better than the
forward model. This result indicates that the strategies used in the transition model
for continue questions are adequate to provide appropriate context. The transition model
provides more information than the forward model or the anaphora model, but at the same
time lowers the risk of introducing unnecessary forward looking centers into processing as
in the forward model.

The forward model outperforms the anaphora model across all the topics, which is also
shown in Figure 5.3(a). This result indicates that reference resolution alone is not enough
for obtaining adequate context information for discourses marked with continue transi-
tions. Meanwhile, I observed that the anaphora model outperforms the baseline model
for all the topics except Debate. The reason is that the reference resolution error brings
down the performance for the Debate data. This can be seen from Figure 5.3(b), which
shows the performance on the continue pairs with all the pronouns correctly resolved.
When all the pronouns are correctly resolved, the anaphora model actually outperforms

the baseline model.

Table 5.8. Transition model performance improvement for continue
Topic Increase Increase Question (excl. the

over ana. model(%) over for. model(%) feeding) w/ pronoun(%)

 

Debate 707.0 70.1 14.5
Hawaii 78.6 55.6 26.6
Pompeii 86.4 53.4 25.0
Tom Cruise 33.3 9.8 81.7
Overall 50.0 23.9 36.3
TREC2004 17.8 5.9 73.9

 

Table 5.8 shows the performance improvement of the transition model over the forward
model and the anaphora model. The results indicate that, for continue pairs, the perfor-
mance improvement of transition model is different across topics. The improvement is less
for topics that have a higher proportion of pronouns compared to other topics. For the

Tom Cruise and the TREC 2004 data which have a higher percentage of pronouns (i.e.,

93

use @2328 8 condenses .2 enema
ESE? 83393 we anon—96.35 3

 

38 2.20
omE .890 so» is“... as! 238

 

 

 

 

 

 

 

 

   

. 8m
8w
:0“..ng.aEsmEsEEMEmﬁEWM
sir. i . - .l........n l .. .
:58? @8332 3V Scum? vegan—35¢ Rev
918
388E .390 3.20 sop =8an 5%: sense
I l , «Ill 0
my, so,
a. a” .
/ s/ moo
m/ w/
” w” to
/ s/
m/ x
a m” e...
. / / I
W. W, a
” .” «d
/ /
W” m” .
/ .7 mu:
m” W” _
/. , 2
3...
$8. .Emeeuoﬁmlweﬂdﬂrwcmwmm 985253559 s @isaﬁn oaﬁmwunlswﬂec: Sim-Mm. 98-53:!!cac8

 

 

i . iii ,L Willi:

 

 

94

81.7% and 73.9% respectively), the transition model improves MRR modestly compared
to other topics: 33.3% and 17.8% over the anaphora model, and 9.8% and 5.9% over the
forward model respectively. Figure 5.3(b) shows the overall performance for the three
models based on the annotated pronoun resolution. We see that the transition model
based on annotated references is consistently better than the forward model except for
the Debate data. It seems that pronoun resolution does not help with the transition model
for cases with the least number of pronouns. When annotated references are used, the
anaphora model performs better than the forward model for Tom Cruise and TREC 2004,
and also outperforms the transition model for Tom Cruise. These results indicate that
when questions have a higher percentage of pronouns (e.g. Tom Cruise), the anaphora
model with pronouns properly resolved will achieve higher performance compared to other
models.

Figure 5.3(c) shows the performance improvement of the annotated systems for the three
models compared to the automated system. For the question pairs with the transition
type continue, the performance of the annotated anaphora model increases across all the
topics. This makes sense because if a question pair is on the same focused entity, according
to rule 1 of Centering Theory, this entity would be pronominalized as the backward
looking center of the second question. The annotated system avoids the mistakes that the
automated system makes in terms of reference resolution. Figure 5.3(c) also shows that
the performance improvement of the annotated forward model and the transition model
is relatively small compared to the automated system. The implication from this result
is that for continue pairs, the forward and the transition model are less sensitive to the

accuracy of reference resolution than the anaphora model.

Retain transition

Next, I present the evaluation results for the retain pairs. Figure 5.4 shows the overall
comparison of the three models on the question pairs with the transition type retain, with
Figure 5.4(a) for the automated system and Figure 5.4(b) for the annotated system. Table

5.9 lists the performance improvement of the transition model over the other two models

95

 

 

Retain-automated Baseline I Ana. Auto 1:] For. Auto is Trans. Auto

 

0.3
0.25
0.2 5
E 015
E .
0.1

0.05

    

Is

0 ..
Debate Hawaii Pompeii Torn Overall TREC
Cmise 2004
topic

 

(a) Automated system

 

 

Retain-annotated [I Baseline IAna. Key C! For. Key nTrans. K1}

 

0.35

0.3

 

Debate Hawaii Pompeii Tom Overall TREC
004

 

 

(b) Annotated system

Figure 5.4. Performance on RETAIN pairs

96

 

based on the automated system. I ﬁrst compare the transition model and the anaphora

Table 5.9. Transition model performance improvement for retain

 

Topic Increase Increase Question (excl. the

over ana. model(%) over for. model(%) feeding) w/ pronoun(%)

 

Debate -57.29 -45.64 14.5
Hawaii -6.01 38.14 26.6
Pompeii 13.76 30.85 25.0
Tom Cruise 15.65 -4.83 81.7
Overall 0.89 13.19 36.3
TREC2004 0 -10.82 73.9

 

model based on the automated processing. Figure 5.4(a) and Table 5.9 ﬁrstly show that
the transition model performs better than the anaphora model for retain pairs in Pompeii
and Tom Cruise. One advantage of the transition model over the anaphora model is
its capability of adding constraints from the context as in the example (16), where the
year 1631 is inherited from Q, to Q,“ for the query expansion. The Lemur retrieval
results based on different query expansion terms are also shown below. The document
that contains the correct answer will not be returned at all for the anaphora model (i.e.
DocRank=0) while it will be returned as the ﬁrst document (i.e. DocRank = 1) for the
transition model.
(15)

Q,: In 1631 Vesuvius erupted again. This was the worst eruption since when?

62,41: When was Vesuvius ’ last cycle?

Query terms for Q1“:

Transition model: {when, was, Vesuvius, last, cycle, 1631}

Transition model retrieval result: DocRank =1, MRR = 1

Anaphora model: {when, was, Vesuvius, last, cycle}

Anaphora model retrieval result: DocRank =0, MRR = 0
Secondly, Figure 5.4(a) and Table 5.9 show that the transition model performs the same
as the anaphora model for the TREC 2004 data. The TREC 2004 data does not have

many constraints so the strategy for the transition model does not add more information

97

given that the transition model is mostly used to resolve the references as the anaphora
model. However, I would expect performance difference between the two models for longer
questions with more constraints such as time phrases. Finally, Figure 5.4(a) and Table 5.9
show that the transition model performs worse than the anaphora model for Debate and
Hawaii. What happened is that some constraints that do not carry much information,
such as adverb there actually introduce noise to the search process. Based on this result,
I suggested excluding this kind of adverbs in QA processing.

Next, let us compare the transition model with the forward model. From Figure 5.4(a)
and Table 5.9, we see that the transition model performs better than the forward model
for the question pairs in Pompeii and Hawaii, worse for Debate, Tom Cruise and TREC
2004. This result seems rather incidental. However, as I examine closely, I found that the
transition model for retain pairs does not seem to work better than the forward model
for questions that have a high percentage of pronouns (e.g., Tom Cruise, and TREC 2004
questions). Note that this observation is similar to what has been noticed for continue
pairs. The fact that the transition model does not work well for the Debate data, which
does not have many pronouns, indicates that the high percentage of pronouns is not the
necessary condition but the sufﬁcient condition for worse transition model performance.

Another interesting observation from Figure 5.4(a) is that the baseline model outper-
forms the transition model for the TREC 2004 data. The TREC 2004 data is more
coherent than my user study data under the assumption that the more a discourse partic-
ipant continues focusing on a discourse entity, the more coherent this discourse would be,
and therefore the more continue pairs will be observed in this discourse. This is exactly
the case for the TREC 2004 data as seen from Table 5.5. The TREC 2004 data has more
continue pairs than my data (64% vs. 30%). Intuitively, a more coherent discourse would
favor more context information for the purpose of discourse processing. However, the
strategy I adopted for the transition model does not seem to help with the retain pairs,
because the TREC 2004 data does not have many constraints such as time or location.
The baseline instead is able to get more context information by simply concatenating the

previous question to the question under processing.

98

Finally, I observed the low sensibility of the transition model to a system’s capability of
correctly resolving pronouns for the overall user data and especially for the TREC 2004

data.

Shift transition

Finally, I discuss the performance results for the shift pairs. Figure 5.5 shows the overall
comparison of the three models on the question pairs with the transition type shift, with
Figure 5.5(a) for the automated system and Figure 5.5(b) for the annotated system.

From Figure 5.5(a) and Figure 5.5(b), we see that the transition model performs the
same as the forward model because the strategy for shift pairs in the transition model
is simply to add the forward looking centers from the previous question to the current
question, which is exactly the same as the forward model. The baseline model for questions
with the shift type performs better than for question pairs with the other two types, which
indicates that the questions with the least coherence may not need much processing or
other processing techniques. It should be noted that, all the context questions within a
sequence are somewhat related even if two adjacent ones are regarded as less coherent
according to Centering Theory (e.g., identiﬁed as shift). This is why sometimes for shift
pairs, by simply running the baseline, I can get pretty good performance (such as for the
Debate data). The reason that the anaphora model does not work well is that the shift
pairs normally do not have referring expressions.

The performance improvement based on the annotated system over the automated
system for the shift pairs is not as significant as for the other two transition types. Since
there are not many cases where pronoun resolution is involved in the shift pairs, it is hard
to examine how pronoun resolution would impact the different models. I also observed
the performance on the Pompeii data even drops a little for the annotated system. After
examination of the processing, I found examples such as (16), which could provide a
possible explanation.

(16)
62,-: When did Pompeii vanish?

99

 

 

 

Shll‘t-eutometed i Baseline I Ana. Auto [:1 For. Auto Is Trans. Auto

 

w.
E
i r
' 2i

Debate Hawaii Pompeii Tom Overall
Cmise

 

 

(a) Automated system

 

 

Shift-annotated {I Baseline IAna. Key [:1 For. Key uTrans. Key

 

Debate Hawaii Pompeii Tom Overaﬂ
Cruise

 

(b) Annotated system

Figure 5.5. Performance on SHIFT pairs

100

 

(2,-4.1: When did Vesuvius erupt?

(2,42: How did people try to recover their possessions?

Transition model for Qi+22 {How, did, people, try, to, recover, their, possessions,
Vesuvius}

Transition model retrieval result: rank: 12; MRR: 0.083333336

Anaphora model for Qi+2: {How, did, people, try, to, recover, their, possessions}

Anaphora model retrieval result: rank: 13; MRR: 0.07692308
In (16Qi+2), although the pronoun their is resolved to people, the reference resolution
does not do much to the query terms. However, for the transition model, the proper
name Vesuvius is added to the query terms for (16Qi+2) because the entity Vesuvius is
the forward looking center of (1662,“). By introducing an important discourse entity
Vesuvius, this operation actually increases the chance of hitting the right document for
(16Q,-+2). The retrieval result has shown that, the correct document returned for (16Q,-+2)
is at a better rank for the transition model than for the anaphora model.

To sum up, besides the individual performance characteristics, there are four major
conclusions. First, for a context question discourse that has more continue pairs, the
transition model works better than the forward model and the anaphora model. Second,
for retain pairs, the transition model works the best for the data with constraints. Third,
for the shift pairs, the baseline could be an effective alternative strategy; Fourth, the
forward and the transition model are less sensitive to the accuracy of reference resolution
than the anaphora model. In other words, the ability of correctly resolving pronouns

affects the anaphora model the most and the transition model the least.

5.5.3 Evaluation and Analysis based on Pronouns

To further examine the performance of different models on questions with different pro-
noun usages, I separated questions into two categories for evaluation: questions with and
without pronouns. Figure 5.6(a) and Figure 5.6(b) show the evaluation based on the
pronoun dichotomy for the automated system and the annotated system.

When Figure 5.6(a) and Figure 5.6(b) are compared, it. is noted that the performance of

101

the transition model on the overall user data and the TREC 2004 data is better than the
other two models both for the automated and for the annotated systems. This observation
is similar to what was found when I separated the questions by transition types. Within
individual user data, the performance of the anaphora model on Hawaii and Tom Cruise
gets increased more than that on the other two topics for the annotated system. A
possible reason is that both the Hawaii and the Tom Cruise data have a high percentage
of pronouns. The transition model stays comparatively stable between the automated
and the annotated system.

Figure 5.7 shows the evaluation results for questions without pronouns, with Figure
5.7 (a) for the automated system and 5.7(b) for the annotated system.

Figure 5.7(a) and 5.7(b) show that the transition model is still competitive with the
other models even for the questions that do not have pronouns, although the advantage
of the transition model for different topics is different. For example, the performance
increase for the Tom Cruise data is not as big as for the Pompeii data.

Compared with Figure 5.6, the performance of the Debate data increases noticeably
both for the automated and the annotated system. One possible explanation is that
the majority of the Debate questions fall into this category. However, there is no much
difference within the three models for the Debate data. This indicates that centering-based
models are more appropriate to process context questions focusing on entities rather than
on events.

Figure 5.7(a) shows the automated anaphora model works better than the baseline
model for the T om Cruise and the TREC 2004 data, but not for the other topics. For the
non-pronoun containing questions, the anaphora model just takes all the terms from the
question itself. However, the baseline model would add the whole previous question to
the current question under processing. Comparing Figure 5.7(a) and 5.7(b), we see that
for the baseline and the anaphora model, there is no performance improvement for the
annotated system over the automated system since there are no pronouns to be resolved.

The performance improvement of the annotated system compared to the automated

system for the transition model and the forward model is rather trivial since the difference

102

 

O wlth pronouns
Automated

 

 

{I Baseline I Ana. Auto (:1 For. Auto a Trans. Autq

 

”//././////.././/(/(/.//////( ._

 

 

 

Debate Hawaii Pompeii Tom Overall TREC
Cruise topic
(a) Automated system
Q ﬁngxaouns [I Baseline I Ana. Key n For. Key nTrans. Key}

0.4 . -~

Debate

 

 

5\.
S4
\ .‘t
8‘:
. 7‘ §::
§ \‘5
s
\f;
S
S “‘
s \
Hawaii Pompeii Tom Overall TREC toplc
lese

 

 

 

(b) Annotated system

Figure 5.6. Performance for questions with pronouns

103

 

0 wlthout pronouns
Automated I Baseline I Ana. Auto Cl For. Auto 3 Trans. Auto I

0.3

 

 

     

   

  

8 m

Debate Hawaii Pompeii Tom Cruise Overall

 

(8.) Automated system

 

 

 

0 wlthout pronouns
Annotated ‘I Baseline I Ana. Key [:1 For. Key ls Trans. Key I

 

H

O

6:"
///////////////////////{g ;

 

topic

 

(b) Annotated system

Figure 5.7. Performance for questions without pronouns

104

 

is within 3%, better or worse depending on different topics.

In summary, there are two important messages conveyed from the analysis based on
the pronoun separation. One, the transition model outperforms the forward model and
the anaphora model for both the questions with and without pronouns. Second, there is

no signiﬁcant advantage of the transition model for the event type data.

5.6 Summary

T 0 support coherent information seeking, this chapter explores the use of linguistic knowl-
edge in discourse processing for a sequence of questions. A question sequence is considered
as a coherent mini discourse and Centering Theory is applied to capture the discourse
coherence. Three models based on Centering Theory (the anaphora model, the forward
model, and the transition model) were proposed, implemented, and evaluated on my user
study data and the TREC 2004 data.

The empirical results indicate that the transition model outperforms the anaphora
model as well as the forward model for the overall data, with or without pronouns. The
transition model and the forward model are less sensitive to the accuracy of automated
reference resolution of pronouns. More sophisticated processing based on discourse tran-
sitions and centers can signiﬁcantly improve the performance of document retrieval com-
pared to models that only resolve references. Since these models are based on discourse
entities, the state-of-the-art natural language processing techniques are sufﬁcient for dis-
course processing.

This chapter presents my initial investigation on the role of discourse processing for
context questions. There are many dimensions along which my future work will be pur-
sued. For example, how to use linguistic knowledge and the existing linguistic theories
to help process event-based context questions has become an interesting topic. I will also
extend context question answering to fully interactive question answering and investigate

the role of discourse processing in this new setting.

105

CHAPTER 6

Discourse models based on

deﬁniteness hierarchy

6. 1 Introduction

In the previous chapter, I have investigated how to employ Centering Theory to process
a context question using the context information from the previous question(s). Three
discourse models are presented based on the framework of Centering Theory. In this
chapter I continue the discussion on discourse modeling for context questions. In this part
of the study, I adopt the deﬁniteness hierarchy of noun phrases to help in the processing of
context questions. The pronouns and the deﬁnite descriptions that appear in the questions
will be my focus. The role of the target as providing context for all the questions in a set
will also be investigated. I hope that, this empirical effort will lead to a solution to the
problem of building an efﬁcient discourse model for processing context questions which
contain not only pronouns but also deﬁnite descriptions. The TREC 2004 and 2005 data
are used in the evaluation, both of which consist of question sets where each set asks for
information regarding a particular target.

A deﬁnite description is deﬁned as a denoting phrase in the form of “the X” where X is
a noun phrase or a singular common noun that describes a speciﬁc individual or object.

The noun phrases the book or the earth are examples of deﬁnite descriptions.

106

Recall that in Chapter 5, the three models perform better on the Tom Cruise data
than on the Debate data. Besides the topic difference (i.e.Tom Cruise is an entity but
the 2004 presidential debate is an event), these two datasets are different in terms of the
distribution for deﬁnite descriptions. A preliminary examination shows that 66.7% of the
Debate questions are found to have the deﬁnite descriptions, more than 10 times of that
for the Tom Cruise questions (6.4%). It is interesting to investigate how these deﬁnite
descriptions are used and how they affect the processing of the context questions.

Secondly, this research is motivated by the new difficulty posed by the TREC 2005
data: 1) increased number of deﬁnite description occurrences and 2) the new event type
of target. The TREC 2005 data contains more deﬁnite descriptions than the TREC
2004 data. TREC 2004 data consists of 65 sets of questions with 84 deﬁnite descriptions
in total. The newly released TREC 2005 data consists of 75 sets of context questions
with 239 deﬁnite descriptions in total, which is almost 3 times more than that of the
TREC 2004 data. At the same time, the percentage of pronouns decreases for the TREC
2005 data. This suggests that the research on the deﬁnite descriptions should receive
more attention than before. Since pronoun resolution is a well-researched topic, it is no
longer my focus for context question processing. The existing QA systems do not provide
more processing beyond resolving pronoun references , therefore they are not sufﬁcient
for handling the data with more complicated linguistic expressions.

As mentioned in Chapter 2, each question in a TREC question set is interpreted in
the context of a TREC target, which could be of type people, entity, organization etc.
The fact that all the TREC 2004 targets are in the form of noun phrases allows a system
to use such an approach as replacing pronouns in a question with the corresponding
target. However in TREC 2005, NIST added topic type event to the datasets. Targets
like “Russian submarine Kursk sinks” or “Miss Universe 2000 crowned” are obviously not
noun phrases. Problems arise immediately for the substitution strategy to process the
questions with event type of targets. Therefore the role of target with respect to providing
context for processing questions needs further examination.

Thirdly, unlike the study of pronoun resolution, the empirical studies in deﬁnite descrip-

107

tions from a computational perspective are rather limited. There is no well-established
computer algorithm or approach speciﬁcally dedicated to processing them. It is particu-
larly true for processing them in context questions. Motivated by these reasons, in this
chapter I continue my study on context questions and initiate an investigation on the
deﬁnite descriptions, and their role in developing efﬁcient discourse models for processing
mini QA discourses. In addition, I will discuss the role of pronouns in context questions
from another perspective. In order to do that, there are two questions that need to be
answered ﬁrst: what would be a proper theoretical framework for me to work on? How
do I come up with an appropriate taxonomy that is computationally possible for process-
ing the deﬁnite descriptions in context questions? Once these questions are answered, I
will be able to develop corresponding discourse models to process deﬁnite descriptions for
query expansion.

The rest of the chapter is organized as follows: section 6.2 describes a new challenge that
the current QA systems are facing; section 6.3 presents some related work on the compu-
tational efforts that involve targets, deﬁnite descriptions, and event in context question
answering; section 6.4 presents the theoretical background information for the empirical
study conducted in this chapter; section 6.5 discusses the uses of deﬁnite descriptions;
section 6.6 presents the discourse models that were developed speciﬁcally for handling
the pronouns and the deﬁnite descriptions in context questions; section 6.7 presents the
evaluation results for the models discussed in section 6.6; and ﬁnally section 6.8 concludes

the chapter.

6.2 A new challenge in QA discourse

The fact that the TREC 2005 QA test sets have more deﬁnite descriptions poses a new
challenge for the current QA systems, because the state-of-the-art techniques have not
provided adequate solutions to process them. In Chapter 2, I have mentioned that some
of the TREC 2005 participating systems used the target-appending strategy to process

context questions. This strategy has two drawbacks: 1) I observe that in the user data,

108

a pronoun used in a question does not necessarily refer to the target / topic, and (2) the
target is not necessarily referred to only by a pronoun. The work in Chapter 5 has partially
addressed the ﬁrst issue. The centering-based models keep track of all the entities in the
context questions, not merely focusing on the target entity, so a pronoun could refer to
any of these entities. Next, I will look at the second problem. As a matter of fact, there
are cases in the TREC 2005 data where deﬁnite descriptions are used to refer to the
target. For instance, example (1)’s target is “Merck & Co.”. However it is referred by the
deﬁnite descriptions the company in (1Q1), (1Q2), (1Q4) and (1Q6).
(1)

Q1: Where is the company headquartered?

Q2: What does the company make?

Q3: What is their symbol on the New York Stock Exchange?

Q4: What is the company’s web address?

Q5: Name companies that are business competitors?

Q6: Who was a chairman of the company in 1996?

Q7: Name products manufactured by Merck.
This problem is addressed by the system developed at the University of Shefﬁeld [16].
Their approach is to replace both the pronominal and the co-referential nominals with
the target. The processed questions are shown in (2).
(2)

Q1: Where is Merck 65 Co. headquartered?

Q2: What does Merck 65 Co. make?

Q3: What is Merck 85 Co.’s symbol on the New York Stock Exchange?

Q4: What is the Merck 55 Co.’s web address?

Q5: Name companies that are business competitors?

Q6: Who was a chairman of Merck 85 Co. in 1996?

Q7: Name products manufactured by Merck 85 Co.
This strategy seems working for example (1), however, another problem arises if there are

deﬁnite descriptions that do not refer to the target, such as example (3) whose target is

109

“Shiite”. The deﬁnite descriptions the ﬁrst Imam of the Shiite sect of Islam in (3Q1) and
the third Imam of Shiite Muslims in (3Q4) do not refer to the target, but some entity
related to the target.
(3)

Q1: Who was the ﬁrst Imam of the Shiite sect of Islam?

Q2: Where is his tomb?

Q3: What was this person’s relationship to the Prophet Mohammad?

Q4: Who was the third Imam of Shiite Muslims?

Q5: When did he die?
Given the increasing trend of deﬁnite description occurrences in the TREC question sets, it
is necessary to take a good look at the deﬁnite descriptions in the context questions. Next,
I will look at some computational work concerning the processing of deﬁnite descriptions,
hoping to give the readers some idea of where I am and how my research relates to others’

work.

6.3 Related work

I now review some of the computational work that involves deﬁnite descriptions.

6.3.1 Computational work on deﬁnite descriptions

Computational efforts on processing deﬁnite descriptions have focused on two aspects: def-
inite description resolution and deﬁnite description classiﬁcation. These two aspects are
closely related to each other. The goal of deﬁnite description resolution is to develop ap-
proaches to identify the antecedents of deﬁnite descriptions automatically. The antecedent
corefers to an entity/eventl with a deﬁnite description (such as Merck 65 Co. is the an-

tecedent of the company in example (1) because they corefer to the entity Merck & Co.

 

1I only consider these two for the context questions.

110

company). Message Understanding Conferencesz, MUC—6 [91] and MUG-73 conducted
subtasks to resolve deﬁnite descriptions. Systems such as Appelt et al.[92], Gaizaukas
et al.[93], and Humphreys et al.[94] implemented speciﬁc modules to resolve deﬁnite de-
scriptions. For example, the Discourse Interpreter module in Humphreys et al.[94]’s
LaSIE—II system performs co-reference resolution between new and old concepts, a hier-
archy of which represents the semantic net of the discourse under processing. Pronouns,
proper names, deﬁnite descriptions and other type of noun phrases are considered in the
co—reference mechanism. Searching for the antecedent of a deﬁnite description could be
intrasentential or intersentential based on the co-reference rules that are speciﬁed by the
system.

Deﬁnite description classiﬁcation aims to automatically identify deﬁnite description
uses. In 1990 Fraurud [95], two classes of uses are presented in the empirical study:
subsequent-mention covers the cases where a deﬁnite description denotes the same entity
as its antecedent denotes, and all other uses are classiﬁed into ﬁrst-mention. Poesio and
Vieira 1998’s work [96] presents a system for classifying deﬁnite descriptions in arbitrary
domains. They conduct a corpus-based investigation on the deﬁnite description uses in
written texts. The deﬁnite descriptions are classiﬁed into: direct anaphoric, discourse new
and bridging (details to be discussed in section 6.5). The solution to the classiﬁcation
task often serves to help identify the antecedents of deﬁnite descriptions. The results
show that, based on the different classiﬁcation schemes, the agreement among human
annotators varies in determining the classes assigned to the deﬁnite descriptions as well

as the antecedents assigned to them. They also conclude that the majority of the deﬁnite

description uses are not anaphoric.

 

2A major conference whose purpose is to facilitate information extraction and machine learning. More

information can be found at http: / / www.cs. nyu.edu / cs / faculty / grishman / muc6.html
3http: / / www-nlpir.nist.gov / related_pro jects / muc / proceedings / muc_7-toc.html

111

6.4 Theoretical background

In the upcoming sections, I will introduce the linguistic knowledge that is relevant in
building QA discourse processing models. The theoretical issues include: 1) the deﬁnite-
ness hierarchy of noun phrases, which provides a ranked list of deﬁnite noun phrases. The
ranking will be used in resolving pronouns and in helping resolve deﬁnite descriptions in
the models developed for the study. 2) the Familiarity theory of deﬁnite descriptions,
which provides us with a theoretical basis for categorizing the deﬁnite description uses in
context questions; 3) some classic categorizations of the deﬁnite description uses, which
cast helpful insights for us to classify the deﬁnite descriptions that appear in QA dis-

courses. I now examine these issues in order.

6.4.1 Deﬁniteness hierarchy

In addition to deﬁnite description, it is not uncommon to ﬁnd that researchers include
more NP types as deﬁnite NPs. For example, based on the work of Prince [97], Birner
&Ward [98] and Ariel [99][100], Abbott [89] presents two lists of NP types which are ranked
roughly according to the deﬁniteness and indeﬁniteness of these NPs. Deﬁniteness is
considered as a property of noun phrases used to identify “particular and determinate
entity or group of entities” [89]. Such noun phrases as proper names, personal pronouns,
NPs with various quantiﬁers etc. have been treated as having the property of deﬁniteness.
Meanwhile, there are NP types that have the characteristics opposite to the deﬁnite NPs,
that is, when they are used, no particular entity or entities can be referred to. In that
sense, they have the property of indeﬁniteness.

Adapted from the lists presented in Abbott[89], table 6.1 shows a ranking hierarchy
that I used in the study. In the list, pronoun is treated as the most deﬁnite NP type,
while indeﬁnite NP (represented in the list as “A/ An”) is ranked as the least deﬁnite NP
type. In my implementations, I actually combine demonstratives, deﬁnite descriptions
and possessive NPs together and label them as one category deﬁnite description. The

reason is that there are only a few occurrences of demonstratives and possessive NPs in

112

the TREC 2004 and TREC 2005 data. The incorporation of these three types will help
me focus on the study of the deﬁnite descriptions. For the same reason, I combine bare
NPs (represented as lDET¢l in the list) with indeﬁnite NPs to one category relabeled
as indeﬁnite. Other NP types mentioned in Abbott[89] are not be included in the study
because those NP types rarely occur in the QA discourses that I am concerned with.

Table 6.1. Deﬁniteness hierarchy of noun phrases

 

 

NP type Note Example
Pronoun* personal pronoun he, them
Demonstrative NP with demonstrative determiner this book
Deﬁnite description NP with determiner the the book
Possessive NP NP with genitive NPs as determiner his book
Proper name Full name or partial name Bing Crosby
[ D ETQ’l all bare NPs books
A / An** Indeﬁnite NP a book

 

* most deﬁnite

** least deﬁnite

After the rearrangement of the deﬁnite noun phrases, I now have the ranked deﬁniteness
hierarchy as shown below:
(4)

Pronoun > Deﬁnite description > Proper name > Indeﬁnite 4
I will return to this hierarchy in section 6.6 for more discussion and present its application

in the QA discourse modeling.

6.4.2 Familiarity theory

In my study, I categorize the deﬁnite description uses based on the Familiarity theory in
linguistic literature. Next, I will cover the important essentials of this theory, and the
“Uniqueness” theory for comparison.

Christophersen 1939’s work[101] describes what has come to be known as Familiarity

theory of deﬁnite description. Essentially, he describes the usage of deﬁnite description

 

4“>” means the NP type on the left side is more deﬁnite than the one on the right side.

113

based on an intuitive observation, that is, speakers tend to use the deﬁnite article (i.e. the)

to refer to a referent that is mutually familiar to the bearer as well as to the speaker.

Now the speaker must always be supposed to know which individual he is
thinking of; the interesting thing is that the the-form supposes that the hearer
knows it too. For the proper use of the form it is necessary that it should
call up in the hearer’s mind the image of the exact individual [italics added]
that the speaker is thinking of. If it does not do that, the form will not be

understood. (Christophersen 1939 [101], p.28)

The article the brings about that to the potential meaning (the idea) of the
word is attached a certain association with previously [bold added] acquired
knowledge, by which it can be inferred that only one [italics added] deﬁnite in-
dividual is meant. That is what is understood by familiarity. (Christophersen

1939 [101], p.72)

While Christophersen emphasizes that both the speaker and hearer share knowledge of
an individual through previous communication, he does not deny the implied uniqueness
of this exact and only one individual.

A similar familiarity assumption is also stated in Heim’s work [102] [103]. Heim develops
the famous File Change semantics attempting to address the difference between deﬁnite
and indeﬁnite noun phrases. In her ﬁle card metaphor, a discourse has various ﬁle cards
each representing a discourse entity. The deﬁnites and indeﬁnites differ in terms of how
the ﬁle cards are kept. If an indeﬁnite is used in the discourse, it is seen as starting a
new card. On the other hand, if a deﬁnite is used, it is seen as updating an old card.
On Heim’s analysis, deﬁnite descriptions denote discourse—old entities (whose existence is
presupposed) while indeﬁnites introduce discourse-new entities in Prince [97] ’3 term.

However, there are many cases where deﬁnite descriptions do not denote entities that
have been mentioned in the discourse context or not assumed to be familiar to the ad-
dressee. Consider the Miami debate in example (5). The deﬁnite description denotes an

event that has not been introduced in this mini QA discourse.

114

(5)

Q1: How many debates were there in the 2004 presidential election?

Q2: What was the focus of the Miami debate?
The familiarity theory thus faces the problem presented in such empirical data as (5).
David Lewis proposes the accommodation principle, which is meant to rescue the theory,
in his classic paper “Scorekeeping in a language game” [104]. In his analogy, conver-
sations are regarded as language games keeping scores in an evolving way. Therefore
deﬁnite descriptions do not necessarily denote entities “in existence” or “in some contex-
tually determined domain of discourse” ([104], p.348). The problematic cases such as (5)

are explained by his rule of accommodation for presupposition.

If at time t something is said that requires presupposition P to be ac-
ceptable, and if P is not presupposed just before t, then - ceteris paribus and
within certain limits— presupposition P comes into existence at t. (Lewis [104],

p.340)

In other words, referents may come into being automatically at the moment the speaker
utters the denoting deﬁnite description. However Lewis does not specify any “limits”
that would constrain the so-called accommodation. Gazdar([105], p.107) and Abbott [106]
raise similar questions as to what degree the rule would allow a presupposition allegeable.
Abbott([106],p.l426) points out that “the theory would become almost vacuous, since no

counterexamples would be raised against it”.

6.4.3 Uniqueness theory

Given the problems of familiarity theory, one would want to seek an account for the prob—
lematic examples. Uniqueness theory is another theory on deﬁnite descriptions. Unlike
familiarity theory, the notion of uniqueness has been established more from a semantic
rather than a pragmatic perspective. Here, I cannot skip one person, Bertrand Russell
who contributed signiﬁcantly to the philosophy of language, and especially to the under-

standing of deﬁnite descriptions. Although his classic work came out more than a century

115

ago, it has been and still is an inspiration for the study of the semantics of denoting
phrases. The famous deﬁnite description the king of France used in example (6) is such a
denoting phrase.
(6)

S1. The King of France is bald.

S2. The King of France is not bald.
On Russell [107]’s analysis, neither the statement (681) nor the statement (632) is true.
The deﬁnite description the king of France is treated as compound statements in (7).
(7)

S1. There is an x such that x is being King of France.

S2. There is no y, y not equal to x, such that y is being King of France.

S3. x is bald.

S4. x is not bald.
Note that (781) describes the existence of the entity the King of France while (782) states
the uniqueness of it. The fact that France is a republic and there is no present King
existing makes (7S1) false and therefore (631), whose truth value is decided by the con-
junction of three propositions (7S1), (782) and (7S3) is false; so is (682), whose truth
value is decided by (7S1), (782) and (734). In plain English, the logic form of (6S1) says:
“there is an individual/ entity who has the property of being King of France, and there is
only one such individual/ entity, and this individual is bald”. On logical analysis, (6S1) is
described as (8).
(8)

3(King of France(cr) & (y)(King of France(y) —+ (y = 22)) & Bald(a:))
Therefore, Russell’s analysis assigns the uniqueness for an entity that ﬁts into the descrip-
tive content of the NP. However his theory of deﬁnite description has received criticisms
in the philosophical, logical and linguistic literature. Interested readers are recommended
to ﬁnd more discussions on deﬁnite descriptions in L6bner [108], Kadmon 1990[109], and
Hawkins [110]. The discussion of the uniqueness theory is presented here to provide a

complementary view of the familiarity theory. In the current study of modeling the QA

116

discourse, I adopt a view that is closer to the familiarity theory. In the next section, I
will elaborate how I classify the deﬁnite description uses in QA discourses conforming to

this view.

6.5 Classiﬁcation of deﬁnite description uses

As mentioned in section 6.1, the TREC 2005 data has more occurrences of deﬁnite de-
scriptions than the TREC 2004 data. In this section, I will look closely at how these
deﬁnite descriptions are used in the QA discourse and provide a computationally feasible
categorization of the uses.

Parallel to the study of the deﬁniteness and indeﬁniteness of noun phrases, several
classiﬁcation schemes have been described in literature, either for the purpose of theoret-
ical study or for the purpose of natural language processing. Just to name a few, they
are Hawkins [111], Prince [97], Fraurud [95] and Poesio and Vieira [96]. The deﬁnite
descriptions in my data are classiﬁed mainly based on these four works.

Hawkins’ classiﬁcation is the most descriptive one and fundamentally important to
the later works. Hawkins (1978) presents a very ﬁne-grained enumeration of deﬁnite
description classiﬁcation. By reviewing his work, I intend to explore the feasibility of
applying the classiﬁcation to the deﬁnite descriptions in my data. Along with the brief
review, some examples of deﬁnite description uses will be presented. They are mostly

drawn from the TREC 2004 data and my user study data.

6.5.1 Fine-grained classiﬁcations

Based on the work of Christophersen [101] and Jespersen [112], Hawkins [111] discusses
6 signiﬁcant uses of the deﬁnite article the. These uses are: anaphoric use, immediate
situation use, larger situation use, associative anaphoric use, unfamiliar use and unex-
planatory modiﬁer use. From the detailed classiﬁcation to be discussed below, we will see
that some of the deﬁnite description uses are anaphoric, as the familiarity theory predicts,

and some are not. Besides, the identiﬁcation of the usage type sometimes relies on the

117

semantic and pragmatic interpretation of the deﬁnite description, and sometimes on its
syntactic features.

The ﬁrst signiﬁcant use is the anaphoric use. It refers to the use where a deﬁnite
description refers to a discourse entity that is already in the discourse. Following Sid-
ner [51], the term cospecify is intended to mean that both the deﬁnite description and
its antecedent denote the same object. This use is common in the TREC data. The
occurrences of deﬁnite descriptions in (9Q3), (10Q2) and (11Q3) (i.e. the court, the or-
ganization, the awards and the prize) indeed cospecify with discourse entities introduced
in the early discourse. (i.e. the international criminal court, AARP and the Nobel prize
awards respectively.)

(9)

Q1: When was the international criminal court established?

Q2: What kind of cases does it try?

Q3: Who is the sponsor of the court?

(10)

Q1: What does AARP stand for?

Q2: When was the organization started?
(11)

Q1: Who established the Nobel prize awards?

Q2: When were the awards ﬁrst given?

Q3: What is the monetary value of the prize?

The examples show the typical ways in which an anaphoric use is usually characterized:

o the deﬁnite description could share the same descriptive predicate as in the an-

tecedent (e.g. the international criminal court with the court);
0 use hypernym (e.g. organization is the hypernym of AARP);

0 use synonym (e.g. prize is synonymous to awards) etc. to indicate the same referent.

The second signiﬁcant type has two subtypes: visible situation use and immediate situa-

tion use. These two uses do not have any occurrences in my data pool, because these two

118

uses involve spoken language instead of written language. Basically, the QA discourses
that I am concerned with are in a written text form.

The third use is the larger situation use (with two subtypes: larger situation uses with
speciﬁc knowledge about the referent, and with general knowledge about the referent).
Speakers / writers are sometimes in anticipation of the hearer / reader’s knowledge of entities
that exist in the non-immediate or larger situation of utterance. Deﬁnite descriptions
are thus used to indicate that both the speaker/ writer and hearer/ reader know about
the existence of the referents, or in other words, share knowledge of the referents. For
example, the referent could be an entity/ object known to community members (such
as the deﬁnite description the president in 2006 referring to George W. Bush is known
knowledge for the people in the US.) or an object known to everyone on earth, such as
the sun or the moon. Consider (12) as an example for a larger situation use.

(12)

Q1: When were the 2004 presidential debates?

Q2: What did the ﬁrst debate cover?

Q3: Who won?

Q4: What was covered in the 3rd debate?

In the feeding question, by using the 2004 presidential debates, the user expects mutual
knowledge of the presidential debates taking place in the US. in 2004.

The fourth signiﬁcant use of a deﬁnite description is the associative anaphoric use. It
indicates that the speaker and hearer share knowledge of the relations between entities, or
properties/ attributes, or the components of the entities. For example in (13), the crew is
associated with the Challenger space shuttle because normally a space shuttle is supposed
to have a crew that performs various services on the flight. This semantic relation is
common sense knowledge shared by the speaker/ writer and the hearer/reader. In other

words, the Challenger space shuttle triggers the use of the crew”.

(13)

5In Sidner[51], this type of usage is also discussed and such triggering phrases as the Challenger space
shuttle are named associators.

 

119

Q1: Which was the ﬁrst space shuttle?

Q2: When was the Challenger space shuttle disaster?

Q3: How many members were in the crew?
The other two uses that I am about to present are not as commonly seen as the ﬁrst
four. These uses are “unfamiliar” in the sense that the entities denoted by the deﬁnite
descriptions are not previously introduced in the discourse, neither is the knowledge on
them shared by the speaker and the bearer in one way or another, thus unfamiliar to the
hearer.

The ﬁfth signiﬁcant type the unfamiliar use is further classiﬁed into 4 subtypes. These
four subtypes are characterized as having certain syntactic features associated with the

deﬁnite descriptions. These features could be one of the following:

o referent-establishing relative clauses: the referent is introduced into a discourse by
a relative clause whose content helps to identify the referent. In example (14), the
relative clause that destroyed Pompeii helps to establish a volcano as the referent.
(14)

Q1: What is the name of the volcano that destroyed Pompeii?
Q2: What month did this ﬁrst happen?

0 associative clause: this use occurs when an associative relationship, especially gen-
itive relationship holds between the expressions in question. 0f the last debate6 in
(15) helps the hearer to identify and locate the referent in question.

(15)
Q1: What was the format of the last debate?

0 nominal modiﬁers: in example (16), the modiﬁer the band refers to the class to
which the head noun Clash belongs. In other words, the modiﬁer is a hypernym of
the head noun.

(16)
Q1: what kind of music does the band Clash play?

 

6the format of the last debate could be converted to the last debate ’3 format.

120

o NP-complement: the deﬁnite description appears with a complement to the head
noun, as shown in (17).
(17)
81. Bill is amazed by the fact that there is so much life on earth[italics added].

(Hawkins[111], p. 140)

Finally, the sixth signiﬁcant use is the unexplanatory modiﬁers use. In (18), the largest,
the same and the ﬁrst contribution in Hawaii are classiﬁed into such a category because
they do not function to establish any deﬁnite referent to the bearer.

(18)

Q1: Is tourism the largest contribution in Hawaii?

Q2: Was the winner of this debate the same as the ﬁrst?

Clearly, the classiﬁcation employed by Hawkins is quite sophisticated covering many de-
tailed uses of deﬁnite descriptions. The basic idea of his classiﬁcation aligns itself with

the familiarity theory on the semantics of deﬁnite descriptions.

6.5.2 Prince (1992)

Different from Hawkins, Prince (1992)[97] discusses the uses of noun phrases from a new
angle. Only two aspects are taken into consideration: hearer status and discourse status.
It should be noted that, this classiﬁcation is also along the lines of the familiarity analysis
of deﬁnite descriptions described in Christophersen [101] and Heim[102]. A discourse
entity could be familiar with respect to a discourse as well as to a hearer.

Table 6.2 shows the taxonomy of classifying discourse entities. Discourse entities
may be hearer—new or hearer—old depending on the speaker/writer’s beliefs about the
hearer / reader’s belief. If the speaker assumes that a discourse entity is “brand-new”, not
to be known to the hearer, it is then classiﬁed as hearer-new. On the other hand, if in
a situation, the speaker takes an entity to be known to the hearer, he/she would most
likely use a proper name or a deﬁnite description to represent this so-called hearer—old

information. However, the deﬁniteness of a noun phrase does not necessarily reflect the

121

bearer-status. In addition, as Prince pointed out, this classiﬁcation highly relies on the
actual situation, which means a hearer-new reference for one may turn out to be hearer-old

for another.

Table 6.2. Hearer and discourse status of a discourse entity*

 

 

Discourse-new Discourse—old
Hearer-new Brand new NA
Hearer—old Unused Evoked

 

*reproduced from Prince 1992[97]

Discourse-new and discourse-old are the other pair of names used to describe discourse
entities. As the name indicates, discourse entities could be classiﬁed to be new or old
with respect to their discourse status. This idea echoes with Heim’s familiarity theory in
that discourse-old entities are the ones that have been previously evoked. Discourse-new
entities are the ones that have not been mentioned in the previous discourse.

From the discussion, we see that Hawkins’s anaphoric uses of deﬁnite descriptions are
for the discourse-old entities, while his larger situation and immediate situation uses are
often for the discourse-new and hearer-old entities. In addition, in Prince’s theory, the
inferrables can be described by these two information schemes as well. That is, the
inferrables are technically discourse—new and hearer-new, but are not entirely new. Those
entities are related to the hearer-old entities, therefore they could be inferred. This use
is similar to Hawkins’ associative anaphoric uses (such as the door and its triggering
noun phrase a building). Finally, Prince’s containing inferrables are inferrables yet the
entities are identiﬁed with the help of the descriptive content speciﬁed in the noun phrases.
Deﬁnite descriptions involving NP complements, referent-establishing relative clauses and

associative clauses could be examples of the containing inferrables, which is similar to

Hawkins’ unfamiliar use.

6.5.3 Poesio and Vieira (1998)

Another coarse-grained classiﬁcation of deﬁnite description uses is presented in Fraurud

[95]. As mentioned in section 6.3, Fraurud only uses two classes: subsequent-mention and

122

ﬁrst-mention. Fraurud’s subsequent-mention is similar to Hawkins’ anaphoric use and
Prince’s discourse-old.

Based on the taxonomy modiﬁed from Hawkins and Prince’s work, three uses are clas-
siﬁed in Poesio and Vieira [96]: direct anaphoric, discourse new and bridging. Direct
anaphora covers the cases where subsequently mentioned deﬁnite descriptions corefer
with an antecedent that has the same head noun as the description; It differs from the
anaphoric use classiﬁed by Hawkins or Prince’s textually evoked classes in that it only
allows the cases where the antecedent shares the same head noun with the deﬁnite descrip-
tion. To be more speciﬁc, this use does not include more complicated anaphoric relations
such as synonyms or hypernyms etc. Discourse new includes deﬁnite descriptions that
are used to “denote entities not related by shared associative knowledge to entities al-
ready introduced in the discourse” ([96],p.542), which is similar to Prince’s discourse-new.
Bridging covers two cases: 1) deﬁnite description having an antecedent denoting the same
discourse entity, but using a different head noun; 2) Deﬁnite descriptions are similar to
Hawkins’ associative, and to the indirect reference by association described by Clark 1977
[113]. According to Clark, the associated information has different levels of predictability
from the entity, event or situation mentioned earlier. The most typical associations are
Necessary part, probable part and inducible part. Examples of necessary part would be
the ceiling of a room and the author of a book. However, the window may be a probable
part of a room. The inducible part is an associated part that one would not normally
predict but has to be inferred. For example, the use of chandelier is associated to a room
by hearer’s inference, given that a room does not necessarily have a chandelier.

Having reviewed the classiﬁcations of the deﬁnite description uses in literature, we now

look at how the deﬁnite descriptions are categorized in my study.

6.5.4 Classifying deﬁnite descriptions in QA discourse

After the preliminary examination of my data pool, I ﬁnd that it is hard to adopt any of
the classiﬁcations mentioned above without making any changes. If I adopted Hawkins’

classiﬁcation scheme, the number of occurrences for each use varies widely. For example,

123

in the TREC04 data, there are no occurrences of visible situation use and immediate
situation use at all. There are only 3 cases of referent-establishing relative clauses out of
88 deﬁnite descriptions, while there are 46 cases of anaphoric use. Given that Hawkins’
classiﬁcation has 6 major classes (8 if including the subtypes) and the distribution from
the TREC data and my data, it is not computationally beneﬁcial for a computer system
to identify each of these classes. Therefore I do not use this ﬁne—grained classiﬁcations for
my implementation. The study in Poesio and Vieira [96] also shows that the more classes
a classiﬁcation scheme has, the more difﬁcult it is for human to agree on the annotation
of the deﬁnite description uses. Therefore complicated classiﬁcation schemes may cause
a problem in evaluation.

Prince’s[97] classiﬁcation takes into account the hearer status, which is not relevant in
my case. As I have mentioned, I focus on the written QA discourse modeling. It would
have been a good categorization if the study were on the user interactive QA. Fraurud’s
[95] classiﬁcation is much simpler than Hawkins’, yet I feel it is way too coarse grained for
my research purpose. In order to examine the behavior and the role of deﬁnite descriptions
in QA discourse processing, I would prefer a classiﬁcation that is not too simple, yet not
too complicated.

The three-class taxonomy in Poesio and Vieira [96] is adopted in my study, with some
modiﬁcations made to suit my purpose. Based on the three classes in Poesio and Vieira
[96], I categorize the deﬁnite descriptions in my data pool into three classes: discourse-
new, direct anaphoric and bridging. Based on the three classes, I then manually annotate
all the deﬁnite descriptions in the data for further processing. The classiﬁcation is based
on the discourse status rather than the hearer status. Discourse-new conveys that the
entity that the deﬁnite description refers to has never occurred in previous QA discourse.
The classiﬁcation is rather strict. For example, the deﬁnite descriptions that are part of
common knowledge and normally would be categories as hearer-old (such as the US or
the U. N.) are treated as discourse-new if they never appear in the questions before. That
is, all the larger situation uses in Hawkins’ term will be annotated as discourse-new if

they have not been mentioned in the previous QA discourse.

124

Bridging means that the entity is associated with an entity that appears in the previ-
ous discourse. Now consider example (19) where the deﬁnite description the company’s
headquarters in (19Q2) is used as bridging because the entities that the company’s head-
quarters refer to are associated with the entity Conde Nast. Headquarters are common
to a big company, in other words. The publishing company Conde Nast in (19Ql) is used
as discourse-new in this example. Different from the bridging use that is described in
Poesio and Vieira [96], I restrict the annotation of the bridging uses based on a mutual-
containing relationship, which will be discussed in section 6.6.2. Brieﬂy, I will identify
the use of the company’s headquarters as bridging. Based on this usage, the expression
the publishing company Conde Nast for the associated entity in (19Q1) will then be used
for query expansion in a deﬁnite model, which will be described in section 6.6.2.

(19)

Q1: Where is the publishing company Conde N ast?

Q2: Where are the company’s headquarters?
Direct anaphoric means that the entity is the same entity that has been mentioned in the
earlier discourse. For the ease of annotation, I follow the strict interpretation mentioned
in Poesio [96], only considering the expressions with the same head and referring to the
same entity. For example in (20), the center is used as anaphorically because it refers to
the same entity that the Berkman Center for Internet and Society in (20Q1) refers to.
Both expressions share the same head center.
(20)

Q1: Where is the Berkman Center for Internet and Society located?

Q2: When was the center formed?

Next, I present the discourse models developed to process the QA discourses.

125

6.6 Discourse models focusing on deﬁnite descrip-
tions

Readers may have noticed that the discourse models presented in Chapter 5 have not
used the target(topic) information of the question set, although the targets are explicitly
provided in the TREC 2004 data. In this section, I present two sets of experiments that
are conducted under two different conditions: 1) the QA system is provided with the
targets; 2) the QA system does not know the targets. The reasons I want to examine the
two situations are that on the one hand, I would like to see how the presence of a known
target would affect the discourse modeling; on the other hand, I would like to see how
well a QA discourse would beneﬁt from extra processing based on linguistic knowledge
beyond known targets.

For the convenience of presentation, I call these two conditions as the no-target situation
and the target-given situation. Although TREC data provides the target information, in
practice, the target information may not be provided in the ﬁrst place, but rather needs
to be inferred from the context. The design of the no-target situation is motivated by
such circumstances. In the target-given situation, before actually asking questions, the
users explicitly inform the computer system: “Hey, this is what I want to know about.”
Before I elaborate the models developed for these two situations, it is necessary to discuss

the questions that I aim to answer:

0 What is the role of a target? We have seen that most participating TREC systems
made use of the targets (discussed in Chapter 2), but we also see the potential
problems mentioned at the beginning of this chapter. What is the difference that a
discourse processing model would make in the no-target and target-given situations
if everything else remains the same? What would be the most efﬁcient way for a

system to intelligently store the relevant context for processing later questions?

0 What is the role of the deﬁniteness hierarchy in processing QA discourse? How can

it be used? I implemented the anaphora model in Chapter 5 using the centering

126

algorithm. To resolve pronouns, the discourse entities were ranked and then the
antecedent was picked according to the ranking of the forward looking centers. Is it
possible to pick the antecedent according to the NP type ranking in the deﬁniteness

hierarchy?

o How does the classiﬁcation of the deﬁnite descriptions help with the processing?

What are the potential implications for my system?

In order to build discourse models to explore these questions, the TREC 2004 and

TREC 2005 data are annotated accordingly. Next I will describe the data separately.

6.6.1 Data analysis
Data annotation

In addition to the annotation entries introduced in Chapter 4, I add more linguistic
information to the XML annotation of the TREC 2004 and 2005 data. For the target-
given situation, I add the target to the question sets and make them the ﬁrst question of
each set. I label it as go as other questions are incrementally labeled as ql, q2, etc. The
targets are annotated the same way as other questions. The example (5) in Chapter 4,
repeated as (21) below shows the basic question information and linguistic information
annotated for the questions. Because this part of the research is also entity-based, my
major concern is the noun phrases, deﬁnite noun phrases in particular. The semantic
information (such as the NP type) as well as the syntactic information (such as the
grammatical role) of these noun phrases are also annotated. For a deﬁnite description,
the head noun, its pre-modiﬁers7 and post-modiﬁers8 are annotated as well. In online
processing, these types of information can be automatically acquired through parsing.
Based on the three classes introduced above, the use of the deﬁnite description is also
annotated. For example, for the deﬁnite description the ﬁrst Crip gang, gang is annotated

as the head, ﬁrst Crip as the premodiﬁer, new as to indicate discourse-new. Entities that

 

7Modifiers appearing before the head noun.
”Modiﬁers appearing after the head noun, including relative clauses, NP complements etc.

127

appear as modiﬁers for other entities are also annotated. The only annotation difference
for the TREC data in the two situations is that the targets are removed for the no—target
situation, leaving everything else (i.e. the annotation tags) the same.

(21)

<qa qaid=“01” t=“Crips” type=“ORG”>
<q.id= “q0” >
<orig>Crips</orig>
<TOPIC start-tok= “l” end-tok= “1” >
<ENTITY sem-type= “ob ject-plural” gram_role= “subject” >
<PROPERTY prop-type=“proper_name” start_tok= “1” end-tok=“1”/>
</ENTITY>
</TOPIC>
</q>
<q.id= “ql” >
<orig>When was the ﬁrst Crip gang started ? </orig>
<TOPIC start_tok= “7” end-tok= “7” >
<ACTIVITY>
<ENTITY sem-type= “person-singular” gram.role= “subject” >
<PROPERTY prop-type=“deﬁnite-des” start-tok= “6” end-tok=“6” />
<CONSTRAIN T constraint .type= “premodiﬁcr” st art.tok= “4”
end_tok= “5” / >
<DEF.TYPE def.type= “new” / >
</ENTITY>
<ENTITY sem_type= “ob ject-singular” gram_role= “possessor” >
<PROPERTY prop-type=“proper_name” start-tok=“5” end-tok= “5” />
</ENTITY>
</ACTIVITY>
</TOPIC>

</q>
Example (22) shows a context question set from the TREC data. We see that (22Q1)
uses the pronoun it to refer to the target Camp David. In order to process (22Q1), a
system has to know the target ﬁrst. There are 7 such cases in the TREC 2004 data

and 12 in the TREC 2005 data where the pronouns in the feeding questions refer to the

128

targets.
(22) Target: Camp David
Q1: Where is it?
Q2: How large is it?
Q3: What was it originally called?
Q4: When was it ﬁrst used?
Q5: What US. President ﬁrst used it?
There are also cases like (23) where the pronouns appearing in the subsequent questions
refer to the target Longwood Gardens.
(23) Target: Longwood Gardens
Q1: When was the initial land purchased?
Q2: Where is it?
Q3: How large is it?
Q4: Who created it?
Q5: How many visitors does it get per year?
Q6: When is the best month to visit the gardens?
Considering examples like (22) and (23), some of the original TREC feeding questions are
modiﬁed so as to provide necessary information. The modiﬁcations could be one of the

followings:

o substitute pronouns with the target:

Q1: When was he born?

=>When was the architect Frank Gehry born?

0 complete proper names:
Q1: When was Guthrie born?

=>When was Woody Guthrie born?

0 complete the event situation
Q1: When did the school shooting occur?
=>When did the Kip Kinkel school shooting occur?

129

For example (23), the feeding question (23Q1) was modiﬁed to:

Q1: When was the initial land of Longwood Gardens purchased ?
From my study, I ﬁnd that users may not explicitly provide a target and then specify
a sequence of questions. The ﬁrst question speciﬁed in a sequence tends to provide the
context and implicitly indicate the target / topic of the sequence. The main purpose of the
modiﬁcations is to provide enough information for the feeding questions, because without
an explicitly provided target, the processing of the subsequent questions will have to rely

on the feeding question.

Data analysis

Table 6.3. Data analysis for the original TREC 2004 and TREC 2005 data

 

 

Number TREC 2004 TREC2005 Diﬁ%* Total
Question set 65 75 15.4 140
Factoid questions 230 362 57.4 592
Question that has no answer 26 30 15.4 56
Pronoun 132 102 -22.7 234
Question that has pronoun 127 97 -23.6 224
Pronoun per question 0.574 0.282 -50.9 0.395
Deﬁnite description 84 239 185 323
Question that has DD** 74 188 154 262

DD per question 0.365 0.66 81 0.546
Discourse new 58(69.1%) 113(47.3%) 94.8 171(52.9%)
Direct anaphoric 19(22.6%) 69(28.9%) 263 88(27.3%)
Bridging 7(8.3%) 57(23.8%) 714 64(19.8%)

 

* the increase of TREC 2005 data compared with the TREC 2004 data
** Deﬁnite description

Now, let us look at the data characteristics of the TREC 2004 and TREC 2005. Table
6.3 shows the data distribution statistics for the TREC questions. There are two major
differences between the TREC 2004 and TREC 2005 data: the decrease of the pronoun
occurrences and the increase of the deﬁnite description occurrences in the TREC 2005

data compared with the TREC 2004 data.

130

The number of pronoun per question of the TREC 2005 (0.282) decreases 50.9% com-
pared with the TREC 2004 data (0.574). This poses challenges for the pronoun resolution
oriented approaches. The number of deﬁnite descriptions per question in the TREC 2005
data (0.66) increases 81% compared with the TREC 2004 data (0.365).

The total number of questions that have deﬁnite descriptions in the TREC 2005 data
increases 154%, while the total number of questions only increases 57.4%. Also, the
major use of the deﬁnite descriptions is discourse-new, 69.1% for the TREC 2004 data
and 47.3% for the TREC 2005 data. To be more speciﬁc, more than half of them (52.9%)
are discourse-new for the total TREC data. Only 27.3% deﬁnite descriptions are anaphoric
and 19.8% are bridging.

The deﬁnite description distribution of the TREC data is very similar to what is re-
ported in Poesio and Vieira [96]. They have 50% of discourse-new, 30% anaphoric and
18% bridging/ associative. Poesio and Vieira reveal that in certain genre of texts, “the
deﬁnite descriptions are not primarily anaphoric”. Therefore I hope my investigation of
the deﬁnite descriptions in context questions will cast new insights into generalizing their
behavior. Table 6.4 shows the data distribution of the modiﬁed data sets for the no—target

Table 6.4. Data analysis for the modiﬁed TREC 2004 and TREC 2005 data

 

 

Number TREC 2004 TREC2005 Diff%* Total
Question set 65 75 15.4 140
Factoid questions 230 362 57.4 592
Question that has no answer 26 30 15.4 56
Pronoun 125 90 -28 215
Question that has pronoun 120 85 —29.2 205
Pronoun per question 0.543 0.249 —54.1 0.346
Deﬁnite description 99 248 151 347
Question that has DD** 83 193 133 276

DD per question 0.43 0.685 59.3 0.586
Discourse new 73(73.7%) 121(48.8%) 65.8 194(55.9%)
Direct anaphoric 19(19.2%) 69(27.8%) 263 88(25.4%)
Bridging 7(7.1%) 58(23.4%) 729 65(18.7%)

 

* the increase of TREC 2005 data compared with the TREC 2004 data

** Deﬁnite description

131

situation. Because I substitute some pronouns in the original feeding questions with the
corresponding targets, I then have more deﬁnite descriptions (i.e. total 99 instead of 84
for the TREC 2004 data, 248 instead of 239 for the TREC 2005 data). The fact that
more deﬁnite descriptions are added to the feeding question can explain the increase of the
discourse-new occurrences(55.9%) compared with the original TREC data (52.9%). In
general, I observe consistent pronoun and deﬁnite description distribution for the modiﬁed
TREC data.

In the coming sections I describe my discourse modeling in detail. First I present the
experiments on the modiﬁed TREC data where the targets are not explicitly speciﬁed,
and then the experiments conducted on the original TREC data with the targets explicitly

speciﬁed.

6.6.2 No-target situation

First I explore the situation where no targets are explicitly given for the context question
sets. Under this condition, the processing on all the later questions will have to solely
rely on the context from the previous question(s). I design three models to investigate the
issues I have discussed at the beginning of section 6.6: 1) a pronoun-resolution model(I
also rename it a pronoun model); 2)a deﬁnite description resolution model (I rename
it a deﬁnite model); 3) a combined-resolution model (I rename it a combined model);
The pronoun model aims to resolve all the pronouns appearing in the questions. Unlike
the centering-based approaches described in Chapter 5 for the anaphora model, here the
pronouns are resolved based on the deﬁniteness hierarchy. The deﬁnite model aims to
resolve the deﬁnite descriptions in the questions and locate the antecedent that each
deﬁnite description refers to. The combined model resolves both the pronouns and the
deﬁnite descriptions, therefore it is an incorporation of the pronoun model and the deﬁnite
model. The separation of the resolution components (the pronoun model and the deﬁnite
model) enables us to examine the individual effect that they have on the entire processing.

Meanwhile, the combined model allows us to see the overall effect on pronouns and deﬁnite

descriptions has on the QA discourse modeling.

132

Pronoun model

Recall that in Chapter 5, I have presented an anaphor model that resolves pronouns
using a Centering algorithm [1]. However, here I employ a different mechanism to resolve
pronouns for the pronoun model. I use some semantic constraints and the deﬁniteness
hierarchy instead. The semantic constraints include the number, person, case, and gender
agreement between the pronouns and their potential antecedents. They are used to ﬁlter
out the unqualiﬁed discourse entities. If a pronoun appears in question Qn, then this
ﬁltering is conducted in a backward fashion, that is, ﬁrst to ﬁlter the entities in Qn_1,
then Qn_2, so on and so forth until a pr0per candidate is found. If more than one
candidates are left after the semantic constraint ﬁltering, then the deﬁniteness hierarchy
is used to rank the candidates. Whichever entity that is ranked higher will be assigned as
the antecedent of the pronoun. The modiﬁed deﬁniteness hierarchy is shown as follows,
where major NP types are ranked according to their deﬁniteness.
(24)

Pronoun > Deﬁnite description > Proper name > Indeﬁnite > Other

It should be noted that if a potential candidate is a deﬁnite description and this deﬁnite
description has to be resolved further, the pronoun model will not further resolve the
candidate deﬁnite description. For example, its in (25Q3) will only be resolved to the
deﬁnite description the center in (25Q2) instead of being resolved to its real antecedent
the Berkman Center for Internet and Society.
(25)

Q1: Where is the Berkman Center for Internet and Society located?

Q2: When was the center formed?

Q3: What is its mission?
By doing so, I try to separate the inﬂuence of resolving deﬁnite descriptions, which is my

focus in the deﬁnite model.

133

Deﬁnite model

The deﬁnite (resolution) model intends to resolve all the deﬁnite descriptions in the ques-
tions. In other words, for each deﬁnite description, it attempts to ﬁnd an antecedent for it
in the QA discourse if there is any. The discourse-new entities are deﬁned as the entities
that never occur in the previous discourse, therefore they do not have antecedents. In my
study I mostly focus on the processing of the deﬁnite descriptions of the other two types:
direct anaphoric and bridging. The basic idea is to locate the antecedent and then inherit
relevant information of the antecedent entity to help with query expansion.

Now I describe in detail the steps that are taken to build the model: 1) the identiﬁcation
of the direct anaphoric use of the deﬁnite descriptions; 2) the identiﬁcation of the bridging
use of the deﬁnite descriptions; 3)the identiﬁcation of the discourse—new use of the deﬁnite
descriptions; 4) the strategy for query expansion.

The identiﬁcation of the direct anaphoric deﬁnite descriptions is based on a mechanism
called NP head matching. The following situations are allowed or regarded as NP head

matched between a deﬁnite description and its antecedent:

o head-matching for the premodiﬁed deﬁnite description containing less information
than the antecedent that has premodiﬁers; For example, the deﬁnite the ﬁrst debate

has less information than its antecedent the ﬁrst 2000 presidential debate;

0 head-matching for the premodiﬁed deﬁnite description containing additional infor-
mation than the antecedent that does not have premodiﬁers; For example, the
premodiﬁed deﬁnite the ﬁrst Auto Show has more information than its antecedent
the show.

These two rules are mentioned in Poesio and Vieira [96]’s work, and they also mention
about the other two rules that do not constitute as a head match.

0 head-matching for cases where both the deﬁnite and the antecedent are postmodiﬁed
and the modiﬁcations are not the same; For example, the Bollywood equivalent of

Beverly Hills and Bollywood ’5 equivalent of the Oscars both have the same head

noun equivalent but different postmodiﬁers of Beverly Hills and of the Oscars.

134

e head-matching for the premodiﬁed deﬁnite containing more information than the
antecedent that has premodiﬁers; For example, the premodiﬁed deﬁnite his second

successful career has more information than the antecedent his primary career.

I relax the criteria so the two situations above do constitute a head match, because after
the preliminary examinations, I found that these two head-matching rules would mostly
result in correct resolution. This can be explained by the fact that the context question
sets are all about speciﬁc topics. Intuitively, it is unlikely for a series to explore two
different entities whose corresponding expressions have the same head.

The bridging uses are identiﬁed by a mutual-containing relationship between the deﬁ-
nite description and its antecedent. Mutual-containing is a rule that I deﬁne to identify
different but related entities. Entities may relate to each other by sharing certain proper-

ties. Expressions that refer to bridging entities were related through the following ways:

0 mutual-containing occurs when the antecedent appears as either the premodiﬁcr or
the postmodiﬁer of the deﬁnite; For example, the antecedent space shuttle appears

as the premodiﬁer in the deﬁnite the space shuttle disaster.

0 mutual-containing occurs when the deﬁnite appears as either the premodiﬁer or the
postmodiﬁer of the antecedent; For example, the deﬁnite the Challenger appears as

the postmodiﬁer of the antecedent the crew of the Challenger.

o mutual-containing occurs when the deﬁnite and the antecedent share the same pre-
modiﬁer or postmodiﬁer. For example, the deﬁnite the ﬁrst ﬂight and the antecedent

ﬁrst shuttle share the same premodiﬁer ﬁrst.

In general, the mutual-containing rules attempt to establish a relationship between the
deﬁnite description and its antecedent by their lexical ties. Deﬁnite descriptions that are
of the traditional bridging use will not be considered as bridging if the mutual-containing
rule cannot be applied, for instance, the deﬁnite description the knob and its antecedent
door. The knob will then be identiﬁed as discourse-new if it appears for the ﬁrst time in
the discourse. This kind of bridging, however, is possible for a system to identify with

the help of an extra semantic knowledge resource, such as WordNet.

135

Based on the head—matching and the mutual-containing rules, if a deﬁnite description
is neither a direct-anaphoric nor a bridging, and if it appears in the discourse for the ﬁrst
time, it will then be identiﬁed as discourse-new.

Once the deﬁnite descriptions are identiﬁedg, I then use the heading-matching and
the mutual-containing rules to ﬁnd the antecedents of the direct anaphoric and bridging
deﬁnitesl.0 To look for the antecedent for the deﬁnite in question Q", the deﬁnite model
will process the entities in its previous questions Qn_1, then Qn_2 until an antecedent is
found according to the rules. In the models I implement, the strategy for query expansion
is to inherit all the terms from the antecedent (of the deﬁnites of direct anaphoric use or

bridging use) and add them to the current question for query expansion.

Combined model

The combined model incorporates the pronoun resolution element with the deﬁnite reso-
lution element. It is a joint effort of providing more linguistic knowledge to the processing

of the context questions.

6.6.3 Target-given situation

Now I turn to discuss the situation where the targets are provided. As I have mentioned in
Chapter 2, most participating 2005 TREC QA systems directly append the corresponding
target to the questions that contain pronouns for query expansion. Under the assumption

that a target is given for a question set, I conduct a series of experiments aiming to address

the following issues:

0 What is the role of an explicit target for a context question set? Is it all that is

needed to process the QA discourse?

o How does the target-given situation affect the pronoun resolution and the deﬁnite

description resolution? Are the resolutions still necessary for the purpose of query

expansion?

 

9The deﬁnite description uses are manually annotated in the XML according to the rules.
10The identiﬁcation of the antecedents are implemented automatically in the system.

136

e What is the difference on the situations where targets are explicitly speciﬁed or

implicitly inferred?

In order to answer these questions, I design several experiments that aim to separate
the inﬂuence of the target from the inﬂuence of the extra processing for pronouns and
deﬁnite descriptions. The following models are examined in detail: a baseline model, a
pronoun model, a deﬁnite model, a combined model, a target model, and a combined

target model.

Baseline

The baseline takes the whole previous question and concatenates with the current question
for query expansion. This model assumes that the previous question provides context for

the processing of the current question.

Pronoun model

The pronoun model is the same as the pronoun model in the no—target situation except
that if a pronoun cannot be resolved either by the semantic constraints or the deﬁniteness
hierarchy, it will be resolved to be the target. Extra efforts are made by this model
to resolve all the pronouns which include both the target-referring pronouns and other
pronouns such as it in (26Q4).
(26) Target: Rose Crumb

Q1: What was her occupation?

Q2: Where was she from?

Q3: What organization did she found?

Q4: When did she found it ?

Q4’: Query for Q4 in the pronoun model: {When, did, she, found, it, Rose, Crumb,
organization}

Retrieval result for Q4 in the baseline model: DocRank: 0; MRR= 0.0

Retrieval result for Q4 in the pronoun model: DocRank: 2; MRR: 0.5
Therefore the query expansion for (26Q4) would look like (26Q4’), where she is resolved

137

to the target eventually, after the pronoun resolution algorithm ﬁnds out that it cannot
be resolved to any entities from any preceding questions of (26Q4). Meanwhile, it in
(26Q4) will be resolved to organization through the semantic constraint rules, because
only the entity organization matches with it in terms of number and gender. Again, the
deﬁnitencss hierarchy is used to rank the candidate entities if there are more than one
entities left after the semantic constraint ﬁltering. The retrieval result shown in (26)
indicates that the pronoun model performs better than the baseline model in terms of

MRR.

Deﬁnite model

The deﬁnite model is to resolve the deﬁnite descriptions in the question sets and pick
corresponding terms from the antecedents for the purpose of query expansion. This is
the same as what I have discussed in section 6.6.2 except that targets are used to help
resolving the deﬁnites. The deﬁnite descriptions are resolved to their antecedents in the
preceding questions. For different uses (i.e. discourse-new, direct anaphoric and bridg-
ing), I select the terms from the antecedents to expand the question under processing. In
section 6.6.2, I do not process any expressions that correspond to the discourse entities
used as discourse-new. However, now that the target is given, it is more likely that the
discourse-new entity is the entity that the target refers to. Therefore, I match the head
noun of the discourse-new entity to the target entities. If there is a head-matching be-
tween the discourse-new entity and any of the target entities, then I identify the matched
target entity to be the antecedent of the discourse-new deﬁnite. For example, Berkman
Center for Internet and Society is the target for the questions in example (27). The
deﬁnite description the center is annotated as discourse-new because the entity appears
for the ﬁrst time in the mini QA discourse. However, the deﬁnite description the center
head-matched 11with the target noun phrase Berkman Center for Internet and Society,
for both of which share the same head center. The identiﬁcation rules for the antecedents

of the direct anaphoric and bridging deﬁnites in this model are the same as what I employ

 

11See the discussion of head-matching in section 6.6.2

138

in the no-target situation.
(27)

Q1: Where is the center located?

Q2: When was the center formed ?

Q3: What is its mission?
After the antecedents of the deﬁnites are identiﬁed, the linguistic expressions correspond-
ing to the antecedents are used to form queries for the questions currently being processed.
Namely, for example (27), the ﬁnal query terms for (27Q1) would be {Where is, the, cen-
ter, located, Berkman, Center, for, Internet, and, Society}. Recall that the query terms
are input into the retrieval engine as a bag of words, so the repeated terms such as center

are handled by the retrieval engine.

Combined model

The combined model is to resolve both the pronouns and the deﬁnite descriptions de—
scribed above. This incorporated model attempts to resolve different types of references

that may occur in QA discourses.

Target model

The target model is to expand the questions by adding the target terms to each individual
question; For example, in example (28) the target is provided as Rose Crumb, so the query
expansion for the target model would give us the input as shown in (29).
(28)

Q1: What was her occupation?

Q2: Where was she from?

Q3: What organization did she found?

Q4: When did she found it?

Q5: How old was she when she won the awards?
(29)

Query for Q1: {What, was, her, occupation, Rose, Crumb}

139

Query for Q2: {Where was, she, from, Rose, Crumb}
Query for Q3: {What, organization, did, she, found, Rose, Crumb}
Query for Q4: {When, did, she, found, it, Rose, Crumb}
Query for Q5: {How, old, was, she, when, she, won, the, awards, Rose, Crumb}
In this example, the pronouns in (28) are mostly resolved to Rose Crumb implicitly by

concatenating the target terms. However, It in (28Q4) is not resolved in this model.

Combined target model

Finally, the combined target model is to resolve both the pronouns and the deﬁnite de-
scriptions with the help of the target information. In addition to the combined model
where the unresolved pronouns and the discourse-new deﬁnites are processed using the
information from the targets, this model simply concatenates the target terms to each
question as an enhancement to provide more information even if there is no pronoun or

deﬁnite in the question.

6.7 Evaluation for deﬁniteness hierarchy based mod-

els

6.7 .1 Evaluation for the no—target situation

The evaluations are conducted using the same metrics as I have used in Chapter 5. The
MRR (Mean Reciprocal Ranking) and coverage are used as major evaluation methods.
Results on the TREC 2004 and 2005 data are also presented separately to show the effects
that different models have on them. The performance results under the assumption that
no speciﬁc targets are given will be presented ﬁrst.

Six models are implemented for the no-target situation: 1) a baseline model that takes
the whole previous question to concatenate with the current question for query expansion;
2) a pronoun model that resolves the pronouns in the TREC questions; 3) an extensive

pronoun model (to be discussed below); 4) a deﬁnite model that resolves the deﬁnite

140

descriptions in the TREC questions; 5) a combined model that resolves both the pronouns
and the deﬁnite descriptions;

When I implement the pronoun model, I have found that 15 out of the total 215 pro-
nouns could not be resolved according to the semantic constraints. There are no qualiﬁed
potential antecedents for the pronouns based on the constraints such as gender, number
etc. I now show one example of each type that occurs in the data.

(30)

Q1: What causes tsunamis?

Q2: Where does it commonly occur?

Q3: What is its maximum height?

Q4: How fast can it travel?

‘

Q5: What language does the term “ tsunami ‘ come from?
This example shows the case where a common noun is used in a plural form but referred
to by a singular pronoun.
(31)

Q1: What is the primary symptom of a cataract?

Q2: How are they treated?
Example (31) is a different case where a common noun is used in a singular form but
referred to by a plural pronoun.
(32)

Q1: When was the company Harley-Davidson founded?

Q2: Where is it based?

Q3: They are best known for making what product?
Example (32) shows a case where plural pronoun is used to refer to an organization.
(33)

Q1: What is Jesse Ventura ’s political party afﬁliation?

Q2: What is his birth name?

Q3: What is his wife ’s name?

Q4: How many children do they have?

141

Example (33) shows a situation where two entities in the discourse are referred to to-
gether.
(34)

Q1: What was the Louvre Museum before it was a museum?

In example (34), the antecedent of a pronoun appears in the same question rather than
in a previous question. The next example shows a case where the pronoun it is not used
for referring. It is used as an expletive.

(35)

Q1: When was Cassini launched for space probe?

Q2: How much did it cost to build?

For the cases mentioned above, again, the deﬁniteness hierarchy is used in my implemen-
tation. I modify the algorithm to resolve the pronouns according to the ranking of the
entities in the feeding question based on the deﬁniteness hierarchy. I name it the errtensive
pronoun model. This model selects the highest ranked entity from the feeding question
to be the antecedent. By doing so, the unresolved pronoun is forced to be resolved to
some entity in the QA discourse, given the assumption it is related to the most prominent
entity according to the deﬁniteness hierarchy.

Table 6.5 shows the MRR performance of the 4 models that are run on the TREC data:
the baseline model, the pronoun model, the extensive pronoun model, and the combined
model that adopts both the extensive pronoun resolution and the deﬁnite description
resolution (i.e. the deﬁnite model)1.2 Table 6.6 shows the coverage information of the 4
models. Considering there are questions that have no answersl3(sometimes the feeding
questions), I separate the evaluation into two groups. The Incl row in table 6.5 shows
the results on all the questions in my data pool and the Basel row shows the results on
the questions excluding the ones that do not have answers. Note that for the deﬁnite

model used for the combined model in Table 6.5, a strategy that inherits the whole noun

 

12The performance of the deﬁnite model itself is not shown because it is not comparable with other

models.
13Although it was N IST staff’s intension to provide test questions that had answers in the test docu-

ment collections, there were still such questions existing in the TREC 2004 and TREC 2005 tracks.

142

8530mm: Becca 95538 on» 358295 5308 8598001....

£9628 96: go: op BS: @3825 23 opﬂoxm .1.
283820 23 2e cps—~05 *

 

 

 

 

3m 38 we. was 2: 32. sad .05 .38.
is as... as. was 2: S2 :2. .2: .392.
was 83 3. $2 as 82. gas Em 80mm...
was 82. as as... a... use 82 .8. 80$?
3.. so”... as. and as, some 22. 1.3m 385.
as... 32 as. was we. as... we; as: 30mg
seam $8558 Case 50858335 sea 582a sees“... me:

 

c2336 893.0: E r3832: 955% no woman 02588th #334 .mb @358

143

Table 6.6. Coverage performance on all TREC data* in no—target situation

 

 

Doc rank Baseline Pronoun PronEx** Combined***
1 14.0 17.0 17.0 18.0
5 30.0 34.0 35.0 36.0
10 38.0 41.0 42.0 45.0
20 43.0 48.0 49.0 52.0
30 46.0 52.0 53.0 56.0
50 53.0 58.0 60.0 62.0

 

* Based on the total TREC data, i.e. including all the questions
** PronEx is the extensive pronoun model.
*** Combined model implemented the extensive pronoun resolution.

phrase that corresponds to the antecedent entity to the current question in processing
is applied. Different strategies could be implemented but the basic idea is to inherit
relevant information of the antecedent entity to help with the query expansion. The Diff%
columns in the table 6.5 show the MRR increasement (in percentage) of the models (the
ones represented by the columns to the left of the Diff% columns) compared with the
baseline model.

From table 6.5 and table 6.6, we see that: l) the combined model works the best for
the overall TREC data; This suggests that the extra processing of the pronouns and the
deﬁnite descriptions does beneﬁt the performance; 2) The extensive pronoun model has
more advantage on the TREC 2004 data than on the TREC 2005 data. This can be
explained by the fact that the T REC 2004 data has a higher percentage of pronouns and
the TREC 2005 data has a higher percentage of deﬁnite descriptions; 3) The pronoun
model, the extensive pronoun model, and the combined model work better than the
baseline model. The result indicates that if both the pronouns and the deﬁnite descriptions
are resolved, more context information is obtained than that provided by the baseline
model, which takes the whole previous question for query expansion. 4) when I compare
the extensive pronoun model which makes use of the deﬁnitencss hierarchy with the
centering-based anaphora model in Chapter 5, I ﬁnd that the extensive pronoun model
works better than the centering-based anaphora model. The MRR (for all the questions)
for the extensive pronoun model for the TREC 2004 is 0.275, while the MRR for the

144

anaphor model on the TREC 2004 data is 0.221 (see table 5.6). The performance increases
24.5%. I also notice that the transition model (MRR was 0.289) in Chapter 5 works better
than the extensive pronoun model for the TREC 2004 data; 5) Finally, I notice a signiﬁcant
improvement if I exclude the questions that do not have answers in the evaluation.

Proper name is another important type of deﬁnite noun phrase in context questions.
I have mentioned in Chapter 5 that proper names should be given a certain discourse
prominence in the processing and also the strategy that I use for the transition model
emphasizes its signiﬁcance. Now I would like to investigate how important it is when it is
put in the deﬁniteness hierarchy. Intuitively, in information retrieval, it seems that proper
name should be ranked higher than deﬁnite descriptions, for that proper names often
are name entities 1ﬁnd name entity recognition has been widely used in QA processing.
Therefore I switch the position of deﬁnite description with proper name in the deﬁniteness
hierarchy, and run the same models once again keeping all other elements the same. The
modiﬁed hierarchy is shown below:

Pronoun > Proper name > Deﬁnite description > Indeﬁnite > Other 15

Table 6.7 shows the average MRR for the pronoun model, the extensive pronoun model
and the combined model after the modiﬁcation of the ranking. Figure 6.1 shows the per-
formance comparison between the modiﬁed models and the original deﬁniteness hierarchy
based models.

From table 6.7 and ﬁgure 6.1, we see that the performance of the modiﬁed models
is worse than the original deﬁniteness hierarchy based models on all the TREC data,
especially so for the TREC 2005 data. This result indicates that the deﬁniteness hierarchy
correctly captures the discourse prominence of the noun phrases in the QA discourse and

the models based on it show consistent performance advantages than otherwise.

 

14Name entity could be person, location, and organization, as well as times, data, percentages, money

amounts, etc.
15“>” in this hierarchy indicates that the NP type on the left side is more important than the one on

the right side in processing QA discourse.

145

Table 6.7. MRR performance based on modiﬁed deﬁnite hierarchy in no—target situation

 

 

 

 

MRR Baseline ModPronoun ModExtensivePronoun ModCombined
TRECO4 Incl* 0.188 0.270 0.275 0.274
TRECO4 Excl** 0.212 0.304 0.310 0.309
TRECO5 Incl 0.220 0.236 0.241 0.260
TRECOS Excl 0.240 0.257 0.263 0.284
Total Incl 0.217 0.249 0.254 0.266
Total Excl 0.240 0.275 0.281 0.293

 

* Include all the questions
** Exclude the questions that do not have answers

 

 

El Pronoun
Combln

 

l3 ExtensivePronoun
ed

.1: \
9\

N

\

a

:. s

2004 Incl

a ModPronoun

l'.'l ModCom med

3 ModExtegsivePronoun

2005 Incl

 

Total Incl

 

 

Figure 6.1. MRR Performance comparison between the deﬁniteness hierarchy based mod-

els and the modiﬁed version

6.7 .2 Evaluation for the target-given situation

Table 6.8 shows the average MRR for the modelsmthat are built for the original TREC

data whose targets are provided along with the question sets. Table 6.9 shows the coverage

information of the models for the total TREC data on all questions. Figure 6.2 shows

the MRR performance of all the models based on the assumption that the targets are

provided for the context question sets. It shows the MRR for all the TREC questions

including the questions that do not have answers.

From table 6.8, table 6.9 and ﬁgure 6.2, I have the following observations:

 

16For the same reason, the performance of the deﬁnite model is not listed here.

146

Table 6.8. MRR Performance based on deﬁnite hierarchy in target-given situation

 

 

 

 

MRR Baseline Pronoun Combined Target CombinedTarget
TRECO4 Incl* 0.188 0.244 0.249 0.314 0.315
TRECO4 Excl** 0.212 0.275 0.280 0.354 0.355
TREC05 Incl 0.176 0.217 0.226 0.288 0.287
TREC05 Excl 0.192 0.236 0.246 0.314 0.313
Total Incl 0.180 0.227 0.235 0.298 0.298
Total Excl 0.199 0.251 0.259 0.329 0.329

 

* Include all the questions
** Exclude the questions that do not have answers

Table 6.9. Coverage performance on all TREC data* in target-given situation

 

Doc rank* Baseline Pronoun Combined Target CombinedTarget

 

1 11.0 15.0 15.0 19.0 19.0
5 25.0 30.0 31.0 41.0 41.0
10 31.0 38.0 39.0 51.0 51.0
20 36.0 45.0 47.0 59.0 59.0
30 39.0 49.0 51.0 63.0 63.0
50 44.0 55.0 57.0 70.0 69.0

 

* total TREC data, i.e. including all the questions

0 The pronoun model performs better than the baseline and also it works better for
the TREC 2004 data than for the TREC 2005 data. Meanwhile, the combined
model does show a big performance improvement for both the TREC 2004 and the
TREC 2005 data compared with the baseline.

0 The target model and the combined target model perform better than the other
models. This result supports the empirical practice of the most TREC QA systems.
As I have mentioned in Chapter 2, appending a target to each of the questions for
query expansion was one of the major methods that some TREC QA systems used

in the 2004 and 2005 TREC.

o The result also shows that the performance of the combined target model is almost
the same as the target model as shown in ﬁgure 6.2. It seems that the performance
improvement converges even when I incorporate the target-appending approach with

the extra processing of the pronouns and the deﬁnite descriptions. One possible rea-

147

 

Te rget—glven

 

[El Baseline r3 Pronoun l9 Combined in Target El CombinedTarget [

 

 

2004 Incl 2005 Incl Total lncl
TREC data

 

 

 

Figure 6.2. MRR performance on all TREC data* in target-given situation
* Including all the questions in the TREC 2004 and TREC 2005 data
son could be that most pronouns and the deﬁnite descriptions are actually resolved
to the targets. Out of the total 592 TREC questions, only 51 questions in which
the pronouns or the deﬁnite descriptions are not resolved to the targets. In other
words, these cases do not affect the performance in a signiﬁcant way. An interesting
question arises immediately: does the target-appending method work the best for
the data that is more diverse in terms of discourse entities? Obviously, a TREC
question set is mostly about one target entity, sometimes about an event, and rarely
about multiple entities. Because of the lack of data, I reserve this question as one

of my future research topics.

Under the two conditions (i.e. no-target and target-given), I have built different discourse
models to process the mini QA discourses, mostly involving pronouns, deﬁnite descriptions
and targets. From ﬁgure 6.2, we see that that the two models that explicitly involved
the targets (i.e. the target model and the combined target model) have a signiﬁcant
performance improvement over the other models (i.e. the models that do not explicitly
append the targets to each question) in the target-given situation.

The results, thus far have answered the questions that I set out to investigate. In

148

summary, the conclusions are:

e The linguistic knowledge based models (the pronoun model and the combined
model) improve the performance signiﬁcantly in the no-target situation (see ta-
' ble 6.5 and table 6.6 in section 6.7.1). In particular, the results also indicate that
the deﬁniteness hierarchy based pronoun resolution model works better than the
centering-based pronoun resolution model, but not better than the centering-based

transition model (see table table 5.6 in Chapter 5 section 5.5.1);

o Exactly as what the hierarchy presents, deﬁnite descriptions in QA discourses do
have more important status than proper names, therefore the processing on them

has been justiﬁed (see table 6.7 and ﬁgure 6.1 )i

o The extra processing of pronouns and deﬁnite descriptions (i.e. the pronoun model
and the combined model) proﬁts even under the target-given situation if the target-
appending strategy is not used (see table 6.8 and table 6.9). This means that the
strategy of appending the whole previous question for query expansion is not as

efﬁcient as to resolve the pronouns and the deﬁnites in the questions.

0 The experiments show that target-appending is an efﬁcient strategy to surpass other

models (see table 6.8 and ﬁgure 6.2 in section 6.7.2).

The conclusions are meaningful because on the one hand, the belief that the linguistic
knowledge is important in QA processing has been conﬁrmed, especially the use of the
deﬁniteness hierarchy in processing the QA discourse has shown to be efﬁcient; on the
other hand, if somehow a proper target could be identiﬁed from the feeding question,
then it would be easier to build a simple processing model that only appends the target
to each question. However, in practice, it is not easy to identify a target only based on

the ﬁrst question.

149

6.7.3 Evaluation based on the deﬁnite description types

In order to examine the effect that different deﬁnite description types have on the per-
formance in the no-target situation, I conduct another evaluation based on my deﬁnite
description classiﬁcation.

The questions are separated according to what type of a deﬁnite description they con-
tain. If a question contains more than one deﬁnite description, the question is grouped
by the ranking illustrated below: direct anaphoric > bridging > discoursenew. That
is to say, if a question has both bridging and discourse-new deﬁnite descriptions, it will
be grouped with questions that contain bridging deﬁnite descriptions, because bridging
is ranked higher than discourse—new. Likewise, if a question has both direct anaphoric
and bridging deﬁnite descriptions, it will be grouped with questions that contain direct
anaphoric deﬁnite descriptions. The motivation of the grouping is due to the intuition
that a direct anaphoric deﬁnite is more familiar with respect to a discourse than a bridging
deﬁnite, which in turn is more familiar than a discourse-new deﬁnite. Consider example
(36).

(36)

Q1: When was the Challenger space shuttle disaster?

Q2: How many members were in the crew of the Challenger?

There are two deﬁnite descriptions in (36Q2): the crew of the Challenger and the Chal-
lenger. The former is identiﬁed as a bridging because its postmodiﬁer Challenger has
mutual-containing relation with its antecedent Challenger in (36Q1). The later the Chat-
lenger is identiﬁed as direct anaphoric because it head-matches with its antecedent Chal-
lenger in (36Q1). According to my grouping scheme, question (36Q2) is identiﬁed as a
question containing direct anaphoric.

Accordingly, I have four groups of questions: questions containing direct anaphoric
deﬁnite descriptions, bridging deﬁnite descriptions, discourse-new deﬁnite descriptions,
and questions containing no deﬁnite descriptions at all (represented as other). Table 6.10

shows the question distribution for the TREC 2004 and TREC 2005 data based on such a

150

grouping. Note that the TREC 2005 data has more questions that involve direct anaphoric
deﬁnite descriptions (17.1%) than the TREC 2004 data (only 8.3%). Also there are more
questions in the TREC 2005 data involving bridging (9.1%) than in the TREC 2004 data
(only 2.6%). More than a quarter of the questions in both the TREC 2004 (25.2%) and
the TREC 2005 (27.1%) data contain discourse-new deﬁnite descriptions. The number of
questions that do not contain any deﬁnite descriptions in the TREC 2004 data decreases
compared with the TREC 2005 data.

Table 6.10. Question distribution based on deﬁnite description types

 

 

Question number direct anaphoric bridging discourse-new other Total
Q in TREC 2004 19(8.3%) 6(2.6%) 58(25.2%) 147(63.9%) 230
Q in TREC 2005 62(17.1%) 33(9.1%) 98(27.1%) 169(46.7%) 362
Q in TREC Total 81(13.7%) 39(6.6%) 156(26.3%) 316(53.4%) 592

 

 

 

Overall [D Baseline D Pronoun a ExtensivePronoun in Combined [

 

0.3

 

2004 Incl 2005 Incl Total Incl
TREC data

 

 

 

Figure 6.3. Overall comparison of four models based on automated processing on an-
tecedents

For the convenience of the discussion that follows, I now present the results from table
6.5 using ﬁgure 6.3. Figure 6.3 shows the MRR performance on all the questions in the
no-target situation. Again, in the no—target situation, the combined model works the
best for the total TREC data and the TREC 2005 data. The extensive pronoun model
works the best for the TREC 2004 data. Figure 6.4 shows the MRR performance for

151

 

Direct anaphoric

 

D Baseline Ci Pronoun is: Extensinronoun in Combined]
0.25

 

 

 

0.2 <

a: 0.15
E
5 ......

0.1 :

  

0.05 4.;

§:§:\
-:;:\

 

 

 

 

2004lncl 2005|nc|

 

 

 

Figure 6.4. MRR performance on questions with direct anaphoric deﬁnite description in
no-target situation

the questions that contain direct anaphoric deﬁnite descriptions. It also shows that the
baseline works the best for the TREC 2004 data (8.3% of questions containing direct
anaphoric deﬁnite descriptions), the combined model works best for the TREC 2005 data
( 17.1%) and the total TREC data (13.7%). The result indicates that the combined model
shows more advantage on the data that has more direct anaphoric deﬁnite descriptions.
In other words, it is effective to implement the strategy of including the antecedent of a
direct anaphoric deﬁnite description for query expansion.

Figure 6.5 shows the MRR performance for the questions that contain the bridging
deﬁnite descriptions. The combined model outperforms other models for the TREC 2005
and the total TREC data. This result suggests that deﬁnite description resolution is
a very important element to improve the performance for the questions that contain
bridging deﬁnite descriptions. For the TREC 2004 data, both the pronoun model and
the extensive pronoun model work better than other models. This is similar to the result
on the overall data shown in ﬁgure 6.3. In other words, for data that has a higher
percentage of pronouns, the advantage of pronoun resolution based models is bigger than
the advantage of the deﬁnite description resolution based models.

Figure 6.6 shows the MRR performance for the questions that contain discourse-new

152

 

 

Bridging EBaseline El Pronoun a ExtensivePronoun Combined
0.25 .

 

0.2 I
I: 0.15
m :
5 0.1

  

0.05

2004 Incl 2005 incl

   

0 - .
Total Incl
TREC data

 

 

 

Figure 6.5. MRR performance on questions with bridging deﬁnite description in no—target
situation

deﬁnite descriptions. It also shows that the pronoun model, the extensive pronoun model,
and the combined model perform the same for the TREC 2004 data. This could be ex-
plained by the fact that all three models resolve pronouns. And, this element is necessary
and effective for the data that has a higher percentage of pronouns. For the questions
that contain discourse-new deﬁnite descriptions, the baseline model works the best for
the TREC 2005 data and the total TREC 2005 data. This result is not surprising be—
cause in my implementation, I do not process the discourse-new deﬁnite descriptions for
query expansion. The baseline, however, by using the whole previous question for query
expansion, provides more context information than the other models.

Figure 6.7 shows the MRR performance for the questions that do not contain any
deﬁnite descriptions. The extensive pronoun model and the combined model actually
perform the same. This is because the deﬁnite description resolution element in the
combined model does not take any effect due to the lack of presence of any deﬁnite
descriptions. Besides, both the extensive pronoun model and the combined model perform
better than the other models for all the datasets. This result indicates that the pronoun
resolution strategy used in the extensive pronoun model is efﬁcient, that is, to resolve

some unresolvable pronoun references to the highest ranked entities (according to the

153

 

DIscours-new

 

LU Baseline :1 Pronoun a ExtensivePronoun Combined)

 

 

2004 Incl 2005 Incl Total Incl

 

 

TREC data

 

Figure 6.6. MRR performance on questions with discourse-new deﬁnite description in
no-target situation
deﬁniteness hierarchy) in the feeding questions.

In summary, I separate the questions by deﬁnite description types in this evaluation
and come to the following conclusions: 1) for the direct anaphoric question group, the
combined model outperforms other models for the TREC 2005 data and the total TREC
data; For the TREC 2004 data, which has a very low percentage of direct anaphoric
uses, the baseline works the best; 2) for the bridging question group, the combined model
works better than other models for the TREC 2005 data and the total TREC data; For
the TREC 2004 data, both the pronoun model and the extensive pronoun model work
better than the other models; 3) for the discourse-new question group, the baseline works
the best for the TREC 2005 data and the total TREC data; For the TREC 2004 data, all
the three models that involve pronoun resolution work better than the baseline; 4) for the
other question group, both the extensive pronoun model and the combined model work

the best for all the datasets.

154

 

 

Other Cl Baseline l3 Pronoun n ExtensivePronoun Combined
0.35 ,

 

 

 

 

 

 

 

2004 Incl 2005 Incl Total Incl
TREC data

 

 

 

Figure 6.7. MRR performance on questions without any deﬁnite descriptions in no-target
situation

6.8 Summary

This chapter discusses how to employ the linguistic knowledge of deﬁniteness hierarchy
in processing mini QA discourses. I have introduced the deﬁniteness hierarchy of deﬁnite
descriptions, the Familiarity theory, and brieﬂy the Uniqueness theory. In order to classify
the deﬁnite descriptions appearing in context questions, I then have reviewed some major
classiﬁcations in literature. Based on those work, three classes of deﬁnite descriptions are
presented: discourse-new, direct anaphoric and bridging. In the implementation part, I
have separated the experiments into two groups: one for the situation where no explicit
targets are provided, and the other has explicit targets. Different models have been
proposed for each group.

In the no—target situation, the inﬂuence of the extra processing based on pronoun res-
olution and deﬁnite description resolution has been investigated. The conclusion is that
the combined model which incorporates both efforts performs the best. Another evalua-
tion is conducted based on deﬁnite description types. Questions are separated into groups
to isolate the inﬂuence of different models. The conclusion is that, although deﬁnite de-

scription resolution in the deﬁnite model does not show advantage over the baseline on

155

the overall data, it does show big performance improvement on the questions that either
have direct anaphoric deﬁnite descriptions or bridging deﬁnite descriptions. Also, the
results also show the need to implement more strategy to process discourse-new deﬁnite
descriptions. For now, the baseline seems to work the best for the discourse-new question
group. Finally, the pronoun resolution is necessary for handling the data that has a high
percentage of pronouns.

In the target-given situation, the experiments have shown that the feeding questions are
not as important as they are in the no-target situation. Moreover, the target-appending

strategy was shown to be very efficient.

156

CHAPTER 7

Conclusion

7. 1 Summary

In context question answering, questions are asked in a sequence. The TREC QA tasks
provide targets as to indicate what all the questions in the sequence are about. The TREC
2005 QA systems employed approaches such as target-appending to relate the questions
with the target. However, the questions in the sequence are related to each other in a
certain way. This seems to be intuitive, however little work has been done to examine the
relationship between the questions, except for the effort of resolving pronoun references.
This is particularly true for the TREC QA tracks, where most participating systems did
not employ discourse processing for the question series. This thesis is therefore motivated
to investigate the role of discourse processing and its implication on query expansion for
a sequence of questions.

Chapter 1 introduces the context question problem that this dissertation aims to ad-
dress. Chapter 2, Chapter 3, and Chapter 4 have introduced the background material
for the study. The TREC QA tasks which inspire the current research are introduced
in Chapter 2. State-of-the-art context question answering techniques used in the TREC
2005 context QA systems are also presented. Chapter 3 has presented the hypotheses for
the discourse modeling presented in the dissertation. Context question sequence is treated

as a mini discourse where it observes similar characteristics as other types of discourse.

157

A two-level representation of the discourse is hypothesized for the purpose of providing
query terms for query expansion at the stage of processing context questions. The lexi-
cal cohesion representation is targeted to capture lexical relationships for the expressions
that correspond to discourse entities. The discourse coherence level is to capture the topic
change between questions. The idea is that a discourse model that captures both levels
of information will provide adequate context information to process context questions.

Centering Theory is adopted in the work reported in Chapter 5, because it provides
an excellent theoretical framework upon which a two-level discourse modeling is con-
ducted. The main objective of Chapter 5 is to introduce centering-based models that
relate discourse entities by the operations of centers to process context questions. At
lexical cohesion level, questions are often linked to each other by such lexical ties as
pronoun references. At discourse coherence level, a QA discourse is coherent in terms
of their t0picality relationships. Adjacent questions may have the same topic, similar
topic or different topics. These topics are believed to be represented by the backward
looking center of each question. Therefore in Chapter 5, three discourse models were pro-
posed for discourse processing based on these two levels of speculations, especially driven
by Centering Theory. The reference model aims to capture the lexical relation between
questions, to be more speciﬁc, to resolve the pronoun references in the questions. The
forward model aims to provide more context information on top of the pronoun resolution
by adding the forward looking centers from the previous question. The transition model
is to selectively pick context information from the previous question based on the topic
transition between two adjacent questions. The results show that the transition model
based on discourse transitions and centers can signiﬁcantly improve the performance of
document retrieval. The results from this chapter emphasize the need to properly capture
more semantic information, rather than only focusing on lexical links, which only reﬂect
discourse cohesion relations.

Chapter 6 looks at the context question problem based on another two—level representa-
tion of semantic information in the QA discourse. It aims to study deﬁnite descriptions in

context questions due to the frequent occurrences of the deﬁnite descriptions in the TREC

158

2005 QA test sets and limited research on this topic. The lower level representation is
similar to the one used in Chapter 5. It intends to capture the lexical relations between
expressions for discourse entities, pronoun references and deﬁnite description references in
particular. The other level of representation is hypothesized as follows: the more familiar
an entity is to the discourse, the more deﬁnite an NP form will be used to refer to the en-
tity. On the familiarity assumption, discourse entities are related to each other at a higher
level. In other words, the linguistic expressions used to refer to the entities and their re—
lated entities would have different deﬁniteness statuses in the QA discourse. Definiteness
hierarchy is therefore employed to relate these entities together. Based on a deﬁniteness
hierarchy modiﬁed from Abbott 2004[89], different discourse models are proposed, mainly
focusing on the processing of the pronouns and the deﬁnite descriptions. The typical
uses of the deﬁnite descriptions in the context questions are classiﬁed into discourse-new,
direct anaphoric and bridging. In general, it is found that the extra processing based on
the deﬁniteness hierarchy improves the performance signiﬁcantly for the situation where
no targets are explicitly provided to the question sets. In the target—given situation, on
the other hand, the target-appending strategy is efﬁcient compared with other processing
models. Moreover, the deﬁniteness hierarchy based pronoun resolution model is shown to
perform better than the anaphora model in Chapter 51.

The results in Chapter 5 and Chapter 6 indicate that QA discourse is coherent in that
discourse entities are related to each other not only at a lexical level but also at a discourse
level where more semantic information can be captured.

The result in Chapter 6 also indicates the need to develop discourse processing models
that can successfully predict a target. However this seems rather unrealistic. The TREC
test sets were prepared speciﬁcally for the QA task in that every question was explicitly
about the target. However, sometimes in reality, users may change the target gradually
over the questions (e.g. resulting in shift transition in the transition model). For my
future study, it is interesting to examine two things: 1) explore what the linguistic cues

in context questions are to indicate a topic change; 2) how to build a discourse model

 

lThe state-of-the-art NLP technology, however is sufﬁcient to implement both models.

159

to capture this topic change. Once these two questions are answered, it is expected that
a discourse model that makes use of the system-detected target would outperform the
models proposed for the no-target situation.

The major contribution of the research presented in this dissertation is that it provides
us working discourse models for the context question problem, based on linguistic knowl-
edge. Based on the data characteristics (e.g. percentage of pronouns, deﬁnite description
distribution etc), different models could be incorporated into the state-of-the—art QA
systems.

There are also other dimensions along which my future work will be pursued. For
example, how to use linguistic knowledge and the existing linguistic theories to process
event-based context questions has become an interesting topic. I will also extend context
Question Answering to fully interactive question answering and investigate the role of

discourse processing in this new setting.

160

APPENDIX A

User data collection

A. 1 Instruction

Assume there is an intelligent agent, who understands natural language (English) quite
well. This agent has read thousands of web pages and can answer your natural language
questions by searching through those web pages. For example, if a user is interested in
ﬁnding out information about volcanoes and eruption history, he/ she may ask the follow-
ing questions in a coherent manner (i.e., all questions asked are related and connected
and serving a particular information goal), and then the agent will ﬁnd answers to each
of these questions.

Q1. What’s the name of the volcano that destroyed the ancient city of Pompeii?

Q2. When did it happen?

Q3. How many people were killed?

Q4. Any pictures?

Q5. Where is Mount Rainier?
N 0w suppose you are interested in ﬁnding out some information about “Presidential De-
bates 2004” or “Tom Cruise”, you want to ask the agent a sequence of coherent questions,
however with the following constraints:

1. The answer to each question must be found in the attached articles (if your question

is indexed Qij, please underline your answer and mark it Aij in the articles)

161

2. Each paragraph in the articles can only answer no more than one question
3. Ask at least 6 consecutive and coherent questions
4. Any question type is ok (e.g., what, when, where, how many, yes/ no, why etc).
Please write down your questions and mark their answers:
Presidential Debates 2004
Q11
Q12
Q13
Q14
Q15
Q16
Q17(opti0nal)

 

Q18(optional)

 

Read your questions carefully. Do those questions make sense (i.e., you would have ask
the same questions if you were talking to a human agent)? What information goal did
you have when you asked those questions? Are all the questions centered around this
information goal? Reﬁne your questions if necessary.

My information Goal is:

 

 

A.2 Documents collection on Presidential Debates
2004:
Presidential Candidates Conclude 2004 Debates

0 President George W. Bush and Senator John F. Kerry faced off three times during

the 2004 election: September 30, October 8, and October 13

162

September 30, the University of Miami continues its exciting lineup of

 

debate-related events

0 The September 30 presidential debate at the University of Miami, which attracted
a record viewing audience of 63 million people and put the University in the in-
ternational spotlight, is now history — but the series of debate-related activities

Celebrating American Democracy and Diversity continues.

0 The debate covered Foreign Policy & Homeland Security primarily, which is believed

to be the single most important issue to voters in this election.

0 In the ﬁrst debate, President Bush’ scowls were thought to play a large part in

viewers’ belief that Kerry won, based on polls by CBS News and others.

Oct 8, Washington University ready for Debate

 

0 Although voters cite Iraq as a major concern, the economy consistently ranks at the

top.

0 Though Mr. Bush was more composed than in last week’s ﬁrst presidential debate,
all agreed his tone was sometimes antagonistic and he again appeared uncomfortable
being challenged. Kerry, on the other hand, was viewed as measured and articulate.

However, none of the experts touted a clear winner.

0 But what is clear after two Bush-Kerry debates is that the candidates don’t care
for each other. They may shake hands, but resentment runs deep. Kerry infers Mr.
Bush is simplistic to the point of dishonesty; Mr. Bush infers Kerry is complicated

to the point of ineffectiveness

0 Thomson thought the debate was “much more even” than the ﬁrst one. Like
Thurber, Thompson said, “I suspect that this is not a debate in which there will be

a substantial change one way or another.”

Oct 13, ASU shines as debate host:

163

0 Arizona State University hosts the third and ﬁnal debate between President Bush

and Sen. John Kerry Wednesday. The debate, to focus on domestic policy, is the
only meeting between the Republican and Democratic candidates to be held in the

Far West

0 More than 2,500 local, national, and international media are leaving Arizona State

University with a positive and lasting impression following their experience covering

the ﬁnal 2004 Presidential Debate October 13 at Gammage Auditorium.

Oct 13, Bush and Kerry Clash Over Jobs and Taxes at Last Debate

 

o Sen. John Kerry said Wednesday night that President Bush bears responsibility for

a misguided war in Iraq, lost jobs at home and mounting millions without health
care. Bush tagged his Democratic rival as a lifelong liberal bent on raising taxes

and government spending.

Kerry and the president also debated abortion, gay rights, immigration and more in
a 90-minute debate that underscored deep differences only 19 campaign days before

Election Day.

This debate was similar in format to the ﬁrst - the two rivals standing behind
identical lecterns set precisely 10 feet apart. Bush was on better behavior, though,
and there was no grimacing and scowling this time when it was Kerry’s turn to

speak.

Please write down your questions and mark their answers:

Tom Cruise Q21

Q22
Q23
Q24
Q25
Q26

Q27(optional)

164

Q28(optional)

 

Read your questions carefully. Do those questions make sense (i.e., you would have ask
the same questions if you were talking to a human agent)? What information goal did
you have when you asked those questions? Are all the questions centered around this
information goal? Reﬁne your questions if necessary.

My information Goal is:

 

 

A.3 Documents collection on Tom Cruise

Tom Cruise

 

e That million megawatt smile has helped Tom Cruise reach the pinnacle of his pro-
fession and stay there. He’s a down-to—earth movie star with huge box-office hits

under his belt such as Top Gun and Mission: Impossible.

0 Thomas Cruise Mapother IV was born on the 3rd of July, 1962 (eerily similar to his
ﬁlm Born on the 4th of July), in Syracuse, New York. He was the only boy of four
children. Since his father was an electrical engineer, he must have inherited his love

for acting from his mother, who was a teacher.

0 His acting career really began because he injured his knee in high school and was

forced to quit the amateur wrestling team.

0 His popularity took a beating in movies like All the Right Moves in 1983, followed
by Legend in 1985. Cruise’s career began to solidify during his signature hit of the

19805, Top Gun.

0 He proved his dramatic talents in the 1988 drama Rain Man, where he co—starred

with Oscar-winner Dustin Hoffman. Oliver Stone’s Born on the Fourth of July

165

(1989) earned him a Best Actor Oscar nomination for his hard-hitting portrayal of

anti-war activist Ron Kovic.

0 Cruise fell short for his role in Far and Away (1992) with co-star Nicole Kidman,

who he later married in 1990 after sharing the screen once again in Days of Thunder.
0 Cruise and his wife of 10 years, Nicole Kidman, ﬁled for divorce in February 2001.

0 He has since been seen with Spanish beauty Penelope Cruz, his Vanilla Sky co-star.
Keeping with the science-ﬁction theme of Vanilla Sky, Cruise’s next starring role is
Steven Spielberg’s Minority Report. A third Mission Impossible is rumored to be

in the works.
Nicole Kidman

0 She starred in Days Of Thunder and the kinky thriller Eyes Wide Shut, opposite

then husband, Tom Cruise.

0 Many would be fooled into thinking that Nicole was born down under because of
her noticeable Australian accent, but they would be wrong. Nicole Mary Kidman
was born in Honolulu, Hawaii, on June 20, 1967. The Kidmans lived in the US.

because Nicole’s biochemist father was conducting research on breast cancer.

0 Once Nicole was 4 years old, her family moved to Australia, where Nicole and her
younger sister were raised under strict rules. Anthony and Janelle Kidman were
extremely politically active, and instilled certain values into their daughters. The
Kidman girls were even required to discuss a political issue or current affair with

their parents at the end of every day.

Penelope Cruz

 

0 There’s a reason why Penelope Cruz is nicknamed the “Spanish Enchantress” and
the “Madonna of Madrid” outside of her native land. Gorgeous, sultry and gifted,

she has paid her dues and climbed her way from supporting actress to feature player

166

Penelope Cruz Sanchez was born in Madrid, Spain, on the 28th of April 1974. Raised
along with brother Eduardo and sister Monica in Madrid, Penelope was always fond
of the arts, particularly ballet and jazz. Her passion for dance led to her decision to
abandon traditional schooling, focusing her time and energy instead on the graceful

art of ballet. Please write down your questions and mark their answers:

Pompeii
Q31
Q32
Q33
Q34
Q35

Q36
Q37(0ptional)

 

Q38(optional)

 

Read your questions carefully. Do those questions make sense (i.e., you would have ask
the same questions if you were talking to a human agent)? What information goal did
you have when you asked those questions? Are all the questions centered around this
information goal? Reﬁne your questions if necessary.

My information Goal is:

 

 

A.4 Documents collection on Pompeii

Pompeii

o On August 23, 79 AD, Pompeii looked like any other busy, prosperous city. People

were moving about, trading goods, news, and friendly talk.

167

a 7 ﬂ
L

‘n.‘. I&- a.” ‘1.-

 

0 Three days later, on August 26, all of these sounds had fallen silent, and the place
itself had vanished. Almost nothing was seen of Pompeii for more than 1500 years.
Now, more than 1900 years later, we are learning more and more about the last

days of Pompeii.

e What happened to Pompeii preserved a treasury of information about life in the
ancient Roman Empire. You can begin your exploration of the mystery of Pompeii
and the life of people in the Roman Empire by clicking on enter below. Once you
see the map, you can choose any place to start, but Vesuvius might make the best

beginning!

0 62 February 5—A major earthquake almost destroys the cities of Pompeii and Her-
culaneum. Many buildings are damaged or destroyed, and the aqueducts that bring

water into Pompeii are ruined.

e 79 August 24 and 25-Vesuvius erupts, burying Pompeii in ash and cinders and
covering Herculaneum in mud as hard as rock. Ash, rock and cinders fall over a
large area, damaging houses in many faraway places and blackening the sky over

what is now known as Naples for three days.

0 79 Late August—People who escaped from Pompeii return to the city to try to ﬁnd
their houses and possessions. Many people dig shafts down into the town and recover

some of their valuables. This effort is ﬁnally abandoned.
e 202 -533 Vesuvius erupts at least ﬁve times.

0 1594 Workers digging a tunnel to supply water to a nearby village ﬁnd a stone that
says decurio Pompeiis. The city has been so long forgotten that most people think

Pompeiis refers to a famous Roman ruler named Pompey.

0 1631 Vesuvius erupts again. This is by far the worst eruption since 79 AD. Lava

ﬂows from the volcano in seven different streams, destroying nearly all of the towns

below.

168

1707 Prince d’Elboeuf, hearing of some interesting ﬁnds during the digging of a well,
starts digging for treasure. At this time he does not know the name of the city he

is digging up.

1860 Giuseppe F iorelli is appointed as director of the Pompeiian dig. F iorelli wants
to uncover the entire Roman city. Up until this time, most people have been digging
single holes or opening up small areas to look for treasure. Fiorelli wants to share

the riches of the lost city with the entire world.

1860-1875 Under Fiorelli’s watchful eye, digging is continued. Modern archeological
techniques and recording systems begin to be used to catalog and identify all of the

objects uncovered.

1875-Present Digging continues under the guidance of many different directors.
Many of the objects uncovered are placed in a museum in nearby Naples, where

they show the world of today much about ancient Roman times.

1913-1944 Vesuvius erupts several times, ﬁnishing the eruption cycle that started in
1631.
Today all we can see of Pompeii are ruins. But the ruins tell us many things

about the ancient world. The disaster that destroyed the city of Pompeii in 79 AD
preserved forever a treasury of the past. Careful excavation and exploration of the

ruins continues to add to my knowledge of ancient Roman times.

Please write down your questions and mark their answers:

Hawaii

Q41
Q42
Q43
Q44
Q45

169

Q46
Q47(optional)

 

Q48(optional)

 

Read your questions carefully. Do those questions make sense (i.e., you would have ask
the same questions if you were talking to a human agent)? What information goal did
you have when you asked those questions? Are all the questions centered around this
information goal? Reﬁne your questions if necessary.

My information Goal is:

 

 

A.5 Documents collection on Hawaii

BIOLOGY

 

e The Hawaiian Islands have a wide variety of plant, marine and animal life. Vegeta-
tion zones include: coastal, dryland forest, mixed open forest, rain forest, subalpine
and alpine. More than 90 percent of the native plants and animals living in Hawaii
are found nowhere else in the world, and a greater variety of ﬁsh exist in Hawaiian

waters than elsewhere. The humuhumunukunukuapuaa is the unofficial state ﬁsh.

0 Hawaii is sometimes called the Endangered Species Capital of the World. At least
one third of all the endangered species in the United States are found in Hawaii
including the Nene Goose (official state bird), the Humpback Whale (ofﬁcial state
marine mammal), the Paciﬁc Green Sea Turtle and the Pueo (Hawaiian owl). The

exotic species, man, poses a greater threat than nature to Hawaii’s native ﬂora and

fauna.

CLIMATE

170

e The Hawaiian Islands have only two seasons: “summer” between May and October

and “winter” between October and April.

0 The climate is subtropical, with a normal annual temperature of 77F, making these
islands “- the peacefullest, restfullest, balmiest, dreamiest haven of refuge for a worn

and weary spirit the surface of the earth can offer.”———— Mark Twain
CULTURAL HISTORY

e The Hawaiian Islands are stepping-stones linking East to West. Here Polynesian
sensuality, American pragmatism, and Oriental exoticism weave a tapestry of cul-

tural extremes.

e Hawaii’s multi-cultural society has had major immigration from: Polynesia, United

States, China, Japan, Portugal, Puerto Rico, Korea, Philippines
ECONOMY

e Hawaii’s cost of living is one of America’s highest, its per capita personal income
below average. In fact, Hawaii’s cost of living for a family of four is estimated to
be approximately 27% higher than the US. average for a comparable standard of
living. In 1999, Hawaii’s average per capita personal income of $27,544 was 3.5%

below the US. average — the price of living in Paradise!

0 Hawaii’s major sources of annual income include (1998/1999): Tourism - $10.3
billion, Federal Defense Spending - $4.2 billion, Sugar - $133.1 million, Pineapple -
$145.1 million

0 The 1990’s has been the worst decade in Hawaii’s economic history since World War
II due, in large measure, to the decline in tourism from the East and the demise
of the sugar and pineapple industries. To meet the challenges of the 2lst Century,
Hawaii is working to diversify its economy with a focus on industries such as science
and technology, health and wellness tourism, diversiﬁed agriculture, ocean research

and development, and ﬁlm and television production

171

GEOGRAPHY

0 Hawaii is the most remote island chain in the world, over 2,000 miles from the
nearest landfall. Distance makes for splendid isolation - these Polynesian islands

are removed from all else but one another.

0 Hawaii consists of eight major islands plus 124 minor islands, reefs and shoals,
strung like a necklace across the Paciﬁc for over 1,500 miles. The eight major
islands (which make up over 99% of the total land area) are Oahu, Maui, Hawaii

(known as Big Island), Kauai, Molokai, Lanai, Kahoolawe (uninhabited) and N iihau

(privately owned).

172

BIBLIOGRAPHY

[1] S. E. Brennan, M. Friedman, and C. Polland, “Centering approach to pronouns,”
in ACL ’87, (Standford, CA.), pp. 155—162, 1987.

[2] R. Geluykens, “Studies in discourse and grammar,” in From Discourse Pro-
cess to Grammatical Construction: on left-dislocation in English, vol. 1, Amster-
dam/ Philadelphia: John Benjamins Publishing Company, 1992.

[3] D. Crystal, Introducing Linguistics. London: Penguin, 1992.

[4] I. Roberts and R. Gaizauskas, “Evaluating passage retrieval approaches for question
answering,” in Proceedings of the 26th European conference on information retrieval,
2004.

[5] E. D. Liddy, A. R. Diekema, and O. Yilmazel, “Context—based question answering
evaluation,” in Proceedings of the 27th Annual ACM-SIGIR Conference, (Sheffield,
England), 2004.

[6] E. Voorhees, “Overview of tree 2001 question answering track,” in proceedings of
TREC, (Gaithersburg, MD), November 13-16 2001.

[7] E. Voorhees, “Overview of tree 2004 question answering track,” in Proceedings of
the 13th Text REtrieval Conference (TREC 2004), (Gaithersburg, MD), 2004.

[8] M. Wu, S. Duan, M.and Shaikh, S. Small, and T. Strzalkowski, “Ilqua - an ie-driven
question answering system,” in Proceedings of the 14th T ext Retrieval Conference
( T REC 2005), Gaithersburg, MD., 2005.

[9] K. Collins-Thompson, J. Callan, E. Terra, and C. L. Clarke, “The effect of document
retrieval quality on factoid question answering performance,” in SI CIR, 2004.

[10] K. Ahn, J. Bos, J. R. Curran, D. Kor, M. Nissim, and BonnieWebber, “Question an-
swering with qed at tree-2005,” in Proceedings of the 14th Text Retrieval Conference
( T REC 2005), (Gaithersburg, MD), 2005.

[11] D. Ferres, S. Kanaan, D. Dominguez-Sal, E. Gonzalez, A. Ageno, M. Fuentes, H. Ro-
driguez, M. Surdeanu, and J. Turmo, “Talpmpc at trec 2005: experiments using a
voting scheme among three heterogeneous qa systems,” in Proceedings of the 14th
Text Retrieval Conference (TREC 2005), (Gaithersburg, MD), 2005.

173

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

D. Jurafsky and J. H. Martin, Speech and Language Processing. NJ .: Prentice Hall,
2000.

M. Mulcahy, K. White, I. Gabbay, and A. O’gorman, “Question answering using
the dlt system at tree 2005,” in Proceedings of the 14th Text Retrieval Conference
(TREC-2005), (Gaithersburg, MD.), 2005.

L. Azzopardi, K. Balog, and M. d. Rijke, “Language modeling approaches for en-
terprise tasks,” in Proceedings of the 14th Text Retrieval Conference (TREC 2005),
(Gaithersburg, MD), 2005.

P. Schone, G.Ciany, R. Cutts, P. McNamee, J.Mayﬁeld, and T. Smith, “Qactis-
based question answering at tree-2005,” in Proceedings of the 14th Text Retrieval
Conference (TREC 2005), Gaithersburg, MD, 2005.

R. Gaizauskas, M. A. Greenwood, H. Harkema, M. Hepple, H. Saggion, and S. A.,
“The university of shefﬁeld’s trec 2005 q&a experiments,” in Proceedings of the 14th
Text Retrieval Conference (TREC-2005), 2005.

T. Abou-Assaleh, N .Cercone, J. Doyle, V. Keselj, and C. Whidden, “Daltrec 2005 qa
system jellyﬁsh: Mark-and- match approach to question answering,” in Proceedings
of the 14th Text Retrieval Conference (TREC 2005), (Gaithersburg, MD), 2005.

S. Harabagiu, D. Moldovan, C. Clark, M. Bowden, A. Hickl, and P. Wang, “Em-
ploying two question answering systems in tree-2005,” in Proceedings of the 14th
Text Retrieval Conference (TREC 2005), 2005.

D. Ahn, S. Fissaha, V. Jijkoun, K. Mller, M. d. Rijke, and E. T. K. Sang, “Towards
a multi-stream question answering-as—xml—retrieval strategy,” in Pmceedings of the
14th Text Retrieval Conference (TREC 2005), 2005.

S. Harabagiu, D. Moldovan, M. Pasca, M. Surdeanu, R. Mihalcea, R. Girju, V. Rus,
F. Lacatusu, P. Morarescu, and R. Bunescu, “Answering complex, list and context
questions with lcc’s question-answering server,” in The Tenth Retrieval Conference
(TREC-2001) (E. Voorhees and D. K. Harman, eds), pp. 355—361, Gaithersburg,
MD.: NIST special publication, 2001.

M. Hoey, 0n the Surface of Discourse. London: Allen and Unwin, 1983.

[22] D. Nunan, Introducing Discourse Analysis. Australia: Penguin English, 1993.

[23] B. J. Grosz and C. Sidner, “Attention, intention, and the structure of discourse,”

Computational Linguistics, vol. 12, no. 3, pp. 175—204, 1986.

[24] H. P. Grice, “Logic and conversation,” in Syntax and Semantics (P. Cole and J. Mor-

gan, eds), pp. 41—58, Academic Press, 1975.

174

[25] J. R. Hobbs, “On the coherence and structure of discourse,” Tech. Rep. Report No.
CSLI-85~37, Center for the Study of Language and Information, Stanford University,
1985.

[26] W. C. Mann and S. A. Thompson, “Rhetorical structure theory: A theory of text
organization,” Tech. Rep. Technical report ISI/RS-87-190, Information Sciences In-
stitute, University of Southern California, 1987.

[27] J. Y. Chai and R. Jin, “Discourse status for context questions,” in Proceedings of
HLT-NAA CL 2004 workshop on pragmatics in question answering, (Boston, MA.),
pp. 23—30, ACL, May 2004.

[28] H. Kamp and U. Reyle, From discourse to logic: introduction to modeltheoretic

semantics of natural language, formal logic, and discourse representation theory.
Kluwer, Dordrecht, 1993.

[29] B. J. Grosz, A. K. Joshi, and S. Weinstein, “Centering: a framework for modeling
the local coherence of discourse,” Computational Linguistics, vol. 21, no. 2, pp. 203—
225, 1995.

[30] J. Hobbs, “Coherence and coreference,” Cognitive Science, vol. 3, pp. 67—90, 1979.

[31] J. R. Hobbs, M. Stickel, D. Appelt, and P. Martin, “Interpretation as abduction,”
Tech. Rep. Note 499, AI Center, SRI International, 333 Ravenswood Ave., Menlo
Park, CA 94025, December 1990.

[32] D. Marcu, The Theory and practice of discourse parsing and summarization. Cam-
bridge, Massachusetts. London, England: the MIT Press, 2000.

[33] D. Marcu, “A surface-based approach to identifying discourse markers and elemen-
tary textual units in unrestricted texts,” in The COLING/ACL’98 Workshop on
Discourse Relations and Discourse Markers, (Montreal, Canada), pp. 1—7, August
1998.

[34] A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky, P. Taylor,
R. Martin, C. V. Ess-Dykema, and M. Meteer, “Dialogue act modeling for auto-
matic tagging and recognition of conversational speech,” Computational Linguistics,
vol. 26, no. 3, pp. 339—373, 2000.

[35] J. L. Austin, How to Do Things with Words. Cambridge, MA: Harvard University
Press, 2nd ed. ed., 1962.

[36] A. Kehler, “Discourse coherence,” in Handbook of Pragmatics (L. R. Horn and
G. Ward, eds), Basil Blackwell, 2004.

[3.7] S. Small, T. Liu, N. Shimizu, and T. Strzalkowski, “Hitiqa: an interactive question
answering system: a preliminary report,” in Proceedings of the ACL 2003 workshop
on multilingual summarization and question answering, 2003.

175

[38] S. Harabagiu, A. Hickl, J. Lehmann, and D. Moldovan, “Experiments with inter-
active question-answering,” in Proceedings of the 43rd annual meeting of the Asso-
ciation for Computational Linguistics (ACL ’05 ), (Ann Arbor, MI), pp. 205—214,
2005.

[39] M. A. K. Halliday and R. Hasan, Cohesion in English. London: Longman, 1976.

[40] M. Pickering and S. Garrod, “Routinization in the interactive-alignment model of
dialogue.” Unpublished manuscript, 2004.

[41] H. H. Clark and D. Wilkes-Gibbs, “Referring as a collaborative processbranigan,
h.p., pickering, m.j., & cleland, a.a. (2000). syntactic coordination in dialogue. cog-
nition, 75, b13-b25.,” Cognition, vol. 22, pp. 1—39, 1986.

[42] S. E. Brennan, “Lexical entrainment in spontaneous dialog,” in Proceedings of Inter—
national Symposium on Spoken Dialogue, 1530-96., (Philadelphia, PA), pp. 41—44,
1996.

[43] D. Lewis, Convention: A philosophical study. Basil Blackwell/Harvard University
Press, 1969.

[44] H. Branigan, M. Pickering, and A. Cleland, “Syntactic coordination in dialogue,”
Cognition, vol. 75, pp. B13—B25, 2000.

[45] B. J. Grosz, A. Joshi, and S. Weinstein, “Towards a computational theory of dis-
course interpretation.” Unpublished manuscript, 1986.

[46] B. J. Grosz, “The representation and use of focus in a system for understanding
dialogs,” in Proceedings of the 5th International Joint Conference on Artiﬁcial In-
telligence, (Cambridge, Mass), 1977.

[47] B. Grosz, “The representation and use of focus in dialogue understanding,” tech.
rep., SRI International, 333 Ravenswood Ave., Menlo Park, CA, 94025, 1977.

[48] B. J. Grosz, “Discourse analysis,” in Understanding Spoken Language (D. Walker,
ed.), pp. 235—268, Elsevier North-Holland, 1978.

[49] B. J. Grosz, “Focusing in dialog,” in Theoretical Issues in Natural Language
Processing-2, pp. 96—103, University of Illinois at Urbana—Champaign, Champaign,
Illinois, 1978.

[50] B. J. Grosz, “Focusing and description in natural language dialogue,” in Elements
of Discourse Understanding (A. Joshi, B. Webber, and I. Sag, eds), pp. 85—105,
Cambridge University Press, 1981.

[51] C. L. Sidner, Towards a computational theory of deﬁnite anaphora comprehension
in English discourse. Technical report 537, Artiﬁcial Intelligence Laboratory, Mas-
sachusetts Institute of Technology, June 1979.

176

[52] C. L. Sidner, “Focusing for interpretation of pronouns,” American Journal of Com-
putational Linguistics, vol. 7, no. 4, pp. 217—231, 1981.

[53] C. L. Sidner, “Focusing in the comprehension of deﬁnite anaphora,” in Compu-
tational Models of Discourse (M. Brady and R. C. Berwick, eds), pp. 267—330,
Cambridge, Mass: MIT Press, 1983. reprinted in Grosz et al. (1986az363-94).

[54] A. K. Joshi and S. Kuhn, “Centered logic: the role of entity centered sentence rep-
resentation in natural language inferencing,” in Proceedings of the 6th international
joint conference on artiﬁcial intelligence, (Tokyo, Japan), pp. 435—439, August 1979.

[55] A. K. Joshi and S. Weinstein, “Control of inference: role of some aspects of discourse
structure - centering,” in Proceedings of the 7th international joint conference on
artiﬁcial intelligence, (Vancouver), pp. 385—387, 1981.

[56] B. J. Grosz, A. K. Joshi, and S. Weinstein, “Providing a uniﬁed account of deﬁ-
nite noun phrases in discourse,” in Proceedings of the 2lst Annual Meeting of the
Association for Computational Linguistics, (Cambridge, MA.), pp. 44—50, 1983.

[57] W. Chafe, “Givenness, contrastiveness, deﬁniteness, subjects, and topics,” in Subject
and Topic (C. Li, ed.), pp. 25—55, New York: Academic Press, 1976.

[58] M. A. Walker, A. Joshi, and E. Prince, CenteringTheory in Discourse. Clarendon
Press / Oxford, 1998.

[59] S. B. Hudson-DZmura, The Structure of Discourse and Anaphor Resolution: The
Discourse Center and the Roles of Nouns and Pronouns. PhD thesis, University of
Rochester, 1988.

[60] P. G. Gordon, B. J. Grosz, and L. A. Gillom, “Pronouns, names and the centering
of attention in discourse,” Cognitive Science, vol. 17, no. 3, pp. 311-347, 1993.

[61] J. Barwise and J. Perry, Situations and attitudes. Cambridge, Mass: Bradford
Books, 1983.

[62] J. Barwise, “The situation in logic-iv: on the model theory of common knowledge,”
Tech. Rep. Technical report No. 122, Stanford, California, CSLI, 1988.

[63] E. F. Prince, “On the function of existential presupposition in discourse,” in 14th
regional meeting of the Chicago Linguistic Society, (Chicago), pp. 362—376, 1978.

[64] H. H. Clark and C. R. Marshall, “Deﬁnite reference and mutual knowledge,” in
Elements of discourse understanding (A. Joshi, B. Webber, and I. Sag, eds), Cam-
bridge: Cambridge University Press, 1981.

[65] M. Poesio, R. Stevenson, B. di Eugenio, and J. Hitzeman, “Centering: A parametric
’ theory and its instantiations,” Computational Linguistics, vol. 30, pp. 309—363, Sept

2004.

177

[66] S. Cote, “Ranking forward-looking centers,” in CenteringTheory in Discourse
(M. Walker, A. Joshi, and E. Prince, eds), pp. 55—69, Clarendon Press / Oxford,
1998.

[67] M. Kameyama, Zero anaphora: the case of Japanese. PhD thesis, Stanford Univer-
sity, 1985.

[68] M. Kameyama, “A property-sharing constraint in centering,” in Proceedings of the
24th Annual Meeting of the Association for Computational Linguistics, (New York,
NY), pp. 200—206, 1986.

[69] B. Di Eugenio, “Centering and the italian pronominal system,” in COLING 90: the
13th International Conference on Computational Linguistics, (Helsinki), pp. 20—25,
Aug. 1990.

[70] B. Di Eugenio, “Centering in italian,” in CenteringTheory in Discourse (M. Walker,
A. Joshi, and E. Prince, eds), pp. 115—137, Clarendon Press / Oxford, 1998.

[71] O. Rambow, “Pragmatic aspects of scrambling and topicalization in german: a
centering approach,” in the workshop on centering theory in naturally-occurring
discourse, IRCS, (University of Pennsylvania), May 1993.

[72] m. D. Turan, Null vs. overt subjects in Turkish discourse: a centering analysis. PhD
thesis, University of Pennsylvania, 1995.

[73] m. D. Turan, “Ranking forward-looking centers in turkish: universal and language-
speciﬁc properties,” in CenteringTheory in Discourse (M. Walker, A. Joshi, and
E. Prince, eds), pp. 139-160, Clarendon Press / Oxford, 1998.

[74] E. Milsakaki and K. Kukich, “Evaluation of text coherence for electronic essay
scoring systems,” Natural Language Engineering, vol. 10, no. 1, pp. 25—55, 2004.

[75] M. A. Walker, “Evaluating discourse processing algorithms,” in Proceedings of 27th
Annual Meeting of the Association for Computational Linguistics, pp. 251—261,
1989.

[76] B. Baldwin, “Anaphora resolution with centering,” in Workshop on Centering The-
ory in Naturally- Occurring Discourse, (Philadelphia, PA.), May 1993.

[77] A. Kehler, “The effect of establishing coherence in ellipsis and anaphora resolu—
tion,” in Proceedings of 313t Annual Meeting of the Association for Computational
Linguistics, (Columbus, OH.), pp. 62—69, June 1993.

[78] M. A. Walker, M. Iida, and S. Cote, “Japanese discourse and the process of center-
ing,” Computational Linguistics, vol. 20, no. 2, pp. 193—232, 1994.

[79] M. Strube and U. Hahn, “Functional centering,” in ACL-96, (Santa Cruz, CA),
pp. 270-277, 1996.

178

[80] S. Lappin and H. Leass, “An algorithm for pronominal anaphora resolution,” Com-
putational Linguistics, vol. 20, no. 4, pp. 535—561, 1994.

[81] J. R. Hobbs, “Resolving pronoun references,” Lingua, vol. 44, pp. 311—338, 1978.
Reprinted in Grosz1986b.

[82] M. Kameyama, “Intrasentential centering: A case study,” in CenteringTheory in
Discourse (M. Walker, A. Joshi, and E. Prince, eds), pp. 89—112, Clarendon Press
/ Oxford, 1998.

[83] M. Strube, “Never look back: an alternative to centering,” in Proc. 0f COLING-
ACL, (Montreal), pp. 1251—1257, 1998.

[84] E. Miltsakaki, “Locating topics in text processing,” in Proceeding of CLIN, 1999.

[85] B. J. Grosz and C. L. Sidner, “Lost intuitions and forgotten intentions,” in Centering
Theory in Discourse (M. A. Walker, A. K. Joshi, and E. F. Prince, eds), pp. 39—51,
Clarendon, Oxford, 1998.

[86] R. Prasad, Constraints on the generation of referring expressions, with special ref-
erence to Hindi. PhD thesis, University of Pennsylvania, 2003.

[87] E. Miltsakaki, The Syntax-Discourse Interface: Eﬂ’ects of the Main-Subordinate Dis-
tinction on Attention Structure. PhD thesis, University of Pennsylvania, 2003.

[88] J. Gundel, “The role of topic and comment in linguistic theory. distributed by
indiana university linguistics club.,” Bloomington, Indiana, 1976.

[89] B. Abbott, “Deﬁniteness and indeﬁniteness,” in Handbook of pragmatics (L. R. Horn
and C. Ward, eds), Oxford, Blackwell, 2004.

[90] R. Gaizauskas, M. A. Greenwood, M. Hepple, I. Roberts, and H. Saggion, “The
university of sheﬂield’s trec 2004 q&a experiments,” in Proceedings of the 13th Text
Retrieval Conference (TREC-2004), 2004.

[91] B. M. Sundheim, “Overview of the results of the muc-6 evaluation,” in Proceedings
of the Sixth Message Understanding Conference ( M UC-6 ), (Columbia, MD), pp. 13—
31, November 6-8 1995.

[92] T. Appelt, J. Hobbs, J. Bear, D. Israel, M. Kameyama, A. Kehler, D. Martin,
K. Myers, and M. Tyson, “Sri international fastus system muc-6 test results and
analysis,” in Proceedings of the Sixth Message Understanding Conference (M U C—6),
(Columbia, MD), NIST, Morgan-Kaufmann Publishers, 1995.

[93] R. Gaizauskas, T. Wakao, K. Humphreys, H. Cunningham, and Y. Wilks, “Uni-

_ versity of Sheffield: Description of lasie system as used,” in Proceedings of the

Sixth Message Understanding Conference (M UC-6), (Columbia, Maryland), Mor-
gan Kaufman Publishers Inc., 1995.

179

[94] K. Humphreys, R. Gaizauskas, S. Azzam, C. Huyck, B. Mitchell, H. Cunningham,
and Y. Wilks, “University of Sheffield: description of the lasie-ii system as used for
muc-7,” in Proceedings of the Seventh Message Understnading Conference ( M U C- 7 ),
1998.

[95] K. Haurud, “Deﬁniteness and the processing of noun phrases in natural discourse,”
Journal of Semantics, vol. 7, pp. 395—433, 1990.

[96] M. Poesio and R. Vieira, “A corpus-based investigation of deﬁnite description use,”
Computational linguistics, vol. 24, pp. 183—216, 1998.

[97] E. F. Prince, “The ng letter: Subjects, deﬁniteness, and information status,” in
Discourse description: Diverse linguistic analyses of a fund-raising text (W. C.
Mann and S. A. Thompson, eds), pp. 295-325, Philadelphia: John Benjamins,
1992.

[98] B. Birner and G. Ward, Informational status and noncanonical word order. Philadel-
phia: John Benjamins, 1998.

[99] M. Ariel, “Referring and accessibility,” Journal of Linguistics, vol. 24, pp. 65—87,
1988.

[100] M. Ariel, Accessing noun-phrase antecedents. London: Routledge, 1990.

[101] P. Christophersen, The articles: A study of their theory and use in English. Oxford
University Press, 1939.

[102] I. Heim, The semantics of deﬁnite and indeﬁnite noun phrases. PhD thesis, Uni-
versity of Massachusetts Amherst, MA, 1982.

[103] I. Heim, “File change semantics and the familiarity theory of deﬁniteness,” in Mean-
ing, use and the interpretation of language (R. Bauerle, C. Schwarze, and A. von
Stechow, eds), pp. 164-189, Berlin: Walter de Gruyter, 1983.

[104] D. Lewis, “Scorekeeping in a language game,” Journal of Philosophical Logic, vol. 8,
pp. 339—59, 1979.

[105] G. Gazdar, Pragmatics: Implicature, presupposition, and logical form. New York:
Academic Press, 1979.

[106] B. Abbott, “Presuppositions as nonassertions,” Journal of Pragmatics, vol. 32,
pp. 1419—1437, 2000.

[107] B. Russell, “On denoting,” Mind, vol. 14, pp. 479—493, 1905.
[108] S. Lbner, “Definites,” Journal of Semantics, vol. 4, pp. 279—326, 1985.

[109] N. Kadmon, “Uniqueness,” Linguistics and Philosophy, vol. 13, pp. 273—324, 1990.

180

[110] J. A. Hawkins, “On (in)definite articles: implicatures and (un)grammaticality pre-
diction,” Journal of Linguistics, vol. 27, pp. 405—442, 1991.

[111] J. A. Hawkins, Deﬁniteness and Indeﬁniteness. London: Croom Helm, 1978.

[112] O. Jespersen, A modern English grammar on historical principles, vol. III. London:
George Allen and Unwin, 1949.

[113] H. H. Clark, “Bridging,” in Thinking: Readings in cognitive science (P. N. Johnson-
Laird and P. C. Wason, eds), pp. 411—20, Cambridge: Cambridge University Press,
1977.

181

   

u]]]i11]g]m]]]]][[1]]: