COREFERENCE RESOLUTION FOR DOWNSTREAM NLP TASKS

By

Sushanta Kumar Pani

A THESIS

Michigan State University

in partial fulﬁllment of the requirements

Submitted to

for the degree of

Computer Science – Master of Science

2021

ABSTRACT

COREFERENCE RESOLUTION FOR DOWNSTREAM NLP TASKS

By

Sushanta Kumar Pani

Natural Language Processing (NLP) tasks have witnessed a signiﬁcant improvement in performance
by utilizing the power of end-to-end neural network models. An NLP system built for one job can
contribute to other closely related tasks. Coreference Resolution (CR) systems work on resolving
references and are at the core of many NLP tasks. The coreference resolution refers to the linking
of repeated object references in a text. CR systems can boost the performance of downstream
NLP tasks, such as Text Summarization, Question Answering, Machine Translation, etc. We
provide a detailed comparative error analysis of two state-of-the-art coreference resolution systems
to understand error distribution in the predicted output. The understanding of error distribution
is helpful to interpret the system behavior. Eventually, this will contribute to the selection of an
optimal CR system for a speciﬁc target task.

Dedicated to Maa and Bapa

iii

ACKNOWLEDGMENTS

First of all, I would like to oﬀer sincere gratitude to my advisor Dr. Parisa Kordjamshidi for her
guidance, support, and immense patience. It has been my privilege to have Dr. Jiliang Tang, Dr.
Kristen Johnson, and Dr. Hamid Karimian being on my committee, and thankful for their guidance.
I thank Professor Joyce Chai for guiding me during my initial days at MSU. I thank Professor Arun
Ross, Dr. Karthik Durvasula, Professor Pann-Ning Tan, Professor Xiaoming Liu, and Professor
Li Xiao for their courses that helped me improve my skills. I am thankful to Dr. Eric Torng and
Professor Sandeep Kulkarni for their help and encouragement. I thank Dr. Katy Colbry for her
research writing guidance. I want to thank Steven R. Smith, Amy King, Brenda Hodge, and Erin
Dunlop for their support and help.

I am blessed to have an awesome bunch of friends as LIAR and HLR group lab mates in
the order of acquaintance; Shane Storks, Sari Saba-Sadiya, Dr. Qiaozi Gao, Dr. Shaohua Yang,
Guangyue Xu, James Peterkin II, Dr. Quan Guo, Hossein Faghihi, Drew Hayward, Chen Zheng,
Roshanak Mirzaee, Yue Zhang, Dr. Elaheh Raisi, Darius Nafar, and Tim Moran. I thank friends
from Michigan State University: Sudarshan, Aﬀan, Apoorva, Gauri, Sneha, Aditya, and Shalin
for making life lively and colorful. I oﬀer my sincere obeisances to ISKCON Lansing Family for
their guidance and support throughout my graduate study period. I appreciate the contribution of
colleagues and friends in India towards my professional as well as personal progress. I take this
opportunity to pay the highest obeisances to all my teachers, guides, and well-wishers. I am highly
indebted to the love and care of Sri. Bhismadev Nag and Sri Purusottam Pati.

I thank everyone from both of my families for always being supportive. In particular, I thank
Maa and Bapa, who are always there for me. Life is always full of pleasure with my siblings: Nani,
Ruby, and Bhai. I thank Bhaina, Dada, and Bhauja for keeping my morals high always. I thank the
nextgen in our family: Om, Subh, Sonu, Mona, Sona, and Subham for their love and aﬀection. I
thank God for the gift as a new family with Mummy, Daddy, Poonam Di, Jop ji, Vishnu, Chetna,
and of course Rahul, Anika, and Jia+.... Life is awesome with all of you.

iv

I admire my friend and wife Renu for being the reservoir of encouragement contributing to my

life progress.

I love everything due to Krishna...

v

TABLE OF CONTENTS

.

ix

.
.

.
.

.
.

.
.

.
.

.
.

.
.

CHAPTER 1

3.1 Models .

1
1
3
3
3

4
4
4
5
5
6
6
7

LIST OF TABLES .
LIST OF FIGURES .

2.1 Coreference Resolution Models . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2 Error Analysis Coreference Resolution Systems . . . . . . . . . . . . . . . . . .

.
.
INTRODUCTION .
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
1.1 Coreference Resolution .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Steps in Corefernce Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
1.2.1 Mention Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
1.2.2 Optimal Clustering of Mentions
CHAPTER 2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
2.1.1 Rule-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Learning based .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2.1 Mention Pair . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2.2
Entity-mention . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2.3 Mention and Span Ranking . . . . . . . . . . . . . . . . . . . .
.
CHAPTER 3 EXPERIMENTAL RESULTS . . . . . . . . . . . . . . . . . . . . . . . . .
.
3.1.1 BERT-base Coreference System . . . . . . . . . . . . . . . . . . . . . . .
SpanBERT-base Coreference System . . . . . . . . . . . . . . . . . . . . .
3.1.2

8
8
8
9
3.2 Dataset .
. 10
3.3 Evaluation Metrics in Coreference Resolution . . . . . . . . . . . . . . . . . . . . 11
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
. . . . . . . . . . . . . . . . . . . . . . . . . . . 13
. . . . . . . . . . . . . . . . . . . . . . . . . . . 13
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
. . . . . . . . . . . . . . . . . . . . . . . . 14
CHAPTER 4 ERROR ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
.
. 17
4.2.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.2 Extra Mentions and Missing Mentions . . . . . . . . . . . . . . . . . . . . 18
4.2.3 Extra Entities and Missing Entities . . . . . . . . . . . . . . . . . . . . . . 20
4.2.4 Conﬂated Entities and Divided Entities
. . . . . . . . . . . . . . . . . . . 23

3.3.1 MUC . .
. .
3.3.2 B3 (B-Cubed) . .
3.3.3 CEAF . .
. .
3.3.4 CoNLL as oﬃcial Score
3.4 Experimental Setup and Results

4.1 Error Classiﬁcation .

.
4.1.1 Transformation .
. .
4.1.2 Mapping .

. .

4.2 Discussion on Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Setup .

3.4.1
3.4.2 Results on Evaluation Metrics

.

.

.

.

.

.

Span Errors .

. .

.

.

.

. .

. .

. .

. .

. .

. .

. .

.

.

.

.

.

.

.

vi

CHAPTER 5 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . 25
. 26
BIBLIOGRAPHY .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

.

.

.

.

.

.

.

vii

LIST OF TABLES

Table 3.1: Number of documents, entities, links and mentions in the English part of

OntoNotes v5.0 data [36] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Table 3.2: Evaluation with an average F1 score of three metrics MUC, B and CE AFφ4

on test dataset .

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Table 4.1: Counts for each error type for BERT-base and SpanBERT-base on the English
test set of the 2012 CoNLL shared task and the Best performing model BERKE-
LEY on 2011 CoNLL shared task reported by Kummerfeld and Klein [25]

. . . 17

Table 4.2: Examples of Span errors with Extra text and Missing text . . . . . . . . . . . . . 18

Table 4.3: Counts of Span Errors grouped by the labels ( NP: Noun Phrase, POS: Posses-
sive Ending (e.g. people’s, government’s) , .: Punctuation, SBAR: Subordinate
clause, PP: Prepositional phrase, DT: Determiner) over the extra/missing part
of the mention.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

.

.

.

.

.

Table 4.4: Examples of Extra Mentions and Missing Mentions error.

. . . . . . . . . . . . 19

Table 4.5: Counts of Missing and Extra Mentions errors by mention type, and some of

the common mentions.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Table 4.6: Counts of Extra and Missing Mentions, grouped by properties of the mention

and the entity it is in .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 19

Table 4.7: Examples of Extra Entities and Missing Entities error.

. . . . . . . . . . . . . . 20

Table 4.8: Counts of Extra and Missing Entity errors,grouped by the composition of the

entity (Names, Nominals, Pronouns).

. . . . . . . . . . . . . . . . . . . . . .

. 21

Table 4.9: Counts of Extra and Missing Entity errors grouped by properties of the men-

tions in the entity. .

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Table 4.10: Counts of common Missing and Extra Entity errors where the entity has just

two mentions: a pronoun and either a nominal or a proper name.

. . . . . . . .

. 22

Table 4.11: Examples of Conﬂated Error and Divided error. . . . . . . . . . . . . . . . . . . 23

Table 4.12: Counts of Conﬂated and Divided entities errors grouped by the Name /Nomi-

nal/Pronoun composition of the parts involved.

. . . . . . . . . . . . . . . . . . 23

viii

LIST OF FIGURES

Figure 1.1: Mention detection and clustering in coreference resolution . . . . . . . . . . .

.

2

Figure 3.1:

In this example the span twenty20 cricket league is masked. The Span
Boundary Objective uses the boundary tokens output representation (x3, x7)
and position embedding p5 to predict the token (cricket) in masked span.

. . . 10

Figure 4.1: Transformation steps to change a predicted output (top) into a gold annotation

(at buttom). Figure after Kummerfeld and Klein [25]

. . . . . . . . . . . . . . 16

ix

CHAPTER 1

INTRODUCTION

Natural Language Processing (NLP) has witnessed performance advancement in many tasks, such
as Information Extraction, Question Answering, and Text Summarization. However, most of these
systems have faced challenges in resolving references. The ambiguity of reference in this system
can be minimized by using a Coreference Resolution (CR) to achieve a higher level of accuracy [34].
Several methods have been utilized for coreference resolution. CR is diﬃcult, which is evident
from the performance of these methods on commonly used evaluation metrics. There is a good
scope of work in the rectiﬁcation of dataset annotation (missing and incomplete) and improvement
of evaluation methodology used in CR. Recent end-to-end CR systems [23, 24, 28, 29] designed
using the power of deep learning algorithms have a primary focus on improving accuracy. However,
a detailed error analysis is required [25] to compare these models. This comparative analysis is
helpful to select a system for a speciﬁc downstream task. The error analysis is also contributing to
the interpretation of the CR system’s decisions.

In this thesis, we work on two state-of-the-art end-to-end CR systems: BERT-base and
SpanBERT-base [23, 24]. We compare these system performances on CoNLL 2012 shared task
data [36]. We also perform detailed error analysis and compute corresponding error distributions
of these models motivated by the work of Kummerfeld and Klein [25].

1.1 Coreference Resolution

Language-based communication has a signiﬁcant contribution to the progress of the human
race. Speech and text are two core elements of language-based human communication. A good
level of understanding is required about entities from their reference (mentions) in speech or text.
Humans easily communicate due to their ability to comprehend entity references. This is still an
arduous challenge for computing-based artiﬁcial agents/systems.

Coreference Resolution (CR) detects mentions and links the mentions referring to a common

1

Figure 1.1: Mention detection and clustering in coreference resolution

entity. In ﬁgure 1.1 has two entities, Barack Obama and Hilary Rodham Clinton. There are two
clusters of mentions {Barack Obama, He, his} and {Hillary Rodham Clinton, her, she, secretary of
state, First Lady}. The ﬁrst group of mentions refers to Barack Obama, and the second group of
mentions refers to Hillary Rodham Clinton. These groups of mentions form coreference chains or
mention clusters for the example passage. In this passage Hillary Rodham Clinton is an antecedent
that appears before a referring mention she. An optimal CR system is expected to detect mentions
{her, she, secretary of state, First Lady} and link them with a common entity Hillary Rodham
Clinton.

In this example, suppose we want to answer the question: Why Barack Obama nominated
Hilary Rodham Clinton? A correct linking of She with Hilary Rodham Clinton, her and First Lady
foreign aﬀairs experience that leads to her nomination. It is evident from this
gives clue about
discussion that CR is useful for other NLP tasks too.

2

1.2 Steps in Corefernce Resolution

Coreference resolution task is an assembly of two sub-tasks: mention detection and optimal

clustering of mentions.

1.2.1 Mention Detection

Mention detection is the ﬁrst stage of coreference resolution. The objective is to ﬁnd the spans of
text that constitute each mention. A mention can be a pronoun, noun, or name. NLP systems like
named entity recognizer, part of speech tagger are useful for ﬁnding mentions. These NLP systems
can able to ﬁnd most instances of pronouns, noun phrases, or names correctly. However, all these
instances may not be considered as mention. Mention detection algorithms are usually very liberal
in proposing candidate mentions using such NLP tools. This approach creates a large candidate
mentions space. Hence, the mentions space is needed to be pruned optimally for computational
eﬃciency. Classiﬁer model could be a better choice instead of these pipeline approach for mention
detection. CR system computes mention score for each mention in the mentions space. It discards
mention with a low score. Singleton mentions (mention with no antecedent) are not annotated in
many datasets such as 2012 CoNLL shared task data [36].

1.2.2 Optimal Clustering of Mentions

Mentions with similar properties are grouped to build coreferent clusters in the second stage.
Some linguistic properties considered among these mentions are number agreement, person agree-
ment, gender or noun class agreement, binding theory constraint, recency, grammatical role, verb
semantics, and selectional restriction.

3

CHAPTER 2

RELATED WORK

2.1 Coreference Resolution Models

2.1.1 Rule-based methods

Like other Natural Language Processing (NLP) tasks, earlier Coreference Resolution (CR) uses
hand-crafted rules. Most earlier knowledge-rich algorithms are powered with hand-crafted rules that
depend on semantic and syntactic features of text under consideration. Hobb’s naive algorithm [21]
was one of the ﬁrst methodologies on anaphora resolution.
It applies a rule-based left to right
breadth-ﬁrst traversal on the parse tree of a sentence to search and ﬁnd an antecedent. It uses world
knowledge-based selectional constraints for antecedent elimination. The rules and selectional
constraints help the algorithm to converge to a single antecedent by pruning the antecedent search
space. Lappin and Leass [26] proposed a hybrid algorithm.
It considers syntax as well as
a discourse for pronominal anaphora resolution.
It has a discourse model that consists of all
the potential antecedent references of a speciﬁc anaphor. Each antecedent has a salience value
considering the semantic and syntactic constraints. The salience value of an antecedent depends
on many other features. An antecedent with maximum salience value is considered as the best
antecedent. This algorithm incorporates a signal attenuation mechanism that halves the inﬂuence
or salience while propagating to the next sentence. BFP algorithm [5] is at the core of centering
theory [16]. It uses discourse structure to explain phenomena such as anaphora and coreference.
Hobb’s dataset is used to evaluate the centering theory algorithm.

Rule-based algorithms have a high dependency on external knowledge. Eﬀorts [2,17,20,27,47]
were made to reduce the dependency of rules on external knowledge. Baldwin [2] proposed
COGNIAC, a knowledge poor coreference resolver model with high precision. This model assumes
there exists an anaphor subclass that doesn’t need generic reasoning. This model can diﬀerentiate

4

between anaphor whether it needs external knowledge and or not. Attempts [19, 30] were made
to incorporate world knowledge into a coreference resolution system. Haghighi and Klein [17]
proposed a strong baseline to modularize syntactic, semantic and discourse constraints. This
model outperformed supervised as well as unsupervised systems at that time.

2.1.2 Learning based

Availability of annotated corpora such as ACE [14], MUC [31] paved the path for learning based CR
models. Even with this dataset Learning in coreference is an arduous task in NLP. Raghunathan et
al. [38] demonstrated hand-engineered system built on top of parse trees had outperformed earlier
learning-based approaches for coreference resolution. However, this approach overcame by the
highly lexical learning method proposed by Durett and Klein [15]. Earlier neural models [7,8,45]
archived better performance compared to machine learning-based models. All of these pipelined
approaches use the mention proposal algorithm and rely on a parser to ﬁnd head features. Parsing
errors in this method causes cascading error in the overall model. First-time Daume III and Marcu
[9] proposed a non-pipeline algorithm to jointly learn mention detection and coreference resolution.
Learning-based coreference resolution systems can be broadly classiﬁed as:
(1) mention-pair
classiﬁer, (2) entity-mention models, and (3) mention-ranking models

2.1.2.1 Mention Pair

Mention pair models [?,3,4,7,11,35,40,41,44] consider coreference as a collection of mention pair
links. These models ﬁrst detect mentions, then learn pairwise mention scores to classify, and ﬁnally
cluster mention pairs. In the ﬁrst step instances of valid mentions are detected. The algorithm
proposed by Soon et al. [40] was a very popular mention creation algorithm. Subsequent mention
creation algorithms apply constraints to minimize incorrect and remove hard-to-train mention
instances. The second step trained a classiﬁer to decide whether two mentions were co-referent or
not. In the third step, used various clustering techniques to build a coreference chain.

5

Though mention-pair models have achieved popularity in CR task, it has some challenges to
overcome. Issue of Transitivity constraint. If there were mention pairs (A, B) and (B, C) then it is
expected to have a co-referent mention pair (A, C). The transitivity property should not be applied
without considering other constraints. Let say He referred to Clinton and Clinton Referred to She
then He and She should not be co-referent.

2.1.2.2 Entity-mention

Mention pair models are eﬀective for the CR task, but they do not use entity-level information,
i.e., features between clusters of mentions. Entity-level information is helpful to inform the new
decision about the prior CR decision. Entity-mention CR systems use entity-level information. The
procedure to ﬁnd useful features itself was very challenging. So most works are done with mention
pair modeling manner. Aggregated mention pair scores from these models can be useful to deﬁne
entity-level features between clusters of mentions. Instead of mention and antecedent pair, entity
models [6,8,18,45] focus on past coreference decision to get utilized in new decisions. A mention
is compared with a partially formed cluster instead of individual antecedents.

2.1.2.3 Mention and Span Ranking

The mention pair model acts as a binary classiﬁer to decide whether two mentions are coreferent
or not. However, the mention pair model provides no clue to compare to antecedent and select
the optimal one for a given anaphor. This issue was handled by many of the ranking models
[10, 15, 28, 29, 32, 39]. The ranking seems to be a better choice as compared to classiﬁcation to
handle CR task. Denis and Baldridge [11] used ranking loss function in place of classiﬁcation. A
ranking model proposed by Durrett and Klein [15] uses a log-linear model on surface features in
antecedent selection. Even though popular, mention ranking models are not able to utilize the past
decisions to make new decisions.

6

2.2 Error Analysis Coreference Resolution Systems

Coreference Resolution research has focused on the improvement of accuracy using evaluation
metrics. These metrics provide a quantiﬁable summary of model performance on a pool of errors.
Most performance analysis methods evaluate nominals, proper names, and pronouns separately.
Methodologies focusing discussion on speciﬁc error or manual classiﬁcation of a small set of errors,
fail to quantify the impact of these errors. Holen [22] presented a manual analysis approach with
a more comprehensive set of error types. It highlights evaluation metric shortcomings instead of
model analysis. Stoyanov et al. [42] used gold annotation and evaluated improvement in mention
detection, anaphoric mention detection, and named entity recognition. They deﬁned nine resolution
classes based on the mention types and antecedent properties. It can characterize the variation of
resolution classes but missed out on cascade error when mentions resolved simultaneously. CoNLL
shared task [36, 37] provides a multi-system comparison and measures the impact of mention and
anaphoricity detection.

Kummerfeld and Klein [25] worked on detailed analysis of errors and extended earlier work
on evaluation. Their method has a detailed understanding of error distribution instead of just
accuracy comparison on evaluation metrics. Hence we consider analyzing errors analysis two
current state-of-the-art systems [23,24] using their methodology.

7

CHAPTER 3

EXPERIMENTAL RESULTS

3.1 Models

We use two state-of-the-art coreference resolution systems [23, 24] that use SpanBERT and
BERT for embedding span representation. We consider the suggestion of Joshi et al. [24], to use
independent versions of BERT-base [24] and SpanBERT-base [23]. The base variant of these
models have lighter computational overhead as compared to large variants [13].

3.1.1 BERT-base Coreference System

Joshi et al. [24] used BERT transformer [13] based span embedding in place of LSTM based span
embedding in earlier work of Lee et al. [29]. Originally BERT is pre-trained on BookCorpus and
English Wikipedia with two training objectives: Masked Language Modeling (MLM) and next
sentence prediction (NSP). BERT encoder generates contextualize vector representation of each
input sequence of tokens.

P(y) =


es(x,y)
y(cid:48)∈Y es(x,y(cid:48))

(3.1)

The model learns the distribution P(.) over possible antecedent spans Y for each mention span
x. A scoring function s(x,y) uses mention scores of constituent spans and their joint compatibility
score. The Span mention score tells how likely a span is a valid mention. Whereas compatibility
score measures how likely two mentions refer to the same entity.

s(x, y) = sm(x) + sm(y) + sc(x, y)

(3.2)

where:

8

Mention score of span x:

Mention score of span y:

sm(x) = FFNNm(gx)

sm(y) = FFNNm(gy)

Joint compatibility score of span x and y:

sc(x, y) = FFNNc(gx, gy, φ(x, y))

(3.3)

(3.4)

(3.5)

Spans x and y have span embeddings gx and gy, respectively. Speaker and distance information
are categorised as meta-information φ(x, y). FFNNm and FFNNc represent Feed Forward Network
in expressions to compute mention score and coreference score, respectively.

N

log

N

y(cid:48)∈Y(i)∩GOLD(i)

i=1

P(y

(cid:48))

(3.6)

The marginal log-likelihood of all correct antecedents for each span is optimized based on
annotated gold clusters in the training data. The model selects the best antecedent on the basis of
the calculated optimized score and forms the coreference chain retaining the transitivity property. In
equation 3.6 GOLD(i) is the set of spans in gold clusters containing span i. This process accurately
prunes spans and makes sure only gold mentions get positive updates.

3.1.2 SpanBERT-base Coreference System

Pre-trained SpanBERT by Joshi et al. [23] provides a better approach to represent and predict spans
of text. Unlike BERT, SpanBERT masks a random contiguous span of tokens instead of individual
tokens. They proposed Span Boundary Objective that able to predict the entire masked span using
span boundary token representation. SpanBERT single sequences are used for training of encoder
instead of bi-sequence unlike BERT with NSP objective. Apart from this modiﬁed embedding
SpanBERT-base follows similar training and clustering as BERT-base.

9

Figure 3.1: In this example the span twenty20 cricket league is masked. The Span Boundary
Objective uses the boundary tokens output representation (x3, x7) and position embedding p5 to
predict the token (cricket) in masked span.

3.2 Dataset

We use the English portion of CONLL-2012 shared task data [36] for our experiments. This
data set is most commonly used to evaluate many recent coreference models [23, 24, 28].
It is
a document-level dataset with 3384 (2802 training, 343 development, and 348 test) documents
having 1.6M words. Table 3.1 has information about the dataset. There are seven genres: broadcast
conversations, broadcast news, magazine texts, news wire, pivot texts, telephone conversations,
and weblogs. There are about one million words in this dataset. The annotation complexity for
coreference increases non-linearly with the length of a document. Longer documents are split
into parts to reduce annotation complexity. Three genres, telephone conversation, weblogs, and
broadcast conversation, contribute to a large share of longer documents.

10

Section
Total
Train
Validation
Test

Documents Words Mentions

1.6M 194480 135179
120417
1.3M 155560
160K
19156
14610
15232
19764
170K

Links Entities
44203
35134
4546
4532

3384
2802
343
348

Table 3.1: Number of documents, entities, links and mentions in the English part of OntoNotes
v5.0 data [36]

3.3 Evaluation Metrics in Coreference Resolution

An evaluation metric for CR should consider two issues [33]: Interpretability and Discriminative
power. High interpretability scores suggest the model is good at coreference relation detection. A
highly discriminative model can diﬀerentiate a good decision from a bad one. There are several
metrics proposed for the evaluation of CR. We consider three evaluation metrics commonly used
for current research work and the 2012 ConLL shared task dataset [36] for our experiments. Each
metrics has a separate dimension in focus. Links representation-based MUC [43], mention-based
B-CUBED [1], and entity-based CE AF [31].

3.3.1 MUC

Vilain et al. Vilain et al. [43] proposed the ﬁrst-ever evaluation metric for coreference resolution
task. The system predicted links are compared with manually annotated coreference chain or truth
link. This metric computes the number of modiﬁcations needed to change a response set into a
truth set. MUC precision and recall are calculated as follows.

In 3.7 |partition(p,G)| is the number of cluster in gold that predicted cluster intersects.

In 3.16 |partition(g,P)| is the number of cluster in predicted that gold set intersects.

11

MUCPrecision(G, P) =
MUCRecall(G, P) =

p∈P

g∈G

|p| − |partition(p, G)|

|p| − 1

|g| − |partition(g, P)|

|g| − 1

(3.7)

(3.8)

where partition (x, y) = {y|y ∈ Y&y ∈ x (cid:44) φ}
This metric is the least discriminative compared to other subsequent metrics proposed for
coreference resolution. It has a similar score for link joining singleton or most signiﬁcant entities.

3.3.2 B3 (B-Cubed)

Bagga and Baldwin [1] consider each individual mention to calculate precision and recall. The
ﬁnal number for precision and recall are computed as follows:

FinalPrecision =

FinalRecall =

wi ∗ Precisioni

wi ∗ Recalli

(3.9)

(3.10)

N
N

i=1

i=1

In equation 3.9 and 3.10 and N = total number of entities in the document, and each entity i
has an assigned weight wi, precision as Precisioni and recall as Recalli. Weights to each entity
i.e. wi = 1/N.

3.3.3 CEAF

Constraint Entity Alignment F-measure (CEAF) by Luo [31] compares similarity between entity.
The similarity measures create optimal mapping between predicted and truth clusters. This mapping
is used to calculate precision and recall. There are four similarity measurements in this metric.

φ1(G, P) considers two entities same if all mentions are same.

φ2(G, P) considers two entities same if there is at least a common mention.

φ1(G, P) =

φ2(G, P) =

0

if P=G

otherwise

1
1 if P ∩ G (cid:44) φ

otherwise

0

12

(3.11)

(3.12)

φ3(G, P) counts the number of common mention between G and P

φ3(G, P) = |G ∩ P|

F measure between G (gold entities) and P (predicted entities) is expressed as φ4(G, P).

φ4(G, P) = 2.

|G ∩ P|
|P| ∩ |G|

(3.13)

(3.14)

The function m(p) takes a predicted cluster p as input returns gold cluster g. A predicted cluster

can only be mapped to one gold cluster. CEAF precision and recall is computed as follows:

CE AFφi Precision(G, P) = maxm

CE AFφi Recall(G, P) = maxm

p∈P φi(p, m(p))
p∈P φi(p, p)
p∈P φi(p, m(p))
p∈P φi(g, g)

(3.15)

(3.16)

(3.17)

3.3.4 CoNLL as oﬃcial Score

CoN LLF1 =

(B3
F1 + MUCF1 + CE AFF1)

3

The oﬃcial score reported in CoNLL shared task 2012 by pradhan et al [36] is the unweighted
average F1 scores from B3, MUC and entity-based CEAF metrics (denoted as CE AFφ4). However,
a weighted average of these scores can be useful depending on a speciﬁc downstream task [12].

3.4 Experimental Setup and Results

3.4.1 Setup

We adapted the Pytorch implementation work [46] on Coreference Resolution using BERT and
SpanBERT [23, 24]. These models are ﬁne-tuned with document-level English data of OntoNotes
5.0 dataset [36]. We ﬁnd documents are longer in this dataset. So multiple segments are used to
read a complete document. We train Each model with documents having a diﬀerent set of maximum
segment lengths of 128, 256, 384, and 512. We randomly truncate longer documents to have eleven

13

segments to handle the issue of memory intense span representation. We consider models with a
maximum segment length of 128 for our analysis work considering the ﬁnding of Joshi et al. [24].
We use batch sizes of one document, similar to Joshi et al. [24] and lee et al. [29]. We run both
the model for 24 epochs with a dropout rate of 0.3. We conduct experiments on Nvidia TITAN
RTX GPUs with 24GB memory. The average training time is around 4 hours for BERT-base and
SpanBERT-base.

3.4.2 Results on Evaluation Metrics

MUC

R

P

F1
80.96
82.91

B3
R

P

F1

P

83.41
82.82

BERT-base
SpanBERT-base
Table 3.2: Evaluation with an average F1 score of three metrics MUC, B and CE AFφ4 on test
dataset

73.98 68.39 71.07 71.44
73.20 74.71 73.96 72.57

78.65
83.00

CE AFφ4

R

F1
64.66 67.88
70.12 71.32

Average

F1
73.30
76.06

We use the oﬃcial CoNLL-2012 evaluation script [36] to report precision, recall and F1 for
the three evaluation metrics MUC, B3 and CE AFφ4. We report coreference score as unweighted
average F1 score of these metrics for our models. The SpanBERT-base system outperforms the
BERT-base system in terms of MUC, B3, CE AFφ4, and the ﬁnal average score. SpanBERT system
has a higher coreference score due to the high recall and comparative precision to the BERT-base
system.

14

CHAPTER 4

ERROR ANALYSIS

4.1 Error Classiﬁcation

Evaluation metrics for Coreference Resolution (CR) consider overall model performance and
have the conclusive notion that high-scoring models encounter fewer prediction errors. However,
these metrics remain silent about types of error in each model. Kummerfeld and Klein [25] did an
extensive error analysis of earlier CR models reported in CoNLL 2011 Shared task [37] and other
publicly available models. It follows a two-step method having Transformation and Mapping to
classify system prediction into seven categories of error.

4.1.1 Transformation

Firstly the system output is modiﬁed to gold annotation using a transformation process with the
following ﬁve operations as demonstrated in Figure 4.1.

1. Alter Span modiﬁes a predicted span into gold spans. Alter Span step shows mention X in

the left-most entity is modiﬁed.

2. Split divides predicted entities to form gold entities. The left-most entity is divided into two

entities in the Split step.

3. Remove deletes predicted mentions that are not part of gold entities. All X mention are

deleted in Remove step.

4. Introduce creates singleton entities (with one mention) for every missing gold mention in
system prediction. Three new mentions are created at the rightmost part during the Introduce
step.

15

Figure 4.1: Transformation steps to change a predicted output (top) into a gold annotation (at
buttom). Figure after Kummerfeld and Klein [25]

16

5. Merge unites a group of wrongly predicted entities to form one correct gold entity. Similar

mention grouped and formed three entities at the end of Merge step.

4.1.2 Mapping

Secondly, these transformations contribute to seven types of errors: namely, i. Span Error, ii.
Missing Entities, iii. Extra Entities, iv. Missing Mentions, v. Extra Mentions, vi. Divided Entities,
and vii. Conﬂated Entities. We discuss each error in the Discussion on Error Analysis.

4.2 Discussion on Error Analysis

In this section, we compare the results of BERT-base and SpanBERT-base based Coreference
Resolution systems. We consider seven categories of errors adapted from the work of Kummerfeld
and Klein [25]. They evaluated earlier Coreference models on CoNLL-2011 shared task data [37].
The shared task reported the performance of many CR systems of that time. Our models use
CoNLL-2012 [36] data. The English portion in CoNLL-2012 shared task data [37] has 1.3M
words compared to 1M words in CoNLL-2011 shared task data [36]. Earlier work of Bjorkelund
shows the addition of 160K words in evaluation data for training failed to improve the model
performance [36] compared to models trained on only training data. It will be unfair to directly
compare earlier models reported in their [25] analysis. However, it can give some clue about how
current systems behaves.

System

BERT-base
SpanBERT-base
BERKELEY [25]

Mention
Detection MUC
80.96
85.45
82.91
86.85
75.57
66.43

B3
71.07
73.96
66.17

CE AFφ4
67.88
71.32
NA

Span
Errors
256
272
392

Conﬂated
Entities
1103
1048
1694

Extra

Mentions

Extra
Entities

507
653
923

406
522
833

Divided
Entities
1286
1086
1981

Missing
Mentions

Missing
Entities

735
589
899

813
558
801

Table 4.1: Counts for each error type for BERT-base and SpanBERT-base on the English test set of
the 2012 CoNLL shared task and the Best performing model BERKELEY on 2011 CoNLL shared
task reported by Kummerfeld and Klein [25]

17

Error
Extra text
Missing text

System
Judy Miller as a journalist
them

Gold
Judy Miller
them all

Table 4.2: Examples of Span errors with Extra text and Missing text

4.2.1 Span Errors

Span errors occur due to missing or extra text in spans. A missing text is present in gold mentions
but absent in system predicted mentions. Whereas extra text is present in system predicted spans
but absent in gold spans.

BERT-base

SpanBERT-base
Extra Missing Extra Missing
4
5
3
0
2
4

6
120
3
4
16
31

Type
NP
POS
.
SBAR
PP
DT

4
121
1
6
16
30

1
5
7
0
0
3

Table 4.3: Counts of Span Errors grouped by the labels ( NP: Noun Phrase, POS: Possessive Ending
(e.g. people’s, government’s) , .: Punctuation, SBAR: Subordinate clause, PP: Prepositional phrase,
DT: Determiner) over the extra/missing part of the mention.

Table 4.3 shows parse nodes having only missing and extra text in gold parse. It shows in both
models have more Extra text cases for Span Error. The POS: possessive type parse node witnesses
maximum diﬀerences in missing and extra for both the model, seems superﬁcial. Span errors can
be minimized by reducing parsing-related errors [25]. It is a challenging task to completely remove
the parsing issue because of inconsistency in the annotation of the dataset.

4.2.2 Extra Mentions and Missing Mentions

Table 4.4 shows antagonistic behavior of Extra and Missing Mentions errors. If a predicted entity
has more mentions than a gold entity causes an Extra Mentions error.
In the case of Missing
Mentions, the error system predicted entity has less mention than a gold entity.

18

Error

Extra Mention

Missing Mention

System
Focus Today
we
us
our
our program
this SMS
it
-

Gold
Focus Today
-
-
-
our program
this SMS
it
this

Table 4.4: Examples of Extra Mentions and Missing Mentions error.

Mention
Proper Name
Nominal
Pronoun
it
you
we
us
that
they
their
this

BERT-base

SpanBERT-base

Extra Missing Extra Missing Count
122
229
156
32
16
17
5
6
6
10
7

149
288
152
24
25
25
2
7
9
2
13

171
334
230
26
54
48
6
7
13
7
12

125
290
238
46
51
29
6
7
7
10
11

1318
1273
754
265
2209
939
457
989

Table 4.5: Counts of Missing and Extra Mentions errors by mention type, and some of the common
mentions.

BERT-base

SpanBERT-base

Nominal

Proper Name

Nominal

Proper Name
Extra Missing Extra Missing Extra Missing Extra Missing
61
150
138
12
0
276
288

73
110
39
64
2
83
149

63
161
128
11
1
278
290

76
123
48
74
3
94
171

53
133
96
19
0
210
229

69
184
150
13
0
321
334

60
88
37
68
2
55
125

60
82
40
66
3
53
122

Text_Match
Head_Match
Others
NER Matches
NER Diﬀers
NER Unknown
Total

Table 4.6: Counts of Extra and Missing Mentions, grouped by properties of the mention and the
entity it is in

19

Table 4.5 lists Missing and Extra errors by type of mentions involved. It also lists some of
the commonly occurring Missing and Extra mentions. BERT-base system with high precision has
few Extra and more Missing mentions. Whereas the SpanBERT-base system with a high recall
has more Extra and fewer Missing mentions.The mentions you and we occur most frequently in
Missing error for BERT-base and Extra error for SpanBERT-base. We observe Missing mentions
are penalized highly in SpanBERT-base as compared to BERT-base. We group Extra Mentions
and Missing Mentions errors by proper names and nominals in Table 4.6. The ﬁrst section of the
table reports errors that consider the exact string match or head match between the mentions with
error and the mentions in the cluster. Named entity annotation of mention with error is considered,
in the second section. It measures occurrences of matched mention type with that of cluster type.
There are balanced occurrences of two types of errors in all cases. However, one exception is
observed for unknown NER for nominal in the BERT-base model. Models included in earlier
work [25] reported this exception for exact string match case for nominal. We diﬀer from some
of the earlier observations. Our models can identify pleonastic pronouns more eﬀectively than the
models reported in the work of [25]. BERT-base shows better performance for Extra error whereas
SpanBERT-base is better for Missing error, both concerning instances with head matching.

4.2.3 Extra Entities and Missing Entities

An entity is a set of all mentions having the same references. A missing Entity is a gold entity,
which is not predicted by the system. An extra entity is introduced by the system, which is not a
gold entity. These two cases contribute to Missing Entities and Extra Entities errors.

System
Dear viewers
dear viewers

Error
Extra Entity
Missing Entity -
-

Gold
-
-
everyone
you

Table 4.7: Examples of Extra Entities and Missing Entities error.

20

Composition

BERT-base

SpanBERT-base
Name Nominals Pronoun Extra Missing Extra Missing
156
10
45
35
215
9
5
29
9
45
558

1
1
0
0
0
2
0
0
3+
Others
Total

212
15
58
56
302
16
8
48
22
76
813

101
11
26
46
216
43
4
43
15
17
522

0
1
1
2
0
0
3+
0
0

1
0
1
0
2
0
0
3+
0

84
8
23
39
153
38
3
20
17
21
406

Table 4.8: Counts of Extra and Missing Entity errors,grouped by the composition of the entity
(Names, Nominals, Pronouns).

Table 4.8 reports these two errors considering the composition (name, nominal, and pronoun)
of entities. A noticeable diﬀerence is observed for these two errors. Entities containing one
nominal and one pronoun (row 0 1 1) have more Missing errors than Extra errors. Entities with
two pronouns (row 0 0 2) behave oppositely, having more Extra errors compared to Missing errors.
SpanBERT-base has more Extra error and less Missing error for entities with three pronoun (row
0 0 3+) or three nominals (row 0 3+ 0), whereas BERT-base shows an opposite trend. Single type
error contributes for 66.50% Extra Entity errors and 55.59% Missing Entity error in Bert-base and
70.30% Extra Entity errors and 54.12% Missing entity errors in SpanBERT-base model. These
results show entities with a single type mention contribute the most for these two errors.

Table 4.9 presents entity errors with a single type mention and categorized as three groups:
Exact, Head, and Non. Nominals account for maximum occurrences and variation across these
categories for both SpanBERT-base as well as BERT-base. Head match constitutes about half the
nominal for Extra column as well as Missing column. The higher share of Head match suggests
these neural models are good at head ﬁnding in the mention. Table 4.8 reports entity containing a
pronoun and a nominal comes as the second most error case. Table 4.10 presents the list of most
frequent pronouns for errors with a pronoun and name or nominal.

We can consider pronouns as a reference to interpret these errors. A pronoun can be an

21

Match Type

Exact

Head

Non

Proper Name
Nominal
Pronoun
Proper Name
Nominal
Pronoun
Proper Name
Nominal
Pronoun

BERT-base

SpanBERT-base
Extra Missing Extra Missing
15
49
8
25
130
8
15
114
10

22
67
36
32
127
36
10
46
19

26
71
13
46
188
13
18
162
25

26
91
41
40
173
41
10
86
17

Table 4.9: Counts of Extra and Missing Entity errors grouped by properties of the mentions in the
entity.

Mention
that
it
this
they
their
them
Any pronoun

BERT-base

SpanBERT-base
Extra Missing Extra Missing
59
40
27
4
4
6
164

31
33
16
7
2
1
112

72
59
35
12
8
7
227

27
23
14
6
2
1
92

Table 4.10: Counts of common Missing and Extra Entity errors where the entity has just two
mentions: a pronoun and either a nominal or a proper name.

Extra mention predicted incorrectly as coreferent or a Missing mention predicted incorrectly as
non-coreferent. Table 4.5 shows these errors are biased towards Missing in BERT-base whereas
SpanBERT-base biased towards Extra errors. However, the distribution of these errors speaks
diﬀerently. For example that is balanced for both the errors in Table 4.5 whereas in Table 4.10
biased towards Missing Entity Error. Entities that have either a nominal or a nominal with a
pronoun, dominates Extra entity and Missing Entity errors. We report head matching in these
cases is quite misleading. Kummerfeld and Klein [25] reported String match as misleading. This
suggests the use of semantics, context, and discourse to reduce these two errors.

22

4.2.4 Conﬂated Entities and Divided Entities

Table 4.11 lists Conﬂated Entity error: mentions in separate gold entities are predicted as a single
entity. Divided Entity error: mentions in one gold entity are predicted as separate entities.

Error

Conﬂated Entity

Divided Entity

System
the anti phased motion1
this1
it1
they1
the two of you1
the two honorable guests2
both of you1
two honorable guests2

Gold
the anti phased motion1
this2
it2
they1
the two of you1
the two honorable guests1
both of you1
two honorable1 guests

Table 4.11: Examples of Conﬂated Error and Divided error.

Incorrect Part

Rest entity

BERT-base

SpanBERT-base

Name Nominal Pronoun Name Nominal Pronoun Conﬂated Divided Conﬂated Divided
43
87
228
156
80
71
50
371
1086

1+
1+
1+
0
1+
1+
1+
Others

70
112
182
145
69
67
27
376
1048

72
105
231
138
98
111
61
470
1286

89
110
181
128
74
67
89
365
1103

0
0
0
0
1+
1+
1+

0
0
0
0
0
0
0

Total

0
0
0
1+
0
0
0

0
1+
1+
1+
1+
0
0

1+
1+
0
0
1+
1+
0

Table 4.12: Counts of Conﬂated and Divided entities errors grouped by the Name /Nominal/Pronoun
composition of the parts involved.

Table 4.12 lists Conﬂated Entities and Divided Entity errors as per the composition of part
split/merged and the rest of the entity. 1+ / 0 depicts the count of each type of mention in the entity.
Misplacement of pronouns constitutes the largest portion of these errors. The most common errors
involve parts with just pronouns. The issue becomes challenging not to have a proper name in
the remaining part of the entity. Systems may have conﬂated pronouns of two entities together to
creates this core issue of entities entirely having pronouns.

Aggregating instances of the incorrect part containing a single pronoun in Table 4.12:

It
accounts for 42.33% and 41.60% of conﬂated cases for BERT-base and SpanBERT-base; 39.81%

23

and 41.25% of divided cases for BERT-base and SpanBERT-base. There is a good possibility of
cases when a part is both conﬂated with a wrong entity and divided from its true entity. If a pronoun
is placed in the wrong entity causes a Pronoun link error. Table 4.12 shows Pronoun link error is
very common in Conﬂated Entities and Divided Entities.

24

CHAPTER 5

CONCLUSION AND FUTURE WORK

In this thesis, we evaluate the performance of two end-to-end Coreference Resolution (CR) systems
BERT-base [24] and SpanBERT-base [23] on CoNLL-2012 Shared Task data [36]. We report
their performance as the unweighted average F1 scores: 73.30 for BERT-base (higher precision)
and 76.06 SpanBERT-base (higher recall). We further investigate the error distributions of both
the systems based on the work of Kummerfeld and Klein [25]. We observe the same model is
not outperforming in all error types. The BERT-base has more errors for Missing Mentions,
Missing Entities, and Divided Entities. The SpanBERT-base has more errors in Span Errors, Extra
Mentions, Extra Entities, and Conﬂated Entities. We observe SpanBERT handles recall-related
issues better than BERT-base and systems reported by Kummerfeld and Klein [25].

Considering the patterns in Span errors, It seems an optimal parsing method can be helpful to
minimize this error. We report nominals to contribute to maximum Missing and Extra Mentions
errors. The nominals in the dataset also have nested annotation, which could lead to a mismatch.
Text match cases witness a balanced distribution of Extra and Missing mention errors. Our analysis
suggests more information needs to be included even though span Head matches. The composition
of entities is crucial in the case of the Extra Entity and Missing Entity errors. Entities having one
type of component has a maximum share in these errors. Among single type error nominals with
head-match contribute to the maximum across the composition. We also report pronoun contributes
to a large portion of error distribution in Conﬂated and Divided Entities. Pronoun grouped in the
wrong mentions cluster of an entity causes a cascaded pronoun linking error. Accurate linking of
the pronoun with an entity is desirable in this task.

Downstream NLP tasks such as Question Answering or Text Summarization can achieve better
performance by resolving references [34]. The reference requirement changes as per the need of
the respective task. Our analysis work will be helpful to select an optimal CR model considering
the objective of a downstream NLP task.

25

BIBLIOGRAPHY

26

BIBLIOGRAPHY

[1] Amit Bagga and Breck Baldwin. Algorithms for scoring coreference chains.

In The ﬁrst
international conference on language resources and evaluation workshop on linguistics coref-
erence, volume 1, pages 563–566. Citeseer, 1998.

[2] Breck Baldwin. CogNIAC: high precision coreference with limited knowledge and linguistic
resources. In Operational Factors in Practical, Robust Anaphora Resolution for Unrestricted
Texts, 1997.

[3] Eric Bengtson and Dan Roth. Understanding the value of features for coreference resolution. In
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing,
pages 294–303. Association for Computational Linguistics, 2008.

[4] Anders Björkelund and Richárd Farkas. Data-driven multilingual coreference resolution using
resolver stacking. In Joint Conference on EMNLP and CoNLL - Shared Task, pages 49–55.
Association for Computational Linguistics, 2012.

[5] Susan E. Brennan, Marilyn W. Friedman, and Carl J. Pollard. A centering approach to
In Proceedings of the 25th Annual Meeting on Association for Computational

pronouns.
Linguistics, page 155–162. Association for Computational Linguistics, 1987.

[6] Kevin Clark and Christopher D. Manning. Entity-centric coreference resolution with model
stacking. In Proceedings of the 53rd Annual Meeting of the Association for Computational
Linguistics and the 7th International Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pages 1405–1415. Association for Computational Linguistics,
2015.

[7] Kevin Clark and Christopher D. Manning. Deep reinforcement learning for mention-ranking
coreference models. In Proceedings of the 2016 Conference on Empirical Methods in Natural
Language Processing, pages 2256–2262. Association for Computational Linguistics, 2016.

[8] Kevin Clark and Christopher D. Manning. Improving coreference resolution by learning entity-
level distributed representations. In Proceedings of the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pages 643–653. Association for
Computational Linguistics, 2016.

[9] Hal Daumé III and Daniel Marcu. A large-scale exploration of eﬀective global features for
a joint entity detection and tracking model. In Proceedings of Human Language Technology
Conference and Conference on Empirical Methods in Natural Language Processing, pages
97–104. Association for Computational Linguistics, 2005.

[10] Pascal Denis and Jason Baldridge. A ranking approach to pronoun resolution.

volume 158821593, 2007.

In IJCAI,

27

[11] Pascal Denis and Jason Baldridge. Specialized models and ranking for coreference resolution.
In Proceedings of the 2008 conference on empirical methods in natural language processing,
pages 660–669, 2008.

[12] Pascal Denis and Jason Baldridge. Global joint models for coreference resolution and named

entity classiﬁcation. Procesamiento del lenguaje natural, 42, 2009.

[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training
of deep bidirectional transformers for language understanding. In Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
Association for Computational Linguistics, 2019.

[14] George R Doddington, Alexis Mitchell, Mark A Przybocki, Lance A Ramshaw, Stephanie M
Strassel, and Ralph M Weischedel. The automatic content extraction (ace) program-tasks,
data, and evaluation. In Lrec, volume 2, pages 837–840. Lisbon, 2004.

[15] Greg Durrett and Dan Klein. Easy victories and uphill battles in coreference resolution. In
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,
pages 1971–1982. Association for Computational Linguistics, 2013.

[16] Barbara J. Grosz, Scott Weinstein, and Aravind K. Joshi. Centering: A framework for

modeling the local coherence of discourse. Comput. Linguist., 21(2):203–225, 1995.

[17] Aria Haghighi and Dan Klein. Simple coreference resolution with rich syntactic and semantic
features. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language
Processing, pages 1152–1161, 2009.

[18] Aria Haghighi and Dan Klein. Coreference resolution in a modular, entity-centered model. In
Human Language Technologies: The 2010 Annual Conference of the North American Chapter
of the Association for Computational Linguistics, pages 385–393, 2010.

[19] Sanda Harabagiu and Steven J Maiorano. Knowledge-lean coreference resolution and its
relation to textual cohesion and coherence. In The Relation of Discourse/Dialogue Structure
and Reference, 1999.

[20] Sanda M. Harabagiu, Razvan C. Bunescu, and Steven J. Maiorano. Text and knowledge
mining for coreference resolution. In Second Meeting of the North American Chapter of the
Association for Computational Linguistics, 2001.

[21] Jerry R. Hobbs. Resolving pronoun references. Lingua, 44(4):311 – 338, 1978.

[22] Gordana Ilić Holen. Critical reﬂections on evaluation practices in coreference resolution.
In Proceedings of the 2013 NAACL HLT Student Research Workshop, pages 1–7, Atlanta,
Georgia, 2013. Association for Computational Linguistics.

[23] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy.
SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the
Association for Computational Linguistics, 8:64–77, 2020.

28

[24] Mandar Joshi, Omer Levy, Luke Zettlemoyer, and Daniel Weld. BERT for coreference resolu-
tion: Baselines and analysis. In Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pages 5802–5807. Association for Computational Linguistics,
2019.

[25] Jonathan K. Kummerfeld and Dan Klein. Error-driven analysis of challenges in coreference
resolution. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language
Processing, pages 265–277, Seattle, Washington, USA, 2013. Association for Computational
Linguistics.

[26] Shalom Lappin and Herbert J. Leass. An algorithm for pronominal anaphora resolution.

Computational Linguistics, 20(4):535–561, 1994.

[27] Heeyoung Lee, Angel Chang, Yves Peirsman, Nathanael Chambers, Mihai Surdeanu, and
Dan Jurafsky. Deterministic coreference resolution based on entity-centric, precision-ranked
rules. Computational linguistics, 39(4):885–916, 2013.

[28] Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. End-to-end neural coreference
resolution. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language
Processing, pages 188–197. Association for Computational Linguistics, 2017.

[29] Kenton Lee, Luheng He, and Luke Zettlemoyer. Higher-order coreference resolution with
In Proceedings of the 2018 Conference of the North American
coarse-to-ﬁne inference.
Chapter of the Association for Computational Linguistics: Human Language Technologies,
Volume 2 (Short Papers), pages 687–692. Association for Computational Linguistics, 2018.

[30] Tyne Liang and Dian-Song Wu. Automatic pronominal anaphora resolution in english texts. In
International Journal of Computational Linguistics & Chinese Language Processing, Volume
9, Number 1, February 2004: Special Issue on Selected Papers from ROCLING XV, pages
21–40, 2004.

[31] Xiaoqiang Luo. On coreference resolution performance metrics. In Proceedings of Human
Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing, pages 25–32, 2005.

[32] Sebastian Martschat and Michael Strube. Latent structures for coreference resolution. Trans-

actions of the Association for Computational Linguistics, 3:405–418, 2015.

[33] Naﬁse Sadat Moosavi and Michael Strube. Which coreference evaluation metric do you trust?
a proposal for a link-based entity aware metric. In Proceedings of the 54th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), pages 632–642.
Association for Computational Linguistics, 2016.

[34] Thomas S. Morton. Using coreference for question answering.

Applications, 1999.

In Coreference and Its

29

[35] Vincent Ng and Claire Cardie. Improving machine learning approaches to coreference res-
olution. In Proceedings of the 40th Annual Meeting of the Association for Computational
Linguistics, pages 104–111. Association for Computational Linguistics, 2002.

[36] Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang.
Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In Joint
Conference on EMNLP and CoNLL-Shared Task, pages 1–40. Association for Computational
Linguistics, 2012.

[37] Sameer Pradhan, Lance Ramshaw, Mitchell Marcus, Martha Palmer, Ralph Weischedel, and
Nianwen Xue. CoNLL-2011 shared task: Modeling unrestricted coreference in OntoNotes.
In Proceedings of the Fifteenth Conference on Computational Natural Language Learning:
Shared Task, pages 1–27, Portland, Oregon, USA, 2011. Association for Computational
Linguistics.

[38] Karthik Raghunathan, Heeyoung Lee, Sudarshan Rangarajan, Nathanael Chambers, Mihai
Surdeanu, Dan Jurafsky, and Christopher Manning. A multi-pass sieve for coreference reso-
lution. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language
Processing, pages 492–501, 2010.

[39] Altaf Rahman and Vincent Ng. Supervised models for coreference resolution. In Proceedings
of the 2009 conference on empirical methods in natural language processing, pages 968–977,
2009.

[40] Wee Meng Soon, Hwee Tou Ng, and Daniel Chung Yong Lim. A machine learning approach
to coreference resolution of noun phrases. Computational Linguistics, 27(4):521–544, 2001.

[41] Veselin Stoyanov, Claire Cardie, Nathan Gilbert, E. Riloﬀ, David J. Buttler, and D. Hysom.

Reconcile: A coreference resolution research platform. 2010.

[42] Veselin Stoyanov, Nathan Gilbert, Claire Cardie, and Ellen Riloﬀ. Conundrums in noun
phrase coreference resolution: Making sense of the state-of-the-art. In Proceedings of the
Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint
Conference on Natural Language Processing of the AFNLP, pages 656–664. Association for
Computational Linguistics, 2009.

[43] Marc Vilain, John D Burger, John Aberdeen, Dennis Connolly, and Lynette Hirschman. A
model-theoretic coreference scoring scheme. In Sixth Message Understanding Conference
(MUC-6): Proceedings of a Conference Held in Columbia, Maryland, 1995, 1995.

[44] Sam Wiseman, Alexander M. Rush, Stuart Shieber, and Jason Weston. Learning anaphoricity
and antecedent ranking features for coreference resolution. In Proceedings of the 53rd Annual
Meeting of the Association for Computational Linguistics and the 7th International Joint
Conference on Natural Language Processing (Volume 1: Long Papers), pages 1416–1426.
Association for Computational Linguistics, 2015.

[45] Sam Wiseman, Alexander M. Rush, and Stuart M. Shieber. Learning global features for
coreference resolution. In Proceedings of the 2016 Conference of the North American Chapter

30

of the Association for Computational Linguistics: Human Language Technologies, pages 994–
1004. Association for Computational Linguistics, 2016.

[46] Liyan Xu and Jinho D. Choi. Revealing the myth of higher-order inference in coreference
resolution. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 8527–8533. Association for Computational Linguistics, 2020.

[47] Amir Zeldes and Shuo Zhang. When annotation schemes change rules help: A conﬁgurable
approach to coreference resolution beyond ontonotes. In Proceedings of the Workshop on
Coreference Resolution Beyond OntoNotes (CORBON 2016), pages 92–101, 2016.

31