SEMANTIC ROLE LABELING OF IMPLICIT ARGUMENTS
FOR NOMINAL PREDICATES
By
Matthew Steven Gerber

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Computer Science
2011

ABSTRACT
SEMANTIC ROLE LABELING OF IMPLICIT ARGUMENTS
FOR NOMINAL PREDICATES
By
Matthew Steven Gerber
Natural language is routinely used to express the occurrence of an event and existence of
entities that participate in the event. The entities involved are not haphazardly related to
the event; rather, they play speciﬁc roles in the event and relate to each other in systematic
ways with respect to the event. This basic semantic scaﬀolding permits construction of the
rich event descriptions encountered in spoken and written language. Semantic role labeling
(SRL) is a method of automatically identifying events, their participants, and the existing
relations within textual expressions of language. Traditionally, SRL research has focused on
the analysis of verbs due to their strong connection with event descriptions. In contrast, this
dissertation focuses on emerging topics in noun-based (or nominal) SRL.
One key diﬀerence between verbal and nominal SRL is that nominal event descriptions
often lack participating entities in the words that immediately surround the predicate (i.e.,
the word denoting an event). Participants (or arguments) found at longer distances in the
text are referred to as implicit. Implicit arguments are relatively uncommon for verbal
predicates, which typically require their arguments to appear in the immediate vicinity. In
contrast, implicit arguments are quite common for nominal predicates. Previous research
has not systematically investigated implicit argumentation, whether for verbal or nominal
predicates. This dissertation shows that implicit argumentation presents a signiﬁcant challenge to nominal SRL systems: after introducing implicit argumentation into the evaluation,
the state-of-the-art nominal SRL system presented in this dissertation suﬀers a performance
degradation of more than 8%.
Motivated by these observations, this dissertation focuses speciﬁcally on implicit argumentation in nominal SRL. Experiments in this dissertation show that the aforementioned

performance degradation can be reduced by a discriminative classiﬁer capable of ﬁltering
out nominals whose arguments are implicit. The approach improves performance substantially for many frequent predicates - an encouraging result, but one that leaves much to be
desired. In particular, the ﬁlter-based nominal SRL system makes no attempt to identify
implicit arguments, despite the fact that they exist in nearly all textual discourses.
As a ﬁrst step toward the goal of identifying implicit arguments, this dissertation presents
a manually annotated corpus in which nominal predicates have been linked to implicit arguments within the containing documents. This corpus has a number of unique properties that
distinguish it from preexisting resources, of which few address implicit arguments directly.
Analysis of this corpus shows that implicit arguments are frequent and often occur within a
few sentences of the nominal predicate.
Using the implicit argument corpus, this dissertation develops and evaluates a novel
model capable of recovering implicit arguments. The model relies on a variety of information
sources that have not been used in prior SRL research. The relative importance of these
information sources is assessed and particularly troubling error types are discussed. This
model is an important step forward because it uniﬁes work on traditional verbal and nominal
SRL systems. The model extracts semantic structures that cannot be recovered by applying
the systems independently.
Building on the implicit argument model, this dissertation then develops a preliminary
joint model of implicit arguments. The joint model is motivated by the fact that semantic
arguments do not exist independently of each other. The presence of a particular argument
can promote or inhibit the presence of another. Argument dependency is modeled by using
the TextRunner information extraction system to gather general purpose knowledge from
millions of Internet webpages. Results for the joint model are mixed; however, a number of
interesting insights are drawn from the study.

ACKNOWLEDGMENTS

I would like to thank my advisor, Dr. Joyce Chai, who has provided many years of unwavering
intellectual and moral support. Joyce has worked tirelessly to keep me on track and has done
a wonderful job of separating the promising research directions from the far fetched ones that
I occasionally put forward. I am fortunate to have had her as my primary collaborator. I am
also grateful for the many friendships I formed in East Lansing over the years. To my lab
mates, thanks for making EB3315 a fun and (usually) productive place. Our many intense
but ultimately frivolous discussions were always a welcome diversion. To everyone else,
including those who have long since graduated and moved away, know that I will remember
all of the poker nights and other gatherings with particular fondness. Of course, none of this
work would have happened without the lifelong love and incredible support of my parents
Randy and Dee Ann, who always remained curious to learn about my research. And ﬁnally,
to my wife Amanda: thank you for putting up with me for the last nine months as I have
worked on this dissertation. You deserve an award. Thank you for reminding me to eat.
Thank you for showing me that there is more to life than research. Thank you.

iv

TABLE OF CONTENTS

LIST OF TABLES

vii

LIST OF FIGURES

viii

1 Introduction
2 Nominal semantic role labeling
2.1 Introduction . . . . . . . . . . . . .
2.2 Related work . . . . . . . . . . . .
2.2.1 Rule-based SRL . . . . . . .
2.2.2 Annotated corpora for SRL
2.2.3 Statistical SRL . . . . . . .
2.3 Nominal SRL model . . . . . . . .
2.3.1 Model formulation . . . . .
2.3.2 Model features . . . . . . .
2.3.3 Post-processing . . . . . . .
2.4 Evaluation . . . . . . . . . . . . . .
2.5 Discussion . . . . . . . . . . . . . .
2.6 Conclusions . . . . . . . . . . . . .

1
.
.
.
.
.
.
.
.
.
.
.
.

6
6
11
12
13
22
28
28
29
33
36
37
38

3 Predicates that lack arguments: the problem of implicit argumentation
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Empirical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Argument-bearing predicate model . . . . . . . . . . . . . . . . . . . . . . .
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Predicate evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.2 Combined predicate-argument evaluation . . . . . . . . . . . . . . . .
3.5.3 NomLex-based analysis of results . . . . . . . . . . . . . . . . . . . .
3.5.4 Analysis of end-to-end nominal SRL speed . . . . . . . . . . . . . . .
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40
40
42
44
45
48
48
50
52
54
59

4 Identifying implicit arguments
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Related work . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Discourse comprehension in cognitive science . .
4.2.2 Automatic relation discovery . . . . . . . . . . .
4.2.3 Coreference resolution and discourse processing
4.2.4 Identifying implicit arguments . . . . . . . . . .
4.3 Empirical analysis . . . . . . . . . . . . . . . . . . . . .
4.3.1 Data annotation . . . . . . . . . . . . . . . . .

61
61
63
64
65
68
70
74
74

.
.
.
.
.
.
.
.
.
.
.
.

v

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

4.4

4.5
4.6

4.7

4.3.2 Annotation analysis . . . . . . . . . . . .
Implicit argument model . . . . . . . . . . . . .
4.4.1 Model formulation . . . . . . . . . . . .
4.4.2 Model features . . . . . . . . . . . . . .
4.4.3 Post-processing for ﬁnal output selection
4.4.4 Computational complexity . . . . . . . .
Evaluation . . . . . . . . . . . . . . . . . . . . .
Discussion . . . . . . . . . . . . . . . . . . . . .
4.6.1 Feature assessment . . . . . . . . . . . .
4.6.2 Error analysis . . . . . . . . . . . . . . .
4.6.3 The investment and fund predicates . .
4.6.4 Improvements versus the baseline . . . .
4.6.5 Comparison with previous results . . . .
Conclusions . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

5 An
5.1
5.2
5.3
5.4
5.5
5.6

exploration of TextRunner for joint implicit argument
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
Related work . . . . . . . . . . . . . . . . . . . . . . . . . .
Joint model formulation . . . . . . . . . . . . . . . . . . . .
Joint model features based on TextRunner . . . . . . . . . .
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6.1 Example improvement versus local model . . . . . . .
5.6.2 Test collection size . . . . . . . . . . . . . . . . . . .
5.6.3 Toward a generally applicable joint model . . . . . .
5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Summary of contributions and future work
6.1 Summary of contributions . . . . . . . . . . . . . . . . . . .
6.1.1 A nominal SRL system for real-world use . . . . . . .
6.1.2 A focused, data-driven analysis of implicit arguments
6.1.3 A novel model for implicit argument identiﬁcation . .
6.2 Summary of future work . . . . . . . . . . . . . . . . . . . .
APPENDIX
A.1 Support verb identiﬁcation . . . . . . . . . . . . . .
A.2 Nominal predicate features . . . . . . . . . . . . . .
A.3 Nominal argument features . . . . . . . . . . . . . .
A.4 Role sets for the annotated predicates . . . . . . . .
A.5 Implicit argument features . . . . . . . . . . . . . .
A.6 Per-fold results for implicit argument identiﬁcation
A.7 Examples of implicit argument identiﬁcation . . . .
A.8 Forward ﬂoating feature subset selection algorithm
REFERENCES

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

81
83
83
84
94
95
96
103
103
104
105
106
107
108

identiﬁcation
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .

110
110
111
114
116
122
123
123
125
125
127

.
.
.
.
.

.
.
.
.
.

128
128
128
129
129
130

.
.
.
.
.
.
.
.

133
134
135
136
137
138
141
142
144

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

145

vi

LIST OF TABLES

2.1

Distribution of annotated NomBank arguments . . . . . . . . . . . . . . . .

21

2.2

Nominal argument results . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

3.1

Degradation of standard nominal SRL system in the all-token evaluation . .

43

3.2

Nominal predicate identiﬁcation results . . . . . . . . . . . . . . . . . . . . .

48

3.3

Combined predicate-argument identiﬁcation results . . . . . . . . . . . . . .

51

3.4

Predicate and combined predicate-argument classiﬁcation F1 scores for deverbal, deverbal-like, and other nominal predicates in the all-token evaluation .

54

3.5

Empirical speed performance of the nominal SRL system . . . . . . . . . . .

57

4.1

Implicit argument annotation data analysis . . . . . . . . . . . . . . . . . . .

77

4.2

Targeted PMI scores between argument positions . . . . . . . . . . . . . . .

88

4.3

Coreference probabilities between argument positions . . . . . . . . . . . . .

91

4.4

Overall evaluation results for implicit argument identiﬁcation . . . . . . . . . 102

4.5

Implicit argument identiﬁcation error analysis . . . . . . . . . . . . . . . . . 104

5.1

Joint implicit argument identiﬁcation evaluation results . . . . . . . . . . . . 123

A.1 Support verb features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
A.2 Nominal predicate features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
A.3 Nominal argument features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
A.4 Implicit argument features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
A.5 Per-fold results for implicit argument identiﬁcation . . . . . . . . . . . . . . 141

vii

LIST OF FIGURES

2.1

Position of an SRL system with respect to target applications . . . . . . . .

8

2.2

Split argument syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

2.3

NomLex distribution of predicate instances . . . . . . . . . . . . . . . . . . .

20

2.4

Predicate syntactic context . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

2.5

Global constraint violations for nominal arguments . . . . . . . . . . . . . .

34

3.1

Markability distribution of nominal predicates . . . . . . . . . . . . . . . . .

42

3.2

Context-free grammar rules for nominal predicate classiﬁcation . . . . . . . .

46

3.3

Nominal predicate identiﬁcation with respect to the markability distribution

50

3.4

All-token argument classiﬁcation results with respect to the markability distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

3.5

Nominal predicate identiﬁcation with respect to the NomLex classes . . . . .

53

3.6

End-to-end nominal SRL architecture . . . . . . . . . . . . . . . . . . . . . .

55

3.7

Box plot of nominal SRL speed . . . . . . . . . . . . . . . . . . . . . . . . .

58

4.1

Location of implicit arguments in the discourse

82

5.1

Eﬀect of depth on WordNet synset similarity . . . . . . . . . . . . . . . . . . 120

viii

. . . . . . . . . . . . . . . .

CHAPTER 1
Introduction
Automatic textual analysis has successfully dealt with many aspects of the information
explosion that has taken place over the last few decades. One of the most prominent types
of textual analysis is what I will be referring to as non-semantic analysis. This type of
analysis often manifests itself in the ubiquitous bag-of-words (and related) models of natural
language. Such models are not typically concerned with the underlying meaning of text, as
evidenced by their use of stemming, stop word removal, and a host of other techniques that
trade semantic information for improved statistical information. While immensely successful
when dealing with large objects of interest (e.g., web pages, blog entries, PDFs, etc.), these
approaches do not perform well when the information need is described by a question or
imperative, or when the information need is a small object such as a concise answer or the
identiﬁcation of an entity-entity relationship. More generally, non-semantic analyses perform
relatively poorly in situations that require semantic understanding and inference.
In an eﬀort to ﬁll this semantic gap, the research community has proposed a wide range
of resources and methods. Some of these address the semantic properties of individual words,
whereas others target the semantic properties of entire sentences or discourses. Below, I give
examples at a few points along this spectrum. As one moves down the list, larger units of
text are analyzed in greater semantic detail.
Words and phrases
• Word sense disambiguation (Joshi et al., 2006): determining whether the word
“bank” in a particular context refers to a mound of earth or a ﬁnancial institution.
• Named entity identiﬁcation (Bikel et al., 1999): identifying particular classes of

1

entities, for example, people or countries.
• Lexical semantics (Fellbaum, 1998): identifying semantic relationships between
concepts, for example, the fact that all humans are mammals.
Short-distance relationships
• Relations between nominals (Girju et al., 2007): identifying a product-producer
relation between the words in a phrase such as “honey bee”.
• Temporal relations (Verhagen et al., 2007): identifying the relationship between
a temporal expression (e.g., “yesterday”) and the events mentioned in a sentence.
Shallow sentential meaning
• Event extraction (ACE, 2007): identiﬁcation of a few speciﬁc event types.
• Semantic role labeling (Surdeanu et al., 2008): identifying a large number of event
types and their participants within a sentence.
Deep sentential meaning
• First-order semantics (Bos, 2005; Mooney, 2007): transforming natural language
text into a ﬁrst-order logic representation.
Tasks near the bottom of this list are generally more diﬃcult because they require detailed
analyses of large text fragments; however, these analyses provide a more complete semantic
picture of natural language expressions.
This dissertation focuses on a speciﬁc type of shallow sentential meaning called semantic
role labeling (SRL). In the SRL paradigm, a predicate word (typically denoting an event)
is bound to various entities in the surrounding text by means of relationships that describe
the entities’ roles in the event. Consider the following example:
(1.1) [Sender John] [Predicate shipped] [Thing shipped a package] [Source from Michigan]
[Destination to California].

2

In this example, John is purposely acting to ship a package from its source location (Michigan) to its destination location (California). The goal of automatic SRL is to identify
the predicates and role-ﬁlling constituents that together provide a basic understanding of a
sentence’s event structure.
Traditionally, automatic SRL research has focused on verb predicates due to their strong
connection with event descriptions; however, recent years have witnessed an increased emphasis on SRL for nouns, which frequently denote events and are also amenable to role
analysis. Although they are related, nominal and verbal SRL exhibit important diﬀerences
that must be taken into account. One key diﬀerence is that nominal SRL structures often lack argument ﬁllers that would normally be required for the corresponding verbal SRL
structure. Consider the following variant of Example 1.1:
(1.2) [Sender John] made a [Predicate shipment] [Source from Michigan] [Destination to
California].
The nominal predicate in Example 1.2 does not require the Thing shipped to be overtly
expressed in the sentence, whereas the verbal predicate in Example 1.1 does (it is ungrammatical otherwise). When the Thing shipped (or any other argument) is expressed elsewhere
in the discourse, we have an instance of implicit argumentation. In general, implicit argumentation is extremely common for nominal predicates, but the research community has
paid very little attention to it. This dissertation focuses speciﬁcally on the issue of implicit
argumentation in nominal SRL.
I begin by developing a nominal SRL system capable of producing analyses similar to
Example 1.2. When given a predicate known to take arguments in the current sentence, this
system is able to recover the arguments with an F-measure (β = 1) score of 75.7%. This
is a state-of-the-art result for the task; however, the practice of supplying a system with an
argument-bearing nominal has serious implications due to implicit argumentation. When
evaluated over all predicates (including those whose arguments are entirely implicit), the
same system achieves an argument F1 score of only 69.3%.
3

In an attempt to address the issue of implicit argumentation, I develop a model that
is able to accurately (F1 = 87.6%) identify nominal predicates with explicit arguments,
eﬀectively ﬁltering out predicates whose arguments are implicit. This model pushes the
argument F1 score to 71.1% for all nominal predicates. The model more than doubles the
argument identiﬁcation performance for particular groups of frequent nominal predicates.
These are encouraging results, but they leave much to be desired. In particular, the nominal
SRL system does not attempt to recover implicit arguments, which are often present in the
surrounding discourse.
Motivated by the results described above, I investigate the automatic identiﬁcation of
implicit arguments using information from a predicate’s sentence and surrounding discourse.
This represents a dramatic departure from traditional SRL approaches, which, for a given
predicate, do not look past sentence boundaries for argument ﬁllers. I base my investigation
on a corpus of manually annotated implicit arguments. This corpus is one of the ﬁrst of its
kind and has been made freely available for research purposes. Using this corpus, I show that
implicit arguments constitute a signiﬁcant portion of the semantic structure of a document;
they are frequent, often located within a few sentences of their respective predicates, and
they provide information that cannot be recovered using standard verbal and nominal SRL
techniques.
Given their importance, it is interesting to note that very little attention has been paid to
the automatic recovery of implicit arguments. I address this issue by developing a model that
is capable of identifying implicit arguments across sentence boundaries. Whereas traditional
SRL models have relied primarily on syntactic information, the implicit argument model
relies primarily on semantic information. This information comes from a variety of sources,
many of which have not previously been explored. Overall, the implicit argument model
achieves an F1 score of approximately 50%. This result represents the current state-of-theart, since the task of implicit argument identiﬁcation is a new one within the ﬁeld.
The implicit argument model described above simpliﬁes the modeling task by assuming

4

that implicit arguments are independent of each other. Each candidate is classiﬁed independently of the other candidates, and a heuristic post-processing procedure is applied to arrive
at the ﬁnal conﬁguration. I present a preliminary investigation of this assumption in which
implicit arguments are identiﬁed in a joint fashion. The model relies, in part, on knowledge
extracted from millions of Internet webpages. This knowledge serves to identify likely joint
occurrences of implicit arguments. Evaluation results for this model are mixed; however,
they suggest a variety of interesting future directions.
This dissertation is organized as follows. In Chapter 2, I review the theoretical status
of semantic roles as well as previous SRL research. In the same chapter, I present the
basic nominal SRL system mentioned above. Chapter 3 begins by introducing implicit
argumentation in more detail and assessing its implications for the basic nominal SRL system.
The chapter then provides a detailed description and evaluation of the nominal ﬁltering
model. Chapter 4 begins with an empirical analysis of nominal event structure, which
is found to be largely implicit. As part of this analysis, I describe in detail the implicit
argument annotation eﬀort I conducted. The chapter then presents and evaluates the model
for implicit argument identiﬁcation. Chapter 5 presents the exploration of joint implicit
argument modeling. I conclude, in Chapter 6, with a summary of contributions and future
work.

5

CHAPTER 2
Nominal semantic role labeling
2.1

Introduction

The notion of semantic role (variously referred to as thematic relation, thematic role, and
theta role) has enjoyed a long and occasionally contentious history within linguistics. Gruber
(1965), in an analysis of motion verbs, observed that certain semantic properties apply to the
entity undergoing motion, regardless of that entity’s surface syntactic position. For example,
consider the following alternations of the verb throw :
(2.1) John threw a ball to Mary.
(2.2) John threw Mary a ball.
(2.3) A ball was thrown to Mary by John.
(2.4) A ball was thrown by John to Mary.
(2.5) To Mary was thrown a ball by John.
...
In all cases, a ball is the entity undergoing motion; however, this entity ﬁlls the syntactic
object position in Example 2.1 and the syntactic subject position in Example 2.3. Gruber
introduced the term Theme to denote objects that passively undergo such actions, and made
similar generalizations for other event participants. For example, John ﬁlls the role of Agent
in the examples because he is the intentional causer of the event. Mary, to whom John is
throwing the ball, ﬁlls the role of Recipient in the examples. These roles reﬂect underlying
semantic properties of the entities within the context of the throw event. Semantic roles,
with their power to generalize over numerous syntactic constructions, have been of great
interest to a variety of researchers in ﬁelds from linguistics to philosophy to natural language

6

processing (NLP).
However, as alluded to above, semantic roles are not uncontroversial. A wide-ranging
debate has raised questions about the composition and requisite number of semantic roles.
The case grammar of Fillmore (1968) and the frame-based semantics of Fillmore (1976) each
posit a large number of speciﬁc semantic roles. Following these theories, entities in text
are assigned a speciﬁc role based on their relation to the event under consideration. On
the other end of the spectrum, Dowty (1991) posits only two roles: proto-agent and protopatient. These two proto-roles are composed of many diﬀerent “contributing properties”
that entities assigned to them should possess. For example, an entity assigned to the protoagent role should be volitional and sentient, whereas an entity assigned to the proto-patient
role should involuntarily undergo a change of state. Constituents are assigned to these roles
in a graded fashion depending on how many of the relevant properties they possess. These
properties, though, are no more agreed upon than the various semantic roles mentioned
above, so the controversy would seem far from being resolved.
Despite a lack of consensus on ﬁner points, semantic roles have much to oﬀer automatic
natural language understanding systems. As demonstrated by Examples 2.1-2.5, semantic
roles generalize over the myriad ways in which an event may be described. Thus, because
events play a central role in everyday language use, the automatic identiﬁcation of semantic
roles should prove helpful in many NLP tasks. In general, the task of automatic semantic
role labeling (SRL) is deﬁned as follows:
Automatic SRL task: Given some unstructured text, identify the events and
the ﬁllers of the events’ semantic roles.1
Figure 2.1 shows the position of an automatic SRL system with respect to unstructured
text (the input) and target applications that make use of structured information (the SRL
output). The ﬁgure includes three target applications to which SRL has been successfully
1 In this dissertation, I will use the terms event and predicate interchangeably. I will do the

same for semantic role and argument. Thus, I will also use predicate-argument identiﬁcation
to refer to the SRL task.
7

Online
encyclopedias

Intelligence
reports
Newswire

SRL system

Structured
information

Automatic question
answering

Information
extraction

Statistical machine
translation

Figure 2.1: Position of an SRL system with respect to target applications. Unstructured information ﬂows in from the top. The SRL system identiﬁes structure within this information,
which is consumed by target applications. For interpretation of the references to color in
this and all other ﬁgures, the reader is referred to the electronic version of this dissertation.

8

applied. Below, I describe these applications and the corresponding role of SRL.

Automatic question answering (QA) is the task of providing a precise answer to a
user’s question. Assume the following question has been issued:
(2.6) Who invented the polio vaccine?
Traditional Internet search engines are not suitable for directly answering 2.6 because they
often return ranked lists of documents instead of precise answers. An SRL-based approach,
on the other hand, might proceed as follows:
1. Query a large corpus of documents for exact matches to “invented the polio vaccine”.
2. Perform SRL on the returned sentences.
3. Identify and ﬁlter the Agent roles for the invent events, returning the single best as
the answer to the question.
Conﬁgured this way, the system stands a chance of returning the correct answer: “Jonas
Salk”. Kaisser and Webber (2007) and Pizzato and Moll´ (2008) have shown that automatic
a
QA can beneﬁt from SRL information.

Information extraction (IE) is the task of identifying facts, relations, events, and other
types of information within unstructured documents. Recently, there has been a surge of
interest in Web-scale IE, where information is extracted from millions of documents. Banko
et al. (2007) developed the Open IE (OIE) methodology to extract an open set of semantic
relations from text in an unsupervised fashion. The standard OIE approach tends to be a
low precision, high recall process. Supervised SRL, on the other hand, tends to be a high
precision, low recall process, particularly for out-of-domain data in which previously unseen
events are encountered. Banko and Etzioni (2008) showed how methods similar to SRL can
be combined with the standard OIE approach, yielding a hybrid system with the advantages

9

of both sub-systems.

Statistical machine translation (SMT) is a classic NLP task in which a system must
translate a text from the original source language S to a target language T . A simple approach to this task is to translate each word, possibly reordering it in the target language
sentence according to a distortion probability. Recently, researchers have shown that integrating syntactic information into the model can have a positive impact on translation
performance (May and Knight, 2007). Following this work, Liu et al. (2010) demonstrated
that SRL information can also help improve translation performance. In both cases, the
translation system used the additional information to ﬁlter out less plausible translation
results that contain either uncommon syntactic constructions or uncommon semantic argument arrangements.

For the three tasks described above, researchers found that system performance increased
when taking SRL-based information into account. More speciﬁcally, these systems used
information derived from verbal SRL analyses. Verbal SRL (demonstrated in Examples
2.1-2.5) is based on predicates that take the form of verbs within a sentence. Historically,
semantic roles have been associated with verbs for the simple reason that nearly all verbs
have semantic roles. However, other parts of speech are associated with semantic roles in
precisely the same manner as verbs. This dissertation focuses on semantic roles associated
with predicates in noun form (i.e., nominal predicates). To see the parallel between verbal
and nominal SRL, consider the following examples:
(2.7) Freeport-McMoRan Energy Partners will be liquidated and [Theme shares of the new
company] [Predicate distributed(verb) ] [Destination to the partnership’s unitholders].2
(2.8) Searle will give [Destination pharmacists] [Theme brochures on the use of
prescription drugs] for [Predicate distribution(noun) ] in their stores.3
2 Borrowed from Kingsbury and Palmer (2003)
3 Borrowed from Meyers (2007a)

10

Example 2.7 uses a verbal form of distribute, and Example 2.8 uses a nominal form. As
expected, the semantic properties of interest (i.e., those related to the distribute event) hold
for ﬁllers of the semantic roles regardless of the ﬁllers’ syntactic positions or the parts of
speech of their respective predicates. This is a key observation because it suggests that tasks
like QA, IE, and SMT might beneﬁt from nominal SRL just as they do from verbal SRL. At
least, this might be the case if nominal predicates are also as frequent as verbal predicates.
As shown in Section 2.2.2, nominal predicates are on average more frequent per document
than verbal predicates.
The remainder of this chapter is structured as follows. In the next section, I review work
related to the tasks of verbal and nominal SRL, paying special attention to the latter as it is
the focus of this dissertation. Then, in Section 2.3, I present a nominal SRL system inspired
by previous work that signiﬁcantly improves the state-of-the-art, as shown in Section 2.4.
This work sets the stage for a more in-depth investigation into nominal SRL, which is taken
up in subsequent chapters.

2.2

Related work

As in many other NLP tasks, research in semantic role labeling has progressed from handcrafted rule systems based on human engineering to statistical systems based on supervised
and unsupervised machine learning. In this section, I give a brief history of this progression,
starting with rule-based systems in Section 2.2.1. I then give an overview of the relevant
supervised training corpora in Section 2.2.2, followed by recent statistical SRL work in
Section 2.2.3. The nominal SRL model developed in Section 2.3 draws on many of the
techniques presented in this section.

11

2.2.1

Rule-based SRL

As noted above, early models of language semantics typically relied on large compilations
of hand-coded processing rules and world knowledge. For example, much of the work done
by Hirst (1987) relied on a rule-based syntactic parser and a frame-based knowledge representation similar to the one developed by Fillmore (1976). Hirst used a mapping to link
syntactic constituents to their respective frame positions, and the sentence’s semantic representation was built up compositionally. A similar emphasis on hand-coded lexicons and
grammars can be found in the work of Pustejovsky (1995) and Copestake and Flickinger
(2000), respectively.
Early work in identifying nominal argument structure used approaches similar to those
discussed above. For example, Dahl et al. (1987), Hull and Gomez (1996), and Meyers et al.
(1998) each employ sets of rules that associate syntactic constituents with semantic roles for
nominal predicates. Consider the following example from Dahl et al. (p. 135):
(2.9) Investigation revealed [Instrument metal] [Predicate contamination] in [Theme the
ﬁlter].
The system created by Dahl et al. used the following rules to identify the contaminating
substance (metal ) and the contaminated entity (the ﬁlter ):
1. The Instrument can be the noun preceding the predicate contamination.
2. The Theme can be the object of the prepositional phrase following contamination.
The rules deﬁned above allow the system to properly identify the ﬁllers of semantic roles in
Example 2.9. This system was not formally evaluated, but it is reasonable to believe that
the rules described above would often be correct when triggered.
The rules in Dahl et al.’s work have advantages and disadvantages that are common
to rule-based semantics systems. On one hand, if a precise rule produces a prediction, that
prediction is likely to be correct (e.g., the identiﬁcation of the Instrument and Theme above).
12

Furthermore, the rule sets are explanatorily powerful, as any derivation can be explained
in terms of the rules that produced it. However, on the other hand, the systems described
above tend to be brittle, particularly when used in novel domains or on genres of text not
anticipated by the rule creators. This is the result of the all-or-nothing nature of rule-based
syntactic and semantic interpretation. Given the great versatility of language, it should
come as no surprise that, in many cases, a limited set of rules fails to apply (i.e., interpret)
a natural language utterance. Furthermore, as noted by Copestake and Flickinger (2000),
the learning curve for working with and extending some rule-based grammar systems can be
prohibitively steep, making it diﬃcult to apply such systems to new domains.
In contrast to hand-coding the behavior of the analyzer as described above, this dissertation develops methods whose behaviors are determined by supervised machine learning. As
described in Section 2.3, this approach allows one to identify the optimal system behavior in
a ﬂexible, automated fashion while relying on hand-coded information that is less expensive
to obtain and more likely to be agreed upon across human annotators. The following section
describes a few of these annotated resources, all of which are used in this dissertation.

2.2.2

Annotated corpora for SRL

FrameNet
As mentioned previously, Fillmore (1968) developed a theory of grammar in which syntactic
constituents stand in various case relations with their predicates. Example cases include
Agent and Instrument, which correspond to the similarly named semantic roles presented
earlier. Fillmore’s case theory was later reﬁned by grouping cases into larger units termed
frames (Fillmore, 1976). For example, in the Buy frame, one ﬁnds a Buyer, Seller, Goods,
Money, etc. Each frame is also associated with a number of predicates (e.g., buy, purchase,
barter) that instantiate it within a sentence.
FrameNet (Baker et al., 1998) is a machine-readable resource created by identifying and
relating Fillmore’s frames and documenting their presence within natural language text. In
13

FrameNet, case roles are called frame elements and predicates are called lexical items. These
lexical items can be verbs, nouns, or adjectives. For example, consider the Transfer frame,
shown with a few of its frame elements and lexical items:
The Transfer frame
Donor: the person that begins in possession of the Theme and causes it to be in the
possession of the Recipient
Theme: the object that changes ownership
Recipient: the entity that ends up in possession of the Theme
Purpose: the purpose for which the Theme is transferred
Lexical items: transfer.n, transfer.v
FrameNet arranges frames into a network by deﬁning frame-to-frame relationships such as
inheritance and causation. For example, consider the Commerce goods-transfer frame, which
inherits from the Transfer frame:
The Commerce goods-transfer frame (inherits from the Transfer frame)
Seller (from Transfer.Donor): entity in possession of Goods and exchanging them for
Money with a Buyer
Goods (from Transfer.Theme): anything that is exchanged for Money in a transaction
Purpose (from Transfer.Purpose): the purpose for which a Theme is transferred
Buyer (from Transfer.Recipient): entity that wants the Goods and oﬀers Money to a
Seller in exchange for them
Money (new in this frame): the thing given in exchange for Goods in a transaction
As shown above, the inheritance relation allows a general frame (e.g., Transfer ) to be specialized with a particular semantic interpretation (e.g., the transfer of commercial goods).
Where applicable, the inheritance relationship also holds between the frame elements of the
related frames. This is indicated above, with Seller inheriting from Donor, Goods inheriting
from Theme, Purpose inheriting from Purpose, and Buyer inheriting from Recipient. Each of
the inheriting frame elements contains all semantic properties of the inherited frame elements
and possibly adds additional semantic properties. The two frames also show that sub-frames
may provide additional frame elements (e.g., Money) for the frame specialization. In total,
version 1.5 of FrameNet deﬁnes 1,019 frames related with 12 diﬀerent relation types.
14

Having established a network of frames, the FrameNet annotators manually identiﬁed instances of the frames within the British National Corpus.4 Consider the following annotation
of the Commerce goods-transfer frame:
(2.10) Four years ago [Buyer I] [Predicate bought] [Goods an old Harmony Sovereign
acoustic guitar] [Money for £20] [Seller from an absolute prat].
As shown in Example 2.10, not all frame elements are present in each frame annotation.
Furthermore, the annotators have only identiﬁed frame elements within the sentence containing the predicate. In total, FrameNet contains approximately 150,000 annotated frame
instances. As shown below in Section 2.2.3, these annotated examples can be used as supervised learning material for systems that automatically identify frames and frame elements
within text.
VerbNet
The work of Kipper et al. (2000) (described more fully by Kipper (2005)) coincided roughly
with the development of FrameNet. This work, inspired by the analysis of so-called verb
classes by Levin (1993), resulted in a computer-readable lexicon of verb argument speciﬁcations called VerbNet. In VerbNet, verbs are collected into classes. The members of each
class exhibit the same diathesis alternations, or meaning preserving transformations. An
example alternation, the causative-inchoative, is shown below:
(2.11) [Agent John] [Predicate broke] [Theme the window]. (causative)
(2.12) [Theme The window] [Predicate broke]. (inchoative)
With respect to Examples 2.11 and 2.12, Levin and Kipper et al. made the following key
observations:
1. The causative-inchoative alternation is (mostly) meaning preserving. That is, Example
2.11 has roughly the same semantic interpretation as Example 2.12. This instance
4 http://www.natcorp.ox.ac.uk

15

of the alternation is not completely meaning preserving because the Agent remains
unspeciﬁed in Example 2.12.
2. Other verbs that are capable of undergoing the causative-inchoative alternation also
appear to indicate a change of state (and vice versa). For example, close indicates a
change of state and may undergo the alternation, as shown below:
(a) [Agent John] [Predicate closed] [Theme the window].
(b) [Theme The window] [Predicate closed].
The verb hit, on the other hand, does not indicate a change of state and thus cannot
undergo the causative-inchoative alternation:5
(a) [Agent John] [Predicate hit] [Theme the window].
(b) *[Theme The window] [Predicate hit].
Each VerbNet class deﬁnes a set of semantic roles used by verbs in the class. Furthermore,
verb classes are arranged into an inheritance tree, such that sub-classes inherit the roles of
super-classes (similarly to FrameNet). Currently, version 3.1 of VerbNet groups 5,725 verbs
into 438 classes. This resource does not annotate instances of the verbs it contains; however,
it is important because it relates semantically similar verbs to each other - a fact that will
be used in Chapter 4 when predicate-predicate relations are examined.
PropBank
To document the diﬀerent ways in which verbs can express their arguments, Kingsbury and
Palmer (2003) annotated semantic role information for all main verbs in the Penn TreeBank
(Marcus et al., 1993). The Penn TreeBank is a corpus of English newswire text that has
been annotated for syntactic structure by humans. Kingsbury and Palmer’s resource, called
Proposition Bank (or PropBank), contains more than 112,000 semantic role analyses for 3,256
distinct verbs. Instead of committing to one of the many competing theories of semantic
5 A preﬁxed asterisk denotes an unacceptable sentence of English.

16

roles, the creators of PropBank chose a theory-agnostic approach in which each sense of each
verb is associated with its own set of roles. Each role set for a verb is contained in a frame
ﬁle for the verb. To demonstrate, consider the frame for the verbal predicate distribute from
the PropBank lexicon:
Frame for distribute, role set 1:
Arg0 : the entity that is performing the distribution
Arg1 : the entity that is distributed
Arg2 : the entity to which the distribution is made
Next, consider an instance of distribute taken from the PropBank corpus:
(2.13) Freeport-McMoRan Energy Partners will be liquidated and [Arg1 shares of the new
company] [Predicate distributed] [Arg2 to the partnership’s unitholders].
In Example 2.13, the Theme and Destination from Gruber (1965) have been given the labels
Arg1 and Arg2 , respectively. The interpretation of these roles is deﬁned in the role set shown
above. Because the interpretation of arguments in PropBank is verb- and sense-speciﬁc,
there is no guarantee that Arg1 and Arg2 will denote the same semantic properties for other
verbs in the lexicon. However, Kingsbury and Palmer (2003) note that, across verbs, Arg0
and Arg1 are very often interpretable as Agent and Theme, respectively. PropBank’s theoryagnosticism regarding semantic roles means that it is compatible with many diﬀerent theories.
For example, subsequent studies have demonstrated the feasibility of mapping PropBank
argument positions into more traditional theories of semantic roles (see, for example, the
PropBank-VerbNet role mapping developed by Yi et al. (2007)).
NomBank
Unlike FrameNet, which focuses primarily on verbal argument structure, and PropBank,
which focuses solely on verbal argument structure, the NomBank corpus (Meyers, 2007a)
focuses solely on the argument structure of nominals. NomBank inherited the lexicon design
and annotation methodology used for the PropBank project. Thus, each nominal predicate is
17

associated with a frame ﬁle that lists role sets and argument deﬁnitions similar to those given
above for the verb distribute. Consider the following instance of the nominal distribution,
taken from the NomBank corpus:
(2.14) Searle will give [Arg0 pharmacists] [Arg1 brochures] [Arg1 on the use of prescription
drugs] for [Predicate distribution] [Location in their stores].
When possible, the creators of NomBank adapted PropBank frame ﬁles for verb-based nominal predicates such as distribution. Thus, in Example 2.14, argument positions Arg0 and
Arg1 have the same semantic interpretation as argument positions Arg0 and Arg1 for the
PropBank verb distribute.
The compatibility between NomBank and PropBank is important because verbal and
nominal predicates often interact with each other. Consider the following contrived example:
(2.15) [Arg0 John] failed to make the [Arg1 newspaper] [Predicate delivery].
To arrive at the labeling of the nominal predicate delivery in Example 2.15, the reader relies
on his or her knowledge of how fail (verb), make (verb), and delivery (noun) interact in the
given context. This interaction is the key to understanding many event descriptions and
highlights the importance of the PropBank/NomBank compatibility - the two resources can
be seamlessly integrated.
Returning to Example 2.14, notice that two spans of text are bracketed with the Arg1
label. This is an instance of split argumentation, which is also present in PropBank. A split
argument is a span of text that constitutes an argument but cannot be precisely subsumed by
a single syntactic parse tree node within the Penn TreeBank. Split argumentation is typically
caused by syntactic analyses that are not binary branching. The syntactic parse for Example
2.14 is shown in Figure 2.2. As shown, it is impossible to select a single node that subsumes
only the complete Arg1 in Example 2.14. Thus, the creators of NomBank and PropBank
have elected to mark both the NP and the PP that together give the correct subsumption
(i.e., brochures on the use of prescription drugs). I will return to split arguments in Section
2.3.3, where argument prediction conﬂicts are discussed.
18

S

NP

VP

.

NNP

MD

VP

.

Searle

will

VB

NP

give

NNS

pharmacists

NP

NP

NNS

PP

PP

on the use of prescription drugs

NP

for

brochures

IN

NP

NN

PP

NP

in

PRP$

NNS

their

distribution

IN

stores

Figure 2.2: Syntax of the split argument construction in Example 2.14.

19

% of nominal instances

0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0

NomLex class
Figure 2.3: Distribution of nominal instances across the NomLex classes. The y-axis denotes
the percentage of all nominal instances that is occupied by nominals in the class.

Example 2.14 also demonstrates the annotation of the Location argument, which is one of
many adjunct argument types that are annotated by both PropBank and NomBank (other
adjuncts include Manner, Temporal, Purpose, Direction, and Location, among others). The
interpretation of an adjunct argument is the same across all predicates in PropBank and
NomBank. For example, the Location argument has the same interpretation regardless of
whether it is connected to the verbal predicate send or the nominal predicate ﬂight. Because
their interpretations are not predicate-speciﬁc, adjunct arguments are not included in the
frame ﬁles. Instead, they are assumed to be available for all predicates.
Examples 2.14 and 2.15 involve nominal predicates that are derived from verbs. This
dissertation will refer to such predicates as deverbal or event-based nominal predicates.
In addition to these predicates, NomBank annotates a wide variety of nouns that are not
derived from verbs and do not denote events. An example is given below of the partitive
noun percent:

20

Argument
Arg1
Arg0
Arg2
Temporal
Arg3
Manner

Count
80,102
49,823
34,850
9,495
7,611
7,210

%
40.4
25.1
17.6
4.8
3.8
3.6

Argument
Location
Extent
Negation
Adverbial
Arg4
Purpose

Count
5,771
865
655
591
494
444

%
2.9
0.4
0.3
0.3
0.2
0.2

Table 2.1: Distribution of annotated NomBank arguments. Argument positions with fewer
than 100 occurrences are omitted.

(2.16) Hallwood owns about 11 [Predicate %] [Arg1 of Integra].
In this case, the noun phrase headed by the predicate % (i.e., about 11% of Integra) denotes a
fractional part of the argument in position Arg1 . Other partitive predicates behave similarly.
The NomLex resource (Macleod et al., 1998) is a hand-coded lexicon that classiﬁes the various
nominal types annotated by NomBank (e.g., deverbal, partitive, and others). Figure 2.3
shows the distribution of NomBank predicate instances across the NomLex classes. Deverbal
(i.e., event-denoting) nominals reside in the nom class, which is signiﬁcantly larger than any
other class. This is the expected result because events form the foundation of many textual
discourses.
In total, the NomBank corpus contains argument information for 114,574 instances of
4,704 distinct nominal predicates. Because this dissertation focuses on the automatic identiﬁcation of the various argument types, it is important to understand the corresponding
distribution. Table 2.1 presents this information. As shown in the table, the distribution of
arguments is extremely skewed; Arg0 , Arg1 , and Arg2 account for approximately 83% of the
annotated argument structure. Thus, in order for a system to perform well it must target
these argument types.
I conclude this section by showing that NomBank contains a signiﬁcant amount of semantic information that is not present in PropBank and cannot be recovered using verbal SRL.
First, note that PropBank contains approximately 49 predicates per document, whereas
21

NomBank contains approximately 50 predicates per document. Thus, NomBank predicates
are just as frequent as PropBank predicates; however, this alone is not enough to show
that NomBank contains novel information beyond that given by PropBank. Consider the
situation in which an instance of the verb distribute is followed by an instance of the noun
distribution. It is quite likely that these two predicate instances refer to the same event.
Thus, extracting information from the latter might not enhance one’s understanding of the
document. Analysis shows that this behavior is more the exception than the rule: 87% of
NomBank predicate instances are neither preceded nor followed by corresponding PropBank
predicates in the same documents. This fact, combined with the per-document frequency
of nominal predicates mentioned above, is preliminary evidence that nominal predicateargument structure contributes a signiﬁcant amount of information to the discourse. This
information should complement PropBank information, which, as described earlier, has been
useful in tasks such as QA, IE, and SMT.

2.2.3

Statistical SRL

The creation of FrameNet prompted a move from hand-coded semantic processing to statistical learning-based approaches. The seminal work of Gildea and Jurafsky (2002) treated
the SRL problem as a supervised learning task and used the FrameNet corpus as a source
of training data. Gildea and Jurafsky employed simple maximum likelihood statistics for
various lexical and syntactic features to both identify frame element boundaries within text
and assign semantic role labels (e.g., Agent) to the identiﬁed frame elements. Results of
this study were promising: the authors reported an overall role F1 score6 of approximately
63% on the task of combined frame element identiﬁcation and labeling. Gildea and Jurafsky
obtained this result using (among other things) features extracted from automatically generated syntactic parse trees. Two results from this work have been particularly inﬂuential
6 In this dissertation, F refers to the harmonic mean of precision and recall:
1

2∗P recision∗Recall ,
P recision+Recall

where P recision = # # true positives and Recall = ## true positives .
predicted positives
existing positives

22

on subsequent work:
1. Syntactic information is essential for high-quality SRL. Many linguistic theories posit a strong connection between syntax and semantics. For example, Adger
(2003) develops a framework in which all semantic roles for a predicate are assigned
to syntactic constituents (p. 81). Furthermore, each predicate places syntactic (p. 84)
and semantic (p. 87) restrictions on the semantic roles with which it can be associated.
Applied work in SRL has found the syntactic restrictions to be particularly important
(Gildea and Palmer, 2001; Punyakanok et al., 2008), and this chapter will place continued emphasis on syntactic information when identifying semantic arguments within
a sentence.
2. A two-stage conﬁguration is possible. Gildea and Jurafsky (2002) used separate
classiﬁers to identify frame elements and apply labels to them. Numerous subsequent
studies have followed this tradition; however, no compelling arguments have been offered in support of this conﬁguration. This dissertation develops a single-stage model
in which arguments are predicted directly, thus avoiding the complexities of chaining
multiple classiﬁers together.
It is important to brieﬂy mention the evaluation setup used by Gildea and Jurafsky
(2002). The authors evaluated their system using ground-truth predicates and frames. The
system’s only task was to identify and label the frame elements. Thus, although the work
was promising it left many open questions. One important question was how to extend the
model to a more practical scenario in which a system is given raw text and must handle
all processing tasks using no ground-truth information. The current chapter also assumes
ground-truth predicates. Chapter 3 explores automatic predicate identiﬁcation in detail.
Soon after its release, PropBank became a popular resource for statistical SRL researchers, supporting many studies and motivating a number of large-scale, competitive
evaluation tasks (e.g., the CoNLL Shared Tasks described by Carreras and M`rquez (2004),
a

23

Carreras and M`rquez (2005), and Surdeanu et al. (2008)). Below, I describe three key
a
aspects of PropBank-based SRL research.
Syntactic representation
Syntactic information is essential for the SRL task; however, there are diﬀerent ways to
represent this syntactic information. Figure 2.2 (p. 19) demonstrates the constituency
approach to syntax, which uses a context free grammar formalism. This formalism has a long
history in linguistics and is amenable to processing by algorithms such as the popular CockeYounger-Kasami (CYK) algorithm, whose running time is O(n3 ) in the sentence length. The
competition organized by Carreras and M`rquez (2005) used this syntactic representation.
a
More recently, Surdeanu et al. (2008) organized a competition in which the dependency
approach to syntax was explored. Although this formalism is not as rich as constituency
representations (i.e., some syntactic properties cannot be captured), it has the advantage of
processing algorithms with running times that are O(n) in the sentence length (Nivre, 2003).
This dissertation will use the constituency formalism in order to explore some of the deeper
syntactic properties of sentences.
Machine learning technique
As described by Carreras and M`rquez (2005) and Surdeanu et al. (2008), a majority of the
a
most successful SRL systems have used maximum entropy models (Berger et al., 1996) or
support vector machines (Burges, 1998). These techniques accommodate large-scale datasets
and often learn models that generalize well from training to testing. Most of the models
developed in this dissertation are produced by the logistic regression framework (LibLinear)
created by Fan et al. (2008), which is capable of handling millions of training instances and
features. As noted by Hsu et al. (2010), high-dimensional data does not always beneﬁt from
a mapping into a higher-dimensional space, as is often done with SVMs. For nominal SRL,
I have found that the linear models produced by LibLinear perform as well as SVMs but are

24

signiﬁcantly faster to train.
Joint inference
Many SRL systems have incorrectly assumed that the existence of one argument is independent of the existence of other arguments. Consider the following examples, created by
Toutanova et al. (2008):
(2.17) [Temporal The day] that [Agent the ogre] [Predicate cooked] [Theme the children] is
still remembered.
(2.18) [Theme The meal] that [Agent the ogre] [Predicate cooked] [Beneﬁciary the
children] is still remembered.
Only one word diﬀers between these examples (day is replaced with meal ); however, the
interpretations are vastly diﬀerent. In 2.17 the children are cooked, whereas in 2.18 the
meal is cooked. If the initial noun phrase is changed from a Temporal marker to a Theme,
the roles of other constituents are also changed. This dependence between roles prompted
Toutanova et al. to study joint inference across argument assignment possibilities. Similarly, Punyakanok et al. (2008) used integer linear programming to enforce constraints on
joint SRL structures for verbal SRL (e.g., one constraint is that arguments cannot overlap
each other in the sentence). This dissertation explores a joint inference model for nominal
SRL in Chapter 5.

The PropBank-based SRL systems mentioned above often reach argument F1 scores
approaching 80% when tested on PropBank data. However, these systems tend to encounter
diﬃculties when tested on genres of text that diﬀer from the training corpus. Carreras
and M`rquez (2005) cite a performance drop of around 10 F1 points for all participating
a
systems when evaluated over PropBank annotations from the Brown Corpus of Present-day
American English (Kuˇera and Nelson, 1967). The Brown Corpus comprises approximately
c
one million words from a variety of sources. Pradhan et al. (2008) provide an in-depth study
of the eﬀects of text genre on verbal SRL, concluding that the second stage (argument label
25

assignment) contributes the most toward out-of-domain performance degradation. This is
due, in large part, to a reliance on lexical and semantic features tuned speciﬁcally for the
TreeBank corpus. A similar drop in performance can be expected for the model developed
in the current chapter.
Nominal SRL
Statistics-based work on nominal SRL has lagged behind its verbal counterpart by a few
years. This is probably because verbs are usually the ﬁrst choice when analyzing textual
semantics. However, as pointed out above, nominal predicates carry a signiﬁcant amount of
novel information. When it comes to automatically extracting this information, one ﬁnds a
few precursors to the standard nominal SRL task. For example, Lapata (2000) developed a
statistical model to classify modiﬁers of deverbal nouns as underlying subjects or underlying
objects, where subject and object denote the grammatical position of the modiﬁer when
linked to a verb. Consider two possible interpretations of the phrase “satellite observation”
below:
(2.19) [Subject Satellite] [Predicate observation] techniques are used to keep track of enemy
troop movements.
(2.20) The stargazers routinely engaged in [Object satellite] [Predicate observation].
In Example 2.19, it is the satellites that are being used for observation, whereas in Example
2.20 the satellites are being observed. Lapata developed a simple statistical model to identify
this distinction, which corresponds roughly to the distinction between Arg0 (subject/Agent)
and Arg1 (object/Theme) in NomBank. The study did not account for other argument
positions, including adjunct arguments.
In a general sense, nominal SRL is the process of identifying relations between a noun
(the predicate) and other nouns in the surrounding context. In recent years, researchers have
investigated a variety of noun-noun relations. Girju et al. (2007) organized a SemEval7 com7 http://www.senseval.org

26

petition in which systems identiﬁed noun-noun relations such as Cause-Eﬀect, an example
of which is given below:
(2.21) The individual was infected with the [Eﬀect ﬂu] [Cause virus].
This relation is not analogous to any of the semantic role relations discussed so far. However,
the Instrument-Agency SemEval relation is:
(2.22) The [Instrument phone] [Agency operator] answered my call.
Girju et al.’s task deﬁned ﬁve other relations and required systems to make binary decisions
about whether a segment of text contained a particular relation (the relation type was
given to the system at testing time). This task was later reﬁned by Hendrickx et al. (2010),
resulting in a multi-way evaluation where each test example could exhibit any of the relations.
These two tasks are certainly related to nominal SRL, but most of the relations they focus
on do not have an interpretation in terms of semantic roles. Thus, the work presented in
this dissertation is largely complimentary to the SemEval tasks.
Although FrameNet contains some annotations for nominal predicates, NomBank (Meyers, 2007a) has been the driving force behind true nominal SRL in recent years. Based on a
pre-release version of NomBank, Jiang and Ng (2006) used standard verbal SRL techniques
and achieved an overall argument F1 score of 69.14% using automatically generated syntactic parse trees. Liu and Ng (2007) followed this up with a diﬀerent technique (alternating
structure optimization) and achieved an F1 score of 72.83%; however, the latter study used
an improved version of NomBank, rendering these two results incomparable. Both studies
also investigated the use of features speciﬁc to the task of NomBank SRL, but observed only
marginal performance gains.
Following these initial studies, NomBank supported a series of competitive evaluation
tasks hosted by the Computational Natural Language Learning (CoNLL) conference. The
ﬁrst task, Joint Parsing of Syntactic and Semantic Dependencies (Surdeanu et al., 2008),
required systems to identify the dependency syntax for a sentence as well the sentence’s
27

predicate-argument structure for both verbal and nominal predicates. Verbal predicateargument structure was derived from PropBank whereas nominal predicate-argument structure came from NomBank.
A majority of systems in the 2008 CoNLL competition formulated the SRL problem as a
two-stage classiﬁcation problem. In the ﬁrst stage, spans of text were assigned a binary label
indicating whether or not the span represented an argument. In the second stage, argument
spans were relabeled with a ﬁnal label, which was then evaluated. For nominals, the best
overall F1 score was 76.64%; however this score is not directly comparable to the NomBank
SRL results of Jiang and Ng (2006), Liu and Ng (2007), or the results in this dissertation
because the evaluation metrics are not the same (see Section 2.4 for details). A similar
task was run in 2009 by Hajiˇ et al., the only fundamental diﬀerence being the inclusion of
c
additional languages. This dissertation only investigates nominal SRL for English text.
In the remainder of this chapter, I present a statistical NomBank SRL system that will
be a starting point for the chapters that follow. In Section 2.3, I describe the SRL model in
terms of its formulation, features, and general operation. I then present a formal evaluation
of the model in Section 2.4. Sections 2.5 and 2.6 identify a variety of problems that will be
taken up in subsequent chapters.

2.3
2.3.1

Nominal SRL model
Model formulation

The following example demonstrates the testing input to the nominal SRL model:
(2.23) Searle will give pharmacists brochures on the use of prescription drugs for [Predicate
distribution] in their stores.
As shown, the system is given a sequence of words and the nominal predicate. Using this
information, the model must assign semantic labels (e.g., Arg0 , Arg1 , . . . , Location, etc.) to
spans of text in the sentence. The correct labeling is given in Example 2.14 (p. 18).
28

The nominal SRL task is treated as a multi-class classiﬁcation problem over parse tree
nodes. Each parse tree node subsumes an unambiguous span of text. Thus, classifying a
node is equivalent to labeling a span of text in the sentence (see Figure 2.2 on page 19). All
nodes are classiﬁed except those that overlap the predicate. In total, there are 22 classes
representing the Argn and adjunct arguments. One additional class null is added to account
for parse tree nodes whose text does not ﬁll a semantic role. For a classiﬁable node n, the
23 classes are modeled in a single stage as follows:

arg max P r(Label(n) = l|f1 , . . . , fn )

(2.24)

l∈Labels

Equation 2.24 constitutes a departure from the two-stage tradition in SRL; however, I have
found that this single-stage approach tends to outperform the two-stage approach described
previously. I used the multi-class logistic regression solver provided by LibLinear (Fan et al.,
2008) to estimate Equation 2.24. In the following section, I give a precise speciﬁcation for
features f1 , . . . , fn , which are used as evidence for the prediction.

2.3.2

Model features

Starting with a wide range of features, I used a greedy selection algorithm similar to the one
proposed by Pudil et al. (1994) to identify an optimal subset.8 Table A.3 in the Appendix
(p. 136) lists the selected argument features. Below, I give detailed examples for features
that are not suﬃciently explained in the table.

Feature 4 identiﬁes predicate-speciﬁc argument behavior. Consider the following examples
from the Penn TreeBank:
(2.25) [Arg1 Investment] [Predicate analysts] generally agree.
(2.26) The tender [Predicate oﬀer] [Arg1 for Gen-Probe’s shares] is expected to begin next
Monday.
8 See Section A.8 on page 144 for a listing of the feature selection algorithm.

29

S

NP

VP

NP

NP

...

John’s

NN

destruction

PP

of the city

Figure 2.4: Syntactic context of the destruction predicate.

In Example 2.25 the Arg1 (entity analyzed) precedes the predicate. 95% of all analyst
instances behave the same way. Compare this to Example 2.26, where the Arg1 (entity acquired) follows the predicate. Ninety percent of all oﬀer instances behave similarly. As these
examples show, an argument’s location relative to the predicate can depend heavily on the
predicate itself. Thus, the value of Feature 4 is obtained by concatenating the predicate stem
with a binary value indicating whether the candidate argument n comes before or after the
predicate in the sentence. This feature would have a value of analyst:before in Example 2.25
and oﬀer:after in Example 2.26. Many other features in Table A.3 have predicate-speciﬁc
values for similar reasons.

30

Feature 10 captures the basic syntactic structure that surrounds the predicate. Some
approaches to SRL begin with a set of pruning heuristics to eliminate unlikely candidate
arguments (Xue and Palmer, 2004). These heuristics start at the predicate node and inspect
the local syntactic context for particular constituents. For example, the predicate’s sibling
node is included in the candidate pool if it is a prepositional phrase. This situation is shown
in Figure 2.4 for the destruction predicate. Instead of using heuristics, Feature 10 directly
encodes the syntactic context of a predicate. The value for this feature is the context-free
grammar rule that expands the predicate’s parent node. With respect to Figure 2.4, this
grammar rule would be NP → NN,NP.

Feature 26 captures the syntactic relationship between the candidate argument node and
the predicate. Its value is formed by traversing the parse tree from the candidate to the
predicate node. At each step in the traversal, the current syntactic category and direction of
movement (up or down) is recorded. In Figure 2.4, the parse tree path from the candidate
argument of the city to the predicate node destruction would be P P ↑ NP ↓ NN. Since its
introduction by Gildea and Jurafsky (2002), this feature has proved to be one of the most
informative for the SRL task. In my nominal SRL model this feature ranks quite low because
variations of it are yet more informative. For example, Features 1 and 2 make the standard
path more speciﬁc by combining it with other information. Feature 13 makes the standard
path more general by removing information. The feature selection algorithm determined
that these variations were better suited to the nominal SRL task.

Feature 17 considers the parse tree path between the candidate argument node and socalled support verbs in the sentence. Support verbs (also called light verbs) have very little
semantic meaning. Their primary purpose is to link long-distance arguments to nominal
predicates that are more meaningful. Consider the following contrived example:
(2.27) [Arg0 John] [Support took] a [Predicate walk].

31

In Example 2.27, took does not have the usual meaning of forcibly changing possession;
rather, this verb’s purpose is to bring in John as the Arg0 (walker) of walk. This sentence
can be paraphrased with the verb walk as “John walked.”.
Feature 17 identiﬁes the parse tree path between the candidate argument and the nearest support verb. As shown above (see Example 2.23), the system is not given support verb
information at testing time. Thus, I created a model to automatically identify support verbs
so that they may be used by this feature. The model and features used to identify support
verbs are described in Appendix Section A.1 (p. 134).

Before moving on, it is worth noting that there are alternatives to the extensive feature
engineering and selection process described above. Moschitti et al. (2008) present a detailed
analysis of so-called tree kernels and their application to various NLP problems, SRL being
their primary interest. Tree kernels provide a means for feature engineering based on the
“kernel trick” that is available in learning frameworks such as support vector machines. This
dissertation leaves the exploration of tree kernels to future work.
Feature binarization
Like many other machine learning toolkits, LibLinear’s instance representation format requires features with numeric values. As shown above, the value range for many features has
no meaningful numeric ordering. That is, a value of P P ↑ NP ↓ NN for Feature 26 cannot
be meaningfully compared to other values for this feature (e.g., the NP ↑ NP ↓ NP ↓ NN
path from John to destruction in Figure 2.4). Thus, it would be unwise to create a single
numeric feature Parse path by mapping P P ↑ NP ↓ NN to 1 and NP ↑ NP ↓ NP ↓ NN to
2. Instead, as suggested by Hsu et al. (2010), all non-numeric features are binarized. Assume
that each candidate node n is represented using only Feature 26 (the parse tree path). Also
assume that this feature has two possible values (the paths mentioned above). In LibLinear,
each node n would be represented as one of the following:

32

n has path P P ↑ NP ↓ NN: 1, 0
n has path NP ↑ NP ↓ NP ↓ NN: 0, 1
n has neither path: 0, 0
Thus, for a feature with m possible values, binarization creates m mutually exclusive binary
features. Instances are represented by activating at most one of these features.
The binarized feature space can be extremely large. A single word-based feature can easily
binarize to 105 binary features. This poses a problem for the greedy forward search algorithm
described on page 144, which inspects each individual feature. Whether this is actually a
problem depends on how one deﬁnes a feature. If one deﬁnes a feature to be the unbinarized
version, then the parse tree path represents a single feature that can be selected. If one
deﬁnes a feature to be the binarized version, then the parse tree path represents thousands
of features to be selected from. I assumed the former deﬁnition when performing feature
selection. For example, by including or excluding Feature 26, the selection process implicitly
includes or excludes all resulting binarizations of this feature. This compromise keeps the
selection process tractable.

2.3.3

Post-processing

After classifying all candidate nodes in the tree using the model described above, two steps
still remain. First, a special classiﬁcation must be performed on the predicate node itself.
Following this, the entire assignment must be made consistent. These two steps are described
below.
Incorporated arguments
An important diﬀerence between PropBank and NomBank is that the latter often applies
argument labels to predicate nodes themselves, whereas the former does not. In NomBank,
predicates that are also arguments are referred to as incorporated arguments. An example
is given below:

33

null (0.95)

Arg1 (0.95)
null (0.99)

Arg1 (0.97)

Arg1 (0.99)

Arg1 (0.97)

null (0.99)

(a) Overlapping arguments

Arg1 (0.65)
(b) Duplicate arguments

Figure 2.5: Global constraint violations. The circled node in 2.5a is reassigned to the null
class after its score is averaged into its parent node. The circled node in 2.5b is reassigned
to the null class because it has lower conﬁdence than other nodes of the same type. The
remaining Arg1 nodes in 2.5b are kept because they are siblings. This accounts for split
arguments (see Section 2.2.2 for a discussion of split argument constructions).

(2.28) Petrolane is the second-largest [Arg1 propane] [Predicate/Arg0 distributor]
[Location in the U.S.].
In 2.28, the predicate additionally assumes the Arg0 role (the entity performing the distribution). In order to account for incorporated arguments, the system uses a separate model to
assign argument labels to predicate nodes. For the predicate distributor, the model assigns
the label that maximizes the following probability:

P (Argi|distributor) =

#(Argi , distributor)
#(distributor)

(2.29)

In Equation 2.29, #(Argi , distributor) is the number of times that the distributor predicate
is observed with the incorporated argument label Argi in the training data. #(distributor)
is the total number of occurrences of the distributor predicate in the training data. This
simple method labels incorporated arguments with an F1 score of approximately 87%.
Conﬂict resolution
When labeling a particular node, the feature-based logistic regression model does not take
labels for other nodes into account. Neither does the model use dynamic programming to

34

arrive at the most likely consistent assignment of labels to constituents, as done by Jiang and
Ng (2006) and Toutanova et al. (2005). As a result, argument labels sometimes violate global
labeling constraints, which are illustrated in Figure 2.5. These constraints are enforced using
the following heuristics:
Overlapping argument heuristic Overlapping arguments arise when two nodes are labeled as arguments and one node is an ancestor of the other, as shown in Figure 2.5a. If
each node has the same label, the system re-scores the ancestor node with the average
of the two nodes’ conﬁdence scores. The descendant node is then reassigned to the
null class. If the two nodes have diﬀerent labels, the node with higher conﬁdence is
kept and the other is reassigned to the null class. All reassignments to the null class
are made with conﬁdence equal to 1.0.
Duplicate argument heuristic Duplicate arguments arise when two nodes are assigned
the same argument label and one is not an ancestor of the other, as shown in Figure
2.5b. If the two nodes are not siblings, the node with the higher conﬁdence score is
kept and the other is reassigned to the null class. If the two nodes are siblings, both
are kept. Keeping both sibling nodes accounts for split argument constructions, which
were discussed in Section 2.2.2 (p. 13).
Low conﬁdence heuristic After the previous two heuristics are applied, all argument
nodes with conﬁdence less than a threshold targ are removed. The value for targ
is found by maximizing the system’s performance on a development dataset.
To summarize, when given a sentence and a nominal predicate within the sentence, the
logistic regression model is applied to each node in the parse tree that does not overlap with
the predicate node. The predicate node is then labeled, and the heuristics are applied to
resolve argument conﬂicts and remove argument labels with low conﬁdence scores.

35

Jiang and Ng (2006)
Liu and Ng (2007)
This dissertation

Development F1
0.6677
(not reported)
0.7401

Testing F1
0.6914
0.7283
0.7574

Table 2.2: NomBank SRL results for argument prediction using automatically generated
parse trees. The F1 scores were calculated by aggregating predictions across all classes.

2.4

Evaluation

To test the model described in the previous section, I extracted training nodes from sections 2-21 of NomBank, keeping only those nodes that did not overlap with the predicate.
LibLinear parameters were set as follows: bias = 1, c = 1, w+ = 1. I tuned the targ
threshold using section 24 as development data (targ = 0.42). Finally, I used section 23 for
testing.9 All syntactic parse trees were generated by the August 2006 version of Charniak’s
re-ranking syntactic parser (Charniak and Johnson, 2005). Each annotated predicate in the
testing section was presented to the system as shown in Example 2.23 (p. 28).
Table 2.2 presents the evaluation results. I calculated the F1 scores by aggregating
predictions across all predicates. Precision and recall were deﬁned in the usual way:
#(correct labels applied)
#(labels applied)
#(correct labels applied)
Recall =
#(labels in ground-truth)

P recision =

(2.30)
(2.31)

This evaluation methodology follows the one used by Jiang and Ng (2006) and Liu and Ng
(2007); however, the results for my model are only comparable to the latter because the
former used a preliminary release of NomBank.10
9 This data separation is standard for PropBank/NomBank SRL evaluations. See, for

example, Carreras and M`rquez (2005).
a
10 The discrepancy between the development and testing results is likely due to poorer
syntactic parsing performance on the development section (Carreras and M`rquez, 2005).
a

36

2.5

Discussion

As can be seen, the NomBank SRL system presented in this chapter comfortably outperforms
the best previous result. Because the models share many properties, it is worth discussing
factors that could possibly lead to the performance diﬀerence. First, I observed a signiﬁcant
performance increase when moving from a traditional two-stage pipeline to the single-stage
classiﬁer presented above. To my knowledge, the research community has not thoroughly
investigated the need for a two-stage pipeline. It is, however, the computationally easier
route. A two-stage approach requires a binary ﬁrst-stage classiﬁer trained over approximately 3.7 million nodes and a 22-class second-stage classiﬁer trained over approximately
179,000 nodes. A single-stage nominal SRL classiﬁer, on the other hand, requires a 23-class
classiﬁer trained over approximately 3.7 million nodes. In the one-versus-all approach to
multi-class classiﬁcation, the single-stage SRL classiﬁer is much more computationally intensive. However, the single-stage approach is free of cascading errors, which are common in
pipelined architectures such as the two-stage model. In the two-stage model, a false negative
error in the ﬁrst stage prevents the second stage from making a decision.
Another important diﬀerence between the current model and the other two is the treatment of overlapping argument nodes and incorporated arguments. In the work of Jiang and
Ng (2006), incorporated arguments were not included in the training data despite the fact
that they occur very frequently - approximately 15% of arguments in the training data are
incorporated. The authors do attempt to label predicate nodes at evaluation time using the
trained model, but the most important features for argument labeling (e.g., the parse tree
path) are not informative for such nodes. In contrast, Liu and Ng (2007) included all parse
tree nodes in the training data, even those that overlap with the predicate node (presumably,
this includes the predicate node itself). At evaluation time, all nodes are classiﬁed; however,
considering the fact that 0.02% of non-incorporated arguments overlap the predicate node,
this approach is likely to create more errors than correct labels. The model presented in Sec-

37

tion 2.3 takes a hybrid approach. Nodes that overlap the predicate are not used as training
data, nor are they labeled by the logistic regression model during testing; instead, predicate
nodes are labeled by the simple model described in Section 2.3, which achieves an F1 score
of 0.84 on incorporated arguments of all types.

2.6

Conclusions

This chapter has shown that the tremendous syntactic ﬂexibility of natural language can be
semantically normalized by analysis in terms of semantic roles. This analysis does not aim
to produce a deep, complete semantic interpretation; rather, the aim is to extract shallow
information reliably. This chapter has also shown that, despite its shallow nature, semantic
role analysis produces a signiﬁcant amount of information that can be levered for language
processing tasks that require inference.
The nominal SRL system described in this chapter produces state-of-the-art results using
no manual intervention. It relies primarily on a rich syntactic analysis combined with traditional supervised machine learning. The single-stage architecture is computationally more
expensive than the standard two-stage model; however, it does not require one to chain
multiple components together. By carefully handling nominal-speciﬁc issues like argument
incorporation, the system is able to recover arguments with an F1 score of approximately
76%. This is an encouraging result; however, the system makes two important assumptions
that must be addressed:
1. Ground-truth predicates are provided to the system at testing time. This does
not invalidate the evaluation methodology used in this chapter, which still provides
useful information about the SRL model; however, in order to assess true end-to-end
performance in a practical setting, one must remove this assumption and force the
system to identify predicates as well as arguments. The following chapter does exactly
this.

38

2. Extra-sentential arguments are similar to the arguments described in this chapter.
The only diﬀerence is that extra-sentential arguments are not present in the sentence
that contains the predicate. Rather, these arguments exist in some other sentence
of the document. Chapter 4 will explore the nature and extraction of extra-sentential
arguments, which have received relatively little attention from the research community.

39

CHAPTER 3
Predicates that lack arguments: the problem of
implicit argumentation
3.1

Introduction

The previous chapter presented a state-of-the-art nominal SRL system that will serve as a
baseline for the current chapter. The system achieves an overall argument F1 score of approximately 76% using a supervised learning approach combined with carefully constructed
post-processing heuristics. Although this result is encouraging, it is produced by an evaluation methodology that has speciﬁc limitations. In particular, the evaluation (which has
been used in many previous NomBank and PropBank SRL studies) provides a system with a
predicate that is known to take arguments in the local context. However, nominal predicates
often surface without local arguments. Consider the following instances of distribution from
the Penn TreeBank:
(3.1) Searle will give [Arg0 pharmacists] [Arg1 brochures] [Arg1 on the use of prescription
drugs] for [Predicate distribution] [Location in their stores].
(3.2) The [Predicate distribution] represents [NP available cash ﬂow] [PP from the
partnership] [PP between Aug. 1 and Oct. 31].
In Example 3.1, distribution is associated with arguments annotated by NomBank. In contrast, distribution in 3.2 has a noun phrase and multiple prepositional phrases in its environment (similarly to 3.1), but not one of these constituents is an argument to the marked
predicate. As described by Meyers (2007a), predicate instances such as 3.1 are called “markable” because they are associated with local arguments. Predicate instances such as 3.2 are
called “unmarkable” because they are not associated with local arguments. In the NomBank
40

corpus, only markable predicate instances from the Penn TreeBank have been annotated.
All other predicates have been ignored.
A number of evaluations (e.g., those described by Jiang and Ng (2006), Liu and Ng
(2007), and Chapter 2 of this dissertation) have been based solely on markable predicate instances (i.e., those annotated by NomBank). This group constitutes only 57% of all nominal
predicate instances found in the underlying TreeBank corpus. In order to use the output of
nominal SRL systems as input for other systems (e.g., QA, IE, and SMT), it is important to
develop and evaluate techniques that can handle all predicate instances instead of a select
few. With respect to the evaluation procedure of the previous chapter, this amounts to
eliminating the assumption that a test predicate takes arguments; instead, the SRL system
must make this decision automatically for every token in the corpus.
Underlying the issues described above is a phenomenon called implicit argumentation.
An implicit argument is any argument that is not annotated by NomBank. Thus, a predicate
is unmarkable when all of its arguments are implicit. In this chapter, I investigate the role of
implicit argumentation in nominal SRL. This is, in part, inspired by the 2008 CoNLL Shared
Task (Surdeanu et al., 2008), which was the ﬁrst evaluation of syntactic and semantic dependency parsing to include unmarkable nominal predicates. The current chapter extends this
task to constituent parsing with techniques, evaluations, and analyses that focus speciﬁcally
on implicit argumentation for nominal predicates. In the next section, I assess the prevalence
of implicit argumentation and its impact on the nominal SRL system presented in Chapter 2.
I ﬁnd that, when applied to all nominal instances, this system achieves an argument F1 score
of only 69%, a loss of approximately 8%. In Section 3.3, I review the recent CoNLL Shared
Task, noting similarities and diﬀerences with the current work. In Section 3.4, I present a
model designed to ﬁlter out nominal predicates whose arguments are entirely implicit. This
model reduces the aforementioned loss, particularly for nominals that are not often markable. In the analyses of Section 3.5, I ﬁnd that SRL performance varies widely among speciﬁc
classes of nominal predicates, suggesting interesting directions for future work. I conclude,

41

0.1

% of nominal instances

0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0

Observed markable probability
Figure 3.1: Distribution of nominal predicates. Each interval on the x-axis denotes a set
of nominal predicates that are markable between (x − 5)% and x% of the time in the Penn
TreeBank corpus. The y-axis denotes the percentage of all nominal predicate instances in
TreeBank that is occupied by nominal predicates in the interval. Quartiles are marked below
the intervals. For example, the 0.25 quartile at x = 0.35 indicates that approximately 25%
of all nominal instances are markable 35% of the time or less.

in Section 3.6, by motivating additional work on implicit argumentation, which is taken up
in the following chapter.

3.2

Empirical analysis

As shown in Example 3.2, nominal predicates often surface without local arguments. In this
section, I provide an analysis of implicit argumentation and its implications for the nominal
SRL system developed in the previous chapter. On the whole, instances of predicates from
the NomBank lexicon are markable only 57% of the time in the Penn TreeBank corpus.
Figure 3.1 shows the distribution of nominal predicates in terms of the frequency with which

42

Precision
Recall
F1

Markable-only evaluation
0.8093
0.7117
0.7574

All-token evaluation
0.6832
0.7039
0.6934

% loss
15.58
1.10
8.45

Table 3.1: Comparison of the markable-only and all-token evaluations of the SRL system
from Chapter 2. In the all-token evaluation, argument identiﬁcation is attempted for any
nominal with at least one annotated (i.e., markable) instance in the training data.

they are markable. As shown, approximately 50% of nominal instances are markable 65% of
the time or less, indicating that implicit argumentation is a very common phenomenon. This
tendency toward implicit argumentation is also reﬂected in the percentage of roles that are
ﬁlled in NomBank versus PropBank. In NomBank, 48% of possible roles are ﬁlled, whereas
61% of roles are ﬁlled in PropBank.
To assess the impact of implicit argumentation, I evaluated the nominal SRL system from
Chapter 2 over each token in the testing section. The system attempted argument identiﬁcation for all singular and plural nouns that have at least one annotated (i.e., markable)
instance in the training portion of the NomBank corpus (morphological variations included).
Table 3.1 gives a comparison of the results from the markable-only and all-token evaluations. As shown, assuming that all known nouns take local arguments results in a signiﬁcant
performance loss. This loss is due primarily to a drop in precision caused by false positive
argument predictions made for nominal predicates with no local arguments. An example of
this is shown below:1
(3.3) [Arg0 0.64 Canadian] [Predicate investment] rules require that big foreign takeovers
meet that standard.
The sentence in Example 3.3 does not contain any constituents that are considered arguments
to investment under the NomBank guidelines, but the SRL system (mistakenly) identiﬁes
1 In this dissertation, a number following an argument label indicates prediction proba-

bility.

43

Canadian as ﬁlling the Arg0 position. Presumably, Canada is the entity imposing rules on
those who invest; Canada is not the investing entity. Examples such as 3.3 demonstrate
an important diﬀerence between nominal predicates and verbal predicates: the former are
more ﬂexible than the latter in terms of argument realization. Both classes of predicates may
undergo syntactic alternations that change the linear order of argument expression; however,
only nominal predicates routinely surface without explicit arguments. Thus, the approach
to SRL used for verbal predicates is not entirely appropriate for nominal predicates.

3.3

Related work

Implicit argumentation was not accounted for in large-scale evaluation tasks until the 2008
Computational Natural Language Learning (CoNLL) Shared Task on dependency parsing
(Surdeanu et al., 2008). In this task, systems were required to identify both syntactic and
semantic dependency structure. Ground-truth syntactic dependency structure was automatically extracted from the constituent trees contained in the Penn TreeBank. Ground-truth
semantic dependencies were extracted from the annotations in PropBank (for verbs) and
NomBank (for nouns). In the semantic portion of the evaluation, systems were required
to identify predicating verbs and nouns in addition to the corresponding arguments. Thus,
systems in this evaluation were required to process instances such as Example 3.2 (p. 40).
Among all entries to the CoNLL Shared Task organized by Surdeanu et al. (2008), the
system created by Johansson and Nugues (2008) fared the best overall and near the top for
nominal predicates in particular. Johansson and Nugues’s system used a classiﬁcation-based
approach to predicate-argument identiﬁcation that is similar to the one presented in this
chapter. However, the authors left open two important questions. First, there is the simple
question of how eﬀectively the system identiﬁes nominal predicates. Second, the study does
not evaluate the impact of the nominal predicate classiﬁer on overall predicate-argument
identiﬁcation performance. As shown in the previous section, implicit argumentation has
a signiﬁcant negative eﬀect on the standard approach to nominal SRL. It is important to
44

quantify how much of this loss can be recovered by adding a nominal predicate classiﬁer.
In addition to answering the two questions above, this chapter develops a nominal predicate classiﬁer that is, in many respects, simpler than the method used by Johansson and
Nugues, which employed an individually trained classiﬁer for each of the 4,704 predicates
contained in NomBank. I opted for a single model capable of making predictions for all
predicates in the corpus.

3.4

Argument-bearing predicate model

Given a sentence, the goal of predicate classiﬁcation is to identify nouns that bear local
arguments (i.e., those that would be annotated by NomBank). I treated this as a binary
classiﬁcation task over token nodes in the syntactic parse tree of a sentence. Once a token
has been identiﬁed as bearing local arguments, it can be further processed by the argument
identiﬁcation model developed in Chapter 2. A token is ignored if it is not identiﬁed as
argument-bearing.
The nominal predicate classiﬁer was constructed using the greedy feature selection algorithm introduced in the previous chapter (see page 144 for details). I used the logistic
regression solver of Fan et al. (2008) over a feature space binarized as described on page 32.
Table A.2 (p. 135) presents the selected features. As shown by Table A.2, the sets of features
selected for argument and nominal classiﬁcation are quite diﬀerent. Many of the features
used for nominal classiﬁcation were not used by Johansson and Nugues (2008) or Liu and
Ng (2007). Below, I provide details for features that are not suﬃciently explained in the table.

Feature 1 captures the local syntactic structure that contains the candidate predicate. As
shown in Table A.2, this is the most informative feature for nominal predicate classiﬁcation,
surpassing the predicate text itself (Feature 2). For a candidate predicate n, Feature 1 is
actually a set of sub-features, one for each parse tree node between n and the tree’s root. The
value of a sub-feature is the context-free grammar rule that expands the corresponding node
45

VP: Sub2 = V P → V, NP
V (made)

NP: Sub1 = NP → Det, N
Det (a)

N (sale)

Figure 3.2: Context-free grammar rules for nominal predicate classiﬁcation (Feature 1). The
candidate nominal predicate sale is being classiﬁed. Arrows indicate grammar productions.

in the tree. Each value is additionally indexed according to its tree node distance from n.
An example of Feature 1 with two sub-features is shown in Figure 3.2. In this example, the
candidate nominal predicate sale is being classiﬁed. The ﬁrst sub-feature (Sub1 ) is derived
from the parent of sale, and the second (Sub2 ) from sale’s grandparent. The sub-features
are indexed under the hypothesis that the same context-free grammar rule might indicate
diﬀerent outcomes at diﬀerent levels in the parse tree. In Figure 3.2, Sub2 indicates the use
of a support verb structure. This in turn indicates a high likelihood that sale will take an
Arg0 that linearly precedes made.

Feature 8 is a modiﬁed version of the parse tree path used for nominal argument identiﬁcation. The modiﬁcation is two-fold: ﬁrst, the path begins at the candidate predicate and
ends at the nearest support verb. As mentioned in the previous chapter, there exists a close
link between nominal predicates and support verbs (see page 31). Second, the parse tree
path is lexicalized, meaning it is concatenated with surface words from the beginning or end
of the path. This ﬁner-grained path captures the joint behavior of the syntactic and lexical
content. For example, in the tree shown in Figure 3.2, the path from sale to made with a
lexicalized destination would be N ↑ NP ↑ V P ↓ V : made. A similar strategy is used for
Features 5, 11, 13, 23, and 29. Lexicalization increases sparsity; however, it provides useful
information and is often preferred over unlexicalized paths. Support verbs for this and other

46

features were automatically identiﬁed using the model described on page 134.

Feature 16 leverages the existing content of PropBank to identify argument-bearing nominal
predicates. The value for this feature is the probability that the context (± 5 words) of a
nominal predicate is generated by a unigram language model trained over the PropBank
argument words for the corresponding verb. All named entities are normalized to their
entity type using BBN’s IdentiFinder (Bikel et al., 1999), and adverbs are normalized to their
related adjective using the ADJADV dictionary provided by NomBank. The normalization
of adverbs to adjectives is motivated by the fact that adverbial modiﬁers of verbs typically
have corresponding adjectival modiﬁers for nominal predicates. This is shown below:
(3.4) [Arg0 John] [Predicate gossiped] [Manner quietly] with his coworkers.
(3.5) John’s quiet [Predicate? gossip] was overheard.
Example 3.4 provides evidence that a predicate such as gossip takes arguments when surrounded by Person mentions and adverbs such as quiet. This information is useful when
classifying the nominal predicate gossip in Example 3.5, where we ﬁnd a similar named entity
and adjectival modiﬁer. Example 3.5 is indeed markable.

LibLinear model conﬁguration
As with other LibLinear models in this dissertation, I found it helpful to adjust the perclass costs for nominal predicate classiﬁcation. I used cost c = 2 and w+ = 1, which were
identiﬁed during feature selection. Two additional parameters were set: (1) the classiﬁcation
bias (= 1), and (2) the prediction threshold tpred . The latter functions similarly to the targ
threshold used for argument classiﬁcation. Any candidate predicate scoring higher than tpred
is passed to the argument identiﬁer. All other candidate predicates are ignored. Actual values
for tpred are discussed in the following section.

47

Baseline
MLE
LibLinear

Precision (%)
55.5
68.0
86.6

Recall (%)
97.8
90.6
88.5

F1 (%)
70.9
77.7
87.6

Table 3.2: Evaluation results for identifying nominal predicates that take local arguments.
The ﬁrst column indicates which nominal classiﬁer was used.

3.5

Evaluation

I evaluated the model described above using a practical setup in which the nominal SRL
system had to process every token in a sentence. The system could not safely assume that
each token took local arguments; rather, this decision had to be made automatically. In
Section 3.5.1, I present results for the automatic identiﬁcation of nominal predicates with
local arguments. Then, in Section 3.5.2, I present results for the combined task in which
nominal classiﬁcation is followed by argument identiﬁcation.

3.5.1

Predicate evaluation

Following standard practice, I trained the nominal classiﬁer over token nodes in TreeBank
sections 2-21. All syntactic parse trees were automatically generated by Charniak’s reranking syntactic parser (Charniak and Johnson, 2005), and only those tokens with at least
one annotated (i.e., markable) instance in NomBank were retained for training. As mentioned
above, the classiﬁer imposes a prediction threshold tpred on the classiﬁcation decisions. The
value of tpred was found by maximizing the nominal F1 score on the development section
(24) of NomBank (tpred = 0.47). The resulting model was tested over all token nodes in
section 23 of TreeBank. For comparison, I implemented the following simple classiﬁers:
• The baseline model classiﬁes a token as locally bearing arguments if it is a singular
or plural noun that is found to be markable at least once in the training sections of

48

NomBank. As shown in Table 3.2, this classiﬁer achieves nearly perfect recall. Recall
is less than 100% due to (1) part-of-speech errors from the syntactic parser and (2)
nominal predicates that were not annotated in the training data but exist in the testing
data.
• The MLE model operates similarly to the baseline, but also produces a score for the
classiﬁcation. The value of the score is equal to the probability that the nominal bears
local arguments, as observed in the training data. When using this model, tpred = 0.33.
As shown by Table 3.2, this exchanges recall for precision and leads to a performance
increase of approximately 8 F1 points.
The last row in Table 3.2 shows the results for the feature-based nominal predicate classiﬁer.
This model outperforms the others by a wide margin, achieving balanced precision and recall
scores near 88% F1 . In addition, the feature-based model is able to recover from part-ofspeech errors because it does not ﬁlter out non-noun candidates; rather, it combines part-ofspeech information with other lexical and syntactic features to classify nominal predicates.
Interesting observations can be made by grouping nominal predicates according to the
probability with which they are markable in the corpus. Recall Figure 3.1 (page 42), which
shows the distribution of markable nominal predicates across intervals of markability. Using
this view of the data, Figure 3.3 presents the overall F1 scores for the baseline and LibLinear
nominal classiﬁers.2 As shown, gains in nominal classiﬁcation diminish as nominal predicates
become more reliably associated with local arguments (i.e., as one moves right along the xaxis). This is because the baseline system makes fewer errors for predicates in intervals to
the right. Furthermore, nominal predicates that are rarely markable (i.e., those in interval
0.05) remain problematic due to a lack of positive training instances and the unbalanced
nature of the classiﬁcation task.
Overall, however, the feature-based model exhibits substantial gains versus the baseline
system, particularly for nominals occupying the left-most intervals of Figure 3.3. As will
2 Baseline and MLE scores are identical above the MLE threshold.

49

1
0.9

Predicate nominal F1

0.8
0.7
0.6
0.5

Baseline
LibLinear

0.4
0.3
0.2
0.1
0

Observed markable probability

Figure 3.3: Nominal classiﬁcation performance with respect to the distribution in Figure
3.1 (page 42). The y-axis denotes the combined F1 for nominal predicates that occupy the
interval given on the x-axis.

be shown in the next section, these gains in nominal predicate classiﬁcation transfer well to
gains in argument identiﬁcation.

3.5.2

Combined predicate-argument evaluation

I now turn to the task of combined predicate-argument classiﬁcation. In this task, systems
must ﬁrst identify nominal predicates that bear local arguments. I evaluated three conﬁgurations based on the nominal classiﬁers from the previous section. Each conﬁguration uses
the argument classiﬁcation system described in Chapter 2. Table 3.3 presents the results of
using the three conﬁgurations for combined predicate-argument classiﬁcation. As shown in
Table 3.3, overall argument classiﬁcation F1 suﬀers a relative loss of more than 8% under the
baseline assumption that all known nouns bear local arguments. The MLE predicate clas-

50

Predicate classiﬁer used
Baseline
MLE
Logistic regression

tpred
N/A
0.23
0.32

targ
N/A
0.44
0.43

All-token argument F1 (%)
69.3
69.9
71.1

Loss (%)
8.5
7.7
6.1

Table 3.3: Comparison of the combined predicate-argument classiﬁers in the all-token evaluation. The ﬁrst column indicates which nominal predicate classiﬁer was used. All conﬁgurations used the argument classiﬁcation system described in Chapter 2. The second and
third columns give the prediction thresholds used. The fourth column gives overall argument
F1 scores, and the last column gives the loss with respect to the standard evaluation task
in which the system is given an argument-bearing predicate (this was used in the previous
chapter).

siﬁer is able to reduce this loss slightly. The LibLinear predicate classiﬁer reduces this loss
even further, resulting in an overall argument classiﬁcation F1 of 71.1%. This improvement
is the direct result of ﬁltering out nominal instances that do not bear local arguments.
Similarly to the predicate classiﬁcation evaluation of Section 3.5.1, one can view argument
classiﬁcation performance with respect to the prior probability that a nominal bears local
arguments as determined by the training data. This is shown in Figure 3.4 for the three
conﬁgurations. The conﬁguration using the MLE nominal predicate classiﬁer obtained an
argument F1 of zero for nominal predicates below its prediction threshold. Compared to the
baseline predicate classiﬁer, the LibLinear classiﬁer achieved argument classiﬁcation gains
as large as 163% (interval 0.05), with an average gain of 58% for intervals 0.05 to 0.3. As
with nominal classiﬁcation, argument classiﬁcation gains versus the baseline diminish for
nominal predicates that occupy intervals further to the right of the graph. I observed an
average gain of only 1% for intervals 0.35 through 1.00. A couple factors contribute to this
result. First, the feature-based predicate model is not substantially more accurate than the
baseline predicate model for the highest intervals (see Figure 3.3 on page 50). Second, the
argument prediction model has substantially more training data for the nominal predicates
in intervals 0.35 to 1.00; NomBank contains many more instances of these predicates than
the predicates occupying lower intervals. Thus, even if the nominal classiﬁer makes a false

51

0.8
0.7

Argument F1

0.6
0.5
0.4

Baseline

0.3

MLE

0.2

LibLinear

0.1
0

Observed markable probability

Figure 3.4: All-token argument classiﬁcation performance with respect to the distribution
in Figure 3.1 (p. 42). The y-axis denotes the combined argument F1 for nominal predicates
in the interval.

positive prediction in the 0.35 to 1.00 interval range, the argument model may correctly
avoid labeling any arguments.
As noted in Section 3.3, the results in Table 3.3 are not directly comparable to the results
of the recent CoNLL Shared Task (Surdeanu et al., 2008). This is because the semantic
labeled F1 score used in the Shared Task combined predicate and argument predictions into
a single score. The same combined F1 score for my best two-stage nominal SRL system
(logistic regression predicate and argument models) is 79.1%. This compares favorably to
the best score of 76.6% reported by Surdeanu et al..

3.5.3

NomLex-based analysis of results

As mentioned previously, NomBank annotates many classes of deverbal and non-deverbal
predicates. These predicates have been semi-automatically categorized on syntactic and

52

Predicate nominal F1

Baseline
MLE
LibLinear

NomLex class
Figure 3.5: Nominal classiﬁcation performance with respect to the NomLex classes in Figure
2.3. The y-axis denotes the combined F1 for nominal predicates in the class.

semantic bases by the NomLex-PLUS resource (Meyers, 2007b). To help understand what
types of nominal predicates are particularly aﬀected by implicit argumentation, I further
analyzed performance with respect to these classes.
Recall Figure 2.3 (p. 20), which shows the distribution of nominal predicates across
classes deﬁned by the NomLex resource. As shown in Figure 3.5, many of the most frequent
classes exhibit signiﬁcant gains. For example, the classiﬁcation of partitive nominal predicates (13% of all nominal instances) with the LibLinear classiﬁer results in gains of 55.5%
and 33.7% over the baseline and MLE classiﬁers, respectively. For the ﬁve most common
classes, which constitute 82% of all nominal predicate instances, I observed average gains of
27.5% and 19.3% over the baseline and MLE classiﬁers, respectively.
Table 3.4 separates predicate and argument classiﬁcation results into sets of deverbal
(NomLex class nom), deverbal-like (NomLex class nom-like), and all other nominal predicates. A deverbal-like predicate is closely related to some verb, although not morphologically.

53

Baseline
MLE
LibLinear

Predicate F1 (%)
Deverbal Deverbal-like Other
79.8
67.9
67.6
83.0
73.3
74.9
92.6
88.3
89.1

Combined predicate-argument F1 (%)
Deverbal Deverbal-like
Other
70.6
67.4
74.5
72.1
66.4
76.8
72.8
71.8
78.5

Table 3.4: Predicate and combined predicate-argument classiﬁcation F1 scores for deverbal,
deverbal-like, and other nominal predicates in the all-token evaluation. The ﬁrst column
indicates which nominal classiﬁer was used. All conﬁgurations used the nominal SRL system
described in Chapter 2.

For example, the noun accolade shares argument interpretation with the verb award, but the
two are not morphologically related. As shown by Table 3.4, predicate classiﬁcation tends
to be easier - and argument classiﬁcation harder - for deverbals when compared to other
types of nominal predicates. For combined nominal-argument F1 , the diﬀerence between
deverbal/deverbal-like predicates and the others is due primarily to relational nominals,
which are included the others column. Relational nominals are accurately classiﬁed by the
logistic regression model (F1 = 0.95 in Figure 3.5); additionally, relational nominals exhibit
a high rate of argument incorporation (i.e., predicate-as-argument behavior), which is easily
handled by the maximum-likelihood model described in Section 2.3.

3.5.4

Analysis of end-to-end nominal SRL speed

The combined predicate-argument classiﬁcation system presented in this chapter is capable
of operating in an end-to-end fashion over completely unstructured text.3 In this section, I
provide asymptotic and empirical analyses for the system’s performance.

54

Documents

Sentence
segmentation
(Gillick, 2009)

Sentences

Part-of-speech tagging and
syntactic parsing
(Charniak and Johnson, 2005)

Nominal SRL
Support verb
identification

Predicate
identification

Argument
identification /
post-processing

SRL structure
sell( seller = John, entity_sold = book, buyer = Mary )
pay( payer = Mary, paid = John, amount = $30 )
…

Figure 3.6: End-to-end nominal SRL architecture.

Processing components
Figure 3.6 shows the end-to-end nominal SRL architecture. Processing begins by segmenting
each document into a sequence of sentences. This step is performed using Gillick’s (2009)
SVM-based segmenter. Each sentence is then tagged for part-of-speech and syntactic information using the August 2006 version of Charniak and Johnson’s (2005) parser. The
nominal SRL classiﬁer chain then labels support verbs (see page 134), nominal predicates
(page 45), and arguments (page 28). In the ﬁnal post-processing step, argument conﬂicts
3 The nominal SRL system is freely available for non-commercial use. Please contact the

author at gerber.matthew@gmail.com for more information.

55

are removed and incorporated arguments are identiﬁed (page 33). In the following sections,
I provide analyses for various components in Figure 3.6.
Asymptotic analysis
One can analyze the computational complexity of the processing components with respect
to either the number of tokens in a document (denoted by d) or the number of tokens in a
sentence (denoted by s).
• Sentence segmentation is O(d). The component needs to scan the document for
ambiguous punctuation marks, which might indicate the end of a sentence.
• Part-of-speech tagging and syntactic parsing is O(s3 ). The Charniak parser is
a re-ranker working on top of a chart parser that is O(s3 ) (Charniak et al., 1998).
• Support verb identiﬁcation is O(s). Each token is classiﬁed.
• Predicate identiﬁcation is O(s). Each token is classiﬁed.
• Argument identiﬁcation is O(s2 ). The length s of a sentence is related to the
number of nodes n in the sentence’s perfect binary tree (this is the worst case) by
s = n . Thus, n ≤ s < n + 1 and 2s − 2 < n ≤ 2s. At most, then, there are 2s nodes
2
2
2
in the tree for a sentence of length s. During argument identiﬁcation, each of these
nodes is classiﬁed for each predicate in the sentence. s is a theoretical upper bound for
the number of predicates in a sentence of length s, giving O(s2 ) for all argument node
classiﬁcations; however, in practice there are far fewer than s predicates per sentence
(see empirical results below).
• Argument post-processing is O(s2 ). The post-processor looks at each argument
node for each predicate node and detects/resolves conﬂicts in a constant number of
operations. The maximum number of argument nodes for a predicate node is less than
the number of nodes in the tree because argument nodes are not allowed to overlap
56

Documents
Sentences per document
Words per sentence
Words per minute
Predicate trees per minute
Sentence segmentation total (seconds)
Part-of-speech tagging and syntactic parsing total (minutes)
Support verb, predicate, and argument identiﬁcation / post-processing (minutes)

1000
8
29
1134
120
24
158
52

Table 3.5: Empirical speed performance of the nominal SRL system. Here, Predicate trees
refers to a single predicate node and its associated support verbs and argument nodes.

the predicate node. Thus, the previous upper bound of 2s is also an upper bound on
the number of argument nodes for a predicate node. There are O(s) predicate nodes
in a sentence, for a total of O(s2 ) for argument conﬂict resolution. The post-processor
also labels incorporated arguments for each of the detected predicates, adding O(s) for
a total of O(s2 ). Note again that the practical number of predicates in a sentence of
length s is far less than s.
Empirical analysis
To test the speed performance of the nominal SRL system empirically, I randomly selected
1000 documents from the Gigaword corpus (Graﬀ, 2003). Documents from this corpus
contain reports from a variety of newswire agencies. Thus, the genre is quite similar to that
of the training data used for the nominal SRL system. In total, the 1000 document subcorpus contained approximately 232,000 tokens and 8,000 sentences. Each document was
provided to the system without paragraph markings or any other structural information.
The processing hardware consisted of standard desktop machine with a 2.8 GHz Pentium 4
CPU and 3 GB of main memory.
Table 3.5 lists the key performance statistics. As shown, sentence splitting contributed a
negligible amount of time (24 seconds) to the total of more than 3.5 hours. Three quarters of
the time was spent performing part-of-speech tagging and syntactic parsing with the Char57

7
6
5
4
3
2
0

1

Processing time (seconds)

POS tagging and syntactic parsing (top)
Nominal SRL (bottom)

10 14 18 22 26 30 34 38 42 46 50 54 58
Sentence length

Figure 3.7: Box plot of nominal SRL speed. Each vertical box shows the range of times
observed for sentences with length given along the x-axis. Medians are indicated by the
black bar in each box, and the box spans from the ﬁrst to the third quartiles.

niak parser. As mentioned above, syntactic analysis has the slowest worst-case performance
of any of the components. Figure 3.7 demonstrates the cubic worst-case visually. Contrast
this with the performance of the nominal SRL components, which consumed one quarter of
the total time. Asymptotically, argument identiﬁcation and post-processing are quadratic
in the worst-case; however, the worst case (i.e., all tokens being predicates) rarely, if ever,
occurs. This can be seen in Figure 3.7, where the slowing of the nominal SRL components

58

is roughly linear in the sentence length.
Speed performance of the end-to-end nominal SRL system could be enhanced using two
approaches. First, one could instantiate multiple instances of the architecture shown in Figure 3.6 using multiple systems. Since there are no inter-document dependencies, documents
can be distributed among these systems, increasing performance. Second, one could replace
the cubic time syntactic parser with a linear time dependency parser (Sagae and Lavie,
2005). Dependency parsing has received a signiﬁcant amount of attention in recent years.
Although its parsing formalism is not as expressive as the constituent formalism used by
parsers such as Charniak’s, dependency parsing often produces useful results very eﬃciently.

3.6

Conclusions

The application of nominal SRL to practical NLP problems requires a system that is able
to accurately process each token it encounters. Previously, it was unclear whether the
models proposed by Jiang and Ng (2006) and Liu and Ng (2007) would operate eﬀectively
in such an environment. The systems described by Surdeanu et al. (2008) are designed
with this environment in mind, but their evaluation did not focus on the issue of implicit
argumentation. These two problems motivate the work presented in this chapter.
The contribution of this chapter is three-fold. First, it shows that the state-of-the-art
nominal SRL system of the previous chapter suﬀers a substantial performance degradation
when evaluated over nominal predicates whose arguments are implicit. Second, it identiﬁes
a set of features - many of them new - that can be used to accurately detect nominal
predicates with local arguments, thus increasing the overall performance of the nominal SRL
system. The nominal predicate model also allows the nominal SRL system to operate in
real-world settings over completely unstructured text. Third, the evaluation results suggest
interesting directions for future work. As described in Section 3.5.2, many nominal predicates
do not have enough labeled training data to produce accurate argument classiﬁers. The
generalization procedures developed by Gordon and Swanson (2007) for PropBank SRL and
59

Pad´ et al. (2008) for NomBank SRL might alleviate this problem.
o
Most important, however, is the following observation: the logistic regression nominal
predicate classiﬁer is able to accurately ﬁlter out predicates whose arguments are implicit;
however, this model cannot actually recover implicit arguments, which are often expressed in
the surrounding discourse. It is also the case that nominal predicates with local arguments
often have additional implicit arguments somewhere in the discourse; these, too, are ignored
by the models in this chapter. In the following chapter, I describe an in-depth study of
implicit arguments and their recovery from the discourse. This topic has received very little
attention from NLP researchers; however, as I will show, implicit argument recovery is (1)
a fundamental process within discourse semantics, and (2) a process that can be modeled
eﬀectively.

60

CHAPTER 4
Identifying implicit arguments
4.1

Introduction

The previous chapter showed that it is possible to accurately distinguish nominals that bear
local arguments from those that do not. This is an important step because it frees us from
the assumption that all nominal predicates take local arguments - an assumption shown to
be false. Ultimately, the goal is to use the output of the nominal SRL system as input for
other NLP tasks such as QA, IE, and SMT. Being able to process all tokens in a document
is essential for such tasks.
Despite these improvements, though, the system developed in the previous chapter does
not address a fundamental question regarding implicit arguments: if an argument is implicit
(i.e., missing) in the local context of a predicate, might the argument be located somewhere
in the wider discourse? The previous chapter stopped short of answering this question,
opting instead for an approach that ignores predicates whose arguments are implicit. The
current chapter directly addresses this important question.
As an initial example, consider the following sentence, which is taken from the Penn
TreeBank:
(4.1) A SEC proposal to ease [Arg1 reporting] [P redicate requirements] [Arg2 for some
company executives] would undermine the usefulness of information on insider trades,
professional money managers contend.
The NomBank role set for requirement is shown below:
Frame for requirement, role set 1:
Arg0 : the entity that is requiring something
61

Arg1 : the entity that is required
Arg2 : the entity of which something is being required
In Example 4.1, the predicate has been annotated with the local argument labels provided
by NomBank. As shown, NomBank does not annotate an Arg0 for this instance of the
requirement predicate; however, a reasonable interpretation of the sentence is that SEC is
the entity that is requiring something.1 This dissertation refers to arguments such as SEC
in Example 4.1 as implicit. When all arguments for a predicate are implicit, one obtains the
situation addressed in the previous chapter; however, this is the extreme case. In Example
4.1, some arguments are local (i.e., Arg1 and Arg2 ) and some are not (i.e., Arg0 = SEC).
Building on Example 4.1, consider the following sentence, which directly follows Example
4.1 in the corresponding TreeBank document:
(4.2) Money managers make the argument in letters to the agency about [Arg1 rule]
[P redicate changes] proposed this past summer.
The NomBank role set for change is shown below:
Frame for change, role set 1:
Arg0 : the entity that initiates the change
Arg1 : the entity that is changed
Arg2 : the initial state of the changed entity
Arg3 : the ﬁnal state of the changed entity
Similarly to the previous example, 4.2 shows the local argument labels provided by NomBank.
These labels only indicate that rules have been changed. For a full interpretation, Example
4.2 requires an understanding of Example 4.1. Without the latter, the reader has no way of
knowing that the agency in 4.2 actually refers to the same entity as SEC in 4.1. As part
of the reader’s comprehension process, this entity is identiﬁed as the ﬁller for the Arg0 role
in Example 4.2. This identiﬁcation must occur in order for these two sentences to form a
coherent discourse.
1 The Securities and Exchange Commission (SEC) is responsible for enforcing investment

laws in the United States.
62

From these examples, it is clear that the scope of implicit arguments quite naturally
spans sentence boundaries. Thus, if one wishes to recover implicit arguments as part of the
SRL process, the argument search space must be expanded beyond the traditional, singlesentence window used in virtually all prior SRL research. What can we hope to gain from
such a fundamental modiﬁcation of the problem? Consider the following question, which
targets Examples 4.1 and 4.2 above:
(4.3) Who changed the rules regarding reporting requirements?
Question 4.3 is a factoid question, meaning it has a short, unambiguous answer in the targeted
text. This type of question has been studied extensively in the Text Retrieval Conference
(TREC) Question Answering Track (Dang et al., 2007). Using the evaluation data from this
track, Pizzato and Moll´ (2008) showed that SRL can improve the accuracy of a QA system;
a
however, a traditional SRL system alone is not enough to recover the implied answer to
Question 4.3: SEC or the agency. Successful implicit argument identiﬁcation provides the
answer in this case.
This chapter presents an in-depth study of implicit arguments for nominal predicates.2
The following section surveys a broad spectrum of research related to implicit argument
identiﬁcation. Section 4.3 describes the study’s implicit argument annotation process and
the data it produced. The implicit argument identiﬁcation model is formulated in Section
4.4 and evaluated in Section 4.5. Discussion of results is provided in Section 4.6, and the
chapter concludes in Section 4.7.

4.2

Related work

The research presented in this chapter is related to a wide range of topics in cognitive science, linguistics, and NLP. This is partly due to the discourse-based nature of the problem.
In single-sentence SRL, one can ignore the discourse aspect of language and still obtain
2 A condensed version of this study was published by Gerber and Chai (2010).

63

high marks in an evaluation (for examples, see Carreras and M`rquez (2005) and Surdeanu
a
et al. (2008)); however, implicit argumentation forces one to consider the discourse context in which a sentence exists. Much has been said about the importance of discourse to
language understanding, and this section will identify the points most relevant to implicit
argumentation.

4.2.1

Discourse comprehension in cognitive science

In linguistics, the traditional view of sentence-level semantics has been that meaning is
compositional. That is, one can derive the meaning of a sentence by carefully composing the
meanings of its constituent parts (Heim and Kratzer, 1998). There are counterexamples to a
compositional theory of semantics (e.g., idioms), but those are more the exception than the
rule. Things change, however, when one starts to group sentences together to form coherent
textual discourses. Consider the following examples, borrowed from Sanford (1981) (p. 5):
(4.4) Jill came bouncing down the stairs.
(4.5) Harry rushed oﬀ to get the doctor.
Examples 4.4 and 4.5 describe three events: bounce, rush, and get. These events are intricately related. One cannot simply create a conjunction of the propositions bounce, rush, and
get and expect to arrive at the author’s intended meaning, which presumably involves Jill’s
becoming injured by her fall and Harry’s actions to help her. The mutual dependence of
these sentences can be further shown by considering a variant of the situation described in
Examples 4.4 and 4.5:
(4.6) Jill came bouncing down the stairs.
(4.7) Harry rushed over to kiss her.
The interpretation of Example 4.6 is is vastly diﬀerent from the interpretation of Example
4.4. In 4.4, Jill becomes injured whereas in 4.6 she is quite happy.
Examples 4.4-4.7 demonstrate the fact that sentences do not have a ﬁxed, compositional
interpretation; rather, a sentence’s interpretation depends on the surrounding context. The
64

standard compositional theory of sentential semantics largely ignores contextual information
provided by other sentences. The single-sentence approach to SRL operates similarly. In both
of these methods, the current sentence provides all of the semantic information. In contrast
to these methods - and aligned with the preceding discussion - this chapter presents methods
that rely heavily on surrounding sentences to provide additional semantic information. This
information is used to interpret the current sentence in a more complete fashion.
Examples 4.4-4.7 also show that the reader’s knowledge plays a key role in discourse comprehension. Researchers in cognitive science have proposed many models of reader knowledge. Schank and Abelson (1977) proposed stereotypical event sequences called scripts as
a basis for discourse comprehension. In this approach, readers ﬁll in a discourse’s semantic
gaps with knowledge of how a typical event sequence might unfold. In Examples 4.4 and
4.5, the reader knows that people typically call on a doctor only if someone is hurt. Thus,
the reader automatically ﬁlls the semantic gap caused by the ambiguous predicate bounce
with information about doctors and what they do. Similar observations have been made by
van Dijk (1977) (p. 4), van Dijk and Kintsch (1983) (p. 303), Graesser and Clark (1985)
(p. 14), and Carpenter et al. (1995). Inspired by these ideas, the model developed in this
chapter relies partly on large text corpora, which are treated as repositories of typical event
sequences. The model uses information extracted from these event sequences to identify
implicit arguments.

4.2.2

Automatic relation discovery

Examples 4.4 and 4.5 in the previous section show that understanding the relationships
between predicates is a key part of understanding a textual discourse. In this section, I
review work on automatic predicate relationship discovery, which attempts to extract these
relationships automatically.
Lin and Pantel (2001) proposed a system that automatically identiﬁes relationships similar to the following:

65

(4.8) X eats Y ↔ X likes Y
This relationship creates a mapping between the participants of the two predicates. One
can imagine using such a mapping to ﬁll in the semantic gaps of a discourse that describes a
typical set of events in a restaurant. In such a discourse, the author probably will not state
directly that X likes Y ; however, the reader might need to infer this in order to make sense
of the fact that X left a large tip for the waiter.
Lin and Pantel created mappings such as 4.8 using a variation of the so-called “distributional hypothesis” posited by Harris (1985), which states that words occurring in similar
contexts tend to have similar meanings. Lin and Pantel applied the same notion of similarity
to dependency paths. For example, the inference rule in Example 4.8 is identiﬁed by examining the sets of words in the two X positions and the sets of words in the two Y positions.
When the two pairs of sets are similar, it is implied that the two dependency paths from X
to Y are similar as well. In Example 4.8, the two dependency paths are as follows:
subject

object

subject

object

X ← − − eats − − → Y
− −−
−−
X ← − − likes − − → Y
− −−
−−

(4.9)

One drawback of this method is that it assumes the implication is symmetric. Although this
assumption is correct in many cases, it often leads to invalid inferences. In Example 4.8, it
is not always true that if X likes Y then X will eat Y . The opposite - that X eating Y
implies X likes Y - is more plausible but not certain.
Bhagat et al. (2007) extended the work of Lin and Pantel to handle cases of asymmetric
relationships. The basic idea proposed by Bhagat et al. is that, when considering a relationship of the form x, p1 , y ↔ x, p2 , y , if p1 occurs in signiﬁcantly more contexts (i.e.,
has more options for x and y) than p2 , then p2 is likely to imply p1 but not vice versa.
Returning to Example 4.8, we see that the correct implication will be derived if likes occurs
in signiﬁcantly more contexts than eats. The intuition is that the more general concept (i.e.,

66

like) will be associated with more contexts and is more likely to be implied by the speciﬁc
concept (i.e., eat). As shown by Bhagat et al., the system built around this intuition is able
to eﬀectively identify the directionality of many inference rules.
Zanzotto et al. (2006) presented another study aimed at identifying asymmetric relationships between verbs. For example, the asymmetric entailment relationship X wins − X
→
plays holds, but the opposite (X plays − X wins) does not. This is because not all those
→
who play a game actually win. To ﬁnd evidence for this automatically, the authors examined
constructions such as the following, adapted from Zanzotto et al.:
(4.10) The more experienced tennis player won the match.
The underlying idea behind the authors’ approach is that asymmetric relationships such as
X wins − X plays are often entailed by constructions involving agentive, nominalized verbs
→
as the logical subjects of the main verb. In Example 4.10, the agentive nominal “player”
is logical subject to “won”, the combination of which entails the asymmetric relationship
of interest. Thus, to validate such an asymmetric relationship, Zanzotto et al. examined
the frequency of the “player win” collocation using Google hit counts as a proxy for actual
corpus statistics.
A number of other studies (e.g., those by Szpektor et al. (2004) and Pantel et al. (2007))
have been conducted that are similar to the work described above. In general, such work
focuses on the automatic acquisition of entailment relationships between verbs. Although
this work has often been motivated by the need for lexical-semantic information in tasks
such as automatic question answering, it is also relevant to the task of implicit argument
identiﬁcation because the derived relationships implicitly encode a participant role mapping
between two predicates. For example, given a missing Arg0 for a like predicate and an
explicit Arg0 = John for an eat predicate in the preceding discourse, inference rule 4.8
would help identify the implicit Arg0 = John for the like predicate.
The missing link between previous work on verb relationship identiﬁcation and the task
of implicit argument identiﬁcation is that previous verb relations are not deﬁned in terms
67

of the Argn positions used by NomBank. Rather, positions like subject and object are used
(see Example 4.9). In order to identify implicit arguments in NomBank, one needs inference
rules between speciﬁc argument positions (e.g., eat:Arg0 and like:Arg0 ). In the current
chapter, I propose methods of automatically acquiring these ﬁne-grained relationships for
verbal and nominal predicates using existing corpora. I also propose a method of using these
relationships to recover implicit arguments.

4.2.3

Coreference resolution and discourse processing

The current chapter will make heavy use of the notions of reference and coreference. The
referent of a linguistic expression is the real or imagined entity to which the expression
refers. Coreference, therefore, is the condition of two linguistic expressions having the same
referent. In the following examples from the Penn TreeBank, the underlined spans of text
are coreferential:
(4.11) “Carpet King sales are up 4% this year,” said owner Richard Rippe.
(4.12) He added that the company has been manufacturing carpet since 1967.
Non-trivial instances of coreference (e.g., Carpet King and the company) allow the author
to repeatedly mention the same entity without introducing redundancy into the discourse.
Pronominal anaphora is a subset of coreference in which one of the referring expressions is
a pronoun. For example, he in Example 4.12 refers to the same entity as Richard Rippe in
Example 4.11. These examples demonstrate noun phrase coreference. Events, indicated by
either verbal or nominal predicates, can also be coreferential when mentioned multiple times
in a document (Wilson, 1974; Chen and Ji, 2009).
For many years, the Automatic Content Extraction (ACE) series of large-scale evaluations
(ACE, 2008) has provided a test environment for systems designed to identify these and
other coreference relations. Systems based on the ACE datasets typically take a supervised
learning approach to coreference resolution in general (Versley et al., 2008) and pronominal
anaphor in particular (Yang et al., 2008).
68

A phenomenon similar to the implicit argument has been studied in the context of
Japanese anaphora resolution, where a missing case-marked constituent is viewed as a zeroanaphoric expression whose antecedent is treated as the implicit argument of the predicate of
interest. This behavior has been annotated manually by Iida et al. (2007), and researchers
have applied standard SRL techniques to this corpus, resulting in systems that are able
to identify missing case-marked expressions in the surrounding discourse (Imamura et al.,
2009). Sasano et al. (2004) conducted similar work with Japanese indirect anaphora. The
authors used automatically derived nominal case frames to identify antecedents. However,
as noted by Iida et al., grammatical cases do not stand in a one-to-one relationship with
semantic roles in Japanese (the same is true for English).
Many other discourse-level phenomena interact with coreference. For example, Centering Theory (Grosz et al., 1995) focuses on the ways in which referring expressions maintain
(or break) coherence in a discourse. These so-called “centering shifts” result from a lack of
coreference between salient noun phrases in adjacent sentences. Discourse Representation
Theory (DRT) (Kamp and Reyle, 1993) is another prominent treatment of referring expressions. DRT embeds a theory of coreference into a ﬁrst-order, compositional semantics of
discourse.
In Centering Theory, DRT, and the ACE coreference competitions, coreference relationships hold between relatively small constituents in one or more sentences (e.g., a pronoun
and its noun phrase antecedent); however, researchers have also investigated relationships
that hold between larger segments of text, including full sentences. Consider the following
example, adapted from Rhetorical Structure Theory (Taboada and Mann, 2006):
(4.13) [Objective The visual system resolves confusion] [Means by applying knowledge of
properties of the physical world].
In Example 4.13, the objective (resolution of confusion) is accomplished by a particular
means (application of knowledge). RST analyses do not depend on “trigger” words in the
way that PropBank, NomBank, and FrameNet do. Rather, segments of text are identiﬁed
69

and the relationships between them are then inferred.
Prasad et al. (2008) take a slightly diﬀerent approach to discourse-level annotation, one
that relies heavily on lexical cues to guide the annotation process. The resulting resource,
called the Penn Discourse TreeBank (PDTB), identiﬁes RST-like relationships that obtain
between large fragments of text. Consider the following example, taken from the PDTB:
(4.14) [Arg1 Use of dispersants was approved] when [Arg2 a test on the third day showed
some positive results].
In Example 4.14, I have underlined the lexical item that triggers the Reason discourse
relationship between the bracketed spans of text. Argument position when:Arg1 indicates the
eﬀect, and argument position when:Arg2 indicates the cause. The latter contains an instance
of the nominal predicate test, whose Arg1 position (the entity tested) is implicitly ﬁlled by
dispersants. The identiﬁcation of this implicit argument is encouraged by the discourse
connective when, which indicates a strong relationship between the events described in its
argument positions. In Section 4.4.2, I will explore the use of PDTB relationships for implicit
argument identiﬁcation.

4.2.4

Identifying implicit arguments

Past research on the actual task of implicit argument identiﬁcation tends to be sparse.
Palmer et al. (1986) describe what appears to be the ﬁrst computational treatment of implicit arguments. In this work, Palmer et al. manually created a repository of knowledge
concerning entities in the domain of electronic device failures. This knowledge, along with
hand-coded syntactic and semantic processing rules, allowed the system to identify implicit
arguments across sentence boundaries. As a simple example, consider the following two
sentences, borrowed from Palmer et al.:
(4.15) Disk drive was down at 11/16-2305.
(4.16) Has select lock.

70

Example 4.16 does not specify precisely which entity has select lock. However, the domain
knowledge tells the system that only disk drive entities can have such a property. Using this
knowledge, the system is able to search the local context and make explicit the implied fact
that the disk drive from Example 4.15 has select lock.
A similar line of work was pursued by Whittemore et al. (1991), who oﬀer the following
example of implicit argumentation (p. 21):
(4.17) Pete bought a car.
(4.18) The salesman was a real jerk.
In Example 4.17, the buy event is not associated with an entity representing the seller. This
entity is introduced in Example 4.18 as the salesman, whose semantic properties satisfy the
requirements of the buy event. Whittemore et al. build up the event representation incrementally using a combination of semantic property constraints and Discourse Representation
Theory.
The systems developed by Palmer et al. and Whittemore et al. are quite similar. They
both make use of semantic constraints on arguments, otherwise known as selectional preferences. Selectional preferences have received a signiﬁcant amount of attention over the years,
with the work of Ritter et al. (2010) being some of the most recent. The model developed
in the current chapter uses a variety of selectional preference measures to identify implicit
arguments.
The implicit argument identiﬁcation systems described above were not widely deployed
due to their reliance on hand-coded, domain-speciﬁc knowledge that is diﬃcult to create.
Much of this knowledge targeted basic syntactic and semantic constructions that now have
robust statistical models (e.g., those created by Charniak and Johnson (2005) for syntax
and Punyakanok et al. (2005) for semantics). With this information accounted for, it is
easier to approach the problem of implicit argumentation. Below, I describe a series of
recent investigations that have led to a surge of interest in statistical implicit argument
identiﬁcation.

71

Fillmore and Baker (2001) provided a detailed case study of FrameNet frames as a basis
for understanding written text (see page 13 for the details of FrameNet). In their case study,
Fillmore and Baker manually build up a semantic discourse structure by hooking together
frames from the various sentences. In doing so, the authors resolve some implicit arguments
found in the discourse. This process is an interesting step forward; however, the authors did
not provide concrete methods to perform the analysis automatically.
Nielsen (2004) developed a system that is able to detect the occurrence of verb phrase
ellipsis. Consider the following sentences:
(4.19) John kicked the ball.
(4.20) Bill [did], too.
The bracketed text in Example 4.20 is a placeholder for the verb phrase kicked the ball in
Example 4.20, which has been elided (i.e., left out). Thus, in 4.20, Bill can be thought
of as an implicit argument to some kicking event that is not mentioned. If one resolved
the verb phrase ellipsis, then the implicit argument would be recovered. Nielsen created
a system able to detect the presence of ellipses, producing the bracketing in 4.20. Ellipsis
resolution (i.e., ﬁguring out precisely which verb phrase is missing) was described by Nielsen
(2005). Implicit argument identiﬁcation for nominal predicates is complementary to verb
phrase ellipsis resolution: both work to make implicit information explicit.
Burchardt et al. (2005) suggested that frame elements from various frames in a text could
be linked to form a coherent discourse interpretation (this is similar to the idea described by
Fillmore and Baker (2001)). The linking operation causes two frame elements to be viewed as
coreferent. Burchardt et al. propose to learn frame element linking patterns from observed
data; however, the authors did not implement and evaluate such a method. Building on the
work of Burchardt et al., this dissertation presents a model of implicit arguments that uses
a quantitative analysis of naturally occurring coreference patterns.
The previous chapter, a condensed version of which was published by Gerber et al.
(2009), demonstrated the importance of ﬁltering out nominal predicates that take no local

72

arguments. This approach leads to appreciable gains for certain nominals. However, the
approach does not attempt to actually recover implicit arguments.
Most recently, Ruppenhofer et al. (2009) conducted SemEval Task 10, “Linking Events
and Their Participants in Discourse”, which evaluated implicit argument identiﬁcation systems over a common test set. The task organizers annotated implicit arguments across
entire passages, resulting in data that cover many distinct predicates, each associated with
a small number of annotated instances. As described by Ruppenhofer et al. (2010), three
submissions were made to the competition, with two of the submissions attempting the implicit argument identiﬁcation part of the task. Chen et al. (2010) extended a standard SRL
system by widening the candidate window to include constituents from other sentences. A
small number of features based on the FrameNet frame deﬁnitions were extracted for these
candidates, and prediction was performed using a log-linear model. Tonelli and Delmonte
(2010) also extended a standard SRL system. Both of these systems achieved an implicit
argument F1 score of less than 0.02. The organizers and participants appear to agree that
training data sparseness was a signiﬁcant problem. This is likely the result of the annotation
methodology: entire documents were annotated, causing each predicate to receive a very
small number of annotated examples.
In contrast to the evaluation described by Ruppenhofer et al. (2010), the study presented
in this chapter focused on a select group of nominal predicates. To help prevent data sparseness, the size of the group was small, and the predicates were carefully chosen to maximize
the observed frequency of implicit argumentation. I annotated a large number of implicit
arguments for this group of predicates with the goal of training models that generalize well
to the testing data. In the following section, I describe the implicit argument annotation
process and resulting dataset.

73

4.3

Empirical analysis

As shown in the previous section, the existence of implicit arguments has been recognized
for quite some time. However, this type of information was not formally annotated until
Ruppenhofer et al. (2009) conducted their SemEval task on implicit argument identiﬁcation.
There are two reasons why I chose to create an independent dataset for implicit arguments.
The ﬁrst reason is the aforementioned sparsity of the SemEval dataset. The second reason is
that the SemEval dataset is not built on top of the Penn TreeBank, which is the gold-standard
syntactic base for all work in this dissertation. Working on top of the Penn TreeBank makes
the annotations immediately compatible with PropBank, NomBank, and a host of other
resources that also build on the TreeBank.

4.3.1

Data annotation

Predicate selection
Because implicit arguments were a new subject of annotation in the ﬁeld, it was important
to focus in on a select group of nominal predicates. Predicates in this group were required
to meet the following criteria:
1. A selected predicate must have an unambiguous role set. This criterion corresponds
roughly to an unambiguous semantic sense and is motivated by the need to separate
the implicit argument behavior of a predicate from its semantic meaning.
2. A selected predicate must be derived from a verb. This dissertation focuses primarily
on the event structure of texts. Nominal predicates derived from verbs denote events,
but there are other, non-eventive predicates in NomBank (e.g., the partitive % ). This
criterion also implies that the annotated predicates have correlates in PropBank with
semantically compatible role sets.

74

3. A selected predicate should have a high frequency in the Penn TreeBank corpus. This
criterion ensures that the evaluation results say as much as possible about the event
structure of the underlying corpus. I calculated frequency with basic counting over
morphologically normalized predicates (i.e., bids and bid are counted as the same
predicate).
4. A selected predicate should express many implicit arguments. Of course, this can only
be estimated ahead of time because no data exist to compute it. To estimate this value
for a predicate p, I ﬁrst calculated Np , the average number of roles expressed by p in
NomBank. I then calculated Vp , the average number of roles expressed by the verb
form of p in PropBank. I hypothesized that the diﬀerence Vp − Np gives an indication
of the number of implicit arguments that might be present in the text for a nominal
instance of p. The motivation for this hypothesis is as follows. Most verbs must be
explicitly accompanied by speciﬁc arguments in order for the resulting sentence to be
grammatical. The following sentences are ungrammatical if the parenthesized portion
is left out:
(4.21) *John loaned (the money to Mary).
(4.22) *John invested (his money).
Examples 4.21 and 4.22 indicate that certain arguments must explicitly accompany
loan and invest. In nominal form, these predicates can exist without such arguments
and still be grammatical:
(4.23) John’s loan was not repaid.
(4.24) John’s investment was huge.
Note, however, that Examples 4.23 and 4.24 are not reasonable things to write unless
the missing arguments were previously mentioned in the text. This is precisely the
type of noun that should be targeted for implicit argument annotation. The value of
Vp − Np thus quantiﬁes the desired behavior.

75

Predicates were ﬁltered according to criteria 1 and 2 and ranked according to the product of
3 and 4. I then selected the top ten, which are shown in the ﬁrst column of Table 4.1. The
role sets (i.e., argument deﬁnitions) for these predicates can be found in Appendix Section
A.4 on page 137.
Annotation procedure
I annotated implicit arguments for instances of the ten selected nominal predicates. The annotation process proceeded document-by-document. For a document d, I annotated implicit
arguments as follows:
1. Select from d all non-proper singular and non-proper plural nouns that are morphologically related to the ten predicates in Table 4.1.
2. By design, each selected noun has an unambiguous role set. Thus, given the arguments
supplied for a noun by NomBank, one can consult the noun’s role set to determine
which arguments are missing.3
3. For each missing argument position, search the current sentence and all preceding
sentences for a suitable implicit argument. Annotate all suitable implicit arguments
in this window.
4. As often as possible, match the extent of an implicit argument to the extent of an argument given by either PropBank or NomBank. This was done to maintain compatibility
with these and other resources.
In the remainder of this dissertation, I will use iargn to refer to an implicit argument
position n. I will use argn to refer to an argument provided by PropBank or NomBank. I
will use p to mark predicate instances. Below, I give an example annotation for an instance
of the investment predicate:
3 See page 137 for the list of role sets used in this study.

76

Pred.
bid
sale
loan
cost
plan
investor
price
loss
investment
fund
Overall
1

# Pred.
88
184
84
101
100
160
216
104
102
108
1,247
2

# Imp./pred.
1.4
1.0
1.0
0.9
0.8
0.7
0.6
0.6
0.5
0.5
0.8
3

Pre-annotation
Role avg. (SD)
Role coverage (%)
Noun
Verb
26.9
0.8 (0.6) 2.2 (0.6)
24.2
1.2 (0.7) 2.0 (0.7)
22.1
1.1 (1.1) 2.5 (0.5)
26.2
1.0 (0.7) 2.3 (0.5)
30.8
1.2 (0.8) 1.8 (0.4)
35.0
1.1 (0.2) 2.0 (0.7)
42.5
1.7 (0.5) 1.7 (0.5)
33.2
1.3 (0.9) 2.0 (0.6)
15.7
0.5 (0.7) 2.0 (0.7)
8.3
0.3 (0.7) 2.0 (0.3)
28.0
1.1 (0.8) 2.0 (0.6)
4
5
6

Post-annotation
Role coverage (%)
73.9
44.0
41.7
47.5
50.0
57.5
58.6
48.1
33.3
21.3
47.8
7

Noun role avg. (SD)
2.2 (0.9)
2.2 (0.9)
2.1 (1.1)
1.9 (0.6)
2.0 (0.4)
1.7 (0.6)
2.3 (0.6)
1.9 (0.7)
1.0 (1.0)
0.9 (1.2)
1.9 (0.9)
8

Table 4.1: Annotation data analysis. Columns are deﬁned as follows: (1) the annotated predicate, (2) the number of predicate
instances that were annotated, (3) the average number of implicit arguments per predicate instance, (4) of all roles for all
predicate instances, the percentage ﬁlled by NomBank arguments, (5) the average number of NomBank arguments per predicate
instance, (6) the average number of PropBank arguments per instance of the verb form of the predicate, (7) of all roles for
all predicate instances, the percentage ﬁlled by either NomBank or implicit arguments, (8) the average number of combined
NomBank/implicit arguments per predicate instance. SD indicates the standard deviation with respect to an average.

77

(4.25) [iarg0 Participants] will be able to transfer [iarg1 money] to [iarg2 other investment
funds]. The [p investment] choices are limited to [iarg2 a stock fund and a
money-market fund].
NomBank does not associate this instance of investment with any arguments; however, one
can easily identify the investor (iarg0 ), the thing invested (iarg1 ), and two mentions of the
thing invested in (iarg2 ) within the surrounding discourse.
Of course, not all implicit argument decisions are as easy as those in Example 4.25.
Consider the following contrived example:
(4.26) People in other countries could potentially consume large amounts of [iarg0 ? Coke].
(4.27) Because of this, there are [p plans] to expand [iarg0 the company’s] international
presence.
Example 4.27 contains one mention of the iarg0 (the agentive planner). It might be tempting
to also mark Coke in Example 4.26 as an additional iarg0 ; however, the only reasonable
interpretation of Coke in 4.26 is as a consumable ﬂuid. Fluids cannot plan things, so this
annotation should not be performed. This is a case of metonymy between Coke as a company
and Coke as a drink. In all such cases, I inspected the implied meaning of the term when
deciding whether to apply an implicit argument label.
Lastly, it should be noted that I placed no restrictions on embedded arguments. PropBank and NomBank do not allow argument extents to overlap. Traditional SRL systems
such as the one created by Punyakanok et al. (2008) model this constraint explicitly to arrive
at the ﬁnal label assignment; however, as the following example shows, this constraint should
not be applied to implicit arguments:
(4.28) Currently, the rules force [iarg0 executives, directors and other corporate insiders]
to report purchases and [p sales] [arg1 of [iarg0 their] companies’ shares] within
about a month after the transaction.
Despite its embedded nature, the pronoun their in Example 4.28 is a perfectly reasonable
implicit argument (the seller) for the marked predicate. Systems should be required to
identify such arguments.
78

Inter-annotator agreement
Implicit argument annotation is a diﬃcult task because it combines the complexities of
traditional SRL annotation with those of coreference annotation. To assess the reliability of
the annotation process described above, I compared my annotations to those provided by an
undergraduate linguistics student who, after a brief training period, re-annotated a portion
of the dataset. For each missing argument position, the student was asked to identify the
textually closest acceptable implicit argument within the current and preceding sentences.
The argument position was left unﬁlled if no acceptable constituent could be found. For
a missing argument position iargn , the student’s annotation agreed with my own if both
identiﬁed the same implicit argument or both left iargn unﬁlled. The student annotated 480
of the 1,247 predicate instances shown in Table 4.1.
I computed Cohen’s chance-corrected kappa statistic for inter-annotator agreement (Cohen, 1960), which is based on two quantities:

po = observed probability of agreement
pc = probability of agreement by chance

The quantity 1 − pc indicates the probability of a chance disagreement. The quantity po − pc
indicates the probability of agreement that cannot be accounted for by chance alone. Finally,
Cohen deﬁnes κ as follows:

κ=

po − pc
1 − pc

Cohen’s kappa thus gives the probability that a chance-expected disagreement will not occur.
When agreement is perfect, κ = 1. If the observed agreement is less than the expected chance
agreement, then κ will be negative. As noted by Di Eugenio and Glass (2004), researchers
have devised diﬀerent scales to assess κ. Many NLP researchers use the scale created by

79

Krippendorﬀ (1980):

κ < 0.67

low agreement

0.67 <= κ < 0.8

moderate agreement

κ >= 0.8

strong agreement

However, Di Eugenio and Glass also note that this scale has not been rigorously defended,
even by Krippendorﬀ himself.
For the implicit argument annotation data, observed and chance agreement are deﬁned
as follows:
agree(iargn )
po =

pc =

iargn

N
PA (n) ∗ PB (n) ∗ random agree(iargn ) + (1 − PA (n)) ∗ (1 − PB (n))
iargn

N

(4.29)

where N is the total number of missing argument positions that need to be annotated, agree
is equal to 1 if the two annotators agreed on iargn and 0 otherwise, PA (n) and PB (n) are the
observed prior probabilities that annotators A and B assign a label of n, and random agree
is equal to the probability that both annotators would select the same implicit argument
for iargn when choosing randomly from the discourse. In Equation 4.29, terms to the right
of + denote the probability that the two annotators agreed on iargn because they did not
identify a ﬁller for it.
Using the above values for po and pc , Cohen’s kappa indicated an agreement of 64.3%.
According to Krippendorﬀ’s scale, this value is borderline between low and moderate agreement. Possible causes for this low agreement include the brief training period for the linguistics student and the sheer complexity of the annotation task. If one considers only
those argument positions for which both annotators actually located an implicit ﬁller, Co-

80

hen’s kappa indicates an agreement of 93.1%. This shows that much of the disagreement
concerned the question of whether a ﬁller was present. Having agreed that a ﬁller was
present, the annotators consistently selected the same ﬁller. The student’s annotations were
only used to compute agreement. I performed all training and evaluation using randomized
cross-validation over the annotations I created.

4.3.2

Annotation analysis

I carried out the annotation process described above on the standard training (2-21), development (24), and testing (23) sections of the Penn TreeBank. Table 4.1 on page 77
summarizes the results. Below, I highlight key pieces of information found in this table.
Implicit arguments are frequent
Column three of Table 4.1 shows that most predicate instances are associated with at least
one implicit argument. Implicit arguments vary across predicates, with bid exhibiting (on
average) more than one implicit argument per instance versus the 0.5 implicit arguments per
instance of the investment and fund predicates. It turned out that the latter two predicates
have unique senses that preclude implicit argumentation (more on this in Section 4.6).
Implicit arguments create fuller event descriptions
Role coverage for a predicate instance is equal to the number of ﬁlled roles divided by
the number of roles in the predicate’s role set. Role coverage for the marked predicate in
Example 4.25 (p. 78) is 0/3 for NomBank-only arguments and 3/3 when the annotated
implicit arguments are also considered. Returning to Table 4.1, the fourth column gives
role coverage percentages for NomBank-only arguments. The seventh column gives role
coverage percentages when both NomBank arguments and the annotated implicit arguments
are considered. Overall, the addition of implicit arguments created a 71% relative (20-point
absolute) gain in role coverage across the 1,247 predicate instances that I annotated.

81

Implicit arguments resolved

1
0.9
0.8
0.7
0.6
0.5
0.4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 15 18 19 25 27 28 30 43 46

Sentences prior
Figure 4.1: Location of implicit arguments. Of all implicitly ﬁlled argument positions, the
y-axis indicates the percentage that are ﬁlled at least once within the number of sentences
indicated by the x-axis (multiple ﬁllers may exist for the same position).
When I introduced the NomBank resource, I observed that approximately 87% of nominal predicate instances do not have a corresponding verbal form present in the document.
This indicates that much of the information contained in NomBank is not redundant with
PropBank. Because most of the NomBank predicates are novel, the implicit arguments
associated with these predicates should also contribute novel information.
The Vp − Np predicate selection metric behaves as desired
The predicate selection method used the Vp −Np metric to identify predicates whose instances
are likely to take implicit arguments. Column ﬁve in Table 4.1 shows that (on average)
nominal predicates have 1.1 arguments in NomBank, this compared to the 2.0 arguments
per verbal form of the predicates in PropBank (compare columns ﬁve and six). I hypothesized
that this diﬀerence might indicate the presence of approximately one implicit argument per
predicate instance. This hypothesis is conﬁrmed by comparing columns six and eight: when
82

considering implicit arguments, many nominal predicates express approximately the same
number of arguments on average as their verbal counterparts.
Most implicit arguments are nearby
In addition to the analyses described above, I examined the location of implicit arguments
in the discourse. Figure 4.1 shows that approximately 56% of the implicit arguments in
our data can be resolved within the sentence containing the predicate. Approximately 90%
are found within the previous three sentences. The remaining implicit arguments require
up to forty-six sentences for resolution. These observations are important; they show that
searching too far back in the discourse is likely to produce many false positives without a
signiﬁcant increase in recall. Section 4.6 discusses additional implications of this skewed
distribution.

4.4
4.4.1

Implicit argument model
Model formulation

Given a nominal predicate instance p with a missing argument position iargn , the task
is to search the surrounding discourse for a constituent c that ﬁlls iargn . The implicit
argument model conducts this search over all constituents that are marked with a core
argument label (arg0 , arg1 , etc.) associated with a NomBank or PropBank predicate. Thus,
the model assumes a pipeline organization in which a document is initially analyzed by
traditional verbal and nominal SRL systems. The core arguments from this stage then
become candidates for implicit argumentation. Adjunct arguments are excluded.
A candidate constituent c will often form a coreference chain with other constituents in
the discourse. Consider the following abridged sentences, which are adjacent in their Penn
TreeBank document:
(4.30) [Mexico] desperately needs investment.
83

(4.31) Conservative Japanese investors are put oﬀ by [Mexico’s] investment regulations.
(4.32) Japan is the fourth largest investor in [c Mexico], with 5% of the total
[p investments].
NomBank does not associate the labeled instance of investment with any arguments, but
it is clear from the surrounding discourse that constituent c (referring to Mexico) is the
thing being invested in (the iarg2 ). When determining whether c is the iarg2 of investment, one can draw evidence from other mentions in c’s coreference chain. Example 4.30
states that Mexico needs investment. Example 4.31 states that Mexico regulates investment.
These propositions, which can be derived via traditional SRL analyses, should increase our
conﬁdence that c is the iarg2 of investment in Example 4.32.
Thus, the unit of classiﬁcation for a candidate constituent c is the three-tuple p, iargn , c′ ,
where c′ is a coreference chain comprising c and its coreferent constituents.4 I deﬁned a binary classiﬁcation function P r(+| p, iargn , c′ ) that predicts the probability that the entity
referred to by c ﬁlls the missing argument position iargn of predicate instance p. In the
remainder of this dissertation, I will refer to c as the primary ﬁller, diﬀerentiating it from
other mentions in the coreference chain c′ . In the following section, I present the feature set
used to represent each three-tuple within the classiﬁcation function.

4.4.2

Model features

Table A.4 (p. 140) lists all features used by the model described in this chapter. As shown,
these features are quite diﬀerent from those used in previous work to identify semantic arguments in the traditional nominal SRL setting (see Chapter 2 and Gerber et al. (2009)). This
diﬀerence is due to the fact that syntactic information - a crucial part of traditional SRL is not very informative for the implicit argument task, which is primarily a semantic phenomenon. Below, I give detailed explanations for features that are not suﬃciently explained
in Table A.4.
4 I used OpenNLP for coreference identiﬁcation: http://opennlp.sourceforge.net

84

Feature 9 captures the semantic relationship between predicate-argument positions by
examining paths between frame elements in FrameNet. SemLink5 maps PropBank argument
positions to their FrameNet frame elements. For example, the arg1 position of sell maps to
the Goods frame element of the Sell frame. NomBank argument positions (e.g., arg1 of sale)
can be mapped to FrameNet by ﬁrst converting the nominal predicate to its verb form. By
mapping predicate-argument structures into FrameNet, one can take advantage of the rich
network of frame-frame relations provided by the resource (see Section 2.2.2 on page 13 for
an example).
The value of Feature 9 has the following general form:
Relationn−1
Relation
Relation
(4.33) F rame1 .F E1 − − − −1 F rame2 .F E2 − − − −2 . . . − − − − − → F ramen .F En
−−−→
−−−−−
−−−→
This path describes how the frame elements at either end are related. For example, consider
the frame element path between the arg1 of sell and the arg1 of buy, both of which denote
the goods being transferred:
Inherits

Causes

Inherited by

(4.34) Sell.Goods − − − Giving.Theme − − → Getting.Theme − − − − → Buy.Goods
− −→
−−
−−−−
This path can be paraphrased as follows: things that are sold (Sell.Goods) are part of a
more general giving scenario (Giving.Theme) that can also be viewed as a getting scenario
(Getting.Theme) in which the buyer receives something (Buy.Goods). This complex world
knowledge is represented compactly using the relationships deﬁned in FrameNet. In my
experiments, I searched all possible frame element paths of length ﬁve or less that use the
following relationships:
• Causative-of
• Inchoative-of
• Inherits
• Precedes
• Subframe-of
5 http://verbs.colorado.edu/semlink

85

Feature 9 is helpful in situations such as the following (contrived):
(4.35) Consumers bought many [c cars] this year at reduced prices.
(4.36) [p Sales] are expected to drop when the discounts are eliminated.
In Example 4.36 we are looking for the iarg1 (thing sold) of sale. The path shown in
Example 4.34 indicates quite clearly that the candidate cars from Example 4.35, being the
entity purchased, is a suitable ﬁller for this position.
Lastly, note that the value for Feature 9 is the actual path instead of a numeric value.
When c forms a coreference chain of multiple elements, this feature can be instantiated using
multiple values (i.e., paths). Ultimately, these instantiations are binarized into the LibLinear
input format, so the existence of multiple feature values does not pose a problem.
Feature 11 checks whether two predicate-argument positions have the same thematic
role and reside in the same VerbNet class. As described in Section 2.2.2 (p. 15), VerbNet is
a lexicon of verb classes. Each class contains verbs that are semantically related in addition
to a set of thematic roles used for all verbs in the class. The classes are arranged into an
inheritance hierarchy. An example class is shown below:
54.4 appraise, approximate, price . . . Thematic roles: Agent, Theme, Value
Here, the dot notation 54.4 should be viewed as a Gorn address (Gorn, 1967) to the verb
class within the VerbNet hierarchy (not shown), which is tree-structured. Feature 11 uses
these classes to identify implicit arguments in situations such as the following (contrived):
(4.37) John appraised [c the house].
(4.38) [arg0 He] determined a [p price] that was fair.
In Example 4.38 we are looking for the iarg1 (the valued item) of price. In VerbNet terms,
we are looking for the 54.4.Theme argument. Candidate c is the arg1 of appraise, which
maps to the VerbNet thematic role 54.4.Theme. A simple identity check allows us to ﬁll
the iarg1 of price with the house. Feature 11 is a boolean feature indicating whether this
identify check is satisﬁed.
86

Feature 13 is inspired by the work of Chambers and Jurafsky (2008), who investigated
unsupervised learning of narrative event sequences using pointwise mutual information (PMI)
between syntactic positions. I extended this PMI score to semantic arguments instead of
syntactic dependencies. Thus, the value for this feature is computed as follows:

pmi( p1 , argi , p2 , argj ) = log

Pcoref pmi ( p1 , argi , p2 , argj )
Pcoref pmi ( p1 , argi , ∗)Pcoref pmi ( p2 , argj , ∗)

(4.39)

I computed Equation 4.39 using carefully selected subsets of the Gigaword corpus (Graﬀ,
2003). I ﬁrst indexed the entire Gigaword corpus (approximately 106 documents) using
the Lucene search engine.6 I then queried this index using the simple boolean query “p1
AND p2 ”, which retrieved documents relevant to the predicates considered in Equation
4.39. I used the verbal SRL system of Punyakanok et al. (2008) and the nominal SRL
system of Gerber et al. (2009) to extract arguments from these documents, and I identiﬁed
coreferent arguments with OpenNLP. Assuming the resulting data has N coreferential pairs
of arguments, the numerator in Equation 4.39 is deﬁned as follows:

Pcoref pmi ( p1 , argi , p2 , argj ) =

#coref ( p1 , argi , p2 , argj )
N

(4.40)

In Equation 4.40, #coref returns the number of times the given argument positions are
found to be coreferential. In order to penalize low-frequency observations with artiﬁcially
high scores, I used the simple discounting method described by Pantel and Ravichandran
(2004) resulting in the following modiﬁcation of Equation 4.40:

x = #coref ( p1 , argi , p2 , argj )
x
x
Pcoref pmi ( p1 , argi , p2 , argj ) =
∗
N x+1

(4.41)

x
Thus, if two argument positions are rarely observed as coreferent, the discount factor x+1

6 http://lucene.apache.org

87

Argument position
win.arg1
gain.arg1
recoup.arg1
steal.arg1
possess.arg1

#coref with loss.arg1
37
10
2
4
3

Raw PMI score
5.68
5.13
6.99
5.18
5.10

Discounted PMI score
5.52
4.64
4.27
4.09
3.77

Table 4.2: Targeted PMI scores between the arg1 of loss and other argument positions. The
second column gives the number of times that the argument position in the row is found to
be coreferent with the arg1 of the loss predicate. A higher value in this column results in a
lower discount factor. See Equation 4.41 for the discount factor.

will be small, reducing the PMI score. The denominator in Equation 4.39 is computed with
a similar discount factor:

x1 = #coref ( p1 , argi , ∗)
x2 = #coref ( p2 , argj , ∗)
Pcoref pmi ( p1 , argi , ∗)Pcoref pmi ( p2 , argj , ∗) =

x1 x2
min(x1 ,x2 )
(N 2 ) min(x ,x )+1
1 2

(4.42)

Thus, if either of the argument positions is rarely observed as coreferent with other argumin(x ,x )

1 2
ment positions, the discount factor min(x ,x )+1 will be small, making the denominator of
1 2

Equation 4.39 large, reducing the PMI score. In general, the discount factors reduce the
PMI score for argument positions that are not frequent in the corpus.
I refer to Equation 4.39 as a targeted PMI score because it relies on data that have been
chosen speciﬁcally for the calculation at hand. Table 4.2 shows a sample of targeted PMI
scores between the arg1 of loss and other argument positions. There are two things to note
about this data: ﬁrst, the argument positions listed are all naturally related to the arg1
of loss. Second, the discount factor changes the ﬁnal ranking by moving the less frequent
recoup predicate from a raw rank of 1 to a discounted rank of 3, preferring instead the more
common win predicate.

88

The information in Table 4.2 is useful in situations such as the following (contrived):
(4.43) Mary won [c the tennis match].
(4.44) [arg0 John’s] [p loss] was not surprising.
In Example 4.44 we are looking for the iarg1 of loss. The information in Table 4.2 strongly
suggests that the marked candidate c, being the arg1 of win, would be a suitable ﬁller for this
position. Lastly, note that if c were to form a coreference chain with other constituents, it
would be possible to calculate multiple PMI scores. In such cases, the targeted PMI feature
uses the maximum of all scores.
Feature 23 takes on a value equal to the concatenation of p.iargn (the missing argument
position) and pf .argf , which is the argument position of the candidate c. To reduce data
sparsity, this feature generalizes predicates and argument positions to their VerbNet classes
and thematic roles using SemLink. For example, consider the following Penn TreeBank
sentences:
(4.45) [arg0 The two companies] [p produce] [arg1 market pulp, containerboard and white
paper]. The goods could be manufactured closer to customers, saving [p shipping]
costs.
Here we are trying to ﬁll the iarg0 of shipping. Let c′ contain a single mention, The
two companies, which is the arg0 of produce. Feature 23 is instantiated with a value of
26.4.Agent → 11.1.1.Agent, where 26.4 and 11.1.1 are the VerbNet classes that contain
produce and ship, respectively. This feature captures general properties of events; in the
example above, it describes the tendency of producers to also be shippers. Similarly to other
non-numeric features, Feature 23 is instantiated once for each element of c′ , allowing the
model to consider information from multiple mentions of the same entity.
Feature 27 captures the selectional preference of a predicate p for the elements in c′
with respect to argument position iargn . In general, selectional preference scores denote
the strength of attraction for a predicate-argument position to a particular word or class
of words. To calculate the value for this feature, I used the information-theoretic model
89

proposed by Resnik (1996), which is deﬁned as follows:
P r(s|p, argn)log
P ref (p, argn , s ∈ WordNet) =

P r(s|p, argn)
P r(s)

Z

Z=

P r(si |p, argn )log
si ∈WordNet

(4.46)

P r(si|p, argn )
P r(si)

In Equation 4.46, P ref calculates the preference for a WordNet synset s in the given
predicate-argument position. Prior and posterior probabilities for s were calculated by examining the arguments present in the Penn TreeBank combined with 20,000 documents
randomly selected from the Gigaword corpus. PropBank and NomBank supplied arguments
for the Penn TreeBank, and I used the aforementioned verbal and nominal SRL systems to
extract arguments from Gigaword. The head word for each argument was mapped to its
WordNet synsets, and counts for these synsets were updated as suggested by Resnik. Two
things should be noted about Equation 4.46. First, WordNet contains only those synsets
that were observed in the training data. Second, the equation is deﬁned as zero for any
synset s that is in WordNet but not observed in the training data.
Equation 4.46 computes the preference of a predicate-argument position for a synset;
however, a single word can map to multiple synsets if its sense is ambiguous. Given a word
w and its synsets s1 , s2 , . . . , sn , the preference of a predicate-argument position for w is
deﬁned as follows:
si P ref (p, argn, si )

P ref (p, argn, w) =

n

(4.47)

That is, the preference for a word is computed as the average preference across all possible
synsets. The ﬁnal value for Feature 23 is computed using the word-based preference score
deﬁned in Equation 4.47. Given a candidate implicit argument c′ comprising the primary

90

Argument
rethink.arg1
deﬁne.arg1
redeﬁne.arg1

Raw coreference probability
3/6 = 0.5
2/6 = 0.33
1/6 = 0.17

Discounted coreference probability
0.32
0.19
0.07

Table 4.3: Coreference probabilities between reassess.arg1 and other argument positions.
See Equation 4.49 for details on the discount factor.

ﬁller c and its coreferent mentions, the following value is obtained:
P ref (p, iargn, c′ ) = min P ref (p, argn, f )
f ∈c′

(4.48)

In Equation 4.48, each f is the syntactic head of a constituent from c′ . The value of Equation
4.48 is in (−∞, +∞), with larger values indicating higher preference for c as the implicit
ﬁller of position iargn .
Feature 33 implements the suggestion of Burchardt et al. (2005) that implicit arguments might be identiﬁed using observed coreference patterns in a large corpus of text. My
implementation of this feature uses the same data used for the previous feature: arguments
extracted from the Penn TreeBank and 20,000 documents randomly selected from Gigaword.
Additionally, I identiﬁed coreferent arguments in this corpus using OpenNLP. Using this information, I calculated the probability of coreference between any two argument positions.
As with Feature 13, I used discounting to penalize low-frequency observations, producing an
estimate of coreference probability as follows:

Corefjoint = #coref ( p1 , argi , p2 , argj )
Corefmarginal = #coref ( p1 , argi , ∗)
Corefmarginal
Corefjoint
Corefjoint
∗
∗
Pcoref ( p1 , argi , p2 , argj ) =
Corefmarginal Corefjoint + 1 Corefmarginal + 1
(4.49)

91

For example, I observed that the arg1 for predicate reassess (the entity reassessed) is
coreferential with six other constituents in the corpus. Table 4.3 lists the argument positions
with which this argument is coreferential along with the raw and discounted probabilities.
The discounted probabilities can help identify the implicit argument in the following contrived examples:
(4.50) Senators must rethink [c their strategy for the upcoming election].
(4.51) The [p reassessment] must begin soon.
In Example 4.51 we are looking for the iarg1 of reassess. Table 4.3 tells us that the marked
candidate - an arg1 to rethink - is likely to ﬁll this missing argument position. When c
forms a coreference chain with other constituents, this feature uses the minimum coreference
probability between the implicit argument position and elements in the chain.
Feature 59 is similar to Feature 9 (the frame element path) except that it captures the
distance of the relationship between predicate-argument positions. Consider the following
VerbNet classes:
13.2 lose, refer, relinquish, remit, resign, restore, gift, hand out, pass out, shell out
13.5.1.1 earn, fetch, cash, gain, get, save, score, secure, steal
The path from earn to lose in the VerbNet hierarchy is as follows:
(4.52) 13.5.1.1 ↑ 13.5.1 ↑ 13.5 ↑ 13 ↓ 13.2
The path in Example 4.52 is four links long.
Intuitively, earn and lose are related to each other - they describe two possible outcomes
of a ﬁnancial transaction. The VerbNet path quantiﬁes this intuition, with shorter paths
indicating closer relationships. This information can be used to identify implicit arguments
in situations such as the following from the Penn TreeBank (abridged):
(4.53) [c Monsanto Co.] is expected to continue reporting higher [p earnings].
(4.54) The St. Louis-based company is expected to report that [p losses] are narrowing.

92

In Example 4.54 we are looking for the iarg0 (i.e., entity losing something) for the loss
predicate. According to SemLink, this argument position maps to the 13.2.Agent role in
VerbNet. In Example 4.53, we ﬁnd the candidate implicit argument Monsanto Co., which
is the arg0 to the earning predicate in that sentence. This argument position maps to
the 13.5.1.1.Agent role in VerbNet. These two VerbNet roles are related according to the
VerbNet path in Example 4.52, producing a value for Feature 59 of four. This relatively
small value supports an inference of Monsanto Co. as the iarg0 for loss.
It is important to note that a VerbNet path only exists when the thematic roles are identical. For example, a VerbNet path would not exist between 13.5.1.1.Theme and 13.2.Agent
because the roles are not compatible. Lastly, note that c might form a coreference chain
c′ with multiple elements. In such a situation, the minimum path length is selected as the
value for this feature.
Feature 67 identiﬁes the discourse relation (if any) that holds between the candidate
constituent c and the ﬁlled predicate p. Consider the following example:
(4.55) [iarg0 SFE Technologies] reported a net loss of $889,000 on sales of $23.4 million.
(4.56) That compared with an operating [p loss] of [arg1 $1.9 million] on sales of $27.4
million in the year-earlier period.
In this case, a comparison discourse relation (signaled by the underlined text) holds between
the ﬁrst and sentence sentence. The coherence provided by this relation encourages an
inference that identiﬁes the marked iarg0 (the loser). The value for this feature is the name
of the discourse relation (e.g., comparison) whose two discourse units cover the candidate
(iarg0 above) and ﬁlled predicate (p above). Throughout my investigation, I used goldstandard discourse relations provided by the Penn Discourse TreeBank (Prasad et al., 2008).
Filler-independent features are those that do not depend on elements of c′ . These
features are usually speciﬁc to a particular predicate. Consider the following example:
(4.57) Statistics Canada reported that its [arg1 industrial-product] [p price] index dropped
2% in September.

93

The “[p price] index” collocation is rarely associated with an arg0 in NomBank or with an
iarg0 in the annotated data (both argument positions denote the seller). Feature 25 accounts
for this type of behavior by encoding the syntactic head of p’s right sibling. The value of
Feature 25 for Example 4.57 is price:index. Contrast this with the following:
(4.58) [iarg0 The company] is trying to prevent further [p price] drops.
The value of Feature 25 for Example 4.58 is price:drop. This feature captures an important
distinction between the two uses of price: the former cannot easily take an iarg0 , whereas
the latter can. Many other features in Table A.4 depend only on the predicate and have
values that take the form predicate:feature value.

4.4.3

Post-processing for ﬁnal output selection

Without loss of generality, assume there exists a predicate instance p with two missing
argument positions iarg0 and iarg1 . Also assume that there are three candidate ﬁllers c1 , c2 ,
and c3 within the candidate window. The discriminative model will calculate the probability
that each candidate ﬁlls each missing argument position. This is depicted graphically below:
c1
c2
c3

iarg0
0.3
0.1
0.6

iarg1
0.4
0.05
0.5

There exist two constraints on possible assignments of candidates to positions. First, a
candidate may not be assigned to more than one missing argument position. To enforce this
constraint, only the top-scoring cell in each row is retained, leading to the following:
c1
c2
c3

iarg0
0.1
0.6

iarg1
0.4
-

Second, a missing argument position can only be ﬁlled by a single candidate. To enforce this
constraint, only the top-scoring cell in each column is retained, leading to the following:
94

c1
c2
c3

iarg0
0.6

iarg1
0.4
-

Having satisﬁed these constraints, a threshold t is imposed on the remaining cell probabilities.7 Cells with probabilities below t are cleared. Assuming that t = 0.42, the ﬁnal
assignment would be as follows:

c1
c2
c3

iarg0
0.6

iarg1
-

In this case, c3 ﬁlls iarg0 with probability 0.6 and iarg1 remains unﬁlled. The latter outcome
is desirable because not all argument positions have ﬁllers that are present in the discourse.

4.4.4

Computational complexity

In a practical setting, the implicit argument model described above would take as input the
output of the end-to-end nominal SRL system described in Chapter 2. Thus, the implicit
argument model has a best-case performance of O(s3 ) owing to the cubic time syntactic
parser. This case is achieved when the linear time predicate identiﬁer does not identify any
predicate nodes. In this case, there are no argument nodes to be used as candidate implicit
arguments, and the implicit argument model will not perform any work in addition to that
performed by the nominal SRL system. The best case, however, is rare. Sentences from
documents in the news genre almost always have predicates and arguments. Thus, the best
case O(s3 ) is not likely to be very informative.
In general, it is diﬃcult to characterize the performance of the implicit argument model
7 The threshold t is learned from the training data. The learning mechanism is explained

in the following section.

95

accurately. This is largely because many of the features used in the model required a significant amount of data pre-processing before the evaluation experiments were conducted (see,
for example, Feature 13 on page 84). In a practical setting, this pre-processing would not
be possible and its associated cost would need to be incorporated into the runtime analysis.
This dissertation leaves an examination of this issue to future work.

4.5

Evaluation

Data
All evaluations in this chapter were performed using a randomized cross-validation conﬁguration. The 1,247 predicate instances were annotated document by document. In order to
remove any confounding factors caused by speciﬁc documents, I ﬁrst randomized the annotated predicate instances. Following this, I split the predicate instances evenly into ten folds
and used each fold as testing data for a model trained on the instances outside the fold.
During training, the system was provided with annotated predicate instances. The system
identiﬁed missing argument positions and generated a set of candidates for each such position.
A candidate three-tuple p, iargn , c′ was given a positive label if the candidate implicit
argument c (the primary ﬁller) was annotated as ﬁlling the missing argument position.
During testing, the system was presented with each predicate instance and was required to
identify all implicit arguments for the predicate.
Throughout the evaluation process I assumed the existence of gold-standard PropBank
and NomBank information in all documents. This factored out errors from traditional SRL
and aﬀected the following stages of system operation:
• Missing argument identiﬁcation. The system was required to ﬁgure out which
argument positions were missing. Each of the ten predicates was associated with
an unambiguous role set, so determining the missing argument positions amounted
to comparing the existing local arguments with the argument positions listed in the
96

predicate’s role set. Because gold-standard local NomBank arguments were used, this
stage produced no errors.
• Candidate generation. As mentioned in Section 4.4.1, the set of candidates for
a missing argument position contains constituents labeled with a core PropBank or
NomBank argument label. Gold-standard PropBank and NomBank arguments were
used; however, it is not the case that all annotated implicit arguments are given a label
by PropBank or NomBank. Thus, despite the gold-standard argument labels, this stage
produced errors in which the system failed to generate a true-positive candidate for an
implicit argument position.
• Feature extraction. Many of the features described in Section 4.4.2 rely on underlying PropBank and NomBank argument labels. For example, the top-ranked Feature 1
relates the argument position of the candidate to the missing argument position. In my
experiments, values for this feature contained no errors because gold-standard PropBank and NomBank labels were used. Note, however, that features such as Feature
13 were calculated using the output of an automatic SRL process that occasionally
produces errors.
For simplicity, I also assumed the existence of gold-standard syntactic structure when possible. Because most of the features used for implicit argument identiﬁcation are semantic,
this assumption is not likely to have a signiﬁcant impact on end performance.
Scoring metrics
I evaluated system performance using the methodology proposed by Ruppenhofer et al.
(2009). For each missing argument position of a predicate instance, the system was required
to either (1) identify a single constituent that ﬁlls the missing argument position or (2) make
no prediction and leave the missing argument position unﬁlled. I scored predictions using

97

the Dice coeﬃcient, which is deﬁned as follows:

Dice(P redicted, T rue) =

2 ∗ |P redicted T rue|
|P redicted| + |T rue|

(4.59)

P redicted is the set of tokens subsumed by the constituent predicted by the model as ﬁlling
a missing argument position. T rue is the set of tokens from a single annotated constituent
that ﬁlls the missing argument position. The model’s prediction receives a score equal to
the maximum Dice overlap across any one of the annotated ﬁllers (AF ):

Score(P redicted) =

max

T rue∈AF

Dice(P redicted, T rue)

(4.60)

Precision is equal to the summed prediction scores divided by the number of argument
positions ﬁlled by the model. Recall is equal to the summed prediction scores divided by
the number of argument positions ﬁlled in the annotated data. Predictions not covering the
head of a true ﬁller were assigned a score of zero. For example, consider the following true
and predicted labelings:
(4.61) True labeling: [iarg0 Participants] will be able to transfer [iarg1 money] to [iarg2
other investment funds]. The [p investment] choices are limited to [iarg2 a stock fund
and a money-market fund].
(4.62) Predicted labeling: Participants will be able to transfer [iarg1 money] to other
[iarg2 investment funds]. The [p investment] choices are limited to a stock fund and a
money-market fund.
In the ground-truth (4.61) there are three implicit argument positions to ﬁll. The hypothetical system has made predictions for two of the positions. The prediction scores are shown

98

below:

Score(iarg1 money) = Dice(money, money) = 1
Score(iarg2 investment funds) = max{Dice(investment funds, other investment funds),
Dice(investment funds, a stock . . . money-market fund)}
= max{0.8, 0} = 0.8

Precision, recall, and F1 for the example predicate are calculated as follows:
1.8
= 0.9
2
1.8
= 0.6
Recall =
3
2 ∗ P recision ∗ Recall
F1 =
= 0.72
P recision + Recall

P recision =

I calculated the F1 score for the entire testing fold by aggregating the counts used in the
above precision and recall calculations. Similarly, I aggregated the counts across all folds to
arrive at a single F1 score for the evaluated system.
I used a bootstrap resampling technique similar to those developed by Efron and Tibshirani (1993) to test the signiﬁcance of the performance diﬀerence between various systems.
Given a test pool comprising M missing argument positions iargn along with the predictions by systems A and B for each iargn , I calculated the exact p-value of the performance
diﬀerence as follows:
1. Create r random resamples from M with replacement.
2. For each resample Ri , compute the system performance diﬀerence dRi = ARi − BRi
and store dri in D.
3. Find the largest symmetric interval [min, max] around the mean of D that does not
include zero.

99

4. The exact p-value equals the percentage of elements in D that are not in [min, max].
Experiments have shown that this simple approach provides accurate estimates of significance while making minimal assumptions about the underlying data distribution (Efron
and Tibshirani, 1993). Similar randomization tests have been used to evaluate information
extraction systems (Chinchor et al., 1993).
LibLinear model conﬁguration
Given a testing fold Ftest and a training fold Ftrain , I performed feature selection using only
the information contained in Ftrain .8 As part of the feature selection process, I conducted
a grid search for the best c and w LibLinear parameters, which govern the per-class cost of
mislabeling instances from a particular class (Fan et al., 2008). Setting per-class costs helps
counter the eﬀects of class size imbalance, which is severe even when selecting candidates
from the current and previous few sentences (most candidates are negative). I ran the feature
selection and grid search processes independently for each Ftrain . As a result, the feature set
and model parameters are slightly diﬀerent for each fold.9 For all folds, I used LibLinear’s
logistic regression solver and a candidate selection window of two sentences prior. As shown
in Figure 4.1 (p. 82), this window imposes a recall upper bound of approximately 85%. The
post-processing prediction threshold t was learned using a brute-force search that maximized
the system’s performance over the data in Ftrain .
Baseline and oracle models
I compared the supervised model described above with the simple baseline heuristic deﬁned
below:
Fill iargn for predicate instance p with the nearest constituent in the two-sentence
candidate window that ﬁlls argn for a diﬀerent instance of p, where all nominal
8 See Appendix Section A.8 (p. 144) for the feature selection algorithm.
9 See Table A.5 (p. 141) for a per-fold listing of features and model parameters.

100

predicates are normalized to their verbal forms.
The normalization allows, for example, an existing arg0 for the verb invested to ﬁll an iarg0
for the noun investment. This heuristic outperformed a more complicated heuristic that
relied on the PMI score described in Section 4.4.2. I also evaluated an oracle model that
made gold-standard predictions for candidates within the two-sentence prediction window.
Results
Table 4.4 presents the evaluation results for implicit argument identiﬁcation. As column
two shows, the systems were tested over 966 missing argument positions for which at least
one true implicit ﬁller existed. Overall, the discriminative model increased F1 performance
by 21.4 points (74.1%) compared to the baseline (p<0.0001). Predicates with the highest
number of implicit arguments - sale and price - showed F1 increases of 13.7 points and 17.5
points, respectively (p<0.001 for both diﬀerences). As expected, oracle precision is 100%
for all predictions, and the F1 diﬀerence between the discriminative and oracle systems is
signiﬁcant at p<0.0001 for all test sets. See Appendix Section A.6 (p. 141) for a per-fold
breakdown of results and a listing of features and model parameters used for each fold.
I also measured human performance on this task by running the undergraduate assistant’s
annotations against a small portion of the evaluation data comprising 275 ﬁlled implicit
arguments. The assistant achieved an overall F1 score of 56.0% using the same two-sentence
candidate window used by the baseline, discriminative, and oracle models. Using an inﬁnite
candidate window, the assistant increased F1 performance to 64.2%. Although these results
provide a general idea about the performance upper bound, they are not directly comparable
to the cross-validated results shown in Table 4.4.

101

sale
price
bid
investor
cost
loan
plan
loss
fund
investment
Overall

# Imp. args.
181
138
124
108
86
82
77
62
56
52
966

Baseline
P
R
F1
57.0 27.7 37.3
67.1 23.3 34.6
66.7 14.5 23.8
30.0 2.8
5.1
60.0 10.5 17.8
63.0 20.7 31.2
72.7 20.8 32.3
78.8 41.9 54.7
66.7 10.7 18.5
28.9 10.6 15.5
61.4 18.9 28.9

Discriminative
P
R
F1
59.2 44.8 51.0
56.0 48.7 52.1
60.0 36.3 45.2
46.7 39.8 43.0
62.5 50.9 56.1
67.2 50.0 57.3
59.6 44.1 50.7
72.5 59.7 65.5
80.0 35.7 49.4
32.9 34.2 33.6
57.9 44.5 50.3

pexact (B,D)
0.0003
<0.0001
<0.0001
<0.0001
<0.0001
<0.0001
0.0032
0.0331
<0.0001
0.0043
<0.0001

Oracle
P
R
100.0 72.4
100.0 78.3
100.0 60.5
100.0 84.3
100.0 86.0
100.0 89.0
100.0 87.0
100.0 88.7
100.0 66.1
100.0 80.8
100.0 78.0

F1
84.0
87.8
75.4
91.5
92.5
94.2
93.1
94.0
79.6
89.4
87.6

pexact (D,O)
<0.0001
<0.0001
<0.0001
<0.0001
<0.0001
<0.0001
<0.0001
<0.0001
<0.0001
<0.0001
<0.0001

Table 4.4: Overall evaluation results for implicit argument identiﬁcation. The second column gives the number of ground-truth
implicitly ﬁlled argument positions for the predicate instances. P , R, and F1 indicate precision, recall, and F-measure (β = 1),
respectively. pexact is the bootstrapped exact p-value of the F1 diﬀerence between two systems, where the systems are (B)aseline,
(D)iscriminative, and (O)racle.

102

4.6

Discussion

4.6.1

Feature assessment

Previously, we assessed the importance of various implicit argument feature groups by conducting feature ablation tests (Gerber and Chai, 2010). In each test, the discriminative
model was retrained and reevaluated without a particular group of features. I summarize
the ﬁndings of this study below:
Semantic roles are essential. We observed statistically signiﬁcant losses when excluding
features that relate the semantic roles of elements in c′ to the semantic role of the
missing argument position. For example, Feature 1 appears as the top-ranked feature
in eight out of ten fold evaluations (see Table A.5 on page 141). This feature is formed
by concatenating the ﬁlling predicate-argument position with the ﬁlled predicateargument position, producing values such as invest.arg0-lose.arg0 . This value indicates
that the entity performing the investing is also the entity losing something. This type
of commonsense knowledge is essential to the task of implicit argument identiﬁcation.
Other information is important. Our 2010 study also found that semantic roles are only
one part of the solution. Using semantic roles in isolation also produced statistically
signiﬁcant losses. This indicates that other features contribute useful information to
the task.
Discourse structure is not essential. We also tested the eﬀect of removing discourse
relations (Feature 67) from the model. Discourse structure has received a signiﬁcant
amount of attention in NLP; however, it remains a very challenging problem, with stateof-the-art systems attaining F1 scores in the mid-40% range (Sagae, 2009). Our 2010
work as well as the updated work presented in this dissertation used gold-standard
discourse relations from the Penn Discourse TreeBank. As shown by Sagae, these

103

#
1
2
3
4
5

Description
A true ﬁller
A true ﬁller
A true ﬁller
A true ﬁller
A true ﬁller

was classiﬁed but an incorrect ﬁller scored higher
did not exist but a prediction was made
existed within the window but was not classiﬁed
scored highest but below threshold
existed but not within the window

%
30.6
22.4
21.1
15.9
10.0

Table 4.5: Implicit argument identiﬁcation error analysis. The second column indicates the
type of error that was made and the third column gives the percentage of all errors that fall
into each type.

relations are diﬃcult to extract in a practical setting. In our 2010 work, we showed that
removing discourse relations from the model did not have a statistically signiﬁcant eﬀect
on performance. Thus, this information should be removed in practical applications of
the model, at least until better uses for it can be identiﬁed.
To further assess the relative importance of features used in this dissertation, I aggregated
the feature rank information given in Table A.5 (p. 141). For each evaluation fold, each
feature received a point value equal to its reciprocal rank within the feature list. Thus, a
feature appearing at rank 5 for a fold would receive 1 = 0.2 points for that fold. I totaled
5
these points across all folds, arriving at the values shown in the ﬁnal column of Table A.4
(p. 140). The scores conﬁrm the ﬁndings described above. The highest scoring feature
relates the semantic roles of the candidate argument to the missing argument position.
Non-semantic information such as the sentence distance (Feature 2) also plays a key role.
Discourse structure is consistently ranked near the bottom of the list (Feature 67).

4.6.2

Error analysis

Table 4.5 lists the errors made by the system and their frequencies. As shown, the single
most common error (type 1) occurred when a true ﬁller was classiﬁed but an incorrect
ﬁller had a higher score. This occurred in approximately 31% of the error cases. Often,
though, the system did not classify a true implicit argument because such a candidate was
104

not generated. Without such a candidate, the system stood no chance of making a correct
prediction. Errors 3 and 5 combined (also 31%) describe this behavior. Type 3 errors resulted
when implicit arguments were not core (i.e., argn ) arguments to other predicates. To reduce
class imbalance, the system only used core arguments as candidates; however, this came at
the expense of increased type 3 errors. In many cases, the true implicit argument ﬁlled a
non-core (i.e., adjunct) role within PropBank or NomBank.
Type 5 errors resulted when the true implicit arguments for a predicate were outside the
candidate window. Oracle recall (see Table 4.4) indicates the nominals that suﬀered most
from windowing errors. For example, the sale predicate was associated with the highest
number of true implicit arguments, but only 72% of those could be resolved within the
two-sentence candidate window. Empirically, I found that extending the candidate window
uniformly for all predicates did not increase F1 performance because additional false positives
were identiﬁed. The oracle results suggest that predicate-speciﬁc window settings might oﬀer
some advantage for predicates such as fund and bid, which take arguments at longer ranges.
Error types 2 and 4 are directly related to the prediction conﬁdence threshold t. The
former would be reduced by increasing t and thus ﬁltering out bad predictions. The latter
would be reduced by lowering t and allowing more true ﬁllers into the ﬁnal output. However,
it is unclear whether either of these actions would increase overall performance.

4.6.3

The investment and fund predicates

In Section 4.4.2, I discussed the price predicate, which frequently occurs in the “[p price]
index” collocation. I observed that this collocation is rarely associated with either an overt
arg0 or an implicit iarg0 . Similar observations can be made for the investment and fund
predicates. Although these two predicates are frequent, they are rarely associated with
implicit arguments: investment takes only 52 implicit arguments and fund takes only 56
implicit arguments (see Table 4.4). This behavior is due in large part to collocations such as
“[p investment] banker”, “stock [p fund]”, and “mutual [p fund]”, which use predicate senses

105

that are not eventive and take no arguments. Such collocations also violate the assumption
that diﬀerences between the PropBank and NomBank argument structure for a predicate
are indicative of implicit arguments (see Section 4.3.1 for this assumption).
Despite their lack of implicit arguments, it is important to account for predicates such
as investment and fund because the incorrect prediction of implicit arguments for them
can lower precision. This is precisely what happened for the investment predicate (P =
33%). The model incorrectly identiﬁed many implicit arguments for instances such as “[p
investment] banker” and “[p investment] professional”, which take no arguments. The right
context of investment should help the model avoid this type of error; however in many cases
this was not enough evidence to prevent a false positive prediction. Additional investigation
is needed to address this type of error.

4.6.4

Improvements versus the baseline

The baseline heuristic covers the simple case where identical predicates share arguments in
the same position. Because the discriminative model also uses this information (see Feature
8), it is interesting to examine cases where the baseline heuristic failed but the discriminative
model succeeded. Such cases represent more diﬃcult inferences. Consider the following
sentence:
(4.63) Mr. Rogers recommends that [p investors] sell [iarg2 takeover-related stock].
Neither NomBank nor the baseline heuristic associate the marked predicate in Example 4.63
with any arguments; however, the feature-based model was able to correctly identify the
marked iarg2 as the entity being invested in. This inference relied on a number of features
that connect the invest event to the sell event (e.g., Features 1, 4, and 76). These features
captured a tendency of investors to sell the things they have invested in.
I conclude my discussion with an example of a complex extra-sentential implicit argument:
(4.64) [arg0 Olivetti] [p exported] $25 million in “embargoed, state-of-the-art, ﬂexible
manufacturing systems to the Soviet aviation industry.”
106

(4.65) [arg0 Olivetti] reportedly began [p shipping] these tools in 1984.
(4.66) [iarg0 Olivetti] has denied that it violated the rules, asserting that the shipments
were properly licensed.
(4.67) However, the legality of these [p sales] is still an open question.
In Example 4.67, we are looking for the iarg0 of sale. As shown, the discriminative model was
able to correctly identify Olivetti from 4.66 as the implied ﬁller of this argument position.
The inference involved two key steps. First, the model identiﬁed coreferent mentions of
Olivetti in 4.64 and 4.65. In these sentences, Olivetti participates in the marked exporting
and shipping events. Second, the model identiﬁed a tendency for exporters and shippers
to also be sellers (e.g., Features 1, 4, and 23 made large contributions to the prediction).
Using this knowledge, the system extracted information that could not be extracted by the
baseline heuristic or a traditional SRL system.

4.6.5

Comparison with previous results

In a previous study, we reported results similar to those in this chapter (Gerber and Chai,
2010). The key diﬀerence between the two is cross-validation, which was not used in our 2010
study. Our 2010 study used ﬁxed partitions of training, development, and testing data. As
a result, feature and model parameter selections overﬁt the development data; we observed
a 23-point diﬀerence in F1 between the development (65%) and testing (42%) partitions.
The small size of the testing set also led to small sample sizes and large p-values during
signiﬁcance testing. The cross-validated approach reported in this chapter alleviated both
problems. The F1 diﬀerence between training and testing was approximately 10 points for
all folds, and all of the data were used for testing, leading to more accurate p-values. It
is not possible to directly compare the evaluation scores in the two studies; however, the
methodology in the current chapter is preferable for the reasons mentioned.

107

4.7

Conclusions

Chapter 3 provided a partial solution to the problem of nominals with implicit arguments.
The model described in that chapter is able to accurately identify nominals whose arguments
are implicit using a variety of lexical and syntactic features. This increases performance by
reducing the number of false positive argument predictions; however, all implicit arguments
remain unidentiﬁed, leaving a large portion of the corresponding event structures unrecognized.
This chapter has presented a detailed study of implicit arguments for a select group of
nominal predicates. The study was based on a manually created corpus of implicit arguments,
which is freely available for research purposes. The study’s primary ﬁndings include the
following:
1. Implicit arguments are frequent. Given the predicates in a document, there exist a
ﬁxed number of possible arguments that can be ﬁlled according to NomBank’s predicate
role sets. Role coverage is deﬁned as the fraction of these roles that are actually ﬁlled
by constituents in the text. Using NomBank as a baseline, the study found that role
coverage increases by 71% when implicit arguments are taken into consideration.
2. Implicit arguments can be automatically identiﬁed. Using the annotated data,
I constructed a feature-based supervised model that is able to automatically identify
implicit arguments. This model relies heavily on the traditional, single-sentence SRL
structure of both nominal and verbal predicates. By unifying these sources of information, the implicit argument model provides a more coherent picture of discourse
semantics than is typical in most recent work (e.g., the evaluation conducted by Surdeanu et al. (2008)). The model demonstrates substantial gains over an informed
baseline, reaching an overall F1 score of 50% and per-predicate scores in the mid-50s
and mid-60s. These results are among the ﬁrst for this task.

108

3. Much work remains. The study presented in the current chapter was very focused:
only ten diﬀerent predicates were analyzed. The goal was to carefully examine the underlying linguistic properties of implicit arguments. This examination produced many
features that have not been used in other SRL studies. The results are encouraging;
however, a direct application of the model to all NomBank predicates will require a
substantial annotation eﬀort. This is because many of the most important features are
lexicalized on the predicate being analyzed and thus cannot be generalized to novel
predicates. Additional information might be extracted from VerbNet, which groups
related verbs together. Features from this resource might generalize better because
they apply to entire sets of verbs.
Lastly, it should be noted that the prediction model described in this chapter is quite
simple. Each candidate is independently classiﬁed as ﬁlling each missing argument position,
and a heuristic post-processing step is performed to arrive at the ﬁnal labeling. This approach ignores the joint behavior of semantic arguments. In the next chapter, I describe
a preliminary joint model for implicit arguments that is based on a large-scale knowledge
based extracted from Internet webpages.

109

CHAPTER 5
An exploration of TextRunner for joint
implicit argument identiﬁcation
5.1

Introduction

The model described in the previous chapter uses a wide variety of lexical and semantic
features to make binary implicit argument predictions for constituents in the surrounding
discourse. For a predicate instance p, each candidate constituent is classiﬁed as ﬁlling each
missing argument position. A heuristic post-processing procedure is then applied to arrive
at the ﬁnal argument structure. With the exception of this ﬁnal step, the candidates and
argument positions are assumed to be independent.
It is easy to construct examples that violate the assumption of independent arguments.
Consider the following sentences:
(5.1) [c1 The president] is currently struggling to manage [c2 the country’s economy].
(5.2) If he cannot get it under control, [p loss] of [arg1 the next election] might result.
In Example 5.2, we are searching for the iarg0 of loss (the entity that is losing). The sentence
in 5.1 supplies two reasonable candidates for this position: c1 and c2 . If one only considers
the predicate loss, then c1 and c2 would appear to be equally likely: presidents often lose
things (e.g., votes and allegiance) and economies often lose things (e.g., jobs and value).
However, the sentence in 5.2 supplies additional information. It tells the reader that the
next election is the entity being lost. Given this information, one would likely prefer c1 over
c2 because economies don’t generally lose elections, whereas presidents often do. This type
of inference is common in textual discourses because authors assume a shared knowledge

110

base with their readers. This knowledge base contains information about events and their
typical participants (e.g., the fact that presidents lose elections but economies do not).
Inspired by the above observations, this chapter presents a preliminary exploration of
the interaction that occurs between implicit arguments. The interaction (or joint) model
relies on a knowledge base constructed by automatically mining semantic propositions from
Internet webpages using the TextRunner information extraction system.1 The primary goal
of this chapter is to assess whether these propositions can help identify likely joint implicit
argument conﬁgurations. In the following section, I review work on joint inference within
semantic role labeling. In Sections 5.3 and 5.4, I present the joint implicit argument model
and its features. Evaluation results for this model are given in Section 5.5. The joint model
contains many simplifying assumptions, which I address in Section 5.6. I conclude in Section
5.7.

5.2

Related work

Joint models for SRL
A number of recent studies have shown that semantic arguments are not independent and
that system performance can be improved by taking argument dependencies into account.
Consider the following examples, which were discussed in Section 2.2.3:
(5.3) [Temporal The day] that [arg0 the ogre] [Predicate cooked] [arg1 the children] is still
remembered.
(5.4) [arg1 The meal] that [arg0 the ogre] [Predicate cooked] [Beneﬁciary the children] is
still remembered.
These examples (due to Toutanova et al. (2008)) demonstrate the importance of interargument dependencies. The fact that the sentential subject is headed by meal in 5.4 instead
of day causes a dramatic change in the interpretation of the constituent following the predicate. Toutanova et al. ﬁrst generated an n-best list of argument labels for a predicate
1 http://www.cs.washington.edu/research/textrunner/reverbdemo.html

111

instance. They then re-ranked this list using joint features that describe multiple arguments
simultaneously. For example, one of the features captures the argument label sequence as
follows for Examples 5.3 and 5.4, respectively:
(5.5) [voice:passive, lemma:cook, Temporal, arg0 , Predicate, arg1 ]
(5.6) [voice:passive, lemma:cook, arg1 , arg0 , Predicate, Beneﬁciary]
The label sequences in 5.5 and 5.6 help rule out globally invalid conﬁgurations such as the
following:
(5.7) [voice:passive, lemma:cook, arg1 , arg0 , Predicate, arg0 ]
The label sequence in 5.7 violates a commonly used constraint that allows a single constituent
to be given each argument label.
The unique label constraint just mentioned is also important to the work of Punyakanok
et al. (2008), who formulate a variety of constraints on argument labels. Punyakanok et al.
treat these constraints as binary variables within an integer linear program, which is optimized to produce the ﬁnal labeling. Other constraints include the following (for a complete
list, see p. 267 of the cited work):
• Arguments cannot overlap the predicate. This constraint was used by the nominal SRL model presented in Chapter 2, where candidate arguments were not allowed
to overlap the predicate.2
• Arguments cannot overlap each other. This constraint was also used by the
nominal SRL system, which applied post-processing heuristics to remove argument
overlap.
Although the work of Punyakanok et al. focuses on SRL within single sentences, the key
ﬁnding is pertinent to multi-sentence implicit argumentation: semantic arguments should
not be predicted independently of each other.
2 The exception to this constraint for the nominal SRL system of Chapter 2 is that incorpo-

rated arguments may overlap the predicate. See page 33 for details concerning incorporated
arguments.
112

Following this line of work, Ritter et al. (2010) investigated joint selectional preferences.
Traditionally, a selectional preference model provides the strength of association between a
predicate-argument position and a speciﬁc textual expression. Returning to Examples 5.1
and 5.2, one sees that the selectional preference for president and economy in the iarg0
position of loss should be high because each expression denotes an entity capable of losing
something. The traditional selectional preference model was used in Chapter 4 as a source of
information for identifying implicit arguments. Ritter et al. extended this single-argument
model using a joint formulation of Latent Dirichlet Allocation (LDA) (Blei et al., 2003). In
the generative version of joint LDA, text for the argument positions is generated from a
common hidden variable. This approach reﬂects the intuition behind Examples 5.1 and 5.2
and would help identify president as the iarg0 . Training data for the model was drawn from
a large corpus of two-argument tuples extracted by the TextRunner system, which I describe
next.
The TextRunner information extraction system
Both Ritter et al.’s model and the model described in this chapter rely heavily on information
extracted by the TextRunner system (Banko et al., 2007). The TextRunner system extracts
tuples from Internet webpages in an unsupervised fashion. One key diﬀerence between
TextRunner and other information extraction systems is that TextRunner does not use a
closed set of relations (compare to the work described by ACE (2008)). Instead, the relation
set is left open, leading to the notion of Open Information Extraction (OIE). Although
OIE often has lower precision than traditional information extraction, it is able to extract a
wider variety of relations at precision levels that are often useful (Banko and Etzioni, 2008).
Returning again to Examples 5.1 and 5.2, one can query TextRunner in the following way:
TextRunner Query
arg0 : ?
Predicate: lose
arg1 : election
113

In the TextRunner system, arg0 typically indicates the Agent and arg1 typically indicates
the Theme. TextRunner provides many tuples in response to this query, two of which are
shown below:
(5.8) Usually, [arg0 the president’s party] [Predicate loses] [arg1 seats in the mid-term
election]
(5.9) [arg0 The president] [Predicate lost] [arg1 the election].
The tuples present in these sentences (and many others) suggest that presidents are capable
of losing elections. This was one possible inference for Examples 5.1 and 5.2. The other
possible inference - that the economy might lose the election - is not supported as strongly
by tuples returned for the TextRunner query. Given all of the returned tuples, only a single
one involves economy in the arg0 position:
(5.10) Any president will take credit for [arg0 a good economy] or [Predicate lose] [arg1 an
election] over a bad one.
In 5.10, TextRunner has not analyzed the arguments correctly (president should be the
arg0 , not economy). Later in this chapter, I show how evidence from the tuple lists can
be aggregated such that correct analyses (5.8 and 5.9) are favored over incorrect analyses
(5.10). Given the tuple-based preference for president in the arg0 of lose where the arg1 is
election, the system would hopefully select c1 (The president) as the arg0 in Examples 5.1
and 5.2. The primary contribution of this chapter is an exploration of how such tuple-based
preferences can be computed and applied to the task of implicit argument identiﬁcation.3

5.3

Joint model formulation

To simplify the experimental setting, the model described in this section targets the speciﬁc
situation where a predicate instance p takes an implicit iarg0 as well as an implicit iarg1 .
3 Thanks to Robert Bart and Alan Ritter at the University of Washington for their assis-

tance with the TextRunner system.

114

Whereas the model in the previous chapter classiﬁes candidates for these positions independently, the model in this chapter classiﬁes joint structures by evaluating the following binary
prediction function:

P (+| p, iarg0 , ci , iarg1 , cj )

(5.11)

Equation 5.11 gives the probability of the joint assignment of ci to iarg0 and cj to iarg1 .
Given a set of n candidates c1 , . . . , cn ∈ C, the best labeling is found by considering all
possible assignments of ci and cj :

arg max
(ci ,cj )∈CxC s.t. i=j

P (+| p, iarg0 , ci , iarg1 , cj )

(5.12)

Consider modiﬁed versions of Examples 5.1 and 5.2:
(5.13) [c1 The president] is currently struggling to manage [c2 the country’s economy].
(5.14) If he cannot get it under control before [c3 the next election], a [p loss] might result.
In this case, we are looking for the iarg0 as well as the iarg1 for the loss predicate. Three
candidates c1 , c2 , and c3 are marked. The joint model evaluates the following probabilities,
taking the highest scoring to be the ﬁnal assignment:

P (+| loss, iarg0 , president, iarg1 , economy )
*P (+| loss, iarg0 , president, iarg1 , election )
P (+| loss, iarg0 , economy, iarg1 , president )
P (+| loss, iarg0 , economy, iarg1 , election )
P (+| loss, iarg0 , election, iarg1 , president )
P (+| loss, iarg0 , election, iarg1 , economy )

115

Intuitively, only the starred item should have a high probability. As described in the previous
section, TextRunner might be capable of modeling such intuitions if the tuple data can be
aggregated in the right way. In the following section, I describe such an aggregation method,
which forms the basis for features used to estimate the above probabilities.

5.4

Joint model features based on TextRunner

The TextRunner system has been extracting massive amounts of knowledge in the form of
tuples such as the following:

president, lose, election

The database of tuples can be queried by supplying a value for one or more of the tuple
arguments. For example, the following is a partial result list for the query president, lose, ? :
Kenyan president, lose, election
party of president, lose seat in, election
president, lose, ally
President Bush, lose support for, mission
The ﬁnal argument in each of these tuples provides a single answer to the question “What
might a president lose?”. In order to aggregate these answers, I ﬁrst generalize each to its
WordNet synset (the WordNet gloss for each synset is shown after the arrow):
Kenyan president, lose, election → vote to select the winner of a position
party of president, lose seat in, election → vote to select the winner of a position
president, lose, ally → a friendly nation
President Bush, lose support for, mission → an organization of missionaries
In cases where the answer argument is a phrase, the phrase’s syntactic head is mapped to
a WordNet synset. In cases where the answer argument can be mapped to multiple synsets
(i.e., it has more than one sense), the argument is mapped to the most common sense as
116

listed in the WordNet database. The ﬁnal mapping above shows the eﬀect of sense ambiguity,
where mission in the sense of war is mapped to mission in the sense of religion. This type
of error is inevitable because the mapping process selects the most common sense instead of
applying a more sophisticated sense disambiguation model; however, the negative eﬀects of
sense ambiguity are mitigated by the aggregation process described below.
Having mapped the answer argument of each tuple to its WordNet synset, each synset
is ranked according to the number of answer arguments that it covers. For the query
president, lose, ? , this produces the following ranked list of WordNet synsets:
1. election (77)
2. war (51)
3. vote (39)
4. people (34)
5. support (26)
...
In the list above, I have provided a one-word paraphrase for each of the top ﬁve synsets.
The number in parentheses indicates how many answer arguments are covered by the synset.
These synsets indicate likely answers to the original question of “What might a president
lose?”.
In a similar manner, one can answer a question such as “What might lose an election?”
using tuples extracted by TextRunner. The procedure described above produces the following
ranked list of WordNet synsets to answer this question:
...
9. people (62)
10. Republican (51)
11. Republican party (51)
12. Hillary (50)
13. president (49)
...

117

In this case, the expected answer (president) ranks 13th in the list of answer synsets. It is
important to note that lower ranked answers are not necessarily incorrect answers. It is a
simple fact that a wide variety of entities can lose an election. Items 9-13 are all perfectly
reasonable answers to the original question of what might lose an election. The features
described later in this section will accommodate this observation.
The two symmetric questions deﬁned and answered above are closely connected to the
implicit argument situation discussed previously and reproduced below:
(5.15) [c1 The president] is currently struggling to manage [c2 the country’s economy].
(5.16) If he cannot get it under control before [c3 the next election], a [p loss] might result.
In Example 5.16, one is searching for the implicit iarg0 and iarg1 to the loss predicate.
Candidates ci and cj that truly ﬁll these positions should be compatible with questions in
the following forms:
Question: What did ci lose?
Answer: cj
Question: What entity lost cj ?
Answer: ci
If either of these question-answer pairs is not satisﬁed, then the joint assignment of ci to
iarg0 and cj to iarg1 should be considered unlikely. Using the ﬁrst question-answer pair
above as an example, satisfaction is determined in the following way:
1. Resolve anaphoric expressions and normalize named entities in the sentences from
which ci and cj originate.4
2. Query TextRunner for ci , lose, ? , retrieving the top n tuples.
3. Map the ﬁnal argument of each tuple to its WordNet synset and rank the synsets by
frequency, producing the ranked list A of answer synsets.
4 I used gold-standard anaphora annotations from Weischedel and Brunstein (2005) and

the automatic named entity extractor created by Bikel et al. (1999) for this purpose.
118

4. Map cj to its WordNet synset synsetcj and determine whether synsetcj exists in A.
If it does, the question-answer pair is satisﬁed.
Some additional processing is required to determine whether synsetcj exists in A. This is
due to the hierarchical organization of WordNet. For example, suppose that synsetcj is the
synset containing “primary election” and A contains synsets paraphrased as follows:
1. election
2. war
3. vote
4. people
5. support
...
synsetcj does not appear directly in this list; however, its existence in the list is implied by
the following hypernymy path within WordNet:
is-a
primary election − − election
−→

Intuitively, if synsetcj is connected to a highly ranked synset in A by a short path, then one
has evidence that synsetcj answers the original question. The evidence is weaker if the path
is long, as in the following example:
is-a
is-a
is-a
open primary − − direct primary − − primary election − − election
−→
−→
−→

Additionally, a path between more speciﬁc synsets (i.e., those lower in the hierarchy) indicates a stronger relationship than a path between more general synsets (i.e., those higher in
the hierarchy). These two situations are depicted in Figure 5.1. The synset similarity metric
deﬁned by Wu and Palmer (1994) combines the path length and synset depth intuitions into

119

entity (a)
physical entity (b)
thing

abstract entity

matter

body of water (c)
bay (d)
Figure 5.1: Eﬀect of depth on WordNet synset similarity. All links indicate is-a relationships.
Although the link distance from (a) to (b) equals the distance from (c) to (d), the latter are
more similar due to their lower depth within the WordNet hierarchy.

a single numeric score that is deﬁned as follows:

sim(synset1 , synset2 ) =

2 ∗ depth(lca(synset1 , synset2 ))
depth(synset1 ) + depth(synset2 )

(5.17)

In Equation 5.17, lca returns the lowest common ancestor of the two synsets within the
WordNet hierarchy.
To summarize, Equation 5.17 indicates the strength of association between synsetcj (e.g.,
primary election) and a ranked synset synseta from A that answers a question such as “What
might a president lose?”. If the association between synsetcj and synseta is weak, then the
assignment of cj to iarg1 is unlikely. The process works similarly for assessing ci as the ﬁller
of iarg0 . In what follows, I quantify this intuition for use in estimating the joint probability
deﬁned in Equation 5.11, which is reproduced below:

P (+| p, iarg0 , ci , iarg1 , cj )

120

(5.18)

In order to estimate Equation 5.18 using LibLinear, one must extract numeric features from
the conditioning information. I describe these features below.

Feature 1: Maximum TextRunner association strength. Given the conditioning
variables in Equation 5.18, there are two questions that can be asked:
Question: What did ci p?
Answer: cj
Question: What entity p cj ?
Answer: ci
Each of these questions produces a ranked list of answer synsets using the approach described
previously. The synset for each answer string will match zero or more of the answer synsets,
and each of these matches will be associated with a similarity score as deﬁned in Equation
5.17. Feature 1 considers all such similarity scores and selects the maximum. A high value
for this feature indicates that one (or both) of the candidates (ci or cj ) is likely to be an
answer to its associated question and is likely to ﬁll its associated implicit argument position.

Feature 2: Maximum TextRunner reciprocal rank. Of all the answer matches described for Feature 1, Feature 2 selects the highest ranking and forms the reciprocal rank.
Thus, values for Feature 2 are in [0,1] with larger values indicating matches with higher
ranked answer synsets.

Feature 3: Number of TextRunner matches. This feature records the number of
matches from either of the questions described for Feature 1.

Feature 4: Summed TextRunner reciprocal rank. Feature 2 considers answer synset
matches from either of the posed questions; ideally, each question-answer pair should have
some inﬂuence on the probability estimate in Equation 5.18. Feature 4 looks at the answer
121

synset matches from each question individually. The match with highest rank for each
2
question is selected, and the reciprocal rank r + r is computed. The value of this feature
1
2
is zero if either of the questions fails to produce a matching answer synset.

Features 5 and 6: Local classiﬁcation scores. The joint model described in this chapter
does not replace the local prediction model presented in the previous chapter. The latter
uses a wide variety of important features that cannot be ignored. Like previous joint models
(e.g., the one described by Toutanova et al. (2008)), the joint model works on top of the
local prediction model, whose scores are incorporated into the joint model as feature-value
pairs. Given the local prediction scores for the iarg0 and iarg1 positions in Equation 5.18,
the joint model forms two features: (1) the sum of the scores for ci ﬁlling iarg0 and cj ﬁlling
iarg1 , and (2) the product of these two scores.

5.5

Evaluation

I evaluated the model described in this chapter over the manually annotated implicit argument data used elsewhere in this dissertation. As mentioned in Section 5.3, all joint model
experiments were conducted using predicate instances that take an iarg0 and iarg1 in the
ground-truth annotations. I reused the ten-fold cross-validation setup from the previous
chapter as well as the evaluation metrics (see Section 4.5, p. 96 for more details). For each
evaluation fold, features were selected using only the corresponding training data. I used the
forward feature subset selection algorithm from Section A.8 (p. 144) for this purpose.
For comparison with the model from the previous chapter, I also evaluated the local
prediction model on the evaluation data. Because this model predicted implicit arguments
independently, it continued to use the conﬂict resolution heuristics described on page 94.
However, the prediction threshold t was eliminated because the system could safely assume
that a true ﬁller for each of the iarg0 and iarg1 positions existed.
Table 5.1 presents the evaluation results. The ﬁrst thing to note is that these results are
122

price
sale
plan
bid
fund
loss
loan
investment
Overall

# Imp. args.
40
34
30
26
18
14
12
8
182

Local model
P
R
F1
65.0
65.0
65.0
86.5
86.5
86.5
60.0
60.0
60.0
66.7
66.7
66.7
83.3
83.3
83.3
100.0 100.0 100.0
63.6
58.3
60.9
57.1
50.0
53.3
72.6
71.8
72.2

Joint model
P
R
F1
67.5
67.5
67.5
84.3
84.3
84.3
56.7
56.7
56.7
78.2
78.2
78.2
83.3
83.3
83.3
100.0 100.0 100.0
50.0
50.0
50.0
62.5
62.5
62.5
73.1
73.1
73.1

Table 5.1: Joint implicit argument identiﬁcation evaluation results. The second column
indicates the number of ﬁlled implicit argument positions for the corresponding predicate(s).
For comparison, the full implicit argument annotation data contain approximately 1000 ﬁlled
implicit argument positions (see Table 4.1 on page 77).

not comparable with the results of the previous chapter. In general, performance is much
higher because predicate instances reliably took implicit arguments in the iarg0 and iarg1
positions. The overall performance increase was relatively small (approximately 1 percentage
point). The bid and investment predicates showed larger gains; however, due to the small
size of the test collection, the diﬀerences in F1 between the local and joint models were not
signiﬁcant at p = 0.10 when using the bootstrap resampling procedure described by Efron
and Tibshirani (1993).

5.6
5.6.1

Discussion
Example improvement versus local model

The bid and investment predicates show the largest increase for the joint model versus the
local model. Below, I give an example of the investment predicate for which the joint model
correctly identiﬁed the iarg0 and the local model did not.
(5.19) [Big investors] can decide to ride out market storms without jettisoning stock.
123

(5.20) Most often, [c they] do just that, because stocks have proved to be the
best-performing long-term [Predicate investment], attracting about $1 trillion from
pension funds alone.
Both models identiﬁed the iarg1 as money from a prior sentence (not shown). The local
model incorrectly predicted $1 trillion in Example 5.20 as the iarg0 for the investment
event. This mistake demonstrates a fundamental limitation of the local model: it cannot
detect simple incompatibilities in the predicted argument structure. It does not know that
“money investing money” is a rare or impossible event in the real world.
For the joint model’s prediction, consider the constituent marked with c in Example
5.20. This constituent is resolved to Big investors in the preceding sentence. Thus, the two
relevant questions are as follows:
Question: What did big investors invest?
Answer: money
Question: What entity invested money?
Answer: big investors
The ﬁrst question produces the following ranked list of answer synsets (the number in parentheses indicates the number of answer arguments that mapped to the synset):
money (71)
amount (38)
million (38)
billion (22)
capital (21)
As shown, the answer string of money matches the top-ranked answer synset. The second
question produces the following ranked list of answer synsets:
company (642)
people (460)
government (275)

124

business (75)
investor (70)
In this case, the answer string Big investors matches the ﬁfth answer synset. The combined
evidence of these two question-answer pairs allows the joint system to successfully identify
Big investors as the iarg0 of the investment predicate in Example 5.20.

5.6.2

Test collection size

The performance improvements for the joint model versus the local model were not found
to be statistically signiﬁcant at p = 0.10. Other studies of joint models for SRL (e.g., the
one by Toutanova et al. for verbal SRL (2008)) have shown slightly larger gains (2.8 F1
points). These gains, although modest, were statistically signiﬁcant because of the larger
test sample size. There are at least two ways in which the evaluation test sample in this
chapter can be expanded. First, one could annotate additional implicit argument data.
This is technically straightforward; however, implicit argument annotation is an expensive
process. Alternatively, one could add to the test sample predicate instances that are not
constrained to take both an iarg0 and an iarg1 . This approach makes the modeling task
more diﬃcult, but the diﬃculty is one that needs to be addressed in order for the system to
be practically applicable. Below, I discuss other issues surrounding the joint model and its
wider application.

5.6.3

Toward a generally applicable joint model

The joint model presented in this chapter assumes that all predicate instances take an iarg0
and an iarg1 . This assumption clearly does not hold for real data (these positions are often
not expressed in the text), but relaxing it will require investigation of the following issues.
1. Explicit arguments should also be considered when determining whether a candidate c ﬁlls an implicit argument position iargn . The motivation here is similar to

125

that given elsewhere in the current chapter: arguments (whether implicit or explicit)
are not independent. This is demonstrated by the example in the beginning of this
chapter (p. 110), where election is an explicit argument to the predicate and aﬀects
the implicit argument inference. The model developed in this chapter only considers
jointly occurring implicit arguments.
2. Other implicit argument positions (e.g., iarg2 , iarg3 , etc.) need to be accounted
for as well. This will present a challenge when it comes to extracting the necessary
propositions from TextRunner. Currently, TextRunner only handles tuples of the form
arg0 , p, arg1 . Other argument positions are not directly analyzed by the system;
however, because TextRunner also returns the sentence from which a tuple is extracted,
these additional argument positions could be identiﬁed in the following way:
(a) For an instance of the sale predicate with an arg0 of company, to ﬁnd likely arg2
ﬁllers (the entity purchasing the item), query TextRunner with company, sell, ? .
(b) Perform standard verbal SRL on the sentences for the resulting tuples, identifying
any arg2 occurrences.
(c) Cluster and rank the arg2 ﬁllers according to the method described in this chapter.
This approach combines Open Information Extraction with traditional information
extraction (i.e., verbal SRL).
3. Computational complexity and probability estimation is a problem for many
joint models. The model presented in this chapter quickly becomes computationally
intractable when the number of candidates and implicit argument positions becomes
moderately large. This is because Equation 5.12 (p. 115) considers all possible assignments of candidates to implicit argument positions. With as few as thirty candidates
and ﬁve argument positions (not uncommon), one must evaluate 30!/25! = 17, 100, 720
possible assignments. Although this particular formulation is not tractable, one based
on dynamic programming or heuristic search might give reasonable results.
126

5.7

Conclusions

Previous chapters of this dissertation have investigated the nature and recovery of semantic
arguments for nominal predicates. Throughout these chapters, the models have assumed that
the arguments are independent of each other. This assumption simpliﬁes the computational
modeling of semantic arguments, but it ignores the joint nature of natural language. In
order to take advantage of the information provided by jointly occurring arguments, the
local prediction models must be enhanced.
The current chapter has presented a preliminary investigation into the joint modeling
of implicit arguments for nominal predicates. The model relies heavily on information extracted by the TextRunner extraction system, which pulls propositional tuples from millions
of Internet webpages. These tuples encode world knowledge that is necessary for resolving
semantic arguments in general and implicit arguments in particular. This chapter has proposed methods of aggregating tuple knowledge to guide implicit argument resolution. The
aggregated knowledge is applied via a re-ranking model that operates on top of the local
prediction model described in the previous chapter.
In general, the model and results presented in this chapter are exploratory. The performance gains of the joint model versus the local model were not found to be statistically
signiﬁcant, with a large factor being the small size of the testing corpus. It is possible to
identify cases of improvement; however, a signiﬁcant amount of future work will be required
to make the model more eﬀective and applicable in a wider usage context. Additional predicates and argument positions need to be accounted for, in turn requiring more eﬃcient
computational approaches. This eﬀort will hopefully lead to better performance using an
approach that more accurately reﬂects the joint properties of natural language.

127

CHAPTER 6
Summary of contributions and future work
6.1
6.1.1

Summary of contributions
A nominal SRL system for real-world use

This dissertation has addressed a number of preexisting issues surrounding nominal semantic
role labeling. Most basic among these was the need for a nominal SRL system that is capable
of handling unstructured textual input. Original work in nominal SRL assumed the existence
of certain pieces of the SRL structure (i.e., nominal predicates). It has always been clear that
this assumption does not hold when working on raw text; however, it was unclear whether,
and to what extent, removal of this assumption would aﬀect nominal SRL performance.
This dissertation has conﬁrmed that predicate identiﬁcation is a crucial part of nominal
SRL for many frequent predicates. Without a predicate identiﬁcation model, the nominal
SRL system presented in Chapter 2 suﬀers an argument F1 loss of approximately 8%.
The predicate identiﬁcation model allows the nominal SRL system to move out of its
simpliﬁed experimental environment and into real-world settings such as the processing of
raw newswire text, intelligence reports, and other sources of important information. Genre
changes will undoubtedly have a negative impact on performance, but a tremendous amount
of text is created in a form very similar to the training genre of Wall Street Journal newswire.
Given the frequency and semantic importance of nominal predicates within this genre, it is
clear that the nominal SRL system has the potential to enhance automatic understanding of
important textual resources. The nominal SRL system is freely available for non-commercial

128

use.1

6.1.2

A focused, data-driven analysis of implicit arguments

Traditional SRL approaches such as the one just mentioned limit the search for arguments to
the sentence containing the predicate of interest. Many systems take this assumption a step
further and restrict the search to the predicate’s local syntactic environment; however, predicates and the sentences that contain them rarely exist in isolation. As shown throughout
this dissertation, they are usually embedded in a semantically rich discourse that contains
complex phenomena such as coreference, coherence, and rhetorical structure. This dissertation has endeavored to make implicit arguments part of the discourse landscape. As a
ﬁrst step, this dissertation presents a manually annotated corpus of implicit arguments that
complements the only other currently available corpus (Ruppenhofer et al., 2010). Analyses
of this data reveal a number of insights.
First, one ﬁnds a correspondence between a verb and its nominal form in terms of implicit
arguments. For the predicates considered in this dissertation, if an argument is required by
the verb form and missing from the nominal form, then it is likely that the argument is
present in the discourse that surrounds the nominal form. Second, the data show that
implicit arguments contribute a substantial amount of novel information to the text. This
information is not provided by arguments to verbs, which are another primary source of
semantic information. Third, implicit arguments tend to be located in or nearby the sentence
containing the predicate of interest. This is an important property when implicit argument
identiﬁcation is considered; without it, the search space could easily become unmanageable.

6.1.3

A novel model for implicit argument identiﬁcation

Researchers formulated the task of implicit argument identiﬁcation more than two decades
ago; however, the task has received relatively little attention since that time. This disser1 Please contact the author at gerber.matthew@gmail.com for more information.

129

tation presents a novel model for the implicit argument identiﬁcation task and evaluates
the model using the manually constructed corpus described above. The model draws evidence from a variety of sources, many of them created speciﬁcally for the task. In general,
the most informative features are derived from semantic sources instead of the syntactic
sources commonly used by standard SRL systems. Using this information, the system is
able to recover implicit arguments with an overall F1 score of 50%. This result represents
the state-of-the-art, as there are no other results to compare it to.
Lastly, this dissertation contributes a preliminary exploration of joint modeling for implicit arguments. The other models in this dissertation assume that arguments are independent, regardless of whether they are implicit or not. The joint implicit argument model
re-ranks the output of the independent model using knowledge extracted from millions of
Internet webpages. This knowledge helps to identify likely joint occurrences of implicit arguments. Overall gains from this approach are small; however, the experiments and discussion
constitute a starting point for future work in this direction.

6.2

Summary of future work

The models in this dissertation (with the exception of the model described in Chapter 5)
apply an argument independence assumption. Under this assumption, each argument can
be identiﬁed independently of each other argument. A wide range of psycholinguistic evidence suggests that this assumption does not reﬂect the true nature of human sentence
comprehension, which builds up joint semantic structures at the sentence and discourse level
(see Section 4.2 for details). Experimentally, researchers have found that joint models of
semantic arguments can improve automatic identiﬁcation for verbal predicates. Based on
these ﬁndings, it seems natural to formulate a joint model for nominal SRL (the standard,
non-implicit task described in Chapters 2 and 3).
The implicit argument model described in Chapter 4 can be improved in a variety of
ways that are not directly related to joint modeling. As shown in Section 4.6, many implicit
130

argument identiﬁcation errors were caused by the absence of true implicit arguments within
the set of candidate constituents. More intelligent windowing strategies in addition to alternate candidate sources might oﬀer some improvement. Although I consistently observed
development gains from using automatic coreference resolution, this process creates errors
that need to be studied more closely. It will also be important to study implicit argument
patterns of non-verbal predicates such as the partitive percent. These predicates are among
the most frequent in the TreeBank and are likely to require approaches that diﬀer from the
ones we pursued.
The implicit argument model developed in Chapter 4 is not generally applicable. It
is limited to the ten predicates for which there exist manually annotated training data.
Additional data will be required in order to extract implicit arguments for all predicates.
An entirely manual annotation project is feasible; however, it will be complicated by the
fact that implicit argument annotation is labor intensive and potentially error prone. This is
because the annotation process requires both argument and coreference identiﬁcation, each
of which is diﬃcult by itself. Thus, it might be productive to combine additional manual
implicit argument annotation with semi-supervised learning from labeled and unlabeled data.
Similar approaches have been applied to the standard verbal SRL task (for recent examples,
see Lang and Lapata (2010), F¨ rstenau and Lapata (2009), Deschacht and Moens (2009),
u
Abend et al. (2009), and Swier and Stevenson (2004)).
Regardless of the training corpus size, the implicit argument model will inevitably encounter previously unseen predicates. The negative eﬀects of these predicates can be mitigated by designing features that transfer well. For example, recall Feature 11 (p. 138).
This feature has a value of true when the candidate implicit argument is an argument to
a predicate that is in the same VerbNet class as the predicate being ﬁlled. This particular
feature does not depend on the exact predicates under consideration; it only checks whether
they are in the same class. As such, this feature would transfer well to predicates that were
not observed in the training data. VerbNet, with its verb classes and network structure, is

131

likely to be a good source of information.
The joint implicit argument model developed in Chapter 5 produced reasonable gains
for two of the predicates, but overall results showed smaller gains. Additional work will be
required in order to fully understand the potential of joint modeling for implicit arguments.
The model will need to be extended to argument positions other than iarg0 and iarg1 .
This, in turn, will require a method of evaluating the possible assignments of candidates to
more than two argument positions - a potentially intractable problem if done by brute force.
Dynamic programming and heuristic search are possible answers to this problem.
As mentioned in the preceding section, it is possible to evaluate the standard nominal SRL
model (no implicit arguments) on textual resources other than the Wall Street Journal, upon
which this dissertation is based. The implicit argument model can be evaluated similarly, as
long as the test cases involve only the ten predicates for which the model is designed. Such
experiments are important because they test the ability of the model to generalize across
domains. Domains with specialized vocabularies - biomedicine, for example - will pose
signiﬁcant challenges. Many of the features used throughout this dissertation are lexical in
nature; that is, they depend on the actual word content of a phrase. Specialized vocabularies
will not be accounted for in the training data. A genre such as standard news reporting will
pose less of a problem, but one should expect performance to drop nonetheless. Domain
adaptation techniques such as those presented by Daum´ et al. (2010) should have something
e
to oﬀer in this respect.

132

APPENDIX

133

A.1

Support verb identiﬁcation

Support verbs link long-distance arguments to nominal predicates. For example:
(A.1) [Arg0 John] [Support took] a [Predicate walk].
In Example A.1, took does not have the usual meaning of forcibly changing possession;
rather, this verb’s purpose is to bring in John as the Arg0 (walker) of walk. I created a
binary logistic regression model to automatically identify support verb tokens. The model
uses the features shown in the table at the end of this page. I set the LibLinear model
parameters as follows: bias=1, c=4, w+ = 1. A prediction threshold of t = 0.294 was used
at testing time. Overall support verb F1 for this model was 53.36%.

#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

Feature value description
First word subsumed by n.
Semantic head of n’s right sibling.
Context-free grammar rule that expands n’s right sibling.
Syntactic head of n’s left sibling.
Context-free grammar rule that expands n’s grandparent.
Context-free grammar rule that expands n’s parent.
Last word of n’s right sibling.
Head word of n’s right sibling.
Context-free grammar rule that expands n’s left sibling.
The object head of the prepositional phrase that follows n.
Parse tree path to nearest passive verb.
Part of speech (POS) of the head word of n’s right sibling.
The POS of n’s parent’s head word.
n’s parent’s head word.
The syntactic category of n’s right sibling.
n’s syntactic category.
The POS of the syntactic head word of n’s left sibling.
Context-free grammar rule that expands n’s great-grandparent.

Table A.1: Features used for support verb identiﬁcation, sorted by feature selection rank.
All features were based on automatically identiﬁed syntactic parse trees.

134

A.2
#
1*
2
3
4
5*
6
7
8
9
10
11
12
13
14
15
16*
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

Nominal predicate features
Feature value description
n’s ancestor grammar rules.
n’s stemmed text.
Syntactic category of n’s right sibling.
First word of n’s left sibling.
Parse tree path from n to previous nominal, with lexicalized source.
The stemmed content words in a one-word window around n.
n’s morphological suﬃx.
Parse tree path from n to closest support verb, with lexicalized destination.
Parse tree path to nearest passive verb.
Number of left siblings of n.
Parse tree path to previous predicate node, with lexicalized source.
Semantic head word of n’s right sibling.
Parse tree path from n to previous nominal with lexicalized source and destination.
Syntactic head word of n’s parent.
Head word of n’s left sibling.
PropBank markability score.
Signed token distance between n and nearest support verb.
Parse tree path from n to previous nominal.
The object head of the prepositional phrase that follows n.
Whether or not n is followed by a prepositional phrase.
Semantic head word of n’s left sibling.
Whether or not n surfaces before a passive verb.
Parse tree paths from n to each support verb, including the support verb.
Syntactic category of n’s left sibling.
Context-free grammar rule of n’s left sibling.
Whether or not n is the head of it’s parent node.
Whether or not the previous term in the lexicon is the previous predicate.
Context-free grammar rule of n’s grandparent.
Parse tree path from n to previous nominal, with lexicalized destination.
Number of right siblings of n.
First word of n’s right sibling.
Last word of n’s right sibling.
Signed token distance between n and the previous predicate.
Part of speech of the syntactic head of n’s right sibling.

Table A.2: Nominal predicate features, sorted by gain in selection algorithm. & denotes
feature concatenation. Features marked with an asterisk are explained on page 45. Johansson
and Nugues (2008) used features similar to 2 and 26.

135

A.3
#
1
2
3
4*
5
6
7
8
9
10*
11
12
13
14
15
16
17*
18
19
20
21
22
23
24
25
26*
27
28
29
30
31
32

Nominal argument features
Feature value description
12 & 26.
Position of n relative to p (beingfore/after) & 26.
First word subsumed by n.
12 & Position of n relative to p (before/after).
12 & 14.
Head word of n’s parent.
Last word subsumed n.
n’s syntactic category & length of 26.
First word of n’s right sibling.
Context-free grammar rule that expands the parent of p.
Head word of the right-most NP in n if n is a PP.
Stem of p according to a Porter stemmer.
Parse tree path from n to the lowest common ancestor (LCA) of n and p.
Head word of n.
12 & n’s syntactic category.
Context-free grammar rule that expands n’s parent.
Parse tree path from n to the nearest support verb.
Last part of speech (POS) subsumed by n.
Context-free grammar rule that expands n’s left sibling.
Head word of n, if the parent of n is a PP.
The POS of the head word of the right-most NP under n if n is a PP.
Last word of n’s left sibling.
Syntactic category of n.
Whether or not n comes before a passive verb.
Context-free grammar rule that expands n’s right sibling.
Parse tree path from n to p.
Whether or not n is under an NP headed by p.
First POS subsumed by n.
Whether or not n is an NP headed by p and is also adjacent to a VP.
Signed token distance from n to p.
Tree depth of the LCA of n and p.
Syntactic category of the LCA of n and p.

New
*

*

*
*
*

*
*

*
*
*

Table A.3: Nominal argument features, sorted by gain in selection algorithm. n indicates the
candidate argument node being classiﬁed. p indicates the predicate under consideration. &
denotes feature concatenation. Features marked with an asterisk are explained on page 29.
Asterisks in the last column indicate features that were not used by Jiang and Zhai (2006)
or Liu and Ng (2007).

136

A.4

Role sets for the annotated predicates

Listed below are the role sets for the ten predicates used in Chapters 4 and 5.

Role set for bid :

Arg3 : secondary plan

Arg0 : bidder
Arg1 : thing being bid for

Role set for investor:

Arg2 : amount of the bid

Arg0 : investor
Arg1 : thing invested
Arg2 : thing invested in

Role set for sale:
Arg0 : seller
Arg1 : thing sold

Role set for price:

Arg2 : buyer

Arg0 : seller

Arg3 : price paid

Arg1 : commodity

Arg4 : beneﬁciary of sale

Arg2 : price
Arg3 : secondary commodity

Role set for loan:
Arg0 : giver

Role set for loss:

Arg1 : thing given

Arg0 : entity losing something

Arg2 : entity given to

Arg1 : thing lost

Arg3 : loan against (collateral)

Arg2 : entity gaining thing lost

Arg4 : interest rate

Arg3 : source of loss

Role set for cost:

Role set for investment:

Arg1 : commodity

Arg0 : investor

Arg2 : price

Arg1 : thing invested

Arg3 : buyer

Arg2 : thing invested in

Arg4 : secondary commodity
Role set for fund :
Role set for plan:

Arg0 : funder

Arg0 : planner

Arg1 : thing funded

Arg1 : thing planned

Arg2 : amount of funding

Arg2 : beneﬁciary of plan

Arg3 : beneﬁciary
137

A.5

Implicit argument features
Table A.4

#
1
2
3
4
5
6
7
8
9*
10
11*
12
13*
14
15
16
17
18
19
20
21
22
23*
24
25*
26
27*

Feature value description
For every f , pf & argf & p & iargn .
Sentence distance from c to p.
For every f , the head word of f & the verbal form of p & iargn .
Same as 1 except generalizing pf and p to their WordNet synsets.
For every f , the WordNet synset for the head of f & the verbal form of p
& iargn .
Whether or not c and p are themselves arguments to the same predicate.
p & the semantic head word of p’s right sibling.
Whether or not any argf and iargn have the same integer argument position.
Frame element path between argf of pf and iargn of p in FrameNet (Baker
et al., 1998).
Percentage of elements in c′ that are subjects of a copular for which p is
the object.
Whether or not the verb forms of pf and p are in the same VerbNet class
and argf and iargn have the same thematic role.
p & the last word of p’s right sibling.
Maximum targeted PMI between argf of pf and iargn of p.
p & the number of p’s right siblings.
Percentage of elements in c′ that are objects of a copular for which p is
the subject.
Frequency of the verbal form of p within the document.
p & the stemmed content words in a one-word window around p.
Whether or not p’s left sibling is a quantiﬁer (e.g., many, most, all, etc.).
Quantiﬁed predicates tend not to take implicit arguments.
Percentage of elements in c′ that are copular objects.
TF cosine similarity between words from arguments of all pf and words
from arguments of p.
Whether the path deﬁned in 9 exists.
Percentage of elements in c′ that are copular subjects.
For every f , the VerbNet class/role of pf /argf & the class/role of p/iargn .
Percentage of elements in c′ that are indeﬁnite noun phrases.
p & the syntactic head word of p’s right sibling.
p & the stemmed content words in a two-word window around p.
Minimum selectional preference between any f and iargn of p. Uses the
method described by Resnik (1996) computed over an SRL-parsed version
of the Penn TreeBank and Gigaword (Graﬀ, 2003) corpora.

Continued on next page. . .
138

Score
8.2
4.0
3.6
3.3
1.0
1.0
0.7
0.7
0.6
0.6
0.6
0.6
0.6
0.5
0.5
0.5
0.5
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.3
0.3
0.3

Table A.4 (cont’d)
#
28
29
30
31
32
33*
34
35
36
37
38
39
40
41
42
43
44

45
46
47
48
49
50
51
52
53
54
55
56
57
58

Feature value description
p & p’s synset in WordNet.
Same as 27 except using the maximum.
Average per-sentence frequency of the verbal form of p within the document.
p itself.
p & whether p is the head of its parent.
Minimum coreference probability between argf of pf and iargn of p.
p & whether p is before a passive verb.
Percentage of elements in c′ that are deﬁnite noun phrases.
Percentage of elements in c′ that are arguments to other predicates.
Maximum absolute sentence distance from any f to p.
p & p’s syntactic category.
TF cosine similarity between the role description of iargn and the concatenated role descriptions of all argf .
Average TF cosine similarity between each argn of each pf and the corresponding argn of p, where ns are equal.
Same as 40 except using the maximum.
Same as 40 except using the minimum.
p & the head of the following prepositional phrase’s object.
Whether any f is located between p and any of the arguments annotated
by NomBank for p. When true, this feature rules out false positives because it implies that the NomBank annotators considered and ignored f
as a local argument to p.
Number of elements in c′ .
p & the ﬁrst word of p’s right sibling.
p & the grammar rule that expands p’s parent.
Number of elements in c′ that are arguments to other predicates.
Nominal form of p & iargn .
p & the syntactic parse tree path from p to the nearest passive verb.
Same as 37 except using the minimum.
Same as 33 except using the average.
Verbal form of p & iargn .
p & the ﬁrst word of p’s left sibling.
Average per-sentence frequency of the nominal form of p within the document.
p & the part of speech of p’s parent’s head word.
Same as 33 except using the maximum.
Same as 37 except using the average.

Continued on next page. . .

139

Score
0.3
0.3
0.3
0.3
0.3
0.3
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
0.2
0.2
0.2

0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.1

Table A.4 (cont’d)
#
59*
60
61
62
63
64
65
66
67*
68

69
70
71
72
73
74
75
76
77
78
79
80
81

Feature value description
Minimum path length between argf of pf and iargn of p within VerbNet
(Kipper, 2005).
Frequency of the nominal form of p within the document.
p & the number of p’s left siblings.
p & p’s parent’s head word.
p & the syntactic category of p’s right sibling.
p & p’s morphological suﬃx.
TF cosine similarity between words from all f and words from the role
description of iargn .
Percentage of elements in c′ that are quantiﬁed noun phrases.
Discourse relation whose two discourse units cover c (the primary ﬁller)
and p.
For any f , the minimum semantic similarity between pf and p using the
method described by Wu and Palmer (1994) over WordNet (Fellbaum,
1998).
p & whether or not p is followed by a prepositional phrase.
p & the syntactic head word of p’s left sibling.
p & the stemmed content words in a three-word window around p.
Syntactic category of c & iargn & the verbal form of p.
Nominal form of p & the sorted integer argument indexes (the ns) from
all argn of p.
Percentage of elements in c′ that are sentential subjects.
Whether or not the integer position of any argf equals that of iargn .
Same as 13 except using the average.
Same as 27 except using the average.
p & p’s parent’s syntactic category.
p & the part of speech of the head word of p’s right sibling.
p & the semantic head word of p’s left sibling.
Maximum targeted coreference probability between argf of pf and iargn
of p. This is a hybrid feature that calculates the coreference probability
of Feature 33 using the corpus tuning method of Feature 13.

Score
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1

0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1

Table A.4: Features for determining whether c ﬁlls iargn of predicate p. For each mention f
(denoting a f iller) in the coreference chain c′ , pf and argf are the predicate and argument
position of f . Unless otherwise noted, all argument positions (e.g., argn and iargn ) should
be interpreted as the integer label n instead of the underlying word content of the argument.
The & symbol denotes concatenation; for example, a feature value of “p & iargn ” for the
iarg0 position of sale would be “sale-0”. Features marked with an asterisk are explained
in Section 4.4.2 (p. 84). The Score column gives a heuristic ranking score for the features
across all evaluation folds (see page 103 for discussion).

140

A.6
Fold
1
2
3
4
5
6
7
8

9

10
1

Per-fold results for implicit argument identiﬁcation
Features
1, 2, 3, 11, 32, 8, 27, 22, 31, 10, 20, 53, 6, 16, 24, 40, 30, 38,
72, 69, 73, 19, 28, 42, 48, 64, 44, 36, 37, 12, 7
1, 3, 2, 4, 17, 13, 28, 11, 6, 18, 25, 12, 56, 29, 16, 53, 41, 31,
46, 10, 7, 51, 15, 22
4, 3, 2, 8, 7, 6, 59, 20, 9, 62, 37, 39, 41, 19, 10, 15, 11, 35, 61,
44, 42, 40, 32, 30, 16, 75, 33, 24
1, 2, 5, 13, 8, 49, 6, 35, 34, 14, 15, 18, 36, 28, 20, 45, 3, 43,
24, 48, 10, 29, 12, 30, 33, 65, 31, 22, 61, 16, 27, 41, 60, 55, 64
1, 2, 26, 3, 4, 23, 5, 63, 55, 6, 12, 44, 42, 65, 7, 71, 18, 15, 10,
14, 52, 34, 19, 24, 50, 58
1, 3, 2, 14, 23, 38, 25, 39, 16, 6, 21, 68, 70, 58, 9, 22, 18, 31,
60, 10, 64, 15, 66, 19, 30, 51, 56, 28
1, 2, 4, 3, 47, 54, 43, 7, 33, 9, 67, 24, 36, 50, 40, 12, 21
1, 3, 2, 4, 9, 7, 14, 12, 6, 46, 30, 18, 19, 36, 48, 42, 37, 45, 60,
56, 61, 51, 15, 10, 41, 40, 25, 31, 11, 39, 62, 69, 34, 16, 33, 8,
38, 20, 78, 44, 55, 80, 53, 50, 52, 49, 24, 28, 57
1, 5, 2, 4, 3, 21, 27, 10, 15, 9, 57, 35, 16, 25, 37, 33, 45, 24,
46, 29, 19, 34, 51, 50, 22, 48, 32, 11, 12, 58, 41, 8, 76, 18, 30,
40, 77, 6, 66, 44, 43, 79, 81, 20
4, 3, 2, 17, 1, 13, 29, 12, 11, 52, 10, 15, 6, 16, 9, 22, 7, 21, 57,
19, 74, 34, 45, 20, 66
2

Baseline
F1 (%)
31.7

Discriminative (LibLinear)
c
w+
t
F1 (%)
0.25
4
0.39260
47.1

Oracle
F1 (%)
86.7

32

0.25

256

0.80629

51.5

86.9

35.3

0.25

256

0.90879

55.8

88.1

27.8

0.25

4

0.38540

45.8

86.5

25.8

0.125

1024

0.87629

45.9

88

34.8

0.25

256

0.87759

55.4

90.8

22.9
27.1

0.25
0.0625

256
512

0.81169
0.92019

46.3
47.4

87.4
87.2

23

0.0625

32

0.67719

54.1

85.5

28.4

0.0625

512

0.89769

53.2

88.5

3

4

5

6

7

8

Table A.5: Per-fold results for implicit argument identiﬁcation. Columns are deﬁned as follows: (1) fold used for testing, (2)
selected features in rank order, (3) baseline F1 , (4) LibLinear cost parameter, (5) LibLinear weight for the positive class, (6)
implicit argument conﬁdence threshold, (7) discriminative F1 , (8) oracle F1 . A bias of 1 was used for all LibLinear models.

141

A.7

Examples of implicit argument identiﬁcation

Below, I provide two additional examples of implicit argument identiﬁcation. The examples
are drawn from the output of the implicit argument model described in Chapter 4.
Example 1: Within-sentence implicit argument using preference information
Consider the following sentence from article wsj 0308 in the Penn TreeBank:
(A.2) Sea Containers Ltd., in a long-awaited move to repel a [Purpose hostile takeover]
[Predicate bid], said it will sell $1.1 billion of assets and use some of the proceeds to
buy about 50% of [iarg1 its common shares] for $70 apiece.
This sentence is describing a situation in which Sea Containers is being pursued by an outside
company. The outside company is considering making an unsolicited bid for shares of Sea
Containers. This move would eﬀectively transfer ownership of Sea Containers to the outside
company, and Sea Contains is trying to avoid this by purchasing a large number of its own
shares.
With respect to the bid predicate in sentence A.2, we are looking for the iarg1 (the
entity being bid for). The answer, although present in the same sentence as the predicate,
is not local to the predicate and is not identiﬁed by the standard nominal SRL system.
Identiﬁcation of its common shares as the iarg1 relies primarily on selectional preference
information: shares are often bid for. This fact is observed when training the selectional
preference model.
Example 2: Inter-sentence implicit argument using coreference information
Consider the following sentences from article wsj 0286 in the Penn TreeBank:
(A.3) Nissan has increased earnings more than market share by cutting costs and by
taking advantage of a general surge in Japanese car sales.
(A.4) But Nissan expects to earn only 120 billion yen in the current ﬁscal year, a modest
increase of 4.7%.

142

(A.5) The big reason: For all its cost-cutting, [iarg0 Nissan] remains less eﬃcient than
Toyota.
(A.6) In its last ﬁscal year, Nissan’s proﬁt represented just 2.3% of [Predicate 1 sales],
compared with 4.3% at Toyota.
These sentences are describing a situation in which Nissan has increased sales but has not
experienced a commensurate increase in proﬁt due to its ineﬃciencies.
With respect to the sale predicate in sentence A.6, we are looking for the iarg0 (the entity
performing the selling). The implicit argument model has identiﬁed Nissan from A.5 as the
ﬁller of this argument position. This is an interesting inference because sentence A.5 does
not contain much supporting evidence. The key is coreference: the system has identiﬁed
many coreferent mentions of Nissan throughout the text, and these mentions participate in
events (e.g., earn in A.3 and A.4 and proﬁt in A.6) that are related to the sale event in the
ﬁnal sentence.

143

A.8

Forward ﬂoating feature subset selection algorithm

F : set of n features to select from
T : set of training instances containing all n features
V : set of validation instances containing all n features
Algorithm
B ← {}
ScoreB ← −∞
P ← {}
while |F | > 0 do
b ← null; Scoreb ← −∞
for all f ∈ F do
Scoref ← eval(P ∪ {f }, T, V )
if Scoref > Scoreb then
b ← f ; Scoreb ← Scoref
end if
end for
P ← P ∪ {b}; ScoreP ← Scoreb
F ← F − {b}
if ScoreP > ScoreB then
while |P | > 1 do
r ← null; Scorer ← ScoreP
for all f ∈ P do
Scoref ← eval(P − {f }, T, V )
if Scoref > Scorer then
r ← f ; Scorer ← Scoref
end if
end for
if Scorer > ScoreP then
P ← P − {r}; ScoreP ← Scorer
F ← F ∪ {r}
else
break
end if
end while
B ← P ; ScoreB ← ScoreP
end if
end while
return B

# Best feature subset from F
# Score of best feature subset
# All features used in previous round
# Track best new feature
# Evaluate using previous/new feats.
# Update best new feature/score

# Update previous features
# Remove best feature from pool

# Track best feature to remove
# Evaluate after removing feature f
# Update best feature to remove

# Remove feature and update score
# Return removed feature to pool
# No improvement from backtracking

# Update best feature subset

# Return best feature subset

144

REFERENCES

145

REFERENCES

Abend, O., Reichart, R., and Rappoport, A. (2009). Unsupervised argument identiﬁcation
for semantic role labeling. In Proceedings of the Joint Conference of the 47th Annual
Meeting of the ACL and the 4th International Joint Conference on Natural Language
Processing of the AFNLP, pages 28–36, Suntec, Singapore. Association for Computational
Linguistics.
ACE (2007). The ACE 2007 Evaluation Plan. NIST, 1.3a edition.
ACE (2008). The ACE 2008 Evaluation Plan. NIST, 1.2d edition.
Adger, D. (2003). Core Syntax. Oxford.
Baker, C., Fillmore, C., and Lowe, J. (1998). The Berkeley FrameNet project. In Boitet,
C. and Whitelock, P., editors, Proceedings of the Thirty-Sixth Annual Meeting of the
Association for Computational Linguistics and Seventeenth International Conference on
Computational Linguistics, pages 86–90, San Francisco, California. Morgan Kaufmann
Publishers.
Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., and Etzioni, O. (2007). Open
information extraction from the web. In Proceedings of the 20th International Joint
Conference on Artiﬁcial Intelligence.
Banko, M. and Etzioni, O. (2008). The tradeoﬀs between open and traditional relation
extraction. In Proceedings of ACL-08: HLT, pages 28–36, Columbus, Ohio. Association
for Computational Linguistics.
Berger, A., Pietra, V., and Pietra, S. (1996). A maximum entropy approach to natural
language processing. Computational Linguistics, 22:39–71.
Bhagat, R., Pantel, P., and Hovy, E. (2007). LEDIR: An unsupervised algorithm for learning
directionality of inference rules. In Proceedings of the 2007 Joint Conference on Empirical
Methods in Natural Language Processing and Computational Natural Language Learning
(EMNLP-CoNLL), pages 161–170, Prague, Czech Republic. Association for Computational Linguistics.
Bikel, D. M., Schwartz, R., and Weischedel, R. M. (1999). An algorithm that learns what’s
in a name. Mach. Learn., 34(1-3):211–231.
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. J. Mach. Learn.
Res., 3:993–1022.

146

Bos, J. (2005). Towards wide-coverage semantic interpretation. In Proceedings of the Sixth
International Workshop on Computational Semantics, pages 42–53.
Burchardt, A., Frank, A., and Pinkal, M. (2005). Building text meaning representations
from contextually related frames - a case study. In Proceedings of the Sixth International
Workshop on Computational Semantics.
Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Data
Mining and Knowledge Discovery, 2:121167.
Carpenter, P. A., Miyake, A., and Just, M. A. (1995). Language comprehension: Sentence
and discourse processing. Annu. Rev. Psychol., 46:91–120.
Carreras, X. and M`rquez, L. (2004). Introduction to the conll-2004 shared task: Semantic
a
role labeling. In Proceedings of the Conference on Computational Natural Language
Learning.
Carreras, X. and M`rquez, L. (2005). Introduction to the CoNLL-2005 shared task: Semantic
a
role labeling.
Chambers, N. and Jurafsky, D. (2008). Unsupervised learning of narrative event chains. In
Proceedings of the Association for Computational Linguistics, pages 789–797, Columbus,
Ohio. Association for Computational Linguistics.
Charniak, E., Goldwater, S., and Johnson, M. (1998). Edge-based best-ﬁrst chart parsing.
In Sixth Workshop on Very Large Corpora, pages 127–133.
Charniak, E. and Johnson, M. (2005). Coarse-to-ﬁne n-best parsing and maxent discriminative reranking. In Proceedings of the 43rd Annual Meeting on Association for
Computational Linguistics.
Chen, B., Su, J., and Tan, C. L. (2010). Resolving event noun phrases to their verbal mentions. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language
Processing, pages 872–881, Cambridge, MA. Association for Computational Linguistics.
Chen, Z. and Ji, H. (2009). Graph-based event coreference resolution. In Proceedings
of the 2009 Workshop on Graph-based Methods for Natural Language Processing
(TextGraphs-4), pages 54–57, Suntec, Singapore. Association for Computational Linguistics.
Chinchor, N., Lewis, D. D., and Hirschmant, L. (1993). Evaluating message understanding
systems: An analysis of the third message understanding conference. Computational
Linguistics, 19(3):409–450.
Cohen, J. (1960). A coeﬃcient of agreement for nominal scales.
Psychological Measurement, 20(1):3746.

Educational and

Copestake, A. and Flickinger, D. (2000). An open-source grammar development environment
and broad-coverage english grammar using hpsg. In Proc. LREC-2000.
147

Dahl, D. A., Palmer, M. S., and Passonneau, R. J. (1987). Nominalizations in pundit. In
Proceedings of the 25th annual meeting on Association for Computational Linguistics,
pages 131–139, Morristown, NJ, USA. Association for Computational Linguistics.
Dang, H. T., Kelly, D., and Lin, J. J. (2007). Overview of the trec 2007 question answering
track. In TREC.
Daum´, H., Deoskar, T., McClosky, D., Plank, B., and Tiedemann, J., editors (2010).
e
Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language
Processing. Association for Computational Linguistics, Uppsala, Sweden.
Deschacht, K. and Moens, M.-F. (2009). Semi-supervised semantic role labeling using the Latent Words Language Model. In Proceedings of the 2009 Conference on Empirical Methods
in Natural Language Processing, pages 21–29, Singapore. Association for Computational
Linguistics.
Di Eugenio, B. and Glass, M. (2004). The kappa statistic: a second look. Comput. Linguist.,
30(1):95–101.
Dowty, D. (1991). Thematic proto-roles and argument selection. Language, 67:547–619.
Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall,
New York.
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. (2008). LIBLINEAR:
A Library for Large Linear Classiﬁcation. Journal of Machine Learning Research, 9:1871–
1874.
Fellbaum, C. (1998). WordNet: An Electronic Lexical Database (Language, Speech, and
Communication). The MIT Press.
Fillmore, C. (1968). The case for case. In Bach, E. and Harms, R., editors, Universals in
Linguistic Theory. Holt, Rinehart, and Winston.
Fillmore, C. (1976). Frame semantics and the nature of language. In Harnad, S., Steklis,
H., and Lancaster, J., editors, Origins and Evolution of Language and Speech. The New
York Academy of Sciences.
Fillmore, C. and Baker, C. (2001). Frame semantics for text understanding. In Proceedings
of WordNet and Other Lexical Resources Workshop, NAACL.
F¨ rstenau, H. and Lapata, M. (2009). Graph alignment for semi-supervised semantic role lau
beling. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language
Processing, pages 11–20, Singapore. Association for Computational Linguistics.
Gerber, M. and Chai, J. (2010). Beyond NomBank: A study of implicit arguments for
nominal predicates. In Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics, pages 1583–1592, Uppsala, Sweden. Association for Computational Linguistics.
148

Gerber, M., Chai, J., and Meyers, A. (2009). The role of implicit argumentation in nominal
SRL. In Proceedings of Human Language Technologies: The 2009 Annual Conference
of the North American Chapter of the Association for Computational Linguistics, pages
146–154, Boulder, Colorado. Association for Computational Linguistics.
Gildea, D. and Jurafsky, D. (2002). Automatic labeling of semantic roles. Computational
Linguistics, 28:245–288.
Gildea, D. and Palmer, M. (2001). The necessity of parsing for predicate argument
recognition. In ACL ’02: Proceedings of the 40th Annual Meeting on Association for
Computational Linguistics, pages 239–246, Morristown, NJ, USA. Association for Computational Linguistics.
Gillick, D. (2009). Sentence Boundary Detection and the Problem with the U.S.
Proceedings of NAACL: Short Papers.

In

Girju, R., Nakov, P., Nastase, V., Szpakowicz, S., Turney, P., and Yuret, D. (2007). Semeval2007 task 04: Classiﬁcation of semantic relations between nominals. In Proceedings of the
4th International Workshop on Semantic Evaluations.
Gordon, A. and Swanson, R. (2007). Generalizing semantic role annotations across syntactically similar verbs. In Proceedings of ACL, pages 192–199.
Gorn, S. (1967). Explicit deﬁnitions and linguistic dominoes. In Hart, J., editor, Systems
and Computer Science, pages 77–115. University of Toronto Press, Toronto Canada.
Graesser, A. C. and Clark, L. F. (1985). Structures and Procedures of Implicit Knowledge.
Ablex Publishing Corporation.
Graﬀ, D. (2003). English Gigaword. Linguistic Data Consortium, Philadelphia.
Grosz, B. J., Joshi, A. K., and Weinstein, S. (1995). Centering: A framework for modeling
the local coherence of discourse. Computational Linguistics, 21(2):203–225.
Gruber, J. (1965). Studies in Lexical Relations. PhD thesis, MIT.
Hajiˇ, J., Ciaramita, M., Johansson, R., Kawahara, D., Mart´ M. A., M`rquez, L., Meyers,
c
ı,
a
ˇ ep´nek, J., Straˇ ´k, P., Surdeanu, M., Xue, N., and Zhang,
A., Nivre, J., Pad´, S., Stˇ a
o
na
Y. (2009). The CoNLL-2009 shared task: Syntactic and semantic dependencies in multiple languages. In Proceedings of the Thirteenth Conference on Computational Natural
Language Learning (CoNLL 2009): Shared Task, pages 1–18, Boulder, Colorado. Association for Computational Linguistics.
Harris, Z. (1985). Distributional structure. In Katz, J. J., editor, The Philosophy of
Linguistics, pages 26–47. New York: Oxford University Press.
Heim, I. and Kratzer, A. (1998). Semantics in Generative Grammar. Blackwell, Oxford.

149

´ e
Hendrickx, I., Kim, S. N., Kozareva, Z., Nakov, P., O S´aghdha, D., Pad´, S., Pennacchiotti,
o
M., Romano, L., and Szpakowicz, S. (2010). Semeval-2010 task 8: Multi-way classiﬁcation
of semantic relations between pairs of nominals. In Proceedings of the 5th International
Workshop on Semantic Evaluation, pages 33–38, Uppsala, Sweden. Association for Computational Linguistics.
Hirst, G. (1987). Semantic Interpretation and the Resolution of Ambiguity. Cambridge
University Press.
Hsu, C.-W., Chang, C.-C., , and Lin, C.-J. (2010). A practical guide to support vector classi
cation. Technical report, National Taiwan University.
Hull, R. and Gomez, F. (1996). Semantic interpretation of nominalizations. In Proceedings
of AAAI.
Iida, R., Komachi, M., Inui, K., and Matsumoto, Y. (2007). Annotating a Japanese text
corpus with predicate-argument and coreference relations. In Proceedings of the Linguistic
Annotation Workshop in ACL-2007, page 132139.
Imamura, K., Saito, K., and Izumi, T. (2009). Discriminative approach to predicateargument structure analysis with zero-anaphora resolution. In Proceedings of the
ACL-IJCNLP 2009 Conference Short Papers, pages 85–88, Suntec, Singapore. Association for Computational Linguistics.
Jiang, J. and Zhai, C. (2006). Exploiting domain structure for named entity recognition. In
Proceedings of the main conference on Human Language Technology Conference of the
North American Chapter of the Association of Computational Linguistics, pages 74–81,
Morristown, NJ, USA. Association for Computational Linguistics.
Jiang, Z. and Ng, H. (2006). Semantic role labeling of nombank: A maximum entropy approach. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language
Processing.
Johansson, R. and Nugues, P. (2008). Dependency-based syntactic–semantic analysis with
propbank and nombank. In CoNLL 2008: Proceedings of the Twelfth Conference on
Computational Natural Language Learning, pages 183–187, Manchester, England. Coling
2008 Organizing Committee.
Joshi, M., Pakhomov, S., Pedersen, T., Maclin, R., and Chute, C. (2006). An end-to-end
supervised target-word sense disambiguation system. In Proceedings of the Twenty-ﬁrst
National Conference on Artiﬁcial Intelligence, pages 1941–1942.
Kaisser, M. and Webber, B. (2007). Question answering based on semantic roles. In ACL
2007 Workshop on Deep Linguistic Processing, pages 41–48, Prague, Czech Republic.
Association for Computational Linguistics.
Kamp, H. and Reyle, U. (1993). From Discourse to Logic. Kluwer, Dordrecht.

150

Kingsbury, P. and Palmer, M. (2003). Propbank: the next level of treebank. In Proceedings
of Treebanks and Lexical Theories.
Kipper, K. (2005). VerbNet: A broad-coverage, comprehensive verb lexicon. PhD thesis,
Department of Computer and Information Science University of Pennsylvania.
Kipper, K., Dang, H. T., and Palmer, M. (2000). Class-based construction of a verb lexicon. In Proceedings of the Seventeenth National Conference on Artiﬁcial Intelligence and
Twelfth Conference on Innovative Applications of Artiﬁcial Intelligence, pages 691–696.
AAAI Press / The MIT Press.
Krippendorﬀ, K. (1980). Content Analysis: An Introduction to Its Methodology. Sage
Publications.
Kuˇera, H. and Nelson, F. W. (1967). Computational Analysis of Present-day American
c
English. Brown University Press, Providence, RI.
Lang, J. and Lapata, M. (2010). Unsupervised induction of semantic roles. In Human
Language Technologies: The 2010 Annual Conference of the North American Chapter of
the Association for Computational Linguistics, pages 939–947, Los Angeles, California.
Association for Computational Linguistics.
Lapata, M. (2000). The automatic interpretation of nominalizations. In Proceedings of
the Seventeenth National Conference on Artiﬁcial Intelligence and Twelfth Conference on
Innovative Applications of Artiﬁcial Intelligence, pages 716–721. AAAI Press / The MIT
Press.
Levin, B. (1993). English verb classes and alternations: A preliminary investigation. Chicago
University Press.
Lin, D. and Pantel, P. (2001). Discovery of inference rules for question-answering. Nat.
Lang. Eng., 7(4):343–360.
Liu, C. and Ng, H. (2007). Learning predictive structures for semantic role labeling of nombank. In Proceedings of the 45th Annual Meeting of the Association of Computational
Linguistics, pages 208–215, Prague, Czech Republic. Association for Computational Linguistics.
Liu, X., Han, B., Li, K., Stiller, S. H., and Zhou, M. (2010). SRL-based verb selection for
ESL. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language
Processing, pages 1068–1076, Cambridge, MA. Association for Computational Linguistics.
Macleod, C., Grishman, R., Meyers, A., Barrett, L., and Reeves, R. (1998). Nomlex: A
lexicon of nominalizations. In Proceedings of the Eighth International Congress of the
European Association for Lexicography.
Marcus, M., Santorini, B., and Marcinkiewicz, M. A. (1993). Building a large annotated
corpus of English: the Penn TreeBank. Computational Linguistics, 19:313–330.

151

May, J. and Knight, K. (2007). Syntactic re-alignment models for machine translation.
In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 360–
368, Prague, Czech Republic. Association for Computational Linguistics.
Meyers, A. (2007a). Annotation guidelines for NomBank - noun argument structure for
PropBank. Technical report, New York University.
Meyers, A. (2007b). Those other nombank dictionaries. Technical report, New York University.
Meyers, A., Macleod, C., Yangarber, R., Grishman, R., Barrett, L., and Reeves, R.
(1998). Using nomlex to produce nominalization patterns for information extraction.
In Proceedings of the COLING-ACL Workshop on the Computational Treatment of
Nominals.
Mooney, R. J. (2007). Learning for semantic parsing. In Proceedings of the 8th International
Conference, CICLing.
Moschitti, A., Pighin, D., and Basili, R. (2008). Tree kernels for semantic role labeling.
Comput. Linguist., 34(2):193–224.
Nielsen, L. A. (2004). Verb phrase ellipsis detection using automatically parsed text.
In COLING ’04: Proceedings of the 20th international conference on Computational
Linguistics, page 1093, Morristown, NJ, USA. Association for Computational Linguistics.
Nielsen, L. A. (2005). A corpus-based study of Verb Phrase Ellipsis Identiﬁcation and
Resolution. PhD thesis, King’s College.
Nivre, J. (2003). An eﬃcient algorithm for projective dependency parsing. In Proceedings
of the 8th International Workshop on Parsing Technologies (IWPT), pages 149–160.
Pad´, S., Pennacchiotti, M., and Sporleder, C. (2008). Semantic role assignment for event
o
nominalisations by leveraging verbal data. In Proceedings of the 22nd International
Conference on Computational Linguistics (Coling 2008), pages 665–672, Manchester, UK.
Coling 2008 Organizing Committee.
Palmer, M. S., Dahl, D. A., Schiﬀman, R. J., Hirschman, L., Linebarger, M., and Dowding,
J. (1986). Recovering implicit information. In Proceedings of the 24th annual meeting on
Association for Computational Linguistics, pages 10–19, Morristown, NJ, USA. Association for Computational Linguistics.
Pantel, P., Bhagat, R., Coppola, B., Chklovski, T., and Hovy, E. (2007). ISP: Learning inferential selectional preferences. In Human Language Technologies 2007: The Conference
of the North American Chapter of the Association for Computational Linguistics;
Proceedings of the Main Conference, pages 564–571, Rochester, New York. Association
for Computational Linguistics.

152

Pantel, P. and Ravichandran, D. (2004). Automatically labeling semantic classes. In Susan Dumais, D. M. and Roukos, S., editors, HLT-NAACL 2004: Main Proceedings, pages
321–328, Boston, Massachusetts, USA. Association for Computational Linguistics.
Pizzato, L. A. and Moll´, D. (2008). Indexing on semantic roles for question answering.
a
In Coling 2008: Proceedings of the 2nd workshop on Information Retrieval for Question
Answering, pages 74–81, Manchester, UK. Coling 2008 Organizing Committee.
Pradhan, S. S., Ward, W., and Martin, J. H. (2008). Towards robust semantic role labeling.
Comput. Linguist., 34(2):289–310.
Prasad, R., Lee, A., Dinesh, N., Miltsakaki, E., Campion, G., Joshi, A., and Webber, B.
(2008). Penn discourse treebank version 2.0. Linguistic Data Consortium.
Pudil, P., Novovicova, J., and Kittler, J. (1994). Floating search methods in feature selection.
Pattern Recognition Letters, 15:1119–1125.
Punyakanok, V., Roth, D., and tau Yih, W. (2005). The necessity of syntactic parsing for
semantic role labeling. In International Joint Conference on Artiﬁcial Intelligence.
Punyakanok, V., Roth, D., and Yih, W.-t. (2008). The importance of syntactic parsing and
inference in semantic role labeling. Comput. Linguist., 34(2):257–287.
Pustejovsky, J. (1995). The Generative Lexicon. The MIT Press.
Resnik, P. (1996). Selectional constraints: An information-theoretic model and its computational realization. Cognition, 61:127–159.
Ritter, A., Mausam, and Etzioni, O. (2010). A latent dirichlet allocation method for selectional preferences. In Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics.
Ruppenhofer, J., Sporleder, C., Morante, R., Baker, C., and Palmer, M. (2009). Semeval2010 task 10: Linking events and their participants in discourse. In Proceedings of
the Workshop on Semantic Evaluations: Recent Achievements and Future Directions
(SEW-2009), pages 106–111, Boulder, Colorado. Association for Computational Linguistics.
Ruppenhofer, J., Sporleder, C., Morante, R., Baker, C., and Palmer, M. (2010). Semeval2010 task 10: Linking events and their participants in discourse. In Proceedings of the
5th International Workshop on Semantic Evaluation, pages 45–50, Uppsala, Sweden. Association for Computational Linguistics.
Sagae, K. (2009). Analysis of discourse structure with syntactic dependencies and datadriven shift-reduce parsing. In Proceedings of the 11th International Conference on Parsing
Technologies (IWPT’09), pages 81–84, Paris, France. Association for Computational Linguistics.
Sagae, K. and Lavie, A. (2005). A classiﬁer-based parser with linear run-time complexity.
In Proceedings of the International Workshop on Parsing Technologies.
153

Sanford, A. J. (1981). Understanding Written Language. John Wiley & Sons Ltd.
Sasano, R., Kawahara, D., and Kurohashi, S. (2004). Automatic construction of nominal
case frames and its application to indirect anaphora resolution. In Proceedings of Coling
2004, pages 1201–1207, Geneva, Switzerland. COLING.
Schank, R. C. and Abelson, R. P. (1977). Scripts, Plans, Goals and Understanding: an
Inquiry into Human Knowledge Structures. L. Erlbaum, Hillsdale, NJ.
Surdeanu, M., Johansson, R., Meyers, A., M`rquez, L., and Nivre, J. (2008). The CoNLL
a
2008 shared task on joint parsing of syntactic and semantic dependencies. In CoNLL 2008:
Proceedings of the Twelfth Conference on Computational Natural Language Learning,
pages 159–177, Manchester, England. Coling 2008 Organizing Committee.
Swier, R. S. and Stevenson, S. (2004). Unsupervised semantic role labelling. In Lin, D. and
Wu, D., editors, Proceedings of Empirical Methods in Natural Language Processing, pages
95–102, Barcelona, Spain. Association for Computational Linguistics.
Szpektor, I., Tanev, H., Dagan, I., and Coppola, B. (2004). Scaling web-based acquisition of entailment relations. In Proceedings of Empirical Methods in Natural Language
Processing.
Taboada, M. and Mann, W. C. (2006). Rhetorical structure theory: looking back and moving
ahead. Discourse Studies, 8:423–459.
Tonelli, S. and Delmonte, R. (2010). Venses++: Adapting a deep semantic processing system to the identiﬁcation of null instantiations. In Proceedings of the 5th International
Workshop on Semantic Evaluation, pages 296–299, Uppsala, Sweden. Association for Computational Linguistics.
Toutanova, K., Haghighi, A., and Manning, C. D. (2005). Joint learning improves semantic
role labeling. In ACL ’05: Proceedings of the 43rd Annual Meeting on Association for
Computational Linguistics, pages 589–596, Morristown, NJ, USA. Association for Computational Linguistics.
Toutanova, K., Haghighi, A., and Manning, C. D. (2008). A global joint model for semantic
role labeling. Comput. Linguist., 34(2):161–191.
van Dijk, T. A. (1977). Semantic macro structures and knowledge frames in discourse
comprehension. In Just, M. A. and Carpenter, P. A., editors, Cognitive Processes in
Comprehension, pages 3–32. Lawrence Erlbaum.
van Dijk, T. A. and Kintsch, W. (1983). Strategies of Discourse Comprehension. Academic
Press.
Verhagen, M., Gaizauskas, R., Schilder, F., Hepple, M., Katz, G., and Pustejovsky, J. (2007).
Semeval-2007 task 15: Tempeval temporal relation identiﬁcation. In Proceedings of the
Fourth International Workshop on Semantic Evaluations (SemEval-2007), pages 75–80,
Prague, Czech Republic. Association for Computational Linguistics.
154

Versley, Y., Ponzetto, S. P., Poesio, M., Eidelman, V., Jern, A., Smith, J., Yang, X., and
Moschitti, A. (2008). BART: A modular toolkit for coreference resolution. In Proceedings
of the 6th International Conference on Language Resources and Evaluation, Marrakech,
Morocco.
Weischedel, R. and Brunstein, A. (2005). Bbn pronoun coreference and entity type corpus.
Linguistic Data Consortium.
Whittemore, G., Macpherson, M., and Carlson, G. (1991). Event-building through roleﬁlling and anaphora resolution. In Proceedings of the 29th annual meeting on Association
for Computational Linguistics, pages 17–24, Morristown, NJ, USA. Association for Computational Linguistics.
Wilson, N. L. (1974). Facts, events, and their identity conditions. Philosophical Studies,
25:303–321.
Wu, Z. and Palmer, M. (1994). Verb semantics and lexical selection. In Proceedings of the
32nd Annual Meeting of the Association for Computational Linguistics, pages 133–138,
Las Cruces, New Mexico, USA. Association for Computational Linguistics.
Xue, N. and Palmer, M. (2004).
Proceedings of EMNLP.

Calibrating features for semantic role labeling.

In

Yang, X., Su, J., and Tan, C. L. (2008). A twin-candidate model for learning-based anaphora
resolution. Comput. Linguist., 34(3):327–356.
Yi, S., Loper, E., and Palmer, M. (2007). Can semantic roles generalize across genres? In
Proceedings of NAACL HLT.
Zanzotto, F. M., Pennacchiotti, M., and Pazienza, M. T. (2006). Discovering asymmetric
entailment relations between verbs using selectional preferences. In ACL-44: Proceedings
of the 21st International Conference on Computational Linguistics and the 44th annual
meeting of the Association for Computational Linguistics, pages 849–856, Morristown, NJ,
USA. Association for Computational Linguistics.

155