MODELING PHYSICAL CAUSALITY OF ACTION VERBS FOR GROUNDED LANGUAGE

UNDERSTANDING

By

Qiaozi Gao

A DISSERTATION

Michigan State University

in partial fulﬁllment of the requirements

Submitted to

for the degree of

Computer Science – Doctor of Philosophy

2019

ABSTRACT

MODELING PHYSICAL CAUSALITY OF ACTION VERBS FOR GROUNDED LANGUAGE

UNDERSTANDING

By

Qiaozi Gao

Building systems that can understand and communicate through human natural language is one
of the ultimate goals in AI. Decades of natural language processing research has been mainly
focused on learning from large amounts of language corpora. However, human communication
relies on a signiﬁcant amount of unverbalized information, which is often referred as commonsense
knowledge. This type of knowledge allows us to understand each other’s intention, to connect
language with concepts in the world, and to make inference based on what we hear or read.
Commonsense knowledge is generally shared among cognitive capable individuals, thus it is rarely
stated in human language. This makes it very diﬃcult for artiﬁcial agents to acquire commonsense
knowledge from language corpora. To address this problem, this dissertation investigates the
acquisition of commonsense knowledge, especially knowledge related to basic actions upon the
physical world and how that inﬂuences language processing and grounding.

Linguistics studies have shown that action verbs often denote some change of state (CoS) as the
result of an action. For example, the result of “slice a pizza” is that the state of the object (pizza)
changes from one big piece to several smaller pieces. However, the causality of action verbs and its
potential connection with the physical world has not been systematically explored. Artiﬁcial agents
often do not have this kind of basic commonsense causality knowledge, which makes it diﬃcult for
these agents to work with humans and to reason, learn, and perform actions.

To address this problem, this dissertation models dimensions of physical causality associated
with common action verbs. Based on such modeling, several approaches are developed to incor-
porate causality knowledge to language grounding, visual causality reasoning, and commonsense
story comprehension.

Copyright by
QIAOZI GAO
2019

ACKNOWLEDGEMENTS

First of all, I would like to express my great appreciation to my advisor, Dr. Joyce Y. Chai, for her
patient guidance and continuous support throughout my doctoral studies. Dr. Chai always has a
very insightful and knowledgeable view about the ﬁeld of study, and I always feel enlightened after
talking to her. Without her invaluable expertise and enthusiastic encouragement, I could not have
ﬁnished this dissertation. Her great passion and patience towards research has left an impact on
me, which will always guide me throughout my career.

I would also like to express my deep gratitude to Dr. Arun Ross, Dr. Pang-Ning Tan and Dr.
Daniel Morris, for being on my program committee and for providing a lot of guidance and help at
every milestone of my Ph.D program.

I want to thank my fellow colleagues in the Language and Interaction Research (LAIR) group,
especially Shaohua Yang, Lanbo She, Malcolm Doering, Sari Saba-Sadiya, Changsong Liu, Rui
Fang, Guangyue Xu, Kenneth Stewart, James Peterkin II, Sarah Fillwock and James Finch, who
gave me a lot of support during the past several years.
I enjoyed the discussion and teamwork
during our collaborations.

Finally, I want to express my warmest gratitude to my family.

I thank my parents for their
unwavering support and love. My parents made me who I am today and I know I could never repay
them for all those lessons they have taught me. Lastly, I would like to thank my beautiful wife Xi
Liu for always being supportive and encouraging, and always pushing me to be the best version of
myself.

iv

TABLE OF CONTENTS

.
.

.
.

.
.

.
.

.
.

.
.

LIST OF TABLES .
LIST OF FIGURES .
CHAPTER 1

.
.
.
.
INTRODUCTION .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
x
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Modeling Physical Causality of Action Verbs
. . . . . . . . . . . . . . . . . . .
.
1.2 Physical Causality Modeling for Language Grounding Task . . . . . . . . . . . . .
1.3 Visual Causality Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Commonsense Reasoning about Physical Actions
. . . . . . . . . . . . . . . . . .
1.5 Contributions .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6 Organization of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
2
5
6
7
8
9

.

.

.

.

.

.

Learning from Ambiguous Parallel Data

2.2.2 Grounding Phrases and Sentences

2.1 Theoretical Linguistics on Verbs . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Grounding Language to Perception . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
. 11
. 12
2.2.1 Grounding Words in Perception . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1.1 Grounding to Discretized Perceptual Signal . . . . . . . . . . . . 12
2.2.1.2
. . . . . . . . . . . . . 14
2.2.1.3 Grounding Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1.4 Context-Dependent Word Meaning . . . . . . . . . . . . . . . . 17
. . . . . . . . . . . . . . . . . . . . . . 17
2.2.2.1 Referent Grounding . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2.2 Grounding Action Frames . . . . . . . . . . . . . . . . . . . . . 18
Parsing and Perception . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2.3
2.2.2.4
Jointly Modeling Parsing and Perception . . . . . . . . . . . . . 20
2.2.2.5 Neural Network approaches . . . . . . . . . . . . . . . . . . . . 21
. . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Recognizing Textual Entailment (RTE) . . . . . . . . . . . . . . . . . . . . 22
2.3.2 Winograd Schema Challenge (WSC) . . . . . . . . . . . . . . . . . . . . . 23
2.3.3 Causal Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
. . . . . . . . . . . . . 24
. . . . . . . . . . . . . . . . . . . . . . . . . . 25
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Hand-built Knowledge Resources
. . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 Automatically Extracted Knowledge . . . . . . . . . . . . . . . . . . . . . 27
2.5 Related Work in Computer Vision and Robotics . . . . . . . . . . . . . . . . . . . 28
2.5.1 Related Work in Computer Vision . . . . . . . . . . . . . . . . . . . . . . 28
2.5.2 Related Work in Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.3.1 Choice of Plausible Alternatives (COPA)
2.3.3.2

2.3 Natural Language Inference Tasks

2.4 Knowledge Resources . .

Story Cloze Test

CHAPTER 3 MODELING PHYSICAL CAUSALITY OF VERBS . . . . . . . . . . . . . 30
. 30

3.1 Categorization of Physical Causality . . . . . . . . . . . . . . . . . . . . . . . .

v

3.1.1 Linguistics Background on Action Verbs . . . . . . . . . . . . . . . . . . . 30
3.1.2 A Crowd-Sourcing Study . . . . . . . . . . . . . . . . . . . . . . . . . .
. 31
3.1.3 Categorization of Change of State . . . . . . . . . . . . . . . . . . . . . . 32
3.1.4 Evaluation: Verb Similarity Judgement and Thematic Fit Estimation . . .
. 36
3.1.4.1 Verb Similarity Judgement . . . . . . . . . . . . . . . . . . . . . 37
Thematic Fit Estimation . . . . . . . . . . . . . . . . . . . . . . 39
3.1.4.2
3.2 Modeling Causality Knowledge via Embedding Methods . . . . . . . . . . . . . . 40
. 40
. 41
. 44
. 44
. 45
. 45
. 45

3.2.1 Cause-Eﬀect Data Collection . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Causality Embedding Models . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Evaluation: Causality Embedding in Causal QA . . . . . . . . . . . . . .
3.2.3.1 Ranking Algorithm . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3.2 Dataset
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3.3 Models for Comparison . . . . . . . . . . . . . . . . . . . . .
3.2.3.4
Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . .

.

CHAPTER 4 PHYSICAL CAUSALITY MODELING FOR LANGUAGE GROUND-

.
.

.
.

.
.

Introduction .

ING TASK .
. .
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Visual Detectors based on Physical Causality . . . . . . . . . . . . . . . . . . . . 48
4.3 Verb Causality in Language Grounding . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.1 Knowledge-driven Approach . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.1.1 Acquiring Knowledge . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.1.2 Applying Knowledge . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.2 Learning-based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 Causality Prediction for New Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5 Conclusion .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

. .

.

.

.

.

.

.

.

.

.

.

.

. .

Introduction .

5.3 Action-Eﬀect Prediction .

CHAPTER 5 VISUAL CAUSALITY REASONING . . . . . . . . . . . . . . . . . . . . 59
5.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Action-Eﬀect Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.1 Actions (verb-noun pairs) . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.2 Eﬀects Described in Language . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.3 Eﬀects Depicted in Images . . . . . . . . . . . . . . . . . . . . . . . . . . 61
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.1 Extracting Eﬀect Phrases from Language Data . . . . . . . . . . . . . . . . 63
5.3.2 Downloading Web Images
. . . . . . . . . . . . . . . . . . . . . . . . . . 63
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.3 Models
.
5.3.4 Evaluation .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.4.1 Methods for Comparison . . . . . . . . . . . . . . . . . . . . . . 67
Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3.4.2
. 69
5.4.1 Action-Eﬀect Embedding Model . . . . . . . . . . . . . . . . . . . . . . . 69
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.2 Evaluation .
5.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.4 Generalizing Eﬀect Knowledge to New Verb-Noun Pairs

. . . . . . . . . . . . .

.
.

.
.

.
.

.

.

.

.

.

vi

CHAPTER 6 UNDERSTANDING PHYSICAL ACTIONS THROUGH NATURAL LAN-

.

.

.

.

.

.

.

.

6.3 Methods .

GUAGE STORIES .
.
.

. .

Introduction .

6.3.1 The Attentive-Reader Model

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 Physical Commonsense Reasoning Tasks . . . . . . . . . . . . . . . . . . . . . . . 75
6.2.1 Data Collection through Crowdsourcing . . . . . . . . . . . . . . . . . . . 76
6.2.2 Underlying Commonsense Knowledge . . . . . . . . . . . . . . . . . . . . 78
6.2.3 Comparison with Existing Tasks . . . . . . . . . . . . . . . . . . . . . . . 81
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
. . . . . . . . . . . . . . . . . . . . . . . . . 82
Leveraging Physical Causality Knowledge . . . . . . . . . . . . . 84
Typed Physical Causality Knowledge . . . . . . . . . . . . . . . 84
6.3.2 Models for Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.4.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.4.2 Results and Analysis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Predicting Breakpoints in Negative Stories . . . . . . . . . . . . . . . . . . 88
6.4.3
. 89

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.4 Experiments .

6.5 Summary .

6.3.1.1
6.3.1.2

. .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

CHAPTER 7 CONCLUSIONS AND FUTURE DIRECTIONS . . . . . . . . . . . . . . . 91
BIBLIOGRAPHY .
. 94

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

.

.

.

.

.

.

.

vii

LIST OF TABLES

Table 3.1: Categorization of physical causality.

. . . . . . . . . . . . . . . . . . . . . . . . 33

Table 3.2: Variability of causality labels over diﬀerent object and scene conditions. . . . . . 35

Table 3.3: Results of verb similarity judgement task using Distributional Memory (DM)
model, and concatenation model (DM+CoS). (Pearson’s correlation ρ, all
values are signiﬁcant with p < 0.001.) . . . . . . . . . . . . . . . . . . . . . . . 38

Table 3.4: Results of thematic ﬁtness estimation using Distributional Memory (DM)
model and concatenation model (DM+Causality). (Pearson’s correlation ρ, all
values are signiﬁcant with p < 0.001.) . . . . . . . . . . . . . . . . . . . . . . . 40

Table 3.5: Example cause and eﬀect text from our collected data.

. . . . . . . . . . . . . . 41

Table 3.6: Example patterns that are used to extract state phrases (bold) from sample

sentences. .

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Table 3.7: MAP results for verb causality question answering task. . . . . . . . . . . . . . . 46

Table 4.1: Causality detectors applied to patient of a verb.

. . . . . . . . . . . . . . . . . . 48

Table 4.2: Causality detectors for grounding source, destination, and agent. . . . . . . . . . 51

Table 4.3: Grounding accuracy on patient role . . . . . . . . . . . . . . . . . . . . . . . . 55

Table 4.4: Grounding accuracy on four semantic roles

. . . . . . . . . . . . . . . . . . . . 55

Table 4.5: Grounding accuracy on patient role using predicted causality knowledge.

. . . . 57

Table 5.1: Example action and eﬀect text from our collected data.

. . . . . . . . . . . . . . 61

Table 5.2: Example patterns that are used to extract eﬀect phrases (bold) from sample

sentences. .

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

.

.

.

.

Table 5.3: Results for the action-eﬀect prediction task (given an action, rank all the

candidate images).

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Table 5.4: Results for the action-eﬀect prediction task (given an image, rank all the actions). 68

Table 5.5: Results for the action-eﬀect prediction task (given an action, rank all the

candidate images).

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

viii

Table 5.6: Results for the action-eﬀect prediction task (given an image, rank all the actions). 71

Table 5.7: Example predicted eﬀect phrases for new verb-noun pairs. Unseen verbs and

nouns are shown in bold.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Table 6.1: Typed state attributes for physical causality knowledge.

. . . . . . . . . . . . . . 85

Table 6.2: Prediction accuracy results on the physical commonsense reasoning tasks. . . .

. 87

Table 6.3: Prediction accuracy results of training on one task and evaluating on the other task. 88

ix

LIST OF FIGURES

Figure 1.1: An image showing apple slices. Question: “What actions could possibly

cause this situation?”

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

Figure 3.1: Distributions of causality labels for verbs clean and rinse.

. . . . . . . . . . . . 34

Figure 3.2: Architecture of the verb causality embedding model. . . . . . . . . . . . . . .

. 42

Figure 4.1: Grounding semantic roles of the verb get in the sentence: the man gets a knife

from the drawer. .

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Figure 4.2: The CRF factor graph of the sentence: the man gets a knife from the drawer.

. . 52

Figure 5.1: Positive images (top row) and negative images (bottom row) of the action

peel-orange.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Figure 5.2: Examples of image search results. . . . . . . . . . . . . . . . . . . . . . . . . . 64

Figure 5.3: Architecture for the action-eﬀect prediction model with bootstrapping.

. . . . . 66

Figure 5.4: Several example test images and their predicted actions and predicted eﬀect

descriptions. The actions in blue are ground-truth labels.

. . . . . . . . . . . . 68

Figure 5.5: Architecture of the action-eﬀect embedding model.

. . . . . . . . . . . . . . . 70

Figure 6.1: Example story data for the cloze task and the ordering task. Candidates in

red are correct answers.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Figure 6.2:

Interface used for annotating stories for the cloze task. . . . . . . . . . . . . . . 77

Figure 6.3:

Interface used for annotating stories for the ordering task.

. . . . . . . . . . . . 77

Figure 6.4: Network architecture for the Attentive-Reader. Note that this architecture
only shows the computation structure for the anomaly scores corresponding to
sentence 3 (score s31 and score s32). The anomaly scores for other sentences
are computed via similar processes. . . . . . . . . . . . . . . . . . . . . . . . . 83

Figure 6.5: Network architecture for the EntNet-based approach. . . . . . . . . . . . . . . . 85

x

CHAPTER 1

INTRODUCTION

Linguistics studies have shown that action verbs often denote some change of state (CoS) as the
result of an action, where the change of state often involves an attribute of the direct object of the
verb [54]. For example, the result of “slice a pizza” is that the state of the object (pizza) changes
from one big piece to several smaller pieces. This change of state can be perceived from the
physical world. In Artiﬁcial Intelligence [126], decades of research on planning, for example, back
to the early days of the STRIPS planner [33], have deﬁned action schemas to capture the change
of state caused by a given action. Based on action schemas, planning algorithms can be applied
to ﬁnd a sequence of actions to achieve a goal state [39]. The state of the physical world is a
very important notion and changing the state becomes a driving force for agents’ actions. Thus,
motivated by linguistic literature on action verbs and AI literature on action representations, in our
view, modeling change of physical state for action verbs, in other words, physical causality, can
better connect language to the physical world.

Physical causality is one important aspect of human commonsense knowledge. Suppose we are
given a statement “the apple is in small pieces”, or given an image as shown in Figure 1.1, what
actions could possibly cause the situation described in the text or illustrated by the image? We
humans have no problem of inferring potential causes: an external action such as cut or slice most
likely have happened to a whole apple. What allows us to make such inference is the common sense
knowledge we have, especially in this case the very basic cause-eﬀect knowledge about how actions
(and thus action verbs) may aﬀect the state of the world. Let’s suppose we give the same statement
and the same image to an artiﬁcial agent, will the agent be able to infer the potential causes? The
answer is most likely no.

Despite tremendous progress in knowledge representation, automated reasoning, and machine
learning, artiﬁcial agents still lack the understanding of naive causal relations regarding the physical
If artiﬁcial agents ever become
world. This is one of the bottlenecks in machine intelligence.

1

Figure 1.1: An image showing apple slices. Question: “What actions could possibly cause this
situation?”

capable of working with humans as partners, they will need to have this kind of physical action-
eﬀect understanding to help them reason, learn, and perform actions.

In this dissertation, to address these limitations mentioned above, a series of investigations on
physical causality of action verbs are performed. First, crowd-sourcing experiments were designed
and conducted to collect physical causality knowledge from human users. Based on the collected
causality knowledge data, two diﬀerent approaches were developed to model causality knowledge:
a categorization-based approach and a language embedding-based approach. These modeling ap-
proaches transform human causality knowledge to machine understandable representations, which
can enable commonsense reasoning. We then developed several approaches to incorporate physical
causality knowledge to language grounding, visual causality reasoning, and commonsense story
understanding, where such knowledge plays an important role.

1.1 Modeling Physical Causality of Action Verbs

Causation in the physical world has long been a central discussion to philosophers who study
casual reasoning and explanation [28, 44], to mathematicians or computer scientists who apply
computational approaches to model cause-eﬀect prediction [107], and to domain experts (e.g.,
medical doctors) who attempt to understand the underlying cause-eﬀect relations (e.g., disease and

2

symptoms) for their particular inquires. Apart from this wide range of topics, this dissertation
investigates a speciﬁc kind of causation, the very basic causal relations between a concrete action
(expressed in the form of a verb-noun pair such as “cut-cucumber”) and the change of the physical
state caused by this action. We believe that physical causality knowledge forms an essential
component of verb semantics, and is crucial to better connecting natural language with the physical
world.

Verb semantics have been studied extensively. Theoretical linguistics use a frame of semantic
roles to capture semantics of verbs [73]. Semantic roles include agent, patient, instrument, source,
destination, etc. Several knowledge base resources on verb semantics have been made available,
such as VerbNet [127], FrameNet [7], and PropBank [66]. However these resources mainly focus
on organizing verbs into classes, and representing verb semantics with action frames. They do not
provide a detailed and formal account of potential causality denoted by verbs.

In the NLP community, there is an increasing amount of eﬀort on capturing common knowledge
or commonsense knowledge. Except for few [167] which acquires commonsense knowledge from
annotated images, most of the previous eﬀort applies information extraction techniques to extract
facts from a large amount of web data. For example, DBpedia [71] and YAGO [144] knowledge
bases contain millions of facts about the world such as people and places. However, these knowledge
bases do not contain basic cause-eﬀect knowledge related to concrete actions such as “drop a glass
will cause the glass to break into pieces”; “grind coﬀee beans will cause coﬀee beans to become
powder”. Lacking this kind of basic physical cause-eﬀect knowledge hinders artiﬁcial agents from
connecting natural language to the physical world, and thus inhibits the capability of reasoning,
learning and performing actions.

Motivated by these observations, this dissertation investigates the acquisition and modeling of
commonsense causality knowledge associated with concrete action verbs. The basic cause-eﬀect
knowledge is so fundamental for human beings and is shared by cognitive capable individuals.
This kind of knowledge is often presupposed in our communication and not explicitly stated. Thus,
it is diﬃcult to extract cause-eﬀect relations from existing textual data (e.g., web). To overcome

3

In
this problem, several crowd-sourcing tasks were designed to collect physical causality data.
these crowd-sourcing tasks, human subjects were asked to explicitly express their knowledge on
action verbs, through natural language descriptions or through answering designed multiple choice
questions.

After data collection, we propose two diﬀerent approaches to model physical causality knowl-
edge. One approach is categorization-based, where the changes of state are categorized by the
physical attributes of objects, and the causality knowledge for an action verb is represented as
its association vector with those attributes. Another approach utilizes neural network embedding
models, where causality knowledge is modeled through similarities between language embedding
vectors.

For the ﬁrst approach, in order to examine the potential types of causality associated with
action verbs, a pilot crowd-sourcing experiment was ﬁrst conducted on a selected set of action
verbs. Motivated by linguistics studies on typology for gradable adjectives, which also have a
notion of change along a scale [23], we developed a set of eighteen main categories to characterize
physical causality. Then, the evaluation results on verb similarity judgement task and thematic ﬁt
estimation task demonstrate that categorization-based causality modeling can be a good supplement
of distributional semantics for verb meanings.

For the second approach, we ﬁrst collected a dataset of natural language cause-eﬀect descrip-
tions for a set of most frequently used action verbs. A neural network structure was developed to
learn a cause and eﬀect embedding space from the collected language data to capture common-
sense causality knowledge. The proposed embedding models were evaluated on causal question
answering, for example, to answer questions such as “what action could cause the state of the world
described in the text?” or “what state change could happen to the object as a result of this action?”
The experimental results have shown the potential of this embedding approach in enabling causal
reasoning of actions for artiﬁcial agents.

Further, we applied the collected physical causality knowledge together with diﬀerent modeling
approaches to several novel tasks, demonstrating that physical causality modeling has a good

4

potential for intelligent systems that can deeply understand human language and better connect
language with the physical world.

1.2 Physical Causality Modeling for Language Grounding Task

Physical causality knowledge captures potential changes of physical states caused by action
verbs. The change of state can be perceived from the physical world. Therefore, modeling physical
causality knowledge can help the machine to better ground natural language components to concepts
of the world, in other words, connecting words, phrases or sentences to objects, states and actions
in the physical world.

We conduct a study to incorporate categorization-based physical causality modeling in a lan-
guage grounding task [36]. In this task, a system is given parallel language and visual data as input,
and the goal is to ground language components to objects from the visual data. Our hypothesis is
that modeling physical causality can provide guidance for visual processing: once a parallel lan-
guage and visual data about an action is given, the potential causality of the verb or the verb-noun
pair can trigger some visual detectors that mainly focus on potential state changes caused by this
action. Applying these visual detectors to the visual data can potentially improve the performance
of grounded language understanding.

Based on the categorization of physical causality attributes, we designed a set of change-of-state
detectors to detect the corresponding changes from visual perception of the physical environment.
We further applied two approaches, a knowledge-driven approach and a learning-based approach,
to incorporate causality modeling in grounded language understanding. The knowledge-driven
approach incorporates the collected human physical causality knowledge with the change-of-state
detectors to ﬁnd the best groundings for semantic roles. The learning-based approach utilizes
Conditional Random Field (CRF) to model the relations between physical objects and language
components.
Instead of using the collected human physical causality knowledge, it learns the
association between causality attributes and verbs from training data. The empirical results have
demonstrated that both of these approaches achieve signiﬁcantly better performance in grounding

5

language to perception compared to previous approaches [162].

1.3 Visual Causality Reasoning

We humans share a vast amount of commonsense causality knowledge, and we use them in our
daily lives without even noticing it. For example, given a verb (e.g., grind) and a noun (e.g., coﬀee
beans), we can predict the eﬀect on the state of the world caused by this action. Given a photo, for
example, showing many small cucumber pieces, we can infer that some external action (e.g., cut)
on a cucumber could cause such state. We can make such action-eﬀect prediction because we have
developed an understanding of this kind of basic action-eﬀect relations at a very young age [6].
What about machines? Will artiﬁcial agents be able to make the same kind of predictions? The
answer is not yet.

To address this problem, we introduce a new task on naive physical action-eﬀect prediction [37].
This task includes both cause prediction: given an image which describes a state of the world,
identify the most likely action (in the form of a verb-noun pair, from a set of candidates) that
can result in that state; and eﬀect prediction: given an action in the form of a verb-noun pair,
identify images (from a set of candidates) that depicts the most likely eﬀects on the state of the
world caused by that action. Note that there could be diﬀerent ways of formulating this problem,
for example, both causes and eﬀects are in the form of language or in the form of images/videos.
Here we intentionally frame the action as a language expression (i.e., a verb-noun pair) and the
eﬀect as depicted in an image in order to make a connection between language and perception.
This connection is important for physical agents that not only can perceive and act, but also can
communicate with humans in language and act to the environment through planning. To our
knowledge, there is no prior work in this nature that attempts to connect actions (in language) and
eﬀects (in images).

As a ﬁrst step, we collected a dataset of 140 verb-noun pairs. Each verb-noun pair is annotated
with possible eﬀects described in language and depicted in images (where language descriptions
and image descriptions are collected separately). We have developed an approach that applies

6

distant supervision to harness web data for bootstrapping action-eﬀect prediction models. Our
empirical results have shown that, using a simple bootstrapping strategy, our approach can combine
the noisy web data with a small number of seed examples to improve action-eﬀect prediction.
In addition, for a new verb-noun pair, our approach can infer its eﬀect descriptions and predict
action-eﬀect relations only based on 3 image examples. This opens up the possibility for humans
to teach robots new tasks through language communication and small number of examples.

1.4 Commonsense Reasoning about Physical Actions

While it is trivial for humans to use natural language to communicate about actions and changes
in the physical world, machines still struggle in developing similar skills. To investigate deeper
understanding of human natural language, we create a new language benchmark, which can be used
to evaluate machines’ capability of understanding and reasoning about human physical actions.
This benchmark contains short stories created by human annotators. Each story describes a short
sequence of human physical actions in our daily lives. For example, a story could describe the
action sequence of making a sandwich in the kitchen, or the actions of repairing a bike in the garage.
Based on the collected stories, we present two tasks for evaluating machine reading systems: the
cloze task (selecting the correct sentence to ﬁll in the blank in a story) and the ordering task
(selecting the correct order of sentences in a story).

Although the proposed tasks are easy for humans to solve, they are very challenging for
machines. An analysis shows that understanding the stories and solving these tasks requires various
types of commonsense knowledge, e.g., knowledge about action verbs, objects, and naive physics
rules. Therefore, we believe this benchmark will be a valuable resource for evaluating machines’
capability of acquiring and applying physical commonsense knowledge. Further, the setting of two
sub-tasks can be naturally used to evaluate models generalization ability, via training on one task
and evaluating on the other task. If a model can successfully learn the fundamental knowledge and
the reasoning abilities via training on the data of one subtask, it can potentially perform well on the
other subtask. By doing this, we encourage models that focus on learning underlying knowledge

7

instead of overﬁtting to shallow language patterns.

A neural network model was proposed for tackling the commonsense reasoning tasks. This
model solves both the cloze task and the ordering task via explicitly examining the compatibility of
each action with its context in those stories. Since the action-eﬀect knowledge plays an essential role
in understanding these commonsense stories, we further incorporated physical causality knowledge
into the proposed model. Experiments were designed to compare the proposed model with several
state-of-the-art models for machine comprehension tasks. The results demonstrate the eﬀectiveness
of the proposed model, and further show the improvement introduced by external physical causality
knowledge. The results also suggests that this benchmark is challenging for current approaches,
and better solving this task requires wider range of commonsense knowledge and richer semantic
representation of actions and objects.

1.5 Contributions

In this dissertation, we focus on an investigation on verb semantics from a new angle of how
they may change the state of the physical world. The contributions of this dissertation is listed as
below:

1. A categorization of physical causality was developed, motivated by existing theoretical
linguistic studies. This categorization provides a stepping stone for systematically exploring
the physical causality knowledge.

2. Two human annotated physical causality knowledge datasets were created. One dataset was
annotated with causality attributes deﬁned in this dissertation. Another dataset was annotated
with open-ended natural language.

3. Two novel approaches were presented to solve the semantic role grounding task via causality
modeling. The empirical results have shown the potential of causality modeling on connecting
language with the physical world.

8

4. A physical causality embedding structure was proposed. The embedded cause-eﬀect knowl-
edge will allow the agent to better infer underlying causes or predict potential eﬀects given a
situation. It can be applied to answer causal questions, which are an important type of ques-
tions for artiﬁcial agents, yet not well explored in either traditional QA or Visual Question
Answering (VQA).

5. The bootstrapping approach for visual causal reasoning provides a cost-eﬃcient way to
connect causality embedding with a large number of images from the web. This approach is
general and can be extended to other applications involving visual causal reasoning.

6. A benchmark dataset for physical commonsense reasoning task was created. This dataset
evaluates a system’s capability of understanding and reasoning about state changes in the
physical world.

7. A novel approach that leverages external knowledge for the physical commonsense reason-
ing task were proposed. Empirical results have shown the potential of physical causality
knowledge on facilitating machines to better comprehend and reason about commonsense
stories.

1.6 Organization of this Dissertation

The rest of this dissertation is organized as follows. In Chapter 2, we review works from diﬀerent
research ﬁelds that are closely related to our study of physical causality knowledge. Chapter 3
presents our modeling of physical causality knowledge, a categorization-based approach and a
language embedding-based approach. Both approaches are also evaluated on several preliminary
tasks. In Chapter 4, we utilize the categorization of causality attributes and causality knowledge
to improve the language grounding task. In Chapter 5, we introduce the visual causality reasoning
task, as well as a bootstrapping approach that harnesses large amount of web images to tackle this
task. In Chapter 6, we introduce the benchmark dataset for understanding human physical actions
through natural language stories, as well as several neural models trying to solve the proposed

9

tasks. Finally, in Chapter 7, we summarize this dissertation and discuss several promising future
directions.

10

CHAPTER 2

RELATED WORK

The research work in this dissertation is motivated by studies from multiple research ﬁelds, including
natural language processing, theoretical linguistics, psycholinguistic, computer vision, robotics, etc.

2.1 Theoretical Linguistics on Verbs

Verb semantics have been studied extensively in linguistics [110, 73, 7, 66]. Previous work [54]
has divided action verbs into manner verbs and result verbs. Result verbs usually specify a result
eﬀect of an action, which often indicates objects’s Change of State [74]. Hovav and Levin [54]
also propose that result verbs often specify movement along a scale [54]. A scale usually denotes
an attribute of an object, like size, temperature, cost. For example, “Mary shortened the skirt”
indicates that the length of the object skirt has decreased. The analysis of gradable predicates in
terms of scale structure motivates us to model verb causality using object attribute categories. A
detailed description of scale structure can be found in Kennedy and McNally’s work [60].

Several large-scale verb lexicon databases have been built, for example, VerbNet [127], FrameNet [7]

and PropBank [66]. These resources have enabled signiﬁcant strides in computational semantic
processing such as semantic role labeling [105, 109, 20, 175] and its applications in information
extraction [29] and question answering[135]. While instrumental for text processing, the current
modeling of verb semantics only plays a limited role in moving language processing towards the
physical world. Despite an increasing research eﬀort on grounding language to the environment,
connections that link verbs to perception and action in the physical world are still missing. There-
fore, the study of verb causality knowledge in this dissertation could be a valuable supplement to
these existing knowledge bases.

11

2.2 Grounding Language to Perception

We humans use natural language to communicate about things in the physical world: objects,
actions, events, and their properties and relations. However, for the decades research of language
processing, linguistic meaning is explained mainly by symbolic models. The circular deﬁnitions
in symbolic explanation of linguistic meaning restricts language meaning to only symbols. For
example, if a person had to learn his ﬁrst language only from a dictionary, he would be passing
endlessly from one meaningless symbol to another. To ground symbol meaning in something other
than just more meaningless symbols, it is the task of symbol grounding problem [50]. Researchers
have been trying to give machines the same abilities as human, to “bridge the symbolic realm of
language with the physical realm of real-world referents” [124].

Recent years have seen an increasing amount of work on grounding language to perception [171,
150, 79, 98, 78]. The common goal of language grounding researches is to enable machines to
automatically acquire beliefs about the physical world and to exchange their beliefs with humans
through natural language communication.

Solving the symbol grounding problem and connecting language and the physical world is
fundamental to many tasks, from identifying context-dependent shifts of word meanings, to enabling
situated natural language communication between human and robots. Here we give a brief review
on existing works on language grounding.

2.2.1 Grounding Words in Perception

One of the most fundamental tasks in grounded language learning is to associate words with
perceptual input.

2.2.1.1 Grounding to Discretized Perceptual Signal

Words are discrete symbols and perceptions are usually represented by continuous sensory data.
Therefore a common way of connecting them is to discretize the sensory feature space into categories

12

that are associated with linguistic words. Examples include models for grounding color names [38,
116, 56, 88] and grounding spatial terms [115, 140, 47]. In computer vision, this task of associating
linguistic labels with perceptual categories is usually called recognition, e.g., object recognition,
action recognition.

Grounding color terms is an actively studied topic in linguistic and cognitive science, since
color is an important type of object properties in human visual system. And the studies of color
name grounding could inspire new models of learning vague meanings for other continuous domains
as quantity, space, and time.

In computational systems, color is usually represented by values in a color space, e.g., red-
green-blue (RGB) or hue-saturation-value (HSV) color space. A cross-linguistic study of color
naming [116] shows that the color prototypes for English are close to the clusters in other diﬀerent
languages, e.g., white, black, red, green, yellow, and blue. This result suggests that the human
perceptual system tends to have strong bias on the meaning of basic color terms. The task of
grounding color terms is usually done by associating color term with a prototypical point [2]
or a convex region in an underlying color space [38, 56]. Since the association between words
and perception is not deﬁnitive, McMahan and Stone [88] propose a Bayesian model of color
naming that takes into account the uncertainty in categorization boundaries and distributions over
vocabulary.

Grounding spatial terms is another actively studied topic. Regier and Carlson [115] propose
the attention vector-sum (AVS) model to predict the acceptability judgement of linguistic spatial
terms given two objects in a two-dimensional space. They use vector sum representation to model
the human concerning attention. Skubic et al. [140] use a histogram of force to model spatial
relations between 2D objects. In Guadarrama et al. [47] and Golland et al. [43], spatial relations
between 3D objects are learned through logistic regression.

Object recognition tasks in computer vision can also be seen as grounding tasks. In object
recognition tasks, we assign name tags to objects in the image. Generally, this is also a process of
grounding language (object labels) to perceptual signals (object images). Thanks for large scale

13

image recognition datasets (e.g., ImageNet [22]), computer algorithms are closing their performance
gap between human on object recognition task, or even overtaking human performance [51].
However, this does not mean computers have the same level of abilities as human in grounding
language to perceptions. Because there are clear limitations in those studies: algorithms are
only trained to recognize a ﬁxed set of discrete categories (usually up to thousands of classes).
Their training data are provided either with explicit labels for image classiﬁcation task, or with
localization annotations (e.g. bounding boxes) for object localization task. Fully annotated image
datasets usually cost a lot to create and they have very limited categories. The number of categories
provided by vision systems is still far from human-level vision.

That said, the task of object recognition still provides us valuable knowledge and tools for more
complex language grounding tasks, like the representation of images (e.g., bag-of-visual-words,
pre-trained CNN features).

Attribute-based recognition has recently gained in popularity in the computer vision commu-
nity. Instead of assigning name tags for objects in an image, attribute-based recognition describes
object properties using learned attributes [30]. Since attributes can be shared by diﬀerent objects,
the learned attributes can be used to recognize novel objects with a few or zero training exam-
ple [1, 57]. The attribute words can be seen as an intermediate representation that bridges the
visual space and the label space, therefore they provide useful information about relations between
class labels. However, the process of attribute learning also requires more human annotation eﬀorts
when collecting attributes annotation.

2.2.1.2 Learning from Ambiguous Parallel Data

Above works need to learn from fully annotated language and perception data, which are expensive
to collect and usually contain very limited words. Now we look at the more general task of learning
from parallel language and perception data that have some sort of ambiguity. For example, when
learning from parallel image and sentence data, we need to deal with the association relation
between words and image locations; in the task of situated language perception, apart from the

14

ambiguity in aligning language and perceptual input, we also need to deal with speech recognition
errors.

Learning the joint distribution of words and image features. There are large numbers of
data sets that consist of parallel image and language data. These data usually do not contain
alignment between words and image regions. In order to learn word grounding from this kind of
image dataset, Barnard et al. [9] present an approach that links image segments and word semantics
through clustering. The clusters captures the joint distribution of words and image segments.
Barnard et al. [8] study a couple models of learning the joint distribution of image regions and
words, including a multi-modal extension of Latent Dirichlet Allocation. Yu and Ballard [169] use
a generative graphical model to model the correspondence of word and objects. It ﬁrst generates a
latent variable, then visual objects are generated based on the latent variable, and ﬁnally, words are
generated conditioned on visual objects.

Models of infant word learning.

In the community of NLP and cognitive science, plenty
researches are focused on understanding the mechanism of how human acquire language. For
example, the CELL system [125] acquires word meaning from speech and image input, mimicking
the language learning process of infants. In this system, word meanings are represented by prototype
feature vectors along with radii around them. A prototype can be seen as an ideal point for that
category in the feature space. If an input perceptual feature vector is within the radius to a prototype,
it will be treated as a member of this perceptual category.

Weakly supervised object recognition/localization. As mentioned earlier, the number of
categories provided by vision systems is still far from human-level visual perception. Apart from
large number of object categories, real world objects also have diﬀerent states. For example, a
potato can be in the state of “peeled”, “in pieces”, “cooked”, etc. Recognizing object state is a
more diﬃcult task. Due to the high cost of human annotation, it is very diﬃcult to establish an
open-domain image dataset that covers a large set of objects with state annotations. Therefore
people seek ways of utilizing parallel language and vision data in weakly supervised settings. With
the recent success of the Deformable Parts Model (DPM) detector [106], weakly-supervised object

15

localization techniques [102] have risen back to popularity. These works use weak supervision
from web search images to train concept detectors (localization). Utilizing automatic web-search
images helps to remove the obstacle of high-cost for collecting fully annotated image dataset. In
Chapter 5, we borrow the similar idea of using web search images with distance supervision to
facilitate the learning of action-eﬀect prediction.

2.2.1.3 Grounding Verbs

Based on the representation of verb meanings, verb grounding researches can be categorized into
several diﬀerent types, 1) representation using world state, 2) representation using action control
structures, 3) representation using motion proﬁle.

Representation using world state. World state includes object properties (e.g., color, shape)
and object relations (e.g., spatial relation).
In Siskind’s work [137, 138, 139], state changes in
force-dynamic relations between participant objects are visually recognized, and their temporal
schemas are used to infer actions (verbs). Yang et al. [164] use a visual semantic graph to represent
the consequence of manipulation actions. A number of work [133, 131, 94] explicitly model verbs
with predicates describing the resulting states of actions.

Representation using action control structures. In robotic studies, in order for robot to carry
out an action, the verb meaning need to be grounded to the control of action. Bailey et al. [5] uses
x-schema to represent action verbs, which captures verb semantics using action control structures.
Misra et al. [93] propose a data driven approach to ground natural language commands to sequences
of robot basic actions.

Representation using motion proﬁles. Motion proﬁles are widely used for recognizing
actions through computer vision [129, 151]. Approaches in this category usually perform well in
recognizing actions from human gesture and motions. However, they usually highly rely on the
training data (e.g., actors, lighting conditions, camera angle), and can hardly be generalized to
grounding action verbs to robot actions.

Although lots of work have been done in grounding language to perception, no previous work

16

has investigated the link between physical causality denoted by action verbs and the change of state
visually perceived. Chapter 4 intends to address this limitation and examine whether the causality
denoted by action verbs can provide top-down information to guide visual processing and improve
grounded language understanding. In this dissertation, we are particularly interested in using world
state to represent verb meaning. World state changes capture the causality information of action
verbs, thus they can be beneﬁcial for action reasoning and planning. Also as shown by experimental
results [139, 164], using state change can be a more robust way to model verb meanings than using
motion proﬁles.

2.2.1.4 Context-Dependent Word Meaning

One thing to notice is that, the grounding of words could be inﬂuenced by language context, e.g.,
the RGB values of “red wine” and “red hair” are likely to be diﬀerent.
[38] models the shift of
word meaning given contexts. In fact, word meanings are determined by both linguistic convention
and visual perception. Experiments in [18] show that human understanding of language depends
on the listener’s evaluation of how to achieve the goal based on current situation. McMahan and
Stone [88] claim that it is not accurate to use deﬁnitive mappings between words and the world.
Their work models speaker judgment and speaker choice in the grounding of color words.

To summarize, a typical system that learns grounded word meanings usually takes two steps:
1) “parsing” the perception into ontological types and relations that could be explained by human
language semantics; 2) learning the association between word and those perception categories.
Sometime these two steps are done jointly.

2.2.2 Grounding Phrases and Sentences

2.2.2.1 Referent Grounding

Reference Grounding is the task of resolving referring expressions to a referent, the entity in the
physical world to which they are intended to refer. In [122], perspective-taking mechanism is used

17

to ﬁnd the referent based on a set of language descriptors. When the given information is not
suﬃcient to make prediction, the agent can automatically raise a question for more distinguishing
information.

In [85] and [62], the objects are distinct and represented via symbolically speciﬁed properties,
such as color and shape. They use pre-deﬁned property classiﬁers, like ‘green’, ‘trigangle’, to
identify object properties. In [63], word meaning is represented by a function from object perception
features to a score of how well the word and that object ﬁt each other. Phrases are represented by
compositions of individual words. For example, a simple noun phrase is composed by averaging
word meaning functions, a relational phrases (e.g., the book to the left of the mug) is composed by
multiplying diﬀerent word meaning functions.

2.2.2.2 Grounding Action Frames

Several works from computer vision community focus on the extraction of action frames from
images [48, 168]. Yatskar et al. [168] creates an image dataset imSitu, where images are annotated
with activities and semantic roles from FrameNet. The goal is to detect the activity and localize
the objects of interactions from image input.

Yang et al. [162] extend traditional semantic role labeling (SRL) to grounded SRL where
arguments of verbs are grounded to participants of actions in the physical world. Using a graphical
probabilistic model to jointly learn the correspondence between language and vision, their approach
grounds both explicit semantic roles and implicit semantic roles. In Chapter 4 of this dissertation,
we model the physical causality of action verbs from crowd-sourced data, and demonstrates that
physical causality modeling helps with the grounded semantic role labeling task.

2.2.2.3 Parsing and Perception

Many researchers treat grounded language acquisition as two subproblems: parsing and percep-
tion [70]. The ﬁrst step is semantic parsing, which tries to map language to formal meaning
representation. One of the most commonly used meaning representation is ﬁrst-order logic. The

18

second step is mapping the meaning representation to perception categories. Researches in this
direction often focus on the semantic parsing step, assuming that we have easy access to a logical
representation of the world.

Zelle and Mooney [174] propose a natural language interface for database queries. Based
on the CHILL parser acquisition system [172, 173], this interface transform the natural language
queries to a logic form which can be used to query the database. Here the logic form bridges the
natural language and database slots. The parser acquisition is regarded as a problem of learning
search-control rules in a logic program through inductive logic programming. Apart from the
logic form representation, there are other forms of semantic meaning representations. In Lu et al.’s
work [83], they propose using hybrid trees to represent semantic meanings, where each tree node
includes both natural language words and the corresponding meaning elements.

Borschinger et al. [14] propose that the grounded language learning problem can be solved by
addressing the unsupervised Probabilistic Context Free Grammar (PCFG) induction problem. In
their approach, the semantic information is encoded as part of the text string. However, this approach
includes every possible meaning representation constituent as a nonterminal in the PCFG. It will
have diﬃculties when dealing with complex sentences with a large number of potential meanings,
since the number of possible subgraphs grows exponentially. Later, Kim and Mooney [64] propose
to address the combinatorial explosion problem by introducing the Lexeme Hierarchy Graph (LHG),
where a hierarchy of semantic lexemes is build for each ambiguous landmarks plan. In another
work, Lin et al. [76] study the task of retrieving videos using complex natural language queries.
In their proposed approach, a sentence is ﬁrst parsed into semantic graph, and objects and their
motions are detected from the video, then language is matched to visual concepts using a generalized
bipartite matching algorithm.

Combinatory Categorial Grammar (CCG) is a popular tool for semantic parsing. It can model
both the syntax and the semantics (expressions in λ-calculus) of a sentence. Matuszek et al. [86]
propose a framework that uses CCG parsing to grounds natural language sentences to perceptions.
A sentence is ﬁrst parsed into logical forms using probabilistic CCG. Then an explicit model is

19

used to align logical constants and perception attribute classiﬁers. However this work only study a
very limited number of objects and several attributes, like color and shape.

2.2.2.4

Jointly Modeling Parsing and Perception

There are also some grounding works that do not need explicitly semantic parsing as a ﬁrst
step. Yu and Siskind [171] propose an unsupervised approach to ground language descriptions
to video clips of human interacting with multiple objects. They use an HMM-based model to
jointly learn the object tracking and word meaning grounding. In their approach, each language
element (verb, noun, adjective, adverb, preposition, etc) is modeled by an HMM, and the visual
perception is represented by object detection results. Later, this model was extended to handle
diﬀerent application tasks [170]: 1) Language generation from vision, 2) Video/image retrieval
using langauge query, 3) Language-Guided Activity Recognition.

In [147, 148], a probabilistic graphical model was created to map natural language commands
to physical world groundings, like objects, paths and locations. Each command sentence is ﬁrst
decomposed into Spatial Description Clauses (SDCs)
[68] with types of event, object, path,
and place. The system then infers groundings in the world corresponding to each SDC using a
Conditional Random Field (CRF) model.

Artzi and Zettlemoyer [4] present a joint model of meaning and context for interpreting and
executing instructional sentences, using a grounded CCG semantic parsing approach. The joint
modeling improves grounding performance by providing situated environment cues, like the set
of visible objects. Matuszek et al. [87] build a joint model of linguistic meaning and action
execution in a grounded CCG semantic parsing framework. In this work, a parser is learned to map
language instructions to robot control language (RCL), which can later be executed by the robot in
a simulation environment.

Above work which jointly address parsing and perception has some drawbacks, including: 1)
the learning phase of these models requires large amounts of manual annotation, and 2) the semantic
representations are limited by pre-deﬁned predicates [68, 147, 87]. Krishnamurthy and Kollar [70]

20

partially solve these limitations by introducing a Logical Semantics with Perceptron(LSP) model,
which jointly models the perception information and language meanings. Their task is given
an environment with multiple objects, to map natural language statements to the referents in the
environment. The authors introduce a weakly supervised method to train the LSP.

2.2.2.5 Neural Network approaches

Recent researches deploy deep-learning frameworks to model both image and word sequence. End-
to-end deep neural network models have shown good performance in the tasks of visual description
generation [26] and visual question answering tasks [3]. One drawback of end-to-end deep learning
framework is the lacking of transparency. If the system fails on one example, it is hard for human
to understand the reason. Therefore, researchers have been trying to develop neural models with
explicit intermediate representations, e.g., the use of attention models.

Karpathy et al. [59] present an learning approach that grounds dependency-tree relations to
regions in the image using a ranking technique. Rohrbach et al. [120] use attention model to ground
phrases to image regions, their model works both with or without grounding supervision. Recently
people start to utilize caption generation framework on grounding task. Hu et al. [55] propose
Spatial Context Recurrent ConvNet (SCRC) to transfer visual-linguistic knowledge from image
captioning taskes to facilitate the grounding of language query to bounding boxes in the image.

A common approach in processing parallel image and language data is to learn an embedding
model that maps text and images into a shared latent space. In this shared space, vector representa-
tions for text and images can be compared directly. Therefore it is very convenient to retrieve related
images given text, or retrieve related text given an image. For example, Wang et al. [152] propose
an approach called Deep Structure-Preserving Embedding, which formulates the image-sentence
retrieval as a ranking problem.

21

2.3 Natural Language Inference Tasks

Recent years have seen a trend that new language processing benchmarks shift from only
targeting linguistic context to ones requiring world knowledge and reasoning process to solve.
For instance, the Winograd Schema Challenge [72] requires commonsense knowledge to resolve
pronouns. Some of the machine comprehension benchmarks [153, 113, 95] require comprehending
and reasoning based on supporting documents, where world knowledge can be critical in deeply
understanding the documents.

Natural Language Inference (NLI) is one of the most widely studied task that focuses on
understanding and reasoning about natural language. NLI is the task of reasoning about the truth
given some linguistic statement and premise. Researchers have created many challenging tasks
that require language understanding and inference. Solving these challenges usually requires large
amount of knowledge. Next we will review some popular language inference challenges and
frameworks.

2.3.1 Recognizing Textual Entailment (RTE)

The Recognizing Textual Entailment (RTE) task aims to evaluate machines’ capability of deter-
mining whether the meaning of one text is entailed by another text [21]. Given two sentences A
and B, we say A entails B if a human reading A would infer that B is holding true. For example:

• Text: The purchase of Houston-based LexCorp by BMI for $2Bn prompted widespread sell-
oﬀs by traders as they sought to minimize exposure. LexCorp had been an employee-owned
concern since 2008.

– Hyp 1: BMI acquired an American company.

– Hyp 2: BMI bought employee-owned LexCorp for $3.4Bn.

– Hyp 3: BMI is an employee-owned concern.

22

• Text: On 18 April 1955, Aortic aneurism killed Albert Einstein. This is when blood vessels

gather in the aorta stretching out this part of the heart..

– Hyp 1: A health issue caused Einstein to die.

– Hyp 2: The Bell Inequalities were not presented while Einstein was alive.

– Hyp 3: Einstein was executed by Nazi Germany.

In RTE tasks, the text is always an inherent part of the inference process for predicting the
In other words, solving a valid RTE question always requires information from both

answer.
sentences.

2.3.2 Winograd Schema Challenge (WSC)

The Winograd Schema Challenge (WSC) is proposed by Hector Levesque [72]. One of the
motivation of this challenge is to provide an alternative to the Turing Test, by enforcing human-like
reasoning. Unlike the Turing Test, Winograd Schema does not involve a conversation between
human and machine. Instead, the machine needs to answer binary questions. For example:

• The trophy would not ﬁt in the brown suitcase because it was too big (small). What was too

big (small)?

– Answer 0: the trophy

– Answer 1: the suitcase

• The town councilors refused to give the demonstrators a permit because they feared (advo-

cated) violence. Who feared (advocated) violence?

– Answer 0: the town councilors

– Answer 1: the demonstrators

Note there is a word (“big”, “feared”) in each of the original sentence.
If we switch this word
with its alternative (“small”, “advocated”), the correct answer becomes the opposite. In the ﬁrst

23

example, when “big” is used in the sentence, the answer is “the trophy”; when “small” is used in
the sentence, the correct answer is “the suitcase”.

The questions in Winograd Schema Challenge are carefully designed to test a system’s world
knowledge and human-like reasoning capability. Levesque introduces guidelines of creating the
questions: 1) The questions are easy for human to answer; 2) The questions cannot be solved by
coreference resolution techniques like selectional restrictions; 3) Mining statistical measures from
large text corpora will not be suﬃce to solve them. These guidelines provide valuable insights to
the process of creating natural language understanding benchmarks. Recently, people are becoming
more aware of the problem that machine learning models are often memorizing shallow statistical
cues instead of truly understanding natural language [58, 100]. In Levesque’s design, a swap of the
special word with its alternative leads to opposite ground-truth answer. This is closely related to
the idea of building adversarial examples to eliminate the power of memorizing shallow statistical
cues [100].

2.3.3 Causal Reasoning

The notion of causality or causation has been explored in psychology, linguistics, and computational
linguistics from a wide range of perspectives. For example, diﬀerent types of causal relations such as
causing, enabling, and preventing [42, 156] have been studied extensively as well as their linguistic
expressions [155, 141, 99] and automated extraction of causal relations from text [12, 97, 111, 118].

2.3.3.1 Choice of Plausible Alternatives (COPA)

The Choice of Plausible Alternatives (COPA) task [119] is created to evaluate a system’s capability
of dealing with commonsense causal reasoning. It contains a thousand questions and each question
contains a premise and two alternatives. The task is to select one of the alternatives that is most
likely to be the cause or eﬀect of the premise. Here are some examples:

• Premise: The man broke his toe. What was the CAUSE of this?

24

– Alternative 1: He got a hole in his sock.

– Alternative 2: He dropped a hammer on his foot.

• Premise: I knocked on my neighbor’s door. What happened as a RESULT?

– Alternative 1: My neighbor invited me in.

– Alternative 2: My neighbor left his house.

The COPA is a relatively challenging task in Natural Language Understanding. Since solving
these questions requires more commonsense knowledge than the training data can provide. And
causal relations in text are usually diﬃcult to obtain. Common ways of acquiring causality
knowledge adopted by existing approaches includes automatic extraction of causal relations, and
pre-training on large-scale external datasets. Luo et al. [84] proposed an text data-based approach
for the COPA task. They extracted causal-eﬀect terms from a large web corpus and their approach
achieves 70.2% accuracy. Li et al. [75] train a neural network model using data from two external
datasets. They feed the model with training examples from other tasks, via transforming data from
other tasks into COPA-style plausibility questions.

2.3.3.2 Story Cloze Test

The Story Cloze Test [95] consists of ﬁve-sentence stories with two alternate endings, requiring a
system to decide which ending is more plausible. Below is an example story:

Context: Karen was assigned a roommate her ﬁrst year of college. Her roommate asked
her to go to a nearby city for a concert. Karen agreed happily. The show was absolutely
exhilarating.

– Right Ending: Karen became good friends with her roommate.

– Wrong Ending: Karen hated her roommate.

25

This benchmark contains lots of everyday events, which is related to a wide variety of commonsense
knowledge. Most of the stories focus on human’s emotions, intentions and attitudes, i.e., naive
psychology. In this dissertation, we are more interested in physical commonsense knowledge.

There are also many research works modeling cause-eﬀect relations [40, 19, 24, 163], par-
ticularly for question answering (e.g., addressing why questions). Most of these works address
high-level causal relations between events, for example, “the collapse of the housing bubble”
causes the eﬀect of “stock prices to fall” [130]. They do not concern the lower-level cause-eﬀect
relations associated with concrete actions. Diﬀerent from these previous works, this dissertation
has a speciﬁc focus on the physical causality of action verbs, in other words, change of state in the
physical world caused by action verbs as described in [54].

2.4 Knowledge Resources

One bottleneck towards natural language inference is that machines lack world knowledge.
Thus, there is an increasing amount of eﬀort on developing knowledge representations and building
knowledge resources. In this section, we discuss some of the major knowledge resources. Based
on the approaches creating the knowledge resources, we categorize them into hand-built knowledge
resources and automatically extracted knowledge resources.

2.4.1 Hand-built Knowledge Resources

WordNet [92] is a large lexical database for English. Words are grouped into synsets and then
organized into a network structure. Each of the synsets is linked with other synsets by some
conceptual relations. For nouns and verbs, they are arranged into hierarchies, with hypernymy
and hyponymy representing links going up or going down the hierarchies. WordNet contains
117,000 synsets and provides diﬀerent types of knowledge regarding the relations between them.
However, WordNet mainly focus on conceptual-semantic and lexical relations, it does not contain
commonsense knowledge about how the world changes.

ConceptNet [81] is a network that records large amounts of commonsense knowledge. Here

26

the term commonsense refers to “the millions of basic facts and understandings possessed by
most people.” ConceptNet is built based on the human annotated database Open Mind Common
Sense (OMCS) [136]. Extraction rules are designed to automatically extract ConceptNet’s binary
relations from the OMCS sentences. Later, ConceptNet has grown to include knowledge from other
human-built resources. In ConceptNet 5.5 [142], it contains over 21 million links between over 8
million nodes. Although ConceptNet has grown rapidly in size and coverage, it is still far from
obtaining human-level commonsense knowledge. For example, causal relations are still sparse in
ConceptNet, making it diﬃcult to making inference in a human-like way.

2.4.2 Automatically Extracted Knowledge

Human annotation is usually very expensive to obtain. Therefore a large amount of studies have
been done to automatically extract knowledge from existing large collections of data sets. Except
for few that acquires knowledge from images [167], most of the previous eﬀort apply information
extraction techniques to extract facts from a large amount of text data [27, 112]. A commonly
adopted way is to discover relations between named entities and automatically extract facts about
those entities from the raw textual data [27, 112]. DBPedia [71], Freebase [13], and YAGO [144]
extract structured information from document repositories on Wikipedia. Wikipedia is an ideal
resource for knowledge extraction since it is maintained by a large community and it contains
semi-structured documents that have great semantic heterogeneity.

These automatically minded knowledge base cover millions of facts about the world such as
people and places, saving a lot of time and expenses compared with human annotation. However
they emphasize more on relations and properties related to named entities (e.g., places, people, and
organizations). They do not contain an important type of commonsense knowledge, which people
usually do not mention explicitly, but is still critical for our communication. Physical causality
knowledge is among this type.

27

2.5 Related Work in Computer Vision and Robotics

The idea of modeling object physical state change has also been studied in the computer vision

community and the robotics community.

2.5.1 Related Work in Computer Vision

A recent trend in computer vision has started looking into intermediate representations beyond
lower-level visual features for action recognition, for example, by incorporating object aﬀor-
dances [69] and causality between actions and objects [31]. Fathi and Rehg [31] have broken down
detection of actions to detection of state changes from video frames. Yang and colleagues [164, 165]
have developed an object segmentation and tracking method to detect state changes (or, in their
terms, consequences of actions) for action recognition. More recently, Fire and Zhu [34] have
developed a framework to learn perceptual causal structures between actions and object statuses
in videos. However these previous works only focus on the visual presentation of motion eﬀects.
In Chapter 5, we aim to make a connection between visual presentation and human language
descriptions.

Recent years have seen an increasing amount of work integrating language and vision, for
example, visual question answering [3, 35, 82]. Diﬀerent approaches have been developed such
as Multimodal Compact Bilinear Pooling (MCB) [35], Dynamic Memory Network [158], and the
use of external knowledge bases [157]. Most of these work mainly focus on the Yes/No questions
and what type questions related to object recognition. While many approaches require a large
amount of training data, more recent works have developed zero/few shot learning for language
and vision [96, 160, 159, 161, 149]. Diﬀerent from these previous works, in Chapter 5 of this
dissertation, we introduce a new task that connects language with vision for physical action-eﬀect
prediction, focusing on the causal relation between actions and state changes depicted by both
language and visual data.

28

2.5.2 Related Work in Robotics

In the robotics community, an important task is to enable robots to follow human natural language
instructions. Previous works [134, 94, 131, 132] explicitly model verb semantics as desired goal
states and thus linking natural language commands with underlying planning systems for action
planning and execution. In these works, action schemas are deﬁned to capture the change of state
caused by a given action. Based on action schemas and the goal state, planning algorithms can be
applied to ﬁnd a sequence of actions to achieve the goal [39]. Therefore, the state of the physical
world is a very important notion and changing the state becomes a driving force for robot’s actions.
However, these studies were carried out either in a simulated world or in a carefully curated
simple environment within the limitation of the robot’s manipulation system. And they only focus
on a very limited set of domain speciﬁc actions which often only involve the change of locations.
In Chapter 5 of this dissertation, we study a set of open-domain physical actions and a variety of
eﬀects perceived from the environment (i.e., from images).

29

CHAPTER 3

MODELING PHYSICAL CAUSALITY OF VERBS

In order to enable a robot or a computer to acquire and utilize the physical causality knowledge
of verbs, we need to collect this kind of knowledge and transfer it into machine-understandable
representations. Usually the most natural way for human to pass knowledge is through language.
However, physical causality knowledge is usually not explicitly stated in human language, since
we assume everyone possesses this kind of common sense knowledge. This makes it diﬃcult to
extract physical causality knowledge from existing language datasets, like large-scale text corpora.
Therefore, in this study, crowd-sourcing tasks were designed to collect physical causality data. In
these crowd-sourcing tasks, human subjects were asked to explicitly express their knowledge on
action verbs, through natural language or through answering designed multiple choice questions.
After data collection, we investigate two diﬀerent approaches to model physical causality
knowledge. In one approach, the changes of state are categorized into classes, and the causality
knowledge for an action verb is represented as its associations with those changes of state classes.
In another approach, language descriptions of actions and their eﬀects are embedded using neural
network into a common vector space. The causality knowledge is modeled through similarities
between embedding vectors.

3.1 Categorization of Physical Causality 1

3.1.1 Linguistics Background on Action Verbs

Verb semantics have been studied extensively in linguistics [110, 73, 7, 66]. In this dissertation, we
only focus on concrete action verbs (such as run, throw, cook), which denote physical actions in
the world, instead of denoting states or abstract actions that can not be visually perceived. Hovav
and Levin [54] propose that action verbs can be divided into two types: manner verbs that “specify
1This is a joint work with Malcolm Doering. Part of this section (Section 3.1.1, 3.1.2 and 3.1.3)

is also included in Doering’s Master of Science dissertation [25].

30

as part of their meaning a manner of carrying out an action” (e.g., nibble, rub, scribble, sweep,
ﬂutter, laugh, run, swim), and result verbs that “specify the coming about of a result state” (e.g.,
clean, cover, empty, ﬁll, chop, cut, melt, open, enter). Result verbs can be further classiﬁed into
three categories: Change of State verbs, Inherently Directed Motion verbs and Incremental Theme
verbs [74]. Change of State verbs denote a change in the property of object (e.g. “to melt”).
Inherently Directed Motion verbs indicate a movement in regard to a landmark object (e.g. “to
arrive”). Incremental Theme verbs stand for the incremental change of object, like mass, volume
or area change (e.g. “to eat”). This dissertation has a main focus on result verbs. Unlike Levin and
Hovav’s deﬁnition of Change of State verbs, here the term change of state is used in a more general
way such that the location, volume, and area of an object are part of its state.

Previous linguistic studies have also shown that result verbs often specify movement along a
scale [54]. A scale usually denotes an attribute of an object, like size, temperature, cost. For
example, “Mary shortened the skirt” indicates that the length of the object skirt has decreased. A
detailed description of scale structure can be found in Kennedy and McNally’s work [60].

Interestingly, gradable adjectives also have their semantics deﬁned in terms of a scale structure.
Dixon and Aikhenvald have deﬁned a typology for adjectives which include categories such as
Dimension, Color, Physical Property, Quantiﬁcation, and Position [23]. The connection between
gradable adjectives and result verbs through scale structure motivates us to use the Dixon typology
as a basis to deﬁne our categorization of causality for verbs.

In summary, previous linguistic literature has provided abundant evidence and discussion on
change of state for action verbs. It has also provided extensive knowledge on potential dimensions
that can be used to categorize change of state as described in this work.

3.1.2 A Crowd-Sourcing Study

Motivated by the above linguistic insight, we have conducted a pilot study to examine the feasibility
of causality modeling using a small set of verbs which appear in the TACoS corpus [117]. This
corpus is a joint data of text and videos, where the videos capture diﬀerent human subjects doing

31

cooking activities, and the text sentences describe the actions of the human subjects. The TACoS
dataset contains mainly descriptions of physical actions, and a majority of the verbs belong to result
verbs, which denote some changes of state that can be observed in the world. Therefore the TACoS
dataset is very suitable for our study.

More speciﬁcally, we chose ten verbs (clean, rinse, wipe, cut, chop, mix, stir, add, open, shake)
based on the criteria that they occur relatively frequently in the corpus and take a variety of diﬀerent
objects as their patient. We paired each verb with three diﬀerent objects in the role of patient.
Nouns (e.g., cutting board, dish, counter, knife, hand, cucumber, beans, leek, eggs, water, break,
bowl, etc.) were chosen based on the criteria that they represent objects dissimilar to each other,
since we want to investigate the verb causality knowledge under diﬀerent contexts.

Each verb-noun pair was presented to human annotators via Amazon Mechanical Turk (AMT)
and they were asked to describe (by text) the changes of state that occur to the object as a result of the
verb. The descriptions were collected under two conditions: (1) without showing the corresponding
video clips (so annotators would have to use their imagination of the physical situation) and (2)
showing the corresponding video clips. For each condition and each verb-noun pair, we collected
30 annotators’ responses, which resulted in a total of 1800 natural language responses describing
changes of state.

3.1.3 Categorization of Change of State

Based on Dixon and Aikhenvald’s typology for adjectives [23] and human annotators’ responses,
we identiﬁed a categorization to characterize causality, as shown in Table 3.1. This categorization is
also driven by the expectation that these attributes can be potentially recognized from the physical
world by artiﬁcial agents. The ﬁrst column speciﬁes the type of state change and the second
column speciﬁes speciﬁc attributes related to the type. The third column speciﬁes some possible
values associated with the attribute, e.g., it could be a binary categorization on whether a change
happens or not (i.e., changes), or a direction along a scale (i.e., increase/decrease), or a speciﬁc
value (i.e., speciﬁc such as “ﬁve pieces”). In total, we have identiﬁed eighteen causality categories

32

Type
Dimension

Color/Texture

Attribute Value

Attribute
Size, length, volume Changes, increases, decreases, speciﬁc
Changes, speciﬁc (cylindrical, ﬂat, etc.)
Shape
Appear, disappear, changes, mix, separate,
Color
speciﬁc (green, red, etc.)
Changes, speciﬁc (slippery, frothy, etc.)
Texture
Increase, decrease
Physical Property Weight
Changes, intensiﬁes, speciﬁc
Liqueﬁes, solidiﬁes, speciﬁc
Becomes wet(ter), dry(er)
Appears, disappears
Increases, decreases
Becomes ﬁlled, emptied, hollow
A hole or opening appears
Increases, one becomes many, decreases,
many become one
Changes, enter/exit container, speciﬁc
Becomes covered, uncovered
Becomes detached
No longer present, becomes present
Changes, speciﬁc

Flavor, smell
Solidity
Wetness
Visibility
Temperature
Containment
Surface Integrity
Number of pieces
Location
Occlusion
Attachment
Presence
Orientation

Quantiﬁcation
Position

Table 3.1: Categorization of physical causality.

corresponding to eighteen attributes as shown in Table 3.1.

Through analyzing the crowd-sourcing data, we have made several interesting and important

observations:

1. A verb can be associated with multiple changes of state. Our data show that each human
description contains as many as three diﬀerent changes of state. 43% descriptions contained only a
single change of state, and 36% descriptions contained no change of state. 19% described two CoS
and 2% described three CoS. Since our causality categories are mainly designed to capture low-level
states, they do not include higher level attributes like cleanliness. For some of those descriptions
counted as no change of state, they actually describe changes of high level state attributes.

Figure 3.1 shows the distributions of causality labels applied to two verbs, clean and rinse.
Intuitively, these two verbs have similar meanings. As shown in the ﬁgure, their distribution of
causality labels are also similar. They both have high weights on PresenceOfObject and Wetness.

33

Figure 3.1: Distributions of causality labels for verbs clean and rinse.

However, the diﬀerences between these two verbs are also captured by the distributions. Human
tends to describe the eﬀect of clean with more PresenceOfObject label, and to describe the eﬀect
of rinse with more Wetness label. Partly because the result verb clean is more related to the ﬁnal
state that “dirt is no longer present”, while the manner verb rinse is more related to the use of water.
2. The causality for a verb is context dependent. Human’s description of verb causality not
only depends on the nouns ﬁlling a particular semantic role for the verb, but also on the physical
scenes where the verb-noun pairs appear in. Based on the collected data, we developed a metrics
called variability using Jensen-Shannon divergence (JSD) to compare the distributions of causality
labels associated to a verb under diﬀerent conditions (e.g., taking diﬀerent nouns or whether a video
clip was shown or not).

The JSD of two distributions P and Q is deﬁned as below.

JSD(P||Q) =

1
2 D(P||M) +

1
2 D(Q||M),

(3.1)

where M = (P + Q)/2, and D is the well-known Kullback-Leibler divergence. JSD is a symmetric
measure (i.e., JSD(P||Q) = JSD(Q||P)). The smaller JSD means two distributions are more
similar.

34

Verb
clean
rinse
wipe
cut
chop
mix
stir
add
open
shake

+/-Scene

0.03
0.01
0.02
0.01
0.02
0.05
0.09
0.12
0.09
0.18

3 Objects

0.04
0.05
0.14
0.02
0.03
0.13
0.21
0.22
0.32
0.42

Table 3.2: Variability of causality labels over diﬀerent object and scene conditions.

Variability describes how the causality label distributions of a verb vary with diﬀerent condi-
tions. The conditions include ﬁlling the patient role with diﬀerent objects, or whether a video clip
was shown to the human annotators. The variability is deﬁned as below.

(cid:80)

JSD(di, d j)

(i, j),i(cid:44)j

variabilit y =

num pairs

(3.2)

where (i, j) indicates a pair of two distributions. The variability is an average of JSD for all pairs
of causality label distributions. Each pair of distributions correspond to a pair of diﬀerent values
for the context variable. For example, the variability over the object conditions of a verb chop is
calculated by averaging the JSD of three unique pairs of causality label distributions, i.e., averaging
over JSD(chop cucumber, chop bean), JSD(chop cucumber, chop leek), and JSD(chop bean, chop
leek).

The variabilities over objects and scenes for each verb are shown in Table 3.2. The second
column of the table shows that with or without video clips can inﬂuence human’s judgement of
action eﬀects. And some verbs (e.g., shake) are more sensitive to the changes of visual scenes.
The third column of the table shows that for some verbs, their causality information is also closely
related to the objects. These observations indicates that the causality of a verb depends on its
context.

3. Causality models can be used to reﬂect similarities between verbs. Based on the data,
we further applied Jensen-Shannon divergence (JSD) to calculate the divergence of causality label

35

distributions between diﬀerent verbs. Our results indicate that similarity between verbs based
on causality distributions is consistent with similarity based on verb semantics, for example, for
two similar verbs JSD(cut, chop) = 0.01 and JSD(mix, stir) = 0.03, for two dissimilar verbs,
JSD(cut, shake) = 0.59 and JSD(rinse, chop) = 0.68. This shows that the causality labels for
verbs, while adding another dimension to verb semantics, still preserve the original meaning of
these verbs.

In summary, the results from our empirical studies, although preliminary due to a small dataset,
have shown it is possible to systematically model causality knowledge for a set of common verbs
through crowd-sourcing studies. These results have motivated us to conduct more in-depth inves-
tigations on modeling and utilizing causality knowledge.

3.1.4 Evaluation: Verb Similarity Judgement and Thematic Fit Estimation

In this section, we demonstrate verb causality categorizations can potentially improve semantic
modeling of verbs based on distributional semantics.

We collected a larger dataset of verb causality annotations based on sentences from the TACoS
Multilevel corpus [121], through crowd-sourcing on Amazon Mechanical Turk. Annotators were
shown a sentence containing a verb-object pair (e.g., “The person chops the cucumber into slices
on the cutting board”). And they were asked to annotate the change of state that occurred to the
patient as a result of the verb by choosing up to three options from the 18 causality attributes. Each
sentence was annotated by three diﬀerent annotators.

The dataset contains 4391 sentences, where there are 178 verbs, 260 nouns, and 1624 verb-
object pairs. Note that multiple sentences could have a same verb-object pair. Each verb-object pair
always contain a single verb, but could have two or more object nouns. 41.6% of the verb-object
pairs contain two or more object nouns, e.g., “move-egg, bowl”. The causality annotation result for
each sentence is represented as a 18-dimension binary valued vector, each dimension is 1 if at least
two annotators labeled the corresponding causality attribute as true, 0 otherwise. In 80% of the
vectors, only one dimension is 1, showing that on most sentences, at least two annotators agreed on

36

one causality attribute. In 3% of the vectors, more than one dimensions are 1, meaning at least two
annotators agreed on more than one attributes. 17% of the vectors are zero vectors, meaning there
is no agreed attribute between three annotators. All the experiments in this section are conducted
on this dataset.

If an annotator believes that none of the 18 attributes is applicable to the verb, he/she has other
choices of selecting “Current change of state frame is not applicable” (CoS-NA), or “No change of
state” (No-CoS). In the overall annotation results, less than 1 percent of instances are labeled with
CoS-NA or No-CoS, illustrating that the coverage of the proposed causality label categorization is
quite thorough.

3.1.4.1 Verb Similarity Judgement

Distributional Semantic Models (DSM) [10, 17] use contextual distributions to represent word
meaning. However, only using contextual information does not provide a complete picture of
word meaning. In this section, we augment verb representation with causality information, and
evaluate the performance of augmented models with human annotated verb similarity dataset.
Since causality information captures possible change of the state of the physical world denoted by
the verb, it could be a good supplement to the contextual information of DSM in terms of verb
semantics.

For each verb, we use a vector of 18 dimension to represent its causality information. The
vector is obtained by averaging all causality vectors of the sentences that contain this verb. For
the contextual information, we adopt the Distributional Memory (typeDM) [10], from which we
can get a vector representation for each verb. DM was constructed from three large-scale corpora,
ukWaC, WackyPedia and BNC. To assemble the DM vector Ft and causality vector Fs, we use the
linear weighted combination function from [17]:

F = α × Ft ⊕ (1 − α) × Fs

(3.3)

where ⊕ is the vector concatenation operator. The parameter α can be determined from a develop-

37

ment dataset.

As there is no existing word similarity dataset that has a good coverage on the verbs we study,
we need to develop new benchmarks. Following previous work on similarity measurement [16, 10],
we developed two benchmarks. Each of the benchmark contains 378 pairs of frequent verbs in
the TACoS dataset. Each pair has an averaged similarity score obtained by crowd-sourcing on the
Amazon Mechanical Turk. Ten diﬀerent annotators were asked to rate each verb pair with a scale
between 1 to 5. For example, the pair cut-slice receives a high average rating 4.2, clean-pull receives
a low rating 1.2, in one of the benchmarks. The only diﬀerence between the two benchmarks is that,
during the collection of one, annotators were informed that these verbs describe cooking activities
in the kitchen, while no such information is provided during the other one. In this way, we can get
human judgement of verb similarity both in a speciﬁc domain and in general domain.

We evaluate the models of verb meaning in terms of their Spearman correlation to the human
rating benchmarks. Cosine similarity is used to measure the similarity between two verbs in these
vector models. In order to tune the parameter α, the general domain benchmark was divided into
development set and test set, each contain half of the data. We found the optimal value of α is
around 0.5 on the development set. We set α = 0.5 for the experiments. Tabel 3.3 reports the
evaluation results. No signiﬁcant diﬀerences were observed between results on two benchmarks.
As expected, the concatenation model (DM+Causality) clearly outperforms the DM model on both
benchmarks. This illustrates the eﬀectiveness of causality information in capturing verb meaning.

Model
DM

DM+Causality

General Domain Cooking Domain

0.4460
0.5554

0.4382
0.5328

Table 3.3: Results of verb similarity judgement task using Distributional Memory (DM) model,
and concatenation model (DM+CoS). (Pearson’s correlation ρ, all values are signiﬁcant with
p < 0.001.)

38

3.1.4.2 Thematic Fit Estimation

First we deﬁne the causality vector for a verb and the aﬀordance vector for a noun. The causality
vector for a verb is the same vector deﬁned in the last application, which is calculated by averaging
all causality vectors of the sentences that contain this verb. This vector shows the possible changes
of state to the physical world caused by the verb. The aﬀordance vector for a noun is calculated
through averaging all causality vectors of the sentences that contain this noun as patient role. This
vector shows the possible change of state for the corresponding object as a consequence of actions.
Thus, we can estimate the possibility of a noun being the patient role (or direct object) of a verb by
calculating the similarity between the two vectors. For example, the causality vector of “cut” has a
heavy weight on the causality attribute “Quantity”, and the aﬀordance vector of “carrot” also has a
heavy weight on the same attribute, thus we can tell “carrot” ﬁts well as the object of verb “cut”.
I implemented typeDM [10] for comparison, since it has shown state of the art performance
In typeDM model, to determine how well a noun ﬁts the patient
in thematic ﬁt tasks [10, 46].
role of a verb, we ﬁrst ﬁnd out 20 most popular nouns for the patient role of the verb, by counting
the syntactic dependence links of object. Then a centroid is calculated through normalizing and
averaging the DM vectors of the 20 nouns. The thematic ﬁt score is the cosine similarity between
the DM vector of the target noun and the centroid.

To show the advantage of including the causality information in thematic ﬁt estimation task,
again we adopted the concatenation model from Equation 3.3 to integrate DM and CoS information
(α = 0.5). The cosine similarity between vectors is used to measure the possibility of a noun being
the object of a verb.

Since there is no existing thematic ﬁt benchmark that has a good coverage on the verbs and
nouns we study, we created a new dataset of human judgements on thematic ﬁt of patient role,
following previous work on thematic ﬁt estimation [89, 104]. 32 verbs and 36 nouns were sampled
from the TACoS Multilevel dataset. These verbs and nouns were used to randomly construct 520
verb-noun pairs. Each verb-noun pair was rated by 5 diﬀerent annotators from Amazon Mechanical
Turk. They rated the pair based on the plausibility of the noun as patient of the verb. The rating was

39

on a scale 1 to 5, and judgement were then averaged from 5 annotators (e.g., cook-broccoli receives
a high average rating 4.8, cut-salt receives a low rating 1.0). Tabel 3.4 reports the evaluation
results. The concatenation model signiﬁcantly outperforms DM model, indicating that causality
information play an important role in measuring thematic ﬁtness.
Pearson’s ρ

0.3007
0.3732

Model
DM

DM+Causality

Table 3.4: Results of thematic ﬁtness estimation using Distributional Memory (DM) model and
concatenation model (DM+Causality).
(Pearson’s correlation ρ, all values are signiﬁcant with
p < 0.001.)

The above two applications of modeling physical causality of verbs illustrate that this kind of
knowledge is an important complement to the distributional semantics of verbs. It can be used not
only to measure similarity among words, but also to capture more abstract semantic relations.

3.2 Modeling Causality Knowledge via Embedding Methods

Previous discussions have shown the potential of modeling verb causality knowledge using pre-
deﬁned categories. However, the most natural way for humans to communicate and pass knowledge
is through open-ended language. In this section, our goal is to directly model causality knowledge
from human natural language, instead of manually translating natural language into pre-deﬁned
categories.

3.2.1 Cause-Eﬀect Data Collection

As mentioned earlier, the commonsense causality knowledge associated with concrete action verbs
is often pre-supposed and not explicitly stated in language. It is diﬃcult to extract cause-eﬀect
data from existing text collections. Therefore, we applied human computing and collected a set of
cause-eﬀect data through crowd-sourcing.

We started with the top 1000 frequent English verbs from the Corpus of Contemporary American
English. By querying these verbs from two dictionaries (LDOCE dictionary and the American

40

Cause Text Eﬀect Text
slice bread The bread went from being a solid loaf to several pieces.
ﬁle nails
fry potato
stain carpet There is a visible soiled mark on the carpet.

The nails became smooth.
The potatoes become crisp and golden and go from raw to cooked.

Table 3.5: Example cause and eﬀect text from our collected data.

Heritage 3rd edition) using patterns provided by the dictionary (e.g., transitive verbs with direct
object), we identiﬁed a subset of verbs which take concrete nouns as their patient (in another word,
direct object). We then extracted all example sentences for this subset of verbs from the dictionaries.
Finally, two undergraduate students manually extracted the verb and its patient (i.e., the noun that
serves as direct object) from each example sentence to form a verb-noun pair (or referred to as
verb-patient pair). Only those verb-noun pairs where the verb has a clear eﬀect on the state of
the world related to the noun are chosen for our crowd-sourcing data collection. This process has
resulted in a total of 558 verb-noun pairs with 251 diﬀerent verbs and 356 diﬀerent nouns.

The crowd-sourcing data collection was carried out on Amazon Mechanical Turk. Annotators
were shown a verb-noun pair, and they were asked to use their own words to describe what changes
might occur to the object (denoted by the noun) as a result of the action (denoted by the verb). Each
verb-noun pair was annotated by 10 diﬀerent annotators, which has led to a total of 5580 eﬀect
descriptions. Table 3.5 shows some examples of collected eﬀect descriptions.

3.2.2 Causality Embedding Models

We propose a text embedding method to model verb causality knowledge. The structure of our
model is shown in Figure 3.2. It is composed of two sub-networks: one for verb-noun pairs (i.e.,
cause) and the other one for eﬀect descriptions (i.e, eﬀect). The cause and eﬀect can be represented
either by words or phrases (as explained later) using their pre-trained embeddings vc and ve. The
pre-trained embedding is fed to a fully-connected layer and transformed into a new (or adapted)
cause embedding ˆvc and a new eﬀect embedding ˆve. The adapted embeddings are learned by

41

Figure 3.2: Architecture of the verb causality embedding model.

minimizing the following loss function l:

l = [s(ˆvc, ˆve) − γ]2

,



1,

s(vc, ve),

if (c, e) ∈ C
if (c, e) (cid:60) C

(3.4)

(3.5)

where

γ =

s(·, ·) is the cosine similarity between vectors. C is the set of cause-eﬀect tuples in our collected
data. Suppose c is an input for cause and e is an input for eﬀect, this loss function will learn a new
cause and eﬀect space that maximizes the similarities between c and e if they have a cause-eﬀect
relation (i.e., (c, e) ∈ C) while maintaining their original similarity if they don’t have a cause-eﬀect
relation (i.e., (c, e) (cid:60) C). Essentially this approach learns an adaptation of the original embedding
space, which is encoded by two nonlinear transforms. To prevent the overﬁtting problem, we also
add a dropout layer with 0.5 probability to the input of the adapted embedding layer.

As mentioned earlier, cause and eﬀect can be represented by either words or phrases as follows.

42

•cucumber into small pieces •cucumber is split•cucumber no longer long•……Pre-trained EmbeddingCausal EmbeddingCosine Loss•cut cucumber•throw baseball•pile boxes•……Pre-trained EmbeddingCausal EmbeddingCause TextEffect TextExample extracted State Phrases (bold)

Example patterns
VP with a verb ∈ {be, become, turn, get} The ship is destroyed.
The wall is knocked oﬀ.
VP + PRT
VP + ADVP
The door swings forward.
The window would begin to get clean.
ADJP
PP + NP
The eggs are divided into whites and yolks.

Table 3.6: Example patterns that are used to extract state phrases (bold) from sample sentences.

Word Causality Embedding (cEmbedWord). In this setting, the cause and eﬀect text are
ﬁrst broken into words. For a verb-noun pair (as cause) and one of its eﬀect description (as
eﬀect), after ﬁltering out stop words, each word in the verb-noun pair is coupled with each word
in the eﬀect description to generate a cause-eﬀect tuple. In this setting, we use the 300-dimension
Word2Vec [90] weights pre-trained on Google News corpus.

Phrase Causality Embedding (cEmbedPhrase).

In this setting, we ﬁrst apply chunking
(shallow parsing) using the SENNA software [20] to break an eﬀect description into phrases such
as noun phrases (NP), verb phrases (VP), prepositional phrases (PP), adjectives (ADJP), adverbs
(ADVP), etc. After examining the syntactic structure of the collected eﬀect descriptions, we found
that most of the descriptions follow simple syntactic patterns. For a verb-noun pair, around 80%
of its eﬀect descriptions start with the same noun as the subject.
In an eﬀect description, the
change of state associated with the noun is mainly captured by some key phrases. For example, an
adjective phrase usually describes a physical state; verbs like be, become, turn, get often indicate
a description of change of the state. Based on these observations, we deﬁned a set of patterns to
identify phrases that describe physical states of an object. We call these phrases state phrases.
Table 3.6 shows some example patterns to identify state phrases and example state phrases that
were extracted based on the patterns. Besides, if an eﬀect sentence begins with a noun phrase as
the subject, we also concatenate that noun phrase with each of the extracted state phrases.

After extracting state phrases from an eﬀect description, we couple the corresponding verb-
noun phrase with each of the extracted state phrases to form a (cause, eﬀect) tuple. If no phrase
is extracted from an eﬀect description, we treat the whole description as a long phrase to form the

43

tuple. We encode phrases into vector representations using skip-thought, an RNN pre-trained on a
large-scale book corpus [67].

3.2.3 Evaluation: Causality Embedding in Causal QA

The learned causality embedding can be applied to Causal Question Answering (cQA). This cQA is
diﬀerent from traditional question answering that involves cause-eﬀect relations between high-level
events. It was mainly designed to test machines/artiﬁcial agents’ ability in causal reasoning related
to concrete actions. More speciﬁcally, we evaluate two types of questions.

Cause-to-Eﬀect (Cause2Effect) questions. Given a verb-noun phrase, the question is: “what
would likely happen to the object denoted by the noun as a result of the action denoted by the verb?”
The answer would be an eﬀect description describing the potential eﬀect.

Eﬀect-to-Cause (Effect2Cause) questions. Given a description illustrating a state of the
world, the question is: “what action would likely cause the state of the world described in the text?”
The answer would be a verb-noun pair that can potentially serve as the cause.

3.2.3.1 Ranking Algorithm

We adopt a simple ranking algorithm to retrieve answers for a question. Given a query q (i.e., either
a verb-noun pair for Cause2Effect questions or a description of the state for Effect2Cause
questions), we rank all candidate answers a based on their similarity score with the query in the
embedded space as in the following.

score(q, a) =

1
|q|

max
e∈a

s(ˆvc, ˆve)

(3.6)

(cid:88)

c∈q

where s(ˆvc, ˆve) is the cosine similarity between two words (or phrases) c, e in the causality embed-
ding space, |q| is the number of words (or phrases) in the query.

44

3.2.3.2 Dataset

We use our collected data (described in Section 3.2.1) for this task. The verb-noun phrases were
split into 70%, 10%, and 20% for training, development, and testing respectively. The model
parameters were selected based on the performance on the development set. Note that each unique
verb-noun pair only appears in one of the training, validation and testing sets. The goal here
is to evaluate whether the learned causality model can be applied to answer questions related to
unknown verb-noun pairs.

3.2.3.3 Models for Comparison

We compare the following models:

(1) The word embedding model (cEmbedWord) described in Section 3.2.2. The dimension of
pre-trained Word2Vec embeddings is 300. The dimension of new word embeddings is set to 100.
During training the negative sampling ratio is set to ﬁve. That is, for each positive cause-eﬀect
sample, ﬁve negative samples are created through random sampling.

(2) The phrase model (cEmbedPhrase) described in Section 3.2.2. The dimension of pre-
trained skip-thoughts embeddings is 4800. The dimension of new phrase embeddings is set to 800.
The negative sampling ratio is set to ﬁve during training.

(3) A baseline causal alignment model (cAlign). Alignment models have been successfully
used in traditional QA tasks [146, 166, 130]. Here we use IBM Model 1 [15] and GIZA++ tool
[101]. This baseline model is trained to “translate” questions to answers, using the question-answer
training set.

(4) A random baseline to show the absolute lower bound.

3.2.3.4 Evaluation Results

The above models were ﬁrst trained using the training data. As a ranked list of answers is retrieved,
we apply mean average precision (MAP) as an evaluation metric.

45

cEmbedPhrase
cEmbedWord
cAlign
random

0.7274
0.7909
0.4498
0.0144

Cause2Eﬀect Eﬀect2Cause

0.8132
0.6478
0.4723
0.0494

Table 3.7: MAP results for verb causality question answering task.

Table 3.7 shows the evaluation results. In general, our embedding models demonstrate good
performance, considering that all the verb-noun pairs in the test data have never been seen in the
training and validation data before. Both models signiﬁcantly outperform the baseline (cAlign).
This suggests that embedding models have a good potential in modeling physical causality knowl-
edge.

46

PHYSICAL CAUSALITY MODELING FOR LANGUAGE GROUNDING TASK

CHAPTER 4

4.1 Introduction

Although recent years have seen an increasing amount of work on grounding language to
perception [171, 150, 79, 98, 78], no previous work has investigated the link between physical
causality denoted by action verbs and the change of state visually perceived. In this chapter, we
intend to address this limitation and examine whether the causality denoted by action verbs can
provide top-down information to guide visual processing and improve language grounding.

In the language grounding task, the input is parallel language and visual data, and the goal is to
ground language components to entities in the visual data. Our expectation is that the categorization
of physical causality can provide guidance for visual processing: once a parallel language and visual
data about an action is given, the potential causality of the verb or the verb-noun pair can trigger
some visual detectors that mainly focus on the potential state changes caused by this action. And
applying these visual detectors to the visual data can potentially improve the performance of
grounded language understanding.

Based on the categorization of physical causality attributes, we designed a set of change-of-state
detectors to detect the corresponding changes from video data. We further applied two approaches,
a knowledge-driven approach and a learning-based approach, to incorporate causality modeling
in grounding. The empirical results have demonstrated that both of these approaches achieve
signiﬁcantly better performance compared to previous approaches. Moreover, we have shown that
causality knowledge for verbs can be generalized to novel verbs through simple learned models.

This chapter has been published in the following paper: Qiaozi Gao, Malcolm Doering, Shao-
hua Yang, and Joyce Chai. Physical causality of action verbs in grounded language understanding.
In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), vol. 1, pp. 1814-1824. 2016.

47

4.2 Visual Detectors based on Physical Causality

An important motivation of modeling physical causality is to provide guidance for visual pro-
cessing. Our hypothesis is that once a language description is given together with its corresponding
visual scene, potential causality of verbs or verb-noun pairs can trigger some visual detectors
associated with the scene. This can potentially improve grounded language understanding (e.g.,
grounding nouns to objects in the scene). Next we give a detailed account on these visual detectors
and their role in grounded language understanding.

The changes of state associated with the eighteen attributes can be detected from the physical
world using various sensors. In this work, we only focus on attributes that can be detected by visual
perception. More speciﬁcally, we chose the subset: Attachment, NumberOfPieces, Presence,
Visibility, Location, Size. They are chosen because: 1) according to the pilot study, they are highly
correlated with our selected verbs; and 2) they are relatively easy to be detected from vision.

Corresponding to these causality attributes, we deﬁned a set of rule-based detectors as shown
in Table 4.1. These in fact are very simple detectors, which consist of four major detectors and
a reﬁned set that distinguishes directions of state change. These visual detectors are speciﬁcally
applied to the potential objects that may serve as patient for a verb to identify whether certain
changes of state occur to these objects in the visual scene.

Attribute
Attachment /
NumberPieces
Presence /
Visibility

Rule-based Detector
Multiple object tracks merge into one,
or one object track breaks into multiple.
Object track appears or disappears.

Location

Object’s ﬁnal location is diﬀerent from
the initial location.

Size

Object’s x-axis length or y-axis length
is diﬀerent from the initial values.

Reﬁned Rule-based Detector
Multiple tracks merge into one.
One track breaks into multiple.
Object track appears.
Object track disappears.
Location shifts upwards.
Location shifts downwards.
Location shifts rightwards.
Location shifts leftwards.
Object’s x-axis length increases.
Object’s x-axis length decreases.
Object’s y-axis length increases.
Object’s y-axis length decreases.

Table 4.1: Causality detectors applied to patient of a verb.

48

4.3 Verb Causality in Language Grounding

In this section, we demonstrate how verb causality modeling and visual detectors can be used
together for a language grounding task. As shown in Figure 4.1, given a video clip V of human
action and a parallel sentence S describing the action, our goal is to ground diﬀerent semantic roles
of the verb (e.g., get) to objects in the video. This is similar to the grounded semantic role labeling
task [162]. Here, we focus on a set of four semantic roles {agent, patient, source, destination}. We
also assume that we have object and hand tracking results from video data. Each object in the video
is represented by a track, which is a series of bounding boxes across video frames. Thus, given a
video clip and a parallel sentence, the task is to ground semantic roles of the verb λ1, λ2, . . . , λk to
object (or hand) tracks γ1, γ2, . . . , γn, in the video1. We applied two approaches to this problem.

4.3.1 Knowledge-driven Approach

We intend to establish that the knowledge of physical causality for action verbs can be acquired
directly from the crowd and such knowledge can be coupled with visual detectors for grounded
language understanding.

4.3.1.1 Acquiring Knowledge

To acquire knowledge of verb causality, we collected a larger dataset of causality annotations
based on sentences from the TACoS Multilevel corpus [121], through crowd-sourcing on Amazon
Mechanical Turk. Annotators were shown a sentence containing a verb-patient pair (e.g., “The
person chops the cucumber into slices on the cutting board”). And they were asked to annotate the
change of state that occurred to the patient as a result of the verb by choosing up to three options
from the 18 causality attributes. Each sentence was annotated by three diﬀerent annotators.

1For manipulation actions, the agent is almost always one of the human’s hands (or both hands).
So we constrain the grounding of the agent role to hand tracks, and constrain the grounding of the
other roles to object tracks.

49

Figure 4.1: Grounding semantic roles of the verb get in the sentence: the man gets a knife from the
drawer.

This dataset contains 4391 sentences, with 178 verbs, 260 nouns, and 1624 verb-noun pairs.
After summarizing the annotations from three diﬀerent annotators, each sentence is represented by
a 18-dimension causality vector. In the vector, an element is 1 if at least two annotators labeled the
corresponding causality attribute as true, 0 otherwise. For 83% of all the annotated sentences, at
least one causality attribute was agreed on by at least two people.

From the causality annotation data, we can extract a verb causality vector c(v) for each verb by

averaging all causality vectors of the sentences that contain this verb v.

4.3.1.2 Applying Knowledge

Since the collected causality knowledge was only for the patient, we ﬁrst look at the grounding of
patient. Given a sentence containing a verb v and its patient, we want to ground the patient to one

50

Language	descrip.on:	The	man	gets	a	knife	from	the	drawer. Verb:	“get”		Agent:	ground	to	the	hand	in	the	green	box		Pa.ent:	“knife”,	ground	to	the	object	in	the	red	box		Source:	“drawer”,	ground	to	the	object	in	the	blue	box of the object tracks in the video clip. Suppose we have the causality knowledge, i.e., c(v), for the
verb. For each candidate track in the video, we can generate a causality detection vector d(γi),
using the pre-deﬁned causality detectors. A straightforward way is to ground the patient to the
object track whose causality detection results has the best coherence with the causality knowledge
of the verb. The coherence is measured by the cosine similarity between c(v) and d(γi). 2

Semantic Role Rule-based Detector

Source

Destination

Agent

Patient track appears within its
bounding box.
Its track is overlapping with the
patient track at the initial frame.
Patient track disappears within
its bounding box.
Its track is overlapping with the
patient track at the ﬁnal frame.
Its track is overlapping with the
patient track when the patient track
appears or disappears.
Its track is overlapping with the
patient track when the patient track
starts moving or stops moving.

Table 4.2: Causality detectors for grounding source, destination, and agent.

Since objects in other semantic roles often have relations with the patient during the action, once
we have grounded the patient, we can use it as an anchor point to ground the other three semantic
roles. To do this, we deﬁne two new detectors for grounding each role as shown in Table 4.2. These
detectors are designed using some common sense knowledge, e.g., source is likely to be the initial
location of the patient; destination is likely to be the ﬁnal location of the patient; agent is likely
to be the hand that touches the patient. With these new detectors, we simply ground a role to the
object (or hand) track that has the largest number of positive detections from the corresponding
detectors.

It is worth noting that although currently we only acquired knowledge for verbs that appear in
the cooking domain, the same approach can be extended to verbs in other domains. The detectors
2In the case that not every causality attribute has a corresponding detector, we need to ﬁrst

condense c(v) to the same dimensionality with d(γi).

51

Figure 4.2: The CRF factor graph of the sentence: the man gets a knife from the drawer.

associated with attributes are expected to remain the same. The signiﬁcance of this knowledge-
driven method is that, once you have the causality knowledge of a verb, it can be directly applied
to any domain without additional training.

4.3.2 Learning-based Approach

Our second approach is based on learning from training data. A key requirement for this approach
is the availability of annotated data where the arguments of a verb are already correctly grounded
to the objects in the visual scene. Then we can learn the association between detected causality
attributes and verbs. We use Conditional Random Field (CRF) to model the semantic role grounding
problem. In this approach, causality detection results are used as features in the model.

An example CRF factor graph is shown in Figure 4.2. The structure of CRF graph is created
based on the extracted semantic roles, which already abstracts away syntactic variations such
as active/passive constructions. This CRF model is similar to the ones in [147] and [162], where
φ1, . . . , φ4 are binary random variables, indicating whether the grounding is correct. In the learning

52

stage, we use the following objective function:
p(Φ|λ1, . . . , λk, γ1, . . . , γk, v) =

1
Z

i

(cid:89)

Ψi(φi, λi, γ1, . . . , γk, v)

(4.1)

where Φ is the binary random vector [φ1, . . . , φk], and v is the verb. Z is the normalization constant.
Ψi is the potential function that takes the following log-linear form:

Ψi(φi, λi, Γ, v) = exp(cid:42)(cid:46)(cid:44)

(cid:88)

l

(cid:43)(cid:47)(cid:45)

wl fl (φi, λi, Γ, v)

(4.2)

where fl is a feature function, wl is feature weight to be learned, and Γ = [γ1, . . . , γk] are the
groundings. In our model, we use the following features:

1. Joint features between a track label of γi and a word occurrence in λi.

2. Joint features between each of the causality detection results and a verb v. Causality detection
includes all the detectors in Table 4.1 and Table 4.2. Note that the causality detectors shown
in Table 4.2 capture relations between groundings of diﬀerent semantic roles.

During parameter learning, we use gradient ascent with L2 regularization.

Compared to [147] and [162], a key diﬀerence in our model is the incorporation of causality
detectors. These previous works [147, 162] apply geometric features, for example, to capture
relations, distance, and relative directions between grounding objects. These geometric features
can be noisy. In our model, features based on causality detectors are motivated and informed by
the underlying causality models for corresponding action verbs.

In the inference step, we want to ﬁnd the most probable groundings. Given a video clip and its
parallel sentence, we ﬁx the Φ to be true, and search for groundings γ1, . . . , γk that maximize the
probability as in Equation 4.1. To reduce the search space we apply beam search to ground in the
following order: patient, source, destination, agent.

4.3.3 Experiments and Results

We conducted our experiments using the dataset from [162]. This dataset was developed from a
subset of the TACoS corpus [117]. It contains parallel video clips and natural language descriptions.

53

The videos capture human performing two cooking tasks “cutting cucumber” and “cutting bread”.
Each cooking task has 5 diﬀerent people performing it, and all the videos were split into pairs of
video clips and corresponding sentences. For each video clip, objects are annotated with bounding
boxes, tracks, and labels (e.g. “cucumber”’, “cutting board” etc). For each sentence, the semantic
roles of a verb are extracted using PropBank [66] deﬁnitions and each of them is annotated with
the ground truth groundings in terms of the object tracks in the corresponding video clip. We
selected the 11 most frequent verbs (get, take, wash, cut, rinse, slice, place, peel, put, remove,
open) and the 4 most frequent explicit semantic roles (agent, patient, source, destination) in this
evaluation. In total, this dataset includes 977 pairs of video clips and corresponding sentences, and
1096 verb-patient occurrences.

We compare our knowledge-driven approach (VC-Knowledge) and learning-based approach

(VC-Learning) with the following two baselines.

Label Matching. This method simply grounds the semantic role to the track whose label
matches the word phrase. If there are multiple matching tracks, it will randomly choose one of
them. If there is no matching track, it will randomly select one from all the tracks.

Yang et al., 2016. This work studies grounded semantic role labeling. The evaluation data

from this work is used in this study. It is a natural baseline for comparison.

To evaluate the learning-based approaches such as VC-Learning and (Yang, et al., 2016), 75%
of video clips with corresponding sentences were randomly sampled as the training set. The
remaining 25% were used as the test set. For approaches which do not need training such as Label
Matching and VC-Knowledge, we used the same test set to report their results.

The results of the patient role grounding for each verb are shown in Table 4.3. The results
of grounding all four semantic roles are shown in Table 4.4. The scores in bold are statistically
signiﬁcant (p < 0.05) compared to the Label Matching method. The scores with an asterisk (∗) are
statistically signiﬁcant (p < 0.05) compared to (Yang et al., 2016).

As it can be diﬃcult to obtain labels for the track, especially when the vision system encounters
novel objects, we further conducted several experiments assuming we do not know the labels for

54

# Instances

Label Matching
Yang et al., 2016
VC-Knowledge
VC-Learning

Label Matching
Yang et al., 2016
VC-Knowledge
VC-Learning

All
279

67.7
84.6
89.6∗
90.3∗

9.0
24.5
60.2∗
71.7∗

take
58

70.7
93.2
94.8
94.8

put
15

46.7
91.7
73.3
86.7

get
47

cut
29

open
6

wash
28

With Ground-truth Track Labels
85.7
93.5
100
89.3

72.3
93.6
100∗
100∗

16.7
80.0
83.3
83.3

69.0
77.8
93.1
93.1

Without Track Labels

slice
13

69.2
86.7
92.3
92.3

82.8
90.0
96.6
96.6

13.3
8.3
60.0∗
33.3

2.1
17.0
87.2∗
87.2∗

12.1
10.3
11.9
40.0
82.8∗
41.4
91.4∗
51.7
Table 4.3: Grounding accuracy on patient role

16.7
10.0
50.0
83.3∗

7.7
40.0
46.2
84.6∗

10.3
50.0
58.6
72.4

3.6
29.0
39.3
46.4

rinse
29

place
29

peel
10

90.0
80.0
90.0
80.0

20.0
60.0
10.0
80.0

remove
15

60.0
38.9
73.3∗
66.7∗

6.7
11.1
40.0
60.0∗

37.9
66.7
58.6
75.9

10.3
0
48.3∗
65.5∗

Number of Instances

Label Matching
Yang et al., 2016
VC-Knowledge
VC-Learning

Label Matching
Yang et al., 2016
VC-Knowledge
VC-Learning

Overall Agent Patient Source Destination
644

279

279

51

35

With Ground-truth Track Labels

66.3
84.2
86.8
88.2∗

68.5
86.4
89.3
88.2

67.7
84.6
89.6∗
90.3∗
Without Track Labels

33.5
48.2
69.9∗
75.0∗

66.7
86.1
89.6
87.1

9.0
24.5
60.2∗
71.7∗

41.2
72.6
60.8
76.5

7.8
15.7
45.1∗
41.2∗

74.3
81.6
82.9
88.6

2.9
13.2
25.7
54.3∗

Table 4.4: Grounding accuracy on four semantic roles

the object tracks. In this case, only geometric information of tracked objects is available. Table 4.3
and Table 4.4 also include these results.

From the grounding results, we can see that the causality modeling has shown to be very eﬀective
in grounding semantic roles. First of all, both the knowledge-driven approach and the learning-
based approach outperform the two baselines. In particular, our knowledge-driven approach (VC-
Knowledge) even outperforms the trained model (Yang et al., 2016). Our learning-based approach
(VC-Learning) achieves the best overall performance. In the learning-based approach, causality
detection results can be seen as a set of intermediate visual features. The reason that our learning-
based approach signiﬁcantly outperforms the similar model in (Yang et al., 2016) is that the causality
categorization provides a good guideline for designing intermediate visual features. These causality
detectors focus on the changes of state of objects, which are more robust than the geometric features

55

used in (Yang et al., 2016).

In the setting of no object recognition labels, VC-Knowledge and VC-Learning also generate
signiﬁcantly better grounding accuracy than the two baselines. This once again demonstrates the
advantage of using causality detection results as intermediate visual features. All these results
illustrate the potential of causality modeling for grounded language understanding.

The results in Table 4.4 also indicate that grounding source or destination is more diﬃcult than
grounding patient or agent. One reason could be that source and destination do not exhibit obvious
change of state as a result of action, so their groundings usually depend on the correct grounding
of other roles such as patient.

Since automated tracking for this TACoS dataset is notably diﬃcult due to the complexity of
the scene and the lack of depth information, our current results are based on annotated tracks. But
object tracking algorithms have made signiﬁcant progress in recent years [164, 91]. We intend to
apply our algorithms with automated tracking on real scenes in the future.

4.4 Causality Prediction for New Verbs

While various methods can be used to acquire causality knowledge for verbs, it may be the
case that during language grounding, we do not know the causality knowledge for every verb.
Furthermore, manual annotation/acquisition of causality knowledge for all verbs can be time-
consuming. In this section, we demonstrate that the existing causality knowledge for some seed
verbs can be used to predict causality for new verbs of which we have no knowledge.

We formulate the problem as follows. Suppose we have causality knowledge for a set of seed
verbs as training data. Given a new verb, whose causality knowledge is not known, our goal is to
predict the causality attributes associated with this new verb. Although the causality knowledge
is unknown, it is easy to compute Distributional Semantic Models (DSM) for this verb. Then our
goal is to ﬁnd the causality vector c(cid:48) that maximizes

where v is the DSM vector for the verb v. The usage of DSM vectors is based on our hypothesis

p(c(cid:48)|v),

arg max

c(cid:48)

(4.3)

56

that the textual context of a verb can reveal its possible causality information. For example, the
contextual words “pieces” and “halves” may indicate the CoS attribute “NumberOfPieces” for the
verb “cut”.

We simplify the problem by assuming that the causality vector c(cid:48) takes binary values, and also
assuming the independence between diﬀerent causality attributes. Thus, we can formulate this task
as a group of binary classiﬁcation problems: predicting whether a particular causality attribute
is positive or negative given the DSM vector of a verb. We apply logistic regression to train a
separate classiﬁer for each attribute. Speciﬁcally, for the features of a verb, we use the Distributional
Memory (typeDM) [10] vector. The class label indicates whether the corresponding attribute is
associated with the verb.

VC-Knowledge
P-VC-Knowledge

All
89.6
89.9

take
94.8
96.6

put
73.3
73.3

get
100
100

cut
93.1
96.6

open wash
83.3
100
100
66.7

slice
92.3
92.3

rinse
96.6
96.6

place
58.6
65.5

peel
90.0
90.0

remove
73.3
60.0

Table 4.5: Grounding accuracy on patient role using predicted causality knowledge.

In our experiment we chose six attributes to study: Attachment, NumberOfPieces, Presence,
Visibility, Location, and Size. For each one of the eleven verbs in the grounding task, we predict
its causality knowledge using classiﬁers trained on all other verbs (i.e., 177 verbs in training set).
To evaluate the predicted causality vectors, we applied them in the knowledge-driven approach
(P-VC-Knowledge). Grounding results were compared with the same method using the causality
knowledge collected via crowd-sourcing. Table 4.5 shows the grounding accuracy on the patient
role for each verb. For most verbs, using the predicted knowledge achieves very similar performance
compared to using the collected knowledge. The overall grounding accuracy of using the predicted
knowledge on all four semantic roles is only 0.3% lower than using the collected knowledge. This
result demonstrates that physical causality of action verbs, as part of verb semantics, can be learned
through Distributional Semantics.

57

4.5 Conclusion

In this Chapter, we have applied the category-based causality modeling to the task of grounding
semantic roles to the environment using two approaches: a knowledge-based approach and a
learning-based approach.

Our empirical evaluations have shown encouraging results for both approaches. When annotated
data is available (in which semantic roles of verbs are grounded to physical objects), the learning-
based approach, which learns the associations between verbs and causality detectors, achieves
the best overall performance. On the other hand, the knowledge-based approach also achieves
competitive performance (even better than previous learned models), without any training. The
most exciting aspect about the knowledge-based approach is that causality knowledge for verbs
can be acquired from humans (e.g., through crowd-sourcing) and generalized to novel verbs about
which we have not yet acquired causality knowledge.

In the future, we plan to build a resource for modeling physical causality for action verbs. As ob-
ject recognition and tracking are undergoing signiﬁcant advancements in the computer vision ﬁeld,
such a resource together with causality detectors can be immediately applied for any applications
that require grounded language understanding.

58

CHAPTER 5

VISUAL CAUSALITY REASONING

5.1 Introduction

We humans rely on a vast amount of commonsense causality knowledge to understand and
reason about the changing world states caused by actions. However, machines do not have such
knowledge, which hinders their capability to reason, learn, and perform actions. To address this
problem, we introduce a new task on naive physical action-eﬀect prediction, which models the
relations between concrete actions (expressed in the form of verb-noun pairs) and their eﬀects on
the state of the physical world as depicted by images. This task includes both cause prediction:
given an image which describes a state of the world, identify the most likely action (in the form of
a verb-noun pair, from a set of candidates) that can result in that state; and eﬀect prediction: given
an action in the form of a verb-noun pair, identify images (from a set of candidates) that depicts the
most likely eﬀects on the state of the world caused by that action.

We develop an approach that utilizes natural language eﬀect descriptions as side knowledge to
help acquiring web image data and bootstrap training. The empirical results have shown that, using
a simple bootstrapping strategy, our approach can combine the noisy web data with a small number
of seed examples to improve action-eﬀect prediction. In addition, for a new verb-noun pair, our
approach can infer its eﬀect descriptions and predict action-eﬀect relations only based on several
image examples.

This chapter has been published in the following paper: Qiaozi Gao, Shaohua Yang, Joyce
Chai, and Lucy Vanderwende. What action causes this? towards naive physical action-eﬀect
prediction.
In Proceedings of the 56th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), vol. 1, pp. 934-945. 2018.

59

5.2 Action-Eﬀect Data Collection

First we collected a dataset to support the investigation on physical action-eﬀect prediction. This
dataset consists of actions expressed in the form of verb-noun pairs, eﬀects of actions described
in language, and eﬀects of actions depicted in images. Note that, as we would like to have a wide
range of possible eﬀects, language data and image data are collected separately.

5.2.1 Actions (verb-noun pairs)

We selected 40 nouns that represent everyday life objects, most of them are from the COCO
dataset [77], with a combination of food, kitchen ware, furniture, indoor objects, and outdoor
objects. We also identiﬁed top 3000 most frequently used verbs from Google Syntactic N-gram
dataset [41] (Verbargs set). And we extracted top frequent verb-noun pairs containing a verb from
the top 3000 verbs and a noun in the 40 nouns which hold a dobj (i.e., direct object) dependency
relation. This resulted in 6573 candidate verb-noun pairs. As changes to an object can occur at
various dimensions (e.g., size, color, location, attachment, etc.), we manually selected a subset of
verb-noun pairs based on the following criteria: (1) changes to the objects are visible (as opposed
to other types such as temperature change, etc.); and (2) changes reﬂect one particular dimension
as opposed to multiple dimensions (as entailed by high-level actions such as “cook a meal”, which
correspond to multiple dimensions of change and can be further decomposed into basic actions).
As a result, we created a subset of 140 verb-noun pairs (containing 62 unique verbs and 39 unique
nouns) for our investigation.

5.2.2 Eﬀects Described in Language

The basic knowledge about physical action-eﬀect is so fundamental and shared among humans. It
is often presupposed in our communication and not explicitly stated. Thus, it is diﬃcult to extract
naive action-eﬀect relations from the existing textual data (e.g., web). This kind of knowledge is also
not readily available in commonsense knowledge bases such as ConceptNet [143]. To overcome

60

Eﬀect Text

Action
ignite paper The paper is on ﬁre.
soak shirt
fry potato
stain shirt

The shirt is thoroughly wet.
The potatoes become crisp and golden.
There is a visible mark on the shirt.

Table 5.1: Example action and eﬀect text from our collected data.

this problem, we applied crowd-sourcing (Amazon Mechanical Turk) and collected a dataset of
language descriptions describing eﬀects for each of the 140 verb-noun pairs. The annotators were
shown a verb-noun pair, and were asked to use their own words and imaginations to describe what
changes might occur to the corresponding object as a result of the action. Each verb-noun pair was
annotated by 10 diﬀerent annotators, which has led to a total of 1400 eﬀect descriptions. Table 5.1
shows some examples of collected eﬀect descriptions. These eﬀect language descriptions allow us
to derive seed eﬀect knowledge in a symbolic form.

5.2.3 Eﬀects Depicted in Images

For each action, three students searched the web and collected a set of images depicting potential
eﬀects. Speciﬁcally, given a verb-noun pair, each of the three students was asked to collect at
least 5 positive images and 5 negative images. Positive images are those deemed to capture the
resulting world state of the action. And negative images are those deemed to capture some state
of the related object (i.e., the nouns in the verb-noun pairs), but are not the resulting state of the
corresponding action. Then, each student was also asked to provide positive or negative labels for
the images collected by the other two students. As a result each image has three positive/negative
labels. We only keep the images whose labels are agreed by all three students. In total, the dataset
contains 4163 images. On average, each action has 15 positive images, and 15 negative images.
Figure 5.1 shows several examples of positive images and negative images of the action peel-orange.
The positive images show an orange in a peeled state, while the negative images show oranges in
diﬀerent states (orange as a whole, orange slices, orange juice, etc.).

61

Figure 5.1: Positive images (top row) and negative images (bottom row) of the action peel-orange.

5.3 Action-Eﬀect Prediction

Action-eﬀect prediction is to connect actions (as causes) to the eﬀects of actions. Speciﬁcally,
given an image which depicts a state of the world, our task is to predict what concrete actions
could cause the state of the world. This task is diﬀerent from traditional action recognition as the
underlying actions (e.g., human body posture/movement) are not captured by the images. In this
regard, it is also diﬀerent from image description generation.

We frame the problem as a few-shot learning task, by only providing a few human-labelled
images for each action at the training stage. Given the very limited training data, we attempt to
make use of web-search images. Web search has been adopted by previous computer vision studies
to acquire training data [32, 61, 11, 103]. Compared with human annotations, web-search comes
at a much lower cost, but with a trade-oﬀ of poor data quality. To address this issue, we apply a
bootstrapping approach that aims to handle data with noisy labels.

The ﬁrst question is what search terms should be used for image search. There are two options.
The ﬁrst option is to directly use the action terms (i.e., verb-noun pairs) to search images and
the downloaded web images are referred to as action web images. As desired images should

62

Extracted Eﬀect Phrases (bold)

Example patterns
VP with a verb ∈ {be, become, turn, get} The ship is destroyed.
The wall is knocked oﬀ.
VP + PRT
VP + ADVP
The door swings forward.
The window would begin to get clean.
ADJP
PP + NP
The eggs are divided into whites and yolks.

Table 5.2: Example patterns that are used to extract eﬀect phrases (bold) from sample sentences.

be depicting eﬀects of an action, terms describing eﬀects become a natural choice. The second
option is to use the key phrases extracted from language eﬀect descriptions to search the web. The
downloaded web images are referred to as eﬀect web images.

5.3.1 Extracting Eﬀect Phrases from Language Data

We ﬁrst apply chunking (shallow parsing) using the SENNA software [20] to break an eﬀect
description into phrases such as noun phrases (NP), verb phrases (VP), prepositional phrases (PP),
adjectives (ADJP), adverbs (ADVP), etc. After some examination, we found that most of the
eﬀect descriptions follow simple syntactic patterns. For a verb-noun pair, around 80% of its eﬀect
descriptions start with the same noun as the subject. In an eﬀect description, the change of state
associated with the noun is mainly captured by some key phrases. For example, an adjective phrase
usually describes a physical state; verbs like be, become, turn, get often indicate a description of
change of the state. Based on these observations, we deﬁned a set of patterns to identify phrases that
describe physical states of an object. In total 1997 eﬀect phrases were extracted from the language
data. Table 5.2 shows some example patterns and example eﬀect phrases that are extracted.

5.3.2 Downloading Web Images

The purpose of querying search engine is to retrieve images of objects in certain eﬀect states. To
form image searching keywords, the eﬀect phrases are concatenated with the corresponding noun
phrases, for example, “apple + into thin pieces”. The image search results are downloaded and used
as supplementary training data for the action-eﬀect prediction models. However, web images can

63

Figure 5.2: Examples of image search results.

be noisy. First of all, not all of the automatically extracted eﬀect phrases describe visible state of
objects. Even if a phrase represents visible object states, the retrieved results may not be relevant.
Figure 5.2 shows some example image search results using queries describing the object name
“book”, and describing the object state such as “book is on ﬁre”, “book is set aﬂame”. These state
phrases were used by human annotators to describe the eﬀect of the action “burn a book”. We
can see that the images returned from the query “book is set aﬂame” are not depicting the physical
eﬀect state of “burn a book”. Therefore, it’s important to identify images with relevant eﬀect states
to train the model. To do that, we applied a bootstrapping method to handle the noisy web images
as described in Section 5.3.3. For an action (i.e., a verb-noun pair), it has multiple corresponding
eﬀect phrases, and all of their image search results are treated as training images for this action.

Since both the human annotated image data (Section 5.2) and the web-search image data were
obtained from Internet search engines, they may have duplicates. As part of the annotated images
are used as test data to evaluate the models, it is important to remove duplicates. We designed a
simple method to remove any images from the web-search image set that has a duplicate in the
human annotated set. We ﬁrst embed all images into feature vectors using pre-trained CNNs. For
each web-search image, we calculate its cosine similarity score with each of the annotated images.
And we simply remove the web images that have a score larger than 0.95.

64

book(cid:1)book	is	on	fire(cid:1)book	is	set	aflame(cid:1)5.3.3 Models

We formulate the action-eﬀect prediction task as a multi-class classiﬁcation problem. Given an
image, the model will output a probability distribution q over the candidate actions (i.e., verb-noun
pairs) that can potentially cause the eﬀect depicted in the image.

the observed image label, t ∈ {0, 1}C,(cid:80)

Speciﬁcally for model training, we are given a set of human annotated seeding image data {x, t}
and a set of web-search image data {x(cid:48), t(cid:48)}. Here x and x(cid:48) are the images (depicting eﬀect states),
and t and t(cid:48) are their classiﬁcation targets (i.e., actions that cause the eﬀects). Each target vector is
i ti = 1, and C is the number of classes (i.e., actions). The
human annotated targets t can be trusted. But the targets of web-search images t(cid:48) are usually very
noisy. Bootstrapping method has been shown to be an eﬀective method to handle noisy labelled
data [123, 154, 114]. The objective of the cross-entropy loss is deﬁned as follows:

L(t, q) =

ti log (qi),

(5.1)

C(cid:88)

i=1

C(cid:88)
[ βt(cid:48)

i=1

where q are the predicted class probabilities, and C is the number of classes. To handle the
noisy labels in the web-search data {x(cid:48), t(cid:48)}, we adopt a bootstrapping objective following Reed’s
work [114]:

L(t(cid:48)

, q) =

i + (1 − β)zi] log (qi),

(5.2)

where β ∈ [0, 1] is a model parameter to be assigned, z is the one-hot vector of the prediction q,
zi = 1, if i = argmax qk, k = 1 . . . C.

The model architecture is shown in Figure 5.3. After each training batch, the current model
will be used to make predictions q on images in the next batch. And the target probabilities is
calculated as a linear combination of the current predictions q and the observed noisy labels t(cid:48). The
idea behind this bootstrapping strategy is to ensure the consistency of the model’s predictions. By
ﬁrst initializing the model on the seeding image data, the bootstrapping approach allows the model
to trust more on the web images that are consistent with the seeding data.

65

Figure 5.3: Architecture for the action-eﬀect prediction model with bootstrapping.

5.3.4 Evaluation

We evaluate the models on the action-eﬀect prediction task. Given an image that illustrates a state
of the world, the goal is to predict what action could cause that state. Given an action in the form
of a verb-noun pair, the goal is to identify images that depict the most likely eﬀects on the state of
the world caused by that action.

For each of the 140 verb-noun pairs, we use 10% of the human annotated images as the seeding
image data for training, and use 30% for development and the rest 60% for test. The seeding image
data set contains 408 images. On average, each verb-noun pair has less than 3 seeding images
(including positive images and negative images). The development set contains 1252 images. The
test set contains 2503 images. The model parameters were selected based on the performance on
the development set.

As a given image may not be relevant to any eﬀect, we add a background class to refer to images
where eﬀects are not caused by any action in the space of actions. So the total of classes for our
evaluation model is 141. For each verb-noun pair and each of the eﬀect phrases, around 40 images
were downloaded from the Bing image search engine and used as candidate training examples. In

66

0.1	Web	Search	Images	Seeding	Images	Prediction	CNN	Image	Classifier	Bootstrapping	Cross-Entropy	Loss	0.2	…	0.5	Action	1	Action	2	…	Action	C	Cross-Entropy	Loss	0.3	Prediction	0.2	…	0.2	Action	1	Action	2	…	Action	C	total we have 6653 action web images and 59575 eﬀect web images.

5.3.4.1 Methods for Comparison

All the methods compared are based on one neural network structure. We use ResNet [51] pre-
trained on ImageNet [22] to extract image features. The extracted image features are fed to a fully
connected layer with rectiﬁed linear units and then to a softmax layer to make predictions. More
speciﬁcally, we compare the following conﬁgurations:

(1) BS+Seed+Act+Eﬀ. The bootstrapping approach trained on the seeding images, the action
web images, and the eﬀect web images. During the training stage, the model was ﬁrst trained on
the seeding image data using vanilla cross-entropy objective (Equation 5.1). Then it was further
trained on a combination of the seeding image data and web-search data using the bootstrapping
objective (Equation 5.2). In the experiments we set β = 0.3.

(2) BS+Seed+Act. The bootstrapping approach trained in the same fashion as (1). The only

diﬀerence is that this method does not use the eﬀect web images.

(3) Seed+Act+Eﬀ. A baseline method trained on a combination of the seeding images, the web

action images, and the web eﬀect images, using the vanilla cross-entropy objective.

(4) Seed+Act. A baseline method trained on a combination of the seeding images and the action

web images, using the vanilla cross-entropy objective.

(5) Seed. A baseline method that was only trained on the seeding image data, using the vanilla

cross-entropy objective.

5.3.4.2 Evaluation Results

We apply the trained classiﬁcation model to all of the test images. Based on the matrix of prediction
scores, we can evaluate action-eﬀect prediction from two angles: (1) given an action class, rank all
the candidate images; (2) given an image, rank all the candidate action classes. Table 5.3 and 5.4
show the results for these two angels respectively. We report both mean average precision (MAP)
and top prediction accuracy.

67

Figure 5.4: Several example test images and their predicted actions and predicted eﬀect descriptions.
The actions in blue are ground-truth labels.

BS+Seed+Act+Eﬀ 0.290
BS+Seed+Act
0.252
0.247
Seed+Act+Eﬀ
0.241
Seed+Act
Seed
0.182

MAP Top 1 Top 5 Top 20
0.921
0.893
0.886
0.814
0.807

0.414
0.414
0.314
0.371
0.329

0.750
0.721
0.679
0.650
0.629

Table 5.3: Results for the action-eﬀect prediction task (given an action, rank all the candidate
images).

BS+Seed+Act+Eﬀ 0.660
0.642
BS+Seed+Act
Seed+Act+Eﬀ
0.289
0.481
Seed+Act
Seed
0.634

MAP Top 1 Top 5 Top 20
0.954
0.924
0.625
0.926
0.892

0.843
0.802
0.398
0.724
0.765

0.523
0.508
0.176
0.301
0.520

Table 5.4: Results for the action-eﬀect prediction task (given an image, rank all the actions).

68

Overall, BS+Seed+Act+Eﬀ gives the best performance. By comparing the bootstrap ap-
proach with baseline approaches (i.e., BS+Seed+Act+Eﬀ vs. Seed+Act+Eﬀ, and BS+Seed+Act
vs. Seed+Act), the bootstrapping approaches clearly outperforms their counterparts, demonstrating
its ability in handling noisy web data. Comparing BS+Seed+Act+Eﬀ with BS+Seed+Act, we can
see that BS+Seed+Act+Eﬀ performs better. This indicates the use of eﬀect descriptions can bring
more relevant images to train better models for action-eﬀect prediction.

In Table 5.4, the poor performance of Seed+Act+Eﬀ and Seed+Act shows that it is risky to
fully rely on the noisy web search results. These two methods had trouble in distinguishing the
background class from the rest.

We further trained another multi-class classiﬁer with web eﬀect images, using their correspond-
ing eﬀect phrases as class labels. Given a test image, we apply this new classiﬁer to predict the
eﬀect descriptions of this image. Figure 5.4 shows some example images, their predicted actions
based on our bootstrapping approach and their predicted eﬀect phrases based on the new classiﬁer.
These examples also demonstrate another advantage of incorporating seed eﬀect knowledge from
language data: it provides state descriptions that can be used to better explain the perceived state.
Such explanation can be crucial in human-agent communication for action planning and reasoning.

5.4 Generalizing Eﬀect Knowledge to New Verb-Noun Pairs

In real applications, it is very likely that we do not have the eﬀect knowledge (i.e., language
eﬀect descriptions) for every verb-noun pair. And annotating eﬀect knowledge using language
(as shown in Section 5.2) can be very expensive. In this section, we describe how to potentially
generalize seed eﬀect knowledge to new verb-noun pairs through an embedding model.

5.4.1 Action-Eﬀect Embedding Model

The structure of our model is shown in Figure 5.5. This model is based on the causality embedding
model in Chapter 3.2.2. It is composed of two sub-networks: one for verb-noun pairs (i.e., action)
and the other one for eﬀect phrases (i.e, eﬀect). The action or eﬀect is fed into an LSTM encoder

69

Figure 5.5: Architecture of the action-eﬀect embedding model.

and then to two fully-connected layers. The output is an action embedding vc and eﬀect embedding
ve. The networks are trained by minimizing the following cosine embedding loss function:



L(vc, ve) =

1 − s(vc, ve),
max(0, s(vc, ve)),

if (c, e) ∈ T
if (c, e) (cid:60) T

s(·, ·) is the cosine similarity between vectors. T is a collection of action-eﬀect pairs. Suppose c
is an input for action and e is an input for eﬀect, this loss function will learn an action and eﬀect
semantic space that maximizes the similarities between c and e if they have an action-eﬀect relation
(i.e., (c, e) ∈ T). During training, the negative action-eﬀect pairs (i.e., (c, e) (cid:60) T) are randomly
sampled from data. In the experiments, the negative sampling ratio is set to 25. That is, for each
positive action-eﬀect pair, 25 negative pairs are created through random sampling.

At the inference step, given an unseen verb-noun pair, we embed it into the action and eﬀect
semantic space. Its embedding vector will be used to calculate similarities with all the embedding
vectors of the candidate eﬀect phrases.

70

Action	Effect	slice		apple	into		many		small		pieces	LSTM(cid:1)LSTM(cid:1)Cosine		Embedding	Loss	MAP Top 1 Top 5
0.928
BS+Seed+Act+Eﬀ
0.529
0.893
BS+Seed+Act+pEﬀ 0.507
0.435
BS+Seed+Act
0.964
Seed
0.369
0.786

0.643
0.642
0.643
0.678

Table 5.5: Results for the action-eﬀect prediction task (given an action, rank all the candidate
images).

MAP Top 1 Top 5
0.947
BS+Seed+Act+Eﬀ
0.733
BS+Seed+Act+pEﬀ 0.729
0.961
0.933
0.724
BS+Seed+Act
Seed
0.705
0.898

0.574
0.551
0.557
0.557

Table 5.6: Results for the action-eﬀect prediction task (given an image, rank all the actions).

5.4.2 Evaluation

We divided the 140 verb-noun pairs into 70% training set (98 verb-noun pairs), 10% development
set (14) and 20% test set (28). For the action-eﬀect embedding model, we use pre-trained GloVe
word embeddings [108] as input to the LSTM. The embedding model was trained using the language
eﬀect data corresponding to the training verb-noun pairs, and then it was applied to predict eﬀect
phrases for the unseen verb-noun pairs in the test set. For each unseen verb-noun pair, we collected
its top ﬁve predicted eﬀect phrases. Each predicted eﬀect phrase was then used as query keywords
to download web eﬀect images. This set of web images are referred to as pEﬀ and will be used in
training the action-eﬀect prediction model.

For each of the 28 test (i.e., new) verb-noun pairs, we use the same ratio 10% (about 3 examples)
of the human annotated images as the seeding images, which were combined with downloaded web
images to train the prediction model. The remaining 30% and 60% are used as the development
set, and the test set. We compare the following diﬀerent conﬁgurations:

(1) BS+Seed+Act+pEﬀ. The bootstrapping approach trained on the seeding images, the action

web images, and the web images downloaded using the predicted eﬀect phrases.

(2) BS+Seed+Act+Eﬀ. The bootstrapping approach trained on the seeding images, the action

web images, and the eﬀect web images (downloaded using ground-truth eﬀect phrases).

71

chop carrot

ignite paper

Action Text Predicted Eﬀect Text
carrot into sandwiches,
carrot is sliced,
carrot is cut thinly,
carrot into diﬀerent pieces,
carrot is divided
paper is being charred ,
paper is being burned,
paper is set,
paper is being destroyed,
paper is lit
potato into chunks,
potato into sandwiches,
potato into slices,
potato is chewed,
potato into smaller pieces

mash potato

Table 5.7: Example predicted eﬀect phrases for new verb-noun pairs. Unseen verbs and nouns are
shown in bold.

(3) BS+Seed+Act. The bootstrapping approach trained on the seeding images and the action

web images.

(4) Seed. A baseline only trained on the seeding images.
Table 5.5 and 5.6 show the results for the action-eﬀect prediction task for unseen verb-noun
pairs. From the results we can see that BS+Seed+Act+pEﬀ achieves close performance compared
with BS+Seed+Act+Eﬀ, which uses human annotated eﬀect phrases. Although in most cases,
BS+Seed+Act+pEﬀ outperforms the baseline, which seems to point to the possibility that semantic
embedding space can be employed to extend eﬀect knowledge to new verb-noun pairs. However,
the current results are not conclusive partly due to the small testing set. More in-depth evaluation
is needed in the future.

Table 5.7 shows top predicted eﬀect phrases for several new verb-noun pairs. After analyzing
the action-eﬀect prediction results we notice that generalizing the eﬀect knowledge to a verb-noun
pair that contains an unseen verb tends to be more diﬃcult than generalizing to a verb-noun pair
that contains an unseen noun. Among the 28 test verb-noun pairs, 12 of them contain unseen verbs
and known nouns, 7 of them contain unseen nouns and known verbs. For the task of ranking images

72

given an action, the mean average precision is 0.447 for the unseen verb cases and 0.584 for the
unseen noun cases. Although not conclusive, this might indicate that, verbs tend to capture more
information about the eﬀect states of the world than nouns.

5.5 Discussion and Conclusion

When robots operate in the physical world, they not only need to perceive the world, but also
need to act to the world. They need to understand the current state, to map their goals to the world
state, and to plan for actions that can lead to the goals. All of these point to the importance of the
ability to understand causal relations between actions and the state of the world. To address this
issue, this work introduces a new task on action-eﬀect prediction.

Particularly, we focus on modeling the connection between an action (a verb-noun pair) and its
eﬀect as illustrated in an image and treat natural language eﬀect descriptions as side knowledge
to help acquiring web image data and bootstrap training. Our current model is very simple and
performance is yet to be improved. We plan to apply more advanced approaches in the future,
for example, attention models that jointly capture actions, image states, and eﬀect descriptions.
We also plan to incorporate action-eﬀect prediction to human-robot collaboration, for example, to
bridge the gap of commonsense knowledge about the physical world between humans and robots.
This chapter presents an initial investigation on action-eﬀect prediction. There are many
challenges and unknowns, from problem formulation to knowledge representation; from learning
and inference algorithms to methods and metrics for evaluations. Nevertheless, we hope this work
can motivate more research in this area, enabling physical action-eﬀect reasoning, towards agents
which can perceive, act, and communicate with humans in the physical world.

73

CHAPTER 6

UNDERSTANDING PHYSICAL ACTIONS THROUGH NATURAL LANGUAGE

STORIES

6.1 Introduction

To further investigate machines’ ability in reasoning about cause-eﬀect of physical actions as
part of language understanding, we create a new language benchmark. This benchmark contains
short stories created by human annotators. Each story describes a short sequence of human physical
actions in our daily lives. For example, a story could describe the actions sequence of making a
sandwich in the kitchen, or packing a suitcase in the bedroom. Based on the collected stories, we
create two tasks to evaluate machine reading systems. The ﬁrst task is to select the correct sentence
from two alternatives to ﬁll in the blank in a story, and it is called the cloze task. The second task
is to select the correct order of sentences in a story, and it is called the ordering task.

Although the proposed tasks are easy for humans to solve, they are very challenging for
machines. An analysis shows that understanding the stories and solving these tasks requires various
types of commonsense knowledge, e.g., knowledge about action verbs, objects, and naïve physics
rules. Therefore, we believe this benchmark will be a valuable resource for evaluating machines’
capability of acquiring and applying physical commonsense knowledge. Further, the setting of two
sub-tasks can be naturally used to evaluate a model’s generalization ability, via training on one task
and evaluating on the other task. If a model can successfully learn the fundamental knowledge and
the reasoning abilities via training on the data of one sub-task, it can potentially perform well on the
other sub-task. By doing this, we encourage models that focus on learning underlying knowledge
instead of over-ﬁtting to shallow statistical cues.

To tackle the commonsense reasoning tasks, we present a new neural network model. This
model solves both the cloze task and the ordering task via explicitly examining the compatibility of
each action with its context in those stories. Since the action-eﬀect knowledge plays an essential role

74

in understanding these commonsense stories, we further incorporated physical causality knowledge
into the proposed model. Experiments were designed to compare the proposed model with several
state-of-the-art models for machine comprehension tasks. The results demonstrate the eﬀectiveness
of the proposed model, and further show the improvement introduced by external physical causality
knowledge. The results also suggests that this benchmark is challenging for current approaches,
and better solving this task requires a wider range of commonsense knowledge and richer semantic
representation of actions and objects.

6.2 Physical Commonsense Reasoning Tasks

The proposed benchmark includes two subtasks. The cloze task is to select the correct sentence
to ﬁll in the blank in a story. The ordering task is to select the correct order of sentences in a story.
In both tasks, each story describes a short sequence of human physical actions in our daily lives.
Examples are shown in Figure 6.1.

Figure 6.1: Example story data for the cloze task and the ordering task. Candidates in red are
correct answers.

For the cloze task, one sentence in an original story is replaced with a blank. That sentence
is then put together with a distraction sentence to form the candidates set. Given the story with a
blank, a system needs to select the correct sentence from the candidates to ﬁll in the blank. The
distraction sentences are created in such a way that they describe very common human actions in

75

the corresponding environment, but adding them to the story will make the story irrational in the
physical world.

For the ordering task, two sentences in an original story are chosen and their positions are
switched. These sentences are selected in a manner that if we switch their positions, the story
becomes irrational in the physical world. Given the original story and the reordered story, a system
needs to determine which story makes more sense.

In our data, the cloze task and the ordering task are diﬀerent in their setups, but they are also
closely related, since both of them rely on the knowledge of action prerequisites and eﬀects, and the
capability of tracking the state changes introduced by human actions. The design of including two
diﬀerent but closely related subtasks is motivated by the recent criticisms on data biases introduced
to natural language benchmarks during data collection [128, 49]. For example, Schwartz et al.
[128] have shown that the Story Cloze Test [95] (which has a similar setting with our cloze task)
can be solved with up to 75% accuracy by only exploiting stylistic features, even without looking at
the story context. With two parallel tasks in this benchmark, we can use them to evaluate a model’s
generalization ability, via training on one task and evaluating on the other task. If a model can
successfully acquire the underlying commonsense knowledge and learn the reasoning abilities via
training on the data of one task, it is very likely to perform well on the other task. By doing this,
we encourage models that focus on learning underlying knowledge instead of overﬁtting to shallow
language patterns.

6.2.1 Data Collection through Crowdsourcing

We collected a set of human-written stories via Amazon Mechanical Turk. Each story describes
a sequence of physical actions in human daily lives. During data collection, the annotators were
shown a person’s name and a location name, and they were asked to use their imagination to
write a short story describing a sequence of physical actions the person takes in that location.
Possible locations includes kitchen, living room, bathroom, garage, bathroom, oﬃce, park. Several
requirements were given to the annotators: 1) All described actions should be entirely realistic; 2)

76

Figure 6.2: Interface used for annotating stories for the cloze task.

Figure 6.3: Interface used for annotating stories for the ordering task.

The actions should be carried out in a short time period; 3) The story must include at least ﬁve
sentences.

After collecting the original stories, we asked a diﬀerent group of annotators to read the stories
and prepare them for the cloze task and the ordering task. Speciﬁcally, to prepare data for the
cloze task, we asked annotators to write a new sentence to replace an original sentence in the story,
such that the story after replacement is not likely to happen in the physical world. The annotation

77

interface is shown in Figure 6.2. The new sentences will be used as distraction alternatives in the
cloze task. To make this task more challenging, we asked the annotator to come up with sentences
that are entirely realistic in real life. For example, sentences like “Mary fried eggs on the printer”,
or “Tom ate the spoon”, are not acceptable, since they are not realistic. In this way, one can not
determine which is the correct sentence to ﬁll the blank by only looking at the candidate sentences.
You always need to put the sentences back into the story context to determine.

To prepare data for the ordering task, we asked annotators to switch two sentences in the original
story, so that the story after switching is not likely to happen in the physical world. The annotation
interface is shown in Figure 6.3. After data collection, we also ﬁltered out words like “the”, “a”,
“an”, just to get rid of some trivial cues for the correct order between some sentences. Since “the”
is usually used to refer something mentioned before, while “a” and “an” are usually used to refer
something not mentioned before. For example, a system can easily determine the order of the
following two sentences, “Tom got an apple out of the fridge" and “Tom peeled the apple with a
knife”, only by looking at the usage of “an apple” and ”the apple”.

In total, we have collected 727 human-written stories. And based on these original stories, we

created 1,672 instances for the cloze task and 4,577 instances for the ordering task.

6.2.2 Underlying Commonsense Knowledge

After data collection, we analyzed the task data and discovered several categories of commonsense
knowledge that are essential to solve the tasks. Here we list the knowledge categories and also show
task examples that require them to make prediction.

1. Verb Causality Knowledge describes how a physical action changes the involving objects’
physical states. For example, the key point of solving the following cloze problem is knowing that
the action bake the potato causes the potato to become hot.

- Story:

1. Tom preheated the oven.

78

2. Tom took out a potato from the fridge.

3. Tom put the potato in a metal pan.

4. Tom baked the potato in the oven.

5.

.

- Select the correct sentence to ﬁll in the blank:

A. Tom sprinkled some grated cheese on the potato.

B. Tom ate the cold potato.

- Correct Answer: A

2. Action Precondition is the requirement that must be satisﬁed before an action happens. For
example, you can cut a solid object instead of liquid, or you can stir liquid instead of a solid object.
To solve the following cloze problem, one needs to know that the butter is in liquid form after
melting (this information belongs to verb causality knowledge), and you cannot cut liquid (action
precondition knowledge).

- Story:

1. Tom took the potato out of the oven.

2. Tom mashed the potato.

3. Tom melted butter in the microwave.

4.

.

5. Tom ate the mashed potato with a spoon.

- Select the correct sentence to ﬁll in the blank:

A. Tom put the mashed potato and butter in a bowl.

79

B. Tom cut the butter into cubes.

- Correct Answer: A

3. Object Functionality involves information about speciﬁc functions of objects, especially
for tools. For example, a microwave oven can be used to heat objects, and a wrench can be used to
repair cars. In the following ordering task, it is critical to infer that the wrench was used to tighten
the bolt.

- Story:

1. John opened the toolbox.

2. John took out the wrench.

3. John tightened a bolt on his bicycle.

4. John put the wrench back in the box.

5. John rode the bicycle to the store.

- Select the correct order:

A. 12345

B. 13245

- Correct Answer: A

4. Intuitive Physics is human’s common understanding about basic physical phenomena. For
example, a solid object can not pass through another solid object; an existing object continues to
exist unless being moved away or destroyed; an object is in a container, if the container moves, the
object also moves along. Psychological studies have shown that human naive physics rules usually
develop at a very young age, even before the development of language ability [6]. Thus, we rarely
explicitly express this kind of information in our communication. We simply assume everyone

80

knows that. If an AI system ever deeply understand human natural language, it needs to acquire
this kind of knowledge. For the following cloze problem, the key is to infer that the hammer is not
accessible since it is locked in the trunk.

- Story:

1. John took oﬀ a bucket from the shelf.

2. John picked up a hammer and a rope from the ﬂoor.

3. John put the hammer and rope into the bucket.

4. John locked the bucket in his car trunk.

5.

.

- Select the correct sentence to ﬁll in the blank:

A. John used the hammer to repair the bike.

B. John drove his car into the street.

- Correct Answer: B

6.2.3 Comparison with Existing Tasks

The proposed tasks are similar to existing machine comprehension tasks, for example, bAbI [153],
SQuAD [113], and the Story Cloze Test [95]. Since in all of these tasks, a model needs to make
predictions based on its understanding of the provided supporting text data. However the proposed
tasks are also diﬀerent from them. In bAbI [153] and SQuAD [113], the evaluation is done in
a question answering setting, where the input includes a supporting document together with a
question, and the model needs to select words from the supporting document as answer. In the
Story Cloze Test [95], the input includes a short story and two alternative endings to the story, and
the model needs to predict which ending is the correct one. For our proposed cloze task, it is very

81

similar to the Story Cloze Test task, except that the blanks are not always at the end of the stories.
In the Story Cloze Test, most of the stories focus on human’s emotions, intentions and attitudes
(i.e., naive psychology), while our tasks have a speciﬁc focus on human’s physical actions.

6.3 Methods

For the cloze task, a model needs to make a selection between two candidate sentence to ﬁll
in the blank. For the ordering task, a model needs to make a selection between two sequences of
sentences. In order to unify these the two proposed tasks, we treat them as a story ranking task:
given two candidate stories, predicting which one is more rational. For the cloze task, we get two
candidate stories via replacing the blank with each candidate sentence. For the ordering task, we
can also get two candidate stories via treating each sequence as a candidate story.

To tackle the proposed commonsense reasoning tasks, we propose a neural network model that
explicitly examines each action in terms of its compatibility with the actions happening before it
and actions happening after it. This model is motivated by the fact that the order of sentences are
very important in understanding the narrative.

As discussed earlier, solving these tasks requires a lot of commonsense knowledge. Given the
fact that commonsense knowledge is not usually explicitly stated in natural language, and also given
the limited size of our data set, it is not practical to acquire all the commonsense knowledge from
training data. So in this work, we also explore methods that can leverage external commonsense
knowledge for better understanding and reasoning about the stories.

6.3.1 The Attentive-Reader Model

The architecture of the Attentive-Reader is shown in Figure 6.4. In this model, we adopt sentence-
level representations. We ﬁrst use a bi-directional gated recurrent unit (Bi-GRU) to embed the
sentences into vectors. Then for each sentence, we examine its compatibility with the rest parts
of the story. Taking sentence 3 as the target sentence, we use its embedding vector e3 to attend
every sentence before it (e1 and e2) and every sentence after it (e4 and e5), separately. For sentence

82

Figure 6.4: Network architecture for the Attentive-Reader. Note that this architecture only shows
the computation structure for the anomaly scores corresponding to sentence 3 (score s31 and score
s32). The anomaly scores for other sentences are computed via similar processes.

before it, we calculate the attentions with

αi =

(cid:80)

exp (e3Waei)
i<3 exp (e3Waei),

(6.1)

where Wa is the parameter matrix to be learned. Then we represent the before-context with a
weighted sum

c12 =

αiei.

(6.2)

(cid:88)

i<3

Then the context representation is used to calculate the anomaly score s31 between the target
sentence and the context before it.

s31 = tanh(W1[c12 : e3] + b1)

(6.3)

Here W1 and b1 are parameters to be learned, and “:” denotes concatenation. The after-context
representation c45 and the anomaly score s32 are computed in a similar way. After calculating the

83

anomaly scores for every target sentence, we apply a max/mean-pooling on them and generate the
ﬁnal score. We use a cross entropy loss at the ﬁnal layer.

6.3.1.1 Leveraging Physical Causality Knowledge

Since the action-eﬀect knowledge plays an essential role in understanding these commonsense
stories, we further incorporated physical causality knowledge into the proposed model. To inject
the verb causality knowledge, we introduce an external knowledge module in the Attentive-Reader
architecture. This module takes external knowledge in the form of natural language sentences. As
shown in Figure 6.4, we use the same bi-directional gated recurrent unit (Bi-GRU) to embed the
knowledge into vector representations. Later they will be concatenated with story sentence vectors
to form knowledge-aware sentence representations.

6.3.1.2 Typed Physical Causality Knowledge

To form the external knowledge base, we start with the category-based knowledge data in Chap-
ter 4.3.1. Given a verb, this knowledge data only tells us which state categories are likely to
changes, but can not tell us how will they change. Theoretical linguistic studies on verbs have
shown that result verbs often specify movement along a scale [60]. Inspired by this, we introduce
state changing directions (or types) to the categories. Speciﬁcally, we selected a subset (presence,
integrity, location, containment, temperature, wetness) of the 18 attributes from Chapter 3.1, and
added types to them (shown in Table 6.1). Then, we manually annotated the 329 transitive verbs
from the current story dataset based on the typed causality attributes. When applying the knowledge
to a story sentence, we ﬁrst run dependency parsing on this sentence to ﬁnd out the verb, direct
object and location (if exists). After extracting the typed attribute values for this verb, we replace
the terms object and location with the corresponding terms in the sentence. For example, the
external knowledge for “Tom put the potato in the fridge” will be “potato be in fridge”.

84

State Attributes Typed attribute values
Presence
Location
Integrity
Containment
Wetness
Temperature

object be present; object be not present
object be in location; object be out of location
object be broken; object be integral
object be full; object be empty
object be wet; object be dry
object be cold; object be hot

Table 6.1: Typed state attributes for physical causality knowledge.

6.3.2 Models for Comparison

The EntNet-Reader Model is based on the Recurrent Entity Network (EntNet) [52, 80], a neural
framework with external memory chains. Neural models with long-term memory and attention
mechanism have exhibited good reasoning capabilities in machine comprehension tasks [145, 45,
52, 80]. And particularly, the EntNet model has been proven to be very eﬀective on similar
reasoning tasks like bAbI [52], CBT (Children’s Book Test) [53, 52] and the Story Cloze Test [80].

Figure 6.5: Network architecture for the EntNet-based approach.

85

The architecture of the EntNet-Reader is shown in Figure 6.5. First, a bi-directional gated
recurrent unit (Bi-GRU) is used to embed the context information of the story at word level. Then
the context-dependent word representations are taken as input to a bi-directional Recurrent Entity
Network (EntNet) [52, 80], where the model tracks the state of world with memory chains. Each
memory chain is a special RNN network, where there is a key governing what kind of information
can pass the gates and be stored in the memory. The state representations of all memory chains
are then gathered into a 2D array and a convolution ﬁlter is applied on top of it. Speciﬁcally, the
ﬁlter covers the memory states from two adjacent time points, and outputs an anomaly score. This
score basically tells us given the world state of the previous time point, how irrational is the next
state/action. Lastly, a max/mean pooling layer will output the ﬁnal anomaly score. We adopt a
cross entropy loss on the ﬁnal score.

The Bi-GRU Baseline. For comparison, we also introduce a baseline model that uses a Bi-GRU
to embed the whole story into one single vector representation and then generates the ﬁnal prediction
score with MLP (multilayer perceptron). Again, we use the cross entropy as loss function.

6.4 Experiments

6.4.1 Experimental Settings

For both the cloze task and the ordering task, we randomly divided the data for training (20%),
validation (20%) and test (60%). The task instances derived from one original story all appear in
the same set (either in training set, validation set or test set). This data split strategy is to prevent a
trivial solution that memorizes positive action sequences from the training data.

For all the models, pre-trained GloVE embeddings [108] with 300 dimension are used as input
word embeddings. The hidden size for Bi-GRU and EntNet are set to 300. Training is carried out
with the Adam optimizer [65] and a batch size of 32.

86

Using 100% of the training data
Bi-GRU EntNet Attentive Attentive+KB
0.634
0.662

0.668
0.648

0.701
0.687

0.681
0.682
Using 67% of the training data
Bi-GRU EntNet Attentive Attentive+KB
0.622
0.644

0.660
0.653
Using 33% of the training data
Bi-GRU EntNet Attentive Attentive+KB
0.565
0.619

0.688
0.684

0.647
0.623

0.585
0.619

0.597
0.628

0.630
0.656

Cloze Task

Ordering Task

Cloze Task

Ordering Task

Cloze Task

Ordering Task

Table 6.2: Prediction accuracy results on the physical commonsense reasoning tasks.

6.4.2 Results and Analysis

Table 6.2 shows the evaluation results for diﬀerent models on the cloze task and the ordering task.
We vary the training size to evaluate the models’ performance with diﬀerent training sizes. Here
EntNet refers to the EntNet-Reader, Attentive refers to the Attentive-Reader, and KB denotes the
use of external physical causality knowledge.

Overall the Attentive-Reader performs better than the EntNet-Reader and the Bi-GRU model.
This might suggest that the sentence-level representation works better on the proposed tasks. After
introducing external causality knowledge, the Attentive-Reader achieves the best performance on
both tasks. This indicates the eﬀectiveness of the external knowledge module in Figure 6.4, together
with the typed physical causality knowledge base.

Given that a random guessing method could achieve 0.5 accuracy, these results also suggest
that this benchmark is very challenging for current approaches. More advanced approaches are
required, with better coverage of commonsense knowledge and richer semantic representation of
actions and world states.

As mentioned earlier, although the cloze task and the ordering task are diﬀerent in their task
setups, they require very similar reasoning processes to solve. The key to tackle both tasks
includes reliably tracking object state changes, and successfully detecting anomaly about action

87

Cloze - Ordering
Ordering - Cloze

0.550
0.561

Bi-GRU EntNet Attentive Attentive+KB
0.535
0.543

0.545
0.560

0.579
0.607

Table 6.3: Prediction accuracy results of training on one task and evaluating on the other task.

preconditions and eﬀects. To evaluate how diﬀerent models generalize to an unseen task, we carried
out a new set of experiments of training these models on one task and testing them on the other
task.

The results of evaluating on a new task are shown in Table 6.3. Again the best performing model
is the Attentive-Reader with external causality knowledge. This suggests that the typed physical
causality knowledge helps the model better tracking object state changes instead of overﬁtting to
shallow language patterns.

6.4.3 Predicting Breakpoints in Negative Stories

For both the EntNet-Reader model and the Attentive-Reader model, their designed network struc-
tures enable them to make predictions about which part of stories does not make sense. For
example, each of the intermediate anomaly scores s21, s31, . . . , s42 indicates the compatibility of
the corresponding sentence with the sentences before it or after it. Therefore, a sentence with the
maximum anomaly score basically suggests that this sentence is most likely to be conﬂicting with
its context, according to the trained model.

Note that for each negative story (selecting the wrong candidate sentence in the cloze task, or
selecting the wrong order in the ordering task), there are at least two sentences conﬂicts with each
other. A manual analysis of the results (models trained using 100% training data) shows that, both
the Att-Reader and the EntNet-Reader have a good chance (around 80%) to successfully ﬁnd out
at least one of the conﬂicting sentences for negative stories. The following are several examples of
the Attentive-Reader’s breakpoint predictions on negative stories.

Negative story 1 (cloze):

1. Mary took pan from cupboard.

88

2. Mary put pan on stove.

3. Mary took out bowl.

4. Mary cracked open egg.

5. Mary made hardboiled egg.

Conflicting sentences:

4 and 5

Model’s prediction:

5

Negative story 2 (cloze):

1. John locked window.

2. John unplugged tv.

3. John turned off fan.

4. John closed up his suitcase.

5. John stayed in and watched tv.

Conflicting sentences:

2 and 5

Model’s prediction:

2

Negative story 3 (ordering):

1. John opened freezer and took out ice cream.

2. John scooped out some ice cream and put it in blender.

3. John cleaned up his mess with broom.

4. John knocked over bowl of butter.

5. John put fruit in blender and made some shake.

Conflicting sentences:

3 and 4

Model’s prediction:

3

6.5 Summary

In this chapter we propose a new benchmark for physical commonsense reasoning. To the best
of our knowledge, this is the ﬁrst crowdsourced natural language story dataset speciﬁcally targeted

89

for evaluating machines’ capability of understanding and reasoning about human physical actions.
This benchmark contains two sub-tasks: a cloze task and an ordering task. The setting of two
sub-tasks in this benchmark can be naturally used to evaluate a model’s generalization ability, via
training on one task and evaluating on the other task. We believe this benchmark will serve as an
valuable resource for physical commonsense reasoning.

As the ﬁrst attempt to tackle the proposed tasks, we present a neural network architecture
together with an external knowledge module. This model solves both the cloze task and the
ordering task via explicitly examining the compatibility of each action with its context in those
stories. The experimental results demonstrate the best performing setup is the proposed model
with typed verb causality annotation as external knowledge. The relative low performance of all
the tested models suggests that the proposed tasks are far from being solved by current approaches.
A careful analysis of the task data suggests that future investigates could focus on modeling a
wider range of commonsense knowledge and providing richer semantic representation of actions
and objects.

90

CHAPTER 7

CONCLUSIONS AND FUTURE DIRECTIONS

This dissertation presents a series of investigation on collecting, modeling and utilizing physical
causality knowledge of action verbs. First, physical causality knowledge were collected from human
contributors through crowdsourcing. Two representation methods were adopted to model physical
causality knowledge, one is based on pre-deﬁned categories, and the other one is based on natural
language embedding models. Both approaches have demonstrated their potential on modeling
verb semantics and connecting language to the physical world. We further incorporated causality
modeling in solving several challenging tasks: language grounding, visual causality reasoning, and
commonsense story understanding.

In Chapter 4, we applied the category-based causality modeling to the task of grounding
semantic roles from sentences to visual perceptions using two approaches: a knowledge-based
approach and a learning-based approach. The empirical evaluations have demonstrated that both
of the proposed approaches outperform previous work, indicating that causality categorization
provides a good guideline for designing intermediate visual features. Moreover, we have shown
that physical causality knowledge can be generalized to novel verbs using simple learned models.
In Chapter 5, we introduced a novel task of visual causality reasoning, which focuses on the
connection between an action (a verb-noun pair) and its eﬀect as illustrated in an image. We
have developed an approach that applies distant supervision to harness web data for bootstrapping
action-eﬀect prediction models. The empirical results have shown that, using a simple bootstrapping
strategy, our approach can combine the noisy web data with a small number of seed examples to
improve action-eﬀect prediction. Furthermore, our approach can infer eﬀect descriptions for new
verb-noun pairs and thus to facilitate the training of action-eﬀect prediction. This opens up the
possibility for humans to teach robots new tasks through language communication and small number
of examples.

In Chapter 6, we introduced a new benchmark for physical commonsense reasoning, which

91

contains two sub-tasks, a cloze task and an ordering task. This benchmark evaluates a system’s
capability of understanding and reasoning about human physical actions from story data. We
presented a novel neural network model that explicitly examines the compatibility of each sen-
tence with its context. Experimental results have demonstrated the eﬀectiveness of the proposed
model, and further show the improvement introduced by incorporating external physical causality
knowledge.

Apart from the studies shown in this dissertation, there are many interesting and promising

directions left for future exploration. Here we mention several:

1. Building a general purpose knowledge base for physical causality of action verbs. Our studies
have shown that the proposed physical causality knowledge datasets are good supplements to
current verb meaning models and resources. Thus, a natural extension is to build a large-scale
physical causality knowledge base for open-domain tasks.

2. Connecting verb causality knowledge with other types of intuitive physics knowledge. As
mentioned in Chapter 6.2, there are diﬀerent types of commonsense knowledge closely
related to the understanding and reasoning about physical actions. Therefore, one interesting
research topic is to explore how to acquire these diﬀerent types of knowledge and model them
with a uniﬁed framework.

3. Exploring methods that better ground language to perceptions.

In Chapter 5 we present
an initial investigation on connecting language with action eﬀect images. There are many
challenges and unknowns in grounding language to more complex forms of perceptions, from
video data to live human-robot interactions.

4. Extending verb causality knowledge to metaphorical uses. The studies we have done are
only focused on literal uses of verbs, i.e., mainly about concrete actions applied on concrete
objects. However, metaphorical uses of verb are very common in human natural language.
For example, “Two planes were shot down” is a literal usage of the verb “shot”, while
“The proposals were shot down” is a metaphorical usage. Given the value of verb causality

92

knowledge on literal verb uses, one would anticipate that extending verb causality knowledge
to metaphorical uses will help comprehending and reasoning about natural language in more
general situations.

93

BIBLIOGRAPHY

94

BIBLIOGRAPHY

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding
for attribute-based classiﬁcation.
In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 819–826, 2013.
Jacob Andreas and Dan Klein. Grounding language with points and paths in continuous
spaces. In Proceedings of the Eighteenth Conference on Computational Natural Language
Learning, pages 58–67, 2014.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,
C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of
the IEEE International Conference on Computer Vision, pages 2425–2433, 2015.
Yoav Artzi and Luke Zettlemoyer. Weakly supervised learning of semantic parsers for map-
ping instructions to actions. Transactions of the Association for Computational Linguistics,
1:49–62, 2013.

David Bailey, Nancy Chang, Jerome Feldman, and Srini Narayanan. Extending embodied
lexical development. In Proceedings of the Twentieth Conference of the Cognitive Science
Society, pages 84–89, 1998.
Renée Baillargeon. Infants’ physical world. Current directions in psychological science,
13(3):89–94, 2004.

Collin F Baker, Charles J Fillmore, and John B Lowe. The berkeley framenet project. In
Proceedings of the 17th international conference on Computational linguistics-Volume 1,
pages 86–90. Association for Computational Linguistics, 1998.

Kobus Barnard, Pinar Duygulu, David Forsyth, Nando de Freitas, David M Blei, and
Michael I Jordan. Matching words and pictures. Journal of machine learning research,
3(Feb):1107–1135, 2003.
Kobus Barnard and David Forsyth. Learning the semantics of words and pictures. In Pro-
ceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2,
pages 408–415. IEEE, 2001.

[10] Marco Baroni and Alessandro Lenci. Distributional memory: A general framework for

corpus-based semantics. Computational Linguistics, 36(4):673–721, 2010.

[11] Tamara L Berg, Alexander C Berg, and Jonathan Shih. Automatic attribute discovery and
characterization from noisy web data. In European Conference on Computer Vision, pages
663–676. Springer, 2010.

[12] Eduardo Blanco, Nuria Castell, and Dan I Moldovan. Causal relation extraction. In Pro-
ceedings of the Sixth International Language Resources and Evaluation (LREC’08), 2008.

95

[13] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a
collaboratively created graph database for structuring human knowledge. In Proceedings of
the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250.
AcM, 2008.

[14] Benjamin Börschinger, Bevan K Jones, and Mark Johnson. Reducing grounded learning
tasks to grammatical inference. In Proceedings of the Conference on Empirical Methods in
Natural Language Processing, pages 1416–1425. Association for Computational Linguistics,
2011.

[15] Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and Robert L Mercer.
The mathematics of statistical machine translation: Parameter estimation. Computational
linguistics, 19(2):263–311, 1993.

[16] Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-Khanh Tran. Distributional semantics
in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computa-
tional Linguistics: Long Papers-Volume 1, pages 136–145. Association for Computational
Linguistics, 2012.

[17] Elia Bruni, Giang Binh Tran, and Marco Baroni. Distributional semantics from text and
In Proceedings of the GEMS 2011 workshop on geometrical models of natural

images.
language semantics, pages 22–32. Association for Computational Linguistics, 2011.

[18] Craig G Chambers, Michael K Tanenhaus, and James S Magnuson. Actions and aﬀordances
in syntactic ambiguity resolution. Journal of experimental psychology: Learning, memory,
and cognition, 30(3):687, 2004.

[19] Stephen V Cole, Matthew D Royal, Marco G Valtorta, Michael N Huhns, and John B
Bowles. A lightweight tool for automatically extracting causal relationships from text. In
SoutheastCon, 2006. Proceedings of the IEEE, pages 125–129. IEEE, 2005.

[21]

[20] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and
Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine
Learning Research, 12(Aug):2493–2537, 2011.
Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual en-
tailment challenge. In Machine Learning Challenges Workshop, pages 177–190. Springer,
2005.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
Imagenet: A
large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009.
CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.

[22]

[23] Robert MW Dixon and Alexandra Y Aikhenvald. Adjective Classes: A Cross-linguistic

Typology. Explorations in Language and Space C. Oxford University Press, 2006.

[24] Quang Xuan Do, Yee Seng Chan, and Dan Roth. Minimally supervised event causality
identiﬁcation. In Proceedings of the Conference on Empirical Methods in Natural Language
Processing, pages 294–303. Association for Computational Linguistics, 2011.

96

[25] Malcolm Doering. Verb semantics as denoting change of state in the physical world. Michigan

State University, 2015.

[26]

Jeﬀrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini
Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks
for visual recognition and description. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 2625–2634, 2015.

[27] Mark Dredze, Paul McNamee, Delip Rao, Adam Gerber, and Tim Finin. Entity disambigua-
tion for knowledge base population. In Proceedings of the 23rd International Conference
on Computational Linguistics, pages 277–285. Association for Computational Linguistics,
2010.

[28] Curt J Ducasse. On the nature and the observability of the causal relation. The Journal of

Philosophy, 23(3):57–68, 1926.

[29] Bastianelli Emanuele, Giuseppe Castellucci, Danilo Croce, and Roberto Basili. Textual
inference and meaning representation in human robot interaction. In Joint Symposium on
Semantic Processing., page 65, 2013.

[30] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their
attributes. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages
1778–1785. IEEE, 2009.

[31] Alireza Fathi and James M Rehg. Modeling actions through state changes. In Computer
Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 2579–2586. IEEE,
2013.

[32] Robert Fergus, Li Fei-Fei, Pietro Perona, and Andrew Zisserman. Learning object categories
from google’s image search. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International
Conference on, volume 2, pages 1816–1823. IEEE, 2005.

[33] Richard E. Fikes and Nils J. Nilsson. Strips: A new approach to the application of theorem
proving to problem solving. In Proceedings of the 2Nd International Joint Conference on
Artiﬁcial Intelligence, IJCAI’71, pages 608–620, San Francisco, CA, USA, 1971. Morgan
Kaufmann Publishers Inc.

[34] Amy Fire and Song-Chun Zhu. Learning perceptual causality from video. ACM Transactions

on Intelligent Systems and Technology (TIST), 7(2):23, 2015.

[35] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus
Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual
grounding. arXiv preprint arXiv:1606.01847, 2016.

[36] Qiaozi Gao, Malcolm Doering, Shaohua Yang, and Joyce Y Chai. Physical causality of action
verbs in grounded language understanding. In Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics (ACL), volume 1, pages 1814–1824, 2016.

97

[37] Qiaozi Gao, Shaohua Yang, Joyce Chai, and Lucy Vanderwende. What action causes this?
towards naive physical action-eﬀect prediction. In Proceedings of the 56th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), pages 934–945,
2018.

[38] Peter Gardenfors. Conceptual spaces as a framework for knowledge representation. Mind

and Matter, 2(2):9–27, 2004.

[39] Malik Ghallab, Dana Nau, and Paolo Traverso. Automated planning: theory & practice.

Elsevier, 2004.

[40] Roxana Girju, Dan I Moldovan, et al. Text mining for causal relations. In FLAIRS Conference,

pages 360–364, 2002.

[41] Yoav Goldberg and Jon Orwant. A dataset of syntactic-ngrams over time from a very
large corpus of english books. In Second Joint Conference on Lexical and Computational
Semantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task:
Semantic Textual Similarity, volume 1, pages 241–247, 2013.

[42] Eugenia Goldvarg and Philip N Johnson-Laird. Naive causality: A mental model theory of

causal meaning and reasoning. Cognitive science, 25(4):565–610, 2001.

[43] Dave Golland, Percy Liang, and Dan Klein. A game-theoretic approach to generating
spatial descriptions. In Proceedings of the 2010 conference on empirical methods in natural
language processing, pages 410–419. Association for Computational Linguistics, 2010.

[44] Alison Gopnik, Laura Schulz, and Laura Elizabeth Schulz. Causal learning: Psychology,

philosophy, and computation. Oxford University Press, 2007.

[45] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka
Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho,
John Agapiou, et al. Hybrid computing using a neural network with dynamic external
memory. Nature, 538(7626):471, 2016.

[46] Clayton Greenberg, Asad Sayeed, and Vera Demberg.

Improving unsupervised vector-
space thematic ﬁt evaluation via role-ﬁller prototype clustering. In Proceedings of the 2015
conference of the North American chapter of the Association for Computational Linguistics–
Human Language Technologies, Denver, USA, 2015.

[47] Sergio Guadarrama, Lorenzo Riano, Dave Golland, Daniel Go, Yangqing Jia, Dan Klein,
Pieter Abbeel, Trevor Darrell, et al. Grounding spatial relations for human-robot interaction.
In Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on,
pages 1640–1647. IEEE, 2013.

[48] Saurabh Gupta and Jitendra Malik. Visual semantic role labeling.

arXiv:1505.04474, 2015.

arXiv preprint

[49] Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R Bowman,
and Noah A Smith. Annotation artifacts in natural language inference data. arXiv preprint
arXiv:1803.02324, 2018.

98

[50] Stevan Harnad. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-

3):335–346, 1990.

[51] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 770–778, 2016.

[52] Mikael Henaﬀ, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann LeCun. Tracking

the world state with recurrent entity networks. arXiv preprint arXiv:1612.03969, 2016.

[53] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. The goldilocks prin-
ciple: Reading children’s books with explicit memory representations. arXiv preprint
arXiv:1511.02301, 2015.

[54] Malka Rappaport Hovav and Beth Levin. Reﬂections on Manner / Result Complementarity.

Lexical Semantics, Syntax, and Event Structure, pages 21–38, 2010.

[55] Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell.
In Proceedings of the IEEE Conference on Computer

Natural language object retrieval.
Vision and Pattern Recognition, pages 4555–4564, 2016.

[56] Gerhard Jäger. Natural color categories are convex sets. In Logic, language and meaning,

pages 11–20. Springer, 2010.

[57] Dinesh Jayaraman and Kristen Grauman. Zero-shot recognition with unreliable attributes.

In Advances in neural information processing systems, pages 3464–3472, 2014.

[58] Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension

systems. arXiv preprint arXiv:1707.07328, 2017.

[59] Andrej Karpathy, Armand Joulin, and Li F Fei-Fei. Deep fragment embeddings for bidi-
rectional image sentence mapping. In Advances in neural information processing systems,
pages 1889–1897, 2014.

[60] Christopher Kennedy and Louise McNally. Scale structure and the semantic typology of

gradable predicates. Language, 81(2)(0094263):345–381, 2005.

[61] Lyndon S Kennedy, Shih-Fu Chang, and Igor V Kozintsev. To search or to label?: predicting
the performance of search-based automatic image classiﬁers. In Proceedings of the 8th ACM
international workshop on Multimedia information retrieval, pages 249–258. ACM, 2006.
[62] Casey Kennington, Spyros Kousidis, and David Schlangen. Situated incremental natural lan-
guage understanding using a multimodal, linguistically-driven update model. In Proceedings
of COLING 2014, the 25th International Conference on Computational Linguistics: Tech-
nical Papers, pages 1803–1812, 2014.

[63] Casey Kennington and David Schlangen. Simple learning and compositional application of
perceptually grounded word meanings for incremental reference resolution. In Proceedings
of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th

99

International Joint Conference on Natural Language Processing (Volume 1: Long Papers),
volume 1, pages 292–301, 2015.

[64]

Joohyun Kim and Raymond J Mooney. Unsupervised pcfg induction for grounded language
learning with highly ambiguous supervision. In Proceedings of the 2012 Joint Conference on
Empirical Methods in Natural Language Processing and Computational Natural Language
Learning, pages 433–444. Association for Computational Linguistics, 2012.

[65] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv

preprint arXiv:1412.6980, 2014.

[66] Paul Kingsbury and Martha Palmer. From treebank to propbank. In Proceedings of the 3rd

International Conference on Language Resources and Evaluation (LREC2002), 2002.

[67] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Anto-
nio Torralba, and Sanja Fidler. Skip-thought vectors. In Advances in neural information
processing systems, pages 3294–3302, 2015.

[68] Thomas Kollar, Stefanie Tellex, Deb Roy, and Nicholas Roy. Toward understanding natural
In Proceedings of the 5th ACM/IEEE international conference on

language directions.
Human-robot interaction, pages 259–266. IEEE Press, 2010.

[69] Hema Swetha Koppula, Rudhir Gupta, and Ashutosh Saxena. Learning human activities
and object aﬀordances from rgb-d videos. The International Journal of Robotics Research,
32(8):951–970, 2013.

[70]

[71]

Jayant Krishnamurthy and Thomas Kollar. Jointly learning to parse and perceive: Connecting
natural language to the physical world. Transactions of the Association for Computational
Linguistics, 1:193–206, 2013.
Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N
Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, et al.
Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic
Web, 6(2):167–195, 2015.

[72] Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge.
In Thirteenth International Conference on the Principles of Knowledge Representation and
Reasoning, 2012.

[73] Beth Levin. English verb classes and alternations: A preliminary investigation. University

of Chicago press, 1993.

[74] Beth Levin and Malka Rappaport Hovav. Lexicalized scales and verbs of scalar change. In

46th Annual Meeting of the Chicago Linguistics Society, 2010.

[75] Zhongyang Li, Tongfei Chen, and Benjamin Van Durme. Learning to rank for plausible

plausibility. arXiv preprint arXiv:1906.02079, 2019.

100

[76] Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun. Visual semantic search: Retriev-
ing videos via complex textual queries. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 2657–2664, 2014.

[77] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In
European conference on computer vision, pages 740–755. Springer, 2014.

[78] Changsong Liu and Joyce Y. Chai. Learning to mediate perceptual diﬀerences in situated
human-robot dialogue. In Proceedings of the 29th AAAI Conference on Artiﬁcial Intelligence
(AAAI’15), pages 2288–2294, Austin, TX, 2015.

[79] Changsong Liu, Lanbo She, Rui Fang, and Joyce Y. Chai. Probabilistic labeling for eﬃcient
referential grounding based on collaborative discourse. In Proceedings of the 52nd Annual
Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages
13–18, Baltimore, MD, 2014.

[80] Fei Liu, Trevor Cohn, and Timothy Baldwin. Narrative modeling with memory chains and

semantic supervision. arXiv preprint arXiv:1805.06122, 2018.

[81] Hugo Liu and Push Singh. Conceptnet - a practical commonsense reasoning tool-kit. BT

technology journal, 22(4):211–226, 2004.
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-
attention for visual question answering.
In Advances In Neural Information Processing
Systems, pages 289–297, 2016.

[82]

[83] Wei Lu, Hwee Tou Ng, Wee Sun Lee, and Luke S Zettlemoyer. A generative model for
parsing natural language to meaning representations.
In Proceedings of the Conference
on Empirical Methods in Natural Language Processing, pages 783–792. Association for
Computational Linguistics, 2008.

[84] Zhiyi Luo, Yuchen Sha, Kenny Q Zhu, Seung-won Hwang, and Zhongyuan Wang. Com-
monsense causal reasoning between short texts. In Fifteenth International Conference on
the Principles of Knowledge Representation and Reasoning, 2016.

[85] Cynthia Matuszek, Liefeng Bo, Luke Zettlemoyer, and Dieter Fox. Learning from un-
scripted deictic gesture and language for human-robot interactions. In Twenty-Eighth AAAI
Conference on Artiﬁcial Intelligence, 2014.

[86] Cynthia Matuszek, Nicholas FitzGerald, Luke Zettlemoyer, Liefeng Bo, and Dieter Fox.
A joint model of language and perception for grounded attribute learning. arXiv preprint
arXiv:1206.6423, 2012.

[87] Cynthia Matuszek, Evan Herbst, Luke Zettlemoyer, and Dieter Fox. Learning to parse natural
language commands to a robot control system. In Experimental Robotics, pages 403–415.
Springer, 2013.

101

[88] Brian McMahan and Matthew Stone. A bayesian model of grounded color semantics.

Transactions of the Association for Computational Linguistics, 3:103–115, 2015.

[89] Ken McRae, Michael J Spivey-Knowlton, and Michael K Tanenhaus. Modeling the inﬂuence
of thematic ﬁt (and other constraints) in on-line sentence comprehension. Journal of Memory
and Language, 38(3):283–312, 1998.

[90] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeﬀ Dean. Distributed
In Advances in neural

representations of words and phrases and their compositionality.
information processing systems, pages 3111–3119, 2013.

[91] Anton Milan, Stefan Roth, and Kaspar Schindler. Continuous energy minimization for
multitarget tracking. Pattern Analysis and Machine Intelligence, IEEE Transactions on,
36(1):58–72, 2014.

[92] George A Miller. Wordnet: a lexical database for english. Communications of the ACM,

38(11):39–41, 1995.

[93] Dipendra K Misra, Jaeyong Sung, Kevin Lee, and Ashutosh Saxena. Tell me dave: Context-
sensitive grounding of natural language to manipulation instructions. Robotics: Science and
Systems (RSS), 2014.

[94] Dipendra Kumar Misra, Kejia Tao, Percy Liang, and Ashutosh Saxena. Environment-driven
lexicon induction for high-level instructions. In Proceedings of the 53rd Annual Meeting of
the Association for Computational Linguistics and the 7th International Joint Conference on
Natural Language Processing (Volume 1: Long Papers), volume 1, pages 992–1002, 2015.
[95] Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy
Vanderwende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper
understanding of commonsense stories. In Proceedings of the 2016 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language
Technologies, pages 839–849, 2016.

[96] Tanmoy Mukherjee and Timothy Hospedales. Gaussian visual-linguistic embedding for
In Proceedings of the 2016 Conference on Empirical Methods in

zero-shot recognition.
Natural Language Processing, pages 912–918, 2016.

[98]

[97] Rutu Mulkar-Mehta, Christopher Welty, Jerry R Hoobs, and Eduard Hovy. Using granularity
concepts for discovering causal relations. In Proceedings of the FLAIRS conference, 2011.
Iftekhar Naim, Young C. Song, Qiguang Liu, Liang Huang, Henry Kautz, Jiebo Luo, and
Daniel Gildea. Discriminative unsupervised alignment of natural language instructions
with corresponding video segments. In Proceedings of NAACL HLT 2015, pages 164–174,
Denver, Colorado, May–June 2015. Association for Computational Linguistics.

[99] Ad Neeleman, Hans Van de Koot, et al. The linguistic expression of causation. The Theta

System: Argument Structure at the Interface, page 20, 2012.

102

[100] Timothy Niven and Hung-Yu Kao. Probing neural network comprehension of natural lan-

guage arguments. arXiv preprint arXiv:1907.07355, 2019.

[101] Franz Josef Och and Hermann Ney. A systematic comparison of various statistical alignment

models. Computational Linguistics, 29(1):19–51, 2003.

[102] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Is object localization for free?
- weakly-supervised learning with convolutional neural networks. In The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), June 2015.

[103] Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkilä, and Naokazu Yokoya. Learning
joint representations of videos and sentences with web image search. In European Conference
on Computer Vision, pages 651–667. Springer, 2016.

[104] Ulrike Padó. The integration of syntax and semantic plausibility in a wide-coverage model

of human sentence processing. PhD thesis, Universitätsbibliothek, 2007.

[105] Martha Palmer, Daniel Gildea, and Paul Kingsbury. The proposition bank: An annotated

corpus of semantic roles. Computational linguistics, 31(1):71–106, 2005.

[106] Megha Pandey and Svetlana Lazebnik. Scene recognition and weakly supervised object
localization with deformable part-based models. In 2011 International Conference on Com-
puter Vision, pages 1307–1314. IEEE, 2011.

[107] Judea Pearl et al. Causal inference in statistics: An overview. Statistics surveys, 3:96–146,

2009.

[108] Jeﬀrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for
word representation. In Proceedings of the 2014 conference on empirical methods in natural
language processing (EMNLP), pages 1532–1543, 2014.

[109] Sameer S Pradhan, Wayne Ward, Kadri Hacioglu, James H Martin, and Daniel Jurafsky.
Shallow semantic parsing using support vector machines. In HLT-NAACL, pages 233–240,
2004.

[110] J Pustejovsky. The syntax of event structure. Cognition, 41(1-3):47–81, 1991.
[111] Kira Radinsky, Sagie Davidovich, and Shaul Markovitch. Learning causality for news events
prediction. In Proceedings of the 21st international conference on World Wide Web, pages
909–918. ACM, 2012.

[112] Nazneen Fatema Rajani and Raymond J Mooney. Combining supervised and unsupervised
In Proceedings of the 2016 Conference on

ensembles for knowledge base population.
Empirical Methods in Natural Language Processing (EMNLP-16), 2016.

[113] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+

questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.

103

[114] Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and
Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping.
arXiv preprint arXiv:1412.6596, 2014.

[115] Terry Regier and Laura A Carlson. Grounding spatial language in perception: an empirical
and computational investigation. Journal of experimental psychology: General, 130(2):273,
2001.

[116] Terry Regier, Paul Kay, and Richard S Cook. Focal colors are universal after all. Proceedings
of the National Academy of Sciences of the United States of America, 102(23):8386–8391,
2005.

[117] Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and
Manfred Pinkal. Grounding action descriptions in videos. Transactions of the Association
for Computational Linguistics (TACL), 1:25–36, 2013.

[118] Mehwish Riaz and Roxana Girju.

In-depth exploitation of noun and verb semantics to
identify causation in verb-noun pairs. In 15th Annual Meeting of the Special Interest Group
on Discourse and Dialogue (SIGDial), page 161, 2014.

[119] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alter-
natives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium
Series, 2011.

[120] Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele.
In European Conference on

Grounding of textual phrases in images by reconstruction.
Computer Vision, pages 817–834. Springer, 2016.

[121] Anna Rohrbach, Marcus Rohrbach, Wei Qiu, Annemarie Friedrich, Manfred Pinkal, and
Bernt Schiele. Coherent multi-sentence video description with variable level of detail. In
Pattern Recognition, pages 184–195. Springer, 2014.

[122] Raquel Ros, Séverin Lemaignan, E Akin Sisbot, Rachid Alami, Jasmin Steinwender, Katha-
rina Hamann, and Felix Warneken. Which one? grounding the referent based on eﬃcient
human-robot interaction. In 19th International Symposium in Robot and Human Interactive
Communication, pages 570–575. IEEE, 2010.

[123] Chuck Rosenberg, Martial Hebert, and Henry Schneiderman. Semi-supervised self-training
of object detection models. In Application of Computer Vision, 2005. WACV/MOTIONS’05
Volume 1. Seventh IEEE Workshops on, volume 1, pages 29–36. IEEE, 2005.

[124] Deb Roy. Grounding words in perception and action: computational insights. Trends in

cognitive sciences, 9(8):389–396, 2005.

[125] Deb Roy and Alex Pentland. Learning words from sights and sounds: A computational

model. Cognitive science, 26(1):113–146, 2002.

[126] S. Russell and P. Norvig. Artiﬁcial Intelligence: A Modern Approach. Prentice Hall, 2010.

104

[127] Karin Kipper Schuler. VerbNet: A Broad-Coverage, Comprehensive Verb Lexicon. PhD

thesis, University of Pennsylvania, 2005.

[128] Roy Schwartz, Maarten Sap, Ioannis Konstas, Li Zilles, Yejin Choi, and Noah A Smith. The
eﬀect of diﬀerent writing tasks on linguistic style: A case study of the roc story cloze task.
arXiv preprint arXiv:1702.01841, 2017.

[129] Paul Scovanner, Saad Ali, and Mubarak Shah. A 3-dimensional sift descriptor and its
application to action recognition. In Proceedings of the 15th ACM international conference
on Multimedia, pages 357–360. ACM, 2007.

[130] Rebecca Sharp, Mihai Surdeanu, Peter Jansen, Peter Clark, and Michael Hammond. Creating
causal embeddings for question answering with minimal supervision. In Proceedings of the
2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages
138–148, 2016.

[131] Lanbo She and Joyce Chai.

Incremental acquisition of verb hypothesis space towards
physical world interaction. In Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), volume 1, pages 108–117, 2016.

[132] Lanbo She and Joyce Chai. Interactive learning of grounded verb semantics towards human-
robot communication. In Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1634–1644, 2017.

[133] Lanbo She, Yu Cheng, Joyce Y Chai, Yunyi Jia, Shaohua Yang, and Ning Xi. Teaching
robots new actions through natural language instructions. In The 23rd IEEE International
Symposium on Robot and Human Interactive Communication, pages 868–873. IEEE, 2014.
[134] Lanbo She, Shaohua Yang, Yu Cheng, Yunyi Jia, Joyce Chai, and Ning Xi. Back to the blocks
world: Learning new actions through situated human-robot dialogue. In Proceedings of the
15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL),
pages 89–97, 2014.

[135] Dan Shen and Mirella Lapata. Using semantic roles to improve question answering.

EMNLP-CoNLL, pages 12–21, 2007.

In

[136] Push Singh, Thomas Lin, Erik T Mueller, Grace Lim, Travell Perkins, and Wan Li Zhu. Open
mind common sense: Knowledge acquisition from the general public. In OTM Confederated
International Conferences" On the Move to Meaningful Internet Systems", pages 1223–1237.
Springer, 2002.

[137] Jeﬀrey M Siskind. Naive physics, event perception, lexical semantics, and language acqui-

sition. Technical report, DTIC Document, 1993.

[138] Jeﬀrey Mark Siskind. Grounding language in perception. Artiﬁcial Intelligence Review,

8(5-6):371–391, 1994.

[139] Jeﬀrey Mark Siskind. Grounding the lexical semantics of verbs in visual perception using

force dynamics and event logic. J. Artif. Intell. Res.(JAIR), 15:31–90, 2001.

105

[140] Marjorie Skubic, Dennis Perzanowski, Samuel Blisard, Alan Schultz, William Adams,
Magda Bugajska, and Derek Brock. Spatial language for human-robot dialogs. IEEE Trans-
actions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 34(2):154–
167, 2004.

[141] Grace Song and Phillip Wolﬀ. Linking perceptual properties to the linguistic expression of

causation. Language, culture and mind, pages 237–250, 2003.

[142] Robert Speer, Joshua Chin, and Catherine Havasi. Conceptnet 5.5: An open multilingual
In Thirty-First AAAI Conference on Artiﬁcial Intelligence,

graph of general knowledge.
2017.

[143] Robert Speer and Catherine Havasi. Representing general relational knowledge in conceptnet

5. In LREC, pages 3679–3686, 2012.

[144] Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: a core of semantic
knowledge. In Proceedings of the 16th international conference on World Wide Web, pages
697–706. ACM, 2007.

[145] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In

Advances in neural information processing systems, pages 2440–2448, 2015.

[146] Mihai Surdeanu, Massimiliano Ciaramita, and Hugo Zaragoza. Learning to rank answers
to non-factoid questions from web collections. Computational linguistics, 37(2):351–383,
2011.

[147] Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew R Walter, Ashis Gopal Banerjee,
Seth J Teller, and Nicholas Roy. Understanding natural language commands for robotic
navigation and mobile manipulation. In AAAI, volume 1, page 2, 2011.

[148] Stefanie Tellex, Pratiksha Thaker, Joshua Joseph, and Nicholas Roy. Learning perceptually
grounded word meanings from unaligned parallel data. Machine Learning, 94(2):151–167,
2014.

[149] Yao-Hung Hubert Tsai and Ruslan Salakhutdinov.

Improving one-shot learning through

fusing side information. arXiv preprint arXiv:1710.08347, 2017.

[150] Matthew R Walter, Sachithra Hemachandra, Bianca Homberg, Stefanie Tellex, and Seth
Teller. Learning semantic maps from natural language descriptions. In Robotics: Science
and Systems, 2013.

[151] Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. Action recognition
In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE

by dense trajectories.
Conference on, pages 3169–3176. IEEE, 2011.

[152] Liwei Wang, Yin Li, and Svetlana Lazebnik. Learning deep structure-preserving image-
text embeddings. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 5005–5013, 2016.

106

[153] Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merriënboer,
Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of
prerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2015.

[154] Max Whitney and Anoop Sarkar. Bootstrapping via graph propagation. In Proceedings of the
50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume
1, pages 620–628. Association for Computational Linguistics, 2012.

[155] Phillip Wolﬀ. Direct causation in the linguistic coding and individuation of causal events.

Cognition, 88(1):1–48, 2003.

[156] Phillip Wolﬀ and Grace Song. Models of causation and the semantics of causal verbs.

Cognitive Psychology, 47(3):276–332, 2003.

[157] Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. Ask me
anything: Free-form visual question answering based on knowledge from external sources.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
4622–4630, 2016.

[158] Caiming Xiong, Stephen Merity, and Richard Socher. Dynamic memory networks for visual

and textual question answering. arXiv, 1603, 2016.

[159] Xun Xu, Timothy Hospedales, and Shaogang Gong. Transductive zero-shot action recogni-
tion by word-vector embedding. International Journal of Computer Vision, 123(3):309–333,
2017.

[160] Xun Xu, Timothy M Hospedales, and Shaogang Gong. Multi-task zero-shot action recogni-
tion with prioritised data augmentation. In European Conference on Computer Vision, pages
343–359. Springer, 2016.

[161] Zhongwen Xu, Linchao Zhu, and Yi Yang. Few-shot object recognition from machine-
labeled web images. In The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2017.

[162] Shaohua Yang, Qiaozi Gao, Changsong Liu, Caiming Xiong, Song-Chun Zhu, and Joyce Y.
Chai. Grounded semantic role labeling. In Proceedings of the 2016 Conference of the North
American Chapter of the Association for Computational Linguistics, San Diego, CA, 2016.
[163] Xuefeng Yang and Kezhi Mao. Multi level causal relation identiﬁcation using extended

features. Expert Systems with Applications, 41(16):7171–7181, 2014.

[164] Yezhou Yang, Cornelia Fermüller, and Yiannis Aloimonos. Detection of manipulation
action consequences (mac). In Computer Vision and Pattern Recognition (CVPR), 2013
IEEE Conference on, pages 2563–2570. IEEE, 2013.

[165] Yezhou Yang, Anupam Guha, C Fermuller, and Yiannis Aloimonos. A cognitive system
for understanding human manipulation actions. Advances in Cognitive Sysytems, 3:67–86,
2014.

107

[166] Xuchen Yao and Benjamin Van Durme. Semi-markov phrase-based monolingual alignmen.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing,
pages 590–600. Association for Computational Linguistics, 2013.

[167] Mark Yatskar, Vicente Ordonez, and Ali Farhadi. Stating the obvious: Extracting visual

common sense knowledge. In Proceedings of NAACL-HLT, pages 193–198, 2016.

[168] Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. Situation recognition: Visual semantic
role labeling for image understanding. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 5534–5542, 2016.

[169] Chen Yu and Dana H Ballard. On the integration of grounding language and learning objects.

In AAAI, volume 4, pages 488–493, 2004.

[170] Haonan Yu, N Siddharth, Andrei Barbu, and Jeﬀrey Mark Siskind. A compositional frame-
work for grounding language inference, generation, and acquisition in video. Journal of
Artiﬁcial Intelligence Research, 52:601–713, 2015.

[171] Haonan Yu and Jeﬀrey Mark Siskind. Grounded language learning from video described with
sentences. In Proceedings of the 51st Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 53–63, 2013.

[172] John M Zelle and Raymond J Mooney. Learning semantic grammars with constructive

inductive logic programming. In AAAI, pages 817–822, 1993.

[173] John M Zelle and Raymond J Mooney. Inducing deterministic prolog parsers from treebanks:

A machine learning approach. In AAAI, pages 748–753, 1994.

[174] John M Zelle and Raymond J Mooney. Learning to parse database queries using inductive
In Proceedings of the national conference on artiﬁcial intelligence,

logic programming.
pages 1050–1055, 1996.

[175] Jie Zhou and Wei Xu. End-to-end learning of semantic role labeling using recurrent neural
networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational
Linguistics and the 7th International Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pages 1127–1137, Beijing, China, July 2015. Association for
Computational Linguistics.

108