GROUNDED LANGUAGE PROCESSING FOR ACTION UNDERSTANDING AND JUSTIFICATION By Shaohua Yang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science — Doctor of Philosophy 2019 ABSTRACT GROUNDED LANGUAGE PROCESSING FOR ACTION UNDERSTANDING AND JUSTIFICATION By Shaohua Yang Recent years have witnessed an increasing interest on cognitive robots entering into our life. In order to reason, collaborate and communicate with human in the shared physical world, the agents need to understand the meaning of human language, especially the actions, and connect them to the physical world. Furthermore, to make the communication more transparent and trustworthy, the agents should have human-like action justification ability to explain their decision-making behaviors. The goal of this dissertation is to develop approaches that learns to understand actions in the perceived world through language communication. Towards this goal, we study three related problems1. Semantic role labeling captures semantic roles (or participants) such as agent, patient and theme associated with verbs from text. While it provides important intermediate semantic representations for many traditional NLP tasks, it does not capture grounded semantics with which an artificial agent can reason, learn, and perform the actions. We utilize semantic role labeling to connect the visual semantics with linguistic semantics. On one hand, this structured semantic representation can help extend the traditional visual scene understanding instead of simply object recognition and relation detection, which is important for achieving human robot collaboration tasks. On the other hand, due to the shared common ground, not every language instruction is fully specified explicitly. We proposed to not only ground explicit semantic roles, but also implicit roles which is hidden 1This dissertation was supported by IIS-1617682 from the National Science Foundation and the DARPA XAI program under a subcontract from UCLA (N66001-17-2-4029) during the communication. Our empirical results have shown that by incorporate the semantic information, we achieve better grounding performance, and also a better semantic representation of the visual world. Another challenge for an agent is to explain to human why it recognizes what’s going on as a certain action. With the recent advance of deep learning, A lot of works have shown to be very effective on action recognition. But most of them function like black-box models and have no interpretations of the decisions which are given. To enable collaboration and communication be- tween humans and agents, we developed a generative conditional variational autoencoder (CVAE) approach which allows the agent to learn to acquire commonsense evidence for action justifica- tion. Our empirical results have shown that, compared to a typical attention-based model, CVAE has a significantly higher explanation ability in terms of identifying correct commonsense evi- dence to justify perceived actions. The experiment on communication grounding further shows that the commonsense evidence identified by CVAE can be communicated to humans to achieve a significantly higher common ground between humans and agents. The third problem combines the action grounding with action justification in the context of visual commonsense reasoning. Humans have tremendous visual commonsense knowledge to an- swer the question and justify the rationale, but the agent does not. On one hand, this process requires the agent to jointly ground both the answers and rationales to the images. On the other hand, it also requires the agent to learn the relation between the answer and the rationale. We propose a deep factorized model to have a better understanding of the relations between the image, question, answer and rationale. Our empirical results have shown that the proposed model out- performs strong baselines in the overall performance. By explicitly modeling factors of language grounding and commonsense reasoning, the proposed model provides a better understanding of effects of these factors on grounded action justification. ACKNOWLEDGMENTS First and foremost, I am tremendously grateful for my advisor Dr. Joyce Y. Chai for her continuous support and guidance. She shared with me how to think critically, explore new problems, asking good questions and how to do good research. All of these experiences will have a great influence on my whole life. Besides, her great insights on the domain of human robot interaction and action understanding have always shed light on problems I have been working on. Without her continuous advice, inspiration and guidance for my PhD. study, this work would have been impossible. I would also like to thank my dissertation committee members: Dr. Arun Ross, Dr. Xiaoming Liu and Dr. Taosheng Liu. I greatly appreciate their valuable feedback on every step of my PhD journey. I’m very happy to have had the opportunity to collaborate with an amazing group of students and researchers: Dr. Changsong Liu, Dr. Lanbo She and Dr. Rui Fang provide great suggestions and directions when I start my research career as a PhD student. Thanks to Dr. Qiaozi Gao and Sari Saba-Sadiya for their great efforts and enlightening comments. I also appreciate my co-authors on various papers: Dr. Yu Cheng, Dr. Yunyi Jia, Dr. Ning Xi, Dr. Caiming Xiong, Dr. Songchun Zhu, Malcolm Doering, Nishant Shukla, Yunzhong He, Guangyue Xu and Dr. Lucy Vanderwende. I would like to thank all my friends at MSU, who made my time at MSU enjoyable. Finally, this thesis is dedicated to my parents, for all the years of your selfless love and support. iv TABLE OF CONTENTS LIST OF TABLES . . LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Challenges . . 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . 1 1 2 4 7 Chapter 2 . . . . . . . 2.1 Verb Semantics Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 2.1.1 Distributed Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 Semantic Role Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Grounded Language Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Explainable Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Visual Commonsense Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 3 . . . . . . . 3.2.1 Introduction . 3.1 3.2 Grounded Semantic Role Labeling . . . . . . . . . . . . . . . . . . . . . . . . . Grounded Semantic Role Labeling . . . . . . . . . . . . . . . . . . . . . 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 . 18 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 CRF-based Learning and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.1 Conditional Random Field . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.2 CRF for Grounded Semantic Role Labeling . . . . . . . . . . . . . . . . . 21 3.4 Dataset for Grounded Semantic Role Labeling . . . . . . . . . . . . . . . . . . . . 23 3.4.1 Dataset Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4.2 Automated Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 Evaluation of Grounded Semantic Role Labeling . . . . . . . . . . . . . . . . . . 28 3.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5.2 Results . . 3.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 . . 3.6 Conclusion . . . . . . . . . . . . Chapter 4 . . . . . . Introduction . 4.1 4.2 A Study On Justification Explanation . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Method . . Commonsense Action Explanation in Human-Agent Communication . . 36 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 . 39 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.1 Conditional Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . 41 . . . . 45 4.3.2 Conditional Variational Autoencoder with Supervision (CVAE+SV) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4 Data Collection . . . . . . . . . . . . . . v 4.5.1 Action Prediction and Explanation . . . . . . . . . . . . . . . . . . . . . Incremental Study . 4.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Visual Simulator 4.5 Evaluation on Action Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . 48 . 49 . 50 . 52 . 53 4.6 Commonsense Justification towards Common Ground . . . . . . . . . . . . . . . . 56 . 56 . 58 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.6.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Conclusion . . . . . . . . . . . Chapter 5 . . . . . . . Introduction . Grounded Action Justification . . . . . . . . . . . . . . . . . . . . . . . . 60 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2 R2C Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3 Deep Factorized Joint Modeling for Grounded Question Answering and Explanation. 65 5.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3.2 Deep Factorized Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3.2.1 Visual Question Answering . . . . . . . . . . . . . . . . . . . . 68 5.3.2.2 Visual Question Rationale(VQR) . . . . . . . . . . . . . . . . . 72 5.3.2.3 Causal Matching Between the Answer and the Rationale . . . . . 72 5.3.3 Training and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 . 5.4 Experiments and Results 5.4.1 Dataset Statistics 5.4.2 Results . . . 5.4.3 Ablation Study . 5.4.4 Error Analysis . . . 5.5 Conclusion . . . . . . . . Chapter 6 6.1 Conclusions . . 6.2 Future Directions . Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 87 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 . . . . . . . . . . BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 vi LIST OF TABLES Table 3.1: Statistics for a set of verbs and their semantic roles in our annotated dataset. The entry indicates the number of explicit/implicit roles for each category. “–” denotes no such role is observed in the data. . . . . . . . . . . . . . . . . . . . 25 Table 3.2: Evaluation results based on annotated language parsing. . . . . . . . . . . . . . 30 Table 3.3: Evaluation results based on automated language parsing. . . . . . . . . . . . . 30 Table 4.1: Statistics for the average relations/attributes Mean and Std for each verb in the . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dataset. . . . . . . . . . 46 Table 4.2: Statistics for the categories of annotated relations/attributes for each verb. . . . . 48 Table 4.3: Action Prediction Accuracy and Evidence Selection MAP. . . . . . . . . . . . . 49 Table 4.4: Results from the human study on communication grounding. . . . . . . . . . . 50 Table 4.5: Target Actions and Simple/Hard Confusion Actions. . . . . . . . . . . . . . . . 58 Table 4.6: Results from the human subject study on common ground. . . . . . . . . . . . . 58 Table 5.1: Basic statistics for the augmented R2C dataset. . . . . . . . . . . . . . . . . . . 75 Table 5.2: Validation and Test Accuracy for all the models. . . . . . . . . . . . . . . . . . 75 Table 5.3: Ablation Study Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Table 5.4: Effect of factors on different actions. . . . . . . . . . . . . . . . . . . . . . . . 80 Table 5.5: Error rates among samples with different lengths of ground truth rationales. . . . 85 vii LIST OF FIGURES Figure 3.1: An example of grounded semantic role labeling for the sentence the woman takes out a cucumber from the refrigerator. The left hand side shows three frames of a video clip with the corresponding language description. The ob- jects in the bounding boxes are tracked and each track has a unique identifier. The right hand side shows the grounding results where each role including the implicit role (destination) is grounded to a track id. . . . . . . . . . . . . . . . 17 Figure 3.2: The CRF structure of sentence “the person takes out a cutting board from the drawer”. The text in the square bracket indicates the corresponding semantic role. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 . . . . . . . . . Figure 3.3: The linear conditional random field structure. . . . . . . . . . . . . . . . . . . 20 Figure 3.4: The TACOS dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Figure 3.5: The VATIC Annotation Interface for the TACOS dataset. . . . . . . . . . . . . 24 Figure 3.6: The relation between the accuracy and the entropy of each verb’s patient from the gold language, gold visual recognition/tracking setting. The entropy for the patient role of each verb is shown below the verb. . . . . . . . . . . . . . 32 Figure 4.1: Grphaical Model representation of the Conditional Variational Auto-encoder. . 41 Figure 4.2: System Architecture for the CVAE model. . . . . . . . . . . . . . . . . . . . . 42 Figure 4.3: Example Crowdsourcing Annotations in which bold relations/attributes are an- notated as gold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Figure 4.4: The system architecture for attention-based method. . . . . . . . . . . . . . . 48 Figure 4.5: Action prediction accuracy in the incremental study. . . . . . . . . . . . . . . 51 Figure 4.6: Evidence selection MAP in the incremental study. . . . . . . . . . . . . . . . 51 Figure 4.7: Evidence selection MAP for semi-supervised learning. . . . . . . . . . . . . . 52 Figure 4.8: Action Accuracy For the Visual Simulator . . . . . . . . . . . . . . . . . . . . 54 Figure 4.9: Evidence MAP For the Visual Simulator . . . . . . . . . . . . . . . . . . . . . 55 viii Figure 4.10: The experimental setup for the human subject study examining the role of commonsense justification towards common ground. . . . . . . . . . . . . . . 55 Figure 4.11: Examples of the communication grounding study based on different models. . 56 Figure 5.1: An example augmented R2C sample. . . . . . . . . . . . . . . . . . . . . . . 66 Figure 5.2: A Graphical model representation of the joint VCR task. . . . . . . . . . . . . 68 Figure 5.3: The neural network architecture for deep factorized network. . . . . . . . . . . 69 Figure 5.4: The length distribution of the sentences in the training dataset. . . . . . . . . . 75 Figure 5.5: The length distribution of the sentences in the validation dataset. . . . . . . . . 76 Figure 5.6: The length distribution of the sentences in the testing dataset. . . . . . . . . . . 76 Figure 5.7: The Histogram of action frequency in the training dataset. . . . . . . . . . . . 77 Figure 5.8: Attention visualization for the VQA factor. . . . . . . . . . . . . . . . . . . . 82 Figure 5.9: Attention visualization for the VQR factor. . . . . . . . . . . . . . . . . . . . 83 Figure 5.10: Attention visualization for the AR factor. . . . . . . . . . . . . . . . . . . . . 83 Figure 5.11: The ratio of different error types. . . . . . . . . . . . . . . . . . . . . . . . . . 85 Figure 5.12: The inverse ratio of different factors. . . . . . . . . . . . . . . . . . . . . . . . 86 ix Chapter 1 Introduction 1.1 Background It has been a long dream to have intelligent agents which can help and collaborate with humans in everyday life such as house keeping, medical assistance and so on. Imagine such a scene: you can instruct a robot to finish some cooking task in the kitchen in natural language. When you say “take out the vegetable”, the agent can quickly move to the location where the vegetable is and bring it back to you. In order for the agent to successfully complete this task, it needs to have advanced perception and reasoning abilities. Compared with traditional robots which are hard-coded to finish specific tasks, this new gen- eration of robots need to adapt to dynamic tasks and environments. That is to say, we cannot fixed the programmed procedures. It’s impossible for us to enumerate all the possible tasks and environments to write programs for them. Instead we need to develop robots which can learn and generalize to a new environment and new tasks through language communications. To achieve this goal, we need to develop intelligent agents which can connect natural language and the robot’s perception. During this process, recognizing and understanding actions are of significant importance as verbs constitute a core part of natural language instructions. Although there has been a lot of research on object recognition and referential grounding, less work has been done to bridge the structured linguistic semantics and visual perception, which is essential 1 for human robot interaction. Machine learning approaches, especially deep learning approaches have achieved exciting per- formances on various tasks. However, they often function as a black box and are hard to interpret models’ behaviors. To address this problem, recent years have witnessed an increasing interesting on explainable artificial intelligence (XAI). The goal is to make the black-box machine learning models more transparent and trustworthy. For example, in the medical diagnosis field, it’s helpful for the agents to provide some evidence or justifications to the doctors regarding any predictions or recommendations made by the agent. Some recent work also tries to do post-hoc analysis to under- stand the behaviors of neural networks through feature visualization, attention visualization and so on. Different from post-hoc explanations, we propose to jointly model the the prediction and justi- fication process as they are often coupled in the human cognitive process. In this way, the agent not only provides the predictions, but also their reasoning justifications. Such justification will allow human users to better interpret machines’ behaviors and enable more mutual understanding and common ground. In the following sections of this chapter, we first identify what are the research challenges. Then we describe our contributions towards the goal of grounded action understanding and justification. 1.2 Challenges As discussed in the background section, it is important for robots not only to automatically connect the visual world to natural language, but also to infer justifications behind decision making. To address this issue, the following questions need to be addressed: 1. Natural language is represented as discrete and symbolic word sequences, but the robots’ surrounding worlds are commonly continuous in nature such as image and videos. There 2 exists a big gap between representations of natural language and the visual world. How to connect the discrete language representation and the continuous visual representation so that AI agents can understand the meanings of language with respect to the physical world? 2. During human human conversation, some information is not be explicitly mentioned as it is commonly known and shared by each other. However, for the robot which is lack of com- monsense knowledge, it’s important to infer the implicit information to get a comprehensive understanding of human language. Is it possible to acquire the implicit knowledge and how to acquire it still remain open problems. 3. To allow agents to reason and justify recognized actions, the first key step is to understand how humans make justifications and what strategies humans apply to justify a perceived action. Exploring an effective action justification representation is of significant importance to further problem formulation. 4. The relations between the justifications and predictions are complex. On one hand, the justifications (or evidence) provide strong supports for the predicted action; on the other hand, the action prediction gives more contexts on what justifications to select. The physical world contains multiple actions where different actions may have different justifications, so jointly modeling the action prediction and justification is really important to tackle the diversities between the actions and justifications. Furthermore, the annotation process of action justifications is really time consuming and expensive, whether we can build models to alleviate human efforts to learn to predict and justify efficiently? 5. Finally, instead of understanding action prediction and justification purely based on textual relations, it’s more meaningful to connect action prediction and action justification with the visual world. The agent not only needs to ground the target action, but also the action 3 justifications. How can we model the relations between the visual world, actions and action justifications to allow agents to get a comprehensive understanding of actions? 1.3 Contributions Towards the goal of building agents that can ground action semantics and action justifications, the contributions of this dissertation are listed as follows: 1. To understand natural language, a variety of semantic representations are proposed and stud- ied by linguists. One of the semantic representations is distributed semantics which rep- resents each word as a continuous vector in a high dimension space, but it is limited in learning structured relations between different words or entities in the sentence. To explic- itly model the relations, the frame-based verb semantics defines a frame of thematic roles (also referred to as semantic roles or verb arguments) for each verb to capture the semantics for different verbs [52]. For example, a verb can be characterized by agent (i.e., the anima- tor of the action) and patient (i.e., the object on which the action is acted upon), and other roles such as instrument, source, destination, etc. Given a verb frame, the goal of Seman- tic Role Labeling is to identify linguistic entities from the text that serve different thematic roles [14, 28, 67, 111]. For instance, given the sentence the woman takes out a cucumber from the refrigerator., takes out is the main verb (also called predicate); the noun phrase the woman is the agent of this action; a cucumber is the patient; and the refrigerator is the source. The frame-based verb semantics capture the verb meaning explicitly by connecting with other words/entities, which is easier for humans to interpret and has a more fine-grained structure. The linguists have developed several large frame-based knowledge bases includ- ing VerbNet [85], FrameNet [3], and Propbank [45] et al. However, based on pure symbolic 4 representations, It’s difficult for the robot to connect texts to the situated world. For ex- ample, in the previous example, the robot needs to understand what the cucumber means, where is it, and what’s the relation between the object cucumber and action takes out. To overcome this limitation, We developed an approach to jointly understand language and vi- sion by incorporating linguistic semantic role information. To be specific, we propose a probabilistic graphical model to ground each semantic role to the possible trackings in the perceived world. In this way, we connect the low-level image pixels to high-level linguistic structure. 2. During human human conversation, it’s not always the case that all content is explicitly stated as some experience or information is assumed known by each other. As in the previous example, the semantic role destination is not explicitly mentioned in the human instruction, but it is important to allow the agent to execute the action. Motivated by this phenomenon, we simultaneously grounds both explicit and implicit semantic roles. Although the destination is missing, its grounding closely depends on other roles’ groundings including the action and patient and so on. Filling all the semantic slots of the physical actions is especially important for the robot to build the semantic map and conduct planing to connect with low- level actions. 3. Our empirical results on grounded semantic role labeling demonstrate a significant perfor- mance improvement compared with previous benchmark without modeling semantic context information and implicit semantic roles. Besides, we collect an additional layer of annota- tion on top of part of the TACOS dataset which captures the structure of actions informed by semantic roles from the video. The annotated data is publicly available for download 1. It 1https://github.com/yangshao/gsrl 5 will provide a benchmark for future work on the grounded semantic role labeling task. 4. Before diving into the action justification problem, we need to first identify key structures of human justifications for action recognition. We conducted a human study to collect real hu- man justifications. After some careful manual analysis, we identify several key dimensions of commonsense knowledge, from a human’s perspective, to justify concrete actions in the physical environment. These dimensions provide an important basis to justify explanations that are aligned with human’s commonsense knowledge about actions. More importantly, we can make use of these key dimensions to derive useful structured representations for action justification modeling. 5. We proposed an unsupervised conditional variational autoencoder (CVAE) based method to jointly learn to predict actions and select commonsense evidence as action justification. The CVAE model naturally models the generation process of both action prediction and com- monsense evidence selection. Inferring commonsense evidence is equivalent to the posterior inference of the CVAE model: what are the possible support given the predicted action, which is flexible and powerful to incorporate actions as context. Inferring actions can be seen as the forward process by first selecting evidence, then using the evidence to do predic- tion. These two processes are jointly learned in our proposed framework. Furthermore we extend the unsupervised setting to the semi-supervised setting which adds supervisions to the latent commonsense evidence to verify whether it can improve both the action prediction and action justification performance. Our empirical experimental results show better perfor- mance on both the action prediction and justification. To test the communication grounding efficiency, we design human studies to show that our method can achieve higher communi- cation grounding compared with strong baselines. The dataset will be made available to the 6 community, which will serve as a baseline for the future work on this topic. 6. Despite the success of the joint modeling of action prediction and justification, our previous work is limited in simple scenarios which only contain limited actions which do not have complex justifications. In addition, our previous work made a strong assumption that all justifications are pre-extracted from the image, which is not realistic. To overcome these limitations, we propose a joint factorized model by extending the traditional visual question answering task with extra justification inference. In this new setting, the question is designed for complex actions which may not limited to only one verb, but may contain multiple ac- tions with different arguments. At the same time, the correct justification not only needs to explain the answer well, but also be grounded to the visual world. This requires the agent to understand the joint relation between the image, question, answer and justification. Com- pared with previous works based on a two step inference process, we factorize the complex interaction into small local interactions, our empirical experimental results demonstrate the effectiveness of the proposed factorized modeling. In addition, we carefully analyzed how different factors influence the final performance through detailed ablation studies. 1.4 Organization of Dissertation The rest of chapters are organized as follows. Related works are introduced in Chapter 2. Then in Chapter 3, we detail how to formulate the grounded semantic role labeling problem as well as how we collect the dataset for this problem. Then we formally introduced the graphical model used to solve this problem. A series of experiments are conducted to demonstrate the effective of the proposed method. Chapter 4 introduces the joint action recognition and justification problem, as well as how different variants of methods can be used to alleviate the data annotation effort. 7 Detailed experimental results are shown to prove the effectiveness of the proposed method. In Chapter 5, we show how to use deep factorized model to solve the grounded action justification problem, Detailed ablation study results are shown to demonstrate how different factors help for the final joint grounded action recognition and justification problem. Finally we discuss some possible future directions in Chapter 6. 8 Chapter 2 Related Work Learning to understand grounded action meanings is related to multiple research areas, from tra- ditional linguistic studies on verb semantics, to grounded language learning, explainable artificial intelligence and commonsense reasoning. In this chapter, we discuss some related works in these areas. 2.1 Verb Semantics How to represent the meaning of words, sentences or even documents has been a long studied problem in natural language processing [3, 45, 52, 72]. Actions, especially verbs, usually indicate some happening events in the physical world. Verbs are one of the most important components as they act as key parts to connect with other different components including nouns, adverbs and so on. According to the linguistic theory of verbs [35], the action verbs are mainly divided into two sub-categories: manner verbs and result verbs. The manner verbs are defined as “verbs specify as part of their meaning a manner of carrying out an action”, and the result verbs are “verbs specify the coming about of a result state”. Example manner verbs include nibble, rub, laugh and so on. Example result verbs contain clean, fill, chop and so on. Different kinds of verbs show different properties. For example, compared with the manner verbs, the result verbs have more obvious state of changes to indicate the possible action justifications. While for manner verbs, sub-actions are important indicators for the commonsense evidence. 9 In our work, we are interested in how to represent and understand action meanings. 2.1.1 Distributed Semantics One of the established semantic categories is distributed semantics: The words/verbs are repre- sented as continuous vectors in a high dimension space. The core idea behind the distributed semantics is that the word meaning is closely related with words occurring in similar linguistic contexts. One of the earliest distributed representations is the bag-of-words representation. For a fixed vocabulary V which contains n unique words. Each word w ∈ V can be represented as a vector v of which vi = 1 and all other values are 0. i is the corresponding word index of w in the vocabulary V. The problem with the bag-of-words representation is that it’s difficult to reflect the similarity between similar words. To alleviate this limitation, some works [64] try to represent the words using lower dimension continuous vectors thus making similar words are close in the space. Some Other works use matrix factorization based methods for the distributed representation [33, 50]. Topic Model [6] is a specific generative bayesian method to model the relations between the words and documents based on latent topics. With the recent advances of deep learning in natural language processing, many new algo- rithms are proposed to learn effective word representations. Word2vec [62] and Glove [69] are two of the most popular methods to utilize the large unlabeled text to learn distributed representations of words based on the distributed assumption. However, both of them are context independent meaning representations whose meanings do not depend on the specific context. Even more re- cently, ELMO [70] and Bert [18] are proposed to extend the context independent representations to context depended representations: For the same word, if they appears in different contexts, they 10 may have different meanings. Nowadays, the distributed pre-trained word embeddings have been widely used for most natural language processing tasks. One of the main reasons is that the pre- trained word embeddings carry a lot of commonsense knowledge which can be beneficial to the target task which is beneficial to alleviate the large dataset requirement of deep learning. 2.1.2 Semantic Role Labeling Although the distributed semantics is a good choice for exploring word similarities, it’s hard to explore fine grained relations between words or entities. To understand the semantics of actions/verbs, a more structured way is to identify the semantic relations between entities and the events they participate in. To be concrete, our goal is to identify who did what to whom when and where given the natural sentence. The linguists proposed frame- based verb semantic representation for this purpose: for each verb sense(one verb may contain more than one sense), a frame including different slots is defined as a template whose values are instantiated for a specific sentence describing the situation where the verb is happening. For example, the slots of the verb cut contains agent, patient, source, instrument and so on. To facilitate the study of semantic role labeling, the researchers have developed semantic gram- mars and manually annotated linguistic resources including several large scale frame-based knowl- edge bases including VerbNet [85], FrameNet [3] and PropBank [45] and so on. These linguis- tic resources greatly accelerate a variety of statistical approaches towards semantic role labeling tasks [14,67,71,111]. For example, the sentence: “the man is cutting the vegetable with the knife” contains the predicate cutting, the patient vegetable, and the tool knife. The semantic role labeling has been widely used on a lot of natural language processing applications including information extraction [22], question answering [86], summarization [42] et al. 11 2.2 Grounded Language Learning Traditional semantic role labeling plays a very important role in many applications in natural lan- guage processing. However, it doesn’t ground to the physical world, which makes the agents hardly understand the situation and then to perform the specified action. Recent years have witnessed an increasing amount of works on muti-modal learning integrating language and vision including image annotation [40, 73], image/video caption generation [19, 21, 48, 66, 96], video sentence alignment [59, 65], scene generation [8], and multi-modal embedding incorporating language and vision [7, 51]. One of the fundamental tasks in grounded language learning is to associate words with per- ceptual input. Words are discrete symbols and perceptions are usually represented by contin- uous sensory data. Therefore a common way of connecting them is to discretize the sensory feature space into categories that are associated with linguistic words such as grounding color names [27, 36, 61, 76] and grounding spatial terms [29, 75, 87]. What is more relevant to our work is recent progress on grounded language understanding, which involves learning meanings of words through connections to machine perception [82] and grounding language expressions to the shared visual world. For example, to visual objects [55,56], to physical landmarks [90, 91], and to perceived actions or activities [2, 91]. Different approaches and emphases have also been explored for grounded language learning. For example, linear programming has been applied to mediate perceptual differences between hu- mans and robots for referential grounding [55]. Approaches to semantic parsing have been applied to ground language to internal world representations [2, 9]. Logical Semantics with Perception (LSP) [47] was applied to ground natural language queries to visual referents through jointly pars- ing natural language (combinatory categorical grammar (CCG)) and visual attribute classification. 12 Graphical models have been applied to word grounding. For example, a generative model was applied to integrate And-Or-Graph representations of language and vision for joint parsing [93]. A Factorial Hidden Markov Model (FHMM) was applied to learn the meaning of nouns, verbs, prepositions, adjectives and adverbs from short video clips paired with sentences [104]. Discrim- inative models have also been applied to ground human commands or instructions to perceived visual entities, mostly for robotic applications [90, 91]. More recently, deep learning has been applied to ground phrases to image regions [39]. 2.3 Explainable Artificial Intelligence Advanced machine learning - especially deep learning approaches have proven the effectiveness in many applications such as image classification and machine translation. However, they do not pro- vide meaningful and human interpretable explanations of model behaviors. This makes it difficult for artificial agents to collaborate with humans as it’s crucial for humans to understand the agent’s capabilities and limitations. To address this problem, there is a growing interest in Explainable ar- tificial intelligence recently. For example, approaches are proposed to generate high precision rules to explain classifiers’ decisions [79, 80]. Specifically for Convolutional Neural Networks (CNNs), recent work addresses its interpretability by mining semantic meanings of filters [108, 109] or by generating language explanations [32,68]. Interpreting the neural models also helps to analyze the linguistic characteristics of the Alzheimer disease patients [38]. An increasing amount of works on the Visual Question Answering (VQA) task [1, 58] have also looked into more interpretable approaches by utilizing attention-based models [24] or reasoning based on explicit evidence [99]. Another trend is to understand physical actions by modeling physical attributes (including causal attributes) [23, 25, 26, 106]. Physical attributes and the related commonsense knowledge 13 are important sources for explanation generation. Previous works try to acquire commonsense knowledge from image annotations [103] or learn commonsense knowledge from visual abstrac- tion [95]. Recently related work also focus on commonsense knowledge related with human’s mental states [74]. Different from above works, our work here focuses on learning to acquire commonsense evidence for action justification. 2.4 Visual Commonsense Reasoning Deep learning based methods have achieved great performance on many vision tasks and applica- tions, for some tasks it even surpasses the human-level performance. However, most of of these applications still capture superficial semantics such as recognizing the objects in an image, iden- tify basic properties including colors, locations and so on. Humans have much stronger reasoning abilities for the situated visual situation. Most of this knowledge for the visual world are visual commonsense knowledge, from low-level spatial understanding to high level causal inference. In order to develop agents who behave like a human, the agents must have the ability to acquire such kind of visual commonsense and also can infer more complex scenarios based on the visual commonsense knowledge. Some work [103] starts from extracting visual commonsense directly from image annotations. Specifically they are trying to mine object-object relations and further extend to entailment rela- tions based on statistics from an annotated image corpus. However, this method requires a large scale of dense annotations of the images. Other works try to extend the traditional recognition-based visual task to reasoning based tasks [1]. One of the most popular tasks is visual question answering(VQA): given an image and a question, the model needs to learn how to answer the visual question. However, the main 14 limitation of the VQA task is that most of questions are not related with the visual commonsense, and are still related recognition-level semantics in most of current open source datasets. Recently the Recognition to Cognition(R2C) work [105] collect a new large dataset focusing on the visual commonsense phenomenon in the movie domain. Their setting is similar to the VQA setting. The difference is that their questions require in-depth understanding of the visual semantics to answer, which they call the cognition semantics. For each sample, they also provide explanation choices to enable the model learns which explanation can be used to justify the answer. Compared with other explanation generation based tasks, making it a multiple choices problem can make the evaluation easier and more robust. And the R2C task can be seen as a combination of Visual Question Answering task and Visual Commonsense Reasoning task. 15 Chapter 3 Grounded Semantic Role Labeling In the previous chapters, we briefly review the background of grounded language learning and some related works. This chapter 1 is organized as follows: first we introduce some backgrounds and motivate the new grounded semantic role labeling task. Then we show how we formulate the problem into the graphical model framework. Third we investigate a subset of the TACOS corpus and analyze the dataset statistics. Finally we conduct a set of experiments and discuss the results. Last we conclude the current work. 3.1 Introduction Linguistic studies capture semantics of verbs by their frames of thematic roles (also referred to as semantic roles or verb arguments) [52]. For example, a verb can be characterized by agent (i.e., the animator of the action) and patient (i.e., the object on which the action is acted upon), and other roles such as instrument, source, destination, etc. Given a verb frame, the goal of Semantic Role Labeling (SRL) is to identify linguistic entities from the text that serve different thematic roles [14, 28, 67, 111]. For example, given the sentence the woman takes out a cucumber from the refrigerator., takes out is the main verb (also called predicate); the noun phrase the woman is the agent of this action; a cucumber is the patient; and the refrigerator is the source. 1Grounded Semantic Role Labeling. Shaohua Yang, Qiaozi Gao, Changsong Liu, Caiming Xiong, Song-Chun Zhu, Joyce Y. Chai. Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), San Diego, CA, June 12-17, 2016. 16 Figure 3.1: An example of grounded semantic role labeling for the sentence the woman takes out a cucumber from the refrigerator. The left hand side shows three frames of a video clip with the corresponding language description. The objects in the bounding boxes are tracked and each track has a unique identifier. The right hand side shows the grounding results where each role including the implicit role (destination) is grounded to a track id. SRL captures important semantic representations for actions associated with verbs, which have shown beneficial for a variety of applications such as information extraction [22] and question answering [86]. However, the traditional SRL is not targeted to represent verb semantics that are grounded to the physical world so that artificial agents can truly understand the ongoing activities and (learn to) perform the specified actions. To address this issue, we propose a new task on grounded semantic role labeling. Figure 3.1 shows an example of grounded SRL. The sentence the woman takes out a cucumber from the refrigerator describes an activity in a visual scene. The semantic role representation from linguistic processing (including implicit roles such as destination) is first extracted and then grounded to tracks of visual entities as shown in the video. For example, the verb phrase take out is grounded to a trajectory of the right hand. The role agent is grounded to the person who actually does the take-out action in the visual scene (track 1) ; the patient is grounded to the cucumber taken out (track 3); and the source is grounded to the refrigerator (track 4). The implicit role of destination (which is not explicitly mentioned in the language description) is grounded to the cutting board (track 5). 17 The woman takes out a cucumber from the refrigerator. Predicate: “takes out”: track 1 Agent: ‘’The woman’’ : track 2 Pa.ent: ‘’a cucumber’’ : track 3 Source: ‘’from the refrigerator’’ : track 4 Des.na.on: ‘’ ‘’ : track 5 To tackle this problem, we have developed an approach to jointly process language and vision by incorporating semantic role information. In particular, we use a benchmark dataset (TACOS) which consists of parallel video and language descriptions in a complex cooking domain [77] in our investigation. We have further annotated several layers of information for developing and evaluating grounded semantic role labeling algorithms. Compared to previous works on language grounding [47,90,104], our work presents several contributions. First, beyond arguments explicitly mentioned in language descriptions, our work simultaneously grounds explicit and implicit roles with an attempt to better connect verb semantics with actions from the underlying physical world. By incorporating semantic role information, our approach has led to better grounding performance. Second, most previous works only focused on a small number of verbs with limited activities. We base our investigation on a wider range of verbs and in a much more complex domain where object recognition and tracking are notably more difficult. Third, our work results in additional layers of annotation to part of the TACOS dataset. This annotation captures the structure of actions informed by semantic roles from the video. The annotated data is available for download 2. It will provide a benchmark for future work on grounded SRL. 3.2 Grounded Semantic Role Labeling 3.2.1 Problem Formulation Given a sentence S and its corresponding video clip V , our goal is to ground explicit/implicit roles associated with a verb in S to video tracks in V. In this paper, we focus on the following set of semantic roles: {predicate, patient, location, source, destination, tool}. In the cooking domain, as actions always involve hands, the predicate is grounded to the hand pose represented by a trajectory 2https://github.com/yangshao/gsrl 18 of relevant hand(s). Normally agent would be grounded to the person who does the action. As there is only one person in the scene, we thus ignore the grounding of the agent in this work. Video tracks capture tracks of objects (including hands) and locations. For example, in Figure 3.1, there are 5 tracks: human, hand, cucumber, refrigerator and cutting board. Regarding the representation of locations, instead of discretization of a whole image to many small regions(large search space), we create locations corresponding to five spatial relations (center, up, down, left, right) with respect to each object track, which means we have 5 times the number of locations compared with the number of objects. For instance, in Figure 3.1, the source is grounded to the center of the bounding boxes of the refrigerator track; and the destination is grounded to the center of the cutting board track. We use Conditional Random Field(CRF) to model this problem. An example CRF factor graph is shown in Figure 3.2. The CRF structure is created based on information extracted from language. More specifically, s1, ...,s6 refers to the observed text and its semantic role. Notice that s6 is an implicit role as there is no text from the sentence describing destination. Also note that the whole prepositional phrase “from the drawer" is identified as the Figure 3.2: The CRF structure of sentence “the person takes out a cutting board from the drawer”. The text in the square bracket indicates the corresponding semantic role. 19 s2 s1 s6 s3 s4 s5 φ5 The person [Agent] Takes out [Predicate] The drawer [Source] From [Source] A cuAng board [PaCent] [DesCnaCon] φ4 φ3 φ6 φ1 φ2 v2 v1 v6 v3 v4 v5 source rather than “the drawer" alone. This is because the prepositions play an important role in specifying location information. For example, “near the cutting boarding" is describing a location that is near to, but not exactly at the location of the cutting board. Here v1, ...,v6 are grounding random variables which take values from object tracks and locations in the video clip, and φ1, ...,φ6 are binary random variables which take values {0,1}. When φi equals to 1, it means vi is the correct grounding of corresponding linguistic semantic role, otherwise it is not. The introduction of random variables φi follows previous work from Tellex and colleagues [90], which makes CRF learning more tractable. 3.3 CRF-based Learning and Inference 3.3.1 Conditional Random Field Conditional Random Field is a kind of discriminative graphical model which models the condi- tional probability distribution p(Y|X) in which X is the structured input and Y is structured output. The output space Y composes of an markov random field. Of all the CRF model variants, the linear chain CRF is the most commonly used which is first proposed for segmentating and labeling Figure 3.3: The linear conditional random field structure. 20 sequence data [49]. The conditional probability is represented as: p(Y = y|X) = 1 Z(X) ∏ i Φi(X,y) in which Φi(X,y) are the factors’ potential functions and Z(X) is the normalization constant. An example CRF for linear sequence tagging is shown in Figure 3.3. 3.3.2 CRF for Grounded Semantic Role Labeling Different from the most common sequence labeling problem, grounded semantic role labeling is a more complex structure prediction task. In the GSRL CRF model, we do not directly model the objective function as: p(v1, ...,vk|S,V ) where S refers to the sentence, V refers to the corresponding video clip and vi refers to the ground- ing variable. Because the gradient based learning method needs the expectation of v1, ...,vk, which is infeasible, we instead use the following objective function: P(φ|s1,s2, . . . ,sk,v1,v2, . . . ,vk,V ) where φ is a binary random vector [φ1, ...,φk], indicating whether the grounding is correct. The 21 objective function is factorized as follows: P(φ|s1,s2, . . . ,sk,v1,v2, . . . ,vk,V ) = 1 Z ∏ i 1 = ∏ Zi ψ(φi,si,vi,V ) exp{w(cid:62)F(φi,si,vi,V )} i = ∏ i P(φi|si,vi,V ) where ψ is the potential function, w refers to parameters, F(φi,si,vi,V ) denotes a factor feature vector. Z and Zi are normalization constant: Z = ∑ φ ∏ i ψ(φi,si,vi,V ) Zi = ∑ φi ψ(φi,si,vi,V ) We can see Z can be decomposed as the product of Zis because each factor only relates to one φi. In this way, the objective function factorizes according to the structure of language with local normalization at each factor. Gradient ascent with L2 regularization was used for parameter learning to maximize the objec- tive function: L = logP(P(Φ|λ1,λ2, . . . ,λn,γ1,γ2, . . . ,γN,V ) Taking derivative of L we can get ∂ L ∂ w = ∑ i F(φi,si,vi,V )−∑ i EP(φi|si,vi,V )F(φi,si,vi,V ) 22 where F refers to the feature function. During the training, we also use random grounding as negative samples for discriminative training. And the updating rule is wt+1 = wt + η ∂ L ∂ w , where η is the step size. We can see the learning in tractable as φ is a binary random variable and calculate it’s expectation is not that hard. For the inference, The linear chain CRF have polynomial extract algorithm for decoding. But for the general graph structure, there is no efficient exact algorithm. The search space can be very large when the number of objects in the world increases. To address this problem we apply beam search to do the approximate inference. Specifically we select an easy to head inference order: we first ground roles including patient, tool, and then other roles including location, source, destination and predicate. We empirical try different beam search orders and find that this order achieves the best performance. 3.4 Dataset for Grounded Semantic Role Labeling In this section, we first detail the statistics of the collected data for GSRL. Then introduce the approaches to process language and vision automatically. Figure 3.4: The TACOS dataset. 23 3.4.1 Dataset Collection We conducted our investigation based on a subset of the TACOS corpus [77]. This dataset contains a set of video clips paired with natural language descriptions related to several cooking tasks. The natural language descriptions were collected through crowd-sourcing on top of the “MPII Cooking Composite Activities” video corpus [81]. The overview of the TACOS is shown in Figure 3.4. The middle part is the video clip corresponding to one specific task in the kitchen domain. On the left side, there show different manual low-level annotations with time segment, verbs and related objects. The right side shows the human language descriptions for specific time periods. In this paper, we selected two tasks “cutting cucumber” and “cutting bread” as our experimental data. Each task has 5 videos showing how different people perform the same task. Each video is segmented to a sequence of video clips where each video clip comes with one or more language descriptions. Figure 3.5: The VATIC Annotation Interface for the TACOS dataset. 2For some verbs (e.g., get), there is a slight discrepancy between the sum of implicit/explicit roles across different categories. This is partly due to the fact that some verb occurrences take more than one objects as grounding to a role. It is also possibly due to missed/duplicated annotation for some categories. 24 The original TACOS dataset does not contain annotation for grounded semantic roles. To support our investigation and evaluation, we had made significant efforts adding the following annotations. For each video clip, we annotated the objects’ bounding boxes, their tracks, and their labels (cucumber, cutting_board, etc.) using VATIC [97]. On average, each video clip is annotated with 15 tracks of objects. The annotation interface is shown in Figure 3.5. For each sentence, we annotated the ground truth parsing structure and the semantic frame for each verb. The ground truth parsing structure is the representation of dependency parsing results. The semantic frame of a verb includes slots, fillers, and their groundings. For each semantic role (including both explicit roles and implicit roles) of a given verb, we also annotated the ground truth grounding in terms of the object tracks and locations. In total, our annotated dataset includes 976 pairs of video clips and corresponding sentences, 1094 verbs occurrences, and 3593 groundings of semantic roles. To check annotation agreement, 10% of the data was annotated by two annotators. The kappa statistics is 0.83 [13]. Source 102 / 149 Table 3.1: Statistics for a set of verbs and their semantic roles in our annotated dataset. The entry indicates the number of explicit/implicit roles for each category. “–” denotes no such role is observed in the data. Verb take put get cut open wash slice rinse place peel Destn Location 2 / 248 75 / 19 0 / 239 3 / 131 0 / 23 26 / 58 2 / 68 8 / 64 5 / 130 2 / 21 2 / 82 2 / 66 62 / 190 64 / 64 – 1 / 27 – – – – – – – – – – – 0 / 74 105 / 7 Tool – – – remove 2 / 27 – – Patient 251 / 0 94 / 0 247 / 0 134 / 1 23 / 0 93 / 0 69 / 1 76 / 0 104 / 1 29 / 0 40 / 0 – – – – – – – 34 / 6 25 From this dataset, we selected 11 most frequent verbs (i.e., get, take, wash, cut, rinse, slice, place, peel, put, remove, open) in our current investigation for the following reasons. First, they are used more frequently so that we can have sufficient samples of each verb to learn the model. Second, they cover different types of actions: some are more related to the change of the state such as take, and some are more related to the process such as wash. As it turns out, these verbs also have different semantic role patterns as shown in Table 3.1. The patient roles of all these verbs are explicitly specified. This is not surprising as all these verbs are transitive verbs. There is a large variation for other roles. For example, for the verb take, the destination is rarely specified by linguistic expressions (i.e., only 2 instances), however it can be inferred from the video. For the verb cut, the location and the tool are also rarely specified by linguistic expressions. Nevertheless, these implicit roles contribute to the overall understanding of actions and should also be grounded too. 3.4.2 Automated Processing To build the structure of the CRF as shown in Figure 3.2 and extract features for learning and inference, we have applied the following approaches to process language and vision. Language Processing. Language processing consists of three steps to build a structure containing syntactic and semantic information. First, the Stanford Parser [60] is applied to create a depen- dency parsing tree for each sentence. Second, Senna [14] is applied to identify semantic role labels for the key verb in the sentence. The linguistic entities with semantic roles are matched against the dependency nodes in the tree and the corresponding semantic role labels are added to the tree. Third, for each verb, the PropBank [67] entries are searched to extract all relevant semantic roles. The implicit roles (i.e., not specified linguistically) are added as direct children of verb nodes in the tree. Through these three steps, the resulting tree from language processing has both explicit 26 and implicit semantic roles. These trees are further transformed to the CRF structures based on a set of rules. Vision Processing. A set of visual detectors are first trained for each type of objects. Here a random forest classifier is adopted. More specifically, we use 100 trees with HoG features [15] and color descriptors [94]. Both HoG and Color descriptors are used, because some objects are more structural such as knives and humans; While some are more textured such as towels. With the learned object detectors, given a candidate video clip, we run the detectors at each 10th frame (less than 0.5 seconds), and find the candidate windows for which the detector score corresponding to the object is larger than a threshold (set as 0.5). Then using the detected window as a starting point, we adopt tracking-by-detection [16] to go forward and backward to track this object and obtain the candidate track with this object label. Feature Extraction. Features in the CRF model can be divided into the following three categories: 1. Linguistic features include word occurrence and semantic role information. They are ex- tracted by language processing. 2. Track label features are the label information for tracks in the video. The labels come from human annotation or automated visual processing depending on different experimental set- tings (described in Section 3.5.1). 3. Visual features are a set of features involving geometric relations between tracks in the video. One important feature is the histogram comparison score. It measures the similarity be- tween distance histograms. Specifically, histograms of distance values between the tracks of the predicate and other roles for each verb are first extracted from the training video clips. For an incoming distance histogram, we calculate its Chi-Square distances [107] from the 27 pre-extracted training histograms with the same verb and the same role. Its histogram com- parison score is set to be the average of top 5 smallest Chi-Square distances. Other visual features include geometric information for single tracks and geometric relations between two tracks. For example, size, average speed, and moving direction are extracted for a single track. Average distance, size-ratio, and relative direction are extracted between two tracks. For features that are continuous, we discretized them into uniform bins. To ground language into tracks from the video, instead of using track label features or visual features alone, we use a Cartesian product of these features with linguistic features. To learn the behavior of different semantic roles of different verbs, visual features are combined with the presence of both verbs and semantic roles through Cartesian product. To learn the correspondence between track labels and words, track label features are combined with the presence of words also through Cartesian product. To train the model, we randomly selected 75% of annotated 976 pairs of video clips and corre- sponding sentences as training set. The remaining 25% were used as the testing set. 3.5 Evaluation of Grounded Semantic Role Labeling 3.5.1 Experimental Setup Comparison. To evaluate the performance of our approach, we compare it with two approaches. • Baseline: To identify the grounding for each semantic role, the first baseline chooses the most possible track based on the object type conditional distribution given the verb and semantic role. If an object type corresponds to multiple tracks in the video, e.g., multiple drawers or knives, we then randomly select one of the tracks as grounding. We ran this 28 baseline method five times and reported the average performance. • Tellex (2011): The second approach we compared with is based on an implementation [90]. The difference is that they don’t explicitly model fine-grained semantic role information. For a better comparison, we map the grounding results from this approach to different explicit semantic roles according to the SRL annotation of the sentence. Note that this approach is not able to ground implicit roles. More specifically, we compare these two approaches with two variations of our system: • GSRLwo_V: The CRF model using linguistic features and track label features (described in Section 3.4.2). • GSRL: The full CRF model using linguistic features, track label features, and visual fea- tures(described in Section 3.4.2). Configurations. Both automated language processing and vision processing are error-prone. To further understand the limitations of grounded SRL, we compare performance under different con- figurations along the two dimensions: (1) the CRF structure is built upon annotated ground-truth language parsing versus automated language parsing; (2) object tracking and labeling is based on annotation versus automated processing. These lead to four different experimental configurations. Evaluation Metrics. For experiments that are based on annotated object tracks, we can simply use the traditional accuracy that directly measures the percentage of grounded tracks that are correct. However, for experiments using automated tracking, evaluation can be difficult as tracking itself poses significant challenges. The grounding results (to tracks) cannot be directly compared with the annotated ground-truth tracks. To address this problem, we have defined a new metric called approximate accuracy. This metric is motivated by previous computer vision work that evaluates 29 Table 3.2: Evaluation results based on annotated language parsing. Methods Predicate Patient Source Destination Location Tool Accuracy On the Gold Recognition/Tracking Setting Baseline Tellex(2011) GSRLwo_V GSRL 0.856 0.865 0.854 0.878∗+ explicit 0.372 0.745 0.794∗+ 0.839∗+ implicit implicit 0.314 implicit 0.569 explicit implicit 0.910 explicit NA – NA NA Approximated Accuracy On the Automated Recognition/Tracking Setting 0.615∗+ 0.641∗+ 0.392∗+ 0.684∗+ 0.920+ 0.930+ NA NA NA NA NA NA NA NA – – – explicit 0.225 0.306 0.375∗ 0.556∗+ explicit 0.311 0.763 0.658∗ 0.789∗ implicit 0.853 – 0.793+ 0.897∗+ Methods Predicate Patient Source Destination Location Tool Explicit Implicit All All All 0.556 0.722 0.768∗+ 0.825∗+ 0.620 0.583 – 0.648∗+ 0.768∗+ – 0.717∗ 0.8∗ Explicit Implicit All All All Baseline Tellex(2011) GSRLwo_V GSRL Upper_Bound 0.529 0.607 0.582∗ 0.548 0.920 explicit 0.206 0.233 0.244∗ 0.263∗ 0.309 implicit NA – NA NA NA explicit 0.169 0.154 0.262∗+ 0.262∗+ 0.277 implicit 0.119 – 0.126+ 0.086+ 0.252 explicit 0.236 0.333 0.485∗+ 0.394∗ 0.636 implicit 0.566 – 0.613∗+ 0.514+ 0.829 explicit NA NA NA NA NA implicit 0.476 – 0.467+ 0.456+ 0.511 explicit implicit NA NA NA NA NA 0.6 – 0.714∗+ 0.688∗+ 0.818 0.352 0.359 0.410∗+ 0.399∗+ 0.577 0.393 0.369 – 0.425∗+ 0.381+ 0.573 – 0.417∗ 0.391∗ 0.575 Table 3.3: Evaluation results based on automated language parsing. Methods Predicate Patient Source Destination Location Tool Accuracy On the Gold Recognition/Tracking Setting Baseline Tellex(2011) GSRLwo_V GSRL 0.881 0.903 0.873 0.873 explicit 0.318 0.746 0.813∗+ 0.875∗+ implicit implicit 0.316 implicit 0.607 explicit implicit 0.877 explicit NA – NA NA Approximated Accuracy On the Automated Recognition/Tracking Setting 0.360+ 0.667∗+ 0.648∗+ 0.667∗+ 0.877+ 0.891+ NA NA NA NA NA NA NA NA – – – explicit 0.203 0.156 0.328∗+ 0.453∗+ explicit 0.235 0.353 0.412∗ 0.412∗ implicit 0.895 – 0.818+ 0.891+ Methods Predicate Patient Source Destination Location Tool Baseline Tellex(2011) GSRLwo_V GSRL Upper_Bound explicit 0.174 0.218 0.243∗ 0.243∗ 0.277 implicit NA – NA NA NA explicit 0.121 0.086 0.190∗+ 0.190∗+ 0.259 implicit 0.113 – 0.120+ 0.063+ 0.254 explicit 0.093 0.00 0.133+ 0.133+ 0.4 implicit 0.594 – 0.641∗+ 0.612+ 0.854 explicit NA NA NA NA NA implicit 0.612 – 0.585+ 0.554+ 0.631 explicit NA NA NA NA NA implicit 0.567 – 0.723∗+ 0.617+ 0.830 0.543 0.598 0.618∗ 0.493 0.908 Explicit Implicit All All All 0.539 0.680 0.769∗+ 0.823∗+ 0.595 – 0.611+ 0.741∗+ 0.563 – 0.7∗ 0.787∗ Explicit Implicit All All All 0.327 0.322 0.401∗+ 0.367∗+ 0.543 0.405 0.362 – 0.434∗+ 0.386+ 0.585 – 0.415∗ 0.375 0.561 tracking performance [4]. Suppose the ground truth grounding for a role is track gt and the pre- dicted grounding is track pt. The two tracks gt and pt are often not the same (although may have some overlaps). Suppose the number of frames in the video clip is k. For each frame, we calcu- late the distance between the centroids of these two tracks. If their distance is below a predefined threshold, we consider the two tracks overlapping in this frame. We consider the grounding is correct if the ratio of the overlapping frames between gt and pt exceeds 50%. As can be seen, this is a lenient and an approximate measure of accuracy. 30 3.5.2 Results The results based on the ground-truth language parsing are shown in Table 3.2, and the results based on automated language parsing are shown in Table 3.3. For results based on annotated ob- ject tracking, the performance is reported in accuracy and for results based on automated object tracking, the performance is reported in approximate accuracy. When the number of testing sam- ples is less than 15, we do not show the result as it tends to be unreliable (shown as NA). Tellex (2011) does not address implicit roles (shown as “–"). The best performance score is shown in bold. We also conducted a two-tailed bootstrap significance testing [20]. The score with a “*” in- dicates it is statistically significant (p < 0.05) compared to the baseline approach. The score with a “+” indicates it is statistically significant (p < 0.05) compared to the approach [90]. For experiments based on automated object tracking, we also calculated an upper_bound to assess the best possible performance which can be achieved by a perfect grounding algorithm given the current vision processing results. This upper_bound is calculated based on grounding each role to the track which is closest to the ground-truth annotated track. For the experiments based on annotated tracking, the upper_bound would be 100%. This measure provides some understandings about how good the grounding approach is given the limitation of vision processing. Notice that the grounding results in the gold/automatic language processing setting are not directly comparable as the automatic SRL can misidentify frame elements. 3.5.3 Discussion As shown in Table 3.2 and Table 3.3, our approach consistently outperforms the baseline (for both explicit and implicit roles) and the Tellex (2011) approach. Under the configuration of gold recog- nition/tracking, the incorporation of visual features further improves the performance. However, 31 Figure 3.6: The relation between the accuracy and the entropy of each verb’s patient from the gold language, gold visual recognition/tracking setting. The entropy for the patient role of each verb is shown below the verb. this performance gain is not observed when automated object tracking and labeling is used. One possible explanation is that as we only had limited data, we did not use separate data to train mod- els for object recognition/tracking. So the GSRL model was trained with gold recognition/tracking data and tested with automated recognition/tracking data. By comparing our method with Tellex (2011), we can see that by incorporating fine grained semantic role information, our approach achieves better performance on almost all the explicit roles (except for the patient role under the automated tracking condition). The results have also shown that some roles are easier to ground than others in this domain. For example, the predicate role is grounded to the hand tracks (either left hand or right hand), there are not many variations such that the simple baseline can achieve pretty high performance, especially when annotated tracking is used. The same situation happens to the location role as most of the locations happen near the sink when the verb is wash, and near the cutting board for verbs like cut, etc. However, for the patient role, there is a large difference between our approach and baseline 32 approaches as there is a larger variation of different types of objects that can participate in the role for a given verb. For experiments with automated tracking, the upper_bound for each role also varies. Some roles (e.g., patient) have a pretty low upper bound. The accuracy from our full GSRL model is already quite close to the upper bound. For other roles such as predicate and destination, there is a larger gap between the current performance and the upper bound. This difference reflects the model’s capability in grounding different roles. Figure 3.6 shows a close-up look at the grounding performance to the patient role for each verb under the gold parsing and gold tracking configuration. The reason we only show the results of patient role here is every verb has this role to be grounded. For each verb, we also calculated its entropy based on the distribution of different types of objects that can serve as the patient role in the training data. The entropy is shown at the bottom of the Figure. For verbs such as take and put, our full GSRL model leads to much better performance compared to the baseline. As the baseline approach relies on the entropy of the potential grounding for a role, we further measured the improvement of the performance and calculated the correlation between the improvement and the entropy of each verb. The result shows that Pearson coefficient between the entropy and the improvement of GSRL over the baseline is 0.614. This indicates the improvement from GSRL is positively correlated with the entropy value associated with a role, implying the GSRL model can deal with more uncertain situations. For the verb cut, The GSRL model performs slightly worse than the baseline. One explanation is that the possible objects that can participate as a patient for cut are relatively constrained where simple features might be sufficient. A large number of features may introduce noise, and thus jeopardizing the performance. We further compare the performance of our full GRSL model with Tellex (2011) (also shown in Figure 3) on the patient role of different verbs. Our approach outperforms Tellex (2011) on most 33 of the verbs, especially put and open. A close look at the results have shown that in those cases, the patient roles are often specified by pronouns. Therefore, the track label features and linguistic features are not very helpful, and the correct grounding mainly depends on visual features. Our full GSRL model can better capture the geometry relations between different semantic roles by incorporating fine-grained role information. 3.6 Conclusion This chapter investigates a new problem on grounded semantic role labeling. Besides seman- tic roles explicitly mentioned in language descriptions, our approach also grounds implicit roles which are not explicitly specified. As implicit roles also capture important participants related to an action (e.g., tools used in the action), our approach provides a more complete representation of action semantics which can be used by artificial agents for further reasoning and planning to- wards the physical world. Our empirical results on a complex cooking domain have shown that, by incorporating semantic role information with visual features, our approach can achieve better performance compared to baseline approaches. Our results have also shown that grounded seman- tic role labeling is a challenging problem which often depends on the quality of automated visual processing (e.g., object tracking and recognition). There are several directions for future improvement. First, the current alignment between a video clip and a sentence is generated by some heuristics which are error-prone. One way to address this is to treat alignment and grounding as a joint problem. Second, our current visual features have not shown effective especially when they are extracted based on automatic visual processing. This is partly due to the complexity of the scene from the TACOS dataset and the lack of depth information. Recent advances in object tracking algorithms [63, 101] together with 3D 34 sensing can be explored in the future to improve visual processing. Moreover, linguistic studies have shown that action verbs such as cut and slice often denote some change of state as a result of the action [34, 35]. The change of state can be perceived from the physical world. Thus another direction is to systematically study causality of verbs. Causality models for verbs can potentially provide top-down information to guide intermediate representations for visual processing and im- prove grounded language understanding. The capability of grounding semantic roles to the physical world has many important impli- cations. It will support the development of intelligent agents which can reason and act upon the shared physical world. For example, unlike traditional action recognition in computer vision [98], grounded SRL will provide deeper understanding of the activities which involve participants in the actions guided by linguistic knowledge. For agents that can act upon the physical world such as robots, grounded SRL will allow the agents to acquire the grounded structure of human commands and thus perform the requested actions through planning (e.g., to follow the command “put the cup on the table”). Grounded SRL will also contribute to robot action learning where humans can teach the robot new actions (e.g., simple cooking tasks) through both task demonstration and language instruction. 35 Chapter 4 Commonsense Action Explanation in Human-Agent Communication In the previous chapter, we conduct a comprehensive study on the grounded semantic role labeling task in order to understand the verb semantics in the physical situated world. In this chapter 1, we start looking into a more interesting problem: the grounded semantic role labeling task targets to answer questions like “what’s the relation between the action and the object/location in the physical world?”. But for the commonsense reasoning and human interpretation, we are more interested in asking the question: why do you think this action happens, which is a more challenging task compared with grounded semantic role labeling. In this chapter, we will first conduct a study on human justifications. Then give a formal for- mulation for commonsense action justification. We also detail the process of data crowd sourcing . Empirical experiments are conducted to prove the effectiveness of the proposed method. Finally we propose a novel human study to verify the communication grounding efficiency compared with a variety of different methods. 1Commonsense Justification for Action Explanation. Shaohua Yang, Qiaozi Gao, Sari Saba-Sadiya, Joyce Y. Chai. Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, November 1-4, 2018. 36 4.1 Introduction When collaborating with artificial agents, it’s important for humans to understand agents’ abilities and limitations (e.g., understand why a decision is made by the agent) so that humans can be more cooperative in joint tasks (e.g., decide when to trust the agent’s prediction). To address this issue, recent years have seen an increasing effort on Explainable AI (XAI) which attempts to develop explainable models that can explain the agent’s decision making while maintaining a high-level of performance. There are two types of explanation: introspective explanation which addresses decision making process and justification explanation which gathers evidence to support a certain decision [5, 68]. In this paper we focus on justification explanation - identifying commonsense evidence particularly for action justification. Although one of the end goals of this investigation is to support perception, our current focus is on higher level commonsense reasoning for action explanation. Therefore this work is based on a symbolic representation of the world without concerning vision processing. Specifically our task is framed as: given many symbolic descriptions of the physical world (e.g., object relations and attributes as a result of vision or other processing), how to identify a small set of descriptions which can justify an action in line with humans’ commonsense knowledge? The lack of commonsense knowledge is a major bottleneck in artificial agents which jeopardizes the common ground between humans and agents for successful communication. If artificial agents ever become partners with humans in joint tasks, the ability to learn and acquire commonsense evidence for action justification is fundamental. This paper intends to address this important yet less studied problem. As a first step in our investigation, we initiated a human study to identify key dimensions of commonsense reasoning, from the human’s point of view, that justify an action. We then devel- 37 oped an explainable model based on the generative conditional variational autoencoder (CVAE) that models perceived attributes/relations as latent variables to learn the association between com- monsense evidence and actions. Our empirical results on a subset of the Visual Genome data [46] show that, compared to a typical attention-based model, CVAE has a significantly higher explana- tion ability in terms of identifying correct commonsense evidence to justify the recognized action. When adding the supervision of commonsense evidence during training, both the explainability and the performance (i.e., action prediction) are further improved. In addition, we evaluated the role of commonsense evidence in communication grounding between humans and agents. Our ex- perimental results show that the commonsense evidence generated by CVAE leads to a significantly higher common ground of actions. The contributions of this chapter are three folds. First we identified several key dimensions of commonsense knowledge, from a human’s perspective, to justify concrete actions in the physical environment. These dimensions provide a basis to justification explanation that is aligned with human’s commonsense knowledge about the action. Second we proposed a method using CVAE to jointly learn to predict actions and select commonsense evidence as action justification. CVAE naturally models the generation process of both actions and commonsense evidence. Inferring commonsense evidence is equivalent to the posterior inference of the CVAE model, which is flexi- ble and powerful to incorporate actions as context. Our experimental results have shown a higher explainability of CVAE in action justification without sacrificing performance. Finally our dataset of commonsense evidence for action explanation, together with our proposed methods, will be made available to the community. It will serve as a baseline for the future work on this topic. 38 4.2 A Study On Justification Explanation While there is a rich literature on explanations in Psychology, Philosophy, and Linguistics [17, 57, 92], the kind of concrete physical actions we are interested in are rarely studied by previous work. To address this issue, our work began with a small human study that would enable us to identify a low level and quantitatively useful taxonomy of commonsense in explaining actions that can be perceived from the physical world. We created a set of 12 short video clips (each about 14 seconds) from the Microsoft Research Video to Text dataset [100]. For each video clip, we asked human subjects to explain why they think a certain action is happening in the video. A total of about 170 responses from 67 participants were collected 2 After a careful examination, we came up with the following dimensions which capture commonsense explanation for actions. • Transitive-relations: This kind of explanation does not directly focus on the structural re- lations between an action and its participants, but rather transits to the relation between the participant and something else (potentially related). For example, use a woman wears an apron to justify the cook action. In the collected responses, 64% of them used transitive relations. (Most subject responses contain multiple categories of explanation.) • Sub-actions: Almost 75% of the responses used the existence of sub-actions as evidence (for example, the action is cook because there are sub-actions of cutting and heating meat). • Spatial-relations: Around 15% of the responses used spatial relations involving the partic- ipants of the action, for example, the knife is on the cutting board, and the water is in the bottle, etc. • Effect-state: Over 28% of the responses cited a change in the state of an object, in other 2The full survey along with other collected data will be released. 39 words the effect state, as evidence, such as cucumber in small pieces as the evidence for chop. • Associated-attributes: Other attributes associated with the participants of the action, but not the effect state of the participants (20%). While these attributes are not directly related to the action, they are linked to the action by association. For example, banana is sliced is used as evidence to justify blend. • Other: Participants have also cited other commonsense such as the “definition" of the action (5%), or the manner associated with different sub actions(12%). Except for the category Other which cannot be perceived from an image, the other five cat- egories can be potentially perceived and used as commonsense evidence to justify a perceived action. These five categories of commonsense explanations are used in our computational models described next. 4.3 Method We formulate our task as the following: given a set of relations R and a set of attributes E, the goal is to jointly select evidence z and predict target action a ∈ A where A is the vocabulary of actions. R is represented as {r1,r2, ...,rm} where each ri is a tuple (rp i ) corresponding to predicate, subject and object. E refers to {e1,e2, ...,en} where each ei is a tuple (eo i ) corresponding to the object and attribute. We introduce z as a discrete vector (z1,z2, ...,zm+n) where zi ∈ {0,1} represents the hidden explainable variable. z is interpreted as an evidence selector: zi = 1 means i ,ep i ,rs i ,ro the corresponding relation/attribute justifying the target action a. Given all these definitions, our target is to learn the probability p(a,z|R,E). 40 Figure 4.1: Grphaical Model representation of the Conditional Variational Auto-encoder. 4.3.1 Conditional Variational Autoencoder The varational autoencoder( VAE) [44] is proposed as a generative model to combine the power of both directed continuous or discrete graphical model and neural network with latent variables. The VAE models the generative process of random variable x as following: first the latent variable z is generated from a prior probability distribution p(z), then a data sample x is generated from a conditional probability distribution p(x|z). The CVAE [110] is a natural extension of VAE: Both the prior distribution and conditional distribution now are conditioned on an additional context c: p(z|c) and p(z|x,c). For our task, we decompose the inference problem p(a,z|R,E) into two smaller problems. The first sub-problem is to infer p(a|R,E), which we call performer. The second problem is to infer p(z|a,R,E) which we call explainer. These two problems are closely coupled, hence we will model them jointly. The probability distribution p(a|R,E) can be written as : p(a|R,E) = ∑ z pθ (a|z,R,E)p(z|R,E) The corresponding graphical representation is shown in Figure 4.1 Directly optimizing this conditional probability is not feasible. Usually the Evidence Lower 41 REAz Bound (ELBO) [88] is optimized, which can be derived as following: ELBO(a,R,E;θ ,φ ) = −KL(qφ (z|a,R,E)||pθ (z|R,E)) + Eqφ (z|a,R,E)[log pθ (a|z,R,E)] ≤ logp(a|R,E) (4.1) For the first KL divergence term, we are minimizing the distance between the posterior distribution and the prior distribution. For the second term, we are maximizing the expectation of the target action based on the posterior latent distribution. In most previous work using VAE, there is no explicit meaning for the hidden representation z, thus it’s hard for humans to interpret. For example, z is simply assumed as a Gaussian distribution or a categorical distribution. In order to have a more explicit representation for the purpose of explanation, our latent discrete variable z is used to indicate whether the corresponding relation or attribute can be used for explanation. In the above ELBO equation, p(a|R,E) is the performer and qφ (z|a,R,E) is the explainer. Thus we can learn the performer and explainer jointly. Figure 4.2: System Architecture for the CVAE model. The whole system architecture is shown in Figure 4.2. From an image, we first extract candidate 42 r1: (on, knife, cutting-board) …… rm: (hold, hand, knife)ImageGRUa1: (banana yellow) …… an: (banana, sliced)GRURelation EmbeddingAttribute EmbeddingchopAction Embedding……Prior EmbeddingPosterior EmbeddingKL(q||p)Gumble SoftmaxWeighted Sum +Reluchop: 0.6drink: 0.1……pull: 0.1feed: 0.1softmaxGlovepθ(z|R,E)qφ(z|a,R,E) relation set R and attribute set E from human image descriptions or trained visual detectors. Every relation r and attribute e are embedded using a Gated Recurrent Neural Network [11]. remb = GRU([rp,rs,ro]) eemb = GRU([eo,ep]) The action a is represented by a Glove embedding [69], followed by another non-linear layer: aemb = ReLU(Wiaglove + bi) where aglove ∈ Rk is the pre-trained glove embedding. Then the latent variable z can be calculated as: qφ (z|a,R,E) = softmax(Wz[U;aemb] + bz) , ...,eemb n 1 1 , ...,remb m ,eemb where U = [remb ] and [U,aemb] means the concatenation of U and aemb. and Wz ∈ R2×2k as we assume each zi belongs to one of the two classes {0,1}. The prior distribution can be calculated as: pθ (z|R,E) = softmax(W(cid:48) zU + b(cid:48) z) The main idea is that in order to sample one discrete zd from the softmax distribution, it’s equivalent to get the sample from z = one_hot(argmax i (log(πi) + gi)) where πi is the i-th logit for the softmax distribution p(z) and gi is a sample drawn from Gum- 43 bel(0,1). The argmax is further approximated as a continuous, differentiable function: zi = exp(((log(πi) + gi)/τ) j=1 exp(((log(π j) + g j)/τ) ∑2 τ is the temperature to control the accuracy of this approximate, the smaller the τ, the closer between the approximated distribution with the true distribution. We denote ˆz = gumble_softmax(z) as the discrete approximate of z. Here ˆz ∈ Rm+n,1 The KL divergence between the prior random variable zprior from pθ (z|R,E) and the posterior random variable zposterior from qφ (z|a,R,E) is: KL(zprior,zposterior) = −pi log pi i − (1− pi)log p(cid:48) 1− pi 1− p(cid:48) i here zprior ∼ Bern (pi), zposterior ∼ Bern p(cid:48) i . (cid:16) (cid:17) Another challenge is that z is a discrete variable which blocks the gradient and makes the end- to-end training infeasible. Gumbel-Softmax [37] is a re-parameterization trick to deal with the discrete variables in the neural network. We use this trick to sample discrete z. Then we do a weighted sum pooling between discretized z and U: hz = ReLU(∑ i zi ∗ Ui) h = ReLU(Whhz + bh) pθ (a|z,R,E) = softmax(Wh + b) 44 During training, we also add a sparsity regularization on the latent variable z besides the ELBO. So our final training objective is LCVAE = −ELBO(a,R,E;θ ,φ ) + β KL(qφ (z|a,R,E)||Bern(0)) (4.2) During testing, we have two objectives. First we want to infer the target action a, which can be computed through sampling: p(a|R,E) = ∑ pθ (z|R,E)pθ (a|z,R,E) z 1 ∑ S s=1 pθ (a|zs,R,E) ≈ S (4.3) where zs ∼ p(z|R,E) and S is the number of samples. After obtaining the predicted action ˆa, the posterior explanation is inferred as qφ (z|ˆa,R,E). 4.3.2 Conditional Variational Autoencoder with Supervision (CVAE+SV) In this setting, we assume we have the supervision for the discrete latent variable z. which is more like a multi-task setting. We optimize both the action prediction loss and the evidence selection loss. The final loss function is defined as: LSV = λ LCVAE + (1− λ )Levidence where Levidence = −∑ k (zk logp(ˆzk) + (1− zk)log(1− p(ˆzk))) in which zk ∈ {0,1} is the ground truth label, ˆzk is the predicted label and λ is a hyper-parameter. 45 Figure 4.3: Example Crowdsourcing Annotations in which bold relations/attributes are annotated as gold. 4.4 Data Collection To evaluate our method, we created a dataset based on the Visual Genome (VG) data [46]. Each image in the VG dataset is annotated with bounding boxes, relations and attributes describing the bounding boxes. The available annotations provided an ideal setup which allowed us to focus on studying commonsense explanation. Table 4.1: Statistics for the average relations/attributes Mean and Std for each verb in the dataset. Ave_Rel Ave_Gold_Rel Ave_Att Ave_Gold_Att feed 15.49 ± 7.55 2.79 ± 1.28 12.48 ± 7.11 0.26 ± 0.48 pull 14.62 ± 9.36 1.86 ± 0.84 13.60 ± 7.52 0.20 ± 0.45 ride 12.42 ± 7.18 1.69 ± 0.83 12.20 ± 7.13 0.13 ± 0.40 drink 15.16 ± 9.89 2.41 ± 1.14 10.86 ± 6.52 0.30 ± 0.56 chop 12.00 ± 7.22 2.41 ± 1.66 15.09 ± 6.82 1.60 ± 1.33 brush 15.40 ± 8.93 2.26 ± 1.08 12.31 ± 8.91 0.22 ± 0.49 fry 14.02 ± 7.02 2.72 ± 2.06 15.31 ± 7.16 0.91 ± 1.26 bake 13.31 ± 7.27 2.25 ± 1.69 13.44 ± 6.84 0.93 ± 1.06 blend 14.37 ± 6.37 2.56 ± 1.84 15.22 ± 7.18 0.15 ± 0.40 eat 15.08 ± 6.87 2.52 ± 1.08 11.98 ± 6.50 0.41 ± 0.70 46 (hold, hand, bottle) (near, bottle, mouth) (in, water, bottle) (hold, woman, racket) (racket, orange) (shirt, white)DrinkChop(carve, knife, meat) (use, man, knife) (on, fork, meat) (under, stove, pan) (meat, sliced) (fork, long)Feed(eat, bird, fruit) (on, fruit, hand) (on, bird, hand) (on, neck, bird) (apple, green) (beak, orange) More specifically, we selected ten frequently occurred actions: feed, pull, ride, drink, chop, brush, fry, bake, blend, eat and manually identified a set of images depicting these actions. This leads to a dataset of 853 images, where each image comes with a ground-truth action and annotated bounding boxes as well as corresponding relations and attributes. We then showed each image to the crowd (through Amazon Mechanical Turk) and instructed the turkers to choose justifying relations and attributes from a list. Each image was annotated by three turkers. The relations or attributes that were selected by two or more turkers are considered gold that can be used to explain or justify the perceived action. Table 4.1 shows some basic statistics for each action. The number of average relations and attributes in each image for different actions varies slightly. However, only a small percentage of them are considered gold. What’s interesting is that the percentage of attributes considered gold is significantly less than the percentage of the relations. The sparsity of gold relations/attributes shows that it’s a challenging task to learn an explainer for a target action. Some example image annotations are shown in Figure 4.3. For each image, we show a subset of relations/attributes and gold commonsense features are marked bold. We further categorize gold relations and attributes into different commonsense categories as discussed in Section 4.2. As shown in Table 4.2, the ratios of transitive relations are similar across different actions. The ratios of spatial relations and sub_actions vary for different verbs. For instance, ride, bake, blend tend to be explained by spatial relations more often than sub-actions. In terms of attributes, feed, pull, ride cannot be explained by effect states while chop is mainly explained by the effect state of its direct object. 47 Table 4.2: Statistics for the categories of annotated relations/attributes for each verb. Rel_Transitive Rel_Sub_Action Rel_Spatial Att_Effect Att_Associated feed 0.10 0.45 0.45 0.0 1.0 pull 0.14 0.46 0.40 0.0 1.0 ride 0.15 0.13 0.72 0.0 1.0 drink 0.11 0.32 0.57 0.14 0.86 chop 0.11 0.29 0.60 0.82 0.18 brush 0.13 0.39 0.48 0.05 0.95 fry 0.12 0.17 0.71 0.53 0.47 bake 0.18 0.11 0.71 0.34 0.66 blend 0.15 0.09 0.76 0.22 0.78 eat 0.09 0.43 0.48 0.27 0.73 Figure 4.4: The system architecture for attention-based method. 4.5 Evaluation on Action Explanation In this section, we first evaluate the performance of action prediction and explainer. We then compare two naive incremental learning strategies to show how they will influence the final model’s performance. We also show that when we have limited data annotation, a semi-supervised method can help to improve the performance. To evaluate our model, we randomly split our dataset (853 images) into 60% for training, 20% for validation, and 20% for test. For all the models we use the Adam optimizer with a starting learning rate 1e-4. All other hyperparameters are tuned on the validation set. 48 ImageGRUa1: (banana yellow) …… an: (banana, sliced)GRURelation EmbeddingAttribute Embeddingr1: (hold, hand, knife) …… rm: (on, knife, cutting-board)Weighted Sum+Reluchop: 0.6……feed: 0.1softmaxContext Vector vAttention Score α Table 4.3: Action Prediction Accuracy and Evidence Selection MAP. Action Accuracy Evidence MAP Random Attention CVAE CVAE+SV Upper Bound 0.1 0.789 0.835 0.871 0.918 0.251 0.442 0.572 0.690 1.0 4.5.1 Action Prediction and Explanation For all experiments in this paper, we use the annotated relations/attributes from the original Visual Genome data. As the state-of-the-art recall@50 on the relation detection with a limited vocabulary is only around 20% [53], using annotated relations and attributes allows us to focus on the study of commonsense evidence and its role in justification and communication grounding. We use Mean Average Precision (MAP) metric for evidence evaluation as we want to rank good evidence higher than others. Methods for Comparison We compare the following methods: (1) CVAE. The conditional variational autoencoder model presented in Section 4.3.1. (2) CVAE+SV. The CVAE model with supervision as presented in Section 4.3.2. (3) Upper Bound. We also calculate the upper bound of the CVAE model using the human annotated gold evidence. (4) Attention. We use an attention-based model as one of the baseline methods. It is similar with the model presented by [102]. The architecture is shown in Figure 4.4. The attention is calculated as: αi = exp(uT i v) ∑ j exp(uT i v) where v is the context parameter, and ui is the GRU embedding of the corresponding relation/attribute. 49 Table 4.4: Results from the human study on communication grounding. Easy Hard Attenton CVAE CVAE+SV Gold Attention CVAE CVAE+SV Gold 0.841 0.071 0.041 0.047 0.888 0.024 0.064 0.024 0.576 0.212 0.135 0.077 0.718 0.118 0.076 0.088 0.788 0.076 0.076 0.059 M+H+ M+H- M-H+ M-H- 0.665 0.124 0.165 0.046 0.776 0.059 0.129 0.035 0.818 0.047 0.129 0.006 (5) Random. A baseline method that randomly ranks all actions and evidence. Evaluation Results The results are shown in Table 4.3. Since the Upper Bound method directly uses the human annotated gold evidence, its MAP for selecting evidence is always 1.0. The CVAE model outperforms the attention-based model in both action prediction and evidence selection tasks. This indicates that the CVAE model can incorporate a better guidance for evidence selection during the training process. Furthermore, after adding the evidence supervision, the CVAE+SV model gives even better performance in both action prediction and evident selection. We notice that for the CVAE+SV model, its action prediction accuracy is approaching the upper bound 91.8%, however the evidence selection MAP is still far from the upper bound even with supervision. 4.5.2 Incremental Study Human learn new knowledge through interactions incrementally. But it’s very challenging for machine to learn in an incremental way. In this section, we are interested to explore how the CVAE and CVAE+SV models work under a simulated incremental setting. Specifically, we assume the training data are received in a sequential order instead of all at once. We evaluate two simple incremental strategies. The naive incremental strategy is to retrain the model using all available training data when new pieces of data come. The local incremental 50 Figure 4.5: Action prediction accuracy in the incremental study. Figure 4.6: Evidence selection MAP in the incremental study. 51 0.00.20.40.60.81.0The Ratio Of The Training Data0.00.20.40.60.8Action AccuracyNaive CVAENaive CVAE+SVLocal CVAELocal CVAE+SV0.00.20.40.60.81.0The Ratio Of The Training Data0.00.10.20.30.40.50.60.7Evidence MAPNaive CVAENaive CVAE+SVLocal CVAELocal CVAE+SV strategy is to finetune the model only using the newly arrived data, with parameters initialized as the previous best model. Usually the local incremental strategy has shorter training time compared with the naive incremental strategy, since the local strategy has less training samples at each time. We use the same training/validation/test split as in Section 4.5.1. Figure 4.5 and Figure 4.6 show the results of the incremental study. Overall, the naive incre- mental strategy outperforms the local incremental strategy. The local incremental strategy performs worse than the naive incremental strategy with the increase of the ratio of training data. When adding the supervision, the local CVAE with supervison model performs better, But still worse than the naive incremental strategy. 4.5.3 Semi-Supervised Learning Figure 4.7: Evidence selection MAP for semi-supervised learning. Although we have shown that it improves the model performance when we add supervision on the latent variable z, collecting this label information through human annotation is usually time 52 0.20.30.40.50.60.70.80.91.0The Ratio Of The Labeled Training Data0.580.600.620.640.660.68Evidence MapSemi CVAENaive CVAE+SV consuming and expensive. In this section, we explore how semi-supervised learning can help to alleviate this difficulty. As a generative model, VAE has shown its advantage on semi-supervised learning [43]. In fact our task is simpler as our latent variable z is also the target label y. Following the method in [43], our semi-supervised learning loss function is defined as: L = ∑ (a,R,E,z)∼pl LSV + ∑ (a,R,E)∼pu LCVAE where LSV is defined in section 4.3.2 and LCVAE is detailed in section 4.3.1. In other words, the data sample with evidence label is fed to LSV , otherwise is fed to LCVAE. The results are shown in Figure 4.7 where the x-axis shows the ratio of labeled examples. The incremental Naive CVAE+SV model only uses the labeled evidence examples while the Semi CVAE model also uses unlabeled evidence examples. The Figure shows that the Semi CVAE model outperforms the Naive CVAE+SV model. This indicates that the semi-supervised method can improve the evidence selection by making use of unlabeled examples. 4.5.4 Visual Simulator In the previous experiment, we assume no visual preprocessing and directly use textual input pro- vided by visual genome. Detecting relations from image is a very hard problem: even the close state-of-the-art re- call@50 is only around 20% with a small and limited objects and relation predicate vocabulary, which is far from practical usage for our case. First Our data contains a very large vocabulary(more than 1000 objects and predicates in total) and the common way to select most frequrent occuring vocabulary can hardly cover the human annotated gold evidence due to the sparsity. Second The 53 Figure 4.8: Action Accuracy For the Visual Simulator label set also contains overlapping labels which are very similar. Third the visual genome data has the problem of missing annotations: an image contains a lot of relations while the textual annotation only covers a small ratio of them. So We use faster rcnn [78] to detect the object bounding boxes and object classes. Then we use the structural ranking based method [54] to classify relation predicates and object attributes. We use all data to build the visual model. Then for our task, we blurred all image with kernel size 5, and using the top predicted relations/attributes as noise mixed with human annotated gold evidence as input. We define the Noisy Ratio as the ratio between noisy evidence and gold evi- dence to estimate the robustness of the model. The reason behind doing so is that we find even we train and test on the same dataset, the recall for the gold evidence is not high due to the existence of a lot of ordinary relatios and attributes such as “in”, “on”, “color” and so on. In Figure 4.8 and Figure 4.9, we compared different methods’ performance under different noisy ratio conditions. Each value is computed by taking the average of 3 runs. The CVAE model’s performance is slighly better the attention based model. But the gap is not large. Through man- 54 0.02.55.07.510.012.515.017.520.0Noisy Ratio0.650.700.750.800.850.90Action Accuracyattention_blurcvae_blurcvae_sv_blur Figure 4.9: Evidence MAP For the Visual Simulator nually checking the predicted noisy evidence, we find they have a different distribution compared with the noisy evidence extracted from image descriptions and not human-like due to the limitation of 2D image spatial bias(most top relation predicates include “in” and “on”.). How to improve the performance of the visual simulator to generate human-like evidence candidates is still an open problem we will explore in the future work. Figure 4.10: The experimental setup for the human subject study examining the role of common- sense justification towards common ground. 55 0.02.55.07.510.012.515.017.520.0Noisy Ratio0.30.40.50.60.70.80.91.0Evidence MAPattention_blurcvae_blurcvae_sv_blurGround-Truth Ac,on A CVAE Model Predicted Ac,on Am Commonsense Evidence Guessed Ac,on Ah Am(cid:2)(cid:1)Ah = A ? communicated 4.6 Commonsense Justification towards Common Ground Figure 4.11: Examples of the communication grounding study based on different models. n human-agent communication, the success of communication is largely dependent on com- mon ground which captures shared knowledge, beliefs, or past experience [12]. As common- sense evidence what humans use to justify actions, To validate this hypothesis, we conducted a human-subject experiment to examine the role of commonsense justification in facilitating com- mon ground. 4.6.1 Experiment Setup Figure 4.10 shows the setup of our experiment. The agent is provided with an image and applies various models (e.g., CVAE) to jointly predict the action and identify commonsense evidence. The human is provided with a list of six action choices and does not have access to the image. The agent communicates to the human only the identified commonsense evidence and the human makes a guess on the action from the candidate list purely based on the communicated evidence. The idea is that, if the human and the agent share the same beliefs about evidence to justify an action, then the action guessed by the human should be the same as the action predicted by the 56 AttentionCVAECVAE+SVGoldGold Action: BakeAm: Eat Ah: BakeAm: Bake Ah: BakeAm: Bake Ah: BakeAm: Bake Ah: Bake•The bread is next to the bread. •The bread is on the rack. •The bread is on the pan. •The man has keys. •The man has the band.•The bread is on the rack. •The bread is on the pan. •The bread is on the tray. •The bread is next to the bread. •The bread is baked.•The bread is baked. •The bread is next to the bread. •The person is pushing the tray. •The bread is on the pan. •The bread is on the rack.•The bread is on the tray. •The person is pushing the tray. •The bread is baked.Gold Action: BrushAm: Brush Ah: SkinAm: Brush Ah: BrushAm: Brush Ah: BrushAm: Brush Ah: Brush•The baby has a mouth. •The baby has a hand. •The baby has eyeballs. •The baby has fingers. •The baby has a nose.•The hand holds the toothbrush. •The toothbrush is in the mouth. •The baby has a mouth. •The baby has fingers. •The baby has a nose.•The hand holds the toothbrush. •The toothbrush is in the mouth. •The baby has eyeballs. •The baby has a mouth. •The baby has a hand.•The toothbrush is in the mouth. •The hand holds the toothbrush. agent. Generating Distracting Verbs. For each image, the human is provided with a list of six ac- tion/verb candidates. To generate this list, we mix four distracting verbs with the ground-truth action verb plus a default Other. Most of the distracting verbs come from the concrete action verbs made available by [26]. We first manually filtered out the verbs which have the same mean- ing with the ground-truth verb. We then selected two groups of distracting verbs: an easy group (where the distracting verbs have larger distance from the ground-truth verb in the embedding space, with an average similarity of 0.284) and a hard group (more close to the ground-truth verbs with an average similarity of 0.479). The temperature based softmax distribution [10] was used to sample the easy and the hard distracting verbs based on the pre-trained GloVe [69] embedding cosine similarity. The selected confusion verbs are list in table 4.5. Process. A total of 170 images were used in this experiment, and 24 workers from AMT partic- ipated in our study. For each image, we applied three different models: Attention baseline, CVAE, and CVAE+SV to generate the commonsense evidence. An upper bound based on gold commonsense evidence was also measured. Note that, the agent has no knowledge of the human’s action choices when generating the commonsense evidence. Theory of mind is an important aspect in human-agent communication. Incorporating human’s action choices in justifying action is an interesting however a different problem which requires different solutions. In this paper, we only focus on the situation where the mind of the human is opaque to the agent. For each model and each image under the easy or hard configurations, the top five predicted commonsense evidence (associated with the predicted action) were shown to a worker. The the worker was requested to select the most probable action from the distracting list only based on these five pieces of evidence. We randomly assigned three workers to each image. The majority of three selections was considered as the final answer. If all three selections disagreed, one worker’s 57 Table 4.5: Target Actions and Simple/Hard Confusion Actions. Target Action Easy Confusion Actions Hard Confusion Actions Bake Blend Brush Chop Drink Eat Feed Fry Pull Ride Batter, Boil, Fry, Fold, Other Grate, Knot, Burn, Frame, Other Bolt, Seperate, Bind, Wrap, Other Weave, Chop, Twist, Squash, Other Frame, Stain, Skin, Scrape, Other Split, Block, Catch, Roll, Other Grate, Burn, Bake, Scrape, Other Crack, Wipe, Kick, Tear, Other Nail, Smoke, Crush, Shoot, Other Eat, Ride, Smoke, Bake, Other Drink, Feed, Bite, Get, Other Crush, Catch, Light, Ride, Other Eat, Get, Assemble, Skin, Other Squash, Loose, Kick, Build, Other Scrape, Insert, Knock, Frame, Other Boil, Bake, Batter, Scrape, Other Twist, Feed, Drink, Coil, Other Park, Pack, Crack, Label, Other Put, Drop, Lift, Pack, Other Get, Open, Throw, Sail, Other Table 4.6: Results from the human subject study on common ground. Attenton CVAE CVAE+SV Gold 0.888 0.841 0.665 0.576 0.776 0.718 0.818 0.788 Easy Hard choice was randomly selected as the final answer. Metrics for Common Ground. We use the agreement between the action guessed by the human and the action predicted by the agent to measure how well the selected commonsense evidence serves to bring the human and the agent to a common ground of perceived actions. More formally, as shown in Figure 4.10, given an image, suppose its ground-truth action is A, the action predicted by the agent/machine is Am, and the action guessed by the human is Ah, the Common Ground is defined as: Am = Ah = A. Here we also enforce that the predicted action should be the same as the ground-truth action. The percentage of trials based on different models that have led to a common ground is measured and compared. 4.6.2 Experimental Results Table 4.6 shows the comparison results among various models and the upper bound where the gold commonsense evidence provided to the human. It’s not surprising that performance on common 58 ground is worse in the hard configuration as the distracting verbs are more similar to the target action. The CVAE-based method is better than the attention-based method in facilitating common ground. Figure 4.11 shows two examples of the top five predicted evidence under different models. For each model, it also shows the agent predicted action (Am) and the human guessed action (Ah). In both examples, all models were able to establish a common ground except for the attention-based model. The evidence selected by the CVAE+SV model is clearly more accurate than the CVAE model and is more close to the ground-truth evidence. The second example shows that although the attention-based model predicts a correct target action, it fails to convey correct commonsense evidence to establish a common ground with the human. 4.7 Conclusion This chapter describes an approach to action justification using commonsense evidence. As demon- strated in our experiments, commonsense evidence is selected to align with humans’ justification of an action and is therefore critical in establishing a common ground between humans and agents. As a first step in our investigation, this work is based on annotated symbolic descriptions from per- ception. This assumption allows us to focus on higher level commonsense reasoning and supports a better understanding of the role of commonsense evidence in explanation and communication grounding. Our future work will extend the model and findings from this work to vision pro- cessing that will not only identify commonsense evidence but also explain where and how in the perceived environment the evidence is gathered. 59 Chapter 5 Grounded Action Justification In the previous chapter, we conduct some pilot studies on relations between physical actions and the latent structured commonsense justifications. However, previous experiments are based on pure textual inputs (we assume the relations and attributes of the image are given), which is not realistic in the real world. Furthermore, the empirical results are conducted on a limited action vocabu- lary, how to efficiently deal with diverse actions in the real world is still an open and challenging problem. In this chapter, we try to addresses the problem of learning to justify perceived actions through natural language rationales. We propose a deep factorized network which jointly models the relations between the shared environment, perceived actions, and action justifications. Our empirical results have shown that the proposed model outperforms strong baselines in the overall performance. By explicitly modeling factors of language grounding and commonsense reasoning, the proposed model provides a better understanding of effects of these factors on grounded action justification. 5.1 Introduction To boost the research of multi-modal visual understanding, researchers propose different language and vision tasks as test beds for different methods. One of the most popular tasks is visual ques- tion answering: given an query and image, how to select the correct answer from a list of candidate answers? However, for commonly used visual question answering datasets, most questions are tar- 60 geted for objects or properties, which do not consider higher level cognitive reasoning ability. Motivated by this, the Recognition to Cognition [105]) dataset is collected and a new visual com- monsense reasoning (VCR) task is proposed: given an image and a question, the goal is to select the correct answer and the corresponding rationale from a list of candidates. The VCR task is closely related with our previous work, but they are different on following perspectives: • Our previous work assumes that the given inputs are textual representations, and we directly model the linguistic texts without grounding action justifications. • Our previous work performs experiments on a limited action vocabulary, while the VCR dataset are collected from complex movie scenes which contains complex and diverse ac- tions. Besides, our previous work represent actions as single verbs instead action phrases or sentences. But in the VCR dataset, the gold answers could be long sentence descriptions. The VCR task is more challenging compared to our previous task in Chapter 4. • Our previous work assumes structured but simplified justifications containing independent relations and attributes, while in the VCR task, the rationales are complex sentences includ- ing multiple verbs, objects, attributes and their interactions. Mathematically, the visual commonsense task is defined as: A,R = argmax A,R p(A,R|Q,I) where I is the given image, Q is the question, A is one of the answer choices and R is one of the 61 rationale choices. In the previous work [105], this joint process is modeled as a two-step process p(A,R|Q,I) ∝ p(A|Q,I)p(R|Q,I,A) where the first step is to select the best answer choice based on the question and image, and the second step is to select the most probable rationale according to the image, question and inferred answer. However, it’s counter-intuitive for humans to solve a similar question using two-step process. When humans answer visual questions, the answering process and the rationale reasoning process interacts and often happens simultaneously. So we propose a joint learning and inference method to solve the problem p(A,R|Q,I). Specifically, to learn to justify perceived actions through natural language rationales, we for- mulate the problem as: given an image (I) and a question (Q) about the activity from the image (e.g., “what is person 1 doing?”), the goal is to identify an answer (A) from a list of potential an- swers (e.g., “person 1 is drinking water”) and a rationale R that supports the answer from a set of rationale candidates (e.g., “person 1 is holding a cup”) at the same time. Our solution to the prob- lem is to jointly infer (A,R) that maximizes P(A,R|Q,I) as rationales and predictions often support each other in the decision making process. We do not consider rationales as post-hoc justifications as in the the original VCR task. This is a key difference between our setup and the VCR setup. Because of this difference, the original VCR dataset (i.e., R2C dataset) cannot be directly applied here. Actually, we augment a portion of the origial R2C dataset for our investigation of the joint problem(details described in Section 5.2). As there are intrinsic relations between the image, question, answers, and rationales, we de- velop a factorized deep neural network which explicitly models these relations to capture the fol- lowing intuitions: (A,Q,I): a good answer has to be grounded to the image content given the 62 question. (R,Q,I): a good rationale has to be grounded to the image content given the question. (A,R): a good rationale to an answer should follow general commonsense knowledge. The first two factors (i.e., (A,Q,I) and (R,Q,I)) concern about the ability to ground language to perception (i.e., together they are referred to as the language grounding factor) and the third factor addresses the ability of reasoning based on commonsense knowledge that may hold between answers and rationales (i.e., the commonsense reasoning factor). The contributions of this work are that, instead of treating rationales as post-hoc justifications, we propose a model that jointly infers actions and rationales. By decomposing complex relations between images, questions, action answers, and action rationales, the proposed model not only outperforms strong baselines, but more importantly promotes a better understanding of the role of language grounding and commonsense reasoning on grounded action justification. 5.2 R2C Dataset Augmentation The original R2C dataset was collected through amazon mechanical turk. Each turker is given an image with detections and contextual descriptions, and is requested to generate 1 to 3 questions with answers and rationales. The images are extracted from movie clips. To generate the negative candidates for the answers and rationales. The authors proposed a so-called adversarial matching method in two steps. • For the first step, The candidate negative answers are selected from all the candidate answers which are the top most similar answers compared with the query. • For the second step, the candidate rationales are selected from all the candidate rationales which are the top most similar rationales with the query and the ground truth answer. 63 After these two steps, one sample is composed of one query, one image, four candidate answer choices which contain the ground truth answer and four candidate rationale choices which contain the ground truth rationales. From the previous two step descriptions of dataset generation process, a problem when we jointly infer the answer and rationale is the answer bias problem. As currently all four rationale candidates are related with the correct answer, which leads to the leaking of answer information. To fix the answer bias, one possible solution is to introduce negative rationales not only similar to the correct answer, but also similar to the negative answers. Fortunately, in the original R2C dataset, To avoid the linguistic bias, each answer candidate(including the wrong answer) has at least one time acting as the correct answer appearing in other samples. We can borrow rationale candidates for each negative answer from where it appears as the correct answer, which means that now each sample is composed of 16 rationale candidates(of which 12 are augmented rationales). In this way, we can eliminate the answer bias by enforcing the model not only learn to reason about the textual relations between the answer and the rationale, but also ground the justifications. The last problem remaining with this kind of dataset augmentation is that how to fill in the object tagging for new augmented rationales. In the original collected dataset, the turkers directly use bounding box tagging numbers instead of the objects’ names in the rationales. For example, one possible rationale could be “[2] is a professional musician in an orchestra.” Here [2] is the tagging number for one of the bounding boxes in the image. When we augment new rationales, we need to re-map the old tagging numbers to the new tagging numbers. We use a heuristic based method to do the re-mapping. • As the same answer appears in two samples in which sample si is the correct one and sample s j is the wrong one, we build object tagging mappings between these two samples for objects appearing in the same answer. Following this mapping, we map the sample s j’s taggings to 64 sample s j’s taggings for all sample s j’s borrowed rationales. • For tagging numbers not appearing in the first step’s mapping, we first try to match with tagging numbers in sample si which have the same tagging labels as in the original sample s j. If there exist multiple tagging numbers for the same label, we randomly assign one of the tagging numbers to build the mapping. After these steps, we build an unbiased augmented dataset for the joint VCR task. Here we show an concrete example to illustrate the augmented sample in Figure 5.1. The upper left corner shows the image with auto-detected bounding boxes. The upper right corner shows the question and four answer choices. The red colored answer is the ground truth answer. Below the images we show all the candidate rationales. Each answer corresponds to 4 candidate rationales(for simplicity, here we only show one specific rationale). The rationales for answer 1 come from the original r2c dataset, and all the other rationales come from the data augmentation process, which are colored green. 5.3 Deep Factorized Joint Modeling for Grounded Question Answering and Explanation. In this section, we will detailed our proposed method. First we will briefly review and discuss the behind model intuitions. Then we will mathematically detailed the model architecture and learning process. 65 Figure 5.1: An example augmented R2C sample. 5.3.1 Motivation Humans tend to answer the visual questions and generate explanations for the answer within a joint process instead of the post-hoc two step process. Motivated by this intuition, we propose a joint algorithm to predict the answer and the rationale simultaneously. The benefits from this joint process are: first it is more consistent with the human thinking process, humans do not separate the visual question answering process and the explanation process into two processes. Second the joint modeling process can better incorporate interactions between the language and vision, which helps alleviate the error propagation problem in the two step process. 66 Ques%on: What is person0 doing? Answer Choices: 0: She wants to go to sleep . 1: She is planning to get on train5 . 2: person0 is looking to punch person1 in her arm . 3: She is going to play a game with the other children . 0: If the light has not already woken her up , it means that she is asleep deeply , she won ' t wake up soon . …… 4: She is facing away from the train and towards people in front of it . …… 8: person0 has an angry look on her face . person0 leaning forward towards person1 . person0 has a drink in her hand . ……12: She is angled toward the basket and it looks like a ball is in her hand . …… Answer 0 Answer 1 Answer 2 Answer 3 5.3.2 Deep Factorized Modeling According to the theory of un-directed graphical model, we can write the probability distribution p(A,R|Q,I) as: p(A,R|Q,I) = 1 Z exp(Φ(A,R,Q,I)) where Φ(A,R,Q,I) is the potential function of the factor and Z is the normalization constant which is Z = ∑ A,R exp(Φ(A,R,Q,I) The meaning of the random variables A,R,Q,I are consistent with previous definitions. Then the key problem turns out to be how do we factorize the whole big factor Φ(A,R,Q,I). Before answering this question, let’s first think: what are the good answers and rationales. Good principles include: • A good answer must be grounded to the image well given the question. • A good rationale must be grounded to the image well given the question. • A good answer and a good rationale must match well. Based on these three assumptions, we factorize the big factor into 3 smaller factors, shown in Figure 5.2 Formally, the factor model can be written as Φ(A,R,Q,I) = Φ(A,Q,I) + Φ(R,Q,I) + Φ(A,R)) The factor Φ(A,Q,I) is kind of similar with the VQA task which captures the grounded semantics of the answer given the image and the question. The factor Φ(R,Q,I) captures the grounded semantics of the rationale given the image and the question. The last factor Φ(A,R) captures 67 Figure 5.2: A Graphical model representation of the joint VCR task. the correlations between the answer and the rationale. By jointly optimizing these three factors, we hope to infer the answers and rationales which are both grounded in the image and also have matching causal semantics well. For each factor, we use an neural network to approximate the potential function. In the follow- ing parts, which is shown in Figure 5.3. We will detail each factor’s neural architecture. 5.3.2.1 Visual Question Answering In this subsection, we build an neural network to predict the factor potential Φ(A,Q,I). The whole architecture is shown in the left part of Figure 5.3. Input Representation: In this paper, we use Bert [18] for both the question, answer and rationale representation. Let’s denote the question Q as q1,q2, ...,qlq where each qi is the word and lq is the length of the question. Similarly we let the answer A as a1,a2, ...,ala where each ai is the word and la is the length of the answer. so using Bert, we can transform the question as a matrix QI ∈ Rlq×768, and the answer as a matrix AI ∈ Rla×768. For the image I, we use a pre-trained resnet [31] to extract visual features Ih ∈ Rc×w×h, and we reshape this 3 dimensional tensor as 2d 68 DeepFactorizedModel8/19/1940P(A,R|Q,I)=1Zexp(φ(A,R,Q,I)))MaximumLikelihood(cid:1)FactorDecomposition(cid:1)(A,R,Q,I)=(A,Q,I)+(R,Q,I)+(A,R)AAACHnicbVDLSgMxFM34rPU16tJNsAgtljJTFd0IVTe6a4t9QDuUTJq2oZnMkGSEMvRL3PgrblwoIrjSvzGdzsK2Hgicc+693NzjBoxKZVk/xtLyyuraemojvbm1vbNr7u3XpR8KTGrYZ75oukgSRjmpKaoYaQaCIM9lpOEObyf1xiMRkvr8QY0C4nioz2mPYqS01THP28GAZq/z1Xwlf5+DVzDRsTqZquqM0r25jpmxClYMuEjshGRAgnLH/Gp3fRx6hCvMkJQt2wqUEyGhKGZknG6HkgQID1GftDTlyCPSieLzxvBYO13Y84V+XMHY/TsRIU/KkefqTg+pgZyvTcz/aq1Q9S6diPIgVITj6aJeyKDy4SQr2KWCYMVGmiAsqP4rxAMkEFY60bQOwZ4/eZHUiwX7tFCsnGVKN0kcKXAIjkAW2OAClMAdKIMawOAJvIA38G48G6/Gh/E5bV0ykpkDMAPj+xfOV5yz(A,Q,I)AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahQilJFfRY9aK3FuwHNKFstpt26SZZdjdCCf0bXjwo4tU/481/47bNQVsfDDzem2Fmni84U9q2v63c2vrG5lZ+u7Czu7d/UDw8aqs4kYS2SMxj2fWxopxFtKWZ5rQrJMWhz2nHH9/N/M4TlYrF0aOeCOqFeBixgBGsjeS6YsTKN5Vm5eEc9Yslu2rPgVaJk5ESZGj0i1/uICZJSCNNOFaq59hCeymWmhFOpwU3UVRgMsZD2jM0wiFVXjq/eYrOjDJAQSxNRRrN1d8TKQ6VmoS+6QyxHqllbyb+5/USHVx7KYtEomlEFouChCMdo1kAaMAkJZpPDMFEMnMrIiMsMdEmpoIJwVl+eZW0a1XnolprXpbqt1kceTiBUyiDA1dQh3toQAsICHiGV3izEuvFerc+Fq05K5s5hj+wPn8AzACQNw==Visual Question Answering: Whether the answer can ground to the image given the question? (R,Q,I)AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahQilJFfRY9KK3VuwHNKFstpt26SZZdjdCCf0bXjwo4tU/481/47bNQVsfDDzem2Fmni84U9q2v63c2vrG5lZ+u7Czu7d/UDw8aqs4kYS2SMxj2fWxopxFtKWZ5rQrJMWhz2nHH9/O/M4TlYrF0aOeCOqFeBixgBGsjeS6YsTKD5Vm5f4c9Yslu2rPgVaJk5ESZGj0i1/uICZJSCNNOFaq59hCeymWmhFOpwU3UVRgMsZD2jM0wiFVXjq/eYrOjDJAQSxNRRrN1d8TKQ6VmoS+6QyxHqllbyb+5/USHVx7KYtEomlEFouChCMdo1kAaMAkJZpPDMFEMnMrIiMsMdEmpoIJwVl+eZW0a1XnolprXpbqN1kceTiBUyiDA1dQhztoQAsICHiGV3izEuvFerc+Fq05K5s5hj+wPn8A5iqQSA==Visual Question Rationale: Whether the rationale can ground to the image given the question? (A,R)AAAB8XicbVBNSwMxEJ31s9avqkcvwSJUkLJbBT1WvXisYj+wXUo2zbah2WxIskJZ+i+8eFDEq//Gm//GtN2Dtj4YeLw3w8y8QHKmjet+O0vLK6tr67mN/ObW9s5uYW+/oeNEEVonMY9VK8CaciZo3TDDaUsqiqOA02YwvJn4zSeqNIvFgxlJ6ke4L1jICDZWeuzIAStdnd6foG6h6JbdKdAi8TJShAy1buGr04tJElFhCMdatz1XGj/FyjDC6TjfSTSVmAxxn7YtFTii2k+nF4/RsVV6KIyVLWHQVP09keJI61EU2M4Im4Ge9ybif147MeGlnzIhE0MFmS0KE45MjCbvox5TlBg+sgQTxeytiAywwsTYkPI2BG/+5UXSqJS9s3Ll7rxYvc7iyMEhHEEJPLiAKtxCDepAQMAzvMKbo50X5935mLUuOdnMAfyB8/kDzBmPrw==Commonsense Matching:Whether the rationale can explain the answer?ImageQuestionAnswerRationale Figure 5.3: The neural network architecture for deep factorized network. matrix Ih ∈ Rw∗h,c. Word-Level Grounding: The first step is to get word level grounding. the goal is to enrich each word with a corresponding visual representation. For example, for some objects or nouns presented in Figure 5.1, we hope the model can automatically learn where should be focused on. To achieve this, we apply a bi-linear attention mechanism between the sentence’s words Bert representation and the image visual representation si j = Qi IM1I j h 69 GroundingReasoningCNNBertBiLSTMCo-AttentionWhatisJamesdoing?ImageBlock Fusion……Heisdrinkingwater.BiLSTMBert……MLP……concatenationSentenceGroundingWordGrounding…QfinalAAAB8HicbVBNSwMxEJ34WetX1aOXYBE8ld0q6LHoxWML9kPapWTTbBuaZJckK5Slv8KLB0W8+nO8+W9M2z1o64OBx3szzMwLE8GN9bxvtLa+sbm1Xdgp7u7tHxyWjo5bJk41ZU0ai1h3QmKY4Io1LbeCdRLNiAwFa4fju5nffmLa8Fg92EnCAkmGikecEuukx0Y/i7giYtovlb2KNwdeJX5OypCj3i999QYxTSVTlgpiTNf3EhtkRFtOBZsWe6lhCaFjMmRdRxWRzATZ/OApPnfKAEexdqUsnqu/JzIijZnI0HVKYkdm2ZuJ/3nd1EY3QcZVklqm6GJRlApsYzz7Hg+4ZtSKiSOEau5uxXRENKHWZVR0IfjLL6+SVrXiX1aqjaty7TaPowCncAYX4MM11OAe6tAEChKe4RXekEYv6B19LFrXUD5zAn+APn8AAU2Qig==AfinalAAAB8HicbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHqxWME85BkCbOT2WTIPJaZWSEs+QovHhTx6ud482+cJHvQxIKGoqqb7q4o4cxY3//2VlbX1jc2C1vF7Z3dvf3SwWHTqFQT2iCKK92OsKGcSdqwzHLaTjTFIuK0FY1up37riWrDlHyw44SGAg8kixnB1kmP170sZhLzSa9U9iv+DGiZBDkpQ456r/TV7SuSCiot4diYTuAnNsywtoxwOil2U0MTTEZ4QDuOSiyoCbPZwRN06pQ+ipV2JS2aqb8nMiyMGYvIdQpsh2bRm4r/eZ3UxldhxmSSWirJfFGccmQVmn6P+kxTYvnYEUw0c7ciMsQaE+syKroQgsWXl0mzWgnOK9X7i3LtJo+jAMdwAmcQwCXU4A7q0AACAp7hFd487b14797HvHXFy2eO4A+8zx/ofpB6AAAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI+oF4+QyCOBDZkdemFkdnYzM2tCCF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR3cxvPaHSPJYPZpygH9GB5CFn1FipftMrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1dssjjycwCmcgwdXUIV7qEEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AJQrjMk=QAAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlZrNfrrhVdw6ySrycVCBHo1/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJu1b1Lqq15mWlfpPHUYQTOIVz8OAK6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8frGuM2Q==ˆQAAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2A9oQ9lsN+3SzSbsToQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNM7ud+54lrI2L1iNOE+xEdKREKRtFKnf6YYtacDcoVt+ouQNaJl5MK5GgMyl/9YczSiCtkkhrT89wE/YxqFEzyWamfGp5QNqEj3rNU0YgbP1ucOyMXVhmSMNa2FJKF+nsio5Ex0yiwnRHFsVn15uJ/Xi/F8NbPhEpS5IotF4WpJBiT+e9kKDRnKKeWUKaFvZWwMdWUoU2oZEPwVl9eJ+1a1buq1prXlfpdHkcRzuAcLsGDG6jDAzSgBQwm8Ayv8OYkzovz7nwsWwtOPnMKf+B8/gB2R4+mˆAAAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeqF48V7Ae0oWy2m3bpZhN2J0IJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GdzO//cS1EbF6xEnC/YgOlQgFo2ildm9EMbuZ9ssVt+rOQVaJl5MK5Gj0y1+9QczSiCtkkhrT9dwE/YxqFEzyaamXGp5QNqZD3rVU0YgbP5ufOyVnVhmQMNa2FJK5+nsio5ExkyiwnRHFkVn2ZuJ/XjfF8NrPhEpS5IotFoWpJBiT2e9kIDRnKCeWUKaFvZWwEdWUoU2oZEPwll9eJa1a1buo1h4uK/XbPI4inMApnIMHV1CHe2hAExiM4Rle4c1JnBfn3flYtBacfOYY/sD5/AFd94+WhQAAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48tmLbQhrLZTtulm03Y3Qgl9Dd48aCIV3+QN/+N2zYHbX0w8Hhvhpl5YSK4Nq777RQ2Nre2d4q7pb39g8Oj8vFJS8epYuizWMSqE1KNgkv0DTcCO4lCGoUC2+Hkfu63n1BpHstHM00wiOhI8iFn1FjJH/ez5qxfrrhVdwGyTrycVCBHo1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8DTIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxOWrWqd1WtNa8r9bs8jiKcwTlcggc3UIcHaIAPDDg8wyu8OdJ5cd6dj2VrwclnTuEPnM8f6bSOwA==hAAAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeqF48VTFtoQ9lsp+3SzSbsboQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoqeNUMfRZLGLVDqlGwSX6hhuB7UQhjUKBrXB8N/NbT6g0j+WjmSQYRHQo+YAzaqzkj3rZzbRXrrhVdw6ySrycVCBHo1f+6vZjlkYoDRNU647nJibIqDKcCZyWuqnGhLIxHWLHUkkj1EE2P3ZKzqzSJ4NY2ZKGzNXfExmNtJ5Eoe2MqBnpZW8m/ud1UjO4DjIuk9SgZItFg1QQE5PZ56TPFTIjJpZQpri9lbARVZQZm0/JhuAtv7xKmrWqd1GtPVxW6rd5HEU4gVM4Bw+uoA730AAfGHB4hld4c6Tz4rw7H4vWgpPPHMMfOJ8/0WSOsA==htextAAAB73icbVBNS8NAEN34WetX1aOXxSJ4KkkV9Fj04rGC/YA2lM120i7dbOLuRCyhf8KLB0W8+ne8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk41hwaPZazbATMghYIGCpTQTjSwKJDQCkY3U7/1CNqIWN3jOAE/YgMlQsEZWqk97GUITzjplcpuxZ2BLhMvJ2WSo94rfXX7MU8jUMglM6bjuQn6GdMouIRJsZsaSBgfsQF0LFUsAuNns3sn9NQqfRrG2pZCOlN/T2QsMmYcBbYzYjg0i95U/M/rpBhe+ZlQSYqg+HxRmEqKMZ0+T/tCA0c5toRxLeytlA+ZZhxtREUbgrf48jJpViveeaV6d1GuXedxFMgxOSFnxCOXpEZuSZ00CCeSPJNX8uY8OC/Ou/Mxb11x8pkj8gfO5w+UU5BShvisualAAAB8XicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGC/cA2lMl20y7dbMLuplBC/4UXD4p49d9489+4bXPQ1gcDj/dmmJkXJIJr47rfztr6xubWdmGnuLu3f3BYOjpu6jhVlDVoLGLVDlAzwSVrGG4EayeKYRQI1gpGdzO/NWZK81g+mknC/AgHkoecorHS07CXjblOUUx7pbJbcecgq8TLSRly1Hulr24/pmnEpKECte54bmL8DJXhVLBpsZtqliAd4YB1LJUYMe1n84un5NwqfRLGypY0ZK7+nsgw0noSBbYzQjPUy95M/M/rpCa88TMuk9QwSReLwlQQE5PZ+6TPFaNGTCxBqri9ldAhKqTGhlS0IXjLL6+SZrXiXVaqD1fl2m0eRwFO4QwuwINrqME91KEBFCQ8wyu8Odp5cd6dj0XrmpPPnMAfOJ8/H16RNQ==CNNBertBiLSTMCo-AttentionWhatisJamesdoing?ImageBlock Fusion……Heisholdingacup.BiLSTMBert……MLP……concatenationSentenceGroundingWordGrounding…QfinalAAAB8HicbVBNSwMxEJ34WetX1aOXYBE8ld0q6LHoxWML9kPapWTTbBuaZJckK5Slv8KLB0W8+nO8+W9M2z1o64OBx3szzMwLE8GN9bxvtLa+sbm1Xdgp7u7tHxyWjo5bJk41ZU0ai1h3QmKY4Io1LbeCdRLNiAwFa4fju5nffmLa8Fg92EnCAkmGikecEuukx0Y/i7giYtovlb2KNwdeJX5OypCj3i999QYxTSVTlgpiTNf3EhtkRFtOBZsWe6lhCaFjMmRdRxWRzATZ/OApPnfKAEexdqUsnqu/JzIijZnI0HVKYkdm2ZuJ/3nd1EY3QcZVklqm6GJRlApsYzz7Hg+4ZtSKiSOEau5uxXRENKHWZVR0IfjLL6+SVrXiX1aqjaty7TaPowCncAYX4MM11OAe6tAEChKe4RXekEYv6B19LFrXUD5zAn+APn8AAU2Qig==QAAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlZrNfrrhVdw6ySrycVCBHo1/+6g1ilkYoDRNU667nJsbPqDKcCZyWeqnGhLIxHWLXUkkj1H42P3RKzqwyIGGsbElD5urviYxGWk+iwHZG1Iz0sjcT//O6qQmv/YzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaiijJjsynZELzll1dJu1b1Lqq15mWlfpPHUYQTOIVz8OAK6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8frGuM2Q==ˆQAAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2A9oQ9lsN+3SzSbsToQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNM7ud+54lrI2L1iNOE+xEdKREKRtFKnf6YYtacDcoVt+ouQNaJl5MK5GgMyl/9YczSiCtkkhrT89wE/YxqFEzyWamfGp5QNqEj3rNU0YgbP1ucOyMXVhmSMNa2FJKF+nsio5Ex0yiwnRHFsVn15uJ/Xi/F8NbPhEpS5IotF4WpJBiT+e9kKDRnKKeWUKaFvZWwMdWUoU2oZEPwVl9eJ+1a1buq1prXlfpdHkcRzuAcLsGDG6jDAzSgBQwm8Ayv8OYkzovz7nwsWwtOPnMKf+B8/gB2R4+mhQAAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48tmLbQhrLZTtulm03Y3Qgl9Dd48aCIV3+QN/+N2zYHbX0w8Hhvhpl5YSK4Nq777RQ2Nre2d4q7pb39g8Oj8vFJS8epYuizWMSqE1KNgkv0DTcCO4lCGoUC2+Hkfu63n1BpHstHM00wiOhI8iFn1FjJH/ez5qxfrrhVdwGyTrycVCBHo1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8DTIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxOWrWqd1WtNa8r9bs8jiKcwTlcggc3UIcHaIAPDDg8wyu8OdJ5cd6dj2VrwclnTuEPnM8f6bSOwA==htextAAAB73icbVBNS8NAEN34WetX1aOXxSJ4KkkV9Fj04rGC/YA2lM120i7dbOLuRCyhf8KLB0W8+ne8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk41hwaPZazbATMghYIGCpTQTjSwKJDQCkY3U7/1CNqIWN3jOAE/YgMlQsEZWqk97GUITzjplcpuxZ2BLhMvJ2WSo94rfXX7MU8jUMglM6bjuQn6GdMouIRJsZsaSBgfsQF0LFUsAuNns3sn9NQqfRrG2pZCOlN/T2QsMmYcBbYzYjg0i95U/M/rpBhe+ZlQSYqg+HxRmEqKMZ0+T/tCA0c5toRxLeytlA+ZZhxtREUbgrf48jJpViveeaV6d1GuXedxFMgxOSFnxCOXpEZuSZ00CCeSPJNX8uY8OC/Ou/Mxb11x8pkj8gfO5w+UU5BShvisualAAAB8XicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGC/cA2lMl20y7dbMLuplBC/4UXD4p49d9489+4bXPQ1gcDj/dmmJkXJIJr47rfztr6xubWdmGnuLu3f3BYOjpu6jhVlDVoLGLVDlAzwSVrGG4EayeKYRQI1gpGdzO/NWZK81g+mknC/AgHkoecorHS07CXjblOUUx7pbJbcecgq8TLSRly1Hulr24/pmnEpKECte54bmL8DJXhVLBpsZtqliAd4YB1LJUYMe1n84un5NwqfRLGypY0ZK7+nsgw0noSBbYzQjPUy95M/M/rpCa88TMuk9QwSReLwlQQE5PZ+6TPFaNGTCxBqri9ldAhKqTGhlS0IXjLL6+SZrXiXVaqD1fl2m0eRwFO4QwuwINrqME91KEBFCQ8wyu8Odp5cd6dj0XrmpPPnMAfOJ8/H16RNQ==BertBiLSTMCo-AttentionHeisdrinkingwater.……Heisholdingacup.BiLSTMBert……MLP……FusionAfinalAAAB8HicbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHqxWME85BkCbOT2WTIPJaZWSEs+QovHhTx6ud482+cJHvQxIKGoqqb7q4o4cxY3//2VlbX1jc2C1vF7Z3dvf3SwWHTqFQT2iCKK92OsKGcSdqwzHLaTjTFIuK0FY1up37riWrDlHyw44SGAg8kixnB1kmP170sZhLzSa9U9iv+DGiZBDkpQ456r/TV7SuSCiot4diYTuAnNsywtoxwOil2U0MTTEZ4QDuOSiyoCbPZwRN06pQ+ipV2JS2aqb8nMiyMGYvIdQpsh2bRm4r/eZ3UxldhxmSSWirJfFGccmQVmn6P+kxTYvnYEUw0c7ciMsQaE+syKroQgsWXl0mzWgnOK9X7i3LtJo+jAMdwAmcQwCXU4A7q0AACAp7hFd487b14797HvHXFy2eO4A+8zx/ofpB6ˆAAAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeqF48V7Ae0oWy2m3bpZhN2J0IJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GdzO//cS1EbF6xEnC/YgOlQgFo2ildm9EMbuZ9ssVt+rOQVaJl5MK5Gj0y1+9QczSiCtkkhrT9dwE/YxqFEzyaamXGp5QNqZD3rVU0YgbP5ufOyVnVhmQMNa2FJK5+nsio5ExkyiwnRHFkVn2ZuJ/XjfF8NrPhEpS5IotFoWpJBiT2e9kIDRnKCeWUKaFvZWwEdWUoU2oZEPwll9eJa1a1buo1h4uK/XbPI4inMApnIMHV1CHe2hAExiM4Rle4c1JnBfn3flYtBacfOYY/sD5/AFd94+WhAAAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeqF48VTFtoQ9lsp+3SzSbsboQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoqeNUMfRZLGLVDqlGwSX6hhuB7UQhjUKBrXB8N/NbT6g0j+WjmSQYRHQo+YAzaqzkj3rZzbRXrrhVdw6ySrycVCBHo1f+6vZjlkYoDRNU647nJibIqDKcCZyWuqnGhLIxHWLHUkkj1EE2P3ZKzqzSJ4NY2ZKGzNXfExmNtJ5Eoe2MqBnpZW8m/ud1UjO4DjIuk9SgZItFg1QQE5PZ56TPFTIjJpZQpri9lbARVZQZm0/JhuAtv7xKmrWqd1GtPVxW6rd5HEU4gVM4Bw+uoA730AAfGHB4hld4c6Tz4rw7H4vWgpPPHMMfOJ8/0WSOsA==htextAAAB73icbVBNS8NAEN34WetX1aOXxSJ4KkkV9Fj04rGC/YA2lM120i7dbOLuRCyhf8KLB0W8+ne8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk41hwaPZazbATMghYIGCpTQTjSwKJDQCkY3U7/1CNqIWN3jOAE/YgMlQsEZWqk97GUITzjplcpuxZ2BLhMvJ2WSo94rfXX7MU8jUMglM6bjuQn6GdMouIRJsZsaSBgfsQF0LFUsAuNns3sn9NQqfRrG2pZCOlN/T2QsMmYcBbYzYjg0i95U/M/rpBhe+ZlQSYqg+HxRmEqKMZ0+T/tCA0c5toRxLeytlA+ZZhxtREUbgrf48jJpViveeaV6d1GuXedxFMgxOSFnxCOXpEZuSZ00CCeSPJNX8uY8OC/Ou/Mxb11x8pkj8gfO5w+UU5BSRAAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI9ELx7ByCOBDZkdemFkdnYzM2tCCF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaHSPJYPZpygH9GB5CFn1Fipft8rltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AK3vjNo=ˆRAAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF49V7Ae0oWy2m3bpZhN2J0IJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ildm9EMXuY9ssVt+rOQVaJl5MK5Gj0y1+9QczSiCtkkhrT9dwE/YxqFEzyaamXGp5QNqZD3rVU0YgbP5ufOyVnVhmQMNa2FJK5+nsio5ExkyiwnRHFkVn2ZuJ/XjfF8NrPhEpS5IotFoWpJBiT2e9kIDRnKCeWUKaFvZWwEdWUoU2oZEPwll9eJa1a1buo1u4vK/WbPI4inMApnIMHV1CHO2hAExiM4Rle4c1JnBfn3flYtBacfOYY/sD5/AF3zI+nRfinalAAAB8HicbVBNSwMxEJ34WetX1aOXYBE8ld0q6LHoxWMV+yHtUrJptg1NskuSFcrSX+HFgyJe/Tne/Dem7R609cHA470ZZuaFieDGet43WlldW9/YLGwVt3d29/ZLB4dNE6easgaNRazbITFMcMUallvB2olmRIaCtcLRzdRvPTFteKwe7DhhgSQDxSNOiXXS430vi7giYtIrlb2KNwNeJn5OypCj3it9dfsxTSVTlgpiTMf3EhtkRFtOBZsUu6lhCaEjMmAdRxWRzATZ7OAJPnVKH0exdqUsnqm/JzIijRnL0HVKYodm0ZuK/3md1EZXQcZVklqm6HxRlApsYzz9Hve5ZtSKsSOEau5uxXRINKHWZVR0IfiLLy+TZrXin1eqdxfl2nUeRwGO4QTOwIdLqMEt1KEBFCQ8wyu8IY1e0Dv6mLeuoHzmCP4Aff4AAtmQiw==hRAAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF49VTFtoQ9lst+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJHCoOt+O4W19Y3NreJ2aWd3b/+gfHjUNHGqGfdZLGPdDqnhUijuo0DJ24nmNAolb4Xj25nfeuLaiFg94iThQUSHSgwEo2glf9TLHqa9csWtunOQVeLlpAI5Gr3yV7cfszTiCpmkxnQ8N8EgoxoFk3xa6qaGJ5SN6ZB3LFU04ibI5sdOyZlV+mQQa1sKyVz9PZHRyJhJFNrOiOLILHsz8T+vk+LgOsiESlLkii0WDVJJMCazz0lfaM5QTiyhTAt7K2EjqilDm0/JhuAtv7xKmrWqd1Gt3V9W6jd5HEU4gVM4Bw+uoA530AAfGAh4hld4c5Tz4rw7H4vWgpPPHMMfOJ8/6zmOwQ==AAAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI+oF4+QyCOBDZkdemFkdnYzM2tCCF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR3cxvPaHSPJYPZpygH9GB5CFn1FipftMrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1dssjjycwCmcgwdXUIV7qEEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AJQrjMk=RAAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI9ELx7ByCOBDZkdemFkdnYzM2tCCF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaHSPJYPZpygH9GB5CFn1Fipft8rltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AK3vjNo=ˆRAAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF49V7Ae0oWy2m3bpZhN2J0IJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ildm9EMXuY9ssVt+rOQVaJl5MK5Gj0y1+9QczSiCtkkhrT9dwE/YxqFEzyaamXGp5QNqZD3rVU0YgbP5ufOyVnVhmQMNa2FJK5+nsio5ExkyiwnRHFkVn2ZuJ/XjfF8NrPhEpS5IotFoWpJBiT2e9kIDRnKCeWUKaFvZWwEdWUoU2oZEPwll9eJa1a1buo1u4vK/WbPI4inMApnIMHV1CHO2hAExiM4Rle4c1JnBfn3flYtBacfOYY/sD5/AF3zI+nRfinalAAAB8HicbVBNSwMxEJ34WetX1aOXYBE8ld0q6LHoxWMV+yHtUrJptg1NskuSFcrSX+HFgyJe/Tne/Dem7R609cHA470ZZuaFieDGet43WlldW9/YLGwVt3d29/ZLB4dNE6easgaNRazbITFMcMUallvB2olmRIaCtcLRzdRvPTFteKwe7DhhgSQDxSNOiXXS430vi7giYtIrlb2KNwNeJn5OypCj3it9dfsxTSVTlgpiTMf3EhtkRFtOBZsUu6lhCaEjMmAdRxWRzATZ7OAJPnVKH0exdqUsnqm/JzIijRnL0HVKYodm0ZuK/3md1EZXQcZVklqm6HxRlApsYzz9Hve5ZtSKsSOEau5uxXRINKHWZVR0IfiLLy+TZrXin1eqdxfl2nUeRwGO4QTOwIdLqMEt1KEBFCQ8wyu8IY1e0Dv6mLeuoHzmCP4Aff4AAtmQiw==hRAAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF49VTFtoQ9lst+3SzSbsToQS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MJHCoOt+O4W19Y3NreJ2aWd3b/+gfHjUNHGqGfdZLGPdDqnhUijuo0DJ24nmNAolb4Xj25nfeuLaiFg94iThQUSHSgwEo2glf9TLHqa9csWtunOQVeLlpAI5Gr3yV7cfszTiCpmkxnQ8N8EgoxoFk3xa6qaGJ5SN6ZB3LFU04ibI5sdOyZlV+mQQa1sKyVz9PZHRyJhJFNrOiOLILHsz8T+vk+LgOsiESlLkii0WDVJJMCazz0lfaM5QTiyhTAt7K2EjqilDm0/JhuAtv7xKmrWqd1Gt3V9W6jd5HEU4gVM4Bw+uoA530AAfGAh4hld4c5Tz4rw7H4vWgpPPHMMfOJ8/6zmOwQ==(A,Q,I)AAAB8nicbVBNS8NAEN3Ur1q/qh69LBahQilJFfRY9aK3FuwHpKFstpt26WYTdidCCf0ZXjwo4tVf481/47bNQVsfDDzem2Fmnh8LrsG2v63c2vrG5lZ+u7Czu7d/UDw8ausoUZS1aCQi1fWJZoJL1gIOgnVjxUjoC9bxx3czv/PElOaRfIRJzLyQDCUPOCVgJLfXGPHyTaVZeTjvF0t21Z4DrxInIyWUodEvfvUGEU1CJoEKorXr2DF4KVHAqWDTQi/RLCZ0TIbMNVSSkGkvnZ88xWdGGeAgUqYk4Ln6eyIlodaT0DedIYGRXvZm4n+em0Bw7aVcxgkwSReLgkRgiPDsfzzgilEQE0MIVdzciumIKELBpFQwITjLL6+Sdq3qXFRrzctS/TaLI49O0CkqIwddoTq6Rw3UQhRF6Bm9ojcLrBfr3fpYtOasbOYY/YH1+QNBx4/t(R,Q,I)AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBahQilJFfRY9KK3VuwHpKFstpt26WYTdjdCCf0ZXjwo4tVf481/47bNQVsfDDzem2Fmnh9zprRtf1u5tfWNza38dmFnd2//oHh41FZRIgltkYhHsutjRTkTtKWZ5rQbS4pDn9OOP76d+Z0nKhWLxKOexNQL8VCwgBGsjeT2GiNWfqg0K/fn/WLJrtpzoFXiZKQEGRr94ldvEJEkpEITjpVyHTvWXoqlZoTTaaGXKBpjMsZD6hoqcEiVl85PnqIzowxQEElTQqO5+nsixaFSk9A3nSHWI7XszcT/PDfRwbWXMhEnmgqyWBQkHOkIzf5HAyYp0XxiCCaSmVsRGWGJiTYpFUwIzvLLq6RdqzoX1VrzslS/yeLIwwmcQhkcuII63EEDWkAggmd4hTdLWy/Wu/WxaM1Z2cwx/IH1+QNb4I/+(A,R)AAAB8HicbVBNSwMxEJ31s9avqkcvwSJUkLJbBT1WvXisYj+kXUo2zbahSXZJskJZ+iu8eFDEqz/Hm//GtN2Dtj4YeLw3w8y8IOZMG9f9dpaWV1bX1nMb+c2t7Z3dwt5+Q0eJIrROIh6pVoA15UzSumGG01asKBYBp81geDPxm09UaRbJBzOKqS9wX7KQEWys9NipDVjp6vT+pFsoumV3CrRIvIwUIUOtW/jq9CKSCCoN4VjrtufGxk+xMoxwOs53Ek1jTIa4T9uWSiyo9tPpwWN0bJUeCiNlSxo0VX9PpFhoPRKB7RTYDPS8NxH/89qJCS/9lMk4MVSS2aIw4chEaPI96jFFieEjSzBRzN6KyAArTIzNKG9D8OZfXiSNStk7K1fuzovV6yyOHBzCEZTAgwuowi3UoA4EBDzDK7w5ynlx3p2PWeuSk80cwB84nz9CmI9lAAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oUy2m3btZhN2N0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqKGvSWMSqE6BmgkvWNNwI1kkUwygQrB2Mb2d++4kpzWP5YCYJ8yMcSh5yisZKrR6KZIT9csWtunOQVeLlpAI5Gv3yV28Q0zRi0lCBWnc9NzF+hspwKti01Es1S5COcci6lkqMmPaz+bVTcmaVAQljZUsaMld/T2QYaT2JAtsZoRnpZW8m/ud1UxNe+xmXSWqYpItFYSqIicnsdTLgilEjJpYgVdzeSugIFVJjAyrZELzll1dJq1b1Lqq1+8tK/SaPowgncArn4MEV1OEOGtAECo/wDK/w5sTOi/PufCxaC04+cwx/4Hz+AIzPjxw=(A,R,Q,I)AAAB9HicbVBNS8NAEJ34WetX1aOXxSJUKCWpgh6rXvTWiv2ANpTNdtMu3Wzi7qZQQn+HFw+KePXHePPfuG1z0NYHA4/3ZpiZ50WcKW3b39bK6tr6xmZmK7u9s7u3nzs4bKgwloTWSchD2fKwopwJWtdMc9qKJMWBx2nTG95O/eaISsVC8ajHEXUD3BfMZwRrI7md6oAVrosPxVrx/qyby9slewa0TJyU5CFFtZv76vRCEgdUaMKxUm3HjrSbYKkZ4XSS7cSKRpgMcZ+2DRU4oMpNZkdP0KlResgPpSmh0Uz9PZHgQKlx4JnOAOuBWvSm4n9eO9b+lZswEcWaCjJf5Mcc6RBNE0A9JinRfGwIJpKZWxEZYImJNjllTQjO4svLpFEuOeelcu0iX7lJ48jAMZxAARy4hArcQRXqQOAJnuEV3qyR9WK9Wx/z1hUrnTmCP7A+fwBRLpB/AnswerGroundingRationaleGroundingCommonsenseReasoning ai j = si j ∑ j si j where Qi I is the Bert representation of the ith word in the question, and I j h is the approximate jth regional representation of the image. The attention vector a is gotten by the softmax of the scores si j. The final grounded visual representations for words is calculated as: vi = ∑ j ai jI j h In our augmented R2C dataset, some words are represented by bounding box tagging number as we denoted before. We directly using the corresponding bounding box visual feature extract from Mask-RCNN [30]. For words without bounding box tagging, we use attended visual representa- tion. we directly concatenate the visual representation vi with the original Bert representation Qi I to form a new sentence representation for the question and the answer: ˆQI = [Q1 I ,v1; ...,Qlq I ,vlq] ˆAI = [A1 I ,v∗1; ...,Ala I ,v∗la] where v∗i is the corresponding visual representations for words in the answer. Contextualized word representation: To get a better meaning representation of words, we use co-attention to get contextualized representations for the question and answers. Following the previous step, we get grounded question representation ˆQI ∈ Rlq×d and grounded answer represen- tation ˆAI ∈ Rla×d. First we calculate a correlation score matrix: S = ˆQIM2 ˆAI T 70 The scores for each row represent how each word in the question attends the words in the an- swer. The scores for each column represent how each word in the answer attends the words in the question. Following the previous attention mechanism, we get answer-guided question represen- tation ˆQ∗I ∈ Rlq×d and question-guided answer representation ˆA∗I ∈ Rla×d. The final contextualized question and answer representation are Q = [ ˆQI, ˆQ∗I ] and A = [ ˆAI, ˆA∗I ]. To get a summary of textual information for both the question and answer, we run bi-directional LSTM on both the question and answer. hq f = LST Mf orward(Q) hq b = LST Mbackward(Q) hq = [hq f ,hq b] ha f = LST Mf orward(A) ha b = LST Mbackward(A) ha = [ha f ,ha b] we concatenate hq and ha to get the final textual representation h = [hq,ha] Sentence-Level Grounding: We use the textual summary vector to re-attend the visual repre- sentation to get a better visual representation. Specifically, we use the summary vector h to attend the visual representation Ih. By similarly using the attention mechanism for word level grounding, 71 we get a visual representation hvisual. To get the final factor potential prediction, we first concatenate h and hvisual to get h f inal. Then apply an fully connected linear layer: f = W h f inal + b 5.3.2.2 Visual Question Rationale(VQR) The goal of the module is to get the grounding score of the rationales given the image and the question, So we use a similar architecture with the module VQA. The detailed architecture is shown in the Figure 5.3. The difference is to replace the answers in the VQA module with rationales. Actually these two modules also share all the parameters used in the architecture, which helps to decrease the risk of overfitting. 5.3.2.3 Causal Matching Between the Answer and the Rationale The last module for our work is the answer rationale matching. The essence of this module is trying to learn the causal relation between sentences. In the right part of the Figure 5.3, we show how we do the causal match between the answer and the rationale. The procedure follows a simplified process of VQA/VQR without visual information. The first step is to use Bert to embed both the answer and rationale, then we build contextualized answer representation and rationale representation using co-attention mechanism. We use a shared Long-Short Term Memory Unit to encode these two sequential representations. Finally we concatenate these two embeddings and use a final linear layer to get the causal factor potential. 72 5.3.3 Training and Inference During the training, we use maximum likelihood as the training criteria. L = logP(A,R|Q,I) = log exp(Φ(A,R,Q,I)) ∑A,R exp(Φ(A,R,Q,I)) (5.1) = Φ(A,Q,I) + Φ(R,Q,I) + Φ(A,R)− log∑ A,R exp(Φ(A,R,Q,I)) All the parameters are optimized end-to-end jointly. During the inference, we are trying to seek the answer A and rationale R where A,R = argmax A,R (Φ(A,Q,I) + Φ(R,Q,I) + Φ(A,R)) Adaptive Factor Weighting(AFW): In previous loss function, we assume all the 3 factors are equally important, however, it’s more intuitive that we use adaptive weights for different factors based on the whole context. So we instead decompose the large factor as: Φ(A,R,Q,I) = α1Φ(A,Q,I) + α2Φ(R,Q,I) + α3Φ(A,R) where the vector α is learned as model parameters: sα = g(A,R,Q,I) α = so f tmax(sα ) The function g is a multi-layer network to be learned based on the concatenation of the represen- 73 tation of A, R, Q and I. Correspondingly during the inference, the best answer and rationale is inferred by A,R = argmax A,R (α1Φ(A,Q,I) + α2Φ(R,Q,I) + α3Φ(A,R)) 5.4 Experiments and Results In this section, we first will show some basic statistics about the augmented statistics. Then we will show some experimental results compared with different methods. Third we will explore different ablations studies to better understand how important different modules play. Finally we will show some qualitative analysis on the attention mechanism we used in different modules. 5.4.1 Dataset Statistics In this section, we will show some basic statistics for the training and testing. As we stated before, After augmented the original R2C dataset, for each sample, we have 1 image, 1 visual question, 4 answer choices and 16 rationale choices. The full dataset is divided into 10 folds, where different folds contain unique image set. and we randomly choose 8 folds as training, 1 fold as validation and the left fold as testing. As the original dataset contains all kinds of question including activity, explanation, temporal and so on. In this work, we are focusing on actions, so we use a simple rule to filter the action related samples in the original dataset. Specifically we extract samples whose question conains one of these following keywords: doing, looking, event, playing, preparing. In table 5.1, we show the basic statistics of of the train/validation/test dataset including the number of samples, the number of total tokens, and the number of unique images. In Figure 5.4, Figure 5.5 and Figure 5.6, we show the histogram distribution of questions, 74 Table 5.1: Basic statistics for the augmented R2C dataset. No. of Samples No. of Tokens No. of Images Train Validation 33687 10.8m 27193 4811 1.55m 3878 Test 4816 1.57m 3869 Table 5.2: Validation and Test Accuracy for all the models. Method\Accuracy R2C_beam R2C VAE MLP Factorized Model(Our Work) Validation Test 0.344 0.364 0.245 0.342 0.391 0.348 0.364 0.248 0.327 0.385 answers and rationales on the train/validation/testing dataset. From these 3 Figures, we can see that the question/answer/rationale has a similar distribtuion across the train, validation and test dataset. When comparing the histrogram in the same dataset, we can clearly see that the rationales’ average length is longer than the answers’ average length. Figure 5.4: The length distribution of the sentences in the training dataset. 5.4.2 Results To evaluate the effectiveness of methods, we compare our joint model with following baseslines: 75 Figure 5.5: The length distribution of the sentences in the validation dataset. Figure 5.6: The length distribution of the sentences in the testing dataset. • R2C: the original methods proposed by the work [105], which utilize a two step strategy: first predict the action, then predict the rationale. • R2C_beam: an extension of the R2C method. when inferring the rationale, using the beam search to search the best combination of the prediction of the answer and rationale. • VAE: We also extend the previous VAE based method to incorporate the question and visual information. • MLP: As an important common baseline for the VQA task, Here we also build an MLP baseline for the joint R2C ask. 76 val test Figure 5.7: The Histogram of action frequency in the training dataset. As we use Bert as linguistic representation, the first step in our training process is to finetune the Bert model on our dataset. Specifically we generate positive and negative pairs of question answer, answer rationale and question rationale pairs. Then we treat the Bert finetune as a two- way binary classification problem. Following the setup of the original R2C work, we use adam optimzer with learning rate 2× 10−5. For all models’ training process, we use adam optimizer with a learning rate of 2× 10−4 and weight decay of 10−4. The gradients are clipped to have L2 norm of at most 1.0. We set the hidden layer dimension as 512. The table 5.2 shows different methods’ performance on the validation and test dataset. First we can see that our factorized model performs best compared with other methods. Second we can see that the best model besides our work is the R2C_beam method although as a two-step method. Third we notice that although both the VAE and MLP are also joint model, they do not perform 77 well compared with the R2C method. This shows that carefully modelling the interaction between different parts or modules are also very important. For VAE based method, we can see that it performs much worse than other methods. One Possible reason is that, to make the whole VAE system end-to-end differentiable, the effective approximate gumbeling sampling process for the rationale given the answer is more challenging. By comparing the R2C method and R2C_beam, we can see beam search can effectively mitigate the propagated errors during the independent inference process of R2C. 5.4.3 Ablation Study To have a better understanding of how different factors performs in our model, we conducted some ablation studies how important different factors are both quantitatively and qualitatively. First we want to test how important for each factors we designed for the Joint R2C problem. We are interested in the following settings for the Factorized Model(FM): • FM-AFW: the factorized model without adaptive factor weighting. The goal of this study is to test how important the adaptive factor weighting plays in the system. • FM-QA: the factorized model without the visual question answer factor. The goal is to test how important the VQA factor plays. • FM-QR: the factorized model without the visual question rationale factor. The goal is to test how important the VQA factor plays. • FM-AR: the factorized model without the answer rationale causal matching factor. The goal is to test how important the AR factor plays. • FM-IM: The factorized model without the visual information. In this setting, we mask all 78 Table 5.3: Ablation Study Results. FM Method\Accuracy Validation Test 0.391 0.382 0.346 0.312 0.209 0.271 FM-AFW FM-QA FM-QR FM-AR FM-IM 0.385 0.371 0.333 0.309 0.194 0.265 the visual context, and try to test how the pure textual based system performs in the joint R2C task. The results for the ablation studies are shown in table 5.3. From this table, we can see that different factors have different influence on the model’s performance. We can see that when add the adaptive factor weighting, we actually improves the model’s performance both on validation and test. Besides, among the three factors QA, QR and AR, the factor AR plays the most important role, as we can see that the performance dropped the most when removing the factor AR. Besides, the rank of these 3 factors are AR, QR and QA with decreasing importance. Besides, we can also see that even if we completely remove the visual image, the validation and testing performance decreased, but is even higher than the setting removing the factor AR. Next we do some qualitative analysis to understand what exactly does model learn. Because we are mainly interested in the actions in this work. The first step we do is to show some distribution of the actions in our training dataset. To extract the actions in the ground truth answers in the training samples, we conduct dependency parsing on the ground truth answer, then select the possible combinations of verb or verb+noun pattern. The histogram of the action frequency is shown in Figure 5.7. The x-axis shows the frequency of actions, while the y-axis shows the histogram count. We can see a severe long train distribution where most of the action frequency are within 5. This also verifies that why introducing Bert helps a lot to improve the final performance in the R2C 79 Table 5.4: Effect of factors on different actions. No. Look 516 Get 180 Watch 152 Talk 137 Dance 107 Wait 89 Drink 89 89 Eat DFN 0.389 0.461 0.428 0.365 0.411 0.404 0.337 0.472 -AR 0.219 0.267 0.184 0.182 0.299 0.146 0.269 0.213 -IM 0.264 0.3 0.316 0.248 0.215 0.213 0.213 0.326 work [105] as the Bert is pre-trained on a large dataset using language modelling, thus the learned Bert representation can lead to better generalization performance. To have a more detailed understanding of how different factors influence different actions. We show the accuracy for specific actions under different ablation study setting. Specifically, For each one of the top eight frequent actions, we calculate its accuracy under different ablation studies. The results are shown in table 5.4. Different factors may have different influence for different actions. For the majority of actions except for the actions dance and drink, the reasoning factor plays a more important role compared to the language grounding factor. A possible explanation is that language grounding is an extremely challenging task and the learned model for language grounding is still quite limited. On the other hand, the reasoning model does a better job in capturing commonsense relations between an answer and a rationale. We think the use of Bert contributes to this advantage as Bert has shown superb performance on many commonsense reasoning tasks [41, 83, 84, 89]. Visual Attention Analysis. Here we will show some visualizations of the visual attention learned in the VQA and VQR factor. Some example visualizations are shown in Figure 5.8. Here we show the sentence-level answer attention on the image. We can see that in the first Figure, the model’s attention mainly focus on the surroundings of the object stuff in the answer. In the second Figure, The main attention is on the man’s hand and the sheep, but there are also some 80 noisy attentions on other peoples. In the last Figure, The answer’s visual attention is correctly put on the tower which the people are looking at. From these examples, we can see that the model actually learns to where to focus and extract useful visual feature representation to learn to answer the visual question. In Figure 5.9, Similarly given a pair of question and rationale, we show the corresponding attention map the model learns. For example, in the third example, the attention is mainly put on the map and arm to indicate this is a hold action. Causal Attention Analysis. We also visualize the attention mapping between the answer and the rationale. This kind of correlation actually indicates what we called causal relation. In our model, we model this kind of relations as un-directed relation. Here we visualize the attention from an answer to a rationale, which means we normalize the correlation matrix learned along the direction of rationals. Thus we can see what are import words used for each word in the answer. The example visualizations are shown in Figure 5.10. In the first example, the word performing mainly correlates with words higher pedestal and words playing music. In the second example, almost all words in the answer correlate with words someone and hear. Two characteristics we observe here are: first we notice that the correspondence of the same people’s names are not necessarily aligned in this process, second the answer words tend to attend much fewer words in the rationale sentence. One possible explanation is that the fine-tuned Bert representation is a contextualized representation, so even for same words, when they are in different context, their vector representation may not be very similar. 5.4.4 Error Analysis From the previous results, we can see that there still exists a big gap between our results and human level performance. To have a better understanding of where the errors come from, we analyze the prediction errors from following two aspects. First we categorize errors in different ways: 81 Figure 5.8: Attention visualization for the VQA factor. • A-R+: the predicted answer is wrong, but the predicted rationale is correct. • A+R-: the predicted answer is correct, but the predicted rationale is wrong. • A-R-: Both the predicted answer and the predicted rationale are wrong, and they also doesn’t 82 Ques*on: What is person7 doing? Answer: Person7 is trying to sell stuff. Ques*on: What is person3 doing? Answer: Person3 is watching what person4 is doing. Ques*on: What is person2 and person4 doing? Answer: They are looking at the tower. Figure 5.9: Attention visualization for the VQR factor. Figure 5.10: Attention visualization for the AR factor. 83 Ques*on: Is person 3 playing outside in the snow? Ra*onale: Dogs have thick fur and are used to cold temperatures. Ques*on: What is person0 doing? Ra*onale: Person0’s pose to be that of discuss a plan affirmed by The expression provided by person1. Ques*on: What is person0 and person1 doing? Ra*onale: They are holding a map in their hands and they are looking at it. Question: What is Casey doing?Question: What is Casey doing? match. Noted that we augment each original answer with 4 rationales of which one of them is the correct rationale for the original answer in another sample. • A-R-*: Both the predicted answer and the predicted rationale are wrong, but they are matched. The ratios of different error types are shown in Figure 5.11. We can see that the most frequent error is R-A+, which means the answer prediction is correct, while the rationale prediction is wrong. The least occurred error is A-R+, which means the answer prediction is wrong and rationale is correct. Combining these information we learned that inferring answers seems easier than inferring rationales. We also analyze the error from the factors perspective. For each sample, the gold answer rationale pair is (a, r), the predicted answer rationale pair is (ap, rp). For each pair we have 3 scores svqa, svqr, sar. Ideally the scores for the gold pair should be larger than the predicted pair. For each score, we calculate whether the score ordering is within our anticipation or not. We estimate them as these 3 ratios: • AR_MM: the order of the AR factor scores is mis-matched. • QR_MM: the order of the QR factor scores is mis-matched. • QA_MM: the order of the QA factor scores is mis-matched. The ratios of these 3 types of mis-matches are shown in Figure 5.12. We can see that the most frequent mis-match types is AR_MM, which means the most challenging part of this task is how to learn the commonsense matching factor for the answer and the rationale. This Figures also shows that the least amount of mis-match is for the factor QA, which coincides with the previous error types analysis: the error for the answer is lower than the error for the rationale. 84 Table 5.5: Error rates among samples with different lengths of ground truth rationales. Length Error rate [0,10) 0.758 [10,20) 0.631 [20,30) 0.530 [30,40) 0.470 [40, +∞) 0.455 A quick error analysis has shown that the model performance seems to correlate with the length of the ground truth rationales (as shown in Table 5.5). Although the ground truth length distribution is nearly uniform in the training dataset, our results show that the shorter the length of a rationale is, the less likely the model would pick it up. This shows that providing more contextual information in the rationales may help improve model performance. Figure 5.11: The ratio of different error types. 5.5 Conclusion In this work, we propose a joint learning task setting for the visual question answer and rationale task. Compared with the previous R2C framework, we argue it’s better to learn these two things jointly instead following a two step process. To adapt to the joint learning task setting, we augment 85 Figure 5.12: The inverse ratio of different factors. the original R2C dataset with more negative rationales to eliminate the dataset answer bias. Based on the new dataset, we propose a pairwise deep factorization model to model the joint probability of the answer and the rationale given the image and the question. In essence, this joint task is trying to jointly solve visual question answering and natural language causal inference together. Finally we conduct comprehensive experiments and ablations to verify the effectiveness of our proposed method. However, the current best performance is still far from human performance. We showed the extreme action sparsity in the dataset analysis. In the future work, how to effective mitigate this kind of sparsity and introduce more external knowledge to the learning process will be an interesting direction worth exploring. 86 Chapter 6 Conclusion and Future Work 6.1 Conclusions The process of human learning to understand the world is not isolated by using only linguistic descriptions or visual signals. Actually we learn through a combination of different sources of signals from our surrounding world, which is also called multi-modal learning. Imagine when you are young, how do you learn the action of “pick up”? It often comes with a combination of the visual demonstration and language instructions. Humans have strong abilities to synthesize information and learn from different sources including text, vision and speech et al. In order to build agents which can really understand human utterances, execute instructions and even generate explanations or rationales, the agents need to understand grounded meanings of text by connecting to the physical world to get a comprehensive understanding. To achieve this goal, we investigate the problem of grounded language learning of physical actions through connecting low level visual semantics with high level linguistic semantics. At the same time, we also explore how to connect grounded rationales with action predictions. • In Chapter 3, we propose a new task: grounded semantic role labeling to bridge the gap between the high level linguistic structured semantic information and the low level visual information including object trackings and attributes et al. From the linguistics perspec- tive, the semantics of actions/verbs are represented as frames with slots and values which 87 characterize key properties describing the action. These properties are called semantic roles including patient, location tool and so on. For each semantic role, we ground it to visual elements in the physical world: the objects in the video clips for our study. Besides explicit semantic roles mentioned in the linguistic descriptions, we also ground the implicit semantic roles which happens in the world but are not explicitly mentioned in the description because they are also very important for the agent to learn to understand and interpret the action. As shown in our experiments, the agent can have a better grounded understanding of the environment with the incorporation of the semantic information. • Chapter 4 discusses an approach that attempts to infer commonsense justifications for phys- ical actions in human-agent communication. To understand the action meaning, not only recognizing what is happening is important, but also providing commonsense evidence to support the decision the agent made. On one hand, it helps improves the trust between the human and the agent. On the other hand, it also helps improve the communication ground- ing during communication. We propose a generative modeling framework to jointly infer the action and the corresponding commonsense justification. We decompose explanations into relations and attributes, then model the evidence selection problem as a latent variable inference problem. Our empirical evaluations show that this joint inference model achieves better performance compared to previous competitive methods. Furthermore, as our latent variables are interpretable, we add the supervision to the latent variable and show that it ac- tually improves both the evidence selection and action prediction performance. Lastly, we design a human study to verify that our propose joint model helps improve communication grounding between humans and agents. • In Chapter 5, we focus on the problem of grounded action justification. We propose to solve 88 the joint visual commonsense reasoning task: given an image and a question, the goal is to select the answer and provide the corresponding rationale. A new factorized neural model is developed to better understand relations between the image, question, answer and rationales. Specifically, we decompose the problem into three small factors including Image-Question- Answer factor, Image-Question-Rationale factor and Image-Answer-Rationale factor. Ex- perimental results show that the factorized model achieves better accuracy. Besides, the comprehensive ablation study results show that different factors are essential for the final performance, and the Image-Answer-Rationale factor plays the most significant role for the final performance. 6.2 Future Directions This dissertation explores different approaches for grounded action understanding and justification through language communication. To extend these approaches to a variety of real world applica- tions, possible future works are described as follows: • Data efficient Learning for grounded semantic role labeling. Currently the supervised graphical model based method for grounded semantic role labeling requires a lot of labeled data which are time-consuming and expensive. One important future research direction is to explore semi-supervised methods to alleviate the burden of data labeling. For example, how to effectively incorporate the unlabeled data to improve the generalization ability. • Deep Learning for grounded semantic role labeling. Recently the deep learning based methods show great potentials on solving multi-modal related problems for several reasons: first they can learn better visual representations and contextual depended word representa- tions. Second the deep architecture can better captures the interaction of information flows 89 across different modals. One possible research direction to further improve grounded seman- tic role labeling by modeling both the visual context and linguistic semantic roles as graphs and use graph based neural network to predict target groundings. • Incorporating commonsense knowledge for grounded action justification. Our current system’s performance is still far from human performance on commonsense justification. We find that incorporating pre-trained word embeddings such as Bert is greatly helpful for improving the accuracy. The pre-trained Bert embeddings capture commonsense knowledge from large scale external datasets. Then one natural follow up question would be: whether we can effectively incorporate some action knowledge base into our model to improve the model performance. For example, how to inject ConceptNet or VerbNet knowledge to the deep neural network? • Grounded Justification Generation. Currently we mainly frame the grounded action jus- tification problem as a ranking or classification task. Compared with language generation tasks, this choice makes the evaluation easier and reasonable. But language generation is a more realistic setting as it doesn’t need us to provide specific candidates. So a possible extended new task could be: given an image, a question and several answer candidates, how can we select the correct answer and generate natural language rationales justifying our prediction? As more and more multi-media applications start to enter our daily lives such as movies with captions, image with descriptions, it will be more and more important to build intelligent agents to understand the world through multi-modal learning. Despite the efforts we have made in this dissertation, a lot of important and interesting problems still remain open. We believe that future research on this topic is of great values to make fundamental advances in AI. 90 BIBLIOGRAPHY 91 BIBLIOGRAPHY [1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015. [2] Y. Artzi and L. Zettlemoyer. Weakly supervised learning of semantic parsers for mapping instructions to actions. TACL, 1:49–62, 2013. [3] C. F. Baker, C. J. Fillmore, and J. B. Lowe. The berkeley framenet project. In Proceedings of the 17th international conference on Computational linguistics-Volume 1, pages 86–90. Association for Computational Linguistics, 1998. [4] F. Bashir and F. Porikli. Performance evaluation of object detection and tracking systems. In Proceedings 9th IEEE International Workshop on PETS, pages 7–14, 2006. [5] O. Biran and K. McKeown. Human-centric justification of machine learning predictions. IJCAI, Melbourne, Australia, 2017. [6] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003. [7] E. Bruni, N.-K. Tran, and M. Baroni. Multimodal distributional semantics. J. Artif. Intell. Res.(JAIR), 49:1–47, 2014. [8] A. Chang, W. Monroe, M. Savva, C. Potts, and C. D. Manning. Text to 3d scene generation with rich lexical grounding. arXiv preprint arXiv:1505.06289, 2015. [9] D. L. Chen and R. J. Mooney. Learning to sportscast: a test of grounded language acqui- In Proceedings of the 25th international conference on Machine learning, pages sition. 128–135. ACM, 2008. [10] J. Chorowski and N. Jaitly. Towards better decoding and language model integration in sequence to sequence models. arXiv preprint arXiv:1612.02695, 2016. [11] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014. [12] H. H. Clark. Using language. 1996. Cambridge University Press: Cambridge, 952:274–296, 1996. [13] J. Cohen et al. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46, 1960. 92 [14] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natu- ral language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011. [15] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005. [16] M. Danelljan, F. S. Khan, M. Felsberg, and J. van de Weijer. Adaptive color attributes for real-time visual tracking. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1090–1097. IEEE, 2014. [17] D. Dennett. The intentional Stance. MIT Press, 1987. [18] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. [19] J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng, X. He, G. Zweig, and M. Mitchell. arXiv preprint Language models for image captioning: The quirks and what works. arXiv:1505.01809, 2015. [20] B. Efron and R. J. Tibshirani. An introduction to the bootstrap. CRC press, 1994. [21] D. Elliott and A. de Vries. Describing images using inferred visual dependency represen- tations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 42–52, Beijing, China, July 2015. Association for Compu- tational Linguistics. [22] B. Emanuele, G. Castellucci, D. Croce, and R. Basili. Textual inference and meaning repre- sentation in human robot interaction. In Joint Symposium on Semantic Processing., page 65, 2013. [23] M. Forbes and Y. Choi. Verb physics: Relative physical knowledge of actions and objects. arXiv preprint arXiv:1706.03799, 2017. [24] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016. [25] Q. Gao, M. Doering, S. Yang, and J. Y. Chai. Physical causality of action verbs in grounded language understanding. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), volume 1, pages 1814–1824, 2016. [26] Q. Gao, S. Yang, J. Chai, and L. Vanderwende. What action causes this? towards naive 93 physical action-effect prediction. In Proceedings of the 56th Annual Meeting of the Associ- ation for Computational Linguistics, 2018. [27] P. Gärdenfors. Conceptual spaces: The geometry of thought. MIT press, 2004. [28] D. Gildea and D. Jurafsky. Automatic labeling of semantic roles. Computational linguistics, 28(3):245–288, 2002. [29] S. Guadarrama, L. Riano, D. Golland, D. Go, Y. Jia, D. Klein, P. Abbeel, T. Darrell, et al. Grounding spatial relations for human-robot interaction. In Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on, pages 1640–1647. IEEE, 2013. [30] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. [31] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. [32] L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, and T. Darrell. Generating visual explanations. In European Conference on Computer Vision, pages 3–19. Springer, 2016. [33] T. Hofmann. Probabilistic latent semantic indexing. In ACM SIGIR Forum, volume 51, pages 211–218. ACM, 2017. [34] M. R. Hovav and B. Levin. Reflections on manner/result complementarity. Lecture notes, 2008. [35] M. R. Hovav and B. Levin. Reflections on Manner / Result Complementarity. Lexical Semantics, Syntax, and Event Structure, pages 21–38, 2010. [36] G. Jäger. Natural color categories are convex sets. In Logic, language and meaning, pages 11–20. Springer, 2010. [37] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016. [38] S. Karlekar, T. Niu, and M. Bansal. Detecting linguistic characteristics of alzheimer’s de- mentia by interpreting neural models. arXiv preprint arXiv:1804.06440, 2018. [39] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descrip- tions. June 2015. [40] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects 94 In Proceedings of the 2014 Conference on Empirical in photographs of natural scenes. Methods in Natural Language Processing (EMNLP), pages 787–798, Doha, Qatar, October 2014. Association for Computational Linguistics. [41] N. S. Keskar, B. McCann, C. Xiong, and R. Socher. Unifying question answering and text classification via span extraction. arXiv preprint arXiv:1904.09286, 2019. [42] A. Khan, N. Salim, and Y. J. Kumar. A framework for multi-document abstractive summa- rization based on semantic role labelling. Applied Soft Computing, 30:737–747, 2015. [43] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with In Advances in Neural Information Processing Systems, pages deep generative models. 3581–3589, 2014. [44] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013. arXiv preprint [45] P. Kingsbury and M. Palmer. From treebank to propbank. In Proceedings of the 3rd Inter- national Conference on Language Resources and Evaluation (LREC2002), 2002. [46] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. 2016. [47] J. Krishnamurthy and T. Kollar. Jointly learning to parse and perceive: Connecting natural language to the physical world. Transactions of the Association for Computational Linguis- tics, 1:193–206, 2013. [48] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi. Generalizing image captions for image-text parallel corpus. In ACL (2), pages 790–796. Citeseer, 2013. [49] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth Inter- national Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. [50] T. K. Landauer. Latent semantic analysis. Wiley Online Library, 2006. [51] A. Lazaridou, N. T. Pham, and M. Baroni. Combining language and vision with a multi- modal skip-gram model. arXiv preprint arXiv:1501.02598, 2015. [52] B. Levin. English verb classes and alternations: A preliminary investigation. University of Chicago press, 1993. [53] K. Liang, Y. Guo, H. Chang, and X. Chen. Visual relationship detection with deep structural 95 ranking. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018, 2018. [54] K. Liang, Y. Guo, H. Chang, and X. Chen. Visual relationship detection with deep structural ranking. 2018. [55] C. Liu and J. Y. Chai. Learning to mediate perceptual differences in situated human-robot dialogue. In The Twenty-Ninth Conference on Artificial Intelligence (AAAI-15). to appear, 2015. [56] C. Liu, R. Fang, and J. Chai. Towards mediating shared perceptual basis in situated dialogue. In Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 140–149, Seoul, South Korea, 2012. [57] T. Lombrozo. Explanation and abductive inference. Oxford handbook of thinking and rea- soning, pages 260–276, 2012. [58] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pages 289– 297, 2016. [59] J. Malmaud, J. Huang, V. Rathod, N. Johnston, A. Rabinovich, and K. Murphy. What’s arXiv preprint interpreting cooking videos using text, speech and vision. cookin’? arXiv:1503.01558, 2015. [60] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. The In Proceedings of 52nd Annual Stanford CoreNLP natural language processing toolkit. Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60, 2014. [61] B. McMahan and M. Stone. A bayesian model of grounded color semantics. Transactions of the Association for Computational Linguistics, 3:103–115, 2015. [62] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013. [63] A. Milan, S. Roth, and K. Schindler. Continuous energy minimization for multitarget track- ing. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(1):58–72, 2014. [64] F. Morin and Y. Bengio. Hierarchical probabilistic neural network language model. Aistats, volume 5, pages 246–252. Citeseer, 2005. In [65] I. Naim, Y. C. Song, Q. Liu, L. Huang, H. Kautz, J. Luo, and D. Gildea. Discriminative un- supervised alignment of natural language instructions with corresponding video segments. 96 In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 164–174, Denver, Colorado, May–June 2015. Association for Computational Linguistics. [66] L. G. M. Ortiz, C. Wolff, and M. Lapata. Learning to interpret and describe abstract scenes. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1505–1515. [67] M. Palmer, D. Gildea, and P. Kingsbury. The proposition bank: An annotated corpus of semantic roles. Computational linguistics, 31(1):71–106, 2005. [68] D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, and M. Rohrbach. Multimodal explanations: Justifying decisions and pointing to the evidence. arXiv preprint arXiv:1802.08129, 2018. [69] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014. [70] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018. [71] S. S. Pradhan, W. Ward, K. Hacioglu, J. H. Martin, and D. Jurafsky. Shallow semantic parsing using support vector machines. In HLT-NAACL, pages 233–240, 2004. [72] J. Pustejovsky. The syntax of event structure. Cognition, 41(1-3):47–81, 1991. [73] V. Ramanathan, P. Liang, and L. Fei-Fei. Video event understanding using natural language descriptions. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 905–912. IEEE, 2013. [74] H. Rashkin, A. Bosselut, M. Sap, K. Knight, and Y. Choi. Modeling naive psychology of characters in simple commonsense stories. arXiv preprint arXiv:1805.06533, 2018. [75] T. Regier and L. A. Carlson. Grounding spatial language in perception: an empirical and computational investigation. Journal of experimental psychology: General, 130(2):273, 2001. [76] T. Regier, P. Kay, and R. S. Cook. Focal colors are universal after all. Proceedings of the National Academy of Sciences of the United States of America, 102(23):8386–8391, 2005. [77] M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. Pinkal. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics (TACL), 1:25–36, 2013. 97 [78] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91– 99, 2015. [79] M. T. Ribeiro, S. Singh, and C. Guestrin. "why should i trust you?": Explaining the predic- tions of any classifier. In Knowledge Discovery and Data Mining (KDD), 2016. [80] M. T. Ribeiro, S. Singh, and C. Guestrin. Anchors: High-precision model-agnostic expla- nations. In AAAI Conference on Artificial Intelligence (AAAI), 2018. [81] M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal, and B. Schiele. Script data In Computer Vision–ECCV 2012, for attribute-based recognition of composite activities. pages 144–157. Springer, 2012. [82] D. Roy. Grounding words in perception and action: computational insights. TRENDS in Cognitive Sciences, 9(8):389–396, 2005. [83] Y.-P. Ruan, X. Zhu, Z.-H. Ling, Z. Shi, Q. Liu, and S. Wei. Exploring unsupervised pre- training and sentence structure modelling for winograd schema challenge. arXiv preprint arXiv:1904.09705, 2019. [84] M. Sap, H. Rashkin, D. Chen, R. LeBras, and Y. Choi. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019. [85] K. K. Schuler. VerbNet: A Broad-Coverage, Comprehensive Verb Lexicon. PhD thesis, University of Pennsylvania, 2005. [86] D. Shen and M. Lapata. Using semantic roles to improve question answering. In EMNLP- CoNLL, pages 12–21, 2007. [87] M. Skubic, D. Perzanowski, S. Blisard, A. Schultz, W. Adams, M. Bugajska, and D. Brock. Spatial language for human-robot dialogs. IEEE Transactions on Systems, Man, and Cyber- netics, Part C (Applications and Reviews), 34(2):154–167, 2004. [88] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep con- ditional generative models. In Advances in Neural Information Processing Systems, pages 3483–3491, 2015. [89] S. Storks, Q. Gao, and J. Y. Chai. Commonsense reasoning for natural language understand- ing: A survey of benchmarks, resources, and approaches. arXiv preprint arXiv:1904.01172, 2019. [90] S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G. Banerjee, S. J. Teller, and N. Roy. Understanding natural language commands for robotic navigation and mobile manipulation. In AAAI, 2011. 98 [91] S. Tellex, P. Thaker, J. Joseph, and N. Roy. Learning perceptually grounded word meanings from unaligned parallel data. Machine Learning, 94(2):151–167, 2014. [92] Thagard. Probabilistic networks and explanatory coherence. Cognitive Science Quarterly, 2000. [93] K. Tu, M. Meng, M. W. Lee, T. E. Choe, and S.-C. Zhu. Joint video and text parsing for understanding events and answering queries. MultiMedia, IEEE, 21(2):42–70, 2014. [94] J. Van De Weijer and C. Schmid. Coloring local feature extraction. In Computer Vision– ECCV 2006, pages 334–348. Springer, 2006. [95] R. Vedantam, X. Lin, T. Batra, C. Lawrence Zitnick, and D. Parikh. Learning common sense through visual abstraction. In Proceedings of the IEEE international conference on computer vision, pages 2542–2550, 2015. [96] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. Translat- ing videos to natural language using deep recurrent neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, pages 1494–1504, Denver, Colorado, May–June 2015. Association for Computational Linguistics. [97] C. Vondrick, D. Patterson, and D. Ramanan. Efficiently scaling up crowdsourced video annotation. International Journal of Computer Vision, 101(1):184–204, 2013. [98] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3169–3176. IEEE, 2011. [99] P. Wang, Q. Wu, C. Shen, and A. van den Hengel. The vqa-machine: Learning how to use existing vision algorithms to answer new questions. In Proc. CVPR, 2017. [100] J. Xu, T. Mei, T. Yao, and Y. Rui. Msr-vtt: A large video description dataset for bridging video and language. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5288–5296, June 2016. [101] Y. Yang, C. Fermuller, and Y. Aloimonos. Detection of manipulation action consequences (mac). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2563–2570, 2013. [102] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489, 2016. 99 [103] M. Yatskar, V. Ordonez, and A. Farhadi. Stating the obvious: Extracting visual common sense knowledge. In Proceedings of NAACL-HLT, pages 193–198, 2016. [104] H. Yu and J. M. Siskind. Grounded language learning from video described with sentences. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 53–63, 2013. [105] R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi. From recognition to cognition: Visual com- monsense reasoning. arXiv preprint arXiv:1811.10830, 2018. [106] R. Zellers and Y. Choi. Zero-shot activity recognition with verb attribute induction. arXiv preprint arXiv:1707.09468, 2017. [107] J. Zhang, M. Marszałek, S. Lazebnik, and C. Schmid. Local features and kernels for clas- sification of texture and object categories: A comprehensive study. International journal of computer vision, 73(2):213–238, 2007. [108] Q. Zhang, R. Cao, F. Shi, Y. N. Wu, and S.-C. Zhu. Interpreting cnn knowledge via an explanatory graph. arXiv preprint arXiv:1708.01785, 2017. [109] Q. Zhang, Y. N. Wu, and S.-C. Zhu. Interpretable convolutional neural networks. arXiv preprint arXiv:1710.00935, 2017. [110] T. Zhao, R. Zhao, and M. Eskenazi. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960, 2017. [111] J. Zhou and W. Xu. End-to-end learning of semantic role labeling using recurrent neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1127–1137, Beijing, China, July 2015. Association for Computational Linguistics. 100