USING SEMANTIC STRUCTURE OF THE DATA AND KNOWLEDGE IN QUESTION
                         ANSWERING SYSTEMS
                                      By
                                  Chen Zheng
                             A DISSERTATION
                                  Submitted to
                          Michigan State University
                  in partial fulfillment of the requirements
                               for the degree of
                 Computer Science – Doctor of Philosophy
                                     2022


                                           ABSTRACT
Understanding and reasoning over natural language is one of the most crucial and long-standing
challenges in Artificial Intelligence (AI). Question answering (QA) is the task of automatically
answering questions posed by humans in a natural language form. It is an important criterion to
evaluate the language understanding and reasoning capabilities of AI systems. Though machine
learning systems on Question Answering (QA) have shown tremendous success in language
understanding, they still suﬀer from a lack of interpretability and generalizability, in particular,
when complex reasoning is required to answer the questions. In this dissertation, we aim to build
novel QA architectures that answer complex questions using the explicit relational structure of the
raw data, that is, text and image, and exploiting external knowledge. We investigate a variety of
problems, including answering natural language questions when the answer can be found in multiple
modalities, including, 1) Textual documents (Document-level QA), 2) Images (Cross-Modality QA),
3) Knowledge graphs (Commonsense QA) and, 4) Combination of text and knowledge graphs. First,
for Document-level QA, we develop a new technique, Semantic Role Labeling Graph Reasoning
Network (SRLGRN), via which the explicit semantic structure of multiple textual documents is
used. In particular, based on semantic role labeling, we form a multi-relational graph that jointly
learns to find cross-paragraph reasoning paths and answers multi-hop reasoning questions. Second,
for the type QA that requires causal reasoning over textual documents, we propose a new technique,
Relational Gating Network (RGN), that jointly learns to extract the entities and their relations to
help highlight the important entity chains and find how those aﬀect each other. Third, for the type
of questions that require complex reasoning over language and vision modalities (Cross-Modality
QA), we propose a new technique, Cross-Modality Relevance (CMR). This technique considers
the relevance between textual tokens and visual objects by aligning the two modalities. Fourth,
for answering questions based on given Knowledge Graphs (KG), we propose a new technique,
Dynamic Relevance Graph Network (DRGN). This technique is based on a graph neural network
and re-scales the importance of the neighbor nodes in the graph dynamically by training a relevance
matrix. The new neighborhoods trained by relevance help fill in the knowledge gaps in the KG for


more eﬀective knowledge-based reasoning. Fifth, for answering questions using a combination
of textual documents and an external knowledge graph, we propose a new technique, Multi-hop
Reasoning Network over Relevant Commonsense Subgraphs (MRRG). MRRG technique extracts the
most relevant KG subgraph for each question and document and uses that subgraph combined with
the textual content and question representations for answering complex questions. We improve the
performance, interpretability, and generalizability of various challenging QA benchmarks based on
diﬀerent modalities. Our ideas have proven to be eﬀective in multi-hop reasoning, causal reasoning,
cross-modality reasoning, and knowledge based reasoning.


To my grandfather Mr. Enke Zheng and my father Mr. Wanying Zheng.
                               iv


                                    ACKNOWLEDGEMENTS
The past years at Michigan State University have been an unforgettable experience for me. Sur-
prisingly, I have done much exciting research about helping artificial intelligence to understand
and reason over natural language. Luckily, in my four-and-a-half-year Ph.D. journey, I met many
outstanding people who encouraged me, helped me, and supported me, including my advisor, lab
partners, intern colleagues, friends, and most importantly, family.
    First and foremost, I would like to express my deepest appreciation to my advisor Dr. Parisa
Kordjamshidi. Parisa is an awesome advisor who helped me in my pursuit of scientific research.
I learned a lot from Parisa, not only how to conduct novel NLP research, but also how to be a
high-standard researcher. I really appreciate her earnest and rigorous attitude toward scientific
research. I appreciate that Parisa never limited my research direction. She always supported my
ideas and motivated me to think deeply about the research challenges. I will always remember
every detailed and meaningful comment from her inspiration, and I will never forget her helpful
revisions on every sentence and paragraph of all my published papers on countless days and nights.
In addition, I would like to thank my Ph.D. committee members, Dr. Jiayu Zhou, Dr. Kristen
Johnson, and Dr. Taiquan Peng, for their valuable and insightful comments and suggestions in
completing this dissertation.
    Next, I would like to thank my Heterogeneous Learning and Reasoning Lab. The HLR lab
has witnessed the growth of my Ph.D. career at MSU. I would like to thank my lab partners, Yue
Zhang (My most trusted and reliant friend throughout my Ph.D. career), Hossein Faghihi, Roshanak
Mirzaee, Guangyue Xu, Drew Hayward, Juan Castro-Garcia, Darius Nafar, Danial Kamali, Sushanta
K. Pani, Hamid Karimian and Elham Barezi. Here, I want to especially thank Dr. Quan Guo, who is
the Postdoc in our laboratory and also my roommate. Thank you for providing help not only for
research but also for daily life.
    Before pursuing my Ph.D., I had never heard of natural language processing (NLP) and deep
learning until I met my big brother, Dr. Shuangfei Zhai. He helped me open the door about how to
conduct the NLP research. Besides, even now, I am so impressed with the Standford CS224d online
                                                 v


opening course taught by Dr. Richard Socher. I can’t imagine how much valuable knowledge of
NLP and deep learning I have gained in this class. I want to especially thank these two talented AI
scientists who helped me get started with NLP and deep learning.
    I have three wonderful internships at the Baidu NLP group, the JD.com Information Retrieval
group, and the Bytedance AML group. I would like to express my thankfulness to my mentors, Dr.
Shengxian Wan and Yu Sun, at Baidu. This internship experience made me realize that NLP can
benefit countless people’s lives. Moreover, I would like to thank my mentors, Dr. Wen-yun Yang,
and Songlin Wang, at JD.com. This internship experience made me understand the challenge and
importance of designing a robust retrieval system. I learned a lot and put many exciting ideas into
the retrieval project. Furthermore, I would like to thank my mentors, Dr. Guokun Lai and Youlong
Cheng, at Bytedance. This meaningful internship experience helped me to be a research scientist,
putting scientific research ideas into the Tiktok system and benefiting people’s lifestyles in many
ways. I also would like to thank all my colleagues for their help on my intern projects. I want to
mention Jiaxiang Liu, Shuohuan Wang, Weichong Yin, Fei Yu, Xiangyang Zhou, Lu Li, Yan Zeng,
Zhou Xin, Dianhai Yu (Baidu), Han Zhang, Kang Zhang, Shang Wang (JD.com), and Wumo Yan,
Jimmy Kim, Wen Liang (Bytedance).
    Outside of NLP research, I am fortunate to be surrounded by many friends. I would like to
thank my extremely kind piano teacher, Dante Li. She made me play the piano more sweetly and
pleasingly and helped me to improve my piano level rapidly. I would like to thank Qiwen Sheng and
Weiyang Yang for every wonderful trip and delicious dinner. I would like to thank Qian Song for
accompanying me to walk along the Manhattan River with her lovely dog during the COVID-19
situation that made people feel most depressed. I would like to thank Yĳun Zheng and Yuchong
Chen. I witnessed their sweet love and marriage process from two of my single roommates to a
young couple. I would like to thank Weixing Ren for keeping the friendship for more than fourteen
years. Moreover, I would like to mention some names that are my best friends: Yue Zhang, Dong
Chen, Meng Xu, Bin Wang, Lu Lu, Haochen Liu, Yang Zheng, Yubo Wang, Zhanwang Chen, Ying
Cao, and Ruolei Xia.
                                                  vi


    Finally, I would like to express my deepest appreciation to my family. I own the warmest family
in the world. I always thank you for your unconditional support. My father, Wanying Zheng, usually
told me that they were proud of me. As the last sentences in the acknowledgment, I would like to
say: This dissertation is dedicated to my family. I am proud to be the child of my family forever.
                                                 vii


                                 TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . .           . . . . . . . . . . . . . . . . . .  1
  1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1
  1.2 Challenges and Contributions of the Dissertation     . . . . . . . . . . . . . . . . . .  4
  1.3 Outline of the Dissertation . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . .  9
CHAPTER 2 BACKGROUND AND RELATED WORK                      . . . . . . . . . . . . . . . . . . 12
  2.1 Background . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . . 12
      2.1.1 Background of Transformer Architecture .       . . . . . . . . . . . . . . . . . . 12
      2.1.2 Background of Graph Neural Networks .          . . . . . . . . . . . . . . . . . . 13
  2.2 Related Work . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . . 14
      2.2.1 Document-level QA . . . . . . . . . . . .      . . . . . . . . . . . . . . . . . . 15
      2.2.2 Cause-eﬀect QA . . . . . . . . . . . . .       . . . . . . . . . . . . . . . . . . 16
      2.2.3 Cross-Modality QA . . . . . . . . . . . .      . . . . . . . . . . . . . . . . . . 17
      2.2.4 Knowledge based QA . . . . . . . . . . .       . . . . . . . . . . . . . . . . . . 18
CHAPTER 3 MULTI-HOP REASONING FOR DOCUMENT-LEVEL QA                            . . . . . . . . 20
  3.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . 20
  3.2 Semantic Role Labeling Graph Reasoning Network . . . . . . . . .         . . . . . . . . 22
      3.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . .      . . . . . . . . 23
      3.2.2 Paragraph Selection . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . 23
              3.2.2.1 First Round Paragraph Selection . . . . . . . . . .      . . . . . . . . 23
              3.2.2.2 Second Round Paragraph Selection . . . . . . . .         . . . . . . . . 24
      3.2.3 Heterogeneous SRL Graph Construction . . . . . . . . . . .         . . . . . . . . 24
      3.2.4 Graph Encoder . . . . . . . . . . . . . . . . . . . . . . . .      . . . . . . . . 26
      3.2.5 Supporting-Fact Prediction . . . . . . . . . . . . . . . . . .     . . . . . . . . 26
      3.2.6 Answer Span Prediction . . . . . . . . . . . . . . . . . . .       . . . . . . . . 28
      3.2.7 Objective Function . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . 28
  3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . 29
      3.3.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . 29
      3.3.2 Implementation Details . . . . . . . . . . . . . . . . . . . .     . . . . . . . . 29
      3.3.3 Baseline Models . . . . . . . . . . . . . . . . . . . . . . .      . . . . . . . . 29
  3.4 Experimental Results and Analysis . . . . . . . . . . . . . . . . . .    . . . . . . . . 30
      3.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . 30
      3.4.2 Model Analysis . . . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . 31
      3.4.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . 34
  3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . 37
CHAPTER 4 CAUSAL REASONING FOR DOCUMENT-LEVEL QA                           . . . . . . . . . . 38
  4.1 Background and Motivation . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . 38
  4.2 Relational Gating Network . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . 40
      4.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . .      . . . . . . . . . . 41
      4.2.2 Entity Representations . . . . . . . . . . . . . . . . . .     . . . . . . . . . . 41
      4.2.3 Entity Gating . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . 42
                                             viii


      4.2.4 Relation Gating . . . . . . . .    . . . . . . . . . . . . . . . . . . . . . . . . 42
      4.2.5 Contextual Interaction Module      . . . . . . . . . . . . . . . . . . . . . . . . 43
      4.2.6 Output Prediction . . . . . . .    . . . . . . . . . . . . . . . . . . . . . . . . 45
  4.3 Experiments . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . 45
      4.3.1 Dataset Description . . . . . .    . . . . . . . . . . . . . . . . . . . . . . . . 46
      4.3.2 Implementation Details . . . .     . . . . . . . . . . . . . . . . . . . . . . . . 46
  4.4 Results and Discussion . . . . . . . .   . . . . . . . . . . . . . . . . . . . . . . . . 46
      4.4.1 Result Comparison . . . . . .      . . . . . . . . . . . . . . . . . . . . . . . . 46
      4.4.2 Model Analysis . . . . . . . .     . . . . . . . . . . . . . . . . . . . . . . . . 48
      4.4.3 Qualitative Analysis . . . . . .   . . . . . . . . . . . . . . . . . . . . . . . . 50
  4.5 Summary . . . . . . . . . . . . . . .    . . . . . . . . . . . . . . . . . . . . . . . . 51
CHAPTER 5 RELATIONAL REASONING FOR CROSS-MODALITY QA .                           . . . . . . . 52
  5.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . 52
  5.2 Cross-Modality Relevance . . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . 53
      5.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . .      . . . . . . . 54
      5.2.2 Representation Alignment . . . . . . . . . . . . . . . . . . .       . . . . . . . 54
      5.2.3 Entity Relevance . . . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . 56
      5.2.4 Relational Relevance . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . 57
      5.2.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . 59
  5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . 60
      5.3.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . 60
      5.3.2 Implementation Details . . . . . . . . . . . . . . . . . . . . .     . . . . . . . 60
      5.3.3 Baseline Description . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . 61
  5.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . 62
      5.4.1 Result Comparison . . . . . . . . . . . . . . . . . . . . . . .      . . . . . . . 62
      5.4.2 Model Analysis . . . . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . 63
      5.4.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . 66
  5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . 66
CHAPTER 6 COMMONSENSE REASONING FOR KNOWLEDGE BASED QA                                 . . . . 67
  6.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . 67
  6.2 Dynamic Relevance Graph Network . . . . . . . . . . . . . . . . . . . . .        . . . . 69
      6.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . .      . . . . 69
      6.2.2 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . .      . . . . 70
      6.2.3 Language Context Encoder . . . . . . . . . . . . . . . . . . . . . .       . . . . 70
      6.2.4 KG Subgraph Construction . . . . . . . . . . . . . . . . . . . . . .       . . . . 71
      6.2.5 Graph Neural Network Module . . . . . . . . . . . . . . . . . . . .        . . . . 71
      6.2.6 Answer Prediction . . . . . . . . . . . . . . . . . . . . . . . . . .      . . . . 73
  6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . 73
      6.3.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . 73
      6.3.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . .     . . . . 74
      6.3.3 Baseline Description . . . . . . . . . . . . . . . . . . . . . . . . .     . . . . 74
  6.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . 76
      6.4.1 Result Comparison . . . . . . . . . . . . . . . . . . . . . . . . . .      . . . . 76
                                             ix


       6.4.2 Model Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
       6.4.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
   6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
CHAPTER 7    EXPLOITING COMMONSENSE KNOWLEDGE FOR DOCUMENT-
             LEVEL QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 83
   7.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 83
   7.2 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 85
       7.2.1 Candidate Triplet Extraction from KG . . . . . . . . . . . . . . . . . .      . . 86
       7.2.2 KG Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 86
       7.2.3 Commonsense Subgraph Construction . . . . . . . . . . . . . . . . . .         . . 87
       7.2.4 Reasoning over Document-level QA . . . . . . . . . . . . . . . . . . .        . . 87
       7.2.5 Answer Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 88
       7.2.6 Training Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 88
   7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
       7.3.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 89
       7.3.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 89
       7.3.3 Baseline Description . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 90
   7.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 90
       7.4.1 Result Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 90
       7.4.2 Model Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 91
       7.4.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 92
   7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 93
CHAPTER 8 CONCLUSION AND FUTURE DIRECTIONS . . . . . . . .                     . . . . . . . . 95
   8.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . 95
   8.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
       8.2.1 Prompt Learning for Question Answering . . . . . . . . . .        . . . . . . . . 97
       8.2.2 Integration of Domain-Knowledge into Question Answering           . . . . . . . . 98
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
                                              x


                                                            CHAPTER 1
                                                        INTRODUCTION
1.1       Motivation
Understanding and reasoning over natural language play a significant role in many real-world artificial
intelligence applications. Question Answering (QA) is one of the most crucial problems in evaluating
the understanding and reasoning over natural language text [41, 2]. Question answering is a computer
science discipline within the fields of Natural Language Processing (NLP), Machine Learning, and
Information Retrieval (IR), which is concerned with building systems that automatically answer
questions posed by humans in a natural language. Nowadays, QA systems are now widely used in
many real-world applications, such as search engines (Google, Bing, Baidu), reading comprehension,
and AI conversational systems (Alexa assistant). In this dissertation, we address diﬀerent types
   Introduction: Game! Find Answers!
of QA problems categorized into five classes. We use a simple but straightforward example of a
“crying child” story shown in Figure 1.1 to introduce these five types of QA problems.
 Questions and Answers:                                        Text: (1, 2, and 5)
 1. Where was the child sitting when he was crying?
    Answer: Sofa. (Document-level QA)                          The boy was sitting on the sofa and crying.
                                                               I hugged him and asked about the issue and the reason why he was crying.
 2. What did cause the child to cry?                           He said he lost his favorite toy.
    Answer: The child lost the toy. (Causal-effect QA)         I drove him to the gift store to buy him a new toy.
 3. Where is the child sitting in the picture?
    Answer: Arm. (Cross-modality QA)
                                                               Image:                                   Knowledge: (4)
                                                                 (3)
 4. Is the car used for transportation?                                                                            MannerOf            RelatedTo
                                                                                                         transport           drive               car
    Answer: Yes. (Knowledge based QA)
                                                                                                                                RelatedTo
 5. Where was the child sitting when I drove him to the store?                                                              driver
    Answer: Car. (Document-level & Knowledge based QA)
                     Figure 1.1 The “Crying child” example of four categories of QA tasks.
                                                                                                                                                     3
      • Document-level Question Answering. The first question in Figure 1.1 shows a straightforward
         example of a Document-level QA task. Given the question “Where was the child sitting when
         he was crying?” and the text “The boy was sitting on the sofa and crying”, the problem is to
         find the answer “sofa”. This type of QA allows humans to ask questions based on the given
                                                                    1


  document. The QA system requires reading and understanding the text, capturing the line of
  reasoning from the document, and answering the question eﬀectively [93].
• Cause-eﬀect Question Answering. The second question in Figure 1.1 shows a simple
  example of Cause-eﬀect QA [107]. Given the question “What did cause the child to cry?”, we
  extract the causal events “child cry” and “the child lost the toy” in the text and answer the
  question. Cause-eﬀect QA is a particular type of document-level QA, where the QA system
  should understand the causal and eﬀect events and find explicit causal relationships between
  events.
• Cross-modality Question Answering. Cross-modality QA combines multidisciplinary
  fields, including language, vision, speech processing, etc [104]. We select Visual Question
  Answering (VQA) as the Cross-modality QA task in this dissertation. VQA task aims to
  answer a natural language question using an image. Vision-and-language reasoning requires
  the understanding of visual contents, language semantics, cross-modality alignments, and
  relationships between two modalities [118]. Let us look back to the “crying child” story and
  the third question “where is the child sitting in the picture?” We cannot state the answer
  just based on the text. However, after providing the image, we can quickly know the correct
  answer, “arm”.
• Knowledge based Question Answering. Commonsense knowledge reflects the natural
  understanding of the world and human behavior. Structured knowledge is another modality of
  resources that can be fed into QA systems for answering natural language questions [61]. This
  type of QA task aims to answer natural language questions utilizing a knowledge base or a
  knowledge graph. The fourth question in Figure 1.1, “Is the car used for transportation”, shows
  an example in which QA requires commonsense knowledge. In this case, QA systems should
  utilize the commonsensense like a human being about “car” ! “used for” ! “transportation”.
• Combine Document-level and Knowledge-based QA. In some document-level QA scenarios,
  the contents included in a given text are suﬃcient to find the answer. However, there are
                                             2


       many cases in which the required knowledge is not included in the text itself [125, 107]. The
       fifth question in Figure 1.1, “Where was the child sitting when I drove him to the toy store?”,
       shows an example in which QA requires commonsense knowledge. Individuals can provide
       the answer “car” because the human has the commonsense knowledge “drive” ! “car” in
       their mind.
    Traditionally, building QA systems have relied on natural language processing (NLP) technologies
as backbones, including semantic role labeling [38], named entity recognition [57], part-of-speech
tagging [70], relation extraction [95], text matching [138], etc. Intuitively, an ideal QA system
should be able to understand the meanings of the text and the semantic relations between questions,
documents, and answers. Over the past decade, deep learning, a particular category of machine
learning, has achieved great success in multiple real-world NLP tasks, especially in Question
Answering domain [87, 125, 4, 106, 103]. Specifically, the deep neural network is constructed
by many neural layers. Each neural layer includes a massive number of “computational neurons”
represented by scalars, tensors, and matrices. The neurons between two layers have connection
edges, and the neural network propagates information via forward and backward directions. Deep
learning QA architectures automatically extract the contextual and semantic features by pre-training
from the large corpora and learn hundred-dimensional dense vectors to represent a word, a phrase, a
sentence, or even a document. Based on the conceptually simple but empirically powerful language
representations, Large-scale language models (LMs), like BERT [24] and RoBERTa [65], have
achieved success in many QA benchmarks.
    However, most of the current QA architectures directly utilize LMs to predict the answer but fall
short of providing interpretable predictions. The semantic structures of the data and knowledge
in the corpus are not explicitly stated but rather implicitly learned from a large corpus. It is thus
diﬃcult to create an explicit reasoning chain, capture high-order relations for the generalizability
of reasoning, or establish the evidence used in the reasoning process. In other words, most of
the existing deep learning QA works cannot track the explicit semantic relationships from various
modalities, including Textual documents, Images (Cross-Modality QA), Knowledge graphs, etc.
                                                  3


    In this dissertation, five critical challenges and contributions are addressed to make neural
networks more eﬀective for various QA tasks.
1.2     Challenges and Contributions of the Dissertation
                                                                Question: What team did the recipient of the 2007
   Question: What causes precipitation to fall?
                                                                Brownlow Medal play for?
                                                                Paragraph 1: Title: "2007 Brownlow Medal"
                                                                0. “The 2007 Brownlow Medal was the 80th year the award …
                                                                (AFL) home and away season."
                                                                1. “Jimmy Bartel won the medal by polling twenty-nine votes ..."
   Paragraph: In meteorology, precipitation is any product of
   the condensation of atmospheric water vapor that falls       Paragraph 2: Title: "Jimmy Bartel"
   under gravity. The main forms of precipitation include       0: “James Ross Bartel (born 4 December 1983) is a former
   drizzle, rain, sleet, snow, graupel and hail.                Australian rules footballer who played for the Geelong Football
                                                                Club in the …"
                                                                1: "A utility, 1.87 m tall and weighing 86 kg , Bartel is able …"
                                                                Answer: Geelong Football Club
   Answer: gravity                                              Support fact: ["2007 Brownlow Medal", 1],
                                                                                 ["Jimmy Bartel", 0]
Figure 1.2 Two benchmark examples of the document-level Question Answering Task. The left
side is the example of the SQuAD benchmark, while the right side is the example of the HotpotQA
benchmark.
Challenge 1: Multi-hop Reasoning for Document-level QA. Answering questions over long
documents usually requires the ability to understand the entities and find their connections throughout
the whole document to be able to reason over them in multiple steps. The left side of Figure 1.2
shows an example of one-hop document-level QA from SQuAD benchmark [87]. Given the question
“What causes precipitation to fall?” and a paragraph, we obtain the answer “gravity” after we read
the sentence “Precipitation is any product of the condensation of atmospheric water vapor that falls
under gravity”. In contrast to one-hop question answering, where answers can be derived directly
from a single paragraph [87, 86], many recent studies on question answering focus on multi-hop
reasoning across multiple documents or paragraphs and aim to build multi-hop reasoning chains to
capture the explicit semantic structure of the documents. In Section 2.2.1, we overview the related
work about these recent studies in more detail. Even after successfully identifying a reasoning
chain through multiple paragraphs, it remains a critical challenge to collect evidence from diﬀerent
                                                              4


granularity levels (e.g., paragraphs, sentences, entities) for jointly predicting the answer and lines of
reasoning. The right side of Figure 1.2 shows an example of complicated multi-hop reasoning QA
task from HotpotQA benchmark [125]. Given the question “What team did the recipient of the 2007
Brownlow Medal play for?” and ten documents, we should find two sub-questions: 1) Who won the
medal? 2) Where is this person playing? Then we filter out the two related documents and extract
the reasoning chain: “2007 Brownlow Medal” ! “Jimmy Bartel won the medal” ! “Jimmy Bartel
played for Geelong Football club”. Finally, we obtain the answer “Geelong Football club” by the
above line of reasoning.
    Contribution: To solve the first challenge, we utilize the semantic role labeling to extract
the semantic structure of the sentences. Semantic role labeling provides the semantic structure in
terms of argument-predicate relationships [38], such as “who did what to whom.” We innovatively
construct a graph with entities and multiple relational edges from documents using semantic role
labeling (SRL). We connected the SRL graphs using shared entities. This semantic structure of
multiple documents can significantly improve the multi-hop reasoning capacity to find the line of
reasoning to answer the questions. Then we use a graph neural model as the backbone to learning
the graph node representations. We jointly train a multi-hop supporting fact prediction module that
finds the cross-paragraph reasoning path, and an answer prediction module that obtains the final
answer. Our experiments show that using the semantic structure of the document is eﬀective in
finding the cross-paragraph reasoning path and answering the question.
  Questions and Answers:                                    Document:
  1. Suppose tadpoles eat more food happens, how            1. A frog lays eggs in the water.
  will it affect frogs?                                     2. Tadpoles develop inside of the eggs.
     (A) More (B) Less (C) No effect                        3. The eggs hatch.
  2. Suppose the weather is unusually bad happens,          4. The tadpoles eat and grow.
  how will it affect the tadpoles that will need food?      5. The tadpoles grow legs and form into frogs.
     (A) More (B) Less (C) No effect                        6. The frogs leave the water.
                Figure 1.3 Two examples of the Cause-eﬀect Question Answering Task.
  Questions and Answers:                           Document:
  Suppose the soil is rich in nutrients happens, 1. A plant produces a seed.
Challenge 2: Causal Reasoning for Document-level QA. Cause-eﬀect QA is a special type of
  how will it affect seeds are produced.           2. The seed falls to the ground.
Document-level QA. Causal reasoning requires3.the
  (A)  More    (B)  Less (C) No effect                Themachine
                                                          seed is buried.
                                                                   to eﬀectively extract the explicit causal
                                                   4. The seed germinates.
  Commonsense Knowledge:                           5. A plant grows.
  Nutrient is used for plant growth.                  5 plant produces flowers.
                                                   6. The
                                                   7. The flowers produce more seeds.


                  relationships between cause and eﬀect events (entities) over the entire document. For example,
                  predicting a “sunny day” results in the direct eﬀect of “sunshine” is less challenging than the indirect
                  eﬀect in “photosynthesis”. Figure 1.3 shows an example of a cause-eﬀect QA task from WIQA
                  benchmark [107]. Given the procedural story and the question “Suppose tadpoles eat more food
                  happens, how will it aﬀect more frogs?”, the following line of casual reasoning should be extracted
                  from the text: “tadpole eat” ! “tadpole grow,” and “tadpole grow” ! “tadpole form into frog”. In
                  Section 2.2.1, we overview the related work about finding the line of causal reasoning and limitations
                  in more detail.
                       Contribution: To solve the second challenge, we aim to find relations between entities and
                AtheCorpus
                       line of causal
                                    forreasoning.
                                          Reasoning Concretely,About          an entity gating
                                                                    we buildNatural               module to extract and filter the
                                                                                              Language
                  involved entities inGrounded
                                        the question andin      Photographs
                                                            context.   Furthermore, we design a relation gating module with
 ne Suhr‡ , Stephanie
                  an alignmentZhou
                                          †
                                            , Ally
                                   of entities      Zhang
                                               to capture   the ,higher-order
                                                               ‡
                                                                   Iris Zhang       ‡
                                                                                 chain, Huajun      Bai‡ , and
                                                                                         of causal reasoning       Yoav
                                                                                                                based        Artzi‡
                                                                                                                         on pairwise
            ‡
              Cornell University Department of Computer Science and Cornell Tech
                  relations. Moreover, we propose an eﬃcient module, called contextual interaction module, to
                                               New York, NY 10044
hr, yoav}@cs.cornell.edu                                    {az346,
                  incorporate cross-information from Question                 wz337,
                                                                       and Content            hb364}@cornell.edu
                                                                                       interactions during training in an eﬃcient
                  wayUniversity       of alignments.
                                          Maryland Department of Computer Science
                    †
                        to help entities
                                              College Park, MD 20742
                                             stezhou@cs.umd.edu
                            Text: Where is the child sitting?         The left image contains twice the number of dogs as the
                           Fridge                      Arm            right image, and at least two dogs in total are standing. Yes
                    Abstract
 e introduce a new dataset for joint reason-
 g about natural language and images, with a
 cus on semantic diversity, compositionality,
  d visual reasoning challenges. The data con-
ins 107,292 examples of English sentences                             The left image contains twice the number of dogs as the
  ired with web      photographs.
                  Figure                The task
                           1.4 Two benchmark         is
                                                  examples    of the right
                                                                      Cross-Modality
                                                                           image, and Question      Answering
                                                                                         at least two            task.areThe
                                                                                                       dogs in total         left side
                                                                                                                          standing.
  determine whether a natural language cap-
                  is an example of the VQA benchmark, while the right side is an example of the NLVR benchmark.
on is true about a pair of photographs. We
 owdsource the data using sets of visuallyCross-modality Question Answering
ch images and a compare-and-contrast task
                  Challenge
  elicit linguistically         3: language.
                          diverse  RelationalQuali-
                                                Reasoning for Cross-Modality QA. In cross-modality QA, we require
tive analysis shows the data requires compo-
                  an understanding
tional joint reasoning,      includingof   both quan-
                                         about   language and vision modalities and their connections and reason over
 ies, comparisons, and relations. Evaluation                               One image shows exactly two brown acorns in
                  them to be able to answer the questions. One line of back-to-back
                                                                                 research addresses
                                                                                                caps onthis  challenge
                                                                                                         green  foliage. by learning
 ing state-of-the-art visual reasoning meth-
ds shows the data     presents a strong
                  representations          challenge.
                                     for cross-modality      data Figure
                                                                    and enabling
                                                                            1: Tworeasoning
                                                                                       examplesfor    target
                                                                                                   from       tasks. This
                                                                                                          NLVR2.       Each iscaption
                                                                                                                                done
                                                                                                     2
ntroductionby the alignment of the representation for              is paired with two images. The task is to predict if
                                                                      multiple modalities. Researchers develop models by
                                                                   the caption is True or False. The examples require
 l reasoning with natural language is a                            addressing challenging semantic phenomena, includ-
ising avenue to study compositional seman-                         ing resolving twice . . . as to counting and comparison
 y grounding words, phrases, and complete                                  6 and composing cardinality constraints, such
                                                                   of objects,
nces to objects, their properties, and rela-                       as at least two dogs in total and exactly two.3
 in images. This type of linguistic reason-


training features and aligning representation using Transformer architectures as the backbone [104].
However, these approaches have well-known issues for robust joint representations and reasoning
for cross-modality QA [59]. In Section 2.2.3, we detailed describe the related work about these
approaches. Explicit modeling of entities and relations in the neural model is one key factor that
can alleviate this problem but is less explored. The right side of Figure 5.1 shows an example of a
cross-modality QA task from NLVR benchmark [101]. Given two pictures and the statement, “The
left image contains twice the number of dogs as the right image, and at least two dogs in total are
standing,” we should know the number of standing dogs in the left image and right image and reason
over the twice number.
     Contribution: To solve the third challenge, we aim to explicitly ground the entities as well as
their relationships from language modality into vision modality. We proposed a novel cross-modality
relevance (CMR) architecture that considers the relevance between textual token representations
and visual object representations by explicitly aligning them in the two modalities. The relevance
metric between two modalities is shown to be helpful for aligning multiple spaces of modalities
in our work. We model the higher-order relational relevance for the generalizability of reasoning
between entity relations in the text and object relations in the image.
                                                    The student practiced his guitar often, where is he
   Q: What is the largest island in the world?
                                                    always spent his free period?
   A: Greenland.
                                                    A. Music room B. Toy store C. Concert
Figure 1.5 Two benchmark examples of the document-level Question Answering Task. The left side
is the example of the WikiQA benchmark, while the right side is the example of the CommonsenseQA
benchmark.
Challenge 4: Commonsense Reasoning for Knowledge based QA. Knowldge base QA is a task of
answering questions given a structured source of knowledge, e.g. Knowledge Graph (KG). However,
this task is challenging because firstly, often the given KG is very large, and secondly, the answer
can not be directly retrieved, but multiple steps of reasoning over KG are needed to obtain the
answer. The common approach for solving this problem is to extract a subgraph that is relevant
to the question [30]. However, the challenge is that the extracted KG subgraph sometimes misses
                                                   7


some edges between entities, which breaks the chain of reasoning. This issue can be due to two
possible scenarios. First, the KG is originally imperfect and does not include all the required edges.
Second, when constructing the subgraph, some intermediate concept (entity) nodes and edges are
omitted [30]. In such cases, the subgraph does not contain a complete chain of reasoning. The right
side of Figure 1.5 shows an example from the CommonsenseQA benchmark. Given the question
“The student practiced his guitar often, where is he always spent his free period?”, the model should
understand “free period” in the question and exploit the line of knowledge reasoning: “guitar” !
“instrument (miss)” ! “music room”.
    Contribution: To solve the fourth challenge, we aim to recover missing edges in the KG that
were needed for finding the line of reasoning and answering the questions. We use ConceptNet [97],
a general-domain knowledge graph, as the commonsense KG. ConceptNet graph has multiple
semantic relational edges, e.g., HasProperty, IsA, AtLocation, etc. We extract the entities and
retrieve related external knowledge from KG. Then, we construct a KG subgraph as part of the
QA model to help fill the knowledge gaps and perform multi-hop reasoning. We proposed a novel
                 Questions and Answers:                                       Document:
Dynamic Relevance       Graph
                 1. Suppose       Network
                               tadpoles       (DRGN)
                                         eat more          that learns
                                                   food happens,   how the1.node A frogrepresentations      while exploiting
                                                                                         lays eggs in the water.
                 will it affect frogs?                                        2. Tadpoles develop inside of the eggs.
the existing edges(A) inMore
                          KG and       establishes
                                 (B) Less               new direct edges3. between
                                            (C) No effect                                     graph nodes based on the
                                                                                 The eggs hatch.
                 2. Suppose the weather is unusually bad happens,             4. The tadpoles eat and grow.
relevance scores.how
                   Aswill
                        a byproduct,      our model
                            it affect the tadpoles  that improved    handling
                                                         will need food?          thetadpoles
                                                                              5. The    negative   questions
                                                                                                grow            due into
                                                                                                      legs and form to deeply
                                                                                                                         frogs.
                    (A) More (B) Less (C) No effect                           6. The frogs leave the water.
considering the relevance between the question node and the graph entities.
                 Questions and Answers:                              Document:
                 Suppose the soil is rich in nutrients happens,      1. A plant produces a seed.
                 how will it affect seeds are produced.              2. The seed falls to the ground.
                 (A) More (B) Less (C) No effect                     3. The seed is buried.
                                                                     4. The seed germinates.
                 Commonsense Knowledge:                              5. A plant grows.
                 Nutrient is used for plant growth.                  6. The plant produces flowers.
                                                                     7. The flowers produce more seeds.
    Figure 1.6 One example of Exploiting Commonsense Knowledge for Document-level QA.
Challenge 5: Exploiting Commonsense Knowledge for Document-level QA. Sometimes,
answering questions over documents not only requires finding the line of the reasoning in the whole
document but also exploiting the external knowledge to be able to help complete the Document-
                                                               8


level reasoning chain. However, the challenge is eﬀectively extracting the most relevant external
information and reducing the noise from the large KG. The irrelevant and noisy knowledge from KG
will seriously mislead the QA model in predicting the answer. There are less sophisticated techniques
proposed for using external knowledge explicitly in Document-level QA tasks [107, 106]. Figure 1.6
shows an example of Exploiting Commonsense Knowledge for Document-level QA. Given the
question, “Suppose the soil is rich in nutrients happens, how will it aﬀect seeds are produced”, the
model should understand “A plant produces a seed”, and exploit the external knowledge “Nutrient is
used for plant growth” to fill in the knowledge gap between the question and text and find the answer.
    Contribution: To solve the fifth challenge, we aim to eﬀectively learn to find the most relevant
KG subgraph in a given large KG and combine that with the document-level information to answer the
questions. We proposed a Multi-hop Reasoning network over Relevant CommonSense SubGraphs
(MRRG) architecture that extracts the entities from the document and learns to retrieve the relevant
external knowledge from KG using a novel KG attention neural mechanism [137]. Then, we
construct a KG subgraph and use it as a part of the document-level QA model to help perform
multi-hop reasoning and find the answer.
1.3     Outline of the Dissertation
The rest of this dissertation organizes as follows:
     • In Chapter 2, following the Introduction Chapter, we describe the background and related
       works about document-level QA, cause-eﬀect QA, cross-modality QA, and knowledge-based
       QA.
     • In Chapter 3, we present our work on multi-hop reasoning for Document-level QA. We
       describe our Semantic Role Labeling Graph Reasoning Network (SRLGRN) for solving
       the multi-hop reasoning challenge. We clarify how it exploits the semantic structure of
       multiple documents based on semantic role labeling models and forms a novel multi-relational
       graph. We evaluate our SRLGRN architecture on the HotpotQA and SQuAD benchmark. The
                                                   9


  experimental results and analysis indicate the eﬀectiveness of SRLGRN on the Document-level
  QA task.
• In Chapter 4, we present our work on causal reasoning for Document-level QA. We describe
  an end-to-end Relational Gating Network (RGN) to solve the casual reasoning challenge. We
  clarify how it finds explicit causal relationships between entities facilitate causal reasoning
  over the whole document. We evaluate the model performance on the WIQA benchmark. The
  analysis demonstrates the eﬀectiveness of the proposed entity gating module, relation gating
  module, and contextual interaction module in the RGN model.
• In Chapter 5, we present our work on relational reasoning for Cross-Modality QA. We describe
  a Cross-Bodality Relevance (CMR) architecture to solve the challenges of cross-modality
  QA by learning and reasoning over visual and text. CMR considers the relevance between
  textual token representations and visual object representations by explicitly aligning them in
  the two modalities. We model the higher-order relational relevance for reasoning between
  entity relations in the text and object relations in the image. We evaluate the proposed CMR
  architecture on NLVR and VQA benchmarks. Moreover, we perform a detailed analysis of
  our CMR approach to show the eﬀectiveness of entity relevance and relational reasoning.
• In Chapter 6, we present our work on commonsense reasoning for Knowledge based QA. We
  describe a novel Dynamic Relevance Graph Network (DRGN) to solve the commonsense
  reasoning challenge by exploiting the existing relations in KG and re-scaling the importance of
  the neighbor nodes in the graph based on training a dynamic relevance matrix. Our proposed
  approach shows competitive performance on two QA benchmarks, CommonsenseQA and
  OpenbookQA. The experiment results and analysis demonstrates that our DRGN model
  facilitates answering complex questions that need multiple hops of knowledge reasoning.
• In Chapter 7, we present our work on knowledge reasoning for document-level QA. We
  describe Multi-hop Reasoning Network over Relevant Commonsense SubGraphs (MRRG)
  to solve the knowledge reasoning challenges by exploiting the external knowledge subgraph
                                               10


  extracted in the most relevant information from a large KG using a novel KG attention neural
  mechanism. We evaluate our model on the WIQA benchmark. The experimental results
  and analysis indicate that our MRRG model helps in filling the knowledge gaps between the
  question and the document and performing reasoning over the extracted knowledge.
• In Chapter 8, we draw the conclusion of the dissertation and several points for future direction.
                                            11


                                            CHAPTER 2
                             BACKGROUND AND RELATED WORK
In this chapter, we first provide a background of Transformer Architecture and Graph Neural
Networks, which are the two main architectural components that we used in our neural models
for QA systems. Then we introduce the related work about document-level QA, cause-eﬀect QA,
cross-modality QA, and knowledge based QA.
2.1    Background
2.1.1   Background of Transformer Architecture
Transformer Architecture is a stacked self-attention model for learning eﬀective natural language
features [111]. The Transformer has been shown to achieve extraordinary success in natural language
processing not only for better performance but also for eﬃciency due to their parallel computations.
The Transformers architecture uses a self-attention mechanism and multi-head attention as two key
components to extract each token feature that helps in learning the features from all the other tokens
trained in the huge natural language corpora [24, 124, 83].
    Self-attention is the “soul” mechanism to learn token representations based on the Scaled
Dot-Product operator. The input of Self-attention consists of Queries & of dimension 3 @ , Keys      of
dimension 3 : , and Values + of dimension 3E . The Self-attention process is computed as follows:
                                                               & ))
                              (4; 5 CC=(&, , +) = B> 5 C<0G( p +,                                 (2.1)
                                                                  3:
where &     ) is the dot-product operation for Quires and Keys.
    Multi-head attention is the “brain” module of the Transformer architecture, allowing for attending
to parts of the sequence diﬀerently and running through self-attention mechanism ⌘ times in parallel.
Then, all the single-head Attention outputs are combined together to obtain the integrated Attention
output. The Multi-head attention is computed as follows:
                  "D;C8 403 (&, , +) = ⇠>=20C ( 4031 , 4032 , . . . , 403 ⌘ ), $ ,                (2.2)
                                                  12


                                  4038 = (4; 5 CC=(&,8& , ,8 , +,8+ ),                         (2.3)
where , $ , , & , , , , + are learnable parameter matrices.
    In recent years, Bidirectional Encoder Representations from Transformers (BERT) [24] and
Robustly Optimized BERT Pretraining Approach (RoBERTa) [65] have been proposed and widely
deployed in countless natural language processing tasks, especially in Question Answering [74,
139, 125]. We take BERT architecture as an example. BERT utilizes a bidirectional self-attention
Transformer as the backbone to learning the pre-train deep bidirectional representations considering
both left and right contexts from the large-scaled unlabeled corpus. Moreover, to better pre-train
contextualized representations, BERT architecture employed two unsupervised tasks, Masked
Language Model and Next Sentence Prediction. Furthermore, BERT uses an English Wikipedia
corpus that contains 2, 500 million words and BooksCorpus [142] that contains 800 million words to
train the architecture. However, the obstacle of BERT is the memory limitation because of millions
or billions of parameters. To address this issue, ALBERT [55] utilizes two technologies, factorized
embedding parameterization and cross-layer parameter sharing, to lower memory consumption and
increase the training speed of BERT. Researchers also extended Transformers with both textual and
visual modalities [59, 102, 104, 99, 109]. Sophisticated pre-training strategies were introduced to
boost the performance [104].
2.1.2    Background of Graph Neural Networks
Graph Neural Network (GNN), which generalizes the deep learning neural network to structured
graphs, has attracted increasing attention and gained valuable significance in various Natural
Language Processing tasks, including Question Answering, Machine Reading Comprehension, etc.
Graph Neural Networks can eﬀectively learn robust representations from nodes, edges, and relations
between the nodes in the structured graph [140, 37, 129]. Graph neural networks follow two types
of approaches, which are spectral graph approaches and spatial graph approaches [23, 140, 52, 112].
Spectral graph approaches learn the spectral representation of the graphs. Spectral graph
                                                  13


methods commonly use the graph Fourier transform and graph convolution operator in the spectral
domain [140]. Graph convolutional network (GCN) [52] is a classic multi-layer graph neural
network and a typical spectral graph approach. For each layer of GCN, the node representations
capture the information of their neighborhood nodes and edges via message passing and graph
convolutional operation. R-GCN is a variation of GCN that deals with the multi-relational graph
structure [90]. Adaptive Graph Convolution Network (AGCN) [60] learns the underlying relations
and learns the residual graph Laplacian to improve spectral graph performance. Meanwhile, some
variants of GCN replace the graph Fourier transform with other transform formats. For example,
Graph Wavelet Neural Network (GWNN) [123] applies the graph wavelet transform to the graph,
and achieves better performance compared to the graph Fourier transform in some tasks.
Spatial graph approaches learn the spatial graph representations based on the graph topology
architecture and utilize the spatial information of the node directly [140, 7, 33]. For example,
the graph attention network (GAT) [112] uses the graph attention layer and multi-head attention
mechanism (like Transformer) on spatial graphs to learn the node representations eﬃciently. In
particular, the graph attention layer consists of 3 components: (1) Linear Transformation, which
is used to apply a learned parameter matrix to the feature vectors of the nodes, (2) Computation
and Normalization of Attention Coeﬃcients, which is used to determine the relative importance
of neighboring features to each other, (3) Computation of Final Output Features, which is used to
generate the Non-Linear Transformation node representations.
2.2     Related Work
In this dissertation, we address diﬀerent types of QA that are categorized into four classes including
document-level QA, cause-eﬀect QA, cross-modality QA, and knowledge based QA. In the following
subsections, we will describe the relevant benchmarks for each QA class and point to the related
published QA architectures.
                                                  14


2.2.1    Document-level QA
Many QA tasks have been proposed to evaluate the language understanding capabilities of ma-
chines [87, 47, 28]. These tasks are single-hop QA and consider answering a question given only
one single paragraph. These single-hop QA benchmarks, such as TriviaQA [47] and SearchQA [28],
and MRC datasets, like SQuAD [87], rarely require complex reasoning (i.e., chain of reasoning)
to obtain the answer. In these years, several multi-hop QA datasets, such as WikiHop [120] and
HotpotQA [125], were proposed. They provide both multiple paragraphs and the ground-truth line
of reasoning from question to answer. Those QA datasets require a multi-hop reasoning model to
learn the cross-paragraph reasoning paths to predict the correct answer.
    Primary studies prefer to use a neural retriever model and a neural reader model to solve the
challenges of document-level multi-hop QA tasks [74, 139, 125]. First, they use a neural retriever
model to find the relevant paragraphs to the question. After that, a neural reader model is applied to
the selected paragraphs for answer prediction. Although these approaches obtain promising results,
the performance of evaluating multi-hop reasoning capability is unsatisfactory [74]. Recently
proposed multi-hop QA models [110, 122, 29] utilize the semantic structures of the data and
construct a semantic graph in diﬀerent ways. For example, Coref-GRN [25] utilizes co-reference
resolution to build an entity graph. MHQA-GRN [96] is an updated version of Coref-GRN that
adds sliding windows. Entity-GCN [13] builds the graph using entities and diﬀerent types of edges
called match edges and complement edges. DFGN [122] and SAE [110] construct an entity graph
through named entity recognition (NER). Besides, some QA research works construct an entity
graph using Spacy1 or Stanford CoreNLP [71] and then apply a graph model to infer the entity path
from question to the answer [14, 122, 16, 29].
    In contrast to the models mentioned above, our SRLGRN replaces entity-based graphs with
semantic role labeling graphs to take the semantic structure of the sentences into account. Semantic
role labeling provides the semantic structure of the sentence in terms of argument-predicate
relationships [141, 105, 72, 39, 38], such as “who did what to whom.” In Chapter 3, our SRLGRN
    1 https://spacy.io
                                                 15


model utilizes graph convolutional network [52] as the backbone to learn the representations of the
SRL graph, find the cross-paragraph reasoning path, and answer the question.
2.2.2    Cause-eﬀect QA
Cause-eﬀect Question Answering is a particular type of QA that aims to find relations between
entities and the line of causal reasoning. Several new QA benchmarks were created in recent years
for this purpose [20, 21]. In particular, WIQA benchmark [107] is proposed that aims solve the
so called“what . . . if” kind of questions, containing multi-hop causal reasoning and commonsense
reasoning, making the task more challenging.
    Multiple previous works achieved impressive performance by modeling cause-and-eﬀect entity
representations on causal-eﬀect QA [69, 44, 5, 107]. For example, REM-Net [44] architecture
refines the evidence by a recursive memory mechanism and then uses a generative model to predict
the answer. Logic-Guided [5] model uses logic rules, including symmetry and transitivity rules
as regularization during training to impose consistency between the answers to multiple questions.
However, these QA models fail to answer the questions when causal reasoning is required [27].
Therefore, we propose the Relational Gating Network (RGN) described in Chapter 4. RGN finds the
line of causal reasoning and relations using entity gating and relation gating modules to solve the
casual reasoning challenge.
    Moreover, there are many cases in which the required knowledge for answering the question
is not included in the document itself [107]. In other words, answering questions over documents
not only requires finding the line of the reasoning in the whole document, but also exploiting the
external knowledge to help complete the Document-level reasoning chain. EIGEN [69] builds an
event influence graph based on a document and leverages LMs to create the chain of reasoning to
predict the answer. However, EIGEN cannot deal with the challenge when the required knowledge is
not in the given document. To address this challenge, we propose an MRRG architecture, described
in Chapter 7, that captures the entities from the document and extracts external knowledge from KG.
                                                   16


2.2.3   Cross-Modality QA
Real-world problems often involve data from multiple modalities and resources. Learning and
decision-making based on natural language and visual information have attracted the attention
of many researchers due to exposing many exciting research challenges to the AI community.
Solving a problem at hand usually requires reasoning about the components across all the involved
modalities[62, 54, 46]. The VQA benchmark [4, 36] contains open-ended questions about images
that require an understanding of and reasoning about language and visual components. In addition,
the natural language visual reasoning (NLVR) [100, 101] benchmark is proposed that asks models
to determine whether a sentence is matched with the image. Moreover, VQACP [1] was proposed
to evaluate the capacity of language and visual understanding. Besides, several datasets contain
extensive visual information such as bounding boxes, labels, etc, e.g., Flickr30k [81] and Visual
Genome [54]. In addition, some visual Question Answering tasks aim to learn visual relation facts
with a rich structure, such as FVQA [116], R-VQA [67], and KVQA [92]. The Video Question
Answering task is a special type of visual Question Answering. Some related benchmarks were
published, like PororoQA [75], Social-IQ [130], TVQA [56], and MovieQA [114], etc.
    There are several challenges in learning and reasoning over cross-modality QA, including
understanding visual contents, language semantics, and relationships between two modalities [45, 80,
82]. Researchers develop models by learning the joint features using Transformers architectures [59,
104]. For instance, VisualBERT [59] consists of Transformer layers that align textual and visual
representation spaces with self-attention. LXMERT [104] aims to learn cross-modality encoder
representations from a cross-Transformer architecture. Besides, LXMERT pre-trains the architecture
with a large number of image-sentence pairs, via five diverse representative pre-training tasks.
Moreover, contrastive learning positively influences learning robust joint representations for two
modalities [82]. On the vision side, contrastive loss brings visual representations of two similar
images closer together while distinguishing the representations of two dissimilar images [68, 50]. On
the language side, contrastive loss makes two language representations closer [34, 117]. However,
those approaches do not consider relational reasoning [82].
                                                  17


    In contrast to these methods, we proposed a novel cross-modality relevance (CMR) architecture
in Chapter 5 that exploits the textual and visual entities and relations and finds their relevance in the
two modalities for learning joint representations. In addition, we model the higher-order relational
relevance for aligning not only textual/visual entities but also the relations between them in the text
and image.
2.2.4   Knowledge based QA
Using structured knowledge is another type of modality that can feed the QA systems for answering
natural language questions. Some benchmarks for QA systems that provide structured sources
of knowledge were published in recent years [26], such as QALD [15], WebQuestions [9], Sim-
pleQuestion [11], and KBQA [19]. To answer the questions in these benchmarks, the structured
sources of explicit knowledge are diﬀerent from each other. Specifically, WebQuestions and
SimpleQuestions contain questions that can be answered using Freebase [10], while QALD uses
DBpedia [8] as the knowledge source. Meanwhile, CommonsenseQA [103] and OpenbookQA [73]
are two benchmarks focusing on commonsense question answering that required external knowledge
provided in ConceptNet [97].
    However, current QA models can not eﬀectively utilize the KG’s information [30] and mostly
rely on implicit knowledge stored in large language models [24, 30]. The reason is that the existing
KGs are usually large and contain many nodes that are irrelevant to the question and text. Moreover,
with larger KGs, the computational complexity of learning over them will increase. To deal with this
issue, pruning KG nodes based on a variety of metrics has been proposed [23, 140, 112, 37, 129].
Moreover, GraphTransformer[53] and QAGNN [127] include the sentence node in the graph, while
HGN [29] and SRLGRN [134] add the paragraph node and sentence node to construct a hierarchical
graph structure. However, the extracted KG subgraph sometimes misses some edges between
entities, which breaks the chain of reasoning [136].
    To solve this challenge, in contrast to the models mentioned above, we proposed a Dynamic
Relevance Graph Network (DRGN), described in Chapter 6, that learns the node representations
                                                   18


while exploiting the existing edges in KG and establishes new direct edges between graph nodes
based on the relevance scores.
                                              19


                                                 CHAPTER 3
                  MULTI-HOP REASONING FOR DOCUMENT-LEVEL QA
3.1   Background and Motivation
Understanding and reasoning over natural language play a significant role in Machine Reading
Comprehension (MRC) and Question Answering (QA). Several types of QA tasks have been
proposed in recent years to evaluate the language understanding capabilities of machines [87, 47, 28].
However, most of the current benchmarks focus on simple single-hop QA problems over a single
paragraph. Many existing neural models rely on learning context and type-matching heuristics [119].
Those rarely build reasoning modules but achieve promising performance on single-hop QA tasks.
The main reason is that these single-hop QA tasks lack an in-depth evaluation of the reasoning
capabilities of the learning models because they do not require complex reasoning.
                      Question 430: What team did the recipient of the 2007
                      Brownlow Medal play for?
                      Paragraph 1: Title: "2007 Brownlow Medal"
                      0. “The 2007 Brownlow Medal was the 80th year the award …
                      (AFL) home and away season."
                      1. “Jimmy Bartel won the medal by polling twenty-nine votes ..."
                      Paragraph 2: Title: "Jimmy Bartel"
                      0: “James Ross Bartel (born 4 December 1983) is a former
                      Australian rules footballer who played for the Geelong Football
                      Club in the …"
                      1: "A utility, 1.87 m tall and weighing 86 kg , Bartel is able …"
                      Paragraph 10: Title: "2005 Brownlow Medal"
                      0: "The 2005 Brownlow Medal was the 78th year the award …"
                      1: "Ben Cousins of the West Coast Eagles won the medal …"
                      Answer: Geelong Football Club
                      Support fact: ["2007 Brownlow Medal", 1],
                                       ["Jimmy Bartel", 0]
                               Figure 3.1 An example of HotpotQA data.
                                                       20


     Recently multi-hop QA benchmarks, such as HotpotQA [125] and WikiHop [120], have been
proposed to assess the multi-hop reasoning ability of the learning models. HotpotQA task provides
annotations to evaluate document-level question answering and finding supporting facts. Providing
supervision for supporting facts improves the explainability of the predicted answer because they
clarify the cross-paragraph reasoning path. Due to the requirement of multi-hop reasoning over
multiple documents with strong distractions, multi-hop QA tasks are challenging. Figure 3.1 shows
an example of HotpotQA. Given a question and 10 paragraphs, only paragraph 1 and paragraph
2 are relevant. The second sentence in paragraph 1 and the first sentence in paragraph 2 are the
supporting facts. The answer is “Geelong Football Club”.
     Primary studies in HotpotQA task prefer to use a reading comprehension neural model [74, 139,
125]. First, they use a neural retriever model to find the relevant paragraphs to the question. After
that, a neural reader model is applied to the selected paragraphs for answer prediction. Although
these approaches obtain promising results, the performance of evaluating multi-hop reasoning
capability is unsatisfactory [74].
     To solve the multi-hop reasoning problem, some previous models tried to construct an entity
graph using Spacy1 or Stanford CoreNLP [71] and then applied a graph model to infer the entity
path from question to the answer [14, 122, 16, 29]. However, these models ignore the importance
of the semantic structure of the sentences and the edge information and entity types in the entity
graph. To take the in-depth semantic roles and semantic edges between words into account, here
we use semantic role labeling (SRL) graph as the backbone of a graph convolutional network.
Semantic role labeling provides the semantic structure of the sentence in terms of argument-predicate
relationships [38]. The argument-predicate relationship graph can significantly improve the multi-
hop reasoning results. Our experiments show that SRL is eﬀective in finding the cross-paragraph
reasoning path and answering the questions.
     Our proposed Semantic Role Labeling Graph Reasoning Network (SRLGRN) jointly learns
to find cross-paragraph reasoning paths and answer questions on multi-hop QA. In the SRLGRN
    1 https://spacy.io
                                                  21


model, firstly, we train a paragraph selection module to retrieve gold documents and minimize
distractors. Second, we build a heterogeneous document-level graph that contains sentences as nodes
(question, title, and sentences) and SRL sub-graphs, including semantic role labeling arguments
as nodes and predicates as edges. Third, we train a graph encoder to obtain the graph node
representations that incorporate the argument types and the semantics of the predicate edges in the
learned representations. Finally, we jointly train a multi-hop supporting fact prediction module that
finds the cross-paragraph reasoning path and answers prediction module that obtains the final answer.
Notice that both supporting fact prediction and answer prediction are based on contextual semantics
graph representations as well as token-level BERT pre-trained representations. The contributions of
this work are as follows:
1) We propose the SRLGRN framework that considers the semantic structure of the sentences in
building a reasoning graph network. Not only the semantics roles of nodes but also the semantics of
edges are exploited in the model.
2) We evaluate and analyze the reasoning capabilities of the semantic role labeling graph compared
to usual entity graphs. The fine-grained semantics of the SRL graph help in both finding the answer
and the explainability of the reasoning path.
3) Our proposed model obtains competitive results on both HotpotQA (Distractor setting) and the
SQuAD benchmarks.
3.2     Semantic Role Labeling Graph Reasoning Network
Our proposed SRLGRN approach is composed of Paragraph Selection, Graph Construction, Graph
encoder, Supporting Fact prediction, and Answer Span prediction modules. Figure 3.2 shows the
proposed architecture. In this section, we introduce our approach in detail and then explain how to
train it with an eﬃcient algorithm.
                                                  22


                       Paragraph            Selected                                        Graph Construction
     Input                                                                                                                                                                                                                                Prediction
                       Selection            Paragraphs                                      & Graph Encoder
                                                          Sentence                                                                                    1
                                                                        7                           95            .55   ./5                   .6                                      .4
                                                                                                                                                                                       6
                                                          Nodes                                                                                                                                    Gr
                                                                                                                                                                                                        ap
                                                                                                                                                                                                             hS
                                                                                                                                                                                                                  en
                                                                                                                                                                                                                       te n
   Question                Add & Norm                                                                                                                                                                                         ces
                                                          Argument
                           Feed Forward                   Nodes                                                                                                                                G
                                                                                                                                                                                                                                         Support Fact
                                                                                                                                                                                                ra                                       Prediction
                                                                                                                                                                                                   ph
                           Add & Norm                                                                                                                                                                A
                                            Question                                                                                                                                                    rg                Se
                                                                                                                                                                                                                             nte
  Paragraph 1                               Paragraph 1                                                                                                                                                      um                  nc
                                                                                                                                                                                                              en                    es
  Paragraph 2                       MUL     Paragraph 2                                     Context Encoder                                                                                                        ts
                        ATT
  Paragraph 3                                                                                                                                                                              Sentence
      …
                                                             q                                                                q
                                                                 ATT                                                              ATT                                                      [CLS] Emb
                       q      k         v                                                                                                                                                                                                Answer Span
                                                                        Add & Norm   Feed Forward    Add & Norm                          Add & Norm       Feed Forward   Add & Norm
  Paragraph n-1                                              k                                                      …         k
                                                                                                                                                                                                                                 ns      Prediction
                                                                                                                                                                                                                            ke
  Paragraph n                                                                                                                                                                                                          To
                                                                  MUL                                                              MUL
                                                             v                                                                v
                                                                                                                                                                                           Token Emb
Figure 3.2 Our proposed SRLGRN model is composed of Paragraph Selection, Graph Construction,
Graph Encoder, Supporting Fact prediction, and Answer Span prediction.
3.2.1     Problem Formulation
Formally, the problem is to predict supporting fact H ( and answer span H 0=B given input question @
and candidate paragraphs. Each paragraph content C = {C, B1 , . . . , B= } includes title C and several
sentences {B1 , . . . , B= }.
3.2.2     Paragraph Selection
Most of the paragraphs are distractors in the HotpotQA task [125]. SRLGRN can select gold
documents and minimize distractors from given # documents by a Paragraph Selection module.
The Paragraph Selection is based on the pre-trained BERT model [24]. Our Paragraph Selection
module has two phases explained in section 3.2.2.1 and section 3.2.2.2.
3.2.2.1       First Round Paragraph Selection
For every candidate paragraph, we take the question @ and the paragraph content C to form the text
input [[⇠ !(]; @; [(⇢ %]; C], where [CLS] and [SEP] are the BERT special tokens in the tokenizer
process [24]. We form the input and feed it into a pre-trained BERT encoder to obtain token
representations. Then we use ⌫⇢ ')[⇠ !(] token representation as the summary representation of
the paragraph. Meanwhile, we utilize a two-layer MLP to output the relevance score, H B4; . The
paragraph which obtains the highest relevance score is selected as the first relevant context. We
                                                             23


                                                                                            %
      '"     !""                     !#"                    !&"                           !$"      '#            !"#              !##           !&#               !$#
                                            for seven seasons     pl
                                              :TEMPORAL              ay                                                                             former NASCAR
                                                                        ed                                                                    e
                                                                                a former football                                     b eca m         driver :ARG
                      William Keith                                                                    became       Jerry Michael
                      Bostic: ARG                    became                         player: ARG                    Glanville :ARG
                                                                                                                      e
                                    bo                                           ed                                am           bor
                                       rn                                 p la y                              b ec                  n
                                                            National Football                   former sportscaster
                                    January 17, 1961                                                                            October 14, 1941
                                                             League : LOC                               :ARG                      :TEMPORAL
                                     : TEMPORAL
                 !#" : William Keith Bostic (born January 17, 1961)                               !## : Jerry Michael Glanville (born October 14, 1941)
                                                                                                                                                                      !$#
      !""        became a former football player who played for                                   became a former football player, former NASCAR
                 seven seasons in the National Football League.                                   driver, and former sportscaster.
Figure 3.3 An example of Heterogeneous SRL Graph. The question is “Who is younger Keith
Bostic or Jerry Glanville?” The circles show the document-level nodes, i.e., sentences. The blue
squares show the argument nodes. The argument nodes include argument phrases and argument-type
information. The solid black lines are semantic edges between two arguments carrying the predicate
information. The black dashed lines show the edges between sentence nodes and argument nodes.
The red dashed lines show the edges between two sentences if there exists a shared argument (based
on an exact string match). The orange blocks are the SRL argument-predicate sub-graphs for
               9
sentences. B8 means the 9-th sentence from the 8-th paragraph.
concatenate @ to the selected paragraph as @ =4F for the next round of paragraph selection.
3.2.2.2    Second Round Paragraph Selection
For the remaining #                1 candidate paragraphs, we use the same model as first-round paragraph
selection to generate a relevance score that takes @ =4F and paragraph content as input. We call
this process as second-round paragraph selection. Similar to section 3.2.2.1, one of the remaining
candidate paragraphs with the highest score is selected. Afterward, we concatenate the question and
the two selected paragraphs to form a new context used as the input text for the graph construction.
3.2.3     Heterogeneous SRL Graph Construction
We build a heterogeneous graph that contains document-level sub-graph S and argument-predicate
SRL sub-graph A6 for each data instance. In the graph construction process, the document level
sub-graph S includes question @, titles C, and sentences B from the selected paragraphs. The
argument-predicate SRL sub-graphs                                   A6, including arguments as nodes and the predicates as
                                                                                           24


edges, are generated using AllenNLP-SRL model [94]. Each argument node is the concatenation of
argument phrase and argument type, including “TEMPORAL”, “LOC”, etc.
    Figure 3.3 describes the construction of the heterogeneous graph. The edges of the heterogeneous
graph are added as follows: 1) There will be an edge between a sentence and an argument if the
argument appears in the sentence (the black dashed lines in Figure 3.3); 2) Two sentences will have
an edge if they share an argument by exact matching (the red dashed lines); 3) Two argument nodes
  A68 and A6 9 will have an edge if they share a predicate (the black solid lines); 4) There will be an
edge between the question and a sentence if their arguments exactly matching their lexical surface
(the red dashed lines). Figure 3.3 shows an example of a heterogeneous SRL graph. B12 and B22 are
connected because of a shared argument node “a former football player: ARG”. Besides, the shared
argument node has several semantic edges, such as “played” and “became”. In this way, the shared
argument node and other connected argument nodes have argument-predicate relationships.
    We create two matrices based on the constructed graph. We will describe the way we use these
matrices in section 3.2.4. We build a weight matrix          to express the predicate-based semantics of
the edges and a weight matrix          to express various types of edges.
    The semantic edge matrix            is a matrix that stores the word index of the predicates that is
shared between the two arguments. We initialize all the elements of               with empty, ;. If two
argument nodes A68 and A6 9 are related to the same predicate, we add that predicate word index
to   ( A6 8 , A6 9 ) . is a matrix that stores diﬀerent types of edge weights. We divide the edges into
three types: sentence-argument edges, argument-argument edges, and sentence-sentence edges.
    The weight of a sentence-sentence edge is 1 when two sentences share an argument. Meanwhile,
the weight of a sentence-argument edge is 1 if there exists an edge between a sentence and an
argument. If two argument nodes have an edge, the weight can be calculated by point-wise mutual
information (PMI) [12]. The reason we use PMI is that it can better explain associations between
nodes compared to the traditional co-occurrence count method [126].
                                                      25


3.2.4   Graph Encoder
Here we provide a background to Graph convolutional network that we use in our model to
obtain graph embedding. We introduce the Graph Convolution Network [52] to obtain the graph
embeddings. The Graph Convolution Network (GCN) is a multi-layer network that uses the graph
input directly and generates embedding vectors of the graph.
    GCN plays an essential role in incorporating neighborhood nodes and helps in capturing the
structural graph information. The SRL graph uses the semantic structure of the sentence to form the
graph nodes and semantic edges. For instance, the GCN nodes of the document level sub-graph help
in finding the supporting fact path, while GCN nodes of the argument-predicate level sub-graph
help in identifying the text span of the potential answers. In this work, we consider a two-layer GCN
to allow message-passing operations and learn the graph embeddings. The graph embeddings are
computed as follows:
                                             1       1
                                   ⌧ 0 = (⇡  2   ⇡   2 ) [-  A6 ; -S ],1 ,                        (3.1)
                                             1       1
                                   ⌧ = (⇡    2   ⇡   2 ) 5 (⌧ 0),2 ,                              (3.2)
where ⌧ 0 and ⌧ are graph embedding outputs of two GCN layers that incorporate higher-order
neighborhood nodes by stacking GCN layers. 5 (G) is an activation function, ⇡ is the degree matrix
of the graph [52],    is the heterogeneous edge parameters matrix, and ,1 and ,2 are the learned
parameters. - represents node embeddings, including argument-predicate embedding -              A6 and
sentence embedding -S . Given graph embedding ⌧, we use ⌧ S to represent document-level node
embeddings, and ⌧     A6 to represent argument-predicate level node embeddings.
3.2.5   Supporting-Fact Prediction
The goal of supporting fact (SF) prediction is to find the supporting fact path that is necessary to
obtain the answer. Inspired by previous research [6], we utilize RNN with a beam search to find the
best document-level supporting fact path. This approach turns out to be eﬀective for selecting the
                                                    26


                                                        q          )*        )+   Done
                                  Output path
                                                      V           V         V     V
                                 SF hidden states       h1  W      h2  W     h3 W   h4
                                                       U          U         U      U
                                  !"#$%&' + ('      q          ),        )-
                                                               )*       )+
                             Figure 3.4 An example of Supporting Fact Prediction.
Done
     SF reasoning path. Notice that, our supporting fact prediction not only relies on BERT and RNN
V    but also incorporates document-level graph node embeddings ⌧ S .
 h4
U        Formally, we use the concatenation of the graph sentence embedding, ⌧ S (blue circles in
null Figure 3.4), and BERT’s [CLS] representation of the candidate sentence (orange circles) to represent
     the candidate supporting fact sentence -S20=3 . The process of selecting supporting facts is as follows:
                                          ⌘C = f(, ⌘C       1 + * -S20=3 + 1 ⌘ ),                       (3.3)
                                          > C = + ⌘C + 1 > ,                                            (3.4)
     where ⌘C is the hidden state of the RNN at the C-th supporting fact selection step, f is the activation
     function. ,, *, +, 1 ⌘ and 1 > are the parameters.
         Finally, we use the beam search to output SF paths, choosing the highest scored path as our final
     supporting fact answer H SF :
                                                                    ÷
                                                  H SF = arg max       >C ,                             (3.5)
                                                           1C)
     where ) is the maximum number of reasoning hops. We penalize with the cross-entropy loss.
         Figure 3.4 shows an example of the predicted SF process. Based on the constructed heterogeneous
     graph, two sentence nodes have an edge if they share an argument. We start from question node @
     as the first input sentence. Since @ is a unique input, we select @ as the first SF candidate. In the
     second step, two candidate sentence nodes, B2 and B3 , which are neighbor nodes of @, are chosen as
     the input. We separately feed B2 and B3 to the RNN layers. The sentence B3 that obtains a larger
     logit score is selected as the second SF candidate of the reasoning path. In the third step, B4 and B5
                                                              27


are neighbor nodes of the second SF, B3 . Then the model chooses B5 as the third SF. In the end, B1 ,
B3 , and B5 are the supporting facts.
3.2.6     Answer Span Prediction
The goal of the answer prediction module is to output “yes”, “no”, or answer span for the final
answer. We first design an answer type classification based on BERT and an additional two fully
connected feed-forward layers. If the highest probability of type classification is “yes” or “no”, we
directly output the answer. The input of type classification is ⌫⇢ ')[⇠ !(] . The answer type H CH ?4
can be calculated as
                                        H type = " !%type ( [⌫⇢ ')[⇠ !(] ]).                              (3.6)
     If the answer is not “yes” or “no”, we compute the logit of every token to find the start position 8
and end position 9 for the answer span. The input token representation is the concatenation of BERT
token representation ⌫⇢ ')C>: and graph embedding ⌧                A6 .  The answer span H 0=B can be computed
as
                                                                9
                                      H ans = arg max H8BC0AC H 4=3 ,                                     (3.7)
                                                8, 9, 8 9
                                     H8start = " !%start ( [⌫⇢ ')C>:8
                                                                        ; ⌧8  A6 ]),                      (3.8)
                                      H8end = " !%end ( [⌫⇢ ')C>: 8
                                                                      ; ⌧8   A6 ]),                       (3.9)
where H ans is the index pair of (start position, end position), H8start represents the logit score of the 8-th
word as the start position, and H8end represents the logit score of the 8-th word as the end position.
3.2.7     Objective Function
Inspired by [122] and [110], the joint objective function includes the sum of cross-entropy losses for
the span prediction ! ans , answer type classification ! type , and supporting fact prediction ! SF . The
loss function is computed as follows:
             ! joint = ! ans + ! SF + ! type
                                                           28


                    = _ 1 ( H start log H start H end log H end ) _ 2 H SF log H SF _ 3 H type log H type ,
where _ 1 , _ 2 , and _ 3 are hyper-parameters which are weighting factors indicating the importance of
each component in the loss.
3.3     Experiments
3.3.1    Dataset Description
We use the HotpotQA dataset [125], a popular benchmark for multi-hop QA tasks, for the main
evaluation of the SRLGRN. For each question in the Distractor Setting, two gold paragraphs and 8
distractor paragraphs, which are collected by a high-quality TF-IDF retriever from Wikipedia, are
provided. Only gold paragraphs include ground-truth answers and supporting facts. In addition,
we use Machine Reading Comprehension task, Stanford Question-Answering Dataset (SQuAD)
v1.1 [87] and v2.0 [86], to demonstrate the language understanding ability of our model.
3.3.2    Implementation Details
We implemented SRLGRN using PyTorch. We use a pre-trained BERT-base language model with
12 layers, 768-dimensional hidden size, 12 self-attention heads, and around 110" parameters
[24]. We keep 256 words as the maximum number of words for each paragraph. For the graph
construction module, we utilize a semantic role labeling model [94] from AllenNLP2 to extract the
predicate-argument structure. For the graph encoder module, we use 300-dimensional GloVe [79]
pre-trained word embedding. The model is optimized using Adam optimizer [51].
3.3.3    Baseline Models
In this subsection, we select three SOTA models as our main baselines. In particular, Multi-Paragraph
Reading Comprehension Model [16] uses a neural retriever model and a neural reader model to find
    2 https://demo.allennlp.org/semantic-role-labeling.
                                                          29


the span answer. In addition, we select DFGN [122] and SAE [110] models that construct entity
graphs through named entity recognition (NER) as the backbone to find the supporting fact path.
Multi-Paragraph Reading Comprehension Model [16] is our first strong baseline. This baseline
model combines the popular technical neural modules as the components in the QA domain,
including self-attention and bi-attention modules [91].
DFGN [122] is a strong baseline method for the HotpotQA task. DFGN builds an entity graph from
the text. Moreover, DFGN includes a dynamic fusion layer that helps in finding relevant supporting
facts.
SAE [110] is the recent SOTA model that is an eﬀective Select, Answer and Explain system for
multi-hop QA. SAE is a pipeline system that first selects the relevant paragraphs and uses the
selected paragraphs to predict the answer and the supporting fact.
3.4    Experimental Results and Analysis
3.4.1   Results
                                          Ans(%)          Sup(%)         Joint(%)
                      Model
                                         EM     F1       EM     F1     EM        F1
              Baseline Model [125]      45.60 59.02     20.32 64.49   10.83 40.16
                   KGNN [128]           50.81 65.75     38.74 76.79   22.40 52.82
                    QFE [76]            53.86 68.06     57.75 84.49   34.63 59.61
                 DecompRC [74]          55.20 69.63       -      -       -        -
                   DFGN [122]           56.31 69.69     51.50 81.62   33.62 59.82
                       TAP              58.63 71.48     46.84 82.98   32.03 61.90
                 SAE-base [110]         60.36 73.58     56.93 84.63   38.81 64.96
                   ChainEx [14]         61.20 74.11       -      -       -        -
                 HGN-base [29]            -   74.76       -    86.61     -     66.90
                  SRLGRN-base           62.65 76.14     57.30 85.83   39.41 66.37
Table 3.1 HotpotQA Result on Distractor setting. Except for the Baseline model, all models deploy
BERT-base uncased as the pre-training language model to compare the performance.
    Evaluation metrics. In the HotpotQA benchmark, two sub-tasks are included in this dataset:
Answer prediction and Supporting facts prediction. For each sub-task, Exact Match (EM) and
Partial Match (F1) are two oﬃcial evaluations that were proposed in [87]. Given the precision and
                                                30


recall of the answer span prediction and the supporting facts, respectively, the joint Exact Match
(EM) and joint Partial Match (F1) scores are computed to evaluate the model performance on the
HotpotQA Distractor Setting.
    Table 3.1 shows the results of HotpotQA (Distractor setting). We can observe the SRLGRN
model outperforms the previous state-of-the-art results in most of the evaluation criteria. Our
model obtains a Joint Exact Matching (EM) score of 39.41% and a Joint Partial Matching (F1)
score of 66.37% that combines the evaluation of answer spans and supporting facts. Our SRLGRN
model has a significant improvement, about 28.58%, on Joint EM and 26.21% on F1, over the
Baseline Model [125]. Compared to the current published state of the art, i.e. SAE model [110],
SRLGRN improves the results for 2.29% on the Joint Exact Match and 2.56% on the joint F1. To
our analysis, the main reason for the eﬀectiveness of our model is that it uses not only token-level
BERT representations but also uses graph-level SRL node representations that help in learning the
line of the multi-hop reasoning.
3.4.2    Model Analysis
Our framework provides an eﬀective way for multi-hop reasoning taking advantage of the SRL
graph and the pre-trained language models. In the following section, we give a detailed analysis of
the SRLGRN model.
Eﬀect of SRL Graph. The SRL graph extracts argument-predicate relationships, including
in-depth semantic roles and semantic edges. The constructed graph is the basis of reasoning as the
inputs of each hop are directly selected from the SRL graph, as shown in Figure 3.4. The SRL graph
provides a rich graph network, that is, providing the key semantic edges between the words to cover
reasoning paths, see Figure 3.3.
    Compared to the NER graph in the previous models [122], the proposed SRL graph covers the
86.5% of reasoning paths for the data samples. The NER graph of DFGN can only cover 68.7% of
the reasoning paths [122]. The coverage of the semantic edges in the graph is one major reason
                                                  31


that the SRLGRN model has higher accuracy compared to other published models. As shown in
Table 3.1, the SRLGRN improves 5.79% on joint EM and 6.55% on joint F1 over DFGN that is
based on the NER graph.
                                                                 Ans(%)
                             Ablation          Model
                                                               EM         F1
                                              w/o graph       53.06 67.68
                              Graph     w/o Argument type
                                         and Semantic edge    60.10     73.24
                              Joint       w/o joint training  58.50     71.58
                                           ALBERT-base        59.87     74.20
                            Language
                                             BERT-base        62.65     76.14
                            Table 3.2 SRLGRN ablation study on HotpotQA.
    To evaluate the eﬀectiveness of the types of semantic roles and the edge types, we perform an
ablation study. First, we removed the whole SRL graph. Second, we removed the predicate-based
edge information from the SRL graph. Table 3.2 shows the results. The complete SRLGRN
improves 8.46% on the F1 score compared to the model without the SRL graph. The model loses
the connections used for multi-hop reasoning if we remove the SRL graph and only use BERT for
answer prediction.
    We also observe that the F1 score of answer span prediction decreases 2.9%, if we did not
incorporate semantic edge information and argument types. The reason is that removing predicate
edges and argument types will destroy the argument-predicate relationships in the SRL graph and
breaks the chain of reasoning. For example, in Figure 3.3, the main arguments of the two supporting
facts in B12 and B22 (William and Jerry) are connected with a predicate edge, “born”, to the temporal
information necessary for finding the answer. Both the “born” edge and the adjunct temporal roles
are the key information in the two sentences to find the final answer to this question. The shared
ARG node, “football player”, also helps to connect the line of reasoning between the two sentences.
These two results indicate that both semantic roles and semantic edges in the SRL graph are essential
for the SRLGRN performance.
    In a diﬀerent experiment, we tested the influence of the joint training of the supporting facts and
                                                   32


answer prediction. As shown in Table 3.2, the performance will decrease by 4.56% when we do not
train the model jointly.
Eﬀect of Language Models. We use two popular and widely-used pre-trained language rep-
resentation models, BERT and ALBERT [55]. The last two lines of Table 3.2 show the results.
Although BERT achieves relatively better performance, ALBERT architecture has significantly
fewer parameters (18x) and is faster (about 1.7x running time) than BERT. In other words, ALBERT
reduces memory consumption by cross-layer parameter sharing, increases the speed, and obtains a
satisfactory performance.
Eﬀect of SRLGRN on Single-hop QA. We evaluate the SRLGRN (excluding the paragraph
selection module) on SQuAD [87] to demonstrate its reading comprehension ability. We evaluate
the performance on both SQuAD v1.1 and SQuAD v2.0. Table 3.3 describes the results of several
baseline methods on SQuAD v1.1. Our model obtains a 1.8% improvement over BERT-large and a
1.6% improvement over BERT-large+TriviaQA [24].
                                                           Ans(%)
                                        Model
                                                          EM      F1
                                        Human             82.3 91.2
                                      BERT-base           80.8 88.5
                                      BERT-large          84.1 90.9
                               BERT-large+TriviaQA        84.2 91.1
                               BERT-large+SRLGRN          85.4 92.7
                               Table 3.3 SQuAD v1.1 performance.
We further test the SRLGRN on SQuAD v2.0. The main diﬀerence is that SQuAD v2.0 combines
answerable questions (like SQuAD v1.1) with unanswerable questions [86]. Table 3.4 shows that
our proposed approach improves the performance of the SQuAD benchmark compared to several
recent strong baselines.
    We recognize that our SRLGRN improves 7.1% on EM compared to the robust BERT-large
model and improves 1.0% on EM compared to SemBERT [132]. The two experiments on SQuAD
                                                 33


                                                        Ans(%)
                                      Model
                                                       EM     F1
                                      Human           86.3 89.0
                               ELMo+DocQA [86]        65.1 67.6
                                 BERT-large [24]      78.7 81.9
                                 SemBERT [132]        84.8 87.6
                             BERT-large+SRLGRN        85.8 87.9
                             Table 3.4 SQuAD v2.0 performance.
v1.1 and SQuAD v2.0 demonstrate the significance of the SRL graph and the graph encoder.
             Error Type          Model Prediction                    Label
                                  washington dc             district of columbia
                                         sars        severe acute respiratory syndrome
             Synonyms                     ey                     ernst young
                                        writer                      author
                                     australian                    australia
               MLV                     hessian                     hessians
                                  mcdonald’s, co                  mcdonalds
                                         1946                        1945
            Month-Year          25, november, 2015               3, december
                                   10, july, 1873            1, september, 1864
                                          11                          10
              Number                    fourth                         4
                                         2402                        5922
        External Knowledge              Coker              NCAA I FBS football
                                    taylor, swift                    usher
               Other                     film                   documentary
                                     fourteenth                 500th episode
                          Table 3.5 Error types on HotpotQA dev set.
3.4.3  Qualitative Analysis
Synonyms Answers is the most frequent cause of the reported errors in many cases where the
predicted answer is semantically correct. As shown in the first row of Table 3.5, our predicted
answer and gold label have the same meaning. For example, SRLGRN predicts "sars", while the
                                                34


label is "severe acute respiratory syndrome." We know that "sars" is the abbreviation of the gold
label.
                         Question: Which is a type of herb, Brassia or Achimenes?
                         Supporting Facts:
              Comparison 1. Brassia is a genus of orchids classified in the Oncidiinae subtribe.
                         2. Achimenes is a genus of about 25 species of tropical and subtropical rhizomatous perennial herbs in the flowering plant family Gesneriaceae.
                         Answer: Achimenes
  Successful
    Cases                Question: When was the University established where Laura Landweber is a professor?
                         Supporting Facts:
                         1. As of 2016, she is a professor of biochemistry and molecular biophysics and of biological sciences at Columbia University.
              Bridge     2. Columbia University (Columbia; officially Columbia University in the City of New York), established in 1754, is a private Ivy League research
                         university in Upper Manhattan, New York City, often cited as one of the world's most prestigious universities.
                         Answer: 1754
                         Question: Luke Null is an actor who was on the program that premiered its 43rd season on which date?
                         Wrong Paragraph Selection: 1. Luke Null 2. 43rd Battalion (Australia)
                         Label Paragraphs Selection : 1. Luke Null 2. Saturday Night Live
              Wrong      Supporting Facts:
              Paragraph  1. Luke Null is an American actor, comedian, and singer, who currently works as a cast member on "Saturday Night Live", having joined the show
              Selection  at the start of its forty-third season.
                         2. The forty-third season of the NBC comedy series "Saturday Night Live" premiered on September 30, 2017 with host Ryan Gosling and musical
                         guest Jay-Z during the 2017-2018 television season.
                         Answer: September 30, 2017
    Failing              Question: Who is younger, Wayne Coyne or Toshiko Koshijima?
                         Supporting Facts:
    Cases                1. Wayne Michael Coyne (born January 13, 1961) is an American musician.
              Comparison 2. Toshiko Koshijima (                      , Koshijima Toshiko , born March 3, 1980 in Kanazawa, Ishikawa) is a Japanese singer.
                         Wrong Answer: Wayne Coyne
                         Answer: Toshiko Koshijima
                         Question: What Division was the college football team that fired their head coach on November 24, 2006?
                         Supporting Facts:
                         1. The 2006 Miami Hurricanes football team represented the University of Miami during the 2006 NCAA I FBS football season.
              Bridge     2. Coker was fired by Miami on November 24, 2006 following his sixth loss that season.
                         Wrong Answer: Coker
                         Label Answer: NCAA Division I FBS football
          Figure 3.5 Successful cases and Failing cases on our proposed SRLGRN framework.
Minor Lexical Variation (MLV) is another major cause of mistakes in the SRLGRN model. As
shown in the second row of Table 3.5, our model’s predicted answer is "Australian", while the gold
label is "Australia". Many wrong predictions occur in the singular noun versus plural noun selection.
Dates and numbers are other common mistakes. By looking into the graphs, we observed that
sometimes SRLGRN predicts the wrong answer when two or more arguments of the same type, in
particular with “TEMPORAL” types of arguments, are connected to the same predicate. In such
cases, it is hard to disambiguate the actual time that is the answer to the question.
Paragraph Selection is the cause of a small portion of errors in the SRLGRN model. As shown
in Figure 3.5, the model chooses the wrong paragraph “43rd Battalion”. The reason is that “43rd” is
a distractor since the “43rd season” appears in the question. The paragraph “Saturday Night Live”
is the correct relevant paragraph that includes both “forty-third season” and the true answer. To
                                                                                35


resolve this issue in the future, we will try to combine our model with a robust retrieval module
designed for multi-hop QA similar to the Multi-step entity-centric model proposed in [35].
Comparison questions seem to be a source of error when answering the questions in HotpotQA.
For example, as is shown in Figure 3.5, the question is “ Who is younger, Wayne Coyne or Toshiko
Koshĳima?” To correctly answer the “comparison” type of question, the model requires the ability
to compare two entities that existed in the question. In the failing case of Figure 3.5, we predict the
wrong answer “Wayne Coyne”. The model keeps answering “Wayne Coyne” even after replacing
the word “younger” with “older”, which happens to be the correct answer this time.
              q: What Division was the college football team                                            .56 : Coker was fired by Miami on
              that fired their head coach on November 24, 2006                                          November 24, 2006.
                                                                                                                            Coker :ARG
                    head coach :ARG
                                         fir                                                                       fire
                                               e                          November 24,
                                                                                                                                   fire
                                                         fir e          2006 :TEMPORAL
                                      the college football                                                                   Miami :ARG
                                          team :ARG
                                                                 7                                                .56
                                                                                           .55
                                                                                         rep res en t
                                                                                                                          University of
                                                    Miami Hurricanes                                                      Miami : ARG
                                                   football team :ARG
                                                                                               rep
                                                                      rep                            res
                                                                            res                            en t
                                                                                  en t
                                                                                                                        NCAA I FBS football
                                                                                                                          season : LOC
                                                                     2006: TEMPORAL
                                      .55 : The 2006 Miami Hurricanes football team represented the
                                      University of Miami during the 2006 NCAA I FBS football season.
Figure 3.6 The “Disconnection” failing case that SRL fails to lead to the correct answer. The
meaning of diﬀerent lines and node colors are the same as Figure 3.3.
Disconnection is another type of error. While the SRL graph helps if finding the chain of
reasoning, in some cases, the line of reasoning was broken. By looking into the errors, we realized
in most cases, obtaining the answer needed multiple hops of reasoning while external knowledge
was required to form the connections. As is shown in “Disconnection” failing cases of Figure 3.5,
                                                                        36


the selected paragraphs do not show the relation between “Coker” and “Miami Hurricanes football
team”. Figure 3.6 describes the SRL construction for this failing case. The second supporting fact
and the question have the same temporal argument node “November 24, 2006”. However, there is
no chain between the first supporting fact and the second supporting fact due to the lack of external
knowledge that can connect “Coker”, “coach” and “Miami Hurricanes football team”. Therefore,
the isolated reasoning chain leads to a wrong answer.
3.5    Summary
We proposed a novel semantic role labeling graph reasoning network (SRLGRN) to deal with
multi-hop QA. SRLGRN has a graph convolutional network (GCN) as the backbone, which is
created based on the semantic structure of the multiple documents. We innovatively construct
a graph with entities and multiple relational edges from documents using semantic role labeling
(SRL). This semantic structure of multiple documents can significantly improve the multi-hop
reasoning capacity to find the line of reasoning to answer the questions. We jointly train a supporting
fact prediction module that finds the cross-paragraph reasoning path, and an answer prediction
module that obtains the final answer. SRLGRN exceeds most of the SOTA results on the HotpotQA
benchmark. Moreover, we evaluate the model (excluding the paragraph selection module) on other
reading comprehension benchmarks. Our approach achieves competitive performance on SQuAD
v1.1 and v2.0.
                                                   37


                                            CHAPTER 4
                     CAUSAL REASONING FOR DOCUMENT-LEVEL QA
4.1    Background and Motivation
Cause-eﬀect QA is a specific type of question answering over a given document in which the
questions ask about the causal impact of entities or events on each other. The recent research on
reasoning over cause-eﬀect QA has achieved promising results [87, 86, 40, 20, 106]. Specific to
this problem, the WIQA benchmark [107] was proposed for the evaluation of causal reasoning
capabilities of learning models on a procedural text by introducing “what . . . if” reasoning. The
“what . . . if” reasoning task is a type of cause-eﬀect QA that relates to reading comprehension,
multi-hop reasoning, and commonsense reasoning. This task is rich in containing various challenging
linguistic and semantic phenomena. The “what . . . if” reasoning is built based on linguistics and
generating possible cause-eﬀect relationships expressed in the context of a paragraph. Its goal is to
predict what would happen if a process was perturbed in some way. It requires understanding and
tracing the changes in events and entities through a paragraph. Figure 4.1 shows some examples
of the WIQA benchmark. There are two types of questions in the dataset, including in-paragraph,
where the answer to the question is in the procedure itself, and out-of-paragraph, where the answer
does not exist in the text and needs external knowledge [107].
    There are several challenges in the “what . . . if” cause-eﬀect QA. The first challenge is reasoning
over the comparative expressions for describing the eﬀect of the entities on each other in the text
that can convey a positive or negative eﬀect (promoting or demoting each other). For example,
comparative expressions such as (larger, smaller), (more, less), (higher, lower). This task requires
extracting the important entities through the procedural text and understanding their influences.
BERT is used as a strong baseline in [107] to predict answers by implicit representations. However,
they ignore explicit comparative expressions between entities and the way they aﬀect each other.
    The second challenge is causal reasoning over relations between pairs of entities. Although
recent pre-trained language models (LM) achieve promising performance on QA, there is still a
                                                    38


                             Procedural Text:
                             1. A frog lays eggs in the water.
                             2. Tadpoles develop inside of the eggs.
                             3. The eggs hatch.
                             4. The tadpoles eat and grow.
                             5. The tadpoles grow legs and form into frogs.
                             6. The frogs leave the water.
                             Questions and Answers:
                             1. Suppose tadpoles eat more food happens,
                             how will it affect more frogs?
                                (A) More (B) less (C) No effect
                             2. Suppose the weather is unusually bad
                             happens, how will it affect the tadpoles will
                             need more food?
                                (A) More (B) less (C) No effect
Figure 4.1 WIQA task contains procedural paragraphs and a large collection of “what . . . if” questions.
The bold font candidate answers are the gold answers.
gap between LM and human performance due to the lack of causal reasoning over entities [30].
For example, given the question “suppose more animals that hunt frogs happen, how will it aﬀect
more tadpoles lose”, the LM has diﬃculty to consider the relation “hunt” between the entity pair
(“animals”, “frogs”). This recent research work [5] uses a Transformer model with regularization to
produce consistent answers to multiple related questions. The model obtains a good result with
augmented data following logical constraints. However, these constraints ignore the importance of
causal reasoning, and can not capture the higher-order chain of causal reasoning.
    The third challenge is the lexical variability in expressing the same concept, which makes
entity alignment hard. For example, the same entities and events are referred to by diﬀerent terms,
like (insect, bee), (become, form). Entity alignment requires the alignment between question and
paragraph entities, and the alignment between the entities appearing in the diﬀerent paragraphs
themselves. Unfortunately, all current works ignore the importance of entity alignment for tracing
the entities and finding the relation between diﬀerent entities in the question and paragraph.
    Therefore, we propose a novel end-to-end Relational Gating Network (RGN) for causal reasoning
over cause-eﬀect QA. The RGN framework answers the “what . . . if” questions and solves challenges
                                                      39


of comparative expressions, causal reasoning, and entity alignment. RGN jointly learns to extract the
key entities through our entity gating mechanism, finds the line of reasoning and relations between
the key entities through the relation gating mechanism, and captures the entity alignment through
contextual entity interaction. The main motivation of the two gating mechanisms is to learn the line
of causal reasoning. Concretely, we build an entity gating module to extract and filter the key entities
in the question and context and highlight the entities that are compared qualitatively. Furthermore,
we design a relation gating module with an alignment of entities to capture the higher-order
chain of causal reasoning based on pairwise relations. Moreover, we propose an eﬃcient module,
called contextual interaction module, to incorporate cross-information from Question and Content
interactions during training in an eﬃcient way to help entities alignments.
    The contributions of this chapter are as follows: 1) We propose a Relational Gating Network
(RGN) that captures the most important entities and relationships involved in comparative expressions
and causal reasoning. 2) We propose a contextual interaction module to eﬀectively and eﬃciently
align the question and paragraph entities. 3) We evaluate the methods and analyze the results on the
“what . . . if” question answering using the WIQA benchmark. We improve the recent state-of-the-art
results and show the significance of the entity gating module and relation gating module on causal
reasoning over text.
4.2    Relational Gating Network
Relational Gating Network (RGN) aims to establish an end-to-end architecture for reasoning over
cause-eﬀect QA. RGN model uses an entity gating module to extract and filter the critical entities in
question and paragraph content. We enable a higher-order chain of causal reasoning based on the
pairwise relationships between key entities through the relation gating mechanism. We propose a
contextual interaction module to improve entity alignment in an eﬃcient way. Figure 4.2 shows the
proposed architecture. This section introduces our network and the training approach in detail.
                                                 40


                        Question     Entity Gating        Contextual Interaction
                                        + Filter                        Module            !!
                                                                      Self:    Q
                                                                      Cross: Q
        Question                                                                          ""
                                                                       Self: C
                                                                      Cross: C                MLP
                                                                                         "#$%
       Paragraph
        Content
                                                                        …         …
                        Paragraph Entity Gating                   Candidate Relation Gating
                         Content        + Filter                  Relations          + Filter
Figure 4.2 Relational Gating Network (RGN) is composed of pre-training contextual representation,
entity gating module, relation gating module, and contextual interaction module followed by a
task-specific classifier.
4.2.1    Problem Formulation
Formally, the task is to select one of the candidate’s answers 0, (A) More; (B) Less; (C) No eﬀect,
given a question @ and the paragraph content C. The paragraph content includes several sentences
C = {B1 , B2 , . . . , B= }. For each data sample, the data format is a triplet of (@, C, a).
4.2.2    Entity Representations
For each data sample, we form the input ! by concatenating the question @ and the paragraph content
C as follows:
                                           ! = [[⇠ !(]; @; [(⇢ %]; C],                            (4.1)
where [CLS] and [SEP] are the special tokens used in Language Models (LMs) [65]. We feed input
! to a pre-trained LM to obtain all question and content token representations. Meanwhile, we use
⇢ [⇠ !(] representation as the summary representation of the paragraph. After that, we obtain the
RoBERTa token representations, ⇢ [⇠ !(] , ⇢ @ , and ⇢ C , which are shown as follows:
                                      ⇢ @ = [⇢ @F 1 , ⇢ @F 2 , . . . , ⇢ @F < ] 2 R<⇥3 ,          (4.2)
                                                              41


                                  ⇢ C = [⇢ CF 1 , ⇢ CF 2 , . . . , ⇢ CF = ] 2 R=⇥3 ,                        (4.3)
where ⇢ @ represents the list of question representations, ⇢ C represents the list of paragraph content
representations, 3 is the learned representation dimension for tokens, < represents the max length
of the question, and = represents the max length of the paragraph content.
4.2.3    Entity Gating
The intuition behind the entity gating module is to filter several key entity representations from
both question, ⇢ @ , and paragraph content, ⇢ C . We call this process entity gating which is shown in
Figure 4.2. Given the question ⇢ @ , for each entity ⇢ @F 8 2 ⇢ @ , we use a multi-layer perceptron and a
softmax layer to obtain an entity importance score *@F 8 :
                                                 exp " !%(⇢ @F 8 )
                                    *@F 8 =Õ                 ⇣                   ⌘,                         (4.4)
                                                                            F9
                                               <
                                               9=1   exp       "   !%(⇢     @  )
                                     ⇢ @0 = *@ ⇢ @ 2 R<⇥3 .                                                 (4.5)
    We compute the new entity representations ⇢ @0 by multiplying the entity representations and
their scores in *@ . Then we choose the most important entities with top-: scores. We denote the set
of filtered key entities after gating the question as +@ = [+@1 , +@2 , . . . , +@: ] 2 R :⇥3 . +@ is the list of
question gated entity representations, : is the number of filtered entities and +@8 is 3-dimensional
embedding for the 8-th filtered entities.
    Notice that the process of computing paragraph entity gating +C 2 R :⇥3 is the same as the
question entity gating +@ . Using the entity gating mechanism improves the generalizability of our
deep model as we can explicitly see the selected entities and comparative expressions. The detailed
analysis of entity gating is shown in Section 4.4.2.
4.2.4    Relation Gating
We consider the representations beyond entities by using a Relation Gating module. This extension
allows RGN to capture the higher-order chain of causal reasoning based on pairwise relations, which
                                                        42


is the main contribution in this chapter. The pairs of entities enable the model to understand the
relationships between words and find the line of causal reasoning. Moreover, relation gating aims to
pair un-directed relations between entities for capturing the crucial relations, like “tadpole (losses)
tail” and “less severe”, as well as the pairs of entities that help to understand the line of reasoning.
We call this process relation gating module, which is shown in Figure 4.2.
    In this module, first, we concatenate +@ and +C , which are obtained from Section 4.2.3 and form
candidate set + = {+@ ; +C }. Then we pair every two gated entities and form +A4; = [+ 8 ; + 9 ] 2 R1⇥23 .
                                                                                             8, 9
Furthermore, the candidate relational representation, +A4; , is a non-linear mapping R23 ! R23
modeled by fully connected layers from candidate relation.
                                             1     2             A
                                  +A4; = [+A4; , +A4; , . . . , +A4; ] 2 RA⇥23 ,
                                                                                 2:⇥(2: 1)
where A is the size of total relation candidate pairs, that is, A =                  2     . Given each candidate
relation +A4;
            8 , we compute a multi-layer perceptron and a softmax layer to obtain a relational
importance score, ) 8 :
                                                      ⇣                  ⌘
                                                exp     " !%(+A4;    8 )
                                      )8 = Õ              ⇣                   ⌘,                            (4.6)
                                               ?
                                               9=1 exp      "   !%(+   9
                                                                        A4; )
                                      0
                                     +A4; = )+A4; 2 R :⇥23 .                                                (4.7)
    We compute the new relation representation +A4;      0 by multiplying the relation representations and
their scores in ). We select the relations with top-: scores because using all the scores increases the
number of parameters and the computational cost significantly. Moreover, the redundant entities
make learning harder and consequently less accurate. We denote the set of filtered key relations
after gating relations as   A4; 2 ' :⇥23 to the gated relation representation.
4.2.5   Contextual Interaction Module
Entity alignment is one of the challenges in Document-level QA. Although we propose entity
and relation gating in the above sections separately, aligning questions with the paragraph is still
                                                      43


                        Cross interaction: Question           Cross interaction: Context
                                 Cross                                Cross
                                  W                                     W
       Question !"                                                                        Context !!
                          Self interaction: Question            Self interaction: Context
                                 Self
                                                                        Self
                                  W
                                                                         W
                           Question Entity Gating                 Context Entity Gating
Figure 4.3 Contextual Interaction Module comprises self-interactions and cross-interactions. The
inputs are the question’s and paragraph’s filtered entities representations, and the outputs are question
and paragraph contextual representations.
important. We found that a simple concatenation of gated entity representations from the question
+@ and the paragraph content +C shows a good performance. However, concatenated representations
and multi-layer perceptrons have a limited capacity for modeling the interactions.
    As shown in Figure 4.3, we have developed a novel and fast encoding model, namely Contextual
Interaction Module. The model needs to incorporate information from Question-Content interactions
and, meanwhile, avoid expensive architectures such as Multi-Head attentions [111] because those
are infeasible for large-scale datasets. Thus, we developed a model that uses only linear projections
and inner products of both sides, i.e., question and context, and we apply a mechanism like simplified
self-attention to model the interactions as described below.
                                                                                            B4; 5
    Given the +@ , we compute the self-interaction of the question’s gated entities,       @      ,
                                         B4; 5
                                        @      = +@) , B4; 5 +@     2 R :⇥3 ,                        (4.8)
where , B4; 5 2 R3⇥: is a projection matrix. The cross-interactions between gated entities and
paragraph entities can be calculated as
                                       2A>BB
                                       @       = +C) , 2A>BB+@       2 R :⇥3 ,                       (4.9)
                                                        44


                                                                                                  B4; 5
where , 2A>BB 2 R3⇥: is also a projection matrix. Then we concatenate the two matrices            @     and
  2A>BB . Finally, we obtain the question contextual representation         as follows:
  @                                                                      @
                                              B4; 5
                                       @ =[   @     ;  2A>BB
                                                       @      ] 2 R :⇥23                            (4.10)
Notice that the process of paragraph contextual representation             C   2 ' :⇥23 is the same as the
question contextual representation      @. Therefore, the output includes two representations. One is
the paragraph contextual representation containing information from the question, and the question
contextual representation contains information from the paragraph.
4.2.6     Output Prediction
After acquiring all the contextual entity representations and gated relations representations, we
concatenate them and use the result as the final representation,           . The process is described as
follows:
                                         =[   @;    2;  A4; ] 2 R3:⇥23                              (4.11)
Finally, a task-specific classifier MLP ( ) predicts the output.
4.3      Experiments
                 Data                        Train      Dev      Test V1   Test V2     Total
                 Questions                  29808      6894       3993       3003     43698
                              in-para        7303      1655        935        530     10423
                 Question     out-of-para   12567      2941       1598       1218     18326
                 type         no-eﬀect       9936      2298       1460       1255     14949
                              Total         29808      6894       3993       3003     43698
                              #hops=0        9936      2298       1460       1255     14949
                 Number       #hops=1        6754      1510        835        245      9254
                 of hops      #hops=2        8969      2145       1153       1027     13294
                              #hops=3        4149        941       545        476      6111
                              Total         29808      6894       3993       3003     43698
                                   Table 4.1 WIQA Dataset Statistics.
                                                      45


4.3.1   Dataset Description
WIQA [107] benchmark contains procedural paragraphs and a large collection of “what . . . if”
questions. The task is to answer the questions given paragraph contents and a list of the candidate’s
answers. Table 4.1 shows the detailed data statistics and data distribution of the WIQA dataset.
4.3.2   Implementation Details
We implemented RGN using PyTorch. We used RoBERTa-Base Language Model as the backbone
to train our model. All of the representations are 768 dimensions. For each data sample, we keep
128 tokens as the max length for the question, and 256 tokens as the max length for paragraph
contents. Notice that both gated entity representations for question and paragraph use : = 10 for
selecting top-: entities in our experiments. The value of this hyper-parameter was selected after
experimenting with various values in {3, 5, 7, 10, 15, 20} using the development dataset. For the
Gated relation representations, top-10 ranked pairs are used to reduce the computational cost and
reduce the unnecessary relations. In the relation gating process, we use two hidden layers for
multi-layer perceptrons. The task-specific output classifier contains two MLP layers. The model
is optimized using the Adam optimizer. The training batch size is 4. During training, we freeze
the parameters of RoBERTa in the first two epochs, and we stop the training after no performance
improvements are observed on the development dataset, which happens after 8 epochs.
4.4     Results and Discussion
4.4.1   Result Comparison
We show the model performance on the WIQA benchmark compared to various strong baselines
in Table 4.2 and Table 4.3. We observe that, in general, Transformer-based models outperform
other models, like Deomp-Attn [78]. This promising performance demonstrates the eﬀectiveness
of Transformers [111] and large-scale pre-trained Language Models [24, 65]. Moreover, our RGN
achieves state-of-the-art results compared to all baseline models. Especially, RGN outperforms [107]
                                                   46


                Models                 in-para  out-of-para   no-eﬀect    Test V1 Acc
                Majority                45.46      49.47        55.0          30.66
                Adaboost [24]           49.41      36.61       48.42          43.93
                Decomp-Attn [78]        56.31      48.56       73.42          59.48
                BERT (no para) [24]     60.32      43.74       84.18          62.41
                BERT [107]              79.68      56.13       89.38          73.80
                RoBERTa [107]           74.55      61.29       89.47          74.77
                EIGEN [69]              73.58      64.04       90.84          76.92
                REM-Net [44]            75.67      67.98       87.65          77.56
                Logic-Guided [5]          -           -           -           78.50
                RGN                     80.32      68.63       91.06          80.18
                Human                     -           -           -           96.33
Table 4.2 Model Comparisons on WIQA test V1 dataset. WIQA test data has four categories,
including in-paragraph accuracy, out-of-paragraph accuracy, no-eﬀect accuracy, and overall test
accuracy.
                    Models                         in      out no-eﬀ       Test V2
                    Random                       33.33    33.33 33.33       33.33
                    Majority                     00.00    00.00 100.0       41.80
                    RoBERTa                      70.69    60.20 91.11       75.34
                    REM-Net                      70.94    63.22 91.24       76.29
                    REM-Net (RoBERTa-large)      76.23    69.13 92.35       80.09
                    QUARTET                      74.49    65.65 95.30       82.07
                    [85]
                    RGN (RoBERTa-base)           75.91 66.15      92.12     79.95
                    RGN (RoBERTa-large)          78.40 68.83      93.01     82.46
                    Human                           -       -        -      96.30
Table 4.3 Model Comparisons on WIQA test V2. “In” represents in-paragraph accuracy, “out”
represents out-of-paragraph accuracy, and “no-eﬀ” represents no eﬀect accuracy, .
by 6.38% and outperforms current state-of-the-art model on test V1, logic-guided [5], by around
1.6%. Moreover, our RGN model achieves the SOTA on WIQA test V2. The improved performance
demonstrates that entity gating, relation gating, and contextual interaction module are eﬀective
for “what . . . if” causal reasoning. We provide a detailed analysis of the advantage of RGN from
diﬀerent perspectives.
                                                 47


                        Model            # hops = 1    # hops = 2    # hops = 3
                        BERT(no para)      58.1%          47.3%        42.8%
                        BERT               71.6 %         62.5%        59.5%
                        RoBERTa            73.5 %         63.9%        61.1%
                        EIGEN             78.78 %        63.49%       68.28 %
                        RGN                80.5%         71.2%         70.0%
                     Table 4.4 The accuracy when the number of hops increases.
4.4.2   Model Analysis
Eﬀects on Causal Reasoning and Multi-Hops: In-para and out-of-para question categories require
multiple hops of causal reasoning to answer the questions. As shown in Table 4.4, we found that the
accuracy improved 7.0% for 1 hop, 7.3% for 2 hops, and 8.9% for 3 hops compared to RoBERTa
which does not have the two gating mechanisms and Contextual Interaction Module. As we expect,
the RGN framework has made tremendous progress in causal reasoning with multiple hops, and the
improvement in the performance of the baselines is more when the number of hops increases. For
qualitative analysis, we show successful cases from our RGN in Figure 4.4. We observe that RGN is
capable of bridging question and paragraph content by extracting key entities. In the successful
cases, which is shown in Figure 4.4, RGN helps in constructing the chain of “water droplets are
in clouds ! droplets combine to form bigger drops in the clouds” through key entities “water”,
“clouds”, and “droplets”. Moreover, we observe that the key entities “water”, “clouds”, and “droplets”
obtain high gating entity scores.
Eﬀects of Entity Gating: As shown in Table 4.5, in the first ablation study, we remove the entity
gating and relation gating modules. Notice that the contextual interaction module uses the whole
question entities and paragraph entities when RGN does not use these two modules. Using whole
entities significantly increases the computational cost. Moreover, Table 4.5 shows that the accuracy
is lower by about 5.3% compared to full RGN when applied to the development dataset. This
experiment demonstrates that using all the entities without a gating mechanism has a negative
influence on the contextual interaction module and drops the performance.
Eﬀect of Relation Gating: The goal of relation gating is to capture the higher-order chain of causal
                                                  48


                              Ablations               in   out no-eﬀ       dev acc
                     RGN (w/o gating ent & rel)      76.2  61.1 89.2         75.3
                        RGN (w/o gating rel)         78.4  63.6 89.9         77.4
                                RGN                  81.7  69.2 91.3         80.6
                          RGN (w/o CIM)              80.2  68.4 90.5         79.7
                    RGN (- CIM + Multi-Head)         81.3  68.9 91.7         80.3
                      RGN (add regularization)       82.0  69.1 91.6         80.8
                   Table 4.5 Ablation Study. CIM: Contextual Interaction Module.
reasoning based on pairwise relations. The relation gating module extracts the important candidate
relations by pairing up entities after gating entities. More importantly, the relation gating module
helps in understanding the connections between entities and finding the line of causal reasoning.
Our model captures the important pairs of influencing entities “tadpole (losses) tail” and “animal
(hunts) frog”.
    When we keep the entity gating module and remove the relation gating module, we observe
that the accuracy of WIQA decreases 3.3% compared to the full RGN architecture. Moreover, the
model without the relation gating module can not capture the key relations. The results show that
the performance on the out-of-para questions decreases 5.6% compared to the full RGN model.
Section 4.4.3 shows more examples and analysis.
Eﬀects of Contextual Interaction Module (CIM): WIQA research work [107] shows that around
15% of the influence changes have diﬃculties handling the entity alignment part due to language
variability. In other words, paragraph entities use diﬀerent terms, such as (“removes”, “expels”) to
express the same semantics. Especially, the problem of language variability becomes more severe
for the multi-hop cases that require aligning the question with several sentences in the paragraph.
Without the Contextual Interaction Module, the development accuracy decreases more than 1%.
As shown in Table 4.4, the accuracy improves significantly in the direct eﬀect (1 hop) and indirect
eﬀects (2 hops or 3 hops) compared to all strong baselines. This demonstrates the eﬀectiveness
of the interaction module. In an additional experiment, we replaced the CIM with the Multi-Head
attention that uses an encoder of the Multi-Head attention composed of a stack of N = 6 identical
layers. Each layer has two sub-layers. The first layer is multi-head self-attention, and the second is
                                                  49


               Question: suppose water is absorbed into the clouds and grow happens, how will it affect clouds are filled with rain droplets?
               Content: [‘Water evaporates from the ground up to the sky’, ‘Water droplets are in clouds’, ‘Droplets combine to form bigger
   Successful  drops in the clouds’, ‘The drops get heavy’, ‘Gravity makes the drops fall.’]                                    Gold Answer: More
   Cases       Question: suppose more activity of the heart happens, how will it affect less waste being removed from the body.
               Content: [‘Blood is full of different waste’, ‘Blood travels through the body’, ‘The blood enters the kidneys’, ‘The kidneys filter
               the blood’, ‘The waste is seperated’, ‘The urine contains the waste’, ‘The urine is expelled from the body.’] Gold Answer: Less
               Question: suppose more fruit is produced happens, how will it affect MORE plants.
               Content: [‘The seed germinates’, ‘The plant grows’, ‘The plant flowers’, ‘Produces fruit’, ‘The fruit releases seeds’, ‘The plant
               dies.’]                                                                                                          Gold Answer: More
   Failing
   Cases       Question: suppose the climate changes happens, how will it affect there are fewer clouds?
               Content: [‘Water evaporates because of the sun’, ‘Water vapor rises into the air as it evaporates’, ‘Water vapor forms clouds as it
               mixes with dust and impurities’, ‘Clouds become heavier and larger over time’, ‘Clouds eventually become too heavy to stay in
               the sky’, ‘Some water vapor exits clouds as rain.’]                                                              Gold Answer: Less
                             Figure 4.4 Successful and failing cases of RGN network.
a fully connected network [111]. The computational time was 936 (ms/batch) for our contextual
interaction module, while it is 3002 (ms/batch) for the Transformer while the accuracy is fairly
similar.
4.4.3      Qualitative Analysis
For a better understanding of how our proposed model performs qualitatively, we show successful
cases and failing cases from our RGN framework in Figure 4.4. We can observe that RGN is
surprisingly capable of bridging the question and content in the in-para category.
     Although the RGN framework has achieved state-of-the-art performance, the framework cannot
always capture the line of causal reasoning. The bottom part of Figure 4.4 shows some failing cases.
In the first failing case, RGN gives a wrong prediction because the content sentence “the plant dies”
is captured as a strong negative influence by our model. Although our model bridges the relation
between “fruit” and “plant”, the critical term “dies” obtains a high gating score and misleads our
final prediction.
     Commonsense reasoning is the other type of error made by RGN model. There are two types
of questions in the dataset, including in-paragraph where the answer to the question is in the text
itself, and out-of-paragraph, where the answer does not exist in the text and the source of external
knowledge is required [107]. For example, in the second failing case of Figure 4.4, the question
                                                                        50


contains “climate change,” and the paragraph does not contain the cause of the “climate change”.
This needs external knowledge between “climate change,” and “water evaporating”. Since answering
the question requires external knowledge, it is hard to build a casual relationship for this example.
However, the improvement in the out-of-paragraph is due to observing multiple examples in the
dataset that use the same type of commonsense. Because relational gating helps to find the line of
reasoning, our RGN model captures those from observing the relationships frequently and learns
shortcuts. For example, in the second successful case of figure 4.4, the relational gating module
captures the pairwise relation between “heart body” and “blood body” due to multiple occurrences
in the data –filling the information gap for reasoning.
4.5    Summary
In this chapter, we propose an end-to-end Relational Gating Network (RGN) to help “what . . . if”
causal reasoning over text for answering cause-eﬀect questions. Particularly, we propose an entity
gating module, relation gating module, and contextual interaction module to find the answer. We
demonstrate that the proposed approach can eﬀectively solve the challenges in the “what . . . if”
reasoning, including causal reasoning, comparative expressions, and entity alignment. We evaluate
our RGN on the WIQA benchmark and achieve state-of-the-art performance. Our gating mechanism
and contextual interaction module can be easily used in solving various QA tasks that need to reason
over entities and their relationships and follow a procedure. The gating mechanism can be extended
to work at various levels of granularity, such as sentence and paragraph levels, to filter important
pieces of information and to find the line of reasoning for answering the questions.
                                                   51


                                            CHAPTER 5
                 RELATIONAL REASONING FOR CROSS-MODALITY QA
5.1     Background and Motivation
In many real-world situations, the answers to natural language questions can be found in diﬀerent types
of modalities. One important modality that can convey information is the visual one. The problem
of answering natural language questions based on a given image is called visual question answering
(VQA). VQA requires the understanding of visual contents, language semantics, cross-modality
alignments, and relationships between two modalities [118, 4, 36, 100, 101]. Recently, there have
been many eﬀorts to build such multi-modal QA benchmarks [62, 54, 46, 4, 100, 36, 101]. Inspired
by the eﬀectiveness of deep learning [24], researchers develop deep architectures on multi-modal
QA by learning representations for each modality, combining two representations, and predicting
answers [59, 104]. For instance, VisualBERT [59] consists of Transformer layers that separately
learn textual and visual representation with the self-attention module. LXMERT [104] learns
entity representations by concatenating textual tokens and visual objects and using cross-modality
Transformer architecture. However, the current performance of these models is unsatisfactory because
the conventional deep learning models have diﬃculties in learning a robust joint representation and
relational reasoning cross-modalities.
    Our hypothesis is that exploiting the structure of the entities and their relationships in the two
modalities and explicitly aligning them is one key factor that can facilitate solving the challenges
of multi-modal QA, but this is less explored. In our proposed model, we learn robust joint
representations by directly modeling the relations between diﬀerent modality components based on
the relevance scores inspired by the ideas from information retrieval litrature [77, 113, 24, 111].
    Following the above-mentioned hypothesis, in this chapter, we propose a novel cross-modality
relevance (CMR) architecture that considers the relevance between textual token representations and
visual object representations for explicitly aligning them. We first encode data from each modality
with single-modality Transformers and combine two encoding representations and pass it into a
                                                  52


                                                 New York, NY 10044
hr, yoav}@cs.cornell.edu                                      {az346, wz337, hb364}@cornell.edu
                     †
                       University of Maryland Department of Computer Science
                                               College Park, MD 20742
                                             stezhou@cs.umd.edu
                             Text: Where is the child sitting?           The left image contains twice the number of dogs as the
                            Fridge                        Arm            right image, and at least two dogs in total are standing. Yes
                     Abstract
 e introduce a new dataset for joint reason-
 g about natural language and images, with a
 cus on semantic diversity, compositionality,
  d visual reasoning challenges. The data con-
ins 107,292 examples of English sentences                               The left image contains twice the number of dogs as the
  ired with web       photographs.
                   Figure                 The task
                            5.1 Two benchmark           is
                                                    examples    of the right
                                                                        Cross-Modality
                                                                               image, and Question     Answering
                                                                                            at least two             task.areThe
                                                                                                           dogs in total          left side
                                                                                                                              standing.
  determine whether a natural language cap-
                   is an example of the VQA benchmark, while the right side is an example of the NLVR benchmark.
on is true about a pair of photographs. We
 owdsource the data using sets of visuallyCross-modality Question Answering
ch images and a compare-and-contrast task
                   cross-modality Transformer. We consistently refer to the words in text and objects in images(i.e.
  elicit linguistically diverse language. Quali-
tive analysis shows       the data
                   bounding    boxesrequires
                                      in images)compo-
                                                    as “entities” and their representations as “Entity Representations”. We
tional joint reasoning, including about quan-
 ies, comparisons, use the
                         andrelevance
                              relations. between   the components of the
                                           Evaluation                            two
                                                                               One    modalities
                                                                                    image          to model
                                                                                           shows exactly    twothe  alignment
                                                                                                                 brown   acornsbetween
                                                                                                                                  in
 ing state-of-the-art visual reasoning meth-                                        back-to-back caps on green foliage.
ds shows the data  them.    We measure
                       presents  a strongthe   relevance between their entities called “Entity Relevance”, and high-order
                                            challenge.
                                                                     Figure 1: Two examples from NLVR2. Each caption
                                                                                                         2
ntroductionrelevance between their relations called “Relational      is paired with      two images.
                                                                                   Relevance”.    We learnThe     task is to predict
                                                                                                             representations     from theif
                                                                     the caption is True or False. The examples require
 l reasoning aﬃnity  with matrix
                             natural     language
                                     of the  relevanceis     a byaddressing
                                                          scores      convolutional     layers and semantic
                                                                                     challenging     fully-connected     layers. Finally,
                                                                                                                  phenomena,        includ-
ising avenue to study compositional seman-                           ing resolving twice . . . as to counting and comparison
 y grounding we        predictphrases,
                    words,      the answerand based   on the relevance
                                                  complete                 representations.
                                                                     of objects,    and composing cardinality constraints, such
nces to objects, their properties, and rela-                         as  at  least two   dogs in total and exactly two.3
 in images. This        The  contributions
                           type                      chapter are as follows: 1) We propose a cross-modality relevance (CMR)
                                             of this reason-
                                  of linguistic
  critical for interactions       grounded
                   architecture that  considers in entity
                                                    visually
                                                           relevance    and high-order
                                                                     NLVR       (Suhr etrelational
                                                                                             al., 2017)relevance   for aligning
                                                                                                            and CLEVR              the two
                                                                                                                               (Johnson
lex environments, such as in robotic appli-
                                                                     et
                                                                      andal.,  2017a,b).     These on datasets     useandsynthetic      im-
                     commonly2)used
  s. However,modalities.             We evaluate
                                            resources  the method
                                                           for              analyze   the results       both VQA            NLVR tasks
                                                                     ages, synthetic language, or both. The result is
age and vision (e.g., Antol et al., 2015;            2
                                                         Chen
  2016) focus using       VQA onv2.0    and NLVR of     benchmarks,  a respectively.    We improve state-of-the-art
                                                                        limited representation           of linguisticonchallenges:
                                                                                                                              both tasks’
                    mostly         identification          ob-
                                                                     synthetic languages are inherently of bounded ex-
roperties and published
                   few spatial     relations
                               results.         (Section
                                         Our analysis       4; the significance of exploiting relevance for relational reasoning
                                                         shows       pressivity, and synthetic visual input entails lim-
ro et al., 2015; Alikhani and Stone, 2019).
                                                                     ited lexical and semantic diversity.
 elatively simple  for cross-modality     QA.
                        reasoning, together         with bi-
in the data, removes much of the need to                                 We address these limitations with Natural Lan-
der language 5.2                                                     guage Visual Reasoning for Real (NLVR2), a new
                            Cross-Modality
                    compositionality        (Goyal  Relevance
                                                        et al.,
                                                                     dataset for reasoning about natural language de-
 . This motivated the design of datasets that
  e compositional1 visual reasoning, including                       scriptions of photos. The task is to determine if a
                   Figure 5.2 shows our proposed Cross-Modality      caption   Relevance   (CMR)
                                                                                 is true with        architecture.
                                                                                                 regard    to a pairAs of an  end-to-end
                                                                                                                           images.     Fig-
  ontributed equally.
                                                                     ure 1 shows examples from NLVR2. We use im-
Work done as an undergraduate at Cornell University.
                   model, it encodes the relevance between the components of input modalities under task-specific
 parts of this paper, we use the term compositional dif-
                                                                         2
y than it is commonly used in linguistics to refer to                      Appendix G contains license information for all pho-
                   supervision.    We   further  add
ing that requires composition. This type of reasoning  a high-order    relevance
                                                                     tographs      between
                                                                                used          relations that occur in each modality.
                                                                                      in this paper.
                                                                         3
  anifests itself in highly compositional language.                        The top example is True, while the bottom is False.
                   This architecture can help to solve tasks that need reasoning on two modalities based on their
                   relevance. In this section, we first formulate the problem. Then we explain each component of the
                                                                              53


   Input          Single-Modality            Cross-Modality   Entity                            Entity Relevance              Entity Relevance
                   Transformer                Transformer Representations                       Affinity Matrix               Representation
                          Add & Norm                                                                                     CNN
            Faster-                                                   Visual Entity
                                                                                                                                             !
                          Feed Forward
            RCNN          Add & Norm                                   visual entity 1
                                                  Add & Norm           visual entity 2
                                    MUL
                      ATT                                              visual entity 3
                                                  Feed Forward              …
                                                                                                                               …     …
                      Q       K          V
                                                  Add & Norm
                                                                      visual entity n
                                                                                                                                             …      Output
                                                                                                                   max
                          BERT                 ATT
                                                            MUL      Textual Entity
                                                                     textual entity 1
                          Add & Norm                                                                                     CNN
                          Feed Forward
                                              Q       K          V
                                                                      textual entity 2                                                       …
                                                                      textual entity 3
                          Add & Norm
                                                                            …
 Raw Text                           MUL                               textual entity n
                      ATT                                                                                                      …     …
                      Q       K          V
                                                                                              Relational Relevance           Relational Relevance
                                                                                                Affinity Matrix                Representation
Figure 5.2 Cross-Modality Relevance model is composed of the single-modality transformer,
cross-modality transformer, entity relevance, and high-order relational relevance, followed by a
task-specific classifier.
CMR model, loss function, and training procedure of CMR in detail.
5.2.1      Problem Formulation
Formally, the problem is to model a mapping from a cross-modality data sample D = D ` to an
                                                                                                                         `            `
output H in a target task, where ` denotes the type of modality. And D ` = 31 , · · · , 3 # ` is a set of
entities in the modality `. In visual question answering, VQA, the task is to predict an answer given
two modalities, that is, a textual question (DC ) and a visual image (DE ). In NLVR, given a textual
statement (DC ) and an image (DE ), the task is to determine the correctness of the textual statement.
5.2.2      Representation Alignment
Single Modality Representations. For the textual modality DC , we utilize BERT [24] as shown
in the bottom-left part of Figure 5.2, which is a multi-layer Transformer [111] with three diﬀerent
inputs: token embeddings [121], segment embeddings, and position embeddings. We refer to
all the words as the entities in the textual modality and use the BERT representations for textual
single-modality representations B1C , · · · , BC# C . We assume to have # C words as textual entities.
   For visual modality DE , as shown in the top-left part of Figure 5.2, Faster-RCNN [88] is used
to generate regions of interest (ROIs), extract dense encoding representations of the ROIs, and
                                                                                         54


predict the class of each ROI. We refer to the ROIs on images as the visual entities 31E , · · · , 3 #E E .
We consider a fixed number, # E , of visual entities with the highest probabilities predicted by
Faster-RCNN each time. The dense representation of each ROI is a local latent representation of a
2048-dimensional vector [88]. To enrich the visual entity representation with the visual context,
we further project the vectors with feed-forward layers and encode them by a single-modality
Transformer, as shown in the second column in Figure 5.2. The visual Transformer takes the dense
representation, segment embedding, and bounding box positional embedding [104] as input and
generates the single-modality representation B1E , · · · , B E# E . In case there are multiple images, for
example, NLVR data (NLVR2 ) has two images in each example, each image is encoded by the
same procedure, and we keep # E visual entities per image. We restrict all the single-modality
representations to vectors of the same dimension 3. However, these single-modality representations
should be aligned.
Cross-Modality Alignment. To align the single-modality representations in a uniformed rep-
resentation space, we introduce a cross-modality Transformer as shown in the third column of
Figure 5.2. All the entities are treated uniformly in the cross-modality Transformer. Given the
set of entity representations from all modalities, we define the matrix with all the elements in the
          ⇥                                       ⇤
set ( = B1C , · · · , BC# C , B1E , · · · , B E# E 2 R3⇥(# +# ) . Each cross-modality self-attention calculation is
                                                          C  E
computed as follows [111]1,
                                                                           ✓  >&
                                                                                  ◆
                                          Attention ( , &, +) = softmax      p      +,                            (5.1)
                                                                               3
where in our case the key , query &, and value +, all are the same size of tensor (. A cross-modality
Transformer layer consists of a cross-modality self-attention representation followed by residual
connection with normalization from the input representation, a feed-forward layer, and another
residual connection normalization. We stack several cross-modality Transformer layers to get a
uniform representation over all modalities. We refer to the resulting uniformed representations
    1 Please note here we keep the usual notation of the attention mechanism for this equation. The notations might have
been overloaded in other parts of this chapter.
                                                                55


as the entity representation and denote the set of the entity representations of all the entities
        0           0     0              0
as B1C , · · · , B #EC , B1E , · · · , B #EE . Although the representations are still organized by their original
modalities per entity, they carry the information from the interactions with the other modality and
are aligned in uniform representation space. The entity representations, as the fourth column in
Figure 5.2, alleviate the gap between representations from diﬀerent modalities, as we will show in
the ablation studies, and allow them to be matched in the following steps.
5.2.3      Entity Relevance
Exploiting relevance, independent of the input representation, plays a critical role in reasoning ability,
which is required in many tasks such as information retrieval, visual question answering, etc. To
consider the entity relevance between two modalities D ` and Da , the entity relevance representation
                                                                                           0`
                                                                                                 h 0         0`
                                                                                                                i
                                                                                                      `
is calculated as shown in Figure 5.2. Given entity representation matrices ( = B1 , · · · , B # ` 2
                          ⇥ 0              0 ⇤
R3⇥# and ( a = B1a , · · · , B #aa 2 R3⇥# , the relevance representation is calculated by
         `        0                                    a
                                                              ⇣  0`
                                                                    ⌘>  0
                                                         `,a
                                                             = (       ( a,                                  (5.2a)
                                               M D ` , Da = CNND ` ,Da (    `,a
                                                                                ),                           (5.2b)
                                                                                                                 `,a
where       `,a is the aﬃnity matrix of the two modalities as shown in the right side of Figure 5.2.             89
is the relevance score of 8th entity in D ` and 9th entity in Da . CNN `,a (·) is a Convolutional Neural
Network, corresponding to the sixth column of Figure 5.2, which contains several convolutional
layers and fully connected layers. Each convolutional layer is followed by a max-pooling layer.
Fully connected layers finally map the flattened feature maps to a 3-dimensional vector. We refer to
   D ` ,Da  = M D ` , Da as the entity relevance representation between ` and a.
     We compute the relevance between diﬀerent modalities. For the modalities considered in
this chapter, when there are multiple images in the visual modality, we calculate the relevance
representation between them too. In particular, for the VQA benchmark, the above setting results
in one entity relevance representation: a textual-visual entity relevance                    DC ,DE .   For NLVR2
benchmark, there are three entity relevance representations: two textual-visual entity relevance
                                                               56


   DC ,DE1     and   DC ,DE2 , and a visual-visual entity relevance           DE1 ,DE2 between two images. Entity
relevance representations will be flattened and joined with other features in the next layer of the
network.
5.2.4         Relational Relevance
We also consider the relevance beyond entities, that is, the relational relevance of the entities’
relations. This extension allows our CMR to capture higher-order relational relevance between
diﬀerent modalities. We consider pair-wise relations between entities in each modality and calculate
the relevance of the relations across modalities. The procedure is similar to entity relevance as
shown in Figure 5.2. We denote the relational representation as a non-linear mapping R23 ! R3
modeled by fully-connected layers from the concatenation of representations of the entities in the
relation:
                                                           ⇣h  0`   0`
                                                                       i⌘
                                                                            2 R3 .
                                          `
                                        A (8, 9) = MLP `,1    B8 , B 9                                      (5.3)
Relational relevance aﬃnity matrix can be calculated by matching the relational representation,
n                  o
  A (8, 9) , 88 < 9 , from diﬀerent modalities. However, there will be ⇠#2 ` possible pairs in each modality
    `
D ` , most of which are irrelevant. The relational relevance representations will be sparse because of
the irrelevant pairs on both sides. Computing the relevance score of all possible pairs will introduce
a large number of unnecessary parameters which makes the training more diﬃcult.
      We propose to rank the relation candidates (i.e. pairs) by the intra-modality relevance score and
the inter-modality importance. Then we compare the top-                   ranked relation candidates between two
modalities as shown in Figure 5.3. For the intra-modality relevance score, shown in the bottom
left part of the figure, we estimate a normalized score based on the relational representation by a
softmax layer.                                           ⇣           ⇣       ⌘⌘
                                                                        `
                                                     exp MLP `,2 A (8, 9)
                                         `
                                       * (8, 9)  =Õ         ⇣           ⇣         ⌘⌘ .                      (5.4)
                                                                            `
                                                    :<; exp   MLP   `,2   A (:,;)
      To evaluate the inter-modality importance of a relation candidate, which is a pair of entities in
the same modality, we first compute the relevance of each entity in text with respect to the visual
                                                           57


                                                   Inter-modality                                    Same procedure as
              Entity Relevance
                                                    Importance                                        textural relations
              Affinity Matrix
                                                                                                        Top-K Visual
                                   max                                                                    Relations
                                                                                                               6
                                                                                                             ,(&,5)
                                                                                                               6
                                                                                                             ,(),7)
                                                                                                               …
                                                                                                              ,168
                                                                                                                2
                                                $                                                        Top-K Textual
                                              ,(&,))
                         1    2                 $                                                          Relations
                                              ,(&,/)
                                  MLP$,&                    MLP$,)                               Top-K
                                                                                                               $
               '&(,$     1    3
                                                                     …                       …               ,(&,/)
                        …     …                 …                                                              $
               ')(,$                           $                                                             ,(),/)
                         1    N              ,(&,* +)                             ⊙
                …                               $                                                              …
                         2    3               ,(),/)                                                     …
               '*(,$+   …     …                                                                               ,1$3
                        N-1   N
                                                 …                   …                       …                  2
                                             $
                                           ,(* + 0&,* + )
              Entity Candidate             Relation Intra-modality Ranking
                      Relation                      Relevance score score
Figure 5.3 Relational Relevance is the relevance of top-K relations in terms of intra-modality
relevance score and inter-modality importance.
objects. As shown in Figure 5.3, we project a vector that includes the most relevant visual object for
each word, denoted this importance vector as E C . This helps to focus on words that are grounded in
the visual modality. We use the same procedure to compute the most relevant words to each visual
object.
   Then we calculate the relation candidates importance matrix + ` by an outer product, ⌦, of the
importance vectors as follows,
                                                                 `                    `,a
                                                             E 8 = max                89 ,                                              (5.5a)
                                                                          9
                                                            + ` = E ` ⌦ E `,                                                            (5.5b)
          `
where E 8 is the 8th scalar element in E ` that corresponds to the 8th entity, and                                         `,a   is the aﬃnity
matrix calculated by Equation 5.2a.
   Notice that the inter-modality importance + ` is symmetric. The upper triangular part of + ` ,
excluding the diagonal, indicates the importance of the corresponding elements with the same index
in intra-modality relevance scores * ` . The ranking score for the candidates is the combination
                                                             `                `              `
(here the product) of the two scores , (8, 9) = * (8, 9) ⇥ +8 9 . We select the set of top-                                            ranked
                                                                         58


candidate relations K` = {^1 , ^2 , · · · , ^ }. We reorganize the representation of the top- relations
            ⇥ `         ` ⇤
as ' ` = A ^1 , · · · A ^ 2 R3⇥ . The relational relevance representation between K` and Ka can be
calculated similarly to the entity relevance representations as shown in Figure 5.2.
                                M K` , Ka = CNNK` ,Ka (' ` ) > ' a .                                      (5.6)
M K` , Ka has its own parameters which results in a 3-dimensional feature             K ` ,Ka .
    In particular, for the VQA task, the above setting results in one relational relevance representation:
a textual-visual relevance M (KC , KE ). For the NLVR task, there are three entity relevance
representations: two textual-visual relational relevance M KC , KE 1 and M KC , KE 2 , and a visual-
visual relational relevance M KE 1 , KE 2 between two images. Relational relevance representations
will be flattened and joined with other features in the next layers of the network.
    After acquiring all the entity and relational relevance representations, namely              D ` ,Da   and
                                                                             ⇥                                 ⇤
  K ` ,Ka , we concatenate them and use the result as the final feature    = D ` ,Da , · · · ,  K ` ,Ka , · · · .
A task-specific classifier MLP ( ) predicts the output of the target task as shown in the right-most
column in Figure 5.2.
5.2.5     Training
In CMR architecture, we predict the output H from a specific task with the final feature               with a
classification function. The gradient of the loss function is back-propagated to all the components in
CMR to penalize the prediction and adjust the parameters. We freeze the parameters of BERT for
textual modality and Faster-RCNN for visual modality. The parameters of the following parts will
be updated by gradient descent: single modality Transformers, the cross-modality Transformers,
CNND ` ,Da (·), CNNK` ,Ka (·), MLP `,1 (·), MLP `,2 (·) for all modalities and modality pairs, and the
task-specific classifier MLP ( ).
    The VQA task can be formulated as a multi-class classification that chooses a word to answer
the question. We apply a softmax classifier on        and penalize it with the cross-entropy loss. For the
NLVR2 dataset, the task is the binary classification that determines whether the statement is correct
                                                     59


regarding the images. We apply a binary classification on        and penalize it with the cross-entropy
loss.
5.3     Experiments
In this section, We introduce the datasets, experiment settings, and results compared to state-of-the-art
published works.
5.3.1    Dataset Description
NLVR2        [101] is a dataset that aims to joint reasoning about natural language descriptions and
related images. Given a textual statement and a pair of images, the task is to indicate whether the
statement correctly describes the two images. NLVR2 contains 107, 292 examples of sentences
paired with visual images and designed to emphasize semantic diversity, compositionality, and
visual reasoning challenges.
VQA v2.0        [36] is an extended version of the VQA dataset. It contains 204, 721 images from
the MS COCO [62], paired with 1, 105, 904 free-form, open-ended natural language questions and
answers. These questions are divided into four categories: Yes/No, Number, and Other.
5.3.2    Implementation Details
We implemented CMR using Pytorch. We consider the 768-dimension single-modality repre-
sentations. For textural modality, the pre-trained BERT base model [24] is used to generate the
single-modality representation. For visual modality, we use Faster-RCNN pre-trained by BUTD [3],
followed by a five-layers Transformer. Parameters in BERT and Faster-RCNN are fixed. For
each example, we keep 20 words as textual entities and 36 ROIs per image as visual entities.
For relational relevance, top-10 ranked pairs are used. For each relevance-CNN, CNND ` ,Da (·)
and CNNK` ,Ka (·), we use two convolutional layers, each of which is followed by a max-pooling,
and fully connected layers. For the relational representations and their intra-modality relevance
                                                    60


score, MLP `,1 (·) and MLP `,2 (·), we use one hidden layer for each. The task-specific classifier
MLP ( ) contains three hidden layers. The model is optimized using the Adam optimizer with
U = 10 4 , V1 = 0.9, V2 = 0.999, n = 10 6 . The model is trained with a weight decay of 0.01, a max
gradient normalization clip of 1.0, and a batch size of 32.
5.3.3   Baseline Description
We briefly describe the recent four SOTA baselines. The first two baselines use non-Transformer
neural models as the backbone, while the other two baselines use Transformer-based architectures.
We describe these baselines as follows.
Compositional Attention Network (MAC)            [45] is a fully diﬀerentiable neural network that aims
to facilitate machine reasoning. The model designs explicit and structured reasoning by a new
recurrent Memory, Attention, and Composition cell.
Feature-wise Linear Modulation (FiLM)             [80] is a strong baseline on visual reasoning tasks.
In the FiLM model, each layer influences neural network computation via a feature-wise aﬃne
transformation based on conditioning information.
VisualBERT        [59] is an End-to-End model for language and vision tasks, consisting of Transformer
layers that align textual and visual representation with self-attention. VisualBERT and CMR have
a similar cross-modality alignment approach. However, VisualBERT only uses the Transformer
representations, while CMR uses the relevance representations.
LXMERT          [104] aims to learn cross-modality encoder representations from Transformers. It
pre-trains the model with a set of tasks and fine-tunes another set of specific tasks. LXMERT is the
currently published state-of-the-art on both NLVR2 and VQA v2.0.
                                                   61


                                       Models         Dev% Test%
                                       N2NMN           51.0     51.1
                                   MAC-Network         50.8     51.4
                                        FiLM           51.0     52.1
                                     CNN+RNN           53.4     52.4
                                     VisualBERT        67.4     67.0
                                      LXMERT           74.9     74.5
                                        CMR            75.4     75.3
                                    Table 5.1 Accuracy on NLVR2 .
                                       Dev%              Test Standard%
                          Model
                                      Overall     Y/N     Num Other Overall
                          BUTD         65.32     81.82   44.21 56.05 65.67
                         ReGAT         70.27     86.08   54.42 60.33 70.58
                        ViLBERT        70.55       -        -        -    70.92
                       VisualBERT      70.80       -        -        -    71.00
                           BAN          71.4     87.22   54.37 62.45 71.84
                        VL-BERT        71.79     87.94   54.75 62.54 72.22
                        LXMERT          72.5     87.97   54.94 63.13 72.54
                          CMR          72.58     88.14   54.71 63.16 72.60
                                  Table 5.2 Accuracy on VQA v2.0.
5.4     Results and Discussion
5.4.1    Result Comparison
NLVR2       The results of NLVR task are listed in Table 5.1. Transformer based models (VisualBERT,
LXMERT, and CMR) outperform other models (N2NMN [42], MAC [45], and FiLM [80]) by a large
margin. This is due to the strong pre-trained single-modality representations and the Transformers’
ability to learn the representations. Furthermore, CMR shows the best performance compared to all
Transformer-based baseline methods and achieves state-of-the-art. VisualBERT and CMR have
similar cross-modality alignment approaches. CMR outperforms VisualBERT by 12.4%. The gain
mainly comes from entity relevance and relational relevance that model the relations.
                                                   62


                              Textural  Visual    Cross   Dev%    Test%
                                12         3        3      74.1    74.4
                                12         4        4      74.9    74.7
                                12         5        5      75.4    75.3
                                12         6        6      75.5    75.1
Table 5.3 Accuracy on NLVR2 of CMR with various Transformer sizes. The numbers in the left
part of the table indicate the number of self-attention layers.
                                       Models                   Dev%    Test%
                                       CMR                       75.4    75.3
                       without Single-Modality Transformer       68.2    68.5
                       without Cross-Modality Transformer        59.7    59.1
                              without Entity Relevance           70.6    71.2
                            without Relational Relevance         73.0    73.4
                 Table 5.4 Test accuracy of diﬀerent variations of CMR on NLVR2 .
VQA v2.0: In Table 5.2, we show the comparison with published models, excluding the ensemble
ones. Most competitive models are based on Transformers (ViLBERT [66], VisualBERT [59],
VL-BERT [99], LXMERT [104], and CMR). BUTD [3, 108], ReGAT [58], and BAN [49] also
employ an attention mechanism for a relation-aware model. The proposed CMR achieves the
best test accuracy on Y/N questions and Other questions. However, CMR does not achieve the
best performance on Number questions. This is because Number questions require the ability to
count numbers in one modality, while CMR focuses on modeling relations between modalities.
Performance on counting might be improved by explicit modeling of quantity representations.
CMR also achieves the best overall accuracy. In particular, we can see a 2.3% improvement over
VisualBERT [59], as in the above-mentioned NLVR2 results. This shows the significance of the
entity and relational relevance.
5.4.2   Model Analysis
To better understand the influence of each part in CMR, we perform the ablation study. Table 7.3
shows the performances of four variations on NLVR2 .
                                                  63


       The bird on                  the      branch        is looking to                 left
Figure 5.4 The entity aﬃnity matrix between textual (rows) and visual (columns) modalities. The
darker color indicates a higher relevance score. The ROIs with a maximum relevance score for each
word are shown paired with the words.
Eﬀect of Single Modality Transformer. In the first ablation study, we remove both textual
and visual single-modality Transformers and map the raw input with a linear transformation to
3-dimensional space instead. Notice that the raw input of textual modality is the WordPieces [121]
embeddings, segment embeddings, and the position embeddings of each word, while that of visual
modality is the 2048-dimension dense representation of each ROI extracted by Faster-RCNN. It
turns out that removing single-modality Transformers decreases the accuracy by 9.0%. Single
modality Transformers play a critical role in producing a strong contextualized representation for
each modality.
Eﬀect of Cross-Modality Transformer. We remove the cross-modality Transformer and use single-
modality representations as entity representations. As shown in Table 7.3, the model degenerates
dramatically, and the accuracy decreases by 16.2%. The huge accuracy gap demonstrates the
unparalleled contribution of the cross-modality Transformer to aligning representations from input
modalities.
                                                 64


Figure 5.5 The relation ranking score of two example sentences. The darker color indicates a higher
ranking score.
Eﬀect of Entity Relevance. We remove the entity relevance representation          D ` ,Da from the final
feature . As shown in Table 7.3, the test accuracy is reduced by 5.4%. This is a significant diﬀerence
in performance among Transformer based models [59, 66, 104]. To highlight the significance of
entity relevance, we visualize an example aﬃnity matrix in Figure 5.4. The two major entities, “bird”
and “branch”, are matched perfectly. More interestingly, the three ROIs which are matching the
phrase “looking to left” capture an indicator (the beak), a direction (left), and the semantics of the
whole phrase.
Eﬀect of Relational Relevance. We remove the entity relevance representation             K ` ,Ka from the
final feature  . A 2.5% decrease in test accuracy is observed in Table 7.3. We argue that CMR
models high-order relations, which are not captured in entity relevance, by modeling relational
relevance. We present two examples of textual relation ranking scores in Figure 5.5. The learned
ranking score highlights the important pairs, for example “gold - top”, and “looking - left”, which
describe the important relations in textual modality.
                                                 65


5.4.3   Qualitative Analysis
To investigate the influence of model sizes, we empirically evaluated CMR on NLVR2 with various
sets of Transformers sizes which contain the most parameters of the model. All other details are
kept the same as descriptions in Section 5.3.2. Textual Transformer remains 12 layers because it is
the pre-trained BERT. Our model contains 285" parameters. Among these parameters, around
230" parameters belong to pre-trained BERT and Transformer. Table 5.3 shows the results. As we
increase the number of layers in the visual Transformer and the cross-modality Transformer, it tends
to improve accuracy. However, the performance becomes stable when there are more than five layers.
We choose five layers of visual Transformer and cross-modality Transformer in other experiments.
5.5    Summary
In this chapter, we propose a novel cross-modality relevance (CMR) for language and vision
reasoning. Particularly, we claim the significance of relevance between the components of the two
modalities of reasoning, which include entity relevance and relational relevance. We propose an
end-to-end Cross-Modality Relevance (CMR) architecture that is tailored for language and vision
reasoning. We evaluate the proposed CMR on NLVR and VQA tasks. Our approach exceeds the
state-of-the-art on NLVR2 and VQA v2.0 datasets. The experiments and the empirical analysis
demonstrate CMR’s capability of modeling relational relevance. This result indicates the significance
of exploiting relevance. Our proposed architectural component for exploiting relevance can be used
independently from the full CMR architecture and is potentially applicable for other multi-modality
tasks.
                                                 66


                                           CHAPTER 6
              COMMONSENSE REASONING FOR KNOWLEDGE BASED QA
6.1    Background and Motivation
Large-scale pre-trained language models (LMs) are shown to cover large amounts of world-knowledge
and common sense and have achieved success in many QA benchmarks [87, 86, 74, 125]. However,
the current research shows LMs have diﬃculties in answering questions merely based on their
implicit knowledge [127, 30].
    Therefore, using the external sources of knowledge explicitly in the form of knowledge graphs
(KGs) is a recent trend in Question Answering [61, 30]. Figure 6.1, taken from the CommonsenseQA
benchmark, shows an example for which answering the question requires commonsense reasoning.
In this example, the external KG provides the required background information to obtain the
reasoning chain from question to answer. We highlight two challenges in this type of QA task:
(a) the extracted KG subgraph sometimes misses some edges between entities, which breaks the
chain of reasoning (b) the semantic context of the question and connection to the answer is not used
properly, for example, reasoning when negative terms exist in the question, such as no and not, is
problematic.
    The challenge (a) is caused by the following reasons. First, the knowledge graph is originally
imperfect and does not include the required edges. Second, since the size of knowledge graphs is
tremendously large, many models use a subgraph of KG for each example [61, 31, 127]. However, to
reduce the size of the graph, most of the models select the entities that appear in two-hop paths [30].
Consequently, some intermediate concept (entity) nodes and edges are missed in the extracted KG
subgraph. In such cases, the subgraph does not contain a complete chain of reasoning. Third,
the current models often cannot reason over paths when there is no direct connection between the
involved concepts. While finding the chain of reasoning in QA is challenging in general [135],
this problem is more critical when the KG is the only source of knowledge, and there are missing
edges. Looking back at Figure 6.1, the KG subgraph misses the direct connection between guitar
                                                 67


                          The student practiced his guitar often, where is he
                          always spent his free period?
                          A. music room B. rock band C. toy store
                          D. stage E. concert
                                                          Q
                                           guitar                               free period
                                  For                        IsA
                              Used      AtLocation
                          band             concert           instrument
                             IsA                 Used               se
                                                     For               dFor
                                                                   U
                          rock
                          band                       Playing instrument
                                                                   AtLocation
                                                        Music room
Figure 6.1 An example of the CommonsenseQA benchmark. Given the question node Q, question
entity nodes (blue boxes), correct answer entity node (red box), and wrong answer entity nodes
(orange boxes), we predict the answer by reasoning over the question and the extracted KG subgraph.
and playing instrument (green arrow). For challenge (b) about considering question semantics, as
KagNET [61] points out, previous models are not sensitive to the negation words and consequently
predict opposite answers. QA-GNN [127] model is the first work to deal with the negative questions.
QA-GNN improves the reasoning under negation, to some extent, by adding the QA global node to
the graph. However, the challenge still exists.
   To solve the above challenges, we propose a novel architecture, called Dynamic Relevance Graph
Network (DRGN). The motivation of our proposed DRGN is to recover the missing edges and
establish direct connections between relevant concepts to facilitate multi-hop reasoning. In particular,
the DRGN model uses a relational graph network module while considering the importance of the
neighbor nodes using an additional relevance matrix. It can potentially recover the missing edges
by establishing direct connections based on the relevancy of the node representations in the KG
during the training. The module can potentially capture the connections between distant nodes while
benefiting from the existing KG edges. Our proposed model learns representations directly based on
the relevance scores between subgraph entity pairs that are computed by the Inner Product operation.
At each convolutional layer of the graph neural network, we compute the inner product of the nodes
                                                        68


based on their current layer’s node representations dynamically and build the neighborhoods based
on this relevance measure and form a relevance matrix accordingly. This can be seen as a way
to learn new edges as the training goes forward in each layer while influencing the weights of
the neighbors dynamically based on their relevance. As shown in Figure 6.1, the relevance score
between guitar and playing instrument is stronger than other nodes in the subgraph. Moreover, since
the graph includes the question node, the relevance between the question node and entity nodes is
computed at every layer, making use of the contextual information more eﬀectively. It becomes
more evident that the student will spend the free period in the music room rather than the concert.
     In summary, the contributions of this work are as follows: (1) The Proposed DRGN architecture
exploits the existing edges in the KG subgraph while explicitly using the relevance between the
nodes to establish direct connections and recover the possibly missing edges dynamically. This
technique helps in capturing the reasoning path in the KG for answering the question. (2) Our
model exploits the relevance between the question and the graph entities, which helps consider the
semantics of the question explicitly in the graph reasoning and boost the performance. In particular,
it improves dealing with the negation. (3) Our proposed model obtains competitive results on both
CommonsenseQA and OpenbookQA benchmarks. Our analysis demonstrates the significance and
eﬀectiveness of the DRGN model.
6.2     Dynamic Relevance Graph Network
In this section, we first define the problem, formally. Then we explain each component of our
proposed model, loss function, and training procedure in detail.
6.2.1    Problem Formulation
The task of QA over pure knowledge is to choose a correct answer 0 0=B from a set of # candidate
answers {0 1 , 0 2 , ..., 0 = } given input question @ and an external knowledge graph (KG). Since the
knowledge graphs are often huge, as a part of the solution, we only consider a subgraph of KG as an
input for each example. A subgraph is selected for each example based on a previously proposed
                                                       69


                 Context Encoder                                       Dynamic Relevance        Dynamic Relevance                   Answer
                                                                          Matrix 1                 Matrix 2                         Prediction
                  Language Rep
                  q                                                                                                 Language Rep
                      ATT
                             Add & Norm   Feed Forward   Add & Norm
  Question+       k
  Cand_Ans
                       MUL
                  v
                                                                                                                    Question Node   MLP
                                                                                                                        Rep
     KG
                                                                                                                    KG Subraph
                                                                                                                    Pooling Rep
                 KG Subgraph                                          KG Subgraph Adding
                                                                      Question Information
Figure 6.2 Our proposed DRGN model is composed of the Language Context Encoder module, KG
Subgraph Construction module, Graph Neural Network module, and Answer Prediction module.
The blue color entity nodes represent the entity mentioned in the question. The yellow color node
represents the answer node. The red color node is the question node. We use diﬀerent colors to
draw the dynamic relevance matrix 1 and 2 because the relevance matrix changes dynamically in
each graph neural layer.
approach [31]. The approach is to construct a subgraph from KG that contains the entities mentioned
in the question and answer choices.
6.2.2     Model Description
Figure 7.2 shows the proposed Dynamic Relevance Graph Network (DRGN) architecture. Our
DRGN includes four modules: Language Context Encoder module, KG Subgraph Construction
module, Graph Neural Network module, and Answer Prediction module. In this section, we describe
the details of our approach and the way we train our model eﬃciently.
6.2.3     Language Context Encoder
For the given question @ and each candidate answer 08 , we concatenate them to form the Language
Context !:
                                                                          ! = [[⇠ !(]; @; [(⇢ %]; 08 ],                                   (6.1)
                                                                                           70


where [CLS] and [SEP] are the special tokens used by large-scale pre-trained Language Models
(LMs). We feed input ! to a pre-trained LMs encoder to obtain token represetnations, denoted as
⌘ ! 2 R|!|3 , where |!| represents the length of the sequence. Then we use the [CLS] representation,
denoted as ⌘ [⇠ !(] 2 R3 , as the representation of !.
6.2.4   KG Subgraph Construction
We use ConceptNet [97], a general-domain knowledge graph, as the commonsense KG. ConceptNet
graph has multiple semantic relational edges, e.g., HasProperty, IsA, AtLocation, etc. We follow
MHGRN [30] research work to construct the subgraphs from KG for each example. The subgraph
entities are selected with the exact match between n-gram tokens and ConceptNet concepts using
some normalization rules. Then another set of entities is added to the subgraph by following the KG
paths of two hops of reasoning based on the current entities in the subgraph.
    Furthermore, we add the semantic context of the question as a separate node to the subgraph.
This node provides an additional question context to the KG subgraph, ⌧ BD1 , as suggested by
QAGNN [127]. We link the question node to entity nodes mentioned in the question. The semantic
context of the question node & is initialized by the [CLS] representation described in Section 6.2.3.
The initial representation of the other entities is derived from applying RoBERTa and pooling over
their contained tokens [30].
6.2.5   Graph Neural Network Module
The basis of our learning representation is Multi-relational Graph Convolutional Network (R-
GCN) [90]. R-GCN is an extension of GCN that operates on a graph with multi-relational edges
between nodes. In our case, the relation types between entities are taken from the 17 semantic
relations from ConceptNet. Meanwhile, an additional type is added to represent the relationship
between the question node and question entities, making the graph structure diﬀerent from previous
works. We denote the set of relations as '.
                                                    71


    Our dynamic relevance graph network (DRGN) architecture is a variation of the R-GCN model.
To establish the direct connection between the graph nodes and re-scale the importance of the
neighbors, we compute the relevance score between the nodes dynamically at each graph layer based
on their current learned representations. Then we build the neighborhoods based on this relevance
measure and form a relevance matrix, "A4; , accordingly. This can be seen as a way to learn new
edges based on the relevance of the nodes as the training goes forward in each graph layer. We use
the inner product to compute the relevance matrix:
                                         (;)
                                       "A4;  = ⌘ (;)> ⌘ (;) 2 R (|+ |+1) (|+ |+1) ,                                       (6.2)
where |+ | is the graph entity node sizes, and 1 is added due to using the question node in the
graph. The relevance matrix re-scales the weights and influences the way the neighborhood nodes’
representations are aggregated in the R-GCN model. "A4; is computed dynamically, and the
relevance scores change while the representations are computed at each graph layer. In our proposed
relational graph, the forward-pass message passing updates of the nodes, denoted by ⌘8 , is calculated
as follows:
                              ’’ 1
                ⌘8(;+1) = f(                 ,A(;) · ("A4;  (;)
                                                                     ⌘ (;)
                                                                       9   ) + ,0(;) · ("A4; (;)
                                                                                                     ⌘8(;) )) 2 R3 ,      (6.3)
                                        3
                              A2' 9 2NA 8,A
                                                                8, 9                             8,8
                                      8
where N8A represents the neighbor nodes of node 8 under relation A, A 2 '. f is the activation
function, ,A denotes the learnable parameters. Besides, we calculate the updated question node
representation as follows,
                           ’
            ⌘&(;+1) = f(        ,&(;) ·         (;)        (;)
                                         2 ([⌘& ; ("A4; &, 9 ⌘ 9 )])
                                                                      (;)
                                                                               + ,0(;) · ("A4;  (;)
                                                                                                     &,& &
                                                                                                          ⌘ (;) )) 2 R3 , (6.4)
                          9 2N&
where    2 is a two-layer MLP, ⌘& is the question node representation. Finally, we stack the node
representations to form ⌘0(;+1) :
                          ⌘0(;+1) = [⌘0(;+1) ; ⌘1(;+1) ; · · · ; ⌘ |+(;+1)
                                                                         |
                                                                            ; ⌘&(;+1) ] 2 R (|+ |+1)3 .                   (6.5)
                                                                                      (;+1)
We then compute the (; + 1) layer’s dynamic relevance matrix "A4;                           that shows the relevance scores
                                                         (;+1)
of node representations. Finally, we use the "A4;                to multiply the node representation matrix ⌘0(;+1)
                                                             72


that helps the node representation to learn the weights of the edges based on the learned relevance
and specifically to include the additional relevance edges between the nodes during the massage
passing as follows:
                                          ⇣                         ⌘
                                             (;+1)
                              ⌘ (;+1) = f "A4;      · ⌘0(;+1) · ,6 2 R (|+ |+1)3 ,                 (6.6)
where ,6 is the learnable parameters.
6.2.6   Answer Prediction
Given the Language Context ! and KG subgraph, we use the information from both the language
representation ⌘ [⇠ !(] , question node representation ⌘& learned from the KG subgraph, and the KG
subgraph representation pooled from the last graph layer, ?>>; (⌘⌧ BD1 ), to calculate the scores of the
candidate answers as follows:
                            ?(0|!, ⌧ BD1 ) = 5>DC ( [⌘ [⇠ !(] ; ⌘& ; ?>>; (⌘⌧ BD1 )]),             (6.7)
where 5>DC is a two-layer MLP. Finally, we choose the highest-scored answer from # candidate
answers as the prediction output. We use the cross entropy loss to optimize the end-to-end model.
6.3    Experiments
6.3.1   Dataset Description
We evaluate our model on two diﬀerent QA benchmarks, CommonsenseQA [103] and Open-
bookQA [73]. Both benchmarks come with an external knowledge graph. We apply ConceptNet to
the external knowledge graph on these two benchmarks.
CommonsenseQA is a QA dataset that requires human commonsense reasoning capacity to answer
the questions. Each question in CommonsenseQA has five candidate answers without any extra
information. The dataset consists of 12, 102 questions.
OpenbookQA It is a multiple-choice QA dataset that requires reasoning with commonsense
knowledge. The OpenbookQA benchmark is a well-defined subset of science QA [17] that requires
                                                      73


finding the chain of commonsense reasoning to answer a question. Each data sample includes the
question, scientific facts, and candidate answers. In our experimental setting, scientific facts are
added to the question part. This makes the problem formulation consistent with the CommonsenseQA
setting.
6.3.2    Implementation Details
We implemented our DRGN architecture using PyTorch. We use the pre-trained RoBERTa-large [65]
to encode the question. We use cross-entropy loss and RAdam optimizer [63] to train our end-to-end
architecture. The batch size is set to 16, and the maximum text input sequence length is set to 128.
Our model uses an early stopping strategy during the training. We use a 3-layer graph neural module
in our experiments. Section 6.4.2 describes the eﬀect of the diﬀerent number of layers. The learning
rate for the LMs is 14    5, while the learning rate for the graph module is 14  3.
6.3.3    Baseline Description
We select three SOTA models as our main baselines. One model is KagNET [61] that finds the
line of reasoning without using a graph neural module. We use two more models, MHGRN [30]
and QAGNN [127] that use graph neural module as the backbone to find the line of reasoning over
knowledge graph.
KagNET [61] is a path-based model that models the multi-hop relations by extracting relational
paths from Knowledge Graph and then encoding paths with an LSTM sequence model.
MHGRN [30]: Multi-hop Graph Relation Network (MHGRN) is a strong baseline. MHGRN model
applies LMs to the question and answer context encoder, uses the GNN encoder to learn graph
representations, and chooses the candidate answers by these two encoders.
QA-GNN [127] is the recent SOTA model that uses a working graph to train language and KG
subgraph. The model jointly reasons over the question and KG and jointly updates the representations.
QA-GNN uses GAT as the backbone to do message passing on the graph. To learn the semantic
edge information, QA-GNN directly adds the edge representation to the local node representation
                                                  74


                              Models                    Dev ACC%       Test ACC%
                          RoBERTa-no KG                   69.6%          67.8%
                              R-GCN                       72.6%          68.4%
                             GconAttn                     72.6%          68.5%
                              KagNet                      73.3%          69.2%
                                RN                        73.6%          69.5%
                             MHGRN                        74.4%          71.1%
                             QA-GNN                       76.5%          73.4%
                              DRGN                        78.2%          74.0%
Table 6.1 Dev accuracy and Test accuracy (In-House split) of various models on the CommonsenseQA
benchmark, following by [61].
and cannot learn the global structure of the edges, which is ineﬃcient. However, our model uses the
global multi-relational adjacency matrices to learn the edge information.
                   The student practiced his guitar often, where is he always
                   spent his free period?
                   A. music room B. rock band C. toy store D. stage E. concert
                                                                                                       genus
                                                                                                                               IsA
                         guitar     band        rock band            guitar    band      rock band                  weasel
                                                                                                       Syn
                                                                                                          ony
                                                                                                             m
                                   concert      stage                          concert   stage                       ferret   IsA
                                                                                                                m              Rela
                          free                                        free                                    ny
                  Q                                              Q                                          no                Rel
                          period    instrument Playing                period   instrument Playing         Sy        mammal       ate
                                                                                                                                    d
                                               instrument                                 instrument
                                   Playing      music room           student             music room              mustela
                         student
                                   instrument                                                                    nigripes
                           .          .                                .         .                                    .
                           .          .                                .         .                                    .
                           .          .                                .         .                                    .
Figure 6.3 The complete reasoning chain from the question node to the candidate answer node. The
blue nodes are question entity nodes, and the red and green nodes are the candidate answer nodes.
The thicker edges indicate a higher relevance score to the neighborhood node, while the thinner
edges indicate a lower score. The left side is the reasoning chain selected from our model (orange
edges), while the right side is selected from the baseline models (grey edges).
                                                            75


                                      Models                  Dev        Test
                                  RoBERTa-large              66.7% 64.8%
                                      R-GCN                  65.0% 62.4%
                                     GconAttn                64.5% 61.9%
                                        RN                   66.8% 65.2%
                                     MHGRN                   68.1 % 66.8%
                                     QA-GNN                  68.9 % 67.8%
                                      DRGN                   70.1% 69.6%
                                AristoRoBERTaV7              79.2% 77.8%
                             T5(3 Billion Parameters)           -      83.2%
                       UnifiedQA(11 Billion Parameters)         -      87.2%
                         AristoRoBERTaV7+MHGRN               78.6% 80.6%
                         AristoRoBERTaV7+QA-GNN              80.4% 82.8%
                          AristoRoBERTaV7+DRGN               81.8% 84.1%
Table 6.2 Development and Test accuracy of various model performances on the OpenbookQA
benchmark.
6.4    Results and Discussion
6.4.1   Result Comparison
Table 6.1 shows the performance of diﬀerent models on the CommonsenseQA benchmark. KagNet
and MHGRN are two strong baselines. Our model outperforms the KagNet by 4.8% and MHGRN
by 2.9% on the CommonsenseQA benchmark. This result shows the eﬀectiveness of our DRGN
architecture. Table 6.2 shows the performance on the OpenbookQA benchmark. There are a few
recent papers that exploit larger LMs, such as T5 [84] that contains 3 billion parameters (10x larger
than our model,) and UnifiedQA [48] (32x larger). For a fair comparison, we use the same RoBERTa
setting for the input representation when we evaluate OpenbookQA. Our model performance,
potentially, will be improved after using these larger LMs. To demonstrate this point, we did
additional experiments using AristoRoBERTaV7 [18] as a backbone to train our model. Our model
achieves better performance when using the larger LMs compared to other baseline models. The
performance shows that the more implicit information learned from pre-trained language models,
the more eﬀective relevance information established between graph nodes. We should note that
GREASELM [131] and GSC [115] are the two most recent models that are developed in parallel with
                                                 76


            Why do parents encourage their kids to play baseball?
            A. round B. cheap C. break window D. hard E. fun to play
                                                               has property
                 MHGRN:      play baseball used for baseball                 fun to play
                                                                has property
                 QA-GNN:     play baseball used for baseball                 fun to play
                                                               has property
                 DRGN:       play baseball used for baseball                 fun to play
            Why don’t parents encourage their kids to play baseball?
            A. round B. cheap C. break window D. hard E. fun to play
                                                        has property
            MHGRN: play baseball used for baseball                      fun to play
            QA-GNN: play baseball     used for baseball type of         A ball used for break window
                                                         has subevent break window
            DRGN:       play baseball used for play ball
Figure 6.4 The case study of the negation examples. The question in the bottom box includes the
negation words. The red colored text represents the gold answer, and the purple colored represents
the wrong answer. In the blue box, each line represents the commonsense reasoning chain of each
model.
our DRGN. GREASELM aims to ground language context in a commonsense knowledge graph by
fusing token representations from pretrained LMs and GNN over Modality Interaction layers [131].
GSC designs a Graph Soft Counter layer [115] to enhance the graph reasoning capacity. Our results
are competitive with the reported ones in those parallel works, while each work emphasizes diﬀerent
contributions.
6.4.2    Model Analysis
In this section, we analyze the eﬀectiveness of our DRGN model that helps in recovering the missing
edges and establishing direct connections based on the relevancy of the node representations in the
KG.
Eﬀects on Finding the Line of Reasoning As we described in Section 7.2.3, to keep the graph
size small, most of the models construct the KG subgraph by selecting the entities that appear in
two-hop paths. Therefore, some intermediate concept nodes and edges are missed in the extracted
KG subgraph, and the complete reasoning chain from the question entity node to the candidate
                                                     77


                           Models         Test ACC %          Test ACC%
                                            (Overall)    question w/ negative
                       RoBERTa-large         68.7 %             54.2%
                           KagNet            69.2 %             54.2 %
                           MHGRN             71.1 %             54.8%
                          QA-GNN             73.4 %             58.8%
                            DRGN             75.0%              60.1%
    Table 6.3 Performance on questions with negation in In-house split test CommonsenseQA.
answer node can not be found.
    For example, as shown in Figure 6.3, the question is “The student practiced his guitar often,
where is he always spent his free period?” and the answer is “music room”. The reasoning chain
includes 2 hops, that is, “guitar ! playing instrument ! music room”. Since the constructed
graph misses the direct edge between “guitar” and “playing instrument”, MHGRN and QA-GNN
baselines select the wrong intermediate node and predict the wrong answer “concert” and “rock
band” by the grey edges described in the Figure 6.3. In contrast, our DRGN model makes a correct
prediction by computing the relevance score of the nodes based on their learned representations and
forming new edges accordingly. As we describe in Section 7.2.3, our model initializes the entity
node representation by large-scale pre-trained language models (LMs). The implicit representations
of LMs are learned from the huge corpora, and the knowledge is implicitly learned. Therefore, these
two entities, “guitar” and “playing instrument”, start with an implicit connection. By looking at
the relevance changes, after several layers of graph encoding, the relevance score between “guitar”
and “playing instrument” becomes stronger. In contrast, the relevance score between “guitar” and
“concert” becomes weaker because of the contextual information “free period”. This is the primary
reason why our DRGN model obtains the correct reasoning chain.
Eﬀects on Semantic Context While the graph has a broad coverage of knowledge, the semantic
context of the question and connection to the answer is not used properly. For example, dealing with
negation can not perform well [127]. Since our dynamic relevance matrix includes the semantic
context of the question, the relevance between the question and graph entities is computed at every
                                                  78


                               CommonsenseQA Dev Accuracy
                 L=4
                 L=3
                 L=2
                 L=1
                      0.7         0.72      0.74          0.76       0.78         0.8
                                QA-GNN          MHGRN             DRGN
Figure 6.5 The Eﬀect of number of layers in QA-GNN, MHGRN, and DRGN models on Common-
senseQA.
graph neural layer while considering the negation in the node representations. Intuitively, this should
improve handling the negative question in our model.
    To analyze this hypothesis for DRGN architecture, we compare the performance of various models
on questions containing negative words (e.g., no, not, nothing, unlikely) from CommonsenseQA
following recent research [127]. The result is shown in Table 6.3. We observe that the baseline
models of KagNet and MHGRN provide limited improvements over RoBERTa on questions with
negation words (+0.4%). However, our DRGN model exhibits a huge boost (+5.9%). Moreover,
the DRGN model gains a larger improvement in accuracy compared to the QA-GNN model,
demonstrating the eﬀectiveness of considering relevance between question semantics and graph
entity that experimentally confirms our hypothesis. An additional ablation study in Table 7.3
confirms this idea further. When removing the question information from DRGN, we observe that
the performance on negation becomes close to the MHGRN.
    Figure 6.4 shows qualitative examples of the positive and negative questions. For the positive
question, all the models obtain the same reasoning chain “play baseball-(used for)! baseball-(has
property)! fun to play”, including MHGRN, QA-GNN, and our architecture. However, when
adding the negative words, MHGRN obtains the same reasoning chain as the positive situation,
while QA-GNN and DRGN find the correct reasoning chain. One interesting finding is that DRGN
can detect the direct connection using fewer hops to establish the reasoning chain.
                                                 79


                         Models                  Time               Space
                         ;-layer KagNet    $ (|'| ; |+ | ;+1 ;) $ (|'| ; |+ | ;+1 ;)
                         ;-layer MHGRN      $ (|'| 2 |+ | 2 ;)   $ (|'||+ |;)
                         ;-layer QA-GNN       $ (|+ | 2 ;)       $ (|'||+ |;)
                         ;-layer DRGN       $ (|'| 2 |+ | 2 ;)   $ (|'||+ | 2 ;)
Table 6.4 The time complexity and space complexity comparison between DRGN and baseline
models.
Eﬀects of Number of Graph Layers The number of graph layers is an influencing factor for
DRGN architecture because our relevance matrix is computed dynamically, and the relevance scores
change while the representations are computed at each graph layer. We evaluate the eﬀects of
multiple layers ; for the baseline models and our DRGN by evaluating its performance on the
CommonsenseQA. As shown in Figure 6.5, the increase of ; continues to bring benefits until ; = 4
for DRGN. We compare the performance after adding each layer for MHGRN, QA-GNN, and our
DRGN. We observe that DRGN consistently achieves the best performance with the same number
of layers as the baselines.
    Table 6.4 shows the time complexity and the space complexity comparison between the DRGN
model and the baseline model. We compare the computational complexity based on the number
of layers ;, the number of nodes +, and the number of relations '. Our model and MHGRN have
the same time complexity because both models use the R-GCN model as the backbone. Besides,
QA-GNN directly adds the edge representation to the local node representation during the graph
pre-processing step and learns the graph node representation without the global semantic relational
adjacency matrices. After adding the dynamic relevance matrix at each graph layer, our DRGN model
achieves better performance compared to other baseline architectures. For the space complexity,
our model’s space complexity is slightly larger than MHGRN because DRGN introduces the extra
dynamic relevance matrix. However, this cost depends on the size of the subgraph, which is usually
small, and it leads to a huge improvement.
                                                 80


                              Models                          Dev ACC
                              DRGN w/o KG subgraph             69.6%
                              + KG subgraph                    72.6%
                              + relational edges in graph      73.7%
                              + question node in graph         74.9%
                              + dynamic relevance matrix       78.2%
                      Table 6.5 Ablation Study on CommonsenseQA dataset.
6.4.3   Qualitative Analysis
To evaluate the eﬀectiveness of various components of DRGN, we perform an ablation study on the
CommonsenseQA development benchmark. Table 7.3 shows the results of the ablation study. First,
we remove the whole commonsense subgraph. Our model without the subgraph obtains 69.6%
on the CommonsenseQA. This shows how the implicit language model can answer the questions
without the external KG, which is not high-performing but yet impressive. After adding the KG
subgraph, the accuracy improves to 72.6% on the CommonsenseQA benchmark. Second, we keep
the KG subgraph and add multiple relational edge information from the subgraph (described in
section 6.2.5). Without the relational edges, the accuracy becomes 73.7%. This result shows that
the multiple relational edges help in learning better graph node representations and obtaining higher
performance. Third, we keep the multi-relational subgraph and add the question node. In other
words, we incorporate the semantic relationship between the question node and the graph entities.
The accuracy of the model improves to 74.9%. It demonstrates the importance of the relevance
mechanism between the question information and the KG subgraph. Finally, we add the most
important component, the dynamic relevance matrix, to each graph layer. The large improvement
demonstrates the importance of the dynamic relevance matrix and the eﬀectiveness of DRGN
architecture.
6.5    Summary
In this paper, we propose a novel Dynamic Relevance Graph Network (DRGN) architecture for
commonsense question answering given an external source of knowledge in the form of a Knowledge
                                                  81


Graph. Our model learns the graph node representation while a) exploits the existing relations in
KG, b) re-scales the importance of the neighbor nodes in the graph based on training a dynamic
relevance matrix, c) establishes direct connections between graph nodes based on measuring the
relevance scores of the nodes dynamically during training. The dynamic relevance edges help in
finding the chain of reasoning when there are missing edges in the original KG. Our quantitative
and qualitative analysis shows that the proposed approach facilitates answering complex questions
that need multiple hops of reasoning. Furthermore, since DRGN uses the relevance between the
question node and graph entities, it exploits the richer semantic context of the question in graph
reasoning, which leads to improvements in the performance on the negative questions. Our proposed
approach shows competitive performance on two QA benchmarks, including CommonsenseQA and
OpenbookQA.
                                                82


                                             CHAPTER 7
    EXPLOITING COMMONSENSE KNOWLEDGE FOR DOCUMENT-LEVEL QA
7.1    Background and Motivation
Solving Question Answering (QA) problems usually requires both understanding and reasoning
over natural language. In recent years, large-scale pre-trained Language Models (LMs) have made
breakthrough progress and demonstrated eﬀectiveness on language understanding in many Question
Answering tasks [107, 85]. There is a large amount of world knowledge that is stored implicitly in
language models that can be directly encoded and, sometimes, help in Document-level QA [24]. For
example, as shown in the question 1 of Figure 7.1, “suppose plants will produce more seeds happens,
how will it aﬀect plants”, the knowledge contained in a given text, (A plant produces seed, the seed
germinates, the plant grows), is suﬃcient to predict the answer. However, there are many cases in
which the required knowledge is not included in the text itself. For example, for the question 2 in
Figure 7.1, the information about the “nutrient” on the seeds does not exist in the text. Therefore, an
external source of knowledge is required to answer the question.
                        Procedural Text:
                        1. A plant produces a seed.
                        2. The seed falls to the ground.
                        3. The seed is buried.
                        4. The seed germinates.
                        5. A plant grows.
                        6. The plant produces flowers.
                        7. The flowers produce more seeds
                        Questions and Answers:
                        1. suppose plants will produce more seeds
                        happens, how will it affect less plants.
                         (A) More (B) Less (C) No effect
                        2. suppose the soil is rich in nutrients happens,
                        how will it affect more seeds are produced.
                         (A) More (B) Less (C) No effect
                        3. suppose The sun comes out happens, how
                        will it affect less plants.
                         (A) More (B) Less (C) No effect
Figure 7.1 WIQA contains procedural text and diﬀerent types of questions. The bold choices are the
answers.
                                                    83


     There are several existing resources that contain world knowledge and commonsense. Examples
are knowledge graphs (KGs) like ConceptNet [97] and ATOMIC [89]. Looking back at question 2 in
Figure 7.1, we observe that an explicit line of reasoning can be generated after providing the external
knowledge triplets (nutrient, related to, soil) and (soil, related to, seed) derived from ConceptNet.
     Two challenges exist in procedural text reasoning and using external KGs. The first challenge
is eﬀectively extracting the most relevant parts of external knowledge and reducing the irrelevant
information from the KG. The second challenge is reasoning over the extracted knowledge. The
irrelevant knowledge from KG will mislead the QA model in predicting the answer. Moreover,
there are less sophisticated techniques proposed for using external knowledge explicitly (i.e. not
through LMs) in document-level QA tasks. REM-Net [44] uses commonsense for WIQA and
uses a memory network to extract the relevant triplets from the knowledge graph and solve the
first challenge. However, this work has no specific mechanism for reasoning over the extracted
knowledge. It just uses a simple multi-head attention operator, which combines the knowledge
triplets and documents as input, to predict the answer. DFGN [122] and SAE [110] construct entity
graphs using named entity recognition (NER) as the backbone to do multi-hop reasoning given the
text itself. However, these models cannot deal with the challenge when the required knowledge is
not in the given document.
     To solve these two challenges, we propose a Multi-hop Reasoning network over Relevant
CommonSense SubGraphs (MRRG) that deals with the challenge of document-level QA when the
answer requires a combination of modalities that is both document and external KG. Our motivation is
to eﬀectively and eﬃciently extract the most relevant information from a large KG to help procedural
reasoning. First, we extract the entities, and retrieve related external triplets from KG, by learning
to extract the most relevant triplets to a given text. In particular, we propose the KG Attention
module to extract the most relevant triplets from large KG given the text and question and reduce the
irrelevant concepts from candidate triplets. Then, we construct a commonsense subgraph based on
the extracted KG triplets in a pipeline. We use the extracted subgraphs as a part of the end-to-end
QA model to help in filling the knowledge gaps in the text and perform multi-hop reasoning. Our
                                                   84


                         A. KG-attention Triplet Selection                                                   B. Multi-hop Reasoning
                         Relational Triplet 1
                                                   (2)                     (2)                      (3)                           (4)           (4)
       Concept
          Net
                     (1) Relational Triplet 2 KG Attention         Relevant Triplets                                     Multi-Hop Graph Answer Prediction
                                …
                               CLS
                         Relational Triplet n                                                                                Encoder
                                                                                                                 RGCN +    Relational Graph
            (1)                                                     [CLS ;Triplet i]                             pooling   Representation
                         [CLS ;Triplet 1]                           [CLS ;Triplet j]
    Open Information     [CLS ;Triplet 2]                                  …
                                 …                                  [CLS ; Triplet x]
      Extraction                                                                       Highly Relevant Commonsense                (4)
                         [CLS ; Triplet n]
                                                                            concat     Subgraph Construction
            (1)                                                                                                      Text Interaction Encoder
                                CLS            concat                                                                    Question + document
      Question+      LM                                 Pretrain A           MLP                                         Contextual Interaction
      Document            Ques token rep                    (I)
                     (2)
                          Doc token rep
Figure 7.2 MRRG Model is composed of Candidate Triplet Extraction, KG Attention, Commonsense
Subgraph Construction, Text encoder with contextual interaction, Graph Reasoning, and Answer
prediction modules.
model predicts the answer by reasoning over the contextual interaction representations over the text
and learning graph representations over the KG subgraphs. We evaluate our MRRG on the WIQA                                                                 40
benchmark. MRRG model achieves SOTA and brings significant improvements compared to the
existing baselines.
    The contributions of our work are: 1) We train a separate module that extracts the relevant
parts of the KB given the procedure and question and reduces the noisy and ineﬃcient usage of the
information in large KBs. 2) Our end-to-end model uses the extracted QA-dependent KG subgraph
to guide the reasoning over the procedural text. 3) Our MRRG achieves SOTA on the WIQA
benchmark.
7.2      Model Description
Figure 7.2 shows the proposed architecture. We have numbered the parts in the figure, and here we
point to the functionality of each part. (1) We extract the entities from the question and context in
a preprocessing step and use them to retrieve the set of candidate triples from the ConceptNet.
(2) We propose a novel KG Attention module to extract the most relevant triplets and reduce the
noisy concepts from candidate triplets. (3) We augment the commonsense subgraph based on the
relevant triplets. (4) We train a model that uses the commonsense subgraph as a relational graph
network and a text encoder including question and document to do procedural reasoning. Below,
we describe the details of each module.
                                                                                    85


7.2.1     Candidate Triplet Extraction from KG
Given the input @ and C, we extract the contextual entities (concepts) using an oﬀ-the-shelf open
Information Extraction (OpenIE) model [98]. For each extracted entity C8= , we retrieve the relational
triplets C = (C8= , A, C >DC ) from KG, where C >DC is the concept taken from ConceptNet and A is a semantic
relation type. We then apply a pre-trained Language Model, RoBERTa, to obtain the representation,
⇢ C , of each triplet:
                                         ⇢ C = 5 ! " ( [C8= , A, C >DC ]) 2 R3⇥3 ,                                  (7.1)
where 5 ! " denotes the language model operation, and the triplets are given as a sequence of concepts
and relations to the LM.
7.2.2     KG Attention
The KG attention module is shown in Figures 7.3. We concatenate @ and C to form &:
                                          & = [[⇠ !(]; @; [(⇢ %]; C],                                               (7.2)
where [CLS] and [SEP] are special tokens in the LMs tokenizer process [65]. We use RoBERTa to
obtain the list of token representations ⇢ [⇠ !(] , ⇢ @ , and ⇢ C . ⇢ [⇠ !(] is the summary representation
of the question and paragraph, ⇢ @ is the list of the question tokens embeddings, and ⇢ C is the list of
the paragraph tokens embeddings output of RoBERTa.
      Given triplet ⇢ C that is generated based on the triplet extraction described in Section 7.2.1, we
build a context-triplet pair ⇢ IC as follows:
                                          ⇢ IC = [⇢ [⇠ !(] ; ⇢8= C
                                                                    ; ⇢AC ; ⇢ >DC
                                                                              C
                                                                                  ],                                (7.3)
where ⇢8=  C is the representation of the head entity from text, ⇢ C
                                                                                     >DC is the representation of the tail
entity from KG, and ⇢AC is the representation of the relation. Afterward, we compute context-triplet
pair attention and a softmax layer to output the Context-Triplet pairwise importance Score ⇠) (.
The process is computed as follows:
                                                        exp " !%(⇢ IC )
                                         ⇠) (C = Õ<                                   .                             (7.4)
                                                       9=1 exp " !%(⇢ I )
                                                                                C
                                                            86


    Then we choose the top-: relevant triplets with the top ⇠) ( scores and then use the relevant
triplets to construct the subgraph. For each selected triplet, we obtain the triplet representation, ⇢ 0C ,
as follows:
                                   ⇢ 0C = [⇢8= 0C           0C
                                                  , ⇢AC , ⇢ >DC  ] 2 R3⇥3 ,                                        (7.5)
                                   ⇢8=0C
                                         = 58= ( [⇠) (C · ⇢8=     C
                                                                     ; ⇠) (C · ⇢AC ]),                             (7.6)
                                   0C
                                 ⇢ >DC   = 5>DC ( [⇠) (C · ⇢ >DC    C
                                                                        ; ⇠) (C · ⇢AC ]),                          (7.7)
where 58= and 5>DC are MLP layers, [; ] is the concatenation, and [·] is the scalar product.
7.2.3    Commonsense Subgraph Construction
We construct the commonsense subgraph ⌧ B based on the relevant triplets from KG attention for
each question and answer pair. We add more edges to the subgraph as follows: Two entities in the
triplets will have an edge if a relation A in the KG exists between them. We use ⇢8=                0C and ⇢ 0C for the
                                                                                                               >DC
KG subgraph initial node representation ⌘ (0) which is used in RGCN formulation in Section 7.2.4.
7.2.4    Reasoning over Document-level QA
To facilitate finding the answer, our MRRG architecture composes of two modules: the Graph
Reasoning Encoder module and the Text Contextual Interaction Encoder module.
Graph Reasoning Encoder: this module is shown in Figure 7.2-B. Given the subgraph ⌧ B , we use
RGCN [90] to learn the representations of the relational graph. RGCN learns graph representations
by aggregating messages from its direct neighbors and relational semantic edges. The (; + 1)-th layer
node representation ⌘8(;+1) is updated based on the neighborhood node representations ⌘;9 from the
;-layer multiplied by the relational matrices ,A(;)       1
                                                             , . . . , ,A(;)
                                                                           |'|
                                                                               . The representation ⌘8(;+1) is computed
as follows:
                                            ’’              1
                             ⌘8(;+1) = f(                      A  ,A(;) ⌘ (;)
                                                                            9   + ,0(;) ⌘8(;) ),                   (7.8)
                                            A2R    9 2#8A
                                                          |#8 |
where f denotes a non-linear activation function, #8A represents a set that includes neighbor indices
of node 8 under semantic relation A. Finally, we obtain the ⇢ ⌧ B after several hops of message passing.
                                                            87


Text Contextual Interaction Encoder: We have obtained the contextual token representations
⇢ [⇠ !(] , ⇢ @ , and ⇢ C in the KG attention module that is described in Section 7.2.2. Followed
by BI-DAF research work [91], we utilize contextual interaction module to feed ⇢ @ and ⇢ C to
Context-to-Question Attention:
                                        ⇢ C!@ = B> 5 C<0G(B8<(⇢ @) , ⇢ C ))⇢ @ ,                                      (7.9)
and Question-to-Context Attention ⇢ @!C to obtain the contextual interaction between question and
context. Then we use LSTM to obtain the hidden state representations:                              @!C    = !() " (⇢ @!C ),
and    C!@    = !() " (⇢ C!@ ).
                                 Relational Triplet 1     KG Attention        Relevant Triplets Answer Prediction
                            (1)  Relational Triplet 2         (2)                       (2)
                                        …
                                       CLS
                                 Relational Triplet n
                   (1)                                                           [CLS ;Triplet i]
                                 [CLS ;Triplet 1]                                [CLS ;Triplet j]  concat
           Open Information      [CLS ;Triplet 2]                                       …
             Extraction                  …                                       [CLS ; Triplet x]
                                 [CLS ; Triplet n]
                   (1)
                                        CLS
             Question+      LM                                        concat
             Document             Ques token rep
                            (2)
                                  Doc token rep
                       Figure 7.3 The architecture of training the KG Attention module.
7.2.5     Answer Prediction
                                                                                                           0
We concatenate ⇢ [⇠ !(] ,       @!C ,      C!@ ,      and the compact subgraph representation ⇢ ⌧ B obtained from
attentive pooling, and use it as the final representation:
                                                                               0
                                               = [⇢ [⇠ !(] ;    @!C ;  C!@ ; ⇢ ⌧ B ].                               (7.10)
Then we utilize a classifier MLP ( ) to predict the answer.
7.2.6     Training Strategy
Training KG Attention for Triplet Selection: Figure 7.3 and the left block of Figure 7.2 show the
same triplet selection model. The same KG attention module, shown in Section 7.2.2 is taken, and 3
                                                                 88


extra MLP layers are added to the module for training as shown in Figure 7.3. The MLP is applied
on the concatenation of the concatenation of [⇢ [⇠ !(] ; ⇢ @ ; ⇢ C ; ⇢ 10C ; . . . ; ⇢ :0C ] to predict the answer.
We use the cross-entropy as the loss function to train the model.
Training End-to-End MRRG: After pre-training the KG attention module, we keep the learned
parameters and extract the most relevant concepts and construct the multi-relational commonsense
subgraph ⌧ B . We combine subgraph representation and text interaction representation as input to
train the answer prediction module by cross-entropy loss.
7.3     Experiments
7.3.1    Dataset Description
WIQA benchmark [107] is a large collection of Document-level QA examples. WIQA contains two
types of questions: 1) the questions can be directly answered based on the text, called in-paragraph
questions. 2) the questions require external knowledge to be answered, called out-of-paragraph
questions. WIQA contains 29808 training samples, 6894 development samples, 3993 test samples
(test V1), and 3003 test samples (test V2).
7.3.2    Implementation Details
We implemented our MRRG framework using PyTorch. We use a pre-trained RoBERTa [65] to
encode the contextual information in the input. The maximum number of triplets is 50, and the
maximum number of nodes in the graph is 100. Further details of hyper-parameters of the graph are
shown in Table 7.3. The maximum number of words for the paragraph context is 256. For the graph
construction module, we utilize open Information Extraction model [98] from AllenNLP1 to extract
the entities. The maximum number of hops for the graph module is 3. The learning rate is 14                      5.
The model is optimized using Adam optimizer [51].
    1 https://demo.allennlp.org/open-information-extraction.
                                                 89


7.3.3    Baseline Description
We briefly describe the recent SOTA baselines that use the Transformer-based language model as
the backbone. The descriptions of each strong baseline are shown below:
EIGEN [69] is a baseline that builds an event influence graph based on a document, and leverages
LMs to create the chain of reasoning to predict the answer. However, EIGEN does not use any
external knowledge to solve the problem.
Logic-Guided [5] uses logic rules, including symmetry and transitivity rules to augment the training
data. Moreover, Logic-Guided uses the rules as a regularization term during training to impose
consistency between the answers to multiple questions.
RGN [135] is the recent SOTA baseline that utilizes a gating network [133] to jointly learns to
extract the key entities through an entity gating module, finds the line of reasoning and relations
between the key entities through a relation gating module, and captures the entity alignment through
contextual entity module.
REM-Net [44] proposes a recursive erasure memory network to find out the line of reasoning.
Specifically, REM-Net refines the evidence by a recursive memory mechanism and then uses a
generative model to predict the answer. REM-Net is the only work that uses external knowledge for
WIQA. REM-Net uses external knowledge by training an attention module that encodes the KG
triplet representations for finding the answer. It does not explicitly select the most relevant triplets
as we do, and the graph reasoning is not exploited for finding the chain of reasoning.
7.4     Results and Discussion
7.4.1    Result Comparison
Table 7.1 and Table 7.2 show the performance of MRRG on the WIQA task compared to other
baselines on two diﬀerent test sets V1 and V2. First, Both tables show that our proposed KG
Attention triplet selection model outperforms the RoBERTa and has a 3.3% improvement on the
out-of-para category. Second, our MRRG achieves SOTA results compared to all baseline models.
                                                   90


    Models                                      in-para out-of-para     no-eﬀect Test V1 Acc
    Majority                                     45.46     49.47          55.0      30.66
    Polarity                                     76.31     53.59          27.0      39.43
    Adaboost [32]                                49.41     36.61         48.42      43.93
    emphDecomp-Attn [78]                         56.31     48.56         73.42      59.48
    BERT (no para) [24]                          60.32     43.74         84.18      62.41
    BERT [107]                                   79.68     56.13         89.38      73.80
    EIGEN [69]                                   73.58     64.04         90.84      76.92
    REM-Net [44]                                 75.67     67.98         87.65      77.56
    Logic-Guided [5]                               -          -             -       78.50
    RoBERTa+KG-attention Triplet Selection       72.21     64.60         89.13      75.22
    MRRG (RoBERTa-base)                          79.85     69.93         91.02      80.06
    Human                                          -          -             -       96.33
Table 7.1 Model Comparisons on WIQA test V1 dataset. WIQA has four evaluation metrics,
including in-paragraph, out-of-paragraph, no eﬀect, and overall test accuracy.
MRRG achieves the SOTA on both in-para, out-of-para, and no-eﬀect questions in WIQA V1 and
V2.
    Models                                      in-para  out-of-para    no-eﬀect Test v2 Acc
    Random                                       33.33      33.33         33.33     33.33
    Majority                                     00.00      00.00         100.0     41.80
    BERT                                         70.57      58.54         91.08     74.26
    REM-Net                                      70.94      63.22         91.24     76.29
    REM-Net (RoBERTa-large)                      76.23      69.13         92.35     80.09
    QUARTET (RoBERTa-large)                      74.49      65.65         95.30     82.07
    [85]
    RGN [135]                                    75.91      66.15         92.12     79.95
    RoBERTa+KG Attention Triplet Selection       70.02      62.30         91.23     75.86
    MRRG (RoBERTa-base)                          76.80      67.83         92.28     80.39
    MRRG (RoBERTa-large)                         78.82      71.10         93.53     82.95
    Human                                           -         -             -       96.30
                    Table 7.2 Model Comparisons on WIQA test V2 dataset.
7.4.2   Model Analysis
Eﬀects of Using External Knowledge In the WIQA, all the baseline models achieve significantly
lower accuracy in the out-of-para than in-para and no-eﬀect categories. MRRG achieves SOTA in
                                                91


the out-of-para category because of using highly relevant commonsense subgraphs. As is shown in
table 7.2, the advantage of the MRRG model is reflected in out-of-para questions. MRRG improves
4.61% over REM-Net. Notice that REM-Net is the only model that utilizes external knowledge on
WIQA. Figure 7.4 shows a case in which the “soil” and “nutrient” only appear in the question and
do not exist in the text. The baseline models fail to answer this out-of-para question due to missing
external knowledge. However, our model predicts the correct answer by explicitly incorporating the
(nutrient, relatedto, soil), (soil, relatedto, seed) that connects the critical information between the
question and the document.
Eﬀect of Combine Knowledge Reasoning and Multi-hop Reasoning Both in-para and out-of-
para types of questions require multiple hops of reasoning to find the answer in the WIQA benchmark.
MRRG made a sharp improvement in reasoning with multiple hops due to the eﬀectiveness of the
extracted commonsense subgraph. In Particular, the MRRG model accuracy improved 2% for 1
hop, 8% for 2 hops, and 2% for 3 hops compared to EIGEN. We study some cases to analyze the
multi-hop reasoning and the reasoning chains. In the third case in Figure 7.4, the extracted relevant
triplets (land, relatedto, surface), (surface, relatedto, igneous rock) construct a two-hop reasoning
chain “land!surface!igneous rock” that helps MRRG to find the correct answer.
7.4.3    Qualitative Analysis
Table 7.3 shows the ablation study results of MRRG in the WIQA benchmark. Firstly, we remove the
commonsense subgraph and graph network. The accuracy decreases 3.4% compared to MRRG. It
demonstrated the eﬀectiveness of using external knowledge graphs on Document-level QA. Second,
we report results about the impact of changing the dimensionality of the node representations in the
model. we try the diﬀerent dimensions of graph representation. The best performance achieved
by the dimension of graph representation is 100. In an additional experiment, we use the KG
attention triplet selection module to directly predict the answer without the pipeline of constructing
the subgraph and using the graph reasoning module. We show the result as KG Attention Triplet
                                                     92


                         Question and Document Content                         RoBERTa +Interaction Incorporating Triplets      +KG       +Graph
                                                                                                                                Attention
  Question: suppose more fruit is produced happens,
             how will it affect MORE plants?
  Content: [“The seed germinates.”, “The plant grows.”, “The plant flowers.”,     X         √         (fruit, createdby, plant)     √        √
             “Produces fruit.”, “The fruit releases seeds.”
  Gold Answer: More
  Question: suppose the soil is rich in nutrients happens,
             how will it affect more seeds are produced.                                            (nutrient, relatedto, soil)
  Content: [“A plant produces a seed”, “The seed falls to the ground”, “The       X         X        (soil, relatedto, seed)        √        √
             seed is buried”, “The seed germinates”, “A plant grows”, “The
             plant produces flowers”, “The flowers produce more seeds.”]
  Gold Answer: More
  Question: suppose more land available happens,                                                    (igneous rock, isa, rock)
            how will it affect less igneous rock forming.                                           (land, relatedto, rock)
  Content: [“Different kinds of rocks melt into magma”, “Magma cools in           X         X       (land, relatedto, surface)      X        √
  the crust”, “Magma goes to the surface and becomes lava”, “Lava cools”,                           (surface, relatedto,
            “Cooled magma and lava become igneous rock.”]                                           igneous rock)
  Gold Answer: Less
Figure 7.4 Case study of the MRRG Framework. “+interaction” means adding the contextual
interaction module. “KG ATTN” means adding the KG Attention Triplet Selection module. ’X’
indicates the model failed to predict the correct answer, and “X” means the prediction was successful
with the included module.
Selection in Table 7.3. The result shows that removing the triplet selection module decreases the
accuracy by 1.8%. It demonstrates that the KG attention neural mechanism itself helps in extracting
the most relevant information from a large KG and filling the knowledge gaps in the document.
                                      Ablation                                Model                      Dev Acc
                                      Text only                         RoBERTa-base                      75.51%
                                      Text only             KG Attention Triplet Selection                77.39%
                                                                        GNN dim=50                        79.18%
                                   Text+Graph                           GNN dim=100                       80.30%
                                                                        GNN dim=200                       79.88%
Table 7.3 Ablation and hyper-para. choices on WIQA. “GNN dim” is the dimension of graph
representation.
7.5        Summary
We propose the MRRG model for using external knowledge graphs in reasoning over procedural
text. Our model extracts a relevant subgraph for each question from the KG and uses that knowledge
subgraph to answer the question. The extracted subgraph includes the reasoning path for answering
the question and helps in filling the knowledge gap between the question and text. We evaluate
                                                                              93


MRRG on the WIQA and achieve SOTA performance.
                                        94


                                             CHAPTER 8
                          CONCLUSION AND FUTURE DIRECTIONS
In this chapter, we summarize our work presented in this dissertation and highlight the contributions.
Meanwhile, we discuss several potential directions for future work.
8.1     Summary of Contributions
This dissertation proposes new techniques for exploiting external knowledge and the semantic
structure of data in diﬀerent modalities in QA systems. My study covers a broad range of QA
problems where the answer to a natural language question can be found in multiple modalities,
including, Textual documents (Document-level QA), Images (Cross-Modality QA), Knowledge
graphs (Commonsense QA), and combination of text and knowledge graphs.
    In Chapter 3 of this dissertation, we addressed the challenges of Document-level QA. In particular,
we focused on answering questions that need multiple hops of reasoning that expand over multiple
documents. We exploited the semantic structure of multiple documents to find the line of reasoning
to answer the questions. We extracted a graph with entities and multiple relational edges from
documents using semantic role labeling (SRL). We connected the SRL graphs using shared entities.
We proposed a Semantic Role Labeling Graph Reasoning Network (SRLGRN) that utilizes LM and
GNN as the backbone to find the cross-paragraph reasoning paths while answering the questions.
Exploiting the semantic structure of the documents makes the line of reasoning more explicit and
explainable. Our proposed model obtains competitive results on both multi-hop document-level QA
and single-hop document-level QA benchmarks, including HotpotQA and SQuAD.
    In Chapter 4 of this dissertation, we addressed the challenges of cause-eﬀect QA, a special
type of Document-level QA. In contrast to relying on the implicit representation of the pre-trained
language models, finding explicit causal relationships between entities facilitate causal reasoning
over the whole document. We proposed a Relational Gating Network (RGN) that jointly extracts the
most important entities and models their relations explicitly. The RGN contains an entity gating
module, relation gating module, and contextual interaction module. These modules help solve
                                                   95


diﬀerent aspects of cause-eﬀect QA challenges, including multiple-hop causal reasoning and entity
alignment. We demonstrated that modeling pairwise relationships help to capture higher-order
relations. Our proposed approach achieves state-of-the-art results on the cause-eﬀect QA benchmark,
WIQA.
    In Chapter 5 of this dissertation, we addressed the challenges of Visual Question Answering,
a classic type of cross-modality QA. Our main contribution was to explicitly ground the entities
as well as their relationships from language modality into vision modality. We proposed a novel
cross-modality relevance (CMR) architecture that is an end-to-end framework that considers the
relevance between textual token representations and visual object representations by explicitly
aligning them in the two modalities. We model the higher-order relational relevance for the
generalizability of reasoning between entity relations in the text and object relations in the image.
Our proposed CMR approach shows competitive performance on two diﬀerent language and vision
benchmarks, including NLVR and VQA. The proposed architecture improves robustness and
eﬀectiveness compared to the previous state-of-the-art models.
    In Chapter 6 of this dissertation, we addressed the challenge of knowledge-based QA given an
external source of knowledge in the form of a Knowledge Graph. The main contribution has been
recovering missing edges in the KG that were needed for finding the line of reasoning and answering
the questions. We proposed a novel Dynamic Relevance Graph Network (DRGN) that learns the
node representations while a) exploits the existing edges in KG, b) establishes direct edges between
graph nodes based on the relevance scores, c) re-scales the importance of the neighbor nodes in the
graph based on training a dynamic relevance matrix. As a byproduct, our model improved handling
the negative questions due to deeply considering the relevance between the question node and the
graph entities. Our proposed approach showed competitive performance on two QA benchmarks,
CommonsenseQA and OpenbookQA, compared to the state-of-the-art published architectures.
    In Chapter 7 of this dissertation, we deal with the challenge of document-level QA when
the answer needs a combination of modalities that is both document and external KG. The main
contribution is to eﬀectively extract the most relevant external information from a given large KG
                                                  96


and combine that with the document-level information to answer the questions. We proposed a
novel architecture called MRRG that extracts the entities from the document and learns to retrieve
the relevant external knowledge from KG using a novel neural KG attention mechanism. Then, we
constructed a KG subgraph as part of the document-level QA model to help fill in the knowledge
gaps and facilitate multi-hop reasoning. We evaluated our model on the commonly used WIQA
benchmark for this task. The proposed model achieves SOTA and brings significant improvements.
8.2     Future Directions
Beyond the topics covered in this dissertation, there are many new exciting directions related to the
Question Answering problem, including Prompt Learning for Question Answering and Integration
of Domain-Knowledge into Question Answering. In the following subsections, we point to some
QA future directions.
8.2.1   Prompt Learning for Question Answering
Recent Research on large-scale pre-trained LMs demonstrates that a unified paradigm [48] could
potentially apply to solve various existing NLP tasks. Developing a unified framework for QA based
on prompt learning becomes a new trend for solving various QA tasks.
    The QA architectures usually are based on a supervised learning paradigm. In general, these
QA architectures take in an input G (question, context, image, knowledge, etc.) and predict an
output H (yes/no, span answer, multiple choices of candidate answer, etc.) as %(H|G; \) in a
“pre-train, fine-tune” architecture, where \ represents the learned parameters in the model. However,
prompt-based QA architectures reformulate the original input G to a prompt, ) (G), where ) is a
prompting transformation function. In general, the generated prompt ) (G) has several empty slots
like cloze that require filling in. The empty slots are the outputs H in a “pre-train, prompt, and
predict” architecture. From the application point of view, UnifiedQA [48] is a pioneer research work
that reformulates various QA tasks as a unified text generation prompting problem. UnifiedQA
model first generates the prompts from the questions and the corresponding context, then utilizes a
                                                   97


pre-trained Sequence-to-Sequence LM, T5 model, to predict the answer directly. As mentioned
in [64], prompt learning models use the “pre-train, prompt, and predict” architecture to achieve
SOTA on many QA tasks, including Document-level QA and Knowledge based QA.
    Following this new trend of prompt-based tuning, we can point to two possible future research.
The first is how to develop prompt learning for QA in diﬀerent structured modalities, such as
relational knowledge graphs, and SQL tables. The second is how to design prompts that can learn
the required type of reasoning that is needed for generating output. For example to learn diﬀerent
types of reasoning, including spatial, temporal, compositional, etc, to enhance transferability and
generalizability among diﬀerent types of QA.
8.2.2    Integration of Domain-Knowledge into Question Answering
Integration of explicit domain knowledge can alleviate deep learning QA challenges [22], including
inconsistent decisions, and low performance on tasks with complex reasoning. The domain
knowledge can be represented through explicit constraints such as logical rules, context-free
grammar, or probabilistic relations. While there are many recent research eﬀorts on the integration
of knowledge graphs based on neural representations, using knowledge in symbolic form and with
explicit reasoning in neural models is less explored. Given the challenges that we faced on complex
QA reasoning problems with long hops of reasoning, we think the neuro-symbolic direction is
key for better generalizability of the models. This is very important in cross-modality QA, where
multiple modalities need to be understood and grounded in each other.
    There are very recent neuro-symbolic solutions to solve visual question answering with external
knowledge such as in VQAR task [43]. In VQAR, given a query, “Identify the tall animal on the
left.”, we require external knowledge and commonsense reasoning (“what is the tall animal in the
real world”), and spatial reasoning (“which animal is on the left of the image”), to answer the
question. While these solutions are very futuristic and interesting, the main issue in dealing with
this task is their scalability and eﬃciency to make them practical for real scenarios. Exploiting
symbolic reasoning over commonsense in VQA will raise eﬃciency problems. On the image side,
                                                  98


the obtained visual features (e.g., object, attribute, and relation) are associated with the deep learning
model. With the increase of the extensive visual information, such as bounding boxes, detected
from the image, the time complexity of computing the deep learning model will be extremely
high. Moreover, although integrating commonsense knowledge into VQA can possibly oﬀer good
interpretability, the models are hardly scalable because the number of knowledge facts using in each
data example is huge.
                                                     99


                                      BIBLIOGRAPHY
[1]  Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume;
     look and answer: Overcoming priors for visual question answering. In Proceedings of the
     IEEE Conference on Computer Vision and Pattern Recognition, pages 4971–4980, 2018.
[2]  Ali Mohamed Nabil Allam and Mohamed Hassan Haggag. The question answering systems:
     A survey. International Journal of Research and Reviews in Information Sciences (ĲRRIS),
     2(3), 2012.
[3]  Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen
     Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual
     question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern
     Recognition, pages 6077–6086, 2018.
[4]  Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,
     C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of
     the IEEE international conference on computer vision, pages 2425–2433, 2015.
[5]  Akari Asai and Hannaneh Hajishirzi. Logic-guided data augmentation and regularization for
     consistent question answering. In Proceedings of the 58th Annual Meeting of the Association
     for Computational Linguistics, pages 5642–5650. Association for Computational Linguistics,
     July 2020.
[6]  Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, and Caiming Xiong.
     Learning to retrieve reasoning paths over wikipedia graph for question answering. In ICLR,
     2020.
[7]  J. Atwood and D. Towsley. Diﬀusion-convolutional neural networks. In NIPS, 2016.
[8]  Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary
     Ives. Dbpedia: A nucleus for a web of open data. In The semantic web, pages 722–735.
     Springer, 2007.
[9]  Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase
     from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in
     natural language processing, pages 1533–1544, 2013.
[10] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a
     collaboratively created graph database for structuring human knowledge. In Proceedings of
     the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250,
     2008.
[11] Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. Large-scale simple
     question answering with memory networks. ArXiv, abs/1506.02075, 2015.
[12] Gerlof Bouma. Normalized (pointwise) mutual information in collocation extraction. Pro-
     ceedings of GSCL, pages 31–40, 2009.
                                              100


[13] Nicola De Cao, Wilker Aziz, and Ivan Titov. Question answering by reasoning across
     documents with graph convolutional networks. In NAACL-HLT, 2019.
[14] Jifan Chen, Shih-Ting Lin, and Greg Durrett. Multi-hop question answering via reasoning
     chains. ArXiv, abs/1910.02610, 2019.
[15] Philipp Cimiano, Vanessa Lopez, Christina Unger, Elena Cabrio, Axel-Cyrille Ngonga Ngomo,
     and Sebastian Walter. Multilingual question answering over linked data (qald-3): Lab overview.
     In International conference of the cross-language evaluation forum for european languages,
     pages 321–332. Springer, 2013.
[16] Christopher Clark and Matt Gardner. Simple and eﬀective multi-paragraph reading compre-
     hension. In ACL, 2018.
[17] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick,
     and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning
     challenge. arXiv preprint arXiv:1803.05457, 2018.
[18] Peter Clark, Oren Etzioni, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle
     Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, Niket Tandon, et al.
     From’f’to’a’on the ny regents science exams: An overview of the aristo project. arXiv preprint
     arXiv:1909.01958, 2019.
[19] Wanyun Cui, Yanghua Xiao, Haixun Wang, Yangqiu Song, Seung-won Hwang, and Wei
     Wang. Kbqa: learning question answering over qa corpora and knowledge bases. arXiv
     preprint arXiv:1903.02419, 2019.
[20] Bhavana Dalvi, Lifu Huang, Niket Tandon, Wen tau Yih, and P. Clark. Tracking state changes
     in procedural text: a challenge dataset and models for process paragraph comprehension. In
     NAACL-HLT, 2018.
[21] Bhavana Dalvi, Niket Tandon, Antoine Bosselut, Wen-tau Yih, and Peter Clark. Everything
     happens for a reason: Discovering the purpose of actions in procedural text. In EMNLP,
     2019.
[22] Tirtharaj Dash, Sharad Chitlangia, Aditya Ahuja, and Ashwin Srinivasan. A review of some
     techniques for inclusion of domain-knowledge into deep neural networks. Scientific Reports,
     12(1):1–15, 2022.
[23] M. Deﬀerrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs
     with fast localized spectral filtering. In NIPS, 2016.
[24] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training
     of deep bidirectional transformers for language understanding. In Proceedings of the 2019
     Conference of the North American Chapter of the Association for Computational Linguistics:
     Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186,
     Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
                                                 101


[25] Bhuwan Dhingra, Qiao Jin, Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov.
     Neural models for reasoning over multiple mentions using coreference. In NAACL-HLT,
     2018.
[26] Dennis Diefenbach, Vanessa Lopez, Kamal Singh, and Pierre Maret. Core techniques of
     question answering systems over knowledge bases: a survey. Knowledge and Information
     systems, 55(3):529–569, 2018.
[27] Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt
     Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over
     paragraphs. In NAACL-HLT, 2019.
[28] Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur Güney, Volkan Cirik, and Kyunghyun
     Cho. Searchqa: A new q&a dataset augmented with context from a search engine. ArXiv,
     abs/1704.05179, 2017.
[29] Yuwei Fang, Siqi Sun, Zhe Gan, Rohit Pillai, Shuohang Wang, and Jingjing Liu. Hierarchical
     graph network for multi-hop question answering. In Proceedings of the 2020 Conference on
     Empirical Methods in Natural Language Processing (EMNLP), pages 8823–8838, Online,
     November 2020. Association for Computational Linguistics.
[30] Yanlin Feng, Xinyue Chen, Bill Yuchen Lin, Peifeng Wang, Jun Yan, and Xiang Ren. Scalable
     multi-hop relational reasoning for knowledge-aware question answering. In EMNLP, 2020.
[31] Yanlin Feng, Xinyue Chen, Bill Yuchen Lin, Peifeng Wang, Jun Yan, and Xiang Ren. Scalable
     multi-hop relational reasoning for knowledge-aware question answering. In Proceedings
     of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),
     pages 1295–1309, Online, November 2020. Association for Computational Linguistics.
[32] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an
     application to boosting. In EuroCOLT, 1995.
[33] Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. Large-scale learnable graph convolutional
     networks. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge
     Discovery & Data Mining, 2018.
[34] Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning
     of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods
     in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican
     Republic, November 2021. Association for Computational Linguistics.
[35] Ameya Godbole, D. Kavarthapu, R. Das, Zhiyu Gong, A. Singhal, Hamed Zamani, Mo Yu,
     Tian Gao, Xiaoxiao Guo, M. Zaheer, and A. McCallum. Multi-step entity-centric information
     retrieval for multi-hop question answering. In MRQA@EMNLP, 2019.
[36] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the v in vqa matter:
     Elevating the role of image understanding in visual question answering. In 2017 IEEE
     Conference on Computer Vision and Pattern Recognition (CVPR), pages 6325–6334, July
     2017.
                                              102


[37] William L. Hamilton, Zhitao Ying, and J. Leskovec. Inductive representation learning on
     large graphs. In NIPS, 2017.
[38] Luheng He, Kenton Lee, Omer Levy, and Luke Zettlemoyer. Jointly predicting predicates
     and arguments in neural semantic role labeling. In Proceedings of the Annual Meeting of the
     Association for Computational Linguistics, 2018.
[39] Luheng He, Kenton Lee, Mike Lewis, and Luke Zettlemoyer. Deep semantic role labeling:
     What works and what’s next. In ACL, 2017.
[40] Mikael Henaﬀ, J. Weston, Arthur Szlam, Antoine Bordes, and Y. LeCun. Tracking the world
     state with recurrent entity networks. In ICLR, 2017.
[41] Lynette Hirschman and Robert Gaizauskas. Natural language question answering: the view
     from here. natural language engineering, 7(4):275–300, 2001.
[42] Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning
     to reason: End-to-end module networks for visual question answering. In Proceedings of the
     IEEE International Conference on Computer Vision (ICCV), 2017.
[43] Jiani Huang, Ziyang Li, Binghong Chen, Karan Samel, Mayur Naik, Le Song, and Xujie
     Si. Scallop: From probabilistic deductive databases to scalable diﬀerentiable reasoning.
     Advances in Neural Information Processing Systems, 34:25134–25145, 2021.
[44] Yinya Huang, Meng Fang, Xunlin Zhan, Qingxing Cao, Xiaodan Liang, and Liang Lin.
     Rem-net: Recursive erasure memory network for commonsense evidence refinement. In
     AAAI, 2021.
[45] Drew A Hudson and Christopher D Manning. Compositional attention networks for machine
     reasoning. In International Conference on Learning Representations (ICLR), 2018.
[46] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick,
     and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary
     visual reasoning. In The IEEE Conference on Computer Vision and Pattern Recognition
     (CVPR), July 2017.
[47] Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale
     distantly supervised challenge dataset for reading comprehension. In ACL, 2017.
[48] Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark,
     and Hannaneh Hajishirzi. UNIFIEDQA: Crossing format boundaries with a single QA
     system. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages
     1896–1907, Online, November 2020. Association for Computational Linguistics.
[49] Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. In Advances
     in Neural Information Processing Systems, pages 1564–1574, 2018.
                                               103


[50] Seonhoon Kim, Seohyeong Jeong, Eunbyul Kim, Inho Kang, and Nojun Kwak. Self-
     supervised pre-training and contrastive representation learning for multiple-choice video
     qa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages
     13171–13179, 2021.
[51] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR,
     2015.
[52] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional
     networks. In International Conference on Learning Representations (ICLR), 2017.
[53] Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan, Mirella Lapata, and Hannaneh Hajishirzi.
     Text generation from knowledge graphs with graph transformers. In NAACL, 2019.
[54] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz,
     Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome:
     Connecting language and vision using crowdsourced dense image annotations. International
     Journal of Computer Vision, 123(1):32–73, 2017.
[55] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and
     Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. In
     ICLR, 2020.
[56] Jie Lei, Licheng Yu, Tamara L. Berg, and Mohit Bansal. Tvqa+: Spatio-temporal grounding
     for video question answering. In ACL, 2020.
[57] Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. A survey on deep learning for named
     entity recognition. IEEE Transactions on Knowledge and Data Engineering, 34(1):50–70,
     2020.
[58] Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. Relation-aware graph attention network for
     visual question answering. 2019 IEEE/CVF International Conference on Computer Vision
     (ICCV), pages 10312–10321, 2019.
[59] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A
     simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557,
     2019.
[60] Ruoyu Li, S. Wang, Feiyun Zhu, and J. Huang. Adaptive graph convolutional neural networks.
     In AAAI, 2018.
[61] Bill Yuchen Lin, Xinyue Chen, Jamin Chen, and Xiang Ren. KagNet: Knowledge-aware graph
     networks for commonsense reasoning. In Proceedings of the 2019 Conference on Empirical
     Methods in Natural Language Processing and the 9th International Joint Conference on
     Natural Language Processing (EMNLP-ĲCNLP), pages 2829–2839, Hong Kong, China,
     November 2019. Association for Computational Linguistics.
[62] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
     Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In
     European conference on computer vision, pages 740–755. Springer, 2014.
                                               104


[63] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and
     Jiawei Han. On the variance of the adaptive learning rate and beyond. In ICLR, 2020.
[64] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig.
     Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language
     processing. arXiv preprint arXiv:2107.13586, 2021.
[65] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy,
     Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert
     pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[66] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visi-
     olinguistic representations for vision-and-language tasks. In Advances in Neural Information
     Processing Systems, pages 13–23, 2019.
[67] Pan Lu, Lei Ji, Wei Zhang, Nan Duan, Ming Zhou, and Jianyong Wang. R-vqa: learning
     visual relation facts with semantic attention for visual question answering. In Proceedings of
     the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,
     pages 1880–1889, 2018.
[68] Jianjie Luo, Yehao Li, Yingwei Pan, Ting Yao, Hongyang Chao, and Tao Mei. Coco-bert:
     Improving video-language pre-training with contrastive cross-modal matching and denoising.
     In Proceedings of the 29th ACM International Conference on Multimedia, pages 5600–5608,
     2021.
[69] Aman Madaan, Dheeraj Rajagopal, Yiming Yang, Abhilasha Ravichander, Eduard Hovy, and
     Shrimai Prabhumoye. Eigen: Event influence generation using pre-trained language models.
     arXiv preprint arXiv:2010.11764, 2020.
[70] Christopher D Manning. Part-of-speech tagging from 97% to 100%: is it time for some
     linguistics? In International conference on intelligent text processing and computational
     linguistics, pages 171–189. Springer, 2011.
[71] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard,
     and David McClosky. The stanford corenlp natural language processing toolkit. In ACL,
     2014.
[72] Diego Marcheggiani, Anton Frolov, and Ivan Titov. A simple and accurate syntax-agnostic
     neural model for dependency-based semantic role labeling. In CoNLL, 2017.
[73] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor
     conduct electricity? a new dataset for open book question answering. In Proceedings of the
     2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391,
     Brussels, Belgium, October-November 2018. Association for Computational Linguistics.
[74] Sewon Min, Victor Zhong, Luke Zettlemoyer, and Hannaneh Hajishirzi. Multi-hop reading
     comprehension through question decomposition and rescoring. In ACL, 2019.
                                                105


[75] Kyung min Kim, Min-Oh Heo, Seongho Choi, and Byoung-Tak Zhang. Deepstory: Video
     story qa by deep embedded memory networks. In ĲCAI, 2017.
[76] Kosuke Nishida, Kyosuke Nishida, Masaaki Nagata, Atsushi Otsuka, Itsumi Saito, Hisako
     Asano, and Junji Tomita. Answering while summarizing: Multi-task learning for multi-hop
     qa with evidence extraction. In ACL, 2019.
[77] Liang Pang, Yanyan Lan, J. Guo, Jun Xu, Shengxian Wan, and X. Cheng. Text matching as
     image recognition. In AAAI, 2016.
[78] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable
     attention model for natural language inference. In Proceedings of the 2016 Conference
     on Empirical Methods in Natural Language Processing, pages 2249–2255, Austin, Texas,
     November 2016. Association for Computational Linguistics.
[79] Jeﬀrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors
     for word representation. In Proceedings of the 2014 Conference on Empirical Methods in
     Natural Language Processing (EMNLP), pages 1532–1543. Association for Computational
     Linguistics, October 2014.
[80] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film:
     Visual reasoning with a general conditioning layer. In Thirty-Second AAAI Conference on
     Artificial Intelligence, 2018.
[81] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier,
     and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for
     richer image-to-sentence models. In Proceedings of the IEEE international conference on
     computer vision, pages 2641–2649, 2015.
[82] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini
     Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
     transferable visual models from natural language supervision. In International Conference
     on Machine Learning, pages 8748–8763. PMLR, 2021.
[83] Alec Radford, Jeﬀrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.
     Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[84] Colin Raﬀel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
     Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified
     text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
[85] Dheeraj Rajagopal, Niket Tandon, Peter Clark, Bhavana Dalvi, and Eduard Hovy. What-if I
     ask you to explain: Explaining the eﬀects of perturbations in procedural text. In Findings of
     EMNLP 2020, 2020.
[86] Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable
     questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for
     Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia,
     July 2018. Association for Computational Linguistics.
                                             106


[87] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+
     questions for machine comprehension of text. In Proceedings of the 2016 Conference
     on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas,
     November 2016. Association for Computational Linguistics.
[88] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time
     object detection with region proposal networks. In Advances in neural information processing
     systems, pages 91–99, 2015.
[89] Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah
     Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. Atomic: An atlas of machine
     commonsense for if-then reasoning. In Proceedings of the AAAI Conference on Artificial
     Intelligence, volume 33, pages 3027–3035, 2019.
[90] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and
     Max Welling. Modeling relational data with graph convolutional networks. In European
     semantic web conference, pages 593–607. Springer, 2018.
[91] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional
     attention flow for machine comprehension. In ICLR, 2017.
[92] Sanket Shah, Anand Mishra, Naganand Yadati, and Partha Pratim Talukdar. Kvqa: Knowledge-
     aware visual question answering. In Proceedings of the AAAI Conference on Artificial
     Intelligence, volume 33, pages 8876–8884, 2019.
[93] Yashvardhan Sharma and Sahil Gupta. Deep learning approaches for question answering
     system. Procedia computer science, 132:785–794, 2018.
[94] Peng Shi and Jimmy Lin. Simple bert models for relation extraction and semantic role
     labeling. arXiv preprint arXiv:1904.05255, 2019.
[95] Livio Baldini Soares, Nicholas FitzGerald, Jeﬀrey Ling, and Tom Kwiatkowski. Matching
     the blanks: Distributional similarity for relation learning. arXiv preprint arXiv:1906.03158,
     2019.
[96] Linfeng Song, Zhiguo Wang, Mo Yu, Yue Zhang, Radu Florian, and Daniel Gildea. Exploring
     graph-structured passage representation for multi-hop reading comprehension with graph
     neural networks. ArXiv, abs/1809.02040, 2018.
[97] Robyn Speer, Joshua Chin, and Catherine Havasi. Conceptnet 5.5: An open multilingual
     graph of general knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence,
     volume 31, 2017.
[98] Gabriel Stanovsky, Julian Michael, Luke Zettlemoyer, and Ido Dagan. Supervised open
     information extraction. In Proceedings of the 2018 Conference of the North American
     Chapter of the Association for Computational Linguistics: Human Language Technologies,
     Volume 1 (Long Papers), pages 885–895, New Orleans, Louisiana, June 2018. Association
     for Computational Linguistics.
                                               107


[99] Weĳie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert:
      Pre-training of generic visual-linguistic representations. In International Conference on
      Learning Representations, 2020.
[100] Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi. A corpus of natural language for visual
      reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational
      Linguistics (Volume 2: Short Papers), pages 217–223, Vancouver, Canada, July 2017.
      Association for Computational Linguistics.
[101] Alane Suhr, Stephanie Zhou, Iris D. Zhang, Huajun Bai, and Yoav Artzi. A corpus for
      reasoning about natural language grounded in photographs. In ACL, 2018.
[102] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A
      joint model for video and language representation learning. 2019 IEEE/CVF International
      Conference on Computer Vision (ICCV), pages 7463–7472, 2019.
[103] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A
      question answering challenge targeting commonsense knowledge. In Proceedings of the 2019
      Conference of the North American Chapter of the Association for Computational Linguistics:
      Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158,
      Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
[104] Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from
      transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural
      Language Processing, 2019.
[105] Zhixing Tan, Mingxuan Wang, Jun Xie, Yidong Chen, and Xiaodong Shi. Deep semantic
      role labeling with self-attention. In AAAI, 2018.
[106] Niket Tandon, Bhavana Dalvi, Joel Grus, Wen-tau Yih, Antoine Bosselut, and Peter Clark.
      Reasoning about actions and state changes by injecting commonsense knowledge. In
      Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,
      pages 57–66, Brussels, Belgium, October-November 2018. Association for Computational
      Linguistics.
[107] Niket Tandon, Bhavana Dalvi, Keisuke Sakaguchi, Peter Clark, and Antoine Bosselut.
      WIQA: A dataset for “what if...” reasoning over procedural text. In Proceedings of the 2019
      Conference on Empirical Methods in Natural Language Processing and the 9th International
      Joint Conference on Natural Language Processing (EMNLP-ĲCNLP), pages 6076–6085,
      Hong Kong, China, November 2019. Association for Computational Linguistics.
[108] Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. Tips and tricks for
      visual question answering: Learnings from the 2017 challenge. In Proceedings of the IEEE
      Conference on Computer Vision and Pattern Recognition, pages 4223–4232, 2018.
[109] Yao-Hung Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and
      Ruslan Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences.
      In ACL, 2019.
                                               108


[110] Ming Tu, Kevin Huang, Guangtao Wang, Jing Huang, Xiaodong He, and Bufang Zhou.
      Select, answer and explain: Interpretable multi-hop reading comprehension over multiple
      documents. In AAAI, 2019.
[111] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
      Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural
      information processing systems, pages 5998–6008, 2017.
[112] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and
      Yoshua Bengio. Graph attention networks. In ICLR, 2018.
[113] Shengxian Wan, Yanyan Lan, J. Guo, Jun Xu, Liang Pang, and X. Cheng. A deep architecture
      for semantic matching with multiple positional sentence representations. In AAAI, 2016.
[114] Bo Wang, Youjiang Xu, Yahong Han, and Richang Hong. Movie question answering:
      Remembering the textual cues for layered visual contents. In Proceedings of the AAAI
      Conference on Artificial Intelligence, volume 32, 2018.
[115] Kuan Wang, Yuyu Zhang, Diyi Yang, Le Song, and Tao Qin. Gnn is a counter? revisiting
      gnn for question answering. In ICLR, 2022.
[116] Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. Fvqa:
      Fact-based visual question answering. IEEE transactions on pattern analysis and machine
      intelligence, 40(10):2413–2427, 2017.
[117] Siyuan Wang, Wanjun Zhong, Duyu Tang, Zhongyu Wei, Zhihao Fan, Daxin Jiang, Ming
      Zhou, and Nan Duan. Logic-driven context extension and data augmentation for logical
      reasoning of text. In Findings of the Association for Computational Linguistics: ACL 2022,
      pages 1619–1629, Dublin, Ireland, May 2022. Association for Computational Linguistics.
[118] Zixu Wang, Yishu Miao, and Lucia Specia. Cross-modal generative augmentation for visual
      question answering. arXiv preprint arXiv:2105.04780, 2021.
[119] Dirk Weissenborn, Georg Wiese, and Laura Seiﬀe. Making neural qa as simple as possible
      but not simpler. In CoNLL, 2017.
[120] Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. Constructing datasets for multi-hop
      reading comprehension across documents. Transactions of the Association for Computational
      Linguistics, 6:287–302, 2018.
[121] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang
      Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeﬀ Klingner, Apurva Shah,
      Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo,
      Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliﬀ Young, Jason
      Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduﬀ Hughes, and Jeﬀrey
      Dean. Google’s neural machine translation system: Bridging the gap between human and
      machine translation. CoRR, abs/1609.08144, 2016.
                                                109


[122] Yunxuan Xiao, Yanru Qu, Lin Qiu, Hao Zhou, Lei Li, Weinan Zhang, and Yong Yu.
      Dynamically fused graph network for multi-hop reasoning. In ACL, 2019.
[123] Bingbing Xu, Huawei Shen, Qi Cao, Yunqi Qiu, and Xueqi Cheng. Graph wavelet neural
      network. In ICLR, 2019.
[124] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and
      Quoc V. Le. Xlnet: Generalized autoregressive pretraining for language understanding. In
      NeurIPS, 2019.
[125] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut-
      dinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop
      question answering. In EMNLP, 2018.
[126] Liang Yao, Chengsheng Mao, and Yuan Luo. Graph convolutional networks for text
      classification. In AAAI, 2019.
[127] Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec. Qa-gnn:
      Reasoning with language models and knowledge graphs for question answering. In NAACL,
      2021.
[128] D. Ye, Yankai Lin, Deming Ye, Zhenghao Liu, Z. Liu, and Maosong Sun. Multi-paragraph
      reasoning with knowledge-enhanced graph neural network. ArXiv, abs/1911.02170, 2019.
[129] Rex Ying, Ruining He, K. Chen, Pong Eksombatchai, William L. Hamilton, and J. Leskovec.
      Graph convolutional neural networks for web-scale recommender systems. Proceedings of
      the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,
      2018.
[130] Amir Zadeh, Michael Chan, Paul Pu Liang, Edmund Tong, and Louis-Philippe Morency.
      Social-iq: A question answering benchmark for artificial social intelligence. In Proceedings
      of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8807–8817,
      2019.
[131] Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga, Hongyu Ren, Percy Liang, Christo-
      pher D. Manning, and Jure Leskovec. Greaselm: Graph reasoning enhanced language models.
      In ICLR, 2022.
[132] Zhuosheng Zhang, Yu-Wei Wu, Zhao Hai, Zuchao Li, Shuailiang Zhang, Xi Zhou, and
      Xiaodong Zhou. Semantics-aware bert for language understanding. In AAAI, 2019.
[133] Chen Zheng, Quan Guo, and Parisa Kordjamshidi. Cross-modality relevance for reasoning
      on language and vision. In Proceedings of the 58th Annual Meeting of the Association for
      Computational Linguistics, pages 7642–7651. Association for Computational Linguistics,
      July 2020.
[134] Chen Zheng and Parisa Kordjamshidi. SRLGRN: Semantic role labeling graph reasoning
      network. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan-
      guage Processing (EMNLP), pages 8881–8891, Online, November 2020. Association for
      Computational Linguistics.
                                             110


[135] Chen Zheng and Parisa Kordjamshidi. Relational gating for ”what if” reasoning. In Zhi-
      Hua Zhou, editor, Proceedings of the Thirtieth International Joint Conference on Artificial
      Intelligence, ĲCAI-21, pages 4015–4022. International Joint Conferences on Artificial
      Intelligence Organization, 8 2021. Main Track.
[136] Chen Zheng and Parisa Kordjamshidi. Dynamic relevance graph network for knowledge-aware
      question answering. In Proceedings of the 29th International Conference on Computational
      Linguistics, pages 1357–1366, Gyeongju, Republic of Korea, October 2022. International
      Committee on Computational Linguistics.
[137] Chen Zheng and Parisa Kordjamshidi. Relevant CommonSense subgraphs for “what if...”
      procedural reasoning. In Findings of the Association for Computational Linguistics: ACL 2022,
      pages 1927–1933, Dublin, Ireland, May 2022. Association for Computational Linguistics.
[138] Chen Zheng, Yu Sun, Shengxian Wan, and Dianhai Yu. Rltm: An eﬃcient neural ir framework
      for long documents. In Proceedings of the Twenty-Eighth International Joint Conference
      on Artificial Intelligence, ĲCAI-19, pages 5457–5463. International Joint Conferences on
      Artificial Intelligence Organization, 7 2019.
[139] Victor Zhong, Caiming Xiong, Nitish Shirish Keskar, and Richard Socher. Coarse-grain
      fine-grain coattention network for multi-evidence question answering. In ICLR, 2019.
[140] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and M. Sun. Graph neural
      networks: A review of methods and applications. AI Open, 1:57–81, 2020.
[141] Jie Zhou and Wei Xu. End-to-end learning of semantic role labeling using recurrent neural
      networks. In ACL, 2015.
[142] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba,
      and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by
      watching movies and reading books. In Proceedings of the IEEE international conference on
      computer vision, pages 19–27, 2015.
                                                111