SPATIAL LANGUAGE UNDERSTANDING: DEEP LEARNING, REASONING, AND EVALUATION By Roshanak Mirzaee Mazrae A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science—Doctor of Philosophy 2024 ABSTRACT Spatial language understanding plays an essential role in human communication and perception of the physical world. It encompasses how people describe, understand, and communicate spatial relationships between objects and environmental entities, such as location, orientation, distance, and relative position. Spatial language processing presents numerous challenges, which often stem from the inherent ambiguity of natural language in describing spatial relations or the complexity of spatial reasoning to infer indirect relations, particularly when multi-hop reasoning is needed. Despite the remarkable achievements of pretrained language models (PLMs) in various natural language processing (NLP) tasks, their effectiveness in spatial language processing has not yet been thoroughly examined. Therefore, in this thesis, we first aim to evaluate these models’ performance in multi-hop spatial reasoning. Second, we intend to propose deep learning methods and models that can achieve better multi-hop spatial reasoning performance in both controlled and real-world settings. As a result, this thesis has four main contributions to the understanding and reasoning of spatial language. The first contribution is proposing novel question-answering benchmarks to evaluate the spatial reasoning capability of deep neural models. These benchmarks include complex and realistic spatial phenomena not covered in previous work, making it more challenging for state-of-the-art PLMs. The second contribution is an approach to generate large distance supervision for spatial question answering and spatial role labeling tasks to enhance the spatial language understanding of models. We design grammar and reasoning rules to automatically generate a spatial description of scenes and corresponding QA pairs. In this approach, we integrate a diverse set of spatial relation types and expressions, complemented by additional functions, to enhance the flexibility and extensibility of the data generation process. Further training PLMs on this data significantly improves their capability on spatial understanding, thereby enabling them to solve other benchmarks and external datasets better. Furthermore, the third contribution explores the potential benefits of disentangling the processes of information extraction and reasoning in neural models to address the challenges of multi-hop spatial reasoning. To explore this, we design various models that disentangle extraction and reasoning (either symbolic or neural) and compare them with state-of-the-art baselines with no explicit design for these parts. Our experimental results consistently demonstrate the efficacy of disentangling, showcasing its ability to enhance models’ generalizability within realistic data domains. Ultimately, the fourth contribution investigates the role of Large Language Models (LLMs) in multi-hop spatial reasoning tasks, focusing on their performance with and without in-context learning. Besides, we integrate LLMs as extraction modules within a pipeline for extraction and symbolic reasoning. While our case studies in controlled environments indicate the benefits of this idea, our experiments in real-world settings reveal that the model’s efficiency decreases due to escalating errors in the extraction process. We also utilize probabilistic logical reasoning and LLMs’ commonsense knowledge, improving the pipeline model’s performance in real-world applications. Despite these enhancements, the pipeline model continues to exhibit inferior performance compared to standalone LLMs. Copyright by ROSHANAK MIRZAEE MAZRAE 2024 In the name of God, ACKNOWLEDGEMENTS Embarking on the journey to complete my PhD has been a tremendous undertaking that would not have been possible without the unwavering support and encouragement of numerous individuals who have touched my life in profound ways. First and foremost, I express my deepest gratitude to my parents, Nasser and Fatemeh. Their boundless love, tireless belief in me, and guidance have been the cornerstone of my academic pursuits since the very beginning. They not only instilled in me the courage to dream big and the resilience to pursue those dreams tirelessly, but also made countless sacrifices to ensure I had the opportunities I needed to succeed. To my spouse, teammate, and most importantly, best friend, Hossein, who has been my pillar of strength, confidant, and closest ally throughout this challenging journey, I owe an immeasurable debt of gratitude. His unwavering support was light through the darkest times, inspiring me to persevere and push beyond my limits. Additionally, I extend my heartfelt thanks to my sisters, Faranak and Baharak, and my friends, whose willingness to lend a listening ear or a helping hand has made all the difference. To all those who have supported me, believed in me, and encouraged me along the way, I offer my sincerest thanks. Their faith in me has been a driving force behind my success, and I am profoundly grateful for their role in shaping my academic journey. v TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Challenges and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 CHAPTER 2 7 BACKGROUND KNOWLEDGE AND RELATED WORKS . . . . . . . 7 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 2.2 Question Answering Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Pretrained Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Transfer Learning . . 16 2.5 Deterministic Vs. Probabilistic Reasoning . . . . . . . . . . . . . . . . . . . . 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 3 . . . Introduction . . EVALUATION OF PRETRAINED LANGUAGE MODELS . . . . . . . 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1 . 20 3.2 SpartQA-Human Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 SpartQA-Auto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Spatial Question Answering Models . . . . . . . . . . . . . . . . . . . . . . . 29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.6 Related Research . . . 36 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Summary . . . . . . . CHAPTER 4 . . . . Introduction . TRANSFER LEARNING FOR SPATIAL QUESTION ANSWERING . . . . . . . . . . . . . . . . . . . 37 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 SpaRTUN: A Synthetic Data with Broader Relation Coverage . . . . . . . . . . 39 4.3 Spatial Question Answering in Real-World Domain . . . . . . . . . . . . . . . 45 4.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.5 Related Research . . . 57 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Summary . . . . CHAPTER 5 SPATIAL ROLE LABELING MODELS . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . 5.1 5.2 Models 5.3 Experiments and Results 5.4 Related Research . . . . 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 . 58 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 . 62 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DISENTANGLEING EXTRACTION AND REASONING IN MULTI- HOP SPATIAL REASONING . . . . . . . . . . . . . . . . . . . . . . . 64 6.1 Introdcution . . . 64 6.2 Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.3 Proposed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 . 6.4 Experiments . 6.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 . 81 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi CHAPTER 7 . . . . Introduction . EXPLORING SPATIAL REASONING WITH LARGE LANGUAGE MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.2 Large Languague Models (LLM) . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.3 Probabilitic Spatial Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . 88 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.4 Experiments and Results . 102 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 8 8.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Future Directions . . CONCLUSION AND FUTURE DIRECTIONS . . . . . . . . . . . . . . 103 . 103 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 PUBLISHED PAPERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 vii CHAPTER 1 INTRODUCTION 1.1 Motivation Spatial language understanding plays a crucial role in human communication and perception of the physical world. Spatial language refers to the way people describe, understand, and use the spatial relationships between objects and entities in the environment, such as location, orientation, distance, and relative position. Understanding spatial language is important in many natural language processing (NLP) applications such as navigation [156, 155, 9, 111, 52], medical domain [16, 47, 81], robotics [134, 49], human-machine interaction [60, 111], dialogue systems [130], and natural language interfaces for smart homes and buildings [96, 48, 63]. These applications can become more intuitive and effective by enabling machines to understand and interpret spatial language. In addition, spatial language understanding can provide insights into human cognition and perception. On the other hand, recent transformer-based pretrained language models1 (PLMs) have already demonstrated exceptional performance in various natural language processing tasks. However, their capacity for spatial language understanding and reasoning remains uncertain. As a result, this thesis seeks to evaluate the strengths and weaknesses of PLMs in this context. Following this assessment, our next goal is to develop methods and models with enhanced spatial language understanding. This could involve improving the spatial reasoning capabilities of existing models or designing entirely new architectures to achieve better performance. Ultimately, these advances will contribute to developing more effective systems that rely on spatial language understanding. We describe spatial language understanding as identifying explicit spatial concepts, such as entities and relations directly mentioned in the text, followed by reasoning to deduce implicit relations that can be logically inferred. We term the former task as Spatial Role Labeling (SpRL) and the latter one as Spatial Reasoning. Figure 1.1 shows an example of Spatial Role Labeling to find the explicit relations, such as 1For brevity, we will refer to these as language models or PLMs. 1 Figure 1.1 Spatial QA example from ReSQ dataset. To do reasoning, we start with searching for the plants in the question triplet, extract the explicit triplets (1,2), use spatial rules (e.g. INVERSE) to infer implicit triplets (3,4,5), and in the last step, use 5 and 6 to answer the question triplet. NTPP: Non-Tangential Proper Part (Table 2.1). “FRONT(a grey car, a grey house)” and Spatial Reasoning to find implicit ones such as “BEHIND (the plants, the car).” Extracting implicit relations requires knowledge of spatial rules over various spatial relation types. For instance, logically, we know that if “a grey car” is in front of “a grey house”, then “the car” will be behind “the house”. This is an example of a reasoning rule also known aa inverse rule. Traditionally, evaluating general language understanding has involved formal spatial semantic extractions and mapping language to symbolic representations that machines can reason over. However, in recent years, question-answering (QA) has been used as a criterion for evaluating deep neural models, such as language models. In the QA setting, the model is given both textual context and question, and the latent layers of these models are expected to perform extraction and reasoning to predict the final answer. For example, given the context and question in Figure 1.1, a language model [119, 70] can predict the final answer (Yes) that means the BEHIND relation holds 2 A grey car is parking in front of a grey housewith brown window frames and plants on thebalcony. Q: Are the plants behind the car?BEHIND( the plants, the car)FRONT( a grey car, a grey house)NTPPI ( a grey house, plants)Reasoning:126435BEHIND( the plants, the car) Answer = Yes213, 455, 6ExplicittripletsExplicittripletsSTART Plants INVERSE NTPP( plants, a grey house) INVERSE BEHIND(a grey house, a grey car) COMBINE BEHIND(plants, a grey car) Coreference (a grey car, the car) between the plants and the car without explicit extraction and reasoning. In this work, we follow both evaluation paradigms to address the challenges of spatial language understanding. We explore explicit semantic representations and reasoning and their influence on question-answering when complex reasoning is required. In the remainder of this chapter, we discuss the existing challenges in this research area and outline our approach to addressing them. 1.2 Challenges and Contributions Spatial language processing tasks entail numerous challenges, which may stem from, firstly, the natural language ambiguity in describing relations and, secondly, the complexity of spatial reasoning to infer indirect relations. This study focuses on the latter case, specifically the multi-hop spatial reasoning challenge in question answering. Following, we identify two main challenges in this study and our four contributions to address them. Challenge 1: Lack of Evaluation Benchmark. Modern language models (LM), e.g., BERT [18], ALBERT [59], and XLNet [147] have seen great successes in natural language processing (NLP). However, there has been limited investigation into spatial reasoning capabilities of LMs. To the best of our knowledge, bAbI [143] is the only dataset with direct textual spatial question answering 2 which is synthetic and overly simplified. In particular, (1) The underlying scenes are spatially simple, with only three objects and relations only in four directions. (2) The stories for these scenes are too short, templated sentences, each describing a single relation between two objects. (3) The questions typically require up to two steps of reasoning due to the simplicity of those stories. Hence, the evaluation of LMs on this dataset does not demonstrate their actual spatial reasoning capability. Contribution 1: To address these issues, this thesis proposes three new datasets. The first one is a human-curated benchmark, SpartQA-Human. Specifically, this dataset provides the following characteristics, (1) it is built on images containing more objects with richer spatial structures. (2) Its stories are more natural, have more sentences, and are richer in spatial relations in each sentence. (3) Its questions require deeper reasoning and have four types: find relation (FR), find blocks (FB), choose object (CO), and yes/no (YN), which allows for more fine-grained analysis of models’ 2At the time of this project. 3 capabilities. Using this dataset, we evaluate the performance of modern LMs on multi-hop spatial reasoning. The human-generated datasets, despite helping study the problem as evaluation benchmarks, are less helpful for training models that can reliably understand spatial language due to their small size. As a solution, this work proposes a way to obtain large synthetic supervision signals for spatial reasoning, called SpartQA-Auto. As spatial relationships are rarely mentioned in existing corpora, we take advantage of spatial language being grounded in the geometry of visual scenes. Thus, we generate stories for existing synthetic images [123] automatically via our newly designed context-free grammars (CFG) and context-sensitive rules. Then, we evaluate LMs’ spatial reasoning capability on this large and complex benchmark. We also provide another human-generated dataset, named ReSQ, to evaluate neural models on real-world spatial question answering. Compared to SpartQA-Human, this benchmark is more readable for humans and needs both spatial and commonsense reasoning to be answered. Challenge 2: Complexity of Multi-hop Spatial Reasoning in Neural Models. After evaluating the spatial reasoning capability of language models, enhancing this feature in them is a challenging and complex task that requires a multidisciplinary approach involving advances in dataset creation, neural network architectures, and training methodologies. As we mentioned before, the deep neural models’ reliance on a large amount of training data may not always be feasible in practical scenarios (e.g., SpartQA-Human, ReSQ). Besides, the existing large synthetic datasets (e.g., bAbI, SpartQA-Auto) often focus on specific types of relations with a small coverage of spatial semantics needed for spatial language understanding in various domains. Contribution 2: As a solution, we use transfer learning as an effective method [109, 2, 77] for pretraining and further fine-tuning PLMs on our target tasks. Transfer learning is a simple way of using synthetic data as an extra source of supervision at no annotation cost. SpartQA-Auto proves to be a rich source of spatial knowledge that improved the performance of LMs on SpartQA-Human and on different data domains. We also build another synthetic dataset, called SpaRTUN, to serve as a source of supervision with broad coverage of spatial relation types and expressions (which is easily 4 extendable to other reasoning tasks). We evaluate the transferability of the models pretrained with SpaRTUN on multiple spatial question answering and spatial information extraction benchmarks and show significant improvements in SOTA results. Contribution 3: While transfer learning helps improve spatial reasoning, the challenge still remains due to the significant gap between human and machine performance. As another direction to solve spatial reasoning, we investigate the interoperability between extraction and reasoning and evaluate whether disentangling information extraction and reasoning at the modeling level can help solve spatial reasoning tasks. To explore this, we design various models that disentangle extraction and reasoning (either symbolic or neural) and compare them with state-of-the-art (SOTA) baselines with no explicit design for these parts. We assessed the performance of our proposed models using various SQA datasets, showcasing the efficiency of our methods in both controlled and real-world settings. In controlled environments, our framework, which used separated extraction and symbolic reasoning, outperformed existing SOTA models by a significant margin, even when using the same or fewer training data. However, in real-world situations characterized by higher ambiguity in natural language and more complex rules, our model with distinct layers for extraction and reasoning, demonstrated superior performance. Contribution 4: In our final contribution, we investigate the role and effectiveness of Large Language Models (LLMs) in spatial reasoning tasks. First, we evaluate them as an end-to-end model on the SQA task and highlight their shortcomings. Considering the strength of LLMs in information extraction, we also utilized LLMs as the extraction module within a pipeline of disentangled extraction and reasoning. In this model, the LLM is responsible for extracting direct and commonsense spatial relations from the text, which are then passed to deterministic or probabilistic spatial reasoners. Our case studies and prior experimental results on controlled environments demonstrate that this methodology effectively addresses spatial reasoning problems that end-to-end LLMs find challenging. Our experiments on a realistic environment demonstrate that probabilistic reasoning outperforms deterministic reasoning in this pipeline, and employing LLMs’ commonsense knowledge enhances its performance by covering some missed commonsense rules and relations 5 from the designed spatial reasoners. However, considering the inherent challenges of LLMs in extracting spatial information from ambiguous natural language, this approach faces limitations in outperforming end-to-end LLMs in real-world settings. 6 CHAPTER 2 BACKGROUND KNOWLEDGE AND RELATED WORKS This chapter aims to provide a comprehensive overview of the relevant literature and research conducted in our research. We start with question-answering and information extraction, two fundamental natural language processing tasks. Question answering involves answering questions posed in natural language [107, 148], while information extraction involves identifying and extracting relevant information from unstructured text [151]. Specifically, we focus on two spatial tasks: spatial question answering (useful for spatial reasoning) and spatial information extraction (a.k.a spatial role labeling (SpRL)), where the goal is to answer questions or extract information about spatial entities, such as locations, directions, and distances. This chapter surveys related work on these tasks, including existing benchmarks and recent advances in deep learning methods, such as transformers and transfer learning, a widespread technique in natural language processing for leveraging pre-trained models to improve performance on downstream tasks. The chapter concludes by describing deterministic and probabilistic reasoning and their distinctions. 2.1 Information Extraction Information extraction (IE) is the process of automatically extracting structured information from unstructured or semi-structured text [114]. This task aims to identify and extract specific information, such as named entities, events, and relationships between entities. One key challenge in IE is dealing with the variability and ambiguity of natural language text. To address this challenge, researchers have developed various approaches, including rule-based methods [145, 140, 28], statistical methods [141], and deep learning [66, 26] methods. These approaches have been applied to various areas, including news articles [153, 44], and social media data [75, 7]. Several benchmarks have been provided on this task, such as DocRED [151], SciERC [73], TAC KBP 2017 [31] and ACE 2005 [136]. These benchmarks have been used to evaluate the performance of various IE models that have achieved SOTA performance. Below you can see some examples of proposed models to tackle IE challenges: 7 BERT-based models: BERT is a pre-trained language model (Check Sec 2.3) fine-tuned for various IE tasks, including named entity recognition and relation extraction [18]. Several studies have shown that BERT-based models can achieve SOTA performance on IE tasks (e.g., [68, 119]). Graph-based models: Graph-based models have been proposed for relation extraction tasks, where entities are represented as nodes in a graph and relations are represented as edges between nodes [154]. These models gain competitive performance on relation extraction tasks. Multi-task learning models: Multi-task learning models can jointly learn multiple IE tasks, such as named entity recognition and relation extraction, using a single model [73]. These models achieve competitive performance on multiple IE tasks. Few-shot learning models: Few-shot learning models can be used to adapt to new IE tasks with limited labeled data. For example, GPT-3 (Generative Pre-trained Transformer 3) performs well on named entity recognition and relation extraction tasks with only a few labeled examples [8]. 2.1.1 Spatial Information Extraction Figure 2.1 An example from mSpRL benchmark with spatial Role Labeling annotation. Spatial information extraction, a.k.a spatial role labeling [54, 56] (SpRL) is the task of identifying and classifying the spatial roles (Trajector, Landmark, and spatial indicator) and their relations. We 8 call the former spatial role extraction and the latter spatial relation1 extraction. Initially, we provide an overview of the spatial roles employed in this task. You can see an example of SpRL identification in Figure 2.1. The study by [56] defines three distinct roles: Trajector, Landmark, and Spatial indicator. Trajector: is an entity referred to by its location or position. It can be static or dynamic, referring to people, objects, or events. In this study, we only focus on the static form of Trajectors. Landmark: is the reference entity concerning which the location or motion of the Trajector is indicated. Spatial indicator: is an expression that establishes constraints on the spatial properties of the Trajector, such as its location to the Landmark (e.g., in, on). Spatial relation: is defined as a triplet of (Trajector, Spatial indicator, Landmark) with an assigned relation type which shows the relation between Trajector, and Landmark using Spatial indicator. SpRL is particularly challenging due to the ambiguity of natural language in expressing spatial relations [6]. For instance, in Figure 1.1, it is ambiguous whether both “the window” and “plants” are on “the balcony” or only the “plants”, or in the phrases “plants on the balcony” and “plants on the wall”, on refers to a different position between plants, the balcony, and the wall. As a solution for this issue, we extract spatial relations first and map them to a formal representation (see Table 2.1) such as mapping (plants, on, the balcony) to NTPP and (plants, on, the wall) to FRONT and EC. Few benchmarks exist with spatial roles and relation annotations, which are small-scale datasets for studying the SpRL problem. mSpRL [55] is a human-curated dataset provided on SpRL task. This dataset contains spatial descriptions of real-world images collected from image CLEF IAPR TC-12 dataset [33] and corresponding SpRL annotations. A SpRL can have the following spatial semantic component [162] on the static environment, trajector (the main entity), landmark(the reference entity), and spatial_indicator (the spatial term describing the relationship between trajector and landmark.). The dynamic environment can also have path, region, direction, and motion. Figure 2.1 shows an 1In different works like [54], the triplet and relation are used interchangeably. 9 example of context and spatial role labeling of mSpRL benchmark. The previous SOTA on this benchmark belongs to “R-Inf” [78], which uses external multi-modal resources and global inference. SpaceEval (SemEval-2015 task 8) [1] is a task in SemEval-2015 that focuses on evaluating models for the extraction of spatial information from text. The goal of SpaceEval was to identify and classify spatial relations between entities mentioned in the text, as well as to identify spatial signals that indicate the presence of a spatial relation. The task involved annotating a corpus of texts with spatial information, such as identifying the types of spatial relations that exist between entities in the text, such as “located in,” “part of,” and “near.” The corpus consisted of a set of texts from different domains, such as news articles, scientific papers, and travel guides. The previous SOTA models on this benchmark use a generation and a classification component for generating and extracting relations [139], or a model containing RoBERTa as word representation and a CRF [95]. 2.2 Question Answering Task Question answering is a helpful format to evaluate machines’ capability of reading compre- hension [30] and many recent works have been implementing this strategy to test machines’ understanding of linguistic formalisms: [37, 85, 64, 45, 97, 21]. An important advantage of QA is using natural language to annotate natural language, thus having the flexibility to get annotations on complex phenomena such as spatial reasoning. Besides, question answering tasks can get even more challenging when requiring multi-hop reasoning. Recently, pretrained language models achieved the SOTA result on most of the QA benchmarks[70, 142]. We discuss language models more in the next section. 2.2.1 Spatial Question Answering Despite the high performance of recent deep neural models on question answering (QA) tasks, solving questions that require multi-hop reasoning is still challenging [82]. Usually, various types of reasoning like spatial [160], temporal [106], numerical [92], commonsense [127] and many more are needed to solve QA tasks. In this work, we focus on spatial reasoning and, as a result, on spatial question answering tasks. Below, we first describe spatial reasoning and then describe it as the task of spatial question answering. 10 Formalism (General Type) Specific value Spatial type/Spatial value) Expressions (e.g.) Topological RCC8 Directional Relative Distance Qualitative DC (disconnected) EC (Externally Connected) PO (Partially Overlapped) EQ (Equal) TPP (Tangential Proper Part) NTPP (Non-Tangential Proper Part) TPPI (Tangential Proper Part inverse) NTPPI (Non-Tangential Proper Part inverse) LEFT, RIGHT BELOW, ABOVE BEHIND, FRONT Far, Near disjoint touching overlapped equal covered by in, inside covers has left of, right of under, over behind, in front far, close Table 2.1 Spatial relation types and examples of spatial language expressions. 2.2.1.1 Spatial Reasoning Spatial reasoning over text can be described as inferring the implicit2 spatial relations from direct relations described in the text. To better understand spatial reasoning, we must first describe two key concepts: spatial relation types and rules. Table 2.1 presents sixteen 3D spatial relation types collected from existing resources [108, 144, 110]. These relations have been categorized into three groups: Directional, Topological, and Distance relations. Directional relations describe how objects or entities are positioned to each other in terms of orientation or direction. Topological relations refer to spatial relationships between objects or entities based on their relative position and shape rather than their orientation or direction. These relationships describe how objects or entities are spatially connected to each other, such as touching, overlapping, or containing. Figure 2.2 shows RCC8, eight different topological relations described in [110]. Distance relations are a scalar entity and can be expressed either qualitatively as "near" or "far," or quantitatively as, for instance, "two meters far" [54]. This work, however, concentrates only on the former type of expression. Spatial rules refer to the logical principles that govern how spatial relations operate and interact with each other. These rules can assist in reasoning about and inferring spatial relations from 2By implicit, we meant indirect relations, not metaphoric usages or implicit meaning for the relations. 11 Figure 2.2 Eight topological relations in RCC8. Images from Wikipedia. Not Inverse Symmetry Transitivity Combination ∀(𝑋, 𝑌 , 𝑍, 𝐻) ∈ 𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 𝑅 ∈ 𝐷𝑖𝑟, ∗𝑃𝑃 ∈ 𝑃𝑃 ∀(𝑋, 𝑌 ) ∈ 𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 ∀(𝑋, 𝑌 ) ∈ 𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 ∀(𝑋, 𝑌 ) ∈ 𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 ∀(𝑋, 𝑌 , 𝑍) ∈ 𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 𝑅 ∈ {𝐷𝑖𝑟 ∨ 𝑃𝑃} 𝑅 ∈ {𝐷𝑖𝑟 ∨ 𝑃𝑃} 𝑅 ∈ {𝐷𝑖𝑠 ∨ (𝑅𝐶𝐶 − 𝑃𝑃)} 𝑅 ∈ {𝐷𝑖𝑟 ∨ 𝑃𝑃} ⇒ NOT(𝑅_𝑟𝑒𝑣𝑒𝑟 𝑠𝑒(𝑋, 𝑌 )) IF 𝑅(𝑋, 𝑌 ) ⇒ 𝑅_𝑟𝑒𝑣𝑒𝑟 𝑠𝑒(𝑋, 𝑌 ) IF 𝑅(𝑌 , 𝑋) ⇒ 𝑅(𝑋, 𝑌 ) IF 𝑅(𝑌 , 𝑋) ⇒ 𝑅(𝑋, 𝑌 ) IF 𝑅(𝑋, 𝑍), 𝑅(𝑍, 𝑌 ) IF ∗𝑃𝑃(𝑋, 𝑍), 𝑅(𝑍, 𝐻), ∗𝑃𝑃𝑖(𝑍, 𝑌 ) ⇒ 𝑅(𝑋, 𝑌 ) Table 2.2 Designed spatial rules. 𝐷𝑖𝑟: Directional relations (e.g., LEFT), 𝐷𝑖𝑠: Distance rela- tions (e.g., FAR), 𝑃𝑃: all Proper parts relations (NTPP, NTPPI, TPPI, TPP), 𝑅𝐶𝐶 − 𝑃𝑃: All RCC8 relation except proper parts relations. ∗𝑃𝑃: one of TPP or NTPP. ∗𝑃𝑃𝑖: one of NTPPi or TPPi. text [144, 20]. For example, the transitivity rule for topological relations states that if object A is "inside" object B, and object B is "inside" object C, then object A is "inside" object C. Similarly, the inverse rule for directional relations states that if object A is “behind” object B, then object B is “in front of” A. Understanding these spatial rules is crucial for accurate spatial language understanding and reasoning. We define five spatial rules in our work, Not, Inverse, Symmetry, Transitivity, and Combination, which all are defined in Table 6.2. Given a set of spatial primitives, relation types, and rules, we define spatial reasoning as the procedure of applying spatial rules to explicit relation types in written language, aiming to infer implicit relations. In spatial question answering, given a scene description, the task is to answer questions about the spatial relations between entities. Here, we focus on challenging questions that need multi-hop spatial reasoning over explicit relations. Figure 1.1 shows an example of the spatial reasoning process over a given text. To find the answer to the question, we first need to identify all spatial entities and relations (SpRL) and then apply rules to find the asked relation. Spatial reasoning phenomena have been covered minimally in the existing benchmarks we mentioned. To the best of our knowledge, Task 17 and 19 of the bAbI project [143] is the only QA 12 dataset focused on textual spatial reasoning until our project. Task 17 is on spatial reasoning and contains binary Yes/No questions. Task 19 is on pathfinding and contains FR questions with answers in {West, East, North, South} set. The original dataset contains west, east, north, and south, which we mapped to their corresponding relative relation type. bAbI is synthetic and does not reflect the complexity of spatial reasoning in natural language. Solving Task 17 and 19 of bAbI typically does not require sophisticated reasoning, which is an important capability emphasized by more recent works (e.g., [23, 50, 148, 14, 97]). 2.3 Pretrained Language Models Pretrained language models (PLM) are large neural network models that have been trained on vast amounts of text data such as the entire text of Wikipedia or the Common Crawl web corpus in a un/supervised way. These models use transformers to learn the underlying patterns and structure of natural language and can then be fine-tuned for specific NLP tasks such as question answering [107], information extraction [75], or sentiment analysis [79]. 2.3.1 Transformer Transformers are deep neural network architecture introduced in [133] that significantly impacts NLP tasks. Unlike previous neural network architectures, which relied on recurrent neural networks (RNNs)or convolutional neural networks (CNNs), transformers use an attention mechanism to capture dependencies between different parts of a sequence of inputs. The transformer architecture enables the model to learn long-range dependencies between different parts of a sentence, allowing it to capture the context and meaning of words in a way that previous neural network architectures could not. The basic building block of a transformer is the self-attention mechanism, which computes the importance of each word in a sequence relative to every other word in the sequence. This allows the model to focus on the most relevant words in the sequence when making predictions. The transformer architecture consists of a series of encoder layers to process the input sequence and decoder layers to generate the output sequence. Each layer contains a multi-head self-attention mechanism, which allows the model to attend to different parts of the sequence simultaneously, as well as a feed-forward 13 neural network that processes the output of the self-attention mechanism. During training, the model is typically pre-trained on a large corpus of text using a un/supervised learning task, such as masked language modeling (MLM) or next sentence prediction. This pretraining enables the model to learn a general representation of natural language that can be fine-tuned on specific downstream tasks with smaller datasets. During inference, the model takes an input sequence and generates an output sequence by iteratively processing the sequence through the encoder and decoder layers. Overall, transformers have revolutionized natural language processing by enabling models to capture better the complex relationships between different parts of a sequence and achieve state-of-the-art performance on a range of NLP benchmarks. Once pretrained, language models that use the transformer’s flexibility and effectiveness can be fine-tuned on specific NLP tasks with much smaller amounts of data. Fine-tuning involves training the model on a smaller, task-specific dataset to improve its performance on that particular task. Using pretrained language models as a starting point, NLP models can achieve state-of-the-art performance on a wide range of tasks with less data and training time than would be required to train a model from scratch. Some examples of popular pretrained language models are: BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model that uses a bidirectional transformer to learn contextual relations between words in a text. BERT uses bidirectional context, where the model is trained to predict missing words in a sentence based on the surrounding words on both sides rather than just the words that come before or after the missing word. This allows BERT to capture more complex relationships between words and produce more accurate predictions. BERT has achieved SOTA performance on various NLP tasks, including question answering. RoBERTa (Robustly Optimized BERT Pretraining Approach) [70] is an improved version of BERT. RoBERTa uses a larger training corpus and longer training time to perform even better than BERT on several NLP benchmarks, including question answering. ALBERT (A Lite BERT) is a modification to improve efficiency and reduce the number of parameters in BERT, which makes it a useful model for low-resource environments and applications 14 where memory and computation resources are limited. ALBERT performs better than BERT on several NLP tasks using fewer parameters. XLNet [147] is another pre-trained language model that uses a permutation-based approach to capture both forward and backward dependencies between words in a text. While BERT predicts the probability of the next word in a sequence given the previous words, XLNet takes a different approach and uses a sequence-to-sequence auto-regressive model to predict the probability of each word in the sequence given all of the previous words. This enables the model to capture bidirectional relationships between the words in the sequence without needing an MLM objective. XLNet training objective called "permutation language modeling," which samples a random permutation of the input sequence and trains the model to predict the probability of the next word in the sequence given the previous words in the sampled permutation. T5 (Text-to-Text Transfer Transformer) [104] is a transformer-based model designed to generalize a wide range of natural language tasks by casting them all as a text-to-text problem, where the model maps a given input text to an output text. This approach allows T5 to generalize to new tasks without additional training by simply changing the task representation in the input. GPT-3 (Generative Pre-trained Transformer 3) [8] is an autoregressive language model that uses a transformer-based architecture. It is trained using a general language modeling objective, where the model is trained to predict the next word in a given sequence of text. This unsupervised pretraining allows the model to learn a rich representation of language, capturing both local and global dependencies in text. GPT-3 is notable for its massive size, with 175 billion parameters, making it the largest publicly available language model to date. While GPT-3 has shown significant improvements in language generation and understanding, it also has limitations, such as a tendency to generate biased or offensive language and difficulty understanding context and sarcasm in text. Nevertheless, GPT-3 has shown the potential for large-scale language models (LLM) to transform the field of natural language processing and enable a wide range of new applications. 15 2.4 Transfer Learning Transfer learning has been used effectively in different NLP tasks to further fine-tune the PLMs [25, 120, 109, 2, 77]. This is a machine learning technique where a model is first trained on a large dataset or task. Then, the knowledge gained from that training is applied to a different, usually smaller, dataset or task. This approach can be more efficient than training a model from scratch on a smaller dataset or task, as the pre-trained model already has a rich representation of the underlying features of the data. Using the benefits of pretrain language models such as BERT or GPT3, which already have been pretrained on large corpora, and next fine-tuning them on a smaller, labeled dataset for a specific task is a hot trend in today’s NLP research. By leveraging the pretrained language model’s knowledge of the language, the fine-tuned model can often achieve SOTA performance with much less training data than would be required to train a model from scratch. Besides transfer learning, several other approaches are used to tackle the lack of training data in various NLP areas, such as providing techniques to label the unlabeled data [24], using semi-supervised models [132, 65] or data augmentation with synthetic data [67, 86]. However, transfer learning is a simple way of using synthetic data as an extra source of supervision at no annotation cost. Compared to the augmentation methods, the data in the transfer learning only needs to be close to the target task/domain [76] and not necessarily the same. 2.5 Deterministic Vs. Probabilistic Reasoning Deterministic Reasoning involves making decisions based on certain, fixed rules or logic. In this approach, the outcome of a given situation is determined by the specific conditions and rules applied to it. If the same conditions and rules are applied again, the outcome will be the same every time. Deterministic reasoning is often used when the rules and relationships between variables are well-known and do not change. Probabilistic Reasoning involves making decisions based on probabilities and uncertainty. In this approach, the outcome of a given situation is not fixed and can vary depending on the probabilities of different events. Probabilistic reasoning is often used in uncertain scenarios or when the rules 16 and relationships between different variables are not well-known. The main difference between deterministic and probabilistic reasoning is how they deal with uncertainty. Deterministic reasoning assumes that the outcome of a situation is certain and can be predicted exactly if the rules and conditions are known. Probabilistic reasoning acknowledges that the outcome of a situation may be uncertain and can vary based on probabilities. 17 CHAPTER 3 EVALUATION OF PRETRAINED LANGUAGE MODELS 3.1 Introduction Modern language models (LM), e.g., BERT [18], ALBERT [59], and XLNet [147] have seen great successes in natural language processing (NLP). However, there has been limited investigation into spatial reasoning capabilities of LMs. To the best of our knowledge, bAbI [143] (Figure 3.1) is the only dataset with direct textual spatial question answering (QA) (Task 17), but it is synthetic and overly simplified: (1) The underlying scenes are spatially simple, with only three objects and relations only in four directions. (2) The stories for these scenes are two short, templated sentences, each describing a single relation between two objects. (3) The questions typically require up to two steps of reasoning due to the simplicity of those stories. Figure 3.1 An example of bAbI dataset, task 17. To address these issues, in this chapter, we propose a new dataset, SpartQA1 (see Fig. 3.2). This benchmark contains more realistic spatial phenomena not covered by prior work and is challenging for state-of-the-art language models (LM). Specifically, (1) SpartQA is built on NLVR’s [123] images containing more objects with richer spatial structures (Fig. 3.3b). (2) SpartQA’s stories are more natural, have more sentences, and are richer in spatial relations in each sentence. (3) SpartQA’s questions require deeper reasoning and have four types: find relation (FR), find blocks (FB), choose object (CO), and yes/no (YN), which allows for more fine-grained analysis of models’ capabilities. We showed annotators random images from NLVR and instructed them to describe objects and relationships not exhaustively at the cost of naturalness (Sec. 3.2). In total, we obtained 1.1k unique QA pair annotations on spatial reasoning, evenly distributed among the aforementioned 1SPAtial Reasoning on Textual Question Answering. 18 “The pink rectangle is below the red square. The red square is below the blue square.”1. Is the red square below the pink rectangle? No2. Is the pink rectangle below the blue square? Yes Figure 3.2 An example story and corresponding questions and answers in SpartQA-Human. types. Similar to bAbI, we keep this dataset on a relatively small scale and suggest to use as little training data as possible. Here thus, we propose a way to generate a large synthetic benchmark for spatial reasoning (Sec. 3.3). As spatial relationships are rarely mentioned in existing corpora, we take advantage of the fact that spatial language is grounded to the geometry of visual scenes. We are able to generate stories for NLVR images automatically [123] via our newly designed context-free grammars (CFG) and context-sensitive rules. In the story generation process, we store the information about all objects and relationships such that QA pairs can also be generated automatically. In contrast to bAbI, we use various spatial rules to infer new relationships in these QA pairs, which requires more complex reasoning capabilities. Hereafter, we call this automatically-generated dataset SpartQA-Auto and the human-annotated one SpartQA-Human. Our contributions can be summarized as follows: First, we propose the first human-curated benchmark, SpartQA-Human, for spatial reasoning with richer spatial phenomena than the prior synthetic dataset bAbI (Task 17). Experiments show that modern LMs (e.g., BERT) do not perform well in this low-resource setting. Second, we exploit the scene structure of images and design novel CFGs and spatial reasoning rules to automatically generate data (i.e., SpartQA-Auto) for spatial reasoning over text. Third, we also proposed the consistency and contrast sets to further evaluate the consistency of 19 Story: There are three blocks called A, B, and C. A is to the left of B and B is to the left of C. In A, there is a medium blue square near and to the left of a medium black triangle. The medium black triangle is near and below a medium yellow square. In B, there is a small blue square far to the left of a large black circle. There is a medium yellow circle far above the black circle. In C, there is a large yellow square.Q1: Is the medium yellow thing in A near and above a black thing? YesQ2: What is the relation between the medium yellow thing in A and the medium black thing in A? Above, Near (a) An example story and corresponding questions and answers in SpartQA-Auto. (b) An example NLVR image and the scene created in Fig. 3.3a, where the blocks in the NLVR image are rearranged. Figure 3.3 Example from SpartQA (specifically from SpartQA-Auto). the spatial reasoning capability of models. 3.2 SpartQA-Human Collection To mitigate the aforementioned problems of Task 17 of bAbI, i.e., simple scenes, stories, and questions, we describe the data annotation process of SpartQA-Human, and explain how those problems were addressed in this section. First, we randomly selected a subset of NLVR images, each of which has three blocks containing multiple objects (see Figure 3.3b). The scenes shown by these images are more complicated than those described by bAbI because (1) there are more objects in NLVR images; (2) the spatial relationships in NLVR are not limited to just four relative directions as objects are placed arbitrarily within blocks. 20 QUESTIONS:FB: Which block(s) has a medium thing that is below a black square? A, B, CFB: Which block(s) doesn't have any blue square that is to the left of a medium square? A, BFR: What is the relation between the medium black square which is in block C and the medium square that is below amedium black square that is touching the bottom edge of a block? LeftCO: Which object is above a medium black square? the medium black square which is in block C or medium blacksquare number two? medium black square number twoYN: Is there a square that is below medium square number two above all medium black squares that are touching thebottom edge of a block? YesSTORY: We have three blocks, A, B and C. Block B is to the right of block C and it is below block A. Block A has two blackmedium squares. Medium black square number one is below medium black square number two and a medium bluesquare. It is touching the bottom edge of this block. The medium blue square is below medium black square numbertwo. Block B contains one medium black square. Block C contains one medium blue square and one medium blacksquare. The medium blue square is below the medium black square.ACBDescribed imagechoose some objects andrelations randomly and addrelationship between blocksNLVR image Figure 3.4 For “A blue circle is above a big triangle. To the left of the big triangle, there is a square,” if the question is: “Is the square to the left of the blue circle?”, the answer is neither Yes nor No. Thus, the correct answer is “Do not Know” (DK) in our setting. Second, two student volunteers produced a textual description of those objects and their corresponding spatial relationships based on these images. Since the blocks are always horizontally aligned in each NLVR image, to allow for more flexibility, annotators could also rearrange these blocks (see Fig. 3.2). Relationships between objects within the same block can take the forms of relative direction (e.g., left or above), qualitative distance (e.g., near or far), and topological relationship (e.g., touching or containing). However, we instructed the annotators not to describe all objects and relationships, (1) to avoid unnecessarily verbose stories, and (2) to intentionally miss some information to enable more complex reasoning later. Therefore, annotators describe only a random subset of blocks, objects, and relationships. To query more interesting phenomena, annotators were encouraged to write questions requiring detecting relations and reasoning over them using multiple spatial rules. A spatial rule can be one of the transitivity (𝐴 → 𝐵, 𝐵 → 𝐶 ⇒ 𝐴 → 𝐶), symmetry (𝐴 → 𝐵 ⇒ 𝐵 → 𝐴), converse (( 𝐴, 𝑅, 𝐵) ⇒ (𝐵, 𝑟𝑒𝑣𝑒𝑟 𝑠𝑒(𝑅), 𝐴)), inclusion (𝑜𝑏 𝑗1 𝑖𝑛 𝐴), and exclusion (𝑜𝑏 𝑗1 𝑛𝑜𝑡 𝑖𝑛 𝐵) rules. There are four types of questions (Q-Type). (1) FR: find relation between two objects. (2) FB: find the block that contains certain object(s). (3) CO: choose between two objects mentioned in the question that meet certain criteria. (4) YN: a yes/no question that tests if a claim on spatial relationship holds. FB, FR, and CO questions are formulated as multiple-choice questions2 and receive a list of candidate answers, and YN questions’ answer is choosing from Yes, No, or “DK” (Do not Know). 2CO can be considered as both single-choice and multiple-choices question. 21 The “DK” option is due to the open-world assumption of the stories, where if something is not described in the text, it is not considered as false (See Fig. 3.4). Finally, annotators were able to create 1.1k QA pairs on spatial reasoning on the generated descriptions distributed among the aforementioned types. We intentionally keep this data on a relatively small scale due to two reasons. First, there has been some consensus in our community that modern systems, given their sufficiently large model capacities, can easily find shortcuts and overfit a dataset if provided with large training data [29, 115]. Second, collecting spatial reasoning QAs is very costly: The two annotators spent 45-60 mins on average to create a single story with 8-16 QA pairs. We estimate that SpartQA-Human cost about 100 hours in total. The expert performance on 100 examples of SpartQA-Human’s test set measured by their accuracy of answering the questions is 92% across four Q-Types on average, indicating its high quality. 3.2.1 Collect Spatial Role Labeling Annotations Collecting spatial role labeling annotation is a complex and time-consuming task that involves identifying and labeling the spatial relationships between different objects or entities within a given context. This process requires a great deal of domain-specific knowledge and a keen understanding of linguistic and syntactic structures. In this task, annotators must possess a deep understanding of the spatial linguistic and semantic nuances of the language they are annotating. This requires extensive training and expertise, which can be challenging to find and cultivate. Here, we asked two expert volunteers familiar with spatial role labeling tasks to annotate the story/questions of this dataset. They spent 270 hours annotating 2k sentences in SpartQA-Human using WebAnno framework3. Obtaining consistent annotations can be complex, as different annotators may interpret the same context or scenario differently. This can lead to inconsistencies and inaccuracies in the annotations, significantly impacting the performance of any models trained on the data. To resolve this, another expert annotator checked the annotation and discarded the erroneous ones. As a result, half of this training data, and full dev and test data are annotated with SpRL tags (Check Tab 5.2). 3https://webanno.github.io/webanno/ 22 Dataset SpartQA-Human (story) SpartQA-Human (question) Train Dev Test 272 99 176 367 127 155 Table 3.1 Number of sentences in SpartQA-Human. 3.3 SpartQA-Auto Since human annotations are costly, it is important to investigate ways to generate a synthetic evaluation and supervision signals for spatial reasoning. However, unlike conventional synthetic approaches (e.g., [88, 152, 158]) where synthetic data can be selected from large corpora by implementing specialized filtering rules, spatial reasoning does not often appear in existing corpora. Therefore, similar to SpartQA-Human, we take advantage of the ground truth of NLVR images, design CFGs to generate stories and use spatial reasoning rules to ask and answer spatial reasoning questions. This automatically generated data is called SpartQA-Auto, and below, we describe its generation process in detail. 3.3.1 Story Generation Since NLVR comes with structured descriptions of the ground truth locations of those objects, we were able to choose random blocks and objects from each image programmatically. The benefit is two-fold. First, a random selection of blocks and objects allows us to create multiple stories for each image; second, this randomness also creates spatial reasoning opportunities with missing information. Once we decide on a set of blocks and objects to be included, we determine their relationships: Those relationships between blocks are generated randomly; as for those between objects, we refer to the ground truth of these images to determine them. Now we have a scene containing a set of blocks and objects and their associated relationships. To produce a story for this scene, we design CFGs to produce natural language sentences that describe those blocks/objects/relationships in various expressions (see Fig. 3.5 for two portions of our CFG describing relative and nested relations between objects). Being grounded to visual scenes guarantees spatial coherency in a story, and using CFGs helps to have correct sentences (grammatically) and various expressions. We also design context-sensitive 23 (a) Part of the grammar describing relations between objects (b) Part of the grammar describing nested relation- ships. Figure 3.5 Two parts of our designed CFG. rules to limited options for each CFG’s variable based on the chosen entities (e.g., black circle) or what is described in the previous sentences (e.g., Block A has a circle. The circle is below a triangle.) Table 3.2 shows some generated sentences in SpartQA-Auto with some specific features that challenge models to understand different forms of relation description in spatial language. 3.3.2 Question Generation To generate questions based on a passage, there are rule-based systems [38, 58], neural networks [22], and their combinations. [19]. However, in our approach, during generating each story, the program stores the information about the entities and their relationships. Thus, without processing the raw text, which is error-prone, we generate questions by only looking at the stored data. The question generation operates based on four primary functionalities, Choose-objects, Describe- objects, Find-all-relations, and Find-similar-objects. These modules are responsible to control the logical consistency, correctness, and the number of steps required for reasoning in each question. Choose-objects randomly chooses up to three objects from the set of possible objects in a story under a set of constraints such as preventing selection of similar objects, or excluding objects with relations that are directly mentioned in the text. Describe-Objects generates a mention phrase for an object using parts of its full name (presented in the story). The generated phrase is either pointing to a unique object or a group of objects such as 24 The big black shape is above the medium triangle.S
is
.Article the | aRelation above | left | …Object * * Size small | medium | bigColor yellow | blue | blackShape square | triangle | circleInd_shape shape | object | thingThe big black shape is above the object that isto the right of the medium triangleS
is
.Object * * | that is Examples Block A is above Block C and B. The small circle is above the yellow square and the big black shape. The yellow square number one is to the right of and above the blue circle. Block B has two medium yellow squares and two blue circles. The blue circle is below the object which is to the right of the big square. A small blue circle is near to the big circle. It is to the left of the medium yellow square. There is a block named A. One small yellow square is touching the bottom edge of this block. What is the relation between black object and a big circle? Features Using conjunction to describe the relation between more than two blocks. Using conjunction to describe relationships between more than two objects. Using conjunction for more than one relation. Describing a group of objects with the same properties. In the next sentences, they are mentioned by an assigned number. For exam- ple, the blue circle number two. Using nested relations between objects in their description. Using coreferences for an entity described in the previous sentences. The verb matches the number of the subject. Using shape, object, and thing, which are a general description of an object. It could be the “black triangle” or the “black circle” mentioned in the story. Table 3.2 Example of Particular features of SpartQA-Auto. "the big circle," or "big circles." To describe a unique object, it chooses an attribute or a group of attributes that apply to a unique object among others in the story. To increase the steps of reasoning, the description may include the relationship of the object to other objects instead of using a direct unique description. For example, "the circle which is above the black triangle." Find-all-relations completes the relationship graph between objects by applying a set of spatial rules such as transitivity, symmetry, converse, inclusion, and exclusion on top of the direct relations described in the story. As shown in Fig. 3.6, it does an exhaustive search over all combinations of the relations that link two objects to each other. Find-similar-objects finds all the mentions matching a description from the question to objects in the story. For instance, for the question "is there any blue circle above the big blue triangle?", this module finds all the mentions in the story matching the description “a blue circle”. Similar to the SpartQA-Human, we provide four Q-Types FR, FB, CO, and YN. To generate 25 Figure 3.6 Find the implicit relation between 𝑜𝑏 𝑗1 and 𝑜𝑏 𝑗4 by Transitivity rule. (1) Find a set of objects that have a relation with 𝑜𝑏 𝑗1. Continue the same process on the new set until 𝑜𝑏 𝑗4 is found. (2) Get the union of the intermediate relations between these two objects and it is the final answer. FR questions, we choose two objects using Choose-objects module and question their relationships. The YN Q-Type is similar to FR, but the question specifies one relationship of interest chosen from all relation extracted by Find-all-relations module to be questioned about the objects. Since most of the time, Yes/No questions are simpler problems, we make this question type more complex by adding quantifiers (adding “all” and “any”). These quantifiers help to evaluates the models’ capability to aggregate relations between more than two objects in the story and do the reasoning over all find relations to find the final answer. In FB Q-Type, we mention an object by its indirect relation to another object using the nested relation in Describe-objects module and ask to find the blocks containing or not containing this object. Finally, the CO question selects an anchor object (Choose-objects) and specifies a relationship ( using Find-all-relations) in the question. Two other objects are chosen as candidates to check whether the specified relationship holds between them and the anchor object. We tend to force the algorithm to choose objects as candidates that at least have one relationship to the anchor object. 3.3.2.1 Question Templates and Statistics Information Table 3.3 shows the templates used to create questions in SpartQA-Auto. The “” is a variable replaced by objects from the story (using Choose-objects and Describe-objects modules), and the “” variable can be replaced by the chosen relations between objects (using Find-all-relations module). The articles and the indefinite pronouns in each template play an essential role in understanding the question’s objective. For example, “Are all blue circles near to a triangle?” is different from “Are there any blue circles near to a triangle?”, and “Are there any blue circles near to all triangles?”. 26 Left (obj1 , obj2)Touching (obj2 , obj3)Right (obj4 , obj2)? (obj1 , obj4)left ^ left => leftLeft (obj1 , obj4)Obj1Obj2Obj3Obj4Obj3left~right = left1Obj423 Q-Type Q-Templates FR what is the relation between and ? CO YN FB What is the ? an or an ? Which object is an ? the or the ? Is (the | a )(the | a) ? Is there any s all s? Which block has an ? Which block doesn’t have an ? Yes, No, Don’t Know Name of blocks, None Table 3.3 Questions and answers templates. Candidate answer Left, Right, Below, Above, Touching, Far from, Near to Object1, object2, Both, None Therefore, we check the uniqueness of the object definition, using “a” or “the” in proper places and randomly place the terms “any” or “all” in the YN questions to generate different questions. Answer generation We compute all direct and indirect relationships between objects using Find-all-relations function and based on the Q-Types generate the final answer. For instance, in YN Q-Type if the asked relation exists in the found relations, the answer is "Yes", if the inverse relation exists it must be "No", and otherwise, it is "DK"4. 3.3.3 Corpus Statistics We generate the train, dev, and test set splits based on the same splits of the images in the NLVR dataset. On average, each story contains 9 sentences (Min:3, Max: 22) and 118 tokens (Min: 66, Max: 274). Also, the average tokens of each question (on all Q-Type ) is 23 (Min:6, Max: 57). Table 3.4 shows the total number of each question type in SpartQA-Auto and Table 3.5 shows the percentage of each correct label in train and test sets. Table 3.5 shows the percentage of correct labels in train and test sets. In multi-choice Q-Types, more than one label can be true. 3.3.4 Extra Annotations Alongside the main SpartQA-Auto’s stories and questions, we provided some extra annotation to help the models understand the spatial language better. 4The SpartQA-Auto generation code and the file of dataset are available at https://github.com/HLR/SpartQA_ generation 27 Sets SpartQA-Human: Test Train SpartQA-Auto: Seen Test Unseen Test Dev Train FB FR YN CO Total 104 154 105 149 194 162 107 151 510 616 3872 3872 3842 23654 3712 3721 3742 23302 3896 3896 3860 23968 3594 3598 3579 22794 15074 15087 15023 93673 Table 3.4 Number of questions per Q-Type. 3.3.4.1 Detailed Annotation and Scene-Graphs Figure 3.7 Scene-graph. Providing in-depth human annotations is quite expensive and time-consuming. In SpartQA- Auto, we generated a fine-grained scene-graph based on the story. This scene graph contains the blocks’ description, their relations, and the objects’ attributes alongside their direct relations. The scene graphs can be used for the models to understand all spatial relations directly mentioned in the textual context. Figure 3.7 shows an example of this scene-graph. The scene graph can provide strong supervision for question answering challenges and can be used to evaluate models based on their steps of reasoning and decisions. 3.3.4.2 SpRL Annotation We also provided spatial annotations for each sentence and question, based on the Spatial Role Labeling annotation scheme [54](Fig. 3.8). This annotation is generated by hand-crafted rules during the main data generation. SpRL is used for recognizing spatial expressions and arguments in a sentence. This annotation is useful for applications that need to detect and reason about spatial 28 Q-Type FR (Multiple Choices) FB (multiple Choices) CO (Single choice) YN (Single choice) Candidate Answers Left Right Above Below Near to Far from Touching DK A B C [] Object1 Object2 Both None Yes No DK train 20.7 21.4 26.9 37.2 5.8 1.3 0.57 0.52 49.8 50.1 35.1 7.1 25.4 25.3 44.3 4.9 53.3 18.7 27.8 test 17.9 16.7 25.4 42.9 2.9 0.56 0.27 0.32 49.4 50 62 90.5 26 24.9 43.9 5.0 50.5 23.6 25.9 Table 3.5 The percentage of each correct label in all samples. The candidate answers for the FB Q-Type can be varied based on its story. CO can be considered a multiple-choice or single-choice question. E.g., in "which object is above the triangle? the blue circle or the black circle?" you can consider two labels with boolean classification on each "blue circle" and "black circle" or consider it as a four-label classification: "blue circle," "black circle," "both of them," and "None of them." * DK, None, [], all mean none of the actual labels are correct. expressions and arguments. 3.4 Spatial Question Answering Models This section describes the model architectures on different Q-Types: FR, YN, FB, and CO. All Q-Types can be cast into a sequence classification task, and the three transformer-based LMs tested in this paper, BERT [18], ALBERT [59], and XLNet [147], can all handle this type of tasks by classifying the representation of [CLS], a special token prepended to each target sequence (see Figure 3.9). Depending on the Q-Type, the input sequence and how we do inference may be different. FR and YN both have a predefined label set as candidate answers, and their input sequences are both the concatenation of a story and a question. While the answer to a YN question is a single label chosen from Yes, No, and DK, FR questions can have multiple correct answers. Therefore, we treat 29 Figure 3.8 SpRL annotation for an example sentence from SpartQA. (a) LM𝑄 𝐴 Architecture for FR and YN Q-Types (b) LM𝑄 𝐴 Architecture for CO and FB Q-Types Figure 3.9 LM𝑄 𝐴 for Spatial Reasoning over Text. each candidate answer to FR as an independent binary classification problem, and take the union as the final answer. As for YN, we choose the label with the highest confidence (Fig 3.9a). As the candidate answers to FB and CO are not fixed and depend on each story and its question the input sequences to these Q-Types are concatenated with each candidate answer. Since the defined YN and FR model has moderately less accurate results on FB and CO Q-Types, we add a LSTM [40] 30 Language Model(YN-FR)CLSSEPSEPQ1Q2Q3QnS1S2S3SmBoolean classificationcandidate options 101000001111TokenEmbeddingSegmentEmbeddingQuestionStoryECLSESEPESEPEQ1EQ2EQ1EQnES1ES2ES3ESmLast layerrepresentationCorrect Answer TCLSTSEPTSEPTQ1TQ2TQ1TQnTS1TS2TS3TSmBoolean classificationcandidate options 1Boolean classificationcandidate options 1Inference Language Model(FB-CO)CLSSEPSEPQ1Q2QnS1S2Sm010000111TokenEmbeddingSegmentEmbeddingQuestionStoryLast layerrepresentationLSTMClassifierCorrect Answer SEPC10Ck0Candidateoption0ECLSESEPESEPEQ1EQ2EQnES1ES2ESmESEPEC1ECkTCLSTSEPTSEPTQ1TQ2TQnTS1TS2TSmTSEPTC1TCk layer to improve it. Hence, to find the final answer, we run the model with each candidate answer and then apply an LSTM layer on top of all token representations. Then, we use the last vector of the LSTM outputs for classification (Fig 3.9b). The final answers are selected based on Eq. (3.1). 𝑥𝑖 = [𝑠, 𝑐𝑖, 𝑞] 𝑚𝑖 ] = 𝐿 𝑀 (𝑥𝑖) , ..., (cid:174)𝑡𝑖 (cid:174)𝑇𝑖 = [ (cid:174)𝑡𝑖 1 𝑚𝑖 ] = LSTM( (cid:174)𝑇𝑖) 𝑖 ] = Softmax( (cid:174)ℎ𝑖𝑇 , ..., (cid:174)ℎ𝑖 [(cid:174)ℎ𝑖 1 (cid:174)𝑦𝑖 = [𝑦0 𝑖 , 𝑦1 Answer = {𝑐𝑖 | arg max 𝑖 ) = 1} 𝑗 𝑚𝑖𝑊)) (𝑦 𝑗 (3.1) where 𝑠 is the story, 𝑐𝑖 is the candidate answer, 𝑞 is the question, [ ] indicates the concatenation of the listed vectors, and 𝑚𝑖 is tokens’ number in 𝑥𝑖. The parameter vector, 𝑊, is shared for all candidates. 3.4.1 Training and Inference We train the models based on the summation of the cross-entropy losses of all binary classifiers in the architecture. For FR and YN Q-Types, there are multiple classifiers, while there is only one classifier used for CO and FB Q-Types. We remove inconsistent answers in post-processing for FR and YN Q-Types during inference phase. For instance on FR, left and right relations between two objects cannot be valid at the same time. For YN, as there is only one valid answer amongst the three candidates, we select the candidate with the maximal predicted probability of being the true answer. 3.5 Experiments and Results All experiments use the models proposed in Sec. 3.4. We use AdamW [72] with 2 × 10−6 learning rate and Focal Loss [69] with 𝛾 = 2 for training all the models.5 To evaluate our models, we use the Accuracy metric, which computes the correct prediction ratio to all questions. 5All codes are available at https://github.com/HLR/SpartQA-baselines 31 To assess the quality of the proposed datasets, we measure human performance accuracy on a sample of these datasets. The evaluation involved 100 randomly selected questions from the SpartQA-Human test set. The participants were graduate students who received training with a few examples from the dataset before answering the final questions. The high accuracy rates observed suggest that the questions are understandable and suitable for evaluation. Besides, upon examining respondents’ incorrect answers, we found that errors typically occurred when the questions required a high number of reasoning steps, leading to confusion in linking them together. 3.5.1 SpartQA-Human Evaluation Table 3.6 shows performance on SpartQA-Human in a low-resource setting, where 0.6k QA pairs from SpartQA-Human are used for fine-tuning these LMs and 0.5k for testing (see Table 3.4 for information on this split).6 During our annotation, we found that the description of “near to ” and “far from” varies largely between annotators. Therefore, we ignore these two relations from FR Q-Type in our evaluations. # Model 1 Majority 2 BERT 3 Human FB 28.84 16.34 91.66 FR 24.52 20 95.23 CO 40.18 26.16 91.66 YN 53.60 45.36 90.69 Avg 36.64 30.17 92.31 Table 3.6 Result of three baselines on SpartQA-Human. All systems are fine-tuned on the training data of SpartQA-Human, Avg: The micro-average on all four Q-Types. As it is shown in Table 3.6, BERT language model, which outperforms many complex deep neural networks in other benchmarks [11], cannot perform well on SpartQA-Human. The results of this model even are lower than the Majority baseline which predicts the label that is most common in each set on each question type. This low performance indicates the complexity of the spatial reasoning (spatial question answering) task and the poor capability of powerful language models to capture that. We also can see from Table 3.6 that BERT performance falls behind human performance by a large margin. 6Note this low-resource setting can also be viewed as a spatial reasoning probe to these LMs [128]. 32 # Models FB Seen Unseen 48.70 48.70 1 Majority 69.38 2 BERT 87.13 83.53 3 ALBERT 97.66 84.85 98.00 4 XLNet 5 Human 85 FR Seen Unseen 40.81 40.81 73.71 85.68 83.70 91.61 91.63 94.60 CO Seen Unseen 20.38 20.59 61.09 71.44 84.55 95.20 90.88 97.11 YN Seen Unseen 49.91 49.94 76.81 78.29 75.05 79.38 78.54 79.91 90 94.44 90 Table 3.7 Spatial reasoning is challenging. We further pretrain three transformer-based LMs, BERT, ALBERT, and XLNet, on SpartQA-Auto, and test their accuracy in three ways: Seen and Unseen are both from SpartQA-Auto, where Unseen has applied minor modifications to its vocabulary; to get those Human columns, all models are fine-tuned on SpartQA-Human’s training data. Human performance on Seen and Unseen is the same since the changes applied to Unseen does not affect human reasoning. 3.5.2 SpartQA Analysis In addition to BERT, we continue to test another two LMs, ALBERT and XLNet (Table 3.7). We further pretrain these LMs on SpartQA-Auto, and test them on SpartQA-Human and two held-out test sets of SpartQA-Auto, Seen and Unseen. Note that when a system is tested against SpartQA-Human, it is fine-tuned on SpartQA-Human’s training data following its further pretraining on SpartQA-Auto. We propose an unseen test set alongside the seen test of SpartQA-Auto to check whether a model is using shortcuts in the language surface by describing objects and relations with new vocabularies in the samples. We use the unseen set to test to what extent the baseline models use shortcuts in the language surface. This set applies minor modifications randomly on a number of stories and questions to change the names of shapes, colors, sizes, and relationships in the vocabulary of the stories, which do not influence the reasoning steps (describing relationships using different language expressions). The modification choices are described in Table 3.8. All models perform worst in YN across all Q-Types, which suggests that YN presents more complex phenomena, probably due to the additional quantifiers in the questions. XLNet performs the best on all Q-Types except its accuracy on SpartQA-Human’s YN section. However, the drops in Unseen and human suggest overfitting on the training vocabulary. The low accuracies on the human test set from all models show that solving this benchmark is still a challenging problem and requires more sophisticated methods like considering spatial roles and relations extraction [54, 13, 105] to 33 Type Shapes Relations Colors Size Original Set Square, Circle, Triangle Left, Right, Above, Below Yellow, Black, Below Small, Medium, Big Unseen Set Rectangle, Oval, Diamond Left side, Right side, Top, Under Green, Red, White Little, Midsize, Large Table 3.8 Modifications on the unseen set. understand stories and questions better. 3.5.3 Consistency Evaluation To evaluate the reliability of the models, we also provide two extra consistency and contrast test sets. Consistency set is made by changing a part of the question in a way that seeks for the same information [43, 124]. Given a pivot question and answer of a specific consistency set, answering other questions in the set does not need extra reasoning over the story. For instance, for the question, “What is the relation between the blue circle and the big shape? Left,” we create a similar question in the form of “What is the relation between the big shape and the blue circle? Right”. Answering these questions around a pivot question is possible for humans without the need for extra reasoning over the story and based on the main questions’ answers. Hence, the evaluation on this set shows that models understand the real underlying semantics rather than overfit on the structure of questions. Contrast set is made by minimal modification in a question to change its answer [29]. For contrast sets, there is a need to go back to the story to find the new answer for the question’s minor variations. The consistency and contrast sets are evaluated only on the correctly predicted questions to check if the actual understanding and reasoning occur. This ensures the reliability of the models. For instance, in the question “Is the blue circle below the black triangle? Yes,” we create a contrast question “Is the blue circle below all triangles? No” by changing “the black triangle” to “all triangles”. The evaluation of this set shows the robustness of the model and its sensitivity to 34 semantic changes when there are minor changes in the language surface. Models BERT AlBERT XLNet FB FR CO YN Consistency Consistency Contrast Consistency Contrast Consistency Contrast 69.44 84.77 85.2 76.13 82.42 88.56 42.47 41.69 50 16.99 58.42 71.10 15.58 62.51 72.31 48.07 48.78 51.08 71.41 69.19 69.18 Table 3.9 Evaluation of consistency and semantic sensitivity of models in Table 3.7. All the results are on the correctly predicted questions of Seen test set of SpartQA-Auto. Based on the papers [43, 124], consistency, and contrast set should be generated manually to control the semantic change. In our case, we are probing the spatial language understanding of models, so we must change parts of the questions that affect spatial understanding, which can be implemented automatically by some static rules. Table 3.9 shows the result of this evaluation on four Q-Types of SpartQA-Auto, where we can see, for another time, that the high scores on the Seen test set are likely due to overfitting on training data rather than correctly detecting spatial terms and reasoning over them. All of the mentioned experiments clearly indicate that current SOTA and powerful language models cannot perform spatial reasoning on the provided benchmarks. Also, experiments showed that the high result of LMs on SpartQA-Auto are due to the overfitting of the models on fixed pattern and cannot be repeated after minor changes. 3.6 Related Research Question answering is a useful format to evaluate machines’ capability of reading compre- hension [30] and many recent works have been implementing this strategy to test machines’ understanding of linguistic formalisms: [37, 85, 64, 45, 97, 21]. An important advantage of QA is using natural language to annotate natural language, thus having the flexibility to get annotations on complex phenomena such as spatial reasoning. However, spatial reasoning phenomena have been covered minimally in the existing works. To the best of our knowledge, Task 17 of the bAbI project [143] is the only QA dataset focused on textual spatial reasoning. However, bAbI is synthetic and does not reflect the complexity of the spatial reasoning in natural language. Solving Task 17 of bAbI typically does not require 35 sophisticated reasoning, which is an important capability emphasized by more recent works (e.g., [23, 50, 148, 14, 97]). Spatial reasoning is arguably more prominent in multi-modal QA benchmarks, e.g., NLVR [123], VQA [4], GQA [43], CLEVR [46]. However, those spatial reasoning phenomena are mostly expressed naturally through images, while this paper focuses on studying spatial reasoning on natural language. Some other works on visual-spatial reasoning are based on geographical information inside maps and diagrams [42] and navigational instructions [9, 3]. As another approach to evaluate spatial reasoning capabilities of models, a dataset proposed in [32] generates a synthetic training set of spatial sentences and evaluates the models’ ability to generate spatial facts and sentences containing composition and decomposition of relations on grounded objects. 3.7 Summary Spatial reasoning is an important problem in natural language understanding. We propose the first human-created QA benchmark on spatial reasoning, and experiments show that SOTA pretrained language models (LM) do not have the capability to solve this task given limited training data, while humans can solve those spatial reasoning questions reliably [90]. We also provide a synthetic benchmark using hand-crafted grammar and spatial reasoning rules to automatically generate a large corpus of spatial descriptions and corresponding question-answer annotations. Even though LMs can solve the main test set of this benchmark, by providing various evaluation test sets we showed that these high results are due to the overfitting, and the models do not learn spatial concepts and how to do spatial reasoning over them. 36 CHAPTER 4 TRANSFER LEARNING FOR SPATIAL QUESTION ANSWERING 4.1 Introduction The previous chapter showed how current SOTA language models cannot resolve spatial question answering tasks. As fine-tuning LMs has become a common baseline approach to knowledge transfer from a source dataset to a target task, including but not limited to [101, 158, 36], in this chapter, we study this method to increase the capability of spatial reasoning of modern LMs. This fine-tuning process is also known as further pretraining, to distinguish with the fine-tuning process on one’s target task. It is an open problem to find out better transfer learning techniques than simple further pretraining, as suggested in [35, 51], which is beyond the scope of this work. One of the key requirements in this method is access to large training data. As we discussed before, SpartQA-Auto is the only available large benchmark on SQA. We use this dataset as a synthetic source of supervision to further pretrain the BERT language model and then fine-tune and evaluate the model on SpartQA-Human with a similar domain and some other target benchmarks. Experiments show that, by further pretraining on SpartQA-Auto, we improve LMs’ performance on SpartQA-Human by a large margin. The spatially-improved LMs also show stronger performance on two external QA datasets, bAbI and boolQ [11]: BERT further pretrained on SpartQA-Auto only requires half of the training data to achieve 99% accuracy on bAbI as compared to the original BERT; on boolQ’s development set, this model shows better performance than BERT, with 2.3% relative error reduction.1 4.1.1 Improve the Coverage of Spatial Relations The current synthetic datasets [90, 143, 118] often focus on specific types of relations with a small coverage of spatial semantics needed for spatial language understanding in various domains. Figure 4.1 indicates the coverage of sixteen spatial relation types (in Table 2.1) collected from existing resources [108, 144, 110]. The human-generated datasets [90, 55], despite helping study the 1To the best of our knowledge, the test set or leaderboard of boolQ has not been released yet. 37 Figure 4.1 The comparative coverage of relation types based on Table 2.1 for SQA datasets. problem as evaluation benchmarks, are less helpful for training models that can reliably understand spatial language due to their small size [90]. In this work, we build a new synthetic dataset on SQA, called SpaRTUN2 (Figure 4.2) to provide a source of supervision with broad coverage of spatial relation types and expressions3. To generate SpaRTUN, we follow the idea of SpartQA benchmark and generate scene graphs from a set of images. The edges in this graph yield a set of triplets such as ABOVE(blue circle, red triangle), which are used to generate a scene description (i.e., a story). In SpaRTUN, we map the spatial relation types in triplets (e.g., ABOVE) to a variety of spatial language expressions (e.g., over, north, above) to enable the transfer learning for various data domains 4. We also build a logical spatial reasoner to compute all possible direct and indirect spatial relations between graph nodes. Then, the questions of this dataset are selected from the indirect relations. To evaluate the effectiveness of SpaRTUN in transfer learning, we created another dataset named ReSQ5 (Figure 6.1). This dataset is built on mSpRL [55] corpus while we added human-generated spatial questions and answers to its real image descriptions. This dataset comparatively reflects more realistic challenges and complexities of the SQA problem. We analyze the impact of SpaRTUN as source of extra supervision on several SQA benchmarks. 2Spatial Reasoning and role labeling for Text UNnderstanding 3We only consider explicit spatial semantics and the Metaphoric usages and implicit meaning are not covered in this work. 4The full list of spatial expressions used in this dataset and the dataset generation code are provided in https: //github.com/HLR/SpaRTUN. 5Real-world Spatial Questions 38 Our results show that further pretraining models with SpaRTUN for SQA task improves the result of previous models on ReSQ, StepGame, and SpartQA-Human benchmarks. Furthermore, studying the broad coverage of spatial relation expressions of SpaRTUN in realistic domains demonstrates that this feature is a key factor for transfer learning. The contributions of this paper can be summarized as: (1) SpartQA-Auto proves to be a rich source of spatial knowledge that improved the performance of LMs on SpartQA-Human as well as on different data domains such as bAbI and boolQ. (2) We build a new synthetic dataset to serve as a source of supervision and transfer learn- ing for spatial language understanding tasks with broad coverage of spatial relation types and expressions (which is easily extendable). (3) We provide a human-generated dataset to evaluate the performance of transfer learning on real-world spatial question answering. (4) We evaluate the transferability of the models pretrained with SpaRTUN on multiple SQA and SpRL benchmarks and show significant improvements in SOTA results. 4.2 SpaRTUN: A Synthetic Data with Broader Relation Coverage The synthetic datasets on SQA often focus on specific types of relations with a small coverage of spatial semantics needed for spatial language understanding in various domains. Figure 4.1 indicates the coverage of sixteen spatial relation types (in Table 2.1) collected from existing resources [108, 144, 110]. The human-generated datasets, despite helping study the problem as evaluation benchmarks, are less helpful for training models that can reliably understand spatial language due to their small size. In this section, we build a new synthetic dataset on SQA, called SpaRTUN6 (Figure 4.2) to provide a source of supervision with broad coverage of spatial relation types and expressions7. To provide a source of supervision for spatial language understanding tasks, we generate a synthetic dataset with SQA format that contains SpRL annotation of sentences. We build this dataset 6Spatial Reasoning and role labeling for Text UNnderstanding 7We only consider explicit spatial semantics and the Metaphoric usages and implicit meaning are not covered in this work. 39 Figure 4.2 SpaRTUN - A synthetic large dataset provided as a source of supervision. Figure 4.3 The data construction flow of SpaRTUN. First, we generate scene graphs from NLVR images. Then a spatial reasoner validates each path between each pair of entities in this graph. All facts (𝐹) in the selected 𝑝𝑎𝑡ℎ and some extra facts (𝐸) from the scene graph are selected as story triplets, and the start and end nodes of the 𝑝𝑎𝑡ℎ are selected as question triplets. Finally, we pass all triplets to a text generation module and compute the final answer. We ignore paths with length one (e.g., 𝐴( 𝐴𝐵𝑂𝑉 𝐸)𝐶) and only keep questions that need multi-hop reasoning. by expanding SpartQA in multiple aspects. The following additional features are considered in creating SpaRTUN: F1) A broad coverage of various types of spatial relations and including rules of reasoning over their combinations (e.g. NTPP(𝑎, 𝑏), LEFT(𝑏, 𝑐) → LEFT(𝑎, 𝑐) ) in various domains. F2) A broad coverage of spatial language expressions and utterances used in various domains. F3) Including extra annotations such as the supporting facts and number of reasoning steps for SQA to be used in complex modeling. In the rest of this section, we describe the details of creating SpaRTUN and the way we support the above mentioned features. Figure 4.3 depicts SpaRTUN data construction flow. 40 Q: Is the yellow apple to the west of the yellowwatermelon?YesThree boxes called one, two and three exist inan image. Box one contains a big yellowmelon and a small orange watermelon. Boxtwo has a small yellow apple. A smallorange apple is inside and touching this box.Box one is in box three. Box two is tothe south of, far from and to the west of boxthree. A small yellow watermelon is insidebox three. Q: Where is box two relative to the yellowwatermelon? Left, Below, FarABCA (ABOVE) CXKHYA (NTPP) X (FRONT) Y (TPPI) BA (NTPP) X (NTPPI) C ABOVEBELOWQ: FRONT(A,B)F1: NTPP(A,X)F2: FRONT(X,Y) F3: TPPI(Y,B)E1: NTPP(H,Y)Extra relationTextGenerationCFGfuncfuncSpatial ReasonerStorySpatial ReasonerAnswerNLVR images & scene graphNew scene graphPath validation forquestion triplet selectionTriplets of questions (Q)and story (F+E)Finding answer based on story Question Spatial Relation Computation. Following SpartQA-Auto, we use the NLVR scene graphs [123] and compute relations between objects in each block based on their given coordinates. NLVR is limited to 2D relation types8, therefore to add more dimensions (FRONT and BEHIND), we randomly change the LEFT and RIGHT to BEHIND and FRONT in a subset of examples. Moreover, there are no relations between blocks in NLVR descriptions. To expand the types of relations, we extend this limitation and randomly assign relations9 to the blocks while ensuring the spatial constraints are not violated. Then, we create a new scene graph with computed spatial relations. The nodes in this graph represent the entities (objects or blocks), and the directed edges are the spatial relations. Question Selection. There are several paths between each pair of entities in the generated scene graph. We call a path valid if at least one relation can be inferred between its start and end nodes can be inferred. For example, in Figure 4.3, NTPP( 𝐴, 𝑋), FRONT(𝑋, 𝑌 ), TPPI(𝑌 , 𝐵) is valid since it results in FRONT( 𝐴, 𝐵) while NTPP( 𝐴, 𝑋), NTPPI(𝑋, 𝐶) is not a valid path –there is no rules of reasoning that can be applied to infer new relations. To verify the validity of each path, we pass its edges, represented as triplets in the predicate- arguments form to a logical spatial reasoner (implemented in Prolog) and query all possible relations between the pair. The number of triplets in each path represents the number of reasoning steps for inferring the relation. We generate the question triplets from the paths with the most steps of reasoning (edges). This question will ask about the spatial relationship between the head and tail entity of the selected path. The triplets in this path are used to generate the story and are annotated as supporting facts. Additionally, the story will include additional information (extra triplets) unnecessary for answering the question to increase the complexity of the task. Spatial Reasoner. We implement several rules (in the form of Horn clauses shown in Table 6.2) in Prolog, which express the logic between the relation types (described in Table 2.1) in various formalisms and model the logical spatial reasoning computation. Given the facts and queries in 8The relations types included in NLVR are: DC, EC, *PP relations, LEFT, RIGHT, BELOW, and ABOVE. 9All relation in Table 2.1 except EQ 41 Prolog format, the spatial reasoner can carry out the reasoning process and provide an answer to any given query. The reasoner matches variables in the program with concrete values and a backtracking search to explore different possibilities for each rule until a solution is found. Compared to previous tools [144], we are the first to include the spatial, logical computation between multiple formalisms. This reasoner validates the question/queries based on the given facts. For instance, by using the Combination rule in Table 6.2 over the set of facts {NTPP( 𝐴, 𝑋), FRONT(𝑋, 𝑌 ), TPPI(𝑌 , 𝐵)}, the reasoner returns True for the query FRONT( 𝐴, 𝐵) and False for FRONT(𝐵, 𝐴) or BEHIND( 𝐴, 𝐵). We consider the logic rules mentioned in Table 6.2 and in the form of the Horn clauses. we collect the different combinations of spatial relations mentioned in Table 2.1 and implement the logic-based spatial reasoner. Figure 4.4a shows an example of some parts of our code on LEFT relation. In Figure 4.4b, on the left, some facts are given, and the query “𝑛𝑡 𝑝 𝑝𝑖(𝑟𝑜𝑜𝑚, 𝑋)” asks about all objects that existed in the room. Below each query, there are all possible predictions for them. Text generation. The scene description is generated from the selected story triplets in question selection phase and using a publicly available context-free grammar (CFG) provided in SpartQA- Auto. However, we increase the variety of spatial expressions by using a vocabulary of various entity properties and relation expressions (e.g., above, over, or north for ABOVE relation type) taken from existing resources [27, 80, 71, 122, 39] We map the relation types and the entity properties to the lexical forms in our collected vocabulary (check Figures 4.5 and 4.6). For the question text, we generate the entity description and relation expression for each question triplet. The entity description is generated based on a subset of its properties in the story. For instance, an expression such as “a black object” can be generated to refer to both “a big black circle” and “a black rectangle”. We generate two question types, YN (Yes/No) questions that ask whether a specific relation exists between two entities, and FR (Find Relations) questions that ask about all possible relations between them. To make YN questions more complex, we add quantifiers (“all” and “any”) to the entities’ descriptions. Our text generation method can flexibly use an extended vocabulary to provide a richer corpus 42 (a) Examaple of implemented rule clauses in Prolog. (b) Example of Facts, Query, and answer of implemented model Figure 4.4 Logic-bases spatial reasoner. to supervise new target tasks when required. Finding Answers. We search all entities in the story based on the entity descriptions (e.g., all circles, a black object) in each question and use the spatial reasoner to find the final answer. SpRL Annotations. Along with generating the sentences for the story and questions, we auto- matically annotate the described spatial configurations with spatial roles and relations (trajector, landmark, spatial indicator, spatial type, triplet, entity ids). These annotations are based on a 43 Figure 4.5 List of relation expression supported in SpaRTUN. This can easily be extended based on the target task. Figure 4.6 List of entities properties supported in SpaRTUN. 44 FormalismTypeCardinalsClocksDirectionalLEFT"to the left of", "on the left side of", "to the left-hand side of""west of", "to the west of""at 9:00 position relative to", "at 9:00 position regarding to", "at 9 o'clock position regarding to"RIGHT"to the right of", "on the right side of", "to the right-hand side of""east of", "to the east of""at 3:00 position relative to", "at 3:00 position regarding to", "at 3 o'clock position regarding to"BELOW"above", "over""north of", "to the north of""at 12:00 position relative to", "at 12:00 position regarding to", "at 12 o'clock position regarding to"ABOVE"below", "under""south of", "to the south of""at 6:00 position relative to", "at 6:00 position regarding to", "at 6 o'clock position regarding to"BEHIND"behind", "on the back of"FRONT"in front of", "after"DistancesFAR"far from", "farther from", "away from"NEAR"near to", "close to"TopologicalDCdisconnected fromEC"touch[es]"PO"overlap[s]"EQ-TPP"covered by", "inside and touching"TPPI"cover[s]"NTPP"in", "inside", "within"NTPPI"ha[s/ve]","contain[s]"PropertiesblockBlock, boxblocksBlocks, boxesobject_general_namething, object, shape, fruitobjects_general_namethings, objects, shapes, fruitsblock_nameAAA, BBB, CCC, DDD, EEE, JJJ, HHH, JJJ, LLL, KKK, one, two, three.coloryellow, black, blue, green, red, orange, grey, white, purplesizesmall, big, medium, midsize, large, tiny, littletypecircle, oval, square, rectangle, dimond, star, triangle, hexagon, pentagon, watermelon, apple, melon.typescircles, ovals, squares, rectangles, dimonds, stars, triangles, hexagons, pentagons, watermelons, apples, melons. previously proposed annotation scheme of SpRL and provide free annotations for the SpRL task. 4.2.1 Statistic Information: Each example in SpaRTUN contains a story that describes the spatial relation between entities and some questions which ask about indirect relations between entities. On average, each story contains eight sentences and 91 tokens, which describe ten relations on average. To generate SpaRTUN, we use 6.6k NLVR scene graphs for training and 1k for each dev and test set. We follow SpartQA for dataset split. Thus, we collect 20k training, 3k dev, and 3k test examples for each FR and YN question (see Table 4.1)10. On average, each story of SpaRTUN contains eight sentences and 91 tokens that describe, on average ten relations between different mentions of entities. 54% of the YN questions have "Yes" answers and the rest (46%) have "No." In FR questions which can have multiple answers, the label are true as below: { left : 10%, right:10%, above: 27%, below: 26%, behind: 19%, front: 10%, near: 2%, far: 15%, dc: 26%, ec: 7%, po: 0.2%, tpp: 2%, ntpp: 10%, tppi: 3%, and ntppi: 8% }. 4.2.2 Version Updates We introduce an updated version of the dataset wherein we have eliminated elements contributing to coreference and pronoun complexity, as well as quantifiers in the Yes/No questions. This modification has been made to ensure that the evaluation of the models is focused solely on their capacity to execute multi-hop spatial reasoning tasks. To facilitate this, each variable within the dataset is assigned a unique name/identifier, which is then consistently utilized across all corresponding sentences and questions. These changes minimize ambiguity, thereby allowing for a more accurate assessment of the models’ spatial reasoning capabilities. 4.3 Spatial Question Answering in Real-World Domain To evaluate the effectiveness of SpaRTUN in transfer learning, we created another dataset named ReSQ11 (Figure 6.1). This dataset is built on mSpRL [55] corpus while we added human-generated 10All data are provided in the English language.: The corpus is in English. 11Real-world Spatial Questions 45 spatial questions and answers to its real image descriptions. This dataset comparatively reflects more realistic challenges and complexities of the SQA problem. Figure 4.7 ReSQ - A human-generated dataset for probing the models on realistic SQA. We created this dataset to reflect the natural complexity of real-world spatial descriptions and questions. We asked three volunteers (English-speaking undergrad students) to generate at least four Yes/No questions for contexts in mSpRL dataset that contains complex human-generated sentences. The questions require at least one step of reasoning. The advantage of ReSQ is that the human-generated spatial descriptions and their spatial annotations already exist in the original dataset. On average, the annotators spent 20 minutes generating questions for each group of sentences, which, in total, cost 210 hours to generate the whole data. After gathering the data, another undergrad student checked the questions and removed the incorrect ones. The train set is provided on the train set of mSpRL, and since mSpRL does not have a dev set, we split the 32% of test data (equal to 20% of the training set) and keep it as the dev set. 50% of questions in this dataset have "Yes," and 50% have "No" answers. The statistics of this dataset are provided in Table 4.1. One of the challenges of the ReSQ, which is not addressed here, is that the questions require spatial commonsense knowledge in addition to capturing the spatial semantics. For example, by using commonsense knowledge from the sentence, “a lamp hanging on the ceiling”, we can infer that the lamp is above all the objects in the room. 4.3.1 Version Updates In the updated version of this dataset, we have implemented several enhancements to improve its quality and utility for research in spatial reasoning. These updates include: 46 A grey car is parking in front of a grey housewith brown window frames and plants on thebalcony.Q: Are the plants in front of the car?NoQ: Are the plants in the house? Yes 1. Clarification of Question Text: We have reviewed and eliminated ambiguous questions or ones that contained incorrect entity mentions. This step ensures that the questions in the dataset are unambiguous, facilitating more accurate evaluations. 2. Verification and Correction of Answers: The answers to the questions in the dataset have been thoroughly double-checked and corrected where necessary. This update ensures the reliability and accuracy of the dataset. 3. Introduction of "non-commonsense" and "commonsense" Labels: We have added labels to the questions to indicate whether they require "non-commonsense" or "commonsense" spatial reasoning. This update helps to have a finer granularity in evaluating and analyzing the performance of models on different types of reasoning tasks. 4. Expansion with Additional Non-Commonsense Questions: We have enriched the dataset by adding extra questions that specifically focus on evaluating models’ multi-hop spatial reasoning capabilities while disregarding their commonsense understanding. Overall, these updates enhance the dataset’s quality, making it a more reliable and effective tool for evaluating the spatial reasoning capabilities of various models. 4.4 Experiments and Results Dataset bAbI SpartQA-Auto (YN) SpartQA-Auto (FR) SpartQA-Human (YN) SpartQA-Human (FR) ReSQ StepGame SpaRTUN (YN) SpaRTUN (FR) Train Dev 992 8992 26152 3860 25744 3780 51 162 28 149 1008 333 50000 1000 20334 3152 18400 2818 Test 992 3896 3797 143 77 610 10000 3193 2830 Table 4.1 Size of SQA benchmarks. In this section, we analyze the impact of our synthetic benchmarks, SpartQA-Auto and SpaRTUN, as an extra source of supervision on SQA tasks using the PLM-based models described in Section 3.4. In all experiments, we compare the performance of models fine-tuned with the target datasets with and without further pretraining on synthetic supervision (SynSup). We select bAbI 47 tasks 17 and 19, StepGame, and boolQ alongside our proposed benchmarks SpartQA-Human and ReSQ as our target datasets. The baselines for all experiments include a majority baseline (MB) which predicts the most repeated label as the answer to all questions, and a pretrained language model, that is, BERT here. We also report the human accuracy in answering the questions for the human-generated datasets12. For all experiments, to evaluate the models, we measure the accuracy which is the percentage of correct predictions in the test sets. 4.4.1 Experiments Configuration Since the datasets that we use contain different question types, we supervise the models based on the same question type as the target task13. We use the huggingFace14 implementation of pretrained BERT which has 768 hidden dimensions. All models are trained on the training set, evaluated on the dev set, and reported the result on the test set. For training, we train the model until no changes happen on the dev set and then store and use the best model on the dev set. We use AdamW ([72]) optimizer on all models and modules. We use Focal Loss [69] with 𝛾 = 2. 𝐿𝑜𝑠𝑠 = ∑︁ CrossEntropyLoss(p′, y′) + BCELoss(p, y) (4.1) Besides, The result of BERT model trained and tested on SpaRTUN and SpaRTUN are provided in Table 4.2. SpaRTUN-Simple only contains one spatial expression for each relation type, and SpaRTUN-Clock contains all relation expressions plus clock expressions (Column 5 in Figure 4.5) for relation types. All codes are publicly available at https://github.com/HLR/Spatial-QA-tasks. 12All human results gathered by scoring the human answers over a subset of the test set. 13StepGame only has FR question types. Hence, we use the model trained on FR questions for both FR and YN target tasks. 14https://huggingface.co/transformers/v2.9.1/model_doc/bert.html 48 Dataset SpaRTUN SpaRTUN-Simple SpaRTUN-Clock SpartQA YN 92.83 90.30 - 82.05 FR 93.66 93.66 87.13 94.17 Table 4.2 Result of BERT (SQA) model trained and test on two synthetic supervision data. 4.4.2 Further Pretraining on SpartQA-Auto Improves Spatial Reasoning Table 4.3 shows performance on SpartQA-Human in a low-resource setting, where 0.6k QA pairs from SpartQA-Human are used for fine-tuning these LMs and 0.5k for testing. # Model 1 Majority 2 BERT 3 BERT (Stories only; MLM) 4 BERT (SpartQA-Auto; MLM) 5 BERT (SpartQA-Auto) 6 Human FB 28.84 16.34 21.15 19.23 62.5 91.66 FR 24.52 20 16.19 29.54 46.66 95.23 CO 40.18 26.16 27.1 32.71 32.71 91.66 YN 53.60 45.36 51.54 47.42 47.42 90.69 Avg 36.64 30.17 32.90 34.88 47.25 92.31 Table 4.3 Further pretraining BERT on SpartQA-Auto improves accuracies on SpartQA-Human. All models are fine-tuned on the training data of SpartQA-Human, but line 3-5 are also further pretrained in different ways. line 3: further pretrained on the stories from SpartQA-Auto as a masked language model (MLM) task. line 4: further pretrained on both stories and QA annotations as MLM. line 5: the proposed model that is further pretrained on SpartQA-Auto as a QA task. Avg: The micro-average on all four Q-Types. In Table 4.3, line 5, BERT (SpartQA-Auto), is the proposed method of further pretraining BERT on SpartQA-Auto. We can see that line 2, the original BERT, performs consistently lower than BERT(SpartQA-Auto), indicating that having SpartQA-Auto as a further pretraining task improves BERT’s spatial understanding. In addition, we implement another two baselines. BERT (Stories only; MLM): further pretraining BERT only on the stories of SpartQA-Auto as a masked language model (MLM) task; BERT (SpartQA-Auto; MLM): we convert the QA pairs in SpartQA-Auto into textual statements and further pretrain BERT on the text as an MLM (see Fig. 4.8 for an example conversion). To convert each question and its answer into a sentence, we utilize static templates for each question type which removes the question words and rearranges other parts into a sentence. 49 Figure 4.8 Convert a triplet of (paragraph, question, answer) into a single piece of text for the MLM task. We can see that line 3 slightly improves over original BERT, an observation consistent with many prior works that seeing more text generally helps an LM (e.g., [34]). The significant gap between line 3 and the proposed line 5 (BERT (SpartQA-Auto)) indicates that supervision signals come more from our annotations in SpartQA-Auto rather than from seeing more unannotated text. Line 4 is another way to make use of the annotations in SpartQA-Auto, but it is shown to be not as effective as further pretraining BERT on SpartQA-Auto as a QA task. While the proposed BERT (SpartQA-Auto) overall performs better than the other three baseline models, one exception is its accuracy on YN, which is lower than that of line 3. Since all models’ YN accuracies are also lower than the majority baseline, we hypothesize that this is due to imbalanced data. To verify it, we compute the 𝐹1 score for YN Q-Type in Table 4.4, where we see all models effectively achieve better scores than the majority baseline. However, further pretraining BERT on SpartQA-Auto still does not beat other baselines, which implies that straightforward pretraining is not necessarily helpful in capturing the complex reasoning phenomena required by YN questions. Model Majority BERT BERT (Stories only; MLM) BERT (SpartQA-Auto; MLM) BERT (SpartQA-Auto) 𝐹1 35 50 53 48 48 Table 4.4 Switching from accuracy in Table 4.3 to 𝐹1 shows that the models are all performing better than the majority baseline on YN Q-Type. 50 A big circle is above a triangle. A blue square isbelow the triangle.What is the relation between the circle and theblue object?Answer: AboveA big circle is above a triangle. A blue square isbelow the triangle. The circle is [MASK] the blueobject.Answer: Above The human performance is evaluated on 100 random questions from each SpartQA-Auto and SpartQA-Human test set. The respondents are graduate students that were trained by some examples of the dataset before answering the final questions. We can see from Table ?? that all models’ performances fall behind human performance by a large margin. 4.4.2.1 Extrinsic Evaluation In this subsection, we show further pretraining BERTon SpartQA-Auto can achieve better performance on two extrinsic evaluation datasets, namely bAbI and boolQ. Figure 4.9 Learning curve of BERT and BERT further pretrained on SpartQA-Auto on bAbI. We draw the learning curve on bAbI, using the original BERT as a baseline and BERT further pretrained on SpartQA-Auto (Fig. 4.9). Although both models achieve perfect accuracy given large enough training data (i.e., 5k and 10k), BERT (SpartQA-Auto) is showing better scores given less training data. Specifically, to achieve an accuracy of 99%, BERT (SpartQA-Auto) requires 1k training examples, while BERT requires twice as much. We also notice that BERT (SpartQA-Auto) converges faster in our experiments. As another evaluation dataset, we chose boolQ [11] (Check Figure 4.10) for two reasons. First, we needed a QA dataset with Yes/No questions. To our knowledge boolQ is the only available one used in the recent work. This dataset is provided on Wikipedia and is designed to be answerable by reading the corresponding passage of text and do not require any external knowledge. Second, indeed, SpartQA and boolQ are from different domains, however, boolQ needs multi-step reasoning in which we wanted to see if SpartQA helps. 51 Figure 4.10 An example of boolQ dataset. Model Majority baseline Recurrent model (ReM) ReM fine-tuned on SQuAD ReM fine-tuned on QNLI ReM fine-tuned on NQ BERT (our setup) BERT (SpartQA-Auto) Accuracy 62.2 62.2 69.8 71.4 72.8 71.9 74.2 Table 4.5 Models performances on the dev set of boolQ (since the test set is not available to us). Top: numbers reported in [11]. Bottom: numbers from our experiments. BERT (SpartQA-Auto): further pretraining BERT on SpartQA-Auto as a QA task. Table 4.5 shows that further pretraining BERT on SpartQA-Auto yields a better result than the original BERT and those reported numbers in clark2019boolq, which also tested on various distant supervision signals such as SQuAD [107], Google’s Natural Question dataset NQ [57], and QNLI from GLUE [138]. We observe that many of the boolQ examples answered correctly by the BERT further pretrained on SpartQA-Auto require multi-step reasoning. Our hypothesis is that since solving SpartQA-Auto questions needs multi-step reasoning, fine-tuning BERT on SpartQA-Auto generally improves this capability of the base model. 4.4.3 Imapct of Broader Relation Coverage on Transfer Learning The following experiments demonstrate the impact SpaRTUN with broader coverage of spatial relation types and expressions on transfer learning for SQA benchmarks compared to different supervisions. 52 Model MB BERT BERT BERT BERT BERT SynSup - - SpartQA-Auto StepGame SpaRTUN-Simple SpaRTUN 171k 51.9 87.39 90.42 87.39 92.43 90.02 19500 10.6 34.53 100 99.89 98.99 99.89 Table 4.6 Impact of using synthetic supervision on the bAbI tasks. All the models are further fine-tuned on the training set of task 17 (size = 1k) and 19 (size = 500), and test on bAbI test sets. bAbI(task 17 and 19): Task 17 is on spatial reasoning and contains binary Yes/No questions. Task 19 is on pathfinding and contains FR questions with answers in {LEFT, RIGHT, ABOVE, BELOW} set. The original dataset contains west, east, north, and south, which we mapped to their corresponding relative relation type. Due to the simplicity of bAbI dataset, our experiments show that PLM can solve this benchmark with 100% accuracy when trained on the full training data(10k). Hence we run our experiment on only 1k and 500 training examples of task 17 and task 19, respectively. Table 4.6 demonstrates the impact of synthetic supervision on both tasks of bAbI. The results with various synthetic data are fairly similar for these two tasks. However, pretraining the model with the simple version of SpaRTUN, named SpaRTUN-Simple, performs better than other synthetic datasets on task 17. This can be due to the fewer relation expressions in SpaRTUN-Simple, which follows the same structure as task 17. Model MB BERT BERT BERT BERT BERT Human SynSup - - SpartQA-Auto StepGame SpaRTUN-Simple SpaRTUN - YN 53.60 49.65 39.86 44.05 44.75 48.25 90.69 FR 24.52 18.18 48.05 11.68 37.66 50.64 95.23 Table 4.7 Transfer learning on SpartQA-Human. SpartQA-A stands for SpartQA-Auto. SpartQA-Human: In the next experiment, we investigate the impact of SpaRTUN on SpartQA- Human result. Comparing the results in Table 4.7, we find that even though the classification layer for SpartQA-Auto and SpartQA-Human are the same, the model trained on SpaRTUN has a better 53 Figure 4.11 StepGame. An example of questions that need ten steps of reasoning. transferability. It achieves 2.6% better accuracy on FR and 9% better accuracy on YN questions compared to SpartQA-Auto. YN is, yet, the most challenging question type in SpartQA-Human and none of the PLM-based models can reach even the simple majority baseline. k steps of reasoning SynSup Model TP-MANN - - BERT SpartQA-Auto BERT SpaRTUN-Simple BERT SpaRTUN BERT 1 85.77 98.44 98.63 98.70 98.55 2 60.31 94.77 94.95 95.21 95.02 4 3 37.45 50.18 71.7 91.78 91.94 77.74 92.46 77.93 79.1 92.04 5 31.25 57.56 68.37 69.53 70.34 6 28.53 50.34 61.67 62.14 63.39 8 7 9 22.52 35.41 46.86 10 21.46 26.45 23.67 33.62 45.17 39.69 57.95 50.82 44.03 57.37 48.79 44.67 42.72 52.09 48.36 45.68 58.74 Table 4.8 Result of models with and without extra synthetic supervision on StepGame. StepGame [118]: is a synthetic SQA dataset (See Figure 4.11) containing FR questions which need 𝑘 reasoning steps to be answered (𝑘 = 1 to 10). This paper was published in 2022 with the same contribution as SpartQA-Auto. However, it contains more relation expressions and less relation types and rules. The answer to each question is one relation in {left, right, below, above, lower-left, upper-right, lower-right, upper-left} set. This dataset was provided one year after SpartQA with the same purpose. However, it contains fewer relation types and rules. Table 4.8 demonstrates our experiments on StepGame. BERT without any extra supervision significantly outperforms the best reported model in shi2022stepgame, TP-MANN, which is based 54 Story: 0:"B is south east of J."1:"X is under E."2:"K is to the left of Z and is on the same horizontal plane." 3:"If L is the center of a clock face, E is located between 10 and 11." 4:"S is positioned above Q."5:"Q is diagonally to the bottom right of L." 6:"C and S are horizontal and C is to the left of S."7:"I is above B with a small gap between them." 8:"E is above N and to the left of N." 9:"Q is below and to the right of B." 10:"X is to the left of C with a small gap between them." question:"What is the relation of the agent L to the agent J? "lower-right" on a neural memory network. As expected, all the PLM-based models almost solve the questions with one step of reasoning (i.e. where the answer directly exists in the text). However, with increasing the steps of reasoning, the performance of the models decreases. Comparing the impact of different synthetic supervision, SpaRTUN achieves the best result on 𝑘 > 3. For questions with 𝑘 <= 3, SpaRTUN-Simple achieves competitive similar results compared to SpaRTUN. Overall, the performance gap in SpaRTUN-Simple, SpartQA-Auto and SpaRTUN shows that more coverage of relation expressions in SpaRTUN is effective. Model MB BERT BERT BERT BERT BERT Human SynSup - - SpartQA-Auto StepGame SpaRTUN-Simple SpaRTUN - Accu 50.21 57.37 55.08 60.14 58.03 63.60 90.38 Table 4.9 Results with and without extra supervision on ReSQ. The Human accuracy is the performance of human on answering a subset of test set. ReSQ: In the next experiment, we show the influence of SpaRTUN on real-world examples, which contain more types of spatial relations and need more rules of reasoning to be solved. Table 4.9 shows the result of transfer learning on ReSQ. To compute the human accuracy, we asked two volunteers (one from those who created the questions and one new volunteer) to answer 100 questions from the test set of ReSQ. Finally, a third student graded their answers and computed the accuracy. This result shows that the limited coverage of spatial relations and expression in SpartQA-Auto impacts the performance of BERT negatively. However, further pretraining BERT on SpaRTUN-Simple improves the result on ReSQ. This can be due to the higher coverage of relation types in SpaRTUN-Simple than SpartQA-Auto. Using SpaRTUN for further pretraining BERT has the best performance and improves the result by 5.5%, indicating its advantage for transferring knowledge to solve real-world spatial challenges. In conclusion, our experiments show the efficiency of SpaRTUN in improving the performance 55 of models on different benchmarks due to the flexible coverage of relation types and expressions. 4.5 Related Research Requiring large amounts of annotated data is a well-known issue in training complex deep neural models [161] that is extended to spatial language processing tasks. In our study, we noticed that all available large datasets on SQA task including bAbI [143], SpartQA-Auto, and StepGame [118] are, all, synthetic. bAbI is a simple dataset that covers a limited set of relation types, spatial rules, and vocabulary. StepGame focuses on a few relation types but with more relation expressions for each and considers multiple reasoning steps. SpartQA-Auto, comparatively, contains more relation types and needs complex multi-hop spatial reasoning. However, it contains a single linguistic spatial expression for each relation type. All of these datasets are created based on controlled toy settings and are not comparable with real-world spatial problems in the sense of realistic language complexity and coverage of all possible relation types. SpartQA-Human is a human-generated version of SpartQA-Auto with more spatial expressions. However, this dataset is provided for probing purposes and has a small training set that is not sufficient for effectively training deep models. Transfer learning has been used effectively in different NLP tasks to further fine-tune the PLMs [109, 2, 77]. Besides transfer learning, several other approaches are used to tackle the lack of training data in various NLP areas, such as providing techniques to label the unlabeled data [24], using semi-supervised models [132, 65] or data augmentation with synthetic data [67, 86]. However, transfer learning is a simple way of using synthetic data as an extra source of supervision at no annotation cost. Compared to the augmentation methods, the data in the transfer learning only needs to be close to the target task/domain [76] and not necessarily the same. Using logical reasoning for building datasets that need complex reasoning for question answering is used before in building QA datasets [12, 113]. More recent efforts even use the path of reasoning and train models to follow that [126]. However, there are no previous works to model spatial reasoning as we do here with the broad coverage of spatial logic. 56 4.6 Summary To improve LMs’ capability on spatial reasoning, we use a synthetic dataset, SpartQA-Auto, as a source of supervision for transfer learning for spatial question answering (SQA) which contains a large corpus of spatial descriptions and corresponding question-answer annotations and show that further pretraining LMs on this extra supervision significantly enhances their spatial language understanding and reasoning. We also show that a spatially-improved LM can have better results on two extrinsic datasets (bAbI and boolQ). We also create a new synthetic dataset and demonstrate that expanding the coverage of relation types and combinations and spatial language expressions can provide a more robust source of supervision for pretraining and transfer learning [89]. As a result, this data improves the models’ performance in many experimental scenarios on both tasks when tested on various evaluation benchmarks. This data includes rules of spatial reasoning and the chain of logical reasoning for answering the questions that can be used for further research in the future. Moreover, we provide a human-generated dataset on a realistic SQA task that can be used to evaluate the models and methods for spatial language understanding related tasks in real-world problems. 57 CHAPTER 5 SPATIAL ROLE LABELING MODELS 5.1 Introduction As we described in Section 2.1.1, Spatial Role Labeling (SpRL), which can also be considered as spatial information extraction, is the task of identifying and classifying the spatial roles and their relations. Since this task is fundamental and one of the key points to spatial language understanding, in this section, we provide models to solve this task. We call the former spatial role extraction and the latter spatial relation1 extraction (Figure 5.1). The SpRL model contains two pipeline models, one for spatial role extraction and another for spatial relation2 extraction (Figure 5.1). The spatial relation extraction model is implemented for two joint tasks, spatial triplet validation, and spatial relation type classification. We show how this model outperforms previous SOTAs on mSpRL benchmark. Next, we analyze the impact of SpaRTUN as a source of extra supervision on several SpRL benchmarks. To the best of our knowledge, we are the first to use synthetic supervision for the SpRL task. Our results show that the auto-generated data successfully improves the SOTA results on mSpRL and SpartQA-Human, which are annotated for SpRL task. 5.2 Models In this section, we describe the implemented SpRL models. Several neural models have been proposed to solve spatial role [83, 74, 15]. We take a similar approach to prior research [119] for the extraction of spatial roles (entities (Trajector/Landmark) and spatial indicators). First, we separately tokenize each sentence in the context and use a PLM (which is BERT here) to compute the tokens representation. Next, we apply a BIO tagging layer on tokens representations using (O, B-entity, I-entity, B-indicator, I-indicator) tags. A Softmax layer on BIO tagger output is used to select the spatial entities and spatial indicators with the highest probability. For training, we use CrossEntropy loss given the spatial annotation. 1In different works like [54], the triplet and relation are used interchangeably. 2In different works like [54], the triplet and relation are used interchangeably. 58 Figure 5.1 Spatial role labeling model includes two separately trained modules. E: entity, SP: spatial_indicators. As an example, triplet (a grey house, front , A grey car) is correct and the “spatial_type = FRONT”, and (A grey car, front, a grey house) is incorrect, and the “spatial_type = NaN”. For the spatial relation extraction model, similar to [150, 119], we use BERT and a classification layer to extract correct triplets. Given the output of the spatial role extraction model, for each combination of (spatial entity(𝑡𝑟), spatial_indicator(𝑠𝑝), spatial entity(𝑙𝑚)) in each sentence we create an input3 and pass it to the BERT model. To indicate the position of each spatial role in the sentence, we use segment embeddings and add 1 if it is a role position and 0 otherwise. The [𝐶 𝐿𝑆] output of BERT will be passed to a one-layer MLP that provides the probability for the triplet. Compared to the prior research, we predict the spatial type for each triplet as an auxiliary task for spatial relation extraction. To this aim, we apply another multi-class classification layer4 on the same [𝐶 𝐿𝑆] token. For spatial argument extraction, we use cross-entropy loss for BIO-tagging, and for spatial relation 3 [CLS, 𝑡𝑟, SEP, 𝑠𝑝, SEP, 𝑙𝑚, SEP, 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒, SEP] 4The classes are relation types in Table 2.1 alongside a NaN class for incorrect triplets. 59 CLST1T2TnSEPA grey car is parking in front of a grey houseT7Spatial Role (entity and spatial_indicator) extractionExtracting spatial entity and spatial indicatorsList of all spatial entitiesList of all spatial indicatorsSpatial Relation (triplet) ExtractionCLSSEPTiT1SEPTjSEPTnTk+mTkSEPT1entity1 (Trajector)entity2 (Landmark)Indicator2 (spatial_indicator)Sentence001101011101Segment Embedding: [ 0 1 1 1 0 0 0 1 0 1 1 1 0 ]Spatial Type Cls (ftype)triplet cls (ftriplet)y'y OEEEOSP Formalism (General Type) Specific value Spatial type/Spatial value) Expressions (e.g.) Topological RCC8 Directional Relative Distance Qualitative DC (disconnected) EC (Externally Connected) PO (Partially Overlapped) EQ (Equal) TPP (Tangential Proper Part) NTPP (Non-Tangential Proper Part) TPPI (Tangential Proper Part inverse) NTPPI (Non-Tangential Proper Part inverse) LEFT, RIGHT BELOW, ABOVE BEHIND, FRONT Far, Near disjoint touching overlapped equal covered by in, inside covers has left of, right of under, over behind, in front far, close Table 5.1 Spatial relation types and examples of spatial language expressions. extraction, we use the summation of loss for each spatial relation and relation type classification part. 𝐿𝑜𝑠𝑠 = ∑︁ CrossEntropyLoss(p′, y′) + BCELoss(p, y) (5.1) 5.3 Experiments and Results The pipeline model provided in Section 5.2, contains three main parts, a model for spatial role extraction (SRole), a model for spatial relation extraction (SRel), and a model for spatial relation type classification (SType) which we analyze separately (Check Table 6.4). We use Macro F1-score (mean of F1 for each class) to evaluate the performance of the SRole, SRel, and SType models. Here, we first analyze the superiority of our proposed model compared to the previous SOTA on mSpRL model. Next, we analyze the influence of the extra synthetic supervision on SpRL task when evaluated on human-generated datasets, mSpRL and SpartQA-Human. To do this, we further pretrain the BERT module in these models and then fine-tune it on the target domain. Table 5.2 shows the number of sentences in each SpRL benchmark. As shown in Table 6.4, all SpRL submodules achieve high results on synthetic datasets, SpartQA and SpaRTUN. The reason behind this good performance is that these datasets may contain less ambiguity coming from the natural language expressions. Therefore, the BERT-base models can 60 Dataset SpartQA-Auto (story) SpartQA-Auto (question) SpartQA-Human (story) SpartQA-Human (question) SpaRTUN (story) SpaRTUN (question) mSpRL Train Dev 25755 23584 176 155 48368 38734 481 16214 15092 99 127 7031 5970 - Test 16336 15216 272 367 7191 6023 461 Table 5.2 Number of sentences of SpRL benchmarks. To train the SpartQA-Auto, we only use the 3k training examples (23 - 25k sentences). Dataset mSpRL Model R-Inf* BERT BERT (SpartQA-Auto) BERT (SpaRTUN) BERT* (SpaRTUN) BERT SRole 80.92 88.59 88.41 88.03 87.74 55.8 BERT (SpartQA-Auto) SpartQA-Human 57.28 BERT (SpaRTUN) 72.53 BERT BERT SpartQA-Auto 99.92 SpaRTUN 99.96 SRel 68.78 69.12 68.84 71.23 70.18 S: 57.43 Q: 52.55 S: 58.32 Q: 55.17 S: 60.24 Q: 61.53 S: 99.72 Q: 98.36 S: 99.18 Q: 97.68 SType - 19.79 23.65 26.02 43.79 39.34 48.74 48.07 99.05 98.62 98.57 98.11 Table 5.3 Extraction modules used in PistaQ. Q: question and S: stories. Certain sentences in the mSpRL datasets are considered as conveying no spatial information by the original SpRL dataset annotators. These sentences are included in experiments where specified by an asterisk (*) and are excluded from all other experiments. easily capture the spatial language’s syntactic patterns that are needed for extracting the roles and direct relations from the large training set. Table 6.4 also demonstrates the influence of synthetic supervision in spatial role extraction evaluated on mSpRL and SpartQA-Human. We compare the result of the SRole model with the previous SOTA, “R-Inf” [78], on mSpRL dataset. R-Inf uses external multi-modal resources and global inference. All of the BERT-based SRole models outperform the R-Inf, which shows the power of PLMs for this task. However, since 61 the accuracy of the SRole is already very high, using synthetic supervision shows no improvements compared to the model that only trained with mSpRL training set for the SRole. In contrast, on SpartQA-Human, using synthetic supervision helps the model perform better. Especially, using SpaRTUN increases the performance of the SRole model dramatically by 15%. In Table 6.4, we show the result of the SRel model (containing spatial relation extraction and spatial relation type classification) for spatial relation extraction, with and without extra supervision from synthetic data. Same as the SRole model, extra supervision from SpaRTUN achieves the best result when tested on SpartQA-Human. For mSpRL, we compared the SRel model with R-Inf on spatial relation extraction. As table 6.4 demonstrates we improve the SOTA by 2.6% on F1 measure using SpaRTUN as synthetic supervision. Also, the model further pretrained on SpartQA-Auto gets lower results than the model with no extra supervision due to the limited relation expressions used in this data. In conclusion, our experiments show the efficiency of SpaRTUN in improving the performance of models on different benchmarks due to the flexible coverage of relation types and expressions. 5.4 Related Research For the SpRL task, mSpRL and SpaceEval (SemEval-2015 task 8) [102] are two available datasets with spatial roles and relation annotations. These are small-scale datasets for studying the SpRL problem. From the previous works which tried transfer learning on SpRL task, [94] only used it on word embedding of their SpRL model, and [119] used PLM without any specifically designed dataset for further pretraining. These issues motivated us to create SpaRTUN for further pretraining and transfer learning for SQA and SpRL. 5.5 Summary We demonstrate that our proposed SpRL model surpasses the previous SOTA model on mSpRL benchmark [89]. Besides, we show that our generated synthetic dataset can be helpful as a source of supervision for transfer learning for spatial role labeling (SpRL) tasks. We show that expanding the coverage of relation types and combinations and spatial language expressions can provide a more robust source of supervision for pretraining and transfer learning. As a result, this data improves 62 the models’ performance in many experimental scenarios on both tasks when tested on various evaluation benchmarks. 63 CHAPTER 6 DISENTANGLEING EXTRACTION AND REASONING IN MULTI-HOP SPATIAL REASONING 6.1 Introdcution Despite the high performance of recent pretrained language models on question-answering (QA) tasks, solving questions that require multi-hop reasoning is still challenging [82]. In chapter 3, we have investigated the use of general end-to-end deep neural models such as pretrained language models (PLM) [91] in spatial question answering (SQA). PLMs show reasonable performance on the SQA problem and can implicitly learn spatial rules from a large set of training examples. However, the black-box nature of PLMs makes it unclear whether these models are making the abstractions necessary for spatial reasoning or their decisions are based solely on patterns observed in the data. As a solution for better multi-hop reasoning, recent research has investigated the impact of using fine-grained information extraction modules such as Named Entity Recognition (NER) [93, 84], gated Entity/Relation [157] or semantic role labels (SRL) [116, 25] on the performance of models. On a different thread, cognitive studies [121, 20] show when the given information is shorter, humans also find spatial abstraction and use spatial rules to infer implicit information. Figure 6.1 shows an example of such extractions. Building upon these findings, we aim to address the limitations of end-to-end models and capitalize on the advantages of fine-grained information extraction in solving SQA. Thus, we propose models which disentangle the language understanding and spatial reasoning computations as two separate components. Specifically, we first design a pipeline model that includes trained neural modules for extracting direct fine-grained spatial information from the text and performing symbolic spatial reasoning over them. The second model is simply an end-to-end PLM that uses annotations used in extraction modules of pipeline model in the format of extra QA supervision. This model aims to demonstrate the advantages of using separate extraction modules compared to a QA-based approach while utilizing the same amount of supervision. Ultimately, the third model is an end-to-end PLM-based model on relation extraction tasks that has explicit latent layers to disentangle the extraction and reasoning 64 Figure 6.1 An example of steps of spatial reasoning on ReSQ dataset. We begin by searching for the plants from the question triplet within the text, enabling us to extract explicit triplets (1,2). Next, we apply rules such as INVERSE to deduce implicit triplets (3,4,5). Then, utilizing triplets 5 and 6 we determine the final answer, ’Yes’. NTPP: Non-Tangential Proper Part (Table 6.1). Formalism (General Type) Spatial Type Expressions (e.g.) DC (disconnected) EC (Externally Connected) PO (Partially Overlapped) EQ (Equal) TPP NTPP TPPI NTPPI LEFT, RIGHT BELOW, ABOVE BEHIND, FRONT Far, Near disjoint touching overlapped equal covered by in, inside covers has left of, right of under, over behind, in front far, close Topological (RCC8) Directional (Relative) Distance Table 6.1 List of spatial relation formalism and types. inside the model. This model incorporates a neural spatial reasoner, which is trained to identify all spatial relations between each pair of entities. We evaluate the proposed models on multiple SQA datasets, demonstrating the effectiveness of the disentangling extraction and reasoning approach in controlled and realistic environments. Our pipeline outperforms existing SOTA models by a significant margin on benchmarks with a 65 A grey car is parking in front of a grey housewith brown window frames and plants on thebalcony.Q: Are the plants behind the car?BEHIND( the plants, the car)FRONT( a grey car, a grey house)NTPPI ( a grey house, plants)126435BEHIND( the plants, the car) Answer = Yes213, 455, 6ExplicittripletsExplicittripletsSTART Plants INVERSE NTPP( plants, a grey house) INVERSE BEHIND(a grey house, a grey car) COMBINE BEHIND(plants, a grey car) Coreference (a grey car, the car) ReasoningExtractionExtraction controlled environment (toy tasks) while utilizing the same or fewer training data. However, in real-world scenarios with higher ambiguity of natural language for extraction and more rules to cover, our end-to-end model with explicit layers for extraction and reasoning performs better. These results show that disentangling extraction and reasoning benefits deterministic spatial reasoning and improves generalization in realistic domains despite the coverage limitations and sensitivity to noises in symbolic reasoning. These findings highlight the potential of leveraging language models for information extraction tasks and emphasize the importance of explicit reasoning modules rather than solely depending on black-box neural models for reasoning. 6.2 Related Research End-to-end model on SQA: To solve SQA tasks, recent research evaluates the performance of different deep neural models such as Memory networks [118, 125], Self-attentive Associative Memory [61], subsymbolic fully connected neural network [160], and Recurrent Relational Network (RRN) [100]. In Chapter 3, we used transfer learning and provided large synthetic supervision that enhances the performance of PLMs on spatial question answering. However, the results show a large gap between models and human performance on human-generated data. Besides, none of these models use explicit spatial semantics to solve the task. The only attempt towards integrating spatial semantics into the spatial QA task is a baseline model we introduced, which uses rule-based spatial semantics extraction for reasoning on bAbI (task 17) and achieves 100% accuracy without using any training data. Extraction and Reasoning: While prior research has extensively explored the use of end-to-end models for learning the reasoning rules [87, 103], there is limited discussion on separating the extrac- tion and reasoning tasks. [98] utilizes LMs to generate new sentences and extract facts while using some symbolic rules to ensure consistency between generated sentences. Similarly, ThinkSum [99] uses LMs for knowledge extraction (Think) and separate probabilistic reasoning (Sum), which sums the probabilities of the extracted information. However, none of these works are on multi-step or spatial Reasoning. 66 Figure 6.2 PistaQ pipeline based on disentangled extraction and reasoning. In this model, facts, e.g., FRONT(grey car, grey house), are extracted from the story and linked by coreference modules. The R-Coref equates ‘the car’ from the question with ‘a grey car’ in the story and forms a query. This query, along with facts, is sent to the spatial reasoner. Finally, the spatial reasoner employs FRONT and BEHIND rules and returns True as the answer. 6.3 Proposed Models To understand the effectiveness of disentangling the extraction and reasoning modules, we provide three groups of models for spatial question answering (SQA) task. The first model is a pipeline of extraction and symbolic reasoning (§6.3.1), the second model is an end-to-end PLM that uses the same spatial information supervision but in a QA format (§6.3.2), and the third model is an end-to-end neural model with explicit layers of extraction and reasoning (§6.3.3). We elaborate each of these models in the subsequent sections. 6.3.1 Pipeline of Extraction and Reasoning Here, we describe our suggested pipeline model designed for spatial question answering task, referred to as PistaQ1. As shown in the extraction part of Figure 6.2, the spatial information is extracted first and forms a set of triplets for a story (Facts) and a question (Query). Then a coreference resolution module is used to connect these triplets to each other. Given the facts and queries, the spatial reasoner infers all implicit relations. The answer generator next conducts the final answer. Below we describe each module in more detail. Spatial Role Labeling (SpRL) is the task of identifying and classifying the spatial roles of phrases 1PIpeline model for SpaTiAl Question answering 67 FactsQueryLEFT: FalseRIGHT: True...BEHIND: TrueNTPPI: FlaseAnswerGenerationRole ExtractionRelation ExtractionStory processingCoref ResolutionRole ExtractionRelation ExtractionQuestion processingA grey car is parking infront of a grey house ...Are the plants behindthe car?1: {grey car}2: {a grey house}Corefernce infoBEHIND(the plants , the car)Question tripletsR-CorefSpatial Reasoner(Rules)3: {plants}FRONT( 1, 2)NTPPI ( 2, 3)Story tripletsAll relations:BEHIND(3, 1)?ExtractionReasoningYesFRONT(X, Z) :- FRONT(X,Y),NTPPI(Y,Z).BEHIND(X,Y) :-FRONT(Y,X). Not Inverse Symmetry Transitivity Combination ∀(𝑋, 𝑌 , 𝑍, 𝐻) ∈ 𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 𝑅 ∈ 𝐷𝑖𝑟, ∗𝑃𝑃 ∈ 𝑃𝑃 ∀(𝑋, 𝑌 ) ∈ 𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 ∀(𝑋, 𝑌 ) ∈ 𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 ∀(𝑋, 𝑌 ) ∈ 𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 ∀(𝑋, 𝑌 , 𝑍) ∈ 𝐸𝑛𝑡𝑖𝑡𝑖𝑒𝑠 𝑅 ∈ {𝐷𝑖𝑟 ∨ 𝑃𝑃} 𝑅 ∈ {𝐷𝑖𝑟 ∨ 𝑃𝑃} 𝑅 ∈ {𝐷𝑖𝑠 ∨ (𝑅𝐶𝐶 − 𝑃𝑃)} 𝑅 ∈ {𝐷𝑖𝑟 ∨ 𝑃𝑃} ⇒ NOT(𝑅_𝑟𝑒𝑣𝑒𝑟 𝑠𝑒(𝑋, 𝑌 )) IF 𝑅(𝑋, 𝑌 ) ⇒ 𝑅_𝑟𝑒𝑣𝑒𝑟 𝑠𝑒(𝑋, 𝑌 ) IF 𝑅(𝑌 , 𝑋) ⇒ 𝑅(𝑋, 𝑌 ) IF 𝑅(𝑌 , 𝑋) ⇒ 𝑅(𝑋, 𝑌 ) IF 𝑅(𝑋, 𝑍), 𝑅(𝑍, 𝑌 ) IF ∗𝑃𝑃(𝑋, 𝑍), 𝑅(𝑍, 𝐻), ∗𝑃𝑃𝑖(𝑍, 𝑌 ) ⇒ 𝑅(𝑋, 𝑌 ) Table 6.2 Designed spatial rules [89]. 𝐷𝑖𝑟: Directional relations (e.g., LEFT), 𝐷𝑖𝑠: Distance relations (e.g., FAR), 𝑃𝑃: all Proper parts relations (NTPP, NTPPI, TPPI, TPP), 𝑅𝐶𝐶 − 𝑃𝑃: All RCC8 relation except proper parts relations. ∗𝑃𝑃: one of TPP or NTPP. ∗𝑃𝑃𝑖: one of NTPPi or TPPi. within a text (including the Trajector, Landmark, and Spatial Indicator) and formalizing their relations [54]. Here, we use the same SpRL modules as in Chapter 5. This model first computes the token representation of a story and its question using a BERT model. Then a BIO tagging layer is applied on the tokens representations using (O, B-entity, I-entity, B-indicator, and I-indicator) tags. Finally, a softmax layer on the BIO tagger output selects the spatial entities2 (e.g., ‘grey car’ or ‘plants’ in Figure 6.2) and spatial indicators (e.g., ‘in front of’ in Figure 6.2). Given the output of the spatial role extraction module, for each combination of (Trajector, Spatial Indicator, Landmark) in each sentence, we create a textual input3 and pass it to a BERT model. The [𝐶 𝐿𝑆] output of BERT will be passed to a one-layer MLP that provides the probability for each triplet. To apply the logical rules on the triplets, we need to assign a relation type to each triplet. To this aim, we use another multi-classification layer on the same [𝐶 𝐿𝑆] token to identify the spatial types of the triplet. The classes are relation types in Table 6.1 alongside a class NaN for triplet with no spatial meaning. For instance, in Figure 6.2, (grey car, in front of, grey house) is a triplet with 𝐹 𝑅𝑂𝑁𝑇 as its relation type while (grey house, in front of, grey car) is not a triplet and its relation type is 𝑁𝑎𝑁. We use a joint loss function for triplet and relation type classification to train the model. Coreference Resolution Linking the extracted triplets from the stories is another important step required in this task, as different phrases or pronouns may refer to the same entity. To make such connections, we implement a coreference resolution model based on [62] and extract all antecedents for each entity and assign a unique 𝑖𝑑 to them. In contrast to previous work, we have extended 2Trajector/Landmark 3 [CLS, 𝑡𝑟𝑎 𝑗, SEP, 𝑖𝑛𝑑𝑖𝑐, SEP, 𝑙𝑎𝑛𝑑, SEP, 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒, SEP] 68 the model to support plural antecedents (e.g., two circles). We first collect all mentions of each predicted entity from spatial role extraction or role annotations, then assign an “id” to the same mentions and include that id in each triplet. For example, for BELOW(a cat, a grey car), Front(the car, a church), id 1= a cat, 2 = a grey car, the car, and 3 = a church. So we create new triplets in the form of BELOW(1, 2) and Front(2, 3). To train the model, we pair each mention with its previous (a) The coreference resolution model structure. (b) The formula for computing the coreference scores Figure 6.3 The coreference resolution model [62]. antecedent and use cross-entropy loss to penalize the model if the correct pair is not chosen. For singletons and starting mention of objects, the model should return class 0, which is the [𝐶 𝐿𝑆] token. Since the previous model does not support the plural antecedent (e.g., two circles), we include that by considering shared entities in pairs like both (two circles, the black circle) and (two circles, the blue circle) are true pairs. As an instance of the importance of coreference resolution in spatial reasoning, consider this context “block A has one black and one green circle. The black circle is above a yellow square. The yellow square is to the right of the green circle. Which object in block A is to the left of a yellow square?” The reasoner must know that the ‘one green circle’ in NTPPI(block A, one green circle) 69 and the ‘the green circle’ in RIGHT( the yellow square, the green circle) are the same to be able to connect them using transitivity rule and find the answer. To find the mentions of the question entities in the story and create the queries, we use a Rule-based Coreference (R-Coref) based on exact/partial matching. In Figure 6.2, ‘the car’ in the question has the same id as ‘the grey car’ from the story’s triplets. Logic-based Spatial Reasoner To do symbolic spatial reasoning, we use the reasoner from Chapter 3. This reasoner is implemented in Prolog and utilizes a set of rules on various relation types, as illustrated in Table 6.2. Given the facts and queries in Prolog format, the spatial reasoner can carry out the reasoning process and provide an answer to any given query. The reasoner matches variables in the program with concrete values and a backtracking search to explore different possibilities for each rule until a solution is found. As shown in Figure 6.2, the reasoner uses a FRONT and a BEHIND rules over the facts and generates the True response for the query. 6.3.2 PLMs Using SpRL Annotations To have a fair comparison between the QA baselines and models trained on SpRL supervision, we design BERT-EQ4. We convert the SpRL annotation into extra YN questions5 asking about explicit relations between a pair of entities. To generate extra questions, we replace triplets from the SpRL annotation into the “Is [Trajector] [Relation*] [Landmark]?” template. The [Trajector] and [Landmark] are the entity phrases in the main sentence ignoring pronouns and general names (e.g., “an object/shape”). The [Relation*] is a relation expression (examples presented in Table 6.1) for the triplet relation type. To have equal positive and negative questions, we reverse the relation in half of the questions. We train BERT-EQ using both original and extra questions by passing the “question+story” into a BERT with answers classification layers. 6.3.3 PLMs with Explicit Extractions As another approach, we aim to explore a model that disentangles the extraction and reasoning parts inside a neural model. Here, rather than directly predicting the answer from the output of 4BERT+Extra Question 5This augmentation does not apply to FR type since it inquires about all relations between the two asked entities. 70 PLMs (as typically done in the QA task), we introduce explicit layers on top of PLM outputs. These layers are designed to generate representations for entities and pairs of entities, which are then passed to neural layers to identify all relations. We call this model SREQA6, which is an end-to-end spatial relation extraction model designed for QA. Figure 6.4 illustrates the structure of this model. (a) Model structure. First, entity mentions such as ‘plants’ and ‘grey car’ are selected from the BERT output and the entity representation is formed. Next, triplets like (‘plants’, ‘car’, [CLS]) are generated and fed into the reasoning component. The collective output of all relation classifiers determines the relationships between each pair. *All hatched parts are trained end-to-end. The rest of the data is obtained from annotations or off-the-shelf modules. (b) The source of supervision in each step of training. In step#1, we train the model on all story relations, and in step#2, we only train it on question relations. These modules and data are the same as the ones used in PistaQ. Figure 6.4 The SREQA model with explicit neural layers to disentangle extraction and reasoning part. In this model, we first select the entity mentions (𝑀 𝑗 (𝐸1)) from the BERT tokens representation and pass it to the extraction part shown in Figure 6.4a. Next, the model computes entity represen- tation (𝑀 (𝐸1)) by summing the BERT token representations of all entity’s mentions and passing it to an MLP layer. Then, for each pair of entities, a triplet is created by concatenating the pair’s entities representations and the BERT [𝐶 𝐿𝑆] token representation. This triplet is passed through an MLP layer to compute the final pair representations. Next, in the reasoning part in Figure 6.4a, for 6Spatial Relation Extraction for QA 71 ...a grey carthe carBERT[CLS] (Story): grey car is park ... +(Question): MLP (Entity rep)MLP (Triplet rep)LEFTClassifierToken RepresentationM1(E1)...Mn(E1)M1(E2)Mn(E2)M-E1M-E2[CLS]RIGHTClassifierNTPPIClassifierBEHINDClassifierAnswer Generation Are the plants behind the car [SEP]the plants1YespairExtractionReasoning000plantsSelect Entity MentionAll story relations(Training step#1)Question relation(Training step#2)Entity MentionsRelation SupervisionEntity + CorefannotationTrained entity extraction+ R-CorefSpRL + Coref annotation-> Spatial ReasonerQA AnswerSource of each relation type in Table 6.1, we use a binary 2-layer MLP classifier to predict the probability of each relation between the pairs. We remove the inconsistent relations by selecting one with a higher probability at inference time, e.g., LEFT and RIGHT cannot be true at the same time. The final output is a list of all possible relations for each pair. This model is trained using the summation of Focal loss [69] of all relation classifiers. We train SREQA in two separate steps. In the first step, the model is trained on a relation extraction task which extracts all direct and indirect relations between each pair of entities only from stories. The top row of Figure 6.4b shows the annotation and modules employed in this step to gather the necessary supervision. We use the entity and coreference annotation to select the entity mentions from the BERT output. To compute the relations supervision for each pair of entities, we employ the spatial reasoner from PistaQ and apply it to the direct relations (triplets) from the SpRL annotation, which are connected by coreference annotations. This training step is only feasible for datasets with available SpRL and coreference annotations. In the next step, we further train SREQA on extracting questions relation using QA supervision. As shown in the bottom row of Figure 6.4b, we employ the trained spatial role extraction model used in PistaQ to identify the entities in the question and use R-Coref to find the mentions of these entities in the story. To obtain the relation supervision, we convert the question answers to relation labels. In FR, the label is similar to the actual answer, which lists all relations. In YN, the question relation is converted to a label based on the Yes/No answer. For example, in Figure 6.4a, the question relation is ‘BEHIND,’ and the answer is Yes, so the label for the BEHIND classifier is 1. We evaluate the SREQA model’s performance in predicting the accurate answers of the test set’s questions, the same as training step 2. 6.4 Experiments We evaluate the proposed models on the datasets presented in Chapter 3, SpartQA, SpaRTUN, and ReSQ. The first two datasets, SpartQA and SpaRTUN, are pre-annotated with both SpRL and coreference annotations. For the ReSQ dataset, we utilize the provided SpRL annotations from mSpRL. 72 6.4.1 Model Configurations & Baselines We compare the models described in section 6.3 with the following baselines. Majority Baseline: This baseline selects the most frequent answer(s) in each dataset. GT-PistaQ: This model uses ground truth (GT) values of all involved modules in PistaQ to eliminate the effect of error propagation in the pipeline. This baseline is used to evaluate the alignments between the questions and story entities and the reasoning module in solving the QA task. It also gives an upper bound for the performance of the pipeline model, as the extraction part is perfect. BERT: We select the same BERT described in Chapter 3 as a candidate PLM that entangles the extraction and reasoning steps. In this model, the input of the “question+story” is passed to the BERT, and the [𝐶 𝐿𝑆] representation is used for the answer classification. GPT-3.5: GPT-3.5 [8] baselines (GPT-3.5 text-davinci-003) is selected as a candidate of generative larger language models which already passes many SOTAs in reasoning tasks [5, 53]. We use 𝑍𝑒𝑟𝑜_𝑠ℎ𝑜𝑡 and 𝐹𝑒𝑤_𝑠ℎ𝑜𝑡 (In-context learning with few examples) settings to evaluate this model on the human-generated benchmarks. We also evaluate the Chain-of-Thoughts (CoT) prompting method [142] to extend the prompts with manually-written reasoning steps. We provide more details of the performance of LLMs on spatial language understanding tasks in Chapter 7. When applicable, we further train models on SpaRTUN synthetic data shown by “*”. We use the huggingFace7 implementation of pretrained BERT base models, which have 768 hidden dimensions. All models are trained on the training set, evaluated on the dev set, and reported the result on the test set. For training, we train the model until no changes happen on the dev set and then store and use the best model on the dev set. We use AdamW ([72]), and learning rates from 2 × 10−6, 2 × 10−5 (depends on the task and datasets) on all models and modules. We used the same configuration and setting for the extraction modules and the baselines as previous works [89]. For SREQA models we use learning rates of 2 × 10−5, 4 × 10−6 for SREQA(story) and SREQA(question) respectively. To run the models, we use a machine with an Intel Core i9-9820X (10 cores, 3.30 GHz) CPU and Titan 7https://huggingface.co/transformers/v2.9.1/model_doc/bert.html 73 RTX with NVLink as GPU. For GPT-3.5, we use Instruct-GPT, davinci-0038. The cost for running GPT-3.5 on the human-generated benchmarks was 0.002$ per 1k tokens. All codes are publicly available at https://github.com/RshNk73/PistaQ-SREQA. 6.5 Results and Discussion Here, we discuss the influence of disentangling extraction and reasoning manifested in PistaQ and SREQA models compared to various end-to-end models with no explicit design for these modules, such as BERT, BERT-EQ, and GPT-3.5. Table 6.3 shows the list of these models with the sources of their supervision as well as extra off-the-shelf or rule-based modules employed in them. Supervisions QA - QA(8 ex) Model BERT GPT-3.5𝑧𝑒𝑟 𝑜_𝑠ℎ𝑜𝑡 GPT-3.5 𝑓 𝑒𝑤_𝑠ℎ𝑜𝑡 GPT-3.5 𝑓 𝑒𝑤_𝑠ℎ𝑜𝑡 +CoT QA(8 ex) + CoT BERT-EQ SREQA SREQA* PistaQ PistaQ𝑧𝑒𝑟 𝑜_𝑠ℎ𝑜𝑡 QA +SpRL(S) QA +SpRL(all)+Coref Reasoner, R-Coref QA + SpRL(Q) SpRL(all) + Coref - R-Coref Reasoner, R-Coref Reasoner, R-Coref Rule-based Modules - - - - - Table 6.3 The list of annotations from the target benchmarks and rule-based modules employed in each model. We use a quarter of SpRL annotations to train the modules on auto-generated benchmarks. S: Stories, Q: Questions, All: Stories+Questions. Since the performance of extraction modules, Spatial Role Labeling (SpRL) and Coreference Resolution (Coref), directly contribute to the final accuracy of the designed models, we have evaluated these modules and reported the results in Table 6.4. We choose the best modules on each dataset for experiments. For a detailed discussion on the performance of these modules, see Section 6.5.3. 6.5.1 Result on Controlled Environment Table 6.5 shows the performance of models on two auto-generated benchmarks, SpaRTUN and SpartQA-Auto. We can observe that PistaQ outperforms all PLM baselines and SREQA. 8from https://beta.openai.com 74 Dataset mSpRL mSpRL* Coref - - SRole 88.59 88.03 Human 82.16 55.8 Human* 81.51 72.53 SpartQA 99.83 99.92 SpaRTUN 99.35 99.96 SRel 69.12 71.23 S: 57.43 Q: 52.55 S: 60.24 Q: 61.53 S: 99.72 Q: 98.36 S: 99.18 Q: 97.68 SType 19.79 23.65 43.79 39.34 48.74 48.07 99.05 98.62 98.57 98.11 Table 6.4 Performance of the extraction modules. Q: question. S: stories. HUMAN: SpartQA- Human. SpartQA: SpartQA-Auto. *Further pretraining modules on SpaRTUN. We report macro F1 for SpRL and the accuracy of the Coref modules. # Models 1 Majority baseline 2 GT-PistaQ 3 BERT 4 BERT-EQ SREQA 5 PistaQ 6 SpartQA-Auto SpaRTUN YN FR YN 51.82 14.23 53.62 99.51 99.43 99.07 91.80 84.88 91.80 90.71 N/A 85.60 85.11 83.31 88.21 97.56 94.52 96.37 FR 44.35 98.99 94.17 N/A 86.88 98.02 Table 6.5 Results on auto-generated datasets. We use the accuracy metric for both YN and FR questions. This outcome first highlights the effectiveness of the extraction and symbolic reasoning pipeline compared to PLMs in addressing deterministic reasoning within a controlled environment. Second, it shows that disentangling extraction and reasoning as a pipeline works better than explicit neural layers in SQA with a controlled environment. The complexity of these environments is more related to conducting several reasoning steps and demands accurate, logical computations where a rule-based reasoner excels. Thus, the result of PistaQ with a rule-based reasoner module is also higher than SREQA with a neural reasoner. The superior performance of PistaQ over BERT suggests that SpRL annotations are more effective in the PistaQ pipeline than when utilized in BERT-EQ in the form of QA supervision. Note that the extraction modules of PistaQ achieve perfect results on auto-generated benchmarks while trained only on a quarter of the SpRL annotations as shown in Table 6.5. However, BERT-EQ uses all the original dataset questions and extra questions created from the full SpRL annotations. 75 # Models YN Acc 52.44 1 Majority baseline 79.72 2 GT-PistaQ 51.74 BERT 3 48.95 4 BERT* 5 GPT-3.5𝑍𝑒𝑟 𝑜_𝑠ℎ𝑜𝑡 45.45 6 GPT-3.5𝐹𝑒𝑤_𝑠ℎ𝑜𝑡 60.13 7 GPT-3.5𝐹𝑒𝑤_𝑠ℎ𝑜𝑡 +CoT 62.93 50.34 8 45.45 9 53.23 10 46.96 11 75.52 12 BERT-EQ BERT-EQ* SREQA SREQA* PistaQ FR R 14.28 66.04 30.13 49.10 22.42 54.10 37.92 - - 13.85 25.79 35.93 P 29.87 96.38 30.74 60.96 40.13 45.20 57.18 - - 15.68 18.70 72.11 F1 6.57 75.16 28.17 50.56 16.51 44.28 38.47 - - 13.70 24.61 46.80 Table 6.6 Results on SpartQA-Human. We use accuracy on YN questions and average Precision (P), Recall (R), and Macro-F1 on FR question types. *Using SpaRTUN supervision for further training. Table 6.6 demonstrates the results of models on SpartQA-Human with a controlled environment setting. As can be seen, our proposed pipeline, PistaQ, outperforms the PLMs by a margin of 15% on YN questions, even though the extraction modules, shown in Table 6.4, perform low. This low performance is due to the ambiguity of human language and smaller training data. We also evaluate PistaQ on SpartQA-Human FR questions using Macro_f1 score on all relation types. PistaQ outperforms all other baselines on FR questions, except for BERT*. There are two main reasons behind the inconsistency in performance between YN and FR question types. The first reason is the complexity of the YN questions, which goes beyond the basics of spatial reasoning and is due to using quantifiers (e.g., all circles, any object). While previous studies have demonstrated that PLMs struggle with quantifiers [91], the reasoning module in PistaQ can adeptly handle them without any performance loss. Second, further analysis indicates that PistaQ predicts ‘No’ when a relationship is not extracted, which can be correct when the answer is ‘No’. However, in FR, a missed extraction causes a false negative, which decreases the F1 score. 6.5.2 Results on Real-world Setting We select ReSQ as an SQA dataset with realistic settings and present the result of models on this dataset in Table 6.7. To evaluate PistaQ on ReSQ, we begin by adapting its extraction modules through training on 76 # Models 1 Majority baseline 2 BERT 3 BERT*𝑍𝑒𝑟𝑜_𝑠ℎ𝑜𝑡 4 BERT* 5 GPT-3.5𝑍𝑒𝑟𝑜_𝑠ℎ𝑜𝑡 6 GPT-3.5𝐹𝑒𝑤_𝑠ℎ𝑜𝑡 7 GPT-3.5𝐹𝑒𝑤_𝑠ℎ𝑜𝑡+CoT 8 BERT-EQ 9 BERT-EQ*𝑍𝑒𝑟𝑜_𝑠ℎ𝑜𝑡 10 BERT-EQ* 11 SREQA 12 SREQA*𝑍𝑒𝑟𝑜_𝑠ℎ𝑜𝑡 13 SREQA* 14 PistaQmSpRL 15 PistaQ SpaRTUN+mSpRL 16 Human Accuracy 50.21 57.37 49.18 63.60 60.32 65.90 67.05 56.55 51.96 61.47 53.15 53.32 69.50 41.96 47.21 90.38 Table 6.7 Result on ReSQ. *Further training on SpaRTUN. The 𝑍𝑒𝑟𝑜_𝑠ℎ𝑜𝑡 refers to evaluation without further training on ReSQ or mSpRL training data. the corresponding dataset. We train the SpRL modules on both mSpRL and SpaRTUN, and the performance of these models is presented in Table 6.4. As the mSpRL dataset lacks coreference annotations, we employ the model trained on SpaRTUN for this purpose. Rows 14 and 15 in Table 6.7 show the performance of the PistaQ on ReSQ is inferior compared to other baselines. We analyze the first 25 questions from the ReSQ test set to find the reason. We find that 18 out of 25 (72%) questions required spatial commonsense information and cannot be answered solely based on the given relations in the stories. From the remaining seven questions, only two can be answered using the SpRL annotations provided in the mSpRL dataset (8%). Our analysis shows that the SpRL annotations of mSpRL are not fully practical in PistaQ due to three main reasons: 1. No Commonsense: Even though the commonsense knowledge is required to answer PistaQ examples, this information is out of the scope of annotations in mSpRL dataset. 2. Missed annotations: As shown in Figure 6.5, some of the prepositions were not considered as spatial concepts by the SpRL annotators but were needed by our reasoning procedure for finding the answers, e.g., NTPP(bar, with, chair). 3. No coreference : The coreference annotation is not supported in this dataset, e.g., “L2: it” and “T2: a bench” are the same entity with different mentions, but they are mentioned with 77 different ids. These missed coreferences result in fewer connections between entities and fewer inferred relations. Figure 6.5 An example of the limitation of mSpRL and coreference annotation to answer ReSQ question. The answer to the questions was predicted wrongly due to two main reasons. First, the missed commonsense knowledge in question 2, and second, the limited coverage of ground truth annotation in mSpRL in question 2. Hence, the low performance of PistaQ is attributed to first the absence of integrating com- monsense information in this model and, second, the errors in the extraction modules, which are propagated to the reasoning modules. As shown in Table 6.7, the best result on ReSQ is achieved by SREQA* model. Compared to SREQA, SREQA* is trained on SpaRTUN instead of mSpRL9 in the first step of the training. mSpRL lacks some SpRL and coreference annotations to answer ReSQ questions. In the absence of this information, collecting the supervision for the first phase of training results in a significant number of missed relations. Therefore, as shown in row 11 of Table 6.7, employing mSpRL in the first training phase decreases the performance while replacing it with SpaRTUN in SREQA* significantly enhances the results. SREQA* surpasses the PLMs trained on QA and QA+SpRL annotation, showcasing the advantage of the design of this model in utilizing QA and SpRL data within explicit extraction layers and the data preprocessing. Also, the better performance of this model compared to PistaQ 9As mentioned, we use the mSpRL annotation for ReSQ dataset. 78 Story: behind it a bar with chairs and two people , and a bench with one person lying on it . Upper level with doors and a blue rail.Question 1: Are the people behind the bar? Answer: YesPredicted answer based on mSpRL annotations: NoQuestion 2: Is the door above the bar? Answer: YesPredicted answer based on relation in text: NoPredicted answer based on the commonsense (upper level is above the main level): YesmSpRL annotation: Triplet: Behind (a bar (id: t1), behind, it (id: l1))Triplet: Behind (a bench (id: t2), behind, it (id: l1))Triplet: EC (one person (id: t3), on, it (id: l2)) demonstrates how the end-to-end structure of SREQA can handle the errors from the extraction part and also can capture some rules and commonsense knowledge from ReSQ training data that are not explicitly supported in the symbolic reasoner. In conclusion, compared to PLMs, disentangling extraction and reasoning as a pipeline indicates superior performance in deterministic spatial reasoning within controlled settings. Moreover, explicitly training the extraction module proves advantageous in leveraging SpRL annotation more effectively compared to using this annotation in QA format in the end-to-end training. Comparison between disentangling extraction and reasoning as a pipeline and incorporating them within an end-to-end model demonstrates that the end-to-end model performs better in realistic domains even better than PLMs. The end-to-end architecture of this model effectively enhances the generalization in the real-world setting and addresses some of the limitations of rule coverage and commonsense knowledge. 6.5.3 Evaluating Extraction and Reasoning Modules Here, we discuss each module used in PistaQ and their performance, including the Spatial Role Labeling (SpRL), Coreference Resolution, and Spatial reasoner. 6.5.3.1 Spatial Role Labeling (SpRL) The SpRL module is divided into three sub-modules, namely, spatial role extraction (SRole), spatial relation extraction (SRel)10, and spatial type classification (SType). We only use these modules on sentences that convey spatial information in each benchmark. To measure the performance of SpRL modules, we use the macro average of F1 measure for each label. These modules are evaluated on three datasets that provide SpRL annotations, mSpRL, SpartQA, and SpaRTUN. When training the SpRL module on auto-generated benchmarks, we achieved a performance of 100% using only a quarter of the training data, therefore, we stopped further training. As shown in Table 6.4, all SpRL sub-modules achieve high performance on synthetic datasets, SpartQA and SpaRTUN. This good performance is because these datasets may contain less 10Since the questions(Q) and stories(S) have different annotations (questions have missing roles), we separately train and test the SRel and SType modules 79 ambiguity in the natural language expressions. Therefore, the BERT-base models can easily capture the syntactic patterns needed for extracting the roles and direct relations from the large training set. 6.5.3.2 Coreference Resolution (Coref) in Spatial Reasoning To evaluate the coreference resolution module (Coref in Table 6.4), we compute the accuracy of the pairs predicted as Corefs. The Coref model achieves a high performance on all datasets. The performance is slightly lower on the SpartQA-Human dataset when SpaRTUN is employed for additional pre-training. However, we observe many errors in the annotations in SpartQA-Human, and the pre-trained model is, in fact, making more accurate predictions than what is reflected in the evaluation. 6.5.3.3 Logic-based Spatial Reasoner Datasets SpartQA-Auto YN FR YN FR YN Table 6.8 Analyzing wrong predictions in GT-PistaQ. A: Missing/errors in Annotation, C: rule-based Coreference issues in connecting extracted information, R: Shortcomings of the Reasoner. Q-Type Total A C R 1 13 0 0 3 SpartQA-Human 7 5 4 0 20 10 20 9 35 6 18 38 13 35 29 SpaRTUN To solely evaluate the performance of the logic-based reasoner, we use the “GT-PistaQ”. We look into the errors of this model and categorize them based on the source of errors. The categories are missing/wrong ground truth direct annotations (A), rule-based Coreference Error (C) in connecting the extracted information before passing to the reasoner, and the low coverage of spatial concepts in the reasoner (R). As is shown in Table 6.8, spatial Reasoner causes no errors for SpaRTUN since the same reasoner has been used to generate it. However, the reasoner does not cover spatial properties of entities (e.g., right edge in “touching right edge”) in SpartQA and causes wrong predictions in those cases. 6.5.4 SREQA on All Story Relations Table 6.9 displays the results of the SREQA model trained and tested solely on all the story’s relation extraction parts (step 1). During the evaluation, we also possess the same data preprocessing 80 Datasets SpaRTUN SpartQA-Auto SpartQA-Human mSpRL (Used in ReSQ) F1 on SREQA 96.37 97.78 23.79 16.59 Table 6.9 The result of SREQA model only trained and tested on all story relations of the SQA datasets. and gather annotations of all relations between stories’ entities and select the best model based on performance on the development set. Notably, the performance on the human-generated datasets, SpartQA-Human and ReSQ, is significantly lower compared to the auto-generated datasets. As discussed in , the mSpRL datasets contain missed annotations, resulting in the omission of several relations from the stories’ entities and incomplete training data for this phase. Similarly, the SpartQA-Human SpRL annotation also exhibits some noise, particularly in coreference annotation, leading to similar issues as observed in mSpRL regarding annotation of all story relations. Consequently, this reduced performance in all story relation extraction impacts the overall performance of the main SREQA model trained using two steps; however, as illustrated in the results of SREQA* in Table6.7, which utilizes SpaRTUN instead of mSpRL for training on all story’s relations, the performance substantially improves on the ReSQ dataset. 6.6 Summary We investigate the benefits of disentangling the processes of extracting spatial information and reasoning over them [112]. To this end, we devised a series of experiments utilizing PLMs for spatial information extraction coupled with a symbolic reasoner for inferring indirect relations. The outcomes of our experiments provide noteworthy insights: (1) Our observations in controlled experimental conditions demonstrate that disentangling extraction and symbolic reasoning, compared to PLMs, enhances the models’ reasoning capabilities, even with comparable or reduced supervision. (2) Despite the acknowledged fragility of symbolic reasoning in real-world domains, our experiments highlight that employing explicit extraction layers and utilizing the same symbolic reasoner in data preprocessing enhances the reasoning capabilities of models. These results emphasize the advantage 81 of disentangling the extraction and reasoning in spatial language understanding. 82 CHAPTER 7 EXPLORING SPATIAL REASONING WITH LARGE LANGUAGE MODELS 7.1 Introduction Recent research highlights the high performance of Large Language Models (LLMs) on numerous tasks using Zero_shot [5] or Few_shot prompting, a.k.a in-context learning, methods [142]. Zero_shot prompting refers to the ability of a model to perform a task without any task-specific training examples, relying solely on its pretrained knowledge. On the other hand, Few_shot prompting involves providing the model with a small number of task-specific examples to help it adapt to the task and improve its performance. These prompting methods demonstrate the LLMs’ ability to generalize and adapt to various problems with minimal task-specific information [10, 8] and perform well even without extensive training data for specific tasks. However, despite their success in many domains, LLMs have limitations on multi-hop reasoning tasks [41, 131]. For instance, [5] indicates that ChatGPT (GPT-3.5-Turbo) struggles to perform well on the SQA (multi-hop Spatial Question Answering) task when evaluated using Zero_shot settings on human-generated test cases. This highlights the challenges LLMs face in reasoning tasks that require more than just pattern recognition and necessitate a deeper understanding of the context and relationships within the data. To address these challenges, new prompting techniques such as Chain-of-Thought (CoT) [142] or Least-to-Most prompting [159] have been introduced. These promptings involve instructing the LLM to generate a step-by-step solution or explanation for a given problem before arriving at the final answer, encouraging the model to think more systematically. By leveraging this method, LLMs can overcome some of the limitations observed in Few_shot evaluations and achieve better performance on complex multi-hop reasoning tasks. In this chapter, we first evaluate LLMs on SQA tasks using Few_shot and Chain-of-Thought (CoT) prompting strategies and compare them with Zero_shot evaluation. Afterward, we introduce an approach in multi-hop reasoning by leveraging the superior performance of LLMs in information extraction tasks [137, 135, 146]. Specifically, we employ an LLM as the extraction module within the proposed pipeline model of extraction and symbolic reasoning, detailed in Chapter 6. As 83 discussed in Section 6.5.1, the efficacy of this pipeline is prominent when the extraction module functions optimally. Hence, our objective in this experiment is to evaluate the impact of LLMs as a potent extraction module within this framework. Figure 7.1 An example of the GPT-3.5 output for a pair of entities. GPT-3.5 output the first one with the highest probability as the final output. When incorporating language models for information extraction within the extraction and reasoning pipeline, we disregard the inherent uncertainty in their predictions and instead select the predictions based on the highest probability (uncertainty score). However, incorporating this uncertainty can enable a more informed and context-aware utilization of language model outputs. Hence, in another line of experiments, we aim to evaluate LLMs in the same extraction and symbolic reasoning pipeline while employing the uncertainty of their prediction in probabilistic logical reasoning rather than a deterministic method. In probabilistic reasoning, first, we select the top five spatial relation predictions of LLM, along with corresponding probability values, for each pair of entities (e.g., Figure 7.1). Next, all this pertinent information is conveyed to a probabilistic reasoner to address the SQA task. As discussed in Section 6.5.2, another challenge associated with using symbolic reasoning in the pipeline model is its weakness in covering all existing spatial rules, including real-world commonsense knowledge, e.g., the concept that a roof is always above a house. Thus, in our last experiment, we capitalize on the comprehensive commonsense knowledge of LLMs about general space. In this experiment, we utilize LLMs to extract spatial commonsense knowledge that may not be explicitly present in the text. This information is passed to a probabilistic spatial reasoner 84 alongside the direct information to solve the SQA task. This experiment highlights the potential of LLMs to provide a more holistic understanding of the spatial world used in the pipeline of extraction and reasoning. The outcomes of our experimental evaluation demonstrate the following insights: • LLMs exhibit better spatial reasoning performance when utilizing Few_shot or CoT prompting compared to Zero_shot evaluation. However, this performance still is less than human performance in this task. • Our case studies and experimental results from concurrent research in controlled SQA environments reveal the impact of LLMs as extraction modules in providing precise information within the pipeline of extraction and reasoning. This pipeline model outperforms SOTA baselines, particularly LLMs, with in-context learning. However, our exploration of real-world SQA tasks unveils a different outcome. The complexity inherent in spatial language hinders the proficiency of LLMs in extracting spatial information from real-world text. Hence, on real-world SQA task, using LLMs in the pipeline model demonstrates inferior performance compared to end-to-end LLMs. Despite this observation, our experiments provide valuable insights: 1. Employing LLM as an extraction module in this pipeline provides more precise informa- tion for symbolic spatial reasoning compared to BERT-based modules. 2. Utilizing LLM’s probabilistic outputs in probabilistic logical reasoning leads to better results than deterministic reasoning. 3. Leveraging LLMs’ commonsense knowledge can enhance the pipeline model’s per- formance by enclosing some rules and relations that the designed symbolic reasoning module may not cover. 7.2 Large Languague Models (LLM) In this study, we select GPT-3.5 as a candidate of LLMs and use the framework developed by OpenAI as the foundation for our experiments. GPT-3.5, with 175 billion parameters, is a state-of-the-art LLM that boasts several advantages over other available open-source LLMs. GPT-3.5 outperforms LLaMA and PaLM on MMLU and GSM8K benchmarks as shown in Figure 7.2 [129]. 85 This suggests that GPT-3.5 has a slight edge over LLaMA and PaLM performance on question answering task. GPT4 already surpasses all results, but we limited our usage of GPT4 to a small set of test examples due to budgetary constraints. One of the primary benefits of GPT-3.5 is its Figure 7.2 Comparison of different LLMs on Natural Language processing benchmarks. extensive knowledge base, which encompasses a wide range of topics and domains. This extensive knowledge allows the model to provide more accurate and contextually relevant responses even with Few_shot prompting, making it particularly effective for our tasks. 7.2.1 OpenAI API The OpenAI API provides an interface for interacting with OpenAI’s Large Language Models, including GPT-3.5 and GPT-4. To use this API, we send a request to the API endpoint, passing the prompt and any desired parameters as part of the request body. The API will return a JSON response containing the model’s output. The following provides a list of parameters we employed to customize the OpenAI framework in our experiments. 1. Prompt: This parameter specifies the input prompt for the model. It can be a string or a list of strings, and it sets the context for the model’s response. 2. Temperature: This parameter controls the randomness of the model’s output. A higher value (e.g., 0.8) makes the output more random and creative, while a lower value (e.g., 0.2) makes it more deterministic and focused. 3. Max_tokens: This parameter sets the maximum length of the generated text to prevent overly long outputs. 4. Log_probs: This parameter specifies the number of most likely tokens for which the model should return the log probabilities. 86 5. Top_k: This parameter selects the top-k most probable tokens. It controls the diversity of the output by restricting the set of tokens considered during generation. We use GPT-3.5 davinci-text-0031 in our experiments of evaluating LLMs on spatial reasoning task. The cost of running this model on the human-generated benchmarks is 0.002$ per 1k tokens. To use GPT-3.5 as information extraction, we use turbo-text-instruct (a.k.a ChatGPT) with a cost of 0.0001$ per 1k tokens. We also use the GPT4 playground in our experiments. 7.2.2 Prompt Engineering Selecting and providing pertinent prompts is crucial, as the prompt can significantly influence the model’s performance. We employ various prompting techniques in our experiments, including Zero_shot, Few_shot, and Chain-of-Thought (CoT) prompting, to evaluate the model’s performance under different conditions. In the Zero_shot setting, the prompt given as input to the model is formatted as “Context: story. Question: question?” and the model returns the answer to the question. In the Few_shot setting, we add two random examples from the training data containing the story, all its questions, and their answers. Figure 7.3 depicts a prompt example for SpartQA-Human YN questions passed to GPT-3.5. In Few_shot+CoT, we use the same idea as [142] and manually write the reasoning steps for eight questions (from two random stories). The input then is formatted as “Context: story. Question: CoT. Answer. Asked Context: story. Question: question?”. Figure 7.4 shows an example of these reasoning steps for ReSQ dataset. 7.2.3 Engineering Over LLM Output Output engineering is essential in extracting meaningful information from the LLM’s output. In our experiments, we implement various output processing techniques, such as filtering and ranking over the model’s predictions. This step transfers the raw output of the LLM into structured data that can be used for final evaluation or as input to other modules. Given that LLMs excel at imitating observed formats, we instructed them to generate predictions in a structured format, as specified in 1from https://beta.openai.com 87 Figure 7.3 Example of the input for GPT-3.5 (Few_shot). The Zero_shot setting is the same just do not have the first training example. the prompt (e.g., illustrated in Figure 7.5). This approach significantly simplified the engineering process over the string output of LLM. 7.3 Probabilitic Spatial Reasoning As discussed in Chapter 6, one of the primary reasons for the lower performance of the PistaQ model on real-world data is the error propagation from the information extraction modules to the reasoning module. One potential solution to address this issue is to pass the uncertain information by including the probabilities of the extracted information from LLM to a probabilistic reasoner instead of a deterministic logical reasoner. Employing deterministic reasoning has two primary drawbacks. First, the probability of the predicted token with the highest probability may not be significantly high; for example, in Figure 7.1, the probability of “covers” is 0.219, while the probability of the second-best token, “near” is 0.17. Ultimately, the model chooses “covers” as the final output while “near” has a close probability. Second, the next best predictions might also offer valuable insights. 88 Figure 7.4 Example of the input for GPT-3.5(Few_shot+Cot) with human-written Chain-of-Thoughts. For example, in Figure 7.1, if we inquire whether “pillow” is near the “beds”, considering only the top-1 token prediction, the probability of the “near” is 0, and the final answer is “No”. However, if we select the top-2 prediction, the probability of “near” becomes 0.17, resulting in the “Yes” answer. Hence, employing the top-k (here we select k= 5) predictions of the LLMs with their probability value helps to solve the SQA task efficiently. To implement the probabilistic logical reasoner, we leverage probabilistic Prolog, known as Problog[17]. Problog is a probabilistic logic programming language that extends traditional logic programming with the ability to reason about uncertainty and probability. It allows users to write programs that define relationships between events or conditions and assign probabilities to those relationships. In Problog, probablities can be assigned to logical rules, and queries can be made about the probability of different events or conditions given a set of evidence or assumptions. In Figures 7.6 and 7.7, you can see examples of predictions with and without probability computation. The 0.8 and 0.6 are the predicted probabilities of spatial relation extraction, and the rules are directly from the proposed spatial reasoner in Chapter 4. Problog computes the probabilities of the query based on all possible ways the query can be true. As a result, 𝑎𝑏𝑜𝑣𝑒(𝑎, 𝑐) can be 𝑇𝑟𝑢𝑒 with 0.48 89 Chain-of-Thoughts Reasoning Steps.Consider Below examples: There are three blocks called A, B, and C. A is to the right of B and B is to the right of C. In A, there is a small blue square far to the right of a medium black circle. In B, there is a large black circle far below a medium yellow circle. The medium yellow circle is far below a medium black triangle. The medium black triangle is near and below a small blue square. In C, there is a large yellow square near and above a medium blue square. There is also a medium yellow square far to the left of the large yellow square.Is the medium blue thing near and below a large yellow thing?The chain of reasoning to answer this question is: The large yellow square is near and above a medium blue square. Using the inverse rule, the medium blue square is near and below the large yellow square. The answer is Yes.Is the small blue thing in B below a medium yellow thing?The chain of reasoning to answer this question is: The small blue circle in B is above the black triangle which is above the medium yellow circle. So the small blue thing is above a medium yellow thing. The answer is No.Is the small blue thing in A to the right of a medium blue thing? The chain of reasoning to answer this question is: block A is right of block B and block B is right of block C, so block A is right of block C. The medium blue thing is in block C. All objects in block A are right of objects in block C, so the small blue thing in A is right of the medium blue thing in C. The answer is Yes.Is the medium black thing in A above a small blue thing? The chain of reasoning to answer this question is: There is a small blue thing in block A and block B. It is not mentioned if block B is below or above block A or not. Also it is not mentioned if medium black thing in A is above the small blue thing in A. So the answer is Don’t Know(No).Answer below questions: …. Figure 7.5 The example of prompts used for LLMs (GPT-3.5-Turbo) in information extraction. probability. 7.3.1 Extract probabilistic Facts From LLM To extract the probabilities of tokens from the GPT-3.5, we use the “log_probs” parameter in the API, which specifies the number of most likely tokens for which the model returns the 90 ExtractionPrompt ExampleStoryEntityContext 1: there are two social workers in the foreground . one wearing a red pullover and brown pants is bending over to access the blue paint . the other one in a red tee-shirt and black jeans is watching her . at the back of the room there is another worker wearing a white tee-shirt and blue jeans , acutally painting . there are many newspapers on the ground to protect the kindergarten floor . all three workers are wearing a mask .entities = ['two social workers', 'the foreground', 'one', 'a red pullover', 'brown pants', 'the blue paint', 'the other one', 'a red tee-shirt', 'black jeans', 'her', 'the back', 'the room', 'another worker', 'a white tee-shirt', 'blue jeans', 'many newspapers', 'the ground', 'the kindergarten floor', 'three workers', 'a mask']Relation- a man in white shirt , black jacket , grey pants and black shoes is sitting on a wooden chair and talking on the phone. relation triplets: [("a man", "in", "white shirt"), ("a man", "in", "black jacket"), ("a man", "in", "grey pants"), ("a man", "in", "black shoes"), ("A man", "sitting on", "a wooden chair"), ("a man", "talking on", "the phone")]- on the right a wooden bed with white bedcovers. relation triplets: [("a wooden bed", "on", "the right"), ("a wooden bed", "with", "white bedcovers")]Relation TypeIf the relation set is:Relation type set:LEFT = to the left of another object,DC= Disconnected, disconnected from other object, (the wall, behind, the tourists), relation type is: ['BEHIND'](Lots of locals, in front of, a blue building) , relation type is: ['FRONT'](pictures, on, the wall), relation type is: ['FRONT', 'EC'](a clock, above, the blackboard), relation type is:['ABOVE']CoreferenceContext 1: Three women are sitting on a wooden bench in front of an about one metre high , red brick wall . they are all wearing skirts and jumpers ...If the list of all entities is:list_of_noun_phrases = ["three women", "they", "two of them", "a wooden bench", "an about one metre high red brick wall", "the wall", "skirts", ...]The below list shows which noun phrases in the "list_of_noun_phrases" refers to which same phrase:{"Three women": ["Three women", "they", "two of them"],they: ["Three women", "they", "two of them"],two of them: ["two of them"],a wooden bench: ["a wooden bench",], .....}QuestionRelation and TypeIf the relation set is: ....Are the lamps behind the building?: list_of_dictionary = [{"triplet": ("the lamps", "behind", "the building"), "relation type": ['BEHIND']}]Is the camera in front of the all kids?: list_of_dictionary = [{"triplet": ("the camera", "in front of", "the all kids"), "relation type": ['FRONT']}]Is a flag to the left of the stairs?: list_of_dictionary = [{"triplet": ("a flag", "to the left of", "the stairs"), "relation type": ['LEFT']}]Question entity to Story MentionsThis should consider the exact or partially matching based on the phrase root. For examples,''{0: "small window", 2: "large window", 5: "three windows"}'' all can be matched with "the window" since the root is window here.Also the output should be in the form of only a python dictionary like {"the window": [0,2,5]}. Figure 7.6 The result of Problog without any probabilistic prediction. Figure 7.7 The result of Problog with probabilistic facts. log probabilities. Following this, from the “log_probability” output of the LLM, we identify additional prediction alternatives and their associated probabilities for the relation’s token as shown in Figure 7.8. Besides, to identify the relation token from the LLM output, a specific pattern was defined within the prompt of the GPT-3.5. For instance, as shown in Figure 7.9, the relation tokens are always inside brackets. 7.4 Experiments and Results In this section, we evaluate GPT-3.5 on two human-generated SQA benchmarks, SpartQA- Human and ReSQ. We compare the performance of GPT-3.5 against a suite of models including BERT, BERT*2, PistaQ, and SREQA described in Chapter 6. Both BERT and BERT* serve as candidate PLMs in which the input of “question+story” is passed to a BERT with answer classification layers. The PistaQ model is a pipeline that disentangles extraction and symbolic reasoning, while SREQA is an end-to-end neural model featuring explicit layers for extraction 2BERT further fine-tuned on SpaRTUN supervisions. 91 Figure 7.8 Example of probability of top-5 tokens extracted from the GPT-3.5. and reasoning. Additionally, we assess the performance of using LLM as the extraction module within PistaQ, referred to as LLM_PistaQ. In cases where the probabilistic output of the LLM is integrated into this model, we call the model LLM_PistaQ𝑝𝑟𝑜𝑏. 7.4.1 LLM with Few_shot Prompting Recent research shows the high performance of LLMs with Zero/Few_shot setting on many tasks [10, 8]. However, [5] shows that ChatGPT (GPT-3.5-Turbo) with Zero_shot evaluation cannot perform well on SpartQA-Human test cases. Similarly, our experiments, as shown in Tables 6.6 and 6.7, show the lower performance of GPT-3.5 (davinci) with Zero/Few_shot settings compared to human and our models PistaQ and SREQA. Figure 7.10, shows an example of three LLMs, GPT-3.5, GPT4 and PaLM2 on SpartQA-Human example. Although [142] shows that using CoT prompting improves the performance of PaLM on multi-step reasoning task, its spatial reasoning capabilities still does not meet the expectation. 7.4.1.1 LLMs as Extraction Module in PistaQ Recent studies [117, 137] show that LLMs have a promising performance in information retrieval. Building upon this, first, we employ GPT-3.5 with Few_shot prompting to extract information from 92 Story: behind it a bar with chairs and two people , and a bench withone person lying on it . Upper level with doors and a blue rail .Extracted top-5 relations with probability. Figure 7.9 Example of the pattern repeated in the prompt of the LLM model. Hence, the relation tokens are always inside brackets (‘[],’), which helps find the relation tokens. a set of SpartQA-Human and ReSQ examples. The extraction process contains Entity, Relation, Relation Type, coreference extractions from the story, and entity and relation extraction from the question. Additionally, LLM is employed to identify mentions of question entities within the text. We construct multiple manually crafted prompt examples for each extraction task, as depicted in Figure 7.5. The extracted information is subsequently utilized within the PistaQ framework called LLM_PistaQ. We conduct a case study, illustrated in the last row of Figure 7.11, which highlights how combining LLM extraction and symbolic reasoning enables answering questions that LLMs struggle to address. Furthermore, Figure 7.12 provides a comparison between the trained BERT-based spatial information extraction modules and GPT-3.5 with Few_shot prompting in PistaQ. GPT-3.5 extracts more accurate information, leading to a correct answer that end-to-end LLMs cannot solve. Furthermore, a concurrent study by Yang et al. (2023) [149] showcases the efficacy of LLMs as extraction modules coupled with symbolic reasoning within a controlled environment of StepGame dataset. As depicted in Figure 7.13, LLM excels in the extraction process when the synthetic 93 Prompts For Finding The ProbabilityLet's find out the "direct" spatial relations of trajectory toward the landmark from the given text from below list and return a dictionary: list_of_spatial_relations = [null, above, below, front, behind, left, right, far, near, disconnected, Externally Connected, overlapped, equal, inside, contains, covered by, covers]The relation should be directly mentioned in the text and you should find all of them. The entities should be selected from the list of the entities which also are given based on the context.Examples:Context1: "tourists are standing in the classroom of a school in front of the blackboard. there is a clock and the writing Dios es Amor above the blackboard ."all_entity_on_the_context = ["tourists", "the classroom of a school", "the blackboard", "a clock", "the writing Dios es Amor"]Dictionary of only all direct Spatial Relations is: {("tourists", "the classroom of a school"): ["in", "inside"],("tourists", "the blackboard"): ["front"],("a clock", "the blackboard"): ["above"],("the writing Dios es Amor above", "the blackboard"): ["above"]}Be concise and based on the given context and entities list find the direct spatial relation and return a dictionary. list_of_spatial_relations = [null, above, below, front, behind, left, right, far, near, disconnected, Externally Connected, overlapped, equal, inside, contains, covered by, covers] Figure 7.10 An example of three LLMs, GPT-3.5 (davinci-text-003), GPT4, and PaLM2 prediction on SQA task evaluated with Zero_shot and CoT prompting. context follows a fixed pattern3, and consequently, the disentangled model, incorporating GPT-3 for extraction and symbolic reasoning (using Answer Set Programming (ASP)), demonstrates superior performance compared to end-to-end models like BERT* (SynSup) or GPT-3.5+CoT. In this experiment, we aim to evaluate the efficacy of LLM_PistaQ on the real-world SQA benchmark, ReSQ. Spatial information (SpRL) from real-world data is significantly more complicated than fixed patterns of synthetic data, as it involves natural language’s ambiguity and intricacy. For 3As the StepGame dataset lacks explicit SpRL supervision, the evaluation of this model’s performance on the extraction task remains unassessed. 94 SpaRTQA-HumanStoryThere is a block called A. Inside there's a red oval that is above a blue triangle. The blue triangle is touching the right edge of A and the top of a large black square. Touching the bottom edge of A is a small orange square. The is a block called B Which is to the right and touching A. Inside B there are 5 triangles, a red one, a green one, a large yellow one, a brown one, and an orange one. The red and orange triangles are touching the left edge of B and the orange triangle is above the red triangle. The large yellow triangle is touching the top edge of B. The green triangle is touching the right edge of B. The brown triangle is below all of the other triangles and is touching the bottom of B.Question 1:What is the relation between blue triangles and orange triangles? Answer: left, nearQuestion 2:Is the green thing above the yellow thing? Answer: NoQuestion 3:Does B have all of the triangles? Answer: NoGPT3.5(Prediction)Zero-shotQ1: Don't KnowQ2: No, there is no green thing mentioned in the description.Q3: NoCoTQ1:The blue triangle is touching the right edge of block A and the orange triangle is touching the left edge of block B. Block B is to the right and touching block A so the orange triangle is right of the blue triangle. The answer is [Right].Q2:The green triangle is to the right of the large yellow triangle and the large yellow triangle is touching the top edge of B. Using inverse rule, the green triangle is below the yellow triangle. The answer is No.Q3: There are five triangles mentioned in block B, a red one, a green one, a large yellow one, a brown one, and an orange one. All of these triangles are in block B, so the answer is Yes.GPT4(Prediction)Zero-shotQ1: Don't KnowQ2: The description does not provide enough information to determine if the green triangle is above the yellow triangle. It only states that the green triangle is touching the right edge of block B and the large yellow triangle is touching the top edge of B. The relative vertical positions of the green and yellow triangles are not specified.Q3: Yes, B has all of the triangles mentioned in the description. Inside block B, there are 5 triangles: a red one, a green one, a large yellow one, a brown one, and an orange one.CoTQ1:The blue triangle is in block A, and the orange triangle is in block B. Block B is to the right and touching block A. So, the blue triangle is left of the orange triangle. The answer is [Left].Q2: The green triangle is touching the right edge of B, and the large yellow triangle is touching the top edge of B. It is not mentioned if the green triangle is above or below the yellow triangle. The answer is Don't Know.Q3: There is a blue triangle in block A, and there are 5 triangles in block B. Since there is a triangle in block A, not all triangles are in block B. The answer is No.PaLM2(Prediction)Zero-shotQ1: The answer is near. The blue triangle and the orange triangle are both in the block B. The blue triangle is above the orange triangle.Q2: Yes, the green thing is above the yellow thing. The green thing is a triangle that is touching the right edge of block B, and the yellow thing is a triangle that is touching the top edge of block B.Q3: Yes, B has all of the triangles. The question asks if B has all of the triangles in the block. The answer is yes because the passage says that "Inside B there are 5 triangles, a red one, a green one, a large yellow one, a brown one, and an orange one." This means that B has all of the triangles in the block.CoTQ1: The blue triangle is in block A and the orange triangle is in block B. Block A is left of block B so the blue triangle is left of the orange triangle. The answer is [Left].Q2: The green triangle is touching the right edge of B and the large yellow triangle is touching the top edge of B. The right edge of B is below the top edge of B. So the green triangle is below the yellow triangle. The answer is No.Q3: There are 5 triangles in B. There is a red triangle in B. There is a green triangle in B. There is a large yellow triangle in B. There is a brown triangle in B. There is an orange triangle in B. The answer is Yes. Figure 7.11 An example of Large Language Models (LLMs) prediction on SQA task evaluated with CoT prompting. The last row shows an example of using GPT3.5-Turbo for information extraction in LLM_PistaQ. Figure 7.12 An example of using BERT-based SpRL and GPT-3.5 as information extraction in PistaQ on a ReSQ example. example, correctly extracting the relation (a large window, on the left, a room) from the sentence “a room with walls made of wood and a large window with white curtains on the left.” is challenging for even potent LLMs. We first evaluate the performance of GPT-3.5 in information extraction within the context of ReSQ. Since ReSQ derives its textual context from the mSpRL dataset, which includes spatial information annotations, we evaluate GPT-3.5 on this dataset. Table 7.1, shows GPT-3.5 performance on entity extraction. The results reveal a higher Recall for GPT-3.5 compared to other models, while 95 SpaRTQA-HumanStory: There is a block called A. Inside there's a red oval that is above a blue triangle ... There is a block called B which is to the right and touching A. Inside B there are 5 triangles, a red one, a green one, a large yellow one, a brown one, and an orange one ... The large yellow triangle is touching the top edge of B. The green triangle is touching the right edge of B...Question 1: Is the green thing above the yellow thing? Answer: NoQuestion 2: Does B have all of the triangles? Answer: NoGPT3.5(Prediction)Q1:The green triangle is to the right of the large yellow triangle and the large yellow triangle is touching the top edge of B. Using inverse rule, the green triangle is below the yellow triangle. The answer is No.Q2: There are five triangles mentioned in block B, a red one, a green one, a large yellow one, a brown one, and an orange one. All of these triangles are in block B, so the answer is Yes.GPT4 (Prediction)Q1: The green triangle is touching the right edge of B, and the large yellow triangle is touching the top edge of B. It is not mentioned if the green triangle is above or below the yellow triangle. The answer is Don't Know.Q2: There is a blue triangle in block A, and there are 5 triangles in block B. Since there is a triangle in block A, not all triangles are in block B. The answer is No.PaLM2 (Prediction)Q1: The green triangle is touching the right edge of B and the large yellow triangle is touching the top edge of B. The right edge of B is below the top edge of B. So the green triangle is below the yellow triangle. The answer is No.Q2: There are 5 triangles in B. There is a red triangle in B. There is a green triangle in B. There is a large yellow triangle in B. There is a brown triangle in B. There is an orange triangle in B. The answer is Yes.PISTAQQ2: Extracted Facts (GPT3, Turbo): NTPP(blue triangle, A), TPP(red triangle, left edge of B), TPP(orange triangle, left edge of B), ... Extracted Queries (GPT3, Turbo): NTPPI/TPPI (B, blue triangle) , NTPPI/TPPI (B, red triangle) , NTPPI/TPPI (B, orange triangle) , ... Symbolic Reasoning (Rule-based Reasoner): False & True & True & ... → Answer = NoStory:a photo of a room with white walls , two single beds with a night table in between and a picture on the wall above the beds .Question:Are the beds below the picture? Answer: YesStory Facts:BERT0: ['a picture', 'the beds'], 2:['a'], 1: ['a picture', 'the wall']Facts: right(2, 1), below(2, 0), near(2, 0) GPT33: ['two single beds', 'the beds'], 5: ['a picture'], 6: ['the wall', 'the beds']Facts: above(5, 3), above(5, 6) ... Queries:BERTbelow(0 , 0)? or below(0 , 1)?GPT3below(3 , 5)? or below(3 , 6)?Reasoning:BERTbelow(0 , 0) = False, below(0 , 1) = False → Answer = NoGPT3below(3 , 5) = True, below(3 , 6) = False → Answer = Yes Figure 7.13 Results from [149] over StepGame with different number of steps of reasoning (k). GPT-3+ASP is a model that disentangles the extraction and reasoning. GPT-3 is used for extraction, and ASP is the symbolic reasoner. Synsup is BERT* here. . Entity Extraction (mSpRL) Recall Precision Macro-F1 R-Inf [78] BERT-Based BERT-Based (SpaRTUN) GPT-3.5-based 75.24 75.89 74.65 49.72 73.56 81.72 81.06 64.47 71.99 88.52 89.52 91.63 Table 7.1 The result of models on entity extraction task. Since the annotation in mSpRL does not contain all information, recall is considered the most crucial metric. Relation Extraction (mSpRL) Recall Precision Macro-F1 R-Inf [78] BERT-Based BERT-Based (SpaRTUN) GPT-3.5-based 68.25 81.31 89.13 34.09 70.23 60.11 59.32 16.48 69.23 69.12 71.23 22.22 Table 7.2 The result of models on relation extraction task. the Precision is considerably lower. This discrepancy is attributed to the numerous false-positive predictions that are accurately made by GPT-3.5 but are not covered by mSpRL annotations (Figure 6.5 shows some of these missed annotations in mSpRL.). Hence, GPT-3.5 can extract information that is not even annotated in mSpRL but is required to solve ReSQ questions. Table 7.2 compares GPT-3.5 and other benchmarks on spatial relation extraction task. In this table, GPT-3.5 exhibits inferior performance compared to other baselines; however, it is essential to acknowledge that this result may not precisely reflect the true capabilities of GPT-3.5. Figure 7.14 presents an example of the complexity of evaluating GPT-3.5 extraction performance based on the 96 Figure 7.14 An example of relation extraction using GPT-3.5. mSpRL annotations. First, GPT-3.5 predicts more correct spatial relations or relation types than those annotated in mSpRL. Second, the model correctly identifies the first two relations in mSpRL. However, due to errors in entity extraction, where "lamps and windows" are not separated into two distinct entities, this prediction is labeled as incorrect. Thus, evaluating GPT-3.5 relation extraction performance using mSpRL annotations may not be a fair comparison. We compare the performance of these models based on their performance within the pipeline framework. As depicted in Table 7.3, LLM_PistaQ surpasses PistaQ, which employs BERT-based extraction modules. This outcome highlights the high capability of LLMs in information extraction when compared to BERT-based modules. However, LLM_PistaQ still underperforms any end-to-end language models. This performance gap is primarily attributed to 1) the inherent complexities involved in the extraction task within real-world environments, 2) errors in output engineering to structure the string outputs generated by GPT-3.5, 3) the limitations of the designed symbolic reasoner to cover real-world commonsense rules. It is noteworthy to mention that, as discussed in Chapter 6, out of 25 sampled questions from ReSQ, only seven can be solved without relying on spatial commonsense information. Our analysis shows that LLM_PistaQ can solve these seven questions. In further exploration, we aim to integrate LLMs as neural spatial reasoners within a pipeline structure of extraction and reasoning. The process, illustrated in Figure 7.15, involves augmenting 97 Story: a brownish red house with a white decorated entrance . there are lamps and windows with white bars on the ground level ,with a man with a ladder , a car , pedestrians and a street in front of it .LLM_Extraction: mSpRL annotation: {'trajector': 'lamps', 'landmark': 'the ground level', 'spatial_indicator': 'on', 'rel_type': 'TPP'}, {'trajector': 'windows', 'landmark': 'the ground level', 'spatial_indicator': 'on','rel_type': 'TPP'},{'trajector': 'a man', 'landmark': 'it', 'spatial_indicator': 'in front of', 'rel_type': 'FRONT'} ,{'trajector': 'a car', 'landmark': 'it', 'spatial_indicator': 'in front of', 'rel_type': 'FRONT'} {'trajector': 'a brownish red house', 'landmark': 'a white decorated entrance', 'spatial_indicator': 'with', 'rel_type': ['EC', 'TPPI']},{'trajector': 'lamps and windows', 'landmark': 'the ground level', 'spatial_indicator': 'on', 'rel_type': ['FRONT', 'EC']},{'trajector': 'a man', 'landmark': 'a ladder', 'spatial_indicator': 'with', 'rel_type': ['EQ', 'NTPP']},{'trajector': 'a car', 'landmark': 'it', 'spatial_indicator': 'in front of', 'rel_type': 'FRONT'},{'trajector': 'pedestrians', 'landmark': 'it', 'spatial_indicator': 'in front of', 'rel_type': 'FRONT'},{'trajector': 'a street', 'landmark': 'it', 'spatial_indicator': 'in front of', 'rel_type': 'FRONT'} Figure 7.15 We employ LLMs in both extraction and reasoning tasks, but in a disentangled manner. Initially, we extract information using LLMs, and subsequently incorporate this extracted information into the prompt alongside written CoTs based on the extracted data. the prompt of a GPT-3.5 with both the extracted information from LLM and the composed Chain of Thoughts (CoTs) based on this information. Contrary to expectations, the obtained result is even lower (62.62%) when compared to GPT-3.5-CoT with the main text (67.05%) during evaluation on the ReSQ dataset. This outcome emphasizes the capability of LLMs to better comprehend information from natural language in contrast to structured data without further fine-tuning. 7.4.2 LLM with Probabilistic logical Reasoning As illustrated in Figures 7.1 and 7.8, considering the top-5 relations’ token from LLM prediction can yield more insights and provide additional information for each pair. In this experiment, we integrate the probability of extracted information from GPT-3.5 with the probabilistic symbolic reasoner in the PistaQ pipeline and call the model LLM_PistaQ𝑝𝑟𝑜𝑏. We select the top-5 relation predictions made by the GPT-3.5 along with their corresponding probability for each pair of entities with direct relations in the text. Subsequently, the extracted relations are fed into the probabilistic 98 Story: a man in white shirt , black jacket , grey pants and black shoes is sitting on a wooden chair and talking on the phone . on the right a wooden bed with white bedcovers . on the left ( before the man ) a wooden desk and a vase with flowers . there is a black brief-case in front of the chair , and there is also a picture hanging on the wall above the bed .Relation_in_story= [("a man", "in", "white shirt"), ("a man", "in", "black jacket"), ("a man", "in", "grey pants"), ("a man", "in", "black shoes"), ("A man", "sitting on", "a wooden chair"), ("a man", "talking on", "the phone"), ("a wooden bed", "on", "the right")("a wooden bed", "with", "white bedcovers"),("a wooden desk", "on", "the left"),("a wooden desk", "before", "the man"), ("a vase", "on", "the left"),("a vase", "before", "the man"),("a vase", "with", "flowers"),("a black brief-case", "in front of", "the chair"),("a picture", "hanging on", "the wall"),("a picture", "above", "the bed")]Question: Is the man touching the chair? (the man, touching, the chair), Chain of reasoning: ("A man", "sitting on", "a wooden chair") which is externally connected and above. The answer is YesQuestion: Is the bedcover below the picture?(the bedcover, below, the picture),Chain of reasoning: ("a wooden bed", "with", "white bedcovers") which means bedcover has the same position as bedcover, ("a picture", "above", "the bed"), so (the picture, above, the bedcover), so (the bedcover, below, the picture). The answer is Yes. # Models 1 Majority baseline 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Human BERT BERT*𝑍𝑒𝑟 𝑜_𝑠ℎ𝑜𝑡 BERT* GPT-3.5𝑍𝑒𝑟 𝑜_𝑠ℎ𝑜𝑡 GPT-3.5𝐹𝑒𝑤_𝑠ℎ𝑜𝑡 GPT-3.5𝐹𝑒𝑤_𝑠ℎ𝑜𝑡 +CoT SREQA SREQA*𝑍𝑒𝑟 𝑜_𝑠ℎ𝑜𝑡 SREQA* PistaQ(BERT) LLM_PistaQ LLM_PistaQ + CommonSense LLM_PistaQ 𝑝𝑟 𝑜𝑏 LLM_PistaQ 𝑝𝑟 𝑜𝑏 (Only Commonsense) LLM_PistaQ 𝑝𝑟 𝑜𝑏 + Commonsense Accuracy 50.21 57.37 49.18 63.60 60.32 65.90 67.05 53.15 53.32 69.50 47.21 50.01 51.48 52.12 49.50 54.76 90.38 Table 7.3 Result on ReSQ. *Further training on SpaRTUN. The Zero_shot refers to evaluation without further training on ReSQ or mSpRL training data. logical reasoner. The results of this evaluation are presented in Table 7.3. Figure 7.16 Comparing probabilistic and deterministic reasoning over an example of ReSQ. The threshold to generate the Yes/No answers is 0.1, obtained by try and trials. Comparing lines 12 and 14 in Table 7.3 shows that when we consider the uncertainty of the extracted relations (facts) in the probabilistic reasoning of LLM_PistaQ𝑝𝑟𝑜𝑏, the outcomes are superior compared to scenarios where certain facts are used in deterministic reasoning of 99 Story: a king size bed with a night table on each side. phone on theright table.Question: Is the table under the phone?Deterministic ReasoningProbabilistic ReasoningQuery : below(the table, the phone)?Facts : 1.0::right_('phone', 'the right table')Answer: No (prob: 0.0)Query : below(the table, the phone)?Facts : 0.293::right_('phone', 'the right table'). 0.208::above_('phone', 'the right table'). 0.195::front_('phone', 'the right table'). 0.015::below_('phone', 'the right table').Answer: Yes (prob: 0.208)Answer: Yes LLM_PistaQ. Figure 7.16 illustrates an example of this comparison. The ‘above’ and ‘below’ relations are missed in the deterministic facts extraction, while in the probabilistic one, the top second and fourth relations are ‘above and below.’ As a result, LLM_PistaQ𝑝𝑟𝑜𝑏 correctly predicts the ‘Yes’ answer. These results highlight the critical insight that the prediction with the highest probability does not necessarily equate to the most accurate result. Moreover, considering the top-5 predictions enables us to consider multiple potential relations between entities, providing a more comprehensive and subtle understanding of the relationships within the data. Despite the demonstrated improvement in LLM_PistaQ𝑝𝑟𝑜𝑏 compared to LLM_PistaQ when employing probabilistic reasoning, it is noteworthy that the performance of this model still falls short of that achieved by the other language models and SREQA models. While the transparent structure of the PistaQ framework facilitates understanding the prediction process, interpreting the superior performance of language models on this task remains challenging due to their black-box nature. This higher performance may be attributed to better spatial and commonsense reasoning capability of LLMs or simply result from random predictions for Yes/No questions. 7.4.3 LLM as Spatial Commonsense Knowledge Extractor One advantage of using LLMs as an end-to-end model, compared to LLM_PistaQ, is their notable commonsense reasoning capabilities in generating responses for ReSQ examples. This capability poses a considerable challenge for the PistaQ or LLM_PistaQ𝑝𝑟𝑜𝑏 model, which relies on explicit information in the text and covers limited spatial rules. We propose a novel methodology to leverage the commonsense knowledge embedded in LLMs within our model. In this approach, we provide tuples of all entities present in the text (extracted using LLMs) without any context as an input to the LLMs and ask for spatial commonsense relations. By excluding the main story text, the LLM returns the most common relation between pairs of entities based on its generalized knowledge of space. These relations are then incorporated with the direct facts from the text into the probabilistic spatial reasoner. Comparing the outcomes presented in lines 14 and 16 of Table 7.3, we observe the enhancement in the performance of LLM_PistaQ𝑝𝑟𝑜𝑏 when commonsense knowledge is utilized. Figure 7.17 100 Figure 7.17 An illustration of the advantages of employing commonsense knowledge in probabilistic reasoning. Without the extra commonsense knowledge, the question cannot be answered. depicts an example highlighting the benefits of using commonsense knowledge in probabilistic reasoning. The relation between ‘the balcony’ and ‘the coop’ is not stated in any sentences from the story. However, based on a common understanding that balconies are typically on upper floors while coops are on the ground, it can be inferred that a balcony should be above a coop. This commonsense knowledge is reflected in the LLM output with a probability of 0.48. This information is then utilized in the probabilistic reasoning, resulting in a "Yes" answer. Furthermore, an additional experiment assesses the impact of incorporating commonsense knowledge into deterministic reasoning. As illustrated in lines 12 and 13 of Table 7.3, integrating commonsense knowledge deterministically also improves results. However, employing commonsense knowledge in probabilistic reasoning surpasses deterministic reasoning by a 3.3% gap, highlighting the added value of probabilistic reasoning in this context. Similar to previous experiments, incorporating such knowledge from GPT-3.5 into the pipeline model cannot outperform the GPT-3.5 or GPT-3.5+CoT models. While this method increases the extracted information compared to the primary PistaQ model, the remaining errors from the extraction part prevent it from surpassing the end-to-end LLMs. 101 Story: entrance and arched windows on ground floor , a coop with a rail , a balcony with several archesand columns on the second floor , balcony with columns on the third floor .Question: Is the coop below the balconies?Probabilistic Reasoning+ Spatial commonsenseQuery : below(the coop, the balconies)?Facts : 0.335::front_('a coop', 'ground floor').0.2::near_('a coop', 'ground floor').0.125::dc_('a coop', 'ground floor').0.09::ntpp_('a coop', 'ground floor').0.066::below_('a coop', 'ground floor').0.363::above_('a rail', 'a coop').0.316::front_('a rail', 'a coop').0.076::dc_('a rail', 'a coop').0.029::ntpp_('a rail', 'a coop').0.685::front_('a balcony', 'second floor').0.273::above_('a balcony', 'second floor').0.012::ntpp_('a balcony', 'second floor').Answer: No (prob: 0.0)Answer: Yes (prob: 0.9999)Answer: Yes+ Commonsense Facts : 0.489::above_('a balcony', 'a coop').0.224::front_('a balcony', 'a coop').Query : below(the coop, the balconies)? 7.5 Summary In conclusion, our research has elucidated several key findings regarding the role of LLMs in spatial reasoning tasks. We have shown that even though utilizing Chain-of-Thought prompting can enhance the performance of LLMs on SQA task, they still have shortcomings and are far behind the human performance. Our case studies show that, despite the inherent limitations of LLMs in spatial reasoning, capitalizing on their potential for information extraction within a disentangled structure of extraction and reasoning can yield some benefits. However, our experimental results in a real-world environment indicate that this high performance is more robust in controlled environments with a fixed pattern of spatial descriptions and is challenging in real-world scenarios. GPT-3.5 exhibits limitations in extracting complex spatial information from natural language that directly influences the symbolic reasoner, resulting in performance lower than end-to-end LLM models. Besides, we propose approaches to enhance the performance of the pipeline model with LLM as an extraction module. Our findings demonstrate that leveraging probabilistic outputs from LLMs in probabilistic logical reasoning produces superior results compared to deterministic reasoning. This emphasizes the advantages of considering uncertainties of language models’ outputs for reasoning tasks. Furthermore, incorporating LLMs’ commonsense knowledge significantly improves the pipeline model’s performance by encapsulating some commonsense rules and relations that the symbolic reasoning module may not cover. However, in the end, due to errors from the extraction modules, the pipeline model still underperforms the end-to-end language models in realistic domains. 102 CHAPTER 8 CONCLUSION AND FUTURE DIRECTIONS This chapter summarizes the research presented in this thesis and emphasizes the outcomes of each contribution. Furthermore, we outline some interesting future directions based on the foundation of this work. 8.1 Summary of Contributions In the first contribution, this thesis addresses the lack of evaluation benchmarks for multi-hop spatial reasoning by proposing three novel textual corpora for spatial question answering (SQA). These new benchmarks represent a significant contribution to the field, enabling more robust evaluation and comparison of different models in spatial reasoning tasks. The first benchmark, SpartQA-Human, offers a human-generated dataset of spatial questions and answers based on controlled visual scenes. This contribution allows researchers to evaluate language models using data that reflects human perception and interpretation of spatial relationships. The second benchmark, SpartQA-Auto, is an automatically generated corpus using hand-crafted grammar and spatial reasoning rules. This contribution provides a large dataset for training and evaluating models, which can be especially useful for early-stage experiments and simulations. The third benchmark, ReSQ, is a realistic human-curated SQA dataset derived from spatial role labeling (SpRL). This contribution extends existing benchmarks with spatial semantic annotations, offering a unique resource for evaluating language models in real-world contexts. Through extensive experiments, we demonstrated that state-of-the-art pre-trained language models struggle with spatial reasoning tasks on human-generated text, emphasizing the need for improved models. Our findings also highlighted that perfect results on auto-generated data might indicate overfitting due to limited vocabulary and grammatical structures, underscoring the importance of diverse and realistic benchmarks. The second main contribution is enhancing the spatial reasoning capability of neural models through transfer learning. By creating a new synthetic dataset, SpaRTUN, which covers a broad range of spatial relation types, combinations, and spatial language expressions, we provide a comprehensive source of supervision for pretraining and transfer learning. This dataset allows for 103 extensive pretraining of language models (LMs), leading to significantly improved performance in spatial question answering and spatial role labeling tasks. Moreover, our experiments show that language models pretrained with SpaRTUN exhibit superior results on several external datasets, indicating the robustness of our approach. The enhanced spatial reasoning in these models suggests that SpaRTUN offers a richer source of supervision for pretraining due to its expanded coverage of spatial relation types and combinations compared to other synthetic data sources. This broader scope contributes to the field by providing a more effective foundation for developing spatially-aware neural models. As our main third contribution, we investigate the benefits of disentangling the processes of extracting spatial information and reasoning over them. To this end, we devised a series of experiments utilizing PLMs for spatial information extraction coupled with a symbolic reasoner for inferring indirect relations. The outcomes of our experiments provide noteworthy insights: (1) Our observations in controlled experimental conditions demonstrate that disentangling extraction and symbolic reasoning, compared to PLMs, enhances the models’ reasoning capabilities, even with comparable or reduced supervision. (2) Despite the acknowledged fragility of symbolic reasoning in real-world domains, our experiments highlight that employing explicit extraction layers and utilizing the same symbolic reasoner in data preprocessing enhances models’ reasoning capabilities. These outcomes highlight the importance of distinguishing the processes of extraction and reasoning, offering a pathway for building more robust models in spatial language understanding. Our results suggest that this approach can lead to better-performing models in various applications on complex spatial reasoning tasks. As our fourth and last contribution, our investigation has uncovered insights into the function and efficacy of Large Language Models (LLMs) in spatial reasoning tasks. We demonstrate that LLMs, particularly those employing Few_shot or Chain-of-Thought prompting techniques, still face challenges solving spatial reasoning tasks. Our case studies and experiments reveal that while LLMs face challenges in spatial reasoning, harnessing their strengths in information extraction within a pipeline of Extraction and Reasoning can lead to improvements, especially in 104 controlled environments. However, in a real-world setting, this approach falls short of outperforming end-to-end language models due to the limitations of GPT-3.5 in extracting complicated spatial information from natural language. Moreover, our experimental outcomes of using LLM in this pipeline model indicate that employing the probabilistic outputs from LLMs in probabilistic logical reasoning consistently outperforms deterministic reasoning. This result highlights the benefits of incorporating uncertainties in reasoning tasks. Finally, using commonsense knowledge from LLMs that comprehend commonsense rules and relations enhances the performance of the pipeline model. Even though these methods enhance knowledge acquisition from the context during the extraction phase, none can mitigate the extraction errors from the LLMs. Hence, the pipeline model still underperforms the end-to-end language models. 8.2 Future Directions In this section, we outline several promising directions for future research that build upon the findings and methodologies established in this thesis. While our work has made significant contributions to the field of spatial reasoning, there is ample scope for extending and refining our approach to explore new areas and address potential limitations. Below we detail potential avenues for future work. • Methods For Prompt Engineering: In the pipeline of extraction and reasoning, one of the shortcomings of employing LLMs lies in their subpar performance in extracting spatial relations from text. A hypothesis to solve this issue is refining prompt engineering, potentially improving the in-context learning of LLMs for spatial information tasks. Investigating automated methods with LLM in the loop to discover optimal prompts for this specific task could be an intriguing avenue for future research. For instance, we can maintain a collection of prompts and select the optimal prompts based on the test example to enhance performance. • Exploring Other Reasoning Types: In this thesis, we focus on spatial reasoning, which contains both language understanding and multi-step reasoning. An extension would be to evaluate our approaches to other types of reasoning that require logical rules, such as temporal reasoning and arithmetic reasoning, to provide valuable insights into the versatility 105 and effectiveness of these methods. For instance, the rules over RCC-8 relations, which are designed in our rule-based spatial reasoner, can also be practical in temporal reasoning. Consequently, incorporating an off-the-shelf model for temporal information extraction within the context of a disentangled extraction and reasoning framework could be beneficial. • Expanding to Other Languages: The current work primarily targets English language, but the methodology can be easily adapted to other languages that share similar grammatical structures. An intriguing avenue for future work is applying our approach to different languages and assessing the benefits of disentangling extraction and reasoning. This would require developing customized extraction modules while utilizing the same spatial reasoner in our designed pipeline model. • Optimizing LLMs for Spatial Reasoning: Our research demonstrates that LLMs may not excel at spatial reasoning tasks when acting alone but show improvement when used with symbolic reasoning within a disentangled model. Additionally, we observe that the performance of LLMs diminishes when extracted information is presented in the prompt of in-context learning as structured data. A promising direction for future research would be to investigate strategies that allow LLMs to utilize such data without need for further fine-tuning. 106 PUBLISHED PAPERS 1. Roshanak Mirzaee, Hossein Rajaby Faghihi, Qiang Ning, and Parisa Kordjamshidi. 2021. SPARTQA: A Textual Question Answering Benchmark for Spatial Reasoning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4582–4598, Online. Association for Computational Linguistics. 2. Roshanak Mirzaee and Parisa Kordjamshidi. 2022. Transfer Learning with Synthetic Corpora for Spatial Role Labeling and Reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6148–6165, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. 3. Roshanak Mirzaee and Parisa Kordjamshidi. 2023. Disentangling Extraction and Reasoning in Multi-hop Spatial Reasoning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore. Association for Computational Linguistics. 4. Roshanak Mirzaee and Parisa Kordjamshidi. 2023. Dual-Phase Models for Extracting Information and Symbolic Reasoning: A Case-Study in Spatial Reasoning. In Proceedings STRL-23 Second International Workshop on Spatio-Temporal Reasoning and Learning, Macao, S.A.R. CEUR Workshop Proceedings. 107 BIBLIOGRAPHY [1] [2] [3] [4] [5] [6] [7] [8] [9] Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. SemEval-2012 task 6: A pilot on semantic textual similarity. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 385–393, Montréal, Canada, 7-8 June 2012. Association for Computational Linguistics. Reem Alrashdi and Simon O’Keefe. Automatic Labeling of Tweets for Crisis Response Using Distant Supervision, page 418–425. Association for Computing Machinery, New York, NY, USA, 2020. Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2018. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, In Proceed- C Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. ings of the IEEE international conference on computer vision, pages 2425–2433, 2015. Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity, 2023. John A Bateman, Joana Hois, Robert Ross, and Thora Tenbrink. A linguistic ontology of space for natural language processing. Artificial Intelligence, 174(14):1027–1071, 2010. Robert Bossy, Wiktoria Golik, Zorana Ratkovic, Philippe Bessieres, and Claire Nédellec. Bionlp shared task 2013–an overview of the bacteria biotope task. In Proceedings of the BioNLP shared task 2013 workshop, pages 161–169, 2013. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877– 1901, 2020. Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, and Yoav Artzi. TOUCHDOWN: Natural language navigation and spatial reasoning in visual street environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12538–12547, 2019. [10] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 108 [11] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, 2019. [12] Peter Clark, Oyvind Tafjord, and Kyle Richardson. Transformers as soft reasoners over In Christian Bessiere, editor, Proceedings of the Twenty-Ninth International language. Joint Conference on Artificial Intelligence, IJCAI-20, pages 3882–3890. International Joint Conferences on Artificial Intelligence Organization, 7 2020. Main track. [13] Soham Dan, Parisa Kordjamshidi, Julia Bonn, Archna Bhatia, Zheng Cai, Martha Palmer, and Dan Roth. From spatial relations to spatial configurations. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 5855–5864, Marseille, France, May 2020. European Language Resources Association. [14] Pradeep Dasigi, Nelson F. Liu, Ana Marasović, Noah A. Smith, and Matt Gardner. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pages 5925–5932, 2019. [15] Surabhi Datta and Kirk Roberts. A hybrid deep learning approach for spatial trigger extraction from radiology reports. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2020, page 50. NIH Public Access, 2020. [16] Surabhi Datta, Yuqi Si, Laritza Rodriguez, Sonya E Shooshan, Dina Demner-Fushman, and Kirk Roberts. Understanding spatial language in radiology: Representation framework, annotation, and spatial relation extraction from chest x-ray reports using deep learning. Journal of biomedical informatics, 108:103473, 2020. [17] Luc De Raedt, Angelika Kimmig, and Hannu Toivonen. Problog: A probabilistic prolog and its application in link discovery. In IJCAI, volume 7, pages 2462–2467. Hyderabad, 2007. [18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019. [19] Kaustubh Dhole and Christopher D. Manning. Syn-QG: Syntactic and shallow semantic rules for question generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 752–765, 2020. [20] Emmanuelle-Anna Dietz, Steffen Hölldobler, and Raphael Höps. A computational logic approach to human spatial reasoning. In 2015 IEEE Symposium Series on Computational Intelligence, pages 1627–1634. IEEE, 2015. 109 [21] Xinya Du and Claire Cardie. Event extraction by answering (almost) natural questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. [22] Xinya Du, Junru Shao, and Claire Cardie. Learning to Ask: Neural question generation for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1342–1352, 2017. [23] Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378, 2019. [24] Saman Enayati, Ziyu Yang, Benjamin Lu, and Slobodan Vucetic. A visualization approach for rapid labeling of clinical notes for smoking status extraction. In Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances, pages 24–30, Online, June 2021. Association for Computational Linguistics. [25] Hossein Rajaby Faghihi, Parisa Kordjamshidi, Choh Man Teng, and James Allen. The role of semantic parsing in understanding procedural text. arXiv preprint arXiv:2302.06829, 2023. [26] Yixing Fan, Xiaohui Xie, Yinqiong Cai, Jia Chen, Xinyu Ma, Xiangsheng Li, Ruqing Zhang, Jiafeng Guo, et al. Pre-training methods in information retrieval. Foundations and Trends® in Information Retrieval, 16(3):178–317, 2022. [27] John Freeman. The modelling of spatial relations. Computer graphics and image processing, 4(2):156–171, 1975. [28] Dayne Freitag, John Cadigan, Robert Sasseen, and Paul Kalmar. Valet: rule-based information extraction for rapid deployment. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 524–533, 2022. [29] Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, 2020. [30] Matt Gardner, Jonathan Berant, Hannaneh Hajishirzi, Alon Talmor, and Sewon Min. Question Answering is a Format; when is it useful? ArXiv, abs/1909.11291, 2019. [31] Jeremy Getman, Joe Ellis, Zhiyi Song, Jennifer Tracey, and Stephanie M Strassel. Overview of linguistic resources for the tac kbp 2017 evaluations: Methodologies and results. In TAC, 2017. 110 [32] Mehdi Ghanimifard and Simon Dobnik. Learning to compose spatial relations with grounded neural language models. In IWCS 2017-12th International Conference on Computational Semantics-Long papers, 2017. [33] M Grubinger and C Leung. A benchmark for performance calibration in visual information search. In Proceedings of The 2003 International Conference on Visual Information Systems (VIS 2003), pages 414–419, 2003. [34] Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. Don’t Stop Pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, 2020. [35] Hangfeng He, Qiang Ning, and Dan Roth. QuASE: Question-answer driven sentence encoding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8743–8758, Online, July 2020. Association for Computational Linguistics. [36] Hangfeng He, Mingyuan Zhang, Qiang Ning, and Dan Roth. Foreshadowing the benefits of incidental supervision. arXiv preprint arXiv:2006.05500, 2020. [37] Luheng He, Mike Lewis, and Luke Zettlemoyer. Question-Answer Driven Semantic Role Labeling: Using natural language to annotate natural language. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 643–653, 2015. [38] Michael Heilman and Noah A Smith. Question generation via overgenerating transforma- tions and ranking. Technical report, CARNEGIE-MELLON UNIV PITTSBURGH PA LANGUAGE TECHNOLOGIES INST, 2009. [39] A. Herskovits. Language and Spatial Cognition: An Interdisciplinary Study of the Prepositions in English. Cambridge University Press, 1986. [40] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [41] Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022. [42] Zixian Huang, Yulin Shen, Xiao Li, Yu’ang Wei, Gong Cheng, Lin Zhou, Xinyu Dai, and Yuzhong Qu. GeoSQA: A benchmark for scenario-based question answering in the geography domain at high school level. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5866–5871, 2019. [43] Drew A Hudson and Christopher D Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019. [44] Gilles Jacobs and Véronique Hoste. Sentivent: enabling supervised information extraction of company-specific events in economic and financial news. Language Resources and Evaluation, 56(1):225–257, 2022. 111 [45] Zhen Jia, Abdalghani Abujabal, Rishiraj Saha Roy, Jannik Strötgen, and Gerhard Weikum. TempQuestions: A benchmark for temporal question answering. In Companion Proceedings of the The Web Conference 2018, pages 1057–1062, 2018. [46] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2901–2910, 2017. [47] Maged N Kamel Boulos, Guochao Peng, and Trang VoPham. An overview of geoai applications in health and healthcare. International journal of health geographics, 18(1):1–9, 2019. [48] Bastian Karstaedt and Birgit Wendholt. Towards semantic interpretations of spatial information in adaptive smart homes. In Forum Bauinformatik, page 8. Citeseer, 2011. [49] William G Kennedy, Magdalena D Bugajska, Matthew Marge, William Adams, Benjamin R Fransen, Dennis Perzanowski, Alan C Schultz, and J Gregory Trafton. Spatial representation and reasoning for human-robot collaboration. In AAAI, volume 7, pages 1554–1559, 2007. [50] Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking Beyond the Surface:a challenge set for reading comprehension over multiple sentences. In Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL), 2018. [51] Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. UnifiedQA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1896–1907, 2020. [52] Hyounghun Kim, Abhaysinh Zala, Graham Burri, Hao Tan, and Mohit Bansal. ArraMon: A joint navigation-assembly instruction interpretation task in dynamic environments. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3910–3927, Online, November 2020. Association for Computational Linguistics. [53] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022. [54] Parisa Kordjamshidi, Marie-Francine Moens, and Martijn van Otterlo. Spatial Role Labeling: In Proceedings of the Seventh conference on Task definition and annotation scheme. International Language Resources and Evaluation (LREC’10), pages 413–420. European Language Resources Association (ELRA), 2010. [55] Parisa Kordjamshidi, Taher Rahgooy, Marie-Francine Moens, James Pustejovsky, Umar Manzoor, and Kirk Roberts. Clef 2017: Multimodal spatial role labeling (msprl) task overview. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 367–376. Springer, 2017. 112 [56] Parisa Kordjamshidi, Martijn Van Otterlo, and Marie-Francine Moens. Spatial role labeling: Towards extraction of spatial relations from natural language. ACM Transactions on Speech and Language Processing (TSLP), 8(3):1–36, 2011. [57] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural Questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019. [58] Igor Labutov, Sumit Basu, and Lucy Vanderwende. Deep questions without deep under- standing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 889–898, 2015. [59] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, 2020. [60] Christian Landsiedel, Verena Rieser, Matthew Walter, and Dirk Wollherr. A review of spatial reasoning and interaction for real-world robotics. Advanced Robotics, 31(5):222–242, 2017. [61] Hung Le, Truyen Tran, and Svetha Venkatesh. Self-attentive associative memory. International Conference on Machine Learning, pages 5682–5691. PMLR, 2020. In [62] Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. End-to-end neural coreference In Proceedings of the 2017 Conference on Empirical Methods in Natural resolution. Language Processing, pages 188–197, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. [63] Jumphon Lertlakkhanakul, Jin Won Choi, and Mi Yun Kim. Building data model and simulation platform for spatial interaction management in smart home. Automation in Construction, 17(8):948–957, 2008. [64] Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. Zero-Shot relation extraction via reading comprehension. In CONLL, pages 333–342, 2017. [65] Dongdong Li, Zhaochun Ren, Pengjie Ren, Zhumin Chen, Miao Fan, Jun Ma, and Maarten In de Rijke. Semi-supervised variational reasoning for medical dialogue generation. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 544–554, 2021. [66] Jianfu Li, Qiang Wei, Omid Ghiasvand, Miao Chen, Victor Lobanov, Chunhua Weng, and Hua Xu. A comparative study of pre-trained language models for named entity recognition in clinical trial eligibility criteria from multiple corpora. BMC Medical Informatics and Decision Making, 22(3):1–10, 2022. [67] Tao Li, Vivek Gupta, Maitrey Mehta, and Vivek Srikumar. A logic-driven framework for consistency of neural models. arXiv preprint arXiv:1909.00126, 2019. 113 [68] Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Han, Fei Wu, and Jiwei Li. A unified mrc framework for named entity recognition. arXiv preprint arXiv:1910.11476, 2019. [69] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. [70] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. [71] Kate Lockwood, Ken Forbus, D Halstead, and Jeffrey Usher. Automatic categorization of spatial prepositions. In Proceedings of the 28th annual conference of the cognitive science society, pages 1705–1710, 2006. [72] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. [73] Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh Hajishirzi. Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. arXiv preprint arXiv:1808.09602, 2018. [74] Oswaldo Ludwig, Xiao Liu, Parisa Kordjamshidi, and Marie-Francine Moens. Deep embedding for spatial role labeling. arXiv preprint arXiv:1603.08474, 2016. [75] Kai Ma, Yongjian Tan, Miao Tian, Xuejing Xie, Qinjun Qiu, Sanfeng Li, and Xin Wang. Extraction of temporal information from social media messages using the bert model. Earth Science Informatics, 15(1):573–584, 2022. [76] Kaixin Ma, Filip Ilievski, Jonathan Francis, Yonatan Bisk, Eric Nyberg, and Alessandro Oltramari. Knowledge-driven data construction for zero-shot evaluation in commonsense question answering. In 35th AAAI Conference on Artificial Intelligence, 2021. [77] Arjun Magge, Davy Weissenbacher, Abeed Sarker, Matthew Scotch, and Graciela Gonzalez- Hernandez. Deep neural networks and distant supervision for geographic location mention extraction. Bioinformatics, 34(13):i565–i573, 06 2018. [78] Umar Manzoor and Parisa Kordjamshidi. Anaphora resolution for improving spatial relation extraction from text. In Proceedings of the First International Workshop on Spatial Language Understanding, pages 53–62, 2018. [79] Rui Mao, Qian Liu, Kai He, Wei Li, and Erik Cambria. The biases of pre-trained language models: An empirical study on prompt-based sentiment analysis and emotion detection. IEEE Transactions on Affective Computing, 2022. [80] David M Mark et al. Languages of spatial relations: Researchable questions & NCGIA research agenda. National Center for Geographic Information and Analysis Santa Barbara . . . , 1989. 114 [81] Wouter Massa, Parisa Kordjamshidi, Thomas Provoost, and Marie-Francine Moens. Machine reading of biological texts: bacteria-biotope extraction. In Proceedings of the 6th international conference on bioinformatics models, methods and algorithms, pages 55–64. SCITEPRESS, 2015. [82] Vaibhav Mavi, Anubhav Jangra, and Adam Jatowt. A survey on multi-hop question answering and generation. arXiv preprint arXiv:2204.09140, 2022. [83] Alexey Mazalov, Bruno Martins, and David Matos. Spatial role labeling with convolutional neural networks. In Proceedings of the 9th Workshop on Geographic Information Retrieval, pages 1–7, 2015. [84] Ana Cristina Mendes, Luísa Coheur, and Paula Vaz Lobo. Named entity recognition in questions: Towards a golden collection. In LREC, 2010. [85] [86] Julian Michael, Gabriel Stanovsky, Luheng He, Ido Dagan, and Luke Zettlemoyer. Crowd- sourcing question-answer meaning representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 560–568, 2018. Junghyun Min, R. Thomas McCoy, Dipanjan Das, Emily Pitler, and Tal Linzen. Syntactic data augmentation increases robustness to inference heuristics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2339–2352, Online, July 2020. Association for Computational Linguistics. [87] Pasquale Minervini, Sebastian Riedel, Pontus Stenetorp, Edward Grefenstette, and Tim Rock- täschel. Learning reasoning strategies in end-to-end differentiable proving. In International Conference on Machine Learning, pages 6938–6949. PMLR, 2020. [88] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003–1011, 2009. [89] Roshanak Mirzaee and Parisa Kordjamshidi. Transfer learning with synthetic corpora for spatial role labeling and reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6148–6165, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. [90] Roshanak Mirzaee, Hossein Rajaby Faghihi, Qiang Ning, and Parisa Kordjamshidi. SPARTQA: A textual question answering benchmark for spatial reasoning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4582–4598, Online, June 2021. Association for Computational Linguistics. [91] Roshanak Mirzaee, Hossein Rajaby Faghihi, Qiang Ning, and Parisa Kordjamshidi. SPARTQA: A textual question answering benchmark for spatial reasoning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: 115 Human Language Technologies, pages 4582–4598, Online, June 2021. Association for Computational Linguistics. [92] Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and Ashwin Kalyan. Lila: A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022. [93] Diego Mollá, Menno Van Zaanen, Daniel Smith, et al. Named entity recognition for question answering. 2006. [94] Alaeddine Moussa, Sébastien Fournier, Khaoula Mahmoudi, Bernard Espinasse, and Sami Faiz. Spatial role labeling based on improved pre-trained word embeddings and transfer learning. Procedia Computer Science, 192:1218–1226, 2021. [95] Alaeddine Moussa, Sebastien Fournier, Khaoula Mahmoudi, Bernard Espinasse, and Sami Faiz. Mixing static word embeddings and roberta for spatial role labeling. Procedia Computer Science, 207:2950–2957, 2022. Knowledge-Based and Intelligent Information Engineering Systems: Proceedings of the 26th International Conference KES2022. [96] Huansheng Ning, Feifei Shi, Tao Zhu, Qingjuan Li, and Liming Chen. A novel ontology consistent with acknowledged standards in smart homes. Computer Networks, 148:101–107, 2019. [97] Qiang Ning, Hao Wu, Rujun Han, Nanyun Peng, Matt Gardner, and Dan Roth. TORQUE: A reading comprehension dataset of temporal ordering questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1158–1172, 2020. [98] Maxwell Nye, Michael Tessler, Josh Tenenbaum, and Brenden M Lake. Improving coherence and consistency in neural sequence models with dual-system, neuro-symbolic reasoning. Advances in Neural Information Processing Systems, 34:25192–25204, 2021. [99] Batu Ozturkler, Nikolay Malkin, Zhen Wang, and Nebojsa Jojic. Thinksum: Probabilistic reasoning over sets using large language models. arXiv preprint arXiv:2210.01293, 2022. [100] Rasmus Berg Palm, Ulrich Paquet, and Ole Winther. Recurrent relational networks. arXiv preprint arXiv:1711.08028, 2017. [101] Jason Phang, Thibault Févry, and Samuel R Bowman. Sentence encoders on stilts: Sup- plementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088, 2018. [102] James Pustejovsky, Parisa Kordjamshidi, Marie-Francine Moens, Aaron Levine, Seth In Proceedings of Dworman, and Zachary Yocum. Semeval-2015 task 8: Spaceeval. the 9th International Workshop on Semantic Evaluation (semeval 2015), pages 884–894. ACL, 2015. 116 [103] Meng Qu, Junkun Chen, Louis-Pascal Xhonneux, Yoshua Bengio, and Jian Tang. Rnnlogic: Learning logic rules for reasoning on knowledge graphs. In International Conference on Learning Representations, 2021. [104] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020. [105] Taher Rahgooy, Umar Manzoor, and Parisa Kordjamshidi. Visually guided spatial relation extraction from text. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 788–794, 2018. [106] Hossein Rajaby Faghihi and Parisa Kordjamshidi. Time-stamped language model: Teaching language models to understand the flow of events. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4560–4570, Online, June 2021. Association for Computational Linguistics. [107] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, 2016. [108] David A Randell, Zhan Cui, and Anthony G Cohn. A spatial logic based on regions and connection. KR, 92:165–176, 1992. [109] Yasaman Razeghi, Robert L Logan IV, Matt Gardner, and Sameer Singh. Impact of pretraining term frequencies on few-shot reasoning. arXiv preprint arXiv:2202.07206, 2022. [110] Jochen Renz and Bernhard Nebel. Qualitative spatial reasoning using constraint calculi. In Handbook of spatial logics, pages 161–215. Springer, 2007. [111] Homero Roman Roman, Yonatan Bisk, Jesse Thomason, Asli Celikyilmaz, and Jianfeng Gao. RMM: A recursive mental model for dialogue navigation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1732–1745, Online, November 2020. Association for Computational Linguistics. [112] Parisa Kordjamshidi Roshanak Mirzaee. Disentangling extraction and reasoning in multi-hop spatial reasoning. In The 2023 Conference on Empirical Methods in Natural Language Processing, 9. [113] Mohammed Saeed, Naser Ahmadi, Preslav Nakov, and Paolo Papotti. RuleBERT: Teaching In Proceedings of the 2021 Conference on soft rules to pre-trained language models. Empirical Methods in Natural Language Processing, pages 1460–1476, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. [114] Sunita Sarawagi et al. Information extraction. Foundations and Trends® in Databases, 1(3):261–377, 2008. 117 [115] Priyanka Sen and Amir Saffari. What do models learn from question answering datasets? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2429–2438, 2020. [116] Dan Shen and Mirella Lapata. Using semantic roles to improve question answering. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 12–21, Prague, Czech Republic, June 2007. Association for Computational Linguistics. [117] Tao Shen, Guodong Long, Xiubo Geng, Chongyang Tao, Tianyi Zhou, and Daxin Jiang. Large language models are strong zero-shot retriever. arXiv preprint arXiv:2304.14233, 2023. [118] Zhengxiang Shi, Qiang Zhang, and Aldo Lipani. Stepgame: A new benchmark for robust multi-hop spatial reasoning in texts. In Proceedings of the Association for the Advancement of Artificial Intelligence, AAAI ’22, 2022. [119] Hyeong Jin Shin, Jeong Yeon Park, Dae Bum Yuk, and Jae Sung Lee. Bert-based spatial information extraction. In Proceedings of the Third International Workshop on Spatial Language Understanding, pages 10–17, 2020. [120] Rahil Soroushmojdehi, Sina Javadzadeh, Alessandra Pedrocchi, Marta Gandolla, et al. Transfer learning in hand movement intention detection based on surface electromyography signals. Frontiers in Neuroscience, 16:1–18, 2022. [121] Keith Stenning and Michiel Van Lambalgen. Human reasoning and cognitive science. MIT Press, 2012. [122] Kristin Stock, Christopher B Jones, Shaun Russell, Mansi Radke, Prarthana Das, and Niloofar Aflaki. Detecting geospatial location descriptions in natural language text. International Journal of Geographical Information Science, 36(3):547–584, 2022. [123] Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi. A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 217–223, Vancouver, Canada, July 2017. Association for Computational Linguistics. [124] Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6418–6428, 2019. [125] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. arXiv preprint arXiv:1503.08895, 2015. [126] Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. ProofWriter: Generating implications, proofs, and abductive statements over natural language. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3621–3634, Online, August 2021. Association for Computational Linguistics. 118 [127] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, 2019. [128] Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R Bowman, Dipanjan Das, et al. What do you learn from context? probing for sentence structure in contextualized word representations. arXiv preprint arXiv:1905.06316, 2019. [129] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. [130] Takuma Udagawa, Takato Yamazaki, and Akiko Aizawa. A linguistic analysis of visually grounded dialogues based on spatial expressions. In Findings of the Association for Compu- tational Linguistics: EMNLP 2020, pages 750–765, Online, November 2020. Association for Computational Linguistics. [131] Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498, 2022. [132] Emile Van Krieken, Erman Acar, and Frank Van Harmelen. Semi-supervised learning using differentiable reasoning. arXiv preprint arXiv:1908.04700, 2019. [133] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017. [134] Sagar Gubbi Venkatesh, Anirban Biswas, Raviteja Upadrashta, Vikram Srinivasan, Partha Talukdar, and Bharadwaj Amrutur. Spatial reasoning from natural language instructions for robot manipulation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 11196–11202. IEEE, 2021. [135] Somin Wadhwa, Silvio Amir, and Byron C Wallace. Revisiting relation extraction in the era of large language models. arXiv preprint arXiv:2305.05003, 2023. [136] Christopher Walker, Stephanie Strassel, Julie Medero, and Kazuaki Maeda. Ace 2005 multilingual training corpus. Linguistic Data Consortium, Philadelphia, 57:45, 2006. [137] Zhen Wan, Fei Cheng, Zhuoyuan Mao, Qianying Liu, Haiyue Song, Jiwei Li, and Sadao Kurohashi. Gpt-re: In-context learning for relation extraction using large language models. arXiv preprint arXiv:2305.02105, 2023. [138] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, 2018. 119 [139] Feng Wang, Peifeng Li, and Qiaoming Zhu. A hybrid model of classification and generation In Proceedings of the 29th International Conference on for spatial relation extraction. Computational Linguistics, pages 1915–1924, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics. [140] Mengru Wang, Jianming Zheng, Fei Cai, Taihua Shao, and Honghui Chen. Drk: Discrimina- tive rule-based knowledge for relieving prediction confusions in few-shot relation extraction. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2129–2140, 2022. [141] Zhile Wang, Jianhua Yang, and Yu Guo. Unknown fault feature extraction of rolling bearings under variable speed conditions based on statistical complexity measures. Mechanical Systems and Signal Processing, 172:108964, 2022. [142] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022. [143] Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2015. [144] Diedrich Wolter. Sparq-a spatial reasoning toolbox. In AAAI Spring Symposium: Benchmark- ing of Qualitative Spatial and Temporal Reasoning Systems, page 53, 2009. [145] Lang-Tao Wu, Jia-Rui Lin, Shuo Leng, Jiu-Lin Li, and Zhen-Zhong Hu. Rule-based information extraction for mechanical-electrical-plumbing-specific semantic web. Automation in Construction, 135:104108, 2022. [146] Xin Xu, Yuqi Zhu, Xiaohan Wang, and Ningyu Zhang. How to unleash the power of large language models for few-shot relation extraction? arXiv preprint arXiv:2305.01555, 2023. [147] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. XLNet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5754–5764, 2019. [148] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. pages 2369–2380, October-November 2018. [149] Zhun Yang, Adam Ishay, and Joohyung Lee. Coupling large language models with logic programming for robust and general reasoning from text. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 5186–5219, Toronto, Canada, July 2023. Association for Computational Linguistics. [150] Liang Yao, Chengsheng Mao, and Yuan Luo. KG-BERT: Bert for knowledge graph completion. arXiv preprint arXiv:1909.03193, 2019. 120 [151] Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, and Maosong Sun. Docred: A large-scale document-level relation extraction dataset. arXiv preprint arXiv:1906.06127, 2019. [152] Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1753–1762, 2015. [153] Chong Zhang, Jiagao Lyu, and Ke Xu. A storytree-based model for inter-document causal relation extraction from news articles. Knowledge and Information Systems, 65(2):827–853, 2023. [154] Ningyu Zhang, Xiang Chen, Xin Xie, Shumin Deng, Chuanqi Tan, Mosha Chen, Fei Huang, Luo Si, and Huajun Chen. Document-level relation extraction as semantic segmentation. arXiv preprint arXiv:2106.03618, 2021. [155] Yue Zhang, Quan Guo, and Parisa Kordjamshidi. Towards navigation by reasoning over spatial configurations. In Proceedings of Second International Combined Workshop on Spatial Language Understanding and Grounded Communication for Robotics, pages 42–52, Online, August 2021. Association for Computational Linguistics. [156] Yue Zhang and Parisa Kordjamshidi. Explicit object relation alignment for vision and language navigation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 322–331, Dublin, Ireland, May 2022. Association for Computational Linguistics. [157] Chen Zheng and Parisa Kordjamshidi. Relational gating for" what if" reasoning. arXiv preprint arXiv:2105.13449, 2021. [158] Ben Zhou, Qiang Ning, Daniel Khashabi, and Dan Roth. Temporal common sense acquisition with minimal supervision. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7579–7589, 2020. [159] Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022. [160] Rui Zhu, Krzysztof Janowicz, Ling Cai, and Gengchen Mai. Reasoning over higher-order qualitative spatial relations via spatially explicit neural networks. International Journal of Geographical Information Science, pages 1–32, 2022. [161] Xiangxin Zhu, Carl Vondrick, Charless C Fowlkes, and Deva Ramanan. Do we need more training data? International Journal of Computer Vision, 119(1):76–92, 2016. [162] Jordan Zlatev. Holistic spatial semantics of thai. In Cognitive linguistics and non-Indo- European languages, pages 305–336. De Gruyter Mouton, 2008. 121