EXPLOITING SEMANTIC STRUCTURES TOWARD PROCEDURAL REASONING By Hossein Rajaby Faghihi A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science—Doctor of Philosophy 2024 ABSTRACT Reasoning over procedural text, which encompasses texts such as recipes, manuals, and ’how to’ tutorials, presents formidable challenges due to the dynamic nature of the world it describes. Reasoning over procedural text has many challenges. These challenges are embodied in tasks such as 1) tracking entities and their status changes (entity tracking) and 2) summarization of the process (procedural summarization). This thesis aims to enhance the representation and reasoning over textual procedures by harnessing semantic structures in the input text and imposing constraints on the models’ output. It delves into using semantic structures derived from the text, including relationships between actions and objects, semantic parsing of instructions, and the sequential structure of actions. Additionally, the thesis investigates the integration of structural and semantic constraints within neural models, resulting in coherent and consistent outputs that align with external knowledge. The thesis contributes significantly to three main areas: Entity tracking, Procedural Abstraction, and the Integration of constraints in deep learning. In the entity tracking task, we have made four primary contributions: 1) Developed a novel architecture for encoding event flow in pretrained language models. 2) Enabled seamless transfer learning from diverse corpora through task reformulation. 3) Enhanced language models by incorporating knowledge from semantic parsers and leveraging ontological abstraction of actions. 4) Created a new evaluation scheme considering fine-grained semantics in tracking entities. Regarding procedural summarization, the thesis proposes a model for an explicit latent space for the procedure that is indirectly supervised to ensure the summary’s action order corresponds to the order of events in the multi-modal instructions.s In the realm of integrating domain knowledge with deep neural networks, the thesis makes two significant contributions, 1) it contributes to the development of a generic framework that facilitates the incorporation of first-order logical constraints in neural models, and 2) it creates a new benchmark for evaluating constraint integration methods across five categories of tasks. This benchmark introduces novel evaluation criteria and offers valuable insights into the effectiveness of constraint integration methods across various tasks. Copyright by HOSSEIN RAJABY FAGHIHI 2024 LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TABLE OF CONTENTS CHAPTER 1 1.1 Motivation . 1.2 Procedural Reasoning Tasks . 1.3 Contributions . 1.4 Manuscript Outline . INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 1 1 3 5 7 CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Language Models . 2.2 Learning Objectives . . 2.3 NLP Tasks . . . 2.4 Benchmarks . . . . . . . . 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 . . 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 3 ENTITY TRACKING . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1 Task Definition . . . 23 . . 3.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Task Evaluation . . 3.4 Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5 Encoding Temporal information . . . . . . . . . . . . . . . . . . . . . . . . . 27 . . . . . . . . . . . . . 38 3.6 Exploiting Semantic Parsers to Understand the Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 4 MULTI-MODAL PROCEDURAL ABSTRACTION . . . . . . . . . . . 56 4.1 Task Definition and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2 Latent Alignment of Procedural Concepts . . . . . . . . . . . . . . . . . . . . 57 4.3 Experiments and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 CHAPTER 5 NEURAL LEARNING WITH LOGICAL CONSTRAINTS . . . . . . . 66 5.1 Declarative Constraint Integration Framework . . . . . . . . . . . . . . . . . . 67 5.2 Constraint Integration Benchmark . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3 Experiments and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 CHAPTER 6 6.1 Procedural Reasoning Constraints 6.2 Consistent Procedural Reasoning within Generative Models . . . . . . . . . . 6.3 Enhanced Inference toward consistent Procedural Reasoning . . . . . . . . . 6.4 Conclusion . . PROCEDURAL REASONING WITH LOGICAL CONSTRAINTS . . . 96 . . . . . . . . . . . . . . . . . . . . . . . . 97 . 99 . 106 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 . . . CHAPTER 7 7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Future Directions . CONCLUSION & FUTURE WORK . . . . . . . . . . . . . . . . . . . 121 . 121 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 . BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 iv LIST OF ABBREVIATIONS MSU Michigan State University NLP Natural Language Understanding AI Artificial Intelligence NER Named Entity Recognition PLM Pre-trained Language Models LLM Large Language Models TSLM Time-Stamped Language Models v CHAPTER 1 INTRODUCTION 1.1 Motivation Natural language understanding concerns developing algorithms and models that can interpret and analyze human language. The ability to understand natural language has numerous practical applications, such as improving search engines [1], automatic customer service [30], and creating intelligent personal assistants [79]. However, natural language understanding remains a complex and challenging problem due to the complexity and variability of human language. In recent years, there has been significant progress in developing machine learning models and techniques to tackle this problem. In this manuscript, we focus on natural language understanding in the domain of procedural text. Procedural texts, such as recipes [16, 184], manuals and tutorials [170, 180], and stories [160], provide step-by-step instructions for various tasks. Understanding the procedural text is important in natural language processing [171] and robotics [164]. In addition to the general challenges of natural language understanding, comprehending procedural text is difficult due to the dynamic nature of the environment described in the process. This results in new challenges specific to this domain. We pinpoint some of these as follows: • Reasoning over the temporal relationships: Procedural texts describe a sequence of steps that must be performed in a specific order. For example, if the instruction is "add the flour after mixing the wet ingredients", the agent should understand that the sequence of action is "mix the wet ingredients" and then "add the flour". • Reasoning over the global context: This challenge arises when understanding a particular step requires information from previous or future steps. For instance, a recipe may instruct to "stir the vegetables" followed by "take them out of the pan". Without looking at the subsequent steps, it may not be clear where the vegetables should be stirred in. • Grounding actions: Procedural texts often use synonyms or multiple meanings of a word to 1 describe the same physical action. For example, a recipe might use the terms "chop," "dice," and "mince" to refer to the action of cutting ingredients into small pieces. • Grounding entities: Procedural texts may have complex entity grounding, making it difficult to determine which entity is being referred to in a particular step. For example, grounding the vegetables in the "stir the vegetables" instruction is difficult when the recipe contains multiple ingredients, such as {parsely, carrot, meat, and sugar}. One requires additional knowledge about the type of ingredients to understand this instruction fully. • Reasoning over action sequences: Procedural texts require a deep understanding of the actions being performed, including the sequence and dependencies between them. For instance, if a process describes entity ‘animal’ to ‘die’ (destroy) at step i, the ‘animal‘ cannot be moved at step i 1. • Reasoning over causal relations: Procedural texts require the ability to reason about causal- ity, to understand how actions relate to each other, and what the consequences of a specific action might be. For instance, in the sequence of information "1. Wind creates waves 2. Waves hit rocks on the beach", it can be inferred that calm weather would result in smaller waves and, hence, less damage to the rocks on the beach. • Reasoning over actions and their consequences: There are complex dependencies between the actions and the changes in the world that should be understood. For instance, if entity ‘water‘ is detected to be moving in the sentence ‘water from the soil is absorbed‘, it should be inferred that the new location of ‘water‘ is different than ‘soil‘. These challenges have been embodied in multiple tasks, such as entity tracking [32, 16], procedural abstraction/summarization [184], and causal reasoning [169]. This thesis focuses on entity tracking and procedural abstraction, delving deeper into each task’s challenges in their corresponding sections in the subsequent chapters. In addition to the intrinsic challenges of understanding procedural text due to its dynamic nature, the available benchmarks for this task 2 Paragraph (Before the process starts) Roots absorb water from soil The water flows to the leaf Light from the sun and CO2 enter the leaf Water, light, and CO2 combine into mixture Mixture forms sugar State number Water Light CO2 Mixture State 0 State 1 State 2 State 3 State 4 State 5 Soil Root Leaf Leaf - - Sun Sun Sun Leaf - - ? ? ? Leaf - - - - - - Leaf - Sugar - - - - - Leaf Participants Table 1.1 An example of procedural text and its annotations from the Propara dataset [32]. "-" means the entity does not exist. "?" means the location of the entity is unknown. are mostly small and provide smaller amounts of annotated data compared to many other NLP tasks. This is due to the costly and complex human process of providing fine-grained annotations of world states during the process. Training neural models to learn from a smaller set of annotated data can result in challenges in generalization and may cause overfitting. In order to tackle the challenges discussed earlier, our objective is to leverage the semantic structures that can be derived from procedural text or the interdependencies among various de- cisions made by neural models, also known as structured prediction, especially based on models designed for procedural reasoning. Prior to outlining our contributions, We discuss the scenarios of procedural understanding that are the focus of our investigation in this thesis. 1.2 Procedural Reasoning Tasks Here, we briefly describe the main procedural reasoning problems that we focus on, that is, entity tracking and procedural abstraction. Entity tracking involves monitoring entities’ states as they undergo a series of actions, while procedural abstractions aim to generate a condensed summary of the steps involved in a process. Both tasks are essential for a comprehensive understanding of the process, as they capture different aspects of the task. Entity Tracking Table 1.1 presents an illustrative example of the entity tracking task. This task revolves around processing step-by-step instructions along with the entities involved in a given process, with the aim of generating comprehensive annotations that trace the evolution of these entities’ properties over the course of the instructions. The property of interest shown in Table 3 1.1 is the location of entities. To expound further, consider the provided sentence, "Roots absorb water from soil," and focus on the entity of interest, namely, "water." In this context, the primary objective of the entity tracking task is to discern and highlight the relocation of the "water" entity as it transitions from its original location in the "soil" to its eventual destination in the "root." Figure 1.1 A sample input for the task of procedural summarization in a simplified setting. This task is known as Textual Cloze. Procedural Abstraction The procedural abstraction task refers to the process of condensing a series of actions and their associated outcomes in order to provide a concise and short representation that accurately captures the original sequence of events. To facilitate the examination of this task, a simplified variant can be explored, wherein the abstract consists of only four action phrases that effectively convey the chronological progression of the process. An illustrative example of this task is presented in Figure 1.1. In this example, the sequence denoted by "1. Preparation, 2. Making the Dough, 3. Preparing Veggies, and 4. Baking" serves as an abstract and succinct action-oriented interpretation of the procedure of baking "Pizza Pancakes". Compared to the full summarization task, it is worth noting that this simplified version deliberately overlooks the element of "comprehensiveness" and relaxes the requirements for "coherence". 4 1.3 Contributions In this thesis, our primary aim is to exploit the inherent semantic structures in language and constraints in structured prediction tasks to advance Procedural Reasoning over text. Firstly, we focus on utilizing the input semantic structures derived from the text, such as the interrelationships between actions and objects, the semantic parse of the instructions, and the relative ordering of actions. By leveraging these semantic structures, we can enhance the understanding and representation of procedural knowledge. Secondly, we aim to harness the power of structural and semantic constraints over the neural model’s decision and latent variables. This line of inquiry falls within the domain of structured output prediction, where the model outputs are correlated within a certain structure. For instance, in predicting actions associated with a particular entity at each step, there exists a sequential constraint: if the action at step i is "create," it is impossible for the action at step i 1 to also be "create." By incorporating these structural constraints, we can guide the neural models to generate more coherent and realistic outputs, aligning them with the inherent constraints of procedural knowledge In pursuit of our objective, we put forth a range of innovative neural architectures that leverage semantic enhancements applied to the instructions, as well as novel learning objectives devised based on the interconnections among latent and output variables of neural models. Our primary focus is on addressing the challenges inherent in procedural understanding, which we tackle through formulating two distinct tasks: entity tracking and procedural abstraction. To harness the semantic dependencies present in the latent and output variables of neural models, we delve into the broader domain of integrating domain knowledge with deep neural networks (DNN). Within this framework, we specifically concentrate on incorporating knowledge that can be expressed as first-order logical constraints. By integrating domain knowledge, we strive to enhance the reasoning capabilities and overall performance of the neural models, especially in capturing the intricacies of procedural reasoning. In summary, our main contributions toward the task of entity tracking are: • We design a novel mechanism to encode the relative time of events and capture the flow 5 of events within language models. This enables us to reason over the whole context of the process while being able to shift focus from one step to another. Our method significantly enhances language models’ ability to reason over action sequences and both local and global contexts. We evaluate this mechanism in the task of entity tracking and achieve state-of-the- art results. • We provide reformulations for the entity tracking task, enabling effective transfer learn- ing from generic domain question answering benchmarks to the procedural domain. This reformulation helps address the challenges arising from limited available annotations for the entity tracking task. • We provide a novel approach to utilize symbolic and neural semantic parsers alongside ontological hierarchies of actions in understanding procedural texts. This approach com- bines semantic parses represented as graphs and encoded via graph attention networks with pretrained language models. Our proposal improves the ability of baseline models to ad- dress challenges in grounding actions, reasoning over action sequences, and understanding actions and their consequences. This approach also enhances the explainability of neural decision-making processes and improves model generalizability through proper abstractions over textual information. We evaluate this approach in the entity tracking task and achieve further improvements over the state-of-the-art results. • We provide new evaluation metrics to better understand the challenges of the entity tracking task based on the difficulty of required reasoning steps for tracking the entities. • We are the pioneer in enforcing global consistency in the entity tracking task, which en- sures decision consistency while considering all the model decisions in parallel rather than sequential, resulting in a better performance compared to sequential inference. Our contributions toward the task of procedural abstraction are listed below: 6 • We proposed a new training objective that utilizes distant supervision signals based on constraints on generating summaries of recipes. These supervision signals are applied in a latent matching space between summary points and the instruction steps, and the model is simply enforced to match earlier summary points to earlier steps of the process. • We proposed and showcased the benefits of using a multi-modal architecture over a single textual modality model in generating process abstractions. • We analyzed and provided insights and reasons on cases where the multi-modal information is helpful or hurtful in processing the process instructions. Finally, our main contributions toward the integration of domain knowledge with DNNs are: • We develop a generic framework for integrating constraints with deep neural models (Domi- KnowS). This framework provides a declarative interface to define knowledge and computa- tional units while providing a senseless transition between multiple integration techniques. • We create a new evaluation benchmark (GLUECons), which facilitates research on integrating explicit knowledge of the task with learning from data. We further showcase the advantage of integrating domain knowledge with deep neural models in multiple tasks and under various configurations over data size and network parameters. • We propose a novel framework for mapping decisions originating from various models into comparable values in a unified framework for enforcing consistency in the heterogeneous decision-making process. 1.4 Manuscript Outline In this section, we provide a short overview of the chapters of this document and discuss the structure of the remaining chapters. Chapter 3 focuses on the entity tracking task. It begins with a formal definition of the task in Section 3.1, followed by a detailed discussion of the specific challenges associated with entity 7 tracking in Section 3.2. Section 3.3 explores existing evaluation metrics for this task, while Section 3.4 provides an overview of related research. Next, Section 3.5 will cover our approach for reformulating the tasks of entity tracking as question-answering and a novel method to encode the relative time frame of events in pretrained language models. Our contributions discussed in this section are published as part of the paper "Time-Stamped Language Model: Teaching Language Models to Understand the Flow of Events" [48] in NAACL 2021. Section 3.6 will cover our main contributions toward utilizing semantic parsers toward understanding the procedural text in the form of entity tracking. The results of this effort are part of the paper "The Role of Semantic Parsing in Understanding Procedural Text" [50] in EACL 2023. Chapter 4 delves into the task of procedural abstraction. It examines the simplified action-based abstraction task. The approach for mapping the temporal order of actions in abstract and the process is described in Section 4.1. The results of this work were published in ALVR2021 under the title "Latent alignment of procedural concepts in multimodal recipes" [131]. In Chapter 5, we explain our general approach for integrating domain knowledge with neural learning. Section 5.1 describes our declarative learning-based framework, DomiKnowS, for knowl- edge integration. Section 5.2 introduces GLUECons, a new benchmark for evaluating constraint integration methods. Our efforts in designing DomiKnowS are presented in EMNLP2021 under the title of "DomiKnowS: A Library for Integration of Symbolic Domain Knowledge in Deep Learning" [47]. GLUECons is also presented in AAAI2023 under the name of "GLUECons: A Generic Benchmark for Learning Under Constraints"[51]. In Chapter 6, we combine two key aspects of our research: procedural reasoning and the integration of domain knowledge with neural learning. To fully harness the potential of our proposed approach for integrating knowledge, we extend existing methods to ensure consistency during inference. These extensions are tailored to handle scenarios where we need to combine decisions with different characteristics, such as size and accuracy. In Section 6.3, we explain our new technical framework for these extensions. Furthermore, in Section 6.2, we propose a novel technique for procedural reasoning within a generative language model. Our approach aims at 8 enhancing the consistency of reasoning within the language model through data augmentation and knowledge integration. Finally, in Chapter 7, we summarize our contributions and propose future directions and possible extensions of our work. 9 CHAPTER 2 BACKGROUND In this chapter, we will review the basic definitions and related works which are essential to understand the contributions of this work. We describe the general research relevant to each contribution in the subsequent chapters. 2.1 Language Models Throughout this document, we refer to language models in terms of transformer-based neural architectures, which are mostly pre-trained on the masked/next token prediction task. Transformer- based language models are a type of deep learning model used for natural language processing tasks such as language generation [145], machine translation [173], and text classification [119]. They are based on the Transformer architecture introduced by [173] in 2017. 2.1.1 Transformers The Transformer architecture [173] is designed to address the limitations of traditional recurrent neural networks (RNNs) [107] and convolutional neural networks (CNNs)[72] in processing long- range dependencies in sequential data such as text. It does this by introducing self-attention mechanisms that allow the model to selectively attend to different parts of the input sequence, regardless of their position in the sequence. Masked Token Prediction Masked token prediction is a task where the model is given a sequence of tokens with some tokens randomly masked, and the model is asked to predict the masked tokens based on the context of the surrounding tokens. Formally, given a sequence of tokens X = {x1, x2, . . . , xn}, a subset of the tokens are randomly selected and replaced with a special M ASK token to obtain a masked sequence Y = {y1, y2, . . . , yn}, where yj is either xj or M ASK. The goal of the model is to predict the original tokens of the masked positions in Y , i.e., to find the sequence Z = {z1, z2, . . . , zn}, where zi,j = xi if yj = xi and zi,j is the predicted token if yj = M ASK. The objective function used to train the model is typically the cross-entropy loss between 10 the predicted probability distribution over the vocabulary and the true distribution of the original tokens. Let ˆyj be the predicted distribution over the vocabulary for the j-th masked position, and let pxi be the true probability distribution of the original token xi. The loss function can be defined as: Lmask = − n j=1 |V| i=1 pxi log ˆyj,i where V is the vocabulary of the language. Next Token Prediction Next token prediction is a task where the model is given a sequence of tokens and is asked to predict the next token in the sequence. Formally, given a sequence of tokens X = {x1, x2, . . . , xn}, the model is asked to predict the probability distribution over the vocabulary of the next token xn1. The objective function used to train the model is typically the cross-entropy loss between the predicted probability distribution over the vocabulary and the true distribution of the next token xn1. The model is trained in an iterative approach, where the true next token is provided as input to the model at each time step during training. During inference, the model can generate text by recursively predicting the next token given the previously generated tokens. Let ˆy be the predicted distribution over the vocabulary for the next token, and let pxn1 be the true probability distribution of the next token. The loss function can be defined as: Lnext = − |V| i=1 pxn1 log ˆyi where V is the vocabulary of the language. During inference, the model can generate text by recursively predicting the next token given the previously generated tokens. Let X t = {x1, x2, . . . , xt} be the first t tokens of the sequence, and let ˆyt1 be the predicted distribution over the vocabulary for the next token given X t. The model can then sample a token from the predicted distribution and add it to the sequence to obtain X t1. This 11 process can be repeated until a stopping criterion is met, such as reaching a maximum sequence length or generating an end-of-sentence token. Next token prediction is a fundamental task in language modeling, and it is often used as a pretraining task for more complex natural language processing tasks, such as machine translation and question answering. The success of the next token prediction relies on the ability of the model to capture the dependencies between the tokens in the sequence and to learn a representation of the language that can generalize to new tasks. 2.1.2 Encoder-only Models Encoder-only models are a type of transformer-based language model that consists of an encoder component only, without a decoder. This means that the model is trained to generate representations of input sequences that capture the contextual relationships between the tokens in the sequence, but it does not generate output sequences. Two popular examples of encoder-only models are BERT (Bidirectional Encoder Representations from Transformers) [42] and RoBERTa (Robustly Optimized BERT Pretraining Approach) [95]. Formally, the input to the encoder-only model is a sequence of tokens X = x1, x2, . . . , xn, which is first tokenized and then fed into the model. The model generates a sequence of hidden representations H = h1, h2, . . . , hn, where hi is the hidden representation of the i-th token in the input sequence. The architecture of the encoder-only model is based on the transformer architecture, which consists of a stack of L identical layers. Each layer has two sub-layers: a self-attention mechanism and a position-wise fully connected feed-forward network. The self-attention mechanism allows the model to attend to different positions of the input sequence to generate the hidden representation of each token. The position-wise feed-forward network allows the model to capture complex interactions between the tokens. The objective function used to train the encoder-only model is typically the masked language modeling task. 12 BERT BERT (Bidirectional Encoder Representations from Transformers) [42] is an encoder-only transformer-based language model developed by Google AI Language. It was introduced in 2018 and has achieved state-of-the-art results on various natural language processing tasks, including question answering, sentiment analysis, and named entity recognition. The architecture of BERT is based on a transformer encoder that is pre-trained on large amounts of text data using the masked language modeling task and the next sentence prediction task. During training, a subset of the tokens in the input sequence is randomly masked, and the model is trained to predict the original tokens based on the context of the surrounding tokens. The next sentence prediction task involves predicting whether two sentences are consecutive in a document or whether they are randomly sampled from the corpus. After pre-training, the weights of the BERT model can be fine-tuned on a downstream task with a smaller labeled dataset, such as sentiment analysis or named entity recognition. Fine-tuning involves adding a task-specific output layer on top of the pre-trained encoder and training the model to minimize the task-specific loss function. One of the key features of BERT is its bidirectional nature, which allows it to capture the contextual relationships between the tokens in both directions. This is achieved through the use of the masked language modeling task, which requires the model to predict the original tokens based on the context of both the preceding and following tokens. RoBERTa RoBERTa (Robustly Optimized BERT Pretraining Approach) [95] is an encoder-only transformer-based language model developed by Facebook AI Research. It was introduced in 2019 and is an extension of the BERT model, with improvements to the pre-training objectives and hyperparameters. The architecture of RoBERTa is similar to that of BERT, with a transformer encoder that is pre-trained on large amounts of text data using a variety of pre-training objectives, including a new task called dynamic masking. Dynamic masking involves randomly masking different spans of tokens in the input sequence, as opposed to contiguous blocks, as in the original masked language 13 modeling task. The masked token would change from one iteration to another. RoBERTa also makes several hyperparameter optimizations to the BERT model, such as in- creasing the batch size, training for longer, and using a larger number of training steps. These optimizations result in a more robust pre-trained model that is less sensitive to hyperparameter choices and can be fine-tuned on a wider range of downstream tasks. In addition, RoBERTa uses a technique called byte-pair encoding (BPE) to handle out-of- vocabulary (OOV) words, which involves breaking down rare or unknown words into subword units. This allows the model to handle rare or unseen words better and improves its generalization ability. 2.1.3 Encoder-Decoder Models Encoder-decoder models are a class of neural network architectures commonly used in natural language processing tasks such as machine translation and text summarization. They consist of an encoder network that processes the input sequence and generates a fixed-size representation of it and a decoder network that generates the output sequence based on the encoder representation. To adapt encoder-only models to the encoder-decoder architecture, a decoder network is added on top of the pre-trained encoder. The encoder network first processes the input sequence to generate a fixed-size representation h, which is then passed to the decoder network. The decoder generates the output sequence y = y1, . . . , ym one token at a time, where yi represents the i-th token in the output sequence and m is the length of the output sequence. During decoding, the model uses an attention mechanism to dynamically weigh the importance of each encoder representation based on the decoder’s current state and the previous output tokens. This allows the model to capture long-range dependencies between the input and output sequences and generate fluent and coherent output. Encoder-decoder models can be trained end-to-end using a variety of loss functions, such as cross-entropy loss or reinforcement learning. During training, the model learns to generate output sequences that are close to the target sequences based on the chosen loss function. Overall, encoder-decoder models have proven to be a powerful tool in natural language pro- 14 cessing and have achieved impressive results on tasks such as machine translation and text sum- marization. The transformer-based architecture, in particular, has become a popular choice for encoder-decoder models due to its strong performance and ability to capture long-range dependen- cies in the input and output sequences. T5 The T5 (Text-to-Text Transfer Transformer) [129] model is a transformer-based encoder- decoder architecture that was introduced in 2019. T5 is trained in a text-to-text format, where both the input and output are text sequences. This allows T5 to perform a wide range of tasks, including language modeling, text classification, and machine translation, among others. T5 is further fined-tuned on predicting the output sequence from the input sequence, given a task-specific prefix that is added to the input sequence. During fine-tuning, the model presents a range of input-output pairs, each with a unique prefix, and learns to generate the corresponding output sequence. Using a prefix ensures that the model knows which task it is performing, allowing it to generalize to new tasks by simply adding a new prefix. T5 uses a variant of the transformer architecture called the T5-Base, which consists of 12 encoder and decoder layers with 768 hidden units and 12 attention heads. In addition to the base model, T5 also includes larger models such as T5-Large, T5-3B, and T5-11B, with up to 11 billion parameters. BART The BART (Bidirectional and Auto-Regressive Transformer) [85] model is another transformer- based encoder-decoder architecture introduced by Facebook AI Research in 2019. Like other encoder-decoder models, BART consists of an encoder network that generates a fixed-size repre- sentation of the input sequence and a decoder network that generates the output sequence based on the encoder representation. However, BART differs from other models in its pre-training and fine-tuning procedures. BART is pre-trained using a denoising autoencoder objective, where the model is trained to reconstruct a noisy version of a text sequence. This involves randomly masking some tokens in the input sequence and training the model to predict the original sequence from the masked 15 version. Additionally, BART is trained using both bidirectional and left-to-right language modeling objectives, allowing it to capture both global and local context information. BART also uses a different fine-tuning procedure than other models. Instead of fine-tuning the entire model on a specific task, BART fine-tunes only the decoder network while keeping the encoder fixed. This is done by adding a task-specific output layer on top of the decoder network and fine-tuning it on the task-specific data. This approach is known as "discriminative fine-tuning" and has improved performance on downstream tasks while reducing the risk of overfitting. BART uses a transformer architecture similar to that of the T5 model, with both base and large versions available. The base version consists of 12 encoder and decoder layers with 768 hidden units and 12 attention heads, while the large version consists of 12 encoder and decoder layers with 1024 hidden units and 16 attention heads. 2.1.4 Decoder-only Models In contrast to encoder-only and encoder-decoder models, decoder-only models are designed to generate text without using an encoder network. Instead, these models use a large decoder network to generate text directly based on a given prompt or conditioning information. Decoder-only models are trained using a variant of the autoregressive language modeling objective. The model is trained to predict the next token in a text sequence given the preceding tokens. However, unlike encoder-decoder models, decoder-only models do not require an encoder to generate the conditioning information for the decoder. Instead, the conditioning information is typically provided as a fixed-size vector or a sequence of tokens. Decoder-only models have the advantage of being computationally efficient since they do not require an encoder network. However, they may struggle with tasks that require the model to generate text based on complex input sequences or long-term dependencies. GPT One popular type of decoder-only model is GPT (Generative Pre-trained Transformer) [128], introduced by OpenAI in 2018. GPT uses a transformer architecture similar to that of the encoder- decoder models but with only a decoder network. The model is pre-trained on a large corpus of 16 text using the autoregressive language modeling objective and can then be fine-tuned on various natural language processing tasks. GPT is highly effective at various natural language generation tasks, including text completion, text generation, and dialog generation. GPT-3 contains 175 billion parameters, making it the largest and most powerful language model to date. 2.2 Learning Objectives Cross-Entropy Cross-entropy is a commonly used loss function in machine learning, particularly for classification tasks. Given a set of predicted probabilities ˆyi and a set of true probabilities yi, the cross-entropy between these two probability distributions is defined as: Hy, ˆy = − n i=1 yi logˆyi (2.1) , where n is the number of classes in the classification problem. The cross-entropy measures the difference between the predicted probabilities and the true probabilities, with a lower value indicating better agreement between the two distributions. In the context of neural network training, cross-entropy is used as a loss function to optimize the model’s parameters. For example, in a classification problem with k classes, the cross-entropy loss for a single training example is defined as: Ly, ˆy = − k i=1 yi logˆyi (2.2) , where yi is a one-hot vector representing the true class label, and ˆyi is a vector of predicted class probabilities output by the model. The overall cross-entropy loss for a batch of training examples is typically calculated as the mean of the individual losses: Lθ = 1 m m i=1 Lyi, ˆyi (2.3) where θ represents the model’s parameters, and m is the batch size. 17 In practice, cross-entropy is a widely used loss function in deep learning. It is often used in combination with other techniques, such as gradient descent optimization and regularization, to train neural networks on a variety of tasks. Triplet Loss Triplet loss is a type of loss function used in metric learning, which aims to learn a mapping from input data to a metric space such that the distance between points in the metric space corresponds to their similarity in the input space. Triplet loss is commonly used in tasks such as face recognition, image retrieval, and person re-identification. In triplet loss, the model is trained using triplets of samples: an anchor sample, a positive sample, and a negative sample. The goal of the model is to learn embeddings such that the distance between the anchor and the positive sample is minimized while the distance between the anchor and the negative sample is maximized. The triplet loss is defined as follows: Ltriplet = max0, da, p − da, n α (2.4) , where a, p, and n are the embeddings of the anchor, positive, and negative samples, respectively, d·, · is a distance function such as Euclidean distance or cosine similarity, and α is a margin that defines the minimum distance that should be maintained between the anchor and negative samples. In practice, triplet loss can be challenging to train, as it requires a careful selection of triplets to ensure that the loss is informative and not trivially satisfied. One common approach is to use hard negative mining, where the negative sample is selected to be the most difficult sample that violates the triplet constraint. Another approach is to use semi-hard negative mining, where the negative sample is selected to be close to the anchor but still violates the triplet constraint. 2.3 NLP Tasks Text Classification Text classification is a fundamental task in natural language processing (NLP) that involves assigning predefined categories or labels to a given input text. Text classification has many real-world applications, such as spam filtering [94], sentiment analysis [119], and topic classification [123]. 18 In text classification, the input text is usually represented as a sequence of tokens, such as words or subwords, and the task is to predict the most appropriate label or category for the input text from a predefined set of labels. The labels can be binary, such as positive or negative sentiment, or multiclass, such as different types of news articles. To train a text classification model, we need a labeled dataset that includes a set of input texts and their corresponding labels. The dataset is usually divided into three sets: a training set, a validation set, and a test set. The training set is used to optimize the model parameters, the validation set is used to tune the hyperparameters of the model and prevent overfitting, and the test set is used to evaluate the final performance of the model on unseen data. Question-Answering Question answering (QA) is a task in natural language processing (NLP) that involves answering a given question based on a given context or document. In QA, the input consists of a question and a context, which is usually a paragraph or a document that contains the information necessary to answer the question. The task is to extract the answer from the context. QA can be categorized into two types: extractive and abstractive. In extractive QA, the answer is a span of text from the context, while in abstractive QA, the answer is free-form and is generated based on the context and the question. QA can be formulated as a machine learning task, where the model is trained on a dataset of question-answer pairs. Semantic Role Labeling Semantic role labeling (SRL) [105] is a task in natural language pro- cessing (NLP) that involves identifying the semantic roles played by different entities and events in a sentence. The task of SRL is to assign a semantic label to each constituent of a sentence, such as the subject, object, and predicate. These labels indicate the role of each constituent in the underlying event or action described by the sentence. SRL is typically formulated as a supervised machine learning task, where the model is trained on a dataset of annotated sentences. The input to an SRL model is a sentence, and the output is a set of semantic labels that correspond to the roles of the constituents in the sentence. For example, in the sentence "John ate the pizza with a fork", the subject "John" is assigned the semantic role of 19 agent, the object "the pizza" is assigned the semantic role of patient, and the prepositional phrase "with a fork" is assigned the semantic role of the instrument. 2.4 Benchmarks 2.4.1 Question-Answering There are several benchmarks for QA in NLP, which are used to evaluate the performance of different QA models. Stanford Question Answering Dataset (SQuAD) [132]: In our research, we use SQuAD as part of our neural learning process. The Stanford Question Answering Dataset (SQuAD) is a benchmark for QA in NLP that consists of a set of questions and their corresponding context passages. The dataset contains over 100,000 question-answer pairs, and the questions are designed to test the model’s ability to understand the context and extract the relevant information to answer the question. 2.4.2 Entity Tracking Propara [32] This dataset was created as a benchmark for procedural text understanding to track entities at each step of a process. Propara contains 488 paragraphs and 3,300 sentences with annotations that are provided by crowd-workers. The annotations ( 81,000) are the location of entities at each step of the process. The location can be either the name of the location, an unknown location, or specified as non-existence. NPN-Cooking [16] This is a benchmark containing textual cooking instructions. Annotators have specified the ingredients of each recipe alongside the changes happening on each ingredient during the steps of the instructions. These changes are reported in categories such as location, temperature, cleanliness, and shape. We evaluate our model on the location prediction task of this benchmark, which is the hardest task due to having more than 260 candidate answers. We do not use the candidates to find the locations in our setting; Instead, we find a span of the text as the final location answer. This is a relatively harder setting but more flexible and generalizable than the classification setting. 20 2.4.3 Procedural Summarization RecipeQA RecipeQA [184] is a dataset designed for multimodal comprehension of cooking recipes. It contains more than 36,000 question-answer pairs that are automatically generated from around 20,000 unique recipes with step-by-step instructions and images. The questions in RecipeQA require a joint understanding of multiple modalities, such as titles, descriptions, or images, and solving them involves capturing the temporal flow of events and making sense of procedural knowledge. The dataset provides a unique opportunity to study the ability of models to reason about procedural knowledge and to perform complex reasoning tasks that require multimodal comprehension. 21 CHAPTER 3 ENTITY TRACKING In this chapter, we delve deeper into the task of entity tracking. Entity tracking aims to understand the evolution of entities inside a process. This can be studied by following the important properties of each entity after each action in the process. We provide a formal definition of the task in Section 3.1. This task poses unique challenges compared to the conventional reading comprehension task due to the dynamic nature of the world described in the text. We will discuss some of the main challenges of this task in Section 3.2. Our study on the entity tracking task relies on the two benchmarks of Propara [32] and NPN- Cooking [16], both of which provide fine-grained annotation for the properties of entities after each step of the process. The evaluation of this task goes beyond the simple F1 accuracy measures for directly comparing the model decisions and ground truth and proposes the evaluation of high-level abstraction of the process, such as its conversions, inputs, and outputs. We discuss this in more detail in Section 3.3. As the main goal of this thesis is to investigate approaches to benefit from the available semantic structures for the task, we propose two approaches to enhance the pre-trained language models with knowledge about the relative time frame of events and semantic relations between entities and actions. We investigate the former in Section 3.5 and the latter in Section 3.6. 3.1 Task Definition An example of a procedural text is shown in Figure 3.1. The example is taken from the Propara [32] dataset and shows the process of oil creation. At each row, the first column is the list of the sentences, each of which forms one step of the procedure. The second column contains the number of steps in the process, and the rest are the entities interacting in the process and their location at each step. The location of entities at step 0 is their initial location, which is not affected by this process. If an entity has a known or unknown location (specified by “?”) at step 0, we call it an input. 22 Figure 3.1 An example of procedural text and its annotation (location of objects). ‘-’ means the entity does not exist; ‘?’ means the entity’s location is unknown. The procedural reasoning task can be formally defined by a procedural text including m steps, S = {s1, s2, ..., sm}, a set of entities E = {e1, e2, ..., en}, where n is the number of entities and a set of properties. Specifically, in the Propara dataset, the property of interest is only the location of the entities PL = {pL 0 1, pL 1 1, ..., pL n }, where pL m t i denotes the jth entity at step t. In Propara, the location prediction starts at step 0, which indicates the entity’s location before the process begins. The location of an entity can either be known (represented by a string) or unknown (represented by "?"). Similar to prior research [170], the location property is used to infer a set of actions A = {a1 1, a1 2, .., am n }, where aj t denotes the action type applied to entity j at step t. 3.2 Challenges Inferring actions and their impact on entities involved in a procedural text can be challenging in various aspects. First, there are dependencies between steps to be considered in predicting a plausible action set. For instance, an entity destroyed at step t of the process cannot be moved again at step t 1. Second, some sentences contain ambiguous local signals by including multiple action verbs. For example, "The oxygen is consumed in the process of forming carbon dioxide.", 23 ProcessParticipantsSentencesplantanimalboneoilBefore the process begins??--1.Plants and animals die in a watery environmentwatery environmentwatery environment--2.Over time, sediments build oversedimentsediment--3.the body decomposessediment-sediment-4.Gradually buried material becomes oil---sediment Step Entity Action Before After Leaf - Leaf Leaf 1 Water Move 2 Water Destroy Create 1 None 2 Root Leaf - Leaf Sugar Sugar Table 3.1 A sample annotation available in the Propara dataset, which is directly used for the document-level evaluation. where the oxygen is being destroyed, and the carbon dioxide is being created. Third, the sentences are incomplete in some steps. For instance, a step of the process might only indicate "is buried in mud", which cannot be understood without context. Fourth, finding the properties of some entities may require reasoning over both the global context and local relations. For instance, in the sentences “1. Magma rises to the surface. 2. Magma cools to form lava”, the location of ‘Lava’ after step 2 should be inferred from the prior location of Magma, which is indicated in its previous step. Fifth, common sense is required to understand some consequences. For example, in Figure 3.1, step 3, one should use common sense to realize that ‘decomposing body’ would expose the ‘bones’, which will be left behind in the ‘sediment’. Sixth, understanding some relations requires an advanced co-reference resolution. In Figure 3.1, step 4, a complex co-reference resolution is required to understand that the ‘buried material’ refers to both ‘plants and animals bones’ and that they are transforming into the ‘oil’. 3.3 Task Evaluation Sentence-level evaluation is introduced in [32] for Propara dataset. This evaluation focuses on the following three categories. • Cat1 Is e created (destroyed/moved) during the process? • Cat2 When is e created (destroyed/moved) during the process? • Cat3 Where is e created (destroyed/moved from or to) during the process? Document-level evaluation is a more comprehensive evaluation process and introduced later in [167] for the Propara benchmark. Currently, this is the default evaluation in the Propara leaderboard containing four criteria: 24 • What are the Inputs? Which entities existed before the process began and do not exist after the process ends. • What are the Outputs? Which entities got created during the process? • What are the Conversions? Which entities got converted to other entities? • What are the Moves? Which entities moved from one location to another? The document-level evaluation requires models to reformat their predictions in a tabular format as shown in Table 3.1. At each row of this table, for each entity at a specific step, we can see the action applied to that entity, the location of that entity before that step, and the location of the entity after that step. The action takes values from a predefined set including, “None”, “Create”, “Move”, and “Destroy”. The exact action can be specified based on the before and after locations. 3.4 Related Research ScoNe [97], NPN-Cooking [16], bAbI [175], ProcessBank [13], and Propara [32] are bench- marks proposed to evaluate models on procedural text understanding. Processbank [13] contains procedural paragraphs mainly concentrated on extracting arguments and relations for the events rather than tracking the states of entities. ScoNe [97] aims to handle co-reference in a procedural text expressed about a simulated environment. bAbI [175] is a simpler machine-generated textual dataset containing multiple procedural tasks such as motion tracking, which has encouraged the community to develop neural network models supporting explicit modeling of memories [162, 146] and gated recurrent models [25, 64]. NPN-Cooking [16] contains recipes annotated with the state changes of ingredients on criteria such as location, temperature, and composition. Propara [32] provides procedural paragraphs and detailed annotations of entity locations and the status of their existence at each step of a process. Some earlier models on procedural text understanding, inspired by bAbI dataset, are Memory Network architecture [162], Recurrent Relational network (RRN) [146], and gated recurrent models such as GRU [25] and Recurrent Entity Networks(EntNet) [64]. EntNet uses dynamic memory for the world’s hidden state, which will be updated based on a gated mechanism at each step. RRN 25 augments neural networks with the capacity to do multi-step relational reasoning while keeping a memory of hidden states (separate memory per entity) at each step and updating that based on the memory representations and newly retrieved representations at each step. Inspired by Propara and NPN-Cooking benchmarks, recent research has focused on tracking entities in a procedural text. Datasets such as Procedural Cyber-Security text [122] and OpenPI [170] are further designed with a similar task in mind. To address the challenges in this type of reasoning over procedural text, various models are proposed. Query Reduction Networks (QRN) [151] performs gated propagation of a hidden state vector at each step. Neural Process Network (NPN) [16] computes the state changes at each step by looking at the predicted actions and involved entities. Prolocal [32] predicts locations and status changes locally based on each sentence and then globally propagates the predictions using a persistence rule. Proglobal [32] predicts the status changes and locations over the whole paragraph using distance values at each step and predicts current status based on current representation and the predictions of the previous step. ProStruct [167] aims to integrate manually extracted rules or knowledge-base information on VerbNet [150] as constraints to inject common-sense into the model. KG-MRC [35] uses a dynamic knowledge graph of entities over time and predicts locations with spans of the text by utilizing reading comprehension models. Ncet [61] updates entities representation based on each sentence and connects sentences together with an LSTM. To ensure the consistency of predictions, Ncet uses a neural CRF over the changing entity representations. XPAD [34] is also proposed to make dependency graphs on the Propara dataset to explain the dependencies of events over time. Most recently, DynaPro [6] feeds an incremental input to pre- trained LMs’ question answering architecture to predict entity status and transitions jointly. Recent models have further investigated the integrating of common-sense (KOALA) [192], utilizing large generative language models (LEMON) [153], or using both the question answering setting and sequential structural constraints at the same time (CGLI) [99] to address this task. Procedural reasoning has also been pursued within the multi-modality domain [184, 131, 5] which has additional challenges of aligning the representation spaces of different modalities. 26 Additionally, recent research has alsop investigated the causality of past events in enabling future events, thorough different datasets such as WIQA [169] and Trip [160] for story understanding. 3.5 Encoding Temporal information The current language models convey rich linguistic knowledge and can serve as a strong basis for solving various NLP tasks [95, 43, 187]. That is why most of the state-of-the-art models on procedural reasoning are also built based on current language models [6, 61]. However, they do not contain a dedicated representation to understand the relative time of actions when encoding a procedural text. Here, we propose a new approach for feeding the procedural information into LMs in a way that the LM-based QA models are aware of the taken steps and can answer the questions related to each specific time-frame in the procedure. We propose the Time-Stamped Language model (TSLM model), which uses timestamp embed- ding to encode past, current, and future time of events as a part of the input to the model. TSLM utilizes timestamp embedding to answer differently to the same question and context based on different steps of the process. As we do not change the portion of the input manually, our approach enables us to benefit from the pre-trained LMs on other QA benchmarks by using their parameters to initialize our model and adapt their architecture by introducing a new embedding type. Here, we use RoBERTa [95] as our baseline language model. We evaluate our model on two benchmarks, Propara [32] and NPN-Cooking [16]. Propara contains procedural paragraphs describing a series of events with detailed annotations of the entities along with their status and location. NPN-Cooking contains cooking recipes annotated with their ingredients and their changes after each step in criteria such as location, cleanliness, and temperature. TSLM differs from previous research as its primary focus is on using pre-trained QA models and integrating the flow of events in the global representation of the text rather than manually changing the part of the input fed to the model at each step. In contrast to DynaPro [6], we explicitly inject past, current, and future timestamps into the language models input and implicitly train the model to understand the events’ flow rather than manually feeding different portions of the context at each 27 step. TSLM outperforms the state-of-the-art models in nearly all metrics of two different evalua- tions defined on the Propara dataset. Results show a 3.1% F1 score improvement and a 10.4% improvement in recall. TSLM also achieves the state-of-the-art result on the location accuracy on the NPN-Cooking location change prediction task by a margin of 1.55%. In summary, our contribution is as follows: • We propose Time-Stamped Language Model (TSLM model) to encode the meaning of past, present, and future steps in processing a procedural text in language models. • Our proposal enables procedural text understanding models to benefit from pre-trained LM- based QA models on general-domain QA benchmarks. • TSLM outperforms the state-of-the-art models on the Propara benchmark on both document- level and sentence-level evaluations. TSLM improves the performance state-of-the-art mod- els on the location prediction task of the NPN-Cooking [16] benchmark. • Improving over two different procedural text understanding benchmarks suggests that our approach is effective, in general, for solving the problems that require the integration of the flow of events in a process. 3.5.1 Model Components QA Adaptation To predict the status and the location of entities at each step, we model F with a question-answering setting. For each entity e, we form the input Qe as follows: Qe =[CLS] Where is e? [SEP] s1 [SEP] s2 [SEP] ..., sn [SEP] (3.1) Although Qe is not a step-dependent representation and does not incorporate any different information for each step, our mapping function needs to generate different answers for the question "Where is entity e?" based on each step of the procedure. 28 Paragraph (Before the process starts) Roots absorb water from soil The water flows to the leaf Light from the sun and CO2 enter the leaf Water, light, and CO2 combine into mixture Mixture forms sugar State number Water Light CO2 Mixture State 0 State 1 State 2 State 3 State 4 State 5 Soil Root Leaf Leaf - - Sun Sun Sun Leaf - - ? ? ? Leaf - - - - - - Leaf - Sugar - - - - - Leaf Participants Table 3.2 An example of procedural text and its annotations from the Propara dataset [32]. "-" means the entity does not exist. "?" means the location of the entity is unknown. For instance, consider the example in Table 3.2 and the question "Where is water?"; Our model should generate different answers at four different steps. The answer will be “root”, “leaf”, “leaf”, “non-existence” for steps 1 to 4, respectively. To model this, we create pairs of Qe, ti for each i ∈ {0, 1, ..., n}. For each pair, Qe is timestamped according to ti using T imestamp. function described in Sec. 3.5.1 and mapped to an updated step-dependent representation, Qti e = T imestampQe, ti. The updated input representation is fed to a language model (here ROBERTA) to obtain the step-dependent entity representation, Rti e , as shown in Equation 3.2. We discuss the special case of i = 0 in more details in Sec. 3.5.1. Rti e = RoBERT aQti e (3.2) We use the step-dependent entity representation, Rti e , and forward it to another mapping function g. to obtain the location and status of the entity e in the output. In particular, the output includes the following three vectors, a vector representing the predictions of entity status S, another vector for each token’s probability of being the start of the location span L, and a third vector carrying the probability of each word being the last token of the location span. The outputs of the model are computed according to Equation 3.3. status, Start_prob, End_prob = gRti e (3.3) where Re is the tokens’ representations output of RoBERTa [95], and g. is a function we apply to the token representations to get the final predictions. We will discuss each part of the model 29 Figure 3.2 An example of timestamp embedding in a procedural text. The question is always ignored with the value "0". At each step i, the tokens from that step are paired with the “current” value, tokens from steps 0 to i are paired with the “past” value, and the tokens from step i to the last step are paired with the “future” value. separately in the following sections. Timestamp Embedding The timestamp embedding adds the step information to the input Qe to be considered in the attention mechanism. The step attention is designed to distinguish between current (what is happening now), past (what has happened before), and future (what has not yet happened) information. We use the mapping function T imestamp. from the pair Qe, ti to add a number along with each token in Qe and retrieve the step-dependent input Qti e as shown in Figure 3.2. The Mapping function T imestamp. integrates past, current, and future representations to all of the tokens related to each part. T imestamp. function assigns the number 1 for the past, 2 for the current, and 3 for future tokens in the paragraph by considering one step of the process as the current event. These values are used to compute an embedding vector for each token, which will be added to its initial representation as shown in Figure 3.3. The special number 0 is assigned to the question tokens, which are not part of the process timeline. For predicting State 0 (The process inputs), we set all the paragraph information as the current step. 30 Where is Water? Roots absorb water from soil. The water flows to the leaf. The water, light and CO2 combine into a mixture. Step 0IgnoreCurrentStep 1IgnoreCurrentFutureStep 2IgnoreCurrentFuturePastStep 3IgnoreCurrentPastQuestionParagraph2222...222222233...331111122...3...31111...222...000000000000 Status classification To predict the entities’ status, we apply a linear classification module on top of the CLS token representation in Re as shown in Equation 3.4. Attribute = Sof tmaxW T ReC (3.4) where ReC is the representation of the CLS token which is the first token in Re. Span prediction We predict a location span for each entity for each step of the process as shown in Equation 3.5, we follow the popular approach of selecting start/end tokens to detect a span of the text as the final answer. We compute the probability of each token being the start or the end of the answer span. If the index with the highest probability to be the start token is tokenstart and for the end token is tokenend, the answer location will be Location = P tokenstart : tokenend. Start_prob = SoftmaxW T startRti e End_prob = SoftmaxW T endRti e tokenstart = arg max Start_prob i tokenend = arg max End_prob i (3.5) 3.5.2 Training & Inference Training We use the cross-entropy loss function to train the model. At each prediction for entity e at timestamp ti, we compute one loss value lossattribute regarding the status prediction and one loss value losslocation for the span selection. The variable losslocation is the summation of the losses of the start token and the end token prediction, losslocation = losslocationstart losslocationend final loss of entity e at time ti is computed as in Equation 3.6. . The Losse i = losse i,attribute losse i,location (3.6) Inference At inference time, we apply two different post-processing rules on the outputs of the model. First, we impose that the final selected location answer should be a noun phrase in the 31 Figure 3.3 An overview of the proposed model. The “Timestamp Embedding” module is introduced in this work and the rest are taken from basic language model architecture. original procedure. Considering that a location span is a noun phrase, we limit the model to do a sof tmax over tokens of noun phrases in the paragraph to select the start and end tokens. Second, we apply consistency rules to make sure that our predicted status of entities is consistent. We define the two following rules: • An entity can not be created if it has been already destroyed: if Sti is unknown or known location, then for every step j, if Stj e is "non-existence" and e is unknown or known location Sti1 e and Stj1 e is "non-existence", then i < j. • An entity cannot be created/destroyed twice in a process: if Stj e and Sti e are both "-", Stj1 e and Sti1 are both either known or unknown location, then i = j. Sti e is the status of entity e at step ti of the process. 32 Transformer ModelQuestionParagraphTimeStampEmbeddingWord EmbeddingPositionEmbeddingType EmbeddingStepNumber+CLST1T2T3T4...TnStart/End Span predictionAttribute PredictionCurrent step of the question-/?/LocationCLSWhereISWaterSEPRootAbsorbsWater...SEP Model ProLocal [32] ProGlobal [32] EntNet [64] QRN [151] KG-MRC [35] NCET [61] XPAD [34] ProStruct [167] DYNAPRO [6] TSLM (Our Model) Sentence-level Cat1 Cat2 Cat3 MacroAvg MicroAvg 62.7 63.0 51.6 52.4 62.9 73.7 - - 72.4 78.81 34.5 45.1 26.1 26.3 47.0 53.9 - - 55.4 58.83 34.0 45.4 26.0 26.5 46.6 54.0 - - 55.5 58.37 30.5 36.4 18.8 15.5 40.0 47.1 - - 49.3 56.8 10.4 35.9 7.8 10.9 38.2 41.0 - - 44.5 40.9 Document-level F1 R P 35.3 49.4 40.2 40.0 56.8 62.5 55.2 54.5 65.5 68.6 77.4 22.9 52.4 46.7 33.5 50.2 31.3 55.5 50.7 64.5 58.5 67.1 45.3 70.5 43.0 74.3 58.0 75.2 68.4 68.9 Table 3.3 Results from the sentence-level and document-level evaluation on Propara. Cati evalua- tions are defined in Section 3.5.3.1. We do not apply an optimization/search algorithm to find the best assignment over the predictions according to the defined constraints. The constraints are only applied based on the order of the steps to ensure that the later predictions are consistent with the ones made before. 3.5.3 Experiments Implementation Details We use the SGD optimizer implemented by Pytorch [124] to update the model parameters. The learning rate for the Propara implementation is set to 3 − e4 and is updated by a scheduler with a 0.5 coefficient every 50 steps. We use 1 − e6 as the learning rate and a scheduler with 0.5 coefficient to update the parameters every ten steps on the NPN-Cooking implementation. The implementation code is publicly available at GitHub1. We use RoBERTa [95] question-answering architecture provided by HuggingFace [178]. RoBERTa is pretrained with SQuAD [132] and used as our base language model to compute the token rep- resentations. Our model executes batches containing an entity at every step and makes updates based on the average loss of entities per procedure. The network parameters are updated after executing one whole example. The implementation code will be publicly available on GitHub after acceptance. 33 3.5.3.1 Evaluation Propara: We have to process our (Status S, Location L) predictions at each step to generate a similar tabular format as in Table 3.1. We define ri e related to entity e at step ti. To fill this row, we first process the status predictions. If the status as a row in this table which stores the predictions prediction S is either “-” or “?”, we fill those values directly in the after location column. The before location column value of ri e is always equal to the after location column value of ri−1 e . If the status is predicted to be a “Known Location”, we fill the predicted location span L into the after location column of ri e . The action column is filled based on the data provided in the before and after locations columns. If the before location is/isn’t "-" and the after location is not/is "-", then the action is "Cre- ate"/"Destroy". If the before and after locations are equal, then the action is "None" and if the before and after locations are both spans and are different from each other, the action is "Move". NPN-Cooking location changes: We evaluate our model on the NPN-Cooking benchmark by computing the accuracy of the predicted locations at steps where the locations of ingredients change. We use the portion of the data that has been annotated by the location changes to train and evaluate our model. In this evaluation, we do not use the status prediction part of our proposed TSLM model. Since training our model on the whole training set takes a very long time (around 20 hours per iteration), we use fewer samples for training. This practice is also used in other prior work [35]. 3.5.3.2 Results The performance of our model on Propara dataset [32] is quantified in Table 3.3. Results show that our model improves the SOTA by a 3.1% margin in the F1 score and improves the Recall metric with 10.4% on the document-level evaluation. On the sentence-level evaluation, we outperform SOTA models with a 5.11% in Cat1, 7.49% in Cat2, and by a 3.4% margin in the macro-average. We report Table 3.3 without considering the consistency rules and evaluate the effect of those in the ablation study in Sec. 3.5.3.3. In Table 3.5, we report a more detailed quantified analysis of the TSLM model’s performance 1https://github.com/HLR/TSLM 34 Model NPN-cooking [16] KG-MRC [35] DynaPro [6] TSLM (Our Model) 51.3 51.6 62.9 63.73 64.45 Accuracy Training Samples Prediction task ∼ 83, 000 (all data) Classification ∼ 10, 000 Span Prediction ∼ 83, 000 (all data) Classification ∼ 10, 000 ∼ 15, 000 Span Prediction Span Prediction Table 3.4 Results on the NPN-Cooking benchmark. Both class prediction and span prediction tasks are the same but use two different settings, one selects among candidates, and the other chooses a span from the recipe. However, each model has used a different setting and a different portion of the training data. The information on the data splits was not available which makes a fair comparison hard. Criteria Inputs Outputs Conversions Moves Precision Recall 71.3 91.4 56.7 56 89.8 85.6 57.7 40.5 F1 79.5 88.4 57.2 47 Table 3.5 Detailed analysis of TSLM performance on the Propara test set on four criteria defined in the document-level evaluation. based on each different criterion defined in the document-level evaluation. Table 3.5 shows that our model performs best on detecting the procedure’s outputs and performs worst on detecting the moves. Detecting moves is essentially hard for TSLM as it predicts outputs based on the whole paragraph at once. Outperforming SOTA results on the input and output detection suggest that the TSLM model can understand the interactions between entities and detect the entities which exist before the process begins. The detection of input entities is one of the weak aspects of the previous research that we improve here. A recent unpublished research [192] reports better results than our model. However, their primary focus is on common-sense reasoning, and their goal is orthogonal to our main focus in proposing the TSLM model. Such approaches can be later integrated with TSLM to benefit from common-sense knowledge on solving the Propara dataset. The reason that TSLM performs better at recall and worse at precision is that our model looks at the global context, which increases the recall and lowers the precision when local information is strongly important. The same phenomenon (better recall) is observed in ProGlobal, which also considers global information as we do, compared to ProLocal. 35 Table 3.4 shows our results on the NPN-Cooking benchmark for the location prediction task. Results are computed by only considering the steps that contain a location change and are reported by computing the accuracy of predicting those changes. Our results show that TSLM outperforms the SOTA models with a 1.55% margin on accuracy even after training on 15,000 training samples. To be comparable with the KG-MRC [35] experiment on NPN-Cooking which is only trained on 10k samples, we report the performance of our model trained on the same number of samples, where TSLM gets a 12.1% improvement over the performance of KG-MRC [35]. 3.5.3.3 Ablation Study To evaluate the importance of each module one at a time, we report the performance of the TSLM by removing the noun-phrase filtering at inference, the consistency rules, timestamp embedding, SQuAD [132] pre-training, and by replacing RoBERTa [95] with BERT [43]. These variations are evaluated on the development set of the Propara dataset and reported in Table 3.6. As stated before and shown in Table 3.6, it is impossible to remove the timestamp embedding as that is the only part of the model enabling changes in the answer at each step. Hence, by removing that, the model cannot converge and yields a 25% decrease in the F1 score. The simple consistency and span filtering rules are relatively easy to be learned by the model based on the available data, therefore adding those does not affect the final performance of the model. TSLMBERT experiment is designed to ensure a fair comparison with previous research [6] which has used BERT as their base language model. The comparison of TSLMBERT to -SQuAD Pre- training and - Timestamp Embedding in Table 3.6 indicates that using RoBERTa instead of BERT is not as much important as our main proposal (using Time-stamp encoding) in TSLM model. Also, TSLMBERT achieves 66.7% F1 score on the Propara test set, which is 1.2% better than the current SOTA performance. By removing the SQuAD pre-training phase, the model performance drops with a 10.6% in the F1 score. This indicates that despite the difference between procedural text understanding and the general MRC tasks, it is quite beneficial to design methods that can transfer knowledge from other QA data sources to help with procedural reasoning. This is crucial as annotating procedural texts 36 Model TSLMRoBERTa - constraints - noun-phrase filtering - SQuAD Pre-training - Timestamp Embedding TSLMBERT P 72.9 73.8 73.5 78.8 94.6 69.2 R 74.1 73.3 73.3 52.2 32.6 73.5 F1 73.5 73.5 73.4 62.8 48.5 71.3 Table 3.6 Ablation study results on the Propara document-level task development set. “- constraints”, “- Span filtering”, and “- Timestamp Encoding” show our model performance while removing those modules. -SQuAD Pre-training is when we do not pre-train our base language model on SQuAD. TSLMBERT is when we use BERT as the base language model. is relatively more expensive and time-consuming. 3.5.4 Discussion We provide more samples to support our hypothesis in solving the procedural reasoning task and answer some of the main questions about the ideas presented in the TSLM model. Why is the whole context important? The main intuition behind TSLM is that the whole context, not just previous information, matters in reasoning over a process. Here, we provide some samples from Propara to show why this intuition is correct. Consider this partial paragraph, "Step i: With enough time the pressure builds up greatly. Step i 1: The resulting volcano may explode.". Looking at the annotated status and location, the "volcano" is being created at Step i without even being mentioned in that step. This is only detectable if we look at the next step saying "The resulting Volcano...". As another example, consider this partial paragraph: "Step i: Dead plants form layers called peat. ... Step i 3: Pressure squeezes water out of the peat.". The annotation indicates that the location of "water" is being changed to "peat" at step i, which is only possible to detect if the model is aware of the following steps indicating that the water comes out of the peat. Positional Embedding VS Time-stamp encoding: As mentioned before, the whole context (future and past events) is essential for procedural reasoning at a specific step. However, the reasoning should focus on one step at a time, given the whole context. While positional encoding encodes the order of information at the token level for reasoning over the entire text, we need another level 37 of encoding to specify the steps’ positions (boundaries) and, more importantly, to indicate the step that the model should focus on when answering a question. Advantages/Disadvantages of TSLM model: TSLM integrates higher-level information into the token representations. This higher-level information can come from event-sequence (time of events), sentence-level, or any other higher source than the token-level information. The first advantage of TSLM is that it enables designing a model which is aware of the whole context, while previous methods had to customize the input at each step to only contain the information of earlier steps. Furthermore, using TSLM enables us to use pretrained QA models on other datasets without requiring us to retrain them with the added time-stamped encoding. One main disadvantage of the TSLM model, which is natural due to the larger context setting in this model, is not being sensitive to local changes, which is consistent with the observation in the comparison between ProGlobal and ProLocal models. 3.6 Exploiting Semantic Parsers to Understand the Process We discussed a series of challenges for the procedural reasoning task in Section 3.2. Except for the common-sense [192] and the ability to make consistent global decisions actions [61], the other challenges might have only been indirectly tackled in the recent research [68, 49], but have neither been addressed explicitly nor properly evaluated to measure their success on resolving these challenges. Here, we evaluate whether semantic parsers can alleviate some of these challenges. Semantic parsers provide semantic frames identifying predicates and their arguments in a sentence. For instance, in the sentence ‘Move bag to the yard’, “Move” is the predicate, “bag” and “the yard” are the arguments with types "affected"2 and “location” respectively. Such semantic information can help disambiguate multi-verb local connections between predicates and arguments [68]. They can also provide meaningful local relations, making it easier to connect global information to infer entities’ states. For instance, in the same sentence, "Magma cools to form lava", "Magma" is noted as the ‘affected’ and ‘lava’ is the result of the predicate ‘form’. This makes it easier to infer that the location of ‘lava’ should match the last location of ‘magma’. 2referred to as ‘Patient’ in some other parsing formalisms. 38 For our study, we consider both the classic semantic role labeling (SRL) 3, based on [152], which is a relatively shallow semantic parsing model, as well as the deep semantic parser TRIPS4 [53, 4]. To investigate the effect of semantic parsing on procedural reasoning, we analyze its effect as a standalone symbolic model as well as its integration in a neuro-symbolic model that combines semantic parsing with state-of-the-art neural models to solve the procedural reasoning task. First, we design a set of heuristics to extract a symbolic abstraction from the TRIPS parser, called PROPOLIS. We use this baseline to further showcase the effectiveness of semantic parsing information in solving the procedural task. Next, we integrate the semantic parsers with two well-established procedural reasoning neural backbones, namely NCET [61] and TSLM [49] (and its extension CGLI [99]), through encoding the semantic relations as a graph attention neural network (GAT) [154]. For our experiments, we use Propara dataset [170] that introduces the procedural reasoning task over natural events that are described in English. We realized the existing evaluation metrics of this dataset do not reflect the actual performance of the models and fail to identify the challenges and shortcomings of the models. Consequently, we propose new evaluation criteria to shed light on the differences between the models, even when they perform similarly based on the prior metrics. In summary, our contributions are (1) Proposing a symbolic model (Propolis) to solve the procedural reasoning task based on semantic parsing, (2) Proposing a set of new evaluation metrics that can identify the strengths and weaknesses of the models, and (3) Showcase the benefits of integrating semantic parsing into the neural models. The code and models proposed in this work are all available in GitHub 5. 3.6.1 Semantic Parsing We investigate two different modeling approaches to solve this problem. First, we use a symbolic and parsing-based model, and second, we integrate semantic parsing with neural models. We use two different sources for semantic extraction: SRL and TRIPS. 3https://demo.allennlp.org/semantic-role-labeling 4http://trips.ihmc.us/parser/cgi/parse 5https://github.com/HLR/ProceduralSemanticParsing 39 Figure 3.4 The Semantic Role Labeling annotation for the sentence “Move the book in the shelf to the library” represented as a graph. In general, SRL is coarse-grained and shallow compared to TRIPS. The connections in TRIPS are not limited to the pairwise connections between predicates and arguments but are extended to the semantic connections between any two words. Since TRIPS relies on a general purpose ontology, it also augments the arguments and predicates with additional information about a set of possible features (mobility, container, negation) and mapping of the words to hierarchical ontology classes (i.e., mapping “water” to “beverage”). SRL is centered around the semantic frames of the verbs ( predicates) and identifies each predicate’s main and adjunct (mainly time and location) arguments in the sentence. Figure 3.4 and 3.5 show examples of the SRL and TRIPS parses, respectively. The symbolic model only uses the TRIPS parser as it provides more extended extractions and meaningful relations, while both SRL and TRIPS are used for integration with the neural baselines. 3.6.2 PROPOLIS: Symbolic Procedural Reasoning We propose the PROPOLIS model, which solves the procedural reasoning task merely by symbolic semantic parsing. PROPOLIS operates on the TRIPS parser in three steps. First, it makes an abstraction over the original parse to summarize the information in the graph and include a smaller set of actions and changes in objects and their locations. Second, it uses a set of rules to transform the abstracted parses into clear actions and identifies the affected objects by the actions, using the semantic roles, while extracting an ending location or starting location. Lastly, 40 Figure 3.5 The TRIPS parse for the sentence “Move the book in the shelf to the library” represented as a graph with named edges. it performs global reasoning to connect the local decisions and produce a consistent sequential set of actions/locations for each entity of interest. 3.6.2.1 Graph Abstraction The original TRIPS parse includes many concepts and edges that do not directly affect entities’ location or existence. Therefore, we make a more concise graph abstraction to facilitate processing the entities, actions, and locations. To obtain a more informative abstraction, firstly, the relevant classes of the TRIPS ontology are mapped to action classes defined in the Propara dataset (Create, move, or destroy). For instance, the verb ‘flow’ is first mapped to the ‘fluidic motion’ class in the TRIPS ontology, which is a child of the ‘motion‘ class, and the ‘motion’ class is mapped to the ‘move’ action in the Propara dataset. This will help distinguish the predicates that signal a change in the location or existence of objects. Second, the important arguments are identified in the parse, and the locations are extracted. The graph is decomposed to include a set of events with their arguments. Each event may contain different roles such as "agent", "affected", "result", "to_location", "from_location", or other roles required by its semantic frame. 41 Main Predicate Move Move Destroy Create Create Roles Affected, Agent Agent Affected Decisions The “Affected” is being moved. The “Agent” is being moved. The “Affected” is being destroyed Affected_Result, Affected The “Affected_Result” is being created Change Affected, Res Affected The “Affected” is being created The “Affected” is being destroyed, and the “Res” is being created Table 3.7 The list of rules used to evaluate the effect of actions on various roles of the semantic frame. Because the TRIPS system handles much of the variation expected in sentence constructions, we can use a relatively compact specification for defining the events and relationships of interest while coping with fairly complex and nested formulations. We capitalized on the TRIPS ontology and parser to develop a compact and easy-to-maintain specification of event extraction rules. Instead of having to write one rule to match each keyword/phrase that could signify an event, many of these words/phrases have already been systematically mapped to a few types in the TRIPS ontology. For instance, demolish, raze, eradicate, and annihilate are all mapped to the TRIPS ontology type “ONT::DESTROY”. In addition, the semantic roles are consistent across different ontology types. The parser handles various surface structures, and the logical form contains normalized semantic roles. For example, in the following sentence: • The bulldozer demolished the building • The building was demolished • The demolition of the building • Building demolition , all the parses result in the same basic logical form with the semantic roles “AFFECTED: the building” and, where applicable, “AGENT: the bulldozer”. Thus, we needed very few extraction rule specifications for each event type, covering a wide range of words and syntactic patterns. 42 3.6.2.2 Rule-based Local Decisions We use a set of heuristic rules to map the abstracted graph onto actual actions over the entities of interest. The rules are written according to the semantic frames and the type of predicates and arguments in each parse. For instance, if a semantic frame is mapped to ‘Move’ and has both the ‘agent’ and ‘affected’ arguments, then the ‘affected’ argument specifies the object being moved. The same frame with only an ‘agent’ argument indicates a move for the object in the ‘agent’ role. Table 3.7 shows the most frequent templates we used to transform the local parses into actual decisions over the entities. To handle the location arguments from the parses, we also consider the two cases on ‘from_loc’ and ‘to_loc’. In the specific case of a “destroy” event, any location attached to the semantic frame is considered the ‘from_loc’ for the item being destroyed. 3.6.2.3 Global Reasoning The two first steps are merely based on the local sentence-level actions of each step. We need additional global reasoning over the whole procedure to predict the outputs. Global reasoning ensures that local decisions form a valid global sequence of actions for a given entity. For instance, if an entity is predicted to be destroyed at step 2 and moved at step 3, we consider the ‘destroyed’ action a wrong local decision since a destroyed object cannot move later in the process. The graph also contains passive indications of object location in phrases such as "the book on the shelf" or even indications of prior locations in terms of a ‘from_location’ argument. These phrases do not generate actions but provide information that should be used in previous steps. For example, if step t has a local prediction ‘Move’ for entity e with no target location and step t 1 has a ‘from_location’ for entity e, then the ‘from_location’ should be used as the target location of the ‘Move’ action in the previous step. To perform the global reasoning over the local predictions, we first do a forward pass through the actions and location predictions and ensure they are globally consistent. To do so, we start from the first predicted action and check the following on every next step prediction: • If the current action is None, then we skip this step! 43 • If the last observed action is “Create” or “Move”, – If the current action in “Create” and the location of this action is the same as the last observed location, then the new “Create” action is transformed to “None”. – IF the current action is “Create” and the location of this action is different from the last observed location, then the new action is changed to “Move”. – Otherwise, the new action is kept the same, and the last observed action is updated. • If the last observed action is “Destroy”, – If the current action is “Destroy” and it has a location different from the last observed location, then the action is changed to “Move”. – If the current action is “Destroy” and it has a location similar to the last observed location, then the action is changed to “None”. – Otherwise, the new action is kept the same, and the last observed action is updated. After fixing the sequence of actions, we first check whether the entity gets created at any of the steps or is just moved or destroyed during the process. If the entity is not created, its initial location is equal to the first ‘from_loc’ in any subsequent actions. We then use the following criteria to fix the locations in a forward pass over the local decisions: • If the action is “Move” but there is no final location, the final location is the first ‘from_loc’ from any of the subsequent actions before the next “Move” event. • If the object is being “Moved”, then its final location should be changed. If the action does not indicate a new location or the information is missing, we replace the final location with ‘?’ to indicate an unknown location. • If the action is “None”, the last location is kept unchanged for the new step. 44 Figure 3.6 The QA graph for the query of “Where is the book” and the sentence “Move the book on the shelf to the library”. 3.6.3 Integration with Neural Models Here, we investigate whether explicitly incorporating semantic parsers with neural models can help better understand the procedural text. We choose two of the recently proposed and most commonly used backbone architectures for procedural reasoning tasks, namely NCET [61] and TSLM [49] (and its extension CGLI [99]). Similar to [68], we rely on a graph attention network (GAT) to integrate the information from the semantic parsers into the neural baselines. Following [68], the nodes in this graph are either (1) predicates in the semantic frames, (2) mentions of entities of interest (Exact match or Co-reference), or (3) noun phrases in the sentence. An edge in the SRL graph exists between two nodes if they have a (predicate, argument) connection or they are both parts of the same verb semantic frame (argument to argument) [68]. It is relatively straightforward to build a semantic graph with the TRIPS parser because it outputs the parse as a 45 graph. An edge is created between any pairs of nodes (phrases) in the graph if any subsets of these two phrases are connected in the original parse. The edge types are preserved. Since not all the nodes in the original parse are present in the new simplified graph, we may lose some key connections. To fix this, if two nodes (phrases) are not connected in the new graph but have been connected in the original one, we find the shortest path between them in the original parse and connect them with a new edge with the type being the concatenation of all the edge types in the path. Lastly, nodes are connected across sentences based on either an exact match or co-reference resolution. Both NCET and TSLM models are trained based on Cross Entropy to compute the loss for both actions and locations. The final loss of the model is calculated by Ltotal = Laction λ ∗ Llocation, where λ is a balancing hyper-parameter. 3.6.3.1 Integration with NCET as Backbone The NCET model uses a language model to encode the context of the procedure and compute representations for mentions of entities, verbs, and locations. These representations are used in two sub-modules to predict actions and locations. To integrate the semantic parsers with the NCET architecture, we use the output of the language model to initialize the semantic graph representations. Then multiple layers of graph attention network (based on TransformerConv [154]) are applied to encode the graph structure. We combine the updated graph representations with the initial mention representations. These combined representations are later used in subsequent prediction modules. More formally, we start by using a language model to encode the context of the process h′ = LM S, where S is the procedure, and h′ is the embedding output from the language model. The representations are further encoded by a BiLSTM h = BiLST M h′. Graph Attention Network Since each node in the semantic graph corresponds to a subset of tokens in the original paragraph, we use the mean average of these tokens’ representation to initialize the nodes’ embedding denoted as v0 i . If the graph contains edge types, the edges between every two nodes i and j are denoted by eij and are represented by the average token embedding through the same LM model used for encoding the story, eij = M eanLM etext ij . Lastly, we use C layers 46 of TransoformerConv [154] to encode the graph structure. TransformerConv uses the following formula to update the representation of the nodes (vi) in the graph. vl1 i = W1vl i j∈N i (cid:16) αi,j W2vl j W6eij (cid:17) , where N i represents the neighbors of node i in the graph, l is the layer, and the coefficient αi,j is computed using the following formula: αi,j = sof tmax (cid:17)⊤ (cid:16) W3vl i  (cid:16)   j W6eij W4vl √ d  (cid:17)   Representing Mentions To integrate the semantic parses with the baseline model, we use both representations obtained from the language model and the graph encoder to represent entities, verbs, and locations in the process. Mention representations are denoted by rm t = h M hm t ; M hg i m t , where t is one step of the process, hm t to the mention m in step t, hg m t is the average representation of tokens in the story corresponding is the average embedding of nodes corresponding to mention m in step t, and the function M replaces the representations with zero if there is no mention of m in step t. Location Prediction We first encode the pairwise representation of an entity e and location i. Next, we use an LSTM to encode the h candidate lc at each step t, denoted by xe,lc = t t ; rlc re t step-wise flow of the pair representation to get ¯he,lc t = LST M h xe,lc t i. Finally, the probability of each location candidate lc to be the location of entity e at step t is calculated by a sof tmax over the potential candidates, pe,lct = Sof tmaxW t loc ¯he,lc t , where W t loc is the learning parameters of a single multi-layer perceptron. Action Prediction To predict the action for entity e at step t, we create a new representation for the entity based on its mention and the sentence verbs, denoted by xe t = (cid:2)re t ; M eanv∈npetrv t (cid:3) , where npet is the set of verbs whose corresponding node in the graph has a path to any nodes representing 47 Tag O_D Entity does not exist after getting destroyed O_C Entity does not exist before getting created Description E C D Entity exists and does not change Entity is created Entity is destroyed Table 3.8 The list of output tags/actions in the Propara dataset is introduced to facilitate a linear correlation between actions. Local Global Loc Both 885 116 105 Both 367 44 61 Global Ent Actions Locations 438 66 98 340 3 71 Train Dev Tests Global Loc and Ent Ambiguous Both 114 9 18 Actions 593 76 110 Table 3.9 The number of decisions per evaluation category with the new decision-level metric. “Both” refers to both location and action decisions and is used since the number of those decisions is the same in most cases. The number of decisions in the ‘Global Ent’ case can differ in terms of actions and locations because this category also considers ‘destroy’ events with no corresponding locations. entity e in step t. The final representations are then produced using a BiLSTM over the steps, t = LST M xe he t actions by learning transition scores during the training of the model [61]. The set of possible . Lastly, a neural CRF layer is used to consider the sequential structure of the actions is shown in Table 3.8. 3.6.3.2 Integration with TSLM as Backbone We have discussed the TSLM model in more detail in Section 3.5. The TSLM [49] model reformulates the procedural reasoning task as a question-answering problem. The model simply asks the question, ‘Where is entity e?’ at each step of the process. To include the context of the whole process when asking the same question at different steps, TSLM further introduces a time-aware language model that can encode additional information about the time of events. Given the new encoding, each step of the process is mapped to either past, present, or future. TSLM uses the answer to the question at each step to form a sequence of decisions over the location of entity e. To integrate the semantic graph with this model, we first extend the graph by adding a question node. The graph is then initialized using the time-aware language model. The encoded representations of the graph, after applying multiple layers of GAT, are combined with the original 48 token representations and used for extracting the answer to the question. Initial Representation For each entity e and timestamp t, the string “where is e? s1 s2 ... sm ” is fed into the time-aware language model. Accordingly, the tokens’ representations for timestamp t are he t = LM S, t. i Graph Attention Module Inspired by [194], we add new nodes to the semantic graph to represent the question and each step of the process. We connect the question node to any node in the graph representing the entity of interest e, and each step node to all the tokens in their corresponding sentence. All the node embeddings are initialized by the average embedding of their corresponding tokens in the procedure. We use C layers of graph attention network (TransformerConv), similar to Section 3.6.3.1, to encode the graph structure. Location Prediction For predicting the locations of entities, that is, the answer to the question, we predict the answer among the set of location candidates. This is different from the common practice of predicting start/end tokens. We represent each location candidate by combining representations from both the graph and the time-aware language model, denoted by rlc t = h he t , hc lc gllc t i, where is the representation of the lc from the last layer of the GAT. The answer is then selected by gllc hc t calculating a sof tmax over the set of location candidates, plc t = Sof tmaxW locationrlc t . Action Prediction Similar to CGLI [99] model, we explicitly predict the actions of entities alongside the locations. First, the model extracts each timestamp’s “CLS” tokens and builds sequential pairs of CLSe t , CLSe t1 t ; CLSe t = F CLSe t1 by re . Then, it produces a change representation vector for each of these pairs, denoted . Lastly, the sequence of re t logits is passed through the same neural CRF layer used by the NCET model, introduced in Section 3.6.3.1, to generate the final probability of actions. 3.6.4 Evaluation We use three evaluation metrics to analyze the performance of the symbolic, sub-symbolic, and neural baselines. The first metric is sentence-level and proposed in [32]. The second metric is a document-level evaluation proposed by [167]. Both of these metrics evaluate higher-level procedural concepts that can be inferred from the model predictions rather than the raw decisions. 49 These metrics give more importance to the actions compared to the location decisions. Although they can successfully evaluate some aspects of the models, they fail to measure the research progress in addressing the challenges of the procedural reasoning task. We extend these evaluations with a new decision-level evaluation metric that considers almost all model decisions with a similar weight and evaluates the models based on the difficulty of the reasoning process. Both sets of existing evaluation metrics of the Propara dataset do not directly evaluate the model’s predictions but rather evaluate higher-level procedural concepts which can be inferred from the sets of decisions (i.e., an entity being input/output). Given their evaluation criteria, one model may surpass another in the number of correct decisions but still obtain a lower performance. Therefore, we propose a new evaluation metric (decision-level) that directly evaluates the mod- els’ decisions. This evaluation metric is designed to consider the difficulty of the reasoning process and help better identify the core challenges of the task. We divide the set of decisions into five categories based on the presence of the entity e and the location l at each step t. We denote any mention of e by me, any mention of l by ml, the action for entity e at step t by tage t , and the text of the current step by St. The following specifies the five categories and how a decision falls under them. Local Decision: A decision where (1) me ∈ St, (2) ml ∈ St, and (3) tage t ∈ {M ove, Create} Global Location Decision: A decision where (1) me ∈ St, (2) ml ∉ St, and (3) tage t ∈ {M ove, Create} Global Entity Decision: A decision where (1) me ∉ St, (2) ml ∈ St or l = ” − ”, and (3) tage t ∈ {M ove, Create, Destroy} Global Entity and Location Decision: A decision where (1) me ∉ St, (2) ml ∉ St, and (3) tage t ∈ {M ove, Create} Ambiguous Local Action: A decision where (1) me ∈ St and (2) St contains multiple action verbs. Table 3.9 shows the detailed statistics of the number of decisions falling under each of these five categories for the Propara dataset. Evaluating the performance of models given the new decision- 50 level metric will clarify the lower-level challenges in the reasoning over states and locations of entities simultaneously. Getting accurate predictions in any of these categories of decisions requires the models to have different reasoning capabilities. The local decisions mostly require a sentence-level understanding of the action and its conse- quences. The global location decisions require reasoning over the current step and the ability to connect the local information to the global context. The predictions for the category of the global entity mostly require reasoning over complex co-references (we have already considered simple co-references such as pronouns as mentions of the entity) or the ability to recover missing pronouns in a sentence such as "Gradually mud piles over (them)". The global entity and location decisions are the most challenging cases, which require reasoning over local and global contexts, complex co- reference resolution, and handling of missing pronouns. The ambiguous decisions mainly require local disambiguation of (entity, role, predicate) connections when multiple predicates are present in the sentence. Moreover, common sense is required for a subset of all the decision categories. 3.6.5 Experiments Implementation details We use the PyTorch geometric 6 library to implement all the graph attention models and Huggingface library [177] for implementing the language models. For the NCET model and its extensions based on semantic parsers, the best model is selected by a search over the λ ∈ {0.3, 0.4}, the learning rate in {3e−5, 3.5e−5, 5e−5}. The number of graph attention layers are set to 2 and the batch size is set to 8 process. All models use Bert-base as the selected language model for encoding the context. We further use RAdam [93] to optimize the model parameters of both the language models, the LSTM, and the classifiers. For the CGLI method, we use the exact hyper-parameters as specified in [99]. We further use 15 layers of graph attention network with the input from the fifth layer of the time-aware language models. The gradients from the graph attention network (GAT) would not back-propagate to the original language model and only affect the parameters in the GAT model. 6https://pytorch-geometric.readthedocs.ios 51 #Row Models 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ProLocal ProGlobal KG-MRC PROPOLIS(ours) NCET (re-implemented) REAL(re-implemented)∗ NCET + SRL(ours) NCET + TRIPS(ours) NCET + TRIPS(Edge)(ours) NCET + PROPOLIS(ours) DynaPro KOALA TSLM CGLI CGLI + TRIPS (ours) Sentence-level evaluation Cat1 62.7 63 62.9 69.9 75.54 78.9 77.1 77.1 75.68 78.54 72.4 78.5 78.81 80.3 80.62 Cat2 30.5 36.4 40 37.71 45.46 48.31 46.35 48.12 47.6 48.69 49.3 53.3 56.8 60.5 58.94 Cat3 Macro-avg Micro-avg 10.4 35.9 38.2 5.6 41.6 41.62 42 43.36 45.71 44.26 44.5 41.3 40.9 48.3 49.08 34.5 45.1 47 37.74 54.2 56.29 55.16 56.19 56.33 57.16 55.4 57.7 58.83 63.0 62.88 34.0 45.4 46.6 36.67 54.38 56.35 55.32 56.32 56.37 57.31 55.5 57.5 58.37 62.7 62.68 Document-level evaluation F1 Precision 35.3 77.4 49.4 46.7 56.8 64.5 58.7 70.9 66 68.4 66.1 67.3 66.5 67.8 68.8 72.5 67.6 69.9 69.9 74.6 65.5 75.2 70.4 77.7 68.6 68.4 72.4 74.9 71.4 74.5 Recall 22.9 52.9 50.7 50.0 63.6 64.9 65.2 65.4 65.5 65.8 58 64.4 68.9 70 68.5 Table 3.10 The table of results based on sentence-level and document-level evaluation of the Propara Dataset. ∗ Since the code for the REAL model is not available, we have re-implemented the architecture based on the guidelines of the paper and the communications. The graph is first abstracted using the PROPOLIS graph abstraction phase and then used instead of the Trips parse as input to the model. Model KOALA PROPOLIS NCET NCET + SRL NCET + TRIPS NCET + PROPOLIS CGLI CGLI + TRIPS Local L 65.7 19.0 62.8 65.7 67.6 64.8 62.9 70.5 Both 59.0 19.0 60.0 61.9 63.8 61.9 54.3 61.9 A 74.3 55.2 69.5 68.6 71.4 71.4 65.7 75.2 Global Loc L 24.6 1.6 36.1 36.1 42.6 36.1 59.0 60.6 Both 22.9 1.6 29.5 31.1 36.1 34.4 50.8 52.2 A 86.9 63.9 70.5 77.0 75.4 83.6 75.4 80.3 Global Ent L 7.0 9.9 5.6 5.6 9.9 7.0 19.7 22.5 Both 0.0 0.0 0.0 0.0 2.8 0.0 11.3 12.7 A 1.0 0.0 3.1 10.2 10.2 3.1 19.4 17.3 Global Loc and Ent Both L A 0 11.1 5.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.5 5.5 0.0 11.1 5.5 0.0 5.5 5.5 11.1 27.8 22.2 16.7 27.8 27.8 Amb A 73.63 52.7 57.2 62.7 63.6 70.9 70.0 74.5 Table 3.11 The results of the models on the new extended evaluation metric (decision-level) in terms of accuracy (%). ‘A’ means the action is correct, ‘L’ means the location is correct, and ‘Both’ means both the action and location are correct. Local ambiguous cases. 3.6.5.1 Results Here, we summarize the performance of strong baselines compared with the symbolic (PROPO- LIS) and integrated models. Table 3.10 shows the performance of models in the two conventional metrics of the Propara dataset, and Table 3.11 shows the performance of models based on the decision-level metric. We summarize our findings in a set of question-answer pairs. Q1. Can semantic parsing alone solve the problem reasonably? Based on Table 3.10, the PROPOLIS model outperforms many of the neural baselines (document-level F1-score of row#4 compared to rows #1 to #3), showing that deep semantic parsing can provide a general solution for the procedural reasoning task to some extent without the need for training data. This model performs relatively well on action-based decisions (cat1) but fails to extract the proper location 52 decisions (cat3). This is because many locations are inferred based on common sense rather than the verb semantic frames. Notably, the set of rules written on top of PROPOLIS is local and simple and can be further expanded to improve performance. Table 3.11 further indicates that the predictions of the PROPOLIS model on the actions are much closer to the SOTA models than its predictions for the entities’ location. The good performance of PROPOLIS on the action decisions for the “Global Location” category can further show that the local context can mostly indicate the action even if retrieving the result of the action (location) requires more reasoning steps. Lastly, since PROPOLIS is a model built over local semantic frames, it dramatically fails to make accurate decisions when the entity does not appear in the sentence (Global Ent). Q2. Can the integration of semantic parsing improve the neural models? We evaluate this based on the two strong baselines, NCET and TSLM. When semantic parsers are integrated into NCET, all three evaluation metrics improve (compare rows #7 to #10 with row #5). This improvement is even better if the source of the graph is the abstracted parse from the PROPOLIS method (row #10). Semantic parsers improve NCET’s performance in all categories of decisions, particularly in local ambiguous sentences and decisions requiring reasoning over global locations. Notably, the integration of PROPOLIS with the NCET model significantly boosts the ability to disambiguate local information in sentences with multiple action verbs. The integration of the semantic graph slightly hurts the performance of the CGLI baseline when using conventional metrics (1%). However, it outperforms this baseline on “cat3” (0.78%), which is the only evaluation that directly considers location predictions. Notably, the original CGLI model (baseline) uses the pre-trained classifiers from SQUAD [132] to predict the start/end tokens from the paragraph as the locations (answer to the question). However, since the integrated method extracts candidates from the graph in the form of spans, it cannot reuse the same pre-trained classifier parameters. This may contribute to the drop in performance since CGLI performs 2% lower on the document-level F1 score when SQUAD pre-training is removed [99]. Despite the drop in performance based on the conventional metrics, the integrated QA-model (CGLI + TRIPS) outperforms the baseline in almost all the criteria in the new evaluation (Table 3.11), especially 53 on decisions that only require local reasoning or local disambiguation. This is due to the global nature of the TSLM (or CGLI) backbone, which predicts the locations based on the whole story and ignores many of the local signals, whereas the graph can help directly extract the local relations. Q3. How can the decision-level metrics help understand models’ weaknesses and strengths? Based on the results in Table 3.11, the NCET model is better at reasoning over the local context than the global context. It also clarifies that although the TSLM (or CGLI) model can properly reason over multiple steps, it is not as competitive as the NCET model in the local cases. However, integrating semantic parsers could improve the models to close the gap on both local and global aspects and has a complimentary influence on the initial performance of the baselines. As a general conclusion based on our new evaluation metrics, we can argue that the most challenging decisions are the ones that require reasoning over missing mentions of entities in the local context. Addressing this challenge may require external reasoning over common-sense, performing the complex co-reference resolution, or handling missing pronouns. 3.6.6 Discussion Here, we discuss some of the potential concerns that may arise with the usage of symbolic systems such as TRIPS and the new evaluation criteria. Coverage and rule crafting of PROPOLIS. Our implementation of the symbolic method and the integrated models rely on the knowledge extracted from very fine-grained semantics covered in TRIPS. Consequently, a small mapping effort was needed to create such a system. The mapping between actions in Propara and verbs is straightforward since verbs are automatically mapped onto ontological classes that provide the type of actions based on the parse. Hence, defining the mapping rules for the most general relevant ontology types of verbs is sufficient because all the descendent types will follow the same mapping (See Table 3.7). Additionally, the effort needed for the pre- processing and designing of the mapping rules is similar to the hyperparameter tuning of neural models. Since mapping is based on common sense rather than trial-and-errors in hyperparameter tuning, finding an optimal solution may even take less effort. Out-Of-Vocabulary words in parses. TRIPS automatically maps words to ontology classes using 54 WordNet [110]. This gives us considerable vocabulary coverage and reduces OOV risk. TRIPS can identify the role of the unseen words (not available in WordNet) based on the sentence syntax and will not produce errors when encountering unseen words. In the same way, PROPOLIS and integrated models will not be affected. Effectiveness of the new evaluation metric. The previously proposed high-level evaluations are strict and do not accurately reflect the quality and quantity of the lower-level model decisions. Thus they do not adequately reveal the models’ abilities. For example, when compared at high- level metrics, two models may have the same performance value of 20%, while their decision accuracy may be 60% and 10%. This issue is reflected during training epochs too when the models’ performance remains the same despite the decisions on the train set continuing to improve. Therefore, it seems more appropriate to evaluate the models based on the same objective criteria used for training them (decision-level). However, the previously used metrics can be secondary evaluations to measure how well the model captures higher-level procedural concepts. 55 CHAPTER 4 MULTI-MODAL PROCEDURAL ABSTRACTION Procedural abstraction is challenging due to the difficulty of identifying evolving events and entities. In contrast to the summarization task, where the summary represents the whole context and usually contains sentences from the original content, the abstraction is rather a very short description of the process trying to capture its essential actions and objects in an abstract format. To investigate this task, we study this in a simplified version, where the abstract is a 4-phrase action that should make sense in a specific chronological order with respect to the original text. Here, we use the RecipeQA benchmark and the Textual Cloze task. Figure 4.1 shows an example of the task. We discussed the task and our proposal for integrating latent alignment between the abstract and the recipe steps in more detail in Section 4.1. 4.1 Task Definition and Challenges We tackle the task of procedural reasoning in a multimodal setting for understanding cooking recipes. The RecipeQA dataset [184] contains recipes from internet users. Thus, understanding the text is challenging due to the different language usage and informal nature of user-generated texts. The recipes are along with images provided by users, which are taken in an unconstrained environment. This exposes a level of difficulty similar to real-world problems. The tasks proposed with the dataset include textual cloze, visual cloze, visual ordering, and visual coherence. Here, we focus on textual cloze. An example of this task is shown in Figure 4.1. 4.1.1 Task Definition The task involves processing a set of multimodal instructions denoted by I. The input also includes three textual items from the question, denoted by qi, qj, qk, and a placeholder, denoted by pl, that is to be replaced by the correct answer. The index l can be between any of the two indexes from the question items, i.e., i < l < j or j < l < k. The task requires selecting the correct answer, denoted by a, from a set of four options, denoted by A = {a1, a2, a3, a4}, based on the given set of instructions. Let S denote the sequence of the three question items and the correct answer that correctly 56 Figure 4.1 A sample of textual cloze task in the RecipeQA, which is a multi-choice question with one valid answer describes the steps of the recipe. Then, the task involves finding the correct answer a such that the sequence S is a valid description of the recipe. Mathematically, the task can be formalized as follows: Given a set of multimodal instructions I = {i1, i2, ..., in}, where each ij is a combination of visual and textual information, and three textual question items qi, qj and qk with a placeholder pl, the task is to select the correct answer a from a set of options A = {a1, a2, a3, a4}, such that the sequence S = {qi, qj, qk, a} (sorted based on the x indexes) correctly describes the steps of the recipe, i.e., S is a valid recipe sequence. Mathematically, where P S|I, ai denotes the probability of the sequence S given the set of instructions I and the a = arg max ai∈A P S|I, ai option ai. 4.2 Latent Alignment of Procedural Concepts To design a semantically augmented model, we rely on the intuition that given question items, and each answer describes exactly one step of the recipe. Hence, we design a model to make explicit alignments between the candidate answers and each step and use those alignment results, given the question information. This alignment space is latent due to not having any direct supervision based on provided annotations. 57 Using multimodal information and representations by making a joint space for comparison has been broadly investigated in the recent research [66, 179, 86, 161, 189, 52, 166, 116]. Our work differs from those as we do not have direct supervision on multimodal alignments. Moreover, the task we are solving uses the sequential nature of visual and textual modality as a weak source of supervision to build a neural model to compare the textual representation of context and the answers for a given question representation. Our model exploits the latent alignment space and the positional encoding of questions and answers while applying a novel approach for constraining the output space of the latent alignment. Moreover, we exploit cross-modality representations based on cross-attention to investigate the benefits of information flow between images and instructions. We compare our results to the provided baselines in [184] and achieve state-of-the-art by improving over 19%. 4.2.1 Forming Latent Alignments To incorporate the order of the sequence in question items and the placeholder, we utilize a one-hot encoding vector of positions to be concatenated with the candidate answers and question items’ representations. We give the instructions to a sentence splitter using Stanford Core NLP library [102]. The output is then tokenized by Flair data structure [3] and embedded with BERT [43]. The words’ embeddings are passed to an LSTM layer and the last layer is used as the instruction representation. We propose two different approaches to include image representations. These proposals are described in Section 4.3.2.1. An overview of our approach is shown in Figure 4.2. Question representation is the last layer of an LSTM on question items. The representation of each question item is the concatenated vector of a one-hot position encoding and word embed- ding obtained from BERT. The candidate answers’ representations are computed using the same approach. We concatenate the question representation to each instruction. Then, the similarity of each candidate answer and instruction is computed using the cosine similarity and forms a similarity matrix. We use S to denote the similarity matrix. The rows of this matrix are candidate answers, and the columns represent the recipe steps. The value of Sij indicates the similarity score 58 Figure 4.2 An overview of the proposed model. The Bert Embeddings are used in a frozen manner while the LSTM would further fine-tune the representation based on the target task. of candidate i and step j. For training the model, we define two different objectives directly applied to the similarity matrix. The textual cloze task does not have the direct supervision required for the alignment between candidates and steps, and our objective is designed to use the answer to the question to train this latent space of alignments. For imposing the constraint of the alignment to be disjoint between steps and candidates, one way is to simply compute the maximum of each row in the similarity matrix and use that as the aligned step for each candidate answer; However, we introduce constrained max-pooling that is a more sophisticated approach as shown in Figure 4.3. We compare these two alternatives in the experimental results. We apply an iterative process to select the most related pair of instructions (a column) and answer candidate (a row) while removing the related column and row each time until all candidate answers find their aligned instruction. We denote the final selected maximum scores by m = S1i1, S2i2, S3i3, S4i4 the index of the step with maximum alignment score with candidate cand for all pairs of candidates , where ic ∈ 1, number_of _steps is c and d, c ≠ d =⇒ ic ≠ id. Respectively, we define two following objectives. The first objective maximizes the distance between the maximum score of the correct answer and the maximum score of another random wrong answer candidate. Furthermore, fixing the instruction with the maximum alignment with the correct answer decreases the score of the other candidates’ alignments with that instruction. The second objective increases the maximum similarity score of the answer to approach one while 59 Figure 4.3 The matrix operation for constrained max-pooling. After selecting each maximum value in the available matrix, the row and the column corresponding to the value are removed from the matrix. decreasing the other maximum scores to be lower than 0.1. Loss = max0, Srir − Saia 0.1 4 c≠a max0, Scia − Saia 0.1 Loss = 1 − Saia 4 c≠a max0, Scic − 0.1 (4.1) (4.2) Where a ∈ {1, 2, 3, 4} is the correct answer number and r is a random index from {1, 2, , 3, 4}− {a}. The main difference between objective 4.1 and objective 4.2 is the regularization term on the selected instruction column in the alignment matrix. 4.3 Experiments and Discussion 4.3.1 Baselines Hasty Student [172] is a simple approach considering only the similarity between elements in question and candidate answers. This baseline fails to get good results due to the intrinsic of the task. Impatient Reader [65] computes attention from answers to the recipe for each candidate, and 60 Models Human Hasty Student Impatient Reader Impatient Reader (multimodal) Model-Obj 4.1 Model-Obj 4.2 Model-Obj 4.1 (multimodal) Model-Obj 4.1 (multimodal) + LXMERT Model-Obj 4.1 (multimodal) + LXMERT - ConstrainedMaxPooling Accuracy 73.6 26.89 28.03 29.07 46.35 43.36 45.41 47.5 46.9 p@2 - - - - 78.7 - 77.5 76.3 Table 4.1 Evaluation on the test set of the RecipeQA dataset. The Multi-modal setting simply applies a late fusion, while LXMERT applies an early fusion for modalities. despite being a complicated approach, yet it fails to get good results on the task. Moreover, the multimodal Impatient reader approach uses both instructions and corresponding images. 4.3.2 Results The RecipeQA textual cloze task contains 7837 training, 961 validation, and 963 test examples. A learning rate of 4 − e1 is used for the first half and then 8 − e2 for the second half of training iterations. We use the momentum of 0.9 for all variations of our model. We train for 30 iterations with a batch size of 1 and optimize the weights using an SGD optimizer. For word embedding, the pre-trained BERT embedding in the Flair framework is used. For the image representations, ResNet50 [63] pre-trained on Imagenet [142] using PyTorch library [125] is applied. Table 4.1 presents the experimental results. We call the model variations which use the loss objective in Equation (4.1) as Model-obj 4.1 and the ones that use the loss in Equation (4.2) as Model-obj 4.1. Using the objective in Formula (4.1) yields better results in all experiments. This indicates the benefit of using the column-wise disjoint constraint on the similarity matrix. Also, using multimodal information yields a 1.12% improvement. We elaborate further on the comparison between multimodal and unimodal results in Section 4.3.3. We provide our Pytorch implementation publicly available on Github 1. 1https://github.com/HLR/LatentAlignmentProcedural 61 4.3.2.1 Multimodal Results In order to investigate the usefulness of the images in solving the textual cloze task, we propose two different models that incorporate the image representation in addition to the textual information of recipe steps. The first variation receives ResNet50 representations of the images and, after applying an LSTM layer, pulls the last layer as image representation. Finally, it concatenates the image representation to the question and instruction representation in the main architecture before applying the MLP and computing the cosine similarities. As shown in figure 4.4, the second variation uses a more complex architecture introduced in LXMERT [166]. We modify the architecture of LXMERT and apply it to the word embedding and image representations to flow the information from one to another. The updated word embedding and image representations are passed to an LSTM, and its last layer is used to represent the visual and textual information of a step. In the end, these representations are concatenated with each other and the question representation to build the instruction vector representation. We report the results of these model variations in Table 4.1. Using the cross-modality representations based on LXMERT provided an extensive way to flow the information from text and image to each other and yield the best results. 4.3.3 Discussion and Analysis We did a qualitative analysis using some examples and their results to understand the behavior of the proposed model better. Our model can almost detect all matched candidates with the instructions (in case multiple matches exist) but fails to choose the one that completes the sequence of the question items. This indicates the shortage of procedural hints inside our architecture while the latent alignment is proven to be practical. Analyzing the results, we found interesting cases where multimodal or unimodal architectures could yield more accurate predictions. Multimodal - , Unimodal +: • Images contain misleading information (see example in Figure 4.5). • Image quality is low. 62 Figure 4.4 Showing the use of LXMERT for integrating multimodal information on steps. The model parameters of LXMERT are frozen. Figure 4.5 Showcasing the setting where utilizing the vision modality misleads the model to choose apple slices (object) rather than the cutting option (action). 63 Figure 4.6 Showcasing the setting where the vision modality helps the model to understand "it" refers to bread rather than a sandwich. • Images are not showing the steps correctly. • Text contains direct mentions of candidate answers. Multimodal + , Unimodal -: • The sequence of the images provide detailed steps and good quality. • The entities in candidate answers are shown in the pictures but not in the text. • The recipe instructions are very short, and the images provide more information. In some cases, the multimodal information can fix the errors resulting from ignoring the order of events in the proposed architecture. Our intuition is that, although the textual model does not contain information from previous steps, the images carry useful information on what has already been done. Figure 4.6 shows an example of this, where co-reference resolution is required to answer the question correctly. Experimenting with ResNet101 for the multimodal architecture resulted in lower performance. We confirmed this experiment by re-implementing the Hasty Student approach with ResNet101 on the visual coherence task (which has 68% accuracy with ResNet50) and obtained 35% lower 64 performance than the performance of the model utilizing ResNet50. This can be due to the lack of quality of images, which results in extra noise when using a more complex network. Thus, ResNet50 achieves better accuracy by producing more abstract representations of the images. 65 CHAPTER 5 NEURAL LEARNING WITH LOGICAL CONSTRAINTS Recent advancements in machine learning are proven very effective in solving real-world problems in various areas, such as vision and language. However, there are still remaining challenges. First, machine learning models mostly fail to perform well on complex tasks where reasoning is crucial [149] while human performance does not drop as much when more steps of reasoning are required. Second, deep neural networks (DNNs) are known to be data-hungry, making them struggle on tasks where the annotated data is scarce [87, 198]. Third, models often provide results that are inconsistent [88, 57] even when they perform well on the task. Prior research has shown that even large pre-trained language models performing well on a specific task may suffer from inconsistent decisions and indicate unreliability when attacked under adversarial examples and specialized test sets that evaluate their logical consistency [57, 113]. This is especially a major concern when interpretability is required [106], or there are security concerns over applications relying on the decisions of DNNs [19]. To address these challenges, one direction that the prior research has investigated is neuro- symbolic approaches as a way to exploit both symbolic reasoning and sub-symbolic learning. Here, we focus on a subset of these approaches for the integration of external knowledge in deep learning. Knowledge can be represented through various formalisms such as logic rules [67, 118], Knowledge graphs [195], Context-free grammars [41], Algebraic equations [159], or probabilistic relations [28]. A more detailed investigation of available sources of knowledge and techniques to integrate them with DNNs is surveyed in [174, 36]. Although integrating knowledge into DNNs is done in many different forms, we focus on explicit knowledge about the latent and/or output variables. More specifically, we consider the type of knowledge that can be represented as declarative constraints imposed (in a soft or hard way) on the models’ predictions, during training or at inference time. The term knowledge integration is used in the scope of this assumption in the remainder of this paper. In this chapter, we discuss our efforts to progress research toward integrating domain knowledge 66 into deep neural models. Here, the domain knowledge is expressed as a set of constraints over the latent/output variables of the neural network. To facilitate research in this direction, we first propose a declarative framework, DomiKnowS. This framework aims to provide a declarative interface to define the knowledge of the task in terms of a graph of concepts involved in the task, a first-order logic interface to define constraints over the concepts, an interface to design neural components, and an interface to design learning or execution objectives. DomiKnowS provides seamless integration of knowledge in terms of constraints with the deep neural models by implementing generalizable conversions between the first-order logic constraints to various formalisms used in integration methods. We will discuss this framework in more detail in Section 5.1. Implementing and designing DomiKnowS is a great team effort, and my contributions have been on developing various techniques to enhance the generalizability of the framework, implementing and developing constraint integration methods, debugging the framework capabilities, and showcasing its abilities on multiple challenging tasks. Building on top of DomiKnowS, we provide the first standard collection of benchmarks (GLUE- Cons) to evaluate the integration methods. This benchmark provides extensive evaluation criteria and task categorizations that help identify the advantages and disadvantages of various constraint integration methods. GLUECons provides an overview of the existing approaches for constraint integration during training or inference on nine distinctive tasks categorized into five categories based on the source of the constraints. We discuss this benchmark in Section 5.2. GLUECons is a big team effort that contains over 100 experiments on various timelines. My contribution to the project involves proposing the need for the benchmark, leading the team effort, proposing new evaluation criteria, summarizing the results of the huge experiment pull into discussion points, and implementing the experiments needed for multiple of the selected tasks in the benchmark. 5.1 Declarative Constraint Integration Framework Current deep learning architectures are known to be data-hungry with issues mainly in gener- alizability and explainability [120]. While these issues are hot research topics, one approach to address them is to inject external knowledge directly into the models when possible. While learning 67 from examples revolutionized the way that intelligent systems are designed to gain knowledge, many tasks lack adequate data resources. Generating examples to capture knowledge is an expensive and lengthy process and especially not efficient when such knowledge is available explicitly. Therefore, one main motivation of our DomiKnowS is to facilitate the integration of domain knowledge in deep learning architectures, particularly when this knowledge is represented symbolically. We highlight the components of this framework that help to combine learning from data and exploiting knowledge in learning, including 1) Learning problem specification, 2) Knowledge representation 3) Algorithms for integration of knowledge and learning. Currently, DomiKnowS implementa- tion relies on PyTorch and off-the-shelf optimization solvers such as Gurobi. However, it can be extended by developing hooks for other solvers and deep learning libraries since the interface is generic and independent from the underlying computational modules. In general, the integration of domain knowledge can be done 1) using pretrained models and transferring knowledge [42, 113], 2) designing architectures that integrate knowledge expressed in knowledge bases (KB) and knowledge graphs (KG) in a way that the KB/KG context influences the learned representations [185, 163], or 3) using the knowledge explicitly and logically as a set of constraints or preferences over the inputs or outputs [89, 118, 115, 159]. Our current library aims at facilitating the third approach. While applying the constraints on input is technically trivial and could be done in a data pre-processing step, applying constraints over outputs and considering those structural constraints during training is a research challenge [118, 89, 60]. This requires encoding the knowledge at the algorithmic level. However, given that the constraints can be expressed logically and symbolically, having a language to express such knowledge in a principled way is lacking in the current machine-learning libraries. Using our developed DomiKnowS library, the domain knowledge will be provided symbolically and by the user utilizing a logical language that we have defined. This knowledge is used in various ways: a) As soft constraints by considering the violations as a part of the loss function, this is done using a prim-dual formulation [118] and can be expanded to probabilistic and sampling-based approaches [183] or by mapping the constraint to differentiable operations [90] b) mapping the 68 constraints to an integer linear program and performing inference-based training by masking the loss [60]. Independent from the training paradigm, the constraint can always be used as hard constraints during inference or not used at all. An interactive online demo of the framework is available at Google Colab1, and the framework is accessible on GitHub2. 5.1.1 Related Research Integrating domain knowledge in learning relates to tools that try to express the prior or posterior information about variables beyond what is in the data. This relates to probabilistic programming languages such as [126], Venture [103], Stan [22], and InferNet [112]. The logical expression of domain knowledge is used in probabilistic logical programming languages such as ProbLog [37], PRISM [147], the recent version of Problog, that is, Deep Problog [100], Statistical Relational Learning tools, such as Markov logic networks [45], Probabilistic soft logic [18], Bayesian Logic (BLOG) [109], and slightly related to learning over graph structures [196]. Considering the structure of the output without its explicit declaration is considered in structured output prediction tools [141]. This library is mostly related to the previous efforts for learning-based programming and the integration of logical constraints in learning with classical machine learning approaches [136, 73, 74]. Our framework makes this connection to deep neural network libraries and arbitrarily designed architectures. The unique feature of our library is that the graph structure is defined symbolically based on the concepts in the domain. Despite Torch-struct [141], our library is independent of the underlying algorithms, and arbitrary structures can be expressed and used based on various underlying algorithms. In contrast to DeepProbLog, we are not limited to probabilistic inference, and any solver can be used for inference depending on the training paradigm that is used for exploiting the logical constraints. Probabilistic soft logic is another framework that considers logical constraints in learning by mapping the constraint declarations to a Hing loss Markov random field [11]. DRaiL is another declarative framework that uses logical constraints on top of deep learning and converts them to an integer linear program at inference time [191]. None of the above-mentioned frameworks accommodate working with raw sensory data nor help in putting that 1https://hlr.github.io/domiknows-nlp/ 2https://github.com/HLR/DomiKnowS 69 in an operational structure that can form the domain predicates and be used by learning modules while our framework tries to address that challenge. We support training paradigms that make use of the inference as a black box, and in those cases, any constraint optimization, logical inference engine, or probabilistic inference tool can be integrated and used based on our abstraction and the provided modularity. 5.1.2 Declarative Learning-based Programming We use the Entity-Mention-Relation (EMR) extraction task to describe the framework. Given an input text such as "Washington is employed by Associated Press.", the task is to extract the entities and classify their types (e.g., people, organizations, and locations) as well as relations between them (e.g., works for, lives in). For example, for the above sentence [Washington] is a person [Associated Press] is an organization and the relationship between these two entities is work-f or. We choose this task as it includes the prediction of multiple outputs at the sentence level, while there are global constraints over the outputs.3 In DomiKnowS, first, using our python-based specification language, the user describes the problem and its logical constraints declarativly and independent from the solutions. Second, it defines the necessary computational units (here, PyTorch-based architectures) and connect the solution to the problem specification. Third, a program instance is created to execute the model using a background knowledge integration method with respect to the problem description. 5.1.2.1 Problem Specification To model a problem in DomiKnowS, the user should specify the problem domain as a conceptual graph GV, E. The nodes in V represent concepts and the edges in E are relationships. Each node can take a set of properties P = P1, P2, ..., Pn. Later, the logical constraints are expressed using the concepts in the graph. In EMR task, the graph contains some initial NLP concepts such as sentence, phrase, pair and additional domain concepts such as people, organization, and work-for. Concepts Each problem definition can contain three main types of concepts (nodes). 3Please note this is just an example of a learning problem and does not have anything to do with the main functionality of the framework. 70 Basic Concepts define the structure of the input of the learning problem. For instance sentence, phrase, and word are all basic concepts that can be defined in the EMR task. Compositional Concepts are used to define the many-to-many relationships between the basic concepts. Here, the pair concept in the EMR task is a compositional concept. This is used as the basic concept for the relation extraction task. Decision Concepts are derived concepts that are usually the outputs of the problem and subject to prediction. They are derived from basic or compositional concepts. The people, organization, and work-for are examples of derived concepts in the EMR conceptual graph. Following is a partial snippet showing the definition of basic and compositional concepts for the EMR task. 1 word = Concept(name='word') 2 phrase = Concept(name='phrase') 3 sentence = Concept(name='sentence') 4 pair = Concept(name='pair') The following snippet also shows the definition of some derived concepts in the EMR example. 1 entity = phrase(name='entity') 2 people = entity(name='people') 3 4 org = entity(name='organization') location = entity(name='location') 5 work_for = pair(name='work_for') 6 located_in = pair(name='located_in') The entity, people, organization and location are the derived concepts from the phrase concept, and the rest are derived from the pair concept. 71 Edges After defining the concepts, the user should specify existing relationships between them as edges in the conceptual graph. Edges are used to either map instances from one concept to another or generate instances of a concept from another concept. DomiKnowS only supports a set of predefined edge types, namely is_a, has_a, and contains. is_a is automatically defined between a derived concept and its parent. In the EMR example, there is an is_a edge between people and entity. The is_a edge is mostly used to introduce hierarchical constraints and relate the basic and derived concepts. Has_a connects a compositional concept to its components (also referred to as arguments). In the EMR example, pair concept has two has_a edges to the phrase concept to specify the arg1 and arg2 of the composition. We allow an arbitrary number of arguments in a has_a relationship, see below. 1 pair.has_a(arg1=phrase, arg2=phrase) Contains edge defines a one-to-many relationship for two concepts to represent a (parent, child) relationship between them. Here, the number of parents of a concept is not necessarily limited to be only one. Following is a sample snippet to define a contains edge between sentence and phrase: 1 sentence.contains(phrase) Global Constraints The constraint definition is the part where the prior knowledge of the problem is defined to enable domain integration. The constraints of each task should be defined on top of the problem using the specified concepts and relationships there. The constraints can be 1) automatically inferred from the conceptual graph structure, 2) extracted from the standard ontology formalism (here OWL4), 3) explicitly defined using the logical constraint 4Ontology Web Language 72 language of DomiKnowS. The framework internally uses the defined constraints at the training time or the inference-time optimization depending on the integration method selected for the task. Here is an example of a constraint written in DomiKnowS’s logical constraint language for the EMR task: 1 ifL(work_for('x'), andL(people(path= ('x',arg1)), organization(path='x',arg2))) The above constraint indicates that a work_for relationship only holds between people and organi- zation. Other syntactic variations of this constraint are shown in the Appendix. To process constraints, DomiKnowS maps those to a set of equivalent algebraic inequalities or their soft logic interpretation depending on the integration method. We discuss this more in Section 5.1.3.2. Following is an example of the mapping between OWL constraint, graph structure, and the logic Python constraint. The ontology definition in OWL: 1 2 3 4 or equivalent graph structure definition: 1 work_for.has_a(arg1=people, arg2=organization) DomiKnowS’s constrain language representation: 1 ifL(work_for('x'), andL(people(path= ('x',arg1)), organization(path='x',arg2))) 73 All three above constraints represent the same knowledge that a work_for relationship only holds between people and organization. In order to map this logical constraint to ILP, the solver collects sets of candidates for each case in the constraint’s concepts. ILP inequalities are created for each of the combinations of candidates’ sets. The internal nested andL logical expression is translated to a set of three algebraic inequalities. The new variable varAND) is created to transfer the result of the internal expression into the external one. 1 varAND <= varPhraseIsPeople 2 varAND <= varPhraseIsOrganization 3 varPhraseIsPeople + varPhraseIsOrganization <= varAND + 1 External ifL expression is translated to a single algebraic inequality (refers to the variable varAND): 1 varPhraseIsWorkFor <= varAND 5.1.2.2 Model Declaration Model declaration phase is about defining the computational units of the task. The basic building blocks of the model in DomiKnowS are sensors and learners, which are used to define either deterministic or probabilistic functionalities of the model. Sensor/Learners interact with the conceptual graph by defining properties on the concepts (nodes). Each sensor/learner receives a set of inputs either from the raw data or property values on the graph and introduces new property values. Sensors are computational units with no trainable parameters; and learners are the ones which contain the neural models. As stated before, the model declaration phase only defines the connection of the graph properties to the computational units and the execution is done later by the program instances. 74 The user can use any deep learning architecture compatible with PyTorch modules alongside the set of pre-designed and commonly used neural architectures currently in the framework. To facilitate modeling different architectures and computational algorithms in DomiKnowS, we provide a set of predefined sensors to do basic mathematical operations and linguistic feature extraction. Following is a short snippet of defining some sensors/learners for the EMR task. 1 phrase['w2v'] = FunctionalSensor('text', forward=word2vec) 2 phrase[people] = ModuleLearner('w2v', module=Classifier(FEATURE_DIM)) 3 pair[work_for] = ModuleLearner('emb', module=Classifier(FEATURE_DIM*2)) In this example, the sensor Word2Vec is used to obtain token representations from the “text” property of each phrase. There is also a straightforward linear neural model to classify phrases and pairs into different classes such as people, organization, etc. 5.1.3 Learning & Evaluation in DomiKnowS To execute the defined model considering the specified conceptual graph, DomiKnowS uses program instances. A program instance is responsible to run the model, apply loss functions, optimize the parameters, connect the output decisions to the inference algorithms, and generate the final results and metrics. Executing the program instance relies on the problem graph, model declaration, dataloaders, and a backbone data structure called DataNode. DataLoader provides an iterable object to loop over the data. DataNode is an instance of the conceptual graph to keep track of the data instances and store the computational results of the sensors and learners. For the EMR task, the program definition is as follows: 1 program = Program(graph, poi=(sentence, phrase, pair), loss=NBCrossEntropyLoss(), ,→ metric=PRF1()) Here, the concepts passed to the poi field specifies the training points of the program. This enables the user to train the task based on any subsets of the concepts defined in the model. 75 The user should specify the domain knowledge integration method for each program instance. The available methods for integration are discussed in the next sections. After initializing the program, the user can call train, test, and prediction functionalities to train and evaluate the designed model. The below snippet is to run training and evaluation on the EMR task: 1 program.train(train_reader, test_reader, epochs=10, Optim=torch.optim.SGD(param, ,→ lr=.001)) 2 program.test(new_test_reader) Here, the user will specify the dataloaders for different sets of data and the hyper-parameters required to train the model. 5.1.3.1 Program Composition The Program instances allow the user to define different training tasks without extra effort to change the underlying models. For example, one can define end-to-end models, pipelines, and two-step tuning paradigms just by defining different program instances and calling them one after another. For instance, we can seamlessly switch between the following learning paradigm variations on the EMR task. End-To-End training: 1 program = Program(graph, poi=(phrase, sentence, pair)) 2 program.train() Pre-train phrase then just train on the pairs: 1 program_1 = Program(graph, poi=(phrase, sentence)) 2 program_2 = Program(graph, poi=(pair)) 3 program_1.train(); program_2.train() 76 Pre-train phrase and use the result in the end-to-end training: 1 program_1 = POIProgram(graph, poi=(phrase, sentence), ...) 2 program_2 = POIProgram(graph, poi=(phrase, sentence, pair), ...) 3 program_1.train(...) 4 program_2.train(...) 5.1.3.2 Inference and Optimization DomiKnowS provides access to a set of approaches to integrate background knowledge in the form of constraints on the output decisions or latent variables/concepts. Currently, DomiKnowS addresses three different paradigms for integration: 1) Learning + prediction time inference (L+I) 2) Training-time integration with hard constraints 3) Training-time integration with soft constraints. The first method, which we refer to as enforcing global constraints, can also be combined and applied on top of the second and third approaches at inference time. Prediction-time Inference: In the back-end of DomiKnowS, ILP 5 solvers are used to make inferences under global linear constraints [138]. The constraints are denoted by C (cid:0)·(cid:1) ≤ 0. Without loss of generality, we can denote the structured output as a binary vector y ∈ Rn. Given local predictions F θ from the neural network, the global inference can be modeled to maximize the combination of log probability scores subject to the constraints [138, 60] as follows, F ∗θ = arg max y log F θ⊤y subject to C (cid:0)y(cid:1) ≤ 0. (5.1) To handle constraints in ILP, we create variables for each local decision of instances and transform the logical constraints to algebraic inequalities [136] in terms of those variables. Auxiliary variables are added to represent the nested constraints. The inference method can be extended to support other approaches, such as probabilistic inference and dynamic programming, in the future without modifying the other parts of the framework. 5Integer Linear Programming 77 Integration of hard constraint in training: Here, we use our proposed inference-masked loss approach (IML) [60] which constructs a mask over local predictions based on the global inference results. The main intuition is to avoid updating the model based on local violations when the global inference can recover true labels from the current predictions. Given structured prediction F θ from a neural network and its global inference F ∗θ subject to the constraints, IML is extended from negative log-likelihood as follows LIML (cid:0)F θ, Y (cid:1) = − (cid:0)(cid:0)1 − F ∗θ(cid:1) ⊙ Y (cid:1)⊤ log F θ, (5.2) where Y is the structured ground-truth labels and ⊙ indicates element-wise product. We imple- mented LIML(cid:0)λ(cid:1) which balances between negative log-likelihood and IML with a factor λ as introduced in [60]. IML works best for very low-resource tasks where label disambiguation cannot be learned from the data but can be done based on the available relational constraints between output variables. The constraint mapping for the IML uses the same module in the DomiKnowSthat is implemented to use the global constraint optimization tool (here ILP). Integration of soft constraints in training: We use the primal-dual formulation of constraints proposed in [117] to integrate soft constraints in training the models. Primal-Dual considers the constraints in the neural network training by augmenting the loss function using Lagrangian multipliers Λ for the violations from the constraints by the set of predictions. The constraints are regularized by a hinge function (cid:2)C (cid:0)F (cid:0)θ(cid:1)(cid:1)(cid:3). The problem is formulated as a min-max optimization where it maximizes the Lagrangian function with the multipliers to enforce the constraints and minimize it with the parameters in the neural network. Instead of solving the min-max primal, we solve the max-min dual of the original problem. max Λ min θ L (cid:0)F (cid:0)θ(cid:1) , Y (cid:1) Λ⊤ (cid:2)C (cid:0)F (cid:0)θ(cid:1)(cid:1)(cid:3) . (5.3) During training, we optimize by minimization and maximization alternatively. With the Primal- Dual strategy, the model learns to obey the constraints without requiring any additional inference. Primal-Dual is less time-consuming at prediction time than the previous methods as it does not 78 need an additional inference-time optimization phase. It can also be used in a semi-supervised setting, exploiting domain knowledge instead of labeled data. Handling constraints in Primal-Dual is done by mapping them to their respective soft logical interpretations [118]. It is an open research topic to identify which of the integration methods performs best for different tasks. However, DomiKnowS makes it effortless to use one problem specification and run all the aforementioned methods. 5.2 Constraint Integration Benchmark Hurdle of Knowledge Integration Unfortunately, most prior research on knowledge integration has only focused on evaluating their proposed method compared to baseline DNN architectures that ignore the knowledge. Consequently, despite each method providing evidence of its effective- ness [67, 118], there is no comprehensive analysis that can provide a better understanding of the use cases, advantages, and disadvantages of methods, especially when compared with each other. The lack of such analysis has made it hard to apply these approaches to a more diverse set of tasks by a broader community and provide a clear comparison with existing methods. We mainly attribute this to three factors: 1) the lack of a standard benchmark with systematic baselines, 2) the difficulty of finding appropriate tasks where constraints are applicable, and 3) the lack of supporting libraries for implementing various integration techniques. Due to these three factors, many research questions are left open for the community, such as (1) The difference in the performance of models when knowledge is integrated during inference vs. training or both, (2) The comparison of the influence of integration methods when combined with simpler vs. more complex baselines, (3) The effectiveness of training-time integration models on reducing the constraint violation, (4) The impact of data size on the effectiveness of the integration methods. Common Ground for Comparison Our contribution is providing a common ground for comparing techniques for knowledge integration by collecting a new benchmark to facilitate research in this area. Our new benchmark, called GLUECons, contains a collection of tasks suitable for constraint integration, covering a spectrum of constraint complexity, from basic linear constraints such as 79 mutual exclusivity to more complex constraints expressed in first-order logic with quantifiers. We organize the tasks in a repository with a unified structure where each task contains a set of input examples, their output annotations, and a set of constraints (written in first-order logic). We limit the scope of knowledge in GLUECons to logical constraints6. Selected Tasks GLUECons contains tasks ranging over five different types of problems categorized based on the type of available knowledge. This includes 1) Classification with label dependencies: Mutual exclusivity in multiclass classification using MNIST [81] and Hierarchical image classifica- tion using CIFAR 100 [77], 2) Self-Consistency in decisions: What-If Question Answering [168], Natural Language Inference [17], BeliefBank [71], 3) Consistency with external knowledge: Entity and Relation Extraction using CONLL2003 [144], 4) Structural Consistency: BIO Tagging, 5) Constraints in (un/semi)supervised setting: MNIST Arithmetic and Sudoku. These tasks either use existing datasets or are extensions of existing tasks, reformulated so that the usage of knowledge is applicable to them. We equip these tasks with constraint specifications and baseline results. Evaluation For a fair evaluation and to isolate the effect of the integration technique, we provide a repository of models and code for each task in both PyTorch [125] and DomiKnows [47] frameworks. As described before, DomiKnows makes it easier to consistently test different integration methods while the rest of the configurations remain unchanged. For a more comprehensive evaluation, we introduce a set of new criteria in addition to the original task performances to measure 1) the effectiveness of the techniques in increasing the consistency with knowledge, 2) the execution run-time, 3) the effectiveness of methods in the low-data regime, 4) the ability to reduce the need for complex models, and 5) the ability to express various forms of knowledge. Baselines We analyze and evaluate a set of knowledge integration methods to serve as baselines for GLUECons. Our baselines cover a set of fundamentally different integration methods, where the integration is addressed either during inference or training of DNNs. GLUECons can be used as blueprints to highlight the importance of integrating constraints with DNNs for different types of tasks and provides inspiration for building such constraints when working on new tasks. 6Throughout this paper, we use the terms constraint integration, knowledge integration, or integration methods interchangeably to refer to the process of integration of knowledge into the DNNs. 80 5.2.1 Constraint Integration in Prior Research Knowledge integration, often, is considered a subset of Neuro-symbolic [39, 7, 69] approaches that build on the intersection of neural learning and symbolic reasoning. (author?) surveyed prior research on knowledge integration in three directions: knowledge source, knowledge representation, and the stage of knowledge integration. (author?) has also studied existing methods where the integration can be done through either transforming the input data, the loss function, or the model architecture itself. Knowledge integration has also been investigated in probabilistic learning frameworks [38, 135, 12] and their modern extensions which use neural learning [101, 69, 176]. Recent research has explored knowledge integration via bypassing the formal representations and expressing knowledge in the form of natural language as a part of the textual input [143, 26]. As of formal representations, knowledge integration has been addressed at both inference [83, 148, 31] and training time [67, 118, 182]. Inference-Time Integration The inference-based integration techniques optimize over the out- put decisions of a DNN, where the solution is restricted by a set of constraints expressing the knowledge [138, 23]. These methods aim at finding a valid set of decisions given the constraints, while their objective is formed using the output scores/probabilities generated by the learning models. As a result of this fixed objective and the fact that approximation approaches are generally used to find the best solution, we expect that the type of optimization technique will not significantly affect the performance of inference-time integration methods –our results discussed in the later sections provide multiple pieces of evidence confirming this hypothesis. Prior research has investigated such integration by using variants of beam search [62, 15, 31], path search algorithm [98], linear programming [138, 139, 23], finite-state/push-down Automata [41], or applying gradient-based optimization at inference [83, 84]. We use Integer Linear Programming (ILP) [138, 139] approach to evaluate the integration of the constraints at inference time. We use off-the-shelf ILP tools that perform an efficient search and offer a natural way to integrate constraints. However, constraints 81 should be converted to a linear form to be able to exploit these tools [47, 76, 75]. Training-Time Integration Several recent techniques have been proposed for knowledge inte- gration at training time [118, 67, 182]. Using constraints during training usually requires finding a differentiable function expressing constraint violation. This will help to train the model to minimize the violations as a part of the loss function. Integrating knowledge in the training loop of DNNs is a challenging task. However, it can be more rewarding than the inference-based integration methods as it reduces the computational overhead by alleviating the need for using constraints during infer- ence. Although such methods cannot guarantee that the output decisions would follow the given constraints without applying further operations at inference-time, they can substantially improve the consistency with the constraints [88]. Prior research has investigated this through various soft interpretations of logic rules [118, 9], rule-regularized supervision [67, 59], re-enforcement learning [186], and black-box semantic [182] or sampling [2] loss functions, which directly train the network parameters to output a solution that obeys the constraints. To cover a variety of techniques based on the previous research, we select Primal-Dual (PD) [118] and Sampling-Loss (SampL) [2] methods as baselines for our new benchmark. The PD approach relies on a soft logic interpretation of constraints, while the SampL is a black-box constraint integration. We discuss some of the existing methods in more detail in Section ‘Baselines.’ Applications and Tasks Constraint integration has been investigated for several applications in prior research including SQL query generation [148], program synthesize [10, 46], semantic parsing [27, 82], question answering [9], entity and relation extraction [59], sentiment analysis [67], visual question answering [69], image captioning [8], and even text generation [98]. 5.2.2 Criteria of Evaluation We extend the evaluation of the constraint integration methods beyond measuring task perfor- mance. The list of proposed evaluation criteria for such an extended comparison is as follows. Individual metrics of each task: The first criterion to evaluate the methods is the conventional metric of each task, such as accuracy or precision/recall/F1 measures. 82 Constraint Violation: Even when the integration method cannot improve the model’s perfor- mance, improving the consistency of its predictions will make the neural models more reliable. A consistency measure quantifies the success of the integration method in training a neural network to follow the given constraints. We measure consistency in terms of constraint violation. We compute the ratio of violated constraints over all predicted outputs. A smaller number indicates fewer constraint violations and, consequently, a higher consistency with the available knowledge. Execution Run-Time: Another critical factor in comparing the constraint integration methods is the run-time overhead. This factor becomes even more critical when the integration happens during inference. This criterion helps in analyzing the adequacy of each technique for each application based on the available resources and the time sensitivity of the decision-making for that application. We measure this evaluation criteria by simply computing the execution time of each integration method both during training and inference. This metric can reflect the overhead of each integration method more accurately by taking into account the new parameters that should be optimized and the additional computations with respect to the complexity of the constraints. Low-data vs full-data performance: For many problems, there is no large data available either due to the high cost or infeasibility of obtaining labeled data. Integrating constraints with deep neural learning has been most promising in such low-resource settings [118, 59]. We measure the improvement resulting from the integration methods on both low and full data. This evaluation will help in choosing the most impactful integration method based on the amount of available data when trying to apply integration methods to a specific task. Simple baseline vs Complex baseline: An expected impact of constraint integration in DNNs is to alleviate the need for a large set of parameters and achieve the same performance using a smaller/simpler model. Additionally, it is important to evaluate whether the integration method can only affect the smaller network or the very large SOTA models can be improved too. This will indicate whether large networks/pre-trained models can already capture the underlying knowledge from the data or explicit constraint integration is needed to inject such knowledge. In addition to the number of parameters, this metric also explores whether knowledge integration can reduce the 83 need for pre-training. This is especially important for the natural language domain, where large pre-trained language models prevail. Constraint Complexity: This criterion evaluates the limitations of each method for integrating different types of knowledge. Some methods consider the constraints a black box with arbitrary complexity, while others may only model a specific form of constraint. This criterion specifies the form/complexity of the constraints that are supported by each technique. To evaluate this, we characterize a set of constraint complexity levels and evaluate whether each technique can model such constraints. 5.2.3 Selected Tasks GLUECons aims to provide a basis for comparing constraint integration methods. We have selected/created a collection of tasks where constraints can potentially play an important role in solving them. We provide five different problem categories containing a total of nine tasks. More details of tasks’ constraints are available in the Appendix. This collection includes a spectrum of very classic tasks for structured output prediction, such as multi-class classification to more involved structures and knowledge, such as entity relation extraction and Sudoku. 5.2.3.1 Classification with Label Dependency Simple Image Classification. In this task, we utilize the classic MNIST [40] dataset and classify images of handwritten digits in the range of 0 to 9. The constraint used here is the mutual exclusivity of the ten-digit classes. Each image can only have one valid digit label as expressed in the following constraint, IF digitix ⇒ ¬ ∨j∈0−9 j=!i digitjx, where digitix is 1 if the model has predicted x to be an image representing the digit i. This task is used as a basic validation of the constraint integration methods, though it is not very challenging and can also be addressed by a “Softmax” function. Hierarchical Image Classification. The hierarchical relationships between labels present a more complex label dependency in multi-label and multi-class tasks. We use the CIFAR-100 [78], which includes 100 image classes, each belonging to 20 parent classes forming a hierarchical structure. 84 This dataset with 60k images is an extension of the classic CIFAR-10 [78]. To create a smaller dataset, we select 10% of these 60k images. For this task, the output is a set of labels for each image, including one label for each level. The constraints are defined as, IF L1 ⊂ L2 : L1x ⇒ L2x, where L1 and L2 are labels, L1x is T rue only if the models assigns label L1 to x, and L1 ⊂ L2 indicates that L1 is a subclass of L2. 5.2.3.2 Self Consistency in Decisions DNNs are subject to inconsistency over multiple decisions while being adept at answering specific questions [21]. Here, we choose three tasks to evaluate whether constraints help ensure consistency between decisions. Causal Reasoning. WIQA [168] is a question-answering (QA) task that aims to find the line of causal reasoning by tracking the causal relationships between cause and effect entities in a docu- ment. The dataset contains 3993 questions. Following [9], we impose symmetry and transitivity constraints on the sets of related questions. For example, the symmetry constraint is defined as follows: symmetricq, ¬q ⇒ F q, C ∧ ¬F ¬q, C where q and ¬q represent the question and its negated variation, C denotes the document, and ¬F is the opposite of the answer F . Natural Language Inference. Natural Language Inference (NLI) is the task of evaluating a hypothesis given a premise, both expressed in natural language text. Each example contains a premise (p), hypothesis (h), and a label/output (l) which indicates whether h is “entailed,” “contradicted”, or “neutral” by p. Here, we evaluate whether NLI models benefit from consistency rules based on logical depen- dencies. We use the SNLI [17] dataset, which includes 500k examples for training and 10k for evaluation. Furthermore, we include AESIM 1000 [111], which is an augmented set over the original dataset containing more related hypotheses and premise pairs to enforce the constraints. Four con- sistency constraints (symmetric/inverse, transitive) are defined based on the (Hypothesis, Premise) 85 pairs. An example constraint is as follows: neutral (cid:0)h, p(cid:1) ⇒ ¬ contradictory (cid:0)p, h(cid:1) , where neutral (cid:0)h, p(cid:1) is True if h is undetermined given p. The complete constraints are described in [111]. Belief Network Consistency. The main goal of this task is to impose global belief constraints to persuade models to have consistent beliefs. As humans, when we reason, we often rely upon our previous beliefs about the world, whether true or false. We can always change our minds about previous information based on new information, but new beliefs should not contradict previous ones. Here, entities and their properties are used as facts. We form a global belief network that must be consistent with those derived from a given knowledge base. We use Belief Bank [71] dataset to evaluate the consistency perseverance of various techniques. The dataset consists of 91 entities and 23k (2k train, 1k dev, 20k test) related facts extracted from ConceptNet [158]. There are 4k positive and negative implications between the facts in the form of a constraint graph. For example, the fact “Is a bird” would imply “can fly,” and the fact “can fly” refute the fact “Is a dog”. Formally, the constraints are defined as follows: ∀F1, F2 ∈ Facts; IF F1, F2 ∈ Pos Imp ⇒ ¬F1x ∨ F2x IF F1, F2 ∈ Neg Imp ⇒ ¬F1x ∨ ¬F2x, “Pos Imp” means a positive implication. 5.2.3.3 Consistency with External Knowledge This set of tasks evaluates the constraint integration methods in applying external knowledge to the DNNs’ outputs. Entity Mention and Relation Extraction (EMR). This task is to extract entities and their re- lationships from a document. Here, we focus on the CoNLL2003 [144] dataset, which contains about 1400 articles. There are two types of constraints involved in this task: 1) mutual exclusivity 86 between entity/relationship labels and 2) a restriction on the types of entities that may engage in certain relationships. An example constraint between entities and relationship types is as follows: IF Work_forx1, x2 ⇒ Personx1 ∧ Orgx2, where Predicatex is T rue if the network predicted input x to be of type Predicate. 5.2.3.4 Structural Consistency In this set of tasks, we evaluate the impact of constraint integration methods in incorporating structural knowledge over the task’s outputs. BIO Tagging. The BIO tagging task aims to identify spans in sentences by tagging each token with one of the “Begin,” “Inside,” and “Outside” labels. Each tagging output belongs to a discrete set of BIO tags T ∈ [‘O’, ‘I-*’, ‘B-*’], where ‘*’ can be any type of entity. Words tagged with O are outside of named entities, while the ‘B-*’ and ‘I-*’ tags are used as an entity’s beginning and inside parts. We use the CoNLL-2003 [144] benchmark to evaluate this task. This dataset includes 1393 articles and 22137 sentences. The constraints of the BIO tagging task are valid BIO sequential transitions; for example, the “before” constraint is defined as follows: If Ixi 1 −→ Bxi, where ‘B-*’ tag should appear before ‘I-*’ tag. xi and xi1 are any two consecutive tokens. 5.2.3.5 (Un/Semi) Supervised Learning We select a set of tasks for which the constraints can alleviate the need for direct supervision and provide a distant signal for training DNNs. Arithmetic Operation as Supervision for Digit Classification. We use the MNIST Arith- metic [14] dataset. The goal is to train the digit classifiers by receiving supervision, merely, from the sum of digit pairs. For example, for image pairs of 5 and 3 in the training data, we only know their labels’ sum is 8. This dataset is relatively large, containing 10k image pairs for training and 5k for testing. This task’s constraint forces the networks to produce predictions for pairs of images where the summation matches the ground-truth sum. The following logical expression is 87 an example constraint for this task: S{img1, img2} ⇒ M =minS,9 M =max0,S−9 M img1 ∧ {S − M }img2, where S{img1, img2} indicates that the given summation label is S and M imgi indicates that the ith image has the label M . Sudoku. This task evaluates whether constraint integration methods can help DNNs to solve a combinatorial search problem such as Sudoku. Here, integration methods are used as an inference algorithm with the objective of solving one Sudoku, while the only source of supervision is the Sudoku constraints. As learning cannot be generalized in this setting, it should be repeated for each input. The input is one Sudoku table partially filled with numbers, and the task is to fill in a number in each cell such that: "There should not be two cells in each row/block/column, with the same value" or formally defined as: IF digitix∧ same_rowx, y ∨ same_colx, y ∨ same_blockx, y ⇒ ¬ digitiy, where x and y are variables regarding the cells of the table, i ∈ 0, n for a n ∗ n Sudoku, digitix is T rue only if the value of x is predicted to be i. For this task, we use an incomplete 9 ∗ 9 Sudoku for the full-data setting and a 6 ∗ 6 Sudoku representing the low-data setting. 5.2.4 Baselines For constraints during training, we use the two following approaches. Primal-Dual (PD). This approach [118] converts the constrained optimization problem into a min-max optimization with Lagrangian multipliers for each constraint and augments the original loss of the neural models. This new loss value quantifies the amount of violation according to each constraint by means of a soft logic surrogate. During training, they optimize the decisions by minimizing the original violation, given the labels, and maximizing the Lagrangian multipliers 88 to enforce the constraints. It is worth noting that all related work in which constraint violation is incorporated as a regularization term in the loss objective follows very similar variations of a similar optimization formulation. inference-masked loss (IML) [59] This method applies a mask over the loss of the erroneous instances where an inference mechanism can recover correct labels for them. This approach has been mainly proposed to avoid overfitting training data when the number of training samples is low. By not updating all possible labels in the training set, the method relies on existing constraints and injected knowledge into the text to recover the correct labels by considering the correlation of different output variables in a structured prediction task. Although this method affects the neural network weights during training, it still requires applying methods at inference time. Semantic-Loss (SemL) [182] This approach proposes a semantic loss function to bridge between the neural decision and logical constraints. The loss functions reflect the closeness of neural network outputs to satisfying the constraints by calculating the satisfaction given any black-box solver. In this case, the authors have used Logical circuits to implement an efficient search over the possible output space and find satisfying instantiations. After finding all the satisfy- ing instantiations, the loss is simply promoting the sum of the conditional probability of such instantiations, assuming that each output is an independent decision. The loss formulation is Lsα, p ∝ − log x|=α i:x|=Xi pi i:x|=¬Xi outputs generated by the network, X is an instantiation of the outputs, and x |= α means that an (cid:1), where α is the constraint, p is the set of probabilities (cid:0)1 − pi instance x is satisfying the α constraint. Sampling-Loss (SampL). This approach [2] is an extension of the semantic loss [182] where instead of searching over all the possibilities in the output space to find satisfying cases, it randomly generates a set of assignments for each variable using the probability distribution of the neural network’s output. The loss function is formed as: LSα, p = xi∈X∧xi|=α p (cid:16) (cid:17) xi | p xi∈X p (cid:0)xi | p(cid:1) , where X is the set of all possible assignments to all output variables, and xi is one assignment. Here, α is one of the constraints. 89 Strong Baseline CNN MLP Task Img Cls. Hier. Img Cls. Resnet18 MLP RoBERTa MLP NLI RoBERTa MLP Causal Rea. BERT MLP BIO Tagging NER W2V BERT MLP Simple Baseline MLP - - BERT MLP LSTM MLP W2V LSTM MLP - Ari. Operation CNN MLP BeliefNet. Sudoku∗ RoBERTa + MLP W2V + MLP (n ∗ n ∗ n) Vector - Table 5.1 Baselines for each task. The basic models we used are RoBERTa [96], BERT [42], W2V [108], CNN [81], and MLP. The simple baseline means fewer parameters. [KEYS: Cls.=classification, Hier.=Hierarchical, NLI= Natural Language Inference, Rea.=reasoning, Ari.=Arithmetic, BeliefNet=Belief Network, W2V=Word to Vec.]. ∗ For the Sudoku, the model is not a generalizable DNN and relies on the integration methods as an inference algorithm to solve one specific table. To utilize the constraints during prediction, we use the following approaches. Integer Linear Programming. (ILP) [138] is used to formulate an optimization objective in which we want to find the most probable solution for log F θ⊤y, subject to the constraints. Here, y is the unknown variable in the optimization objective, and F θ is the network’s output probabilities for each variable in y. The constraints on y are formulated as C (cid:0)y(cid:1) ≤ 0. Search and Dynamic Programming. for some of the proposed benchmarks, when applicable, we use the A∗ search or Viterbi algorithm to choose the best output at prediction time given the generated probability distribution of the final trained network [98]. 5.3 Experiments and Discussion This section highlights our experimental findings using proposed baselines, tasks, and evaluation criteria. Details on experimental designs, training hyper-parameters, codes, models, and results can be found on our website7. The basic architectures for each task are shown in Table 5.1. The results of the experiments are summarized in Table 5.3. The columns represent evaluation criteria, and the rows represent tasks and their baselines. Each task’s model ‘row’ records the strong/simple baseline’s performance without constraint integration, and below that, the improve- 7https://hlr.github.io/gluecons/ 90 PD Mut Excl. Seq. Struc. Lin. Const ✗ Log + Quan ✗ Softmax ✓ ✗ Log. Const ✗ ✓ ✓ ✓ NC Prog Const ✗ ✗ SampL ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ NC ✓ ✓ NG ILP NG NG NC NC A∗ Table 5.2 The limitation of integration methods with respect to constraint types. [KEYS: NC= Needs Conversion, NG= No Generalization, Mut Excl.= Mutual Exclusivity, Seq.= Sequen- tial, Struc.=Structure, Lin.= Linear, Log.= Logical, Const.= Constraint, Quan.=Quantifiers, Prog Const= Any Constraints encoded as a program, A∗ is abbreviation for the A∗ Search algorithm.] ments or drops in performance after adding constraints are reported. Here, we summarize the findings of these experiments by answering the following questions. What are the key differences in inference-time and training-time integration performance? Notably, using ILP only in inference time outperformed other baselines in most of the tasks. However, it fails to perform better than the training-time integration methods when the base model is wildly inaccurate in generating the probabilities for the final decisions. This phenomenon happened in our experiments in the semi-supervised setting and can be seen when comparing rows [#44, #45] to #46. In this case, inference alone cannot help correct the model, and global constraints should be used as a source of supervision to assist with the learning process. ILP performs better than the training-time methods when applied to simpler baselines (see column simple baseline performance). However, the amount of improvement does not differ significantly when applying ILP to the simpler baselines compared to the strong ones. Additionally, the training-time methods perform relatively better on simpler baselines than the strong ones (either the drop is less or the improvement is higher) (compare columns ‘Strong Baseline‘ and ‘Simple Baseline‘ for ‘+PD‘ and ‘+SampL‘ rows.) How does the size of data affect the performance of the integration techniques? The integration methods are exceptionally effective in the low-data regime when the constraints come from external knowledge or structural information. This becomes evident when we compare the results of ‘EMR’ and ‘BIO tagging’ with the ‘Self Consistency in Decision Dependency’ tasks in column ‘Low 91 Tasks # Models Strong Baseline Performance Simple Img ClsF 1 Hierarchical Img ClsF 1 CausalA Reasoning NLIA BeliefF 1 Network EMRF 1 BIOF 1 Tagging Arithmetic Supervision for Digit ClassificationA SudokuCS SampL ILP SampL ILP PD SampL ILP SampL ILP PD ILP PD SampL ILP SampL ILP PD ILP PD SampL ILP SampL ILP PD ILP 1 Model 2 3 4 5 6 7 Model 8 9 10 12 Model 13 14 15 16 17 18 Model 19 20 21 22 23 24 Model 25 26 27 28 29 30 Model 31 32 33 34 35 36 Model PD 37 SampL 38 ILP 39 SampL ILP 40 41 PD ILP 42 A∗ search 43 Model 44 45 46 47 48 49 50 PD SampL ILP Supervised PD SampL ILP PD SampL ILP SampL ILP PD ILP PD SampL ILP SampL ILP PD ILP 94.23% ↑0.14% ↓1.17% ↑0.24% ↓0.52% ↑0.32% 58.03% ↑0.39% ↑2.88% ↑2.42% 74.77% ↑2.17% ↑2.54% ↑4.03% ↑4.15% ↑3.60% 74.00% ↑0.25% ↑0.55% ↑8.90% ↑8.20% ↑8.75% 94.90% ↑0.94% ↓0.29% ↑0.21% ↑1.10% ↑2.68% 90.15% ↓1.00% ↓0.30% ↑3.02% ↑2.40% ↑1.64% 89.56% ↑0.97% ↓0.17% ↑0.61% ↑0.08% ↑1.07% ↑0.59% 9.01% ↑89.39% ↑89.55% ↓2.11% ↑89.53% 96.00% 87.00% 100% Simple Baseline Performance 87.34% ↓1.14% ↑0.49% ↑1.60% ↑2.02% ↑0.39% 52.54% ↑0.54% ↑3.18% ↑3.52% 73.80% ↑1.98% ↑2.17% ↑4.51% ↑4.25% ↑4.30% - - - - - - 84.46% ↑0.87% ↓0.95% ↓0.10% ↓3.19% ↑1.60% 85.22% ↓0.30% ↑0.50% ↑4.10% - - 82.77% ↑0.04% ↑0.73% ↑2.96% ↑2.43% ↑1.83% ↑2.97% - - - - - - - - Low data Size 5% 10% 30% 10% 25% 20% 30% 5% 6 ∗ 6 Table Performance 88.78% ↑4.40% ↑3.19% ↑1.70% ↑4.40% ↑4.40% 31.33% ↑2.18% ↑1.90% ↑3.82% 60.49% ↑1.10% ↑1.63% ↑1.88% ↑2.11% ↑1.76% 68.65% ↑3.25% ↑0.95% ↑7.75% ↑7.05% ↑10.1% 94.36% ↓0.49% ↓3.03% ↓0.97% ↓2.31% ↑0.51% 82.00% ↑2.42% ↑3.36% ↑8.86% ↑7.83% ↑8.15% 75.36% ↑1.25% ↑2.62% ↑3.01% ↑2.80% ↑2.73% ↑3.03% 10.32% ↑85.01% ↑85.60% 0.00% ↑84.30% 100% 100% 100% Constraint Violation* 7.17% 8.32% 9.04% - - - 39.26% 36.57% - - 8.60% 11.36% 4.37% - - - 9.48% 7.26% 5.00% - - - 0.22% 0.16% 0.01% - - - 20.32% 16% 16.7% - - - 2.19% 0.99% 0.16% - - - - 96.92% 3.18% 2.86% - 2.86% 3.7% 18.88% - Run-Timems Training 34 36.6 39 - - - 55.3 58.2 - 58.2 104 118.1 119.4 - - - 29.2 31.7 29.8 - - - 8.3 23.59 8.5 - - - 210 245 280 - - - 361.2 389.1 429.8 - - - - 13.6 197 90.2 - 12.5 - - - Inference 27.5 - - 31.5 - - 48.43 - 55.2 55.2 46.2 - - 59.2 - - 10.7 - - 14.3 - - 7.57 - - 11 - - 200 - - 226 - - 263.2 - - 312 - - - - - - - - - - - h t i w n o i t a c fi i s s a l C y c n e d n e p e D l e b a L n i y c n e t s i s n o C f l e S y c n e d n e p e D n o i s i c e D y c n e t s i s n o C K E h t i w l a r u t c u r t S y c n e t s i s n o C n i s t n i a r t s n o C n o i s i v r e p u S ) i m e S / n U ( Table 5.3 Impact of constraint integration. F1, A, and CS are F1-measure, accuracy, and constraint satisfaction metrics, respectively, showing model performance. The full data of the Sudoku task is a 9 ∗ 9 table. *: The ‘Constraint Violation’ values are calculated for the strong baselines trained with full data. ms: Run-Time is computed per example/batch and is reported in milliseconds. ↑ indicates improvement and ↓ indicates a drop in the performance compared to the initial Model. Run times are recorded on a machine with Intel Core i9-9820X (10 cores, 3.30 GHz) CPU and Titan RTX with NVLink as GPU. [KEYS: EK= external knowledge.] 92 data/ Performance’. This is because such constraints can inject additional information into the models, compensating for the lack of training data. However, when constraints are built over the self-consistency of decisions, they are less helpful in low-data regimes (rows #12 to #29), though a positive impact is still visible in many cases. This observation can be justified since there are fewer applicable global constraints in-between examples in the low-data regime. Typically, batches of the full data may contain tens of relationships leading to consistency constraints over their output, while batches of the low data may contain fewer relationships. The same observation is also seen as batch sizes for training are smaller. Does constraint integration reduce the constraint violation? Since our inference-time inte- gration methods are searching for a solution consistent with the constraints, they always have a constraint violation rate of 0%. However, training-time integration methods cannot fully guarantee consistency. However, it is worth noting these methods have successfully reduced the constraint violation in our experiments even when the performance of the models is not substantially improved or is even slightly hurt (see rows #18 and #20, rows #24 and #26, and rows #30 to #32). In general, SampL had a more significant impact than PD on making models consistent with the available task knowledge (compare rows with ‘+PD’ and ‘+SampL’ in column ‘Constraint Violation’). How do the integration methods perform on simpler baselines? According to our experiments, there is a significant difference between the performance of the integration methods applied to simple and strong baselines when the source of constraint was external (BIO tagging, EMR, Simple Image Cls, and Hierarchical Image Cls tasks). Moreover, we find that ILP applied to a simple baseline can sometimes achieve a better outcome than a strong model without constraints. This is, in particular, seen in the two cases of EMR and Causal Reasoning, where the difference between the simple and strong baselines is in using a pre-trained model. Thus, explicitly integrating knowledge can reduce the need for pre-training. In such settings, constraint integration compensates for pre- training a network with vast amounts of data for injecting domain knowledge for specific tasks. Additionally, the substantial influence of integration methods on simple baselines compared to strong ones in these specific tasks indicates that constraint integration is more effective when 93 knowledge is not presumably learned (at some level) by available patterns in historical data used in the pre-raining of large language models. How much time overhead is added through the integration process? While the inference-time method (ILP) has a computational overhead during inference, we have shown that this overhead can be minimized if a proper tool is used to solve the optimization problem (here, we use Gurobi8. It should be noted that training-time integration methods do not introduce additional overhead during inference; however, they typically have a high computational cost during training. In the case of our baselines, SampL has shown to be relatively more expensive than PD. This is because SampL has an additional cost for forming samples and evaluating the satisfaction of each sample. What is the effect of combining inference-time and training-time integration methods? Our results show that combining inference-time and training-time methods mainly yields the highest performance on multiple tasks. For example, the performance on the NLI task on low-data can yield over 10% improvement with the combination of PD and ILP, while ILP on its own can only improve around 7%. The rationale behind these observations needs to be further investigated. However, this can be attributed to better local predictions of the training-time integration methods that make the inference-time prediction more accurate. A more considerable improvement is achieved over the initial models when these predictions are paired with global constraints during ILP (see rows #16, #28, #29, and #41). What type of constraints can be integrated using each method? Table 5.2 summarizes the limitations of each constraint integration method to encode a specific type of knowledge. We have included “Softmax” in this table since it can be used to support mutual exclusivity directly in DNN. However, “Softmax” or similar functions are not extendable to more general forms of constraints. SampL is the most powerful method that is capable of encoding any arbitrary program as a constraint. This is because it only needs to evaluate each constraint based on its satisfaction or violation. A linear constraint can be directly imposed by PD and ILP methods. However, first-order logic constraints must be converted to linear constraints before they can be directly applied. Still, 8https://www.gurobi.com/ 94 PD and ILP methods fail to generalize to any arbitrary programs as constraints. The A∗ search can generally be used for mutual exclusivity and sequential constraints, but it cannot provide a generic solution for complex constraints as it requires finding a corresponding heuristic. [23] show A∗ with constraints can be applied under certain conditions and when the feature function is decomposable. 95 CHAPTER 6 PROCEDURAL REASONING WITH LOGICAL CONSTRAINTS This chapter discusses integrating constraints in procedural reasoning tasks, focusing on entity tracking. This task is a structured prediction challenge, where constraints help guide predictions about entities’ actions and locations. Our goal is to improve the accuracy of model predictions by using constraint integration techniques. Previous methods for tracking entities mainly used sequential inference, depending on hand- coded algorithms or CRF (Conditional Random Field) decoders [165] along with the Viterbi algorithm [54] to identify important sequences of actions. These approaches adjust entity locations to match the sequences of actions identified. Our work introduces a new way to consider constraints globally in structured prediction tasks through the use of DomiKnowS [47]. Our main contribution is a global consistency approach that ensures predictions about actions and locations are coherent during entity tracking. We use an Integer Linear Programming (ILP) [139] method to enforce this consistency during inference. We also use a generative model as a baseline to predict actions and locations. This model can predict locations not directly mentioned in the text and tackle different types of questions within a single architecture, like yes/no questions and extracting phrases. The generative model also allows for the consideration of more types of questions derived from the dataset’s labels. For example, by looking at location annotations, we can determine if an entity is an input or output in a process. Taking into account these additional questions and the original dependencies between action and location queries, our evaluation framework checks how well the models understand the processes. We aim to improve both the coherence of the models and, possibly, their overall performance. Applying ILP to a large set of question types with various difficulty levels and potential output space further motivates us to investigate the success of such inference techniques in properly resolving inconsistencies between inherently different decisions. We further use the procedural reasoning task to investigate the potential limitations of the current formulation for applying the ILP method to the model-generated probabilities and propose enhancements to the formulation by 96 Figure 6.1 Example of a procedural reasoning task in the Propara dataset. ‘-’ indicates the entity is absent, and ‘?’ indicates an unknown location. considering additional factors such as the decision’s prior probability, confidence (entropy), and accuracy. We evaluate the performance of the newly proposed inference method with a series of toy tasks regarding hierarchical classification and the close to real-world task of entity tracking tasks. 6.1 Procedural Reasoning Constraints In this chapter, we examine the task of procedural reasoning with a focus on entity tracking. Our study uses the Propara dataset [33], which provides a series of procedural steps, a list of entities of interest, and the locations of these entities after each process step. An illustration from the dataset is shown in figure 6.1. Understanding how entity locations change requires models to grasp the actions occurring at each step and their impact on each entity. Previous research [33] has explored this task with the dual objective of predicting both the actions affecting entities and their locations at every step. The models can predict an action as ‘No Change’, ‘Move’, ‘Create’, or ‘Destroy’, alongside a location drawn from the text. These action-location decisions follow a logical dependency, expressed as Action-Action and Action-Location constraints. We list these constraints below: 97 ProcessParticipantsSentencesplantanimalboneoilBefore the process begins??--1. Plants and animals die in a watery environmentwatery environmentwatery environment--2. Over time, sediments build oversedimentsediment--3. The body decomposessediment-sediment-4. Gradually buried material becomes oil---sediment Action-Location Constraints • Move: If an entity moves at step i, its location at step i − 1 and step i must differ. • Create: If an entity is created at step i, its location must be ‘None’ at step i − 1 and either ‘Unknown’ or a valid phrase at step i. • Destroy: If an entity is destroyed at step i, it must exist (not ‘None’) at step i − 1, but its location at step i should be ‘None’. • No Change: If there’s no change to an entity, its location at steps i − 1 and i should match. ‘No Change’ can further be divided into ‘Exists’ and ‘Does not Exist’, indicating whether the location should be ‘None’. Action-Action Constraints • Move: If an item moves at step i, it must have existed at step i − 1. A ‘destroy’ action should not precede step i without an intervening ‘create’. • Create: An item created at step i should not exist at step i − 1. There shouldn’t be a ‘create’ or ‘Move’ before step i without a ‘destroy’ in between. • Destroy: An item destroyed at step i must have been present at step i − 1. A ‘destroy’ should not precede this without a ‘create’ in between. • Exists: If an item is marked as existing at step i, it must have been present at step i − 1, with no ‘destroy’ before step i without an intervening ‘create.’ • Does not Exist: An item marked as non-existent at step i should not have been present at step i − 1. There should be no ‘create’ or ‘Move’ before step i without a ‘destroy’ in between. To model these action-action constraints in a sequential model like Conditional Random Field (CRF), previous work has differentiated the ‘Does not Exist’ action into ‘O_C‘ (not created 98 yet) and ‘O_D‘ (already destroyed). This distinction aids in modeling the dependencies between consecutive steps without examining the entire history of actions. The probabilities of moving between nodes in the CRF (actions) are based on dataset statistics (frequency of action sequences in the training data). Invalid sequences are assigned zero probability, preventing their selection as the final output by the inference (Viterbi algorithm). However, this might be unnecessary if an inference model can effectively search through possible sequences and apply constraints globally across non-consecutive steps. 6.2 Consistent Procedural Reasoning within Generative Models In recent years, there’s been a notable increase in the use of generative models across various fields such as natural language processing [157], computer vision [29], robotics [134, 20], and more [190]. A key strength of these models, particularly when they’re fine-tuned with specific instructions, lies in their flexibility to tackle different tasks within a single framework by simply altering the input prompt to include task-specific metadata [155]. However, despite their remarkable success in numerous applications, Generative Models, in- cluding Large Language Models (LLMs), face challenges in tasks that require complex, multi-step reasoning [114, 91] and exhibit inconsistencies in their outputs [58, 24]. This section aims to assess how well generative models can process and reason about proce- dural information, focusing particularly on tracking the status of entities throughout a narrative. Successful tracking demands an understanding of the physical world, entity attributes, the impact of various actions, and the ability to reason across past, present, and future events. We propose that a model’s understanding of a process is incomplete if it cannot maintain consistency across related tasks. For example, recognizing an entity’s location change should inherently mean understanding that the entity has been moved, even when phrased differently. Assessing a model’s proficiency in procedural reasoning requires a set of interconnected tasks. Previous research in entity tracking has seen models predict both actions and locations of entities at each step. Dependencies between actions have been managed in sequential models like CRFs to ensure consistency and leverage statistical patterns in action changes [33]. Without resorting 99 to CRFs, some studies have introduced an inference stage that processes decisions sequentially, adjusting subsequent decisions based on earlier ones [193, 49]. Although entity tracking datasets traditionally only mark the status of entities at each step, this information can be enriched to infer actions. Previous studies have expanded the datasets to teach models to predict actions [33] explicitly and tested them on more complex concepts like identifying input and output entities [167]. Building on these foundations, we generate interconnected questions about entities and the overall process, forming consistency sets for evaluating LLMs’ ability to reason with consistency. We broaden the scope of questions from mere location and action identification to more detailed inquiries like detecting location changes relative to time, determining input/output entities, offering multiple-choice action predictions, identifying entity conversions, pinpointing creation times, and recognizing boolean state changes. By utilizing this enriched dataset, we aim to scrutinize the model’s competency in procedural reasoning through its consistency in answering these interconnected questions. Additionally, we explore the effects of increased supervisory labels (varied question types and phrasings) and the application of a neuro-symbolic approach to enforce output constraints on the generative model’s reasoning processes. Our contributions are summarized as follows: • Extension of the Propara benchmark to incorporate consistency sets for enhanced procedural reasoning evaluation. • Analysis of a generative model’s consistent reasoning capabilities using the expanded con- sistency sets. • Development and assessment of a neuro-symbolic method to directly apply constraints on the generative model’s outputs during procedural reasoning. 6.2.1 Datasets and Consistency Protocols 6.2.1.1 Dataset We employ the Propara dataset to construct sets for consistency checking. The consistency set would include various question types, where their answers have correlated reasoning steps. Utilizing 100 the question-answering format enables us to standardize the structure of different reasoning tasks. Previous research [49] has explored the adaptation of the location detection task into question forms such as "What is the location of the entity ‘entity’ at step i?". We expand upon these initial questions regarding entity locations with an extended assortment of questions. • After Location: "Where is ent located right after the completion of step i?" • Prior Location: "Where is ent located right before the commencement of step i?" • Multiple Actions: "What is the condition of ent at step i? (options listed here)?" • Boolean Actions (move, create, destroy): "Is the entity ent action-verb at step i?", encom- passing 4 variations • Location Change: "Does the location of entity ent alter during step i?" • Input and Alternate Input: "Is ent contributing to the subsequent process?" and "Does ent exist before the upcoming process?" • Output and Alternate Output: "Is ent a result of the following process?" and "Does ent remain after the process concludes?" • Timing (Create/Destroy): "During which step is ent action-verb? (step options including ’Never’)" • Existence Before/After: "Does entity ent exist immediately (before/after) step i (be- gins/ends)?" • Specific Location Change: "What becomes the location of ent after undergoing (action- verb) at step i?" • Combination of Action & Location: "What is the status, preceding location, and subsequent location of ent at step i?" 101 The complexity of each question type may vary based on the level of reasoning needed for resolution. For example, identifying an entity as a process’s outcome might necessitate an analysis across the entire process and its steps (the model could seek a shortcut by pinpointing the entity’s last mention and determining the relevant action). This set of additional questions is intended to facilitate the assessment of the model’s comprehension of the described processes. 6.2.1.2 Consistency Rules Given that the augmented question sets are derived from a singular annotation source, several dependencies exist among their anticipated responses. Below, we enumerate several rules that should be adhered to among the responses: • Locational Continuity: The location following step i should align with the location prior to step i 1. • Input Consistency: Both input queries should yield identical responses. • Output Consistency: Both output queries should yield identical responses. • Action Consistency: Boolean actions should concur with the multi-action responses. • Timing and Action Validity: The timing query should identify the step where the corre- sponding boolean action query affirms the action. • Input Validation and Initial State: An input is valid only if the preceding location for step 1 is either known or unknown. • Output Verification and Final State: An output is valid only if the final subsequent location is identified as either known or unknown. • Initial Action Restriction for Inputs: If an entity is an input, its initial action cannot be labeled as “create.” • Existential Continuity and Location: If an entity exists, it must possess a definable location, whether known or unknown. 102 • Implications of Action on Location: The action at step i should correlate with its effects on the after locations for steps i − 1 and i. 6.2.2 Experimental Evaluations To evaluate the capacity of generative models for performing consistent reasoning within procedural narratives, we conduct two primary sets of experiments. Initially, we measure the model’s performance in its base state—without any adjustments or modifications to its original predictions—under two scenarios: 1) training with the original dataset (comprising Location and Action information) and 2) training with a dataset that has been expanded to include additional information. These tests aim to determine whether the model’s consistency and overall performance can be enhanced through the implicit cues gained during the training phase. Additionally, we utilize the rate of constraint satisfaction as a measure of the consistency of decisions. In a second series of experiments, we examine the performance of the model after implementing strategies designed to achieve 100% consistency in its decisions. We explore three methods to ensure decisional consistency: 1) a sequential consistency enforcement approach crafted by experts, which adjusts sequences of actions and corresponding locations; 2) a global inference strategy (Integer Linear Programming, ILP) that addresses inconsistencies by solving an optimization problem within the task’s constraints, using only decisions from the original question types; and 3) the application of this global inference approach to rectify inconsistencies across all decision types, considering a comprehensive set of constraints. Baseline (T5): The T5 model, known for its powerful encoder-decoder capabilities and pretraining on various downstream tasks with specific prefixes, serves as our baseline. Within this framework, we evaluate both the base and large variants of the T5 model, whether employing pre-trained parameters or after fine-tuning on the UnifiedQA—a general question-answering dataset. Our evaluation extends to how this benchmark model performs when fine-tuned on both the original and augmented versions of the Propara dataset. 103 Size Base Large Pretrain - UnifiedQA - UnifiedQA - UnifiedQA - UnifiedQA Augmented Set Consistency Rate (%) Action Loc Total Macro Avg Micro Avg - - Yes Yes - - Yes Yes 56.76% 56.61% 56.32% 56.35% 56.44% 56.39% 56.32% 56.62% 70.25 70 68.89 71.32 72.1 70.96 72.28 73.65 63.19 63.61 63.56 63.24 65.34 65.81 66.49 64.50 66.72 66.30 66.72 67.28 68.72 68.39 69.38 69.07 66.49 66.13 66.52 67.02 68.50 68.22 69.20 68.77 Table 6.1 The results for the fine-tuned version of various T5 models on the original and augmented set of the Propara dataset. Model Base Base - UnifiedQA Large Large - UnifiedQA Method Expert-Designed ILP Expert-Designed ILP Expert-Designed ILP Expert-Designed ILP Action Loc Total Macro Avg Micro Avg 69.05 70.13 65.83 68.51 70.96 72.34 70.43 70.97 63.925 64.707 62.979 63.401 65.022 65.029 63.243 64.607 66.487 67.418 64.405 65.956 67.991 68.684 66.836 67.789 66.32 67.24 64.31 65.79 67.80 68.44 66.60 67.58 Table 6.2 The results of applying various inference techniques on the T5 baseline, comparing the traditional ILP method and the Expert-Designed inference methods. 6.2.2.1 Initial Model Performance Evaluation This phase focuses on assessing the baseline model’s inherent performance without applying consistency enforcement strategies. Table 6.1 details the outcomes from testing the model under all configurations without modifying its decision-making post hoc. It’s noteworthy that the models were supervised either with the original dataset, incorporating only action and location queries, or with the enhanced dataset. From Table 6.1, we deduce four main observations: 1) performance escalates with an increase in model size, 2) contrary to expectations, leveraging pre-training on the Unified-QA dataset adversely affects performance on this specific task, 3) incorporating the enhanced dataset markedly boosts the model’s efficacy, and 4) additional pre-training, fine-tuning for consistency, or enlarging the model size does not compromise the model’s ability to render consistent reasoning across the original set of questions. 104 Model Base Base - UnifiedQA Large Large - UnifiedQA Method Expert-Designed ILP ILP (aug) Expert-Designed ILP ILP (aug) Expert-Designed ILP ILP (aug) Expert-Designed ILP ILP (aug) Action Action - F1 Loc Total Macro Avg Micro Avg 69.05 70.20 70.25 69.77 71.86 69.23 70.97 71.56 72.16 72.94 73.54 73.24 65.493 65.417 65.936 66.979 68.207 65.136 67.341 67.352 68.384 68.169 68.447 68.425 61.937 60.633 61.622 64.189 64.555 61.043 63.713 63.145 64.608 63.399 63.355 63.610 65.26 65.10 65.65 66.80 67.97 64.87 67.10 67.08 68.14 67.86 68.11 68.11 53.69 56.19 57.02 53.71 58.38 57.08 55.85 59.26 60.01 60.15 61.2 60.65 Table 6.3 The results of applying inference methods on the T5 base models when trained on ILP (aug) refers to the optimization problem when considering the augmented training sets. decisions from the augmented question types in the constraint satisfaction process. 6.2.2.2 Logical Post-Processing Evaluation Table 6.2 illustrates the outcomes of implementing the three aforementioned inference method- ologies on the model’s decisions. We assess the impact of these inference strategies on the precision of responses to individual question types and the F1 scores for action questions regarding creation, destruction, and movement. The rationale behind incorporating the F1 score is to mitigate the inflated accuracy rates for actions, which could arise if the model predominantly forecasts "No Change" or "Exists" across most steps. The essence of this task is to accurately forecast concrete actions, including creation, destruction, and movement. The F1 score thus helps highlight the significance of these actions while also accounting for any false positives. As indicated in Table 6.2, the global inference method (ILP) generally equals or surpasses the expert-designed strategy in effectiveness, occasionally outdoing it by more than 1%. Table 6.3 further demonstrates that in most cases (three out of four), incorporating all question types into the logical reasoning phase substantially improves model performance compared to other inference approaches. Our experiments with the procedural reasoning task indicated that applying the ILP method to decisions that have various output sizes mostly alters the decisions made in larger output spaces in favor of preserving the original model decisions made in lower output spaces. For instance, the 105 ILP method applied to the two question types, Actions and Locations, heavily updated the location decisions while rarely changing the decisions regarding actions. This can be seen by comparing the base performance of models present in Table 6.1 with their corresponding ILP results in Table 6.3. 6.3 Enhanced Inference toward consistent Procedural Reasoning To comprehend procedural text, multiple neural models collaborate to establish temporal re- lationships between actions, reveal semantic relations, and discern entity properties like location and temperature [50, 16, 70]. Each model responsible for these decisions exhibits distinct decision characteristics, output sizes, uncertainty levels, and varying excepted accuracy levels. Resolving inconsistencies and aligning these diverse neural decisions is crucial for comprehensively under- standing the underlying process. In many instances, raw model outputs lack usability without enforcing consistency. In tasks like hierarchical image classification, with independent models for each hierarchy level, outputs should adhere to the known hierarchical relationships. For example, the combination “Plant, Chair, Armchair” lacks validity and requires post-processing for downstream applications. A similar requirement extends to generative models in text summarization [98] and image captioning [8]. Prior studies have proposed techniques for handling inconsistencies in correlated decisions during both inference [55, 148, 31, 23, 59] and training [67, 118, 182] of neural models. This proposal focuses on resolving these inconsistencies at inference, where the goal is to ensure that outputs align with task constraints while preserving or enhancing the original model performance without training. Integer Linear Programming (ILP) [140] stands out as a robust approach to addressing decision inconsistencies. ILP is a global optimization framework that seeks to find the best configuration of variables while meeting specified constraints. It is known for its efficiency and capability to produce globally optimal solutions, distinguishing it from alternatives like beam search. The ILP formulation is as follows: 106 Objective : Maximize P ⊤y subject to C (cid:0)y(cid:1) ≤ 0, (6.1) where constraints are denoted by C (cid:0)·(cid:1) ≤ 0, decision variables are denoted by y ∈ Rn, and the vector containing the local weights of variables are denoted by P . To apply ILP to resolve conflicts from decisions of neural models, prior work [137, 127, 121, 60] has defined P to be the vector of raw probabilities of local decisions, P = p1, ..., pn, where pi corresponds to the probability generated from a certain model for the ith decision variable (yi). The global inference is modeled to maximize the combination of probabilities subject to constraints. Previous use of ILP has proven effective in ensuring decision consistency in certain cases [51] but did not address model heterogeneity. This problem becomes more dominant in scenarios where output probabilities come from independent models, making them less directly comparable. We extend the ILP formulation to address this limitation beyond just considering the raw model probabilities. Instead, we map these raw scores into globally comparable values, facilitating a more balanced global optimization. We achieve this by incorporating additional information, such as decision confidence, expected model accuracy, and estimated prior probabilities. While previous studies have explored the integration of uncertainty in modeling the training objective [181, 56, 197], our work represents a novel effort in systematically incorporating multiple factors of this nature into the inference process for interrelated decisions to leverage external knowledge effectively. 6.3.1 Method Our objective is to devise an improved scoring system, generating new local variable weights (im- portance) W in the ILP formulation. Thus, we modify the original objective function as follows: Maximize W ⊤y, (6.2) where W = w1, ..., wn. To determine the new weights, we aim to find the scoring function G, which normalizes the local predictions of each model and maps them into globally comparable values. For each model m with multi-class decisions, we denote the output probabilities after applying a SoftMax layer as Pm ⊂ P . The scoring function G transforms these raw probabilities 107 into new weights Wm ⊂ W to indicate the importance of the variables within the ILP objective, i.e., Wm = GPm, m. This subsection explores different options for the function G and provides an intuitive understanding of their rationale. 6.3.1.1 Prior Probability (Output Size) We consider a normalization factor based on prior probabilities to facilitate fair comparison among decisions with varying output sizes. For an N -class output, the prior probability for each label is 1 N (assuming uniform distribution). This implies an inherent disadvantage for decisions made in larger output spaces. Thus, we normalize the raw probabilities by dividing them by the inverse of their respective priors and define GPm, m = Pm × N . This factor adjusts the weight of decisions relative to the size of the output space. 6.3.1.2 Entropy and Confidence The outputs generated from models often exhibit varying levels of confidence. While raw prob- abilities alone may adequately indicate the model’s confidence in individual Boolean decisions, a more sophisticated approach is required for assessing the models’ confidence in multi-classification. We propose incorporating the entropy of the label distribution as an additional factor to assess the model’s decision-making confidence. As lower entropy corresponds to higher confidence, we use the reverse of the entropy, normalized by the output size N , as a factor in forming the decision weight function GPm, m = Pm ∗ N EntropyPm . 6.3.1.3 Expected Models’ Accuracy Assigning higher weights to the probabilities generated by more accurate models aligns the optimal solution with the overall underlying models’ performance. This approach mitigates the influence of poor-quality decisions, which can negatively impact others in the global setting. We define the decision weight function G as GPm, m = Pm ∗ Accm, where Accm represents the accuracy of the corresponding model, measured in isolation. To mimic the real-world settings where test labels are unavailable during inference, we utilize the models’ accuracies on a probe/dev set. 108 6.3.2 Empirical Study We assess the impact of integrating proposed factors into the ILP formulation on structured prediction tasks. Our approach is particularly suited for hierarchical structures encompassing multiple classes at different granularity levels, such as classical hierarchical classification problems. Additionally, we are the first to investigate the influence of enforcing global consistency on the procedural reasoning task, a complex real-world problem. To implement our method, we rely on the DomiKnowS framework [130], offering a versatile platform that enables implementing and evaluating techniques to leverage external logical knowledge with minimal effort on structured output prediction tasks. 6.3.2.1 Metrics and Evaluation We compare our method against two inference-time approaches: sequential decoding and basic ILP (ILP without our refinement). In contrast to ILP, sequential decoding, which relies on expert-designed rules or programs to enforce consistency, is unique to each dataset. In addition to conventional metrics (e.g., accuracy/F1), we include measurements that evaluate changes applied by the inference techniques: (1) total changes (C), (2) the percentage of incorrect-to-correct changes (+C), (3) the percentage of correct-to-incorrect changes (-C). We further evaluate all the baselines and inference methods on (1) the percentage of decisions satisfying task constraints and (2) Set Correctness, the percentage of correct sets of interrelated decisions (i.e., predictions of all levels in the hierarchy must be correct for an image). Number of Changes This metric quantifies the post-inference changes in decisions, specifically assessing the extent to which original decisions are altered due to inference constraints. It serves as a crucial indicator of whether the optimization method treats all decisions equally or prefers certain decisions over others. A genuinely global optimization method will result in multiple decision changes, promoting a more balanced distribution of alterations across all decisions. In contrast, expert-written strategies tend to favor specific decisions. This metric is straightforward to calculate by comparing the differences between decisions before and after applying the inference mechanism. 109 Ratio of In-Correct to Correct Changes (+C) This metric reveals the proportion of post- inference changes that are deemed favorable. While this metric may not carry substantial standalone significance, it is a valuable means to compare different inference techniques. A higher ratio signifies that the inference method has been more successful in deducing accurate labels based on the imposed constraints. Ratio of Correct to In-Correct Changes (-C) This number shows the extent of undesirable changes made after inference. A lower ratio means the inference method has done a better job of preventing errors while ensuring the output adheres to the constraints. Satisfaction Rate This metric shows how well predictions align with constraints. We calculate it by generating constraint instances from related decisions and counting the satisfying cases against all possible instances. Inference techniques guarantee that modified decisions always adhere to the constraints, resulting in a satisfaction rate of 100%. Correctly Predicated Sets of Interrelated Decisions This metric is crucial for assessing the practical usefulness of the output from inference techniques or the original network decisions in downstream applications. The primary objective of inference mechanisms is to boost the percentage of these fully satisfying cases compared to the model’s original performance, all while ensuring that the decisions align with the task’s constraints. For instance, in a hierarchical classification task, we consider one instance to be correct only when the decisions at all levels are simultaneously accurate. 6.3.2.2 Tasks Procedural Reasoning Task: Procedural reasoning tasks entail tracking entities within a narrative. Following [48], we formulate this task as Question-Answering (QA). Two key questions are addressed for each entity e and step i: (1) Where is e located in step i? and (2) What action is performed on e at step i?. The decision output of this task exhibits heterogeneity, encompassing a diverse range of possible 110 actions (limited multi-class) and varied locations derived from contextual information (spans). The task constraints establish relationships between action and location decisions as well as among action decisions at different steps. For instance, the ‘Destroy, Move’ sequence represents an invalid assignment for action predictions at steps i and i 1. Dataset: We utilize the Propara dataset [33], a small dataset focusing on natural events. This dataset provides annotations for involved entities and their corresponding location changes. The label set is further expanded to include information on actions, which can be inferred from the sequence of locations. Baseline: We employ a modified version of the MeeT architecture [156] as our baseline for this task. The MeeT model is designed to ask the two aforementioned questions at each step and employs a generative model (T5-large) to answer those questions. The Sequential Decoding baseline resolves action inconsistencies in a sequential stepwise manner (first to last), followed by the selection of locations accordingly. Propara Dataset: The Propara dataset serves as a procedural reasoning benchmark, primarily devised to assess the ability of models to track significant entities across a series of events effectively. The stories within this dataset revolve around natural phenomena, such as photosynthesis. The annotation process involves capturing crucial entities and their corresponding locations at each step of the process, which are obtained through crowd-sourcing efforts. Figure 6.1 depicts an illustrative example of this dataset. The sequence of locations pertaining to each entity can be further extended to infer the actions or status of the entity at each step. Previous studies [34] have proposed six possible actions for each entity at each step, namely ’Create,’ ’Move,’ ’Exist,’ ’Destroy,’ ’Prior,’ and ’Post.’ In this context, ’Prior’ signifies an entity that has not yet been created, while ’Post’ denotes an entity that has already been destroyed. As for the baseline, we employ a modified version of the MeeT [156] architecture. The architecture utilizes T5-Large [129] as the backbone and employs a Question-Answering framework to extract the location and action of each entity at each step. The format of the input to the model 111 is as follows for entity e and step i: "Where is e located in sent i? Sent 1: ..., Sent 2: ..., ...". For extracting the action, the set of options is also passed as input, resulting in the modification of the question to "What is the status of entity e in sent i? (a) Create (b) Move (c) Destroy (d) Exist (e) Prior (f) Post". Although the original model of MeeT incorporates a Conditional Random Field (CRF) [80] layer during inference to ensure consistency among action decisions, we exclude this layer from our baseline. This decision is motivated by two reasons. Firstly, the use of CRF in this context is not generalizable as it relies on training data statistics for defining transitional scores. Secondly, we intend to impose consistency using various inference mechanisms and consider a joint framework to ensure both locations and actions exhibit consistency. Additionally, while the MeeT baseline employs two independent T5-Large models for each question type (location and action), our baseline utilizes the same model for both question types. For the sequential decoding technique to enforce sequential consistency among the series of interrelated action and location decisions, we utilize the post-processing code presented in [50]. Hierarchical Classification Task: This task involves classifying inputs into various categories at distinct levels of granular- ity, establishing parent-child relationships between the classes where those follow a hierarchical structure. Datasets: We employ three different datasets. (1) A subset of the Flickr dataset [188] with two hierarchical levels for the classification of images with types of Animal, Flower, and Food, (2) 20News dataset for text classification, where the label set is divided into two levels, and (3) The OK-VQA benchmark [104], a subset of the COCO dataset [92]. OK-VQA establishes the hierarchical relations between labels into four levels based on ConceptNet triplets and the dataset’s knowledge base. Baselines: ResNet [63] and BERT [44] are used to obtain representations for the image and text modalities, respectively. Linear classification layers are applied to convert obtained representations into decisions. The Sequential Decoding is top-down, bottom-up, and a two-stage (1) top-down 112 on ‘None’ values and (2) bottom-up on labels for Animal/Flower/Food, 20 News, and VQA tasks, respectively. Animal/Flower/Food Dataset: The dataset1 employed in this study is sourced from the online platform ’Flickr’ and encompasses a total of 5439 images classified into three primary categories, namely ’Flower,’ ’Animal,’ and ’Food.’ Without an officially designated test set, a random par- titioning strategy is adopted to ensure comparability in the distribution of training and testing instances. Consequently, the resulting splits are utilized within the experimental framework. The training subset encompasses 4531 images, while the test set comprises 1088 images. The dataset further comprises various sub-categories, including ’cat,’ ’dog,’ ’monkey,’ ’squirrel,’ ’daisy,’ ’dan- delion,’ ’rose,’ ’sunflower,’ ’tulip,’ ’donuts,’ ’lasagna,’ ’pancakes,’ ’pizza,’ ’risotto,’ and ’salad.’ It should be noted that the data distribution across labels is not balanced, posing a more challenging classification task. This dataset is employed as a simplified scenario to illustrate the benefits of the proposed inference approach. As the baseline for this task, we use ResNet-50 to represent the images and add a single layer MLP on top for each level. The model is further trained by Cross-Entropy objective and AdamW as optimizer. The sequential decoding strategy for this dataset propagates labels in a top-down manner, where the most likely children of the selected Level1 decisions are chosen as the prediction at Level2. 20News Dataset: This dataset comprises a collection of diverse news articles classified into 23 distinct categories. In order to capture the hierarchical structure inherent in the dataset’s labels, we partition these categories into two levels. It should be noted that certain higher-level concepts lack corresponding lower-level labels, necessitating the inclusion of a ’None’ label at level 2. Furthermore, we perform a removal process on the initially annotated data containing the ’None’ labels, as this subset primarily consists of noisy documents that do not align with any categories present within the dataset. It is crucial to differentiate this removal process from the intentional addition of the ’None’ label at level 2, which we manually introduced. 1https://github.com/kaustubh77/Multi-Class-Classification 113 As the baseline for this task, we initially employed the Bert-Base encoder to generate repre- sentations for each news story. Due to the limited context size of Bert, which is constrained to a maximum of 512 tokens, we truncate the news articles accordingly and utilize the CLS token as the representative embedding for the entire article. For Level 1, a 2-layer Multilayer Perceptron (MLP) architecture is employed, with LeakyReLU as the chosen activation function. Additionally, Level 2 decisions are made using a single-layer MLP. The model is optimized using the AdamW optimizer during the training process, with the Cross-Entropy loss function employed. The sequential decoding strategy in this dataset is a bottom-up strategy. Here, the model’s decision from Level 2 is propagated into Level 1 without looking further into the initial probabilities generated by the model at that level. OK-VQA (COCO) Dataset: The OK-VQA dataset is primarily introduced as a means to propose an innovative task centered around question-answering utilizing external knowledge. A subset of the COCO dataset is employed to construct this dataset, with augmented annotations obtained through crowdsourcing. While the main objective of the dataset revolves around question answering, it is important to note that it encompasses two levels of annotation. These annotations indicate the answer to the given question and provide additional clarifications regarding the types of objects depicted in the corresponding images. To leverage knowledge pertaining to image type relationships, the label set is expanded to include supplementary high-level concepts. A knowledge base is also provided, delineating parent-child relationships between these labels. The dataset comprises a total of 500 object labels. To enhance the breadth of knowledge encompassed by the dataset, we incorporate additional information from ConceptNet to establish comprehensive relationships among the labels. Notably, both the new information and the original knowledge base may contain noisy information. In conjunction with the original knowledge base, this forms a four-level hierarchical dependency among the initial 500 labels. Consequently, certain labels within each level may not possess corresponding children at lower levels, necessitating the introduction of ’None’ labels at levels 2, 3, and 4. In this study, we employ the Faster R-CNN framework [133] along with ResNet-110 as the 114 Model Baseline Sequential ILP + Acc + Prior + Ent + Acc + Ent + Prior + All Level 1 (3) + C C - C Acc - - - 86.12 - - - 86.12 43.75 43.75 16 86.07 33.33 33.33 3 86.14 41.67 24 86.30 50 12 86.09 40 86.42 25 43.75 16 86.17 50 33.33 52 43.75 Level 2 (15) C - 32 16 29 8 20 7 16 + C - 15.625 12.5 13.79 12.5 10 14.29 12.5 - C - 37.5 37.5 37.93 25 40 28.57 37.5 Average Acc 70.48 70.25 70.25 70.27 70.54 70.25 70.62 70.33 Acc 54.85 54.39 54.43 54.41 54.78 54.41 54.82 54.50 Table 6.4 Results on Animal/Flower/Food dataset on four random seeds. Reported values are the average scores of runs with close variances for all techniques (Level1: ±1.6 and Level2: ±0.5). C values are derived from the best run. n in Level (n) denotes the number of output space classes. Prior: Prior Probability, and Ent: Entropy. chosen methodology to represent individual objects within images. Subsequently, a one-layer Multilayer Perceptron (MLP) architecture is utilized to classify the images at each level of the hierarchical structure. It should be noted that the number of positive examples (i.e., labels that are not denoted as ’None’) decreases as we move toward lower levels of the hierarchy. We perform subsampling on the ’None’ labels for the corresponding classifiers at those levels to address this. The models are trained with the Cross-Entropy loss function and the AdamW optimizer. The sequential decoding strategy for this dataset is a two-stage top-down and then bottom-up process. Here, ‘None’ labels are first propagated from Level 1 to Level 4, then the selected label (if not None) from Level 4 is propagated bottom-up to Level 1. Since each label at leveln only has one parent in Leveln − 1, this process does not need to look into the original model probabilities for propagation. 6.3.2.3 Results Tables 6.4, 6.5, 6.6 and 6.7 display results for Animal/Flower/Food, Ok-VQA, 20 News, and Propara datasets. 6.3.2.2. For close results, we use multiple seeds to validate reliability. Across experiments, the basic ILP technique favors decisions in smaller output spaces due to higher probability magnitudes (e.g., more changes in Actions than Locations in Table 6.7). Our new proposed variations can effectively mitigate this problem and perform a more balanced optimization. Animal/Flower/Food: The sequential decoding establishes that the enforcement of the decisions 115 Model Baseline Sequential ILP + Acc + Prior + Ent + Acc + Ent + Prior + All Level 1 (274) 56.73 55.81 52.38 55.65 56.35 56.43 56.79 56.84 Level 2 (158) 54.45 53.17 46.33 54.67 53.36 53.25 52.93 52.66 Level 3 (63) 43.43 43.44 49.66 48.15 48.11 48.1 47.53 46.98 Level 4 (8) 17.68 24.18 28.43 23.73 23.86 24.02 23.75 22.63 Average 54.64 53.72 50.17 54.23 54.54 54.56 54.61 54.5 Table 6.5 The results on the Ok-VQA dataset. The values represent the F1 measure. Levels 2, 3, and 4 contain ‘None’ labels. The low F1 measure of lower levels is due to a huge number of False Positives. Model Baseline Sequential ILP + Acc + Prior + Ent + Acc + Ent + Prior + All Level 1 (16) F1 - C + C C 73.62 - - - 72.99 46.36 20.6 330 73.53 39.55 25.78 225 73.57 212 26.89 39.62 73.35 39.13 25.46 161 40 26.34 73.54 205 36 26.4 125 73.63 25.95 35.11 73.64 131 Level 2 (8) C - 0 68 73 94 75 112 111 F1 75.13 75.13 75.46 75.45 75.35 75.39 75.49 75.52 + C - 0.00 63.24 64.39 65.96 64 68.75 68.47 Average F1 74.01 73.55 74.03 74.05 74.01 74.02 74.12 74.13 Table 6.6 The results on the 20News dataset, comparing various inference settings and the initial performance. Here, the -C of level 2 is 0 in all cases. Model Baseline Sequential ILP + Acc + Prior + Ent + Acc + Ent + Prior + All Acc 73.05 71.56 73 73 72.88 72.93 71.62 71.74 Actions (6) + C C - - 13.33 75 36.5 63 36.5 63 31.93 119 34.92 63 25.83 209 25.75 198 - C - 46.66 38.1 38.1 34.45 38.1 37.32 36.86 Locations (*) + C - 27.8 19.8 19.8 23.2 19.6 26.4 29.2 C - 255 217 217 138 219 53 72 Acc 68.21 67.63 66.38 66.43 67.54 66.38 68.16 68.27 - C - 32.2 35.9 35.9 32.6 35.6 28.3 27.8 Average Acc 70.47 69.47 69.47 69.50 70.03 69.44 69.78 69.89 Table 6.7 Results on Propara dataset. The dataset comprises 1910 location decisions and 1674 action decisions. *The output size of location decisions depends on the context of each procedure. 116 originating from a model with better accuracy and with a smaller output size (Level 1) on other decisions may even hurt them (Level 2). In such scenarios, the inclusion of Expected Accuracy favors dominant decisions and adversely affects performance. However, the inclusion of Prior Probability effectively achieves a balanced comparison among decisions. In this task, despite the basic ILP formulation being detrimental, some of the new variations can even surpass the original baseline performance. Ok-VQA: The baseline exhibits lower accuracy in lower-level decisions with smaller output sizes. When applying the basic ILP method under these circumstances, a significant decline in results is observed, even below that of sequential decoding. However, incorporating any of our proposed factors leads to substantial improvements compared to the basic ILP formulation (over 4% improvement) and can surpass the performance of sequential decoding. Particularly, combining Entropy and Prior Probability achieves the best performance. Notably, although the baseline model has higher overall performance, its inconsistent outputs are unreliable for determining the object label (see Table 6.8). 20News: The baseline performance is similar across different decisions. Thus, in isolation, considering either the Expected Accuracy or the Prior Probability does not substantially impact the global optimization process. However, including all proposed factors (Entropy, Accuracy, and Prior Probability) leads to a balanced and optimal solution. Although the overall task performance in this experiment does not show significant improvements, this is mainly because the initial decision inconsistencies are minimal. Nevertheless, evaluating the positive and negative changes provides valuable insights into the significance of incorporating the proposed factors. Propara: This is an example of a real-world task that involves hundreds of constraints and thousands of variables when combining decisions across entities and steps. Once again, basic ILP and Expected Accuracy factor prioritize decisions from the smaller output size (Actions). However, the Prior probability factor enables a more comparable space for resolving inconsistencies. Notably, the higher baseline performance is attributed to inconsistencies and cannot be used when reasoning about the process (See Table 6.8). 117 Dataset Animal/Flower VQA Propara Model Baseline Sequential Ent + Prior Baseline Sequential Ent + Prior Baseline Sequential Prior Satisfaction Set Correctness 96.4 100 100 38.99 100 100 45.12 100 100 53.40 54.50 54.50 54.43 57.11 58.92 23.30 28.81 30.93 Table 6.8 Results of our proposed technique, baselines, and expert-written decoding strategies in terms of constraint satisfaction and set correctness. The Set Correctness metric reflects the practical usability of sets of dependent decisions in downstream applications. Constraints: Table 6.8 presents the satisfaction results and set correctness metrics across various datasets. It is evident that our newly proposed method significantly outperforms the baseline in both of these metrics. Notably, the degree of improvement in set correctness is more pronounced when the initial consistency of the baseline is lower. This observation underscores the substantial significance of our proposed technique in ensuring the practical utility of model decisions in downstream applications by substantially increasing the proportion of correct interrelated decision sets. Furthermore, compared to sequential decoding, our proposed solutions demonstrate even greater performance enhancements, particularly in scenarios where the task complexity is higher, and global inference can exert its maximum effectiveness. 6.3.3 Discussion Here, we address some of the key questions about this work. Q1: Which metric is most important among the ones evaluated in this paper? All the metrics assessed in this paper provide insights into the model’s performance. Among these, the Set Correctness score offers a comprehensive evaluation that combines constraint satisfaction and correctness, indicating the proportion of output decisions suitable for safe use in downstream tasks. When comparing different ILP variations, the primary focus should be on the original task performance since they all share the same high satisfaction score of 100%. Additionally, the Change metric helps reveal whether an ILP variation conducts truly global optimization or exhibits 118 a bias towards specific prediction classes. In comparing the baseline method with inference techniques, it is essential to consider both the satisfaction and set correctness scores. This is because the raw model predictions, as initially generated, may not be directly acceptable. For instance, if a model predicts a “Move” action for entity A at step 4, but the location prediction does not indicate a change in location, it becomes unclear whether entity A changed locations. Q2: Why utilize the model’s overall accuracy in the score function instead of its accuracy for a specific decision variable? In our context, we assume that each decision type corresponds to a specific model. Therefore, assessing the model’s accuracy is the same as evaluating the accuracy of a particular decision type. If a single model supplies multiple decision types, we can easily expand this concept to evaluate the accuracy of each decision type individually within the same framework. Q3: What is the main difference between the sequential decoding strategy and the ILP formulation? The sequential decoding strategy is a domain-specific, expert-crafted technique employed for addressing decision inconsistencies in accordance with task constraints. In contrast, the ILP (Integer Linear Programming) formulation offers a more general, non-customized approach that isn’t tailored to individual tasks. Sequential decoding strategies typically involve rules or programs that often prefer a specific decision while adjusting other decisions to align with it. This approach prioritizes decision align- ment over considering the probabilities associated with these decisions. On the other hand, the ILP optimization process seeks the most optimized solution by considering the raw probabilities from the models and the imposed constraints. 6.4 Conclusion This chapter introduced an expanded benchmark aimed at assessing a model’s ability to maintain consistency while reasoning through procedural texts in the context of entity tracking. Our enhanced benchmark encompasses a set of interrelated questions, all of which necessitate a fundamental comprehension of the described process to deduce accurate answers. We appraised the state-of- 119 the-art model’s efficacy in this domain, utilizing our evaluation framework. Our observations revealed a notable inconsistency in the model’s reasoning, particularly concerning initial queries related to actions and locations. Furthermore, our findings demonstrated that incorporating a set of consistency-focused questions into the model’s training regimen—despite these questions being derived from the same original dataset—improves its performance when addressing the initial question types. Moreover, we developed and assessed various strategies to enforce logical consistency within the model’s outputs. The outcomes from these experiments suggest that adopting a global optimization approach for inference tends to be more advantageous than implementing sequential logical corrections, particularly when applying a comprehensive suite of constraints to the decision-making process. We further proposed an approach for considering the uncertainty and confidence measures, in- cluding the decisions’ prior probability, entropy, and expected accuracy, alongside raw probabilities when making globally consistent decisions based on diverse models. We demonstrated the effec- tiveness of incorporating our idea within the ILP formulation through experiments on four datasets including the procedural reasoning task. This contribution represents a significant advancement in integrating large models in a unified decision-making framework for conducting complex tasks requiring interrelated decisions. 120 CHAPTER 7 CONCLUSION & FUTURE WORK This chapter will summarize our contributions and conclusions regarding the two directions that we took in this research: exploiting semantic structures in procedural reasoning and integrating logical constraints into deep neural models. Moreover, we will outline a series of ideas and directions for extending our current techniques and addressing the remaining challenges in procedural reasoning. 7.1 Summary of Contributions Reasoning over procedural text presents a significant challenge for machine learning models due to the dynamic nature of the described world. Despite recent advances and promising performances achieved by building on Pretrained Language Models (PLMs), a notable performance gap between neural models and human capabilities persists. In this thesis, we introduced various strategies to improve language models’ proficiency in complex reasoning over procedural contexts, particularly focusing on entity tracking and procedural summarization tasks. We also proposed a new framework and benchmark to advance research on integrating domain knowledge with deep neural networks. 7.1.1 Entity Tracking Our first proposition aimed at enhancing understanding and reasoning over the temporal dynam- ics of events in procedural contexts. We introduced the Time-Stamped Language Model (TSLM), which augments PLMs with an additional embedding layer representing the temporal dimension of actions (past, present, and future). This enhancement allows PLMs to effectively tackle entity tracking by comprehensively reasoning across the process context and dynamically shifting focus between steps. This capability is crucial for neural models to utilize preceding and forthcoming contexts to resolve ambiguities and fill information gaps in procedural instructions. Furthermore, recognizing the importance of understanding actions, their relationships with objects, and their impacts on entities for entity tracking, we proposed integrating semantic parsers with procedural reasoning modules. This approach allows neural models to directly reason about actions and entities, thereby abstracting and interpreting the significance of actions more accurately. We developed a symbolic model based on semantic parsers, achieving superior performance over 121 several neural models. Additionally, we extended this approach by encoding semantic parses into graph representations for procedural reasoning modules. This result of such integration with various base neural models proved effective in improving the state-of-the-art models. 7.1.2 Procedural Abstraction We introduced a novel method that trains a similarity matrix to correlate critical actions in the summary with procedural instructions. A set of constraints was designed to ensure the summary’s sequential integrity accurately reflects the process flow. This innovative approach and modeling effort yielded state-of-the-art results on the RecipeQA benchmark for textual cloze tasks involving multi-modal recipes. We also demonstrated the advantages of utilizing text and image modalities in summarization, leveraging entangled representations from both to enhance task performance. 7.1.3 Integration of Logical Constraints with Deep Neural Networks We developed DomiKnowS, a novel declarative framework for incorporating domain knowledge into deep neural networks. DomiKnowS offers a modular interface for defining knowledge and computational units and facilitates various methods for integrating human knowledge with data- derived learning. Utilizing DomiKnowS’s flexibility, we established a new benchmark (GLUECons) for evaluat- ing the effectiveness of integrating constraints with deep neural networks. GLUECons comprises nine benchmarks across five task categories, expanding the evaluative scope to assess the general applicability of constraint integration methods across diverse scenarios rather than focusing solely on task-specific performance. 7.1.4 Procedural Reasoning Considering Logical Constraints We pioneered applying a global constraint integration technique to complex tasks such as procedural reasoning. Given the extensive constraint space associated with procedural reasoning, we enriched the dataset by creating consistency sets (various relevant question/task types) to probe the consistency of neural model reasoning and explore the potential of further constraints to enhance model performance. Acknowledging the diverse decision-making requirements inherent in procedural reasoning, we 122 proposed an enhancement to integer linear programming (ILP) to ensure decision consistency across heterogeneous outputs. This novel approach accounts for additional factors such as confidence, prior probability expectations, and decision accuracy, demonstrating improved performance across multiple tasks, including procedural reasoning. 7.2 Future Directions This section proposes several avenues for addressing the remaining challenges in procedural reasoning and integrating constraints with neural models. Constraints and Generative AI Our research explored strategies for integrating constraints with computational units (neural models) by treating them as black-box entities and utilizing model- generated probabilities but was mostly limited to the scope of structured prediction tasks with the assumption that changing the probability of one decision does not directly affect the raw probabilities generated by the model on other decisions. However, the generative AI paradigm, especially auto-regressive techniques, produces dependent probabilities for each output variable, complicating ad-hoc inference adjustments during runtime. This scenario necessitates techniques for approximating or considering the effects of such adjustments during the inference phase. Model Consistency in Complex Reasoning Tasks We highlighted the inconsistency of language models in reasoning over interconnected decisions in procedural reasoning. While inference techniques can address inconsistencies, simply training models on interrelated decisions does not inherently enhance consistency. Improving model consistency during training without reliance on inference techniques remains an open challenge. Integrating a soft interpretation of logical constraint violations as loss objectives may enhance consistent model performance. Interpretability in Procedural Reasoning The detailed annotation of tasks like entity tracking provides valuable step-by-step insights into entity evolution during processes. Yet, the underlying reasoning steps for each decision are often not explicitly documented in current datasets. Extending 123 datasets to include these reasoning steps could enable models to perform more comprehensive reasoning, from textual analysis to abstract procedural understanding. Utilizing Entity Tracking in Story Understanding In this research, we have investigated the underlying models for entity tracking. An interesting future direction is to consider the effect of such underlying knowledge in story completion and understanding, where entity tracking plays a vital role. 124 BIBLIOGRAPHY [1] [2] [3] [4] Jordi Adell, Antonio Bonafonte, Antonio Cardenal, Marta R. Costa-Jussà, José A. R. Fonol- losa, Asunción Moreno, Eva Navas, and Eduardo R. Banga. BUCEADOR, a multi-language search engine for digital libraries. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 1705–1709, Istanbul, Turkey, May 2012. European Language Resources Association (ELRA). Kareem Ahmed, Tao Li, Thy Ton, Quan Guo, Kai-Wei Chang, Parisa Kordjamshidi, Vivek Srikumar, Guy Van den Broeck, and Sameer Singh. Pylon: A pytorch framework for learning with constraints. In NeurIPS 2021 Competitions and Demonstrations Track, pages 319–324. PMLR, 2022. Alan Akbik, Duncan Blythe, and Roland Vollgraf. Contextual string embeddings for se- quence labeling. In COLING 2018, 27th International Conference on Computational Lin- guistics, pages 1638–1649, 2018. James F Allen and Choh Man Teng. Broad coverage, domain-generic deep semantic parsing. In 2017 AAAI Spring Symposium Series, 2017. [5] Mustafa Sercan Amac, Semih Yagcioglu, Aykut Erdem, and Erkut Erdem. Procedural reasoning networks for understanding multimodal procedures. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 441–451, 2019. [6] [7] [8] [9] Aida Amini, Antoine Bosselut, Bhavana Dalvi Mishra, Yejin Choi, and Hannaneh Hajishirzi. Procedural reading comprehension with attribute-aware context flow. In Proceedings of the Conference on Automated Knowledge Base Construction (AKBC), 2020. Saeed Amizadeh, Hamid Palangi, Alex Polozov, Yichen Huang, and Kazuhito Koishida. Neuro-symbolic visual reasoning: Disentangling. In ICML, pages 279–290. PMLR, 2020. Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Guided open vocab- ulary image captioning with constrained beam search. In EMNLP, pages 936–945, 2017. Akari Asai and Hannaneh Hajishirzi. Logic-guided data augmentation and regularization for consistent question answering. arXiv preprint arXiv:2004.10157, 2020. [10] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021. [11] Stephen H. Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. Hinge-loss markov random fields and probabilistic soft logic. Journal of Machine Learning Research (JMLR), 18:1–67, 2017. 125 [12] Stephen H. Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. Hinge-loss markov random fields and probabilistic soft logic. JMLR, 18:1–67, 2017. [13] Jonathan Berant, Vivek Srikumar, Pei-Chun Chen, Abby Vander Linden, Brittany Harding, Brad Huang, Peter Clark, and Christopher D Manning. Modeling biological processes for reading comprehension. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1499–1510, 2014. [14] Marcus D Bloice, Peter M Roth, and Andreas Holzinger. Performing arithmetic using a neural network trained on digit permutation pairs. In ISMIS, pages 255–264. Springer, 2020. [15] Sebastian Borgeaud and Guy Emerson. Leveraging sentence similarity in natural language generation: Improving beam search using range voting. In 4th WNGT, pages 97–109, 2020. [16] Antoine Bosselut, Omer Levy, Ari Holtzman, Corin Ennis, Dieter Fox, and Yejin Choi. Simu- lating action dynamics with neural process networks. In Proceedings of the 6th International Conference for Learning Representations (ICLR), 2018. [17] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In EMNLP, pages 632–642, Lisbon, Portugal, September 2015. Association for Computational Linguistics. [18] Matthias Broecheler, Lilyana Mihalkova, and Lise Getoor. Probabilistic similarity logic. In Conference on Uncertainty in Artificial Intelligence, 2010. [19] Miles Brundage, Shahar Avin, Jasmine Wang, Haydn Belfield, Gretchen Krueger, Gillian Hadfield, Heidy Khlaaf, Jingying Yang, Helen Toner, Ruth Fong, et al. Toward trust- worthy ai development: mechanisms for supporting verifiable claims. arXiv preprint arXiv:2004.07213, 2020. [20] Judith Bütepage, Ali Ghadirzadeh, Özge Öztimur Karadaˇg, Mårten Björkman, and Danica Kragic. Imitating by generating: Deep generative models for imitation of interactive tasks. Frontiers in Robotics and AI, 7:47, 2020. [21] Oana-Maria Camburu, Brendan Shillingford, Pasquale Minervini, Thomas Lukasiewicz, and Phil Blunsom. Make up your mind! adversarial generation of inconsistent natural language explanations. arXiv preprint arXiv:1910.03065, 2019. [22] Bob Carpenter, Andrew Gelman, Matthew Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of Statistical Software, Articles, 76(1):1–32, 2017. [23] Ming-Wei Chang, Lev Ratinov, and Dan Roth. Structured learning with constrained condi- tional models. Machine learning, 88(3):399–431, 2012. 126 [24] Angelica Chen, Jason Phang, Alicia Parrish, Vishakh Padmakumar, Chen Zhao, Samuel R Bowman, and Kyunghyun Cho. Two failures of self-consistency in the multi-step reasoning of llms. arXiv preprint arXiv:2305.14279, 2023. [25] Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar, October 2014. Association for Computational Linguistics. [26] Peter Clark, Oyvind Tafjord, and Kyle Richardson. Transformers as soft reasoners over language. arXiv preprint arXiv:2002.05867, 2020. [27] James Clarke, Dan Goldwasser, Ming-Wei Chang, and Dan Roth. Driving semantic parsing from the world’s response. In 14th CoNLL, pages 18–27, 2010. [28] Anthony Costa Constantinou, Norman Fenton, and Martin Neil. Integrating expert knowl- edge with data in bayesian networks: Preserving data-driven expectations when the expert variables remain unobserved. Expert systems with applications, 56:197–208, 2016. [29] Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. [30] Lei Cui, Shaohan Huang, Furu Wei, Chuanqi Tan, Chaoqun Duan, and Ming Zhou. Su- perAgent: A customer service chatbot for E-commerce websites. In Proceedings of ACL 2017, System Demonstrations, pages 97–102, Vancouver, Canada, July 2017. Association for Computational Linguistics. [31] Daniel Dahlmeier and Hwee Tou Ng. A beam-search decoder for grammatical error correc- tion. In EMNLP 2012, pages 568–578, 2012. [32] Bhavana Dalvi, Lifu Huang, Niket Tandon, Wen-tau Yih, and Peter Clark. Tracking state changes in procedural text: a challenge dataset and models for process paragraph com- prehension. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1595–1604, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. [33] Bhavana Dalvi, Lifu Huang, Niket Tandon, Wen-tau Yih, and Peter Clark. Tracking state changes in procedural text: a challenge dataset and models for process paragraph com- prehension. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1595–1604, 2018. [34] Bhavana Dalvi, Niket Tandon, Antoine Bosselut, Wen-tau Yih, and Peter Clark. Everything 127 happens for a reason: Discovering the purpose of actions in procedural text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4496–4505, Hong Kong, China, November 2019. Association for Computational Linguistics. [35] Rajarshi Das, Tsendsuren Munkhdalai, Xingdi Yuan, Adam Trischler, and Andrew McCal- lum. Building dynamic knowledge graphs from text using machine reading comprehension. In International Conference on Learning Representations, 2018. [36] Tirtharaj Dash, Sharad Chitlangia, Aditya Ahuja, and Ashwin Srinivasan. A review of some techniques for inclusion of domain-knowledge into deep neural networks. Scientific Reports, 12(1):1–15, 2022. [37] Luc De Raedt, Angelika Kimmig, and Hannu Toivonen. Problog: a probabilistic Prolog and its application in link discovery. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, pages 2468–2473. AAAI Press, 2007. [38] Luc De Raedt, Angelika Kimmig, and Hannu Toivonen. Problog: A probabilistic prolog and its application in link discovery. In IJCAI, volume 7, pages 2462–2467. Hyderabad, 2007. [39] Luc De Raedt, Robin Manhaeve, Sebastijan Dumancic, Thomas Demeester, and Angelika Kimmig. Neuro-symbolic= neural+ logical+ probabilistic. In NeSy’19 workshop @ IJCAI, 2019. [40] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012. [41] Daniel Deutsch, Shyam Upadhyay, and Dan Roth. A general-purpose algorithm for con- strained sequential inference. In Proceedings of the 23rd CoNLL, pages 482–492, 2019. [42] [43] [44] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. 128 [45] Perdo Domingos and Matthew Richardson. Markov logic: A unifying framework for statis- tical relational learning. In ICML’04 Workshop on Statistical Relational Learning and its Connections to Other Fields, pages 49–54, 2004. [46] Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sablé-Meyer, Lucas Morales, Luke Hewitt, Luc Cary, Armando Solar-Lezama, and Joshua B Tenenbaum. Dreamcoder: Boot- strapping inductive program synthesis with wake-sleep library learning. In Proceedings of the 42nd ACM SIGPLAN PLDI, pages 835–850, 2021. [47] Hossein Rajaby Faghihi, Quan Guo, Andrzej Uszok, Aliakbar Nafar, and Parisa Kord- jamshidi. Domiknows: A library for integration of symbolic domain knowledge in deep learning. In EMNLP: System Demonstrations, pages 231–241, 2021. [48] Hossein Rajaby Faghihi and Parisa Kordjamshidi. Time-stamped language model: Teaching language models to understand the flow of events. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4560–4570, 2021. [49] Hossein Rajaby Faghihi and Parisa Kordjamshidi. Time-stamped language model: Teaching language models to understand the flow of events. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4560–4570, 2021. [50] Hossein Rajaby Faghihi, Parisa Kordjamshidi, Choh Man Teng, and James Allen. The role of semantic parsing in understanding procedural text. arXiv preprint arXiv:2302.06829, 2023. [51] Hossein Rajaby Faghihi, Aliakbar Nafar, Chen Zheng, Roshanak Mirzaee, Yue Zhang, An- drzej Uszok, Alexander Wan, Tanawan Premsri, Dan Roth, and Parisa Kordjamshidi. Glue- cons: A generic benchmark for learning under constraints. arXiv preprint arXiv:2302.10914, 2023. [52] Haoqi Fan and Jiatong Zhou. Stacked latent attention for multimodal reasoning. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1072–1080, 2018. [53] George Ferguson, James F Allen, et al. Trips: An integrated intelligent problem-solving assistant. In Aaai/Iaai, pages 567–572, 1998. [54] G.D. Forney. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, 1973. [55] Markus Freitag and Yaser Al-Onaizan. Beam search strategies for neural machine translation. ACL 2017, page 56, 2017. [56] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 129 1050–1059. PMLR, 2016. [57] Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. Evaluating nlp models via contrast sets. ArXiv, abs/2004.02709, 2020. [58] Yuling Gu, Bhavana Dalvi, and Peter Clark. Do language models have coherent mental models of everyday things? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1892–1913, 2023. [59] Quan Guo, Hossein Rajaby Faghihi, Yue Zhang, Andrzej Uszok, and Parisa Kordjamshidi. Inference-masked loss for deep structured output learning. In 29th IJCAI, pages 2754–2761, 2021. [60] Quan Guo, Hossein Rajaby Faghihi, Yue Zhang, Andrzej Uszok, and Parisa Kordjamshidi. Inference-masked loss for deep structured output learning. In Christian Bessiere, editor, Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, pages 2754–2761. International Joint Conferences on Artificial Intelligence Or- ganization, 7 2020. Main track. [61] Aditya Gupta and Greg Durrett. Tracking discrete and continuous entity state for process understanding. In Proceedings of the Third Workshop on Structured Prediction for NLP, pages 7–12, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. [62] James Hargreaves, Andreas Vlachos, and Guy Emerson. Incremental beam manipulation for natural language generation. In 16th EACL, pages 2563–2574, 2021. [63] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [64] Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann LeCun. Tracking the world state with recurrent entity networks. In 5th International Conference on Learning Representations, ICLR 2017, 2017. [65] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, In Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. Advances in neural information processing systems, pages 1693–1701, 2015. [66] Jack Hessel, Lillian Lee, and David Mimno. Unsupervised discovery of multimodal links in multi-image, multi-sentence documents. In EMNLP, 2019. 130 [67] Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, and Eric Xing. Harnessing deep neural networks with logic rules. In 54th ACL, pages 2410–2420, 2016. [68] Hao Huang, Xiubo Geng, Jian Pei, Guodong Long, and Daxin Jiang. Reasoning over entity- action-location graph for procedural text understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5100–5109, Online, August 2021. Association for Computational Linguistics. [69] Jiani Huang, Ziyang Li, Binghong Chen, Karan Samel, Mayur Naik, Le Song, and Xujie Si. Scallop: From probabilistic deductive databases to scalable differentiable reasoning. Neurips, 2021. [70] Yifan Jiang, Filip Ilievski, and Kaixin Ma. Transferring procedural knowledge across commonsense tasks. arXiv preprint arXiv:2304.13867, 2023. [71] Nora Kassner, Oyvind Tafjord, Hinrich Schütze, and Peter Clark. Beliefbank: Adding memory to a pre-trained language model for a systematic notion of belief. arXiv preprint arXiv:2109.14723, 2021. [72] Phil Kim and Phil Kim. Convolutional neural network. MATLAB deep learning: with machine learning, neural networks and artificial intelligence, pages 121–147, 2017. [73] P. Kordjamshidi, D. Roth, and H. Wu. Saul: Towards declarative learning based program- ming. In Proc. of the International Joint Conference on Artificial Intelligence (IJCAI), 7 2015. [74] Parisa Kordjamshidi, Daniel Khashabi, Christos Christodoulopoulos, Bhargav Mangipudi, Sameer Singh, and Dan Roth. Better call saul: Flexible programming for learning and inference in nlp. In Proc. of the International Conference on Computational Linguistics (COLING), 2016. [75] Parisa Kordjamshidi, Daniel Khashabi, Christos Christodoulopoulos, Bhargav Mangipudi, Sameer Singh, and Dan Roth. Better call Saul: Flexible programming for learning and inference in NLP. In COLING, pages 3030–3040, Osaka, Japan, December 2016. COLING. [76] Parisa Kordjamshidi, Dan Roth, and Hao Wu. Saul: Towards declarative learning based programming. In 2015 AAAI Fall Symposium Series, 2015. [77] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical Report 0, University of Toronto, Toronto, Ontario, 2009. [78] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. NeurIPS, 25, 2012. 131 [79] Igor Labutov, Shashank Srivastava, and Tom Mitchell. LIA: A natural language pro- grammable personal assistant. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 145–150, Brussels, Bel- gium, November 2018. Association for Computational Linguistics. [80] John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001. [81] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [82] Celine Lee, Justin Gottschlich, and Dan Roth. Toward code generation: A survey and lessons from semantic parsing. arXiv preprint arXiv:2105.03317, 2021. [83] [84] Jay Yoon Lee, Sanket Vaibhav Mehta, Michael Wick, Jean-Baptiste Tristan, and Jaime Car- bonell. Gradient-based inference for networks with output constraints. In AAAI, volume 33, pages 4147–4154, 2019. Jay Yoon Lee, Michael L Wick, Jean-Baptiste Tristan, and Jaime G Carbonell. Enforcing output constraints via sgd: A step towards neural lagrangian relaxation. In AKBC@ NIPS, 2017. [85] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to- sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online, July 2020. Association for Computational Linguistics. [86] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019. [87] Rumeng Li, Xun Wang, and Hong Yu. Metamt, a meta learning method leveraging multiple domain data for low resource machine translation. In AAAI, volume 34, pages 8245–8252, 2020. [88] Tao Li, Vivek Gupta, Maitrey Mehta, and Vivek Srikumar. A logic-driven framework for consistency of neural models. In EMNLP-IJCNLP, pages 3924–3935, 2019. [89] Tao Li and Vivek Srikumar. Augmenting neural networks with first-order logic. In Proceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 292–302, 2019. [90] Tao Li and Vivek Srikumar. Augmenting neural networks with first-order logic. In Anna Korhonen, David R. Traum, and Lluís Màrquez, editors, Proceedings of the 57th Conference 132 of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 292–302. Association for Computational Linguistics, 2019. [91] Xingxuan Li, Liying Cheng, Qingyu Tan, Hwee Tou Ng, Shafiq Joty, and Lidong Bing. Unlocking temporal question answering for large language models using code execution. arXiv preprint arXiv:2305.15014, 2023. [92] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. [93] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019. [94] Xiaoxu Liu, Haoye Lu, and Amiya Nayak. A spam transformer model for sms spam detection. IEEE Access, 9:80253–80263, 2021. [95] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019. [96] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019. [97] Reginald Long, Panupong Pasupat, and Percy Liang. Simpler context-dependent logical forms via model projections. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1456–1465, Berlin, Germany, August 2016. Association for Computational Linguistics. [98] Ximing Lu, Peter West, Rowan Zellers, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Neurologic decoding:(un) supervised neural text generation with predicate logic constraints. In NAACL, pages 4288–4299, 2021. [99] Kaixin Ma, Filip Ilievski, Jonathan Francis, Eric Nyberg, and Alessandro Oltramari. Coa- lescing global and local information for procedural text understanding. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1534–1545, 2022. [100] Robin Manhaeve, Sebastijan Dumancic, Angelika Kimmig, Thomas Demeester, and Luc De Raedt. Deepproblog: Neural probabilistic logic programming. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. 133 [101] Robin Manhaeve, Sebastijan Dumancic, Angelika Kimmig, Thomas Demeester, and Luc De Raedt. Deepproblog: Neural probabilistic logic programming. NeurIPS, 2018. Code is available. [102] Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. The stanford corenlp natural language processing toolkit. In Pro- ceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pages 55–60, 2014. [103] Vikash K. Mansinghka, Daniel Selsam, and Yura N. Perov. Venture: a higher-order proba- bilistic programming platform with programmable inference. CoRR, abs/1404.0099, 2014. [104] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3195–3204, 2019. [105] Lluís Màrquez, Xavier Carreras, Kenneth C Litkowski, and Suzanne Stevenson. Semantic role labeling: an introduction to the special issue, 2008. [106] Sherin Mary Mathews. Explainable artificial intelligence applications in nlp, biomedical, and malware classification: a literature review. In Intelligent computing:proceedings of the computing conference, pages 1269–1292. Springer, 2019. [107] Larry R Medsker and LC Jain. Recurrent neural networks. Design and Applications, 5:64–67, 2001. [108] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. [109] Brian Milch, Bhaskara Marthi, Stuart Russell, David Sontag, Daniel L. Ong, and Andrey Kolobov. BLOG: Probabilistic models with unknown objects. In Proceedings of the Inter- national Joint Conference on Artificial Intelligence (IJCAI), 2005. [110] George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995. [111] Pasquale Minervini and Sebastian Riedel. Adversarially regularising neural nli models to integrate logical background knowledge. arXiv preprint arXiv:1808.08609, 2018. [112] Tom Minka, John M. Winn, John P. Guiver, and David A. Knowles. Infer.NET 2.5, 2012. Microsoft Research Cambridge. http://research.microsoft.com/infernet. [113] Roshanak Mirzaee, Hossein Rajaby Faghihi, Qiang Ning, and Parisa Kordjamshidi. Spartqa: A textual question answering benchmark for spatial reasoning. In NAACL, pages 4582–4598, 134 2021. [114] Roshanak Mirzaee and Parisa Kordjamshidi. Disentangling extraction and reasoning in multi-hop spatial reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3379–3397, 2023. [115] Nikhil Muralidhar, Mohammad Raihanul Islam, Manish Marwah, Anuj Karpatne, and Naren Ramakrishnan. Incorporating prior domain knowledge into deep neural networks. In 2018 IEEE International Conference on Big Data (Big Data), pages 36–45. IEEE, 2018. [116] Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 299–307, 2017. [117] Yatin Nandwani, Abhishek Pathak, Mausam, and Parag Singla. A primal dual formulation for deep learning with constraints. In NeurIPS, 2019. [118] Yatin Nandwani, Abhishek Pathak, and Parag Singla. A primal dual formulation for deep learning with constraints. NeurIPS, 32, 2019. [119] Usman Naseem, Imran Razzak, Katarzyna Musial, and Muhammad Imran. Transformer based deep intelligent contextual embedding for twitter sentiment analysis. Future Genera- tion Computer Systems, 113:58–69, 2020. [120] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 427–436, 2015. [121] Qiang Ning, Zhili Feng, Hao Wu, and Dan Roth. Joint reasoning for temporal and causal relations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2278–2288, 2018. [122] Kuntal Kumar Pal, Kazuaki Kashihara, Pratyay Banerjee, Swaroop Mishra, Ruoyu Wang, and Chitta Baral. Constructing flow graphs from procedural cybersecurity texts. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3945–3957, Online, August 2021. Association for Computational Linguistics. [123] Raghavendra Pappagari, Piotr Zelasko, Jesús Villalba, Yishay Carmiel, and Najim Dehak. Hierarchical transformers for long document classification. In 2019 IEEE automatic speech recognition and understanding workshop (ASRU), pages 838–844. IEEE, 2019. [124] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differen- tiation in pytorch. 2017. 135 [125] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, In H. Wallach, H. Larochelle, A. Beygelzimer, high-performance deep learning library. F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. [126] Avi Pfeffer. Practical Probabilistic Programming. Manning Publications, 2016. [127] Vasin Punyakanok, Dan Roth, Wen-tau Yih, and Dav Zimak. Semantic role labeling via inte- ger linear programming inference. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pages 1346–1352, 2004. [128] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018. [129] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020. [130] Hossein Rajaby Faghihi, Quan Guo, Andrzej Uszok, Aliakbar Nafar, and Parisa Kord- jamshidi. DomiKnowS: A library for integration of symbolic domain knowledge in deep learning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 231–241, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. [131] Hossein Rajaby Faghihi, Roshanak Mirzaee, Sudarshan Paliwal, and Parisa Kordjamshidi. Latent alignment of procedural concepts in multimodal recipes. In Proceedings of the First Workshop on Advances in Language and Vision Research, pages 26–31, 2020. [132] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics. [133] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015. [134] Francesco Riccio, Roberto Capobianco, and Daniele Nardi. Guess: Generative modeling In Proceedings of the 19th of unknown environments and spatial abstraction for robots. International Conference on Autonomous Agents and MultiAgent Systems, pages 1978–1980, 2020. 136 [135] Matthew Richardson and Pedro Domingos. Markov logic networks. Machine learning, 62(1):107–136, 2006. [136] N. Rizzolo and D. Roth. Learning based Java for rapid development of NLP systems. In Pro- ceedings of the Seventh Conference on International Language Resources and Evaluation, 2010. [137] Nick Rizzolo and Dan Roth. Integer linear programming for coreference resolution. Anaphora Resolution: Algorithms, Resources, and Applications, pages 315–343, 2016. [138] D. Roth and W. Yih. Integer linear programming inference for conditional random fields. In ICML, pages 737–744, 2005. [139] Dan Roth and Vivek Srikumar. Integer linear programming formulations in natural language processing. In 15th EACL: Tutorial Abstracts, 2017. [140] Dan Roth and Wen-tau Yih. Integer linear programming inference for conditional random fields. In Proceedings of the 22nd international conference on Machine learning, pages 736–743, 2005. [141] Alexander Rush. Torch-struct: Deep structured prediction library. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 335–342, Online, July 2020. Association for Computational Linguistics. [142] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015. [143] Mohammed Saeed, Naser Ahmadi, Preslav Nakov, and Paolo Papotti. Rulebert: Teaching soft rules to pre-trained language models. In EMNLP, pages 1460–1476, 2021. [144] Erik F Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language- independent named entity recognition. arXiv preprint cs/0306050, 2003. [145] Marvin C Santillan and Arnulfo P Azcarraga. Poem generation using transformers and doc2vec embeddings. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2020. [146] Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. Relational recurrent neural networks. In Advances in neural information processing systems, pages 7299–7310, 2018. [147] Taisuke Sato and Yoshitaka Kameya. Prism: A language for symbolic-statistical modeling. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 137 1330–1339, 1997. [148] Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. Picard: Parsing incrementally for constrained auto-regressive decoding from language models. In EMNLP, pages 9895–9901, 2021. [149] Moritz Schubotz, Philipp Scharpf, Kaushal Dudhat, Yash Nagar, Felix Hamborg, and Bela Gipp. Introducing mathqa: a math-aware question answering system. IDD, 2018. [150] Karin Kipper Schuler. Verbnet: A broad-coverage, comprehensive verb lexicon. 2005. [151] Min Joon Seo, Sewon Min, Ali Farhadi, and Hannaneh Hajishirzi. Query-reduction networks for question answering. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. [152] Peng Shi and Jimmy Lin. Simple bert models for relation extraction and semantic role labeling. arXiv preprint arXiv:1904.05255, 2019. [153] Qi Shi, Qian Liu, Bei Chen, Yu Zhang, Ting Liu, and Jian-Guang Lou. Lemon: Language-based environment manipulation via execution-guided pre-training. arXiv preprint arXiv:2201.08081, 2022. [154] Yunsheng Shi, Zhengjie Huang, Shikun Feng, Hui Zhong, Wenjin Wang, and Yu Sun. Masked label prediction: Unified message passing model for semi-supervised classification. arXiv preprint arXiv:2009.03509, 2020. [155] Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, and Tom Goldstein. On the exploitability of instruction tuning. Advances in Neural Information Processing Systems, 36, 2024. [156] Janvijay Singh, Fan Bai, and Zhen Wang. Entity tracking via effective use of multi-task learning model and mention-guided decoding. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1255–1263, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. [157] Shashi Kant Singh, Shubham Kumar, and Pawan Singh Mehra. Chat gpt & google bard ai: A review. In 2023 International Conference on IoT, Communication and Automation Technology (ICICAT), pages 1–6. IEEE, 2023. [158] Robyn Speer, Joshua Chin, and Catherine Havasi. Conceptnet 5.5: An open multilingual graph of general knowledge. CoRR, abs/1612.03975, 2016. [159] Russell Stewart and Stefano Ermon. Label-free supervision of neural networks with physics and domain knowledge. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. 138 [160] Shane Storks, Qiaozi Gao, Yichi Zhang, and Joyce Chai. Tiered reasoning for intuitive physics: Toward verifiable commonsense language understanding. In Findings of the Asso- ciation for Computational Linguistics: EMNLP 2021, pages 4902–4918, 2021. [161] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-linguistic representations. In Eighth International Conference on Learning Representations (ICLR), April 2020. [162] Sainbayar Sukhbaatar, arthur szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28, pages 2440–2448. Curran Associates, Inc., 2015. [163] Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn Mazaitis, Ruslan Salakhutdinov, and William Cohen. Open domain question answering using early fusion of knowledge bases and text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4231–4242, 2018. [164] Jiankai Sun, De-An Huang, Bo Lu, Yun-Hui Liu, Bolei Zhou, and Animesh Garg. Plate: IEEE Robotics and Visually-grounded planning with transformers in procedural tasks. Automation Letters, 7(2):4924–4930, 2022. [165] Charles Sutton and Andrew McCallum. An introduction to conditional random fields. Foundations and Trends in Machine Learning, 4(4):267–373, 2012. [166] Hao Tan and Mohit Bansal. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111, Hong Kong, China, November 2019. Association for Computational Linguistics. [167] Niket Tandon, Bhavana Dalvi, Joel Grus, Wen-tau Yih, Antoine Bosselut, and Peter Clark. Reasoning about actions and state changes by injecting commonsense knowledge. In Pro- ceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 57–66, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. [168] Niket Tandon, Bhavana Dalvi, Keisuke Sakaguchi, Peter Clark, and Antoine Bosselut. WIQA: A dataset for “what if...” reasoning over procedural text. In EMNLP, 2019. [169] Niket Tandon, Bhavana Dalvi, Keisuke Sakaguchi, Peter Clark, and Antoine Bosselut. Wiqa: A dataset for “what if...” reasoning over procedural text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6076–6085, 2019. 139 [170] Niket Tandon, Keisuke Sakaguchi, Bhavana Dalvi, Dheeraj Rajagopal, Peter Clark, Michal Guerquin, Kyle Richardson, and Eduard Hovy. A dataset for tracking entities in open domain procedural text. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6408–6417, Online, November 2020. Association for Computational Linguistics. [171] Niket Tandon, Keisuke Sakaguchi, Bhavana Dalvi, Dheeraj Rajagopal, Peter Clark, Michal Guerquin, Kyle Richardson, and Eduard Hovy. A dataset for tracking entities in open domain procedural text. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6408–6417, 2020. [172] Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4631–4640, 2016. [173] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [174] Laura von Rueden, Sebastian Mayer, Katharina Beckh, Bogdan Georgiev, Sven Giesselbach, Raoul Heese, Birgit Kirsch, Julius Pfrommer, Annika Pick, Rajkumar Ramamurthy, et al. Informed machine learning–a taxonomy and survey of integrating knowledge into learning systems. arXiv preprint arXiv:1903.12394, 2019. [175] Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. 2015. cite arxiv:1502.05698. [176] Thomas Winters, Giuseppe Marra, Robin Manhaeve, and Luc De Raedt. Deepstochlog: Neural stochastic logic programming. arXiv preprint arXiv:2106.12574, 2021. [177] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of- the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics. [178] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771, 2019. 140 [179] Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6609–6618, 2019. [180] Te-Lin Wu, Alex Spangher, Pegah Alipoormolabashi, Marjorie Freedman, Ralph Weischedel, and Nanyun Peng. Understanding multimodal procedural knowledge by se- quencing multimodal instructional manuals. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4525–4542, 2022. [181] Yijun Xiao and William Yang Wang. Quantifying uncertainties in natural language process- ing tasks. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 7322–7329, 2019. [182] Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Broeck. A semantic loss function for deep learning with symbolic knowledge. In ICML, pages 5502–5511. PMLR, 2018. [183] Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Van den Broeck. A semantic loss function for deep learning with symbolic knowledge. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5502–5511. PMLR, 10–15 Jul 2018. [184] Semih Yagcioglu, Aykut Erdem, Erkut Erdem, and Nazli Ikizler-Cinbis. Recipeqa: A challenge dataset for multimodal comprehension of cooking recipes. In EMNLP, 2018. [185] Bishan Yang and Tom Mitchell. Leveraging knowledge bases in lstms for improving machine reading. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1436–1446, 2017. [186] Tsung-Yen Yang, Michael Y Hu, Yinlam Chow, Peter J Ramadge, and Karthik Narasimhan. Safe reinforcement learning with natural language constraints. NeurIPS, 34:13794–13808, 2021. [187] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5753–5763, 2019. [188] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014. [189] Zhou Yu, Yuhao Cui, Jun Yu, Dacheng Tao, and Qi Tian. Multimodal unified attention networks for vision-and-language interactions. arXiv preprint arXiv:1908.04107, 2019. 141 [190] Peng Zhang and Maged N Kamel Boulos. Generative ai in medicine and healthcare: Promises, opportunities and challenges. Future Internet, 15(9):286, 2023. [191] Xiao Zhang, Maria Leonor Pacheco, Chang Li, and Dan Goldwasser. Introducing DRAIL – a step towards declarative deep relational learning. In Proceedings of the Workshop on Structured Prediction for NLP, pages 54–62, Austin, TX, November 2016. Association for Computational Linguistics. [192] Zhihan Zhang, Xiubo Geng, Tao Qin, Yunfang Wu, and Daxin Jiang. Knowledge-aware pro- cedural text understanding with multi-stage training. In Proceedings of the Web Conference 2021, 2021. [193] Zhihan Zhang, Xiubo Geng, Tao Qin, Yunfang Wu, and Daxin Jiang. Knowledge-aware pro- cedural text understanding with multi-stage training. In Proceedings of the Web Conference 2021, pages 3512–3523, 2021. [194] Chen Zheng and Parisa Kordjamshidi. Srlgrn: Semantic role labeling graph reasoning network. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8881–8891, 2020. [195] Chen Zheng and Parisa Kordjamshidi. Relevant commonsense subgraphs for “what if...” procedural reasoning. In Findings of ACL 2022, pages 1927–1933, 2022. [196] Da Zheng, Minjie Wang, Quan Gan, Zheng Zhang, and George Karypis. Learning graph neural networks with deep graph library. WWW ’20, New York, NY, USA, 2020. Association for Computing Machinery. [197] Lingxue Zhu and Nikolay Laptev. Deep and confident prediction for time series at uber. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pages 103–110. IEEE, 2017. [198] Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. Transfer learning for low- resource neural machine translation. In EMNLP, pages 1568–1575, 2016. 142