REFERRING EXPRESSION GENERATION TOWARDS MEDIATING SHARED PERCEPTUAL BASIS IN SITUATED DIALOGUE By Rui Fang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science - Doctor of Philosophy 2015 ABSTRACT REFERRING EXPRESSION GENERATION TOWARDS MEDIATING SHARED PERCEPTUAL BASIS IN SITUATED DIALOGUE By Rui Fang Situated human-robot dialogue has received increasing attention in recent years. In situated dialogue, robots/artificial agents and their human partners are co-present in a shared physical world. Robots need to automatically perceive and make inference of the shared environment. Due to its limited perceptual and reasoning capabilities, the robot’s representation of the shared world is often incomplete, error-prone, and significantly mismatched from that of its human partner’s. Although physically co-present, a joint perceptual basis between the human and the robot cannot be established. Thus, referential communication between the human and the robot becomes difficult. Robots need to collaborate with human partners to establish a joint perceptual basis, referring expression generation (REG) thus becomes an important problem in situated dialogue. REG is the task of generating referring expressions to describe target objects such that the intended objects can be correctly identified by the human. Although extensively studied, most existing REG algorithms were developed and evaluated under the assumption that agents and humans have access to the same kind of domain information. This is clearly not true in situated dialogue. The objective of this thesis is investigating how to generate referring expressions to mediate mismatched perceptual basis between humans and agents. As a first step, a hypergraph-based approach is developed to account for group-based spatial relations and uncertainties in perceiving the environment. This approach outperforms a previous graph-based approach with an absolute gain of 9%. However, while these graph-based approaches perform effectively when the agent has perfect knowledge or perception of the environment (e.g., 84%), they perform rather poorly when the agent has imperfect perception of the environment (e.g., 45%). This big performance gap indicates that when the agent applies traditional approaches (which usually generate a single minimum description) to generate referring expressions to describe target objects, the intended objects often cannot be correctly identified by the human. To address this problem, motivated by collaborative behaviors in human referential communication, two collaborative models are developed - an episodic model and an installment model - for REG. In both models, instead of generating a single referring expression to describe a target object as in the previous work, it generates multiple small expressions that lead to the target object with a goal to minimize the collaborative effort. In particular, the installment model incorporates human feedback in a reinforcement learning framework to learn the optimal generation strategies. Our empirical results have shown that the episodic model and the installment model outperform our non-collaborative hypergraph-based approach with an absolute gain of 6% and 21% respectively. Lastly, the collaborative models are further extended to embodied collaborative models for facilitate human-robot interaction. These embodied models seamlessly incorporate robot gesture behaviors (i.e., pointing to an object) and human’s gaze feedback (i.e., looking at a particular object) into the collaborative model for REG. The empirical results have shown that when robot gestures and human verbal feedback is incorporated, the new collaborative model achieves over 28% absolute gains compared to the baseline collaborative model. This thesis further discusses the opportunities and challenges brought by modeling embodiment in collaborative referential communication in human-robot interaction. ACKNOWLEDGMENTS First and foremost I want to thank my Ph.D. advisor Dr. Joyce Chai. I thank her for taking me on as a PhD student in computer science. I thank her for teaching me how to select research problems, think them critically, solve them and present their solutions. I thank her for teaching me how to write research papers. She has guided me with her vast experience and invaluable suggestions, lightened up the way in my darkest times and encouraged me a lot in my Ph.D. life. I can truly say that without her, this thesis would never have happened. I would also like to express my thanks to all members in the Language and Interaction Research (LAIR) Group. I thanks all of you for your valuable discussions and comments. Your collaboration and constructive criticism have been tremendous assets throughout my Ph.D. study. Thanks also to my parents who have been extremely understanding and supportive of my Ph.D. studies. I thanks my little cute daughter TiTi for coming to my life during my Ph.D. study. Finally, the warmest of thanks to my wonderful wife, Jingmin Liu, who has encouraged me so much over these years. I thank her for her understanding and patience while I was away from home during the period of my Ph.D. study. I owe everything to her, without her everlasting love, this thesis would never be completed. To her, I dedicate this work. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Introduction . . . . . . . 1.1 Referring Expression Generation 1.2 REG in Situated Dialogue . . . . 1.3 Contributions of this Dissertation 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . 1 . 4 . 8 . 10 Chapter 2 Related Work . . . . . . . . . . . . 2.1 Greedy Search Approaches . . . . . . . . . 2.2 Incremental Approach . . . . . . . . . . . . 2.3 Graph-based Approaches . . . . . . . . . . 2.4 Plan-based approaches . . . . . . . . . . . 2.5 Reinforcement Learning based Approaches 2.6 Other Approaches . . . . . . . . . . . . . . 2.7 Evaluation of REG approaches . . . . . . . 2.7.1 Corpora for REG evaluation . . . . 2.7.1.1 The TUNA corpus . . . . 2.7.1.2 The GRE3D3 corpus . . . 2.7.1.3 the GIVE-2 Corpus . . . . 2.7.2 Evaluation Metrics . . . . . . . . . 2.7.2.1 Human-likeness Metrics . 2.7.2.2 Identification Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 13 15 18 23 25 28 32 32 32 33 34 35 35 37 Chapter 3 Graph-based Approach for REG . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Hyper-graph based REG . . . . . . . . . . . . . . . . . . 3.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Hypergraph Representation of the Perceived Scenes 3.2.3 Hypergraph Pruning . . . . . . . . . . . . . . . . 3.2.4 Symbolic Descriptors for Attributes . . . . . . . . 3.2.4.1 Lexicon with Grounded Semantics . . . 3.2.4.2 Attribute Descriptors and Cost Functions 3.2.5 Graph Matching for REG . . . . . . . . . . . . . . 3.3 Empirical Evaluations . . . . . . . . . . . . . . . . . . . . 3.3.1 Evaluation Setup . . . . . . . . . . . . . . . . . . 3.3.1.1 Mismatch Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 38 39 39 41 42 45 46 47 48 50 50 50 v . . . . . ix . . . . . . . . . . . . . . . . . . . . . . 51 52 54 54 56 58 59 Chapter 4 Two Collaborative Models for REG . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Collaborative Behaviors in Human-Human Referential Communication. . . . . . 4.3 Two Collaborative Models for REG . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Episodic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Installment Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.1 Reinforcement Learning Basis . . . . . . . . . . . . . . . . . . 4.3.2.2 State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.3 Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.4 Transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.5 Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.6 Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.7 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.8 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Empirical Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Training of the Installment Model . . . . . . . . . . . . . . . . . . . . . 4.4.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3.1 Comparison of the Episodic Model with the Non-Collaborative Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3.2 Comparison of the Installment Model with the Episodic Model. 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 61 62 65 67 68 68 71 71 72 73 73 76 76 76 76 77 79 Chapter 5 Embodied Collaborative Models for REG . 5.1 Embodied Collaborative Models for REG . . . . . 5.2 Empirical Studies . . . . . . . . . . . . . . . . . . 5.2.1 Design Factors . . . . . . . . . . . . . . . 5.2.2 Experimental Tasks . . . . . . . . . . . . . 5.2.3 Procedures . . . . . . . . . . . . . . . . . 5.2.4 Empirical Results . . . . . . . . . . . . . . 5.2.5 Performance . . . . . . . . . . . . . . . . 5.2.5.1 The role of gesture (on accuracy) 5.2.5.2 The role of feedback . . . . . . 5.2.5.3 Effects on number of installments 5.2.5.4 Gesture on gaze indicated focus . 5.2.5.5 Gesture and gaze fixation time . 5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 3.3.1.2 Experiment Setup using Amazon Mechanical Turk 3.3.2 Generation Strategies . . . . . . . . . . . . . . . . . . . . . 3.3.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . 3.3.3.1 The Role of Cost Functions . . . . . . . . . . . . 3.3.3.2 The Role of Imperfect Perception . . . . . . . . . 3.3.3.3 The Role of Extra Effort . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 . 80 . 82 88 89 95 95 97 97 98 99 99 100 103 103 105 106 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Chapter 6 Future Directions 6.1 Dialogue Context . . . . 6.2 Joint Optimization . . . . 6.3 Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . REFERENCES . . . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 109 109 110 . . . . . . . . . . . . . . . . . . . . 111 LIST OF TABLES Table 3.1: Results with different cost functions . . . . . . . . . . . . . . . . . . . . 55 Table 3.2: Results of comparing perfect perception and imperfect perception of the shared world. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Table 3.3: Results of comparing minimum effort and extra effort using hypergraphs . 58 Table 4.1: Features used in the installment model. We use the following abbreviations, o: an object to describe, re: generation strategy for an object, lm: landmark object, tar : the target object, sp: spatial relation . . . . . . . . 75 Table 4.2: Comparison of accuracy for referent identification based on expressions generated from three models. . . . . . . . . . . . . . . . . . . . . . . . . 79 Table 5.1: Features used in the installment model [1]. . . . . . . . . . . . . . . . . . 91 Table 5.2: Comparison of accuracy for the eight experiment conditions. . . . . . . . 99 Table 5.3: Accuracy of object identification game, when gesture is present (R+) or absent (R-) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Table 5.4: Rate of successful object identification game, with different feedback conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Table 5.5: Statistics for landmark and intended object in Gaze feedback only condition (G), Gaze and Verbal feedback condition (VG) . . . . . . . . . . . . 102 Table 5.6: Rate of different gaze patterns, under Verbal Only condition and Verbal and Gesture Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 viii LIST OF FIGURES Figure 1.1: The pipeline for Natural Language Generation [2]. . . . . . . . . . . . . . 3 Figure 1.2: The experiment setup in [3] for the investigation of collaborative referring under mis-matched perceptual capabilities. . . . . . . . . . . . . . . . . . 5 Figure 2.1: An algorithm for the Greedy Heuristic: the Full Brevity algorithm . . . . 14 Figure 2.2: The algorithm that takes REG as a search problem . . . . . . . . . . . . . 15 Figure 2.3: The Incremental Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 16 Figure 2.4: an example domain represented as a labeled directed graph: object 1 is a cup and it is to the left of object 2 which is a box. . . . . . . . . . . . . . 19 Figure 2.5: The graph-based REG algorithm . . . . . . . . . . . . . . . . . . . . . . 20 Figure 2.6: Deictic gesture pointing to the scene [4, 5] . . . . . . . . . . . . . . . . . 21 Figure 2.7: Deictic gesture graph of the scene [4, 5] . . . . . . . . . . . . . . . . . . 21 Figure 2.8: An example scene from the TUNA corpus [6]. An example description for the target object: ‘the black chair on the right’. . . . . . . . . . . . . 33 Figure 2.9: An example scene from the GRE3D corpus [7]. A description generated by human: ‘the small green ball on the top of a box’. . . . . . . . . . . . 34 Figure 2.10: An example scene from GIVE-2 corpus [8]. An example instruction given by human: ‘keep going through the room and make a left, click that yellow button on your left’. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Figure 3.1: An original scene and its impoverished scene processed by CV algorithm Figure 3.2: An example of hypergraph representing the perceived scene (a partial scene only including object 7, 8, 9, 11, 13 for Figure 3.1(a)). . . . . . . . 45 Figure 3.3: Hypergraph based REG . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Figure 3.4: The procedure of generating the impoverished image from the original image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Figure 3.5: a web-based interface shown to the Mechanical turk workers . . . . . . . 52 ix 40 Figure 4.1: An original scene shown to the human and the re-rendering of the robot’s internal representation from [9], Each object/group has an ID associated for the identification purpose. . . . . . . . . . . . . . . . . . . . . . . . . 66 Figure 4.2: The Episodic Model for REG . . . . . . . . . . . . . . . . . . . . . . . . 84 Figure 4.3: The Agent-Environment Interface in MDP [10] . . . . . . . . . . . . . . 85 Figure 4.4: The Installment Model for REG . . . . . . . . . . . . . . . . . . . . . . 85 Figure 4.5: a web-based interface shown to the Mechanical Turk workers for the episodic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Figure 4.6: a web-based interface shown to the Mechanical Turk workers for the installment model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Figure 4.7: The number of objects correctly identified at an interval of every 200 training sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Figure 4.8: The distribution of failure cases at the 1st, 2nd, 3rd and after the 3rd expression in the episodic model . . . . . . . . . . . . . . . . . . . . . . 87 Figure 5.1: An example of situated setup and referrential communication . . . . . . . 92 Figure 5.2: The average number of installments per game in each of the feedback conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Figure 5.3: The average fixation time (in seconds) for the conditions with and without gesture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 x Chapter 1 Introduction 1.1 Referring Expression Generation Natural Language Generation (NLG) is one of the major sub-fields in natural language processing [2]. NLG is the task of generating natural language surface forms from a machine representation of the non-linguistic information such as knowledge base or logical form. Traditionally, NLG systems have used a pipeline architecture (e.g., Figure 1.1) that divides the generation process into a number of distinct stages for decision making. While different system designers have chosen a different number of modules (see [2] for a detailed overview of generation tasks), the core tasks of any NLG system are typically the choice of semantic concepts, their organization into distinct sub-messages and their realization as a string of words. These three tasks are commonly referred to as Document planning, Microplanning and surface realization each with its own subtasks. Below is an overview of the generation tasks that are performed within each stage. • Document planning 1. Content determination is the task of converting the raw data input to the specific kind of data objects (or ‘messages’) that serve as a basis for the subsequent generation tasks. An important subtask here is to decide which input information to express in the text. 2. Document structuring (or discourse planning) is the process of ordering the information to be expressed and determining the structure of the output text. The result is a 1 text plan, often in the form of a tree structure representing the order and grouping of the messages, and the relations between them. • Microplanning 1. Lexicalisation is the task of choosing the right words to express the input information. Often, a concept can be expressed using different words or phrases; the task of the lexicalization procedure is to choose the one that is most appropriate in the given context. 2. Aggregation is the process of deciding which information to put in one sentence. Pieces of information that form separate input messages may be joined together and expressed using one sentence which, for instance, contains a conjunction or a relative clause. 3. Referring expression generation is the task of creating phrases to identify domain entities. This task involves choosing the type of expression (e.g., a pronoun or a definite description) and choosing the properties to include in the description. • Surface realization 1. Linguistic realization is the task of creating grammatical sentences. This involves the application of syntactic and morphological rules that determine aspects like word order and agreement. 2. Structure realization, is the task of converting the output text into some required external format (e.g., by adding HTML mark-up), which should be correct according to the rules of syntax, morphology, and orthography. 2 Figure 1.1: The pipeline for Natural Language Generation [2]. From the a pipelined architecture of NLG, we can see that Referring Expression Generation (REG) is a subtask of NLG, which involves creating referring expressions (noun phrases) can identify a referent to a listener. These expressions generally take the form of a definite noun phrase such as ‘the small yellow ball’. However, when we consider REG in isolation, it quickly becomes clear that things are more complicated than this. It appears that some of the other subtasks within the NLG pipeline are in actual fact tasks that have to be tackled within REG itself as well. A NLG system will as a minimum have to perform the tasks of content determination (selecting the properties of the target referent to be mentioned), lexicalization (choosing the words to represent the properties), and linguistic realization (constructing a grammatically correct noun phrase). By far the majority of work in REG focuses on the content determination subtask. The Task of REG. The main function of referring expressions is to distinguish the target referent from the objects around it. Other functions that referring expressions can fulfill, such as describing, convincing, informing or directing, are beyond the current main scope of REG. Assume that a finite domain O consists of objects with attributes A. For example, O = {o1 ; o2 ; o3 } and A = {T ype; Color; P osition}. The attribute-value pair for each object in domain O are o1 : {T ype − apple, Color − red, P osition − lef t} o2 : {T ype − apple, Color − yellow, P osition − lef t} 3 o3 : {T ype − banana, Color − yellow, P osition − right} The REG task can now be defined as: given a target (or referent) object tar ∈ O, find a set of attribute-value pairs L whose conjunction is true of the target but not of any of the distractors (i.e., O − r), the domain objects different from the target). L is called a distinguishing description of the target. For example, suppose that o1 is the target (and hence o2 , o3 are the distractors), then L could, for example, be either (Type-apple, Color-red) or (Color-yellow, Position-left), which could be realized as ‘the red apple’ or ‘the yellow object on the left’. 1.2 REG in Situated Dialogue Situated human-robot dialogue has received increasing attention in recent years. In situated dialogue, robots/artificial agents and their human partners are co-present in a shared physical world. Robots need to automatically perceive and make inference of the shared environment. Due to its limited perceptual and reasoning capabilities, the robot’s representation of the shared world is often incomplete, error-prone, and significantly mismatched from that of its human partner’s. Although physically co-present, a joint perceptual basis between the human and the robot cannot be established [11]. Thus, referential communication between the human and the robot becomes difficult. To investigate collaborative referring under mis-matched perceptual capabilities, previously, we have conducted experiments on human-human interaction [3]. In these experiments, we have two human subjects play a set of naming games. One subject (referred to as the human-player) is provided with an original image containing over ten objects (Figure 1(a)). Several of these objects have secret names. The other subject (referred to as the robot-player) only has access to an 4 Figure 1.2: The experiment setup in [3] for the investigation of collaborative referring under mismatched perceptual capabilities. impoverished image of the same scene (Figure 1(b)) to mimic the lower perceptual capability of a robot. The human-players goal is to communicate the names of target objects to the robot-player so that the robot-player knows which object in his view has what name. The impoverished image was automatically created by applying standard computer vision algorithms and thus may contain different types of processing errors (e.g., mis-segmentation and/or mis-recognition). Using this setup, we have collected a set of dialogues. The following shows an example dialogue segment (collected using the images shown in Figure 1.2) 5 H: there is basically a cluster of four objects in the upper left , do you see that R: yes H: ok, so the one in the corner is a blue cup R: I see there is a square, but fine, it is blue H: alright, I will just go with that, so and then right under that is a yellow pepper R: ok, I see apple but orangish yellow H: ok, so that yellow pepper is named Brittany R: uh, the bottom left of those four? Because I do see a yellow pepper in the upper right H: the upper right of the four of them? R: yes H: ok, so that is basically the one to the right of the blue cup R: yeah H: that is actually an apple, it is green, I guess it has some amount of yellow on it, but that is a green apple and it is named Ashley This example demonstrates two important characteristics regarding referential communication under mismatched perceptual capabilities. • First, conversation partners rely on both object-specific properties (e.g., object class, color) and spatial relations to describe objects in the environment. Spatial expressions include not only the binary relations (e.g., the one to the left of the blue cup), but also the group-based references [12, 13, 14, 15] (e.g., the upper right of the four of them). • Second, because the shared perceptual basis is missing here, the partners make extra efforts 6 to refer and ground references. For example, the human-player go through step-by-step installments [16] to come to the targeted object. The robot-player often proactively provides what he perceives from the environment. The human-player and the robot-player collaborate with each other through iterative presentation-acceptance phases as described in the Contribution Model proposed in [17, 11]. These observations indicate that, since robots need to collaborate with human partners to establish a joint perceptual basis, referring expression generation (REG) becomes an equally important problem in situated dialogue. Robots have much lower perceptual capabilities of the environment than humans. In order for a robot to effectively generate referential descriptions about the environment so that its human partner can understand which objects are being referred to, the approach to REG in situated dialogue should have at least the following features: • should capture not only binary relations but also group-based relations. • should go beyond traditional approaches that purely rely on one-shot description by a single utterances. For example, it should incorporate the step-by-step collaborative dynamics from the discourse as the conversation proceeds. • should have the error-tolerance mechanism to make the referential communication successful. There has been a tremendous amount of work on REG in the last two decades. REG has traditionally been formulated as a problem of generating a single noun phrase (possibly with multiple modifiers and prepositional phrases) that can uniquely describe a target object among multiple objects [18, 19] so that addressees (i.e., humans) can correctly identify the intended object given this expression. 7 Although extensively studied, most existing REG algorithms were developed and evaluated under the assumption that agents and humans have access to the same kind of domain information. For example, many experimental setups [20, 7, 21, 22] were developed based on a visual world for which the internal representation is assumed to be known and can be represented symbolically. However, this assumption no longer holds in situated dialogue with robots. There are two important distinctions in situated dialogue: • First, the perfect knowledge of the environment is not available to the agent ahead of time. The agent needs to automatically make inferences to connect recognized lower-level visual features with symbolic labels or descriptors. Both recognition and inference are error-prone and full of uncertainties. • Second, in situated dialogue the agent and the human have mismatched representations of the environment. The agent needs to take this difference into consideration to identify the most reliable features for REG. Given these two distinctions, two research questions pop up: • How well state-of-the-art REG approaches are applicable under mismatched perceptual basis in situated dialogue? • If state-of-the-art REG approaches are not applicable, how shall we develop REG approaches that are suitable for situated dialogue? 1.3 Contributions of this Dissertation My thesis is to address the above two research questions. The main contributions of this thesis are: 8 • A hypergraph approach to REG, which is an extension of a well-known graph-based approach [23] that has shown to be effective in previous work [24, 25]. We extended regular graph representation into hypergraph representation to account for group-based spatial relations that are important for visual descriptions. Unlike the previous regular graph approach, we incorporated uncertainties in vision perception into cost functions. Our empirical results have demonstrated that: 1. In order to address the agent’s limited perceptual capability in situated dialogue, REG algorithms will need to take into account uncertainties in perception and reasoning. (More specifically, the cost functions should be tied to the agent’s ability to perceive and infer the environment.) 2. Group-based information appears more reliable and thus should be modeled by an approach that deals with automated perception of spatially rich scenes. 3. Hypergraph-based approaches perform effectively under perfect perception. But with around 40% performance drop under imperfect perception, this significant performance degradation calls for new solutions to REG under situated dialogue. • Two computational collaborative models for REG: an episodic model and an installment model. Motivated by collaborative behaviors in human-human referential communication, our collaborative models, instead of generating a single referring expression to describe a target object as in the previous work, generate multiple small noun phrases that gradually lead to the target object with a goal to minimize the collaborative effort. Our empirical results have demonstrated that: 1. The weights of features which affect the decision of generating referring expressions can be learned through online interaction with participants. 9 2. The integration of human feedback in REG model improves the accuracy of the referential identification task. Our current model could serve as a baseline for further work that incorporates richer dialogue interaction strategies. • Embodied collaborative models for REG, which take into account of embodiment and apply to physical interaction. In particular, these models seamlessly incorporate robot gesture behaviors (i.e., pointing to an object) and human’s gaze feedback (i.e., looking at a particular object) into the collaborative model for REG. Our empirical results have demonstrated that that among different strategies incorporating robot gestures and human feedback, when robot gestures and human verbal feedback is incorporated, the new embodied collaborative model achieves over 28% absolute gains compared to the baseline collaborative model. 1.4 Outline The remainder of this thesis is structured as follows: • Chapter 2 describes earlier work related to this thesis, including previous work in REG categorized in several groups: the greedy search approaches, incremental approaches, graphbased approaches, plan-based approach, reinforcement learning based approaches and others. Some of them directly inspire our study in this thesis. • Chapter 3 describes the hypergraph approach to REG, a extension to a previous well known graph-based approach. We show how to represent the perceived environment as a hypergraph , and how we incorporated uncertainties in vision perception into the hypergraph. We then formalize the REG as a subgraph isomorphism problem, and present the algorithm that search for the best solution for the problem. The evaluation of the hypergraph approach 10 under mis-matched perception basis is also presented. We conclude with the call for new approach to REG in situated dialogue. • Chapter 4 describes the two collaborative models for REG, which do not follow the traditional paradigm of REG algorithm. Instead, the models are inspired by the behaviors observed in human-human referential communication. The evaluation and comparison with the hypergraph based approach are present under the mis-matched perception basis. We also discuss the importance of human feedback in collaborative referring process and possible future directions. • Chapter 5 describes Embodied collaborative models for REG, which take into account of embodiment and apply to physical interaction. we shows how these models seamlessly incorporate robot gesture behaviors (i.e., pointing to an object) and human’s gaze feedback (i.e., looking at a particular object) into the collaborative model for REG. Our empirical results also demonstrated that that, when robot gestures and human verbal feedback is incorporated, the new embodied collaborative model achieves over 28% absolute gains compared to the baseline collaborative model. • Chapter 6 describes my future plan towards the situated human-robot dialogue. 11 Chapter 2 Related Work Research in REG primarily focuses on the sub-task of content determination, which selects a set of properties that may be used to construct the final surface expression. This content determination task is generally affected by the Grices’s maxims [26], as shown below: • Quantity 1. Make your contribution as informative as is required. 2. Do not make your contribution more informative than is required. • Quality 1. Do not say what you believe to be false. 2. Do not say that for which you lack adequate evidence. • Relation 1. Be relevant. • Manner 1. Avoid obscurity of expression. 2. Avoid ambiguity. 3. Be brief (avoid unnecessary prolixity). 12 4. Be orderly. In practice, the content determination task for REG is generally optimized to meet two main different goals: • Human-likeness: be identical to those a human would say in the same situation. • Distinction: be unique to the intended referent with respect to its distractors. With this in mind, next we will give a brief review of the REG algorithms. 2.1 Greedy Search Approaches The first classical algorithm is the Full Brevity algorithm (Dale 1989). The algorithm (as shown in Figure 2.1) deploys a search over the set of all properties of the target referent to choose the smallest possible subset that can distinguishes the target from its all distractors. In other words, it aims to search for the shortest distinguishing referring expression for the intended object. In order to find such minimal descriptions, the algorithm first checks whether there exists a single attribute of the intended object that can rule out all the distractors. If this fails, the algorithm further checks all possible combinations of two attributes to see if this can uniquely identify the target, and so on. The algorithm attempts to meet principles from Grice’s conversational maxims [26]: subsequent work on the generation of distinguishing descriptions. • Maxim of Quantity states that a referring expression must contain enough information to distinguish the target referent unambiguously from all distractors and enable the listener to resolve the reference. And a referring expression should not contain more information than necessary for identification, as that might result in unintended implication for the listener. 13 Figure 2.1: An algorithm for the Greedy Heuristic: the Full Brevity algorithm • Maxim of Relation specifies that a referring expression should be pertinent to the abilities of the listener by only including contents that the listener knows or can perceive easily. Full brevity is a computationally intractable problem, it has been superseded by the more popular Incremental Algorithm. However, many ideas and concepts from Full Brevity algorithm. One of this is viewing a number of REG algorithms as search problem and can be a unified as in a search framework [27]. The underlying mechanism of REG approach is a search over the properties of the intended referent. A REG problem is then defined as a state consisting of the description L constructed so far, the set of distractors D that L might refer to, and the set of properties P r that have not been included in L yet. In the initial state, L is empty, D contains all other objects in the 14 domain and P r contains all properties known to be true of r. In the goal state, L contains some of the properties from P r and D is empty, which means that L only refers to r. In Bohnet and Dale’s model, the search proceeds from state to state by moving properties between P r and L and adjusting D accordingly. The search process and its outcome can be influenced by means of a cost function and different ways of queuing the properties in P and expanding the search graph. This procedure can be be sketched as [28] general algorithm for search: Figure 2.2: The algorithm that takes REG as a search problem 2.2 Incremental Approach The Incremental Algorithm (IA) [18] which takes as an input a domain O containing domain objects with their corresponding properties, and a preference list P representing the order in which the attributes will be considered for selection. One particular object in O is the target r to be described by a description(i.e., a uniquely identifying set of properties), and the remaining objects are assumed to be distractors. The goal of the algorithm is to compute a list of attributes L such that L uniquely describes r and no other distractor in O. Figure 2.1 show the IA algorithm. 15 Figure 2.3: The Incremental Algorithm The IA works by iterating over the list of preferred attributes P and adding to the description L under construction the attributes of the target object r that help ruling out at least one distractor. The algorithm succeeds when all the distractors have been ruled out (e.g., at the point in which L can uniquely identify r), it fails if all the properties have been processed and the distractor set is not empty. Due to its computational efficiency and simplicity as well as its psycholinguistic plausibility, the IA has become the most implemented and built upon REG algorithm in the literature. In the 16 following I describe some of these extensions of the IA. [29, 30] proposed an extension of the IA that takes into account the discourse salience of the target referent and produce referring expressions that contain binary relations to other objects. Theune and Krahmer’s approach works by assigning a salience score to all objects according to the topic distinction and Centering Theory [31]. They alter the success criterion of the algorithm and only let it stop when there is no distractor left that is as or more salient than the target referent. [32] study the logical completeness of IA in terms of the Boolean operators of negation and disjunction. He then extended IA to be able to generate referring expressions that contain negated properties, such as ’the apple that is not yellow’, and descriptions of sets of objects, such as ’the orange books’ , or descriptions which contains a logical disjunction of properties such as ’the yellow apple and the orange book’ . The algorithm proceeds in stages, if the atomic properties and shorter disjunctions could not sufficient to distinguish the target set. The algorithm will try to generate longer disjunctions of properties. [33, 34] used the IA for the re-generation of referring expressions in newspaper text taken from the Penn Treebank. Their approach is different from other approaches to REG in that it is aimed at being useful for applications such as question-answering or summarization, and therefore takes text as input rather than a well-defined knowledge base. Their approach has several characteristics: • Instead of processing words with properties at a semantic level, their approach works with words on a lexical level. • They redefined the definition of discriminatory power (which they called the discriminating quotient) and incorporated it into the IA. The preference order in their approach is sorted not according to the psycholinguistic finding of human’s preference, but in a way that properties which are less similar to those of the distractors according to WordNet are considered before 17 those that are more similar to the distractor’s properties. • They suggested a different approach to incorporate discourse salience in the IA. In their approach takes into account the salience of the distractor objects in the mechanism that determines the preference order: a property that distinguishes the target referent from a salient distractor is worth more than one that distinguishes it from one with relatively low salience. • They extended the IA in a way which the preference order is dynamically adjusted at runtime, the properties are also sorted according to their discriminatory quotient at the run-time. [35] used the IA for incremental generation of spatial referring expressions in situated dialog, they extended the IA in two directions: • First, a model of object salience is integrated by modifying the condition under which a description is distinguishing: • Second, Instead of constructing a full relational scene model, they incrementally construct a series of reduced scene used by the algorithm. 2.3 Graph-based Approaches Most early REG algorithms represent domain in a very simple type of knowledge representation: a knowledge base that contains the set of all objects in the domain, and each object has a set of properties that are true of it, and a set of spatial relations among objects. Sometimes this is varied by instead listing for each property which entities have this property. This knowledge base can be represented by a propositional database. But research in Knowledge base suggest that such way of knowledge modeling may not be wise in the long run. There are some researchers that try to 18 Figure 2.4: an example domain represented as a labeled directed graph: object 1 is a cup and it is to the left of object 2 which is a box. formalize REG with different mathematical and computational frameworks. These formalizations are different from traditional REG approaches in several ways: • The representation of the domain information • The representation of the semantic content of a referring expression • The way to search for a distinguishing description One notable algorithm is the graph-based approach [23], which represents the domain including the target referent and distractor objects as a labeled directed graph and the REG is re-formalized as the subgraph construction problem. In the domain, each object is represented as a node, the properties of an object are represented as self-looping arcs on the object node, and spatial relations between objects are represented as arcs between nodes. The graph representation of a visual scene models each object of the scene as a vertex in the scene graph. Attributes of an object such as type, color, type or size are represented as selflooping arcs on the corresponding vertex. They are labeled with the attribute names and the values the object holds. Relations between objects, for example above or to the left of, are modeled as 19 Figure 2.5: The graph-based REG algorithm arcs between the corresponding vertices. Figure 2.4) shows a sample scene graph containing two objects: a cup and a box. Two different graphs play a role in the algorithm: a scene graph representing the domain, and a referring graph representing the content of a referring expression. To generate a distinguishing description, the graph-based algorithm searches for a subgraph of the scene graph that uniquely identifies the target referent, called a distinguishing graph. Starting with the subgraph only containing the vertex which represents the target referent, it performs a breadth-first search over the edges connected to the subgraph found so far. It searches the space exhaustively, but uses a Branch and Bound cost-based heuristic to effectively prune the search space. Informally, a subgraph refers to the target referent if and only if it can be ‘placed over’ the domain graph in such a way that the 20 subgraph vertex representing the target object can be ‘placed over’ the vertex of the target in the domain graph, and each of the labelled edges in the subgraph can be ‘placed over’ a corresponding edge in the domain graph with the same label and same direction. Furthermore, a subgraph is distinguishing if and only if it can be ‘placed over’ exactly one vertex in the domain graph. The informal notion of one graph being ‘placed over’ another corresponds to the mathematical graph-theoretic concept of subgraph isomorphism. Figure 2.6: Deictic gesture pointing to the scene [4, 5] Figure 2.7: Deictic gesture graph of the scene [4, 5] 21 The graph-based approach generates referring expressions that can identify the target object, One natural extension of the graph based approach is to generate multimodal expressions to identify the target object, e.g., pointing gesture together with referring expression. [4, 5] extended the graph-based framework to integrates pointing gestures with referring expressions.(Figure 2.6, 2.7 shows the basis idea). Their method is based on psycholinguistic study finding that people combine gestural and verbal information when referring. Their approach represents pointing gestures of different preciseness as self-loop arcs on the target referent in the domain graph. The more imprecise the pointing gesture, the more of the target’s closest neighbors are included in the gesture and therefore have the same pointing edge. The cost of pointing edges is determined by the size of the target referent and the distance that the pointing device has to travel to make the pointing gesture associated with it. The more imprecise the pointing gesture, the less effort is involved in bringing the pointing device into the correct position and therefore the cheaper the pointing gesture. Based on the observation of the experiment that the verbal information in such multimodal referring expressions is often redundant when the pointing gesture is taken into consideration, [36] further proposed an adaptation of their algorithm to allow the generation of over-specified multimodal descriptions. They use a certainty score that represents the speaker’s estimate of how likely the listener is to misinterpret the referring expression under construction. The certainty score of a property is determined by a hierarchy over the domain attributes, whereby absolute attributes, such as color, have a higher certainty score than relative ones such as size. The certainty score of a pointing gesture is dependent on its preciseness and calculated in a similar way to the cost of pointing gestures. The certainty score of a referring expression is the sum of the certainty scores of the properties and pointing gestures contained in it. As long as the overall score is below a certain threshold, more properties have to be included, even if the description is already distinguishing from a logical point of view. 22 Another extension to graph-based approach is Conceptual Graphs-based generation of referring expressions [37]. In this work, the author used Conceptual Graphs as the representational formalism for referring expression generation. A Conceptual Graph consists of a bipartite graph, which specifies the factual knowledge about a domain, and a number of support hierarchies which capture ontological knowledge about the types of entities and relations that can be involved in the domain graphs. There are several advantages: • First, conceptual graphs are firmly anchored in First-Order Logic, which allows tapping into existing ontology, and to perform automatic inference. • Second, binary relations are handled in the same way as atomic properties, and relations of any higher order can be represented and treated in a natural way without complex adaptations of the knowledge representation. 2.4 Plan-based approaches Planning approaches to REG typically define a set of logical constraints, capturing linguistic or contextual knowledge, and then use planning approaches to find one or several ways to meet these constraints. [38] treated the processes of generating referring expressions and identifying their referents as a collaborative activity [16]. The collaborative process starts with one participant presenting an initial referring expression. The other participant would accept it, reject it, or postpone the judgment. If a presentation is not accepted, then either participant needs to refashion it. This refashioned expression is then judged again, and the process iterates until the current presentation is accepted. Based on this process Heeman developed a plan-based computational approach to collaborative REG, where referring is modeled as goal-directed behavior. Their approach - based 23 on Clark and Wilkes-Gibbs’s model [16] - is a computational mode of how dialogue participants collaborates in making a successful referring action. [39] investigated the collaborating on references to objects that are not mutually known to the conversational participants (such as references to landmarks in direction-giving dialogues). Edmonds focuses on dialogue situations where an agent’s first attempt at describing a referent is judged insufficient by the recipient and the agents collaborate on expanding the description to provide further information, Edmonds views referent identification as a collaborative activity and formalized it within the planning framework. A more recent approach to constraint satisfaction is [40], who present an approach to generating scripted dialogues. Their approach addresses particularly the case where several conflicting constraints need to be satisfied together. The challenge here is to balance both constraints against each other if it is not possible to satisfy them both optimally. [41] presented an approach for context-sensitive referring expression generation, the approach is based on a theoretical model of salience grounded in functional and cognitive linguistics. The model takes into account two measures of salience, one focusing on speaker-related salience and one focusing on hearer-related salience. Given a set of candidate realizations for a referring expression, the best candidate can be chosen according to both salience measures. [42] developed a plan-based model for response generation during collaborative planning. Their approach is based on a recursive Propose-Evaluate-Modify framework for modeling collaboration. [43] present an approach based on Lexicalized Tree Adjoining Grammar, where the planner’s goal is to get a grammatically complete derivation tree, which is also satisfying a set of semantic and pragmatic constraints. To reach this goal, a surface realization needs to be found. All semantic and pragmatic constraints introduced by a surface form are directly encoded into the planner so 24 that they can be taken into account for choosing the best surface realization given a concept and context as well as the constraints specified by the realization itself. There are two advantages in this approach: 1. Any planning algorithm can be used which can make generation process very quick and efficient. 2. Sentence planning and realization are combined into a single process. [44] extend this approach to situated generation where the goal of the planner is to generate nonambiguous referring expressions in a virtual 3D environment. This goal can be achieved by making use of all the available linguistic and non-linguistic context. [45] develop a maximum entropy based model - which is an extension of the model - to deal with alternative surface realizations. Their model aims to find the surface realizations that a user will consider easiest to resolve. 2.5 Reinforcement Learning based Approaches Reinforcement learning approaches optimizes a sequential decision making problems by mapping situations to actions with a goal to maximize a long-term reward. The main ingredients of a reinforcement learning approach contains: • a set of states S that represents the agents knowledge of the previous and current environment. • a set of actions A that represents the agents potential behaviors under a certain state. • a state transition function T that specifies the next state of the agent for taking an action in a state. 25 • a reward function R that assigns a numeric value to the agent for taking an action in a state. • a policy that specifies what actions to take in what states. We refer to [10] for a comprehensive overview of the reinforcement learning approaches. Reinforcement Learning has been successfully applied for adaptive Natural Language Generation[46, 47, 48, 49, 50, 51], where reinforcement Learning is used to automatically optimize the generation policy under uncertainty. We now review some of them. [47] present an optimization approach in a tourist information domain for information presentation. Their system has two main choices to consider: 1. which information presentation strategy to choose among recommendation, comparison, summary, and any ordered combination of these strategies. 2. which attributes to include in the presentation of food quality,location, price range, service quality and cuisine type. The main trade-off the learning system needs to optimize is to present as enough information to users as possible to give them the feeling of having a good overview of the options, but at the same time utterances need to be short and understandable. The learning system was trained using a reward function that optimizes the sentence length, the user reaction and the number of database hits presented. For the user reaction, the authors have hand-coded numeric values associated with user’s each possible reaction (e.g., user selects an item, user adds further constraints to the search, etc.). The reward function and the n-gram-based simulated environment in which the agent was trained were estimated from data collected in a Wizard-of-Oz setting . Learning was performed using the SHARSA algorithm. In a simulation-based evaluation, the learned policy were shown to outperform a supervised learning baseline. 26 [46] present a approach for referring expression generation in troubleshooting dialogue domains, where their approach is able to tailor its generation actions according the users individual prior knowledge of the domain. The setting in the dialogue domain is about a system helping a user in setting up a broadband connection. This dialogue domain enables the many choices for referring to technical objects. The choices range from using technical jargon, such as saying the broadband filter, to using referential descriptions, such as the big black box. An important and interesting characteristic here is that users knowledge will dynamically be updated during the interaction due to a learning effect. Their system would then needs to adapt to a constantly changing user model. The crucial trade-off system encounters is therefore to adapt maximally to the users individual information need. For example, the system should try not to use terms that the user does not understand, it should also not engage in unnecessarily long descriptions. Training of the learning agent was done using the on-policy SHARSA algorithm [10] against a user simulation that was estimated from data collected in a Wizard-of-Oz setting. The user’s responses are simulated based on the system’s last dialogue act, the users prior knowledge of the domain as well as previous clarification requests. [52] use reinforcement learning to learn the policy for generation of temporal expressions. They train a learning agent in a simulated environment. The simulated environment was estimated from human’s ratings of temporal expressions. They use a hand-crafted reward function that assesses the user preference, the likelihood that the user requests clarification and the length of the description. They showed that users interacting with the learned policy have a significantly higher subjective task success rate and user satisfaction score than other baseline approaches. [51] presented a Hierarchical reinforcement learning models for task-oriented Natural language generation in the navigation instruction domain. the generation of a navigation instruction is a mix of navigation strategies generation and referring expression generation Their approach jointly 27 optimize the navigation strategies (e.g., high level navigation such as ”go down the hallway”, low level navigation such as ”turn right”) and referring expression generation (e.g., mention color or spatial relation of the referent) in direction-Giving dialogues within virtual 3D worlds [8]. 2.6 Other Approaches [53] proposed a computational model for REG. The model treated the REG task as a constrained search problem, which takes into account the syntax, lexical choice and content selection as the search constraint. The model was evaluated using synthesized images. [21] treat language generation as a game-theoretic model in which a set of semantic and pragmatic constraints need to be satisfied. They address the task of generating spatial descriptions so as to allow the hearer to identify a target object as easily as possible. They present results showing that a generation model that takes the pragmatic effects of its utterances on the hearer into account significantly outperforms a model that does not. [13] formalized a referring expression as a sequence of groups which reflects the subject’s narrowing down strategy in identifying the referent object. Given a domain and a referent object, the method constructs a sequence of groups starting from the largest group containing the referent, and recursively narrowing down the group until only the referent is identified. The authors extended this approach [14] to handle more referring strategies. e.g., descriptions with only the referent object or in combination with binary relations between objects. The entire sequence is then rendered linguistically. [54] present a minimally supervised learning approach to the generation of route instructions in outdoor environments. They use the Expectation-Maximizing algorithm to align geographical route representations with corresponding linguistic realizations that were taken from an anno- 28 tated corpus of human data. They argue that route instruction generation can be a highly contextdependent task given the adaptations that an NLG system is required to make to specific spatial configurations and environments also including diverse spatial objects that may be present. [55] describe a approach to generate text descriptions of web resources, where the context for generation is represented in a declarative way through the construction of policies. The context in their model is defined by permissions, prohibitions and obligations regarding particular users or resources. In their text descriptions generation domain, different users of the platform may have different permissions to access information that are stored on the platform, and this need to be taken into account by the NLG system. [56] introduced a chart-based overgenerate-and-rank approach to REG which separates the representational form of the knowledge base from that of the referring expression being built. This allows the algorithm to logically infer information that is not represented explicitly in the knowledge base, rather than adding such information to the knowledge representation, as was done in extensions to the IA and the graph-based framework described above. The algorithm first builds up all possible combinations of properties to describe all objects and sets of objects in the domain. It does this by recursively combining basic descriptions using logical connectives and relations between objects, always keeping track of the extension sets of the logical forms constituting the descriptions. The combinations are stored in a chart to facilitate reuse of intermediate results. The chart is pruned by the requirement that every combination of properties produced has to be realizable linguistically, making this one of the rare approaches that integrate surface realization with content selection. To guide the search for ‘optimal’ referring expressions, Varges suggested a number of constraints that either apply during the chart-building process or filter out unwanted solutions afterwards. [57] extended his algorithm to be able to produce quantified expressions, which is, as they show, impossible in approaches such as the IA 29 and the graph-based approach. Description Logics constitute yet another approach to the generation of referring expressions that is based on viewing the semantics of a referring expression as a logical formula. Description Logics are a well-established First-Order Logic formalism and have been used to combine basic formulas (or descriptions) via logical connectives to generate descriptions for all objects and sets of objects in parallel [58] [59] present an approach for route instruction generation in urban environments which takes different properties of the user and the environment into account. In their approach, route instructions are adapt to different degree of prior knowledge that users have about the navigation environment (e.g., their familiarity with the area). Moreover, their approach when generates route instructions, take into account the saliency of geographical properties in the environment. [60] present an approach for generating referring expressions for an interaction task in a 3D virtual environment. Their algorithm is based on a decision tree classifier using features of the environment. The features include the users viewing angle, their distance from the target referent, number of the target referent’s distractors, as well as its visibility and dialogue history. They present results from a human rating study in which the referring expressions generated by their algorithm were rated better. Recently, there is an increasing interest in generating multimodal reference. For example, based on the findings that deictic gesture is useful in referential communication [61, 62, 63, 64] explored the role of salience in the generation of referring expressions with gesture. [65, 66] combined deictic gesture and speech to generate a referring act. [5] extended a graph-based non-collaborative REG model to generate multimodal reference. Motivated by psycholinguistic findings that eye gaze is tightly linked to human language processing in both language interpretation and production [67], [68] developed an interactive REG 30 system that tracks the hearer’s eye gaze to monitor hearer’s interpretation of the referring expressions the system generates. Their REG system is embodied in nature, but the system was built on a virtual environment where the agent and the human are assumed to have the same representation of the shared environment. In addition, their system only generates verbal descriptions, not multimodal referring expressions. More recently, work in REG has identified the importance of modeling the variation observed in referring expressions generated by human, instead of generating a single referring expression for the intended referent, [69, 70] proposed approaches that can generate a distribution of referring expressions for a referent object or a set of objects. An earlier work by [71] has looked into the problem of mismatched knowledge between conversation partners for REG. The approach is a direct extension of the incremental algorithm [18]. However, this work only provides a proof of concept example to illustrate the idea. No empirical evaluation was given. All these previous works have motivated our present investigation. We are interested in REG under mismatched perceptual basis between conversation partners, where the agent has imperfect perception and knowledge of the shared environment. In particular, we first took a well-studied graph-based approach [23] and extended it to incorporate group spatial relations and uncertainties associated with automated perception of the environment. The reason we chose a graph-based approach is that graph representations are widely used in the fields of computer vision (CV) and pattern recognition to represent spatially rich scenes. Nevertheless, the findings from this investigation provide insight to other approaches. Motivated by collaborative behaviors observed in human referential communication, we then develop two collaborative models - an episodic model and an installment model - for REG. In both models, instead of generating a single referring expression to describe a target object as in the previous work, it generates multiple small expressions 31 that lead to the target object with a goal to minimize the collaborative effort. In particular, our installment model incorporates human feedback in a reinforcement learning framework to learn the optimal generation strategies and obtains better performance. We have also developed embodied collaborative models for REG. These models seamlessly incorporate robot gesture behaviors (i.e., pointing to an object) and human’s gaze feedback (i.e., looking at a particular object) into the collaborative model for REG. We examined different strategies incorporating robot gestures and human feedback. Our empirical results have shown that when robot gestures and human verbal feedback is incorporated, the new collaborative model achieves over 28% absolute gains compared to the baseline collaborative model. This paper discusses the opportunities and challenges brought by modeling embodiment in collaborative referential communication in human-robot interaction. 2.7 Evaluation of REG approaches The evaluation of REG evaluation could be task-based where human rate the usefulness of generated referring expressions for a particular task (e.g., identifying objects), or corpus-based, where generated expressions are compared to human-produced expressions in a given data set. As more and more suitable data sets are available now, corpus-based evaluation becomes the dominant paradigm. we next introduce the currently available data set, and discuss what metrics are used to evaluate the REG algorithms on those data set. 2.7.1 Corpora for REG evaluation 2.7.1.1 The TUNA corpus The TUNA corpus [6] is a data set which was collected via a web-based experiment. During the experiment both singular and plural descriptions for objects were gathered by showing partici32 Figure 2.8: An example scene from the TUNA corpus [6]. An example description for the target object: ‘the black chair on the right’. pants one or two targets. Targets were always displayed together with distractors, and the resulting domain objects were randomly positioned in a 3 x 5 grid. An example scene are shown in Figure 2.8. The TUNA corpus contains two different domains: a furniture and a people domain. The furniture domain has pictures of furniture and household items, and the items in the domain were manipulated so that besides type attribute( e.g., desk, chair, fan), color, orientation and size are also systematically varied. In the people domain, the number of possible attributes and values is much larger, This domain contains a set of black and white photographs of people. Properties of these photographs include gender, head orientation, shirt suit and tie, beard, age, glasses, hair and locations. The TUNA data set has formed the basis of three shared REG challenges. 2.7.1.2 The GRE3D3 corpus Psycholinguistic findings suggesting that the use of spatial relations causes a higher cognitive load for both speaker and listener. So there exists a prevalent hypothesis that spatial relations should only be used as a last resort. Many REG algorithms don’t attempt to use spatial relations between 33 Figure 2.9: An example scene from the GRE3D corpus [7]. A description generated by human: ‘the small green ball on the top of a box’. the target referent and landmark objects in the descriptions. The GRE3D3 and corpus [7] were created to explore the usefulness of spatial relations in REG. The corpus is based on visual stimuli of simple 3D scenes containing a small number of geometric objects as shown in Figure 2.9. They were collected by directing participants to a website, which allowed the collection of a very large number of samples. The GRE3D3 contains 720 referring expression. The authors created a second corpus, GRE3D7, which contains 4480 referring expressions and is by far the largest existing collection of context-independent distinguishing descriptions. 2.7.1.3 the GIVE-2 Corpus Interpreting and generating natural-language instructions in a situated environment have received significant attention in the past few years. Most recently, the Challenge on Generating Instructions in Virtual Environments (GIVE) [72] has attracted considerable interest in the Natural Language Generation community. GIVE is a shared task in which NLG systems must generate real-time instructions that guide a user in a virtual world. It offers novel possibilities of exploring the linguistic and non-linguistic issues involving situated NLG, while supporting a solid approach NLG system evaluation. The GIVE-2 corpus [8] is a multilingual corpus of human instruction giving in virtual 34 (a) bird’s-eye view (b) The view of the virtual environment Figure 2.10: An example scene from GIVE-2 corpus [8]. An example instruction given by human: ‘keep going through the room and make a left, click that yellow button on your left’. environments, It is designed to support the development of NLG systems for the GIVE Challenge . The corpus consists of 63 American English and 45 German written discourses in which one subject guided another in a treasure hunting task in the spirit of the GIVE-2 virtual worlds as shown in Figure 2.10. This corpus exhibits varied instruction-giving behavior, and can thus serve both as a source of data and as point of comparison for for natural language generation systems. 2.7.2 Evaluation Metrics The emergence of corpora discussed earlier has greatly facilitated the empirical evaluation of REG algorithms. We now describe how corpora can be used for evaluation purposes. 2.7.2.1 Human-likeness Metrics Human-generated language is by definition natural, and also generally understandable. How to compare human references with those produced by a REG algorithm? One way is looking for 35 measures that compute the overlap at the level of properties, one source of inspiration may come from the kind of measures researchers in biology and information retrieval have been using (van Rijsbergen 1979). One of these is the Dice (1945) coefficient, which was originally proposed as a measure of ecologic association between species, and applied to REG by Gatt et al. (2007). The Dice coefficient is computed by scaling the number of properties (that is: attribute-value pairs) that two descriptions have in common, by the size of the two sets combined: 2×|A∩B| dice(A, B) = |A|+|B| , The Dice measure ranges from 0 (no agreement, i.e., no properties shared between A and B) to 1 (complete agreement; A and B share all properties). It is a simple and straightforward measure for overlap, but it does have at least two disadvantages. First, it assumes that all properties are independent and that all are equally different from each other. Suppose a human participant referred to o1 in our example domain as the red apple on the left, and consider the following two references produced by a REG algorithm: the red apple and the apple on the left. Both omit one property from the human reference and thus have the same dice score. But only the former is distinguishing; the latter is not. This problem could be solved by adopting a binary weighted version of the Dice metric which would multiply the Dice score with 1 for a distinguishing description and with 0 for a non-distinguishing one. Another metric is MASI (Measuring Agreement on Set-valued Items) [73]: |A∩B| M ASI(A, B) = λ × |A∪B| , The weighting function θ biases the score in favor of similarity where one set (apple) is a subset 36 of the other (fruit):     0        1 λ= A∩B=∅ if A = B    2/3 if A ⊂ B or B ⊂ A        1/3 otherwise. 2.7.2.2 Identification Accuracy Besides human-likeness, it is also important to explore to what extent the generated expressions help addressees identify a referent target. This can be studied experimentally in several different ways. For example [74] first show a generated description to participants, After the participants read the description, a visual scene appears and the participants are asked to choose the intended referent. This experiment gives rise to three extrinsic evaluation metrics: the reading time, the identification time and the accuracy, defined as the percentage of generated referring expressions where the referents are correctly identified. 37 Chapter 3 Graph-based Approach for REG This chapter focuses on graph-based approach for REG in traditional paradigm 1 . 3.1 Introduction Towards mediating a shared perceptual basis in situated dialogue, our previous work [3] has conducted experiments to study referential communication between partners with mismatched perceptual capabilities. We simulated mismatched capabilities by making an original scene (Figure 3.1(a)) available to a director (simulating higher perceptual calibre) and a corresponding impoverished scene (Figure 3.1(b)) available to a matcher (simulating lowered perceptual calibre). The impoverished scene is created by re-rendering automated recognition results of the original scene by a CV algorithm. An example of the original scene and an impoverished scene is shown in Figure 3.1. Using this setup, the director and the matcher were instructed to collaborate with each other on some naming games. Through these games, they collected data on how partners with mismatched perceptual capabilities collaborate to ground their referential communication. The setup in [3] is intended to simulate situated dialogue between a human (like the director) and a robot (like the matcher). The robot has a significantly lowered ability in perception and rea1 The majority of this chapter was published in the following paper: Rui Fang, Changsong Liu, Lanbo She, Joyce Y. Chai. Towards Situated Dialogue: Revisiting Referring Expression Generation. Conference on Empirical Methods in Natural Language Processing (EMNLP), Seattle, USA, 2013. 38 soning. The robot’s internal representation of the shared world will be much like the impoverished scene which contains many recognition errors. We have pointed out earlier that there are two important distinctions in situated dialogue: • First, the perfect knowledge of the environment is not available to the agent ahead of time. The agent needs to automatically make inferences to connect recognized lower-level visual features with symbolic labels or descriptors. Both recognition and inference are error-prone and full of uncertainties. • Second, in situated dialogue the agent and the human have mismatched representations of the environment. The agent needs to take this difference into consideration to identify the most reliable features for REG. Given these two distinctions, it is not clear whether state-of-the-art REG approaches are applicable under mismatched perceptual basis in situated dialogue. In this chapter, we first extend a state-of-the-art REG algorithm with hypergraph-based representations and illustrate how uncertainties from automated perception can be incorporated. We then describe an empirical study using Amazon Mechanical Turk for evaluating generated referring expressions. Finally we present evaluation results and discuss potential future directions. 3.2 Hyper-graph based REG 3.2.1 Motivation The data from [3, 75] shows that different strategies were used by conversation partners to produce referential descriptions. Besides directly describing attributes or binary relations with a relatum, they often use group-based descriptions (e.g., a cluster of four objects on the right). This is mainly 39 (a) An original scene (b) the corresponding impoverished scene Figure 3.1: An original scene and its impoverished scene processed by CV algorithm due to the fact that some objects are simply not recognizable to the matcher. Binary spatial relationships sometimes are difficult to describe the target object, so the matcher must resort to group information to distinguish the target object from the rest of the objects. For example, suppose the matcher needs to describe the target object 5 in Figure 3.1(b), he/she may have to start by indicating the group of three objects at the bottom and then specify the relationship (i.e., top) of the target object within this group. The importance of group descriptions has been shown not only here, but also in previous works on REG [13, 14, 15]. While the original graph-based approach can effectively represent attributes and binary relations between objects [23], it is insufficient to capture within-group or betweengroup relations. Therefore, to address the low perceptual capabilities of artificial agents, we introduce hypergraphs to represent the shared environment. Our approach has two unique characteristics compared to previous graph-based approaches: • A hypergraph representation is more general than a regular graph. Besides attributes and binary relations, it can also represent group-based relations. 40 • Unlike previous work, here the generation of hypergraphs are completely driven by automated perception of the environment. This is done by incorporating uncertainties in perception and reasoning into cost functions associated with graphs. Next we give a detailed account on hypergraph representation, cost functions incorporating uncertainties, and the search algorithm for REG. 3.2.2 Hypergraph Representation of the Perceived Scenes A directed hypergraph G [76] is a tuple of the form: G = ⟨X, A⟩, in which X = {xm } A = {ai = (ti , hi ) | ti ⊆ X, hi ⊆ X} Similar to regular graphs, a hypergraph consists of a set of nodes X and a set of arcs A. However, different from regular graphs, each arc in A is considered as a hyperarc in the sense that it can capture relations between any two subsets of nodes: a tail (ti ) and a head (hi ). Therefore, a hypergraph is a generalization of a regular graph. It becomes a regular graph if the cardinalities of both the tail and the head are restricted to one for all hyperarcs. While regular graphs are commonly used to represent binary relations between two nodes, hypergraphs provide a more general representation for n-ary relations among multiple nodes. We use hypergraphs to represent the agent’s perceived physical environment (also called scene hypergraphs). More specifically, each perceived object is represented by a node in the graph. Each perceived visual attribute of an object (e.g., color, size, type information) or a group of objects (e.g., number of objects in the group, location) is captured by a self-looping hyperarc. Hyperarcs are also used to capture the spatial relations between any two subsets of nodes, whether it is a 41 relation between two objects, or between two groups of objects, or between one or more objects within a group of objects. For example, Figure 3.2 shows a hypergraph created for part of the impoverished scene shown in Figure 3.1(b) (i.e., the upper right corner including objects 7, 8, 9, 11, and 13). One important characteristic is that, because the graph is created based on an automated vision recognition system, the values of an attribute or a relation in the hypergraph are numeric (except for the type attribute). For example, the value of the color attribute is the RGB distribution extracted from the corresponding visual object, the value of the size attribute is the width and height of the bounding box and the value of the location attribute is a function of spatial coordinates. These numerical features will be further converted to symbolic labels with certain confidence scores (described later in Section 3.2.4.2). 3.2.3 Hypergraph Pruning The perceived visual scene can be represented as a complete hypergraph, in which any pair of two subsets of nodes are connected by a hyperarc. However, such a complete hypergraph is not only computationally inefficient but also unnecessary. Instead of keeping all possible n-ary relations (i.e., hyperarcs), we only retain those relations that are likely used by humans to produce referring expressions, based on two heuristics. The first heuristic is based on perceptual principles, also called the Gestalt Laws of perception [77], which describe how people perceive groups and objects in visual information. The word ‘Gestalt’ means whole or complete in German. The laws of grouping are shown below. • Law of Proximity. The law of proximity states that objects that lie close together are often perceived as groups. Even if their shapes, sizes, and object types are different, they will 42 be perceived as a group if they are close together. And objects that are far apart are often perceived as separate objects. • Law of Similarity. The law of similarity states that similar objects are likely to be perceptually grouped together. This similarity can occur in the form of shape, size, color, shading or other qualities. Objects of similar shape, size or color are more likely to form groups than objects differing along these dimensions. • Law of Closure. The law of closure states that human vision perceive objects such as shapes, letters, pictures as a whole even when they are incomplete or partially occluded by other objects. For example, when parts of a shape’s border are missing, human perception tends to see the shape as completely and thus fill in the missing visual gap. • Law of Symmetry. The law of symmetry states that the human vision perceives objects as being symmetrical and forming around a center point. Therefore, when two symmetrical parts are unconnected the human vision tend to perceptually connect them to form a coherent shape. • Law of Common Fate. The law of common fate states that human vision perceive the moving elements as a whole when visual elements are seen moving in the same direction at the same rate. • Law of Continuity. The law of continuity states that objects tended to be grouped together if they form a line or follow an established direction. When there is an intersection between objects, human vision tend to perceive each object as a single object, even the objects are overlapped with each other. 43 • Law of Good Gestalt. The law of good gestalt states that objects tend to be perceptually grouped together if they form a pattern that is simple, regular, symmetrical and in order. • Law of Past Experience. The law of Past Experience states that objects tend to be grouped together if they were observed together in the past. If objects tend to be observed within small temporal intervals, the objects are more likely to be grouped together. • Law of Figure and Ground. The law of figure and ground states that if human vision’s attention is on part of a scene, the remaining vision information tends to be ignored and becomes part of the background. Among these laws, two well known principles of perceptual grouping are proximity and similarity. Based on these two principles, previous works have developed different algorithms for perceptual grouping [78, 79]. In our investigation, we adopted Gatt’s algorithm [79], which has shown to be more accurate for spatial grouping. Given the results from spatial grouping, we only retain hyperarcs that represent spatial relations between two objects, between two perceived groups, between one object and a perceived group, or between one object and the group it belongs to. The second heuristic is based on the observation that, given a certain orientation, people tend to use a relatum that is closer to the referent than more distant relata. In other words, it is less likely to refer to an object relative to a distant relatum when there is a closer relatum. For example, when referring to the stapler (object 9 in Figure 3.1(a) ), it is more likely to use “the stapler above the battery” than “the stapler above the cellphone”. Based on this observation, we prune the hypergraphs by only retaining hyperarcs between an object and their closest relata for each possible orientation. Figure 3.2 shows the resulting hypergraph for representing a subset of objects (7, 8, 9, 11, and 13) in Figure 3.1(a). 44 Figure 3.2: An example of hypergraph representing the perceived scene (a partial scene only including object 7, 8, 9, 11, 13 for Figure 3.1(a)). 3.2.4 Symbolic Descriptors for Attributes As mentioned earlier, the values of attributes of objects and their relations are numerical in nature. In order for the agent to generate natural language descriptions, the first step is to assign symbolic labels or descriptors to those attributes and relations. Next we describe how we use a lexicon with grounded semantics in this process. 45 3.2.4.1 Lexicon with Grounded Semantics Grounded semantics provides a bridge to connect symbolic labels or words with lower level visual features [80]. Previous work has developed various approaches for grounded semantics mainly for the reference resolution task, i.e., identifying visual objects in the environment given language descriptions [81, 82, 83, 84, 3]. For the referring expression generation task here, we also need a lexicon with grounded semantics. In our lexicon, the semantics of each category of words is defined by a set of semantic grounding functions that are parameterized on visual features. For example, for the color category it is defined as a multivariate Gaussian distribution based on the RGB distribution. Specific words such as green, red, or blue have different means and co-variances as the following: color : red = fr (⃗vcolor ) = N (⃗vcolor | µ1 , ∑ 1) ∑ color : green = fg (⃗vcolor ) = N (⃗vcolor | µ2 , 2 ) ∑ color : blue = fb (⃗vcolor ) = N (⃗vcolor | µ3 , 3 ) The above functions define how likely a set of recognized visual features (i.e., ⃗vcolor ) describing the color dimensions (i.e., RGB distribution) is to match the color terms red, green, and blue. For the spatial relation terms such as above, below, left, right, the semantic grounding functions take both vertical and horizontal coordinates of two objects, as follows 2 : spatialRel : above(a, b) = fabove (v⃗a loc , v⃗bloc )    1 − |xa −xb | if ya < yb ; 400 =   0 otherwise. 2 The size of the overall scene is 800x800. 46 spatialRel : lef t(a, b) = flef t (v⃗a loc , v⃗bloc )    1 − |ya −yb | if xa < xb ; 400 =   0 otherwise. Using the above convention, we have defined semantic grounding functions for size category words (e.g., small and big) and absolute position words (e.g., top, below, left, and right). In addition, we use object recognition models [85] to define class type category words such as apple and orange used in our domain. 3.2.4.2 Attribute Descriptors and Cost Functions Given the lexicon with grounded semantics as described above, the numerical attributes captured in the scene hypergraph can be converted to symbolic descriptors. For each attribute (e.g., color) or relation, the corresponding visual feature vector (i.e., ⃗vcolor ) is plugged into the semantic grounding functions for the corresponding category of words. The word that best describes the attribute is chosen as the descriptor for that attribute. For example, given an RGB color distribution ⃗vcolor , we can find the color descriptor as follows: color : w∗ = arg max red,green,blue fw (⃗vcolor ), For each attribute or relation, we can find a best descriptor in this manner. In addition, we also obtain a numerical value (returned from the semantic grounding functions) that measures how well this descriptor describes the corresponding visual features. Intuitively, one would choose a descriptor that closely matches the visual features. Based on this intuition, we define the cost for 47 each attribute A as the following: cost(A) = 1 − fw∗ (v⃗A ) where w∗ is the best descriptor for the attribute. Given an attribute, the better the descriptor matches the extracted visual features, the lower the cost of the corresponding hyperarc. 3.2.5 Graph Matching for REG Now the hypergraph representing the perceived environment has symbolic descriptors for its attributes and relations together with corresponding costs. Given this representation, REG can be formulated as a graph matching algorithm similar to that described in [23]. We use the same Branch and Bound algorithm described in [23]. Figure 3.3 shows our hypergraph based referring expressions generation approach. The cost of a hypothesis hypergraph H is denoted as cost(H). It is defined as: cost(H) = ∑ cost(x) + ∑ cost(a) a∈A x∈X The cost function is monotonic. This implies that extending a hypergraph H with an hyperarc a can only result in a more expensive hypergraph, that is, cost(H +a) > cost(H). The monotonicity of the cost function helps reduce the searching space. The distractor set contains every node d in G that is not the target node v, and can “refer to” v ′ ′ ′ in H. By “refer to”, we mean, ∃G ⊂ G, and H ∼ = G , the bijection b : X → X between nodes of ′ H and G , such that b(d) = v. 48 Figure 3.3: Hypergraph based REG In this approach, a hypothesis hypergraph (starting with one node representing the target object) is gradually expanded by adding in a least cost hyperarc from the scene hypergraph. At each expansion, the hypothesis graph is matched against the scene hypergraph to decide whether it matches any nodes other than the target node in the scene hypergraph. The expansion stops if the hypothesis graph does not cover any other nodes except for the target node. At this point, the hypothesis graph captures all the content (e.g., attributes and relations) required to uniquely 49 describe the target object. We then apply a set of simple generation templates to generate the surface form of referring expressions based on the hypothesis graph. 3.3 Empirical Evaluations 3.3.1 Evaluation Setup 3.3.1.1 Mismatch Views To simulate the Mismatched perception basis between human and an artificial agent, we applied standard computer vision algorithms to process the original image and generate the impoverished representation of the same scene. This procedure is illustrated in Figure 3.4. To create the impoverished image, we first used the OTSU algorithm [86] to separate foreground objects from the background. Then each segmented object was fed into a feature extraction routine that computed a set of region-based and contour-based shape features of the object [85]. The feature vector of the object was then compared with all the known objects in a knowledge base. The object was recognized as the class of its nearest neighbor in the knowledge base. After this segmentation ⇀ f eatureextraction ⇀ recognition pipeline, the final outcome was then displayed as an abstract illustration in the impoverished image. For instance, if an object was recognized as a pear, an abstract illustration of pear was displayed in the impoverished image at the position of the original object. The color of the illustration was set to the average color of the pixels of the original object, and the height and width were set according to the bounding box of the original object. 50 Figure 3.4: The procedure of generating the impoverished image from the original image 3.3.1.2 Experiment Setup using Amazon Mechanical Turk To evaluate the performance of this hypergraph-based approach to REG, we conducted a comparative study using crowd-sourcing. More specifically, we created 48 different scenes similar to that in Figure 3.1(a). Each scene has 13 objects on average and there are 621 objects in total. For each of these scenes, we applied the above impoverished scene generation procedure and generated scene hypergraphs as described in Section 3.2.2. We then use different generation strategies (varied in terms of graph representations and cost functions, to be explained in Section 3.3.2) to automatically generate referring expressions to refer to each object. To evaluate the quality of these generated referring expressions, we applied Amazon Mechanical Turk to solicit feedback from the crowd 3 . Through an interface as shown in Figure 3.5, we 3 To control the quality of crowdsourcing, we recruited participants based on the following criteria: Participants’ locations are limited to the United States. Approval rate for each participant’s previous work is greater than or equal to 95%, and the number of each participant’s previous 51 Figure 3.5: a web-based interface shown to the Mechanical turk workers displayed an original scene and generated referring expressions (from different generation strategies) in a random order. We asked each turkers to select the object in the scene that he/she believed was the one referred to by the shown referring expression (i.e., reference identification task). Each referring expression received three votes from the crowd. In total, 217 turkers participated in our experiment. 3.3.2 Generation Strategies We applied a set of different strategies to generate referring expressions for each object. The variations lie in two dimensions: • Graph representations: using a hypergraph to represent the perceived scene as described in Section 3.2.2 versus using a regular graph as introduced in [23]; approved work is greater than or equal to 1000. 52 • Cost functions for attributes and relations: cost functions that have been used in previous works [87, 88] and cost functions that incorporate uncertainties of perception as described in Section 3.2.4.2. Cost functions play an important role in graph-based approaches [23]. Previous works have examined different types of cost functions [87, 88, 89]. We adopted some commonly used cost functions from previous work together with the cost functions defined here. In particular, we experimented with the following different cost functions: • Simple Cost: The costs for all hyperarcs are set to 1. With this cost function, the graph-based algorithm resembles the Full Brevity algorithm of Dale [90] in that a shortest distinguishing description is preferred. • Absolute Preferred: The costs for hyperarcs representing absolute attributes (e.g., type, color, and position) are set to 1. The costs for relative attributes (e.g., size) and relations are set to 2. This cost function mimics human’s preference for absolute attributes over relative ones [18]. • Relative Preferred: The costs for hyperarcs representing absolute attributes are set to 2 and for relative attributes and relations are set to 1. This cost function has been applied previously to emphasize the importance of spatial relations in REG [7]. • Uncertainty Based: The costs for all hyperarcs are defined by incorporating uncertainties from perception as described in Section 3.2.4.2. • Uncertainty Relative Preferred: To emphasize the importance of spatial relations as demonstrated in situated interaction [83, 35], the costs for hyperarcs representing relative attributes 53 and relations are divided by 3. This cost function will allow the algorithm to prefer spatial relations through the reduced cost. Note that we only tested a few (not all) commonly used cost functions proposed by previous work [23, 87, 88, 89]. For example, we did not include the stochastic cost function which is defined based on the frequencies of attribute selection from the training data [23]. On the one hand, we did not have a large set of human descriptions of the impoverished scene to learn the stochastic cost. On the other hand, it is not clear whether human strategies of describing the impoverished scene should be used to represent optimal strategies for the robot. Nevertheless, the above different cost functions will allow us to evaluate whether incorporating perceptual uncertainties will make a difference in the REG performance. 3.3.3 Evaluation Results As mentioned earlier, each generated referring expression received three independent votes regarding its referent from the crowd. The referent with the most votes is taken as the predicted referent and is used for evaluation. If all three votes are different, then by default, it is deemed that the referent is not correctly identified for that expression. We use the accuracy of the referential identification task (i.e., the percentage of generated referring expressions where the referents are correctly identified) as the metric to evaluate different generation strategies illustrated in Section 3.3.2. 3.3.3.1 The Role of Cost Functions Table 3.1 shows the results based on different cost functions and different graph representations. There are several observations. • First, when the agent does not have perfect knowledge of the environment and has to auto54 Cost Function Simple Costs Absolute Preferred Relative Preferred Uncertainty Based Uncertainty Rel. Prefer. Regular Graph 33.2% 30.1% 31.1% 35.7% 36.7% Hypergraph 33.3% 30.3% 35.4% 37.5% 45.2% Table 3.1: Results with different cost functions matically infer the environment as in our setting here, cost functions based on uncertainties of perception lead to better results. This occurs for both regular graphs and hypergraphs. This result is not surprising and indicates that cost functions should be tied to the agent’s ability to perceive and infer the environment. The uncertainty based cost functions allow the agent to prefer reliable attributes or relations. • Second, consistent with previous work [7], we observed the importance of spatial relations. Especially when the perceived world is full of uncertainties, spatial relations tend to be more reliable. In particular, as shown in Table 3.1, using hypergraphs enables generating groupbased relations and results in significantly better performance (45.2%) compared to regular graphs (36.7%) (p = 0.002). Note that our current cost function only includes uncertainties of the agent’s own perception in a simplistic form. When humans and agents have mismatched perceptual basis, the human’s model of comprehension and tolerance of inaccurate description could play a role in REG. Incorporating human models in the cost function will require in-depth empirical studies and we will leave that to our future work. 55 3.3.3.2 The Role of Imperfect Perception To further understand the role of hypergraphs in mediating mismatched perceptions between humans and agents, we created a perfect scene regular graph and a perfect scene hypergraph (representing the agent’s perfect knowledge of the environment) for each of the 48 scenes used in the experiments. In each of these scene graphs, the attribute and relation descriptors are manually provided. We further applied the Absolute Preferred cost function (which has shown competitive performance in previous work) to generate referring expressions for each object. Again, each referring expression received three votes from the crowd. Table 3.2 shows the results comparing two conditions: • (1) REs generated (by the Absolute Preferred cost function) based on the perfect graphs which represent the agent’s perfect knowledge and perception of the environment; and • (2) REs generated based on automatically created graphs (by the Uncertainty Relative Preferred cost function) which represent the agent’s imperfect knowledge of the environment as a result of automated recognition and inference. The result shows that given perfect knowledge of the environment, hypergraphs only perform marginally better than the regular graphs (p = 0.07). Given imperfect knowledge of the environment, hypergraphs significantly outperforms the regular graphs by taking advantage of spatial grouping information (p = 0.002). It is worthwhile to mention that currently we use spatial proximity to identify groups. However, the hypergraph based approach is not restricted to spatial grouping. In theory, it can represent any type of group based on different similarity criteria. Furthermore, our result shows that the graph-based approaches perform quite competitively under the condition of perfect knowledge and perception. Although evaluated on different data 56 Environment Pefect Perception Imperfect Perception Regular Graph 80.4% 36.7% Hypergraph 84.2% 45.2% Table 3.2: Results of comparing perfect perception and imperfect perception of the shared world. sets, this result is consistent with results from previous work [24, 25]. However, what is more interesting here is that while graph-based approaches perform well when the agent has perfect knowledge of the environment, as its human partner, these approaches literally fall apart with close to 40% performance degradation when applied to the situation where the agent’s representation of the shared world is problematic and full of mistakes. These results indicate that REG for automatically perceived scenes can be extremely challenging. Many errors result from automated perception and reasoning that will affect the internal representation of the world and thus the generated REs. In our experiments here, we applied a very basic CV algorithm which resulted in rather poor performance in our data: overall, 60.3% of objects in the original scene are mis-recognized, and 10.5% of objects are mis-segmented. We think this poor CV performance represents a more challenging problem. Some errors such as recognition errors can be by-passed using our current approach based on hypergraphs. For example, in Figure 3.1 target object 9 (a stapler) and 13 (a key) are misrecognized as a cup and a pen. Using our hypergraph-based approach, for the target object 9, instead of generating “a small cup” (as in the case of using regular graphs), “a gray object on the top within a cluster of four objects” is generated. For the target object 13, instead of “a pen” as generated by regular graphs, “a small object on the right within a cluster of 4” is generated. Even with recognition errors, these group-based descriptions will allow the listener to identify target objects in their representation correctly. 57 Pefect Perception Imperfect Perception Minimum Effort 84.2% 45.2% Extra Effort 88.1% 51.5% Table 3.3: Results of comparing minimum effort and extra effort using hypergraphs Nevertheless, many processing errors cannot be handled by our current approach. For example, an object can be mistakenly segmented into multiple parts or several objects can be mistakenly grouped into one object. In addition, our current semantic grounding functions are simple. Sometimes they do not provide correct descriptors for the extracted visual features. More sophisticated functions that better reflect human’s visual perception [91, 92, 93] should be pursued in the future. 3.3.3.3 The Role of Extra Effort While REG systems have a tendency to produce minimal descriptions, recent psycholinguistic studies have shown that speakers do not necessarily follow the Grice’s maxim of quantity, and they tend to provide redundant properties in their descriptions [94, 95, 96]. With this in mind, we conducted a very simple evaluation on the role of extra effort. Once a set of descriptors are selected based on the minimum cost, one additional descriptor (with the least cost among the remaining attributes or relations) is added to the referential description. We once again solicited the crowd feedback to this set of expressions generated by extra effort. Each expression again received three votes from the crowd. Evaluation is again based on the identification accuracy. Table 3.3 shows the results by comparing minimum effort with extra effort when using hypergraphs to generate REs. As indicated here, extra effort (by adding one additional descriptor) leads to more comprehensible REs with 3.9% improvement under perfect perception and 6.3% improve- 58 ment under imperfect perception (both are significant, p < 0.05). The improvement is larger under imperfect perception. This seems to indicate that exploring extra effort in REG could help mediate mismatched perceptions in situated dialogue. This finding is in line with some Psycholinguistic study showing that over-specification can speed up the object identification process [96]. However, more understanding on how to engage in such extra effort will be required in the future. 3.4 Conclusion In situated dialogue, humans and agents have mismatched perceptions of the shared environment. To facilitate successful referential communication between a human and an agent, the agent needs to take such discrepancies into consideration and generate referential descriptions that can be understood by its human partner. With this in mind, we re-visited the problem of referring expression generation in the context of mismatched perceptions between humans and agents. In particular, we applied and extended the state of the art graph-based approach [23] in this new setting. Our empirical results have shown that, • To address the agent’s limited perceptual capability, REG algorithms will need to take into account the uncertainties in perception and reasoning. More specifically, the cost functions should be tied to the agents ability to perceive and infer the environment. • Group-based information appears more reliable explicitly model group-based information could circumvent some perceptual errors occur in perceiving the environment. Thus groupbased information should be modeled by an approach that deals with automated perception of spatially rich scenes. • While graph-based approaches have shown effective for the situation where the agent has 59 complete knowledge of the environment, as its human partner, these approaches are often inadequate when humans and agents have mismatched representations of the shared world. This implies that, traditional paradigm of REG approach might not be applicable in situated dialogue. Our empirical results here call for new solutions to address the mismatched perceptual basis. Previous work indicated that referential communication is a collaborative process [16, 38]. Conversation partners make extra effort to collaborate with each other. For the situation with mismatched perceptual basis, a potential solution thus should go beyond the objective of generating a minimum description, and towards a collaborative model which incorporates immediate feedback from the conversation partner [39]. 60 Chapter 4 Two Collaborative Models for REG This chapter focuses on collaborative models for REG 1 4.1 Introduction Recently, there is an increasing interest in REG for situated dialog [60, 35, 44, 97, 51]. While traditional approaches work well in situated dialogue in a virtual world, they are not adequate to support situated dialogue in a physical world because of the mismatched perceptual basis between the human and the agent. To investigate the problem of REG under the mismatched perceptual basis, In chapter 3, We have conducted a study using the artificial scenes (an example is shown in Figure 3.1(a)). The original scene is what a human sees, and the corresponding impoverished scene is a re-rendering of the robot’s internal representation of the same scene, created by processing the original scene with a computer vision algorithm [85] and a perceptual grouping algorithm [79]. Each object/group has an ID associated for the identification purpose. This example demonstrates the perceptual differences between the human and the agent. For example, in the impoverished scene, some objects (e.g., object 19, 20) are missing or mis-recognized (e.g., object 2, 10), and perceptual groupings are 1 The majority of this chapter was published in the following paper: Rui Fang, Malcolm Doering, Joyce Y. Chai. Collaborative Models for Referring Expression Generation in Situated Diaolgue. Proceedings of the Twenty-Eighth National Conference on Artificial Intelligence (AAAI), Qubec, Canada, 2014. 61 also different (e.g., group 18 contains only 2 objects). Using these scenes, we applied a hypergraphbased approach to REG. Our experimental results have shown that, although the hypergraph-based approach performs better than the leading approach based on regular graphs [23], our approach only achieved 45% accuracy (in referential identification) under the mismatched perceptual basis. Our results indicate that, in situations where the shared perceptual basis is missing, if the agent applies traditional approaches to generate referring expressions, the intended objects often cannot be correctly identified by the human. Inspired by these findings, we have developed collaborative models to REG to particularly address the mismatched perceptual basis. We use the same scenes and target objects as used in Chapter 3 in our evaluation in order to have a valid comparison. Next, we will first introduce the observations from collaborative behaviors in human’s referential communication. Based on these observations, we describe our two collaborative models for REG, which do not follow the traditional paradigm of REG algorithm. We then present the evaluation and comparison with the hypergraph based approach (introduced in Chapter 3) under the mis-matched perception basis. We will also discuss the importance of human feedback in collaborative referring process and possible future directions. 4.2 Collaborative Behaviors in Human-Human Referential Communication. Previous psycholinguistic studies have indicated that referential communication is a collaborative process [16]. To minimize the collaborative effort, partners tend to go beyond issuing an elementary referring expression (i.e., a single noun phrase), but rather using other different types of expressions such as episodic, installment, self expansion, self correction, etc. [16, 98] have identified nine different types of referring behaviors listed below.: 62 As mentioned in the introduction, Clark and Bartenger[2] have identified nine types of noun phrases in referring expressions. The first and second type have only oneway communication: It represents the approach most current and past research on REG has taken. In all other types there is collaboration between the director and the matcher to identify the right figure. • Elementary noun phrases: These are the only non-iterated phrases, such as: The triangle rabbit. • Self-corrected noun phrases: Phrases where the director corrects himself, such as: The one with the large square, ehh, triangle in the middle. • Other-corrected noun phrases: Noun phrases that are corrected by the matcher, such as: Director: ... with the triangle on top. Matcher: Don’t you mean the square? Director: Yes, you’re right. • Expanded noun phrases: The speaker at first produces a complete and distinguishing description and only continues describing because the hearer did not understand it. The difference with the installment noun phrase is that there the speaker does not produce a matching description. An example: Director: The next one is the rabbit. Matcher: Uhh.... Director: That’s asleep. It looks like it’s got pointy ears. • Episodic noun phrases, noun phrases produced in two or more easily distinguished intonation units, enabling the matcher to think and interact, such as: The next one is a rabbit, with ears and a head pointing down. 63 • Installment noun phrases: The speaker starts with the first part of a distinguishing description, waits for acknowledgment and after that, continues. It looks much like the episodic noun phrases, the only real difference is the absence of a response from the matcher in the episodic noun phrases. An example: Director: The next one is the one who’s sitting, Matcher: Okay. Director: With his knees bended. • Trial noun phrases: Almost the same as four. Instead of just waiting, the director asks a yes-no question, like: You see the ice-skater? Depending on the answer, the director will continue differently. • Holder noun phrases: Noun phrases that are used as a stand-in when a speaker does not know how to refer to a certain object such as whatchamacallit in: The eh-hmm - whatchamacallit, the, eh-mm... the rabbit. • Invited noun phrases With these, directors invite the matcher to complete the expression, for example: Director: and the last figure is ehh... Matcher: the rabbit. Among these nine types of referring expressions, episodic descriptions and installment descriptions are particularly collaborative. Unlike a single elementary description to describe a target object, an episodic description is produced in two or more easily distinguished episodes or intonation units. Here is an example of an episodic description from [75]. A: below the orange, next to the apple, it’s the red bulb. 64 An installment behavior is similar to the episodic behavior in the sense that it also breaks down generation into smaller episodes. The difference is that an explicit feedback from the addressee is solicited before the speaker moves to the next episode. Here is an example of an installment from [75]. A: under the pepper we just talked about. B: yes. A: there is a group of three objects. B: OK. A: there is an a yellow object on the right within the group. The generation of episodic or installment descriptions is not to minimize the speaker’s own effort, but rather to minimize the collaborative effort so that the addressee can quickly identify the referent. These collaborative behaviors from human-human referential communication have motivated our collaborative models. 4.3 Two Collaborative Models for REG Inspired by the episodic noun phrases and installment noun phrases used in human-human dialogue, we developed two collaborative computational models for REG: • Episodic Model. The episodic model generates referring expressions (REs) in an “episodic” fashion: it generates a RE in a sequence of smaller noun phrases which lead to the target object. • Installment Model. The installment model generates REs in an “installment” fashion: it generates one small noun phrase, waits for partner’s response, and then generates another small noun phase. This process iterates until the target object is reached. 65 (a) an example original scene (b) the corresponding impoverished scene Figure 4.1: An original scene shown to the human and the re-rendering of the robot’s internal representation from [9], Each object/group has an ID associated for the identification purpose. For both models the goal is to construct a sequence of objects to describe, where the target object is the final object to be described in the sequence. Both models can choose to directly describe the target object (as the traditional REG methods do) if such choice is deemed to have the lowest overall cost. But in general these models often find an object that is “easier” to describe and then gradually lead to the target object. For example, in Figure 4.1 suppose the robot wants to describe the target object 7. Note that it is hard to describe this object directly based on its features. So both models will search for a sequence of objects that lead to the target. In the episodic model, a sequence could be object 5 → group 15 → object 7 66 which could be linguistically realized as: “a red pepper, below that a group of 3 objects, and a yellow object on the right within the group”. In the installment model, once an object (e.g., object 5) is chosen, it will ask the human to provide feedback. Then based on the feedback, it will search for another object and iterate this process until the target object is reached. 4.3.1 Episodic Model The episodic model is based on the Branch-and-Bound search method [99]. It searches for the path to a target object with the overall least cost. We represent our problem space as a directed graph G = (N, A), in which N = {ni } ⟨ ⟩ A = {aj = tj , hj | tj , hj ∈ N } N is a set of nodes, and A is a set of arcs. Each node n ∈ N represents a perceived object or a perceptual grouping of objects by the agent, together with one of all possible concatenations of linguistic descriptors describing attributes (e.g., type, color, etc.). Each descriptor comes with a cost, which indicates how accurately the linguistic descriptor matches the underlying physical features. We use the ‘Uncertainty Relative Preferred’ cost functions from [9], where the costs for hyperarcs representing relative attributes and relations are divided by 3. This cost function will allow the algorithm to prefer spatial relations through the reduced cost. A path in graph G is a sequence of nodes ⟨n0 , ..., nk ⟩ such that ⟨ni−1 , ni ⟩ ∈ A. The cost of a path is the sum of the cost of all the nodes and arcs on the path: cost(⟨n0 , ..., nk ⟩) = ∑k i=0 cost(ni ) + ∑k i=1 cost(⟨ni−1 , ni ⟩) Figure 4.2 details our episodic model for generating episodic expressions. The algorithm starts 67 with an initial state in which the path contains only the root node n0 , the best path is none ⊥, and the bound is set to infinity. The algorithm then recursively extends the path ⟨n0 ⟩ until the minimum cost path is found. The Boolean function isDiscriminating() measures whether a path can uniquely identify object nk in the path. The Boolean function isGoal() tests whether nk is the target node. Figure 3.3 4.3.2 Installment Model In the Installment model, we treat the collaborative referring expression generation task as a sequential decision making problem and formalize it under the reinforcement learning framework. Our goal is to learn a good strategy that can construct a series of objects in the scene which will lead to the successful identification of a target object. Our model is inspired by [100, 101], where reinforcement learning is applied to interpret navigational natural language instructions. Here we apply reinforcement learning to optimize the generation strategy that can generate most appropriate referring expressions to successfully identify the target object. We will first introduce the basic reinforcement learning concepts and show how we formalize our installment model under the reinforcement learning framework. 4.3.2.1 Reinforcement Learning Basis Reinforcement Learning is based on the Markov Decision Process (MDP). The main ingredients of a reinforcement learning approach contains: • a set of states S = {s1 , s2 , ..., sN } that represents the agents knowledge of the previous and current context, In REG task, the state should contain all relevant information that are important for the agent to generate referring expressions. States need to meet the Markov 68 property which requires that the current state contains all the information in such a way that, an agent’s decisions making based on the current state is just as good as if it made decisions based on the entire history of states. • a set of actions A = {a1 , a2 , ..., aM } that represents the agents potential behavior under a certain state. For REG task, an action need to decide which object to describe, how to describe the object (e.g., what attributes to include in the description, what relation need to be mention with respective to other objects). • a state transition function T that specifies the next state of the agent for taking an action in a state. The transition could be stochastic or deterministic. For REG task, after an agent generated a particular referring expression in the current state, how does this referring expression change the environment? (E.g., what is the user’s response? How many objects are still available to describe?) • a reward function R that assigns a numeric value to the agent for taking an action in a state. A reward represents the goal that we want the learning agent to achieve. It is critical that the rewards we set up truly indicate what we want accomplished. In REG task, the assignment of reward could be based on different criterion: e.g.: 1. how well the description can uniquely identify the target object and rule out all other distractors. 2. how complex the description is (e.g., the length of the description). 3. what kind of the descriptors are included in the description (e.g., is color mentioned in the description?). 4. how confident is to use a description for a given target object, etc. 69 • a policy π is any mapping from states to actions. a policy specifies what actions to take in a state. In our REG task, a policy will decides what objects to describe and how to describe it given certain situation. The dynamics of an MDP can be described as follows. At the beginning of an interaction between the agent and the environment, when the time step t = 0, the agent receives a representation of the environment, called the state st ∈ S . It needs to perform an action at ∈ A that is the set of actions available in state st . As a result, the agent will receive a reward rt+1 ∈ R and observe the next state st+1 ∈ S , which is the updated environment state. This process can be seen as a finite sequence of states, actions and rewards {s0 , a0 , r1 , s1 , a1 , r2 , ..., st } and is illustrated in Figure 4.3. The agents goal is to learn an optimal policy π ∗ , a mapping from every states to an action a that will yield the highest expected return. In particular, we wish to develop an agent that learns an optimal generation policy, i.e. a mapping from any generation context s to a generation action a that will maximize the expected success of a generated utterance. Remember that an agent does not learn a policy to optimize immediate rewards. Rather, the agents goal is to optimize the cumulative reward that can be obtained for an entire sequence of actions. Thus, to identify those actions that yield the best expected long-term return, the agent needs to apply a trial and error strategy trying individual actions many times in different states in order to estimate their real reward. We refer to [10] for a comprehensive overview of the reinforcement learning approaches. Now, it is time for us to present the detailed formalization of the installment model under reinforcement learning framework. 70 4.3.2.2 State The state set is denoted as S, s ∈ S. Let O be the set of all perceived objects and perceptual groups in the scene. A state s =< t, lm, re, W > of our sequential decision making problem contains the target object t ∈ O, the landmark lm ∈ O and its generation strategy re ⊂ RE (will explain in the Action section), which is confirmed by user (e.g., when the user accepts the description of the landmark object), together with a set of objects W in the scene which contains the objects that have not been used but can potentially be used as landmark objects: W = {w1 , ...wn }, wi ∈ O. In the initial state lm is none, and W = O. When a user accepts the description of an object or a group it becomes the landmark object lm. When a user rejects the description of an object or a group, it is removed from RE. In the terminal state, the target object is reached and becomes the current landmark object t = lm. 4.3.2.3 Action An action in our problem basically describes an object in relation to a landmark object. The action set is denoted as A, a ∈ A. a = ⟨o, re, sp⟩ is composed of an object to be described o ∈ W , its generation strategy re ⊂ RE, and a spatial relation sp ∈ SP to the landmark object lm. Currently, RE represents the strategies that can be used to describe an object. The space of RE consists of all possible concatenations of the following dimensions: • desType: describes the type of the object such as “apple”, “pen” • desColor: describes the color of the object such as “red”, “yellow” • desSize: describes the size such as “big” or “small”. • desSpaLoc: describes the spatial location with respect to the whole scene, e.g., “on the left”, 71 “on the right”, etc. • desNumGroup: describes the number of objects within a group. SP represents the strategies that can be used to describe a spatial relation between two objects, between two perceptual groups, between one object and a perceptual group or between one object and the group it belongs to. The space of SP can be one of the following dimensions: • desBetween: describes the spatial relation between two objects/groups, e.g., “below”, “above”, “to the left of”, and “to the right of”. • desInGroup: describes the spatial relation between one object and the group it belongs to, e.g., “on the right within”, “on the left within”, “on the bottom within”, “on the top within”. For example, an action sequence could be < object5, [desColor, desT ype], [none] > ↓ < group15, [desN umGroup], [desBetween] > ↓ < object7, [desColor], [desInGroup] > These actions capture the decisions on how to generate an expression to describe an intended object. Given an action, the surface form of an expression can be generated through some templates. 4.3.2.4 Transition T (s, a) is the transition function. It takes an action a = ⟨o, re, sp⟩ in current state s = ⟨tar, lm, re, W⟩ ⟩ ⟨ and gives the next state s′ = tar, lm′ , re′ , W′ . Note that, the action does not change the target tar. 72 Rather the landmark object and W′ are affected. Given the expression generated by a particular action, the human’s response will lead to the update of the landmark object lm′ and its description re′ , as well as the remaining objects that can be used as landmark objects in the future W′ . 4.3.2.5 Reward The goal of the reinforcement learning algorithm is to identify a policy to maximize the expected utility Ut = ∑∞ k k=0 γ rt+k , where 0 ≤ γ ≤ 1. The reward function r(s, a): r(s, a) : S × A → R It specifies a numeric reward that an agent receives for taking action a in state s. It is defined as follows:     100 if the terminal state is reached and the        target object is correctly identified.     r= 10 if the terminal state is not reached and the        current object is correctly identified.        −1 otherwise. 4.3.2.6 Policy A policy π : S → A is a mapping from states to actions. We can determine the expected return of a policy by estimating its action-value function. The action-value function Qπ (s, a) of a policy π 73 is the expected return for starting in state s, taking action a and then following π. It is defined as Eπ [Ut |st = s, at = a] The optimal action-value function that yields the best expected return for each action a taken in state s is defined as Q(s, a)∗ = max Qπ (s, a) π The optimal policy π ∗ is therefore the policy that chooses the action a that maximizes Q(s,a) for all s ∈ S, π ∗ (s) = arg max Q(s, a) a Basic reinforcement learning approaches use a table to represent the action-values. However, as the number of states in an environment increases, it becomes infeasible for an agent to visit all possible states enough times to find the optimal actions for those states. Furthermore, it becomes important to be able to generalize the learning experiences in a particular state to the other states in the environment. One common way to handle the large state space and generalization issue is through function approximation [10]. Existing function approximation methods for value prediction includes artificial neural networks, decision trees, and linear functions. We apply the linear functions approximation for our work here. We define a mapping ϕ that assigns a feature vector to each state-action pair. Then, the action value Q(a,s) of a state-action pair (s,a) is obtained by linearly combining the components of ϕ(s, a) with the weights θ: Q(s, a) = d ∑ θi ϕ(s, a) i Here we represent the Q-function as a linear combination of the features, the agent extracts 74 # Feature Value Description 1 normalized sum of vision confidence of all descriptors in re is spatial location descriptor in re? visual confidence of spatial relation descriptor vision confidence of type descriptor vision confidence of spatial location descriptor is type descriptor in re? number of descriptors in re vision confidence of size descriptor is size descriptor in re? is color descriptor in re? vision confidence of color descriptor can re together with sp to lm uniquely identify an object? is there a sp between o and tar ? number of spatial links from lm number of spatial links to o (in degree) number of spatial links from o (out degree) is o and lm in the same group? is o a group? is lm is a group and o in lm? is o a group and tar in o? is o and tar in the same group? is o a group and lm in o? 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Learned Weights 0.92 0.54 0.52 0.51 0.48 0.21 0.19 0.13 0.13 0.10 0.09 0.88 0.51 0.23 0.01 0.01 1 0.97 0.96 0.90 0.51 0.11 Table 4.1: Features used in the installment model. We use the following abbreviations, o: an object to describe, re: generation strategy for an object, lm: landmark object, tar : the target object, sp: spatial relation the relevant information it needs from the state through feature extraction, and uses the resulting feature vector to calculate the approximate value it gets from being in that state. This is done by doing a dot product between the feature vector and a parameter vector. In this way, similar state features will also have similar values and learn weights which most closely approximate the true expected reward. The goal of reinforcement learning with function approximation is then to learn the best values for this parameter vector. 75 4.3.2.7 Features We use features from both the state and the action to model the Q-function as summarized in Table 4.1. Features 1-11 are inspired by [9] which demonstrates that encoding visual confidence of a symbolic descriptor into the cost function improves performance. Feature 12 models the discriminating power of a referring expression, which is the key requirement for traditional REG approaches. Features 13-16 measure the spatial relations between objects. Features 17-22 are inspired by previous work indicating that group descriptions are important to REG [13, 9]. 4.3.2.8 Learning To learn these weights θ we follow the method in [101], which uses SARSA [10] with linear function approximation. The learning model is shown in Figure 4.4. We use Pr(a0 |s0 ; θ) = exp(θT ϕ(st ,at )) ∑ T ′ to choose the best action based on the current estimation of θ, with ϵ-greedy a′ exp(θ ϕ(st ,a )) (ϵ = 0.2) for the exploration (That means 20% of the time, we randomly choose an action). The 30 and we stop training when the magnitude of updates ∥θ learning rate αt is set to 30+t t+1 − θt ∥ is smaller than 0.0001. The parameters are tuned based on our own empirical interaction of the installment model. 4.4 Empirical Evaluations 4.4.1 Experimental Setup To evaluate the performance of both the episodic model and the installment model to REG, we conducted an empirical study using crowd-sourcing from the Amazon Mechanical Turk. Through a web interface (e.g., Figure 4.5 and 4.6) we displayed an original scene and a description generated 76 to refer to an object in the scene. We asked each turkers to choose the object he/she believed was referred to by the description. They can also choose if none of the objects or multiple objects were considered to be referred to. Similar to [9], each description received three votes regarding its referent from the crowd 2 . The referent with a majority voting was taken as the identified referent and was used to calculate the performance metric: the accuracy of referent identification (i.e., the percentage of generated referring expressions where the target object are correctly identified). If all three votes were different, then the referent was evaluated as not correctly identified for that expression. 4.4.2 Training of the Installment Model In order to learn the weights for features in the installment model, we first created 32 different training scenes similar to the scenes used in [9], then used the Amazon Mechanical Turk to solicit feedback from the crowd. The training was divided into sessions where each session was used to identify only one target object. More specifically, in each session, the system applied Figure 4.4 to pick an action. Then a referring expression was generated based on this action and shown to the user. The user was then asked to identify the referent based on this expression. Based on the user’s feedback, the internal state would be updated and the system would pick the next action accordingly. This process iterates until one of the following three conditions was met: • The system reached the target object and describes it. • There was no more action available for the system to take. • The number of interactions exceeds 13, which is the average number of objects in a scene. 2 To have a fair comparison, we use the same quality control of crowdsourcing as used in [9]. 77 When one training session ended, a new training session for a new object would start. We stopped training when the algorithm converged. We had a total number of 1860 training sessions and a total number of 6775 interaction with the user, which resulted in an average of 3.6 interactions per session. Figure 4.7 shows the number of objects correctly identified during training at an interval of 200 sessions. From Figure 4.7, we observe a trend that the number of correctly identified objects gradually increases, indicating that the system is learning in the right direction towards the true underlying parameter θ. Note that we stopped the training when the algorithm converged. However, as shown in Figure 4.7, the performance seems to pick up during the last 200 sessions. This suggests that further training may improve the performance. The weights learned for each feature are shown in the third column of Table 4.1. We can observe that, group-based features (e.g., features 17 - 20) are assigned relatively higher weights. Through the interaction with the user, the system learns that a strategy of describing an object with reference to groups has a higher probability of correctly identifying the target object. This learning result is consistent with previous work showing that group descriptions are important. However, in contrast to previous work, our system can learn the importance of the group descriptions through interaction and gradually assign higher weights to them. Visual confidence (feature 1) and the discriminating power (feature 12) of a referring expression also receive relatively high weights. Given these features with learned weights, we can modify the episodic model in two ways 1. using the trained installment model, but assume that turker’s feedback is the intended object described by the installment model (We name it Epsodic Model II) 2. using the features with learned weights to guide the Branch and Bound search in the episodic model (We name it Episodic model III) 78 Non-collaborative model [9] The episodic model The episodic model II The episodic model III The installment model Accuracy 47.2% 53.6% 56.9% 57.8% 68.9% Table 4.2: Comparison of accuracy for referent identification based on expressions generated from three models. 4.4.3 Evaluation Results In the testing phase, we applied the learned parameters to the testing scenes and evaluated how well the turkers 3 can identify target objects based on the expressions generated by the learned policy. We used the same 48 scenes used in [9] for evaluation. Each testing scene has 13 objects on average and there are 583 objects in total 4 . The turkers went through all the testing scenes and we recorded the number of responses the turkers made while identifying the target objects, as well as the total number of target objects that were correctly identified. Each expression received three votes from the crowd and the referent with the majority voting was used in calculating the accuracy. Table 4.2 shows the results of our two collaborative models compared with the non-collaborative model [9]. • Two collaborative models significantly outperform the non-collaborative approach with an absolute performance gain of over 6% and 21% respectively. • The installment model further significantly outperforms the episodic model with an absolute 3 We randomly recruit the turkers from Amazon Mechanical turkers using the same quality control criteria. The turkers are presented with the same interface as in the training phase. 4 Here we remove 38 split objects from the target object set, so the adjusted result shown in Table 4.2 is a little different from the reported results in [9]. 79 performance gain of 15% (Pearson’s Chi-square test, χ2 = 20.11, p < 0.001 with Tukey’s Post Hoc test). The average number of turns in the installment model is 2.35 (Standard Deviation = 1.07). 4.4.3.1 Comparison of the Episodic Model with the Non-Collaborative Model. The episodic model outperforms the non-collaborative model. This is mainly due to two reasons: • First, instead of directly describing the target object, the episodic model provides more options to describe the target object, and these options have less cost than the direct description used in the non-collaborative model. In the episodic model 72% of the target objects are not described directly (i.e., one long noun phrase to describe the target), which indicates that, in these cases describing other objects first and then gradually approaching the description of the target object has a lower overall cost. • The second reason is due to the complexity of interpreting a description. The average number of descriptors in a referring expression for the non-collaborative model (Mean=4.3,SD=1.23) is larger than that in each small noun phrase in the episodic model (Mean=2.2,SD=0.78) (ttest, p < 0.001). In the non-collaborative model, the target object can be described in relation to other objects. Although as a whole the overall description should be able to distinguish the target from the rest of objects, the expressions to describe each of the objects (in relation to the target) do not need to be distinguishing. In contrast, the smaller descriptions generated by the episodic model can already distinguish the intended object from the rest. We believe these behaviors contribute to a less complexity in interpreting episodic expressions. 4.4.3.2 Comparison of the Installment Model with the Episodic Model. There are two main reasons that the installment model outperforms the episodic model. 80 • First, the installment model incorporates many more features than the episodic model, and the weight of those features are learned through on-line interaction with users. The episodic model only relies on two features when searching for the best path to a target object: vision uncertainty and the discriminating power of a path. These two features roughly corresponds to feature 1 and 12 in Table 4.1. Realizing this reason and to have a fair comparison, we build episodic model III, which applies the same features as used in the installment. The results show that, the installment model still outperform the episodic model III with absolute performance gain of 11%. • Second, in the episodic model there is no intermediate feedback from humans, while in the installment model the system selects the next object to describe based on the intermediate feedback from the human partner. To explore the role of intermediate feedback, we build Epsodic Model II, which uses the trained installment model, but assume that turker’s feedback is the always the intended object described by the installment model. The results show that with intermediate feedback, the installment model outperforms the episodic model II with an absolute performance gain of 12%. Figure 4.8 shows among all failed cases in the episodic model, where do these fail case occur. Among all the successful sessions (where the target object is correctly identified), 18.7% of them in fact encountered some problems. (The turkers could not identify the object at the intermediate step.) However, the system was able to get around the problem and chose to describe another object. It is the intermediate feedback that guides the installment model to generate referring expressions that lead to the correct identification of target objects. Here is an example comparison between the episodic model and installment model. The target object is object 10 in Figure 4.1. For the episodic model the sequence of objects to describe is 81 group18 → object10. It is linguistically realized as: ‘a group of 2’ → ‘an object on the top within the group of 2’. While, for installment model: System: ‘a group of 2’ User: None of the above System: ‘a group of 4’ User: group 16 System: ‘an orange object below the group of 4’ User: object 10 4.5 Conclusion In situated dialogue, humans and robots have mismatched perceptions of the shared world. To facilitate successful referential communication between a human and a robot, the robot needs to take into account such discrepancies and generate referring expressions that can be understood by its human partner. Motivated by collaborative behaviors in human-human referential communication, we developed two collaborative models for REG. In contrast to previous non-collaborative models which tend to generate a single long description to describe a target object, our models generate multiple short expressions describing easier-to-identify landmark objects that eventually lead to the target object. The goal of these models is to minimize the collaborative effort between the human and the robot. Our empirical results have shown that the two collaborative models significantly outperform the non-collaborative model. Although our current models, especially the installment model, have yielded encouraging results, several problems need to be addressed in the future. 82 • First, the cost function we used for each descriptor is predefined. As the environment changes, the agent may have different confidence in capturing different dimensions of perception. Thus, one future direction is to automatically learn and adjust these cost functions in new environments. • Second, our current evaluation is only based on the simplified scenes without addressing complexities in true human-robot dialogue. Even for our installment model, although follows the definition from (Clark and Wilkes-Gibbs 1986) does not fully resemble humanhuman dialogue, where the addressee can take a more active role in referential grounding. Our future work will extend the current work to real human-robot dialogue and incorporate more interaction strategies and non-verbal modalities (e.g., gaze direction from the human) as intermediate feedback for generating referring expressions. • Third, based on reinforcement learning framework, in order for the installment model to converge, it needs thousands of interactions between user and our system. This posts the question of how to faster the learn phase of the installment modal in real human-robot interaction. • Fourth, although in the installment model we can learn a policy to generate description, the model does not explicitly learn where the mismatched perception occurs (e.g., after learning, the scene hypergraph is still full of error: objects are still mis-recognized/missing, etc). So one direct extension is to directly mediate the mis-alignment in perception. The effect is that, after learning, the robot’s visual perception system will be improved and the recognition performance is better. 83 Figure 4.2: The Episodic Model for REG 84 Figure 4.3: The Agent-Environment Interface in MDP [10] Figure 4.4: The Installment Model for REG 85 Figure 4.5: a web-based interface shown to the Mechanical Turk workers for the episodic model Figure 4.6: a web-based interface shown to the Mechanical Turk workers for the installment model 86 Figure 4.7: The number of objects correctly identified at an interval of every 200 training sessions Figure 4.8: The distribution of failure cases at the 1st, 2nd, 3rd and after the 3rd expression in the episodic model 87 Chapter 5 Embodied Collaborative Models for REG This chapter focuses on embodied collaborative models for REG in situated dialogue 1 . In situated dialogue, robots and humans have different representations of the shared environment because of their mismatched perceptual capabilities. The robot’s representation of the shared environment is often incomplete and error-prone. When a shared perceptual basis is missing, referential communication between partners becomes difficult[11]. Specifically for the task of REG, [9] has shown that when the human and the agent have a mismatched perceptual basis traditional approaches to REG tend to break down. A competitive algorithm that achieves over 80% accuracy (in referent identification) when the human and the agent have the same representation of the world (i.e., the traditional setting) only obtains over 40% accuracy under a mismatched perceptual basis [9]. Based on this observation, [1] developed two computational collaborative models for REG: an episodic model and an installment model. Their collaborative models, motivated by collaborative referential behaviors in human-human communication [16], instead of generating a single referring expression, generate multiple small noun phrases that gradually lead to the target object with a goal to minimize the collaborative effort. In a web-based study where 2D images were used as shared environment between an agent and a human [1], empirical results have shown the collaborative models can better handle mismatched representations between the agent (which only has an im1 The majority of this chapter was published in the following paper: Rui Fang, Malcolm Doer- ing, Joyce Y. Chai. Embodied Collaborative Model for Referring Expression Generation in Situated Human-Robot Interaction. Proceedings of the 10th Annual ACM/IEEE International Conference on Human-Robot Interaction (HRI), Portland, USA, 2015. 88 poverished representation of the image due to limited object recognition capability) and the human (who has a rich representation of the shared image) and significantly outperform the traditional approaches. While our previous results on collaborative models are encouraging, the use of 2D images as a shared environment is a great simplification of the shared physical world as in human-robot interaction. Collaborative models have not been applied to real-time human-robot interaction in a physical environment. In situated human-robot interaction, a key characteristics is embodiment: agents (i.e., robots) and humans both have physical bodies and they can use non-verbal modalities (e.g., gesture and eye gaze) to make references to the shared world and to provide immediate feedback. It is not clear to what degree the collaborative models can take into account of embodiment and apply to physical interaction. To address this issue, we have developed embodied collaborative models for REG. These models seamlessly incorporate robot gesture behaviors (i.e., pointing to an object) and human’s gaze feedback (i.e., looking at a particular object) into the collaborative model for REG. We examined different strategies incorporating robot gestures and human feedback. Our empirical results have shown that when robot gestures and human verbal feedback is incorporated, the new collaborative model achieves over 28% absolute gains compared to the baseline collaborative model. This chapter discusses the opportunities and challenges brought by modeling embodiment in collaborative referential communication in human-robot interaction. 5.1 Embodied Collaborative Models for REG Inspired by the collaborative behaviors in referential communication, [1] developed two computational collaborative models for REG: an episodic model and an installment model. Their collaborative models, instead of generating a single referring expression to describe a target object, generate 89 multiple small noun phrases that gradually lead to the target object with a goal to minimize the collaborative effort. In particular, the installment model treats the collaborative referring expression generation task as a sequential decision making problem and formalize it under the reinforcement learning framework. The goal is to learn a good policy to generate installments under different situations to eventually lead to successful identification of a target object. Through a web-based crowd-sourcing interface and interacting with users based on 2D images, the weights for a set of features are learned to represent the Q(s, a), the utility for a particular state s (which captures the agent representation of the shared environment) and a particular generation strategy a (which represents a generation strategy/action e,g., which attributes to choose to describe the object) as in: Q(s, a) = θT ϕ(s, a) Where ϕ assigns a feature vector to each state-action pair, and θ is the weight vector of the features. The utility of a state-action pair Q(a,s) is defined as a linear combination of the features. Once the weights for features are learned, they are applied to calculate Q(s, a) for any stateaction pairs. During real-time generation, the agent will pick the generation action a∗ to maximize the Q(s, a). a∗ = arg max Q(s, a) a∈A The above collaborative model was only applied and evaluated in a simplified environment with 2D images on the computer screen [1]. Human-robot interaction is much more complex. To extend the collaborative model to real-time human-robot interaction and seamlessly incorporate embodiment into the model, we pursued two directions: (1) incorporating the robot’s gesture (i.e., pointing) into the collaborative model to generate multimodal referring expressions; and (2) incorporating the human real-time feedback (either verbal feedback, or gaze feedback, or both) into the 90 collaborative model. Figure 5.1(a) shows a typical setup where the embodied collaborative model is examined. Incorporation of robot’s gesture in referring acts. Instead of generating verbal descriptions only, non-verbal description such as pointing gesture can also be included. Here, we treat cost of gesture generation as a feature and incorporate it with other features learned from [1] into the collaborative models. # Feature Value Description 1 normalized sum of vision confidence of all descriptors in re is spatial location descriptor in re? visual confidence of spatial relation descriptor vision confidence of type descriptor vision confidence of spatial location descriptor is type descriptor in re? number of descriptors in re vision confidence of size descriptor is size descriptor in re? is color descriptor in re? vision confidence of color descriptor can re together with sp to lm uniquely identify an object? is there a sp between o and tar ? number of spatial links from lm number of spatial links to o (in degree) number of spatial links from o (out degree) is o and lm in the same group? is o a group? is lm is a group and o in lm? is o a group and tar in o? is o and tar in the same group? is o a group and lm in o? the cost of producing a pointing gesture 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Learned Weights 0.92 0.54 0.52 0.51 0.48 0.21 0.19 0.13 0.13 0.10 0.09 0.88 0.51 0.23 0.01 0.01 1 0.97 0.96 0.90 0.51 0.11 0.50 Table 5.1: Features used in the installment model [1]. Table 5.1 shows the features and their associated weights. We directly adopt features and the learned weights for the first 22 features. The 23rd feature represents the cost of the gesture 91 generation which we empirically set it as 0.5 (a middle value since we have no evidence to either promote or discount it) 2 The cost of a pointing gesture depends on several factors: the distance from the robot to the target object, the size of the target object, adjacency of other objects to the target object, etc. Inspired by previous work on the costs of pointing [102] and [103], we define the cost for a pointing gesture to an object as follows:    + 1) log2 ( 1.5×Distance   Size   cost =       If the object to be described is in a group log2 ( Distance Size + 1) Otherwise Given a situation s, the robot will apply the features and their associated weights in Table 5.1 to calculate Q(s, a) and then choose generation action a∗ that maximize Q(s, a). For example, In Figure 5.1(b), R1 includes gesture as part of the referring expression, while R2 does not. (a) An Example of Situated Setup (b) An Example of Referring Expressions in Installments Figure 5.1: An example of situated setup and referrential communication Incorporation of eye gaze as feedback. 2 Since it is challenging to conduct studies to undergo a large number of interactions with the robot to learn the weight through reinforcement learning, we decided to empirically set this weight in our initial investigation. 92 Previous psycholinguistic studies have shown that human eye gaze directly links with language comprehension. Immediately after hearing a referring expression, the hearer’s eyes move to the objects being referred to. Motivated by this finding we incorporate the user’s real-time gaze and verbal information as intermediate feedback in the installment model. In order to use the human’s gaze information as feedback, we first must find how the gaze is distributed among each of the objects in the scene over the time immediately following an RE such as in Figure 5.1(b). Specifically, starting from the onset of an RE uttered by the robot, we measure at least 200 gaze readings (taking on average 7.7 sec) from the participant, telling which object in the scene their gaze is focused on. From these 200 gaze readings we calculate a distribution showing which objects are focused on the most. Based on this distribution we make two decisions: 1) which object receives most of the focus? And, 2) is the distribution on this object high enough to assume the human believes it to be the referent of the robot’s RE? We answer the first question by taking the object with the highest gaze distribution, and the second question by checking whether the object with the highest distribution is significantly (at least two times) greater than the object with the second highest distribution. Thus, our gaze processing algorithm returns an (objectId, YES/NO) tuple that can be incorporated as feedback to the installment model in various ways. The gaze feedback can be incorporated into the installment model in three specific ways. • Gaze Only Feedback In this method of incorporating the gaze feedback the robot receives the tuple output by the gaze processing algorithm. If YES, the human’s gaze is mostly focused on a single object, then the installment is successful and objectId becomes the next landmark for the next RE. Else, if NO, the human’s gaze is not focused on a single object, then the robot will try a different RE. • Gaze and Verbal Feedback Combined To incorporate both the gaze and verbal information from the human as intermediate feedback, we must handle all cases where the gaze feedback 93 and verbal feedback conflict, such as when the human responds “yes” but the gaze indicates that no unique object was identified. We manage these cases in the following ways: when the user responds “No” then the gaze information is disregarded and the robot follows the verbal feedback. When the user says “Yes” then there are two cases: (1) if the gaze processing algorithm returns NO then the robot assumes the human understands and the robot’s intended object becomes the next landmark. (2) if the gaze processing algorithm returns YES then the robot assumes that the human believes it is talking about the object with the highest gaze distribution and it becomes the next landmark. • No Gaze Feedback In this method gaze is not used as feedback to modify the behavior of the installment model. However, the gaze can be recorded and processed with the gaze processing algorithm in order to later analyze the human’s gaze behavior. Figure 5.1 shows an example of how (verbal) feedback from the user determines which object/group becomes the landmark in the next RE. The robot starts by asking, “Do you see a group of two on the left?”. After the user responds in the affirmative, that group becomes the landmark in the next utterance. The robot describes the next object in relation to this landmark, “... do you see an object in the back that is within that group?” Note that, instead of generating a single minimum description to describe a target object (potentially with pointing gesture), our model actually plans for incrementally generating smaller episodes that lead to the target object. The planning is based on what the robot perceives from the environment and what feedback is provided by the human. Note that, when the gesture option is available, it does not mean the referring expression generated will have to use gesture. Whether the gesture is used or not depends on how it weighs against other features extracted from the environment. Although the current model is built upon the results (22 visual features) from our previous study from a 2D image, this work demonstrates that the cost of gesture can be integrated seam94 lessly with the visual features to apply to truly situated environments. As far as we know, this is the first model for referring expression generation in situated human robot dialogue that incorporates embodiment in the planning for generating episodes towards the target object. 5.2 Empirical Studies Using the embodied collaborative models for REG described above, we conducted an empirical study. The goal was to examine the performance of embodied collaborative models in situated human-robot interaction and understand the role of embodiment in language generation. 5.2.1 Design Factors Our experiments were designed to take into consideration the following two main issues relating to the embodiment of the REG models in a robot: Robot’s embodied generation strategy. We want to explore to what degree gesture may help referential communication. In our experiments we specifically modeled two methods of generation. • Verbal Only (R-) refers to the condition where only verbal referring expressions are generated, using the installment model. In this condition the robot will first refer to the landmark object saying “That object/group I was just talking about/you were just looking at...” Then after a short pause the robot will refer to the current object, “Do you see a ...” in some relation to the landmark. • Verbal and Gesture (R+) refers to the condition where both verbal referring expressions and pointing gestures are generated, using the installment model. In this condition the robot will first verbally refer to the landmark object as in R- and may simultaneously point towards the landmark. Subsequently, the robot will verbally refer to the current object in the same 95 way and may also point towards that object. Incorporating human’s feedback. We want to examine how different ways of incorporating the human’s feedback affects the performance of the installment model for REG. • No Feedback (N) refers to the condition where no feedback is used. This is similar to the Episodic behavior [1], which generates an RE in a sequence of smaller noun phrases that lead to the target object. The user was not asked to provide any intermediate verbal feedback; instead the system assumes that user can correctly follow an installment before generates the next installment. Gaze is tracked for later analysis. • Verbal Only (V) refers to the condition where only verbal information is used as the feedback from human. In this condition the human is allowed to respond “yes”, “no”, or “repeat” if they did not hear clearly. If the response is “yes” the robot assumes that the human understands the generated RE and the current intended object becomes the landmark for the next step. If the response is “no” the robot assumes the human does not understand the intended object and will try another RE. Lastly, if the response is “repeat” the robot repeats the last generated RE. • Gaze Only (G) refers to the condition where only gaze is used as the feedback from human. In this condition the gaze of the human is incorporated as described in Section 3. The gaze is tracked and used as feedback to the installment model. • Gaze and Verbal (VG) refers to the condition where both gaze and verbal information are used as the feedback from human. The gaze and verbal feedback are combined as described in Section 3. These design factors lead to 4 × 2 = 8 experimental conditions. 96 5.2.2 Experimental Tasks We designed an object identification game to evaluate the effects of the feedback and gesture conditions on referential communication. Figure 5.1(a) shows an example of the situated setup for the experiment. A human subject and our NAO robot were positioned in front of a table which held ten everyday objects. The subject was wearing an ASL mobile eye tracker. The NAO robot would generate referring expressions in an installment fashion to describe these objects using different strategies (depending on the experimental setting). The human subject’s goal was to identify which object the NAO robot was talking about. We compare the object identified by the human with the intended target object to compute the accuracy of a particular generation strategy. 5.2.3 Procedures We designed four scenes with ten objects of various types in different configurations, six of which were randomly chosen as target objects for the robot. These four scenes were processed by a computer vision algorithm [85] and a perceptual grouping algorithm [79] and the results were given to the robot as its internal representations of the scenes. Here, we applied a very basic CV algorithm 3 which resulted in rather poor performance in our data: overall, about 73% of objects in the original scene are mis-recognized or mis-segmented. We think this poor CV performance represents a more challenging problem. These four scenes were independent from each other (e.g., an apple in two different scenes may have two different recognition results). To each subject a set 3 Computer vision algorithms have advanced significantly in recent years. However, they are far from being perfect. It is very likely, in a new situation or changing environment, that the robot which is equipped with modern CV algorithms can only correctly recognize a few objects (not all objects) in the shared environment with the human. This is the situation our work intends to address. Therefore, we intentionally applied a very simple CV algorithm so that the misrecognition rate would be high. This is a different setting than previous work on referring expression generation. 97 of games was assigned, where each game had a different combination of scene, feedback condition, gesture condition, and target object. Different from [63], Our user study was designed not for studying the role of gestures in referring communication. We designed the studies to evaluate our computational model and address how to automatically generate verbal and gesture expressions on the fly given human’s feedback and different visual configurations of the environment. We recruited a total of 24 human subjects to participate in our study. These subjects were not aware of any part of the system or the study design. During the experiments, each participant played 24 object identification games - 6 games on each of four separate scenes. Each scene was paired with one of the four feedback conditions such that the participant interacted with the robot under each condition. Moreover, in each of the 6 games on a scene, the target object was changed. Half of these 6 games were played under the Verbal and Gesture condition (R+) and the other half under the Verbal Only condition (R-). For each participant the scenes were presented in one of the 24 possible permutations so that the order would not have an effect on the experiment. Furthermore, the orders of the targets were randomized. Based on this design each participant played three games in each of the 8 feedback-gesture conditions, resulting in 72 games total played in each condition. 5.2.4 Empirical Results Our experiments resulted in a total of 576 object identification games based on various scenes and target objects. To help interpret the results, we reiterate some key terms that will be used throughout this section: The target object is the object the robot wants the human to identify in the end. An intended object at each installment is an object the robot refers to on the path to the target. A landmark object is the object on the path (i.e. an intended object) that has been confirmed. 98 5.2.5 Performance Table 5.2 shows the identification accuracy of the target object under the eight conditions. There is a significant difference among the settings (χ2 = 56.97, p < 0.0001). A Tukey post-hoc test reveals that this difference is significant (at p < 0.05) between VR+ and NR-, VR+ and VR-, VR+ and VGR-, VR+ and GR-, VR- and NR+, VGR+ and GR-, VGR- and GR+, VGR- and NR+, GR+ and GR-, GR- and NR+. The gesture with verbal feedback condition performs significantly better than most of the other models with an accuracy of 0.88. Furthermore, each model that incorporates gesture into its generation strategy performs significantly better than its corresponding model without gesture. These results indicate the importance of embodiment in RE generation in the situated environment. Feedback Condition Verbal Only Gaze Only Verbal + Gaze No Feedback Generation Strategy Gesture Only Gesture + Verbal 0.53 0.88 0.40 0.72 0.47 0.67 0.60 0.79 Table 5.2: Comparison of accuracy for the eight experiment conditions. 5.2.5.1 The role of gesture (on accuracy) First, we examine the role of the robot’s gesture as part of REG. Table 5.3 shows the identification rate of the target object under the Verbal Only condition (R-) and Verbal and Gesture condition(R+). When gesture is incorporated into the REG model, the identification accuracy is much higher at 0.76 than when gesture is not used at all, 0.50. There is a significant difference between the two settings (χ2 = 43.11, p < 0.0001) and gesture was used at a 99 rate of 0.89 in REs in the R+ condition. Our observation is that in general, in the conditions where gesture is involved, the object identification accuracy is higher. In the R+ condition participants were able to find the intended object based on additional information to the verbal RE provided by the robot’s pointing behavior. An example of the usefulness of gesture is apparent in the gaze feedback only condition (G). Observing the experiment under this condition, where only gaze feedback is used to decide the next landmark, several participants appeared to have difficulty understanding which object was the landmark because they were not aware of which object they had been looking at previously (The landmark is reference, “That object you were just looking at...”). In this condition gesture was useful for maintaining the common ground between the dialog participants by indicating the landmark object to the human. Overall, Gesture is a useful modality for improving the effectiveness of REs generated in the situated environment. Num. Games Num. Success Accuracy Verbal only 288 144 0.50 Verbal and gesture 288 220 0.76 Table 5.3: Accuracy of object identification game, when gesture is present (R+) or absent (R-) 5.2.5.2 The role of feedback Second we examined the effect of the different types of feedback on object identification accuracy. We analyzed the effects of integrating the human’s real time eye gaze and verbal feedback into the installment model. Table 5.4 shows the object identification rate under the four feedback conditions. For object identifications accuracy, there is a marginal difference among the settings (χ2 = 10.81, p = 0.013). 100 Num. Games Num. Success Accuracy Without-Feedback N 144 100 0.69 With-Feedback V G VG 144 144 144 101 81 82 0.70 0.56 0.57 Table 5.4: Rate of successful object identification game, with different feedback conditions Our observation is that, when the intermediate feedback includes eye gaze (G and VG), the performance in terms of referential identification accuracy is lower than without gaze (N and V). Although we were hoping that the incorporation of eye gaze as intermediate feedback may improve the performance of the installment model, the empirical results however have shown otherwise. This is likely due to the inaccuracy of the eye gaze feedback. As a consequence of our method of measuring the subjects’ eye gaze, accidental movements of the subjects’ heads during the experiment may cause the coordinates of gaze readings to shift to objects nearby the actual focus of gaze. The empirical results indicate our current way of incorporating gaze is not successful. For a more in-depth analysis based on our data, we separate out the cases where the feedback indicated by gaze coincides with the intended landmark object and examine whether this would be a strong signal of understanding and whether it improves the accuracy of identifying the target object in the end. Specifically, on the G and VG feedback condition, we calculate how often gaze feedback actually provides a fixed object that is used as a landmark object and how often this fixed object in fact is the intended object. Table 5.5 shows the above results. Table 5.5 shows results from G and VG conditionbs that succeeded and failed (meaning the participant did not identify the correct referent when you asked them at the end of the game). For G condition, the data shows that for both successful and unsuccessful games the landmark was established by the gaze feedback with about the same rate (88.6% vs. 85.1%, χ2 = 1.124, 101 Num. Installment Num. LM estd. by gaze Rate Num. LM estd. by gaze and match intended obj. Rate Gaze Only Success Fail 185 235 164 200 0.886 0.851 39 21 0.238 0.105 Gaze + Verbal Success Fail 204 269 140 140 0.686 0.520 27 11 0.193 0.079 Table 5.5: Statistics for landmark and intended object in Gaze feedback only condition (G), Gaze and Verbal feedback condition (VG) p = 0.289). More interestingly, the rate of landmarks established by gaze which actually matched the intended object (the participant gazed at the intended object and that object became the next landmark) is twice as high in successful games as it is in failed games (23.8% vs.10.5%, χ2 = 11.545, p = 0.000679). This shows that when the human is able to correctly identify the current intended object and the gaze feedback works properly, it leads to successful target identification. For VG condition, we see similar statistics as with the G condition. One difference is that here the rate of landmarks established by gaze feedback (for the VG condition this only happens when the verbal response is “yes’) is much higher for successful games than failed games (68.6% vs. 52.0%, χ2 = 13.201, p = 0.000279). One of the reasons this is lower compared to the G condition is because of the verbal feedback integration. But we are not sure why there is such a large difference for this statistic between the successful and unsuccessful games in the VG condition compared to the G condition. (There was not significant difference in the G condition 88.6% vs. 85.1%.) We wonder if it has to do with how gaze feedback and verbal feedback is integrated. Furthermore, as with the G condition the rate of landmarks established by gaze which actually matched the intended object (the participant gazed at the intended object and that object became 102 the next landmark) is greater in successful games than it is in failed games (19.3% vs. 7.9%, χ2 = 7.795, p = 0.00524). The difference in rate is even higher here than for the G condition. Similarly, this shows that when the human is able to correctly identify the current intended object and the gaze feedback works properly, it leads to successful target identification. 5.2.5.3 Effects on number of installments There is a significant difference among the feedback conditions for the average number of installment per game (F (3, 572) = 7.583, p < 0.0001). Figure 5.2 shows the average number of installments per game in each of these feedback conditions. Games in the verbal feedback condition (V) lasted slightly longer, with an average of 2.36 installments per game, than games in the no feedback condition (N), with 2.07 installments per game. Games in both the gaze only (G) and verbal-gaze combined (VG) conditions lasted significantly longer. Figure 5.2: The average number of installments per game in each of the feedback conditions On the other hand, statistical tests have shown that the presence of gesture does not have a significant effect on the number of installments per session. 5.2.5.4 Gesture on gaze indicated focus We have also examined the role of robot’s gesture on human’s focus indicated by gaze. More specifically, we compare the proportion of different eye gaze patterns in the conditions where 103 gesture is present and absent. The gaze responses after each RE were classified into 3 non-mutually exclusive classes. The conditions for being classified into each of the 3 classes are listed below. • IntendedObject The object with the highest distribution of gaze readings is the intended object or contained in the intended group. • LandmarkObject The object with the highest distribution is (contained in) the landmark. • OtherObject The object with the highest distribution is neither (contained in) the intended object/group nor the landmark, i.e., focus indicated by gaze is on objects other than the target or landmark objects. In order to classify the gaze response, the object of the highest distribution was calculated in the same way as the gaze feedback in the G and VG conditions, except object/group with the highest distribution was not compared to the object with the second highest distribution. By nonmutually exclusive we mean, for example, it is possible for a gaze response to be classified into both IntendedObject class and LandmarkObject class if the target is contained in a landmark group. Total Num Num. IntendedObject Rate Num. LandmarkObject Rate Num. OtherObject Rate Verbal Only 656 206 0.31 212 0.32 298 0.45 Verbal and Gesture 739 327 0.44 252 0.34 248 0.34 Table 5.6: Rate of different gaze patterns, under Verbal Only condition and Verbal and Gesture Condition Table 5.6 shows the rate of different gaze patterns, when gesture is present or absent. There is a significant difference between the settings for IntendedObject class response (χ2 = 24.29, p < 0.0001) and for OtherObject class response (χ2 = 20.55, p < 0.0001). We can see that the use of gesture towards the intended object is associated with a higher percentage of IntendedObject 104 class, with a rate of 0.44 under R+ and 0.31 under R-. Moreover, the use of gesture towards the intended object is associated with a lower percentage of OtherObject class. These results show that the presence of gesture helps the human visually locate the intended object. 5.2.5.5 Gesture and gaze fixation time Gesture also has an effect on how long it takes for the human to find the robot’s intended target. Figure 5.3 shows the effects of the presence and absence of gesture on the length of time it takes for the participant’s gaze to fixate on the intended object. Gesture can direct the human’s gaze quickly to the intended object. We use fixation time to demonstrate this effect. Intuitively, the fixation time of a gaze response is the length of time it takes for the gaze to settle on a single object after the robot begins uttering the RE to the current object. Fixation times for successful gaze responses (those in IntendedObject class, where the intended object received the highest distribution of gaze readings) were computed by finding the distributions of the eye gaze over the objects in the scene after each gaze reading is taken. We find the length of time after the first gaze reading is taken at which the intended object obtains the highest distribution and after which no other object obtains a higher distribution than the intended object. We call this length the fixation time. There is a significant difference between the two settings (t = 2.43, p < 0.015). The gaze fixation data indicates that the presence of gesture during the RE (R+) helps the participant’s gaze to fixate on the intended object in less time (3.73 s) than when gesture is absent(R-, 4.17 s). This is likely because the robot points in the direction of the intended object, immediately reducing the search space to a subset of objects in the scene and allowing the human to identify the object being referred to more quickly. 105 Figure 5.3: The average fixation time (in seconds) for the conditions with and without gesture 5.3 Discussion One of the important characteristics of situated interaction is the embodiment. Humans and robots have physical bodies in the shared environment. Humans can refer to objects by generating a referential expression, or we can simply point to the object, or a combination of both. The decision of how to describe objects and when to incorporate gesture into referring act is complex. As an initial step in our investigation, we empirically set the weight for the gesture feature to take into account of embodiment and directly adopted the weights learned for other features [1] to address the robot’s and the human’s mismatched representations of the shared world. Undoubtedly the 2D setting in the crowd-sourcing is different from the 3D physical environment where we conducted our study. How the gesture interacts with other features to generate referring expression need to be learned in physical-world settings. We leave this to our future work. Our empirical results have demonstrated that incorporation of gesture into the collaborative model significantly outperforms models without gestures. Our observations have shown that gestures are more useful when spatial descriptions such as “on the left/right” or “in the front/back” are used. Gestures remind the user which frame of reference is being used (human’s perspective vs robot’s perspective). Furthermore, in a scene where everything is technically in the “front” of 106 the user, gesture disambiguates unclear spatial relations. Gesture is also very useful in the gaze only condition because it is difficult for the human to remember what object they were looking at previously, i.e. which is the landmark object. For example, when hearing, “That object you just looked at...”), users are not necessarily aware of the previous object that they looked at unless they catch on to the way the robot is using their gaze feedback. Eye gaze data obtained from Mobile Eye tracking system might not be suitable to be integrated for making fine-grained decisions about which object the human is looking at. It still comes with much noise. In the conditions with eye gaze as feedback (G, VG), the rate of successful games, where the target is identifies, is lower than the conditions without eye gaze (e.g., N condition). This result seems conflicting with previous studies (e.g., [68]). However the setting in [68] is human-computer interaction mediated through a computer screen, which is very different from our situated interaction. The former represents a simplied setting where monitor-mounted gaze tracker tends to be more reliable. Better use of mobile eye track or other devices (e.g., Kinect) in physical environment needs to be explored as well as the granularity of representing gaze feedback (e.g., focused object versus focused region). In this study we used an empirical heuristic to decide how to process the gaze of the human and use it as feedback. Often, however, the verbal response from the human does not match the gaze response. E.g., the human says, “Yes” but the gaze feedback does not indicate that the gaze has fixated on a unique object. We have recorded a large amount of gaze data from participants simultaneously with yes/no verbal feedback. In the future we could use this data in a supervised learning approach to learn how to better classify human gaze patterns. 107 5.4 Conclusions In this section we develop an embodied collaborative model for referring expression generation which makes use of non-verbal modalities during interaction. We have conducted an experiment to evaluate the new embodied collaborative model through situated, physical human robot interaction and understand the role of embodiment in language generation. Our empirical results have shown that the incorporation of non-verbal modalities such as gesture into the embodied collaborative model performs better. However, the incorporation of eye gaze as the intermediate feedback for the model does not perform as well as we expected. 108 Chapter 6 Future Directions 6.1 Dialogue Context Currently in our embodied collaborative model, a human’s verbal feedback is limited to simple YES/NO response. If the response is YES the robot assumes that the human understands the generated RE and the current intended object becomes the landmark for the next step. If the response is NO the robot assumes the human does not understand the intended object and will try another RE. In real natural human robot interaction, human could have different types of response, for example, human can express clarification, or they can describe what they can perceive. How to dynamically incorporate rich types of user response and dialogue context into the embodied model will be critical to REG system for practical use. 6.2 Joint Optimization The idea of jointly optimizing a set of distinct but related sub-tasks is likely to improve the performance of systems. Specifically for the NLG system, designers have traditionally used a pipeline architecture to divide the generation process into the distinct stages of content selection, utterance planning and surface realization to choose the semantics, organization and realization of an utterance. However, this sequential model does not account for the inter-dependencies that exist among these stages. Some joint models (e.g., [104]) that consider decisions at all NLG states in 109 interdependence with each other could possibly obtain overall better system performance. 6.3 Online Learning Our installment model is based on reinforcement learning, the model was trained by learning from interactions with real users on crowd-sourcing. Usually for reinforcement learning agent, it needs several thousands of interactions to learn a good policy for a domain, which makes learning from real human-robot interactions impractical. The approaches that can learn generation strategies in a fast and efficient way worth more exploration. 110 REFERENCES 111 REFERENCES [1] R. Fang, M. Doering, and J. Y. Chai, “Collaborative models for referring expression generation in situated dialogue,” in Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, July 27 -31, 2014, Qu´ebec City, Qu´ebec, Canada., pp. 1544–1550, 2014. [2] E. Reiter and R. Dale, Building Natural Language Generation Systems. Cambridge University Press, 2000. [3] C. Liu, R. Fang, and J. Y. Chai, “Towards mediating shared perceptual basis in situated dialogue,” in Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL ’12, (Stroudsburg, PA, USA), pp. 140–149, Association for Computational Linguistics, 2012. [4] E. Krahmer and I. van der Sluis, “A new model for the generation multimodal referring expressions,” in Proceedings of the 9th European Workshop on Natural Language Generation, (Budapest, Hungary), pp. 47–57, 2003. [5] I. F. V. D. Sluis, Multimodal Reference, Studies in Automatic Generation of Multimodal Referring Expressions. PhD thesis, Tulburg University, 2005. [6] A. G. Van der Sluis, I. and K. van Deemter, “Manual for the tuna corpus: Referring expressions in two domains,” tech. rep., University of Aberdeen, 2006. [7] J. Viethen and R. Dale, “The use of spatial relations in referring expression generation,” in Proceedings of the Fifth International Natural Language Generation Conference, INLG ’08, (Stroudsburg, PA, USA), pp. 59–67, Association for Computational Linguistics, 2008. [8] A. Gargett, K. Garoufi, A. Koller, and K. Striegnitz, “The give-2 corpus of giving instructions in virtual environments,” in Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC), 2010. [9] R. Fang, C. Liu, L. She, and J. Y. Chai, “Towards situated dialogue: Revisiting referring expression generation,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), (Seattle, Washington, USA), pp. 392–402, Association for Computational Linguistics, October 2013. [10] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning. Cambridge, MA, USA: MIT Press, 1st ed., 1998. 112 [11] H. Clark and S. Brennan, “Grounding in communication,” Perspectives on socially shared cognition, vol. 13, pp. 127–149, 1991. [12] T. Tenbrink and R. Moratz, “Group-based spatial reference in linguistic human-robot interaction,” Spatial Cognition and Computation, vol. 6, pp. 63–106, 2003. [13] K. Funakoshi, S. Watanabe, N. Kuriyama, and T. Tokunaga, “Generation of relative referring expressions based on perceptual grouping,” in COLING, 2004. [14] K. Funakoshi, S. Watanabe, and T. Tokunaga, “Group-based generation of referring expressions,” in INLG, pp. 73–80, 2006. [15] S. Weijers, “Referring expressions with groups as landmarks,” vol. 15, University of Twente, 2011. [16] H. H. Clark and D. Wilkes-Gibbs, “Referring as a collaborative process,” Cognition, vol. 22, pp. 1–39, 1986. [17] H. H. Clark and E. F. Schaefer, “Contributing to discourse,” Cognitive Science, vol. 13, pp. 259–294, 1989. [18] R. Dale, “Computational interpretations of the gricean maxims in the generation of referring expressions,” Cognitive Science, vol. 19, pp. 233–263, 1995. [19] E. Krahmer and K. V. Deemter, “Computational generation of referring expressions: A survey,” computational linguistics, vol. 38(1), pp. 173–218, 2012. [20] A. Gatt, I. van der Sluis, and K. van Deemter, “Evaluating algorithms for the generation of referring expressions using a balanced corpus,” in Proceedings of the Eleventh European Workshop on Natural Language Generation, ENLG ’07, (Stroudsburg, PA, USA), pp. 49– 56, Association for Computational Linguistics, 2007. [21] D. Golland, P. Liang, and D. Klein, “A game-theoretic approach to generating spatial descriptions,” in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’10, (Stroudsburg, PA, USA), pp. 410–419, Association for Computational Linguistics, 2010. [22] K. Striegnitz, H. Buschmeier, and S. Kopp, “Referring in installments: a corpus study of spoken object references in an interactive virtual environment,” in Proceedings of the Seventh International Natural Language Generation Conference, INLG ’12, (Stroudsburg, PA, USA), pp. 12–16, Association for Computational Linguistics, 2012. 113 [23] E. Krahmer, S. van Erk, and A. Verleg, “Graph-based generation of referring expressions,” Computational Linguistics, vol. 29, pp. 53–72, Mar. 2003. [24] A. Gatt and A. Belz, “Attribute selection for referring expression generation: new algorithms and evaluation methods,” in Proceedings of the Fifth International Natural Language Generation Conference, INLG ’08, (Stroudsburg, PA, USA), pp. 50–58, Association for Computational Linguistics, 2008. [25] A. Gatt, A. Belz, and E. Kow, “The tuna-reg challenge 2009: overview and evaluation results,” in Proceedings of the 12th European Workshop on Natural Language Generation, ENLG ’09, (Stroudsburg, PA, USA), pp. 174–182, Association for Computational Linguistics, 2009. [26] P. Grice, Logic and conversation. Academic Press, New York, 1975. [27] B. Bohnet and R. Dale, “Viewing referring expression generation as search,” in Proceedings of the 19th International Joint Conference on Artificial Intelligence, IJCAI’05, (San Francisco, CA, USA), pp. 1004–1009, Morgan Kaufmann Publishers Inc., 2005. [28] S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach. Pearson Education, 2 ed., 2003. [29] M. Theune, From data to speech: language generation in context. PhD thesis, Eindhoven University of Technology,The Netherlands, 2000. [30] E. Krahmer and M. Theune, “Efficient context-sensitive generation of referring expressions,” in Information Sharing: Reference and Presupposition in Language Generation and Interpretation, Center for the Study of Language and Information-Lecture Notes (K. van Deemter and R. Kibble, eds.), vol. 143 of CSLI Lecture Notes, pp. 223–263, Stanford, USA: CSLI Publications, 2002. Imported from HMI. [31] B. J. Grosz, S. Weinstein, and A. K. Joshi, “Centering: A framework for modeling the local coherence of discourse,” Computational Linguistics, vol. 21, pp. 203–225, 1995. [32] K. van Deemter, “Generating referring expressions: Boolean extensions of the incremental algorithm,” Comput. Linguist., vol. 28, pp. 37–52, Mar. 2002. [33] A. Siddharthan and A. Copestake, “Generating referring expressions in open domains,” in Proceedings of the 42Nd Annual Meeting on Association for Computational Linguistics, ACL ’04, (Stroudsburg, PA, USA), Association for Computational Linguistics, 2004. 114 [34] A. Siddharthan and A. Copestake, “Evaluating an open domain gre algorithm on closed domains. system ids: Cam-b, cam-t, cam-bu and cam-tu,” in Proceedings of the Workshop on Using Corpora for NLG: Language Generation and Machine Translation (UCNLG+MT), 2007. [35] J. D. Kelleher and G.-J. M. Kruijff, “Incremental generation of spatial referring expressions in situated dialog,” in Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, ACL-44, (Stroudsburg, PA, USA), pp. 1041–1048, Association for Computational Linguistics, 2006. [36] I. van der Sluis and E. Krahmer, “Towards the generation of overspecified multimodal referring expressions,” in Proceedings of the Symposium on Dialogue Modelling and Generation at the 15th Annual Meeting of the Society for Text and Discourse (STD-05), (Amsterdam), 6-9 July 2005. [37] M. Croitoru and K. Van Deemter, “A conceptual graph approach to the generation of referring expressions,” in Proceedings of the 20th international joint conference on Artifical intelligence, IJCAI’07, pp. 2456–2461, 2007. [38] P. A. Heeman and G. Hirst, “Collaborating on referring expressions,” Computational Linguistics, vol. 21, pp. 351–382, 1995. [39] P. G. Edmonds, “Collaboration on reference to objects that are not mutually known,” in Proceedings of the 15th conference on Computational linguistics - Volume 2, COLING ’94, (Stroudsburg, PA, USA), pp. 1118–1122, Association for Computational Linguistics, 1994. [40] P. Piwek and K. van Deemter, “Generating under global constraints: The case of scripted dialogue,” Research on Language and Computation, vol. 5(2), pp. 237–263, 2007. [41] C. Chiarcos, “Evaluating salience metrics for the context-adequate realization of discourse referents,” in Proceedings of the 13th European Workshop on Natural Language Generation, ENLG ’11, (Stroudsburg, PA, USA), pp. 32–43, Association for Computational Linguistics, 2011. [42] J. Chu-Carroll and S. Carberry, “Collaborative response generation in planning dialogues,” Comput. Linguist., vol. 24, pp. 355–400, Sept. 1998. [43] A. Koller and M. Stone, “Sentence generation as a planning problem.,” in ACL (J. A. Carroll, A. van den Bosch, and A. Zaenen, eds.), The Association for Computational Linguistics, 2007. 115 [44] K. Garoufi and A. Koller, “Automated planning for situated natural language generation.,” in ACL, pp. 1573–1582, The Association for Computer Linguistics, 2010. [45] K. Garoufi and A. Koller, “Combining symbolic and corpus-based approaches for the generation of successful referring expressions,” in Proceedings of the 13th European Workshop on Natural Language Generation, ENLG ’11, (Stroudsburg, PA, USA), pp. 121–131, Association for Computational Linguistics, 2011. [46] S. Janarthanam and O. Lemon, “Adaptive referring expression generation in spoken dialogue systems: Evaluation with real users,” in Proceedings of the 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL ’10, (Stroudsburg, PA, USA), pp. 124–131, Association for Computational Linguistics, 2010. [47] V. Rieser and O. Lemon, “Empirical methods in natural language generation,” ch. Natural Language Generation As Planning Under Uncertainty for Spoken Dialogue Systems, pp. 105–120, Berlin, Heidelberg: Springer-Verlag, 2010. [48] O. Lemon, “Learning what to say and how to say it: Joint optimisation of spoken dialogue management and natural language generation,” Comput. Speech Lang., vol. 25, pp. 210–221, Apr. 2011. [49] V. Rieser and O. Lemon, “Learning and evaluation of dialogue strategies for new applications: Empirical methods for optimization from small data sets,” Comput. Linguist., vol. 37, pp. 153–196, Mar. 2011. [50] N. Dethlefs and H. Cuay´ahuitl, “Hierarchical reinforcement learning and hidden markov models for task-oriented natural language generation,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2, HLT ’11, (Stroudsburg, PA, USA), pp. 654–659, Association for Computational Linguistics, 2011. [51] N. Dethlefs, H. Cuayhuitl, and J. Viethen, “Optimising natural language generation decision making for situated dialogue.,” in SIGDIAL Conference, pp. 78–87, 2011. [52] S. Janarthanam, H. Hastie, O. Lemon, and X. Liu, “”the day after the day after tomorrow?”: A machine learning approach to adaptive temporal expression generation: Training and evaluation with real users,” in Proceedings of the SIGDIAL 2011 Conference, SIGDIAL ’11, (Stroudsburg, PA, USA), pp. 142–151, Association for Computational Linguistics, 2011. [53] D. Roy, “Learning visually grounded words and syntax of natural spoken language,” Evolution of Communication, vol. 4, 2002. 116 [54] M. Roth and A. Frank, “Computing em-based alignments of routes and route directions as a basis for natural language generation,” in Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, (Stroudsburg, PA, USA), pp. 958–966, Association for Computational Linguistics, 2010. [55] T. Bouttaz, E. Pignotti, C. Mellish, and P. Edwards, “A policy-based approach to context dependent natural language generation,” in Proceedings of the 13th European Workshop on Natural Language Generation, ENLG ’11, (Stroudsburg, PA, USA), pp. 151–157, Association for Computational Linguistics, 2011. [56] S. Varges, “Overgenerating referring expressions involving relations,” in Proceedings of the 3rd International Conference on Natural Language Generation, pp. 171–181, 2004. [57] S. Varges and K. van Deemter, “Generating referring expressions containing quantifiers,” in Proceedings of the 6th International Worskhop on Computational Semantics, 2005. [58] C. Areces, A. Koller, and K. Striegnitz, “Referring expressions as formulas of description logic,” in Proceedings of the Fifth International Natural Language Generation Conference, INLG ’08, (Stroudsburg, PA, USA), pp. 42–49, Association for Computational Linguistics, 2008. [59] A. K. Nina Dethlefs, Yunhui Wu and S. Winter, “Generation of adaptive route descriptions in urban environments.,” Spatial Cognition and Computation, vol. 11(2), pp. 153–177, 2011. [60] L. Stoia, D. M. Shockley, D. K. Byron, and E. Fosler-Lussier, “Noun phrase generation for situated dialogs,” in Proceedings of the Fourth International Natural Language Generation Conference, (Sydney, Australia), pp. 81–88, Association for Computational Linguistics, July 2006. [61] C. J. Fillmore, “Towards a descriptive framework for spatial deixis,” in Speech, Place, and Action (R. J. Jarvella and W. Klein, eds.), pp. 31–59, Chichester: Wiley, 1982. [62] S. Goldin-Meadow, “The role of gesture in communication and thinking,” Trends Cogn. Sci., 1999. [63] A. Saupp´e and B. Mutlu, “Robot deictics: How gesture and context shape referential communication,” in Proceedings of the 2014 ACM/IEEE International Conference on Humanrobot Interaction, HRI ’14, (New York, NY, USA), pp. 342–349, ACM, 2014. [64] P. Piwek, “Salience in the generation of multimodal referring acts,” in Proceedings of the 2009 International Conference on Multimodal Interfaces, ICMI-MLMI ’09, (New York, NY, USA), pp. 207–210, ACM, 2009. 117 [65] A. Gatt and P. Paggio, “What and where: An empirical investigation of pointing gestures and descriptions in multimodal referring actions,” in Proceedings of the 14th European Workshop on Natural Language Generation, (Sofia, Bulgaria), pp. 82–91, Association for Computational Linguistics, August 2013. [66] A. Gatt and P. Paggio, “Learning when to point: A data-driven approach,” in Proceedings of the 25th International Conference on Computational Linguistics (COLING ’14), 2014. [67] M. Tanenhaus, M. Spivey-Knowlton, K. Eberhard, and J. Sedivy, “Integration of visual and linguistic information during spoken language comprehension,” Science, vol. 268, pp. 1632– 1634, 1995. [68] A. Koller, M. Staudte, K. Garoufi, and M. Crocker, “Enhancing referential success by tracking hearer gaze,” in Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL ’12, (Stroudsburg, PA, USA), pp. 30–39, Association for Computational Linguistics, 2012. [69] M. Mitchell, K. van Deemter, and E. Reiter, “Generating expressions that refer to visible objects,” in Proceedings of NAAC-HLT 2013, pages 1174-1184, 2013. [70] N. FitzGerald, Y. Artzi, and L. Zettlemoyer, “Learning distributions over logical forms for referring expression generation,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, (Seattle, Washington, USA), pp. 1914–1925, Association for Computational Linguistics, October 2013. [71] H. Horacek, “Generating referential descriptions under conditions of uncertainty,” in Proceedings of the 10th European Workshop on Natural Language Generation (ENLG) pages 58-67, Aberdeen, UK., 2005. [72] K. S. Donna Byron, Alexander Koller, “Report on the first nlg challenge on generating instructions in virtual environment,” in Proceedings of the 12th European Workshop on Natural Language Generation, 2009. [73] R. Passonneau, “Using centering to relax gricean informational constraints on discourse anaphoric noun phrases,” Language and Speech, vol. 39, pp. 229–264, 1996. [74] A. Gatt, A. Belz, and E. Kow, “The tuna challenge 2008: Overview and evaluation results,” in Proceedings of the 5th International Conference on Natural Language Generation (INLG), 2008. 118 [75] C. Liu, R. Fang, L. She, and J. Chai, “Modeling collaborative referring for situated referential grounding,” in Proceedings of the SIGDIAL 2013 Conference, (Metz, France), pp. 78– 86, Association for Computational Linguistics, August 2013. [76] G. Gallo, G. Longo, S. Pallottino, and S. Nguyen., “Directed hypergraphs and applications,” Discrete applied mathematics, vol. 42(2), pp. 177–201, 1993. [77] R. Sternberg, Cognitive Psychology,Third Edition. Thomson Wadsworth, 2003. [78] K. R. Thrisson, “Simulated perceptual grouping: An application to human-computer interaction,” in Proceedings of the Sixteenth Annual Conference of the Cognitive Science Society, pp. 876–881, 1994. [79] A. Gatt, “Structuring knowledge for reference generation: A clustering algorithm,” in Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 321–328, 2006. [80] S. Harnad, “The symbol grounding problem,” Physica D, vol. 42, pp. 335–346, 1990. [81] S. S. Dhande, “A computational model to connect gestalt perception and natural language,” in Masters thesis, Massachusetts Institure of Technology, 2003. [82] P. Gorniak and D. Roy, “Grounded semantic composition for visual scenes,” Journal of Artificial Intelligence Research, vol. 21, pp. 429–470, 2004. [83] T. Tenbrink and R. Moratz, “Group-based spatial reference in linguistic human-robot interaction,” Spatial Cognition and Computation, vol. 6, pp. 63–106, 2003. [84] A. Siebert and D. Schlangen, “A simple method for resolution of definite reference in a shared visual context,” in Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, SIGdial ’08, (Stroudsburg, PA, USA), pp. 84–87, Association for Computational Linguistics, 2008. [85] D. Zhang and G. Lu, “An integrated approach to shape based image retrieval,” in Proc. of 5th Asian Conference on Computer Vision (ACCV), pp. 652–657, 2002. [86] N. Otsu, “A threshold selection method from gray-level histogram,” IEEE Trans. Syst. Man Cybern., vol. 9(1), p. 62C66, 1979. 119 [87] M. Theune, P. Touset, J. Viethen, and E. Krahmer, “Cost-based attribute selection for gre (graph-sc/graph-fp),” in Proceedings of the MT Summit XI Workshop on Using Corpora for NLG: Language Generation and Machine Translation (UCNLG+MT), 2007. [88] E. Krahmer, M. Theune, J. Viethen, and Hendrickx, “Graph: The costs of redundancy in referring expressions.,” in Proceedings of the 5th International Natural Language Generation Conference (INLG 2008). pp. 227-229., 2008. [89] M. Theune, R. Koolen, E. Krahmer, and S. Wubben, “Does size matter – how much data is required to train a reg algorithm?,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, (Portland, Oregon, USA), pp. 660–664, Association for Computational Linguistics, June 2011. [90] R. Dale, Generating Referring Expressions: Constructing Descriptions in a Domain of Objects and Processes. The MIT Press,Cambridge, Massachusetts, 1992. [91] T. Regier, The human semantic potential. The MIT Press,Cambridge, Massachusetts, 1996. [92] A. Mojsilovic, “A computational model for color naming and describing color composition of images,” IEEE Transactions on Image Processing, vol. 14, pp. 690 – 699, 2005. [93] M. Mitchell, K. van Deemter, and E. Reiter, “Two approaches for generating size modifiers,” in Proceedings of the 13th European Workshop on Natural Language Generation, ENLG ’11, (Stroudsburg, PA, USA), pp. 63–70, Association for Computational Linguistics, 2011. [94] P. W. Jordan and M. Walker, “Learning attribute selections for non-pronominal expressions,” in Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pages 181-190., 2000. [95] E. Belke and A. S. Meyer, “Tracking the time course of multidimensional stimulus discrimination: Analyses of viewing patterns and processing times during ”same”-”different” decisions,” European Journal of Cognitive Psychology, vol. 14(2), pp. 237–266, 2002. [96] A. Arts, A. Maes, L. Noordman, and C. Jansen, “Overspecification facilitates object identification,” Journal of Pragmatics, vol. 43(1), pp. 361–374, 2011. [97] K. Striegnitz, A. Denis, A. Gargett, K. Garoufi, A. Koller, and M. Theune, “Report on the second second challenge on generating instructions in virtual environments (give-2.5),” in Proceedings of the Generation Challenges Session at the 13th European Workshop on Natural Language Generation, Nancy, France, (Nancy, France), pp. 270–279, Association for Computational Linguistics, 2011. 120 [98] H. Clark and A. Bangerter, Changing ideas about reference, pp. 25–49. Experimental pragmatics, Palgrave Macmillan, 2004. [99] T. L. Morin and R. E. Marsten, Branch-and bound strategies for dynamic programming. MIT, 1974. [100] S. R. K. Branavan, H. Chen, L. S. Zettlemoyer, and R. Barzilay, “Reinforcement learning for mapping instructions to actions,” in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1, ACL ’09, (Stroudsburg, PA, USA), pp. 82– 90, Association for Computational Linguistics, 2009. [101] A. Vogel and D. Jurafsky, “Learning to follow navigational directions,” in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, (Stroudsburg, PA, USA), pp. 806–814, Association for Computational Linguistics, 2010. [102] P. M. Fitts, “The information capacity of the human motor system in controlling the amplitude of movement,” Journal of Experimental PSychology, vol. 74, pp. 381–391, 1954. [103] I. S. MacKenzie, A. Sellen, and W. A. S. Buxton, “A comparison of input devices in element pointing and dragging tasks,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’91, (New York, NY, USA), pp. 161–166, ACM, 1991. [104] N. Dethlefs and H. Cuay´ahuitl, “Combining hierarchical reinforcement learning and bayesian networks for natural language generation in situated dialogue,” in Proceedings of the 13th European Workshop on Natural Language Generation, ENLG ’11, (Stroudsburg, PA, USA), pp. 110–120, Association for Computational Linguistics, 2011. 121