REFERENTIAL GROUNDING TOWARDS MEDIATING SHARED PERCEPTUAL BASIS IN SITUATED DIALOGUE By Changsong Liu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science - Doctor of Philosophy 2015 ABSTRACT REFERENTIAL GROUNDING TOWARDS MEDIATING SHARED PERCEPTUAL BASIS IN SITUATED DIALOGUE By Changsong Liu In situated dialogue, although an artificial agent (e.g., robot) and its human partner are co-present in a shared environment, they have significantly mismatched perceptual capabilities (e.g., recognizing objects in the surroundings). When a shared perceptual basis is missing, it becomes difficult for the agent to identify referents in the physical world that are referred to by the human (i.e., a problem of referential grounding). The work presented in this dissertation focuses on computational approaches to enable robust and adaptive referential grounding in situated settings. First, graph-based representations are employed to capture a human speaker’s linguistic discourse and an agent’s visual perception. Referential grounding is then formulated as a graph-matching problem, and a state-space search algorithm is applied to ground linguistic references onto perceived objects. Furthermore, hypergraph representations are used to account for group-based descriptions, and one prevalent pattern of collaborative communication observed from a human-human dialogue dataset is incorporated into the search algorithm. This graph-matching based approach thus provides a principled way to model and utilize spatial relations, group-based descriptions, and collaborative referring discourse in situated dialogue. Evaluation results demonstrate that, when the agent’s visual perception is unreliable due to computer vision errors, the graph-based approach significantly improves referential grounding accuracy over a baseline which only relies on object-properties. Second, an optimization based approach is proposed to mediate the perceptual differences between an agent and a human. Through online interaction with the human, the agent can learn a set of weights which indicate how reliably/unreliably each dimension (object type, object color, etc.) of its perception of the environment maps to the human’s linguistic descriptors. Then the agent can adapt to the situation by applying the learned weights to the grounding process and/or adjusting its word grounding models accordingly. Empirical evaluation shows this weight-learning approach can successfully adjust the weights to reflect the agent’s perceptual insufficiencies. The learned weights, together with updated word grounding models, can lead to a significant improvement for referential grounding in subsequent dialogues. Third, a probabilistic labeling algorithm is introduced to handle uncertainties from visual perception and language processing, and to potentially support generation of collaborative responses in the future. The probabilistic labeling algorithm is formulated under the Bayesian reasoning framework. It provides a unified probabilistic scheme to integrate different types of evidence from the collaborative referring discourse, and to generate ranked multiple grounding hypotheses for follow-up processes. Evaluated on the same dataset, probabilistic labeling significantly outperforms state-space search in both accuracy and efficiency. All these approaches contribute to the ultimate goal of building collaborative dialogue agents for situated interaction, so that the next generation of intelligent machines/devices can better serve human users in daily work and life. Copyright by CHANGSONG LIU 2015 To my family. v ACKNOWLEDGMENTS First of all, I want to express my deepest gratitude to my advisor, Dr. Joyce Chai, for her support, guidance, encouragement, and trust during the past several years. I am grateful for the trust and freedom she offers, with which I can pursue various research directions that interest me. Her insightful advice on tackling difficult problems, writing convincing papers, giving strong presentations, and many other research related aspects has been great help for me to become a qualified researcher. Her passion in research and diligence in work always motivate me to immerse myself in study and research, and love it. I also appreciate her a lot for her patience and encouragement that helped me to learn from failures and overcome difficulties. I am very fortunate to have had Dr. Chai as my PhD advisor, and I will follow the great example she sets to us in my career life. I am also grateful to my PhD Committee members, Dr. Erik Altmann, Dr. Rong Jin, and Dr. Yiying Tong, for their insightful criticism, inspiring advice, and kind patience. I enjoyed sitting in Dr. Altmann’s cognitive psychology class, from which I learned many interesting and inspiring cases on experiment-based studies. His deep insight on the strength and weakness of these discussed studies helped me to better design and conduct my own data collection experiments. It was also very pleasant learning experience to take both Dr. Jin’s and Dr. Tong’s classes in the filed of computer science. I deeply appreciated the beauty and power of computational methods in these classes, and learned to use mathematics as a tool to tackle hard research problems including those contribute to this dissertation. I am also fortunate enough to have two summers of internship at SRI International, where I could truly experience how multidisciplinary research led to many world-changing innovations such as SIRI. I especially want to express my gratitude to Dr. Edgar Kalns, who vi provided me such a great opportunity. I also would like to thank Mr. Girish Acharya and many other SRI colleagues, for their collaboration and support to make my two summers productive and fruitful. In the past several years, I greatly enjoyed the collaboration and friendship with my colleagues at the LAIR lab: Rui Fang, Lanbo She, Shaohua Yang, Malcolm Doering, Tyler Baldwin, Matthew Gerber, Qiaozi Gao, and many others. It has been very rewarding to work with them, and joyful to play with them. I wish all the best in their future study, career, and life. I also appreciate the help from the staff of the CSE department, especially Linda Moore and Norma Teague. I wish Linda and Norma a very happy retirement life. Last but not least, this journey would not be possible without the love and support of my family. I especially want to thank my best friend, soul mate, and other half, Ou Ni. I am very grateful for having her in my life, who really loves me, cares for me, and goes through all the ups and downs hand in hand with me. Even when I sometimes feel lost and don’t have faith in myself, she always has faith in me and brings me the belief in a bright future. Another special thank is for my dear daughter Nini and son Mike, my bundles of joys. Life feels so sweet whenever I am with them, or thinking about them. Every moment becomes a cherishable memory because of them. To both my parents and parents-in-law, I owe them a debt of gratitude which I shall never be able to repay, but I will do my best to deserve and return their unwavering and unconditional love and support. The eternal love of my family is the truest reason and motivation for me to become a better man, to contribute more to our world, and to live a more worthful life. vii TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . 1.1 Situated Referential Communication . . . . . . . . . . . . . 1.2 A Human-human Dialogue Experiment . . . . . . . . . . . . 1.2.1 Experiment Design . . . . . . . . . . . . . . . . . . . 1.2.2 Observations . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Patterns of Collaborative Referential Communication 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 1 2 4 5 7 10 12 15 Chapter 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Linguistic Studies on Referential Communication . . . . . . . . . . . . . . . 2.2 Computational Approaches to Situated Language Grounding . . . . . . . . . 17 18 20 Chapter 3 A Graph-matching Approach . . . . . . . 3.1 System Overview . . . . . . . . . . . . . . . . . . . 3.2 Graph-matching for Situated Referential Grounding 3.2.1 Graph Representation . . . . . . . . . . . . 3.2.2 State-space Search Based Graph-matching . 3.2.3 Modeling Collaborative Referring . . . . . . 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Data . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Results . . . . . . . . . . . . . . . . . . . . . 3.4 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 32 33 34 39 43 48 48 49 52 Chapter 4 Learning to Mediate Perceptual Differences . . 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Weight Learning for Situated Referential Grounding . . . . . 4.2.1 Inexact Matching with Weighted Attributes . . . . . 4.2.2 Optimization for Weight Learning . . . . . . . . . . . 4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Weight-learning Results on Object-specific Attributes 4.3.3 Weight-learning Results at Word Level . . . . . . . . 4.4 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 56 58 59 63 65 65 68 73 80 viii . . . . . . . . . . . . . . . . . . Chapter 5 A Probabilistic Labeling Approach . . . . 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Limitation of the State-Space Search Based Approach 5.3 Probabilistic Labeling for Referential Grounding . . . 5.3.1 Automatic Language Processing . . . . . . . . 5.3.2 Tracking Dialogue Dynamics via Coreference . 5.3.3 Iterative Probabilistic Labeling Algorithm . . 5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Results . . . . . . . . . . . . . . . . . . . . . . 5.5 Conclusion and Discussion . . . . . . . . . . . . . . . Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 85 87 90 90 92 96 101 101 102 105 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . 109 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 ix LIST OF TABLES Table 3.1 Commonly used attributes and relations and their typical values in our collected data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Table 3.2 Averaged referential grounding accuracies under different settings. . 50 Table 3.3 Matching accuracies of three groups of dialogues (all the matching results here are produced using hypergraphs). . . . . . . . . . . . . . 51 Attributes and relations used in the referring expressions in the humanrobot data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 The final weights learned after 20 training dialogues (averaged over 100 runs of simulation) . . . . . . . . . . . . . . . . . . . . . . . . . 70 Final learned weights for some common words of the two attributes after 20 training dialogues (averaged over 100 runs of simulation) . . 74 Table 4.1 Table 4.2 Table 4.3 Table 5.1 Comparison of the reference grounding performances of a random guess baseline, Probabilistic Labeling (P.L.) and State-Space Search (S.S.S.), and P.L. using manually annotated coreference. . . . . . . . 103 x LIST OF FIGURES Figure 1.1 Figure 1.2 Figure 1.3 An illustration of the mismatched perceptions of the shared environment between a human and an agent. . . . . . . . . . . . . . . . . . 3 Our experimental system. Two partners collaborate on an object naming task using this system. The director on the left side is shown an original image, while the matcher on the right side is shown an impoverished version of the original image. . . . . . . . . . . . . . . 6 The procedure of generating the impoverished image from the original image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Figure 3.1 An overview of our graph-matching based referential grounding system. 32 Figure 3.2 An illustration of the graph representations in our method. The discourse graph is created from formal semantic representations of the linguistic discourse; The vision graph is created by applying CV algorithms on the visual scene. Given the two graphs, referential grounding is to find a proper node-to-node mapping from the discourse graph to the vision graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 An illustration of the hypergraph representation for modeling groupbased descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Figure 3.4 An example of the state-space search tree. . . . . . . . . . . . . . . . 43 Figure 3.5 An example of incorporating collaborative efforts in an unfolding dialogue into graph representations. . . . . . . . . . . . . . . . . . . . 45 An illustration of the constrained state-space search through utilizing the grounded nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Figure 4.1 An example of situated setup and human-robot dialogue . . . . . . . 65 Figure 4.2 weight-learning and referential grounding results after each training dialogue (averaged over 100 runs) . . . . . . . . . . . . . . . . . . . 72 Figure 3.3 Figure 3.6 xi Figure 4.3 Figure 4.4 Figure 4.5 Figure 4.6 Figure 5.1 An illustration of the different viewing directions. Direction 1 is the robot’s original viewing direction in the experiments, which is the same as the human subject’s. Direction 2 is the misaligned viewing direction simulated by decreasing the x coordinates from direction 1. Because of the narrow viewing range of the robot’s camera, all the objects become closer to the left edge of its field-of-view (i.e., their x coordinates decrease) under direction 2. . . . . . . . . . . . . . . . . 74 Word level weight-learning evaluation results after each training dialogue (averaged over 100 runs of simulation) . . . . . . . . . . . . . 76 Word level weight-learning results with model updating (averaged over 100 runs of simulation) . . . . . . . . . . . . . . . . . . . . . . 77 Two grounding models for “left”. I.e., a grounding model here is a scoring function w.r.t. the x-coordinate of an object’s center of mass. It measures how well an object’s location conforms the descriptor “left”. Model 1 is the Gaussian model; Model 2 is the exponential decay function model. . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Average running time of the state-space search algorithm with respect to the number of nodes to be grounded in a dialogue graph. The red curve is the state-space search algorithm and the blue curve is the probabilistic labeling algorithm. . . . . . . . . . . . . . . . . . . . . 105 xii Chapter 1 Introduction As a new generation of intelligent machines/devices starts to emerge into our daily life, techniques that enable efficient and effective human-machine interaction have become increasingly important. Particularly, using natural language to dialogue with these intelligent “agents” has become an important direction to pursue [1, 2]. Different from traditional telephone-based dialogue systems (e.g., [3, 4]) and conversational interfaces (e.g., [5, 6]), humans and artificial agents (e.g., robots) are now co-present in a shared physical environment to achieve joint tasks. The meaning of language thus depends on the physical environment and the goals of communication partners. One cannot understand the dialogue without knowledge of the immediate physical and intentional context [7, 8]. This kind of dialogue is called “situated dialogue” [9]. In situated dialogue, although an artificial agent and its human partner are co-present in a shared environment, they have different capabilities in perceiving and reasoning about the physical world. A shared perceptual basis, which plays an important role in supporting situated dialogue between the human and the agent [10], thus is missing. Dialogue between them then becomes difficult, and both the agent and the human will need to make extra efforts to mediate a joint perceptual basis and reach a mutual understanding [11]. To address this challenging problem, our ultimate goal is to develop situated dialogue systems that are robust and adaptive to handle inaccuracies and uncertainties of perceiving the physical environment, and can interact with humans collaboratively to mediate a joint perceptual 1 basis and facilitate the communication. 1.1 Situated Referential Communication The work presented in this dissertation focuses on situated referential communication between human users and artificial agents (e.g., robots). A situated dialogue often involves objects and their identities in the shared environment. One critical task thus is referential communication - a process to establish mutual understanding between conversation partners about the intended referents [12]. The agent needs to identify referents in the environment that are specified by its human partner, and the human needs to recognize that the intended referents are correctly understood. It is critical for the agent and its human partner to quickly and reliably reach the mutual acceptance of the referents before the dialogue can move forward [11]. Although referential communication between human partners often can be easily done, it becomes more difficult between a human and an artificial agent because of the “mismatched” perceptual capabilities. For example, let’s consider the situation illustrated in Figure 1.1. Suppose the human wants to refer to the toy panda by issuing an utterance as “Give me the black panda to the right of the blue cup”. If the hearer is another human, she/he would have no problem to identify the intended object. However, it becomes problematic for the robot to correctly identify the referent due to the robot’s mismatched perceptual capabilities. As illustrated in Figure 1.1, the robot has limited object recognition and color perception: the blue cup is mis-recognized as a blue can; the toy panda and the computer mouse are not recognizable; the color of the 2 Figure 1.1 An illustration of the mismatched perceptions of the shared environment between a human and an agent. toy panda is perceived as white but not black. With such mismatched perceptions, relying on object-specific properties (e.g., object-class and color) becomes insufficient for referential communication, and a single exchange is often not adequate. Therefore, the agent and the human need to make extra efforts to establish a joint perceptual basis for successful referential communication. To mediate a joint perceptual basis between an agent and a human in situated dialogue, previous studies have suggested two important strategies. The first one is to rely more on spatial language (e.g., [13, 14, 15]). This is not only because modern robotic/intelligent agents are often equipped with advanced sensors for acquiring accurate spatial information, but also because spatial language usually can uniquely identify the referents if properly stated. The second strategy is to utilize extended and collaborative dialogue (e.g., [12, 16, 17, 18]). For example, the speaker can refer to the intended object iteratively: the speaker first issues an initial installment, and then continues with further refashionment (e.g., expansion, replacement, or repair ) based on the hearer’s immediate feedback. The hearer, on the other hand, can provide useful feedback based on which further refashionment can be made, or 3 directly refashion what the speaker just said in a relevant next turn.1 Through utilizing spatial language and collaborative dialogue, human partners can often succeed in referential communication even under mismatched perceptions (see the next section for the experiment we conducted to investigate human partners’ behaviors under simulated perceptual differences). Therefore, it is important that a situated dialogue system can also capture and utilize spatial language and dialogue dynamics, and engage in communication collaboratively. It is the goal of this thesis to develop computational approaches that enable such situated dialogue systems. 1.2 A Human-human Dialogue Experiment Although previous studies had done some experiments on human-human (e.g., [12]) and human-agent (e.g., [19]) referential communication in situated settings, the problem of mismatched perceptual capabilities between the conversation partners had not been taken into consideration in these experiments. To address this problem, we conducted an empirical study to investigate how two human partners collaborate and mediate a joint basis when they have simulated perceptual differences. Our experiment was inspired by the “ablated human capability” studies that was demonstrated useful in investigating the problem-solving strategies humans would resort to when they encountered the machine-like speech recognition errors [20, 21]. In the ablated capability studies, a human subject’s capacities were incrementally restricted in the direction of a real system, thus the system could learn better error-handling strategies from the way the human handled the same errors. To implement this idea in our experiment, we ablated one 1 See [12] and [16] for a formal account on these collaborative actions in referential communication. 4 human participant’s perceptual capability by showing her/him an impoverished image. The impoverished image simulated what a computer-vision based agent could perceive from the environment (i.e., the original image), thus we could observe how human participants strive to overcome the mismatched perceptions in situated referential communication. 1.2.1 Experiment Design The setup of our experimental system is shown in Figure 1.2. Two partners (a director and a matcher ) collaborate on an object naming task. They both face the same scene that is composed by some daily-life items (office supplies, friuts, etc.). However, what they actually see are different: the director is shown the original image of the scene, whereas the matcher is shown an impoverished version of the original image. An example of the two different images is illustrated at the top of Figure 1.2. To faithfully simulate the perceptual capability of an artificial agent, we applied standard computer vision algorithms to process the original image and generate the impoverished representation of the same scene. This procedure is illustrated in Figure 1.3. To create the impoverished image, we first used the OTSU algorithm to separate foreground objects from the background [22]. Then each segmented object was fed into a feature extraction routine that computed a set of region-based and contour-based shape features of the object [23]. The feature vector of the object was then compared with all the “known” objects in a knowledge base, and the object was recognized as the class of its nearest neighbor in the knowledge base. After this segmentation → f eature extraction → recognition pipeline, the final outcome was then displayed as an abstract illustration in the impoverished image. For instance, if an object was recognized as a pear, an abstract illustration of pear was displayed in the 5 Figure 1.2 Our experimental system. Two partners collaborate on an object naming task using this system. The director on the left side is shown an original image, while the matcher on the right side is shown an impoverished version of the original image. impoverished image at the position of the original object. The color of the illustration was set to the average color of the pixels of the original object, and the height and width were set according to the segmented bounding box of the original object. In the object naming task, the director’s goal was to communicate the “secret names” of some randomly selected objects (i.e., target objects) in the original image to the matcher, so that the matcher would know which object had what name. As shown in the example image in Figure 1.2, those secret names were displayed only on the director’s side but not the matcher’s. Once the matcher believed that the name of an target object was correctly communicated, s/he recorded the name by clicking on the target and repeating the name. A task was considered complete when the matcher had recorded the names of all the target 6 Figure 1.3 The procedure of generating the impoverished image from the original image. objects. 1.2.2 Observations Using this experimental setup, we collected a set of human-human2 dialogues on the object-naming task (namely, a referential communication task). Consistent with previous studies (e.g., [12, 14]), our collected data have demonstrated overwhelming use of spatial relations and collaborative referring actions to overcome the mismatched perceptions. Here are two example excerpts from the collected data. Example 1: D 3 : what I am seeing on my screen are three apples (1) D: and there is an apple that is directly below, slightly to the right of the battery (2) 2 All the participants were undergraduate/graduate students recruited from the campus of Michigan State University. 3 D stands for the Director. 7 M 4 : ok (3) D: and then there is an apple to the right of that (4) D: and there is an apple below that last apple (5) M : ok (6) D: so the apple directly below the battery is called Alexis (7) M : ok, this is Alexis (8) D: and then to the right of Alexis is an apple (9) D: and then below that apple is (10) D: I am sorry, actually that is a red pear (11) D: but it looks like an apple (12) ... ... Example 2: D: there is basically a cluster of four objects in the upper left, do you see that (1) M : yes (2) D: ok, so the one in the corner is a blue cup (3) M : I see there is a square, but fine, it is blue (4) D: alright, I will just go with that, so and then right under that is a yellow pepper (5) M : ok, I see apple but orangish yellow (6) D: ok, so that yellow pepper is named Brittany (7) M : uh, the bottom left of those four? Because I do see a yellow pepper in the upper right (8) D: the upper right of the four of them? (9) M : yes (10) D: ok, so that is basically the one to the right of the blue cup (11) 4 M stands for the Matcher. 8 M : yeah (12) D: that is actually an apple (13) D: that is a green apple and it is named Ashley (14) ... ... As we can see from the two examples, the participants relied on both object-specific properties and spatial relations for communicating the identities of intended objects. The most commonly used object properties included object-class, color, and spatial location. Other properties such as size, length, and shape were also used. For spatial relations, the most commonly used were the projective relations [24], such as right, left, above, and below. Besides, as demonstrated in Example 2 (utterance (1), (8), (9)), descriptions based on grouping of multiple objects were also very useful [25]. Furthermore, as illustrated by these two examples, referential communication under mismatched perceptions was often a highly incremental and collaborative process. In example 1, after the director initiated “three apples” in utterance (1), he kept elaborating on them till they were accepted by the matcher. Each of the director’s following utterances can be viewed as either an expansion (e.g., utterance (4), (5), (7), (9)) or a replacement (e.g., utterance (11)) to a previous utterance. In example 2, the matcher played a more active role in the dialogue. He proactively described what he perceived (utterance (4), (6)), and asked for clarification when he could not uniquely resolve the reference (utterance (8)). The director, on the other hand, could either accept the matcher’s presentation and further expand it (utterance (5)), or refashion it with a replacement (utterance (13)). All these observed dynamics conform to the collaborative referring model in the literature [12, 16]. We thus applied this model to categorize all the prevalent patterns of these collaborative dialogues in our data (see the next section). These categories have provided 9 important guidances for developing our computational approaches to support collaborative referential communication. 1.2.3 Patterns of Collaborative Referential Communication In our data, referential communication between the director and the matcher generally falls into two phases: a presentation phase and an acceptance phase. A presentation phase can be in one of the following forms: • A complete description: the speaker issues a complete description in a single turn. For example, “there is a red apple on the top right”. • An installment: a description is divided into several parts/installments, each of which needs to be confirmed before continuing to the rest. For example, A: under the big green cup we just talked about, B: yes A: there are two apples, B: OK A: one is red and one is yellow. ...... • An expansion: a description that adds more information to the previous presentation, such as speaker A’s second and third utterances in last example. • A replacement: a description that replaces some information in the previous presentation. For example, “that is actually an apple (not a pepper)”. 10 • A trial : a description with a try marker. For example, “Is there a red apple on the top right?” In an acceptance phase, the other interlocutor (i.e., the hearer) can either accept or reject the current presentation, or postpone the decision. Two major forms of accepting a presentation are observed in our data: • An acknowledgement: the hearer explicitly shows his/her understanding, using assertions (e.g., “Yes”,“Right”, “I see”) or affirmative continuers (e.g., “uh huh”, “OK”). • A relevant next turn: the hearer proceeds to the next turn that is relevant to the current presentation. For example: A says “I see a red apple” and directly following that B says “there is also a green apple to the right of that red one”. Furthermore, there are also two forms of rejecting a presentation: • A rejection: the hearer explicitly rejects the current presentation, for example, “I don’t see any apple”. • An alternative description: the hearer instead presents an alternative description. For example, A says “there is a red apple on the top left,” and immediately following that B says “I only see a red apple on the right”. Besides explicitly accepting or rejecting a presentation, the hearer can also postpone the decision and wait/request for more information. For example: • A clarification question/request: the hearer asks a question to let the speaker clarify, such as “uh, the bottom left of those four?”, “I see a blue object there, is that what you are talking about?” 11 Actually, the acceptance to a previous presentation often represents a new presentation itself, which then triggers further acceptance. For example, a relevant next turn as an acceptance can also be viewed as a new presentation itself. An alternative description is also a replacement. And a clarification question often presents a trial that awaits for further response. In general, referential communication in our data emerges as a hierarchical structure of recursive presentation-acceptance phases. It is a highly collaborative process that cannot be moved forward without the joint efforts from both of the two dialogue participants. 1.3 Contributions The observations from our empirical study have indicated that, to enable effective and efficient situated referential communication between human users and artificial agents, computational approaches should take the following issues into consideration: • It needs to model different types of referring expressions, i.e., referring expressions based on object-specific properties and spatial relations. Especially, spatial relations can provide very useful information for identifying intended objects, when objectspecific properties alone is not sufficient due to mismatched perceptual capabilities. • The model of the linguistic contents also needs to capture the rich dynamics of collaborative dialogue. It should identify and capture various relationships between interrelated utterances in the discourse, for instance, those patterns of collaborative referential communication we discussed in the previous section. • Besides modeling the linguistic discourse, it needs to represent the perceived visual features and spatial relations of the physical objects in the environment. Then a com12 putational approach should use the semantic representation of the linguistic discourse as constraints to search for feasible mappings from the model of the linguistic discourse to the model of the perceived environment. We call such a process as “situated referential grounding”. • Because of the mismatched perceptual capabilities between a human and an agent, situated referential grounding needs some level of approximation without enforcing complete satisfaction of the constraints. Furthermore, the dynamics of collaborative dialogue should be utilized to overcome mismatched perceptions and facilitate referential grounding. These issues comprise the basic requirements that a computational approach to situated referential grounding needs to fulfill. Besides, some other features/functionalities are also desirable to support robust and adaptive referential grounding, which include but not limited to: • It is desirable that the computational approach is based on a probabilistic framework, within which the uncertainties from different sources (e.g., visual perception, speech and language processing) can be handled under a unified probabilistic scheme. Using such a probabilistic approach, the referential grounding component should efficiently generate ranked grounding hypotheses to facilitate the dialogue manager’s decision making. • Since an agent’s perceptual and grounding abilities can be affected by many situational factors, such as noisy environment, faulty sensors, and human speaker’s individual differences, even previously well-performed system can become unreliable and need to 13 be adjusted under a new situation. It is thus important to develop automatic learning mechanisms to allow the system efficiently adapt to changed situations. Towards the ultimate goal of building robust and adaptive dialogue systems for situated referential communication, this dissertation has made the following contributions to address all these issues. First, graph-based representations are employed to capture a human speaker’s linguistic discourse and an agent’s visual perception. Referential grounding is then formulated as a graph-matching problem, and a state-space search algorithm is applied to ground linguistic references onto perceived objects. Furthermore, hypergraph representations are used to account for group-based descriptions, and one prevalent pattern of collaborative communication observed from a human-human dialogue dataset is incorporated into the search algorithm. This graph-matching based approach thus provides a principled way to model and utilize spatial relations, group-based descriptions, and collaborative referring discourse in situated dialogue. Evaluation results demonstrate that, when the agent’s visual perception is unreliable due to computer vision errors, the graph-based approach significantly improves referential grounding accuracy over a baseline which only relies on object-properties. Second, an optimization based approach is proposed to mediate the perceptual differences between an agent and a human. Through online interaction with the human, the agent can learn a set of weights which indicate how reliably/unreliably each dimension (object type, object color, etc.) of its perception of the environment maps to the human’s linguistic descriptors. Then the agent can adapt to the situation by applying the learned weights to the grounding process and/or adjusting its word grounding models accordingly. Empirical evaluation shows this weight-learning approach can successfully adjust the weights to reflect the 14 agent’s perceptual insufficiencies. The learned weights, together with updated word grounding models, can lead to a significant improvement for referential grounding in subsequent dialogues. Third, a probabilistic labeling algorithm is introduced to handle uncertainties from visual perception and language processing, and to potentially support generation of collaborative responses in the future. The probabilistic labeling algorithm is formulated under the Bayesian reasoning framework. It provides a unified probabilistic scheme to integrate different types of evidence from the collaborative referring discourse, and to generate ranked multiple grounding hypotheses for follow-up processes. Evaluated on the same dataset, probabilistic labeling significantly outperforms state-space search in both accuracy and efficiency. All these approaches contribute to the ultimate goal of building collaborative dialogue agents for situated interaction, so that the next generation of intelligent machines/devices can better serve human users in daily work and life. 1.4 Organization of the Dissertation The remainder of this dissertation is organized as follows. Chapter 2 first discusses some related linguistic studies on referential communication and collaborative dialogue, and then reviews related work on computational approaches to the language grounding problem. Chapter 3 presents the graph-matching based approach to situated referential grounding. The contents in Chapter 3 come from our papers appeared in the 13th (2012) and 14th (2013) annual meetings of the Special Interest Group on Discourse and Dialogue (SIGDIAL) [26, 27]. Chapter 4 presents the weight learning approach to mediate the perceptual differences between humans and agents. This is based on the work published in the 29th con- 15 ference of the Association for the Advancement of Artificial Intelligence (AAAI 2015) [28]. Chapter 5 introduces the probabilistic labeling based framework. The material follows the paper published in the 52nd annual meeting of the Association for Computational Linguistics (ACL 2014) [29]. Lastly, Chapter 6 concludes the thesis and discusses some future directions to be explored. 16 Chapter 2 Related Work SHRDLU [30], the seminal work on situated language understanding by Terry Winograd in the 1970’s, still remains to be one of the most sophisticated systems today. With a relatively static and symbolic representation of the “physical” situation (a virtual block world), it used a systematic grammar to analyze the user’s language input and identify the key units (e.g., noun, verb, or preposition phrases). The connection between the language units and the physical world were implicitly encoded in the form of computational procedures attached to the grammar, and a higher-level parsing procedure was used to combine the sentence structural analysis with the semantics (i.e., the meaning interpreting procedures attached to grammar units) to select the referent. Many nowadays situated language understanding systems have followed the similar idea established by SHRDLU (e.g., [31, 32, 33, 34, 35, 36]). These systems often consist of two key components: The first component addresses formalisms and methods that connect linguistic terms to the agent’s representations of the shared environment and task. Given these connection models, the second component analyzes the syntactic structure over the linguistic units in a sentence, and combines their models accordingly to generate situational context based interpretation. Comparing to those pioneer systems like SHRDLU, the possibly two most important advancements that modern situated language understanding systems have made are: (1) relying more on statistical models and machine-learning approaches to build more robust and adaptive systems, whereas the earlier systems were mostly based on symbolic and 17 hand-crafted models and procedures; (2) paying more attention to human behavior studies and incorporating more linguistic models/theories into the computational approaches. In the rest of this chapter, we will first review some linguistic studies that provide insights on how humans behave in referential communication tasks and shed light on what a situated dialogue system should take into account when it interacts with human. Then we will review previous work on computational approaches to situated language grounding. Some of them focus on the first component, some on the second, and some on both. 2.1 Linguistic Studies on Referential Communication Referential communication is a conversation focusing on objects and their identities [11]. It arises very often in our daily life. For example, when two people build something together, or one teaches the other how to build, the two of them need to refer to the construction pieces again and again. Referential communication also arises in tasks in which people need to arrange objects, transport objects, or do other things with objects in the surrounding environment. In the linguistic community, referential communication has been studied for a long time, tracing back to the 1960’s [37, 38]. In situated settings, the typical way of establishing a reference is to compare the properties of the referent with the properties of other surrounding items, and the expression used conforms to the Gricean maxims [39] in general [40]. For example, in the study of [41], they found that people often use object-specific properties such as object-class, color, shape, and size, to construct their referring expressions. Spatial relations between objects is another way for referring to objects, but they tend to be less-preferred because of the more cognitive efforts it imposes to the speaker. 18 However, the patterns in a monologue setting and in a two-person dialogue setting are profoundly different [38, 10]. Conversation between two participants is a joint activity [10]. In conversation, participants coordinate their mental states based on their mutual understanding of their intention, goals, and current tasks [42]. An important notion, also a key to the success of communication is grounding, a process to establish mutual understanding between conversation participants. Specifically for the referential communication in conversation, Clark and Wilkes-Gibbs developed the collaborative model of referring to explain the referential behaviors of participants [43]. This work states that grounding references is a collaborative process following the principle of least collaborative effort [43]: . . . speakers and addressees try to minimize collaborative effort, the work both speakers and addressees do from the initiation of the referential process to its completion. The collaborative model indicates that speakers tend to use different types of noun phrases other than elementary noun phrases during communication. The addressee attends to what has been said almost at the same time that the utterance is produced by the speaker. The speaker often adapts language production in the middle of the planning based on the feedback from the addressee. Similarly, addressees make efforts to accept or reject references using alternative descriptions and indicative gestures (e.g., pointing, looking, or touching) [44]. Different types of evidence can be used to indicate grounding of references such as back-channel responses, relevant next turn, and continued attention [44]. When an initial noun phrase is not acceptable, it must be refashioned. The collaborative model also identified three types of refashioning: repair, self-expansion, and replacement. Through these mechanisms, the speaker and the addressee strive to minimize the collaborative effort in grounding the reference [43]. 19 The collaborative model and the concept of grounding have motivated previous work on spoken dialogue systems [45], embodied conversational agents [46], and recent work on human robot interaction [47, 48, 49]. However, the implications of the collaborative model is not clear in situated dialogue where conversation partners have significantly mismatched capabilities in perceiving the environment (e.g., speech recognition and object recognition). It is not clear in this setting how participants strive to collaborate and minimize the collaborative effort in grounding references. Understanding these implications is important to enable the collaborative referential process between a human and an artificial agent such as a robot. The experiment presented in Section 1.2 is our first step towards understanding how participants with mismatched capabilities coordinate the collaborative referring process through dialogue. 2.2 Computational Approaches to Situated Language Grounding Similar to SHDRLU [30], some early situated language understanding work such as SAM [50] and Ubiquitous Talker [51] were also based on the syntactic-analysis driven framework. These systems used a phrase parser to analyze the syntactic structure of language inputs, based on which sequences of operations (i.e., plans) were constructed. These plans operated on a symbolic knowledge base that was constructed by a vision system to represent the perceived environment such as the properties of perceived objects. The key limitation of these systems is plans can only operate on symbolic representations of visual scenes, which cannot represent subtle visual features that is always available to human speakers. To retain the subtle visual information sensed by vision systems and achieve robust lan20 guage grounding performance, later work often relies on sophisticated “language grounding models” that associate rich visual representations (i.e., with low-level visual features and complex internal structures) to linguistic terms. These models are often based on psychology/linguistics studies or machine-learning approaches. For example, Mojsilovic developed computational models for color categorization and composition [52]. Through subjective experiments, a vocabulary and syntax for describing colors was collected first. Then a perceptually based color-naming metric, which followed neurophysiological findings on human color categorization, was designed to measure the compatibility between a color name from the vocabulary and an arbitrary numeric color values. Based on this metric, their algorithm could accurately identify regions of known colors in different color spaces (RGB and HSV ), and assign names to arbitrary colors that were compatible with human judgments. Regier and Carlson’s work focused on building grounding models to map spatial language onto aspects of visual perception [53]. The grounded meaning of projective spatial terms, such as above, below, left, or right, was computationally modeled by the combination of an attention process (i.e., finding a focused location and direction) and a vector-sum coding of overall direction (i.e., a set of constituent directions). Their model is thus called the Attentional Vector-Sum (AVS) model for projective spatial terms. In a series of evaluation experiments, their computational model accurately predicted linguistic acceptability judgments for spatial terms comparing to some other relatively simple models. The AVS model thus has provided a good formalism to connect spatial language to the perceive environment, that can be especially useful in situated referential grounding. The work by Moratz and Tenbrink also focused on developing grounding models of projective spatial terms [24]. They developed an empirically supported model of projective spatial 21 relations through iterative experiments involving uninformed users. Their grounding models were mapping functions between spatial projective terms and geometrical configurations of objects in the environment, which was represented by a set of characteristic points on a 2D plane. To allow the grounding models capable of interpreting projective relations under different reference systems, they iteratively developed and tuned their models based on formal linguistic analysis of a series of experiment data, which provided them a systematic categorization of the variability in speakers’ spontaneous strategies. Their work demonstrated a practical methodology to develop flexible language grounding models through an iterative development process. Dhande’s thesis work presented a computational model that connects gestalt visual perception and language [54]. This work focused on a particular class of words or phrases called “aggregates”, such as pair and group, which inherently refer to groups of objects. The model grounded the meaning of these words and phrases onto the perceptual properties of visually salient groups, and provided an explanation for how the semantics of these terms interact with gestalt processes. The work by Funakoshi et al. also focused on understanding referring expressions involving perceptual grouping [55]. Similar to the iterative and empirical approach used in [24], their conducted experiments with spontaneous speakers to collect group-based referring expressions, and developed a set of models based on analyzing the collect data. Their models showed a good coverage on identifying referents by linking group-based descriptions with perceptual grouping features extracted from the visual scenes. Since advanced sensing technologies can provide very accurate spatial representation of the physical environment [56], a body of research has focused on spatial language understanding to interpret referents, especially for situated human-robot dialogue [57, 13, 15]. Spatial information can be very helpful to overcome the limitation of object recognition in 22 robotic systems, however, spatial language understanding has a challenging problem in itself. The use of spatial expressions presupposes underlying conceptual reference systems, or the so-called frame-of-reference [58], and different frames-of-reference can lead to different interpretations of the same expression [59]. As shown in studies of spatial language using in situated interactions [60, 24, 61], automated prediction of frames-of-reference can be difficult because “speakers vary considerably in their individual solutions to the problem at hand, but they often do not account for potential underlying ambiguities (of frames-of-reference) sufficiently” [61]. A potential solution to this problem perhaps should be based on the interaction aspect, e.g., develop more interactive agent that can resolve such ambiguities through collaborative dialogue. Ripley, a conversational robotic system developed by Roy’s group [62, 63], provided a computational framework for understanding situated interaction and addressing the frameof-reference problem as well. To ground language to the physical world, a key component of their system was a “grounded lexicon”, which defined the meaning of words in terms of structured representations of the robot’s rich sensorimotor information. To resolve the referent of natural language expressions, it applied a linguistic parsing procedure that is compatible with the sensor-grounded lexical meanings [31]. Furthermore, to cope with the ambiguous frames-of-reference problem, they developed a perceptually-coupled physical simulator to maintain a “mental model” of the physical environment. Using the simulator, the system can synthesize imagined views of the environment from arbitrary perspectives and generate different interpretations based on the synthesized views. This mental imaginary mechanism provides a nice basis for spatial language understanding and also generation in situated dialogue. While it is only in a virtual world, Gorniak and Roy’s work [31] on building a visually23 grounded language understanding system has been an influential piece of work in recent years. To build their system, they also first conducted an experiment to collect language data and investigate how people verbally describe objects in synthesized virtual scenes. Based on the collected data, they built a visually-grounded language understanding system that can handle the most commonly used descriptive strategies in their experiments. Their system again consisted of some “standard” components that commonly exist in the later and the earlier referential grounding systems: a set of visual feature extraction algorithms, a lexicon that is grounded in terms of these visual features, a robust parser to analyze the syntactic structure of spoken utterances, and a compositional engine driven by the parser that combines visual groundings of lexical units to select the most proper referent(s). For a large portion of the test cases, their implemented system demonstrated impressive performances on selecting the correct referents in response to natural language expressions. In a later work [8], Gorniak and Roy further extended their computational approach to support a theory of situate language understanding, which depended on not only the physical environment, but also the goals and plans of communication partners to interpret the meaning of language. Their work was based on the concept of “perceived affordances”, which were structured representations used to encode not only the perception of the world, but also the past happened and future possible interactions between the perceiver and the world. Given such representations, they used probabilistic hierarchical plan recognition mechanism to predict possible interactions between the speaker and the environment, and treated situated language understanding as filtering/ranking the predicted interactions based on the parsing outcome of the utterance. This work demonstrated an important aspect in the evaluation of their affordance-based approach, that is to take the communication partner’s intention into consideration for situated language understanding. 24 Although not in a situated setting, the probabilistic approach to reference resolution in multimodal user interfaces by Chai et al. [6, 64] also inspired our work. In their approach, a graph-based representation was used to capture a rich set of constraints from different perspectives, including temporal alignments between speech and pointing gesture, coherence of ongoing conversation, and domain information. It then applied a probabilistic graphmatching algorithm [65] to identify the most probable referents by optimizing the satisfaction of all the constraints captured in the graphs simultaneously. As demonstrated by their evaluation results, this graph-matching based approach was very effective, especially for complex user inputs that involve multiple referring expressions and multiple gestures at the same time. Inspired by this approach, our work in situated referential grounding also employs a similar graph representation to capture various information from the linguistic discourse and the visual perception. Referential grounding is then formulated as a graphmatching problem and different algorithms have been developed to allow robust and adaptive matching. These work we have discussed so far all aimed to build hand-crafted models and/or systems to address the language grounding problem. They completely rely on the developers’ manual efforts to analyze the patterns in data, design models and tune parameters, write grammars, or implement grounding procedures. Another line of work tackles the problem using machine-learning based approaches, in which various machine-learning techniques have been used to automatically build some or all the components of a situated language understanding system. For example, to learn the perceptually grounded semantics of spatial terms (such as above, below, left, right, in and on), Regier used artificial neural networks to learn a set of classification models from line drawing animations labeled by spatial terms [66]. Based on a 25 simple set of features, such as the shapes, sizes, and relative positions of objects, the learned model could reliably classify a variety of spatial configurations and events in accordance with spatial terms across several languages. This work gives an excellent example about how perceptually grounded semantics of natural language can be automatically acquired using machine learning technique. Another good example for learning perceptually grounded meanings of linguistic terms is the computational model of word acquisition developed by Roy and Pentland [67]. They proposed an information theoretic framework which could learn models for speech segmentation, word discovery and visual categorization simultaneously from spontaneous infantdirected speech data paired with video images of objects. The learning of sensor-grounded word semantics in this work focused on only the shape of objects, but other attributes such as color, size, and so forth could be handled in the same way. The visual inputs to their learning algorithms were histograms of local shape features of objects, and such raw inputs were categorized into prototypes using statistical learning mechanisms. To associate the identified spoken words, they used mutual information to evaluate the strength of the association between word and shape representations based on their occurrence from multiple observations. This work demonstrated the possibility of learning grounded word-to-meaning models from raw sensor data without the need for human transcription and labeling. While the just mentioned two examples only focused on learning the grounded semantics of one particular category of natural language descriptors (i.e., spatial terms and shape terms, respectively), Roy’s another work developed a unified scheme for learning perceptually grounded word semantics on different object-attributes including shape, color, size, and spatial terms as well [68]. The grounded meanings of the concepts of different categories were all modeled as multivariate Gaussian distributions in relevant feature spaces. A set of 26 learning algorithms were developed to extract the syntactic structures, cluster words into word classes, select relevant visual features, and establish the grounding models of individual words. This approach successfully acquired the visual semantics of individual words, word classes, and phrase structures in terms of probabilistic models over the visual feature spaces. All the models were learned from a dataset of visual scenes paired with transcribed natural language descriptions (collected in a “show-and-tell” experiment), without any manual annotation. Yu and Ballard developed a multimodal learning interface to learn to associate spoken language with perceptual features in an unsupervised mode [69]. To facilitate learning, their interface collected a rich set of multimodal sensory information simultaneously with users’ speech inputs, while they were performing everyday tasks and providing natural language descriptions of their behaviors at the same time. The collected multimodal sensory information included users’ perspective videos, hand movements, gaze positions, and head directions. These multisensory information were used to extract visual feature representations of objects and actions, and estimate users’ attentions. Expectation-Maximization (EM) algorithm was then applied to find the reliable associations of spoken words and temporally co-occurring attentional objects and actions. As demonstrated by their work, utilizing the user-centric multisensory information can have a key advantage of reducing large amount of irrelevant computation that a solely co-occurrence based statistical learning can hardly avoid. Tellex et al. proposed a probabilistic graphical model based approach for learning and understanding natural language commands in terms of a robot’s sensorimotor representations [70]. Given a natural language command, they first built a hierarchical representation of its semantic structure based on linguistic parsing. Then a Conditional Random Fields (CRF) [71] model was constructed from the semantic structure, which captured the interde27 pendencies between the language phrases in the command and their groundings. With such graphical models, they further developed efficient inference and learning algorithms to find proper mappings from natural language to the robot’s internal representations of its action and the perceived environment. Their model was very successful in learning and inferring the grounded meaning of navigation and manipulation verbs and spatial phrases. One limitation of this approach is it requires the annotated grounding of each individual phrase in a command in order to train the model. To address this limitation, their later work proposed a weakly supervised learning algorithm to train the model with unaligned parallel data [72]. This new learning algorithm only needs to observe the top-level action corresponding to each command, but no detailed annotations are required at training time. These work we have just reviewed demonstrate how meaning of language that is mapped to perceptual features of the external world can be acquired using machine learning techniques. Besides, work on situated language understanding often leverages linguistic parsing to map from natural language to the perceptual context. The parsing component can also be constructed automatically through machine learning. For example, Kate and Mooney developed a weakly-supervised method for learning a semantic parser from ambiguous training data [73]. In their method, semantic parsers were automatically learned using Support-Vector Machines (SVMs). To handle ambiguous training data, in which each natural language sentence was paired with a set of potential meaning representations but only one was correct, their method employed an EM-like procedure to iteratively score the potential sentence-meaning pairs and retrain the parser. This method was used in a later work by Chen and Mooney [33] to build a system that could learn to transform natural language instructions into executable navigation plans in complex virtual environments. The system was able to automatically learn a grounded lexicon and a seman28 tic parser by simply observing how humans follow navigation instructions, without any prior knowledge or manual annotation. Matuszek et al. presented an approach to jointly learn language grounding and parsing models [34] from weakly-supervised data. In their work, the grounded meaning of a word was associated with a set of learned visual attribute classifiers, and semantic parsing was based on a learned Combinatory Categorial Grammar (CCG) lexicon [74]. Their training data was collected using Amazon Mechanical Turk, and contained spontaneous referring expressions paired with raw visual perceptions of scenes of multiple objects. To jointly train the parameters of different models, they developed an online and EM-like learning algorithm to optimize the log-likelihood of training data. Their approach was able to learn accurate language and perception models, given only the annotations of target objects but no explicit labelings of linguistic parsing or visual attribute classifier outputs. Based on a very close idea, Krishnamurthy and Kollar also developed a grounded language acquisition model for jointly learning to parse and perceive [35]. The advancements of their work were being able to learn models for not only one-argument categories (object attributes) but also two-argument relations, and allowing for entirely weakly supervised training without a bootstrapping phase. All these related works we have discussed in this section provide valuable insights on how to manually or automatically build the key components (e.g., linguistic parsing, grounding models to connect words to visual perceptions, and computational procedures to combine visual groundings of lexical units) for a situated referential grounding system. However, most of them have only dealt with the interpretation of single referring expressions/utterances, but not interrelated utterances as in a dialogue context. Some earlier works [17, 18, 75] developed symbolic reasoning (e.g., planning) based approach to incorporate dialogue dynamics into interpreting referring expressions. These works 29 provided good examples for computationally modeling the collaborative referring theory [12]. However, they have not addressed situated referential grounding, for which pure symbolic reasoning based approach may not be sufficient and new approaches that are more robust against uncertainties need to be pursued. Some recent works have tackled this problem using “hybrid” models. For example, DeVault and Stone proposed an approach that combined symbolic reasoning and machine learning for interpreting situated referential grounding dialogue [19]. Kennington and Schlangen used Markov Logic Networks (MLN) [76] to incorporate the discourse and situational context for situated language understanding in an incremental manner [77]. But their “environments” were just simplistic block worlds and the issue of mismatched perceptions between humans and agents was not addressed. 30 Chapter 3 A Graph-matching Approach In this chapter, we present a graph-matching based approach for situated referential grounding.1 This approach uses a graph-based representation to model the referential communication dialogue. The graph representation models different types of referring expressions, including object-specific properties, spatial relations, and group-based descriptions. It also captures the discourse relations between different referring expressions. The environment perceived via computer vision is also represented using the same graph model. Referential grounding is then formulated as a graph-matching problem to find feasible matchings between the graph representation of the linguistic discourse and the graph representation of the visual perception. This chapter begins with an overview of our situated referential communication system. Then it presents the graph representation and the graph-matching algorithm in detail, followed by the evaluation of this approach using the data collected from the experiment as described in Section 1.2. Lastly, we conclude this chapter with some discussions. 1 This chapter is based on the following publications: C. Liu, R. Fang, and J. Chai, “Towards mediating shared perceptual basis in situated dialogue,” in Proceedings of the 13th SIGDIAL Conference, pp. 140–149, July 2012. C. Liu, R. Fang, L. She, and J. Chai, “Modeling collaborative referring for situated referential grounding,” in Proceedings of the 14th SIGDIAL Conference, pp. 78–86, August 2013. 31 Figure 3.1 An overview of our graph-matching based referential grounding system. 3.1 System Overview Figure 3.1 illustrates the key elements and the process of our graph-based method. The key elements of our method are two graph representations, one of which is called the discourse graph and the other called the vision graph. The discourse graph captures the information extracted from the linguistic discourse. To create the discourse graph, the linguistic discourse first needs to be processed by Natural Language Processing (NLP) components, such as the semantic parsing and coreference resolution components. The output of the NLP components are usually in the form of some formal semantic representations, e.g., in the form of First-Order Logic (FOL) representations. The discourse graph is then created based on the formal semantics and discourse relations between different utterances. The vision graph, on the other hand, is a representation of the visual features extracted from the perceived environment. It is built based on the information output by the Computer Vision (CV) component. Given the discourse graph and the vision graph, then we can formulate referential grounding as constructing a 32 node-to-node mapping from the discourse graph to the vision graph, or in other words, a matching between the two graphs. Note that, the matching between the discourse graph and the vision graph we encounter here is different from the original graph matching problem that is often used in the CV field [78, 79]. The original problem only considers matching between two graphs that have the same type of values for each attribute. But in the case of situated referential grounding, all the attributes in the discourse graph possess symbolic values since they come from formal semantic representations, whereas the attributes in the vision graph are often numeric values produced by CV algorithms. Our solution is to introduce a set of semantic grounding functions, which bridges the heterogeneous attributes of the two graphs and makes general graph matching algorithms applicable to referential grounding. The details will be presented later in this chapter. 3.2 Graph-matching for Situated Referential Grounding In the field of Computer Vision and Pattern Recognition, Attributed Relational Graph (ARG) is a very useful data structure to represent an image [80, 81]. In an ARG, the underlying unlabeled graph represents the topological structure of the scene. Then each node and edge are labeled with a set of attributes that represents local features of a single node or the topological features between two nodes. Based on the ARG representations, an inexact graph matching is to find a graph or a subgraph whose error-transformation cost with the already given graph is minimum [82, 83]. Motivated by the representation power of ARG and the error-correcting capability of 33 inexact graph matching, we developed a graph-based approach to address the referential grounding problem. In the following we first demonstrate how the ARG representations are created, and then present the formal formulation of our graph-based method for referential grounding. 3.2.1 Graph Representation Figure 3.2 illustrates the graph representations in our method. The discourse graph is created based on the formal semantic representation (e.g., FOL representations) from parsing and the coreference resolution results. For instance, each linguistic entity is represented by a node in the graph, and one-arity predicates are translated to the node attributes. Two-arity predicates that correspond to the mentioned relations between entities are represented by the edges and their attributes in the graph. Furthermore, multiple nodes can be merged together based on coreference, since coreferential discourse entities should refer to the same object in the environment (i.e., correspond to the same node in the vision graph). The vision graph, on the other hand, is a representation of the visual features extracted from the scene. It is built based on the information output by the CV component. For instance, each object detected by CV is represented as a node in the vision graph, and the attributes of the node correspond to visual features, such as the color, size and position of the object. The edges between nodes represent their relations in the physical space. Based on the graph representations of the linguistic discourse and the visual perception, we formulate referential grounding as a graph matching problem, which has extended the original graph matching approach used in the CV and PR filed. Formally, an attributed relational graph (ARG) is a labeled graph 34 Figure 3.2 An illustration of the graph representations in our method. The discourse graph is created from formal semantic representations of the linguistic discourse; The vision graph is created by applying CV algorithms on the visual scene. Given the two graphs, referential grounding is to find a proper node-to-node mapping from the discourse graph to the vision graph. G = (X, E) X = {xm | m = 1, . . . , M } (3.1) E = {ei = (ei1 , ei2 ) | i = 1, . . . , I; ei1 ∈ X, ei2 ∈ X} So it is a directed graph with M nodes and I (directed) edges. Furthermore, each node xm is assigned a set of A attributes Um and each edge ei is assigned a set of B attributes Vi , i.e., 35 Um = {uma | a = 1, . . . , A} (3.2) Vi = {vib | b = 1, . . . , B} For example, a node can be assigned a set of attributes as {v1 = Apple, v2 = Red, v3 = Small, v4 = F ront} which specify the type, color, size and location of an object represented by the node. The vision graph is defined in the same way and denoted as G , i.e., G = X ,E (3.3) X = xn | n = 1, . . . , N E = ej = ej1 , ej2 | j = 1, . . . , J; ej1 ∈ X , ej2 ∈ X In an ARG, the value of a node/edge attribute can be symbolic, numeric, or as a vector of numeric values. The attributes in a discourse graph always contain symbolic values since they are from linguistic inputs. On the other hand, the attribute values in a vision graph are mostly numeric because they are from computer vision algorithms’ outputs. For example, if in the vision graph the attribute v1 is used to represent the color feature of an object, then a possible assignment could be v1 = (255, 0, 0), which is the rgb color vector. Some commonly used attributes and their typical values from our data are shown in Table 3.1. An edge in the graph represents a 2-tuple of nodes, i.e., ei = (ei1 , ei2 ) in which ei1 ∈ X and ei2 ∈ X. So we use a labeled edge to represent a binary relation between one object and another, such as “the apple to the right of the cup”. Besides binary relations, group-based relations can also be used to distinguish an object or set of objects from others [25, 55]. A 36 type color size Discourse graph “apple” “red” “large” spatial relation “right of” ... ... Vision graph “ball” (r : 210, g : 12, b : 90) (w : 45, h : 45) (x1 : 700, y1 : 450), (x2 : 300, y2 : 600) ... Table 3.1 Commonly used attributes and relations and their typical values in our collected data. group-based relation involves more than two objects, for example, an intra-group relation (such as “the middle one of the three balls”), an inter-group relation (such as “the two objects in front of those three balls”), and a geometric formation relation (such as “a row of three balls”)2 . To account for the group-based relations, we can further extend the regular graph representation to hypergraph [84]. Hypergraph is a more general representation than regular graph. It can represent not only binary relations between two nodes, but also group-based relations among multiple nodes. Formally, a hypergraph is defined as G = (X, E) X = {xm | m = 1, . . . , M } (3.4) E = {ei = (ei1 , ei2 ) | i = 1, . . . , I; ei1 ⊆ X, ei2 ⊆ X} The difference between a hypergraph and a regular graph is that E now is a set of “hyperedges”. Just the same as an edge in a regular graph, each hyperedge ei in a hypergraph also consists of two “ends”, i.e., a tail (ei1 ) and a head (ei2 ). However, the tail and head of a hyperedge are both subsets of X, thus they can contain any number of nodes in X (i.e., a “group”). 2 See [55] for definitions of these different types of group-based relations. 37 Figure 3.3 An illustration of the hypergraph representation for modeling group-based descriptions. For example, suppose the discourse includes the following utterances: (1) There is a cluster of four objects in the upper left. (2) The one in the corner is a blue cup. (3) Under the blue cup is a yellow pepper. (4) To the right of the blue cup, which is also in the upper right of the four objects, is a green apple. The corresponding dialogue graph G = (X, E) is shown in Figure 3.3, where X = {x1 , x2 , x3 , x4 } and E = {e1 , e2 , e3 }. In E, for example, e1 = ({x1 }, {x3 }) represents the relation “right of” between the tail {x3 } and the head {x1 }, and e3 = ({x3 }, {x1 , x2 , x3 , x4 }) represents the group-based relation “upper right” between one node and a group of nodes. 38 3.2.2 State-space Search Based Graph-matching Given a discourse graph G and a vision graph G , a matching between G and G is a set of node-to-node mappings between the nodes in G and the nodes in G . Or in other words, a matching (denoted as Θ) between G and G is to assign each node xm in X a “label” θm to indicate which node in X that xm maps to. Formally defined as: Θ = θm = xn | m = 1, . . . , M ; xn ∈ X (3.5) We then define the compatibility function of a matching Θ as f (Θ) = xm ∈X g xm , x n + h ei , e j (3.6) ei ∈E where xn and ej are the corresponding node and edge of xm and ei according to Θ, respectively3 . To compute f (Θ), we need to further define g xm , xn and h ei , ej , i.e., the compatibility function for a matched pair of nodes and edges, respectively. This is based on the attributes assigned to the nodes/edges (note that subscripts m, n, i and j are dropped since the definition is general for any pair of nodes/edges): g x, x h e, e 1 =A = 1 B A ga ua , ua a=1 B (3.7) hb vb , vb b=1 We call ga ua , ua and hb vb , vb the semantic grounding function for the a-th node attribute and b-th edge attribute, respectively. Namely, a semantic grounding function for 3 Note that, we treat the vision graph G as a complete graph, in which there exists an edge between each pair of nodes. In the case of a hypergraph, it means there exists a hyperedge between any two subsets of nodes. 39 the a-th attribute takes two input arguments ua and u a , which are the values of the a-th attribute from node x and x respectively. The output of the function is a real number p in the range of [0, 1], which can be interpreted as a measurement of the compatibility between a symbolic value (or word) ua and a visual feature value u a . p can also be interpreted as the likelihood of observing the value u a given the symbol ua provided by the linguistic input. In essence, what we have defined as symbol grounding functions here are very similar to the “visual word semantics” in previous work [68, 85, 32]. Let Y = y1 , y2 , · · · , yK be the set of all possible symbolic values of ua (i.e., Y is the lexicon of the a-th node attribute in the dialogue graph), then ga ua , ua can be further decomposed as:     ga1 ua        ga2 ua ga ua , ua =  ..   .        g u aK a if ua = y1 ; if ua = y2 ; .. . (3.8) if ua = yK So for each value (e.g., yk ) of ua , there is a stand-alone grounding function/model gak ua . Its output measures how well ua (e.g., a numeric value from the vision graph) matches with ua (e.g., the word that human speaker used to describe this attribute of the object). In fact, it can also be interpreted as the (conditional) probability distribution of observing ua given ua = yk , i.e., gak ua = p ua |ua = yk . Similarly, the grounding function for edge attributes is in the same form as for the node attributes. Let Z = z1 , z2 , · · · , zL be the lexicon of the b-th edge attribute, then we have: 40     hb1 vb        hb2 v b hb vb , vb =           if vb = z1 ; .. . if vb = z2 ; .. . hbL vb if vb = zL (3.9) where each hbl vb is a stand-alone grounding function for the value zl of vb . These semantic grounding functions can be either manually defined (e.g., [54, 31]) or automatically learned (e.g., [68, 32]). In our current work, we use a set of manually defined grounding functions similar to those used in [31]. Based on our definition of the compatibility function of a matching, the optimal matching between G and G is the one with the highest compatibility score: Θ∗ = arg max f (Θ) (3.10) Θ which gives us the most feasible result of grounding the entities in the discourse graph to the objects in the vision graph. Given our formulation of referential grounding as a graph matching problem, the next question is how to find the optimal matching between two graphs. Unfortunately, such a problem belongs to the class of NP-complete [78]. In practice, techniques such as A∗ search are commonly used to improve the efficiency (e.g. in [80, 86]). But the memory requirement can still be considerably large if the heuristic does not provide a close estimate of the future matching cost [78]. As one practical solution, we use a state-space search method [87] and apply a simple beam search algorithm [88, 89] to retain the computation tractability. In this state-space search algorithm, a state s in the search space represents the recon- 41 Algorithm 1 State-space Search Algorithm for Graph Matching 1: INPUT: • Two graphs G and G 2: OUTPUT: • A sub-optimal matching Θ between G and G 3: METHOD: (1) root ← {}, level0 ← {root}, M ← number of nodes in G, N ← number of nodes in G δ ← max (M, N ) (2) for m ← 1 to M do (3) (4) for every state s in levelm−1 do for every unmatched node index n in G do (5) snew ← add θm = xn to s (6) add snew to levelm (7) end for (8) end for (9) sort all the states in levelm by their compatibility scores (10) keep only the first mδ 2 states (11) end for (12) return the first state in levelM struction of a subgraph from G and a subgraph from G , as well as the matching between the two subgraphs. A state s can be expanded to a successive state by adding one more matched pair of nodes, namely letting θm = xn where m is the index of an unprocessed node in G and n is the index of an unmatched node in G . A state-space search tree is constructed from the root Θ = ∅ in a breadth-first search manner. At each level of the search tree, 42 Figure 3.4 An example of the state-space search tree. all the states are ranked based on their compatibility scores and those who fall out of a predetermined beam width are pruned. Following the assumption in [82], we set the beam size as dδ 2 , where d is the depth of the current level of the search tree and δ = max (M, N ) (i.e., the size of the bigger graph). Figure 3.4 illustrates the procedure of constructing the state-space search tree. 3.2.3 Modeling Collaborative Referring As we have already discussed in Chapter 1, one of our goals is to explicitly incorporate collaborative referring into the graph-matching algorithm for referential grounding. As the conversation unfolds, our approach intends to incrementally build a discourse graph by keeping track of the contributions (i.e., presentation and acceptance) from both the human and the agent. This discourse graph is then matched against the perceived environment (i.e., the 43 vision graph) in order to resolve referring expressions from the human. Here we focus on a particularly prevalent type of collaborative referring pattern observed from our collected data4 , i.e., what we call the agent-present-human-accept collaboration pattern. More specifically, our data showed that when mediating their joint perceptual basis, the director often took into consideration what the matcher saw and used that to gradually lead to the intended referents. This is demonstrated in the following example5 , where the director accepted (Turn 3) the matcher’s presentation (Turn 2) through a relevant next turn: (Turn 1) D: There is a kiwi fruit. (Turn 2) M : I don’t see any kiwi fruit. I see an apple. (Turn 3) D: Do you see a mug to the right of that apple? (Turn 4) M : Yes. (Turn 5) D: OK, then the kiwi fruit is to the left of that apple. We use this example to show how the agent-present-human-accept pattern can be incorporated to potentially improve reference resolution. Figure 3.5 gives a graphical illustration on this idea. In this example, the human and the agent (i.e., the robot) face a shared physical environment. The robot perceives the environment through computer vision algorithms and generates a graph representation (i.e., a vision graph), which captures the perceived objects and their spatial relations6 . As shown in Figure 3.5(a), the kiwi is represented as an unknown object in the vision graph due to insufficient object recognition. Besides the vision graph, the robot also maintains a discourse graph that captures the linguistic discourse between the human and the robot. 4 See Section 1.2 for the details of our data collection experiment. 5 This is a clean-up version of the original example to demonstrate the key ideas. 6 The spatial relations between objects are represented as their relative coordinates in the vision graph. 44 Figure 3.5 An example of incorporating collaborative efforts in an unfolding dialogue into graph representations. At Turn 1 in Figure 3.5(b), the human says “there is a kiwi fruit”. Upon receiving this utterance, through semantic processing, a node representing “a kiwi” is generated (i.e., x1 ). The discourse graph at this point only contains this single node. Identifying the referent of the expression “a kiwi fruit” is essentially a process that matches the discourse graph to 45 the vision graph. Because the vision graph does not have a node representing a kiwi object, no high confidence match is returned at this point. Therefore, the robot responds with a rejection as in Turn 2 (Figure 3.5(c)) “I don’t see any kiwi fruit” 7 . In addition, the robot takes an extra effort to proactively describe what is being confidently perceived (i.e., “I see an apple”). Now an additional node y1 is added to the discourse graph to represent the term “an apple” 8 . Note that when the robot generates the term “an apple”, it knows precisely which object in the vision graph this term refers to. Therefore, as shown in Figure 3.5(c), y1 is mapped to v2 in the vision graph. In Turn 3 (Figure 3.5(d)), through semantic processing on the human’s utterance “a mug to the right of that apple”, two new nodes (x2 and x3 ) and their relation (RightOf ) are added to the discourse graph. In addition, since “that apple”(i.e., x2 ) corefers with “an apple” (i.e., y1 ) presented by the robot in the previous turn, a coreference link is created from x2 to y1 . Importantly, in this turn human displays his acceptance of the robot’s previous presentation (“an apple”) by coreferring to it and building further reference based on it. This is exactly the agent-present-human-accept strategy described earlier. Since y1 maps to object v2 and x2 now links to y1 , it becomes equivalent to consider x2 also maps to v2 . We name a node such as x2 a grounded node, since from the robot’s point of view this node has been “grounded” to a perceived object (i.e., a vision graph node) via the agent-present-human-accept pattern. At this point, the robot matches the updated discourse graph with the vision graph again and can successfully match x3 to v3 . Note that, the matching occurs here is considered constrained graph-matching in the sense that some nodes in the discourse graph (i.e., x2 ) are 7 Note that, since in this chapter we are using a dataset of human-human (i.e., the director and the matcher) dialogues, decisions from the matcher are assumed known. We leave the agent’s decision making (i.e., response generation) into the future work. 8 We use x to denote nodes that represent expressions from the human’s utterances and y to represent i i nodes from the robot’s utterances. 46 Figure 3.6 An illustration of the constrained state-space search through utilizing the grounded nodes. already grounded, and the only node needs to be matched against the vision graph is x3 . The constrained matching utilizes additional constraints from the collaboration patterns in a dialogue and thus can improve both the efficiency and accuracy of the matching algorithm. Based on such matching result, the robot responds with a confirmation as in Turn 4 Figure 3.5(e)). The human further elaborates in Turn 5 “the kiwi is to the left of the apple”. Again semantic processing and linguistic coreference resolution will allow the robot to update the discourse graph as shown in Figure 3.5(f). Given this discourse graph, based on the context of the larger discourse graph and through constrained matching, it will be possible to match x1 to v1 although the object class of v1 is unknown. 47 This example demonstrates how the discourse graph can be created to incorporate the collaborative referring behaviors as the conversation unfolds and how such accumulated discourse graph can help referential resolution through constrained matching. When the “agent-present-human-accept” pattern is identified, the associated nodes (e.g., x2 in the example) will be marked as grounded nodes and the mappings to their grounded visual entities (i.e., vision graph nodes) will be added into the discourse graph. Such information can be then incorporated into the state-space search algorithm straightforwardly: the search procedure can now start from the state that already represents the known matching of grounded nodes, instead of starting from the root. Thus it is constrained in a smaller and more promising subspace to improve both efficiency and accuracy. See Figure 3.6 for an illustration of the constrained state-space search through utilizing the grounded nodes. 3.3 3.3.1 Evaluation Data A total of 32 dialogues collected from our experiments (as described in Section 1.2) are used in the evaluation. For each of these dialogues, we have manually annotated (turn-byturn) the formal semantics, discourse coreferences and grounded nodes as described earlier in this chapter. Since the focus here is on graph building and matching for referential grounding, we use these annotations to build the discourse graphs in our evaluation (Evaluation results based on completely automatic language processing will be the discussed in Chapter 4). Vision graphs are automatically generated by CV algorithms from the original images used in the experiments. In our settings, the CV algorithm’s object recognition performance is rather low: only 5% of the objects in those images are correctly recognized, thus reference 48 resolution will need to rely on relations and collaborative strategies9 . The 32 dialogue graphs have a total of 384 nodes10 that are generated from directors’ utterances (12 per dialogue on average), and a total of 307 nodes generated from matchers’ utterances (10 per dialogue on average). Among the 307 matcher generated nodes, 187 (61%) are initially presented by the matcher and then coreferred by directors’ following utterances (i.e., relevant next turns). This indicates that the agent-present-human-accept strategy is a prevalent way to collaborate in our experiment. As mentioned earlier, those director generated nodes which corefer to nodes initiated by matchers are marked as grounded nodes. In total, 187 out of the 384 director generated nodes are in fact grounded nodes. 3.3.2 Results To evaluate our approach, we apply the graph-matching algorithm on each pair of discourse graph and vision graph. The matching results are compared with the annotated ground-truth to calculate the accuracy of our approach in grounding directors’ referring descriptions to visual objects. For each dialogue, we have produced matching results under four different settings: with/without modeling collaborative referring (i.e., the agent-presenthuman-accept collaboration) and with/without using hypergraphs. When collaborative referring is modeled, the graph-matching algorithm uses the grounded nodes to constrain its search space to match the remaining ungrounded nodes11 . When collaborative referring is not modeled, all the director generated nodes are treated as ungrounded and need to be 9 Although CV’s object recognition can be improved by employing more sophisticate algorithms and doing more trainings, it can still become unreliable given noisy or unfamiliar environment. Thus our focus is how to enable robust referential grounding when the agent’s visual perception suffers such limitations 10 As mentioned earlier in Section 3.2.1, multiple expressions that are coreferential with each other and describing the same entity are merged into a single node. 11 Our referential grounding algorithm actually plays the role of the “matcher” in the experiment. From the matcher’s/algorithm’s point of view, those grounded nodes have already been grounded onto known objects via the agent-present-and-human-accept collaboration pattern 49 Object-properties only Regular graph Not modeling collaborative referring Modeling collaborative referring Improvement Hypergraph 31.4% 44.1% 47.9% N/A 55.7% 66.2% N/A 11.6% (p < 0.05) 18.3% (p < 0.02) Table 3.2 Averaged referential grounding accuracies under different settings. grounded by the algorithm. Besides, in order to investigate the advantages of the graphbased approach, we have also applied a baseline approach which only takes object-specific properties into consideration and ignores all the spatial and discourse relations between linguistic entities [90, 91]. In this case, the discourse representations contain only “isolated nodes” but not connected graphs. The results of these different settings (averaged accuracies on the 32 dialogues) are shown in Table 3.2. As we can see from the first row of Table 3.2, using our graph-based approach significantly improves the referential grounding accuracy by 12.7% (p < 0.05 based on the Wilcoxon signed-rank test [92] on the 32 dialogues). The results thus demonstrate the importance of representing and reasoning on relations between entities in referential grounding, and the graph-based approach provides an ideal solution to capture and utilize relations. Modeling collaborative referring improves the matching accuracies for both regular graphs and hypergraphs. When regular graphs are used, it improves overall matching accuracy by 11.6% (p = 0.05). The improvement is even higher as 18.3% when hypergraphs are used (p < 0.02). The results indicate that proactively describing what the agent sees to the human to facilitate communication is an important collaborative strategy in referential grounding dialogues. Human can often ground the agent presented object via the agent-present-humanaccept strategy and use the grounded object as a reference point to further describe other 50 Number of dialogues % of grounded nodes Average number of object properties a Average number of relations b Not modeling collaborative referring Modeling collaborative referring Improvement Group 1 9 <30% Group 2 Group 3 11 12 30%˜60% >60% 20 21 12 11 13 8 49.7% 49.4% 45.3% 57.0% 76.6% 63.6% 7.3% 27.2% 18.3% a Specified by directors. b Specified by directors. The number includes both binary and group-based relations. Table 3.3 Matching accuracies of three groups of dialogues (all the matching results here are produced using hypergraphs). intended object(s), and our graph-matching approach is able to capture and utilize such collaboration pattern to improve the referential grounding accuracy. The improvement is more significant when hypergraphs are used. A potential explanation is that those group-based relations captured by hypergraphs always involve multiple (more than 2) objects (nodes). If one node in a group-based relation is grounded, all other involved nodes can have a better chance to be correctly matched. Whereas in regular graphs one grounded node can only improve the chance of one other node, since only one-to-one (binary) relations are captured by regular graphs. To further investigate the effect of modeling collaborative referring, we divide the 32 dialogues into three groups according to how often the agent-present-human-accept collaboration pattern happens (measured by the percentage of the grounded nodes among all the director generated nodes in a dialogue). Table 3.3 shows the statistics and the matching accuracies for each of the three groups. As shown at the top part of Table 3.3, the agent- 51 present-human-accept pattern happened less often in the dialogues in group 1 (i.e., less than 30% of director generated nodes are grounded nodes). In the dialogues in group 2, matchers more frequently provided proactive descriptions which led to more grounded nodes. Matchers were the most proactive in the dialogues in group 3, thus this group contains the highest percentage of grounded nodes. Note that, although the dialogues in group 3 contain more proactive contributions from matchers, directors tend to specify less number of properties and relations describing intended objects (as shown in the middle part of Table 3.3). The matching accuracies for each of the three groups are shown at the bottom part of Table 3.3. Since the agent-present-human-accept pattern appears less often in group 1, modeling collaborative referring only improves matching accuracy by 7.3%. The improvements for group 2 and group 3 are more significant compared to group 1. However, group 3’s improvement is less than group 2, although the dialogues in group 3 contain more proactive contributions from matchers. This indicates that in some cases even with modeling collaborative referring, underspecified information from human speakers (directors in our case) may still be insufficient to identify the intended referents. Therefore, incorporating a broader range of dialogue strategies to elicit adequate information from humans is also important for successful human-agent communication. 3.4 Conclusion and Discussion In situated dialogue, conversation partners make extra collaborative efforts to mediate a joint perceptual basis for referential grounding. It is important to model and interpret such collaborative dialogue in order to build situated conversational agents. As a first step, we have developed a graph-based representation to capture the linguistic discourse and the 52 visual perception. Referential grounding is then formulated as a graph-matching problem and a state-space search algorithm is applied to ground linguistic references onto perceived objects. In addition, hypergraph representations have been used to account for group-based descriptions, and one prevalent pattern of collaborative referring (i.e., agent-present-humanaccept) has been incorporated into the search algorithm. Our empirical results have shown that, even when the perception of the environment by computer vision algorithms has a high error rate (95% of the objects are mis-recognized), our approach can still correctly ground those mis-recognized referents with 66% accuracy. As demonstrated by the results, our graph-matching approach has provided a potential solution for reliable referential grounding through modeling and utilizing spatial relations, group descriptions and collaborative referring behaviors. The results reported in this chapter are all based on manually annotated semantics and coreference of the linguistic discourse. When the discourses are automatically processed, semantics and coreference of these discourses often will not be obtained correctly or completely as in their manual annotations. Therefore, an important next step should explore how to efficiently match hypothesized discourse graphs (from automated language processing) with vision graphs. This issue will be addressed in the later chapters. Our current symbol grounding functions are manually defined and limited to the specific environment used in the data collection experiment. In a real world scenario, grounding a linguistic term to a visual feature will be influenced by many contextual factors such as the surrounding of the environment, the discourse history, the speaker’s individual preference. Thus, it is important to explore context-based semantic grounding functions and automatic acquisition of these functions (e.g. [68, 32, 34, 35]) in future work. The discourse graph presented in this chapter represents all the mentioned entities and 53 their relations that are currently available at any given dialogue status. Due to its flatten structure, however, it is difficult to model dialogue dynamics at the illocutionary level (for example, in [93, 94]). Using richer representations such as hierarchical (hyper)graphs [95, 96] will allow us to better represent and utilize the dialogue context for referential grounding and for response generation. Nevertheless, our graph-based approach provides a well-formed basis for modeling various types of referring expressions, capturing the collaborative dynamics, and utilizing the dialogue context to constrain referential grounding. All these aspects are important for enabling collaborative dialogue agents towards situated interaction in the physical world. 54 Chapter 4 Learning to Mediate Perceptual Differences In Chapter 3, we have introduced our graph-based approach for referential grounding, which uses ARG (Attributed Relational Graph) to capture the collaborative referring dialogue and solves referential grounding as a graph-matching problem. We have shown that the graph-based approach provides a principled way to model different types of referring expressions and incorporate the “agent-present-human-accept” collaboration pattern to improve grounding accuracy. This chapter presents our work to further extend the graph-based approach to address some other important issues in situated referential grounding:1 • The graph-matching we discussed so far has always been exact-matching, i.e., every node in the discourse graph is forced to match with a node in the vision graph. However, due to various errors that can happen in constructing both the discourse graph and the vision graph, it may not always be suitable to enforce exact matching. Inexactmatching criterion [78], i.e., obtaining proper matchings even if they are only partial, is more desirable to accommodate errors and allow the system to deal with problematic situations more flexibly. 1 This chapter is based on the following publication: C. Liu and J. Chai, “Learning to mediate perceptual differences in situated human-robot dialogue”, in Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2015. 55 • To enable more adaptive dialogue system, we should not only focus on the interpretation of the referring expressions, but also try to acquire useful information from the interaction with human to mediate the perceptual differences. The weight-learning approach presented in this chapter demonstrates our effort towards this promising and challenging goal. Through online interaction with the human, this approach can learn a set of weights indicating how reliable/unreliable each dimension of the system’s perception of the environment maps to the human’s linguistic descriptors. The system can then adapt to the situation by applying the learned weights to ground follow-up dialogues and/or adjusting its language grounding models accordingly. This chapter is organized as follows: we first discuss the motivation of our weight-learning approach to mediate the perceptual differences between agents and humans. We then present the extension to the graph-matching formulation and the optimization based weight-learning method. The empirical evaluation based on a human-robot dialogue dataset is presented next, followed by discussions on the strengths and limitations of the work. Last, we conclude the current work and discuss some future directions. 4.1 Motivation As we have discussed in Chapter 3, computational approaches to referential grounding often consist of two key components [?, 32, 26]. The first component addresses formalisms and methods that connect linguistic terms (e.g., red, left) to the lower level numerical features (e.g., rgb vectors) captured in the agent’s representation of the perceived environment. Namely, this is what we called the semantic grounding functions in Section 3.2.2. The second component extracts all the linguistic terms from a referring expression and combines their 56 grounding models together to identify referents. For example, given a referring expression “the big blue bottle to the left of the red box”, it will recognize that the intended referent has several attributes (e.g., color is blue, size is big, type is bottle) and it is to the left of another object. Then it will combine all relevant sensors’ inputs and apply the corresponding semantic grounding models to identify the referent that most likely satisfies the referring expression. Many previous works on referential grounding have focused on the first component, i.e., exploring how to learn and ground individual linguistic terms to low level physical attributes (e.g., [68, 97, 98]). Although different algorithms have been applied for the second component [99, 100], little attention has been paid to the question how to intelligently combine different attributes to ground references. However, this is an important question for situated dialogue since the human and the agent have mismatched representations of the shared environment. For example, the agent may not recognize any bottle, or may see something blue but not a bottle. Furthermore, the agent’s perception of blue may be very different from the human’s perception of blue. How should the agent utilize these different attributes? What part of its own perception should the agent trust the most when there is potential mismatch with the human’s perception? To address these questions, in this chapter we propose a computational approach that will allow the agent to learn and mediate perceptual differences during situated dialogue. The idea is that, by interacting with its human partner (e.g., through dialogue), the agent should be able to assess its perceptual differences from its human partner. In particular, the agent should be able to learn what dimension(s) of its own perception (e.g., recognition of objects or colors) are more reliable, namely more aligned with the human’s perception reflected by the linguistic descriptions. To mediate the perceptual differences, the agent 57 should use this self-assessment to update its internal models and further improve referential grounding in follow-up communication. Specifically, we have developed an optimization-based approach to efficiently learn a set of weights, which indicate how reliably/unreliably each dimension of the agent’s perception of the environment maps to the human’s linguistic descriptions. By simulating different types of mismatched perception, our empirical results have shown the weight-learning approach can successfully adjust the weights to reflect the agent’s perceptual capabilities. In addition, the learned weights for specific linguistic terms (e.g., “red”, “left”) can further trigger automatic model updating for these words, which in turn leads to an improvement of referential grounding performance for future dialogues. 4.2 Weight Learning for Situated Referential Grounding In this section, we first present a general formulation of inexact graph-matching with weighted attributes. The state-space search approach in Chapter 3 can be viewed as a special case of this general formalism. Then we propose an optimization method which utilize this formulation to learn, through the interaction with the human partner, a set of weights that indicates how reliable/unreliable each of the agent’s dimensions of perception maps to the human’s linguistic descriptions. 58 4.2.1 Inexact Matching with Weighted Attributes The same as in Section 3.2.2, we first construct two ARG representations based on the language and vision processing outcomes, which are the discourse graph G and the vision graph G . G = (X, E) X = {xm | m = 1, . . . , M } (4.1) E = {ei = (ei1 , ei2 ) | i = 1, . . . , I; ei1 ∈ X, ei2 ∈ X} G = X ,E (4.2) X = xn | n = 1, . . . , N E = ej = ej1 , ej2 | j = 1, . . . , J; ej1 ∈ X , ej2 ∈ X In an ARG, each node (e.g., xm ) is assigned a set of A attributes Um and each edge (e.g., ei ) is assigned a set of B attributes Vi (A and B are two predetermined numbers), i.e., Um = {uma | a = 1, . . . , A} (4.3) Vi = {vib | b = 1, . . . , B} Note that, we can also assign a special UNK value to an attribute, meaning this attribute is currently “unknown” for this node/edge. For example, when an attribute is not mentioned or the word used to describe it is “out-of-vocabulary”, the value of that attribute should be assigned the value UNK. As defined earlier in Section 3.2.2, a matching Θ between G and G is to assign each node xm in X a “label” θm to indicate which node in X that xm maps to. However, the difference here is that, we now allow Θ to be an inexact-matching. Namely, we first define a 59 ˘ = {1, . . . , M }, and two subsets of M ˘ (i.e., M ˘ 1 and M ˘ 2 ) such that set of node indices as M ˘1 ⊆ M ˘ and M ˘2 = M ˘ −M ˘ 1 . And an inexact-matching matching Θ is then formally defined M as: Θ = Θ1 ∪ Θ2 = ˘ 1; x ∈ X θm1 = xn | m1 ∈ M n (4.4) ˘2 ∪ θm2 = Λ | m2 ∈ M where Θ1 specifies the label/mapping of each of the nodes in X for which a proper mapping can be found in X , and Θ2 contains the nodes in X for which no proper mapping can be found (i.e., Λ denotes a special “NULL” node, thus θm = Λ means node xm is not matched with any node in X ). ˘1 , Let XΘ1 denote the subset of nodes that are labeled in Θ1 (i.e., XΘ1 = xm | m ∈ M and EΘ1 denotes the set of edges induced by XΘ1 (i.e., EΘ1 = {ei = (ei1 , ei2 ) | ei ∈ E, ei1 ∈ XΘ1 , ei2 ∈ XΘ1 }). The compatibility function of an inexact-matching Θ can then be defined as2 f (Θ1 ) = xm ∈XΘ 1 g xm , x n h ei , e j + ei ∈EΘ 1 f (Θ2 ) = η1 M − XΘ1 + η2 M (M − 1) − EΘ1 = η 1 M − X Θ1 + η2 M (M − 1) − XΘ1 (4.5) XΘ1 − 1 where xn and ej are the corresponding node and edge of xm and ei according to Θ1 , respectively. η1 and η2 are the parameters for determining whether a node should be mapped to the “NULL” node (i.e., treated as unmatched) or not. 2 Note that, both G and G are now treated as complete graphs, in which each ordered pair of nodes are connected by an edge, thus G contains M (M − 1) edges. In the discourse graph G, if the relation between two nodes are not mentioned, we just set the attributes of the corresponding edge to UNK (i.e., “unknown”). 60 Similar to exact-matching, the optimal matching between G and G is the one with the highest compatibility score: Θ∗ = arg max f (Θ) (4.6) Θ and the compatibility function for a matched pair of nodes and edges are based on the attributes assigned to them: g x, x h e, e 1 =A = 1 B A ga ua , ua a=1 B (4.7) hb vb , vb b=1 As we have already discussed in Section 3.2.2, g x, x and h e, e are further decom- posed to a set of stand-alone semantic grounding functions. Now we also assign each of these grounding functions a weight αak , representing the importance or reliability of each grounding function for finding the correct matching. For example, suppose we have a set of grounding functions for colors such as “red”, “green” and “yellow”. If the agent’s perception and model of red color is more accurate than other colors, then the grounding function of “red” should be assigned a higher weight. Also note that, when the value of ua is “unknown” (i.e., not-mentioned or out-of-vocabulary), ga ua , ua just returns the predetermined value η1 . Thus the decomposition of g x, x and h e, e     αa1 ga1        αa2 ga2     .. ga ua , ua = .        αaK gaK        η1 61 now become: ua if ua = y1 ; ua if ua = y2 ; .. . ua if ua = yK ; else (4.8)     βb1 hb1 vb        βb hb vb     2 2 .. hb vb , vb = .        βb hb vb  L L       η2 if vb = z1 ; if vb = z2 ; .. . (4.9) if vb = zL ; else Given this general form of weighted inexact-matching, we can get the special case as what we have used in Section 3.2.2 by (1) let η1 = η2 = 0, thus it becomes exact-matching; (2) let all the weights equal to 1, thus each attribute value is treated equally when the compatibility score of a matching is calculated. We close this section by a discussion on how inexact-matching is controlled by the two parameters η1 and η2 . Basically, inexact-matching means not every node in G has to be matched with a node in G , but some nodes can be unmatched (i.e., Θ2 ) if no proper matchings are found for them. Suppose there is only one node in G, and it has only one attribute mentioned (e.g., color is red). For this node to be matched with a node in G , there must be at least one node in G such that αred gred u > η1 , otherwise this node will be matched with Λ (i.e., the NULL node). Suppose we further add an edge with one mentioned attribute (e.g., something is to the left of the red object) to G, then in G there must be at least one node and one relation, such that αred gred u + βlef tof hlef tof v > η 1 + η2 , to find a non-NULL matching. In general, for a node xm to be matched with a non-NULL node xn , we should have g xm , xn + ei ∈Exm h ei , ej > η1 + |Exm | η2 , where Exm is the set of edges that are associated with xm . 62 4.2.2 Optimization for Weight Learning The graph-matching algorithm relies on all these attributes to find out proper matchings and filter improper ones. Importantly, because the agent can have different capacities of perceiving these different attributes and linguistic referring expressions associated with different attributes can also have different levels of ambiguities, different attributes then should not be always treated equally. Thus we develop an optimization based approach to automatically acquire a set of weights based on the matching hypotheses and the ground-truth matching. The weights (i.e., αa and βb ) are the “variables” that we aim to adjust, and our general objective is to maximize the reference grounding performance. We represent all the weights using a vector w: w = [ α11 , α12 , . . . , α21 , α22 , . . . , αA1 , αA2 , . . . , β11 , β12 , . . . , β21 , β22 , . . . , βB1 , βB2 , . . . , (4.10) η1 , η2 ]T For a given matching Θ, its compatibility score f (Θ) then becomes a linear function of w: f (Θ) = fΘ (w) = cΘ w (4.11) where cΘ is a vector of “coefficients” that are computed from the given Θ and the grounding functions, based on our previous definitions. ˆ is the ground-truth matching, and Θ1 , Θ2 , . . . , ΘD Given two graphs G and G , suppose Θ are the top-D matching hypotheses (i.e., Θ1 is the top-1 hypothesis and so forth) generated ˆ we can try to find a new w that may lead to using an initial weights vector w0 . If Θ1 = Θ, 63 a better matching outcome. This can be formulated into an optimization problem as: D max cΘ ˆw − γ w,ε εd d=1 s. t. cΘ1 w − cΘ ˆ w ≤ ε1 , ε 1 ≥ 0 .. . (4.12) cΘ w − cΘ ˆ w ≤ εD , ε D ≥ 0 D where {εd } is a set of “slack” variables to accommodate infeasible constraints, and γ is a penalizing factor to push the values of all εd to be small (i.e., the closer to 0 the better). The essence of this optimization scheme is to find a proper w, based on which the groundtruth matching can be ranked as top as possible among the matching hypotheses generated by the state-space search algorithm. If such optimal values of w can be found, we expect it can facilitate the search algorithm to select out better matching hypotheses next time when a similar instance is encountered. For example, if the referring expression is “the red apple” but in the vision graph recognition of apple is not reliable (e.g., pear is always recognized as apple), then it is likely that “red apple” will be wrongly matched with a pear. Through our learning scheme, it can be learned that the weight for “apple” need to be low and the weight for “red” need to be high for the whole expression to be correctly matched. Thus, the learning outcome (i.e., the learned weights) can tell us which attribute-values are more/less reliable for finding the correct matching results. Actually, the optimization problem we have here is exactly a linear programming problem, which can be efficiently solved using algorithms such as the Interior Point methods [101]. 64 4.3 4.3.1 Evaluation Data (a) Situated setup H1: The green cup is called Bill. R1: Ok, the green cup is Bill. (point to the inferred object) H2: Do you see a car? R2: Yes, I see a car. H3: Do you see a blue can behind it? R3: I don’t see a blue can. But I see a blue unknown object there. Is that the can you are talking about? (point to the inferred object) H4: Yes, it’s called Mary. R4: Got it, the blue can is Mary. (b) Sample Dialogue Figure 4.1 An example of situated setup and human-robot dialogue We use the data collected from earlier experiments on human-robot dialogue to evaluate our approach. Here we first give a brief description of these experiments to help better understand the data3 . The goal of these experiments was to investigate collaborative effort in human-robot dialogue. In these experiments, a NAO robot (with fully automated 3 See [102] for the details of this human-robot dialogue experiment. 65 components for language processing, referential grounding, and dialogue management) was positioned with a human subject to play a set of naming game as shown in Figure 4.1(a). Human subjects were given some secret names for several objects in the shared environment. The task was for the human to communicate these secret names to the robot through dialogue so that the robot would know which object had what name. Because of the nature of this naming game, these experiments led to dialogues focusing on referential communication as shown in an example in Figure 4.1(b). One controlled condition in the experiments was to simulate different perceptual capabilities of the robot, which resulted in two levels of variations: • High-Mismatch simulated the situation where the human and the robot had a high mismatch in their perceptions of the shared environment. The robot’s objectrecognition error rate was manipulated to be very high, namely, a large portion (60% or 90%) of the objects were mis-recognized.4 • Low-Mismatch simulated the situation where the human and the robot had a low mismatch in their perceptions of the shared environment. The robot correctly recognized most of the objects, with only a small portion (10% or 30%) being mis-recognized. Although the experiment was originally designed for a different purpose, it actually provides an ideal data set for evaluating our weight-learning approach. Since we currently do not address dialogue management, evaluating our algorithm only needs language inputs from the dialogue discourse and the corresponding visual environment. Furthermore, the systematic simulation of mismatched perceptions allows us to evaluate whether the weight-learning 4 These mis-recognized objects were randomly selected, and their object-recognition results were randomly assigned. 66 Object-specific attribute Spatial relation Type Color Location Number of expressions 686 747 281 162 Table 4.1 Attributes and relations used in the referring expressions in the human-robot data outcome is consistent with our expectation given the simulated situation. For example, we would expect the learned weight for the attribute type (i.e., the object-recognition result) to be smaller under the high-mismatch condition than under the low-mismatch condition. Therefore, we apply the weight-learning approach on the two subsets of data to see whether the expected weights can be learned. Before we present the results, we first briefly summarize the data. There were a total of 147 dialogues collected from 24 human subjects. Among these dialogues, 73 were collected under the low-mismatch condition and 74 were under the highmismatch condition. For each dialogue, the robot’s perception of the environment, such as the object-recognition results, the color of each object (represented as a rgb vector) and the position of each object (represented as x and y coordinates), were also logged. It thus provided a dataset with both discourses of human’s spontaneous referring expressions and robotic perceptions of the shared environment, which is desired for the evaluation of our approach here. Table 4.1 summarizes the most frequently used attributes/relations in human subjects’ referring expressions5 . As shown in the table, the type (e.g., “the bottle”) and color (e.g., “the red object”) attributes are the most commonly used to describe the referents. This conforms with psycholinguistic findings on people’s preference of referring [41]. Human subjects in our experiments always intend to mention these two attributes when they refer 5 Some other less-frequently used (appeared less than 20 times) descriptors such as size are excluded here. 67 to an object, thus the robot’s ability of correctly perceiving the type and color attributes can be crucial for successful referential grounding. Furthermore, spatial location of an object (e.g., “the object in the front”) is another commonly used descriptor, which can be utilized by robots equipped with advanced spatial sensors6 . Besides these object-specific attributes, referring based on spatial relations between two objects is also observed. When one object has already been grounded, it can then serve as a reference point for referring to another object based on their relation. We mainly observed two kinds of relations, which are the projective relations (e.g., “the bottle is to the right of the box”) and the relations based on object’s proximity (e.g., “it is close to the box”). Although spatial relations can be very useful under certain situations, in general they are less often used than object-based attributes. In the following section, we will evaluate our weightlearning approach based on these three most frequently used object-specific attributes, i.e., type, color, and location. 4.3.2 Weight-learning Results on Object-specific Attributes To evaluate our weight-learning method, we apply it to the dataset of 147 (i.e., lowmismatch and high-mismatch) dialogues to see what weights can be learned and how referential grounding can be potentially improved. Each dialogue in the data is represented by a discourse graph based on automatic language processing, and each discourse graph is also paired with a vision graph that represents the perceived environment in which the dialogue happened7 . Thus a “training” instance from the data is a pair of a discourse graph and a 6 What we call the “spatial location” here is actually one type of spatial relation within the egocentric frame-of-reference [59], i.e., using the viewer’s own field-of-view as the reference frame. Here we treat it as an object-specific property other than a relation because it does not involve another object. 7 The vision graphs are generated from the logged robotic vision information during the experiment. 68 vision graph. We then apply our weight learning method as describe in Section 4.2 to each training instance to learn a set of weights for the type, color, and location attributes. We evaluate the weight-learning at two levels, i.e., the attribute-level and the word-level (discussed later in Section 4.3.3). When learning weight at the attribute-level, we simplify the weight learning by letting αa1 = αa2 = · · · = αaK and βb1 = βb2 = · · · = βb , thus we only L learn one weight for each attribute. Note that, in the current evaluation, we let η1 = η2 = 0, thus only exact-matching is performed here. Although not exactly the same, our weight learning procedure follows the online learning paradigm [103] in real interaction scenario. Given an instance (i.e., dialogue), the “learner” first generates predicted outcome (i.e., grounding results); and then it receives the true outcome for this instance8 , based on which its internal model (i.e., the weights) is updated for making better future predictions. Therefore, we conduct the learning and evaluation in a simulated online fashion to investigate how effectively and efficiently it can adapt to the situation of mismatch perceptions. In each simulation run, the weight-learner receives a sequence of 20 training dialogues that are randomly picked from the dataset. Upon receiving each training instance, it first generates a list of grounding hypotheses using the same statespace search algorithm as discussed in Section 3.2.2 based on the current weights. After the grounding hypotheses have been generated, the ground-truth is then considered given, and a new set of weights is learned using our optimization method. With the learned new weights, 8 When learning the weights of attributes, we assume that the ground-truth mapping between the linguistic entities (i.e., discourse graph nodes) and the physical objects (i.e., vision graph nodes) is already known. Although in the original experiment, due to various kinds of errors and the time constraint, some entities could be incorrectly grounded or not grounded at all by the end of a dialogue. Here we just assume that the ground-truth mappings can always be correctly established, e.g., through endless interactions. 69 Type Color Location Low-mismatch 0.87 0.97 0.97 High-mismatch 0.45 0.9 0.71 Table 4.2 The final weights learned after 20 training dialogues (averaged over 100 runs of simulation) we then updated the current weights as9 : wt = wt−1 + γ(wnew − wt−1 ) and move to the next instance in the training sequence. Besides, to evaluate the effect of weight-learning on referential grounding, after each training dialogue we also apply the current learned weights to ground all the other 53/54 dialogues that are not selected as training data. Such a procedure repeats itself till the entire training sequence has been gone through. We started with uniform weights (i.e., all being 1), and repeated the weight learning process throughout the sequence of the selected 20 training dialogues. Table 4.2 summarizes the final learned weights on the low-mismatch and high-mismatch data after going through the 20 training dialogues10 . As we can see in Table 4.2, the most significant change from the low-mismatch condition to the high-mismatch condition is the drop of the learned weight for the type attribute (0.87 vs. 0.45). This is consistent with the situation (i.e., low-mismatch vs. high-mismatch) from which the data was collected. To further demonstrate the weightlearning efficiency, we plot the updated weight of the type attribute after each training dialogue, as shown in Figure 4.2(a). It shows that when the robot’s object-recognition is 9 γ is a step-size parameter set to be 0.5. 10 We have run the simulation 100 times, and the weights shown in Table 4.2 are the average over the 100 runs. 70 significantly mismatched with the human’s perception (i.e, the high-mismatch condition), the weight for the type attribute quickly descends in the first 5 training dialogues, and after that it starts to become stable and gradually converges to the final value. The results thus demonstrate that our weight-learning approach can efficiently learn informative weights to indicate the unreliable attributes of the robot’s perception. Besides the type attribute, the learned weights of the other attributes also indicate how reliable they are for referential grounding11 . The color attribute appears to be a reliable information source here, i.e., the color perception and grounding models are compatible with human descriptions. The spatial location attribute is less reliable compared to the color attribute, although the robot’s perception of spatial information ought to be accurate. This is possibly due to the vagueness of spatial expressions themselves, since a spatial expression such as “object in the front” can often result in several objects that all conform with the description and thus difficult to resolve based on the spatial information alone. These learned weights not only indicate the robot’s perceptual capabilities, but can also improve the referential grounding accuracy when applied to subsequent dialogues. To demonstrate this, we use all the remaining dialogues (i.e., those not selected as training dialogues) as the testing set12 . After each training dialogue, we applied the current learned weights to generate referential grounding results on all the testing dialogues. The results (averaged referential grounding accuracies on the testing dialogues) are shown in Figure 4.2(b). Under the low-mismatch situation, applying the learned weights does not significantly change the grounding accuracy. This is because the learned weights are close to the initial value 11 Note that, the weights learned from the high-mismatch data can be more informative, because our weight-learning method only updates the weights when referential grounding results are not aligned with the ground-truth, which is more often the case under the high-mismatch situation. 12 There are 53 and 54 testing dialogues for the low-mismatch and high-mismatch conditions, respectively. 71 1 high-mismatch low-mismatch Learned weight for the "type" attribute 0.9 0.8 0.7 0.6 0.5 0.4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Sequence of training dialogues (a) Learned weights for the “type” attribute 79 Grounding accuracy on the testing set (%) 76 73 70 67 64 61 58 55 52 49 46 high-mismatch low-mismatch 43 40 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Sequence of training dialogues (b) Referential grounding accuracies on the testing set by applying the learned weights Figure 4.2 weight-learning and referential grounding results after each training dialogue (averaged over 100 runs) (i.e., 1.0) as all the attributes were reasonably reliable. Under the high-mismatch situation, using the learned weights can improve grounding accuracy by 9.4% (from 44.4% to 53.8%) within the first 5 training dialogues. After that the grounding accuracy stays stable since 72 the learned weights also become stable as shown in Figure 4.2(a). Therefore the robot can reduce referential grounding errors by assigning lower weights to those unreliable attributes, once they have been learned by the weight-learning approach. 4.3.3 Weight-learning Results at Word Level Besides learning weights at the attribute level as we just discussed, it will be even more useful if the weights can be learned at a lower level, i.e., to learn a weight for each of the words that are used to describe an attribute. The learned weight then indicates the reliability of the robot’s perception and/or grounding model on that specific word. For example, if the robot can learn that its perception of “red” is unreliable, it can then adjust the grounding model for this specific word accordingly. This would be more useful than only knowing the overall reliability of the color attribute. To enable learning weights at the “word level”, instead of assigning only one weight for an attribute (i.e., all the words that describe one attribute always share the same weight), we need to assign each word’s grounding function a unique weight (as described in Section 4.2). The same weight learning approach can then be applied to learn how well the robot’s perception is aligned with the human’s description for each specific word. To evaluate word level weight-learning, we again use systematic simulation of perceptual errors which allow us to easily assess whether expected weights can be learned given the simulated situation. Specifically, we use the low-mismatch data13 and modify the robot’s perception to simulate some common errors that can happen in a real situation. The modifications we make are: • For each object’s perceived color (i.e., an rgb vector), we increase the intensity of the 13 Since the low-mismatch data contain few original errors, it would be easier to see the effect of simulated errors here. 73 Figure 4.3 An illustration of the different viewing directions. Direction 1 is the robot’s original viewing direction in the experiments, which is the same as the human subject’s. Direction 2 is the misaligned viewing direction simulated by decreasing the x coordinates from direction 1. Because of the narrow viewing range of the robot’s camera, all the objects become closer to the left edge of its field-of-view (i.e., their x coordinates decrease) under direction 2. Color Without simulated errors With simulated errors Location Without simulated errors With simulated errors “orange” “pink” “green” “blue” 0.92 0.95 0.9 0.95 0.47 0.59 0.88 0.97 “left” “right” “front” “back” 0.95 0.81 0.97 0.95 0.19 0.73 0.77 0.73 Table 4.3 Final learned weights for some common words of the two attributes after 20 training dialogues (averaged over 100 runs of simulation) r and g channels (by 100)14 and decrease the intensity of the b channel (by 100). This is to simulate color-sensing error, e.g., due to environmental lighting noise or deficient color sensor. • For each object’s perceived position (i.e., x and y coordinates), we decrease the x coordinate (by 300 pixels)15 . This is to simulate a different viewing direction that may affect the robot’s interpretation of spatial language such as “on the left”, as illustrated in Figure 4.3. 14 The range of the intensity of each channel is from 0 to 255. 15 The size of the image produced by the robot’s camera is 1280 × 960 pixels. 74 With these simulated perceptual errors, we then use the same online learning scenario to evaluate the effectiveness and efficiency of weight-learning at the word level. Namely, we randomly select 20 dialogues as the training sequence and go through these dialogues oneby-one to learn and update the weights. Table 4.3 summarizes the final learned weights of some common words of the type and location attributes after 20 randomly selected training dialogues. In the table we also show the learned weights from the situation that no errors were simulated, so that the weights learned with simulated errors can be compared. As we can see from Table 4.3, there are clear correspondences between the learned weights and the simulated errors in the robot’s perception: • For the color attribute, the learned weights indicate that the robot’s grounding of “orange” and “pink” is affected by the simulated error of the inflated r and g channel intensities and the deflated b channel intensity. • For the location attribute, the very low weight learned for “left” indicates the robot’s problematic interpretation of this concept, which corresponds to the simulated error of shifting all the perceived objects to the left side. These correspondences between the learned weights and the underlying perceptual errors again demonstrate that our weight-learning approach is capable of learning informative weights, which indicate how reliably the robot’s perception maps onto the human’s linguistic descriptions. For the efficiency of weight-learning at word level, it is also the same as the previous attribute-level weight learning. Figure 4.4 shows the plots of updated weights for the words “orange”, “pink”, and “left”, after each training dialogue during the online weight-learning process. It took only 5 training dialogues for the weight of “orange” to land on its final value. The weight of “pink” and “left” took some more training dialogues to 75 1 "orange" "pink" "left" 0.9 0.8 Learned weights 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Sequence of training dialogues Figure 4.4 Word level weight-learning evaluation results after each training dialogue (averaged over 100 runs of simulation) converge to the final values because they did not appear as often as the word “orange” in the data. In a realistic setting, a more dynamic interaction and learning process would be expected. For example, the robot could adjust and improve its perceptual capabilities dynamically, based on the interaction with the human. Thus we further simulate one such dynamic process to see how our weight-learning responded to it. We still use the same data (i.e., lowmismatch data with simulated perceptual errors) and the online weight-learning process, but add a model-updating process after the first 5 training dialogues. This is to simulate the scenario that the robot automatically start to adjust its language grounding models for the unreliable words with low learned weights. Initially, the grounding models for color and location terms were all defined as Gaussian distributions over the corresponding visual features [68]. To update the grounding models for the two color words (i.e., “orange” and “pink”), we follow the online word model acquisition approach as described in [104], which essentially keeps updating the mean of the Gaussian 76 distribution by averaging out the old mean and the new observed values. In addition to updating model parameters, the underlying models can be adjusted as well. For instance, for the word ”left”, the robot can switch from the Gaussian model to the exponential decay function model as in [53]. 1 0.9 Learned weights 0.8 0.7 0.6 0.5 "orange" "pink" "left" 0.4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Sequence of training dialogues (a) Learned weights with model updating Grounding accuracy on the testing set (%) 70 67 64 61 58 55 52 49 46 43 40 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Sequence of training dialogues (b) Referential grounding accuracies with model updating Figure 4.5 Word level weight-learning results with model updating (averaged over 100 runs of simulation) Figure 4.5(a) shows the plots of learned weights for these three words with the model 77 updating process. After adaptation, the weights for “orange”, “pink”, and “left” all become high values, indicating their grounding models became more consistent with the human’s descriptions. We also evaluate referential grounding accuracy on the testing dialogues as we did earlier, but using both the updated models and weights after each training dialogue. Figure 4.5(b) further shows that the model updating process is able to improve the performance of the language grounding models which were previously affected by the simulated perceptual errors. Referential grounding accuracy is improve by 10% (from 48.7% to 58.7%) with the updated models and weights, compared to the initial state of using the original models and uniform weights. These results again demonstrate that our weight-learning approach can efficiently update the weights which consistently reflect the underlying changes of the robot’s perceptual and language grounding capabilities. As we can also see in the simulated model updating process, the robot’s perceptual errors can be accommodated by adjusting language grounding models or switching to a more robust model. For color, because of the simulated perceptual errors (i.e., inflated r and g channel intensity and deflated b channel intensity), the original grounding models became inaccurate and misleading. For example, a blue object was likely to be perceived as pink, thus it would lead to incorrect matching result. Such errors can be accommodated by re-calibrating the parameters of the color grounding models, so that the models are more aligned with the current environmental or robot’s sensory condition, as demonstrated by our simulated model updating process (also see [104]). Our weight-learning approach can provide useful guidance to the model adjusting effort by indicating the models of which words are unreliable and thus need to adjusted. For the grounding model of spatial descriptors like “left”, we have two models (i.e., a Gaussian model and a exponential decay function model) and we switch from one to the 78 Figure 4.6 Two grounding models for “left”. I.e., a grounding model here is a scoring function w.r.t. the x-coordinate of an object’s center of mass. It measures how well an object’s location conforms the descriptor “left”. Model 1 is the Gaussian model; Model 2 is the exponential decay function model. other in the model updating scenario. As indicated by the weight-learning outcome, the exponential model is more reliable than the Gaussian model. Figure 4.6 shows the two models for “left” (Model 1 is the Gaussian model and Model 2 is the exponential model. Both are defined as functions w.r.t. the x-coordinate of the center of mass of an object). The Gaussian model assigns higher scores to the central area of the left region in the image, whereas the exponential model assigns higher scores to the area closer to the left edge. The exponential model is actually more robust against the “misaligned view” error (e.g., all the objects are shifted towards the left side as in our simulation). For example, assume that the original coordinates for object o1 is x1 = 350 and for object o2 is x2 = 700, and the coordinates changed to x1 = 50 and x2 = 400 due to the perceptual error as in our simulation. Although both models now incorrectly assign non-zero scores to o2 of being “on the left”, model 2 still assigns a higher score to o1 and thus prefers o1 as a better match than 79 o2 . As demonstrated by the case of switching spatial models, when the robot has different models to choose and needs to determine which one is better under the current situation, weight-learning again provides important guidance. One limitation of the current weight-learning approach is it only learns about falsepositive errors, but not false-negative errors. For instance, if the robot’s object-recognition constantly mis-recognize bottles as “boxes”, a low weight can be learned for the word “box” but not “bottle”. Namely, when the grounding model of “box” often produced false-positive errors, the approach learned a low weight to indicate the problem. But it can not respond to the false-negative errors (i.e., failed to recognize a bottle). Similarly, low weights were learned for “pink” and “orange” but not for “blue” when blue color was likely to be misrecognized as pink or orange, and a low weight was learned for “left” but not for “right” when objects on the right were likely to be mis-detected as on the left. How to develop a systematic approach that can learn from both false-positive and false-negative errors is an interesting topic for future work. 4.4 Conclusion and Discussion Towards enabling robust and adaptive human-robot dialogue, we have developed an optimization based weight-learning method that can mediate the perceptual differences between a robot and its human partner for referential grounding. Since audio perception (i.e., speech recognition) related issues have been addressed before (e.g., [21]), we focus on the robot’s visual perception of the physical environment here. As demonstrated by our empirical evaluations, our weight-adjusting mechanism is capable of learning informative weight values that reflect the alignment or misalignment between 80 the robotic visual perception and the human’s description of the shared physical environment. Our method can efficiently and reliably adapt to new situations through just a small number of interactions, as soon as the change occurs. The learned weights can be applied to referential grounding and/or word grounding model learning algorithms to improve the referential grounding performance. They can also be utilized by referring expression generation algorithms (e.g., [105, 106]) to facilitate referential communication between robots and humans. While many previous works mainly focused on learning semantic grounding models. The work presented in this chapter addresses language grounding from a different angle. The focus here is on assessing and adapting existing semantic grounding models for a given situation. The reliabilities of the semantic grounding models can be affected by many situational factors, such as noisy environment, faulty sensors, and human speakers individual differences. Thus, even previously well-performed models can become unreliable in a different environment. We address this challenge by proposing a weight-learning mechanism that quickly learns a set of weights to indicate the reliabilities of the semantic grounding models under the current situation. For example, suppose a color-blind person speaks to the robot and swaps red and green. In this case, our approach can quickly learn new weights for color words, indicating the inconsistency between the current color models and this specific speaker. It will allow the robot to discount the color terms (since they are not reliable for this user) and rely on other more reliably terms (e.g., spatial terms) to ground linguistic expressions to the environment. When the system turns to a normal person again, it can also quickly update the weights back to normal values, even if the weights have been offset by the color-blind person. Our approach can learn and remember specific weights for each specific person/situation in an 81 online manner through a couple of training dialogues as demonstrated by the evaluation results. There are several interesting future directions that can be further explored. First, our current evaluation is based on several simplifications, including the simulation of perceptual errors and the strong assumption that the correct grounding information can be provided to the robot through dialogue with a human. Using simulated visual perceptual errors enables us to do controlled evaluations (e.g., manipulating color and spatial perception in a systematic way) without concerning about the effects of many nuisance factors that may occur in real-time physical setting. It allows us to focus on evaluating whether the algorithm performs the way expected (e.g., when the color perception changes whether the algorithm correctly responds to such change). Now we have evaluated the algorithm using the simplified setting, the next step is to integrate this algorithm into our dialogue system, and evaluate our algorithm with actual perceptual errors from real environments (e.g., varying lighting conditions, camera angles, etc.). Besides, the assumption that the ground truth is provided at the end of dialogue is certainly a strong assumption, which points to the importance of engaging humans in providing feedback to the robot during dialogue. This again calls for sophisticated and integrated models to support collaborative referential communication between humans and agents. Second, we have only evaluated exact graph-matching results so far. To enable more robust situated referential grounding, learning useful values of parameter η1 and η2 for controlling inexact-matching should be further explored. The role of η1 and η2 in the matching process is like two “threshold” values that will determine whether a node can be found a proper matching or not. Inexact-matching can signal the problematic situation (by assigning a node the “NULL” label) in which a discourse graph node’s matching compatibility is 82 too low with any node in the vision graph. Such situation can happen due to errors from both language and vision processing. For example, one situation is there can be a “missing object” which the agent cannot see at all due to CV segmentation error. In such case, only inexact-matching can produce a proper matching, i.e., to match human’s description on a missing object to NULL. In general, allowing NULL-matching with inexact-matching makes referential grounding more flexible and thus provides the dialogue manager better choices to deal with problematic situations. Last but not least, it is important to explore a broader range of information and knowledge that can be dynamically acquired to further mediate perceptual differences between the human and the agent. For example, it can be very helpful to learn context-dependent models for language interpretation and grounding, or to acquire knowledge about out-of-vocabulary words. Furthermore, it is also important to investigate how to utilize different kinds of machine learning mechanisms (e.g., online learning, reinforcement learning, or unsupervised learning) to better serve the purposes of supporting situated interaction in the open world and long-time relationship between agents and their human users. 83 Chapter 5 A Probabilistic Labeling Approach The graph-matching approach for referential grounding we have discussed so far is based on the state-space search algorithm to obtain proper grounding results. Although it has made meaningful progress in addressing collaborative referential grounding under mismatched perceptions as demonstrated in Chapter 3 and 4, the state-space search based approach has two major limitations. First, it is neither flexible to obtain multiple grounding hypotheses, nor flexible to incorporate different hypotheses incrementally for follow-up grounding. Second, the search algorithm tends to have a high time complexity for optimal solutions. Thus, the state-space search based approach is not ideal for collaborative and incremental dialogue systems that interact with human users in real time. To address these limitations, this chapter presents a new approach to referential grounding based on probabilistic labeling.1 This approach aims to integrate different types of evidence from the collaborative referential communication discourse into a unified probabilistic scheme. It is formulated under the Bayesian reasoning framework to easily support incorporation and generation of multiple grounding hypotheses for follow-up processes. Our empirical results have shown that the probabilistic labeling approach significantly outperforms the state-space search approach in both grounding accuracy and efficiency. In this chapter, we first discuss some motivations for developing a probabilistic approach 1 This chapter is based on the following publication: C. Liu, L. She, R. Fang, and Y. J. Chai, “Probabilistic labeling for efficient referential grounding based on collaborative discourse,” in Proceedings of the 52nd ACL conference, pp. 13–18, 2014. 84 to situated referential grounding. We then give a detailed account on the key limitation of the state-space search based approach. The technical details of the probabilistic labeling based approach is presented next, followed by the evaluation of this new approach. Lastly, we conclude this section with some discussions on future directions. 5.1 Motivation As we have already discussed in Chapter 1, one key challenge for referential communication in situated settings is the mismatched capabilities between the human and the agent to perceive and reason about the shared environment. Because of the mismatched capabilities, the human’s and the agent’s knowledge and representations of the shared world can be significantly different. Object-specific properties such as object-class, color and shape now become unreliable and insufficient for communicating the intended objects. Referential communication thus becomes more difficult and extra efforts have to be made to mediate between the mismatched perceptions [107]. To overcome this challenging problem, collaborating through extensive dialogue between the human and the agent becomes an important strategy [17, 108, 26, 27]. As demonstrated by our experiments using simulated mismatched perceptions in Chapter 1, there are rich dynamics that can be observed from human partners’ referential grounding dialogues. Such dialogue always unfolds as a well-formed structure of presentation and acceptance phases contributed by both partners. For example, speakers often refer to the intended object(s) through multiple “episodes” of actions, such as an initial installment followed by further refashioning (e.g., expansion, replacement or repair ) based on the hearer’s immediate feedback. The hearer, on the other 85 hand, can also refashion what the speaker just said in a relevant next turn.2 It is through utilizing collaborative dialogues, human partners can often succeed in referential communication even under mismatched perceptions. Therefore, it is important that a situated dialogue system can also capture and utilize the dialogue dynamics and engage in the dialogue collaboratively. Besides, to build robust situated dialogue systems, a computational framework should be able to handle uncertainties and errors that can arise from every step of the its decision making process. For instance, these uncertainties and errors can come from: • Speech and language processing, such as acoustic speech recognition, parsing, and coreference resolution. • Computer vision, such as segmentation, object recognition, and feature extraction. • Vagueness/flexibility of human language, for example, how to map language to numeric metrics. • Individual differences among human users, for example, differences in personal experiences, differences in language using habits. • Situational/environmental changes, that can often affect language and perception from time to time. Therefore, a computational approach to situated referential grounding needs to fuse the uncertainties from different sources to generate and evaluate the most likely grounding hypotheses. It is desirable that the approach is based on a probabilistic framework, under which the information from different sources can be incorporated and reasoned in a unified 2 See the discussions in Section 1.2.2 and Section 1.2.3 on these collaborative referring patterns. 86 and principled manner. As the information from different sources is often interrelated (e.g., the dialogue content is always related to the surrounding environment in situated dialogue), incorporating the information from one source into the processing of another information source can potentially reduce the uncertainties and errors in the latter (e.g., vision processing results can help with language processing, and vice versa). A unified scheme will make it much easier to share information between different components and integrate them in a principled way to make better decisions. 5.2 Limitation of the State-Space Search Based Approach In Chapter 3, we introduced the ARG (Attributed Relational Graph) based model and the state-space search algorithm for situated referential grounding. It showed some encouraging results as a first step towards modeling and interpreting episodic and collaborative dialogues of situated referential communication. However, there are two major limitations: (1) The state-space search based graph-matching algorithm can only produce “hard” and “one-to-one” matching results (see the later discussion in this section for more details). We believe such an algorithm is not adequate to produce meaningful multiple grounding hypotheses, which is the key for generating collaborative responses. In this chapter, we present a probabilistic relaxation labeling algorithm [109], which permits “soft” and “many-to-one” graph matching outcome. It significantly outperforms the state-space search algorithm in generating multiple matching hypotheses. (2) The evaluation in Chapter 3 was not based on automatic dialogue processing and the 87 graph representations were built based on manually annotated information. This makes the overall referential grounding problem much easier and unrealistic, since the uncertainties arise from automatic dialogue processing is often one key challenge that a real system has to deal with. In this chapter, we present the work of building an end-to-end system using all automatic components. We especially tackle the issue of handling the uncertainties of automatic dialogue processing through the probabilistic approach. Here we give a detailed account on the key limitation of the state-space search algorithm to referential grounding, i.e., it only supports “hard” (or binary) matching between two graphs. More specifically, if we organize the matching Θ that we have defined in Section 3.2.2 in the form of a matching matrix as: Θ = x1 x2 ··· xN x1 p11 p12 ··· p1N x2 p21 p22 ··· p2N .. . .. . .. . xM pM 1 pM 2 .. . ··· pM N Namely, Θ now becomes a M × N matrix and each element pmn in this matrix can be viewed as P θm = xn , i.e., the probability of matching node xm with node xn . The state-space search algorithm only supports binary matching because it produces matching matrices in which each pmn can only be either 1 or 0, as illustrated in the following example: 88 Θ = x1 x2 ··· xN x1 1 0 ··· 0 x2 0 0 ··· 1 .. . .. . .. . xM 0 1 .. . ··· 0 Thus, for the state-space search based matching, only binary decisions can be made on whether a node xm is matched with node xn or not. Such a binary matching nature makes it difficult to incorporate multiple hypotheses from other components and to generate multiple matching hypotheses. For instance, machine-learning based coreference resolution often produces multiple hypotheses that are ranked by probabilities. For the state-space search based graph-matching, the only way to handle multiple coreference hypotheses is to keep different “versions” of the discourse graph, each of which represented one coreference hypothesis (e.g., node x4 should be merged with x3 or with x1 needs to be represented by two different versions of a discourse graph). Furthermore, state-space search is also inadequate for generating multiple matching hypotheses for each individual node. For example, if there are M discourse entities, state-space search will need to generate at least 3M states to produce 3 different grounding hypotheses for each entity. However, this is not computationally feasible. To address these key limitations, we have developed a probabilistic approach that handles the uncertainties of automatic dialogue processing and produces “soft” grounding results. Specifically, we extend the ARG model to capture not only the discourse of referring expressions, but also the dialogue dynamics in situated referential communication. To cope with the uncertainties that arise from mismatched perceptions and other sources (e.g., language 89 and dialogue processing), we employ a probabilistic graph-matching algorithm to combine different sources of information and produce multiple reference grounding hypotheses. The empirical evaluation results demonstrate that the probabilistic graph-matching based approach is capable to incorporate dialogue processing uncertainties and generate multiple grounding hypotheses, whereas the state-space search based approach fails to do so. It outperforms the state-space search algorithm in both grounding accuracy and efficiency. Thus, our probabilistic graph-matching based approach provides a better framework for interpreting the unfolding dialogue in real time and generating collaborative agent’s responses in situated referential communication. 5.3 Probabilistic Labeling for Referential Grounding In this section, we first describe the automatic language processing components we use to build an end-to-end referential grounding system. Then the procedure of the probabilistic labeling algorithm and how the outcomes of different components are integrated under its unified scheme are presented in detail. 5.3.1 Automatic Language Processing One necessary step in building the discourse graph is the semantic parsing of human utterances. As shown in the examples in Section 1.2.2, these utterances are often informal, fragmented, and contain various types of disfluencies. Therefore we apply a partial parsing approach based on Combinatory Categorical Grammar (CCG) [110]. We have defined a set of basic CCG lexicon rules, which covers key expressions in our domain, such as describing object properties, spatial relations, and so forth. Given a human utterance, the CCG 90 parser [111] searches for the longest word sequences (i.e., chunks) covered by the basic CCG grammar iteratively until the end of the utterance. A semantic interpretation (represented in logic forms) is then generated for each chunk. For example, given the utterance “ok then to the right of the green object I have got a giant coffee cup”, the parser can extract all key chunks. It generates the semantic representation Rightof (x, y) ∧ Color(y, Green) for the first chunk “to the right of the green object”, and the semantic representation Size(z, Giant) ∧ Isa(z, Cup) for the second chunk “a giant coffee cup”. The semantic representations of these chunks will be further combined to form the final representation as following: [x, y], [Size(x, Giant) ∧ Isa(x, Cup)∧ Color(y, Green) ∧ RightOf (x, y)] This final semantic representation is in the form of the Discourse Representation Structure (DRS) [112], which contains a list of discourse entities introduced by the utterance, and a list of first-order-logic predicates specifying the properties and relations between these entities. Then for each discourse entity, a node is added to the discourse graph. Unary predicates become the attributes of corresponding nodes, and binary predicates become the attributes of corresponding edges in the graph. Group-based descriptions are analyzed in the same way and incorporated into the graphs as hyperedges (see Section 3.2.1 for details). To assess the performance of automatic semantic parsing on our data, we compared the parsing outcome with manually annotated semantics of all the utterances. Our partial parser can achieve 66.95% recall and 74.4% precision in extracting the correct first-order-logic predicates from the utterances in our data. 91 5.3.2 Tracking Dialogue Dynamics via Coreference As we have already discussed in Section 1.2.2, situated referential communication is often a highly incremental and collaborative process between two dialogue participants. Because of the incremental and collaborative nature of these dialogues, keeping track of the dialogue dynamics can be important for correctly grounding all the interrelated referring expressions in the dialogue. Based on the observations from our data, we simplify tracking the dynamics of situated referential communication as a coreference resolution problem. We now use the same examples in in Section 1.2.2 to illustrate the idea (For the convenience of reading, here we show the two examples from Section 1.2.2 again): Example 1: D: what I am seeing on my screen are three apples (1) D: and there is an apple that is directly below, slightly to the right of the battery (2) M : ok (3) D: and then there is an apple to the right of that (4) D: and there is an apple below that last apple (5) M : ok (6) D: so the apple directly below the battery is called Alexis (7) M : ok, this is Alexis (8) D: and then to the right of Alexis is an apple (9) D: and then below that apple is (10) D: I am sorry, actually that is a red pear (11) D: but it looks like an apple (12) 92 ... ... Example 2: D: there is basically a cluster of four objects in the upper left, do you see that (1) M : yes (2) D: ok, so the one in the corner is a blue cup (3) M : I see there is a square, but fine, it is blue (4) D: alright, I will just go with that, so and then right under that is a yellow pepper (5) M : ok, I see apple but orangish yellow (6) D: ok, so that yellow pepper is named Brittany (7) M : uh, the bottom left of those four? Because I do see a yellow pepper in the upper right (8) D: the upper right of the four of them? (9) M : yes (10) D: ok, so that is basically the one to the right of the blue cup (11) M : yeah (12) D: that is actually an apple (13) D: that is a green apple and it is named Ashley (14) ... ... In Example 1, for instance, utterance (4) expands utterance (2) by adding new information about a spatial relation (“right of”) between two referents. Such a relation can be very informative for identifying the referred objects, but first we need to link this new piece of information to what we already have. Here the pronoun “that” in utterance (4) serves as the key: since we can infer this “that” corefers to the phrase “an apple” in utterance (2), then we can update the snapshot of the current dialogue to be something like “apple 1 is 93 below and to the right of battery 1, apple 2 is to the right of apple 1”. With such updated information, we can further search through the environment to find objects that conform to the described situation. Example 2 demonstrates another interesting collaborative pattern. In utterance (4), the matcher proactively describes what he perceives (i.e. “a blue square”). Then in utterance (5) the director confirmatively accepts this presentation and further expands it by introducing a new object that is “under” the matcher’s squarish object. An utterance like (5) is called a relevant next turn [16], which is often an effective way of moving the dialogue forward. From the matcher’s point of view, utterance (5) also can be easier to ground since it is built on what has already been jointly accepted. He now only needs to look at the area below his blue square to find the next referent (i.e. “a yellow pepper”).3 Notice that, this again relies on correctly linking the second “that” in utterance (5) to “a blue square” in utterance (4). As we can see from these two examples, tracking the dialogue dynamics for the referential grounding purpose boils down to the coreference resolution problem (e.g., [113]). Namely, we want to infer whether a referring expression introduces a new referent (i.e., an object that is not mentioned before) or it (co)refers to the same referent as a previous referring expression does. With such inferences, we can link the information across multiple referring expressions or overwrite old information with new information, and then search for proper referents based on all the accumulated information. We use the similar machine learning based approach for coreference resolution4 (e.g., as in [114, 115]). Formally, let xi be a discourse entity extracted from the current referring 3 Recall that, in Section 3.2.3, we have already discussed this prevalent collaboration pattern and given it a name as “agent-present-human-accept”. 4 To serve our referential grounding purpose, we specifically focus on coreference resolution on mentions of the physical objects in the shared environment, but not mentions of people, event, and so on. 94 expression, and xj is a discourse entity from a previous expression. We then perform coreference resolution as binary classification on the entity pair xi , xj , i.e., we train a maximum entropy classifier5 [116] to predict whether xi and xj should both refer to the same object (i.e. positive) or each of them should refer to a different object (i.e. negative). The features we use for the classification include the distance between xi and xj , the determiners associated with them, the associated pronouns, the extracted object-specific properties, the syntactic roles, who is the speaker, etc. Note that, although our coreference tracking task is similar to the pairwise classification step in regular coreference resolution (e.g., as in [117]; [118]; [119]), it does have some unique characteristics: • Since its purpose is to link between separated pieces of information (e.g., due to the dialogue dynamics, fragmented utterances, or partial parsing results), coreference tracking directly deals with the extracted discourse entities but not the linguistic mentions as regular coreference resolution does. For example, our parser can extract a discourse entity as [x, red(x)] from a one-word fragment “red ”, which even does not count as a mention. Furthermore, it only focuses on expressions referring to physical objects in the shared environment, but not people, event, etc. • The situated context (i.e., the shared environment) plays an important role. Recall the earlier Example 1, the director was trying to communicate the identities of three objects (two apples and a pear). However the phrase “an apple” appeared five times plus another “a red pear”. This is contradictory to non-situated discourse, where after an entity has already been evoked the definite determiner “the” should be used 5 http://nlp.stanford.edu/downloads/classifier.shtml 95 in the following referring expressions. Thus coreference tracking based on linguistic information alone may be difficult and the situational context needs to be taken into consideration. Actually, solving coreference resolution and referential grounding as a joint problem in situated dialogue itself is an interesting future work to pursue. • Since our dialogue graph use edges to represent only binary relations between discourse entities, entity-pairwise coreference tracking is sufficient and it is not necessary for us to look at coreference chains. To incorporate the coreference tracking results into the dialogue graph, we add a special type of “coreference edges” into the graph. Such an edge (i.e., denoted as xi xj ) exists between each pair of nodes in the graph. It is also assigned an attribute to encode the coreference tracking result (i.e., positive or negative, as well as the classification probability) on this pair of discourse entities.6 5.3.3 Iterative Probabilistic Labeling Algorithm The probabilistic relaxation labeling algorithm [109] is formulated in the Bayesian framework for contextual label assignment. It provides a unified probabilistic evidence-combining scheme to integrate unary attributes, binary relations and prior knowledge for updating the labeling probabilities (i.e. P θm = xn ). In essence, the probabilistic relaxation labeling algorithm is based on several conditional independence assumptions, such as the matching of a node xm is only dependent on itself and its neighbors but not all other unrelated nodes. Given the same kind of graph representations (i.e., the discourse graph and the vision graph) as described in Section 3.2.1, probabilistic relaxation labeling finds proper labelings 6 If the distance between a pair of discourse entities is greater than a predefined window size, we just assign it to the majority class (i.e. negative) based on the class prior probability. 96 (i.e., matchings) in an iterative manner. It first initiates the labeling probabilities by considering only the unary attributes of each node, and then efficiently and effectively updates the labeling probability of each node based on the labeling of its neighbors and the relations with them. The detailed algorithm is as follows: Initialization: Compute the initial labeling probabilities: P (0) (θm = xn ) = P xm | θm = xn Pˆ θm = xn P xm | θm = xn2 Pˆ θm = xn2 (5.1) xn ∈X 2 in which Pˆ θm = xn is the prior probability of labeling xm with xn . The prior probability can be used to encode any prior knowledge about possible labelings. Especially in incremental processing of the dialogue, the prior can encode previous grounding hypotheses (e.g., the agent-present-human-accept pattern as discussed in Section 3.2.3), and other information from the collaborative dialogue such as confirmation, rejection, or replacement (see the collaboration patterns summarized in Section 1.2.3). P xm | θm = xn is called the “compatibility coefficient” between xm and xn , which is computed based on the attributes of xm and xn : P xm | θm = xn ≈ a and we further define 97 P uma | θm = xn (5.2) P uma | θm = xn = P uma | una = p una |uma p(uma ) p(una |uma )p(uma ) uma ∈La (5.3) where La is the “lexicon” for the a-th attribute of a discourse graph node, e.g., for the color attribute: Lcolor = {red, green, blue, . . .} and p una | uma is what we have called the “semantic grounding function” earlier in Section 3.2.27 , i.e., the probability of observing una given the word uma . It judges the compatibilities between the symbolic attribute values from the discourse graph and the numeric attribute values from the vision graph. Iteration: At each iteration (i.e., in the equation, superscript (r) means at the r-th iteration) and for each possible labeling, compute the “support function” as: Q(r) θm = xn = ˘ m xn ∈X m2 ∈M 2 P (r) θm2 = xn2 P xm xm2 | θm = xn , θm2 = xn2 (5.4) ˘ m is defined as: in which the set of indices M ˘ m = {1, 2, . . . , m − 1, m + 1, . . . , M } , M The support function Q(r) θm = xn here expresses how the labeling θm = xn at the 7 Here we use the same set of manually defined semantic grounding functions as introduced in Section 3.2.2. 98 r-th iteration is supported by the labeling of xm ’s neighbors, taking into consideration the binary relations that exist between xm and these neighbors (Here we denote an edge between node xm and node xm2 directly as xm xm2 ). Then update the probability of each possible labeling as: P (r+1) (θm = xn ) = P (r) θm = xn Q(r) θm = xn xn ∈X 2 P (r) θm = xn2 Q(r) θm = xn2 (5.5) Similar to the node compatibility coefficient, the edge compatibility coefficient between xm xm2 and xn xn2 , namely the P xm xm2 | θm = xn , θm2 = xn2 for computing Q(r) θm = xn , is also based on the attributes of the two edges and their corresponding semantic grounding functions. For an edge that represents the spatial relation between two entities, this is similar to the state-space search approach as described in Section 3.2.2. The discourse coreference relations, however, can be handled differently under the unified scheme of the probabilistic labeling method. For the state-space search method, two coreferential nodes need to be merged for them to be grounded jointly, but in this way a discourse graph can only represent one hypothesis of the coreference relation between two nodes. To be able to utilize multiple hypotheses from the coreference resolution component, the probabilistic labeling approach can directly incorporate them as the edge compatibility coefficients. Recall that in Section 5.3.2 we have treated coreference resolution as a classification problem on discourse entities pairs: for each pair of discourse entities (i.e., discourse graph nodes) xm , xm2 , we trained a statistical classifier to predict whether xm and xm2 should both refer to the same object (i.e., positive) or each of them should refer to a different object 99 (i.e., negative). To incorporate the coreference results into the discourse graph, we now add a special type of “coreference edges” into the graph. Such an edge (i.e., denoted as xm xm2 ) exists between each pair of nodes in the graph. It is also assigned an attribute to encode the coreference tracking result (i.e., positive or negative, and the classification probability) on this pair of discourse entities. Suppose xm xm2 is assigned a positive attribute value by the coreference classifier, the compatibility coefficient then can be computed as: P xm xm2 = + | θm = xn , θm2 = xn2 = P θm =xm ,θm2 =xn |xm xm2 =+ P xm xm2 =+ 2 P θm =xm ,θm2 =xn 2 , where (5.6) P θm = xm , θm2 = xn2 = P θm = xm , θm2 = xn2 | xm xm2 = + P xm xm2 = + +P θm = xm , θm2 = xn2 | xm xm2 = − P xm xm2 = − in which P xm xm2 = + is the coreference classifier’s output probability of assigning the entity pair xm , xm2 to the positive class, and P xm xm2 = − = 1−P xm xm2 = + since it is a binary classification. Furthermore, we use the coreference classifier’s precision to estimate P θm = xn , θm2 = xn2 | xm xm2 = + and P θm = xn , θm2 = xn2 | xm xm2 = − . Namely, when the classifier assigns an entity pair to the positive or negative class, how likely that they are truly mapping to the same object (i.e. when θm = θm2 ) or mapping to different objects (i.e. when θm = θm2 ). The compatibility coefficient for a negative coreference edge can be computed in a similar way. Termination: Terminate the algorithm if any one of the following conditions is true: 100 • For each m, one of P (θm = xn ) exceeds 1 − 1 , where 1 1; • In the last iteration, none of P (θm = xn ) changed by more than 2 , where 2 1; • The number of iterations has reached a specified limit. In practice, we find that this algorithm often converges very fast (≤ 5 iterations). 5.4 5.4.1 Evaluation Data We use the same data as in Section 3.3 and compare the referential grounding performances between the probabilistic labeling approach and the state-space search approach. As described earlier in Section 3.3, the dataset we use for evaluation purposes contains the transcriptions of human-human (i.e., a director and a matcher) dialogues on a object-naming task, along with the images that were used to collect each dialogue. This time we build the discourse graph completely based on automatic language processing and coreference resolution. It thus allows us to use more data collected from the experiment, since we don’t need to manually annotate the semantics and coreferences for each dialogue. Our dataset for evaluation now has 62 dialogues, each of which contains an average of 25 valid utterances from the director. We first apply our CCG-based parser to extract semantic representations (i.e., the DRS representation as described in Section 5.3.1) from these utterances. An average of 33 discourse entities per dialogue (1.3 per utterance) are extracted through automatic parsing. Since the matchers in these dialogues actually played the role of a computer-vision based agent, we treat all their decisions as known to our system8 8 We leave the agent’s decision making (e.g., response generation) into our future work. 101 and focus on only the processing of the directors’ utterances. Based on the semantic parsing outcome, we need to further infer the coreference relations between each pair of extracted discourse entities. As discussed earlier in Section 5.3.2, such coreference relations become the key for tracking the collaborative dynamics in the dialogues of this dataset, thus building the dialogue graph representation and generating the grounding (i.e., graph-matching) results rely on coreference resolution. To infer the pairwise coreference between discourse entities, for each dialogue we create a list of entity pairs by pairing each discourse entity with each of the previous discourse entities within a predefined window. On average over 300 entity pairs are created for each dialogue. We then use each dialogue as one testing set and all others as training set (i.e., in the “leave-one-out” manner) to train and evaluate the coreference classifier. 5.4.2 Results We apply both the probabilistic labeling algorithm and the state-space search algorithm to ground each of the director’s discourse entities onto an object perceived from the image. The referential grounding performances of the two algorithms are compared in Table 5.1. Note that, the results in Table 5.1 are calculated and organized based on three ideas: (1) Besides the accuracy of the top-1 grounding hypothesis, we also measure and compare the accuracies of the top-2 and top-3 grounding hypotheses. The “accuracy” of the top-2 grounding hypotheses is measured in this way: suppose xˆ is the ground-truth mapping of xm , and T2 = x1 , x2 is the top-2 grounding hypotheses generated by the labeling algorithm. The matching of xm is counted as a “hit” if xˆ ∈ T2 , otherwise a “miss”. And the overall accuracy is the percentage of all the hits among all the discourse entities 102 Random Guessa S.S.S. P.L. Gainb P.L. using annotated coreference Top-1 Top-2 Top-3 7.7% 15.4% 23.1% 19.1% 19.7% 21.3% 24.9% 36.1% 45.0% 5.8% 16.4% 23.7% (p < 0.01) (p < 0.001) (p < 0.001) 66.4% 74.8% 81.9% a Each image contains an average of 13 objects. b p-value is based on the Wilcoxon signed-rank test [92] on the 62 dialogues. Table 5.1 Comparison of the reference grounding performances of a random guess baseline, Probabilistic Labeling (P.L.) and State-Space Search (S.S.S.), and P.L. using manually annotated coreference. being evaluated. The accuracy of the top-3 grounding hypotheses is measured similarly. (2) The reference grounding results are evaluated in an incremental manner to resemble the situation in a real dialogue: For each dialogue, we go through the director’s utterances one-by-one from the beginning to the end. At each utterance, we first update the dialogue graph (e.g., add in new discourse entities and relations), and then (re-)evaluate the updated grounding results of all the so far encountered discourse entities9 . The final accuracy then is the averaged accuracy over all the “evaluation points” (i.e., each utterance is a evaluation point). As shown in Table 5.1, probabilistic labeling (P.L.) significantly outperforms state-space search (S.S.S.), especially with regard to producing meaningful multiple grounding hypotheses. The state-space search algorithm actually only results in multiple hypotheses for the 9 Note that, when a “new” discourse entity is added into the graph, it can change the grounding results of previous discourse entities with the new information it brings in. This is why we re-evaluate the grounding results of all the encountered discourse entities each time after new information is added. 103 overall matching, and it fails to produce multiple hypotheses for many individual discourse entities. Multiple grounding hypotheses can be very useful to generate responses such as clarification questions or nonverbal feedback (e.g. pointing, gazing). For example, if there are two competing hypotheses, the dialogue manager can utilize them to generate a response like “I see two objects there, are you talking about this one (pointing to) or that one (pointing to the other)?”. Such proactive feedback is often an effective way in referential communication, as we discussed earlier in Section 1.2.3 and 3.2.3. The probabilistic labeling algorithm not only produces better grounding results, it also runs much faster (with a running-time complexity of O M 2 N 2 ,10 comparing to O N 4 of the state-space search algorithm11 ). Figure 5.1 shows the averaged running time of the two algorithm algorithms (the red curve is the state-space search algorithm and the blue curve is the probabilistic labeling algorithm) on a Intel Core i7 1.60GHz CPU with 16G RAM computer. As we can see, when the size of the dialogue graph becomes greater than 15, state-space search takes more than 1 minute to run, whereas probabilistic labeling only takes 1 second or less even for large graphs. The efficiency of the probabilistic labeling algorithm thus makes it more appealing for real-time interaction applications. Although probabilistic labeling significantly outperforms the state-space search, the grounding performance is still rather poor (less than 50%) even for the top-3 hypotheses. With no surprise, the coreference resolution performance plays an important role in the final grounding performance (see the grounding performance of using manually annotated coreference in the bottom part of Table 5.1). Due to the simplicity of our current coreference classifier and the flexibility of the human-human dialogue in the data, the pairwise coreference resolution 10 M is the number of nodes in the vision graph and N is the number of nodes in the dialogue graph. 11 Beam search algorithm is applied to reduce the exponential O M N to O N 4 . 104 Figure 5.1 Average running time of the state-space search algorithm with respect to the number of nodes to be grounded in a dialogue graph. The red curve is the state-space search algorithm and the blue curve is the probabilistic labeling algorithm. only achieves 0.74 in precision and 0.43 in recall. The low recall of coreference resolution makes it difficult to link interrelated referring expressions and resolve them jointly. So it is important to develop more sophisticated coreference resolution and dialogue management components to reliably track the discourse relations and other dynamics in the dialogue to facilitate referential grounding. 5.5 Conclusion and Discussion In this chapter, we have presented a probabilistic labeling based approach for referential grounding in situated dialogue. This approach provides a unified scheme for incorporating different sources of information. Its probabilistic scheme allows each information source to present multiple hypotheses to better handle uncertainties. Based on the integrated information, the labeling procedure then efficiently generates probabilistic grounding hypotheses, which can serve as important guidance for the dialogue manager’s decision making. As demonstrated by the empirical evaluation results, the probabilistic labeling approach has some desirable advantages compared to the state-space search algorithm based approach: 105 • The probabilistic labeling approach is designed to incorporate uncertainties from different sources through its unified evidence-combing scheme. Its outputs are directly multiple-hypotheses with probabilities. For the state-space search approach, maintaining multiple-hypotheses is very cumbersome and inefficient. • The probabilistic labeling approach outputs an individual score/probability for each possible matching of each discourse entity. Individual matching scores and multiplehypotheses can provide important guidance for dialogue management. • The probabilistic labeling approach is more efficient than the state-space search approach, and incremental grounding can be easily implemented. It is thus more suitable for building a system that interacts with human in real-time. One future direction is to extend the probabilistic labeling approach to handle groupbased (i.e., n-ary) relations, as what we have done in Section 3.2.1 using hypergraph representations. The issue with the probabilistic labeling approach is that its iterative labeling scheme only takes binary relations into consideration. How to incorporate n-ary relations into probabilistic labeling in a theoretically sound and computational efficiently way is an interesting research question to address. One straightforward approach to try is to decompose a n-ary relations into several binary relations without loss of expressive power. Such a method may work well for “inter-group” relations (such as “the apple is behind the cluster of four”), but may not for “intra-group” relations (such as “the apple is in the middle of the cluster of four”).12 Nevertheless, a general approach that can account for any kind of binary and n-ary relations is desirable. In Chapter 4, we introduced a weight-learning method that is based the state-space 12 See [55] for some discussion on different types of group-based relations. 106 search algorithm. One open question is whether we can also design a similar weight-learning mechanism for the probabilistic labeling algorithm. In general, we can always assign some weights to different attributes and attribute-values to specify how reliable/important they are for referential grounding. Just the same as the state-space search, probabilistic labeling also relies on the combination of all the attributes to find proper matchings. Thus it should benefit from meaningful weights in the same way as state-space search does. However, weight learning under the probabilistic labeling framework may not be easy to design and implement, because of the non-linear calculation and iterative procedure it involves. More complex optimization techniques (such as quadratic programming) may need to be considered. One possible simpler solution is to use the state-space search for weight-learning and apply the learned weights in probabilistic labeling for real-time grounding. If the weights learned by our weight-learning method make general senses, such a “hybrid” approach may produce better results than using either of the two approaches alone. Finally, another important future direction is to tackle coreference resolution and referential grounding in situated dialogue as a joint problem. As we discussed earlier in Section 5.3.2, coreference is often the key to track the relations at the locutionary level (i.e., the relations among the formal semantic representations of different referring expressions) and to build a well-connected discourse graph for referential grounding. It may also serve as one important source of information to track the dynamics at the illocutionary level (i.e., identifying and tracking dialogue acts) for sophisticated dialogue management. Our empirical results have already demonstrated that, on one hand, referential grounding relies on good coreference resolution to achieve better performance, but on the other hand coreference resolution in situated dialogue is difficult and needs to incorporate the situational context and other information from the interaction. As we have mentioned earlier at the beginning of this 107 chapter, information from different sources is often interrelated (e.g., the dialogue content is always related to the surrounding environment in situated dialogue), incorporating one source of information into another can potentially reduce the uncertainties and errors within the latter one. Thus, based on the probabilistic and iterative scheme of probabilistic labeling, we should further investigate how to solve coreference resolution and referential grounding jointly and iteratively, and expect that the performances of both can be jointly leveraged. 108 Chapter 6 Conclusion and Future Work In situated dialogue, one significant challenge is the agent (i.e., the dialogue system) needs to perceive and make sense of the shared environment simultaneously during conversation. The agent’s representation of the world is often limited by its perceptual and reasoning capabilities. Therefore, although co-present, humans and agents do not share a joint perceptual basis, and the lack of such a joint basis will jeopardize referential communication between humans and agents. It will become more difficult for the agent to identify referents in the physical world that are referred to by the human, i.e., the problem of referential grounding. The work presented in this dissertation has focused on developing computational approaches to enable robust and adaptive referential grounding in situated dialogue. In Chapter 3, graph-based representations are utilized to model the linguistic discourse of collaborative referring dialogue and the visual perception of the physical environment. Referential grounding is then formulated as a graph-matching problem and a state-space search algorithm is applied to ground linguistic references onto perceived objects. In addition, hypergraph representations are introduced to account for group-based descriptions, and the most prevalent pattern of collaborative communication observed from dialogue data is incorporated into the search algorithm. The empirical results show that, even when the perception of the environment by computer vision algorithms has a high error rate (95% of the objects are mis-recognized), this approach can still correctly ground those mis-recognized referents with 66% accuracy, whereas a object-properties-only baseline just obtains 31% 109 grounding accuracy. As demonstrated by the results, the graph-matching approach has provided a potential solution to robust referential grounding through modeling and utilizing spatial relations, group descriptions, and collaborative referring behaviors. In Chapter 4, an optimization based approach is further developed to allow the agent to detect and adapt to the perceptual differences through interaction and learning. Through interaction with the human, the agent can learn a set of weights indicating how reliably/unreliably each dimension (object type, object color, etc.) of its perception of the environment maps to the human’s linguistic descriptors. The agent can then adapt to the situation by applying the learned weights to the grounding algorithm, and adjust its word grounding models accordingly. The empirical evaluation shows this weight-learning approach can effectively adjust the weights to reflect the agent’s perceptual insufficiencies. When the perceptual difference is high (i.e., the agent can only correctly recognize 10-40% of objects in the environment), applying the learned weights with updated word grounding models significantly improves referential grounding performance by an absolute gain of 10%. In Chapter 5, a probabilistic-labeling based approach is developed to better support collaborative and incremental dialogue systems that interact with human users in real time. This approach provides a unified scheme for incorporating different sources of information. Its probabilistic scheme allows each information source to present multiple hypotheses to better handle uncertainties. Based on the integrated information, the labeling procedure then efficiently generates probabilistic grounding hypotheses, which can serve as important guidance for the dialogue manager’s decision making. Evaluated on the same dataset, the probabilistic labeling approach significantly outperforms the state-space search approach in both grounding accuracy and efficiency. As we have already discussed in Section 3.4, Section 4.4, and Section 5.5, there are a broad 110 range of future directions that can be further explored. One very interesting and important future work is to solve coreference resolution and referential grounding as a joint problem in situated dialogue. Coreference resolution in situated settings is to determine whether two linguistic entities refer to the same physical object or not, and referential grounding is to find out which physical object an linguistic entity refers to. It should not be difficult to see that these two problems are intertwined and need to be solved jointly in situated dialogue. However, in our current work they are treated as two separate problems and solved individually. For coreference resolution, we used the same kind of classification method and feature set as in the previous text-based work. Unsurprisingly, the performance is rather low because the key characteristic of situated dialogue, i.e., the situational context, was not taken into consideration. And unreliable coreference results have further impacted the performance of referential grounding because the latter relies on the coreference to combine information across the collaborative discourse. Our probabilistic labeling approach has provided a unified and iterative scheme to update beliefs on the matchings between linguistic entities and physical objects, and on the relationships between linguistic entities as well1 . Thus, probabilistic labeling may provide a potential solution for jointly updating both the coreference resolution and referential grounding beliefs throughout its iterative evidence-combining procedure. If we can develop such a joint approach and demonstrate its effectiveness in improving both coreference resolution and referential grounding performances, it will be an interesting and meaningful achievement. To enable situated dialogue systems in the open world, sophisticated machine-learning techniques are indispensable. Recent works on joint learning (e.g., [120, 35]) have demonstrated the nice idea of jointly acquiring/updating models of different components from the 1 I.e., this is captured by the “supporting function” as described in Section 5.3.3. 111 system’s direct inputs and outputs, without providing supervised data for each individual component. Such kind of learning methods is desirable for our goal, i.e., building systems that can learn and adapt from the interactions with human users and the environment. While most of the previous work on situated referential grounding only focused on the locutionary level (i.e., interpretation of single referring expressions), processing information at the illocutionary level (i.e., collaborative discourses) should also play an important role in building interactive systems. Although we have incorporated a prevalent collaborative pattern into our graph-based approach for referential grounding, it is still an ad-hoc solution. The probabilistic labeling approach provides a unified scheme for incorporating different source of information for referential grounding, and supports agent’s response generation by providing probabilistic grounding hypotheses to the dialogue manager component. But we just have focused on only the interpretation aspect. Can we have a more general framework, under which the interpretation of referring expressions, incorporation of dialogue dynamics, generation of collaborative actions, and mediation of perceptual differences can be all handled in a unified manner? Furthermore, such a framework should be based on or integrated with advanced machine-learning techniques, so that it can continuously adapt to changing environments and situations through learning from the interaction with human users. One important step along this direction is to utilize some well-established machine-learning techniques/frameworks, such as POMDP and reinforcement learning [121, 122], and tailor them to best address our focused problem here – situated referential communication. 112 REFERENCES 113 REFERENCES [1] D. Bohus and E. Horvitz, “Dialog in the open world: platform and applications,” in Proceedings of the 2009 international conference on Multimodal interfaces, pp. 31–38, ACM, 2009. [2] H. I. Christensen, G. M. Kruijff, and J. Wyatt, eds., Cognitive Systems. Springer, 2010. [3] M. F. McTear, “Spoken dialogue technology: enabling the conversational interface,” ACM Computing Surveys, vol. 34, no. 1, pp. 90–169, 2002. [4] A. Raux, B. Langner, D. Bohus, A. W. Black, and M. Eskenazi, “Lets go public! taking a spoken dialog system to the real world,” in in Proc. of Interspeech 2005, Citeseer, 2005. [5] M. Johnston, S. Bangalore, G. Vasireddy, A. Stent, P. Ehlen, M. Walker, S. Whittaker, and P. Maloor, “MATCH: An architecture for multimodal dialogue systems,” in Proceedings of the ACL’02, pp. 376–383, 2002. [6] J. Chai, P. Hong, M. Zhou, and Z. Prasov, “Optimization in multimodal interpretation,” in Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, p. 1, Association for Computational Linguistics, 2004. [7] D. K. Byron, “Understanding referring expressions in situated language: some challenges for real-world agents,” in Proceedings of the First International Workshop on Language Understanding and Agents for Real World Interaction, pp. 39–47, 2003. [8] P. Gorniak and D. Roy, “Situated language understanding as filtering perceived affordances,” Cognitive Science, vol. 31, no. 2, pp. 197–231, 2007. [9] R. J. Ross, Situated Dialogue Systems: Agency and Spatial Meaning in Task-Oriented Dialogue. PhD thesis, University of Bremen, 2009. [10] H. H. Clark, Using language. Cambridge University Press, 1996. [11] H. H. Clark and S. E. Brennan, “Grounding in communication,” Perspectives on socially shared cognition, vol. 13, no. 1991, pp. 127–149, 1991. 114 [12] H. H. Clark and D. Wilkes-Gibbs, “Referring as a collaborative process,” Cognition, vol. 22, no. 1, pp. 1–39, 1986. [13] M. Skubic, D. Perzanowski, S. Blisard, A. Schultz, W. Adams, M. Bugajska, and D. Brock, “Spatial language for human-robot dialogs,” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 34, no. 2, pp. 154–167, 2004. [14] T. Tenbrink, “Identifying objects on the basis of spatial contrast: An empirical study,” in Spatial cognition iv. reasoning, action, interaction, pp. 124–146, Springer Berlin Heidelberg, 2005. [15] R. Moratz, “Intuitive linguistic joint object reference in human-robot interaction,” in Proceedings of the Twenty-first National Conference on Artificial Intelligence (AAAI), Boston, MA., AAAI, Menlo Park, CA, 2006. [16] H. H. Clark and E. F. Schaefer, “Contributing to discourse,” Cognitive science, vol. 13, no. 2, pp. 259–294, 1989. [17] P. G. Edmonds, “Collaboration on reference to objects that are not mutually known,” in Proceedings of the 15th Conference on Computational Linguistics-Volume 2, pp. 1118–1122, Association for Computational Linguistics, 1994. [18] P. A. Heeman and G. Hirst, “Collaborating on referring expressions,” Computational Linguistics, vol. 21, no. 3, pp. 351–382, 1995. [19] D. DeVault and M. Stone, “Learning to interpret utterances using dialogue history,” in Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 184–192, Association for Computational Linguistics, 2009. [20] E. Levin and R. Passonneau, “A woz variant with contrastive conditions,” in Proceedings of the Interspeech Sattelite Workshop, Dialogue on Dialogues: Multidisciplinary Evaluation of Advanced Speech-based Interactive Systems, 2006. [21] R. J. Passonneau, S. L. Epstein, and J. B. Gordon, “Help me understand you: Addressing the speech recognition bottleneck.,” in AAAI Spring Symposium: Agents that Learn from Human Teachers, pp. 119–126, 2009. [22] N. Otsu, “A threshold selection method from gray-level histograms,” Automatica, vol. 11, no. 285-296, pp. 23–27, 1975. 115 [23] D. Zhang and G. Lu, “An integrated approach to shape based image retrieval,” in Proceedings of 5th Asian Conference on Computer Vision (ACCV), Melbourne, Australia, 2002. [24] R. Moratz and T. Tenbrink, “Spatial reference in linguistic human-robot interaction: Iterative, empirically supported development of a model of projective relations,” Spatial cognition and computation, vol. 6, no. 1, pp. 63–107, 2006. [25] T. Tenbrink and R. Moratz, “Group-based spatial reference in linguistic human-robot interaction,” in Proceedings of EuroCogSci, vol. 3, pp. 325–330, 2003. [26] C. Liu, R. Fang, and J. Chai, “Towards mediating shared perceptual basis in situated dialogue,” in Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, (Seoul, South Korea), pp. 140–149, Association for Computational Linguistics, July 2012. [27] C. Liu, R. Fang, L. She, and J. Chai, “Modeling collaborative referring for situated referential grounding,” in Proceedings of the SIGDIAL 2013 Conference, (Metz, France), pp. 78–86, Association for Computational Linguistics, August 2013. [28] C. Liu and J. Chai, “Learning to mediate perceptual differences in situated humanrobot dialogue.,” in Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2015. [29] C. Liu, L. She, R. Fang, and Y. J. Chai, “Probabilistic labeling for efficient referential grounding based on collaborative discourse,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 13–18, Association for Computational Linguistics, 2014. [30] T. Winograd, Procedures as a representation for data in a computer program for understanding natural language. PhD thesis, Massachusetts Institute of Technology, 1970. [31] P. Gorniak and D. Roy, “Grounded semantic composition for visual scenes,” J. Artif. Intell. Res. (JAIR), vol. 21, pp. 429–470, 2004. [32] A. Siebert and D. Schlangen, “A simple method for resolution of definite reference in a shared visual context,” in Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, pp. 84–87, Association for Computational Linguistics, 2008. [33] D. L. Chen and R. J. Mooney, “Learning to interpret natural language navigation instructions from observations,” in Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, pp. 859–865, 2011. 116 [34] C. Matuszek, N. FitzGerald, L. Zettlemoyer, L. Bo, and D. Fox, “A joint model of language and perception for grounded attribute learning,” in Proceedings of the 29th International Conference on Machine Learning, 2012. [35] J. Krishnamurthy and T. Kollar, “Jointly learning to parse and perceive: Connecting natural language to the physical world,” Transactions of the Association for Computational Linguistics, vol. 1, no. 2, pp. 193–206, 2013. [36] S. Guadarrama, L. Riano, D. Golland, D. Gouhring, Y. Jia, D. Klein, P. Abbeel, and T. Darrell, “Grounding spatial relations for human-robot interaction,” in Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on, pp. 1640– 1647, IEEE, 2013. [37] R. M. Krauss and S. Weinheimer, “Concurrent feedback, confirmation, and the encoding of referents in verbal communication.,” Journal of Personality and Social Psychology, vol. 4, no. 3, p. 343, 1966. [38] R. M. Krauss and S. Weinheimer, “Effect of referent similarity and communication mode on verbal encoding,” Journal of Verbal Learning and Verbal Behavior, vol. 6, no. 3, pp. 359–363, 1967. [39] H. P. Grice, Logic and conversation. Harvard Univ., 1970. [40] M. Mitchell, K. van Deemter, and E. Reiter, “Natural reference to objects in a visual domain,” in Proceedings of the 6th international natural language generation conference, pp. 95–104, Association for Computational Linguistics, 2010. [41] S. Kriz, J. G. Trafton, and J. M. McCurry, “The role of spatial information in referential communication: Speaker and addressee preferences for disambiguating objects,” in Proceedings of the 29th Annual Cognitive Science Society, 2007. [42] H. H. Clark, Arenas of Language Use. University of Chicago Press, 1992. [43] H. H. Clark and D. Wilkes-Gibbs, “Referring as a collaborative process,” in Cognition, no. 22, pp. 1–39, 1986. [44] H. H. Clark and S. E. Brennan, “Grounding in communication,” in Perspectives on socially shared cognition (L. B. Resnick, R. M. Levine, and S. D. Teasley, eds.), pp. 127– 149, 1991. [45] D. Traum, A Computational Theory of Grounding in Natural Language Conversation. PhD thesis, University of Rochester, 1994. 117 [46] J. Cassell, T. Bickmore, L. Campbell, and H. Vilhjalmsson, “Human conversation as a system framework: Designing embodied conversational agents,” in Embodied conversational agents (J. Cassell, J. Sullivan, S. Prevost, and E. Churchill, eds.), MIT Press, 2000. [47] T. Fong, I. R. Nourbakhsh, C. Kunz, L. Fluckiger, J. Schreiner, R. Ambrose, R. Burridge, R. Simmons, L. Hiatt, A. Schultz, J. Trafton, M. Bugajska, and J. Scholtz, “The peer-to-peer human robot interaction project,” Proceedings of AIAA Space 2005, 2005. [48] K. Stubbs, P. Hinds, and D. Wettergreen, “Autonomy and common ground in humanrobot interaction,” in IEEE Intelligent Systems, pp. 42–50, 2007. [49] K. Stubbs, D. Wettergreen, and I. Nourbakhsh, “Using a robot proxy to create commmon ground in exploration tasks,” in Proceedings of the HRI08, pp. 375–382, 2008. [50] M. K. Brown, B. M. Buntschuh, and J. G. Wilpon, “Sam: A perceptive spoken language-understanding robot,” Systems, Man and Cybernetics, IEEE Transactions on, vol. 22, no. 6, pp. 1390–1402, 1992. [51] K. Nagao and J. Rekimoto, “Ubiquitous talker: Spoken language in-teraction with real world objects,” in Proceeding of the International Joint Conference on Artificial Intelligence, 1995. [52] A. Mojsilovic, “A computational model for color naming and describing color composition of images,” Image Processing, IEEE Transactions on, vol. 14, no. 5, pp. 690–699, 2005. [53] T. Regier and L. A. Carlson, “Grounding spatial language in perception: an empirical and computational investigation.,” Journal of Experimental Psychology: General, vol. 130, no. 2, p. 273, 2001. [54] S. S. Dhande, “A computational model to connect gestalt perception and natural language,” Master’s thesis, Massachusetts Institute of Technology, 2003. [55] K. Funakoshi, S. Watanabe, T. Tokunaga, and N. Kuriyama, “Understanding referring expressions involving perceptual grouping,” IEIC Technical Report (Institute of Electronics, Information and Communication Engineers), vol. 105, no. 204, pp. 19–24, 2005. [56] M. Skubic, P. Matsakis, G. Chronis, and J. Keller, “Generating multi-level linguistic spatial descriptions from range sensor readings using the histogram of forces,” Autonomous Robots, vol. 14, pp. 51–69, January 2003. 118 [57] R. Moratz and T. Tenbrink, “Instruction modes for joint spatial reference between naive users and a mobile robot,” in Proc. IEEE International Conference on Robotics, Intelligent Systems and Signal Processing, vol. 1, pp. 43–48, 2003. [58] S. C. Levinson, Space in language and cognition: explorations in cognitive diversity. Cambridge University Press, Cambridge, 2003. [59] C. Liu, J. Walker, and J. Chai, “Ambiguities in spatial language understanding in situated human robot dialogue,” in AAAI 2010 Fall Symposium on Dialogue with Robots, (Arlington, Virginia, USA), November 2010. [60] J. G. Trafton, N. L. Cassimatis, M. D. Bugajska, D. P. Brock, F. E. Mintz, and A. C. Schultz, “Enabling effective human-robot interaction using perspective-taking in robots,” vol. 35, pp. 460–470, 2005. [61] T. Tenbrink, V. Maiseyenka, and R. Moratz, “Spatial reference in simulated humanrobot interaction involving intrinsically oriented objects,” in Proceedings of the Symposium Spatial Reasoning and Communication at AISB’07 Artificial and Ambient Intelligence, 2007. [62] K.-y. Hsiao, N. Mavridis, and D. Roy, “Coupling perception and simulation: Steps towards conversational robotics,” in Intelligent Robots and Systems, 2003.(IROS 2003). Proceedings. 2003 IEEE/RSJ International Conference on, vol. 1, pp. 928–933, IEEE, 2003. [63] D. Roy, K.-Y. Hsiao, and N. Mavridis, “Mental imagery for a conversational robot,” Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 34, no. 3, pp. 1374–1383, 2004. [64] J. Chai, P. Hong, and M. Zhou, “A probabilistic approach to reference resolution in multimodal user interfaces,” in Proceedings of the 9th international conference on Intelligent user interfaces, pp. 70–77, ACM, 2004. [65] S. Gold and A. Rangarajan, “A graduated assignment algorithm for graph matching,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 18, no. 4, pp. 377–388, 1996. [66] T. Regier, “The human semantic potential,” 1996. [67] D. K. Roy and A. P. Pentland, “Learning words from sights and sounds: A computational model,” Cognitive science, vol. 26, no. 1, pp. 113–146, 2002. 119 [68] D. K. Roy, “Learning visually grounded words and syntax for a scene description task,” Computer Speech & Language, vol. 16, no. 3, pp. 353–385, 2002. [69] C. Yu and D. H. Ballard, “A multimodal learning interface for grounding spoken language in sensory perceptions,” ACM Transactions on Applied Perception (TAP), vol. 1, no. 1, pp. 57–80, 2004. [70] S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G. Banerjee, S. J. Teller, and N. Roy, “Understanding natural language commands for robotic navigation and mobile manipulation.,” in AAAI, 2011. [71] C. Sutton and A. McCallum, “An introduction to conditional random fields for relational learning,” Introduction to statistical relational learning, pp. 93–128, 2006. [72] S. Tellex, P. Thaker, J. Joseph, and N. Roy, “Learning perceptually grounded word meanings from unaligned parallel data,” Machine Learning, vol. 94, no. 2, pp. 151–167, 2014. [73] R. J. Kate and R. J. Mooney, “Learning language semantics from ambiguous supervision,” in AAAI, vol. 7, pp. 895–900, 2007. [74] T. Kwiatkowski, L. Zettlemoyer, S. Goldwater, and M. Steedman, “Lexical generalization in ccg grammar induction for semantic parsing,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1512–1523, Association for Computational Linguistics, 2011. [75] D. DeVault, N. Kariaeva, A. Kothari, I. Oved, and M. Stone, “An information-state approach to collaborative reference,” in Proceedings of the ACL 2005 on Interactive poster and demonstration sessions, pp. 1–4, Association for Computational Linguistics, 2005. [76] M. Richardson and P. Domingos, “Markov logic networks,” Machine learning, vol. 62, no. 1-2, pp. 107–136, 2006. [77] C. Kennington and D. Schlangen, “Markov logic networks for situated incremental natural language understanding,” in Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, (Seoul, South Korea), pp. 314–323, Association for Computational Linguistics, July 2012. [78] D. Conte, P. Foggia, C. Sansone, and M. Vento, “Thirty years of graph matching in pattern recognition,” International journal of pattern recognition and artificial intelligence, vol. 18, no. 03, pp. 265–298, 2004. 120 [79] D. Conte, P. Foggia, C. Sansone, and M. Vento, “How and why pattern recognition and computer vision applications use graphs,” in Applied Graph Theory in Computer Vision and Pattern Recognition, pp. 85–135, Springer, 2007. [80] W. Tsai and K. Fu, “Error-correcting isomorphisms of attributed relational graphs for pattern analysis,” Systems, Man and Cybernetics, IEEE Transactions on, vol. 9, no. 12, pp. 757–768, 1979. [81] A. Sanfeliu and K. S. Fu, “A distance measure between attributed relational graphs for pattern recognition,” IEEE transactions on systems, man, and cybernetics, vol. 13, no. 3, pp. 353–362, 1983. [82] M. A. Eshera and K. S. Fu, “A graph distance measure for image analysis,” IEEE transactions on systems, man, and cybernetics, vol. 14, no. 3, pp. 398–410, 1984. [83] M. Eshera and K. Fu, “An image understanding system using attributed symbolic representation and inexact graph-matching,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, no. 5, pp. 604–618, 1986. [84] G. Gallo, G. Longo, S. Pallottino, and S. Nguyen, “Directed hypergraphs and applications,” Discrete applied mathematics, vol. 42, no. 2, pp. 177–201, 1993. [85] D. Roy, Y. Ghitza, J. Bartelma, and C. Kehoe, “Visual memory augmentation: Using eye gaze as an attention filter,” in Proceedings of the IEEE International Symposium on Wearable Computers, 2004. [86] W.-H. Tsai and K.-S. Fu, “Subgraph error-correcting isomorphisms for syntactic pattern recognition,” Systems, Man and Cybernetics, IEEE Transactions on, no. 1, pp. 48– 62, 1983. [87] W. Zhang, State-space search: Algorithms, complexity, extensions, and applications. Springer-Verlag New York Inc, 1999. [88] R. M. Cesar Jr, E. Bengoetxea, I. Bloch, and P. Larra˜ naga, “Inexact graph matching for model-based recognition: Evaluation and comparison of optimization algorithms,” Pattern Recognition, vol. 38, no. 11, pp. 2099–2113, 2005. [89] K. Sambhoos, R. Nagi, M. Sudit, and A. Stotz, “Enhancements to high level data fusion using graph matching and state space search,” Information Fusion, vol. 11, no. 4, pp. 351–364, 2010. 121 [90] Z. Prasov and J. Chai, “What’s in a gaze?: the role of eye-gaze in reference resolution in multimodal conversational interfaces,” in Proceedings of the 13th international conference on Intelligent user interfaces, pp. 20–29, ACM, 2008. [91] Z. Prasov and J. Chai, “Fusing eye gaze with speech recognition hypotheses to resolve exophoric references in situated dialogue,” in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 471–481, Association for Computational Linguistics, 2010. [92] F. Wilcoxon, S. Katti, and R. A. Wilcox, “Critical values and probability levels for the wilcoxon rank sum test and the wilcoxon signed rank test,” Selected tables in mathematical statistics, vol. 1, pp. 171–259, 1970. [93] S. Sitter and A. Stein, “Modeling information-seeking dialogues: The conversational roles (cor) model,” RIS: Review of Information Science (online journal), vol. 1, no. 1, pp. 165–180, 1996. [94] H. Shi, R. J. Ross, T. Tenbrink, and J. Bateman, “Modelling illocutionary structure: combining empirical studies with formal model analysis,” in Computational Linguistics and Intelligent Text Processing, pp. 340–353, Springer, 2010. [95] V. N. Kasyanov and I. A. Lisitsyn, “Hierarchical graph models and visual processing,” in Proc. Int. Conf. on Software: Theory and Practice, 16th World Comput. Congr. IFIP, pp. 119–123, 2000. [96] W. Palacz, “Algebraic hierarchical graph transformation,” Journal of Computer and System Sciences, vol. 68, no. 3, pp. 497–520, 2004. [97] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. Jordan, “Matching words and pictures,” Journal of Machine Learning Research, vol. 3, pp. 1107–1135, 2003. [98] D. Roy, “Grounding words in perception and action: computational insights,” TRENDS in Cognitive Sciences, vol. 9, no. 8, pp. 389–396, 2005. [99] C. Liu, R. Fang, and J. Chai, “Towards mediating shared perceptual basis in situated dialogue,” in Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, (Seoul, South Korea), pp. 140–149, Association for Computational Linguistics, July 2012. 122 [100] C. Liu, L. She, R. Fang, and Y. J. Chai, “Probabilistic labeling for efficient referential grounding based on collaborative discourse,” Proceedings of ACL’14 (Volume 2: Short Papers), pp. 13–18, 2014. [101] C. Roos, T. Terlaky, and J.-P. Vial, Interior point methods for linear optimization. Springer, 2006. [102] J. Y. Chai, L. She, R. Fang, S. Ottarson, C. Littley, C. Liu, and K. Hanson, “Collaborative effort towards common ground in situated human-robot dialogue,” in Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction, pp. 33– 40, ACM, 2014. [103] N. Cesa-Bianchi, Prediction, learning, and games. Cambridge University Press, 2006. [104] R. Fang, C. Liu, and J. Y. Chai, “Integrating word acquisition and referential grounding towards physical world interaction,” in Proceedings of ICMI ’12, pp. 109–116, 2012. [105] R. Fang, C. Liu, L. She, and J. Y. Chai, “Towards situated dialogue: Revisiting referring expression generation.,” in Proceedings of EMNLP’13, pp. 392–402, 2013. [106] R. Fang, M. Doering, and J. Y. Chai, “Collaborative models for referring expression generation in situated dialogue,” in Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014. [107] H. H. Clark, Using language. Cambridge, UK: Cambridge University Press, 1996. [108] P. A. Heeman and G. Hirst, “Collaborating on referring expressions,” Computational Linguistics, vol. 21, no. 3, pp. 351–382, 1995. [109] W. J. Christmas, J. Kittler, and M. Petrou, “Structural matching in computer vision using probabilistic relaxation,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 17, no. 8, pp. 749–764, 1995. [110] M. Steedman and J. Baldridge, “Combinatory categorial grammar,” Transformational Syntax, pp. 181–224, 2011. Non- [111] C. Boz¸sahin, G.-J. M. Kruijff, and M. White, “Specifying grammars for openccg: A rough guide,” in Included in the OpenCCG distribution, 2005. 123 [112] H. Kamp and U. Reyle, From discourse to logic: Introduction to modeltheoretic semantics of natural language, formal logic and discourse representation theory. No. 42, Springer, 1993. [113] W. M. Soon, H. T. Ng, and D. C. Y. Lim, “A machine learning approach to coreference resolution of noun phrases,” Computational linguistics, vol. 27, no. 4, pp. 521–544, 2001. [114] M. Strube and C. M¨ uller, “A machine learning approach to pronoun resolution in spoken dialogue,” in Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pp. 168–175, Association for Computational Linguistics, 2003. [115] Y. Versley, S. P. Ponzetto, M. Poesio, V. Eidelman, A. Jern, J. Smith, X. Yang, and A. Moschitti, “Bart: A modular toolkit for coreference resolution,” in Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Demo Session, pp. 9–12, Association for Computational Linguistics, 2008. [116] C. Manning and D. Klein, “Optimization, maxent models, and conditional estimation without magic,” in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Tutorials - Volume 5, NAACL-Tutorials ’03, (Stroudsburg, PA, USA), pp. 8–8, Association for Computational Linguistics, 2003. [117] W. M. Soon, H. T. Ng, and D. C. Y. Lim, “A machine learning approach to coreference resolution of noun phrases,” Computational linguistics, vol. 27, no. 4, pp. 521–544, 2001. [118] Y. Versley, S. P. Ponzetto, M. Poesio, V. Eidelman, A. Jern, J. Smith, X. Yang, and A. Moschitti, “Bart: A modular toolkit for coreference resolution,” in Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Demo Session, pp. 9–12, Association for Computational Linguistics, 2008. [119] G. Durrett and D. Klein, “Easy victories and uphill battles in coreference resolution,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, (Seattle, Washington), Association for Computational Linguistics, October 2013. [120] L. S. Zettlemoyer and M. Collins, “Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars,” CoRR, vol. abs/1207.1420, 2012. 124 [121] M. Gaˇsic, C. Breslin, M. Henderson, D. Kim, M. Szummer, B. Thomson, P. Tsiakoulis, and S. Young, “Pomdp-based dialogue manager adaptation to extended domains,” in Proceedings of SIGDIAL, 2013. [122] A. G. Barto, Reinforcement learning: An introduction. MIT press, 1998. 125