W519

20001

 

LIBRARY
Michigan State
University

 

 

 

This is to certify that the
dissertation entitled

Incorporating Non-verbal Modalities in Spoken Language
Understanding for Multimodal Conversational Systems

presented by

Shaolin Qu

has been accepted towards fulfillment
of the requirements for the

Ph.D. degree in Computer Science

 

 

Major Professor’s Signature

Sj/u/o‘i

Date

MS U is an Affirmative Action/Equal Opportunity Employer

PLACE IN RETURN BOX to remove this checkout from your record.
‘ TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5/08 K lPrq/Acc8Pres/ClRC/Dateoue indd

 

IN CORPORATING NON-VERBAL MODALITIES IN
SPOKEN LANGUAGE UNDERSTANDING FOR
MULTIMODAL CONVERSATIONAL SYSTEMS

By

Shaolin Qu

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Computer Science

2009

ABSTRACT

IN CORPORATING NON-VERBAL MODALITIES IN
SPOKEN LANGUAGE UNDERSTANDING FOR
MULTIMODAL CONVERSATIONAL SYSTEMS

By

Shaolin Qu

Interpreting human language is a challenging problem in building human—machine
conversational systems due to the ﬂexibility of human language behavior. This prob-
lem is further signiﬁed by insufﬁcient speech understanding and system knowledge
representation. When unreliable and unexpected language inputs are received, con-
versational systems tend to fail. Robust language interpretation is essential for build-
ing practical conversational systems.

To address this issue, this thesis investigates the use of non-verbal modalities for
robust language interpretation in human-machine conversation. Speciﬁcally, this the—
sis investigates the use of deictic gesture and eye gaze to address two interrelated
problems of language interpretation: unreliable speech input due to weak speech
recognition, and unexpected speech input containing words that are not in the sys-
tem’s knowledge base. The underlying assumption is that deictic gesture and eye gaze
indicate the user’s visual attention and signal the salient visual context in which the
user’s spoken language is situated. This context constrains what the user is likely to
say to the system and therefore can be used to help understand the user’s language.

To facilitate this investigation, we developed a multimodal conversational sys-
tem on 3D-based domains. The system supports speech, deictic gesture, and eye
gaze input during human-machine conversation. Using this system, we conducted
user studies to collect speech-gaze and speech-gesture data sets for the investigation.

For the ﬁrst topic, using non-verbal modalities to improve speech recognition and

understanding, we built different salience driven language models to incorporate ges-
ture/ gaze in different stages of speech recognition. We also experimented different
model-based and instance-based approaches to incorporate gesture in recognizing the
intention of the user’s spoken language. Our experiments show that using gesture
and eye gaze signiﬁcantly improves speech recognition and understanding. The use
of gesture has also been shown to achieve signiﬁcant improvement on user intention
recognition.

For the second topic, using non-verbal modalities for automatic word acquisition,
we developed different approaches to incorporate speech-gaze temporal information
and domain knowledge with eye gaze to facilitate word acquisition during human-
machine conversation. To further improve word acquisition, we also incorporated
user interactivity to pick out the “useful” speech-gaze data for word acquisition. Our
ﬁndings indicate that word acquisition is signiﬁcantly improved when speech-gaze
temporal information and domain knowledge are incorporated. Moreover, acquisition
performance is further improved when the words are acquired from the automatically
identiﬁed “useful” speech-gaze data.

The results form this thesis have important implications in building robust and
practical multimodal conversational systems. They demonstrate how non—verbal
modalities can be combined successfully at different stages of spoken language pro-

cessing to improve robustness in language interpretation.

I‘

Copyright by
SHAOLIN QU
2009

 

ACKNOWLEDGMENTS

I would like to thank my advisor, Dr. Joyce Chai, for her guidance and support
over the years. Dr. Chai introduced me into the world of multimodal conversation and
helped set up the direction of my research. Her devotion to research, commitment to
professionalism, and relentless seeking of perfection have greatly inspired me through
the completion of my study. I would also like to thank my guidance committee, Dr.
John Deller, Dr. Anil Jain, and Dr. George Stockman for their insightful comments
and suggestions that have greatly enhanced this thesis.

Many fellow graduate students have helped me for the work reported in this thesis.
Special thanks to Zahar Prasov, who not only collaborated with me on the user study
designs and data collection, but also had many valuable discussions with me that
have helped shape my research. Tyler Baldwin, Matthew Gerber, and Chen Zhang
also have contributed to the data collection and shared their valuable comments and
suggestions to my work.

And ﬁnally, I want to thank my parents and my sister for their support all these

years.

TABLE OF CONTENTS

LIST OF TABLES ..............................
LIST OF FIGURES .............................
1 Introduction ................................
1.1 Overview of Multimodal Conversation .................
1.2 Problems in Multimodal Language Understanding ...........
1.2.1 Unreliable Speech Input .....................
1.2.2 Unexpected Speech Input ....................
1.3 Research Questions ............................
1.4 Road Map .................................
2 Background ................................
2.1 Why Multimodal Design? ........................
2.2 Non-Verbal Modalities in Multimodal Conversational Systems . . . .
2.2.1 Gesture ..............................
2.2.2 Eye Gaze .............................
2.3 Using Non-Linguistic Information for Language Understanding . . . .
2.3.1 Multimodal Language Processing ................
2.3.2 Context-aware Language Processing ...............
2.4 Automatic Word Acquisition .......................
3 A Multimodal Conversational System ................
3.1 System Architecture ...........................
3.2 Input Modalities .............................
3.2.1 Speech ...............................
3.2.2 Deictic Gesture ..........................
3.2.3 Eye Gaze .............................
3.3 Domains of Application ..........................
3.3.1 Interior Decoration ........................
3.3.2 Treasure Hunting .........................

4 Incorporation of N on-verbal Modalities in Language Models for Spo-
ken Language Processing ........................
4.1 A Salience Driven Framework ......................
4.1.1 Salience ..............................
4.1.2 Salience Driven Interpretation of Spoken Language in Multi-
modal Conversation. . ._ .....................
4.2 Gesture-Based Salience Modeling ....................

vi

OC'OOCUCIICJOIOl—l

‘20
‘29

29

4.3 Gaze-Based Salience Modeling ...................... 35

4.4 Salience Driven Language Modeling ................... 37
4.4.1 Language Models for Speech Recognition ............ 37
4.4.2 Salience Driven N—Gram Models ................. 38
4.4.3 Salience Driven PCF G ...................... 39

4.5 Application of Salience Driven Language Models for ASR ....... 42
4.5.1 Early Application ......................... 42
4.5.2 Late Application ......................... 43

4.6 Evaluation ................................. 44
4.6.1 Speech and Gesture Data Collection .............. 44
4.6.2 Evaluation Results on Speech and Gesture Data ........ 45
4.6.3 Speech and Eye Gaze Data Collection .............. 52
4.6.4 Evaluation Results on Speech and Eye Gaze Data ....... 54
4.6.5 Discussion ............................. 57

4.7 Summary ' ................................. 66

Incorporation of Non-verbal Modalities in Intention Recognition for

Spoken Language Understanding ................... 67

5.1 Multimodal Interpretation in a Speech-Gesture System ........ 68
5.1.1 Semantic Representation ..................... 68
5.1.2 Incorporating Context in Two Stages .............. 69

5.2 Intention Recognition ........................... 70

5.3 Feature Extraction ............................ 71
5.3.1 Semantic Features ........................ 71
5.3.2 Phoneme Features ........................ 72

5.4 Model-Based Intention Recognition ................... 73

5.5 Instance-Based Intention Recognition .................. 74

5.6 Evaluation ................................. 76
5.6.1 Experiment Settings ....................... 76
5.6.2 Results Based on Traditional Speech Recognition ....... 78
5.6.3 Results Based on Gesture-Tailored Speech Recognition . . . . 79
5.6.4 Results Based on Different Sizes of Training Data ....... 80
5.6.5 Discussion ............................. 84

5.7 Summary ................................. 86

Incorporation of Eye Gaze in Automatic Word Acquisition . . . . 88

6.1 Data Collection .............................. 89
6.2 Translation Models for Automatic Word Acquisition .......... 90
6.2.1 Base Model I ........................... 90
6.2.2 Base Model II ........................... 90
6.3 Using Speech-Gaze Temporal Information for Word Acquisition . . . 91

vii

6.4 Using Domain Semantic Relatedness for Word Acquisition ...... 93

6.4.1 Domain Modeling ......................... 94
6.4.2 Semantic Relatedness of Word and Entity ........... 95
6.4.3 Word Acquisition with Word-Entity Semantic Relatedness . . 95
6.5 Grounding Words to Domain Concepts ................. 97
6.6 Evaluation ................................. 97
6.6.1 Evaluation Metrics ........................ 98
6.6.2 Evaluation Results ........................ 99
6.6.3 An Example ............................ 105
6.7 Summary ................................. 106

7 Incorporation of Interactivity with Eye Gaze for Automatic Word

Acquisition ................................ 107
7.1 Data Collection .............................. 108
7.1.1 Domain .............................. 108
7.1.2 Data Preprocessing ........................ 109
7.2 Identiﬁcation of Closely Coupled Gaze-Speech Pairs .......... 110
7.2.1 Features Extraction ........................ 110
7.2.2 Logistic Regression Model .................... 113
7.3 Evaluation of Gaze-Speech Identiﬁcation ................ 114
7.4 Evaluation of Word Acquisition ..................... 116
7.4.1 Evaluation Metrics ........................ 116
7.4.2 Evaluation Results ........................ 117
7.5 The Effect of Word Acquisition on Language Understanding ..... 122
7.5.1 Simulation 1: When the System Starts with No Training Data 123
7.5.2 Simulation 2: When the System Starts with Training Data . . 124
7.6 Summary ................................. 126
8 Conclusions ................................ 128
8.1 Contributions ............................... 128
8.2 Future Directions ............................. 130
APPENDICES ................................ 132
A Multimodal Data Collection ....................... 132
Al Speech-Gesture Data Collection in the Interior Decoration Do-
main ................................ 132

A.2 Speech-Gaze Data Collection in the Interior Decoration Domain 132
A3 Speech-Gaze Data Collection in the Treasure Hunting Domain 135

B Parameter Estimation in Approaches to Word Acquisition ...... 137
B.1 Parameter Estimation for Base Model-1 ............ 137
B2 Parameter Estimation for Base Model-2 ............ 138

viii

B.3 Parameter Estimation for Model-2s ...............
B.4 Parameter Estimation for Model-2t ...............
B.5 Parameter Estimation for Model-2ts ..............

BIBLIOGRAPHY

ix

139

"“‘11

r .

4.1

4.2

4.3

4.4

5.1

5.2

5.3

6.1

7.1

LIST OF TABLES

Performances of the early application of different language models on

speech-gesture data ............................ 48
Performance of the late application of LMs on speech-gesture data . . 51
WER of the early application of LMs on speech-gaze data ...... 55
WER of the late application of LMs on speech-gaze data ....... 55
Intentions in the 3D interior decoration domain ............ 70

Accuracies of intention prediction based on standard speech recognition 78

Accuracies of intention prediction based on gesture-tailored speech
recognition ................................. 79

N-best candidate words acquired for the entity dresserzl by different
models ................................... 105

Gaze-speech prediction performances with different feature sets for the
instances with 1—best speech recognition ................ llo

LIST OF FIGURES

1.1 Architecture of multimodal conversation ................ 3
1.2 Semantics-based multimodal interpretation ............... 4
3.1 Multimodal conversational system architecture ............. 26
3.2 Eye gaze on a scene ............................ 28
3.3 A 3D interior decoration domain ..................... 29
3.4 A treasure hunting domain ........................ 30
4.1 Salience driven interpretation ...................... 33
4.2 Gesture-based salience modeling ..................... 34
4.3 An excerpt of speech and gaze stream data ............... 36
4.4 Context free grammar for the 3D interior decoration domain ..... 40

4.5 Trained PCFG for entity lamp in the 3D interior decoration domain . 41

4.6 Application of salience driven language model in speech recognition . 42
4.7 A* search in word lattice ......................... 43
4.8 An Excerpt of XML Data File ...................... 46

4.9 Performance of the early application of LMs on speech-gesture data of
individual users .............................. 50

4.10 Performance of the late application of LMs on speech-gesture data of
individual users .............................. 53

4.11 WERs of application of LMs on speech-gaze data of individual users . 56

4.12 N-best lists of speech recognition for utterance “show me details on
this desk” ................................. 58

4.13 Word lattice of utterance “show me details on this desk” generated by
using standard bigram model .................... 59

4.14

4.15

4.16

4.17

4.18

4.19

5.1

5.2

5.3

5.4

5.5

5.6

5.7

5.8

5.9

5.10

Word lattice of utterance “show me details on this desk” generated by
using salience driven bigram model .................

N-best lists of speech recognition for utterance “move the red chair
over here” .................................

Word lattice of utterance “move the red chair over here” generated by
using standard bigram model ....................

Word lattice of utterance “move the red chair over here” generated by
using salience driven bigram model .................

N-best lists of speech recognition for utterance “I like the picture with
like a forest in it” .............................

N-best lists of an utterance: early stage integration v.s. late stage
integration .................................

Semantic frame of a user’s multimodal input ..............
Using context (via gesture) for language understanding ........
Phonemes of an utterance ........................

Intention prediction performance of Naive Bayes based on different
training size ................................

Intention prediction performance of Decision Tree based on different
training size ................................

Intention prediction performance of SVM based on different training

Intention prediction performance of S-KNN based on different train-
ing size ...................................

Intention prediction performance of P-KNN based on different train-
ing size ...................................

Intention prediction performance of SP-KNN based on different train-
ing size ...................................

Using gestural information in different stages for intention recognition

xii

60

61

62

63

64

81

81

82

83

83

6.1

6.2

6.3

6.4

6.5

6.6

6.7

7.1

7.2

7.3

7.4

7.5

7.6

7.7

7.8

7.9

7.10

Al

A2

Parallel speech and gaze streams .................... 89

Histogram of truly aligned word and entity pairs over temporal distance

(bin width = 200ms) ........................... 92
Domain model with domain concepts linked to WordNet synsets . . . 94
Precision of word acquisition ....................... 101
Recall of word acquisition ........................ 102
F—measure of word acquisition ...................... 103
MRRRs achieved by different models .................. 104

A snapshot of one user’s experiment (the dot on the stereo indicates
the user’s gaze ﬁxation, which was not shown to the user during the
experiment) ................................ 109

Precision of word acquisition on 1-best speech recognition with Model-
2t-r ..................................... 118

Recall of word acquisition on l-best speech recognition with Model-2t-r 118

F-measure of word acquisition on 1-best speech recognition with

Model-2t-r ................................. 119
Precision of word acquisition on speech transcript with Model-2t-r . . 120
Recall of word acquisition on speech transcript with Model-2t-r . . . 120

F —measure of word acquisition on speech transcript with Model-2t-r . 121
MRRRs achieved by Model—2t-r with different data set ........ 121

CIR of user language achieved by the system starting with no training
data .................................... 124

CIR of user language achieved by the system starting with 10 users

training data) ............................... 126
Instruction for scenario 1 in the interior decoration domain ...... 133
Instruction for scenario 2 in the interior decoration domain ...... 134

xiii

A.3 Questions for users in the study ..................... 135

A4 Instruction for the user study ...................... 136

xiv

CHAPTER 1

Introduction

Speech is the most natural means for humans to communicate with each other.
Due to its naturalness, speech is also a desirable communication mode in human-
computer interaction. .A lot of research has been done on spoken dialog sys-
tems [1, 17,64,65,78, 110], where users communicate with the system through speech.
In recent years, the development of multimodal conversational systems has gained
more interest. Besides speech input, multimodal conversational systems also support
inputs from other modalities such as gesture and eye gaze during human—machine
conversation. Compared to the conventional speech-only interfaces in spoken dialog
systems, multimodal conversational interfaces provide users with greater expressive
power, naturalness, and ﬂexibility. Moreover, multimodal conversational systems can
achieve better interpretation of user input due to mutual disambiguation among com-
plementary modalities [74].

Despite recent advances in multimodal conversational systems, interpreting what
a user communicates to the system is still a signiﬁcant challenge due to insufficient
speech recognition and language understanding performance. Moreover, when the
user’s utterances contain unexpected words that are out of the system’s knowledge,
interpretation of the user language tends to fail even when these words are correctly
recognized, which also makes robust language interpretation a big challenge.

Towards building more practical multimodal conversational systems, this thesis

explores the use of non-verbal modalities for robust language interpretation in two
related directions. First, to improve spoken language understanding, the domain con-
textual information indicated by non-verbal modalities is incorporated in language
modeling to get better speech hypotheses. Second, this thesis explores the use of eye
gaze to acquire words automatically during human-machine conversation, in particu-
lar, by incorporating speech-gaze temporal information, domain semantic knowledge,

and interactivity in word acquisition.

1.1 Overview of Multimodal Conversation

Figure 1.1 shows the typical interaction process between a user and a multimodal
conversational system. The user talks to the system using speech and pen-based
deictic gesture. The user’s eye gaze is captured by the system. The Multimodal
Interpreter identiﬁes semantic meaning of the user’s multimodal input. Given the
interpretation, the Conversation Manager informs the Action Manager what action
(e.g., information query, removing an object on the graphical display) to take. The
Action Manager performs the action in the application domain and provides results to
the Conversation Manager. Based on the results, the Conversation Manager decides
what responses (e.g., inquired information not found, conﬁrmation of object deletion)
to give back to the user. The Presentation Manager presents the system’s response
to the user in one or more formats (e.g., audio, video, graphics).

To be able to provide intelligent responses to the user, the system ﬁrst needs
to understand user input, which makes Multimodal Interpreter a key component in
multimodal conversational systems. This thesis focuses on building robust spoken

language understanding in Multimodal Interpreter.

speech

gesture I ‘
Multimodal gaze L 7
Interpreter C: ~- .

 

  

 

 

 

 

 

 

 

 

 

 

 

 

semantic representation ﬂ
Action Conversation
Manager C: Manager
H H graphics
audio

Presentation video
Manager

Figure 1.1. Architecture of multimodal conversation

 

 

 

1.2 Problems in Multimodal Language Understanding

Multimodal interpretation is to derive semantic meaning from the user’s multimodal
input. The interpretation process involves recognition, understanding, and integra-
tion of the user’s multiple inputs of different modalities. In most multimodal con-
versational systems, input interpretation is based on a semantic fusion approach. In
this approach, the system ﬁrst creates all possible partial meaning representations in-
dependently from individual modalities. Then these partial meaning representations
identiﬁed from each modality are fused in a multimodal integration process to form
an overall meaning representation. Previous studies have shown that multitnodal
interpretation can achieve better performance than unimodal interpretation because
of the mutual disambiguation among complementary modalities during multimodal
integration process [74].

Figure 1.2 shows an example of the semantics—based approach to the interpretation
of speech and gesture input. In the example, the user says “what is the price of this
painting?” and at the same time points to a position on the screen. The system
ﬁrst creates all possible partial meaning representations from speech and gesture
independently. The partial meaning representations from the speech input and the

gesture input are shown in (a—b) in Figure 1.2. In this case, the gesture could be

pointing to a wall or a picture. The system uses the partial meaning representations
to disambiguate one and another and combines compatible partial representations

together into an overall semantic representation as shown in Figure 1.2(c).

‘What is the price of this painting?" (Pointing to a position on the screen)

 

 

 

 

 

 

 

 

 

 

 
    

  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Speech Input Gesture Input
‘7 Jagzqgrv‘tr.“ 5'1". [— - .,.. t'."‘.'}’, ,V‘,‘
Speech Gesture
Recognition Recognition
Language Cesture
Understanding Understanding
l 1
Semantic Representation Semantic Representation
Intention Attention
(a) action: ACT-INFO_REQUEST object id: picture_lotus 0”
aspect: PRICE semantic type: PICTURE
Attention .
semantic type: PICTURE Attention
object id: wall_room
semantic type: WAU.
v v

 

 

 

 

Multimodal 3".
Fusion

 

 

I Semantic Representation

 

Intention
action: ACT -INF 0_RE Q UES T
aspect: PRICE (6)
Attention
object id: picture_lotus
semantic type: PICTURE

 

 

 

Figure 1.2. Semantics—based multimodal interpretation

In the semantics-based multimodal interpretation, the partial semantic represen-
tations from individual modalities are crucial for mutual disambiguation during mul-
timodal fusion. A robust recognition and understanding of the user’s speech is very
important. However, there are two main barriers to robust spoken language under-
standing: unreliable speech input and unexpected speech input. We address these

two problems of language understanding as follows.

1.2.1 Unreliable Speech Input

Unreliable speech input refers to the input that can not be correctly recognized due to
weak speech recognition. For example, in Figure 1.2, if the speech input is recognized
as “what is the pr_i_z§ of this panting .9”, then the partial representation from the speech
input will not be correctly created in the ﬁrst place. Without a correct candidate
partial representation, it is not likely for multimodal fusion to reach a correct overall
meaning of the input.

A potential solution to the above problem is to incorporate contextual information
in recognition and understanding of speech at an earlier stage before semantic fusion in
the pipelined process of multimodal interpretation. The context of human-computer
interaction constrains what users are likely to interact with the system, and thus can
be used to help user input interpretation. In the example, the user is talking about
a picture. Suppose we already have the knowledge that the word “price” is more
likely to appear in an utterance talking about a picture than the word “prize”. By
identifying the visual context (i.e., the picture object) from deictic gesture, the system
can use the domain knowledge associated with the visual context to help recognize
the word “price” correctly and thus achieve correct language understanding.

Following this idea, this thesis presents a salience driven framework in which
gesture/gaze—based salience driven language models are built to improve recognized
speech hypothesis. During speech recognition, these salience driven language models
will guide the'system to pick the speech hypothesis that is more likely describing the
currently salient object as indicated by the user’s gesture or eye gaze. Our experi-
mental results have shown the potential of gesture and eye gaze in improving spoken
language processing.

Besides using non-verbal modalities to obtain better speech recognition hypothe-
sis, we also apply non-verbal modalities directly in the language understanding pro-

cess to better interpret the user’s spoken language, speciﬁcally, the user’s intention

reﬂected in the Spoken language. In conversational systems, the “meaning” of user in-
put can be generally categorized into intention and attention [33]. Intention indicates
the user’s motivation and action. Attention reﬂects the focus of the conversation, in
other words, what has been talked about. In the speech-gesture system where speech
is the dominant mode of communication, the user intention (such as asking for price of
an object) is generally expressed by spoken language and attention (e.g., the speciﬁc
object) is indicated by the deictic gesture on the graphical display. Based on such
observations, many speech-gesture systems mainly identify intention from speech and
identify attention using deictic gesture [4, 27,53]. In our view, deictic gestures not
only indicate users’ attention, but also can activate the relevant domain context. This
context can constrain the type of intention associated with the attention and thus
provide useful information for intention recognition.

Based on this assumption, we experimented with model-based and instance-based
approaches to incorporate gestural information to recognize the user’s intention. We
examined the effects of using gestural information for user intention recognition in two
stages — speech recognition stage and language understanding stage. Our empirical
results have shown that using gestural information improves intention recognition and
the performance is further improved when gestures are incorporated in both speech

recognition and language understanding stages compared to either stage alone.

1.2.2 Unexpected Speech Input

Unexpected speech input happens when the user speaks some words that the sys-
tem can not recognize. When the encountered vocabulary is outside of the system’s
knowledge, conversational systems tend to fail. For example, in Figure 1.2, if the user
says “what is the cost of this painting?” and the word “cost” is not in the system’s
vocabulary, then the system would not be able to understand that the user is asking

for the price of the painting. Therefore, it is desirable that conversational systems can

learn new words automatically during human-machine conversation. While automatic
word acquisition in general is quite challenging, multimodal conversational systems
offer an unique opportunity to explore word acquisition. In a multimodal conversa-
tional system where users can talk and interact with a graphical display, users’ eye
gaze, which occurs naturally with speech production, provides a potential channel for
the system to learn new words automatically during human-machine conversation.

Psycholinguistic studies have shown that eye gaze is tightly linked to human lan-
guage processing. Eye gaze is one of the reliable indicators of what a person is “think-
ing about” [37]. The direction of eye gaze carries information about the focus of the
user’s attention [49]. The perceived visual context inﬂuences spoken word recognition
and mediates syntactic processing of spoken sentences [97,101]. In addition, directly
before speaking a word, the eyes move to the mentioned object [31,68,88].

Motivated by these psycholinguistic ﬁndings, we investigate the use of eye gaze for
automatic word acquisition in multimodal conversation. Particulary, this thesis in-
vestigates the use of temporal alignment of speech and eye gaze and domain semantic
relatedness for automatic word acquisition. The speech-gaze temporal information
and domain semantic information are incorporated in statistical translation models
for word acquisition. Our experiment results demonstrate that eye gaze provides
a potential channel for acquiring words automatically. The use of extra speech-gaze
temporal information and domain semantic knowledge can signiﬁcantly improve word
acquisition.

Furthermore, since eye gaze could have different functions during human-machine
conversation, not all speech and eye gaze data are useful for word acquisition. To
further improve word acquisition, the thesis also presents approaches that automat-
ically identify potentially “useful” speech and eye gaze based on information from
multiple sources such as the user’s speech, eye gaze behavior, interaction activity,

and conversation context. Our experimental evaluation shows that using only the

identiﬁed “useful” speech and gaze signiﬁcantly improves word acquisition compared

to using all speech and gaze data.

1.3 Research Questions

Addressing the above problems, this thesis investigates the following speciﬁc questions

about language interpretation in speech and gesture/ gaze systems:

0 How can the non-verbal modalities be used to improve speech recognition?

0 How can the non-verbal modalities be used to help language understanding,

speciﬁcally, to help recognition of the user’s intention?

o How can the non—verbal modalities be used to acquire new words automatically

during multimodal conversation?

To facilitate the investigations described above, this thesis has accomplished the

following objectives:

0 Development of a multimodal system that supports inputs of speech, gesture

and eye gaze in 3D-based domains.
0 Collection of corpora of speech and gesture/ gaze data from user studies.

0 Design and implementation of approaches to incorporating non-verbal modal-
ities in spoken language understanding and automatic vocabulary acquisition

during multimodal conversation.

0 Evaluation and analysis of these approaches that incorporate non-verbal modal-

ities.

1.4 Road Map

The remainder of the thesis is organized as follows:

0 Chapter 2: background on relevant aspects of multimodal conversation and
review of previous work on multimodal language processing and language ac—

quisition.

0 Chapter 3: description of a multimodal conversational system developed for this
thesis investigation. The developed system supports inputs of speech, deictic
gesture, and eye gaze in a 3D interior decoration domain and a 3D treasure

hunting game domain.

0 Chapter 4: investigation of incorporating non-verbal modalities to improve rec-
ognized speech hypotheses for better language understanding. This chapter
describes different approaches in a gesture/gazebased salience driven frame-
work. Evaluation and analysis of these approaches are also presented in this

chapter.

0 Chapter 5: investigation of incorporating non-verbal modalities to improve user
intention recognition for better language understanding. This chapter describes
different model-based and instance-based approaches for intention recognition

and presents evaluation and analysis of these approaches.

0 Chapter 6: investigation of incorporating eye gaze in automatic vocabulary
acquisition for robust language understanding. This chapter describes the ap-
proaches of incorporating speech-gaze temporal information and domain seman-
tic relatedness to facilitate word acquisition. Evaluation and analysis are also

presented in this chapter.

0 Chapter 7: investigation of using user interactivity related information for iden-
tifying “closely-coupled” gaze and speech streams and its effect on word acqui-

sition. This chapter describes the prediction of “closely-coupled” gaze—speech

instances for word acquisition. Evaluations of gaze—speech prediction and its

effect on word acquisition are also presented in this chapter.

9 Chapter 8: contributions of this thesis work.

10

CHAPTER 2

Background

This chapter presents a review of the topics that are relevant to this thesis. We begin
by explaining the motivation for multimodal design in conversational systems, then
introduce the non-verbal modalities that have been explored in multimodal conver-
sation, and ﬁnally review the previous work on multimodal language interpretation

and automatic word acquisition.

2.1 Why Multimodal Design?

One motivation for multimodal design is users’ strong preference to interact mul-
timodally. Unlike the traditional keyboard and mouse interface or a unimodal
recognition-based interfaces, multimodal interfaces allow users to choose which modal-
ity to use depending on the types of information to convey, to use combined input
modes, and to alternate between modes at any time. This ﬂexible choice of input
modes is preferred by users in human-computer interaction. It has been found that
more than 95% percent of users chose to interact multimodally when they were free
to use either speech or pen input in a map-based spatial domain [73].

Multimodal design is also motivated by the potential of multimodal systems in
expanding the accessibility of computing to a broader range of users. There are large

individual differences in ability and preference to using different modes of commu—

11

nication. These differences could be age, skill level, culture, and sensory, motor, or
intellectual impairments. For example, a user with accented speech may prefer pen
input rather than speech, whereas a visually impaired user may prefer speech input
and text-to—speech output.

Besides expanding the range of users, multimodal systems can also expand the
usage contexts. Multimodal systems allow users to switch input modes when environ-
ment condition changes or in mobile use, the user is unable to use a particular input
mode temporarily. For example, users can use pen input in a noisy environment and
use speech in a quiet environment, and a user of an in-vehicle multimodal application
can use speech when he or she is unable to use gestural input while driving.

Another major motivation for multimodal design is the error avoidance and recov-
ery in multimodal systems. There are user-centered and system-centered reasons why

multimodal systems facilitate error recovery [75]. The user-centered reasons include:

0 Users select the input mode they judge less error prone for particular lexical
content, which usually leads to error avoidance. For example, in a speech and
pen system, the user may prefer speech input, but will switch to pen to com-

municate a foreign surname.

0 User’s language often is simpliﬁed when interacting multimodally, which leads
to better speech recognition and language understanding. For example, in a
multimodal system involving a room scene, a user wants to move one of the
chairs beside the bed to the window. Using only speech, the user might need
to say “move the left red chair beside the bed to the window”. When using
both speech and gesture, the user only needs to say “move this chair here”,
along with two pointing gestures. This observation is most relevant to the work

presented in this thesis.

0 Users tend to switch modes after a system recognition error, which can prevent

12

repeating errors and facilitate error recovery.

The system-centered reason for error recovery in multimodal systems lies in the mul-
timodal architecture. A well designed multimodal architecture with two semantically
rich input modes can support mutual disambiguation [74] of input signals. Mutual
disambiguation involves disambiguation of signal or semantic-level information in one
input mode from partial information supplied by another input mode. It leads to re-
covery from unimodal recognition errors within a multimodal architecture, with the
net effect of suppressing errors experienced by the user. The mutual disambiguation

of speech and gestural inputs has been successfully demonstrated in [14, 20,48, 106].

2.2 Non-Verbal Modalities in Multimodal Conversational

Systems

Since the appearance of Bolt’s “Put That There” [4] demonstration system, which
supported speech and touch-pad pointing, a variety of new multimodal conversational
systems have emerged. In most of these multimodal conversational systems, the
other modality besides speech is either gesture or eye gaze. Besides speech and
gesture/ gaze systems, there are also speech and lip movement systems where speech
is processed with corresponding human lip movement information during human-
computer interaction [24,94,102]. In speech and lip movement systems, the visual
features of human lip movement is fused together with the acoustic features in the
speech decoding process to perform the so—called audio-visual speech recognition [79].

The use of lip movement in audio-visual speech recognition is beyond the scope
of this thesis. Moveover, speech recognition is not a focus of this thesis. This thesis
focuses on the use of gesture and eye gaze in improving language understanding
for multimodal conversation. An overview of the use of gesture and eye gaze in

multimodal systems is presented as follows.

13

2.2.1 Gesture

In speech and gesture systems, spoken language is processed along with its accompa-
nying gestures. The gestural input can be a simple pen-based deictic gesture (e.g.,
pointing, circling) [11,15,40,104,107,108], a complex pen-based gesture involving
symbolic interpretations [20,47,114], or a manual gesture [9,38,59,69].

This thesis focuses on the use of pen-based deictic gesture in spoken language
processing. Deictic gesture is an active input mode, which is deployed by the user
intentionally as an explicit command to the computer system. Deictic gesture has
been widely used in multimodal map-based systems to indicate the focus of the user’s
attention (objects, locations, or areas on the map) [11,25,71,92,95,99]. Beyond only
using deictic gesture as an indicator of the user’s attention focus, in this thesis, we
use deictic gesture to inﬂuence the recognition and understanding of the user’s spoken

utterances.

2.2.2 Eye Gaze

Eye gaze has been studied in various research ﬁelds such as cognitive science, psy-
cholinguistics, and human-computer interaction. In human-computer interaction, eye
gaze has long been explored for direct manipulation interfaces in which eye gaze is used
as a pointing device [43,56,112,113,120]. Eye gaze as a modality in multimodal inter-
action goes beyond the function of pointing. In different speech and eye gaze systems,
eye gaze has been explored for the purpose of mutual disambiguation [100,121], as a
complement to the speech channel for reference resolution [8,52,80] and speech recog-
nition [21], and for managing human-computer dialogue [87]. Eye gaze has also been
used as a facilitator in computer supported human-human communication [103,105].
In this thesis, we use eye gaze and the gaze perceived visual context to help spoken
language understanding in multimodal conversation.

Cognitive scientists have been studying eye movements to understand brain pro-

14

cesses [36,88]. In psycholinguistics, eye gaze has been shown its tight link to both
language comprehension [2,23,97] and language production [3,7, 30]. Psycholinguis-
tic studies have found that the gaze perceived visual context inﬂuences spoken word
recognition and mediates the syntactic processing in real-time spoken language com-
prehension. For language production, psycholinguistic studies found that the user’s
eyes move to the mentioned object directly before speaking a word. These psycholin-
guistic ﬁndings are the motivations for this thesis’s work on the use of eye gaze for
spoken language processing in human-computer interaction.

Eye gaze can be captured by eye trackers, which track the user’s eye movements
during human-computer interaction. Two main types of eye trackers have been used in
interaction study — head mounted and display mounted. Head mounted eye trackers
can provide accurate gaze direction, but they are intrusive. It is unnatural and
inconvenient for a user to interact with the computer system with an eye tracker
mounted on the head. The state-of-the—art eye tracking technologies have enabled the
eye tracking system to be embedded in a monitor. The display mounted eye trackers

are non-intrusive and more appropriate for the use in human-computer interaction.

2.3 Using Non-Linguistic Information for Language Under-

standing

This thesis’s work on using non-verbal inputs to improve spoken language under-
standing is inspired by previous research on multimodal language processing and
context-aware language processing.

2.3.1 Multimodal Language Processing

Multimodal language processing combines speech with non-verbal modalities such

as gesture, eye gaze, and lip movements for language processing. There are two

15

levels of multimodal language processing: 1) feature-level processing; 2) semantic—

level processing.

Feature-Level Processing

Feature-level processing fuses low-level feature information from parallel input signals
in a multimodal architecture. Feature-level processing is most appropriate for closely
synchronized modalities such as speech and lip movements. In audio-visual speech
recognition [79], features of speech and lip movements are ﬁrst extracted by acoustic
signal processing and vision analysis respectively. The extracted audio features and
visual features are then fused together for speech decoding.

Feature-level multimodal integration of speech and lip movement is beyond the
scope of this thesis. This thesis investigates the use of deictic gesture and eye gaze
in multimodal language processing. These modalities do not have the close coupling
with acoustic speech as lip movement does, so the feature-level processing is not
appropriate. Moreover, this thesis focuses on language understanding rather than
speech recognition. In audio—visual speech recognition, extracted acoustic and visual
features are fused for speech decoding. In this thesis, gesture/ gaze is incorporated in

language modeling to tailor speech hypotheses for better semantic interpretation.

Semantic-Level Processing

Semantic-level processing is to integrate semantic information derived from parallel
input modes in a pipelined multimodal architecture (as seen in Figure 1.2). Semantic-
level processing is mostly used for less coupled modalities such as speech and gesture.
In semantic-level processing, the system ﬁrst recognizes each modality independently
and then creates all possible partial semantic representations individually for each
modality. Then the system uses these partial semantic representations to disam-

biguate each other and form a joint semantic representation [10,44,45]. This fusion

16

of multimodal input at the semantic level is called late fusion [76].

Late semantic integration systems use individual recognizers for different input
modes. These individual recognizer can be trained using unimodal data, which are
easier to get and already publicly available for modalities such as speech [18] and
handwriting [41,61]. Multimodal systems based on semantic fusion can also take
advantage of the existing relative mature unimodal recognition techniques and off-
the—shelf recognizers, which can be directly integrated in the late semantic integration
architecture. In this respect, multimodal systems based on semantic fusion can be
scaled up easier in the number of input modes.

Previous work on semantic fusion of multimodal input has been more focused on
the integration of speech and gesture, especially pen-based gesture, than on integra-
tion of speech and eye gaze. In multimodal interaction, pen-based gesture is a much
more reliable input mode for object selection than eye gaze. Moreover, pen-based
gesture can contain more semantic meaning by drawing symbols or writing letters.

Due to the limitation of eye gaze, multimodal integration of speech and eye gaze
has mainly been studied for simple object selection and reference resolution. In the
experiments of object selection [100,121], the user selects an object (icon) on the
screen using speech, user’s speech and eye gaze are both used to decide the selected
object by the system. In [121], both speech and eye gaze of user generate an n-best list
of potential objects, the system decides the selected object by taking the common one
on both n-best lists. In [100], the selected object is decided by computing the posterior
probabilities of the objects on screen being selected by the multimodal input. In the
applications of reference resolution [8,52], the object that is ﬁxated by eye gaze prior
to user’s mention of the object in speech is taken as the referent for simple commands
like “move it there” and “open the door”.

Integration of speech and gesture for multimodal interpretation is more mature

than integration of speech and eye gaze. Many integration approaches have been

17

explored for a variety of speech and pen-based gesture systems. Those integration
approaches can be categorized into the following types based on their integration
mechanisms: frame—based approaches, uniﬁcation—based approaches, ﬁnite-state ap-
proaches, optimization-based approaches, and statistics-based approaches.

Frame is a data structure used for knowledge representation. A frame has a
number of slots in it. The slots represent object properties, actions, or an object’s
relation with other frames. Frame-based multimodal integration approaches use indi-
vidual frames to represent semantic meanings obtained from different modalities and
achieve multimodal integration by merging those complementary individual frames to
one uniﬁed frame. Frame-based integration approaches have been used in speech and
gesture systems for applications such as multimodal text editing [109], multimodal
drawing [93], and multimodal appointment scheduling [106]. Frame-based approaches
are simple and efﬁcient, but they are speciﬁc to application.

Uniﬁcation-based approaches are derived from computational linguistics, in which
formal logics of typed feature structures have been well deve10ped. The primary
Operation in the logics of feature structure is uniﬁcation — determining the consistency
of two feature structures and combining them into a single feature structure if they
are consistent. Using feature structures for meaning representation, uniﬁcation-based
approaches achieve multimodal integration by performing uniﬁcation operation over
the feature structures of different modalities. Compared to frame merging, uniﬁcation
of typed feature structures provides a more general, formally well-understood, and
reusable mechanism for multimodal integration. Uniﬁcation-based approaches haven
been used in the QuickSet system for the integration of speech and pen-based gesture
input [44,48].

Johnston and Bangalore [45,46] employed ﬁnite-state transducers to achieve mul-
timodal integration in a multimodal messaging application, in which users interact

with a company directory using synergistic combinations of speech and pen input.

18

Multimodal context-free grammar (CFC) was introduced for integrating speech and
gesture with ﬁnite-state transducers. The ﬁnite-state approach enables a tighter cou-
pling of speech and gesture by using gesture to guide speech recognition, which can
lead to improved speech recognition and understanding. However, the ﬁnite-state
approach has one major limitation in that it requires a multimodal grammar to be
created to deﬁne the language allowed in a particular application domain, which makes
them only applicable for very constrained domains that involves small vocabulary and
simple expressions.

Optimization-based approaches use Optimization methods of machine learning for
multimodal integration. Chai et al. [14] modeled integration of multimodal inputs as
graph matching and applied the graph-based approach to achieve reference resolution
in a map-based real estate domain, where users use speech and gesture to inquire es—
tate information. In [27], for the purpose of multimodal reference resolution, gestures
and spoken words are aligned by minimizing a penalty function deﬁned to penalize
the gesture-speech bindings that violate the empirically preferred binding rules.

Wu et al. [116] proposed a statistical hierarchical framework, Members-Teams-
Committee (MTC), for the integration of speech and gesture in a simulated commu-
nity ﬁre and ﬂood control domain. In this framework, all possible multimodal in-
terpretations are predeﬁned and the interpretation of a multimodal input is decided
by the posterior probabilities of unimodal speech and gesture recognition hypotheses
and the statistics of predeﬁned multimodal interpretations. Since this statistics-based
approach requires all possible speech and gesture interpretations to be pre—deﬁned for
a particular domain, it is only appropriate for constrained domains involving simple
speech and gesture commands.

In the above late semantic fusion approaches, information from multiple modalities
is only used at the fusion stage. Some low probability information (e.g., recognized

alternatives with low probabilities) that could turn out to be very crucial in terms

19

of the overall interpretation may never reach the fusion stage. Therefore, it is desir-
able to use information from multiple sources at an earlier stage, for example, using
one modality to facilitate semantic processing of another modality. Addressing this
problem in late semantic fusion, Chapter 4 of this thesis presents the use of deic-
tic gesture and eye gaze in an earlier stage to facilitate language processing before

semantic fusion.

2.3.2 Context-aware Language Processing

The context of human-computer interaction constrains what a user is likely to interact
with the system, thus can be utilized for user language interpretation. A variety of
research work has been done on using contextual information for spoken language
processing. There are mainly two types of context used in context-aware language
processing: conversation context and domain context.

All information related to the discourse prior to an utterance constitutes the
conversation context of the utterance. Chotimongkol and Rudnicky [16] used con-
versation contextual feature to improve speech recognition and understanding by
rescoring the n-best output of speech recognizer with a linear regression model. The
conversation contextual feature was represented by the correlation of the current
user utterance and the previous system utterance. Solsona et al. [96] combined con-
versation context-speciﬁc ﬁnite state grammar (F SG) and general n-gram model to
improve speech recognition for a conversational system. The conversation context
was represented by the types of previous system prompts and questions. Lemon and
Gruenstein [62] also built conversation context-speciﬁc grammars to improve speech
recognition and understanding. The conversation context was represented by the
types of dialog move. Gruenstein et al. [34] built context-sensitive class-based n-gram
model to improve speech recognition for a ﬂight reservation system. The conversation

context was represented by the current information state, which indicates whether

20

certain information about the ﬂight has been collected from previous conversation.

All domain related information constitutes the domain context, which could be
the visual content of the graphical display in a domain, or the task knowledge in
a speciﬁc domain application. Roy and Mukherjee [89] incorporated visual domain
context in language model to improve spoken language comprehension in a synthetic
visual scene description domain. The visual context was represented by the visual
features (e.g., color, size, shape) of the objects in the scene. Coen et al. [19] built
visual context-speciﬁc grammars to improve speech recognition and understanding
in an Intelligent Room where a user can operate computer controlled devices by
speaking. What is currently nearby the user in the room constitutes the visual con-
text. Carbini et al. [9] used domain contextual information to help interpretation
of ambiguous speech-gesture commands and enable short multimodal commands in
a chess game domain. The domain contextual constraints include the displacement
rules of chess game and current game position. Gorniak andRoy [29] incorporated
both physical domain context and conceptual domain task related context to resolve
spoken referring expressions in a 3D game domain. The physical context includes
information about the physical objects in the game, such as location and type of the
objects. The conceptual context consists of a set of hierarchical plan fragments to
complete the speciﬁc task of the game. Due to the constrained game setting, users
must follow certain steps to complete the task. Therefore, given the previous steps
(physical context) and the hierarchical plan segments (task conceptual context), it is
possible to predict which plan fragment the user will take, speciﬁcally which object
the user will likely to refer to in his/ her spoken commands.

Motivated by context—aware language processing, Chapters 4 & 5 of this thesis
investigate the use of domain contextual information for improving speech recognition
and understanding. Different from the context in previous work, the domain context

in this thesis work is dynamically signaled by non-verbal modalities such as gesture

21

and eye gaze during multimodal conversation. Cooke [21] also explored the use of
eye gaze for spoken language processing in a map route description domain. In
[21], eye gaze was used to improve speech recognition by rescoring the n-best list of
speech recognition with the landmark-speciﬁc n—gram models that correspond to the
gaze ﬁxated landmarks. Different from [21], in this thesis, we explore more ways of
integrating eye gaze in spoken language processing and present a better integration

strategy than n-best list rescoring for the use of eye gaze in speech recognition.

2.4 Automatic Word Acquisition

Word acquisition is to learn the semantic meanings of new words. In this thesis,
we focus on the automatic word acquisition by a computer system during human-
computer interaction. The purpose of automatic word acquisition is to enlarge the
system’s knowledge base of vocabulary and therefore better interpret the user’s spoken
language.

In the conversational systems with which users interact through a visual scene,
users talk to the system based on what is being shown on the scene and the system
“understands” the user’s language by mapping the spoken words to the semantic
concepts in its domain knowledge base. These semantic concepts of words represent
the visual entities and their prOperties in the domain. For these systems, the Speciﬁc
task of word acquisition is to ground words to the visual entities and their related
properties in the domain. Word acquisition by grounding words to visual entities has
been studied in various language acquisition systems.

Sankar and Grorin [91] acquired words by grounding words to visual properties
(color, shape) of objects in a synthetic blocks world, in which the user interacted with
the system by typing sentences. The system started with no semantic associations
of words and visual properties. The only innate knowledge of the system was the

semantic level signal “good” and “no”. During the human-computer interaction,

22

the user instructed the system to focus on certain objects and gave responses (e.g.,
“good”, “no”) indicating whether the system followed the instructions correctly. The
goal of the system was to learn to focus on the object that the user referred to
by building associations of words and visual properties. The mutual information
between the occurrences of words and object shape/ color types was used to evaluate
the strength of the association of a word and a color/ shape type.

Roy and Pentland [90] proposed a computational model that could learn words
directly from raw multimodal sensory input. In their experiments, infant caregivers
were asked to play toys with their infants while giving infant-directed speech. Given
speech paired with video images of single objects (toys), the temporal correlation of
speech and vision was used to learn words by associating the automatically segmented
acoustic phone sequences with the visual prototypes (color, shape, size) of the objects.

Yu and Ballard [118] investigated word learning in a visual scene description do-
main in which users were asked to describe nine oﬂice objects on a desk and how
to use these ofﬁce tools. Given speech and the co—occurring video images captured
by a head-mounted camera, a generative model was used to ﬁnd the associations of
automatically recognized spoken words and visual objects.

Towards the goal of robust multimodal interpretation, this thesis explores the
use of eye gaze for automatic word acquisition. Eye gaze is an implicit and subcon-
scious input, which brings additional challenges into word acquisition. Eye gaze has
been explored for word acquisition in [117], in which eye gaze and other non-verbal
modalities such as the user’s perspective video image and hand movement were used
together with speech to learn words. In the experiments, users were asked to describe
what they were doing while performing three required activities: “stapling a letter”,
“pouring water”, and “unscrewing a jar”. Head-mounted eye tracker and camera
were used to capture gaze and video data. Given speech paired with gaze positions

and video images, a translation model was used to associate acoustic phone sequences

23

to the four objects and nine actions in the domain.

Liu et al. [66] also investigated the use of eye gaze for word acquisition. In [66],
speech and eye gaze data were collected from simpliﬁed human-computer conversation
in which users verbally answered the system’s questions about the decoration of a 3D
room. A translation model was used to acquire words from transcribed speech and
its accompanying gaze ﬁxations.

This thesis’s work on the use of eye gaze for word acquisition is different from
previous work. Besides gaze positions, we use extra information such as speech-gaze
temporal information and domain semantic knowledge to facilitate word acquisition.
Moreover, not all co-occurring speech and gaze data are useful for word acquisition.
This was not considered in the previous work on using eye gaze for word acquisition.
In this thesis, we investigate the automatic identiﬁcation of “useful” speech and gaze

ﬁxations and its application on word acquisition.

24

CHAPTER 3

A Multimodal Conversational System

To explore the incorporation of non-verbal modalities in language interpretation dur-
ing multimodal conversation, we built a multimodal conversational system that sup-
ports speech, deictic gesture, and eye gaze inputs. This chapter presents the archi-

tecture of the system and the processing of different input modalities.

3.1 System Architecture

Our multimodal conversational system is built on a client / server architecture as shown
in Figure 3.1. In this architecture, the user interacts with the client, a graphic inter-
face, using speech and other modality (e.g., deictic gesture, eye gaze). The results of
speech recognition and gesture/ gaze recognition are sent to the server via TCP/IP
network. The Multimodal Interpreter derives semantic meaning of the user’s mul-
timodal input and sends the interpretation result to a dialog manager. The Dialog
Manager controls the interaction ﬂow and decides what the system should do based
on the interpretation of the user’s input. The Presentation Manager decides how to
present the system’s responses to the user and transmits the responses to the client
through the network. The system’s responses are presented to the user on the client

by graphics or/ and speech.

25

Gesture/Gaze Speech
Recognizer Recognizer

     
   

’ sentation graphic?
Manager ontro er
Server " Client

Figure 3.1. Multimodal conversational system architecture

3.2 Input Modalities

Users can interact with our multimodal conversational system using speech, deictic

gesture, and eye gaze.

3.2.1 Speech

As the major input mode in multimodal conversational systems, speech enables users
to interact with the system naturally and efficiently. To be able to give intelligent
replies to the user, the system ﬁrst needs to recognize the user’s speech. Speech
recognition is to convert acoustic speech signals to text. Automatic speech recogni-
tion (ASR) has been progressing steadily in the last three decades, which have resulted
in commercial ASR systems that can recognize human speech with sufficient accuracy
under optimal conditions. However, during natural conversation, environment noise
and disﬂuency in users’ speech can deteriorate speech recognition performance signif-
icantly. Accents in users’ speech can also make speech recognition difﬁcult. Because

of these reasons, speech recognition remains a major bottleneck for building robust

26

conversational systems.

The CMU Sphinx-4 speech recognizer [111] is used in our system for recognizing
users’ spoken utterances. Sphinx-4 is an open source speech recognizer based on
Hidden Markov Model (HMM).

How non-verbal modalities can be incorporated to improve speech recognition is

presented in Chapter 4.

3.2.2 Deictic Gesture

Besides speech, users can use deictic gesture (e.g., pointing, circling on a graphical
display) to make interaction easier. For example, instead of say “how much is the red
chair in the left corner?”, the user can say “how much is this chair?” while pointing
to the attended chair on the screen.

In our system, users’ deictic gestures are captured by a touch screen. Based on the
position of the gesture on the screen, we can infer which object the user is referring to.
How this gestural information can help recognize and understand the users’ speech is

presented in Chapter 4 and Chapter 5.

3.2.3 Eye Gaze

Eye gaze indicates the user’s focus of attention [26,49,101]. The published results
on eye gaze and human language production have led to the hypothesis that users
tend to look at the objects on the graphical display when they are talking about
them. Based on this hypothesis, by tracking the user’s eye gaze during human-
machine conversation, the system is likely to infer the user’s attended objects on
the screen and use this attention information to help recognize and understand the
user’s speech. Moreover, using eye gaze information, the system can potentially learn
new words from the user’s language by associating semantics of the attended objects

(indicated by eye gaze) with words in the user’s spoken utterances.

27

'Eye gaze is captured by an eye tracker. The raw gaze data points consists of
the screen coordinates of each gaze point with a particular timestamp. As shown in
Figure 3.2(a), this raw data is not very useful for identifying ﬁxated objects. The
raw gaze data is processed to eliminate the invalid and saccadic gaze points, leaving
only pertinent eye ﬁxations. Invalid gaze points occur when users look off the screen.
Saccadic gaze points occur during ballistic eye movements between ﬁxations. Vision
studies have shown that no visual processing occurs in the human mind during sac-
cades (i.e., saccadic suppression). It is well known that eyes do not stay still, but
rather make small, frequent jerky movements. In order to best determine ﬁxation lo-
cations, nearby gaze points are averaged together to identify ﬁxations. The processed

eye gaze ﬁxations is shown in Figure 3.2(b).

 

(3) Raw gaze points (b) Processed gaze ﬁxations

Figure 3.2. Eye gaze on a scene

How eye gaze information can be used in language models to potentially help
spoken language processing is presented in Chapter 4. How eye gaze information is
used for automatic vocabulary acquisition in multimodal conversation is presented in

Chapter 6.

28

3.3 Domains of Application

Two application domains were designed and implemented for our investigation. Both

domains were constructed based on 3D graphics.

3.3.1 Interior Decoration

Figure 3.3 shows the 3D interior decoration domain. In this domain, users can interact
with the system using both speech and deictic gestures to query information about the
entities or arrange the room by adding, removing, moving, and coloring the entities.
For example, the user may say “remove this lamp” or ask “what’s the power of this

lamp .9” while pointing at a lamp in the scene.

 

Figure 3.3. A 3D interior decoration domain

There are 13 types of entities (3D objects, e.g., chair, bed, lamp) in this domain.

3.3.2 Treasure Hunting

Figure 3.4 shows the 3D treasure hunting domain. In this domain, users walk around
in a 3D castle trying to ﬁnd treasures that are hidden somewhere in the rooms of a
castle. Unlike the interior decoration domain where users give spoken commands to
the system to move around and change decoration, in the treasure hunting domain,

users walk around inside the castle and move objects by themselves, but the user has

29

to talk to the system to get hints about where to ﬁnd the treasure. Users’ eye gaze

ﬁxations are recorded during the human—machine conversation.

 

Figure 3.4. A treasure hunting domain

Compared to the interior decoration domain, the treasure hunting domain pro-
vides a richer interactive environment that involves more complex scenes and tasks,
which enables studies on automatic vocabulary acquisition during human-machine
conversation.

The underlying architecture supporting these two domains can be used to deveIOp
similar 3D applications such as virtual tourism guide and virtual reality personnel

training.

30

CHAPTER 4

Incorporation of N on-verbal Modalities in
Language Models for Spoken Language

Processing

In multimodal conversational systems, speech recognition performance is critical in
interpreting user inputs. Only after speech is correctly recognized, is the system able
to further extract semantic meaning from the recognized hypothesis. Although mutual
disambiguation of multiple modalities [74] can alleviate the problem with speech
recognition, speech recognition is still a bottleneck to achieving robust multimodal
interpretation.

This chapter presents the use of non—verbal modalities to help speech recognition
in multimodal conversation. In particular, we describe a salience driven approach to
incorporate the contextual information activated by deictic gesture and eye gaze in
speech recognition. This approach combines gesture-based and gaze-based salience
modeling with language modeling. We further describe the application of the salience
driven language models in speech recognition across different stages and present eval-

uation results.

31

4.1 A Salience Driven Framework

In this section, we ﬁrst introduce the notion of salience and its applications in language
processing, then describe a salience driven framework for interpretation of language

in multimodal conversation.

4.1.1 Salience

Salience modeling has been used in both natural language and multimodal language
processing. Linguistic salience describes entities with their accessibility in a hearer’s
memory and their implications in language production and interpretation. Many
theories on linguistic salience have been developed, including how the salience of
entities affects the form of referring expression as in the Givenness Hierarchy [35] and
the local coherence of discourse as in the Centering Theory [32]. Linguistic salience
modeling has been used for both language generation [98] and language interpretation.
Most salience-based language interpretation have focused on reference resolution [27,
42, 58]. 8

Visual salience measures how much attention an entity attracts from a user. An
entity is more salient when it attracts a user’s attention more than other entities. The
cause of such attention depends on many factors including user intention, familiarity,
and physical characteristics of objects. For example, an object may be salient when
it has some properties the others do not have, such as it is the only one that is
highlighted, the only one in its size, category, or color [57]. Visual salience can also
be useful in multimodal language interpretation. Studies have shown that a user’s
perceived salience of entities on the graphical interface can tailor the user’s referring

expressions and thus can be used for multimodal reference resolution [54].

4.1.2 Salience Driven Interpretation of Spoken Language in Multimodal

Conversation

During multimodal conversation, a user’s deictic gesture or eye gaze ﬁxation on the
graphical display indicates the user’s attention and therefore indicate salient entities.
The more likely is an entity selected by a gesture or eye gaze, the more salient is this
entity.

We developed a salience driven framework [13] for language interpretation in multi-
modal conversational systems. Figure 4.1 illustrates the salience driven interpretation
of speech in this framework. As shown in the ﬁgure, the user’s deictic gesture or eye
gaze ﬁxation on the graphic display signals a distribution of entities that are salient at
that particular time of interaction. The contextual knowledge associated with these
salient objects constitutes the salient context. This salient context can be used to

help speech recognition and understanding by constraining speech hypotheses.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Gesture / Gaze Speech
Gesture/Gaze _> Speech
Recognition Recognition
Gesture/Gaze ....’ [] 5} Language
Understanding /0 Understanding
a . /O """"""" ‘
whey/Nb
co n text

 

 

 

 

V

Multimodal Fusion

l

Semantic Representation

ll

 

 

 

Figure 4.1. Salience driven interpretation

33

In this framework, there are two important operations involved: 1) the salience
modeling based on gesture/ gaze, and 2) the incorporation of salience information in

language processing. We address these two operations in the following sections.

4.2 Gesture-Based Salience Modeling

As mentioned earlier, a deictic gesture on the graphical display can signal the under-
lying context that is salient at that particular time of communication. In other words,
the deictic gesture will activate a salience distribution over entities in the domain.
As illustrated in Figure 4.2, the salience value of an entity 6 at time t is calculated
based on the probabilities that e is selected by the gestures g = {9,} occurring prior

to time t.

Utterance: @ . . . ® @ . . . @

 

Gesture:

,P,l(eig,) Pt,(elg2) P,_,(elg3)§ % Pt(e)

 

, ag3(t)

 

“82“)

 

 

 

“81“)

 

Figure 4.2. Gesture-based salience modeling

More speciﬁcally, for an entity e in the domain, its salience value at time t is

calculated as follows [13]:

:09”?
2am 8| :W) W6”

69 (4.1)

 

Pt(€) =
3,9
0 290490) p(e|g) = 0

34

where p(elg) is the probability of entity 6 being selected by gesture g (calculated
based on the distance from the gesture point to the center of the entity), ag(t) is the
weight of gesture g contributing to the salience distribution at time t.

Gesture weight ag(t) is deﬁned as follows:

t—t
e_700‘8 tZt
age): 0 Mtg (4.2)
9

where tg stands for the beginning time (in milliseconds) of gesture g. Weight ag(t)
says that gesture g has more impact on the salience distribution at a time closer to
the gesture’s occurrence. Note that at any time t, only gestures occurring before t

(i.e., t 2 tg) can contribute to the salience distribution at time t.

4.3 Gaze-Based Salience Modeling

Psycholinguistic experiments have shown that eye gaze is tightly linked to human
language processing. Eye gaze is one of the reliable indicators of what a person is
“thinking about” [37]. The direction of gaze carries information about the focus of the
users attention [49]. The perceived visual context inﬂuences spoken word recognition
and mediates syntactic processing [89,101]. In addition, directly before speaking a
word, the eyes move to the mentioned object [31].

Motivated by these psycholinguistic ﬁndings about eye gaze’s link to speech, we
use eye gaze information in salience models to help spoken language processing.

Figure 4.3 shows an excerpt of the speech and gaze ﬁxation stream. In the speech
stream, each word starts at a particular timestamp. In the gaze stream, each gaze
ﬁxation f has a starting timestamp t f and a duration Tf. Gaze ﬁxations can have
different durations. An entity 6 on the graphical display is ﬁxated by gaze ﬁxation f
if the area of 6 contains the ﬁxation point of f. One gaze ﬁxation can fall on multiple

entities or no entity.

2 72 2872 17 7
SJ 1 310353813 36 (ms)

 

This room has a chandelier

f gaze ﬁxation Speech stream

 

5 6 9 8 1668 2096 2692 32 2 (ms)
F
" gaze stream
it Tr

[19] [1 [17] [19ll22][1[10] [10] [10] [fixatedentity]
[11] [11] [11]

([10] — bedroom; [11] - chandelier; [17] — lamp_2; [l9] - bed frame; [22] — door)

Figure 4.3. An excerpt of speech and gaze stream data

We ﬁrst deﬁne a gaze ﬁxation set Ftt8+T(e), which contains all gaze ﬁxations that

fall on entity e within a time window to ~ (to + T):

Fttg+T(e) = {flf falls on e within to ~ (to + T)} (4.3)

We model gaze-based salience in two ways [82]:

o Gaze Salience Model 1
Salience model 1 is based on the assumption that when an entity has more gaze
ﬁxations on it than other entities, this entity is more likely attended by the user

and thus has higher salience:

#elements in Fttg+T(e)

Z(#elements in Ftt8+T(e))

8

 

12,010?) = (4.4)
Here, pt0,T (e) tells how likely it is that the user is focusing on entity e within
time period to ~ (to + T) based on how many gaze ﬁxations are on e among all

gaze ﬁxations that fall on entities within to ~ (to + T).

o Gaze Salience Model 2
Salience model 2 is based on the assumption that when an entity has longer

gaze ﬁxations on it than other entities, this entity is more likely attended by

36

the user and thus has higher salience:

t +T
Dtg (e)

= t +T
Z Dtg (e)
e

pt0,T(e)

where

D§3+T(e) = Z Tf (4.6)

t +T
1:6th (e)

Here, pt0,T(e) tells how likely it is that the user is focusing on entity e within
time period to ~ (to + t) based on how long e has been ﬁxated by gaze ﬁxations
among the overall time length of all gaze ﬁxations that fall on entities within

to N (to + T).

4.4 Salience Driven Language Modeling

Given salience models, the next question is how to incorporate this salient contex-
tual information in language processing. In this section, we describe the building
of salience driven language models for speech recognition. We ﬁrst give a review of
the typical language models used in speech recognition, then describe how to build

salience driven language models based on those baseline models.

4.4.1 Language Models for Speech Recognition

The task of speech recognition is to, given an observed spoken utterance 0, ﬁnd the

word sequence W" such that
W” = arg max p(O]W)p(W) (4.7)
W

where p(OIW) is the acoustic model and p(W) is the language model.
In speech recognition systems, the acoustic model provides the probability of

observing the acoustic features given hypothesized word sequences, and the language

37

model provides the prior probability of a sequence of words. The language model is

represented as follows:

p(W)= pat): P(w1)P(IU2lUJ1)P(w3lwi) . . -p(wn|w?-1) (4.8)

The language model can be approximated by a bigram model using ﬁrst-order

Markov assumption:
n

=lewklwk—1l(4-9l
or by a trigram model using second-order 1Markov assumption:
n
p(w?) = H palm—1. wit—2) (4.10)
k=1

By clustering words into classes, the class-based n-gram model reduces the training
data requirement and improves the robustness of probability estimates compared to

the word n—gram model. The class-based bigram model is given by [6]:

P(wilwi—1) = P(wt|Ct)p(CtICt—1) (4-11)

where c,- and 01-1 are the classes of word w,- and w,_1 respectively.
Probabilistic context free grammar (PCF G) can also be used as a language model
in speech recognition by constraining the speech recognizer to generate only gram-

matical sentences as deﬁned by the grammar.

4.4.2 Salience Driven N-Gram Models

Statistical n—gram models are widely used in speech recognition. We incorporate
the gesture/gazebased salience modeling into the bigram model and the class-based

bigram model to build salience driven n-gram models [13,81] for speech recognition.

0 Salience driven bigram model
The salience driven bigram probability p3(w,-|w,-_1) is given by:

lez‘lwt—il + A Z P(wilwi—1ve)Pt(e)

Ps(wilwi—1)= 18+ A (4.12)

 

38

where pt(e) is the salience distribution, as modeled in equation (4.1), A is the
priming weight. The priming weight /\ decides how much the original bigram
probability will be tailored by the salient entities that are indicated by gestures.
Currently, we set /\ = 2 empirically. We also tried to learn the priming weight
by an EM algorithm. However, we found out that the learned A performed
worse than the empirical one in our experiments. This is partially due to in-
sufﬁcient development data. Bigram probabilities p(w,]w,-_1) were estimated
by the maximum likelihood estimation using Katz’s backoff method [51] with
frequency cutoff of 1. The same method was used to estimate p(w,]w,-_1,e)

from the users’ speech transcripts with entity annotation of e.

Salience driven class-based bigram model

The salience driven class—based bigram probability p3(w,-|w,-_1) is given by:

P(CilCi-1) :20thsz €)Pi(€) 21W?) 79 0

(4.13)
P(wilwi—1) Zpde) = 0

Ps(wilwi—1)=
where pt(e) is the salience distribution, c, and c,-_1 are the semantic classes of
words w,- and wi_1 respectively, p3(w,-|c,', e) is learned with maximum likelihood

estimation from the utterances talking about entity e.

4.4.3 Salience Driven PCFG

Building salience driven PCF G [81] as language model includes three steps: 1) con-

struct a context free grammar (CFG) speciﬁc to the application domain; 2) for each

entity in the domain, train entity-speciﬁc PCFG based on the utterances talking about

that particular entity; 3) create salience driven PCFG based on the entity salience

distribution and entity-speciﬁc PCFGs.

More speciﬁcally, we build salience driven PCF G for the 3D interior decoration

domain (Section 3.3.1) as follows. Based on the domain knowledge, we ﬁrst deﬁne

39

a domain-speciﬁc CFG as shown in Figure 4.4. This CFG covers all the language
that is “legal” in the interior decoration domain. An utterance is said to be“legal”
in the domain if a semantic representation speciﬁc to the domain can be built from
the utterance. The deﬁned grammar covers the “legal” commands like “this table”,
“remove this chair”, “move this plant on this table”, and query questions like “how

much is this table .9”, “who is the artist of this painting .9”, “what is the wattage of this

 

lamp?”.

S —+ NP I VP I WRB JJ VBZ NP I WRB JJ NN VBZ NP VB]
WP VBZ NP PP I WRB VBZ NP VBN I VBZ NP NP

VP —+ VBNPIVBNPPPIVBNPJJIVBNPRB

NP —> NN I DT NN I PRP

PP -—> IN DT NN I TO DT NN

WP —> what I who

WRB -—> how I where

JJ -—> big I black I blue I dark I expensive I gray I green I

VBZ —» does I is

VB —> add I align I bring I buy I change | delete I

RB —> back I backward I backwards I down I forward I here I

NN —> age I alternative I artist I artwork | back I bar I bed

DT —> a | an I that I the I these I this I those

PRP —> it I them

IN —» about I above I against I among I around I at I behind

TO —+ to

VBN —i made I produced

 

 

 

Figure 4.4. Context free grammar for the 3D interior decoration domain

We build the entity-speciﬁc PCFGs by first using the Stanford Parser [55] to parse
users’ transcribed utterances, then for each entity e in the domain, training a PCFG
with maximum likelihood estimation based on the utterances talking about entity e.
In the trained PCFG, only the lexicon-part rules are associated with probabilities.
An example of trained PCFG for entity lamp is shown in Figure 4.5. The PCF G
in Figure 4.5 is in the Java Speech Grammar Format (JSGF) and the numbers in
the “/ /” are the weights of the rules. When normalized, the weights are the rule

probabilities. As we can see in Figure 4.5, the words closely related to entity lamp

40

such as “lamp” and “wattage” achieve higher weights in the trained PCFG. It means
that those words closely related to lamp will be more likely chosen during the speech

recognition process when the entity lamp is salient.

 

<S> = <NP> I <VP> I <WRB> <JJ> <VBZ> <NP> I

<VP> = <VB> <NP> | <VB> <NP> <PP> I <VB> <NP> <JJ> |
<VB> <NP> <RB>;

<NP> = <NN> I <DT> <NN> I <PRP>;

<PP> = <IN> <DT> <NN> | <TO> <DT> <NN>;

<DT> = /117/ this I /59/ the I /16/ that I /3/ these I /1/ those I
/1/ a I /1/ an;

<IN> = /34/ of] /17/ on I /10/ about I /7/ with I /4/ in I
/2/ behind I ...;

<JJ> = /8/ many I /2/ much I /1/ small I /1/ left I /1/ expensive I ...;

<NN> = /144/ lamp I /24/ wattage I /7/ place I /7/ information I
/6/ table I ...;

<PRP> = /3/ it I /1/ them;

<RB> = /9/ here I /2/ back I /2/ up I /2/ there;

<TO> = to;

<VB> = /27/ remove I /18/ move I /7/ show I /6/ put I
/6/ change I ...;

<VBN> = /2/ made I /1/ produced;

<VBZ> = /30/ is I /3/ does;

<WP> = /26/ what I /4/ who;

<WRB> = /9/ how I /5/ where;

 

 

 

Figure 4.5. Ttained PCFG for entity lamp in the 3D interior decoration domain

Given entity-speciﬁc PCFGS, salience driven PCF G is created by combining the
PCFGs associated with the salient entities. The weight of a rule r in the salience

driven PCFG is given by:
wtr) = Zwetrme (4.14)
C

where p(e) is the salience distribution, we(r) is the weight of rule r in the PCFG

speciﬁc to entity e.

41

4.5 Application of Salience Driven Language Models for ASR

The salience driven language models can be integrated into speech recognition in two
stages: an early stage before word lattice (n-best list) is generated, or in a later stage

where the word lattice (n-best list) )is post-processed (Figure 4. 6).

gesture /
9829 Lang.ane A.coustlc
M-del Model
word lattice

speech (n-best list)
——> Speech Decoder L———>

 

 

 

 

(a) Early application

gesture / -
Lang.ane A.coustic 9829 Language
Mod-I Model Model

H word lattice ﬂ

599°C“ (n- -best list) n-best list
——H Speech Decoder 5 Rescorer a

 

 

 

 

 

 

 

 

 

 

(b) Late application

Figure 4.6. Application of salience driven language model in speech recognition

4.5.1 Early Application

For the early application, as Figure 4.6(a) shows, salience driven language model
is used together with the acoustic model to generate the word lattice, typically by
Viterbi search.

Compared to n-gram models, CFG-based language models put more strict con-
straint on the speech recognition process, speciﬁcally on choosing the next set of
possible words following a path during the searching process. When an n-gram model
is used, the next set of possible words includes any words in the vocabulary with non-
zero transition probabilities (as speciﬁed by the n-gram model) from the previous n-1

words along the path. When a CFC-based language model is used, the next set of

42

possible words only includes those allowable words as deﬁned by the grammar.

4.5.2 Late Application

For the late application, as shown in Figure 4.6(b), the salience driven n-gram lan-
guage model is used to rescore the word lattice generated by a speech recognizer with
a basic language model not involving salience modeling. A word lattice consists of
a list of nodes and edges (Figure 4.7). In the word lattices, each node represents a
word hypothesis and each edge represents a word transition. Each path going from
the start node <s> to the end node </s> forms a sentence recognition hypothesis.

Given a word lattice, A* search can be applied to ﬁnd the n—best paths in the word

@: a 0 ...>@

Figure 4.7. A* search in word lattice

lattice.

A* search ﬁnds in a graph the optimal path from a given initial node to a given
goal node. Speciﬁcally, in the word lattice shown in Figure 4.7, the task of A* search

is to ﬁnd a path from sentence start node “<S>” to sentence end node “</s>” that

has the highest score. The score of a path L = (w0, w1, . . . ,wn) is deﬁned as
11
NM = ZUOgPaWi) +108P(wz'lwz'—1)) (4-15)
i=0

where pa(w,-) is the acoustic model probability and p(w,Iw,-_1) is the language model
probability. The language model probabilities can be tailored by the salience driven

language models described in Section 4.4.2.

43

In the word lattice, each node (i.e., a word hypothesis) is associated with a score.
The score of a word w,- depends on two parts: the true score g(wi) that measures the
actual score of the path from the start node to the current node, and the heuristic
score h(wi) that measures the expected score of the path from the current node to
the goal node. In each step of the A* search, the next node to expand is chosen as
the one with the highest score (g(wi) + h(w,)) among the ending nodes of all previous
partial paths that have been explored.

Before A* search begins, the heuristics at each node w,- are ﬁrst calculated:

h(w,) _—. m£x{h(w,’-°+1) + log pa(wf+1) + log p(wf+,|w,)} (4.16)

where h(</s>) = 0.

During the A* searching process, the score of the path up to node w,- is calculated

90%) = g(wi—tl +108Pa(wz') +10gP(wz'|wz'—1) (4-17)

where g(<s>) = 0.

A late application of gaze-tailored language model was reported in [21], where the
language model tailored by eye gaze was used to directly reorder the n-best list of
speech recognition to get better 1-best recognition. We will show in Section 4.6.5 that

the early application works better than the late application.

4.6 Evaluation

In the 3D interior decoration domain, we empirically evaluate the different salience

driven language models when applied at the two stages for speech recognition.

4.6.1 Speech and Gesture Data Collection

We conducted a wizard-of—oz study to collect speech and gesture data for our eval-

uation using the system described in Chapter 3. In the study, users were asked to

44

accomplish two tasks. Task 1 was to clean up and redecorate a messy room. Task
2 was to arrange and decorate the. room so that it looks like the room in the pic-
tures provided to the user. Each of these tasks put the user into a speciﬁc role (e.g.,
college student, professor, etc.), and the task had to be completed with a set of con-
straints (e.g. budget of furnishings, bed size, number of domestic products, etc.). A
detailed description of the user study in the interior decoration domain is given in
Appendix A.1.

From 5 users’ interactions with the system, we collected 649 utterances with ac-
companying gestures. The vocabulary size of the collected utterances is 250 words.

Each utterance was transcribed and annotated with referred entities. For example,
an utterance like “remove this lamp” accompanied by a deictic gesture was annotated
with the true entity lamp] as indicated by the gesture, while an utterance like “move
this lamp to this table” accompanied by two deictic gestures were annotated with the
entities lamp] and table] as indicated by the two gestures respectively.

Each gesture results in a set of possibly selected entities. The selection probabili-
ties of the entities are calculated based on the distances from the gesture point to the
center of the entities.

All the collected data, together with the speech transcripts and entity annotation,
are saved in XML format. Figure 4.8 shows an excerpt form one of the XML data
ﬁles. The excerpt is the record of one turn in the conversation between the system and
one user. In this turn, the user pointed to the entity picture_girl and said “flip this
picture one hundred eighty degrees”. The pointing gesture resulted in an ambiguous

selection of three entities (bedroom, picture_girl, table_pc) with different probabilities.

4.6.2 Evaluation Results on Speech and Gesture Data

We compare the performances of the following different language models trained in

our domain:

45

 

<turn>
<user,input>
<gesture>
<curve start-"2153" end="2309">
<point>613 183</point>
<point>613 183</point>
</curve>
<selection>
(entity text="bedroom">0.458000</entity>
(entity text="picture_girl">0.530700</entity>
<entity text="tab1e_pc">0.011300</entity>

</selection>
</gesture>
<speech>
<entity_annotation>
picture_girl
</entity_annotation>
<transcription>

</transcription>
</speech>

</user-input>
</turn>

 

flip this picture one hundred eighty degrees

<waveform>2005916-144311-707.wav</waveform>

 

 

Figure 4.8. An Excerpt of XML Data File

0 Standard bigram model (Bigram)

0 Standard trigram model (Trigram)

0 Standard class-based bigram model (C-Bigram)
o Salience driven bigram model (S-Bigram)

Salience driven class-based bigram model (S-C-Bigram)

Standard PCFG (PCFG)

Salience driven PCFG (S-PCF G)

The evaluation metrics include the following aspects related to recognition results:

0 Word error rate of the best hypothesis (WER)

46

0 Word lattice WER (Lattice-WER)
The minimal WER of all possible paths through the word lattice (output Of

speech recognition).

Since we are building a conversational system, we are also interested in the fol-

lowing metrics related to semantic interpretation:

0 Concept identiﬁcation precision (CI-Precision)
The percentage of correctly identiﬁed concepts out of the total number of con-

cepts in the 1-best recognition hypothesis.

0 Concept identiﬁcation recall (CI-Recall)
The percentage of correctly identiﬁed concepts out of the total number of con-

cepts in a user’s utterance (speech transcript).

o F-measurement (F-score)

F _ (52 + 1) x CI-Precision x CI-Recall
— 62 x CI—Precision + CI—Recall

 

where 6 = 1 in this experiment.

The evaluation was done by an eight—fold cross validation. We compare the per-

formances of the salience driven language models for both early and late applications.

RESULTS OF EARLY APPLICATION

Table 4.1 shows the experimental results of the early application of different language
models on the utterances with accompanying gestures. Among the n-gram models,
the performance of the trigram model is roughly the same as the bigram model. The
salience driven bigram (S—Bigram) model improved speech recognition and under-
standing compared to the three baselines (Bigram, Trigram, and C-Bigram). Com—
pared to the best baseline of the trigram model, the S-Bigram model reduced the
WER by 7%. A t-test showed that this was a significant change: t = 3.38, p < 0.004.

47

 

I Language Model I Lattice-WER I WER I CI-Precision I CI-Recall I F-Score I

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Bigram 0.250 0.321 0.830 0.793 0.811
Trigram 0.258 0.312 0.838 0.797 0.817
C-Bigram 0.292 0.371 0.856 0.748 0.798
S-Bigram 0.243 0.291 0.861 0.830 0.845
S-C-Bigram 0.412 0.448 0.863 0.623 0.724
PCFG 0.323 0.360 0.819 0.816 0.817
S-PCFG 0.319 0.355 0.862 0.845 0.853

 

 

Table 4.1. Performances of the early application of different language models on speech-
gesture data

The S—Bigram model increased the precision and recall of concept identiﬁcation by
3% and 4% respectively. The overall F-measurement achieved by the S-Bigram model
gained an increase of 3%. A t-test showed that this was also a signiﬁcant improve-
ment: t = 3.01, p < 0.002. The S-C—Bigram model achieved the best result on the
precision of concept identiﬁcation, but had the worst results on all other metrics.
Comparing class-based n-gram models (C-Bigram, S-C-Bigram) to n—gram models
(Bigram, Trigram, S-Bigram), we can see that class-based n-gram models achieve
better concept identiﬁcation precision but worse concept identiﬁcation recall and
WER. The performances of the class-based n-gram models depend on how the classes
of words are deﬁned. When one unique class is deﬁned for each unique word, there
will be no difference between n-gram models and class-based n-gram models. In our
experiment, we deﬁne different classes for the words with key semantic concepts,
whereas a single class is assigned to all other words. With this class deﬁnition, the
class-based bigram models contain n-gram probability information about the words
with key semantic concepts but lost the information for the non-key words with one
same class. Therefore, using the class-based n—gram models in speech recognition, it is
hard to correctly recognize the non-key words with one same class, whereas the words
with key semantic concepts are more likely to appear in the recognition result, though
many of them incorrectly recognized. This leads to a better concept identiﬁcation

precision but worse concept identiﬁcation recall and WER.

48

Compared to the standard PCFG model, the salience driven PCFG (S-PCF G)
model increased the precision and recall of concept identiﬁcation by 5% and 3.5%
respectively. The overall F-measurement was increased by 4%. A t-test conﬁrmed
that this was a signiﬁcant improvement: t = 3.30, p < 0.001. The S-PCFG model
did not change the WER much compared to the standard PCFG model. A t-test
conﬁrmed that this change in WER was not signiﬁcant.

When compared to the trigram model, the S-PCFG model did not improve the
WER but improved the language understanding. The F-measurement was increased
by 4%. A t-test showed that this was a signiﬁcant improvement: t = 2.77, p < 0.003.
The worse WER of the S-PCFG model is due to the lesser ﬂexibility of grammar-based
language models than n-gram language models. Grammar-based language models
place too much constraint on what language can be recognized, which hurts the
recognition of complex utterances. On the other hand, after salience tailoring, the
more strict constraints on what words of key semantic concepts can be recognized for
the salient entity makes the S-PCFG model achieve better language understanding
performance than the n-gram model.

We also Show the experimental results for individual users. Figure 4.9 compares
the performances of different salience driven language models in early application for
each user. From the results for individual users, we can see that for most users, the
performances of different salience driven language models are consistent. Compared to
the best baseline of the trigram model, the S-Bigram model achieved lower WER and
higher F—score for each user. The S-C-Bigram model did not show improvement over
the trigram model for all users. The S-PCFG model showed its merit on improving
language understanding by achieving higher F-scores than the baseline for all users
except user 2. And for 3 of the 5 users, the S-PCFG model achieved the best language
understanding among all different language models.

Overall, the results of the early application of the gesture-based salience driven

49

 

 

7/////////////////////%////////ﬂ//////////////é

 

 

I“ trigram BB s—bigram Eli! s-c-big'ram 8 $ch

 

 

...u...v c.... c........ I ..o.o.o o 9.0...u...-...v.o o...-.-.... s. ..... . ................
...................................................

..................................................
...................................................

 

0.6

0.5

£95

(a) Word error rate

 

 

 

 

IEB trigram BB s-bigram Ba s-c-bigram I8 s-pcfg I

 

Egg/20

a

oooooooooooooooooooooooooooooooo

oooooooooooooooooooooooooooooooo
uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
................................

 

0.95

289m

(b) F-score

Figure 4.9. Performance of the early application of LMs on speech-gesture data of indi-

vidual users

50

language models show that:

o In terms of WER, the S-Bigram model performed the best. N-gram models
performed better than class-based n—gram models, and all n-gram models except

the S-C-Bigram model performed better than the PCFG-based models.

0 In terms of language understanding metrics, the S-PCF G model performed the
best in that it achieved the highest concept identiﬁcation recall and overall

F -measurement.

0 Overall, the S-Bigram model appears to be the best one for the early application
in that it not only achieved the lowest WER but also achieved a high F-score on

concept identiﬁcation (close to the highest one achieved by the S-PCFG model).

RESULTS OF LATE APPLICATION

We further compared different n—gram models: C-Bigram, S-Bigram, and S-C-Bigram
during the late application. In these experiments, the standard trigram model trained
on our domain was ﬁrst used to generate word lattices, then the salience driven models

were used in A* search (Section 4.5.2) to ﬁnd the best paths in the word lattices.

 

I Language Model I Lattice-WER I WER] CI-Precision I CI-RecallI F -score I

 

 

 

 

C-Bigram 0.258 0.334 0.831 0.784 0.807
S-Bigram 0.258 0.294 0.854 0.834 0.844
S-C-Bigram 0.258 0.316 0.858 0.786 0.821

 

 

 

 

 

 

 

 

Table 4.2. Performance of the late application of LMS on speech-gesture data

Table 4.2 shows the results of the three models on the utterances with accompa-
nying gestures. In the late application, the S-Bigram model performed the best with
the exception of concept identiﬁcation precision. Compared to the trigram model,
the S-Bigram in late application decreased the WER by 6%. A t-test showed that

this was a signiﬁcant change: t = 2.66, p < 0.005. On language understanding, the

51

S-Bigram model increased the F-measurement by 3% compared tO the trigram model.
A t-test conﬁrmed that this was a signiﬁcant improvement: t = 2.92, p < 0.002.

Compared to Table 4.1, Table 4.2 shows that there is no difference in performance
whether the S-Bigram model is applied early or later. However, a signiﬁcant difference
is Observed for the S-C-Bigram model. The S-C-Bigram model performed much better
when it was applied in a later stage. However, its performance was close to the
baseline (trigram model). The WER change achieved by the S—C—Bigram model
was not statistically signiﬁcant from the t-test (t = 0.94, NS), neither was the
F-measurement (t = 0.22, NS).

The experimental results of the late application Of the three n-gram models for
individual users are shown in Figure 4.10. The results demonstrate the consistency
of the performances of different salience driven language models in late application
for most users. Compared to the baseline of the trigram model, the S-Bigram model
improved both Speech recognition and language understanding when applied in a
late stage. The S-C-Bigram model did not improve speech recognition either when
applied in a late stage, but it improved language understanding for most of the users.
Compared to its performance on speech recognition in early application, the SC-

Bigram model performed better in late application for all the users.

4.6.3 Speech and Eye Gaze Data Collection

We conducted user studies to collect speech and eye gaze data. In the experiments,
a static 3D bedroom scene was Shown to the user. The system verbally asked the
user a list of questions one at a time about the bedroom and the user answered the
questions by speaking to the system. A detailed description of the user study is given
in Appendix A2.

The user’s speech was recorded through an open microphone and the user’s eye

gaze was captured by an Eye Link II eye tracker. From 7 users’ experiments, we col-

52

 

0.5

 

I E B trigram BB s-bigram B I! S-c-bigram

 

WER

 

 

       

o o
o a
0 U C
O O U '
' ' 0.0
o .
¢'-' 0..
I O .
C... ..'
. C ' .
0.1 .0.
.

O C .
O O C .
C O .
C C I .

0" ...
o
O O .
O O C U
..C. ...
l l .
O D I .
I D C
I I . I
I I U
I . U .
I . .
I D. '..
.

D I I U
. . I
~ - s o
. . .
I l . .
l I .
O O . .
J l .
I I C .
I D .

I I ' C
I O .

(a) Word error rate

3
User ID

 

 

I! I trigram QB s-bigram Elli s-c-bigram

 

0.95

0.75

 

 

 

 

0.7

 

NJ ‘llIIIllIIIIIIIllIIIllllllIllIIllIIIIllllllllIIIIIIIIIIIIIIII

4x IIIllllIIIIIIIIIIIIllllllllllllllIIIIIIIIIII

or lllllllllllllllllllll
W

User ID
(b) F -score

Figure 4.10. Performance of the late application of LMS on speech-gesture data of indi-
vidual users

53

lected 554 utterances with a vocabulary of 489 words. Each utterance was transcribed

and annotated with entities that were being talked about in the utterance.

4.6.4 Evaluation Results on Speech and Eye Gaze Data

Evaluation was done by a 14-fold cross validation. We compare the performances of

the early and late applications of two gaze-based salience driven language models:

0 S-Bigraml — salience driven language model based on salience modeling 1 (Equa-

tion (4.4))

o S-Bigram2 — salience driven language model based on salience modeling 2 (Equa-

tion (4.5))

Table 4.3 and Table 4.4 show the results of the early and late applications of
the salience driven language models based on eye gaze. We can see that all word
error rates (WERs) are high. In the experiments, users were instructed to only
answer systems questions one by one. There was no ﬂow of a real human—machine
conversation. In this setting, users were more free to express themselves than in the
situation where users believed they were conversing with a machine. Thus, we observe

much longer sentences that often contain disﬂuencies. Here is one example:

System: “How big is the bed .9”
User: “I would to have to offer a guess that the bed, if I look the chair
that ’s beside it [pause] in a relative angle to the bed, it’s probably six feet

long, possibly, or shorter, slightly shorter.”

The high WER was mainly caused by the complexity and disﬂuencies of users’
speech. Poor speech recording quality is another reason for the bad recognition per-
formance. It is found that the trigram model performed worse than the bigram model
in the experiment. This is probably due to the sparseness of trigrams in the corpus.

The amount of data available is too small considering the vocabulary size.

54

 

I Language Model I Lattice-WER I WER]

 

 

 

 

 

Bigram 0.613 0.707
Trigram 0.643 0.719
S-Bigram 1 0.605 0.690
S—Bigram 2 0.604 0.689

 

 

 

 

 

Table 4.3. WER of the early application of LMS on speech-gaze data

 

I Language Model I Lattice-WER I WER]

S-Bigram 1 0.643 0.709
S-Bigram 2 0.643 0.710

Table 4.4. WER of the late application of LMS on speech-gaze data

 

 

 

 

 

 

 

 

The S-Bigraml and S-Bigram2 achieved similar results in both early application
(Table 4.3) and late application (Table 4.4). In early application, the S—Bigraml
model performed better than the trigram model (t = 5.24, p < 0.001) and the bigram
model (t = 3.31, p < 0.001). The S-Bigram2 model also performed better than the
trigram model (t = 5.15, p < 0.001) and the bigram model (t = 3.33, p < 0.001) in
early application. In late application, the S-Bigraml model performed better than
the trigram model (t = 2.11, p < 0.02), so did the S-Bigram2 model (t = 1.99,
p < 0.025). However, compared to the bigram model, the S-Bigraml model did not
change the recognition performance signiﬁcantly in late application, neither did the
S-Bigram2 model.

We also compare performances of the salience driven language models for individ-
ual users. In early application (Figure 4.11a), both the S-Bigraml and the S-Bigram2
model performed better than the baselines of the bigram and trigram models for all
users except user 2 and user 7. T-tests have shown that these are signiﬁcant im-
provements. For user 2, the S—Bigraml model achieved the same WER as the bigram
model. For user 7, neither of the salience driven language models improved recogni-
tion compared to the bigram model. In late application (Figure 4.11b), only for user 3
and user 4, both salience driven language models performed better than the baselines

of the bigram and trigram models. These improvements have also been conﬁrmed by

55

 

 

IE I bigram Bﬂ trigram [39 s-bigraml an s-bigramgI

nooooooooooooooooooo
nnbbbbmbnmmnnnnnnnnnn
_--—---—--—-—--_

....................
....................
---------------------

 

700aaooaaboooooooooooooooooooooo
oooooooooooooooooooooooooooooooo
nun-...--.-—------——-—----_

oooooooooooooooooooooooooooooooooo
...................................
.................................

A6aoooo0aoonooooooooaoooaaooonoooo000000000

 

........................................................................
...................................................................

uaaaaaooaaaaoaaoaoaaaoooaoaooooooooooooooaoooooooooon

 

.....................................................
oooooooooooooooooooooooooooooooooooooooooooooooooooooo
oooooooooooooooooooooooooooooooooooooooooooooooooooooo

oooaooaaooaooaooaooaaoaaoooaoooaaaoaaaaoooooooaoooooooaooooooan

 

nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

 

 

 

0000000000000

aauuaauuumaa

................
nus nxo ..n ”nu .Au .nu
Any nnu Auu Anu nnu nnu

ma?»

tion

1C8.

(a) WER of early appl

 

 

.0000000000000000000000000

 

.....................
....................
---------------------

 

79aoa0oaooooooooooooooaooooooooooooo
oooooooooaooooooooooooooooooooooooon
_--——-—---—-----------2

ooooooooooooooooooooooooooooooooooo
oooooooooooooooooooooooooooooooooo
...................................

monocoooooooaaooaoooooooooooooooooaoaoooo
ooaaaaoaaaaaaaaaaaooaaoaoaaaaooaaaooaoaaa
_--—--—--——---—------—-----u-g

.....................................................................................
ooooooooooooooooooooooooooooooooooooooooooooo

Abaaooooaoaaoaooooaaoaooonaoooaooooooooooooooooooav

 

......................................................
oooooooooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

aoaooooooaaaaaaoaaaooaooaaaaaaooaaaaooaoaaooooooooooooooooooaoooooa

nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

 

 

I3 I bigram QB trigram [It] s-bigraml aﬂs—bigram24]

 

 

0000000000000a

mauaauaaaaauauu

................_
nuu nxu .1. ”AN .Au Ia.
Anu nuu “nu Aug nuu Auu

mm?

(b) WER of late application

Figure 4.11. WERs of application of LMs on speech-gaze data of individual users

56

t-tests as signiﬁcant.

Comparing early and late application of the salience driven language models, it
is observed that early application performed better than late application for all users
except user 3 and user 4. T-tests have conﬁrmed that these differences are signiﬁcant.

It is interesting to see that the effect of gaze—based salience modeling is different
among users. For two users (i.e., user 3 and user 4), the gaze-based salience driven
language models consistently out-performed the bigram and trigram models in both
early application and late application. However, for some other users (e.g., user 7),
this is not the case. In fact, the gaze-based salience driven language models performed
worse than the bigram model. This observation indicates that during language pro-
duction, a user’s eye gaze is voluntary and unconscious. This is different from deictic

gesture, which is more intentionally delivered by a user.

4.6.5 Discussion

Gesture-based salience driven language models are built on the assumption that the
entity selected by the accompanying gesture of a user’s utterance is the topic of the
user’s utterance. Similarly, gaze-based salience driven language models are built on
the assumption that when a user’s eye gaze is ﬁxating on an entity, the user is saying
something related to the entity. With this assumption, gesture/gazebased salience
driven language models have the potential to improve speech recognition by biasing
the speech decoder to favor the words that are consistent with the entity indicated by
the user’s gesture or eye gaze ﬁxation, especially when the user’s utterance contains
words describing unique characteristics of the object. These particular characteristics
could be the object’s name or physical properties (e.g., color, material, size).

An example where the gesture-based salience driven language model helped speech
recognition is shown in Figure 4.12. In this example, a user pointed to the entity

tablasquare in the bedroom scene and said “show me details on this desk”. The

57

 

Utterance: “show me details on this desk”

Gesture selection:
p(bedroom) = 0.0050
p(lamp_ﬂoor) = 0.1954
p(couchnnrsofa) = 0.1409
p(lamp_ﬂoor2) = 0.0510
p(tablesquare) = 0.6077

Bigram n-best list: S-Bigram n-best list:
show me details on this bed show me details on this desk
show me details on this desk show me details on this bed
show me details on this back show me details on this back

show me details on that’s desks show me details on this desk a
show me details on that’s desk show me details on that’s desk
show me details on that’s that’s show me details on that’s desk a

 

 

 

Figure 4.12. N-best lists of speech recognition for utterance “show me details on this
desk”

user’s gesture resulted in a set of candidate entities being selected, in which the correct
one (i.e., table_square) was assigned the highest selection probability of 0.6077. Two
n-best lists, the bigram n—best list and S-Bigram n-best list, were generated by the
speech recognizer when the standard bigram model and the salience driven bigram
model were applied respectively. When the standard bigram model was applied, the
speech recognizer did not get the correct recognition. When the salience driven bigram
model was applied, the speech recognizer recognized the user’s utterance correctly.
Figures 4.13 and 4.14 Show the word lattices of the utterance generated by the
speech recognizer using the standard bigram model and the salience driven bigram
model respectively. The n—best lists in Figure 4.12 were generated frOm those word
lattices. In the word lattices, each path going from the start node <s> to the end
node </s> forms a recognition hypothesis. The bigram probabilities along the edges
are in the logarithm of base 10. In the standard bigram case, although the probability
of bigram “this desk” (-1.3952) is slightly higher than the probability of “this bed”

(-1.4380), the speech recognizer got the wrong recognition, i.e., the correct speech

58

—O.7 1 83

-O. 1947

-2.5902

 

Figure 4.13. Word lattice of utterance “show me details on this desk” generated by using
standard bigram model

recognition hypothesis is not the ﬁrst one in the n-best list (Figure 4.12). This is
because the system tries to ﬁnd an overall best speech recognition hypothesis by

considering both language conﬁdence and acoustic conﬁdence. After tailoring the

59

-1.0262

—0.2520

~2.6282

 

Figure 4.14. Word lattice Of utterance “show me details on this desk” generated by using
salience driven bigram model

60

standard bigram model with gesture selection, in the resulting salience driven bigram
model, the probability of bigram “this desk” is increased (-0.8309) while the probabil-
ity of “this bed” is decreased (—1.9182). This enlarged bigram probability difference
ensures that “this desk” is on the overall best speech hypothesis generated by the

speech recognizer with the salience driven language model.

 

Utterance: “move the red chair over here”

Gesture selection:
p(bedroom) = 0.0001
p(curtainsJ) = 0.0061
p(table_pc) = 0.2229
p(chairJ) = 0.7196
p(lamp_ﬂoor) = 0.0512

Bigram n—best list: S-Bigram n-best list:

move the rid chair over here move the red chair over here
move the rid chair over here a move the red chair over here a
move the rid chair over here i move the red chair over here i
move the rid chair over here the move the red chair over here the
move the rid chair over here it move the red chair over here it

 

 

 

Figure 4.15. N-best lists of speech recognition for utterance “move the red chair over
here”

Figure 4.15 shows another example where the salience driven language model
helped recognize an utterance that referred visual properties of an entity. In this
example, the user pointed to a red chair and then pointed to a location while saying
“move the red chair over here”. In the resulting gesture selections, the truly selected
entity chair..1 was assigned the highest probability. As shown in the bigram n-best list
and the S—Bigram n-best list, the speech recognizer with the standard bigram model
did not get the correct recognition result while the one with the salience driven bigram
model recognized the user’s utterance correctly.

The word lattices of the utterance are shown in Figures 4.16 and 4.17. In the

standard bigram case, as Shown in Figure 4.16, the probability of bigram “rid chair”

61

 

Figure 4.16. Word lattice Of utterance “move the red chair over here” generated by using
standard bigram model

(-3.3811) is higher than the probability of “red chair” (-3.8231). This makes the

wrong speech hypothesis the top one in the n-best list (Figure 4.15). After tailoring

62

 

-3.2397 -0.6302

-2.0144

 

Figure 4.17 . Word lattice of utterance “move the red chair over here” generated by using
salience driven bigram model

63

the bigram model with gesture selection, in the salience driven bigram model (F ig-
ure 4.17), the probability of bigram “red chair” is much higher than the probability
of “rid chair”, which makes the correct speech hypothesis the best one in the n-best

list and thus gets correct speech recognition.

 

Utterance: “I like the picture with like a forest in it”

Gaze salience:
p(bedroom) = 0.5960 p(chandelierJ) = 0.4040

Bigram n-best list:

and i eight that picture rid like got ﬁve
and i eight that picture rid identiﬁable

and i eight that picture rid like got forest
and i eight that picture rid like got front
and i eight that picture rid like got forest a

S-Bigram2 n-best list:

and i that bedroom it like upside

and i that bedroom it like a ﬁve

and i that bedroom it like a forest

and i that bedroom it like a forest a

and i that bedroom it like a forest candle

 

 

 

Figure 4.18. N-best lists of speech recognition for utterance “I like the picture with like a
forest in it”

Unlike the active input mode Of deictic gesture, eye gaze is a passive input mode.
The salience information indicated by eye gaze is not as reliable as the one indicated
by deictic gesture. When the salient entities indicated by eye gaze are not the true
entities the user is referring to, the salience driven language model can worsen speech
recognition. Figure 4.18 shows an example where the S-Bigram2 model in early
application worsened the recognition of a user’s utterance “I like the picture with like
a forest in it” because of wrong salience information. In this example, the user was
talking about a picture entity picture_bamboo. However, this entity was not salient,

only entities bedroom and chandelierJ were salient as indicated by the user’s eye

64

gaze. As a result, the recognition with the S-Bigram2 model becomes worse than
the baseline. The correct word “picture” is missing and the wrong word “bedroom”
appears in the result.

The failure to identify the actual referred entity picture_bamboo as salient in the
above example can also be caused by the visual properties of entities. Smaller entities
on the screen are harder to be ﬁxated by eye gaze than larger entities. To address

this issue, more reliable salience modeling that takes into account the visual features

is needed.

 

Utterance: “remove this lamp”

Gesture salience:
p(bedroom) = 0.0995
p(lamp_bank) = 0.5288
p(table_dresser) = 0.3604
p(table_pc) = 0.0114

N-best list of standard trigram model:
remove this stand

remove this them

remove this left

N-best list of S-Bigram model in early integration:
remove this lamp
remove this lamp a

N—best list of S-Bigram model in late integration:
remove this left

remove this stand

remove this them

 

 

 

Figure 4.19. N-best lists of an utterance: early stage integration v.s. late stage integration

Early application has an advantage over the late application on bringing the good
hypothesized words with low acoustic probabilities into the word lattice. This is par-
ticularly important when using the Sphinx-4 speech recognizer, because the current
release Of Sphinx-4 does not provide a full word lattice. When the correct words are
not in the word lattice output, a late application Of salience driven language models

will never succeed in retrieving those correct words by rescoring the word lattice.

65

Figure 4.19 shows one example that demonstrates the difference between the early
application and the late application. Here the correct word “lamp” did not appear in
the word lattice generated by the trigam model, and thus could not be retrieved by
the late application of the salience driven bigram model. When the salience driven
bigram model was applied in an early stage, the top one in the generated n-best list

turned out to be the correct recognition result.

4.7 Summary

This chapter presents a systematic investigation Of incorporating gesture/ gaze into
speech recognition and understanding via salience driven language modeling. Three
salience driven language models based on the bigram model, the class-based bigram
model, and the PCP G are compared. Our experimental results have shown that the
salience driven bigram model can improve spoken language understanding in both
early and late applications, while the salience driven class-based bigram model seems
only useful for the late application. In the early application, the salience driven
PCFG model has also Shown a potential advantage in improving spoken language

understanding.

66

CHAPTER 5

Incorporation Of Non-verbal Modalities in
Intention Recognition for Spoken

Language Understanding

In multimodal interpretation, the user’s speech is ﬁrst converted to text by speech
recognition. To understand the user’s speech, the system further extracts semantic
meaning from the user’s recognized utterance. The previous chapter has addressed
speech recognition in multimodal conversation. In this chapter, we address the un-
derstanding of the recognized speech during multimodal conversation.

In speech and deictic gesture systems, deictic gestures have been mainly used
for attention identiﬁcation (i.e., identifying which object the user is talking about).
Many approaches have been developed to incorporate gestural information to resolve
referring expressions (e.g., using gesture information to resolve what this refers to in
the utterance “how much does this cost?) I12,14,42,54,72,119]. Different from these
earlier works, our work focuses on how to take gesture beyond attention identiﬁcation
to help intention recognition (i.e., inferring what the user intends to do with an
object), which is the main task of language understanding.

Traditional language understanding is solely based on the text input. In multi-

modal conversational systems, besides the user’s language, it is possible to infer the

67

context of the user’s language from other non-verbal modalities (e.g. gesture) and
use this context for language understanding. In speech and deictic gesture systems,
deictic gestures on the graphical display indicate the user’s attention, which consti-
tutes the context of the user’s utterance. Since the context of the identiﬁed attention
can potentially constrain the associated intention, the deictic gestures can go beyond
attention and apply to recognize the user’s intention.

Within the context of a speech and gesture system, this chapter systematically
investigates the role of deictic gestures in incorporating contextual information to
help language understanding, speciﬁcally, to help recognize the user’s intention. We
experiment with different model-based and instance-based approaches to incorporate
gestural information for intention recognition. We also examine the effects of us-
ing gestural information for intention recognition in two different processing stages:

speech recognition stage and language understanding stage.

5.1 Multimodal Interpretation in a Speech-Gesture System

Multimodal interpretation involves extraction of semantic meanings from multimodal
inputs. In human-machine conversation, the speciﬁc task of multimodal interpreta-
tion is to convert the user’s multimodal input into a semantic representation that is

recognizable to the system.

5 . 1 . 1 Semantic Representation

Semantic meanings from user input can be generally categorized into intention and
attention [33]. Intention indicates the user’s motivation and action. Attention reflects
the focus of the conversation. Structuring semantic meanings in this way, we represent
the semantic meaning of a user’s input by a semantic frame containing intention and
attention of the user. Figure 5.1 shows the semantic frame of a user’s multimodal

input. In the example, the user asks “who is the artist of this picture?” while pointing

68

to a picture object (identiﬁed as picture_lotus) on the screen. The intention indicates
that the user wants the artist information, whereas attention indicates picture_lotus

is the object that the user is interested in.

 

Intention
action: A C T-INF 0-RE Q UES T
aspect: ARTIST

Attention
Object id: picture_lotus

 

 

 

Figure 5.1. Semantic frame of a user’s multimodal input

Representing semantic meaning as semantic frames, the speciﬁc task of multimodal
interpretation is to ﬁll intention and attention units in the semantic frames based on

the user’s multimodal input.

5.1.2 Incorporating Context in Two Stages

 

 

 

  
 

 

 

 

 

 

 

 

 

 

 

 

 

Speech Input Gesture Input
. J - w., . I _
F -;.. L "gram 5,». (b) ' .. " . ‘.
Speec + 1 Gesture
Recognition I Recognition
I l I
.Lﬁ. . ._‘_“ j.‘. a)| ,. _ .. ‘, ‘___.'
Language ‘_ _] Gesture
Understanding I Understanding

 

l
Semantic Representation I I Semantic Representation

 

Multimodal
Fusion

 

 

 

I Semantic Representation

Figure 5.2. Using context (via gesture) for language understanding

Context can be incorporated in two stages to help language understanding in
multimodal interpretation [83]. Take Speech and gesture systems for example, as

illustrated by (a) in Figure 5.2, contextual information (inferred from gesture) can

69

be used together with recognized speech hypotheses directly in language understand-
ing (LU) stage to improve language understanding. Since speech recognition is not
perfect, and better speech recognition should lead to better language understanding,
contextual information can also be used in speech recognition (SR) stage to improve

Speech recognition hypotheses and thus improve language understanding (Figure 5.2-

(b))-

5.2 Intention Recognition

We investigate using the context identiﬁed by gesture for intention recognition in a
speech-gesture system that is built for a 3D interior decoration domain (Section 3.3.1).
In this domain, the user’s intention is represented by an action and its corresponding
aspect. All actions and corresponding aspects in the interior decoration domain are
shown in Table 5.1. Note that for action ACT-INFQREQ UES T, the aspect includes
different domain properties such as ARTIST, AGE, and PRICE.

 

 

Action Aspect

A CT-A DD <null>

A C T—ALTERNA TES.SHO W < null>

A C T-INFO_RE Q UES T <domain property> or < null >
A (IT-MO VE <location> or <null>
ACT-PAIN T <color> or <null>

A C T-REM 0 VB <null>

A CT—REPLA CE < replacement > or < null >
ACT-ROTA TE <direction> or <null>

 

 

Table 5.1. Intentions in the 3D interior decoration domain

Given this representation, intention recognition can be formulated as a classiﬁ-
cation problem. Each action-aspect pair can be considered as a particular type of
intention. For action ACT-INFO.REQUEST, there are 11 possible aspect values
which result in 11 classes. For all other 7 actions, each action is treated as one type

of intention despite multiple possible aspect values. During interpretation, additional

70

post-processing will take place to identify different aspects. For example, for action
ACT—PAIN T, the system will try to identify the <color> value (e.g., red, blue) from
the user’s utterance after ACT-PAIN T is predicted as the user’s intended action.
Here, we only focus on the classiﬁcation of intention without elaborating on the post-
processing. In total, there are 19 target classes for intention recognition (including
class NOT- UNDERSTOOD to represent the intention that is not supported in the

domain).

5.3 Feature Extraction

To predict user intention, we ﬁrst need to extract features from the user’s multimodal
input. Two types of features are used for intention prediction: semantic features and

phoneme features.

5.3.1 Semantic Features

The semantic features of users’ multimodal input consist of two parts: lexical features
extracted from users’ spoken utterances, and contextual features extracted from users’

deictic gestures.

o Lexical features
Lexical feature is represented by a binary feature vector which indicates what
semantic concepts appear in the user’s utterance. The semantic concepts are
extracted from the recognized speech hypotheses (could be n-best hypotheses
or 1-best hypothesis) based on lexical rules. Currently, we have 18 semantic

concepts in the interior decoration domain with 130 lexical rules.

0 Contextual features
When a deictic gesture takes place, the selected object and its properties as

deﬁned in the domain are activated, which form the context of the user’s ut-

71

terance. This context constrains what the user is likely talking about. For
example, the user is unlikely to ask the artist of a lamp or the wattage of a
picture. Therefore, this context can be used to help predict user intention. For
each gesture that accompanies the user’s utterance, we choose the most likely
object selected by the gesture and use the semantic type of the object as the

contextual feature. There are 14 semantic types of objects in the domain.

5.3.2 Phoneme Features

Besides semantic features, we also use phoneme features of users’ spoken utterances for
intention prediction. For each speech recognition hypothesis of the user’s utterance,
we can get a phoneme sequence. Each phoneme sequence is treated as a phoneme

feature.

 

User utterance: “information on this”
Phonemes: [ih n f er m ey sh ax n] [ao n] [dh ih 3]

Speech recognition: “and for mission on this”
Phonemes: [ax n d] [f er] [m ih sh ax n] [ao n] [dh ih s]

 

 

 

Figure 5.3. Phonemes of an utterance

We give an example to show the potential of using phoneme features to help
user intention prediction. As shown in Figure 5.3, the user’s utterance is not cor-
rectly recognized and as a result, the semantic feature extracted from the recog-
nized speech does not give any useful information about the user’s intention of ACT-
INFO-REQUEST. Therefore, using semantic features alone will fail to predict the
user’s intention. However, if we compare the two phoneme sequences of the true ut—
terance and the speech recognition result, we can ﬁnd that the phoneme sequences of
the mis-recognized speech, [ax n d] [f er] [m ih sh ax n], is close to the true phoneme
sequence [ih n f er m ey sh ax n]. This means that using phoneme sequence sim-

ilarity can help recover the word “information”, which is the key to identifying the

72

user’s intention in this utterance, and therefore can help predict the user’s intention.

5.4 Model-Based Intention Recognition

Given an instance x that is represented by semantic features, we applied three clas-

siﬁers to predict user intention.

o Naive Bayes

The prediction c* of instance x is given by
c” = arg maxp(c|x) = argmaxp(c|x1,:cg,. ..,:1:m) (5.1)
C C

where cc,- is the i-th feature of instance x.

Applying Bayes’ theorem and assuming the features are conditionally indepen-

dent given a class, we have

 

p(ClX) =

 

o< p(e) Hume) (5.2)

Estimating p(c) and p(xilc) from the training data, we can get the prediction
of a testing instance by Equation (5.1). In our evaluation, add-one smoothing

was used in the estimation of p(c) and p(xilc) for predicting user intention.

0 Decision Tree
In a decision tree, each root node provides the classiﬁcation of the instances,
each non-leaf node speciﬁes a test of some attribute of the instances, and each

branch descending from that node corresponds to one of the possible values for

73

this attribute. Decision trees classify instances by sorting them down the tree
from the root node to some leaf node through a list of attribute tests. We used
C4.5 algorithm [86] to construct decision trees for intention prediction based on

the semantic features of users’ multimodal input.

0 Support Vector Machines (SVM)
The SVM [22] is built by mapping instances to a high dimensional space and
ﬁnding a hyperplane with the largest margin that separates the training in-
stances into two classes in the mapped space. In prediction, an instance is
classiﬁed depending the side of the hyperplane it lies in. A kernel function a:
is used in SVM to achieve linear classiﬁcation in the high dimensional space.
Based on the semantic features of users’ multimodal input, we used a polynomial

kernel for user intention prediction.

Since SVM can only handle binary classiﬁcation, a “one-against-one” method
is applied to use SVM for multi-class classiﬁcation [39]. For a classiﬁcation task
of c classes, c(c — 1) / 2 SVMS are built for all pairs of classes and each SVM
is trained on the data from the pair of two classes. In the testing phase, a
test instance x is classiﬁed through a majority voting strategy. For each of the
c(c - 1) / 2 binary classiﬁers built for class pair (Ci, cj), if the classiﬁer decides x
belongs to the class 0;, the vote for class c,- increases by one. Otherwise, the vote
for class cj increases by one. After all binary classiﬁers have been used to vote
for the classes, the one which wins the most votes is picked as the prediction of

X.

5.5 Instance-Based Intention Recognition

We also applied k-nearest neighbor (KNN), an instance-based approach, to predict

user intention. Given a set of training instances with known intention, the KNN

74

method (k=1) predicts the intention of a testing instance by ﬁnding the testing in-
stance’s closest match in the training instances and using the match’s intention as
the prediction.

We applied KN N to predict user intention based on semantic features and phoneme
features. The similarity between a testing instance xt and a training instance x" is
deﬁned as

d3p(xt,xr) = d3(xt, x") + apart, x’") (5.3)

where d, (xt, x7") is the Hamming distance between the nominal semantic features and
dp(xt, xr) is the distance between the phoneme features.

Hamming distance ds(xt, X?) is deﬁned as:
m
d,(xt,x") = 2(1— (safe, :49) (5.4)
k=1

where rump is the k-th attribute in the semantic feature, and

t_ r
“332,513“: 0 113:: —:ck
1 331:7533;

Phonemes distance dp(xt,x") is deﬁned as follows based on different conﬁgura-

tions:

0 when n-best speech recognition hypotheses are used, and no gestural informa-
tion is used:

dp(x‘,x'") = mkin MED(P,f., PT) (5.5)

0 when n-best speech recognition hypotheses are used, and gestural information

(i.e., objects indicated by deictic gestures) is used

dp(xt,xr) = mkin MED(P,§, P') + we(ot,or) (5.6)

75

where,

MED

PT

we(0t, 01‘)

minimum edit distance

phonemes of the k-th speech recognition hypothesis of testing
instance xt
phonemes of the speech transcript of training instance xr
distance between the object 0t selected by the gesture
accompanying testing instance xt and the object or selected by

the gesture accompanying training instance xr (0 if 0t and o,—

are of the same semantic type, otherwise a non-zero constant)

5.6 Evaluation

We empirically evaluated the role of contextual information in intention recognition.
We applied both model-based and instance-based approaches, and investigated the

incorporation of contextual information for intention recognition in language under-

standing and speech recognition stages.

5.6.1 Experiment Settings

The CMU Sphinx-4 speech recognizer [111] was used for speech recognition. An

open acoustic model and a domain dictionary were used in recognizing users’ spoken

utterances.

For model-based intention prediction, we evaluated the intention prediction accu-

racies with the following classiﬁers based on semantic features:

0 NBayes — naive bayes

. o DTree — decision tree (C46)

0 S VM — support vector machine (polynomial kernel)

76

For instance-based intention prediction, we evaluated the intention prediction ac-

curacies with KNN classiﬁers based on different instance similarity functions:

0 S—KNN — instance distance deﬁned on semantic features (Equation (5.4))

o P—KNN — instance distance deﬁned on phoneme features (Equations (5.5) and

(5.6) depending on whether gestural information is incorporated)

o SP-KNN - instance distance deﬁned on combinational features of semantics

and phonemes (Equation (5.3))

For each approach, we compared the performances of using only the l-best speech
recognition hypothesis and using all n-best speech recognition hypotheses for inten-
tion prediction. Also, to compare the inﬂuences of gestural information on intention

prediction, we evaluated intention prediction under three gesture conﬁgurations:

o noG’est — no gestural information is used.

c recoGest - with gesture recognition results, i.e., the most likely objects selected

by the user’s gestures as recognized by the system.

0 tmeG’est — with ground truth gesture recognition results, i.e., the objects truly

selected by the user’s gestures.

For each approach, we further evaluated intention prediction based on standard
speech recognition and gesture-tailored speech recognition. When intention prediction
is based on standard speech recognition, gestural information is incorporated only
in language understanding for intention prediction. When intention prediction is
based on gesture-tailored speech recognition, gestural information is already used in
speech recognition and can also be used in language understanding stage for intention
prediction.

The evaluations were done by a 10-fold cross validation on the speech and gesture

data set as described in Section 4.6.1.

77

5.6.2 Results Based on Traditional Speech Recognition

Table 5.2 shows the intention prediction accuracies based on the standard speech
recognition results that did not use gestural information. The intention prediction
accuracies based on transcripts of users’ spoken utterances are also given in the table

to show the upper-bound performance when speech is perfectly recognized.

 

[ I NBayes 1 DTree [SVM 1 s-KNN] P-KNNTSP-KNN J

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

noGest 0.860 0.881 0.878 0.881 0.918 0.937

transcript recoGest 0.878 0.888 0.884 0.888 0.921 0.934
trueGest 0.874 0.889 0.884 0.884 0.921 0.934

west noGest 0.709 0.718 0.713 0.700 0.790 0.824
recoGest 0.741 0.729 0.749 0.740 0.797 0.826
hYPOtheses trueGest 0.755 0.738 0.744 0.737 0.806 0.832
Lbest noGest 0.721 0.727 0.730 0.730 0.798 0.820
. recoGest 0.747 0.755 0.747 0.757 0.801 0.834
hypOth‘f‘S trueGest 0.763 0.769 0.760 0.758 0.804 0.844

 

 

Table 5.2. Accuracies of intention prediction based on standard speech recognition

For all model-based approaches (i.e., NBayes, DTree, SVM), the results show that
using gestural information together with recognized speech (1-best or n-best) in in-
tention prediction achieves signiﬁcant improvement on prediction accuracy compared
to not using gestural information. Among instance-based approaches (i.e., S-KNN, P-
KNN, SP-KNN), only for the S-KN N that uses semantic features, intention prediction
accuracies are improved signiﬁcantly when gestural information is used together with
recognized speech (1—best or n—best hypotheses). For the P-KNN, where only phoneme
features are used, there is no signiﬁcant change between the intention prediction using
gesture and not using gesture, no matter gestural information is used together with
1-best speech recognition or n-best speech recognition. For the SP-KNN that uses
both semantic and phoneme features, intention prediction is signiﬁcantly improved
only when gestural information is used together with l-best speech recognition.

It is found that, used together with recognized speech hypotheses in model-based

approaches, ground truth gesture selection achieves more accurate intention predic-

78

tion than recognized gesture selection in most conﬁgurations. This indicates that
improving gesture recognition and understanding can further enhance intention pre-
diction when speech recognition is not perfect. When SVM is applied on semantic
features extracted from all n-best speech recognition hypotheses, using the true ges-
ture selection achieves slightly worse performance than using the recognized gesture
selection. However, this is not a signiﬁcant difference. In instance-based approaches,
using true gesture selection makes no signiﬁcant difference than using recognized

gesture selection for user intention prediction.

5.6.3 Results Based on Gesture-Tailored Speech Recognition

Table 5.3 shows the intention prediction accuracies based on the gesture-tailored
speech recognition hypotheses. Note that in Table 5.3, gestural information (all pos-
sible gesture selections recognized by the system) has been utilized in speech recog-
nition [81], the conﬁgurations noGest, recoGest, and tr‘ueGest only apply to how
gestural information is used in language understanding stage for intention prediction.
Therefore, in Table 5.3, the results under conﬁgurations n-best hypotheses + noGest
and 1-best hypothesis + noGest are actually the intention prediction performance

when gestural information is used in only speech recognition stage.

 

| | NBayes 1 DTree j SVM [ S-KNN [ P-KNN 1 SP-KNN j

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

noGest 0.860 0.881 0.878 0.881 0.918 0.937

transcript recoGest 0.878 0.888 0.884 0.888 0.921 0.934
trueGest 0.874 0.889 0.884 0.884 0.921 0934

West noGest 0.727 0.749 0.750 0.753 0.826 0.858
recoGest 0.753 0.766 0.780 0.770 0.829 0.857
hYPOtheses trueGest 0.766 0.781 0.786 0.781 0.827 0.860
l-best noGest 0.735 0.743 0.752 0.758 0.812 0.843
, recoGest 0.764 0.772 0.764 0.778 0.815 0.855
hYPOtheS‘S trueGest 0.783 0.795 0.777 0.795 0.817 0.860

 

 

Table 5.3. Accuracies of intention prediction based on gesture-tailored speech recognition

Compared to using gestural information only in speech recognition, the accuracies

79

of intention prediction are signiﬁcantly improved in all model-based approaches when
gestural information is used in both speech recognition and language understand-
ing, no matter it is used together with 1-best or n-best speech recognition. Among
instance-based approaches, only in S—KNN, that using gestural information in both
speech recognition and language understanding (with l-best or n-best recognition
hypotheses) signiﬁcantly improves intention prediction compared to using gestural
information only in speech recognition. For P-KNN, whether or not using gestural
information in language understanding does not make signiﬁcant change on inten-
tion prediction. For SP-KN N, only when gestural information is used together with
1-best Speech recognition hypothesis in language understanding that intention predic-
tion is signiﬁcantly improved compared to using gestural information only in speech
recognition.

In all model-based approaches, together with recognized speech, using ground
truth gesture selection in language understanding is found to improve intention pre-
diction more than the recognized gesture selection. Again, this indicates that im-
proving gesture recognition and understanding is helpful for intention prediction. In
instance-based approaches, using true or recognized gesture selection in language un-
derstanding stage for intention prediction does not make signiﬁcant differences when

phoneme features are used.

5.6.4 Results Based on Different Sizes of Training Data

The empirical results have shown that using gestural information improves user inten-
tion recognition. To examine whether this improvement by using gestural information
is dependent on the size of training data, we compare the accuracies of intention pre-
diction with different sizes of training sets. The results of the approaches are shown in
Figures 5.4—5.9. The semantic features and phoneme features are extracted from the

l-best speech recognition and the recognized gesture selection are used in intention

80

prediction.

 

 

 

 

 

 

 

 

 

 

0.78 I I I I I 1 I
0.76" _,======='D
r = = " ‘ ' 5 5 >
07.4.... 59‘ e;:::::e- .1
0.72 ~
8
S
g 0.7 ~ -
m
058- "0 .
066- “:6“ .
" gesture not used
0 64 _ —9— gesture used in LU ‘
' ——+— gesture used In SR
—B-— gesture used in SR and LU
0.62 L l L 1 l 1 1
20 30 4O 50 60 70 80 90
% training

100

Figure 5.4. Intention prediction performance of Naive Bayes based on different training

size

 

 

 

 

 

 

 

 

 

0.78 I I I I I I I
076» =,,'i==‘ .
5 = 5 'e ‘ V. 9.."- 2

0.74“ C::=:::‘: 53?:(:.v" _,

072- - 4” .539 4
g 0.7 ,. . a f ..... "f . . _,
a u

068» ”=3°° " ~

0.66 7 ' o .

gesture not used
0 64 _ —e— gesture used in LU j
' —t— gesture used In SR
—8— gesture used In SR and LU
0.62 1 1 L 1 L 1 1
20 30 40 50 60 70 80 90 100
% trainlng

Figure 5.5. Intention prediction performance of Decision Tree based on different training

size

81

 

0.78 I T I I I I I

 

 

 

 

 

 

 

 

5 I: : '4 a - - ‘ u ’;":r=(:v:*5
E ‘5 Ut.‘;g
g O7r - .~ 9" -
(I! a": 'v
0.686 c . . . ,.. .. .
0.66l“ .I
gesturenotused
064- —e—gesture used In LU _
' —+—— gesture used In SR
—8— gesture used In SR and LU
0.62 l l 1 l 1 1 1
20 30 40 50 60 70 80 90 100

% training

Figure 5.6. Intention prediction performance of SVM based on different training size

 

 

 

 

 

 

 

 

 

 

0.86 I I I I I I I
0.84 .. . . . . . . . . . . . . . . . ..
0.82 ,_ . . . . ..... . . . . . . . a
0.8 ,_ ...... .4
5‘ 0.78 -
g 0.76 ~
to
0.74 -
0.72 t "
0.7 - - - ' - - ' gesture not used ‘
—e— gesture used In LU
0.68 ~ - - - - ' , —+— gesture used In SR
—8— gesture used in SR and LU
0.66 l 1 L L 1 1 1
20 30 40 50 60 70 80 90 100

°/o training

Figure 5.7. Intention prediction performance of S-KN N based on different training size

The intention prediction accuracy curves are generated in the following way. The

whole data set is ﬁrst separated into 5 folds in a stratiﬁed way such that the class

82

 

0.86 I r r r r r r

 

 

 

 

 

 

 

 

 

0.84 -- -
0.82 ~ ~
0.8 r
0.78 -- ' ‘ - -
3
§ 076* '
m ‘l ,C"
0.74 '- _J'4 1 I -I
f!" a
.u-a/l f9.
0 72 f. {I I ‘ {all -<
l. '4 géj"
0.7 ' * .. ' ., ’ ' ' gesture not used ‘
n o —e— gesture used in LU
0.68 t o - —+—— gesture used In SR ~
° —8— gesture used In SR and LU
0.66 L J L 1 L l 1
20 30 40 50 60 70 80 90 100

% traInIng

Figure 5.8. Intention prediction performance of P-KNN based on different training size

 

0.86

I

0.84

0.82 ~

 

 

S3
V
I

 

gesture not used ‘
—e— gesture used In LU
—+—- gesture used in SR ~
—8— gesture used in SR and LU

1

I

0.68

 

 

 

 

 

 

0.66 l l l
20 30 40 50 60 70 80 90 1 00

% training

Figure 6.9. Intention prediction performance of SP-KN N based on different training size

distributions in each fold are the same. In each round of evaluation, two different
folds are picked as the testing set and initial training set, instances in the other 3

folds are added to the training set incrementally by random picking to get intention

83

prediction accuracies based on different sizm of training sets. After each fold of data
has been used as testing set and initial training set, the intention prediction accuracy
curves of the 20 round evaluations are averaged to get the curves in Figures 5.4—5.9.

We can see that, for all model-based and instance-based approaches, using gestural
information in both speech recognition stage and language understanding stage al-
ways outperforms using gestural information in only language understanding stage or
not using gestural information at all for intention prediction. Using gestural informa-
tion only in speech recognition stage is found to always outperform not using gestural
information for intention prediction in all model-based and instance-based approaches
despite the training size. When gestural information is used only in language under-
standing stage, Naive Bayes and S—KNN always improve intention prediction despite
the training size. For the other approaches (Decision Titee, SVM, P-KNN, and SP-
KNN), sufficient training data is needed to make gestural information helpful for

intention prediction.

5.6.5 Discussion

The empirical results lead to several ﬁndings about the role of deictic gestures in
incorporating domain context in intention recognition.

First, deictic gesture helps intention recognition given the current speech recogni-
tion technology. The earlier deictic gesture is used in the speech processing, the more
eﬁfect it brings to intention recognition. Figure 5.10 shows the performance of inten-
tion recognition by different approaches when gestural information is not used (i.e.,
only recognized speech hypotheses are used), used only in speech recognition stage,
used only in language understanding stage, and used in both speech recognition and
language understanding stages. We can easily see that using gestural information
in speech recognition stage or language understanding stage improves intention pre-

diction. Using gestural information in both Speech recognition stage and language

84

 

 

 

 

 

 

 

 

 

      

086

/

BE gesture not used é

. /

0.84 ﬂagesture used in LU 2/
aﬂgesture used in SR

0.82 EBgesture used in SR and LU 3.;
. E's-'3

o./ z-f/

3:/ :=./

0.8 =.¢ 25'?
:=./ :=-./

>> _./ ._./
8 :-.¢ :=./
z=¢ wt?

5078 i i5? 35?
o / '=-% -=.¢
'/ ¢ :-./ :--./

‘i i r

I ' .=. ._:.

0.76 g g g ,. g=g 2:.g
/ i / / -/:: ;=/ :=./

/ = f .-/ _.;/ .=./ .=.%

z ¢ : / =E/ =5/ -:=¢ 2:2/

074 = Z = g =.é ‘15/:3=-¢ 3:.é

- : g : g ::¢ :3? 3:3? 3:5?

3 / = / =/ -/;: ;=/ ;=./

: / -: / :::/ :=e/ ::-/ .:5/

.: é 3: ? 3::é ::&¢ 1:=Z 3:5?

072 1;: g ;= Z ;2=.¢ 2:.é 1:? 3:.Z
35 / 35 e 255% 355/ :E:¢ 353;

3::- é :5 z :25? 25-? :5? :58

07 s: 4 t: 4 >:-é v:bé w:a4 A:c4
NBayes D'I‘ree SVM S-KNN P-KNN SP-KNN

Figure 5.10. Using gestural information in different stages for intention recognition

understanding stage further improves intention prediction. Therefore, it is desirable
to incorporate gesture earlier in the spoken language processing.

Second, deictic gesture does not help much in intention recognition for a sim-
ple/small domain if speech is perfectly recognized. As we can see in Table 5.2, when
gestural information is used together with the transcripts of user utterances to pre-
dict intention, the effect is not as Signiﬁcant as when gesture information is used
with recognized Speech hypotheses. This is within our expectation. Given a simple
domain with a limited number of words (the vocabulary size for our current domain
is 250), it is relatively easier to come up with sufﬁcient semantic grammars to cover
the variations of language. In other words, once user utterances are correctly rec-
ognized, the semantics of the input can most likely be correctly identiﬁed by the
language understanding component. So the bottleneck in interpretation appears in
speech recognition (due to many possible reasons such as background noise, accent,

etc.) The better is speech recognition, the better the language understanding compo-

85

 

nent processes the hypotheses, and the less effect the gesture is likely to bring. When
Speech is perfectly recognized (i.e., same as transcriptions), the addition of gesture
information will not bring extra advantage. In fact, it may hurt the performance if
gesture recognition is not adequate. However, we feel that when the domain becomes
more complex and the variations of language become more difficult to process, the
use of gesture may begin to Show advantage even when speech recognition performs
reasonably well. After all, speech recognition is far from being perfect in reality, which
makes gestural information valuable in intention recognition.

Third, deictic gesture helps more signiﬁcantly when combined with semantic fea-
tures than with phoneme features for intention prediction. As shown in Figure 5.10,
for NBaeys, DTree, SVM and S-KNN where only semantic features are used, the
addition of deictic gesture in both speech recognition and language understanding
can improve the performance between 4.7% and 6.6%. For P-KNN where only the
phonemes features are used, the improvement is 2.1%. Although the addition of
phoneme features signiﬁcantly improves the intention recognition performance, it is
computationally much more expensive than the use of only semantic features. Using
phoneme features may become impractical in real-time systems for complex domains.

Thus the incorporation of the gestural information could be even more important.

5.7 Summary

This chapter systematically investigates the role of deictic gesture in recognizing user
intention during interaction with a Speech and gesture interface. Different model-
based and instance-based approaches using gestural information have been applied
to recognize user intention. Our empirical results have Shown that using gestural
information in either speech recognition or language understanding stage is able to
improve user intention recognition. Moreover, when gestural information is used

in both speech recognition and language understanding, intention recognition can be

86

 

further improved. These results indicate that deictic gesture, although most indicative
to reﬂect user attention, is helpful in recognizing user intention. These results further
point out when and how deictic gesture should be effectively incorporated in building

practical speech-gesture systems.

87

CHAPTER 6

Incorporation of Eye Gaze in Automatic

Word Acquisition

Chapter 4 and Chapter 5 investigate the use of non-verbal modalities to improve Spo-
ken language understanding in multimodal conversational systems. Another signiﬁ-
cant problem with language understanding in multimodal conversation is the system’s
lack of knowledge to process user language. Language is ﬂexible, different users may
use different words to express the same meaning. When the system encounters a word
that is out of its knowledge base (e.g., vocabulary), it tends to fail in interpreting the
user’s language. It is desirable that the system can learn new words automatically
during human-machine conversation.

In this chapter, we present the investigation of using eye gaze for automatic word
acquisition. The speech-gaze temporal information and domain semantic relatedness
are incorporated in statistical translation models for word acquisition. Our experi-
ments show that the use of speech-gaze temporal information and domain semantic
relatedness signiﬁcantly improves word acquisition performance.

This chapter begins with a description of the speech and gaze data collection,
followed by an introduction of the basic translation models for word acquisition. Then,
we describe the enhanced models that incorporate temporal and semantic information

about speech and eye gaze for word acquisition. Finally, we present the results of

88

empirical evaluation.

6.1 Data Collection

We used the same set of speech and eye gaze data as described in Section 4.6.3.

25712 28712 31170 35'28 l3736 (ms)

This room has a chandelier

f gaze ﬁxation Speech stream

 

5 6 9 8 1668 2096 2692 32 2 (ms)
P
gaze stream
is te

[19] [1 [17] [19] [2211100] [10] [10] [ﬁxatedentity]
[11] [11] [ll]

([10] — bedroom; [1 l] — chandelier; [l7] — lamp_2; [l9] - bed frame; [22] —- door)

Figure 6.1. Parallel speech and gaze streams

Figure 6.1 Shows an excerpt of the collected speech and gaze ﬁxation in one ex-
periment. In the speech stream, each word starts at a particular timestamp. In the
gaze stream, each gaze ﬁxation has a starting timestamp t3 and an ending timestamp
te. Each gaze ﬁxation also has a list of ﬁxated entities (3D objects). An entity e on
the graphical display is ﬁxated by gaze ﬁxation f if the area of e contains ﬁxation
point of f.

Given the collected Speech and gaze ﬁxations, we build a parallel speech-gaze data
set as follows. For each Spoken utterance and its accompanying gaze ﬁxations, we
construct a pair of word sequence and entity sequence (w, e). The word sequence w
consists of only nouns and adjectives in the utterance. Each gaze ﬁxation results in
a ﬁxated entity in the entity sequence e. When multiple entities are ﬁxated by one
gaze ﬁxation due to the overlapping of the entities, the forefront one is chosen. Also,
we merge the neighboring gaze ﬁxations that contain the same ﬁxated entities. For

the parallel speech and gaze streams Shown in Figure 6.1, the resulting word sequence

89

is w = [room chandelier] and the entity sequence is e = [bed_frame lamp_2 bed. ame

door chandelier].

6.2 Translation Models for Automatic Word Acquisition

Since we are working on conversational systems where users interact with a visual
scene, we consider the task of word acquisition as associating words with visual en-
tities in the domain. Given the parallel speech and gaze ﬁxated entities {(w,e)},
we formulate word acquisition as a translation problem and use translation models
to estimate word-entity association probabilities p(wle). The words with the highest

association probabilities are chosen as acquired words for entity 6.

6.2.1 Base Model I

Using the translation model I [5], where each word is equally likely to be aligned with

each entity, we have

1:2)(wle=l———+1)1 m H Zoptwjlen (6.1)

j=1i=0

where l and m are the lengths of entity and word sequences respectively. We refer to

this model as Model-1.

6.2.2 Base Model II

Using the translation model II [5], where alignments are dependent on word/ entity

positions and word / entity sequence lengths, we have

m l
p<>=wle 11122292 a-=z‘0.m.0p<w.-Ie.) (62)
3'1: i=0

where aj = i means that wj is aligned with 8,. When aj = 0, wj is not aligned with

any entity (e0 represents a null entity). We refer to this model as Model-2.

90

Compared to Model-1, Model-2 considers the ordering of words and entities in
word acquisition. EM algorithms are used to estimate the probabilities p(wle) in the

translation models.

6.3 Using Speech-Gaze Temporal Information for Word Ac-
quisition

In Model-2, word-entity alignments are estimated from co—occurring word and entity
sequences in an unsupervised way. The estimated alignments are dependent on where
the words / entities appear in the word / entity sequences, not on when those words and
gaze ﬁxated entities actually occur. Motivated by the ﬁnding that users move their
eyes to the mentioned object directly before speaking a word [31], we make the word-
entity alignments dependent on their temporal relation in a new model (referred as

Model-2t) [85]:

m l
P(W|e) = H Zpdaj = ilj,e,W)P(wJ'|€i) (6-3)
j=1i=0

where pt(aj = 2] j, e, w) is the temporal alignment probability computed based on the
temporal distance between entity e, and word wj.

We deﬁne the temporal distance between e,- and wj as

0 ts(ei) S ts(wj) S t«2(81)
d(eiijl = te(ei) — t3(wj) t3(wj) > te(ei) (6.4)
ts(€i) - ts(wj) ts(wj) < ts(ei)
where ts(wj) is the starting timestamp (ms) of word wj, t3(e,') and te(e,-) are the
starting and ending timestamps (ms) of gaze ﬁxation on entity e.
The alignment of word wj and entity e, is decided by their temporal distance
d(e,-, wj). Based on the psycholinguistic ﬁnding that eye gaze happens before a spo-
ken word, wj is not allowed to be aligned with ei when wj happens earlier than e,-

(i.e., d(e,-,wJ-) > 0). When wj happens no earlier than ei (i.e., d(e,-,wj) S 0), the

91

closer they are, the more likely they are aligned. Speciﬁcally, the temporal alignment

probability of wj and ez- in each co—occurring instance (w, e) is computed as

0 d(e,-,wj) > 0

expla-d(€iawj)l d e. w- < 0 (6-5)
Zexpla-d(ei,wjll (is J)_

l

 

pt(aj = ilj,e,w) =

where a is a constant for sealing d(e,-, wj).

An EM algorithm is used to estimate probabilities p(wle) and a in Model-2t.

 

140
120
100

80

60

Alignment count

40

20

 

 

 

0 2;: .g. 2:: 1;. ,I;. t: 3:32 9:3 3:1 :.;1 p: :.; . .
-—5,000 —4,000 —3,000 —2,000 —1,000 0 1,000
Temporal distance of aligned word and entity (ms)

Figure 6.2. Histogram of truly aligned word and entity pairs over temporal distance (bin
width = 200ms)

For the purpose of evaluation, we manually annotated the truly aligned word and
entity pairs. Figure 6.2 shows the histogram of those truly aligned word and entity
pairs over the temporal distance of aligned word and entity. We can observe in the
ﬁgure that 1) almost no eye gaze happens after a spoken word, and 2) the number of
word-entity pairs with closer temporal distance is generally larger than the number
of those with farther temporal distance. This is consistent with our modeling of the

temporal alignment probability of word and entity (Equation (6.5)).

92

6.4 Using Domain Semantic Relatedness for Word Acquisi-

tion

Speech-gaze temporal alignment and occurrence statistics sometimes are not sufﬁcient
to associate words to entities correctly. For example, suppose a user says “there is a
lamp on the dresser” while looking at a lamp object on a table object. Due to their
co—occurring with the lamp object, the words dresser and lamp are both likely to
be associated with the lamp object in the translation models. As a result, the word
dresser is likely to be incorrectly acquired for the lamp object. For the same reason,
the word lamp could be acquired incorrectly for the table object. To solve this type
of association problem, the semantic knowledge about the domain and words can be
helpful. For example, the knowledge that the word lamp is more semantically related
to the object lamp can help the system avoid associating the word dresser to the lamp
object. Therefore, we are interested in investigating the use of semantic knowledge
in word acquisition.

On one hand, each conversational system has a domain model, which is the knowl-
edge representation about its domain such as the types of objects and their properties
and relations. On the other hand, there are available resources about domain inde-
pendent lexical knowledge (e.g., WordNet [28]). The question is whether we can use
the domain model and external lexical knowledge resource to improve word acqui-
sition. To address this question, we link the domain concepts in the domain model
with WordNet concepts, and deﬁne semantic relatedness of word and entity to help
the system acquire domain semantically compatible words.

In the following sections, we ﬁrst describe our domain modeling, then deﬁne the
semantic relatedness of word and entity based on domain modeling and WordNet
semantic lexicon, and ﬁnally describe different ways of using the semantic relatedness

of word and entity to help word acquisition.

93

6.4. 1 Domain Modeling

We model the 3D room decoration domain as shown in Figure 6.3. The domain model
contains all domain related semantic concepts. These concepts are linked to the
WordNet concepts (i.e., synsets in the format of “word#part-of-speech#sense—id”).
Each of the entities in the domain has one or more properties (e.g., semantic type,
color, size) that are denoted by domain concepts. For example, the entity dresser_1
has domain concepts SEM_DRESSER and COLOR. These domain concepts are linked
to “dresser#n#4” and “color#n#1” in WordNet.

1 Domain Model

Entities: . . . @

5 - l l l l i
£3321; [SEM_DRESSER ] COLOR I I SEM_BED ][ COLOR [F9125 ]:

 

 

 

 

 

 

  

   

--—--——_--_-_--_-_ __.__.____ --—--———————__-_—_ __-—_— __——--__ -___

“color#n#l ”
«a...
w:

Figure 6.3. Domain model with domain concepts linked to WordNet synsets

     

WordNet
concepts:

Note that in the domain model, the domain concepts are not speciﬁc to a cer-
tain entity, they are general concepts for a certain type of entity. Multiple entities
of the same type have the same properties and Share the same set of domain con-
cepts. Therefore, properties such as color and size of an entity have general concepts
“color#n#1” and “size#n#1” instead of more speciﬁc concepts like “yellow#a#1”

and “big#a#1”, so their concepts can be shared by other entities of the same type,

94

but with different colors and sizes.

6.4.2 Semantic Relatedness of Word and Entity

We compute the semantic relatedness of a word w and an entity e based on the se-
mantic Similarity between w and the properties of e. Speciﬁcally, semantic relatedness

SR(e,w) is deﬁned as

SR(e, w) = nzizgx sim(s(cg), sj(w)) (6.6)

where cf3 is the i-th pr0perty of entity e, s(cé) is the synset of property of. as designed
in domain model, sj(w) is the j-th synset of word w as deﬁned in WordNet, and
sim(-, ) is the Similarity score of two synsets.

We computed the similarity score of two synsets based on the path length between
them. The similarity score is inversely proportional to the number of nodes along the
shortest path between the synsets as deﬁned in WordNet. When the two synsets
are the same, they have the maximal similarity score of 1. The WordNet—Similarity

tool [77] was used for the synset similarity computation.

6.4.3 Word Acquisition with Word-Entity Semantic Relatedness

We can use the semantic relatedness of word and entity to help the system acquire
semantically compatible words for each entity, and therefore improve word acquisition
performance. The semantic relatedness can be applied for word acquisition in two
ways: post process learned word-entity association probabilities by rescoring them
with semantic relatedness, or directly affect the learning of word-entity associations

by constraining the alignment of word and entity in the translation models.

Rescoring with Semantic Relatedness

In the acquired word list for an entity ei, each word wj has an association probability

p(wJ-Iei) that is learned from a translation model. We use the semantic relatedness

95

SR(ei,wJ-) to redistribute the probability mass for each wj. The new association
probability is given by:
P(wjl€i)SR(ei.wj)

’(w'lei) = (6-7)
p J ZP(wjlei)SR(eiawjl

J

 

Semantic Alignment Constraint in Translation Model

When used to constrain the word-entity alignment in the translation model, semantic
relatedness can be used alone or used together with Speech-gaze temporal information

to decide the alignment probability of word and entity [84].

0 Using only semantic relatedness to constrain word-entity alignments in Model-

23, we have

m I
With?) = H 2173(0)“ = ili,e,W)p(wjlei) (6-8)

j=1i=0
where p3(aj = 2| j, e, w) is the alignment probability based on semantic related-

ness,

330%,le
ZSRfeitwj)
i

 

Maj = zlien”) = (6-9)
0 Using semantic relatedness and temporal information to constrain word-entity
alignments in Model-2ts, we have
m l
p(wle) = II Zpt8(aj = 1]], e,w)p(wjle,-) (610)
j=1 i=0
where pts(aj = 2] j, e,w) is the alignment probability that is decided by both
temporal relation and semantic relatedness of e,- and wj,
Ps(aj = ilj,e.W)pt(aj = z'lJ',e.W)

2:10st = ilj,e,W)Pt(aj = z'IJ'.e.W)

1

 

Pts(aj = ilj,e,W) = (6.11)

where p3(aj = i] j, e, w) is the semantic alignment probability in Equation (6.9),
and pt(aj = z| j, e,w) is the temporal alignment probability given in Equa-

tion (6.5).

96

EM algorithms are used to estimate p(wle) in Model—2s and Model-2ts.

6.5 Grounding Words to Domain Concepts

As discussed above, based on translation models, we can incorporate temporal and
domain semantic information to obtain p(wle). This probability only provides a
means to ground words to entities. In conversational systems, the ultimate goal of
word acquisition is to make the system understand the semantic meaning of new
words. Word acquisition by grounding words to objects is not always sufficient for
identifying their semantic meanings. Suppose the word green is grounded to a green
chair object, so is the word chair. Although the system is aware that green is some
word describing the green chair, it does not know that the word green refers to the
Chair’s color while the word chair refers to the chair’s semantic type. Thus, after
learning the word-entity associations p(wle) by the translation models, we need to
further ground words to domain concepts of entity properties.

We further apply WordNet to ground words to domain concepts. For each entity e,
based on association probabilities p(wle), we can choose the n-best words as acquired
words for e. Those n-best words have the n highest association probabilities. For
each word w acquired for e, the grounded concept c; for w is chosen as the one that

has the highest semantic relatedness with w:

c; = arg zrnax]:mJax sim(s(cg), sj(w))] (6.12)

where sim(s(cf3), sj(w)) is the semantic similarity score deﬁned in Equation (6.6).

6.6 Evaluation

To evaluate the acquired words for the entities, we manually compile a set of “gold

I

standard” words from all users’ speech transcripts and gaze ﬁxations. Those ‘gold

standard” words are the words that the users have used to refer to the entities and

97

their properties (e.g., color, size, shape) during the interaction with the system. The

automatically acquired words are evaluated against those “gold standard” words.

6.6.1 Evaluation Metrics

The following metrics are used to evaluate the words acquired for domain concepts

(i.e., entity properties) {c2}.

0 Precision .
Z Z # words correctly acquired for c;
e i

 

Z Z # words acquired for cf.
8 i

0 Recall .
Z: Z # words correctly acquired for c;
e i

 

Z Z # “gold standard” words of cf,

8 t

e F-measure
2 x precision x recall

 

precision + recall
The metrics of precision, recall, and F-measure are based on the n-best words
acquired for the entity properties. Therefore, we have different precision, recall, and
F -measure when n changes.
The metrics of precision, recall, and F -measure only provide evaluation on the top
n candidate words. To measure the acquisition performance on the entire ranked list

of candidate words, we deﬁne a new metric as follows:

0 Mean Reciprocal Rank Rate (MRRR)

Ne

1
Z inde:c(wf3)

i=1
N
e :81
i
i=1

#e

 

 

MRRR =

98

where Ne is the number of all ground-truth words {wfg} for entity e, indea:(wf3)

is the index of word w}; in the ranked list of candidate words for entity e.

Entities may have a different number of ground-truth words. For each entity e,
we calculate a Reciprocal Rank Rate (RRR), which measures how close the ranks
of the ground-truth words in the candidate word list is to the best scenario where
the top Ne words are the ground-truth words for e. RRR is in the range of (0,1].
The higher the RRR, the better is the word acquisition performance. The average of
RRRs across all entities gives the Mean Reciprocal Rank Rate (MRRR).

Note that MRRR is directly based on the learned word-entity associations p(wle),

it is in fact a measure of grounding words to entities.

6.6.2 Evaluation Results

To compare the effects of different speech-gaze alignments on word acquisition, we

evaluate the following models:

0 Model—1 — base model I without word-entity alignment (Equation (6.1)).

Model-2 — base model II with positional alignment (Equation (6.2)).

Model-2t — enhanced model with temporal alignment (Equation (6.3)).

Model-2s — enhanced model with semantic alignment (Equation (6.8)).

Model-2ts - enhanced model with both temporal and semantic alignment (Equa-

tion (6.10)).

To compare the different ways of incorporating semantic relatedness in word ac-

quisition as discussed in Section 6.4.3, we also evaluate the following models:

0 Model-l-r — Model-1 with semantic relatedness rescoring of word-entity associ-

ation.

99

e Model-2t-r — Model-2t with semantic relatedness rescoring of word-entity asso-

ciation.

Figures 6.4, 6.5, and 6.6 compare the results of models with different Speech-gaze
alignments and models with semantic relatedness rescoring. In the ﬁgures, n-best
means the top it word candidates are chosen as acquired words for each entity. The

Mean Reciprocal Rank Rates of all models are compared in Figure 6.7.

Results of Using Different Speech-Gaze Alignments

As shown in Figures 6.4(a), 6.5(a), and 6.6(a), Model-2 does not Show a consistent
improvement compared to Model-1 when a different number of n-best words are
chosen as acquired words. This result shows that it is not very helpful to consider
the index-based positional alignment of word and entity for word acquisition.

Figures 6.4(a), 6.5(a), and 6.6(a) also show that models considering tempo-
ral or/ and semantic information (Model-2t, Model-2S, Model-2ts) consistently per-
form better than the models considering neither temporal nor semantic information
(Model-1, Model-2). Among Model-2t, Model-2s, and Model-2ts, it is found that they
do not make consistent differences.

As Shown in Figure 6.7, the MRRRS of different models are consistent with their
performances on F-measure. A t-test has shown that the difference between the
MRRRs of Model-1 and Model-2 is not statistically signiﬁcant. Compared to Model-
1, t-tests have conﬁrmed that MRRR is Signiﬁcantly improved by Model-2t (t =
2.29,p < 0.016), Model-2s (t = 3.40,p < 0.002), and Model—2ts(t = 3.12,p < 0.003).
T-tests have shown no signiﬁcant differences among Model-2t, Model-2s, and Model-

2ts.

100

 

0.9 ﬁr I I T I I I F

—I— Model-1
—V- Model-2
—e— Model-2t -
—xr-— Model—25
+ Model-2ts

 

 

 

 

   
 
 
 
  

 

precIsIon

 

 

 

0.1
1
n-best

(a) Precision of word acquisition when different speech-gaze alignments
are applied

 

0.90 l I I I I I I I
—+— Model-1
—a— Model-I-r
—e— Model-2t ‘
—e— Model-2H]

 

 

 

 

precision
O
01

0.4

0.3

 

0.2

 

p.
p
L.

0.1 . a
1 2 3 4 5 6 7 8 9 10
n-best
(b) Precision of word acquisition when semantic relatedness rescoring
of word-entity association is applied

Figure 6.4. Precision of word acquisition

101

 

 

 

 

 

   

-—I— Model-1
—V— Model-2
—e— Model-2t
—ar—— Model-23
+ Model-21$

 

 

 

 

41 l l 1 l l l l

1 2 3 4 5 6 7 8 9 1 0
n-best

(a) Recall of word acquisition when different speech-gaze alignments are
applied

 

 

 

 

 

 

 

 

 

 

 

 

0.2 —+— Model—1 .
‘ —B— Model-1 -r
—e— Model-2t
—9— Model—2t-r
0'11 2 3 4 5 6 7 8 9 10
n-best

(b) Recall of word acquisition when semantic relatedness rescoring of
word-entity association is applied

Figure 6.5. Recall of word acquisition

102

 

0.55 1 . . . . T . .
——+— Model—1
-'Er— Model—2
0.5 ~ —9— Model-2t
—ar— Model-25
—-A-— Model-2ts
0.45 - , *

 

1

 

 

 

  

.°
.5
r

 

F-measure

P
o)

0.3

0.25

1 l 1 l

2 3 4 5 6 7 8 9 1 0
n-best

(a) F-measure of word acquisition when different speech-gaze alignments
are applied

 

 

_
p-
)-

 

0.2
1

 

 

 

 

 

 

 

 

0.55 r I I I f I I I
—I— Model-1
—B— Model-1—r
0.5 —e— Model-21
—9— Model-2t-r
0.45<
d)
g 0.4
g I
[L 0.35
0.3
0.25
0.2 1 1 1 1 1 1 I 1
1 2 3 4 5 6 7 8 9 10
n-best

(b) F-measure of word acquisition when semantic relatedness rescoring
of word-entity association is applied

Figure 6.6. F -measure of word acquisition

103

 

0.8

0.75 *

0.7

0.65

0.6

Mean Reciprocal Rank Rate

0.55

     

 

 

     

 

0 ‘ 5 1:272 131:} 34.: .35.; {:3 53:3“
M-1 M—2 M-2t M-2s M-2ts M-2t-r
Models

Figure 6.7. MRRRs achieved by different models

Results of Applying Semantic Relatedness Rescoring

Figures 6.4(b), 6.5(b), and 6.6(b) Show that semantic relatedness rescoring improves
word acquisition. After semantic relatedness rescoring of the word-entity associations
learned by Model-1, Model-l—r improves the F-measure consistently when a different
number of n-best words are chosen as acquired words. Compared to Model-2t, Model-
2t-r also improves the F-measure consistently.

Comparing the two ways of using semantic relatedness for word acquisition, it is
found that rescoring word-entity association with semantic relatedness works better.
When semantic relatedness is used together with temporal information to constrain
word-entity alignments in Model-2ts, word acquisition performance is not improved
compared to Model-2t. However, using semantic relatedness to rescore word-entity
association learned by Model-2t, Model-2t-r further improves word acquisition.

As shown in Figure 6.7, the MRRRs of Model-l-r and Model-2t-r are consistent

with their performances on F—measure. Compared to Model-2t, Model-2t-r improves

104

MRRR. A t-test has conﬁrmed that this is a signiﬁcant improvement (t = 1.96, p <
0.031). Compared to Model-1, Model-l-r signiﬁcantly improves MRRR (t = 2.33, p <
0.015). There is no Signiﬁcant difference between Model-l-r and Model-2t/Model-
2s/Model—2ts.

In Figure 6.5, we notice that the recall of the acquired words is still comparably
low even when 10 best word candidates are chosen for each entity. This is mainly
due to the scarcity of those words that are not acquired in the data. Many of the
words that are not acquired appear less than 3 times in the data, which makes them
unlikely to be associated with any entity by the translation models. When more data

is available, we expect to see higher recall.

6.6.3 An Example

Table 6.1 Shows the 5—best words acquired by different models for the entity dresserJ
in the 3D room scene. In the table, each word iS followed by its word-entity association

probability p(wle). The correctly acquired words are Shown in bold font.

 

Model Model-1 Model-2t Model-2t-r
Rank 1 table(0.173) table(0.196) table(0.294)
Rank 2 dresser(0.067) dresser(0.101) dresser(0.291)
Rank 3 area(0.058) area(0.056) vanity(0.147)
Rank 4 picture(0.053) vanity(0.051) desk(0.038)
Rank 5 dressing(0.041) dressing(0.050) area(0.024)

 

 

 

Table 6.1. N-best candidate words acquired for the entity dresser.1 by different models

As Shown in the example, the baseline Model-1 learned 2 correct words in the 5-
best list. Considering Speech-gaze temporal information, Model-2t learned one more
correct word vanity in the 5—best list. With semantic relatedness rescoring, Model—2t-r
further acquired word desk in the 5-best list because of the high semantic relatedness
of word desk and the type of entity dresserJ. Although neither Model-1 nor Model-2t
successfully acquired the word desk in the 5-best list, the rank (=7) of the word desk

in Model-2t’s n-best list is much higher than the rank (=21) in Model-1’s n-best list.

105

 

6.7 Summary

This chapter investigates the use of eye gaze for automatic word acquisition in mul-
timodal conversational systems. Particularly, we investigate the use of speech-gaze
temporal information and word-entity semantic relatedness to facilitate word acqui-
sition. The experiments Show that word acquisition is signiﬁcantly improved when
temporal information is considered, which is consistent with the previous psycholin-
guistic ﬁndings about Speech and eye gaze. Moreover, using temporal information

together with semantic relatedness rescoring further improves word acquisition.

106

 

 

CHAPTER 7

Incorporation of Interactivity with Eye

Gaze for Automatic Word Acquisition

In the previous chapter, we describe the use of the speech-gaze temporal information
and domain semantic relatedness for automatically acquiring words from the user’s
Speech and its accompanying gaze ﬁxations. Successful word acquisition relies on the
tight link between what the user says and what the user sees. Although published
studies provide us with a sound empirical basis for assuming that eye movements are
predictive of Speech, the gaze behavior in an interactive setting can be much more
complex. There are different types of eye movements [50]. The naturally occurring eye
gaze during speech production may serve different functions, for example, to engage in
the conversation or to manage turn taking [70]. Furthermore, while interacting with
a graphic display, a user could be talking about objects that were previously seen
on the display or something completely unrelated to any object the user is looking
at. Therefore using all the speech-gaze pairs for word acquisition can be detrimental.
The type of gaze that is mostly useful for word acquisition is the kind that reﬂects
the underlying attention and tightly links to the content of the co-occurring spoken
utterances. Thus, one important question is how to identify the closely coupled speech
and gaze streams to improve word acquisition.

To address this question, in this chapter, we develop an approach that incorporates

107

interactivity (e.g., user activity, conversation context) with eye gaze to identify the
closely coupled speech and gaze streams. We further use the identiﬁed speech and
gaze streams for word acquisition. Our studies indicate that automatic identiﬁcation
of closely coupled gaze-speech stream pairs is an important ﬁrst step that leads to
performance gains in word acquisition. Our simulation studies further demonstrate
the effect of automatic online word acquisition on improving language understanding
in human-machine conversation.

In the following sections, we ﬁrst describe the data collection in a new 3D in-
teractive domain, then present the automatic identiﬁcation of the closely coupled
gaze-Speech pairs and its effect on word acquisition. The last part of this chapter
presents a simulation study that exempliﬁes how word acquisition can be automati-
cally achieved and how the acquired words affect language interpretation during online

conversation.

7. 1 Data Collection

We recruited 20 users to interact with our speech-gaze system to collect data.

7.1.1 Domain

We used the 3D treasure hunting domain (see Section 3.3.2) for the investigation of
automatic word acquisition in multimodal conversation. In this application, the user
needs to consult with a remote “expert” (i.e., an artiﬁcial system) to ﬁnd hidden
treasures in a castle with 115 3D objects. The expert has some knowledge about the
treasures but can not see the castle. The user has to talk to the expert for advices
of ﬁnding the treasures. The application is developed based on a game engine and
provides an immersive environment for the user to navigate in the 3D space. A
detailed description of the user study is given in Appendix A.3.

During the experiment, the user’s speech was recorded, and the user’s eye gaze was

108

 

captured by a Tobii eye tracker. Figure 7 .1 shows a snapshot of one user’s experiment.

I - ‘ .
5‘ C
-» 71-

l l

P" ‘

7 A

V 6 ‘
“M“ ‘8” .~~ I -‘
. f o ‘ ' \‘fiv‘r .J; I‘

. l.
- J .
1 “I
. ‘ .sw- . r‘ . .
. ._ ,v~ {1.7“ :- ’71,
, . 4‘1' ..1 ‘0
..c, . 1‘ -(
. ‘ ’
"""~ 5; :v ‘ b"-
.. I- - r .. c
‘ -'~4~., A -
.. q

. I Wa' :

 

Figure 7 .1. A snapshot of one user’s experiment (the dot on the stereo indicates the user’s
gaze ﬁxation, which was not shown to the user during the experiment)

It’s worthwhile to note that the collected data set is different from the data set
used for the investigation in Chapter 6. The difference lies in two aspects: 1) the data
for this investigation was collected during mixed initiative human-machine conversa-
tion whereas the data in Chapter6 was based only on question and answering; 2)
user studies were conducted in a more complex domain for this investigation, which

resulted in a richer data set that contains larger vocabulary.

7 .1.2 Data Preprocessing

From 20 users’ experiments, we collected 3709 utterances with accompanying gaze
ﬁxations. We transcribed the collected speech. The vocabulary size of the speech
transcript is 1082, among which 227 words are nouns and adjectives. The user’s
speech was also automatically recognized online by the Microsoft speech recognizer
with a word error rate (WER) of 48.1% for the 1-best recognition. The vocabulary size
of the 1-best speech recognition is 3041, among which 1643 are nouns and adjectives.

The collected speech and gaze streams are automatically paired together by the

109

 

system. Each time the system detects a sentence boundary of the user’s speech, it
pairs the recognized speech with the gaze ﬁxations that the system has been accumu-
lating since the previously detected sentence boundary. Given the paired Speech and
gaze streams, we build a parallel data set of word sequence and, gaze ﬁxated entity
sequence {(w, e)} for the task of word acquisition. For the gaze stream, e contains
all the gaze ﬁxated entities. For the speech stream, we can build w based on speech
transcript or the 1-best speech recognition. The resulting word sequence w contains

all the nouns and adjectives in the transcript or the l-best recognition.

7 .2 Identiﬁcation of Closely Coupled Gaze-Speech Pairs

AS mentioned earlier, not all gaze-speech pairs are useful for word acquisition. In
a gaze-speech pair, if the Speech does not have any word that relates to any of the
gaze ﬁxated entities, this instance only adds noise to word acquisition. Therefore,
we Should identify the closely coupled gaze-Speech pairs and only use them for word
acquisition.

In this section, we ﬁrst describe the feature extraction, then describe the use
of a logistic regression classiﬁer to predict whether a gaze-speech pair is a closely
coupled gaze-speech instance — an instance where at least one noun or adjective in
the speech stream is referring to some gaze ﬁxated entity in the gaze stream. For the
training of the classiﬁer for gaze-speech prediction, we manually labeled each instance
whether it is a closely coupled gaze-speech instance based on the speech transcript

and gaze ﬁxations.

7 .2.1 Features Extraction

For a parallel gaze-speech instance, the following sets of features are automatically

extracted.

110

SPEECH FEATURES (S-FEAT)

Let cw be the count of nouns and adjectives in the utterance, and l3 be the temporal

length of the Speech. The following features are extracted from speech:

0 0w — count of nouns and adjectives.
More nouns and adjectives are expected in the user’s utterance describing enti-

ties.

e ow/ls - normalized noun/ adjective count.

The effect of speech length ls on cw is considered.

GAZE FEATURES (G-FEAT)

For each ﬁxated entity ei, let ll, be its ﬁxation temporal length. Note that several
gaze ﬁxations may have the same ﬁxated entity, lg is the total length of all the gaze

ﬁxations that ﬁxate on entity ei. We extract the following features from gaze stream:

0 cc — count of different gaze ﬁxated entities.
Less ﬁxated entities are expected when the user is describing entities while

looking at them.

0 ce/ls - normalized entity count.

The effect of speech temporal length ls on Ca is considered.

0 mam-(lg) — maximal ﬁxation length.
At least one ﬁxated entity’s ﬁxation is expected to be long enough when the

user is describing entities while looking at them.

0 mean(lg) - average ﬁxation length.
The average gaze ﬁxation length is expected to be longer when the user is

describing entities while looking at them.

111

e var(lg) — variance of ﬁxation lengths.
The variance of the ﬁxation lengths is expected to be smaller when the user is

describing entities while looking at them.

The number of gaze ﬁxated entities is not only decided by the user’s eye gaze, it
is also affected by the visual scene. Let C: be the count of all the entities that have
been visible during the length of the gaze stream. We also extract the following scene

related feature:

0 Ce / cg — scene normalized ﬁxated entity count.

The effect of the visual scene on Ca is considered.

USER ACTIVITY FEATURES (UA-FEAT)

While interacting with the system, the user’s activity can also be helpful in deter-
mining whether the user’s eye gaze is tightly linked to the content of the speech. The

following features are extracted from the user’s activities:

0 maximal distance of the user’s movements — the maximal change of user position
(3D coordinates) during the speech length.
The user is expected to move within a smaller range while looking at entities

and describing them.

0 variance of the user’s positions
The user is expected to move less frequently while looking at entities and de-
scribing them.

CONVERSATION CONTEXT FEATURES (CC-FEAT)

While talking to the system (i.e., the “expert”), the user’s language and gaze behavior

are inﬂuenced by the state of the conversation. For each gaze-speech instance, we use

112

 

the previous system response type as a nominal feature to predict whether this is a
closely coupled gaze—speech instance.

In our treasure hunting domain, there are 8 types of system responses in 2 cate-
gories:
System Initiative Responses:

0 speciﬁc-see — the system asks whether the user sees a certain entity, e.g., “Do

you see another couch?”.

e nonspeciﬁc-see — the system asks whether the user sees anything, e. g., “Do you

see anything else?”, “Tell me what you see”.

0 previous-see ~ the system asks whether the user previously sees something, e.g.,

“Have you previously seen a Similar object?”.

0 describe — the system asks the user to describe in detail what the user sees, e.g.,

“Describe it”, “Tell me more about it”.

0 compare - the system asks the user to compare what the user sees, e.g., “Com-

pare these objects”.

0 clarify — the system asks the user to make clariﬁcation, e.g., “I did not under-

stand that”, “Please repeat that”.

. action-request — the system asks the user to take action, e.g., “Go back”, “Try
moving it”.
User Initiative Responses:

0 misc — the system hands the initiative back to the user without specifying

further requirements, e.g., “I don’t know”, “Yes”.

7 .2.2 Logistic Regression Model

Given the extracted feature x and the “closely coupled” label y of each instance in

the training set, we train a ridge logistic regression model [60] to predict whether an

113

instance is a closely coupled instance (y = 1) or not (y = 0).
In the logistic regression model, the probability that y,- = 1, given the feature
x,- = (Tiara, . . . ,Tfn), is modeled by
9XP(Z?=1 5373-)
1 + €XP(Z?-1:15j$§)
where 63- are the feature’s weights to be learned.

The log-likelihood l of the data (X, y) is

 

P(yilxi) =

1(ﬂ) = Zhn 108P(yilxil + (1 - yil108(1— p(inXiD]

i
In ridge logistic regression, parameters 63- are estimated by maximizing a regularized

log-likelihood
c(c) = 1(6) — Allan?

where /\ is the ridge parameter that is introduced to achieve more stable parameter
estimation.
We used the Weka toolkit [115] for the training of the ridge logistic regression

model.

7 .3 Evaluation of Gaze-Speech Identiﬁcation

We evaluate the gaze-speech identiﬁcation for the instances with l-best speech recog-
nition. Since the goal of identifying closely coupled gaze-speech instances is to improve
word acquisition and we are only interested in acquiring nouns and adjectives, only
the instances with recognized nouns/ adjectives are used for training the logistic re-
gression classiﬁer. Among the 2969 instances with recognized nouns/ adjectives and
gaze ﬁxations, 2002 (67.4%) instances are labeled as closely coupled. The gaze-speech
prediction was evaluated by a 10~fold cross validation.

Table 7.1 shows the prediction precision and recall when different sets of features

are used. As seen in the table, as more features are used, the prediction precision

114

 

goes up and the recall goes down. It is important to note that prediction precision is
more critical than recall for word acquisition when sufﬁcient amount data is available.
Noisy instances where the gaze does not link to the speech content will only hurt word
acquisition since they will guide the translation models to ground words to the wrong
entities. Although higher recall can be helpful, its effect is expected to become less

when more data becomes available.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Feature sets Precision Recall
Null (baseline) 0.674 1
S-Feat 0.686 0.995
G-Feat 0.707 0.958
UA-Feat 0.704 0.942
CC-Feat 0.688 0.936
G-Feat + UA-Feat 0.719 0.948
G-Feat + UA-Feat + S-Feat 0.741 0.908
G-Feat + UA-Feat + CC-Feat 0.731 0.918
G-Feat + UA-Feat + S-Feat + CC-Feat 0.748 0.899

 

 

Table 7.1. Gazespeech prediction performances with different feature sets for the instances
with l-best speech recognition

The results show that speech features (S-Feat) and conversation context features
(CC-Feat), when used alone, do not improve prediction precision much compared to
the baseline of predicting all instances “closely coupled” with a precision of 67.4%.
When used alone, gaze features (G-Feat) and user activity features (UA-Feat) are
the two most useful feature sets for increasing prediction precision. When they are
used together, the prediction precision is further increased. Adding either speech
features or conversation context features to gaze and user activity features (G-Feat
+ UA-Feat + S—Feat/CC—Feat) increases the prediction precision more. Using all
four sets of features (G-Feat + UA-Feat + S-Feat + CC-Feat) achieves the highest
prediction precision, which is signiﬁcantly better than the baseline: 2 = 5.93,}? <
0.001. Therefore, we choose to use all feature sets to identify the closely coupled
gaze-speech instances for word acquisition.

To compare the effect of the identiﬁed closely coupled gaze-speech instances on

115

word acquisition from different Speech input (l-best speech recognition, speech tran-
script), we also use the logistic regression classiﬁer with all features to predict closely
coupled gaze-Speech instances for the instances with speech transcript. For the in-
stances with speech transcript, there are 2948 instances with nouns/ adjectives and
gaze ﬁxations, 2128 (72.2%) of them being labeled as closely coupled. The prediction
precision is 77.9% and the recall is 93.8%. The prediction precision is signiﬁcantly

better than the baseline of predicting all instances as coupled: z = 4.92, p < 0.001.

7.4 Evaluation Of Word Acquisition

In Chapter 6, we have shown that Model-2t-r (Section 6.4), where the temporal align-
ment between speech and eye gaze and domain semantic relatedness are incorporated,
achieves Signiﬁcantly better word acquisition performance. Therefore, this model is
used for the word acquisition in this investigation. The word acquired by Model-2t-r
are evaluated against the “gold standard” words that we manually compiled for each
entity and its properties based on all users’ Speech transcripts and gaze ﬁxations.
Those “gold standard” words are the words that the users have used to describe the

entities and their properties during the interaction with the system.

7 .4. 1 Evaluation Metrics

We evaluate the n-best acquired words on
0 Precision
0 Recall
e F-measure

When a differen n is chosen, we will have different precision, recall, and F-measure.

We also evaluate the whole ranked candidate word list on

116

0 Mean Reciprocal Rank Rate (MRRR) (see Section 6.6.1)

7 .4.2 Evaluation Results

We evaluate the effect of the closely coupled gaze-Speech instances on word acquisition
from the 1-best Speech recognition. To Show the inﬂuence of speech recognition qual-
ity on word acquisition performance, we also evaluate word acquisition from speech
transcript. The predicted closely coupled gaze—speech instances in the evaluations are
generated by a 10-fold cross validation with the logistic regression classiﬁer.

Figures 7.2 ~ 7 .7 Show the precision, recall, and F-measure of the n-best words
acquired by Model-2t-r using all instances (all), only predicted closely coupled in-
stances (predicted), and true (manually labeled) closely coupled instances (true). In
Figures 7.2 ~ 7.4, the acquired words come from the 1-best Speech recognition of
users’ utterances. In Figures 7.5 ~ 7.7, the acquired words come from the transcripts
of users’ utterances.

Figure 7.8 compares the MRRRs achieved by Model-2t-r using different set of
instances (all instances, predicted closely coupled instances, true closely coupled in-

stances) with different speech input (1-best speech recognition, Speech transcript).

Results of Word Acquisition on 1-best Speech Recognition

As Shown in Figure 7.4, using predicted instances achieves consistent better perfor-
mance than using all instances except the case where only the 3—best word candidates
are evaluated. These results Show that the prediction of closely coupled gaze-speech
instances helps word acquisition. When the true closely coupled gaze-speech instances
are used for word acquisition, the word acquisition performance is further improved.
This means that higher gaze-speech prediction precision will lead to better word ac-
quisition performance.

We notice that using all instances actually achieves higher F-measure than using

117

 

 

0.55

 

A + all

 

 

 

0.5: ' —A— predicted
—6- true
0.45 r

Precision
o
DO
01

0.25 --

 

 

0.2 '

 

 

n-best

Figure 7.2. Precision of word acquisition on 1-best speech recognition with Model-2t-r

 

0.4

 

 

-I!— all
0.1 . —A— predicted
—€9— true

 

 

 

 

 

 

12 3 4 5 6 7 8 910
n-best

Figure 7.3. Recall of word acquisition on 1-best speech recognition with Model-2t-r

predicted instances for the 3—best word candidates. This is because there are few
“gold standard” words that do not appear in the predicted gaze-speech instances due

to the scarcity of these words in the whole data set. In the word acquisition with all

118

 

 

0.3

 

 

 

0.25 -
Q)
‘5
8
Q)
'53
L“ 0.2
+ all
.. + predicted
0.15 "‘ -6— true

 

 

 

 

 

12 3 4 5 6 7 8 910
n-best

Figure 7.4. F—measure of word acquisition on 1-best speech recognition with Model-2t-r

instances, these words will not appear in the 10-best list if word acquisition is only
based on co—occurring statistics (as in Model-1). In Model-2t-r, with domain semantic
relatedness rescoring, these words are boosted up to the 3—best list. However, this can
not happen in the word acquisition with the predicted gaze-speech instances because
the predicted instances do not contain these few words and therefore it is impossible to
acquire them. Therefore, for Model-2t-r, using all instances accidentally outperforms
using predicted instances when only the 3—best word candidates are evaluated. We
believe this will not happen when a fairly large amount of data is available for word
acquisition.

As Shown in Figure 7.8, the MRRRs achieved by Model-2t—r using different sets of
instances with the 1-best Speech recognition are consistent with their performances
on F-measure. Using predicted instances results in signiﬁcantly better MRRR than

using all instances (t = 1.89,p < 0.031).

119

 

 

 

 

 

 

 

 

 

 

== +all
0.55‘ +predicted .
, —£B—true
05
§ 0.45
.2
8
g 0.4
0.35-
0.3»
02512 3 4 5 6 7 8 910

n-best
Figure 7 .5. Precision of word acquisition on speech transcript with Model-2t-r

0.55
0.5 5
0.45 ..
0.4 ~
0.35 :

3 0.3?

0.25 '

 

 

   

—-I-— all
—A— predicted
—€B— true

 

 

 

 

 

 

123 4 5 6 7 8 910
n-best

Figure 7.6. Recall of word acquisition on speech transcript with Model-2t-r

Results of Word Acquisition on Speech Transcript

For the word acquisition on speech transcript, as shown in Figure 7.7, using predicted

closely coupled instances results in better F-measure than Using all instances. When

120

 

0.45 4 .4 . .

 

 

 

 

 

 

 

0.4»
‘ 5
0.35:
93.
5
£33 0.34
5
[La
0.25
+all
0'2 f —A— predicted
z; —e—true
0'1512 3 4 5 6 7 8 910

n—best

Figure 7.7. F-measure of word acquisition on speech transcript with Model-2t-r

 

0.6

 

58 1-best reco
E 0 transcript

 

 

 

0.55 -

0.5 '-

M RRR

0.45

0.4 *-

 

 

 

 

 

0.35

 

all predicted
Figure 7.8. MRRRs achieved by Model-2t-r with different data set
the true closely coupled instances are used for word acquisition, the F-measure is
further improved.

As shown in Figure 7.8, consistent with its F —measure performance, using pre-

dicted instances results in signiﬁcantly better MRRR than using all instances (t =

121

2.66,p < 0.005).

The quality of Speech recognition is critical to word acquisition performance. F ig-
ure 7.8 also compares the word acquisition performance on the l-best speech recogni-
tion and speech transcript. AS expected, the word acquisition performance on speech
transcript is much better than on 1-best speech recognition. This result Shows that

better speech recognition will lead to better word acquisition.

7 .5 The Effect Of Word Acquisition on Language Under-

standing

One important goal of word acquisition is to use the acquired new words to help lan-
guage understanding in subsequent conversation. To demonstrate the effect of online
word acquisition on language understanding, we conduct simulation studies based on
our collected data. In these Simulations, the system starts with an initial knowledge
base — a vocabulary of words associated to domain concepts. The system contin-
uously enhances its knowledge base by acquiring words from users with Model-2t-r
(Section 6.4) that incorporates both speech-gaze temporal information and domain
semantic relatedness. The enhanced knowledge base is used to understand the lan-
guage of new users.

We evaluate language understanding performance on concept identiﬁcation rate

(CIR):

CIR _ #correctly identiﬁed conepts in the 1-best speech recognition

 

#concepts in the speech transcript

We simulate the process of online word acquisition and evaluate its effect on
language understanding for two situations: 1) the system starts with no training data

but with a small initial vocabulary, and 2) the system starts with some training data.

122

 

7.5.1 Simulation 1: When the System Starts with No Training Data

To build conversational systems, one approach iS that domain experts provide domain
vocabulary to the system at design time. Our ﬁrst simulation follows this practice.
The system is provided with a default vocabulary to start without training data. The
default vocabulary contains one “seed” word for each domain concept.

Using the collected data of 20 users, the Simulation process goes through the

following steps:
0 For user index i = 1,2, . . . ,20:

— Evaluate CIR of the i-th user’s utterances (1-best speech recognition) with

the current system vocabulary.

— Acquire words from all the instances (with 1-best speech recognition) of

users 1mi.

— Among the 10-best acquired words, add veriﬁed new words to the system

vocabulary.

In the above process, the language understanding performance on each individual
user depends on the user’s own language as well as the user’s position in the user
sequence. To reduce the effect of user ordering on language understanding perfor-
mance, the above Simulation process is repeated 500 times with randomly ordered
users. The average of the CIRs in these simulations is shown in Figure 7.9.

Figure 7.9 also Shows the CIRs when the system is with a static knowledge base
(vocabulary). The curve is drawn in the same way as the curve with a dynamic
knowledge base, except without word acquisition in the random simulation processes.
As we can see in the ﬁgure, when the system doest not have word acquisition capa-
bility, its language understanding performance does not change after more users have

communicated to the system. With the capability of automatic word acquisition, the

123

0.6

0.55

0.5

0.45

0.4

Concept Identiﬁcation Rate

0.35

0.3

 

 

 

 

 

 

 

 

—e— dynamic knowledge base
—*— static knowledge base g a e e = =
6 g I
F a 'l
r,
L "
,

0

2 4 6 8 10 12 14 16 18

user index

20

Figure 7 .9. CIR of user language achieved by the system starting with no training data

system’s language understanding performance becomes better after more users have

talked to the system.

7.5.2 Simulation 2: When the System Starts with Training Data

Many conversational systems use real user data to derive domain vocabulary. To

follow this practice, the second Simulation provides the system with some training

data. The training data serves two purposes: 1) build an initial vocabulary of the

system; 2) train a classiﬁer to predict the closely coupled gaze—speech instances of

new users’ data.

Using the collected data of 20 users, the simulation process goes through the

following steps:

0 Using the ﬁrst m users’ data as training data, acquire words from the training

instances (with speech transcript); add the veriﬁed 10—best words to the sys—

124

 

tem’s vocabulary as “seed” words; build a classiﬁer with the training data for

prediction of closely coupled gaze-speech instances.

0 Evaluate the effect of incremental word acquisition on CIR of the remaining

(20-m) users’ data. For user index i = 1, 2, . . . , (20—m):

- Evaluate CIR of the i-th user’s utterances (1—best speech recognition).
—— Predict coupled gaze-Speech instances of the i-th user’s data.

— Acquire words from the m training users’ true coupled instances (with
Speech transcript) and the predicted coupled instances (with 1-best speech

recognition) of users 1mi.

— Among the 10-best acquired words, add veriﬁed new words to the system

vocabulary.

The above simulation process is repeated 500 times with randomly ordered users
to reduce the effect of user ordering on the language understanding performance.
Figure 7.10 shows the averaged language understanding performance of these random
simulations.

The language understanding performance of the system with a static knowledge
base is also shown in Figure 7.10. The curve is drawn by the same random simulations
without the steps of word acquisition. We can observe a general trend in the ﬁgure
that, with word acquisition, the system’s language understanding becomes better after
more users have communicated to the system. Without word acquisition capability,
the system’s language understanding performance does not increase after more users
have conversed with the system.

The simulations show that automatic vocabulary acquisition is beneﬁcial to the
system’s language understanding performance when training data is available. When
training data is not available, vocabulary acquisition could be more important and

beneﬁcial to robust language understanding. It is worth to mention that the results

125

 

 

 

 

 

 

 

 

0.59 I I I I I ﬂ I T
—6— dynamic knowledge base
—*— static knowledge base
0.585 - >
2 ‘_ ’4
6:“ 0.58 . = ~
C o
'g I; "
.8
“3 0.575 - a «
o
9 e
g. e
c 0.57 - .
O
o
0.56
0.56 1 1 1 l L 1 1 1
1 2 3 4 5 6 7 8 9 10
user index

Figure 7.10. CIR of user language achieved by the system starting with 10 users training
data)

shown here are based on the 1-best recognized speech hypotheses with a relatively
high WER (48.1%). With better Speech recognition, we expect to have better concept

identiﬁcation results.

7.6 Summary

This chapter investigates the automatic identiﬁcation of closely coupled gaze-Speech
instances and its application for automatic word acquisition in multimodal conver-
sational systems. Particulary, this chapter explores the use of the features extracted
from speech, eye gaze, user interaction activities, and conversation context for pre-
dicting whether the user’s naturally occurring eye gaze links to the content of the
user’s Speech.

This chapter also investigates the application of the identiﬁed closely coupled

126

gaze—speech instances for word acquisition The gaze-speech prediction and its effect on
word acquisition are evaluated on the l-best speech recognition and speech transcript.
The experiments demonstrate that the automatic identiﬁcation of the closely coupled
gaze-speech instances signiﬁcantly improves word acquisition, no matter the words
are acquired from the 1-best speech recognition or from the speech transcript.
Moreover, this chapter demonstrates that, during multimodal conversation pro-
cess, the system with word acquisition capability will be able to better understand

the user’s language after more users have communicated to the system.

127

 

CHAPTER 8

Conclusions

8.1 Contributions

In this thesis, we present our work on using non-verbal modalities for human language
interpretation in multimodal conversational systems. Particularly, we present a joint
solution to the problems of unreliable speech input and unexpected speech input in
multimodal conversational systems, which includes two aspects: 1) use deictic gesture
and eye gaze to improve speech recognition and understanding, and 2) use eye gaze to
acquire new words automatically during multimodal conversation. Our evaluations
have demonstrated the promise of incorporating non-verbal modalities to help speech
recognition and language understanding during multimodal conversation.

Speciﬁc contributions of this thesis include:

0 Systematic investigation of incorporating deictic gesture and eye gaze to improve
speech recognition hypotheses for spoken language understanding. We have de-
veloped salience driven approaches to incorporate the domain context activated
by gesture/ gaze in speech recognition. The gesture/gazebased salience driven
language models are used in different stages of speech recognition to improve
recognition hypotheses. Experimental results show that, by using non-verbal
salience driven language models, the word error rate of speech recognition is

decreased by 6.7% and the concept identiﬁcation F-measure is increased by

128

4.2%.

0 Systematic investigation of using deictic gesture to improve spoken language
understanding in multimodal interpretation. We have developed model-based
and instance—based approaches to incorporate gestural information in language
understanding. Experimental results have shown that the accuracy of intention
recognition in language understanding is increased by 6% ~ 6.6% by differ-
ent approaches that incorporate gestural information. We further analyze the

implications of these results in building practical conversational systems.

0 Systematic investigation of using eye gaze for automatic word acquisition in
multimodal conversation. We have developed word acquisition models that
incorporate speech—gaze temporal information and domain semantic relatedness
to improve word acquisition. By using the temporal and semantic information,
the mean reciprocal rank rate (MRRR) of word acquisition is increased by
43.2% in our experiment. To further improve word acquisition performance,
we build a classiﬁer based on user interactivity to pick out “useful” speech-
gaze instances before word acquisition, which results in a further increase of
MRRR by 3.6%. Our simulation studies have shown that automatic online

word acquisition improves the system’s language understanding performance.

0 A Multimodal conversational system supporting speech, deictic gesture, and
eye gaze developed for 3D domains. Integrating techniques from speech recog-
nition, eye tracking, and computer graphics, we have implemented a multimodal
conversational system based on 3D interior domains. The system can support
speech, deictic gesture, and eye gaze inputs from the user during multimodal
conversation. It provides a framework to develop different multimodal applica-

tions.

0 Corpora of multimodal data collected through user studies. This research results

129

 

in 3 sets of data to study multimodal conversation. These data provide user
speech and the accompanying deictic gestures and eye gaze ﬁxations during
multimodal conversation. The data has been annotated for this thesis research.
The annotation includes the transcript of speech, the timestamps of transcribed
words, the referred entity in users’ speech, and the labeling of closely-coupled

gaze-speech pairs. These data will be available for research communities.

8.2 Future Directions

Some future directions for the research on using non-verbal modalities in language

processing include:

o In this thesis’s work on automatic word acquisition, new words are grounded to
the domain concepts representing entities and their properties. These domain
concepts are already given to the system. It is interesting for future work to

automatically learn these domain concepts.

0 The current implementation of word acquisition by means of eye gaze learns
words referring to entities and their physical properties (color, size, materia,
shape). It may be extended to learn words that describe the spatial relations

of entities and the user actions.

0 Besides word acquisition, eye gaze can also be used to help syntactic parsing
of the user’s spoken language. For example, suppose the user says “there is a
book on a table with a brown cover”. It is ambiguous in the parsing whether the
prepositional phrase “with a brown cover” should be attached to “a book” or “a
table”. However, using eye gaze ﬁxations, the system can decide which entity
the phrase “with a brown cover” should be attached to based on its domain

knowledge about the properties of the ﬁxated entities (book, table).

130

APPENDICES

131

A Multimodal Data Collection

This section describes the user studies that we conducted to collect the speech-gesture

and speech-gaze multimodal data sets for the investigations in this thesis.

A.1 Speech-Gesture Data Collection in the Interior Decoration Domain

We collected speech-gesture data by conducting user studies in the interior decoration
domain (Section 3.3.1). In this study, users were asked to accomplish tasks in two
scenarios. Scenario 1 was to clean up and redecorate a messy room. Scenario 2 was to
arrange and decorate the room so that it looks like the room in the pictures provided
to the user. Each scenario put the user into a speciﬁc role (e.g., college student,
professor, merchant, etc.), and the task had to be completed with a set of constraints
(e.g., budget of furnishings, bed size, number of domestic products, etc.). Figures
A.1 & A.2 show the instructions for scenario 1 and scenario 2 that were given to the
user before the study.

We recruited 5 users for the study. During the study, the user’s speech was
recorded through an open mic'rophone and the user’s deictic gesture was captured
by a touch-screen. From the user studies, we collected 649 spoken utterances with

accompanying gestures

A.2 Speech-Gaze Data Collection in the Interior Decoration Domain

We also collected a corpus of speech-gaze data in the interior decoration domain with
a different user task. In this study, a static 3D bedroom scene was shown to the
user. The system verbally asked the user a list of questions one at a time about the
bedroom and the user answered the questions by speaking to the system. Figure A.3
lists the questions that are asked by the system.

We recruited 7 users for the study. During the study, the user’s speech was

recorded through an open microphone and the user’s eye gaze was captured by an

132

 

 

Description of Scenario 1

1. You are planning to have an important meeting at your apartment. Cur-
rently, your apartment is a mess. You would like to clean it up and redecorate.
You have found a computer program that will allow you to manipulate the fur-
niture arrangement and style in the apartment. This will allow you to decorate
the virtual replica of your apartment prior to redecorating your real apartment.
This will minimize heavy lifting and save you lots of time! You have two goals.
The ﬁrst goal is to clean up your messy apartment by removing, replacing, or
modifying objects that appear to be either out of place or have strangelooking
characteristics. The second goal is to redecorate your apartment. This can be
accomplished by adding, removing, or modifying objects.

2. You are not a millionaire, so you will have to stay under a speciﬁc budget during
the decoration process. You also have certain personality traits and practical
needs which will constrain the redecoration process. The budget along with
these needs will be deﬁned by a character role card which will be given to you at
the beginning of this scenario.

3. Additionally you will need to write down certain information about the result—
ing redecorated apartment for future reference. The information that is important
to you will be determined by your character role.

Role: College Student

You are a college student. You want to have an exotic and colorful apartment,
but price is a major concern. You require a quality desk that will last for a long
time. You need cabinets with many drawers to store all your school work. You
prefer dim lighting and lots of plants and artwork. You want your apartment
to look as exotic and colorful as possible while satisfying your basic needs and
staying under a budget of $1800.

Role: Patriotic Family (with kids)

You are a former US Marine. You are very patriotic and have a family (with
kids) that share your values. You want your apartment to contain as many
objects made in the US (especially objects that have recently been made in the
US) and be symbolic of the US, yet you also want your apartment to be practical
and safe for your children. You prefer soft unbreakable furniture without sharp
corners that has be recently been produced in the US. You need a large bed and
would prefer to have at least one reclining piece of furniture. You must satisfy
these preferences while staying under a budget of $2500.

 

Figure A.1. Instruction for scenario 1 in the interior decoration domain

133

 

 

Description of Scenario 2

1. Imagine that you are searching for a new place to live. You have found a
computer program that will allow you to manipulate the furniture arrangement
and style in a perspective apartment. When you recently visited an old friend,
you really enjoyed the layout of his / her apartment. The images of this apartment
are vividly engrained in your mind. Your goal is to arrange your perspective
apartment in the mold of those images. To help with the story, sample images
will be provided for you.

2. While the layout of your friend’s place was aesthetically pleasing to you, certain
aspects of the apartment need to be modiﬁed to fulﬁll your own personality traits
and practical needs. These needs will be defined by a character role card which
will be given to you at the beginning of this scenario. Based on your chosen
character role, you will need to modify certain pieces of furniture to adhere to
your character’s needs.

3. Additionally you will need to write down certain information about the per-
spective apartment for future reference. The information that is important to
you will be determined by your character role.

Role: Collector You are an art and antiques collector. You prefer old, ex-
pensive, and aesthetically pleasing furniture. You sometimes take prospective
customers to your apartment and need to keep up the appearance that you know
what you are talking about. Your goal for this apartment is that it contains a
lot of art (paintings), old and expensive furniture, objects from a wide variety
of countries with a minimal number of US—produced objects. You will need to
modify the existing furniture to adhere to your preferences.

Role: Professor You are a college professor. The apartment’s practicality is
very important to you. You require a quality desk that will last for a long time.
You need cabinets with many drawers. Light is very important to you. You
prefer powerful (high-wattage) lamps. Additionally you require that a recliner
is available when you need to relax from your busy day. You want to efﬁciently
balance comfort vs. price — you generally don’t want furniture made out of the
cheapest or more expensive material.

 

 

 

Figure A.2. Instruction for scenario 2 in the interior decoration domain

Eye Link 11 eye tracker. From the user studies, we collected 554 spoken utterances

with accompanying gaze streams.

134

 

Describe this room.

What do you like/ dislike about the arrangement?
Describe anything in the room that seems strange to you.
Is there a bed in this room?

How big is the bed?

Describe the area around the bed.

Would you make any changes to the area around the bed?
Describe the left wall.

How many paintings are there in this room?

Which is your favorite painting?

Which is your least favorite painting?

What is your favorite piece of furniture in the room?
What is your least favorite piece of furniture in the room?
How would you change this piece of furniture to make it better?

OONQCJ'IAOOMI—I

I—‘lI—‘l—llv—‘Co
COMP-‘0

 

 

p—i
A

 

Figure A.3. Questions for users in the study

A.3 Speech-Gaze Data Collection in the Treasure Hunting Domain

We collected another corpus of speech-gaze data by conducting user studies in the
treasure hunting domain (Section 3.3.2). In this study, the user’s task is to ﬁnd some
treasures that are hidden in a 3D castle. The user can walk around inside the castle
and move objects. The user needs to consult with a remote “expert” (i.e., an artiﬁcial
agent) to ﬁnd the treasures. The expert has some knowledge about the treasures but
can not see the castle. The user has to talk to the expert for advices of ﬁnding the
treasures. Figure A.4 shows the instruction that is given to the user before the study.

We recruited 20 users for the study. During the study, the user’s speech was
recorded through an open microphone and the user’s eye gaze was captured by a
Tobii eye tracker. From the user studies, we collected 3709 spoken utterances with

accompanying gaze streams.

135

 

 

Instruction

Your mission, if you choose to accept it (by signing the consent form), is to
immerse yourself into the world of treasure hunting and ﬁnd Zahalin’s treasure.
With the help of an artiﬁcial conversational agent, you will navigate Zahalin’s
castle in search for the treasure. Some of the treasure will be hidden, while some
of it will be in plain sight. To communicate with your artiﬁcial assistant, speak
clearly into the microphone using your natural tone of voice.

The assistant is an old criminal who is familiar with Zahalin’s castle. He has
partial knowledge about where the treasure is and how to ﬁnd it, but cannot
see what is inside the castle. You have additional knowledge about what can be
seen in the castle environment. It is your responsibility to communicate with the
artiﬁcial assistant and provide as much detail about the layout of the castle as
the he requires.

You have the ability to open, move, and pick up various objects in the castle.
However, you must be careful! Some objects are booby trapped and you will be
penalized for manipulating these objects. Make sure to ask the artiﬁcial assistant
if an object is safe before manipulating it.

Together you will decipher this puzzle. Good luck!

While you navigate through the castle and converse with your artiﬁcial assistant,
we will track your speech and eye gaze. This data will be used to make further
improvements to the conversational agent’s spoken language understanding. The
system will inform you if it fails to recognize either your speech or eye gaze. If
this happens at any point during the study, please ask your proctor for assistance.

 

Figure A.4. Instruction for the user study

136

 

B Parameter Estimation in Approaches to Word Acquisition

Given parallel data set (W,E) where W = {w1,w2,...,wn} and E =
{e1,e2, . . . ,en}, EM algorithms are used to estimate the probabilities p(wle) that

maximize the likelihood of the data set

p()=W|E ﬁP(Wklek)
k=1

B.1 Parameter Estimation for Base Model-1

The Base Model-1 is

IWI lel
p(wle)= P (wjlei)
(T———e|+11)'w'jl:11§
Use EM algorithm to estimate the parameters 0 = (p(wle)) that maximize

p(WlE)=

o E—step: compute the expected value of the log-likelihood with respect to the

distribution of the alignments a]-

Q = E[1ogp(lea<old>)]
n
= 23% (lek|+1) 'wk'
n IWkllekl

+ Z Z ZPW — llwkjiekiaaww)108P(wz.~j|6ki)

k=1j= 1i:

 

where for each instance,

01d) P(wkjleki)
’= lekl

Zkajleki)
i=0

(3.1)

 

p(aj = ilwkj, eki, 6(

o M-step: ﬁnd the new parameters

60"”) = arg maxQ

137

and we have

kaI lekl

 

n
Z Z 2W2“ = ilwkja elm 9(0’d’)5(w,wkj)5(e, ea)
k=1j=1 i=0
= B.2
p(wle) n W M, < >
Z Z Z 229% = ilwkj’eki’6(01d))5(wawkj)5(ea er.)
w k=1 j=l i=0
where,
1 wk- = w
5(wawkj) = J _ (B-3)
0 otherwrse
I. eki = 8
5(eaekz’) = (8.4)

0 otherwise

B.2 Parameter Estimation for Base Model-2

The Base Model-2 is
IWI lel

P(W|e) = H 2PM = ilj, lwl, |e|)P(wj|€z')

j=1 i=0
Use EM algorithm to estimate the parameters 6 = (p(ajlm, l), p(wle)) that max-
imize p(WIE):

o E—step: compute the expected value of the log-likelihood with respect to the

distribution of the alignments a]-

Q = E[logp(W|E,0(0‘d))]

kal lekl

n
= Z Z Zp(aj = ilwkjieki’ lwklv leklv6(01d))
k=1

j=1 i=0
x log [p(aj = illwklv lekllp(wkjleki)l

where for each instance is,

 

old) Maj = 73Hka lekl)P(wkjleki)
) = lekl (B.5)
ZPUIJ' = illwkl, lekl)P(wkj|ek-i)
i=0 '

Maj = ilwkjvekiv kal» |8k|,9(

138

o M-step: ﬁnd the new parameters
0(new) = arg maxQ

and we have

- i
Z p(aj = nutter. Iwkl. lekla9(0d))

kilWIc|=mJekl=l

 

 

10(0' = z'ImJ) = (36)
J 2 1
kzlwk|=m,|ek|=l
n IWkl lekl
Z Z :Maj = 2'Im = lwli = |6k|)5(w»ww)5(€,eki)
k=1j=1 i=0
w e = 8.7
p‘ ' ) n lwkl lekl ( )
Z Z: Z 217(03' = 2'Im = lle = lek|)5(w,ij)5(6»€kz‘)
w k=1j=1 i=0
where 6(w,wkj) and 6(e, eki) are shown in Equations B.3 and B4.
B.3 Parameter Estimation for Model-2s
The Model-2s is
IWI lel
P(W|e) = H ZPst = ilj,e,W)P(wj|€i)
j=1i=0
Use EM algorithm to estimate the parameters 0 = (p(wle)) that maximize

P(W|E)=

o E—step: compute the expected value of the log-likelihood with respect to the

distribution of the alignments a]-

Q = E [10g p(WlE, 9W0]

n lwkl lekl

= Z Z 210% = Zl'wkjﬁkiﬂwl ’)10g[Ps(aj = ZIJvekawk)P(wkjleki)]
k=1j=li=0

where for each instance,

_ Ps(aj = iljaekawklmwkjleki)
lekl
Zpsﬁlj = ilLGaWr—lPW’mIGt-r)
i=0

01(1))

 

p(aj =2|1Ukj, ekis 6(

139

o M-step: ﬁnd the new parameters
0("ew)— — arg maxQ

and we have

n lwkl lekl

2: Z 277(03' = ilwkjmm,9(01d))6(w,wkj)6(e,eki)

k=1 j=1 i=0
IWkl lekl

n
2: Z Z 217(03' = ilwkj’ eki: 9(01d))6(w, wkj)6(ev eki)
w

k=1j=1 i=0

p(wle) =

 

(3.9)

where 6(w,wkj) and 6(e,ek,-) are shown in Equations B.3 and 8.4.

BA . Parameter Estimation for Model-2t

The Model-2t is

 

IWI lel
P(=W|e) H Ema ° - ‘ilj,e,W)P(w]'|€z‘)
j= —1i=0
where
0 d(e,-,wj) > 0
. _ - - _ expla'd(€i,wj'll
pt(aj - ZI],6,W) — d(e,-,wj S 0

Z exvla ' dfei, wjll

Use EM algorithm to estimate the parameters 6 = (p(wle),a) that maximize

P(W|E)=

o E—step: compute the expected value of the log-likelihood with respect to the

distribution of the alignments a]-

Q = E[logp(W|E,6(01d))]

n kal lekl

= 223me -=imitated”)logiptta- =2'Ij,e;..wnp<wt,let.>]

k=lj=li=0

140

where for each instance,

old) Maj = iljaekiwklmwkjleki)
) = lekl
:29th = iljaek,wklp(wkjleki)

 

p(aj = ilwkj, eki, 6(

eXPla ' d(ekia wkj)lP(wkjleki)

 

 

 

= 3.10
lekl ( )
Z eXPla ' d(€ki, wkj)lP(wkj|€ki)
i=0
0 M-step: ﬁnd the new parameters
60m”) = arg maxQ
0
The new p(wle) is given by
n lwkl lekl
Z Z ZPW = ilwkj, emu 9(01d))5(w,wkj)5(6, eki)
k=1 j=1 i=0
w e = 8.11
p( ' l n IWkl lekl ( ’
Z Z 2 2pm.- : ﬁlm]... at... 6(0’d>>6(w. wkj)5(€, er.)
w k=1 j=1 i=0
where 6(w,wkj) and 6(e,ek,') are shown in Equations B.3 and B4.
The new a is given by
expla ' d(€kw wk )1 .
2 J = P(%‘ = lekjﬂkrﬂwd’)
Z eXPla ' d(6ki» wkjll
i
2

 

ex -d e -, -
a- — arg 11111123: 2: pla ( kl wk])] — p(aj = ilwkj,8ki,0(01d))
ZeXpla d( (ekivwkjll
l
(B.12)
The Levenberg-Marquardt (LM) algorithm [63,67] is used to ﬁnd the MSE

estimate of a.

141

B.5 Parameter Estimation for Model-2ts

The Model-2ts is

IWI lel

p(WIG) = H Zptsmj = iljievw)p(wjlei)

j=1 i=0
where

53(8kz‘, wkj) expla ' d(€i’ lel
Dammit«w.»

I

 

pt3(aj = ilj,e,W) =
Use EM algorithm to estimate the parameters 0 = (p(wle),a) that maximize
p(WlE)=

o E—step: compute the expected value of the log-likelihood with respect to the

distribution of the alignments a]-

Q = E[logp(WIE.9<°’d>>]
n lwkllekl
= Z 2 21407 = ilwkjaeki,9(01d))1030”st = iljvekawklﬂwkjlekill
k=1j=1 i=0

where for each instance,

 

 

.___ , , ,_ ,.
P(aj=i|wkj»eki,0(°ld)) = lefltsmj ll],ek’w“)p(“k-chkz)

:20ij = ilieawklﬂwkjlew)

i=0
_ Smeki’ wkj) expla ' d(ekiv wkjllPUUkjlekil
_ lekl

Z SR(ek,-, wkj) eXPla ' (“emu ijllPijleki)

i=0

(3.13)

o M-step: ﬁnd the new parameters

6(new) = arg max Q

142

i' .

 

The new p(wle) is given by

n lwkl lekl

 

Z Z 274% = ilwrcjv em, 9(Old’)5(wawm5(e,ekr)
_ k=1j=1 i=0
p(wle) _ n kalle/cl (314)
Z 2 2pm,- =ilwkreki,0(01d))5(w,wkj)6(e,eki)
w k=1j=1i=0

where 6(w,wkj) and 6(e, em) are shown in Equations B.3 and B4.
The new a is given by

330%» wkj) expla ' d(€kz‘, wkjll
Z 53(8ki, wkj) expla ' d(ekia wkjll

‘l

 

= Maj =ilwkjaekiv6(01d))

 

Z: SR(€ki,ij)eXP[a'd(8ki,witj)l _
j k ZSR(ekiawkj)eXPla'd(€ki,wkj)l

1

a = arg min 2
a i

2
p(aj = ilwkj,ek,~,0(°ld)) (13.15)

The Levenberg-Marquardt (LM) algorithm [63,67] is used. to ﬁnd the MSE

estimate of a.

143

BIBLIOGRAPHY

144

BIBLIOGRAPHY

[1] J. F. Allen, B. W. Miller, E. K. Ringger, and T. Sikorski. A robust system for
natural spoken dialogue. In Proceedings of the Annual Meeting of the Associa-
tion for Computational Linguistics (A CL ), 1996.

[2] P. D. Allopenna, J. S. Magnuson, and M. K. Tanenhaus. Tracking the time
course of spoken word recognition using eye movements: Evidence for continu-
ous mapping models. Journal of Memory 85 Language, 38:419—439, 1998.

[3] K. Bock, D. E. Irwin, D. J. Davidson, and W. Leveltb. Minding the clock.
Journal of Memory and Language, 48:653—685, 2003.

[4] R. A. Bolt. Put that there: Voice and gesture at the graphics interface. Com-
puter Graphics, 14(3):262—270, 1980.

[5] P. F. Brown, S. D. Pietra, V. J. D. Pietra, and R. L. Mercer. The mathe-
matic of statistical machine translation: Parameter estimation. Computational
Linguistics, 19(2):263—311, 1993.

[6] P. F. Brown, V. J. D. Pietra, P. V. deSouza, J. C. Lai, and R. L. Mercer.
Class-based n—gram models of natural language. Computational Linguistics,
18(4):467—479, 1992.

[7] S. Brown-Schmidt and M. K. Tanenhaus. Watching the eyes when talking about
size: An investigation of message formulation and utterance planning. Journal
of Memory and Language, 54:592—609, 2006.

[8] E. Campana, J. Baldridge, J. Dowding, B. Hockey, R. Remington, and L. Stone.
Using eye movements to determine referents in a spoken dialogue system. In
Proceedings of the Workshop on Perceptive User Interface, 2001.

[9] S. Carbini, J. E. Viallet, and L. Delphin-Poulat. Context dependent interpre-
tation of multimodal speech-pointing gesture interface. In Proceedings of the
International Conference on Multimodal Interfaces (ICMI), 2005.

145

[10] J. Chai, P. Hong, M. Zhou, and Z. Prasov. Optimization in multimodal interpre-
tation. In Proceedings of 42nd Annual Meeting of Association for Computational
Linguistics (A CL), 2004.

[11] J. Chai, S. Pan, and M. Zhou. MIND: A context-based multimodal interpre—
tation framework in conversational systems. In 0. Bernsen, L. Dybkjaer, and
J. van Kuppevelt, editors, Natural, Intelligent and Effective Interaction in Mul-
timodal Dialogue Systems. Kluwer Academic Publishers, 2005.

[12] J. Chai, Z. Prasov, and S. Qu. Cognitive principles in robutst multimodal
interpretation. Journal of Artiﬁcial Intelligence Research, 27:55—83, 2006.

[13] J. Chai and S. Qu. A salience driven approach to robust input interpretation
in multimodal conversational systems. In Proceedings of the Human Language

Technology Conference and Conference on Empirical Methods in Natural Lan-
guage Processing (HLT/EMNLP), 2005.

[14] J. Y. Chai, P. Hong, and M. X. Zhou. A probabilistic approach to reference
resolution in multimodal user interfaces. In Proceedings of the International
Conference on Intelligent User Interfaces (IUI), pages 70—77, 2004.

[15] A. Cheyer and L. Julia. MVIEWS: Multimodal tools for the video analyst.
In Proceedings of the International Conference on Intelligent User Interfaces
(I U1), 1998.

[16] A. Chotimongkol and A. Rudnicky. N-best speech hypotheses reordering using
linear regression. In Proceedings of 7th E UROSPEECH, pages 1829—1832, 2001.

[17] J. Chu-Carroll. MIMIC: An adaptive mixed initiative spoken dialogue system
for information queries. In Proceedings of the 6th Conference on Applied Natural
Language Processing (ANLP), 2000.

[18] CMU. The CMU audio databases. http://www.speech.cs.cmu.edu/databases/.

[19] M. Coen, L. Weisman, K. Thomas, and M. Groh. A context sensitive natural
language modality for the intelligent room. In Proceedings of the Ist Interna-
tional Workshop on Managing Interactions in Smart Environments (MANSE),
pages 38—79, 1999.

[20] P. Cohen, M. Johnston, D. McGee, S. Oviatt, J. Pittman, I. Smith, L. Chen,
and J. Clow. QuickSet: multimodal interaction for distributed applications. In

Proceedings of the Fifth ACM International Conference on Multimedia, pages
31—40, 1997.

146

[21] N. J. Cooke. Gaze-Contingent Automatic Speech Recognition. PhD thesis,
University of Birminham, 2006.

[22] C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273—
297, 1995.

[23] D. Dahan and M. K. Tanenhaus. Looking at the rope when looking for the
snake: Conceptually mediated eye movements during spoken-word recognition.
Psychonomic Bulletin 85 Review, 12(3):453—459, 2005.

[24] S. Dupont and J. Luettin. Audio—visual speech modelling for continuous speech
recognition. IEEE Transactions on Multimedia, 2(3):141—151, 2000.

[25] S. Dusan, G. J. Gadbois, and J. Flanagan. Multimodal interaction on pda’s
integrating speech and pen inputs. In Proceeding of E UROSPEECH, 2003.

[26] K. M. Eberhard, M. J. Spivey-Knowiton, J. C. Sedivy, and M. K. Tanenhaus.
Eye movements as a window into real-time spoken language comprehension in
natural contexts. Journal of Psycholinguistic Research, 24:409—436, 1995.

[27] J. Eisenstein and C. M. Christoudias. A salience-based approach to gesture-
speech alignment. In Proceedings of HLT/NAA CL ’04, 2004.

[28] C. Fellbaum, editor. WordNet: An Electronic Lexical Database. MIT Press,
1998.

[29] P. Gorniak and D. Roy. Probabilistic grounding of situated speech using plan
recognition and reference resolution. In Proceedings of the Seventh International
Conference on Multimodal Interfaces (ICMI), 2005.

[30] Z. M. Grifﬁn. Gaze durations during speech reﬂect word selection and phono-
logical encoding. Cognition, 822B1-B14, 2001.

[31] Z. M. Grifﬁn and K. Bock. What the eyes say about speaking. Psychological
Science, 11:274—279, 2000.

[32] B. J. Grosz, A. K. Joshi, and S. Weinstein. Centering: A framework for modeling
the local coherence of discourse. Computational Linguistics, 21(2):203—226,
1995.

[33] B. J. Grosz and C. Sidner. Attention, intention, and the structure of discourse.
Computational Linguistics, 12(3):175-204, 1986.

[34] A. Gruenstein, C. Wang, and S. Seneff. Context-sensitive statistical language
modeling. In Proceedings of Eurospeech, 2005.

147

[35] J. K. Gundel, N. Hedberg, and R. Zacharski. Cognitive status and the form of
referring expressions in discourse. Language, 69(2):274—307, 1993.

[36] J. E. Hanna and M. K. Tanenhaus. Pragmatic effects on reference resolution in
a collaborative task: evidence from eye movements. Cognitive Science, 28:105—
115, 2004.

[37] J. M. Henderson and F. Ferreira, editors. The interface of language, vision,

and action: Eye movements and the visual world. Taylor & Francis, New York,
2004.

[38] H. Holzapfel, K. Nickel, and R. Stiefelhagen. Implementation and evaluation
of a constraint-based multimodal fusion system for speech and 3d pointing ges-
tures. In Proceedings of the 6th international conference on Multimodal inter-
faces (ICMI), pages 175—182, 2004.

[39] C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support
vector machines. IEEE Transactions on Neural Networks, 13:415—425, 2002.

[40] P. Hui and H. Meng. Joint interpretation of input speech and pen gestures for
multimodal human computer interaction. In Proceedings of Interspeech, 2006.

[41] J. J. Hull. A database for handwritten text recognition research. IEEE Trans-
actions On Pattern Analysis And Machine Intelligence, 16(5):550—554, 1994.

[42] C. Huls, E. B03, and W. Classen. Automatic referent resolution of deictic and
anaphoric expressions. Computational Linguistics, 21(1):59—79, 1995.

[43] R. J. K. Jacob. The use of eye movements in human-computer interaction tech-
niques: What you look at is what you get. ACM Transactions on Information
Systems, 9(3):152—169, 1991.

[44] M. Johnston. Uniﬁcation-based multimodal parsing. In Proceedings of the
International Conference on Computational Linguistics and Annual Meeting of
the Association for Computational Linguistics (COLINC-ACL), 1998.

[45] M. Johnston and S. Bangalore. Finite-state multimodal parsing and under-
standing. In Proceedings of the International Conference on Computational

Linguistics ( COLIN C ), 2000.

[46] M. Johnston and S. Bangalore. Finite-state methods for multimodal parsing
and integration. In ESSLLI Workshop on Finite-state Methods, 2001.

148

[47]

[48]

[49]

[50]

[51}

[52]

[53]

[54]

[55}

[56]

[57]

M. Johnston, S. Bangalore, G. Vasireddy, A. Stent, P. Ehlen, M. Walker,
S. Whittaker, and P. Maloor. MATCH: An architecture for multimodal dia-

logue systems. In Proceedings of the Annual Meeting of the Association for
Computational Linguistics (A CL), pages 376—383, 2002.

M. Johnston, P. Cohen, D. McGee, S. Oviatt, J. Pittman, and 1. Smith.
Uniﬁcation-based multimodal integration. In Proceedings of the Annual Meeting
of the Association for Computational Linguistics (A CL), 1997.

M. A. Just and P. A. Carpenter. Eye ﬁxations and cognitive processes. Cognitive
Psychology, 8:441—480, 1976.

D. Kahneman. Attention and Effort. Prentice-Hall, Inc., Englewood Cliffs,
1973.

S. Katz. Estimation of probabilities from sparse data for the language model
component of a speech recogniser. IEEE Transaction on Acoustics, Speech and
Signal Processing, 35(3):400—401, 1987.

M. Kaur, M. Termaine, N. Huang, J. Wilder, Z. Gacovski, F. Flippo, and C. S.
Mantravadi. Where is “it”? event synchronization in gaze-speech input sys-
tems. In Proceedings of the International Conference on Multimodal Interfaces

(ICMI), 2003.

Z. Kazi, S. Chen, M. Beitler, D. Chester, and R. Foulds. Multimodal HCI for
robot control: Towards an intelligent robotic assistant f or people with disabli-
ties. In Proceedings of AAAI’96 Fall Symposium on Developing AI Applications
for the Disabled, 1996.

A. Kehler. Cognitive status and form of reference in multimodal human-
computer interaction. In Proceedings of the National Conference on Artiﬁcial
Intelligence (AAAI), pages 685—689, 2000.

D. Klein and C. D. Manning. Accurate unlexicalized parsing. In Proceedings
of the 4lst Meeting of the Association for Computational Linguistics (ACL),
2003.

D. B. Koons, C. J. Sparrell, and K. R. Thorisson. Integrating simultaneous
input from speech, gaze, and hand gestures. In M. Maybury, editor, Intelligent
Multimedia Interfaces, pages 257—276. MIT Press, 1993.

F. Landragin, N. Bellalem, and L. Romary. Visual salience and perceptual
grouping in multimodal interactivity. In Proceedings of the First International
Workshop on Information Presentation and Natural Multimodal Dialogue, pages
151—155, 2001.

149

[58] S. Lappin and H. Leass. An algorithm for pronominal anaphora resolution.
Computational Linguistics, 20(4):535—561, 1994.

[59] M. E. Latoschik. A user interface framework for multimodal vr interactions.
In Proceedings of the 7th international conference on Multimodal interfaces

(ICMI), pages 76—83, 2005.

[60] S. le Cessie and J. van Houwelingen. Ridge estimators in logistic regression.
Applied Statistics, 41(1):191—201, 1992.

[61] Y. LeCun and C. Cortes. The MNIST database of handwritten digits.
http: //yann.lecun.com/exdb/mnist.

[62] O. Lemon and A. Gruenstein. Multithreaded context for robust conversational
interfaces: Context-sensitive speech recognition and interpretation of corrective
fragments. ACM Transactions on Computer-Human Interaction, 11(3):241-267,
2004.

[63] K. Levenberg. A method for the solution of certain non—linear problems in least
squares. Quarterly of Applied Mathematics, 2(2):164—168, 1944.

[64] E. Levin, S. Narayanan, R. Pieraccini, K. Biatov, E. Bocchieri, G. D. Fabbrizio,
W. Eckert, S. Lee, A. Pokrovsky, M. Rahim, P.Ruscitti, and M. Walker. The
at&t-darpa communicator mixed-initiative spoken dialog system. In Proceedings
of the International Conference on Spoken Language Processing ( I CSLP), 2000.

[65] D. J. Litman and K. Forbes-Riley. Predicting student emotions in computer-
human tutoring dialogues. In Proceedings of the 42nd Annual Meeting of the
Association for Computational Linguistics (A CL), 2004.

[66] Y. Liu, J. Y. Chai, and R. Jin. Automated vocabulary acquisition and interpre-
tation in multimodal conversational systems. In Proceedings of the 45th Annual
Meeting of the Association of Computational Linguistics (ACL), 2007.

[67] D. Marquardt. An algorithm for the least—squares estimation of nonlinear pa-
rameters. SIAM Journal of Applied Mathematics, 11(2):431C441, 1963.

[68] A. S. Meyer, A. M. Sleiderink, and W. J. M. Levelt. Viewing and naming
objects: eye movements during noun phrase production. Cognition, 66(22):25—
33, 1998.

[69] L.-P. Morency and T. Darrell. Head gesture recognition in intelligent interfaces:
The role of context in improving recognition. In Proceedings of the International
Conference on Intelligent User Interfaces (IUI), 2006.

150

[70] Y. Nakano, G. Reinstein, T. Stocky, and J. Cassell. Towards a model of face-
to—face grounding. In Proceedings of the Annual Meeting of the Association for
Computational Linguistics (A CL), 2003.

[71] J. G. Neal and S. C. Shapiro. Intelligent multimedia interface technology. In
J. Sullivan and S. Tyler, editors, Intelligent User Interfaces. ACM, New York,
1991.

[72] J. G. Neal, C. Y. Thielman, Z. H. Dobes, S. M., and S. C. Shapiro. Natural
language with integrated deictic and graphic gestures. In M. Maybury and
W. Wahlster, editors, Intelligent User Interfaces, pages 38—51. Morgan Kauf-
mann Press, CA, 1998.

[73] S. Oviatt. Mulitmodal interactive maps: Designing for human performance.
Human-Computer Interaction, 12:93—129, 1997.

[74] S. Oviatt. Mutual disambiguation of recognition errors in a multimodal ar-
chitecture. In Proceedings of the Conference on Human Factors in Computing
Systems (CHI), 1999.

[75] S. Oviatt. Breaking the robustness barrier: Recent progress on the design of
robust multimodal systems. Advances in Computers, 56:305—341, 2002.

[76] S. Oviatt. Multimodal interfaces. In J. Jacko and A. Sears, editors, The
Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies
and Emerging Applications, chapter 14, pages 286—304. Lawrence Erlbaum As-
soc., Mahwah, NJ, 2003.

[77] T. Pedersen, S. Patwardhan, and J. Michelizzi. WordNet::Similarity - mea-
suring the relatedness of concepts. In Proceedings of the Nineteenth National
Conference on Artiﬁcial Intelligence (AAAI), 2004.

[78] A. Potamianos, S. Narayanan, and G. Riccardi. Adapative categorical under-
standing for spoken dialog systems. IEEE Transactions on Speech and Audio
Processing, 13(3):321~329, 2005.

[79] G. Potamianos, C. Neti, J. Luettin, and 1. Matthews. Audio-visual automatic
speech recognition: An overview. In G. Bailly, E. Vatikiotis-Bateson, and P. Per-

rier, editors, Issues in Visual and Audio- Visual Speech Processing. MIT Press,
2004.

[80] Z. Prasov and J. Y. Chai. What’s in a gaze? the role of eye—gaze in reference
resolution in multimodal conversational interfaces. In Proceedings of ACM 12th
International Conference on Intelligent User interfaces (IUI), 2008.

151

[81] S. Qu and J. Y. Chai. Salience modeling based on non-verbal modalities for
spoken language understanding. In Proceedings of the International Conference
on Multimodal Interfaces (ICMI), pages 193—200, 2006.

[82] S. Qu and J. Y. Chai. An exploration of eye gaze in spoken language processing
for multimodal conversational interfaces. In Proceedings of the Human Language
Technology Conference of the North American Chapter of the Association for
Computational Linguistics (HLT—NAACL), pages 284—291, 2007.

[83] S. Qu and J. Y. Chai. Beyond attention: The role of deictic gesture in inten-
tion recognition in multimodal conversational interfaces. In Proceedings of the
International Conference on Intelligent User Interfaces (I UI ), pages 237—246,
2008.

[84] S. Qu and J. Y. Chai. Incorporating temporal and semantic information with
eye gaze for automatic word acquisition in multimodal conversational systems.
In Proceedings of the Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 244—253, 2008.

[85] S. Qu and J. Y. Chai. Speech-gaze temporal alignment for automatic word
acquisition in multimodal conversational systems. In Proceedings of the Fifth
Midwest Computational Linguistics Colloquium (MCLC), 2008.

[86] R. Quinlan. C45: Programs for Machine Learning. Morgan Kaufmann Pub-
lishers, San Mateo, CA, 1993.

[87] P. Qvarfordt and S. Zhai. Conversing with the user based on eye-gaze patterns.
In Proceedings of the Conference on Human Factors in Computing Systems
(CHI), 2005.

[88] K. Rayner. Eye movements in reading and information processing - 20 years of
research. Psychological Bulletin, 124(3):372—~422, 1998.

[89] D. Roy and N. Mukherjee. Towards situated speech understanding: Visual

context priming of language models. Computer Speech and Language, 19(2):227-—
248, 2005.

[90] D. K. Roy and A. P. Pentland. Learning words from sights and sounds, a
computational model. Cognitive Science, 26(1):113-—146, 2002.

[91] A. Sankar and A. Gorin. Adaptive language acquisition in a multi-sensory
device. In R. Mammone, editor, Artiﬁcial neural networks for speech and vision,
pages 324—356. Chapman and Hall, London, 1993.

152

[92]

[93]

[94]

[95]

[96]

[97]

[98]

[99]

[100]

[101]

[102]

S. Seneff, D. Goddeau, C. Pao, and J. Polifroni. Multimodal discourse modelling
in a multi-user multi-domain environment. In Proceedings of the International
Conference on Spoken Language Processing (ICSLP), pages 192—195, 1996.

A. Shaikh, S. Juth, A. Medl, I. Marsic, C. Kulikowski, and J. Flanagan. An
architecture for multimodal information fusion. In Proceedings of the Workshop
on Perceptual User Interfaces (PUI), pages 91—93, 1997.

P. Silsbee and A. Bovik. Computer lipreading for improved accuracy in auto-
matic speech recognition. IEEE Transactions on Speech and Audio Processing,
4(5):337——351, 1996.

J. Siroux, M. Guyomard, F. Multon, and C. Remondeau. Modeling and process-
ing of oral and tactile activities in the GEORAL system. In Multimodal Human-
Computer Communication, Systems, Techniques, and Experiments, pages 101—
110. Springer-Verlag, London, UK, 1998.

R. A. Solsona, E. Fosler-Lussier, H.-K. J. Kuo, A. Potamianos, and I. Zitouni.
Adaptive language models for spoken dialogue systems. In Proceedings of the In-
ternational Conference on Acoustics, Speech, and Signal Processing (I CASSP),

2002.

M. J. Spivey, M. K. Tanenhaus, K. M. Eberhard, and J. C. Sedivy. Eye move-
ments and spoken language comprehension: Effects of visual context on syn-
tactic ambiguity resolution. Cognitive Psychology, 45:447—481, 2002.

R. Stevenson. The role of salience in the production of referring expressions:
A psycholinguistic perspective. In K. van Deemter and R. Kibble, editors,
Information Sharing. CSLI Publ., 2002.

Y. Sun, F. Chen, Y. Shi, and V. Chung. An input-parsing algorithm supporting
integration of deictic gesture in natural language interface. In J. A. Jacko,
editor, Human-Computer Interaction: HCI Intelligent Multimodal Interaction
Environments, pages 206—215. Springer-Verlag Berlin Heidelberg, 2007.

K. Tanaka. A robust selection system using real-time multi-modal user-agent
interactions. In Proceedings of the International Conference on Intelligent User
Interfaces (I UI ), 1999.

M. K. Tanenhaus, M. J. Spivey-Knowiton, K. M. Eberhard, and J. C. Sedivy.
Integration of visual and linguistic information in spoken language comprehen-
sion. Science, 268:1632-1634, 1995.

M. J. Tomlinson, M. J. Russell, and N. M. Brooke. Integrating audio and
visual information to provide highly robustspeech recognition. In International

153

Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 821-
824, 1996.

[103] B. M. Velichkovsky. Communicating attention-gaze position transfer in coop-
erative problem solving. Pragmatics and Cognition, 3:99—224, 1995.

[104] J. Vergo. A statistical approach to multimodal natural language interaction.
In Proceedings of the AAAI ’98 Workshop on Representations for Multimodal
Human-Computer Interaction, pages 81—85, 1998.

[105] R. Vertegaal. The GAZE groupware system: Mediating joint attention in mul—
tiparty communication and collaboration. In Proceedings of the Conference on
Human Factors in Computing Systems ( CHI ), pages 294—301, 1999.

[106] M. T. V0 and C. Wood. Building an application framework for speech and pen
input integration in multimodal learning interfaces. In International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), 1996.

[107] W. Wahlster. User and discourse models for multimodal communication. In
J. W. Sullivan and S. W. Tyler, editors, Intelligent user interfaces, pages 45—67.
ACM, 1991.

[108] A. Waibel, B. Suhm, M. V0, and J. Yang. Multimodal interfaces for multimedia
information agents. In Proceedings of the International Conference on Acoustics
Speech and Signal Processing (ICASSP), pages 167-170, 1997.

[109] A. Waibel, M. T. Vo, P. Duchnowski, and S. Manke. Multimodal interfaces.
Artiﬁcial Intelligence Review, 10(3—4):299—319, 1996.

[110] M. A. Walker. An application of reinforcement learning to dialogue strategy
selection in a spoken dialogue system for email. Journal of Artiﬁcial Intelligence
Research, 12:387-416, 2000.

[111] W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf, and
J. Woelfel. Sphinx-4: A ﬂexible open source framework for speech recognition.
Technical Report TR-2004-139, Sun Microsystems Laboratories, 2004.

[112] J. Wang. Integration of eye-gaze, voice and manual response in multimodal
user interfaces. In Proceedings of IEEE International Conference on Systems,
Man and Cybernetics, pages 3938—3942, 1995.

[113] C. Ware and H. H. Mikaelian. An evaluation of an eye tracker as a device
for computer input2. In Proceedings of the SICCHI/CI conference on Human
factors in computing systems and graphics interface, pages 183—188, 1987.

154

[114] Y. Watanabe, K. Iwata, R. Nakagawa, K. Shinoda, and S. Phrui. Semi-
synchronous speech and pen input. In Proceedings of the IEEE International
Conference on Acoustics, Speech and Signal Processing (I CASSP), 2007.

[115] I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and
techniques. Morgan Kaufmann, San Francisco, 2005.

[116] L. Wu, S. Oviatt, and P. Cohen. From members to teams to committee - a
robust approach to gestural and multimodal recognition. IEEE Transactions
on Neural Networks, 13(4), 2002.

[117] C. Yu and D. H. Ballard. A multimodal learning interface for grounding spoken
language in sensory perceptions. ACM Transactions on Applied Perceptions,
1(1):57-80, 2004.

[118] C. Yu and D. H. Ballard. On the integration of grounding language and learning
objects. In Proceedings of AAAI-04, 2004.

[119] M. Zancanaro, O. Stock, and C. Strapparava. Multimodal interaction for infor-
mation access: Exploiting cohesion. Computational Intelligence, 13(7):439—464,
1997.

[120] S. Zhai, C. Morimoto, and S. Ihde. Manual and gaze input cascaded (MAGIC)
pointing. In Proceedings of the Conference on Human Factors in Computing
Systems (CHI), pages 246-253, 1999.

[121] Q. Zhang, A. Imamiya, K. Go, and X. Mao. Overriding errors in a speech and
gaze multimodal architecture. In Proceedings of the International Conference
on Intelligent User Interfaces (I UI ), 2004.

155

 

 

I WIllml[lllllllllllllllllllllllllm
,- ' - ' ' 3 1293 03062 7305