W519
20001
LIBRARY
Michigan State
University
This is to certify that the
dissertation entitled
Incorporating Non-verbal Modalities in Spoken Language
Understanding for Multimodal Conversational Systems
presented by
Shaolin Qu
has been accepted towards fulfillment
of the requirements for the
Ph.D. degree in Computer Science
Major Professor’s Signature
Sj/u/o‘i
Date
MS U is an Affirmative Action/Equal Opportunity Employer
PLACE IN RETURN BOX to remove this checkout from your record.
‘ TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.
DATE DUE DATE DUE DATE DUE
5/08 K lPrq/Acc8Pres/ClRC/Dateoue indd
IN CORPORATING NON-VERBAL MODALITIES IN
SPOKEN LANGUAGE UNDERSTANDING FOR
MULTIMODAL CONVERSATIONAL SYSTEMS
By
Shaolin Qu
A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Computer Science
2009
ABSTRACT
IN CORPORATING NON-VERBAL MODALITIES IN
SPOKEN LANGUAGE UNDERSTANDING FOR
MULTIMODAL CONVERSATIONAL SYSTEMS
By
Shaolin Qu
Interpreting human language is a challenging problem in building human—machine
conversational systems due to the flexibility of human language behavior. This prob-
lem is further signified by insufficient speech understanding and system knowledge
representation. When unreliable and unexpected language inputs are received, con-
versational systems tend to fail. Robust language interpretation is essential for build-
ing practical conversational systems.
To address this issue, this thesis investigates the use of non-verbal modalities for
robust language interpretation in human-machine conversation. Specifically, this the—
sis investigates the use of deictic gesture and eye gaze to address two interrelated
problems of language interpretation: unreliable speech input due to weak speech
recognition, and unexpected speech input containing words that are not in the sys-
tem’s knowledge base. The underlying assumption is that deictic gesture and eye gaze
indicate the user’s visual attention and signal the salient visual context in which the
user’s spoken language is situated. This context constrains what the user is likely to
say to the system and therefore can be used to help understand the user’s language.
To facilitate this investigation, we developed a multimodal conversational sys-
tem on 3D-based domains. The system supports speech, deictic gesture, and eye
gaze input during human-machine conversation. Using this system, we conducted
user studies to collect speech-gaze and speech-gesture data sets for the investigation.
For the first topic, using non-verbal modalities to improve speech recognition and
understanding, we built different salience driven language models to incorporate ges-
ture/ gaze in different stages of speech recognition. We also experimented different
model-based and instance-based approaches to incorporate gesture in recognizing the
intention of the user’s spoken language. Our experiments show that using gesture
and eye gaze significantly improves speech recognition and understanding. The use
of gesture has also been shown to achieve significant improvement on user intention
recognition.
For the second topic, using non-verbal modalities for automatic word acquisition,
we developed different approaches to incorporate speech-gaze temporal information
and domain knowledge with eye gaze to facilitate word acquisition during human-
machine conversation. To further improve word acquisition, we also incorporated
user interactivity to pick out the “useful” speech-gaze data for word acquisition. Our
findings indicate that word acquisition is significantly improved when speech-gaze
temporal information and domain knowledge are incorporated. Moreover, acquisition
performance is further improved when the words are acquired from the automatically
identified “useful” speech-gaze data.
The results form this thesis have important implications in building robust and
practical multimodal conversational systems. They demonstrate how non—verbal
modalities can be combined successfully at different stages of spoken language pro-
cessing to improve robustness in language interpretation.
I‘
Copyright by
SHAOLIN QU
2009
ACKNOWLEDGMENTS
I would like to thank my advisor, Dr. Joyce Chai, for her guidance and support
over the years. Dr. Chai introduced me into the world of multimodal conversation and
helped set up the direction of my research. Her devotion to research, commitment to
professionalism, and relentless seeking of perfection have greatly inspired me through
the completion of my study. I would also like to thank my guidance committee, Dr.
John Deller, Dr. Anil Jain, and Dr. George Stockman for their insightful comments
and suggestions that have greatly enhanced this thesis.
Many fellow graduate students have helped me for the work reported in this thesis.
Special thanks to Zahar Prasov, who not only collaborated with me on the user study
designs and data collection, but also had many valuable discussions with me that
have helped shape my research. Tyler Baldwin, Matthew Gerber, and Chen Zhang
also have contributed to the data collection and shared their valuable comments and
suggestions to my work.
And finally, I want to thank my parents and my sister for their support all these
years.
TABLE OF CONTENTS
LIST OF TABLES ..............................
LIST OF FIGURES .............................
1 Introduction ................................
1.1 Overview of Multimodal Conversation .................
1.2 Problems in Multimodal Language Understanding ...........
1.2.1 Unreliable Speech Input .....................
1.2.2 Unexpected Speech Input ....................
1.3 Research Questions ............................
1.4 Road Map .................................
2 Background ................................
2.1 Why Multimodal Design? ........................
2.2 Non-Verbal Modalities in Multimodal Conversational Systems . . . .
2.2.1 Gesture ..............................
2.2.2 Eye Gaze .............................
2.3 Using Non-Linguistic Information for Language Understanding . . . .
2.3.1 Multimodal Language Processing ................
2.3.2 Context-aware Language Processing ...............
2.4 Automatic Word Acquisition .......................
3 A Multimodal Conversational System ................
3.1 System Architecture ...........................
3.2 Input Modalities .............................
3.2.1 Speech ...............................
3.2.2 Deictic Gesture ..........................
3.2.3 Eye Gaze .............................
3.3 Domains of Application ..........................
3.3.1 Interior Decoration ........................
3.3.2 Treasure Hunting .........................
4 Incorporation of N on-verbal Modalities in Language Models for Spo-
ken Language Processing ........................
4.1 A Salience Driven Framework ......................
4.1.1 Salience ..............................
4.1.2 Salience Driven Interpretation of Spoken Language in Multi-
modal Conversation. . ._ .....................
4.2 Gesture-Based Salience Modeling ....................
vi
OC'OOCUCIICJOIOl—l
‘20
‘29
29
4.3 Gaze-Based Salience Modeling ...................... 35
4.4 Salience Driven Language Modeling ................... 37
4.4.1 Language Models for Speech Recognition ............ 37
4.4.2 Salience Driven N—Gram Models ................. 38
4.4.3 Salience Driven PCF G ...................... 39
4.5 Application of Salience Driven Language Models for ASR ....... 42
4.5.1 Early Application ......................... 42
4.5.2 Late Application ......................... 43
4.6 Evaluation ................................. 44
4.6.1 Speech and Gesture Data Collection .............. 44
4.6.2 Evaluation Results on Speech and Gesture Data ........ 45
4.6.3 Speech and Eye Gaze Data Collection .............. 52
4.6.4 Evaluation Results on Speech and Eye Gaze Data ....... 54
4.6.5 Discussion ............................. 57
4.7 Summary ' ................................. 66
Incorporation of Non-verbal Modalities in Intention Recognition for
Spoken Language Understanding ................... 67
5.1 Multimodal Interpretation in a Speech-Gesture System ........ 68
5.1.1 Semantic Representation ..................... 68
5.1.2 Incorporating Context in Two Stages .............. 69
5.2 Intention Recognition ........................... 70
5.3 Feature Extraction ............................ 71
5.3.1 Semantic Features ........................ 71
5.3.2 Phoneme Features ........................ 72
5.4 Model-Based Intention Recognition ................... 73
5.5 Instance-Based Intention Recognition .................. 74
5.6 Evaluation ................................. 76
5.6.1 Experiment Settings ....................... 76
5.6.2 Results Based on Traditional Speech Recognition ....... 78
5.6.3 Results Based on Gesture-Tailored Speech Recognition . . . . 79
5.6.4 Results Based on Different Sizes of Training Data ....... 80
5.6.5 Discussion ............................. 84
5.7 Summary ................................. 86
Incorporation of Eye Gaze in Automatic Word Acquisition . . . . 88
6.1 Data Collection .............................. 89
6.2 Translation Models for Automatic Word Acquisition .......... 90
6.2.1 Base Model I ........................... 90
6.2.2 Base Model II ........................... 90
6.3 Using Speech-Gaze Temporal Information for Word Acquisition . . . 91
vii
6.4 Using Domain Semantic Relatedness for Word Acquisition ...... 93
6.4.1 Domain Modeling ......................... 94
6.4.2 Semantic Relatedness of Word and Entity ........... 95
6.4.3 Word Acquisition with Word-Entity Semantic Relatedness . . 95
6.5 Grounding Words to Domain Concepts ................. 97
6.6 Evaluation ................................. 97
6.6.1 Evaluation Metrics ........................ 98
6.6.2 Evaluation Results ........................ 99
6.6.3 An Example ............................ 105
6.7 Summary ................................. 106
7 Incorporation of Interactivity with Eye Gaze for Automatic Word
Acquisition ................................ 107
7.1 Data Collection .............................. 108
7.1.1 Domain .............................. 108
7.1.2 Data Preprocessing ........................ 109
7.2 Identification of Closely Coupled Gaze-Speech Pairs .......... 110
7.2.1 Features Extraction ........................ 110
7.2.2 Logistic Regression Model .................... 113
7.3 Evaluation of Gaze-Speech Identification ................ 114
7.4 Evaluation of Word Acquisition ..................... 116
7.4.1 Evaluation Metrics ........................ 116
7.4.2 Evaluation Results ........................ 117
7.5 The Effect of Word Acquisition on Language Understanding ..... 122
7.5.1 Simulation 1: When the System Starts with No Training Data 123
7.5.2 Simulation 2: When the System Starts with Training Data . . 124
7.6 Summary ................................. 126
8 Conclusions ................................ 128
8.1 Contributions ............................... 128
8.2 Future Directions ............................. 130
APPENDICES ................................ 132
A Multimodal Data Collection ....................... 132
Al Speech-Gesture Data Collection in the Interior Decoration Do-
main ................................ 132
A.2 Speech-Gaze Data Collection in the Interior Decoration Domain 132
A3 Speech-Gaze Data Collection in the Treasure Hunting Domain 135
B Parameter Estimation in Approaches to Word Acquisition ...... 137
B.1 Parameter Estimation for Base Model-1 ............ 137
B2 Parameter Estimation for Base Model-2 ............ 138
viii
B.3 Parameter Estimation for Model-2s ...............
B.4 Parameter Estimation for Model-2t ...............
B.5 Parameter Estimation for Model-2ts ..............
BIBLIOGRAPHY
ix
139
"“‘11
r .
4.1
4.2
4.3
4.4
5.1
5.2
5.3
6.1
7.1
LIST OF TABLES
Performances of the early application of different language models on
speech-gesture data ............................ 48
Performance of the late application of LMs on speech-gesture data . . 51
WER of the early application of LMs on speech-gaze data ...... 55
WER of the late application of LMs on speech-gaze data ....... 55
Intentions in the 3D interior decoration domain ............ 70
Accuracies of intention prediction based on standard speech recognition 78
Accuracies of intention prediction based on gesture-tailored speech
recognition ................................. 79
N-best candidate words acquired for the entity dresserzl by different
models ................................... 105
Gaze-speech prediction performances with different feature sets for the
instances with 1—best speech recognition ................ llo
LIST OF FIGURES
1.1 Architecture of multimodal conversation ................ 3
1.2 Semantics-based multimodal interpretation ............... 4
3.1 Multimodal conversational system architecture ............. 26
3.2 Eye gaze on a scene ............................ 28
3.3 A 3D interior decoration domain ..................... 29
3.4 A treasure hunting domain ........................ 30
4.1 Salience driven interpretation ...................... 33
4.2 Gesture-based salience modeling ..................... 34
4.3 An excerpt of speech and gaze stream data ............... 36
4.4 Context free grammar for the 3D interior decoration domain ..... 40
4.5 Trained PCFG for entity lamp in the 3D interior decoration domain . 41
4.6 Application of salience driven language model in speech recognition . 42
4.7 A* search in word lattice ......................... 43
4.8 An Excerpt of XML Data File ...................... 46
4.9 Performance of the early application of LMs on speech-gesture data of
individual users .............................. 50
4.10 Performance of the late application of LMs on speech-gesture data of
individual users .............................. 53
4.11 WERs of application of LMs on speech-gaze data of individual users . 56
4.12 N-best lists of speech recognition for utterance “show me details on
this desk” ................................. 58
4.13 Word lattice of utterance “show me details on this desk” generated by
using standard bigram model .................... 59
4.14
4.15
4.16
4.17
4.18
4.19
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
Word lattice of utterance “show me details on this desk” generated by
using salience driven bigram model .................
N-best lists of speech recognition for utterance “move the red chair
over here” .................................
Word lattice of utterance “move the red chair over here” generated by
using standard bigram model ....................
Word lattice of utterance “move the red chair over here” generated by
using salience driven bigram model .................
N-best lists of speech recognition for utterance “I like the picture with
like a forest in it” .............................
N-best lists of an utterance: early stage integration v.s. late stage
integration .................................
Semantic frame of a user’s multimodal input ..............
Using context (via gesture) for language understanding ........
Phonemes of an utterance ........................
Intention prediction performance of Naive Bayes based on different
training size ................................
Intention prediction performance of Decision Tree based on different
training size ................................
Intention prediction performance of SVM based on different training
Intention prediction performance of S-KNN based on different train-
ing size ...................................
Intention prediction performance of P-KNN based on different train-
ing size ...................................
Intention prediction performance of SP-KNN based on different train-
ing size ...................................
Using gestural information in different stages for intention recognition
xii
60
61
62
63
64
81
81
82
83
83
6.1
6.2
6.3
6.4
6.5
6.6
6.7
7.1
7.2
7.3
7.4
7.5
7.6
7.7
7.8
7.9
7.10
Al
A2
Parallel speech and gaze streams .................... 89
Histogram of truly aligned word and entity pairs over temporal distance
(bin width = 200ms) ........................... 92
Domain model with domain concepts linked to WordNet synsets . . . 94
Precision of word acquisition ....................... 101
Recall of word acquisition ........................ 102
F—measure of word acquisition ...................... 103
MRRRs achieved by different models .................. 104
A snapshot of one user’s experiment (the dot on the stereo indicates
the user’s gaze fixation, which was not shown to the user during the
experiment) ................................ 109
Precision of word acquisition on 1-best speech recognition with Model-
2t-r ..................................... 118
Recall of word acquisition on l-best speech recognition with Model-2t-r 118
F-measure of word acquisition on 1-best speech recognition with
Model-2t-r ................................. 119
Precision of word acquisition on speech transcript with Model-2t-r . . 120
Recall of word acquisition on speech transcript with Model-2t-r . . . 120
F —measure of word acquisition on speech transcript with Model-2t-r . 121
MRRRs achieved by Model—2t-r with different data set ........ 121
CIR of user language achieved by the system starting with no training
data .................................... 124
CIR of user language achieved by the system starting with 10 users
training data) ............................... 126
Instruction for scenario 1 in the interior decoration domain ...... 133
Instruction for scenario 2 in the interior decoration domain ...... 134
xiii
A.3 Questions for users in the study ..................... 135
A4 Instruction for the user study ...................... 136
xiv
CHAPTER 1
Introduction
Speech is the most natural means for humans to communicate with each other.
Due to its naturalness, speech is also a desirable communication mode in human-
computer interaction. .A lot of research has been done on spoken dialog sys-
tems [1, 17,64,65,78, 110], where users communicate with the system through speech.
In recent years, the development of multimodal conversational systems has gained
more interest. Besides speech input, multimodal conversational systems also support
inputs from other modalities such as gesture and eye gaze during human—machine
conversation. Compared to the conventional speech-only interfaces in spoken dialog
systems, multimodal conversational interfaces provide users with greater expressive
power, naturalness, and flexibility. Moreover, multimodal conversational systems can
achieve better interpretation of user input due to mutual disambiguation among com-
plementary modalities [74].
Despite recent advances in multimodal conversational systems, interpreting what
a user communicates to the system is still a significant challenge due to insufficient
speech recognition and language understanding performance. Moreover, when the
user’s utterances contain unexpected words that are out of the system’s knowledge,
interpretation of the user language tends to fail even when these words are correctly
recognized, which also makes robust language interpretation a big challenge.
Towards building more practical multimodal conversational systems, this thesis
explores the use of non-verbal modalities for robust language interpretation in two
related directions. First, to improve spoken language understanding, the domain con-
textual information indicated by non-verbal modalities is incorporated in language
modeling to get better speech hypotheses. Second, this thesis explores the use of eye
gaze to acquire words automatically during human-machine conversation, in particu-
lar, by incorporating speech-gaze temporal information, domain semantic knowledge,
and interactivity in word acquisition.
1.1 Overview of Multimodal Conversation
Figure 1.1 shows the typical interaction process between a user and a multimodal
conversational system. The user talks to the system using speech and pen-based
deictic gesture. The user’s eye gaze is captured by the system. The Multimodal
Interpreter identifies semantic meaning of the user’s multimodal input. Given the
interpretation, the Conversation Manager informs the Action Manager what action
(e.g., information query, removing an object on the graphical display) to take. The
Action Manager performs the action in the application domain and provides results to
the Conversation Manager. Based on the results, the Conversation Manager decides
what responses (e.g., inquired information not found, confirmation of object deletion)
to give back to the user. The Presentation Manager presents the system’s response
to the user in one or more formats (e.g., audio, video, graphics).
To be able to provide intelligent responses to the user, the system first needs
to understand user input, which makes Multimodal Interpreter a key component in
multimodal conversational systems. This thesis focuses on building robust spoken
language understanding in Multimodal Interpreter.
speech
gesture I ‘
Multimodal gaze L 7
Interpreter C: ~- .
semantic representation fl
Action Conversation
Manager C: Manager
H H graphics
audio
Presentation video
Manager
Figure 1.1. Architecture of multimodal conversation
1.2 Problems in Multimodal Language Understanding
Multimodal interpretation is to derive semantic meaning from the user’s multimodal
input. The interpretation process involves recognition, understanding, and integra-
tion of the user’s multiple inputs of different modalities. In most multimodal con-
versational systems, input interpretation is based on a semantic fusion approach. In
this approach, the system first creates all possible partial meaning representations in-
dependently from individual modalities. Then these partial meaning representations
identified from each modality are fused in a multimodal integration process to form
an overall meaning representation. Previous studies have shown that multitnodal
interpretation can achieve better performance than unimodal interpretation because
of the mutual disambiguation among complementary modalities during multimodal
integration process [74].
Figure 1.2 shows an example of the semantics—based approach to the interpretation
of speech and gesture input. In the example, the user says “what is the price of this
painting?” and at the same time points to a position on the screen. The system
first creates all possible partial meaning representations from speech and gesture
independently. The partial meaning representations from the speech input and the
gesture input are shown in (a—b) in Figure 1.2. In this case, the gesture could be
pointing to a wall or a picture. The system uses the partial meaning representations
to disambiguate one and another and combines compatible partial representations
together into an overall semantic representation as shown in Figure 1.2(c).
‘What is the price of this painting?" (Pointing to a position on the screen)
Speech Input Gesture Input
‘7 Jagzqgrv‘tr.“ 5'1". [— - .,.. t'."‘.'}’, ,V‘,‘
Speech Gesture
Recognition Recognition
Language Cesture
Understanding Understanding
l 1
Semantic Representation Semantic Representation
Intention Attention
(a) action: ACT-INFO_REQUEST object id: picture_lotus 0”
aspect: PRICE semantic type: PICTURE
Attention .
semantic type: PICTURE Attention
object id: wall_room
semantic type: WAU.
v v
Multimodal 3".
Fusion
I Semantic Representation
Intention
action: ACT -INF 0_RE Q UES T
aspect: PRICE (6)
Attention
object id: picture_lotus
semantic type: PICTURE
Figure 1.2. Semantics—based multimodal interpretation
In the semantics-based multimodal interpretation, the partial semantic represen-
tations from individual modalities are crucial for mutual disambiguation during mul-
timodal fusion. A robust recognition and understanding of the user’s speech is very
important. However, there are two main barriers to robust spoken language under-
standing: unreliable speech input and unexpected speech input. We address these
two problems of language understanding as follows.
1.2.1 Unreliable Speech Input
Unreliable speech input refers to the input that can not be correctly recognized due to
weak speech recognition. For example, in Figure 1.2, if the speech input is recognized
as “what is the pr_i_z§ of this panting .9”, then the partial representation from the speech
input will not be correctly created in the first place. Without a correct candidate
partial representation, it is not likely for multimodal fusion to reach a correct overall
meaning of the input.
A potential solution to the above problem is to incorporate contextual information
in recognition and understanding of speech at an earlier stage before semantic fusion in
the pipelined process of multimodal interpretation. The context of human-computer
interaction constrains what users are likely to interact with the system, and thus can
be used to help user input interpretation. In the example, the user is talking about
a picture. Suppose we already have the knowledge that the word “price” is more
likely to appear in an utterance talking about a picture than the word “prize”. By
identifying the visual context (i.e., the picture object) from deictic gesture, the system
can use the domain knowledge associated with the visual context to help recognize
the word “price” correctly and thus achieve correct language understanding.
Following this idea, this thesis presents a salience driven framework in which
gesture/gaze—based salience driven language models are built to improve recognized
speech hypothesis. During speech recognition, these salience driven language models
will guide the'system to pick the speech hypothesis that is more likely describing the
currently salient object as indicated by the user’s gesture or eye gaze. Our experi-
mental results have shown the potential of gesture and eye gaze in improving spoken
language processing.
Besides using non-verbal modalities to obtain better speech recognition hypothe-
sis, we also apply non-verbal modalities directly in the language understanding pro-
cess to better interpret the user’s spoken language, specifically, the user’s intention
reflected in the Spoken language. In conversational systems, the “meaning” of user in-
put can be generally categorized into intention and attention [33]. Intention indicates
the user’s motivation and action. Attention reflects the focus of the conversation, in
other words, what has been talked about. In the speech-gesture system where speech
is the dominant mode of communication, the user intention (such as asking for price of
an object) is generally expressed by spoken language and attention (e.g., the specific
object) is indicated by the deictic gesture on the graphical display. Based on such
observations, many speech-gesture systems mainly identify intention from speech and
identify attention using deictic gesture [4, 27,53]. In our view, deictic gestures not
only indicate users’ attention, but also can activate the relevant domain context. This
context can constrain the type of intention associated with the attention and thus
provide useful information for intention recognition.
Based on this assumption, we experimented with model-based and instance-based
approaches to incorporate gestural information to recognize the user’s intention. We
examined the effects of using gestural information for user intention recognition in two
stages — speech recognition stage and language understanding stage. Our empirical
results have shown that using gestural information improves intention recognition and
the performance is further improved when gestures are incorporated in both speech
recognition and language understanding stages compared to either stage alone.
1.2.2 Unexpected Speech Input
Unexpected speech input happens when the user speaks some words that the sys-
tem can not recognize. When the encountered vocabulary is outside of the system’s
knowledge, conversational systems tend to fail. For example, in Figure 1.2, if the user
says “what is the cost of this painting?” and the word “cost” is not in the system’s
vocabulary, then the system would not be able to understand that the user is asking
for the price of the painting. Therefore, it is desirable that conversational systems can
learn new words automatically during human-machine conversation. While automatic
word acquisition in general is quite challenging, multimodal conversational systems
offer an unique opportunity to explore word acquisition. In a multimodal conversa-
tional system where users can talk and interact with a graphical display, users’ eye
gaze, which occurs naturally with speech production, provides a potential channel for
the system to learn new words automatically during human-machine conversation.
Psycholinguistic studies have shown that eye gaze is tightly linked to human lan-
guage processing. Eye gaze is one of the reliable indicators of what a person is “think-
ing about” [37]. The direction of eye gaze carries information about the focus of the
user’s attention [49]. The perceived visual context influences spoken word recognition
and mediates syntactic processing of spoken sentences [97,101]. In addition, directly
before speaking a word, the eyes move to the mentioned object [31,68,88].
Motivated by these psycholinguistic findings, we investigate the use of eye gaze for
automatic word acquisition in multimodal conversation. Particulary, this thesis in-
vestigates the use of temporal alignment of speech and eye gaze and domain semantic
relatedness for automatic word acquisition. The speech-gaze temporal information
and domain semantic information are incorporated in statistical translation models
for word acquisition. Our experiment results demonstrate that eye gaze provides
a potential channel for acquiring words automatically. The use of extra speech-gaze
temporal information and domain semantic knowledge can significantly improve word
acquisition.
Furthermore, since eye gaze could have different functions during human-machine
conversation, not all speech and eye gaze data are useful for word acquisition. To
further improve word acquisition, the thesis also presents approaches that automat-
ically identify potentially “useful” speech and eye gaze based on information from
multiple sources such as the user’s speech, eye gaze behavior, interaction activity,
and conversation context. Our experimental evaluation shows that using only the
identified “useful” speech and gaze significantly improves word acquisition compared
to using all speech and gaze data.
1.3 Research Questions
Addressing the above problems, this thesis investigates the following specific questions
about language interpretation in speech and gesture/ gaze systems:
0 How can the non-verbal modalities be used to improve speech recognition?
0 How can the non-verbal modalities be used to help language understanding,
specifically, to help recognition of the user’s intention?
o How can the non—verbal modalities be used to acquire new words automatically
during multimodal conversation?
To facilitate the investigations described above, this thesis has accomplished the
following objectives:
0 Development of a multimodal system that supports inputs of speech, gesture
and eye gaze in 3D-based domains.
0 Collection of corpora of speech and gesture/ gaze data from user studies.
0 Design and implementation of approaches to incorporating non-verbal modal-
ities in spoken language understanding and automatic vocabulary acquisition
during multimodal conversation.
0 Evaluation and analysis of these approaches that incorporate non-verbal modal-
ities.
1.4 Road Map
The remainder of the thesis is organized as follows:
0 Chapter 2: background on relevant aspects of multimodal conversation and
review of previous work on multimodal language processing and language ac—
quisition.
0 Chapter 3: description of a multimodal conversational system developed for this
thesis investigation. The developed system supports inputs of speech, deictic
gesture, and eye gaze in a 3D interior decoration domain and a 3D treasure
hunting game domain.
0 Chapter 4: investigation of incorporating non-verbal modalities to improve rec-
ognized speech hypotheses for better language understanding. This chapter
describes different approaches in a gesture/gazebased salience driven frame-
work. Evaluation and analysis of these approaches are also presented in this
chapter.
0 Chapter 5: investigation of incorporating non-verbal modalities to improve user
intention recognition for better language understanding. This chapter describes
different model-based and instance-based approaches for intention recognition
and presents evaluation and analysis of these approaches.
0 Chapter 6: investigation of incorporating eye gaze in automatic vocabulary
acquisition for robust language understanding. This chapter describes the ap-
proaches of incorporating speech-gaze temporal information and domain seman-
tic relatedness to facilitate word acquisition. Evaluation and analysis are also
presented in this chapter.
0 Chapter 7: investigation of using user interactivity related information for iden-
tifying “closely-coupled” gaze and speech streams and its effect on word acqui-
sition. This chapter describes the prediction of “closely-coupled” gaze—speech
instances for word acquisition. Evaluations of gaze—speech prediction and its
effect on word acquisition are also presented in this chapter.
9 Chapter 8: contributions of this thesis work.
10
CHAPTER 2
Background
This chapter presents a review of the topics that are relevant to this thesis. We begin
by explaining the motivation for multimodal design in conversational systems, then
introduce the non-verbal modalities that have been explored in multimodal conver-
sation, and finally review the previous work on multimodal language interpretation
and automatic word acquisition.
2.1 Why Multimodal Design?
One motivation for multimodal design is users’ strong preference to interact mul-
timodally. Unlike the traditional keyboard and mouse interface or a unimodal
recognition-based interfaces, multimodal interfaces allow users to choose which modal-
ity to use depending on the types of information to convey, to use combined input
modes, and to alternate between modes at any time. This flexible choice of input
modes is preferred by users in human-computer interaction. It has been found that
more than 95% percent of users chose to interact multimodally when they were free
to use either speech or pen input in a map-based spatial domain [73].
Multimodal design is also motivated by the potential of multimodal systems in
expanding the accessibility of computing to a broader range of users. There are large
individual differences in ability and preference to using different modes of commu—
11
nication. These differences could be age, skill level, culture, and sensory, motor, or
intellectual impairments. For example, a user with accented speech may prefer pen
input rather than speech, whereas a visually impaired user may prefer speech input
and text-to—speech output.
Besides expanding the range of users, multimodal systems can also expand the
usage contexts. Multimodal systems allow users to switch input modes when environ-
ment condition changes or in mobile use, the user is unable to use a particular input
mode temporarily. For example, users can use pen input in a noisy environment and
use speech in a quiet environment, and a user of an in-vehicle multimodal application
can use speech when he or she is unable to use gestural input while driving.
Another major motivation for multimodal design is the error avoidance and recov-
ery in multimodal systems. There are user-centered and system-centered reasons why
multimodal systems facilitate error recovery [75]. The user-centered reasons include:
0 Users select the input mode they judge less error prone for particular lexical
content, which usually leads to error avoidance. For example, in a speech and
pen system, the user may prefer speech input, but will switch to pen to com-
municate a foreign surname.
0 User’s language often is simplified when interacting multimodally, which leads
to better speech recognition and language understanding. For example, in a
multimodal system involving a room scene, a user wants to move one of the
chairs beside the bed to the window. Using only speech, the user might need
to say “move the left red chair beside the bed to the window”. When using
both speech and gesture, the user only needs to say “move this chair here”,
along with two pointing gestures. This observation is most relevant to the work
presented in this thesis.
0 Users tend to switch modes after a system recognition error, which can prevent
12
repeating errors and facilitate error recovery.
The system-centered reason for error recovery in multimodal systems lies in the mul-
timodal architecture. A well designed multimodal architecture with two semantically
rich input modes can support mutual disambiguation [74] of input signals. Mutual
disambiguation involves disambiguation of signal or semantic-level information in one
input mode from partial information supplied by another input mode. It leads to re-
covery from unimodal recognition errors within a multimodal architecture, with the
net effect of suppressing errors experienced by the user. The mutual disambiguation
of speech and gestural inputs has been successfully demonstrated in [14, 20,48, 106].
2.2 Non-Verbal Modalities in Multimodal Conversational
Systems
Since the appearance of Bolt’s “Put That There” [4] demonstration system, which
supported speech and touch-pad pointing, a variety of new multimodal conversational
systems have emerged. In most of these multimodal conversational systems, the
other modality besides speech is either gesture or eye gaze. Besides speech and
gesture/ gaze systems, there are also speech and lip movement systems where speech
is processed with corresponding human lip movement information during human-
computer interaction [24,94,102]. In speech and lip movement systems, the visual
features of human lip movement is fused together with the acoustic features in the
speech decoding process to perform the so—called audio-visual speech recognition [79].
The use of lip movement in audio-visual speech recognition is beyond the scope
of this thesis. Moveover, speech recognition is not a focus of this thesis. This thesis
focuses on the use of gesture and eye gaze in improving language understanding
for multimodal conversation. An overview of the use of gesture and eye gaze in
multimodal systems is presented as follows.
13
2.2.1 Gesture
In speech and gesture systems, spoken language is processed along with its accompa-
nying gestures. The gestural input can be a simple pen-based deictic gesture (e.g.,
pointing, circling) [11,15,40,104,107,108], a complex pen-based gesture involving
symbolic interpretations [20,47,114], or a manual gesture [9,38,59,69].
This thesis focuses on the use of pen-based deictic gesture in spoken language
processing. Deictic gesture is an active input mode, which is deployed by the user
intentionally as an explicit command to the computer system. Deictic gesture has
been widely used in multimodal map-based systems to indicate the focus of the user’s
attention (objects, locations, or areas on the map) [11,25,71,92,95,99]. Beyond only
using deictic gesture as an indicator of the user’s attention focus, in this thesis, we
use deictic gesture to influence the recognition and understanding of the user’s spoken
utterances.
2.2.2 Eye Gaze
Eye gaze has been studied in various research fields such as cognitive science, psy-
cholinguistics, and human-computer interaction. In human-computer interaction, eye
gaze has long been explored for direct manipulation interfaces in which eye gaze is used
as a pointing device [43,56,112,113,120]. Eye gaze as a modality in multimodal inter-
action goes beyond the function of pointing. In different speech and eye gaze systems,
eye gaze has been explored for the purpose of mutual disambiguation [100,121], as a
complement to the speech channel for reference resolution [8,52,80] and speech recog-
nition [21], and for managing human-computer dialogue [87]. Eye gaze has also been
used as a facilitator in computer supported human-human communication [103,105].
In this thesis, we use eye gaze and the gaze perceived visual context to help spoken
language understanding in multimodal conversation.
Cognitive scientists have been studying eye movements to understand brain pro-
14
cesses [36,88]. In psycholinguistics, eye gaze has been shown its tight link to both
language comprehension [2,23,97] and language production [3,7, 30]. Psycholinguis-
tic studies have found that the gaze perceived visual context influences spoken word
recognition and mediates the syntactic processing in real-time spoken language com-
prehension. For language production, psycholinguistic studies found that the user’s
eyes move to the mentioned object directly before speaking a word. These psycholin-
guistic findings are the motivations for this thesis’s work on the use of eye gaze for
spoken language processing in human-computer interaction.
Eye gaze can be captured by eye trackers, which track the user’s eye movements
during human-computer interaction. Two main types of eye trackers have been used in
interaction study — head mounted and display mounted. Head mounted eye trackers
can provide accurate gaze direction, but they are intrusive. It is unnatural and
inconvenient for a user to interact with the computer system with an eye tracker
mounted on the head. The state-of-the—art eye tracking technologies have enabled the
eye tracking system to be embedded in a monitor. The display mounted eye trackers
are non-intrusive and more appropriate for the use in human-computer interaction.
2.3 Using Non-Linguistic Information for Language Under-
standing
This thesis’s work on using non-verbal inputs to improve spoken language under-
standing is inspired by previous research on multimodal language processing and
context-aware language processing.
2.3.1 Multimodal Language Processing
Multimodal language processing combines speech with non-verbal modalities such
as gesture, eye gaze, and lip movements for language processing. There are two
15
levels of multimodal language processing: 1) feature-level processing; 2) semantic—
level processing.
Feature-Level Processing
Feature-level processing fuses low-level feature information from parallel input signals
in a multimodal architecture. Feature-level processing is most appropriate for closely
synchronized modalities such as speech and lip movements. In audio-visual speech
recognition [79], features of speech and lip movements are first extracted by acoustic
signal processing and vision analysis respectively. The extracted audio features and
visual features are then fused together for speech decoding.
Feature-level multimodal integration of speech and lip movement is beyond the
scope of this thesis. This thesis investigates the use of deictic gesture and eye gaze
in multimodal language processing. These modalities do not have the close coupling
with acoustic speech as lip movement does, so the feature-level processing is not
appropriate. Moreover, this thesis focuses on language understanding rather than
speech recognition. In audio—visual speech recognition, extracted acoustic and visual
features are fused for speech decoding. In this thesis, gesture/ gaze is incorporated in
language modeling to tailor speech hypotheses for better semantic interpretation.
Semantic-Level Processing
Semantic-level processing is to integrate semantic information derived from parallel
input modes in a pipelined multimodal architecture (as seen in Figure 1.2). Semantic-
level processing is mostly used for less coupled modalities such as speech and gesture.
In semantic-level processing, the system first recognizes each modality independently
and then creates all possible partial semantic representations individually for each
modality. Then the system uses these partial semantic representations to disam-
biguate each other and form a joint semantic representation [10,44,45]. This fusion
16
of multimodal input at the semantic level is called late fusion [76].
Late semantic integration systems use individual recognizers for different input
modes. These individual recognizer can be trained using unimodal data, which are
easier to get and already publicly available for modalities such as speech [18] and
handwriting [41,61]. Multimodal systems based on semantic fusion can also take
advantage of the existing relative mature unimodal recognition techniques and off-
the—shelf recognizers, which can be directly integrated in the late semantic integration
architecture. In this respect, multimodal systems based on semantic fusion can be
scaled up easier in the number of input modes.
Previous work on semantic fusion of multimodal input has been more focused on
the integration of speech and gesture, especially pen-based gesture, than on integra-
tion of speech and eye gaze. In multimodal interaction, pen-based gesture is a much
more reliable input mode for object selection than eye gaze. Moreover, pen-based
gesture can contain more semantic meaning by drawing symbols or writing letters.
Due to the limitation of eye gaze, multimodal integration of speech and eye gaze
has mainly been studied for simple object selection and reference resolution. In the
experiments of object selection [100,121], the user selects an object (icon) on the
screen using speech, user’s speech and eye gaze are both used to decide the selected
object by the system. In [121], both speech and eye gaze of user generate an n-best list
of potential objects, the system decides the selected object by taking the common one
on both n-best lists. In [100], the selected object is decided by computing the posterior
probabilities of the objects on screen being selected by the multimodal input. In the
applications of reference resolution [8,52], the object that is fixated by eye gaze prior
to user’s mention of the object in speech is taken as the referent for simple commands
like “move it there” and “open the door”.
Integration of speech and gesture for multimodal interpretation is more mature
than integration of speech and eye gaze. Many integration approaches have been
17
explored for a variety of speech and pen-based gesture systems. Those integration
approaches can be categorized into the following types based on their integration
mechanisms: frame—based approaches, unification—based approaches, finite-state ap-
proaches, optimization-based approaches, and statistics-based approaches.
Frame is a data structure used for knowledge representation. A frame has a
number of slots in it. The slots represent object properties, actions, or an object’s
relation with other frames. Frame-based multimodal integration approaches use indi-
vidual frames to represent semantic meanings obtained from different modalities and
achieve multimodal integration by merging those complementary individual frames to
one unified frame. Frame-based integration approaches have been used in speech and
gesture systems for applications such as multimodal text editing [109], multimodal
drawing [93], and multimodal appointment scheduling [106]. Frame-based approaches
are simple and efficient, but they are specific to application.
Unification-based approaches are derived from computational linguistics, in which
formal logics of typed feature structures have been well deve10ped. The primary
Operation in the logics of feature structure is unification — determining the consistency
of two feature structures and combining them into a single feature structure if they
are consistent. Using feature structures for meaning representation, unification-based
approaches achieve multimodal integration by performing unification operation over
the feature structures of different modalities. Compared to frame merging, unification
of typed feature structures provides a more general, formally well-understood, and
reusable mechanism for multimodal integration. Unification-based approaches haven
been used in the QuickSet system for the integration of speech and pen-based gesture
input [44,48].
Johnston and Bangalore [45,46] employed finite-state transducers to achieve mul-
timodal integration in a multimodal messaging application, in which users interact
with a company directory using synergistic combinations of speech and pen input.
18
Multimodal context-free grammar (CFC) was introduced for integrating speech and
gesture with finite-state transducers. The finite-state approach enables a tighter cou-
pling of speech and gesture by using gesture to guide speech recognition, which can
lead to improved speech recognition and understanding. However, the finite-state
approach has one major limitation in that it requires a multimodal grammar to be
created to define the language allowed in a particular application domain, which makes
them only applicable for very constrained domains that involves small vocabulary and
simple expressions.
Optimization-based approaches use Optimization methods of machine learning for
multimodal integration. Chai et al. [14] modeled integration of multimodal inputs as
graph matching and applied the graph-based approach to achieve reference resolution
in a map-based real estate domain, where users use speech and gesture to inquire es—
tate information. In [27], for the purpose of multimodal reference resolution, gestures
and spoken words are aligned by minimizing a penalty function defined to penalize
the gesture-speech bindings that violate the empirically preferred binding rules.
Wu et al. [116] proposed a statistical hierarchical framework, Members-Teams-
Committee (MTC), for the integration of speech and gesture in a simulated commu-
nity fire and flood control domain. In this framework, all possible multimodal in-
terpretations are predefined and the interpretation of a multimodal input is decided
by the posterior probabilities of unimodal speech and gesture recognition hypotheses
and the statistics of predefined multimodal interpretations. Since this statistics-based
approach requires all possible speech and gesture interpretations to be pre—defined for
a particular domain, it is only appropriate for constrained domains involving simple
speech and gesture commands.
In the above late semantic fusion approaches, information from multiple modalities
is only used at the fusion stage. Some low probability information (e.g., recognized
alternatives with low probabilities) that could turn out to be very crucial in terms
19
of the overall interpretation may never reach the fusion stage. Therefore, it is desir-
able to use information from multiple sources at an earlier stage, for example, using
one modality to facilitate semantic processing of another modality. Addressing this
problem in late semantic fusion, Chapter 4 of this thesis presents the use of deic-
tic gesture and eye gaze in an earlier stage to facilitate language processing before
semantic fusion.
2.3.2 Context-aware Language Processing
The context of human-computer interaction constrains what a user is likely to interact
with the system, thus can be utilized for user language interpretation. A variety of
research work has been done on using contextual information for spoken language
processing. There are mainly two types of context used in context-aware language
processing: conversation context and domain context.
All information related to the discourse prior to an utterance constitutes the
conversation context of the utterance. Chotimongkol and Rudnicky [16] used con-
versation contextual feature to improve speech recognition and understanding by
rescoring the n-best output of speech recognizer with a linear regression model. The
conversation contextual feature was represented by the correlation of the current
user utterance and the previous system utterance. Solsona et al. [96] combined con-
versation context-specific finite state grammar (F SG) and general n-gram model to
improve speech recognition for a conversational system. The conversation context
was represented by the types of previous system prompts and questions. Lemon and
Gruenstein [62] also built conversation context-specific grammars to improve speech
recognition and understanding. The conversation context was represented by the
types of dialog move. Gruenstein et al. [34] built context-sensitive class-based n-gram
model to improve speech recognition for a flight reservation system. The conversation
context was represented by the current information state, which indicates whether
20
certain information about the flight has been collected from previous conversation.
All domain related information constitutes the domain context, which could be
the visual content of the graphical display in a domain, or the task knowledge in
a specific domain application. Roy and Mukherjee [89] incorporated visual domain
context in language model to improve spoken language comprehension in a synthetic
visual scene description domain. The visual context was represented by the visual
features (e.g., color, size, shape) of the objects in the scene. Coen et al. [19] built
visual context-specific grammars to improve speech recognition and understanding
in an Intelligent Room where a user can operate computer controlled devices by
speaking. What is currently nearby the user in the room constitutes the visual con-
text. Carbini et al. [9] used domain contextual information to help interpretation
of ambiguous speech-gesture commands and enable short multimodal commands in
a chess game domain. The domain contextual constraints include the displacement
rules of chess game and current game position. Gorniak andRoy [29] incorporated
both physical domain context and conceptual domain task related context to resolve
spoken referring expressions in a 3D game domain. The physical context includes
information about the physical objects in the game, such as location and type of the
objects. The conceptual context consists of a set of hierarchical plan fragments to
complete the specific task of the game. Due to the constrained game setting, users
must follow certain steps to complete the task. Therefore, given the previous steps
(physical context) and the hierarchical plan segments (task conceptual context), it is
possible to predict which plan fragment the user will take, specifically which object
the user will likely to refer to in his/ her spoken commands.
Motivated by context—aware language processing, Chapters 4 & 5 of this thesis
investigate the use of domain contextual information for improving speech recognition
and understanding. Different from the context in previous work, the domain context
in this thesis work is dynamically signaled by non-verbal modalities such as gesture
21
and eye gaze during multimodal conversation. Cooke [21] also explored the use of
eye gaze for spoken language processing in a map route description domain. In
[21], eye gaze was used to improve speech recognition by rescoring the n-best list of
speech recognition with the landmark-specific n—gram models that correspond to the
gaze fixated landmarks. Different from [21], in this thesis, we explore more ways of
integrating eye gaze in spoken language processing and present a better integration
strategy than n-best list rescoring for the use of eye gaze in speech recognition.
2.4 Automatic Word Acquisition
Word acquisition is to learn the semantic meanings of new words. In this thesis,
we focus on the automatic word acquisition by a computer system during human-
computer interaction. The purpose of automatic word acquisition is to enlarge the
system’s knowledge base of vocabulary and therefore better interpret the user’s spoken
language.
In the conversational systems with which users interact through a visual scene,
users talk to the system based on what is being shown on the scene and the system
“understands” the user’s language by mapping the spoken words to the semantic
concepts in its domain knowledge base. These semantic concepts of words represent
the visual entities and their prOperties in the domain. For these systems, the Specific
task of word acquisition is to ground words to the visual entities and their related
properties in the domain. Word acquisition by grounding words to visual entities has
been studied in various language acquisition systems.
Sankar and Grorin [91] acquired words by grounding words to visual properties
(color, shape) of objects in a synthetic blocks world, in which the user interacted with
the system by typing sentences. The system started with no semantic associations
of words and visual properties. The only innate knowledge of the system was the
semantic level signal “good” and “no”. During the human-computer interaction,
22
the user instructed the system to focus on certain objects and gave responses (e.g.,
“good”, “no”) indicating whether the system followed the instructions correctly. The
goal of the system was to learn to focus on the object that the user referred to
by building associations of words and visual properties. The mutual information
between the occurrences of words and object shape/ color types was used to evaluate
the strength of the association of a word and a color/ shape type.
Roy and Pentland [90] proposed a computational model that could learn words
directly from raw multimodal sensory input. In their experiments, infant caregivers
were asked to play toys with their infants while giving infant-directed speech. Given
speech paired with video images of single objects (toys), the temporal correlation of
speech and vision was used to learn words by associating the automatically segmented
acoustic phone sequences with the visual prototypes (color, shape, size) of the objects.
Yu and Ballard [118] investigated word learning in a visual scene description do-
main in which users were asked to describe nine oflice objects on a desk and how
to use these office tools. Given speech and the co—occurring video images captured
by a head-mounted camera, a generative model was used to find the associations of
automatically recognized spoken words and visual objects.
Towards the goal of robust multimodal interpretation, this thesis explores the
use of eye gaze for automatic word acquisition. Eye gaze is an implicit and subcon-
scious input, which brings additional challenges into word acquisition. Eye gaze has
been explored for word acquisition in [117], in which eye gaze and other non-verbal
modalities such as the user’s perspective video image and hand movement were used
together with speech to learn words. In the experiments, users were asked to describe
what they were doing while performing three required activities: “stapling a letter”,
“pouring water”, and “unscrewing a jar”. Head-mounted eye tracker and camera
were used to capture gaze and video data. Given speech paired with gaze positions
and video images, a translation model was used to associate acoustic phone sequences
23
to the four objects and nine actions in the domain.
Liu et al. [66] also investigated the use of eye gaze for word acquisition. In [66],
speech and eye gaze data were collected from simplified human-computer conversation
in which users verbally answered the system’s questions about the decoration of a 3D
room. A translation model was used to acquire words from transcribed speech and
its accompanying gaze fixations.
This thesis’s work on the use of eye gaze for word acquisition is different from
previous work. Besides gaze positions, we use extra information such as speech-gaze
temporal information and domain semantic knowledge to facilitate word acquisition.
Moreover, not all co-occurring speech and gaze data are useful for word acquisition.
This was not considered in the previous work on using eye gaze for word acquisition.
In this thesis, we investigate the automatic identification of “useful” speech and gaze
fixations and its application on word acquisition.
24
CHAPTER 3
A Multimodal Conversational System
To explore the incorporation of non-verbal modalities in language interpretation dur-
ing multimodal conversation, we built a multimodal conversational system that sup-
ports speech, deictic gesture, and eye gaze inputs. This chapter presents the archi-
tecture of the system and the processing of different input modalities.
3.1 System Architecture
Our multimodal conversational system is built on a client / server architecture as shown
in Figure 3.1. In this architecture, the user interacts with the client, a graphic inter-
face, using speech and other modality (e.g., deictic gesture, eye gaze). The results of
speech recognition and gesture/ gaze recognition are sent to the server via TCP/IP
network. The Multimodal Interpreter derives semantic meaning of the user’s mul-
timodal input and sends the interpretation result to a dialog manager. The Dialog
Manager controls the interaction flow and decides what the system should do based
on the interpretation of the user’s input. The Presentation Manager decides how to
present the system’s responses to the user and transmits the responses to the client
through the network. The system’s responses are presented to the user on the client
by graphics or/ and speech.
25
Gesture/Gaze Speech
Recognizer Recognizer
’ sentation graphic?
Manager ontro er
Server " Client
Figure 3.1. Multimodal conversational system architecture
3.2 Input Modalities
Users can interact with our multimodal conversational system using speech, deictic
gesture, and eye gaze.
3.2.1 Speech
As the major input mode in multimodal conversational systems, speech enables users
to interact with the system naturally and efficiently. To be able to give intelligent
replies to the user, the system first needs to recognize the user’s speech. Speech
recognition is to convert acoustic speech signals to text. Automatic speech recogni-
tion (ASR) has been progressing steadily in the last three decades, which have resulted
in commercial ASR systems that can recognize human speech with sufficient accuracy
under optimal conditions. However, during natural conversation, environment noise
and disfluency in users’ speech can deteriorate speech recognition performance signif-
icantly. Accents in users’ speech can also make speech recognition difficult. Because
of these reasons, speech recognition remains a major bottleneck for building robust
26
conversational systems.
The CMU Sphinx-4 speech recognizer [111] is used in our system for recognizing
users’ spoken utterances. Sphinx-4 is an open source speech recognizer based on
Hidden Markov Model (HMM).
How non-verbal modalities can be incorporated to improve speech recognition is
presented in Chapter 4.
3.2.2 Deictic Gesture
Besides speech, users can use deictic gesture (e.g., pointing, circling on a graphical
display) to make interaction easier. For example, instead of say “how much is the red
chair in the left corner?”, the user can say “how much is this chair?” while pointing
to the attended chair on the screen.
In our system, users’ deictic gestures are captured by a touch screen. Based on the
position of the gesture on the screen, we can infer which object the user is referring to.
How this gestural information can help recognize and understand the users’ speech is
presented in Chapter 4 and Chapter 5.
3.2.3 Eye Gaze
Eye gaze indicates the user’s focus of attention [26,49,101]. The published results
on eye gaze and human language production have led to the hypothesis that users
tend to look at the objects on the graphical display when they are talking about
them. Based on this hypothesis, by tracking the user’s eye gaze during human-
machine conversation, the system is likely to infer the user’s attended objects on
the screen and use this attention information to help recognize and understand the
user’s speech. Moreover, using eye gaze information, the system can potentially learn
new words from the user’s language by associating semantics of the attended objects
(indicated by eye gaze) with words in the user’s spoken utterances.
27
'Eye gaze is captured by an eye tracker. The raw gaze data points consists of
the screen coordinates of each gaze point with a particular timestamp. As shown in
Figure 3.2(a), this raw data is not very useful for identifying fixated objects. The
raw gaze data is processed to eliminate the invalid and saccadic gaze points, leaving
only pertinent eye fixations. Invalid gaze points occur when users look off the screen.
Saccadic gaze points occur during ballistic eye movements between fixations. Vision
studies have shown that no visual processing occurs in the human mind during sac-
cades (i.e., saccadic suppression). It is well known that eyes do not stay still, but
rather make small, frequent jerky movements. In order to best determine fixation lo-
cations, nearby gaze points are averaged together to identify fixations. The processed
eye gaze fixations is shown in Figure 3.2(b).
(3) Raw gaze points (b) Processed gaze fixations
Figure 3.2. Eye gaze on a scene
How eye gaze information can be used in language models to potentially help
spoken language processing is presented in Chapter 4. How eye gaze information is
used for automatic vocabulary acquisition in multimodal conversation is presented in
Chapter 6.
28
3.3 Domains of Application
Two application domains were designed and implemented for our investigation. Both
domains were constructed based on 3D graphics.
3.3.1 Interior Decoration
Figure 3.3 shows the 3D interior decoration domain. In this domain, users can interact
with the system using both speech and deictic gestures to query information about the
entities or arrange the room by adding, removing, moving, and coloring the entities.
For example, the user may say “remove this lamp” or ask “what’s the power of this
lamp .9” while pointing at a lamp in the scene.
Figure 3.3. A 3D interior decoration domain
There are 13 types of entities (3D objects, e.g., chair, bed, lamp) in this domain.
3.3.2 Treasure Hunting
Figure 3.4 shows the 3D treasure hunting domain. In this domain, users walk around
in a 3D castle trying to find treasures that are hidden somewhere in the rooms of a
castle. Unlike the interior decoration domain where users give spoken commands to
the system to move around and change decoration, in the treasure hunting domain,
users walk around inside the castle and move objects by themselves, but the user has
29
to talk to the system to get hints about where to find the treasure. Users’ eye gaze
fixations are recorded during the human—machine conversation.
Figure 3.4. A treasure hunting domain
Compared to the interior decoration domain, the treasure hunting domain pro-
vides a richer interactive environment that involves more complex scenes and tasks,
which enables studies on automatic vocabulary acquisition during human-machine
conversation.
The underlying architecture supporting these two domains can be used to deveIOp
similar 3D applications such as virtual tourism guide and virtual reality personnel
training.
30
CHAPTER 4
Incorporation of N on-verbal Modalities in
Language Models for Spoken Language
Processing
In multimodal conversational systems, speech recognition performance is critical in
interpreting user inputs. Only after speech is correctly recognized, is the system able
to further extract semantic meaning from the recognized hypothesis. Although mutual
disambiguation of multiple modalities [74] can alleviate the problem with speech
recognition, speech recognition is still a bottleneck to achieving robust multimodal
interpretation.
This chapter presents the use of non—verbal modalities to help speech recognition
in multimodal conversation. In particular, we describe a salience driven approach to
incorporate the contextual information activated by deictic gesture and eye gaze in
speech recognition. This approach combines gesture-based and gaze-based salience
modeling with language modeling. We further describe the application of the salience
driven language models in speech recognition across different stages and present eval-
uation results.
31
4.1 A Salience Driven Framework
In this section, we first introduce the notion of salience and its applications in language
processing, then describe a salience driven framework for interpretation of language
in multimodal conversation.
4.1.1 Salience
Salience modeling has been used in both natural language and multimodal language
processing. Linguistic salience describes entities with their accessibility in a hearer’s
memory and their implications in language production and interpretation. Many
theories on linguistic salience have been developed, including how the salience of
entities affects the form of referring expression as in the Givenness Hierarchy [35] and
the local coherence of discourse as in the Centering Theory [32]. Linguistic salience
modeling has been used for both language generation [98] and language interpretation.
Most salience-based language interpretation have focused on reference resolution [27,
42, 58]. 8
Visual salience measures how much attention an entity attracts from a user. An
entity is more salient when it attracts a user’s attention more than other entities. The
cause of such attention depends on many factors including user intention, familiarity,
and physical characteristics of objects. For example, an object may be salient when
it has some properties the others do not have, such as it is the only one that is
highlighted, the only one in its size, category, or color [57]. Visual salience can also
be useful in multimodal language interpretation. Studies have shown that a user’s
perceived salience of entities on the graphical interface can tailor the user’s referring
expressions and thus can be used for multimodal reference resolution [54].
4.1.2 Salience Driven Interpretation of Spoken Language in Multimodal
Conversation
During multimodal conversation, a user’s deictic gesture or eye gaze fixation on the
graphical display indicates the user’s attention and therefore indicate salient entities.
The more likely is an entity selected by a gesture or eye gaze, the more salient is this
entity.
We developed a salience driven framework [13] for language interpretation in multi-
modal conversational systems. Figure 4.1 illustrates the salience driven interpretation
of speech in this framework. As shown in the figure, the user’s deictic gesture or eye
gaze fixation on the graphic display signals a distribution of entities that are salient at
that particular time of interaction. The contextual knowledge associated with these
salient objects constitutes the salient context. This salient context can be used to
help speech recognition and understanding by constraining speech hypotheses.
Gesture / Gaze Speech
Gesture/Gaze _> Speech
Recognition Recognition
Gesture/Gaze ....’ [] 5} Language
Understanding /0 Understanding
a . /O """"""" ‘
whey/Nb
co n text
V
Multimodal Fusion
l
Semantic Representation
ll
Figure 4.1. Salience driven interpretation
33
In this framework, there are two important operations involved: 1) the salience
modeling based on gesture/ gaze, and 2) the incorporation of salience information in
language processing. We address these two operations in the following sections.
4.2 Gesture-Based Salience Modeling
As mentioned earlier, a deictic gesture on the graphical display can signal the under-
lying context that is salient at that particular time of communication. In other words,
the deictic gesture will activate a salience distribution over entities in the domain.
As illustrated in Figure 4.2, the salience value of an entity 6 at time t is calculated
based on the probabilities that e is selected by the gestures g = {9,} occurring prior
to time t.
Utterance: @ . . . ® @ . . . @
Gesture:
,P,l(eig,) Pt,(elg2) P,_,(elg3)§ % Pt(e)
, ag3(t)
“82“)
“81“)
Figure 4.2. Gesture-based salience modeling
More specifically, for an entity e in the domain, its salience value at time t is
calculated as follows [13]:
:09”?
2am 8| :W) W6”
69 (4.1)
Pt(€) =
3,9
0 290490) p(e|g) = 0
34
where p(elg) is the probability of entity 6 being selected by gesture g (calculated
based on the distance from the gesture point to the center of the entity), ag(t) is the
weight of gesture g contributing to the salience distribution at time t.
Gesture weight ag(t) is defined as follows:
t—t
e_700‘8 tZt
age): 0 Mtg (4.2)
9
where tg stands for the beginning time (in milliseconds) of gesture g. Weight ag(t)
says that gesture g has more impact on the salience distribution at a time closer to
the gesture’s occurrence. Note that at any time t, only gestures occurring before t
(i.e., t 2 tg) can contribute to the salience distribution at time t.
4.3 Gaze-Based Salience Modeling
Psycholinguistic experiments have shown that eye gaze is tightly linked to human
language processing. Eye gaze is one of the reliable indicators of what a person is
“thinking about” [37]. The direction of gaze carries information about the focus of the
users attention [49]. The perceived visual context influences spoken word recognition
and mediates syntactic processing [89,101]. In addition, directly before speaking a
word, the eyes move to the mentioned object [31].
Motivated by these psycholinguistic findings about eye gaze’s link to speech, we
use eye gaze information in salience models to help spoken language processing.
Figure 4.3 shows an excerpt of the speech and gaze fixation stream. In the speech
stream, each word starts at a particular timestamp. In the gaze stream, each gaze
fixation f has a starting timestamp t f and a duration Tf. Gaze fixations can have
different durations. An entity 6 on the graphical display is fixated by gaze fixation f
if the area of 6 contains the fixation point of f. One gaze fixation can fall on multiple
entities or no entity.
2 72 2872 17 7
SJ 1 310353813 36 (ms)
This room has a chandelier
f gaze fixation Speech stream
5 6 9 8 1668 2096 2692 32 2 (ms)
F
" gaze stream
it Tr
[19] [1 [17] [19ll22][1[10] [10] [10] [fixatedentity]
[11] [11] [11]
([10] — bedroom; [11] - chandelier; [17] — lamp_2; [l9] - bed frame; [22] — door)
Figure 4.3. An excerpt of speech and gaze stream data
We first define a gaze fixation set Ftt8+T(e), which contains all gaze fixations that
fall on entity e within a time window to ~ (to + T):
Fttg+T(e) = {flf falls on e within to ~ (to + T)} (4.3)
We model gaze-based salience in two ways [82]:
o Gaze Salience Model 1
Salience model 1 is based on the assumption that when an entity has more gaze
fixations on it than other entities, this entity is more likely attended by the user
and thus has higher salience:
#elements in Fttg+T(e)
Z(#elements in Ftt8+T(e))
8
12,010?) = (4.4)
Here, pt0,T (e) tells how likely it is that the user is focusing on entity e within
time period to ~ (to + T) based on how many gaze fixations are on e among all
gaze fixations that fall on entities within to ~ (to + T).
o Gaze Salience Model 2
Salience model 2 is based on the assumption that when an entity has longer
gaze fixations on it than other entities, this entity is more likely attended by
36
the user and thus has higher salience:
t +T
Dtg (e)
= t +T
Z Dtg (e)
e
pt0,T(e)
where
D§3+T(e) = Z Tf (4.6)
t +T
1:6th (e)
Here, pt0,T(e) tells how likely it is that the user is focusing on entity e within
time period to ~ (to + t) based on how long e has been fixated by gaze fixations
among the overall time length of all gaze fixations that fall on entities within
to N (to + T).
4.4 Salience Driven Language Modeling
Given salience models, the next question is how to incorporate this salient contex-
tual information in language processing. In this section, we describe the building
of salience driven language models for speech recognition. We first give a review of
the typical language models used in speech recognition, then describe how to build
salience driven language models based on those baseline models.
4.4.1 Language Models for Speech Recognition
The task of speech recognition is to, given an observed spoken utterance 0, find the
word sequence W" such that
W” = arg max p(O]W)p(W) (4.7)
W
where p(OIW) is the acoustic model and p(W) is the language model.
In speech recognition systems, the acoustic model provides the probability of
observing the acoustic features given hypothesized word sequences, and the language
37
model provides the prior probability of a sequence of words. The language model is
represented as follows:
p(W)= pat): P(w1)P(IU2lUJ1)P(w3lwi) . . -p(wn|w?-1) (4.8)
The language model can be approximated by a bigram model using first-order
Markov assumption:
n
=lewklwk—1l(4-9l
or by a trigram model using second-order 1Markov assumption:
n
p(w?) = H palm—1. wit—2) (4.10)
k=1
By clustering words into classes, the class-based n-gram model reduces the training
data requirement and improves the robustness of probability estimates compared to
the word n—gram model. The class-based bigram model is given by [6]:
P(wilwi—1) = P(wt|Ct)p(CtICt—1) (4-11)
where c,- and 01-1 are the classes of word w,- and w,_1 respectively.
Probabilistic context free grammar (PCF G) can also be used as a language model
in speech recognition by constraining the speech recognizer to generate only gram-
matical sentences as defined by the grammar.
4.4.2 Salience Driven N-Gram Models
Statistical n—gram models are widely used in speech recognition. We incorporate
the gesture/gazebased salience modeling into the bigram model and the class-based
bigram model to build salience driven n-gram models [13,81] for speech recognition.
0 Salience driven bigram model
The salience driven bigram probability p3(w,-|w,-_1) is given by:
lez‘lwt—il + A Z P(wilwi—1ve)Pt(e)
Ps(wilwi—1)= 18+ A (4.12)
38
where pt(e) is the salience distribution, as modeled in equation (4.1), A is the
priming weight. The priming weight /\ decides how much the original bigram
probability will be tailored by the salient entities that are indicated by gestures.
Currently, we set /\ = 2 empirically. We also tried to learn the priming weight
by an EM algorithm. However, we found out that the learned A performed
worse than the empirical one in our experiments. This is partially due to in-
sufficient development data. Bigram probabilities p(w,]w,-_1) were estimated
by the maximum likelihood estimation using Katz’s backoff method [51] with
frequency cutoff of 1. The same method was used to estimate p(w,]w,-_1,e)
from the users’ speech transcripts with entity annotation of e.
Salience driven class-based bigram model
The salience driven class—based bigram probability p3(w,-|w,-_1) is given by:
P(CilCi-1) :20thsz €)Pi(€) 21W?) 79 0
(4.13)
P(wilwi—1) Zpde) = 0
Ps(wilwi—1)=
where pt(e) is the salience distribution, c, and c,-_1 are the semantic classes of
words w,- and wi_1 respectively, p3(w,-|c,', e) is learned with maximum likelihood
estimation from the utterances talking about entity e.
4.4.3 Salience Driven PCFG
Building salience driven PCF G [81] as language model includes three steps: 1) con-
struct a context free grammar (CFG) specific to the application domain; 2) for each
entity in the domain, train entity-specific PCFG based on the utterances talking about
that particular entity; 3) create salience driven PCFG based on the entity salience
distribution and entity-specific PCFGs.
More specifically, we build salience driven PCF G for the 3D interior decoration
domain (Section 3.3.1) as follows. Based on the domain knowledge, we first define
39
a domain-specific CFG as shown in Figure 4.4. This CFG covers all the language
that is “legal” in the interior decoration domain. An utterance is said to be“legal”
in the domain if a semantic representation specific to the domain can be built from
the utterance. The defined grammar covers the “legal” commands like “this table”,
“remove this chair”, “move this plant on this table”, and query questions like “how
much is this table .9”, “who is the artist of this painting .9”, “what is the wattage of this
lamp?”.
S —+ NP I VP I WRB JJ VBZ NP I WRB JJ NN VBZ NP VB]
WP VBZ NP PP I WRB VBZ NP VBN I VBZ NP NP
VP —+ VBNPIVBNPPPIVBNPJJIVBNPRB
NP —> NN I DT NN I PRP
PP -—> IN DT NN I TO DT NN
WP —> what I who
WRB -—> how I where
JJ -—> big I black I blue I dark I expensive I gray I green I
VBZ —» does I is
VB —> add I align I bring I buy I change | delete I
RB —> back I backward I backwards I down I forward I here I
NN —> age I alternative I artist I artwork | back I bar I bed
DT —> a | an I that I the I these I this I those
PRP —> it I them
IN —» about I above I against I among I around I at I behind
TO —+ to
VBN —i made I produced
Figure 4.4. Context free grammar for the 3D interior decoration domain
We build the entity-specific PCFGs by first using the Stanford Parser [55] to parse
users’ transcribed utterances, then for each entity e in the domain, training a PCFG
with maximum likelihood estimation based on the utterances talking about entity e.
In the trained PCFG, only the lexicon-part rules are associated with probabilities.
An example of trained PCFG for entity lamp is shown in Figure 4.5. The PCF G
in Figure 4.5 is in the Java Speech Grammar Format (JSGF) and the numbers in
the “/ /” are the weights of the rules. When normalized, the weights are the rule
probabilities. As we can see in Figure 4.5, the words closely related to entity lamp
40
such as “lamp” and “wattage” achieve higher weights in the trained PCFG. It means
that those words closely related to lamp will be more likely chosen during the speech
recognition process when the entity lamp is salient.
= I I I
= | I |
;
= I I ;
= | ;
= /117/ this I /59/ the I /16/ that I /3/ these I /1/ those I
/1/ a I /1/ an;
= /34/ of] /17/ on I /10/ about I /7/ with I /4/ in I
/2/ behind I ...;
= /8/ many I /2/ much I /1/ small I /1/ left I /1/ expensive I ...;
= /144/ lamp I /24/ wattage I /7/ place I /7/ information I
/6/ table I ...;
= /3/ it I /1/ them;
= /9/ here I /2/ back I /2/ up I /2/ there;
= to;
= /27/ remove I /18/ move I /7/ show I /6/ put I
/6/ change I ...;
= /2/ made I /1/ produced;
= /30/ is I /3/ does;
= /26/ what I /4/ who;
= /9/ how I /5/ where;
Figure 4.5. Ttained PCFG for entity lamp in the 3D interior decoration domain
Given entity-specific PCFGS, salience driven PCF G is created by combining the
PCFGs associated with the salient entities. The weight of a rule r in the salience
driven PCFG is given by:
wtr) = Zwetrme (4.14)
C
where p(e) is the salience distribution, we(r) is the weight of rule r in the PCFG
specific to entity e.
41
4.5 Application of Salience Driven Language Models for ASR
The salience driven language models can be integrated into speech recognition in two
stages: an early stage before word lattice (n-best list) is generated, or in a later stage
where the word lattice (n-best list) )is post-processed (Figure 4. 6).
gesture /
9829 Lang.ane A.coustlc
M-del Model
word lattice
speech (n-best list)
——> Speech Decoder L———>
(a) Early application
gesture / -
Lang.ane A.coustic 9829 Language
Mod-I Model Model
H word lattice fl
599°C“ (n- -best list) n-best list
——H Speech Decoder 5 Rescorer a
(b) Late application
Figure 4.6. Application of salience driven language model in speech recognition
4.5.1 Early Application
For the early application, as Figure 4.6(a) shows, salience driven language model
is used together with the acoustic model to generate the word lattice, typically by
Viterbi search.
Compared to n-gram models, CFG-based language models put more strict con-
straint on the speech recognition process, specifically on choosing the next set of
possible words following a path during the searching process. When an n-gram model
is used, the next set of possible words includes any words in the vocabulary with non-
zero transition probabilities (as specified by the n-gram model) from the previous n-1
words along the path. When a CFC-based language model is used, the next set of
42
possible words only includes those allowable words as defined by the grammar.
4.5.2 Late Application
For the late application, as shown in Figure 4.6(b), the salience driven n-gram lan-
guage model is used to rescore the word lattice generated by a speech recognizer with
a basic language model not involving salience modeling. A word lattice consists of
a list of nodes and edges (Figure 4.7). In the word lattices, each node represents a
word hypothesis and each edge represents a word transition. Each path going from
the start node to the end node forms a sentence recognition hypothesis.
Given a word lattice, A* search can be applied to find the n—best paths in the word
@: a 0 ...>@
Figure 4.7. A* search in word lattice
lattice.
A* search finds in a graph the optimal path from a given initial node to a given
goal node. Specifically, in the word lattice shown in Figure 4.7, the task of A* search
is to find a path from sentence start node “” to sentence end node “” that
has the highest score. The score of a path L = (w0, w1, . . . ,wn) is defined as
11
NM = ZUOgPaWi) +108P(wz'lwz'—1)) (4-15)
i=0
where pa(w,-) is the acoustic model probability and p(w,Iw,-_1) is the language model
probability. The language model probabilities can be tailored by the salience driven
language models described in Section 4.4.2.
43
In the word lattice, each node (i.e., a word hypothesis) is associated with a score.
The score of a word w,- depends on two parts: the true score g(wi) that measures the
actual score of the path from the start node to the current node, and the heuristic
score h(wi) that measures the expected score of the path from the current node to
the goal node. In each step of the A* search, the next node to expand is chosen as
the one with the highest score (g(wi) + h(w,)) among the ending nodes of all previous
partial paths that have been explored.
Before A* search begins, the heuristics at each node w,- are first calculated:
h(w,) _—. m£x{h(w,’-°+1) + log pa(wf+1) + log p(wf+,|w,)} (4.16)
where h() = 0.
During the A* searching process, the score of the path up to node w,- is calculated
90%) = g(wi—tl +108Pa(wz') +10gP(wz'|wz'—1) (4-17)
where g() = 0.
A late application of gaze-tailored language model was reported in [21], where the
language model tailored by eye gaze was used to directly reorder the n-best list of
speech recognition to get better 1-best recognition. We will show in Section 4.6.5 that
the early application works better than the late application.
4.6 Evaluation
In the 3D interior decoration domain, we empirically evaluate the different salience
driven language models when applied at the two stages for speech recognition.
4.6.1 Speech and Gesture Data Collection
We conducted a wizard-of—oz study to collect speech and gesture data for our eval-
uation using the system described in Chapter 3. In the study, users were asked to
44
accomplish two tasks. Task 1 was to clean up and redecorate a messy room. Task
2 was to arrange and decorate the. room so that it looks like the room in the pic-
tures provided to the user. Each of these tasks put the user into a specific role (e.g.,
college student, professor, etc.), and the task had to be completed with a set of con-
straints (e.g. budget of furnishings, bed size, number of domestic products, etc.). A
detailed description of the user study in the interior decoration domain is given in
Appendix A.1.
From 5 users’ interactions with the system, we collected 649 utterances with ac-
companying gestures. The vocabulary size of the collected utterances is 250 words.
Each utterance was transcribed and annotated with referred entities. For example,
an utterance like “remove this lamp” accompanied by a deictic gesture was annotated
with the true entity lamp] as indicated by the gesture, while an utterance like “move
this lamp to this table” accompanied by two deictic gestures were annotated with the
entities lamp] and table] as indicated by the two gestures respectively.
Each gesture results in a set of possibly selected entities. The selection probabili-
ties of the entities are calculated based on the distances from the gesture point to the
center of the entities.
All the collected data, together with the speech transcripts and entity annotation,
are saved in XML format. Figure 4.8 shows an excerpt form one of the XML data
files. The excerpt is the record of one turn in the conversation between the system and
one user. In this turn, the user pointed to the entity picture_girl and said “flip this
picture one hundred eighty degrees”. The pointing gesture resulted in an ambiguous
selection of three entities (bedroom, picture_girl, table_pc) with different probabilities.
4.6.2 Evaluation Results on Speech and Gesture Data
We compare the performances of the following different language models trained in
our domain:
45
613 183
613 183
(entity text="bedroom">0.458000
(entity text="picture_girl">0.530700
0.011300
picture_girl
flip this picture one hundred eighty degrees
2005916-144311-707.wav
Figure 4.8. An Excerpt of XML Data File
0 Standard bigram model (Bigram)
0 Standard trigram model (Trigram)
0 Standard class-based bigram model (C-Bigram)
o Salience driven bigram model (S-Bigram)
Salience driven class-based bigram model (S-C-Bigram)
Standard PCFG (PCFG)
Salience driven PCFG (S-PCF G)
The evaluation metrics include the following aspects related to recognition results:
0 Word error rate of the best hypothesis (WER)
46
0 Word lattice WER (Lattice-WER)
The minimal WER of all possible paths through the word lattice (output Of
speech recognition).
Since we are building a conversational system, we are also interested in the fol-
lowing metrics related to semantic interpretation:
0 Concept identification precision (CI-Precision)
The percentage of correctly identified concepts out of the total number of con-
cepts in the 1-best recognition hypothesis.
0 Concept identification recall (CI-Recall)
The percentage of correctly identified concepts out of the total number of con-
cepts in a user’s utterance (speech transcript).
o F-measurement (F-score)
F _ (52 + 1) x CI-Precision x CI-Recall
— 62 x CI—Precision + CI—Recall
where 6 = 1 in this experiment.
The evaluation was done by an eight—fold cross validation. We compare the per-
formances of the salience driven language models for both early and late applications.
RESULTS OF EARLY APPLICATION
Table 4.1 shows the experimental results of the early application of different language
models on the utterances with accompanying gestures. Among the n-gram models,
the performance of the trigram model is roughly the same as the bigram model. The
salience driven bigram (S—Bigram) model improved speech recognition and under-
standing compared to the three baselines (Bigram, Trigram, and C-Bigram). Com—
pared to the best baseline of the trigram model, the S-Bigram model reduced the
WER by 7%. A t-test showed that this was a significant change: t = 3.38, p < 0.004.
47
I Language Model I Lattice-WER I WER I CI-Precision I CI-Recall I F-Score I
Bigram 0.250 0.321 0.830 0.793 0.811
Trigram 0.258 0.312 0.838 0.797 0.817
C-Bigram 0.292 0.371 0.856 0.748 0.798
S-Bigram 0.243 0.291 0.861 0.830 0.845
S-C-Bigram 0.412 0.448 0.863 0.623 0.724
PCFG 0.323 0.360 0.819 0.816 0.817
S-PCFG 0.319 0.355 0.862 0.845 0.853
Table 4.1. Performances of the early application of different language models on speech-
gesture data
The S—Bigram model increased the precision and recall of concept identification by
3% and 4% respectively. The overall F-measurement achieved by the S-Bigram model
gained an increase of 3%. A t-test showed that this was also a significant improve-
ment: t = 3.01, p < 0.002. The S-C—Bigram model achieved the best result on the
precision of concept identification, but had the worst results on all other metrics.
Comparing class-based n-gram models (C-Bigram, S-C-Bigram) to n—gram models
(Bigram, Trigram, S-Bigram), we can see that class-based n-gram models achieve
better concept identification precision but worse concept identification recall and
WER. The performances of the class-based n-gram models depend on how the classes
of words are defined. When one unique class is defined for each unique word, there
will be no difference between n-gram models and class-based n-gram models. In our
experiment, we define different classes for the words with key semantic concepts,
whereas a single class is assigned to all other words. With this class definition, the
class-based bigram models contain n-gram probability information about the words
with key semantic concepts but lost the information for the non-key words with one
same class. Therefore, using the class-based n—gram models in speech recognition, it is
hard to correctly recognize the non-key words with one same class, whereas the words
with key semantic concepts are more likely to appear in the recognition result, though
many of them incorrectly recognized. This leads to a better concept identification
precision but worse concept identification recall and WER.
48
Compared to the standard PCFG model, the salience driven PCFG (S-PCF G)
model increased the precision and recall of concept identification by 5% and 3.5%
respectively. The overall F-measurement was increased by 4%. A t-test confirmed
that this was a significant improvement: t = 3.30, p < 0.001. The S-PCFG model
did not change the WER much compared to the standard PCFG model. A t-test
confirmed that this change in WER was not significant.
When compared to the trigram model, the S-PCFG model did not improve the
WER but improved the language understanding. The F-measurement was increased
by 4%. A t-test showed that this was a significant improvement: t = 2.77, p < 0.003.
The worse WER of the S-PCFG model is due to the lesser flexibility of grammar-based
language models than n-gram language models. Grammar-based language models
place too much constraint on what language can be recognized, which hurts the
recognition of complex utterances. On the other hand, after salience tailoring, the
more strict constraints on what words of key semantic concepts can be recognized for
the salient entity makes the S-PCFG model achieve better language understanding
performance than the n-gram model.
We also Show the experimental results for individual users. Figure 4.9 compares
the performances of different salience driven language models in early application for
each user. From the results for individual users, we can see that for most users, the
performances of different salience driven language models are consistent. Compared to
the best baseline of the trigram model, the S-Bigram model achieved lower WER and
higher F—score for each user. The S-C-Bigram model did not show improvement over
the trigram model for all users. The S-PCFG model showed its merit on improving
language understanding by achieving higher F-scores than the baseline for all users
except user 2. And for 3 of the 5 users, the S-PCFG model achieved the best language
understanding among all different language models.
Overall, the results of the early application of the gesture-based salience driven
49
7/////////////////////%////////fl//////////////é
I“ trigram BB s—bigram Eli! s-c-big'ram 8 $ch
...u...v c.... c........ I ..o.o.o o 9.0...u...-...v.o o...-.-.... s. ..... . ................
...................................................
..................................................
...................................................
0.6
0.5
£95
(a) Word error rate
IEB trigram BB s-bigram Ba s-c-bigram I8 s-pcfg I
Egg/20
a
oooooooooooooooooooooooooooooooo
oooooooooooooooooooooooooooooooo
uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
................................
0.95
289m
(b) F-score
Figure 4.9. Performance of the early application of LMs on speech-gesture data of indi-
vidual users
50
language models show that:
o In terms of WER, the S-Bigram model performed the best. N-gram models
performed better than class-based n—gram models, and all n-gram models except
the S-C-Bigram model performed better than the PCFG-based models.
0 In terms of language understanding metrics, the S-PCF G model performed the
best in that it achieved the highest concept identification recall and overall
F -measurement.
0 Overall, the S-Bigram model appears to be the best one for the early application
in that it not only achieved the lowest WER but also achieved a high F-score on
concept identification (close to the highest one achieved by the S-PCFG model).
RESULTS OF LATE APPLICATION
We further compared different n—gram models: C-Bigram, S-Bigram, and S-C-Bigram
during the late application. In these experiments, the standard trigram model trained
on our domain was first used to generate word lattices, then the salience driven models
were used in A* search (Section 4.5.2) to find the best paths in the word lattices.
I Language Model I Lattice-WER I WER] CI-Precision I CI-RecallI F -score I
C-Bigram 0.258 0.334 0.831 0.784 0.807
S-Bigram 0.258 0.294 0.854 0.834 0.844
S-C-Bigram 0.258 0.316 0.858 0.786 0.821
Table 4.2. Performance of the late application of LMS on speech-gesture data
Table 4.2 shows the results of the three models on the utterances with accompa-
nying gestures. In the late application, the S-Bigram model performed the best with
the exception of concept identification precision. Compared to the trigram model,
the S-Bigram in late application decreased the WER by 6%. A t-test showed that
this was a significant change: t = 2.66, p < 0.005. On language understanding, the
51
S-Bigram model increased the F-measurement by 3% compared tO the trigram model.
A t-test confirmed that this was a significant improvement: t = 2.92, p < 0.002.
Compared to Table 4.1, Table 4.2 shows that there is no difference in performance
whether the S-Bigram model is applied early or later. However, a significant difference
is Observed for the S-C-Bigram model. The S-C-Bigram model performed much better
when it was applied in a later stage. However, its performance was close to the
baseline (trigram model). The WER change achieved by the S—C—Bigram model
was not statistically significant from the t-test (t = 0.94, NS), neither was the
F-measurement (t = 0.22, NS).
The experimental results of the late application Of the three n-gram models for
individual users are shown in Figure 4.10. The results demonstrate the consistency
of the performances of different salience driven language models in late application
for most users. Compared to the baseline of the trigram model, the S-Bigram model
improved both Speech recognition and language understanding when applied in a
late stage. The S-C-Bigram model did not improve speech recognition either when
applied in a late stage, but it improved language understanding for most of the users.
Compared to its performance on speech recognition in early application, the SC-
Bigram model performed better in late application for all the users.
4.6.3 Speech and Eye Gaze Data Collection
We conducted user studies to collect speech and eye gaze data. In the experiments,
a static 3D bedroom scene was Shown to the user. The system verbally asked the
user a list of questions one at a time about the bedroom and the user answered the
questions by speaking to the system. A detailed description of the user study is given
in Appendix A2.
The user’s speech was recorded through an open microphone and the user’s eye
gaze was captured by an Eye Link II eye tracker. From 7 users’ experiments, we col-
52
0.5
I E B trigram BB s-bigram B I! S-c-bigram
WER
o o
o a
0 U C
O O U '
' ' 0.0
o .
¢'-' 0..
I O .
C... ..'
. C ' .
0.1 .0.
.
O C .
O O C .
C O .
C C I .
0" ...
o
O O .
O O C U
..C. ...
l l .
O D I .
I D C
I I . I
I I U
I . U .
I . .
I D. '..
.
D I I U
. . I
~ - s o
. . .
I l . .
l I .
O O . .
J l .
I I C .
I D .
I I ' C
I O .
(a) Word error rate
3
User ID
I! I trigram QB s-bigram Elli s-c-bigram
0.95
0.75
0.7
NJ ‘llIIIllIIIIIIIllIIIllllllIllIIllIIIIllllllllIIIIIIIIIIIIIIII
4x IIIllllIIIIIIIIIIIIllllllllllllllIIIIIIIIIII
or lllllllllllllllllllll
W
User ID
(b) F -score
Figure 4.10. Performance of the late application of LMS on speech-gesture data of indi-
vidual users
53
lected 554 utterances with a vocabulary of 489 words. Each utterance was transcribed
and annotated with entities that were being talked about in the utterance.
4.6.4 Evaluation Results on Speech and Eye Gaze Data
Evaluation was done by a 14-fold cross validation. We compare the performances of
the early and late applications of two gaze-based salience driven language models:
0 S-Bigraml — salience driven language model based on salience modeling 1 (Equa-
tion (4.4))
o S-Bigram2 — salience driven language model based on salience modeling 2 (Equa-
tion (4.5))
Table 4.3 and Table 4.4 show the results of the early and late applications of
the salience driven language models based on eye gaze. We can see that all word
error rates (WERs) are high. In the experiments, users were instructed to only
answer systems questions one by one. There was no flow of a real human—machine
conversation. In this setting, users were more free to express themselves than in the
situation where users believed they were conversing with a machine. Thus, we observe
much longer sentences that often contain disfluencies. Here is one example:
System: “How big is the bed .9”
User: “I would to have to offer a guess that the bed, if I look the chair
that ’s beside it [pause] in a relative angle to the bed, it’s probably six feet
long, possibly, or shorter, slightly shorter.”
The high WER was mainly caused by the complexity and disfluencies of users’
speech. Poor speech recording quality is another reason for the bad recognition per-
formance. It is found that the trigram model performed worse than the bigram model
in the experiment. This is probably due to the sparseness of trigrams in the corpus.
The amount of data available is too small considering the vocabulary size.
54
I Language Model I Lattice-WER I WER]
Bigram 0.613 0.707
Trigram 0.643 0.719
S-Bigram 1 0.605 0.690
S—Bigram 2 0.604 0.689
Table 4.3. WER of the early application of LMS on speech-gaze data
I Language Model I Lattice-WER I WER]
S-Bigram 1 0.643 0.709
S-Bigram 2 0.643 0.710
Table 4.4. WER of the late application of LMS on speech-gaze data
The S-Bigraml and S-Bigram2 achieved similar results in both early application
(Table 4.3) and late application (Table 4.4). In early application, the S—Bigraml
model performed better than the trigram model (t = 5.24, p < 0.001) and the bigram
model (t = 3.31, p < 0.001). The S-Bigram2 model also performed better than the
trigram model (t = 5.15, p < 0.001) and the bigram model (t = 3.33, p < 0.001) in
early application. In late application, the S-Bigraml model performed better than
the trigram model (t = 2.11, p < 0.02), so did the S-Bigram2 model (t = 1.99,
p < 0.025). However, compared to the bigram model, the S-Bigraml model did not
change the recognition performance significantly in late application, neither did the
S-Bigram2 model.
We also compare performances of the salience driven language models for individ-
ual users. In early application (Figure 4.11a), both the S-Bigraml and the S-Bigram2
model performed better than the baselines of the bigram and trigram models for all
users except user 2 and user 7. T-tests have shown that these are significant im-
provements. For user 2, the S—Bigraml model achieved the same WER as the bigram
model. For user 7, neither of the salience driven language models improved recogni-
tion compared to the bigram model. In late application (Figure 4.11b), only for user 3
and user 4, both salience driven language models performed better than the baselines
of the bigram and trigram models. These improvements have also been confirmed by
55
IE I bigram Bfl trigram [39 s-bigraml an s-bigramgI
nooooooooooooooooooo
nnbbbbmbnmmnnnnnnnnnn
_--—---—--—-—--_
....................
....................
---------------------
700aaooaaboooooooooooooooooooooo
oooooooooooooooooooooooooooooooo
nun-...--.-—------——-—----_
oooooooooooooooooooooooooooooooooo
...................................
.................................
A6aoooo0aoonooooooooaoooaaooonoooo000000000
........................................................................
...................................................................
uaaaaaooaaaaoaaoaoaaaoooaoaooooooooooooooaoooooooooon
.....................................................
oooooooooooooooooooooooooooooooooooooooooooooooooooooo
oooooooooooooooooooooooooooooooooooooooooooooooooooooo
oooaooaaooaooaooaooaaoaaoooaoooaaaoaaaaoooooooaoooooooaooooooan
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
0000000000000
aauuaauuumaa
................
nus nxo ..n ”nu .Au .nu
Any nnu Auu Anu nnu nnu
ma?»
tion
1C8.
(a) WER of early appl
.0000000000000000000000000
.....................
....................
---------------------
79aoa0oaooooooooooooooaooooooooooooo
oooooooooaooooooooooooooooooooooooon
_--——-—---—-----------2
ooooooooooooooooooooooooooooooooooo
oooooooooooooooooooooooooooooooooo
...................................
monocoooooooaaooaoooooooooooooooooaoaoooo
ooaaaaoaaaaaaaaaaaooaaoaoaaaaooaaaooaoaaa
_--—--—--——---—------—-----u-g
.....................................................................................
ooooooooooooooooooooooooooooooooooooooooooooo
Abaaooooaoaaoaooooaaoaooonaoooaooooooooooooooooooav
......................................................
oooooooooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
aoaooooooaaaaaaoaaaooaooaaaaaaooaaaaooaoaaooooooooooooooooooaoooooa
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
I3 I bigram QB trigram [It] s-bigraml afls—bigram24]
0000000000000a
mauaauaaaaauauu
................_
nuu nxu .1. ”AN .Au Ia.
Anu nuu “nu Aug nuu Auu
mm?
(b) WER of late application
Figure 4.11. WERs of application of LMs on speech-gaze data of individual users
56
t-tests as significant.
Comparing early and late application of the salience driven language models, it
is observed that early application performed better than late application for all users
except user 3 and user 4. T-tests have confirmed that these differences are significant.
It is interesting to see that the effect of gaze—based salience modeling is different
among users. For two users (i.e., user 3 and user 4), the gaze-based salience driven
language models consistently out-performed the bigram and trigram models in both
early application and late application. However, for some other users (e.g., user 7),
this is not the case. In fact, the gaze-based salience driven language models performed
worse than the bigram model. This observation indicates that during language pro-
duction, a user’s eye gaze is voluntary and unconscious. This is different from deictic
gesture, which is more intentionally delivered by a user.
4.6.5 Discussion
Gesture-based salience driven language models are built on the assumption that the
entity selected by the accompanying gesture of a user’s utterance is the topic of the
user’s utterance. Similarly, gaze-based salience driven language models are built on
the assumption that when a user’s eye gaze is fixating on an entity, the user is saying
something related to the entity. With this assumption, gesture/gazebased salience
driven language models have the potential to improve speech recognition by biasing
the speech decoder to favor the words that are consistent with the entity indicated by
the user’s gesture or eye gaze fixation, especially when the user’s utterance contains
words describing unique characteristics of the object. These particular characteristics
could be the object’s name or physical properties (e.g., color, material, size).
An example where the gesture-based salience driven language model helped speech
recognition is shown in Figure 4.12. In this example, a user pointed to the entity
tablasquare in the bedroom scene and said “show me details on this desk”. The
57
Utterance: “show me details on this desk”
Gesture selection:
p(bedroom) = 0.0050
p(lamp_floor) = 0.1954
p(couchnnrsofa) = 0.1409
p(lamp_floor2) = 0.0510
p(tablesquare) = 0.6077
Bigram n-best list: S-Bigram n-best list:
show me details on this bed show me details on this desk
show me details on this desk show me details on this bed
show me details on this back show me details on this back
show me details on that’s desks show me details on this desk a
show me details on that’s desk show me details on that’s desk
show me details on that’s that’s show me details on that’s desk a
Figure 4.12. N-best lists of speech recognition for utterance “show me details on this
desk”
user’s gesture resulted in a set of candidate entities being selected, in which the correct
one (i.e., table_square) was assigned the highest selection probability of 0.6077. Two
n-best lists, the bigram n—best list and S-Bigram n-best list, were generated by the
speech recognizer when the standard bigram model and the salience driven bigram
model were applied respectively. When the standard bigram model was applied, the
speech recognizer did not get the correct recognition. When the salience driven bigram
model was applied, the speech recognizer recognized the user’s utterance correctly.
Figures 4.13 and 4.14 Show the word lattices of the utterance generated by the
speech recognizer using the standard bigram model and the salience driven bigram
model respectively. The n—best lists in Figure 4.12 were generated frOm those word
lattices. In the word lattices, each path going from the start node to the end
node forms a recognition hypothesis. The bigram probabilities along the edges
are in the logarithm of base 10. In the standard bigram case, although the probability
of bigram “this desk” (-1.3952) is slightly higher than the probability of “this bed”
(-1.4380), the speech recognizer got the wrong recognition, i.e., the correct speech
58
—O.7 1 83
-O. 1947
-2.5902
Figure 4.13. Word lattice of utterance “show me details on this desk” generated by using
standard bigram model
recognition hypothesis is not the first one in the n-best list (Figure 4.12). This is
because the system tries to find an overall best speech recognition hypothesis by
considering both language confidence and acoustic confidence. After tailoring the
59
-1.0262
—0.2520
~2.6282
Figure 4.14. Word lattice Of utterance “show me details on this desk” generated by using
salience driven bigram model
60
standard bigram model with gesture selection, in the resulting salience driven bigram
model, the probability of bigram “this desk” is increased (-0.8309) while the probabil-
ity of “this bed” is decreased (—1.9182). This enlarged bigram probability difference
ensures that “this desk” is on the overall best speech hypothesis generated by the
speech recognizer with the salience driven language model.
Utterance: “move the red chair over here”
Gesture selection:
p(bedroom) = 0.0001
p(curtainsJ) = 0.0061
p(table_pc) = 0.2229
p(chairJ) = 0.7196
p(lamp_floor) = 0.0512
Bigram n—best list: S-Bigram n-best list:
move the rid chair over here move the red chair over here
move the rid chair over here a move the red chair over here a
move the rid chair over here i move the red chair over here i
move the rid chair over here the move the red chair over here the
move the rid chair over here it move the red chair over here it
Figure 4.15. N-best lists of speech recognition for utterance “move the red chair over
here”
Figure 4.15 shows another example where the salience driven language model
helped recognize an utterance that referred visual properties of an entity. In this
example, the user pointed to a red chair and then pointed to a location while saying
“move the red chair over here”. In the resulting gesture selections, the truly selected
entity chair..1 was assigned the highest probability. As shown in the bigram n-best list
and the S—Bigram n-best list, the speech recognizer with the standard bigram model
did not get the correct recognition result while the one with the salience driven bigram
model recognized the user’s utterance correctly.
The word lattices of the utterance are shown in Figures 4.16 and 4.17. In the
standard bigram case, as Shown in Figure 4.16, the probability of bigram “rid chair”
61
Figure 4.16. Word lattice Of utterance “move the red chair over here” generated by using
standard bigram model
(-3.3811) is higher than the probability of “red chair” (-3.8231). This makes the
wrong speech hypothesis the top one in the n-best list (Figure 4.15). After tailoring
62
-3.2397 -0.6302
-2.0144
Figure 4.17 . Word lattice of utterance “move the red chair over here” generated by using
salience driven bigram model
63
the bigram model with gesture selection, in the salience driven bigram model (F ig-
ure 4.17), the probability of bigram “red chair” is much higher than the probability
of “rid chair”, which makes the correct speech hypothesis the best one in the n-best
list and thus gets correct speech recognition.
Utterance: “I like the picture with like a forest in it”
Gaze salience:
p(bedroom) = 0.5960 p(chandelierJ) = 0.4040
Bigram n-best list:
and i eight that picture rid like got five
and i eight that picture rid identifiable
and i eight that picture rid like got forest
and i eight that picture rid like got front
and i eight that picture rid like got forest a
S-Bigram2 n-best list:
and i that bedroom it like upside
and i that bedroom it like a five
and i that bedroom it like a forest
and i that bedroom it like a forest a
and i that bedroom it like a forest candle
Figure 4.18. N-best lists of speech recognition for utterance “I like the picture with like a
forest in it”
Unlike the active input mode Of deictic gesture, eye gaze is a passive input mode.
The salience information indicated by eye gaze is not as reliable as the one indicated
by deictic gesture. When the salient entities indicated by eye gaze are not the true
entities the user is referring to, the salience driven language model can worsen speech
recognition. Figure 4.18 shows an example where the S-Bigram2 model in early
application worsened the recognition of a user’s utterance “I like the picture with like
a forest in it” because of wrong salience information. In this example, the user was
talking about a picture entity picture_bamboo. However, this entity was not salient,
only entities bedroom and chandelierJ were salient as indicated by the user’s eye
64
gaze. As a result, the recognition with the S-Bigram2 model becomes worse than
the baseline. The correct word “picture” is missing and the wrong word “bedroom”
appears in the result.
The failure to identify the actual referred entity picture_bamboo as salient in the
above example can also be caused by the visual properties of entities. Smaller entities
on the screen are harder to be fixated by eye gaze than larger entities. To address
this issue, more reliable salience modeling that takes into account the visual features
is needed.
Utterance: “remove this lamp”
Gesture salience:
p(bedroom) = 0.0995
p(lamp_bank) = 0.5288
p(table_dresser) = 0.3604
p(table_pc) = 0.0114
N-best list of standard trigram model:
remove this stand
remove this them
remove this left
N-best list of S-Bigram model in early integration:
remove this lamp
remove this lamp a
N—best list of S-Bigram model in late integration:
remove this left
remove this stand
remove this them
Figure 4.19. N-best lists of an utterance: early stage integration v.s. late stage integration
Early application has an advantage over the late application on bringing the good
hypothesized words with low acoustic probabilities into the word lattice. This is par-
ticularly important when using the Sphinx-4 speech recognizer, because the current
release Of Sphinx-4 does not provide a full word lattice. When the correct words are
not in the word lattice output, a late application Of salience driven language models
will never succeed in retrieving those correct words by rescoring the word lattice.
65
Figure 4.19 shows one example that demonstrates the difference between the early
application and the late application. Here the correct word “lamp” did not appear in
the word lattice generated by the trigam model, and thus could not be retrieved by
the late application of the salience driven bigram model. When the salience driven
bigram model was applied in an early stage, the top one in the generated n-best list
turned out to be the correct recognition result.
4.7 Summary
This chapter presents a systematic investigation Of incorporating gesture/ gaze into
speech recognition and understanding via salience driven language modeling. Three
salience driven language models based on the bigram model, the class-based bigram
model, and the PCP G are compared. Our experimental results have shown that the
salience driven bigram model can improve spoken language understanding in both
early and late applications, while the salience driven class-based bigram model seems
only useful for the late application. In the early application, the salience driven
PCFG model has also Shown a potential advantage in improving spoken language
understanding.
66
CHAPTER 5
Incorporation Of Non-verbal Modalities in
Intention Recognition for Spoken
Language Understanding
In multimodal interpretation, the user’s speech is first converted to text by speech
recognition. To understand the user’s speech, the system further extracts semantic
meaning from the user’s recognized utterance. The previous chapter has addressed
speech recognition in multimodal conversation. In this chapter, we address the un-
derstanding of the recognized speech during multimodal conversation.
In speech and deictic gesture systems, deictic gestures have been mainly used
for attention identification (i.e., identifying which object the user is talking about).
Many approaches have been developed to incorporate gestural information to resolve
referring expressions (e.g., using gesture information to resolve what this refers to in
the utterance “how much does this cost?) I12,14,42,54,72,119]. Different from these
earlier works, our work focuses on how to take gesture beyond attention identification
to help intention recognition (i.e., inferring what the user intends to do with an
object), which is the main task of language understanding.
Traditional language understanding is solely based on the text input. In multi-
modal conversational systems, besides the user’s language, it is possible to infer the
67
context of the user’s language from other non-verbal modalities (e.g. gesture) and
use this context for language understanding. In speech and deictic gesture systems,
deictic gestures on the graphical display indicate the user’s attention, which consti-
tutes the context of the user’s utterance. Since the context of the identified attention
can potentially constrain the associated intention, the deictic gestures can go beyond
attention and apply to recognize the user’s intention.
Within the context of a speech and gesture system, this chapter systematically
investigates the role of deictic gestures in incorporating contextual information to
help language understanding, specifically, to help recognize the user’s intention. We
experiment with different model-based and instance-based approaches to incorporate
gestural information for intention recognition. We also examine the effects of us-
ing gestural information for intention recognition in two different processing stages:
speech recognition stage and language understanding stage.
5.1 Multimodal Interpretation in a Speech-Gesture System
Multimodal interpretation involves extraction of semantic meanings from multimodal
inputs. In human-machine conversation, the specific task of multimodal interpreta-
tion is to convert the user’s multimodal input into a semantic representation that is
recognizable to the system.
5 . 1 . 1 Semantic Representation
Semantic meanings from user input can be generally categorized into intention and
attention [33]. Intention indicates the user’s motivation and action. Attention reflects
the focus of the conversation. Structuring semantic meanings in this way, we represent
the semantic meaning of a user’s input by a semantic frame containing intention and
attention of the user. Figure 5.1 shows the semantic frame of a user’s multimodal
input. In the example, the user asks “who is the artist of this picture?” while pointing
68
to a picture object (identified as picture_lotus) on the screen. The intention indicates
that the user wants the artist information, whereas attention indicates picture_lotus
is the object that the user is interested in.
Intention
action: A C T-INF 0-RE Q UES T
aspect: ARTIST
Attention
Object id: picture_lotus
Figure 5.1. Semantic frame of a user’s multimodal input
Representing semantic meaning as semantic frames, the specific task of multimodal
interpretation is to fill intention and attention units in the semantic frames based on
the user’s multimodal input.
5.1.2 Incorporating Context in Two Stages
Speech Input Gesture Input
. J - w., . I _
F -;.. L "gram 5,». (b) ' .. " . ‘.
Speec + 1 Gesture
Recognition I Recognition
I l I
.Lfi. . ._‘_“ j.‘. a)| ,. _ .. ‘, ‘___.'
Language ‘_ _] Gesture
Understanding I Understanding
l
Semantic Representation I I Semantic Representation
Multimodal
Fusion
I Semantic Representation
Figure 5.2. Using context (via gesture) for language understanding
Context can be incorporated in two stages to help language understanding in
multimodal interpretation [83]. Take Speech and gesture systems for example, as
illustrated by (a) in Figure 5.2, contextual information (inferred from gesture) can
69
be used together with recognized speech hypotheses directly in language understand-
ing (LU) stage to improve language understanding. Since speech recognition is not
perfect, and better speech recognition should lead to better language understanding,
contextual information can also be used in speech recognition (SR) stage to improve
Speech recognition hypotheses and thus improve language understanding (Figure 5.2-
(b))-
5.2 Intention Recognition
We investigate using the context identified by gesture for intention recognition in a
speech-gesture system that is built for a 3D interior decoration domain (Section 3.3.1).
In this domain, the user’s intention is represented by an action and its corresponding
aspect. All actions and corresponding aspects in the interior decoration domain are
shown in Table 5.1. Note that for action ACT-INFQREQ UES T, the aspect includes
different domain properties such as ARTIST, AGE, and PRICE.
Action Aspect
A CT-A DD
A C T—ALTERNA TES.SHO W < null>
A C T-INFO_RE Q UES T or < null >
A (IT-MO VE or
ACT-PAIN T or
A C T-REM 0 VB
A CT—REPLA CE < replacement > or < null >
ACT-ROTA TE or
Table 5.1. Intentions in the 3D interior decoration domain
Given this representation, intention recognition can be formulated as a classifi-
cation problem. Each action-aspect pair can be considered as a particular type of
intention. For action ACT-INFO.REQUEST, there are 11 possible aspect values
which result in 11 classes. For all other 7 actions, each action is treated as one type
of intention despite multiple possible aspect values. During interpretation, additional
70
post-processing will take place to identify different aspects. For example, for action
ACT—PAIN T, the system will try to identify the value (e.g., red, blue) from
the user’s utterance after ACT-PAIN T is predicted as the user’s intended action.
Here, we only focus on the classification of intention without elaborating on the post-
processing. In total, there are 19 target classes for intention recognition (including
class NOT- UNDERSTOOD to represent the intention that is not supported in the
domain).
5.3 Feature Extraction
To predict user intention, we first need to extract features from the user’s multimodal
input. Two types of features are used for intention prediction: semantic features and
phoneme features.
5.3.1 Semantic Features
The semantic features of users’ multimodal input consist of two parts: lexical features
extracted from users’ spoken utterances, and contextual features extracted from users’
deictic gestures.
o Lexical features
Lexical feature is represented by a binary feature vector which indicates what
semantic concepts appear in the user’s utterance. The semantic concepts are
extracted from the recognized speech hypotheses (could be n-best hypotheses
or 1-best hypothesis) based on lexical rules. Currently, we have 18 semantic
concepts in the interior decoration domain with 130 lexical rules.
0 Contextual features
When a deictic gesture takes place, the selected object and its properties as
defined in the domain are activated, which form the context of the user’s ut-
71
terance. This context constrains what the user is likely talking about. For
example, the user is unlikely to ask the artist of a lamp or the wattage of a
picture. Therefore, this context can be used to help predict user intention. For
each gesture that accompanies the user’s utterance, we choose the most likely
object selected by the gesture and use the semantic type of the object as the
contextual feature. There are 14 semantic types of objects in the domain.
5.3.2 Phoneme Features
Besides semantic features, we also use phoneme features of users’ spoken utterances for
intention prediction. For each speech recognition hypothesis of the user’s utterance,
we can get a phoneme sequence. Each phoneme sequence is treated as a phoneme
feature.
User utterance: “information on this”
Phonemes: [ih n f er m ey sh ax n] [ao n] [dh ih 3]
Speech recognition: “and for mission on this”
Phonemes: [ax n d] [f er] [m ih sh ax n] [ao n] [dh ih s]
Figure 5.3. Phonemes of an utterance
We give an example to show the potential of using phoneme features to help
user intention prediction. As shown in Figure 5.3, the user’s utterance is not cor-
rectly recognized and as a result, the semantic feature extracted from the recog-
nized speech does not give any useful information about the user’s intention of ACT-
INFO-REQUEST. Therefore, using semantic features alone will fail to predict the
user’s intention. However, if we compare the two phoneme sequences of the true ut—
terance and the speech recognition result, we can find that the phoneme sequences of
the mis-recognized speech, [ax n d] [f er] [m ih sh ax n], is close to the true phoneme
sequence [ih n f er m ey sh ax n]. This means that using phoneme sequence sim-
ilarity can help recover the word “information”, which is the key to identifying the
72
user’s intention in this utterance, and therefore can help predict the user’s intention.
5.4 Model-Based Intention Recognition
Given an instance x that is represented by semantic features, we applied three clas-
sifiers to predict user intention.
o Naive Bayes
The prediction c* of instance x is given by
c” = arg maxp(c|x) = argmaxp(c|x1,:cg,. ..,:1:m) (5.1)
C C
where cc,- is the i-th feature of instance x.
Applying Bayes’ theorem and assuming the features are conditionally indepen-
dent given a class, we have
p(ClX) =
o< p(e) Hume) (5.2)
Estimating p(c) and p(xilc) from the training data, we can get the prediction
of a testing instance by Equation (5.1). In our evaluation, add-one smoothing
was used in the estimation of p(c) and p(xilc) for predicting user intention.
0 Decision Tree
In a decision tree, each root node provides the classification of the instances,
each non-leaf node specifies a test of some attribute of the instances, and each
branch descending from that node corresponds to one of the possible values for
73
this attribute. Decision trees classify instances by sorting them down the tree
from the root node to some leaf node through a list of attribute tests. We used
C4.5 algorithm [86] to construct decision trees for intention prediction based on
the semantic features of users’ multimodal input.
0 Support Vector Machines (SVM)
The SVM [22] is built by mapping instances to a high dimensional space and
finding a hyperplane with the largest margin that separates the training in-
stances into two classes in the mapped space. In prediction, an instance is
classified depending the side of the hyperplane it lies in. A kernel function a:
is used in SVM to achieve linear classification in the high dimensional space.
Based on the semantic features of users’ multimodal input, we used a polynomial
kernel for user intention prediction.
Since SVM can only handle binary classification, a “one-against-one” method
is applied to use SVM for multi-class classification [39]. For a classification task
of c classes, c(c — 1) / 2 SVMS are built for all pairs of classes and each SVM
is trained on the data from the pair of two classes. In the testing phase, a
test instance x is classified through a majority voting strategy. For each of the
c(c - 1) / 2 binary classifiers built for class pair (Ci, cj), if the classifier decides x
belongs to the class 0;, the vote for class c,- increases by one. Otherwise, the vote
for class cj increases by one. After all binary classifiers have been used to vote
for the classes, the one which wins the most votes is picked as the prediction of
X.
5.5 Instance-Based Intention Recognition
We also applied k-nearest neighbor (KNN), an instance-based approach, to predict
user intention. Given a set of training instances with known intention, the KNN
74
method (k=1) predicts the intention of a testing instance by finding the testing in-
stance’s closest match in the training instances and using the match’s intention as
the prediction.
We applied KN N to predict user intention based on semantic features and phoneme
features. The similarity between a testing instance xt and a training instance x" is
defined as
d3p(xt,xr) = d3(xt, x") + apart, x’") (5.3)
where d, (xt, x7") is the Hamming distance between the nominal semantic features and
dp(xt, xr) is the distance between the phoneme features.
Hamming distance ds(xt, X?) is defined as:
m
d,(xt,x") = 2(1— (safe, :49) (5.4)
k=1
where rump is the k-th attribute in the semantic feature, and
t_ r
“332,513“: 0 113:: —:ck
1 331:7533;
Phonemes distance dp(xt,x") is defined as follows based on different configura-
tions:
0 when n-best speech recognition hypotheses are used, and no gestural informa-
tion is used:
dp(x‘,x'") = mkin MED(P,f., PT) (5.5)
0 when n-best speech recognition hypotheses are used, and gestural information
(i.e., objects indicated by deictic gestures) is used
dp(xt,xr) = mkin MED(P,§, P') + we(ot,or) (5.6)
75
where,
MED
PT
we(0t, 01‘)
minimum edit distance
phonemes of the k-th speech recognition hypothesis of testing
instance xt
phonemes of the speech transcript of training instance xr
distance between the object 0t selected by the gesture
accompanying testing instance xt and the object or selected by
the gesture accompanying training instance xr (0 if 0t and o,—
are of the same semantic type, otherwise a non-zero constant)
5.6 Evaluation
We empirically evaluated the role of contextual information in intention recognition.
We applied both model-based and instance-based approaches, and investigated the
incorporation of contextual information for intention recognition in language under-
standing and speech recognition stages.
5.6.1 Experiment Settings
The CMU Sphinx-4 speech recognizer [111] was used for speech recognition. An
open acoustic model and a domain dictionary were used in recognizing users’ spoken
utterances.
For model-based intention prediction, we evaluated the intention prediction accu-
racies with the following classifiers based on semantic features:
0 NBayes — naive bayes
. o DTree — decision tree (C46)
0 S VM — support vector machine (polynomial kernel)
76
For instance-based intention prediction, we evaluated the intention prediction ac-
curacies with KNN classifiers based on different instance similarity functions:
0 S—KNN — instance distance defined on semantic features (Equation (5.4))
o P—KNN — instance distance defined on phoneme features (Equations (5.5) and
(5.6) depending on whether gestural information is incorporated)
o SP-KNN - instance distance defined on combinational features of semantics
and phonemes (Equation (5.3))
For each approach, we compared the performances of using only the l-best speech
recognition hypothesis and using all n-best speech recognition hypotheses for inten-
tion prediction. Also, to compare the influences of gestural information on intention
prediction, we evaluated intention prediction under three gesture configurations:
o noG’est — no gestural information is used.
c recoGest - with gesture recognition results, i.e., the most likely objects selected
by the user’s gestures as recognized by the system.
0 tmeG’est — with ground truth gesture recognition results, i.e., the objects truly
selected by the user’s gestures.
For each approach, we further evaluated intention prediction based on standard
speech recognition and gesture-tailored speech recognition. When intention prediction
is based on standard speech recognition, gestural information is incorporated only
in language understanding for intention prediction. When intention prediction is
based on gesture-tailored speech recognition, gestural information is already used in
speech recognition and can also be used in language understanding stage for intention
prediction.
The evaluations were done by a 10-fold cross validation on the speech and gesture
data set as described in Section 4.6.1.
77
5.6.2 Results Based on Traditional Speech Recognition
Table 5.2 shows the intention prediction accuracies based on the standard speech
recognition results that did not use gestural information. The intention prediction
accuracies based on transcripts of users’ spoken utterances are also given in the table
to show the upper-bound performance when speech is perfectly recognized.
[ I NBayes 1 DTree [SVM 1 s-KNN] P-KNNTSP-KNN J
noGest 0.860 0.881 0.878 0.881 0.918 0.937
transcript recoGest 0.878 0.888 0.884 0.888 0.921 0.934
trueGest 0.874 0.889 0.884 0.884 0.921 0.934
west noGest 0.709 0.718 0.713 0.700 0.790 0.824
recoGest 0.741 0.729 0.749 0.740 0.797 0.826
hYPOtheses trueGest 0.755 0.738 0.744 0.737 0.806 0.832
Lbest noGest 0.721 0.727 0.730 0.730 0.798 0.820
. recoGest 0.747 0.755 0.747 0.757 0.801 0.834
hypOth‘f‘S trueGest 0.763 0.769 0.760 0.758 0.804 0.844
Table 5.2. Accuracies of intention prediction based on standard speech recognition
For all model-based approaches (i.e., NBayes, DTree, SVM), the results show that
using gestural information together with recognized speech (1-best or n-best) in in-
tention prediction achieves significant improvement on prediction accuracy compared
to not using gestural information. Among instance-based approaches (i.e., S-KNN, P-
KNN, SP-KNN), only for the S-KN N that uses semantic features, intention prediction
accuracies are improved significantly when gestural information is used together with
recognized speech (1—best or n—best hypotheses). For the P-KNN, where only phoneme
features are used, there is no significant change between the intention prediction using
gesture and not using gesture, no matter gestural information is used together with
1-best speech recognition or n-best speech recognition. For the SP-KNN that uses
both semantic and phoneme features, intention prediction is significantly improved
only when gestural information is used together with l-best speech recognition.
It is found that, used together with recognized speech hypotheses in model-based
approaches, ground truth gesture selection achieves more accurate intention predic-
78
tion than recognized gesture selection in most configurations. This indicates that
improving gesture recognition and understanding can further enhance intention pre-
diction when speech recognition is not perfect. When SVM is applied on semantic
features extracted from all n-best speech recognition hypotheses, using the true ges-
ture selection achieves slightly worse performance than using the recognized gesture
selection. However, this is not a significant difference. In instance-based approaches,
using true gesture selection makes no significant difference than using recognized
gesture selection for user intention prediction.
5.6.3 Results Based on Gesture-Tailored Speech Recognition
Table 5.3 shows the intention prediction accuracies based on the gesture-tailored
speech recognition hypotheses. Note that in Table 5.3, gestural information (all pos-
sible gesture selections recognized by the system) has been utilized in speech recog-
nition [81], the configurations noGest, recoGest, and tr‘ueGest only apply to how
gestural information is used in language understanding stage for intention prediction.
Therefore, in Table 5.3, the results under configurations n-best hypotheses + noGest
and 1-best hypothesis + noGest are actually the intention prediction performance
when gestural information is used in only speech recognition stage.
| | NBayes 1 DTree j SVM [ S-KNN [ P-KNN 1 SP-KNN j
noGest 0.860 0.881 0.878 0.881 0.918 0.937
transcript recoGest 0.878 0.888 0.884 0.888 0.921 0.934
trueGest 0.874 0.889 0.884 0.884 0.921 0934
West noGest 0.727 0.749 0.750 0.753 0.826 0.858
recoGest 0.753 0.766 0.780 0.770 0.829 0.857
hYPOtheses trueGest 0.766 0.781 0.786 0.781 0.827 0.860
l-best noGest 0.735 0.743 0.752 0.758 0.812 0.843
, recoGest 0.764 0.772 0.764 0.778 0.815 0.855
hYPOtheS‘S trueGest 0.783 0.795 0.777 0.795 0.817 0.860
Table 5.3. Accuracies of intention prediction based on gesture-tailored speech recognition
Compared to using gestural information only in speech recognition, the accuracies
79
of intention prediction are significantly improved in all model-based approaches when
gestural information is used in both speech recognition and language understand-
ing, no matter it is used together with 1-best or n-best speech recognition. Among
instance-based approaches, only in S—KNN, that using gestural information in both
speech recognition and language understanding (with l-best or n-best recognition
hypotheses) significantly improves intention prediction compared to using gestural
information only in speech recognition. For P-KNN, whether or not using gestural
information in language understanding does not make significant change on inten-
tion prediction. For SP-KN N, only when gestural information is used together with
1-best Speech recognition hypothesis in language understanding that intention predic-
tion is significantly improved compared to using gestural information only in speech
recognition.
In all model-based approaches, together with recognized speech, using ground
truth gesture selection in language understanding is found to improve intention pre-
diction more than the recognized gesture selection. Again, this indicates that im-
proving gesture recognition and understanding is helpful for intention prediction. In
instance-based approaches, using true or recognized gesture selection in language un-
derstanding stage for intention prediction does not make significant differences when
phoneme features are used.
5.6.4 Results Based on Different Sizes of Training Data
The empirical results have shown that using gestural information improves user inten-
tion recognition. To examine whether this improvement by using gestural information
is dependent on the size of training data, we compare the accuracies of intention pre-
diction with different sizes of training sets. The results of the approaches are shown in
Figures 5.4—5.9. The semantic features and phoneme features are extracted from the
l-best speech recognition and the recognized gesture selection are used in intention
80
prediction.
0.78 I I I I I 1 I
0.76" _,======='D
r = = " ‘ ' 5 5 >
07.4.... 59‘ e;:::::e- .1
0.72 ~
8
S
g 0.7 ~ -
m
058- "0 .
066- “:6“ .
" gesture not used
0 64 _ —9— gesture used in LU ‘
' ——+— gesture used In SR
—B-— gesture used in SR and LU
0.62 L l L 1 l 1 1
20 30 4O 50 60 70 80 90
% training
100
Figure 5.4. Intention prediction performance of Naive Bayes based on different training
size
0.78 I I I I I I I
076» =,,'i==‘ .
5 = 5 'e ‘ V. 9.."- 2
0.74“ C::=:::‘: 53?:(:.v" _,
072- - 4” .539 4
g 0.7 ,. . a f ..... "f . . _,
a u
068» ”=3°° " ~
0.66 7 ' o .
gesture not used
0 64 _ —e— gesture used in LU j
' —t— gesture used In SR
—8— gesture used In SR and LU
0.62 1 1 L 1 L 1 1
20 30 40 50 60 70 80 90 100
% trainlng
Figure 5.5. Intention prediction performance of Decision Tree based on different training
size
81
0.78 I T I I I I I
5 I: : '4 a - - ‘ u ’;":r=(:v:*5
E ‘5 Ut.‘;g
g O7r - .~ 9" -
(I! a": 'v
0.686 c . . . ,.. .. .
0.66l“ .I
gesturenotused
064- —e—gesture used In LU _
' —+—— gesture used In SR
—8— gesture used In SR and LU
0.62 l l 1 l 1 1 1
20 30 40 50 60 70 80 90 100
% training
Figure 5.6. Intention prediction performance of SVM based on different training size
0.86 I I I I I I I
0.84 .. . . . . . . . . . . . . . . . ..
0.82 ,_ . . . . ..... . . . . . . . a
0.8 ,_ ...... .4
5‘ 0.78 -
g 0.76 ~
to
0.74 -
0.72 t "
0.7 - - - ' - - ' gesture not used ‘
—e— gesture used In LU
0.68 ~ - - - - ' , —+— gesture used In SR
—8— gesture used in SR and LU
0.66 l 1 L L 1 1 1
20 30 40 50 60 70 80 90 100
°/o training
Figure 5.7. Intention prediction performance of S-KN N based on different training size
The intention prediction accuracy curves are generated in the following way. The
whole data set is first separated into 5 folds in a stratified way such that the class
82
0.86 I r r r r r r
0.84 -- -
0.82 ~ ~
0.8 r
0.78 -- ' ‘ - -
3
§ 076* '
m ‘l ,C"
0.74 '- _J'4 1 I -I
f!" a
.u-a/l f9.
0 72 f. {I I ‘ {all -<
l. '4 géj"
0.7 ' * .. ' ., ’ ' ' gesture not used ‘
n o —e— gesture used in LU
0.68 t o - —+—— gesture used In SR ~
° —8— gesture used In SR and LU
0.66 L J L 1 L l 1
20 30 40 50 60 70 80 90 100
% traInIng
Figure 5.8. Intention prediction performance of P-KNN based on different training size
0.86
I
0.84
0.82 ~
S3
V
I
gesture not used ‘
—e— gesture used In LU
—+—- gesture used in SR ~
—8— gesture used in SR and LU
1
I
0.68
0.66 l l l
20 30 40 50 60 70 80 90 1 00
% training
Figure 6.9. Intention prediction performance of SP-KN N based on different training size
distributions in each fold are the same. In each round of evaluation, two different
folds are picked as the testing set and initial training set, instances in the other 3
folds are added to the training set incrementally by random picking to get intention
83
prediction accuracies based on different sizm of training sets. After each fold of data
has been used as testing set and initial training set, the intention prediction accuracy
curves of the 20 round evaluations are averaged to get the curves in Figures 5.4—5.9.
We can see that, for all model-based and instance-based approaches, using gestural
information in both speech recognition stage and language understanding stage al-
ways outperforms using gestural information in only language understanding stage or
not using gestural information at all for intention prediction. Using gestural informa-
tion only in speech recognition stage is found to always outperform not using gestural
information for intention prediction in all model-based and instance-based approaches
despite the training size. When gestural information is used only in language under-
standing stage, Naive Bayes and S—KNN always improve intention prediction despite
the training size. For the other approaches (Decision Titee, SVM, P-KNN, and SP-
KNN), sufficient training data is needed to make gestural information helpful for
intention prediction.
5.6.5 Discussion
The empirical results lead to several findings about the role of deictic gestures in
incorporating domain context in intention recognition.
First, deictic gesture helps intention recognition given the current speech recogni-
tion technology. The earlier deictic gesture is used in the speech processing, the more
efifect it brings to intention recognition. Figure 5.10 shows the performance of inten-
tion recognition by different approaches when gestural information is not used (i.e.,
only recognized speech hypotheses are used), used only in speech recognition stage,
used only in language understanding stage, and used in both speech recognition and
language understanding stages. We can easily see that using gestural information
in speech recognition stage or language understanding stage improves intention pre-
diction. Using gestural information in both Speech recognition stage and language
84
086
/
BE gesture not used é
. /
0.84 flagesture used in LU 2/
aflgesture used in SR
0.82 EBgesture used in SR and LU 3.;
. E's-'3
o./ z-f/
3:/ :=./
0.8 =.¢ 25'?
:=./ :=-./
>> _./ ._./
8 :-.¢ :=./
z=¢ wt?
5078 i i5? 35?
o / '=-% -=.¢
'/ ¢ :-./ :--./
‘i i r
I ' .=. ._:.
0.76 g g g ,. g=g 2:.g
/ i / / -/:: ;=/ :=./
/ = f .-/ _.;/ .=./ .=.%
z ¢ : / =E/ =5/ -:=¢ 2:2/
074 = Z = g =.é ‘15/:3=-¢ 3:.é
- : g : g ::¢ :3? 3:3? 3:5?
3 / = / =/ -/;: ;=/ ;=./
: / -: / :::/ :=e/ ::-/ .:5/
.: é 3: ? 3::é ::&¢ 1:=Z 3:5?
072 1;: g ;= Z ;2=.¢ 2:.é 1:? 3:.Z
35 / 35 e 255% 355/ :E:¢ 353;
3::- é :5 z :25? 25-? :5? :58
07 s: 4 t: 4 >:-é v:bé w:a4 A:c4
NBayes D'I‘ree SVM S-KNN P-KNN SP-KNN
Figure 5.10. Using gestural information in different stages for intention recognition
understanding stage further improves intention prediction. Therefore, it is desirable
to incorporate gesture earlier in the spoken language processing.
Second, deictic gesture does not help much in intention recognition for a sim-
ple/small domain if speech is perfectly recognized. As we can see in Table 5.2, when
gestural information is used together with the transcripts of user utterances to pre-
dict intention, the effect is not as Significant as when gesture information is used
with recognized Speech hypotheses. This is within our expectation. Given a simple
domain with a limited number of words (the vocabulary size for our current domain
is 250), it is relatively easier to come up with sufficient semantic grammars to cover
the variations of language. In other words, once user utterances are correctly rec-
ognized, the semantics of the input can most likely be correctly identified by the
language understanding component. So the bottleneck in interpretation appears in
speech recognition (due to many possible reasons such as background noise, accent,
etc.) The better is speech recognition, the better the language understanding compo-
85
nent processes the hypotheses, and the less effect the gesture is likely to bring. When
Speech is perfectly recognized (i.e., same as transcriptions), the addition of gesture
information will not bring extra advantage. In fact, it may hurt the performance if
gesture recognition is not adequate. However, we feel that when the domain becomes
more complex and the variations of language become more difficult to process, the
use of gesture may begin to Show advantage even when speech recognition performs
reasonably well. After all, speech recognition is far from being perfect in reality, which
makes gestural information valuable in intention recognition.
Third, deictic gesture helps more significantly when combined with semantic fea-
tures than with phoneme features for intention prediction. As shown in Figure 5.10,
for NBaeys, DTree, SVM and S-KNN where only semantic features are used, the
addition of deictic gesture in both speech recognition and language understanding
can improve the performance between 4.7% and 6.6%. For P-KNN where only the
phonemes features are used, the improvement is 2.1%. Although the addition of
phoneme features significantly improves the intention recognition performance, it is
computationally much more expensive than the use of only semantic features. Using
phoneme features may become impractical in real-time systems for complex domains.
Thus the incorporation of the gestural information could be even more important.
5.7 Summary
This chapter systematically investigates the role of deictic gesture in recognizing user
intention during interaction with a Speech and gesture interface. Different model-
based and instance-based approaches using gestural information have been applied
to recognize user intention. Our empirical results have Shown that using gestural
information in either speech recognition or language understanding stage is able to
improve user intention recognition. Moreover, when gestural information is used
in both speech recognition and language understanding, intention recognition can be
86
further improved. These results indicate that deictic gesture, although most indicative
to reflect user attention, is helpful in recognizing user intention. These results further
point out when and how deictic gesture should be effectively incorporated in building
practical speech-gesture systems.
87
CHAPTER 6
Incorporation of Eye Gaze in Automatic
Word Acquisition
Chapter 4 and Chapter 5 investigate the use of non-verbal modalities to improve Spo-
ken language understanding in multimodal conversational systems. Another signifi-
cant problem with language understanding in multimodal conversation is the system’s
lack of knowledge to process user language. Language is flexible, different users may
use different words to express the same meaning. When the system encounters a word
that is out of its knowledge base (e.g., vocabulary), it tends to fail in interpreting the
user’s language. It is desirable that the system can learn new words automatically
during human-machine conversation.
In this chapter, we present the investigation of using eye gaze for automatic word
acquisition. The speech-gaze temporal information and domain semantic relatedness
are incorporated in statistical translation models for word acquisition. Our experi-
ments show that the use of speech-gaze temporal information and domain semantic
relatedness significantly improves word acquisition performance.
This chapter begins with a description of the speech and gaze data collection,
followed by an introduction of the basic translation models for word acquisition. Then,
we describe the enhanced models that incorporate temporal and semantic information
about speech and eye gaze for word acquisition. Finally, we present the results of
88
empirical evaluation.
6.1 Data Collection
We used the same set of speech and eye gaze data as described in Section 4.6.3.
25712 28712 31170 35'28 l3736 (ms)
This room has a chandelier
f gaze fixation Speech stream
5 6 9 8 1668 2096 2692 32 2 (ms)
P
gaze stream
is te
[19] [1 [17] [19] [2211100] [10] [10] [fixatedentity]
[11] [11] [ll]
([10] — bedroom; [1 l] — chandelier; [l7] — lamp_2; [l9] - bed frame; [22] —- door)
Figure 6.1. Parallel speech and gaze streams
Figure 6.1 Shows an excerpt of the collected speech and gaze fixation in one ex-
periment. In the speech stream, each word starts at a particular timestamp. In the
gaze stream, each gaze fixation has a starting timestamp t3 and an ending timestamp
te. Each gaze fixation also has a list of fixated entities (3D objects). An entity e on
the graphical display is fixated by gaze fixation f if the area of e contains fixation
point of f.
Given the collected Speech and gaze fixations, we build a parallel speech-gaze data
set as follows. For each Spoken utterance and its accompanying gaze fixations, we
construct a pair of word sequence and entity sequence (w, e). The word sequence w
consists of only nouns and adjectives in the utterance. Each gaze fixation results in
a fixated entity in the entity sequence e. When multiple entities are fixated by one
gaze fixation due to the overlapping of the entities, the forefront one is chosen. Also,
we merge the neighboring gaze fixations that contain the same fixated entities. For
the parallel speech and gaze streams Shown in Figure 6.1, the resulting word sequence
89
is w = [room chandelier] and the entity sequence is e = [bed_frame lamp_2 bed. ame
door chandelier].
6.2 Translation Models for Automatic Word Acquisition
Since we are working on conversational systems where users interact with a visual
scene, we consider the task of word acquisition as associating words with visual en-
tities in the domain. Given the parallel speech and gaze fixated entities {(w,e)},
we formulate word acquisition as a translation problem and use translation models
to estimate word-entity association probabilities p(wle). The words with the highest
association probabilities are chosen as acquired words for entity 6.
6.2.1 Base Model I
Using the translation model I [5], where each word is equally likely to be aligned with
each entity, we have
1:2)(wle=l———+1)1 m H Zoptwjlen (6.1)
j=1i=0
where l and m are the lengths of entity and word sequences respectively. We refer to
this model as Model-1.
6.2.2 Base Model II
Using the translation model II [5], where alignments are dependent on word/ entity
positions and word / entity sequence lengths, we have
m l
p<>=wle 11122292 a-=z‘0.m.0p te(ei) (6.4)
ts(€i) - ts(wj) ts(wj) < ts(ei)
where ts(wj) is the starting timestamp (ms) of word wj, t3(e,') and te(e,-) are the
starting and ending timestamps (ms) of gaze fixation on entity e.
The alignment of word wj and entity e, is decided by their temporal distance
d(e,-, wj). Based on the psycholinguistic finding that eye gaze happens before a spo-
ken word, wj is not allowed to be aligned with ei when wj happens earlier than e,-
(i.e., d(e,-,wJ-) > 0). When wj happens no earlier than ei (i.e., d(e,-,wj) S 0), the
91
closer they are, the more likely they are aligned. Specifically, the temporal alignment
probability of wj and ez- in each co—occurring instance (w, e) is computed as
0 d(e,-,wj) > 0
expla-d(€iawj)l d e. w- < 0 (6-5)
Zexpla-d(ei,wjll (is J)_
l
pt(aj = ilj,e,w) =
where a is a constant for sealing d(e,-, wj).
An EM algorithm is used to estimate probabilities p(wle) and a in Model-2t.
140
120
100
80
60
Alignment count
40
20
0 2;: .g. 2:: 1;. ,I;. t: 3:32 9:3 3:1 :.;1 p: :.; . .
-—5,000 —4,000 —3,000 —2,000 —1,000 0 1,000
Temporal distance of aligned word and entity (ms)
Figure 6.2. Histogram of truly aligned word and entity pairs over temporal distance (bin
width = 200ms)
For the purpose of evaluation, we manually annotated the truly aligned word and
entity pairs. Figure 6.2 shows the histogram of those truly aligned word and entity
pairs over the temporal distance of aligned word and entity. We can observe in the
figure that 1) almost no eye gaze happens after a spoken word, and 2) the number of
word-entity pairs with closer temporal distance is generally larger than the number
of those with farther temporal distance. This is consistent with our modeling of the
temporal alignment probability of word and entity (Equation (6.5)).
92
6.4 Using Domain Semantic Relatedness for Word Acquisi-
tion
Speech-gaze temporal alignment and occurrence statistics sometimes are not sufficient
to associate words to entities correctly. For example, suppose a user says “there is a
lamp on the dresser” while looking at a lamp object on a table object. Due to their
co—occurring with the lamp object, the words dresser and lamp are both likely to
be associated with the lamp object in the translation models. As a result, the word
dresser is likely to be incorrectly acquired for the lamp object. For the same reason,
the word lamp could be acquired incorrectly for the table object. To solve this type
of association problem, the semantic knowledge about the domain and words can be
helpful. For example, the knowledge that the word lamp is more semantically related
to the object lamp can help the system avoid associating the word dresser to the lamp
object. Therefore, we are interested in investigating the use of semantic knowledge
in word acquisition.
On one hand, each conversational system has a domain model, which is the knowl-
edge representation about its domain such as the types of objects and their properties
and relations. On the other hand, there are available resources about domain inde-
pendent lexical knowledge (e.g., WordNet [28]). The question is whether we can use
the domain model and external lexical knowledge resource to improve word acqui-
sition. To address this question, we link the domain concepts in the domain model
with WordNet concepts, and define semantic relatedness of word and entity to help
the system acquire domain semantically compatible words.
In the following sections, we first describe our domain modeling, then define the
semantic relatedness of word and entity based on domain modeling and WordNet
semantic lexicon, and finally describe different ways of using the semantic relatedness
of word and entity to help word acquisition.
93
6.4. 1 Domain Modeling
We model the 3D room decoration domain as shown in Figure 6.3. The domain model
contains all domain related semantic concepts. These concepts are linked to the
WordNet concepts (i.e., synsets in the format of “word#part-of-speech#sense—id”).
Each of the entities in the domain has one or more properties (e.g., semantic type,
color, size) that are denoted by domain concepts. For example, the entity dresser_1
has domain concepts SEM_DRESSER and COLOR. These domain concepts are linked
to “dresser#n#4” and “color#n#1” in WordNet.
1 Domain Model
Entities: . . . @
5 - l l l l i
£3321; [SEM_DRESSER ] COLOR I I SEM_BED ][ COLOR [F9125 ]:
--—--——_--_-_--_-_ __.__.____ --—--———————__-_—_ __-—_— __——--__ -___
“color#n#l ”
«a...
w:
Figure 6.3. Domain model with domain concepts linked to WordNet synsets
WordNet
concepts:
Note that in the domain model, the domain concepts are not specific to a cer-
tain entity, they are general concepts for a certain type of entity. Multiple entities
of the same type have the same properties and Share the same set of domain con-
cepts. Therefore, properties such as color and size of an entity have general concepts
“color#n#1” and “size#n#1” instead of more specific concepts like “yellow#a#1”
and “big#a#1”, so their concepts can be shared by other entities of the same type,
94
but with different colors and sizes.
6.4.2 Semantic Relatedness of Word and Entity
We compute the semantic relatedness of a word w and an entity e based on the se-
mantic Similarity between w and the properties of e. Specifically, semantic relatedness
SR(e,w) is defined as
SR(e, w) = nzizgx sim(s(cg), sj(w)) (6.6)
where cf3 is the i-th pr0perty of entity e, s(cé) is the synset of property of. as designed
in domain model, sj(w) is the j-th synset of word w as defined in WordNet, and
sim(-, ) is the Similarity score of two synsets.
We computed the similarity score of two synsets based on the path length between
them. The similarity score is inversely proportional to the number of nodes along the
shortest path between the synsets as defined in WordNet. When the two synsets
are the same, they have the maximal similarity score of 1. The WordNet—Similarity
tool [77] was used for the synset similarity computation.
6.4.3 Word Acquisition with Word-Entity Semantic Relatedness
We can use the semantic relatedness of word and entity to help the system acquire
semantically compatible words for each entity, and therefore improve word acquisition
performance. The semantic relatedness can be applied for word acquisition in two
ways: post process learned word-entity association probabilities by rescoring them
with semantic relatedness, or directly affect the learning of word-entity associations
by constraining the alignment of word and entity in the translation models.
Rescoring with Semantic Relatedness
In the acquired word list for an entity ei, each word wj has an association probability
p(wJ-Iei) that is learned from a translation model. We use the semantic relatedness
95
SR(ei,wJ-) to redistribute the probability mass for each wj. The new association
probability is given by:
P(wjl€i)SR(ei.wj)
’(w'lei) = (6-7)
p J ZP(wjlei)SR(eiawjl
J
Semantic Alignment Constraint in Translation Model
When used to constrain the word-entity alignment in the translation model, semantic
relatedness can be used alone or used together with Speech-gaze temporal information
to decide the alignment probability of word and entity [84].
0 Using only semantic relatedness to constrain word-entity alignments in Model-
23, we have
m I
With?) = H 2173(0)“ = ili,e,W)p(wjlei) (6-8)
j=1i=0
where p3(aj = 2| j, e, w) is the alignment probability based on semantic related-
ness,
330%,le
ZSRfeitwj)
i
Maj = zlien”) = (6-9)
0 Using semantic relatedness and temporal information to constrain word-entity
alignments in Model-2ts, we have
m l
p(wle) = II Zpt8(aj = 1]], e,w)p(wjle,-) (610)
j=1 i=0
where pts(aj = 2] j, e,w) is the alignment probability that is decided by both
temporal relation and semantic relatedness of e,- and wj,
Ps(aj = ilj,e.W)pt(aj = z'lJ',e.W)
2:10st = ilj,e,W)Pt(aj = z'IJ'.e.W)
1
Pts(aj = ilj,e,W) = (6.11)
where p3(aj = i] j, e, w) is the semantic alignment probability in Equation (6.9),
and pt(aj = z| j, e,w) is the temporal alignment probability given in Equa-
tion (6.5).
96
EM algorithms are used to estimate p(wle) in Model—2s and Model-2ts.
6.5 Grounding Words to Domain Concepts
As discussed above, based on translation models, we can incorporate temporal and
domain semantic information to obtain p(wle). This probability only provides a
means to ground words to entities. In conversational systems, the ultimate goal of
word acquisition is to make the system understand the semantic meaning of new
words. Word acquisition by grounding words to objects is not always sufficient for
identifying their semantic meanings. Suppose the word green is grounded to a green
chair object, so is the word chair. Although the system is aware that green is some
word describing the green chair, it does not know that the word green refers to the
Chair’s color while the word chair refers to the chair’s semantic type. Thus, after
learning the word-entity associations p(wle) by the translation models, we need to
further ground words to domain concepts of entity properties.
We further apply WordNet to ground words to domain concepts. For each entity e,
based on association probabilities p(wle), we can choose the n-best words as acquired
words for e. Those n-best words have the n highest association probabilities. For
each word w acquired for e, the grounded concept c; for w is chosen as the one that
has the highest semantic relatedness with w:
c; = arg zrnax]:mJax sim(s(cg), sj(w))] (6.12)
where sim(s(cf3), sj(w)) is the semantic similarity score defined in Equation (6.6).
6.6 Evaluation
To evaluate the acquired words for the entities, we manually compile a set of “gold
I
standard” words from all users’ speech transcripts and gaze fixations. Those ‘gold
standard” words are the words that the users have used to refer to the entities and
97
their properties (e.g., color, size, shape) during the interaction with the system. The
automatically acquired words are evaluated against those “gold standard” words.
6.6.1 Evaluation Metrics
The following metrics are used to evaluate the words acquired for domain concepts
(i.e., entity properties) {c2}.
0 Precision .
Z Z # words correctly acquired for c;
e i
Z Z # words acquired for cf.
8 i
0 Recall .
Z: Z # words correctly acquired for c;
e i
Z Z # “gold standard” words of cf,
8 t
e F-measure
2 x precision x recall
precision + recall
The metrics of precision, recall, and F-measure are based on the n-best words
acquired for the entity properties. Therefore, we have different precision, recall, and
F -measure when n changes.
The metrics of precision, recall, and F -measure only provide evaluation on the top
n candidate words. To measure the acquisition performance on the entire ranked list
of candidate words, we define a new metric as follows:
0 Mean Reciprocal Rank Rate (MRRR)
Ne
1
Z inde:c(wf3)
i=1
N
e :81
i
i=1
#e
MRRR =
98
where Ne is the number of all ground-truth words {wfg} for entity e, indea:(wf3)
is the index of word w}; in the ranked list of candidate words for entity e.
Entities may have a different number of ground-truth words. For each entity e,
we calculate a Reciprocal Rank Rate (RRR), which measures how close the ranks
of the ground-truth words in the candidate word list is to the best scenario where
the top Ne words are the ground-truth words for e. RRR is in the range of (0,1].
The higher the RRR, the better is the word acquisition performance. The average of
RRRs across all entities gives the Mean Reciprocal Rank Rate (MRRR).
Note that MRRR is directly based on the learned word-entity associations p(wle),
it is in fact a measure of grounding words to entities.
6.6.2 Evaluation Results
To compare the effects of different speech-gaze alignments on word acquisition, we
evaluate the following models:
0 Model—1 — base model I without word-entity alignment (Equation (6.1)).
Model-2 — base model II with positional alignment (Equation (6.2)).
Model-2t — enhanced model with temporal alignment (Equation (6.3)).
Model-2s — enhanced model with semantic alignment (Equation (6.8)).
Model-2ts - enhanced model with both temporal and semantic alignment (Equa-
tion (6.10)).
To compare the different ways of incorporating semantic relatedness in word ac-
quisition as discussed in Section 6.4.3, we also evaluate the following models:
0 Model-l-r — Model-1 with semantic relatedness rescoring of word-entity associ-
ation.
99
e Model-2t-r — Model-2t with semantic relatedness rescoring of word-entity asso-
ciation.
Figures 6.4, 6.5, and 6.6 compare the results of models with different Speech-gaze
alignments and models with semantic relatedness rescoring. In the figures, n-best
means the top it word candidates are chosen as acquired words for each entity. The
Mean Reciprocal Rank Rates of all models are compared in Figure 6.7.
Results of Using Different Speech-Gaze Alignments
As shown in Figures 6.4(a), 6.5(a), and 6.6(a), Model-2 does not Show a consistent
improvement compared to Model-1 when a different number of n-best words are
chosen as acquired words. This result shows that it is not very helpful to consider
the index-based positional alignment of word and entity for word acquisition.
Figures 6.4(a), 6.5(a), and 6.6(a) also show that models considering tempo-
ral or/ and semantic information (Model-2t, Model-2S, Model-2ts) consistently per-
form better than the models considering neither temporal nor semantic information
(Model-1, Model-2). Among Model-2t, Model-2s, and Model-2ts, it is found that they
do not make consistent differences.
As Shown in Figure 6.7, the MRRRS of different models are consistent with their
performances on F-measure. A t-test has shown that the difference between the
MRRRs of Model-1 and Model-2 is not statistically significant. Compared to Model-
1, t-tests have confirmed that MRRR is Significantly improved by Model-2t (t =
2.29,p < 0.016), Model-2s (t = 3.40,p < 0.002), and Model—2ts(t = 3.12,p < 0.003).
T-tests have shown no significant differences among Model-2t, Model-2s, and Model-
2ts.
100
0.9 fir I I T I I I F
—I— Model-1
—V- Model-2
—e— Model-2t -
—xr-— Model—25
+ Model-2ts
precIsIon
0.1
1
n-best
(a) Precision of word acquisition when different speech-gaze alignments
are applied
0.90 l I I I I I I I
—+— Model-1
—a— Model-I-r
—e— Model-2t ‘
—e— Model-2H]
precision
O
01
0.4
0.3
0.2
p.
p
L.
0.1 . a
1 2 3 4 5 6 7 8 9 10
n-best
(b) Precision of word acquisition when semantic relatedness rescoring
of word-entity association is applied
Figure 6.4. Precision of word acquisition
101
-—I— Model-1
—V— Model-2
—e— Model-2t
—ar—— Model-23
+ Model-21$
41 l l 1 l l l l
1 2 3 4 5 6 7 8 9 1 0
n-best
(a) Recall of word acquisition when different speech-gaze alignments are
applied
0.2 —+— Model—1 .
‘ —B— Model-1 -r
—e— Model-2t
—9— Model—2t-r
0'11 2 3 4 5 6 7 8 9 10
n-best
(b) Recall of word acquisition when semantic relatedness rescoring of
word-entity association is applied
Figure 6.5. Recall of word acquisition
102
0.55 1 . . . . T . .
——+— Model—1
-'Er— Model—2
0.5 ~ —9— Model-2t
—ar— Model-25
—-A-— Model-2ts
0.45 - , *
1
.°
.5
r
F-measure
P
o)
0.3
0.25
1 l 1 l
2 3 4 5 6 7 8 9 1 0
n-best
(a) F-measure of word acquisition when different speech-gaze alignments
are applied
_
p-
)-
0.2
1
0.55 r I I I f I I I
—I— Model-1
—B— Model-1—r
0.5 —e— Model-21
—9— Model-2t-r
0.45<
d)
g 0.4
g I
[L 0.35
0.3
0.25
0.2 1 1 1 1 1 1 I 1
1 2 3 4 5 6 7 8 9 10
n-best
(b) F-measure of word acquisition when semantic relatedness rescoring
of word-entity association is applied
Figure 6.6. F -measure of word acquisition
103
0.8
0.75 *
0.7
0.65
0.6
Mean Reciprocal Rank Rate
0.55
0 ‘ 5 1:272 131:} 34.: .35.; {:3 53:3“
M-1 M—2 M-2t M-2s M-2ts M-2t-r
Models
Figure 6.7. MRRRs achieved by different models
Results of Applying Semantic Relatedness Rescoring
Figures 6.4(b), 6.5(b), and 6.6(b) Show that semantic relatedness rescoring improves
word acquisition. After semantic relatedness rescoring of the word-entity associations
learned by Model-1, Model-l—r improves the F-measure consistently when a different
number of n-best words are chosen as acquired words. Compared to Model-2t, Model-
2t-r also improves the F-measure consistently.
Comparing the two ways of using semantic relatedness for word acquisition, it is
found that rescoring word-entity association with semantic relatedness works better.
When semantic relatedness is used together with temporal information to constrain
word-entity alignments in Model-2ts, word acquisition performance is not improved
compared to Model-2t. However, using semantic relatedness to rescore word-entity
association learned by Model-2t, Model-2t-r further improves word acquisition.
As shown in Figure 6.7, the MRRRs of Model-l-r and Model-2t-r are consistent
with their performances on F—measure. Compared to Model-2t, Model-2t-r improves
104
MRRR. A t-test has confirmed that this is a significant improvement (t = 1.96, p <
0.031). Compared to Model-1, Model-l-r significantly improves MRRR (t = 2.33, p <
0.015). There is no Significant difference between Model-l-r and Model-2t/Model-
2s/Model—2ts.
In Figure 6.5, we notice that the recall of the acquired words is still comparably
low even when 10 best word candidates are chosen for each entity. This is mainly
due to the scarcity of those words that are not acquired in the data. Many of the
words that are not acquired appear less than 3 times in the data, which makes them
unlikely to be associated with any entity by the translation models. When more data
is available, we expect to see higher recall.
6.6.3 An Example
Table 6.1 Shows the 5—best words acquired by different models for the entity dresserJ
in the 3D room scene. In the table, each word iS followed by its word-entity association
probability p(wle). The correctly acquired words are Shown in bold font.
Model Model-1 Model-2t Model-2t-r
Rank 1 table(0.173) table(0.196) table(0.294)
Rank 2 dresser(0.067) dresser(0.101) dresser(0.291)
Rank 3 area(0.058) area(0.056) vanity(0.147)
Rank 4 picture(0.053) vanity(0.051) desk(0.038)
Rank 5 dressing(0.041) dressing(0.050) area(0.024)
Table 6.1. N-best candidate words acquired for the entity dresser.1 by different models
As Shown in the example, the baseline Model-1 learned 2 correct words in the 5-
best list. Considering Speech-gaze temporal information, Model-2t learned one more
correct word vanity in the 5—best list. With semantic relatedness rescoring, Model—2t-r
further acquired word desk in the 5-best list because of the high semantic relatedness
of word desk and the type of entity dresserJ. Although neither Model-1 nor Model-2t
successfully acquired the word desk in the 5-best list, the rank (=7) of the word desk
in Model-2t’s n-best list is much higher than the rank (=21) in Model-1’s n-best list.
105
6.7 Summary
This chapter investigates the use of eye gaze for automatic word acquisition in mul-
timodal conversational systems. Particularly, we investigate the use of speech-gaze
temporal information and word-entity semantic relatedness to facilitate word acqui-
sition. The experiments Show that word acquisition is significantly improved when
temporal information is considered, which is consistent with the previous psycholin-
guistic findings about Speech and eye gaze. Moreover, using temporal information
together with semantic relatedness rescoring further improves word acquisition.
106
CHAPTER 7
Incorporation of Interactivity with Eye
Gaze for Automatic Word Acquisition
In the previous chapter, we describe the use of the speech-gaze temporal information
and domain semantic relatedness for automatically acquiring words from the user’s
Speech and its accompanying gaze fixations. Successful word acquisition relies on the
tight link between what the user says and what the user sees. Although published
studies provide us with a sound empirical basis for assuming that eye movements are
predictive of Speech, the gaze behavior in an interactive setting can be much more
complex. There are different types of eye movements [50]. The naturally occurring eye
gaze during speech production may serve different functions, for example, to engage in
the conversation or to manage turn taking [70]. Furthermore, while interacting with
a graphic display, a user could be talking about objects that were previously seen
on the display or something completely unrelated to any object the user is looking
at. Therefore using all the speech-gaze pairs for word acquisition can be detrimental.
The type of gaze that is mostly useful for word acquisition is the kind that reflects
the underlying attention and tightly links to the content of the co-occurring spoken
utterances. Thus, one important question is how to identify the closely coupled speech
and gaze streams to improve word acquisition.
To address this question, in this chapter, we develop an approach that incorporates
107
interactivity (e.g., user activity, conversation context) with eye gaze to identify the
closely coupled speech and gaze streams. We further use the identified speech and
gaze streams for word acquisition. Our studies indicate that automatic identification
of closely coupled gaze-speech stream pairs is an important first step that leads to
performance gains in word acquisition. Our simulation studies further demonstrate
the effect of automatic online word acquisition on improving language understanding
in human-machine conversation.
In the following sections, we first describe the data collection in a new 3D in-
teractive domain, then present the automatic identification of the closely coupled
gaze-Speech pairs and its effect on word acquisition. The last part of this chapter
presents a simulation study that exemplifies how word acquisition can be automati-
cally achieved and how the acquired words affect language interpretation during online
conversation.
7. 1 Data Collection
We recruited 20 users to interact with our speech-gaze system to collect data.
7.1.1 Domain
We used the 3D treasure hunting domain (see Section 3.3.2) for the investigation of
automatic word acquisition in multimodal conversation. In this application, the user
needs to consult with a remote “expert” (i.e., an artificial system) to find hidden
treasures in a castle with 115 3D objects. The expert has some knowledge about the
treasures but can not see the castle. The user has to talk to the expert for advices
of finding the treasures. The application is developed based on a game engine and
provides an immersive environment for the user to navigate in the 3D space. A
detailed description of the user study is given in Appendix A.3.
During the experiment, the user’s speech was recorded, and the user’s eye gaze was
108
captured by a Tobii eye tracker. Figure 7 .1 shows a snapshot of one user’s experiment.
I - ‘ .
5‘ C
-» 71-
l l
P" ‘
7 A
V 6 ‘
“M“ ‘8” .~~ I -‘
. f o ‘ ' \‘fiv‘r .J; I‘
. l.
- J .
1 “I
. ‘ .sw- . r‘ . .
. ._ ,v~ {1.7“ :- ’71,
, . 4‘1' ..1 ‘0
..c, . 1‘ -(
. ‘ ’
"""~ 5; :v ‘ b"-
.. I- - r .. c
‘ -'~4~., A -
.. q
. I Wa' :
Figure 7 .1. A snapshot of one user’s experiment (the dot on the stereo indicates the user’s
gaze fixation, which was not shown to the user during the experiment)
It’s worthwhile to note that the collected data set is different from the data set
used for the investigation in Chapter 6. The difference lies in two aspects: 1) the data
for this investigation was collected during mixed initiative human-machine conversa-
tion whereas the data in Chapter6 was based only on question and answering; 2)
user studies were conducted in a more complex domain for this investigation, which
resulted in a richer data set that contains larger vocabulary.
7 .1.2 Data Preprocessing
From 20 users’ experiments, we collected 3709 utterances with accompanying gaze
fixations. We transcribed the collected speech. The vocabulary size of the speech
transcript is 1082, among which 227 words are nouns and adjectives. The user’s
speech was also automatically recognized online by the Microsoft speech recognizer
with a word error rate (WER) of 48.1% for the 1-best recognition. The vocabulary size
of the 1-best speech recognition is 3041, among which 1643 are nouns and adjectives.
The collected speech and gaze streams are automatically paired together by the
109
system. Each time the system detects a sentence boundary of the user’s speech, it
pairs the recognized speech with the gaze fixations that the system has been accumu-
lating since the previously detected sentence boundary. Given the paired Speech and
gaze streams, we build a parallel data set of word sequence and, gaze fixated entity
sequence {(w, e)} for the task of word acquisition. For the gaze stream, e contains
all the gaze fixated entities. For the speech stream, we can build w based on speech
transcript or the 1-best speech recognition. The resulting word sequence w contains
all the nouns and adjectives in the transcript or the l-best recognition.
7 .2 Identification of Closely Coupled Gaze-Speech Pairs
AS mentioned earlier, not all gaze-speech pairs are useful for word acquisition. In
a gaze-speech pair, if the Speech does not have any word that relates to any of the
gaze fixated entities, this instance only adds noise to word acquisition. Therefore,
we Should identify the closely coupled gaze-Speech pairs and only use them for word
acquisition.
In this section, we first describe the feature extraction, then describe the use
of a logistic regression classifier to predict whether a gaze-speech pair is a closely
coupled gaze-speech instance — an instance where at least one noun or adjective in
the speech stream is referring to some gaze fixated entity in the gaze stream. For the
training of the classifier for gaze-speech prediction, we manually labeled each instance
whether it is a closely coupled gaze-speech instance based on the speech transcript
and gaze fixations.
7 .2.1 Features Extraction
For a parallel gaze-speech instance, the following sets of features are automatically
extracted.
110
SPEECH FEATURES (S-FEAT)
Let cw be the count of nouns and adjectives in the utterance, and l3 be the temporal
length of the Speech. The following features are extracted from speech:
0 0w — count of nouns and adjectives.
More nouns and adjectives are expected in the user’s utterance describing enti-
ties.
e ow/ls - normalized noun/ adjective count.
The effect of speech length ls on cw is considered.
GAZE FEATURES (G-FEAT)
For each fixated entity ei, let ll, be its fixation temporal length. Note that several
gaze fixations may have the same fixated entity, lg is the total length of all the gaze
fixations that fixate on entity ei. We extract the following features from gaze stream:
0 cc — count of different gaze fixated entities.
Less fixated entities are expected when the user is describing entities while
looking at them.
0 ce/ls - normalized entity count.
The effect of speech temporal length ls on Ca is considered.
0 mam-(lg) — maximal fixation length.
At least one fixated entity’s fixation is expected to be long enough when the
user is describing entities while looking at them.
0 mean(lg) - average fixation length.
The average gaze fixation length is expected to be longer when the user is
describing entities while looking at them.
111
e var(lg) — variance of fixation lengths.
The variance of the fixation lengths is expected to be smaller when the user is
describing entities while looking at them.
The number of gaze fixated entities is not only decided by the user’s eye gaze, it
is also affected by the visual scene. Let C: be the count of all the entities that have
been visible during the length of the gaze stream. We also extract the following scene
related feature:
0 Ce / cg — scene normalized fixated entity count.
The effect of the visual scene on Ca is considered.
USER ACTIVITY FEATURES (UA-FEAT)
While interacting with the system, the user’s activity can also be helpful in deter-
mining whether the user’s eye gaze is tightly linked to the content of the speech. The
following features are extracted from the user’s activities:
0 maximal distance of the user’s movements — the maximal change of user position
(3D coordinates) during the speech length.
The user is expected to move within a smaller range while looking at entities
and describing them.
0 variance of the user’s positions
The user is expected to move less frequently while looking at entities and de-
scribing them.
CONVERSATION CONTEXT FEATURES (CC-FEAT)
While talking to the system (i.e., the “expert”), the user’s language and gaze behavior
are influenced by the state of the conversation. For each gaze-speech instance, we use
112
the previous system response type as a nominal feature to predict whether this is a
closely coupled gaze—speech instance.
In our treasure hunting domain, there are 8 types of system responses in 2 cate-
gories:
System Initiative Responses:
0 specific-see — the system asks whether the user sees a certain entity, e.g., “Do
you see another couch?”.
e nonspecific-see — the system asks whether the user sees anything, e. g., “Do you
see anything else?”, “Tell me what you see”.
0 previous-see ~ the system asks whether the user previously sees something, e.g.,
“Have you previously seen a Similar object?”.
0 describe — the system asks the user to describe in detail what the user sees, e.g.,
“Describe it”, “Tell me more about it”.
0 compare - the system asks the user to compare what the user sees, e.g., “Com-
pare these objects”.
0 clarify — the system asks the user to make clarification, e.g., “I did not under-
stand that”, “Please repeat that”.
. action-request — the system asks the user to take action, e.g., “Go back”, “Try
moving it”.
User Initiative Responses:
0 misc — the system hands the initiative back to the user without specifying
further requirements, e.g., “I don’t know”, “Yes”.
7 .2.2 Logistic Regression Model
Given the extracted feature x and the “closely coupled” label y of each instance in
the training set, we train a ridge logistic regression model [60] to predict whether an
113
instance is a closely coupled instance (y = 1) or not (y = 0).
In the logistic regression model, the probability that y,- = 1, given the feature
x,- = (Tiara, . . . ,Tfn), is modeled by
9XP(Z?=1 5373-)
1 + €XP(Z?-1:15j$§)
where 63- are the feature’s weights to be learned.
The log-likelihood l of the data (X, y) is
P(yilxi) =
1(fl) = Zhn 108P(yilxil + (1 - yil108(1— p(inXiD]
i
In ridge logistic regression, parameters 63- are estimated by maximizing a regularized
log-likelihood
c(c) = 1(6) — Allan?
where /\ is the ridge parameter that is introduced to achieve more stable parameter
estimation.
We used the Weka toolkit [115] for the training of the ridge logistic regression
model.
7 .3 Evaluation of Gaze-Speech Identification
We evaluate the gaze-speech identification for the instances with l-best speech recog-
nition. Since the goal of identifying closely coupled gaze-speech instances is to improve
word acquisition and we are only interested in acquiring nouns and adjectives, only
the instances with recognized nouns/ adjectives are used for training the logistic re-
gression classifier. Among the 2969 instances with recognized nouns/ adjectives and
gaze fixations, 2002 (67.4%) instances are labeled as closely coupled. The gaze-speech
prediction was evaluated by a 10~fold cross validation.
Table 7.1 shows the prediction precision and recall when different sets of features
are used. As seen in the table, as more features are used, the prediction precision
114
goes up and the recall goes down. It is important to note that prediction precision is
more critical than recall for word acquisition when sufficient amount data is available.
Noisy instances where the gaze does not link to the speech content will only hurt word
acquisition since they will guide the translation models to ground words to the wrong
entities. Although higher recall can be helpful, its effect is expected to become less
when more data becomes available.
Feature sets Precision Recall
Null (baseline) 0.674 1
S-Feat 0.686 0.995
G-Feat 0.707 0.958
UA-Feat 0.704 0.942
CC-Feat 0.688 0.936
G-Feat + UA-Feat 0.719 0.948
G-Feat + UA-Feat + S-Feat 0.741 0.908
G-Feat + UA-Feat + CC-Feat 0.731 0.918
G-Feat + UA-Feat + S-Feat + CC-Feat 0.748 0.899
Table 7.1. Gazespeech prediction performances with different feature sets for the instances
with l-best speech recognition
The results show that speech features (S-Feat) and conversation context features
(CC-Feat), when used alone, do not improve prediction precision much compared to
the baseline of predicting all instances “closely coupled” with a precision of 67.4%.
When used alone, gaze features (G-Feat) and user activity features (UA-Feat) are
the two most useful feature sets for increasing prediction precision. When they are
used together, the prediction precision is further increased. Adding either speech
features or conversation context features to gaze and user activity features (G-Feat
+ UA-Feat + S—Feat/CC—Feat) increases the prediction precision more. Using all
four sets of features (G-Feat + UA-Feat + S-Feat + CC-Feat) achieves the highest
prediction precision, which is significantly better than the baseline: 2 = 5.93,}? <
0.001. Therefore, we choose to use all feature sets to identify the closely coupled
gaze-speech instances for word acquisition.
To compare the effect of the identified closely coupled gaze-speech instances on
115
word acquisition from different Speech input (l-best speech recognition, speech tran-
script), we also use the logistic regression classifier with all features to predict closely
coupled gaze-Speech instances for the instances with speech transcript. For the in-
stances with speech transcript, there are 2948 instances with nouns/ adjectives and
gaze fixations, 2128 (72.2%) of them being labeled as closely coupled. The prediction
precision is 77.9% and the recall is 93.8%. The prediction precision is significantly
better than the baseline of predicting all instances as coupled: z = 4.92, p < 0.001.
7.4 Evaluation Of Word Acquisition
In Chapter 6, we have shown that Model-2t-r (Section 6.4), where the temporal align-
ment between speech and eye gaze and domain semantic relatedness are incorporated,
achieves Significantly better word acquisition performance. Therefore, this model is
used for the word acquisition in this investigation. The word acquired by Model-2t-r
are evaluated against the “gold standard” words that we manually compiled for each
entity and its properties based on all users’ Speech transcripts and gaze fixations.
Those “gold standard” words are the words that the users have used to describe the
entities and their properties during the interaction with the system.
7 .4. 1 Evaluation Metrics
We evaluate the n-best acquired words on
0 Precision
0 Recall
e F-measure
When a differen n is chosen, we will have different precision, recall, and F-measure.
We also evaluate the whole ranked candidate word list on
116
0 Mean Reciprocal Rank Rate (MRRR) (see Section 6.6.1)
7 .4.2 Evaluation Results
We evaluate the effect of the closely coupled gaze-Speech instances on word acquisition
from the 1-best Speech recognition. To Show the influence of speech recognition qual-
ity on word acquisition performance, we also evaluate word acquisition from speech
transcript. The predicted closely coupled gaze—speech instances in the evaluations are
generated by a 10-fold cross validation with the logistic regression classifier.
Figures 7.2 ~ 7 .7 Show the precision, recall, and F-measure of the n-best words
acquired by Model-2t-r using all instances (all), only predicted closely coupled in-
stances (predicted), and true (manually labeled) closely coupled instances (true). In
Figures 7.2 ~ 7.4, the acquired words come from the 1-best Speech recognition of
users’ utterances. In Figures 7.5 ~ 7.7, the acquired words come from the transcripts
of users’ utterances.
Figure 7.8 compares the MRRRs achieved by Model-2t-r using different set of
instances (all instances, predicted closely coupled instances, true closely coupled in-
stances) with different speech input (1-best speech recognition, Speech transcript).
Results of Word Acquisition on 1-best Speech Recognition
As Shown in Figure 7.4, using predicted instances achieves consistent better perfor-
mance than using all instances except the case where only the 3—best word candidates
are evaluated. These results Show that the prediction of closely coupled gaze-speech
instances helps word acquisition. When the true closely coupled gaze-speech instances
are used for word acquisition, the word acquisition performance is further improved.
This means that higher gaze-speech prediction precision will lead to better word ac-
quisition performance.
We notice that using all instances actually achieves higher F-measure than using
117
0.55
A + all
0.5: ' —A— predicted
—6- true
0.45 r
Precision
o
DO
01
0.25 --
0.2 '
n-best
Figure 7.2. Precision of word acquisition on 1-best speech recognition with Model-2t-r
0.4
-I!— all
0.1 . —A— predicted
—€9— true
12 3 4 5 6 7 8 910
n-best
Figure 7.3. Recall of word acquisition on 1-best speech recognition with Model-2t-r
predicted instances for the 3—best word candidates. This is because there are few
“gold standard” words that do not appear in the predicted gaze-speech instances due
to the scarcity of these words in the whole data set. In the word acquisition with all
118
0.3
0.25 -
Q)
‘5
8
Q)
'53
L“ 0.2
+ all
.. + predicted
0.15 "‘ -6— true
12 3 4 5 6 7 8 910
n-best
Figure 7.4. F—measure of word acquisition on 1-best speech recognition with Model-2t-r
instances, these words will not appear in the 10-best list if word acquisition is only
based on co—occurring statistics (as in Model-1). In Model-2t-r, with domain semantic
relatedness rescoring, these words are boosted up to the 3—best list. However, this can
not happen in the word acquisition with the predicted gaze-speech instances because
the predicted instances do not contain these few words and therefore it is impossible to
acquire them. Therefore, for Model-2t-r, using all instances accidentally outperforms
using predicted instances when only the 3—best word candidates are evaluated. We
believe this will not happen when a fairly large amount of data is available for word
acquisition.
As Shown in Figure 7.8, the MRRRs achieved by Model-2t—r using different sets of
instances with the 1-best Speech recognition are consistent with their performances
on F-measure. Using predicted instances results in significantly better MRRR than
using all instances (t = 1.89,p < 0.031).
119
== +all
0.55‘ +predicted .
, —£B—true
05
§ 0.45
.2
8
g 0.4
0.35-
0.3»
02512 3 4 5 6 7 8 910
n-best
Figure 7 .5. Precision of word acquisition on speech transcript with Model-2t-r
0.55
0.5 5
0.45 ..
0.4 ~
0.35 :
3 0.3?
0.25 '
—-I-— all
—A— predicted
—€B— true
123 4 5 6 7 8 910
n-best
Figure 7.6. Recall of word acquisition on speech transcript with Model-2t-r
Results of Word Acquisition on Speech Transcript
For the word acquisition on speech transcript, as shown in Figure 7.7, using predicted
closely coupled instances results in better F-measure than Using all instances. When
120
0.45 4 .4 . .
0.4»
‘ 5
0.35:
93.
5
£33 0.34
5
[La
0.25
+all
0'2 f —A— predicted
z; —e—true
0'1512 3 4 5 6 7 8 910
n—best
Figure 7.7. F-measure of word acquisition on speech transcript with Model-2t-r
0.6
58 1-best reco
E 0 transcript
0.55 -
0.5 '-
M RRR
0.45
0.4 *-
0.35
all predicted
Figure 7.8. MRRRs achieved by Model-2t-r with different data set
the true closely coupled instances are used for word acquisition, the F-measure is
further improved.
As shown in Figure 7.8, consistent with its F —measure performance, using pre-
dicted instances results in significantly better MRRR than using all instances (t =
121
2.66,p < 0.005).
The quality of Speech recognition is critical to word acquisition performance. F ig-
ure 7.8 also compares the word acquisition performance on the l-best speech recogni-
tion and speech transcript. AS expected, the word acquisition performance on speech
transcript is much better than on 1-best speech recognition. This result Shows that
better speech recognition will lead to better word acquisition.
7 .5 The Effect Of Word Acquisition on Language Under-
standing
One important goal of word acquisition is to use the acquired new words to help lan-
guage understanding in subsequent conversation. To demonstrate the effect of online
word acquisition on language understanding, we conduct simulation studies based on
our collected data. In these Simulations, the system starts with an initial knowledge
base — a vocabulary of words associated to domain concepts. The system contin-
uously enhances its knowledge base by acquiring words from users with Model-2t-r
(Section 6.4) that incorporates both speech-gaze temporal information and domain
semantic relatedness. The enhanced knowledge base is used to understand the lan-
guage of new users.
We evaluate language understanding performance on concept identification rate
(CIR):
CIR _ #correctly identified conepts in the 1-best speech recognition
#concepts in the speech transcript
We simulate the process of online word acquisition and evaluate its effect on
language understanding for two situations: 1) the system starts with no training data
but with a small initial vocabulary, and 2) the system starts with some training data.
122
7.5.1 Simulation 1: When the System Starts with No Training Data
To build conversational systems, one approach iS that domain experts provide domain
vocabulary to the system at design time. Our first simulation follows this practice.
The system is provided with a default vocabulary to start without training data. The
default vocabulary contains one “seed” word for each domain concept.
Using the collected data of 20 users, the Simulation process goes through the
following steps:
0 For user index i = 1,2, . . . ,20:
— Evaluate CIR of the i-th user’s utterances (1-best speech recognition) with
the current system vocabulary.
— Acquire words from all the instances (with 1-best speech recognition) of
users 1mi.
— Among the 10-best acquired words, add verified new words to the system
vocabulary.
In the above process, the language understanding performance on each individual
user depends on the user’s own language as well as the user’s position in the user
sequence. To reduce the effect of user ordering on language understanding perfor-
mance, the above Simulation process is repeated 500 times with randomly ordered
users. The average of the CIRs in these simulations is shown in Figure 7.9.
Figure 7.9 also Shows the CIRs when the system is with a static knowledge base
(vocabulary). The curve is drawn in the same way as the curve with a dynamic
knowledge base, except without word acquisition in the random simulation processes.
As we can see in the figure, when the system doest not have word acquisition capa-
bility, its language understanding performance does not change after more users have
communicated to the system. With the capability of automatic word acquisition, the
123
0.6
0.55
0.5
0.45
0.4
Concept Identification Rate
0.35
0.3
—e— dynamic knowledge base
—*— static knowledge base g a e e = =
6 g I
F a 'l
r,
L "
,
0
2 4 6 8 10 12 14 16 18
user index
20
Figure 7 .9. CIR of user language achieved by the system starting with no training data
system’s language understanding performance becomes better after more users have
talked to the system.
7.5.2 Simulation 2: When the System Starts with Training Data
Many conversational systems use real user data to derive domain vocabulary. To
follow this practice, the second Simulation provides the system with some training
data. The training data serves two purposes: 1) build an initial vocabulary of the
system; 2) train a classifier to predict the closely coupled gaze—speech instances of
new users’ data.
Using the collected data of 20 users, the simulation process goes through the
following steps:
0 Using the first m users’ data as training data, acquire words from the training
instances (with speech transcript); add the verified 10—best words to the sys—
124
tem’s vocabulary as “seed” words; build a classifier with the training data for
prediction of closely coupled gaze-speech instances.
0 Evaluate the effect of incremental word acquisition on CIR of the remaining
(20-m) users’ data. For user index i = 1, 2, . . . , (20—m):
- Evaluate CIR of the i-th user’s utterances (1—best speech recognition).
—— Predict coupled gaze-Speech instances of the i-th user’s data.
— Acquire words from the m training users’ true coupled instances (with
Speech transcript) and the predicted coupled instances (with 1-best speech
recognition) of users 1mi.
— Among the 10-best acquired words, add verified new words to the system
vocabulary.
The above simulation process is repeated 500 times with randomly ordered users
to reduce the effect of user ordering on the language understanding performance.
Figure 7.10 shows the averaged language understanding performance of these random
simulations.
The language understanding performance of the system with a static knowledge
base is also shown in Figure 7.10. The curve is drawn by the same random simulations
without the steps of word acquisition. We can observe a general trend in the figure
that, with word acquisition, the system’s language understanding becomes better after
more users have communicated to the system. Without word acquisition capability,
the system’s language understanding performance does not increase after more users
have conversed with the system.
The simulations show that automatic vocabulary acquisition is beneficial to the
system’s language understanding performance when training data is available. When
training data is not available, vocabulary acquisition could be more important and
beneficial to robust language understanding. It is worth to mention that the results
125
0.59 I I I I I fl I T
—6— dynamic knowledge base
—*— static knowledge base
0.585 - >
2 ‘_ ’4
6:“ 0.58 . = ~
C o
'g I; "
.8
“3 0.575 - a «
o
9 e
g. e
c 0.57 - .
O
o
0.56
0.56 1 1 1 l L 1 1 1
1 2 3 4 5 6 7 8 9 10
user index
Figure 7.10. CIR of user language achieved by the system starting with 10 users training
data)
shown here are based on the 1-best recognized speech hypotheses with a relatively
high WER (48.1%). With better Speech recognition, we expect to have better concept
identification results.
7.6 Summary
This chapter investigates the automatic identification of closely coupled gaze-Speech
instances and its application for automatic word acquisition in multimodal conver-
sational systems. Particulary, this chapter explores the use of the features extracted
from speech, eye gaze, user interaction activities, and conversation context for pre-
dicting whether the user’s naturally occurring eye gaze links to the content of the
user’s Speech.
This chapter also investigates the application of the identified closely coupled
126
gaze—speech instances for word acquisition The gaze-speech prediction and its effect on
word acquisition are evaluated on the l-best speech recognition and speech transcript.
The experiments demonstrate that the automatic identification of the closely coupled
gaze-speech instances significantly improves word acquisition, no matter the words
are acquired from the 1-best speech recognition or from the speech transcript.
Moreover, this chapter demonstrates that, during multimodal conversation pro-
cess, the system with word acquisition capability will be able to better understand
the user’s language after more users have communicated to the system.
127
CHAPTER 8
Conclusions
8.1 Contributions
In this thesis, we present our work on using non-verbal modalities for human language
interpretation in multimodal conversational systems. Particularly, we present a joint
solution to the problems of unreliable speech input and unexpected speech input in
multimodal conversational systems, which includes two aspects: 1) use deictic gesture
and eye gaze to improve speech recognition and understanding, and 2) use eye gaze to
acquire new words automatically during multimodal conversation. Our evaluations
have demonstrated the promise of incorporating non-verbal modalities to help speech
recognition and language understanding during multimodal conversation.
Specific contributions of this thesis include:
0 Systematic investigation of incorporating deictic gesture and eye gaze to improve
speech recognition hypotheses for spoken language understanding. We have de-
veloped salience driven approaches to incorporate the domain context activated
by gesture/ gaze in speech recognition. The gesture/gazebased salience driven
language models are used in different stages of speech recognition to improve
recognition hypotheses. Experimental results show that, by using non-verbal
salience driven language models, the word error rate of speech recognition is
decreased by 6.7% and the concept identification F-measure is increased by
128
4.2%.
0 Systematic investigation of using deictic gesture to improve spoken language
understanding in multimodal interpretation. We have developed model-based
and instance—based approaches to incorporate gestural information in language
understanding. Experimental results have shown that the accuracy of intention
recognition in language understanding is increased by 6% ~ 6.6% by differ-
ent approaches that incorporate gestural information. We further analyze the
implications of these results in building practical conversational systems.
0 Systematic investigation of using eye gaze for automatic word acquisition in
multimodal conversation. We have developed word acquisition models that
incorporate speech—gaze temporal information and domain semantic relatedness
to improve word acquisition. By using the temporal and semantic information,
the mean reciprocal rank rate (MRRR) of word acquisition is increased by
43.2% in our experiment. To further improve word acquisition performance,
we build a classifier based on user interactivity to pick out “useful” speech-
gaze instances before word acquisition, which results in a further increase of
MRRR by 3.6%. Our simulation studies have shown that automatic online
word acquisition improves the system’s language understanding performance.
0 A Multimodal conversational system supporting speech, deictic gesture, and
eye gaze developed for 3D domains. Integrating techniques from speech recog-
nition, eye tracking, and computer graphics, we have implemented a multimodal
conversational system based on 3D interior domains. The system can support
speech, deictic gesture, and eye gaze inputs from the user during multimodal
conversation. It provides a framework to develop different multimodal applica-
tions.
0 Corpora of multimodal data collected through user studies. This research results
129
in 3 sets of data to study multimodal conversation. These data provide user
speech and the accompanying deictic gestures and eye gaze fixations during
multimodal conversation. The data has been annotated for this thesis research.
The annotation includes the transcript of speech, the timestamps of transcribed
words, the referred entity in users’ speech, and the labeling of closely-coupled
gaze-speech pairs. These data will be available for research communities.
8.2 Future Directions
Some future directions for the research on using non-verbal modalities in language
processing include:
o In this thesis’s work on automatic word acquisition, new words are grounded to
the domain concepts representing entities and their properties. These domain
concepts are already given to the system. It is interesting for future work to
automatically learn these domain concepts.
0 The current implementation of word acquisition by means of eye gaze learns
words referring to entities and their physical properties (color, size, materia,
shape). It may be extended to learn words that describe the spatial relations
of entities and the user actions.
0 Besides word acquisition, eye gaze can also be used to help syntactic parsing
of the user’s spoken language. For example, suppose the user says “there is a
book on a table with a brown cover”. It is ambiguous in the parsing whether the
prepositional phrase “with a brown cover” should be attached to “a book” or “a
table”. However, using eye gaze fixations, the system can decide which entity
the phrase “with a brown cover” should be attached to based on its domain
knowledge about the properties of the fixated entities (book, table).
130
APPENDICES
131
A Multimodal Data Collection
This section describes the user studies that we conducted to collect the speech-gesture
and speech-gaze multimodal data sets for the investigations in this thesis.
A.1 Speech-Gesture Data Collection in the Interior Decoration Domain
We collected speech-gesture data by conducting user studies in the interior decoration
domain (Section 3.3.1). In this study, users were asked to accomplish tasks in two
scenarios. Scenario 1 was to clean up and redecorate a messy room. Scenario 2 was to
arrange and decorate the room so that it looks like the room in the pictures provided
to the user. Each scenario put the user into a specific role (e.g., college student,
professor, merchant, etc.), and the task had to be completed with a set of constraints
(e.g., budget of furnishings, bed size, number of domestic products, etc.). Figures
A.1 & A.2 show the instructions for scenario 1 and scenario 2 that were given to the
user before the study.
We recruited 5 users for the study. During the study, the user’s speech was
recorded through an open mic'rophone and the user’s deictic gesture was captured
by a touch-screen. From the user studies, we collected 649 spoken utterances with
accompanying gestures
A.2 Speech-Gaze Data Collection in the Interior Decoration Domain
We also collected a corpus of speech-gaze data in the interior decoration domain with
a different user task. In this study, a static 3D bedroom scene was shown to the
user. The system verbally asked the user a list of questions one at a time about the
bedroom and the user answered the questions by speaking to the system. Figure A.3
lists the questions that are asked by the system.
We recruited 7 users for the study. During the study, the user’s speech was
recorded through an open microphone and the user’s eye gaze was captured by an
132
Description of Scenario 1
1. You are planning to have an important meeting at your apartment. Cur-
rently, your apartment is a mess. You would like to clean it up and redecorate.
You have found a computer program that will allow you to manipulate the fur-
niture arrangement and style in the apartment. This will allow you to decorate
the virtual replica of your apartment prior to redecorating your real apartment.
This will minimize heavy lifting and save you lots of time! You have two goals.
The first goal is to clean up your messy apartment by removing, replacing, or
modifying objects that appear to be either out of place or have strangelooking
characteristics. The second goal is to redecorate your apartment. This can be
accomplished by adding, removing, or modifying objects.
2. You are not a millionaire, so you will have to stay under a specific budget during
the decoration process. You also have certain personality traits and practical
needs which will constrain the redecoration process. The budget along with
these needs will be defined by a character role card which will be given to you at
the beginning of this scenario.
3. Additionally you will need to write down certain information about the result—
ing redecorated apartment for future reference. The information that is important
to you will be determined by your character role.
Role: College Student
You are a college student. You want to have an exotic and colorful apartment,
but price is a major concern. You require a quality desk that will last for a long
time. You need cabinets with many drawers to store all your school work. You
prefer dim lighting and lots of plants and artwork. You want your apartment
to look as exotic and colorful as possible while satisfying your basic needs and
staying under a budget of $1800.
Role: Patriotic Family (with kids)
You are a former US Marine. You are very patriotic and have a family (with
kids) that share your values. You want your apartment to contain as many
objects made in the US (especially objects that have recently been made in the
US) and be symbolic of the US, yet you also want your apartment to be practical
and safe for your children. You prefer soft unbreakable furniture without sharp
corners that has be recently been produced in the US. You need a large bed and
would prefer to have at least one reclining piece of furniture. You must satisfy
these preferences while staying under a budget of $2500.
Figure A.1. Instruction for scenario 1 in the interior decoration domain
133
Description of Scenario 2
1. Imagine that you are searching for a new place to live. You have found a
computer program that will allow you to manipulate the furniture arrangement
and style in a perspective apartment. When you recently visited an old friend,
you really enjoyed the layout of his / her apartment. The images of this apartment
are vividly engrained in your mind. Your goal is to arrange your perspective
apartment in the mold of those images. To help with the story, sample images
will be provided for you.
2. While the layout of your friend’s place was aesthetically pleasing to you, certain
aspects of the apartment need to be modified to fulfill your own personality traits
and practical needs. These needs will be defined by a character role card which
will be given to you at the beginning of this scenario. Based on your chosen
character role, you will need to modify certain pieces of furniture to adhere to
your character’s needs.
3. Additionally you will need to write down certain information about the per-
spective apartment for future reference. The information that is important to
you will be determined by your character role.
Role: Collector You are an art and antiques collector. You prefer old, ex-
pensive, and aesthetically pleasing furniture. You sometimes take prospective
customers to your apartment and need to keep up the appearance that you know
what you are talking about. Your goal for this apartment is that it contains a
lot of art (paintings), old and expensive furniture, objects from a wide variety
of countries with a minimal number of US—produced objects. You will need to
modify the existing furniture to adhere to your preferences.
Role: Professor You are a college professor. The apartment’s practicality is
very important to you. You require a quality desk that will last for a long time.
You need cabinets with many drawers. Light is very important to you. You
prefer powerful (high-wattage) lamps. Additionally you require that a recliner
is available when you need to relax from your busy day. You want to efficiently
balance comfort vs. price — you generally don’t want furniture made out of the
cheapest or more expensive material.
Figure A.2. Instruction for scenario 2 in the interior decoration domain
Eye Link 11 eye tracker. From the user studies, we collected 554 spoken utterances
with accompanying gaze streams.
134
Describe this room.
What do you like/ dislike about the arrangement?
Describe anything in the room that seems strange to you.
Is there a bed in this room?
How big is the bed?
Describe the area around the bed.
Would you make any changes to the area around the bed?
Describe the left wall.
How many paintings are there in this room?
Which is your favorite painting?
Which is your least favorite painting?
What is your favorite piece of furniture in the room?
What is your least favorite piece of furniture in the room?
How would you change this piece of furniture to make it better?
OONQCJ'IAOOMI—I
I—‘lI—‘l—llv—‘Co
COMP-‘0
p—i
A
Figure A.3. Questions for users in the study
A.3 Speech-Gaze Data Collection in the Treasure Hunting Domain
We collected another corpus of speech-gaze data by conducting user studies in the
treasure hunting domain (Section 3.3.2). In this study, the user’s task is to find some
treasures that are hidden in a 3D castle. The user can walk around inside the castle
and move objects. The user needs to consult with a remote “expert” (i.e., an artificial
agent) to find the treasures. The expert has some knowledge about the treasures but
can not see the castle. The user has to talk to the expert for advices of finding the
treasures. Figure A.4 shows the instruction that is given to the user before the study.
We recruited 20 users for the study. During the study, the user’s speech was
recorded through an open microphone and the user’s eye gaze was captured by a
Tobii eye tracker. From the user studies, we collected 3709 spoken utterances with
accompanying gaze streams.
135
Instruction
Your mission, if you choose to accept it (by signing the consent form), is to
immerse yourself into the world of treasure hunting and find Zahalin’s treasure.
With the help of an artificial conversational agent, you will navigate Zahalin’s
castle in search for the treasure. Some of the treasure will be hidden, while some
of it will be in plain sight. To communicate with your artificial assistant, speak
clearly into the microphone using your natural tone of voice.
The assistant is an old criminal who is familiar with Zahalin’s castle. He has
partial knowledge about where the treasure is and how to find it, but cannot
see what is inside the castle. You have additional knowledge about what can be
seen in the castle environment. It is your responsibility to communicate with the
artificial assistant and provide as much detail about the layout of the castle as
the he requires.
You have the ability to open, move, and pick up various objects in the castle.
However, you must be careful! Some objects are booby trapped and you will be
penalized for manipulating these objects. Make sure to ask the artificial assistant
if an object is safe before manipulating it.
Together you will decipher this puzzle. Good luck!
While you navigate through the castle and converse with your artificial assistant,
we will track your speech and eye gaze. This data will be used to make further
improvements to the conversational agent’s spoken language understanding. The
system will inform you if it fails to recognize either your speech or eye gaze. If
this happens at any point during the study, please ask your proctor for assistance.
Figure A.4. Instruction for the user study
136
B Parameter Estimation in Approaches to Word Acquisition
Given parallel data set (W,E) where W = {w1,w2,...,wn} and E =
{e1,e2, . . . ,en}, EM algorithms are used to estimate the probabilities p(wle) that
maximize the likelihood of the data set
p()=W|E fiP(Wklek)
k=1
B.1 Parameter Estimation for Base Model-1
The Base Model-1 is
IWI lel
p(wle)= P (wjlei)
(T———e|+11)'w'jl:11§
Use EM algorithm to estimate the parameters 0 = (p(wle)) that maximize
p(WlE)=
o E—step: compute the expected value of the log-likelihood with respect to the
distribution of the alignments a]-
Q = E[1ogp(lea)]
n
= 23% (lek|+1) 'wk'
n IWkllekl
+ Z Z ZPW — llwkjiekiaaww)108P(wz.~j|6ki)
k=1j= 1i:
where for each instance,
01d) P(wkjleki)
’= lekl
Zkajleki)
i=0
(3.1)
p(aj = ilwkj, eki, 6(
o M-step: find the new parameters
60"”) = arg maxQ
137
and we have
kaI lekl
n
Z Z 2W2“ = ilwkja elm 9(0’d’)5(w,wkj)5(e, ea)
k=1j=1 i=0
= B.2
p(wle) n W M, < >
Z Z Z 229% = ilwkj’eki’6(01d))5(wawkj)5(ea er.)
w k=1 j=l i=0
where,
1 wk- = w
5(wawkj) = J _ (B-3)
0 otherwrse
I. eki = 8
5(eaekz’) = (8.4)
0 otherwise
B.2 Parameter Estimation for Base Model-2
The Base Model-2 is
IWI lel
P(W|e) = H 2PM = ilj, lwl, |e|)P(wj|€z')
j=1 i=0
Use EM algorithm to estimate the parameters 6 = (p(ajlm, l), p(wle)) that max-
imize p(WIE):
o E—step: compute the expected value of the log-likelihood with respect to the
distribution of the alignments a]-
Q = E[logp(W|E,0(0‘d))]
kal lekl
n
= Z Z Zp(aj = ilwkjieki’ lwklv leklv6(01d))
k=1
j=1 i=0
x log [p(aj = illwklv lekllp(wkjleki)l
where for each instance is,
old) Maj = 73Hka lekl)P(wkjleki)
) = lekl (B.5)
ZPUIJ' = illwkl, lekl)P(wkj|ek-i)
i=0 '
Maj = ilwkjvekiv kal» |8k|,9(
138
o M-step: find the new parameters
0(new) = arg maxQ
and we have
- i
Z p(aj = nutter. Iwkl. lekla9(0d))
kilWIc|=mJekl=l
10(0' = z'ImJ) = (36)
J 2 1
kzlwk|=m,|ek|=l
n IWkl lekl
Z Z :Maj = 2'Im = lwli = |6k|)5(w»ww)5(€,eki)
k=1j=1 i=0
w e = 8.7
p‘ ' ) n lwkl lekl ( )
Z Z: Z 217(03' = 2'Im = lle = lek|)5(w,ij)5(6»€kz‘)
w k=1j=1 i=0
where 6(w,wkj) and 6(e, eki) are shown in Equations B.3 and B4.
B.3 Parameter Estimation for Model-2s
The Model-2s is
IWI lel
P(W|e) = H ZPst = ilj,e,W)P(wj|€i)
j=1i=0
Use EM algorithm to estimate the parameters 0 = (p(wle)) that maximize
P(W|E)=
o E—step: compute the expected value of the log-likelihood with respect to the
distribution of the alignments a]-
Q = E [10g p(WlE, 9W0]
n lwkl lekl
= Z Z 210% = Zl'wkjfikiflwl ’)10g[Ps(aj = ZIJvekawk)P(wkjleki)]
k=1j=li=0
where for each instance,
_ Ps(aj = iljaekawklmwkjleki)
lekl
Zpsfilj = ilLGaWr—lPW’mIGt-r)
i=0
01(1))
p(aj =2|1Ukj, ekis 6(
139
o M-step: find the new parameters
0("ew)— — arg maxQ
and we have
n lwkl lekl
2: Z 277(03' = ilwkjmm,9(01d))6(w,wkj)6(e,eki)
k=1 j=1 i=0
IWkl lekl
n
2: Z Z 217(03' = ilwkj’ eki: 9(01d))6(w, wkj)6(ev eki)
w
k=1j=1 i=0
p(wle) =
(3.9)
where 6(w,wkj) and 6(e,ek,-) are shown in Equations B.3 and 8.4.
BA . Parameter Estimation for Model-2t
The Model-2t is
IWI lel
P(=W|e) H Ema ° - ‘ilj,e,W)P(w]'|€z‘)
j= —1i=0
where
0 d(e,-,wj) > 0
. _ - - _ expla'd(€i,wj'll
pt(aj - ZI],6,W) — d(e,-,wj S 0
Z exvla ' dfei, wjll
Use EM algorithm to estimate the parameters 6 = (p(wle),a) that maximize
P(W|E)=
o E—step: compute the expected value of the log-likelihood with respect to the
distribution of the alignments a]-
Q = E[logp(W|E,6(01d))]
n kal lekl
= 223me -=imitated”)logiptta- =2'Ij,e;..wnp]
k=lj=li=0
140
where for each instance,
old) Maj = iljaekiwklmwkjleki)
) = lekl
:29th = iljaek,wklp(wkjleki)
p(aj = ilwkj, eki, 6(
eXPla ' d(ekia wkj)lP(wkjleki)
= 3.10
lekl ( )
Z eXPla ' d(€ki, wkj)lP(wkj|€ki)
i=0
0 M-step: find the new parameters
60m”) = arg maxQ
0
The new p(wle) is given by
n lwkl lekl
Z Z ZPW = ilwkj, emu 9(01d))5(w,wkj)5(6, eki)
k=1 j=1 i=0
w e = 8.11
p( ' l n IWkl lekl ( ’
Z Z 2 2pm.- : film]... at... 6(0’d>>6(w. wkj)5(€, er.)
w k=1 j=1 i=0
where 6(w,wkj) and 6(e,ek,') are shown in Equations B.3 and B4.
The new a is given by
expla ' d(€kw wk )1 .
2 J = P(%‘ = lekjflkrflwd’)
Z eXPla ' d(6ki» wkjll
i
2
ex -d e -, -
a- — arg 11111123: 2: pla ( kl wk])] — p(aj = ilwkj,8ki,0(01d))
ZeXpla d( (ekivwkjll
l
(B.12)
The Levenberg-Marquardt (LM) algorithm [63,67] is used to find the MSE
estimate of a.
141
B.5 Parameter Estimation for Model-2ts
The Model-2ts is
IWI lel
p(WIG) = H Zptsmj = iljievw)p(wjlei)
j=1 i=0
where
53(8kz‘, wkj) expla ' d(€i’ lel
Dammit«w.»
I
pt3(aj = ilj,e,W) =
Use EM algorithm to estimate the parameters 0 = (p(wle),a) that maximize
p(WlE)=
o E—step: compute the expected value of the log-likelihood with respect to the
distribution of the alignments a]-
Q = E[logp(WIE.9<°’d>>]
n lwkllekl
= Z 2 21407 = ilwkjaeki,9(01d))1030”st = iljvekawklflwkjlekill
k=1j=1 i=0
where for each instance,
.___ , , ,_ ,.
P(aj=i|wkj»eki,0(°ld)) = lefltsmj ll],ek’w“)p(“k-chkz)
:20ij = ilieawklflwkjlew)
i=0
_ Smeki’ wkj) expla ' d(ekiv wkjllPUUkjlekil
_ lekl
Z SR(ek,-, wkj) eXPla ' (“emu ijllPijleki)
i=0
(3.13)
o M-step: find the new parameters
6(new) = arg max Q
142
i' .
The new p(wle) is given by
n lwkl lekl
Z Z 274% = ilwrcjv em, 9(Old’)5(wawm5(e,ekr)
_ k=1j=1 i=0
p(wle) _ n kalle/cl (314)
Z 2 2pm,- =ilwkreki,0(01d))5(w,wkj)6(e,eki)
w k=1j=1i=0
where 6(w,wkj) and 6(e, em) are shown in Equations B.3 and B4.
The new a is given by
330%» wkj) expla ' d(€kz‘, wkjll
Z 53(8ki, wkj) expla ' d(ekia wkjll
‘l
= Maj =ilwkjaekiv6(01d))
Z: SR(€ki,ij)eXP[a'd(8ki,witj)l _
j k ZSR(ekiawkj)eXPla'd(€ki,wkj)l
1
a = arg min 2
a i
2
p(aj = ilwkj,ek,~,0(°ld)) (13.15)
The Levenberg-Marquardt (LM) algorithm [63,67] is used. to find the MSE
estimate of a.
143
BIBLIOGRAPHY
144
BIBLIOGRAPHY
[1] J. F. Allen, B. W. Miller, E. K. Ringger, and T. Sikorski. A robust system for
natural spoken dialogue. In Proceedings of the Annual Meeting of the Associa-
tion for Computational Linguistics (A CL ), 1996.
[2] P. D. Allopenna, J. S. Magnuson, and M. K. Tanenhaus. Tracking the time
course of spoken word recognition using eye movements: Evidence for continu-
ous mapping models. Journal of Memory 85 Language, 38:419—439, 1998.
[3] K. Bock, D. E. Irwin, D. J. Davidson, and W. Leveltb. Minding the clock.
Journal of Memory and Language, 48:653—685, 2003.
[4] R. A. Bolt. Put that there: Voice and gesture at the graphics interface. Com-
puter Graphics, 14(3):262—270, 1980.
[5] P. F. Brown, S. D. Pietra, V. J. D. Pietra, and R. L. Mercer. The mathe-
matic of statistical machine translation: Parameter estimation. Computational
Linguistics, 19(2):263—311, 1993.
[6] P. F. Brown, V. J. D. Pietra, P. V. deSouza, J. C. Lai, and R. L. Mercer.
Class-based n—gram models of natural language. Computational Linguistics,
18(4):467—479, 1992.
[7] S. Brown-Schmidt and M. K. Tanenhaus. Watching the eyes when talking about
size: An investigation of message formulation and utterance planning. Journal
of Memory and Language, 54:592—609, 2006.
[8] E. Campana, J. Baldridge, J. Dowding, B. Hockey, R. Remington, and L. Stone.
Using eye movements to determine referents in a spoken dialogue system. In
Proceedings of the Workshop on Perceptive User Interface, 2001.
[9] S. Carbini, J. E. Viallet, and L. Delphin-Poulat. Context dependent interpre-
tation of multimodal speech-pointing gesture interface. In Proceedings of the
International Conference on Multimodal Interfaces (ICMI), 2005.
145
[10] J. Chai, P. Hong, M. Zhou, and Z. Prasov. Optimization in multimodal interpre-
tation. In Proceedings of 42nd Annual Meeting of Association for Computational
Linguistics (A CL), 2004.
[11] J. Chai, S. Pan, and M. Zhou. MIND: A context-based multimodal interpre—
tation framework in conversational systems. In 0. Bernsen, L. Dybkjaer, and
J. van Kuppevelt, editors, Natural, Intelligent and Effective Interaction in Mul-
timodal Dialogue Systems. Kluwer Academic Publishers, 2005.
[12] J. Chai, Z. Prasov, and S. Qu. Cognitive principles in robutst multimodal
interpretation. Journal of Artificial Intelligence Research, 27:55—83, 2006.
[13] J. Chai and S. Qu. A salience driven approach to robust input interpretation
in multimodal conversational systems. In Proceedings of the Human Language
Technology Conference and Conference on Empirical Methods in Natural Lan-
guage Processing (HLT/EMNLP), 2005.
[14] J. Y. Chai, P. Hong, and M. X. Zhou. A probabilistic approach to reference
resolution in multimodal user interfaces. In Proceedings of the International
Conference on Intelligent User Interfaces (IUI), pages 70—77, 2004.
[15] A. Cheyer and L. Julia. MVIEWS: Multimodal tools for the video analyst.
In Proceedings of the International Conference on Intelligent User Interfaces
(I U1), 1998.
[16] A. Chotimongkol and A. Rudnicky. N-best speech hypotheses reordering using
linear regression. In Proceedings of 7th E UROSPEECH, pages 1829—1832, 2001.
[17] J. Chu-Carroll. MIMIC: An adaptive mixed initiative spoken dialogue system
for information queries. In Proceedings of the 6th Conference on Applied Natural
Language Processing (ANLP), 2000.
[18] CMU. The CMU audio databases. http://www.speech.cs.cmu.edu/databases/.
[19] M. Coen, L. Weisman, K. Thomas, and M. Groh. A context sensitive natural
language modality for the intelligent room. In Proceedings of the Ist Interna-
tional Workshop on Managing Interactions in Smart Environments (MANSE),
pages 38—79, 1999.
[20] P. Cohen, M. Johnston, D. McGee, S. Oviatt, J. Pittman, I. Smith, L. Chen,
and J. Clow. QuickSet: multimodal interaction for distributed applications. In
Proceedings of the Fifth ACM International Conference on Multimedia, pages
31—40, 1997.
146
[21] N. J. Cooke. Gaze-Contingent Automatic Speech Recognition. PhD thesis,
University of Birminham, 2006.
[22] C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273—
297, 1995.
[23] D. Dahan and M. K. Tanenhaus. Looking at the rope when looking for the
snake: Conceptually mediated eye movements during spoken-word recognition.
Psychonomic Bulletin 85 Review, 12(3):453—459, 2005.
[24] S. Dupont and J. Luettin. Audio—visual speech modelling for continuous speech
recognition. IEEE Transactions on Multimedia, 2(3):141—151, 2000.
[25] S. Dusan, G. J. Gadbois, and J. Flanagan. Multimodal interaction on pda’s
integrating speech and pen inputs. In Proceeding of E UROSPEECH, 2003.
[26] K. M. Eberhard, M. J. Spivey-Knowiton, J. C. Sedivy, and M. K. Tanenhaus.
Eye movements as a window into real-time spoken language comprehension in
natural contexts. Journal of Psycholinguistic Research, 24:409—436, 1995.
[27] J. Eisenstein and C. M. Christoudias. A salience-based approach to gesture-
speech alignment. In Proceedings of HLT/NAA CL ’04, 2004.
[28] C. Fellbaum, editor. WordNet: An Electronic Lexical Database. MIT Press,
1998.
[29] P. Gorniak and D. Roy. Probabilistic grounding of situated speech using plan
recognition and reference resolution. In Proceedings of the Seventh International
Conference on Multimodal Interfaces (ICMI), 2005.
[30] Z. M. Griffin. Gaze durations during speech reflect word selection and phono-
logical encoding. Cognition, 822B1-B14, 2001.
[31] Z. M. Griffin and K. Bock. What the eyes say about speaking. Psychological
Science, 11:274—279, 2000.
[32] B. J. Grosz, A. K. Joshi, and S. Weinstein. Centering: A framework for modeling
the local coherence of discourse. Computational Linguistics, 21(2):203—226,
1995.
[33] B. J. Grosz and C. Sidner. Attention, intention, and the structure of discourse.
Computational Linguistics, 12(3):175-204, 1986.
[34] A. Gruenstein, C. Wang, and S. Seneff. Context-sensitive statistical language
modeling. In Proceedings of Eurospeech, 2005.
147
[35] J. K. Gundel, N. Hedberg, and R. Zacharski. Cognitive status and the form of
referring expressions in discourse. Language, 69(2):274—307, 1993.
[36] J. E. Hanna and M. K. Tanenhaus. Pragmatic effects on reference resolution in
a collaborative task: evidence from eye movements. Cognitive Science, 28:105—
115, 2004.
[37] J. M. Henderson and F. Ferreira, editors. The interface of language, vision,
and action: Eye movements and the visual world. Taylor & Francis, New York,
2004.
[38] H. Holzapfel, K. Nickel, and R. Stiefelhagen. Implementation and evaluation
of a constraint-based multimodal fusion system for speech and 3d pointing ges-
tures. In Proceedings of the 6th international conference on Multimodal inter-
faces (ICMI), pages 175—182, 2004.
[39] C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support
vector machines. IEEE Transactions on Neural Networks, 13:415—425, 2002.
[40] P. Hui and H. Meng. Joint interpretation of input speech and pen gestures for
multimodal human computer interaction. In Proceedings of Interspeech, 2006.
[41] J. J. Hull. A database for handwritten text recognition research. IEEE Trans-
actions On Pattern Analysis And Machine Intelligence, 16(5):550—554, 1994.
[42] C. Huls, E. B03, and W. Classen. Automatic referent resolution of deictic and
anaphoric expressions. Computational Linguistics, 21(1):59—79, 1995.
[43] R. J. K. Jacob. The use of eye movements in human-computer interaction tech-
niques: What you look at is what you get. ACM Transactions on Information
Systems, 9(3):152—169, 1991.
[44] M. Johnston. Unification-based multimodal parsing. In Proceedings of the
International Conference on Computational Linguistics and Annual Meeting of
the Association for Computational Linguistics (COLINC-ACL), 1998.
[45] M. Johnston and S. Bangalore. Finite-state multimodal parsing and under-
standing. In Proceedings of the International Conference on Computational
Linguistics ( COLIN C ), 2000.
[46] M. Johnston and S. Bangalore. Finite-state methods for multimodal parsing
and integration. In ESSLLI Workshop on Finite-state Methods, 2001.
148
[47]
[48]
[49]
[50]
[51}
[52]
[53]
[54]
[55}
[56]
[57]
M. Johnston, S. Bangalore, G. Vasireddy, A. Stent, P. Ehlen, M. Walker,
S. Whittaker, and P. Maloor. MATCH: An architecture for multimodal dia-
logue systems. In Proceedings of the Annual Meeting of the Association for
Computational Linguistics (A CL), pages 376—383, 2002.
M. Johnston, P. Cohen, D. McGee, S. Oviatt, J. Pittman, and 1. Smith.
Unification-based multimodal integration. In Proceedings of the Annual Meeting
of the Association for Computational Linguistics (A CL), 1997.
M. A. Just and P. A. Carpenter. Eye fixations and cognitive processes. Cognitive
Psychology, 8:441—480, 1976.
D. Kahneman. Attention and Effort. Prentice-Hall, Inc., Englewood Cliffs,
1973.
S. Katz. Estimation of probabilities from sparse data for the language model
component of a speech recogniser. IEEE Transaction on Acoustics, Speech and
Signal Processing, 35(3):400—401, 1987.
M. Kaur, M. Termaine, N. Huang, J. Wilder, Z. Gacovski, F. Flippo, and C. S.
Mantravadi. Where is “it”? event synchronization in gaze-speech input sys-
tems. In Proceedings of the International Conference on Multimodal Interfaces
(ICMI), 2003.
Z. Kazi, S. Chen, M. Beitler, D. Chester, and R. Foulds. Multimodal HCI for
robot control: Towards an intelligent robotic assistant f or people with disabli-
ties. In Proceedings of AAAI’96 Fall Symposium on Developing AI Applications
for the Disabled, 1996.
A. Kehler. Cognitive status and form of reference in multimodal human-
computer interaction. In Proceedings of the National Conference on Artificial
Intelligence (AAAI), pages 685—689, 2000.
D. Klein and C. D. Manning. Accurate unlexicalized parsing. In Proceedings
of the 4lst Meeting of the Association for Computational Linguistics (ACL),
2003.
D. B. Koons, C. J. Sparrell, and K. R. Thorisson. Integrating simultaneous
input from speech, gaze, and hand gestures. In M. Maybury, editor, Intelligent
Multimedia Interfaces, pages 257—276. MIT Press, 1993.
F. Landragin, N. Bellalem, and L. Romary. Visual salience and perceptual
grouping in multimodal interactivity. In Proceedings of the First International
Workshop on Information Presentation and Natural Multimodal Dialogue, pages
151—155, 2001.
149
[58] S. Lappin and H. Leass. An algorithm for pronominal anaphora resolution.
Computational Linguistics, 20(4):535—561, 1994.
[59] M. E. Latoschik. A user interface framework for multimodal vr interactions.
In Proceedings of the 7th international conference on Multimodal interfaces
(ICMI), pages 76—83, 2005.
[60] S. le Cessie and J. van Houwelingen. Ridge estimators in logistic regression.
Applied Statistics, 41(1):191—201, 1992.
[61] Y. LeCun and C. Cortes. The MNIST database of handwritten digits.
http: //yann.lecun.com/exdb/mnist.
[62] O. Lemon and A. Gruenstein. Multithreaded context for robust conversational
interfaces: Context-sensitive speech recognition and interpretation of corrective
fragments. ACM Transactions on Computer-Human Interaction, 11(3):241-267,
2004.
[63] K. Levenberg. A method for the solution of certain non—linear problems in least
squares. Quarterly of Applied Mathematics, 2(2):164—168, 1944.
[64] E. Levin, S. Narayanan, R. Pieraccini, K. Biatov, E. Bocchieri, G. D. Fabbrizio,
W. Eckert, S. Lee, A. Pokrovsky, M. Rahim, P.Ruscitti, and M. Walker. The
at&t-darpa communicator mixed-initiative spoken dialog system. In Proceedings
of the International Conference on Spoken Language Processing ( I CSLP), 2000.
[65] D. J. Litman and K. Forbes-Riley. Predicting student emotions in computer-
human tutoring dialogues. In Proceedings of the 42nd Annual Meeting of the
Association for Computational Linguistics (A CL), 2004.
[66] Y. Liu, J. Y. Chai, and R. Jin. Automated vocabulary acquisition and interpre-
tation in multimodal conversational systems. In Proceedings of the 45th Annual
Meeting of the Association of Computational Linguistics (ACL), 2007.
[67] D. Marquardt. An algorithm for the least—squares estimation of nonlinear pa-
rameters. SIAM Journal of Applied Mathematics, 11(2):431C441, 1963.
[68] A. S. Meyer, A. M. Sleiderink, and W. J. M. Levelt. Viewing and naming
objects: eye movements during noun phrase production. Cognition, 66(22):25—
33, 1998.
[69] L.-P. Morency and T. Darrell. Head gesture recognition in intelligent interfaces:
The role of context in improving recognition. In Proceedings of the International
Conference on Intelligent User Interfaces (IUI), 2006.
150
[70] Y. Nakano, G. Reinstein, T. Stocky, and J. Cassell. Towards a model of face-
to—face grounding. In Proceedings of the Annual Meeting of the Association for
Computational Linguistics (A CL), 2003.
[71] J. G. Neal and S. C. Shapiro. Intelligent multimedia interface technology. In
J. Sullivan and S. Tyler, editors, Intelligent User Interfaces. ACM, New York,
1991.
[72] J. G. Neal, C. Y. Thielman, Z. H. Dobes, S. M., and S. C. Shapiro. Natural
language with integrated deictic and graphic gestures. In M. Maybury and
W. Wahlster, editors, Intelligent User Interfaces, pages 38—51. Morgan Kauf-
mann Press, CA, 1998.
[73] S. Oviatt. Mulitmodal interactive maps: Designing for human performance.
Human-Computer Interaction, 12:93—129, 1997.
[74] S. Oviatt. Mutual disambiguation of recognition errors in a multimodal ar-
chitecture. In Proceedings of the Conference on Human Factors in Computing
Systems (CHI), 1999.
[75] S. Oviatt. Breaking the robustness barrier: Recent progress on the design of
robust multimodal systems. Advances in Computers, 56:305—341, 2002.
[76] S. Oviatt. Multimodal interfaces. In J. Jacko and A. Sears, editors, The
Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies
and Emerging Applications, chapter 14, pages 286—304. Lawrence Erlbaum As-
soc., Mahwah, NJ, 2003.
[77] T. Pedersen, S. Patwardhan, and J. Michelizzi. WordNet::Similarity - mea-
suring the relatedness of concepts. In Proceedings of the Nineteenth National
Conference on Artificial Intelligence (AAAI), 2004.
[78] A. Potamianos, S. Narayanan, and G. Riccardi. Adapative categorical under-
standing for spoken dialog systems. IEEE Transactions on Speech and Audio
Processing, 13(3):321~329, 2005.
[79] G. Potamianos, C. Neti, J. Luettin, and 1. Matthews. Audio-visual automatic
speech recognition: An overview. In G. Bailly, E. Vatikiotis-Bateson, and P. Per-
rier, editors, Issues in Visual and Audio- Visual Speech Processing. MIT Press,
2004.
[80] Z. Prasov and J. Y. Chai. What’s in a gaze? the role of eye—gaze in reference
resolution in multimodal conversational interfaces. In Proceedings of ACM 12th
International Conference on Intelligent User interfaces (IUI), 2008.
151
[81] S. Qu and J. Y. Chai. Salience modeling based on non-verbal modalities for
spoken language understanding. In Proceedings of the International Conference
on Multimodal Interfaces (ICMI), pages 193—200, 2006.
[82] S. Qu and J. Y. Chai. An exploration of eye gaze in spoken language processing
for multimodal conversational interfaces. In Proceedings of the Human Language
Technology Conference of the North American Chapter of the Association for
Computational Linguistics (HLT—NAACL), pages 284—291, 2007.
[83] S. Qu and J. Y. Chai. Beyond attention: The role of deictic gesture in inten-
tion recognition in multimodal conversational interfaces. In Proceedings of the
International Conference on Intelligent User Interfaces (I UI ), pages 237—246,
2008.
[84] S. Qu and J. Y. Chai. Incorporating temporal and semantic information with
eye gaze for automatic word acquisition in multimodal conversational systems.
In Proceedings of the Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 244—253, 2008.
[85] S. Qu and J. Y. Chai. Speech-gaze temporal alignment for automatic word
acquisition in multimodal conversational systems. In Proceedings of the Fifth
Midwest Computational Linguistics Colloquium (MCLC), 2008.
[86] R. Quinlan. C45: Programs for Machine Learning. Morgan Kaufmann Pub-
lishers, San Mateo, CA, 1993.
[87] P. Qvarfordt and S. Zhai. Conversing with the user based on eye-gaze patterns.
In Proceedings of the Conference on Human Factors in Computing Systems
(CHI), 2005.
[88] K. Rayner. Eye movements in reading and information processing - 20 years of
research. Psychological Bulletin, 124(3):372—~422, 1998.
[89] D. Roy and N. Mukherjee. Towards situated speech understanding: Visual
context priming of language models. Computer Speech and Language, 19(2):227-—
248, 2005.
[90] D. K. Roy and A. P. Pentland. Learning words from sights and sounds, a
computational model. Cognitive Science, 26(1):113-—146, 2002.
[91] A. Sankar and A. Gorin. Adaptive language acquisition in a multi-sensory
device. In R. Mammone, editor, Artificial neural networks for speech and vision,
pages 324—356. Chapman and Hall, London, 1993.
152
[92]
[93]
[94]
[95]
[96]
[97]
[98]
[99]
[100]
[101]
[102]
S. Seneff, D. Goddeau, C. Pao, and J. Polifroni. Multimodal discourse modelling
in a multi-user multi-domain environment. In Proceedings of the International
Conference on Spoken Language Processing (ICSLP), pages 192—195, 1996.
A. Shaikh, S. Juth, A. Medl, I. Marsic, C. Kulikowski, and J. Flanagan. An
architecture for multimodal information fusion. In Proceedings of the Workshop
on Perceptual User Interfaces (PUI), pages 91—93, 1997.
P. Silsbee and A. Bovik. Computer lipreading for improved accuracy in auto-
matic speech recognition. IEEE Transactions on Speech and Audio Processing,
4(5):337——351, 1996.
J. Siroux, M. Guyomard, F. Multon, and C. Remondeau. Modeling and process-
ing of oral and tactile activities in the GEORAL system. In Multimodal Human-
Computer Communication, Systems, Techniques, and Experiments, pages 101—
110. Springer-Verlag, London, UK, 1998.
R. A. Solsona, E. Fosler-Lussier, H.-K. J. Kuo, A. Potamianos, and I. Zitouni.
Adaptive language models for spoken dialogue systems. In Proceedings of the In-
ternational Conference on Acoustics, Speech, and Signal Processing (I CASSP),
2002.
M. J. Spivey, M. K. Tanenhaus, K. M. Eberhard, and J. C. Sedivy. Eye move-
ments and spoken language comprehension: Effects of visual context on syn-
tactic ambiguity resolution. Cognitive Psychology, 45:447—481, 2002.
R. Stevenson. The role of salience in the production of referring expressions:
A psycholinguistic perspective. In K. van Deemter and R. Kibble, editors,
Information Sharing. CSLI Publ., 2002.
Y. Sun, F. Chen, Y. Shi, and V. Chung. An input-parsing algorithm supporting
integration of deictic gesture in natural language interface. In J. A. Jacko,
editor, Human-Computer Interaction: HCI Intelligent Multimodal Interaction
Environments, pages 206—215. Springer-Verlag Berlin Heidelberg, 2007.
K. Tanaka. A robust selection system using real-time multi-modal user-agent
interactions. In Proceedings of the International Conference on Intelligent User
Interfaces (I UI ), 1999.
M. K. Tanenhaus, M. J. Spivey-Knowiton, K. M. Eberhard, and J. C. Sedivy.
Integration of visual and linguistic information in spoken language comprehen-
sion. Science, 268:1632-1634, 1995.
M. J. Tomlinson, M. J. Russell, and N. M. Brooke. Integrating audio and
visual information to provide highly robustspeech recognition. In International
153
Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 821-
824, 1996.
[103] B. M. Velichkovsky. Communicating attention-gaze position transfer in coop-
erative problem solving. Pragmatics and Cognition, 3:99—224, 1995.
[104] J. Vergo. A statistical approach to multimodal natural language interaction.
In Proceedings of the AAAI ’98 Workshop on Representations for Multimodal
Human-Computer Interaction, pages 81—85, 1998.
[105] R. Vertegaal. The GAZE groupware system: Mediating joint attention in mul—
tiparty communication and collaboration. In Proceedings of the Conference on
Human Factors in Computing Systems ( CHI ), pages 294—301, 1999.
[106] M. T. V0 and C. Wood. Building an application framework for speech and pen
input integration in multimodal learning interfaces. In International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), 1996.
[107] W. Wahlster. User and discourse models for multimodal communication. In
J. W. Sullivan and S. W. Tyler, editors, Intelligent user interfaces, pages 45—67.
ACM, 1991.
[108] A. Waibel, B. Suhm, M. V0, and J. Yang. Multimodal interfaces for multimedia
information agents. In Proceedings of the International Conference on Acoustics
Speech and Signal Processing (ICASSP), pages 167-170, 1997.
[109] A. Waibel, M. T. Vo, P. Duchnowski, and S. Manke. Multimodal interfaces.
Artificial Intelligence Review, 10(3—4):299—319, 1996.
[110] M. A. Walker. An application of reinforcement learning to dialogue strategy
selection in a spoken dialogue system for email. Journal of Artificial Intelligence
Research, 12:387-416, 2000.
[111] W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf, and
J. Woelfel. Sphinx-4: A flexible open source framework for speech recognition.
Technical Report TR-2004-139, Sun Microsystems Laboratories, 2004.
[112] J. Wang. Integration of eye-gaze, voice and manual response in multimodal
user interfaces. In Proceedings of IEEE International Conference on Systems,
Man and Cybernetics, pages 3938—3942, 1995.
[113] C. Ware and H. H. Mikaelian. An evaluation of an eye tracker as a device
for computer input2. In Proceedings of the SICCHI/CI conference on Human
factors in computing systems and graphics interface, pages 183—188, 1987.
154
[114] Y. Watanabe, K. Iwata, R. Nakagawa, K. Shinoda, and S. Phrui. Semi-
synchronous speech and pen input. In Proceedings of the IEEE International
Conference on Acoustics, Speech and Signal Processing (I CASSP), 2007.
[115] I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and
techniques. Morgan Kaufmann, San Francisco, 2005.
[116] L. Wu, S. Oviatt, and P. Cohen. From members to teams to committee - a
robust approach to gestural and multimodal recognition. IEEE Transactions
on Neural Networks, 13(4), 2002.
[117] C. Yu and D. H. Ballard. A multimodal learning interface for grounding spoken
language in sensory perceptions. ACM Transactions on Applied Perceptions,
1(1):57-80, 2004.
[118] C. Yu and D. H. Ballard. On the integration of grounding language and learning
objects. In Proceedings of AAAI-04, 2004.
[119] M. Zancanaro, O. Stock, and C. Strapparava. Multimodal interaction for infor-
mation access: Exploiting cohesion. Computational Intelligence, 13(7):439—464,
1997.
[120] S. Zhai, C. Morimoto, and S. Ihde. Manual and gaze input cascaded (MAGIC)
pointing. In Proceedings of the Conference on Human Factors in Computing
Systems (CHI), pages 246-253, 1999.
[121] Q. Zhang, A. Imamiya, K. Go, and X. Mao. Overriding errors in a speech and
gaze multimodal architecture. In Proceedings of the International Conference
on Intelligent User Interfaces (I UI ), 2004.
155
I WIllml[lllllllllllllllllllllllllm
,- ' - ' ' 3 1293 03062 7305