SEQUENCE LEARNING WITH SIDE INFORMATION: MODELING AND
APPLICATIONS
By
Zhiwei Wang
A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
Computer Science — Doctor of Philosophy
2020
ABSTRACT
SEQUENCE LEARNING WITH SIDE INFORMATION: MODELING AND
APPLICATIONS
By
Zhiwei Wang
Sequential data is ubiquitous and modeling sequential data has been one of the most
long-standing computer science problems. The goal of sequence modeling is to represent
a sequence with a low-dimensional dense vector that incorporates as much information
as possible. A fundamental type of information contained in sequences is the sequential
dependency and a large body of research has been devoted to designing eﬀective ways to
capture it. Recently, sequence learning models such as recurrent neural networks (RNNs),
temporal convolutional networks, and Transformer have gained tremendous popularity in
modeling sequential data. Equipped with eﬀective structures such as gating mechanisms,
large receptive ﬁelds, and attention mechanisms, these models have achieved great success in
many applications of a wide range of ﬁelds.
However, besides the sequential dependency, sequences also exhibit side information that
remains under-explored. Thus, in the thesis, we study the problem of sequence learning
with side information. Speciﬁcally, we present our eﬀorts devoted to building sequence
learning models to eﬀectively and eﬃciently capture side information that is commonly seen
in sequential data. In addition, we show that side information can play an important role
in sequence learning tasks as it can provide rich information that is complementary to the
sequential dependency. More importantly, we apply our proposed models in various real-world
applications and have achieved promising results.
Copyright by
ZHIWEI WANG
2020
ACKNOWLEDGMENTS
I have enjoyed an invaluable and memorable experience in the past ﬁve years as a Ph.D.
student at Michigan State University. From an undergraduate student who barely knew
anything about machine learning and data mining, I have grown as a fruitful independent
researcher in this ﬁeld. I deeply understand that such achievement could not be possible
without the help, support, and guidance from many people along the way.
First and foremost, my deepest thanks go to my advisor Dr. Jiliang Tang. Without his
support, I would not have the opportunity to do research in machine learning. As an advisor,
he always oﬀers help, advice, encouragement, and inspiration to me. As a researcher, he
greatly inﬂuenced me with his passion, perseverance, curiosity, and creativity. As a role
model for life, he teaches me through his words and actions the value of kindness, optimism,
ambition, and responsibility. I am really grateful to him.
I would like to thank Dr. Arun A. Ross, Dr. Pang-Ning Tan, Dr. Yuying Xie, and Dr.
Zitao Liu, for being on my thesis committee. I also want to express my gratitude to Dr.
Kevin Liu, who has introduced me to the research world. In addition, I feel fortunate to be a
member of the Data Science and Engineering (DSE) lab at Michigan State University. It has
been an unforgettable experience working with those amazing labmates, that is, Dr. Tyler
Derr, Yao Ma, Xiangyu Zhao, Xiaorui Liu, Hamid Karimi, Haochen Liu, Han Xu, Jamell
Dacon, Wentao Wang, Wei Jin, Yaxin Li, Yiqi Wang, Juanhui Li, and Harry Shomer.
Moreover, I appreciate the experience of working with the extraordinary collaborators I
met during internships. They are Dr. Dawei Yin, Dr. Yulong Gu, Dr. Zitao Liu, Wenbiao
Ding, Hang Li, Tianqiao Liu, Dr. Zhengzhang Chen.
Lastly, I would like to thank my parents and cousin for their love and support.
iv
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
Chapter 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Dissertation Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Dissertation Outline
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 2 Knowledge Background . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Sequence Modeling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Basic Notations And Deﬁnitions . . . . . . . . . . . . . . . . . . . . .
2.1.2 Recurrent Neural Networks
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.3 Transformer Model
. . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1
Sequence Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Next Event Prediction . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Sequence Modeling Problems
Chapter 3 Modeling Event Relation . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 The Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Learning Sequential Dependencies . . . . . . . . . . . . . . . . . . . .
3.3.1.1 The Input Layer
. . . . . . . . . . . . . . . . . . . . . . . .
3.3.1.2 The Sequential Layer . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Capturing Event Relations . . . . . . . . . . . . . . . . . . . . . . . .
Linear Project Layer . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
3.3.3 The Objective Function and Training . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1.1 Knowledge Tracing . . . . . . . . . . . . . . . . . . . . . . .
3.4.1.2
Session-based Recommendation . . . . . . . . . . . . . . . .
3.4.2 Representative Baselines . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.4 Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2.1
3.3.2.2 Aggregation Layer
3.4 Experiment
3.4.1 Experimental Settings
Chapter 4 Modeling Sequence Relation . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
v
1
1
4
6
7
7
7
8
10
11
11
12
12
13
13
15
17
18
18
19
22
24
24
26
27
28
28
29
31
33
34
35
38
38
4.4 Experiment
4.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 The Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Capturing Sequential Information . . . . . . . . . . . . . . . . . . . .
4.3.2 Capturing Sequence Relation . . . . . . . . . . . . . . . . . . . . . .
4.3.3 Linked Recurrent Neural Networks
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.2 Representative Baselines . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.3 Experimental Settings
. . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.5 Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.6 Parameter Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 5 Modeling Temporal Information . . . . . . . . . . . . . . . . . . .
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 The Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 An Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . .
5.3.2 Embedding Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.3 RNN Layer: Learning Sequential Information . . . . . . . . . . . . .
5.3.4
Interaction Layer: Modeling Heterogeneous Inﬂuence . . . . . . . . .
5.3.5 Product Layer: Recommending the Product . . . . . . . . . . . . . .
5.3.6 Time layer: Predicting Recommendation Time . . . . . . . . . . . . .
5.3.7 Learning Model Parameters
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Experimental Settings
. . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.2 Performance On Next Item Recommendation . . . . . . . . . . . . .
5.4.3 Performance On Next Time Prediction . . . . . . . . . . . . . . . . .
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Experiment
Chapter 6 Modeling Hierarchy Information . . . . . . . . . . . . . . . . . . .
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 The Proposed Framework: MUDE . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Learning Character-level Dependencies . . . . . . . . . . . . . . . . .
6.2.1.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1.3
Sequence-to-sequence Loss . . . . . . . . . . . . . . . . . . .
6.2.2 Capturing Word-level Dependencies . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
6.2.3 Training Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.3.1 Test Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.2.1 Word Prediction Loss
6.3 Experiment
6.3.1 Experimental Settings
39
40
41
43
44
47
47
49
50
52
53
54
55
57
57
59
60
60
61
62
63
64
65
68
69
69
70
74
75
77
77
79
79
80
82
83
83
84
85
85
85
86
86
vi
Implementation Details
6.3.1.2 Baselines
6.3.1.3
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
6.3.2 Comparison Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.3 Parameter Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.4 Generalization Analysis
. . . . . . . . . . . . . . . . . . . . . . . . .
6.3.5 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.1 Grammatical Error Correction . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
6.4.2 Denoising Text for Downstream Tasks
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
87
87
90
91
92
93
93
95
95
Chapter 7 Modeling Multi-scale Sequential Dependencies . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
7.1
97
7.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.3 Preliminaries: One-Class Classiﬁer
. . . . . . . . . . . . . . . . . . . . . . . 100
7.4 The Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.4.1 Learning Embeddings for Events
. . . . . . . . . . . . . . . . . . . . 103
7.4.2 Anomaly Detection from Global Perspective . . . . . . . . . . . . . . 103
7.4.3 Anomaly Detection from Local Perspective . . . . . . . . . . . . . . . 105
7.4.4 The Objective Function and Optimization Procedure . . . . . . . . . 106
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.5.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.5.3 Experimental Settings
. . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.5.4 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.5.5 Parameter Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.5.6 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.5.7 Visualization of Normal and Abnormal Sequences . . . . . . . . . . . 116
7.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.5 Experiment
Chapter 8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.1 Dissertation Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.2.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.2.2 Modeling
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
vii
LIST OF TABLES
Table 4.1: Statistics of the datasets.
. . . . . . . . . . . . . . . . . . . . . . . . . . .
Table 4.2: Performance comparison on the DBLP dataset
. . . . . . . . . . . . . . .
Table 4.3: Performance comparison on the BOOHEE dataset
. . . . . . . . . . . . .
Table 5.1: Performance improvement of JRec compared to the best performance of the
baselines (%) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Table 5.2: Next event time prediction performance comparison.
. . . . . . . . . . . .
Table 6.1: Toy examples of noised text . . . . . . . . . . . . . . . . . . . . . . . . . .
Table 6.2: Performance comparison on diﬀerent types of noise in terms of accuracy
. . . . . . . . . . . .
(%). Best results are highlighted with bold numbers.
Table 6.3: Performance comparison on diﬀerent types of noise in terms of accuracy
. . . . . . . . . . . .
(%). Best results are highlighted with bold numbers.
Table 6.4: Generalization analysis results. The best results are highlighted. MEAN
shows the average value of each row. . . . . . . . . . . . . . . . . . . . . .
Table 6.5: Data augmentation results. The values that are higher than these of
Table 6.4 are bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Table 6.6: An illustrative example of spelling correction outputs for the Cambridge
sentence noised by W-PER. Words that the models fail to correct are
underlined and bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Table 6.7: An illustrative example of spelling correction outputs for the Cambridge sen-
tence noised by W-INS. Words that the models fail to correct are underlined
and bold.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
51
53
74
76
86
88
89
91
91
93
94
Table 7.1: The key statistics of RUBiS, HDFS and BGL datasets.
. . . . . . . . . . 107
Table 7.2: The prediction performance comparison on RUBiS, HDFS and BGL datasets.109
viii
LIST OF FIGURES
Figure 1.1: Examples of event and sequence relation information . . . . . . . . . . .
Figure 1.2: Examples of temporal and hierarchy information . . . . . . . . . . . . .
Figure 1.3: Examples of event and sequence relation information . . . . . . . . . . .
Figure 3.1: A toy example of a sequence with and without side dependencies in the
knowledge tracing task. Note that each square indicates an interaction
where the purple node denoting the question, and cross and check marks
denote the correctness of the student’s answer. . . . . . . . . . . . . . . .
Figure 3.2: The overall structure of the proposed framework. It consists of two major
components: SeqBlock and SideBlock . . . . . . . . . . . . . . . . . . . .
Figure 3.3: The structure of SeqBlock is shown in the left subﬁgure. It consists of two
sublayers: one attention sublayer and one feedforward sublayer. The right
subﬁgure shows the structure of the attention sublayer. . . . . . . . . . .
Figure 3.4: The illustrative example of multihead attention mechanism.
. . . . . . .
Figure 3.5: Performance comparison for baselines and SEE. Results on knowledge trac-
ing and recommendation tasks are shown in the left and right subﬁgures,
respectively. The y-axis is the AUC score. Higher score means better pre-
diction. For each task, models are grouped by the input vectors – Gaussian
indicates that models use vectors sampled from Gaussian distribution to
represent events while LINE suggests that the events are represented by
vectors obtained through LINE. . . . . . . . . . . . . . . . . . . . . . . .
Figure 3.6: Component analysis results
. . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.1: An Illustration of Linked Sequences. S1, S2, S3 and S4 denote four
sequences and they are connected via four links. . . . . . . . . . . . . . .
Figure 4.2: An illustrate of the proposed framework LinkedRNN on the toy example
as shown in Figure 4.1. It consists of two major layers where RNN layer is
to capture sequential information and the link layer is to capture sequence
relation.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.3: Model analysis results
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
3
4
14
17
20
23
30
36
39
41
55
ix
Figure 5.1: The network architecture of the proposed framework. It consists of six key
layers: 1) Input layer; 2) Embedding layer, which embeds high-dimensional
input vector into low-dimension space; 3) RNN layer, which captures the
sequential and temporal information in sequences; 4) Interaction layer,
which models the interactions between products and events in the sequence;
5) Recommendation layer, which includes product layer and time layer
to predict products and time, respectively; and 6) Output layer where λi
denotes the time and Pi indicates the probability of recommending product i. 61
Figure 5.2: Performance comparison on next item prediction. The prediction perfor-
mance is measured by Recall@30, Recal@15 and MRR@15. The left and
right panels show the results with long and short sequences, respectively.
It is easily observed that the proposed framework JRec achieves the best
performance with any metrics. . . . . . . . . . . . . . . . . . . . . . . . .
Figure 6.1: The graphical illustration of the proposed framework: MUDE.
. . . . . .
70
78
Figure 6.2: Learning curve of MUDE in the training procedure with diﬀerent β values. 89
Figure 7.1: An illustrative example of BGL log messages and the corresponding log
key sequence. Each log message contains a predeﬁned log key that is
underscored by red lines. . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
Figure 7.2: The overview of the proposed framework OC4Seq. It consists of two major
components that learns the sequential information from global and local
perspectives, respectively.
. . . . . . . . . . . . . . . . . . . . . . . . . . 102
Figure 7.3: Validation Precision-Recall curves on HDFS datasets. . . . . . . . . . . . 113
Figure 7.4: The local and global anomaly scores of HDFS log key sequences. . . . . . 115
Figure 7.5: The visualization of HDFS datasets. Abnormal and normal data are
denoted by red and blue dots, respectively. . . . . . . . . . . . . . . . . . 117
x
Chapter 1
Introduction
1.1 Motivation
Sequential data, which consists of a set of sequences presented by lists of events with particular
orders, are ubiquitous. For example, in genetics, a gene which is a sequence of four diﬀerent
types of nucleotide is sequential data; in linguistics, sentences that are formed by words from
a certain vocabulary are considered as sequential data; in e-commerce, the behaviors of a
customer in a session are sequential data. Given its wide applications, modeling sequential
data has been one of the most long-standing computer science problems.
The goal of sequence modeling is to represent a sequence with a low-dimensional dense
vector that incorporates as much information as possible. A fundamental type of information
contained in sequences is the sequential dependency of events and a large body of research has
been devoted to capturing them. Recently, deep neural networks has gained great attention
for modeling sequential data. In particular, recurrent neural networks(RNNs) [101], temporal
convolutional networks [10, 26], and Transformer [117] are among the most popular neutral
sequence models. Equipped with gating mechanisms, large receptive ﬁeld, and attention
mechanisms, these models are very eﬀective in capturing long-term dependencies in sequences
and have led to great success in many applications of a wide range of ﬁelds, including
linguistics, e-commerce, and education [26, 56, 97].
1
(a) Student Exercise Sequence.
(b) Four papers with their cita-
tion links.
Figure 1.1: Examples of event and sequence relation information
However, besides long-term sequential dependency, sequences also exist side information
that remains under-explored. In particular, this dissertation focuses on the following ﬁve
types of side information that are commonly seen.
• Event relation. Most of the existing popular sequence models assume that events are
independent. However, this assumption is not true and events are related in many
scenarios. Figure 1.1a shows an example of student exercise sequence, where an event
involves one problem the student solves. In this ﬁgure, problems are inherently related
because of the underlying knowledge or skills that are required. Such problem relation
provides us essential information to model the student knowledge state [19].
• Sequence relation. Besides event relations, sequences can also be correlated. For
instance, ﬁgure 1.1b shows four published papers (denoted by S1, S2, S3, S4) that
are connected by their citation relations. In this case, each paper (Si) is a sequence
of English words. To understand the topic of one paper, both content (sequential
dependency) and citations (sequence relation) can play an important role.
2
: Correct: IncorrectBACExercise Sequencex1x2x3BFEACDProblem RelationS4S3S2S1Papers with Citation Links(a) User shopping behavior sequence.
(b) A toy English sentence.
Figure 1.2: Examples of temporal and hierarchy information
• Temporal information. Sequences are also associated with temporal information. For
example, each event of a user behavior sequence in an e-commerce website is recorded
with a timestamp indicating when the behavior happens as shown in the ﬁgure 1.2a. In
this case, the events no longer evenly distributed along the timeline. Such temporal
information enables us to not only model user interest but also its dynamics.
• Hierarchy information. Events can exhibit hierarchical structures. One example is the
English language where a sentence consists of a sequence of words and a word consists
of a sequence of characters. Figure 1.2b gives a concrete example sentence as well as its
hierarchical structures.
• Multi-scale sequential dependency. The sequential dependency of events in a sequence
is often of multiple scales. Consider the example of anomalous sequence detection
problem shown in ﬁgure 1.3, where there are three user behavior sequences and two
of them are anomalous. Speciﬁcally, in the second sequence, the anomalous user tries
to hack an account but failed three times before success. In the third sequence, the
3
Data Science and Engineering LabChallengesx1x2x3x4xktk+1t1t2t3t4tk?Time LineWeightPhoneCaseBookPen?Data Science and Engineering LabChallengesx1x2x3x4xktk+1t1t2t3t4tk?Time LineWeightLaptopShopping Behavior SequenceAtoyexampleAtoyexampleAn English SentenceFigure 1.3: Examples of event and sequence relation information
anomalous user adds several expensive identical items and changes the shipping address
before checking out. Thus, suspicious behaviors can be reﬂected by both local (second
sequence) and the global (third sequence) sequential dependencies.
In summary, the side information mentioned above can play an essential role in sequence
learning tasks as it could provide rich information of sequences that complements the
sequential dependency. Thus, in this dissertation, we study the problem of incorporating
side information for sequence modeling. In the next section, we give a summary of our
contribution.
1.2 Dissertation Contribution
This dissertation presents our eﬀorts to design eﬀective models to capture the side information
mentioned above for sequence learning and apply them in various real-world tasks. As a
summary, the major contribution of this thesis are listed below.
• We conduct pioneering research to explore event relations for sequence modeling .
4
C: Add to cartE: iPhoneD: Conﬁrm orderA: LoginB: LogoutAECDFF: Phone caseCDBAAAAECDBEvent TypesUser Behavior SequencesAECECGG: Change AddrECDAnomaly: Local DependencyAnomaly: Global DependencyNormal SequenceWe design a neural model that is able to explicitly incorporate the event relation for
sequence modeling and empirically verify its eﬀectiveness.
• We are among the ﬁrst to identify the importance of sequence relation for learning
better representations of sequences. We develop an advanced framework that captures
the dependencies not only within sequences but also among sequences. We evaluate the
framework with extensive experiments and demonstrate the contribution of sequence
relation to sequence learning tasks.
• We provide a principled approach to model the temporal dynamics of a sequence and
build a novel recommender system that is able to recommend the right item at the
right time.
• We tackle the word recognition task where the computer system needs to recognize
the correct form of noised words. To achieve it, we design an eﬀective neural sequence
model to capture the hierarchical structure of English languages. We believe such a
model can serve as a very important component in robust natural language processing
systems.
• We design a novel one-class classiﬁer for sequences and it is aware of both global and
local scales of sequential dependencies. Moreover, we apply it in sequence anomaly
detection problem and show very promising results with comprehensive experiments.
In the next, we will give the outline of this dissertation.
5
1.3 Dissertation Outline
Based on diﬀerent types of side information, the rest of this dissertation is organized as
follows: Chapter 3 demonstrates our work on capturing event relation for sequence modeling;
In Chapter 4, we focus on incorporating relation among sequences; We discuss our work on
modeling temporal information in Chapter 5; Chapter 6 introduces our work on modeling
hierarchy information in sequences; We present a novel one-class sequence classiﬁer that
captures multi-scale dependencies for anomaly detection problem in Chapter 7; Finally,
Chapter 8 concludes our works and identiﬁes promising future directions.
6
Chapter 2
Knowledge Background
In this chapter, we overview the knowledge background of sequence modeling. Speciﬁcally,
we ﬁrstly review sequence modeling techniques, which covers the basics of widely used deep
neural networks for sequence modeling. In the second part, we will introduce the typical
tasks for sequence modeling as well as their applications.
2.1 Sequence Modeling Techniques
2.1.1 Basic Notations And Deﬁnitions
Before we give detailed mathematic formulation of sequence models, we want ﬁrstly describe
the basic notations and deﬁnitions that will be used throughout this chapter. We denote
scalars by lower-case letters such as i and j, vectors are denoted by bold lower-case letters
such as x and h, and matrices are represented by bold upper case letters such as W and
U. For a matrix A, we denote the entry at the ith row and jth column of it as A(i, j), the
ith row as A(i, :) and jth column as A(:, j). In addition, We use (··· ) to represent an event
sequence and subscripts are used to index the events in the sequence such as (x1, x2, x3).
7
2.1.2 Recurrent Neural Networks
The most popular deep neural networks are arguably recurrent neural networks(RNNs) [101].
RNNs are fundamentally diﬀerent to any other neural networks such feedforward networks
and convolutional neural networks and specialized for modeling sequential data. Most RNNs
process the events one by one according to their sequential order. During this procedure,
they maintain an internal state to encode the information that has already been processed.
Due to this specially designed process, they are able to scale to sequences of various length.
Concretely, let’s denote the internal state and the representation of event at step t as ht and
xt, respectively. To obtain the state, a vanilla RNN takes the previous state ht−1 and xt as
input and processes them with a neural cell deﬁned as follows:
ht = f (Uht−1 + Wxt)
(2.1)
Where U and W are the trainable parameters and f (·) is a activation function which
enables the non-linearity. The above described vanilla RNN naturally encodes the historical
information and enjoys many advantages. For example, the model size will not changes with
the size of the input sequence. However, there are two major drawbacks. One is that it
suﬀers from gradients vanishing or exploding issues, which fail the learning procedure as it
cannot capture the error signals during back-propagation process [13]. The other one is its
high computation time complexity which is linear to the size of input sequence. To overcome
its ﬁrst drawback, more advanced variants have been proposed. The most successful ones are
gated recurrent unit(GRU) [20] and long short-term memory(LSTM) [57].
GRU GRU also utilize gating mechanism to control the information ﬂow. Speciﬁcally, in the
GRU, current state ht is a linear interpolation between previous state ht−1 and a candidate
8
state ˜ht:
ht = zt (cid:12) ht−1 + (1 − zt) (cid:12) ˜ht
(2.2)
where (cid:12) is the element-wise multiplication and zt is called update gate which is introduced
to control how much current state should be updated. It is obtained through the following
equation:
zt = σ(Wzxt + Uzht−1)
(2.3)
Where Wz and Uz are the parameters and σ(·) is the sigmoid function, that is, σ(x) = 1
1+e−x
In addition, the newly introduced candidate state ˜ht is computed by the Equation 2.4:
.
˜ht = g(Wxt + U(rt (cid:12) ht−1))
(2.4)
where g(·) is the Tanh function that g(x) = ex−e−x
and W and U are model parameters. rt
ex+e−x
is the reset gate which determines the contribution of previous state to the candidate state
and is obtained as follows:
rt = σ(Wrxt + Urht−1)
(2.5)
LSTM. Similar to GRU, LSTM also utilize gating mechanism to control the information
ﬂow. However, comparing to GRU, the structure of LSTM is more complex, which has four
9
gates and an additional cell state. More concretely, the LSTM is deﬁned as follows:
(2.6)
ft = σg(Wf xt + Uf ht−1 + bf )
it = σg(Wixt + Uiht−1 + bi)
ot = σg(Woxt + Uoht−1 + bo)
˜ct = σc(Wcxt + Ucht−1 + bc)
ct = ft (cid:12) ct−1 + it (cid:12) ˜ct
ht = ot (cid:12) σh(ct)
where ft, it and ot are the forget, input and output gate, respectively. W∗
and b∗
(∗ ∈ {f, i, o, c}) are trainable parameters. Despite that both LSTM and GRU alleviate the
problems of gradient exploding/vanishing, they still suﬀers from slow computation due to
, U∗
the recurrent procedure. In the next subsections, sequence models without any recurrent
procedure are introduced.
2.1.3 Transformer Model
Since its ﬁrst appearance, the Transformer model has grained great success in many ﬁelds [117].
The major building material of Transformer model is the attention mechanism that was
introduced to alleviate the bottleneck brought by a ﬁxed-length context vector in machine
translation problem [8]. In a nutshell, Transformer consists of a encoder and a decoder. Since
only encoder is related to our works, we will mainly focus on it and encourage readers to
refer the original paper for details of decoder.
The input of the Transformer encoder are the event embeddings, to which positional
10
encodings will be added. After this, the input will be passed through stacked identical blocks.
Each block consists a multi-head attention and a feedforward sublayers that are connected
with residual and layer norm operations [53, 7]. The multi-head attention sublayer is the
most important structure of encoder. Roughly speaking, the multi-head attention layer allows
every event to attend all events and reﬁnes the representation of each event as the weighted
sum of all events. Comparing to RNNs, there are two major advantages of Transformer
encoder. The ﬁrst one is its eﬀectiveness for long-term dependencies, which is attributed to
the direction connection between events in the multi-head attention sublayer. The second is
the ease of parallelization of computation. The computation for representations of each event
is independent of that of any others. Therefore, the time complexity will not changed by the
input size.
2.2 Sequence Modeling Problems
In this section, we will introduce sequence modeling problems which can be roughly grouped
into two categories, that is, sequence classiﬁcation and next event prediction.
2.2.1 Sequence Classiﬁcation
The goal of this problem is to learn a sequence model that can map unseen sequences to their
labels. Depending on the deﬁnitions of sequence and label, this problem corresponds to a
various of tasks. For example, if the sequences are sentences and the label is sentiment, i.e.,
positive/negative, then it is sentiment classiﬁcation task [108]. Another example would be
the spam detection task where the sequence is a email and the label is whether the email is
spam or not [64].
11
2.2.2 Next Event Prediction
In this problem, the model aims to predict the event in next step given the a sequence contains
all the events seen so far. One typical task is the language modeling where the sequence is a
preﬁx of a word/sentence, we want to predict the next immediately character/word [106].
Comparing to the sequence classiﬁcation problem, next event prediction often relies more on
the model’s ability to preserve the historical information.
2.3 Summary
This chapter covers the knowledge background of sequence modeling including popular neural
networks and common tasks. It helps to lay the foundation for the next chapters which
present our eﬀort to tackle new challenges in sequence modeling.
12
Chapter 3
Modeling Event Relation
3.1 Introduction
Besides sequential dependencies, the real-world sequences of events often exhibit side relation
that may not fully reﬂected in the sequences. In the next, I will give a few examples. In
session-based recommendation [56], items involved in sessions can be related with each other
due to the factors such as their brands, the categories they belong to, and prices, etc. In the
knowledge tracing scenario [97] where we want to model the knowledge state of a student
based on her historic interactions with questions, questions can be related because they are
designed to examine similar knowledge or skills behind them. In the document modeling
task [133], words can have similar features such as their syntactic functions. The event
relations can be naturally captured by a side graph where nodes are events in the sequences
and there is an edge between two events if they are related. Figure 3.1 illustrate a toy
example about sequences with and without side relations in the knowledge tracing task. In
this example, there are total 6 questions ( {A, B, C, D, E, F}) in the database and a student
has 3 sequential events ({e1, e2, e3}) where each event includes the question the student
answered and the correctness of the answer. Compared to the sequence without event relation
in Figure 3.1 (1), the 6 questions are related by the skills required to solve them and the
relation is modeled by a question-question graph in Figure 3.1 (2). Given that the student
13
Figure 3.1: A toy example of a sequence with and without side dependencies in the knowledge
tracing task. Note that each square indicates an interaction where the purple node denoting
the question, and cross and check marks denote the correctness of the student’s answer.
had correctly answered the question “A" in e2, it could be suggested that the student is
likely to correctly answer the question “C" since they require similar skills as indicated by
the question-question graph. Thus, incorporating the event relation in addition to sequential
dependencies has the great potential to advance the modeling of sequences.
While existing models are eﬀective in capturing the sequential dependencies, the majority
of them cannot take advantage of the event relation that naturally exist among events of
sequences. Thus, dedicated eﬀorts to develop more advanced sequence learning models are
needed. However, several obstacles make exploiting event relations for sequence learning a
challenging problem. First, sequential dependencies and event relations are inherently diﬀerent
that sequential dependencies rely totally on the sequential data while event relations capture
intrinsic properties of events. Thus, it can be very diﬃcult to incorporate them simultaneously.
Meanwhile, sequential dependencies and event relations can have varied contributions in
14
: Correct: IncorrectBFEACDBBFEACDABFEACDCBAC(1) Sequence(2) Sequence with side dependenciesx1x2x3x1x2x3sequence modeling tasks. Therefore, how to prevent one from being dominated by the other
one during learning process is another obstacle. One straightforward strategy to incorporate
the event relations in the sequence learning is to apply graph embedding algorithms on the side
graph such as LINE [109] and Node2vect [44] to obtain a low-dimension representation vector
for each event, which can be utilized as attributes of events in the existing sequence learning
models. However, this solution could be suboptimal since the network embedding space may
not align well with the sequence representation space. Thus, we investigate the problem of
modeling sequences with event relations. In an attempt to solve the aforementioned obstacles,
we propose a novel sequence learning framework that captures both sequential and event
relations simultaneously in an end-to-end approach.
3.2 Problem Statement
In this section, we will formally deﬁne the problem we aim to solve. Let E = {e1, e2, e3,··· , en}
be a event set with n events in total and a sequence S be denoted as S = (x1, x2,··· , xi ··· , xt, xt+1),
where xi is referred as the ith event. In addition, each event xi in a sequence S involves
one event from E, denoted as gi. In many real-world applications, events in E exhibit some
intrinsic relations that are captured as a weighted undirected graph G. Following the standard
way to represent graphs, G is denoted by an adjacent matrix A ∈ Rn×n, where Ai,j indicates
the strength of the relation between events ei and ej.
Most of sequence learning tasks follow the supervised learning settings that each sequence
Si in the sequence training set SL = {(Si, yi)}N
learning aims to utilize information in SL = {(Si, yi)}K
without labels. Depending on the speciﬁc tasks, the deﬁnitions of labels are diﬀerent. For
is associated with a label yi. The sequence
and predict the labels for sequences
1
1
15
example, in natural language process, labels can be the sentiment of a sentence [30]; in com-
puter vision, labels can be the content of images [119]; and in session-based recommendations,
labels can be the sentiment of users on items [56]. In this work, we focus on user behavior
modeling tasks where the labels are the user’s actions on a given event in E as shown in
Figure 3.1. In particular, for each sequence Si = [x1, x2,··· ,··· , xt, xt+1] in the training
set, the event xj = (gj, yj) for j ∈ {1, 2,··· , t} where yj denotes the action on gj and the
label yi = yt+1. In other words, given the actions on events in events from 1 to t, the task we
are investigating is to predict the action on the event gt+1 in the next event xt+1. This is an
essential task in a wide range of real-world applications. For instance, in recommendations,
this task is equivalent to predict the actions of users on the next recommended item; and
in Education, it is the task of knowledge tracing that aims to predict whether a user gives
correct or wrong answer to the next question. Note that given the generalization of the
sequence modeling component of the proposed framework in this work, it is straightforward
to extend the framework to other tasks and we will leave it as one future work.
With the above notations and deﬁnitions, the problem we aim to solve in this work is
formally deﬁned as follows:
Given an event set E with event relations in A and a labeled sequence set SL = {(Si, yi)}K
1 ,
where each sequence Si = [x1, x2,··· , xt, xt+1] is generated by a speciﬁc user ui and each
event xt = (gt, yt) involves one event gt ∈ E and the action label yi, our goal is to build a
framework that is able to predict the labels of unlabeled sequences by capturing both sequential
and event relations.
16
Figure 3.2: The overall structure of the proposed framework.
components: SeqBlock and SideBlock
It consists of two major
3.3 The Proposed Framework
In real sequential data, events naturally exhibit intrinsic relations in addition to the sequential
dependencies. In this section, we propose a framework that is able to take advantage of event
relations for sequence learning. The overall structure of the framework is shown in Figure 3.2.
It consists of two major components: SeqBlock and SideBlock, which capture the sequential
dependencies and event relations, respectively. Speciﬁcally, the SeqBlock takes a sequence
as input and output a latent representation vector. In addition, the SideBlock will take the
event relations graph as the input and output event representation vectors, which will be
queried by event sequences. Finally, the framework will predict the user action label based
on the sequence representation and the queried event representation. In the next subsections,
we will describe each component in details.
17
SideBlockSeqBlockS1S2SiSeqBlockSeqBlockS1S2Sif(·)f(·)f(·)3.3.1 Learning Sequential Dependencies
Previously, sequence learning relies on Recurrent Neural Networks (RNN) as fundamental
modules. Many variants based on RNN have been proposed in order to solve the notoriously
diﬃcult problem: capturing the long-term sequential dependencies. Among them, structures
with gated mechanism such as Long Term Short Memory [57] and Gated Recurrent Unit [20]
stand out due to their widely received success [23]. However, RNN based models suﬀer
from the ineﬃciency issue because of its sequential nature, which makes them very diﬃcult
to parallel within-sequence computation. To overcome this issue, many eﬀorts have been
devoted to developing models without recurrent structures, such as temporal convolutional
networks, which can have similar performance with RNN based models while having much
faster computation speed [10]. Recently, promising results for a wide range of sequence
learning tasks have been obtained by models that heavily relies on attention mechanism
including memory networks and Transformer [117, 105]. Inspired by aforementioned advances,
we develop the sequence dependency component, which we refer as SeqBlock.
3.3.1.1 The Input Layer
The input of the Seqblock is a sequence of events (x1, x2,··· , xt). Each event xj is denoted by
three types of information – the corresponding event gj, the action label yj and the sequential
order information in the sequence. We use a vector gj ∈ Rd/2 to indicate the event. To
incorporate the action information, we deﬁne the event vector for xj as follows:
yj = 1
(3.1)
[gj(cid:107)0],
xj =
[0(cid:107)gj], yj = −1
18
where (cid:107) is the concatenation operation, gj = ei if gj is the ith event in E and 0 ∈ Rd/2 is a
zero vector. ei is the embedding of the i-th event in E, which can be pre-trained or randomly
initialized. Note that in this work, we consider binary labels and it can be naturally extended
for the multiple labels scenario. In addition, we want to embed the order information in the
event vector. As suggested by the previous work [40, 117], we add the order information into
xj as:
xj(i) =
xj(i) + sin(
),
i%2 = 0
(3.2)
j
10000i/d
j
10000(i−1)/d
xj(i) + cos(
), otherwise
3.3.1.2 The Sequential Layer
Inspired by previous works [117, 105], we utilize attention mechanism to capture the sequential
dependency among events. As shown in Figure 3.3, each sequential layer consists of two
sublayers – one attention sublayer and one feedforward sublayer and there could be multiple
sequential layers stacking together. The input of the ﬁrst sequential layer is the output of the
event layer plus one special vector xt+1, which is randomly initialized and its corresponding
ﬁnal state will be used as the representation of the whole sequence. Next, without the loss
ith sequential layer are ui
of generality, we will illustrate the ith sequential layer and assume the input vectors of the
j, j ∈ {1, 2,··· , t + 1}. Since the output of the event layer is the
j = xj. Next we introduce the attention sublayer and the
input of the ﬁrst sequential layer, u1
feedforward sublayer in the ith sequential layer.
Attention Sublayer: The attention sublayer is shown in right subﬁgure of the Figure 3.3.
j, j ∈ {1, 2,··· , t + 1}. For each input vector ui
j
, which is obtained through
The input of the attention sublayer are ui
the corresponding output vector of the attention layer is vi
j
,
19
Figure 3.3: The structure of SeqBlock is shown in the left subﬁgure.
It consists of two
sublayers: one attention sublayer and one feedforward sublayer. The right subﬁgure shows
the structure of the attention sublayer.
multihead mechanism, which is design to capture the event sequential dependencies from
diﬀerent aspects. Next, we will use vi
m
to illustrate the multihead mechanism.
The overall structure of the multihead mechanism is demonstrated in Figure 3.4. Specif-
ically, we assume that there are na attention heads, each of which corresponds to a set of
query, key, and value spaces. In the headk
with respect to ui
m
, the input vectors are projected into the key space and ui
m
, to obtain the attention score of each input vector
is projected
into the query space. Speciﬁcally, let αik
j
is calculated as follows:
. Then αik
j
headk
be the attention score of ui
j
with respect to ui
m
in
αik
j = sof tmax(
√d(cid:80)t+1
ui
mWQk · ui
l=1 ui
jWKk
mWQk · ui
lWKk
)
(3.3)
where WQk ∈ Rd×d and WKk ∈ Rd×d are the query and key projection matrices in headk
,
respectively. and · is the dot product operator. With the obtained attention scores, the new
can be calculated as the weighed sum of all the input vectors in
presentation of ui
m
in headk
20
FeedforwardAttentionAdd & NormAdd & NormMulti-head Attentionui1ui2uit+1Multi-head AttentionMulti-head Attentionvi1vi2vit+1Sequential LayerAttention Sublayerthe value space. Let hik
m
be the new representation of ui
m
in headk
:
t+1(cid:88)
hik
m =
αik
l ui
lWV k
(3.4)
l=1
where WV i ∈ Rd×dv is the project matrix that maps the input vector into the value space.
In this way, we are able to obtain the new representation of uik
in each head. Thus, given
m
m, k ∈ {1, 2,··· , na}, vi
hik
can be calculated as follows:
m
vi
m = [hi1
m(cid:107)hi2
m(cid:107)···(cid:107)hina
m ]WO
(3.5)
where WO ∈ Rdvna×d is another projection matrix. The multihead attention mechanism
not only allows to learn sequential dependency of events eﬀectively, but also enables the
calculations of Equation 3.4 be proceeded parallelly, which could considerably reduce the
time cost of the training procedure. As we previously mentioned that the sequential layer
can be stacked together to form a very deep structure. Thus, to reduce the training diﬃculty
when the structure goes deep, following previous work [53], we add a shortcut connection
for the attention sublayer followed by the layer normalization [7]. Speciﬁcally, given the
j, j ∈ {1, 2,··· , t + 1}, the short connection and layer normalization operation output the
vi
vector ˜vi
j
as follows:
˜vi
j = LayerN orm(vi
j + ui
j)
(3.6)
where we deﬁne the LayerN orm(·) function similar to that in [7].
Feedward sublayer: The second sublayer consists of a fully connected feed-forward network
(FFN). Its input is ˜vi
j, j ∈ {1, 2,··· , t + 1}. The output of the feedward sublayer for ˜vi
j
is
21
, that is calculated as follows:
ˆvi
j
j = F F N ( ˜vi
ˆvi
j) = ReLU ( ˜vi
jW1 + b1)W2 + b2
(3.7)
where W1 ∈ Rd×df , W2 ∈ Rdf×d, b1 ∈ Rdf , and b2 ∈ Rd are the learnable parameters
and RelU (cdot) is the non-linear activation function. Similarly, the short connection and
layer normalization operation are also applied to the feedward sublayer, which will produce
ui+1
, j ∈ {1, 2,··· , t + 1} corresponding to ˜vi
j
stacking together, the t + 1th output vector of the last layer unsl+1
. Assume that there are nsl sequential layers
j
will be used as the
t+1
representation of the whole sequence.
In summary, the SeqBlock takes the event sequence S as the input and outputs a repre-
sentation of the sequence unsl+1
t+1
. Although the SeqBlock eﬀectively captures the sequential
dependency of the events, it completely ignores the side dependencies that are not reﬂected
in the sequences. Next, we will introduce the model component to incorporate the event
relations.
3.3.2 Capturing Event Relations
In this subsection, we introduce the model component SideBlock of the framework and it
aims to capture the event relations.
One straightforward solution to incorporate the relations of events is to apply graph
embedding algorithms to the event-event graph G to obtain an embedding vector that contains
the relation information for each event. For instance, the state-of-the-art network embedding
algorithms such as LINE [109] and Node2Vec [44], can map the nodes in a graph to a subspace
that preserves maximum neighborhood information of nodes. Thus, the event embedding
22
Figure 3.4: The illustrative example of multihead attention mechanism.
obtained by such algorithm naturally carries their relation information. Then we can use event
graph embeddings to be the event representation vectors, which can be inputted in SeqBlock.
In fact, this solution is similar to one widely used approach in natural language processing
tasks where pre-trained word embeddings that capture the word relations through large
corpuses are utilized to improve the performance [18, 24, 68]. However, this solution may be
not suﬃcient to fully take advantage of event relations because: 1) as the sequential learning
block can go very deep and can completely focus on modeling the sequential dependencies, it
is likely that sequential dependency will dominate event relations; and 2) the embedding of
events is obtained by graph embedding algorithms that run in a separate procedure, thus
they may not be optimal for the speciﬁc task. Thus, we propose SideBlock to eﬀectively
capture the event relations. Speciﬁcally, it consists of two layers: one linear projection layer
and one aggregation layer. Next we will describe each layer.
23
WV1WK1WQ1WQ2WK2WV2WVnaWKnaWQnaui1ui2uit+1uijDot & SoftmaxWeighted Sumhi1jhi2jhinajui1ui2uit+1uijDot & SoftmaxWeighted Sumui1ui2uit+1uijDot & SoftmaxWeighted SumConcatenateVijhead1head2headna3.3.2.1 Linear Project Layer
The input of the linear projection layer is the event embedding vectors ej, j ∈ {1, 2,··· , n}.
There are many ways to compute ej. For example, ej can be sampled from a normal
distribution N (0, I) and it only serves as the indicator of the ith event; while it can be also
obtained from the output of the graph embedding algorithms. Given ej, the linear projection
function in this layer is deﬁned as:
ˆej = ejWP + bP
(3.8)
where WP ∈ Rde×d and bP are the projection matrix and the bias term, respectively.
ˆej ∈ Rd is the projected vector of the jth event. As we stated previously, the embedding
space may not align well with the sequence representation space. Therefore, the linear project
layer will enable the model to learn to project the embedding vectors into a space where the
aggregation layer is designed to preserve the event relations as shown next.
3.3.2.2 Aggregation Layer
Intuitively, if an event is related to its neighbors in the graph G, it is desirable that its repre-
sentation is also close to that of its neighbors. For instance, in the knowledge tracing scenario,
questions requiring similar skills should be located in a similar spot in the representation
space. To impose this intuition, inspired by previous work [48, 71], we design 3 sublayers
in the aggregation layer – feedforward sublayer, neighbor sublayer and integration sublayer.
Note that, the aggregation layer can also be stacked to a deep structure. Without the loss of
generality, next we detail the ith aggregation layer. Let the input of the ith aggregation layer
is cij, j ∈ {1, 2,··· , n}. Thus, c1j = ˆej.
24
The input of the feedforward layer are cij, j ∈ {1, 2,··· , n}. For each cij, the feedforward
will output a vector ˆcij, which can be calculated as:
ˆcij = ReLU (cijWa1 + ba1)
(3.9)
where Wa1 ∈ Rd×d and ba1 ∈ Rdare learnable parameters. This feedforward layer aims to
learn better representation of events in the latent space. The second sublayer is to aggregate
the neighbors’ information. The input is ˆcij, j ∈ {1, 2,··· , n} and for each ˆcij, this sublayer
will also output a vector ˜cij which is deﬁned as:
(cid:88)
˜cij =
ˆA(j, l)ˆcil
(3.10)
l∈N j
where N j is the neighborhood set of jth events and ˆA is the normalized adjacent matrix for
G, which is deﬁned as:
ˆA = D−1/2AD−1/2
(3.11)
where D is the diagonal matrix and D(i, i) =(cid:80)
j A(i, j). With the latent vector ˜cij that
captures neighborhood information of the jth event, the third sublayer is to integrate ˜cij and
ˆcij to obtain the ﬁnal representation of the event, which is deﬁned as follows:
c(i+1)j = ReLU ([˜cij(cid:107)ˆcij]Wa2 + ba2)
(3.12)
where Wa2 ∈ R2d×d and ba2 ∈ Rd are the model parameters. Let nag to be the total number
of aggregation layers. The output of the SideBlock will be cnagj, j ∈ {1, 2,··· , n}.
25
3.3.3 The Objective Function and Training
With the previously described model components that are able to capture both sequential
dependencies and event relations, the proposed framework SEE is formulated as follows:
unsl
t+1 = SeqBlock (S)
{cnag1, cnag2,··· , cnagn} = SideBlock({e1, e2,··· , en}, A)
z = query(S,{cnag1, cnag2,··· , cnagn})
ˆy = f (unsl
t+1, z)
(3.13)
(3.14)
(3.15)
(3.16)
where S = (x1, x2,··· , xt) is the input sequence and the SeqBlock will produce unsl
that
embeds the sequential information of the sequence. {e1, e2,··· , en} are the initial event
embedding vectors and the SideBlock will produce a new set of representation vectors for
t+1
events in the latent space that align well with the sequence representation space. In addition,
S will query one event vector from {cnag1, cnag2,··· , cnagn according to gt+1. Speciﬁcally,
z = cnagj if gt+1 is the jth event in E. Finally, f (·) is the prediction layer and ˆy is the
predicted action on the next event gt+1. There are many choices for f (·) and in this work,
we deﬁne it as follows:
f (unsl
t+1, z) = σ(unsl
t+1 · z)
(3.17)
26
where σ(·) is the Sigmoid function. Then we deﬁne the following loss function to train the
framework:
K(cid:88)
j=1
L =
−yj log(ˆyj) − (1 − yj)log(1 − ˆyj)
(3.18)
where K in the number of sequences in the training set, ˆyj and yj are the predicted and true
labels for the jth sequence, respectively.
In order to eﬀectively train the proposed framework SEE, we exploit the widely used
Truncated Back Propagation Trough Time (TBPTT) algorithm. Speciﬁcally, we ﬁrstly deﬁne
M to be the number of timesteps that are used for both forward and back-pass procedure.
Then, for each sequence, we take M consecutive samples (xk, yk)k+M
forward-pass of the framework to these samples and obtain a error signal. Finally, according
. Then we apply the
k=j
to the error, the back-pass will update the parameters of the proposed framework.
3.4 Experiment
In this section, we evaluate the proposed framework SEE with real sequential data. In the
following subsections, we ﬁrst describe the two tasks and the corresponding datasets to assess
the performance of sequence learning. Then we present and analyze the experimental results.
Finally, we conduct model component analysis to gain a deeper understanding of the proposed
framework SEE.
27
3.4.1 Experimental Settings
In this work, we choose two tasks from two diﬀerent domains to evaluate the performance of
the proposed framework, i.e., knowledge tracing and session-based recommendations.
3.4.1.1 Knowledge Tracing
In the education ﬁeld, knowledge tracing is to monitor student knowledge states and skill
acquisition levels, which is highly important for personalized education [97, 19]. One of the
key tasks of knowledge tracing is to predict a student’s future interactions with questions in
the question database based on her past interactions, where an interaction denotes that the
student answers one question. Thus, it is easily to formulate this task as the user behavior
sequence modeling problem stated in Section 2. Speciﬁcally, we regard past interactions as
the event sequence, where each question is an event and the correctness of the student’s
answer is the action. Then the task is to predict whether the student gives correct or wrong
answer to the next question. To conduct experiment on this task, we collect a GMAT dataset
from a GMAT preparation mobile application 1, which is one of the most popular apps
in this kind in China. This dataset describes students’ interactions with GMAT questions
that are designed to improve students’ problem-solve skills. All the student identiﬁcation
information is removed. Speciﬁcally, it contains 90831 students and their 16002324 records
with 8684 questions. Moreover, each interaction record consists of one student ID, one
question ID, correctness of the student answer, and the time stamp of this interaction. Thus,
for each student, a sequence of interactions can be formed according to the time stamp of
each interaction. We ﬁlter those students who have only few interactions since we are not
focusing on the “cold start" problem. In addition, we also remove the questions that are not
1http://www.kmf.com/
28
in the interactions of the remaining students. After such ﬁltering procedure, there are 20,251
sequences and 7,530 questions. We then construct a weighted question event relations graph
according to the knowledge and skills required by questions. We further randomly split the
sequences into training, validation and testing sets such that each interaction sequence will
be put in one and only one of the three sets. As results, the training, validation, and test
sets have 14,176, 3,038, 3,037 sequences, respectively.
3.4.1.2 Session-based Recommendation
Session-based recommendation is getting increasingly popular in recommender systems
community.
It aims to recommend items by modeling user’s preference from their past
interactions. To achieve this goal, most of previous works[56, 127, 143], have chosen RNN
based models to build the recommender systems, which tend to ignore relations of items. In
this subsection, we easily apply the proposed framework SEE to solve the recommendation
task. The recommendation task can also be formulated as our previously deﬁned problem
with items being the events and user rating being the action. To evaluate the proposed
framework on this task, we collect the MovieLens dataset, which is publicly available 2
and has been used for many recommendation related works [51, 141, 58, 78]. This dataset
describes user preferences for movies. Speciﬁcally, it contains 200 million ratings for movies
from 138,493 users. Each rating is a 10-scale score and in this work, we convert the rating
to a binary score indicating whether the user likes a movie or not. Speciﬁcally, score = 1 if
rating > 4, otherwise score = 0. Each rating is also associated with a time stamp, according
to which all of ratings given by one user can be arranged as one sequence. Moreover, it also
provides the additional information of the movies including their title, genre, release year,
2https://grouplens.org/datasets/movielens/
29
Figure 3.5: Performance comparison for baselines and SEE. Results on knowledge tracing
and recommendation tasks are shown in the left and right subﬁgures, respectively. The
y-axis is the AUC score. Higher score means better prediction. For each task, models are
grouped by the input vectors – Gaussian indicates that models use vectors sampled from
Gaussian distribution to represent events while LINE suggests that the events are represented
by vectors obtained through LINE.
etc. We apply similar ﬁltering procedure to this dataset that removes users who give few
ratings and movies that receive few ratings. In the end, there are 20,000 users and 8,270
movies left. We also construct a item relation graph for movies according to their shared
attributes. All the sequences are split into training, validation, and testing sets. The resulted
training, validation and test sets contain 14,000, 3,000, and 3,000 sequences, respectively.
Evaluation Metric: As we want to assess the performance on predicting the action on the
next event, we use Area under the curve (AUC) to measure the action prediction performance
since it is widely used and especially suitable for the unbalanced data. The higher value of
AUC indicates better performance.
30
GaussianLINEGMAT0.600.650.700.750.80AUCRNNLSTMGRUSEEGaussianLINEMovieLens0.600.650.700.750.800.850.90RNNLSTMGRUSEE3.4.2 Representative Baselines
Recent years have witnessed great success of RNN-based models for sequence learning tasks.
Though they are eﬀective in capturing the sequential dependencies, they tend to ignore event
relations. Thus, to demonstrate the importance of event relations and verify the eﬀectiveness
of SEE, we compare it with the-state-of-art RNN-based models, which can be divided into
two groups. The ﬁrst group only uses sequential information while the second group of models
utilize both sequential and event relations information by ﬁrst performing graph embedding
and then taking the embedding of events as input to the sequence learning. The following
are details of baselines:
• RNN + Gaussian. This is the recurrent neural network with vanilla cells. At step j, it
will output a state hj = g(Wxj + Uhj−1 + b), where xj is the input vector, g(·) is
the activation function, hj is the output state, and W, U, b are the model parameters.
The last state ht is regarded as the ﬁnal representation of the whole sequence and thus
is used to predict user action for each event by the following prediction function:
s = σ(htWS + bS)
(3.19)
, where WS ∈ Rd×n and bS ∈ Rn are the parameters of the prediction layer. Thus, if
the event in xt+1 is the ith event, the predicted label ˆyj = s(i). The initial vectors for
events are sampled from Gaussian distribution.
• RNN + LINE. It is based on RNN. However, to incorporate the event relations, the
initial vectors are obtained by running the graph embedding algorithm LINE on the
event-event graph.
31
• LSTM + Gaussian [57]. LSTM is a variant of Vanilla RNN by introducing gating
mechanism that is essential for capturing the long-term dependencies. The initial
vectors for events are sampled from Gaussian distribution.
• LSTM + LINE. It uses LSTM to capture sequential information while performing the
graph embedding algorithm LINE as initial vectors to capture event relations.
• GRU+ Gaussian [20]. GRU is another popular recurrent network cell that adopts the
gating mechanism. We initialize its input from Gaussian distribution.
• GRU+ LINE. It uses LINE to learn embeddings from the event-event graph to initialize
the input of GRU. It incorporates sequential and side information in a separate way.
For a fair comparison, we use Equation 3.1 to incorporate the action information into the
event vectors for baseline models. Moreover, as all the baseline models process event vectors
with the sequential order, it is no need to add the positional embedding into the event vectors.
Implementation: The Gaussian vectors for events are sampled from Gaussian distribution
N (0, I). The embedding vectors are obtained by running the graph embedding algorithm
LINE using the code 3 provided by the authors. In particular, we choose the second order
proximity version of this algorithm as it consistently has better performance than the ﬁrst
order [109]. Unless speciﬁed otherwise, all the models were implemented in Pytorch 4 and
mini-batch stochastic Adam optimizer was used during the training procedure [69]. To
prevent models from overﬁtting problem, dropout was used for all the methods [104]. In
addition, we applied the early stopping strategy that the training procedure was stopped
when there was no improvement in the performance on the validation set for next 10 epochs.
3https://github.com/tangjianpku/LINE
4https://pytorch.org/
32
For each model, the hyperparameters including learning rate and dropout rate were chosen
according to the performance on the validation set.
3.4.3 Experimental Results
In this subsection, we present and compare the experimental results of baseline methods and
SEE on the aforementioned datasets.
The results for knowledge tracing and recommendation tasks are shown in the left and right
subsigures of Figure 3.5, respectively. From the ﬁgure, we make the following observations:
• In most of the cases, RNN has the worst performance among baseline methods. This is
because that LSTM and GRU have gating mechanism which enable them to capture
the long-term dependencies. The results are consistent with the ﬁndings of previous
works [23];
• For most methods, when representing events with embedding vectors obtained from
graph embedding algorithm, the performance increases consistently. This clearly shows
the contribution of event relations information to sequence modeling.
• It is very interesting to see that when we compare the improvement brought by the
graph embedding vectors, the performance gain of RNN-based baseline models in GMAT
dataset is more signiﬁcant than that in the MovieLens dataset, while the improvement
for SEE is less signiﬁcant in the GMAT dataset than in the Movielens dataset. The
reason is that the mechanism by which the graph embedding vectors improve the
performance is diﬀerent for baseline RNN-based models and SEE. For baseline models,
the graph embeddings contain the sevent relations. Thus if the embedding space aligns
better with the sequential representation space as in the case of GMAT dataset, it is
33
easy for RNN-based models to learn the event relations in the latent space. On the
contrary, when the embedding space of events does not align well with the sequence
representation space as indicated by the MovieLens case, the improvement brought by
the dependency information is limited. On the other hand, SEE is a uniﬁed framework
to incorporate both information where a latent embedding space which aligns well
with the sequence representation space. Thus, better initialization does not necessarily
lead to better performance of SEE as the optimization procedure could also play an
important role;
• In all the cases, SEE demonstrates performance gain over the baseline methods by
a large margin. We attribute such superior performance to its ability to eﬀectively
capture both sequential dependencies and event relations. More details analysis of
the contributions of major framework components will be discussed in the following
subsection.
To sum up, the results for both knowledge tracing and recommendation tasks have clearly
demonstrated the advantage of exploiting event relations and veriﬁed the eﬀectiveness of the
proposed framework SEE. Next, we will analyze the important components of SEE to gain
better understanding of its performance.
3.4.4 Component Analysis
The proposed framework SEE has shown signiﬁcant performance gain over the baseline models.
To understand its eﬀectiveness, in this subsection, we analyze two important components of
SEE that are the SeqBlock and SideBlock, which capture the sequential dependencies and
event relations, respectively. To do so, we deﬁne the following variants of SEE:
34
• SEE-SeqBlock-LINE: this is the variant that only has the SeqBlock. Thus it only
captures the sequential dependencies. In addition, it uses vectors sampled from Gaussian
distribution as the event representation;
• SEE-SeqBlock: this variant is similar to SEE-SeqBlock-LINE, but uses graph embedding
vectors as the event representation;
• SEE-SideBlock: this variant only has SideBlock. Due to the lack of SeqBlock, we samples
a random vector to represent the sequential information for avoiding further changes
on the framework. Thus, this variant only captures event relations.
The performance of the model variants on both datasets are shown in Figure 3.6. The
following can be observed:
• SEE-SeqBlock outperforms SEE-SeqBlock-LINE, which is consistent with previous
ﬁndings and suggests the contribution of event relations again;
• SEE-SideBlock-LINE has much better performance than the variant with only SideBlock
in both datasets. This is expected because the sequential information is no doubt
important in sequential learning tasks.
In summary, event relations are important for sequence modeling and can boost the perfor-
mance signiﬁcantly if properly captured. In addition, the proposed framework has gained the
best performance because of its ability to learn both sequential and event relations eﬀectively.
3.5 Discussion
Event relations contain important relation information of events, which have great potentials
to boost sequence modeling performance, but few current sequence learning models are
35
Figure 3.6: Component analysis results
able to capture them. In addition, learning sequential dependencies and event relations
simultaneously is challenging because these two types of information are inherently diﬀerent
and it is diﬃcult to prevent one being dominated by the other.
In this work, we have
exploited event relations for user behavior sequence modeling problem and demonstrated the
importance of taking advantage of them. Further, we propose a sequence modeling framework
SEE that is able to capture both sequential dependencies and event relations eﬀectively.
Evaluating our framework on real data sets that come from totally diﬀerent domains, we ﬁnd
that the proposed framework outperforms the traditional state-of-the-art sequence learning
models signiﬁcantly.
There are several meaningful future directions to explore. Firstly, our work has focused
on user behavior sequence modeling. Thus, one future work could attempt to exploit event
relations for other sequential learning tasks such as language modeling. In addition, we believe
that event relations information will be even more valuable when data sparsity problem
presents, e.g., very few training sequences are available. Therefore, it would be meaningful to
36
SEE-SeqBlockSEE-SideBlock-LINESEE-SideBlockSEEGMAT0.690.700.710.720.730.740.75AUCSEE-SeqBlockSEE-SideBlock-LINESEE-SideBlockSEEMovieLens0.650.700.750.800.85AUCutilize it to address cold-start and data sparsity problems. Moreover, our framework and most
current sequence learning models assume the events of a sequence is uniformly distributed,
which is not true in many cases. For example, in the GMAT data, the time intervals between
two consecutive interactions can be very diﬀerent, which could play an important role in
understanding students’ knowledge states. Thus, how to incorporate such information in
our proposed framework is another meaningful future direction. Finally, in this chapter, we
assume the event relations given does not have any noise. However, this assumption may not
hold true. In such cases, the event representation obtained by the proposed framework is no
longer accurate. Thus, in the future, it is also important to extend our proposed framework
to deal with the noise in event relations.
37
Chapter 4
Modeling Sequence Relation
4.1 Introduction
The majority of existing sequence models such as RNNs have been designed for traditional
sequences, which are assumed to be identically, independently distributed (i.i.d.). However,
many real-world applications generate linked sequences (sequences are related). For example,
web documents, sequences of words, are connected via hyperlinks; genes, sequences of DNA
or RNA, typically interact with each other. Figure 4.1 illustrates one toy example of linked
sequences where there are four sequences – S1, S2, S3 and S4. These four sequences are linked
via three links – S2 is connected with S1 and S3 and S3 is linked with S2 and S4. On the one
hand, these linked sequences are inherently related. For example, linked web documents are
likely to be similar [41] and interacted genes tend to share similar functionalities [11]. Hence,
linked sequences are not i.i.d., which presents immense challenges to traditional RNNs. On
the other hand, linked sequences oﬀer additional sequence relational information in addition
to the sequential information. It is evident that sequence relation can be exploited to boost
various analytical tasks such as social recommendations [112] , sentiment analysis [121, 59]
and feature selection [113]. Thus, the availability of sequence relation information in linked
sequences has the great potential to enable us to develop advanced Recurrent Neural Networks.
Now we have established that – (1) traditional RNNs are insuﬃcient and dedicated eﬀorts
38
Figure 4.1: An Illustration of Linked Sequences. S1, S2, S3 and S4 denote four sequences
and they are connected via four links.
are needed for linked sequences; and (2) the availability of sequence relations information
in linked sequences oﬀer unprecedented opportunities to advance traditional RNNs. In this
chapter, we study the problem of modeling sequence relation via RNNs. In particular, we aim
to address the following challenges – (1) how to capture sequence relation mathematically
and (2) how to combine sequential and sequence relation information via Recurrent Neural
Networks. To address these two challenges, we propose a novel Linked Recurrent Neural
Network (LinkedRNN) for linked sequences.
4.2 Problem Statement
Let S = {S1, S2,··· , SN} be the set of N sequences. For linked sequences, two types of
information are available. One is the sequential information for each sequence. We denote
N i) where N i is the length of Si. The
the sequential information of Si as = (xi
other is the sequence relation. We use an adjacent matrix A ∈ RN×N to denote the sequence
2,··· , xi
1, xi
39
S4S3S2S1relation of linked sequences where A(i, j) = 1 if there is a link between the sequence Si and
Sj and A(i, j) = 0, otherwise. In this work, we following the transductive learning setting.
In detail, we assume that a part of the sequences from S1 to SK are labeled where K < N.
We denote the labeled sequences as SL = {S1, S2, . . . , SK}. For a sequence Sj ∈ SL
yj to denote its label where yj is a continuous number for the regression problem and yj is
, we use
one symbol for the classiﬁcation problem. Note that in this work, we focus on the unweighted
and undirected links among sequences. However, it is straightforward to extend the proposed
framework for weighted and directed links. We would like to leave it as one future work.
Although the proposed framework is designed for transductive learning, we also can use it for
inductive learning, which will be discussed when we introduce the proposed framework in the
following section.
With the above notations and deﬁnitions, we formally deﬁne the problem we target in
this work as follows:
Given a set of sequences S with sequential information Si = (xi
1, xi
sequence relation A, and a subset of labeled sequences {SL, (yj)K
model by leveraging S, A and {SL, (yj)K
to predict the labels of the unlabeled sequences in S .
N i) and
j=1}, we aim to build a RNN
j=1}, which can learn representations for sequences
2,··· , xi
4.3 The Proposed Framework
In addition to sequential information, sequence relation is available for linked sequences as
shown in Figure 4.1. As aforementioned, the major challenges to model linked sequences are
how to capture sequence relation and how to combine sequential and relation information
coherently. To tackle these two challenges, we propose a novel Recurrent Neural Networks
40
Figure 4.2: An illustrate of the proposed framework LinkedRNN on the toy example as
shown in Figure 4.1. It consists of two major layers where RNN layer is to capture sequential
information and the link layer is to capture sequence relation.
LinkedRNN. An illustrate of the proposed framework on the toy example of Figure 4.1 is
demonstrated in Figure 4.2. It mainly consists of two layers. The RNN layer is to capture
the sequential information. The output of the RNN layer is the input of the link layer where
sequence relation is captured. Next, we ﬁrst detail each layer and then present the overall
framework of LinkedRNN.
4.3.1 Capturing Sequential Information
Given a sequence Si = xi
1, xi
2,··· xi
N i
, the RNN layer aims to learn a representation vector
that can capture its complex sequential patterns via Recurrent Neural Networks.
In this work, due to its simplicity and eﬀectiveness, we choose GRU as our RNN unit.
The details of GRU are described in Section 2. The output of the RNN layer will be the
input of the link layer. For a sequence Si, the RNN layer will learn a sequence of latent
N i). There are various ways to obtain the ﬁnal output ˆhi of Si
representations (hi
1, hi
2, . . . , hi
41
RNNInput LayerRNN LayerLink LayerOutput LayerRNNRNNRNNS1S2S3S4from (hi
N i). In this work, we investigate two popular ways:
1, hi
2, . . . , hi
• As the last latent representation hi
N i
is able to capture information from previous
states, we can just use it as the representation of the whole sequence. We denote this
way of aggregation as aggregation11. Speciﬁcally, we let ˆhi = hi
N i
.
• The attention mechanism can help the model automatically focus on relevant parts of
the sequence to better capture the long-range structure and it has shown eﬀectiveness
in many tasks [9, 83, 22]. Thus, we deﬁne our second way of aggregation based on the
attention mechanism as follows:
N i(cid:88)
j=1
ajhi
j
ˆhi =
where aj is the attention score, which can be obtained as
(cid:80)
a(hi
j )
e
m ea(hi
m)
aj =
where a(hi
j) is a feedforward layer:
a(hi
j) = vT
a tanh(Wahi
j)
(4.1)
(4.2)
(4.3)
Note that diﬀerent attention mechanisms can be used, we will leave it as one future
work. We denote the aggregation way described above as aggregation12.
For the general purpose, we will use RNN to denote GRU in the rest of the chapter.
42
4.3.2 Capturing Sequence Relation
The RNN layer is able to capture the sequential information. However, in linked sequences,
sequences are naturally related. The Homophily theory suggests that linked entities tend
to have similar attributes [87], which have been validated in many real-world networks
such as social networks [74], web networks [79], and biological networks [11]. As indicated
by Homophily, a node is likely to share similar attributes and properties with nodes with
connections. In other words, a node is similar to its neighbors. With this intuition, we
propose the link layer to capture sequence relation in linked sequences.
As shown in Figure 4.2, to capture sequence relation, for a node, the link layer not only
includes information from its sequential information but also aggregates information from its
neighbors. The link layer can contain multiple hidden layers. In other words, for one node,
we can aggregate information from itself and its neighbors multiple times. Let vk
i
be the
hidden representations of the sequence Si after k aggregations. Note that when k = 0, v0
i
is
the input of the link layer, i.e., v0
i = ˆhi. Then vk+1
i
vk+1
i
= act(
1
|N (i)| + 1
(vk
i +
can be updated as:
(cid:88)
Sj∈N (i)
vk
j ))
(4.4)
where act() is an element-wise activation function, N (i) is the set of neighbors who are linked
with Si, i.e., N (i) = {Sj|A(i, j) = 1}, and |N (i)| is the number of neighbors of Si. We deﬁne
N ] as the matrix form of representations of all sequences at the k-th layer.
Vk = [vk
We modify the original adjacency matrix A by allowing A(i, i) = 1. The aggregation in the
2 , . . . , vk
1 , vk
Equation 4.4 can be written in the matrix form as:
Vk+1 = act(AD−1Vk)
(4.5)
43
where Vk+1 is the embedding matrix after k + 1 step aggregation, and D is the diagonal
matrix where D(i, i) is deﬁned as:
N(cid:88)
j=1
A(i, j)
D(i, i) =
(4.6)
4.3.3 Linked Recurrent Neural Networks
With the model components to capture sequential and relation information, the procedure of
the proposed framework LinkedRNN is presented below:
(hi
1, hi
2, . . . , hi
N i) = RN N (Si)
ˆhi = aggregation1(hi
1, hi
2, . . . , hi
N i)
i = ˆhi
v0
vk+1
i
= act(
1
|N (i)| + 1
zi = aggregation2(v0
i , v1
(cid:88)
vk
j ))
(vk
i +
Sj∈N (i)
i , . . . , vM
i )
(4.7)
where the input of the RNN layer is the sequential information and the RNN layer will
produce the sequence of latent representations (hi
N i). The sequence of latent
representations will be aggregated to obtain the output of the RNN layer, which serves
2, . . . , hi
1, hi
as the input of the Link layer. After M layers, link layer produces a sequence of latent
representations (v0
i , v1
i , . . . , vM
i ), which will be aggregated to the ﬁnal representation.
The ﬁnal representation zi for the sequence Si is to aggregate the sequence (v0
i , . . . , vM
i )
from the link layer. In this work, we investigate several ways to obtain the ﬁnal representation
i , v1
zi as:
44
• As vM
i
is the output of the last layer, we can deﬁne the ﬁnal representation as: zi = vM
i
,
and we denote this way as aggregation21.
• Although the new representation vM incorporates all the neighbor information, the
signal in the representation of itself may be overwhelmed during the aggregation process.
This is especially likely to happen when there are a large number of neighbors. Thus, to
make the new representation to focus more on itself, we propose to use a feed forward
neural network to perform the combination. We concatenate representations from the
last two layers as the input of the feed forward network. We refer this aggregation
method as aggregation22.
• Each representation vj
i
could contain its unique information, which cannot be carried
in the later part. Thus, similarly, we use a feed forward neural network to perform the
combination of (v1
i ). We refer this aggregation method as aggregation23.
i ; v2
i ;··· ; vM
To learn the parameters of the proposed framework LinkedRNN, we need to deﬁne a loss
function that depends on the speciﬁc task. In this work, we investigate LinkedRNN in two
tasks – classiﬁcation and regression.
Classiﬁcation. The ﬁnal output of a sequence Si is zi. We can consider zi as features
and build the classiﬁer. In particular, the predicted class labels can be obtained through a
softmax function as:
pi = sof tmax(Wczi + bc)
(4.8)
where Wc and bc are the coeﬃcients and the bias parameters, respectively. pi is the predicted
label of the sequence Si. The corresponding loss function used in this chapter is the cross-
45
entropy loss.
Regression. For the regression problem, we choose linear regression in this work. In
other words, the regression label of the sequence Si is predicted as:
pi = Wrzi + br
(4.9)
where Wr and br are the regression coeﬃcients and the bias parameters, respectively. Then
square loss is adopted in this work as the loss function as:
K(cid:88)
i=1
L =
1
K
(yi − pi)2
(4.10)
Note that there are other ways to deﬁne loss functions for classiﬁcation and regression.
We would like to leave the investigation of other formats of loss functions as one future work.
Prediction. For an unlabeled sequence Sj under the classiﬁcation problem, its la-
bel is predicted as the one corresponding to the entity with the highest probability in
sof tmax(Wczj + bc).
For an unlabeled sequence Sj under the regression problem, its label is predicted as
Wrzi + br.
Although the framework is designed for transductive learning, it can be naturally used
for inductive learning. For a sequence Sk, which is unseen in the given linked sequences
S, according to its sequential information and its neighbors N (k), it is easy to obtain its
representation zk
, its label can be predicted as the
normal prediction step described above.
via Equation 4.7. Then based on zk
46
Table 4.1: Statistics of the datasets.
Description
# of sequences
Network density (‰)
Avg length of sequences
Max length of sequences
DBLP BOOHEE
47,491
0.13
6.6
20
18,229
0.012
23.5
29
4.4 Experiment
In this section, we present experimental details to verify the eﬀectiveness of the proposed
framework. Speciﬁcally, we validate the proposed framework on datasets from two diﬀerent
domains. Next, we ﬁrstly describe the datasets we used in the experiments and then compare
the performance of the proposed framework with representative baselines. Lastly, we analyze
the key components of LinkedRNN.
4.4.1 Datasets
In this study, we collect two types of linked sequences. One is from DBLP where data contains
textual sequences of chapters. The other is from a weight loss website BOOHEE where
data includes weight sequences of users. Some statistics of the datasets are demonstrated in
Table 4.1. Next we introduce more details.
DBLP dataset. We constructed a chapter citation network from the public available
DBLP data set1[111]. This dataset contains information for millions of chapter from a variety
of research ﬁelds. Speciﬁcally, each chapter contains the following relevant information:
chapter id, publication venue, the id references of it and abstract. Following the similar
practice in [110], we only select chapters from conferences in 10 largest computer science
domains including VCG, ACL, IP, TC, WC, CCS, CVPR, PDS , NIPS, KDD, WWW,
1https://aminer.org/citation.
47
ICSE, Bioinformatics, TCS. We construct a sequence for each chapter from their abstracts
and regard their citation relationships as the sequence relation. Speciﬁcally, we ﬁrst split
the abstract into sentences and tokenize each sentence using python NLTK package. Then,
we use Word2Vec [89] to embed each word into Euclidean space and for each sentence, we
treat the mean of its word vectors as the sentence embedding. Thus, the abstract of each
chapter can be represented by a sequence of sentence embeddings. We will conduct the
classiﬁcation task on this dataset, i.e., chapter classiﬁcation. Thus, the label of each sequence
is the corresponding publication venue.
BOOHEE dataset. This dataset is collected from one of the most popular weight
management mobile applications, BOOHEE 2. It contains million of users who self-track
their weights and interact with each other in the internal social network provided by the
application. Speciﬁcally, they can follow friends, make comment to friends’ post and mention
(@) friends in comments or posts. The recored weights by users form sequences which contain
the weight dynamic information and the social networking behaviors result in three networks
that correspond to following, commenting, and mentioning interactions, respectively. Previous
work [123] has shown a social correlation on the users’ weight loss. Thus, we use these social
networks as the sequence relation for the weight sequence data. We preprocess the dataset to
ﬁlter out the sequences from suspicious spam users. Moreover, we change the time granularity
of weight sequence from days to weeks to remove the daily ﬂuctuation noise. Speciﬁcally, we
compute the mean value of all the recorded weights in one week and use it as the weight for
that week. For networks, we combine three networks into one by adding them together and
ﬁlter out weak ties. In this dataset, we will conduct a regression task of weight prediction.
We choose the most recent weight in a weight sequence as the weight we aim to predict
2https:www.boohee.com
48
(or the groundtruth of the regression problem). Note that for a user, we remove all social
interactions that form after the most recent weight where we want to avoid the issue of using
future sequence relation for weight prediction.
4.4.2 Representative Baselines
To validate the eﬀectiveness of the proposed framework, we construct three groups of
representative baselines. The ﬁrst group includes the state-of-the-art network embedding
methods, i.e., node2vec [45] and GCN [71], which only capture the sequence relation. The
second group is the GRU RNN model [42], which is the basic model we used in our model to
capture sequential information. Baselines in the third group is to combine models in the ﬁrst
and second groups, which captures both sequential and sequence relation. Next, we present
more details about these baselines.
• Node2vec [45]. Node2vec is one state-of-the-art network embedding method. It learns
the representation of sequences only capturing the sequence relation in a random-walk
perspective.
• GCN [71] It is the traditional graph convolutional graph algorithm. It is trained with
both relation and label information. Hence, it is diﬀerent from node2vec, which is learnt
with only sequence relation and is totally independent on the task.
• RNN [42]. RNNs have been widely used for modeling sequential data and achieved great
success in a variety of domains. However, they tend to ignore the correlation between
sequences and only focus on sequential information. We construct this baseline to show
the importance of correlation information. To make the comparison fair, we employ
the same recurrent unit (GRU) in both the proposed framework and this baseline.
49
• RNN-node2vec. The Node2vec method is able to learn representation from the sequence
relation and the RNN can do so from the sequential information. Thus, to obtain the
representation of sequences that contains both relational and sequential information,
we concatenate the two sets of embeddings obtained from Node2vec and RNN via a
feed forward neural network.
• RNN-GCN. RNN-GCN applies a similar strategy of combining RNN and node2vec to
combine RNN and GCN.
There are several notes about the baselines. First, node2vec does not use label information
and it is unsupervised , RNN and RNN-node2vec utilize label information and they are
supervised, and GCN and RNN-GCN use both label information and unlabeled data and they
are semi-supervised. Second, some sequences may not have sequence relation and baselines
only capture sequence relation cannot learn representations for these sequences; hence, in
this work, when representations from sequence relation are unavailable, we will use the
representations from the sequential information via RNN instead. Third, we do not choose
LSTM and its variants as baselines since our current model is based on GRU and we also
can choose LSTM and its variants as the base models.
4.4.3 Experimental Settings
Data split: For both datasets, we randomly select 30% for test. Then we ﬁx the test
set and choose x% of the remaining 70% data for training and 1 − x% for validation to
select parameters for baselines and the proposed framework. In this work, we vary x as
{10, 30, 50, 70}.
Parameter selection: In our experiments, we set the dimension of representation vectors
50
Table 4.2: Performance comparison on the DBLP dataset
Measurement Method
Micro-F1
Macro-F1
node2vec
GCN
RNN
RNN-node2vec
RNN-GCN
LinkedRNN
node2vec
GCN
RNN
RNN+node2vec
RNN+GCN
LinkedRNN
10 % 30 %
0.6641
0.6550
0.7093
0.7005
0.7980
0.7686
0.7940
0.8031
0.8230
0.7912
0.8399
0.8146
0.6514
0.6523
0.6992
0.6874
0.7751
0.7452
0.7734
0.7797
0.8014
0.7642
0.7970
0.8249
Training ratio
50%
0.6688
0.7110
0.7978
0.7933
0.8255
0.8463
0.6513
0.7004
0.7754
0.7702
0.8069
0.8331
70%
0.6691
0.7180
0.8025
0.8114
0.8284
0.8531
0.6565
0.7095
0.7824
0.7912
0.8104
0.8365
of sequences to 100. For Node2vec, we use the validation data to select the best value for p
and q from {0.25, 0.50, 1, 2, 4} as suggested by the authors [45] and use the default values for
the remaining parameters. In addition, the learning rate for all of the methods are selected
through validation set.
Evaluation metrics: Since we will perform classiﬁcation in the DBLP data, we use
Micro and Macro F1 scores as the metrics for DBLP, which are widely used for classiﬁcation
problems [134, 45]. The higher value means better performance. We perform the regression
problem weight prediction in the BOOHEE data. Therefore the performance in BOOHEE
data is evaluated by mean squared error (MSE) score. The lower value of MSE indicates
higher prediction performance.
51
4.4.4 Experimental Results
We ﬁrst present the results in DBLP data. The results are shown in Table 4.2. For the
proposed framework, we choose M = 2 for the link layer and more details about discussions
about the choices of its aggregation functions will be discussed in the following section. From
the table, we make the following observations:
• As we can see in Table 4.2, in most cases, the performance tends to improve as the
number of training samples increases.
• The random guess can obtain 0.1 for both micro-F1 and macro-F1. We note that the
network embedding methods perform much better than the random guess, which clearly
shows that the sequence relation is indeed helpful for the prediction.
• GCN achieves much better performance than node2vec. As we mentioned before, GCN
uses label information and the learnt representations are optimal for the given task.
While node2vec learns representations independent on the given task, the representations
may be not optimal.
• The RNN approach has higher performance than GCN. Both of them use the label
information. This observation suggests that the content and sequential information is
very helpful.
• Most of the time, RNN-node2vec and RNN-GCN outperform the individual models.
This observation indicates that both sequential and sequence relation are important
and they contain complementary information.
• The proposed framework LinkedRNN consistently outperforms baselines. This strongly
demonstrates the eﬀectiveness of LinkedRNN. In addition, comparing to RNN-node2vec
52
Table 4.3: Performance comparison on the BOOHEE dataset
Method
node2vec
GCN
RNN
RNN-node2vec
RNN-GCN
LinkedRnn
Training ratio
10 % 30 %
8.8702
8.8517
8.6830
8.9347
8.6048
8.6600
8.4653
8.5944
8.5662
8.6286
7.1822
6.3882
50%
7.4744
6.7949
7.0466
7.0173
6.9967
6.8416
70%
7.0390
6.7278
6.8033
6.7796
6.7945
6.3517
and RNN-GCN, the proposed framework is able to jointly capture the sequential and
sequence relation coherently, which leads to signiﬁcant performance gain.
We present the performance on BOOHEE in Table 4.3. Overall, we make similar observa-
tions as these on DBLP as – (1) the performance improves with the increase of number of
training samples; (2) the combined models outperform individual ones most of the time and
(3) the proposed framework LinkedRNN obtains the best performance.
Via the comparison, we can conclude that both sequential and relation information in the
linked sequences are important and they contain complementary information. Meanwhile,
the consistent impressive performance of LinkedRNN on datasets from diﬀerent domains
demonstrate its eﬀectiveness in capturing the sequential and relation information presented
in the sequences.
4.4.5 Component Analysis
In the proposed framework LinkedRNN, we have investigate several ways to deﬁne the
two aggregation functions. In this subsection, we investigate the impact of the aggregation
functions on the performance of the proposed framework LinkedRNN by deﬁning the following
53
variants.
• LinkedRNN11: it is the variant which chooses aggregation11 and aggregation21
• LinkedRNN12: we deﬁne the variant by using aggregation11 and aggregation22
• LinkedRNN13: this variant is made by applying aggregation11 and aggregation23
• LinkedRNN21: this variant utilizes aggregation12 and aggregation21
• LinkedRNN22: it is the variant which chooses aggregation12 and aggregation22
• LinkedRNN23: we construct the variant by adopting aggregation12 and aggregation23
The results are demonstrated in Figure 4.3a. Note that we only show results on DBLP
with 50% as training since we can have similar observations with other settings. It can be
observed:
• Generally, the variants of LinkedRNN with aggregation12 obtain better performance
It demonstrates that aggregating the sequence of the latent
than aggregation11.
presentations with the help of the attention mechanism can boost the performance.
• Aggregating representations from more layers in the link layer typically can result in
better performance.
4.4.6 Parameter Analysis
LinkedRNN uses the link layer to capture sequence relation. The link layer can have multiple
layers. In this subsection, we study the impact of the number of layers on the performance of
LinkedRNN. The performance changes with the number of layers are shown in Figure 4.3b.
54
(a) The impact of aggregation functions.
(b) The performance variance with the num-
ber of link layers
Figure 4.3: Model analysis results
Similar to the component analysis, we only report the results with one setting in DBLP since
we have similar observations. In general, the performance ﬁrst dramatically increases and
then slowly decreases. One layer is not suﬃcient to capture the sequence relation while more
layers may result in overﬁtting.
4.5 Discussion
RNNs have been proven to be powerful in modeling sequences in many domains. Most
of existing RNN methods have been designed for sequences which are assumed to be i.i.d.
However, in many real-world applications, sequences are inherently related and linked se-
quences present both challenges and opportunities to existing RNN methods, which calls for
novel RNN methods. In this chapter, we study the problem of designing RNN models for
linked sequences. Suggested by Homophily, we introduce a principled method to capture
sequence relation and propose a novel RNN framework LinkedRNN, which can jointly model
55
LinkedRNN11LinkedRNN12LinkedRNN13LinkedRNN21LinkedRNN22LinkedRNN23Aggregation Functions0.8200.8220.8240.8260.8280.8300.8320.834Macro-F11234Number of layers0.8260.8280.8300.8320.834Macro-F1sequential and relation information. Experimental results on datasets from diﬀerent domains
demonstrate that (1) the proposed framework can outperform a variety of representative
baselines; and (2) sequence relation is helpful to boost the RNN performance.
There are several interesting directions to investigate in the future. First, our current
model focuses on unweighted and undirected relation and we will study weighted and directed
links and the corresponding RNN models. Second, in current work, we focus on classiﬁcation
and regression problems with certain loss functions. we will investigate other types of loss
functions to learn the parameters of the proposed framework and also investigate more
applications of the proposed framework. Third, since our model can be naturally extended
for inductive learning, we will further validate the eﬀectiveness of the proposed framework
for inductive learning. Finally, in some applications, the sequence relation may be evolving;
thus we plan to study RNN models, which can capture the dynamics of links as well.
56
Chapter 5
Modeling Temporal Information
5.1 Introduction
In many scenarios, sequences are also associated with temporal information. For example, in a
customer behavior sequence, each behavior (event) has a time stamp recording the exact time
that behavior happened. Such temporal information is essential for recommender systems
to make recommendation at the right time, which is vital to improve customers’ shopping
experience in e-commerce for the following reasons [120, 2, 67, 72, 38]. First, customers’
interests in buying a product naturally ﬂuctuate. Thus, there could be right moments when
recommending one product will beneﬁt the customer most. On the contrary, there could be
huge opportunity cost if the system does it at the moment when the customer loses his/her
interest since it may lose great opportunities to attract customers. Second, many products
are time-sensitive. For instance, popular snow boots are unlikely well wanted in summer.
Therefore, it will greatly impair customers’ trust if recommender systems overlook the time
information. Third, predicting time about customers’ needs will also enable providers to
supply enough goods and avoid disappointing customers with “out of order" experience.
The increasing availability of time-stamped ﬁne-grained customers’ behaviors observed by
providers brings us unprecedented opportunities to build advanced recommender systems
that take both customers’ preference and its temporal dynamics into consideration. Take
57
a customer who plans to buy a phone for example, she may ﬁrstly search for some phone
brands in e-commerce websites. Then she clicks a few models for detailed description and
adds two of them to the shopping cart for further comparison. Finally, she deletes one in
the cart and decides to purchase the other one. The ﬁne-grained behaviors as shown in the
previous example oﬀer us a new perspective to model customers’ preferences. Furthermore,
these behaviors are often time-stamped; thus it provides rich temporal information that
enables us to gain a deep understanding on temporal dynamics of customers’ preferences.
Such understanding enables us to perform just-in-time recommendations where we consider
what and when to recommend simultaneously.
Although exploiting time-stamped ﬁne-grained users’ behaviors will potentially bring
in tremendous opportunities to advance recommender systems, new challenges are also
introduced. First, considering users’ behaviors introduces great complexity to model the
relations among various products. The majority of traditional recommender systems merely
consider coarse-grained user-item interactions. They often assume that the inﬂuence of
one product on another is static. However, when considering behaviors users perform on
products, such inﬂuence could vary according to the speciﬁc behavior associated with the
product. Below is one example. If a customer clicks “iPhone", it is a strong signal that she
may be interested in buying a new phone; thus in addition to presenting iPhone models,
we can also recommend some “Samsung" phones to her. In other words, clicking “iPhone"
increases the probability of recommending “Samsung" phone. On the other hand, if she just
purchased an “iPhone", it is unlikely that she will buy another new phone in the near future.
Thus, it is more suitable to recommend “iPhone" cases than “Samsung" phones. In this case,
purchasing “iPhone" should decrease the probability of recommending a “Samsung" phone.
Therefore, “iPhone"’s inﬂuence on “Samsung" phone depends on the behaviors associated such
58
as purchase and click. Second, temporal dynamics of customers’ interests are heterogeneously
inﬂuenced by various products they have interacted with. For example, a customer’s interest
in buying a new "iPhone" could be related by her previous interactions with other products
such as “iPad" and “Samsung" phones, each of which can have diﬀerent inﬂuence according to
the behaviors associated. This presents formidable challenges to accurately model temporal
dynamics of interests because it requires to consider all the customers’ historical interactions,
which is further complicated by varied relations among products.
In this chapter, we investigate the just-in-time recommendation problem and address
aforementioned challenges simultaneously to leverage the timestamped ﬁne-grained customer
behavior data.
5.2 Problem Statement
Let U = [u1, u2,··· , uN ] be the set of N users. We assume that P = [p1, p2,··· , pV ] is a
set of products where V is the number of products. Let B = (b1, b2,··· , bB) be the set of
B behaviors users can perform on products such as “click" and “purchase". The historical
interactions of a user ui with the recommender system have been recorded and are represented
as a sequence Si:
Si = {(ti
1, bi
1, pi
1), (ti
2, bi
2, pi
2),··· , (ti
k, bi
k, pi
k)}
(5.1)
where each element in Si is a 3-tuple that is named as an event in this work. Each event
i ∈ P.
ej = (ti
Moreover, S = [S1, S2,··· , SN ] is used to denote the set of event sequences of all users in
j ∈ B and one product pj
j) contains one time stamp ti
, one behavior type bi
j, bi
j, pi
j
59
U. We further assume that the next event after ti
k+1) is available for
all ui ∈ U. With above notations, the problem we plan to study in this work can formally
deﬁned as:
for ui (ti
k+1, pi
k+1, bi
k
k+1, pi
k+1)}N
i=1, we aim to learn a mapping function from S to {(ti
Given event sequences of a set of users U, i.e., S, and the corresponding next events
i=1;
k+1)}N
k+1, bi
k+1 and its corresponding time stamp tj
2),··· , (tj
k, bj
{(ti
thus it can simultaneously predict next product pj
given a user uj /∈ U with its historical event sequence Sj = {(tj
S.
k+1, pi
1), (tj
2, bj
1, bj
1, pj
k+1, bi
2, pj
k+1
k, pj
k)} /∈
5.3 The Proposed Framework
In this section, we ﬁrst give an architecture overview about the proposed framework and then
we detail each model component.
5.3.1 An Architecture Overview
In addition to sequential information, sequences also contain temporal information to indicate
when events happened, which not only enables us to make the accurate recommendation
but also can allow us to perform recommendations at the right time. To achieve this, as
mentioned before, major challenges include capturing the sequential information, complex
interactions between products and behaviors, and modeling the timing accurately. In this
subsection, we will introduce key components of our proposed model that are able to address
those challenges and simultaneously recommend right product and predict recommendation
time. The overview of our framework is shown in Figure 5.1. It consists of six layers from the
bottom to the top: 1) Input layer; 2) Embedding layer, which embeds high-dimensional input
60
Figure 5.1: The network architecture of the proposed framework. It consists of six key
layers: 1) Input layer; 2) Embedding layer, which embeds high-dimensional input vector into
low-dimension space; 3) RNN layer, which captures the sequential and temporal information
in sequences; 4) Interaction layer, which models the interactions between products and events
in the sequence; 5) Recommendation layer, which includes product layer and time layer to
predict products and time, respectively; and 6) Output layer where λi denotes the time and
Pi indicates the probability of recommending product i.
vector into low-dimension space; 3) RNN layer, which captures the sequential and temporal
information in sequences; 4) Interaction layer, which models the interactions between products
and events in the sequence; 5) Recommendation layer, which includes product layer and time
layer to predict products and time, respectively; and 6) Output layer where λi characterizes
the temporal dynamics of interest in buying ith product and Pi indicates the probability of
recommending product i. In the following subsections, we will give details of each layer.
5.3.2 Embedding Layer
The input of the model is the sequences of events, where each event e = (t, b, p) contains
timing, behavior and product information. p = {0, 1}V is a one-hot vector and p[i =
j] = 1, p[i (cid:54)= j] = 0 if jth product is involved in e. Similarly, b = {0, 1}B is the one-hot
61
Input LayerRNN LayerInteraction LayerRecommendation LayerEmbedding LayerTime LayerRNN1001001001,2,···,VP1,P2,···,PVRNN100100100RNN100100100Product LayerOutput Layervector for behavior indication. Both products and behaviors space could be very large
in real-world recommender systems. For example, there could be millions of products in
e-commerce recommender systems. Hence, one-hot vectors are typically high-dimensional,
which motivates us to add an embedding layer right after the input layer to embed products
and behaviors into low-dimension spaces. Speciﬁcally, two embedding matrices Wp ∈ RV ×Kp
and Wb ∈ RV ×Kb are introduced. The representation of p and b in the embedding spaces
become p · Wp and b · Wb, respectively. Note that t can also be embedded by using an
embedding matrix as it does for b and p. But in this work, to reduce number of parameters,
we use binary number of timestamp t to be the time embedding te and pad zeros to the
left to make all of time embedding vectors have the same length. Thus, each event e is the
concatenation of behavior, product and temporal information in their embedding spaces as:
e = [p · Wp(cid:107)b · Wb(cid:107)te]
(5.2)
where (cid:107) is the concatenation operator. As Kp (cid:28) V and Kb (cid:28) B, the resulted representation
of events is dense, which will be the input into the following RNN layer to capture the
sequential information of events.
5.3.3 RNN Layer: Learning Sequential Information
The order and time of user behaviors in the event sequence should carry important information
to understand user’s interest. In this subction, we design a RNN sublayer to eﬀectively
model the sequences. Speciﬁcally, at time-step t, the RNN receives an input vector et ∈ RE,
involving behavior bt ∈ B, product pt ∈ P and a latent representation ht−1 of the sequence
seen previously. It emits an output vector ht, an updated representation that incorporates
62
the new observation from et. Due to its simplicity and performance, we choose GRU as the
RNN cell. The RNN sublayer will output hidden states h1, h2,··· , ht with hi representing
the sequential information up to time stamp ti.
5.3.4
Interaction Layer: Modeling Heterogeneous Inﬂuence
Although, the hidden representation ht produced by GRU based RNN contains sequential
information of events in a sequence and has been successfully utilized for recommendation [56].
However, GRU only models the sequence and fails to consider how the events in sequences
will aﬀect later decision on each product. In fact, later interest of a product is highly related
to past behaviors with products. For example, if a user clicks one cellphone several times
in the near past, it is very likely that she has interest in buying a new phone. Thus, it is
desirable to recommend similar products for her. In another case, if the user has already
purchased a cellphone, the probability that she will purchase it again in the near future is
very low. Thus, instead of similar products, related products such as phone cases are likely
to match her interests. However, for other types of products such as food, the story could be
totally diﬀerent. For instance, if the user has purchased popular and positively rated snacks,
it is highly possible that she will buy it again. In this case, previous behaviors increase the
probability of the same behaviors. Thus, later interest of a product is complicatedly related
to both previous behaviors and products in the sequence. To capture the heterogeneous
inﬂuence of events on diﬀerent products, we learn a unique latent representation ci of the
sequences for each product. To be speciﬁc, with the sequence of hidden representation
H = {h1, h2,··· hk} produced by previously introduced RNN layer and the candidate product
, we compute the relation between the jth product’s
embedding vectors Wp
and each hidden representation hi in H through a function a(·) followed by
embedding Wp
j,:
V,:
1,:, Wp
2,:,··· , Wp
63
a Softmax function as follows:
αij =
a(hi,W
e
p
j,:
)
(cid:80)
a(hk,W
p
j,:
)
k e
(5.3)
where a(·) could be any functions such as feedforward neural networks and dot product. As
the dot product is computed much faster practically due to the highly optimized matrix
multiplication code [117], in this work, we have chosen it as the function a(·) such that
a(p, q) = pT q. The output latent vector cj which is the unique representation of the
sequence history on the jth product can be then computed by the weighted sum of hidden
representation vectors as follows:
(cid:88)
i
cj =
αijhi
(5.4)
Thus, the heterogeneous inﬂuence of the events in a sequence on a product j is captured
and contained in the unique representation vectors cj. With cj, we are able to design product
and time layers for product recommendation and time prediction.
5.3.5 Product Layer: Recommending the Product
For the jth product, previous layer outputs a unique vector cj, which captures both sequential
information and relations between the jth product and each event in the sequence and
represents the user’s later interest on that product. In this subsection, we describe the product
layer that utilizes cj and embedding vector of jth product Wp
to make recommendations.
Speciﬁcally, the product layer involves a non-linear function s(·), which synthesizes cj and
to obtain a score which indicates user’s future interest on the jth product. Due to the
Wp
j,:
j,:
64
computation eﬃciency mentioned in the previous subsection, we also choose s(·) to be the
dot product such that s(p · q) = pT q. Note that other functions such as feed-forward neural
networks can also be used, we leave it as one future exploration direction. To obtain interest
probability distribution over all the candidate products, we apply a softmax layer on the top
of S(·):
P (˜y = ithproduct) =
(cid:80)
exp(s(Wp
j exp(s(Wp
i,:, ci))
j,:, cj))
(5.5)
Given the probability distribution, we can rank products according to their probabilities for
recommendations.
5.3.6 Time layer: Predicting Recommendation Time
Beside sequential information, event sequences also provide rich temporal information that
can be leveraged to model user’s interest dynamics to make the recommendation just in time.
For example, a user’s interest of a certain product will naturally ﬂuctuate instead of staying
the same. Accurately modeling these dynamics enables us recommend the right product at
the time when user needs it. However, as mentioned previously, it is very challenging to
model the dynamics of user interest as it varies from product to product. Fortunately, the
unique representation cj of users history information for each product enables us to achieve
the goal. In this subsection, we will describe our time prediction layer that leverages cj to
model the temporal dynamics of user’s interest for jth product. Next, we brieﬂy introduce
temporal point process to model temporal dynamics.
Temporal point process is a random process whose realization consists of a sequence of
isolated events with a time stamp t for each event and is denoted by {ti|i = 1, 2,··· , N},
65
where N is the number of events and ti ∈ N+. Note that temporal point process is also called
counting process in some literature and the two terminologies are interchangeable. In this
work, we will stick to the former one. A point process is often modeled by its conditional
intensity function λ(t|Ht), where Ht is the history before t, which represents the rate of
events happening at time t given the time stamps of all events happening before t. The
formal deﬁnition of λ(t|Ht) is the following [1]:
λ(t|Ht) = lim
∆t→0
1
∆t
S(t|Ht) − S(t + ∆t|Ht)
S(t|Ht)
(5.6)
where S(t|Ht) is the conditional probability that the new event will not happen before t
given Ht. With Equation 5.6, S(t|Ht) can be straightforwardly written as:
S(t(cid:48)|Ht) = exp
−
λ(τ|Ht)dτ
(5.7)
(cid:26)
(cid:90) t
t(cid:48)
n(cid:88)
i
(cid:27)
(cid:90) tn
Thus, the conditional probability density of an event happening at time stamp ti is f (ti) =
λ(ti)S(ti) and the likelihood of the realization {ti|i = 1, 2,··· , N} is:
L({ti|i = 1, 2,··· , N}) =
log λ(ti) −
0
λ(τ )dτ
(5.8)
Depending on the assumptions about parametric form of λ(t|Ht), the temporal point
process can be reduced to more speciﬁc ones. For example, if λ(t|Ht) = λ(t) = λ0, it is
the Poisson process [70]. A more general point process is inhomogeneous Poisson process
where λ(t|Ht) = λ(t). Moreover, if λ(t) increases or decreases when new event comes, the
temporal point process becomes Hawkes [52] or self-correcting process [62], respectively.
Although, these speciﬁc temporal point processes whose intensity function λ(t) takes assumed
66
parametric form has shown it eﬀectiveness in various applications [75, 50, 91], it requires
prior domain knowledge, which is not always available, to make a reasonable choice of the
form. In addition, this approach is constrained by the chosen parametric forms because
it may not align well with the actual data, which is generated by a much more complex
underlying process. To leverage the eﬀectiveness of temporal point process in modeling event
sequences and avoid issues with parametric form of intensity function, in the next, we exploit
the high-level representation that contains the temporal information of the sequences to
model the intensity function.
Modeling intensity function: The interaction layer outputs a unique high-level represen-
tation of the temporal information of user history for each product. Thus, it is natural to
model the intensity function that characterizes the underlying dynamics of user’s interest on
a product with its corresponding representation vector. Speciﬁcally, the intensity function for
the future interest on jth product can be learned as follows:
λj(δt) = σ(wt · cj)e−σ(wd·cj )δt
(5.9)
and wt are
where δt is the elapsed time since the last event in the sequence happening, wd
the model learnable parameters, σ(wt · cj) determines how much previous events trigger the
intensity for jth product and σ(wd · cj) determines how quickly the inﬂuence diminishes.
Predicting the time of next event: With the intensity function λj(δt), the probability
of event involving the jth product happening at time t is:
pj(t) =λj(t)(cid:0)
−
(cid:90) t
0
λj(τ )dτ(cid:1)
(cid:16)
67
=σ(wt · cj) exp
− σ(wd · cj)δt
(5.10)
+
σ(wt · cj)
σ(wd · cj)
exp(−σ(wd · cj)δt − 1)
(cid:17)
To predict the time of event involving jth product happening, we calculate the expected value
of pj(t) as follows:
(cid:90) ∞
0
ˆtj =
t · pj(t)dt
(5.11)
Thus, ˆtj indicates the right time to recommend the jth product.
5.3.7 Learning Model Parameters
Given S and their next event {(ti
the proposed framework to learn model parameters is the negative sum of log likelihood for
, one straightforward loss function of
k+1)}N
k+1, pi
k+1, bi
i=1
events:
N(cid:88)
i=1
L =
(βLp
i + (1 − β)Lt
i)
(5.12)
i = − log P pi
k+1) are the loss of temporal and
where Lt
product modeling for ith sequence, respectively. And β controls the relative emphasis of
i = − log P (˜y = pi
k+1) and Lp
k+1(ti
time or item prediction accuracy, N is the number of total sequences, pi
k+1
and ti
k+1
are the
product and time in the next event of ith sequence, respectively.
However, we empirically found that the above loss function could lead to very unstable
performance which is consistent with other works [56]. To address this, we adopt negative
sampling strategy in the training process. Speciﬁcally, for each positive product pi
, we
randomly sample one negative product pi
neg
from P and enforce P (˜y = pi
k+1
k+1) > P (˜y = pi
neg)
68
by the logits loss for binary classiﬁcation as follows:
Lp
i = − log σ(s(Wp
pos,:, cpos)) − log(1 − σ(s(Wp
neg,:, cneg))
(5.13)
where s(·) is the dot product function used in Equation 5.5, and pos and neg are the product
indexes of pi
, respectively.
and pi
pos
neg
5.4 Experiment
In this section, we conduct extensive experiments with real-world data to evaluate the
proposed model JRec. We ﬁrst describe details of the data followed by experimental settings.
Then we compare the performance of JRec in the tasks of product recommendation and event
time prediction with that of representative baselines.
5.4.1 Experimental Settings
Data Description: The data used in experiments is collected from a popular e-commerce
website JD.com. This data records click and purchase events from customers’ interactions
with the website. Thus, each event has two pieces of information: one the user behavior
type (click or purchase); the other is the product. We collect 6, 166, 916 sessions and these
sessions contain 169, 856 products. To avoid data sparsity problem, we ﬁlter out sequences
that contain products that appear less than one hundred times in the data. We further
ﬁlter out very short sequences in order to analyze how the length of the sequence aﬀect the
performance. We follow the procedure in [56] and split the data into training, validation and
testing sets. Each sequence will be in one and only one of the three sets.
69
Figure 5.2: Performance comparison on next item prediction. The prediction performance is
measured by Recall@30, Recal@15 and MRR@15. The left and right panels show the results
with long and short sequences, respectively. It is easily observed that the proposed framework
JRec achieves the best performance with any metrics.
Experiment settings: To analyze how the length of event sequence will aﬀect the model
performance, we further construct two datasets from the original data. Speciﬁcally, we cut
the last 10 and 50 events of the training sequences to make one of the datasets only contain
the sequences of length 10 and the other only contain those of 50. This also ensures that
both datasets have the same set of sessions, so that we are able to control variables other
than sequence length that could potentially aﬀect the prediction performance.
5.4.2 Performance On Next Item Recommendation
In this subsection, we evaluate our proposed model in terms of its accuracy in predicting
customers’ next interested product. Our model will produce a probability distribution
over all the products. To consider real-world scenarios where a recommender system often
recommends a couple of products that match customers’ interest most, we choose two metrics
to measure the model performance. The ﬁrst metric is Recall@x, which is the proportion
of cases that the desired products are ranked in top x in terms of probability. We vary the
70
Recal@30Recal@15MRR@150.000.050.100.150.200.250.300.350.40K=50Recal@30Recal@15MRR@15K=10RMTPPPOP-userPOP-globalMFRNNRNN+ATTKNN-itemJRecvalue of x such that x = {15, 30}. The second metric used for evaluation is Mean Reciprocal
Rank (MRR), which is the average of the reciprocal probability ranks of the desired products.
Thus, in contrast to Recall@x that measures system general recommendation performance,
MRR focuses on the system’s ability to recommend user the most desired product. These
two metrics are widely used in previous works to evaluate the eﬀectiveness of a recommender
system and their mathematical expressions are shown as follows:
(cid:88)
i
1
N
Recall@x =
I(ranki < x)
(cid:88)
M RR =
1
N
1
ranki
i
(5.14)
where I(x) = 1 is x is true and I(x) = 0 otherwise. The higher Recall@x and MRR are, the
better the performance is.
To show the recommendation performance of the proposed model, we compare it with
the following representative baselines:
POP-global: This baseline recommends the most popular products in the training data to
customers.
POP-user: Instead of a global view, this method considers customer-speciﬁc information
and recommends products that have been most popular in the given customer’s history.
Matrix factorization(MF): MF embeds products and customers into low-dimension spaces
according to the customer-item interactions and it is a widely adopted approach in recom-
mender systems. However, in session-based settings, MF can not be directly applied as
the customer set is not ﬁxed and there could be new one to come. Thus, we follow the
approach in [56] that when a new session comes, we compute the average embeddings of all
the products in the session history and use it as the embedding of session. To obtain the
71
product embeddings, we construct and factorize an interaction matrix in which the value in
ith row and jth column is the number of occurrence of the jth product in the ith session.
RNN: Recurrent Neural Network is one of most eﬀective methods in modeling sequential
information and has gained great success in diverse tasks. Recently it is also adopted in
recommender systems and achieves the state-of-the-art performance. In this baseline, we use
the model described in [56] and choose the GRU as the recurrent unit to make the comparison
fair. Note that this baseline is also known as GRU4REC [56] and has been widely used as
baselines from session recommendation tasks.
RNN+ATT: This baseline is a variant of the RNN model. It adds one attention layer on top
of the basic GRU model. As attention mechanism is able to identify important information
in a sequence, comparing to RNN model, this baseline will demonstrate the eﬀectiveness of
attention mechanism.
RMTPP: It models event data based on recurrent neural network and marked temporal
point process. This baseline is able to simultaneously predict the type and time of next
event [33]. Thus, if we regard the product as the event type, it can be naturally used to
jointly predict next product and time.
Implementation Details: All the deep methods are implemented in Pytorch 1 and we
optimize them by mini-batch stochastic Adam algorithm [69]. For RMTPP, we directly use
the code provided by the authors 2. In addition, to make the comparison fair, in this work,
we use the same network settings (i.e. number of layer, hidden size, mini-batch size etc.) to
all deep methods and we select learning rate from [0.1, 1e−1, 1e−2, 1e−3, 1e−4] according to
the validation performance.
1http://pytorch.org/
2https://github.com/musically-ut/tf_rmtpp
72
Results: The results of recommendation performance of all methods are shown in Figure 5.2.
Following observations can be made from the ﬁgure. (1) All model based methods achieved
better performance than heuristics which recommend products merely based on their pop-
ularity. (2) RNN-based methods that are able to capture the sequential information have
substantial performance gain over the MF methods, which is consistent with the previous
observations [56]. (3) Comparing to RNN, RNN+ATT performs better with long sequences
and comparable with short ones. This can be explained that it is hard to capture the
long-term dependencies in long sequences and attention mechanism can mitigate the problem
by making the model focus on related events. In contrast, it is much easier for RNN to capture
the dependencies in short sequences and attention mechanism may not be so useful for short
sequences. (4) RMTPP obtains worst performance. We suspect this is because RMTPP is
designed for the case where there are only limited number of event types, e.g, less than 10.
But the number of event types in our task is much larger as there are huge amount of items
in the e-commerce scenario. (5) JRec signiﬁcantly outperforms all baselines. Speciﬁcally,
Table 5.1 shows the performance improvement of JRec compared to the best performance of
the baselines. We argue that JRec is able to make much better recommendation because of
two key advantages. One is the RNN layer that is able to capture sequential information in
the sequences. The other, which is more important, is the interaction layer that can exploit
the complex relation among behaviors and products. For instance, it is able to attend to the
highly relevant events, which will increase the customer’s interest on that certain product.
To summarize, the proposed framework is able to outperform all the baseline methods
because (1) it models sequential and temporal information; (2) it eﬀectively captures the
complex relations between products when behaviors present; and (3) it explicitly models the
varied inﬂuence of past events on diﬀerent products.
73
Table 5.1: Performance improvement of JRec compared to the best performance of the
baselines (%)
Sequence length Recall@30 Recall@15 MRR@15
+62.07
+55.90
+11.86
+30.28
K=50
K=10
+21.08
+44.81
5.4.3 Performance On Next Time Prediction
In this subsection, we will focus on evaluating the performance of our model on time prediction.
Speciﬁcally, given the product in the next event, we will show how accurately our model can
model customers’ interest temporal dynamics by predicting when the next event will happen.
This is a key task for recommender systems that aim to make just-in-time recommendation.
(cid:80)
Evaluation: We use Mean Absolute Error (MAE) as the accuracy measurement, which is
i(|ti − ˆti|), where ti and ˆti are the predicted and actual time of
next event of the ith sequence, respectively. A lower MAE score means better performance.
computed as: M AE = 1
N
We compare the proposed model with the following representative methods for next time
prediction:
Mean-global: This method predicts the next time based on the average time intervals in
the training data.
Mean-local: Rather than focusing on global information, this approach only uses the local
information and predicts the next time of the session by computing the average of historical
time intervals in the session.
Hawkes: Hawkes process model is widely used to model event data with time stamps. It is
based on the “self-excited" assumption that the advent of event will increase the rate of new
event coming.
RMTPP: As noted in the previous subsection, this method models the event type and time
74
simultaneously and can directly predict the time of the next event [33].
Results: The results for time prediction task are shown in Table 5.2. From the table, we
observe that: 1) The Hawkes method obtained the worst performance. We argue this comes
from the incorrect assumption that the underlying process that generates the click-purchase
data is self-excited. In fact, the advent of one event may even prohibit the happening of
next event. For instance, after purchasing a very expensive product, a customer may use up
most of her budget and will not likely shop anything for some time. While Hawkes process
has achieved great success in modeling event data, its performance can drop substantially
if the assumption is not satisﬁed. Thus, this gives an alarm of applying parametric models
which largely depends on the correctness of the assumption underlying the data. 2) RMTPP
signiﬁcantly outperforms both heuristics and Hawkes. This is because RMTPP can utilize
RNNs to eﬀectively capture the underlying mechanism generating the sequences. Unlike its
bad performance for product prediction, the temporal modeling challenge in datasets is very
similar to the case for which the model is designed. Thus, it is able to obtain state-of-the-art
performance. 3) JRec has the best performance among all methods with both long and short
sequences. It also outperforms RMTPP, which is designed for event time prediction. we
argue this is because the JRec is able to capture the varied inﬂuence of the historical events
on individual product.
5.5 Discussion
In this chapter, we study just-in-time recommendation problem that is to recommend a right
product to a customer at the right time. Speciﬁcally, we propose a novel model JRec that
leverages the new opportunities brought by the ﬁne-grained temporal behavior data and
75
Table 5.2: Next event time prediction performance comparison.
Baselines
Mean-global
Mean-user
Hawkess
RMTPP
JRec
K=50
MAE
1038.96
1050.38
1247.36
890.09
887.01
K=10
MAE
1107.04
1050.38
3476.80
892.80
889.15
simultaneously predicts the probability of customers’ interest and its temporal dynamics.
Speciﬁcally, JRec not only models the sequential information of the event sequences, but also
captures the varied inﬂuence of sessions historical data on each product and obtains unique
representations, which pave a way for product prediction and interest dynamics modeling
with temporal point process. Extensive experiments with real-world e-commerce data have
shown that JRec is able to outperform representative baselines for both next product and
time prediction.
In this chapter, we focus on session-based recommendation setting and treat each session
independently. Thus, one interesting future direction is to extend current framework to
capture the relation among customers’, where event sequences are related to others. Moreover,
we only consider customers’ interest for products in general and ignore that of a behavior
type. For example, for music service providers, knowing a customer will listen a song or buy
a song could be very important for their marketing strategies. Thus, we would also like to
extend our framework to predict product, time and behavior type simultaneously.
76
Chapter 6
Modeling Hierarchy Information
6.1 Introduction
Most of the widely used language processing systems have been built on neural networks
that are highly eﬀective, achieving the performance comparable to humans [28, 132, 135].
They are also very brittle, however, as they could be easily broken with the presence of
noises [12, 140, 34]. However, the language processing mechanism of humans are very robust.
One representative example is the following Cambridge sentence:
Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht
oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat
ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it
wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by
istlef, but the wrod as a wlohe.
In spite of the fact that a human can read the above sentence with little diﬃculty, it
can cause a complete failure to existing natural language processing systems such as Google
Translation Engine 1. Building robust natural language processing systems is becoming
increasingly important nowadays given severe consequences that can be made by adversarial
samples [49]: carefully misspelled spam emails that fool spam detection systems [37] delib-
erately designed input sentences that force chatbot to emit oﬀensive language [126, 29, 81],
1https://translate.google.com/
77
Figure 6.1: The graphical illustration of the proposed framework: MUDE.
etc. Thus, in this work, we focus on building a word recognition framework which can
denoise the misspellings such as those shown in the Cambridge sentence. As suggested by
psycholinguistic studies [99, 27], the humans can comprehend text that is noised by jumbling
internal characters while leaving the ﬁrst and last characters of a word unchanged. Thus, an
ideal word recognition model is expected to emulate robustness of human language processing
mechanism.
The beneﬁts of such framework are two-folds. The ﬁrst is its recognition ability can be
straightforwardly used to correct misspellings. The second is its contribution to the robustness
of other natural language processing systems. By serving as a denoising component, the word
recognition framework can ﬁrstly clean the noised sentences before they are inputted into
other natural language processing systems [98, 144].
From the human perspective, there are two types of information that play an essential role
for us to recognize the noised words [96]. The ﬁrst is the character-level dependencies. Take
the word ‘wlohe’ in the Cambridge sentences as an example, it is extremely rare to see a ‘w’ sits
next to an ‘l’ in an English word. Instead, it is more natural with ‘wh’. Thus, it is quite easy
for humans to narrow down possible correct forms of ‘wlohe’ to be ‘whole’ or ‘whelo’. To ensure
that it should be ‘whole’, we often need the second type of information: context information
78
LSTM CELLPrediction layer LSTM CELLPrediction layer LSTM CELLPrediction layer w1AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKthbaUDbbTbt0dxN2J0oJ/QtePCji1T/kzX9j0uagrQ8GHu/NMDMviKWw6LrfTmlldW19o7xZ2dre2d2r7h+0bZQYxlsskpHpBNRyKTRvoUDJO7HhVAWSPwTjm9x/eOTGikjf4yTmvqJDLULBKObSU9+r9Ks1t+7OQJaJV5AaFGj2q1+9QcQSxTUySa3tem6MfkoNCib5tNJLLI8pG9Mh72ZUU8Wtn85unZKTTBmQMDJZaSQz9fdESpW1ExVknYriyC56ufif100wvPJToeMEuWbzRWEiCUYkf5wMhOEM5SQjlBmR3UrYiBrKMIsnD8FbfHmZtM/q3nndu7uoNa6LOMpwBMdwCh5cQgNuoQktYDCCZ3iFN0c5L8678zFvLTnFzCH8gfP5A0A9jbY=w2AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3bp7ibsTpRS+he8eFDEq3/Im//GpM1BWx8MPN6bYWZeEEth0XW/ncLa+sbmVnG7tLO7t39QPjxq2SgxjDdZJCPTCajlUmjeRIGSd2LDqQokbwfj28xvP3JjRaQfcBJzX9GhFqFgFDPpqV8r9csVt+rOQVaJl5MK5Gj0y1+9QcQSxTUySa3tem6M/pQaFEzyWamXWB5TNqZD3k2ppopbfzq/dUbOUmVAwsikpZHM1d8TU6qsnagg7VQUR3bZy8T/vG6C4bU/FTpOkGu2WBQmkmBEssfJQBjOUE5SQpkR6a2EjaihDNN4shC85ZdXSatW9S6q3v1lpX6Tx1GEEziFc/DgCupwBw1oAoMRPMMrvDnKeXHenY9Fa8HJZ47hD5zPH0HCjbc=wnAAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKthbaUDbbSbt0dxN2N0oJ/QtePCji1T/kzX9j0uagrQ8GHu/NMDMviAU31nW/ndLK6tr6RnmzsrW9s7tX3T9omyjRDFssEpHuBNSg4ApblluBnVgjlYHAh2B8k/sPj6gNj9S9ncToSzpUPOSM2lx66qtKv1pz6+4MZJl4BalBgWa/+tUbRCyRqCwT1Jiu58bWT6m2nAmcVnqJwZiyMR1iN6OKSjR+Ort1Sk4yZUDCSGelLJmpvydSKo2ZyCDrlNSOzKKXi/953cSGV37KVZxYVGy+KEwEsRHJHycDrpFZMckIZZpntxI2opoym8WTh+AtvrxM2md177zu3V3UGtdFHGU4gmM4BQ8uoQG30IQWMBjBM7zCmyOdF+fd+Zi3lpxi5hD+wPn8AZzujfM=c1AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cK9gPaUDbbTbt0dxN2J0IJ/QtePCji1T/kzX9j0uagrQ8GHu/NMDMviKWw6LrfTmltfWNzq7xd2dnd2z+oHh61bZQYxlsskpHpBtRyKTRvoUDJu7HhVAWSd4LJXe53nrixItKPOI25r+hIi1AwirnEBl5lUK25dXcOskq8gtSgQHNQ/eoPI5YorpFJam3Pc2P0U2pQMMlnlX5ieUzZhI54L6OaKm79dH7rjJxlypCEkclKI5mrvydSqqydqiDrVBTHdtnLxf+8XoLhjZ8KHSfINVssChNJMCL542QoDGcopxmhzIjsVsLG1FCGWTx5CN7yy6ukfVH3Luvew1WtcVvEUYYTOIVz8OAaGnAPTWgBgzE8wyu8Ocp5cd6dj0VrySlmjuEPnM8fIbGNog==c2AAAB63icbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGC/YA2lM120i7d3YTdjVBC/4IXD4p49Q9589+YtDlo64OBx3szzMwLYsGNdd1vZ219Y3Nru7RT3t3bPzisHB23TZRohi0WiUh3A2pQcIUty63AbqyRykBgJ5jc5X7nCbXhkXq00xh9SUeKh5xRm0tsUC8PKlW35s5BVolXkCoUaA4qX/1hxBKJyjJBjel5bmz9lGrLmcBZuZ8YjCmb0BH2MqqoROOn81tn5DxThiSMdFbKkrn6eyKl0pipDLJOSe3YLHu5+J/XS2x446dcxYlFxRaLwkQQG5H8cTLkGpkV04xQpnl2K2FjqimzWTx5CN7yy6ukXa95lzXv4arauC3iKMEpnMEFeHANDbiHJrSAwRie4RXeHOm8OO/Ox6J1zSlmTuAPnM8fIzaNow==cmAAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cK9gPaUDbbSbt0dxN2N0IJ/QtePCji1T/kzX9j0uagrQ8GHu/NMDMviAU31nW/ndLa+sbmVnm7srO7t39QPTxqmyjRDFssEpHuBtSg4ApblluB3VgjlYHATjC5y/3OE2rDI/VopzH6ko4UDzmjNpfYQFYG1Zpbd+cgq8QrSA0KNAfVr/4wYolEZZmgxvQ8N7Z+SrXlTOCs0k8MxpRN6Ah7GVVUovHT+a0zcpYpQxJGOitlyVz9PZFSacxUBlmnpHZslr1c/M/rJTa88VOu4sSiYotFYSKIjUj+OBlyjcyKaUYo0zy7lbAx1ZTZLJ48BG/55VXSvqh7l3Xv4arWuC3iKMMJnMI5eHANDbiHJrSAwRie4RXeHOm8OO/Ox6K15BQzx/AHzucPfN2N3g==EncoderDecodery1AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8eK1hbaUDbbTbt0swm7EyGE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmlldW19o7xZ2dre2d2r7h88mjjVjLdYLGPdCajhUijeQoGSdxLNaRRI3g7GN1O//cS1EbF6wCzhfkSHSoSCUbTSfdb3+tWaW3dnIMvEK0gNCjT71a/eIGZpxBUySY3pem6Cfk41Cib5pNJLDU8oG9Mh71qqaMSNn89OnZATqwxIGGtbCslM/T2R08iYLApsZ0RxZBa9qfif100xvPJzoZIUuWLzRWEqCcZk+jcZCM0ZyswSyrSwtxI2opoytOlUbAje4svL5PGs7p3XvbuLWuO6iKMMR3AMp+DBJTTgFprQAgZDeIZXeHOk8+K8Ox/z1pJTzBzCHzifPw4yjaQ=y2AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2m3bpZhN2J0Io/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xCzhfkSHSoSCUbTSQ9av9csVt+rOQVaJl5MK5Gj0y1+9QczSiCtkkhrT9dwE/QnVKJjk01IvNTyhbEyHvGupohE3/mR+6pScWWVAwljbUkjm6u+JCY2MyaLAdkYUR2bZm4n/ed0Uw2t/IlSSIldssShMJcGYzP4mA6E5Q5lZQpkW9lbCRlRThjadkg3BW355lbRqVe+i6t1fVuo3eRxFOIFTOAcPrqAOd9CAJjAYwjO8wpsjnRfn3flYtBacfOYY/sD5/AEPto2lymAAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyqoMegF48RjQkkS5idzCZD5rHMzArLkk/w4kERr36RN//GSbIHTSxoKKq66e6KEs6M9f1vr7Syura+Ud6sbG3v7O5V9w8ejUo1oS2iuNKdCBvKmaQtyyynnURTLCJO29H4Zuq3n6g2TMkHmyU0FHgoWcwItk66z/qiX635dX8GtEyCgtSgQLNf/eoNFEkFlZZwbEw38BMb5lhbRjidVHqpoQkmYzykXUclFtSE+ezUCTpxygDFSruSFs3U3xM5FsZkInKdAtuRWfSm4n9eN7XxVZgzmaSWSjJfFKccWYWmf6MB05RYnjmCiWbuVkRGWGNiXToVF0Kw+PIyeTyrB+f14O6i1rgu4ijDERzDKQRwCQ24hSa0gMAQnuEV3jzuvXjv3se8teQVM4fwB97nD2kijeA=c1AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cK9gPaUDbbTbt0dxN2J0IJ/QtePCji1T/kzX9j0uagrQ8GHu/NMDMviKWw6LrfTmltfWNzq7xd2dnd2z+oHh61bZQYxlsskpHpBtRyKTRvoUDJu7HhVAWSd4LJXe53nrixItKPOI25r+hIi1AwirnEBl5lUK25dXcOskq8gtSgQHNQ/eoPI5YorpFJam3Pc2P0U2pQMMlnlX5ieUzZhI54L6OaKm79dH7rjJxlypCEkclKI5mrvydSqqydqiDrVBTHdtnLxf+8XoLhjZ8KHSfINVssChNJMCL542QoDGcopxmhzIjsVsLG1FCGWTx5CN7yy6ukfVH3Luvew1WtcVvEUYYTOIVz8OAaGnAPTWgBgzE8wyu8Ocp5cd6dj0VrySlmjuEPnM8fIbGNog==c2AAAB63icbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGC/YA2lM120i7d3YTdjVBC/4IXD4p49Q9589+YtDlo64OBx3szzMwLYsGNdd1vZ219Y3Nru7RT3t3bPzisHB23TZRohi0WiUh3A2pQcIUty63AbqyRykBgJ5jc5X7nCbXhkXq00xh9SUeKh5xRm0tsUC8PKlW35s5BVolXkCoUaA4qX/1hxBKJyjJBjel5bmz9lGrLmcBZuZ8YjCmb0BH2MqqoROOn81tn5DxThiSMdFbKkrn6eyKl0pipDLJOSe3YLHu5+J/XS2x446dcxYlFxRaLwkQQG5H8cTLkGpkV04xQpnl2K2FjqimzWTx5CN7yy6ukXa95lzXv4arauC3iKMEpnMEFeHANDbiHJrSAwRie4RXeHOm8OO/Ox6J1zSlmTuAPnM8fIzaNow==cmAAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cK9gPaUDbbSbt0dxN2N0IJ/QtePCji1T/kzX9j0uagrQ8GHu/NMDMviAU31nW/ndLa+sbmVnm7srO7t39QPTxqmyjRDFssEpHuBtSg4ApblluB3VgjlYHATjC5y/3OE2rDI/VopzH6ko4UDzmjNpfYQFYG1Zpbd+cgq8QrSA0KNAfVr/4wYolEZZmgxvQ8N7Z+SrXlTOCs0k8MxpRN6Ah7GVVUovHT+a0zcpYpQxJGOitlyVz9PZFSacxUBlmnpHZslr1c/M/rJTa88VOu4sSiYotFYSKIjUj+OBlyjcyKaUYo0zy7lbAx1ZTZLJ48BG/55VXSvqh7l3Xv4arWuC3iKMMJnMI5eHANDbiHJrSAwRie4RXeHOm8OO/Ox6K15BQzx/AHzucPfN2N3g==EncoderDecodery1AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8eK1hbaUDbbTbt0swm7EyGE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmlldW19o7xZ2dre2d2r7h88mjjVjLdYLGPdCajhUijeQoGSdxLNaRRI3g7GN1O//cS1EbF6wCzhfkSHSoSCUbTSfdb3+tWaW3dnIMvEK0gNCjT71a/eIGZpxBUySY3pem6Cfk41Cib5pNJLDU8oG9Mh71qqaMSNn89OnZATqwxIGGtbCslM/T2R08iYLApsZ0RxZBa9qfif100xvPJzoZIUuWLzRWEqCcZk+jcZCM0ZyswSyrSwtxI2opoytOlUbAje4svL5PGs7p3XvbuLWuO6iKMMR3AMp+DBJTTgFprQAgZDeIZXeHOk8+K8Ox/z1pJTzBzCHzifPw4yjaQ=y2AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2m3bpZhN2J0Io/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xCzhfkSHSoSCUbTSQ9av9csVt+rOQVaJl5MK5Gj0y1+9QczSiCtkkhrT9dwE/QnVKJjk01IvNTyhbEyHvGupohE3/mR+6pScWWVAwljbUkjm6u+JCY2MyaLAdkYUR2bZm4n/ed0Uw2t/IlSSIldssShMJcGYzP4mA6E5Q5lZQpkW9lbCRlRThjadkg3BW355lbRqVe+i6t1fVuo3eRxFOIFTOAcPrqAOd9CAJjAYwjO8wpsjnRfn3flYtBacfOYY/sD5/AEPto2lymAAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyqoMegF48RjQkkS5idzCZD5rHMzArLkk/w4kERr36RN//GSbIHTSxoKKq66e6KEs6M9f1vr7Syura+Ud6sbG3v7O5V9w8ejUo1oS2iuNKdCBvKmaQtyyynnURTLCJO29H4Zuq3n6g2TMkHmyU0FHgoWcwItk66z/qiX635dX8GtEyCgtSgQLNf/eoNFEkFlZZwbEw38BMb5lhbRjidVHqpoQkmYzykXUclFtSE+ezUCTpxygDFSruSFs3U3xM5FsZkInKdAtuRWfSm4n9eN7XxVZgzmaSWSjJfFKccWYWmf6MB05RYnjmCiWbuVkRGWGNiXToVF0Kw+PIyeTyrB+f14O6i1rgu4ijDERzDKQRwCQ24hSa0gMAQnuEV3jzuvXjv3se8teQVM4fwB97nD2kijeA=c1AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cK9gPaUDbbTbt0dxN2J0IJ/QtePCji1T/kzX9j0uagrQ8GHu/NMDMviKWw6LrfTmltfWNzq7xd2dnd2z+oHh61bZQYxlsskpHpBtRyKTRvoUDJu7HhVAWSd4LJXe53nrixItKPOI25r+hIi1AwirnEBl5lUK25dXcOskq8gtSgQHNQ/eoPI5YorpFJam3Pc2P0U2pQMMlnlX5ieUzZhI54L6OaKm79dH7rjJxlypCEkclKI5mrvydSqqydqiDrVBTHdtnLxf+8XoLhjZ8KHSfINVssChNJMCL542QoDGcopxmhzIjsVsLG1FCGWTx5CN7yy6ukfVH3Luvew1WtcVvEUYYTOIVz8OAaGnAPTWgBgzE8wyu8Ocp5cd6dj0VrySlmjuEPnM8fIbGNog==c2AAAB63icbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGC/YA2lM120i7d3YTdjVBC/4IXD4p49Q9589+YtDlo64OBx3szzMwLYsGNdd1vZ219Y3Nru7RT3t3bPzisHB23TZRohi0WiUh3A2pQcIUty63AbqyRykBgJ5jc5X7nCbXhkXq00xh9SUeKh5xRm0tsUC8PKlW35s5BVolXkCoUaA4qX/1hxBKJyjJBjel5bmz9lGrLmcBZuZ8YjCmb0BH2MqqoROOn81tn5DxThiSMdFbKkrn6eyKl0pipDLJOSe3YLHu5+J/XS2x446dcxYlFxRaLwkQQG5H8cTLkGpkV04xQpnl2K2FjqimzWTx5CN7yy6ukXa95lzXv4arauC3iKMEpnMEFeHANDbiHJrSAwRie4RXeHOm8OO/Ox6J1zSlmTuAPnM8fIzaNow==cmAAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cK9gPaUDbbSbt0dxN2N0IJ/QtePCji1T/kzX9j0uagrQ8GHu/NMDMviAU31nW/ndLa+sbmVnm7srO7t39QPTxqmyjRDFssEpHuBtSg4ApblluB3VgjlYHATjC5y/3OE2rDI/VopzH6ko4UDzmjNpfYQFYG1Zpbd+cgq8QrSA0KNAfVr/4wYolEZZmgxvQ8N7Z+SrXlTOCs0k8MxpRN6Ah7GVVUovHT+a0zcpYpQxJGOitlyVz9PZFSacxUBlmnpHZslr1c/M/rJTa88VOu4sSiYotFYSKIjUj+OBlyjcyKaUYo0zy7lbAx1ZTZLJ48BG/55VXSvqh7l3Xv4arWuC3iKMMJnMI5eHANDbiHJrSAwRie4RXeHOm8OO/Ox6K15BQzx/AHzucPfN2N3g==EncoderDecodery1AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8eK1hbaUDbbTbt0swm7EyGE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmlldW19o7xZ2dre2d2r7h88mjjVjLdYLGPdCajhUijeQoGSdxLNaRRI3g7GN1O//cS1EbF6wCzhfkSHSoSCUbTSfdb3+tWaW3dnIMvEK0gNCjT71a/eIGZpxBUySY3pem6Cfk41Cib5pNJLDU8oG9Mh71qqaMSNn89OnZATqwxIGGtbCslM/T2R08iYLApsZ0RxZBa9qfif100xvPJzoZIUuWLzRWEqCcZk+jcZCM0ZyswSyrSwtxI2opoytOlUbAje4svL5PGs7p3XvbuLWuO6iKMMR3AMp+DBJTTgFprQAgZDeIZXeHOk8+K8Ox/z1pJTzBzCHzifPw4yjaQ=y2AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2m3bpZhN2J0Io/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xCzhfkSHSoSCUbTSQ9av9csVt+rOQVaJl5MK5Gj0y1+9QczSiCtkkhrT9dwE/QnVKJjk01IvNTyhbEyHvGupohE3/mR+6pScWWVAwljbUkjm6u+JCY2MyaLAdkYUR2bZm4n/ed0Uw2t/IlSSIldssShMJcGYzP4mA6E5Q5lZQpkW9lbCRlRThjadkg3BW355lbRqVe+i6t1fVuo3eRxFOIFTOAcPrqAOd9CAJjAYwjO8wpsjnRfn3flYtBacfOYY/sD5/AEPto2lymAAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKeyqoMegF48RjQkkS5idzCZD5rHMzArLkk/w4kERr36RN//GSbIHTSxoKKq66e6KEs6M9f1vr7Syura+Ud6sbG3v7O5V9w8ejUo1oS2iuNKdCBvKmaQtyyynnURTLCJO29H4Zuq3n6g2TMkHmyU0FHgoWcwItk66z/qiX635dX8GtEyCgtSgQLNf/eoNFEkFlZZwbEw38BMb5lhbRjidVHqpoQkmYzykXUclFtSE+ezUCTpxygDFSruSFs3U3xM5FsZkInKdAtuRWfSm4n9eN7XxVZgzmaSWSjJfFKccWYWmf6MB05RYnjmCiWbuVkRGWGNiXToVF0Kw+PIyeTyrB+f14O6i1rgu4ijDERzDKQRwCQ24hSa0gMAQnuEV3jzuvXjv3se8teQVM4fwB97nD2kijeA=LSTM CELLPrediction layer w1AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKthbaUDbbTbt0dxN2J0oJ/QtePCji1T/kzX9j0uagrQ8GHu/NMDMviKWw6LrfTmlldW19o7xZ2dre2d2r7h+0bZQYxlsskpHpBNRyKTRvoUDJO7HhVAWSPwTjm9x/eOTGikjf4yTmvqJDLULBKObSU9+r9Ks1t+7OQJaJV5AaFGj2q1+9QcQSxTUySa3tem6MfkoNCib5tNJLLI8pG9Mh72ZUU8Wtn85unZKTTBmQMDJZaSQz9fdESpW1ExVknYriyC56ufif100wvPJToeMEuWbzRWEiCUYkf5wMhOEM5SQjlBmR3UrYiBrKMIsnD8FbfHmZtM/q3nndu7uoNa6LOMpwBMdwCh5cQgNuoQktYDCCZ3iFN0c5L8678zFvLTnFzCH8gfP5A0A9jbY=LSTM CELLPrediction layer w2AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3bp7ibsTpRS+he8eFDEq3/Im//GpM1BWx8MPN6bYWZeEEth0XW/ncLa+sbmVnG7tLO7t39QPjxq2SgxjDdZJCPTCajlUmjeRIGSd2LDqQokbwfj28xvP3JjRaQfcBJzX9GhFqFgFDPpqV8r9csVt+rOQVaJl5MK5Gj0y1+9QcQSxTUySa3tem6M/pQaFEzyWamXWB5TNqZD3k2ppopbfzq/dUbOUmVAwsikpZHM1d8TU6qsnagg7VQUR3bZy8T/vG6C4bU/FTpOkGu2WBQmkmBEssfJQBjOUE5SQpkR6a2EjaihDNN4shC85ZdXSatW9S6q3v1lpX6Tx1GEEziFc/DgCupwBw1oAoMRPMMrvDnKeXHenY9Fa8HJZ47hD5zPH0HCjbc=LSTM CELLPrediction layer wnAAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cKthbaUDbbSbt0dxN2N0oJ/QtePCji1T/kzX9j0uagrQ8GHu/NMDMviAU31nW/ndLK6tr6RnmzsrW9s7tX3T9omyjRDFssEpHuBNSg4ApblluBnVgjlYHAh2B8k/sPj6gNj9S9ncToSzpUPOSM2lx66qtKv1pz6+4MZJl4BalBgWa/+tUbRCyRqCwT1Jiu58bWT6m2nAmcVnqJwZiyMR1iN6OKSjR+Ort1Sk4yZUDCSGelLJmpvydSKo2ZyCDrlNSOzKKXi/953cSGV37KVZxYVGy+KEwEsRHJHycDrpFZMckIZZpntxI2opoym8WTh+AtvrxM2md177zu3V3UGtdFHGU4gmM4BQ8uoQG30IQWMBjBM7zCmyOdF+fd+Zi3lpxi5hD+wPn8AZzujfM=c1AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cK9gPaUDbbTbt0dxN2J0IJ/QtePCji1T/kzX9j0uagrQ8GHu/NMDMviKWw6LrfTmltfWNzq7xd2dnd2z+oHh61bZQYxlsskpHpBtRyKTRvoUDJu7HhVAWSd4LJXe53nrixItKPOI25r+hIi1AwirnEBl5lUK25dXcOskq8gtSgQHNQ/eoPI5YorpFJam3Pc2P0U2pQMMlnlX5ieUzZhI54L6OaKm79dH7rjJxlypCEkclKI5mrvydSqqydqiDrVBTHdtnLxf+8XoLhjZ8KHSfINVssChNJMCL542QoDGcopxmhzIjsVsLG1FCGWTx5CN7yy6ukfVH3Luvew1WtcVvEUYYTOIVz8OAaGnAPTWgBgzE8wyu8Ocp5cd6dj0VrySlmjuEPnM8fIbGNog==c2AAAB63icbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGC/YA2lM120i7d3YTdjVBC/4IXD4p49Q9589+YtDlo64OBx3szzMwLYsGNdd1vZ219Y3Nru7RT3t3bPzisHB23TZRohi0WiUh3A2pQcIUty63AbqyRykBgJ5jc5X7nCbXhkXq00xh9SUeKh5xRm0tsUC8PKlW35s5BVolXkCoUaA4qX/1hxBKJyjJBjel5bmz9lGrLmcBZuZ8YjCmb0BH2MqqoROOn81tn5DxThiSMdFbKkrn6eyKl0pipDLJOSe3YLHu5+J/XS2x446dcxYlFxRaLwkQQG5H8cTLkGpkV04xQpnl2K2FjqimzWTx5CN7yy6ukXa95lzXv4arauC3iKMEpnMEFeHANDbiHJrSAwRie4RXeHOm8OO/Ox6J1zSlmTuAPnM8fIzaNow==cmAAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cK9gPaUDbbSbt0dxN2N0IJ/QtePCji1T/kzX9j0uagrQ8GHu/NMDMviAU31nW/ndLa+sbmVnm7srO7t39QPTxqmyjRDFssEpHuBtSg4ApblluB3VgjlYHATjC5y/3OE2rDI/VopzH6ko4UDzmjNpfYQFYG1Zpbd+cgq8QrSA0KNAfVr/4wYolEZZmgxvQ8N7Z+SrXlTOCs0k8MxpRN6Ah7GVVUovHT+a0zcpYpQxJGOitlyVz9PZFSacxUBlmnpHZslr1c/M/rJTa88VOu4sSiYotFYSKIjUj+OBlyjcyKaUYo0zy7lbAx1ZTZLJ48BG/55VXSvqh7l3Xv4arWuC3iKMMJnMI5eHANDbiHJrSAwRie4RXeHOm8OO/Ox6K15BQzx/AHzucPfN2N3g==Encoderc1AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cK9gPaUDbbTbt0dxN2J0IJ/QtePCji1T/kzX9j0uagrQ8GHu/NMDMviKWw6LrfTmltfWNzq7xd2dnd2z+oHh61bZQYxlsskpHpBtRyKTRvoUDJu7HhVAWSd4LJXe53nrixItKPOI25r+hIi1AwirnEBl5lUK25dXcOskq8gtSgQHNQ/eoPI5YorpFJam3Pc2P0U2pQMMlnlX5ieUzZhI54L6OaKm79dH7rjJxlypCEkclKI5mrvydSqqydqiDrVBTHdtnLxf+8XoLhjZ8KHSfINVssChNJMCL542QoDGcopxmhzIjsVsLG1FCGWTx5CN7yy6ukfVH3Luvew1WtcVvEUYYTOIVz8OAaGnAPTWgBgzE8wyu8Ocp5cd6dj0VrySlmjuEPnM8fIbGNog==c2AAAB63icbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGC/YA2lM120i7d3YTdjVBC/4IXD4p49Q9589+YtDlo64OBx3szzMwLYsGNdd1vZ219Y3Nru7RT3t3bPzisHB23TZRohi0WiUh3A2pQcIUty63AbqyRykBgJ5jc5X7nCbXhkXq00xh9SUeKh5xRm0tsUC8PKlW35s5BVolXkCoUaA4qX/1hxBKJyjJBjel5bmz9lGrLmcBZuZ8YjCmb0BH2MqqoROOn81tn5DxThiSMdFbKkrn6eyKl0pipDLJOSe3YLHu5+J/XS2x446dcxYlFxRaLwkQQG5H8cTLkGpkV04xQpnl2K2FjqimzWTx5CN7yy6ukXa95lzXv4arauC3iKMEpnMEFeHANDbiHJrSAwRie4RXeHOm8OO/Ox6J1zSlmTuAPnM8fIzaNow==cmAAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cK9gPaUDbbSbt0dxN2N0IJ/QtePCji1T/kzX9j0uagrQ8GHu/NMDMviAU31nW/ndLa+sbmVnm7srO7t39QPTxqmyjRDFssEpHuBtSg4ApblluB3VgjlYHATjC5y/3OE2rDI/VopzH6ko4UDzmjNpfYQFYG1Zpbd+cgq8QrSA0KNAfVr/4wYolEZZmgxvQ8N7Z+SrXlTOCs0k8MxpRN6Ah7GVVUovHT+a0zcpYpQxJGOitlyVz9PZFSacxUBlmnpHZslr1c/M/rJTa88VOu4sSiYotFYSKIjUj+OBlyjcyKaUYo0zy7lbAx1ZTZLJ48BG/55VXSvqh7l3Xv4arWuC3iKMMJnMI5eHANDbiHJrSAwRie4RXeHOm8OO/Ox6K15BQzx/AHzucPfN2N3g==Encoderc1AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cK9gPaUDbbTbt0dxN2J0IJ/QtePCji1T/kzX9j0uagrQ8GHu/NMDMviKWw6LrfTmltfWNzq7xd2dnd2z+oHh61bZQYxlsskpHpBtRyKTRvoUDJu7HhVAWSd4LJXe53nrixItKPOI25r+hIi1AwirnEBl5lUK25dXcOskq8gtSgQHNQ/eoPI5YorpFJam3Pc2P0U2pQMMlnlX5ieUzZhI54L6OaKm79dH7rjJxlypCEkclKI5mrvydSqqydqiDrVBTHdtnLxf+8XoLhjZ8KHSfINVssChNJMCL542QoDGcopxmhzIjsVsLG1FCGWTx5CN7yy6ukfVH3Luvew1WtcVvEUYYTOIVz8OAaGnAPTWgBgzE8wyu8Ocp5cd6dj0VrySlmjuEPnM8fIbGNog==c2AAAB63icbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGC/YA2lM120i7d3YTdjVBC/4IXD4p49Q9589+YtDlo64OBx3szzMwLYsGNdd1vZ219Y3Nru7RT3t3bPzisHB23TZRohi0WiUh3A2pQcIUty63AbqyRykBgJ5jc5X7nCbXhkXq00xh9SUeKh5xRm0tsUC8PKlW35s5BVolXkCoUaA4qX/1hxBKJyjJBjel5bmz9lGrLmcBZuZ8YjCmb0BH2MqqoROOn81tn5DxThiSMdFbKkrn6eyKl0pipDLJOSe3YLHu5+J/XS2x446dcxYlFxRaLwkQQG5H8cTLkGpkV04xQpnl2K2FjqimzWTx5CN7yy6ukXa95lzXv4arauC3iKMEpnMEFeHANDbiHJrSAwRie4RXeHOm8OO/Ox6J1zSlmTuAPnM8fIzaNow==cmAAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cK9gPaUDbbSbt0dxN2N0IJ/QtePCji1T/kzX9j0uagrQ8GHu/NMDMviAU31nW/ndLa+sbmVnm7srO7t39QPTxqmyjRDFssEpHuBtSg4ApblluB3VgjlYHATjC5y/3OE2rDI/VopzH6ko4UDzmjNpfYQFYG1Zpbd+cgq8QrSA0KNAfVr/4wYolEZZmgxvQ8N7Z+SrXlTOCs0k8MxpRN6Ah7GVVUovHT+a0zcpYpQxJGOitlyVz9PZFSacxUBlmnpHZslr1c/M/rJTa88VOu4sSiYotFYSKIjUj+OBlyjcyKaUYo0zy7lbAx1ZTZLJ48BG/55VXSvqh7l3Xv4arWuC3iKMMJnMI5eHANDbiHJrSAwRie4RXeHOm8OO/Ox6K15BQzx/AHzucPfN2N3g==Encodersuch as ‘but the wrod as a wlohe.’, which is denoted as word-level dependencies in this chapter.
The character-level and world-level dependencies form hierarchical structures. Intuitively, an
eﬀective word recognition framework should capture these hierarchical structures. However,
hierarchical structures are rarely exploited by the exiting works such as scRNN [102]. Hence,
we propose a framework MUDE that is able to fully utilize hierarchical structures for the
robust word recognition task. It integrates a character-level sequence-to-sequence model and
a word-level sequential learning model into a coherent framework. The major contributions
of our work are summarized as follows:
• We identify importance of hierarchical structures for recognizing a noised word;
• We propose a novel framework, MUDE, that utilizes both character-level and word-level
dependencies for robust word recognition task;
• We conduct extensive experiments on various types of noises to verify the eﬀectiveness
of MUDE.
6.2 The Proposed Framework: MUDE
In this section, we describe MUDE that is able to capture both character-level and word-level
dependencies. The overall architecture is illustrated in Figure 6.1. It consists of 3 major
components: a sequence-to-sequence model, a bi-directional recurrent neural network and a
prediction layer. Next we will detail each component.
79
6.2.1 Learning Character-level Dependencies
As mentioned previously, there exist sequential patterns in the characters of a word. For
example, vocabulary roots such as cur and bio can be found in many words. In this subsection,
we propose a sequence-to-sequence model to learn a better representation of a given word by
incorporating character-level dependencies. The model consists of an encoder and a decoder,
which we will describe next.
6.2.1.1 Encoder
Let ˆw = c1, c2,··· cm be a sequence of characters of a given noised word ˆw. We ﬁrstly map
each character ci to a dc-dimensional character embedding as follows:
xi = Eoi
(6.1)
where E ∈ RC×dc is the embedding matrix given that the total number of unique characters
is C. oi ∈ RC is the one-hot representation of ci. Since there could some noise in ˆw, the
sequential order of ci can be misleading. Thus, instead of using a sequential learning model
such as recurrent neural network, we choose the multi-head attention mechanism [117] to
model the dependencies between characters without considering their order. To do so, we
add a special character c0 whose ﬁnal representation will be used as the representation of the
word.
Speciﬁcally, the multi-head attention mechanism will obtain a reﬁned representation for
each character in ˆw. Next, without the loss of generality, we will use ci as an example to
illustrate. To obtain the reﬁned representation of ci, xi will ﬁrstly be projected into query
80
space and xj ∀j ∈ {0, 1,··· , m} will be projected into key and value spaces as follows:
xq
i = Qxi
xk
j = Kxj ∀j ∈ {0, 1,··· , m}
xv
j = Vxj ∀j ∈ {0, 1,··· , m}
(6.2)
where Q, K, V are the projection matrices for query, key, and value spaces, respectively.
With xq
i
, xk
j
and xv
j
, the reﬁned representation ei of ci can be calculated as the weighted sum
of xv
j
:
(cid:88)
αjxv
j
ei =
(6.3)
where αj is the attention score that is obtained by the following equation:
α0, α1,··· , αm = σs(
T xk
xq
0√d
i
,
T xk
xq
1√d
i
,··· ,
T xk
xq
m√d
i
)
Where σs is the softmax function. To capture the dependencies of characters from diﬀerent
aspects, multiple sets of projection matrices are usually used, which will result in multiple
sets of xq
i
, xk
j
and xv
j
, and thus ei. To be concrete, assume that there are h sets of projection
matrices, from Equation 6.2 and Equation 6.3, we can obtain h eis, which are denoted as
}. With this, the reﬁned representation of ci is obtained by the concatenation
{e1
i ,··· , eh
i , e2
operation:
i
zi = concatenation(e1
i , e2
i ,··· , eh
i )
(6.4)
81
where zi is the new representation of ci and contains dependency information of ci to other
characters in ˆw from h aspects.
Following [117], we also add a positional-wise feedforward layer to zi as follows:
pi = W2ReLU (W1zi)
(6.5)
where W1 and W2 are the learnable parameters. pi is the ﬁnal representation of ci. Note
that we can have several above mentioned layers stacked together to form a deep structure.
At this point, we have obtained the reﬁned representation vector for each character and
we use that of the special character c0 as the representation of given noised word, which is
denoted as wc
6.2.1.2 Decoder
To capture the sequential dependency in the correct words, the Gated Recurrent Unit (GRU)
which has achieved great performance in many sequence learning tasks [129, 6, 128] is used
as the decoder. To be speciﬁc, in the decoding process, the initial hidden state h0 of GRU
is initialized with the noised word presentation ˆw. Then at each time stamp t, GRU will
recursively output a hidden state ht given the hidden state ht−1 at the previous time stamp.
Due to the page limitation, we do not show the details of GRU, which is well described in [20].
In addition, each hidden state will emit a predicted character cp
t
. The decoding process will
end when the special character denoting the end of word is emitted. Concretely, the whole
decoding process is formally formulated as follows:
h0 = wc
82
(6.6)
ht = GRU (ht−1)
pt = σs(Wpht)
cp
t = arg max
(pt[i])
i
where Wp ∈ RC×d is a trainable parameter. pt ∈ RC gives the emission probability of each
character and pt[i] denotes the ith entry of vector pt.
6.2.1.3 Sequence-to-sequence Loss
To train the previously described character-level sequence-to-sequence model, we deﬁne the
loss function as follows:
m(cid:88)
i
pi[yi]
Lseq2seq = −
(6.7)
where yi is the index of the ground truth at position i of the correct word w. By minimizing
Lseq2seq, the designed sequence-to-sequence model can learn a meaningful representation
that incorporates character-level sequential dependencies for the noised word. Next, we will
describe the framework component that captures the word-level dependencies.
6.2.2 Capturing Word-level Dependencies
From the human perspective, it is vitally important to consider the context of the whole
sentences in order to understand a noised word. For example, it would be very hard to know
‘frist’ means ‘ﬁrst’ until a context ‘the olny iprmoetnt tihng is taht the frist and lsat ltteer
be at the rghit pclae.’
is given. Thus, to utilize the context information and word-level
dependencies, we design a recurrent neural network (RNN) to incorporate them in the
83
noised word representation. Speciﬁcally, the word presentations obtained from character-level
encoder will be passed into a bi-directional long short-term memory (LSTM). Concretely,
given a sequence of word presentations S = {wc
n} obtained from character-level
dependencies, we calculate a sequence of reﬁned word representation vectors {w1, w2,··· , wn}
as follows:
2,··· , wc
1, wc
n = LST Mf orward(wc
1, wc
n = LST Mbackward(wc
wf
1 , wf
2 ,··· , wf
2,··· , wb
1, wb
wb
w1, w2,··· , wn = wf
2,··· , wc
n)
2,··· , wc
n)
n||wb
n
1, wc
2,··· , wf
1||wb
1, wf
2||wb
(6.8)
where (cid:107) denotes concatenation. LST Mf orward
wc
n
, while LST Mbackward
to wc
1
wc
n
indicates that wcs are processed from wc
1
to
processes word presentations in an opposite direction, namely, from
. Comparing to original LSTM where only forward pass is performed, bi-directional
LSTM can include both ‘past’ and ‘future’ information in the representation of wi.
With the aforementioned procedure, the representation of each word now incorporates
both character-level and word-level dependencies. Thus, the correct word is predicted as
follows:
pw
i = σs(Wwwi)
(6.9)
wp
i = arg max
(pw
t [i])
i
where Ww ∈ RV ×dw is a trainable matrix and V is the size of the vocabulary that contains
i ∈ RV is the probability distribution over the vocabulary for
all possible words. Moreover, pw
the ith word in a sentence.
84
6.2.2.1 Word Prediction Loss
To eﬀectively train MUDE for correct word prediction, similar to character-level sequence-to-
sequence model, we deﬁne the following objective function:
n(cid:88)
i
i [yw
pw
i ]
Lpred = −
(6.10)
where yw
i
is the index of the ith correct word.
6.2.3 Training Procedure
So far we have described MUDE which includes a character-level sequence-to-sequence model
and a word-level sequential learning model. To train both models simultaneously, we design
a loss function for the whole framework as follows:
L = Lpred + βLseq2seq
(6.11)
where β is a hyperparameter that controls the contribution of the character-level sequence-to-
sequence model. Since the major goal of the framework is to predict the correct word given
the noised word, we decrease the value of β gradually as the training proceeds to allow the
optimizer increasingly focus on improving the word prediction performance.
6.2.3.1 Test Stage
As shown in Figure 6.1, in the test stage, we simply remove the decoder of the sequence-to-
sequence model and only keep the encoder in the framework.
85
6.3 Experiment
In this section, we conduct extensive experiments on the spell correction task to verify the
eﬀectiveness of MUDE. Next, we ﬁrstly introduce the experimental settings, followed by the
analysis of the experimental results.
6.3.1 Experimental Settings
6.3.1.1 Data
We use the publicly available Penn Treebank [86] as the dataset. Following the previous
work [102], we ﬁrstly experiment on 4 diﬀerent types of noise: Permutation (PER), Deletion
(DEL), Insertion (DEL), and Substitution (SUB), which only operate on the internal characters
of words, leaving the ﬁrst and last characters unchanged. Table 6.1 shows a toy example of a
noised sentence. These 4 types of noise can cover most of the realistic cases of misspellings
and commonly tested in previous works [12, 98]. For each type of noise, we construct a noised
dataset from the original dataset by altering all the words that have more than 3 characters
with corresponding noise. We use the same training, validation and testing split in [102],
which contains 39,832, 1,700 and 2,416 sentences, respectively.
Table 6.1: Toy examples of noised text
Noise Type
Correct
PER
DEL
INS
SUB
Sentence
An example of noised text
An epaxmle of nsieod txet
An examle of nosed tet
An edxample of nmoised texut
An exsmple of npised test
86
6.3.1.2 Baselines
To show the eﬀectiveness of MUDE, we compare it with two strong and widely used baselines.
The ﬁrst is Enchant 2 spell checker which is based on dictionaries. The second one is
scRNN [102]. It is a recurrent neural network based word recognition model and has achieved
previous state-of-the-art results on spell correction tasks. This baseline only considers the
sequential dependencies in the word level with a recurrent neural network and ignores that of
character level. Note that other baselines including CharCNN [107] have been signiﬁcantly
outperformed by scRNN. Thus, we do not include them in the experiments.
6.3.1.3
Implementation Details
Both scRNN and MUDE are implemented with Pytorch. The number of hidden units of
word representations is set to be 650 as suggested by previous work [102]. The learning rate
is chosen from {0.1, 0.01, 0.001, 0.0001} and β in Equation 6.11 is chosen from {1, 0.1, 0.001}
according to the model performance on the validation datasets. The parameters of MUDE
are learned with stochastic gradient decent algorithm and we choose RMSprop [115] to be
the optimizer as it did in [102]. To make the comparison fair, scRNN is trained with the
same settings as MUDE.
6.3.2 Comparison Results
The comparison results are shown in Table 6.2. There are several observations can be made
from the table. The ﬁrst is that model based methods (scRNN and MUDE) achieve much
better performance than dictionary based one (Enchant). This is not surprising as model
based methods can eﬀectively utilize the sequential dependencies of words in the sentences.
2https://abiword.github.io/enchant/
87
Moreover, MUDE consistently outperforms scRNN in all cases, which we believe attributes to
the eﬀective design of MUDE to capture both character and word level dependencies. More
detailed analysis of contribution brought by the character-level dependencies will be shown
later in this section. In addition, we observe that the diﬃculty brought by diﬀerent types of
noise varies signiﬁcantly. Generally, for model based methods, permutation and insertion
noises are relatively easier to deal with comparing to deletion and substitution noises. We
argue this is because the former ones do not lose any character information. In other words,
the original character information is largely preserved with permutation and insertion. On
the contrary, both deletion and substitution can cause information loss, which makes it harder
to recognize the original words. This again demonstrate how important the character-level
information is. Finally, the results also show that in more diﬃcult situations where deletion
or substitution noises present, the advantages of the MUDE become even more obvious. This
clearly suggests the eﬀectiveness of the MUDE.
Table 6.2: Performance comparison on diﬀerent types of noise in terms of accuracy (%). Best
results are highlighted with bold numbers.
Method
Enchant
scRNN
MUDE
DEL
71.23
91.55
SUB
INT
79.77
72.33
98.23
87.09
98.81 95.86 97.16 90.52
INS
93.93
95.95
Next, we take one step further by removing the constraint that the noise will not aﬀect the
ﬁrst and last characters of each word. More speciﬁcally, we deﬁne 4 new types of noise that
are W-PER, W-DEL, W-INS, and W-SUB, which stand for altering a word by permuting
the whole word, deleting, inserting, and substituting characters in any position of the word.
Similarly, for each type of new noise, we construct a noised dataset. The results are shown in
88
Figure 6.2: Learning curve of MUDE in the training procedure with diﬀerent β values.
Table 6.3: Performance comparison on diﬀerent types of noise in terms of accuracy (%). Best
results are highlighted with bold numbers.
Method W-PER W-DEL W-INS W-SUB
77.23
Enchant
scRNN
81.12
MUDE
85.17
93.77
95.96
97.10
59.08
97.36
98.75
69.84
89.99
93.29
Table 6.3.
From the table, we observe that ﬁrstly, the performance of nearly all methods decreases
comparing to that of Table 6.2. This suggests the new types of noise are more diﬃcult to
handle, which is expected as they cause more variations of noised words. In fact, without
keeping ﬁrst and last characters of each words, it also becomes a diﬃcult task for human
to comprehend the noised sentences [99]. Secondly, MUDE still achieves higher accuracy
than other baselines, which is consistent with observations from Table 6.2. More importantly,
as the diﬃculty of the task increases, the advantages of MUDE over scRNN also become
more obvious. Take the noise of substitution for example, in Table 6.2, MUDE has around
3.5% absolute accuracy gain over scRNN. When more diﬃcult noise (W-SUB) comes, the
89
020406080100Epoch780760740720700680660640620600Loss=1=0020406080100Epoch4003002001000Loss=1=0performance gain of MUDE becomes 4% as shown in Table 6.3. Such observation is also
consistent with previous ﬁndings.
In summary, both Table 6.2 and 6.3 clearly demonstrate the robustness of MUDE and
its advantages over scRNN which can not utilize the character-level dependencies. Thus, in
the next subsection, we conduct analysis on the contribution of character-level dependencies
to gain better understanding of MUDE.
6.3.3 Parameter Analysis
In this subsection, we analyze the contribution of character-level dependencies to better word
representations by showing the model performance with diﬀerent β values, which controls
the contribution of character-level sequence-to-sequence loss. Speciﬁcally, we let the β be 0
and 1. When β is 0, MUDE will totally ignore the character-level dependencies; When β
equals to 1, MUDE achieve best accuracy in validation set. The prediction loss and seq2seq
loss during the training stage with diﬀerent β values are shown in Figure 6.2. Note that the
trends in Figure 6.2 are similar in all of the cases with the diﬀerent types noise and we only
show that of W-PER case due to the page limitation.
As the upper sub-ﬁgure shows, when β = 1 the prediction loss converges faster and at a
lower value comparing to that of case when β = 0. For the seq2seq loss, it remains constant
value when β = 0 as the model does not learn anything regarding seq2seq task. On the other
hand, when β = 1, the seq2seq loss stably decreases, suggesting that the MUDE is trained to
obtain better representation of each word. The obvious positive correlation between these
two losses clearly demonstrates the importance of learning character-level dependencies in
misspelling correction tasks.
90
Table 6.4: Generalization analysis results. The best results are highlighted. MEAN shows
the average value of each row.
Train Noise
PER W-PER DEL W-DEL
INS W-INS
SUB W-SUB MEAN
Test Noise
PER
W-PER
DEL
W-DEL
INS
W-INS
SUB
W-SUB
–
98.75
90.83
86.75
94.79
95.67
91.71
87.05
98.81
–
90.83
86.75
94.79
95.67
91.71
87.05
82.55
81.31
–
94.08
77.3
78.34
88.34
83.42
79.61
78.3
86.02
–
74.81
75.95
81.49
82.42
92.21
91.32
79.96
78.83
–
97.01
81.19
79.27
92.37
91.25
79.97
78.87
97.15
–
81.21
79.17
71.39
69.55
81.99
80.35
82.86
82.96
–
85.67
69.88
67.91
76.02
79.07
80.42
80.78
83.65
–
85.70
84.64
85.18
84.74
87.41
87.91
86.22
83.65
Table 6.5: Data augmentation results. The values that are higher than these of Table 6.4 are
bold.
Test Noise
PER W-PER DEL W-DEL INS W-INS
95.28
96.45
95.3
94.26
93.34
W-ALL 96.45
SUB W-SUB MEAN
91.51
90.48
94.13
6.3.4 Generalization Analysis
In this subsection, we conduct experiments to understand generalization ability of MUDE.
Concretely, we train the framework on one type of noise and test it with a dataset that
presents another type of noise. The results are shown in Table 6.4.
From results, we have the following two observations. Firstly, between datasets with
similar type of noise, MUDE generalizes quite well (e.g. trained on W-PER and tested on
PER), which is not surprising. However, the MUDE trained on one type of noise performs
much worse on other types of noise that are very diﬀerent. These observations suggest that
it is hard for MUDE to generalize between noises, which we argue is possibly because of the
small overlap between distributions of each type of noise.
91
Thus, in the next, we apply the commonly used adversarial training method by augmenting
all types of noise to train MUDE and test it on each type noise individually. As W-* (*∈
{PER, DEL, INS, SUB}) includes the *, in this experiment, we only combine W-* instead of
all types of noise. We denote the new constructed training dataset as W-ALL. The results
are shown in Table 6.5. It can be observed from table that the MUDE trained on W-ALL
has much better generalization ability (i.e., the mean value is much higher). In addition, it is
interesting to see that performance of the MUDE decreases slightly in relatively easy cases
where permutation or insertion noise presents while increasing a lot in diﬃcult cases where
deletion or substitution noise presents.
6.3.5 Case Study
In this subsection, we take the Cambridge sentences which are not the training set as an
example to give a qualitative illustration of MUDE’s misspelling correction performance.
Note that due to the constraint of space, we only show the results of the two types of noise:
W-PER and W-INS. The examples are shown in Table 6.6 and 6.7, respectively. We can see
from the table that it is quite diﬃcult for even humans to comprehend the noised sentence
when ﬁrst and last characters are also changed. However, MUDE can still recognize almost
all of the words. In addition, for both cases, the MUDE has much less errors in the corrected
sentence than scRNN, which is consistent with previous quantitative results.
6.4 Related Work
In this section, we brieﬂy review the related literature that is grouped into two categories.
The ﬁrst category includes the exiting works on similar tasks and the second one contains
92
Table 6.6: An illustrative example of spelling correction outputs for the Cambridge sentence
noised by W-PER. Words that the models fail to correct are underlined and bold.
Correct
Noised
.
scRNN
According to a researcher at Cambridge University , it does n’t matter in what
order the letters in a word are , the only important thing is that the ﬁrst and
last letter be at the right place . The rest can be a total mess and you can still
read it without problem . This is because the human mind does not read every
letter by itself , but the word as a whole .
iodrcAngc ot a reeachsr at meigaCdbr srtiinUyve , it seod tn’ amrtte in wtah
rerdo het tserelt in a rdwo rae , the onyl onmtiaptr ingth si tath hte itfrs dan
stla treelt be ta het tgrhi place . hTe rset nca be a aotlt mess dan ouy anc lsilt
drae ti tthwuoi lorbmpe . hTsi is aubeecs the huamn dmni edos nto erad evrye
lteter by etﬁsl , but het rdwo sa a eholw .
According to a research at Cambridge University , it does n’t matter in what
order the letters in a word are , the only important thing is that the ﬁrst and
last letter be at the right place . The rest can be a total mess and you can still
read it without problem . This is because the human mind does not read very
letter by itself , but the word as a whole .
MUDE
According to a research at Cambridge University , it does n’t matter in what
order the letters in a word are , the only important thing is that the ﬁrst and
last letter be at the right place . The rest can be a total mess and you can still
read it without problem . This is because the human mind does not read every
letter by itself , but the word as a whole .
previous works that have applied word recognition model to improve the robustness of other
NLP systems.
6.4.1 Grammatical Error Correction
Since the CoNLL-2014 shared task [93], Grammatical Error Correction (GEC) has gained
great attention from NLP communities [138, 46, 66, 21, 65]. Currently the most eﬀective
approaches regard GEC as machine translation problem that translates erroneous sentences
to correct sentences. Thus, many methods that are based on statistical or neural machine
translation architectures have been proposed. However, most of the existing GEC systems
93
Table 6.7: An illustrative example of spelling correction outputs for the Cambridge sentence
noised by W-INS. Words that the models fail to correct are underlined and bold.
Correct
Noised
.
scRNN
According to a researcher at Cambridge University , it does n’t matter in what
order the letters in a word are , the only important thing is that the ﬁrst and
last letter be at the right place . The rest can be a total mess and you can still
read it without problem . This is because the human mind does not read every
letter by itself , but the word as a whole .
Acxcording to a reysearch at Cazmbridge Univversity , it doesw n’t msatter in
whmat orderh the letteros in a fword are , the oynly wimportant tghing is tyhat
the ﬁrcst and ldast legtter be at the rightv placeu . The resty can be a totalp
mesus and you can stillb rnead it withougt promblem . Txhis is bebcause the
humgan minnd doess not reabd everyb lettfer by itslelf , but the whord as a
whvole .
according to a research at Cambridge University , it does n’t matter in what
order the letters in a word are , the only important thing is that the ﬁrst and
last better be at the right place . The rest can be a total less and you can
still read it without problem . This is because the human mind does not rated
every better by itself , but the word a a whole .
MUDE
According to a research at Cambridge University , it does n’t matter in what
order the letters in a word are , the only important thing is that the ﬁrst and
last letter be at the right place . The rest can be a total uses and you can
still read it without problem . This is because the human mind does not bear
every letter by itself , but the word as a whole .
have focused on correction of grammar errors instead of noised spellings. For example, most
of words in a wrong sentence in CoNLL-2014 shared task [93] are correct such as ‘Nothing
is absolute right or wrong’, where the only error comes from the speciﬁc form ‘absolute’.
One of the existing works that are most similar to this chapter is scRNN [102], where each
word is represented in a ﬁxed ‘bag of characters’ way. It only consists of a word-level RNN
and focused on very easy noise. On the contrary, our proposed framework is more ﬂexible
and can obtain meaningful representations that incorporate both character and word-level
dependencies. In addition, we have experimented on more diﬃcult types of noise than these
in [102] and achieved much better performance.
94
6.4.2 Denoising Text for Downstream Tasks
Robust NLP systems are becoming increasingly important given the severe consequences
adversarial samples can cause [43, 63, 49]. However, previous works have shown that
neural machine translation models can be easily broken with words whose characters are
permuted [12]. To solve this problem, researchers have found that misspelling correction
models can play an extremely eﬀective role [98, 144] in improving the robustness of the
systems. For example, Pruthi et al [98] ﬁrstly applied the pre-trained scRNN model to
source sentence to remove noise and then the denoised source sentence was input into the
neural translation model to obtain the correctly translated sentence. In addition, Zhou
et al [144] directly integrated such denoising models into the machine translation system
that was trained in an end-to-end approach. In either way, these works suggest that the
proposed framework which has demonstrated strong performance can have great potentials
in improving the robustness of other NLP systems.
6.5 Discussion
As most of the current NLP systems are very brittle, it is extremely important to develop
robust neural models. In this chapter, we have presented a word recognition framework,
MUDE, that achieves very strong and robust performance with diﬀerent types of noise
presenting. The proposed framework is able to capture hierarchical structures to obtain
eﬀective word representations. Extensive experiments on datasets with various types of noise
have demonstrated its superior performance over the exiting popular models.
There are several meaningful future research directions that are worthy exploring. The
ﬁrst is to extend MUDE to deal with sentences where word-level noise presents. For example,
95
in the noised sentences, some of the words might be swapped, dropped, inserted or replaced,
etc. In addition, it is also meaningful to improve the generality of MUDE such that it can
achieve strong performance with the presence of various types of noise not seen in the training
dataset. Another possible future direction is to utilize MUDE to improve the robustness
other NLP systems including machine translation, reading comprehension, text classiﬁcation,
etc. Moreover, as this work primarily focuses on English, it would be very meaningful to
experiment the proposed framework on other languages. Finally, since dictionary contains
very complete information of words, we believe that by incorporating it in our framework,
the performance can be further improved.
96
Chapter 7
Modeling Multi-scale Sequential
Dependencies
7.1 Introduction
Nowadays, Information and Communication Technology (ICT) has permeated every aspect
of our daily life and played a crucial role on the society than ever. While ICT systems have
brought unprecedented convenience, when in abnormal states caused by malicious attackers,
they could also lead to ramiﬁcations including severe loss of economy and social wellbeing [39,
124, 73, 16, 76]. Therefore, it is vital to timely and accurately detect abnormal states of ICT
systems such that the loss can be mitigated. Fortunately, with the ubiquitous sensors and
networks, ICT systems have generated a large amount of monitoring data [84, 3, 142, 32].
Such data contains rich information and provides us with unprecedented opportunities to
understand complex states of ICT systems.
One type of the most important monitoring data is discrete event sequences which can be
seen everywhere such as control commands of machine systems, logs of a computer program,
transactions of customer purchases and DNA in genetics. Due to the rich information they
provide, they have been a valuable source for anomaly detection adopted by both academic
research and industry practice [15, 14, 17, 36, 32]. For example, system logs that record the
97
Figure 7.1: An illustrative example of BGL log messages and the corresponding log key
sequence. Each log message contains a predeﬁned log key that is underscored by red lines.
detailed messages of run time information by nearly all the modern computer systems are
extensively used for anomaly detection [82, 32, 130, 95]. Each log message can be roughly
considered as consisting of a predeﬁned constant print statement (also known as “log key” or
“message type”) and a speciﬁc parameter (also known as “variable”). When the log keys are
arranged chronologically according to the recording time, they form a discrete event sequence
that can reﬂect the underlying system state. Figure 7.1 shows an illustrative example of logs
from a BlueGene/L supercomputer system (BGL). In this example, there are six log messages
that are generated by ﬁve corresponding predeﬁned statements (log keys). These log keys
form a discrete event sequence. When the system in an abnormal state, the resulted discrete
event sequences will deviate from normal patterns. For instance, if “k2” always comes after “k1”
in a normal state, then the log key sequence shown in Figure 7.1 may indicate the abnormal
state of the system because it shows that “k2”comes directly after “k5”, which is unlikely to
happen in the normal state. In this work, we refer to discrete event sequences generated by
normal and abnormal system states as normal and abnormal sequences, respectively.
98
Log MessagesLog Key SequenceTime Linek1k2k3k4k52005-11-10-23.04.09.760063 R63-M0-N3-C:J14-U11 RAS KERNEL INFO iar 00106298 dear 02f7a18c 2005-11-10-23.12.32.269153 R63-M0-N5-C:J12-U01 RAS KERNEL INFO 488205 floating point alignment exceptions 2005-11-11-04.39.52.183132 R31-M0-N8-I:J18-U01 RAS KERNEL INFO ciod: generated 128 core files for program /bgl/apps/followup/SPaSM_static/SPaSM_mpi.new_comp 2005-11-11-18.58.22.474164 R23-M1-N0-C:J08-U11 RAS KERNEL INFO CE sym 25, at 0x12127ee0, mask 0x10 2005-11-14-18.25.23.269634 R46-M1-NE-C:J14-U11 RAS KERNEL FATAL rts: kernel terminated for reason 1004 2005-11-15-21.12.12.119153 R63-M0-N2-C:J52-U01 RAS KERNEL INFO 489105 floating point alignment exceptions k1:k2:k3:k4:k5:k2:k2Despite that detecting anomaly for discrete event sequences has attracted much atten-
tion [85, 92, 17, 47, 94, 32], it still remains an extremely diﬃcult task due to several intrinsic
challenges. The ﬁrst challenge is from the data imbalance issue that is commonly seen in
all kinds of anomaly detection problems. As systems are in normal states in most of the
time, abnormal sequences are very rare. This makes the data distributed very unequally in
terms of normal and abnormal states. Thus, binary classiﬁcation models that have achieved
great success in the other problems becomes ineﬀective for anomaly detection. Moreover,
in reality we often do not have the prior knowledge about abnormal sequence that further
exacerbates the diﬃculty. The second obstacle comes from the discrete property of events.
Unlike continuous sequences where each event are real-valued and have physical meanings, the
discrete event sequence consists of discrete symbols, making it hard to capture the relations
of events over time. The last but not the least challenge is the sequential nature of the data.
In order to determine whether a discrete event sequence is abnormal or not, it is essential to
consider each individual event, subsequences of events and the whole sequence simultaneously.
This requires dedicated eﬀorts to designing models that not only have strong capability to
capture the sequential dependencies but also are ﬂexible to handle sequential dependencies
at diﬀerent scales.
In this chapter, in an attempt to address the aforementioned challenges, we propose an
one-class classiﬁcation framework for event sequence anomaly detection (OC4Seq). It is
proposed to directly integrate the anomaly detection objective with a specially designed deep
sequence models that explicitly incorporates sequential dependencies at diﬀerent scales. The
main contributions of this work are summarized as follows:
• We identify the importance of multi-scale sequential dependencies in anomaly detection
for discrete event sequences empirically.
99
• We introduce a novel framework (OC4Seq) to explicitly capture multi-scale sequential
dependencies and directly optimize the deep sequence models with one-class classiﬁcation
objective.
• We conduct extensive experiments on three datasets to consistently demonstrate that
OC4Seq outperforms the state-of-the-art representative methods with a signiﬁcant
margin.
7.2 Problem Statement
Given an event set E that contains all possible discrete events, an event sequence Si is deﬁned
j ∈ E and N i is the length of sequence Si. Each event ei
as Si = (ei
N i), where ei
j
1, ei
2,··· , ei
is represented by a categorical value, i.e., ei
j ∈ N +.
With the notations above, the anomaly detection for discrete event sequence problem
under the one-class setting is formally deﬁned as follows:
Given a set of sequences S = {S1, S2,··· , SN}, where each sequence Si is normal, we
aim to design a one-class classiﬁer that is able to identify whether a new sequence S is the
normal class or not by capturing the underlying multi-scale sequential dependencies in S.
7.3 Preliminaries: One-Class Classiﬁer
In this section, we introduce preliminaries that lay a foundation for our proposed framework.
One-Class classiﬁer is a specially designed classiﬁer that is trained with objects of a single
class and can predict whether an object belongs to this class or not in the test stage. One of
the most widely used one-class classiﬁers is kernel-based such as One-Class Support Vector
100
Machines (OC-SVM) [103] and Support Vector Data Description (SVDD) [114]. Both OC-
SVM and SVDD are inspired by SVM that tries to maximize the margin between two classes.
Next, we use SVDD as an example to illustrate traditional one-class classiﬁers. SVDD aims at
ﬁnding a spherically shaped boundary around the given data in the kernel space. Concretely,
let the center of the hypersphere be c and the radius be R > 0. The SVDD objective is
deﬁned as follows:
n(cid:88)
i=1
εi
R2 + C
min
R,c,ε
s.t.(cid:107)φ(xi − c)(cid:107)2 ≤ R2 + εi, εi ≥ 0,
∀i
(7.1)
where xi is the feature vector of ith data and εi ≥ 0 is a slack variable for xi that is introduced
to allow the possibility of outliers in the training set and hyperparameter C controls the
trade-oﬀ between errors εi and the volume of the sphere. The objective deﬁned in Equation 7.1
is in primary form and similar to SVM, it is solved in the dual space by using Lagrange
multipliers. For more details of the optimization, please refer to the original chapter [114].
Once the R and c are determined, the points that are outside the sphere will be classiﬁed as
other classes.
Recently, a deep neural network based one-class classiﬁer called Deep SVDD was introduced
in [100]. Inspired by SVDD, it tries to ﬁnd a minimum hypersphere in the latent space.
Unlike SVDD which relies on kernel functions for feature transformation, Deep SVDD takes
advantage of deep neural networks to learn data representations. Speciﬁcally, the simpliﬁed
objective that employs a quadratic loss for penalizing the distance of every data point
101
Figure 7.2: The overview of the proposed framework OC4Seq. It consists of two major compo-
nents that learns the sequential information from global and local perspectives, respectively.
representation to center is deﬁned as:
(cid:88)
i=1
1
n
min
W
(cid:107)φ(xi) − c(cid:107)2 +
λ
2(cid:107)W(cid:107)2
F
(7.2)
where φ(·) is the deep neural network whose parameter are W and (cid:107)(cid:107)F
norm. This objective function has been proved with nice theoretical properties [100]. Once
indicates the Frobenius
the neural networks are trained and center ﬁxed, outliers will be detected similarly as SVDD.
7.4 The Proposed Framework
In this section, we introduce a multi-scale one-class framework for event sequence anomaly
detection. An overall of the proposed framework OC4Seq is shown in Figure 7.2. It consists of
two major components that focus on global and local information in the sequences, respectively.
The details of each component are described in the next subsections.
102
···AAAB7XicbVBNS8NAEJ34WetX1aOXYBE8lUQFPRa9eKxgP6ANZbPZtGs3u2F3IpTQ/+DFgyJe/T/e/Ddu2xy09cHA470ZZuaFqeAGPe/bWVldW9/YLG2Vt3d29/YrB4ctozJNWZMqoXQnJIYJLlkTOQrWSTUjSShYOxzdTv32E9OGK/mA45QFCRlIHnNK0EqtHo0Umn6l6tW8Gdxl4hekCgUa/cpXL1I0S5hEKogxXd9LMciJRk4Fm5R7mWEpoSMyYF1LJUmYCfLZtRP31CqRGyttS6I7U39P5CQxZpyEtjMhODSL3lT8z+tmGF8HOZdphkzS+aI4Ey4qd/q6G3HNKIqxJYRqbm916ZBoQtEGVLYh+IsvL5PWec2/qPn3l9X6TRFHCY7hBM7Ahyuowx00oAkUHuEZXuHNUc6L8+58zFtXnGLmCP7A+fwBr1OPMg==Local PerspectiveGlobal PerspectiveInput Sequences···AAAB7XicbVBNS8NAEJ34WetX1aOXYBE8lUQFPRa9eKxgP6ANZbPZtGs3u2F3IpTQ/+DFgyJe/T/e/Ddu2xy09cHA470ZZuaFqeAGPe/bWVldW9/YLG2Vt3d29/YrB4ctozJNWZMqoXQnJIYJLlkTOQrWSTUjSShYOxzdTv32E9OGK/mA45QFCRlIHnNK0EqtHo0Umn6l6tW8Gdxl4hekCgUa/cpXL1I0S5hEKogxXd9LMciJRk4Fm5R7mWEpoSMyYF1LJUmYCfLZtRP31CqRGyttS6I7U39P5CQxZpyEtjMhODSL3lT8z+tmGF8HOZdphkzS+aI4Ey4qd/q6G3HNKIqxJYRqbm916ZBoQtEGVLYh+IsvL5PWec2/qPn3l9X6TRFHCY7hBM7Ahyuowx00oAkUHuEZXuHNUc6L8+58zFtXnGLmCP7A+fwBr1OPMg==···AAAB7XicbVBNS8NAEJ34WetX1aOXYBE8lUQFPRa9eKxgP6ANZbPZtGs3u2F3IpTQ/+DFgyJe/T/e/Ddu2xy09cHA470ZZuaFqeAGPe/bWVldW9/YLG2Vt3d29/YrB4ctozJNWZMqoXQnJIYJLlkTOQrWSTUjSShYOxzdTv32E9OGK/mA45QFCRlIHnNK0EqtHo0Umn6l6tW8Gdxl4hekCgUa/cpXL1I0S5hEKogxXd9LMciJRk4Fm5R7mWEpoSMyYF1LJUmYCfLZtRP31CqRGyttS6I7U39P5CQxZpyEtjMhODSL3lT8z+tmGF8HOZdphkzS+aI4Ey4qd/q6G3HNKIqxJYRqbm916ZBoQtEGVLYh+IsvL5PWec2/qPn3l9X6TRFHCY7hBM7Ahyuowx00oAkUHuEZXuHNUc6L8+58zFtXnGLmCP7A+fwBr1OPMg==RNN CELLRNN CELLRNN CELLRNN CELLRNN CELLRNN CELLRNN CELLRNN CELLRNN CELLRNN CELLRNN CELLRNN CELLGlobal Latent SpaceLocal Latent Space7.4.1 Learning Embeddings for Events
The inputs of the framework are the sequences of events, where each event et is a one-hot
vector and e(j) = 1, e(i) = 0 ∀i (cid:54)= j if et is the jth type event of the set E. In real-world
scenarios, event space could be very large, i.e., there are tens of thousands of event types.
This can lead et to be very high-dimensional and cause notorious learning issues such as
sparsity and curse of dimension. In addition, one-hot vector representation makes an implicit
assumption that events are independent with each other, which does not hold in most cases.
Therefore, we design an embedding layer to embed events into a low-dimension space that can
preserve relations between events. To do so, we introduce an embedding matrix E ∈ Rde×|E|,
where de is the dimension of the embedding space and |E| is the number of event types in E.
With this, the representation of et can be obtained as follows:
xt = ET · et
(7.3)
where xt ∈ Rde is the new low-dimensional dense representation vector for et. After the
embedding layer, the input sequences will be passed into the other components that will be
introduced next.
7.4.2 Anomaly Detection from Global Perspective
To detect an anomalous sequence, it is important to learn an eﬀective representation of the
whole sequence in the latent space. To this end, we propose to integrate the widely used
Gated Recurrent Neural Networks (GRU) [20] with one-class objective function. Speciﬁcally,
given a normal sequence, i.e., Si = (xi
N i
at the ﬁnal step summarizes all the information in the previous steps, we regard it as the
N i), the GRU outputs a state vector h
1, xi
2,··· , xi
103
representation of the whole sequence. Please note that the GRU component can be replaced
by any sequence learning models such as Long Short-Term Memory (LSTM) [57]. In fact, we
empirically found that the two have similar performance. Due to its structural simplicity, we
choose GRU over LSTM. More details of component analysis can be found in Section 7.5.
Inspired by the intuition behind the Deep SVDD that all the normal data should be lie
within a hypersphere of minimum volume in a latent space, we propose the following objective
function to guide the GRU training process:
N(cid:88)
i=1
1
N
Lglobal = min
Θ
(cid:107)h
N i − c(cid:107)2 + λ(cid:107)Θ(cid:107)2
F
(7.4)
Here c is a predeﬁned center in the latent space and n is the total number of sequences
in the training set. The ﬁrst term in the objective function employs a quadratic loss for
penalizing the distance of every sequence representation to the center c and the second term
is a regularizer controlled by the hyperparameter λ. Therefore, this objective will force the
GRU model to map sequences to representation vectors that, on average, have the minimum
distances to the center c in the latent space.
Although GRU is eﬀective to model the whole sequence, it might ignore vital information
for event sequence anomaly detection because of the following reason: the abnormal property
of a sequence can be caused by only a small abnormal subsequence or even a single abnormal
event. However, when the sequence is long, the abnormal information could be overwhelmed
by other normal subsequences during the representation learning procedure. This could lead
to a very high false negative rate. Therefore, to solve this problem, we design a subsequence
learning component that is used to detect the anomaly from the local perspective. In the
next subsection, we describe its details.
104
7.4.3 Anomaly Detection from Local Perspective
In this previous subsection, we introduce to combine GRU and Deep SVDD one-class
classiﬁcation objective to embed the normal sequences in a latent space where they lie within
a small distance to a predeﬁned center. However, local information that is vital for anomaly
detection could be overwhelmed during this process. Thus, we design the subsequence learning
component from the local perspective.
For a given event sequence, we construct subsequences of a ﬁxed size M with a sliding
window. Therefore, each subsequence contains its unique local information, which plays
an important role to determine whether the whole sequence is abnormal or not. To learn
the representation of subsequence, we introduce the local GRU component that will model
the sequential dependencies in every subsequence. To be concrete, given a subsequence
t−M +1, xi
xi
M hidden states, the last of which is used as the representation of the local subsequence:
of length M, the local GRU processes them sequentially and outputs
t−M +2,··· , xi
t
t = GRU(xi
hi
t−M +1, xi
t−M +2,··· , xi
t)
(7.5)
Thus, for all subsequences in a sequence, the GRU will obtain a sequence of hidden represen-
tations that incorporate sequential dependencies in every local region as follows:
hi
1, hi
2,··· , hi
N i−M
= LocalGRU (xi
1, xi
2,··· , xi
N i)
(7.6)
where LocalGRU is the name for the second GRU model that processes each subsequence. For
a normal event sequence, it is intuitive to assume that all of its subsequences are also normal.
Thus, we further assume that all the local subsequences should be within a hypersphere in
105
another latent space. To impose this assumption, we design the following objective function
to guide the local sequence learning procedure:
N(cid:88)
N i(cid:88)
i=1
j=1
1
N
Llocal = min
ΘL
(cid:107)h
j − cL(cid:107)2 + λ(cid:107)ΘL(cid:107)2
N i
F
(7.7)
Here, cL is a predeﬁned center of another hypersphere in the latent space and ΘL contains all
the trainable parameters of LocalGRU. Similarly, the ﬁrst term penalizes the average distance
between all normal subsequences to the center cL and the second term is a regularizer.
7.4.4 The Objective Function and Optimization Procedure
In previous subsections, we have introduced components of OC4Seq to detect an abnormal
event sequence from both global and local perspectives, respectively. In this subsection, we
design an objective function to combine them together. Speciﬁcally, given the global and
local loss function Lglobal
follows:
, the overall objective function of OC4Seq is deﬁned as
and Llocal
min
ΘL,ΘL = Lglobal + αLlocal
(7.8)
where α is a hyper parameter that controls the contribution from local information in the
sequence. This objective enables us to train the framework in an end-to-end manner. The
speciﬁc optimization procedure is described next.
Optimization. We use stochastic gradient descent (SGD) and its variants (e.g., Adam)
to optimize the objective function deﬁned in Equation 7.8. Following previous work [100],
to accelerate the training process, the predeﬁned centers c is computed as follows: given
106
Table 7.1: The key statistics of RUBiS, HDFS and BGL datasets.
Dataset # of normal # of abnormal # of log keys
RUBiS
HDFS
BGL
11,677
558,221
9,543
1,000
16,838
985
24
28
1,540
the untrained GRU, we ﬁrstly feed the sequences in the training set into it and obtain the
sequence representation vectors. Then we obtain an average vector by computing the mean
value of all representation vectors and use it as c. To obtain cL, similar process is applied
with untrained LocalGRU. Once c and cL are obtained, they remain as constant vectors
in the optimization process. The whole training process is done when the objective value
converges.
7.5 Experiment
In this section, we conduct extensive experiments to evaluate the proposed framework OC4Seq
on one synthetic web logs and two real world system log datasets. Next, we ﬁrst describe
the datasets and experimental settings. Then we present the experimental results and
observations. Finally, we conduct qualitative analysis to gain deep understandings on the
proposed framework.
7.5.1 Datasets
RUBiS [5]: This dataset is a synthetic web log dataset and was generated by an auction site
prototype modeled after eBay.com [5]. Speciﬁcally, each log message contains information
related to a user web behavior including user_id, date, request_info, etc. Following previous
107
works [136, 139], we ﬁrst parse each log message into a structured representation which
consists of a log key and a variable among others. Next, the log keys of a same user are
collected following the time order that form an event sequence. Thus, each log key sequence
represents a user behavior session in a web server. In addition, we create synthetic anomalous
users and their attack cases to obtain abnormal sequences. After the generation process, we
are able to collect 11,677 normal sequences and 1,000 abnormal sequences.
Hadoop Distributed File System (HDFS) [130]: This dataset was generated by a
Hadoop-based map-reduce cloud environment using benchmark workloads.
It contains
11,175,629 log messages, of which 2.9% are labeled as anomalies by Hadoop experts. Each log
message involves a block ID, a time stamp and state information. To make the comparison
fair, we used the public available dataset1 processed by [32]. As described in [32], the log
messages are ﬁrstly parsed into structured text so that a log key is extracted from each log
message. In addition, the log keys are sliced into sequences according to the associated block
IDs. As a result, there are 558,221 normal sequences and 16,838 abnormal sequences.
BlueGene/L (BGL) [95]: This dataset contains 4,747,936 log messages generated by
a BlueGene/L supercomputer system at Lawrence Livermore National Labs (LLNL) in
Livermore, California, with 131,072 processors and 32,768GB memory. Each log message
contains system information such as type, time stamp, nodes, content, etc. The log messages
can be categorized into two types, i.e., non-alert and alert. The non-alert messages are labeled
as normal and alert messages are labeled abnormal. Similarly, the log messages are ﬁrstly
parsed by Drain [54] whose implementation2 is open sourced by [145]. Following previous
work [88], the log keys are sliced using time sliding windows. A sequence is labeled abnormal
1https://www.cs.utah.edu/ mind/chapters/deeplog_misc.html
2https://github.com/logpai/logparser
108
Table 7.2: The prediction performance comparison on RUBiS, HDFS and BGL datasets.
Datasets
OC-SVM PCA Invariant Mining DeepLog OC4Seq
HDFS
RUBiS
BGL
F-1 score
Precision
Recall
F-1 score
Precision
Recall
F-1 score
Precision
Recall
0.509
0.622
9.431
0.351
0.220
0.869
0.336
0.215
0.764
0.634
0.968
0.471
0.784
0.862
0.718
0.423
0.269
0.993
0.943
0.893
1.000
0.912
0.841
0.996
0.428
0.273
1.000
0.941
0.952
0.930
0.935
0.885
0.992
0.326
0.196
0.980
0.976
0.955
0.998
0.985
0.987
0.983
0.747
0.704
0.795
if it contains at least one abnormal message. After processing, there are normal and abnormal
sequences.
The statistics of three datasets are summarized in the Table 7.1. Moreover, as stated in
Section 7.2, this work focuses on the one-class setting where the training dataset does not
contain any abnormal sequence. Therefore each dataset is splitted into training, validation
and test sets by the following process. Firstly, we randomly sample the training data from
the normal sequence set. Then, we separately split the remaining normal sequences and all
the abnormal sequences into validation and test sets with the validation to test ratio 3/7. At
last, we combine the two validation/test sets into one.
7.5.2 Baselines
To evaluation the proposed framework, we construct following representative anomaly detec-
tion baselines.
• Principle Component Analysis (PCA) [125]. PCA is a classic unsupervised method that
has been extensively used for a variety of problems. More recently, it becomes a popular
109
method for anomaly detection [130]. Speciﬁcally, it ﬁrstly constructs a count matrix M,
where each row represents a sequence and each column denotes a log key. Moreover, each
entry M(i, j) indicates the count of jth log key in the ith sequence. Next, PCA learns
a transformed coordinate system where the projection lengths of normal sequences are
small while these of abnormal sequences are large. Although, it has been shown that
PCA can be eﬀective in detecting anomalies especially in reducing false positives [32],
it totally ignores the sequential information, which could play an important role for
sequence anomaly detection. We use the open sourced implementation3 [55].
• Invariant Mining (IM) [82]. IM is another popular unsupervised anomaly detection
method. It is designed to automatically mine invariants in logs and assumes that
discovered invariants can capture the inherent linear characteristics of log ﬂows. Similar
to PCA, it ﬁrstly constructs a count matrix M. Next, IM learns sparse, integer valued
invariants with physical meanings from M. Finally, with the mined invariants, IM
makes an invariant hypothesis and sequences that do not satisfy the hypothesis are
detected as anomalies. As IM also relies the M, it has similar drawbacks of PCA. The
IM used in this work was implemented by [55].
• One-Class SVM (OC-SVM) [103]. OC-SVM is a very eﬀective one-class classier that
has been extensively used for anomaly detection [77, 122, 4]. It is especially suited
for the our setting that the training set only contains normal data. Speciﬁcally, it
learns a kernel that maps the normal data into a latent space where all the normal
sequence clusters in a small region. Thus, a sequence that does not belong to the
cluster is regarded as abnormal. To apply OC-SVM, we ﬁrstly need to extract features
3https://github.com/logpai/loglizer
110
from each sequence. In this work, we tried two models to extract features: sequence
auto-encoder [25] and bag-of-words [137]. As we empirically found the latter often has
better performance, we choose bag-of-words as the feature extractor. The OC-SVM
used in this work was implemented with the scikit-learn package4.
• DeepLog [32]. DeepLog is a state-of-the-art log anomaly detection method. This
method is based on a LSTM model which tries to capture the sequential dependencies
in sequences. Speciﬁcally, by training with normal sequences, it learns to predict the
next token given the previously seen tokens in a sequence. During the test stage, for
each time step in a sequence, DeepLog will output a probability distribution over all
the log keys. If any of the actual tokens is not in the top k candidates, it will regard the
sequence as abnormal. Compared to other baselines this method is able to utilize the
sequential information and has demonstrated state-of-the-art performance in previous
works.
7.5.3 Experimental Settings
Model Selection: For all the methods with hyper-parameters, we use the validation set to
select the best value and report the performance on the test set. For DeepLog, we follow the
original chapter’s suggestion [32]. Speciﬁcally, both the h and g are selected from {8, 9, 10},
which denotes window size and candidates number, respectively. The number of layer is
set to be 2 and the number of LSTM hidden units is 64. For OC4Seq, we use the same
hyper-parameters as DeepLog and select α that controls the contribution of local subsequence
from {0.01, 0.1, 1, 10}.
4https://scikit-learn.org/
111
Implementation Details: We implemented OC4Seq with Pytorch 1.5 5. The model is
trained using Adam [69] optimizer with learning rate to be 0.01. The mini-batch size is
chosen to be 64 and the model is trained for 100 epochs on a single NVIDIA GEFORCE
RTX 2080 card. To encourage reproducible results, we make the code publicly available 6.
Evaluation Metrics: To measure the model performance on anomaly detection, we choose
the widely used Precision, Recall and F1 score as the evaluation metrics. They are deﬁned as
follows:
P recision =
F 1 =
T P
Recall =
T P + F P
2 ∗ P recision ∗ Recall
P recision + Recall
T P
T P + F N
(7.9)
where TP is the abbreviation for True Positives that denotes the number of true abnormal
sequences that are detected by the model; FP is the abbreviation for False Positives that
measures the number of normal sequences that are regarded as anomalies; FN is False
Negative that denotes the number of abnormal sequences the model fails to report. Thus,
due to the deﬁnition, there is well-known trade-oﬀ between precision and recall. On the
other hand, the F1 score considers the balance of the two and is often considered as a more
comprehensive evaluation metric.
7.5.4 Performance Comparison
The results of all methods on three datasets are shown in Table 7.2. From the table, we made
the following observations: 1) On most datasets, OC-SVM has the worse performance. We
5https://pytorch.org/
6https://github.com/xxxxx
112
(a) Precision-Recall curves with
diﬀerent alpha.
(b) Precision-Recall curves with #
of layers.
(c) Precision-Recall curves with
diﬀerent RNN cell types.
Figure 7.3: Validation Precision-Recall curves on HDFS datasets.
argue that this is because OC-SVM is highly dependent on feature qualities. Unlike dense
data where OC-SVM generally performs very well, it is very hard to extract meaningful
features from discrete event sequence. 2) IM and DeepLog outperforms PCA signiﬁcantly
in most of the evaluation metrics. This is expected as both IM and DeepLog are designed
speciﬁcally for anomaly detection for log data. 3) IM and DeepLog generally have comparable
results in terms of F-1 score.
It is interesting to observe that IM always achieves very
impressive Recall while DeepLog is better in Precision. We argue that this diﬀerence could
be caused by the fact that DeepLog focuses much on the local subsequence information while
IM always concentrates on global sequence information. 4) On all the datasets, the proposed
framework OC4Seq has achieved best F-1 scores. In addition, when comparing to the second
best method, the performance gain brought by OC4Seq is signiﬁcant. In terms of Precision,
OC4Seq still achieved the highest value in most cases. For Recall, OC4Seq was only slightly
outperformed by IM which has much lower precision scores. 5) All of the methods performed
much worse on BGL datasets than other two datasets. This is because BGL involve much
more log keys that can make the task extremely diﬃcult. Moreover, it is interesting to note
113
0.00.20.40.60.81.0Recall0.00.20.40.60.81.0Precisionalpha=0.0, average precision=0.953alpha=0.01, average precision=0.973alpha=0.1, average precision=0.984alpha=1.0, average precision=0.988alpha=10.0, average precision=0.970.00.20.40.60.81.0Recall0.20.30.40.50.60.70.80.91.0Precision# of layers=1, average precision=0.982# of layers=2, average precision=0.988# of layers=3, average precision=0.957# of layers=4, average precision=0.9750.00.20.40.60.81.0Recall0.00.20.40.60.81.0PrecisionRNN, average precision=0.875GRU, average precision=0.988LSTM, average precision=0.987that the DeepLog method becomes especially ineﬀective as it heavily relies on the next key
prediction which is very diﬃcult when log keys space becomes large. In this challenging case,
the improvement from OC4seq over other baselines becomes even more remarkable.
As a summary, from the experimental results on three datasets, our proposed framework
OC4Seq demonstrates its superior performance over the baselines. We argue this is because
OC4Seq can capture sequential information from both local subsequence and whole sequence
and it directly optimizes the anomaly detection objective. Next we design further experiments
to gain deeper understandings on OC4Seq.
7.5.5 Parameter Analysis
In this subsection, we analyze key hyper-parameters and components of OC4Seq. We only
report the performance on the HDFS validation dataset as we have similar observations on the
others. Moreover, the performance is evaluated by Precision-Recall curve as it eliminates the
need to choose a speciﬁc anomaly threshold and very suitable for datasets with imbalanced
label distribution. We also report the area under the Precision-Recall curve (average precision)
where the higher the value, the better the performance.
We ﬁrstly varies the value of α from {0, 0.01, 0.1, 1, 10}, which controls the contribution
from local subsequence information. The results are shown in Figure 7.3a. As can be seen in
the ﬁgure, with the increase of α, the performance ﬁrstly increases and then decreases. The
initial increase demonstrates the importance to incorporate local subsequence information
while the latter decrease suggests that the global sequential information is also very essential
and should not be overwhelmed by local information.
Next, we varies the number of RNN layers from {1, 2, 3, 4} and the results are shown
in Figure 7.3b. These results suggest that more layers does not necessary lead to better
114
(a) Showcase of the importance of Local in-
formation.
(b) Showcase of the importance of Global informa-
tion.
Figure 7.4: The local and global anomaly scores of HDFS log key sequences.
performance as it may cause other issues such as overﬁtting. Thus, it is important to select a
proper value through validation process.
Finally, to investigate how diﬀerent types of RNN cell aﬀect anomaly detection perfor-
mance, we experienced on popular cells, i.e., RNN (Vanilla), GRU, LSTM. The results are
shown in Figure 7.3c. From the ﬁgure, it is easily seen that the LSTM and GRU cells have
very similar performance while vanilla RNN achieves much worse results. These results are
consistent with previous works [23]. Due to its simpler structure, we choose GRU in OC4Seq.
7.5.6 Case Study
In this subsection, to further understand how the local and global information contribute
to anomaly detection, we conduct case studies involving two pairs of representative log
key sequences in the HDFS dataset. Speciﬁcally, for a sequence, we use the trained model
to calculate the anomaly scores of each subsequence and the whole sequence. The higher
the anomaly score is, the more likely the sequence is abnormal. The results are shown in
Figure 7.4.
115
051015Local Perspective0.00.10.20.30.40.5Anomaly ScoreAbnormalNormalNormalAbnormalGlobal Perspective0.00.10.20.30.40.3490.44x103051015Local Perspective0.00100.00150.00200.00250.00300.0035Anomaly ScoreAbnormalNormalNormalAbnormalGlobal Perspective0.02.55.07.510.012.515.017.50.02217.526x103In the sub-ﬁgure 7.4a, we show the ﬁrst pair of normal and abnormal sequences. The left
panel uses dot lines to demonstrate the anomalous scores for local subsequences. Each dot
denotes one anomalous score (y axis) of xth subsequence (x axis). The right panel uses bar
to show the anomalous score for the whole sequence (global information). From the ﬁgure,
we observe that the two sequences have comparable anomalous scores for the whole sequence.
Thus it is very diﬃcult to detect the abnormal one purely from the global perspective.
However, from the local perspective, we can see that the 10th subsequence of abnormal
sequence has a very high anomalous score while the anomalous scores of subsequence of
normal sequence are all very low. Therefore, in this case, the local information plays a very
important role in detecting anomalies.
In the sub-ﬁgure 7.4b, the anomalous scores of the second pair of sequence are shown.
Unlike the previous case, the second pair of sequences has very similar anomalous scores
for local subsequences. This makes it hard to detect anomaly from the local perspective.
However, the abnormal sequence has signiﬁcantly higher anomalous score from the global
perspective than the normal one. Therefore, the global information contributes a lot to
detecting the anomaly in this case. From the two cases, we further illustrate the importance
of combining both local and global information in a sequence for anomaly detection.
7.5.7 Visualization of Normal and Abnormal Sequences
In this subsection, to gain insights of the one-class classiﬁer objective function, we project
the global representations of both normal and abnormal sequences in the HDFS validation
set to a two-dimensional space by Local Linear Embedding techniques [31]. The visualization
of the two-dimensional space is shown in Figure 7.5. We observe that the normal sequences
generally cluster together and lie in a very small region. On the other hand, the abnormal
116
Figure 7.5: The visualization of HDFS datasets. Abnormal and normal data are denoted by
red and blue dots, respectively.
sequences spread all over the place. Therefore, the visualization clearly show that by directly
minimizing the anomalous score which measures the distance between normal data point and
a center point, the one-class classiﬁer objective function is very eﬀective to guide the model
to separate the abnormal and normal sequences in the latent space.
7.6 Discussion
In this chapter, we propose the OC4seq, a novel one-class recurrent neural network for discrete
event sequence detection. It can deal with multi-scale sequential dependencies and detect
anomalies from both local and global perspectives. Speciﬁcally, OC4seq incorporates an
eﬀective anomaly detection objective that is able to guide the learning process of sequence
models. Moreover, it explicitly map the local subsequence and whole sequence into diﬀerent
latent spaces, where the abnormal data points can be easily detected. The proposed OC4seq
117
0.150.100.050.000.050.100.050.000.050.100.150.20AbnormalNormalhas consistently shown superior performance than representative baselines in extensive
experiments on one synthetic and two benchmark datasets. In addition, through parameter
analysis and case studies, the importance of capturing multi-scale sequential dependencies for
discrete event sequence anomaly detection has been well demonstrated. Also, the visualization
of the sequence representations qualitatively suggests the eﬀectiveness of anomaly detection
objectives.
118
Chapter 8
Conclusions
8.1 Dissertation Summary
In this dissertation, we have described our eﬀorts denoted to sequence learning with side
information. Particularly, we have discussed our proposed eﬀective sequences models that
capture the most common types of side information and their applications in real-world tasks
of various ﬁelds.
In Chapter 3, we identiﬁed the existence of intrinsic relation of events in certain sequences.
To exploit it for sequence learning tasks, we proposed a novel framework SEE that is is
able to capture both sequential dependencies and event relations eﬀectively. In addition,
our framework has been evaluated extensively with real-world educational and e-commerce
sequences. Compared to RNNs that fails to capture the event relation, our proposed framework
achieves signiﬁcantly better performance consistently. Through a carefully designed model
component analysis, we showed the contribution of event relation for sequence modeling
performance.
In Chapter 4, we focused on capturing the relation between sequences. We ﬁrstly refuted
the assumption behind most of the existing sequence models that sequences are independently
and identically distributed. We showed that in many real-world applications, sequences
are inherently related. Given both challenges and opportunities brought by the sequence
119
relation, we devoted our eﬀorts to design a novel sequence model. Our model is based on the
Homophily theory that has been veriﬁed empirically by many researchers from various ﬁelds.
Our model can jointly capture the sequential dependencies of events and sequence relation.
Through comprehensive experiments, we demonstrated that our model is able to outperform
a variety of representative baselines. We also clearly showed that the eﬀectiveness of our
model came from its ability to capture sequence relation.
In Chapter 5, we showed that sequences are also associated with temporal information
and studied the just-in-time recommendation problem. In addition, we proposed a recom-
mender framework JRec to model the temporal dynamics of customer interest. Compared to
traditional recommender systems, JR is not only able to recommend the right products to
customers but also do it at the right time. This is attributed to the fact that JRec models
both sequential and temporal information of sequences simultaneously. Collaborating with a
big e-commerce company, we were able to evaluate our framework with real-world data. The
experiment results suggested that JRec outperformed representative baselines in both item
recommendation and customer interest prediction tasks.
In Chapter 6, we covered our eﬀorts in designing a sequence model to capture the
hierarchical structures in English languages. Speciﬁcally, we presented a word recognition
model MUDEA to tackle the word recognition problem that plays an extremely important
role in robust NLP systems. Our model is able to leverage both character-level and word-
level dependencies for this task. We studied our model with extensive experiments. The
results showed that MUDE is very robust with diﬀerent types of word noise presenting.
Compared to existing popular word recognition systems, our model has achieved higher
accuracy consistently. When the task became more diﬃcult, i.e., with the substitution noise,
the performance gain of our model was even more signiﬁcant. Based on the experiment
120
results, we identiﬁed the promise of our model in boosting the robustness of other NLP
systems.
In Chapter 7, we discussed the problem of anomaly detection problem for sequential
data, which is ubiquitous in information systems. We observed that the state of a sequence
depends on sequential dependencies of events of multiple scales. Therefore, we proposed
a novel one-class recurrent neural network for this problem. Our model, OC4Seq, is able
to capture multi-scale sequential dependencies and detect anomalies from both local and
global perspectives. Through extensive experiments on one synthetic and two benchmark
anomaly detection datasets, we demonstrated that OC4seq outperformed representative
baselines signiﬁcantly and consistently. To understand our model better, we also conducted a
comprehensive analysis of its parameters as well as case studies. The results clearly showed
the contribution of multi-scale sequential dependencies to improving both true positive rate
and false-negative rate in sequence anomaly detection problem.
In summary, we are excited about the performance boost brought by side information. As
most of the research community focuses on learning sequential dependencies, this dissertation
provides our unique and pioneering works of exploring and exploiting side information for
sequence modeling.
8.2 Future Works
Given the promising results achieved in our works, we believe more dedicated eﬀorts should
be devoted in this area. In the following, we outline a few meaningful future directions from
two aspects: applications and modeling.
121
8.2.1 Applications
As stated earlier, sequential data is ubiquitous as well as side information. However, due to
the time constraint, we can only apply our models to a limited number of applications. There
are many more tasks where proposed models could play a role in boosting state-of-the-art
performance. Next, we give two concrete examples.
The ﬁrst example is the people search problem in social network sites such as Facebook
and LinkedIn [60, 61]. The goal of people search is to give the most relevant persons to a
search query. Current approaches often consider this problem as a typical ranking problem
where user proﬁles are documents. However, as demonstrated in Chapter 4, the connections
between user proﬁles can also play an important role. Considering the fact that user social
connection information is readily available in these social network sites, it is very promising
to extend the LinkedRNN proposed in Chapter 4 to improve the current approaches to the
people search problem.
The second one is the disease prediction from Electronic Health Record [80], where we aim
to predict the disease a patient might develop in the future from his/her records of historical
clinical visits. In this case, the patient clinical visits form an event sequence and each visit is
one event that has associated a timestamp. The temporal information plays an essential role
in this problem because of two reasons. The ﬁrst comes from the fact that patients usually
visit clinics irregularly. This suggests that the historical records are not evenly distributed
along the timeline. Second, it is crucial to accurately predict not only the type of disease the
patient might have but also when the disease will be developed. Thus, it is a natural idea to
extend our work in Chapter 5 to tackle this, which is essentially a sequence modeling with
temporal information problem.
122
8.2.2 Modeling
In this dissertation, we have presented a number of eﬀective models to capture side information
for sequence learning. However, there are still many scenarios that require further eﬀort into
modeling.
One future direction in this line is to develop uniﬁed sequence models to capture various
types of side information simultaneously. Our proposed models often focus on one type of
side information. However, in many cases, two or more types of side information coexist.
For example, the student video-watch sequences in massive open online courses platforms
involve both temporal information, i.e., time stamps for each video-watch event and event
(lecture) relation [131]. Further, if social connections among students are provided, it would
be beneﬁcial to incorporate sequence relation as well. To achieve this, a uniﬁed sequence
model is necessary. Nevertheless, building uniﬁed sequence models is very diﬃcult. The main
challenges include keeping a proper balance of information from all perspectives, limiting the
model computation complexity, etc. We hope that our proposed models can provide insights
and inspirations for future works in this direction.
In addition, our works assume an ideal situation where side information is clean and stable.
However, this does not hold in many real-world cases where the side information could contain
noise and might change over time. To give a concrete example, let’s consider the sequence
relation described in Chapter 3, where the sequences are user historical weight records and
relations are the social connections among users. It is often the case that user social interest
changes frequently. As a result, their social relations are not stable. Thus, it is important to
design new models to incorporate both sequence relation and its dynamics. Moreover, there
might exist some malicious users who can bring noises to the resulted sequence relations.
123
Therefore, how to make the model robust to noises is another important problem that need
to solve in the future.
The third direction is to improve the interpretability of the models.
It is great to
see that our proposed models have brought promising results for many sequence learning
tasks. Although we have conducted quantitative and qualitative analysis to study where
the eﬀectiveness comes from, it is of paramount importance to make the models themselves
explainable so that they not only give accurate predictions but also provide the rationale
for their behaviors. This can bring huge beneﬁts to both model users and designers. From
the user’s perspective, knowing the rationale behind the model behaviors is essential for
them to trust the model, which is especially vital in health-related domains [118, 35]. From
the designer’s perspective, interpretability paves a direct way for them to improve model
performance further. Existing works in this direction are limited and mainly focus on
the sequence part [116, 90]. Thus, dedicated eﬀorts are needed to make our models more
interpretable from both sequence and side information perspectives.
124
BIBLIOGRAPHY
125
BIBLIOGRAPHY
[1] Odd Aalen, Ornulf Borgan, and Hakon Gjessing. Survival and event history analysis: a
process point of view. Springer Science & Business Media, 2008.
[2] Deepak Agarwal, Bee-Chung Chen, and Pradheep Elango. Fast online learning through
oﬄine initialization for time-sensitive recommendation. In Proceedings of the 16th ACM
SIGKDD international conference on Knowledge discovery and data mining, pages
703–712. ACM, 2010.
[3] Subutai Ahmad, Alexander Lavin, Scott Purdy, and Zuha Agha. Unsupervised real-time
anomaly detection for streaming data. Neurocomputing, 262:134–147, 2017.
[4] Mennatallah Amer, Markus Goldstein, and Slim Abdennadher. Enhancing one-class
support vector machines for unsupervised anomaly detection. In Proceedings of the
ACM SIGKDD Workshop on Outlier Detection and Description, pages 8–15, 2013.
[5] Cristiana Amza, Emmanuel Cecchet, Anupam Chanda, Alan L Cox, Sameh Elnikety,
Romer Gil, Julie Marguerite, Karthick Rajamani, and Willy Zwaenepoel. Speciﬁcation
and implementation of dynamic web site benchmarks. In 5th Workshop on Workload
Characterization, number CONF, 2002.
[6] Simon Andermatt, Simon Pezold, and Philippe Cattin. Multi-dimensional gated
recurrent units for the segmentation of biomedical 3d-data. In Deep Learning and Data
Labeling for Medical Applications, pages 142–151. Springer, 2016.
[7] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoﬀrey E Hinton. Layer normalization. arXiv
preprint arXiv:1607.06450, 2016.
[8] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation
by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[9] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua
Bengio. End-to-end attention-based large vocabulary speech recognition. In Acoustics,
Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages
4945–4949. IEEE, 2016.
[10] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convo-
lutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271,
2018.
126
[11] Gurkan Bebek. Identifying gene interaction networks. In Statistical Human Genetics,
pages 483–494. Springer, 2012.
[12] Yonatan Belinkov and Yonatan Bisk. Synthetic and natural noise both break neural
machine translation. In International Conference on Learning Representations, 2018.
[13] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies
with gradient descent is diﬃcult. IEEE transactions on neural networks, 5(2):157–166,
1994.
[14] Suratna Budalakoti, Ashok N Srivastava, Ram Akella, and Eugene Turkov. Anomaly
detection in large sets of high-dimensional symbol sequences. 2006.
[15] Suratna Budalakoti, Ashok N Srivastava, and Matthew Eric Otey. Anomaly detection
and diagnosis algorithms for discrete symbol sequences with applications to airline
safety. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications
and Reviews), 39(1):101–113, 2008.
[16] Eric Byres and Justin Lowe. The myths and facts behind cyber security risks for
industrial control systems. In Proceedings of the VDE Kongress, volume 116, pages
213–218, 2004.
[17] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection for discrete
sequences: A survey. IEEE Transactions on Knowledge and Data Engineering, 24(5):823–
839, 2010.
[18] Danqi Chen and Christopher Manning. A fast and accurate dependency parser using
neural networks. In Proceedings of the 2014 conference on empirical methods in natural
language processing (EMNLP), pages 740–750, 2014.
[19] Penghe Chen, Yu Lu, Vincent W Zheng, and Yang Pian. Prerequisite-driven deep
knowledge tracing. In 2018 IEEE International Conference on Data Mining (ICDM),
pages 39–48. IEEE, 2018.
[20] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi
Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using
rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014
Conference on Empirical Methods in Natural Language Processing (EMNLP), pages
1724–1734, 2014.
[21] Shamil Chollampatt and Hwee Tou Ng. Connecting the dots: Towards human-level
grammatical error correction. In Proceedings of the 12th Workshop on Innovative Use
of NLP for Building Educational Applications, pages 327–333, 2017.
127
[22] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua
Bengio. Attention-based models for speech recognition. In Advances in neural informa-
tion processing systems, pages 577–585, 2015.
[23] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical
evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint
arXiv:1412.3555, 2014.
[24] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu,
and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of
Machine Learning Research, 12(Aug):2493–2537, 2011.
[25] Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Advances in
neural information processing systems, pages 3079–3087, 2015.
[26] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling
with gated convolutional networks. In Proceedings of the 34th International Conference
on Machine Learning-Volume 70, pages 933–941. JMLR. org, 2017.
[27] Matt Davis. Psycholinguistic evidence on scrambled letters in reading, 2012.
[28] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training
of deep bidirectional transformers for language understanding. In Proceedings of the
2019 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages
4171–4186, 2019.
[29] Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. Build it break
it ﬁx it for dialogue safety: Robustness from adversarial human attack. arXiv:1908.06083,
2019.
[30] Li Dong, Furu Wei, Chuanqi Tan, Duyu Tang, Ming Zhou, and Ke Xu. Adaptive
recursive neural network for target-dependent twitter sentiment classiﬁcation.
In
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics
(Volume 2: Short Papers), volume 2, pages 49–54, 2014.
[31] David L Donoho and Carrie Grimes. Hessian eigenmaps: Locally linear embedding
techniques for high-dimensional data. Proceedings of the National Academy of Sciences,
100(10):5591–5596, 2003.
[32] Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. Deeplog: Anomaly detection
and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM
SIGSAC Conference on Computer and Communications Security, pages 1285–1298,
2017.
128
[33] Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez,
and Le Song. Recurrent marked temporal point processes: Embedding event history
to vector. In Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 1555–1564. ACM, 2016.
[34] Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. Hotﬂip: White-box adver-
sarial examples for text classiﬁcation. arXiv:1712.06751, 2017.
[35] Radwa Elshawi, Mouaz H Al-Mallah, and Sherif Sakr. On the interpretability of
machine learning-based model for predicting hypertension. BMC medical informatics
and decision making, 19(1):146, 2019.
[36] German Florez-Larrahondo, Susan M Bridges, and Rayford Vaughn. Eﬃcient modeling
of discrete events for anomaly detection using hidden markov models. In International
Conference on Information Security, pages 506–514. Springer, 2005.
[37] Giorgio Fumera, Ignazio Pillai, and Fabio Roli. Spam ﬁltering based on the analysis
of text information embedded into images. Journal of Machine Learning Research,
7(Dec):2699–2720, 2006.
[38] Julien Gaillard and Jean-Michel Renders. Time-sensitive collaborative ﬁltering through
adaptive matrix completion. In European Conference on Information Retrieval, pages
327–332. Springer, 2015.
[39] Robin Gandhi, Anup Sharma, William Mahoney, William Sousan, Qiuming Zhu, and
Phillip Laplante. Dimensions of cyber-attacks: Cultural, social, economic, and political.
IEEE Technology and Society Magazine, 30(1):28–38, 2011.
[40] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin.
Convolutional sequence to sequence learning. In Proceedings of the 34th International
Conference on Machine Learning-Volume 70, pages 1243–1252, 2017.
[41] Eric J Glover, Kostas Tsioutsiouliklis, Steve Lawrence, David M Pennock, and Gary W
Flake. Using web structure for classifying and describing web pages. In Proceedings of
the 11th international conference on World Wide Web, pages 562–569. ACM, 2002.
[42] Alex Graves, Abdel-rahman Mohamed, and Geoﬀrey Hinton. Speech recognition with
deep recurrent neural networks. In Acoustics, speech and signal processing (icassp),
2013 ieee international conference on, pages 6645–6649. IEEE, 2013.
[43] Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, and Patrick
McDaniel. Adversarial examples for malware detection. In European Symposium on
Research in Computer Security, pages 62–79. Springer, 2017.
129
[44] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks.
In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 855–864. ACM, 2016.
[45] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks.
In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 855–864. ACM, 2016.
[46] Roman Grundkiewicz and Marcin Junczys-Dowmunt. Near human-level performance in
grammatical error correction with hybrid machine translation. arXiv:1804.05945, 2018.
[47] Manish Gupta, Jing Gao, Charu C Aggarwal, and Jiawei Han. Outlier detection for
temporal data: A survey. IEEE Transactions on Knowledge and Data Engineering,
26(9):2250–2267, 2013.
[48] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on
large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034,
2017.
[49] Han Xu Yao Ma Hao-Chen, Liu Debayan Deb, Hui Liu Ji-Liang Tang Anil, and K Jain.
Adversarial attacks and defenses in images, graphs and text: A review. International
Journal of Automation and Computing, 17(2):151–178, 2020.
[50] Stephen J Hardiman, Nicolas Bercot, and Jean-Philippe Bouchaud. Critical reﬂexivity
in ﬁnancial markets: a hawkes process analysis. The European Physical Journal B,
86(10):442, 2013.
[51] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context.
Acm transactions on interactive intelligent systems (tiis), 5(4):19, 2016.
[52] Alan G Hawkes. Spectra of some self-exciting and mutually exciting point processes.
Biometrika, 58(1):83–90, 1971.
[53] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 770–778, 2016.
[54] Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R Lyu. Drain: An online log parsing
approach with ﬁxed depth tree. In 2017 IEEE International Conference on Web Services
(ICWS), pages 33–40. IEEE, 2017.
[55] Shilin He, Jieming Zhu, Pinjia He, and Michael R Lyu. Experience report: System
log analysis for anomaly detection. In 2016 IEEE 27th International Symposium on
Software Reliability Engineering (ISSRE), pages 207–218. IEEE, 2016.
130
[56] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.
arXiv preprint
Session-based recommendations with recurrent neural networks.
arXiv:1511.06939, 2015.
[57] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computa-
tion, 9(8):1735–1780, 1997.
[58] Cheng-Kang Hsieh, Longqi Yang, Yin Cui, Tsung-Yi Lin, Serge Belongie, and Deborah
Estrin. Collaborative metric learning. In Proceedings of the 26th International Confer-
ence on World Wide Web, pages 193–201. International World Wide Web Conferences
Steering Committee, 2017.
[59] Xia Hu, Lei Tang, Jiliang Tang, and Huan Liu. Exploiting social relations for sentiment
analysis in microblogging. In Proceedings of the sixth ACM international conference on
Web search and data mining, pages 537–546. ACM, 2013.
[60] Jui-Ting Huang, Ashish Sharma, Shuying Sun, Li Xia, David Zhang, Philip Pronin,
Janani Padmanabhan, Giuseppe Ottaviano, and Linjun Yang. Embedding-based
retrieval in facebook search. In Proceedings of the 26th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, pages 2553–2561, 2020.
[61] Shih-Wen Huang, Daniel Tunkelang, and Karrie Karahalios. The role of network
distance in linkedin people search. In Proceedings of the 37th international ACM SIGIR
conference on Research & development in information retrieval, pages 867–870, 2014.
[62] Valerie Isham and Mark Westcott. A self-correcting point process. Stochastic Processes
and Their Applications, 8(3):335–347, 1979.
[63] Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. Adversarial example
generation with syntactically controlled paraphrase networks. arXiv:1804.06059, 2018.
[64] Gauri Jain, Manisha Sharma, and Basant Agarwal. Optimizing semantic lstm for spam
detection. International Journal of Information Technology, 11(2):239–250, 2019.
[65] Jianshu Ji, Qinlong Wang, Kristina Toutanova, Yongen Gong, Steven Truong, and
Jianfeng Gao. A nested attention neural hybrid model for grammatical error correction.
arXiv:1707.02026, 2017.
[66] Marcin Junczys-Dowmunt, Roman Grundkiewicz, Shubha Guha, and Kenneth Heaﬁeld.
Approaching neural grammatical error correction as a low-resource machine translation
task. arXiv:1804.05940, 2018.
[67] Komal Kapoor, Karthik Subbian, Jaideep Srivastava, and Paul Schrater. Just in
time recommendations: Modeling the dynamics of boredom in activity streams. In
131
Proceedings of the Eighth ACM International Conference on Web Search and Data
Mining, pages 233–242. ACM, 2015.
[68] Yoon Kim. Convolutional neural networks for sentence classiﬁcation. In Proceedings of
the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),
pages 1746–1751, 2014.
[69] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980, 2014.
[70] John Frank Charles Kingman. Poisson processes, volume 3. Clarendon Press, 1992.
[71] Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolu-
tional networks. arXiv preprint arXiv:1609.02907, 2016.
[72] Paul Knight, Larry Salmen, and Russell Krajec. Time sensitive inventory sales system,
September 23 2004. US Patent App. 10/769,098.
[73] Emanuel Kopp, Lincoln Kaﬀenberger, and Nigel Jenkinson. Cyber risk, market failures,
and ﬁnancial stability. International Monetary Fund, 2017.
[74] Pavel N Krivitsky, Mark S Handcock, Adrian E Raftery, and Peter D Hoﬀ. Representing
degree distributions, clustering, and homophily in social networks with latent cluster
random eﬀects models. Social networks, 31(3):204–213, 2009.
[75] Takeshi Kurashima, Tim Althoﬀ, and Jure Leskovec. Modeling interdependent and
periodic real-world action sequences. In Proceedings of the... International World-Wide
Web Conference. International WWW Conference, volume 2018, page 803. NIH Public
Access, 2018.
[76] Monica Lagazio, Nazneen Sherif, and Mike Cushman. A multi-level approach to
understanding the impact of cyber crime on the ﬁnancial sector. Computers & Security,
45:58–74, 2014.
[77] Kun-Lun Li, Hou-Kuan Huang, Sheng-Feng Tian, and Wei Xu. Improving one-class svm
for anomaly detection. In Proceedings of the 2003 International Conference on Machine
Learning and Cybernetics (IEEE Cat. No. 03EX693), volume 5, pages 3077–3081. IEEE,
2003.
[78] Dawen Liang, Jaan Altosaar, Laurent Charlin, and David M Blei. Factorization meets
the item embedding: Regularizing matrix factorization with item co-occurrence. In
Proceedings of the 10th ACM conference on recommender systems, pages 59–66. ACM,
2016.
132
[79] Zhenjiang Lin, Michael R Lyu, and Irwin King. Pagesim: a novel link-based measure
of web page aimilarity. In Proceedings of the 15th international conference on World
Wide Web, pages 1019–1020. ACM, 2006.
[80] Zachary C Lipton, David C Kale, Charles Elkan, and Randall Wetzel. Learning to
diagnose with lstm recurrent neural networks. arXiv preprint arXiv:1511.03677, 2015.
[81] Haochen Liu, Tyler Derr, Zitao Liu, and Jiliang Tang. Say what i want: Towards the
dark side of neural dialogue models. arXiv:1909.06044, 2019.
[82] Jian-Guang Lou, Qiang Fu, Shengqi Yang, Ye Xu, and Jiang Li. Mining invariants from
console logs for system problem detection. In USENIX Annual Technical Conference,
pages 1–14, 2010.
[83] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Eﬀective approaches to
attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
[84] Pankaj Malhotra, Anusha Ramakrishnan, Gaurangi Anand, Lovekesh Vig, Puneet
Agarwal, and Gautam Shroﬀ. Lstm-based encoder-decoder for multi-sensor anomaly
detection. arXiv preprint arXiv:1607.00148, 2016.
[85] Mirco Marchetti and Dario Stabili. Anomaly detection of can bus messages through
analysis of id sequences. In 2017 IEEE Intelligent Vehicles Symposium (IV), pages
1577–1583. IEEE, 2017.
[86] Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large
annotated corpus of english: The penn treebank. 1993.
[87] Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily
in social networks. Annual review of sociology, 27(1):415–444, 2001.
[88] Weibin Meng, Ying Liu, Yichen Zhu, Shenglin Zhang, Dan Pei, Yuqing Liu, Yihao
Chen, Ruizhi Zhang, Shimin Tao, Pei Sun, et al. Loganomaly: Unsupervised detection
of sequential and quantitative anomalies in unstructured logs. In Proceedings of the
Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence, IJCAI-19.
International Joint Conferences on Artiﬁcial Intelligence Organization, volume 7, pages
4739–4745, 2019.
[89] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeﬀrey Dean. Eﬃcient estimation of
word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[90] Yao Ming, Panpan Xu, Huamin Qu, and Liu Ren. Interpretable and steerable sequence
In Proceedings of the 25th ACM SIGKDD International
learning via prototypes.
Conference on Knowledge Discovery & Data Mining, pages 903–913, 2019.
133
[91] Lawrence Mitchell and Michael E Cates. Hawkes process as a model of social interactions:
a view on video dynamics. Journal of Physics A: Mathematical and Theoretical,
43(4):045101, 2009.
[92] Robert Mitchell and Ray Chen. A survey of intrusion detection in wireless network
applications. Computer Communications, 42:1–23, 2014.
[93] Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy
Susanto, and Christopher Bryant. The conll-2014 shared task on grammatical error
correction. In Proceedings of the Eighteenth Conference on Computational Natural
Language Learning: Shared Task, pages 1–14, 2014.
[94] Rajvardhan Oak, Min Du, David Yan, Harshvardhan Takawale, and Idan Amit. Malware
detection on highly imbalanced data through sequence modeling. In Proceedings of the
12th ACM Workshop on Artiﬁcial Intelligence and Security, pages 37–48, 2019.
[95] Adam Oliner and Jon Stearley. What supercomputers say: A study of ﬁve system
logs. In 37th Annual IEEE/IFIP International Conference on Dependable Systems and
Networks (DSN’07), pages 575–584. IEEE, 2007.
[96] Manuel Perea, María Jiménez, Fernanda Talero, and Soraya López-Cañada. Letter-case
information and the identiﬁcation of brand names. British Journal of Psychology,
106(1):162–173, 2015.
[97] Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami,
Leonidas J Guibas, and Jascha Sohl-Dickstein. Deep knowledge tracing. In Advances
in Neural Information Processing Systems, pages 505–513, 2015.
[98] Danish Pruthi, Bhuwan Dhingra, and Zachary C Lipton. Combating adversarial
misspellings with robust word recognition. arXiv:1905.11268, 2019.
[99] Keith Rayner, Sarah J White, and SP Liversedge. Raeding wrods with jubmled lettres:
There is a cost. 2006.
[100] Lukas Ruﬀ, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed
Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. Deep one-class
classiﬁcation. In International conference on machine learning, pages 4393–4402, 2018.
[101] David E Rumelhart, Geoﬀrey E Hinton, and Ronald J Williams. Learning representa-
tions by back-propagating errors. nature, 323(6088):533, 1986.
[102] Keisuke Sakaguchi, Kevin Duh, Matt Post, and Benjamin Van Durme. Robsut wrod
reocginiton via semi-character recurrent neural network. In AAAI2017, 2017.
134
[103] Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C
Williamson. Estimating the support of a high-dimensional distribution. Neural compu-
tation, 13(7):1443–1471, 2001.
[104] Nitish Srivastava, Geoﬀrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting.
The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
[105] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks.
In Advances in neural information processing systems, pages 2440–2448, 2015.
[106] Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. Lstm neural networks for
language modeling. In Thirteenth annual conference of the international speech com-
munication association, 2012.
[107] Ilya Sutskever, James Martens, and Geoﬀrey E Hinton. Generating text with recurrent
In Proceedings of the 28th International Conference on Machine
neural networks.
Learning (ICML-11), pages 1017–1024, 2011.
[108] Duyu Tang, Bing Qin, and Ting Liu. Document modeling with gated recurrent neural
network for sentiment classiﬁcation. In Proceedings of the 2015 conference on empirical
methods in natural language processing, pages 1422–1432, 2015.
[109] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line:
Large-scale information network embedding. In Proceedings of the 24th International
Conference on World Wide Web, pages 1067–1077. International World Wide Web
Conferences Steering Committee, 2015.
[110] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line:
Large-scale information network embedding. In Proceedings of the 24th International
Conference on World Wide Web, pages 1067–1077. International World Wide Web
Conferences Steering Committee, 2015.
[111] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. Arnetminer:
Extraction and mining of academic social networks. In KDD’08, pages 990–998, 2008.
[112] Jiliang Tang, Xia Hu, and Huan Liu. Social recommendation: a review. Social Network
Analysis and Mining, 3(4):1113–1133, 2013.
[113] Jiliang Tang and Huan Liu. Unsupervised feature selection for linked social media
data. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 904–912. ACM, 2012.
[114] David MJ Tax and Robert PW Duin. Support vector data description. Machine
learning, 54(1):45–66, 2004.
135
[115] Tijmen Tieleman and Geoﬀrey Hinton. Rmsprop. COURSERA: Lecture, 7017, 2012.
[116] Shikhar Vashishth, Shyam Upadhyay, Gaurav Singh Tomar, and Manaal Faruqui.
Attention interpretability across nlp tasks. arXiv preprint arXiv:1909.11218, 2019.
[117] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in
Neural Information Processing Systems, pages 5998–6008, 2017.
[118] Alfredo Vellido. The importance of interpretability and visualization in machine learning
for applications in medicine and health care. Neural Computing and Applications, pages
1–15, 2019.
[119] Feng Wang and David MJ Tax. Survey on the attention based rnn model and its
applications in computer vision. arXiv preprint arXiv:1601.06823, 2016.
[120] Kung-Jeng Wang, YS Lin, and CP Jonas. Optimizing inventory policy for products
with time-sensitive deteriorating rates in a multi-echelon supply chain. International
Journal of Production Economics, 130(1):66–76, 2011.
[121] Xiaolong Wang, Furu Wei, Xiaohua Liu, Ming Zhou, and Ming Zhang. Topic senti-
ment analysis in twitter: a graph-based hashtag sentiment classiﬁcation approach. In
Proceedings of the 20th ACM international conference on Information and knowledge
management, pages 1031–1040. ACM, 2011.
[122] Yanxin Wang, Johnny Wong, and Andrew Miner. Anomaly intrusion detection using
one class svm. In Proceedings from the Fifth Annual IEEE SMC Information Assurance
Workshop, 2004., pages 358–364. IEEE, 2004.
[123] Zhiwei Wang, Tyler Derr, Dawei Yin, and Jiliang Tang. Understanding and predicting
weight loss with mobile social networking data. In Proceedings of the 2017 ACM on
Conference on Information and Knowledge Management, pages 1269–1278. ACM, 2017.
[124] Bryan Watkins. The impact of cyber attacks on the private sector. Brieﬁng Paper,
Association for International Aﬀair, 12, 2014.
[125] Svante Wold, Kim Esbensen, and Paul Geladi. Principal component analysis. Chemo-
metrics and intelligent laboratory systems, 2(1-3):37–52, 1987.
[126] Marty J Wolf, K Miller, and Frances S Grodzinsky. Why we should have seen that
coming: comments on microsoft’s tay experiment, and wider implications. ACM
SIGCAS Computers and Society, 47(3):54–64, 2017.
136
[127] Chao-Yuan Wu, Amr Ahmed, Alex Beutel, Alexander J. Smola, and How Jing. Recurrent
recommender networks. In Proceedings of the Tenth ACM International Conference on
Web Search and Data Mining, pages 495–503, 2017.
[128] Dongkuan Xu, Wei Cheng, Dongsheng Luo, Yameng Gu, Xiao Liu, Jingchao Ni, Bo Zong,
Haifeng Chen, and Xiang Zhang. Adaptive neural network for node classiﬁcation in
dynamic networks. In ICDM. IEEE, 2019.
[129] Dongkuan Xu, Wei Cheng, Dongsheng Luo, Xiao Liu, and Xiang Zhang. Spatio-
temporal attentive rnn for node classiﬁcation in temporal attributed graphs. In IJCAI,
pages 3947–3953, 2019.
[130] Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan. Detecting
large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS
22nd symposium on Operating systems principles, pages 117–132, 2009.
[131] Tsung-Yen Yang, Christopher G Brinton, Carlee Joe-Wong, and Mung Chiang. Behavior-
based grade prediction for moocs via time series neural networks. IEEE Journal of
Selected Topics in Signal Processing, 11(5):716–728, 2017.
[132] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and
Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding.
In Advances in neural information processing systems, pages 5753–5763, 2019.
[133] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy.
Hierarchical attention networks for document classiﬁcation.
In Proceedings of the
2016 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages 1480–1489, 2016.
[134] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy.
Hierarchical attention networks for document classiﬁcation.
In Proceedings of the
2016 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages 1480–1489, 2016.
[135] Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad
Norouzi, and Quoc V Le. Qanet: Combining local convolution with global self-attention
for reading comprehension. arXiv:1804.09541, 2018.
[136] Xiao Yu, Pallavi Joshi, Jianwu Xu, Guoliang Jin, Hui Zhang, and Guofei Jiang.
Cloudseer: Workﬂow monitoring of cloud infrastructures via interleaved logs. ACM
SIGARCH Computer Architecture News, 44(2):489–502, 2016.
[137] Yin Zhang, Rong Jin, and Zhi-Hua Zhou. Understanding bag-of-words model: a
statistical framework. International Journal of Machine Learning and Cybernetics,
1(1-4):43–52, 2010.
137
[138] Wei Zhao, Liang Wang, Kewei Shen, Ruoyu Jia, and Jingming Liu. Improving gram-
matical error correction via pre-training a copy-augmented architecture with unlabeled
data. arXiv:1903.00138, 2019.
[139] Xu Zhao, Kirk Rodrigues, Yu Luo, Ding Yuan, and Michael Stumm. Non-intrusive
performance proﬁling for entire software stacks based on the ﬂow reconstruction principle.
In 12th {USENIX} Symposium on Operating Systems Design and Implementation
({OSDI} 16), pages 603–618, 2016.
[140] Zhengli Zhao, Dheeru Dua, and Sameer Singh. Generating natural adversarial examples.
arXiv:1710.11342, 2017.
[141] Yin Zheng, Bangsheng Tang, Wenkui Ding, and Hanning Zhou. A neural autoregressive
approach to collaborative ﬁltering. In Proceedings of the 33rd International Conference
on International Conference on Machine Learning-Volume 48, pages 764–773, 2016.
[142] Chong Zhou and Randy C Paﬀenroth. Anomaly detection with robust deep autoencoders.
In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 665–674, 2017.
[143] Meizi Zhou, Zhuoye Ding, Jiliang Tang, and Dawei Yin. Micro behaviors: A new
perspective in e-commerce recommender systems. In Proceedings of the Eleventh ACM
International Conference on Web Search and Data Mining, pages 727–735. ACM, 2018.
[144] Shuyan Zhou, Xiangkai Zeng, Yingqi Zhou, Antonios Anastasopoulos, and Graham
Neubig. Improving robustness of neural machine translation with multi-task learning.
In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared
Task Papers, Day 1), pages 565–571, 2019.
[145] Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, and Michael R
Lyu. Tools and benchmarks for automated log parsing. In 2019 IEEE/ACM 41st
International Conference on Software Engineering: Software Engineering in Practice
(ICSE-SEIP), pages 121–130. IEEE, 2019.
138