MACHINE LEARNING TOWARDS DATA WITH COMPLEX STRUCTURES
                                     By
                                  Runze Su
                           A DISSERTATION
                               Submitted to
                       Michigan State University
                in partial fulfillment of the requirements
                             for the degree of
                   Statistics — Doctor of Philosophy
   Computational Mathematics, Science and Engineering — Dual Major
                                   2022


                                         ABSTRACT
      MACHINE LEARNING TOWARDS DATA WITH COMPLEX STRUCTURES
                                               By
                                            Runze Su
The development of sequential analysis provides a deeper understanding in the exploration of
many different fields. In the application of sequential analysis, there are two main challenges:
How to extract informative features from a high-dimensional noisy domain? How to model
the interaction for the information flow from multiple domains? We explored the two core
challenges in bio-informatics, sales forecasting and multimedia services.
    In biology field, a typical problem is the to evaluate the interaction mechanism between
non-coding DNA sequences and transcription. We propose CANEE, a convolutional self-
attention architecture to analyze the function of non-coding DNA sequences. Compared to
other existing models, CANEE achieves a better performance in overall prediction of 919
regulatory functions with respect to receiver operating characteristics and has a significant
improvement on some responses in precision recall curve with shorter training time. In sales
forecasting field, we extract a unique customers’ microbehavior dependency structure from
clickstream data based on a Word-to-Vector model. Then,we build a clickstream informed
LSTM model to forecast the car sales over 30 days. Our model significantly outperforms the
classic seasonal autoregressive integrated moving average model. Besides, we demonstrate
that transfer knowledge among different car models can further improve the performance.
Other applications for multi-domain sequences happens in multimedia service field, where
we focus on the understanding of multiple domain modalities. We propose new principles
for audio visual learning and introduce a new framework as well as its training algorithm to
set sight of videos’ themes to facilitate AVC learning.


                              TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       v
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       vi
Chapter 1    Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      1
Chapter 2 Machine learning on DNA sequences . . . . . . .                 . . . . . . . . .  4
   2.1 Functional analysis on non-coding DNA sequences . . . . . .        . . . . . . . . .  4
       2.1.1 Task and dataset overview . . . . . . . . . . . . . . .      . . . . . . . . .  6
       2.1.2 Model formulation . . . . . . . . . . . . . . . . . . .      . . . . . . . . .  7
       2.1.3 Experimental results . . . . . . . . . . . . . . . . . .     . . . . . . . . . 10
       2.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . 13
   2.2 The prediction of plant stress response from DNA sequences         . . . . . . . . . 14
       2.2.1 Data and problem statement . . . . . . . . . . . . . .       . . . . . . . . . 16
       2.2.2 Model Architecture . . . . . . . . . . . . . . . . . . .     . . . . . . . . . 17
       2.2.3 Experimental results . . . . . . . . . . . . . . . . . .     . . . . . . . . . 19
               2.2.3.1 Performance Analysis . . . . . . . . . . . .       . . . . . . . . . 19
               2.2.3.2 Interpretation . . . . . . . . . . . . . . . . .   . . . . . . . . . 22
       2.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . 23
Chapter 3 The understanding of clickstream data . . . . . . . . . . . . . . .               26
   3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
   3.2 Problem Statement and Formulations . . . . . . . . . . . . . . . . . . . . . .       29
   3.3 Statistical Analysis and Models . . . . . . . . . . . . . . . . . . . . . . . . .    31
       3.3.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . .     32
               3.3.1.1 Historical Sales Data . . . . . . . . . . . . . . . . . . . . . .    32
               3.3.1.2 Correlation Analysis between Clickstream data and Future
                        Demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    33
               3.3.1.3 Word2vec Models for Browsing Behaviors . . . . . . . . . .           34
       3.3.2 Implemented Models . . . . . . . . . . . . . . . . . . . . . . . . . . .       37
               3.3.2.1 Seasonal Autoregressive Integrated Moving Average Model .            37
               3.3.2.2 Multivariate LSTM Model . . . . . . . . . . . . . . . . . . .        37
               3.3.2.3 Clickstream Informed LSTM Model . . . . . . . . . . . . . .          38
   3.4 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .     39
       3.4.1 Benchmark Comparison . . . . . . . . . . . . . . . . . . . . . . . . .         40
       3.4.2 Daily Demand Forecasting . . . . . . . . . . . . . . . . . . . . . . . .       41
       3.4.3 Transfer Learning on Multiple Levels . . . . . . . . . . . . . . . . . .       42
       3.4.4 Clickstream Informed Multivariate LSTM Model . . . . . . . . . . . .           45
       3.4.5 GAT-LSTM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . .         47
               3.4.5.1 Experimental Results . . . . . . . . . . . . . . . . . . . . .       47
               3.4.5.2 Graphical Analysis . . . . . . . . . . . . . . . . . . . . . . .     48
                                             iii


  3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Chapter 4 Machine learning towards multimedia data . .                 . . . . . . . . . . 50
  4.1 Themes informed audio-visual correspondence
      learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
      4.1.1 Related Work . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . . . 53
      4.1.2 KWAI-AD-AudVis Dataset . . . . . . . . . . . . .           . . . . . . . . . . 54
      4.1.3 Proposed Approaches . . . . . . . . . . . . . . . . .      . . . . . . . . . . 56
      4.1.4 Experiment and Analysis . . . . . . . . . . . . . . .      . . . . . . . . . . 58
      4.1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . 62
  4.2 Self-organized short video advertisement evaluation system       . . . . . . . . . . 63
      4.2.1 Related Work . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . . . 65
      4.2.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . 66
      4.2.3 Proposed Approaches . . . . . . . . . . . . . . . . .      . . . . . . . . . . 67
      4.2.4 Experiment and Analysis . . . . . . . . . . . . . . .      . . . . . . . . . . 70
      4.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . 73
Chapter 5    Conclusion and future direction . . . . . . . . . . . . . . . . . . .         74
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       76
                                            iv


                                    LIST OF TABLES
Table 2.1: Model performance comparison. . . . . . . . . . . . . . . . . . . . . . . .       10
Table 2.2: Top 10 regulatory factors with the highest PR-AUC improvement. . . . .            12
Table 3.1: Daily demand forecasting accuracy of model B. . . . . . . . . . . . . . . .       43
Table 3.2: Daily demand forecasting accuracy of car Model C. . . . . . . . . . . . . .       44
Table 3.3: Performance for clickstream informed multivariate LSTM models. . . . . .          46
Table 3.4: Results for state-level car sale prediction. We ran 50 times under the same
           settings and calculated the average L2 −MAPE. . . . . . . . . . . . . . . .       48
Table 3.5: Performance comparison of multivariate LSTM model. Weight initialization
           are derived from random generation, trained model of car model B and
           trained model of car model A. . . . . . . . . . . . . . . . . . . . . . . . .     49
Table 4.1: Summary of experiment results. . . . . . . . . . . . . . . . . . . . . . . .      60
Table 4.2: Modal contributions calculated from a batch of positive audio-visual pairs
           and a batch of negative audio-visual pairs. . . . . . . . . . . . . . . . . .     62
Table 4.3: Summary of experimental results of CTR prediction. . . . . . . . . . . . .        72
Table 4.4: Summary of experimental results of 3-second play rate prediction. . . . . .       72
Table 4.5: Summary of experimental results of multi-class classification. “Acc" here
           refers to accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
                                               v


                                   LIST OF FIGURES
Figure 2.1: The architecture of CANEE model. . . . . . . . . . . . . . . . . . . . . .        8
Figure 2.2: Model structure comparison. . . . . . . . . . . . . . . . . . . . . . . . . .    10
Figure 2.3: Figure A: CANEE output vs DeepSEA in AUC-ROC. Figure B: CANEE
             output vs DeepSEA in AUC-PR. Figure C: CANEE output vs DanQ in
             AUC-ROC. Figure D: CANEE output vs DanQ in AUC-PR. . . . . . . .                11
Figure 2.4: Running speed: DanQ vs CANEE. . . . . . . . . . . . . . . . . . . . . .          13
Figure 2.5: High level pipeline of DeepCAT. . . . . . . . . . . . . . . . . . . . . . . .    16
Figure 2.6: DeepCAT architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . .    18
Figure 2.7:   Performance of DeepCAT.      . . . . . . . . . . . . . . . . . . . . . . . . . 19
Figure 2.8: Performance of DeepCAT with kernels initialized from weights learned
             from the DanQ human model. . . . . . . . . . . . . . . . . . . . . . . . .      20
Figure 2.9: Performance of DeepCAT with kernels initialized from experimentally
             verified TFBMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   21
Figure 2.10: Performance of DeepCAT with known TFBM initialized kernels, and the
             clustered response multi-task model. . . . . . . . . . . . . . . . . . . . .    21
Figure 2.11: The clustering hierarchy of the stress types from k-means clustering, with
             red highlighted stresses being heat related. . . . . . . . . . . . . . . . . .  22
Figure 2.12: Pipeline to translate kernel weights to position frequency matrices and
             aligned to known motifs. . . . . . . . . . . . . . . . . . . . . . . . . . . .  23
Figure 2.13: Motifs learned from DeepCAT aligned with a known TFBMs. . . . . . .             24
Figure 2.14: Here we plotted the motif interaction for the 4 heads of the attention
             module across all responses. . . . . . . . . . . . . . . . . . . . . . . . . .  24
Figure 3.1: A example of a user from an e-commerce site. . . . . . . . . . . . . . . .       27
Figure 3.2: An illustration of micro behavior. . . . . . . . . . . . . . . . . . . . . . .   30
Figure 3.3: Sales demonstrate strong weekly, monthly, and holiday effects.         . . . . . 32
                                              vi


Figure 3.4: Fig A: The highest correlation between each micobehavior action and
             historical sales for one Ford model. Fig B: The correlation box plot
             between each microbahavior action and historical sales for Ford SUV. . .         33
Figure 3.5: UMAP projection plot showing 27 major clusters of the 5022 micro
             behaviors using embedding learnt from word2vec model. The colors
             represents the clusters generated from spectral clustering to group them
             into 27 clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  36
Figure 3.6: The architecture of the CDF framework. . . . . . . . . . . . . . . . . . .        38
Figure 3.7: The architecture of GAT-LSTM model. . . . . . . . . . . . . . . . . . . .         39
Figure 3.8: The framework of GAT module. . . . . . . . . . . . . . . . . . . . . . . .        40
Figure 3.9: L2 -MAPE Comparison in car model B State Level Prediction.              . . . . . 43
Figure 3.10: Daily demand forecasting for car model B in the US. A: Forecasting
             performance of SARIMA model. B: Forecasting performance of Multivariate
             LSTM model. C: Forecasting performance of Multivariate LSTM model
             with weights transferred from car model A. . . . . . . . . . . . . . . . . 43
Figure 3.11: Entity-level model comparison. . . . . . . . . . . . . . . . . . . . . . . .     45
Figure 3.12: UMAP projection plot showing 10 major clusters using embedding learned
             from word2vec model. The colors represent the clusters generated from
             spectral clustering to group them into 10 clusters. The middle wheat-
             colored cluster represents the chaos part. Other surrounding clusters each
             represent a certain car model. . . . . . . . . . . . . . . . . . . . . . . .     47
Figure 4.1: Diagram of the proposed framework. . . . . . . . . . . . . . . . . . . . .        57
Figure 4.2: AUC and sample counts per ADs category. The dark grey bar represents
             the AUC, whose scale axis is on left; the light grey bar represents the
             number of samples, whose scale axis is on right. The horizontal line is the
             baseline-1 accuracy AUC. . . . . . . . . . . . . . . . . . . . . . . . . . .     61
Figure 4.3: Overview of the proposed framework. . . . . . . . . . . . . . . . . . . . .       67
Figure 4.4: Baseline model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   69
Figure 4.5: All-connected model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    69
Figure 4.6: Step (2), (3) and (4) of our self-learning approach. We simplify the
             diagrams for illustration. . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
                                               vii


Figure 4.7: CTR distribution.   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Figure 4.8: 3-second play rate distribution. . . . . . . . . . . . . . . . . . . . . . . .  72
                                            viii


Chapter 1
Overview
Data with complex structures extensively exists in lots of different fields. Thus how to extract
informative features from a noisy domain and how to build a model based on those features is
of great importance. Nowadays, with the rapid development of machine learning techniques,
deep learning shows its tremendous potential in modeling the statistical relationships across
multiple domains. In this paper, we showed the application of machine learning in bio-
informatics, business sales forecasting and multimedia fields.
    In bio-informatics field, evaluating the functional effects of non-coding DNA sequences has
been an important and challenging problem. Although experimental results have indicated
the connection between DNA sequences and gene expression, two main characteristics of
the DNA sequence data is hindering the revealing of the interaction mechanism between
non-coding DNA sequences and transcription.           Firstly, the non-coding DNA sequences
are longer than traditional sequential data. Secondly, the DNA sequences are hard to
interpret and don’t necessarily follow ordered properties.           Several methods have been
proposed using deep learning methods to capture the interaction, but the sequential learning
mechanisms are mainly based on either a fully connected or a bidirectional long short-
term memory frameworks, which requires a large memory and hence is difficult to scale
to long DNA sequences. To address those challenges, we proposed a convolutional self-
attention architecture to analyze the function of non-coding DNA sequences. Compared
                                                 1


to other existing models, our model achieves a better performance in overall prediction
of 919 regulatory functions with respect to receiver operating characteristics and has a
significant improvement on some responses in precision recall curve with shorter training
time. We also found out that there exists interactions for the functional analysis across
different species. To look into the information sharing between human and plant genes,
we proposed a model with similar structure in predicting plant stress response from DNA
sequences. We demonstrate that our model outperforms classical shallow and deep learning
approaches for predicting plant gene expression and it shows great potential with pretrained
information from experiments on human genes.
    In business sales forcasting field, we explore the clickstream data for car sales. Forecasting
car sales and demand in the market is an important but challenging task in car market
analysis. Recently, the development of deep learning and the availability of clickstream data
with rich customers’ online behaviors provide us with immense opportunities to advance the
car demand forecasting problem. However, the online clickstream data is very noise and less
informative. To solve the problem, we consider the clickstream data as a sentence consists of
words, then we applied a word-to-vector model to extract a unique customers’ micro-behavior
dependency structure from clickstream data. Then, we build a clickstream informed LSTM
model to forecast the car sales over 30 days. Our model significantly outperforms the classic
seasonal autoregressive integrated moving average model. Besides, we demonstrated that
transfer knowledge among different car models can further improve the performance.
    In the multimedia field, we put our attention on short videos. Comparing with image,
text and audio, the short videos are much more informative and more complicated. The
applications of short-term user-generated video (UGV), such as Tik Tok, Snapchat and
Youtube short-term videos, booms recently, raising lots of multimodal machine learning
                                                 2


tasks. Among those tasks, learning the correspondence between audio and visual information
from videos is a popular but challenging one. Though there exist lots of classic methods
to extract features from audio and visual domains, how to measure the correspondence
across multiple domains remains a problem.       Most previous work of the audio-visual
correspondence(AVC) learning only investigated constrained videos or simple settings, which
may not fit the application of UGV. For this problem, we proposed new principles for AVC
and introduced a new framework to set sight of videos’ themes to facilitate AVC learning.
We also released the KWAI-AD-AudVis corpus which contained 85,432 short advertisement
videos (around 913 hours) made by users. We evaluated our proposed approach on this
corpus, and it was able to outperform the baseline by 23.15% absolute difference. Based on
that, with a better understanding on the interaction for multimodality, we further explored
a novel end-to-end self-organizing framework for user behavior prediction. The new model
is able to learn the optimal topology of neural network architecture, as well as optimal
weights, through training data. We evaluate our proposed method on our in-house dataset.
The experimental results reveal that our model achieves the best performance in all our
experiments.
                                             3


Chapter 2
Machine learning on DNA sequences
The analysis of DNA sequences has been a challenging problem in both academic and
industry field. Here we proposed a convolutional self-attention network to learn the internal
effects for DNA sequences. The application of this model fells in human and plant DNA
sequences. In section 2.1, we will cover the functional analysis on non-coding DNA sequences
from human being with a convolutional self-attention network. In section 2.2, we apply
the similar architecture on the genetic motif of plants. Beside the higher performance
comparing with traditional models, we also did further analysis on the training speed and
transfer learning potention. Comparing with the state-of-the-art models, the convolutional
self attention network shows a faster training speed, reasonable interpretability and large
potential for transfer learning across species.
2.1      Functional analysis on non-coding DNA sequences
Understanding the function of human DNA sequences has been an important but challenging
problem. Experimental results have indicated that non-coding DNA sequences may act on
the constitution of numerous diseases, but concrete functional evaluation of non coding DNA
sequences remains a problem. One concerning point is to evaluate the relationship between
DNA sequences and their corresponding transcription process. The core challenge for this
problem is to evaluate the binding of chromatin proteins and histone marks from DNA
                                                4


sequence with single-nucleotide sensitivity. TF binding can be influenced by cofactor binding
sequences, chromatin accessibility and structural flexibility of binding-site DNABenveniste
et al. (2014); Whitaker et al. (2015). DNase I-hypersensitive sites (DHSs) and histone
marks are expected to have even more complex underlying mechanisms involving multiple
chromatin proteinsSlattery et al. (2014). To learn how non-coding DNA acts on those factors,
a stable and accurate sequence tagging model is of great importance to reveal the complicated
dependencies.
    Deep learning has shown its strong potential in dealing with rich-feature problems
from large data sets LeCun et al. (2015). Current genomics problems are based on DNA
sequencing, which makes them to be rich-feature and large sample problems. DeepSEA Zhou
and Troyanskaya (2015) is the first model for this problem. It proposed a deep convolutional
neural network to evaluate the sequence tags. Then, DanQ Quang and Xie (2016) combines
the bidirectional recurrent neural network architecture with the convolutional neural network
to capture sequential properties of DNA sequences. Besides, it’s hard to classify which
weather a DNA sequence is the forward one or the reverse complementary sequence. To
learn this property, DeFine Wang et al. (2018) and FactorNet Quang and Xie (2019) take
both the forward and reverse complementary sequence as the model input.
    In the recent years, the development of natural language processing has provided us with
a new sight of sequence learning. Recurrent neural network Mikolov et al. (2010) captures the
sequence properties by setting different units to scan the input sequence. Classic recurrent
neural network frameworks including gated recurrent units Chung et al. (2014) and long
short term memory Hochreiter and Schmidhuber (1997a). Bidirectional recurrent neural
networks Schuster and Paliwal (1997) strengthen the learning ability by learning both the
forward and reverse ordered sequence. A self-attention framework Vaswani et al. (2017) is
                                               5


proposed to learn a score for each element with respect to all other elements in the sequence.
However, existing methods for the non-coding DNA functional evaluation always rely on an
ordered recurrent neural network to learn the sequential properties of DNA sequences. This
is counterintuitive since non-coding DNA sequences don’t necessarily affect transcription by
the order of base pairs. Especially in long DNA sequences, there are 2 main challenges:
    • Recurrent neural network is likely to cause gradient vanishing problems in long DNA
      sequences.
    • The interaction for the base pairs in the DNA sequence is complicated.
To better model the function of non-coding DNA sequences, we propose CANEE, a
convolutional self-attention architecture to evaluate the transcription effects of non-coding
DNA. This model applies a convolutional layer to convert DNA base pairs to a sequence of
numerical values. Then we use a self attention module to learn the interaction between each
base pair and all other base pairs in the DNA sequence. Because weights in self attention layer
will be updated parallelly, it will also avoid the gradient vanishing problem. In summary,
CANEE will achieve a higher prediction accuracy and faster training speed.
2.1.1     Task and dataset overview
The dataset we are using is collected in the paper of DeepSEA Zhou and Troyanskaya (2015).
They collected the human GRCh37 reference genome and then segmented DNA into short
sequences with a length of 1000. Different DNA sequences will not overlap for more than
200 base pairs. On each short sequence, they evaluate the 919 regulatory functions including
factor binding, DNase I sensitivity and histone-mark effects. If more than 500 base pairs are
activated with such functions, then they labeled the corresponding response as 1. We take
                                               6


the DNA sequence as a 1000 × 4 matrix, where the first dimension represents the sequence
length and the second dimension represents the type of the base pairs A, T, C and G.
    The dataset contains the forward DNA sequences and their complementary sequence
pairs. The predicted probability for each sequence will be the average of the forward and
reverse complementary sequences. The training set, testing set and validation set are already
split. The training set contains 4400000 samples, the testing set contains 455024 samples an
validation set contains 8000 samples. No reverse complementary is leaked into other sets.
2.1.2     Model formulation
Figure 2.1 illustrates the framework of CANEE model. The sequences are firstly converted
to 1000 × 4 matrices, then the input matrix will go through a 1D convolutional layer along
with a max pooling layer. The output sequence from the max pooling layer will then be
fed into the next layer for positional embedding. In the end, a multihead self attention
network and a fully connected layer will learn and provide the output. We choose binary
cross entropy as the loss function. To avoid overfitting problem, we also add dropout layers
in the self attention layers and an early stopper to keep the best result if the model doesn’t
receive a lower loss in validation set for 5 epochs.
    The modules will be introduced here:
CNN Module
    We apply a convolution operation to the input matrix. It consists of a 1D convolutional
layer and a max-pooling layer. Suppose the input of the convolutional layer formula is
(N, I, L) and the output is (N, O), then the 1D convolutional layer will as following:
                                                7


                       Figure 2.1: The architecture of CANEE model.
                                                              I−1
                                                              X
                Conv1D(XNm ,Oj ) = ReLU (Bias(Oj ) +              WOj ,k ⋆ XNm ,k ),
                                                              k=0
where N is the batch size, I is the dimension for elements in the input sequence, O represents
the output element dimension, L is the input sequence length. Then the output sequence
from the convolutional layer will be fed into a max-pooling layer:
                Output(Ni , Cj , k) =          max            input(Ni , Cj , k + m),
                                      m=0,1,...,kernel size−1
where input value is of size (N, C, L).
Positional Encoding
    Self attention architecture updates the weight in parallel but will miss element location
information. To add the relative position information to the model, we feed the output
                                                  8


sequence to a positional embedding layer. Positional embedding encodes sines and cosines
of different frequencies as positions. Assume t is the location for an element in the sequence,
then its positional encoding p(t) is defined as:
                                    
                                                t
                                     sin(             ) if t = 2k
                                    
                                    
                             p(t) =        100002k/d
                                               t
                                    cos(              ) if t = 2k + 1
                                    
                                           100002k/d
Self Attention Module
    This module consists of a self attention network and a fully connected layer. The self
attention network has three factors: query Q, key K and value V . Assume the input is X,
the formulation can be expressed as:
                                           Q i = W Q Xi
                                           K i = W K Xi
                                           Vi = WK Xi
                                                  Qi · Kj
                                         Si,j =      √
                                                       d
                                                    exp(Si,j )
                                    Scorei,j = P
                                                    k exp(Si,k )
                                                   X
                                      outputi =        Si,j Vj
                                                    j
, where W represent the weight matrix. The output from self attention network will be fed
into a fully connected neural network to fit the 919 tags.
    We pick up DeepSEA and DanQ as our comparison.                   The model architecture is
summarized in Figure 2.2. The major difference for the the two models and CANEE is
                                                  9


                         Figure 2.2: Model structure comparison.
                  Models        AUC-ROC              AUC-PR
                  DeepSEA       0.9325               0.3425
                  DanQ          0.9384               0.3709
                  CANEE         0.9398               0.3732
                        Table 2.1: Model performance comparison.
the sequence learning module. CANEE applies self attention module while the DanQ and
DeepSEA applies Bidirectional RNN and simple fully connected layer.
2.1.3    Experimental results
In the experiments for CANEE, we select kernel stride as 26 and kernel size as 75. We set
4 heads in the multi-head self attention layer and stack 2 self attention layers in the self
attention module. The learning rate we set is 0.0001. Weights are initialized by Xavier
uniform. All experiments are conducted on single NVIDIA V100 GPU.
 Performance Analysis
    We compared the performance of CANEE with DeepSEA and DanQ. Two metrics
are used to evaluate model performance: Receiver Operating Characteristics(ROC) and
Precision Recall Curve(PR). The area under receiver operating characteristics curve(AUC-
ROC) will provide an estimation of how the model learn negative samples while the area
                                            10


Figure 2.3: Figure A: CANEE output vs DeepSEA in AUC-ROC. Figure B: CANEE output
vs DeepSEA in AUC-PR. Figure C: CANEE output vs DanQ in AUC-ROC. Figure D:
CANEE output vs DanQ in AUC-PR.
under the precision recall curve(AUC-PR) will focus more on the model performance on
positive samples. In the dataset, positive samples are significantly less than negative samples,
so we expect AUC-ROC to be much higher than AUC-PR.
    We analyze our model performance in two aspects. First we evaluate the overall accuracy
for all regulatory functions, then we check the distribution for each response prediction. In
Table 2.1, we calculate the average AUC-ROC and AUC-PR for all 919 responses in testing
set. For both metrics, CANEE outperforms both DeepSEA and DanQ.
    Another important point is to find out which responses have significant improvement.
                                              11


                Cell         TF,            DanQ       CANEE       AUC
                Type         DNase or AUC-             AUC-        Difference
                             HistoneMarkPR             PR
                GM12878      BRCA1          0.156832   0.592497    0.435665
                HepG2        BRCA1          0.176535   0.377814    0.201279
                GM12878      ZBTB33         0.107438   0.264291    0.156852
                GM12878      ZZZ3           0.125252   0.278512    0.15326
                H1-hESC      NRSF           0.370143   0.521568    0.151425
                HepG2        ZBTB33         0.117476   0.257853    0.140377
                K562         ZBTB33         0.117976   0.257172    0.139195
                K562         BDP1           0.144466   0.276138    0.131672
                K562         CHD2           0.290002   0.415569    0.125567
                H1-hESC      BRCA1          0.113197   0.237324    0.124127
       Table 2.2: Top 10 regulatory factors with the highest PR-AUC improvement.
Figure 2.3 shows the comparison between DeepSEA, DanQ and CANEE model by each
response. As a consequence, CANEE outperforms DeepSEA in most responses in AUC-ROC
and AUC-PR. Besides, CANEE shows a similar performance as DanQ in most of responses,
but it can also be noticed that some responses show a significant better performance in
CANEE, especially with respect to AUC-PR. Taking 5% as a threshold, more than 95% of
responses show no significant difference between DanQ and CANEE, but in 40 responses
which present significant differences, 38 of them show a better AUC-PR in CANEE model.
Table 2.2 lists the top 10 regulatory functions that have the highest improvement. The
highest improvements all happen in cell types such as GM12878, HepG2, H1-hESC and
K562, which indicates that the improvement is closely related to cell type. Besides, taking
GM12875 for example, among 91 regulatory functions of this cell type, the AUC-PR can
increase 4.2% by switching from DanQ to CANEE.
 Speed Comparison
    Comparing with recurrent neural network model, self attention is much faster in training.
Under the same environment, it takes 30 ∼ 60 epochs until convergence while CANEE takes
                                               12


                       Figure 2.4: Running speed: DanQ vs CANEE.
30 ∼ 50 epochs to converge, but CANEE is much faster for each epoch. We train the model
with different CNN kernel numbers and record the run time in Figure 2.4. It turns out that
under the same setting, CANEE is much faster in this long sequence learning.
2.1.4     Discussion
In this section, we propose a new architecture to evaluate the transcription effects of non-
coding DNA. The model shows a better performance in AUC-ROC and AUC-PR while
accelerating training procedure within less training time.
    However, there are still challenges remaining in this problem. Here we will list some
future directions to further update the model. According to the biological interpretation
of the transcription results, there exists interactions between 919 targets and this still
needs exploration; a deeper understanding on how to combine the forward and reverse
complementary sequences may further improve model performance; taking sparsity in the self
attention module into consideration can also help the model better capture the information
behind the DNA features. It is also valuable to find a lower-dimensional representation for
                                              13


the interaction between base pairs, and this may make the model even faster and cut down
noise factors in genomics field.
2.2      The prediction of plant stress response from DNA
         sequences
Advances in omics technologies have led to an abundance of biological information.
Integrating rich data sources from this omics data explosion, biologists can get a deeper
understanding of complex biological systems and answer difficult questions by employing
deep learning, which enables successful prediction by extracting high-level features from
massive data Zhou and Troyanskaya (2015); Quang and Xie (2016). A central problem
in bioinformatics towards understanding these complexities is gene function prediction, in
particular molecular functions (e.g. transcription factor binding) and biological processes
(e.g. a given gene is pertinent to the process of reproduction). However, experimentally
annotating gene function is a relatively slow process Kulmanov et al. (2018), making
attractive these computational methods on DNA sequence data.
    A subclass of this central problem is found in molecular plant biology, where much
research is done to understand how plants respond to various abiotic and biotic stressors
(e.g. heat waves, drought, and pest infestations). As the regulation of expression levels
determines how plants respond to different environmental factors, the analysis of expression
regulation is of great importance.
    A main component of gene expression regulation is through the binding of transcription
factors to specific sequences of DNA called regulatory elements (motifs). For this reason,
an avenue of research has been to identify these transcription factors and the respective
                                             14


regulatory motifs, in order to predict gene expression responses Uygun et al. (2017); Wilkins
et al. (2016). However, identifying individual regulatory motifs, such as transcription factor
binding sites (TFBS), is only a small part of the complex process of gene regulation.
Indeed, gene regulation processes also depend on the location, orientation, quantity and
co-localization of regulatory motifs. These dependencies form the structures that modulate
gene regulation, and these structures form what is called regulatory grammar Weingarten-
Gabbay and Segal (2014).
    Understanding of regulatory grammar by computational modeling of these complex
dependencies has thus become a hot area of bioinformatics research. Many advancements
towards modeling complex regulatory grammar have come from deep sequence learning
models, traditionally used in natural language processing.
    One of the early deep learning models developed to account for the sequential
dependencies was DeepSea Zhou and Troyanskaya (2015).              This was done by using
convolutional neural networks (CNN), from which motifs and local dependencies were
learned, ultimately used for functional-variant prediction.
    Building on the DeepSea model, Quang and Xie developed DanQ Quang and Xie
(2016), which couples the CNN with a recurrent neural network (RNN), namely a bi-
directional long short-term memory network (LSTM) Hochreiter and Schmidhuber (1997b).
The LSTM component helps identify long-range dependencies [9], and hence co-localization
dependencies. As the LSTM is bi-directional, it learns these features on both the forward
and reverse ordering of sequences (hence orientation). Besides, as discussed in the previous
topic, the self-attention module points out a promising direction towards the understanding
of genetic information.
    These developments of deep sequence learning models are easily tailored and applied to
                                              15


                        Figure 2.5: High level pipeline of DeepCAT.
our problem of interest: predicting plant stress response from DNA sequences. Building
on these ideas, we propose DeepCAT, a similar convolutional self-attention architecture as
CANEE to predict plant stress response from DNA sequences. DeepCAT consists of 3 layers.
The first is a convolutional layer which converts DNA base-pairs to a numerical sequence,
identifying key predictive motifs and local dependencies. The second layer is self-attention,
which captures key predictive co-localization dependencies. Lastly, a fully-connected (FC)
layer to output prediction scores of gene up-regulation under different abiotic and biotic
stresses.
2.2.1     Data and problem statement
Gene expression and sequence data of 20,799 Arabidopsis genes each consisting of 3,200-bp
(covering promoter and 5’ UTR) were downloaded from the AtGenExpress database and
processed as in Uygun et al. (2017). In brief, the preprocessed and normalized expression
data from AtGenExpress was used to calculate log2 fold change between stress and control
conditions using Limma Ritchie et al. (2015) in the R environment. Genes with a log2 fold
change ≥ 1 were considered up-regulated.
    DNA sequences were pulled for each gene from TAIR10. Particularly, the sequences were
taken from 1-kilobase (kb) upstream and 500-base pairs (bp) downstream the transcription
                                             16


start site and 500-bp upstream and 1-kb downstream the transcription stop site. These
sequences were then one-hot encoded, with each sequence converted into a 3200x4 binary
matrix. The columns correspond to A,C,G,T, and rows correspond to the position in the
DNA sequence, with a each row containing a single 1 in one column and zeros in the remaining
columns.
    Given raw DNA sequence data, the objective is to predict the gene expression responses
to 57 environmental stress conditions in arabidopsis thaliana. Specifically, we want to predict
if an arabidopsis gene was up-regulated or not in shoot tissue under each of 36 abiotic (e.g.
cold, heat, osmotic) and 21 biotic (e.g. 71 Pseudomonas syringae, bacterial flagellin) stress
conditions. Genes were randomly assigned according to a training-validation-test split of
70-10-20.
2.2.2     Model Architecture
As previously described in brief, DeepCAT consists of 3 main modules: (1) CNN, (2) Self
Attention and (3) FC & output. The descriptions are below.
CNN Module and Self Attention Module
    The CNN module and self attention module have the same structure as in Section 2.1.2.
Fully-Connected Output Module
    The output of the self-attention module is the input here. We apply a single FC layer,
giving weighted scores for each of the 57 stress types. We then apply a sigmoid output
layer, which takes these scores and converts them to probability scores, which is a predicted
probability of gene up-regulation under each of the 57 stresses.
  Training
    We trained DeepCAT by minimizing the average multi-task binary cross-entropy loss in
                                              17


                             Figure 2.6: DeepCAT architecture.
mini-batches of size 50 using the Adam optimizer Kingma and Ba (2014). All the weights
and biases were initialized with Xavier (uniform) Glorot and Bengio (2010) and zero values
respectively. For model regularization purposes, we applied dropout with rate of 0.1 in
attention layers.
    Validation data was used to determine an optimal number of training iterations. Namely,
we use an early-stopper to stop the training process if the validation loss does not decrease
for a set number of epochs (default 5), thus keeping the model that performs best on the
validation set.
    In all of our experiments we trained DeepCAT with settings:            320 convolutional
kernels/filters, kernel dimension 26, pooling dimension 13, and used 4 attention heads. Our
implementation was with PyTorch, and our experiments (training and testing) were ran on
NVIDIA K80 GPU.
                                              18


                           Figure 2.7: Performance of DeepCAT.
2.2.3     Experimental results
2.2.3.1    Performance Analysis
Using the fully trained models, performance was measured on the testing data. We used
two metrics: the Receiver Operating Characteristic-Area Under the Curve (ROC-AUC) and
the Precision Recall-Area Under the Curve (PR-AUC). For overall comparison purposes we
averaged the PR-AUC and ROC-AUC across the 57 stress types.
Experiment 1
    In the first experiment, we evaluated baseline performances of our standard DeepCAT
model and a few classic and deep learning models. The baseline models consisted of Support
Vector Machines (SVM) and Random Forest. The deep learning model we compared against
was essentially the DanQ model Quang and Xie (2016), with the modification of the output
layer to give plant response probability scores for the 57 different stress types. We chose
this deep learning model for comparison, as it has a similar structure as DeepCAT, and has
performed well on a different but similar problem with human DNA data. Figure 2.7 shows
the PR-AUC and ROC-AUC values (y-axis) for each of the 57 stress conditions (x-axis),
                                             19


Figure 2.8: Performance of DeepCAT with kernels initialized from weights learned from the
DanQ human model.
along with the respective average values in the legend. Comparing with the classic models
and DanQ, DeepCAT can achieve a higher accuracy in most targets in average.
 Experiment 2 - Transfer Learning
    The main idea of Transfer Learning Tan et al. (2018) is to leverage existing knowledge
from one problem to solve a different but similar problem. Here we injected existing existing
knowledge in two ways. One was through experimentally verified information. The other
was information learned from a model with a rich data set. As the kernels in the CNN
layer of DeepCAT act as DNA motif finders, we experimented initializing the kernels with
known A.thaliana TFBMs. Moreover, the DanQ model used a massive amount of human
gene data (> 4 million), so we also experimented initializing the kernels with the kernel
weights learned in the DanQ model. According to Figure 2.8 and Figure 2.9, we find that
implementing these Transfer Learning methods in DeepCAT lead to better performances
across nearly all 57 stresses.
 Experiment 3 - Stress Grouping
    Our previous results are based on learning all 57 stress responses simultaneously.
However, the information from DNA sequences may not be shareable for different
                                             20


Figure 2.9: Performance of DeepCAT with kernels initialized from experimentally verified
TFBMs.
Figure 2.10: Performance of DeepCAT with known TFBM initialized kernels, and the
clustered response multi-task model.
                                        21


Figure 2.11: The clustering hierarchy of the stress types from k-means clustering, with red
highlighted stresses being heat related.
responses, because these different stress types may have very different underlying regulatory
mechanisms, and finding a good shared representation may not be possible. The expectation
is that in an MTL setting, learning stresses with similar underlying regulatory mechanisms
are to mutually benefit from each other, while stresses with very different underlying
regulatory mechanisms may hinder performance.
    In Figure 2.11, we did hierarchical clustering and cluster the responses into 3 different
groups. Then, we train three models individually on those three groups. We also pair
this with what we did in experiment 2 by initializing the convolutional kernels with known
A.thaliana TFBMs. As is shown in Figure 2.10, we find that both of these experiments lead
to better performances, with the latter yielding the best performance.
2.2.3.2    Interpretation
We also explored the interpretation of the convolutional and self attention layers in the
DeepCAT model. An interesting result is that the DeepCat model can be interpreted as a
                                              22


Figure 2.12: Pipeline to translate kernel weights to position frequency matrices and aligned
to known motifs.
motif learner, by a translation of the kernels in the convolution layer to positional weight
matrices Alipanahi et al. (2015). We aligned these to known motifs from the DAP-seq
and CIS-BP databases using TOMTOM software (https://meme-suite.org/meme/tools/
tomtom). Of the 319 motifs learned by our model, 114 significantly match known motifs
(E < 0.1); a threshold of 0.05 was used for p-value to measure the similarities. Figure 2.12
shows this process. In Figure 2.13, it could be seen that the trained convolutional kernels
can be interpreted to be matching with some the existing gene motifs.
    Besides, we also analyzed the attention scores in the self attention module, we found
interactions of motifs exists at different positions. From Figure 2.14 we can see that the
attention model identifies interactions between base-pairs at long ranges, and thus identifies
long-range co-localization dependencies.
2.2.4     Discussion
While the performance of DeepCAT is good relative to well-established shallow and deep
learning methods, the accuracy is still low in absolute terms. Thus there are still some
challenges to overcome. From our results, we see that the stress grouping is a significant
matter, and more sophisticated methods to learn the best groupings (i.e. the most related
stresses) could help increase testing accuracy. Additionally, we saw leveraging transfer
                                              23


         Figure 2.13: Motifs learned from DeepCAT aligned with a known TFBMs.
Figure 2.14: Here we plotted the motif interaction for the 4 heads of the attention module
across all responses.
                                            24


learning from both the big-data human model, and the experimentally verified TFBMs
helped increase the predictive accuracy. This is another lever for increased accuracy, and
hence is a direction of great interest.
    Nonetheless, with DeepCAT we have shown how deep sequence learning and other
learning mechanisms, such as grouped learning and transfer learning, can move us towards
solving the problem of plant stress response prediction. Moreover, these methods are able
to learn and extract key motifs and long-range motif interactions, which are important
components of understanding regulatory grammar, and hence gene regulation.
                                            25


Chapter 3
The understanding of clickstream data
3.1      Introduction
In recent years, car buyers are more demanding than ever: they emphasize various factors
such as product variety, rapid delivery, etc. Meanwhile, the car-shopping behaviors change
dramatically: customers spend, on average, 108 days in the market before buying a new
vehicle and use 60% of their research time online before going to a dealership Cox (2018). This
long time gap, typically 108 days, between the first click and orders provides car companies a
unique opportunity to adjust their procurement and production. Therefore, a demand model
based on clickstream data and local dealer information can help us seize these opportunities
by making accurate predictions for the dynamic and granular automotive demand.
    The past few years have witnessed a surge of interest in developing demand forecasting
models utilizing low dimensional inputs such as dealers’ locations, gas prices, gross domestic
product, etc. Chase (2013). However, recent Internet clickstream tracking technology has
generated massive data regarding customers’ browsing behaviors. Huang and Van Mieghem
(2014) suggested that forecasting models using the clickstream information can reduce the
inventory holding by 3% and back-ordering cost by 5% in the rolling door industry. Those
methods rely on the fact that online users and buyers are identical.           However, in the
automotive industry, customers tend to browse online for information but purchase vehicles
                                               26


                  Figure 3.1: A example of a user from an e-commerce site.
in local dealers, which generates mismatches between the clicks and sales data. In summary,
in auto demand forecasting, the proposed machine learning algorithms are desired to 1)
ingest the high-volume and high-frequency clickstream data; 2) make no model assumption
on the relationship between predictors and response variables; 3) accurately predict demands
for various auto models (F150, Fiesta, Mustang, etc.) and entities (engine, drive, etc.); and
4) provide a robust long-term forecast. This study aims to develop novel machine learning
methods to address the obstacles mentioned above.
    Forecasting automobile demand has been studied for more than 30 years, pioneered by
Lewandowski (1974) in the 1970s using conventional time series forecasting techniques. Then,
Berkovec (1985) proposed a general equilibrium model that assumes that the demand equals
the supply. In Brühl et al. (2009), moving average and Support Vector Regression are
integrated with a Gaussian kernel for demand forecasting. In Kayapinar Kaya and Yildirim
(2020), an 8-layer Deep Neural Network is developed to incorporate exogenous features
such as exchange rates, gross domestic product, and consumer confidence indexes for sales
prediction. However, all the above methods are based on monthly or quarterly data and are
thus less applicable to process large data sets and provide accurate daily demand prediction.
More importantly, none of the models leverage the massive clickstream data.
                                               27


    On the other hand, utilizing clickstream data for purchase forecasting has been studied
extensively in e-commerce and has shown the ability to boost revenue significantly Xu et al.
(2015); Lu et al. (2014). Precisely, a user’s clickstream data consists of a series of visited
items, known as macro interactions, and the final purchase, as shown in Figure 3.1. The
dynamic and nonlinear temporal nature of clickstream data poses significant challenges to
the existing time series forecasting models. These methods rely on the stationary and linear
assumptions on the data, which is invalid in most cases. Therefore, numerous types of
nonlinear Recurrent Neural Network (RNN) models, such as Long Short-Term Memory
(LSTM), Gated Recurrent Units (GRU), and Transformer have been introduced Hochreiter
and Schmidhuber (1997a); Vaswani et al. (2017); Cho et al. (2014). Zhou et al. (2018)
further decompose each macro interaction into a sequence of micro behaviors and proposed
an RNN based predictive model. Despite their successes in e-commerce, those methods
require correspondence between online clicks and purchases. This critical requirement is
natural for e-commerce but mostly missing in our study. The deeper understanding of
transfer learning in Weiss et al. (2016) and graph neural network in Veličković et al. (2017);
Zhou et al. (2020) points us a direction to learn from the clickstream data.
    This study investigates automobile demand forecasting on a granular level via integrating
historical sales and clickstream data.        Although the clickstream data contains rich
information, online users are anonymous and do not directly correspond to offline purchase
data. This data nature poses tremendous challenges, including (a) how to extract useful
features from clickstream data and (b) how to incorporate those features into the forecasting
models.    To address the challenges, we propose a general Clickstream based Demand
Forcasting framework (CDF), which includes clicks and sales data into an RNN model for
demand forecasting. Specifically, in the first step, we convert the clicks data into a sequence
                                               28


of micro behaviors. Then we leverage a natural language processing framework Word2vec
in Mikolov et al. (2013), to learn the low dimensional numerical representation of the micro
behaviors and cluster them into different groups. With the clustering structure, we transform
the clickstream data into a multivariate time series, which is then fed into an RNN model to
predict the demand. Our models outperform traditional machine learning methods in both
national and state levels forecasting. Another contribution of our study is the application
of transfer learning in auto demand forecasting. We demonstrated transferring knowledge
across car models can also increase forecasting accuracy, especially for models with low sales.
Besides, we also applied graph attention network (GAT) the learn the interaction between
different car models.
       The rest of this chapter is organized as follows. In Section 3.2, we give a formal illustration
of the demand forecasting problem, and then Section 3.3 describes the data preprocessing
and the proposed forecasting model. Detailed analysis and experiments are presented in
Section 3.4. Finally, Section 3.5 discusses the potential extensions and future directions.
3.2         Problem Statement and Formulations
Let M = {m1 , m2 , . . . , mM } be the set of products, where M is the number of total models;
A = {a1 , a2 , . . . , aA } be the set of customer actions on the web-page (e.g. accessory, gallery,
Search inventory, etc.), where A is the number of possible actions; D = {d1 , d2 , . . . , d5 } be the
set of five discredited action dwell times corresponding to 0-20 percentile, 20-40 percentile,
. . . , 80-100 percentile respectively; S = {si,m } be the model m sales on the ith day. With
these definitions, we can represent the clickstream data for a user as a sequence of tuples
                                                    29


                           Figure 3.2: An illustration of micro behavior.
(mi , aj , dk ), denoted as micro behaviors. Specifically, the micro behavior is defined as:
                          Micro behavior = (Model, Action, Dwell Time)
 As Figure 3.2 shows, elements in a micro behaviors are defined as:
• M odel consists of two components: the car model and the release year. For simplicity, we
  only consider two types of release years: the New which is the current model when the
  customer visits the website and the Old is the older version.
• Action is certain short strings summarizing website content. For example, “fv:si" represents
  Ford vehicle search inventory.
• Dwell time represents the time customers spend on the website. We convert the duration
  into 5 categories including short, medium short medium, medium long and long according
  to the percentile of the duration time in all visits to reduce the dimensionality.
    The problem we want to study is: Given the historical sales for multiple car models
and the historical clickstream data of a set of users, we aim to build a demand forecasting
machine learning algorithm for different car models and entities. Specifically, given sale data
{si,m }ai=a−d and clickstream data {ci,m }ai=a−d from day a − d to day a for model m, we
want to predict the model m demand on the day a + l. Here, d is the size of the look-back
                                                 30


window, and l is the number of time steps we want to predict into the future. We then
assume the following non-parametric model.
                          Ya+l,m = f ({si,m }ai=a−d , {ci,m }ai=a−d ) + ϵ,
where Ya+l is the demand on day a + l and f (·) is a nonlinear function, and ϵ is the random
error.
Online Clickstream Data
    We utilized two different data sources: Ford in-home historical sale data and clickstream
data pulled from the Ford.com. Historical sales data contains daily updated customer
purchases reported from dealers starting from Jan 6th, 2016.              This sale data includes
locations, dates, and model type. The unit of clickstream data is a web visitor who visits
Ford.com. These data sets contain features for the visitors, including the IP address, date,
browsing behavior, dwell time for each behavior, model, etc. However, web visitors are
anonymous because they do not provide their identity. Thus, unlike e-commerce or the B2B
cases Zhou et al. (2018), the web visitors in our clickstream data do not match the offline
sales data. The original data contains 2,595,072,202 records (each record is an action for one
visitor) with 977 features.
3.3      Statistical Analysis and Models
In this section, we introduce the statistical analysis and our model architecture. Section 3.3.1
will cover the data preprocessing and analysis and Section 3.3.2 will introduce the model we
proposed.
                                                31


          Figure 3.3: Sales demonstrate strong weekly, monthly, and holiday effects.
3.3.1      Data Preprocessing
The distribution of car sales is affected by compound temporal effects. To explore the factors
in car sales, in Section 3.3.1.1 we figure out the temporal properties of the sale distribution.
Then in Section 3.3.1.2, we analyze the correspondence between the online clickstream data
and the sales history. As the clickstream data is not numerical, Section 3.3.1.3 introduces
the embedding method and shows the properties of the clickstream data embeddings.
3.3.1.1    Historical Sales Data
To simplify this project and remove unnecessary heterogeneity, we only focus on the US’s
data. Besides, the sales data contains retail, leasing, and bulk orders by enterprises. The
clickstream data track individual customers’ browsing behavior, which is most related to
retail sales. Thus, we focus on retail sales in this study.
    There exist strong weekly effects and monthly effects on sales. As Fig 3.3 shows, there is
a peak at the end of each month and a significant decrease every Sunday. National holidays
also cause a compound effect on car sales.
                                                32


Figure 3.4: Fig A: The highest correlation between each micobehavior action and historical
sales for one Ford model. Fig B: The correlation box plot between each microbahavior action
and historical sales for Ford SUV.
3.3.1.2    Correlation Analysis between Clickstream data and Future Demand
There are hidden inferences between car sales and clickstream data. Figure 3.4 shows
the correlation between Ford SUV microbehavior actions and Ford SUV sale history.
Figure 3.4.A shows the ordered correlation between Ford SUV sales and each action in
microbehaviors. Actions such as ’payment estimator’, ’private’ and ’help’ are very sparse in
the microbehaviors, and their low correlation with sales is within our expectation. Popular
and intuitively related actions such as ’bp’, ’find a dealer’ and ’vehicle’ show a relatively high
correlation with historical sales. As there exist significant week effects in car sales, we also
analyze the correlation between actions and car sales on different days of the week. Figure
3.4.B shows that those highly correlated actions with sales also show their high correlation
on all days of a week.
    Due to the mismatch between sales and clickstream data, we utilize an unsupervised
Word2vec model, i.e., Continuous Bag of Words Model (CBOW), to embed micro behaviors
into low dimensional vectors so that we can use machine learning algorithms to learn the
relationship between micro behavior and extract features for the CDF model. Thus, we utilize
the hidden layer’s output from CBOW as a good representation of the micro behaviors.
                                                33


    With the embedded data, we cluster micro behaviors via the spectral clustering
algorithmVon Luxburg (2007). Specifically, we will first calculate the similarity matrix S
between micro behaviors using their embedded vectors, {zi }n      i=1 . Then, we compute the
normalized graph Laplacian L and compute the first k eigenvectors u1 , . . . , uk of L. Let
U be the matrix concatenating the column vectors u1 , . . . , uk , where vi corresponds to the
ith row of U . Then the cluster labels for the i-th micro behavior is achieved via k-mean
clustering on the point {vi }ni=1 .
3.3.1.3    Word2vec Models for Browsing Behaviors
New technologies have reshaped people’s shopping behaviors, and new buyers prefer to collect
car information online, including visiting OEM websites such as Ford.com. This enables the
company to collect enormous data about prospective buyers. Figure 3.2 illustrates a real
example of a visitor’s browsing history on Ford.com. The visitor first visited the homepage
of Ford SUV exploring the inventory, price and gallery. Then, the customer moved to a sport
utility car model and checked the dealer information. After that, the customer browsed the
page for a mid-sized car exploring its accessories and finally went back to the homepage of
Ford SUV and ended the session.
    Despite its usefulness, the clickstream data’s summary statistics can only capture a small
fraction of the information from the clickstream data and is sensitive to other environmental
factors such as launches of new models and incentives. To extract informative and robust
features from clickstream data, we propose a novel word2vec as follows.
    We organize the clickstream data in terms of a series of macro interactions (for different
car models) and micro behavior tuples as defined above. The macro behaviors record the
interactions between visitors and the models and measure the similarity between models.
                                               34


For example, a customer wants to buy a family car. Even though there are multiple choices,
including Sedan, SUV or Van, the customer may only compare two similar SUV models,
which indicates these two models are more similar compared other car models. From a
micro perspective, each macro interaction consists of a sequence of behaviors indicating
what information the customer collects for the model, how long the customer dwells on
a page, and whether the customer checks the information for local dealers. These micro
behaviors provide additional information about the visitor. For example, the longer dwell
time on a model indicates a stronger desire for the product, checking the local inventory
suggests a stronger intent, and visiting model details pages, including accessory and gallery,
also indicates a higher interest. One major advantage of formulating the clickstream data
into sequences of micro behaviors is that it provides a framework to capture the relationship
between car models and represent browsing behaviors using numerical vectors, which can
be incorporated into our demand forecasting model. However, this formulation also poses
tremendous challenges on 1) how to utilize the sequential nature of the data and (2) how
to incorporate the click information into our forecasting model. The solution to these two
challenges leads to our novel demand forecasting model.
    Specifically, after organizing the clickstream data in this structured way above, we turn
each browsing history into a ’sentence’ with micro behaviors as ’words’. To capture the
relationships between micro behaviors and embed them into a Euclidean space, we apply
the word2vec framework using the Continuous Bag of Words (CBOW) model illustrated
in Figure 3.1. Using CBOW on 880 days of clickstream data across the US, we learned
the 20-dimensional numerical representation of each micro behavior and clustered them
into 27 different groups using spectral clustering. Figure 3.5 shows the two-dimensional
Uniform Manifold Approximation and Projection (UMAP) representation of different micro
                                                35


Figure 3.5: UMAP projection plot showing 27 major clusters of the 5022 micro behaviors
using embedding learnt from word2vec model. The colors represents the clusters generated
from spectral clustering to group them into 27 clusters.
behaviors highlighting the data’s clustering structure. The microbehavior embeddings will
keep the properties about car models and dwell time. In Figure 3.5, each cluster only
contains microbehaviors of 1 ∼ 2 car models. On the other hand, microbehaviors about a
certain car model can only be projected to a few certain clusters. Take Ford Model A for
example, more than 95% microbehaviors about Ford Model A can only be classified into 2
clusters. It should be noticed that in the middle of the plot there exists a cyan cluster. This
cluster is a chaos cluster, which represents that the microbehavior is hard to be classified.
This cluster consists of rare microbehaviors and therefore it’s difficult to build connection
with other microbehaviors. Furthermore, we can also find dwell time properties in a certain
cluster. Taking Ford Model A for example, Figure 3.5 recolored the embedding according to
their dwell time category and the cluster can be further split into 5 subclusters. Therefore
dwell time also player a minor role in microbehavior embedding. Besides, the microbehavior
embedding also provides indication for different car models. If two cluster are near with each
other, then car models in the two cluster will be considered to have some relationship with
                                              36


each other in respect of customers.
    With this cluster information, we can extract daily robust features such as the number
of incidences in cluster 2 for each zip, which can be incorporated into our CDF model.
3.3.2     Implemented Models
In our research, we improve models progressively. Firstly we select the classic statistical
model Seasonal Autoregressive Integrated Moving Average as our benchmark. Then for
comparison, we propose multivariate LSTM to capture the compound temporal properties
in sales data. In the end, we introduce our proposed approaches to fuse the clickstream
information with sales features.
3.3.2.1    Seasonal Autoregressive Integrated Moving Average Model
As a benchmark, we select the Seasonal Autoregressive Integrated Moving Average
(SARIMA) model. Autoregressive Integrated Moving Average(ARIMA) is a classic way
for time-series data forecasting. As there is a strong interaction between weekly effects and
monthly effects, adding seasonal features can help ARIMA capture the complicated feature
interactions.
3.3.2.2    Multivariate LSTM Model
To utilize the complicated temporal dependency among data, we propose a general
Clickstream based Demand Forecasting Framework (CDF), which incorporates clickstream
and historical sales data into a recurrent neural network model. Figure 3.6 illustrates the
architecture of the proposed Clickstream based Demand Forecasting Framework.
    To capture the strong temporal effects, including weekly, monthly, and holiday effects,
                                               37


                   Figure 3.6: The architecture of the CDF framework.
we propose a novel multivariate LSTM based daily demand forecasting model incorporating
those temporal features. Specifically, for each day, we include the following features
• the weekday information (Monday, Tuesday, etc.) using one-hot encoding
• dummy variable for whether it is the end of a month
• dummy variable for whether it is a holiday, e.g., Christmas, New year’s eve, etc.
In sum, the proposed model’s input features are sales and the temporal features of the past
92 days, and the output includes the sales after 30 days, as shown in Figure 3.6.
3.3.2.3   Clickstream Informed LSTM Model
Multivariate LSTM architecture provides us with the potential to tranfer information across
similar domains. However, how to choose the transfer learning direction remains a challenge.
Here, we propose that the interaction for online clickstream history provides an indication
between car models. The interaction can be captured by analyzing the distance between
microbehaviors or learned by a graph neural network. Figure 3.7 shows the graph attention
network(GAT) based architecture by taking the vectored microbehaviors as a graph and
multiple car models as input to improve the learning ability for single model prediction. The
framework contains two modules:
                                              38


                      Figure 3.7: The architecture of GAT-LSTM model.
• A masked-graph attention network for selected 9 car models.
• Multivariate LSTM as illustrated in Section 3.3.2.3.
    Figure 3.8 illustrates the GAT framework. It will learn an attention weight to each car
cluster and combine it with the cluster distance. Then according to the weighted sale history,
it will generate a sequence for prediction. The multivariate LSTM part will fit the output
sequence from the GAT module to predict future sales.
3.4      Experiments and Analysis
Our experiments will follow the order as our model improvement process. In Section 3.4.1 to
Section 3.4.3, we first summarize the SARIMA and multivariate LSTM experiments using
sales data only. Meanwhile, there also exists potential of transfer learning on Multivariate
LSTM. Then we feed the summary clickstream statistics into the multivariate models directly
and illustrate the performance of those different clickstream informed multivariate LSTM
                                              39


                        Figure 3.8: The framework of GAT module.
models in Section 3.4.4. The experiments to combine clickstream data and transfer learning
are covered in Section 3.4.5.
3.4.1     Benchmark Comparison
First we compared the performance of SARIMA model and multivariate LSTM model in
car sales forecasting. Then we run transfer learning experiments on different levels with the
multivariate LSTM framework.
Seasonal Autoregressive Integrated Moving Average Model Settings
    To decide the hyperparameters within the SARIMA model, we apply the autocorrelation
function and find where there exists a high correlation every 7 days. Within each week, there
is no significant seasonality. Therefore we decide to set p = 7. From the PACF plot, we
can see that there exists a lag of 7 days, and there is less correlation when the lag is higher
than 7. This indicates a selection of q as 7. To address if we need to apply differencing
within the series, we tried several settings for d to perform differencing for historical sales
                                              40


and checked its corresponding Akaike information criterion(AIC). Within all settings, d = 0
had the lowest AIC. Therefore, we choose the hyperparameters for the ARIMA model as
p = 7, d = 0, q = 7. Meanwhile, to capture the monthly effects, we also added the seasonal
parameter as 30 days in SARIMA.
Multivariate LSTM Model Settings
    We implement the Asynchronous Successive Halving Algorithm (ASHA), a simple and
robust algorithm for hyper-parameters tuning in multivariate LSTM model and GAT-LSTM
model. The 2-layer LSTM contains 5 and 3 kernels, and the fully connected layer after the
concatenate layer has 5 nodes. The training algorithm is adam with the learning rate as
0.0001.
3.4.2     Daily Demand Forecasting
We first test the performance of our model on the national level data.           We split the
national sales data into a training set (09/07/2016-01/20/2018), validation set (01/21/2018
- 04/30/2018), and testing set (05/01/2018 - 10/11/2018).
    Traditionally, people tend to use Mean absolute percentage error (MAPE) to measure
the model performance. The MAPE is defined as
                                                  n
                                              1 X |ŷi − yi |
                                  MAPE =                      ,
                                              n        yi
                                                 i=1
where yi is the true sale on ith day, and ŷi is the predicted sale. However, in our sale data,
there are many days with sales close to zero as shown in Figure 3.3, which makes MAPE
metric statistically unstable. Instead, we introduce a new metric, termed L2 -MAPE, to
                                               41


measure the model performances, where
                                             sP
                                                 n
                                                     (ŷ − y )2
                               L2 -MAPE =         Pn i 2 i .
                                                 i=1
                                                     i=1 yi
3.4.3      Transfer Learning on Multiple Levels
To improve the model and leverage the information from other regions, we propose to
incorporate the transfer learning framework into our framework. Here, transfer learning
is a system’s ability to recognize and apply knowledge and skills learned in previous tasks
to novel tasks or new domains, which share some commonality. The following experiments
summarize the transfer learning potential for the multivariate LSTM model.
Region-level Transfer Learning
    In our setting, the sales data in regions with a similar culture and incentive schedules
tend to inherit similar patterns, which can improve the model accuracy for a specific region.
To implement the transfer learning framework, We first train our model on national data
and record the learned weights/parameters. We then initialize our local model using the
national data’s weights and train the model with data from each state. The implementation
of the transfer learning leads to reduction in most states with respect to L2 -MAPE loss.
    Figure 3.9 compares L2 -MAPE of SARIMA, randomly initialized LSTM and LSTM
initialized from the trained national model in all 54 states and territories. The X-axis
represents all states ordered by their regional car sales. In most states, LSTM has higher
accuracy than SARIMA, and initializing weights from trained LSTM model can further
improve model performance.
   Model-level Transfer Learning
                                              42


                                                                          s
         Figure 3.9: L2 -MAPE Comparison in car model B State Level Prediction.
Figure 3.10: Daily demand forecasting for car model B in the US. A: Forecasting performance
of SARIMA model. B: Forecasting performance of Multivariate LSTM model. C: Forecasting
performance of Multivariate LSTM model with weights transferred from car model A.
         Models                                                       L2-MAPE
         SARIMA                                                       57.10%
         Randomly Initialized Multivariate LSTM                       19.45%
         Multivariate LSTM Transferred from car model A               19.00%
                 Table 3.1: Daily demand forecasting accuracy of model B.
                                             43


         Models                                                        L2-MAPE
         SARIMA                                                        48.83%
         Randomly Initialized Multivariate LSTM                        23.69%
         Multivariate LSTM Transferred from model B(less               23.65%
         popular)
         Multivariate LSTM Transferred from car model                  19.14%
         A(more popular)
               Table 3.2: Daily demand forecasting accuracy of car Model C.
    Besides transferring information within the same model, an interesting question is
whether we can transfer information between models.          For example, can we transfer
information from the best selling model, to our target one or multiple models? We pick
up car model A, B and C from Ford popular compact car models, where A sells much more
than the other two car models. Figure 3.10 and Tables 3.1, 3.2 demonstrate that the transfer
information from a more characteristic domain can improve the Multivariate LSTM model
prediction accuracy for both car model B and C. In addition, transfer learning from different
domains shows various performances. Table 3.2 shows that the weight initialization from the
trained model for car model A and B can help improve the prediction accuracy of C.
Entity-level Transfer Learning
    There are multiple entity levels within the same car model. In Ford dataset, entity
information is matched with a unique hashed VIN number for some car models. Then, by
matching the hashed VIN number with the historical car sales, we grouped one kind of Ford
SUV model sale history by entities. It should be noticed that after grouping, the distribution
of some entities will be very sparse with respect to time. Some entities are directly related
to model years, so only the time around the model year will have these kinds of entities
recorded. We picked up 5 entities recorded in at least 3 years and compared the model
performance between the random initialized multivariate LSTM model and the multivariate
                                             44


                         Figure 3.11: Entity-level model comparison.
LSTM model with the weight initialized by the trained model for all entities. Figure 3.11
shows that out of 5 entities, 4 of them have a better performance after transfer learning.
3.4.4     Clickstream Informed Multivariate LSTM Model
We cleaned the clickstream data and fed the data into our multiple-input LSTM model.
We evaluate the model performance on selected Ford SUV in 2017∼2020 and summarize
the results in Table 3.3. The following shows the input settings for the merging degree of
clickstream statistics and model. The results are shown in Table 3.3
• Baseline Inputs only includes the car sales and temporal property features.
• Model 1: Naive Clickstream Input Model. We construct the clickstream input by feeding
                                              45


                                     MSE        MAE       L2-MAPE
                      Baseline       28908      120       24.38%
                      Model 1        27303      111       23.68%
                      Model 2        26802      120       23.46%
                      Model 3        27624      108       23.82%
                      Model 4        39039      154       28.34%
        Table 3.3: Performance for clickstream informed multivariate LSTM models.
  all microbehavior tuple into the Word2vec model and do clustering. Then we pick up the
  clusters according to the car model in microbehavior tuples. We calculated the number of
  microbehaviors such clusters and fed it into the model along with the sales.
• Model 2:       Cleaned Naive Clickstream Input Model. There exist uninterpretable and
  synonymous records in the clickstream dataset. We manually removed or merged those
  records before feeding them into the model.
• Model 3: Concise Clickstream Input Model.             To make a better use of dwell time
  information, we picked up the core 13 actions and calculated the summation of the dwell
  time for each action. We also set a threshold for the dwell time to be no more than 120
  seconds.
• Model 4: Concise Clickstream Input on R-Transformer. We updated the RNN part of
  the model from LSTM to R-transformer as illustrated in the state of the art sequential
  model Wang et al. (2019d). The input is the same as the cleaned naive clickstream input
  model.
As a result, after proper cleaning, adding clickstream summary statistics into the multivariate
LSTM model can improve the model performance.
                                                46


Figure 3.12: UMAP projection plot showing 10 major clusters using embedding learned
from word2vec model. The colors represent the clusters generated from spectral clustering
to group them into 10 clusters. The middle wheat-colored cluster represents the chaos part.
Other surrounding clusters each represent a certain car model.
3.4.5     GAT-LSTM Model
3.4.5.1    Experimental Results
We take a less popular car model as our target and select 9 popular car models as our graphic
input. The distance between models is derived from the distance between the microbehavior
cluster’s midpoints corresponding to each model in the Word2Vec result. The sale history
ranges from Oct. 27, 2016 to Nov. 17, 2020. As the GAT model contains more parameters
than the LSTM model(4196 vs. 474), constructing samples from only 1482 days can’t
guarantee enough samples. Therefore, we also manually select 20 states with high sales
and run experiments on those states.
    To avoid the effect of local optimum during a single training process, we switched random
seeds and run 50 experiments. Table 3.4 summarized the average results. GAT-LSTM
outperforms Multivariate LSTM model for the 20 selected states and the summation of the
                                              47


                      L2 -MAPE              State Level         Summed
                                             Prediction      Prediction for
                                                              all 20 States
                     Multivariate              64.04%              7.84%
                         LSTM
                     GAT-LSTM                  61.73%             7.57%
                         Model
Table 3.4: Results for state-level car sale prediction. We ran 50 times under the same settings
and calculated the average L2 −MAPE.
car sales in these states.
3.4.5.2    Graphical Analysis
Graphic results from the clickstream data also provide us with a straightforward direction of
transfer learning. Figure 3.12 shows the clustered Word2Vec embedded result, the targeted
model and correlated car model A shows the highest correlation, while car model B shows
the almost lowest correlation with the target car model. It should be noticed that the UMAP
plot doesn’t necessarily represent the exact distance between different car models. According
to the Word2Vec embedding results, car model A is the nearest cluster to the target car
model, while car model B is the furthest cluster. We compared the multivariate LSTM
model performance by initializing weights from the trained model of car model A and the
trained model of car model B. Table 3.5 shows the performance’s summary under different
weight initialization. Weight initialization from car model B shows a worse performance
even than random initialization. This indicates the trained car model B model leads to an
opposite direction for the target car model forecast. Meanwhile, car model A initialized
model performs better than the randomly initialized model. Therefore, the direction of the
clickstream data provides a more reliable way for transfer learning.
                                                48


                            Weight initialization       L2 -MAPE
                                   method
                           Random initialization           24.17%
                            Weight initialization          27.61%
                              from car model B
                            Weight initialization          23.12%
                              from car model A
Table 3.5: Performance comparison of multivariate LSTM model. Weight initialization are
derived from random generation, trained model of car model B and trained model of car
model A.
3.5      Discussion
We have 1) built an LSTM based demand forecasting model with the capacity to
provide accurate daily sale prediction on both national and local levels for each model;
2) implemented the word2vec model to extract customer micro behavior feature from
clickstream data; 3) Proposed a graph neural network method to capture the relationship
across different car models. Applying the clickstream data for future car sales is still at an
early stage. There are two possible directions to improve the performance of this network. 1)
Noise reduction for clickstream data and historical sale limited sample size. 2) Methodology
to combine the clickstream data and sale history.
                                              49


Chapter 4
Machine learning towards multimedia
data
In this chapter, we will talk about the understanding of the data in multimedia field. One
important application of the multimedia data is advertising. With the development of the
multimedia technology, rich media advertisement is more and more popular in business.
Since the rich media services needs the guarantee the fitness across different modalities, we
would like to build a model to analyze the correspondence between different modalities. In
this paper, we mainly includes 2 topics:
    • How to learn the correspondence across multiple modalities?
    • How to arrange the information flow within the model to achieve the best performance?
    To answer the first question, we introduced themes, an outside modality that helps to
project the embedding vectors for different modalities into the same projection space. Then
based on the new model, we propose a self-adapted training algorithm to let the model
optimize its architecture with limited parameter size.
                                             50


4.1       Themes informed audio-visual correspondence
          learning
Recently, the applications of short-term user-generated video (UGV) boom quickly, such as
Tik Tok, Youtube short-term videos, Snapchat and Kwai. In multimodal field Baltrušaitis
et al. (2018), an important application is audio-visual correspondence (AVC) learning, which
can tell whether or how well the audio and visual information matches in the video. It can
recommend audio or visual streams to users given the other modality that contributes to the
same target, evaluate the quality of the short-term video for pushing to users, and build a
better high-level representation of videos for other uses.
     Efforts have been made on the AVC learning. A general idea is finding a shared projective
space for multiple modalities. Given an modal embedding vector space V , and let P(V ) be a
projective space through a canonical map p : V →  − P(V ). We need to find a proper function
f , and take
                               Corr = f (P(Vaudio ), P(Vvisual ))
as an estimation of the correspondence of audio and visual information. However, most
previous works have limitations, mainly lying on two shortages: the task setting was simple,
such as matching background audio to single image Aytar et al. (2016); and the approaches
relied on simple assumptions that audio and visual information should be similar in the
projective space Arandjelovic and Zisserman (2017); Li and Kumar (2019); Zhu et al. (2021b).
These shortcomings may fail the systems on UGVs, where the video can convey more than one
themes by switching modality combinations, and the confusion between those combinations
may introduce too much variance into the model. Therefore, in complex AVC learning
                                              51


problems, we should also decide whether the correspondence measurement is proper with
respect to a certain theme. One example can illustrate this: the user may combine a series
of cheerful wedding ceremony photos and low-spirited style music to present the theme
“marriage is the tomb of love” – spoken by Giacomo Casanova.
     To model the complex relationship between visual and audio information, we introduce
the concept of themes and propose a Theme Informed AVC (Ti-AVC) learning algorithm.
Ti-AVC involves themes as an important auxiliary modality to learn the projective space as
following:
                           Corr = f (Ptheme (Vaudio ), Ptheme (Vvisual ))
To establish the themes-informed projective space, we propose that the matched audio and
visual information should follow two principles:
    1. both modalities should convey the same desired theme;
    2. there exists positive interactions between them when presenting the theme.
For the first principle, we designed a a novel framework to inject the theme information into
AVC learning. Since it is not clear how to represent the theme, we adopted the video tags
to direct model the theme indirectly in this project. For the second principle, we followed
conventional ideas and adopted a state-of-the-art framework to model the relationship.
     To evaluate our proposed framework, we collected 85432 UGVs from Kwai, a popular
short-term video app in China. All the collected videos are advertisement (ads) uploaded
by commercial advertisers. We will publish the dataset as the extension of KWAI-AD Chen
et al. (2020) dataset. In this dataset, our proposed approach gained 23.15% improvement in
accuracy AUC compared to a state-of-the-art AVC learning framework.
                                               52


    We summarize the contribution of this project below: 1) We introduced new principles for
the AVC learning task. 2) We proposed the first theme informed audio-visual correspondence
(Ti-AVC) framework which is suitable for UGVs.              It outperformed the state-of-the-
art baseline by 23.15% absolute difference, and its hidden values indicate the modality
information flow in AVC. 3) We published the first audio-visual dataset grouped by contents
based on short-term ads video.
4.1.1     Related Work
Researchers cast much attention on reciprocity between audio and visual information on
various tasks Tao and Busso (2020). Although transfer learning has been proposed to convey
information across modalities, how to model the correspondence between modalities is still
an open question.
    L3 net Arandjelovic and Zisserman (2017) was proposed to explicitly model AVC. It used
several sub-networks to perform inputs processing and modalities fusion. Relying on the max-
pooling layer in the fusion sub-network, the L3 net had a flexible framework that was able to
take sequential or single input. It showed state-of-the-art performance and a new perspective
to perform sound localization taskArandjelovic and Zisserman (2018); Wu et al. (2019b). In
audio-visual cross-modal embedding designs, the pre-trained L3 net is deployed as embedding
extractor Cramer et al. (2019); Chung et al. (2019). Verma et al. applied L3 net framework to
learn AVC based on the emotion from audio and visual streamsVerma et al. (2019). This work
was evaluated on a new released dataset that contained audio and visual emotion information.
To increase the correspondence, dual attention matching Wu et al. (2019b) added attention
to both audio and visual inputs to predict their event sequential localization relevance
between modals; elastic multi-way network Wang et al. (2019b) designed a loss function
                                               53


with the distance between samples and an anchor point to encourage correspondence; Tao
and Busso (2018b) relied on a bimodal recurrent neural network to learn the temporal
correspondence information in a data-driven fashion. Unsupervised methods such as video
audio correspondence were also investigated such as audio-visual deep clustering modelLu
et al. (2019). Most of the approaches focused on modeling the similarity between modalities
and showed decent performance. However, most of the approaches were only evaluated on
constrained dataset. AVC learning on unconstrained data is still a complicated and difficult
task. Zhu et al. (2021a); Baltrušaitis et al. (2018).
    There are several public available unconstrained audio-visual datasets such as UGVs
datasets, but none of them are suitable for the short-term videos case. Specifically, Youtube-
8M Abu-El-Haija et al. (2016) only, one of the most popular UGVs dataset, covers various
themes (i.e. tags), but its video quality is not controlled intentionally. Also, the video
duration in Youtube-8M exceeds the typical length for a short-term video. On the other
side, the Moments in Time Monfort et al. (2019) contains 1,000,000 3-second videos, which
is too short. Flickr-SoundNetAytar et al. (2016) is a unconstrained dataset, however it
only has single image with background sound track. Movienet and HVUDiba et al. (2020);
Huang et al. (2020) introduce some holistic datasets, but their audio-visual properties are
not significant. The shortage of good quality short-term UGVs inspired us to collect a new
data, whose details will be introduced later.
4.1.2     KWAI-AD-AudVis Dataset
In this study, we developed our framework on KWAI-AD-AudVis dataset. It constists of
85432 ads videos (around 913 hours) from the China popular short-term video app, Kwai.
The videos were made and uploaded by commercial advertisers. The reason to use the ads
                                               54


videos lied on two folds: 1) the source guarantees videos under control to some level, such as
high-resolution pictures and intentionally designed scene; 2) ad videos simulate audio-visual
matching style as manually composited by users in Kwai app. It can be seen as a quality
controlled UGVs dataset.
    In the KWAI-AD-AudVis dataset, each UGV/ad has a label for the industry category.
We estimate the number of clicks advertisers receive every time the ads come out as a criteria
during collection. In this dataset, half of the ads have a high rate to raise customers’ interests
in the products, and the other half has a relatively low attraction. The short videos have
been classified into 19 themes by uploaders with an average length of seconds. The audio
track had 2 channels (we mixed to mono channel in the study) and was sampled at 44.1
kHz, while the visual track had a resolution of 720 × 1280 and was sampled at 25 frames
per second (FPS). This dataset is an extension of the KWAI-AD corpus Chen et al. (2020).
It is not only suitable for tasks in the multimodal learning area, but also for ones in ads
recommendation.
    The details and data of KWAI-AD-AudVid can be accessed through Zenodo1 . It shows
that the ads videos have three main characteristics: 1) The videos may have very inconsistent
information in visual or audio streams. For example, the video may play a drama-like story
at first, and then present the product introduction, whose scenes are very different. 2)
The correspondence between audio and visual streams is not clear. For instance, similar
visual objects (e.g. talking salesman) come with very different audio streams. 3) The
relationship between audio and video varies in different industries. For example, games
or E-commerce ads will have very different styles. These characteristics make the dataset
suitable yet challenging for our study on AVC learning.
   1 https://zenodo.org/record/4023390#.X12Dr5NKgUE
                                                55


4.1.3      Proposed Approaches
Data and Feature
    In this study, we used KWAI-AD-AudVis dataset to develop our AVC learning framework.
To reduce the training workload, we used our in-house key-frame extractor to extract 8 frames
from each video to represent the visual information. Audio tracks were extracted as same as
in original videos. The visual and audio information are pre-processed through Mobilenetv2
Sandler et al. (2018) and VGGish Hershey et al. (2017). Embedding from top layers of the
two pre-trained was fed to our proposed system.
Themes Informed AVC learning System
    Figure 4.1 shows the diagram of our proposed approach, theme-informed audio-visual
correspondence (Ti-AVC) learning framework. It consisted of two parts, a theme-learning
(TL) model and a correspondence-learning CL model.
    For the TL model, we were inspired by L3 net and designed a similar network as L3 net,
except its task was theme prediction (in this study, it is ads industry category prediction).
It took audio and visual embedding as input. It consisted of three sub-networks. Two sub-
network processed the input of single modality separately, and the third one processed the
fused information. The audio sub-network had a time distributed dense layer, an LSTM layer
and a self-attention layer. Its output is a 128-D vector. The visual sub-network had a fully
connected layer, whose parameters were shared across different input frames. Its output was
a sequence of 8 128-D vectors. The output from the audio sub-network was repeated and
concatenated to each vector from the visual sub-network. The concatenated embedding was
fed into the fusion sub-network. The fusion sub-network had two 1-D convolutional neural
networks (CNN), a max-pooling layer and 2 fully connected (FC) layers to predict themes.
                                              56


Figure 4.1: Diagram of the proposed framework.
                      57


    Once the TL model was trained, we fixed it as an embedding extractor to extract three
types of information for the CL model: audio embedding from the top-layer of the audio sub-
network, visual embedding from the top-layer of the visual sub-network and the predicted
theme. We concatenated the predicted theme with the theme ground-truth to form the
theme information, which was injected into the CL model with audio and visual embedding.
By adding the true theme, we expect the CL model to learn how the input audio and
visual embedding performs in predicting the theme. This was following the principle (1) we
mentioned in the introduction section. In this study, the theme ground-truth corresponded
to the visual modality. The CL model has similar architecture as the fusion sub-network in
the TL model. We intended to use both of the theme prediction and ground-truth to tell
how the two modalities represent the desired theme. The CL model was expected to capture
two points: 1) how the desired theme was presented; 2) how the modalities related to each
other. These two points corresponded to the two principles we proposed at the beginning of
this section. The correspondence result was eventually predicted based on the two points.
4.1.4     Experiment and Analysis
Experiment Setup
    We used the original videos from the KWAI-AD-AudVis dataset as positive samples,
where we assumed audio and visual information matched with each other. Negative samples
were generated by pairing audio and visual tracks from different videos. We generated the
same number of negative samples as positive ones. The dataset was partitioned to 80%, 10%
and 10% for training, validation and testing respectively. We applied Adam as the optimizer
and set the learning rate as 0.0001, batch size as 8 in all experiments. We used ads industries
categories as theme information in this study.
                                               58


    We built two baselines for comparison. The first baseline (denoted as “baseline-1")
borrows the architecture from the themes learning model. To make a fair comparison, we
made two adjustments for correspondence learning: 1) we replaced the theme prediction task
by correspondence prediction; 2) we doubled the number of all trainable layers in the fusion
sub-network to guarantee the same parameter size as our proposed approach. The second
baseline (denoted as “baseline-2") had exactly the same architecture as baseline-1, except
we input theme ground-truth to the fusion sub-network concatenated with the modalities
embedding. This made the comparison fair since the system also got theme information
like the proposed approach. For the proposed approach, we made two training strategies
and therefore had two systems. We named the system that trained TL and CL models
separately as “Ti-AVC", while we named the one jointly trained (i.e. a multitask learning
system with TL and CL tasks) as“joint Ti-AVC". We kept all systems having the same
number of parameters.
Experiment Results
    The accuracy AUC score of each system is shown in Table 4.1. The baseline-1, which
had similar architecture to L3 net, had random-guess results (we had the same amount of
positive and negative samples). The baseline-2, which was the same as baseline-1 except it
took the theme as an extra input, could outperform baseline-1 by 18.94%. This verified our
hypothesis that theme information was necessary for AVC learning on UGVs. Both of our
proposed approaches beat the baselines (by at least 3.36% absolute difference) with the Ti-
AVC achieving the best performance. Since the TL and CL models were trained separately
in Ti-AVC, it indicates that properly injecting the information on how the audio and visual
modalities presented the desired theme could improve the performance of the correspondence
learning task. We would like to emphasize that the Ti-AVC is flexible in application. The TL
                                              59


                                      Model       Match AUC
                                   Baseline-1        55.58%
                                   Baseline-2        74.52%
                                  Joint Ti-AVC       77.88%
                                     Ti-AVC         78.73%
                          Table 4.1: Summary of experiment results.
model can be fixed as embedding extractor and the theme categories provided by CL model
can be obtained from either modality (in this study, we made it follow visual modality).
    We also performed evaluations within each theme category (shown in Figure 4.2), where
all the testing candidates were from the same category in AUC computation. Since all
the testing candidates had the same theme ground-truth, the evaluation was equivalent to
eliminating the information of theme ground-truth. The CL model can only obtain help
from the difference between the theme prediction and ground-truth. This difference can
represent "how the desired themes are presented" as we proposed in principle (1), so the
results can reflect the effectiveness of the principle (1) in the AVC task. We compare the
results with the baseline-1 (the horizontal line in Figure 4.2), which did not include theme
ground-truth during inference. The result shows that the Ti-AVC framework dominates
the baseline in 15 categories out of 19. Especially, we notice all the categories with most
samples outperformed the baseline. This result indicates that the proposed framework can
help improve the correspondence learning even without theme information, and justify the
proposed principle (1).
 Contribution Analysis
    To further verify the rationality of our proposed approach, we analyzed the contribution
of each input in the CL model. We use the information flow fed into the convolutional layers
to indicate the importance of each modal. As both positive and negative values have an
                                               60


Figure 4.2: AUC and sample counts per ADs category. The dark grey bar represents the
AUC, whose scale axis is on left; the light grey bar represents the number of samples, whose
scale axis is on right. The horizontal line is the baseline-1 accuracy AUC.
effective influence on the prediction, we set the absolute value of the input values as our
estimation statistic. Define the contribution in equation 4.1, where Wi is the weight of the
first layer connecting the ith input and Xi is the ith input, I is the input type (audio, visual,
predicted theme and true theme).
                                 ContributionI =       |WiI · XiI |
                                                   X
                                                                                            (4.1)
    Table 4.2 listed the computed proportion of the inputs for the matched pairs. It showed
that audio modalities have the most portion contribution (58.78%). The theme information
took up 10.38%, where the predicted and true ones were close (4.52% and 5.86%). This
result indicated both of them could not be neglected, which verified our proposed principles
                                                61


                     Vision    Audio  Predicted Themes      True Themes
                    30.85%    58.78%         4.52%              5.86%
Table 4.2: Modal contributions calculated from a batch of positive audio-visual pairs and a
batch of negative audio-visual pairs.
for AVC and the capability of the proposed approach.
4.1.5      Discussion
In this project, we proposed new principles in audio-visual correspondence learning on users
generated videos, which introduced theme information in AVC tasks. We proposed a new
framework to perform the AVC task under unconstrained scenarios. To evaluate the proposed
approach, we also collected and released the KWAI-AD-AudVis corpus, consisting of 85432
short-term videos (around 913 hours).
    Our proposed approach was able to outperform a state-of-the-art AVC framework by
23.15% in accuracy AUC. We also showed that the proposed approach could still outperform
the baseline even without theme information. Besides, the proposed framework would be
flexible in real application as the TL model can be fixed and the theme information can
correspond to either modality. This study only focused on learning correspondence between
audio and visual modalities by concatenating the embedding of the modalities. The future
work lies on a more sophisticated fusion strategy and further analyzing how the modality
correlates with each other.
                                             62


4.2        Self-organized short video advertisement evaluation
           system
As daily active users (DAU) of video sharing apps, e.g, Youtube, Snapchat and Kwai,
have rocketed in recent years, advertisers take advantage of this trend to promote their
products or services through user-generated videos (UGV)-based advertisements. Generally,
user behavior-related metrics, e.g., click-through rate (CTR)2 , 3-second play rate3 , are
employed to assess advertisement quality and performance.                     These two metrics can be
calculated as follows: CTR = number of clicks4 ÷ impressions5 ; and 3-second play rate
= number of plays (more than 3s) ÷ impressions. Recent researches Wang et al. (2019c,a)
on recommender system require user profiles, which are extracted from user browsing
history and user basic information, as essential model input to make precise prediction
on video CTR within each user account. However, in some application scenarios, such
as cold start and automatic generation of advertisements, where user-related information
cannot be obtained, advertisement publishers have to rely on video content to estimate
advertisement performance. Accordingly, methods for making precise prediction on UGV-
based advertisement performance without user profile are of great value.                      Multi-Modal
Machine Learning (MMML), which exploits signals of different modalities jointly, is able
to help in mentioned video-related tasks.
    Application of MMML tasks has been widely studied, including emotion recognition Tao
et al. (2018); Liu et al. (2018), object localization Arandjelovic and Zisserman (2018); Zhao
   2 CTR describes how many users become interested in product based on video content.
   3 3-second play rate describes how much the users are attracted by the content of the beginning three
seconds.
   4 The event that an user click the link that comes with advertisement.
   5 Impression here refers to the count of the event that a video is fetched from dataset and recommended
to users.
                                                      63


et al. (2018), speech recognition Tao and Busso (2018a); Afouras et al. (2018); Tao and Busso
(2018c), speech separation Wu et al. (2019a), voice activity detection Tao and Busso (2019,
2020) and etc. We notice that multi-modal fusion strategy plays a decisive role in multi-modal
tasks. Previous works have proposed sophisticated fusion methods and achieved remarkable
success. However, in our case, previous solutions have two limitations: 1) they mainly focus
on signal perception-related tasks, rather than user behaviors; 2) it is unclear how modalities
interact with each other. For example, in ASR task Tao and Busso (2018a), visual content
is taken as auxiliary information for audio content and therefore, audio modality is taken
as query information in attention model. However, in our case, we have no prior knowledge
about the relationship between input modalities and use behaviors. These shortages may
lead to difficulties of applying existing methods in computational advertising.
    This study focus on building an end-to-end system for predicting CTR and 3-second
play rate of UGV-based advertisements. The contribution of our work can be summarized
as follows: 1) To the best of our knowledge, our proposed system is the first work of
predicting CTR and 3-second play rate directly from video content (combining audio and
visual modalities), which is equivalent to predicting user behavior directly from raw signals.
2) We propose a self-organizing system that is able to learn the optimal topology of neural
network architecture. More specifically, it is a data-driven framework which can adjust
information flow by changing model architecture.
    We evaluated our proposed method on a video dataset which consists of 9841
advertisements videos collected from Kwai, a trending short video app worldwide. All videos
are uploaded by advertisers and contain unconstrained information. The experimental results
of CTR and 3-second play rate prediction reveal that the proposed method outperforms all
models for comparison.
                                              64


4.2.1      Related Work
To build the first framework for predicting user behavior from videos, we borrowed ideas from
recommender system and MMML studies. Recommender system relied on designated data
pre-processing to collect descriptive features. Google’s Wide-&-Deep model Cheng et al.
(2016), which has been widely deployed in industry, combined these features at different
levels within one neural network. Recently proposed AutoCTR framework Song et al. (2020)
explored the optimal model structure in a data-driven way. These ideas inspired us to design
a model that is able to process features collected from different levels. However, AutoCTR
and Wide-&-Deep model may not be applicable in our task, as they could not handle raw
signal inputs. Densenet Huang et al. (2017) essentially had similar strategy to Wide-&-
Deep model that it used cross-layer connection in image classification task. It merged
information from different levels in dense block, where one layer was directly connected
to all its subsequent layers.
    In MMML research, one straight-forward fusion method was to combine weighted
prediction results across all modalities, where the weight of each modality was determined
by its own performance on validation set. This method may fail in handling UGVs, whose
modalities had different significance across UGV topics. Another widely applied fusion
method was to concatenate the features extracted from different modalities into a joint
representation Noroozi et al. (2017); Afouras et al. (2018); Wu et al. (2019a) and then
the concatenated feature vectors could be processed by a classification/regression model.
Such simple method has shown its effectiveness in many video-related tasks introduced in
previous section. However, merging features into one vector did not provide enough flexibility
in dealing with unconstrained videos. In many other works, attention model Vaswani et al.
                                              65


(2017) has been considered Hu et al. (2019) to assign weights dynamically, where the strategy
could be learned through end-to-end training. Also signal from one modality could be utilized
as auxiliary information on other modalities Tao and Busso (2018c); Yu et al. (2020). These
methods made prior assumptions on relationship among all available modalities and set
constrains on the architectures of fusion models, which was not the case in our study. To
tackle the shortages mentioned above, we develop a self-organizing framework, which is able
to explore the optimal topology of neural network architecture through training data. Our
proposed method bridges these two research domains so that it is able to make prediction
on user behavior with original video input.
4.2.2     Dataset
In our study, we use our in-house dataset, containing advertisement play history within one
week. The advertisements have been grouped into 19 pre-defined categories by advertisers.
This video setting is same as KWAI-AD Chen et al. (2020) dataset. All audio tracks are
sampled 44.1kHz sampling rate and each audio track has two channels. We mix each track
into mono-channel in this study. The visual tracks have the resolution of 720 × 1280, a
typical vertical setup for mobile devices. All advertisements have the same frame per second
(FPS) of 25. Also, we summarize one-week performance for each advertisement, including
impression, CTR and 3-second play rate.
    However, CTR and 3-second play rate are lack of statistical significance without enough
impressions. Therefor, based on our experience, we set 70,000 as impression threshold, under
which advertisement samples have been discarded. After this step of filtering, a total of 9841
advertisements are collected. The total length of these advertisements is about 82 hours.
                                               66


                        Visual Input                  Audio Input      Category Input
                                                 FC Layer (128 Units)
                    FC Layer (128 Units)                              One-Hot Encoder
                                                    LSTM (30 Units)
          Visual sub-network                         Self Attention
                                         Audio sub-network
                         …
                                                         Fusion
                                                     Sub-network
                                                   Predicted Target
                               Figure 4.3: Overview of the proposed framework.
4.2.3      Proposed Approaches
The overview architecture of our system is shown in Figure 4.3. It consists of two parts:
single modality sub-networks (visual and audio) and fusion sub-network. In this study, we
train all these sub-networks jointly.
 Feature extraction
    In our study, CTR is related to the entire video, while 3-second play rate only corresponds
to the content in the beginning three seconds. Therefore, we prepare the feature extraction
separately for these two tasks. For CTR prediction, we utilize our in-house key-frame
extractor to extract 8 visual frames from each video and keep entire audio track. For 3-
second play rate prediction, we extract 3 visual frames only from the first three seconds,
one frame per second and only keep audio track of the first three seconds. Extracted visual
frames and audio tracks are then processed through Mobilenetv2 Sandler et al. (2018) and
VGGish Hershey et al. (2017) respectively to generate visual and audio inputs. Our primary
research showed that advertisement category also had impact on CTR prediction result and
                                                               67


thus, we introduce category into our case as an extra modality. Category information (we
have 19 advertisement categories) are processed by an one-hot encoder to generate category
embedding.
Single Modality Sub-networks
    Inputs of visual and audio modalities are processed by two sub-networks respectively. The
visual sub-network has a fully-connected (FC) layer, whose parameters are shared across all
frames. The output of visual sub-network is a sequence of 128-D embeddings. The audio
sub-network consists of a FC layer, an uni-directional LSTM layer and a self-attention layer.
The output of audio sub-network is a 128-D embeddings. Numbers of neurons in each layer
are shown in Figure 4.3. For each video, we have several frames for visual input, while we
have only one embedding for audio input and one embedding for category. Therefore, audio
embedding and category embedding are repeated to match the frame count in each video.
These collected embeddings are then sent to our proposed fusion model.
Fusion Sub-network
    In our study, we have two fusion approaches: baseline and self-organizing. For the baseline
approach, we adopt fusion model sub-network proposed in Ti-AVC network Su et al. (2020)
(Figure 4.4). It has two 1-D convolutional neural networks (CNN), one max-pooling layer
and one FC layer to predict targets. For our proposed approach, which we name as “self-
organizing" approach, we follow the following steps to learn fusion strategy from data: (1)
Modify the fusion sub-network in the baseline approach. We connect input embeddings
to the second CNN layer and the max-pooling layer, in addition to the first CNN layer.
The output of the first CNN layer is also connected to the max-pooling layer, as shown in
dashed border of Figure 4.5. It is equivalent to connecting the output of each layer to all
of its following layers in the dashed boundary. Therefore, we name it as “all-connected"
                                               68


      …                                            …
            1-D CNN (100 Filters)                       1-D CNN (100 Filters)
            1-D CNN (100 Filters)                       1-D CNN (100 Filters)
                Max-Pooling                                  Max-Pooling
            FC Layer (128 Units)                         FC Layer (128 Units)
              Predicted Target                            Predicted Target
          Figure 4.4: Baseline model.               Figure 4.5: All-connected model.
fusion sub-network. (2) Optimize all-connected fusion sub-network until there is no more
improvement in performance on validation set (shown in Figure 4.6). (3) Select the 5% of
connections with the lowest absolute values (shown in Figure 4.6). (4) Remove connections
selected in step (3) and fine-tune the sub-network (shown in Figure 4.6). (5) Repeat step
(3) and step (4) until parameters number reaches a pre-defined threshold. We employ the
parameter number of fusion model in baseline approach as our threshold in the experiments.
    The logic behind removing connections is that connections with low absolute values
indicate that they play less important role in forward propagation than others.      With
these connection removed, our model re-organizes information flow and learns the optimal
topology.   Therefore, we name our fusion model as self-organizing model.       The entire
procedure is data-driven and does not require manually defined rules. Our self-organizing
framework is flexible and sophisticated.
                                             69


Figure 4.6: Step (2), (3) and (4) of our self-learning approach. We simplify the diagrams for
illustration.
4.2.4      Experiment and Analysis
Experiment Setup
    We evaluated our proposed method on CTR prediction and 3-second play rate tasks. For
each task, two types of experiments were conducted: regression and classification. For the
classification experiment, we uniformly binned the data into five groups (every bin had same
count of sample in training set), based on the distribution of data on training set (shown in
Figure 4.7 and 4.8). Here, the 1-D output layer in regression model was replaced by a 5-D
softmax output layer.
    Two models, which have been introduced in Section 4.2.3, were built for comparison with
our proposed work. The first one was the fusion model in the baseline approach (named
as “Baseline"). The other model for comparison was the fusion model in all-connected
approach (named as “All-Connected"). It should be noted that “All-Connected" model
has more neuron connections than “Baseline" model with the same neuron number. To
make fair comparison, we trimmed its neuron number (51% of kernels in convolutional
layers and neurons in dense layer) to ensure it has same parameters number as “Baseline"
model. We also made the final parameters number of our proposed approach (named as
                                                70


“Self-Organizing") same as the “Baseline" model. In other words, all three models had the
same number of parameters.
    80% and 10% of the samples were randomly selected as our training set and validation set,
while the rest were used as our testing set. We adopted Adam as our optimizer. 0.0001 and 8
were chosen as learning rate and batch size respectively in all experiments. In all regression
experiments, mean squared error (MSE) and mean absolute error (MAE) were employed
as our evaluation metrics. To fairly compare all models’ performance on different tasks, in
addition to the two metrics above, we utilized ratio of MAE to Average ground truth (named
as “MAR" in our study) as our main evaluation metric. Mean absolute percentage error
(MAPE) was not used in our study, as it was more likely to be effected by outlier samples,
which had low MAE but introduced extremely high MAPE. In classification experiments,
the accuracy of all models were summarized.
Experiment Results and Analysis
    The results of all regression experiments are listed in Table 4.3 and 4.4.        In CTR
regression experiment, our proposed Self-Organizing model beats Baseline model and All-
Connected model by 0.9% and 6.0% respectively (absolute difference). In 3-second play
rate regression experiment, our proposed Self-Organizing model outperforms Baseline model
and All-Connected model by 0.5% and 0.3% respectively (absolute difference). We notice
that MAE of Self-Organizing model is 2.5% lower than Baseline model and 1.8% lower than
All-Connected model. As shown in Figure 4.7 and 4.8, the CTR distribution follows an
heavy-tail distribution, while 3-second play rate follows a normal distribution. In both types
of distribution, our proposed model achieves the best performance, indicating that it has
strong flexibility and generalization ability.
    Table 4.5 summarizes classification experiment results. In CTR classification experiment,
                                               71


                              Model                                MSE                              MAE                          MAR
                              Baseline                             1.87e−05                         2.26e−03                     35.5%
                              All-Connected                        2.00e−05                         2.58e−03                     40.6%
                              Self-Organizing                      1.77×10−5                        2.20×10−3                    34.6%
                                   Table 4.3: Summary of experimental results of CTR prediction.
                                              Model                           MSE                          MAE                MAR
                                              Baseline                        0.0152                       0.0744             17.6%
                                              All-Connected                   0.0148                       0.0738             17.4%
                                              Self-Organizing                 0.0143                       0.0725             17.1%
                     Table 4.4: Summary of experimental results of 3-second play rate prediction.
                                       CTR                                                 3-second Play Rate
                                       Model                                 Acc           Model               Acc
                                       Baseline                             61.3%          Baseline           61.4%
                                       All-Connected                        63.5%          All-Connected      61.5%
                                       Self-Organizing                      66.3%          Self-Organizing 62.8%
Table 4.5: Summary of experimental results of multi-class classification. “Acc" here refers
to accuracy.
our proposed model outperforms Baseline model and All-Connected model by 5.0% and
2.8% respectively (absolute difference). In 3-second play rate classification experiment,
our proposed model outperforms Baseline model and All-Connected model by 1.4% and
1.3% respectively (absolute difference). We note that classification is a task with coarser
granularity compared with regression. It shows that our proposed Self-Organizing model
outperforms other baseline models in all granularities.
               0.1                                                                                  0.02
              0.08
                                                                                                   0.015
  Frequency                                                                            Frequency
              0.06
                                                                                                    0.01
              0.04
              0.02                                                                                 0.005
                0                                                                                     0
                0.0005     0.012     0.0235    0.035      0.0465    0.058    0.0695                    0.005   0.13   0.255   0.38     0.505     0.63    0.755   0.88
                                              CTR Value                                                                       3-second Play Rate Value
                         Figure 4.7: CTR distribution.                                 Figure 4.8: 3-second play rate distribution.
                                                                                      72


4.2.5     Discussion
In this study, we propose a self-organizing approach, which can learn the optimal topology of
neural network architecture in a data-driven way. Unlike previous approaches, our proposed
method does not require prior knowledge or assumption about relationship among modalities.
It provides more flexibility in handling tasks related to UGVs, which contain complex and
complicated information. Also, our proposed method is able to predict CTR and 3-second
play rate directly from video inputs. Our experimental results reveal that our proposed
method successfully predict user behaviors and outperforms all other models for comparison.
                                              73


Chapter 5
Conclusion and future direction
In this paper we discussed the exploration and application of data with complex structures.
In bio-informatics field, we proposed a convolutional self-attention based model to capture the
hidden information within DNA sequences and motifs. This model also shows its potential
in the for biological interpretation. In the sales forecasting field, we proposed a word-to-
vector based data processing pipeline to convert the complicated online clickstream data into
vectors, then we ran experiments on different models. Besides, we further tested the potential
improvement of the multitask learning with graph attention network. In multimedia field,
our contribution includes the learning of single modality and the learning of across multiple
modalities. For single modality, we proposed an emsemble end-to-end spoken language
model to learn the information from spoken language audio. Then for multimodality, we
applied a themes informed audio video correspondence learning model, along with a self-
adapted learning algorithm to optimize the information flow within the model. In the
future, we have three possible directions. The first direction is the quantitative analysis of
different architectures. A complicated model is more likely to get overfitting during training,
which makes it hard to converge on noisy data. A quantitative analysis for the selection
of regularization model architecture will be very significant during application. Another
direction is multitask learning. There are lots of datasets with similar structure. When
the sample size is limited, transfer knowledge from a large dataset to a smaller dataset
                                              74


will be very helpful. Besides, how to separate the shared information and the task specific
information can further improve the model performance. The last direction is to capture
the information from multiple domains. Different from multitask learning, a very popular
topic now is combining the inputs from multiple domains and make them contribute to the
same target. Our researches have demonstrated that new modality can help us achieve a
better performance on multimedia services. Based on that, we may extend the contribution
to more modalities and more fields.
                                            75


BIBLIOGRAPHY
      76


                                   BIBLIOGRAPHY
Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., and
Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark.
arXiv preprint arXiv:1609.08675 .
Afouras, T., Chung, J. S., Senior, A., Vinyals, O., and Zisserman, A. (2018). Deep audio-
visual speech recognition. IEEE transactions on pattern analysis and machine intelligence.
Alipanahi, B., Delong, A., Weirauch, M. T., and Frey, B. J. (2015). Predicting the sequence
specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol , 33(8),
831–838.
Arandjelovic, R. and Zisserman, A. (2017). Look, listen and learn. In Proceedings of the
IEEE International Conference on Computer Vision, pages 609–617.
Arandjelovic, R. and Zisserman, A. (2018). Objects that sound. In Proceedings of the
European Conference on Computer Vision (ECCV), pages 435–451.
Aytar, Y., Vondrick, C., and Torralba, A. (2016). Soundnet: Learning sound representations
from unlabeled video. In Advances in neural information processing systems, pages 892–900.
Baltrušaitis, T., Ahuja, C., and Morency, L.-P. (2018). Multimodal machine learning: A
survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence,
41(2), 423–443.
Benveniste, D., Sonntag, H.-J., Sanguinetti, G., and Sproul, D. (2014). Transcription factor
binding predicts histone modifications in human cell lines. Proceedings of the National
Academy of Sciences, 111(37), 13367–13372.
Berkovec, J. (1985). New car sales and used car stocks: A model of the automobile market.
The Rand Journal of Economics, pages 195–214.
Brühl, B., Hülsmann, M., Borscheid, D., Friedrich, C. M., and Reith, D. (2009). A sales
forecast model for the german automobile market based on time series analysis and data
mining methods. In Industrial Conference on Data Mining, pages 146–160. Springer.
                                            77


Chase, C. W. (2013). Demand-driven forecasting: a structured approach to forecasting.
John Wiley & Sons.
Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., and Han, J. (2020). Imram: Iterative matching
with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12655–12663.
Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G.,
Corrado, G., Chai, W., Ispir, M., et al. (2016). Wide & deep learning for recommender
systems. In Proceedings of the 1st workshop on deep learning for recommender systems,
pages 7–10.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and
Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical
machine translation. arXiv preprint arXiv:1406.1078 .
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated
recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 .
Chung, S.-W., Chung, J. S., and Kang, H.-G. (2019). Perfect match: Improved cross-modal
embeddings for audio-visual synchronisation. In ICASSP 2019-2019 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3965–3969. IEEE.
Cox, A. (2018). 2018 Car Buyer Journey Study. Cox Automotive. Cox, Automotive.
Cramer, J., Wu, H.-H., Salamon, J., and Bello, J. P. (2019). Look, listen, and learn more:
Design choices for deep audio embeddings. In ICASSP 2019-2019 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3852–3856. IEEE.
Diba, A., Fayyaz, M., Sharma, V., Paluri, M., Gall, J., Stiefelhagen, R., and Van Gool,
L. (2020). Large scale holistic video understanding. In European Conference on Computer
Vision, pages 593–610. Springer.
Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward
neural networks. In Proceedings of the Thirteenth International Conference on Artificial
Intelligence and Statistics, volume 9, pages 249–256. PMLR.
Hershey, S., Chaudhuri, S., Ellis, D. P., Gemmeke, J. F., Jansen, A., Moore, R. C., Plakal,
M., Platt, D., Saurous, R. A., Seybold, B., et al. (2017). Cnn architectures for large-scale
                                              78


audio classification. In 2017 ieee international conference on acoustics, speech and signal
processing (ICASSP), pages 131–135. IEEE.
Hochreiter, S. and Schmidhuber, J. (1997a). Long short-term memory. Neural computation,
9(8), 1735–1780.
Hochreiter, S. and Schmidhuber, J. (1997b). Long short-term memory. Neural computation,
9(8), 1735–1780.
Hu, D., Nie, F., and Li, X. (2019). Deep multimodal clustering for unsupervised audiovisual
learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 9248–9257.
Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. (2017). Densely connected
convolutional networks. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 4700–4708.
Huang, Q., Xiong, Y., Rao, A., Wang, J., and Lin, D. (2020). Movienet: A holistic dataset
for movie understanding. In Computer Vision–ECCV 2020: 16th European Conference,
Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16 , pages 709–727. Springer.
Huang, T. and Van Mieghem, J. A. (2014). Clickstream data and inventory management:
Model and empirical analysis. Production and Operations Management, 23(3), 333–347.
Kayapinar Kaya, S. and Yildirim, Ö. (2020). A prediction model for automobile sales in
turkey using deep neural networks. Journal of Industrial Engineering (Turkish Chamber of
Mechanical Engineers), 31(1).
Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. International
Conference on Learning Representations.
Kulmanov, M., Khan, M. A., Hoehndorf, R., and Wren, J. (2018). DeepGO: predicting
protein functions from sequence and interactions using a deep ontology-aware classifier.
Bioinformatics, 34(4), 660–668.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553), 436–444.
Lewandowski, R. (1974). Prognose-und informationssysteme und ihre anwendungen bd. 1.
Berlin ua.
                                             79


Li, B. and Kumar, A. (2019). Query by video: Cross-modal music retrieval. In ISMIR,
pages 604–611.
Liu, C., Tang, T., Lv, K., and Wang, M. (2018). Multi-feature based emotion recognition
for video clips. In Proceedings of the 20th ACM International Conference on Multimodal
Interaction, pages 630–634.
Lu, R., Duan, Z., and Zhang, C. (2019). Audio–visual deep clustering for speech separation.
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(11), 1697–1712.
Lu, W., Chen, S., Li, K., and Lakshmanan, L. V. (2014). Show me the money: dynamic
recommendations for revenue maximization. Proceedings of the VLDB Endowment, 7(14),
1785–1796.
Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., and Khudanpur, S. (2010). Recurrent
neural network based language model. In Eleventh annual conference of the international
speech communication association.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). Distributed
representations of words and phrases and their compositionality.            arXiv preprint
arXiv:1310.4546 .
Monfort, M., Andonian, A., Zhou, B., Ramakrishnan, K., Bargal, S. A., Yan, T., Brown,
L., Fan, Q., Gutfreund, D., Vondrick, C., et al. (2019). Moments in time dataset: one
million videos for event understanding. IEEE transactions on pattern analysis and machine
intelligence, 42(2), 502–508.
Noroozi, F., Marjanovic, M., Njegus, A., Escalera, S., and Anbarjafari, G. (2017). Audio-
visual emotion recognition in video clips. IEEE Transactions on Affective Computing,
10(1), 60–75.
Quang, D. and Xie, X. (2016). Danq: a hybrid convolutional and recurrent deep neural
network for quantifying the function of dna sequences. Nucleic acids research, 44(11),
e107–e107.
Quang, D. and Xie, X. (2019). Factornet: a deep learning framework for predicting cell type
specific transcription factor binding from nucleotide-resolution sequential data. Methods,
166, 40–47.
                                            80


Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., and Smyth, G. K.
(2015). limma powers differential expression analyses for RNA-sequencing and microarray
studies. Nucleic Acids Res, 43(7), e47.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. (2018). Mobilenetv2:
Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 4510–4520.
Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE
transactions on Signal Processing, 45(11), 2673–2681.
Slattery, M., Zhou, T., Yang, L., Machado, A. C. D., Gordân, R., and Rohs, R. (2014).
Absence of a simple code: how transcription factors read the genome. Trends in biochemical
sciences, 39(9), 381–399.
Song, Q., Cheng, D., Zhou, H., Yang, J., Tian, Y., and Hu, X. (2020). Towards automated
neural interaction discovery for click-through rate prediction. In Proceedings of the 26th
ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages
945–955.
Su, R., Tao, F., Liu, X., Mei, H. W. X., Duan, Z., Yuan, L., Liu, J., and Xie, Y. (2020).
Themes inferred audio-visual correspondence learning. arXiv preprint arXiv:2009.06573 .
Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., and Liu, C. (2018). A survey on deep
transfer learning. CoRR, abs/1808.01974.
Tao, F. and Busso, C. (2018a). Aligning audiovisual features for audiovisual speech
recognition. In IEEE International Conference on Multimedia and Expo (ICME 2018),
pages 1–6, San Diego, CA, USA.
Tao, F. and Busso, C. (2018b). End-to-end audiovisual speech activity detection with
bimodal recurrent neural models. ArXiv e-prints (arXiv:1809.04553), pages 1–11.
Tao, F. and Busso, C. (2018c). Gating neural network for large vocabulary audiovisual
speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
26(7), 1286–1298.
Tao, F. and Busso, C. (2019). End-to-end audiovisual speech activity detection with
bimodal recurrent neural models. Speech Communication, 113, 25–35.
                                            81


Tao, F. and Busso, C. (2020). End-to-end audiovisual speech recognition system with
multitask learning. IEEE Transactions on Multimedia.
Tao, F., Liu, G., and Zhao, Q. (2018). An ensemble framework of voice-based emotion
recognition system for films and tv programs. In 2018 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pages 6209–6213. IEEE.
Uygun, S., Seddon, A. E., Azodi, C. B., and Shiu, S.-H. (2017). Predictive models of spatial
transcriptional response to high salinity. Plant Physiology, 174(1), 450–464.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L.,
and Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762 .
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. (2017).
Graph attention networks. arXiv preprint arXiv:1710.10903 .
Verma, G., Dhekane, E. G., and Guha, T. (2019). Learning affective correspondence
between music and image. In ICASSP 2019-2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pages 3975–3979. IEEE.
Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing, 17(4),
395–416.
Wang, J., Xu, Q., Wang, Q., Lyu, Z., Chen, J., and Xu, W. (2019a). Mmctr: A multi-
task model for short video ctr prediction with multi-modal video content features. In 2019
IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pages 679–
682. IEEE.
Wang, M., Tai, C., E, W., and Wei, L. (2018). Define: deep convolutional neural networks
accurately quantify intensities of transcription factor-dna binding and facilitate evaluation
of functional non-coding variants. Nucleic acids research, 46(11), e69–e69.
Wang, R., Huang, H., Zhang, X., Ma, J., and Zheng, A. (2019b). A novel distance learning
for elastic cross-modal audio-visual matching. In 2019 IEEE International Conference on
Multimedia & Expo Workshops (ICMEW), pages 300–305. IEEE.
Wang, X., Du, Y., Zhang, L., Li, X., Zhang, M., and Dong, J. (2019c). Exploring content-
based video relevance for video click-through rate prediction. In Proceedings of the 27th
ACM International Conference on Multimedia, pages 2602–2606.
                                             82


Wang, Z., Ma, Y., Liu, Z., and Tang, J. (2019d). R-transformer: Recurrent neural network
enhanced transformer. arXiv preprint arXiv:1907.05572 .
Weingarten-Gabbay, S. and Segal, E. (2014). The grammar of transcriptional regulation.
Hum Genet, 133(6), 701–711.
Weiss, K., Khoshgoftaar, T. M., and Wang, D. (2016). A survey of transfer learning.
Journal of Big data, 3(1), 1–40.
Whitaker, J. W., Chen, Z., and Wang, W. (2015). Predicting the human epigenome from
dna motifs. Nature methods, 12(3), 265.
Wilkins, O., Hafemeister, C., Plessis, A., Holloway-Phillips, M.-M., Pham, G. M., Nicotra,
A. B., Gregorio, G. B., Jagadish, S. K., Septiningsih, E. M., Bonneau, R., and Purugganan,
M. (2016). Egrins (environmental gene regulatory influence networks) in rice that function
in the response to water deficit, high temperature, and agricultural environments. The
Plant Cell , 28(10), 2365–2384.
Wu, J., Xu, Y., Zhang, S.-X., Chen, L.-W., Yu, M., Xie, L., and Yu, D. (2019a). Time
domain audio visual speech separation. In 2019 IEEE Automatic Speech Recognition and
Understanding Workshop (ASRU), pages 667–673. IEEE.
Wu, Y., Zhu, L., Yan, Y., and Yang, Y. (2019b). Dual attention matching for audio-
visual event localization. In Proceedings of the IEEE International Conference on Computer
Vision, pages 6292–6300.
Xu, C., Peak, D., and Prybutok, V. (2015). A customer value, satisfaction, and loyalty
perspective of mobile application recommendations. Decision Support Systems, 79, 171–
183.
Yu, J., Zhang, S.-X., Wu, J., Ghorbani, S., Wu, B., Kang, S., Liu, S., Liu, X., Meng,
H., and Yu, D. (2020). Audio-visual recognition of overlapped speech for the lrs2 dataset.
In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 6984–6988. IEEE.
Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., and Torralba, A.
(2018). The sound of pixels. In Proceedings of the European conference on computer vision
(ECCV), pages 570–586.
                                              83


Zhou, J. and Troyanskaya, O. G. (2015). Predicting effects of noncoding variants with deep
learning–based sequence model. Nature methods, 12(10), 931–934.
Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C., and Sun, M.
(2020). Graph neural networks: A review of methods and applications. AI Open, 1, 57–81.
Zhou, M., Ding, Z., Tang, J., and Yin, D. (2018). Micro behaviors: A new perspective
in e-commerce recommender systems. In Proceedings of the eleventh ACM international
conference on web search and data mining, pages 727–735.
Zhu, H., Luo, M.-D., Wang, R., Zheng, A.-H., and He, R. (2021a). Deep audio-visual
learning: A survey. International Journal of Automation and Computing, pages 1–26.
Zhu, Y., Wu, Y., Latapie, H., Yang, Y., and Yan, Y. (2021b). Learning audio-
visual correlations from variational cross-modal generation. In ICASSP 2021-2021 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
4300–4304. IEEE.
                                            84