ROBUST MULTI-TASK LEARNING ALGORITHMS FOR PREDICTIVE MODELING OF

SPATIAL AND TEMPORAL DATA

By

Xi Liu

A DISSERTATION

Michigan State University

in partial fulﬁllment of the requirements

Submitted to

for the degree of

Computer Science – Doctor of Philosophy

2019

ABSTRACT

ROBUST MULTI-TASK LEARNING ALGORITHMS FOR PREDICTIVE MODELING OF

SPATIAL AND TEMPORAL DATA

By

Xi Liu

Recent years have witnessed the signiﬁcant growth of spatial and temporal data generated from
various disciplines, including geophysical sciences, neuroscience, economics, criminology, and
epidemiology. Such data have been extensively used to train spatial and temporal models that can
make predictions either at multiple locations simultaneously or along multiple forecasting horizons
(lead times). However, training an accurate prediction model in these domains can be challenging
especially when there are signiﬁcant noise and missing values or limited training examples available.
The goal of this thesis is to develop novel multi-task learning frameworks that can exploit the spatial
and/or temporal dependencies of the data to ensure robust predictions in spite of the data quality
and scarcity problems.

The ﬁrst framework developed in this dissertation is designed for multi-task classiﬁcation of
time series data. Speciﬁcally, the prediction task here is to continuously classify activities of a
human subject based on the multi-modal sensor data collected in a smart home environment. As
the classes exhibit strong spatial and temporal dependencies, this makes it an ideal setting for
applying a multi-task learning approach. Nevertheless, since the type of sensors deployed often
vary from one room (location) to another, this introduces a structured missing value problem, in
which blocks of sensor data could be missing when a subject moves from one room to another. To
address this challenge, a probabilistic multi-task classiﬁcation framework is developed to jointly
model the activity recognition tasks from all the rooms, taking into account the block-missing value
problem. The framework also learns the transitional dependencies between classes to improve its
overall prediction accuracy.

The second framework is developed for the multi-location time series forecasting problem.
Although multi-task learning has been successfully applied to many time series forecasting appli-

cations such as climate prediction, conventional approaches aim to minimize only the point-wise
residual error of their predictions instead of considering how well their models ﬁt the overall dis-
tribution of the response variable. As a result, their predicted distribution may not fully capture
the true distribution of the data. In this thesis, a novel distribution-preserving multi-task learning
framework is proposed for the multi-location time series forecasting problem. The framework
uses a non-parametric density estimation approach to ﬁt the distribution of the response variable
and employs an L2-distance function to minimize the divergence between the predicted and true
distributions.

The third framework proposed in this dissertation is for the multi-step-ahead (long-range) time
series prediction problem with application to ensemble forecasting of sea surface temperature.
Speciﬁcally, our goal is to eﬀectively combine the forecasts generated by various numerical models
at diﬀerent lead times to obtain more precise predictions. Towards this end, a multi-task deep
learning framework based on a hierarchical LSTM architecture is proposed to jointly model the
ensemble forecasts of diﬀerent models, taking into account the temporal dependencies between
forecasts at diﬀerent lead times. Experiments performed on 29-year sea surface temperature
data from North American Multi-Model Ensemble (NAMME) demonstrate that the proposed
architecture signiﬁcantly outperforms standard LSTM and other MTL approaches.

Copyright by
XI LIU
2019

ACKNOWLEDGMENTS

The past more than six years of PhD life in Michigan State University has been a long and
unforgettable experience for me. Back to the summer of 2012, I was an undergraduate student with
poor English, basic computer science knowledge, and very unclear picture of what a researcher’s
daily life looks like. The past few years witnessed the unprecedented prosperous progress of
multiple ﬁelds in artiﬁcial intelligence, and I felt very lucky that I could be a small member of this
era of AI. When this long journey ﬁnally reaches to its ﬁnish line, I would like to acknowledge
everyone who has provided me with help, support, and valuable suggestions.

First and foremost, my most sincere gratitude to my advisor Dr Pang-Ning Tan. He is quite a
knowledgeable and outstanding expert in his domain, and every time when I got lost in a concept
or formula, he would explain it to me very patiently no matter how trivial my it is. He is also a
very good mentor in research, and would like to spend time with me discussing my research topics.
When I got stuck in the research, he would enlighten me with helpful comments, inspire me with
his broad knowledge, and encourage me to get it from another direction. Besides, Dr Tan would
like to push me to explore new topics and techniques, and at the same time to learn new staﬀs,
which helped me to keep up with the latest progress in my ﬁeld. What is more, he has always been
patient, nice, and humble. I shall never forget my ﬁrst two or three years when I had limited domain
knowledge and research experience, and Dr Tan helped me to build up the picture of data mining
step by step. It is my honor to study and work with such a great professor and educator.

I would also like to acknowledge the help from my project collaborator Dr Lifeng Luo, who
supported me with helpful domain knowledge and useful study resources. I also thank to my other
PhD committee members Dr Jiayu Zhou and Dr Eric Torng, who provided invaluable comments
for my dissertation.

I want to thank to my lab mates and friends Prakash Mandayam Comar, Zubin Abraham, Lei
Liu, Jianpeng Xu, Shuai Yuan, Courtland VanDam, Ding Wang, Farzan Masrour, Tyler Wilson,
Boyang Liu, Zheyun Feng, Lei Xu, Lanbo She, Beibei Liu, Shaohua Yang, Qi Wang, Kaixiang Lin,

v

Sari Saba-Sadiya, Haoyang Li, and Pouyan Hatami. It has been interesting and inspiring to talk
with them. And with them, my PhD life has been joyful and a lot of fun.

Finally, I want to oﬀer my greatest thank to my family. Without their support and love, I would
not accomplish this long journey to the PhD degree. I want to thank to my beloved parents for all
the unconditioned love and support they gave to me. It is their countless phone calls that warmed
and encouraged me during each phase of my study. I also want to thank to my dear husband for
accompanying me during each obstacle that I faced, and sharing with me all the happy things in
life.

Specially, this dissertation is partially supported by the National Science Foundation under

grant IIS-1615612.

vi

TABLE OF CONTENTS

x

.
.

.
.

.
.

.
.

.
.

.
.

.
.

CHAPTER 1

LIST OF TABLES .
LIST OF FIGURES .

.
.
INTRODUCTION .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2
1.1 Spatial and Temporal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2 Applications of Multi-task Learning to Spatial and Temporal Data . . . . . . . . .
4
1.2.1 Multi-modal Time Series Classiﬁcation at Multiple Locations
. . . . . . .
6
1.2.2 Multi-location Time Series Forecasting . . . . . . . . . . . . . . . . . . .
6
1.2.3 Multi-step-ahead Time Series Prediction . . . . . . . . . . . . . . . . . . .
8
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Research Challenges
1.4 Thesis Contributions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Other Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
.
1.6 Related Publications .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.7 Thesis Outline .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

xi

.
.

.
.

.
.

.
.

.
.

.

.

.

.

CHAPTER 2 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1 Spatial and Temporal Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 16
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Multi-task Learning .
2.2.1 MTL Based on Encoding Graph Structures
. . . . . . . . . . . . . . . . . 18
2.2.2 MTL with Low Dimensional Subspace . . . . . . . . . . . . . . . . . . . . 19
2.2.3 MTL with Incomplete Multi-source Data
. . . . . . . . . . . . . . . . . . 20
2.2.4 MTL in Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.5 MTL for Multi-label Learning . . . . . . . . . . . . . . . . . . . . . . . . 22
. 23

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3 Summary .

.

.

.

.

.

.

.

.

.

CHAPTER 3 SOFT MULTI-TASK CLASSIFICATION FOR ACTIVITY RECOGNI-

.

.

.

.

.

. .

3.1 Preliminaries

3.2 Methodology .

TION FROM MULTI-MODAL SENSOR DATA . . . . . . . . . . . . . . . 24
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Sensory Data Used for Human Activity Recognition . . . . . . . . . . . . . 28
3.1.1
3.1.2
Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.3 Annotation Conﬁdence Level . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.4 Temporal Transitional Dependency . . . . . . . . . . . . . . . . . . . . . . 33
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.1 Multi-Class Learning with Softmax Regression . . . . . . . . . . . . . . . 33
3.2.2
Proposed Method: STARS . . . . . . . . . . . . . . . . . . . . . . . . . . 34
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.3 Optimization .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Experimental Evaluation .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Baseline Algorithms
3.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4 Related Work .

.

.

.

.

.

.

.

.

.

.

.

vii

3.5 Summary .

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

CHAPTER 4 DISTRIBUTION PRESERVING MULTI-TASK REGRESSION FOR SPATIO-

.

.

.

.

.

.

. .

4.1 Preliminaries

4.2 Proposed Framework .

TEMPORAL DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.1 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.2 Divergence Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.2.1 RMS-CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
L2 Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.2.2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Divergence Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 DPMTL: Distribution-Preserving MTL Framework . . . . . . . . . . . . . 49
.
4.2.3 Optimization .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.4 Algorithm .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Data and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.4 Related Work .
4.5 Summary .
.

.
.

.
.

.
.

.
.

.
.

.
.

.

.

CHAPTER 5 MULTI-TASK HIERARCHICAL LSTM FRAMEWORK FOR MULTI-

.

.

.

.

.

.

.

.

.

. .

.
Problem Statement

5.3 Preliminaries

5.4 Proposed MSH-LSTM Architecture

STEP-AHEAD TIME SERIES FORECASTING . . . . . . . . . . . . . . . 63
5.1 Ensemble Forecasting of Sea Surface Temperature . . . . . . . . . . . . . . . . . . 63
5.2 Related Work .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2.1 Multi-task Learning in DNN . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2.2 Hierarchical LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.2 Long Short-Term Memory (LSTM) Network . . . . . . . . . . . . . . . . . 69
. . . . . . . . . . . . . . . . . . . . . . . . . 70
. . . . . . . . . . . . . . . . . . . . . . . . . . 70
. . . . . . . . . . . . . . . . . . . . . . . 72
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5.1 Data Set .
.
5.5.2 Baseline Algorithms
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.5.3 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.5.4 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.5.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
. 84

5.4.1 Lead Time Encoder Layer
5.4.2 Generation Time Encoder Layer
5.4.3 Output Layer
5.4.4

5.5 Experimental Results .
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.6 Summary .

.
.

.

.

.

.

.

.

.

.

.

.

.

.

CHAPTER 6 CONCLUSIONS & FUTURE WORK . . . . . . . . . . . . . . . . . . . . . 86
. . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.1 Summary of Thesis Contributions

viii

6.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

BIBLIOGRAPHY .

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

ix

LIST OF TABLES

Table 3.1: List of human activity classes from the Sphere challenge data [122].

. . . . . . . 27

Table 3.2: 2D/3D camera features summary.

. . . . . . . . . . . . . . . . . . . . . . . . . 31

Table 3.3:

Illustration of activities distribution.

. . . . . . . . . . . . . . . . . . . . . . . . 33

Table 3.4: Weighted brier scores for various competing algorithms.

. . . . . . . . . . . . . 41

Table 4.1: Summary of notations used in the chapter. . . . . . . . . . . . . . . . . . . . . . 48

Table 4.2: Predictor variables from NCEP reanalysis. . . . . . . . . . . . . . . . . . . . . . 56

Table 5.1: Physical models from NMME used for monthly sea surface temperature prediction 75

Table 5.2: Partitioning of SST data into multiple training, validation, and test splits.

. . . . 75

Table 5.3: Comparison of RMSE values among the competing methods for all 9 forecast

. .

. .

lead times .

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Table 5.4: A win-loss table comparing the performance of the competing methods across
all 58 grid cells. Each (i, j)-th entry in the table represents the fraction of grid
cells in which method i has lower RMSE than method j.

. . . . . . . . . . . . . 80

x

LIST OF FIGURES

Figure 1.1: Monthly maximum temperature observed at weather stations around the world
in 1970. The data was obtained from the Global Historical Climatology
Network (GHCN) [85].

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 1.2: A snapshot of the trajectories and activities of a human subject from the

benchmark Sphere challenge dataset [122] (ﬁgure is best viewed in color).

. . .

Figure 1.3: Sea Surface Temperature(SST) ensemble members vs. actual SST observa-

tions. From North American Multi-Model Ensemble (NMME) [62].

. . . . . .

3

5

8

Figure 1.4: A snippet of a custom input categorization . . . . . . . . . . . . . . . . . . . . 13

Figure 1.5: AUC comparison for classes with diﬀerent number of apps.

. . . . . . . . . . . 15

Figure 2.1: Hard parameter sharing or layer transfer in multi-task DNN. . . . . . . . . . . . 21

Figure 2.2: Soft parameter sharing or conservative training in multi-task DNN.

. . . . . . . 21

Figure 3.1: Percentage of time data from an accelerometer and RGB-D camera are avail-

able for each human activity. The list of activities are shown in Table 3.1. . . . . 25

Figure 3.2: A segment of ground truth activities.

. . . . . . . . . . . . . . . . . . . . . . . 26

Figure 3.3: Distribution of the maximum acceleration for each activity class.(The line
in the middle of each box is the sample median; The tops and bottoms of
each “box” are the 25th and 75th percentiles of the samples, respectively; The
whiskers are lines extending above and below each box; Observations beyond
the whisker length are marked as outliers)

. . . . . . . . . . . . . . . . . . . . 30

Figure 3.4: Time Domain and Frequency Domain of Activities in acceleration data
from [122]. For each subplot: time domain on the left; frequency domain on
the right.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

. .

. .

.

.

.

Figure 3.5: The given coordinates of an example bounding box of the RGB-D camera
data from [122]. 2D bounding box on the left, 3D bounding box on the right.
(tl: top left; br: bottom right; ﬂt: front left top; brb: back right bottom.) . . . . . 32

Figure 3.6: The illustration of multi-task learning in STARS . . . . . . . . . . . . . . . . . 36
Figure 3.7: The estimated transition matrices F 1

r:: (right) for living room.
The ordering of the classes on the horizontal and vertical axes are the same.

r:: (left) and F 2

. . 41

xi

Figure 4.1: Comparison between the predictions of non-distribution preserving (Model
1) and distribution preserving (Model 2) methods in terms of their root mean
squared errors and cumulative distribution functions. . . . . . . . . . . . . . . . 44

Figure 4.2: Performance comparison between DPMTL and baseline approaches in terms
of RMSE and RMS-CDF when varying the tradeoﬀ parameter β between 0
and 1.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

.

.

.

.

.

.

.

.

.

.

.

.

.

Figure 4.3: Percentage of stations in which DPMTL outperforms the baseline methods

(for β = 0). .

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Figure 4.4: Comparison between the RMSE of GSpartan and DPMTL for β = 0 (ﬁgure

best viewed in color).

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Figure 4.5: Comparison between the RMS-CDF of GSpartan and DPMTL for β = 0

(ﬁgure best viewed in color).

. . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Figure 4.6: Histogram comparison of precipitation distribution for a weather station lo-

cated at [38.25◦N, 82.99◦W]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Figure 5.1: A two-level autocorrelation structure in multi-lead time forecasting of the

SST data shown in Fig. 1.3.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Figure 5.2: Proposed hierarchical LSTM architecture for multi-step-ahead time series
forecasting. Blocks with diﬀerent shades of colors (in the generation time
and output layers) are trained independently and have diﬀerent parameters
while those with the same color (lead time encoder layer) are trained jointly
and have identical parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Figure 5.3: Performance on each grid cell by EnS and MSH-LSTM. . . . . . . . . . . . . . 81

Figure 5.4: Comparison between the temporal autocorrelation of the proposed MSH-

LSTM framework and other baseline methods.

. . . . . . . . . . . . . . . . . . 83

Figure 5.5: Correlogram plots for lead-time level autocorrelation of MSH-LSTM and

other methods (including the ground truth SST time series).

. . . . . . . . . . . 84

Figure 5.6: Gradient distribution of each model for diﬀerent lead time tasks. The x-axis
represents the indices of physical models that are listed in Table. 5.1. The
box plots are gradients distribution of each model for MSH-LSTM. The red
curve are the computed ridge regression coeﬃcients. . . . . . . . . . . . . . . . 85

xii

LISTOFALGORITHMSAlgorithm 1:       STARS Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  37Algorithm 2:       DPMTL: Distribution Preserving Multi-task Learning . . . . . . . . . .   55Algorithm 3:       Training process for MSH-LSTM: Multi-task Hierarchical LSTM . . . . . 74xiiiCHAPTER 1

INTRODUCTION

Advances in data mining and machine learning have led to the development of sophisticated models
for solving complex prediction tasks in various application domains, from healthcare to autonomous
driving. A common strategy for solving such complex learning tasks is to decompose the overall
prediction problem into smaller sub-tasks that can be solved in a more eﬃcient and tractable
manner. For example, in autonomous driving, the sub-tasks include identifying obstacles in front
of the vehicle, tracking movement of other vehicles in its surrounding, recognizing street signs,
and detecting lane departures. Training a global model that can be applied to all the sub-tasks will
likely lead to inferior model performance due to the inherent diﬀerences among the sub-tasks. A
more eﬀective strategy would be to train a separate prediction (“local") model that can discern the
relationship between the predictor and response variables for each sub-task.

Formally, let X = {X1, X2, · · · , XT} be the set of predictor variables for each of the T prediction
sub-tasks, where Xi ∈ Rni×d, and Y = {y1, y2, · · · , yT} be the corresponding set of response vari-
ables, where yi ∈ Rni. For such multi-task prediction problem, our goal is to learn a distinct model,
ft : Rd → R, that maps the predictor variables of each sub-task t to their corresponding response
value. The model for each sub-task ft(wt) is assumed to be characterized by its model parameter
wt, which can be estimated during training by optimizing the following objective function:

(cid:21)
(cid:20)
Lt
(1.1)
1, w∗
where Lt(·) is the loss function for sub-task t and {w∗
T} are the learned parameters.
Each sub-task can be solved separately as the loss functions are independent of each other. This
approach is also known as single-task learning (STL). In principle, STL would work well if there
are suﬃcient amount of training data (Xt, yt) available for each sub-task. However, since acquiring
labeled data can be expensive, the performance of STL can still be poor as its induced models are
highly susceptible to overﬁtting when there are limited training data.

yt, ft(Xt; wt)
2, · · · , w∗

∗
T} = arg min{wt}

{w

∗
1, w

∗
2, · · · , w

,

T

t

1

To overcome the limitation of STL, multi-task learning (MTL) approaches (MTL) have been
proposed [20, 148, 147]. The model parameters for MTL are generally solved by optimizing the
following joint objective function:

T

(cid:20)

(cid:21)

{w

∗
1, w

∗
2, · · · , w

∗
T} = arg min{wt}

Lt

yt, ft(Xt; wt)

+ Ω(w1, ...wt, ...wT)

t

(1.2)
where Ω(w1, ...wt, ...wT) is a regularization term that controls the dependencies among the param-
eters. The regularization enables MTL to leverage domain-speciﬁc knowledge from other related
sub-tasks to prevent each model from overﬁtting its training data, thereby improving its general-
ization performance [20]. The success of MTL has been well-documented in many application
domains including computer vision [47, 41], natural language processing [80, 114], and medical
informatics [101, 15].

In this thesis, I will focus on the development of MTL approaches for spatial and temporal data.
Speciﬁcally, several challenging problems from various application domains are investigated and
novel frameworks are proposed to overcome these challenges. I will ﬁrst present an overview of
spatial and temporal data as well as its applications in the next two sections before summarizing
the contributions of this thesis.

1.1 Spatial and Temporal Data

Spatial and temporal data are observations that contain measurements of geographic location
and time information [10, 102]. Such data are pervasive across many application domains, including
climate and environmental sciences [57, 85, 56, 62, 125, 96, 133, 77], neuroscience [34, 109, 131,
43, 29, 31], health sciences [83, 89, 106], social sciences [23, 51], transportation studies [71, 88],
and criminology [129, 110]. One important characteristic of the data is their non-i.i.d (independent
and identically distributed) property. The non-independence property arises due to the inherent
autocorrelation of their measurements along the space and/or time dimensions. In particular, the
presence of strong spatial auto-correlation implies that observations at nearby locations should
be similar to each other [66] while temporal autocorrelation refers to the non-random association
between a pair of observations measured at nearby times. For example, Fig. 1.1 shows the variability

2

(a) Jan 1970

(b) Feb 1970

(c) Mar 1970

(d) Apr 1970

(e) May 1970

(f) Jun 1970

(g) Jul 1970

(h) Aug 1970

(i) Sep 1970

(j) Oct 1970

(k) Nov 1970

(l) Dec 1970

Figure 1.1: Monthly maximum temperature observed at weather stations around the world in 1970.
The data was obtained from the Global Historical Climatology Network (GHCN) [85].

of monthly maximum temperature of weather stations around the world for a one year period in 1970.
Due to its high spatial and temporal autocorrelation, the monthly maximum temperature appears
to change smoothly along its spatial and temporal dimensions, as illustrated in the ﬁgure. From a
predictive modeling perspective, failure to account for the spatial and/or temporal autocorrelation
of the response variable may lead to suboptimal local models as the predicted values may not
exhibit the desired autocorrelation properties when the observations are treated independently
during training. As evidenced by many previous studies, incorporating such autocorrelation into
the learning framework would indeed improve model performance [39, 139, 136].

3

-150-100-50050100150-50050-50050Maximum Temperature (Celsius)-150-100-50050100150-50050-50050Maximum Temperature (Celsius)-150-100-50050100150-50050-50050Maximum Temperature (Celsius)-150-100-50050100150-50050-50050Maximum Temperature (Celsius)-150-100-50050100150-50050-50050Maximum Temperature (Celsius)-150-100-50050100150-50050-50050Maximum Temperature (Celsius)-150-100-50050100150-50050-50050Maximum Temperature (Celsius)-150-100-50050100150-50050-50050Maximum Temperature (Celsius)-150-100-50050100150-50050-50050Maximum Temperature (Celsius)-150-100-50050100150-50050-50050Maximum Temperature (Celsius)-150-100-50050100150-50050-50050Maximum Temperature (Celsius)-150-100-50050100150-50050-50050Maximum Temperature (Celsius)Though the non-independence property of the data suggests homogeneity of the models, its
non-identically distributed property suggests there should be still notable diﬀerences in the models
due to the spatial heterogeneity of the data. An example illustrating the opposing forces of spatial
autocorrelation and spatial heterogeneity is shown in Fig. 1.1. Though the maximum temperature
at two nearby locations are similar, there are still signiﬁcant diﬀerences between the observations in
the northern and southern hemispheres on a given month. From a predictive modeling perspective,
the non-identically distributed property suggests that a one-size-ﬁts-all approach using a global
model to ﬁt all the training observations is not a viable strategy as the model fails to account for
the spatial diﬀerences of the observations.
Instead, the modeling approach should incorporate
local features that can help explain the variability observed in the data, both along the spatial and
temporal dimensions.

The non-i.i.d. property thus provides a strong motivation for using MTL for predicting modeling
of spatial and temporal data. Instead of building a single (global) model, MTL addresses the non-
identically distributed (heterogeneity) problem by training a separate model for each sub-task
(which could be a location or a forecast lead time). MTL also alleviates the limitation of STL in
terms of handling non-independent observations by allowing the models to incorporate the spatial
and/or temporal autocorrelations as regularization terms in its formulation as given in Eq. (1.2).

1.2 Applications of Multi-task Learning to Spatial and Temporal Data

In this section, I will brieﬂy describe several spatial and temporal prediction problems in which
MTL can be used along with their respective applications. These applications would serve as case
studies for evaluating the MTL frameworks developed in this dissertation.

1.2.1 Multi-modal Time Series Classiﬁcation at Multiple Locations

The ﬁrst MTL problem investigated in this thesis involves classiﬁcation of multi-modal time series
at multiple locations. An example application of such problem is identifying the daily activities of
a human subject from the multi-modal sensor data collected in a smart home environment [122,

4

(a) Living Room

(b) Hallway

Figure 1.2: A snapshot of the trajectories and activities of a human subject from the benchmark
Sphere challenge dataset [122] (ﬁgure is best viewed in color).

76, 72, 73]. We use a benchmark user activity data from [122] for this study. Fig. 1.2 shows an
example of the trajectories recorded by a tri-axial accelerometer worn by a human subject who
moved around the kitchen and hallway areas of a smart home. Each colored dot represents a
speciﬁc activity (walking, standing, sitting, lying down, etc.) performed by the human subject. In
addition to the accelerometer sensor, RGB-D cameras and environment sensors were also installed
in some of the rooms in the smart home. While the trajectory data is available everywhere from
the accelerometer worn by the subject, video data from RGB-D cameras are only available in a few
of the rooms (e.g., in kitchen and living room but not in bedroom nor bathroom). This introduces
blocks of missing values in the video data as the subject moves from one room with RGB-D camera
to another room without such camera. The modeling approach must therefore be able to account
for such varying types of features available in diﬀerent rooms. In addition, since the layout for each
room is diﬀerent, some activities are more likely to be performed in certain rooms than others. For
example, the activities “ascending" or “descending" stairs are more likely to occur in in the hallway
than in the kitchen or bedroom. Due to such spatial constraints, it makes more sense to develop a
local model for activity recognition in each room instead of ﬁtting a global model for all the rooms.
However, due to the noisy nature of the data and the limited training samples available in some
rooms, the local models are susceptible to model overﬁtting problem. Multi-task learning thus

5

provides a promising approach to address such multi-location time series classiﬁcation problem.

1.2.2 Multi-location Time Series Forecasting

Another common prediction problem involving spatial and temporal data is multi-location time
series forecasting, where each location is aﬃliated with a time series whose future values are
to be predicted. Example applications of such problem include climate, disease incidence, and
crime rate prediction. For example, in climate prediction, our objective is to predict future climate
conditions at various locations based on the historical climate observations at each location as well
as other auxiliary information such as local topology, vegetation, or simulated outputs from global
and regional climate models. Fig. 1.1 shows an example of the monthly maximum temperature
measurements in in the year 1970 for more than 70,000 weather stations. The data was obtained
from the Global Historical Climatology Network (GHCN) database [85]. As previously noted, the
climate data exhibit strong spatial and temporal autocorrelation, which makes it natural to apply
MTL approaches to exploit such dependencies and train the prediction models at multiple locations
jointly.

1.2.3 Multi-step-ahead Time Series Prediction

The third MTL problem investigated in this thesis is multi-step-ahead (i.e., long-range) time
series prediction. Unlike the previous problem, which focuses only on the prediction for the next
immediate future time step, the goal here is to predict the values for multiple future time steps. Each
future time step is called a lead time, while the maximum lead time is known as the forecasting
horizon. The multi-step-ahead time series prediction problem will be investigated in the context
of its application to ensemble forecasting of sea surface temperature (SST), which is an important
task due to the strong inﬂuence of ocean temperature on global climate conditions. While it is
possible to apply single-step time series forecasting methods to a multi-step forecasting setting,
such an approach typically requires using the predicted value of one time step to infer the value for

6

the next time step. This would lead to an error propagation problem, in which the error can quickly
become unacceptably high even at short-range forecasting horizons.

To overcome this problem, ensemble forecasting uses a set of forecasts generated from computer-
generated (physical) models to project the possible future scenarios [62]. As the computer models
were developed based on the physical laws of the underlying domain, their outputs are likely
to be more consistent with the true observations even for long-range forecasting horizons. For
example, Fig. 1.3 shows the multi-step ahead monthly SST forecasts generated by a set of ensemble
members obtained from the North American Multi-Model Ensemble (NMME) [62]. Each blue
curve represents the 8-month forecasts generated by an ensemble member while the red curve
represents the true SST values for the 8-month forecasting horizon.
In the NMME dataset, a
new set of multi-step ahead forecasts are generated by the ensemble members every month. For
example, Fig. 1.3(a), shows the 8-month ahead monthly average SST forecasts generated by 80
ensemble members on Jun 1st, 2010 for the months of June 2010 until February 2011. The next
set of multi-step ahead ensemble member forecasts were generated on July 1st, 2010 and are
shown in Fig. 1.3(b). As the red line is encapsulated within the envelope of blue curves, the plot
suggests that the ensemble members are capable of capturing the range of forecast uncertainties
of SST even at 8-month forecasting horizon. Nevertheless, the ensemble member forecasts still
need to be aggregated to obtain a point prediction for each lead time. This can be achieved by
applying regression techniques to learn the mapping from ensemble member forecasts into point-
wise prediction. However, since the skills of the ensemble members may vary from one lead time
to another, it may not be wise to train only a single regression model for aggregating the ensemble
member forecasts at all lead times. Instead of training a separate model for each lead time, MTL
provides a promising approach for this problem by exploiting the temporal autocorrelation of the
SST values at diﬀerent lead times.

7

(a) Jun 1 2010

(b) Jul 1 2010

(c) Aug 1 2010

(d) Sep 1 2010

(e) Oct 1 2010

(f) Nov 1 2010

Figure 1.3: Sea Surface Temperature(SST) ensemble members vs. actual SST observations. From
North American Multi-Model Ensemble (NMME) [62].

1.3 Research Challenges

This section presents the research challenges associated with the problems and applications

described in Section 1.2.

• MTL for Multi-modal Time Series Classiﬁcation at Multiple Locations There are sev-
eral challenges that must be addressed when applying MTL to the multi-modal time series
classiﬁcation problem described in Section 1.2.1. First, the MTL approach must account
for the temporal dependencies between the classes. For the human activity recognition ap-
plication, some transitions between activities are more or less likely to occur than others.
For example, a subject often "bends" before "jumping" and rarely "sits" immediately after
"jumping".
Incorporating such temporal dependencies into the modeling framework may
potentially enhance the prediction results. Although such dependence relationship can be
acquired from domain knowledge, they may not be complete nor exact enough to help recog-
nize the activities of individual human subjects. Instead of using a pre-deﬁned relationship,
it is better to extract the temporal relationships between the classes directly from the data. In
addition to the temporal dependencies, the classes may have spatial relationships as well. For

8

example, Fig. 1.2 shows that the classes "ascend" and "descend" stairs are more prevalent in
the hallway than in the living room. It would be advantageous to have a modeling framework
that can account for such spatial dependencies in the data. Finally, as previously noted, due
to the varying sensor data available in diﬀerent rooms, addressing the block missing value
problem is another challenge that should be addressed by the modeling framework.

• MTL for Multi-location Time-series Forecasting

As mentioned in Section 1.2.2, the multi-location time series forecasting problem requires
building prediction models for diﬀerent locations in the data. Previous research [136, 139]
has mostly focused on applying MTL to jointly train the models for diﬀerent locations in
order to maximize their overall prediction accuracies. However, in many applications such
as climate modeling, preserving the true distribution of the data is just as important as
maximizing model accuracy [5] as the predicted distribution can provide useful information
for planning, risk assessment, and other decision making purposes. For example, knowing
the future distribution of temperature and precipitation can help climate scientists to better
anticipate the severity and frequency of adverse weather events for climate impact assessment
studies. In agricultural production, the predicted distribution can be used to derive statistics
such as average length of future growing season or persistence of wet and dry spells, which
are important metrics for farmers and agricultural researchers. However, achieving both high
accuracy and preserving the distribution ﬁt at each location is a challenge that has not been
addressed by existing MTL frameworks.

• MTL for Multi-step-ahead Time Series Prediction

The multi-step-ahead time series prediction problem described in Section 1.2.3 requires
building a prediction model for each forecast lead time. However, the prediction error
tends to grow as the lead time increases as illustrated by the SST ensemble forecasting
task shown in Fig. 1.3. The plots show that the variance of the ensemble member forecasts
generally increases with longer lead times. MTL can help alleviate this problem by leveraging

9

information from the shorter lead time tasks to regularize the predicted values for the longer
lead time tasks. However, the challenge here is how to determine the relationship between the
shorter and longer lead time tasks. Previous works such as [135] assume there is a predeﬁned
graph structure in terms of the task relationship between diﬀerent lead times.
In fact,
many of the previous MTL frameworks [148] are based on some pre-deﬁned assumption
about the task relationships, e.g., using model correlation matrix [136], graph Laplacian
structure [135, 40], low-rank [24, 139, 134], or model sparsity structure [8]. While these
assumptions are mostly designed for general learning problems, its eﬀectiveness for the
multi-step ahead time prediction problem remains unclear. In particular, the task relationship
could be nonlinear, and thus, needs to be inferred from the data. Learning the appropriate
task relationship for multi-step-ahead time series prediction is a challenge to be addressed in
this dissertation.

1.4 Thesis Contributions

• Chapter 3: Multi-task Learning on Multi-modal Sensor Data for Time Series Classiﬁ-

cation

To address the ﬁrst challenge described in Section 1.3, Chapter 3 presents a probabilistic multi-
task learning framework for multi-modal time series classiﬁcation. The framework learns the
pair-wise temporal dependencies between the classes and incorporates such dependencies
into its formulation to enhance the activity recognition performance of each classiﬁer.
It
employs a softmax classiﬁer, in which the model parameters for each class are learned
jointly at multiple locations. Furthermore, to address the varying feature types at diﬀerent
locations, the framework decomposes its feature set into two parts–a common feature set
(for all locations) and a location-speciﬁc feature set. While the parameter values for the
common feature set are learned jointly across all locations, the location-speciﬁc ones are
learned independently for each location. This strategy enables the proposed MTL framework
to address the block-missing value problem.

10

• Chapter 4: Distribution Preserving Multi-Task Regression for Multi-location Time

Series Forecasting

To address the challenge described in the previous section for multi-location time series
forecasting, Chapter 4 presents a novel distribution preserving MTL framework for spatio-
temporal data. The proposed framework is unique in that it integrates both distribution and
point-wise data ﬁtting in a uniﬁed learning formulation. A non-parametric kernel density
estimation approach is employed to ﬁt the marginal distribution of response variable along
with an L2-distance measure used to estimate the divergence between the predicted and true
distributions. Parameter sharing between models trained at diﬀerent locations is enforced
through their low-rank structure along with a graph laplacian regularizer based on Haversine
distance between locations. The eﬀectiveness of the proposed approach is then demonstrated
through extensive experiments using a 45-year precipitation dataset for more than 1500
weather stations in the United States.

• Chapter 5: Multi-task Hierarchical LSTM Structure for Multi-step-ahead Time Series

Prediction

To address the third challenge described in the previous section, Chapter 5 presents a multi-
task deep learning architecture for the multi-step-ahead ensemble forecasting problem. The
architecture considers the prediction for each lead time as a separate learning task and employs
a novel two-layer hierarchical LSTM structure to learn a nonlinear relationship between the
tasks. The ﬁrst LSTM layer of the hierarchy learns a latent representation for each lead time
task, taking into account their temporal dependencies. This enables the proposed framework
to leverage information from shorter-term lead time tasks to improve the prediction for
longer-term tasks. The second layer of the hierarchy learns a feature representation for the
generation times of the forecasts. Speciﬁcally, given a speciﬁc lead time, it assumes that
the hidden representations of the forecast generation times are related in a sequential way
using an LSTM. This allows the architecture to capture the temporal autocorrelation between

11

forecasts generated at diﬀerent times. The entire architecture is trained in an end-to-end
fashion and evaluated on an ensemble of monthly sea surface temperature data for a large
study region in the Paciﬁc ocean.

1.5 Other Research Contributions

In addition to my research contributions to the development of MTL frameworks for spatial and
temporal data in this thesis, I have also developed a multi-label learning framework that incorporates
taxonomy-type relations for categorizing mobile apps. With the proliferation of smart devices and
apps markets, there is a pressing need to develop automated techniques for apps categorization.
While existing app markets such as Google Play and Apple Store do provide their own list of app
categories, there are several issues with the existing market categorizations.

1. Limited granularity. With the explosive growth of mobile apps has resulted in huge amount
of apps per category, rendering the task of searching in a category to be laborious and
time consuming. The granularity of current mobile app categorization is also too coarse to
eﬀectively distinguish between apps assigned to the same category.

2. Lack of objectivity. The mobile apps in online app stores are typically classiﬁed manually

based on the subjective judgment and may not agree with the actual use of the apps.

3. Limited expressiveness. As pointed out in [70], the hard, exclusive labeling results in a large
number of multi-category apps missing from their appropriate categories. For example, the
Instagram app [2] is found only in the social networking category but not in the photography
category, even though it has functionalities related to both categories.

To overcome these limitations, I propose a novel approach in in [74] to automatically label apps
with a richer and more ﬂexible categorization. In order to label apps with a ﬁner-grained ontology
than the original categories the app market provides, a detailed categorization is leveraged from an
application domain that closely resembles that of mobile apps (e.g., Google Ad Preference ontol-

12

Figure 1.4: A snippet of a custom input categorization

ogy [1]). Unlike the ﬂat structure of the original market categorization, the target categorization
structure is hierarchical illustrated in Fig. 1.4.

The proposed framework uses a semi-supervised Non-negative Matrix Factorization (NMF)
approach to classify the apps. The framework takes into account various factors, such as the class
labels associated with sampled training set apps, feature vectors of unlabeled apps, as well as the
side information encoding aﬃnity relationship between input categories. All of this information
are integrated into a uniﬁed learning framework that automatically generates (i) the predicted labels
for previously uncategorized apps, (ii) the important features characterizing diﬀerent classes, and
(iii) a modiﬁed inter-class similarity matrix that better ﬁts to speciﬁc characteristics of mobile apps.
The proposed semi-supervised NMF framework is designed to minimize the following objective
function:

min
Yu,L,B

F + ||Yu − XuW||2

F + β||L||2

F

||Yl − XlW||2
+ γ||B − P||2
F

s.t. W = LB, B ≥ 0, L ≥ 0, Yu ≥ 0

(1.3)

where the subscripts l and u denote the labeled and unlabeled data sets; nl and nu denote the
number of labeled and unlabeled examples, respectively. Yl ∈ (cid:60)nl×k
are the
class indicator matrices, where k is the number of classes. Xl ∈ (cid:60)nl×d
are the

and Yu ∈ (cid:60)nu×k
and Xu ∈ (cid:60)nu×d

+

+

+

+

13

feature matrices associated with the labeled and unlabeled data, respectively. To help guide the app
categorization with custom ground truth categories, the NMF framework can be modiﬁed to utilize
side information about the classes in the input categories. Speciﬁcally, the side information can
be represented as a k × k class similarity matrix P. Each entry Pi j denotes the similarity between
classes i and j, which is obtained by checking their sibling relationship (i.e, whether i and j share
a common parent in Google Ad Tree). Formally,

Pi j =

1

# of levels

if i and j are siblings,

otherwise.

(1.4)

p =

0

The performance of the proposed framework was evaluation on 1, 065 apps from Google Play.
Each app is characterized by a tﬁdf feature vector of length 7,745. The classiﬁcation performance
for each class is measured by Area Under Curve (AUC) of Receiver Operating Characteristic
(ROC) [119]. Using right y-axis of Figure 1.5, a histogram of 49 Google-ad categories that
represents the number of apps in each category (in black) is plotted. Using left y-axis of Figure 1.5,
two curves that represent AUCs of proposed approach for each class (in red), and AUC of logistic
regression (in blue) are plotted. For large classes with more than 50 apps, both the proposed
approach and logistic regression yield similarly good AUC performances. However, for smaller
classes with less than 50 apps, the semi-supervised NMF clearly outperforms logistic regression.

1.6 Related Publications

Some chapters in this dissertation are based partially on the following publications: Chapter 3
is based on three papers entitled "STARS: Soft Multi-Task Learning for Activity Recognition
from Multi-Modal Sensor Data" [76], "Human daily activity recognition for healthcare using
wearable and visual sensing data" [72], and "Location-based Hierarchical Approach for Activity
Recognition with Multi-modal Sensor Data" [73]. Chapter 4 is from "Distribution Preserving
Multi-Task Regression for Spatio-Temporal Data" [75], which appeared in the Proceedings of the
2018 IEEE International Conference on Data Mining. Chapter 5 is based on the materials from
a paper that is currently under review for the 2019 ACM SIGKDD International Conference on

14

Figure 1.5: AUC comparison for classes with diﬀerent number of apps.

Knowledge Discovery and Data Mining. My other publications include "Macro-scale mobile
app market analysis using customized hierarchical categorization" [74], "MUSCAT: Multi-Scale
Spatio-Temporal Learning with Application to Climate Modeling" [134], and "WISDOM: Weighted
incremental spatio-temporal multi-task learning via tensor decomposition" [139].

1.7 Thesis Outline

The rest of this thesis is organized as the following. Chapter 2 presents the background and
previous literature related to my thesis topic. Chapter 3 describes my research on multi-modal
time series classiﬁcation for human activity recognition while Chapter 4 introduces the distribution
preserving regression framework for multi-location climate forecasting. Chapter 5 presents the
MTL framework for multi-step-ahead ensemble forecasting of sea surface temperature. Finally,
conclusions and future research directions are discussed in Chapter 6.

15

CHAPTER 2

LITERATURE REVIEW

2.1 Spatial and Temporal Data Mining

Spatial and temporal data refers to observations containing variables that vary over space and
time [10, 102]. Due to the non-i.i.d property of spatial and temporal data, novel spatial and temporal
data mining techniques have been developed to discover interesting patterns and models from such
data [107], which include:

• Spatio-temporal outliers [116, 108], which correspond to observations whose non-spatial-
temporal features are signiﬁcantly diﬀerent from the rest of the data set. An example
application of this approach is to predict extreme climate events from precipitation data [132].

• Spatio-temporal couplings [42, 90], which correspond to multiple events that occur in close
spatial or temporal proximity; For example, studying the spatio-temporal coupling of traﬃcs
helps better planning on the road.

• Spatio-temporal partitioning or clustering [63, 104], which correspond to groups of observa-
tions that are similar to each other along the space or temporal dimensions. Spatio-temporal
hotspots is a special topic of clustering, where high intensity of instances is observed to
happen in a certain region or time period;

• Spatio-temporal change footprints [150], which correspond to changes in the data distribution
or underlying model over space or time; For example, in industrial process, the change of
statistical index of sensors may indicate system fault; the raise of reported infections may
indicate the epidemic outbreak.

• Spatial and temporal prediction, which corresponds to the task of inferring the response
values either at previously unknown locations or for future time steps. Example applications

16

include predicting crime rate or future climate condition at a given location.

This thesis will mainly focus on the spatial and temporal prediction problems. Conventional
predictive modeling techniques (e.g. logistic regression, ridge regression, decision tree and so on)
usually assumes the data to follow i.i.d property, which are not suitable for spatial and temporal
data. Therefore, new techniques are developed to build connections between consecutive locations
or time steps. Some examples are summarized as the following categories:

• Incorporating temporal dependencies and predicting on time. Techniques like autoregressive
integrated moving average (ARIMA) [81] are speciﬁcally designed for time series forecasts;
besides, some approaches designed for sequential data are also demonstrated to be eﬃcient
on predicting on time series, such as conditional random ﬁeld [60] and recurrent neural
network [100].

• Incorporating spatial dependencies and predicting on space. Techniques like spatial autore-
gressive regression (SAR) [58] and kriging [91] are developed to predict for unknowns over
space;

In sum, on the one hand, due to the non-independence property, it is unwise to build an
independent local model for each spatial/temporal unit, as what Tobler’s ﬁrst law of geography
says: “Everything is related to everything else, but near things are more related than distant
things” [121]; on the other hand, due to the non-identical property, it is unreasonable to build
a single global model and expect it to ﬁt for all spatial/temporal units. Since the local model
ignores the generalization while global model lacks uniqueness, it is suggested to develop new
approaches that can incorporate data dependencies over space/time and at the same time keeps
distinct spatial/temporal attributes.

2.2 Multi-task Learning

Multi-task learning (MTL) is a machine learning technique which jointly deals with multiple
related modeling tasks by incorporating the correlations/dependencies among tasks [20, 148]. This

17

approach is useful as many complex prediction problems can be decomposed into multiple learning
tasks. Building a single global model for all tasks may not be eﬀective because the global model
is too general to ﬁt each task well. In contrast, building a local model independently for each task
would require having a suﬃciently large training set for each task to avoid model overﬁtting problem.
Multi-task learning addresses this problem by leveraging domain-speciﬁc information to enable the
pooling of information across diﬀerent tasks in order to improve its model generalization [105].

MTL is employed under two scenarios: 1). predictive modeling for multiple homogeneous
objects; 2).
learning with heterogeneous auxiliary tasks. [105, 69]. The latter one appears in
many studies in computer vision [35, 146, 38] and natural language processing [27, 84, 11], where
heterogeneous tasks are processed. For example, in [146], the facial landmark detection is enhanced
with the help of correlated auxiliary tasks such as gender classiﬁcation, head pose classiﬁcation,
age estimation, facial expression recognition. In [27], a weight-sharing structure is built to jointly
learn several language processing predictions. In this thesis, I mainly focus on the former case, and
a few existing MTL structures are reviewed in the following.

2.2.1 MTL Based on Encoding Graph Structures
In many applications, the pair-wise relations between N tasks can be pre-deﬁned with N × N
similarity matrix. Diﬀerent assumptions are made to quantify the relatedness between tasks. For
example, in multi-location geospatial predictions, it is reasonable to assume that locations that are
geographically closer or demographically similar share more similarities in models than those far
away from each other [99].

One of the popular way to embed the tasks similarities into an objective function is to use graph
Laplacian regularizer. Graph Laplacian is commonly used in spectral clustering [127] to optimize
graph partitioning. The spectral property of a graph is examined with the eigenvectors of Laplacian
matrix, with the property that Laplacian matrix is positive semideﬁnite [119]. This property also
makes Laplacian matrix a good way for smoothness. For example, in [94], the graph regularizer is
used for image denoising by assuming the pixel patches are smooth with respect to a pre-deﬁned

18

graph.

The semideﬁnite property of Laplacian matrix is borrowed by MTL approaches to build con-
nections between model parameters from diﬀerent learning tasks. Treating the model of each task
as a single node in the graph, and embedding the pair-wise similarities A ∈ (cid:60)N×N as weighted
edges, the graph Laplacian minimizes the weighted Euclidean distance of linear coeﬃcients wi from
diﬀerent models [148], in order to make a pair of individual learning tasks with higher proximity
to have closer coeﬃcients, and vice versa.

min
W

s.t.

L(W) + λTr[WTLW]

N

= L(W) +

Ai, j||wi − wj||2
2

i, j

λ
2
L = D − A,
W = [w1, w2, ...wN]

Laplacian matrix can be generated with both graph structure matrix and similarity matrix.
In [148, 68], the Laplacian matrix is generated directly from the topology of graph.
In [40], a
structure matrix is deﬁned to make each model approaching the mean model. In [149, 135], each
time point in the time series is treated as a task, and models of neighboring tasks are built to be
smooth. In [136], each location is treated as a task, and the inverse of a modiﬁed variogram is used
to measure similarities between tasks.

2.2.2 MTL with Low Dimensional Subspace

Another assumption made for relations among tasks is that individual models share a low dimen-
sional subspace. Related works can be categorized in two ways. The ﬁrst way uses traces norm
as regularizer to minimize the common rank of linear coeﬃcients [55, 9] from diﬀerent learning
tasks. Many variations of this approach emerges based on its basic formulation [24, 25].

L(W) + λ||W||∗

min
W
s.t. W = [w1, w2, ...wN]

19

The second way employs matrix factorization. It assumes that linear models wi, i = 1, 2, ...N
from diﬀerent learning tasks share a set of base models, and each individual model is a linear
combination of these base models [138, 136] as the following.

L(W) + λ1Ω1(U) + λ2Ω2(V)

min
W
s.t. W = [w1, w2, ...wN]
V = [v1, v2, ...vN]
W = UTV or wi = UTvi

where wi ∈ (cid:60)d×1 is the linear parameter vector for individual task i. U ∈ (cid:60)d×k represents the k
shared base models, and vi ∈ (cid:60)k×1 describes the coeﬃcients for linear combination of the base
models. Since d > k, the matrix decomposition of W makes models from diﬀerent learning tasks
to share a common low rank space.

2.2.3 MTL with Incomplete Multi-source Data

A special case of MTL comes when the predictors of diﬀerent tasks come from both common
sources and diﬀerent sources. A typical example is multi-modal sensor data [122, 145] when
diﬀerent sensors are not available all the time, leading to block-missing problem. It is unwise to
discard the incomplete data instances or interpolating missing values. In [145], data is partitioned
into multiple groups according to the data source available for each data instance. An individual
model is built for each group of data. And it assumes that the model parameters corresponding to
a certain source are learned jointly and have common sparsity with L2,1 norm.

2.2.4 MTL in Deep Learning

Two most popular multi-task learning strategies in neural networks are hard parameter sharing and
soft parameter sharing of hidden layers [105]. For hard parameter sharing case, the parameters of a
number of layers are shared by diﬀerent learning tasks, while the rest layers are task-speciﬁc [21].
An illustration is shown in Fig 2.1. For example, in the speech recognition problem in [53],

20

Figure 2.1: Hard parameter sharing or layer transfer in multi-task DNN.

Figure 2.2: Soft parameter sharing or conservative training in multi-task DNN.

the recognition with each language is treated as a learning task, and DNN-typed frameworks are
developed. It assumes that learning tasks from diﬀerent languages share common hidden layers
and only distinguish with each other on the softmax layer. It has been discussed in many previous
literatures how shared parameters help DNN to avoid overﬁtting. It is a case-to-case decision to
choose which layers should be shared. For example, in speech recognition, the last few layers are
usually shared because it is believed that the last few layers are not dependent on the speakers.
However, in image recognition, the ﬁrst few layers are shared because they are capable to capture
the most basic and commonly seen patterns like lines and curves [4]. Comparatively, the soft
parameter sharing or conservative training illustrated in Fig 2.2. It is inspired by regularizer skills
from traditional MTL approaches, and constrains the parameters learned from diﬀerent tasks to be
related [37, 143].

21

2.2.5 MTL for Multi-label Learning

Instead of labeling each data
Multi-label learning is a popular topic in the data mining ﬁeld.
instance with only one category, multi-label learning gives multiple categories. Each labeling task
results in an individual binary classiﬁer to determine whether a data instance has this label or not.
One straight forward solution is to learn for each labeling task independently. However, in many
cases, there are insuﬃcient training data for each label due to the expensive and time-consuming
manual labeling process. And label-independent learning in this case will most likely result in
serious overﬁtting problem [79]. Therefore, it is beneﬁcial to extend the applications of MTL to
multi-label learning. Instead of treating each labeling task as an independent binary classiﬁcation
problem and ignoring the interdependencies between labels, individual labeling problems are taken
as related learning tasks [33]. Multi-label learning is considered to have many common parts with
MTL, since it is a type of structured output prediction [32] which shares the hypothesis space
among models [113]. The idea of MTL has been employed to solve multi-lable learning in many
applications, like multi-label image classiﬁcation [79, 54] and face attributes recognition [41].

There are various types of relations among labels. For example, in multi-label activity recog-
nition, diﬀerent activities are sequentially dependent.
It is essential to quantify these relations
by learning from data and help the recognition performance [76]. Another very commonly seen
dependence label structure is taxonomy, like Google Ad ontology [1]. Doing single-task learning
in this case deﬁnitely ignores the taxonomy structure and fails when the label distribution is very
imbalanced. To do MTL for multi-label learning with pre-deﬁned label taxonomy, the represen-
tation of the taxonomy needs to be formulated and incorporated into the model. However, this
approach fails when the pre-deﬁned taxonomy is not perfectly designed to serve to the data. For
example, in the mobile apps categorization problem in [74], while the existing ﬂat-structured app
market categorization is limited by its lack of granularity and expressiveness, other ﬁner-grained
hierarchical categorization chosen from resembling domains still cannot guarantee to cover all
concepts and aspects of mobile apps. To address the challenges, in [74], a customized hierarchical
multi-task learning framework is built for multi-label categorization. The framework takes the

22

mobile apps categorization as an example. It leverages a ﬁne-grained taxonomy as a guide, labels
the mobile apps with ﬁner categories, and at the same time induces a customized taxonomy.

2.3 Summary

In this chapter, the non-i.i.d property of spatial temporal data is discussed, and the motivation
of employing multi-task learning on spatial and temporal data mining is presented. Then, several
state-of-art MTL approaches are surveyed.

However, these existing approaches are still not enough to address all challenges on spatial
In this thesis, I am going to investigate into
and temporal data as mentioned in Section 1.3.
some challenging problems speciﬁcally in the domain of spatial and temporal data, and develop
frameworks for each targeting problem.

23

CHAPTER 3

SOFT MULTI-TASK CLASSIFICATION FOR ACTIVITY RECOGNITION FROM

MULTI-MODAL SENSOR DATA

Rapid advances in the development of inexpensive, low-power, wireless sensing technology have
enabled the deployment of sensors ubiquitously in a smart home environment to support various
applications, from personal safety and security to water conservation and energy management. Real-
time data generated from the myriad of sensors in the smart home provide a unique opportunity for
monitoring daily living activities, alerting the residents or the authorities if any unusual activities
are detected. The ability to accurately recognize human activities from the multi-modal sensor data
is essential to support such applications.

However, classifying human activities from smart home sensor data is not a trivial task for
several reasons. First, the sensor data are often noisy, and thus, require substantial preprocessing to
extract discriminative features for the classiﬁcation task. Second, the data are heterogeneous and
may vary depending on the type of sensors deployed for monitoring the user activities. For example,
wearable sensors such as accelerometers would generate data continuously at all times unlike other
sensors such as motion detectors and surveillance cameras, which may only be available in certain
rooms. For example, Fig.3.1 shows the percentage of time in which data from two sensors—
accelerometer and surveillance camera—are available for each human activity in the smart home
dataset investigated in this study. The results suggest that the accelerometer data is available at
all times for most of the classes (human activities) whereas the surveillance camera data has a
more imbalanced and irregular distribution as they are aﬀected by the user’s location as well as the
rooms where the cameras are deployed. Thus, one of the key challenges is to develop a modeling
approach that can handle the multi-modal sensor data, whose availability varies from one location
to another depending on the sensor placement. Furthermore, the modeling approach must consider
the imbalanced class distribution in diﬀerent rooms since some activities could be restricted to
certain locations only (e.g., one will more likely lie down in a bedroom or living room than in a

24

(a) Accelerometer.

(b) RGB-D camera.

Figure 3.1: Percentage of time data from an accelerometer and RGB-D camera are available for
each human activity. The list of activities are shown in Table 3.1.

kitchen).

The activities performed by each user can be represented by a sequence of actions, where the
transition from one action to another proceeds in a continuous fashion. Since the data are collected
and annotated at discrete time periods, some activities could be interleaved together in the same
time period (e.g., walking and turning at the same time or going from a standing posture to a
bending and eventually kneeling position). For example, Fig 3.2 shows a 30-second segment of
user activity from the labeled data used in this study. Since there could be more than one activity
performed in each second, each class label (human activity) is associated with a conﬁdence score,
represented by its gray scale color. One of the goals of this study is to develop a modeling approach
that can leverage the soft labels to determine the probability an activity is performed at a given
time period. The temporal dependency between activities is another factor that must be taken into
consideration. For example, the lie-to-sit transition activity typically occurs between the lie
and sit postures. However, we do not expect the sequences to contain transitions from lie to
jump activities. How to eﬀectively quantify these temporal dependencies and incorporate into the
modeling framework is another challenge that needs to be addressed. Although such constraints
can be pre-deﬁned from domain knowledge, they may vary depending on the dataset used. Instead
of encoding them as hard constraints, my goal is to infer the temporal dependencies automatically

25

Figure 3.2: A segment of ground truth activities.

from the data.

To address these challenges, this thesis presents a soft classiﬁcation approach for activity
recognition in a smart home environment. The approach employs a softmax classiﬁer to predict
user activities in a sequence based on the multi-modal sensor data available. Training a global
softmax classiﬁer is not eﬀective since some features (e.g., surveillance camera data) are only
available in certain rooms.
Imputing their missing values may introduce errors into the model
while discarding the data with incomplete features may lead to suboptimal models. Conversely,
training a local model for each room is also not the answer due to the limited training data available
for some rooms and the large number of classes involved. To overcome this limitation, the proposed
approach allows the local models for all the rooms to be jointly trained, and takes into account the
relationship between the local models and the varying types of features available. Speciﬁcally, the
framework enables the model for predicting, say, the walk activity in one room, to be related to
the same activity in another room even if their features are not identical. This is accomplished by
decomposing the weight matrix associated with the prediction of each class into a set of low rank
latent factors, where the decomposition is performed only on the common features for all rooms.
Using a real-world multi-modal sensor dataset [122] as the case study, I showed that the proposed
framework is more eﬀective than other sequential and non-sequential classiﬁcation algorithms,
including multinomial logistic regression and conditional random ﬁelds.

26

time(second)51015202530ascenddescendjumploadwalkwalkbentkneelliesitsquatstandstand-to-bendkneel-to-stand lie-to-sitsit-to-liesit-to-stand stand-to-kneel stand-to-sit bend-to-stand turn00.10.20.30.40.50.60.70.80.91Table 3.1: List of human activity classes from the Sphere challenge data [122].

ascend
descend
jump
loadwalk
walk

bent
kneel
lie
sit
squat

stand
stand-to-bend
kneel-to-stand
lie-to-sit
sit-to-lie

sit-to-stand
stand-to-kneel
stand-to-sit
bend-to-stand
turn

3.1 Preliminaries

Consider a multi-modal sensor dataset, D = {D1, D2, · · · , DR}, where each Dr = (Xr, Yr)
is the training set for room r. Furthermore, each Xr = RNr×dr corresponds to the data matrix
derived for room r, where Nr is the number of training examples available and dr is the number of
features. For notational convenience, we denote Xi: as the i-th row of matrix X and X:j as its j-th
column. The sensor data considered in this study [122] include 3-d acceleration features generated
by a portable triaxial accelerometer worn by the subject, RGB-D camera data, and location data
from passive infrared (PIR) and strength of acceleration signals (RSSI) recorded by access points
location in diﬀerent rooms. The raw sensor data are preprocessed to extract various features (e.g.,
kurtosis, frequency, and entropy of accelerometer time series and bounding box information about
subjects from RGB-D camera data) associated with the human activities measured at every 1
second interval. I apply the feature extraction and preprocessing methods as described in one of
my previous publication in [72].

Let Yr ∈ [0, 1]Nr×K be the class membership matrix for all Nr observations in room r, where
ik ∈ [0, 1] denotes the conﬁdence score for the i-th training instance in room r belonging to
Yr
the k-th class. There are altogether 20 classes in this dataset, which are divided into 3 groups:
(1) Active motion (a), which include activities such as ascending or descending stairs, jumping,
and walking, (2) Stationary postures (p), which include bending, sitting down, and standing, and
(3) Transition movements (t), which include stand-to-bend, lie-to-sit, sit-to-lie, and stand-to-kneel.
The complete list of classes is shown in Table 3.1.

27

3.1.1 Sensory Data Used for Human Activity Recognition

I investigated the feasibility of using data generated from the following four sources of sensoring
measurements. All of these sensors are cheap and widely embedded sensors in current smart phone,
ﬁtness band or motion sensing input devices, such as Xbox 360 and Asus Xtion Pro.

• Acceleration: A portable triaxial accelerometer on the wrist of a subject to record the

real-time acceleration in all three spatial dimensions (X-Y-Z) in real time.

• Bounding box from an RGB-D camera: which combine RGB color information with per-
pixel depth information. It is easy to capture and extract the moving subject with a bounding
box from the raw image frames with various libraries and SDKs, such as OpenCV [92],
OpenNI [93], point cloud library (PCL) [98] and Microsoft Kinect SDK [86].

• PIR and RSSI: Localization sensors which estimates the subject’s locations in real time, in
either a passive way, or an active way. While the passive sensors tell the appearance of a
subject by detecting light, radiation from the human bodies (e.g.passive infrared sensor), the
active sensors measures the location of a subject by sending out signals (e.g.Received Signal
Strength Indication).

In this chapter, I will illustrate my preprocessing and proposed method mainly by taking the
example of the data provided in “SPHERE Challenge: Activity Recognition with Multimodal Sensor
Data” [122] for indoor human activity recognition (denoted as “Sphere data” in the rest of the text).

3.1.2 Feature Extraction

We target to predict human activities within each second. Thus, I ﬁrstly divide the entire time
series into 1-second-length segments. Based on each time segment, I extract useful features to
discriminate activities.

• Features from Acceleration

28

– Kurtosis [12]: Describes the tailedness of a segment of time-series acceleration data.

– Approximate Entropy [97]: Describes the unpredictability of ﬂuctuations over a segment

of time-series acceleration data.

– Top-10 Frequency by FFT [14]: Distribution of the resultant time-series energy in

frequency domain.

– FFT distribution kurtosis: Describes tailedness of the resultant energy distribution in

frequency domain.

– Average Jerk [130]: Average rate of change of acceleration on each axis.

– Average Absolute Value: Average absolute acceleration on each axis

– Average Value [14]: Average acceleration on each axis.

– Median: Meidian acceleration on each axis.

– Standard Deviation [14]: Standard deviation of acceleration on each axis.

– Maximum Value: Maximum acceleration on each axis.

– Minimum Value: Minimum acceleration on each axis.

– Maximum Absolute Value: Maximum absolute acceleration on each axis.

All the above features are selected because of their capability to distinguish between human
activities. For example, during exploring the Sphere data, it can be found that class “a_jump”
gets much higher maximum acceleration than any other activities (Fig 3.3). For another
instance, while most of the activities don’t repeat periodically, there are still some activities
which show obvious repetitive patterns(e.g. “a_ascend”, “a_walk”). With Fast Fourier
Transform(FFT), it can be observed that their distributions of energy in frequency domain
(e.g. Fig 3.4a and Fig 3.4b) are apparently diﬀerent from those who do not have periodic
patterns (e.g. Fig 3.4c and Fig 3.4d). Most of these repetitive activities get relatively higher
energy in low-frequency bands, while others get comparable energy in all frequency bands.

29

Figure 3.3: Distribution of the maximum acceleration for each activity class.(The line in the middle
of each box is the sample median; The tops and bottoms of each “box” are the 25th and 75th
percentiles of the samples, respectively; The whiskers are lines extending above and below each
box; Observations beyond the whisker length are marked as outliers)

• Features from RGB-D Camera The RGB-D cameras get not only RGB images but also the
depth information of pixels. Usually, the OpenNI [93] was employed to extract both 2D and
3D bounding boxes of a human subject from raw RGB-D image frames [122]. As shown in
Fig 3.5, the coordinates of the centers reﬂect the detailed location of a human subject, and
can be further used to compute displacements and speeds. The coordinates of the corners of
bounding boxes can be used to compute the shape of a subject. It is quite straightforward
to associate these features with human activities. For example, when the coordinate of the
bounding box center moves fast, subjects are more likely to perform motions rather than
staying in stationary postures; the shape of a subject is apparently diﬀerent when the subject

30

00.511.522.533.541234567891011121314151617181920acceleration max_x00.511.522.533.541234567891011121314151617181920acceleration max_y00.511.522.533.541234567891011121314151617181920a_ascenda_descenda_jumpa_loadwalka_walkp_bentp_kneelp_liep_sitp_squatp_standt_bendt_kneel_standt_lie_sitt_sit_liet_sit_standt_stand_kneelt_stand_sitt_straightent_turnacceleration max_z1234561234567891011121314151617181920a_ascenda_descenda_jumpa_loadwalka_walkp_bentp_kneelp_liep_sitp_squatp_standt_bendt_kneel_standt_lie_sitt_sit_liet_sit_standt_stand_kneelt_stand_sitt_straightent_turnresultant acceleration acceleration max(a) Sample Illustration of Activity Class “a_ascend”

(b) Sample Illustration of Activity Class “a_walk”

(c) Sample Illustration of Activity Class “p_stand”

(d) Sample Illustration of Activity Class “p_bend”

Figure 3.4: Time Domain and Frequency Domain of Activities in acceleration data from [122]. For each
subplot: time domain on the left; frequency domain on the right.

Table 3.2: 2D/3D camera features summary.

feature category
movement

shape

2D

3D

2D/3D feature varaible
2D
3D

features

center coordinate mean, std, gradient
center coordinate mean, std, gradient
length
width
area
length
width
height
volume

mean, std
mean, std
mean, std
mean, std
mean, std
mean, std
mean, std

is standing compared against when he or she is sitting, etc. Based on the above observations,
both 2D and 3D movement and shape features have been extracted. The detailed camera
features are provided in Table 3.2.

31

823823.5824824.5825825.5826826.5827−202acceleration on x−axis (g) sample time series of accelration for class "a_ascend"823823.5824824.5825825.5826826.5827012acceleration on y−axis (g) 823823.5824824.5825825.5826826.5827−1−0.500.5acceleratin on z−axis (g) 823823.5824824.5825825.5826826.5827012resultant acceleration (g) time(second)024681000.10.20.30.40.50.60.70.80.91frequency domain of resultant acceleration with FFTfrequency(Hz)13071307.513081308.513091309.513101310.51311−1.5−1−0.500.5acceleration onx−axis(g)sample time series of acceleration for class "a_walk"13071307.513081308.513091309.513101310.51311−0.500.511.5acceleration ony−axis(g)13071307.513081308.513091309.513101310.51311012acceleration on z−axis (g) 13071307.513081308.513091309.513101310.513110.511.52resultant acceleration (g) time(second)024681000.20.40.60.81frequency domain of resultant acceleration FFT power frequency(Hz)450450.5451451.5452452.5453453.5454012acceleration on x−axis (g) sample time series of acceleration for class "p_stand"450450.5451451.5452452.5453453.5454−101acceleration on y−axis (g) 450450.5451451.5452452.5453453.5454012acceleration on z−axis (g) 450450.5451451.5452452.5453453.5454012resultant acceraltion (g) time(second)024681000.10.20.30.40.50.60.70.80.91frequency domain of resultant acceleration with FFTfrequency(Hz)16201620.516211621.51622012acceleration on x−axis (g) sample time series of acceleration for class "t_bend"16201620.516211621.51622−101acceleration on y−axis (g) 16201620.516211621.51622012accelerationon z−axis (g) 16201620.516211621.51622012resultant acceleration (g) time(second)024681000.20.40.60.81frequency domain of resultant acceleration with FFTfrequency(Hz)Figure 3.5: The given coordinates of an example bounding box of the RGB-D camera data
from [122]. 2D bounding box on the left, 3D bounding box on the right.
top left; br:
bottom right; ﬂt: front left top; brb: back right bottom.)

(tl:

• Location Feature Extraction The data gathered by RSSI and PIR sensors is ued for location
information. Considering the distinguished layout and decoration of each room in a smart
house, the patterns of a same activity in diﬀerent rooms varies. The detection of the speciﬁc
room that the subject is located reveals the distinguished patterns in a certain room and
brings beneﬁts to improve the classiﬁcation performance. For instance, a subject is not able
to “p_lie on the stairs; what’s more, the detailed coordinates given by RSSI introduce prior
knowledge of a possible activity. For example, the subject is more likely to “p_lie” rather than
“a_jump. The average signal values of RSSI and PIR are computed as location information,
while additionally, the standard deviation of RSSI within each second is also computed as an
indicator of motion speed. These information will be used in predicting room occupancy for
my hierarchical approach.

3.1.3 Annotation Conﬁdence Level

Human activities usually consist of a series of continuous actions, where the transition between
one activity to the next is conducted in a graduate manner. Fig 3.2 illustrates an example of a
30-second human activity segment from the labeled Sphere data, where the degree of grey scale
indicates the conﬁdence score of labeling, or multi-class probability. For instance, in Figure 3.2,
“a_walk" starts from 2s to 8s. Among all these 7 seconds, only the 5th second has “a_walk” with

32

conﬁdence 1, and during all the rest ﬁve seconds, activity “a_walk” and “p_stand” coexist with
diﬀerent conﬁdence scores (as shown in Table 3.3). This means that the subject conducted a series
of standing and walking actions with a gradual and smooth transition between activities. Similar
observation for activity of “t_turn" exists from 12s to 15s.

Table 3.3: Illustration of activities distribution.

second
a_walk
p_stand

2
0
1

3
0.5
0.5

4
0.7
0.3

5
1
0

6
0.8
0.2

7
0.7
0.3

8
0.3
0.7

9
0
1

3.1.4 Temporal Transitional Dependency

Since there exists signiﬁcant temporal dependency on the transition among daily activities, it
is usually helpful to learn this inherent pattern for activity recognition, especially for transition
activities. For example, t_lie_sit usually happens in a period between p_lie and p_sit; t_bend
usually follows p_bent. Some counter examples include that a_jump never happens with p_lie, and
a_ascend never happens with p_squat, etc.

3.2 Methodology

3.2.1 Multi-Class Learning with Softmax Regression

Softmax regression can be used to compute the posterior probability that the i-th instance in room
r belongs to class k as follows [16]:

P(Yr

i = k|Xr

i:) =

s:) ≡ Pr

ik,

where Wr ∈ (cid:60)K×dr is the model parameter matrix for room r. The parameters can be estimated
by minimizing the following cross entropy loss function:

K
exp(Xr
k:)
i:Wr
s=1 exp(Xr
i:Wr
K
Nr

i

k

33

Wr = arg min
Wr

−Yr

ik log Pr
ik

(3.1)

(3.2)

Intuitively, the model produces probabilistic classiﬁcation results by equation (3.1), and the loss
function (3.2) measures the discrepancy between the estimated posterior probability and annotated
conﬁdence score of each class for the training examples. The loss function is well-suited for
handling soft labels in the human activity recognition problem shown in Fig.3.2, in which multiple
activities may occur in the same time period.

3.2.2 Proposed Method: STARS

Although the softmax regression approach can be applied to the smart home data, it has several
limitations. First, it does not account for the temporal dependencies among activities in the
sequence data. Second, it is designed for learning models independently for each room. Since the
amount of training examples available in each room may vary, this may lead to suboptimal local
models. Furthermore, the features available to classify the human activities can be diﬀerent from
one room to another. It would be useful to develop a multi-task learning approach that can jointly
train the models for all the rooms, taking into account the relationships among the prediction tasks
and variable features of the rooms. To overcome these limitations, I propose the following soft
multi-task learning framework called STARS, which is designed to optimize the following objective
function:

(3.3)

(λW||Wdi f ,r||F + λF1||F 1

r::||F + λF2||F 2

r::||F)

34

min
Θ

L1 + L2 + L3

Nr

K

s.t. L1 =

L2 =

L3 =

−Yr

ik log Pr
ik

k

i
β||Pr − GrPr||2
F

(λU||Uk,:,:||F + λV||Vk,:,:||F)

r

R
R
K
R

k

r

+

r

where,

]1×dr, Wcom
i−1:F 1
T + Zr
rk:
T + Zi−1:F 1
rs:

rk: = Ukr:Vk::
T)
i+1:F 2
T + Zr
i+1:F 2
T)
T + Zr

rk:

rs:

(3.4)

(3.5)

(3.6)

=

k: = [Wcom
, Wdi f ,r
Wr
K
rk:
k:
exp(Xr
i:Wr
k:
Pr
s exp(Xr
ik
i:Wr
s:
i:WrT
1
1
0

0 0 ... 0
0 0 ... 0
1 0 ... 0

i: = Xr
Zr

Gr =

...

0

0 0 ... 1



 Nr×Nr

0
0
0

...

0

where Θ = {U,V, Wr, F 1, F 2} corresponds to the set of model parameters for all the rooms
r = 1, 2, · · · , R. The framework assumes a linear model for each room r, parameterized by the
matrix Wr = [Wcom,r, Wdi f ,r], where Wcom,r is represented by the r-th slice of tensor Wcom,
and denotes the weight matrix associated with the common features for all the rooms. And Wdi f ,r
denotes the weight matrix associated with the unique features of the room. For example, the
common features may include those derived from accelerometer sensors worn by the users whereas
the unique features may correspond to those derived from surveillance cameras located only in
certain rooms.

ik in the proposed formulation depends not only on the features Xr

Note that the objective function consists of three parts: (1) L1, which is the cross entropy loss
function associated with the classiﬁcation error, (2) L2, which captures the temporal persistence
of the classes (to be explained below), and (3) model complexity control L3. The posterior
probability Pr
i: at time i, but
also on the temporal features Zi−1: and Zi+1: at time i − 1 and i + 1, respectively. We consider
i+1:WrT as temporal features because they are related to the
i−1: = Xr
Zr
predicted probabilities in the previous and next timesteps. This model also encapsulates information
about the class transitions by using the transition tensors F 1 and F 2. Speciﬁcally, F 1
r:: encodes
the relationship between the activity at previous timestep i − 1 to the activity at current timestep

i−1:WrT and Zr

i+1 = Xr

35

Figure 3.6: The illustration of multi-task learning in STARS

i in room r. Conversely, F 2
r:: encodes the relationship between the activity at the next timestep
i + 1 and the activity at current timestep i in room r. These transition tensors are estimated while
optimizing the loss function of STARS framework. The second term of the objective function, L2,
is a regularization term to ensure the temporal persistence of the classes. As illustrated in Fig.3.2,
most activities tend to last for more than several seconds. This suggests a trivial approach to predict
user activity in the next timestep is by using the predicted activity for the current timestep. The
temporal persistence of an activity between two adjecent time steps is reﬂected by the soft constraint
||Pr
i−1: = (GrPr)i:. Finally, the third term in the objective function, L3, is
i: − Pr
used to control the model complexity to avoid overﬁtting.

F, where Pr

i−1:||2

One unique feature of the proposed STARS framework is that it uses a multi-task learning
approach to train the models for all rooms simultaneously. Furthermore, instead of treating
classiﬁcation task for diﬀerent rooms as independent learning problems, it assumes the tasks are
related via the common features shared by all the rooms. Fig. 3.6 illustrates how the multi-task
works with matrix decomposition. Speciﬁcally, although the weight matrix Wcom,r for all the
rooms can be diﬀerent, they share a pair of common low-rank factors, U and V.
In Fig. 3.6,
we use the notation Wcom to represent a 3-dimensional tensor, where the r-th slice of the tensor
corresponds to the weight matrix Wcom,r for room r.

36

3.2.3 Optimization

The accelerated gradient descent method can be applied to learn the model parameters. A pseudo-
code of the algorithm for both training phase and prediction phase is shown Algorithm 1. The
training phase is for inferring the model parameters, and the testing phase is for predicting the next
activity in the sequence.

Algorithm 1: STARS Framework
TRAINING PHASE
Input: training set {Xr, Yr} and set of regularizers {β, λU, λV, λW, λF1, λF2}.
Output: Θ(t) = {U(t),V(t), Wdi f f ,r(t), F 1(t), F 2(t)}
Set t = 0 and initialize Θ(0) = {U(0),V(0), Wdi f f ,r(0), F 1(0), F 2(0)}.
repeat

+ λUUqr:)

t = t + 1
∀r, q : Uqr: ← Uqr: − α(t)( ∂L1
∂Uqr:
∀q : Vq:: ← Vq:: − α(t)( ∂L1
∂Vq::
∀r, q : Wdi f ,r
− α(t)(
∀r, q : F 1
rq: − α(t)( ∂L1
∂F 1
rq:
∀r, q : F 2
rq: − α(t)( ∂L1
∂F 2
rq:

q: ← Wdi f ,r
q:
rq: ← F 1
rq: ← F 2

+ ∂L2
∂Uqr:
+ λVVq::)
+ ∂L2
∂Wdi f ,r
q:
+ λF1F 1
rq:)
+ λF2F 2
rq:)

+ ∂L2
∂Vq::
∂L1
∂Wdi f ,r
q:
+ ∂L2
∂F 1
rq:
+ ∂L2
∂F 2
rq:

until convergence

+ λWWdi f ,r

q:

)

PREDICTION PHASE
Input: test example, Xr
ters, Θ = {U,V, Wdi f ,r, F 1, F 2}, r = 1, 2, ..., R
Output: predicted probability
P(y = k|Xr

i:, its adjacent predictors, Xr

i+1:, Θ) = Pr

i:, Xr

i−1:, Xr

ik, k = 1,2,...K, with Formula (3.5) and (3.6)

i−1: and Xr

i+1:, and estimated model parame-

Since there are multiple model parameters, Θ = {U,V, Wdi f ,r, F 1, F 2}, the parameters are
each updated in an alternating fashion. A backtracking line search strategy is also implemented
to adaptively choose the step size of the gradient descent [17] and ensure faster convergence. In
the remainder of this section, I show the gradient computation of L1 and L2 with respect to each
model parameter.

• Gradient Computation for U.

37

Taking the partial derivative of L1 w.r.t. Uqr:, where k = 1, 2, ..., K, and q = 1, 2, ..., K,
yields the following:

∂L1
∂Uqr:

=

(Pr
iq − Yr

iq)Xcom

ri: VT

q::

Nr
Nr

i

K

i

k

+

(3.7)

ri: VT

q:: + F 2

ik)(F 1

rkqXcom

rkqXcom

(Pr
ik − Yr

r(i−1):VT

r(i+1):VT
q::)
q:: denote the latent representation of the common features Xcom
ri:

ﬁrst term on the right hand side of Equation (3.7),Nr
The second term,Nr

where Xcom
for room r.
The preceding equation suggests that the update formula for Uqr: depends on two terms. The
q::, measures the
diﬀerence between the predicted and true class in terms of the latent, common feature vectors.
q::) measures the
diﬀerence in terms of the latent feature vectors for adjacent time periods, taking into account
the temporal dependencies between activities, F 1
Furthermore, the gradient of L2 w.r.t. Uqr: is given by

K
k (Pr

r:: and F 2
r::.

r(i−1):VT

r(i+1):VT

iq)Xcom

ri: VT

(Pr
iq − Yr

rkqXcom

rkqXcom

q:: + F 2

ik − Yr

ik)(F 1

i

i

∂L2
∂Uqr:

=

2β(Pr

ik − (GrPr)ik) × ( ∂Pr
∂Uqr:

ik

Gr
i j

∂Pr
jk
∂Uqr:

)

− Nr

j

where,
∂Pr
ik
∂Uqr:

= Pr
ik

(1{q = k} − Pr

ri: VT
iq)Xcom
q:: + F 2

q:: + (F 1
r(i+1):VT

(cid:19)
r(i−1):VT
rkqXcom
q::)Pr

rsqXcom

is

r(i−1):VT

q:: + F 2

rkqXcom

r(i+1):VT
q::)

• Gradients Computation for V. Similarly, the gradients w.r.t. V are:

(Pr
iq − Yr

iq)Uqr:Xcom
ri:

N

K

i

s

k

(F 1
rsqXcom

(cid:18)
− K
R
Nr
K
Nr
R
K
Nr
R

+

=

r

r

i

i

=

r

i

k

∂L1
∂Vq::

∂L2
∂Vq::

r(i−1): + F 2

rkqUqr:Xcom

r(i+1):)

ik)(F 1

rkqUqr:Xcom
(Pr
ik − Yr
ik − (GrPr)ik) × ( ∂Pr
∂Vq::

k
2β(Pr

ik

− Nr

j

Gr
i j

∂Pr
jk
∂Vq::

)

38

• Gradients Computation for Wdi f ,r.

Nr

K

∂L1
∂Wdi f ,r
q:
∂L2
∂Wdi f ,r
q:

=

=

(Pr
iq − Yr

K

2β(Pr

i:

+

iq)Xdi f ,r
(Pr
ik − Yr
ik − (GrPr)ik) × ( ∂Pr
ik
∂Wdi f ,r

k

i

q:

ik)(F 1

− Nr

rkqXdi f ,r
i−1:
∂Pr
jk
∂Wdi f ,r

Gr
i j

q:

j

)

i

k

+ F 2

i+1: )
rkqXdi f ,r

Nr
Nr

i

• Gradients Computation for F 1 (or F 2).

Nr
Nr

i

=

= 2β

∂L1
∂F 1
rq:
∂L2
∂F 1
rq:

iq − Yr
(Pr

K

i−1:

iq)Zr
ik − (GrPr)ik) × ( ∂Pr
(Pr
ik
∂F 1
rq:

i

k

− Nr

j

Gr
i j

∂Pr
jk
∂F 1
rq:

)

The gradients ∂L1
∂F 2
rq:

and ∂L2
∂F 2
rq:

can be obtained in a similar way.

3.3 Experimental Evaluation

I performed the experiments on a real-world data set from the SPHERE Challenge competi-
tion [122]. The dataset contains classes of human activities recorded in a house with 9 rooms.
The raw data contains 10 sequences, where each sequence corresponds to a series of activities
performed by a subject for a time period lasting between 1, 392 to 1, 825 seconds. With the sensor
observations at each second as a data instance, I ended up with a total of 16, 124 instances. Each
instance was labeled by a team of 12 annotators [122], whose results are aggregated to obtain a
conﬁdence score for each class label. For evaluation purposes, I apply 5-fold cross-validation and
report the mean and standard deviation of their prediction accuracies.

Following the approach described in [122], I employ the weighted Brier score to evaluate the

classiﬁcation performance. The metric is deﬁned as follows:

BS = 1
N

lk(Yik − Pik)2

(3.8)

N

K

i=1

k=1

39

where Yik is the conﬁdence score for the i-th instance and k-th class, computed based on the labels
provided by a team of annotators, whereas Pik is the predicted posterior class. The weight for each
class, lk, is deﬁned in [3], which is negatively correlated with the class size.

3.3.1 Baseline Algorithms

I compare the performance of my proposed framework, STARS, against the following baseline
algorithms:

• SR: Softmax regression, which trains a local softmax regression model for each room with
Equation (3.1) and (3.2). Unlike my proposed framework, it is a single-task learning model
and does not incorporate temporal dependencies.

• KNN: A k-nearest neighbor classiﬁer, which is another baseline used in [3] for the SPHERE
competition data. I sum up the weights associated with each neighboring instance Yik for the
test instance i and normalize the weighted sum to obtain the predicted posterior probabilities.

• CRF: Conditional Random Field [118], which is a widely used model for sequence classiﬁ-

cation problems [45, 65] and has been applied to activity recognition problems [123, 59].

3.3.2 Experimental Results

The results comparing the weighted brier score of the proposed framework STARS, against the
baseline methods (SR, KNN and CRF) are shown in Table 3.4.
I reported the weighted Brier
score for all rooms (denoted as Overall) as well as for individual rooms. The results suggest that
the overall performance of STARS is signiﬁcantly better than the baseline methods. In terms of
performances for individual rooms, STARS achieves the best (i.e., lowest score) in 7 out of the
9 rooms. The performance of STARS is slightly worse than SR for bedroom2 and hallway due
to the lack of transitional activities, making it harder to learn the temporal dependency accurately
based on their limited training data.

40

Table 3.4: Weighted brier scores for various competing algorithms.

SR
Room
0.1334±0.0149
bathroom
0.0853±0.0129
bedroom1
0.2817±0.0148
bedroom2
0.1926±0.0953
hallway
0.0842±0.0122
kitchen
living room 0.1594±0.0181
0.3827±0.0705
stairs
study room 0.0441±0.0407
0.1440±0.0380
toilet
0.1700±0.0095
Overall

KNN
0.1315±0.0103
0.1026±0.0085
0.2862±0.0178
0.2323±0.0675
0.0915±0.0099
0.1774±0.0171
0.4366±0.0552
0.0649±0.0240
0.1368±0.0304
0.1815±0.0089

CRF
0.1479±0.0152
0.0935±0.0118
0.2886±0.0223
0.2115±0.0816
0.0917±0.0133
0.1710±0.0201
0.3834±0.0288
0.0506±0.0556
0.1442±0.0358
0.1794±0.0121

STARS
0.1269±0.0173
0.0920±0.0156
0.2675±0.0172
0.1953±0.0849
0.0820±0.0106
0.1468±0.0142
0.3373±0.0505
0.0381±0.0365
0.1360±0.0417
0.1598± 0.0087

(a) Transition matrix from (i − 1) to i timestep.
Figure 3.7: The estimated transition matrices F 1
ordering of the classes on the horizontal and vertical axes are the same.

(b) Transition matrix from i to (i + 1) timestep.
r:: (right) for living room. The

r:: (left) and F 2

r:: and F 2

In addition to its lower brier score, another advantage of STARS framework is that the model can
be used to learn the transition between activities via F 1 and F 2 as by-product. Fig. 3.7a and 3.7b
depict a heat map of the two tensor slices F 1
r:: for the living room. The results shown in
these ﬁgures are mostly consistent with our common sense knowledge. For example, Fig.3.7a shows
that the bent posture is mostly followed by the activity bend-to-stand whereas stand-to-bend
often leads to the bent posture. Similarly, Fig.3.7b shows that the stand-to-kneel activity would
lead to the kneel posture in the next time step, while lie-to-sit begins with lie posture and
ends with the sit posture.

41

Activities in the previous second jump loadwalk  walk bentkneel lie sit squat stand stand-to-bend kneel-to-stand  lie-to-sit sit-to-lie sit-to-stand  stand-to-kneel  stand-to-sit  bend-to-stand  turnActivities in the current second-0.4-0.200.20.40.60.811.21.4Activities in the next second jump loadwalk  walk bent kneel lie sit squat stand stand-to-bend kneel-to-stand  lie-to-sit sit-to-lie sit-to-stand  stand-to-kneel  stand-to-sit  bend-to-stand  turnActivities in the current second-0.500.511.53.4 Related Work

Numerous approaches have been developed for human activity recognition. Classic approaches
include decision tree [19], support vector machine [128], logistic regression, and Bayesian net-
works [36]. For example, a comparison between logistic regression and non-linear SVM on human
activity recognition is given in [72]. These approaches do not consider the temporal/sequential
dependencies between activities. In contrast, methods such as Hidden Markov Model (HMM) and
Conditional Random Fields (CRF) are more well-suited for handling sequential data [45, 142],
and thus, have been widely utilized for activity recognition tasks [67, 123]. However, these
approaches are primarily designed for single-task learning, unlike the multi-task approach pro-
posed in this study. The success of multi-task learning for activity recognition has been well-
documented [117, 140, 141, 7, 22]. In [117], a structured multi-task classiﬁcation method was
proposed, where each task corresponds to the classiﬁcation of a speciﬁc person. [140] presented a
multi-task clustering framework for analyzing daily living activities from visual data collected by
wearable cameras. In addition, [141] focused on multi-task feature selection whereas [7] focused
on online matrix regularization. Unlike other existing works, my proposed framework considers
the classiﬁcation in diﬀerent rooms as separate tasks, with possibly diﬀerent types of features.

3.5 Summary

In this thesis, I present a soft multi-task learning technique for human activity recognition
from multi-modal sensor data in a smart home. The proposed technique incorporates the temporal
dependencies between classes in a multi-task learning setting. Experimental results using a pub-
lic human activity recognition dataset showed that the proposed technique outperforms baseline
methods including K-Nearest Neighbor, Conditional Random Field, and single-task learning with
multinomial softmax regression. The framework not only improves the classiﬁcation performance,
it also reveals the typical type of transitions between activities.

42

CHAPTER 4

DISTRIBUTION PRESERVING MULTI-TASK REGRESSION FOR

SPATIO-TEMPORAL DATA

Regression methods play an important role in many spatio-temporal applications as they can be
used to solve a wide variety of prediction problems such as projecting future changes in the climate
system, predicting the crime rate in urban cities, or forecasting traﬃc volume on highways. Although
accuracy is an important requirement, building models that can replicate the future distribution of
data is just as important since the predicted distribution can be used for planning, risk assessment,
and other decision making purposes. For example, in climate modeling, knowing the changes in
future distribution of climate variables such as temperature and precipitation can help scientists to
better estimate the severity and frequency of adverse weather events in the future. In agricultural
production, the predicted distribution can be used to derive statistics such as average length of
future growing season or persistence of wet and dry spells, which are important metrics for farmers
and agricultural researchers.

However, previous studies have shown that the distribution of predicted values generated by
traditional regression methods are not always consistent with the true distribution of the data even
when the prediction errors are relatively low [5, 6]. While distribution-preserving methods such as
quantile mapping [120] have been developed to overcome this limitation, their prediction errors can
still be high [5]. For example, Figure 4.1(a) shows a comparison between the predicted values of a
non-distribution preserving model (Model 1) against a distribution-preserving model (Model 2) on
a set of 10 values. Although the ﬁrst model has a much lower root mean square error (RMSE) it does
not ﬁt well the tails of the predicted distribution, as shown in Figure 4.1, compared to the second
model, which ﬁts the distribution almost perfectly but has a considerably higher RMSE. This has
led to the growing interest in developing techniques that can minimize both prediction error and
the divergence between the true and predicted distributions [5, 6]. However, current techniques
are mostly designed for single task learning problems, i.e., to build a regression model for a single

43

Figure 4.1: Comparison between the predictions of non-distribution preserving (Model 1) and
distribution preserving (Model 2) methods in terms of their root mean squared errors and cumulative
distribution functions.

location. For multi-location prediction, these models are trained independently, and thus, often fail
to capture the inherent autocorrelations of the spatio-temporal data. In addition, their accuracy and
distribution ﬁt are likely to be suboptimal for locations with limited training data. Therefore, multi-
task learning algorithms should be developed for multi-location problems (e.g. [134, 139, 136]).

To account for spatial autocorrelation and the imbalanced distribution of training data, there have
been several recent studies focusing on the development of multi-task learning (MTL) methods [148]
[147] for spatio-temporal data [137] [136]. MTL learns a local model for each location, but leverages
data from other locations to improve its model performance. It accomplishes this by assuming that
the local models share some common structure, which can be exploited to enhance their predictive
performance. Unfortunately, existing MTL approaches are mostly focused on minimizing the
residual error, paying scant attention to how realistic is the overall predicted distribution.

This chapter presents a novel distribution-preserving multi-task learning framework for spatio-
temporal data. Our framework assumes that the local models share a common low-rank representa-
tion, similar to the assumption used in [136]. It also employs a graph Laplacian regularizer based on
the Haversine spatial distance to preserve the spatial autocorrelation in the data. A non-parametric
kernel density estimation approach with L2-distance is used to determine the divergence between

44

the predicted and true distributions of the data. Both the distribution ﬁtting and multi-task learning
are integrated into a uniﬁed objective function, which is optimized using a mini-batch accelerated
gradient descent algorithm. Experimental results using a real-world climate dataset from the Global
Historical Climatology Network (GHCN) showed that our proposed framework outperformed the
non-distribution preserving approaches in more than 78% of the weather stations considered in this
study, with an average reduction in distribution error between 7.5% − 17.8%. Our approach also
outperforms a distribution-preserving single-task regression method called contour regression [5]
in at least 78% of the weather stations.

4.1 Preliminaries

Our framework is designed not only to learn accurate local regression models, but also to
generate predictions that are consistent with the true distribution of the data. This requires an
approach for estimating the density function of the response variable and a divergence measure to
compute the diﬀerence between two distributions. We review these approaches in this section.

4.1.1 Density Estimation

There are various density estimation methods that have been proposed in the literature. In general,
these methods can be divided into two categories:

• Parametric Methods, which assume that the density function follows certain parametric
distribution, such as Gaussian, Gamma, exponential and so on. The sampled data points are
used to estimate parameters of the distribution. As the number of parameters tends to be
small, parametric methods has an advantage in that they do not require a large number of
points to ﬁt the density function. Unfortunately, many real-world datasets may not follow
the standard distributions as they often comprise of complex mixtures of distributions, which
leads to imprecise estimation of their true distribution.

• Non-parametric Methods, which avoid making a priori assumption about the shape of the

45

distributions. Two popular methods are the K-nearest neighbor (KNN) approach and kernel
density estimation. The KNN approach uses only the k nearest neighbors to estimate the
density function whereas the kernel density estimation (KDE) approach uses a mixture of
Gaussian distributions centered at all the data points with a smoothing kernel width parameter,
h.

Due to the generality of the approach for ﬁtting unknown distribution, we employ the KDE approach
to approximate the density function. Given N data points sampled from an unknown distribution
of a variable Y , y = {yi|, i = 1, 2, ..., N}, the density function of Y is estimated as follows:

Py(Y = y) = 1
N

G(y|yi, h2)

(4.1)

N

i

4.1.2 Divergence Measures

A divergence measure can be used to estimate the diﬀerence between two density functions. This
section reviews some of the divergence measures used in this chapter. Although there are other
measures available, evaluating them is beyond the scope of this chapter and is a subject for future
research.

4.1.2.1 RMS-CDF

RMS-CDF is a measure deﬁned in [5] to compare two empirical cumulative distribution functions.
Let y and ˆy denote vectors of length N sampled from two distributions. The RMS-CDF measure
between the pair of distributions is computed as follows:

RMS-CDF =

N

i

(y(i) − ˆy(i))2

(4.2)

where y(i) represents the i-th largest value in y and ˆy(i) is the i-th largest value in ˆy. The measure
is obtained by sorting the values in y and ˆy and computing the average sum of squared diﬀerence
between their sorted values.

46

(cid:118)(cid:117)(cid:116) 1

N

4.1.2.2 L2 Distance

Given a pair of random variables, Y and ˆY, along with their respective probability density functions,
PY and P ˆY, their L2-Distance can be calculated as follows [115]:

Þ

L2(PY, P ˆY) =

(PY(y) − P ˆY(y))2dy

Unlike RMS-CDF, L2-Distance measures the divergence of two distribution based on probability
density functions instead of their cumulative distribution functions.

4.2 Proposed Framework

This section introduces our proposed framework for distribution-preserving multi-task regres-
sion. The framework uses kernel density estimation (KDE) to estimate the probability density
function. KDE provides a ﬂexible approach for modeling the density function of unknown dis-
tributions unlike other parametric approaches. We also employ an L2-Distance to measure the
diﬀerence between two density functions.

4.2.1 Divergence Measurement

In this chapter, we use kernel density estimation (KDE) to estimate the probability density function
and the L2-Distance to measure the divergence between two density functions. Given N data
points sampled from an unknown distribution of a variable Y , y = {yi|, i = 1, 2, ..., N}, the density
function of Y is estimated as follows:

N

i

Py(Y = y) = 1
N

G(y|yi, h2)

(4.3)

where, G(y| µ, σ) represents Gaussian kernel with mean µ and standard deviation σ, and h is the
Parzan window width.

We now derive our approach for computing the divergence between two estimated probability
distributions, using the Gaussian kernel density estimator with L2-Distance. Let Y be a random
variable. Consider two N-dimensional vectors y = [y1, y2, ...yN]T and ˆy = [ ˆy1, ˆy2, ... ˆyN]T, where

47

Table 4.1: Summary of notations used in the chapter.

Notation
S
Ns
d
k
Xs ∈ (cid:60)Ns×d
ys or
ˆys ∈ (cid:60)Ns×1
Py(Y) or
Pˆy(Y)
W ∈ (cid:60)d×S
U ∈ (cid:60)d×k
V ∈ (cid:60)k×S
h, ˆh
G(y| µ, σ)
Q ∈ (cid:60)S×S
A ∈ (cid:60)S×S
D ∈ (cid:60)S×S

Deﬁnition
number of stations(tasks)
number of observations in task s
number of features
number of latent factors, k < d
predictor matrix for station s
observed or estimated
response samples at station s
density function of Y estimated
by observed(or estimated) samples
model parameter
latent factors
linear coeﬃcients for latent factors
Parzen window widths for y and ˆy
Gaussian distribution
pairwise Haversine distances
adjacency matrix, where Ai j =

a diagonal matrix, Dii =j Ai j

1

exp(Qi j/γ)

the yi’s and ˆyj’s are randomly drawn from the sample space of Y. The density functions for Y
estimated using Gaussian KDE from the two sample vectors, Py(Y) and Pˆy(Y), can be written as:

N
N

i

i

Py(Y = y) = 1
N
Pˆy(Y = y) = 1
N

G(y|yi, h2)

G(y| ˆyi, ˆh2)

The following theorem presents the closed-form formula for computing the L2-Distance between
two estimated density functions.
Theorem 1 L2-Distance between Py(Y) and Pˆy(Y) is:

G(yi|yj, 2h2) + G( ˆyi| ˆyj, 2ˆh2) − 2G(yi| ˆyj, h2 + ˆh2)

(cid:21)

L2(Py, Pˆy) =

(Py(y) − Pˆy(y))2dy

Þ

(cid:20)

N

=

1
N2

i, j

48

Proof: We begin by expressing the L2-Distance in terms of their KDE functions:

L2(Py, Pˆy) =

=

Þ
Þ (cid:18) 1

N

N

i

(Py(y) − Pˆy(y))2dy

G(y|yi, h2) − 1
N

(cid:19)2

dy

G(y| ˆyi, ˆh2)

N

i

Expanding the square yields the following expression:

Þ
Þ
Þ

i, j

N
N
N

i, j

i, j

L2(Py, Pˆy) =

+

−

1
N2

1
N2

2
N2

G(y|yi, h2)G(y|yj, h2)dy

G(y| ˆyi, ˆh2)G(y| ˆyj, ˆh2)dy

G(y|yi, h2)G(y| ˆyj, ˆh2)dy

(4.4)

Based on the fact that the product of two Gaussian distributions, G(y| µ1, σ2
be written as follows [18]:
G(y| µ1, σ2
2 +µ2σ2
µ1σ2
1
σ2
2 +σ2
1

SinceÞ G(y| µ, σ) = 1, the integral for the product of two Gaussians can be written as:

(cid:115)
1)G(y| µ2, σ2
2) = G(µ1| µ2, σ2
σ2
1 σ2
2
σ2
1 +σ2
2

2) × G(y| µ12, σ12),

where µ12 =

1 + σ2

, σ12 =

.

1) and G(y| µ2, σ2

2), can

G(y| µ1, σ2

1)G(y| µ2, σ2

2)dy = G(µ1| µ2, σ2

2)
1 + σ2

Replacing this into Equation (4.4) leads to the following;

G(yi|yj, 2h2) + G( ˆyi| ˆyj, 2ˆh2) − 2G(yi| ˆyj, h2 + ˆh2)

Þ

1
N2

(cid:18)

N

i, j
which completes the proof.

(cid:19)

,

(cid:3)

4.2.2 DPMTL: Distribution-Preserving MTL Framework

Let D = {X,Y} be a spatial-temporal dataset, where X = {Xs|s = 1, 2, ..., S}, Y = {ys|, s =

49

1, 2, ...S}, and S is the number of locations. Each Xs ∈ RNs×d is a d-dimensional multivariate time
series of predictor variables at location s, and each ys ∈ RNs is the time series for observations
at location s. Furthermore, Ns is the number of training examples available at location s. The
motivation for our spatial-temporal distribution preserving approach is to learn a set of local linear
models, fs(x; ws) that minimizes the prediction error while ﬁtting the marginal distribution of Y.
Learning the local models amounts to estimating the weight matrix W = [w1 w2 · · · wS] for all
the stations. Due to the inherent relationships between the prediction tasks at multiple locations,
our proposed framework assumes that W ∈ Rd×S is not a full rank matrix and can be decomposed
into a product of two low-rank matrices, U ∈ Rd×k and V ∈ Rk×S, where k < d. These low-rank
matrices can be derived as follows:

where

L1 + αL2 + λL3,

min
U,V
s.t. W = UV, ˆys = Xsws

S
S

s

s

||ys − ˆys)||2
2

L1 =
L2 = tr[W(D − A)WT)]
L3 =

(cid:26) 1

Ns

(cid:18)

N2
s

i, j

G(ysi|ys j, 2h2
s)

+ G( ˆysi| ˆys j, 2ˆh2

s) − 2G(ysi| ˆys j, h2

s)
s + ˆh2

(4.5)

(4.6)

(cid:19)(cid:27)

Our objective function consists of three loss functions: (1) L1, which measures the residual errors
on the training data between ground truth samples in ys and estimated samples in ˆys, (2) L2, which
is a regularizer to ensure that the model parameters for two neighboring locations should be close to
each other, and (3) L3, which measures the divergence between the true and predicted distributions
for Y.

For L2, an S × S similarity matrix A is calculated by applying an RBF kernel on the Haversine
distance [126], Qi j, between two locations i and j, i.e., Ai j = exp[−Qi j/γ]. W ∈ Rd×S is the

50

model parameter matrix, where each column corresponds to the weights of the regression model
for a given station. The second loss function ensures that spatial autocorrelation is preserved by our
framework (following Tobler’s First Law of Geography [87]). It accomplishes this by using a graph

Laplacian regularizer, where D is a diagonal matrix whose diagonal elements are Dii =j Ai j.

The ﬁrst loss function L1 is designed to minimize prediction error by learning the conditional
probability function Pys(Y|X). In contrast, the third loss function, L3, is designed to accurately
ﬁt the marginal distribution Pys(Y) by minimizing the L2-Distance between the true and estimated
density functions for all stations using Equation (4.6). The hyperparameter λ is used to control the
trade-oﬀ between minimizing the two loss functions. Note that the local models ws are assumed
to share a common latent matrix U. Such an assumption is useful especially for learning models
at locations with limited training data. Speciﬁcally, the model parameters for each station s are
assumed to be formed using a linear combination of the dictionary (latent factors) in U, with vs(the
s-th column of V) specifying the coeﬃcients of the linear combination.

4.2.3 Optimization

We employ a mini-batch accelerated gradient descent approach to solve the optimization problem
given in Equation (4.5). This requires us to derive the gradient of each term, L1, L2 and L3, with
respect to the model parameters.

For L1, the partial derivatives are given by:

S

s

=

∂L1
∂U
∂L1
∂vs

2XT

s ysvT
s

s − 2XT
s XsUvsvT
s XsUvs − 2UTXT
s ys

= 2UTXT
For L2, the partial derivatives are given by:

∂L2
∂U
∂L2
∂V

= 2UV(D − A)VT
= 2UTUV(D − A)

51

(4.7)

(4.8)

(4.9)

(4.10)

For L3, we ﬁrst show the partial derivative of L2-Distance with respect to ˆysm. The L2-
Distance of each station, given in Equation (4.6), is composed of three Gaussian kernels. Since
G(ysi|ys j, 2h2) is a constant with respect of ˆysm, its partial derivative is:.

The derivative for the sum of pairwise Gaussian kernel of ˆys is given as follows:

Ns

∂G(ysi|ys j, 2h2)

i, j

∂ ˆysm

= 0

Ns
Ns

i, j
−1
2ˆh2

j
−1
ˆh2

Ns

i

=

=

=

(cid:21)

( ˆysi − ˆys j)

∂G( ˆysi| ˆys j, 2ˆh2)

∂ ˆym

(cid:20)

G( ˆysi| ˆys j, 2ˆh2) ×
−1
2ˆh2

G( ˆysm| ˆys j, 2ˆh2)( ˆysm − ˆys j) − Ns

1{i = m} − 1{ j = m}
−1
2ˆh2

i

G( ˆysm| ˆysi, 2ˆh2)( ˆysm − ˆysi)

The derivative for the cross-term Gaussian kernels is

G( ˆysi| ˆysm, 2ˆh2)( ˆysi − ˆysm)

Ns
Ns

i, j

∂G( ˆysi|ys j, h2 + ˆh2)

∂ ˆysm

=

=

i, j

−1
h2 + ˆh2
−1
h2 + ˆh2

Ns
(cid:20)

i

∂L3
∂ ˆysm

=

Ns

i

h2 + ˆh2

1
Ns2
− 1
ˆh2

2

Ns

i

G( ˆysi|ys j, h2 + ˆh2) × 1{i = m}( ˆysi − ys j)

G( ˆysm|ysi, h2 + ˆh2)( ˆysm − ysi)

G( ˆysm|ysi, h2 + ˆh2)( ˆysm − ysi)

(cid:21)

G( ˆysm| ˆysi, 2ˆh2)( ˆysm − ˆysi)

Putting together all three derivatives, the partial derivative of L3 w.r.t. ˆysm is given by

Furthermore, denoting xsm ∈ (cid:60)d×1 (the m-th row of Xs), the partial derivative of ˆysm with respect
to U and V are:

∂ ˆysm
∂U

= xT

smvT

s ∈ Rd×k,

∂ ˆysm
∂vs

= UTxT

sm ∈ Rk×1

52

By applying chain rule, we obtain:

∂L3
∂U
∂L3
∂vs

=

=

S
Ns

s

m

Ns

m

∂L3
∂ ˆysm

× ∂ ˆysm
∂U

∂L3
∂ ˆysm
× ∂ ˆysm
∂vs

(4.11)

(4.12)

4.2.4 Algorithm
Equation (4.11) requires us to compute both G( ˆysm|ysi, h2 + ˆh2) and G( ˆysm| ˆysi, 2ˆh2), which takes
O(Ns). Furthermore, computing the partial derivative of L3 w.r.t all ysm, s = 1, 2, ..., S, m =
1, 2, ..., Ns requires O(SN2). To speed up the computation, we employ a mini-batch Gradient
Descent (MGD) approach at each iteration, where instead of using all Ns points from each station,
we use only a subset of size l < Ns. This reduces signiﬁcantly the amount of computations needed
from O(SN2) to O(Sl2).

For each gradient descent update step at iteration t, let the mini-batch data matrix be X

(t)
s ∈ Rl×d
(t)
s ∈ Rl×1. For each station s, we compute the following

and the mini-batch response vector be y
vector ps ∈ Rl×1 as the derivative of 1
2L3 on ˆys.
(Gs ◦ Es
ps = 1
h2 + ˆh2
l2
Gs ∈ (cid:60)l×l : Gs
Hs ∈ (cid:60)l×l : Hs
Es ∈ (cid:60)l×l : Es
Fs ∈ (cid:60)l×l : Fs
i, j = 1, 2, ..., l

s.t

− Hs ◦ Fs
) · 1
2ˆh2
(t)
(t)
s j , h2 + ˆh2)
i j = G( ˆy
si |y
(t)
(t)
i j = G( ˆy
si | ˆy
s j , 2ˆh2)
(t)
(t)
si − y
i j = ˆy
s j
(t)
(t)
si − ˆy
i j = ˆy
s j

(4.13)

Thus, the gradient w.r.t U can be obtained by combining Equations (4.7), ((4.9), and (4.11) as

53

follows:

∂L
∂U

= 2αUV(D − A)VT + 2

S

s

(t)T
X
s X

(t)
s UvsvT
s

S

s

−2

(t)T
s

(t)
(y
s − λps)vT

s

X

(4.14)

Similarly, the gradient w.r.t V is obtained by combining Equations (4.8), (4.10), and ((4.12) to
obtain:

(t)T
s

(y

∂L
∂V
(t)T
(t)
s Uvs − 2UTX
s X

= 2αUTUV(D − A) + [∆a1, ∆a2, ..., ∆aS]

(4.15)
(t)
s − λps). Observe that ps can be viewed as an
where, ∆as = 2UTX
adjustment weighted by λ on ys when calculating the gradients. In other words, the gradient of the
distribution-preserving term L3 adjusts the role of y in calculating the gradients. Finally, the Mini-
Batch Gradient Descent(MGD) is implemented using the Accelerated Gradient Descent (AGD)
approach in order to speed up the search for local optimum [28]. A summary of the algorithm is
given in Algorithm 3.

4.3 Experimental Evaluation

We have conducted extensive experiments to evaluate the performance of our proposed frame-
work. The dataset and baseline methods used in our experiments along with the results obtained
are described in this section.

4.3.1 Data and Preprocessing

We evaluated the proposed approach on monthly precipitation data from the Global Historical
Climatology Network (GHCN) data [85]. The dataset spans a 540-month time period, from
January 1970 to December 2014. For brevity, we consider only data from weather stations in the
United States (located between 24.74◦N to 49.35◦N and 66.95◦W and 124.97◦W). We also omit
any station that has more than 50% missing values in its time series. The resulting dataset contains
precipitation data from 1,510 weather stations.

54

t = t + 1;
for s = 1,2,...,S

(t)
s and y

(t)
s ;

Algorithm 2: DPMTL: Distribution Preserving Multi-task Learning

TRAINING PHASE
Input: Training data X = {Xs}, Y = {ys}, A, hs,ˆhs, τu, τv, s = 1, 2...S.
Output: U, V = [v1, v2, ...vS].
repeat

(t)
s Uvs;

randomly choose l sample data, X
(t)
ˆy
s = X
compute ps with equation (4.13);
end for
U = U − τu
V = V − τv
until converges

L
U by equation (4.14);
L
V by equation (4.15);

PREDICTION PHASE
Input: Testing data X∗ = {X∗
s}, U, V.
Output: Predictions Y∗ = {ˆy∗
s}.
for s = 1,2,...,S
ˆy∗
s = X∗
sUvs;
end for

We selected 13 predictor variables from the NCEP Reanalysis [56] gridded dataset with the help
of our domain expert. A brief introduction of these predictors is shown in Tab. 4.2. The mapping
between the GHCN station and its NCEP Reanalysis grid is established by ﬁnding the closest grid
cell to each GHCN station. Variables of each station is deseasonalized by subtracting the seasonal
mean of that station and dividing by the corresponding seasonal standard deviation.

4.3.2 Experimental Setup

We compared the proposed framework, DPMTL, against the following baseline algorithms:

• Global model: The data from all stations are combined and used to train a global, lasso

regression model.

• Local model: A local model is trained for each station using only data from the given station.

55

Table 4.2: Predictor variables from NCEP reanalysis.

Variable
cprat
dlwrf
dswrf
lftx
omega
pr_wtr
prate
rhum
slp
thick850
thick1000
tmax

Description
convective precipitation rate at surface
longwave radiation ﬂux at surface
solar radiation ﬂux at surface
surface lifted index
omega at sigma level 0.995
precipitable water content
precipitation rate
relative humidity at sigma level 0.995
Sea level pressure
thickness for 850-500mb
thickness for 1000-500mb
maximum temperature at 2 m

• GSpartan [136]: A MTL framework for spatio-temporal data. This framework is equivalent
to setting the hyperparameter λ in DPMTL to 0 and adding L1 regularizers to the model
parameters U and V.

• Contour Regression: A distribution-preserving method for time series prediction [5]. Un-
like DPMTL, contour regression is designed to improve distribution ﬁt by minimizing the
discrepancy between the empirical cumulative density function of the predicted and ground
truth values, whereas DPMTL applies L2-distance on probability density functions esti-
mated using KDE. Furthermore, contour regression is a single-task learning method, unlike
the multi-task learning method used in DPMTL.

The evaluation metrics used in this study are deﬁned below, where ˆys corresponds to the

predicted values for station s and ys corresponds to their true values.

• RMSE: a measure of prediction error obtained by taking the square root of the average

sum-of-squared errors in the predictions.

Ns

(ysi − ˆysi)2

RMSE = 1
S

(cid:118)(cid:117)(cid:116) 1

S

s

Ns

i

56

• RMS-CDF: a metric deﬁned in [5] to evaluate the ﬁt between two cumulative distribution
functions created from a ﬁnite sample of observations. The metric is equivalent to applying
RMSE on the ordered values of the data.

(cid:118)(cid:117)(cid:116) 1

S

Ns

s

Ns

i

RMS-CDF = 1
S

(ys(i) − ˆys(i))2

• L2-Distance: another metric for measuring the divergence between two density functions,

computed according to the formula given in Theorem 1.

S

s

1
N2
s

(cid:18)

Ns

i, j

L2-Distance

= 1
S

G(yi|yj, 2h2) + G( ˆyi| ˆyj, 2ˆh2) − 2G(yi| ˆyj, h2 + ˆh2)

(cid:19)

We apply 9-fold cross validation on the 45-year data (from 1970-2014) to evaluate the perfor-
mance of various algorithms. Each fold corresponds to 5 years or 60 months worth of data. In each
of the 9 rounds, 8 of the folds are selected to be the training set while the remaining fold is used as
test set. Since the dataset has been standardized by their corresponding month, we set the Parzen
window width hs and ˆhs to be half of its variance, i.e., 0.5. The spatial autocorrelation matrix
A = exp(−Q/γ) is computed using the Haversine distance Q, with γ = 100. The number of latent
factors k is set to 10 while the mini-batch size l is chosen to be 64. For gradient descent, the step
sizes τu and τv are initialized to 10−8 and 10−7 and gradually decreased with increasing number of
iterations.

The hyperparameters α and λ are tuned via nested cross-validation on the training data. Since
we want to minimize both residual and distribution errors, their trade-oﬀ must be considered during
hyperparameter tunning. Let RMSSUM= (1 − β)RMSE + β RMS-CDF, where β is a parameter
that controls the tradeoﬀ between minimizing RMSE and RMS-CDF. The hyperparameters for all
competing algorithms are chosen in such a way to minimize the RMSSUM on the validation set.
To ensure fair comparison, we report the performance of all the algorithms for each β chosen in
the range between [0,1]. For example, if β = 0, the chosen hyperparameters will be biased toward
minimizing RMSE.

57

Figure 4.2: Performance comparison between DPMTL and baseline approaches in terms of RMSE
and RMS-CDF when varying the tradeoﬀ parameter β between 0 and 1.

Figure 4.3: Percentage of stations in which DPMTL outperforms the baseline methods (for β = 0).

58

0.760.780.80.820.840.86RMSE0.30.350.40.450.5RMS-CDFGlobalLocalGSpartanContourDPMTLWin-Loss Ratio of DPMTL over BaselinesRMSRMS CDFL2-Dist020406080100Ratio of Stations DPMTLOutperform Baselines (%)GlobalLocalGSpartanContour4.3.3 Experimental Results

Figure 4.2 summarizes the average performance of the various algorithms after 9-fold nested cross
validation. For each algorithm, their RMSE and RMS-CDF metrics are computed as β is varied
from 0, 0.25, 0.5, 0.75 to 1. The results suggest that Global (lasso) and GSpartan are not sensitive to
the tradeoﬀ parameter β, which is obvious since they are both non-distribution preserving methods.
For Local (lasso) models, varying β reduces RMS-CDF slightly, though the change is more
erratic as the tuned hyperparameters are highly sensitive to the training set size. For distribution
preserving methods such as contour regression and DPMTL, Figure 4.2 suggests that there is a
trade-oﬀ between minimizing prediction error and distribution error. Increasing β monotonically
reduces RMSE-CDF at the expense of increasing RMSE. When the hyperparameters are tuned to
minimize RMS-CDF (β = 1), DPMTL has the lowest RMS-CDF with only a slight increase in
RMSE compared to the non-distribution preserving methods. Speciﬁcally, the increase in RMSE
is only between 0.75% − 1.48% compared to the signiﬁcant reduction in RMS-CDF between
7.5% − 17.75%. DPMTL also outperforms contour regression, a distribution-preserving single-
task learning method. In fact, the curve for DPMTL is lower than that for contour regression, which
justiﬁes our rationale for developing a distribution-preserving MTL framework. Furthermore, the
improvement in RMS-CDF for DPMTL is more pronounced for larger values of β. For example,
when β = 0.5, DPMTL achieves a reduction in RMS-CDF by more than 21.9% compared to global,
local, and GSpartan, with an increase in RMSE by at most 5.72%.

We also analyze the performance of the algorithms on a station by station basis. Figure 4.3
shows the percentage of stations in which DPMTL outperforms each baseline method according
to the given metrics. The results show that DPMTL outperforms both Global and Local models in
more than 91% of the stations. It also outperforms GSpartan and Contour in more than 78% of the
stations (in terms of RMS-CDF) and in more than 80% of the stations (in terms of L2-Distance).
Figure 4.4 compares the RMSE values of GSpartan and DPTML for all stations in the dataset.
While the overall RMSE for DPMTL is slightly worse than GSpartan, the maps shown in Figure
4.4 are quite similar to each other. The poor performance of DPMTL tends to occur at locations

59

(a) RMSE results for GSpartan.

(b) RMSE results for DPMTL.

Figure 4.4: Comparison between the RMSE of GSpartan and DPMTL for β = 0 (ﬁgure best viewed
in color).

(a) RMS-CDF error for GSpartan.

(b) RMS-CDF error for DPMTL.

Figure 4.5: Comparison between the RMS-CDF of GSpartan and DPMTL for β = 0 (ﬁgure best
viewed in color).

where GSpartan also has high RMSE. However, in terms of their RMS-CDF, the maps shown in
Figure 4.5 suggest that the distribution ﬁt of DPMTL improves signiﬁcantly for the majority of
the stations. GSpartan performs poorly with high prediction and distribution errors especially in
the southeastern part of the United States, where there are more variability in their precipitation
time series. Although the RMSE is also high for DPMTL in this region, its RMS-CDF improves
signiﬁcantly. The maps also show that the distribution error is generally lower for both methods
along the Paciﬁc and Atlantic coastal areas.

Finally, we also examine characteristics of the predicted distribution generated by diﬀerent
algorithms. Figure 4.6 shows the precipitation histograms obtained using DPMTL, contour regres-
sion, and GSpartan for a station located at [38.25◦N, 82.99◦W]. The results suggest that DPMTL

60

-120-110-100-90-80-70Longitude253035404550Latitude0.50.60.70.80.9-120-110-100-90-80-70Longitude253035404550Latitude0.50.60.70.80.9-120-110-100-90-80-70Longitude253035404550Latitude0.30.40.50.6-120-110-100-90-80-70Longitude253035404550Latitude0.30.40.50.6Figure 4.6: Histogram comparison of precipitation distribution for a weather station located at
[38.25◦N, 82.99◦W].

has the best ﬁt to the ground truth distribution compared to contour regression and GSpartan. In
particular, DPMTL was able to capture the skewness and heavy tail distribution. DPMTL also ﬁts
the distribution of below average precipitation values more eﬀectively than the other two methods.
Although the plot was shown only for one station, the good ﬁt obtained by DPTML was found in
many other stations in our dataset.

4.4 Related Work

Multi-task learning (MTL) is a machine learning technique for solving multiple related modeling
tasks jointly by exploiting and sharing the common information among tasks [148]. MTL improves
generalization performance by leveraging domain-speciﬁc information to enable the pooling of
information across diﬀerent tasks, which is particularly useful when there are insuﬃcient training
instances available to solve each prediction task separately. It also provides a natural way to handle
multi-location prediction problems [147] [137] [136]. For example, in [136], the prediction at each
location can be considered a single task, which is related to the prediction tasks at other nearby
locations. However, existing MTL methods focus primarily on minimizing point-wise prediction

61

-2-10123Precipitation Bins020406080# Data PointsGround TruthGSpartanContourDPMTLerrors, ignoring how well the predicted distribution ﬁts the true distribution of the data. Alternative
approaches such as quantile mapping [82] have been developed to correct the bias between the true
and predicted distribution. However, such techniques tend to have poor prediction accuracy [5].
While hybrid approaches such as contour regression has been developed [5], they are mostly
designed for single-task learning.

4.5 Summary

This chapter presents a distribution preserving multi-task regression framework for spatio-
temporal data. Our framework employs a Parzen window based kernel density estimation (KDE)
approach to compute the probability density function and L2-distance to measure the diﬀerence
between two distributions. We evaluated our method on a real-world climate dataset, containing
more than 1500 stations in the United States, and showed that the proposed framework outperforms
four other competing baselines in at least 78% of the stations.

62

CHAPTER 5

MULTI-TASK HIERARCHICAL LSTM FRAMEWORK FOR MULTI-STEP-AHEAD

TIME SERIES FORECASTING

This chapter investigates the challenge of solving multi-step-ahead time series forecasting using a
multi-task learning approach. There are two key diﬀerences between the framework proposed in
this chapter and the one developed in the previous chapter. First, instead of making single-step
prediction, the framework proposed here is designed to generate forecasts at multiple consecutive
future time steps. Second, the proposed framework is nonlinear, using a hierarchical long short-
term memory (LSTM) architecture to capture the temporal dependencies of the data. As proof of
concept, the proposed framework will be applied to the problem of forecasting monthly sea surface
temperature using a suite of ensemble member forecasts from dynamical (physical) models as its
predictor variables.

The remainder of this chapter is organized as follows. The motivation and challenges of
predicting sea surface temperature are described in Section 5.1, followed by a review of related
literature in Section 5.2. The formal problem statement and background on LSTM architecture are
given in Section 5.3. The proposed hierarchical LSTM structure is then described in Section 5.4.
Section 5.5 presents the experimental results followed by conclusions in Section 5.6.

5.1 Ensemble Forecasting of Sea Surface Temperature

With more than two-third of the Earth covered by ocean, accurate prediction of sea surface
temperature (SST) is crucial due to its signiﬁcant inﬂuence on the climate patterns around the
world. For example, the El Niño phenomenon, which is related to the abnormally high sea surface
temperature values in the equatorial Paciﬁc Ocean, has been shown to cause unusual droughts
and extreme rainfall across many regions [49]. In addition, SST is a key parameter for weather
prediction and atmospheric model simulations and contributes to the development of tropical
cyclones such as hurricanes [30]. Thus, its accurate prediction is essential and remains an active

63

research area [52, 78, 48].

Various dynamic forecasting systems, such as NCEP coupled forecast system model (CFSv2)
and the Canadian Seasonal to Interannual Prediction System (CanCM3), have been developed over
the years to predict SST and other climate variables. These models would simulate the physical
processes that govern the fundamental dynamics of the Earth system. However, predicting the future
states of a such a highly nonlinear dynamic system is very challenging, even with these state-of-
the-art models. In particular, the predictions made by these models will likely never be perfect due
to the uncertainties associated with the models themselves, their initial conditions, and the chaotic
nature of the ocean and climate system. While the errors in predictions may be acceptable at for
short-term predictions, these errors can compound and accumulate as predictions are made for more
distant times. To better represent the eﬀect of initial condition uncertainties, ensemble prediction
with perturbed initial conditions has been adopted in operational forecasting [26]. To represent
the uncertainties associated with models such as model physics, parameterization schemes and
resolution, multi-model ensemble (also called superensemble) approach was introduced for SST
predictions. Instead of a single forecast run from a single model, ensemble forecast from a group
of models, each producing an ensemble of forecast from diﬀerent initial conditions, are used to
better estimate forecast uncertainties. The multi-model ensemble approach has been demonstrated
to provide more reliable forecast than any single model.

The North American Multi-Model Ensemble (NMME) [62] is an example of a multi-model
ensemble for climate prediction, including monthly SST. Fig. 1.3 shows an example of the monthly
SST predictions from NMME for the time period between June 2010 and November 2010. The
time at which the forecasts were generated is known as forecast generation time whereas the number
of months ahead the forecast was made is called the forecast horizon or lead time. For example, the
forecasts shown in Fig. 1.3 are for lead times up to 8 months. Each month, every physical model is
run multiple times with slightly perturbed initial conditions to create several diﬀerent instances of
each model. Each subplot of Figure 1.3 contains the entire ensemble of all models and all of their
instances (i.e. the superensemble) generated within a single month.

64

However, in many cases it is preferable to have a single point estimate of the expected SST
rather than a superensemble of disparate SST predictions. With multi-model ensemble prediction,
a common way to derive a deterministic forecast from the superensemble is to take the average or
median [103], while other methods have been developed to weigh the models diﬀerently based on
their performance [135]. These previous approaches suﬀer from two main deﬁciencies. First, they
are unable to capture complicated non-linear relationships between diﬀerent ensemble members.
Second, they may not fully capture the temporal autocorrelation present in the time series. The
latter problem is illustrated in Fig. 5.1. Each row in the diagram corresponds to a set of predictions
generated at a given forecast generation time while each column represents the set of forecasts
generated for a given lead time. The ﬁgure shows there are two types of temporal dependencies
that should be modeled in multi-step-ahead time series prediction:

1. Temporal autocorrelation between diﬀerent lead time forecasts for the same forecast gener-
ation time. This is denoted as lead time level autocorrelation (i.e., autocorrelation between
elements in each row) in Fig. 5.1.

2. Temporal autocorrelation of all forecasts for a given lead time. This is denoted as generation
time level autocorrelation (i.e., autocorrelation between elements in each column) in Fig. 5.1.

Incorporating both types of autocorrelation into the learning framework would be useful to ensure
robustness of the models and temporal consistencies of their predictions especially for noisy data.
To address these challenges, this chapter presents a novel hierarchical LSTM framework for
aggregating the multi-model ensemble forecasts of SST. The proposed architecture consists of three
main components. The ﬁrst component is a lead time encoder LSTM for extracting a high-level
representation of the predictions made at each lead time. The second component is the generation
time encoder which is an LSTM that extracts a high-level representation of the physical model
predictions for a particular lead time and generation time window. The third component is a fully
connected network to convert the output of the generation time encoder to a ﬁnal prediction.

65

Figure 5.1: A two-level autocorrelation structure in multi-lead time forecasting of the SST data
shown in Fig. 1.3.

5.2 Related Work

To address the multi-step-ahead time series forecasting problem, a novel MTL framework using
hierarchical LSTM is proposed in this chapter. This section reviews previous research on multi-task
learning in deep neural networks (DNN) and hierarchical LSTM.

5.2.1 Multi-task Learning in DNN

Deep learning has recently exploded in popularity thanks to its ability to eﬀectively model com-
plicated non-linear relationships. It has been succesfully applied to images [64], video [112], and
natural language processing [124] among other important tasks. The conventional deep learning
approaches to multi-task learning can be categorized into two types: hard parameter sharing and
soft parameter sharing of hidden layers [105]. In terms of hard parameter sharing, multiple neural
networks are built, and they are constrained to share the parameters at a certain number of layers
while learning parameters on their own for the rest layers [21]. In [53], the cross-language layer

66

is shared by the model of all languages, and diﬀerentiates only at the output layer. In terms of
soft parameter sharing, regularizer skills are employed to constrain the correlations between model
parameters from diﬀerent tasks [37, 143].

5.2.2 Hierarchical LSTM

Long short-term memory (LSTM) network is one of the most popular deep learning architecture
for modeling sequential data such as time series, where the data points exhibit strong temporal
autocorrelation, and document data, where the appearance of a word depends highly on its context.
Hierarchical LSTM [144, 44, 100] is a variant of LSTM that has been developed in recent years to
capture the myriad types of relationships and processing of sequential data.

For example, in document modeling, a hierarchical LSTM architecture was proposed in [144]
to represent the multi-level structure of a document. Speciﬁcally, a document contains words
that are sequentially connected to form a sentence, which in turn, is connected to other sentences
to form a paragraph.
In [144], the ﬁrst level of the LSTM hierarchy encodes the word level
dependencies, while the second level encodes the sentence level dependencies. The document-
sentence-word hierarchy of the data is analogous to the two-level temporal correlation shown in
Fig 5.1. A hierarchical LSTM framework was also developed in [44] for the multi-modal data
fusion problems, where each modality may refer to sensors, video, audio, etc. The ﬁrst LSTM layer
of the hierarchy learns the modality-speciﬁc temporal dynamics whereas the second layer combines
the representation of each modality to generate an embedding for each time step. However, none
of these hierarchical LSTM architectures [144, 44] are designed for multi-task learning, and thus,
cannot be easily adapted to the multi-step-ahead time series prediction problem.

In [100], a hierarchical LSTM with attention mechanism was developed for time series prediction
with multiple input (driving) time series. The proposed architecture builds upon previous research
on attention mechanism [13] to improve performance of RNN. Speciﬁcally, the attention mechanism
acts as a selection/ﬁltering mechanism to the input data. The ﬁrst layer of their architecture uses an
input attention mechanism to extract the relevant input variables for the analysis while the second

67

layer uses an LSTM with temporal attention mechanism to learn a hidden representation for each
time step and selects the relevant hidden states for subsequent time series prediction task. The
outputs of the second layer are then fed into another LSTM for modeling the response time series.
Unfortunately, the original architecture was not designed for multi-task learning problems. It has to
be modiﬁed to make predictions at multiple future time steps, without using the ground truth values
of the shorter-term forecasts to make longer-term forecasts. A modiﬁed version of this architecture
is therefore used as one of the baseline methods in the experiment section.

5.3 Preliminaries

This section formalizes the multi-step time series forecasting problem and introduces the basic

formulation of LSTM model.

5.3.1 Problem Statement
Let x<t,l> ∈ RM be an input vector of predictor variables generated at time t for the forecast at lead
time l, where t ∈ [1, T] and l ∈ [1, L]. Furthermore, let yt+l ∈ R be the corresponding ground truth
value of the target variable at time t + l. The multi-step-ahead time series forecasting is formally
deﬁned as follows.

Deﬁnition 1 (Multi-Step-Ahead Time Series Forecasting) Given a training set, D = {Xt, yt}T
t=1,
where Xt ∈ RL×M and yt ∈ RL, multi-step-ahead time series forecasting seeks to learn a target
function fl : RM → R that maps each input vector x<t,l> ∈ RM to its corresponding output yt+l.

Let o<t,l> ≡ fl(x<t,l>) denote the output of the target function for lead time l when applied to
the predictor variables generated at time t. The eﬀectiveness of the target function output can be
measured by comparing it against the ground truth value yt+l.

In the context of ensemble forecasting of SST, the predictor variables correspond to forecasts
generated by a set of M physical models. Speciﬁcally, at each generation time t, a physical model m
is run to generate forecasts up to L future time steps (lead times). Each physical model is run by dm

68

times, with a varying initial and boundary conditions. The output of each dm run is also known as
an ensemble member forecast and dm is the number of ensemble members associated with model
m. The forecasts of these dm ensemble members are often averaged together into a single point
prediction for model m. The resulting averaged forecasts of the M models are then used as input to
the proposed framework.

5.3.2 Long Short-Term Memory (LSTM) Network

Our model is based upon the popular Long Short-Term Memory network (LSTM), a type of
Recurrent Neural Network (RNN) [50, 46] that has proven to be eﬀective in dealing with sequential
data, especially time series data with long-term dependencies. At each time step, i, the hidden state
of the LSTM, hi, depends on the hidden state from the previous time step, hi−1, as well as the input
in the current time step, xi. We compactly represent this relationship using the following equation:

ht = LSTM(xt, ht−1)

(5.1)

LSTM maintains the long-term history of a time series with a cell state ct. The information that
ﬂows into and out of the cell state are regulated with several gates. First, the input xt at current
time t and the previous hidden state ht−1 are used to change the cell memory as follows:

¯ct = tanh(Wcxt + Ucht−1 + bc)

The amount of change to the current cell state is controlled by an input gate it while the amount for
the cell to maintain its previous state information is regulated by a forget gate ft:

ct = it ◦ ct−1 + ft ◦ ¯ct,

where the input and forget gates also depend on xt and ht−1.

it = σ(Wixt + Uiht−1 + bi)
ft = σ(W f xt + U f ht−1 + b f )

69

The cell state provides the information needed to compute the output ht, whose value is regulated
by an output gate.

ht = ot ◦ tanh(ct)
ot = σ(Woxt + Uoht−1 + bo)

In the above equations, σ(·) denotes a sigmoid activation function, tanh(·) denotes a hyperbolic
tangent function, and ◦ denotes the Hadamard product operation.

5.4 Proposed MSH-LSTM Architecture

This section presents an overview of the proposed multi-step-ahead hierarchical LSTM (MSH-
LSTM) framework for time series forecasting. The objective of MSH-LSTM is to jointly train a set
of inter-dependent LSTM models that capture both the temporal dependencies between lead times as
well as those between consecutive generation times, as illustrated in Fig. 5.1. A schematic diagram
of the proposed MSH-LSTM architecture is shown in Fig.5.2. The architecture is composed of
three layers—a lead time encoder layer, a generation time encoder layer, and an output layer. Details
of each layer are discussed next.

5.4.1 Lead Time Encoder Layer

This layer enforces the temporal dependencies between diﬀerent lead time forecasts for the same
forecast generation time. Speciﬁcally, let vt = {x<t,1>x<t,2> · · · x<t,L>} be a sequence of length L
corresponding to the ensemble forecasts generated at time t for each lead time l ∈ [1, L], where each
element of the sequence is an M-dimensional vector, i.e., x<t,l> ∈ RM. The lead time encoder layer
takes this sequence as input and produces a sequence of hidden states {h<t,1>h<t,2> · · · h<t,L>}
of the same length using an LSTM network, LSTMlead, where each h<t,l> ∈ Rd. The outputs of
LSTMlead can be viewed as feature representation for each forecast lead time l, embedded with
the temporal dependencies between them. More formally, the relationship between the input and

70

Figure 5.2: Proposed hierarchical LSTM architecture for multi-step-ahead time series forecasting.
Blocks with diﬀerent shades of colors (in the generation time and output layers) are trained inde-
pendently and have diﬀerent parameters while those with the same color (lead time encoder layer)
are trained jointly and have identical parameters.

output of the lead time encoder can be expressed as follows:

h<t,l> = LSTMlead(h<t,l−1>, x<t,l>)

(5.2)

The structure for LSTMlead is depicted by the orange boxes at the ﬁrst (left-most) layer of Fig. 5.2.
Each box is assumed to process a sequence of length L from diﬀerent generation times, i.e.,
vt−K+1, vt−K+2, · · · , vt. Although they are depicted as separate boxes, note that the parameters of
LSTMlead are shared across the boxes, i.e., the parameter values are identical when processing
every sequence in the time window K.

71

x<t-K+1,1>x<t-K+1,2>x<t-K+1,L>x<t-K+2,1>x<t-K+2,2>x<t-K+2,L>x<t,1>x<t,2>x<t,L>h<t-K+1,1>h<t-K+1,2>h<t-K+1,L>h<t-K+2,1>h<t-K+2,2>h<t-K+2,L>h<t,1>h<t,2>h<t,L>h<t-K+1,1>h<t-K+2,1>h<t,1>h<t-K+1,2>h<t-K+2,2>h<t,2>h<t-K+1,L>h<t-K+2,L>h<t,L>s<t-K+1,1>s<t-K+2,1>s<t,1>s<t-K+1,2>s<t-K+2,2>s<t,2>s<t-K+1,L>s<t-K+2,L>s<t,L>o<t,1>o<t,2>o<t,L>LSTMleadLSTM1genLSTMlgenLSTMLgen. . .. . .. . .. . .. . .. . .. . .. . .g1g2gLLead Time Encoder(one layer shared by all generation time)Generation Time Layer(Each generation time has an independent layer)Output Layer(Each lead time taskhas an intendent layer)5.4.2 Generation Time Encoder Layer

The second layer of MSH-LSTM, which corresponds to the generation time encoder, models
the temporal dependencies between forecasts generated at diﬀerent times in a given time win-
dow, K, for each lead time l. To illustrate this, let t be the current time step and wl =
{h<t−K+1,l>h<t−K+2,l> · · · h<t,l>} be a sequence of length K, whose elements correspond to the
outputs of the hidden states generated for the same lead time l by the previous layer of MSH-LSTM.
In the case when t < K, zero padding is used to ensure every sequence wl is of the same length, K,
at all times.

The generation time encoder will take the sequence for a speciﬁc lead time l (i.e., wl) as input
and produces another sequence of hidden states s<t−K+1,l>s<t−K+2,l> · · · s<t,l> as output using an
, where each s<t,l> ∈ Rp and l ∈ [l, L]. More speciﬁcally, the relationship
LSTM network LSTMgen
between the inputs and outputs of the generation time encoder for lead time l can be expressed as
follows:

l

s<t,l> = LSTMgen

l

(s<t−1,l>, h<t,l>)

(5.3)

l

The outputs of LSTMgen
can be viewed as a representation embedding by taking into account the
temporal dependencies between diﬀerent forecast generation times (within the time window K) for
a given lead time l. Furthermore, note that s<t,l> incorporates information about both the lead time
level autocorrelation and generation time level autocorrelation shown in Fig. 5.1. In details, while
s<t,l> explicitly learns the temporal dependencies between diﬀerent forecast generation times using
Eq. (5.3), and the temporal dependencies between the lead times are implicitly captured through
the hidden states h<t,l> as its input.

Unlike the lead time encoder layer, the parameters of the generation time encoders are not
shared, because the temporal relationships vary for diﬀerent lead times. Such varying parameter
values are illustrated by the diﬀerent shades of blue boxes in Fig. 5.2, where each box is assumed
to process a sequence of length K for diﬀerent lead times, i.e., w1, w2, · · · , wL.

72

5.4.3 Output Layer

Finally, the output layer of MSH-LSTM will take each hidden state s<t,l> as input and uses a fully
connected network to generate its prediction for lead time l at generation time t:

o<t,l> = gl(s<t,l>)

(5.4)

The fully connected networks are depicted by the green boxes in Fig. 5.2. As we assume that the
models may vary across diﬀerent lead times, the parameters are not shared across diﬀerent lead
times, and an independent gl is learned for each lead time. This is depicted by the varying shades
of green boxes in the diagram.

5.4.4 Parameter Estimation

Let Θ be the set of parameters associated with the proposed MSH-LSTM framework to be estimated
from data. MSH-LSTM is a multi-task learning framework as the model parameters for all lead
times are (1) tied via the hierarchical LSTM structure and (2) jointly estimated by optimizing the
following least-square loss function:

(o<t,l> − yt+l)2

(5.5)

T

L

Θ∗ = arg min

Θ

t

l

The network parameters in Θ are initialized randomly and then trained in an end-to-end fashion
using Adam [61]. To avoid overﬁtting, a dropout strategy is employed during the training process,
which improves the robustness of the network. The stopping criteria for training the network
depends on its performance on a separate validation set. Since the error on validation set generally
has a decreasing trend as the training epochs iterate, the training process is terminated when
the validation error converges. A pseudocode summarizing the training procedure is shown in
Algorithm 3. Hyperparameters of the framework such as the number of nodes at each layer, learning
rate, batch size, and dropout rate are also tuned based on the performance of the framework on
validation set. The entire architecture was implemented in PyTorch [95].

73

Algorithm 3: Training process for MSH-LSTM: Multi-task Hierarchical LSTM

t−1, K, and dropout rate p;

, and gl (where l ∈ [1, L]);

Input: Training set D = {Xt, yt}T
Output: Parameter set Θ∗ for LSTMlead, LSTMgen
i = 0;
Initialize Θ randomly as Θ(0);
repeat

l

for batch = 1,2,...

i = i + 1;
Randomly drop out neural network units with rate p;
Update Θ(i) with back propagation;
Recover the network units that were dropped out;
end for
Compute error on validation set by replacing Θ(i) into Equations (5.2), (5.3), (5.4), (5.5);

until validation error converges
Θ∗ = Θ(i)

5.5 Experimental Results

5.5.1 Data Set

The performance of the proposed MSH-LSTM framework is evaluated using an ensemble of
monthly sea surface temperature forecasts from the North American Multi-Model Ensemble
(NMME) project [62]. Monthly SST observations are collected for a 10-year period from Jan-
uary 1982 to December 2010 for a total of 384 months. The forecast lead times are set to a
maximum of 9 months, starting from the end of the current month1 to the end of 8 months ahead,
for a total of L = 9 prediction tasks. The data was obtained from 58 grid cells located in the
tropical Paciﬁc area. Forecasts from M = 7 physical models, which are listed in Table 5.1, are
used as predictor variables. As has been mentioned before, although each physical model generates
multiple ensemble member forecasts (from varying initial conditions), the average forecast value
of the members was used to represent the predictions of each physical model. The window size K
for creating sequences of diﬀerent forecast generation times is set to 6.

The eﬀectiveness of the proposed framework was evaluated on 6 diﬀerent training-validation-
1As the forecasts are generated at the beginning of the month, forecasting the current month
means predicting the average sea surface temperature over the course of the month that is currently
in progress.

74

Table 5.1: Physical models from NMME used for monthly sea surface temperature prediction

Index Model name
1
2
3
4
5
6
7

CMC1-CanCM3
CMC2-CanCM4
COLA-RSMAS-CCSM3
COLA-RSMAS-CCSM4
GFDL-CM2p1
NCEP-CFSv2
NCAR-CESM1

# ensemble members
10
10
6
10
10
10
24

Table 5.2: Partitioning of SST data into multiple training, validation, and test splits.

Data Set Training Period
SST1
SST2

Validation Period
Jan 2007 - Aug 2008 May 2009 - Dec 2010
Sep 2007 - Apr 2009

Testing Period

Jan 1982 - Apr 2006
Jan 1982 - Aug 2004,
Jan 2010 - Dec 2010 May 2005 - Dec 2006
Jan 1982 - Dec 2002,
Sep 2003 - Apr 2005
May 2008 - Dec 2010
Jan 1982 - Apr 2001,
Sep, 2006 - Dec 2010
Jan 1982 - Aug 1999,
Jan 2005 - Dec 2010 May 2000 - Dec 2001
Jan 1982 - Dec 1997,
Sep 1998 - Apr 2000
May 2003 - Dec 2010

SST3

SST4

SST5

SST6

Jan 2002 - Aug 2003 May 2004 - Dec 2005

Jan 2006 - Aug 2007

Sep 2002 - Apr 2004

Jan 2001 - Aug 2002

testing splits, as shown in Table 5.2. To avoid overlap due to the multi-step-ahead predictions, a gap
of 9 months is introduced between the training-validation and validation-testing sets. Each split
has 20 generation time steps for validation and another 20 generation time steps for testing. Since
the data is collected over 58 grid cells, there are altogether 20 (generation time steps) × 58 (grid
cells) × 9 (lead times) test instances to be predicted in each split. Furthermore, the sequences in
the training, validation, and testing sets are centered by subtracting their monthly values with the
corresponding monthly means computed from the training set.

5.5.2 Baseline Algorithms

The proposed MSH-LSTM framework was compared against the following baselines:

• EnS: This is a simple approach that uses the ensemble mean to form a point estimate for the

ensemble of model forecasts [26].

75

• Ridge: In this approach, an independent ridge regression model is trained for each lead
time task. The input of the model is a 7-dimensional vector, corresponding to the averaged
ensemble member forecasts for each of the 7 physical models.

• FFN: This approach trains an independent feed-forward neural network with two hidden
layers and an output layer for each lead time. Hyperparameters to be tuned on validation set
include the number of nodes in each layer, dropout rate, learning rate and batch size. The
input of the model is similar to that for ridge regression.

• LSTM: An independent LSTM model is trained for each lead time task. The input to the
LSTM is slightly diﬀerent from that for ridge regression and feed-forward neural network
as it requires a sequence of length 6 (K = 6), where each element of the sequence is a
7-dimensional vector.

• GFFN: This global approach trains a single feed-forward neural network to predict all 9 lead
time tasks. The input is similar to that for ridge regression and feed-forward neural network.

• GLSTM: This is similar to the previous global approach except it uses LSTM instead of
a feed-forward network. The input of the model is similar to that for independent LSTM
models.

• ARIMA: This corresponds to the autoregressive integrated moving average model, which
is typically used for time series prediction [81]. For each lead time task, an independent
ARIMA model is trained. Its input corresponds to historical SST values up to 6 previous
time steps, i.e., it does not use model forecasts from NMME.

• DARNN: This is a state-of-the-art hierarchical LSTM network for time series prediction
with exogenous variables [100]. It is based on a dual stage attention-based recurrent neural
network model. Although it is a hierarchical LSTM, its multi-level structure is designed to
capture relationships between its input features (i.e., physical model forecasts) and forecast
generation times, but not the dependencies between diﬀerent lead times, unlike the proposed

76

MSH-LSTM framework. Since the network is not designed for multi-step-ahead prediction,
an independent DARNN model is trained for each lead time task. The input to the model is
a sequence of length 6, similar to that for LSTM and MSH-LSTM.

• MTL-LSTM: A 3-layer LSTM network based on the multi-task learning approach described
in [105, 21]. The bottom layer is a feed forward layer shared by all lead time tasks. The
middle layer is a standard LSTM while the top (output) layer is a feed forward neural network.
While the model parameters at the bottom layer are shared, both the middle and top layers
are built independently for each lead time task. Its input corresponds to the ensemble model
forecasts for all 9 lead times with a time window of 6 forecast generation times. In other
words, the input is a length-6 sequence of 7 × 9-dimensional matrices.

5.5.3 Evaluation Metric

The prediction error for each method is computed using the root mean square error (RMSE) metric.
The metric can be computed independently for each lead time prediction task as well as for all the
tasks:

Error for a given lead time, l: RMSEl =

Overall error for all lead times: RMSE =

(yt+l − o<t,l>)2

(yt+l − o<t,l>)2

(cid:118)(cid:117)(cid:116) 1
(cid:118)(cid:117)(cid:116) 1

T

T
T

t

L

T

t

l

where o<t,l> denotes the predicted value for lead time l at the forecast generation time t.

5.5.4 Experimental Settings

For each method, its hyperparameters are tuned using the validation set. The hyperparameter for
ridge regression corresponds to the ridge regularizer. For DNN-type approaches including FFN,
LSTM, GFFN, GLSTM, DARNN, MTL-LSTM and the proposed MSH-LSTM, the number of
nodes in each hidden layer is a hyperparameter that needs to be tuned. For all the methods, the

77

number of nodes is varied from 10 to 50. Other hyperparameters include batch size, initial learning
rate and dropout rate are also tuned independently for each dataset based on its performance on the
corresponding validation set. Furthermore, as the loss function for DNN is non-convex, diﬀerent
initialization of the model parameters may yield diﬀerent solutions. Consequently, we test each
hyperparameter setting for the FFN, LSTM, GFFN, GLSTM, MTL-LSTM and MSH-LSTM with
15 diﬀerent initializations of the weights. RMSE values are reported based on their average over
these 15 runs.

5.5.5 Results and Discussion

A summary of the RMSE values, averaged across the 6 data splits, is shown in Table 5.3. In terms
of their overall RMSE, simple baseline methods such as ensemble mean and ARIMA have the
worst performance among all the methods. This shows the importance of combining the ensemble
member forecasts in a weighted fashion to obtain better predictions instead of using only the mean
forecasts or historical time series alone for making long-term predictions. The next worst performer
is ridge regression, which suggests the importance of using non-linear approaches to aggregate the
ensemble predictions. Furthermore, global models such as GFFN and GLSTM also have worse
RMSE compared to their independent local model counterparts (FFN and LSTM). This suggests
there is signiﬁcant diﬀerence in the skills of the individual physical models for making predictions at
diﬀerent lead times, which explains the inferior performance of the one-size-ﬁts-all global models.
Among all the competing methods, MSH-LSTM achieves the lowest overall RMSE, which
demonstrates the eﬀectiveness of the proposed framework for the multi-step-ahead ensemble fore-
casting problem. In particular, it outperforms both MTL-LSTM, which is based on a conventional
approach to multi-task deep learning [105] and DARNN, which is the state-of-art DNN-typed
hierarchical approach for time series prediction with exogenous variables [100]. The proposed
MSH-LSTM framework also outperforms both MTL-LSTM and DARNN for the majority of the
lead times except for lead times 0 and 8. The eﬀectiveness of MSH-LSTM can be explained as
it is the only framework that considers lead-time level autocorrelation, whereas other frameworks

78

such as MTL-LSTM and DARNN only account for generation-time level autocorrelation and the
relationship between predictors. The result also suggests that the hierarchical LSTM architecture
of MSH-LSTM is more suitable for the problem than the hard parameter sharing strategy employed
by MTL-LSTM.

In terms of lead-time RMSE, MSH-LSTM has the lowest RMSE for lead times 1 to 5 and has
among the top-3 lowest RMSE for lead times 6, 7, and 8. Both MSH-LSTM and MTL-LSTM
appear to perform slightly worse than conventional LSTM for lead time 0. One possible explanation
is that, since both MSH-LSTM and MTL-LSTM are multi-task learning approaches, the long-term
forecasts performance may be improved at the expense of a slight degradation in their accuracy for
forecasting lead time 0. In addition, MSH-LSTM is outperformed by FFN for lead times 6 to 8,
though by relatively small margin.

Another interesting observation is that the independent LSTM models generally do a much
better job at short-term forecasting but perform worse at long-term forecasting (4 months or more)
compared to independent FFN models. In both approaches, the models are trained independently
for each lead time task. As the longer-term predictors have higher variance in the ensemble
forecasts, this suggests that LSTM may not be as eﬀective dealing with higher variance in the
predictors compared to FFN. This limitation of LSTM also seems to aﬀect the performance of
MTL-LSTM and MSH-LSTM. Nevertheless, it appears that MSH-LSTM is able to compensate
for such limitation by regularizing its predictions to ensure the generation-time and lead-time level
autocorrelations are modeled. Overall, its RMSE performance is comparable to FFN for longer
lead time forecasts (4 months or more).

The previous analysis compares the performance of diﬀerent methods in terms of their overall
and lead-time speciﬁc RMSE. The reported RMSE values are averaged over all 58 grid cells in the
data. To determine how well each method performs on the grid cells, Table 5.4 summarizes the
percentage of grid cells in which the method speciﬁed in the given row has lower RMSE than the
method speciﬁed by the column. The results show that MSH-LSTM outperforms all the baseline
methods in more than 70% of the grid cells. Furthermore, although the RMSE diﬀerence shown

79

Table 5.3: Comparison of RMSE values among the competing methods for all 9 forecast lead times
LSTM MTL-
LSTM ARIMA DARNN MSH-
LSTM
0.1996
0.2090
0.2173
0.3057
0.3313
0.3230
0.3621
0.3890
0.3827
0.4032
0.4302
0.4176
0.4426
0.4648
0.4735
0.4744
0.5050
0.5029
0.5019
0.5260
0.5264
0.5192
0.5326
0.5367
0.5519
0.5445
0.5355
1.2898
1.3514
1.3432

Ridge GFFN GLSTM FFN
0.2811
0.3940
0.4437
0.4697
0.4900
0.5058
0.5205
0.5302
0.5407
1.4114

EnS
0.2652
0.4322
0.5307
0.5933
0.6402
0.6757
0.7039
0.7268
0.7458
1.8268

Lead
0
1
2
3
4
5
6
7
8
overall

0.2830
0.3652
0.4058
0.4313
0.4574
0.4794
0.4969
0.5151
0.5400
1.3443

0.2787
0.3560
0.3974
0.4286
0.4633
0.4965
0.5185
0.5344
0.5546
1.3673

0.2129
0.3435
0.4015
0.4286
0.4548
0.4775
0.4916
0.5097
0.5298
1.3136

0.2592
0.3706
0.4414
0.4901
0.5279
0.5529
0.5701
0.5827
0.5923
1.4964

0.2610
0.3475
0.4053
0.4399
0.4747
0.5132
0.5155
0.5333
0.5245
1.3641

Table 5.4: A win-loss table comparing the performance of the competing methods across all 58
grid cells. Each (i, j)-th entry in the table represents the fraction of grid cells in which method i
has lower RMSE than method j.

EnS
0
EnS
0.7414
Ridge
0.8621
GFFN
0.8276
GLSTM
0.8448
FFN
LSTM
0.8103
MTL-LSTM 0.8103
ARIMA
0.6379
DARNN
0.7586
MSH-LSTM 0.8448

Ridge GFFN GLSTM FFN
0.2586
0
0.8276
0.7759
0.8103
0.7414
0.7931
0.3793
0.7069
0.8966

0.1724
0.2241
0.5862
0
0.5862
0.5862
0.4828
0.1379
0.4483
0.7759

0.1379
0.1724
0
0.4138
0.5690
0.4655
0.4483
0.1207
0.3621
0.7414

0.1552
0.1897
0.4310
0.4138
0
0.3793
0.3621
0.1552
0.2759
0.7586

LSTM MTL-
LSTM ARIMA DARNN MSH-
LSTM
0.1552
0.1897
0.1897
0.1034
0.2069
0.2586
0.2586
0.5517
0.5345
0.4138
0.5172
0.2241
0.2414
0.6379
0.6207
0.2931
0.5172
0
0.2241
0
0.4828
0.1207
0.1379
0.0345
0.0690
0.3276
0.3793
0.7069
0.7759
0

0.3621
0.6207
0.8793
0.8621
0.8448
0.8793
0
0
0.8276
0.9655

0.2414
0.2931
0.6379
0.5517
0.7241
0.6207
0.6724
0.1724
0
0.9310

in Table 5.3 for MSH-LSTM and FFN is not that large, MSH-LSTM actually outperforms FFN for
more than 75% of the grid cells. MSH-LSTM also outperformed MTL-LSTM (by more than 77%)
and DARNN (by more than 93%), which demonstrates the eﬀectiveness of proposed MSH-LSTM
framework for the multi-step-ahead ensemble SST forecasting problem.

To illustrate the performance improvement achieved by MSH-LSTM in diﬀerent grid cells,
Fig. 5.3 shows a map of the RMSE values for ensemble mean and MSH-LSTM, where lighter
(yellow) color indicates higher RMSE values and darker (blue) color indicates lower RMSE. The
maps were plotted for diﬀerent lead times. Figs. 5.3(a) and 5.3(b) correspond to the the overall
RMSE for all 9 lead times. Figs. 5.3(c) and 5.3(d) correspond to short-term forecasts (from 0 to
2 months), 5.3(e) and 5.3(f) are for mid-term forecasts (between 3 to 5 months lead time), and

80

(a) EnS

(b) MSH-LSTM

(c) Short-term EnS

(d) Short-term MSH-LSTM

(e) Mid-term EnS

(f) Mid-term MSH-LSTM

(g) Long-term EnS

(h) Long-term MSH-LSTM

Figure 5.3: Performance on each grid cell by EnS and MSH-LSTM.

5.3(g) and 5.3(h) for long-term forecasts (more than 6 months). The maps show that MSH-LSTM
outperforms the ensemble median in the majority of the grid cells, especially for mid-term and
long-term predictions.

Fig 5.4 shows the amount that the temporal autocorrelation of the ground truth time series is

81

0.81.01.21.41.61.82.02.2RMSE0.81.01.21.41.61.82.02.2RMSE0.200.250.300.350.400.450.500.550.60RMSE0.200.250.300.350.400.450.500.550.60RMSE0.30.40.50.60.70.8RMSE0.30.40.50.60.70.8RMSE0.30.40.50.60.70.80.9RMSE0.30.40.50.60.70.80.9RMSEpreserved by each approach. The temporal autocorrelation for any given time series at a lag k,
ACF(k), is obtained by shifting its sequence of values by k time steps (lags) and computing the
correlation between the shifted sequence and the unshifted one.
If there are p such shifts, this
produces an autocorrelation vector of length p, i.e., [ACF(1), ACF(2), · · · , ACF(p)], which is then
compared against the autocorrelation vector of the ground truth time series. The degree to which
temporal autocorrelation is preserved can be determined by taking the Euclidean distance between
the two vectors.

Fig. 5.1 depicts two types of temporal autocorrelations that must be preserved by multi-step-
ahead time series forecasting methods—lead-time level and generation-time level autocorrelation.
For generation-time level autocorrelation, the results shown in Fig. 5.4a suggest that LSTM and its
variants, including the proposed MSH-LSTM approach, closely model the temporal autocorrelation
structure of the ground truth SST time series as the Euclidean distances calculated for LSTM-based
approaches are relatively smaller compared to non-LSTM methods such as FFN and GFNN. This
is not surprising as the LSTM-based methods are designed to capture the temporal dependencies
of the forecasts generated for diﬀerent time steps, while the non-LSTM methods are not designed
for modeling temporal dependencies of the data.

For lead-time level autocorrelation, at ﬁrst glance, the results shown in Fig. 5.4b appear to
suggest that both MSH-LSTM and DARNN do not capture the lead-time level autocorrelation as
eﬀectively as other baseline methods. To further illustrate this, Fig. 5.5 shows the correlogram plots
for each forecasting method as well as for the ground truth SST time series. Notice that the lead-time
level autocorrelation for MSH-LSTM and DARNN are much higher than other approaches and the
ground truth, as the number of lags increases. Despite over-estimating the lead-time level temporal
autocorrelation, the RMSE results shown in Table 5.3 suggest that MSH-LSTM was able to exploit
the higher autocorrelation at longer lags and in this way improve its long-term forecasts. Similarly,
the results in Table 5.3 also show that DARNN reaches the best RMSE at lead time 8, which
is consistent with it having the highest temporal autocorrelation at lag 8. However, the RMSE
of DARNN is worse than MSH-LSTM and other baselines at other lead times even though its

82

(a) Generation time level temporal autocorrelation.

(b) Lead time level temporal autocorrelation

Figure 5.4: Comparison between the temporal autocorrelation of the proposed MSH-LSTM frame-
work and other baseline methods.

autocorrelation is still high. This suggests that MSH-LSTM was able to leverage its high lead-time
autocorrelation to improve its prediction accuracy more eﬀectively than DARNN.

Finally, we evaluate the importance of each physical model for diﬀerent lead time tasks. The
models of both ridge regression and MSH-LSTM are analyzed. For ridge regression, the importance
of each model can be evaluated by examining the magnitude of coeﬃcients. For MSH-LSTM, the
gradient of the loss w.r.t each physical model forecast is computed [111], and the distribution of
the gradient magnitudes is investigated. In Figure 5.6, the gradients for the 7 physical models are

83

012345678Lead Time0.020.040.060.080.100.120.140.16Euclidean DistanceGFFNFFNGLSTMLSTMMTL-LSTMDARNNMSH-LSTMGFFNFFNGLSTMLSTMMTL-LSTMDARNNMSH-LSTMApproach0.0000.0250.0500.0750.1000.1250.1500.175Euclidean DistanceFigure 5.5: Correlogram plots for lead-time level autocorrelation of MSH-LSTM and other methods
(including the ground truth SST time series).

illustrated by box plots while the magnitude of the ridge regression coeﬃcients is illustrated with
the red curve. For both ridge regression and the MSH-LSTM the most important physical model
tends to be model 2 (CMC2-CanCM4). Model 4 also has a relatively large importance for both
MSH-LSTM on shorter-term forecasting models.

5.6 Summary

In this chapter, a novel multi-task neural network architecture is proposed to address the
multi-step-ahead time series forecasting problem. The proposed framework considers each lead
time forecast as a separate learning task and employs a hierarchical LSTM structure to capture
both lead-time and generation-time level autocorrelation of the data. The eﬀectiveness of the
proposed architecture is evaluated on a 29-year monthly sea surface temperature data from the
North American Multi-Model Ensemble (NMME) project. The results showed that the proposed
method outperformed existing hierarchical and non-hierarchical neural network, MTL, and other
conventional time series prediction methods.

84

12345678Lag (in months)0.00.20.40.60.81.0AutocorrelationGFFNFFNGLSTMLSTMMTL-LSTMDARNNMSH-LSTMGround Truth(a) Lead Time 0

(b) Lead Time 1

(c) Lead Time 2

(d) Lead Time 3

(e) Lead Time 4

(f) Lead Time 5

(g) Lead Time 6

(h) Lead Time 7

(i) Lead Time 8

Figure 5.6: Gradient distribution of each model for diﬀerent lead time tasks. The x-axis represents
the indices of physical models that are listed in Table. 5.1. The box plots are gradients distribution
of each model for MSH-LSTM. The red curve are the computed ridge regression coeﬃcients.

85

12345670.000000.000050.000100.000150.00020MSH-LSTM Grads−0.050.000.050.100.150.200.250.300.35Ridge Weights12345670.000000.000020.000040.000060.000080.000100.000120.000140.00016MSH-LSTM Grads0.000.050.100.150.200.250.300.35Ridge Weights12345670.000000.000020.000040.000060.000080.000100.00012MSH-LSTM Grads−0.050.000.050.100.150.200.250.30Ridge Weights12345670.000000.000020.000040.000060.00008MSH-LSTM Grads−0.050.000.050.100.150.200.25Ridge Weights12345670.000000.000020.000040.000060.000080.00010MSH-LSTM Grads−0.050.000.050.100.150.200.25Ridge Weights12345670.000000.000020.000040.000060.00008MSH-LSTM Grads−0.10−0.050.000.050.100.150.200.250.30Ridge Weights12345670.000000.000010.000020.000030.000040.000050.000060.00007MSH-LSTM Grads−0.10.00.10.20.3Ridge Weights12345670.000000.000010.000020.000030.000040.000050.00006MSH-LSTM Grads−0.10.00.10.20.3Ridge Weights12345670.000000.000010.000020.000030.00004MSH-LSTM Grads−0.10.00.10.20.3Ridge WeightsCHAPTER 6

CONCLUSIONS & FUTURE WORK

6.1 Summary of Thesis Contributions

With the growing complexity and prevalence of spatial and temporal data from various disci-
plines, it has become an important but more challenging research topic to learn accurate predictive
models from such data. For some complex prediction tasks, learning a single model to all obser-
vations is often undesirable as such a model may not capture the intricate details and variabilities
across diﬀerent samples.

In this thesis, I present several innovative frameworks of multi-task learning techniques on some
real-world applications on spatial and temporal data, including activity recognition in multi-modal
sensor data to large-scale climate and sea surface temperature predictions. I investigate into their
related problems, examine the fundamental challenges, and propose novel MTL frameworks for
solution. The contributions of this thesis are summarized as below.

In Chapter 3, a probabilistic MTL framework was designed for multi-modal time series classi-
ﬁcation problem at multiple locations. The framework was motivated by the need to address both
temporal and spatial dependencies of the classes as well as the block missing value problem, which
arises due to the varying types of features (modalities) available at diﬀerent locations. As proof
of concept, the proposed framework was applied to the activity recognition problem, where the
task is to identify user activities (e.g., walking, sitting, jumping, or lying down) in multiple rooms
using multi-modal sensor data from accelerometer, video cameras, and environmental sensors. Ex-
perimental results showed that the proposed framework outperformed baseline methods including
K-nearest neighbor, Conditional Random Field, and single-task learning with multinomial softmax
regression. In addition to improving classiﬁcation performance, the framework also produces a
data-driven transition matrix between the activities as byproduct.

In Chapter 4, a novel MTL framework was proposed to address the challenges of time series

86

forecasting at multiple locations simultaneously. Unlike conventional MTL approaches, the pro-
posed framework was designed to preserve the marginal distribution of the response variable in
addition to point-wise accuracy. Speciﬁcally, it employs a Parzen window based kernel density
estimation (KDE) approach to compute the probability density function and an L2-distance mea-
sure to determine the discrepancy between the true and predicted distributions. When using on a
real-world climate dataset as case study, experimental results showed that the proposed framework
outperformed existing non-distribution preserving methods in more than 78% of the weather sta-
tions considered in this study. It also outperformed another baseline distribution preserving method
(with single-task learning) in more than 78% of the weather stations.

Finally, a multi-task hierarchical LSTM architecture was designed in Chapter 5 to deal with
the challenges proposed in multi-step-ahead time series prediction. The proposed architecture was
designed to improve forecasting of sea surface temperature at longer lead times. The proposed
architecture considers each lead time as a separate learning task and employs a two-level hierar-
chical LSTM structure to capture both the lead-time and forecast generation time dependencies
of the data. Experimental results using 29 years of monthly sea surface temperature data from
the North American Multi-Model Ensemble (NMME) project demonstrated the eﬃcacy of the
proposed method compared to 9 other baseline methods, including ridge regression, MTL, and
other conventional neural network approaches.

6.2 Future Research Directions

This dissertation has demonstrated the eﬃcacy of using MTL to improve performance of
predictive modeling in spatial and temporal data. The success of MTL in the application domains
investigated in this study lay forth to several potential future research directions to pursue.

First, the proposed multi-modal time series classiﬁcation approach is limited to a linear model
using softmax regression. Extending the approach to nonlinear models via deep neural networks
will be an interesting research direction. However, there are several key questions that must be
answered in order to facilitate the development of such an approach. Among them include: “How to

87

incorporate the spatial and temporal dependencies of the classes?", “How to deal with the varying
feature types available for diﬀerent locations?", and “How to relate the modeling for diﬀerent rooms
without increasing the number of model parameters signiﬁcantly?" One possibility is to extend the
hierarchical LSTM approach described in Chapter 5 to incorporate the classiﬁer properties of
the probabilistic MTL framework introduced in Chapter 3. From the application perspective, the
current activity recognition framework assumes there is only a single subject to be monitored.
Extending such an approach to multiple subject setting is another potentially interesting research
direction. Demographical information can be used to build relationships between subjects, and it
will provide more customized model for each subject.

Second, for the multi-location time series forecasting problem, the distribution-preserving multi-
task regression framework proposed in Chapter 4 also considers only a linear model. Extending
the distribution preserving approach to deep networks is another potentially interesting research
direction. The non-identical distributed property exists in both spatial and temporal dimensions.
In Chapter 4, the distribution preserving mechanism only keeps the temporal distribution contour
at each location, without accounting for the spatial distribution at each ﬁxed time. For example, the
spatial distribution of precipitation certainly varies from spring to winter. An extension of this work
by taking account of the distribution property on spatial dimension will improve the interpretability
and performance of predictions. The approach can potentially be enhanced to a 3D-distribution
setting, e.g., by deﬁning a Parzen window and L2-distance over both space and time, to ensure the
spatial distribution of the response variable is also preserved by the prediction model.

Finally, this thesis mainly discusses the predictive modeling under an oﬄine setting. Never-
theless, in practice, data keeps growing rapidly on both spatial and temporal dimension. When
observations are collected from a new location or a new time step, re-training the entire model in
a batch way would be very expensive. What is more, cold start problem should also be considered
for previously unseen locations. As a result, the framework should consider the data increment
without requiring re-training from scratch. Incremental learning mechanism should be potentially
incorporated to the existing model for a runtime execution.

88

BIBLIOGRAPHY

89

BIBLIOGRAPHY

[1]
[2]
[3]

[4]
[5]

[6]

[7]

[8]

[9]

data.

sphere

challenge:

sensor

recognition with multimodal

Google ads preferences manager. http://google.com/ads/preferences.
Instagram. http://www.instagram.com.
The
Activity
http://blog.drivendata.org/2016/06/06/sphere-benchmark/.
Transfer learning. https://www.youtube.com/watch?v=qD6iD4TFsdQ.
Zubin Abraham, Pang-Ning Tan, Perdinan, Julie Winkler, Shiyuan Zhong, and Malgorzata
Liszewska. Distribution regularized regression framework for climate modeling. In Proceed-
ings of the 2013 SIAM International Conference on Data Mining, pages 333–341. SIAM,
2013.
Zubin Abraham, Pang-Ning Tan, Perdinan, Julie Winkler, Shiyuan Zhong, and Malgorzata
Liszewska. Position preserving multi-output prediction. In Joint European Conference on
Machine Learning and Knowledge Discovery in Databases, pages 320–335. Springer, 2013.
Alekh Agarwal, Alexander Rakhlin, and Peter Bartlett. Matrix regularization techniques for
online multitask learning. EECS Department, University of California, Berkeley, Tech. Rep.
UCB/EECS-2008-138, 2008.
Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learn-
ing. In Advances in neural information processing systems, pages 41–48, 2007.
Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature
learning. Machine Learning, 73(3):243–272, 2008.

[10] Gowtham Atluri, Anuj Karpatne, and Vipin Kumar. Spatio-temporal data mining: A survey

[11]

[12]

of problems and methods. arXiv preprint arXiv:1711.04710, 2017.
Isabelle Augenstein, Sebastian Ruder, and Anders Søgaard. Multi-task learning of pairwise
sequence classiﬁcation tasks over disparate label spaces. arXiv preprint arXiv:1802.09913,
2018.
Jonghun Baek, Geehyuk Lee, Wonbae Park, and Byoung-Ju Yun. Accelerometer signal
In Proceedings of International Conference on
processing for user activity detection.
Knowledge-Based and Intelligent Information and Engineering Systems, pages 610–617,
2004.

[13] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by

jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

[14] Ling Bao and Stephen S Intille. Activity recognition from user-annotated acceleration data.
In Proceedings of International Conference on Pervasive Computing, pages 1–17. Springer,
2004.

90

[15] Steﬀen Bickel, Jasmina Bogojeska, Thomas Lengauer, and Tobias Scheﬀer. Multi-task
learning for hiv therapy screening. In Proceedings of the 25th international conference on
Machine learning, pages 56–63. ACM, 2008.

[16] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
[17] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press,

2004.

[18] Paul Bromiley. Products and convolutions of gaussian probability density functions. Tina-

Vision Memo, 3(4):1, 2003.
J Solai Carós, O Chételat, Patrick Celka, S Dasen, and J CmÃral. Very low complexity
algorithm for ambulatory activity classiﬁcation. In Proceedings of the 3rd European Medical
and Biological Conference EMBEC, pages 16–20, 2005.

[19]

[20] Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
[21] R Caruna. Multitask learning: A knowledge-based source of inductive bias. In Machine

Learning: Proceedings of the Tenth International Conference, pages 41–48, 1993.

[23]

[22] Giovanni Cavallanti, Nicolo Cesa-Bianchi, and Claudio Gentile. Linear algorithms for online
multitask classiﬁcation. Journal of Machine Learning Research, 11(Oct):2901–2934, 2010.
Junghoon Chae, Dennis Thom, Harald Bosch, Yun Jang, Ross Maciejewski, David S Ebert,
and Thomas Ertl. Spatiotemporal social media analytics for abnormal event detection
In Visual Analytics Science and
and examination using seasonal-trend decomposition.
Technology (VAST), 2012 IEEE Conference on, pages 143–152. IEEE, 2012.
Jianhui Chen, Ji Liu, and Jieping Ye. Learning incoherent sparse and low-rank patterns from
multiple tasks. ACM Transactions on Knowledge Discovery from Data (TKDD), 5(4):22,
2012.

[24]

[25]

Jianhui Chen, Jiayu Zhou, and Jieping Ye. Integrating low-rank and group-sparse structures
In Proceedings of the 17th ACM SIGKDD international
for robust multi-task learning.
conference on Knowledge discovery and data mining, pages 42–50. ACM, 2011.

[26] Kevin KW Cheung. A review of ensemble forecasting techniques with a focus on tropical

cyclone forecasting. Meteorological Applications, 8(3):315–332, 2001.

[27] Ronan Collobert and Jason Weston. A uniﬁed architecture for natural language processing:
In Proceedings of the 25th international

Deep neural networks with multitask learning.
conference on Machine learning, pages 160–167. ACM, 2008.

[28] Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better mini-batch al-
gorithms via accelerated gradient methods. In Advances in neural information processing
systems, pages 1647–1655, 2011.

91

[29] TM Darcey and PD Williamson. Spatio-temporal eeg measures and their application to
human intracranially recorded epileptic seizures. Electroencephalography and clinical neu-
rophysiology, 61(6):573–587, 1985.

[30] Richard A Dare and John L McBride. Sea surface temperature response to tropical cyclones.

Monthly Weather Review, 139(12):3798–3808, 2011.
Jeﬀery Jonathan Davis, Robert Kozma, Chin-Teng Lin, and Walter J Freeman. Spatio-
temporal eeg pattern extraction using high-density scalp arrays. In Neural Networks (IJCNN),
2016 International Joint Conference on, pages 889–896. IEEE, 2016.

[31]

[32] Krzysztof Dembczynski, Arkadiusz Jachnik, Wojciech Kotlowski, Willem Waegeman, and
Eyke Hüllermeier. Optimizing the f-measure in multi-label classiﬁcation: Plug-in rule
In International Conference on Machine
approach versus structured loss minimization.
Learning, pages 1130–1138, 2013.

[33] Krzysztof Dembczyński, Willem Waegeman, Weiwei Cheng, and Eyke Hüllermeier. On
label dependence and loss minimization in multi-label classiﬁcation. Machine Learning,
88(1-2):5–45, 2012.

[34] Xavier Descombes, Frithjof Kruggel, and D Yves Von Cramon. Spatio-temporal fmri
analysis using markov random ﬁelds. IEEE transactions on medical imaging, 17(6):1028–
1039, 1998.

[35] Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. In The

IEEE International Conference on Computer Vision (ICCV), 2017.
James Dougherty, Ron Kohavi, and Mehran Sahami. Supervised and unsupervised dis-
cretization of continuous features. In Proceedings of the 12th International Conference of
Machine Learning, volume 12, pages 194–202, 1995.

[36]

[37] Long Duong, Trevor Cohn, Steven Bird, and Paul Cook. Low resource dependency parsing:
In Proceedings of the 53rd
Cross-lingual parameter sharing in a neural network parser.
Annual Meeting of the Association for Computational Linguistics and the 7th International
Joint Conference on Natural Language Processing, volume 2, pages 845–850, 2015.

[38] David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a
common multi-scale convolutional architecture. In Proceedings of the IEEE International
Conference on Computer Vision, pages 2650–2658, 2015.

[39] Anders Eklund, Thomas E Nichols, and Hans Knutsson. Cluster failure: why fmri inferences
for spatial extent have inﬂated false-positive rates. Proceedings of the National Academy of
Sciences, page 201602413, 2016.

[40] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi–task learning. In Pro-
ceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and
data mining, pages 109–117. ACM, 2004.

92

[41] Yuchun Fang, Zhengyan Ma, Zhaoxiang Zhang, Xu-Yao Zhang, and Xiang Bai. Dynamic
multi-task learning with convolutional neural network. In Proceedings of the Twenty-Sixth
International Joint Conference on Artiﬁcial Intelligence, IJCAI, pages 19–25, 2017.

[42] Elizabeth A Franz, James C Eliassen, Richard B Ivry, and Michael S Gazzaniga. Dissoci-
ation of spatial and temporal coupling in the bimanual movements of callosotomy patients.
Psychological Science, 7(5):306–310, 1996.

[43] R Friedrich, A Fuchs, and H Haken. Spatio-temporal eeg patterns. In Rhythms in physio-

logical systems, pages 315–338. Springer, 1991.

[44] Ankit Gandhi, Arjun Sharma, Arijit Biswas, and Om Deshmukh. Gethr-net: A generalized
temporally hybrid recurrent neural network for multimodal information fusion. In European
Conference on Computer Vision, pages 883–899. Springer, 2016.

[45] Qiaozi Gao, Malcolm Doering, Shaohua Yang, and Joyce Chai. Physical causality of action
verbs in grounded language understanding. In Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics, volume 1, pages 1814–1824, 2016.

[46] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual

prediction with lstm. 1999.

[47] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer

vision, pages 1440–1448, 2015.

[48] Yuanhong Guan, Jieshun Zhu, Bohua Huang, Zeng-Zhen Hu, and James L Kinter III. South
paciﬁc ocean dipole: A predictable mode on multiseasonal time scales. Journal of Climate,
27(4):1648–1658, 2014.

[49] D Herring. What is el nino? NASA’s Earth Observatory, 2009.
[50] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,

9(8):1735–1780, 1997.

[51] Bo Hu, Mohsen Jamali, and Martin Ester. Spatio-temporal topic modeling in mobile social
media for location recommendation. In Data Mining (ICDM), 2013 IEEE 13th International
Conference on, pages 1073–1078. IEEE, 2013.

[52] Zeng-Zhen Hu, Arun Kumar, Bohua Huang, Wanqiu Wang, Jieshun Zhu, and Caihong Wen.
Prediction skill of monthly sst in the north atlantic ocean in ncep climate forecast system
version 2. Climate dynamics, 40(11-12):2745–2759, 2013.
Jui-Ting Huang, Jinyu Li, Dong Yu, Li Deng, and Yifan Gong. Cross-language knowledge
transfer using multilingual deep neural network with shared hidden layers. In Acoustics,
Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages
7304–7308. IEEE, 2013.

[53]

[54] Yan Huang, Wei Wang, Liang Wang, and Tieniu Tan. Multi-task deep neural network for
multi-label learning. In Image Processing (ICIP), 2013 20th IEEE International Conference
on, pages 2897–2900. IEEE, 2013.

93

[55] Shuiwang Ji and Jieping Ye. An accelerated gradient method for trace norm minimization.
In Proceedings of the 26th annual international conference on machine learning, pages
457–464. ACM, 2009.

[56] Eugenia Kalnay, Masao Kanamitsu, Robert Kistler, William Collins, Dennis Deaven, Lev
Gandin, Mark Iredell, Suranjana Saha, Glenn White, John Woollen, et al. The ncep/ncar
40-year reanalysis project. Bulletin of the American meteorological Society, 77(3):437–471,
1996.

[57] Anuj Karpatne, James Faghmous, Jaya Kawale, Luke Styles, Mace Blank, Varun Mithal,
Xi Chen, Ankush Khandelwal, Shyam Boriah, Karsten Steinhaeuser, et al. Earth science
applications of sensor data. In Managing and Mining Sensor Data, pages 505–530. Springer,
2013.

[58] Harry H Kelejian and Ingmar R Prucha. A generalized spatial two-stage least squares
procedure for estimating a spatial autoregressive model with autoregressive disturbances.
The Journal of Real Estate Finance and Economics, 17(1):99–121, 1998.

[59] Eunju Kim, Sumi Helal, and Diane Cook. Human activity recognition and pattern discovery.

IEEE Pervasive Computing, 9(1), 2010.

[60] Minyoung Kim. Semi-supervised learning of hidden conditional random ﬁelds for time-

series classiﬁcation. Neurocomputing, 119:339–349, 2013.

[61] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv

preprint arXiv:1412.6980, 2014.

[62] Ben P Kirtman, Dughong Min, Johnna M Infanti, James L Kinter III, Daniel A Paolino,
Qin Zhang, Huug Van Den Dool, Suranjana Saha, Malaquias Pena Mendez, Emily Becker,
et al. The north american multimodel ensemble: phase-1 seasonal-to-interannual prediction;
phase-2 toward developing intraseasonal prediction. Bulletin of the American Meteorological
Society, 95(4):585–601, 2014.

[63] Slava Kisilevich, Florian Mansmann, Mirco Nanni, and Salvatore Rinzivillo. Spatio-
temporal clustering. In Data mining and knowledge discovery handbook, pages 855–874.
Springer, 2009.

[64] Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton. Imagenet classiﬁcation with deep
convolutional neural networks. In Advances in neural information processing systems, pages
1097–1105, 2012.
John Laﬀerty, Andrew McCallum, and Fernando CN Pereira. Conditional random ﬁelds:
Probabilistic models for segmenting and labeling sequence data. 2001.

[65]

[66] Pierre Legendre. Spatial autocorrelation: trouble or new paradigm? Ecology, 74(6):1659–

1673, 1993.
Jonathan Lester, Tanzeem Choudhury, and Gaetano Borriello. A practical approach to
recognizing physical activities. In Proceedings of the International Conference on Pervasive
Computing, pages 1–16. Springer, 2006.

[67]

94

[68] Caiyan Li and Hongzhe Li. Network-constrained regularization and variable selection for

analysis of genomic data. Bioinformatics, 24(9):1175–1182, 2008.

[69] Sijin Li, Zhi-Qiang Liu, and Antoni B Chan. Heterogeneous multi-task learning for hu-
man pose estimation with deep convolutional neural network. In Proceedings of the IEEE
conference on computer vision and pattern recognition workshops, pages 482–489, 2014.

[70] Lei Liu, Prakash Mandayam Comar, Sabyasachi Saha, Pang-Ning Tan, and Antonio Nucci.
Recursive nmf: Eﬃcient label tree learning for large multi-class problems. In IEEE Inter-
national Conf on Pattern Recognition, 2012.

[71] Wei Liu, Yu Zheng, Sanjay Chawla, Jing Yuan, and Xie Xing. Discovering spatio-temporal
causal interactions in traﬃc data streams. In Proceedings of the 17th ACM SIGKDD in-
ternational conference on Knowledge discovery and data mining, pages 1010–1018. ACM,
2011.

[72] Xi Liu, Lei Liu, Steven J Simske, and Jerry Liu. Human daily activity recognition for
healthcare using wearable and visual sensing data. In Proceedings of IEEE International
Conference on Healthcare Informatics (ICHI-2016), pages 24–31. IEEE, 2016.

[73] Xi Liu, Lei Liu, and Pang-Ning Tan. Location-based hierarchical approach for activity
recognition with multi-modal sensor data. In ECML/PKDD 2016 Discovery Challenge, Riva
del Gada, Italy (2016), pages 24–31. IEEE, 2016.

[74] Xi Liu, Han Hee Song, Mario Baldi, and Pang-Ning Tan. Macro-scale mobile app market
analysis using customized hierarchical categorization. In INFOCOM 2016-The 35th Annual
IEEE International Conference on Computer Communications, IEEE, pages 1–9. IEEE,
2016.

[75] Xi Liu, Pang-Ning Tan, Zubin Abraham, Lifeng Luo, and Pouyan Hatami. Distribution
In 2018 IEEE International

preserving multi-task regression for spatio-temporal data.
Conference on Data Mining (ICDM), pages 1134–1139. IEEE, 2018.

[76] Xi Liu, Pang-Ning Tan, and Lei Liu. Stars: Soft multi-task learning for activity recognition
from multi-modal sensor data. In Proceedings of Paciﬁc Asian Conference on Knowledge
Discovery and Data Mining (PAKDD-2018), pages 571–583. Springer, 2018.

[77] Ye Liu, Yu Zheng, Yuxuan Liang, Shuming Liu, and David S Rosenblum. Urban water

quality prediction based on multi-task multi-view learning. 2016.

[78]

Jing-Jia Luo, Sebastien Masson, Swadhin Behera, and Toshio Yamagata. Experimental
forecasts of the indian ocean dipole using a coupled oagcm. Journal of climate, 20(10):2178–
2190, 2007.

[79] Yong Luo, Dacheng Tao, Bo Geng, Chao Xu, and Stephen J Maybank. Manifold regularized
multitask learning for semi-supervised multilabel image classiﬁcation. IEEE Transactions
on Image Processing, 22(2):523–536, 2013.

95

[80] Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task

sequence to sequence learning. arXiv preprint arXiv:1511.06114, 2015.

[81] Spyros Makridakis and Michele Hibon. Arma models and the box–jenkins methodology.

Journal of Forecasting, 16(3):147–163, 1997.

[82] Douglas Maraun. Bias correction, quantile mapping, and downscaling: Revisiting the

inﬂation issue. Journal of Climate, 26(6):2137–2143, 2013.

[83] Yasuko Matsubara, Yasushi Sakurai, Willem G Van Panhuis, and Christos Faloutsos. Funnel:
In Proceedings of the 20th ACM
automatic mining of spatially coevolving epidemics.
SIGKDD international conference on Knowledge discovery and data mining, pages 105–
114. ACM, 2014.

[84] Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The nat-
arXiv preprint

language decathlon: Multitask learning as question answering.

ural
arXiv:1806.08730, 2018.

[85] MJ Menne, I Durre, B Korzeniewski, S McNeal, K Thomas, X Yin, S Anthony, R Ray,
RS Vose, BE Gleason, et al. Global historical climatology network-daily (ghcn-daily),
version 3.22. noaa national climatic data center, 2016.

[86] Microsoft Kinect SDK.

kinect.

https://developer.microsoft.com/en-us/windows/

[87] Harvey J Miller. Tobler’s ﬁrst law and spatial analysis. Annals of the Association of American

Geographers, 94(2):284–289, 2004.

[88] Wanli Min and Laura Wynter. Real-time road traﬃc prediction with spatio-temporal corre-

lations. Transportation Research Part C: Emerging Technologies, 19(4):606–616, 2011.

[89] Phan Q Minh, Roger S Morris, Birgit Schauer, Mark Stevenson, Jackie Benschop, Hoang V
Nam, and Ron Jackson. Spatio-temporal epidemiology of highly pathogenic avian inﬂuenza
outbreaks in the two deltas of vietnam during 2003–2007. Preventive veterinary medicine,
89(1-2):16–24, 2009.

[90] Pradeep Mohan, Shashi Shekhar, James A Shine, and James P Rogers. Cascading spatio-
In Proceedings of the 2010 SIAM

temporal pattern discovery: A summary of results.
International Conference on Data Mining, pages 327–338. SIAM, 2010.

[91] Margaret A Oliver and Richard Webster. Kriging: a method of interpolation for geographical
information systems. International Journal of Geographical Information System, 4(3):313–
332, 1990.

[92] Open Source Computer Vision Library. http://opencv.org/.
[93] OpenNI. http://www.openni.org/.
[94]

Jiahao Pang and Gene Cheung. Graph laplacian regularization for image denoising: Analysis
in the continuous domain. IEEE Transactions on Image Processing, 26(4):1770–1785, 2017.

96

[95] Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. Pytorch, 2017.

[96]

Jian Peng, Sha Chen, Huiling Lü, Yanxu Liu, and Jiansheng Wu. Spatiotemporal patterns
of remotely sensed pm2. 5 concentration in china from 1999 to 2011. Remote Sensing of
Environment, 174:109–121, 2016.

[97] Steven M Pincus, Igor M Gladstone, and Richard A Ehrenkranz. A regularity statistic for

medical data analysis. Journal of clinical monitoring, 7(4):335–345, 1991.

[98] Point Cloud Library. http://www.pointclouds.org.

[99] R Poulin and S Morand. Geographical distances and the similarity among parasite commu-

nities of conspeciﬁc host populations. Parasitology, 119(4):369–374, 1999.

[100] Yao Qin, Dongjin Song, Haifeng Chen, Wei Cheng, Guofei Jiang, and Garrison Cottrell. A
dual-stage attention-based recurrent neural network for time series prediction. arXiv preprint
arXiv:1704.02971, 2017.

[101] Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David Konerding,
arXiv preprint

and Vijay Pande. Massively multitask networks for drug discovery.
arXiv:1502.02072, 2015.

[102] K Venkateswara Rao, A Govardhan, and KV Chalapati Rao. Spatiotemporal data mining:
Issues, tasks and applications. International Journal of Computer Science and Engineering
Survey, 3(1):39, 2012.

[103] Thomas Reichler and Junsu Kim. How well do coupled models simulate today’s climate?

Bulletin of the American Meteorological Society, 89(3):303–312, 2008.

[104] Jose Antonio MR Rocha, Valéria C Times, Gabriel Oliveira, Luis O Alvares, and Vania
Bogorny. Db-smot: A direction-based spatio-temporal clustering method. In 2010 5th IEEE
international conference intelligent systems, pages 114–119. IEEE, 2010.

[105] Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint

arXiv:1706.05098, 2017.

[106] Patrick H Ryan, Grace K LeMasters, Pratim Biswas, Linda Levin, Shaohua Hu, Mark
Lindsey, David I Bernstein, James Lockey, Manuel Villareal, Gurjit K Khurana Hershey,
et al. A comparison of proximity and land use regression traﬃc exposure models and
wheezing in infants. Environmental health perspectives, 115(2):278, 2007.

[107] Shashi Shekhar, Zhe Jiang, Reem Ali, Emre Eftelioglu, Xun Tang, Venkata Gunturi, and
Xun Zhou. Spatiotemporal data mining: a computational perspective. ISPRS International
Journal of Geo-Information, 4(4):2306–2338, 2015.

[108] Shashi Shekhar, Chang-Tien Lu, and Pusheng Zhang. A uniﬁed approach to detecting spatial

outliers. GeoInformatica, 7(2):139–166, 2003.

97

[109] Amir Shmuel, Essa Yacoub, Denis Chaimow, Nikos K Logothetis, and Kamil Ugurbil.
Spatio-temporal point-spread function of fmri signal in human gray matter at 7 tesla. Neu-
roimage, 35(2):539–552, 2007.

[110] Martin B Short, Maria R D’orsogna, Virginia B Pasour, George E Tita, Paul J Branting-
ham, Andrea L Bertozzi, and Lincoln B Chayes. A statistical model of criminal behavior.
Mathematical Models and Methods in Applied Sciences, 18(supp01):1249–1267, 2008.

[111] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional
networks: Visualising image classiﬁcation models and saliency maps. arXiv preprint
arXiv:1312.6034, 2013.

[112] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action
recognition in videos. In Advances in neural information processing systems, pages 568–
576, 2014.

[113] John R Smith, Jelena Tesic, and Rong Yan. Method and apparatus for model-shared subspace

boosting for multi-label classiﬁcation, June 7 2011. US Patent 7,958,068.

[114] Anders Søgaard and Yoav Goldberg. Deep multi-task learning with low level tasks super-
vised at lower layers. In Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers), volume 2, pages 231–235, 2016.

[115] Masashi Sugiyama, Song Liu, Marthinus Christoﬀel Du Plessis, Masao Yamanaka, Makoto
Yamada, Taiji Suzuki, and Takafumi Kanamori. Direct divergence approximation between
probability distributions and its applications in machine learning. Journal of Computing
Science and Engineering, 7(2):99–111, 2013.

[116] Pei Sun and Sanjay Chawla. On local spatial outliers. In Fourth IEEE International Confer-

ence on Data Mining (ICDM’04), pages 209–216. IEEE, 2004.

[117] Xu Sun, Hisashi Kashima, and Naonori Ueda. Large-scale personalized human activity
recognition using online multitask learning. IEEE Transactions on Knowledge and Data
Engineering, 25(11):2551–2563, 2013.

[118] Charles Sutton and Andrew McCallum. An introduction to conditional random ﬁelds for

relational learning. Introduction to Statistical Relational Learning, pages 93–128, 2006.

[119] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to data mining. Addison

Wesley, 2006.

[120] Bridget Thrasher, Edwin P. Maurer, C. McKellar, and P. B. Duﬀy. Bias correcting climate
model simulated daily temperature extremes with quantile mapping. Hydrology and Earth
System Sciences, 16(9):3309, 2012.

[121] Waldo R Tobler. A computer movie simulating urban growth in the detroit region. Economic

geography, 46(sup1):234–240, 1970.

98

[122] Niall Twomey, Tom Diethe, Meelis Kull, Hao Song, Massimo Camplani, Sion Hannuna,
Xenofon Fafoutis, Ni Zhu, Pete Woznowski, Peter Flach, and Ian Craddock. The SPHERE
challenge: Activity recognition with multimodal sensor data. arXiv:1603.00797, 2016.

[123] Douglas L Vail, Manuela M Veloso, and John D Laﬀerty. Conditional random ﬁelds for
activity recognition. In Proceedings of the 6th International Joint Conference on Autonomous
Agents and Multiagent Systems, page 235. ACM, 2007.

[124] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
In Advances in

Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.
Neural Information Processing Systems, pages 5998–6008, 2017.

[125] Aurore Voldoire, E Sanchez-Gomez, D Salas y Mélia, B Decharme, Christophe Cassou,
S Sénési, Sophie Valcke, I Beau, A Alias, M Chevallier, et al. The cnrm-cm5. 1 global
climate model: description and basic evaluation. Climate Dynamics, 40(9-10):2091–2121,
2013.

[126] Kiel von Lindenberg. Comparative analysis of gps data. Undergraduate Journal of Mathe-

matical Modeling: One+ Two, 5(2):1, 2014.

[127] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–

416, 2007.

[128] Shuangquan Wang, Jie Yang, Ningjiang Chen, Xin Chen, and Qinfeng Zhang. Human
activity recognition with user-free accelerometers in the sensor networks. In Proceedings of
the International Conference on Neural Networks and Brain, volume 2, pages 1212–1217.
IEEE, 2005.

[129] Xiaofeng Wang, Donald E Brown, and Matthew S Gerber. Spatio-temporal modeling of
criminal incidents using geographic, demographic, and twitter-derived information.
In
Intelligence and Security Informatics (ISI), 2012 IEEE International Conference on, pages
36–41. IEEE, 2012.

[130] A Weiss, T Herman, M Plotnik, M Brozgol, N Giladi, and JM Hausdorﬀ. An instrumented
timed up and go: the added value of an accelerometer for identifying fall risk in idiopathic
fallers. Physiological measurement, 32(12):2003, 2011.

[131] Mark William Woolrich, Mark Jenkinson, J Michael Brady, and Stephen M Smith. Fully
bayesian spatio-temporal modeling of fmri data. IEEE transactions on medical imaging,
23(2):213–231, 2004.

[132] Elizabeth Wu, Wei Liu, and Sanjay Chawla. Spatio-temporal outlier detection in precipitation
data. In International Workshop on Knowledge Discovery from Sensor Data, pages 115–133.
Springer, 2008.

[133] Yangyang Xie, Bin Zhao, Lin Zhang, and Rong Luo. Spatiotemporal variations of pm2. 5
and pm10 concentrations between 31 chinese cities and their relationships with so2, no2, co
and o3. Particuology, 20:141–149, 2015.

99

[134] Jianpeng Xu, Xi Liu, Tyler Wilson, Pang-Ning Tan, Pouyan Hatami, and Lifeng Luo. Muscat:
Multi-scale spatio-temporal learning with application to climate modeling. In Proceedings
of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI’18),
pages 2912–2918, 2018.

[135] Jianpeng Xu, Pang-Ning Tan, and Lifeng Luo. Orion: Online regularized multi-task re-
gression and its application to ensemble forecasting. In Data Mining (ICDM), 2014 IEEE
International Conference on, pages 1061–1066. IEEE, 2014.

[136] Jianpeng Xu, Pang-Ning Tan, Lifeng Luo, and Jiayu Zhou. Gspartan: a geospatio-temporal
multi-task learning framework for multi-location prediction. In Proceedings of the 2016
SIAM International Conference on Data Mining, pages 657–665. SIAM, 2016.

[137] Jianpeng Xu, Pang-Ning Tan, Jiayu Zhou, and Lifeng Luo. Online multi-task learning frame-
work for ensemble forecasting. IEEE Transactions on Knowledge and Data Engineering,
29(6):1268–1280, 2017.

[138] Jianpeng Xu, Jiayu Zhou, and Pang-Ning Tan. Formula: Factorized multi-task learning
In Proceedings of the 2015 SIAM

for task discovery in personalized medical models.
International Conference on Data Mining, pages 496–504. SIAM, 2015.

[139] Jianpeng Xu, Jiayu Zhou, Pang-Ning Tan, Xi Liu, and Lifeng Luo. Wisdom: Weighted incre-
mental spatio-temporal multi-task learning via tensor decomposition. In IEEE International
Conference on Big Data (Big Data’2016), pages 522–531. IEEE, 2016.

[140] Yan Yan, Elisa Ricci, Gaowen Liu, and Nicu Sebe. Egocentric daily activity recognition via

multitask clustering. IEEE Transactions on Image Processing, 24(10), 2015.

[141] Haiqin Yang, Irwin King, and Michael R Lyu. Online learning for multi-task feature
selection. In Proceedings of the 19th ACM international conference on Information and
knowledge management, pages 1693–1696. ACM, 2010.

[142] Shaohua Yang, Qiaozi Gao, Changsong Liu, Caiming Xiong, Song-Chun Zhu, and Joyce Y
Chai. Grounded semantic role labeling. In Proceedings of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, pages
149–159, 2016.

[143] Yongxin Yang and Timothy M Hospedales. Trace norm regularised deep multi-task learning.

arXiv preprint arXiv:1606.04038, 2016.

[144] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierar-
chical attention networks for document classiﬁcation. In Proceedings of the 2016 Conference
of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, pages 1480–1489, 2016.

[145] Lei Yuan, Yalin Wang, Paul M Thompson, Vaibhav A Narayan, and Jieping Ye. Multi-source
learning for joint analysis of incomplete multi-modality neuroimaging data. In Proceedings
of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining,
pages 1149–1157. ACM, 2012.

100

[146] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial landmark detection
by deep multi-task learning. In European Conference on Computer Vision, pages 94–108.
Springer, 2014.

[147] Liang Zhao, Qian Sun, Jieping Ye, Feng Chen, Chang-Tien Lu, and Naren Ramakrishnan.
Multi-task learning for spatio-temporal event forecasting. In Proceedings of the 21th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1503–
1512. ACM, 2015.

[148] Jiayu Zhou, Jianhui Chen, and Jieping Ye. Malsar: Multi-task learning via structural

regularization. Arizona State University, 21, 2011.

[149] Jiayu Zhou, Lei Yuan, Jun Liu, and Jieping Ye. A multi-task learning formulation for
In Proceedings of the 17th ACM SIGKDD international

predicting disease progression.
conference on Knowledge discovery and data mining, pages 814–822. ACM, 2011.

[150] Xun Zhou, Shashi Shekhar, and Reem Y Ali. Spatiotemporal change footprint pattern
discovery: an inter-disciplinary survey. Wiley Interdisciplinary Reviews: Data Mining and
Knowledge Discovery, 4(1):1–23, 2014.

101