THESIS
2..

100C]

 

 

LIBRARY
Michigan State
University

 

 

 

This is to certify that the
dissertation entitled

Spatial and Temporal Data Mining with
Applications to Earth Science Data

presented by

Haibin Cheng

has been accepted towards fulﬁllment
of the requirements for the

Ph.D. degree in Computer Science

 

Ta» @477

Major Professor’s Signature

H/zo/aooé‘

Date

MSU is an Afﬁrmative Action/Equal Opportunity Employer

 

PLACE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

Wm 30151131

 

 

 

.__.-___>
(1".

 

 

 

 

 

 

 

 

 

 

 

 

5/08 K:IProlecc&Pres/CIRC/DateDue.indd

 

 

SPATIAL AND TEMPORAL DATA MINING WITH
APPLICATIONS TO EARTH SCIENCE DATA

By

Haibin Cheng

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Computer Science

2008

ABSTRACT

SPATIAL AND TEMPORAL DATA MINING WITH
APPLICATIONS TO EARTH SCIENCE DATA

By
Haibin Cheng

Recent progress on computer and sensor technology has generated huge amounts of spatial and
temporal data in the Earth Science domain including climate observations, land cover time series
records, ground level pollution measurements, etc. Combined with historical climate records
and predictions from ecosystem models, it offers new Opportunities for understanding how the
earth is changing, for determining what factors cause these changes and for forecasting future
changes. Spatial and temporal data mining provides innovative solutions for mining Earth Science
data by incorporating spatial and temporal dependencies into standard data mining techniques.
Although there has been substantial research in spatial and temporal data mining, there are still
many technical issues that need to be addressed. This includes issues such as processing massive
high resolution data, reducing the effect of noise, fusing data from heterogeneous sources, etc. We
develop a class of efﬁcient and robust spatial and temporal data mining algorithms in this thesis to
overcome these challenges. First, we develop an integrated localized prediction framework based
on Support Vector Machine to incorporate spatial and temporal dependencies. Efﬁcient algorithms
are also proposed to reduce its computational overhead. Second, we study the error accumulation
problem in multi-step time series prediction and develop a novel semi-supervised multivariate
time series prediction algorithm for long term forecasting. A covariance alignment method is also
proposed to reduce the inconsistencies between historical and future climate data when applying
the algorithm to ﬁtture climate projection problems. Third, we propose a graph-based framework

to detect and categorize different types of anomalies in multivariate time series data. We applied

the framework to the problem of detecting and characterizing ecosystem disturbances in Earth
Science data. While the spatial and temporal data mining techniques proposed in this thesis have
been applied to many problems in the Earth Science domain, they are also applicable to spatial
and temporal data in other application domains such as trafﬁc analysis, image processing, network

monitoring, etc.

To
My Family

For their love and support

iv

ACKNOWLEDGMENTS

It is my pleasure to thank all the wonderful people that contributed to this thesis in many aspects.

I wish to express my warm and sincere thanks to my advisor Dr. Pang-Ning Tan. This dis-
sertation would not have been possible without his continuous support, inspiration and valuable
suggestions. He is the one that I can always count on to discuss the details of a problem. I truly
appreciate his patience and encouragement that helped me go through my PhD. study step by step
and develop my research skills.

I am grateful to my thesis committee members - Dr. Rong Jin, Dr. Li Xiao and Dr. Ashton
Shortridge. Dr. Rong Jin advised me and helped me greatly in my research. His extensive
discussions abount machine learning theory and algorithms have been very helpful for this study.
Dr. Li Xiao and Dr. Ashton Shortridge provided me with many important suggestions on the
application of my study in different domains.

I wish to thank all the faculties and scientists that has helped me in numerous aspects of my
research work - Dr. Christopher Potter and Dr. Steven Klooster from NASA research center, Dr.
Jon Sticklen and Dr. William F. Punch from Department of Computer Science and Engineering
at Michigan State University, Dr. Haifeng Chen and Dr. Guofei Jiang from NBC research lab, Dr.
Jianchan Mao and Dr. Ruofei Zhang from Yahoo Lab.

I am grateful to all the staff in the Department of Computer Science and Engineering at Michi-
gan State University, for helping the department to run smoothly and for assisting me in many
different ways. Ms. Linda Moore and Ms. Norma Teague deserve special mention. Many thanks
to Ms. Kim Thompson for proof-reading this entire document. I

I would like to thank many current or previous student colleagues in Data Mining Research
Group — Jerry Scripps, Kapila Moonesinghe, Samah Fodeh, Feilong Chen, Hamed Valizadegan,
J ing Gao, and Jun Wang for creating a fun and stimulating research environment. Many thanks to

all the other members of the Links Lab too.

I wish to thank all my friends in the US. and China. Although I cannot name them all here, I
appreciate the support they provided and the wonderful time we had during this period.

Finally, I would like to extend my long loving thanks to my parents and my wife for their
continuous care, support, and encouragement. Their love accompanies me wherever I go. My

thanks also go to my grandma, my brother, my aunts, and my uncles.

vi

TABLE OF CONTENTS

LIST OF TABLES ...................................... x
LIST OF FIGURES ..................................... xii
1 Introduction ....................................... 1
1.1 Earth Science Applications ............................. 1

1.2 Challenges ...................................... 3

1.3 Contributions .................................... 4
1.3.1 Localized Prediction with Spatial and Temporal Dependencies ...... 5

1.3.2 Long Term Time Series Forecasting .................... 8

1.3.3 Anomaly Detection, Characterization, and Visualization ......... 10

1.4 Outline of the Thesis ................................ l3

2 Spatial and Temporal Data Mining ........................... 14
2.1 Spatial and Temporal Data ............................. 14
2.2 Spatial Data Mining ................................. 17
2.2. 1 Spatial Prediction .............................. 17

2.2.2 Spatial Clustering .............................. 18

2.2.3 Spatial Association Rule Mining ...................... 19

2.2.4 Spatial Anomaly Detection ......................... 20

2.3 Temporal Data Mining ............................... 20
2.3.1 Time Series Forecasting .......................... 21

2.3.2 Classiﬁcation of Temporal Data ...................... 22

2.3.3 Clustering of Temporal Data ........................ 22

2.3.4 Temporal Association Rule Mining (Frequent Pattern) .......... 23

2.3.5 Time Series Anomaly Detection ...................... 23

2.4 Spatio-Temporal Data Mining ........................... 24
2.4.1 Spatio-Temporal Prediction ........................ 25

2.4.2 Spatio-Temporal Clustering ........................ 25

2.4.3 Spatio-Temporal Association Rule Mining ................ 26

2.4.4 Spatio-Temporal Anomaly Detection ................... 26

2.4.5 Exploratory Analysis and Visualization of Spatio-Temporal Data . . . . 27

2.5 Scope of Thesis Research .............................. 28

3 Localized Prediction with Spatial and Temporal Dependencies ........... 29
3.1 Preliminaries .................................... 30
3.1.1 K-nearest Neighbor Prediction ....................... 31

3.1.2 Support Vector Machine . .- ........................ 33

vii

3.2 Localized Support Vector Machine ......................... 36

3.2.1 Localized Support Vector Classiﬁcation (LSVC) ............. 36
3.2.2 Localized Support Vector Regression (LSVR) ........ I ....... 37
3.2.3 Discussion ................................. 38
3.3 Proﬁle Support Vector Machine (PSVM) ...................... 39
3.3.1 Supervised Clustering Algorithms: MagKmeans and Vaereans ..... 40
3.3.2 Proﬁle Support Vector Classiﬁcation (PSVC) ............... 45
3.3.3 Proﬁle Support Vector Regression (PSVR) ................ 46
3.3.4 Summary .................................. 46
3.4 LSVM and PSVM for Spatial and Temporal Prediction .............. 47
3.5 Experimental Results ................................ 49
3.5.1 Comparison between LSVM/PSVM and Nonlinear SVM ......... 50
3.5.2 Performance Comparison on Real-World Data .............. 54
3.5.3 Application of LSVM and PSVM to Earth Science Data ......... 59
3.5.4 Sensitivity Analysis ............................ 62
3.6 Summary ...................................... 63
Long-Term Time Series Forecasting .......................... 65
4.1 Preliminaries .................................... 66
4.1.1 Single Step and Multi-step Time Series Forecasting ............ 66
4.1.2 Univariate and Multivariate Time Series Forecasting ........... 67
4.1.3 Regression Models for Prediction ..................... 68
4.2 Problem Study on Long-Term Time Series Prediction ............... 71
4.2.1 Error Accumulation ............................ 72
4.2.2 Learning Diﬂiculty ............................. 74
4.2.3 Smoothness of Prediction ......................... 75
4.2.4 Discussion ................................. 76
4.3 Values of Unlabeled Data on Regression ...................... 77
4.4 Semi-supervised Multivariate Time Series Prediction ............... 79
4.4.1 Semi-Supervised HMMR for Long Term Time Series Forecasting . . . . 79
4.4.2 Data Calibration .............................. 83
4.5 Experimental Result ................................. 85
4.5.1 Experimental Setup ............................. 85
4.5.2 Comparative Study on Multi-step Prediction ................ 87
4.5.3 Performance Comparison on Semi-Supervised HMMR .......... 89
4.5.4 Value of Unlabeled Data for Semi-Supervised HMMR .......... 90
4.5.5 Semi-Supervised HMMR with Covariance Alignment on Climate Prediction 91
4.6 Summary ...................................... 95
Anomaly Detection, Characterization, and Visualization ............... 98
5.1 Preliminaries .................................... 99
5.1.1 Anomaly Detection in Time Series Data .................. 99
5.1.2 Moving Average Time Series Anomaly Detection ............. 100
5.1.3 Anomaly Detection and Characterization in Multivariate Time Series Data 102
5.1.4 Anomaly Exploration in Large Scale Time Series Repository ....... 102

viii

5.2 Graph-based Time Series Anomaly Detection ................... 103

5.2.1 Kernel Matrix for Time Series Data .................... 103
5.2.2 Random Walk on Graph for Anomaly Detection ............. 104
5.3 Anomaly Detection and Characterization in Noisy Multivariate Time Series . . . 105
5.3.1 Multivariate Kernel Alignment ....................... 105
5.3.2 Multivariate Time Series Anomaly Detection and Characterization Algo-
rithms .................................... 108
5.4 Extensions of the Proposed Algorithm ....................... 109
5.4.1 Sparse Kernel for Local Anomaly Detection ................ 110
5.4.2 Subsequence Anomaly Detection ..................... 1 10
5.5 Visual Exploration of Ecosystem Disturbances ................... l 11
5.5.1 Clustering of Ecosystem Disturbance Events ............... 1 12
5.5.2 Clustering for Exploratory Data Analysis ................. 1 13
5.5.3 A User-Interactive System for Disturbance Event Detection ....... 117
5.6 Experimental Result ................................. 1 18
5.6.1 Target Speciﬁc Anomaly Detection .................... 118
5.6.2 General Multivariate Noisy Time Series Anomaly Detection ....... 120
5.6.3 Subsequence Anomaly Detection ..................... 120
5.6.4 Local Anomaly Detection ......................... 121
5.6.5 General Performance Comparison ..................... 122
5.6.6 Ecosystem Disturbance Detection ..................... 123
5.6.7 Ecosystem Disturbance Exploratory Analysis ............... 126
5.6.8 Multi-Level Indexing Performance Evaluation .............. 127
5.6.9 Multi-Level Indexing System Efﬁciency Evaluation ............ 128
5.7 Summary ...................................... 130
6 Conclusions and Future Work ............................. 132
6.1 Contributions .................................... 132
6.2 Future Research ................................... 135
APPENDIX .......................................... 138
A Proof of the Dual Form for Localized Support Vector Machine ........... 138
BIBLIOGRAPHY ...................................... 140

ix

1.1

3.1
3.2
3.3

3.4

3.5
3.6

3.7

4.1
4.2
4.3
4.4
4.5
4.6
4.7

4.8

4.9

LIST OF TABLES

Some Examples of Earth Science Data and their Applications. .......... 3
Description of the 18 UCI datasets used for classiﬁcation .............. 54
Description of the 18 UCI datasets used for regression. .............. 55
Classiﬁcation accuracies (‘70) for SVC, KNNC, HLSVC (KNN-SVC), SLSVC,

and PSVC on the 18 UCI datasets. ......................... 56
Regression R2 for SVR, KNNR, HLSVR (KNN—SVR), SLSVR, and PSVR on

the 8 UCI datasets. ................................. 57

Description of data sets used for prediction with spatial and temporal dependencies. 60

Classiﬁcation accuracies (‘78) for SVC, KNNC, HLSVC (KNN-SVC), SLSVC,
and PSVC (using spatial or temporal dependencies) on the 2 Earth Science data:
CoverType and Ozone. ............................... 61

Regression R2 for SVR, KNNR, HLSVR (KNN-SVR), SLSVR, and PSVR (using
spatial or temporal dependencies) on the 2 Earth Science data: Elnino and Forest

Fire. ......................................... 61
Traning Set D = X x Y .............................. 66
Description of the UCR time series data sets for Semi-HMMR .......... 86
R2 of three methods on 21 data sets using Multiple Linear Regression ...... 87
R2 of three methods on 21 data sets using Recurrent Neural Network . . . .' . . . 88
Win-Draw-Loss results for MLP .......................... 89
Win-Draw-Loss results for RNN .......................... 89

Average R2 for UAR, MLR, HMMR and SemiHMMR on UCR time series data
sets. ......................................... 90

Comparing the average R2 values for MLR, HMMR, SemiHMMR, and CSemiH-
MMR on the Canadian climate data. ........................ 94

Comparing the degree of alignment and loss of neighborhood structure informa-
tion using calibration techniques 1 and 2. ..................... 95

5.1

5.2
5.3
5.4

5.5

Connectivity values obtained by applying a random walk algorithm on the graph
shown in Figure 5.2. The lower the connectivity value, the more anomalous a

node is. ....................................... 105
Features for clustering disturbance events ..................... 1 12
Comparison of sampling loss for different strategies. ............... 127

Runtime (in seconds) needed to compute disturbance events at each level for North
America ........................................ 128

Response Time For Drilling Down From Low Resolution To High Resolution. . . 129

xi

1.1

1.2
1.3

1.4

2.1

3.1
3.2
3.3

3.4

3.5

3.6

3.7

LIST OF FIGURES

An illustration of increasing global mean temperature from 1850 to 2000 (relative
to mean temperature from 1961 — 1990 (from Global Warming Art [181]). . . . . 2

Forest Fire distribution by month in Cuba 1981 — 1996 [93]. ........... 6

Climate model predictions for future global temperature (00) until 2100 under
the SRES A2 emissions scenario relative to global average temperatures in 2000

(from Global Warming Art [182]). ......................... 9
F PAR and Precipitation time series at (latitude = 54.85N, longitude =

114.20W) with possible anomalies. ........................ l 1
Survey on Spatial and Temporal Data Mining .................... 15
K-nearest Neighbor Prediction ............................ 32
Support Vector Machine. .............................. 34

An illustration of the MagKmeans clustering algorithm. The cluster in the left
ﬁgure contains only positive examples. After several iterations, some of the pos-
itive examples will be expelled ﬁ'om the cluster while some negative example are
absorbed into the cluster to achieve balance in the class distribution (the right ﬁgure). 42

An illustration of the Vaereans clustering algorithm. The left ﬁgure plots the
clustering results by K-means on the data set. The right ﬁgure plots the clustering
results by Vaereans on the data set. Vaereans produce much better clusters
for learning three linear regression segments. ................... 43

Comparison between LSVC and nonlinear SVC. The left diagram shows the sup—
port vectors obtained by nonlinear SVC; whereas the right diagram shows the
support vectors obtained for three test examples using LSVC. .......... 50

Comparison between LSVR and nonlinear SVR. The left diagram shows the sup-
port vectors obtained by nonlinear SVR; whereas the right diagram shows the
support vectors obtained for two test examples using LSVR. ........... 51

The diagrams in the top panel show the distribution for two synthetic data sets.
Each data set comprises of two classes marked by c and *, respectively. The
diagrams in the bottom panel show the decision boundaries generated by PSVC.
Each symbol represents one of the clusters produced by the MagKmeans algorithm. 52

xii

3.8 The diagrams in the top panel show the distribution for two synthetic data sets. Y-
axis represents the response and X -axis represents the predictor. The diagrams in
the bottom panel show the regression generated by PSVR. Each symbol represents
one of the clusters produced by the Vaereans algorithm. ............

3.9 The left panel shows the runtime comparison among SVC, KNNC, HLSVC,
SLSVC, and PSVC in terms of classiﬁcation task. The right top panel shows the
amount of time spent for MagKmeans clustering by PSVC (i.e., PSVC-Clustering)
as opposed to the amount of time spent for training LSVC and applying them to
the test examples (i.e., PSVC-LSVC). The right bottom panel compares the time
spent by each algorithm to predict the class labels of test examples .........

3.10 Effect of varying the number of clusters Ii on the performance of PSVC. .....

3.11 Classiﬁcation runtime comparison among SVC, KNNC, HLSVC, SLSVC, and
PSVC when varying the number of clusters r. in PSVC ...............

4.1 A sliding window is used to create the regression training set D=X’+Y .......
4.2 Multivariate time series prediction ..........................

4.3 A simple example of Elman Network to model AR(3) model for univariate time
series. ........................................

4.4 Hidden Markov Regression Model ..........................
4.5 Bias and variance for MLR (leﬁ) and HMMR (right). ...............
4.6 Prediction Results (p=12,u=12). ..........................
4.7 White Noise WN(0,0.5) (left) and predicting Results (p=12,u=100,d=6)(right).

4.8 The value of unlabeled data for regression analysis ................

4.9 The performance of SemiHMMR by each iteration compared with KNNR,
HMMR and KNNR+HMMR on the data set “Foetal” ................

4.10 The performance of MLR, HMMR and SemiHMMR when varying the ratio of
unlabeled to labeled data ..............................

4.11 Performance comparison between HMMR, SemiHMMR, and CSemiHMMR on
Canadian climate data ................................

5.1 F PAR time series at two different locations and their anomaly scores learned by
moving average ....................................

5.2 A weighted graph representation of a kernel matrix .................

5.3 Clusters of ecosystem disturbance events detected in the North Carolina region
(each symbol corresponds to a different cluster) ...................

5.4 F PAR time series at different locations. ......................

xiii

58
62

63

67
68

70
74
75
76
78

91

93

101
104

113

5.5
5.6

5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14

5.15

5.16

5.17
5.18

A multi-level approach for visual exploration of ecosystem disturbances ...... 1 15

An interactive visual data mining tool for detection and characterization of ecosys-

tem disturbances. .................................. 1 17
Target speciﬁc anomaly detection. ......................... 1 18
General multivariate noisy time series anomaly detection .............. 1 19
Unusual subsequences detection ........................... 121
Time-based local anomaly detection ......................... 122
Cycle-based local anomaly detection. ....................... 123
Performance comparison using 38 data sets from UC Riverside data repository. . 124
An example of ROC curves for all the methods on the 7th data set. ........ 125

FPAR and Precipitation time series at (latitude = 54.85N, longitude =
114.20W) and corresponding anomaly score by random walk. .......... 126

FPAR and Precipitation time series at (latitude=35.31N, longitude=l 11.21W) and
corresponding anomaly score by random walk. .................. 127

FPAR and Sea level pressure time series at (latitude=15.68N, longitude=87.57W)

and corresponding anomaly score by random walk. ................ 128
Hayman Fire Event .................................. 129
Chisholm Fire Event. ................................ 130

xiv

CHAPTER 1

Introduction

Advances in sensor technology, online mapping services (such as Google Earth, Yahoo Maps,
and Microsoft Virtual Earth), and location-aware computing have generated massive amounts of
spatial and temporal data that can be potentially mined to uncover valuable knowledge about a
particular domain. Spatial data refers to a set of observations characterized by a spatial location
in addition to other attributes. Examples of spatial data includes crime data, housing price data,
and vegetation cover data collected for various locations. Temporal data corresponds to an or-
dered series of observations, where each observation is associated with a time stamp. Examples
of temporal data include climate time series, sensor data streams, and telecommunication alarm
sequences. Spatio-temporal data refers to data with both spatial and temporal components. Exam.-
ples of spatio-temporal data include cyclone trajectories, mobile user tracking data, and dynamic
brain images, etc. The thesis will investigate some of the outstanding issues in mining spatial and

temporal data with applications to the Earth Science domain.

1.1 Earth Science Applications

With the growing concerns about the impact of climate change, climate study has become an
increasingly important area of research. Figure 1.1 shows a substantial increasing trend of global
temperature since the mid-twentieth century [181], which is believed to be partly caused by a
steady increase of the global atmospheric concentrations of C02, CH4, and N20 due to the fast

growth of industrial activities [113]. According to a recent assessment by the lntergovemmental

Panel on Climate Change (IPCC) the projected temperature increase by the end of the century
is expected to range between 1.1°C and 64°C. Due to global warming, numerous long term
changes in climate have been observed at continental, regional, and ocean basin scales. Some
of these include widespread changes in temperature, precipitation amounts, ocean salinity, wind
patterns, and the frequency and severity of extreme weather conditions, which lead to adverse
consequences such as the melting of polar ice caps, rising of sea level, diminishing of fresh water
supplies, decreasing of rainfall, increasing number of ecosystem disturbance events, and so on.
Most of these events have a direct or indirect impact on human life, e.g. cities close to the coast
may disappear due to rising sea level, grth of crops is strongly affected by the diminishing of
fresh water. Thus developing new techniques to explore the cause and impact of climate change
has become an important research topic for scientists in various disciplines.

To help better understand and project ﬁiture climate changes, scientists have collected a huge
amount of data by utilizing the latest sensor technology and computing systems to measure various
aspects of land surface, atmosphere, oceans, and other components of the Earth system, which

provides new opportunities for answering some important questions in Earth Science, e. g. (1) how

 

0.6

 

,. Global Temperatures : ;
OK) 04, _;5 .5 , 5 5 . , .. 5 5 ;.. .
; ..... Annual Average :5 1
‘5 ; --~ Five Year Average; ; . E i

O .!l i'

c '[ f 1 I

i 0. ‘ t 5.‘ I“ -' _ a
1'; ‘02 ’ l 5 ‘ l' i ‘
a, "'1‘, i 1,1 11'

Q) '0-4 ‘l '_ 1'. .. I: q
l— ' u 5

 

 

'0'6 1860 1800 1900 1020 1040 1900 1900 2000

Figure 1.1. An illustration of increasing global mean temperature from 1850 to 2000 (relative to
mean temperature from 1961 — 1990 (from Global Warming Art [181]).

Table 1.1. Some Examples of Earth Science Data and their Applications.

 

 

 

 

 

 

[Example of Earth Science data I Applications 1
Climate observations (6. g. precipitation) Projection of future climate scenarios
Remote sensing data (e.g. F PAR) Detection of ecosystem disturbance
Earth environment measurements (e.g. Ozone) Prediction of the ozone alarm
Trajectory data (e.g. cyclone tracks) Clustering of extratropical cyclones

 

 

 

 

the earth is changing, (2) what factors cause these changes, and (3) how to forecast ﬁiture changes.
Data mining and knowledge discovery techniques can aid this effort by discovering patterns that
capture complex interactions among ocean temperature, air pressure, surface meteorology, and
terrestrial carbon ﬂux. Table 1.1 shows several examples of Earth Science data and their potential

applications.

0 Projection of future climate scenarios such as forecasting the frequency and amount of rain-
fall or the number of frost days is beneﬁcial to climate-dependent industries such as agri-

culture, tourism, recreational and commercial ﬁshing, etc [105].

0 Discovery of anomalous patterns ﬁ'om historical vegetation cover data can be used to detect

ecosystem disturbances such as forest ﬁres, droughts, and hurricanes [170].

o Localized prediction with spatial and temporal dependencies could be used to predict the
burned area of forests for an unknown region [54], or to classify the days in which the

ground-level ozone reaches a dangerous level [224].

o Clustering of the trajectories of tropical cyclone tracks is important for improving the ability
to forecast the location and timing of hurricane landfall [86], which in turn, may help to

mitigate the hurricane damage.

1.2 Challenges

Despite its wide range of applications, mining Earth Science data is not a trivial problem because
of its intrinsic spatial and temporal nature, which makes the direct application of standard data
mining methods and algorithms problematic. As an example, the prediction of climate for a

region depends strongly on the local topographic conditions of the ﬁeld. Traditional data mining

algorithms fail to give special consideration to inherent spatial and temporal dependencies. This
has led to the development of numerous spatial and temporal data mining algorithms for different
tasks such as prediction, clustering, anomaly detection, change detection, etc. Here we formally

deﬁne spatial and temporal data mining as:

Deﬁnition 1. (Spatial and Temporal Data mining) Spatial and temporal data mining is a task of

discovering useful knowledge in data set with spatial, temporal, or spatio-temporal dependencies.

In addition to the spatial and temporal dependencies, there are still many other technical chal-
lenges that need to be addressed, particularly in the context of mining Earth Science data. First, the
deployment of high-resolution earth imaging satellites has generated huge amounts of Earth Sci-
ence data in very high spatial resolution. For instance, satellite-derived vegetation cover data such
as F PAR (Fraction of Photosynthetically Active Radiation Absorbed by Vegetation), are available
at a spatial resolution of 500m x 500m, which makes it impractical to perform exploratory data
analysis in real-time fashion. Second, Earth Science data may involve many variables, among
which strong correlations may exist. For example, the change of global vegetation cover could
be caused by the disturbance events in climate variables such as temperature, precipitation, sea
level pressure, and so on. Third, Earth Science data is mostly noisy. For example, satellite obser-
vations are distorted because of sensor noise or cloud blocking. Fourth, Earth Science data may
come from different sources. For instance, there is climate simulation data from GCM (Global
Climate Modeling) or RCM (Regional Climate Modeling) in addition to the true observation data
from sensors. Fifth, acquiring early historical data in Earth Science is not only difﬁculty but also
impossible due to the unavailability of the sensor technology in previous years. Last but not least,
Earth Science data is not evenly distributed across different time periods or spatial locations. For
example, forest ﬁres happen more frequently in regions with dry weather or in years with less
rainfall. Solving these issues by inventing new spatial and temporal data mining techniques and
algorithms is critical to ﬁilly utilize the Earth Science data in helping understand or predict the

potential impact of climate change.

1.3 Contributions

In this thesis, we investigate these problems mainly in three aspects:

o Localized prediction with spatial and temporal dependencies.
0 Long term time series forecasting.

o Anomaly detection, anomaly characterization, and visualized exploration.

1.3.1 Localized Prediction with Spatial and Temporal Dependencies

Prediction is one of the most popular tasks in data mining and machine learning. It typically
consists of two aspects: classiﬁcation and regression. The task of classiﬁcation is concerned with
the prediction of predeﬁned discrete labels, while the task of regression aims to predict contin-
uous target values. Prediction with spatial and temporal dependencies is an important problem
in mining Earth Science data with many applications. For example, environmental scientists are
interested in predicting the the burned area of a potential forest ﬁre [54] for a given region such
that prevention measures can be better organized to reduce the damages. The spatial and temporal
dependencies in Earth Science data may indicate a non-even distribution of the data across dif-
ferent spatial locations and temporal timestamps, which has posed a great challenge for building
an accurate prediction function. As an example, Figure 1.2 shows the forest ﬁre distribution by
month in Cuba from 1981 — 1996 [93]. Apparently, forest ﬁres happen more frequently from
February to May in Cuba.

To address the challenges, one must ﬁrst decide whether to use global learning or local leam-
ing approaches. Global learning tends to build a sophisticated single model using the global
characteristics of the training data. Auto-regression and Kriging are two popular global learning
techniques widely used for spatial and time series prediction [29][96][103]. However, they are
both least square methods for linear prediction. Recently a large number of algorithms have been
proposed to solve the prediction problems in data mining and machine learning literature, which
consist of various objectives and are capable of modeling complex nonlinear decision surfaces
[221][32][179][130]. Among them, one of the most effective algorithms is the nonlinear Support
Vector Machine (SVM), which learns a prediction function by maximizing the margin in clas-
siﬁcation or enforcing function ﬂatness in regression, and employs a kernel to model nonlinear
decision functions. One of the main difﬁculties found in global learning methodologies is the

model selection problem. More precisely, one needs to select a suitable model and its parameters

 

 

 

20- 1

8

315- 5

c

0)

8

61310- -
5- -
0

 

123456789101112
Month

Figure 1.2. Forest Fire distribution by month in Cuba 1981 — 1996 [93].

in order to represent the observed data. For example, a nonlinear SVM must employ sophisti-
cated kernel functions to predict data sets with complex decision surfaces. Determining the right
parameters for such kernel functions is not only challenging [84], the resulting models are also
susceptible to overﬁtting due to their large VC dimensions [32]. This problem is even more severe
when applying nonlinear SVM directly to Earth Science prediction problems due to its disregard
of the spatial and temporal dependencies. Instead of ﬁtting a global model to the entire training
data, another strategy is to use local learning by constructing multiple models using the local char-
acteristics of the data. Recent research has demonstrated that local learning is superior to global
learning especially on data sets that are not evenly distributed [28][223][89][l32]. A well-known
classiﬁcation technique that employs such a strategy is the K -nearest neighbor (KNN) classiﬁer
[55]. The advantage of using KNN is that it does not require making a biased assumption about
characteristics of the data nor about the decision surfaces [11]. Nevertheless, KNN performs
not very well in high dimensional data and because of its lazy learning strategy, classifying test
examples can be computationally expensive.

We ﬁrst developed a framework known as Localized Support Vector Machine (LSVM), consist-

ing of Localized Support Vector Classiﬁcation (LSVC) and Localized Support Vector Regression
(LSVR), which leverages the strengths of both Support Vector Machine (SVM) and K—nearcst
neighbor (KNN). Instead of ﬁtting a sophisticated kernel function to build a global model for the
entire training data, LSVM builds multiple linear SVM models by considering the training exam-
ples located in the vicinity of each test example. We empirically show that such a strategy would
lead to signiﬁcant performance improvement over global nonlinear SVM and avoids the thorny
issue of model selection.

However, since LSVM builds a unique model for each test example, it is impractical when
the number of test examples is large. To overcome this limitation, we develop an eﬂicient imple-
mentation of the algorithm, known as Proﬁle Support Vector Machine (PSVM), consisting of both
Proﬁle Support Vector Classiﬁcation (PSVC) and Proﬁle Support Vector Regression (PSVR). The
intuition behind this algorithm is based on the premise that many test examples with similar neigh-
boring training examples tend to share a common set of support vectors. The PSVM algorithm
is therefore designed to exploit this property by employing a supervised clustering algorithm to
partition the training data into clusters and training a local SVM model for each cluster. After con-
structing the local models, a test example will be classiﬁed by ﬁnding its nearest cluster centroid
and invoking the corresponding local SVM model. Our experimental results show that this strat-
egy enables PSVM to maintain the high performance of LSVM without its costly computational
overhead.

We also extend the LSVM and PSVM framework to Earth Science data, which consists of strong
spatial and temporal dependencies. Recent research on local learning for spatial and temporal
prediction has demonstrated the signiﬁcantly improved performance of building multiple local
machine learning models compared with global machine algorithms [89][l32]. The neighborhood
information of examples is deﬁned in terms of the temporal and spatial attributes in LSVM and
PSVM. The experimental results on real-word Earth Science data sets demonstrate a signiﬁcantly
improvement of the performance by the proposed LSVM and PSVM algorithms compared with
global nonlinear SVM as well as KNN.

1.3.2 Long Term Time Series Forecasting

Time series prediction aims to predict the future values for a target variable of interest based on
its historical observations or other related information. It has been an active area of research for a
long time with many applications, e.g. weather forecasting [75] [40], stock market analysis [215],
network monitoring [27], and transportation planning [107][158], etc. Most of these problems
focused on single step or short term prediction. Recently, there has been an increasing interest
on long term time series forecasting for strategic planning and decision making. For example,
Earth scientists are interested in projecting the future climate to assess their potential impact on
the ecosystem and society. Figure 1.3 shows such an example of projecting the global temperature
for the next 100 years by simulating different climate scenarios [182], e. g. CCSR/NIES, CCCma,
and NCAR PCM. Despite its wide applications, long term time series forecasting is a challenging
problem because of several issues. First, an extensive amount of historical time series data is
required in order to build a reliable predictive model for long term time series prediction. However,
collecting historic data in the early days may be not only difﬁcult but also impossible due to the
unavailability of facilities at that time. Second, most existing long term time series forecasting
methods are extended from the single step prediction method by repeatedly invoking a model that
makes its prediction one step at a time. However, since the model uses predicted values from the
past to infer future values, these approaches may lead to the error accumulation problem, where
the errors made from one prediction step will be propagated to the next prediction step. Alternative
methods such as independent prediction also have their own problems. Third, potential concept
drifting in many domains makes the long term time series forecasting even more difﬁcult. For
example, to predict rainfall in the next 100 years, one should certainly consider the increasing
trend of global temperature around the world as illustrated in Figure 1.3. Last but not least, using
data from different sources to aid the long term time series forecasting is a challenging problem
since they may follow different data distributions. For instance, GCM data in climate projection
as shown in Figure 1.3 is generated by computer systems driven by a set of emissions scenarios,
which may assume greenhouse gas concentrations that are different ﬁ'om those observations in
the historical climate record.

To address these issues, we employ a multivariate time series prediction method in this thesis to

 

 

 

 

 

5 1
-— CCSR/NIES
4— -------- CCCma _
- - - - NCAR PCM
a: 3- 1
L
3 s
E 2 ' ”I“
8. ‘ “1h '
l ‘l’ \
E . = . ‘ ._ ‘1‘ «”1“! ‘l
w 1 :1 ‘ ' r "; -‘ L “ ”}~‘\ "'\I
l— . '1 .: iv.|l"\"" \ I
0. :5. .
2 , --F
-1 :f- '

 

 

 

"1900 1930 20100 2050 2100

Figure 1.3. Climate model predictions for future global temperature (00) until 2100 under the
SRES A2 emissions scenario relative to global average temperatures in 2000 (from Global Warm-
ing Art [182]).

preform long term time series forecasting instead of predicting the future values of a time series
based on its historical values alone. The time series for a set of predictor variables are ﬁtted against
the time series for the response variable. This strategy avoids the requirement for long periods
of historical data; however, it is contingent upon the availability of future values of the predictor
variables. Such values can be obtained from computer-driven simulation models in many domains.
For example, in climate modeling, outputs from global climate models (GCMs) in future periods
[75] are often used as predictor variables to determine the climate of a local region. Potential
concept drifting in the future has been incorporated into the GCM data by simulating different
climate scenarios. However, current approaches utilize the simulation data mostly in a supervised
learning setting, where a predictive model trained on past observations is used to estimate the
future values. Such an approach fails to take advantage of information about the future data
during model building. A semi-supervised learning framework is proposed for long term time
series forecasting based on Hidden Markov Model Regression (HMMR). Our approach builds an
initial HMMR model from past observations and incorporates future data to iteratively reﬁne the

model. Since the initial predictions for some of the future data may not be reliable, we need to

ensure that they do not adversely affect the revised model. We developed an approach to overcome
this problem by assigning weights to instances of the future data based on the consistency between
their global and local predictions. This approach also helps to ensure smoothness of the target
function [227]. We demonstrated the efﬁcacy of our approach using data sets from a variety of
application domains.

When applying the semi-supervised HMMR to climate modeling problems, we must consider
the potential inconsistencies between the training and future data since they come from different
sources. Previous work on semi-supervised classiﬁcation has shown that combining labeled and
unlabeled data with different distributions may degrade the performance of a classiﬁer [199].
We encountered a similar problem when applying the semi-supervised HMMR method to climate
projection. A data calibration technique is developed to transform the data sets in a way that aligns
their covariance structure while preserving most of the neighborhood information. Experimental
results for modeling climate at a number of locations in Canada showed that semi-supervised

HMMR with data calibration outperforms the conventional (supervised) HMMR method.

1.3.3 Anomaly Detection, Characterization, and Visualization

Anomaly detection is a valuable data mining tool for discovering patterns signiﬁcantly different
than the rest of a data set. Recently, there has been an increasing interest on time series anomaly
detection in many applications, e.g. to detect unusual events such as ecosystem disturbances in
Earth Science data. While numerous algorithms have been developed for detecting anomalies
[121, 141, 16, 122, 212, 220] in time series data, most of them are applicable only to univariate
time series.

The detection of anomalies in multivariate time series is more challenging due to several rea-
sons. First, it is difﬁcult to establish a concise deﬁnition of an anomaly. Analogous to univariate
time series, some anomalies may correspond to abnormally high (or low) values or unusual sub-
sequences (discord [121]) in one or more time series. In addition, the multivariate anomalies may
correspond to unexpected changes in the relationships among a set of variables [42]. For example,
the time series for vegetation cover at mid-latitude locations in the United States typically varies in
a 12-month cycle, peaking during the warm summer months and dropping to its minimum during

the cold winter. Ecosystem disturbances such as wildﬁre and drought can be potentially detected

10

FPAR at latitude = 54.85, longitude = -114.20
100 . a t 5. 1

WW

8000 2001 2002 2003 2004 2005
Precipitation at latitude = 54.85, longitude = -114.20
2 T l ;' 1 !

 

  

 

 

 

 

 

 

0i
-1 '5

 

 

 

-§000 2001 2002 2003 2004 2005

Figure 1.4. FPAR and Precipitation time series at (latitude = 54.85N, longitude = 114.20W)
with possible anomalies.

based on the unusually low values of vegetation cover observed in the summer. We consider such
anomalies as “local” (as opposed to global anomalies) since their values are abnormally low only
when compared to the average values during the summer months. Second, the performance of a
multivariate anomaly detection algorithm is highly susceptible to the presence of noise in one or
more time series. Therefore, a multivariate anomaly detection algorithm must be robust to noisy
measurements in order to improve its detection rate and false alarm rate.

Multivariate time series anomaly can also be used to characterize the types of anomalies found
in a target time series. For example, Earth scientists are interested in detecting wildﬁres by mon-
itoring the anomalies found in vegetation cover data derived from NASA’s Earth observing satel-
lites. However, analyzing the vegetation cover data alone is insufﬁcient to distinguish between
man-made wildﬁres from those induced by extreme climate events (such as lightning strikes from
severe thunderstorms). Therefore, a key challenge is to combine the time series from other related
variables (e.g., temperature and precipitation) to help explain the anomalies found in a target time

series (vegetation cover).

11

Furthermore, previous work on ecosystem disturbance detection mainly focused on low reso-
lution F PAR data, which can only detect sustained disturbance events that alter the structure of
vegetation cover signiﬁcantly for a sufﬁciently large region and result in an overall decline in the
average annual vegetation levels. Smaller scale disturbances could have been missed when using
the coarse resolution AVHRR data. With the recent availability of high resolution vegetation cover
data from the moderate resolution imaging spectroradiometer (MODIS) instrument on board the
NASA TERRA satellite, smaller scale disturbances can be potentially uncovered. However, be-
cause the size of the data has grown by 4 orders of magnitude, applying the disturbance event
detection algorithm to such massive data sets is computationally expensive. The problem is fur-
ther exacerbated by the fact that a disturbance event detection algorithm (or anomaly detection
algorithm, in general) requires speciﬁcation of one or more thresholds to determine whether a
detected event should be ﬂagged as a real disturbance. The thresholds are determined by users
through a trial and error process during the exploratory data analysis phase. Because of the large
amount of data that must be processed, performing exploratory data analysis on the high resolu-
tion MODIS data in a real-time fashion is computationally infeasible. The massive size of the data
also produces more events for scientists to validate. This necessitates the development of innova-
tive data mining approaches that can assist Earth scientists in real-time exploration of global scale
coo-climatic data.

To address these challenges, we propose to develop a robust graph-based algorithm for detecting
anomalies in multivariate time series data. Our algorithm captures the dependence relationships
among variables in the multivariate time series using a kernel matrix alignment approach. The
alignment helps to eliminate noise and retains only anomalies in the target time series that can be
explained by anomalies observed in other time series. The time series anomalies are detected by
performing a random walk traversal on the graph induced by the aligned kernel matrix. Since a
kernel matrix can be constructed from each time point or subsequence in the time series, our algo-
rithm is ﬂexible enough to handle various types of time series anomalies including subsequence-
based and local anomalies. The effectiveness of our algorithm is demonstrated by using a number
of synthetic and real data sets. We have also conducted a case study to show the ability of our
algorithm to detect and characterize ecosystem disturbances in Earth Science data.

A visualizing system is further developed for detecting anomalies in Earth Science data inter-

12

actively. We illustrate the use of clustering to group together regions with similar incidents of
ecosystem disturbances. This approach provides an unsupervised way to categorize the events
and helps reduce the number of events that need to be validated. We also present a multi-level
indexing scheme based on clustering to aid the discovery and visual exploration of ecosystem
disturbances. We show how the indexing scheme can be used to enable users to quickly focus
on regions of interest during the exploratory data analysis phase. While clustering-based methods
have been proposed for spatio-temporal indexing and for data reduction purposes [34, 19], none of
them are speciﬁcally designed for detecting anomalies (such as ecosystem disturbances) in spatio-
temporal data. In this work, we evaluate the effectiveness of using clustering-based methods for
exploratory analysis of ecosystem disturbances. Our disturbance event detection and clustering
algorithms have been integrated into an interactive system we developed. The system enables sci—
entists to explore the data in real-time fashion and to visually inspect clusters of locations where

similar types of disturbance events were observed.

1.4 Outline of the Thesis

The rest of this thesis is organized as follows. Chapter 2 provides a literature survey for spatial
and temporal data mining. Chapter 3 presents a localized prediction algorithm based on the Sup-
port Vector Machine. Chapter 4 investigates the problems in long term time series forecasting and
proposes a semi-supervised method based on Hidden Markov Regression to improve the predic-
tion accuracy. Chapter 5 discusses a graph-based time series anomaly detection algorithm, kernel
alignment based anomaly characterization technique and a visualization system for exploratory
analysis of large scale spatial and temporal data. Finally, Chapter 6 concludes with a summary of
the thesis and suggestions for future research. Note that the techniques developed in this study are
potentially applicable to other spatial and temporal domains such as sensor networks, intelligent

transportation systems, image processing, and biogeography.

13

CHAPTER 2

Spatial and Temporal Data Mining

Spatial and temporal data mining is a subﬁeld of data mining and knowledge discovery that refers
to the extraction of interesting patterns in spatial and temporal data, which consists of spatial,
temporal, or spatio-temporal dependencies as illustrated in Section 2.1. Research on spatial and
temporal data mining has received signiﬁcant attention in the past decade by researchers in the
statistics, data mining, and machine learning communities. As illustrated in Figure 2.1, we pro-
vide a literature survey on spatial and temporal data mining particularly in three areas: spatial
data mining as introduced in Section 2.2, temporal data mining as discussed in Section 2.3, and
spatio-temporal data mining as presented in Section 2.4. Each area consists of several mining
tasks including prediction, clustering, association rule mining, anomaly detection, etc. For spatio-
temporal data mining, an additional survey on visualization is also presented, which is useful for
studies on large scale Earth Science data. Finally, the scope of research directions for this thesis

is given in Section 2.5.

2.1 Spatial and Temporal Data

Here we give a simple notation of the spatial and temporal data used throughout this thesis in

terms of spatial data, temporal data, and spatio-temporal data.

Deﬁnition 2. (Temporal Data) Temporal data corresponds to an ordered series of observations,

where each observation is associated with a timestamp.

Note that we do not consider sequences ordered by other indexes such as text, protein sequence,

14

 

Prediction

 

 

Temporal Data

 

 

 

 

 

 

    

Mining
Clustering
r——\
Spatial and .
Temporal Spam. Data
Data mung
Mining Detection
\___2

 

 

 

 

Spatio-Temporal
Data Mining

 

Association
Rule Mining

 

 

 

Visualization

l
l
Anomaly ]
i
l

 

Figure 2.1. Survey on Spatial and Temporal Data Mining.

etc. as temporal data although there has been such deﬁnition [135] in the literature. An example

of temporal data is formulated as:

[051.31% (132,932), (111331)]T

where t, is the tirnestamp for example 2:, and l is the size of the temporal data. One popular
class of temporal data is the time series data with numerical values ordered based on timestamps.
Time series analysis is important for many applications such as weather forecasting, stock market
analysis, sensor data analysis, etc. Another category of temporal data is a sequence of events with
categorical values ordered based on their timestamps, e.g. telecommunication alarms/events, web
server logs, online transaction logs, etc. In this thesis, we are mostly concerned with time series.
which are widely available in Earth Science data. Time series data consists of univariate time se-
ries and multivariate time series according to the number of available variables. A univariate time
series is an ordered sequence of real values for a single variable. When the temporal information
t,- represents nothing more than an order from 1 to n as t,- = i, the univariate time series data can

be simply represented as:

A multivariate time series is a vector of time series involving more than one variable. To accom-
modate different applications, we use two notations here to represent a multivariate time series
with 1 data examples and p variables. One formulates a multivariate time series as a vector of p

variables:

[X1,X2l "'3 Xp]

where X ,- represents it}, time series variable with length 1; another formulates a multivariate time

series as a vector of l data examples:
T
331. m2. 1 $1]
where £13,: represents it h example with p attributes.

Deﬁnition 3. (Spatial Data) Spatial data corresponds to data characterized by a spatial location

and other non-spatial attributes.

Here we use 0,- to denote location feature for the example sci. A spatial data set with 1 data

examples is represented as:

[(01,31). (02,502), (01,931) ]T

A popular representation of the spatial features in Earth Science data are the latitude and longitude

o,- = (lati, loni), which determine the locations of the various places on planet earth.

Deﬁnition 4. (Spatio-temporal Data) Spatio-temporal data corresponds to a set ofobservations

with both spatial location features and timestamps.

Using similar notations introduced above for spatial data and temporal data, a spatio-temporal

data is represented as:

T
[(151.01431), ($2.021332), 01,01,920]

where (ti, 0,) constitutes a space-time index for each data example 93,. Examples of spatio-
temporal data include vegetation cover or climate time series recored in different regions, objects

moving across different space and time, spatial events or phenomena evolving over time, etc.

16

2.2 Spatial Data Mining

Spatial data mining is the process of discovering interesting and previously unknown, but po-
tentially useful patterns from large spatial datasets. Mining spatial data is very useful for many
applications, ranging from remote sensing, to geographical information systems (GIS), computer
cartography, environmental assessment and planning, etc. There has been a long history of spa-
tial data mining in the statistics, geography, data mining, and machine learning communities
[80][78][2][190][126]. We identify some representative work in the literature for a number of
different tasks such as spatial prediction, spatial clustering, spatial association rule mining, and

spatial anomaly detection.

2.2.1 Spatial Prediction

Spatial prediction builds a predictive model from the data in observed locations and predicts the
target values for unobserved locations. Previous work on spatial prediction mostly focused on
regression but is applicable to classiﬁcation problems. Spatial prediction has a rich literature in
geostatistical analysis. Among them, one of the simplest is the spatial auto-regression model
(SAR) [96], which is similar to auto-regression in time series prediction. It builds a global linear
regression model to model the spatial dependencies from the entire observed locations and uses
it to predict the unobserved location. Parallel formulation of the SAR model is also developed
by Kazar et al. [117] to speed up this method. Markov random ﬁelds is another popular model
that incorporates spatial autocorrelation in the Bayes framework for spatial prediction, which is
believed to make less restrictive assumptions than the SAR model as discussed by Shekhar et al.
[189]. As spatial data are often inﬂuenced by various local phenomena, the idea of using local
models was also developed for spatial prediction problems. Geographically weighted regression
(GWR) [83] is a local version of spatial regression that estimates parameters taking into account
the spatial proximity between training and testing examples. Kriging is a group of geostatistical
techniques to interpolate the value of a random ﬁeld at an unobserved location from values at
observed locations [58][68], which is also known as Gaussian process regression in the statistics
community. Both the Kriging and SAR methods are linear methods based on least square es-

timation. Recently, there has been increasing interest in applying more sophisticated nonlinear

l7

machine learning models to spatial prediction, e.g. Neural Network [l6l][130] provides mod-
els of relationships between inputs and outputs through highly interconnected “neurons”, Support
Vector Machine [89] learns a nonlinear decision ﬁrnction by minimizing the prediction error while
maximizing the margin, Decision Trees [222] algorithm builds a tree-like structure of predictors,
etc. However, building global machine learning models is not suﬂicient, especially for data sets
with spatial and temporal dependencies. There has been some work on building local machine
learning models for such data, e. g. Gilardi et al. proposed to build a local Support Vector Regres-
sion or local Neural Network for each testing example from the training examples in its spatial
neighborhood. However, their method is very slow and has the difﬁculty in selecting a suitable

number of nearest neighbors.

2.2.2 Spatial Clustering

Clustering of spatial data aims to group together observations with similar spatial and non-spatial
attributes to identify the underlying structure of the data. Spatial clustering has been used to
discover “hot spots” in geostatististics [94], e.g. mapping of hot spots for criminal activities or
diseases in a city. Spatial clustering algorithms can be classiﬁed into three categories: parti-
tioning based method, hierarchical based method, and density based method. Partition based
methods try to divide the data into a given number of clusters. The K—means [114], K-medoids
(PAM)[116] and EM algorithms [66] are three examples of partition based clustering methods.
K-means algorithm ﬁnds the clusters by estimating the centroid to represent each cluster and as-
signing each data point to its closest centroid iteratively. The K-medoids algorithm is similar to
K-means except that the data point closest to the centeroid is utilized to represent the cluster. The
EM algorithm assumes that each cluster follows a mixture of Gaussian distribution and data points
are assigned to the clusters with probability. These methods perform poorly on large scale spatial
data. To overcome this problem, a sampling based clustering method CLARA (clustering large
application) is proposed by Kaufman et al. [116] to run the K-mediod algorithm on a randomly
sampled subset of the data. Although CLARA is more efﬁcient, there is a problem when the true
mediods are missed during sampling. An improved algorithm CLARANS [155] is proposed by
introducing some randomness in each sampling step and taking advantage of some eﬂicient in-

dexing technique such as R-tree [81]. These methods can only ﬁnd clusters with spherical shapes

18

and they require the number of clusters to be predeﬁned. Hierarchical clustering algorithms such
as BIRCH [225] and CHAMELEON [115] build a dendrogram to split the data set into smaller
subsets. Two strategies are usually used: agglomerative algorithms build the tree in the bottom-up
manner, while divisive algorithms build the tree in the up-bottom manner. Hierarchical clustering
is powerful in discovering arbitrarily-shaped clusters but is expensive for high dimensional data.
Density based methods such as DBSCAN [79] assume a data point in high density belongs to
a cluster. GDBSCAN [183] generalizes it to cluster data points with both spatial attributes and
non-spatial attributes. DBSCAN and GDBSCAN are useful to ﬁnd arbitrarily-shaped clusters and
to distinguish outliers from clusters. However, they are not very efﬁcient for high dimensional
data and need a predeﬁned parameter such as radius. Bethi et al. proposed another density based
method called CLIQUE [20], which divides the space into grid cells whose density is computed
and grid cells with high density are merged to form clusters. It is more eﬂicient but less effective

than DBSCAN for high dimensional data.

2.2.3 Spatial Association Rule Mining

The task of spatial association rule mining is to discover spatial relationships between objects
or events in one location with those in other locations [127]. For example, a rule like “most big
cities in US are close to the coast” is a spatial association rule. Spatial association rule min-
ing has been applied to many applications such as geographic information system [127], image
analysis [109], census analysis [142], etc. The earliest algorithm for mining spatial association
is to apply standard associate rule mining algorithm such as Apriori [3] method to spatial data
points whose spatial relationships satisfy some user speciﬁed task [127]. As a result, it separates
spatial computations from the process of generating spatial association rules. Recently, there has
been increasing interest on mining co-location patterns [l49][150][186][226], which are spatial
associations dedicated to capturing neighboring relationships. Morimoto [149] and Shekhar et
al. [186] proposed an algorithm to discover clique co-location patterns, which requires all in-
volved objects to be close to each other. Finding co-location patterns are very time consuming.
To solve this problem, Zhang et al. [226] implement a partition-based optimization strategy to
ﬁnd co-location patterns more eﬂiciently; however, it requires all spatial objects to be data points.

Xiong et al. [218] propose co-location mining for extended spatial objects such as polygons and

19

line—strings. Most of these work on spatial association rule mining consider short-range depen-
dencies only. Mining spatial association rules describing long-range dependencies remain an open

research problem.

2.2.4 Spatial Anomaly Detection

Spatial anomaly detection, also known as spatial outlier detection, identiﬁes spatial locations
whose non-spatial attributes are signiﬁcantly different ﬁ'om the values of their neighborhoods
even though they may not be signiﬁcantly different from the entire population [1 87][1 39]. Spatial
anomaly detection approaches were ﬁrst studied in geostatistics, which can be generally grouped
into two categories: graphical approaches and quantitative tests. Graphical methods are based
on the mapping of spatial data that highlights spatial outliers, e.g. variograrn clouds and pocket
plots [98][201]. These methods require a domain expert to detect the spatial anomalies visually.
Quantitative methods use statistical tests to identify spatial outliers from the remainder of data
such as scatterplot proposed by Han et a1. [95] and Moran scatterplot approaches developed by
Anselin et al. [9]. Data mining methods have also been applied to detect spatial outliers recently.
Shekhar et al. [188] introduced a method for detecting spatial outliers in graph data sets based
on the distribution of the difference between an attribute value and the average attribute value of
its neighbors. Lu et al. [139] also proposed two methods, one is non-iterative algorithm that uses
median as the neighborhood function, another is an iterative algorithm that identiﬁes only one
outlier in each iteration and modiﬁes its attribute value to prevent negative impact on subsequent
iterations. The false alarm rate of these algorithms are still relatively high. Reducing the false

alarm rate of spatial outlier detection needs further investigation.

2.3 Temporal Data Mining

Temporal data mining is concerned with extracting interesting and previously unknown, but po-
tentially useful patterns from large scale temporal data sets. There has been increasing interest on
mining temporal data during the past several years [135]. Most previous work can be categorized
into several major tasks including classiﬁcation, clustering, time series forecasting, anomaly de-

tection, and temporal association rule mining. Here we present a brief review of the techniques

20

and algorithms in terms of different temporal data mining tasks.

2.3.1 Time Series Forecasting

The task of time series forecasting aims to predict the future values of a time series based on its
own historical values or information from other related variables. Extensive studies have been
conducted for the problem of time series forecasting in the statistics, data mining, and machine
learning communities. These approaches typically use a sliding window to create a training set
from the historical data to build a predictive model. The auto-regression (AR) model [29] is
the earliest and simplest predictive model, which is designed to predict the future values of a
stationary time series [29][41][101] based on a linear combination of historical values using least
square linear regression model [118]. A more complex linear model such as ARIMA [147] is
also invented for the modeling of non-stationary time series, which has been widely applied to
ﬁnancial stock analysis. Another popular alternative model to the AR model for modeling non-
stationary time series is the piece-wise linear regression model, which breaks the time series into
a series of stationary pieces and learns a separate linear regression model. A similar approach
is also found in the Hidden Markov Regression [85] for time series prediction, which assumes a
stochastic process on a set of states, each associated with a separate linear regression models. In
addition to those linear models, nonlinear regression models are also used to model the time series
data. For example, Recurrent Neural Network [90], Support Vector Regression[157], Gaussian
Process [17], and Decision Trees [138] have been applied to model nonlinear time series for
future forecasting. However, these works mostly focused on univariate times series prediction.
These works mostly utilize a recursive prediction methodology, where the outputs from the last
prediction are used as inputs for the next prediction, and thus fails for long term time series
prediction. Although an advanced model was proposed by Herrera et al. in [104] to improve
the long term time series forecasting by using a noiseless one-step-ahead prediction for recursive
training, their method did not address the error accumulation caused by model bias and variance.
Chapados et al.[39] deveIOped augmented functional time series representation and a forecasting
algorithm based on Gaussian processes to avoid point estimation. However, the prediction window
of their algorithm is still relatively short and the error caused by the biased estimation of the

posterior distribution will still be accumulated for long term time series prediction.

21

Multivariate time series forecasting [138] predicts the ﬁrture values of a time series by utilizing
not only its own historical values but also information from other related time series. Methods for
multivariate time series prediction are usually extended from univariate time series prediction by
including related time series as predictor variables. For example, Popescu et al. [169] proposed to
use independent component analysis to decompose the multivariate time series into independent
components and perform univariate forecasting for each component separately whose predicted

results are recombined at the ﬁnal stage.

2.3.2 Classiﬁcation of Temporal Data

Temporal data classiﬁcation assumes that each sequence or time series belongs to one of the
predeﬁned classes or labels and the objective is to automatically determine the class or label for a
given sequence. Temporal data classiﬁcation has been investigated widely in many applications,
for example, speech recognition tries to transcribe speech signals into their corresponding textual
representation [174] [92]; gesture recognition classiﬁes the human body motions into messages
people want to express [61] or some sign language [192]; handwritten words recognition tends to
recognize the words from a sequence of pixel columns and segments extracted form the image of
handwritten texts [43][151][211]. The methodologies used in these works can be categorized into
pattern based and model based. Pattern based methods align the given sequence to a collection
of prototype sequences and ﬁnd the closest match using certain sequence similarity measures
such as dynamic time warping (DTW) [123][50][50] and longest common subsequence (LCSS)
[62]. Model based methods assign labels to a sequence according to a classiﬁcation model that
is learned from a set of training sequences. Examples of model based methods for temporal data
classiﬁcation include Hidden Markov Model [15][203], Neural Networks [216][195], Support

Vector Machine [72], etc.

2.3.3 Clustering of Temporal Data

Clustering of temporal data aims to group together a collection of temporal data based on their
similarity. Clustering is important for mining temporal data since it provides a way to automati-

cally reveal the intrinsic structure of the data and helps people to better understand the data. For

22

example, it would be interesting to cluster gene expression time series [76], stock time series [33]
or trajectories of moving objects [64], whose trends and variations behave similarly. Many differ-
ent models has been proposed to cluster temporal data. For example, mixtures of ARMA model
[219] or Hidden Markov Model [7] have been utilized for the clustering of time series data; Sebas-
tiani et al. pr0posed the Bayes model [185] and Law et al. proposed a rival penalized competitive
learning method [133] to cluster sequence data. A number of different similarity measures such
as dynamic time warping (DTW) [123][50][50], longest common subsequence (LCSS) [62], etc.
have also been proposed to cluster time series data using standard clustering algorithms such as

K-means [156], density based [64], and hierarchical clustering algorithms [180].

2.3.4 Temporal Association Rule Mining (Frequent Pattern)

Temporal association rule mining is designed for discovering frequent sequential patterns
[208][4][88], which comprise temporal order information between elements, in the collections
of temporal transactions. The Apriori [3] method has been adapted for mining temporal asso-
ciation rules by counting the fraction of all the sequences, in which rules are contained, in the
temporal database. One extension of frequent pattern mining in temporal data is frequent episode
mining, which is to extract subsequences that repeat a suﬂiciently number of times in the given
long time series or sequence [l44][10] [l3][136][10]. Temporal episode mining has many useful
applications such as analyzing alarm streams in a telecommunication network [144] or mining of
data from Walmart transactions [10]. Furthermore, time dependence is also incorporated explic-
itly into the frequent sequential pattern mining or frequent episodes mining. For example, Lee et
a1. [38] proposed to incorporate the duration of the events into frequent sequential pattern mining
and Ozden et al. [160] introduced cyclic association rules that hold with a ﬁxed periodicity along

the entire length of the sequence of time intervals.

2.3.5 Time Series Anomaly Detection

Time series anomaly detection aims to discover data points or subsequences that are signiﬁcantly
different than the other parts of the time series. Numerous algorithms have been proposed to dis-

cover anomalous patterns in the univariate time series. For example, Keogh et al. [121] proposed

23

an algorithm for mining time series that contain subsequences with largest nearest neighbor dis-
tance. Mahoney and Chan [141] employed a path and box feature trajectory approach to model
the time series. Anomalous subsequences are detected based on their deviation from the trajectory
path. Bay et al. [16] utilized local AR models to transform the data into its corresponding param-
eter space and anomalies were detected based on distances computed in the parameter space.
F requency-based methods [122, 212] are designed to discover unusual subsequences who happen
less frequently than other patterns. Immunology-based method [63] uses given normal pattern to
detect anomaly patterns as templates and uses these templates to ﬁnd unusual pattern in a new
time series. Probability based method [220] calculates the anomaly score for any given data as
its deviation from the learned probabilistic AR (Auto-regression) model from the historical data.
These methods mostly focused on anomaly detection in univariate time series.

Recently, there has been considerable interest in developing anomaly detection algorithms for
multivariate time series. One category of the methods utilize time series projection [87] and in-
dependent component analysis [213] to convert the multivariate time series into univariate time
series. One potential limitation of these methods is the loss of information incurred when pro-
jecting the data into l-dimensional space. Despite the rich literature on time series anomaly
detection, there are few works on characterizing the discovered anomalies. Lakhina et al. [131]
proposed an approach to characterize anomalies in network trafﬁc ﬂows. Their work, however,
simply performs a separate anomaly detection for each variable without focusing on any speciﬁc
target variable. Potter et al. [171] used association analysis to relate extreme climate events with

ecosystem disturbances. However, it is achieved by a separate postprocessing step.

2.4 Spatio-Temporal Data Mining

Spatio-temporal data mining is an emerging research area dedicated to the development and ap-
plication of novel computational techniques for the analysis of large spatio-temporal data sets, e. g.
climate time series recorded for different locations in Earth Science data or moving objects tracked
at different locations and time. Compared with spatial data mining or temporal data mining, stud-
ies on spatio-temporal data mining are limited. Recent studies on spatio-temporal data mining

mainly focus on two strategies. One is to decompose the spatio-temporal data mining into two

24

independent sub-problems: spatial data mining and temporal data mining [146][21][167][168].
Another strategy is to extend techniques in spatial or temporal data mining directly to the mining
of spatio-temporal data [l62][l45][77][145]. Similar to previous sections on spatial data min-
ing and temporal data mining, we present a brief review on the latest work on spatio-temporal
data mining methods in terms of spatio-temporal prediction, spatio-temporal clustering, spatio-
temporal association rule mining, and spatio-temporal anomaly detection. An additional survey

for the indexing and visualization of large scale spatio-temporal data is also provided.

2.4.1 Spatio-Temporal Prediction

Spatio—temporal prediction is to build a predictive model from the observed data with space-time
information and predict the target variable at an unobserved location and time. Most spatio-
temporal prediction techniques are extended from spatial prediction or time series forecasting
models. One example is the spatio-temporal auto-regression model proposed by Kelley et al. [162]
for real estate price prediction, which formulates the auto-regression model with both spatial and
temporal neighborhood information. Markov Random Field [145] and Kriging [77] have also been
extended from spatial prediction to spatio-temporal prediction as well. Another family of spatio-
temporal prediction methods decomposes the problem into two separate steps. For example, a
two-phase spatio-temporal auto-regression model was proposed by Pokrajac et al. [l67][168],
which models the spatial data ﬁrst and then adjusts the prediction by incorporating autoregressive

modeling of residuals in time.

2.4.2 Spatio-Temporal Clustering

Spatio-temporal clustering tries to group together the spatio-temporal data points with similar
behavior, which is often used to extract interesting groups of trajectories. Most spatio-temporal
clustering algorithms are simply extended from the spatial clustering algorithms discussed in Sec-
tion 2.2.2. For instance, the partition based method K-means is utilized to cluster the climate
variables indexed by both space and time to extract time varying climate regions [106]; the den-
sity based method DBSCAN is applied to cluster trajectories of moving objects [145] and the

hierarchical agglomerative clustering algorithm is adapted to group together similar trajectories

25

[152]. Other spatio-temporal clustering techniques such as spatial scan has also been developed
by Kulldorff et al. [129] for epidemiological data to extract spatio-temporal cylinders (e. g. spatial
circular shapes that remain still for some time interval) where the rate of disease cases is higher

than outside the cylinder.

2.4.3 Spatio-Temporal Association Rule Mining

Spatio-temporal association rule mining aims to discover how objects move between regions and
over time. The most straightforward technique for spatio-temporal association rule mining is
conducted by mapping spatio-temporal data into a transaction table and to use the standard Apriori
approach to extract spatio-temporal rules [146]. A more concrete deﬁnition of spatio-temporal
rule is ﬁrst given by Tao et al. [198], which explores the association between objects or events in
different regions over time, e.g. the rule (2', Vt, p) => (j ) means if an abject appears in region i at
time t, it will appear in region j at time t + Vt with probability p, which is discovered by a simple
brute force method. A more advanced method to mine such spatio-temporal association rules was
proposed by Verhein et al. [205][204] to deal with the spatio-temporal semantics such as area
effectively throughout the mining process and prune the search space as much as possible. It also
deﬁnes a series of special patterns to describe important temporal characteristics of regions, e. g.
stationary regions where many objects tend to stay over time and high trafﬁc regions where many
objects tend to move in (sink), move out (source), or move through (thoroughfares). There is also
some work on the discovery of other interesting spatio-temporal association rules. For example,
Marnoulis et al. proposed a technique to mining periodic patterns [143], represented by objects
moving between regions and following the same routes approximately over regular time intervals.
Mining spatio-temporal association rules is slow for large scale data. Wang et al. proposed an

eﬂicient disk based algorithm for mining the events changing over space and time [209].

2.4.4 Spatio-Temporal Anomaly Detection

Spatio-temporal anomaly detection is to detect data points signiﬁcantly different than other data in
their spatio-temporal neighborhoods. A simple strategy is provided in [49] by Cheng eta1., which

deﬁnes spatial outliers that are not always present in consecutive time frames as spatio-temporal

26

outliers. The temporal dimension is not well treated, since it can only compare spatial outliers
between immediate time snapshots. Lu et al. [140] introduced the time dimension by linking
the center region outliers in different time frames, which can be used to detect moving outliers.
Similarly, Birant et al. [21] proposed to detect spatio-temporal outliers from large scale data
sets by clustering data ﬁrst followed by procedures of checking spatial and temporal neighbors
separately. Basically, all these approaches deﬁne spatio-temporal outliers by incorporating time
information into the spatial outliers. Sun et al. [196] gave a more general study by deﬁning two
kinds of spatio-temporal outliers: time period outliers given a region, and region outliers given a
time period; however, mining spatial and temporal data for anomalies are mostly treated as two

separate sub-problems.

2.4.5 Exploratory Analysis and Visualization of Spatio-Temporal Data

Spatio-temporal data sets are often very large and difﬁcult to analyze and display. Since they
are fundamental for decision support in many application contexts, recently a lot of interest has
arisen toward data-mining techniques to extract relevant subsets of very large data repositories
and effectively display the results by addressing many challenges such as indexing, representa-
tion, querying, mining, etc. Compieta et al. [52] proposed a complementary, dual 3D visualization
environment for mining spatio-temporal data sets, which is composed of a Google Earth system
to display the mining outcomes and a Java3D-based tool for providing advanced interactions with
the data set in a non-geo-referenced space, such as displaying association rules. Olga et al. [159]
developed a system that enables users to query clusters of time series with similar geometric
shapes. These systems have mainly focused on the visualization aspect and do not consider issues
such as large scale or high resolution data. Techniques such as wavelet tree [137], R—tree [153],
and aRB-tree [164] are some of the well-known indexing techniques for online analytical process-
ing. Camossi et al. [34] presented a multi-level indexing technique based on spatial and temporal
granularity. However, it does not consider the similarity between data points when constructing
the index. Their technique is also especially designed for clustering tasks. Bertolotto et al. [19]
developed an approach to use clustering for data reduction purposes, which is similar to our work.
However, they did not consider the problem of choosing representative samples for a speciﬁc data

mining task such as anomaly detection. Denny et a1. [67] presented a multi-level technique for

27

exploratory data analysis of hot spots. While there are other works on multi-level techniques [70],

they are not directly applicable to our disturbance event detection problem.

2.5 Scope of Thesis Research

In this chapter, we reviewed some related work on spatial and temporal data mining in terms of
three research areas: spatial data mining, temporal data mining and spatio-temporal data mining.
The contributions and limitations of a number of algorithms and statistical models proposed in the
literature for different mining tasks as clustering, prediction, anomaly detection and association
rule mining were discussed. Despite the broad scope of research areas in spatial and temporal
data mining, this thesis will mainly focus on the following research areas that are motivated by
the problem of mining Earth Science data: spatial and temporal prediction, anomaly detection,

anomaly characterization, and visualization.

28

CHAPTER 3

Localized Prediction with Spatial and

Temporal Dependencies

Prediction is one of the most important tasks in data mining. Consider a training set:

Pr. = {XLaYL} = 10131.91), (932.92). - - - . ($1.yi)lT.

and test set:
T
DU = {XU} = l$l+1a931+2~ - . - vml+ui

the task of prediction is to build a predictive model from the training set DL that maps input to
output as:

f:X—>Y

The constructed model is then used to predict the target y for any given test example a: E X U as:

y=f(=13)

Generally speaking, a prediction task can be divided into classiﬁcation and regression tasks de-
pending on the format of target value y. Classiﬁcation task predicts a predeﬁned discrete class
label. For example, y E {—1, +1} for a two class problem. Regression task tries to predict a
continuous target 3] E ’R..

In this chapter, we explore a localized method for the prediction problems in mining Earth
Science data with spatial and temporal dependencies. The main contributions of this chapter are

as follows:

29

o We develop an integrated localized prediction framework based on Support Vector Machine
by incorporating the neighborhood information between the test examples and the training

examples.

0 The localized algorithm proposed is slow due to requirement of learning of one model
for each testing example. A more eﬂicient algorithm proﬁle Support Vector Machine is

proposed based on supervised clustering to reduce the computational cost in learning.

0 Experiments on a number of real and synthetic data sets demonstrate the efﬁcacy and ef-
ﬁciency of the proposed algorithms compared to K-nearest neighbor and Support Vector
Machine. The applicability of our algorithms on Earth Science data is also investigated by

incorporating spatial and temporal dependencies.

The remainder of this chapter is organized as follows. Section 3.1 introduces the problem
of localized prediction and some background. Section 3.2 proposes the general framework of
our localized Support Vector Machine. Section 3.3 presents the proﬁle Support Vector Machine
algorithms. We extend our algorithms to incorporate spatial and temporal dependencies in Section
3.4. The experimental results on both benchmark and Earth Science data sets are shown in Section

3.5.

3.1 Preliminaries

One of the most important questions in building a prediction model for a data set is whether to
use a global learning or local learning framework. Global prediction learns a global predictive
model f : X ——+ Y ﬁ'om the entire training set 271,. The formal deﬁnition of global prediction is

given as following:

Deﬁnition 5. (Global Prediction) Global prediction learns a single classy'ication or regression
model, which is applicable to all the test data, using the global characteristics of the entire train-

ing data.

Many classiﬁcation and regression algorithms have been proposed in the data mining and ma-

chine learning literature including Support Vector Machine [32][191], Decision Trees [222][221],

30

Neural Network [22], Hidden Markov Model [40][85], etc. They typically fall into the global pre-
diction framework. In this work, we mostly focus on the Support Vector Machine algorithm. A
detailed description of the Support Vector Machine algorithm will be discussed in Section 3.1.2.
Localized prediction algorithms tend to build multiple local prediction models instead of one
global prediction model as: {f1, f2, . . . , fk} where each local model f,- is associated a local subset

of the training set. A formal deﬁnition of local prediction is given as following:

Deﬁnition 6. (Local Prediction) Local prediction constructs multiple classiﬁcation or regression
models, each of which is applicable to the corresponding local part of the test data only, using the

local characteristics of the training data.

Localized prediction attempts to adjust the training system locally to the properties of the train-
ing set in each area of the input space, and thus has the potential to outperform global prediction
especially when the data is non-evenly distributed [28]. The simplest localized prediction meth-
ods proposed in data mining literature is the K -nearest neighbor algorithm [177], which will be
discussed in Section 3.1.1.

In this work, localized prediction is applied to Earth Science data with strong spatial or temporal

dependencies. Given a training set with spatial and temporal features:

T
73L = 101.01.931.91),02.02.332.112), - . - , 01.01.331.90] .

localized prediction with spatial and temporal independences builds multiple local prediction
models from the data by incorporating the spatial, temporal, or spatio-temporal neighborhood

information.

3.1.1 K-nearest Neighbor Prediction

The K -nearest neighbor [55] [177] (KNN) predicts the target of a test example based on the
distribution of training examples in its local neighborhood. Among the appealing features of the
KNN learning include its simplicity and ﬂexible modeling scheme. Nevertheless, it is susceptible
to the curse of dimensionality [18] problem and is sensitive to the choice of neighborhood size K
[97]. When K is too small, the KNN model may overﬁt the training data. On the other hand, a

large K may also hurt the performance of the classiﬁer by incorporating training examples that

31

 

(a) K-nearest Neighbor Classiﬁcation. (b) K-ncarest Neighbor Regression.

Figure 3.1. K-nearest Neighbor Prediction.

are unsuitable to the prediction of a test example. To overcome the diﬂiculty of choosing K, the
weighted K -nearest neighbor [102] model was introduced. In this method, the inﬂuence of each
training example a),- is weighted by its similarity to the test example 3:. Thus, it is important to
deﬁne a a similarity measure to determine how closely matched two given examples are for the
K -nearest neighbor algorithm. One of the most popular similarity metrics between two examples

:2), and 203- is using Radial Basis Function [165]:

._ .2
131.31] (35,,

Set. as.) = exp [— a,
where a is a user-speciﬁed parameter to control the spread of the function values. Throughout this
thesis, RBF function is used to measure the similarity between two examples.

In K-nearest neighbor classiﬁcation as shown in Figure 3.1(a), the probability that a test

example belongs to class y is computed as follows:
E 5(y. yi)S(wi. cc)
(1:,- e!) K (m)

“'9'” = 2 5mm)
mieQK(a:)

 

(3.2)

where QK(:1:) is a subset of observations in ‘D L that correspond to the K-nearest neighbors of the

test example a), S (sci, :13) denote the similarity between 2:,- and a), 6 (y, 31,-) as follows:

,,_ 1 Wy=w
60.0.) —{ 0 otherwise (3.3)

32

In K-nearest neighbor regression as shown in Figure 3.1(b), the continuous target value y for

a test example a: is computed as follows:
E y.,-S ($1, at)
(13,; 69 B4113)

y : 2 50132313)
w.,-ettK(:z:)

 

(3.4)

The similarity matrix S is often normalized so that its values range between 0 and 1. One pop-
ular choice for the similarity measure is the radial basis function (RBF) kernel [165] introduced
in Equation (3.1).

K -nearest neighbor methods are simple and easy to understand. However, they have limita-
tions. For example, they fail to produce accurate prediction results for high dimensional data,
which is known as the “curse of dimensionality” problem; they are slow especially when the
number of test examples is large; and they have diﬂiculty in choosing the appropriate value for K
[97], etc. Several variants of the nearest-neighbor classiﬁer have been proposed in the literature
to overcome some of its present limitations. Hastie and Tibshirani [100] attempted to address the
curse of dimensionality problem by estimating the effective metric for computing the neighbor-
hoods of the test examples. [69] also proposed a feature weighting scheme to alleviate the curse
of dimensionality problem. Vincent and Bengio [206] developed the HKNN algorithm to ﬁx the
missing sample problem, which was shown to introduce artifacts in the decision surfaces produced
by regular KNN. Both of these methods, however, are still computationally expensive when the

number of test examples is large.

3.1.2 Support Vector Machine

Support Vector Machine (SVM) is a supervised learning method that can be applied to classiﬁca-
tion or regression tasks. Consider a training set ’D L and test set DU, a nonlinear Support Vector

Machine builds a classiﬁcation or regression model f with form:
f(-’v) =< W. 95196) > +b

where w is the coeﬂicient vector [1121, 102, - . - ,wp] , b is the bias and 1,0(a3) is a function that maps
a: to the higher dimensional feature space so that decision function will be linear.
Support vector classiﬁcation (SVC) learns a global decision boundary by maximizing the

classiﬁcation margin and minimizing the empirical classiﬁcation error on the training set [32].

33

 

 

 

 

 

(a) Support Vector Classiﬁcation. (b) Support Vector Regression.

Figure 3.2. Support Vector Machine.

For brevity, Support Vector Classiﬁcation is discussed based on a two-class problem based on a
discrete class label y E {—1, +1}. Figure 3.2(a) shows a SVC decision boundary learned from a
two—class training set, where the margin is deﬁned as the minimum distance between the training
examples and the hyperplane f (m) = 0 generated by the decision function f (:13) The objective
ﬁrnction of nonlinear Support Vector Machine is shown as below:

l
, 1
mm §IIW||3+CZa (3.5)

w
i=1

3. t y,(< w,cp(a:,-) > —b) 2 1 — €23

€120,i=1,2,...,l

where 5,- are slack variables introduced as an allowable error for misclassiﬁed data examples, C is
a parameter that controls the tradeoff between the classiﬁcation margin and cost of misclassiﬁca-
tion on training data. The preceding optimization function can be transformed into its dual form

using the Lagrange multiplier method:

l l

1 .
max 042'. — - Z Ottajytyjﬂwz'wj) (3-6)
almal i=1 2z'j==1

l
S. t. 2 (1,31,; = 0
i=1

OSaiSC,i=1,2,...,l

34

where o is the kernel function used to compute the dot product of two examples in a high dimen-
sional space as: ¢(a:,-, mj) -—- <p(a:,-) - 99(3)) and a, is Lagrange multiplier or also called “support
weight” assigned to each training example 50,-. Once the weights are determined, a test example a:

is classiﬁed as: l

y = Sig"(zaiyi¢(mamil) (3-7)
i=1

Note that examples with large support weight a,- are called support vectors, which are essential
examples to determine the decision boundary. Three examples of support vectors are shown as
gray data points lying on the dotted lines in Figure 3.2(a).

Support Vector Classiﬁcation can be extended to handle multi-class problems [108] using one-
versus-all [178] method. Given a data set with N classes, one may construct N binary SVC
models {f1, f2, - - - , f N}, where each f,- is trained to separate training examples that belong to
class i from all other classes. The class label of a test example a: can be predicted by applying the
N binary SVC models and choosing the class i that yields the highest output value f,(:r:). Other
strategies such as one-versus-one and error correcting code [6] are also available.

Support Vector Regression (SVR) learns a prediction ﬁrnction f from training examples by
enforcing two constraints: one is to guarantee the prediction f (23,-) has at most 5 deviation from
the actual continuous response value y,- for all the training data; another is to make the prediction
function f as ﬂat as possible [191]. Figure 3.2(b) shows a simple illustration of a Support Vector
Regression model, where all the training examples fall into a tube with width 6. The optimization

problem of nonlinear Support Vector Regression is formulated as:

i
. 1 .
nwn §IIW||3+CZE<0+ED (3.8)
1:
Min-KWWWD>WMR+&
(< w, am) > +0) — y..- s e + £2“

g,-,g;‘20,1=1,2,...,t

where {11:36: are slack variables with similar properties to those in SVC and C controls the tradeoff

between minimizing prediction error and maximizing function ﬂatness. To simplify the computa-

35

tion, the problem is transformed to its dual form as follows:

l
1 * :1: I
max —§i§;1(ai— oi,- )(aj — aj)o(:c,-, 51:3) (3.9)

1,:
l
S. t. 2(07; - 0:) = 0

OSGi,a:SCl=1,2,...,l

where o is the kernel function and (a,- — 0;“) is the weight assigned to the training example mi.
Once the weights have been determined, the response value for a test example a: is predicted in
the following way:

l

l
w = 2m,- — a:)c,9(a:,-), thus ,2 = 2(0, — ark/atria) + b

i=1 i=1
As it indicates, the coefﬁcient vector w is represented as a combination of the training data 93,.
Finding the right kernel for global nonlinear SVM given a data set can be quite a challenging
task. On the one hand, a simple kernel may not capture all the intricacies of a complex decision
surface. On the other hand, a kernel that is too ﬂexible is susceptible to model overﬁtting. Al-
though there have been several studies devoted to kernel learning [12], the proposed methods are

generally expensive and do not scale well to very large data sets.

3.2 Localized Support Vector Machine

In this section, we present a framework called localized Support Vector Machine algorithm
(LSVM), which integrates K -Nearest Neighbor with Support Vector Machine by incorporating
the neighborhood information directly into SVM learning. The localized Support Vector Classiﬁ-

cation and Regression is formulated separately in Section 3.2.1 and Section 3.2.2.

3.2.1 Localized Support Vector Classiﬁcation (LSVC)

We ﬁrst discuss the classiﬁcation case when 3; e {—1, 1}. For each test example (n+8 E ’DU, 3 =

1, 2, . . . ,u, we construct a local Support Vector Machine from all the training examples whose

36

allowed misclassiﬁcation errors 5,- are weighted by their similarity to the test example. This idea

can be formulated as the following optimization problem:

I
. 1
mgn Ellwué+CZS<xtmnna (3.10)
i=1

S-t 9i(< W.<p(wi) > 4921-62»

Et20,i=1,2,...,l

Compared with regular SVC in Equation (3.5), the constraint 5,; on misclassiﬁcation error for
training example ac,- is weighted by its similarity to the test example n+3. As a result, training
examples close to the test example are promoted, while training examples located far away from
the test example are penalized. The prediction of the test example is determined mostly by the
training examples in its local neighborhood.

To further appreciate the role of the weight ﬁrnction, consider the dual form of the preceding

optimization problem:
I 1 l
,ggl§}n—§§jmeaauawn (M0
i=1 i,j=1

l
S. t. Z aw,- = 0
i=1

030i,“ SCS<$£,$1+3),7:= 1,2,...,l

where a linear kernel Mari, raj) = 2:,- -a:j is employed for the LSVC. The details of the derivation
can be found in the Appendix section. Unlike the optimization problem given in (3.6) for global

nonlinear SVC, the upper bound for a,- has changed from C to CS (03,-, 031+ 3).

3.2.2 Localized Support Vector Regression (LSVR)

For the regression case when y E R, we construct a local Support Vector Regression model for

each test example 21+ 3 6 Pg by solving the following optimization problem:

I
. 1 ,
mgn 511w113+czls<mm+a<e+9) (3.12)
z:
S-t 315i - (< W. M931) > +19) S €+§t
(< W.<.0(-'I=i) > +b> —y.- s we

artzai=latus

37

This is analogous to the LSVC approach given in Section 3.2.1, where weight matrix S is used to

weight the slack variables 5,- and q. The LSVR formulation can be transformed to its dual form

as follows:
1 I
max —- Z (a; — 0;)(aj -— a;)05(a:,-,a:j) (3.13)
i,j=1
l l
- 620%“ +0?) + 21%sz - 0t?)
' 1 i=1

5. I.

View

(at *0?) =0
1. 1
0 S (1,50,: S CS(a:,-,a:1+s), i=1,2,...,l

where is d) is a linear kernel. Note again the upper bound for 012,01; has changed from C to
CS(:1:,-, 1314-5) in localized Support Vector Regression compared with global nonlinear SVR in

(3.9).

3.2.3 Discussion

Comparing the objective functions of LSVM (LSVC in Equation (3.11) and LSVR in Equation
(3.13)) to the nonlinear global SVM (SVC in Equation (3.6) and SVR in Equation (3.9)), the
only difference is that the constraint on the upper bound for a,- of LSVC and aha; of LSVR has

changed from C to CS (23,-, 1111+ s). The resulting effects of such modiﬁcation is two-fold:

1. Some of the support vectors for nonlinear SVM are no longer support vectors for
LSVM. As previously noted, a support vector for nonlinear SVM corresponds to a training
example with positive support weight (a,- > 0 for LSVC and o,- - a; > 0 for LSVR) while a
non-support vector has zero support weight (a,- = 0 for LSVC and a,- — a; = 0 for LSVR).
Since the upper bound CS (031”, 513,-) decreases to zero when 501+ s and :13,- are dissimilar, a
support vector for nonlinear SVM that is dissimilar to the test example will have an upper

bound close to zero, converting it into a non-support vector for LSVM.

2. Some of the non-support vectors for nonlinear SVM become support vectors for
LSVM. As the support weight (a,- for LSVC and oz,- — a; for LSVR) for support vec-

tors that are dissimilar to the test example reduces to zero, some of the training examples

38

that are similar to the test example will be assigned weights larger than zero to ensure that
the equality constraint (2:121 only, = O for LSVC and 2L1 (a, — of) = 0 for LSVR) is

preserved.

There are two variants of LSVM considered in this study. If S is a real-valued similarity mea-
sure deﬁned on the closed interval [0, 1], the resulting algorithm is called Soft Localized Support
Vector Machine (SLSVM). We employ the RBF kernel as our similarity measure for SLSVM,
similar to the approach taken by weighted KNN (see Equation (3.1)). On the other hand, if S
is a binary-valued function, the resulting algorithm is called Hard Localized Support Vector Ma-
chine (HLSWVI). The similarity measure S (021+ 3, 3,) is equal to 1 if 11:, is part of the K nearest
neighbor list for 2214.3, and 0 otherwise. For HLSVM, the upper bound constraint for support
weight is equal to C. Basically, it constructs a local SVM model for each test example based on
its K -nearest neighbors [166]. This approach is equivalent to the KNN-SVM algorithm proposed
by Zhang et al. [223]. This algorithm, which is a straightforward adaptation of KNN to SVM, has
several limitations. First, the performance of the algorithm depends on the choice of K. Second,
the model is not that ﬂexible because it uses the same neighborhood size for each test example. F i-
nally, the algorithm does not consider the similarity between the training and test examples when
constructing the local SVM models. Once the nearest neighbors have been found, it will identify
the local support vectors by solving the optimization problem given in (3.5). The performance of
this algorithm is not as good as SLSVM according to our experiments.

One issue about LSVM is that it must build a local SVM model for each test example. Applying
it to large data sets can be very costly. In the next section, we will present an algorithm to improve

its eﬂiciency.

3.3 Proﬁle Support Vector Machine (PSVM)

Proﬁle Support Vector (PSVM) algorithms reduce the computation cost of Localized Support
Vector Machine by reducing the number of local SVM models that need to be built. To under-
stand the intuition behind this approach, consider the optimization problem given in Equation
(3.13) and Equation (3.11), which must be solved separately for each test example n+3. No-

tice that the optimization problem is nearly identical for each test example, except for the up-

39

per bound constraint for a,- of LSVC and 015,01; of LSVR, which depends on S (11:1... 3, 3,). Let
S(:1:1+3) = S(:cl+s,a:1),S(a:,+3,a:2), . . .,S(:1:l+3,a:n)]T denote a column vector of simi-
larity values between a test example 2:1,. 3 and the set of training examples {113,1}§=1. Since a,
determines whether 2:,- is a support vector, we expect test examples with highly correlated similar-
ity vector S to share many common support vectors. For such test examples, it may be sufﬁcient
to build a single LSVM model to predict them. To achieve this, a simple strategy is to group
the training and test examples using standard clustering algorithm such as K -means. However,
building local models for such clusters will be problematic, e. g. one cluster may contain examples

from one class only for the classiﬁcation problem.

3.3.1 Supervised Clustering Algorithms: MagKmeans and Vaereans

We have developed supervised clustering algorithms called MagKmeans and Vaereans,
for LSVC and LSVR respectively, to determine the set of training examples that can be
trained together and the set of test examples that can be predicted together. Let S =
[S(a:l+1) S(a:j+2) . . . S(a:l+,,)] be an l x 11 matrix, where l is the training set size, a is the test
set size, and the (i, j)-th entry of the matrix denote the similarity between the training example :13,

and the test example $l+j- The goal is to ﬁnd K. clusters from S such that K. < < l.

MagKmeans: A Supervised Clustering Algorithm for LSVC

We consider the classiﬁcation case when 3; 6 {—1,1} in this section. Our clustering task is
somewhat different than conventional unsupervised clustering. First, the data to be clustered is
S, which contains the similarity between every training-test example pair. Second, conventional
clustering methods consider only the proximity between examples and ignore their class distri-
butions. As a result, some clusters may not contain enough representative examples from the
different classes to construct a reliable local Support Vector Classiﬁcation model.

The MagKmeans algorithm extends regular K-means [114] by considering the class distribu-
tion of training examples within each cluster. Let Y = [y1, yg, . . . ,y,]T be the class labels of the
training examples. The objective function for MagKmeans is:

K. n. l
22 Zinn- — 33.11% + R 2 21.0. (3.14)
j=1 i=1 j=1 i=1

R.

min
X,Z

40

where X,- is the i-th row of the similarity matrix S, 73- is an 1 x a row vector representing the
centroid of the jth cluster, R is a non-negative scaling parameter, and Z is the cluster membership
matrix, whose (i, j )-th element is equal to one if the ith training example belongs to the jth cluster,
and zero otherwise. The ﬁrst term in the objective function corresponds to a cluster cohesion mea-
sure. Minimizing this term would ensure that training examples in the same cluster have highly
correlated similarity vectors. The second term in the objective ﬁtnction measures the uniformity
of class distributions in each cluster. Minimizing this term would ensure that each cluster contains
a balanced number of positive and negative examples.

The cluster centroids 7 and cluster membership matrix Z are estimated iteratively as follows.
First, we ﬁx the cluster centroids and use them to determine the cluster membership matrix. Next,
the revised cluster membership matrix is used to update the centroids. This procedure is repeated
until the algorithm converges to a local minimum. To compute the cluster membership matrix Z,
we transform the original optimization problem into the following form using it slack variables

if

10 l it
minzm. 2:2,,jnx, —X,||§+Rzi, (3.15)
' j=1 i=1 j=l
n
S. t. —tj S ZZz'J‘yi S tj
i=1

thO, OgZngl
K.

ZZar=1

i=1

The preceding optimization problem can be solved using linear programming [184]. When the
cluster membership matrix is ﬁxed, the centroids are updated based on the following equation:

_ 21:1 22'.in
_ _l___

Here R serves as a scaling parameter to handle the tradeoff between the two terms in our objective

—,- (3.16)
function: cluster cohesion and class imbalance. For our experiments, R is set to Us times the
diameter of the data set.

Figure 3.3 illustrates how the MagKmeans algorithm works. The initial cluster (the left ﬁg-

ure) contains only positive examples. As the algorithm progresses, some positive examples are

expelled from the cluster and replaced by the nearby negative examples (the right ﬁgure). By

41

 

Figure 3.3. An illustration of the MagKmeans clustering algorithm. The cluster in the left ﬁgure
contains only positive examples. After several iterations, some of the positive examples will be
expelled from the cluster while some negative example are absorbed into the cluster to achieve
balance in the class distribution (the right ﬁgure).

 

Algorithm 1 MagKmeans algorithm
Input: Similarity matrix S, class vector Y, and number of clusters n.

 

Output: Cluster membership matrix Z and the centroid matrix A?
1: Randomly initialize the centroid matrix 7.
2: repeat
3: Update the cluster membership matrix Z1, ,- by solving the linear programming problem
given in Equation (3.15).
4: Update the centroid matrix K using Equation (3.16).

5: until convergence

 

ensuring that the cluster has almost equal representation from each class, a local SVM model can
be constructed from the training examples. A summary of the MagKmeans algorithm is given in

Algorithm 1.

Vaereans: A Supervised Clustering Algorithm for LSVR

Next we consider the case for regression. Similarly to the MagKmeans algorithm proposed in
Section 3.3.1, the Vaereans algorithm proposed in this thesis modiﬁes the objective function
of the K-means algorithm to consider the variance of the response values in the training examples

of each cluster. The data used for clustering consists of two parts: (1) the weight matrix S and (2)

42

0.8r

0.6:
0.5:
0.4:
0.3:
0.2:

 

 

 

 

 

Figure 3.4. An illustration of the Vaereans clustering algorithm. The left ﬁgure plots the cluster-
ing results by K-means on the data set. The right ﬁgure plots the clustering results by Vaereans
on the data set. Vaereans produce much better clusters for learning three linear regression
segments.

 

the response values Y = (y1, . . . ,y,)T. The objective function for Vaereans is:
N 1 it I , —, 2
. , — .-= Z'.'(I’.'—l -
ngn__ Exhdnxi—Xjng—RZZ‘ 1 71'] ‘ 3) (3.17)
Z X” j=1 i=1 j=1 21:1th

where 7(— is the centroid matrix for predictor variables, 17 is the centroid vector for response
variable, and R is again set to 1 /n times the diameter of the data set.

Note that the ﬁrst part of the objective function groups data points together based on the similar-
ity of their predictor values, while the second part maximizes the variance of the response values
in each cluster. We employ an EM-like algorithm [66] to optimize the objective function given
in Equation (3.17). More speciﬁcally, we ﬁrst search for the optimal cluster membership matrix
Z by ﬁxing the centroid [K], 73-] for all j. We then search for the Optimal centroid X 3,73]
by ﬁxing the cluster memberships Z. These steps are repeated until the algorithm converges to a
local minimum.

When X], 73-] is ﬁxed, the cluster membership matrix Z can be computed efﬁciently using

linear programming. To do this, we transform the original optimization problem into the following

43

 

Algorithm 2 Vaereans algorithm
Input: Similarity matrix S, Response vector Y, and number of clusters 1»:

 

Output: Cluster membership matrix Z and the centroid matrix K, 7
1: Randomly initialize the centroid matrix K, 7-.
2: repeat
3: Update the cluster membership matrix Ziﬁj by solving the linear programming problem
given in Equation (3.18).
4: Update the centroid matrix 7 using Equation (3.19) and 7 using Equation (3.20).

5: until convergence

 

form using it slack variables tj (j = 1, - - - ,1:):
H. l H.
minzw. :sz-(X, —X,-)2 41th (3.18)
j=1 i=1 j=1

m
P
MN

1
Zt,j(Yi - 7j)2 2 tj Z Zr.)
1 i=1

j>aogZ~gr

N.

‘9-

1,]
K

22,5,- :1

K)
p—d

When the cluster membership matrix Z is ﬁxed, the following equation is used to update each

centroid Xj,7j]:

l
T. _ 22:121..th

, _ 3.19)
J z , (
21:1 41.3
l
_ ._ Z-.-Y-
Y,=—————Z’;1 ”“1 i (3.20)

By ensuring that the cluster has almost uniform distribution of response values, a local SVR model
can be constructed from the training examples. A summary of the Vaereans algorithm is given
in Algorithm 2.

We illustrate the difference between the clustering results produced by regular K-means and.
Vaereans using the synthetic data set shown in Figure 3.4. The data set contains a data set in
the shape of a nonlinear regression line with three linear segments. The ﬁgure in the left shows

the results of regular K-means, whereas the ﬁgure on the right shows the results of Vaereans,

44

which produces clusters that contain more widely spread response values in the Y-axis. The latter

clustering is clearly more desirable for constructing local linear SVR models.

3.3.2 Proﬁle Support Vector Classiﬁcation (PSVC)

After clustering the data using our MagKmeans algorithm, a local LSVC model is constructed for
each cluster. The SVC model for the kth cluster is obtained by solving the following optimization
problem:

1 l
1
max Ei- — — 5a -1 33:1: 3.21
5 :2 .2” 1Jy1.J_]tj ( )
2:]. 2,321
l
S. t. 25,3” = 0
i=1

USER'S/321k, i=1,2,...,l

Since Zak E {0, 1}, only the training examples assigned to the cluster will be used to build the

local LSVC model.

 

Algorithm 3 Proﬁle SVC algorithm
Input: Training set ’D L = {(m,, y.,-)}§___1, test set ’DU = {wl+j 84:1, and number of clusters is.

 

Output: Class labels for the test set {111453-2421
1: Compute the l x 11 similarity matrix S.
2: (Z, Y) «— MagKmeans(S, Y, n.)
3: fork= l tondo
4: SVC), «— Build_Local-SVC(’DL, Z, k)
5: end for
6: for each test example 001+ ,- E ‘DU do
7: k = arg mirror—Xm-
8: 91+) = SVC/c(33l+j)

9: end for

 

To classify the test set, we need to determine the local SVC model that should be invoked for
each test example. Let K be the K: x u centroid matrix produced by the MagKmeans algorithm.

Since the (i, j )-th element of the centroid matrix indicates the average similarity between the test

45

example m1+j and the training examples in cluster 2', we assign 1:13”- to the cluster with highest
similarity. We then apply the local SVM model associated with the cluster to predict the class
label of 931+ j. The procedure for model building and test of the PSVC algorithm is summarized
in Algorithm 3.

3.3.3 Proﬁle Support Vector Regression (PSVR)

Similar to PSVC, we construct a local SVR for each cluster after clustering the data using the
Vaereans algorithm. More speciﬁcally, the LSVR model for the kth cluster is obtained by

solving the following optimization problem:

I
1 ~
max —5 Z (a, — 042' )(aj " ajlmia’j

i,j=1
l l
— 62(az+a:)+zyi(az_az)
:21 £21
I
st. Zen—5,) =0

For hard clUstering, since Zi,k E {0, 1}, the above optimization problem is equivalent to build—
ing a linear SVR using only the training examples assigned to the cluster. In turn, the n LSVRs
constructed from the clusters form a piecewise decision boundary with A: linear segments.

Now comes the testing step. We need to decide which local model should be used to predict a
test example. To do this, we refer to the centroid matrix 7 obtained by the Vaereans algorithm.
Since the (i, j )-th element of the centroid matrix indicates the similarity between a test example
:cj to cluster 1', we assign the test example to the cluster with highest similarity. Finally, the
response value of the test example is determined by applying the corresponding LSVR model
for its assigned cluster. The procedure for model building and testing of the PSVR algorithm is

summarized in Algorithm 4.

3.3.4 Summary

In short, to avoid training a separate LSVM model for each test example, the PSVM algorithm

partitions the data into 10 clusters and trains a local SVM model for each cluster. it is typically

46

 

Algorithm 4 Proﬁle SVR algorithm
Input: Training set ’D L = {(3), y,-)}£=1, test set ’DU = {m)+j}3*:1, and number of clusters 1:

 

Output: Predictions for the test set {y)+j }}‘:1
1: Compute the l x a similarity matrix S.
2: (Z, 7, XY) <— Vaereans(S, Y, n.)
3: fork= l tondo
4: SVR;, <— Build_Local_SVR(’DL, Z, k)
5: end for
6: for each test example :13)“: 6 By do
7: k = arg max?”-
8: 91+) = SVRklIBHj)

9: end for

 

chosen to be considerably smaller than the number of test examples a to reduce the computational

COSI.

3.4 LSVM and PSVM for Spatial and Temporal Prediction

According to a statement in [28], building multiple local models performs better than one global
model, especially for data that are not evenly distributed. There has been some recent research on
local learning for spatial and temporal prediction, which demonstrated the superior performance
of building multiple local machine learning models compared with global machine algorithms
[89][132][128][l25]. For example, a similar approach to KNN-SVM has been applied to spatial
prediction problem by Gilardi et al. in [89] for data sets with strong spatial dependencies.

This section describes how our LSVM and PSVM formulations can be extended to prediction
problems with spatial and temporal dependencies. For temporal data, each example is associated
with a time stamp and the prediction of a test example is often inﬂuenced by recently arrived train-
ing examples. Let 31+ 3 be a test example that arrives at time n+3 and a), be a training example
that arrives at an earlier time t,. We may deﬁne a temporal similarity function S (13,-, tl+ S) to mea-
sure the proximity between two examples in terms of their time stamp. Such timing information

can then be incorporated into LSVM by constraining the upper bound for a,- of LSVC and aha:

47

ofLSVR from C to CS(t.,-, n+3).

Spatial dependencies can be incorporated into the proposed framework in a similar manner. Let
371+ 3 be a test example located at 01+, and 2:,- is a training example located at 0,. The spatial prox-
imity between the examples is given by the spatial similarity function S (02-, 014.3). The function
can be incorporated into LSVM by changing the upper bound for 01,: of LSVC and ago: of LSVR
from C to CS(o,-, 0)”).

Our approach is also ﬂexible enough to accommodate data with spatio-temporal dependencies.
Let 31+, be a test example located at 0”,, with time stamp tl+s and 11:,- is a training example
located at 0,; with time stamp t,. The spatio-temporal proximity between the examples is given by
the spatio-temporal similarity function S ([t, 0.)], [131+S 01+ 3]). The function can be incorporated
into LSVM by changing the upper bound for a,- of LSVC and aha; of LSVR from C to CS ([t,
Oil, [tl-l-s 01+sl)-

Speciﬁcally, the modiﬁed optimization problem for LSVC in Equation (3.11) with spatial, tem-

poral, or spatio-temporal constraint becomes:

max Za. — — :2 manner aw as) (3.22)
i.i= —1

S. t. Zaiyi = 0
'—1

OSaiSCS,’l=1,2,...,l

where:

S (t,, n+3) temporal
S= S(o,-,o)+3) spatial (3.23)
sat.- 0.1.0.1.. 01+sl) spade-temporal

The optimization problem for LSVR in Equation (3.13) is modiﬁed similarly as:

max

$.le

I
—Z( arat) a aj—agntawj) (3.24)
9:1
l
— eZ(a.-+a;‘) +Zata.
1321 i=1

1
s. t. :(ai — a?) = 0
i=1

Saga: SCS, i=1,2,...,l

48

where S is deﬁned in Equation (3.23).

A PSVM formulation can also be developed to incorporate the spatial or temporal dependen-
cies. Unlike its original formulation, clustering is performed on the spatial similarity matrix
S = [S(o.,-, 0l+slir temporal similarity matrix S = [S(t,-, tl+s)l: or spatio-temporal similarity
matrix S = [S ([t, 0,], [n+3 01+ 3])] Depending on whether a prediction task is classiﬁcation or
regression, one could use MagKmeans or Vaereans to partition the training and test data. A
local SVM model can then be constructed for each cluster partition. The experimental results
demonstrate the successful application of spatial and temporal LSVM and PSVM to the Earth

Science data.

3.5 Experimental Results

This section describes the experimental results obtained by applying the proposed algorithms to
a variety of data sets. The proposed LSVC and LSVR algorithms are implemented by modifying
the CH code for the LIBSVM tool developed by Chang and Lin [37]. For LSVC, the tool is
applicable to multi-class classiﬁcation problems using the one-versus—all approach as described
in Section 3.1.2. To support PSVC and PSVR, we have also implemented the MagKmeans and
Vaereans algorithms to cluster the similarity matrix S. Our experiments were conducted on a
Windows XP machine with 3.0GHz CPU and 1.0GB RAM. The main objectives of the experi-

ments are:

1. To compare the difference between the support vectors and decision surfaces obtained using

LSVM and nonlinear SVM (Section 3.5.1).

2. To compare the relative performance of KNN, SVM, LSVM, and PSVM on a variety of

real-world data sets (Section 3.5.2).

3. To illustrate the application of LSVM and PSVM to Earth Science data sets with spatial and

temporal dependencies (Section 3.5.3).

4. To perform sensitivity analysis on parameters of the pr0posed framework (Section 3.5.4).

To assess the goodness of a prediction algorithm on the test set DU, two metrics are used in

this thesis. The accuracy of classiﬁcation model is measured by the percentage of unlabeled data

49

 

 

 

 

 

 

 

 

 

Figure 3.5. Comparison between LSVC and nonlinear SVC. The left diagram shows the support
vectors obtained by nonlinear SVC; whereas the right diagram shows the support vectors obtained
for three test examples using LSVC.

whose labels are accurately predicted:

l+u A
Z Mil/1,311)

Accuracy = 1:1“ (3 .25)
u

 

where y, is the true label, 37,; is the predicted label for text example 2:,- and 6(y,-, 1],) is deﬁned
similarly as in Equation (3.3).

R-square can be used to evaluate the goodness of the learned regression model, which is deﬁned
as the ratio of the sum of squares explained by a regression model and the total sum of squares

around the mean:

l+u A 2
213591-311)
2_ _ t=+
R _1 1+“ _ (3.26)
2 (y,—y,;)2

i=l+1

 

where y,- is the true value, if,- is the predicted value for text example as, and y, is the expected
value (usually it is the average of the true values in the test data). Note that a regression model

with larger value of It2 ﬁts the data better.

3.5.1 Comparison between LSVM/PSVM and Nonlinear SVM

For this experiment, we use synthetic data sets to demonstrate the difference between the support

vectors and decision boundaries found by LSVM/PSVM in contrast to those found by nonlinear

50

 

 

 

 

 

 

G.
0.8~ e 8.2- 4 0.8-
.. 9' . . . ~
0‘" e.-'..’o :'.'-.-o' 5 06* -.
Q o . " 1. .0. 11.0 0 .° to .. ...2 .-
0.4' G). -.o' .- . . ° ‘ 0.4: .3 at. :° ... .63 2‘06) 4
o . . ..0 .:- O..EE .. .0 .
'3'... O o .3...EI 63259
02* O . 02.
. o. I s .
0 0 r

 

 

0 0:2 0:4 0.6 08 1 0 0.2 014 0:6 0:8 1

Figure 3.6. Comparison between LSVR and nonlinear SVR. The left diagram shows the support
vectors obtained by nonlinear SVR; whereas the right diagram shows the support vectors obtained
for two test examples using LSVR.

SVM. As previously mentioned in Section 3.2, LSVM may reduce the inﬂuence of support vec-
tors (for nonlinear SVM) that are far away from a test example while converting other training

examples in its vicinity into support vectors.

A. Comparison of Support Vectors

Consider the synthetic data set depicted in Figure 3.5 for a two class classiﬁcation problem. The
left diagram shows the data distributions for the two classes, represented as - and +, respectively.
The support vectors found using a nonlinear SVC with an RBF kernel function are represented by
the symbols 0 and EB. Observe that the support vectors are located along the whole boundaries
between the two classes. The right diagram of Figure 3.5 shows the corresponding support vectors
found by LSVC for three selected test examples. The locations of the test examples are marked
by the symbol x while their support vectors are represented with three different symbols. We
can see that those support vectors found by LSVC lie closer to their corresponding test examples.
A similar phenomenon is found in Figure 3.6 for a regression task. The left diagram shows a
synthetic data set represented as - and the support vectors found using a nonlinear SVR algorithm
denoted as Q surrounding the whole data set. The right diagram demonstrates two individual test
examples marked by the symbol x and the support vectors found for them by LSVR algorithms.
Again, the support vectors moved closer to those test examples.

Upon comparing left and right diagrams in both Figure 3.5 and Figure 3.6, it is clear that some

51

 

 

 

 

 

 

 

 

  
 

   

 

 

 

 

 

A, AA 999% ,9 0.7-

0.8- e ‘ ' mm
0.6-
0.6 05
0.4. . .' » + >3,» 0.4»

+ . 91>
0.3-
0.2»

0.2-
0 1 . 1 0.1
0 0.2 0.4 0.6 0.8 1 0.

Figure 3.7. The diagrams in the top panel show the distribution for two synthetic data sets. Each
data set comprises of two classes marked by 0 and at, respectively. The diagrams in the bottom
panel show the decision boundaries generated by PSVC. Each symbol represents one of the clus-
ters produced by the MagKmeans algorithm.

support vectors for nonlinear SVM are not chosen as support vectors for LSVM because they
are far away from the test examples. In the meantime, some non-support vectors for nonlinear
SVM have become support vectors for LSVM because of their proximity to the test examples.
Furthermore, the test examples marked as 2 and 3 in Figure 3.5 share several common support
vectors because of their close proximity to each other. Even though their support vectors are not
exactly identical, they have the potential of sharing the same decision surface, which justiﬁes our

motivation for using clustering to reduce the computational cost of LSVM.

52

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.8. The diagrams in the top panel show the distribution for two synthetic data sets. Y-axis
represents the response and X -axis represents the predictor. The diagrams in the bottom panel
show the regression generated by PSVR. Each symbol represents one of the clusters produced by
the Vaercans algorithm.

B. Effect of MagKmeans and Vaereans in PSVM

Two synthetic data sets are generated for classiﬁcation as shown in the top panel of Figure 3.7.
The bottom panel shows the corresponding decision boundaries generated by PSVC. For the ﬁrst
data set shown in the left diagram of Figure 3.7, the horse-shoe shaped decision boundary is
approximated by 11 piecewise linear decision boundaries. For the second data set shown in the
right diagram, the spiral-shaped decision boundary is also approximated by 11 piecewise linear
decision boundaries. In summary, the results of this experiment demonstrate the ability of PSVC
to ﬁt a complex decision boundary using multiple piecewise linear segments.

We generate another two synthetic data sets for regression tasks as shown in the top panel of

Figure 3.8. The bottom panels shows the regression function learned by PSVR. For the ﬁrst data

53

Table 3.1. Description of the 18 UCI datasets used for classiﬁcation.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Data # Instances # Attributes
Breast 699 10
Glass 214 9
Iris 1 50 4
KDDCup 1015062 38
Physics 171017 10
Yeast 1484 8
Robot 173841 26
Ecoli 336 7
Arcene 200 10000
Arrhyth 452 279
Balance 625 4
Car 1 728 6
OptDigit 3477 64
Contrace 1473 9
Dermatology 366 34
Sonar 208 60
Gisette 6000 5000
Krkopt 28056 6

 

 

 

 

set in the left diagram of Figure 3.8, M-shape regression function is approximated by 5 piecewise
linear regression lines. For the second data set shown in right diagram, the sin wave regression
function is also approximated by 5 piecewise linear regression lines. The results of this experiment
demonstrate the ability of PSVR to ﬁt a complex regression function using multiple piecewise

linear segments.

3.5.2 Performance Comparison on Real-World Data

In this experiment, we conduct several experiments on real-word data to compare the performance
of LSVM and PSVM against other supervised prediction methods.

We use 18 data sets (described in Table 3.1) from the UCI repository [154] to compare the
classiﬁcation performances of HLSVC, SLSVC, and PSVC against KNN and nonlinear SVC in
terms of their accuracy. Some of the data sets such as “Breast”, “Glass”, “Iris”, “KDDcup IDS”,
“Physics”, “Yeast”, “Robot”, and “Ecoli” are multi-class prediction problems. Since the clustering
part of our proposed PSVM algorithms are designed for binary classiﬁcation problems, we divide

the classes for these data sets into two groups and relabel one group as the positive class and

54

Table 3.2. Description of the 18 UCI datasets used for regression.

 

 

 

 

 

 

 

 

 

Data # Instances # Attributes
AutoPrice 205 25
AutoMpg 398 7

Cancer 699 9
Concrete 1030 8
Hardware 209 8

Servo 167 12

Abalone 41 77 8
Stock 950 9

 

 

 

 

 

9, ‘6

the others as negative class. For some of the large data sets such as “KDDCup , Physics”, and
“Robot” we randomly sample 4500 examples from each class to form the data sets. Another 8 data
sets are collected from UCI repository [154] to compare the regression performances of HLSVR,
SLSVR, and PSVR against KNNR and nonlinear SVR in terms of their R2. The data sets used

9, ‘6

here are “AutoPrice”, “AutoMpg”, “Cancer , Concrete”, “Hardware”, “Servo”, “Abalone”, and
“Stock” (described in Table 3.2). The attributes for every data set have been normalized so that
their range goes from 0 to 1. The experimental results reported in this study are obtained by
applying ﬁve-fold cross validation on the data sets. To make the problem more challenging, we
use one-ﬁfth of the data for training and the remaining four-ﬁfths for testing. Each experiment is

also repeated ten times and the ﬁnal results reported are obtained by averaging the results over ten

trials.

A. Model Selection

Each prediction algorithm investigated in this study has one or more parameters that must be
speciﬁed by the user. For the KNN prediction, this corresponds to the number of nearest neighbors
to be considered for each test example. For nonlinear SVC and SVR with RBF kernel (Equation
(3.1)), the user must specify the kernel width 0 to control the model complexity. Scaling parameter
C in SVC controls the tradeoff between classiﬁcation margin and cost of misclassiﬁcation on the
training set, while in SVR it controls the tradeoff between ﬂatness of regression function and
prediction error. The same parameter C is also used in HLSVM, SLSVM, and PSVM. There
is an additional particular parameter c for SVR, HLSVR, SLSVR, and PSVR, which is used to
specify the width of the tube. Both SLSVM and PSVM also employ the RBF kernel to deﬁne the

55

Table 3.3. Classiﬁcation accuracies (‘70) for SVC, KNNC, HLSVC (KNN-SVC), SLSVC, and
PSVC on the 18 UCI datasets.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Data SVC KNNC HLSVC SLSVC PSVC
Breast 94.57 95.70 95.42 . 96.85 96.52
Glass 64.33 54.30 62.67 66.39 66.91

Iris 92.75 89.75 74.71 96.42 97.06
KDDCup 98.14 95.21 98.01 99.71 99.29
Physics 82.98 67.82 83.57 86.42 85.57

Yeast 92.31 93.36 94.62 96.00 95.83
Robot 85.35 78.64 85.87 87.23 86.23
Ecoli 93.41 93.30 93.33 94.89 94.44

Arcene 63.90 68.94 70.06 72.30 71.93
Arrhyth 68.78 66.74 68.45 70.22 69.86
Balance 89.10 88.70 90.62 92.02 89.94
Car 79.16 78.24 75.90 81.65 79.22
OptDigit 97.76 96.98 98.18 99.31 98.42
Contrace 69.85 69.01 71.79 75.01 72.44
Dermatology 97.65 97.37 98.41 98.68 98.72
Sonar 73.29 69.58 72.57 75.21 74.79
Gisette 85.91 78.53 85.93 88.14 86.99
Krkopt 69.04 69.09 68.16 72. 15 71.44

 

 

 

 

 

 

 

similarity between training and test examples. For HLSVM, the similarity measure S (921+ S, 23,-)
is equal to 1 if the training example (1:,- belongs to the K-nearest neighbor list of the test example
581+,» and 0 otherwise. For PSVM, the ntunber of clusters 10 is another parameter that must be
determined. Throughout our experiments, the parameter values are selected using ten-fold cross
validation on the training set. For example, the parameter K for the KNN classiﬁer is chosen
based on the number of nearest neighbors that yields the highest accuracy according to ten-fold
cross validation on the training set. The same approach is also used to determine the parameters
a, C, K, 6 and/or is for nonlinear SVM, SLSVM, HLSVM, and PSVM. For our experiments, is

usually takes a value between x/l / 2 and \/l.

B. Comparison of Classiﬁcation Accuracy

The experimental results reported in this study are obtained by applying ﬁve-fold cross validation
on the data sets. To make the problem more challenging, we use one-ﬁfth of the data for training

and the remaining four-ﬁfths for testing. Each experiment is also repeated ten times and the accu-

56

Table 3.4. Regression R2 for SVR, KNNR, HLSVR (KNN-SVR), SLSVR, and PSVR on the 8

 

 

 

 

 

 

 

 

 

UCI datasets.

Data SVR KNNR HLSVR SLSVR PSVR

AutoPrice 0.8934 0.8491 0.9166 0.9221 0.9209

AutoMpg 0.9425 0.9592 0.9580 0.9681 0.9643
Cancer 0.9126 0.8600 0.9423 0.9491 0.9584

Concrete 0.9039 0.9396 0.9473 0.9477 0.9424

Hardware 0.8274 0.8147 0.9078 0.9219 0.9262

Servo 0.6625 0.7109 0.7455 0.7986 0.7962

Abalone 0.9400 0.9367 0.9422 0.9438 0.9424
Stock 0.9757 0.9914 0.9901 0.9910 0.9963

 

 

 

 

 

 

 

 

racy reported is obtained by averaging the results over ten trials. Table 3.3 summarizes the results
of our experiments. First, observe that, for most of the data sets (14 out of 18), nonlinear SVC
outperforms the KNN algorithm. One example is the “Physics” data set, in which the accuracy
for SVC is 82.98% whereas the accuracy for KNNC is considerably lower at 67.82%. Second,
the accuracy for PESVC does not seem to show signiﬁcant improvement over nonlinear SVC. In
fact, the accuracy for HLSVC is worse than nonlinear SVC for the 8 of the 18 data sets: “Glass”,
“Iris”, “Car”, “Sonar”, “Ecoli”, “Arrhyth” and “Krkopt”. For “Iris”, the classiﬁcation accuracy
drops from 92.75% (for nonlinear SVC) to 74.71% (for HLSVC). One possible explanation for
HLSVC’s poor performance is the diﬂiculty in choosing the right number of nearest neighbors
when there are limited training examples available. Unlike HLSVC, the SLSVC algorithm con-
sistently outperforms nonlinear SVC. With the exception of “KDD Cup”, the difference observed
in their classiﬁcation accuracies can be shown to be statistically signiﬁcant based on Student’s
t-test. This should not come as a surprise because SVC can be considered as a special case of
SLSVC by setting the kernel width of similarity matrix S to 00. Finally, we observe that PSVC,
which is an eﬂicient implementation of LSVC, achieves comparable accuracy as SLSVC but still

outperforms nonlinear SVC for all the data sets.

C. Comparison of Regression Performance

Table 3.4 summarizes the results of regression performance on 8 data sets. It is observed that
SLSVR outperforms nonlinear SVR, KNNR, and HLSVR on most data sets with signiﬁcantly

higher R2. The only exception is the “Stock” data set, where KNNR seems to perform slightly

57

seconds seconds

 

 

 

 

    
  
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

180 ...,... KNNC 40 -o— PSVC
' -v- PSVC-Cluster
160 : ﬁxgvc g: -a— PSVC-LSVC i
140. ~e- SLSVC of- y ‘
120 + PSVC oil-“P 20 PSVC PSVC-Cluster
,5 l 0 _ PSVC-LSVC
100" SLSVC 53’. "
y'Hstc 0 5000
80- _..-°-’ 1 3
-—+- KNNC
60’ -o- svc ,«t
2. -a- PSVC-LSVC _.+~--+' .
40- +
20 1 . KNNC+...-*
+..--*' svc
' War—9:44
0 0t PSVC-LSVC ‘
0 0 5000

 

Number of Test Examples

Figure 3.9. The left panel shows the runtime comparison among SVC, KNNC, HLSVC, SLSVC,
and PSVC in terms of classiﬁcation task. The right top panel shows the amount of time spent for
MagKmeans clustering by PSVC (i.e., PSVC-Clustering) as opposed to the amount of time spent
for training LSVC and applying them to the test examples (i.e., PSVC-LSVC). The right bottom
panel compares the time spent by each algorithm to predict the class labels of test examples.

better than SLSVR. HLSVR performs better than KN NR and SVR on 6 of the 8 data sets, however,
it is worse than SLVSR on all the data sets. The performance of PSVR is comparable with SLSVR
and better than other methods on most data sets except the “Concrete”. PSVR even outperforms
SLSVR on the data set “Cancer”, “Hardware” and “Stock”. For example, the R2 for PSVR is
0.9584 on “Cancer”, which is higher than 0.9491 for SLSVR.

D. Runtime Comparison

The purpose of this experiment is to compare the eﬂiciency of LSVM against PSVM. Recall from
Section 3.2 that the main limitation of LSVM is its high computational cost since a unique LSVM
model must be constructed for each test example. PSVM attempts to overcome this limitation
by partitioning the training examples into a small number of clusters and building a linear SVM

model for each cluster. Figure 3.9 shows the computational time (in seconds) for training and

58

testing diﬁerent classiﬁcation models using the “Physics” data set. To evaluate its performance,

we choose the two largest classes of the data set and randomly sample 600 records from each class ‘

to form the training set. We then apply LSVC and PSVC separately on the data set, while varying
the number of test examples from 600 to 4800. The diagram on the left panel of Figure 3.9
shows the total time spent by each algorithm for building models and testing. Notice that the
time consumed by SLSVC, HLSVC, and PSVC grows linearly with the number of test examples.
However, the run times for both HLSVC and SLSVC are considerably longer than PSVC. We may
further decompose the runtime for PSVC into two parts: PSVC-Cluster, which is the time needed
to apply the MagKmeans clustering algorithm, and PSVC-LSVC, which is the time needed to
build a separate model for each cluster. The diagram on the top right panel of Figure 3.9 shows
that most of the computational time for PSVC is spent on clustering. If the clustering time is
excluded, the remaining time needed to build a linear SVC model for each cluster as well as the
time to evaluate each test example is considerably shorter than the overall training and testing
times for nonlinear SVC, as demonstrated in the bottom right panel of Figure 3.9. Although we
only show the results for the classiﬁcation task here, regression is expected to behave similarly

since it uses a similar optimization procedure.

E. Summary

The SLSVM algorithm generally outperforms both SVM and KNN but at the expense of longer
computational time. PSVM helps to improve its computational efﬁciency, while achieving com-

parable accuracy as the SLSVM algorithm.

3.5.3 Application of LSVM and PSVM to Earth Science Data

In this experiment, we applied our LSVM and PSVM algorithms to Earth Science data sets with
spatial and temporal dependencies. Four prediction tasks are considered, two for classiﬁcation
and another two for regression. All these four data sets are downloaded from the UCI machine
learning repository [154]. A summary of the data sets and their associated prediction tasks is

provided in Table 3.5, while the details are described as follows.

59

Table 3.5. Description of data sets used for prediction with spatial and temporal dependencies.

 

 

 

 

 

Data # Instances # Attributes Prediction Task Dependencies
CoverType 581012 54 Classiﬁcation Spatial
Ozone 2536 73 Classiﬁcation Temporal
Elnino 178080 12 Regression Spatio-temporal
ForestF ire 517 l 3 Regression Spatio—temporal

 

 

 

 

 

 

 

o The CoverType data set is used to predict the types of forest cover [22] of an unknown re-
gion. It is an important classiﬁcation task to help natural resource managers obtain basic
descriptive information of forested lands to support their decision-making ecosystem man-
agement strategies. A set of cartographic measures are used as predictor variables such as
soil type, hill shade, slope, etc. Spatial information such as the distance from a region to
water surface or roadway is available to be incorporated to the LSVM models, which makes

it a spatial prediction problem.

0 The Ozone data set is used to classify the days in which the ground-level ozone reaches a
dangerous level [224]. Studies have shown that elevated ground-level ozone is harmful to
human health, e.g. it causes asthma, chest pain, coughing, and decreases in lung function
[71]. It also has negative effects on vegetation and ecosystems, leading to reductions in
agricultural and commercial forest yields, and increases in plant’s susceptibility to disease,
pests, and other environmental stresses. A number of climate variables such as wind speed,
solar radiation, temperature, and precipitation are used as the predictor variables. The tem-
poral neighborhood is deﬁned based on the day and month in which the measurement was

taken. As a result,it is a temporal prediction problem.

0 The Elnino data set is used to predict the sea surface temperature [124] in equatorial Paciﬁc.
It is a regression task that can help in the understanding and prediction of El Nino/Southem
Oscillation (ENSO) cycles, which caused many problems throughout the world, e.g. de-
structive ﬂooding from increased rainfalls in Peru and the Unites States and the drought
and devastating brush ﬁres in western paciﬁc areas. The data set contains oceanographic
and surface meteorological attributes such as wind speed, humidity etc., which are recorded
from a series of buoys positioned throughout the equatorial paciﬁc. Both spatial (latitude

and longitude of the locations for bouys) and temporal information (day/month/year read

60

Table 3.6. Classiﬁcation accuracies (%) for SVC, KNNC, HLSVC (KNN-SVC), SLSVC, and
PSVC (using spatial or temporal dependencies) on the 2 Earth Science data: CoverType and
Ozone.

 

 

 

 

Data SVC KNNC HSVC SLSVC PSVC
CoverType 86.21 67.40 73.33 89.12 89.43
Ozone 93.35 92.30 91.56 94.80 94.38

 

 

 

 

 

 

 

Table 3.7. Regression R2 for SVR, KNNR, HLSVR (KNN-SVR), SLSVR, and PSVR (using
spatial or temporal dependencies) on the 2 Earth Science data: Elnino and Forest Fire.

 

 

 

 

Data SVR KNNR HSVR SLSVR PSVR
Elnino 0.9215 0.9692 0.9437 0.9817 0.9893
ForestFire 0.0810 0.0034 0.0914 0.1 124 0.1 103

 

 

 

 

 

 

 

from the bouys) is available to be incorporated into LSVM models, which makes it a spatio-

temporal prediction problem.

a The ForeFire data set is used to predict the burned area of possible forest [54] ﬁres for a
given region. It is a regression task that can help to reduce the damages that can appear if
ﬁre occurs by improving the organization of prevention measures and storage of ﬁreﬁghting
resources. Meteorological information such as wind speed, relative humidity, temperature
etc. are available as the predictor variables. Both spatial (x,y spatial coordinate) and tempo-
ral (day of the week, month of the year) are used for “localization” in LSVM model, which

makes it a spatio-temporal prediction problem.

Table 3.6 presents the classiﬁcation results for the CoverType and Ozone data sets. For the
CoverType data, PSVC performs the best with highest accuracy 89.43%. SLSVC performs better
than nonlinear SVC, KNNC and HLSVC. For the Ozone data, SLSVC is the best with 94.80%
accuracy. PSVC is slightly worse but still comparable to SLVC, and it is better than SVC, KNNC
and HLSVC. Table 3.7 shows the R2 of regression on the Elnino and F orestF ire data sets. PSVR
performs the best with R2 = 0.9893 on the Elnino data set. SLSVR performs the best on the forest
ﬁre data set with R2 = 0.1124, while PSVR second it with comparable R2 = 0.1103. Both results
shows the advantage of LSVM and PSVM on predicting Earth Science data by incorporating

spatial and temporal neighborhood information.

61

Accuracy

 

I l l l

--+‘~ KNNC

_ -e— SVC q
0.88 v-r>- HLSVC
--<>~ SLSVC
—a— PSVC

 

 

 

 

0 86” SLSVC

 
    
 
 

->t>->-r>-r>-->--r>
H SVC

: .m..-

  

 

 

 

0.84 — = = 3 = = = = : : : : : : : : : : : : : ; _
0.82- .
‘ . ‘ .. KNNC _
0'80 110 210 310 4b 50

Number of Clusters

Figure 3.10. Effect of varying the number of clusters K. on the performance of PSVC.

3.5.4 Sensitivity Analysis

The number of clusters n is an important parameter in PSVM. In this experiment, we perform
sensitivity analysis on this parameter using the “Robot” data set. We sampled 4500 records from
each class of the data set and varied rt from 2 to 46. For each value of re, we perform 10-fold
cross validation, using one-tenth of the data for training and the rest for testing. The experiment
is repeated ten times and we reported their average accuracy. Figure 3.10 shows the accuracies
of PSVC for each n‘ along with those of SVC, KNN, SLSVC, and HLSVC. The accuracy of
PSVC increases rapidly until 5 reaches 5. PSVC outperforms HLSVC after is = 15 and becomes
comparable to SLSVC when n is sufﬁciently large. We also recorded the computational times
for SVC, KNN, HLSVC, SLSVC, and PSVC in Figure 3.11. The results conﬁrmed our earlier
observation that PSVC spends considerably less time than SLSVC and HLSVC. Furthermore, the
computational time for the training and testing parts of PSVC (PSVC-LSVC) is more efﬁcient
than regular SVC when n g 30, as shown in the right panel of Figure 3.11. Although we only

present experiments here for the classiﬁcation task, our conclusion applied to PSVR too.

62

seconds seconds

 

 

 

 

 

 

 

 

 

 

      

 

 

 

 

 

 

3
_ ..+.. KNNC . ~+~ KNNC
.41.. HLSVC 2 5_ —a— PSVC-LSVC
250. --<>~ SLSVC ~ .
—a— PSVC
ZOO-wwwoommor 2’ KNNC
SLSVC
150 «>9»me « 1,5-
HLSVC
100- ‘ 1*
_ PSVC
50 05 SVC _
O -’. ””SVC” ‘ 0 ‘=:=:==:=:==" PSVC-LISVC
o 50 0 50

Number of Clusters

Figure 3.11. Classiﬁcation runtime comparison among SVC, KNNC, HLSVC, SLSVC, and
PSVC when varying the number of clusters n in PSVC.

3.6 Summary

Prediction tasks such as classiﬁcation and regression for regular data sets have been studied for
decades in statistics, data mining, and machine learning. Prediction with spatial and temporal de-
pendencies remains a challenging problem, especially for data sets that are not evenly distributed
across different time periods and locations. In this chapter, we proposed a localized prediction

algorithm to address this issue. The main contributions of this work are:

c We proposed a localized Support Vector Machine framework for both classiﬁcation and
regression, which incorporates the neighborhood information between examples into SVM
learning and testing such that those training examples that are close to the test example
contribute more to the development of local model. This framework has several advantages
compared with standard Support Vector Machine and K-nearest neighbor algorithm. First,
by building local linear Support Vector Machine model for each test example, the difﬁculty

in choosing an appropriate kernel function for learning global nonlinear Support Vector

63

Machine is avoided. Second, building multiple local SVM models performs much better
than one global SVM model as demonstrated by experiments on a number of real-world

data sets.

However, building a separate SVM model for each test example is computationally infeasi-
ble for large scale data sets. To address this limitation, we developed a more efﬁcient im-
plementation of the algorithm called proﬁle Support Vector Machine. PSVM ﬁrst extracts
a small number of clusters from the data set using supervised algorithms called MagK-
means and Vaereans, for classiﬁcation and regression tasks respectively, which extend
the K-means algorithm to maximize the variance of the labels or response values inside
each cluster. After applying the supervised clustering step, a local model is trained for each
cluster. Our experiments on real data sets showed that PSVM reduces the training cost

signiﬁcantly while maintaining a comparable accuracy to LSVM.

We further extend the LSVM and PSVM frameworks to spatial and temporal prediction
tasks by incorporating the spatial, temporal, or spatio-temporal neighborhood information.
The extended methods are applied to Earth Science data sets and the results demonstrate
the success of our proposed algorithm with better performance compared with K-nearest

neighbor and Support Vector Machine algorithm.

CHAPTER 4

Long-Term Time Series Forecasting

Time series forecasting is a particular prediction task that is concerned with predicting the future
values of the time series based on its historical observations or information from other variables.

The formal deﬁnition of time series prediction is given as follows:

Deﬁnition 7. (Time Series Forecasting) Given a time series X L = [2:1, 3:2, of with 1

historical values, the task of time series prediction is to predict its future ’11. values X L =

T
l$l+1i $l+2i ---. $l+ul -

A typical time series prediction technique ﬁrst constructs a training set (as shown in Table 4.1)
by running a sliding window of length p + u (p is the training window and u is the prediction
window) along the historical time series as illustrated in Figure 4.1. Then a regression model is
built from the training set and is further applied to predict the future it values. When the prediction
window it is long, we called it a long term time series forecasting problem.

In this chapter, we investigate the long term time series forecasting problem with applications

to climate projection. The main aspects of the work are summarized as follows:

0 We ﬁrst discuss the existing issues in multi-step ahead time series prediction by investigat-
ing three prediction frameworks: multi-stage prediction, independent value prediction and
parameter prediction. Their advantages and disadvantages are compared with each other

from both theoretical and empirical perspectives.

o A semi-supervised multivariate time series prediction method is proposed to alleviate the

error accumulation problem in long term time series forecasting. Extensive experiments

65

Table 4.1. Traning Set D = X x Y

 

X’ = [X1, ...,X,] Y = M, ...,Yu]
[$ltx27""$p] [$P+19xp+23”'7$p+ll]
[$2. 373. ma xp+ll [37p+2i $p+3i ""$1)+1£+1]

 

 

 

 

 

 

 

have been conducted on a variety of data sets to demonstrate the effectiveness of the pro-
posed algorithm compared to other supervised learning methods for long term time series

forecasting.

o A covariance alignment method is also developed to deal with the inconsistencies between
historical observation data and future simulation data of the predictor variables in climate
projection. Experimental results show signiﬁcant improvement by semi-supervised predic-

tion with data calibration.

The rest of this chapter is organized as follows. Section 4.1 introduces some notations and con-
cepts for time series forecasting. Section 4.2 presents a study on the issues of the long term time
series forecasting. Section 4.3 demonstrates the value of unlabeled data for regression. Section
4.4 proposes the semi-supervised multivariate time series prediction algorithm with data calibra-
tion technique for long term forecasting. Section 4.5 presents the experimental results on both

benchmark and Earth Science data sets.

4.1 Preliminaries

This section presents some background on time series forecasting.

4.1.1 Single Step and Multi-step Time Series Forecasting

Time series prediction can be categorized into single step time series prediction and multi-step
time series prediction based on the number of prediction steps. The problem of single step ahead
prediction is to the predict only the next value 3314.1 with u = 1 while the multi-step time series
prediction problem is to predict the ﬁiture u > 1 values [1121“, 1:143, ..., xl+u]T. Multi-step time
series prediction with long prediction window is identiﬁed as the long term time series forecasting

problem. In long term time series forecasting, the length of prediction window can be as long as

66

 

1.5 ' T I r

Aﬁ

l
l

0.5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

i u Prediction

llllll

L...
-1 _ I II J I I a .l 7
DAMP.) W2<")J

1 5 ‘lSliddinngindowl 1 . I
' 0 20 40 60 80 100 120

 

 

 

 

 

 

 

 

 

 

 

Figure 4.1. A sliding window is used to create the regression training set D=X’+Y.

the length of historical data or even much longer. The prediction error for multi-step ahead time

series forecasting is evaluated using the R-square method described in Equation (3.26).

4.1.2 Univariate and Multivariate Time Series Forecasting

Time series forecasting can also be categorized into univariate time series prediction and multi-
variate time series prediction. Univariate time series forecasting [29] predicts the future values
of a variable [12:1+1”_l+u]T by building an auto-regression model f on its own historical values
]T

[Jilml such as:

xii = f(xt—1’$t—2, ' ' ' amt—p) (41)

As illustrated in Figure 4.2, multivariate time series prediction [138] predicts the future values of
the target time series by utilizing not only its own historical values but also information from other

related time series. A formal deﬁnition of multivariate time series prediction is given as:

Deﬁnition 8. (Multivariate Time Series Prediction) Let D L = {X L, Y L} be a multivariate

time series of length l and p predictor variables, the objective of multivariate time series prediction

67

 

 

 

 

Figure 4.2. Multivariate time series prediction.

is to predict the future values of the response variable X U-

In the above deﬁnition, X L = [m1,:z:2, ...,azl]T is a p-dimensional sequence of values for
the predictor variables, which can also be represented as X L = [X 1,X 2, ..., X p]. Y L 2
JT

1311. yg, y; is the corresponding values for the response variable. Multivariate time series

prediction learns a target function f from the historical data ‘D L as:

316 : f(x1t’$2t7 ' ' ' Expt7 ' ° ) (42)

The model is applied to the ﬁiture unlabeled data X U = [w1+1, an”, ..., ml+u]T and predicts the

future values of the response variable, YU = [yl+1, yl+2, yl+u]T.

4.1.3 Regression Models for Prediction

Before we get into the problem of long term time series forecasting, we ﬁrst introduce three widely
used regression techniques: Multiple Linear Regression, Recurrent Neural Networks and Hidden
Markov Regression. These methods can be used for both univariate and multivariate time series

prediction.

A. Multiple Linear Regression

Multiple Linear Regression (MR) models the relationship between the response variable y and

the predictor variables X = [X1 , X 2, ..., X 1,] using a linear equation:

P
y = sz‘Xi +100 +6

i=1

68

Output
Layer

Hidden
Layer

 

X(t-3) X(t-2) X(t:l1)

Figure 4.3. A simple example of Elman Network to model AR(3) model for univariate time series.

, where 6. corresponds to a random noise term with zero mean and variance 0?. The coeﬁicient
vector w is estimated using the least square method by minimizing the sum of squared error on
the training data with solution:

w = (XTX)—1XTy

The variance is estimated using S S E / u, where u is the size of the prediction window. For uni-
. . . . . . T
variate time series prediction, we use mt as the response variable and [33.4, 94-2, xt-p] as
the predictor variables, which corresponds to the auto-regression model. For multivariate time

series prediction, the predictor variables come ﬁom other related time series.

B. Recurrent Neural Networks

Recurrent Neural Networks (RNN) has been successfully applied to noisy and non-stationary time
series prediction. In RNN, the temporal relationship of the time series is explicitly modeled using
feedback connections [91] to the internal nodes (known as hidden units). An RNN model is trained
by presenting the past values of the time series to the input layer of the Elman back propagation.
network [73]. The weights of the network are then adjusted based on the error between the true
output and the output predicted by the network until the algorithm converges. Figure 4.3 illustrates

a simple Elman network of AR model with order 3 for modeling a univariate time series. Before

69

 

Figure 4.4. Hidden Markov Regression Model.

the network is trained, the user must specify the number of hidden units in the network. and the
stopping criteria of the learning algorithm. RNN have the strength of modeling complex nonlinear

regression models.

C. Hidden Markov Regression

In Hidden Markov Model Regression, a time series is assumed to be generated by a doubly stochas-
tic process, where the response variable is conditionally dependent on the values of the predictor
variables as well as an underlying unobserved process. The unobserved process is characterized
by the state space 1" = {31, 32, - - - ,sN} and is assumed to evolve according to a Markov chain,
Q L = [q1,q2, ql]T, where q,- E P (as shown in Figure 4.4). The transition between states is
governed by a N x N transition probability matrix A = [dz-j]. Each state 3,- E I‘ is associated with
an initial probability distribution 7r,- and a regression function fi(a:t). For brevity, we consider a
multiple linear regression model, f,;(:r:t) = win); + diet, where (wi, 02-) are the regression pa-
rameters, and Ct ~ N (0, 1). Given a set of predictor variables rat at time t, the response variable
yt is generated based on the following conditional probability:

(yt — thmtTl2
2
20%

 

_1
Pqtlytlwt) = (27.03,) 2 exp - (4.3)

The likelihood function for the sequence of labeled observations in ’D L = {X L, Y L } is given

70

by

l
L = ZHPqt(yt|$t)p(QtIQt—1)

 

QLtzl
l T 2
——- 2:11.. eerie. —— aw...)
(It—lqt ‘Qt p 202
QLtzl (It

The model parameters

N V N
A = (H. 14.3“”) = ({Wi}i=1s{aij}in=1a{0i i=1.{Wi}i=1)
are estimated by maximizing the above likelihood function using the Baum-Welch (BW) algo-
rithm [14]. The model parameters can be estimated iteratively using the following quantities,
which are known as the forward (0.) and backward (6) probabilities:
with?) = Mum/2, - ~ .yt.9t == Z'IXL)
__ 7rii'DiCl/ilml). ift = 1;

[2le a.-. man] no.) otherwise.

p(yt+1.yt+2. . -- .3/ll(1t = 2'. XL)

1 'ft = i-
{ ’ l ’ (4.5)

(4.4)

6.90)

N , . .
23:1 aiij(yt+1)t3t+1(J). otherWISe.

Further details on the derivation of the parameter update formula using these quantities can be
found in [85]. Upon convergence of the BW algorithm, the HMMR model can be applied to the
unlabeled data X U in the following manner. First, the probability of each hidden state at a given
future time step is estimated as follows:
2i 0 (i)a.- . .
T174. , if m = 1;
p((ll+m = SleL) = { la! 1)

2i aikP(Qt+m-1 = SiIYL). 1 < m S u.
Next, the future value of the response variable is estimated using the following equation:

fq,(a:t) = E[yt] = Zine-mint... = em) (4.6)

4.2 Problem Study on Long-Term Time Series Prediction

In this section, we studied the issues for long term time series prediction using traditional pre-
diction methods. Three approaches: multi-stage prediction, independent value prediction, and.
parameter prediction, are used to solve the multi-step ahead time series prediction problem and

compared with each other. These methods are described as follows:

71

Multi-stage Prediction is a direct extension of single step time series prediction. It predicts the
future values of a time series in a step by step manner. We ﬁrst predict n+1 using the previous
p values, [r1+1_p, ..., $14, :14]? We then predict a21+2 based on its previous p values, which
includes the predicted value for 1:1“. The procedure is repeated recursively until the last value,
3.7+”, has been estimated. In this approach, it is sufﬁcient to construct a single model for making

the prediction.

Independent Value Prediction predicts the value at each time step using a separate model.
Given the initial data set shown in Table 4.1, we ﬁrst create it training sets, each of which has the
same input X, but different output Y. We use Y(1) as the output variable for the ﬁrst training set,
Y(2) as the output variable for the second training set, and so on. By learning each training set
independently, we obtain it regression models {fzy'i = 1, 2, - - - ,u}. The models are then used to

predict the next it values as follows: 2.3+, = f,(X),i = 1, 2, ..., u.

Parameter Prediction transforms the problem of predicting u output values into an equivalent
problem of predicting d + 1 parameters. For each record in Table 4.1, we ﬁt a parametric function
g to the output vector Y. Let [00,01, ---th] denote the parameters of the function 9. We then
replace the original output vector Y = [Y(1), Y(2), ..., Y(ii.)] with a modiﬁed output vector Y’ =
[c0, c1, ..., CdJT. We now construct d + 1 regression models {f,-,i = 0,1,2, - - . ,d} one for each
output column 1”. The models are then applied to predict the d + 1 parameters of a test sequence.
The test sequence is reconstructed by substituting the predicted parameters into the parametric
function 9. While this methodology is generally applicable to any family of parametric functions,

we use polynomial functions in our experiments.

4.2.1 Error Accumulation

Error accumulation refers to the propagation of past prediction errors into future predictions. To
gain a better insight into this problem, we employ the bias-variance decomposition for squared

loss functions. Consider a time series generated by the model:

113:: f(17t—1.$t—2. - " .‘It—p) + 6(0.02)

72

Let [331,1r2, . . . ,xdT denote the historical values and [yl+1, y1+2, . . . ,yl+,,]T denote the future
T

observed values, and [pf +1, yz“ +2, . . . man] be the values generated by the deterministic model

f. In other words,

312' = y; + 6(0. 0'2).

We use the notation [391+1,,t)1+2, . . . ,Ql+u]T to denote the values predicted by a regression model
g. The MSE (mean square error) at each step can be decomposed into the following three

components [26]:

MSEU) = Elli/j - mg]
= (1307.) - if)? + Elttj - Eta-n2] + Eltyj - .921
The ﬁrst term represents the squared bias of the model. The second term represents the variance

of the model. The third term represents the variability due to noise. The following example

illustrates the error accumulation problem for the noise term. Consider AR(2) model:
It+1 = a133t+ a2-73t—1 + 6.

with 6 ~ N (0, 02 ). Suppose we were able to model accurately the coefﬁcients using MLR. For

the multi-stage approach, we have:

yi+1 = (1.12:, + a2351-4 + C1 = 914.1 + E1

yi+2 = all/1+1 + 0231 + 62 = 3314.2 + (0161 + 62)
For independent value prediction, we have

yz+1 = “111+0'2$l—1+€1

311+2 = (Oi1L a2)r1+ (WWW—1 + (61161 + 62)

The preceding example shows the accumulation of errors due to noise for both multi-stage pre-
diction and independent value prediction as the prediction step increases. Thus it is unavoidable.

To analyze the error accumulation due to the bias and variance of a model, we generate the
following time series: Xt = 0.418Xt_1 + 0.634Xt_2 + 6, 6 ~ N(0, 0.1). The length of the time
series is set to 1000 and the prediction window is u = 50. To ensure there is sufﬁcient bias in

the model, we set p = 1. The bias and variance of the models are estimated by generating 500

73

 

 

 

 

 

 

 

 

    
    

 

 

 

 

 

 

 

Bias Bias
. a f 10 . .
—o— Multi-stage —.— Multi-stage
+ Independent —0- Independent
” 5
O ......................
0 10 20 . 30 40 go 0 1O 20 O 40 50
Variance tep Variance Step
15 % - 40 . . .
D

-o— Multi-stage
+ Independent

—0— Jim-stage
10: + Independent

 

 

 

 

 

 

 

 

 

 

 

 

20
5r
0 Y ‘ , , 0 W,
O 10 20 30 40 0 0 10 20 3 4O 58
tep tep

Figure 4.5. Bias and variance for MLR (left) and HMMR (right).

bootstrap replicates of the training set ’D and inducing a model 9 from each bootstrap replicate.
The models are then applied to the test sequence to obtain 500 estimated values yjj for each pre-
diction step j. The empirical bias is computed by taking the average value of the 500 predictions
E (323-) and subtracting it from the value predicted using the deterministic model. The variance
of the models can also be estimated as follows: var( j ) = E [(3% — E(ﬁj))2]. Figures 4.5 illus-
trate the bias and variance for multi-stage and independent value predictions (using MLR and a
HMMR as the underlying regression methods). Both ﬁgures show that the bias and variance for
multi-stage prediction grows steadily with increasing time steps, whereas the bias and variance

for independent value prediction do not appear to be propagated into future predictions.

4.2.2 Learning Difﬁculty

Multi-stage prediction builds a single model to ﬁt the entire time series. In contrast, we need to
build it models for independent value prediction and d + 1 models for parameter prediction, which
means they are more expensive.

For parameter prediction, the learning difﬁculty depends on how easy it is to ﬁnd the appropriate
function that ﬁts the output vector. To compare parameter prediction against independent value
prediction, we apply them to the monthly milk production data using it = 12 and w = 12. For
parameter prediction, we use a polynomial function to ﬁt the output vector and vary the degree of
the polynomial from 0 to 11. We then employ MLR to predict the parameters of the polynomial.

The bottom diagram of Figure 4.6 shows the results of 1 — R2 for different degree. Observe that

74

 

10 ﬂ . .

 

 

 

 

 

8 _ _
6 -
4 1 l I
0 50 100 150 200
1-R2 monthly milk production
I I I -:— parameter
0.4. + independent ,

 

 

 

 

 

 

 

Figure 4.6. Prediction Results (p=12,u=12).

the 1 — R2 for parameter prediction drops dramatically when the polynomial degree increases to
3 and decreases slowly thereafter. This results suggest that it is sufﬁcient to ﬁt a polynomial of

degree 4 to the output vector and achieves comparable accuracy as independent value prediction.

4.2.3 Smoothness of Prediction

Another factor to consider is the influence of noise on the prediction approaches. To do this, we
conduct an experiment using a simple, stationary time series, i.e., white noise, as shown in the
left panel of Figure 4.7. The right panel of Figure 4.7 shows that multi-stage prediction tends
to smooth out the time series to its mean value after p time steps. Such a smoothing effect is
not present in independent value prediction, which makes spurious predictions around the mean,
because the prediction at each time step is modeled independently. This method may suffer from
overﬁtting as it tries to capture the ﬂuctuations of the noise time series. For parameter prediction,
the best ﬁt model of the data is found to be a polynomial of degree zero. Even though the parame-
ters are predicted independently, the smoothness of the time series is guaranteed by the parametric

function used to ﬁt the output vector.

75

 

 

 

 

 

 

 

 

   

 

     

 

 

 

 

 

 

 

 

 

 

1.5 2
I o
I ‘2’ [— WN(0,0.5) ii
, o 20 4o 60 so 100
0.4
—0.25- i --------- Independent t
- : —- Multi-stage
[ —— WN(0,0.5)j ---- Parameter
'20 200 400 600 800 1000 ’0'90 20 4o 60 so 100

Figure 4.7. White Noise WN(0,0.5) (left) and predicting Results (p=12,u=100,d=6)(right).
4.2.4 Discussion

The study in this section discovered that present multi-stage prediction framework suffers from
error accumulation problems when the prediction period is long and the historical data is relatively
short. Independent value prediction and parameter prediction can slightly alleviate this problem;
however, they have their own issues such as difﬁculty in leaming and sensitive to noise. Their
performances are further compared in Section 4.5.2 on a number of real data sets.

Fortunately, there are alternative ways to improve long term time series forecasting by multi-
variate prediction that utilizes information from other related time series. Future data of predictor
variables can be generated by simulating potential scenarios using computer-driven models, such
as global climate models. However, the data generated by these models are currently utilized
in a supervised learning setting, where a predictive model trained on past observations is used
to estimate the future values. In the rest of this chapter, we present a semi-supervised learning
framework for long-term time series forecasting based on Hidden Markov Model Regression. A
covariance alignment method is also developed to deal with the issue of inconsistencies between
historical and model simulation data. We evaluated our approach on data sets from a variety
of domains, including climate modeling. We start by discussing the value of unlabeled data on

regression in the next Section.

76

4.3 Values of Unlabeled Data on Regression

There have been extensive studies on the effect of incorporating unlabeled data to supervised clas-
siﬁcation problems, including those based on generative models [56], transductive SVM [112],
co-training [25], self-training [229] and graph-based methods [24] [230]. Some studies concluded
that signiﬁcant improvements in classiﬁcation performance can be achieved when unlabeled ex-
amples are used, while others have indicated otherwise [25][51][57][l99]. Blum and Mitchell
[25] and Cozman et al. [51] suggested that unlabeled data can help to reduce variance of the
estimator as long as the modeling assumptions match the ground truth data. Otherwise, unlabeled
data may either improve or degrade the classiﬁcation performance, depending on the complexity
of the classiﬁer compared to the training set size [57]. Tian et al. [199] showed the ill effects of
using different distributions of labeled and unlabeled data on semi-supervised learning.

Recently, there has been growing interest on applying semi-supervised learning to regression
problems [228][30][53][23l]. Some of these approaches are direct extensions from their semi-
supervised classiﬁcation counterparts. For example, transductive support vector regression is pro-
posed in [53] as an extension to transductive SVM classiﬁer. Zhou and Li developed a co-training
approach for semi-supervised regression in [228]. Their algorithm employs two KNN regressors,
each using a different distance metric. Another extension of co-training to regression problems
was developed by Brefeld et al. [30]. Graph based semi-supervised algorithms [24][230] utilize
a label propagation process to ensure that the smoothness assumption holds for both labeled and
unlabeled data. An extension of the algorithm to regression problems were proposed by Wang et
al. in [210]. Zhu and Goldberg [231] developed a semi-supervised regression method that incor-
porates additional domain knowledge to improve model performance. Since all of the previous
approaches ignore the temporal dependencies between observations, they are not well-suited for
time series prediction problems.

This section provides an example to illustrate the value of unlabeled data for regression analysis.
Consider the diagram shown in Figure 4.8, where the x-axis corresponds to a predictor variable
and the y-axis corresponds to the response variable. The data set contains 10 training examples
(labeled 1 — 10) and 3 unlabeled examples (labeled 11 — 13). The diagram on the left shows the

results of applying supervised linear regression while the diagram on the right shows the results

77

 

7 ’-
__@_”_. 69%

(a) Supervised Ieaming (b) Semi-supervised learning

Figure 4.8. The value of unlabeled data for regression analysis

 

 

 

of applying semi-supervised linear regression. The solid line 1) indicates the true function from
which the data is generated.

The dashed line a in the left diagram corresponds to the target function estimated from training
examples. In the right diagram, the response values for the unlabeled data are initially computed
using the values of their nearest neighbors. The target function, represented by the dashed line
c, is then estimated from the combined training and previously unlabeled examples. Clearly,
augmenting unlabeled data in this situation helps to produce a target function that lies closer to
the true function. This example also illustrates the importance of using local prediction methods
(such as nearest neighbor estimation) to compute the response values of the unlabeled examples.
If the unlabeled examples were estimated using the initial regression ﬁmction instead, then adding
unlabeled data will not change the initial model.

It is worth noting that current research in semi-supervised learning suggests unlabeled data are
valuable under two situations. First, the model assumptions should match well with the underly-
ing data. Second, the labeled and unlabeled data must be generated from the same distribution.
Otherwise, incorporating unlabeled data may actually degrade the performance of a predictive

model [199][57].

78

4.4 Semi-supervised Multivariate Time Series Prediction

This section describes our proposed semi-supervised Ieaming algorithm for HMMR and a data
calibration approach to deal with inconsistencies between the distributions of labeled and unla-

beled data.

4.4.1 Semi-Supervised HMMR for Long Term Time Series Forecasting

The basic idea behind our proposed approach is as follows. First, an initial HMMR model is
trained from the historical data using the BW algorithm described in Section 4.1.3. The model is
then used, in conjunction with a local prediction model, to estimate the response values for future
observations. The local model is used to ensure that the target function is sufﬁciently smooth. The
estimated future values, weighted by their conﬁdence in estimation, are then combined with the
historical data to re-train the model. This procedure is repeated to gradually reﬁne the HMMR
model.

We now describe the details of each step of our algorithm. Let A0 be the initial set of model
parameters trained from the historical observations DL. We use A0 to compute the initial estimate
of each response value in YU:

ﬁt = ZPW = SilYL)Wz'$lr, t= l + 1,---,l +11,

2
which is similar to the formula given in Equation (4.6).

A principled way to incorporate unlabeled data is to require that the resulting target function
must be sufﬁciently smooth with respect to its intrinsic structure [227]. For regression problems,
this requirement implies that the response values for nearby examples should be close to each other
to ensure the smoothness of the target function. We obtain an estimate of the local prediction of

the target variable y for a future time step t as follows:
2 Sort) mj)yj
ijQk (mt)

yt = Z S($t,mj)
‘L'jEQkCCtl

 

,t=l+l,...,l+u

where 91((930 is a subset of observations in X L that correspond to the K -nearest neighbors of

the unlabeled observation mt and S (mt, mj) is the similarity measure between two observations.

79

Based on their global and local estimations, we compute the weighted average of the response

value at each future time step t as follows:
Qt=tt27t+(1—u)§t,t=l+1,...,l+'u. (4.7)

The parameter ,u controls the smoothness of the target function; a smaller weight means preference
will given to the local estimation.

The newly labeled observations from the ﬁiture time period will be augmented to the training set
in order to rebuild the model. As some of the predicted values gt may deviate quite signiﬁcantly
from either the local or global estimations (depending on ,u), incorporating such examples may
degrade the performance of semi-supervised HMMR. To overcome this problem, we compute the
conﬁdence value for each prediction and use it to weigh the inﬂuence of the unlabeled examples

during model rebuilding. Let ct denote the weight assigned to the value of gt:

1 t=12~- l'
= 3 . 1, ,’ 4.8
Ct {exp[—dt], t=l+1,l+2,---,l+u. ( )

where (it = [ﬁt — ytl/(gjt + gt). Equation (4.8) assigns a weight of 1 to each historical obser-
vation Y L- This is based on the assumption that there is no noise in the historical data. Even if
such an assumption is violated, our framework may accommodate noisy observations by applying
Equation (4.8) to both training and future observations. The weight formula reduces the inﬂuence
of future observations whose predictions remain uncertain. After the model has been revised, it
is used to re-estimate the response values for the future observations. This procedure is repeated
until the changes in the model parameters become insigniﬁcant.

We now describe the procedure for estimating the parameters of the semi-supervised HMMR
model. Let Y", denote the estimated response values for the unlabeled data obtained from Equation

(4.7). The likelihood ﬁmction for the combined data is:

L = Z PA(YL, Yu, leL, XU)
q

l l+u
= 2(HGQt~lthQt(ytlwt) H aqt-1qtpqt(3]t§ctlmt))

 

 

Q t=1 t=l+1
l T 9
—1 (w -W CB )‘
._ 2 (If, t
— 2(Haq._1q.<2mq.> [— ]
l+u . T 2
_1 c (2 —w :1: )
2 tJt Qt t
x H aqtm19t(27mqt) Zexp [— 202 ])
t=l+1 Qt

80

 

Algorithm 5 Semi-Supervised Hidden Markov Regression for Time Series Prediction
Input: Labeled multivariate time series for the historical period, ’13 L = {X L, Y L} and unla-

 

beled multivariate time series for the future period X U-
Output: Future response YU
1: Learn the initial HMMR model A0 = (H, A, 2, W) using the training data DL.

IQ

: Perform local estimation of YU as in Equation (4.7)

3: repeat

4: Perform” global estimation of YU using the current parameters in A as in Equation (4.6)

5: Calculate the ﬁnal estimation of YU as in Equation (4.7)

6: Calculate the conﬁdence of the predicted values in YU as in Equation (4.8)

7: Combine (gt; ct) estimated in steps 4 and 5 with the training data to re-train the HMMR
model A’ = (H’, A’, 2’, W)’.

8: until Convergence (HA’ — All << 6)

 

where
T
Ct(3/t _ thl't )2
_ 2
2%:

 

1
c 2 _
Pqt(3/t§ Ctlict) = (ZWth) 2 exp

with Ct = 1 for historical observations. Unlike the supervised learning case, the weights are used
to determine the least square error (:9, — wqtm; ) for each future observation. To maximize the
likelihood function, observations with large weights will incur higher penalty if their response
values disagree with the predictions made by the current HMMR model. Such observations are
therefore more inﬂuential in rebuilding the HMMR model.

To determine the model parameters that maximize the likelihood function, we introduce the

81

following auxiliary function:

0(A, A’)

= ZPA(YL,Yu,Q|X) IHPAI(YL)Y’UAQ|X)
(I

N l+u
= Z Z (2122. 1: =222=2)PA(YL,Y2,QIX)ma;,
q 2}sz —=1t
+ 22(1q1=-2)PA(YL2Y22,Q|X)11m; (49)
q 2': 1
N l+u
+ 2220 22:2)PA(YL,Y2,QIX)1np(.,;,,w;)(2z;c21222)
q 22:12=1

whereX = [XL;XU] and

1,_ 22, ift=1,---,z;
Jt‘ 2),, ift=l+1,---,l+u.

It can be shown that maximizing the auxiliary function will produce a sequence of model param—
eters with increasing likelihood values. Taking the derivative of 0(A, A’) in Equation (4.9) with

respect to each model parameter in A’, we obtain the following update formula:

 

 

T930210 7'1021'1 7"-73023320 T230227
w; = . . . X .
TIL‘p,.’II0 Tip. $1 ' ' ° T$p,$p 7.151),?)
I 1 l
2 T 2
02- : 1+” 26% Ct ’yt WW2)
Z at('l )/3t( ) 2:1
2:1
(+22
+ 2 04(2) )'3t(2') ~w2mt T)2}
t: [+1
1 1—1
I . .
023' = 2+22—1 ' . Zat(2)az'j1)j(02+12yt+1)ﬁt+1(])
Z at(z)/3t(")
2:1
l+u 1
+ Z at(i)az'jpj(ct+12ﬁt+1>ﬁt+1(j)}
t=l+1
7r,- _ a1(i)ﬁ1(z)
1 — N
2: 02(1)
i=1

82

where:

(.) { 7T2P2(U’12y1),ift =1
at Z . .
[23:1 at—1(J)aj2'] Pi(Ct,yt),lft > 1
. 1 Vi, ift = 1
23%?) = { N .. _ 23 . ’f 1
23:1 023P](Ct+12yt+1)2 t+1(])21 t>
[+22
7212222]- : that(i)ﬁt(i)mtimtj
2:1

1 [+12

712,4, = Zetat(i)l32(i)$t2yt+ Z Ctat(i)./32(i)$t2l)2

 

Pj(ct,2yt) = (27ragt) 2 exp [_ 2 22
042

The overall procedure for our proposed semi-supervised algorithm is summarized in Algorithm
4.4.1. After estimating the model parameters A, the response values for future observations are

predicted using Equation (4.6).

4.4.2 Data Calibration

The previous section described our proposed algorithm for semi-supervised time series prediction.
The underlying assumption behind the algorithm is that the predictor variables for labeled and
unlabeled data have the same distribution. This may not be true in real-world applications such as
climate modeling or urban growth planning, where the unlabeled data are obtained from a different
source (e. g. model simulations) or their distributions may have been perturbed by changes in the
modeling domain (e.g. increase of population growth or greenhouse gas concentration).

Several recent studies on semi-supervised classiﬁcation have suggested the negative effect of
unlabeled data, especially when a classiﬁer assumes an incorrect structure of the data [5 7] or when
the labeled and unlabeled data have different distributions [199]. None of these studies, however,
have been devoted to regression or time series problems. Our experimental results demonstrate
that, while in most cases, semi-supervised HMMR indeed outperforms its supervised counterpart,
for the climate modeling domain, where the historical and future observations come from different
sources, semi-supervised Ieaming do not signiﬁcantly improve the performance of HMMR. To
overcome this problem, we propose a data calibration technique to deal with the inconsistencies

between historical and future data.

83

A straightforward way to calibrate X L and X U is to standardize the time series of each predic-
tor variable by subtracting their means and dividing by their respective standard deviations. The
drawback of this approach is that the covariance structures for X L and X U are not preserved
by the standardization procedure. As a result, a model trained on the historical data may still
not accurately predict the future data since the relationship between the predictor variables may
have changed. In this paper, we propose a new data calibration approach to align the covariance
structure of the historical data and future data. Our approach seeks to ﬁnd a linear transformation
matrix ,8 that is applicable to the future unlabeled data X- U such that the difference between their
covariance matrices is minimized.

Let A denote the covariance matrix of X U and B denote the covariance matrix of X L1
A = E [m ~ E<XL>)T(XL — mm]
B = E[(Xu — E<XU>>T<XU — E(XU))]
The covariance matrix after transforming X U to X U ,6 is

B' = E[(XUﬁ — E(XUﬁ))T(XUﬁ — Ema»)
= aTBa
The transformation matrix ,6 can be estimated using a least-square approach:
arg minJ = arg min “A — B'llfj); (4.10)
ﬂ 26
= argngn HA — [Herald

The optimization problem can be solved using a gradient descent algorithm:

2+1_ 2'_ 3_J
where:
31 T T T
% 4(Bﬁﬁ B ﬁ-BﬁA )

and 17 > 0 is the learning rate.
Although the preceding data calibration approach helps to align the covariance matrices of the
data, our experimental results show that the transformation tends to signiﬁcantly distort the neigh-

borhood structure of the observations. Since our semi-supervised HMMR framework performs

84

local estimation based on the nearest neighbor approach, such a transformation leads to unreliable
local predictions and degrades the overall performance of the algorithm. An ideal transformation
should preserve the neighborhood information while aligning the covariance structure. To accom-
plish this, we create a combined matrix X = [X Li X U], and compute the covariance matrix B

using the matrix, i.e.,
B = E (X — E<X>)T(X — E(X>)

After calibration using the gradient descent method described previously, both X L and X U will
be transformed as follows:

X’ = XLﬁ X’U = arm.

The transformed data X IL and X [U will serve as the new training and test data for the semi-

supervised HMMR algorithm.

4.5 Experimental Result

We have conducted several experiments to evaluate the performance of proposed algorithms for
long term time series prediction. All the experiments were performed on a Windows XP machine

with 3.0GHz CPU and 1.0GB RAM.

4.5.1 Experimental Setup

First, we compared the relative performance among multi-stage prediction, independent value
prediction and parameter prediction on 21 real data sets. The real datasets are obtained from the
UCI Machine Learning Repository [154] and the Time Series Data Library [110]. All the time
series used are univariate time series with length recored in the last column of Table 4.3. The
parameters for the prediction approaches include the order of regression model p and the degree
of polynomial ﬁt d (for parameter prediction). We use Akaike’s ﬁnal prediction error (F PE) [5]
as our criterion for determining the right order for p in the MLR model. The same criterion is
applicable to estimate the degree of the polynomial function used in parameter prediction. To
determine the correct order for RNN, we employ the method described by Kennel in [119] by

choosing a value for p such that the number of false nearest neighbors is close to zero.

85

Table 4.2. Description of the UCR time series data sets for Semi-HMMR

 

 

 

 

 

 

 

 

 

 

Data Sets Length # Variables
Dryer 867 6
Foetal 2500 9
Glass 1247 9

Greatlake 984 5
Steam 9600 4

Twopat 5000 129
Logistic 1000 101

Leaf 442 15 1
C12full 43 10 l 66

 

 

 

 

 

Second, we conducted several experiments to evaluate the performance of our proposed semi-
supervised algorithm. Table 4.2 summarizes characteristics of the time series used in this experi-
ment. The time series are obtained from the UC Riverside time series data repository [120]. We
divide each time series into 10 disjoint segments. To simulate this as a long term forecasting prob-
lem, for the k-th run, we use segment k as training data and segments k + 1 to k + 5 as test data.
The future period is therefore ﬁve times longer than the training period for each run. The results
reported for each data set are based on the average R2 for 5 different runs. We also consider three
other competing algorithms for our experiments: (1) univariate autoregressive (AR) model, (2)
multiple linear regression (MLR), and (3) supervised HMMR.

There are also several parameters that must be determined for our semi-supervised HMMR,
such as the number of nearest neighbors K, the number of hidden states N, and smoothness
parameter )2. To determine the number of nearest neighbors, we perform 10-fold cross validation
on the training data using the K-nearest neighbor regression method [99]. There are various
methods available to determine the number of hidden states N. These methods can be divided into
two classes—modiﬁed cross-validation and penalized likelihood methods (such as AIC, BIC, and
ICL). In this work, we employ the modiﬁed 10-fold cross validation with missing value approach
[36] to select the best N. For each fold, we randomly select one-tenth of the observations to be
removed from the training data. The likelihood function will be estimated from the remaining
nine-tenth of the training data (while treating the removed data as missing values). The number of
states N that produces the highest R2 will be chosen as our parameter. To ensure smoothness in

the target function, )2 should be biased more towards the local prediction. Our experience shows

86

Table 4.3. R2 of three methods on 21 data sets using Multiple Linear Regression

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Multi-stage Independent Parameter length
Milk 0.9267 0.9295 0.9172 168
Temp. 0.7224 0.7041 0.7064 216
PET 0.9581 0.9586 0.9381 216
PREC 0.9690 0.9683 0.9466 216
Solar 0.8782 0.8764 0.7868 216
Appb 0.7026 0.6196 0.6197 78
Appd 0.7848 0.7605 0.7234 64
Appf 0.6853 0.7555 0.7009 108
Appg 0.1358 0.0657 0.0782 114
Deaths 0.2691 0.4440 0.4367 72
Lead 0.5805 0.5793 0.5794 150
Sales 0.6813 0.6363 0.6363 150
Wine 0.7262 0.7098 0.6791 142
Seriesc 0.0641 0.0155 0.0155 224
Odono 0.5774 0.5269 0.5288 1 l4
Qbirth 0.4550 0.5207 0.4769 300
Bond2 0.4774 0.41 14 0.41 16 309
Daily 0.7994 0.7863 0.7863 500
Food 0.8005 0.8071 0.8050 1 17
Treerin 0.1071 0.1193 0.1182 948
Pork 0.0538 0.2052 0.2082 99

 

 

 

 

 

 

that this can be accomplished by setting p. somewhere between 0.1 to 0.3. We ﬁx the smoothness

parameter )2 = 0.1 throughout our experiments.

4.5.2 Comparative Study on Multi-step Prediction

A general comparative study on the three prediction approaches using real datasets is performed
in this section. A Win-Draw-Loss Table is created to compare the relative performance between
two prediction approaches when applied to a number of data sets. We use the criterion of 0.01
difference in R2 to determine whether one approach wins or loses against another approach. For a
stricter evaluation, we also apply the paired t signiﬁcance test to determine whether the observed
diﬂerence in R2 is statistically signiﬁcant. To do this, we ﬁrst calculate the difference (AR) in
the R2 obtained from two prediction approaches on each data set. The means TR and standard

deviation SA R of the observed differences are also calculated. To determine whether the difference

87

Table 4.4. R2 of three methods on 21 data sets using Recurrent Neural Network

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Multi-stage Independent Parameter
Milk 0.9267 0.9295 0.9082
Temp. 0.7224 0.7041 0.7064
PET 0.9581 0.9586 0.9381
PREC 0.9690 0.9683 0.9466
Solar 0.8782 0.8764 0.7868
Appb 0.7026 0.6196 0.6197
Appd 0.7848 0.7605 0.7234
Appf 0.6853 0.7555 0.7009
Appg 0.1358 0.0657 0.0782
Deaths 0.2691 0.4440 0.4367
Lead 0.5805 0.5793 0.5794
Sales 0.6813 0.6363 0.6363
Wine 0.7262 0.7098 0.6791
Seriesc 0.0641 0.0155 0.0155
Odono 0.5774 0.5269 0.5288
Qbirth 0.4550 0.5207 0.4769
Bond2 0.4774 0.41 14 0.41 16
Daily 0.7994 0.7863 0.7863
Food 0.8005 0.8071 0.8050
Treerin 0.1071 0.1193 0.1182
Pork 0.0538 0.2052 0.2082
is signiﬁcant, we compute their T-statistic:
t = 272
SAR/W

 

which follows a t-distribution with u — 1 degrees of freedom. Under the null hypothesis that
the two prediction approaches are comparable in performance, we expect the value of t should
be close to zero. From the computed value for t, we estimate the p-value of the difference,
which corresponds to the probability of rejecting the null hypothesis. We say the difference in
performance is statistically signiﬁcant if p < 0.05 and highly statistically signiﬁcant if p < 0.001.

We apply the three prediction approaches to 21 real data sets to compare their relative perfor-
mance. The R2 value for each data set is obtained by 10-fold cross validation. The size of the pre-
diction window is set to u = 24. Table 4.3 summarizes the R2 for the three prediction approaches
using MLR as the underlying regression method. Their relative performance is summarized in

Table 4.5 in terms of the number of wins, draws and losses. We also test the signiﬁcance of the

88

Table 4.5. Win-Draw-Loss results for MLP

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Multi vs Indep Multi vs Param Indep vs Param
0.01 diff 10-6-5 14-2-5 8-12-1
t value 0.1496 0.8761 2.7299
p value 0.8826 0.3914 0.0129

Table 4.6. Win-Draw-Loss results for RNN

Multi vs Indep Multi vs Param Indep vs Param
0.01 diff 7-0-14 5-2-14 13-1-7
t value 2.6396 3.3884 0.3012
p value 0.0157 0.0029 0.7664

 

 

 

 

 

 

difference using paired t-signiﬁcance test. The results show that the observed difference between
the R2 of multi-stage and independent value prediction is not that signiﬁcant. However, the per-
formance of parameter prediction is signiﬁcantly worse than independent value prediction. This
is because MLR may not be suitable to ﬁt the parameters of the function, which have nonlinear
relationships with the time series values.

Tables 4.4 and 4.6 show the results using RNN as the underlying regression method. Observe
that the independent value and parameter prediction approaches perform signiﬁcantly better than
multi-stage prediction (p < 0.05). For multi-stage prediction, the R2 for RNN is lower than
the R2 of MLR in 10 out of 21 data sets, which suggests the possibility of model overﬁtting
when using a ﬂexible regression method such as RNN. Nevertheless, we still ﬁnd 17 data sets in
which independent value prediction with RNN outperforms all the prediction approaches using
MLR and 12 data sets in which parameter prediction with RNN outperforms all the prediction
approaches using MLR. The results suggest that, using nonlinear regression methods such as
RNN, the independent value and parameter prediction approaches may achieve better performance
than multi-stage prediction. Moreover, for parameter prediction, most of the data sets require
d < 5, which makes it more efﬁcient to build compared to independent value prediction (which

requires building 22 = 24 models).

4.5.3 Performance Comparison on Semi-Supervised HMMR

Table 4.7 compares the R2 of our proposed framework (semiHMMR) against the univariate

AR (UAR), multiple linear regression(MLR), and supervised Hidden Markov Model Regression

89

Table 4.7. Average R2 for UAR, MLR, HMMR and SemiHMMR on UCR time series data sets.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Data Sets UAR MLR HMMR SemiHMMR
Dryer 0.4347 0.6253 0.6797 0.7105
Foetal 0.1221 0.7762 0.7782 0.8014
Glass 0.0019 0.2319 0.2543 0.5170

Greatlake 0. 1367 0.5840 0.6047 0.6846
Steam 0.2850 0.5027 0.5237 0.6808

Twopat 0.0877 0.1739 0.1465 0.1774
Logistic 0.0195 0.0212 0.0232 0.0246

Leaf 0.2063 0.7382 0.7421 0.7897
C12full 0.1621 0.6840 0.7031 0.8263

 

 

(HMMR). First, observe that the performance of univariate AR is signiﬁcantly worse than mul-
tivariate MLR and supervised HMMR model. This is consistent with the prevailing consensus
that multivariate prediction approaches are often more effective than univariate approaches be-
cause they may utilize information in the predictor variables to improve its prediction. Second,
we observe that HMMR generally performs better than MLR on most of the data sets. This result
suggests the importance of learning models that take into consideration the dependencies between
observations. Finally, the results also show that semi-supervised HMMR is signiﬁcantly better
than HMMR on the majority of the data sets. The improvements achieved using semi-supervised

HMMR exceed 10% on data sets such as “Glass”, “Greatlake”, “Steam”, and “C12full”.

4.5.4 Value of Unlabeled Data for Semi-Supervised HMMR

We further show the value of incorporating unlabeled data into the model retraining phase of
the semi-supervised HMMR algorithm. The average R2 for each iteration before convergence
is plotted in Figure 4.9 when applying the semiHMMR algorithm to the “Foetal” data set. The
R2 of semiHMMR algorithm increases from 0.7782, which is the same to the R2 achieved by
HMMR, to 0.8014 after several iterations. The performance of local KNN R estimation algorithm
is 0.6951, which is considerably worse than HMMR and semiHMMR algorithm. The R2 obtained
by a linear combination of KNNR and HMMR with p. = 0.1 is 0.7216, which is also much lower
than the R2 of the semiHMMR algorithm. This experiment demonstrates the effectiveness of
retraining the model using unlabeled data in the semiHMMR algorithm.

Another experiment is conducted to show the value of unlabeled data when there is limited

90

 

 

+ semiHMMR
0-85" -e- HMMR ‘
—a— KNNR

+ KNNR+HMMR

 

 

 

 

 

 

 

 

 

Iteration

Figure 4.9. The performance of SemiHMMR by each iteration compared with KNNR, HMMR
and KNNR+HMMR on the data set “Foetal”.

training data. We have used the “Steam” time series for this experiment and vary the ratio of
unlabeled to labeled data from 1 to 5 by changing the size of labeled data and report its average
R2. As shown in Figure 4.10, when the ratio is 1, the R2 for MLR and supervised HMMR
are nearly the same as that for semi-supervised HMMR. However, when the ratio increases to
5, the performance of both MLR and supervised HMMR degrades rapidly (R2 from 0.67726 to
0.5027 for MLR and from 0.6884 to 0.5229 for HMMR) whereas the R‘2 for semi-supervised
HMMR decreases only slightly, from 0.6924 to 0.6807. The results of this experiment show that
semi-supervised HMMR can effectively utilize information in the unlabeled data to improve its

prediction, especially when labeled data is scarce.

4.5.5 Semi-Supervised HMMR with Covariance Alignment on Climate

Prediction

Here we apply the proposed semi-supervised HMMR algorithm to the climate projection problem
by combining historical observations with GCM data. The GCM models are developed based on

the basic principles of physics, chemistry, and ﬂuid mechanics, taking into account the coupling

91

 

0.7 . . .

 

0.65

 

 

   

 

 

 

 

 

 

—e— HMMR ,
05 —-— SemiHMMR
' 1 5

2 3
Unlabeled/Labeled Ratio

Figure 4.10. The performance of MLR, HMMR and SemiHMMR when varying the ratio of
unlabeled to labeled data

between land, ocean, and atmospheric processes. Although the outputs from these models suf-
ﬁciently capture many of the large-scale (2 150-300 km spatial resolution) circulation patterns
of the observed climate system, they may not accurately model the response variable at the local
or regional scales (z 1-50 km) needed for climate impact assessment studies [214]. As a result,
the coarse-scale GCM outputs need to be mapped into ﬁner scale predictions, a process that is
known as “downscaling” in the Earth Science literature. Regression is a popular technique for
downscaling, where the GCM outputs are used as predictor variables and the response variable
corresponds to the local climate variable of interest (e.g. precipitation or temperature).

For this experiment, we downloaded climate data from the Canadian Climate Change Scenarios
Network Web site [35]. The data consists of daily observations for 26 climate predictor variables
including sea-level pressure, wind direction, vorticity, humidity, etc. The response variable cor-
responds to the observed mean temperature at a meteorological station. In short, there are three
sources of data for this experiment: (1) Mean temperature observations from 1961 to 2001 to be
used as response variable, (2) Reanalysis data from NCEP (National Center for Environmental

Prediction) reanalysis project from 1961 to 2001 to be used as predictor variables for training

92

 

.0
oo

 

.0
on

 

.0
4;

Percentage of Wins

9
N

   

 

 

 

 

 

 

 

 

HMMR-SemiHMMR HMMR-CSemiHMMR

Figure 4.11. Perforrhance comparison between HMMR, SemiHMMR, and CSemiHMMR on
Canadian climate data

data, and (3) Simulation data from HADCM3 global climate model from 1961 to 2099 to be used
as predictor variables for future data. Since the mean temperature for the future time period is
unavailable, we conducted our experiment using NCEP reanalysis and the observed mean temper-
ature data ﬁom 1961 to 1965 for training and HADCM3 simulation data from 1966 to 1991 for
testing.

The main purpose of this experiment is to investigate the effect of data calibration on the per-
formance of semi-supervised HMMR when applied to climate prediction. We show that data cali—
bration is useful when the historical observations and future data come from different sources. To
compare the relative performance of supervised HMMR, semi-supervised HMMR (SemiHMMR)
and semi-supervised HMMR with data calibration (CSemiHMMR), we applied the algorithms
to predict the mean daily temperature for 40 randomly selected meteorological stations (another
criterion for choosing a station is that the time series must be complete, i.e., it has no missing val-
ues). in Canada. The bar chart shown in Figure 4.11 indicates the fraction of locations in which
one algorithm has a higher R2 than the other. Unlike the results reported in Section 4.5.3, the

bar chart seems to suggest that the performance of semi-supervised HMMR is only comparable

93

Table 4.8. Comparing the average R2 values for MLR, HMMR, SemiHMMR, and CSemiHMMR

on the Canadian climate data.

 

 

 

 

 

 

 

 

 

 

 

(Lat°,Lon°) MLR HMMR SemiHMMR CSemiHM MR
(48.65N,123.43W) 0.6038 0.6144 0.6392 0.6356
(48.95N,54.58W) 0.3276 0.3447 0.4187 0.4350
(52.18N,113.89W) 0.3555 0.4591 0.4538 0.4732
(48.33N,71W) 0.4843 0.4842 0.4776 0.4982
(82.52N,62.28W) 0.4982 0.5456 0.5875 0.6277

 

 

to supervised HMMR (55% versus 45%). This is actually consistent with the conclusions drawn
in [57, 199] for semi-supervised classiﬁcation in which it was suggested that unlabeled data with
different distribution may not improve the performance of a semi-supervised algorithm. However,
with the calibration method developed in Section 4.4.2, CSemiHMMR actually performs better
than HMMR on 72% of the data sets. This result conﬁrmed the effectiveness of incorporating the
covariance alignment technique to our semi-supervised learning framework. We also illustrate the
R2 values for ﬁve of the selected stations in Table 4.8. The latitude and longitude for each station
is recorded in the ﬁrst column of Table 4.8. Though the performance of semiHMMR appears to
be worse than supervised HMMR at (52.2°N, 113.9°W) and (48.3°N 710W), the R2 values for
semi-supervised HMMR with data calibration is clearly superior for both locations.

In Section 4.4.2, we argued that aligning X L with X U (we call this calibration technique 1)
may not be as effective as calibrating X L with the combined matrix X C' = [X L2 X U1 (we call
this calibration technique 2). This is because the former may result in signiﬁcant loss of nearest
neighbor information, thus degrading the performance of semi-supervised HMMR (note that the
results shown in Figure 4.1 l are based on calibration technique 2). To measure the degree of align-
ment and loss of neighborhood information using the calibration techniques, we deﬁne the follow-
ing two measures: RCovDiff and NNLoss. Let A0(X A2 X 3) denote the difference between the
covariance matrices constructed from X A and X B before alignment, and A1(XA, XB) denote
the corresponding difference aﬁer alignment. RCovDiff measures the reduction of the covariance

difference before and after alignment, i.e.:

|A0(XL2 XU) - A1(XL2 XU)|

 

 

RCovDiffl =
A0(XL2XU)
- |A0(XL2XC) - A1(XL2XC)|
RCovDrﬁ2
A0(XL2 X0)

94

Table 4.9. Comparing the degree of alignment and loss of neighborhood structure information

using calibration techniques 1 and 2.

 

 

 

 

 

 

 

 

 

 

 

 

(Lat°,Lon°) RCovDiffl RCovDiff2 NNLossl NNLossZ
(48.6N,123.4W) 0.998 0.785 0.994 0.186
(48.9N,54.6W) 0.993 0.714 0.998 0.116
(52.2N,113.9W) 0.989 0.728 0.995 0.108
(48.3N,71W) 0.992 0.703 0.996 0.086
(82.5N,62.3W) 0.993 0.702 0.991 0.096

 

A calibration technique with larger RCovDiff will produce covariance matrices that are better
aligned with each other. We also measure the loss of neighborhood structure due to data calibration
in the following way. Let M0(X A2 X B) denote a 0/1 matrix computed based on the l-nearest
neighbor of each example in X B to the examples in X A before alignment and 1141(X A2 X B)
denote the corresponding matrix aﬁer alignment. The NNLoss measure is deﬁned as follows:

|Mo(XL2 XU) - M1(XL2XU)1
M0(XL2XU)

 

NNLoss =

Unlike RCovDiﬁ‘, the same equation is applied to both calibration techniques 1 and 2. Table 4.9
compares the results of both calibration techniques. Although calibration technique 1 produces
more well-aligned covariance matrices, it loses more information about its neighborhood struc-

ture. This explains our rationale for using calibration technique 2 for semi-supervised HMMR.

4.6 Summary

Time series forecasting is an important task and has been studied for a long time in statistics, data
mining and machine Ieaming literature. However, most previous work on time series forecasting
focused on single step or short term prediction problems, which are not sufﬁcient for many appli-
cations. Although there has been some work extending single step time series forecasting methods
to long term prediction, it suffers from problems such as error accumulation. In this chapter, we
ﬁrst investigated the issues in the long term time series forecasting and then proposed a semi-
supervised multivariate time series prediction algorithm to improve the prediction accuracy. The

main contributions of this work are:

0 We ﬁrst investigated the issues associated with long term time series forecasting problem by

studying the advantages and disadvantages of three prediction approaches both theoretically

95

and empirically. Using the bias-variance decomposition framework, our analysis shows
that multi-stage prediction suffers from the error accumulation problem, especially when
the prediction window is long. Independent value prediction alleviates this problem by
making predictions independently at each time step. However, it has its own problems.
First it is more expensive since it has to learn a separate prediction model for each step
in the prediction window. Second, it has difﬁculty in learning the appropriate function
when the prediction window is long because the true function becomes more complex with
increasing prediction time steps. Third, this approach does not smooth out the effect of
noise, unlike multi-stage prediction. Parameter prediction smooths the effect of noise by
ﬁtting a function over the entire output sequence and avoids the error accumulation problem
by making independent predictions. It is more eﬂicient than independent value prediction
when the parameter set is small. However, ﬁnding the appropriate parameter function to ﬁt

the output values can still be quite a challenging task.

We further investigated a multivariate time series prediction method to improve the pre-
diction accuracy for long term forecasting by utilizing information from other related time
series. The future data of the predictor variables can be collected by simulation in many
domains such as climate modeling; however, it is used in supervised fashion in traditional
methods. In this work, a semi-supervised multivariate time series prediction framework
based on the Hidden Markov Regression model was proposed to obtain a .better model by
utilizing both historical and future data in training. The experimental results on benchmark
time series data sets demonstrate the superior performance of the semi-supervised HMMR

algorithm compared with other supervised methods.

However, applying the semi-supervised HMMR to climate projection problem by combin-
ing the historical reanalysis data and GCM future data did not show much improvement,
which demonstrated that the inconsistencies between training and test data may not im-
prove the model’s performance. To compensate for this problem, we proposed a covariance
alignment data calibration method that will transform the training and test data into a new
space before applying the semi-supervised HMMR algorithm. The semi-supervised HMMR

with data calibration technique clearly achieves considerable improvement according to our

96

evaluation using the climate data sets.

97

CHAPTER 5

Anomaly Detection, Characterization, and

Visualization

In this chapter, we discuss methods that are used for the detection, characterization, and visualiza-
tion of anomalies in time series data. Give time series data ’D with length l, the goal of anomaly
detection is to discover the timestamps at which the measurement values for one or more variables
have deviated signiﬁcantly from their nominal behavior. The nominal behavior represents the ex-
pected value of a time series based on its its own history as well as its relationship to other time

in ’D. The major innovations and contributions of this chapter are listed as follows.

0 We present a graph-based algorithm and some extensions to detect different types of anoma-
lies in time series data. The graph-based method is applied to Earth Science data to detect

ecosystem disturbance.

o A kernel alignment technique is further proposed for the detection and characterization of
anomalies in noisy multivariate time series data, which is used to discriminate the ecosystem

disturbances in Earth Science applications.

0 A clustering-based framework is proposed to aid the visual exploration and detection of
anomalies from large scale time series data in high spatial resolution, e.g. to explore the

ecosystem disturbances in global FPAR time series data.

The rest of this chapter is organized as follows. Section 5.] introduces some preliminaries on

time series anomaly detection. Section 5.2 presents the graph-based based time series anomaly

98

detection algorithm. Section 5.3 proposes the kernel alignment based method for multivariate
time series anomaly detection and characterization. Two extensions of the graph-based anomaly
detection algorithms are discussed in 5.4. Section 5.5 introduces a clustering-based visualization
system for the exploratory analysis of anomalies in Earth Science data. The evaluation of our

algorithms and system is given in Section 5.5.3.

5.1 Preliminaries

Here we present some background for the problem of time series anomaly detection.

5.1.1 Anomaly Detection in Time Series Data

Anomaly detection in time series data can be categorized into point-wise anomaly detection and
subsequence anomaly detection . A formal deﬁnition of point-wise anomaly and subsequence

anomaly in time series data is given as follows:

Deﬁnition 9. (Point-wise Anomaly) Point-wise anomaly refers to the timestamps at which the

observed values are signiﬁcantly diﬂerent than the rest of the time series.

Examples of point-wise anomalies includes peak or valley anomaly whose values are signiﬁ-

cantly larger or smaller than the rest of time series.

Deﬁnition 10. (Subsequence Anomaly) Subsequence anomaly is deﬁned as a segment of length

d < 1 that does not match any other other segments in the time series.

An example of such type of anomaly is shown in Figure 5.9(a), where there is an unusual shape
of subsequence in the time series.

In both cases, the anomalies can be characterized as global or local anomalies, depending on
how the nominal behavior is determined. The nominal behavior for global anomalies is com-
puted from the entire time series whereas for local anomalies, the nominal behavior is determined
with respect to some local neighborhood, which can be deﬁned in several ways using temporal

information. One way is to consider the time-based neighborhood:

99

Deﬁnition 11. (Time-based Local Anomaly) A data point at time t is considered a time-based
local anomaly if its value is signiﬁcantly dtﬁerent than the values within the time interval [t -—

k,t + it]. An example of such anomaly is shown in Figure 5.10(a).

Another category of local anomaly for time series with periodical patterns is cycle-based deﬁned

as:

Deﬁnition 12. (Cycle-based Local Anomaly) Let k be the periodicity ofa time series. A data
point at time t is considered a cycle-based local anomaly if its value is signiﬁcantly dzﬂerent than

the values measured at times t — k,t — 2k, - 2- andt + k,t + 2k, - - -.

The above discussion applied to both univariate and multivariate time series.

5.1.2 Moving Average Time Series Anomaly Detection

Moving average is the simplest anomaly detection method for time series data. In this approach,
a time series was ﬁrst detrended using a linear adjustment. For example, this is necessary to
minimize the possibility that, in cases where there is gradual but marked increase in monthly
F PAR over the time series data, any potential disturbance events occurring relatively near the end
of the series are not overlooked. Then the detrended time series was subsequently deseasonalized
by computing the moving average time series to remove the dominant seasonal oscillations.
When applied to detect ecosystem disturbance in FPAR time series, the domain experts hypoth-
esize that a “sustained” disturbance event could be deﬁned as any decline in average annual F PAR
levels (at an assigned signiﬁcance level) that lasts for a temporal threshold value of at least 12 con-
secutive monthly observations at any speciﬁc location [173]. The logic used here is that an actual
disturbance involves a sustained decline in F PAR because the structure of the vegetation cover has
been severely altered or destroyed during the disturbance event, to a magnitude that lowers F PAR
signiﬁcantly for at least one seasonal growing cycle, after which time regrowth and recovery of
the former vegetation structure may permit F PAR to increase again. Signiﬁcant declines in aver-
age annual FPAR levels can be deﬁned to be greater than 1.7 standard deviations (SD) below the
19-year average FPAR computed for any speciﬁc location. The threshold of 1.7 SD was chosen
based on the 95% conﬁdence level in rejecting the null hypothesis using a one-sided statistical

t-test.

100

FPAR at lat = 29.61 , lon = -89.78 FPAR at lat = 54.91. Ion = -114.16

 

 

300 100 '
200 *

100V

   

 

 

 

 

 

 

922 8'4 3'6 8'8 9'0 92 9'4 9'6 92 00 9600 20101 2002 20103 2604 2005
Anomaly Score by Moving Average Anomaly Score by Moving Average

 

 

   

 

 

 

 

 

, I I
2 I I
L J

_ 3 85 87 2‘9 9‘1 913 915 9‘7 9‘9 01 42601 2602 2603 20104 2005

 

(a) Detection of ecosystem disturbance by moving av- (b) Detection failure of ecosystem disturbance by mov-
erage due to Hurricane Elena (during September 1985) ing average due to Chisholm wildﬁre.
from FPAR time series data.

Figure 5.1. FPAR time series at two different locations and their anomaly scores learned by
moving average.

Figure 5.1(a) shows an example of an ecosystem disturbance event recorded at 29.61°N and
89.78°W. The event coincides with the timing of Hurricane Elena (September 1985) while its
location is near the vicinity of the landfall points of the storm. In addition to this example, this
method successfully detects documented landfall points of other tropical storms including Hur-
ricane Alicia, Gloria, Gilbert, and Hugo. For an extended discussion that includes coverage of
droughts, heat waves, cold waves, and blizzards, see [172].

While this approach can successfully detect many large-scale disturbance events in Earth Sci-
ence data, it also has several limitations. First, the timing of the event may vary since it uses
12-month moving average to deseasonalize the time series. Second, it may not be able to detect
small scale anomaly events in which the vegetation structure of the region recovers in less than
one growing seasonal cycle. Third, it has problems detecting certain special kind of anomalies
such as local anomalies. Finally, it fails to explore the relationship between different variables,
e.g. the correlation between climate and FPAR time series. Figure 5.1(b) shows the failure of
this approach to detect the disturbance event in Chisholm, Alberta where a major wildﬁre had

occurred in May 2001.

101

5.1.3 Anomaly Detection and Characterization in Multivariate Time Series

Data

A multivariate time series D = {X ilf=1 is a collection of time series that corresponds to the
measurements of p real-valued variables spanning the same time interval. For some application
domains, one of the variables Y 6 D may be designated as the target variable of interest, while
the remaining D — {Y} variables are known as the predictor variables. In this study, we consider
two types of multivariate time series anomaly detection problems: general and target speciﬁc.

For general multivariate time series anomaly detection, all variables in D are considered
equally important when detecting anomalies. For example, in network intrusion detection, anoma-
lies can be detected based on the deviations observed in one or more time series—e. g. the unusu-
ally high number of connections originating from the same IP address (for a denial of service
attack), the wide range of port numbers used (for a port scan attack), or the abnormally large
number of ICMP packets sent (for a ping ﬂood attack).

In contrast, target speciﬁc anomaly detection attempts to ﬁnd anomalies in the time series for
a target variable that are correlated with the deviations observed in the predictor time series. An
example application can be found in the Earth Science domain, where scientists are interested
in identifying ecosystem disturbances such as wildﬁres from satellite observations of the global
vegetation cover data (i.e., the target variable). By correlating the anomalies found in the veg-
etation cover data with those found in climate variables such as temperature and precipitation,
this may help scientists to distinguish between climate-induced anomalies ﬁ'om human-induced
anomalies. As a result, target speciﬁc anomaly detection can help characterize the anomalies in a

target variable using the anomalies observed in predictor time series.

5.1.4 Anomaly Exploration in Large Scale Time Series Repository

Anomaly detection in a very large time series database is challenging because of the computation
cost for process all the data. For example, huge amounts of time series such as FPAR remote
sensing data are available in high spatial resolution in Earth Science application. Scanning all the
time series across all the locations for anomalies becomes really expensive, not to mention that

most anomaly detection algorithms require the adjustment of one or more thresholds. Visual data

102

mining is deﬁned as the use of interactive visual representation of the abstract data to amplify the
pattern discovery process. The goal of the visualization is to reduce the complexity of a given data
set by presenting only small portion of the massive time series at a time, while at the same time
minimizing the loss of information. Interaction is a ﬁmdamental component of visualization that
permits the user to modify the visualization parameters, browsing the data or querying for results.
Anomalies are discovered interactively between the user and the visualized anomaly exploration

system.

5.2 Graph-based Time Series Anomaly Detection

This section presents the graph-based anomaly detection algorithm for time series anomaly detec-
tion. This method ﬁrst constructs a kernel matrix from the given time series data and then run a

random walk to obtain the anomaly scores for each data point or subsequence.

5.2.1 Kernel Matrix for Time Series Data

Here we introduce the kernel matrix to measure the similarity between data points in a time series.
Let 2' and j denote a pair of timestamps in the time series. Each timestamp is associated with a set
of measurements for p variables (Note that p = 1 for univariate time series and p > 1 for mul-
tivariate time series). Although there are numerous similarity measures that have been proposed
in the literature (e.g. cosine and correlation), few of them are applicable to both univariate and
multivariate time series. In this study, we measure the similarity between two timestamps using

the RBF ﬁrnction:

 

(5.1)

r)
2221(171221— irkj)‘
02

m2, j) = exp [—

where 132k is a data point at time 2' for time series variable lc. The RBF function can also be

generalized to measure the similarity between two subsequences of width (1 as follows:

 

P d—l 2
,_ _ :1: 2 ' — I17
k_123_.0( hﬂ'l'S kJ'i'S) ], (52)

K(i2j)=expl:- 0.2

where 2' and j are the starting timestamps for each subsequence. Based on the similarity measure,

we may construct a non-negative, symmetric matrix K to capture the pairwise similarity between

103

 

 

Figure 5.2. A weighted graph representation of a kernel matrix.

every pair of timestamps (or subsequences) in a given time series. K is also known as a kernel
matrix (or a Gram matrix). The next section describes a graph-based algorithm to detect anomalies

by utilizing the information in a kernel matrix.

5.2.2 Random Walk on Graph for Anomaly Detection

A kernel matrix can be transformed into a weighted graph representation, 9 = (V, E), where V is
the set of nodes and E C_: V x V is the set of edges. Each node in the graph corresponds to a data
point (or a subsequence) in the time series whereas each edge is weighted by the similarity value
encoded in the kernel matrix K. Because an anomalous node is highly dissimilar to other nodes,
the weights of its connections to the rest of the nodes in the graph are generally small. Such a
node will be rarely visited when performing a random walk traversal on the graph. This is the
fundamental idea behind our graph-based anomaly detection algorithm. For example, node a in
Figure 5.2 is an anomalous node due to its low connectivity compared to other nodes in the graph.

To estimate the connectivity value of a node, we view the graph as a Markov chain on the state
space V with a transition matrix S, whose (i, j )2]: element denote the transition probability from
node 2' to node j. The transition matrix is obtained by normalizing each column of the kernel
matrix.

K 2', '

The connectivity value of each node is computed iteratively by applying the following recursive

 

104

Table 5.1. Connectivity values obtained by applying a random walk algorithm on the graph shown
in Figure 5.2. The lower the connectivity value, the more anomalous a node is.

 

a b c d e
0.05 0.23 0.28 0.14 0.32

 

 

 

 

 

 

 

 

equation:
c = d/n + (1 — d)S’c, (5.4)

where d is the damping factor and c is a connectivity vector for all the nodes in the graph. This
iterative procedure can be viewed as performing a random walk on the Markov chain, where given
the current node 21., there is a probability 1 — d of visiting one of its neighboring nodes according to
the transition matrix S and a probability d of visiting any random node in the graph. This approach
is equivalent to the the formulation used by the PageRank algorithm [163]. Upon convergence,
nodes with high connectivity c are considered normal whereas those with low connectivity are
declared as anomalous. As an example, Table 5.1 shows the connectivity values of the nodes
given in Figure 5.2. Node a is successfully detected as an anomaly because it has the lowest

connectivity value.

5.3 Anomaly Detection and Characterization in Noisy Multi-
variate Time Series

This section investigates the problem of anomaly detection and characterization in noisy multi-

variate time series. We start with an introduction of multivariate kernel alignment method.

5.3.1 Multivariate Kernel Alignment

As previously noted in Section 5.1.1, the nominal behavior of a multivariate time series can be
determined based on the historical evolution of each time series as well as their relationships
with each other. A kernel matrix may capture the similarity between measurements at different
timestamps but not the relationship between the different time series. To apply the graph-based

algorithm described in the previous section to multivariate time series, the kernel matrix must

105

be adjusted to capture the dependence relationships among the time series in D. This can be
accomplished by “aligning” their corresponding kernel matrices.

We begin with a discussion of how kernel alignment works when one of the time series corre-
sponds to a target variable of interest. An extension of the framework to more general multivariate
time series problems, where all the time series are equally important, is described in Section 5.3.2.
Let K be the initial kernel matrix constructed from the set of predictor variables X 6 D (using
Equation (3.1)) and Ky be the corresponding kernel matrix computed from the target time se-
ries. The objective of kernel alignment is to derive an adjusted kernel matrix K0, from K that
maximizes its correlation to the target kernel matrix Ky, i.e.:

< Ra, K Y >1?

ngrx A A (5.5)
\/< K...Ka >F< Ky,Ky >F

 

 

where < A, B > p: Zij Aij Bij is the Frobenius product between two matrices. The aligned
matrix K a is obtained by ﬁrst decomposing K into a set of basis vectors {'01, v2, - - - ,vn} and
learning the corresponding weights 02.,- associated with each basis vector:
P I
< ._ a-v-v- K >
maax Zz_1 ’l 'l 21 Y F ’ (56)
m\/< 2:le aim-2);, €21 aivivg >p

 

 

where m =< Ky, Ky >F- The resulting aligned matrix is:
A P
I
Ka = Z 02,21,222-
i=1

We consider two ways to decompose K into its basis vectors. The ﬁrst approach extracts
the normalized eigenvectors of K by solving the eigenvalue equation K v = /\v, subject to the
constraint vgvj = 1 if 2' = j and zero otherwise. As a result, kernel alignment reduces to the

following optimization problem:

72 I
ax 22:1 012' < v2v22KY >F

 

ma (5-7)
m\/ 2le 922
or equivalently,
12 n
moax 2a,- < v,v§,Ky >1: —p[Za,2 — 1] (5.8)
i=1 i=1
which has the following closed form solution [60]:
< v-vh, K >
a, = Z '1 Y F (5.9)

 

 

\/Zi < viv§,Ky >%,

106

One potential problem with this formulation is that the constraint 2,- az-z = 1 may overﬁt Ky and
loses its similarity to the original kernel matrix K. To overcome this problem, we propose a new
objective function:
22 n

111ng a,- < vivg, Ky >F —p[i:zl(ai —— My] (5.10)
where M, v,- are the corresponding eigenvalues and eigenvectors of the kernel matrix K. With this
modiﬁcation, the aligned kernel matrix K will preserve as much information in K as possible.
The weight parameters a, can be computed in closed form as follows:

< 1121);, KY >F
2#

 

(5.11)

02i=/\z'+

The preceding equation shows that a,- increases when there is strong positive correlation between
'02-'02 and the target kernel Ky. Furthermore, if p ——> 00, a,- —-2 /\,-, which reduces K back to the
original kernel matrix K.

An alternative approach to using eigenvectors is to simply replace 2: by the time series for each
predictor variable, i.e., v,- = X i. In this case, the objective function to be maximized is

P
I
mngai < XiXiaKY >17
1,:

P P
-4222.” < x,x;,x,x;. >1. -1
i=1j=1

For brevity, we choose )2 = 1 for our experiments. Taking its derivative with respect to 0‘22 we
obtain:

P
< X,X;,KY >12»: :22,- < X,X;,ijfj >F (5.12)
i=1

The weight parameters 023- can be estimated by solving a linear regression problem. Since a]-
is associated with the original variable X j, a large weight of aj indicates a strong correlation
between X j and the target variable Y.

Once the weight parameters have been determined, the aligned kernel matrix KO, = 2 (1,22in
is constructed. Note that elements of K0, may become negative when there are negative values
in the basis vector. Although the random walk approach is still applicable in such a case, the

kernel matrix K a can no longer be interpreted as a transition matrix. Our strategy is to make all

107

 

Algorithm 6 Multivariate Time Series Anomaly Detection
Input: Multivariate time series D.

 

Output: Connectivity vector c.
1: ’12,, 2— KernelAlign(D)
2: S +— Normalize(Ka)

3: c +— RandomWalk(S)

 

the elements of K0 non-negative by adding a sufﬁciently large number to each element of the
matrix. This is equivalently to adding a constant weight to every edge in the graph. However,
if the added weight is too large, then the weights for all the edges will become close to uniform
values after normalization. To avoid this problem, the elements of the matrix are shifted such that

the minimum value of the matrix becomes zero:
Ka <— Ka + Imin(Ka)|,if min(Ka) < 0

The graph-based algorithm described in Section 5.2.2 will applied to the shifted aligned kernel

matrix to detect anomalies.

5.3.2 Multivariate Time Series Anomaly Detection and Characterization

Algorithms

Here we propose our multivariate time series anomaly detection algorithms. A high-level sum-
mary of the algorithm is shown in Algorithm 5.3.2. There are two variations to the proposed
algorithm. The ﬁrst variation is designed to identify anomalies in a target time series whereas
the second variation is designed to identify anomalies in a general multivariate time series, where
each variable is equally important. Their difference lies in the way K0, is computed.

For target speciﬁc anomaly detection, we ﬁrst align the kernel matrix derived from the predictor
variables with the kernel matrix of the target variable, as described in Section 5.3.1. To perform
the alignment, the original kernel matrix K can be decomposed into its basis vectors before
applying Equations (5.11) or (5.12) to determine the weight parameters a. We then construct a
transition matrix S from the aligned kernel K by normalizing the columns of the matrix. Finally,
a Markov chain random walk algorithm is performed on the graph associated with the aligned

kernel matrix to obtain the connectivity values of each node (which may represent a timestamp or

108

subsequence of the time series). Since the connectivity values depend on the similarity measure,
number of nodes, and other factors, we normalize the connectivity scores by subtracting their
means and dividing their standard deviations._ Nodes that have the connectivity scores below
certain threshold are declared as anomalies. The impact of kernel alignment step for target speciﬁc

anomaly detection in multivariate time series is two-fold:

o It tends to extract the components in the predictor time series that are most related to the
anomalies in target time series at the same time period. In this way, the anomalies found
in the predictor time series after alignment can be used to characterize the corresponding

anomalies in target time series.

0 The components in predictor time series that are not related to target time series will be
removed aﬁer kernel alignment, which leads to two effects: ﬁrst, the anomalies in predictor
time series which are independent of the target time series will be degraded; second, the

impact of noise in the multivariate time series will be reduced.

In such a way, the anomalies found in predictor time series can be used to characterize the anoma-
lies discovered in target time series.

For general multivariate time series anomaly detection, where D = [X 1, X 2, . . . ,Xp] , we
consider each variable X ,- as a target and learn an aligned kernel matrix K,- between the target and
all the remaining variables in D. After alignment, this produces p aligned kernel matrices. The
overall aligned kernel matrix K0, is computed by taking the Hadamard product of the individually

aligned kernel matrices Ki:
Ra=flof€2mof€p (5.13)

Once the overall aligned kernel has been computed, the remaining steps are the same as those
taken for target-speciﬁc anomaly detection. Again, the noise in multivariate time series is removed

after kernel alignment such that false alarm rate can be reduced.

5.4 Extensions of the Proposed Algorithm

This section describes two extensions of the proposed algorithms: (1) to detect local anomalies in

time series data, and (2) to discover subsequence-based anomalies. Although we discuss it in the

109

framework of multivariate time series anomaly detection, it is also applicable to univariate time

series.

5.4.1 Sparse Kernel for Local Anomaly Detection

Our aligned kernel matrix is used to construct a graph 9 upon which a random walk algorithm
is applied to identify anomalous nodes in the graph. 9 is a completely connected graph whose
edge weights depend on the values of the kernel matrix Ka. As a result, the anomalies found
correspond to global anomalies whose values are signiﬁcantly different than the rest of the time
series. Our proposed algorithm can be extended to detect local anomalies by sparsifying the kernel
matrix prior to constructing the transition matrix S.

As mentioned in Section 5.1.1, we have implemented two ways to deﬁne the neighborhood of
a node for local anomaly detection. The ﬁrst approach, time-based neighborhood, is implemented
by removing the edges between all pairs of nodes 2' and j in which |2' — jl > k. Thus, a local
anomaly observed at time t has signiﬁcantly different values than other observations within the
time interval [t — k, t + h]. For the second approach, i.e., cycle-based neighborhood, we sparsify
the graph by removing all edges in g in which |2' — j| mod k > T. For example, setting 7 = 0
would compare an observation at time t to other observations at times t — k,t — 219,- - - and
t + k, t + 2k, - - - , where k is the known periodicity of the time series. Cycle-based approach has
great applications in non-stationary time series data with periodical patterns. For example, it can
be used to detect anomalies in the monthly FPAR or climate time series by comparing the values

for the same month in different years.

5.4.2 Subsequence Anomaly Detection

In addition to point-wise anomalies, our algorithm can also be extended to discover subsequence
anomalies (also known as discords). Let Y be the target variable and X be the set of predictor
variables. Assume T is the length of all the time series. A sliding window of predeﬁned size d
is used to extract all the subsequences of length d from the multivariate time series. The corre-
sponding predictor and target variable subsequences are represented as X and 7, respectively.

The number of subsequence windows created from the time series data is T -— d + 1. Each element

110

of X and 7 is a subsequence with length d, which is denoted as:

3723' = {1323' $-2,j+12 - - 2 2152',j+d—1}

31) = {yj,yj+12~-.yj+d——1}

The kernel matrices K and Ky are constructed by applying the following Equations:

 

. - 2p.— Hip—EMF
K04) = exp(—- 1‘1 a; J )
d—l

 

= exp(- )

0.2

and
d—I
Ky(i,j) = 92?; = ZM—i—syj-i-s
=0

The (2', j )th element in the kernel matrix represents the similarity between two subsequences (i.e.,
sliding windows) of length d starting from the timestamps 2' and j. Similar to the approach de-
scribed for point-wise anomaly detection, the kernel matrices can be aligned before applying the
random walk algorithm to detect the anomalous subsequences. While the approach presented in
this section assumes a ﬁxed window size (1, it can be easily extended to deal with variable length

time windows as well as other similarity measures for subsequences.

5.5 Visual Exploration of Ecosystem Disturbances

In previous sections, we have introduced a graph-based method for anomaly detection and char-
acterization in multivariate time series data. However, applying these methods for ecosystem
disturbance detection in large scale Earth Science data directly is very expensive because of the
massive data size and the requirement for parameter tuning. Visual exploratory analysis provides
a way to alleviate this problem by presenting only small portion of the data each time and enable
user interaction during the mining process. Here we present a visualization system for the mining
and exploration of ecosystem disturbance in Earth Science data. We ﬁrst present a system to clus-
ter the regions of similar disturbance events. Then a multi-level indexing framework is proposed

to aid the expiatory analysis of ecosystem disturbance in the global F PAR time series repository.

lll

Table 5.2. Features for clustering disturbance events

 

 

 

 

Features Description

Location Latitude and longitude of a location

Period Number of months of consecutively low F PAR values
First month Month in which disturbance event was detected

 

F PAR (During) Minimum and range of F PAR values during period of disturbance event
FPAR (After) Minimum and range of FPAR values after period of disturbance event
Climate Precipitation and temperature values at the onset of disturbance event
Land cover Land cover type at the location

 

 

 

 

 

 

 

5.5.1 Clustering of Ecosystem Disturbance Events

Clustering is the task of partitioning data into groups of similar objects. In Earth Science studies,
clustering has been applied for various applications including land cover classiﬁcation [202, 207],
classiﬁcation of synoptic-scale circulation patterns [59], and discovery of climate indices [193].
Clustering can also be used to aggregate related (and possibly nearby) data points that have similar
characteristics (e. g. climate conditions, landscape variability, etc). Since it does not require any
labeled examples, clustering is considered an unsupervised learning task, as opposed to supervised
Ieaming tasks such as classiﬁcation and regression.

In this work, we apply clustering to group together locations that exhibit similar types of dis-
turbance events. The clusters may help users to automatically categorize the different types of
disturbance events (e.g. wildﬁres, insects, deforestation, etc) based on characteristics of the eco-
climatic time series during or aﬁer the event has occurred. Since there are no labeled examples
available, a key challenge is to determine which attributes are appropriate to characterize the dif-
ferent types of disturbance events. Table 5.2 shows some of the attributes available to describe
the characteristics of disturbance events. The user may cluster the data points using any subset of
these attributes. Since each attribute can have a different scale, we need to normalize the attribute
values before applying the clustering algorithm. Otherwise, the similarity computation between
data points will be dominated by attributes that have a large range of possible values. Continuous-
valued attribute are normalized by subtracting each attribute with its minimum possible value and
dividing it by the range of its possible values. For each discrete attribute, we normalize it by cre-
ating a binary attribute for each attribute-value pair. Normalization allows each attribute selected

for the clustering task to be treated equally important during similarity computation.

112

 

 

 

 

 

 

 

 

 

Figure 5.3. Clusters of ecosystem disturbance events detected in the North Carolina region (each
symbol corresponds to a different cluster).

5.5.2 Clustering for Exploratory Data Analysis

The methods described in the previous section had been successfully applied to the F PAR time
series data from AVHRR. With the availability of the higher resolution MODIS data, applying the
disturbance event detection algorithm on all the data points can be very expensive. For example,
with the 4 km x 4 km MODIS data, there are more than 1.3 million data points for North America
alone. The total number of FPAR time series for the entire world is more than 10.1 million.
Although the disturbance event detection algorithm scales linearly with the size of the data, it still
takes more than a few minutes to process the data for North America. Repeating the analysis using
different thresholds will take a considerable amount of time, making it infeasible for real-time
exploration of the data. The runtime will increase considerably if the analysis is to be performed
on the 1 km x 1 km MODIS data or using more sophisticated (and expensive) anomaly detection
algorithms. Techniques must therefore be developed to reduce the amount of processing time.
For this experiment, we have performed K-means clustering (with k=20) on all disturbance

event locations in North America using precipitation and temperature as data attributes. Figure

113

FPAR at lat = 33.36, lon = -80.24 FPAR at lat = 33.21. lon = -80.01

 

 

 

 

 

 

 

300 2 2 . g 2 300,
200 . i ; ' 1 ~ ' ' ‘ 200
100 : i I § § : ' 100%
092 2‘4 96 3‘3 90 912 94 96 98 oo (62 e4 86 88 9‘0 92 94 96 918 do
5 12—month movmg average (count=14) 5 12-month moving average (oount=13)

   

 

 

 

 

L

'583 815 a} 819 91 9‘3 915 9‘7 99 01 33 85 817 89 91 913 915 9‘7 919 01

 

 

(a) FPAR at 32.85°N and 79.84°S (cluster 1) (b) FPAR at 33.21°N and 80.01°S (cluster 13)

Figure 5.4. FPAR time series at different locations.

5.3 shows clusters of disturbance events found near the region of North Carolina. Each colored
pixel corresponds to a location where disturbance event has been previously detected. We have
also labeled each colored pixel according to its cluster ID. Our preliminary results suggest the
possibility of distinguishing the different types of ecosystem disturbance events in the region
according to their cluster assignment. For example, Figure 5.4 shows two F PAR time series for
one data point assigned to each cluster #1 and #13, respectively. Further investigation revealed
that the data points assigned to cluster #1 are associated with disturbance events due to Hurricane
Hugo (which occurred in September 1989). Although the disturbance event found for the data
point shown in Figure 5.4(b) also occurs in 1989, it appears earlier in the year, compared to the

disturbances found in cluster #1.

A. Preposed Clustering Framework

In this work, we propose to develop a clustering-based framework to reduce the number of time
series that needs to be processed in order to facilitate interactive exploration of the large-scale data.
Speciﬁcally, our strategy is to build a multi-level index on the MODIS FPAR time series by ap-
plying clustering on the time series and choosing representative samples from the clusters at each
level of the index hierarchy. Therefore, instead of applying the moving average or graph-based
disturbance detection algorithm on the entire time series, we would approximate the frequency

distribution of disturbance events in each region by applying the algorithm to the selected sam-

114

 

Leveld

 

 

 

 

 

Figure 5.5. A multi-level approach for visual exploration of ecosystem disturbances.

ples. This helps to reduce the amount of processing time considerably, thus allowing the user to
tune the thresholds of their algorithm and to observe the changes in the results in real-time. It will
also help the user to focus on a particular region of interest during exploratory data analysis.

Figure 5.5 shows an example of how the multi-level indexing scheme works. Speciﬁcally, we
use the multi-level index to monitor the ﬁ‘equency distribution of ecosystem disturbance events
in North America. We use North America for illustrative purposes only. Multi-level indices for
other continents are also created. At the top level, the entire continent is partitioned into a 7 x 10
grid box, where each box contains 256 x 256 pixel locations. Next, each of the 7 x 10 boxes is
further divided into smaller 4 X 4 grid boxes, where each of the smaller box now contains 64 x 64
pixel locations. At the next level, each of these smaller boxes is partitioned into another 4 x 4 grid
boxes, each of which now contains 16 x 16 pixel locations. A ﬁnal partitioning of each grid box
produces a square region that contains 4 x 4 pixel locations. A map displayed at this level will
show the actual locations where disturbance events have been observed. At all other higher levels,
the map displays the frequency distribution of disturbance events in each grid box (region).

For example, in Figure 5.5, the map at level d displays the frequency distribution of disturbance
events for each region in North America. As can be seen from the map, the Baja California
peninsula, the midwest region, and Canada appear to have highest concentration of disturbance
event locations. The user may decide to focus only on these regions and may click on the grid
box that corresponds to one of these regions (say California) for further analysis. The map for the
California peninsula will now be displayed (level d — 1). The region is now divided into 4 x 4

smaller regions; this allows the user to examine the regions where there is a dense concentration of

115

disturbance events (e. g. the dark region in the lower right hand comer of the map). This process is
repeated until it reaches the lowest level of the index, where the map displays the actual locations

where disturbance events have been detected for that particular region.

B. Sampling Representative Time Series from Clusters

The multi-level indexing scheme described above allows the user to explore the data interactively
in a hierarchical fashion. Computation time can still be very expensive because the number of
time series in each grid box grows quadratically from one level (m) to the next level (772 + 1). For
example, there are 256 pixel locations in each grid box at level 1 and 65,536 pixel locations in
each grid box at level 3. To reduce the computation time at higher levels of the index, instead of
applying the disturbance event detection algorithm to all the time series in the grid box, we apply
the algorithm on samples of time series for that region. The key challenge here is to determine
which time series should be selected to obtain a representative sample. A better representative
sample must capture more disturbance events in the region. Another issue to consider is the
tradeoff between coverage and processing time, which depends on the sample size. If the sample
size is large, this will provide a good coverage for the region but at the expense of increasing the
processing time. Our experiments on the 4km x 4km MODIS F PAR data suggest that a sample
size containing 100 points per grid box will produce reasonable coverage and processing time.

If N time series must be sampled from a given grid box, we perform clustering on all the time
series located in the grid box to obtain 1: clusters. We then select [N/k] representative time series
from each cluster to form the sample for the region. We investigate several sampling approaches

in this study. For example, two basic sampling technique is considered:
0 Closest Sampling would sample time series that are closest to the cluster centroid.

o Farthest Sampling would sample time series that are farthest away from the cluster cen-

troid.

Since we are interested in the detection of anomalous events, farthest sampling may be useful to
focus on time series that deviate signiﬁcantly from others in the cluster.
During the construction of the multi-level index, the pixel locations of the sampled time series

are stored in the index structure. The samples are retrieved when the grid box is selected by the

116

 

 

 

 

 

 

 

 

 

 

 

"m m in run In- ner-anus :I
u—n. ‘ run. ‘ ' turn-1?.- ‘3.

 

Figure 5.6. An interactive visual data mining tool for detection and characterization of ecosystem
disturbances.
user during exploratory data analysis. The disturbance event detection algorithm is then applied
to the samples and their results will be displayed on the map (as shown in Figure 5.5.

To evaluate the effectiveness of the sampling strategy, we compute the fraction of disturbance
events missed due to sampling:

d
1 N (12.-:9) — N..(b~9)
Loss: —2 : E : —“—‘.,~———”— (5.14)
d [21W]! .~,(b,, 6)

where d is the number of levels, Q; is the set of grid boxes at level I, 6 is the event deﬁnition
threshold, Na(b,:; 6) is the actual number of disturbance events in the grid box, and Ns(b,-; 6) is

the corresponding number of disturbance events in the sampled time series.

5.5.3 A User-Interactive System for Disturbance Event Detection

Figure 5.6 shows a snapshot of the interactive system that integrates the disturbance event de-
tection and clustering algorithms developed in this study. The algorithms and user interface are
developed using the Matlab programming language. Initially, the map will display the frequency
distribution of disturbance events for the selected continent (e.g. North America, as shown in
Figure 5.6). The continent is initially divided into a 7 x 10 grid box. The color of the pixels on the
map represents the fraction of locations in each grid box that contain disturbance events. The user
may decide to select a grid box that contains a high percentage of disturbance events for further

exploration. The viewer will zoom in to the selected grid box and scanned all the subgrids for

117

 

 

 

-— Y ‘ l . 4 . . . L
:VVW‘NV‘NVVWNV 2 l— “9"“ Keme'l

ON-h

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

-2

o 20 4o 60 so 100 °

— X1 ' ' ' - -2° 2

0- WW 0 20 4o 60 so 100
-2 . . 4 2 2 . . .

0 20 4o 60 so 100 2 ]— Unalrgned Kernel I
6 . . . . _ ‘
4_ 0

2- 2 -
-3: . ‘ 1 . -2~

0 2o 40 60 80 100 o 20 40 so so 100

(a) Three simulated time series: target Y with two predic- (b) Anomaly scores obtained by applying a random
tor variables X1 and X 2,for measuring the effectiveness of walk on the aligned (top) and unaligned (bottom) ker-
target speciﬁc anomaly detection. nels for time series in Figure 5.7(a).

Figure 5.7. Target speciﬁc anomaly detection.

that region in the next higher resolution level. Also, our viewer enable users to retLu'n to the last
level by right clicking the map and choosing a different grid at a lower resolution level. Users
can repeat this process until the highest level is reached where each grid represents a true location
with latitude and longitude. At the highest resolution level, users can click a single location and

the viewer will plot the time series and return anomaly score for each temporal data point.

5.6 Experimental Result

We perform several experiments to evaluate the performance of our proposed techniques in this

work.

5.6.1 Target Speciﬁc Anomaly Detection

The objective of this experiment is to illustrate the effectiveness of applying kernel alignment for
target speciﬁc anomaly detection in multivariate time series. We simulated three equal length time
series as shown in Figure 5.7(a). The top diagram represents the target variable Y whereas the
bottom two time series correspond to the predictor variables X1 and X2. Both X1 and Y are

generated by interjecting large amplitude pulses to a periodic sinusoidal time series. The time

118

 

2. ' ‘ '—

 

 

 

'20 20 4‘0 60 so
2’ . . . _ x2

Or

100

 

 

 

 

‘20 20 4‘0 60 so 100
(a) Two simulated time series X1 and X2 with noise
for measuring effectiveness of general multivariate time
series anomaly detection.

 

 

Y F— ‘AlignedIKemel ]

 

 

 

 

o 20 4‘0 60 8‘0 100

 

 

[ —— UrialignedlKemel J

 

q

 

 

 

 

0 2o 40 so so 100
(b) Anomaly scores obtained by applying a random
walk to the aligned (top) and unaligned (bottom) ker-
nels for time series in Figure 5.8(a).

Figure 5.8. General multivariate noisy time series anomaly detection.

series for X 2 is generated by interjecting large amplitude pulses and modifying the local shape of
a random time series generated from the normal distribution N (t, 0.8). The diagram clearly shows
that two of the anomalies in the target variable Y (at timestamps 20 and 40, respectively) can be
explained by the anomalies found in X1. The third anomaly in the target variable Y (at timestamp
60) is unrelated to any anomalies observed in the predictor variables X1 and X2. Furthermore,
the anomalies at timestamp 90 of the predictor variable X1 and at timestamps 70 and 80 of the
predictor variable X 2 do not correspond to any anomalies in Y.

Figure 5.7(b) shows the results of applying our target-speciﬁc anomaly detection algorithm on
the aligned (K0) and unaligned (K X) kernel matrices derived from the combined predictor vari-
ables, X1 and X2. Anomalies are detected based on the timestamps at which the connectivity
values are below a certain threshold (after standardization). The results clearly demonstrate the
effectiveness of applying kernel alignment to learn the dependence relationship between the pre-
dictor and target variables. Without kernel alignment, the correlated anomalies at timestamps 20
and 40 are indistinguishable from other timestamps in the time series, which may lead to high
false positive or false negative rates (depending on the threshold chosen). By applying kernel
alignment, anomalies that are correlated with those observed in Y (at timestamps 20 and 40) are
ampliﬁed, which suggests that they can be used to characterize the anomalies found in the tar-

get variable. The anomalies found at timestamps 70 and 80 in the predictor variable X1 and the

119

anomaly found at timestamp 90 in predictor variable X 2 are degraded since they do not align with
the target variable Y. Similarly, the anomaly at timestamp 60 in the target variable is also not
detected because it cannot be explained by any anomalies in the predictor variables.

This experiment uses the original predictor time series as the basis vectors for the aligned kernel
(see Section 5.3.1). The weights associated with the predictor variables are 021 = 0.9964 and
a2 = 0.0844, respectively. These parameters suggest that X1 is more correlated to the target

variable Y than X 2, which is consistent with our observation.

5.6.2 General Multivariate Noisy Time Series Anomaly Detection

The purpose of this experiment is to demonstrate the effectiveness of applying kernel alignment on
anomaly detection for multivariate time series without a target variable. Figure 5.8(a) shows two
simulated time series X1 and X2 with a pair of anomalies at timestamps 20 and 30, respectively.
A Gaussian noise (N(0, 0.3)) is also added to each time series to distort the signals.

The resulting anomaly score is plotted in Figure 5.8(b). Without kernel alignment, the random
walk algorithm has difﬁculty in discriminating the anomalies at timestamps 20 and 30 from other
noisy signals in the time series. By applying kernel alignment (using Equation (5.13)), the random
walk algorithm can successfully detect both anomalies in the noisy time series. This experiment
suggests that kernel alignment helps to improve the kernel matrix used by the anomaly detection

algorithm by reducing the effect of noise.

5.6.3 Subsequence Anomaly Detection

The purpose of this experiment is to demonstrate the effectiveness of our algorithm for ﬁnding
unusual subsequences in a multivariate time series. We simulated two time series X1 and X2 as
shown in Figure 5.9(a). A Gaussian noise (with mean zero and standard deviation 0.1) is used
to distort both time series. Anomalous subsequences are then added to the time series two time
series.

We compare the results of applying random walk anomaly detection algorithm on kernel matri-
ces constructed from the subsequences to those constructed from individual time points. In both

cases, the kernel matrices are initially aligned before applying the random walk algorithm. The

120

 

 

 

 

2. — X1 2» [ — Aligned Subsequence Kernel l
1
o

 

1

 

 

 

 

 

 

0 2‘0 40 60 so 100 120 0 2‘0 40 60 80 160 120

2. 2, ]— Data Point Kernel L
1- .
o

 

 

 

 

 

 

 

 

 

 

 

'20 25 4‘0 60 8‘0 190 120 0 2‘0 4‘0 60 ' 8‘0 160 120

(a) Two simulated time series X 1 and X2 with unusual (b) Anomaly scores obtained by applying a random
subsequences. walk to the subsequence (top) and data point (bottom)
kernels for time series in Figure 5.9(a).

Figure 5.9. Unusual subsequences detection.

subsequence-based kernel matrices are constructed using a sliding window of length 6. Figure
5.9(b) compares the anomaly scores obtained using the aligned subsequence-based kernel matrix
(top diagram) to those obtained using the aligned point-wise kernel matrix (bottom diagram). The

results obtained using subsequence-based kernel are clearly more superior.

5.6.4 Local Anomaly Detection

This experiment compares the results of applying the full versus sparse kernel matrices for detect-
ing local anomalies. First, we illustrate the results of detecting time-based local anomalies. Figure
5.10(a) shows the two simulated time series used in our experiment. One of the time series, X1,
contains both a global anomaly (at timestamp 100) and a local anomaly (at timestamp 170). We
apply the random walk anomaly detection algorithm on the graph induced by the sparse kernel
matrix (using time-based neighborhood with k = 20), as described in Section 5.4.1. The resulting
anomaly scores are plotted in Figure 5.10(b) and compared against those obtained by using the
full (global) kernel. The results show that a ﬁrll kernel may miss the local anomaly because its
value at timestamp 170 is similar to other observations in X1. In contrast, the sparse kernel is
capable of detecting both global and local anomalies.

Next, we illustrate the results of detecting cycle-based local anomalies. The two simulated time

series used for this experiment are shown in Figure 5.11(a). Both of the simulated time series

121

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5 4 . . r . ,
4 — X1 I — Localized Kernel l
_ 2 q
2 o
o . .. - _ .
'20 so 100 150 200 250 300 o 50 100 150 7 200 250 300
3 . . . . 4 r . . A .
2. l— Full Kernel l
2 .4
1 p p . . . p
0 I 01' l i . “ , 1
_2 r 1 n . . i 1 r n 1
o 50 100 150 200 250 300 o 50 100 150 200 250 300

(a) Two simulated time series X 1 and X2 with local (b) Anomaly scores obtained by applying a random
and global anomalies. walk to the time-based local (top) and full (bottom) ker-
nel matrices.

Figure 5.10. Time-based local anomaly detection.

contain a local anomaly at timestamp 20 and have a periodicity equals to 6. Clearly, the values
of the variables at timestamp 20 are not unusual from a global perspective. As a result, standard
distance-based and density-based anomaly detection methods fail to detect such anomalies. Our
random walk algorithm using the full kernel matrix also fails as shown in the bottom panel of
Figure 5.1 1(b). However, by sparsifying the kernel matrix based on the cycle-based neighborhood

approach, the anomaly at timestamp 20 is successfully detected as shown in the top panel of Figure

5.11(b).

5.6.5 General Performance Comparison

In this experiment, we compare the performance of our general multivariate time series anomaly
detection algorithm against several baseline algorithms using benchmark data (the data was orig-
inally created by Dasgupta et al. [111] and obtained via a CD from Eamonn Keogh). from the
University of California Riverside time series data mining archive [120]. The benchmark data,
which is used to test the performance of anomaly detection algorithms, contain 38 distinct time
series data. Each time series data has a separate training and test sets. The training set contains
a pair of “normal” time series of length 1000. The test set also has a pair of time series of length
1000 in addition to a class label vector that indicates whether a data point in the bivariate time

series is anomalous.

122

 

 

e . 5

2. I V L— Lbcalized lKernel ‘

0' W ‘
o 110 2b at: 4‘0 50 0 1b 240 310 40 50

2 Y T . Y I F— FulFKernel I

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

'20 110 210 310 4b 50 '50 1‘0 26 3b 410 50

(a) Two simulated time series X 1 and X2 with cycle- (b) Anomaly scores obtained by applying a random
based anomalies. walk to the cycle-based local (top) and full (bottom)
kernel matrices.

Figure 5.11. Cycle-based local anomaly detection.

We compared our general multivariate time series algorithm (denoted as Al ign) against 4 other
competing methods—random walk without alignment [148], probability based [220], LOF [31],
and K-distance based methods [176]. Note that the probability based method trains a classiﬁer
from the training set and thus is a supervised methods. All other methods, including Align,
are unsupervised and do not use the training set. The performance of the algorithms is measured
in terms of their area under ROC curve (AUC) [82]. Figure 5.12 shows the AUC values for all
the methods on the 38 data sets. The results suggest that Align signiﬁcantly outperforms other
existing methods for the majority of the data sets. As an example, Figure 5.13 shows the ROC
curves for all the methods on one of the data sets in which our random walk algorithm on the

aligned kernel matrix has the largest area.

5.6.6 Ecosystem Disturbance Detection

This section describes the results of applying our algorithm to the problem of detecting ecosys-
tem disturbances such as wildﬁres in Earth Science data. Speciﬁcally, ecosystem disturbances are
detected by monitoring changes in the vegetation cover data (called F PAR [200]) obtained from
satellite observations. The monthly FPAR data is available at 4km x 4km spatial resolution cov-
ering the time period between 2000 and 2005. The objective of this study is to detect ecosystem

disturbances using our proposed algorithm and to correlate them against the anomalies observed

123

 

 

 

 

 

 

 

 

i;§i?§§§§5fi.ii §§§:::3132;3§f§ [J lﬂﬁst

 

 

 

 

 

 

O 10 20 30 40

Figure 5.12. Performance comparison using 38 data sets from UC Riverside data repository.

in climate data (such as precipitation, temperature, and sea-level pressure). The FPAR time series

is treated as the target variable, whereas the climate time series are used as the predictor variables.

To detect anomalies, we align the kernel matrix constructed from the climate variables against
the kernel matrix for F PAR. The aligned kernel is then sparsiﬁed using the cycle-based neighbor-
hood approach with k = 12 and r = 1. The anomaly scores for the multivariate time series are
then computed by performing a random walk on the graph induced by the kernel matrix.

Using this approach, we were able to detect several incidents of large-scale FPAR disturbance
events that correlate with extreme climate conditions. One example is given in Figure 5.14(a),
which shows the FPAR time series at a location in Alberta, Canada (latitude=54.85N, longi-
tude=114.20W) along with its anomaly scores obtained by applying the random walk algorithm
to the kernel matrix constructed from the F PAR time series. The timing and location of the F PAR
disturbance event coincide with the Chisholm wildﬁre event on May, 2001 as reported [194]. It has
also been reported that dry condition was one of the factors causing the severity of the wildﬁre.

Figure 5.14(b) shows the monthly precipitation for the location along with the anomaly scores

124

 

I

0.8

0.6

l

0.4

 

 

+ Align
-- Rnd _
-e- Prob

-A- Lof

-a— Kdist
0.4 0.6 0.8 1

   

 

 

 

 

 

Figure 5.13. An example of ROC curves for all the methods on the 7th data set.

computed from the aligned kernel. Observe that the FPAR disturbance event is well-aligned with
the occurrence of a low precipitation event. Our analysis also suggests the possibility of another
F PAR disturbance and low precipitation event in Spring 2002, however, more analysis is needed
to validate this event.

As another example, Figure 5.15(a) shows the FPAR time series and anomaly score for a lo-
cation in South Dakota (latitude=35.31N, longitude=l 11.21W). The timing and location of the
disturbance event co-incide with the Antelope wildﬁre [197] reported in August, 2002. The F PAR
anomaly is also associated with a low precipitation event, as shown in Figure 5.15(b). Although
there are other unusual precipitation events during this period, only one of them coincides with
the F PAR disturbance.

A third example is given in Figure 5.16(a) for the FPAR time series at a location in Honduras
(latitude=15.68N, longitude=87.57W). A sudden drop in FPAR was observed at the end of 2001,
which correlates with a sudden drop in sea level pressure (see Figure 5.16(b)). According to
a report by the National Oceanic and Atmospheric Administration (NOAA) [134], the F PAR

disturbance event coincides with the timing of Hurricane lris.

125

FPAR at lat = 54.85, lon = -114.20

1

Climate at lat = 54.85, lont = -114.20

 

1oor

 

     

50*?

  
    

 

 

 

 

 

 

 

 

3000 2001 2002 2003 2004 2005 30100 2601 2602 2603 2604 2005
Anomaly Score by Random Walk Anomaly Score by Random Walk on Aligned Kemel
' f r Y 5 . r i— 7 t

 

 

 

 

 

 

 

 

 

-3000 2001 2002 2003 2004 2005 43000 2001 2002 2003 2004 2005

(a) FPAR and its anomaly score. (b) Precipitation and its anomaly score.

Figure 5.14. FPAR and Precipitation time series at (latitude = 54.85N, longitude = 114.20W)
and corresponding anomaly score by random walk.

5.6.7 Ecosystem Disturbance Exploratory Analysis

We verify our viewer with several examples of ﬁre events that happened between 2000 to 2005.
Figure 5.17(a) shows a subregion in the highest resolution level, which is close to the Pike-San
Isabel National Forest, 30 miles southwest of Denver, Colorado. According to a report published
in [1], a forest ﬁre hit this region around June 2002. Figure 5.17(b) shows one example location
picked from dark color locations in Figure 5.17(a). The raw time series and the anomaly score by
moving average method is plotted. Continuous low F PAR-Low events lasting 12 months is suc-
cessfully detected at those two locations with latitude and longitude (39.45°N, 105.0005). Figure
5.18(a) shows another subregion in the highest resolution level close to the Alberta, Canada. Ac-
cording to a report published in [194], the Chisholm Fire hit this region around May 2001. Figure
5.18(b) shows an example location picked from dark color locations in Figure 5.18(a). The raw
time series and the anomaly score by graph-based method is plotted. Continuous low FPAR-Low
events lasting 3 months is successfully detected at those two locations with latitude and longitude
(54.88°N,114.18°S).

Since all our results are computed on the ﬂy, users can adjust the parameters whenever they
want to. For example, he can change the length of the sliding window or the threshold in moving

average method accord to his/her own purpose as well as the parameters in random walk approach.

126

FPAR at lat = 35.31, lon = -111.21

 

 

Climate at lat = 35.31. lon = -111.21

1

    

  
    

 

15»
10r-

l

    

     

  

 

 

 

 

 

 

2000 2001 2002 2003 2004 2005

3600 2601 2602 2603 2604 2005
Anomaly Score by Random Walk on Kernel

Anomaly Score by Random Walk on Aligned Kernel
2 Y T '. ' '

l:

 

f

 

     
 

 

 

 

 

 

 

-§000 2001 2002 2003 2004 2005 -3000 2001 2002 2003 2004 2005

(a) F PAR and its anomaly score. (b) Precipitation and its anomaly score.

Figure 5.15. FPAR and Precipitation time series at (latitude=35.31N, longitude=11 1.21W) and
corresponding anomaly score by random walk.

5.6.8 Multi-Level Indexing Performance Evaluation

Here we would like to evaluate how different aggregation methods will affect the results during
the exploratory analysis on the high resolution F PAR data. The conﬁguration of the system in this
evaluation is based on 3 level indexing and 50, 100, 100 samples for level 1, 2, 3. We ﬁrst evaluate
how the system performs when aggregating data from high resolution level to low resolution level
using diﬂ‘erent deﬁnitions of samples. Two types of samples are used for K-means clusters: 1.
data points closet to the cluster centroid, 2. data points farthest away from the cluster centroid.
The metric in formula 5.14 is used to calculate the sample effect when data is aggregated from
high resolution level to lower resolution level. We include the results for these methods as well
as the random sampling and uniform sampling in Table 5.3. It shows that the loss for random
and uniform sampling is more than 77%, which is considerably higher than farthest sampling.

This result suggests that, by sampling the time series that are considerably different than the

Table 5.3. Comparison of sampling loss for different strategies.

 

 

 

 

 

 

 

 

 

 

Level 1 Level 2 Level 3 avg
Random 0.7679 0.7319 0.8356 0.7784
Uniform 0.7699 0.7196 0.8213 0.7703
Closest 0.7595 0.7008 0.7994 0.7532
Farthest 0.7527 0.6956 0.4243 0.6242

 

 

127

Climate at lat = 15.68, lon = -87.57

 

100v

 

FPAR at lat = 15.68, lon = -87.57
i ‘ ' ' 2 r

80*

   

60'

 

 

 

 

 

 

4 1 l l A L — l l 1 A l
8000 2001 2002 2003 2004 2005 3000 2001 2002 2003 2004 2005
Anomaly Score by Random Walk 0" Kernel anomaly Score by Random Walk on Aligned Kernel

 

 

 

 

 

 

 

 

 

-9000 2001 2002 2003 2004 2005 -3000 2001 2002 2003 2004 2005

(a) F PAR and its anomaly score. (b) Sea level pressure and its anomaly score.

Figure 5.16. FPAR and Sea level pressure time series at (latitude=15.68N, longitude=87.57W)
and corresponding anomaly score by random walk.

cluster centroid, we are more likely to sample time series that contain anomalies (i.e., disturbance

events).

5.6.9 Multi-Level Indexing System Efﬁciency Evaluation

Table 5.4 shows the total runtime for computing all the disturbance events at different levels using
the moving average and random walk methods. The number of grid boxes and the number of time
series to scan at each level are also recorded. Level 0, which represents the highest resolution
where each grid box is a 4km x 4km location, requires nearly 1.5 hours to scan the entire land
points in North America for moving average method. It is even more expensive for random walk
method, which requires around 5 hours. Clearly, this is infeasible for exploratory data analysis.
After the data is aggregated to level 1 with 50 clusters in each grid box, the time needed to scan all

the sampled time series at this level reduces signiﬁcantly to only 10 minutes for moving average

Table 5.4. Runtime (in seconds) needed to compute disturbance events at each level for North
America.

 

Level 0 l 2 3
Total Time (MV) 5418.8 661.7 98.4 10.9
Total Time (RW) 17710.0 1839.6 254.6 56.0

#Grids (70x) 48 44 4T 1
#Points (70x) 48 44 x 50 4 x 100 100

 

 

 

 

 

 

 

 

 

 

 

128

FPAR at lat = 39.45, lon = -105.00

 

an
UU ‘

40F
20'

 

n

2000 2001 2002 $03 2004 2005
12-month moving average

 

(— .

 

 

 

 

2001 2002 2003 2004 2005

 

(a) Disturbance locations distributed in the regions of (b) Examples of FPAR time series for the Hayman Fire

Hayman Fire Event at the highest resolution level. Event at 39.45°N and 105.00°S and the anomaly score
obtained by moving average (First month = 6/2002,
Count = 12).

Figure 5.17. Hayman Fire Event.

method. This was further reduced to as few as 10 seconds at the lowest resolution level 3, which
makes exploratory analysis feasible.

The preceding table shows the amount of time needed to process all the time series at each level.
However, during exploratory data analysis, a user selects only one grid box to process as he/she
drills down from one level to another. The run time needed to compute disturbance events for
each grid box is shown in Table 5.5. Since all the grid points needed to be scanned at the lowest
resolution when the explorer ﬁrst appears, the time consumed is equal to scan all the data points
at level 3, which is about 10 seconds for moving average method. When a region at level 3 is
selected, the sample time series for that region will be computed for disturbance event detection.
The response time at this level takes only about 1.4 seconds to compute. In turn, it takes another
0.59 seconds to drill down from level 2 to level 1 and 0.3 from level 1 to the highest resolution

level 0. The same effect is also observed with the random walk method.

Table 5.5. Response Time For Drilling Down From Low Resolution To High Resolution.

 

 

 

 

1—0 2—1 3—2 3

Avg Time(MV) 0.3 0.59 1.4 10.9

Avg Time(RW) 3.24 10.04 18.1 56.0
#Points 44 42 x 50 42 x 100 70 x 100

 

 

 

 

 

 

 

129

 

FPAR at lat = 54.88, lon = -114.18

-a

   
        

000 2001 2002 2003 2004 2005
Anomaly Score By Random Walk on Kernel

 

l
l

 

 

 

 

l

2000 2001 2002 2003 2004 2005
(a) Disturbance locations distributed in the regions of (b) Examples of FPAR time series for the Chisholm

Chisholm Fire Event at the highest resolution level. Fire Event at 54.88°N and 114.18°S and its anomaly
score

Figure 5.18. Chisholm Fire Event.

5.7 Summary

Anomaly detection in time series data has been studied extensively in the literature. However,
there are still many challenging problems for existing methods especially when applying to Earth
Science data, which consists of a large number of multivariate time series in high spatial resolu-
tion. In this chapter, we discussed various techniques to detect and characterize different types of
anomalies in multivariate time series data and developed a visualization system for exploratory
analysis of ecosystem disturbances in Earth Science data. The main contributions of this work

are:

c A graph-based method was ﬁrst proposed for anomaly detection in multivariate time series
data, which learns the connectivity score for each node by nmning a random walk on the
graph constructed ﬁ'om the kernel matrix of the time series data. This method was further
extended to detect different types of anomalies in time series data. First, local anomalies
such as time-based or cycle-based anomalies can be detected by sparsifying the full kernel
matrix to local kernel matrix, which consists of similarity values between data points in
time-based or cycle-based local neighborhood only. Second, subsequence anomalies can
also be detected by constructing kernel matrix between subsequences of the time series.

Experimental results on both real FPAR time series and syntactic data sets demonstrated

130

the superior performance of this technique compared with other methods such as moving

average, LOF, K-dist, etc.

A kernel alignment technique was further discussed for anomaly discovery in multivariate
time series, which involves more than one related variables. Two problems were investi-
gated. One airns for the detection of anomalies in general noisy multivariate time series,
which learns the anomaly scores for every data point or subsequence by running a random
walk on the combination of the aligned kernel matrix among different variables. The align-
ment procedure will reduce the impact of noise signiﬁcantly and thus decrease the false
alarm rate. The effectiveness of this technique is demonstrated by experiments on a number
of data sets. Another problem aims for the detection of target speciﬁc anomalies in multi-
variate time series, which learns the anomaly scores for every data point or subsequence in
target time series and predictor time series by running a random walk on the their aligned
kernel matrices. Anomalies in predictor time series that are most related to corresponding
anomalies in target time series are detected and thus used to characterize the anomalies in
target time series. This method is successfully applied to associate anomalies in climate
events such as low precipitation with disturbance events in FPAR time series such as forest

ﬁres.

Despite the success of our proposed graph-based technique for anomalies detection in time
series data, it is very expensive to apply to large amounts of Earth Science time series in
high spatial resolution, not to mention the requirement of tuning thresholds or parameters
frequently in real applications. A clustering-based framework is proposed to assist the ex-
ploratory analysis of the F PAR time series repository for ecosystem disturbance detection.
The clustering approach was ﬁrst applied to cluster regions with similar ecosystem distur-
bance events to help the user in categorizing different types of disturbance events. Then
it was used to build a multi-level indexing system, which is integrated into an interactive
viewer that enables scientists to explore the data in near real—time fashion. The effectiveness

and efﬁciency of our system are demonstrated by our evaluation.

131

CHAPTER 6

Conclusions and Future Work

The primary objective of the work presented in this dissertation is to develop new spatial and
temporal data mining techniques to address the challenging problems in mining Earth Science

data.

6.1 Contributions

We focused on three interesting but challenging research problems related to spatial and tempo-
ral data mining: (l) localized prediction in the presence of spatial and temporal dependencies;
(2) long term time series forecasting; and (3) detection, characterization, and visualization of
anomalies. Preliminary results of this thesis have appeared in several conference proceedings
[46][47][45][48][42][44]. In this section, we summarize the main contributions of this thesis.

The localized prediction work described in Chapter 3 has made the following contributions:

0 Integrated Localized Support Vector Machine Framework:

An integrated framework for localized Support Vector Machine (including LSVC and
LSVR) is proposed, which incorporates neighborhood information between training and
testing examples into the SVM (including SVC and SVR) objective function. We tested the
proposed framework on a number of real-world data sets and showed its superior perfor-

mance over both KNN and nonlinear global SVM.

o Supervised Clustering Algorithms and Proﬁle Support Vector Machine:

132

LSVM is expensive to compute since it requires training a separate SVM model for each test
example. To address this limitation, we developed a more efﬁcient implementation of the
algorithm called Proﬁle Support Vector Machine. PSVM reduces the computational time by
extracting a small number of clusters using supervised algorithms called MagKmeans and
Vaereans, for classiﬁcation and regression respectively. Our analysis showed that PSVM

achieved comparable accuracy as LSVM but is much more computationally efﬁcient.

o LSVM and PSVM with Spatial and Temporal Dependencies:

We demonstrated the effectiveness of the pr0posed algorithms on data with spatial only,
temporal only, or both spatio-temporal dependencies. The spatial, temporal or spatio-
temporal neighborhood information is incorporated into the LSVM and PSVM ﬁ'amework.
Our algorithms achieved signiﬁcant improvement when applied to a number of prediction

tasks on Earth Science data sets with spatial and temporal dependencies.

The long term time series forecasting work described. in Chapter 4 has made the following

contributions:

0 Empirical Study on Multi-step Time Series Prediction:

An empirical study is conducted on three prediction approaches for multi-step ahead time
series prediction. Our theoretical and empirical analysis showed that current multi-stage
prediction tends to suffer from error accumulation problem when the prediction period is
long. Independent value prediction is less susceptible to this problem because its predictions
are made independently at each time step. However, it has difﬁculty in learning the true
model and ﬁltering the effect of noise. Parameter prediction handles noisy data by ﬁtting a
function over the entire output sequence and also tends to be more efﬁcient than independent
value prediction when the number of parameters to be ﬁtted is small. However, ﬁnding the

appropriate parameter function to ﬁt the time series is quite a challenging task.

0 Semi-supervised Hidden Markov Regression Model for Multivariate Time Series Pre-
diction:
To alleviate the error accumulation problem in long term time series forecasting, an altema-

tive strategy is adapted to perform multivariate time series prediction by utilizing informa-

133

tion from other related time series. A combination of local prediction and global prediction
methods is developed to estimate the response values for the future data. A semi-supervised
time series prediction approach is developed based on Hidden Markov Model Regression
(HMMR) to learn from both historical and future data. Experimental results on a number
of benchmark data sets shows the superior performance of our proposed semi-supervised

algorithm compared with supervised algorithms.

o Covariance Alignment for Data Calibration:

We showed that the inconsistencies between historical and future data may actually hurt
the model’s performance when we apply our semi-supervised HMMR for long term cli-
mate prediction. To compensate for this problem, we proposed a covariance alignment data
calibration method that will transform the training and test data into a new space before
applying the semi-supervised HMMR algorithm. Experimental results on the climate data
sets clearly demonstrate the improvement of the semi-supervised HMMR algorithms when

the proposed data calibration technique is used.

The anomaly detection, characterization, and visualization techniques described in Chapter 5

have made the following contributions:

0 Graph-based Time Series Anomaly Detection Algorithm and Its Extensions:

A robust graph-based method is proposed for anomaly detection in multivariate time series
data, which performs better than other methods such as moving average, LOF, K-dist, etc.
Our algorithm is also very ﬂexible in that it can be extended to detect unusual subsequences
or local anomalies. Our method was successfully applied to detect ecosystem disturbance

in Earth Science data.

0 Kernel Alignment for Multivariate Time Series Anomaly Detection and Characteriza-
tion:
A kernel alignment method is proposed to improve the anomaly detection in noisy multivari-
ate time series data. This method can also be utilized to associate the anomalies in predictor
time series with anomalies in target time series, e.g. to characterize the ecosystem distur-

bance such as forest ﬁres detected ﬁ'om FPAR time series with climate anomalies such as

134

dry weather detected from precipitation time series. Extensive experiments have been con-
ducted on synthetic and real Earth Science data sets to demonstrate the effectiveness of our

algorithm.

0 Visualization System for Exploratory Analysis of Global Ecosystem Disturbances:

We also developed an interactive viewer that enables scientists to display the F PAR time se-
ries at locations where disturbance events have been detected. A clustering-based technique
is adapted to group together regions with similar disturbance events to be presented to the
user. The clustering method is further used to build a multi-level indexing structure of the
large scale high resolution F PAR time series data such that the user can explore the data in
near real-time fashion. The effectiveness and eﬂiciency of the system are demonstrated by

our evaluation.

In addition to the Earth Science domain, the algorithms and techniques developed in this thesis
would be beneﬁcial for many other real-world applications including traﬂic analysis, network

monitoring, geographical information system, and stock market analysis, etc.

6.2 Future Research

In this section, we discuss some of the future research directions on spatial and temporal data

mining motivated by this thesis.

0 Prediction of Extreme Events:

Most time series prediction predicts smooth future values ﬁrst by constructing a regression
function and then adds possible noise. The forecasts will exhibit less variability than the
actual values, which implies a regression to the mean problem [23]. In many applications,
people are interested in predicting extreme events. For example, Earth scientists are in-
terested in knowing the distribution of droughts in the future which may have substantial
impact on the ecosystem and agriculture of the affected region. Prediction of such extreme
events is much more difﬁcult compared with traditional time series prediction problems.
First, although there has been some work on extreme value theory [74], modeling extreme

events is still a challenging task, not mention to predict their future occurrences. Second,

135

extreme events usually happen more rarely than normal events and thus require much more
data for model training. Third, because of pattern evolvement in long term time series such
as climate change, the behavior of extreme events may change substantially compared with
the history. Extending current methods or developing new techniques for extreme events

forecasting is a very interesting problem to be addressed in future work.

Pattern Discovery in Spatio-temporal Database:

It this thesis, we have examined the detection and characterization of anomalous patterns
in time series data. However, the method we proposed focused on temporal patterns only.
There are still many interesting pattern discovery problems remaining for future research in
Earth Science data. For instance, to detect anomalies in spatio-temporal data or to discover
spatio-temporal associations between anomaly patterns in spatio-temporal data. Mining
other interesting patterns such as frequent patterns or periodical patterns and their spatio-
temporal association is also an interesting and challenging problem for future research.
Developing new techniques to solve these problems for Earth Science applications would
deﬁnitely help people better understand the interaction between the global climate, ecolog-

ical, and economic systems underlying different observed phenomenons.

Distributed and Parallel Data Mining:

In this thesis, we have proposed a multi-level indexing system by sampling from high res-
olution spatio-temporal Earth Science data for exploratory analysis of the ecosystem dis-
turbances. One limitation of our visualization system for mining ecosystem disturbance in
large scale spatio-temporal data is that only parts of the data are processed each time. It
was able to facilitate the task of exploratory analysis but did not solve the scalability issue
directly when we need to process a very large data set. Recently, large scale computing
systems such as cloud computing have been developed, which utilizes a large number of
machines to solve a problem eﬂiciently. Thus another interesting research direction that
may worth future exploration is to develop parallelized algorithms in large scale distributed
computing system such as map-reduce framework [65]. For example, to develop the paral-
lelized version of the graph-based anomaly detection algorithm for ecosystem disturbance

detection in large scale Earth Science data.

136

0 Learning from Heterogeneous Data:

As discussed in Chapter 4, data in Earth Science may come from different sources such
as climate observations and global climate simulation. Heterogeneous data sets are usually
generated from different distributions, which may degrade the performance of most stan-
dard supervised or semi-supervised Ieaming algorithms. We have proposed a covariance
alignment method in our work to calibrate the data from different sources. However, our
solution only aligns the ﬁrst and second order of the data. Higher order alignment of the
data using kernel statistics would be a possible research direction. Another interesting re-
search problem would be to apply transfer learning methodology [217], which learns from
small amounts of training data and large amounts of low quality auxiliary data with different
distribution. However, transfer Ieaming typically requires the auxiliary data with different
data distribution to be labeled, which is not true for the future data in time series forecast-
ing problem. Recently, Raina et al. [175] and Ando et a1. [8] proposed transfer learning
framework to learn from unlabeled data, which may serve as the foundation for our future

work.

0 Multi-scale Anomaly Detection:

The proposed anomaly detection techniques in Chapter 5 only consider anomalies at the
same spatial and temporal scales. In the Earth Science domain, ecosystem disturbance
events may happen at different spatial or temporal scales. For example, ecosystem distur-
bance events such as wildﬁres may happen at a large spatial scale whereas other events
caused by human activities such as construction of a housing subdivision or a golf course
may happen at a small spatial scale; some disturbance events may last several months
whereas others may last only a short period of time. Detecting anomalies at different spa-
tial and temporal scales may help reveal the intrinsic spatial and temporal structure of the

anomaly patterns and can be used to characterize different types of anomalies.

137

APPENDIX A

Proof of the Dual Form for Localized

Support Vector Machine

Let ’DU = [m1+1, ml”, ..., ml+u]T be a set ofu test examples and S'(:1:,-,a31+s) be the similarity
between the training example 1:,- and the test example (131+ s. For each ml.” 6 ’DU, we construct

its primal local SVM model by solving the following optimization problem:

I
, 1
rrgn EIIWIIE +ﬁZS(%wz+s)€7 (A1)

i=1

5. t y,(< w,<p(a:i) > —b) 2 1 — 52',

We introduce the Lagrangian multiplier a, for each inequality condition yz-(< w, M337) > —b) 2
1 —— £1, and u,- for 6,- 2 0. As a result, the Lagrangian for the optimization problem can be written
as:
1 l
L = EIIWIIE + a 2 son.» mas.
i=1

1 l
- Z ai(3’i(< W, 99(332‘) > —b) - 1 + £1.) - 211167. (A2)
i=1

i=1

138

Then, taking deritives of [I with regard to w, b, (52-, we obtain the KKT conditions for the primal

problem:

I

(9L3
-—— =0=>W=Zai$iyi
(9W 7:1

1
BL
b—b =02— ;alyz=0
813
T = 0 => [53(wi,ml+s) — (71‘
(942.

‘llz‘=0

yz'(< Edict) > -b)—1+§,- 2 0

£720
(it—£20

u,20

aiiy'i(< W1 90(937) > —b) - 1 + 62:} = 0

11767 = 0

(A3)

(A.4)

(A5)
(A6)
(A7)
(A8)
(A9)
(A10)

(A.11)

From the last two KKT complementarity conditions,we can determine the threshold 0. F urther-

more, if a, < [35(232', 7:), then g, = 0. So any training point in which 0 < a,- < (35073,, 7:) can be

used to compute b. By replacing Equations (A.3)—(A.5) into the objective function in (A2), the

optimization problem can be reduced to:

L = — :2 aiajyiyjmz-mj +2( or,- +117)€

271,7:1

7:1

I l
—Zai(—byi -1+€7) 211762?
1:1 7::

l

l

= -- :2 azajyzijjari mj + a7:
7,j= -—1 i=1

2 a 6,7,),7 .8,
7j=1

(A. 1 2)

Thus, we obtain the dual form of the LSVM optimization problem, which is given as follows:

27,j=1

s. t. Z aw, = 0
'—1

0 S a,- S,BS(13,;,CB[+3), i=1,2,...,

139

max :71,- — — :2 (r a jylyja: a3]-

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

BIBLIOGRAPHY

Hayman ﬁre, baer report. http : //www . wilderness . org/Library/
Documents / Wi ldf ireSummary_Hayman . cfm, 2002.

J. Adhikary. Knowledge discovery in spatial databases: Progress and challenges. Technical
report, University of British, 1996.

R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of
the 20th International Conference on Very Large Data Bases, pages 487—499, Santiago de
Chile, Chile, 1994.

R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the 11th Inter-
national Conference on Data Engineering, pages 3—14, Taipei, Taiwan, 1995.

H. Akaike. Fitting autoregressive models for prediction. Annals of the Institute of Statistical
Mathematics, 21(1):243—247, 1969.

E. L. Allwein, R. E. Schapire, Y. Singer, and P. Kaelbling. Reducing multiclass to binary:
A unifying approach for margin classiﬁers. Journal of Machine Learning Research, 1:1 13—
141, 2000.

J. Alon, S. Sclaroff, G. Kollios, and V. Pavlovic. Discovering clusters in motion time-series
data. In Proceedings of the IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, pages 375—381, Madison, Wisconsin, USA, 2003.

R. K. Ando and T. Zhang. A high-performance semi-supervised learning method for text
chunking. In Proceedings of the 43rd Annual Meeting on Association for Computational
Linguistics, pages 1—9, Morristown, NJ, USA, 2005. Association for Computational Lin-
guistics.

L. Anselin. Local indicators of spatial association-lisa. Geographical Analysis, 27:93—1 15,
1995.

M. Atallah, R. Gwadera, and W. Szpankowski. Detection of signiﬁcant sets of episodes in
event sequences. In Proceedings of the 4th IEEE International Conference on Data Mining,
volume 00, pages 3—10, Los Alamitos, CA, USA, 2004. IEEE Computer Society.

C. Atkeson, A. Moore, and S. Schaal. Locally weighted learning. Artiﬁcial Intelligence
Review, 11:11—73, April 1997.

F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Fast Kernel Learning using Sequential
Minimal Optimization. EECS Department, University of California, Berkeley, 2004.

140

[13] R. A. Baeza-Yates. Searching subsequences. Theorem of Computer Science, 78(2):363—
376, 1991.

[14] P. M. Baggenstos. A modiﬁed baum-welch algorithm for hidden markov models with
multiple observation spaces. IEEE Transactions on Speech Audio Processing, 2:411—416.
2001.

[15] P. Baldi, Y. Chauvin, T. Hunkapiller, and M. A. McClure. Hidden markov models of bio-
logical primary sequence information. In Proceedings of the National Academy of Science,
volume 91, pages 1059-1063, feb 1994.

[16] S. Bay, K. Saito, N. Ueda, and P. Langley. A ﬁamework for discovering anomalous regimes
in multivariate time-series data with local models. In Symposium on Machine Learningfor
Anomaly Detection, 2004.

[17] B. Belhouari, S. Berrnak, and Amine. Gaussian process for nonstationary time series pre-
diction. Computational Statistics and Data Analysis, 47(41):705—712, 2004.

[18] R. E. Bellman. Adaptive Control Processes. Princeton University Press, 1961.

[19] M. Bertolotto, T. Kechadi, S. Di Martino, and F. Ferrucci. Scalable 2-pass data mining
technique for large scale spatio-temporal datasets. In Proceedings of the 11th Knowl-
edge based and Intelligent Information and Engineering Systems, pages 785—792, Vietri
sul Mare, Italy, 2007.

[20] S. K. Bethi, V. V. Phoha, and Y. B. Reddy. Clique clustering approach to detect denial-
of-service attacks. In Proceedings of the 5th Annual IEEE SMC Workshop on Information
Assurance, pages 447—448, June 2004.

[21] D. Birant and A. Kut. Spatio-temporal outlier detection in large databases. Computing and
Information Technology, 14(4):29l—298, 2006.

[22] J. A. Blackard and J. D. Denis. Comparative accuracies of artiﬁcial neural networks and
discriminant analysis in predicting forest cover types from cartographic variables. Comput-
ers and Electronics in Agriculture, 24:131—151, 2000.

[23] J. M. Bland and D. G. Altman. Statistic notes: Regression towards the mean. British
Medical Journal, 30821522, 1994.

[24] A. Blum and S. Chawla. Learning ﬁ'om labeled and unlabeled data using graph mincuts.
In Proceedings 18th International Conference on Machine Learning, pages 19—26, San
Francisco, CA, USA, 2001. Morgan Kaufrnann.

[25] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In

Proceedings of the 11th Annual Conference on Computational Learning Theory, pages 92—
100, Madison, Wisconsin, USA, 1998. Morgan Kaufrnann.

[26] Y. L. Borgne. Bias variance trade-off characterization in a classiﬁcation. what differences
with regression? Technical report, University Libre de Bruxelles, 2005.

141

[27] Y. L. Borgne, S. Santini, and G. Bontempi. Adaptive model selection for time series pre-
diction in wireless sensor networks. Signal Process, 87(12):3010—3020, 2007.

[28] L. Bottou and V. Vapnik. Local learning algorithms. Neural Computation, 4(6):888—900,
1992.

[29] G. E. P. Box, G. M. Jenkins, and G. C. Reinsel. Time Series Analysis, Forecasting, and
Control. Prentice Hall, 1970.

[30] U. Brefeld, T. Gartner, T. Schelfer, and S. Wrobel. Efﬁcient co-regularised least squares
regression. In Proceedings of the 23rd international conference on Machine learning, pages
137—144, Pittsburgh, Pennsylvania, 2006. ACM Press.

[31] M. M. Breunig, H. P. Kriegel, R. T. Ng, and J. Sander. Lof: identifying density-based local
outliers. In Proceedings of the 19th ACM SIGMOD international conference on Manage-
ment of Data, pages 93—104, Dallas, Texas, USA, 2000.

[32] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Knowledge
Discovery and Data Mining, 2: 121—167, 1998.

[33] J. Caiado and N. Crato. A garch-based method for clustering of ﬁnancial time series:
International stock markets evidence. MPRA Paper 2074, University Library of Munich,
Germany, 2007.

[34] E. Camossi, M. Bertolotto, and T. Kechadi. Mining spatio-temporal data at different levels
of detail. In Lecture Notes in Geoinformation and Cartography, pages 225—240. Springer
Berlin Heidelberg, 2008.

[3S] CCCSN. Canadian climate change scenarios network, environment canada. ht tp : / /
www.ccsn.ca/.

[36] G. Celeux and J. B. Durand. Selecting hidden markov model state number with cross-
validated likelihood. Computational Statistics, 2007.

[37] C. C. Chang and C. J. Lin. Libsvm: a library for support vector machines. Software
available at http : //www . csie . ntu . edu . tw/~cj lin/libsvm, 2001.

[38] C. Y. Chang, M. S. Chen, and C. H. Lee. Mining general temporal association rules for
items with different exhibition periods. In Proceedings of the 2002 IEEE International
Conference on Data Mining, volume 00, page 59, Los Alamitos, California, USA, 2002.
IEEE Computer Society.

[3 9] N. Chapados and Y. Bengio. Augmented functional time series representation and forecast-
ing with gaussian processes. 2007.

[40] S. P. Charles, B.C. Bates, LN. Smith, and J. P. Hughes. Statistical downscaling of daily
precipitation from observed and modelled atmospheric ﬁelds. Hydrological Processes,
18(8): 1373—1394, 2004.

[41] C. Chatﬁeld. The analysis of time series. New York, NY: Chapman and Hall, 1996.

142

[42] H. F. Chen, H. B. Cheng, G. F. Jiang, and K. J. Yoshihira. Exploiting local and global
invariants for the management of large scale information systems. In Proceedings of the
8th IEEE International Conference on Data Mining, Pisa, Italy, 2008.

[43] M. Y. Chen, A. Kundu, and J. Zhou. Off-line handwritten word recognition using a hid-

den markov model type stochastic network. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 16(5):481—496, 1994.

[44] H. B. Cheng and P. N. Tan. Semi-supervised Ieaming with data calibration for long—term
time series forecasting. In Proceedings of the 14th ACMSIGKDD International Conference
on Knowledge Discovery and Data Mining, pages 133—141, Las Vegas, NV, 2008.

[45] H. B. Cheng, P. N. Tan, J. Gao, and J. Scripps. Multistep-ahead time series prediction.
In Proceedings of the 10th Paciﬁc-Asia Conference on Knowledge Discovery and Data
Mining, pages 765—774, Singapore, 2006.

[46] H. B. Cheng, P. N. Tan, and R. Jin. Localized support vector machine and its efﬁcient
algorithm. In Proceedings of the 7th SIAM Conference on Data Mining, 2007.

[47] H. B. Cheng, P. N. Tan, C. Potter, and S. Klooster. Data mining for visual exploration and
detection of ecosystem disturbaces. In Proceedings of 16th ACM SIGSPATIAL Interna-
tional Conference on Advances in Geographic Information Systems, Irvine, CA, 2008.

[48] H. B. Cheng, P. N. Tan, C. Potter, and S. Klooster. A robust graph-based algorithm for
detection and characterization of anomalies in noisy multivariate time series. In Proceed-
ings of the 8th IEEE International Conference on Data Mining Workshop on Spatial and
Spatiotemporal Data Mining, Irvine, CA, 2008.

[49] T. Cheng and Z. Li. A hybrid approach to detect spatial-temporal outliers. Geolnformaticas,
pages 173—178, 2004.

[50] S. Chu, E. K., D. Hart, and Michael. Iterative deepening dynamic time warping for time
series. In Proceedings of the 2nd SIAM international conference on Data Mining, 2002.

[51] 1. Cohen, N. Sebe, F. G. Cozman, M. C. Cirelo, and T. S. Huang. Semi-supervised Ieaming
of classiﬁers: Theory and algorithms for bayesian network classiﬁers and applications to
human-computer interaction. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 26(12): 1553—1566, Dec 2004.

[52] P. Compieta, S. Di Martino, M. Bertolotto, F. F errucci, and T. Kechadi. Exploratory spatio-
temporal data mining and visualization. Journal of Visual Languages and Computing,
18(3):255—279, 2007.

[53] C. Cortes and M. Mohri. On transductive regression. In Proeedings of the 22th Annual
Conference on Neural Information Processing Systems, British Columbia, Canada, 2006.

[54] P. Cortez and A. Morais. A data mining approach to predict forest ﬁres using meteorological
data. In Proceedings of the 13th Portuguese Conference on Artiﬁcial Intelligence, pages
512—523, Guirnaraes, Portugal, 2007.

143

[55] T. Cover and P. Hart. Nearest neighbor pattern classiﬁcation. IEEE Transactions in Infor-
mation Theory, 36:21—27, 1967.

[56] F. G. Cozman, 1. Cohen, and M. Cirelo. Semi-supervised learning of mixture models. In
Proceedings of the 20th International Conference on Machine Learning, pages 99—106,
Washington, DC, USA, 2003.

[5 7] PG. Cozman and I. Cohen. Unlabeled data can degrade classiﬁcation performance of gen-
erative classiﬁers. In Proceedings of the 15th International Florida Artiﬁcial Intelligence
Society Conference, pages 327—331, Pensacola Beach, Florida, USA, 2002.

[58] N. Cressie. Statistics for spatial data. Wiley, New York, 1993.

[59] M. A. Crimmins. Synoptic climatology of extreme ﬁre-weather conditions across the south-
west united states. International Journal of Climatology, 26: 1001—1016, 2006.

[60] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. Kandola. On kemel-target alignment.
In Proceedings of the 16th Annual Conference on Neural Information Processing Systems,
pages 367—373, British Columbia, CA, 2002.

[61] T. Darrell and A. Pentland. Space-time gestures. In Proceedings of the IEEE Computer So-
ciety Conference on Computer Vision and Pattern Recognition, pages 33 5—340, Jun 1993.

[62] G. Das, D. Gunopulos, and H. Mannila. Finding similar time series. In Proceedings of the
1st European Symposium on Principles of Data Mining and Knowledge Discovery, pages
88—100, London, UK, 1997. Springer-Verlag.

[63] D. Dasgupta and S. Forrest. Novelty detection in time series data using ideas from im-
munology. In Proceedings of the 5th International Conference on Intelligent Systems, pages
82—87, Reno, Nevada, USA, 1996.

[64] M. D’Auria, M. Nanni, and D. Pedreschi. Time-focused density-based clustering of tra-
jectories of moving objects. In Proceedings of the Workshop on Mining Spatio-Temporal
Data, 2005.

[65] J. Dean and S. Ghemawat. Mapreduce: Simpliﬁed data processing on large clusters. In Pro-
ceedings of the 6th Symposium on Operating Systems Design and Implementation, pages
137—150, San Francisco, CA, December 2004.

[66] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data
via the em algorithm. Journal of the Royal Statistical Society, 39(1): 1—3 8, 1997.

[67] Denny, G. J. Williams, and P. Christen]. Exploratory hot spot proﬁle analysis using inter-
active visual drill-down self-organizing maps. In Proceedings of the 12th Paciﬁc-Asia C on-
ference on Knowledge Discovery and Data Mining, pages 536—543, Osaka, Japan, 2008.
Springer Berlin Heidelberg.

[68] M. Dillon. Search for efﬁcient local kriging algorithm. The Statistician, 41(3):383, 1992.

[69] C. Domeniconi, D. Gunopulos, and P. J ing. Large margin nearest neighbor classiﬁers. IEEE
Transactions on Neural Networks, 16(4):899—909, 2005.

144

[70] T. Dusek. Regional income differences in hungary - a multi-level spatio-temporal analysis.
ERSA conference papers ersa06p284, European Regional Science Association, 2006.

[71] T. S. Dye, C. P. MacDonald, and C. B. Anderson. Guideline for developing an ozone
forecasting program. Technical report, EPA-454/R-99-OO9, 1999.

[72] D. Eads, D. Hill, S. Davis, S. Perkins, J. Ma, and R. Porter andJ. Theiler. Genetic algorithms
and support vector machines for time series classiﬁcation. In Proceedings of the SPIE,
volume 4787, pages 74—85, 2002.

[73] J. L. Elman. Distributed representations, simple recurrent networks, and grammatical struc-
ture. Machine Learning, 7:195—224, 1991.

[74] P. Embrechts, C. Kluppelberg, and T. Mikosch. Modeling Extremal Events: for Insurance
and Finance. Spring Verlag, Berlin, German, 1997.

[75] W. Enke and A. Spekat. Downscaling climate model outputs into local and regional weather
elements by classiﬁcation and regression. Climate Research 8, 8(3):]95—207, 1997.

[76] J. Ernst, G. J. Nau, and Z. Bar-Joseph. Clustering short time series gene expression data.
Bioinformatics, 21(1): 159—168, 2005.

[77] A. K. Ersboll and B. K. Ersbnll. On spatio-temporal kriging. In Proceedings of the 3rd
Annual Conference of the International Association for Methematica Geolog, pages 617—
622, Barcelona, Spain, 1997. CLMNE.

[78] M. Ester. Data mining tasks and methods: spatial analysis. Handbook of data mining and
knowledge discovery, pages 409—418, 2002.

[79] M. Ester, H. P. Kriegel, J. Sander, and X. W. Xu. A density-based algorithm for discover-
ing clusters in large spatial databases with noise. In Proceedings of the 2nd ACMSIGKDD

International Conference on Knowledge Discovery and Data Mining, pages 226—23 1, Port-
land, Oregon, USA, 1996.

[80] M. Ester, H.P. Kriegel, and J. Sander. Algorithms and applications for spatial data mining.
Geographic Data Mining and Knowledge Discovery, Research Monographs in GIS, 2001.

[8]] Martin Ester, Hans peter Kriegel, and Xiaowei Xu. Knowledge discovery in large spatial
databases: Focusing techniques for efﬁcient class identiﬁcation. In Proceedings of 4th
International Symposium on Large Spatial Databases, pages 67—82, Portland, Maine, 1995.
Springer.

[82] T. Fawcett. An introduction to toe analysis. Pattern Recognition Letters, 27(8):861—874,
June 2006.

[83] A. S. Fotheringham, C. Bnmsdon, and M. E. Charlton. Geographically Weighted Regres-
sion: The Analysis of Spatially Varying Relationships. Chichester: Wiley, 2002.

[84] H. F rohlich and A. Zell. Eﬁicient parameter selection for support vector machines in clas-
siﬁcation and regression via model-based global optimization. Proceedings of IEEE Inter-
national Joint Conference on Neural Networks, 3: 143 1—1436, July 2005.

145

[85] K. Fujinaga, M. Nakai, H. Shimodaira, and S. Sagayama. Multiple-regression hidden
markov model. In Proceedings of the IEEE International Conference on Acoustics, Speech,
and Signal Processing, pages 513—516, Salt Lake City, Utah, USA, 2001.

[86] S. J. Gaﬂhey, A. W. Robertson, P. Smyth, S. J. Carnargo, and M. Ghil. Probabilistic cluster-
ing of extratropical cyclones using regression mixture models. Climate Dynamics, 29:423—
440, 2007.

[87] P. Galeano, D. Pena, and R. S. Tsay. Outlier detection in multivariate time series via
projection pursuit. Statistics and Econometrics Working Papers ws044211, Universidad
Carlos III, Departamento de Estadlstica y Econometrla, 2004.

[88] M. Garofalakis, R. Rastogi, and K. Shim. Mining sequential patterns with regular expres-
sion constraints. IEEE Transactions on Knowledge and Data Engineering, l4(3):530—552,
2002.

[89] N. Gilardi and S. Bengio. Local machine Ieaming models for spatial data analysis. Geo-
graphic Information and Decision Analysis, 4(1):11—28, 2000.

[90] C. L. Giles, S. Lawrence, and A. C. Tsoi. Noisy time series prediction using a recurrent
neural network and grammatical inference. Machine Learning, 44(1-2):161—183, 2001.

[91] C. L. Giles, S. Lawrence, and A. C. Tsoi. Noisy time series prediction using a recurrent neu-
ral network and grammatical inference. Machine Learning, 44(1/2):161—183, July/August
2001.

[92] B. Gold and N. Morgan. Speech and Audio Signal Processing: Processing and Perception
of Speech and Music. John Wiley & Sons, Inc., New York, NY, USA, 1999.

[93] J. G. Goldammer. Global ﬁre monitoring center, an overview on forest ﬁres in cuba. ht tp :
//www.fire.uni-freiburg.de/iffn/country/cu/cu_1.htm#t0p.

[94] T. H. Grubesic and A. T. Murray. Detecting hot spots using cluster analysis and gis. In
Proceedings of the 5 th Annual Conference of the Crime Mapping Research Center, Dallas,
TX, 2001.

[95] R. Haining. Spatial Data Analysis in the Social and Environmental Sciences. Cambridge
University Press, 1993.

[96] R. Haining. Spatial Data Analysis: Theory and Practice. Cambridge University Press,
2003.

[97] D. J. Hand and V. Vinciotti. Choosing k for two-class nearest neighbor classiﬁers with
unbalanced classes. Pattern Recognition Letters, 24(9-10):1555—1562, 2003.

[98] J. Haslett, R. Bradley, P. Craig, A. Unwin, and G. Wills. Dynamic graphics for explor-
ing spatial data with application to locating global and local anomalies. The American
Statistician, 45(3):234—242, 1991.

[99] T. Hastie and C. Loader. Local regression: Automatic kernel carpentry. Statistical Science,
8(2)2120—143, 1993.

146

[100] T. Hastie and R Tibshirani. Discriminant adaptive nearest neighbor classiﬁcation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 18(6):607—616, 1996.

[101] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: Data mining,
inference and prediction. New York: Springer-Verlag, 2001.

[102] K. Hechenbichler and K. P. Schliep. Weighted k-nearest-neighbor techniques and ordinal
classiﬁcation. Technical report, Ludwig-Maximilians University, Munich, 2006.

[103] T. Hengl, G. B. M. Heuvelink, and D. G. Rossiter. About regression-kriging: From equa-
tions to case studies. Computers and Geosciences, 33: 1301—1315, Oct 2007.

[104] L. J. Herrera, H. Pomares, l. Rojas, A. Guillén, A. Prieto, and O. Valenzuela. Recursive
prediction for long term time series forecasting using advanced models. Neurocomputinig,
70(16-18):2870—2880, 2007.

[105] B. C. Hewitson and R. G. Crane. Climate downscaling techniques and applications. Climate
Research, 7:85—95, 1996.

[106] F. M. Hoffman, W. W. Hargrove, and A. D. Del Genio. Multivariate spatio-temporal clus-
tering of time-series data: An approach for diagnosing cloud processes and understanding
arm site representativeness. In Proceedings of the I 3th Atmospheric Radiation Measure-
ment (ARM) Science Team Meeting, Extended Abstracts, Broomsﬁeld, Colorado, USA,
2003.

[107] W.C. Hong, P. F. Pai, S. L. Yang, and R. Theng. Highway trafﬁc forecasting by support
vector regression model with tabu search algorithms. In Proceedings of the International

Joint Conference on Neural Networks, pages 1617—1621, Vancouver, British Columbia,
Canada, 2006.

[108] C. Hsu and C. Lin. A comparison of methods for multi-class support vector machines.
Technical report, Department of Computer Science and Information Engineering, National
Taiwan University, 2001.

[109] W. Hsu, J. Dai, and M. L. Lee. Mining viewpoint patterns in image databases. In Proceed-
ing of the 9th AC M SIGKDD International Conference on Knowledge Discovery and Data
Mining, pages 553-558, Washington, DC, USA, 2003.

[110] R. Hyndman. Time series data library. http : //www-personal . buseco . monash .
edu.au/~hyndman/TSDL/.

[l 1 1] Z. Ji and D. Dasgupta. Applicability issues of the real-valued negative selection algorithms.
In Proceedings of the 8th annual conference on Genetic and evolutionary computation,
pages 111—118, New York, NY, USA, 2006. ACM.

[1 12] T. Joachims. Transductive inference for text classiﬁcation using support vector machines. In
Ivan Bratko and Saso Dzeroski, editors, Proceedings of the 16th International Conference
on Machine Learning, pages 200—209, Bled, SL, 1999. Morgan Kaufmann Publishers, San
Francisco, US.

147

[l 13] P. D. Jones and A. Moberg. Hemispheric and large-scale surface air temperature variations:
An extensive revision and an update to 2001. Journal of Climate, 16:206—223, 2003.

[1 14] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu. An efﬁcient
k-means clustering algorithm: analysis and implementation. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 24(7):881—892, 2002.

[1 15] G. Karypis, E. H. Han, and V. Kumar. Chameleon: hierarchical clustering using dynamic
modeling. Computer, 32(8):68—75, Aug 1999.

[1 16] L. Kaufman and P. J. Rousseeuw. Finding groups in data: an introduction to cluster anal—
ysis. New York: Wiley, 1990.

[I 17] B. Kazar, S. Shekhar, and D. J. Lilja. Parallel formulation of spatial auto-regression. Tech-
nical report, Army High-Performance Computing Research Center, 2003.

[118] B. Kedem and K Fokianos. Regression Models for Time Series Analysis. Wiley-
Interscience, 2002.

[119] M. B. Kenneland, R Brown, and H. D. I. Abarbanel. Determining embedding dimen-
sion for phase—space reconstruction using a geometrical construction. Physical Review A
(Atomic, Molecular; and Optical Physics), 45(6):3403—34l l, 1992.

[120] E. Keogh and T. Folias. Ucr time series data mining archive. ht tp : / /www . cs .ucr.
edu/~eamonn/TSDMA/ index . html, 2002.

[121] E. Keogh, J. Lin, and A. Fu. Hot sax: Efficiently ﬁnding the most unusual time series
subsequence. In Proceedings of the 5th IEEE International Conference on Data Mining,
pages 226—233, Houston, Texas, USA, 2005.

[122] E. Keogh, S. Lonardi, and B. Chin. Finding surprising patterns in a time series database
in linear time and space. In Proceedings of the 8th ACM SIGKDD International Conf.
on Knowledge Discovery and Data Mining, pages 550—556, Edmonton, Alberta, Canada,
2002.

[123] E. J. Keogh and M. J. Pazzani. Scaling up dynamic time warping to massive datasets. In
Proceedings of 6th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, pages 1—1 1, San Diego, California, USA, 1999. Springer.

[124] R. A. Kerr. Climate prediction: Signs of success in forecasting elnino. Science, 262497,
2002.

[125] S. Kiritchenko, S. Matwin, and S. Abu-Hakima. Email classiﬁcation with temporal fea-
tures. Proceedings of Intelligent Information Systems, New Trends in Intelligent Informa-
tion Processing and Web Mining, pages 523—534, 2004.

[126] K. Koperski, J. Adhikary, and J. W. Han. Spatial data mining: Progress and challenges. In
Proceedings of the SIGMOD Workshop on Research Issues on data Mining and Knowledge
Discovery, pages 1—10, 1996.

148

[127] K. Koperski and J. W. Hah. Discovery of spatial association rules in geographic information
databases. In Proceedings of the 4th International Symposium on Advances in Spatial
Databases, pages 47—66, Portland, Maine, 1995. Springer-Verlag.

[128] K. Koperski, J. Han, and N. Stefanovic. An efﬁcient two-step method for classiﬁcation of
spatial data. In Proceedings of the 8th International Symposium on Spatial Data Handling,
pages 45—54, Vancouver, Canada, 1998.

[129] M. Kulldorff. A spatial scan statistic. Communications in Statistics: Theory and Methods,
26(6):1481—1496, 1997.

[130] J .K. Kumar. An application of spatial prediction using a fuzzy-neural network. Proceedings
of the International Joint Conference on Neural Networks, 6:4241—4246, Jul 1999.

[131] A. Lakhina, M. Crovella, and C. Diot. Characterization of network-wide anomalies in traf-
ﬁc ﬂows. In Proceedings of the 4th ACM SI GC OMM Conference on Internet Measurement,
pages 201—206, Taormina, Sicily, Italy, 2004. ACM.

[132] K. W. Lau and Q. H. Wu. Local prediction of non-linear time series using support vector
regression. Pattern Recognition, 41(5):]556—1564, 2008.

[133] M. H. Law and J. T. Kwok. Rival penalized competitive learning for model-based sequence
clustering. In Proceedings of the 15th International Conference on Pattern Recognition,
volume 02, page 2195, Los Alamitos, California, USA, 2000. IEEE Computer Society.

[134] M. B. Lawrence. Tropical cyclone report - tropical storm lorenzo,national hurricane center.
http : //www . nhc . noaa . gov/2001iris . html, 2001.

[135] S. Laxman and P. S. Sastry. A survey of temporal data mining. Sadhana, 31(2):]73—198,
2006.

[136] S. Laxman, P. S. Sastry, and K. P. Unnikrishnan. A fast algorithm for ﬁnding frequent
episodes in event streams. In Proceedings of the 13th ACM SIGKDD international confer-
ence on Knowledge discovery and data mining, pages 410—419, San Jose, California, USA,
2007. ACM.

[137] S. T. Li, S. W. Chou, and J. J. Pan. Multi-resolution spatio-temporal data mining for the
study of air pollutant regionalization. In Proceedings of the 33rd Annual Hawaii Interna-
tional Conference on System Sciences, page 7, Hawaii, USA, 2000.

[138] B. Liu and J. Liu. Multivariate time series prediction via temporal classiﬁcation. In Pro-
ceedings of the 18th International Conference on Data Engineering, page 268, San Jose,
CA, USA, 2002.

[139] C. T. Lu, D. Chen, and Y. Kou. Algorithms for spatial outlier detection. In Proceedings of
the 3rd IEEE International Conference on Data Mining, Nov. 2003.

[140] C. T. Lu and L. R. Liang. Wavelet fuzzy classiﬁcation for detecting and tracking region
outliers in meteorological data. In Proceedings of the 12th annual ACM international work-
shop on Geographic information systems, pages 258—265, New York, NY, USA, 2004.
ACM.

149

[141] M. Mahoney and P. Chan. Trajectory boundary modeling of time series for anomaly detec—
tion. In Proceedings of the KDD Workshop on Data Mining Methods for Anomaly Detec-
tion, pages 32—40, 2005.

[142] D. Malerba, F. Esposito, and F. A. Lisi. Mining spatial association rules in census data: a
relational approach. In Proceedings of the ECAﬂ/PKDD workshop on mining oﬂicial data,
pages 80—93. University Printing House, 2002.

[143] N. Mamoulis, H. P. Cao, G. Kollios, M. Hadjieleftheriou, Y. F. Tao, and D. W. Cheung.
Mining, indexing, and querying historical Spatiotemporal data. In Proceedings of the 10th
ACM SIGKDD international conference on Knowledge discovery and data mining, pages
236—245, Seattle, WA, USA, 2004. ACM.

[144] H. Mannila, , H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event
sequences. Data Mining and Knowledge Discovery, 1:259—289, 1997.

[145] F. Melgani and S. B. Serpico. A markov random ﬁeld approach to spatio-temporal con-
textual image classiﬁcation. IEEE Transactions on Geoscience and Remote Sensing,
41(11):2478—2487, 2003.

[146] J. Mennis and J. W. Liu. Mining association rules in spatio-temporal data: An analysis of
urban socioeconomic and land cover change. Transactions in GIS, pages 5—17, 2005.

[147] C. T. Mills. Iime Series Techniques for Economists. Cambridge University Press, 1990.

[148] H. D. K. Moonesignhe and P. N. Tan. Outlier detection using random walks. In Proceedings
of18th IEEE International Conference on Tools with Artiﬁcial Intelligence, pages 532-539,
Washington, DC, USA, January 2006.

[149] Y. Morimoto. Mining frequent neighboring class sets in spatial databases. In Proceedings
of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pages 353—358, San Francisco, California, 2001. ACM.

[150] R. Munro, S. Chawla, and R. Sun. Complex spatial relationships. Proceedings of the 3rd
IEEE International Conference on Data Mining, 3:227, 2003.

[151] V. S. Nalwa. Automatic on-line signature veriﬁcation. In Proceedings of the 3rd Asian
Conference on Computer Vision, volume 1, pages 10—15, London, UK, 1997. Springer-
Verlag.

[152] M. Nanni. Clustering methods for spatio-temporal data. Technical report, PhD thesis, CS
Department, University of Pisa, 2002.

[153] M. A. Nascimento and Jefferson R. O. Silva. Towards historical r-trees. In Proceedings of
the ACM symposium on Applied Computing, pages 235—240, Atlanta, Georgia, USA, 1998.

[154] DJ. Newman, S. Hettich, C.L. Blake, and CJ. Merz. UCI repository of machine Ieaming
databases. http : //archive . ics . uci . edu/m1/, 1998.

150

 

 

[155] R. T. Ng and J. W. Han. Efﬁcient and effective clustering methods for spatial data mining.
In Proceedings of the 20th International Conference on Very Large Data Bases, pages 144—
155, Santiago de Chile, Chile, 1994.

[156] V. Niennattrakul and C. A.Ratanamahatana. On clustering multimedia time series data
using k-means and dynamic time warping. In Proceedings of the International Conference
on Multimedia and Ubiquitous Engineering, pages 733—73 8, April 2007.

[157] K. R Nuller, A. J. Smola, G. Ratsch, B. Scholkopf, J. kohlmorgen, and V. Vapnik. Predicting
time series with support vector machine. In Proceedings of the 7th International Confer-

ence on Artiﬁcial Neural Networks, volume 1327, pages 999—1004, Lausanne, Switzerland,
1997. Springer LNCS.

[158] A. Ober-Sunderrneier and H. Zackor. Prediction of congestion due to road works on free-
ways. In Proceedings of the IEEE Intelligent Transportation Systems, pages 240—244,
Oakland, California, USA, 2001.

[159] S. Olga and D. Liu. Visual interactive clustering and querying of spatio-temporal data. In
Proceedings of the International conference on computational science and its applications,
page 1362, Perugia, Italy, 2005.

[160] S. Ozden and R. A. Silberschatz. Cyclic association rules. In Proceedings of 14th Inter-
national Conference on Data Engineering, pages 412—421, Orlando, Florida, USA, 1998.
IEEE Computer Society.

[161] S. L. Ozesmia and U. OZesmib. An artiﬁcial neural network approach to spatial habitat
modelling with interspeciﬁc interaction. Ecological Modeling, 116(1): 15-31, March 1999.

[162] R. K. Pace, R. Barry, J. M. Clapp, and M. Rodriquez. Spatiotemporal autoregressive models
of neighborhood effects. Journal of Real Estate Finance and Economics, 17(1): 15—33 , July
1998.

[163] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing
order to the web. Technical report, Stanford University, 1998.

[164] D. Papadias, Y. F. Tao, P. Kalnis, and]. Zhang. Indexing spatio-temporal data warehouses.
In Proceedings of the 18th International Conference on Data Engineering, page 166, Wash-
ington, DC, USA, 2002. IEEE Computer Society.

[165] M. Pawlak and M. F. Yat Fung Ng. On kernel and radial basis function techniques for
classiﬁcation and function recovering. In Proceedings of the 12th International Conference
on Pattern Recognition, volume 2, pages 454—456, Jerusalem, Israel, 1994.

[166] J. Platt, N. Cristianini, and J. Shawe-Taylor. Large margin dags for multiclass classiﬁcation.

Proceedings of the 14th Annual Conference on Neural Information Processing Systems,
39(1):547—553, 2000.

[167] D. Pokrajac and Z. Obradovic. Combining regressive and auto-regressive models for spatio-
temporal prediction. In Proceedings of 1 7th International Machine Learning Workshop on
Spatial Knowledge, 2000.

151

[168] D. Pokrajac and Z. Obradovic. Improved spatial-temporal forecasting through mining. In
Proceedings of the 1st SIAM International Conference on Data Mining (SDM), Chicago,
USA, 2001.

[169] T. D. Popescu. Multivariate time series forecasting using independent component analysis.
In Proceedings of 9th IEEE Conference on Emerging Technologies and Factory Automa-
tion, pages 782—789, Lisbon, Portugal, 2003.

[170] C. Potter, P. N. Tan, M. Steinbach, S. Klooster, V. Kumar, R. Myneni, and V. Genovese.
Major disturbance events in terrestrial ecosystems detected using global satellite data sets.
Global Change Biology, 9(7): 1005—1021, 2003.

[171] C. Potter, P. N. Tan, M. Steinbach, S. Klooster, V. Kumar, R Myneni, and V. Genovese.
Major disturbance events in terrestrial ecosystems detected using global satellite data sets.
Global Change Biology, 9(7):1005—1021, 2003.

[172] C. S. Potter, P. N. Tan, V. Kumar, C. Kucharik, S. Klooster, V. Genovese, W. Cohen, and
S. Healey. Recent history of large-scale ecosystem disturbances in north arnerica derived
from the avhrr satellite record. Ecosystems, 8:808—824, 2005.

[173] C. S. Potter, P. N. Tan, M. Steinbach, V. Kumar, S. Klooster, R. Myneni, and V. Genovese.
Major disturbance events in terrestrial ecosystems detected using global satellite data sets.
Global Change Biology, 9(7):]005—1021 , 2003.

[174] L. Rabiner and B.H. Juang. Fundamentals of speech recognition. Prentice-Hall, Inc., Upper
Saddle River, New Jersey, USA, 1993.

[175] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught Ieaming: transfer leam-
ing from unlabeled data. In Proceedings of the 24th Annual International Conference on
Machine Learning, volume 227, pages 759-766. ACM, 2007.

[l 76] S. Ramaswamy, R. Rastogi, and K. Shim. Efﬁcient algorithms for mining outliers from
large data sets. SIGMOD Record, 29(2):427—43 8, 2000.

[177] C. E. Rasmussen. K nearest neighbors for regression knn-cv-l, 1996.

[178] R. Riﬂdn and A. Klautau. In defense of one-vs-all classiﬁcation. Journal of Machine
Learning Research, 5: 101—141, 2004.

[179] I. Rish. An empirical study of the naive bayes classiﬁer. In Proceedings of the 17th In-
ternational Joint Conferences on Artificial Intelligence Workshop on Empirical Methods, in
Artiﬁcial Intelligence, pages 41—48, Seattle, Washington, USA, 200].

[180] P. P. Rodrigues, J. Gama, and J. P. Pedroso. Hierarchical clustering of time-series data
streams. Knowledge and Data Engineering, IEEE Transactions on, 20(5):615—627, May
2008.

[181] R. A. Rohde. Instrumental temperature record. http:/ /www.globalwarmingart.
com/wiki/Image:Instrumental_Temperature_Record_png.

152

[182] R. A. Rohde. Global warming predictions. ht tp : / /www . globalwarmingart . com/
wiki/Image :Global_Warming_Predictions_png, 2001.

[183] J. Sander, M. Ester, H. P. Kriegel, and X. W. Xu. Density-based clustering in spatial
databases: The algorithm gdbscan and its applications. Data Mining and Knowledge Dis-
covery, 2(2): 169-194, 1998.

[184] A. Schrijver. Theory of Linear and Integer Programming. John Wiley and sons, 1998.

[185] P. Sebastiani, M. Ramoni, P. Cohen, J. Warwick, and J. Davis. Discovering dynamics using
bayesian clustering. In Proceedings of the 3rd International Symposium on Intelligent Data
Analysis, page pages. Springer, 1999.

[186] S. Shekhar and Y. Huang. Discovering spatial co-location patterns: A summary of results.
Advances in Spatial and Temporal Databases, pages 236—256, 2001.

[187] S. Shekhar, C. T. Lu, and P. Zhang. A uniﬁed approach to spatial outliers detection. Tech-

nical report, Department of Computer Science and Engineering, University of Minnesota,
2003.

[188] S. Shekhar, C. T. Lu, and P. S. Zhang. Detecting graph-based spatial outliers: algorithms
and applications (a summary of results. In Proceedings of the 1st International Conference
on Knowledge Discovery and Data Mining, pages 371—376. Press, 2001.

[189] S. Shekhar, P. R. Schrater, R. R. Vatsavai, W. Wu, and S. Chawla. Spatial contextual
classiﬁcation and prediction models for mining geospatial data. IEEE Transactions on
Multimedia, 4: 174—1 88, 2002.

[190] S. Shekhar, P.S. Zhang, Y. Huang, and R. Vatsavai. Trends in spatial data mining. Data
Mining: Next Generation Challenges and Future Directions, 2003.

[191] A.J. Smola and B. Schoelkopf. A tutorial on support vector regression. Statistics and
Computing, 14(3): 199—222, 1998.

[192] T. Stamer and A. Pentl. Visual recognition of american sign language using hidden markov
models. In Proceedings of the IEEE International Workshop on Automatic Face and Ges-
ture Recognition, pages 189—194, 1995.

[193] M. Steinbach, P. N. Tan, V. Kumar, S. Klooster, and C. Potter. Discovery of climate indices
using clustering. In Proceedings of the 9th ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 44—55, Washington, DC, USA, 2003.

[194] B.J. Stocks, J. A. Mason, J. B. Todd, E. M. Bosch, B. M. Wotton, B. D. Amiro, M. D.
F lannigan, K. G. Hirsch, K. A. Logan, D. L. Martel], and W. R. Skinner. Large forest ﬁres
in canada. Journal of Geophysical Research, 108(Dl, 8149):l959—l997, 2002.

[195] A. Subasi, A. Alkan, E. Koklukaya, and M. K. Kiymik. Wavelet neural network classiﬁca-
tion of eeg signals by using ar model with mle preprocessing. Neural Network, 18(7):985—
997, 2005.

153

[196] Y. X. Sun, K. Ma, X. Jin, X. Pu, and W. Gao. Detecting spatio-temporal outliers in climate
dataset: A method study. International Geoscience And Remote Sensing Symposium, 2:760,
2005.

[197] US. Geological Survey. Antelope ﬁre, Sioux falls, south dakota usa, august 2002.
ftp://edcftp.cr.usgs.gov/pub/data/landcover/ﬁles/2002/wupa/ante02aJneta.pdf.

[198] Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, and Dimitris Papadias. Spatio-
temporal aggregation using sketches. In Proceedings of the International Conference on
Data Engineering, pages 214—226, Boston, USA, 2004.

[199] Q. Tian, J. Yu, Q. Xue, and N. Sebe. A new analysis of the value of unlabeled data in
semi-supervised learning for image retrieval. In Proceedings of IEEE International Conf
on Multimedia and Expo, pages 1019— 1022, Taipei, Taiwan, 2004.

[200] Y. Tian, Y. Zhang, Y. Knyazikhin, R. B. Myneni, and S. W. Running. Prototyping of modis
lai/fpar algorithm with lasur and landsat data. IEEE Transaction on Geoscience and Remote
Sensing, 38(5):2387—2401, 2000.

[201] C. Tilke. Variowin - software for spatial data analysis in 2d. Computational Statistics and
Data Analysis, 25(2):243—244, July 1997.

[202] J. Tilton. Image segmentation by region growing and spectral clustering with a natural
convergence criterion. In Proceedings of the IEEE International Symposium on Geoscience
and Remote Sensing, pages 1766—1768, Seattle, Washington,U SA, 1998.

[203] P. Tino, C. Schittenkopf, and G. Dorffner. Volatility trading via temporal pattern recognition
in quantized ﬁnancial time series. Pattern Analysis and Applications, 4(4):283—299, 2001.

[204] F. Verhein. k-stars: Sequences of spatio-temporal association rules. In Proceedings of 6th
IEEE International Conference on Data Mining, pages 387—394, Los Alamitos, CA, USA,
2006. IEEE Computer Society.

[205] F. Verhein and S. Chawla. Mining spatio-temporal association rules, sources, sinks, sta-
tionary regions and thoroughfares in object mobility databases. In Proceedings of the 11th
International Conference on Database Systems for Advanced Applications,Springer Lec-
ture Notes in Computer Science, pages 187—201 , 2006.

[206] P. Vincent and Y. Bengio. K-local hyperplane and convex distance nearest neighbor a1-
gorithms. Proceedings of the 16th Annual Conference on Neural Information Processing
Systems, pages 985—992, 2002.

[207] N. Vivoy. Automatic classiﬁcation of time series (acts): A new clustering method for
remote sensing time series. International Journal of Remote Sensing, 45(3): 191—200, 2000.

[208] J. Wang and J. Han. Bide: eﬂ‘icient mining of frequent closed sequences. In Proceedings of
the 20th International Conference on Data Engineering, pages 79—90, Boston, USA, 2004.

[209] J. Wang, W. Hsu, and M. L. Lee. Flowminer: ﬁnding flow patterns in spatio-temporal
databases. In Proceedings of the 16th IEEE International Conference on Tools with Artiﬁ-
cial Intelligence, pages 14—21, Nov. 2004.

154

[210]

[211]

[212]

[213]

[214]

[215]

[216]

[217]

[218]

[219]

[220]

[221]

M. Wang, X. S. Hua, Y Song, L. R. Dai, and H. J. Zhang. Semi-supervised kernel regres-
sion. In Proceedings of the 6th International Conference on Data Mining, pages 1130—
1135, Washington, DC, USA, 2006.

L. Wei and E. Keogh. Semi-supervised time series classiﬁcation. In Proceedings of the
12th ACM SIGKDD international conference on Knowledge discovery and data mining,
pages 748—753, Philadelphia, PA, USA, 2006. ACM.

L. Wei, N. Kumar, V. N. Lolla, E. Keogh, and S. Lonardi. Assumption-free anomaly
detection in time series. In Proceedings of the I 7th International Scientiﬁc and Statistical
Database Management Conference, pages 237—240, Santa Barbara, CA, 2005.

L. Wei, N. Kumar, V. N. Lolla, E. Keogh, and S. Lonardi. Outliers detection in multivariate
time series by independent component analysis. Neural Computation, pages 1962—1984,
2007.

R. L. Wilby, S. P. Charles, E. Zorita, B. Timbal, P. Whetton, and L. O. Mearns. Guidelines
for use of climate scenarios developed from statistical downscaling methods. Available
from the DDC of IPCC TGCIA, 2004.

C. C. Wong, M. C. Chan, and C. C. Lam. Financial time series forecasting by neural
network using conjugate gradient learning algorithm and multiple linear regression weight
initialization. Technical Report 61, Society for Computational Economics, 2000.

C. Wu, M. Berry, S. Shivakumar, and J. McLarty. Neural networks for full-scale protein
sequence classiﬁcation: Sequence encoding with singular value decomposition. Machine
Learning, 21(1-2):l77—193, 1995.

P. C. Wu and T. G. Dietterich. Improving svm accuracy by training on auxiliary data
sources. In Proceedings of the 21th Annual International Conference on Machine Learning,
pages 871—878, Banff, Alberta, Canada, 2004.

H. Xiong, S. Shekhar, Y. Huang, V. Kumar, X. Ma, and J. Yoo. A framework for discovering
co-location patterns in data sets with extended spatial objects. In Proceedings of the 4th
SIAM International Conference on Data Mining, pages 80—93, Lake Buena Vista, Florida,
USA, 2004. '

Y. M. Xiong and D. Y. Yeung. Mixtures of arma models for model-based time series
clustering. In Proceedings of the IEEE International Conference on Data Mining, pages
717—720, Maebashi City, Japan, 2002.

K. Yamanishi and J. Takeuchi. A unifying framework for detecting outliers and change
points from non-stationary time series data. In Proceedings of 8th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining, pages 320—324, Edmonton,
Alberta, Canada, 2002.

Y. Yuan and M. J. Shaw. Induction of fuzzy decision trees. Fuzzy Sets and Systems, 69: 125—
139, 1995.

155

[222] K. Zeitouni and N. Chelghoum. Spatial decision tree-application to traffic risk analysis. In
Proceedings of the ACS/IEEE International Conference on Computer Systems and Appli-
cations, volume 0, page 0203, Los Alamitos, CA, USA, 2001. IEEE Computer Society.

[223] H. Zhang, A. C. Berg, M. Maire, and J. Malik. Svm-knn: discriminative nearest neighbor
for visual object recognition. In Proceedings of the IEEE Computer Society Conference

on Computer Vision and Pattern Recognition, volume 2, pages 2126—2136, New York,
NY,USA, 2006.

[224] K. Zhang and W. Fan. Forecasting skewed biased stochastic ozone days: analysis, solutions
and beyond. Knowledge and Information Systems, 14(3):753—764, 2008.

[225] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efﬁcient data clustering method for
very large databases. SIGMOD Record, 25(2):]03—114, 1996.

[226] X. Zhang, N. Mamoulis, D. W. Cheung, and Y. Shou. Fast mining of spatial collocations. In
Proceedings of the 10th ACM SIGKDD international conference on Knowledge discovery
and data mining, pages 384—393, Seattle, Washington, USA, 2004. ACM.

[227] D. Y. Zhou, O. Bousquet, T. N. La], J. Weston, and B. Scholkopf. Learning with local and
global consistency. In Proceedings of the I 7th Annual Conference on Neural Information
Processing Systems, volume 16, pages 321—328, British Columbia, Canada, 2003. MIT
Press.

[228] Z. H. Zhou and M. Li. Semi-supervised regression with co-training. In Proceedings of
19th International Joint Conference on Artiﬁcial Intelligence, pages 908—913, Edinburgh,
Scotland, UK, 2005.

[229] X. Zhu. Semi-supervised learning literature survey. Technical report, Computer Sciences,
University of Wisconsin-Madison, 2005.

[230] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised Ieaming using gaussian ﬁelds
and harmonic functions. In Proceedings of the 20th International Conference on Machine
Learning, volume 20, pages 912—919, Melbourne, Florida, USA, 2003.

[231] X. J. Zhu and A. Goldberg. Kernel regression with order preferences. In Proceedings
of the 22nd Conference on Artiﬁcial Intelligence, page 681, Vancouver, British Columbia,
Canada, 2007.

156

 

 

 

 

 

 

 

 

 

 

 

.11

09

H l

l,

l l

 

468

5

 

 

1;. .

...U.

.32.
a...
:: .. .