FUNCTIONAL DATA ANALYSIS WITH APPLICATION TO TRAFFIC FLOW DATA

By

Yi-Chen Zhang

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Statistics – Doctor of Philosophy

2018

ABSTRACT

FUNCTIONAL DATA ANALYSIS WITH APPLICATION TO TRAFFIC FLOW DATA

By

Yi-Chen Zhang

Functional data has become increasingly popular in the recent statistical literature. Con-

siderable attention has been paid to the development of functional data analysis. This

thesis consists of four main chapters to address some important questions that arise from

implementing FPCA in practice and to give answer to these questions. In Chapter 2, we

investigate the problem of data preprocessing for functional data. We propose and analyzes

a nonparametric functional data approach to missing value imputation and outlier detection

for functional data. In Chapter 3, a functional naive Bayes classiﬁer has been proposed for

functional data which provides a surrogate density estimation for functional random variables

that makes a direct extension of density-based classical multivariate classiﬁcation approaches

to functional data classiﬁcation possible. In Chapter 4, we merge two ideas of functional

classiﬁcation and functional prediction to develop a dynamical prediction for functional data.

The proposed functional mixture prediction approach combines functional linear model with

functional naive Bayes classiﬁer. In Chapter 5, we suggest a two-step segmentation proce-

dure to estimate both the number and locations of the mean change-points of a functional

sequence. Finally, the thesis concludes with a brief discussion of future research directions.

Keywords: Functional data analysis; missing values imputation; outlier detection; func-

tional classiﬁcation; naive Bayes classiﬁer; functional prediction; functional change-points

Copyright by
YI-CHEN ZHANG
2018

This thesis is dedicated to my wonderful wife and loving parents, the hidden strength

behind my every success.

iv

ACKNOWLEDGEMENTS

I would like to begin by thanking my advisor Dr. Lyudmila Sakhanenko. I wish to express

my deepest gratitude to her. Thanks to her guidance, patience, and thoughtfulness, this

work has been made possible. Thank you.

I wish to extend my sincere gratitude to my thesis committee members Drs. David Zhu,

Ping-Shou Zhong, and Yuying Xie for their time and interest. Thanks for Dr. Ping-Shou

Zhong for his comments and suggestions on the research work in Chapter 3 and Dr. David

Zhu for his data and discussion on the future work in Chapter 6.

My sincere thanks go to the staﬀ at the Department of Statistics and Probability, whose

kindness provided a very enjoyable study environment. Also, I appreciate the help and

support of the Department of Statistics and Probability in the last ﬁve years.

Most importantly, I would like to thank Yi-Ru for being a supportive, patient, and a

wonderful wife through this entire journey. Her positive attitude and encouragement have

always been an inspiration, and give me the much needed strength and conﬁdence, especially

during the last stages of my Ph.D training.

v

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Functional Principal Component Analysis
. . . . . . . . . . . . . . . . . . .
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 2 DATA PREPROCESSING OF FUNCTIONAL DATA . . . . . . . .
2.1 Missing Values and Outliers in Functional Data . . . . . . . . . . . . . . . .
2.1.1 Patterns of Missing Values . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Patterns of Outlying Curves . . . . . . . . . . . . . . . . . . . . . . .
2.2 Functional Principal Components Analysis . . . . . . . . . . . . . . . . . . .
2.2.1 Missing Value Imputation by Functional Principal Component Models
2.2.2 Visualization Tools for Outlier Detection . . . . . . . . . . . . . . . .
2.3 Simulation Study and Data Application . . . . . . . . . . . . . . . . . . . . .
Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1.1 Performance Comparison in Missing Value Imputation . . .
2.3.1.2 Performance Comparison in Outlier Detection . . . . . . . .
2.3.2 Data Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3.2.1 Functional Principal Component Analysis
2.3.2.2 Outlier Detection Results

2.3.1

CHAPTER 3 THE NAIVE BAYES CLASSIFIER FOR FUNCTIONAL DATA . . .
3.1 The Naive Bayes Classiﬁer for Functional Data . . . . . . . . . . . . . . . . .
3.1.1 Model Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Simulation Study and Data Application . . . . . . . . . . . . . . . . . . . . .
3.3.1
Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Data Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 4 DYNAMIC FUNCTIONAL PREDICTION AND CLASSIFICATION
. . . . . . . . . . . . . . . . . . . . . . . . .

4.1 Modeling traﬃc ﬂow trajectries

4.1.1 Functional Naive Bayes and Functional Probability Naive Bayes

x

1
2
4
6

8
10
10
12
13
15
16
17
17
17
23
30
30
33
35

37
39
42
45
52
52
56
62

64
64

Classiﬁer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

4.1.2 Estimation for Functional Naive Bayes and Functional Probability

Bayes Classiﬁer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Functional Mixture Prediction . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Functional Linear Prediction Model . . . . . . . . . . . . . . . . . . .

66
67
68

vi

4.2.2 Estimation for Functional Mixture Prediction . . . . . . . . . . . . .
4.2.3
Implementation algorithm of functional mixture prediction. . . . . . .
4.2.4 Bootstrap prediction intervals for functional mixture prediction . . .
4.3 A Real Data Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Analysis of Traﬃc Flow Patterns and Posterior Probabilities . . . . .
4.3.2 Mixture Prediction of Traﬃc Flow . . . . . . . . . . . . . . . . . . .
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69
70
71
73
73
74
77

CHAPTER 5

IDENTIFYING MULTIPLE MEAN CHANGE-POINTS OF FUNC-
81
TIONAL SEQUENCE . . . . . . . . . . . . . . . . . . . . . . . . .
84
5.1 Functional Mean Change-Points . . . . . . . . . . . . . . . . . . . . . . . . .
84
5.1.1 Multiple Mean Change-Points Model
. . . . . . . . . . . . . . . . . .
85
5.1.2 Functional Principal Components . . . . . . . . . . . . . . . . . . . .
87
5.1.3
Single Mean Change-Point Test . . . . . . . . . . . . . . . . . . . . .
89
5.2 Backward Recursive Least Squares Segmentation . . . . . . . . . . . . . . .
90
5.2.1 Recursive Least Squares Segmentation . . . . . . . . . . . . . . . . .
93
5.2.2 Backward Elimination . . . . . . . . . . . . . . . . . . . . . . . . . .
95
5.2.3
Some Remarks of BRLSS . . . . . . . . . . . . . . . . . . . . . . . .
98
5.3 Simulation Study and Real Data Application . . . . . . . . . . . . . . . . . .
Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
5.3.1.1 Performance for Data without Change-Point . . . . . . . . . 101
5.3.1.2 Performance for Data with Change-Points . . . . . . . . . . 102
5.3.2 Data Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.3.1

CHAPTER 6 CONCLUSIONS AND FUTURE DIRECTIONS . . . . . . . . . . . 111
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

vii

LIST OF TABLES

Table 2.1: Performance comparison in terms of the RMSE mean, standard error
(in parenthesis) and average proportion of total variance explained [in
bracket (%)] for FM, BPCA, PPCA and FPCA in PM, IM and Mixed
missing patterns based on 100 simulation replications.

. . . . . . . . . . .

19

Table 2.2: The AUC of the RMSE for PM, IM and Mixed PM/IM . . . . . . . . . .

22

Table 2.3: Average standardized deviation of imputed values for the FM, PPCA,
BPCA and FPCA methods in terms of the mean and standard error (in
parentheses) based on 100 simulation replications.

. . . . . . . . . . . . .

Table 2.4: The sample means and standard errors (in parentheses) in percentages
of ˆpc and ˆpf for the modiﬁed functional bagplot and functional bagplot
with 200 replications.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 2.5: The sample means and standard errors (in parentheses) of the percent-
ages ˆpc and ˆpf for the functional HDR boxplot and functional HDR
boxplot with 200 replications. . . . . . . . . . . . . . . . . . . . . . . . . .

Table 3.1: The sample means, standard errors (in parentheses) and average pro-
portions of total variance explained [in bracket (%)] for misspeciﬁcation
rates in Model 1, Model 2 and Model 3 with 200 replications.

. . . . . . .

Table 3.2: The sample means, standard errors (in parentheses) and average pro-
portions of total variance explained [in bracket (%)] for misspeciﬁcation
rates in Model 1, Model 2 and Model 3 with 200 replications. (Continued)

Table 3.3: The sample means, standard error (in parentheses) and average propor-

tion of total variance explained [in bracket (%)] for misclassiﬁcation rates.
Table 4.1: Performance comparisons for FP, FNP, and FMP based on TMIPE (×103)
under various values of κ. . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

27

29

57

58

63

78

Table 5.1: Frequencies of correctly matched change-point number in 1000 no change-

point samples.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Table 5.2: Frequencies of correctly matched change-point number in 1000 one change-

point samples.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Table 5.3: Frequencies of correctly matched change-point number in 1000 two change-

point samples.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

viii

Table 5.4: Frequencies of matched number and exactly/roughly matched locations

of change-points in 1000 FBE samples. . . . . . . . . . . . . . . . . . . . . 105

Table 5.5: Frequencies of matched number and exactly/roughly matched locations

of change-points in 1000 ARH(1) samples. . . . . . . . . . . . . . . . . . . 106

Table 5.6: The sequentially selected most unlikely points and their Box’s M statis-
tics for vehicle volumes on southbound and northbound NH5 during May
10, 2010 and May 16, 2010. The asterisks denote signiﬁcant tests un-
der critical value χ2
0.05,3 = 7.8147. The bold numbers are the estimated
change-points.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

ix

LIST OF FIGURES

Figure 2.1: Typical missing patterns of traﬃc ﬂow data. (a) Point Missing (PM):
The circles are imputed missing values. (b) Interval Missing (IM): The
dotted lines are imputed missing intervals. (c) Mixed PM/IM.

. . . . . .

11

Figure 2.2: Typical outliers of traﬃc ﬂow rate trajectories. (a) Magnitude outlier;

(b) Shape outlier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

Figure 2.3: RMSE sample means as a function of missing ratios for the imputation
methods, FM, BPCA, PPCA (with q = 1, 2, 4), and FPCA (with L =
1, 2, 4) based on 100 simulation replicates under the three missing patterns. 21

Figure 2.4: Boxplots of RMSE with diﬀerent missing ratios p using the four methods

for the three missing patterns based on 100 simulation replications.

. . .

22

Figure 2.5: Three models of synthetic observations with diﬀerent patterns of outliers
including (a) Model 1 with symmetric magnitude outliers, (b) Model 2
with shape outliers, and (c) Model 3 with a harmonic signal process.

. .

25

Figure 2.6: The modiﬁed bivariate bagplot (column 1), the modiﬁed functional bag-
plot (column 2), the bivariate bagplot (column 3), and the functional
bagplot (Column 4) for a sample of synthetic curves for Models 1, 2, and 3. 27

Figure 2.7: The modiﬁed bivariate HDR boxplot (column 1), the modiﬁed func-
tional HDR boxplot (column 2), the bivariate HDR boxplot (column
3), and the functional HDR boxplot (column 4) for a sample of syn-
thetic curves for Models 1, 2 and 3. . . . . . . . . . . . . . . . . . . . . .

Figure 2.8: (a) Daily traﬃc ﬂow trajectories superimposed on the estimated mean
function, (b) the estimated covariance function, (c) the cumulative frac-
tion of total variance explained by the leading FPCs, and (d) the esti-
mated eigenfunctions for weekends (including holidays). . . . . . . . . . .

Figure 2.9: (a) Daily traﬃc ﬂow trajectories superimposed on the estimated mean
function, (b) the estimated covariance function, (c) the cumulative frac-
tion of total variance explained by the leading FPCs, and (d) the esti-
mated eigenfunctions for weekdays.

. . . . . . . . . . . . . . . . . . . . .

Figure 2.10: Three samples of daily traﬃc ﬂow rate trajectories with the observa-
tions (dots in gray), the predicted trajectories (curves in blue) and the
imputed missing values (dots or curves in red). . . . . . . . . . . . . . . .

28

32

32

32

x

Figure 2.11: (a) The modiﬁed bivariate bagplot, (b) the modiﬁed functional bagplot,
(c) the modiﬁed bivariate HDR boxplot (with α = 0.15), and (d) The
modiﬁed functional HDR boxplot for outlier detection on weekends. . . .

Figure 2.12: The modiﬁed bivariate HDR boxplots with α = 0.01, α = 0.05 and
α = 0.1. Top panels: the modiﬁed bivariate HDR boxplot. Bottom
panels: the modiﬁed functional HDR boxplot.

. . . . . . . . . . . . . . .

33

34

Figure 3.1: Example dataset generated from Model 2 for the simulation study. . . . .

54

Figure 3.2: Example dataset generated from Model 2 for the simulation study. . . . .

55

Figure 3.3: Example dataset generated from Model 3 for the simulation study. . . . .

55

Figure 3.4: Boxplot of misclassiﬁcation error rates with diﬀerent number of training

n and testing m using the four methods for Model 1 based on 200 replications. 59

Figure 3.5: Boxplot of misclassiﬁcation error rates with diﬀerent number of training

n and testing m using the four methods for Model 2 based on 200 replications. 60

Figure 3.6: Boxplot of misclassiﬁcation error rates with diﬀerent number of training

n and testing m using the four methods for Model 3 based on 200 replications. 61

Figure 3.7: The original trajectories for the four functional dataset. . . . . . . . . . .

62

Figure 3.8: Boxplot of misclassiﬁcation error rate for 10-fold cross-validation based

on 200 replications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

Figure 4.1: Functional mixuture prediction ﬂow chart.

. . . . . . . . . . . . . . . .

72

Figure 4.2: Overall and group-speciﬁc mean functions of the training data of daily

traﬃc ﬂow rates.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

Figure 4.3: Estimated mean functions (left column) superimposed on the observed
trajectories, covariance functions (middle column) and the correpsond-
ing eigenfunctions (right column) of group 1-3 (from top to bottom)
based on the training data of daily traﬃc ﬂow trajectories.

. . . . . . . .

Figure 4.4: The predicted group membership distribution for Groups 1-3 (plotted
in blue, green, and red) as a function of the “current time” τ for samples
from the test data based on the trajectories observed up to τ .

. . . . . .

75

76

Figure 4.5: Performance comparisons for FP, FNP, and FMP, based on TMIPE,

displayed as a function of ω from (a) κ = 1 to (k) κ = κ∗.

. . . . . . . .

79

xi

Figure 4.6: Performance comparisons for FP, FNP, and FMP, based on TMIPE,

displayed as a function of ω from (a) ω = 1 to (g) ω = ω∗.

. . . . . . . .

80

Figure 5.1: The trajectories of the total vehicle volumes on southbound National
Highway No. 5 of Taiwan during May 14 – 16, 2010. The data were
monitored by 84 electric detectors distributed along the highway, where
each curve represents the daily record of a detector. The trajectories are
identiﬁed as three groups with the segmentation method of this paper,
where the red, green and blue lines represents detectors No. 1–70, 71–79
and 80–84 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

Figure 5.2: The mean functions used in simulation. The black solid line is µ1(t),

the red point line is µ2(t) and the blue dash line is µ3(t). . . . . . . . . .

99

Figure 5.3: Box plots of the estimated number of change-point in the length 100 no
change-point samples using P2S initial partition. The results of FBE
are plotted in the left panel and ARH(1) in the right. The X-axis
represents the correlation categories and the Y -axis is the estimated
number of change-points. . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Figure 5.4: Box plots of the estimated number of change-point in length 100 sam-
ples with change-points using P2S initial partition. The upper panel
are plots of 1 change-point data and the lower panel are of 2 change-
points. The results of FBE are plotted in the left panel and ARH(1)
in the right. The X-axis represents the correlation categories and the
Y -axis is the estimated number of change-points. For each correla-
tion category, diﬀerent change-point settings are drawn from left to
right in black ({0.15}/{0.15,0.50}), red ({0.50}/{0.15,0.80}) and green
({0.80}/{0.50,0.80}) respectively.

. . . . . . . . . . . . . . . . . . . . . . 103

Figure 5.5: The sequential plots of the ﬁrst FPC scores of the vehicle volumes. The
southbound scores are in the upper panel and the northbound scores
are in the lower panel. From left to right are May 14, 15 and 16. The
dash lines are the estimated locations of the change-points. . . . . . . . . 108

Figure 6.1: Kernel density estimates for the ﬁrst 10 multivariate functional common

principal component scores for the fMRI data. . . . . . . . . . . . . . . . 114

xii

CHAPTER 1

INTRODUCTION

With the progress of technology, data arising in a wide range of ﬁelds are often obtained in

a form of functions. “Functional data” is a term that refers to data which are be recorded

continuously during a time interval, whose graphical representations can be curves, images,

shapes, or general objects that vary over time. Although the analysis of functional data and

that of multivariate data share many common principles, the inﬁnite-dimensional nature of

functional data presents many new challenges that are absent in the traditional multivariate

analysis. Functional data analysis (FDA) is an emerging ﬁeld in statistics that has seen

rapid development over the last two decades. As a new branch of statistics, FDA extends

existing methodologies and theories from the areas of multivariate data analysis, stochastic

processes, generalized linear models, and many others. The book by Ramsay and Silverman

(2005) gives a clear account of the basic considerations of FDA, and the book of Ferraty

and Vieu (2006) provides a detailed survey of many nonparametric techniques for analyzing

functional data.

The term FDA can be dated to Ramsay (1982) and Ramsay and Dalzell (1991). From

the 1990s onward, with the development of technology, FDA became a fast growing area of

statistics. There are several motivations for studying functional data. In many research ﬁelds

such as traﬃc ﬂow in transportation, signal in radar range, weather in meteorology, gene

expression in genomics, growth curve in medicine, and brain image in functional magnetic

resonance imaging (fMRI), the data generating process is naturally represented in terms of

functions. Thus, many practical problems are better approached with the data described

as functions. In a conceptual sense, functions are intrinsically inﬁnite-dimensional but are

usually measured discretely. Diﬀerent from multivariate data analysis, FDA treats datum

as the continuity of the curves and models the data in the functional space rather than

treating them as a set of vectors. With the high intrinsic dimensionality of these data, a

1

standard multivariate data analysis might not be computationally feasible due to the “curse

of dimensionality”. Moreover, it should be emphasized that the datum in FDA is a whole

function deﬁned on some interval, contrary to simply focusing on the observed value at a

particular point in the interval.

Under the FDA framework each sample element is considered to be a curve or a function

instead of a ﬁnite set of single data points. This has potential advantage to analyzing discrete

data as it has fewer assumptions than classical multivariate analysis. Functional data are

assumed to be smooth functions that have been measured on a dense grid. One of the most

popular tools for FDA is functional principal component analysis (FPCA), which is built on

the Karhunen-Lo`eve expansion for stochastic processes and can reduce the random functions

to a set of functional principal component (FPC) scores. This thesis focuses extensively on

the FPC scores to carry out secondary development for analyzing traﬃc ﬂow data.

1.1 Motivation

The purpose of this thesis is to address some important questions that arise from imple-

menting FPCA in practice and to give answers to these questions. This topic was chosen

after conducting an extensive literature review of functional data analysis. This thesis studies

the tasks including:

• Data preprocessing for functional data which proposes and analyzes a nonparametric

functional data approach to missing value imputation and outlier detection for func-

tional data;

• The naive Bayes classiﬁer for functional data which provides a surrogate density esti-

mation for functional random variables that makes a direct extension of density-based

classical multivariate classiﬁcation approaches to functional data classiﬁcation possible;

• Dynamical prediction for functional data which proposes a functional mixture pre-

diction approach that combines functional linear model with functional naive Bayes

2

classiﬁer;

• Change-points in functional sequence which oﬀers a two-step segmentation procedure

to estimate both the number and locations of the mean change-points of a functional

sequence.

To illustrate the practical use of our techniques, all of the above topics are implemented on

the traﬃc ﬂow data.

Traﬃc ﬂow data provide valuable information for highway planning, traﬃc surveillance,

and control purposes. For example, the information of traﬃc ﬂow is useful to estimate the

design-hour volume and annual average daily traﬃc. In addition, real-time traﬃc ﬂow data

provide essential information for traﬃc surveillance and control in Intelligent Transportation

Systems (ITS).

Applications of traﬃc monitoring require complete and reliable data. These data can

be recorded automatically by various types of vehicle loop detectors, which are usually

installed under a planned road at regular intervals. Since loop detectors operate at a rough

environment, missing data problems are inevitable due to detector malfunctions or package

loss during transmission. Therefore, temporary detector malfunctions that result in loss of

data are quite common.

Besides, outlier detection is another important issue in investigating traﬃc data. These

include detecting temporal outliers in terms of magnitude in time and identifying unusual

patterns of trajectories, both of which provide useful information for further applications to

traﬃc management.

Classiﬁcation of the traﬃc ﬂow is also an important topic for management in ITS. Since

the traﬃc ﬂow can be viewed as a macroscopic traﬃc characteristic in transportation system,

a good classiﬁcation rule can be very helpful to build a better traﬃc control strategy.

Furthermore, it is hard not to mention the importance of traﬃc prediction for ITS.

The prediction of traﬃc rates has long been recognized in many applications. Real-time

3

forecasting gives travelers the ability to choose better routes, and while it gives authorities

the ability to manage the transportation system.

Moreover, change-point problem is also an issue for traﬃc management. The pattern of

traﬃc ﬂow rate from the downstream or upstream detectors should be roughly similar. An

abrupt change indicates something unusual.

1.2 Functional Principal Component Analysis

In the functional data framework, we adopt the notion that each daily traﬃc ﬂow tra-

jectory is a realization of a random function sampled from a stochastic process. Let X be
a random function for the daily traﬃc ﬂow trajectory in the domain I = [0, T ]. We note

that the random function X is a square integrable function, that is, (cid:82)I E(X2) < ∞, such
any two functions f and g deﬁned as (cid:104)f, g(cid:105) =(cid:82)I f (t)g(t)dt with the norm (cid:107)f(cid:107) = (cid:104)f, f(cid:105)1/2.

that X ∈ L2(I). Here, the L2(I) is the class of random functions with the inner product of

We also assume that the random function X has a smooth mean function µ such that

µ(t) := E(X(t)) and the covariance function of X is deﬁned to be the function Γ such that

Γ(s, t) := cov(X(s), X(t)) = E((X(s) − µ(s))(X(t) − µ(t)))

for s and t in I. We further assume that the covariance function is continuous and square-

integrable, that is, (cid:82)(cid:82) Γ2(s, t)dsdt < ∞. Then the function Γ induces the kernel operator

Γ : L2(I) → L2(I), deﬁned by

(cid:90)
I Γ(s, t)φ(t)dt.

(Γφ)(s) =

As noted, FPCA relies on an expansion in terms of the eigenbasis of the covariance function
Γ. The existence of an eigenbasis of Γ for L2(I) is guaranteed by Mercer’s lemma, and the
expansion of X in this basis is termed the Karhunen-Lo`eve expansion.

Lemma 1. (Mercer’s lemma) Assume that the covariance function Γ as deﬁned is con-
tinuous over I2. Then there exist an orthonormal sequence {φj} of continuous function in

4

L2(I), and a non-increasing sequence {λj} of positive numbers, such that

(Γφj)(t) = λjφj(t),

t ∈ I, j ∈ N,

and moreover,

∞(cid:88)

j=1

Γ(s, t) =

where the series converges uniformly on I2. Hence

λjφj(s)φj(t), s, t ∈ I,
(cid:90)
I Γ(s, s)ds < ∞.

∞(cid:88)

j=1

λj =

The proof of Mercer’s lemma can be essentially found in Mercer (1909).

Theorem 1. (Karhunnen-Lo`eve expansion) Under the assumptions and notations of

Mercer’s lemma, we have

X(t) = µ(t) +

∞(cid:88)

ξjφj(t),

(1.1)

where ξj = (cid:104)X − µ, φj(cid:105) is a random variable with E(ξj) = 0, and E(ξjξk) = δjk, j, k ∈ N.
The δjk denotes the Kronecker delta. The series (1.1) converges uniformly on I with respect
to the L2-norm.

j=1

In equation (1.1), the eigenfunctions φj are referred to as functional principal components

(FPC) with FPC scores ξj. The deviation of each sample function from the mean is thus

represented as a sum of orthogonal curves with uncorrelated random coeﬃcients. Although

the expansion in equation (1.1) is inﬁnite dimensional, it is a common practical experience
that the ﬁrst leading M eigenfunctions can eﬀectively span the process, for an M < ∞. In

practical applications, this M must be chosen data-adaptively and will be discussed in each

chapters for a diﬀerent purpose. The idea of FPCA is to retain the ﬁrst M terms in the

Karhunen-Lo`eve expansion as an approximation to X

X(t) ≈ µ(t) +

ξjφj(t)

(1.2)

M(cid:88)

j=1

5

and hence to achieve dimension reduction. This can be seen as projecting X onto an M -

dimensional space spanned by the ﬁrst M eigenfunctions with the largest eigenvalues λj.

Given the discrete observations {(tij, Xi(tij)), i = 1,··· , n, j = 1,··· , mi} the estimate
ˆµ of mean function µ can be estimated by applying the locally weighted least squares method

while the estimates ˆφj and ˆξij of the components φj and ξij rely on the covariance estimate

(cid:98)Γ of Γ by applying the smoothing scatterplot data (Xij − ˆµ(tij))(Xij − ˆµ(til)) to ﬁt a local

linear plane. Detail of this estimation will be discussed in subsequent chapters. Individual

traﬃc trajectory can then be modeled, using their FPC scores, by

M(cid:88)

ˆXi(t) = ˆµ(t) +

ˆξij

ˆφj(t)

for t ∈ I. We note that the above FPCA method can be implemented on fully observed

j=1

functional data by Besse (1992), Besse and Ramsay (1986), and Silverman (1996), on densely

observed functional data by Castro et al. (1986), Rice and Silverman (1991), and Cardot

(2000), and in most common situation of sparely observed functional data by Staniswalis

and Lee (1998), James et al. (2000), Rice and Wu (2001), Yao et al. (2005), Yao and Lee

(2006), and Paul and Peng (2009).

1.3 Thesis Outline

This thesis considers a FDA approach to address above problems that have been men-

tioned. We treat the daily traﬃc ﬂow trajectories as functional data that are sampled

from random functions. FDA was introduced nearly two decades ago and various statistical

methods for FDA have been intensively developed. Overviews of the FDA methodological

foundations can be found in Ramsay and Silverman (2005) and Ferraty and Vieu (2006), as

well as the review articles Rice (2004), M¨uller (2005), and Wang et al. (2016).

The thesis is organized as follows: In Chapter 2, applications of FPCA to the problem of

missing data imputation and outlier detection are explored. In Chapter 3, we propose the

naive Bayes classiﬁer for functional data. Functional naive Bayes classiﬁer is an extension

from multivariate setting to the functional setting. We also investigate properties of the

6

classiﬁer. Under regularity conditions, the proposed functional naive Bayes classiﬁer has

asymptotic equivalence to the true one. In Chapter 4, we propose a functional mixture pre-

diction approach to predict future functional observations. The proposed method combines

functional linear regression with the functional naive Bayes classiﬁer. Chapter 5 is devoted

to a change-point analysis. We propose a two-step segmentation algorithm for detecting

multiple mean change-points in a sequence of functional data. In particular, functional data

are transformed into FPC scores and used in the segmentation algorithm for estimating both

the number and locations of the mean change-points. In Chapter 6, the concluding chap-

ter, a summary is provided, and further topics in FPCA are brieﬂy mentioned, including

generalizations to multivariate FPCA and nonparametric approaches.

7

CHAPTER 2

DATA PREPROCESSING OF FUNCTIONAL DATA

Applications of traﬃc monitoring require complete and reliable data. These data can be

recorded automatically by various types of vehicle loop detectors, which are usually installed

under a planned road at regular intervals. Since loop detectors operate in a rough envi-

ronment, missing data problems are inevitable due to detector malfunctions or package loss

during transmission. Therefore, temporary detector malfunctions that result in loss of data

are quite common. While the missing data problem takes place on a detector, the data

on neighboring detectors located at the downstream or upstream are often missing as well.

One way to deal with missing values is to eliminate sample records with missing values from

the original dataset, yet the reduced dataset may lead to biased analysis results. Another

approach is to reconstruct missing entries based on the recorded dataset; however, distinct

imputation methods have their own advantages and disadvantages with diﬀerent imputation

performances depending on data availability scenarios. Each method may lead to diﬀerent

imputing results. Just like normal data collection or analysis procedures, missing data should

be an important consideration in designing traﬃc data archiving or analysis systems for the

purposes of highway planning and traﬃc surveillance and control, especially for ITS appli-

cations. Besides, outlier detection is another important issue in investigating traﬃc data.

These include detecting temporal outliers in terms of magnitude in time (i.e., magnitude

outliers) and identifying unusual patterns of trajectories (i.e., shape outliers), which provide

useful information for further applications to traﬃc management.

Comprehensive overview of the issues of missing data can be found in Allison (1999) and

Schafer and Graham (2002). Various imputation techniques have been developed in the past

decades (see, e.g., Schafer (1999); Collins et al. (2001); King et al. (2001); Graham et al.

(2003); Rubin (2004)). There is a large body of literature discussing methods of imputing

missing values for multivariate data (Beale and Little (1975); Schafer (1997)) and longitu-

8

dinal data (Laird (1988); Little (1995); Little and Rubin (2002); Fitzmaurice et al. (2012)).

Methods speciﬁcally discussed for traﬃc ﬂow data have attracted signiﬁcant attention. These

include the Kalman ﬁlter method (Dailey, 1993), time series modeling (Nihan, 1997), histor-

ical (neighboring) imputation (Chen and Shao, 2000), the lane distribution method (Smith

and Conklin, 2002), spline regression imputation methods (Chen et al., 2003), genetically

designed modeling (Zhong et al., 2004). More recently, Li Qu et al. (2009) proposed Proba-

bilistic Principal Component Analysis (PPCA) and Bayesian Principal Component Analysis

(BPCA) imputation algorithms and compared their performance with some conventional

methods from the literature. Although historical (or neighboring) imputation and spline (or

local regression) imputation are frequently used methods for missing value imputation, they

both suﬀer from some defects. As discussed in Li Qu et al. (2009), they ignore the fact that

traﬃc ﬂows may ﬂuctuate signiﬁcantly from day to day and contain stochastic variation

within the same day. A basic historical imputation utilizes the global information in the

sense of closely related or neighboring in historical data while spline imputation uses the lo-

cal information in the sense of in-a-day ﬂow data. Since the PPCA-based method considers

an adaptive fusion of historical and in-a-day information, it outperforms the historical and

imputation methods. Although numerous multivariate analysis methods have been devel-

oped to deal with missing values, to the best of our knowledge, functional data approaches

that take advantage of functional data features have not yet been discussed in relation to

imputing missing values for longitudinal or functional data.

As for outlier detection methods, for many statistical analysis procedures an essential

step toward obtaining a coherent analysis is the detection of outlying observations. While

outlier detection of multivariate data has been developed over several decades, outlier detec-

tion of functional data has only been discussed in recent years. Identiﬁcation of abnormal

or unusual patterns of trajectories that signiﬁcantly deviate from other observations in a

homogeneous group can improve the quality of observations and can be useful for further

research. Abnormal data may adversely lead to model misspeciﬁcation, biased parameter

9

estimation and incorrect results. Methods of outlier detection of functional data in the lit-

erature include the use of robust principal component analysis (Hyndman and Ullah, 2007),

the successive likelihood ratio test and smoothed bootstrapping (Febrero et al., 2007), sin-

gular value decomposition plots (Zhang et al., 2007), rainbow plots, bagplots and boxplots

for functional data (Hyndman and Shang, 2010) and functional boxplots (Sun and Genton,

2011).

This study considers a functional data analysis (FDA) approach to missing value impu-

tation, where daily traﬃc ﬂow trajectories are treated as functional data that are sampled

from random functions. We propose to use the conditional expectation approach to FPCA

for incompleteness of traﬃc ﬂow data. Following the missing data imputation method, we

also provide two outlier detection methods based on functional principal component (FPC)

scores, the modiﬁed functional bagplot and the modiﬁed functional highest density region

(HDR) boxplot, both of which are graphical tools aimed at identifying outlying curves of

functional data.

2.1 Missing Values and Outliers in Functional Data

Data quality is an important issue encountered in analysis of traﬃc ﬂow data. Although

the data are automatically recorded by dual loop detectors, data corruption may happen

due to short-term software or hardware malfunctions, maintenance operations and detector

construction. These may lead to discontinuities or gaps and outliers in the data records,

and may create severe obstacles in modeling and identiﬁcation of the underlying stochastic

mechanism. Therefore, it is essential to ﬁll in missing gaps of the data and remove identiﬁed

outliers before performing statistical analysis.

2.1.1 Patterns of Missing Values

Missing data can be random in nature and are sometimes caused by a detector that does

not deliver measurement values or a fault in the measurement tools. Depending on the

10

measurement facility, missing values can appear as a blank, zero, negative, or not a number.

Therefore, missing values are often simple to detect in a recorded dataset.

(a) Point Missing

(b) Interval Missing

(c) Mixed PM/IM

Figure 2.1: Typical missing patterns of traﬃc ﬂow data. (a) Point Missing (PM): The circles
are imputed missing values. (b) Interval Missing (IM): The dotted lines are imputed missing
intervals. (c) Mixed PM/IM.

Rubin (1976) ﬁrst developed a useful terminology for describing the patterns of missing

data. Generally speaking, there are three kinds of missing patterns (see, e.g., Rubin (1976);

Little and Rubin (2002) for multivariate data and Nakai and Ke (2011) for longitudinal

data), including Missing Completely at Random (MCAR), Missing at Random (MAR) and

Not Missing at Random (NMAR). When neither MCAR nor MAR holds as the missingness

mechanism, it is termed as NMAR. NMAR may be caused by loop detector malfunctioning

due to various reasons of machine failure, and it is rare to ﬁnd an appropriate model for

this missingness mechanism. In practice, the missing patterns of real traﬃc ﬂow data may

combine the patterns of MCAR and MAR, called Mixed MCAR/MAR. Since we cannot

distinguish MAR and MCAR from NMAR based on the data, we simply classify the missing

patterns as PM and IM, and also Mixed PM/IM for traﬃc ﬂow data as follows. We note

that PM and IM correspond to MCAR and MAR, respectively.

• Point Missing (PM): The missing points are completely independent of the observed

and unobserved values. The missing points are isolated, grouped or randomly scattered.

Both MCR and MR could be special cases of PM. See Figure 2.1(a).

11

05101520020406080100120140160180HourTraffic Flow Rate  the observed pointsthe missing points05101520020406080100120140160180HourTraffic Flow Rate  the observed trajectorythe missing intervals05101520050100150200250HourTraffic Flow Rate  the observed trajectorythe missing pointsthe missing intervals• Interval Missing (IM): This deﬁnition is closely related to MR, but has a diﬀerent focus.

In terms of functional data, the data are curves instead of points. Hence, a missing

interval means an unobserved interval rather than some unobserved points in a small

group. The missing intervals often occur randomly. See Figure 2.1(b).

• Mixed PM/IM: The missing patterns can be PM or IM. See Figure 2.1(c).

Data incompleteness is a troubling feature of many datasets and missing values could

be a serious problem as they may distort the properties of the data. Although there may

be diﬀerent patterns of missing values, the imputation method we propose is based on the

partially observed trajectories and is not aﬀected by the missing patterns.

2.1.2 Patterns of Outlying Curves

Outlier detection is a prerequisite in many data applications. There are several methods for

outlier detection that can be distinguished as univariate versus multivariate techniques and

parametric versus non-parametric procedures. For instance, the Mahalanobis distance is a

well-known criterion that depends on estimated parameters of the multivariate distribution.

Although there are many outlier detection methods for multivariate data, very few of them

are for functional data. Deﬁning an outlier or a contamination with a sample of curves

is itself a tricky problem. Detecting outlying curves is a challenging task and mistakes or

oversights in this area can have serious eﬀects on statistical analysis, including biasing the

results.

Following Hyndman and Shang (2010), there are two types of outliers, magnitude outliers

and shape outliers. In general, magnitude outliers are distant from the mean and shape out-

liers have a pattern that is diﬀerent from the other curves, e.g., see Figures 2.2(a) and 2.2(b),

respectively. In practice, outlying curves may exhibit a combination of these features. When

analyzing functional data, outliers can greatly aﬀect estimates in many ways, including skew-

ing the summary statistics and distorting the statistical modeling. Further research based

12

(a) Magnitude outlier

(b) Shape outlier

Figure 2.2: Typical outliers of traﬃc ﬂow rate trajectories. (a) Magnitude outlier; (b) Shape
outlier.

on such models and summaries can result in potentially serious failure due to previously

undetected errors. Thus, identifying outliers can be important. A good methodology for

trying to identify outlying curves should be able to cope with all types of outliers.

In this chapter, our focus is on identifying outlying curves or trajectories that have dif-

ferent patterns in terms of the underlying stochastic structure. Faulty data could be checked

by the logic rules and they are distinguished from outliers due to special traﬃc incidents that

cause volume surge or extreme traﬃc conditions so that the ﬂow rate trajectories are beyond

the extent of typical traﬃc variations. Outliers due to the occurrence of traﬃc incidents may

require diﬀerent incident detection techniques that should be able to detect abrupt changes

in traﬃc streams; thus, it is beyond the scope of this chapter.

2.2 Functional Principal Components Analysis

Most functional data approaches are nonparametric due to the data features, which im-

pose minimal assumptions on the data that overcome the limitations in parametric modeling.

In this section, we make use of functional principal component analysis (FPCA) techniques

to impute missing values of functional data.

We adopt the notion that each daily traﬃc ﬂow trajectory is a realization of a random

function. Let X denote the random function for the daily traﬃc ﬂow trajectory. We further

13

05101520050100150200250HourTraffic Flow Rate  magnitude outlier05101520050100150200250HourTraffic Flow Rate  shape outlierΓ has an orthogonal expansion in L2, that is, Γ(s, t) =(cid:80)∞

assume that X has an unknown smooth mean function EX(t) = µ(t) and covariance function
Cov(X(s), X(t)) = Γ(s, t), s, t ∈ I, where I = [0, T ] in the L2 space. Here we assume that
k=1 λkφk(s)φk(t), where {λk} is a
set of eigenvalues in non-ascending order and {φk} is the corresponding set of eigenfunctions
that form a basis with a unit norm in L2. A random trajectory from the traﬃc ﬂow then

has the following Karhunen-Lo`eve representation:

∞(cid:88)

k=1

where ξik = (cid:82) (Xi(t) − µ(t))φk(t) dt is a random coeﬃcient, projecting (Xi − µ) in the

Xi(t) = µ(t) +

ξikφk(t),

(2.1)

direction of the k-th eigenfunction φk, with a mean of zero and variance λk.

In practice, the random function Xi is often contaminated with measurement errors.
The i-th data object, i = 1, . . . , n, with mi observations observed at tij for all tij in I and
j = 1, . . . , mi, can be represented as

Yi(tij) = Xi(tij) + εij

∞(cid:88)

= µ(tij) +

ξikφk(tij) + εij,

(2.2)

k=1

where Xi is the random function described in (2.1), and the random measurement errors εij

are assumed to be uncorrelated with each other and are independent of ξik, with E(εij) = 0
and V ar(εij) = σ2. To obtain the corresponding function estimate in (2.1), we must estimate

the model component functions µ and φk. We apply the locally weighted least squares

smoothing method on the pooled data from all trajectories for the estimated mean function

µ, where the smoothing parameters can be chosen by various methods, including cross-

validation (Rice and Silverman, 1991) or generalized cross-validation (Fan and Gijbels, 1996).

To obtain the estimate of Γ, we adopt the techniques proposed in Yao et al. (2005) and

smooth the empirical covariances. Then, we obtain the estimated eigenvalue ˆλk and ˆφk by

applying the eigen-decomposition procedure to the smoothed covariance function estimate.

The random coeﬃcient estimate of ξik cannot be obtained easily through ˆξik =(cid:82) (Xi(t)−

ˆµ(t)) ˆφk(t) dt. First, this integral approximation method encounters diﬃculties when there

14

are many missing entries or if only a few repeated observations available. Second, Xi(t)

cannot be observed directly but only through the observations Yi that are contaminated

with measurement errors. Estimating ξik by substituting Yi for Xi may lead to biased

FPC scores. To overcome these diﬃculties, we adopt the approach of Yao et al. (2005) in

relation to the conditional expectation by assuming that in (2.2), ξik and εij are jointly
))T , where mi is the number of available obser-

Gaussian. Let Yi = (Yi(ti1), . . . , Yi(timi
vations for the i-th trajectory. Let φik be the vector of the values of the k-th eigen-

function, φik = (φk(ti1), . . . , φk(timi
µi = (µ(ti1), . . . , µ(timi
εij are jointly Gaussian, the conditional FPC scores are

be the covariance matrix of Yi, and
))T . Under the assumption that the FPC scores ξik and error term

))T . Let ΣYi

E(ξik|Yi) = λkφT

ikΣ−1
Yi

(Yi − µi).

(2.3)

The estimated conditional FPC scores in (2.3) are then obtained by substituting the corre-

sponding estimates, giving

)) is the estimate of φik and (cid:98)ΣYi

ik(cid:98)Σ−1
= {(cid:98)Γ(tij, til) + ˆσ2δij},
. Note that (cid:98)Γ(tij, til) and ˆσ2 are estimates of

where ˆφik = ( ˆφk(ti1), . . . , ˆφk(timi
for 1 ≤ j, l ≤ mi, is the estimate of ΣYi
Γ(tij, til) and σ2, respectively, and δij is the Kronecker delta coeﬃcient. More details about

(Yi − ˆµi),

ˆξik = ˆλk

ˆφT

(2.4)

Yi

the conditional expectation approach can be found in Yao et al. (2005).

2.2.1 Missing Value Imputation by Functional Principal Component Models

In view of (2.4), this FPC score estimate is applicable to situations where there are missing

values and, thus, inspires our missing value imputation method. We truncate the inﬁnite

series by L in (2.1) using the fraction of variance explained (FVE) criterion such that the
ﬁrst L components explain at least τλ × 100% of total variance, i.e.,

L = min

L ≥ 1 :

,

(2.5)

(cid:41)

(cid:80)L
(cid:80)M

k=1

k=1

≥ τλ

ˆλk
ˆλk

(cid:40)

15

where M is the largest number of components with ˆλk > 0 and τλ is a predetermined
threshold value, 0 ≤ τλ ≤ 1. Setting τλ equal 0.9 or higher works reasonably well in our
simulation study. Based on the estimated model components ˆµ, ˆφk and { ˆξik}k=1,...,L for all
i, the predicted functions for all of the i-th data object are then given by

L(cid:88)

ˆYi(t) = ˆµ(t) +

ˆξik

ˆφk(t).

(2.6)

k=1

Since model (2.6) holds for all t in the entire time domain I, the model ﬁts can be used

to impute the missing values. That is, if the observations Yi(tij) are missing for some j,
then the missing entries can be imputed by the predicted values ˆYi(tij). Note that the
predicted trajectories ˆYi include the components of the smoothed mean function and a

linear combination of the eigenfunctions, which recover individual trajectories from noise

measurements. Additionally, the imputation errors depend on the model complexity while

the number of FPC scores are determined by FVE. We note that (2.5) provides a natural

method to determine the number of FPC scores, and we will investigate the eﬀect of the

number of components on the imputed missing values.

2.2.2 Visualization Tools for Outlier Detection

A visualization tool can be useful to detect abnormal trajectories for functional data when

data are contaminated with outlying curves. Two graphical tools were proposed in Hyndman

and Shang (2010), functional bagplot and functional highest density region (HDR) boxplot,

to detect outliers. Both are based on the ﬁrst two principal component scores. For this

purpose, the robust principal component (Croux and Ruiz-Gazen, 2005) estimation algorithm

is applied with a form of projection pursuit because the principal component decomposition

could be sensitive to outliers. In addition, this algorithm is more resistant to outliers when the

measurement matrix contains outliers. However, this method does not take missing values

into account; besides, some diﬃculties can arise when the sample covariance matrix has very

high dimensionality. Computing this sample covariance itself is very costly. Furthermore,

16

the way of dealing with an incomplete dataset is not clear; particularly as the projection

pursuit algorithm requires a complete dataset to project the data onto a lower-dimensional

space such that a robust measure of variance of the projected data will be maximized.

To overcome the aforementioned diﬃculties, instead of using the robust principal compo-

nent scores, we introduce FPC scores based on conditional expectation in a functional bagplot

and functional HDR boxplot. Functional principal component scores can capture much of

the information inherent in functional data since the covariance surface has smoothed out

some outlying features and noise from the measurement errors. We call these two modiﬁed

outlier detection tools ‘modiﬁed functional bagplot’ and ‘modiﬁed functional HDR boxplot’.

Details about the functional bagplot and the functional HDR boxplot were discussed in

Hyndman and Shang (2010).

2.3 Simulation Study and Data Application

2.3.1 Simulation Study

In the simulation study, we consider various degrees of missingness or missing ratios and

compare the conditional expectation approach to FPCA with other methods for missing value

imputation.

In addition, we investigate the performance of outlier detection for diﬀerent

patterns of outlying curves.

2.3.1.1 Performance Comparison in Missing Value Imputation

For missing value imputation, we compared our FPCA approach with the following methods:

• Functional Mean (FM): The missing values are imputed by the estimated mean function

ˆµ(t), which is obtained by smoothing the pooled data from all trajectories.

• Probabilistic Principal Component Analysis (PPCA): This approach was proposed by

Tipping and Bishop (1999) who introduced a Gaussian latent variable into classical

Principal Component Analysis (PCA), and gave an Expectation-Maximization (EM)

17

algorithm for estimating the principal subspace. Moreover, the missing entries and the

principal axes can be derived by the EM algorithm simultaneously.

• Bayesian Principal Component Analysis (BPCA): Bishop (1999) proposed a Bayesian

estimation method to modify PPCA which introduced continuous hyper-parameters

to determine the optimal value of the latent space dimensionality.

Both BPCA and PPCA are PCA-based multivariate imputation methods and are shown to

outperform many conventional approaches in the literature. Brief reviews of the methods

are given in appendix.

In order to evaluate the performance of missing value imputation methods, we use a real

dataset that is archived in the Caltrans Performance Measurement System (PeMS) available

through the link http://pems.dot.ca.gov/. The traﬃc ﬂow rates were collected on a 5-min

interval from April 1 to April 30 in 2013 (I5-N@CA PM5.32, District 11, San Diego County,

City of San Diego. Detector ID: 1114734). We generated artiﬁcial missing entries from the

real dataset according to the missing mechanisms of the PM, IM and Mixed PM/IM patterns

as follows.

1. PM: The missing entries are generated point by point with locations following a uniform

distribution.

2. IM: The missing entries are generated with an interval length of size that follows

a uniform distribution ranging from half-hour (6 consecutive points) to 2-hour (24

consecutive points) periods. The locations of the missing intervals in each trajectory

are uniformly distributed.

3. Mixed PM/IM: The missing entries are generated by combining PM with IM for indi-

vidual trajectories.

The total number of missing entries amounts to nmp for each missing pattern, where p is

the missing ratio, n is the number of trajectories, and m is the number of time points.

18

Table 2.1: Performance comparison in terms of the RMSE mean, standard error (in paren-
thesis) and average proportion of total variance explained [in bracket (%)] for FM, BPCA,
PPCA and FPCA in PM, IM and Mixed missing patterns based on 100 simulation replica-
tions.

Pattern Method

p = 0.02

p = 0.18

p = 0.34

p = 0.50

PM

IM

Mixed

FM

BPCA

PPCA

(q)

FPCA

(L)

FM

BPCA

PPCA

(q)

FPCA

(L)

FM

BPCA

PPCA

(q)

FPCA

(L)

40.15 (4.07)
40.19 (1.31)
29.79 (11.73)
28.39 (10.58)
20.88 (1.30) [77.2]
21.13 (0.42) [75.8]
19.05 (1.17) [82.5]
19.48 (0.40) [81.5]
19.54 (1.21) [85.2]
20.46 (0.49) [84.8]
20.49 (1.36) [87.3]
21.99 (0.58) [87.4]
20.51 (1.04) [91.0]
20.49 (0.45) [90.6]
18.52 (1.24) [97.0]
18.59 (0.38) [96.5]
18.01 (1.26) [99.0]
18.15 (0.38) [98.2]
17.83 (1.31) [99.0]
18.03 (0.37) [99.0]
AIC 20.59 (1.14) [91.0]
20.51 (0.43) [91.1]
BIC 20.59 (1.14) [91.0] 20.51 (0.43) [91.1]

1
2
4
6
1
2
4
6

36.88 (13.01)
40.56 (3.49)
27.17 (12.77)
25.29 (9.33)
20.90 (2.39) [77.2]
21.39 (0.69) [76.0]
20.33 (2.09) [82.5]
20.80 (0.75) [81.7]
20.06 (2.20) [85.2]
22.01 (2.49) [84.8]
21.63 (3.21) [87.3]
23.97 (2.17) [87.4]
20.52 (2.16) [90.9]
20.96 (0.68) [87.9]
18.50 (2.22) [96.8]
19.76 (1.22) [94.5]
18.79 (3.10) [98.8]
20.33 (1.27) [97.7]
19.33 (3.27) [99.0]
20.61 (1.41) [98.9]
AIC 20.84 (3.16) [90.6]
21.32 (2.38) [89.3]
BIC 21.02 (2.37) [90.8] 21.53 (1.01) [89.1]

1
2
4
6
1
2
4
6

40.97 (10.76)
40.82 (2.73)
27.46 (12.65)
27.90 (10.92)
20.68 (1.68) [77.2]
21.14 (0.60) [75.8]
19.91 (1.58) [82.4]
20.51 (0.56) [81.5]
19.60 (1.84) [85.2]
21.13 (1.38) [84.7]
21.03 (2.42) [87.3]
22.82 (1.43) [87.4]
20.61 (1.93) [90.9]
20.63 (0.58) [88.9]
18.81 (1.37) [96.9]
18.95 (0.61) [95.0]
18.45 (1.38) [98.7]
19.20 (0.79) [97.6]
18.77 (1.55) [99.0]
19.26 (0.81) [98.9]
AIC 20.73 (1.77) [90.9]
21.00 (0.64) [88.9]
BIC 20.73 (1.77) [90.9] 21.00 (0.64) [88.9]

1
2
4
6
1
2
4
6

40.38 (0.77)
30.59 (10.60)
21.87 (0.53) [74.1]
20.47 (0.61) [80.4]
22.38 (0.78) [84.6]
25.09 (1.05) [88.0]
20.65 (0.28) [90.2]
18.89 (0.28) [96.3]
18.53 (0.27) [98.0]
18.42 (0.27) [99.0]
20.65 (0.26) [90.2]
20.65 (0.26) [90.2]

40.84 (1.87)
26.56 (9.77)
21.91 (0.80) [74.6]
21.59 (0.91) [81.0]
24.11 (2.31) [84.8]
26.99 (2.55) [88.2]
21.35 (0.81) [84.7]
20.63 (1.64) [91.4]
21.02 (1.24) [96.5]
21.28 (1.31) [98.5]
21.58 (2.51) [88.2]
21.80 (1.24) [88.4]

40.59 (1.70)
29.12 (10.42)
22.02 (0.77) [74.2]
21.68 (0.88) [80.6]
23.11 (1.70) [84.7]
26.10 (2.10) [88.0]
20.85 (0.54) [88.3]
19.27 (0.63) [94.2]
19.28 (0.73) [97.2]
19.37 (0.81) [98.7]
21.33 (0.69) [88.3]
21.33 (0.69) [88.3]

40.36 (0.58)
31.74 (9.89)
23.15 (0.80) [72.0]
22.22 (0.90) [79.3]
26.16 (1.64) [85.0]
30.57 (1.70) [89.9]
20.84 (0.28) [89.6]
19.18 (0.28) [95.6]
18.92 (0.28) [97.9]
18.82 (0.27) [99.0]
20.86 (0.25) [89.6]
20.86 (0.25) [89.6]

41.02 (1.51)
30.75 (9.93)
23.33 (1.93) [72.5]
23.87 (2.61) [79.9]
28.13 (3.25) [85.0]
32.75 (3.18) [89.6]
22.36 (1.45) [81.2]
21.71 (1.75) [88.7]
22.28 (2.07) [94.7]
22.48 (2.16) [97.1]
22.35 (2.97) [88.9]
22.61 (1.95) [88.8]

40.76 (1.48)
28.64 (9.48)
23.03 (0.95) [72.6]
23.15 (1.13) [79.9]
26.45 (2.03) [85.4]
30.82 (2.25) [89.9]
21.31 (0.54) [83.8]
20.49 (1.06) [91.0]
20.55 (0.79) [96.4]
20.66 (0.83) [98.2]
21.62 (0.98) [88.2]
21.62 (0.98) [88.2]

To compare their imputation performance, we use the root mean square error (RMSE)

19

as a criterion deﬁned by

RMSE =

n0

n0(cid:88)

i=1

−1

mi(cid:88)

−1

mi

{yi(tij) − ˆy(tij)}2

j=1

1/2

,

where mi is the number of missing points and n0 is the number of trajectories whose mi > 0.

The RMSE measure compares the minimal diﬀerence between the imputed data values and

the underlying observations, reﬂecting the performance of missing value imputation. A

smaller RMSE indicates a better imputing performance.

The sample means with the associated standard errors of the RMSE results for the four

methods are presented in Table 2.1, with diﬀerent missing ratios p, ranging from 2% to

50%, based on 100 simulation replicates for three diﬀerent missing types. It is clear that

FM performs worse than the other methods since it does not take individual ﬂuctuation

into account. The RMSE of PPCA increases with missing ratio p, given a ﬁxed number of

components. The performance of BPCA looks more robust to missing ratios as compared

to PPCA. While this could be the beneﬁt of using a continuous type of hyper-parameter to

determine the optimal number of latent space, BPCA also has the largest standard errors,

reﬂecting relatively unstable imputation results. Figure 2.3 displays the RMSE values as a

function of the missing ratios under the three missing patterns. The three cases consistently

indicate that while the proposed FPCA outperforms the others, PPCA performs much better

than BPCA, and FM is the worst. Further, the 95% FVE threshold value generally indicates

slightly better performance than that of 90%. In general, the proposed FPCA imputation

method performs the best in terms of RMSE including the sample means and standard

errors, indicating the capability of capturing the information inherent in the functional data.

Similar conclusions can be reached for the IM and Mixed PM/IM cases as well.

In addition, we use the box plots to display the information of RMSE variation between

the imputed values and the observations. Figure 2.4 shows that for the PM missing pattern,

FM has the largest bias, and BPCA has the largest standard errors, reﬂecting relatively

unstable imputation results. The performance of PPCA depends the dimension of the latent

20

space (q) and its RMSE increases with the missing ratio. The proposed FPCA approach

is robust to missing ratios and outperforms the other methods. Similar conclusions can be

reached for the cases of IM and Mixed PM/IM missing patterns.

Here we compare the relative marginal improvement with FM, BPCA, PPCA for diﬀerent

dimensions of the latent space q, and using FPCA for diﬀerent threshold values πλ. The

imputation performance based on the area under the curve (AUC) measure that considers

the missing ratio p is deﬁned as AUC = (cid:82) pU

RMSE(p) dp, where we set pL = 0.02 and

pL

pU = 0.50 in this study. The AUC results are summarized in Table 2.2. The imputation

errors using FPCA are substantially smaller when compared with FM and BPCA, and is a

little smaller than these based on PPCA. Choosing the best performance in FPCA in terms

of πλ and PPCA in terms of q, in the Mixed PM/IM case FPCA reduces 7.4% of AUC from

PPCA, 44% from BPCA and 106% from FM.

Although FPCA and PPCA are both PCA-based methods, FPCA takes advantage of

functional data features and, thus, could perform better than PPCA. In the estimation of

the mean and the covariance functions in model (2), FPCA takes into account noise or

measurement errors and recovers individual trajectories from noisy measurements. More-

over, the eigen-decomposition is performed on the smoothed covariance, which renders a

relatively higher proportion of total variance explained by fewer number of components as

(a) PM

(b) IM

(c) Mixed PM/IM

Figure 2.3: RMSE sample means as a function of missing ratios for the imputation methods,
FM, BPCA, PPCA (with q = 1, 2, 4), and FPCA (with L = 1, 2, 4) based on 100 simulation
replicates under the three missing patterns.

21

00.10.20.30.40.515202530354045Missing Ratio (%)RMSE  FMBPCA           q=1PPCA q=2           q=4FPCA L=1           L=2           L=400.10.20.30.40.515202530354045Missing Ratio (%)RMSE  FMBPCA           q=1PPCA q=2           q=4FPCA L=1           L=2           L=400.10.20.30.40.515202530354045Missing Ratio (%)RMSE  FMBPCA           q=1PPCA q=2           q=4FPCA L=1           L=2           L=4(a) p = 0.02 (PM)

(b) p = 0.18 (PM)

(c) p = 0.34 (PM)

(d) p = 0.5 (PM)

(e) p = 0.02 (IM)

(f) p = 0.18 (IM)

(g) p = 0.34 (IM)

(h) p = 0.5 (IM)

(i) p = 0.02 (Mixed)

(j) p = 0.18 (Mixed)

(k) p = 0.34 (Mixed)

(l) p = 0.5 (Mixed)

Figure 2.4: Boxplots of RMSE with diﬀerent missing ratios p using the four methods for the
three missing patterns based on 100 simulation replications.

Table 2.2: The AUC of the RMSE for PM, IM and Mixed PM/IM

Method

PM

IM Mixed

FM

BPCA

PPCA

(q)

FPCA

(L)

19.34
14.19
10.38
9.67
10.46
11.52
12.82
9.65
8.78
8.59
8.53
9.89
9.89

19.58
13.64
10.49
10.32
11.26
12.56
13.89
10.21
9.70
9.95
10.09
10.33
10.44

1
2
4
6
8
1
2
4
6

AIC
BIC

19.54
13.68
10.40
10.20
10.74
11.90
13.24
9.96
9.20
9.26
9.31
10.16
10.16

22

FMBPCAPPCAFPCA1520253035404550RMSE124681246FMBPCAPPCAFPCA1520253035404550RMSE124681246FMBPCAPPCAFPCA1520253035404550RMSE124681246FMBPCAPPCAFPCA1520253035404550RMSE124681246FMBPCAPPCAFPCA203040506070RMSE124681246FMBPCAPPCAFPCA203040506070RMSE124681246FMBPCAPPCAFPCA203040506070RMSE124681246FMBPCAPPCAFPCA203040506070RMSE124681246FMBPCAPPCAFPCA1520253035404550556065RMSE124681246FMBPCAPPCAFPCA1520253035404550556065RMSE124681246FMBPCAPPCAFPCA1520253035404550556065RMSE124681246FMBPCAPPCAFPCA1520253035404550556065RMSE124681246compared with those based on the unsmoothed covariance. In contrast, PPCA derives the

unknown parameters from the observations via EM algorithm, which contain high noise or

measurement errors. In addition, EM algorithm is sensitive to the initial values. Even if a

good initial value is given, EM algorithm may converge to a local maximum of the observed

data likelihood function rather than the global maximum, and the problem could be serious

when the noise is high in the recorded data.

To assess the accuracy of the imputed data values, we calculate the Averaged Standard-

ized Deviation (ASD) of the imputed missing values in the simulation study, where ASD is

calculated by

n0(cid:88)

i=1

1
ni

ni(cid:88)

j=1

1
n0

ASD =

(cid:40)

(cid:41)1/2

,

(yi(tij) − ˆyi(tij))2

s2(y(tij))

where ni is the number of missing points and n0 is the number of trajectories whose ni > 0.

Here, yi(tij) is the observed ﬂow rate of the missing entry, ˆyi(tij) is the estimated missing
value, and s2(y(tij)) is the sample variance at time tij. The ASD results are summarized

in Table 2.3. The results using FPCA indicate that the averaged deviations of the imputed

values range from about 0.40 to 0.60 times the sample standard deviation, which appear to

be acceptable.

2.3.1.2 Performance Comparison in Outlier Detection

For the performance of outlier detection, we compare the modiﬁed functional bagplot and

HDR boxplot coupled with the conditional expectation approach, with the robust princi-

pal component based functional bagplot and HDR boxplot of Hyndman and Shang (2010).

We generate synthetic curves from diﬀerent models including magnitude outliers and shape

outliers as follows.

• Model 1: The synthetic data trajectories are generated from model (2.2) with L = 2,

√
the mean function µ(t) = t + sin(t), the eigenfunctions φ1(t) = − cos(πt/10)/
√
φ2(t) = sin(πt/10)/

5 and
5, 0 ≤ t ≤ 10, the corresponding eigenvalues λ1 = 4, λ2 = 1, and

23

Table 2.3: Average standardized deviation of imputed values for the FM, PPCA, BPCA
and FPCA methods in terms of the mean and standard error (in parentheses) based on 100
simulation replications.

Pattern Method

p = 0.02

p = 0.18

p = 0.34

p = 0.50

PM

IM

Mixed

FM

BPCA

PPCA

(q)

FPCA

(L)

FM

BPCA

PPCA

(q)

FPCA

(L)

FM

BPCA

PPCA

(q)

FPCA

(L)

0.80 (0.35)
0.69 (0.41)
0.73 (0.30)
0.61 (0.30)
0.66 (0.30)
0.72 (0.31)
0.71 (0.36)
0.68 (0.28)
0.60 (0.23)
0.59 (0.21)
0.60 (0.24)
AIC 0.57 (0.30)
BIC 0.66 (0.30)

1
2
4
6
8
1
2
4
6

0.88 (0.51)
0.75 (0.56)
0.64 (0.27)
0.58 (0.22)
0.56 (0.21)
0.56 (0.19)
0.72 (0.26)
0.60 (0.25)
0.53 (0.24)
0.53 (0.25)
0.52 (0.24)
AIC 0.56 (0.30)
BIC 0.56 (0.30)

1
2
4
6
8
1
2
4
6

0.66 (0.26)
0.50 (0.28)
0.46 (0.36)
0.46 (0.34)
0.48 (0.30)
0.54 (0.36)
0.62 (0.35)
0.45 (0.40)
0.45 (0.36)
0.44 (0.36)
0.46 (0.36)
AIC 0.45 (0.30)
BIC 0.45 (0.30)

1
2
4
6
8
1
2
4
6

0.63 (0.06)
0.55 (0.09)
0.54 (0.07)
0.50 (0.04)
0.54 (0.05)
0.57 (0.06)
0.65 (0.09)
0.52 (0.07)
0.48 (0.06)
0.46 (0.06)
0.47 (0.05)
0.47 (0.06)
0.52 (0.06)

0.70 (0.10)
0.56 (0.11)
0.54 (0.09)
0.52 (0.06)
0.55 (0.07)
0.59 (0.07)
0.64 (0.09)
0.53 (0.09)
0.48 (0.06)
0.48 (0.05)
0.49 (0.05)
0.49 (0.09)
0.49 (0.09)

0.64 (0.08)
0.52 (0.11)
0.54 (0.10)
0.50 (0.10)
0.53 (0.08)
0.57 (0.07)
0.63 (0.06)
0.53 (0.08)
0.49 (0.08)
0.49 (0.08)
0.50 (0.08)
0.50 (0.10)
0.50 (0.10)

0.66 (0.05)
0.58 (0.07)
0.59 (0.05)
0.56 (0.06)
0.60 (0.04)
0.65 (0.05)
0.71 (0.07)
0.56 (0.05)
0.52 (0.06)
0.52 (0.06)
0.51 (0.06)
0.50 (0.05)
0.53 (0.05)

0.64 (0.04)
0.58 (0.05)
0.58 (0.07)
0.57 (0.04)
0.60 (0.06)
0.66 (0.10)
0.68 (0.12)
0.53 (0.06)
0.52 (0.04)
0.52 (0.04)
0.52 (0.05)
0.53 (0.07)
0.53 (0.07)

0.65 (0.05)
0.61 (0.07)
0.58 (0.06)
0.56 (0.04)
0.61 (0.06)
0.65 (0.06)
0.73 (0.07)
0.56 (0.06)
0.52 (0.05)
0.53 (0.05)
0.51 (0.05)
0.52 (0.06)
0.52 (0.06)

0.66 (0.03)
0.60 (0.05)
0.60 (0.03)
0.57 (0.03)
0.61 (0.04)
0.69 (0.08)
0.87 (0.13)
0.57 (0.03)
0.54 (0.03)
0.53 (0.04)
0.52 (0.04)
0.50 (0.03)
0.53 (0.03)

0.66 (0.05)
0.59 (0.08)
0.58 (0.06)
0.58 (0.08)
0.66 (0.10)
0.74 (0.04)
0.81 (0.14)
0.55 (0.04)
0.52 (0.04)
0.52 (0.05)
0.53 (0.05)
0.55 (0.05)
0.55 (0.05)

0.65 (0.03)
0.60 (0.07)
0.58 (0.03)
0.56 (0.03)
0.62 (0.12)
0.72 (0.17)
0.89 (0.23)
0.56 (0.05)
0.53 (0.05)
0.54 (0.04)
0.53 (0.03)
0.55 (0.04)
0.55 (0.04)

the variance of measurement error σ2 = 0.25. The FPC scores ξik are generated from
N (0, λk) and the measurement errors εi are generated from N (0, σ2). We generate

24

n = 190 synthetic curves with 10 additional magnitude outlying curves generated from
the same model but with an inﬂated mean function µ(t) = (−1)rK + t + sin(t), where
1 ≤ r ≤ 10 and K = 10.

• Model 2: We generate n = 190 curves based on the same structure of Model 1 with

10 additional shape outlying curves generated using diﬀerent eigenfunctions φ1(t) =
√
− sin(πt/10)/

√
5 and φ2(t) = cos(πt/10)/

5.

• Model 3: We simulate 990 curves of the form Yi(t) = ai sin(t) + bi cos(t) without
measurement errors, where 0 < t < 2π. The coeﬃcients ai and bi follow independent

uniform distributions on [0.0, 0.1]. Ten additional curves are also randomly simulated

with the same functional form, but with ai and bi following uniform distribution on

[0.1, 0.12].

We note that Model 1 includes some symmetric magnitude outliers as shown in Figure 2.5(a),

Model 2 contains shape outliers as shown in Figure 2.5(b) and Model 3 is a harmonic signal

process with outlying curves that are not far from the median curve as shown in Figure

2.5(c). We note that Model 3 was used in Hyndman and Shang (2010).

(a)

(b)

(c)

Figure 2.5: Three models of synthetic observations with diﬀerent patterns of outliers includ-
ing (a) Model 1 with symmetric magnitude outliers, (b) Model 2 with shape outliers, and
(c) Model 3 with a harmonic signal process.

For Models 1 and 2, the ﬁrst two functional principal components (FPCs) explain 99.07%

and 99.70% of the total variation, whereas the ﬁrst two robust principal components (RPCs)

explain only 87.81% and 85.59% of the total variation. The diﬀerence occurs because FPCA

25

0246810−15−10−50510152025tY(t)0246810−5051015tY(t)0123456−0.2−0.15−0.1−0.0500.050.10.150.2tY(t)takes advantage of the smoothed covariance, which can capture most information inherent

in the functional data structure by the ﬁrst few FPCs;

in contrast, the robust PCA is

based on the projection pursuit from the robust scale estimator of raw covariance, which

explains a relatively low explanation of the ﬁrst few robust PCs as compared to FPCs. For

Model 3, the ﬁrst two FPCs explain 99.40% of the total variation and the ﬁrst two robust

PCs explain 99.95% of the total variation. Since the data generated from Model 3 is quite

smooth and without measurement errors, the robust PCA can take advantage of the robust

scale estimator and performs well.

We ﬁrst compare the performance of the modiﬁed functional bagplots and the functional

bagplots for Models 1–3. Figure 2.6 shows the outlier detection results of the two methods.

It can be seen that the modiﬁed functional bagplots (column 2) work slightly better than

the functional bagplots (column 4) in Models 1 and 2. For Model 3, the outlier detection

results are identical, but both methods have failed to identify many outliers in our simulation

study. This is because the curves are not suﬃciently distant from the median as was shown

by Hyndman and Shang (2010).

To further compare the performance of outlier detection, we introduce two performance

measures as deﬁned in Sun and Genton (2011).

• pc: the percentage of correctly detected outliers deﬁned as the number of correctly

detected outliers divided by the total number of outlying curves.

• pf : the percentage of falsely detected outliers deﬁned as the number of falsely detected

outliers divided by the total number of non-outlying curves.

A good outlier detection performance requires a high correct detection percentage pc and a

low false detection percentage pf .

The percentages of the sample means and standard errors of ˆpc and ˆpf based on 200

simulation replicates are shown in Table 2.4. For Model 1, both methods have 100% correct

outlier detection rate, yet the modiﬁed functional bagplot has a lower false detection rate

26

(a) Model 1

(b) Model 2

(c) Model 3

Figure 2.6: The modiﬁed bivariate bagplot (column 1), the modiﬁed functional bagplot
(column 2), the bivariate bagplot (column 3), and the functional bagplot (Column 4) for a
sample of synthetic curves for Models 1, 2, and 3.

Table 2.4: The sample means and standard errors (in parentheses) in percentages of ˆpc and
ˆpf for the modiﬁed functional bagplot and functional bagplot with 200 replications.

Model

Modiﬁed Functional bagplot Functional bagplot

1

2

3

ˆpc
ˆpf
ˆpc
ˆpf
ˆpc
ˆpf

100.00 (0.00)
0.82 (0.72)

59.55 (15.99)
1.41 (4.76)

0.20 (1.40)
0.00 (0.00)

100.00 (0.00)
1.31 (4.69)

59 (16.23)
1.48 (4.81)

0.20 (1.40)
0.00 (0.00)

27

−40−2002040−10−5051015FPC score 1FPC score 20246810−10−505101520tY(t)−20−1001020−40−2002040PC score 1PC score 20246810−10−505101520tY(t)−10−50510−10−505FPC score 1FPC score 20246810−5051015tY(t)−1001020−10−5051015PC score 1PC score 20246810−5051015tY(t)−0.100.000.050.100.15−0.10−0.050.000.050.10FPC score 1FPC score 20123456−0.15−0.050.050.15tY(t)−0.2−0.10.00.10.2−0.2−0.10.00.10.2PC score 1PC score 20123456−0.15−0.050.050.15tY(t)(a) Model 1

(b) Model 2

(c) Model 3

Figure 2.7: The modiﬁed bivariate HDR boxplot (column 1), the modiﬁed functional HDR
boxplot (column 2), the bivariate HDR boxplot (column 3), and the functional HDR boxplot
(column 4) for a sample of synthetic curves for Models 1, 2 and 3.

with a smaller standard error. For Model 2, the modiﬁed functional bagplot has a slightly

higher correct and a slightly lower false outlier detection rate, both with a smaller standard

error. For Model 3, both methods work similarly and the results are identical. Focusing on

Models 1 and 2, without the help of robust PCA, FPCA still can derive a satisfactory result.

The functional HDR boxplot requires a pre-speciﬁed coverage probability, say α, which

can be viewed as the percentage of potential outliers. Since we simulate 200 curves, including

10 outliers, for Models 1 and 2, and 1000 curves, including 10 outliers for Model 3, it is

reasonable to set α = 0.05 for Models 1 and 2, and α = 0.01 for Model 3. A sample of the

28

−40−2002040−10−5051015FPC score 1FPC score 20246810−10−505101520tY(t)−20−1001020−40−2002040PC score 1PC score 20246810−10−505101520tY(t)−10−50510−10−505FPC score 1FPC score 20246810−5051015tY(t)−1001020−10−5051015PC score 1PC score 20246810−5051015tY(t)−0.100.000.050.100.15−0.10−0.050.000.050.10FPC score 1FPC score 20123456−0.15−0.050.050.15tY(t)−0.2−0.10.00.10.2−0.2−0.10.00.10.2PC score 1PC score 20123456−0.15−0.050.050.15tY(t)modiﬁed functional HDR boxplots and functional HDR boxplots for the three models are

shown in Figure 2.7. The modiﬁed functional HDR boxplots (column 2) work better than

the functional HDR boxplots (column 4) for Models 1 and 2, especially for Model 1 where

the modiﬁed functional HDR boxplot correctly identiﬁes all of the outliers without any false

detection. Moreover, both methods work equally well on Model 3.

We consider diﬀerent coverage probabilities of the outlying region based on α = 0.01, 0.05

and 0.1 for each model with the sample means and standard errors reported in Table 2.5.

Overall, the modiﬁed functional HDR boxplot appears to work better than the functional

HDR boxplot, except for Model 3 with α = 0.01; however, the results of the two methods

are not signiﬁcantly diﬀerent. In addition, the distribution of ˆpc and ˆpf are identical for

α = 0.05 and 0.1 in Model 3. For the functional HDR boxplot, ˆpc increases with α so

that more outliers are detected, but ˆpf also increases, meaning more non-outlying curves

are ﬂagged as potential outliers. When performing functional HDR boxplot, one should be

very careful with the pre-speciﬁed α since the outlier detection performance depends on the

choice of α.

Table 2.5: The sample means and standard errors (in parentheses) of the percentages ˆpc and
ˆpf for the functional HDR boxplot and functional HDR boxplot with 200 replications.

Model

1

2

3

Modiﬁed Functional HDR boxplot
α = 0.1

α = 0.05

α = 0.01

Functional HDR boxplot

α = 0.01

α = 0.05

α = 0.1

0.10 (0.22)

0.63 (0.50)

0.13 (0.24)

5.26 (0.00)

ˆpc 18.10 (4.18) 88.05 (9.49) 100.00 (0.00) 17.55 (4.54) 85.7 (10.49) 100.00 (0.00)
ˆpf
5.26 (0.00)
ˆpc 18.30 (3.90) 59.00 (13.71) 70.00 (14.39) 18.20 (4.22) 58.55 (13.83) 69.25 (14.63)
ˆpf
6.88 (0.77)
ˆpc 97.25 (6.09) 100.00 (0.00) 100.00 (0.00) 97.50 (5.47) 100.00 (0.00) 100.00 (0.00)
ˆpf
9.09 (0.00)

0.75 (0.55)

2.18 (0.73)

0.03 (0.06)

4.04 (0.00)

0.03 (0.06)

4.04 (0.00)

0.09 (0.21)

2.16 (0.72)

6.84 (0.76)

0.09 (0.22)

9.09 (0.00)

29

2.3.2 Data Application

We implement the proposed missing value imputation and outlier detection methods for

traﬃc ﬂow data collected by a dual loop vehicle detector located at 28.45K northbound on

National Highway No. 5 in Taiwan, which is near the entrance of Shea-San Tunnel at 28.11K

northbound. The traﬃc ﬂow rates were collected on a 5-min interval from April 1 to April

30 in 2009. National Highway No. 5 is the major road northbound from Yilan County to

Taipei. Yilan County is located nearby Taipei and many people living in Taipei like to go to

Yilan County during weekends and holidays for their recreational trips. Therefore, numerous

trips northbound on National Highway No. 5 are recreational trips coming back from Yilan

County to Taipei, especially starting from the afternoon till evening during weekends and

holidays. As shown in Figure 2.8(a) for the observed traﬃc ﬂow rate trajectories, the peak

hours occurred between 14:00 and 21:00. The data set consists of a sample of 30 functional

observations, among which 22 were weekdays and 8 were weekends (including Chinese Tomb-

Sweeping holiday on April 5), and each sample contains 288 data observations for complete

data. There were some missing entries caused by the malfunction of detector, lost packages

during transmissions and other reasons. We deﬁne the missing ratio of the dataset by

p = (cid:80)n

i=1 mi/mn, where n is the number of observed days, m is the maximal number of

observed time points and mi is the number of missing points of subject i. The overall missing

ratio for this data set is 2.28%.

2.3.2.1 Functional Principal Component Analysis

Observing that the traﬃc ﬂow patterns are distinct on weekends (including holidays) and

weekdays, we separate the FPC analysis in these two groups. The observed trajectories

and the estimated mean function are displayed in Figures 2.8(a) and 2.9(a). The estimated

mean functions indicate the peak hours on weekends occur from 14:00 to 20:00, while there

are two peaks on weekdays, one around 8:00 with lower ﬂow rates and smaller variability

and the other around 17:00 with relatively high ﬂow rates and variability. The estimated

30

auto-covariance functions are shown in Figures 2.8(b) and 2.9(b). On weekdays the ﬁrst

peak occurs around from 7:00 to 9:00, which is on-work state, and the second peak occurs

from 16:00 to 18:00, which is oﬀ-work state. The time from 9:00 to 16:00 shows a regular

state. The variability is more complicated on weekends with high variability during peak

hours. This smoothed covariance surface reveals the structure of the underlying process,

which would be diﬃcult for modeling using traditional parametric approaches. In addition,

the eigenfunctions from the decomposition of the estimated covariance are shown in Fig-

ures 2.8(c) and 2.9(c). The cumulative fractions of total variance explained by the leading

components are displayed in Figures 2.8(d) and 2.9(d). The results indicate that setting πλ

as 0.9 results in 3 components for weekends and 2 components for weekdays, while setting

πλ as 0.95 leads to 4 components for both weekends and weekdays. In addition to the FVE

criterion, the AIC- and BIC-like criteria proposed by Yao et al. (2005), which are derived

under the Gaussian process assumption, can be used as well. The results using these criteria

coincide with those based on the FVE criterion with 0.90 as the threshold in the dataset.

The threshold values 0.90 and 0.95 appear to be reasonable choices. However, the threshold

0.9 renders slightly smaller prediction errors in our simulation study, and, thus, we choose

0.9 as the threshold value for our real dataset. The three leading FPCs account for 65.81%,

15.20%, and 11.03% of total variation in weekends, where the ﬁrst eigenfunction reﬂects

the overall variability in the peak-hour period.

In contrast, the two leading components

explain 55.01% and 36.42% of total variation in weekdays, where the ﬁrst and the second

eigenfunctions contrast variability between early and late times.

Figure 2.10 displays samples of observed daily traﬃc ﬂow trajectories, along with the

predicted functions and the imputed missing values of diﬀerent missing patterns. The im-

puted results appear reasonable. Particularly, the method can catch curvature pattern in

interval missing as shown in Figure 2.10(c).

31

(a)

(b)

(c)

(d)

Figure 2.8: (a) Daily traﬃc ﬂow trajectories superimposed on the estimated mean function,
(b) the estimated covariance function, (c) the cumulative fraction of total variance explained
by the leading FPCs, and (d) the estimated eigenfunctions for weekends (including holidays).

(a)

(b)

(c)

(d)

Figure 2.9: (a) Daily traﬃc ﬂow trajectories superimposed on the estimated mean function,
(b) the estimated covariance function, (c) the cumulative fraction of total variance explained
by the leading FPCs, and (d) the estimated eigenfunctions for weekdays.

(a) April 5. (holiday weekend)

(b) April 12. (weekend)

(c) April 22. (weekday)

Figure 2.10: Three samples of daily traﬃc ﬂow rate trajectories with the observations (dots
in gray), the predicted trajectories (curves in blue) and the imputed missing values (dots or
curves in red).

32

05101520050100150200250HourTraffic Flow Rate5101520510152005001000HourHour5101520−0.4−0.200.20.40.60.8Hour  65.81%15.20%11.03%0123456020406080100Number of FPCsFVE (%) k = 3, FVE = 92.05%05101520050100150200250HourTraffic Flow Rate51015205101520050100150200250HourHour5101520−0.4−0.200.20.40.60.8Hour  55.01%36.42%0123456020406080100Number of FPCsFVE (%) k = 2, FVE = 91.43%05101520050100150200250HourTraffic Flow Rate  imputed missing pointsimputed missing intervals05101520050100150200250HourTraffic Flow Rate  imputed missing pointsimputed missing intervals05101520050100150200250HourTraffic Flow Rate  imputed missing intervals2.3.2.2 Outlier Detection Results

Detecting extreme traﬃc ﬂow trajectories through visual inspection can be diﬃcult due to

large noise and missing values. We perform outlier detection by applying the modiﬁed func-

tional bagplot and the modiﬁed functional HDR boxplot, both of which use the functional

principle component scores based on the conditional expectation approach.

The outlier detection results based on the modiﬁed functional bagplot are shown in

Figure 2.11. Figure 2.11(a) displays the modiﬁed bivariate bagplot, where the red star

marks the Tukey median of the bivariate FPC scores, the dark gray region displays the

50% bag and the light gray region shows the 95% fence. The point at April 4 and April

5, outside the fence are identiﬁed as an outlier. The modiﬁed functional bagplot is shown

in Figure 2.11(b), where the solid black curve (median curve) corresponds to the median

point (red star) and the similar shaded dark and light gray region correspond to the bag and

fence in the modiﬁed functional bagplot, respectively. The outlying curves at April 4 and

April 5 are highlighted in green and red. In this dataset, the suspected functional outlier

dated April 5 was Chinese Tomb-Sweeping holiday. Figure 2.11(c) illustrates the modiﬁed

bivariate HDR boxplot using the setting of α = 0.15. We note that when using the setting

of α = 0.01 to 0.1 April 4 and April 5 are identiﬁed as outliers. Figure 2.11(d) display the

corresponding modiﬁed functional HDR boxplots.

(a)

(b)

(c)

(d)

Figure 2.11: (a) The modiﬁed bivariate bagplot, (b) the modiﬁed functional bagplot, (c)
the modiﬁed bivariate HDR boxplot (with α = 0.15), and (d) The modiﬁed functional HDR
boxplot for outlier detection on weekends.

33

−150−100−50050100150−100−50050100FPC score 1FPC score 204/0504/0405101520050100150200HourTraffic Flow Rate04/0504/04FPC score 1FPC score 2−150−100−50050100150−100−5005010004/0504/0405101520050100150200HourTraffic Flow Rate04/0504/04More descriptions on the modiﬁed bivariate HDR boxplot and functional boxplot will

follow. April 5 was on Sunday and it was also the Chinese Tomb-Sweeping holiday. It was

a special holiday for people returning back to their hometowns for getting together with

their families.

In addition to recreational trips, there are many back-to-hometown trips.

Therefore, it is understood that the traﬃc ﬂow pattern on April 4 and April 5 were quite

diﬀerent from other weekends as illustrated in Figure 2.11.

(a) α = 0.01

(b) α = 0.05

(c) α = 0.1

Figure 2.12: The modiﬁed bivariate HDR boxplots with α = 0.01, α = 0.05 and α = 0.1.
Top panels: the modiﬁed bivariate HDR boxplot. Bottom panels: the modiﬁed functional
HDR boxplot.

For weekdays, no outliers are detected based on the modiﬁed functional bagplot approach.

To use the functional HDR boxplots, a prespeciﬁed coverage probability of the outlying region

is needed. We use three coverage probabilities 99%, 95%, and 90%, corresponding to the

settings of α = 0.01, 0.05, and 0.1, to perform the outlier detection procedure for the traﬃc

ﬂow data. The top panels of Figures 2.12 illustrate the modiﬁed bivariate HDR boxplots

in which the red star marks the mode of the bivariate FPC scores, the darker gray regions

34

−150−100−50050100−100−50050FPC score 1FPC score 204/06−150−100−50050100−100−50050FPC score 1FPC score 204/0604/03−150−100−50050100−100−50050FPC score 1FPC score 204/0604/0304/2705101520050100150HourTraffic Flow Rate04/0605101520050100150HourTraffic Flow Rate04/0604/0305101520050100150HourTraffic Flow Rate04/0604/0304/27display the 50% HDR, and the lighter gray regions display the 99%, 95% and 90% HDR,

respectively. The points outside the light gray regions are identiﬁed as outliers. The bottom

panels of Figures 2.12 display the corresponding modiﬁed functional HDR boxplots, where

the black curves correspond to the mode of bivariate FPC scores, and the shaded dark and

light regions correspond to the regions in the modiﬁed bivariate HDR boxplots. The modiﬁed

functional HDR boxplots with α = 0.01, 0.05, and 0.1 detect one, two and three outliers in

the order of April 6, 3 and 27, which are highlighted in red, blue, and green, respectively.

The capability of identifying outliers using the functional HDR boxplots highly depends on

the pre-speciﬁed α. The results show that more outliers are detected with larger values of α

and the strength of potentially ﬂagged outliers can be identiﬁed by varying the values of α.

While the curve corresponding to April 27 is very close to the boundary of the 90% region,

the identiﬁed outliers on April 3 (Friday) and April 6 (Monday) are both around the April

5 Chinese Tomb-Sweeping holiday, which gives a reasonable interpretation. Based on the

outlier detection results, we found that traﬃc ﬂow patterns on the special holiday of April 5,

and the days before and after the holiday are diﬀerent from the general weekend or weekday

patterns. Thus traﬃc control strategies for such holidays require separate considerations for

the weekends and the weekdays.

2.4 Conclusion

In this study, we proposed a nonparametric functional data approach to missing value

imputation and outlier detection for functional data. Our method takes advantage of the

functional data features that can be expanded by the FPC models consisting of the mean

function and a stochastic component to catch individual variation. We investigate the nu-

merical performance through comparison with popularly used imputation methods in the

literature for handling missing traﬃc ﬂow data. The proposed FPCA based on the condi-

tional expectation approach outperforms PPCA and BPCA in addressing traﬃc ﬂow data

incompleteness. Moreover, we proposed a modiﬁed version of the functional bagplot and the

35

functional HDR boxplot that applies to the FPC scores. One of the advantages of the pro-

posed approach is that it can be used even for incomplete or irregularly collected functional

data. The simulation study shows that the proposed FPCA approach for missing value im-

putation and the modiﬁed outlier detection methods can work reasonably well. Although

motivated by traﬃc ﬂow data, the proposed methodology is widely applicable to data that

are repeatedly measured over a period of time.

36

CHAPTER 3

THE NAIVE BAYES CLASSIFIER FOR FUNCTIONAL DATA

The problem of classifying objects is a popular topic in machine learning and statistics.

Classiﬁcation is the problem of identifying a set, where a new observation belongs to, based

on a training set of data containing observations whose category memberships are known.

The idea is to produce a so-called classiﬁer, which can be viewed as a function induced

by a classiﬁcation algorithm that maps input data to a category. Conventional heuristic

methods, such as the naive Bayes classiﬁers (Langley et al., 1992; Mitchell, 1997) have been

widely used in many applications. The naive Bayes classiﬁcation approach is based on the

assumption of the independence between the features, and the output label simply relies on

the estimation of univariate conditional probabilities. Experiments demonstrate that despite

the naive design and apparently oversimpliﬁed assumptions, naive Bayes classiﬁers bring a

competitive performance compared to other state-of-the-art classiﬁers in many complex real-

world situations (Baesens et al., 2003).

In the ﬁnite dimensional setting, the multivariate probability density function, provided

that it exists, is the main tool for constructing naive Bayes classiﬁer. However, in the inﬁnite

dimensional setting, where the data belongs to a functional space, the problem of “the curse

of dimensionality” occurs immediately, and as a result the naive Bayes approach is not

directly applicable. The diﬃculty comes from the fact that the notion of a probability density

is not well-deﬁned due to the inﬁnite dimensionality of data. Often a probability density

function for functional data does not even exist (Delaigle and Hall, 2010). Therefore, a direct

extension of density-based classical multivariate classiﬁcation approaches to functional data

cannot be utilized.

On the other hand, functional classiﬁcation has attracted a lot of attention recently due

to a large demand of real world applications including real-time signal analysis (Hall et al.,

2001), temporal gene expression curves (Leng and M¨uller, 2006), traﬃc ﬂow patterns (Chiou,

37

2012) and many other ﬁelds. Methods of classiﬁcation of functional data can vary. There has

been rapidly expanding amount of research within the statistical ﬁeld on the development

of functional classiﬁcation methods. These include the k-nearest neighbor (k-NN) methods

(Biau et al., 2005; C´erou and Guyader, 2006; Biau et al., 2010), linear discriminant analysis

(LDA) (James and Hastie, 2001; Shin, 2008), partial least squares (PLS) (Preda and Saporta,

2005; Preda et al., 2007), logistic regression (Leng and M¨uller, 2006; Araki et al., 2009),

support vector machines (SVM) (Rossi and Villa, 2006; Wu and Liu, 2013; Martin-Barragan

et al., 2014), distance-based approaches (Alonso et al., 2012; Galeano et al., 2015), depth-

based notions (Cuevas et al., 2007; Sguera et al., 2014), shape descriptors (Epifanio, 2008),

nonparametric kernels (Hall et al., 2001; Ferraty and Vieu, 2003), Bayesian methods (Wang

et al., 2007), and centroid method (Delaigle and Hall, 2012). More recently, Bongiorno and

Goia (2016) employ the idea of pseudo-density to build a classiﬁer based on the Bayes rule.

However, their method was based on the multivariate kernel density estimator, which is often

restricted to a lower dimension unless the sample size is extremely large since the accuracy of

nonparametric density estimators decrease rapidly as dimension increases. Dai et al. (2017)

propose to use density ratios of projections on a sequence of common eigenfunctions on the

two populations. Our approach is diﬀerent from theirs, since we work directly with the

extension of density-based naive Bayes classiﬁer for functional data.

Despite diﬀerences in building classiﬁers, functional principal component analysis has

been used as the main tool in the dimension reduction for functional data. Overviews of the

functional data analysis methodological foundations can be found in Ramsay and Silverman

(2005) and Ferraty and Vieu (2006), as well as in the review article (Wang et al., 2016).

In this chapter, we work toward the construction of naive Bayes classiﬁer for functional

data.

In order to do so, a density function for a random function is constructed from

the functional common principal component (FCPC) scores with an aid of independence

assumption. The FCPC scores are the coeﬃcients of the random functions that projected

onto an orthogonal basis derived from the decomposition of the common covariance function.

38

The idea of common principal components can be dated to Flury (1984) in multivariate

analysis, and has been extended by Boente et al. (2010) to functional data analysis. This step

of construction of density function from FCPC scores makes the naive Bayes classiﬁer possible

for functional data. We then prove that under some general assumptions the proposed

functional naive Bayes (FNB) classiﬁer has the asymptotic equivalence to the true one. We

also compare FNB numerical performance with logistic regression, k-NN, LDA, and SVM.

Several simulation experiments suggest that the proposed FNB methods outperform other

methods.

3.1 The Naive Bayes Classiﬁer for Functional Data

This section describes a model for functional classiﬁcation: functional naive Bayes classi-
ﬁer. The setting is as follows. Let X be a random function in L2(I), which is a Hilbert space
of square integrable functions on a compact interval I equipped with the inner product of

two functions f and g deﬁned through the integral operator by (cid:104)f, g(cid:105) =(cid:82)I f (t)g(t)dt with
deﬁnite and has an orthogonal expansion in L2, Γ(s, t) =(cid:80)∞

the associated norm || · || = (cid:104)·,·(cid:105) 1
2 . The mean function of X is E(X(t)) = µ(t) and the
covariance function of X is cov(X(s), X(t)) = Γ(s, t), s, t ∈ I. The covariance Γ is positive
j=1 λjφj(s)φj(t), where {λj}∞
is a set of eigenvalues and {φj}∞
j=1 is the corresponding set of eigenfunctions. Assume that
there are Πc, c ∈ {1, . . . , C}, populations, where C is an integer indicating the total number
of populations. Let the random variable Y be the group label of X. The prior probability

j=1

that X is from Πc is denoted by πc = P (Y = c). We denote by X(c) the X given Y = c.
Furthermore, we deﬁne µ(c) and Γ(c) as the mean and the covariance of X(c). The covariance

Γ(c) also has an orthogonal expansion in L2, Γ(c)(s, t) =(cid:80)∞

(c)
j (t). Our task

j=1 λ

(c)
j φ

(c)
j (s)φ

will be to deduce the group label Y for a new random function X.

The Bayes classiﬁer is a simple probabilistic classiﬁer that assigns each object to the class
with the highest conditional probability. Given an observed random function z ∈ L2(I), one

39

predicts the most probable class δ(z) ∈ {1, . . . , C} of z by the following classiﬁcation rule:

δ(z) = argmax
c∈{1,...,C}

P (Y = c)P (X = z|Y = c).

(3.1)

If the conditional probability densities of the random function X exist, we denote them as

f (c) conditioning the group label Y on c, with f (c) > 0. Then the classiﬁcation rule would

become

δ(z) = argmax
c∈{1,...,C}

πcf (c)(z).

(3.2)

However, it is clear that the above argument does not work directly as the probability density

does not exist for functional data (Delaigle and Hall, 2010). To overcome the aforementioned

diﬃculties, instead of modeling the joint distribution of Y and X to derive the Bayes classiﬁer,

we propose to model the joint distribution of Y and the L2 distance between X and z in
sizes of ε > 0, and then use the classiﬁer in (3.2). The small-ball probability P ((cid:107)X − z(cid:107) ≤ ε)

measures the concentration of X for diﬀerent values of z, so it can be viewed as a “surrogate

density” for X.

Now, consider the Karhunen-Lo`eve expansion of a random function X:

∞(cid:88)

X(t) = µ(t) +

ξjφj(t),

where ξj =(cid:82)I (X(t)− µ(t))φj(t)dt, j ≥ 1, are random variables with mean zero and variance

j=1

(3.3)

λj. The quantities ξj’s are termed as the functional principal component (FPC) scores

corresponding to a random function X. We note that FPC scores are always uncorrelated

due to the orthogonality of the φj’s. Our goal is to construct the surrogate density for X,
that is, we are approximating it by the small-ball probability P ((cid:107)X − z(cid:107) ≤ ε). A key idea in

the naive Bayes classiﬁer for functional data is that we make a strong and naive independence

assumption of the ξj’s. It is close to its analogous deﬁnition in the ﬁnite-dimensional setting

that we typically assume the features are independent in classical naive Bayes classiﬁer. We

also note that this is exactly correct if X is a Gaussian process. With the aid of independence

assumption of the ξj’s, Delaigle and Hall (2010) showed that the probability of X belonging

40

to a ball of radius ε > 0 centered at z can be written as

p(z|ε) = P ((cid:107)X − z(cid:107) ≤ ε) =

(επ1/2)J
Γ(J/2 + 1)

 J(cid:89)

j=1

 exp{o(J)},

fj(zj)

(3.4)

where J = J(ε) diverges to inﬁnity as ε decreases to zero, fj is the probability density
function of ξj, and zj is a projection of z − µ in the direction of the j-th eigenfunction φj,

i.e., zj =(cid:82)I (z(t)− µ(t))φj(t)dt, j ≥ 1, for any square-integrable function z on I. We denote

the surrogate density function of X by f (·|ε) indicating it is related to the choice of ε. It

is certainly desirable that, though the density function of X does not exist, the surrogate

density function of X satisfy the probability function

(cid:90)

Dz

P ((cid:107)X − z(cid:107) ≤ ε) =

f (u|ε)du,

(3.5)

f (z|ε) ∝ J(cid:89)

where Dz = {u ∈ L2(I) : (cid:107)u − z(cid:107) ≤ ε}. For a small ε > 0, the relation between (3.4) and
(3.5) suggests that

fj(zj).

(3.6)

j=1

Above equation (3.6) implies that the product of probability density functions fj, for j =
1, . . . J, can be an approximation for the surrogate density f (·|ε) of the random function X.

In practice the densities fj’s are estimated nonparametrically via kernel density estimation

methods. The eﬀective dimension of J for a given value of scale ε will be discussed in Section

3.2.

from (cid:82)I (X(c)(t) − µ(c)(t))φ

(c)
j

To estimate the conditional probability densities f

(c)
, we need to extract ξ
j

from X(c).

(c)
However, if we simply calculate ξ
j

(c)
j (t)dt, this will lead to a
, for c = 1, . . . , C, due to the diﬀerent means and eigenfunctions for each

(c)
diﬀerent scale of ξ
j

group. It is sensible to subtract X(c) from the overall mean and then project the data onto the

same basis. Here we adopt the idea of common principal components from the multivariate

analysis. We write the overall mean function µ(t) =(cid:80)C

c=1 πcµ(c)(t) and impose the common
:= φj, for j ≥
(1)
j = φ
1. This assumption helps to avoid the comparisons of the modes of variation between diﬀerent

eigenfunction assumption on all populations, that is, φ

j = ··· = φ

(C)
j

(2)

41

(c)
j φj(s)φj(t), where the φj’s
are the common eigenfunctions. We then deﬁne the common covariance function Γ(s, t) =

j=1 λ

bases. Under this assumption we can write Γ(c)(s, t) =(cid:80)∞
(cid:80)C
λj =(cid:80)C

(c)
j

c=1 πcΓ(c)(s, t). Then φj is also the j-th eigenfunction of Γ with the pooled eigenvalue
. The functional common principal component (FCPC) score under group

j =(cid:82)T (X(c)(t) − µ(t))φj(t)dt, and the conditional probability density

c=1 πcλ

c is calculated by ξ

(c)

f

(c)
j

(c)
is then estimated from the FCPC score ξ
j

. We note that the FCPC score is not the

FPC score in general.

The criterion function for functional naive Bayes classiﬁer can be rewritten as

δJ (z) = argmax
c∈{1,...,C}

πc

f

(c)
j

(zj),

(3.7)

J(cid:89)

j=1

where f

(c)
j

is the density of j-th FCPC score under group c. This criterion function is also

known as the maximum a posteriori (MAP) in Bayesian statistics. Despite the fact that

the independence assumption is usually false and the probability estimates of naive Bayes

are of low quality, given that we use it only to make the classiﬁcation and not to accurately

predict the actual probabilities, the classiﬁcation decisions are still quite good. This is also

pointed out in Delaigle and Hall (2010) that the small-ball probability in (3.4) can capture

the variation with z. Thus, we classify z into population Πc if and only if the criterion

function δJ (z) is maximized at the group c. The theoretical properties of the above criterion

(3.7) will be discussed in Section 3.2.

3.1.1 Model Estimation

We now consider how the prior probabilities πc and the conditional probability density

functions f

(c)
j

can be estimated from data. Suppose we have training samples that consist

of examples (yi, xi) for i = 1, . . . , n. The prior probability πc can be simply estimated by

nc =(cid:80)n

ˆπc = nc/n, where nc is the number of times that the label c is seen in the training set, i.e.,
i=1 1{yi = c}, where 1 is an indicator function to be 1 if yi = c, 0 otherwise. As noted
in Dai et al. (2017), when constructing a classiﬁer for functional data, it is often unrealistic

42

to have fully observed trajectories. In practice, we can only observe a random function at

discretized time points rather then the entire trajectory. We denote the discrete observations

by xil = xi(til), where til is the l-th recording time of the i-th curve in ascending order,

l = 1, . . . , mi; mi is the number of observations for the i-th curve. Also, the sampled random

function is often contaminated with measurement errors. In this case, a presmoothing step

is performed on the discrete observations, then we treat the smoothed trajectory as a fully

observed trajectory.

Since the overall mean function µ(t) and the common eigenfunctions φj(t) play a central

(c)
role in the calculation of the FCPC scores ξ
j

for c = 1, . . . , C, we wish to construct esti-

The overall mean function µ(t) is estimated by ˆµ(t) =(cid:80)C

mators ˆµ and ˆφj of µ and φj, respectively. We brieﬂy summarized the procedure as follows.
c=1 ˆπc ˆµ(c)(t), where ˆµ(c)(t) is the
estimated mean function under group c. We apply locally weighted polynomial regression

min

(c)

x

i=1

l=1

(cid:110)

(α0,α1)

mi(cid:88)

where Khc
µ

(cid:32) nc(cid:88)

(cid:33)
(til − t)

,

Khc
µ

(Fan and Gijbels, 1996) to the curves in a given group c:

(cid:111)2
il − α0 − α1(til − t)
common covariance Γ(s, t) is estimated by (cid:98)Γ(s, t) = (cid:80)C

(·) is a known kernel with bandwidth hc

µ. Take ˆµ(c)(t) = ˆα0. Similarly, the

the estimated covariance function under group c. We apply two-dimensional scatterplot

c=1 ˆπc(cid:98)Γ(c)(s, t), where (cid:98)Γ(c)(s, t) is
il − ˆµ(til)).
To address the problem of measurement errors, only the oﬀ-diagonal elements of Γ
(tij, til),
for j (cid:54)= l, are included in the smoothing step. The covariance estimate is then obtained by

(c)
(tij, til), where Γ
i

smoothing to the raw estimates Γ

ij − ˆµ(tij))(x

(tij, til) = (x

(c)
i

(c)
i

(c)

(c)

min

(cid:26)

 nc(cid:88)

ﬁtting a local linear plane,

(cid:88)
1≤j(cid:54)=l≤mi

 ,
(cid:98)Γ(c)(s, t) = ˆα0. The estimates of the common eigenfunctions and the pooled eigenvalues

(·,·) is a two-dimensional non-negative kernel function with bandwidth hc

(tij, til) − α0 − α1(tij − s) − α2(til − t)

(s − tij, t − til)

where Khc
Γ

(α0,α1,α2)

(cid:27)2

Γ. Set

Khc
Γ

(c)
Γ
i

i=1

43

correspond to the solution ˆφj and ˆλj of the eigenequations:

(cid:90)
I(cid:98)Γ(s, t) ˆφj(s)ds = ˆλj

ˆφj(t),

j (t)dt = 1 for j = 1, 2, . . . , and (cid:82)I ˆφj(t) ˆφk(t)dt = 0 for j < k. We note
subject to (cid:82)I ˆφ2
that the consistency of ˆµ(c) and(cid:98)Γ(c) will be discussed in Section 3.2, and the consistency of
ˆµ and (cid:98)Γ are obtained as a consequence result of the convergence of ˆµ(c) and (cid:98)Γ(c) and the
convergence of ˆπc to πc. The rates of convergence for ˆφj are the same as the rate for(cid:98)Γ.
ij =(cid:80)mi

is
(t)− ˆµ(t)) ˆφj(t)dt, for j = 1, . . . , J, via numerical approximation,
(c)
ij . The
(c)
ij are then obtained by Nadaraya-Watson
kernel method. The kernel density estimate for the j-th FCPC score in the group c is given

(c)
(c)
ij as the j-th FCPC score of the i-th observation under group c, then ξ
ij
(c)

il − ˆµ(til)) ˆφj(til)∆til. Note that ˆξ

(c)
ij are the empirical versions of ξ

estimated by ˆξ
(c)
ˆξ

l=1(x

nonparametric estimates of the densities for ξ

ij =(cid:82)I (x

Denoting ξ

(c)
i

(c)

J(cid:89)

j=1

ˆδJ (z) = argmax
c∈{1,...,C}

ˆπc

(c)
ˆf
j

(ˆzj).

(3.9)

We classify a newly observed random function X∗ with unknown label Y ∗ by setting Y ∗ =
ˆδJ (X∗). In the next section, we study the asymptotic properties of the classiﬁer ˆδJ , and
show that it is a consistent estimator of δJ when n goes to inﬁnity.

44

u − ˆξ

hc
j

(c)
ij

 ,

nc(cid:88)

K

i=1

by

(c)
ˆf
j

(u) =

1

nchc
j

where u ∈ R, K is a kernel function, and hc
bandwidth hc

j is chosen by a cross-validation procedure. Thus, the bandwidth hc

j will be

j is a bandwidth. For feasibility reason, the

the optimal bandwidth that minimizes the asymptotic mean integrated squared error. See

Ferraty and Vieu (2006).

Now, we write ˆzj =(cid:82)I (z(t)− ˆµ(t)) ˆφj(t)dt for any function z ∈ L2(I). The kernel density

based criterion function for functional naive Bayes classiﬁer (3.7) is thus

(3.8)

3.2 Theoretical Results

In this section we examine theoretical properties of the functional naive Bayes classiﬁer.

We begin with the consistency of the estimated overall mean function and the estimated

common covariance function. Under the regularity conditions as stated in Yao et al. (2005),

we obtain uniform convergence rates for ˆµ(c)(t) of µ(c)(t) and (cid:98)Γ(c)(s, t) of Γ(c)(s, t). The
consistency of ˆµ(t) for µ(t) and that of(cid:98)Γ(s, t) for Γ(s, t) are the immediate corollary of the

consistency of ˆπc for πc. Without loss of generality, we make the following simpliﬁcations in

the theoretical analysis. We assume there is an equal prior probability that an observation

is from Πc, and there is an equal number of observations in each population, i.e., π1 = π2 =
··· = πC and n1 = n2 = ··· = nC .

Proposition 1. Assume that the regularity conditions holds on the design points, the ker-

nel functions, the bandwidths of the kernel functions, and the moments of X are listed as

(A1.1)-(A4) and (B1.1)-B(2.2) in Yao et al. (2005). Given the group memberships of the ob-

served curves, for each group c, c = 1, . . . , C, the estimated mean function and the estimated

covariance function satisfy the uniform consistency properties such that

|ˆµ(c)(t) − µ(c)(t)| = Op(τ c
n)

sup
t∈I

(3.10)

and

|(cid:98)Γ(c)(s, t) − Γ(c)(s, t)| = Op(γc
µ)−1 → 0 and γc

n = (n

1/2
c

n = (n

sup
s,t∈I
1/2
c hc

with the sequences τ c

n)
Γ)2)−1 → 0, as nc → ∞, where
(hc
Γ are the bandwidths for estimating mean and covariance functions, respectively.
Proof. The uniform consistency of ˆµ(c) and (cid:98)Γ(c) in Proposition 1 are obtained as a conse-

hc
µ and hc

(3.11)

quence of applying Theorem 1 in Yao et al. (2005) to each group c.

Proposition 2. Under the assumptions stated in Proposition 1 the estimated overall mean

function and the estimated common covariance function satisfy the uniform consistency prop-

45

erties such that

and

|ˆµ(t) − µ(t)| = Op(αn)

sup
t∈I

|(cid:98)Γ(s, t) − Γ(s, t)| = Op(ηn).

sup
s,t∈I

(3.12)

(3.13)

Furthermore, the common eigenfunctions satisfy the uniform consistency properties such that

for ﬁxed j, j = 1, 2, . . . , with αn =(cid:80)C

sup
t∈I

| ˆφj(t) − φj(t)| = Op(ηn)

n → 0 and ηn =(cid:80)C

c=1 τ c

c=1 γc

n → 0, as n → ∞.

(3.14)

Proof. We ﬁrst establish the consistency of ˆπc for πc. The rate of convergence for ˆπc can be

easily derived by Hoeﬀding’s inequality. For  > 0, we have

Let δ = 2e−2n2

, then  =

. The above inequality yields

.

P (|ˆπc − πc| > ) ≤ 2e−2n2
(cid:114)
(cid:17)
(cid:16) 2
(cid:32)(cid:12)(cid:12)ˆπc − πc
(cid:12)(cid:12) ≤

(cid:19)(cid:33)

(cid:18)2

1
2n log

(cid:115)

log

δ

P

1
2n

δ

> 1 − δ,

which implies that

|ˆπc − πc| = Op(n−1/2).

(3.15)

The rate of convergence for the overall mean function ˆµ(t) can be derived as follows:

|ˆµ(t) − µ(t)| = sup
t∈I

sup
t∈I

c=1

≤ sup
t∈I

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) C(cid:88)
(cid:0)ˆπc ˆµ(c)(t) − πcµ(c)(t)(cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
C(cid:88)
C(cid:88)
(cid:0)ˆµ(c)(t) − µ(c)(t)(cid:1)(cid:12)(cid:12) + sup
(cid:12)(cid:12)ˆπc
C(cid:88)
≤ C(cid:88)
(cid:12)(cid:12)ˆπc − πc
(cid:12)(cid:12)ˆµ(c)(t) − µ(c)(t)(cid:12)(cid:12) +
 + Op(n−1/2) = Op(αn),
 C(cid:88)

= Op

sup
t∈I

t∈I

c=1

c=1

c=1

c=1

τ c
n

c=1

46

(cid:12)(cid:12)(ˆπc − πc)µ(c)(t)(cid:12)(cid:12)
(cid:12)(cid:12)µ(c)(t)(cid:12)(cid:12)
(cid:12)(cid:12) sup

t∈I

τ c
n =

1

C

.

1√
nc

|(cid:98)Γ(s, t) − Γ(s, t)| = sup

where the second equality follows from (3.10) and (3.15) with a constant C and supt∈I |µ(c)(t)|
bounded for all c = 1, . . . , C. The last equality is guaranteed by

The rate of (cid:80)C
the common covariance function(cid:98)Γ(s, t) in (3.13) can be derived analogously,

c=1
c=1 τ c

maxc hc
µ

maxc hc
µ

c=1

c=1

C(cid:88)

≥ 1√
n

1√
nchc
µ

C(cid:88)
n term is slower than that of n−1/2 term. The rate of convergence for

≥ C(cid:88)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) C(cid:88)
(cid:0)ˆπc(cid:98)Γ(c)(s, t) − πcΓ(c)(s, t)(cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
C(cid:88)
C(cid:88)
(cid:12)(cid:12)ˆπc
(cid:0)(cid:98)Γ(c)(s, t) − Γ(c)(s, t)(cid:1)(cid:12)(cid:12) + sup
(cid:12)(cid:12)(ˆπc − πc)Γ(c)(s, t)(cid:12)(cid:12)
C(cid:88)
(cid:12)(cid:12)Γ(c)(s, t)(cid:12)(cid:12)
(cid:12)(cid:12) sup
(cid:12)(cid:12)ˆπc − πc
(cid:12)(cid:12)(cid:98)Γ(c)(t) − Γ(c)(s, t)(cid:12)(cid:12) +
 + Op(n−1/2) = Op(ηn),

≤ C(cid:88)
 C(cid:88)

≤ sup
s,t∈I

t∈I

c=1

s,t∈I

c=1

sup
s,t∈I

c=1

= Op

γc
n

c=1

s,t∈I

sup
s,t∈I

c=1

and the rate (3.14) is direct consequences of the rate (3.13).

c=1

To investigate the properties of surrogate densities f (c)(·|ε) for each group c = 1, . . . , C,

we make the following assumptions (A1)-(A4), which are parallel to assumptions (3.6)-(3.9)

in Delaigle and Hall (2010). These assumptions are made analogous to conditions A1.-A4. in

Dai et al. (2017) for two populations. We extend these assumptions to more general case of

C populations.
(A1) For all large D > 0 and some δ > 0, supt∈I E{|X(c)(t)|D} < ∞ and sups,t∈I:s(cid:54)=t E[{|s−

t|−δ|X(c)(s) − X(c)(t)|}D] < ∞;

(A2) For each integer r ≥ 1, (λ

j )−rE{(cid:82)I (X(c)(t) − µ(t))φj(t)dt}2r is bounded uniformly

(c)

in j;

(A3) The eigenvalues in each of the sequences {λ

(c)

j }∞

j=1 are all diﬀerent, and so are the

pooled eigenvalues {λj}∞
j=1;

47

f (c)(x|ε) ∝ J(cid:89)

(A4) The density of the j-th standardized FCPC score is bounded and has a bounded deriva-

tive; the kernel K is a symmetric, compactly supported probability density function
j)3

j = hj(nc) = O(n−δ

c ) and n1−δ

with two bounded derivatives; for some δ > 0, hc
is bounded away from zero as nc → ∞.

(hc

c

Proposition 3. Assuming (A1)-(A4) hold and FCPC scores are independent, the surrogate
density of each group f (c)(·|ε) can be approximated by the product of probability density

functions f

(c)
j

, for j = 1, . . . , J with J = J(ε) → ∞ as ε → 0.

f

(c)
j

(xj).

(3.16)

j=1

Proof. Under assumptions (A1)-(A4) for each group c this is an immediately result from

Delaigle and Hall (2010) and (3.4) to (3.6).

As noted in Delaigle and Hall (2010), the choice of J depends on the rate of convergence

of the sequence of eigenvalues λj to zero. We list below the eﬀective dimension J for a given

value of scale ε associated with the diﬀerent eigenvalues decays:

• if {λj}∞

j=1 decays exponentially

(cid:88)

k≥j+1

λ−1

j

λk is bounded as j → ∞,

then we take J to be the unique integer for which λJ+1 < ε2 < λJ .
• if {λj}∞

j=1 decays super-exponentially

λj+1
λj

→ 0 or, equivalently, λ−1

j

λk → 0,

(cid:88)

k≥j+1

then we take J such that the value of ε2/λJ is suﬃciently close to 1.

From a practical point of view, the eigenvalue sequence λj decreases to zero at an exponential

rate is essential in determining the eﬀectiveness of J. In general, the integer J acts as the
dimension of the approximated small-ball probability p(x|ε) at scale ε. In other word, the

48

value of J can be interpreted as the dimension of the scale space when the unit of scale is ε.

Moreover, for a given scale ε the case of eigenvalues that decay exponentially suggests that

the eﬀective dimension J can be chosen such that

ε2 ≈ λJ

(3.17)

In practice, for a predetermined ε, we choose the value of J such that the value of ε2 is

approximately equal to ˆλJ . Alternatively, one can also simply increases the value of J to

assess how the classiﬁcation performance changes as J increases and determine the optimal

value of J empirically.

Next we investigate the asymptotic properties of the FNB classiﬁer ˆδJ deﬁned in (3.9).

We ﬁrst introduce an useful lemma that states the asymptotic equivalence of the estimated
kernel density function to the true one. Let S(α) = {x : (cid:107)x(cid:107) ≤ α} be a bounded set of all

square integrable functions for α > 0.

Lemma 2. Assuming (A1)-(A4) hold, for any j ≥ 1 and c = 1, . . . , C,

| ˆf

(c)
j

(ˆxj) − f

(c)
j

(xj)| = Op

sup
x∈S(α)

(3.18)

hc

j +

(cid:33)− 1

2

(cid:32) nchc

j
log nc

 .

This lemma may essentially be found in Dai et al. (2017). Because it plays such a central

role in this paper, we give the proof here since it is brief.

Proof. Let g

(c)
density estimates of g
j

(c)
j be the density function of the standardized FCPC scores and ˆg

(c)
j be the kernel
using the estimated standardized FCPC scores. Without loss of

generality, we assume the standardized FCPC scores have zero mean. This assumption can

49

be easily relaxed by subtracting the standardized FCPC scores from its mean.

| ˆf
(c)
j

(ˆxj) − f

sup
x∈S(α)

(

(c)
j

(c)
j

) − g

1(cid:114)
xj(cid:114)
xj(cid:114)

(xj)| = sup
x∈S(α)
ˆxj(cid:114)
ˆxj(cid:114)
(cid:33)− 1

) − g
(c)
j

(c)
j

(c)
j

(c)
j

(c)
j

ˆλ

ˆλ

λ

λ

(

)

(

(c)
ˆg
j

)

(c)
ˆλ
j

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + g
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
 + Op

(c)
j

 .

2

ˆλ

(c)
j

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)ˆg
 1(cid:114)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)ˆg
 sup
hc
(cid:32) nchc

x∈S(α)

j
log nc

(c)
j

j +

(

≤

sup
x∈S(α)

= Op

= Op

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)


g

(c)
λ
j

(c)
j

ˆλ

(c)
j

ˆxj(cid:114)
) − 1(cid:114)
1(cid:114)
xj(cid:114)
− 1(cid:114)
1(cid:114)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

(c)
j

ˆλ

λ

)

(

(

(c)
j

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)


ˆλ

(c)
j

(c)
λ
j

(c)
j

(c)
j

(

λ

xj(cid:114)
− 1(cid:114)


λ

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

(c)
j

Theorem 2. Assuming all populations have the common eigenfunctions, FCPC scores are

independent, (A1)-(A4) hold, and the value of J is chosen by (3.17). The criterion function
for FNB classiﬁer ˆδJ is an M -estimator and P (ˆδJ (X) (cid:54)= δJ (X)) → 0 as n → ∞.

Proof. We deﬁne

Mn

(cid:0)δ(X)(cid:1) = ˆπδ

J(cid:89)

j=1

(ˆxj) and M(cid:0)δ(X)(cid:1) = πδ

(δ)
ˆf
j

J(cid:89)

j=1

f

(δ)
j

(xj).

(3.19)

It is clear that the criterion function for FNB classiﬁer is to ﬁnd an estimator δ(X) such
that Mn(δ(X)) can be maximized over the parameter space ∆ = {1, . . . , C}. So ˆδJ (X) is
an M -estimator.

50

We note that

sup
δJ∈∆
≤ sup
δJ∈∆

≤ sup
δJ∈∆
≤ sup
δJ∈∆

≤ sup
δJ∈∆

|Mn

sup
x∈S(α)

sup
x∈S(α)

J(cid:89)
J(cid:89)

j=1

j=1

j=1

j=1

(cid:12)(cid:12)(cid:12)

(xj)

|Mn

(δJ )
ˆf
j

(cid:18)

(δJ )
j

(xj)

(δJ )
ˆf
j

(δJ )
f
j

sup
x∈S(α)

(cid:0)δJ (X)(cid:1) − M(cid:0)δJ (X)(cid:1)| ≤ sup
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)ˆπδJ
δJ∈∆
J(cid:89)
J(cid:89)
(ˆxj) − πδJ
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)ˆπδJ
J(cid:89)
(cid:12)(cid:12)(cid:12)(cid:12) ˆf
h

(cid:0)δJ (x)(cid:1) − M(cid:0)δJ (x)(cid:1)|
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup
(cid:12)(cid:12)(cid:12)(cid:12) + sup
(cid:12)(cid:12)(cid:12)ˆπδJ
 + Op(n−1/2).

(cid:12)(cid:12)(cid:12)ˆπδJ
(cid:12)(cid:12)(cid:12)

(ˆxj) − f

(ˆxj) − f

sup
x∈S(α)

2

δJ∈∆

δJ
h
j
log nδJ

− 1
 nδJ
(cid:0)δJ (X)(cid:1) − M(cid:0)δJ (X)(cid:1)| p→ 0 as n → ∞.

|Mn

− πδJ

− πδJ

J(cid:89)

(δJ )
j

(xj)

δJ∈∆

j=1

(δJ )
j

Op

δJ
j +

sup
x∈S(α)

sup
x∈S(α)

(cid:12)(cid:12)(cid:12)(cid:12)f

j=1

sup
δJ∈∆

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

(xj)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) J(cid:89)

j=1

(δJ )
f
j

(cid:12)(cid:12)(cid:12)(cid:12)

(δJ )
j

(xj)

The above inequality suggests that

We note that this is a Glivenko-Cantelli class. By the properties of the δJ (X), it attains the

maximum of M . So for any δ

(cid:48)
J (X) (cid:54)= δJ (X), we have

M(cid:0)δ

J (X)(cid:1) < M(cid:0)δJ (X)(cid:1).

(cid:48)

sup
(cid:48)
J∈∆
δ

Then for any estimator ˆδJ (X) attains the maximum of Mn, we have

(cid:0)ˆδJ (X)(cid:1) ≥ Mn

(cid:0)δJ (X)(cid:1) − op(1).

Mn

Now by the argmax continuous mapping theorem (Vaart, 1998), we conclude that ˆδJ (X)
converges to δJ (X) in probability. In other words, P (ˆδJ (X) (cid:54)= δJ (X)) → 0 as n → ∞.

Theorem 1 provides the asymptotic equivalence of the estimated version of the functional

naive Bayes classiﬁer based on the kernel density estimates to the true one.

51

3.3 Simulation Study and Data Application

3.3.1 Simulation Study

To examine the performance of the proposed method a simulation study was conducted. We

consider several simulation settings for various training and testing datasets. Furthermore,

we compare the presented method with other algorithms by computing classiﬁcation error

rates. All computations are done with the software MATLAB. We compared the functional

naive Bayes (FNB) approach with the following methods:

• Multinomial logistic regression (logistic): The subroutine “mnrﬁt” is used to ﬁt multi-

nomial logistic regression. The estimated FCPC scores are treated as predictor vari-

ables and the group labels are treated as response variables.

• k-nearest neighbors (k-NN): The subroutine “ﬁtcknn” is used to train the k-NN model
with the classic L2 metric, see Ferraty and Vieu (2006) for example. The number of

neighbors k were determined by 10-fold cross validations.

• Linear discriminant analysis (LDA): The subroutine “ﬁtcdiscr” is used to train the

LDA model, which determine a linear discriminant that classiﬁes FCPC.

• Support vector machines (SVM): the package “libsvm” is used for the SVM with the

choice of the RBF kernel for which the cost and gamma parameters were left as the

default. The package is available at https://www.csie.ntu.edu.tw/˜cjlin/libsvm/.

We note that from the functional data point of view, each datum in functional data is a

realization of a sample from an underlying stochastic process. However, in the real world,

the data were discretely collected over time, so we can only observe some points that form

a vector rather than the entire trajectory. We treat the functional data as a vector when

running the SVM.

52

In order to evaluate the performance of the FNB, we generate synthetic data trajectories

based on the model

X

(c)
il = µ(c)(til) +

(c)
ij φj(til) + 
ξ

(c)
il ,

J(cid:88)

where i = 1, . . . , nc and nc is the number of samples for each group, c = 1, . . . , C. We set

j=1

C = 6 and J = 20 for all simulations. The variates ξ

and identically distributed N (0, λ

(c)
(c)
j ), where λ
j

(c)
ij are generated from independent
are eigenvalues corresponding to φj for

class c, and the measurement errors 

(c)
il are generated from independent and identically
distributed N (0, σ2). The basis for the linear combination of functions was speciﬁed as the

Fourier basis, where the φj’s take the following form: φ1(t) = 1, φ2r(t) =

√
2 cos(2rπt),
2 sin(2rπt), for r = 1, 2, . . . , and t ∈ [0, 1]. The time points are generated from

√

φ2r+1(t) =

a regular design on 51 equally spaced time points from 0 to 1. Various combinations of the

mean function µ(c)(t) and eigenvalues λ

(c)
j between groups are described below:

• Model 1: µ(c)(t) are diﬀerent and λ
√

(c)
j

µ(1)(t) = 0,

are the same:

µ(3)(t) = (t + 0.75)−3,
µ(2)(t) =
µ(5)(t) = 2 − 4 exp(−6t), µ(6)(t) = −3(t − 0.5),

2 sin(4πt),

µ(4)(t) = 2 cos(2πt),
j = exp(− j
3),

and λ

(c)

for c = 1, . . . C.

• Model 2: µ(c)(t) are the same and λ

(c)
j

are diﬀerent:

λ

(1)

µ(c)(t) = 0, for c = 1, . . . , C,
j = exp(−j),
j = exp(− 2

j = exp(− 1
j = exp(− 1

(2)

(5)

(4)

λ

λ

3 j), λ
• Model 3: µ(c)(t) and λ

2 j), λ

4 j), λ

(3)

j = exp(− 1
j = exp(− 3

(6)

3 j),

4 j).

(c)
j

are diﬀerent for all groups:

the mean functions µ(c)(t) were chosen as described in Model 1 and the eigenvalues
(c)
j were chosen as described in Model 2.
λ

We remark that the mean functions µ(t) in Model 1 were used in Coﬀey et al. (2014),

which were chosen to reﬂect the real-life situation for time-course gene expression data. The

53

variance of measurement error is set to be σ2 = 0.01 to each observation for all Models. In

each setting we generated sample sizes of n = 60, 120, and 300 for the training set, and

sample sizes of m = 120, 300, and 600 for the testing set. Each simulated trajectory has

1/C probability to belong to group c. Examples for Model 1, Model 2, and Model 3 are

plotted in Figure 3.1, Figure 3.2, and Figure 3.3, respectively. This simulation was repeated

200 times.

(a) Group 1

(b) Group 2

(c) Group 3

(d) Group 4

(e) Group 5

(f) Group 6

(g) All groups

Figure 3.1: Example dataset generated from Model 2 for the simulation study.

The sample means with associated standard errors of the misclassiﬁcation error rate are

presented in Table 3.1 and Table 3.2 with n = 60, 120, 300 and m = 120, 300, 600 based on

200 simulation replicates for three diﬀerent models. We explore the behavior of the FNB with

J = 1, 5, 10 and 20. For Model 1, it is clear that LDA has the best performance and FNB is
the second best when J ≥ 10. This result is not surprising since the LDA created maximizes

the diﬀerences between groups, when the mean functions are diﬀerent, the LDA can separate

groups very well. For Model 2 and 3, it is clear that FNB has consistently outperformed all
other methods for J ≥ 10. When the mean functions are diﬀerent (Model 1 and Model 3)

the misclassiﬁcation error rate decreases as the training sample size n increases for all the

54

00.20.40.60.81-3-2-10123400.20.40.60.81-3-2-10123400.20.40.60.81-3-2-10123400.20.40.60.81-3-2-10123400.20.40.60.81-3-2-10123400.20.40.60.81-3-2-10123400.20.40.60.81-3-2-101234(a) Group 1

(b) Group 2

(c) Group 3

(d) Group 4

(e) Group 5

(f) Group 6

(g) All groups

Figure 3.2: Example dataset generated from Model 2 for the simulation study.

(a) Group 1

(b) Group 2

(c) Group 3

(d) Group 4

(e) Group 5

(f) Group 6

(g) All groups

Figure 3.3: Example dataset generated from Model 3 for the simulation study.

methods. On the other hand, when the mean functions are the same (Model 2) the FNB is

the only method for which its misclassiﬁcation error decreases as the training sample size n

increases. We observe that the misclassiﬁcation rate of FNB decreases as the number of n

55

00.20.40.60.81-6-4-202400.20.40.60.81-6-4-202400.20.40.60.81-6-4-202400.20.40.60.81-6-4-202400.20.40.60.81-6-4-202400.20.40.60.81-6-4-202400.20.40.60.81-6-4-202400.20.40.60.81-6-4-2024600.20.40.60.81-6-4-2024600.20.40.60.81-6-4-2024600.20.40.60.81-6-4-2024600.20.40.60.81-6-4-2024600.20.40.60.81-6-4-2024600.20.40.60.81-6-4-20246increases given a ﬁxed number of components J. This illustrates well the consistency result

stated in Theorem 1.

The boxplot in Figure 3.4 displays the information about variation of misclassiﬁcation

error rate for Model 1. FNB with J = 1 has the largest bias, then this bias is dramatically
reduced as J increases. In general when J ≥ 10, FNB performs the second best for Model 1

among the methods considered. We note that the performance of SVM relies heavily on the

choice of the kernel as well as the tuning parameters. There are some situations where more

than 80% of the testing data are misclassiﬁed when the kernels are misspeciﬁed. When the

mean functions are diﬀerent, FNB performs the best for Model 2 and Model 3. The boxplot

results are shown in Figure 3.5 and Figure 3.6 for Model 2 and Model 3, respectively.

3.3.2 Data Applications

In this section we illustrate the performance of the proposed FNB approach on four real

datasets that belong to very diﬀerent research areas including traﬃc ﬂow, ﬁsh species, leaf

images, and growth curves. In order to evaluate the classiﬁcation performance, the error

rate from 10-fold cross-validation will be used. For each available sample 9/10 of the data

are used in training the methods, and the misclassiﬁcation error is then estimated on the

remaining data. We repeat the process 200 times and report the mean misclassiﬁcation rates,

and the standard deviations of the mean estimates.

The ﬁrst data example concerns with classiﬁcation of daily traﬃc volume trends from a

highway traﬃc ﬂow data. We use a dataset that is archived on the Caltrans Performance

Measurement System (PeMS) available through the link http://pems.dot.ca.gov/. The traﬃc

ﬂow rate were collected on 5-min intervals from February 8 to July 31 in 2013 (I15-S@CA

PM14.3, District 8, Riverside County, City of Murrieta. Detector ID: 817371). Since the

traﬃc ﬂow patterns are distinct on weekdays and weekends (including holidays) and are

similar between weekdays, we use only the data from weekdays in this analysis. Each daily

curve in the dataset was collected by a vehicle detector at 288 equispaced instants of time in

56

Table 3.1: The sample means, standard errors (in parentheses) and average proportions of
total variance explained [in bracket (%)] for misspeciﬁcation rates in Model 1, Model 2 and
Model 3 with 200 replications.

Model

n m

Functional Naive Bayes with diﬀerent J

1

5

10

20

1

2

3

60

120 0.717 (0.047) [64.30] 0.149 (0.052) [99.26] 0.081 (0.051) [99.90] 0.075 (0.048) [99.99]
300 0.717 (0.032) [64.01] 0.150 (0.043) [99.26] 0.079 (0.046) [99.90] 0.074 (0.043) [99.99]
600 0.715 (0.028) [64.07] 0.149 (0.042) [99.25] 0.079 (0.044) [99.90] 0.074 (0.042) [99.99]
120 0.712 (0.046) [63.83] 0.096 (0.029) [99.22] 0.033 (0.021) [99.88] 0.032 (0.020) [99.99]
120 300 0.710 (0.031) [64.02] 0.097 (0.021) [99.24] 0.034 (0.017) [99.89] 0.032 (0.014) [99.99]
600 0.710 (0.025) [64.29] 0.096 (0.018) [99.23] 0.033 (0.013) [99.89] 0.032 (0.012) [99.99]
120 0.702 (0.038) [63.93] 0.077 (0.025) [99.20] 0.028 (0.016) [99.86] 0.022 (0.014) [99.98]
300 300 0.699 (0.028) [63.56] 0.077 (0.016) [99.18] 0.028 (0.011) [99.86] 0.023 (0.009) [99.98]
600 0.699 (0.021) [63.37] 0.075 (0.011) [99.19] 0.028 (0.008) [99.86] 0.022 (0.007) [99.98]

60

120 0.824 (0.037) [37.81] 0.703 (0.044) [88.44] 0.589 (0.049) [98.45] 0.563 (0.053) [99.98]
300 0.824 (0.024) [37.71] 0.703 (0.031) [88.51] 0.589 (0.033) [98.48] 0.562 (0.041) [99.98]
600 0.823 (0.019) [37.94] 0.704 (0.026) [88.67] 0.588 (0.030) [98.51] 0.560 (0.038) [99.98]
120 0.827 (0.036) [36.23] 0.682 (0.043) [86.69] 0.521 (0.049) [97.81] 0.422 (0.052) [99.97]
120 300 0.822 (0.023) [36.61] 0.679 (0.031) [86.95] 0.518 (0.033) [97.88] 0.424 (0.036) [99.97]
600 0.823 (0.019) [36.74] 0.679 (0.025) [86.82] 0.517 (0.028) [97.84] 0.424 (0.030) [99.97]
120 0.816 (0.034) [35.72] 0.647 (0.047) [85.92] 0.457 (0.046) [97.44] 0.331 (0.045) [99.96]
300 300 0.816 (0.024) [35.46] 0.649 (0.029) [85.71] 0.456 (0.030) [97.38] 0.330 (0.029) [99.96]
600 0.817 (0.018) [32.52] 0.649 (0.020) [85.86] 0.455 (0.024) [97.44] 0.329 (0.024) [99.96]

60

120 0.741 (0.049) [37.81] 0.353 (0.065) [88.47] 0.237 (0.061) [98.47] 0.191 (0.074) [99.98]
300 0.741 (0.037) [37.72] 0.353 (0.056) [88.55] 0.241 (0.055) [98.49] 0.192 (0.068) [99.98]
600 0.740 (0.034) [37.95] 0.349 (0.052) [88.71] 0.239 (0.052) [98.53] 0.189 (0.066) [99.98]
120 0.739 (0.046) [36.23] 0.262 (0.050) [86.72] 0.130 (0.035) [97.83] 0.063 (0.025) [99.97]
120 300 0.735 (0.035) [36.61] 0.264 (0.038) [86.98] 0.127 (0.027) [97.89] 0.063 (0.020) [99.97]
600 0.735 (0.030) [36.74] 0.258 (0.033) [86.85] 0.124 (0.023) [97.85] 0.062 (0.017) [99.97]
120 0.725 (0.041) [35.72] 0.207 (0.040) [85.94] 0.080 (0.025) [97.45] 0.031 (0.017) [99.96]
300 300 0.723 (0.031) [35.46] 0.204 (0.027) [85.73] 0.079 (0.015) [97.39] 0.031 (0.011) [99.96]
600 0.726 (0.024) [35.52] 0.206 (0.022) [85.88] 0.079 (0.012) [97.45] 0.031 (0.008) [99.96]

the interval [0, 24]. Following the procedure described in Chiou et al. (2014b), the missing

values were imputed and outliers were identiﬁed and removed from the dataset. Overall, 113

daily traﬃc ﬂow patterns are analyzed: 23 of them are from Mondays, 23 of Tuesdays, 23 of

Wednesdays, 21 of Thursdays and 23 of Fridays. The goal of the analysis is to classify the

traﬃc ﬂow patterns based on the day of the week.

The second data example is based on Lee et al. (2008)’s study of ﬁsh species recogni-

tion and migration monitoring.

In their study they suggest that the shape of the ﬁsh is

the most reliable general characteristic in determining its species. They developed a con-

57

Table 3.2: The sample means, standard errors (in parentheses) and average proportions of
total variance explained [in bracket (%)] for misspeciﬁcation rates in Model 1, Model 2 and
Model 3 with 200 replications. (Continued)

Model

n m

Logistic

k-NN

LDA

SVM

1

2

3

60

120 0.191 (0.051) 0.147 (0.040) 0.029 (0.020) 0.154 (0.052)
300 0.190 (0.044) 0.148 (0.031) 0.029 (0.015) 0.152 (0.047)
600 0.190 (0.041) 0.146 (0.027) 0.029 (0.013) 0.151 (0.045)
120 0.140 (0.041) 0.115 (0.033) 0.016 (0.012) 0.102 (0.033)
120 300 0.139 (0.032) 0.115 (0.023) 0.016 (0.008) 0.102 (0.024)
600 0.139 (0.029) 0.113 (0.018) 0.017 (0.006) 0.101 (0.019)
120 0.081 (0.031) 0.088 (0.028) 0.012 (0.011) 0.068 (0.025)
300 300 0.079 (0.022) 0.087 (0.017) 0.013 (0.007) 0.067 (0.016)
600 0.078 (0.020) 0.087 (0.014) 0.012 (0.005) 0.066 (0.012)

60

120 0.772 (0.040) 0.777 (0.037) 0.755 (0.045) 0.759 (0.060)
300 0.771 (0.029) 0.778 (0.024) 0.756 (0.032) 0.757 (0.051)
600 0.771 (0.024) 0.777 (0.018) 0.755 (0.026) 0.756 (0.048)
120 0.769 (0.043) 0.766 (0.042) 0.765 (0.044) 0.715 (0.051)
120 300 0.767 (0.031) 0.766 (0.026) 0.763 (0.032) 0.711 (0.036)
600 0.764 (0.026) 0.765 (0.018) 0.761 (0.027) 0.712 (0.032)
120 0.771 (0.043) 0.744 (0.039) 0.763 (0.044) 0.670 (0.048)
300 300 0.772 (0.032) 0.745 (0.028) 0.765 (0.032) 0.671 (0.040)
600 0.771 (0.027) 0.745 (0.019) 0.765 (0.027) 0.670 (0.032)

60

120 0.350 (0.060) 0.352 (0.048) 0.249 (0.048) 0.281 (0.068)
300 0.349 (0.053) 0.353 (0.036) 0.247 (0.038) 0.280 (0.063)
600 0.348 (0.049) 0.352 (0.032) 0.245 (0.035) 0.278 (0.061)
120 0.287 (0.052) 0.310 (0.045) 0.197 (0.038) 0.203 (0.042)
120 300 0.288 (0.041) 0.310 (0.031) 0.195 (0.028) 0.205 (0.028)
600 0.286 (0.037) 0.308 (0.023) 0.193 (0.022) 0.200 (0.022)
120 0.196 (0.038) 0.271 (0.039) 0.162 (0.034) 0.152 (0.036)
300 300 0.195 (0.026) 0.268 (0.027) 0.159 (0.022) 0.149 (0.021)
600 0.196 (0.020) 0.268 (0.022) 0.160 (0.017) 0.150 (0.017)

tour matching algorithm to present the shape of ﬁshes as a modiﬁed curve. Each curve

in the dataset contained 463 equally spaced points that were mapped from the outline of

the ﬁsh. There are seven ﬁsh species with similar shape characters: salmon, winter coho,

brown trout, Bonneville cutthroat, Colorado River cutthroat trout, Yellowstone cutthroat

and mountain whiteﬁsh. Each species has the sample size of 50, and the total number of

ﬁshes is 350. This dataset is available at the UCR Time Series Classiﬁcation and Clustering

website http://www.cs.ucr.edu/˜eamonn/time series data/. The aim is to discriminate ﬁsh

species via the shape of the ﬁsh.

58

(a) n = 60, m = 120

(b) n = 60, m = 300

(c) n = 60, m = 600

(d) n = 120, m = 120

(e) n = 120, m = 300

(f) n = 120, m = 600

(g) n = 300, m = 120

(h) n = 300, m = 300

(i) n = 300, m = 600

Figure 3.4: Boxplot of misclassiﬁcation error rates with diﬀerent number of training n and
testing m using the four methods for Model 1 based on 200 replications.

The third data example, described in detail in Grandhi (2002), comes from Oregon State

University. According to the original data source with the current growth of digitized data,

there is a huge demand for automatic management and retrieval of various images. The

OSULeaf data set consists of curves obtained by color image segmentation and boundary

extraction in the counter-clockwise direction at 427 equispaced instants from digitized leaf

images of six classes. In total 442 leaf images are analyzed: 66 of them are Acer Circinatum,

59

FNBLogistic  kNNLDASVM00.20.40.60.8Classification error rate151020FNBLogistic  kNNLDASVM00.20.40.60.8151020FNBLogistic  kNNLDASVM00.20.40.60.8151020FNBLogistic  kNNLDASVM00.20.40.60.8Classification error rate151020FNBLogistic  kNNLDASVM00.20.40.60.8151020FNBLogistic  kNNLDASVM00.20.40.60.8151020FNBLogistic  kNNLDASVM00.20.40.60.8Classification error rate151020FNBLogistic  kNNLDASVM00.20.40.60.8151020FNBLogistic  kNNLDASVM00.20.40.60.8151020(a) n = 60, m = 120

(b) n = 60, m = 300

(c) n = 60, m = 600

(d) n = 120, m = 120

(e) n = 120, m = 300

(f) n = 120, m = 600

(g) n = 300, m = 120

(h) n = 300, m = 300

(i) n = 300, m = 600

Figure 3.5: Boxplot of misclassiﬁcation error rates with diﬀerent number of training n and
testing m using the four methods for Model 2 based on 200 replications.

84 of Acer Glabrum, 75 of Acer Macrophyllum, 97 of Acer Negundo, 82 of Quercus Garryana

and 38 of Quercus Kelloggii. This dataset is also available at the UCR Time Series Classiﬁ-

cation and Clustering website. The main objective is to solve the problem of leaf boundary

curves classiﬁcation.

The last data example is the well-known Berkeley growth study (Tuddenham and Snyder,

1954), which is used as an example in various functional clustering studies (Chiou and Li

60

FNBLogistic  kNNLDASVM0.20.30.40.50.60.70.80.9Classification error rate151020FNBLogistic  kNNLDASVM0.20.30.40.50.60.70.80.9151020FNBLogistic  kNNLDASVM0.20.30.40.50.60.70.80.9151020FNBLogistic  kNNLDASVM0.20.30.40.50.60.70.80.9Classification error rate151020FNBLogistic  kNNLDASVM0.20.30.40.50.60.70.80.9151020FNBLogistic  kNNLDASVM0.20.30.40.50.60.70.80.9151020FNBLogistic  kNNLDASVM0.20.30.40.50.60.70.80.9Classification error rate151020FNBLogistic  kNNLDASVM0.20.30.40.50.60.70.80.9151020FNBLogistic  kNNLDASVM0.20.30.40.50.60.70.80.9151020(a) n = 60, m = 120

(b) n = 60, m = 300

(c) n = 60, m = 600

(d) n = 120, m = 120

(e) n = 120, m = 300

(f) n = 120, m = 600

(g) n = 300, m = 120

(h) n = 300, m = 300

(i) n = 300, m = 600

Figure 3.6: Boxplot of misclassiﬁcation error rates with diﬀerent number of training n and
testing m using the four methods for Model 3 based on 200 replications.

(2007), Jacques and Preda (2014) and Bongiorno and Goia (2016)).

In this dataset, the

heights of 54 girls and 39 boys are measured at 31 not equally spaced time points, from 1 to

18 years. In total 93 subjects are measured. The goal is to discriminate the growth curves

based on the gender diﬀerences.

The original trajectories for the four functional dataset are plotted in Figure 3.7. The

number of groups for each dataset is reported in parentheses and the group membership of

61

FNBLogistic  kNNLDASVM00.20.40.60.8Classification error rate151020FNBLogistic  kNNLDASVM00.20.40.60.8151020FNBLogistic  kNNLDASVM00.20.40.60.8151020FNBLogistic  kNNLDASVM00.20.40.60.8Classification error rate151020FNBLogistic  kNNLDASVM00.20.40.60.8151020FNBLogistic  kNNLDASVM00.20.40.60.8151020FNBLogistic  kNNLDASVM00.20.40.60.8Classification error rate151020FNBLogistic  kNNLDASVM00.20.40.60.8151020FNBLogistic  kNNLDASVM00.20.40.60.8151020each trajectory is highlighted in diﬀerent colors. We presmooth the OSULeaf data since the

original observations are quite noisy as clearly indicated in Figure 7 (c).

In Table 3.3, we report the average misclassiﬁcation rates along with the standard error

and the average proportion of total variance explained for the four dataset. As can be seen

in Table 3.3, the performance of the proposed FNB method gets better when the number

of FCPC increases. For the PeMS, OSULeaf and Growth data, the FNB performs the best

when J = 20. For the Fish data, the FNB performs the third best, but still comparable with

LDA and SVM. We note that the SVM results reported here were obtained under the best

choice of kernel and tuning parameters. As we discussed in simulation (Section 3.3.1) there

are some situations where more that 80% of test data are misclassiﬁed when we speciﬁed

a wrong kernel or tuning parameters. In general, the results reveal how the FNB behaves

with J: The misclassiﬁcation errors should reduce with increasing J consistently with the

proportion of total variance explained.

(a) PeMS (5 groups)

(b) Fish (7 groups)

(c) OSULeaf (6 groups)

(d) Growth (2 groups)

Figure 3.7: The original trajectories for the four functional dataset.

3.4 Conclusions

We have developed a classiﬁcation method for functional data based on the surrogate

densities constructed from FCPC scores. The novelty here is to make a naive assumption of

independence of FCPC scores and use the theoretical result from Delaigle and Hall (2010) to

construct surrogate densities. The surrogate densities make the density-based naive Bayes

classiﬁers possible for functional data. It is shown that the FCPC score relies on the consis-

62

05101520Time50100150200250300350400Traffic Flow0100200300400Location-2-1.5-1-0.500.511.52Outline0100200300400Location-3-2-101234Outline51015Age80100120140160180200HeightTable 3.3: The sample means, standard error (in parentheses) and average proportion of
total variance explained [in bracket (%)] for misclassiﬁcation rates.

Method

J PeMS

Fish

OSULeaf

Growth

0.624 (0.019) [49.61] 0.703 (0.007) [70.16]
1
0.229 (0.017) [91.69] 0.454 (0.010) [91.83]
5
10
0.084 (0.015) [97.71]
0.353 (0.013) [98.05]
15 0.058 (0.014) [99.29] 0.190 (0.010) [99.54]
20 0.025 (0.011) [99.81] 0.165 (0.012) [99.90]

0.095 (0.024)
0.132 (0.014)
0.060 (0.011)
0.046 (0.008)

0.196 (0.014)
0.172 (0.007)
0.150 (0.008)
0.129 (0.009)

0.637 (0.009) [29.18]
0.460 (0.012) [68.10]
0.338 (0.013) [89.62]
0.225 (0.011) [97.35]
0.139 (0.009) [99.65]
0.349 (0.010)
0.175 (0.009)
0.428 (0.010)
0.194 (0.010)

0.346 (0.012) [88.77]
0.077 (0.012) [99.30]
0.061 (0.008) [99.97]
0.045 (0.008) [100.00]
0.040 (0.010) [100.00]
0.094 (0.021)
0.075 (0.010)
0.055 (0.010)
0.071 (0.011)

FNB

Logistic

k-NN
LDA
SVM

(a) PeMS

(b) Fish

(c) OSULeaf

(d) Growth

Figure 3.8: Boxplot of misclassiﬁcation error rate for 10-fold cross-validation based on 200
replications.

tently estimated overall mean and common eigenfunctions that are used for stochastic curve

expansion. In addition, the nonparametrically estimated FNB classiﬁer has an asymptotic

equivalence to the true one. In the present paper, for ease of presentation we have assumed

that there is an equal prior probability and there is an equal number of observations in each

population to better demonstrate the consistence of the FNB classiﬁer. It should be noted

that this assumption is not essential and can be easily relaxed. Our simulation studies and

real data applications show that the proposed FNB classiﬁer performs well. Overall, we con-

clude that the proposed FNB classiﬁer is conceptually simple, analogous to the classical naive

Bayes classiﬁer. The procedure of constructing this classiﬁer is also easily implementable

and practically useful.

63

FNBLogistic  kNNLDASVM00.10.20.30.40.50.60.7Classification error rate15101520FNBLogistic   kNNLDASVM00.10.20.30.40.50.60.70.8Classification error rate15101520FNBLogistic  kNNLDASVM00.10.20.30.40.50.60.7Classification error rate15101520FNBLogistic  kNNLDASVM00.050.10.150.20.250.30.350.4Classification error rate15101520CHAPTER 4

DYNAMIC FUNCTIONAL PREDICTION AND CLASSIFICATION

The idea of dynamic function prediction follows that of Chiou (2012) as the main reference.

We adopt their functional mixture prediction approach combines with the functional naive

Bayes classiﬁer that proposed in Chapter 3. This study extends the idea of the functional

naive Bayes to identify distinct patterns of traﬃc ﬂow from the past data.

4.1 Modeling traﬃc ﬂow trajectries

As we mentioned in Chapter 1.2, the daily traﬃc ﬂow trajectory can be viewed as a

functional data. We use the notation X to be the random function for the daily traﬃce ﬂow
trajectory in the domain I = [0, T ]. The random function X has a smooth mean function
E(X(t)) = µ(t) and covariance function cov(X(s), X(t)) = Γ(s, t) for s and t in I.

4.1.1 Functional Naive Bayes and Functional Probability Naive Bayes Classiﬁer

We assume the random function X consists of C sub-functions, with each sub-function cor-
responding to a population. We also assume that there are Πc, c ∈ {1,··· , C}, populations,
where C is an integer indicating the total number of populations. Let the random vari-

able Y be the group label of X. For each sub-function associated with the group c, we
deﬁne the conditional mean function E(X(t)|Y = c) = µ(c)(t) and the covariance func-
tion Cov(X(s), X(t)|Y = c) = Γ(c)(s, t). Let (λ
eigenfunction pairs of the covariance kernel Γ(c).

(c)
j ) be the corresponding eigenvalue-

(c)
j

, φ

Following the conventional approach, the best group membership c given X is determined

by maximizing the posterior probability PY |X (·|·) such that

c∗(X) = argmax
c∈{1,··· ,C}

PY |X (c|X).

However, it is clear that the above posterior probability is conditioned on the probability

64

density function of a random function X, which does not exist as discussed in Delaigle

and Hall (2010). Without attempting to model the underlying probability distribution, we
propose estimating the posterior membership probability P (Y = c|X) using the following

methods:

• Functional Naive Bayes: the posterior probability is either 1 or 0, i.e.,

 1,

PY |X (c|X) =

if c = δJ (X)

(4.1)

0, otherwise

where δJ (X) is the criterion function (3.7) as we discussed in Chapter 3

δJ (X) = argmax
c∈{1,··· ,C}

πcf (c)(X|ε) and f (c)(X|ε) =

f

(c)
j

(xj).

J(cid:89)

j=1

• Functional Probability Naive Bayes: the posterior group membership probability is

estimated using the proportion of posterior probability of each group.

P (Y = c|X) =

for c = 1,··· , C

,

(4.2)

(cid:80)C
πcf (c)(X|ε)
c=1 πcf (c)(X|ε)

In the machine learning literature, the equations (4.1) and (4.2) are called hard- and soft-

classiﬁcation, respectively.

For the purpose of prediction, the time domain I of the function X is partitioned into two
exclusive time domains S(τ ) = [0, τ ] and T (τ ) = [τ, T ]. Now, let X∗ be a newly observed
trajectory of the function X, denoted by X∗
group membership probability of the trajectory X∗ based on the known trajectory X∗
observed until time τ , which will then be used to predict the unobserved trajectory X∗

S(τ ) as observed up to time τ . We predict the
S(τ )
T (τ ).
We deﬁne the surrogate density of a random function based on the partially observed

XS(τ ) rather than the entire X since the part XT (τ ) is not yet observed. Suppose we have
the group mean function µ(c)(t) and covariance function Γ(c)(s, t), c = 1,··· , C, the partially
(c)S(τ )(s, t). Then,

observed mean function and covariance function are denoted by µ

(c)S(τ ) and Γ

65

the partially observed overall mean function and common covariance function are deﬁned as

C(cid:88)

c=1

C(cid:88)

c=1

µS(τ )(t) =

πcµ

(c)S(τ )(t) and ΓS(τ )(s, t) =

πcΓ

(c)S(τ )(s, t),

respectively. The partially observed common eigenfunction is then derived from the eigen-

decomposition of the partially observed common covariance function. We denoted the par-
(c)S(τ ) −
tially observed common eigenfunction by φS(τ ),j(t), for j ≥ 1. Taking ξ
µS(τ ), φS(τ ),j(cid:105), the surrogate density can be approximated by the product of probability
density function of ξS(τ ),j, i.e.,

(c)S(τ ),j = (cid:104)X

(c)S(τ )|ε) ≈ J(cid:89)

j=1

f (c)(X

f

(c)
j

(c)S(τ ),j).
(ξ

We can predict the group membership based on the newly observed X∗

S(τ ) using the func-

tional naive Bayes

δJ (X∗

S(τ )) = argmax
c∈{1,··· ,C}

πcf (c)(X∗

S(τ )|ε)

and the group membership probability using the functional probability naive Bayes

P (Y = c|X∗

S(τ )) =

(cid:80)C
πcf (c)(X∗
S(τ )|ε)
c=1 πcf (c)(X∗

,

S(τ )|ε)

for c = 1,··· , C. We note that the prior probability πc remains the same from the training
data.

4.1.2 Estimation for Functional Naive Bayes and Functional Probability Bayes

Classiﬁer

In practice, the model components of functional naive Bayes classiﬁer can be estimated from
the training data. Given the observations {tij, Yi, Xi(tij)}, i = 1,··· , n, j = 1,··· , mi, we
follow the same procedure as discussed in Chapter 3.1.1 to get the estimates of {ˆπc, ˆµ(c),(cid:98)Γ(c)},
c = 1,··· , C. Then the estimated overall mean function is ˆµ(t) =(cid:80)C
estimated common covariance function is (cid:98)Γ(s, t) = (cid:80)C
c=1 ˆπc(cid:98)Γ(c)(s, t). Denoting the j-th
eigenvalue-eigenfunction pair of (cid:98)Γ by (ˆλj, ˆφj), we obtain the FCPC scores for a generic

c=1 ˆπc ˆµ(c)(t) and the

66

functional observation X from group c, as ˆξ

(c)

for the j-th FCPC score in group c is

(c)
ˆf
j

(u) =

1

nchc
j

j = (cid:104)X(c) − ˆµ, ˆφj(cid:105). The kernel density estimate
nc(cid:88)

u − ˆξ

 .

(c)
ij

K

hc
j

Now for any function z ∈ L2(I), we write ˆzj = (cid:104)z − ˆµ, ˆφj(cid:105), the surrogate density function is

i=1

J(cid:89)

ˆf (c)(z|ε) =

(c)
ˆf
j

(ˆzj).

For the details of the procedure we refer to Chapter 3.1.1.

Now, given a newly observed trajectory X∗ up to time τ , denoted by X∗

S(τ ), we obtain

j=1

the estimated surrogate density

ˆf (c)(X∗

S(τ )|ε) =

J(cid:89)

j=1

(c)
ˆf
j

( ˆξS(τ ),j)

for c = 1,··· , C, where the j-th FCPC score ˆξS(τ ),j = (cid:104)X∗
S(τ )− ˆµS(τ ), ˆφS(τ ),j(cid:105), for j ≥ 1. To
obtain { ˆφS(τ ),j} we simply decompose the covariance estimate(cid:98)Γ into blocks corresponding to

the time domains S(τ ) and T (τ ) without re-estimating the covariance function. We predict
the group membership for the newly observed X∗

ˆδJ (X∗

S(τ )) = argmax
c∈{1,··· ,C}

S(τ ) by the functional naive Bayes
ˆπc ˆf (c)(X∗

S(τ )|ε)

and the group membership probability using the functional probability naive Bayes

(cid:98)P (Y = c|X∗

S(τ )) =

(cid:80)C
ˆπc ˆf (c)(X∗
S(τ )|ε)
c=1 ˆπc ˆf (c)(X∗

S(τ )|ε)

for c = 1,··· , C.

4.2 Functional Mixture Prediction

In order to accurately predict traﬃc ﬂow trajectories under various traﬃc conditions, we

propose to combine the functional linear model with functional naive Bayes methods. Given

67

(4.3)

(4.4)

C(cid:88)

a newly observed trajectory X∗

S(τ ) of the process X as observed up to time τ , by using law

of iterated expectation (Durrett, 2010) one can show that

E(X∗

T (τ )(t)|X∗

S(τ )) =

E(X∗

T (τ )(t)|X∗

S(τ ), Y = c)P (Y = c|X∗

S(τ )),

(4.5)

c=1

where E(XT (τ )(t)|X∗
P (Y = c|X∗
trajectory X∗

S(τ ), Y = c) is the predictive function conditional on group Y = c and
S(τ )) is the posterior probability of group membership given the newly observed
S(τ ) up to time τ .

S(τ ), we aim to predict the values of X∗

4.2.1 Functional Linear Prediction Model
Given X∗
T (τ ). Here we consider a functional linear
regression model (Ramsay and Silverman, 2005). The process X(s), for s ∈ S(τ ) denoted
by XS(τ ), serves as the predictor function and the process X(t), for t ∈ T (τ ) denoted by
(c)S(τ )
XT (τ ), is the response function. For each group, the process X(c) is decomposed into X

(c)S(τ ) whose Karhunen-Lo`eve expansions can be obtained such that

X

(c)S(τ )(s) = µ

(c)S(τ )(s) +

(c)S(τ ),jφ
ξ

(c)S(τ ),j(s)

and X

and

∞(cid:88)
∞(cid:88)

j=1

68

Conditioning on the group membership, the functional linear regression model becomes

k=1

(c)T (τ ),kφ
ξ
(c)T (τ ),k = (cid:104)X

(c)T (τ ),k(t),
(c)T (τ ) − µ

(c)T (τ ), φ

(c)T (τ ),k(cid:105)

X

where ξ

(c)S(τ ), φ

(c)S(τ ) − µ

(c)T (τ )(t) = µ

(c)S(τ ),j = (cid:104)X

(c)T (τ )(t) +
(c)S(τ ),j(cid:105) and ξ
(cid:90)
is smooth and square integrable, that is, (cid:82)T (τ )

E(XT (τ )(t)|XS(τ ), Y = c) = µ

(c)T (τ )(t) +

S(τ )

(c)

τ (s, t)(XS(τ )(s) − µ

β

(c)S(τ )(s))ds

(4.6)

for all t ∈ T (τ ). Here, given a ﬁxed value of τ , the bivariate regression function β

(c)
τ (s, t)
τ (s, t)dsdt < ∞. Under certain
regularity conditions which are outlined in He et al. (2000) the regression coeﬃcient function

(c)

(cid:82)S(τ ) β

(c)
β
τ

has a basis representation such that β

where β

(c)
τ,jk = E

(c)S(τ ),jξ
ξ

(c)T (τ ),k

(cid:17)(cid:46)

(cid:16)

(cid:16)

(cid:80)∞

j=1 β

(c)
τ,jkφ

(c)S(τ ),j(s)φ

(c)T (τ ),k(t),

E

(ξ

. Model (4.6) thus can be rewritten as

E(XT (τ )(t)|XS(τ ), Y = c) = µ

(c)
β
τ,jkξ

(c)S(τ ),jφ

(c)T (τ ),k(t).

(4.7)

k=1

(c)

τ (s, t) =(cid:80)∞
(c)S(τ ),j)2(cid:17)
∞(cid:88)
∞(cid:88)

(c)T (τ )(t) +

k=1

j=1

Suppose that the group structure µ

(c)S(τ ), µ

(c)T (τ ), {φ

(c)S(τ ),j}, {φ

(c)

coeﬃcient β
partially observed trajectory X∗

τ,jk are given, one can predict the unobserved trajectory X∗
S(τ ) for a speciﬁc cluster c by (4.7), that is,

(c)T (τ ),k} and the regression
T (τ ) based on the

E(X∗

T (τ )(t)|X∗

S(τ ), Y = c) = µ

(c)T (τ )(t) +

(c)
τ,jkξ
β

∗(c)S(τ ),jφ

(c)T (τ ),k(t)

(4.8)

∞(cid:88)

∞(cid:88)

k=1

j=1

∗(c)S(τ ),j = (cid:104)X

for all t ∈ T (τ ), where ξ
(c)S(τ ), µ
as described in Section 4.1.2. However, the regression coeﬃcients β

(c)S(τ ),j(cid:105). In practice, the group structure
(c)T (τ ),k} can be estimated analogously from the training data
(c)
τ,jk are more complicated.

(c)S(τ ),j}, and {φ

(c)T (τ ), {φ

(c)S(τ ), φ

µ

∗(c)S(τ ) − µ

We will summarize the estimation procedure in the next section.

4.2.2 Estimation for Functional Mixture Prediction

Since the bivariate regression function β

(c)
τ (s, t) is a smooth function of τ for all s and t, it

(c)
τ,jk is also smooth in τ for all j and k. To estimate the regression parameters
(c)S(τ ),j
(c)T (τ ),k(t) for each group. The estimate of

follows that β
(c)
τ,jk for each group, one needs to derive the estimated principal component score ˆξ
β
and ˆξ
(c)
τ,jk is then derived by
β

(c)T (τ ),k and the eigenfunction ˆφ

(c)S(τ ),j(t) and ˆφ

(c)
˜β
τ,jk =

(nc − 1)λ

(c)S(τ ),j

(c)S(τ ),j

(c)T (τ ),i,k − ¯ˆξ

(c)T (τ ),k

,

(4.9)

(cid:110)

(cid:17)

where

¯ˆξ
(c)S(τ ),j and

¯ˆξ
(c)T (τ ),k are sample averages of ˆξ

mate is motivated by β
method (Fan and Gijbels, 1996) is then applied on the estimates { ˜β

(c)S(τ ),jξ
ξ

(c)T (τ ),k

(c)
τ,jk = E

(c)S(τ ),i,j and ˆξ
E

(ξ

(c)T (τ ),i,k, respectively. The esti-
. The local linear smoothing

(c)

τ,jk, τ = τ1,··· , τQ} over

(cid:111)−1

nc(cid:88)

(cid:16) ˆξ

i=1

(cid:16)

(c)S(τ ),i,j − ¯ˆξ
(cid:16)
(cid:17)(cid:46)

(cid:17)(cid:16) ˆξ
(c)S(τ ),j)2(cid:17)

69

τ to obtain the smoothed estimates ˆβ

(c)
τ,ik, where Q is the number of time points at which

predicting the future trajectory is of interest.

We then proceed to obtain the predicted trajectory conditional on group c by using their

estimate(cid:98)E(X∗

T (τ )(t)|X∗

S(τ ), Y = c) = ˆµ

Lc(cid:88)

Lc(cid:88)

k=1

j=1

(c)T (τ )(t) +

(c)
ˆβ
τ,jk

∗(c)S(τ ),j
ˆξ

(c)T (τ ),k(t)
ˆφ

(4.10)

for all t ∈ T (τ ). Here, Lc is determined by (2.5). Finally, the functional mixture prediction
model is combining the results of (4.10) with (4.4). We obtain the predicted unobserved

traﬃc ﬂow trajectory

(cid:98)E(X∗

T (τ )(t)|X∗

S(τ )) =

C(cid:88)

c=1

(cid:98)E(X∗

S(τ ), Y = c)(cid:98)P (Y = c|X∗

S(τ )).

T (τ )(t)|X∗

(4.11)

The implementation algorithm of functional mixture prediction is summarized in the next

section.

4.2.3

Implementation algorithm of functional mixture prediction.

The algorithm of functional mixture prediction is summarized is this section. Suppose we

have a newly partially observed trajectory {tj, X∗(t∗

S(τ ).
The functional mixture prediction that combines the functional linear model and functional

j )} for tj < τ , and denoted by X∗

probability Naive Bayes classiﬁer is summarized as follows:

Step 1 Train the functional linear regression model and functional Naive Bayes Classiﬁer for

each group c, where c = 1,··· , C.

(i) Fit the functional linear regression model based on the group speciﬁed training

data and derive the smoothed regression coeﬃcient estimates ˆβ

(c)
τ,jk.

(ii) Train the functional Naive Bayes classiﬁer and derive the estimates of prior prob-

ability ˆπc and the estimates of surrogate density ˆf (c)(·|ε).

70

Step 2 For a new partially observed trajectory X∗

S(τ ), predict the unobserved trajectory X∗

T (τ )

and also the posterior membership probability

(i) The unobserved trajectory is predicted based on group speciﬁed functional linear

model, that is, ﬁt (cid:98)E(XT (τ )(t)|XS(τ ), Y = c) in (4.10).
bility naive Bayes, that is, (cid:98)P (Y = c|X∗

S(τ )) in (4.4).

(ii) The unknown group and posterior probability is predicted by functional proba-

Step 3 The unobserved trajectory is predicted by the functional mixture prediction model that

combines the functional linear model with functional probability naive Bayes, that is

(cid:98)E(X∗

T (τ )|X∗

S(τ )) as derived in (4.5).

The whole process of the functional mixture prediction ﬂow chart is plotted in Figure 4.1.

We further note that, in order to obtain the prediction interval of (cid:98)E(X∗

T (τ )|X∗

S(τ )), the

bootstrap resampling method is applied and summarized in the next section.

4.2.4 Bootstrap prediction intervals for functional mixture prediction
In order to obtain the prediction interval of ˆX∗

S(τ )), one commonly
used method is the bootstrap resampling method. Using the training data, we resample the

T (τ )(t) = (cid:98)E(X∗

T (τ )|X∗

trajectory x(c)(t) = µ(c)(t) +(cid:80)Lc

j=1 ξ

(c)
j φ

(c)
j (t) by the following procedures:

Step 1 The mean function µ(c) and the eigenfunctions φ

(c)
j
and are replaced with their estimates ˆµ(c) and ˆφ(c).

are treated as ﬁxed components

Step 2 Derive the estimated FPC scores ˆξ
(c)
il = x

mated residual of each X

(c)
i

by ˆ

(c)
i

(c)

ij = (cid:104)x

(c)

i − ˆµ(c)(t), ˆφ

(til) − ˆµ(c)(til) −(cid:80)Lc

(c)

j (cid:105). Then, obtain the esti-

(c)

1j,b,··· , ˆξ

j=1

(c)
ˆφ
j (til).

(c)
ˆξ
ij
ncj,b} and the residual
, 1 ≤ i ≤ nc, 1 ≤ j ≤ Lc}

(c)

Step 3 Obtain the b-th bootstrap sample of the FPC scores { ˆξ

(c)

{ˆ
1l,b,··· , ˆ
and {ˆ

(c)

ncl,b} by sampling with replacement from { ˆξ

(c)
il

(c)

1l , 1 ≤ i ≤ nc, 1 ≤ l ≤ m}, respectively.

71

Training data

{tij, (Yi, Xi(tij))}i=1,...,n

FLM for speciﬁc group

to derive ˆβ

(c)
τ,jk

FNB for prior prob. ˆπc and
surrogate density ˆf (c)(·|ε)

Test data

{t, X∗, t < τ}

(cid:16)
Predicting X∗
(cid:98)E
T (τ )
S(τ )

(cid:12)(cid:12)X∗

X∗
T (τ )

|C = c

(cid:17)

, C = c

Predicting Membership

(cid:16)

C = c(cid:12)(cid:12)X∗

(cid:98)P

S(τ )

(cid:17)

Functional Mixture Prediction

(cid:16)

(cid:98)E

(cid:17)

(cid:12)(cid:12)X∗

S(τ )

X∗
T (τ )

Figure 4.1: Functional mixuture prediction ﬂow chart.

Step 4 The b-th bootstrap sample {X

(c)

1,b ,··· , X

(c)

nc,b} is constructed by X

(c)
i,b (til) = ˆµ(c)(til) +

(cid:80)Lc

(c)
ˆξ
ij,b

(c)
ˆφ
j (til) + 

(c)
il,b.

j=1

Step 5 Use the b-th bootstrap sample to derived the smoothed regression coeﬃcient estimates
(·|ε) of

(c)
(c)
τ,b,jk of the functional linear regression model and the surrogated density ˆf
ˆβ
b
the functional naive Bayes classiﬁer.

Step 6 Based on the estimates obtained from each bootstrap sample, we can estimate the pos-

S(τ )) as in (4.4) and the predicted trajectory conditioning

terior probability (cid:98)P (Y = c|X∗
on group c, (cid:98)E(X∗
T (τ )(t)|X∗

Step 7 The mixture predicted trajectory (cid:98)E(X∗

S(τ ), Y = c) as in (4.10).
T (τ )(t)|X∗

(4.5).

72

S(τ )) is then obtained according to

Step 8 The 95% prediction interval is constructed by using the 2.5% and 97.5% percentiles of

the bootstrap predicted trajectories.

4.3 A Real Data Application

4.3.1 Analysis of Traﬃc Flow Patterns and Posterior Probabilities

In the section we use one data example to illustrate the proposed functional mixture pre-

diction. This traﬃc ﬂow data were collected by a dual loop vehicle detector over 15-min

time intervals located near Shea-San Tunnel on National Highway 5 in Taiwan in 2009. The

trajectories sample 70 days as the training data and the remaining 14 days are used as the

test data to validate the prediction performance. The goal is to predict the unobserved traf-

ﬁc ﬂow trajectory for a partial trajectory with updated ﬂow information up to the “current

time” τ .

Based on our prior knowledge, this traﬃc ﬂow data can be divided into three groups.

Group 1 contains all holidays, Group 2 comprises weekdays including Mondays through

Thursdays, and Group 3 comprises Fridays. The mean functions of the three groups and the

overall trajectories are displayed in Figure 4.2. It is clear that Group 1 has a higher mean

traﬃc ﬂow rate than the other two groups, Group 2 and Group 3 have relatively close mean

ﬂow rates in terms of shape and magnitude until 11:00, and they diverge thereafter with a

higher mean ﬂow rate in Group 3. In Figure 4.3, we plot the mean functions along with

observed trajectories, covariance functions, and leading eigenfunctions.

We ﬁrst use the training data to train the functional linear model and the functional

naive Bayes classiﬁer as described in previous section. Given a newly observed trajectory

from the test data up to the current time τ , we predict the posterior probabilities by (4.4).

The posterior probabilities for some test samples are illustrated in Figure 4.4 with the values

of τ from 8:00 to 20:00 by 15-min intervals. In Figure 4.4, Test sample 2, 3, 9 are classiﬁed as

Group 1 (holidays) for all τ . Other Test samples are classiﬁed either in Group 2 (Weekdays)

73

Figure 4.2: Overall and group-speciﬁc mean functions of the training data of daily traﬃc
ﬂow rates.

or Group 3 (Fridays). In this case Group 2 and Group 3 have almost the same pattern prior

to τ = 10 : 00, which may lead to misclassiﬁcation and result in poor prediction accuracy.

This issue is resolved as τ moves onward.

4.3.2 Mixture Prediction of Traﬃc Flow

To examine the performance of mixture prediction, we use partially observed trajectories

before time τ to predict the future interval length after time τ . Here we introduce some
notations and criterion as used in Chiou et al. (2014b). We deﬁne S(τ ; ω) = [max(0, τ−ω), τ ]
and T (τ ; κ) = [τ, min(τ +κ, T )], where ω is the length of the known interval prior to time τ to

be used in prediction calculations and κ is the length of the unknown interval to be predicted
from time τ onward. Given a sample X∗

i,S(τ ), the

mean integrated prediction error (MIPE) can be calculated by

i observed up to time τ , denoted by X∗
(cid:90) κ

(cid:110) ˆX∗
i,T (τ )(t) − X∗

i,T (τ )(t)

(cid:111)2

dt,

mp(cid:88)

i=1

MIPE(τ, ω, κ) =

1
mp

1
κ

0

where ˆX∗

i,T (τ )(t) is the predicted trajectory obtained by (4.11) and X∗

i,T (τ ) is the X∗

i observed

after time τ . Here the mp is the number of trajectories in the test data. The total mean

74

5101520Time0200400600Flow rateGroup 1Group 2Group 3OverallFigure 4.3: Estimated mean functions (left column) superimposed on the observed trajec-
tories, covariance functions (middle column) and the correpsonding eigenfunctions (right
column) of group 1-3 (from top to bottom) based on the training data of daily traﬃc ﬂow
trajectories.

integrated prediction error (TMIPE) across diﬀerent values of τ can be calculated by

(cid:90) τe

TMIPE(ω, κ)

MIPE(s, ω, κ)ds,

τs

where τs = max(0, τ − ω) and τe = min(τ + κ, T ), for ω, κ > 0. In this study, T = 24 (hours)
and we set τs = 8 and τe = 20. For notation convenience we further set ω∗ = τ to denote the
maximal length of the past trajectory information available for prediction and κ∗ = 24 − τ

to denoted the interval length from the current time to the end of the day.

The prediction performance is then compared with the following methods:

75

5101520Time0200400600Flow rate012010421520Time15Time1010555101520Time-0.6-0.4-0.200.20.476.05%12.27%5.50%5101520Time0200400600Flow rate010002015200020Time15Time1010555101520Time-0.6-0.4-0.200.20.471.42%18.65%5101520Time0200400600Flow rate0100020200030001520Time400015Time1010555101520Time-0.6-0.4-0.200.20.479.23%16.21%(a) Test sample 1

(b) Test sample 2

(c) Test sample 3

(d) Test sample 4

(e) Test sample 5

(f) Test sample 6

(g) Test sample 7

(h) Test sample 8

(i) Test sample 9

(j) Test sample 10

(k) Test sample 11

(l) Test sample 12

(m) Test sample 13

(n) Test sample 14

Figure 4.4: The predicted group membership distribution for Groups 1-3 (plotted in blue,
green, and red) as a function of the “current time” τ for samples from the test data based
on the trajectories observed up to τ .

• FP: Functional prediction based on function linear regression without considering

groups. That is, we use all the training data to train the FLM without speciﬁc groups.

• FNP: Functional (naive) prediction based on group-speciﬁed functional linear regres-

sion and the group membership is simply determined by the functional naive Bayes

76

89101112131415161718192000.20.40.60.81τProbabilities89101112131415161718192000.20.40.60.81τProbabilities89101112131415161718192000.20.40.60.81τProbabilities89101112131415161718192000.20.40.60.81τProbabilities89101112131415161718192000.20.40.60.81τProbabilities89101112131415161718192000.20.40.60.81τProbabilities89101112131415161718192000.20.40.60.81τProbabilities89101112131415161718192000.20.40.60.81τProbabilities89101112131415161718192000.20.40.60.81τProbabilities89101112131415161718192000.20.40.60.81τProbabilities89101112131415161718192000.20.40.60.81τProbabilities89101112131415161718192000.20.40.60.81τProbabilities89101112131415161718192000.20.40.60.81τProbabilities89101112131415161718192000.20.40.60.81τProbabilitiesclassiﬁer in (4.3).

• FMP: Function mixture prediction based on group-speciﬁed functional linear regression

and the posterior probability is calculated by (4.4).

To examine the eﬀects on the prediction performance of interval length after the time τ , we
consider various values of ω from ω = 1 to ω = 6 and ω = ω∗ and that of κ from κ = 1 to
κ = 10 and κ = κ∗. Table 4.1 indicates that the proposed FMP has the smallest TMIPE

compared to these of FP and FNP. It is not surprising that the prediction performance gets

worse when κ increases for a ﬁxed ω. In Figure 4.5, we plot the TMIPE as a function of ω

for a ﬁxed value of κ. The best prediction strategies for diﬀerent values of κ are not quite

clear to us. For diﬀerent values of κ the best prediction varies. For example in Figure 4.5(a)
for κ = 1 the best prediction would be ω = ω∗. However, in Figure 4.5(k) for κ = κ∗ the

best prediction is ω = 1. It not only depends on the available information prior to time

τ , but also depends on the interval length after it. In Figure 4.6 we plot the TMIPE as

a function of κ for a ﬁxed value of ω. All Figures in 4.5 suggest the proposed functional

mixture prediction outperform other methods.

4.4 Conclusion

This study proposed a prediction method that incorporates functional linear model and

functional naive Bayes model as introduced in Chapter 3. Although it is motivated by the

subject of traﬃc ﬂow prediction, the proposed method can be generally applied to other

subjects that are suitable for functional data analysis. Our real data analysis demonstrates

that considering the group-speciﬁc prediction and posterior probability can work reasonably

well to predict traﬃc ﬂow. We thus conclude that taking the traﬃc ﬂow patterns into account

can greatly improve prediction performance.

77

Table 4.1: Performance comparisons for FP, FNP, and FMP based on TMIPE (×103) under
various values of κ.

ω

κ

1

2

3

4

5

6

FP

FNP

FMP

1
2
3
4
5
6
7
8
9
10
κ∗
1
2
3
4
5
6
7
8
9
10
κ∗
1
2
3
4
5
6
7
8
9
10
κ∗

4.21
6.16
6.99
7.44
8.07
8.71
9.32
9.92
10.52
11.07
12.34

3.51
4.50
4.67
5.00
5.86
6.85
7.77
8.56
9.31
10.05
11.84

3.31
4.25
4.38
4.48
4.75
5.04
5.39
5.80
6.29
6.75
8.00

4.99
6.47
7.12
7.69
8.33
8.95
9.59
10.24
10.87
11.43
12.77

3.28
3.90
4.11
4.62
5.48
6.42
7.33
8.14
8.89
9.62
11.49

3.32
3.85
4.05
4.36
4.68
4.95
5.32
5.77
6.27
6.72
8.07

4.89
6.02
6.52
7.05
7.60
8.19
8.83
9.48
10.08
10.64
11.97

3.33
4.02
4.38
4.87
5.66
6.59
7.51
8.31
9.05
9.78
11.64

3.33
3.99
4.42
4.77
5.07
5.37
5.79
6.24
6.73
7.19
8.53

4.64
5.72
6.26
6.81
7.45
8.12
8.82
9.49
10.10
10.68
12.06

3.58
4.25
4.64
5.18
6.01
6.92
7.82
8.63
9.37
10.10
11.94

3.49
4.13
4.61
5.00
5.30
5.60
5.99
6.45
6.95
7.42
8.75

4.44
5.59
6.21
6.89
7.63
8.36
9.07
9.74
10.39
10.99
12.40

3.58
4.25
4.64
5.24
6.10
7.01
7.90
8.71
9.46
10.20
12.04

3.50
4.13
4.61
5.06
5.40
5.70
6.06
6.54
7.06
7.54
8.87

5.54
6.70
7.16
7.63
8.18
8.67
9.21
9.79
10.34
10.85
12.12

3.22
3.79
4.15
4.69
5.49
6.41
7.32
8.14
8.88
9.61
11.48

3.21
3.75
4.13
4.51
4.81
5.10
5.51
5.96
6.45
6.91
8.26

78

ω∗

5.42
7.05
7.86
8.56
9.21
9.90
10.65
11.38
12.06
12.69
14.11

3.15
3.84
4.20
4.74
5.66
6.65
7.59
8.44
9.25
10.08
12.54

2.95
3.57
4.09
4.62
5.22
5.75
6.31
6.99
7.75
8.48
10.87

(a) κ = 1

(b) κ = 2

(c) κ = 3

(d) κ = 4

(e) κ = 5

(f) κ = 6

(g) κ = 7

(h) κ = 8

(i) κ = 9

(j) κ = 10

(k) κ = κ∗

Figure 4.5: Performance comparisons for FP, FNP, and FMP, based on TMIPE, displayed
as a function of ω from (a) κ = 1 to (k) κ = κ∗.

79

123456*2.533.544.555.56TMIPE103FPFNPFMP123456*34567TMIPE103FPFNPFMP123456*45678TMIPE103FPFNPFMP123456*456789TMIPE103FPFNPFMP123456*45678910TMIPE103FPFNPFMP123456*45678910TMIPE103FPFNPFMP123456*567891011TMIPE103FPFNPFMP123456*56789101112TMIPE103FPFNPFMP123456*678910111213TMIPE103FPFNPFMP123456*678910111213TMIPE103FPFNPFMP123456*89101112131415TMIPE103FPFNPFMP(a) ω = 1

(b) ω = 2

(c) ω = 3

(d) ω = 4

(e) ω = 5

(f) ω = 6

(g) ω = ω∗

Figure 4.6: Performance comparisons for FP, FNP, and FMP, based on TMIPE, displayed
as a function of ω from (a) ω = 1 to (g) ω = ω∗.

80

12345678910*2468101214TMIPE103FPFNPFMP12345678910*2468101214TMIPE103FPFNPFMP12345678910*2468101214TMIPE103FPFNPFMP12345678910*24681012TMIPE103FPFNPFMP12345678910*2468101214TMIPE103FPFNPFMP12345678910*2468101214TMIPE103FPFNPFMP12345678910*246810121416TMIPE103FPFNPFMPCHAPTER 5

IDENTIFYING MULTIPLE MEAN CHANGE-POINTS OF FUNCTIONAL

SEQUENCE

Change-point problem is an important topic in analysis of sequence data. Numerous methods

have been implemented for detecting change-points in scalar and vector observations, but

for functional observations, the related works are few. In this chapter, we are interested in

detecting whether there are abrupt changes in the mean function of the data. Particularly,

we assume that there are multiple change-points.

Multiple change-points occurs in many applications, especially when the sequence is

long. A typical example is the vehicle volume ﬂows on a highway with several interchanges.

Figure 5.1 displays the pre-smoothed trajectories of the daily vehicle volumes on southbound

National Highway No. 5 (NH5) in Taiwan from May 14, 2010 to May 16, 2010. The raw

data are monitored by 84 electric detectors on a 5-minute interval, which results in the total

of 288 records per day. In every panel, the trajectories are classiﬁed into three groups by

diﬀerent colors, which consist of detectors No. 1–70 (red), 71–79 (green) and 80–84 (blue)

respectively. The partition boundaries, which are located at Toucheng interchange (between

No. 70 and 71) and Luodong interchange (between No. 79 and 80), are identiﬁed by the

algorithm in this chapter. It is easy to recognize that the mean trajectories of these three

groups are diﬀerent. This may imply that the total vehicle volumes may have changed at

least twice among these intervals.

A conventional approach to detect and estimate the change-point is through hypothe-

sis testing. Berkes et al. (2009) and Aue et al. (2009) are the ﬁrst to discuss the mean

change-point of a sequence of independent functional data. They introduce the partial sum

based test, a common approach in univariate and multivariate scalar data to detect single

mean change-point. The functional observations are transformed into a lower dimensional

score vectors by functional principal component (FPC) analysis, and the test statistic is

81

Figure 5.1: The trajectories of the total vehicle volumes on southbound National Highway
No. 5 of Taiwan during May 14 – 16, 2010. The data were monitored by 84 electric detectors
distributed along the highway, where each curve represents the daily record of a detector. The
trajectories are identiﬁed as three groups with the segmentation method of this paper, where
the red, green and blue lines represents detectors No. 1–70, 71–79 and 80–84 respectively.

constructed by the normalized quadratic form of the partial sum process of this score se-

quence. Later, the test is extended to serially correlated functional data by H¨ormann and

Kokoszka (2010) and Aston and Kirch (2012) under their weak dependence frameworks.

They use the long-run covariance of the score vectors as the normalizer in the test statistic

for standardizing the variance of the partial sum process.

These partial sum based tests are designed to detect single mean change-point. For mul-

tiple mean change-points, the methods should be applied through binary segmentation to

iteratively ﬁnd out all the change-points. That is, repeatedly performing the methods on the

sub-sequences divided by the estimated change-points in previous tests. But there are two

potential issues about this approach. First, these tests rely on the asymptotic distribution

whose validity is doubtful as the sub-sequence gets shorter. Furthermore, if the FPC score

sequence is dependent, the tests exhibit the non-monotonic power phenomenon due to over-

sized bandwidth when estimating the long-run covariance (Perron (1991); Vogelsang (1999)).

To avoid this phenomenon, Zhang et al. (2011) adopt the adapted self-normalizer statistic

as the normalizer to get rid of estimating the bandwidth. However, this test is powerful but

the empirical size is often greater than expected in our experience. We are motivated to

82

051015200501001502010/05/14HourVehicle Volumes051015200501001502002502010/05/15HourVehicle Volumes051015200501001502002502010/05/16HourVehicle Volumesdevelop a procedure that estimates the multiple change-points simultaneously to avoid the

aforementioned drawbacks.

The multiple change-points problem can be viewed as partitioning the sequence into non-

overlapped segments, where the partition boundaries represent the change-points and the

interested statistical property is unchanged within each segment. In the literature, there are

quite a few procedures in data mining and signal processing that deal with scalar sequence

segmentation. One of the early partition algorithms is the dynamic programming proposed

by Bellman (1961). Given the number of partition boundaries M , it starts with the sub-

problem of ﬁnding the optimal partition with one boundary. The optimization process scans

each possible candidate of this boundary and the results are stored. Then the algorithm

proceeds to the second sub-problem of ﬁnding optimal partition with two boundaries, where

the ﬁrst boundary can be determined with the stored information of the ﬁrst sub-problem.

Similarly, the optimization process continues with further sub-problems until the main prob-

lem is solved and at each iteration, the information of previous sub-problems can be recycled

to reduce the computation cost. Extension of the dynamic programming include the divide

and segment algorithm (Terzi and Tsaparas, 2006), optimal segmentation algorithm (Auger

and Lawrence, 1989), and pruned exact linear time algorithm (Killick et al., 2012), who also

implement an R package ‘changepoint’ (Killick and Eckley, 2014).

Greedy algorithm is another popular approach. It searches the change-points iteratively,

one at a time, and stops when a certain criterion is met. The search path can be either

forward (top-down) or backward (bottom-up). The former assumes no change-point at ﬁrst

and looks for a new partition boundary at each step. Binary segmentation also belongs to

this type of approaches. The backward search assumes all or many of the data are partition

boundaries and tries to merge two segments in the iteration by eliminating their boundary.

The stopping criterion can be that the number of boundaries reaches a given M or the result

of the veriﬁcation procedure is no more signiﬁcant. Notable references inlcude Shatkay and

Zdonik (1996), Haiminen and Gionis (2004), Palpanas et al. (2004) and Terzi (2006).

83

Besides the above two types of methods, Himberg et al. (2001) propose the local/global

iterative segmentation procedures that recursively adjust the partition boundaries with ﬁxed

number of segments. The local approach adjusts each boundary within the neighboring seg-

ments and the global approach adjusts it to the optimal position within the whole sequence.

It is noted that most of the segmentation methods above focus on ﬁnding the positions of

the boundaries rather than their number since they are executed under the assumption of

ﬁxed segment number.

In this chapter, we propose a two-step segmentation procedure to estimate both the num-

ber and locations of the mean change-points among a functional sequence. The ﬁrst step

is to list the possible locations of a given number of mean change-points with a recursive

least square segmentation algorithm, and the second step is to remove the redundant points

from the list with a backward elimination procedure. The raw functional observations, which

are often recorded as high dimensional data with regular or irregular sampling frequencies,

are presmoothed by basis functions such as the B-splines and projected on the low dimen-

sional sub-eigenspace spanned by the leading FPC’s. As long as the diﬀerences in the mean

functions are not orthogonal to this sub-eigenspace, there will also be mean changes at the

same locations in the sequence of projected score vectors. Thus, we can apply the proposed

segmentation procedure on the score sequence to detect the mean change-points.

5.1 Functional Mean Change-Points

5.1.1 Multiple Mean Change-Points Model

Let H = L2([0, 1]) be a Hilbert space of square integrable functions deﬁned on [0, 1], and
LpH be the space of H valued random variables X. Assume that a sequence of functional
data {Xi} ∈ H are observed:

Xi(t) = Yi(t) + µ0(t),

t ∈ [0, 1], i = 1,··· , N,

(5.1)

84

functions with E(cid:107)Yi(cid:107)2 = (cid:82) EY 2

where µ0 ∈ H is a deterministic function, and {Yi} ∈ H is a sequence of mean zero random
1 (t)dt < ∞. We assume that {Yi} have the same covari-

ance function and are serially correlated under some weakly dependent framework. In the

following content, we will omit the functional argument t for simplicity when there is no

ambiguity.

A multiple change-points model with M change-points is:

M +1(cid:88)

Xi(t) = Yi(t) +

where

m=1

µm(t) · 1(θm−1,θm](i/N ),

t ∈ [0, 1], i = 1,··· , N,

(5.2)

 1, θm−1 < x ≤ θm;

0, otherwise,

1(θm−1,θm](x) =

and 0 < θ1 < θ2 < ··· < θM < 1 are the M positions of change. For notation convenience,
we also denote θ0 = 0 and θM +1 = 1. The function µm ∈ H is the deterministic mean
function of the segment between the (m − 1)-th and m-th changes. We also assume that
µm (cid:54)= µm+1, ∀m = 1,··· , M for model identiﬁability.
In this setting, the number of
change-points M , their positions {θm : m = 1,··· , M} and the segmentwise mean functions
{µm(·) : m = 1,··· , M +1} are unknown parameters. Because the mean functions {µm} can
be consistently estimated by segmentwise sample means as long as we have the estimated

positions of the change-points, for our goal we will focus on the estimation of the change-
points {M, θ1,··· , θM}.

5.1.2 Functional Principal Components

By assuming the data are functional, the raw samples should be preprocessed before applying

any statistical method. Both presmoothing and functional principal component (FPC) pro-

jection are the basic preprocessing tools in functional data analysis (Ramsay and Silverman,

2005). Presmoothing replaces the raw observations with their smoothed approximations,

which reduces the random variations in the functions. The smoothed functions are then

85

projected into a lower dimensional space in order to reduce the data dimensionality. FPC

analysis plays this role of dimension reduction for the inﬁnite dimensional smooth functional

data. Particularly, it preserves most of the variances in the space spanned by the leading

components. If the data set contains signiﬁcant mean changes, then the change informa-

tion will also be counted in this projection space. This fulﬁlls the basic requirement of

the functional mean change-point procedures that the mean function diﬀerences should not

be orthogonal to the projecting space. Therefore, FPC is a suitable choice of dimension

reduction for the functional mean change-point problem.

Analogous to the multivariate principal component analysis, the components in FPC

analysis are derived from the covariance function of the observations. Let the covariance oper-
ator of {Yi} be C : L2([0, 1]) → L2([0, 1]), with the integration kernel c(t, s) = E(Yi(t)Yi(s)),
t, s ∈ [0, 1]. Assume that c(t, s) satisﬁes the following decomposition:

∞(cid:88)

l=1

c(t, s) =

λlvl(t)vl(s),

where λ1 > λ2 > ··· ≥ 0 are the eigenvalues and {vl(·) : l ≥ 1} are the corresponding
orthonormal eigenfunctions of c(t, s), satisfying

(cid:90)

c(t, s)vl(s)ds = λlvl(t),

l = 1, 2,··· ,

t ∈ [0, 1].

Then we have the following Karhunen-Lo´eve decomposition:

Yi(t) =

ηi,lvl(t),

i = 1,··· , N,

(5.3)

∞(cid:88)

l=1

where ηi,l =(cid:82) 1

0 Yi(t)vl(t)dt is the projected score of Yi on component vl. It is noted that for
any ﬁxed i, {ηi,l : l = 1, 2,···} is a uncorrelated mean 0 random sequence with variance λl.
The dimension reduction is done by representing the observed data Xi with its projected

scores on the ﬁrst d eigenfunctions of c(t, s), where d is a positive integer such that λ1 >
··· > λd > 0.

In practice, c(t, s) is unknown and needs to be estimated. A consistent estimator of the

covariance kernel c(t, s) under the model of no change-point (5.1) is the empirical covariance

86

N(cid:88)

i=1

1
N

(cid:0)Xi(t) − ¯X1,N (t)(cid:1)(cid:0)Xi(s) − ¯X1,N (s)(cid:1) ,

i=1 Xi is the overall sample mean. Let the eigenfunctions of cN (t, s)

function:

where ¯X1,N = N−1(cid:80)N

cN (t, s) =

be {νl(·)}, and

ξi,l =

(cid:90)

Xi(t)νl(t)dt,

l = 1,··· , d.

These scores ξi = (ξi,1,··· , ξi,d)T , for i = 1,··· , N , will be used in the procedure for
estimating the mean change-points.

5.1.3 Single Mean Change-Point Test

As we mentioned above, all currently available works concerning the mean change-point

detection in a functional sequence consider the hypothesis of a single change-point, i.e.,

M = 1 in (5.2). Aston and Kirch (2012) also introduce an extra change type called the

epidemic change, which states that the sequence changes its mean function at some time

and changes back to the original mean at a later time. All these works adopt the partial

sum based test, and we will address the steps in the following paragraphs.

Recall that ηi,l is the projected score of Yi on the l-th FPC. Denote ηi = (ηi,1,··· , ηi,d)T .
We note that if {Yi} are independent, then {ηi} are independent. Also, if {Yi} are weakly
dependent, then {ηi} are weakly dependent in the same framework. From functional central
limit theorem, if {ηi} are independent, then the partial sum process of {ηi}:
(cid:111)
Σ1/2Wd(x) : x ∈ [0, 1]

ηi : x ∈ [0, 1]

(cid:98)N x(cid:99)(cid:88)

(5.4)

,

N−1/2

 d→(cid:110)

i=1

where the covariance matrix Σ is diagonal with elements {λ1,··· , λd} and Wd is a d-
dimensional Wiener process with independent components. If {ηi} are weakly dependent,
then equation (5.4) still holds, but the matrix Σ is no longer a diagonal matrix. It becomes
the long-run covariance matrix of {ηi}, which is positive deﬁnite and deﬁned by:

Eηi−hηT
i .

(5.5)

∞(cid:88)

h=0

Σ =

−∞(cid:88)

h=−1

EηiηT

i+h +

87

This indicates that for independent and weakly dependent functional data, their cumulative

scores actually converge to the same Wiener process with diﬀerent covariances.

Based on this property, we can derive a statistic featuring this asymptotic result in the

following steps:

1. Project the centered functions {Xi − ¯X1,N} on the estimated d-dimensional sub-

eigenspace through FPC and get scores {ξi − ¯ξ1,N};

3. Let Q(x) =(cid:80)(cid:98)N x(cid:99)

2. Compute the estimator of the covariance matrix Σ in (5.4), denote it by ΣN ;
i=1 (ξi − ¯ξ1,N ) be the partial sums of the centered scores, then
TN,d(x) = N−1Q(x)T Σ−1

Dd[0,1]→ Bd(x)T Bd(x),

N Q(x)

(5.6)

where Bd(x) is the d-dimensional Brownian bridge with independent components.

Various test statistics can be constructed with TN,d(x). For example, its integral or maximum
are the mostly used. The matrix ΣN is used as a normalizer in (5.6). When {Yi} are
independent, it is the diagonal matrix with the d estimated eigenvalues of cN (t, s) as its
diagonal elements. When {Yi} are weakly dependent, ΣN is a consistent symmetric positive-
deﬁnite estimator for Σ which is often computed by:

wq(h)

(ξi − ¯ξ1,N )(ξi+h − ¯ξ1,N )T

wq(h)

(ξi−h − ¯ξ1,N )(ξi − ¯ξ1,N )T

(5.7)

ΣN =N−1

N−1

q(cid:88)
−q(cid:88)

h=0

h=−1

N−h(cid:88)
N−|h|(cid:88)

i=1

i=1

 +
 ,

where wq is a kernel weight function with bandwidth q and the inﬁnite sums in (5.5) are

truncated here to be sums of 2q + 1 terms.

The parameter q can cause serious problem for detecting mean change-points. It controls

the number of autocorrelation lags to be counted in the long-run covariance estimator. Intu-

itively, if the autocorrelation decays slowly, q should be large. Several data-driven methods

were proposed to select a proper q among a stationary time series, e.g., Andrews (1991) and

88

Newey and West (1994), where the former was used in H¨ormann and Kokoszka (2010). But

when there is a mean change among the observations, the mean diﬀerence would enlarge the

sample covariance and autocovariance functions, then these data-driven methods tend to

provide an over-sized bandwidth. This oversized q would cause over-cumulation of the sums

in (5.7), and results in an over-normalized test statistic. The greater the mean diﬀerence, the

more the test statistic is over-normalized and its power shrinks. This is the so-called non-

monotonic power phenomenon observed by Perron (1991) and Vogelsang (1999) in univariate

and multivariate time series.

Another inconvenience of the partial sum based tests is the way to handle multiple

change-points. They must be applied with binary segmentation to repeatedly ﬁnd out the

change-points. Therefore, at the ﬁrst few iterations, since there are undetected change-

points in the data, it is possible to produce insigniﬁcant tests due to the non-monotonic

power phenomenon. Moreover, the sub-sequence length gets shorter in the latter iterations

and the validity of the asymptotic distribution becomes doubtful.

In the next section, we will propose a procedure to estimate the number and locations

of multiple change-points in the mean of a sequence of functional observations, which gets

rid of the estimation of long-run covariances and takes the bottom-up like approach to avoid

making decisions with contaminated statistics.

5.2 Backward Recursive Least Squares Segmentation

To assess the multiple change-points model (5.2), we need two assumptions.

1. The number of true change-points M < N α, for some α ∈ (0, 0.5).

This assumption was rarely mentioned in the literature concerning multiple change-

points. If we don’t put this restriction, M can actually be maximized to N and then

(5.2) may not be a proper model since the sequence changes its mean all the times. The

upper bound 0.5 for α can be adjusted, according to the application. Because we adopt

a bottom-up approach in our proposed procedure, we need this assumption to assure

89

the model adequacy and constrain M to a moderate size for the sake of shortening the

search path and in order to reduce the computational cost.

2. The minimum segment length is greater than some positive number β, that is,

{θm − θm−1} > β.

min

1≤m≤M

This assumption also poses an implicit constraint that M < β−1, but more importantly,
it guarantees that there are at least (cid:98)N β(cid:99) observations between any adjacent change-

points, which provides suﬃcient samples for estimating the mean functions in each

segment.

The choice of (α, β) is either subjective or data-dependent. For our algorithm, it is more

convenient to assign an integer upper bound for M rather than choose α. Moreover, since a

greater β would limit both the possible number of change-points and solution space during

recursive optimization, we prefer to use simple candidates such as 2/N or 3/N for β.

We propose a new method to recursively search for the positions of the possible change-

points with least squares, and via backward elimination to sequentially test and remove the

redundant ones, till all the remaining change-points are signiﬁcant. This procedure, named

the Backward Recursive Least Square Segmentation (BRLSS), is applied on the FPC score
vectors {ξi}. It contains two major stages:

1. Partition the score sequence into K non-overlapping segments with recursive least

squares segmentation (RLSS), where K > N α is a positive integer;

2. Backward eliminating the estimated partition boundaries in pervious stage.

We will detail the steps in the subsections below.

5.2.1 Recursive Least Squares Segmentation

The idea of RLSS originates from the K-means algorithm for clustering data, because both

methods look for subsets or partitions of the data where in each subset or partition, the

90

data share the same mean. We also adopt the devices of ﬁxed number of segments, initial

partitions and recursively updating approach in our algorithm. The main diﬀerence between

RLSS and K-means is, rather than updating the membership of every observation, we only

need to adjust the positions of the partition boundaries until they are not varying any more.

This updating algorithm coincides with the local iterative replacement of Himberg et al.

(2001), who state the generic algorithm and we give the detailed steps here.

Let 0 ≤ a < b < c ≤ 1. Denote

(cid:98)N b(cid:99)(cid:88)

(cid:16)

(cid:17)T(cid:16)

(cid:17)

S(a, b) =

the sum of squared centered distances of the segment(cid:8)ξ(cid:98)N a(cid:99)+1,··· , ξ(cid:98)N b(cid:99)(cid:9), where

i=(cid:98)N a(cid:99)+1

ξi − ¯ξ(cid:98)N a(cid:99)+1,(cid:98)N b(cid:99)

ξi − ¯ξ(cid:98)N a(cid:99)+1,(cid:98)N b(cid:99)

(5.8)

¯ξ(cid:98)N a(cid:99)+1,(cid:98)N b(cid:99) =

1

(cid:98)N b(cid:99) − (cid:98)N a(cid:99)

If there is a mean change among(cid:8)ξ(cid:98)N a(cid:99)+1,··· , ξ(cid:98)N c(cid:99)(cid:9), a natural estimate of the position

i=(cid:98)N a(cid:99)+1

would be argminb{S(a, b)+S(b, c)}. To avoid the ambiguity of multiple solutions and embed
the interval length constraint β, we set the change-point estimator as

(cid:98)N b(cid:99)(cid:88)

ξi.

(cid:41)

(cid:40)

b∗ = inf

argmin

a+β<b≤c−β

S(a, b) + S(b, c)

.

(5.9)

This 2-segmentation is the basic operation in RLSS.

Before executing RLSS, we need to determine an upper bound for the number of change-

points, or equivalently, for the number of segments. Let this upper bound be K. In this

section, we would regard K to be the upper bound for the segment number because it

is easier to introduce our algorithm in the view of segments. Assign an initial partition

in K segments for the sequence. This initial partition can be arbitrarily determined, but

our simulation shows that an informative partition is preferred, we will provide a simple

algorithm in section 5.2.3 for generating an informative partition.

Let the initial partition be θ(0) = (cid:8)θ

(cid:9), where the superscript stands for

(0)

1 ,··· , θ

(0)
K−1

the number of iterations that the partition boundaries have been updated. For convenience,

91

we also set θ

K = θK = 1 for all r ∈ N. RLSS updates each element in
θ(0) sequentially, one at a time, by applying the 2-segmentation in (5.9) to the concatenated

(r)
0 = θ0 = 0 and θ

(r)

segment made up of the two adjacent segments of the boundary being updated. The next

boundary will be updated similarly with its left-hand-side segment partitioned by the latest

updated boundary. We get θ(1) after updating all the boundaries in θ(0) and the same process
repeats on θ(1), θ(2),··· . RLSS terminates as soon as (cid:98)N θ(r)(cid:99) = (cid:98)N θ(r−1)(cid:99) for some r, then
the ﬁnal partition boundaries θ(r−1) will be used in the next stage for backward elimination.

The updating process is formulated in Algorithm 1.

(cid:110)
Algorithm 1 The RLSS Algorithm
input: The FPC scores {ξi},

(cid:111)

(0)

0 ,··· , θ

(0)
K

θ

, and the minimum segment length β.

the number of

segments K,

an initial partition

Set r = 0
repeat

r = r + 1
for k = 1 to K − 1 do

Perform 2-segmentation on segment(cid:0)θ



(r)
k−1, θ

(r−1)
k+1

(cid:16)

(cid:16)

(r)
k−1, θ
θ

+ S

θ, θ

(r−1)
k+1

(cid:17)

(r)
k = inf
θ

argmin
+β<θ≤θ

(r−1)
k+1

S

−β

(r)
k−1
θ

end for

until (cid:98)N θ(r)(cid:99) = (cid:98)N θ(r−1)(cid:99)

output: The ﬁnal partition θ(r−1) =

(cid:110)

(r−1)
θ
1

,··· , θ

(r−1)
K−1

(cid:3):
(cid:17)

(cid:111)

(cid:17)

92

RLSS is guaranteed to stop within ﬁnite steps by the following theorem. The proof is

given in the appendix.

Theorem 3. Given the number of segments K, the total sums of squared centered distances

(TSSCD) at the r-th loop SK (θ(r)) = (cid:80)K
θ(r) : r ∈ N(cid:111)

estimated partition boundaries

(cid:110)

k=1 S

(cid:16)

(r)
(r)
k−1, θ
θ
k

will converge in ﬁnite r. The

will also stop varying in ﬁnite r.

5.2.2 Backward Elimination
The K − 1 partition points θ(r−1) from the RLSS stage contain K − 1− M redundant points

which should be removed. At this stage, we will eliminate these points iteratively with a

validity check. The basic notion is to verify the most unlikely point in the current list. If it is

veriﬁed as a true change-point, then the remaining points must all be real ones. But if not,

we will discard this point and go on to check the next most unlikely point. This backward

elimination will proceed interatively until a true change-point is found or we run out of the

candidates.

Let θ[0] = θ(r−1). After the k-th iteration, if the last checked most unlikely point is not
signiﬁcant, there will be K − k − 1 points left on the list. Denote these K − k − 1 point by

(cid:9) and use A[k]

j =(cid:0)θ

[k]
[k]
j−1, θ
j

(cid:3) to represent the j-th segment in θ[k].

θ[k] =(cid:8)θ

[k]

1 ,··· , θ

[k]
K−k−1

The elimination procedure is described below:

1. Select the most unlikely point:

The sum of squared centered distances is an adequate tool for determining the most
unlikely point. The TSSCD of the scores {ξi} under partition θ[k] is:

[k]

j ∈ θ[k] is to be removed at the (k + 1)-th iteration, the segments A[k]

j

and

S(cid:0)θ

j+1 and we have

(cid:1) + S(cid:0)θ

(cid:1)(cid:111) ≥ 0.

[k]
j , θ

[k]
j+1

[k+1]
[k+1]
j−1 , θ
j

[k]
[k]
j−1, θ
j

[k]
j

If θ
A[k]
j and A[k]

is a true change-point, this diﬀerence would be great due to diﬀerent means in

to select the most unlikely point θ

j+1. But if not, this diﬀerence would be small. Therefore, it is reasonable
j∗ which minimizes this diﬀerence, which can also

[k]

93

here we also set θ

[k]
0 = 0 and θ

[k]
K−k = 1 for all k with the same reason as in RLSS. It

SK−k

[k]
[k]
j−1, θ
j

(cid:1),
(cid:0)θ[k]\(cid:8)θ

j=1

K−k(cid:88)

(cid:0)θ[k](cid:1) =
S(cid:0)θ
(cid:0)θ[k](cid:1) ≤ SK−k−1
(cid:83)A[k]
(cid:1) −(cid:110)
S(cid:0)θ

= A[k]

j

is easy to see that SK−k

When θ
j+1 will merge as A[k+1]
A[k]

j

k = 1,··· , K − 1,

(cid:9)(cid:1) for all θ

[k]
j

[k]

j ∈ θ[k].

be formulated as the increment between the two TSSCD’s before and after removing

it:

[k]

j∗ = argmin
θ
θ∈θ[k]

(cid:110)

(cid:0)θ[k](cid:1) − SK−k−1

(cid:0)θ[k]\(cid:8)θ(cid:9)(cid:1)(cid:111)

.

SK−k

2. Test if the most unlikely point is a real change-point:

We will use the fact that removing θ

j∗ is a real change-point. Let(cid:8)ζ

j∗ causes a nonnegative increment in the TSSCD

(cid:9) be the score sequence centered

[k]
i

[k]

[k]

to verify whether θ

under partition θ[k]:

[k]

[k]

¯ξ(cid:98)N θ

i = ξi − K−k(cid:88)
(cid:9) be deﬁned similarly under partition θ[k+1] = θ[k]\(cid:8)θ
(cid:9) must be quite close to each other. But if θ

j−1(cid:99)+1,(cid:98)N θ

[k]
j−1,θ

(i/N ),

j (cid:99)

[k]
j ]

· 1

j=1

[k]

[k]

(θ

sequences have zero mean. If θ

j∗ is not a real change-point, then the covariances of
j∗ is a real change-
, hence the sample covariance

point, there will be a mean shift in the segment A[k+1]

(cid:9) will be contaminated by the mean shift and become diﬀerent from that of

[k+1]
i

j∗

[k]

i = 1,··· , N,

j∗(cid:9). Both centered

[k]

ζ

[k+1]
i

and(cid:8)ζ
(cid:8)ζ
(cid:9) and (cid:8)ζ
of(cid:8)ζ
(cid:8)ζ
(cid:9).

[k+1]
i

[k]
i

[k]
i

We adopt the M-test of Box (1949) to compare these two covariance matrices:

(cid:110)
2 log(cid:12)(cid:12)Cpool

(cid:12)(cid:12) − log(cid:12)(cid:12)C[k](cid:12)(cid:12) − log(cid:12)(cid:12)C[k+1](cid:12)(cid:12)(cid:111)

12N − 25

M =

where C[k], C[k+1] are the sample covariances of (cid:8)ζ

6

{C[k] + C[k+1]}/2 is their pooled covariance. If M > χ2

(cid:9) and (cid:8)ζ

[k]
i

,

(cid:9), and Cpool =

[k+1]
i

α,3, where α is the signiﬁcance

[k]

level, we reject the null hypothesis that the covariances are equal and conclude that
j∗ is a change-point, then BRLSS terminates and θ[k] are the estimates of the real
θ
change-points. If the test is not rejected, the elimination procedure is repeated again

with θ[k+1].

The select and test steps are conducted iteratively for k = 1,··· , K − 1 until it is signiﬁcant
at some k. If none of the K − 1 tests is signiﬁcant, we conclude that there is no change-point

in the sequence. The algorithm of backward elimination is listed in Algorithm 2.

94

Algorithm 2 The Backward Elimination Algorithm

input: The FPC scores {ξi}, the number of segments K, the partition(cid:8)θ
(cid:0)θ[k]\(cid:8)θ(cid:9)(cid:1)(cid:111)

Set the ﬁnal number of change-point M = K
for k = 1 to K − 1 do

Choose the most unlikely point

[k]

(cid:110)
(cid:0)θ[k](cid:1) − SK−k−1
(cid:9) and the test statistic M

SK−k

j∗ = argmin
θ
θ∈θ[k]

(cid:9),(cid:8)ζ

[k+1]
i

Compute(cid:8)ζ

if M > χ2

[k]
i

α,3 then

(cid:9)

[0]

1 ,··· , θ

[0]
K−1

M = K − k, and the M points θ[k] are the estimated change-points
break

end if

end for
if M = K then

M = 0

end if

output: The number of change-points M , and the location estimates θ[k]

5.2.3 Some Remarks of BRLSS

We will discuss the BRLSS algorithm in the following three aspects: the initial partition,

the backward elimination procedure, and the data dependency.

1. Initial partition:

In RLSS stage, we need an initial partition and a simple choice is the equally spaced

partition. However, like the K-means type algorithms, our algorithm also converges

to local minimum sometimes. But if we use an informative initial partition, we can

possibly avoid this unfavorable situation.

A good initial partition should carry the information of the possible segmentation. We

provide a small but useful algorithm basing on the 2-segmentation in (5.9), named the

proceeding 2-segmentation (P2S), to generate a suggesting list of the initial partition

boundaries. The algorithm is listed in Algorithm 3.

The P2S algorithm applies the 2-segmentation on the expanding sub-sequences that

start at the ﬁrst observation and the end point proceeds from the given position to the

95

last observation. The estimated positions ωF from these expanding sub-sequences help

to depict the proﬁle of the scattering and importance of the changes. The duplicate

times of a position reﬂect the signiﬁcancy of the change occurs on it, so we can sort the

distinct values in ωF by their frequencies and select the initial partition boundaries

from the leading positions with highest frequencies.

There is a possibility that the most signiﬁcant change occurs early so that the remaining

changes after it are masked in ωF . Therefore, we suggest to perform P2S on the
reversed sequence {ξN−i : i = 0,··· , N − 1} as well, and summarize the partition
boundaries from both ωF and ωB. Please note that the selected partition boundaries
must be at least (cid:98)N β(cid:99) points away from each other.

Algorithm 3 The P2S Algorithm
input: the FPC scores {ξi}, the minimum segment length β
Perform 2-segmentation on subsequence {ξ1,··· , ξj}:

for j = (cid:98)N β(cid:99) + 1 to N do

(cid:40)

(cid:41)

ωF,j = inf

argmin
1≤ω≤j

(cid:41)
Perform 2-segmentation on the reversed subsequence {ξN ,··· , ξN−j+1}:

(cid:40)

S(0, ω/N ) + S(ω/N, j/N )

ωB,j = (N + 1) − inf

argmin
1≤ω≤j

S(0, ω/N ) + S(ω/N, j/N )

end for

output: Estimated partition point lists: ωF = {ωF,j} and ωB = {ωB,j}

The lists ωF and ωB can also reveal the change patterns among the sequence. Where

there are many distinct ωj’s in ωF and ωB, and the common elements in both lists

are few, it could imply that there is no mean change-point. But if there are some

common points in both lists and their frequencies are high, they are very likely to be

the true change-points. We would suggest to select the points with relatively higher
frequencies, which are (cid:98)N β(cid:99) points apart from each other, to form the initial partition.

It is always safe to select K that is greater than the number of the selected points by 2

or 3. The vacancy of these 2 or 3 extra boundaries can be ﬁlled with the middle points

96

of the ﬁrst 2 or 3 longest segments in the selected partition.

2. Test procedure in backward elimination:

The backward elimination procedure is not limited to the M-test in this paper, it could

be any other procedure which can correctly identify whether the removed boundary is

a true change-point.

(cid:9) and(cid:8)ζ

[k]
i

The reasoning of backward elimination relies on the fact that the segmentwisely cen-
j∗+1 when
a true change-point is removed. In fact, we have tried with many test procedures, for

tered sequences(cid:8)ζ
j∗ and A[k]
example, the two-sample t-test type procedures that verify the means of(cid:8)ζ

(cid:9) will have diﬀerent means in A[k]

(cid:9) in

[k+1]
i

[k+1]
i

these two segments are the same. We found that if we only consider the samples in

these two segments, the test performance is usually not good due to limited sample

size. On the other hand, if we use the full sequence, the mean change eﬀect in these

two segments will be diluted by other unchanged segments and the powers of these

methods are also not satisfactory.

Moreover, when the data actually contains no change-point, we also require this proce-

dure to be insigniﬁcant in all iterations. In an analogous sense, we need to control the

test size. Out of the many procedures we have tried, testing the equality of the covari-

ance matrices of (cid:8)ζ

(cid:9) and (cid:8)ζ

[k]
i

[k+1]
i

(cid:9) M-test generates the most impressive results,

though there are some criticisms about its sensitivity to the departure of Gaussianity.

But we would still emphasize that it is not the only answer in our algorithm.

3. Dependency:

There is no special treatment for the serial correlation in BRLSS. When sequence {Yi}
possesses a high positive autocorrelation, there would be some pseudo mean eﬀects

due to less ﬂunctuations among the observations. For a nonparametric procedure

that rely on the data magnitudes, it would often be confused by these pseudo mean

eﬀects and make wrong inferences. Our RLSS algorithm also fails at such case (see

97

Section 5.3.1). We believe that the highly positively correlated cases are hard to

solve for many procedures, even the parametric methods. But there is no way to

cancel or remove the correlation from the observations without impacting their mean.

Recall in Section 5.2.2 that the long-run covariance is used by H¨ormann and Kokoszka

(2010) in the single change-point test as the normalizer. Their purpose is to derive

an asymptotic distribution for the standardized test statistic. In our algorithm, the

long-run covariance is irrelevant since we adopt the least square approach.

We cannot use standardized distances, which are normalized by some statistic such as

the long-run covariance, in RLSS because the total sums of the squares of these stan-

dardized distances in the updating iterations are not necessary a decreasing sequence.

This cannot guarantee the convergence of RLSS.

Even though we cannot handle the extreme case of highly positively correlated data,

the simulations in Section 5.3.1 show that BRLSS can do well with mildly positively

correlated data. BRLSS can also work with negatively correlated, even the highly

negatively correlated cases, since there are more ﬂunctuations and less pseudo mean

eﬀects in the data.

5.3 Simulation Study and Real Data Application

5.3.1 Simulation Study

In this section, we will use some simulated functional data to verify the cabability of the

BRLSS algorithm. Consider the model (5.2) with M = 0, 1, 2. The three mean functions to

be used:

µ1(t) = 0.5 − 100(t − 0.1)(t − 0.3)(t − 0.5)(t − 0.9)
µ2(t) = µ1(t) + 5t2 − exp(1 − 20t)

µ3(t) = µ1(t) + sin(1 + 6πt),

98

are drawn in Figure 5.2. Two data generating models of {Yi(t)} are employed, namely the
functional basis expansion model and the autoregressive Hilbertian process of order 1, which

we denote by FBE and ARH(1) respectively.

Figure 5.2: The mean functions used in simulation. The black solid line is µ1(t), the red
point line is µ2(t) and the blue dash line is µ3(t).

The FBE model is designed to imitate the Karhunen-Lo´eve decomposition (5.3), which

is actually generated by

(cid:112)λl τi,l φl(t),

L(cid:88)

l=0

Yi(t) =

i = 1,··· , N,

where {τi,l : i = 1,··· , N} are random scores with zero mean and unit variance, {φl(t)}
are the orthonormal basis functions that play the role of eigenfunctions and {λl} are the
corresponding eigenvalues. We set L = 150 and adopt the Fourier basis



φl(t) =

√
2 sin(2πkt − π),
√
2 cos(2πkt − π),

l = 2k − 1;

l = 2k,

along with the strictly decreasing λl = 0.7 × 2−l.

To generate data with diﬀerent levels of dependence, the random scores {τi,l} are simu-

lated by the AR(1) model with diﬀerent lag-1 coeﬃcients, that is, for each l,

τi,l = ρ τi−1,l + εi,l,

i = 1,··· , N,

99

0.00.20.40.60.81.0−2024tand the noise {εi,l} are i.i.d. N (0, 1 − ρ2). Four diﬀerent ρ’s are used: 0, 0.2, 0.5 and 0.8,
each corresponds to the independent, weakly, mildly, and strongly correlated cases.

The ARH(1) model is deﬁned as:

Yi(t) = Ψ(Yi−1)(t) + i(t),

i = 1,··· , N,

with a sequence of independent standard Wiener processes {i(t)}, and the AR operator
Ψ(·). The integral kernel of Ψ is ψ(s, t) = c{2 − (2s − 1)2 − (2t − 1)2}, for s, t ∈ [0, 1]. The

constant c is used to adjust the Hilbert-Schmidt norm (cid:107)Ψ(cid:107)S =(cid:8)(cid:82)(cid:82) |ψ(s, t)|2 ds dt(cid:9)1/2 so

it would equal to some predeﬁned values. Four values of c are selected so that (cid:107)Ψ(cid:107)S would
equal 0, 0.2, 0.5, and 0.8. The larger (cid:107)Ψ(cid:107)S is, the more {Yi(t)} are correlated and for c = 0
(i.e., (cid:107)Ψ(cid:107)S = 0), {Yi(t)} are independent.

We test for 0, 1 and 2 change-points in this simulation. For 1 change-point data, we

designed three settings: θ1 = 0.15, 0.5, and 0.8 to represent the early, midterm, and late

change respectively. Their combinations are used for the 2 change-points cases. These

seven diﬀerent change-point settings would combine with the four diﬀerent dependencies

to generate a total of 28 sub-models. 1000 sequences of length 100 and 500 are generated

respectively for each sub-model, where all the functions are simulated on the discrete time
points {0.01j : j = 1,··· , 100}. Every realization is pre-smoothed and projected on the

FPC with R package ‘fda’.

The minimum segment length for BRLSS is 3, which corresponds to β = 0.03 for the

sequences of length 100 and 0.006 for length 500. Moreover, two kinds of initial partitions

with K = 10 are used: the equally spaced partition (ESP) and the P2S partition with the
given β. The nominal size used by Box’s M-test in backward elimination is 0.05.

To analyze the simulation results, two quantities are evaluated: the number of change-

points and their estimated locations. The precision of locations depends on the capability

of RLSS, while the correctness of change-point number relies on backward elimination. We

will check out these quantities by looking at the cases with or without changes separately.

100

5.3.1.1 Performance for Data without Change-Point

For data without mean change-point, we only need to check the number of estimated change-

points. Table 5.1 lists the counts of correctly matched number of change-points for the sub-

models that contain no change-point. That is, the number of samples with estimated M = 0

out of the 1000 realizations.

Table 5.1: Frequencies of correctly matched change-point number in 1000 no change-point
samples.

FBE

ρ

ARH(1)

Partition Size

0

0.2

0.5 0.8

Partition Size

0

(cid:107)Ψ(cid:107)S
0.2

0.5 0.8

ESP

P2S

100 1000 1000
68
500 1000 1000 1000 703

977

100 1000 1000
88
500 1000 1000 1000 736

980

ESP

P2S

991

100
994 904
500 1000 1000 1000 994

983

100 1000 1000
999 889
500 1000 1000 1000 994

Figure 5.3: Box plots of the estimated number of change-point in the length 100 no change-
point samples using P2S initial partition. The results of FBE are plotted in the left panel
and ARH(1) in the right. The X-axis represents the correlation categories and the Y -axis
is the estimated number of change-points.

The frequencies of correctly matched change-point number are all high except the FBE
model with ρ = 0.8. Furthermore, almost all the ARH(1) samples with (cid:107)Ψ(cid:107)S ≤ 0.5 perfectly
identify M besides the length 100 samples using equally spaced initial partition. This shows
that the backward elimination with Box’s M-test can eﬀectively identify the data of no

101

lllllllllllllllllllllllllllllllllFBErM00.20.50.802468llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllARH(1)||Y||SM00.20.50.80123change-point, even when they are weakly dependent. The inﬂuence of initial partition can

also be seen from these results, since there are even or more matched cases with P2S than

with equally spaced partition in most of the sub-models.

We also draw box plots of the estimated M in Figure 5.3, using the length 100 samples

with P2S initial partition. It is clear that the number of change-points are overestimated in

the FBE model with ρ = 0.8, indicating the existence of one or more pseudo mean eﬀects in

the data, which are too strong for our method to handle.

5.3.1.2 Performance for Data with Change-Points

For the sub-models with mean change-points, we also look at the estimated number of change-

points ﬁrst. The frequencies of exactly matched number for one and two change-points data

are listed in Table 5.2 and Table 5.3 respectively.

Table 5.2: Frequencies of correctly matched change-point number in 1000 one change-point
samples.

FBE

ρ

ARH(1)

Partition Size

θ

0.5 0.8

Partition Size

θ

(cid:107)Ψ(cid:107)S
0.2

0.5 0.8

995
982
979

992 838
991 876
985 854

0

997
977
967

0.15
0.50
0.80

0.15 1000 1000 1000 997
0.50 1000 1000 1000 968
0.80 1000
999 1000 956

0.15 1000 1000
0.50 1000 1000
0.80 1000
999

997 834
998 890
998 859

0.15 1000 1000 1000 989
0.50 1000 1000 1000 983
0.80 1000 1000 1000 984

ESP

P2S

100

500

100

500

ESP

P2S

100

500

100

500

0

925
931
892

988
989
944

0.15
0.50
0.80

0.15
0.50
0.80

998
0.15
0.50
999
0.80 1000

0.2

947
963
931

990
993
978

996
999
997

930
92
932 111
919
87

997 650
993 602
990 660

944 124
960 148
939 121

0.15 1000 1000 1000 722
0.50 1000 1000 1000 698
0.80 1000 1000 1000 725

The patterns in Table 5.2 and Table 5.3 are quite similar to that in Table 5.1: the

frequencies are high in most cases except the FBE model with ρ = 0.8. The results in these

102

Figure 5.4: Box plots of the estimated number of change-point in length 100 samples with
change-points using P2S initial partition. The upper panel are plots of 1 change-point data
and the lower panel are of 2 change-points. The results of FBE are plotted in the left panel
and ARH(1) in the right. The X-axis represents the correlation categories and the Y -axis is
the estimated number of change-points. For each correlation category, diﬀerent change-point
settings are drawn from left to right in black ({0.15}/{0.15,0.50}), red ({0.50}/{0.15,0.80})
and green ({0.80}/{0.50,0.80}) respectively.

two tables demonstrate the capability of our backward elimination procedure, which stops

at the exact iteration when the remaining number of change-points are correct.

The box plots of the estimated M in the length 100 samples using P2S initial partition are

drawn in Figure 5.4 for one and two change-points data. Under each correlation category, the

boxes of diﬀerent change-point settings are drawn sequantially in diﬀerent colors. Similarly,

the estimated M in FBE model with ρ = 0.8 are overestimated and seem to have various

answers. But in other sub-models, the estimated M ’s all concentrate on the estimands. This

agrees with the results in Table 5.2 and Table 5.3.

The frequencies of correctly matched number of change-points increase when sample

length becomes 500, implying that the change-point number estimator are more consistent

as the sample size increases. Further, the inﬂuence of initial partition is still signiﬁcant.

Now we turn to the location estimation of BRLSS. We are interested in the frequencies of

103

llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll02468llllllllllllllllllllllllllllllllllllllllll02468llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll02468FBEMlllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllARH(1)llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll02468lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll02468llllllllllllllllllllllllllllllllllllllllllllllllll02468rM00.20.50.8llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll||Y||S00.20.50.8Table 5.3: Frequencies of correctly matched change-point number in 1000 two change-point
samples.

FBE

ρ

ARH(1)

Partition Size

θ

0

0.2

0.5 0.8 Partition Size

θ

0

(cid:107)Ψ(cid:107)S
0.2

0.5 0.8

ESP

P2S

100

500

100

500

(0.15,0.50) 763 832 897 109
(0.15,0.80) 733 864 885 104
(0.50,0.80) 830 898 886 115

(0.15,0.50) 903 955 977 550
(0.15,0.80) 913 962 983 565
(0.50,0.80) 942 983 985 537

(0.15,0.50) 998 996 944 179
(0.15,0.80) 998 998 935 195
(0.50,0.80) 1000 998 952 188

(0.15,0.50) 1000 1000 1000 719
(0.15,0.80) 1000 1000 998 747
(0.50,0.80) 1000 1000 999 697

ESP

P2S

100

500

100

500

(0.15,0.50) 724 800 906 810
(0.15,0.80) 849 887 939 812
(0.50,0.80) 758 821 891 819

(0.15,0.50) 959 970 996 952
(0.15,0.80) 961 976 996 954
(0.50,0.80) 950 964 992 949

(0.15,0.50) 1000 1000 989 859
(0.15,0.80) 1000 1000 994 852
(0.50,0.80) 1000 1000 997 879

(0.15,0.50) 1000 1000 1000 989
(0.15,0.80) 1000 1000 1000 986
(0.50,0.80) 1000 1000 1000 990

both exactly estimated number and correctly matched locations of the change-points. In this

simulation, the real change-points are known in advance, so we can count the frequencies of

exactly matched locations. In many application when the real change-points are unknown,

we wish that our location estimates would be as close to the real ones as possible. Therefore,

we also check the frequencies of roughly matched, where the estimated locations are in the
neighborhood of the true change-points with ±2 oﬀsets. The results of FBE and ARH(1)

models are listed in Table 5.4 and Table 5.5 respectively.

From Table 5.4, we note that BRLSS correctly estimates both the number and positions

of the change-points in more than 70% of the FBE samples with ρ = 0, 0.2 and 0.5, if we

adopt the P2S initial partition. When we relax the precision of location estimate and allow
rough matches within ±2 oﬀsets, the match percentages jump up to about 90%! This is

useful in many applications. Our method can precisely point out the locations where mean

changes occur, even if they are not exactly located on the true change-points, they would be

only very few positions away. Similar pattern is found in Table 5.5, but the percentages of

correctly matched (both exactly and roughly) are even higher.

104

Table 5.4: Frequencies of matched number and exactly/roughly matched locations of change-
points in 1000 FBE samples.

Partition

Size

θ

0

0.2

0.5

0.8

0

0.2

0.5

0.8

Exact

Rough

ESP

P2S

100

500

100

500

(0.15)
(0.50)
(0.80)

(0.15, 0.50)
(0.15, 0.80)
(0.50, 0.80)

(0.15)
(0.50)
(0.80)

(0.15, 0.50)
(0.15, 0.80)
(0.50, 0.80)

(0.15)
(0.50)
(0.80)

(0.15, 0.50)
(0.15, 0.80)
(0.50, 0.80)

(0.15)
(0.50)
(0.80)

(0.15, 0.50)
(0.15, 0.80)
(0.50, 0.80)

840
843
808

659
679
725

907
905
858

785
799
829

912
915
915

858
874
849

925
919
906

872
880
850

855
868
825

698
740
776

886
886
879

805
810
830

900
889
895

832
845
817

903
891
901

836
849
810

784
779
755

689
682
685

797
835
825

714
714
748

793
806
787

697
705
695

816
841
832

737
745
745

56
77
55

67
56
70

324
340
341

189
204
201

74
97
70

92
98
81

381
395
409

267
271
258

923
929
890

758
768
829

987
985
942

902
912
938

996
999
1000

995
996
993

998
998
999

997
998
987

939
956
924

823
856
894

984
989
975

948
954
976

992
991
989

982
986
977

995
998
993

991
995
984

898
898
884

860
848
849

934
953
945

900
895
918

916
932
908

897
892
893

954
954
955

928
933
913

73
97
72

95
77
95

426
419
435

312
328
299

96
126
94

136
144
116

492
504
506

401
426
393

In general, a change-point that occurs in the middle of a sequence, is often easier to be

detected than those near both ends. But the results in Table 5.4 and Table 5.5 show that

there is no signiﬁcant distinction among diﬀerent change-point settings. Hence, BRLSS is

robust to the change-point locations.

105

Table 5.5: Frequencies of matched number and exactly/roughly matched locations of change-
points in 1000 ARH(1) samples.

Partition

Size

θ

0

0.2

0.5

0.8

0

0.2

0.5

0.8

Exact

Rough

ESP

P2S

100

500

100

500

(0.15)
(0.50)
(0.80)

(0.15, 0.50)
(0.15, 0.80)
(0.50, 0.80)

(0.15)
(0.50)
(0.80)

(0.15, 0.50)
(0.15, 0.80)
(0.50, 0.80)

(0.15)
(0.50)
(0.80)

(0.15, 0.50)
(0.15, 0.80)
(0.50, 0.80)

(0.15)
(0.50)
(0.80)

(0.15, 0.50)
(0.15, 0.80)
(0.50, 0.80)

942
918
885

665
779
706

944
937
929

877
882
875

954
946
930

913
933
898

946
947
929

907
920
898

925
933
926

722
795
748

941
941
929

893
896
886

933
943
948

899
899
891

943
940
934

909
918
906

909
938
931

798
829
823

917
943
939

878
874
900

910
924
929

852
875
874

913
930
921

865
867
866

771
816
806

727
715
732

824
850
832

727
735
773

761
780
777

714
702
691

806
817
824

712
707
742

996
975
963

724
849
758

999
997
999

958
960
947

998
1000
999

999
998
997

999
1000
1000

999
999
992

992
980
976

798
885
818

1000
997
996

970
976
961

992
980
976

798
885
818

1000
999
999

1000
1000
996

984
988
979

898
930
886

981
995
994

977
974
987

984
988
979

898
930
886

985
993
995

984
979
978

824
861
845

795
795
801

912
928
909

856
873
891

824
861
845

795
795
801

912
914
913

853
866
859

5.3.2 Data Application

In this section, we demonstrate BRLSS on the NH5 vehicle volume data described in the

beginning of Chapter 5. NH5 is the ﬁrst freeway connecting the eastern and western Taiwan.

It is famous for the world’s 5th longest road tunnel, the Hsuehshan Tunnel. Naturally the

traﬃc ﬂow control is critical for the government and the road users. Along both directions

of NH5, a compact monitoring system, including 84 electric detectors on the southbound

106

and 86 on the northbound, is installed and several traﬃc indices are monitored. The indices

are collected every 5 minutes and the daily proﬁle which consists of 288 observations can be

viewed as a functional sample.

Here we use the total volumes of all lanes in each direction during May 10 – 16, 2010.

From Figure 5.1, the volume trajectories seem to shift downward several times. We would

like to know the actual number and positions of these changes. All the 288-point curves

are pre-smoothed and transformed to FPC scores with same programs as in the simulation.

Here the FPC dimension is determined by the sum of variance proportions of the leading

components that exceeds 80%. We apply BRLSS on these data and use the P2S initial

partition with K = 6 (i.e., it begins with 5 change-points). The results are listed in Table

5.6.
Table 5.6: The sequentially selected most unlikely points and their Box’s M statistics for
vehicle volumes on southbound and northbound NH5 during May 10, 2010 and May 16,
2010. The asterisks denote signiﬁcant tests under critical value χ2
0.05,3 = 7.8147. The bold
numbers are the estimated change-points.

Iteration

Direction Date

1

2

3

4

South

North

May 10
May 11
May 12
May 13
May 14
May 15
May 16

May 10
May 11
May 12
May 13
May 14
May 15
May 16

10 0.0124
11 0.2690
8 0.0301
8 0.0132
7 0.0036
15 0.0960
8 0.1313

3 0.1890
33 0.0081
66 0.0367
63 0.0210
9 0.0508
63 0.0371
63 0.0410

29 0.1270
8 0.1642
29 0.5750
74 1.2952
29 0.5652
29 2.2935
74 2.3620

9 0.0698
9 0.2376
63 0.2376
72 0.4319
72 0.3813
76 1.3343
76 1.5727

74 1.2213
74 0.7938
74 0.6480
29 0.9215
74 3.3232
74 2.8535
30 6.5459

63 0.7259
63 0.5702
81 0.7212
81 0.7458
33 0.6515
33 1.2541
33 7.5704

79 *13.0161
79 *12.8263
79 *11.1850
79 *13.5420
79 *20.5770
79 *20.5580
*9.9222
79

81
81
33
33
81
81
81

1.2408
0.5674
0.9632
1.0611
0.8733
*8.2028
3.0006

5

32.9286
28.8395
30.5788
33.9575
40.9360
53.2198
47.3120

70
70
70
70
70
70
70

76 *38.5139
76 *24.2468
76 *25.9946
76 *30.8025
76 *32.4544
71
30.4828
72 *39.5000

The southbound results are consistent, with two critical nodes No. 70 and 79. Both

107

detectors are at interchange exits, particularly, detector No. 70 is also at the end of Hsuehshan

Tunnel. The northbound results are interesting. During May 10 – 14, 2010, which are

weekdays, the volume ﬂows change at detector No. 76. But the critical nodes switch to

No. 71 and 81 on Saturday and then to No. 72 on Sunday. The northbound detectors

No. 72, 76, and 81 are all located at the entrances of diﬀerent interchanges, while No. 71 is

the beginning of Hsuehshan Tunnel. In fact, the estimated northbound detectors can match

the estimated southbound detectors geographically. The southbound detector No. 70 is close

to the northbound detectors No. 71 and 72, and southbound No. 79 is about at the same

place as northbound No. 81. Therefore, it is natural to link the volume mean changes with

the vehicle input/output and diﬀerent driving policies in diﬀerent road sections divided by

the identiﬁed interchanges. Furthermore, the weekday/weekend eﬀect is also worth studying.

Figure 5.5: The sequential plots of the ﬁrst FPC scores of the vehicle volumes. The south-
bound scores are in the upper panel and the northbound scores are in the lower panel. From
left to right are May 14, 15 and 16. The dash lines are the estimated locations of the
change-points.

The sequential plots of the ﬁrst FPC scores in May 14 – 16 are plotted in Figure 5.5,

108

020406080−100−60−200202010 / 05 / 14Detector NoSouthboundllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll020406080fpc1s[6, ]2010 / 05 / 15Detector Nollllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll020406080fpc1s[7, ]2010 / 05 / 16Detector Nollllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll020406080−100−60−200202010 / 05 / 14Detector NoNorthboundllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll020406080fpc1n[6, ]2010 / 05 / 15Detector Nollllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll020406080fpc1n[7, ]2010 / 05 / 16Detector Nolllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllleach with their estimated change-points marked (the dash lines). The mean shifts can be

clearly identiﬁed at the estimated locations of the change-points. As argued in Aston and

Kirch (2012), if the mean diﬀerence is signiﬁcant, it would impact the FPC such that it is

included in the sub-space spanned by the leading components. Hence, the mean diﬀerence

is not orthogonal to that sub-space and is detectable for the change-point test procedures.

For this data set, the changes are so obvious that they can be seen in the ﬁrst component.

Here the projected FPC dimensions in the seven days are all one, regardless of the direc-

tion. That is, the ﬁrst component would explain more than 80% of the total variation of the

raw functions. Under this situation, the other components may carry very few information

of the mean diﬀerences. Including them in the change-point detection procedure, either our

BRLSS or the single change-point test of Berkes et al. (2009), will not help with the test

power but sometimes may hamper the detectability. This is because the scores of the selected

components are equally treated in the procedure, the importance of the informative compo-

nent will reduce as more non-informative components are included. The detection results

would become varied due to many unnecessary information in the rest components. To this

end, we would suggest to use the weighted scores, that is, one can multiply the variance pro-

portion of the component on its projected scores, to reduce the eﬀects of non-informational

components.

5.4 Conclusions

In this study, we proposed a two-step segmentation algorithm for detecting multiple mean

change-points in a sequence of functional data. Functional data are transformed into func-

tional principal component scores and used in the segmentation algorithm for estimating

both the number and locations of the mean change-points. The ﬁrst step of the algorithm is

to estimate a given large number of change-point candidates with recursively least square up-

dating, and the second step removes the redundant candidates with a backward elimination

procedure. Simulation shows that this algorithm can accurately estimate both the number

109

and locations of the change-points among a functional sequence, even when the sequence is

weakly dependent with mild autocorrelation. An application is illustrated with a sequence of

highway vehicle volume ﬂows in Taiwan. Our real data analysis demonstrates that the mean

shifts can be captured at the estimated location of change-points. Overall, we conclude that

the proposed BRLSS algorithm is successfully identiﬁed the change points in a functional

sequence.

110

CHAPTER 6

CONCLUSIONS AND FUTURE DIRECTIONS

This thesis is concerned with the functional data analysis of daily curves in traﬃc ﬂow

data. The main obstacle in modeling of daily traﬃc ﬂow trajectories is the problem of the

“curse of dimensionality”. Since the daily traﬃc ﬂow data can be viewed as a stochastic

process, this leads to the application of the functional data analysis framework. We have

ﬁrst reviewed the theorem of Karhunen-Lo`eve expansion and further developed functional

principal component model for modeling of traﬃc ﬂow trajectories, where the mean function

describes the trend of overall traﬃc ﬂow pattern and the eigenfunctions take variations of

each daily traﬃc ﬂow into account.

In addition, the inﬁnite dimension functions can be

reduced to a set of ﬁnite dimension functional principal component scores, which serves well

as the proxy of each daily traﬃc traﬃc trajectory. This thesis has focused on the FPC scores

to develop further analysis of traﬃc ﬂow data.

In Chapter 2, we have introduced the functional data approach to deal with missing values

in traﬃc ﬂow data and compared the imputation performance of the proposed functional

principal component model with those of probabilistic principal component and Bayesian

principal component models. Moreover, based on the FPCA approach, the functional prin-

cipal component scores can be applied to the functional bagplot and functional HDR boxplot.

Using FPC scores derived from conditional expectation makes the outlier detection possible

for incomplete functional data. Experiments with a simulated traﬃc ﬂow data and a real

traﬃc ﬂow data have showed the eﬀectiveness of functional data approach.

In Chapter 3, the naive Bayes classiﬁer for functional data has been proposed. The

novelty here is to make a naive assumption of common functional principal component

scores to construct surrogate densities. Simulation studies have showed that the proposed

functional naive Bayes classiﬁer has a superior performance compared with other state-of-art

classiﬁers. This result should advise a density point of view for the classiﬁcation of functional

111

data.

In Chapter 4, we have introduced the functional naive prediction and functional mixture

prediction for predicting the unobserved daily traﬃc ﬂow trajectories. The idea is that we

adapt the functional naive Bayes classiﬁer to classify partially observed traﬃc ﬂow trajecto-

ries and consider the group-speciﬁc functional linear model to predict the future unobserved

traﬃc ﬂow trajectory for a partially observed ﬂow trajectory. Moreover, the posterior group

membership probability can be calculated by surrogate density introduced in Chapter 3.

The proposed functional data approaches, including classiﬁcation and prediction, facilitate

accurate prediction of daily traﬃc ﬂow.

In Chapter 5, a two-step segmentation algorithm for detecting multiple mean change-

points in a sequence of functional data has been proposed. Functional data are transformed

into functional principal component scores which then are used in the segmentation algorithm

for estimating both the number and locations of the mean change-points. The ﬁrst step of the

algorithm is to estimate a given large number of change-point candidates with recursive least

square updating, while the second step removes the redundant candidates with a backward

elimination procedure. Simulation shows that this algorithm can accurately estimate both

the number and locations of the change-points among a functional sequence, even when the

sequence is weakly dependent with mild autocorrelation.

In each chapter we have demonstrated that the proposed functional principal component

methods have their own merits, and can outperform other investigated methods in traﬃc

data applications. However, in this thesis we only worked with univariate functional data. In

many situations, data can be collected on several variables simultaneously. This encourages

the study of multivariate functional data and raises a natural motivation to extend the theory

of univariate FPCA to the multivariate case. We brieﬂy summarize the future research

direction in the following section.

112

6.1 Future Work

In recent years there has been explosive growth in the number of neuroimaging studies

performed using functional Magnetic Resonance Imaging (fMRI). A standard fMRI study

gives rise to massive amounts of noisy data with a complicated spatio-temporal correlation

structure (Haxby et al., 2001). The spatio-temporal correlation structures in voxels in fMRI

data pose a challenge for the univariate functional principal component analysis since the

voxels (in the spatio sense) should be considered jointly, and the correlation between them

must be taken into account in the kernel covariance operator. Simply ignoring the corre-

lation structure between voxels and performing a separate FPCA for each voxel make the

interpretation of the FPCA results diﬃcult. Multivariate functional data analysis (Chiou

et al., 2014a) is potentially useful in analyzing such kind of spatio-temporal data because the

covariation between voxels can be directly addressed by a single set of multivariate functional

principal component scores, which serve well as a proxy for multivariate functional data.

In the study of Henderson et al. (2007), two cortical areas (posterior parahippocampal

cortex and retrosplenial cortex) were investigated using fMRI to measure patterns of response

while subjects viewed indoor, outdoor, and face pictures. Their ﬁndings suggest diﬀerences

in function in these two areas. Here, we have a diﬀerent approach. We aim to classify these

fMRI brain images based on the pictures that subjects viewed. We extend the functional

naive Bayes classiﬁer presented in Chapter 3 from the univariate FPCA to the multivariate

FPCA. The relationship between univariate and multivariate FPCA for the Karhunen-Lo`eve

representation is discussed in Happ and Greven (2017).

The densities of the ﬁrst 10 multivariate functional common principal component scores

are shown in Figure 6.1. It is clear that the densities for each picture are diﬀerent and thus can

be used to classify fMRI images. This result may lead to an idea of functional classiﬁcation

for fMRI data. Although the technique is still under development, the densities as shown

in 6.1 look promising. More work is required to develop multivariate functional naive Bayes

classiﬁer.

113

Figure 6.1: Kernel density estimates for the ﬁrst 10 multivariate functional common principal
component scores for the fMRI data.

114

-4-202400.10.20.30.40.5indooroutdoorface-4-2024600.10.20.30.4indooroutdoorface-6-4-202400.10.20.30.40.5indooroutdoorface-4-2024600.10.20.30.4indooroutdoorface-4-202400.10.20.30.40.50.60.7indooroutdoorface-50500.10.20.30.40.50.60.7indooroutdoorface-50500.050.10.150.20.250.30.35indooroutdoorface-4-2024600.20.40.60.81indooroutdoorface-10-50500.10.20.30.4indooroutdoorface-4-2024600.10.20.30.40.5indooroutdoorfaceAPPENDIX

115

Appendix

PPCA Imputation Method

Probabilistic PCA (PPCA) is a probabilistic formulation of PCA based on a Gaussian la-

tent variable model, ﬁrst introduced by Tipping and Bishop (1999). The goal of such a

model is to capture the covariance structure of an observed d dimensional variable y using

a corresponding q dimensional latent variable x though a linear transformation function,

where q < d. Let yi = (yi1, . . . , yid)T be a set of observed variables for observation i and

xi = (xi1, . . . , xiq) be a corresponding latent variable, and assume that yi is produced by

a linear transformation from xi plus additive Gaussian noise. Denoting the transformation
by the d × q matrix W and the d dimensional noise vector by i, the PPCA model can be
expressed as follows:

yi = W xi + µ + i,

where µ permits the model to have nonzero mean. Conventionally, the latent variable xi is

assumed to follow Nq(0, I) where I is the identity matrix. Note that xi can be viewed as

the principal component score of observation i in terms of traditional PCA. By additionally
specifying the noise to be Gaussian i ∼ Nd(0, σ2I) and given the latent variable xi, the
conditional distribution of the observed data can be expressed as

p(yi|xi) ∼ Nd(W xi + µ, σ2I).

With a Gaussian prior xi, we can obtain the marginal distribution of the observed data yi,

p(yi) ∼ Nd(µ, C),

where C = W W T + σ2I. Using Bayes’ rule, the posterior distribution of the latent variable

xi given the observed yi may be calculated

p(xi|yi) ∼ Nq(M−1W T (yi − µ), σ2M−1),

(1)

116

where M = W T W + σ2I. Note that M is of dimension q × q and C is of dimension d × d.
Given a set of observed data recorded in vector y = {yi}, i = 1, . . . , N , the log-likelihood
of the observed data under this model with unknown parameters θ = (W , µ, σ2) can be

N(cid:88)

written as

L(θ|y) =

ln{p(xi|yi, θ)}.

i=1

The unknown parameters θ for this model can be estimated by maximizing the log-likelihood
L(θ|y). One can simply apply the EM algorithm to achieve its maximum. In the Expectation-

step, we take the expectation of log-likelihood L, which is evaluated using the current esti-

mate for the parameters. In the Maximization-step, we compute parameters W , µ and σ2

that maximize the expected log-likelihood found on the Expectation-step. The EM iteration

is repeated until the algorithm converges. As shown by Tipping and Bishop (1999), the

columns of W will span the principal subspace of conventional PCA when the log likelihood

of the PPCA model is maximized. Thus the maximum likelihood estimate of the loadings

matrix (cid:99)W in PPCA corresponds exactly to the loading matrix in the conventional PCA.
Now if yi contains missing values, we can use ˆyi = (cid:99)W ˜xi + ˆµ as an estimate for yij if

yij is missing, where ˜xi is the posterior mean in (1). Further details of PPCA can be found

in Tipping and Bishop (1999). The PPCA program is implemented in the package called

‘pcaMethods’ in R software.

BPCA Imputation Method

BPCA, a Bayesian estimation method for PPCA, was proposed by Bishop (1999). It intro-

duced some continuous hyper-parameters to determine the optimal value of the latent space

dimensionality q which PPCA provides no mechanism for determining. In a fully Bayesian

framework, both the number of principal components q and the model parameters θ are

considered to be drawn from appropriate prior distribution. According to Bayesian theory,

117

the posterior distribution of θ and x is computed as

p(θ, x|y) ∝ p(y, x|θ)p(θ),

where p(θ) is the prior distribution determined before estmation. Following Oba et al. (2003),

we assume that there are conjugate priors for µ and σ2 and further assume a hierarchical

prior for W , which is used to determine the best dimension of the latent space. We then

have the joint prior:

p(θ|α) ≡ p(µ, W , τ|α) = p(µ|τ )p(τ )

q(cid:89)

j=1

p(wj|τ, αj),

(2)

where p(µ|τ ) ∼ Nd(µ0, (νµ0τ )−1I), p(wj|τ, αj) ∼ Nq(0, (αjτ )−1I) and p(τ ) ∼ G(τ0, ντ0).

Here α is a q dimensional vector of hierarchical parameters used to control a column of W

to diminish overﬁtting problem. The setting of all hyper-parameters are the same as in Oba

et al. (2003), corresponding to an almost non-informative prior.

Bayesian inference is achieved by evaluating the posterior distribution of the unknown

variables given the observations. In Oba et al. (2003), a variational Bayes algorithm is used

to execute Bayesian estimation for both the model parameter θ and missing values of y. The

posterior distributions for θ and missing values of y are obtained via a repetitive algorithm.

The missing values are imputed using the expectation of the estimated posterior distribution

of missing values of y with optimal q.

Proof of Theorem 3

The proof of this theorem is carried out in two parts:

(cid:0)θ(r)(cid:1) : r = 1, 2,···} converges in ﬁnite r:

1. For a ﬁxed K, the sequence {SK

Recall that in RLSS, we sequentially update each partition boundary with its adjacent

segments by 2-segmentation in every iteration. Let(cid:8)θ(r)(cid:9) =(cid:8)θ

(cid:9) be

(r)
0 , θ

(r)

1 ,··· , θ

(r)
K

118

(r+1)
the ﬁnal boundaries in the r-th iteration. Because θ
0

= θ

(r)
0 = 0, then

(cid:1)

(r)
2

(r)

(r)
2

(r+1)
0

, θ(cid:1) + S(cid:0)θ, θ
S(cid:0)θ
(cid:1) .
0 , θ(cid:1) + S(cid:0)θ, θ
S(cid:0)θ
(cid:1) + S(cid:0)θ
(cid:1) ≤ S(cid:0)θ
(cid:1) ≤ S(cid:0)θ
(cid:1) + S(cid:0)θ

(r)
(r)
0 , θ
1

(r)
(r+1)
k−1 , θ
k

(cid:1).

(r)
(r)
1 , θ
2

(cid:1),

(3)

(r)
k , θ

(r)
k+1

(r+1)
θ
1

argmin
+β<θ≤θ

(r)

2 −β

argmin
0 +β<θ≤θ

(r)

(r)

2 −β

Hence, we know that

(r+1)
0

(r+1)
1

(r)
, θ
2

Similarly, because θ

(r)
k < θ

(r)
k+1, we also have

(r+1)
(r+1)
k−1 , θ
k

(r+1)
k

, θ

(r)
k+1

for all k = 1,··· , K − 1. Then

θ

θ

= inf

= inf

(r+1)
0

(r+1)
, θ
1

(r+1)
k−1 < θ



(cid:1) + S(cid:0)θ
(cid:1) + S(cid:0)θ
K(cid:88)
S(cid:0)θ
(cid:110)
K−1(cid:88)
S(cid:0)θ
K(cid:88)
S(cid:0)θ
(cid:110)
K−1(cid:88)
S(cid:0)θ
(cid:0)θ(r)(cid:1) =

(r)
(r)
k−1, θ
k

k=1

k=1

k=1

k=1

(cid:1)

(cid:1)

(r+1)
(r+1)
k−1 , θ
k

(r+1)
(r+1)
k−1 , θ
k

S(cid:0)θ
S(cid:0)θ
(cid:0)θ(r+1)(cid:1) =

=

(cid:0)θ(r)(cid:1) =

SK

=

SK

(cid:1) + S(cid:0)θ

(r+1)
k

, θ

(r)
k+1

(cid:1)(cid:111) − K−2(cid:88)

k=1

S(cid:0)θ

(r+1)
k

(cid:1),

, θ

(r)
k+1

(cid:1) + S(cid:0)θ

(r)
k , θ

(r)
k+1

(cid:1)(cid:111) − K−1(cid:88)

k=2

S(cid:0)θ

(r+1)
k

(cid:1),

, θ

(r)
k+1

(r)
(r+1)
k−1 , θ
k

and from (3),

SK

(cid:0)θ(r+1)(cid:1) − SK
(cid:1)(cid:111) ≤ 0.
(cid:110)
K−1(cid:88)
S(cid:0)θ
(cid:0)θ(r)(cid:1) is nonnegative and nonincreasing with r, it will converge in ﬁnite r.

(cid:1) − S(cid:0)θ

(cid:1) − S(cid:0)θ

(cid:1) + S(cid:0)θ

(r+1)
(r+1)
k−1 , θ
k

(r)
(r+1)
k−1 , θ
k

(r+1)
k

(r)
k , θ

(r)
k+1

(r)
k+1

k=1

, θ

(4)

Since SK

119

2. For a ﬁxed K, sequence {SK

the partition boundaries(cid:8)(cid:0)ˆθ

(cid:0)θ(r)(cid:1) : r = 1, 2,···} converges in ﬁnite r if and only if
(cid:1) : r = 1, 2,···(cid:9) stop varying in ﬁnite r.
(cid:0)θ(r)(cid:1)(cid:9) converges in ﬁnite r ⇒ (cid:8)θ(r)(cid:9) stops varying in ﬁnite r.

1 ,··· , ˆθ

(r)
K−1

(r)

The necessity is trivial, hence we shall only prove the suﬃciency that:

(cid:8)SK

Suppose that for some r0 ∈ N, SK
at least one element, say θ

(cid:0)θ(r0+1)(cid:1) − SK

Since SK
K − 1 equations:

(r)
k .

(cid:0)θ(r)(cid:1) = SK

(cid:0)θ(r0)(cid:1), ∀ r ≥ r0, but θ(r) (cid:54)= θ(r0) for
(cid:0)θ(r0)(cid:1) = 0, then from (3) and (4), we have the following
(cid:1) + S(cid:0)θ

(cid:1) = S(cid:0)θ

(cid:1) + S(cid:0)θ

(cid:1),

, θ

, θ

(r0+1)
k−1

(r0)
, θ
k

(r0+1)
k

(r0)
k+1

(r0)
k

(r0)
k+1

(r0+1)
k−1

(r0+1)
, θ
k

S(cid:0)θ

when performing 2-segmentation on(cid:0)θ

∀k = 1,··· , K − 1. This means that both θ
(r0)
k
(r0+1)
k−1

(r0+1)
and θ
k
(r0)
, θ
k+1
(r0+1)
of this 2-segmentation, from (5.9), we have θ
k

(cid:3). Because θ

≤ θ

produce the same TSSCD

(r0+1)
k

is the minimizer

. Similar argument applies

, ∀h ∈ N. The decreasing sequence(cid:8)θ

(r0)
k

≤ θ
(r0)
k−1, therefore, it will converge in ﬁnite steps.

(r0+h−1)

k

(r0+h)
and results in θ
k

must not fall below θ

(cid:9)

(r)

k , r ≥ r0

120

BIBLIOGRAPHY

121

BIBLIOGRAPHY

Allison, P. D. (1999). Missing data. Sage Thousand Oaks, CA.

Alonso, A. M., Casado, D., and Romo, J. (2012). Supervised classiﬁcation for functional data:
A weighted distance approach. Computational Statistics & Data Analysis, 56(7):2334–
2346.

Andrews, D. W. K. (1991). Heteroskedasticity and autocorrelation consistent covariance

matrix estimation. Econometrica, 59(3):817–858.

Araki, Y., Konishi, S., Kawano, S., and Matsui, H. (2009). Functional logistic discrimination
via regularized basis expansions. Communications in Statistics - Theory and Methods,
38(16-17):2944–2957.

Aston, J. A. D. and Kirch, C. (2012). Detecting and estimating changes in dependent

functional data. Journal of Multivariate Analysis, 109:204–220.

Aue, A., Gabrys, R., Horv´ath, L., and Kokoszka, P. (2009). Estimation of a change-point in
the mean function of functional data. Journal of Multivariate Analysis, 100(10):2254–2269.

Auger, I. E. and Lawrence, C. E. (1989). Algorithms for the optimal identiﬁcation of segment

neighborhoods. Bulletin of Mathematical Biology, 51(1):39–54.

Baesens, B., Gestel, T. V., Viaene, S., Stepanova, M., Suykens, J., and Vanthienen, J. (2003).
Benchmarking state-of-the-art classiﬁcation algorithms for credit scoring. Journal of the
Operational Research Society, 54(6):627–635.

Beale, E. M. L. and Little, R. J. A. (1975). Missing values in multivariate analysis. Journal

of the Royal Statistical Society. Series B (Methodological), 37(1):129–145.

Bellman, R. (1961). On the approximation of curves by line segments using dynamic pro-

gramming. Commun. ACM, 4(6):284.

Berkes, I., Gabrys, R., Horv´ath, L., and Kokoszka, P. (2009). Detecting changes in the mean
of functional observations. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 71(5):927–946.

Besse, P. (1992). PCA stability and choice of dimensionality. Statistics & Probability Letters,

13(5):405–410.

Besse, P. and Ramsay, J. O. (1986). Principal components analysis of sampled functions.

Psychometrika, 51(2):285–311.

Biau, G., Bunea, F., and Wegkamp, M. H. (2005). Functional classiﬁcation in Hilbert spaces.

IEEE Transactions on Information Theory, 51(6):2163–2172.

122

Biau, G., Cerou, F., and Guyader, A. (2010). Rates of convergence of the functional k-nearest

neighbor estimate. IEEE Transactions on Information Theory, 56(4):2034–2040.

Bishop, C. (1999). Bayesian PCA. Microsoft Research, 11.

Boente, G., Rodriguez, D., and Sued, M. (2010). Inference under functional proportional and
common principal component models. Journal of Multivariate Analysis, 101(2):464–475.

Bongiorno, E. G. and Goia, A. (2016). Classiﬁcation methods for Hilbert data based on

surrogate density. Computational Statistics & Data Analysis, 99:204–222.

Box, G. E. P. (1949). A general distribution theory for a class of likelihood criteria.

Biometrika, 36(3/4):317–346.

Cardot, H. (2000). Nonparametric estimation of smoothed principal components analysis of

sampled noisy functions. Journal of Nonparametric Statistics, 12(4):503–538.

Castro, P. E., Lawton, W. H., and Sylvestre, E. A. (1986). Principal modes of variation for

processes with continuous sample curves. Technometrics, 28(4):329–337.

C´erou, F. and Guyader, A. (2006). Nearest neighbor classiﬁcation in inﬁnite dimension.

ESAIM: Probability and Statistics, 10:340–355.

Chen, C., Kwon, J., Rice, J., Skabardonis, A., and Varaiya, P. (2003). Detecting errors
and imputing missing data for single-loop surveillance systems. Transportation Research
Record: Journal of the Transportation Research Board, 1855:160–167.

Chen, J. and Shao, J. (2000). Nearest neighbor imputation for survey data. Journal of

oﬃcial statistics, 16(2):113.

Chiou, J.-M. (2012). Dynamical functional prediction and classiﬁcation, with application to

traﬃc ﬂow prediction. The Annals of Applied Statistics, 6(4):1588–1614.

Chiou, J.-M., Chen, Y.-T., and Yang, Y.-F. (2014a). Multivariate functional principal com-

ponent anlaysis: A normalizatoin approach. Statistica Sinica, 24(4):1571–1596.

Chiou, J.-M. and Li, P.-L. (2007). Functional clustering and identifying substructures of lon-
gitudinal data. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
69(4):679–699.

Chiou, J.-M., Zhang, Y.-C., Chen, W.-H., and Chang, C.-W. (2014b). A functional data
approach to missing value imputation and outlier detection for traﬃc ﬂow data. Trans-
portmetrica B: Transport Dynamics, 2(2):106–129.

Coﬀey, N., Hinde, J., and Holian, E. (2014). Clustering longitudinal proﬁles using p-splines
and mixed eﬀects models applied to time-course gene expression data. Computational
Statistics & Data Analysis, 71:14–29.

Collins, L. M., Schafer, J. L., and Kam, C. M. (2001). A comparison of inclusive and
restrictive strategies in modern missing data procedures. Psychological Methods, 6(4):330–
351.

123

Croux, C. and Ruiz-Gazen, A. (2005). High breakdown estimators for principal components:
the projection-pursuit approach revisited. Journal of Multivariate Analysis, 95(1):206–226.

Cuevas, A., Febrero, M., and Fraiman, R. (2007). Robust estimation and classiﬁcation for
functional data via projection-based depth notions. Computational Statistics, 22(3):481–
496.

Dai, X., M¨uller, H.-G., and Yao, F. (2017). Optimal Bayes classiﬁers for functional data and

density ratios. Biometrika, 104(3):545–560.

Dailey, D. J. (1993). Improved error detection for inductive loop sensors. Washington State

Department of Transportation.

Delaigle, A. and Hall, P. (2010). Deﬁning probability density for a distribution of random

functions. The Annals of Statistics, 38(2):1171–1193.

Delaigle, A. and Hall, P. (2012). Achieving near perfect classiﬁcation for functional data.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(2):267–286.

Durrett, R. (2010). Probability: theory and examples. Cambridge series in statistical and
probabilistic mathematics. Cambridge University Press, Cambridge ; New York, 4th ed
edition. OCLC: ocn607573997.

Epifanio, I. (2008). Shape descriptors for classiﬁcation of functional data. Technometrics,

50(3):284–294.

Fan, J. and Gijbels, I. (1996). Local polynomial modelling and its applications: Monographs

on statistics and applied probability 66. CRC Press.

Febrero, M., Galeano, P., and Gonz´alez-Manteiga, W. (2007). A functional analysis of
NOx levels: location and scale estimation and outlier detection. Computational Statistics,
22(3):411–427.

Ferraty, F. and Vieu, P. (2003). Curves discrimination: a nonparametric functional approach.

Computational Statistics & Data Analysis, 44(1-2):161–173.

Ferraty, F. and Vieu, P. (2006). Nonparametric functional data analysis: theory and practice.

Springer series in statistics. Springer, New York, NY. OCLC: 255028741.

Fitzmaurice, G. M., Laird, N. M., and Ware, J. H. (2012). Applied longitudinal analysis.

Wiley, 2 edition edition.

Flury, B. N. (1984). Common principal components in K groups. Journal of the American

Statistical Association, 79(388):892.

Galeano, P., Joseph, E., and Lillo, R. E. (2015). The Mahalanobis distance for functional

data with applications to classiﬁcation. Technometrics, 57(2):281–291.

Graham, J. W., Cumsille, C. P. E., and ElekFisk, E. (2003). Methods for handling missing

data. Handbook of Psychology.

124

Grandhi, A. (2002). Content-based image retrieval plant species identiﬁcation. Master’s

thesis, Oregon State University.

Haiminen, N. and Gionis, A. (2004). Unimodal segmentation of sequences. In Fourth IEEE

International Conference on Data Mining, 2004. ICDM ’04, pages 106–113.

Hall, P., Poskitt, D. S., and Presnell, B. (2001). A functional data-analytic approach to

signal discrimination. Technometrics, 43(1):1–9.

Happ, C. and Greven, S. (2017). Multivariate functional principal component analysis for
data observed on diﬀerent (dimensional) domains. Journal of the American Statistical
Association, pages 1–11.

Haxby, J. V., Gobbini, M. I., Furey, M. L., Ishai, A., Schouten, J. L., and Pietrini, P. (2001).
Distributed and overlapping representations of faces and objects in ventral temporal cor-
tex. Science (New York, N.Y.), 293(5539):2425–2430.

He, G., Mller, H. G., and Wang, J. L. (2000). Extending correlation and regression from
multivariate to functional data. Asymptotics in statistics and probability, pages 197–210.

Henderson, J. M., Larson, C. L., and Zhu, D. C. (2007). Cortical activation to indoor versus

outdoor scenes: an fMRI study. Experimental Brain Research, 179(1):75–84.

Himberg, J., Korpiaho, K., Mannila, H., Tikanm¨aki, J., and Toivonen, H. T. T. (2001). Time
series segmentation for context recognition in mobile devices. In Proceedings 2001 IEEE
International Conference on Data Mining, pages 203–210.

H¨ormann, S. and Kokoszka, P. (2010). Weakly dependent functional data. The Annals of

Statistics, 38(3):1845–1884.

Hyndman, R. J. and Shang, H. L. (2010). Rainbow plots, bagplots, and boxplots for func-

tional data. Journal of Computational and Graphical Statistics, 19(1):29–45.

Hyndman, R. J. and Ullah, M. S. (2007). Robust forecasting of mortality and fertility rates:
A functional data approach. Computational Statistics & Data Analysis, 51(10):4942–4956.

Jacques, J. and Preda, C. (2014). Model-based clustering for multivariate functional data.

Computational Statistics & Data Analysis, 71:92–106.

James, G. M. and Hastie, T. J. (2001). Functional linear discriminant analysis for irregularly
sampled curves. Journal of the Royal Statistical Society: Series B (Statistical Methodol-
ogy), 63(3):533–550.

James, G. M., Hastie, T. J., and Sugar, C. A. (2000). Principal component models for sparse

functional data. Biometrika, 87(3):587–602.

Killick, R. and Eckley, I. A. (2014). changepoint: An R package for changepoint analysis.

Journal of Statisitcal Software, 58(3):1–19.

125

Killick, R., Fearnhead, P., and Eckley, I. A. (2012). Optimal detection of changepoints with a
linear computational cost. Journal of the American Statistical Association, 107(500):1590–
1598.

King, G., Honaker, J., Joseph, A., and Scheve, K. (2001). Analyzing incomplete political
science data: An alternative algorithm for multiple imputation. American Political Science
Review, 95(1):49–69.

Laird, N. M. (1988). Missing data in longitudinal studies. Statistics in medicine, 7(1-2):305–

315.

Langley, P., Iba, and, W., and Thompson, K. (1992). An analysis of Bayesian classiﬁers. In
Proceedings of the Tenth National Conference on Artiﬁcial Intelligence, AAAI’92, pages
223–228, San Jose, California. AAAI Press.

Lee, D.-J., Archibald, J. K., Schoenberger, R. B., Dennis, A. W., and Shiozawa, D. K. (2008).
Contour matching for ﬁsh species recognition and migration monitoring. In Applications
of Computational Intelligence in Biology, pages 183–207. Springer.

Leng, X. and M¨uller, H.-G. (2006). Classiﬁcation using functional data analysis for temporal

gene expression data. Bioinformatics, 22(1):68–76.

Li Qu, Jianming Hu, Li Li, and Yi Zhang (2009). PPCA-based missing data imputation for
traﬃc ﬂow volume: A systematical approach. IEEE Transactions on Intelligent Trans-
portation Systems, 10(3):512–522.

Little, R. J. A. (1995). Modeling the drop-out mechanism in repeated-measures studies.

Journal of the American Statistical Association, 90(431):1112–1121.

Little, R. J. A. and Rubin, D. B. (2002). Statistical analysis with missing data. John Wiley

& Sons.

Martin-Barragan, B., Lillo, R., and Romo, J. (2014). Interpretable support vector machines

for functional data. European Journal of Operational Research, 232(1):146–155.

Mercer, J. (1909). XVI. Functions of positive and negative type, and their connection the

theory of integral equations. Phil. Trans. R. Soc. Lond. A, 209(441-458):415–446.

Mitchell, T. M. (1997). Machine learning. McGraw-Hill, Inc., New York, NY, USA, 1 edition.

M¨uller, H.-G. (2005). Functional modelling and classiﬁcation of longitudinal data. Scandi-

navian Journal of Statistics, 32(2):223–240.

Nakai, M. and Ke, W. (2011). Review of the methods for handling missing data in longitu-

dinal data analysis. International Journal of Mathematical Analysis, 5(1):1–13.

Newey, W. K. and West, K. D. (1994). Automatic lag selection in covariance matrix estima-

tion. The Review of Economic Studies, 61(4):631–653.

126

Nihan, N. L. (1997). Aid to determining freeway metering rates and detecting loop errors.

Journal of Transportation Engineering, 123(6):454–458.

Oba, S., Sato, M.-a., Takemasa, I., Monden, M., Matsubara, K.-i., and Ishii, S. (2003). A
Bayesian missing value estimation method for gene expression proﬁle data. Bioinformatics,
19(16):2088–2096.

Palpanas, T., Vlachos, M., Keogh, E., Gunopulos, D., and Truppel, W. (2004). Online
amnesic approximation of streaming time series. In Proceedings. 20th International Con-
ference on Data Engineering, pages 339–349.

Paul, D. and Peng, J. (2009). Consistency of restricted maximum likelihood estimators of

principal components. The Annals of Statistics, 37(3):1229–1271.

Perron, P. (1991). A test for changes in a polynomial trend function for a dynamic time

series. Econometric Research Program, Princeton University.

Preda, C. and Saporta, G. (2005). PLS regression on a stochastic process. Computational

Statistics & Data Analysis, 48(1):149–158.

Preda, C., Saporta, G., and L´ev´eder, C. (2007). PLS classiﬁcation of functional data. Com-

putational Statistics, 22(2):223–235.

Ramsay, J. O. (1982). When the data are functions. Psychometrika, 47(4):379–396.

Ramsay, J. O. and Dalzell, C. J. (1991). Some tools for functional data analysis. Journal of

the Royal Statistical Society. Series B (Methodological), 53(3):539–572.

Ramsay, J. O. and Silverman, B. W. (2005). Functional data analysis. Springer series in

statistics. Springer, New York, 2nd ed edition.

Rice, J. A. (2004). Functional and longitudinal data analysis: Perspectives on smoothing.

Statistica Sinica, 14(3):631–647.

Rice, J. A. and Silverman, B. W. (1991). Estimating the mean and covariance structure
nonparametrically when the data are curves. Journal of the Royal Statistical Society.
Series B (Methodological), 53(1):233–243.

Rice, J. A. and Wu, C. O. (2001). Nonparametric mixed eﬀects models for unequally sampled

noisy curves. Biometrics, 57(1):253–259.

Rossi, F. and Villa, N. (2006). Support vector machine for functional data classiﬁcation.

Neurocomputing, 69(7):730–742.

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3):581.

Rubin, D. B. (2004). Multiple imputation for nonresponse in surveys. John Wiley & Sons.

Schafer, J. L. (1997). Analysis of incomplete multivariate data. CRC Press.

127

Schafer, J. L. (1999). Multiple imputation: A primer. Statistical Methods in Medical Re-

search, 8(1):3–15.

Schafer, J. L. and Graham, J. W. (2002). Missing data: Our view of the state of the art.

Psychological Methods, 7(2):147–177.

Sguera, C., Galeano, P., and Lillo, R. (2014). Spatial depth-based classiﬁcation for functional

data. TEST, 23(4):725–750.

Shatkay, H. and Zdonik, S. B. (1996). Approximate queries and representations for large data
sequences. In Proceedings of the Twelfth International Conference on Data Engineering,
pages 536–545.

Shin, H. (2008). An extension of Fisher’s discriminant analysis for stochastic processes.

Journal of Multivariate Analysis, 99(6):1191–1216.

Silverman, B. W. (1996). Smoothed functional principal components analysis by choice of

norm. The Annals of Statistics, 24(1):1–24.

Smith, B. and Conklin, J. (2002). Use of local lane distribution patterns to estimate missing
data values from traﬃc monitoring systems. Transportation Research Record: Journal of
the Transportation Research Board, 1811:50–56.

Staniswalis, J. G. and Lee, J. J. (1998). Nonparametric regression analysis of longitudinal

data. Journal of the American Statistical Association, 93(444):1403–1418.

Sun, Y. and Genton, M. G. (2011). Functional boxplots. Journal of Computational and

Graphical Statistics, 20(2):316–334.

Terzi, E. (2006). Problems and algorithms for sequence segmentations. PhD thesis, University

of Helsinki, Helsinki. OCLC: 500354767.

Terzi, E. and Tsaparas, P. (2006). Eﬃcient algorithms for sequence segmentation. In Pro-
ceedings of the 2006 SIAM International Conference on Data Mining, Proceedings, pages
316–327. Society for Industrial and Applied Mathematics.

Tipping, M. E. and Bishop, C. (1999). Probabilistic principal component analysis. Journal

of the Royal Statistical Society, Series B, 21/3.

Tuddenham, R. D. and Snyder, M. M. (1954). Physical growth of California boys and girls
from birth to eighteen years. Publications in Child Development. University of California,
Berkeley, 1(2):183–364.

Vaart, A. W. v. d. (1998). Asymptotic statistics. Cambridge series in statistical and proba-
bilistic mathematics. Cambridge University Press, Cambridge, UK ; New York, NY, USA.

Vogelsang, T. J. (1999). Sources of nonmonotonic power when testing for a shift in mean of

a dynamic time series. Journal of Econometrics, 88(2):283–299.

128

Wang, J.-L., Chiou, J.-M., and M¨uller, H.-G. (2016). Functional data analysis. Annual

Review of Statistics and Its Application, 3(1):257–295.

Wang, X., Ray, S., and Mallick, B. K. (2007). Bayesian curve classiﬁcation using wavelets.

Journal of the American Statistical Association, 102(479):962–973.

Wu, Y. and Liu, Y. (2013). Functional robust support vector machines for sparse and
irregular longitudinal data. Journal of Computational and Graphical Statistics, 22(2):379–
395.

Yao, F. and Lee, T. C. M. (2006). Penalized spline models for functional principal component
analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
68(1):3–25.

Yao, F., M¨uller, H.-G., and Wang, J.-L. (2005). Functional data analysis for sparse longitu-

dinal data. Journal of the American Statistical Association, 100(470):577–590.

Zhang, L., Marron, J. S., Shen, H., and Zhu, Z. (2007). Singular value decomposition and

its visualization. Journal of Computational and Graphical Statistics, 16(4):833–854.

Zhang, X., Shao, X., Hayhoe, K., and Wuebbles, D. J. (2011). Testing the structural stability
of temporally dependent functional observations and application to climate projections.
Electronic Journal of Statistics, 5:1765–1796.

Zhong, M., Sharma, S., and Lingras, P. (2004). Genetically designed models for accurate
imputation of missing traﬃc counts. Transportation Research Record: Journal of the
Transportation Research Board, 1879:71–79.

129