MULTI-OBJECTIVE REGRESSION WITH APPLICATION TO THE CLIMATE
DOMAIN
By
Zubin Abraham

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
Computer Science - Doctor of Philosophy
2013

ABSTRACT
MULTI-OBJECTIVE REGRESSION WITH APPLICATION TO THE
CLIMATE DOMAIN
By
Zubin Abraham
Regression-based approaches are widely used in climate research to derive the statistical,
spatial, and temporal relationships among climate variables. Despite its extensive literature, existing approaches are insuﬃcient to address the unique challenges arising from the
data characteristics and requirements of this domain. For example, climate variables such
as precipitation have zero-inﬂated distributions, which render ineﬀective any linear regression models constructed from the data. In addition, whereas traditional regression-based
approaches emphasize on minimizing the discrepancy between observed and predicted values, there is a growing demand for regression outputs that satisfy other domain-speciﬁc
criteria. To address these challenges, this thesis presents multi-objective regression frameworks designed to extend current regression-based approaches to meet the needs of climate
researchers. First, a framework called Integrated Classiﬁcation and Regression (ICR) is developed to accurately capture the timing of rain events and the magnitude of rain amount
in zero-inﬂated precipitation data. The second multi-objective regression framework focuses
on modeling the extreme values of a distribution without degrading its overall accuracy in
predicting non-extreme values. The third framework emphasizes on both minimizing the
divergence between the regression output and observed data while maximizing the ﬁt of
their cumulative distribution functions. The fourth contribution extends this framework to
a multi-output setting, to ensure that the joint distribution of the multiple regression outputs
is realistic and consistent with true observations.

ACKNOWLEDGMENTS

This work is partially supported by NSF grant III-0712987 and subcontract for NASA award
NNX09AL60G. This work is also partly supported by National Science Foundation Dynamics
of Coupled Natural and Human Systems competition Program (CNH Award No.-0909378).
The author would like to thank Dr Julie Winkler and Dr Sharon Zhong for valuable discussions on statistical downscaling for climate change projection.

iii

TABLE OF CONTENTS

LIST OF TABLES

LIST OF FIGURES

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 1 Introduction . . . . . .
1.1 Regression in Climate Research
1.2 Challenges . . . . . . . . . . . .
1.3 Thesis Contributions . . . . . .
1.4 Summary . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

1
1
3
8
10

. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
Downscaling
. . . . . . . .
. . . . . . . .
. . . . . . . .

.
.
.
.
.
.
.
.

12
13
14
15
16
17
19
21

Chapter 3 Modeling Zero-Inﬂated Data . . . . . . . . . . . . . . . . . . . . .
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Framework for Simultaneous Classiﬁcation and Regression . . . . . . . . . .
3.3.1 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.3 Proof of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.4 Classiﬁcation of Test Data . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Integrated Classiﬁcation and Regression Algorithm . . . . . . . . . . . . . .
3.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.2.1 Eﬀect of Zero-Inﬂated Time Series Data . . . . . . . . . . .
3.5.2.2 Impact of Coupling the Classiﬁer and Regression Model Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.2.3 Performance Comparison . . . . . . . . . . . . . . . . . . .
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25
25
28
29
30
32
34
36
37
38
38
40
40

Chapter 2 Related Work . . . . . . . . . . . . . . . . . . . . . .
2.1 Forecasting in the Climate Science Domain . . . . . . . . . .
2.1.1 Downscaling in the Climate Science Domain . . . . .
2.1.2 Bias Correction in the Climate Science Domain . . .
2.1.3 Regression-Based Approaches for Bias Correction and
2.2 Multiple-Objective Prediction . . . . . . . . . . . . . . . . .
2.3 Modeling Extremes . . . . . . . . . . . . . . . . . . . . . . .
2.4 Distribution Preserving Modeling . . . . . . . . . . . . . . .

iv

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

ix

45
45
57

Chapter 4 Semi-Supervised Modeling of Zero-Inﬂated Data . . . . . . .
4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Semi-Supervised Framework for Simultaneous Classiﬁcation and Regression
4.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2.1 Rationale for Incorporating Unlabeled Data . . . . . . . .
4.3.2.2 Performance Comparison . . . . . . . . . . . . . . . . . .
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

59
59
60
64
64
66
66
69
73

Chapter 5 Modeling Conditional Quantiles . . . . . . . . . . . .
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Multiple Linear Regression (MLR) and Ridge Regression
5.2.2 Quantile Linear Regression(QR) . . . . . . . . . . . . . .
5.3 Framework for Smoothed Quantile Regression . . . . . . . . . .
5.3.1 Smoothed Quantile Regression (LSQR) . . . . . . . . . .
5.3.2 Linear Semi-Supervised Quantile Regression (LSSQR) . .
5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . .
5.4.3 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . .
5.4.4 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

74
74
78
79
80
81
81
88
88
88
89
90
91
92
97

Chapter 6 Modeling Extremes in Zero-Inﬂated Data . . . . . . . . . .
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Generalized Linear Model and 2-Step GLM (GLM-C) . . . . . .
6.2.2 Zero Inﬂated Poisson Regression (ZIP) . . . . . . . . . . . . . .
6.2.3 Quantile Linear Regression(QR) and 2-step QR (QR-C) . . . .
6.3 Framework for Integrated Classiﬁcation and Regression . . . . . . . . .
6.3.1 Integrated Classiﬁer and Regression for Extreme Values (ICRE)
6.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.3 Baseline Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.3.1 General Linear Model (GLM) . . . . . . . . . . . . . .
6.4.3.2 General Linear Model with Classiﬁcation (GLM-C) . .
6.4.3.3 Quantile Regression (QR) . . . . . . . . . . . . . . . .
6.4.3.4 Quantile Regression with Classiﬁcation (QR-C) . . . .
6.4.3.5 Zero Inﬂated Poisson (ZIP) . . . . . . . . . . . . . . .
6.4.4 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

99
99
101
103
105
106
107
108
109
110
110
111
111
111
111
112
112
112
113

v

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

6.5

6.4.5.1 Impact of Zero-Inﬂated Data on Extreme Value Prediction . 113
6.4.5.2 Comparison of ICRE to Baseline Methods . . . . . . . . . . 114
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Chapter 7 Contour Regression . . . . . . . . . . . . . . . . . . . . . . .
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.1 Multiple Linear Regression (MLR) . . . . . . . . . . . . . . .
7.2.2 Quantile Mapping (QM) . . . . . . . . . . . . . . . . . . . . .
7.3 Framework for Multivariate Contour Regression (CR) . . . . . . . . .
7.3.1 Multiple Linear Contour Regression (MLCR) . . . . . . . . .
7.3.1.1 Proof of Convergence . . . . . . . . . . . . . . . . . .
7.3.2 Kernel Contour Regression (KCR) . . . . . . . . . . . . . . .
7.3.3 Quantile Contour Regression (QCR) . . . . . . . . . . . . . .
7.3.3.1 Proof of Convergence . . . . . . . . . . . . . . . . . .
7.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . .
7.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.3.1 QCR Results . . . . . . . . . . . . . . . . . . . . . .
7.5 MCR Using Heterogeneous Data . . . . . . . . . . . . . . . . . . . . .
7.5.1 Geometric Quantile Mapping . . . . . . . . . . . . . . . . . .
7.5.2 MCR Using Geometric Quantile Mapped Heterogeneous Data
7.5.3 Projections For The Years 2040-2049 . . . . . . . . . . . . . .
7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

116
116
121
121
121
123
124
126
128
129
131
133
133
134
135
138
139
140
143
144
145

Chapter 8 Multivariate Contour Regression . . . . . . . . . . . . . .
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.1 Properties of the Geometric Quantiles . . . . . . . . . . . .
8.2.2 Quantile Mapping-Based Approaches . . . . . . . . . . . . .
8.2.3 Rank Correlation and Residual Errors of Quantile Mapping .
8.3 Multi-Output Contour Regression Framework
(MCR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.1 Estimating Inverse Geometric Quantile Position . . . . . . .
8.3.2 Alternative Approximation-Based Approach for MCR . . . .
8.4 Variations of MCR . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4.1 Quantile Multi-Output Contour Regression (M CRQ ) . . . .
8.4.2 Non-linear Multi-Output Contour Regression (M CRN L ) . .
8.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . .
8.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.2.1 Univariate MCR . . . . . . . . . . . . . . . . . . .
8.5.2.2 Bivariate MCR . . . . . . . . . . . . . . . . . . . .
8.5.2.3 Trivariate MCR . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

147
147
150
152
154
155

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

158
161
162
164
164
165
166
166
167
167
168
170

vi

8.6

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

Chapter 9 Conclusions and Future Work
BIBLIOGRAPHY

. . . . . . . . . . . . . . . . . . . . 173

. . . . . . . . . . . . .

vii

. . . . . . . . . . . . . . . . . 177

LIST OF TABLES

Table 3.1

List of predictor variables for precipitation prediction. . . . . . . . .

39

Table 5.1

List of predictor variables for temperature prediction. . . . . . . . .

89

Table 5.2

The relative performance of LSSQR compared with LSQR with regard to the extreme data points. . . . . . . . . . . . . . . . . . . . .

91

The percentage of stations LSQR outperformed the respective baselines, with regard to the extreme data points. . . . . . . . . . . . . .

92

The percentage of stations LSSQR outperformed the respective baselines, with regard to the extreme data points. . . . . . . . . . . . . .

93

Table 5.3

Table 5.4

Table 6.1

Percentage of stations won . . . . . . . . . . . . . . . . . . . . . . . 114

Table 6.2

Percentage of stations ICRE outperformed the baseline . . . . . . . 114

Table 7.1

List of predictor variables from each RCM. . . . . . . . . . . . . . . 134

Table 7.2

Relative performance gain of MLCR over baseline approaches. . . . 136

Table 7.3

Percentage of stations that MLCR outperformed baseline in terms of
σ and ρ − CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

Table 7.4

Percentage of stations that QCR outperformed baseline approaches
in terms of RMSE, F-measure, k statistic and correlation for data
points considered extreme value. . . . . . . . . . . . . . . . . . . . . 141

Table 8.1

Quantile Mapping Example . . . . . . . . . . . . . . . . . . . . . . . 155

Table 8.2

Performance (RMSE) of bivariate MCR over baseline approaches . . 170

Table 8.3

Performance (Kendall τ ) of bivariate MCR over baseline approaches

viii

171

LIST OF FIGURES

Figure 1.1

Figure 1.2

Figure 1.3

Figure 1.4

Histogram comparing the distribution of predicted daily maximum
temperature at a weather station in Michigan to its respective observed values, 1990-1999. [For interpretation of the references to color
in this and all other ﬁgures, the reader is referred to the electronic
version of this dissertation] . . . . . . . . . . . . . . . . . . . . . . .

4

Histogram comparing the distribution of extreme values of the predicted daily maximum temperature at a weather station in Michigan,
obtained using multiple linear regression, to its respective observed
values, 1990-1999. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

Scatter plot comparing the joint distribution between the quantile
mapping (QM) predicted output of two response variables and it true
values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

Histogram of daily precipitation recorded at a weather station in
Canada. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

Figure 3.1

A zero-inﬂated frequency distribution of daily precipitation at a weather
station in Canada . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Figure 3.2

Comparison between independent modeling approach and proposed
framework for predicting zero-inﬂated data . . . . . . . . . . . . . .

27

Eﬀect of increasing the number of NoRain days on performance of
regression model (best viewed in color). . . . . . . . . . . . . . . . .

41

The cumulative distribution function of multiple linear regression output for a zero-inﬂated precipitation response variable. . . . . . . . .

42

Comparison of RMSE values (tested on Rain days only) for MLR
models trained on all days compared with models trained only on
Rain days. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

Comparison of RMSE values (tested on All days) for MLR models
trained on all days compared with models trained only on Rain days.

44

Figure 3.3

Figure 3.4

Figure 3.5

Figure 3.6

ix

Figure 3.7

Figure 3.8

Figure 3.9

Figure 3.10

Figure 3.11

Figure 3.12

Figure 3.13

Figure 3.14

Figure 3.15

Figure 3.16

Figure 3.17

Figure 4.1

Figure 4.2

Comparison of RMSE values (for all days) among MLR, SVM-MLR
and ICR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

Comparison of RMSE values (for all days) among MLR, SVM-MLR
and ICR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

Comparison of RMSE values (for Rain days) among MLR, SVM-MLR
and ICR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

Comparison of RMSE values (for Rain days) among MLR, SVM-MLR
and ICR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

Comparison of classiﬁcation accuracy (for all days) between SVMMLR and ICR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

Comparison of classiﬁcation accuracy (for all days) between SVMMLR and ICR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

Comparison of F-measure (for Rain days) between SVM-MLR and
ICR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

Comparison of F-Measure (for Rain days) between SVM-MLR and
ICR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

Comparison of F-Measure (for NoRain days) between SVM-MLR and
ICR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

Comparison of F-Measure (for NoRain days) between SVM-MLR and
ICR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

Comparing the cumulative distribution function of the predicted values of precipitation according to the various models. . . . . . . . . .

57

Similarity of predictor variables for all pairs of time periods of a given
width. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

Similarity of response variable for all pairs of time periods of a given
width. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

Figure 4.3

Similarity of the relation between Predictors and Predicants over time 69

Figure 4.4

Comparison of RMSE values (for all days) among MLR, ZICR-S, and
ZICR-SS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

70

Figure 4.5

Comparison of RMSE values (for Rain days) among MLR, ZICR-S,
and ZICR-SS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

Comparison of classiﬁcation accuracy (for all days) among MLR,
ZICR-S, and ZICR-SS. . . . . . . . . . . . . . . . . . . . . . . . . .

71

Comparison of F-Measure (for Rain days) among MLR, ZICR-S, and
ZICR-SS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

Comparison of F-Measure (for NoRain days) among MLR, ZICR-S,
and ZICR-SS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

Figure 5.1

Histogram of observed temperature. . . . . . . . . . . . . . . . . . .

76

Figure 5.2

Tail of the histogram. . . . . . . . . . . . . . . . . . . . . . . . . . .

77

Figure 5.3

Inﬂuence of parameter λ on the regression coeﬃcients β in LSQR. .

85

Figure 5.4

Inﬂuence of λ on the probability distribution of the predicted values
obtained from LSSQR. . . . . . . . . . . . . . . . . . . . . . . . . .

86

Inﬂuence of λ on the probability distribution of the predicted extreme
values obtained from LSSQR. . . . . . . . . . . . . . . . . . . . . . .

87

Ratio of stations LSSQR outperforming baseline in terms of RMSE
of extreme data points. . . . . . . . . . . . . . . . . . . . . . . . . .

94

Ratio of stations LSSQR outperforming baseline in terms of recall of
extreme data points. . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

Ratio of stations LSSQR outperformin baseline in terms of F-measure
of extreme data points. . . . . . . . . . . . . . . . . . . . . . . . . .

96

Prediction performance of extreme data points using MLR, Ridge,
QR, LSSQR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

Figure 4.6

Figure 4.7

Figure 4.8

Figure 5.5

Figure 5.6

Figure 5.7

Figure 5.8

Figure 5.9

Figure 7.1

Area between the CDF of y and yM LR . . . . . . . . . . . . . . . . . 117

Figure 7.2

Histogram of predicted daily maximum temperature at a weather
station in Michigan, 1990-1999. . . . . . . . . . . . . . . . . . . . . . 118

Figure 7.3

CDF of predicted daily precipitation at a weather station in Michigan,
1990-1999. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
xi

Figure 7.4

Inﬂuence of gamma parameter on ﬁdelity of the response variable’s
cumulative distributive function. . . . . . . . . . . . . . . . . . . . . 126

Figure 7.5

CDF of predicted daily precipitation at a weather station in Michigan
over the years 1990-99. . . . . . . . . . . . . . . . . . . . . . . . . . 137

Figure 7.6

CDF of predicted daily maximum temperature at a weather station
in Michigan, 1990-99. . . . . . . . . . . . . . . . . . . . . . . . . . . 138

Figure 7.7

CDF of predicted daily precipitation at a weather station in Michigan,
1990-99. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

Figure 7.8

CDF of predicted daily minimum temperature at a weather station
in Michigan, 1990-99. . . . . . . . . . . . . . . . . . . . . . . . . . . 140

Figure 7.9

Cumulative distribution function of predicted daily minimum temperature at a weather station in Michigan, 1990-99, using asynchronous
regional climate model data. . . . . . . . . . . . . . . . . . . . . . . 142

Figure 7.10

Cumulative distribution function of predicted daily precipitation at
a weather station in Michigan, 1990-99, using asynchronous regional
climate model data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

Figure 7.11

Comparing the cumulative distribution function of M CRHET and
M CRGQ output of predicted daily minimum temperature at a weather
station in Michigan, 1990-99, using asynchronous regional climate
model data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

Figure 7.12

Comparing the cumulative distribution function of M CRHET and
M CRGQ output of predicted daily precipitation at a weather station
in Michigan, 1990-99, using asynchronous regional climate model data.145

Figure 7.13

Comparing the cumulative distribution function of projected daily
minimum temperature at a weather station in Michigan, for the period 2040-2049 to that of QM model output for the years 1990-99 . . 146

Figure 7.14

Comparing the cumulative distribution function of projected daily
precipitation at a weather station in Michigan, for the period 20402049 to that of QM model output for the years 1990-99 . . . . . . . 146

Figure 8.1

Scatter plot of observed daily maximum and minimum temperature
at a climate station in Michigan, USA. . . . . . . . . . . . . . . . . 148

xii

Figure 8.2

Relative computation time of the various approximation-based approaches for estimating inverse geometric quantile positions. . . . . . 163

Figure 8.3

Box plot of the percentage stations where MCR showed improvement
over single output baselines, in terms of Kendall τ rank correlation
and RMSE, across all RCM’s and variables. . . . . . . . . . . . . . . 168

Figure 8.4

Box plot of MCR’s improvement over baseline approaches in terms of
Kendall τ rank correlation and RMSE, across all RCM’s and variables.169

Figure 8.5

Scatter plot portraying the ﬁdelity of forecast values of various approaches replicating the observed associations among the bivariate
temperature response variables. . . . . . . . . . . . . . . . . . . . . 169

Figure 8.6

Three dimensional scatter plot of the observed associations among
maximum temperature, minimum temperature and precipitation as
well as the respective forecasts made by the various single output and
multiple output approaches. . . . . . . . . . . . . . . . . . . . . . . 171

Figure 9.1

Relative likelihood of identifying large precipitation residuals . . . . 174

Figure 9.2

Relative likelihood of identifying days with large residuals . . . . . . 175

xiii

Chapter 1
Introduction
Regression is a statistical method for deriving the relationship between a continuous-valued
response variable y ∈

and its predictor variables x ∈

expressed as a mathematical function f :

d →

d . The relationship is typically

. Numerous regression-based methods

have been developed in the past, including discriminative models (such as multiple linear
regression [58] and support vector regression [105]) and generative models (such as hidden
Markov regression [57]). These methods are primarily designed to minimize a loss function,
[f (x), y], that measures the diﬀerence between the observed and predicted values. Although
such a loss function is suﬃcient to ensure a good ﬁt between the predicted and observed
values, there are other requirements that must be met when applying such methods to realworld applications. This thesis focuses on developing new regression-based methods that
can handle the unique challenges arising from the climate research domain.

1.1

Regression in Climate Research

Given the growing concerns about global warming and its potential inﬂuence on human
and natural systems [107, 28, 49, 93, 75, 60], there is a pressing need to generate accurate
and robust projections of future climate scenarios for researchers, policy-makers and other
stakeholders. For example, to aid crop management decision making, the projections can
be incorporated into crop models to assess the crop yield response to future climate change.
1

Towards this end, recent advances in global and regional climate modeling have produced
vast amounts of simulation data that can be harnessed to improve our understanding of the
climate system and its evolution under diﬀerent greenhouse gas emission scenarios [31, 46,
116, 85]. These general circulation models (GCMs) and regional climate models (RCMs),
as they are called, are physical-based models, developed based on the fundamental laws
of physics, chemistry, and ﬂuid dynamics to simulate the response of the Earth system
to various external forcings on a three-dimensional spatial grid mesh. However, the scale of
these computer-simulated model outputs are often too coarse to be eﬀectively used in climate
change impacts, adaptation, and vulnerability (CCIAV) assessment studies. Furthermore,
due to the complexity of the climate system and inadequacy of the models in capturing all
of its underlying processes, there are inherent biases in the model outputs that must be
corrected to enable reproduction of historical climate conditions.
Regression is a popular method to empirically downscale the coarse resolution model
outputs to a ﬁner resolution. It can also debias the model outputs to ﬁt the distribution of
historical climate data. In addition to generating climate projections, regression can also be
applied for the purpose of spatial interpolation, to ﬁll in missing values at locations where
observed values are unavailable [64]. However, in spite of extensive number of regressionbased approaches that have been proposed to generate downscaled climate projections and
bias corrected climate projections, there are still a number of challenges current regression
methods have not adequately addressed.

2

1.2

Challenges

The projections generated by climate models are not an end in themselves but are often
integrated into other numerical models (such as crop and hydrological models) to enable
assessments of future climate change impacts. These downstream models not only diﬀer in
terms of the climate variables needed as input, they may also have distinct expectations
regarding the desired characteristics of the climate projections. As an example, the skills of
the climate projections in terms of simulating the length of wet and dry runs, i.e., number of
consecutive rain and non-rain days, is an important requirement to estimate drought duration
and intensity [9]. The least-square loss function employed by multiple linear regression
(MLR) or the ﬁrst-order Markov assumption employed by weather generators are inadequate
to simulate the higher order temporal autocorrelation of a precipitation time series. Thus,
one of the main challenges of applying regression methods to generate climate projections
is that there could be more than one distinct expectation of the projections, which are not
always easy to simultaneously achieve.
Furthermore, the projections should be unbiased across all quantiles. An unbiased projection is one whose distributional characteristics are consistent with that of the true values
of the response variable, across all the quantiles of its distribution. To illustrate this, consider
the histogram of observed daily maximum temperature at a weather station in Michigan,
represented by the gray area in Figure 1.1. The red dotted line represents the histogram of
daily maximum temperature generated by multiple linear regression. Observe that the MLR
outputs have a warm bias for the cooler days and a cold bias during the warmer days. As
a result, MLR is not a suitable approach if extreme values are of paramount importance to
users of the climate projections.

3

400
Observed Max Temperature
350

Multiple Linear Regression

300

Days

250
200
150
100
50
0
−30

−20

−10

0

°C

10

20

30

40

Figure 1.1: Histogram comparing the distribution of predicted daily maximum temperature
at a weather station in Michigan to its respective observed values, 1990-1999. [For interpretation of the references to color in this and all other ﬁgures, the reader is referred to the
electronic version of this dissertation]
To address the requirement for an unbiased distribution, there is a class of bias correction
approaches that can be used. Although these approaches can generate climate projections
with low bias, their residual errors can be high, which implies a lack of agreement between
the observed and predicted values at each time step. Hence, it is imperative to develop
regression-based approaches that provide outputs satisfying both requirements of minimal
error and unbiased predictions.
In addition, when considering the distribution characteristics of the regression outputs,
the end user may be interested in certain quantiles over others. For instance, farmers are
often interested in the frequency and magnitude of extreme values in the climate projections,
due to the larger economic implications associated with them. However, as most approaches
prioritize the conditional mean of the distribution, they tend to underestimate the frequency
of extreme-valued data points as shown in Figure 1.2. Thus, another challenge for regression4

400
Observed Max Temperature
Multiple Linear Regression

350
300

Days

250
200
150
100
50
0
25

30

°C

35

40

Figure 1.2: Histogram comparing the distribution of extreme values of the predicted daily
maximum temperature at a weather station in Michigan, obtained using multiple linear
regression, to its respective observed values, 1990-1999.
based approach is to provide a framework that is ﬂexible enough to prioritize the accuracy
at the quantiles of the end-user’s choice, without signiﬁcantly degrading the performance at
other quantiles.
End users may also require projections of more than one climate variable. For instance,
daily minimum and maximum temperature as well as total precipitation are among the common variables needed for CCIAV assessments. Although regression-based approaches can be
trained for each variable independently, the resulting projections may not be consistent with
each other. Hence, there is a growing demand for multiple output prediction methods capable of capturing the joint distribution of the response variables in a realistic and consistent
fashion, while minimizing their residual errors. Unfortunately, current methods are designed
to optimize one of the two criteria, but not both, as shown in Figure 1.3. Generating projections for multiple variables while preserving their joint relationships and minimizing the

5

3
Observation
QM

2

Y2

1

0

−1

−2

−3
−2.5

−2

−1.5

−1

−0.5

0
Y1

0.5

1

1.5

2

2.5

Figure 1.3: Scatter plot comparing the joint distribution between the quantile mapping (QM)
predicted output of two response variables and it true values.
residual errors is another challenge that has not been suﬃciently addressed.
Some climate variables are also harder to project due to their unique data characteristics.
For instance, daily precipitation is notoriously challenging to model due of its zero-inﬂated
distribution, a challenge that conventional regression methods are not well suited to handle.
A zero-inﬂated distribution is a distribution with an abundance of zero values as shown
in Figure 1.4. Such distribution can also be found in many other applications that are
related to long-term projections, such as ecological modeling, disease monitoring, and traﬃc
monitoring. As conventional regression methods typically prioritize modeling the conditional
mean of the distribution, they tend to underestimate the frequency of zero-valued data points
as well as the magnitude of the extreme values of a zero-inﬂated variable. Thus, there is a
need to develop models that are geared toward dealing with some of the more uncharacteristic
distributions observed among the climate variables.
The source of climate data available for building the regression models may also introduce
6

6000

5000

Frequency

4000

3000

2000

1000

0

0

2

4

6

8
10
12
Daily precipitation (mm)

14

16

18

20

Figure 1.4: Histogram of daily precipitation recorded at a weather station in Canada.
additional complications. For example, the GCM or RCM-simulated precipitation data may
not be exactly synchronized with the observed precipitation since each simulated output is
only one possible realization of the time series. This aﬀects both the training of regression
models as well as model evaluation, which typically assumes there is a one-to-one mapping
between the input and output variables for each data point used in the regression model.
Fortunately, there are alternative approaches, such as quantile mapping, that cater to modeling data points with asynchronous predictor and response variables. Unfortunately, this
ﬂexibility also results in a prediction with relatively higher residual errors, as the models
do not utilize the existing mapping information between the predictor and response variables. The challenge here is to develop models that can handle asynchronous data with
compromises model accuracy.
For RCM models, the simulations can be driven by either reanalysis data (akin to true
observations) or by GCM models as their initial boundary conditions. Reanalysis-driven

7

RCM runs are typically used to generate hindcast values of the predictor variables whereas
GCM-driven RCM runs are needed to generate forecast values of the predictor variables.
Most regression-based approaches are trained using reanalysis-driven RCM runs and tested
on future data generated by GCM-driven RCM runs. Unfortunately, there are inherent biases
in the GCM-driven RCM runs that are not fully accounted for by the regression model. This
presents an additional challenge that must be addressed by regression-based approaches,
which typically assume that the training and test data have similar distributions.

1.3

Thesis Contributions

This thesis presents multivariate regression-based frameworks that simultaneously addresses
multiple objectives pertaining to the individual requirements of an accurate climate projection. By simultaneously optimizing multiple objectives, the frameworks generate a projection that satisfy multiple requirement with minimal degradation of any one objective, unlike
existing regression-based approaches that address a single objective at the expense of compromising other requirements. The multiple objectives addressed by each of the frameworks
pertain to best replicating the unique distribution characteristics of a response variable, while
also ensuring an accurate projection in terms of minimum residual errors.
As mentioned earlier, pragmatic approaches to modeling predictive systems need to take
into account any unique distribution characteristics of the response variable that is critical to generating an accurate projection. Chapter 3 presents an integrated multi-objective
framework that simultaneously performs classiﬁcation and regression to accurately predict
values of a zero-inﬂated time series [4]. The regression and classiﬁcation models are trained
to optimize a joint objective function that minimizes both the classiﬁcation errors (zero

8

and non-zero values) and regression errors for data points that have non-zero values. The
framework compensates for the uncertainty in the data by using a smoothing function that
prioritizes non-zero valued data point whose response value is consistent with other data
points having similar values for their respective predictor variable, during the learning of the
regression function. The eﬀectiveness of the framework is demonstrated in the context of its
application to a downscaling precipitation climate variable. The semi-supervised extension
of the framework in Chapter 3 is elaborated in Chapter 4 and compared with its supervised
counterpart [3].
Given that studies and applications that utilize long-term projections for analysis may
be interested in certain quantiles of the distribution of the projection over others, Chapter
5 presents a multivariate framework that focuses on the accurate projection of a speciﬁc
quantile of the distribution (such as those pertaining to extremes values) with minimum
deterioration of the accuracy of the projection at the other quantiles of the response variable
[8]. Chapter 5 also elaborates on the framework in a semi-supervised setting.
As an extension of the above-mentioned framework, Chapter 6 presents a multi-objective
framework called ICR that focuses on reliable prediction of extreme values events for a
zero-inﬂated response variable [7]. This multi-objective framework incorporates the multiple
objectives of classiﬁcation, regression and conditional quantiles. The frameworks in Chapters
5 and 6 were evaluated on climate data and these demonstrated their ability in accurately
detecting the frequency, timing and magnitude of extreme temperature and precipitation
events eﬀectively compared to several baseline methods.
Chapter 7 shows the limitations of popular regression-based approaches in terms of preserving the distribution characteristics of true response variable across the various quantiles
in spite of its minimizing residual errors. Chapter 7 goes on to present a multi-objective re9

gression framework that simultaneously replicates the cumulative distribution properties of
the response variable while minimizing the residual errors. The framework is highly ﬂexible
and can be applied to linear, nonlinear, and conditional quantile models [5]. The eﬀectiveness of the framework in modeling the daily minimum and maximum temperature as well
as precipitation for climate stations in the Great Lakes region is demonstrated along with
marked improvement over traditional regression-based approaches, for all climate stations
evaluated.
There is a growing demand for a multiple-output prediction that not only is accurate in
terms of minimal residual errors but also in terms of accurately capturing the joint distributional characteristic of multiple output variables, so that they are realistic and consistent
with each other. Unfortunately, the preservation of these associations is not guaranteed by
regular single or multiple output regression approaches. Chapter 8 presents a framework
for multiple output regression that preserves the general association patterns among the response variables (including non-linear associations) while minimizing the overall errors of the
individual prediction, by coupling regression and geometric quantile mapping. The eﬀectiveness of the framework in modeling temperature and daily precipitation for climate stations
in the Great Lakes region is demonstrated [6]. The framework showed signiﬁcant improvement in reducing residual errors while preserving the joint distribution of the multi-output
variables, over the baseline approaches, in all climate stations evaluated.

1.4

Summary

To summarize, the presented frameworks address the challenges pertaining to the application
of regression-based approaches to the climate research domain. However, even though the

10

multi-objective frameworks presented in this thesis are originally motivated by the need to
generate accurate and unbiased projections of climate variables, they are generic enough to
be used for other applications. For example, the ICR framework is applicable to any domains
with zero-inﬂated data distribution whereas the MCR framework is designed for domains
that require consistent predictions across multiple response variables.
Additionally, the proposed frameworks leverage ideas from semi-supervised learning, statistical asynchronous regression, and geometric quantiles to address the challenges introduced
by the climate research domain. All the experimental results reported in this thesis pertaining to demonstrating the eﬀectiveness of the frameworks are conducted on climate data for
a study region involving parts of Canada and the Great Lake region around Michigan. The
eﬀectiveness of the frameworks is demonstrated in the context of their applications to bias
correcting and downscaling precipitation and temperature data.

11

Chapter 2
Related Work
Regression is a commonly employed statistical approach for estimating the relationship between a response variable and its respective predictor variables. Popular approaches of
regression, such as multiple linear regression (MLR), learn the regression coeﬃcients that
estimate the conditional expectation of the response variable, given the predictor variables.
Unlike MLR, quantile regression estimates the conditional quantile of the response variables [77]. Similarly, there are numerous other popular variants of regression-based approaches [58, 15, 73], such as ridge regression [62], lasso regression [110], recurrent neural
networks [55], Hidden Markov Model Regression [52], and support vector regression [105].
The various regression approaches primarily diﬀer based on the number of predictor variables
used, the number of response variables, the type of response variable, the estimation method
used to identify the regression coeﬃcients, the bias in the estimated regression coeﬃcients,
whether a linear or non-linear regression function is used, etc.
Time series prediction [80] has long been an active area of research with applications in
ﬁnance [30], network monitoring [81], transportation planning [63][91], weather forecasting
[46][31], etc. Regression approaches can be applied to time series data for forecasting purposes. When it comes to using regression for time series analysis, the two most common
approaches employed are autoregressive and multivariate regression approaches. While multivariate regression-based approaches are constrained by the requirement of the availability

12

of future values of the predictor variable to make a forecast, autoregressive approaches, such
as autoregressive moving average (ARMA) and autoregressive integrated moving average
(ARIMA) [23] do not have such constraints. Autoregressive approaches make forecasts by
repeatedly invoking a model that makes its prediction, one unit at a time and using the
predicted values from the previous iteration to infer future values. However, autoregressive
approaches are plagued by the error accumulation problem, on account of the propagation
of errors from one prediction step to the next one.
Given the inﬂuence of climate on agriculture [28, 107], natural ecosystems [49, 93], human
health and natural calamities [60, 75], economic impact [104] etc., considerable eﬀort has
been dedicated to generate projections of climate variables, to aid strategic decision making.
This eﬀort has also been precipitated by the growth in the number of climate models in the
climate science domain [88].

2.1

Forecasting in the Climate Science Domain

In the climate science domain, there has been extensive research on applying time series
regression models on climate data obtained from global climate models (GCM) [31, 46, 116].
Global climate models (GCMs) are extensively used for understanding how the global climate
may change in the future.
GCMs are computer-generated models for simulating future climate conditions under
diﬀerent greenhouse gas emission scenarios. However, the spatial resolution of GCM outputs
are often too coarse to reliably project the future climate scenarios of a local region and do
not provide reliable information on scales below about 200 km [Meehl et al., 2007].
Two of the more widely employed approaches to improving the projection of climate

13

variables obtained from climate models such as GCMs and regional climate models (RCM),
is downscaling and bias correction. While bias correction is employed to correct the biases
in the distributional characteristics of the climate projection, downscaling is employed to
reﬁne the granularity of the projection to better represent the climate variability associated
with the location of interest of the impact assessment study. In spite of the two distinct
objectives, regression is a popular approach employed, to generate bias corrected projections
as well as downscaled climate projections.

2.1.1

Downscaling in the Climate Science Domain

Downscaling techniques are used to relate the coarse-scale GCM outputs to the local climate variables such as daily precipitation and temperature [116]. There are two common
approaches to downscaling climate variables. The ﬁrst approach is dynamical downscaling,
which nests a regional climate model (RCM) into the GCM to represent the atmospheric
physics with a higher grid-box resolution within a limited area of interest. The second approach is statistical downscaling, that statistically links coarse-scale weather with relatively
ﬁner resolution observed local-scale weather. Wilby and Wigley [117] classiﬁed statistical
downscaling into regression methods, weather type approaches, and stochastic weather generators. Multiple linear regression (MLR) is probably the most common regression-based
statistical downscaling approach whose objective is to minimize the sum square error. Multiple linear regression with randomization (MLRR) is another regression-based approach that
adds a randomization term to compensate for the reduced variability in the prediction of
the regression function that is ﬁtted to pass through the centroid of the data. Themeβl et
al. [108] applies MLRR to bias correct precipitation data. Analog method (AM) and its
variants, such as nearest neighbor analog methods (NNAM), are other common downscaling
14

techniques that are based on the intuition of prototyping. The advantage of approaches such
as AM is the minimum training time required. The disadvantage is the need for suﬃciently
large amount of historical data for accurate prediction as well as the limitation of being
unable to predict extreme events beyond the magnitude of extreme events available among
the historical data sets. Rummukainen [100] diﬀerentiated various statistical downscaling
approaches based on the nature of the chosen predictors as either being perfect prog(nosis)
(PP) or model output statistics (MOS) (Glahn and Lowry [56]).

2.1.2

Bias Correction in the Climate Science Domain

Often even the results of downscaling, such as RCM simulation data often needs to be bias
corrected to accurately reﬂect the observed distribution of the respective climate variables
before being fed to climate change impact models and/or crop growth and yield models so
that biases in the simulated data are not propagated. RCM variables, such as temperature,
may require bias correction of the mean and standard deviation of the distribution, while
variables such as precipitation may require frequency and intensity of the distribution to be
additionally calibrated.
The quantile-based bias correction approaches focuses on matching the distribution of
the downscaled approaches as closely as possible to that of its observations’ distribution.
Quantile mapping, modiﬁed quantile mapping (EDCDFm) and transfer functions deﬁned
by Piani et al.[96][97] are a few examples. Quantile mapping has been extensively used to
downscale climate variable across regions, ranging from the smaller regions, such as Japan
(Iizumi et al. [67]) to the larger areas such as the European continent as seen by Piani et al
[96]. Samuels et al. [102] and Ines et al. [68] use quantile mapping for downscaling and bias
correcting climate variables like precipitation. Ceglar et al. [29] has used it to downscale
15

even variables like solar radiation. Quantile mapping can use both as a means to downscale
the climate variables as well as a bias correction technique. When used as separate bias
correction technique, QM corrects the errors in the shapes of the two distributions either
prior or after downscaling the data, using any of the downscaling approaches. Approaches
such as linear regression used by Rivington et al. [98] and copula-based approaches used by
Favre et al. [47] are also occasionally used for bias correction.

2.1.3

Regression-Based Approaches for Bias Correction and Downscaling

Regression-based approaches are among the most commonly employed approaches for bias
correction and downscaling climate projections. In spite of the nuance between the objectives of bias correction and downscaling climate variables for generating projection, there is
considerable overlap in terms of the approaches employed by both applications. The various
regression-based approaches used for downscaling and bias correcting climate projections
can be broadly categorized as either being distribution-driven or accuracy-driven, based on
whether their primary objective is minimizing residual errors or ensuring that the cumulative distribution of the projection of the response variable matches, as closely as possible,
the distribution of the corresponding observations data at each quantile. Multiple linear regression and QM are respectively, the most popular accuracy-driven and distribution-driven
approaches.
Among the various accuracy-driven regression-based approaches, change factor (delta
method), linear regression (LR), its multiple variable counterpart and multiple linear regression (MLR) are the most popular regression-based approaches used for downscaling and bias

16

correction [108][98]. Multiple linear regression with randomization used by Themel et al.
and copula based approaches used by Favre et al. [47] have also occasionally been used for
bias correction and downscaling climate variables.
Unlike accuracy-driven approaches, distribution-based approaches primarily focuses on
ensuring that the cumulative distribution of the projection of the response variable matches
as closely as possible, the distribution of the corresponding observations data at each quantile.
Quantile mapping (QM), modiﬁed quantile mapping (EDCDFm), transfer functions deﬁned
by Piani et al. and local intensity scaling [108][61][96][97] are few examples of distributionbased regression approaches commonly used for bias correction and downscaling climate
projections.

2.2

Multiple-Objective Prediction

Based on the distribution characteristics of the response variable, modeling certain response
variables, is more challenging that others. This is often due to the distinct or uncommon distribution characteristics of the response variable. For instance, conventional single-objective
regression models that prioritize the conditional mean of the distribution tend to underestimate the number of zero-valued data points while also under-estimating the values of the
extreme values of a zero-inﬂated response variable. Similarly, the single-objective regression
approaches such as MLR, that are commonly used when the emphasis is on minimizing SSR,
fare poorly in terms of capturing the shape of the distribution and hence is not well suited in
preserving the distribution characteristics of the projection. Thus there is the need to have
a framework that caters to simultaneously addressing multiple objectives.
As mentioned earlier, precipitation which is an important driver in a lot of models such as

17

fresh water modeling [78], is considerably more diﬃcult to model than temperature, mostly
due to its high spatial and temporal variability and its nonlinear nature. The motivation
behind using a multi-objective approach that combines the use of classiﬁcation and regression
models is to address the challenges of zero-inﬂated distribution observed in precipitation.
Previous studies have shown that additional precautions must be taken to ensure that
the excess zeros do not lead to poor ﬁts [11, 13, 42, 20, 114] of the regression models. A
typical approach to model a zero-inﬂated data set is to use a mixture distribution of the
form
P (y|x) = απ0 (x) + (1 − α)π(x)
where π0 and π are functions of the predictor variables x and α is a mixing coeﬃcient that
governs the probability an observation is a zero or non-zero value. This approach assumes
that the underlying data are generated from known parametric distributions. For example,
π may be Poisson or negative binomial distribution (for discrete data) and lognormal or
Gamma (for continuous data). Piani et al. [97] proposed a multi-objective transfer functions,
speciﬁc to response variables having zero-inﬂated distribution. Similarly, local intensity
scaling (LOCI) has been specialized for bias correcting response variables that have a zeroinﬂated distribution by accounting for the zero-inﬂated characteristics of precipitation data
[108].
There have been extensive studies on the eﬀect of incorporating unlabeled data to supervised classiﬁcation problems, including those based on generative models[41], transductive
SVM [71], co-training [19], self-training [120] and graph-based methods [18][121]. Some studies concluded that signiﬁcant improvements in classiﬁcation performance can be achieved
when unlabeled examples are used, while others have indicated otherwise [19, 36, 40, 109,

18

118]. Blum and Mitchell [19] and Cozman et al. [36] suggested that unlabeled data can help
to reduce variance of the estimator as long as the modeling assumptions match the ground
truth data. Otherwise, unlabeled data may either improve or degrade the classiﬁcation performance, depending on the complexity of the classiﬁer compared to the training set size
[40]. Tian et al. [109] showed the ill eﬀects of using diﬀerent distributions of labeled and
unlabeled data on semi-supervised learning.
Recently, there have been growing interest on applying semi-supervised learning to regression problems [119][24][39][122]. Some of these approaches are direct extensions of their
semi-supervised classiﬁcation counterparts. Cheng and Tang [33] proposed a semi-supervised
learning framework for long-term time series forecasting based on Hidden Markov Model
Regression. They also developed a covariance alignment method to deal with the issue of
inconsistencies between historical and future data from climate simulation models. None of
these semi-supervised learning methods are designed for handling zero-inﬂated time series
data.

2.3

Modeling Extremes

Identifying and modeling extreme events in climatology have recently gained a lot of traction
[48]. Unfortunately, the common regression techniques mentioned earlier that may be used
for downscaling, focus on predicting the conditional mean of the response variable, while
extreme values are better identiﬁed by conditional quantiles that corresponds to the extreme
values. Hence, unlike the common regression techniques mentioned earlier that focus on
predicting the conditional mean, the motivation behind the presented model focuses on the
conditional quantile, using an approach similar to quantile regression [77].

19

Variations of quantile regression, such as non-parametric quantile regression and quantile regression forests, have been used to infer the conditional distribution of the response
variable, which may be used to build prediction intervals [106, 87]. Also, variants of quantile
regression that estimate the median are used due to their robustness to outliers when compared to traditional mean estimate [82]. Friederichs and Hense [51] presented a statistical
downscaling approach to estimate censored conditional quantiles of precipitation that uses
QR. The conditional probability of the censored variable is estimated using a generalized
linear model (GLM) with a logit function to model the nature of the distribution of precipitation and hence cannot be directly applied to model temperature. Mannshardt-Shamseldin
et al. [83] demonstrate another approach to downscaling extremes through the development
of a family of regression relationships between the 100 year return value (extremes) of climate
modeled precipitation (NCEP and CCSM) and station-observed precipitation values. Generalized extreme value theory based approaches have also be applied to model extreme events
like hydrologic and water quality extremes, precipitation, etc [111, 16]. The Pareto distribution [43, 66], Gumbel [22, 14] and Weibull [35] are the more common variants of general
extreme value distribution used. But these techniques are probability based that emphasize
trends pertaining to the distribution of future extreme events and not the deterministic timing of the occurrence of the extreme event. The drawback of building a model that primarily
focuses on only a particular section of the conditional distribution of the response variable
is the limited amount of available data. Hence, the motivation for incorporating unlabeled
data during model building.
When it comes to accurately predicting extreme values in the presence of zero-inﬂated
data, studies have shown that additional precautions must be taken to ensure that the excess
zeros do not lead to poor ﬁts [11, 13, 42, 20, 114] of the regression models. Generally, simple
20

modeling of zero values may not be suﬃcient, especially in the case of zero-inﬂated climate
data such as precipitation, where extreme value observations need to be accurately modeled.
Due to the signiﬁcance of extreme values in climatology and the increasing trend in extreme
precipitation events over past few decades, a lot of work has be done in analyzing the trends
in precipitation, temperature, etc., for regions in the United States and Canada among others
[79, 99, 44, 37, 101]. Katz [72] introduces the common approaches used in climate change
research, especially with regard to extreme values.
The common approaches to modeling extreme events are based on general extreme value
theory [53, 89, 50], Pareto distribution [43, 66, 70, 90], generalized linear modeling [35,
34], hierarchical Bayesian approaches [54, 65, 103], etc. Gumbel [22, 14] and Weibull [35]
are the more common variants of General extreme value distribution used. There are also
Bayesian models such as the model of Cooley et al. [38] that augment the model with spatial
information. Watterson et. al. proposed a model that also deals with the skewness of nonzero data/intermittency of precipitation using gamma distribution to interpret changes in
precipitation extremes [113]. In contrast, the framework presented in this chapter handles the
intermittency of the data by coupling a logistic regression classiﬁer to the quantile regression
part of the model.

2.4

Distribution Preserving Modeling

Courtesy of projects such as NARCCAP (North American Regional Climate Change Assessment Program), extensive studies have been done to utilize the long-term future climate
projections made available [86, 88]. Many of these studies focus on the impact assessment
of climate change on domains, ranging from natural ecosystems [93] [49] to those related to

21

human systems [94]. Since many climate change impact assessment studies are interested
in long-term climate projections, the accuracy of the distribution of the projection is often
critical. As mentioned earlier, eﬃcient utilization of these projections require the projections
to be unbiased across all the quantiles of the distribution [31, 4, 116].
As mentioned earlier, bias correction approaches such as QM [108], Equidistant CDF
Matching (EDCDFm), Statistical Asynchronous Regression (SAR)[92], transfer functions
proposed by Piani et al.(2010b), etc, have been applied, to address these biases in the
projection of climate data. However, these approaches are best suited when there is no dayto-day mapping available between the predictor and the response variable, as is the case of
downscaling from GCMs or data from RCMs driven by GCMs. QM is very well equipped to
generate a projection of the response variable with an unbiased distribution. However, these
bias correction approaches under-perform in terms of accuracy of prediction of individual
data points. This is because these bias correction approaches do not leverage the original
mapping information between the response and predictor variables during training. This
drawback is all the more impeding, since data obtained from RCMs driven by reanalysis data
have day-to-day mapping and may be used for building a regression model for downscaling
and bias correction.
Thus, common distribution driven single output regression approaches are best suited
when the predictor and output variables are asynchronous and there is less emphasis on
low sum squared residual error (SSR). Accuracy-driven regression approaches, such as
MLR, Ridge, Lasso and analog methods [108], are commonly used when the emphasis is on
minimizing SSR but fare poorly in terms of capturing the shape of the distribution (Figures
1.1). Thus, commonly used regression approaches are not well suited in preserving the
distribution characteristics of the projection. Given the drawbacks of regression and quantile22

based approaches, approaches such as Contour Regression (CR) [5] have been proposed that
try to simultaneously minimize error and preserve the shape of the forecast distribution. CR
uses a regression function that regularizes the area between the CDF of the target response
variable and the regression result.
To address the limitation of single output regression (SOR), numerous multiple output
regression (MOR) models have been proposed including the commonly used multi-output
regression [59] and structured output regression [17]. A number of regression-based multiple output models focus on penalizing of the regression matrix using low rank penalization
methods such as reduced rank regression [69]. However, these approaches do not model
correlation in output dimensions. Another common approach to multiple output prediction
is to penalize input space shared, for co-linearity, such as partial least square regression
discriminant analysis (PLSDA) [95]. However these models, too, do not capture the association among various response variables. ”Curds and whey” is an example of regression
based approach that models output correlation [25]. However, modeling output correlation
assumes the relation among the response variables is linear. Multiple output SVR is another approach that takes advantage of correlation among response variables and extends
SVR to multi-output systems by considering Cokriging (a multi-variable version of Kriging)
[112]. Cokriging models multiple output variables by computing cross covariances between
the diﬀerent outputs. Group lasso [74], LL-MIMO [21], gaussian process MOR [10] are other
examples of MOR.
However, none of the above-mentioned approaches preserve the full range of variability
of the joint distribution of the response variables. He et al.[61] proposed bivariate quantile
mapping to address the limitations of QM in bivariate space and use the intuition proposed
by Buja et al. regarding geometric quantiles [26]. However, in spite of the approach faring
23

very well in terms of capturing the requisite marginal distribution characteristics of the two
response variables, due to its asynchronous nature, it suﬀers from poor SSR.
In the following chapters, approaches that address the challenges of modeling zero-inﬂated
response variables, prioritizing extremes in a distribution of the response variable, preserving
the overall distribution characteristics of the response variable, in both a single output and
a multi-output setting are proposed.

24

Chapter 3
Modeling Zero-Inﬂated Data
This chapter presents a multi-objective approach for predicting future values of a time series
data that are inherently zero-inﬂated. The proposed framework decouples the prediction
task into two objectives—a classiﬁcation step to predict whether the value of the time series
is zero and a regression step to estimate the magnitude of the non-zero time series value.

3.1

Introduction

Predictive models for time series data are commonly employed in the ﬁelds of economics,
ﬁnance, epidemiology, ecology, and meteorology, among others. The prediction accuracy
is subject to the choice of model used, which in turn, may be limited by characteristics
of the time series observations. For example, studies have shown that the performance of
classical regression models is degraded when applied to data sets with excess zero values
[11, 13, 42, 20, 114]. Such data are typically encountered in applications such as climate
and ecological modeling, disease monitoring, manufacturing defect detection, and traﬃc
monitoring.
Figure 3.1 shows the histogram of daily precipitation (in log scale) at a weather station
in Canada for the period between January 1, 1961 and December 31, 2000. Nearly half of
the observations have precipitation values equal to zero. Such zero-inﬂated data, as they
are commonly known, often lead to poor model ﬁtting using standard regression methods as
25

10
9

Log (Frequency)

8
7
6
5
4
3
2
1
0

5

10

15

20

25

30

35

40

45

50

Amount of precipitation per day
Figure 3.1: A zero-inﬂated frequency distribution of daily precipitation at a weather station
in Canada
they tend to underestimate the frequency of zeros and the magnitude of non-zero values of
the data. A typical strategy for handling such type of data is to ﬁrst invoke a classiﬁcation
model to predict whether the output value is zero. A regression model, which has been
trained on the non-zero data points, is then applied to estimate its magnitude only if the
classiﬁer predicts a non-zero output. Such an approach is commonly used for statistical
downscaling of precipitation [115], in which the occurrence of rain or wet days is initially
predicted prior to applying a regression model to estimate the amount of rainfall for the
predicted wet days. The limitation of this approach is that the classiﬁcation and regressions
models are often built independent of each other. As a result, neither models can glean

26

Input

Output > 0
Classifier

Regression

Output

Output = 0

(a) Two-step independent prediction approach

Input

Output
Regression

Classifier
Output = 0?

Output

(b) Joint regression and classification framework
Figure 3.2: Comparison between independent modeling approach and proposed framework
for predicting zero-inﬂated data
information from the other to potentially improve their prediction accuracy.
The objective of this chapter is to develop an integrated framework that accurately estimates the future values of a zero-inﬂated time series by simultaneous training the classiﬁcation and regression models. Speciﬁcally, the models are trained to optimize a joint objective
function that penalizes errors in classifying a data point and errors in predicting the magnitude of non-zero data points. Given a test point, the regression model is applied to estimate
the magnitude of the predicted value. The output from the regression model along with the
values of other predictor variables of the test point are then fed into a classiﬁcation model
to determine whether the predicted value should be adjusted to zero. The distinction between the traditional two-step independent modeling approach and the proposed framework
is illustrated in Figure 3.2.
The eﬀectiveness of the learning framework is demostrated in the context of precipitation
prediction, using climate data from the Canadian Climate Change Scenarios Network Web
site [1]. Speciﬁcally, the performance of the integrated framework was compared against
27

two baseline methods. The ﬁrst baseline corresponds to applying standard multiple linear
regression (MLR) method on the entire training data, which includes both dry and rain days.
The second baseline method (SVM-MLR) uses a combination of support vector machine
classiﬁer to predict dry/wet days and multiple linear regression to predict rainfall amount
on wet days. Both the models are trained independently. Empirical results showed that the
proposed framework outperforms both MLR and SVM-MLR on the majority of the weather
stations investigated in this study.
In summary, the main contributions of this chapter are as follows:
• An integrated framework for simultaneously learning classiﬁcation and regression models.
• The proposed framework was found to be more eﬀective at predicting zero-inﬂated time
series than building a single regression model or building independent classiﬁcation and
regression models to ﬁt the time series data.
• The framework was successfully applied to the real-world problem of downscaling precipitation time series for climate impact assessment studies.

3.2

Preliminaries

Consider a multivariate time series L = (xt , ct ), where t ∈ {1, 2, · · · , n} denote the elapsed
time, xt is a d-dimensional vector of predictor variables at time t, and ct is the corresponding value for the response (target) variable. Given an unlabeled sequence of multivariate
observations xτ , where τ ∈ {n + 1, · · · , n + m}, the goal was to learn a target function
f (xτ , w) that best estimates the future values of the response variable at each time τ . The
28

set of weights w = [w1 , w2 , ..., wd ]T are the regression coeﬃcients to be estimated from the
training data L. For applications such as statistical downscaling, the predictor variables
xτ correspond to climate variables at large spatial scales generated from computer-driven
general circulation models (GCMs).
For zero-inﬂated data, the frequency of zero values in the time series is relatively larger
than the frequency of each non-zero values, as shown in Figure 3.1. The response variable
ct can be mapped into a binary class ct , where


 1, if c > 0;
t
ct =

 0, otherwise.

(3.1)

For brevity, the notation y ≡ f (x, w) was used as the predicted value of the response
variable and y as its corresponding predicted class.

3.3

Framework for Simultaneous Classiﬁcation and Regression

In this chapter, a framework is presented for predicting future values of a time series with
the following unique characteristics:
1. The framework simultaneously performs classiﬁcation and regression to improve the
accuracy of predicting the magnitude of non-zero values in a zero-inﬂated time series.
2. The framework can be easily extended to a semi-supervised learning setting via graph
regularization.

29

This chapter considers a framework for modeling zero-inﬂated variables using a combination of classiﬁcation and regression models. The models in the framework are trained to
optimize a joint objective function that considers both the classiﬁcation errors on the time
series and regression errors for the non-zero values. The framework presented in this chapter
trains an SVM classiﬁer only once after the parameters of the regression model have been
determined. Proofs of convergence of our algorithm are also presented in this section.
Multiple linear regression (MLR) was considered as the underlying regression model in
this study, in which f (x, w) = wT x. Extending the approach to nonlinear models will be a
subject for future research.

3.3.1

Objective Function

The classiﬁcation and regression models developed in this study are designed to minimize
the following objective function:
n

ci (ci − yi yi )2 + T1

arg min L(w, y) =
w,y

i=1

n

(yi − ci )2

i=1
n

si,j [ci yi − cj yj ]2 + T3 ||w||2

+ T2
i,j=1

where,
yi =

wd xi,d , yi ∈ {0, 1}
d

and sij is the similarity between the values of the predictor variables at ti and tj
The rationale for the design of our objective function is as follows. The ﬁrst term is
somewhat similar to the standard least-square formulation of multiple linear regression,
except the estimation of w is based on the non-zero values in the time series. The regression
30

model is therefore biased towards estimating the non-zero values more accurately instead
of being inﬂuenced by the over-abundance of zeros in the time series. The product yi yi in
the ﬁrst term corresponds to the predicted output of our joint classiﬁcation and regression
models. The second term in the objective function is equivalent to misclassiﬁcation error
in training data. The third term corresponds to a graph regularization constraint to ensure
smoothness and consistency in the model predictions. Speciﬁcally, for two highly similar
data points xp and xq , i.e., spq is large, the model is penalized if the predicted values of
the response variables are inconsistent. Finally, the last term in the objective function is
equivalent to the L2 norm used in ridge regression models to shrink the coeﬃcients in w.
Each data point was considered to be a given elapsed time t ∈ {1, 2, · · · , n} in the time
series. An n × n similarity matrix S = [sij ] is computed between every pair of data points
based on the similarities of their predictor variables. Prior to computing the similarity
matrix, each variable is standardized by subtracting its mean value and then dividing by its
corresponding standard deviation. The standardization of the variables is needed to account
for their varying scales. Pearson correlation coeﬃcient was used to compute the similarity
between each pair of data points and then transform the value to a range between 0 and
1 to ensure ensure all the terms in the objective function are non-negative. The choice of
Pearson correlation as the similarity measure is due to the popularity of the measure in the
Earth science domain.

31

3.3.2

Parameter Estimation

The objective function can be further expanded as follows:
n

L(w, y) =

wd xi,d )2 + T1

ci (ci − yi
i=1

n

(yi − ci )2

i=1

d
n

+ T2

si,j

ci wd xi,d −

i,j=1
+ T3 ||w||2

d

cj wd xj,d

2

d

or equivalently,
n

L(w, y) =

wd xi,d )2

ci (ci − yi
i=1

d
n

(yi − ci )2 + T3 ||w||2

+ T1
i=1
n

+ T2

ci wd xi,d

si,j
i,j=1

d

2

+

cj wd xj,d

2

d

ci cj wd wd xi,d xj,d

− 2
d,d

To estimate the regression parameter w and class labels y, the following iterative procedure was employed. First, the partial derivative of L(w, y) is computed with respect to

32

each of the w’s and set it to zero (assuming y is ﬁxed):
∂L
=
∂wk

n

−2

ci ci − yi

wd xi,d

i=1

yi xi,k

d
n

+ 2T2

si,j

ci wd xi,d
cj wd xj,d

i,j=1
n

+ 2T2

cj xj,k

d

si,j
i,j=1
n

− 2T2

ci xi,k

d

si,j
i,j=1

ci cj wd (xi,d xj,k + xi,k xj,d )
d

+ 2T3 wk = 0

This reduces to a system of linear equations of the form Aw = b where
n

bk =

ci yi ci xi,k
i=1

and A is a square matrix of dimension d × d whose non-diagonal elements is given by,
n

Ak,l = 2T2

si,j ci xi,l xi,k
i,j=1
n

− 2T2

si,j ci cj xi,l xj,k
i,j=1

n

+

ci yi xi,l xi,k
i=1

33

and diagonal elements
n

si,j ci x2
i,k

Ak,k = 2T2
i,j=1
n

− 2T2

si,j ci cj xi,k xj,k
i,j=1

n

+

ci yi x2 + T3
i,k

i=1

To estimate y, the following part of the objective function that depends on y is minimized:
n

Lc (y) =

ci (ci − yi yi )2 + T1

i=1

n

(yi − ci )2

i=1

subject to the constraint yi ∈ {0, 1}. It is straightforward to show that Lc is minimized
according to the following rule:



yi =

1, if ci = 1 and (ci − yi )2 > ci2 + T1 ;


 0, otherwise.

(3.2)

The predicted class labels y are then used to re-estimate the regression coeﬃcients w. This
procedure is repeated until the regression coeﬃcients and class labels converge.

3.3.3

Proof of Convergence

This section presents the proof of convergence of our iterative update algorithm.

Let

(wt , yt ) be the regression coeﬃcients and class labels estimated after the t-th iteration and
(wt+1 , yt+1 ) be the regression coeﬃcients and class labels estimated after the (t + 1)-th
34

iteration.
Proposition 3.3.1. Assuming that the class labels yt are ﬁxed, L(wt+1 , yt ) ≤ L(wt , yt ).
Proof. For a ﬁxed yt , let Lr (w) be part of the objective function that depends on the
regression coeﬃcients w:
n

Lr (w) =

wd xi,d )2 + T3 ||w||2

ci (ci − yi
i=1

d
n

+ T2

si,j
i,j=1

ci wd xi,d −
d

cj wd xj,d

2

d

The Hessian matrix H of Lr (w) is given by:
∂ 2 Lr
= 2
∂wk ∂wl

n

ci yi 2 xi,k xi,l + 2T3 δkl

i=1
n

+ 2T2

si,j (ci xi,k − cj xj,k )(ci xi,l − cj xj,l )
i,j=1

where δkl = 1 if k = l and zero otherwise. Since the parameters T2 and T3 are non-negative,
it can be shown that, for any non-zero vector z with real values, zT Hz ≥ 0, i.e., the Hessian
matrix is positive semi-deﬁnite. Thus, the stationary point wt+1 minimizes L(wt+1 ) and
Lr (wt+1 ) ≤ Lr (wt ).

Proposition 3.3.2. Assuming that the regression coeﬃcients are ﬁxed,
L(wt+1 , yt+1 ) ≤ L(wt+1 , yt ).
Proof. For a ﬁxed wt+1 , let L(wt+1 , yt ) = Lc (yt ) + T2

n
2
2
i,j=1 si,j [ci yi − cj yj ] + T3 ||w|| .

Note that last two terms are independent of yt . Since the update formula for yt minimizes
35

Lc (y), it follows that L(wt+1 , yt+1 ) ≤ L(wt+1 , yt ).

Theorem 3.3.1. The objective function L(w) is monotonically non-increasing given the
update formula for w and y.
Proof. The update formula iteratively modiﬁes the objective function as follows: L(wt , yt ) ⇒
L(wt+1 , yt ) ⇒ L(wt+1 , yt+1 ). Using the above propositions we have L(wt+1 , yt ) ≤ L(wt , yt )
and L(wt+1 , yt+1 ) ≤ L(wt+1 , yt ). Therefore, L(wt+1 , yt+1 ) ≤ L(wt , yt )

Lemma 3.3.1. The objective function will eventually converge, as the value of the loss
function is always non-negative and since we know L(w) is monotonically decreasing.

3.3.4

Classiﬁcation of Test Data

The update formula presented in the previous subsections compute the regression coeﬃcients
w and class labels y of the training examples in such a way that minimizes the objective
function. For a given test example xτ , where τ ∈ {n + 1, · · · , n + m}, the predicted value of
the regression model can be computed as follows: yτ = wT xτ . However, the classiﬁcation
output cannot be determined since the update formula for y depends on the true class labels
c, as shown in Equation (3.2). Therefore, to predict the class label y, an SVM classiﬁer on
(xt , yt ) as the d + 1-dimensional feature vector and the estimated (yt ) as the class labels
using only examples from training data. Once the classiﬁer has been constructed, it can be
applied to predict the class label of a test example. The ﬁnal output of the joint classiﬁcation
and regression model is the product yτ yτ (see Figure 3.2).

36

Empirically, it was found that SVM may be used as an alternate classiﬁer to predict y
at each iteration, instead of the update formula described above. But since the objective
function of the generic classiﬁer does not necessarily minimize both the ﬁrst and second term
of Lc (y) simultaneously, convergence cannot be guaranteed.

3.4

Integrated Classiﬁcation and Regression Algorithm

The Integrated Classiﬁcation and Regression (ICR) framework takes as input (xt , ct ) (a
multivariate time series with d-dimensional predictor variables xt and response variable ct )
and a sequence of unlabeled observations (xτ ). The output returned by the framework are
the regression coeﬃcients (w) and the predicted values of the unlabeled sequence (zτ ). For
the training phase set c = (c > 0) and initialize y = c. Then until convergence update w by
solving Aw = b followed by updating y using Equation (3.2). Then train an SVM classiﬁer
g : (xt , yt ) → yt . During the testing phase, set ∀τ : yτ = wT xτ , ∀τ : yτ = g(xτ , yτ ) and
∀τ : zτ = yτ yτ .
It is assumed that the time series data has been partitioned into a training set, a validation
set (for model selection), and a test set. Model selection is needed to estimate the parameters
T1 ,T2 ,T3 of our objective function L(w, y).
The class labels c of the training examples are obtained based on the response variable c .
The training phase of the algorithm starts by setting y = c for all the n-training examples.
It then iteratively updates the regression coeﬃcients w and class labels y according to the
methodology presented in the previous section. At this stage, the value of the objective
function is computed and saved for testing convergence of the objective function. Upon
convergence, an SVM classiﬁer g is constructed to learn the mapping between the input

37

features x, y and output class y.
Once the training phase is completed, the Testing phase begins. Testing is performed
by ﬁrst applying the multiple linear regression model to the predictor variables xτ . This is
followed by invoking the SVM classiﬁer to predict the class label yτ for the m test examples.
The classiﬁer takes xτ and yτ as input and returns class labels yτ . Finally, the prediction
output is obtained by setting zτ = yτ yτ .
The time complexity of the training phase of the algorithm is O(k(n2 d + d3 )), where n is
the number of training examples, d the number of predictor variables and k is the maximum
number of iterations required for convergence. The computational complexity of the training
phase is composed of two major parts: the ﬁrst that requires computing the similarity matrix
and the second that requires iteratively solving w and y. The time needed to compute the
similarity matrix is (O(n2 d)). The time complexity of each iteration refers to the time needed
to compute w (O(n2 d2 + d3 )) plus time needed to compute y (O(n)). Hence, for maximum
iterations set to k, the time complexity for the training phase is O(k(n2 d + d3 )), where
d

n.

3.5

Experimental Evaluation

This section presents the experimental results to demonstrate the eﬀectiveness of the proposed framework.

3.5.1

Experimental Setup

The ICR algorithm was run on climate data obtained for 37 weather stations in Canada, from
the Canadian Climate Change Scenarios Network Web site [1]. The response variable to be
38

regressed corresponds to daily precipitation values measured at each weather station. The
predictor variables correspond to 26 coarse-scale climate variables derived from the NCEP
Reanalysis data set, which include measurements of airﬂow strenght, sea-level pressure, wind
direction, vorticity, and humidity, as shown in Table 7.1. The data span a 40-year period,
1961 to 2001. The time series was truncated for each weather station to exclude days for
which the precipitation values are missing.
Table 3.1: List of predictor variables for precipitation prediction.
Predictor Variables
Mean sea level pressure
Surface airﬂow strength
Surface vorticity
Surface divergence
500 hPa airﬂow strength
500 hPa zonal velocity
500 hPa meridional velocity
500 hPa vorticity
500 hPa geopotential height
500 hPa wind direction
500 hPa divergence
Relative humidity at 500 hPa
Near surface relative humidity

Surface zonal velocity
Surface meridional velocity
Surface wind direction
Mean temp at 2m
850 hPa airﬂow strength
850 hPa zonal velocity
850 hPa meridional velocity
850 hPa vorticity
850 hPa geopotential height
850 hPa wind direction
850 hPa divergence
Relative humidity at 850 hPa
Surface speciﬁc humidity

A comparison of the performance of the algorithm(ICR) was made against the multiple
linear regression (MLR) model and an approach that combined SVM and MLR (SVM-MLR).
MLR uses the least square criterion to estimate the weight vector w of the model. In SVMMLR, SVM was used to learn a classiﬁer model to diﬀerentiate between Rain and NoRain
days, and MLR was learnt on rain days only. Finally, for the given test set MLR is applied
only to those days classiﬁed as a Rain day. As far as choice of SVM is concerned, during the
evaluation phase a choice of the kernel (Linear or RBF) and its respective parameter is made.
The choice of the SVM kernel for ICR was limited to a linear kernel. Future experiments
will include a wider selection during the evaluation phase.
39

The following criteria was used to evaluate the performance of the models:
• Root Mean Square Error (RMSE), which measures the diﬀerence between the actual
and predicted values of the response variable, i.e.: RMSE =

n (c −y )2
1 i i .
n

• Accuracy, which measures the number of Rain and NoRain days predicted correctly by
the model.
• F-measure, which is the harmonic mean between recall and precision values for rain
days.

3.5.2

Experimental Results

The purpose of the experiment is to demonstrate the following:
1. Limitations of classical regression models in terms of handling zero-inﬂated time series
data.
2. Performance comparison between classical regression models and the proposed framework.

3.5.2.1

Eﬀect of Zero-Inﬂated Time Series Data

The objective of this experiment is to demonstrate the eﬀect of increasing number of zeros in
a time series on the performance of a regression model. Speciﬁcally, given the precipitation
time series of a randomly selected weather station, each day was classiﬁed as NoRain or Rain,
depending on the amount of precipitation it receives is equal to or greater than zero. Several
training sets of diﬀerent sizes and varying percentage of NoRain and Rain days by randomly

40

Size of the Training set (Yr)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

0.14
0.12
0.1
0.08
0.06
0.04
0.02

.2

.4

.6

.8

1.0

1.2

1.4

1.6

0

(Non−Rain days/ Rain days) in TrainingSet
Figure 3.3: Eﬀect of increasing the number of NoRain days on performance of regression
model (best viewed in color).
sampling the original time series, were created. A disjoint test set of size ten years, is used
for all the experiments in this subsection.
The performance of two multiple linear regression (MLR) models was evaluated: (1)
MLR1 , which is trained on both Rain and NoRain days and (2) MLR2 , which is trained on
Rain days only. Figure 3.3 compares the RMSE values of both models for Rain days in the
test set. The horizontal axis corresponds to the ratio of NoRain to Rain days in the training
set. The larger the ratio, the more inﬂated the number of zeros in the training data. The
vertical axis corresponds to the training set size, where each unit on the scale represents a
period of three months. The value of each cell indicates the performance improvement when
using MLR2 to predict the Rain days:

%Improvement =

RMSE(MLR1 ) − RMSE(MLR2 )
RMSE(MLR1 )

41

(3.3)

1
Observation
MLR
0.8

F(x)

0.6

0.4

0.2

0
−10

0

10

20

30

40

50

Precipitation (mm)

Figure 3.4: The cumulative distribution function of multiple linear regression output for a
zero-inﬂated precipitation response variable.
Since the % Improvement is greater than or equal to zero, this indicates that MLR2 consistently outperforms MLR1 in terms of predicting future Rain days irrespective of the training
set size. The amount of improvement becomes even more pronounced when the percentage
of NoRain days in the training data increases. A similar improvement pattern is observed
for all the weather stations investigated in this study, as shown in Figure 3.5. In contrast,
MLR1 , which is trained on both Rain and NoRain days, has a lower RMSE compared to
MLR2 when applied to all the days in the test set, as shown in Figure 3.6. This is because
MLR2 tends to overestimate the amount of precipitation for the NoRain days.
In summary, the experiment given in this section clearly justiﬁes the rationale for applying
a combination of classiﬁcation and regression models to better estimate the precipitation
amount of Rain days.

42

MLR Trained on Rain days only
MLR Trained on All days.

RMSE of Precipitation
caculated only on Rain days

12

10

8

6

4

2

0

5

10

15

20

25

30

35

Station
Figure 3.5: Comparison of RMSE values (tested on Rain days only) for MLR models trained
on all days compared with models trained only on Rain days.

43

10

MLR Trained on Rain days only
MLR Trained on All days.

9

RMSE of Precipitation

8
7
6
5
4
3
2
1
0

3

6

9

12

15

18

21

24

27

30

33

35

Stations
Figure 3.6: Comparison of RMSE values (tested on All days) for MLR models trained on all
days compared with models trained only on Rain days.

44

3.5.2.2

Impact of Coupling the Classiﬁer and Regression Model Creation

The objective of this experiment is to demonstrate the advantage of building a classiﬁer
and regression model in conjunction with each other, as against building them independent
of the other for zero-inﬂated time-series data. Speciﬁcally, empirical results demonstrating
improvement in the classiﬁcation accuracy, F-measure of classiﬁcation as well as RMSE of
the predictors are provided.
The performance of two multiple linear regression models was evaluated and compared.
In the ﬁrst model, MLR is trained on all days and a quadratic discriminant analysis (QDA)
trained on ground truth response variable. In the second model, again MLR is trained on all
days but the QDA trained on the predicted response values y = wT x. The results of the
experiment show that the model trained on the predicted response values outperformed the
model trained on ground truth response variable for all 37 stations, when it came to RMSE,
Classiﬁcation Accuracy and F-Measure. In particular, the average improvements were 13.4%
and 19.3% when it came to RMSE and classiﬁcation accuracy.
In summary, these empirical results provide motivation to try and integrate the classiﬁer
and regression models to take into consideration the accuracy of the other’s prediction for
each individual data point.

3.5.2.3

Performance Comparison

This section compares the RMSE, accuracy, and F-measure values of the predicted response
variable (Precipitation) for our proposed supervised (ICR) framework against that of multiple linear regression (MLR), SVM-MLR (A model that combines MLR and SVM) and
classiﬁcation and regression tree (CART). All the experiments were performed using a training size (n) of 3 years starting from the ﬁrst observation in the time series. The test set
45

size (m) was also ﬁxed at 1 year. After calculating the RMSE on the test set, the training
set was shifted by 3 years, such that it now occupied the data set used for testing in the
previous iteration. The experiment is repeated 5 times for each station. The RMSE values
reported in this section is the mean value of all 5 iterations. The same approach is used
to compute the RMSE values for Rain days, accuracy (for all days), F-measure for Rain
days only and F-measure for NoRain days only. The results for 37 weather stations when
ICR is compared with both MLR and SVM-MLR, is presented. Classiﬁcation accuracy, and
F-measures related to classiﬁcation accuracy of MLR is not plotted on account of MLR not
having an explicit classiﬁer. CART fared comparatively poorly in terms of residual errors
and classiﬁcation accuracy and F-measure. However, CART fared well in replicating the
cumulative distribution function of the response variable as shown later in the experiment
section.
As shown in Figures 3.7 and 3.8, our supervised model, ICR signiﬁcantly outperformed
the MLR model (trained on all days) and the SVM-MLR model in terms of their RMSE
values for predicting both Rain and NoRain days.
ICR outperformed MLR in 36 out of 37 stations and outperformed SVM-MLR in 30
out of the 37 stations. In terms of percentage improvement in RMSE for all days, ICR
indicated an average 8% improvement over MLR and 5.8% improvement when compared to
SVM-MLR.
In terms of the RMSE values for Rain days only, as shown in Figures 3.9 and 3.10, ICR
consistently outperformed both the MLR and SVM-MLR model with ICR outperforming
MLR in 35 stations and ICR outperforming SVM-MLR in 33 stations. When evaluating
average RMSE value for Rain days only, ICR had an improvement of 5.3% over MLR and
8.6% over SVM-MLR.
46

8

MLR
SVM−MLR
ICR

7

RMSE

6
5
4
3
2
1
0

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19

Stations
Figure 3.7: Comparison of RMSE values (for all days) among MLR, SVM-MLR and ICR.

47

8

MLR
SVM−MLR
ICR

7

RMSE

6
5
4
3
2
1
0

20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

Stations
Figure 3.8: Comparison of RMSE values (for all days) among MLR, SVM-MLR and ICR.

48

RMSE (Rain days only)

10

MLR
SVM−MLR
ICR

9
8
7
6
5
4
3
2
1
0

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19

Stations
Figure 3.9: Comparison of RMSE values (for Rain days) among MLR, SVM-MLR and ICR.

49

RMSE (Rain days only)

12

MLR
SVM−MLR
ICR

10

8

6

4

2

0

20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

Stations
Figure 3.10: Comparison of RMSE values (for Rain days) among MLR, SVM-MLR and ICR.

50

Overall Classification Accuracy

0.9

SVM−MLR
ICR

0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Stations
Figure 3.11: Comparison of classiﬁcation accuracy (for all days) between SVM-MLR and
ICR.

51

Overall Classification Accuracy

0.9

SVM−MLR
ICR

0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5

20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

Stations
Figure 3.12: Comparison of classiﬁcation accuracy (for all days) between SVM-MLR and
ICR.

52

MLR does not inherently classify any days as Rain or NoRain. Hence, a comparison
between ICR and MLR with regards to classiﬁcation accuracy and F-measure, is not plotted.
As shown in Figures 3.11 and 3.12, ICR outperformed SVM-MLR in 36 of the 37 stations
and showed a 9.1% improvement in classiﬁcation accuracy. At the same time, in terms of
F-measure for Rain days, the model outperformed SVM-MLR, as shown in Figures 3.13,
3.14. ICR outperformed SVM-MLR in 35 out of the 37 stations.
Although, MLR does not inherently classify any days as Rain or NoRain, a Quadratic
Discriminant Analysis(QDA) classiﬁer mentioned earlier was trained on the MLR output.
ICR witnessed a 21.2% improvement in overall classiﬁcation accuracy.

F−Measure (Rain days)

1

SVM−MLR
ICR

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19

Stations
Figure 3.13: Comparison of F-measure (for Rain days) between SVM-MLR and ICR.

With regard to F-measure for NoRain days, ICR outperformed SVM-MLR, in 36 stations.

53

F−Measure (Rain days)

1

SVM−MLR
ICR

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

Stations
Figure 3.14: Comparison of F-Measure (for Rain days) between SVM-MLR and ICR.

54

As shown in Figures 3.15,3.16 that shows the comparison of F-Measure for NoRain days
between SVM-MLR and ICR, ICR outperformed SVM-MLR in all but one station and
witnessed an 8.1% improvement in F-measure results.

F−Measure (NoRain days)

1

SVM−MLR
ICR

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19

Stations
Figure 3.15: Comparison of F-Measure (for NoRain days) between SVM-MLR and ICR.

As shown in the following ﬁgure, ICR as well as SVM-MLR was able to capture the
frequency of zero-valued data points in the distribution as well as improve the shape of
the cumulative distribution function when compared to that of multiple linear regression.
CART fared the best in terms of replicating the distribution of the zero inﬂated precipitation
observed, especially for the higher quantiles. However, CART fared less favorably in terms
of capturing the frequency of zero-valued data points as shown in Figure 3.17. Also, ICR
prediction showed a 28.9% and 24.2% improvement over the CART output in terms of

55

F−Measure (NoRain days)

1

SVM−MLR
ICR

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2

20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

Stations
Figure 3.16: Comparison of F-Measure (for NoRain days) between SVM-MLR and ICR.

56

1

0.8

F(x)

0.6

0.4
Observation
ICR
CART
SVM−MLR

0.2

0
0

5

10

15

20
25
30
Precipitation (mm)

35

40

45

50

Figure 3.17: Comparing the cumulative distribution function of the predicted values of
precipitation according to the various models.
RMSE for all-days and rain-days respectively. Similarly, the classiﬁcation accuracy of ICR
was 12.9% better than CART. The F-measure of ICR for rain days and non-rain days 5.3%
and 25.2% better.

3.6

Conclusions

This chapter presents a novel approach for predicting future values of a time series data that
are inherently zero-inﬂated. The proposed framework decouples the prediction task into
two steps—a classiﬁcation step to predict whether the value of the time series is zero and a
regression step to estimate the magnitude of the non-zero time series value. The eﬀectiveness
of the model was demonstrated on climate data to predict the amount of precipitation at a
given station.
The framework presented in this chapter assumes a linear relationship between the pre-

57

dictor and response variables. The framework can also be extended to a semi-supervised
learning setting as shown in the Chapter 3.

58

Chapter 4
Semi-Supervised Modeling of
Zero-Inﬂated Data
This chapter demonstrates a semi-supervised extension of the ICR framework, presented in
Chapter 3 . The purpose of the framework is to utilize both labeled and unlabeled data to
accurately estimate the future values of a zero-inﬂated variable, by simultaneously performing
classiﬁcation and regression. The regression and classiﬁcation models are simultaneously
learned by optimizing a uniﬁed objective function that includes a graph regularization term
to ensure smoothness of their target functions and consistency between the labeled and
unlabeled examples. The eﬀectiveness of the semi-supervised learning framework is also
demonstrated in the context of precipitation prediction using climate data obtained from
the Canadian Climate Change Scenarios Network website [1]. The proposed framework
signiﬁcantly outperforms regression models trained on both zero and non-zero parts of the
time series for the majority of the weather stations investigated in this study.

4.1

Preliminaries

Let L = (Xl , cl ) be a multivariate time series of length l, where the predictor variables
Xl = [xl1 , xl2 , ..., xln ]T is a d-dimensional sequence of values and cl = cl1 , cl2 , ..., cln

59

T

is

the corresponding ground truth values for the response variable. The objective of time series
prediction is to learn a target function f (x, w) that best estimates the future values of the
response variable, cu = cu1 , cu2 , ..., cum

T

, given the historical data L and the unlabeled

data, Xu = [xu1 , xu2 , ..., xum ]T , where w = [w1 , w2 , ..., wd ]T is the set of weights associated
with the target function. Xu may be obtained, for example, using computer-driven simulation models. In the semi-supervised framework proposed in this study, let n represent the
number of labeled training points and m the number of unlabeled training points. In the
supervised framework proposed, m represents the number of unlabeled testing points.
In this study, the relative frequency of zero values in cl and cu is assumed to be larger
than the frequency of non-zero values. Furthermore, the response variable c can be mapped
into a binary class c, where c = 1 if c > 0, and c = 0 otherwise. For brevity, the notation
y ≡ f (x, w) is used as the predicted value of the response variable and y as the predicted
class. Let, yu = yl1 , yl2 , ..., ylm

T

and yu = [yl1 , yl2 , ..., ylm ]T .

In the semi-supervised framework proposed, let y be a vector of length n + m whose ﬁrst
˜
n elements are initialized with the vector cl and whose remaining m elements are initialized
with the vector yu . Hence, in the supervised framework proposed, as there are no unlabeled
training points, y is a vector of length n and is initialized with the vector cl .
˜

4.2

Semi-Supervised Framework for Simultaneous Classiﬁcation and Regression

Unlike ICR, the semi-supervised extension modiﬁes the graph regularization term to be
summed over n + m data points, where the m refers to the unlabeled data points. In the
remainder of this chapter, the supervised and semi-supervised versions of the algorithm
60

are denoted as ZICR-S and ZICR-SS, respectively (where ZICR stands for Zero-Inﬂated
Classiﬁcation-Regression method).
For brevity, only linear regression models were considered, where f (x, w) = wT x. Extending the approach to nonlinear models will be a subject for future research.
The goal was to simultaneously estimate the values of the weight parameters w and the
class labels y to minimize the following objective function:
n

ci (ci − yi yi )2 + T1

arg min f (w) =
w,y

i=1

n

(yi − ci )2

i=1
n n+m

+ T2

si,j [ci yi − yj yj ]2 + T3 ||w||2
˜

i=1 j=1

where,
xi,d wd = yi .
d

Intuitively the ﬁrst term of the objective function is equivalent to the least square formulation of multiple linear regression, except the estimation of w is performed based on
the rain days only. The second term of the objective function measures the classiﬁcation
accuracy on the training data. The third term in the objective function computes the sum
of squared diﬀerence in the predicted response values for every pair of data points, weighted
by the similarity value of their predictor variables. This represents a graph regularization
constraint to ensure smoothness of the objective function and can be used to extend the
framework to a semi-supervised learning setting. Unlike ICR, ZICR the observed class label
term in the graph regularizer component of the equation is replaced by its expected value,
to that it could be extended to include unlabeled data points. Finally, the last term of the
objective function is equivalent to the L2 norm used in ridge regression models to penalize

61

models that have many large non-zero weights.
Note that each data point corresponds to a given time period in the time series. The
similarity matrix S is computed according to the Pearson correlation coeﬃcient between
every pair of data points in X. Prior to computing the similarity matrix, each attribute
value of the data set is standardized by subtracting the mean value of the attribute and
then dividing by its corresponding standard deviation. The standardization of each column
is done to account for diﬀerences in the variance of the various attributes in the data set.
The Pearson correlation value is then transformed to range between 0 and 1. The choice of
Pearson correlation as our similarity measure is due to the popularity of the measure in the
Earth science domain.
The purpose of the similarity function is to identify how closely related two data points
are to one another, and to use this information in creating the regression model which gives
more credence to closeness in the predicted amount of precipitation for data points that are
similar as against to data points that are dissimilar. As the similarity function has values
ranging between 0 to 1, dissimilar data points have limited impact on the error function
while similar data points that diﬀer signiﬁcantly on the amount of predicted precipitation
have the largest impact on the error function. The model further emphasizes on using data
points that are categorized as rain events by using ‘0’ and ‘1’ as class labels. Such that ‘0’
is assigned to days that are categorized as ‘NoRain’ days and ‘1’ to ’Rain’ days.
The supervised version of the framework is obtained by considering only the labeled
training examples for the third term in the objective function.
An iterative procedure was employed to solve the objective function. First, the partial

62

derivative of f (w) with respect to each of the w’s is computed, and set to zero:
∂f
= 2T2
∂wk

n+m

si,j
i,j=1
n+m

+ 2T2

yi wd xi,d
˜
d

si,j

yj wd xj,d
˜

i,j=1
n+m

− 2T2

yi xi,k
˜
yj xj,k
˜

d

si,j

yi yj wd (xi,d xj,k + xi,k xj,d )
˜˜

i,j=1

d

n

−2

ci ci − yi

wd xi,d

i=1

xi,k

d

+ 2T3 wk = 0

This reduces to a system of linear equations of the form Ax = b where x = [w1 w2 ....wd ]T
and
n

bk =

ci ci xi,k
i=1

A is a square matrix of dimension d × d where the non-diagonal elements,
n+m

Ak,l = 2T2

n+m

si,j yi xi,l xi,k − 2T2
˜
i,j=1

i,j=1

n

+

si,j yi yj xi,l xj,k
˜˜

ci yi xi,l xi,k
i=1

63

and the diagonal elements
n+m

Ak,k = 2T2

si,j yi x2 − 2T2
˜ i,k

i,j=1
n

+

n+m

si,j yi yj xi,k xj,k
˜˜
i,j=1

ci yi x2 + T3
i,k

i=1

Next quadratic discriminant analysis (QDA) was applied on the predicted response values
y = wT x to estimate the class labels of the unlabeled data points. The updated class labels
y are then used to re-estimate the regression weights w. This procedure is repeated until
convergence. A summary of the framework is presented in Algorithm 1. In the remainder
of this paper, the supervised and semi-supervised versions of our algorithm are denoted
as ZICR-S and ZICR-SS, respectively (where ZICR stands for Zero-Inﬂated ClassiﬁcationRegression method).

4.3

Experimental Evaluation

This section presents the experimental results to demonstrate the eﬀectiveness of our proposed framework.

4.3.1

Experimental Setup

The set up of the experiments discussed in this chapter is similar to the experiment setup
described in Chapter 3. The performance of the algorithm was compared against the multiple
linear regression (MLR) model. MLR uses the least square criterion to estimate the weight

64

Algorithm 1 Concurrent Semi-supervised Regression and classiﬁcation using simultaneous
equations iteratively.
Input:
X (An (n + m) × d matrix of NCEP weather data)
c (A n-dimension vector of class labels (1-Rain/0-NoRain))
c (A n-dimension vector of precipitation values for each day.)
Output:
w (A d-dimensional vector of weights)
y (A (n + m)-dimensional vector containing class labels.)
y (A (n + m)-dimensional vector containing regressional values of amount of precipitation
for each day)
Method:
Partition data 3 ways (training, evaluation and test)
1) Perform MLR on the training set (Size-n) to get w.
2) Use the w on the testing set (Size-m) to get yi .
3) Calculate the objective function error using the present w and save the value
4) Quadratic Discriminant Analysis (QDA) is performed on
yi to get yi
5) In the semi-supervised approach (ZICR-SS) initialize y to c for the ﬁrst n datapoints
˜
and initialize the remaining m points of y with y from step-3.
˜
In the supervised approach (ZICR-SS), y is initialized to c only.
˜
6) Solve w, using the d equations got after diﬀerentiating the objective function f (w)
7) After having solved w, solve for y using the linear equation y = xw
8) Apply QDA to ﬁnd class labels for the training data points y.
10) Calculate the objective function error using the present
w
11) For a ﬁxed number of iterations (e.g., 10) or based on the convergence of the objective
function, repeat steps 4
to 10
12) Evaluate the model by testing the RMSE error on the
test data set.

65

vector w of the model. The following criteria was used to evaluate the performance of the
models:
• Root Mean square error (RMSE), which measures the diﬀerence between the actual
and predicted values of the response variable, i.e.: RMSE =

n (c −y )2
1 i i .
n

• Accuracy, which measures the number of Rain and NoRain days predicted correctly by
the model.
• F-measure, which is the harmonic mean between recall and precision values for rain
days.

4.3.2

Experimental Results

The purpose of the experiment was to demonstrate the following:
1. Limitations of classical regression models in terms of handling zero-inﬂated time series
data.
2. Rationale of incorporating unlabeled data for precipitation prediction.
3. Performance comparison between classical regression models and our proposed framework.

4.3.2.1

Rationale for Incorporating Unlabeled Data

The objective of this section is to demonstrate the utility of incorporating unlabeled data
for semi-supervised learning in precipitation prediction. Previous studies have shown that
unlabeled data are helpful as long as their distribution is similar to those in the labeled
training data [19, 36]. For climate data, the study showed that the natural periodic behavior
66

of the predictor variables and the response variable (precipitation) provides an opportunity
to leverage the unlabeled data to improve precipitation prediction.
Figure 4.1 shows the average similarity values of the predictor variables over time. The
horizontal axis corresponds to the width of two time periods in the time series while the
vertical axis corresponds to the average similarity for all pairs of time periods with the given
width. For example, consider a time series of length 10,000 days. To compute the average
similarity of width 3 months, we compare the similarity of the predictor variables on days
1 and 91, days 2 and 92, and so on. We use Pearson correlation as the similarity measure.
The plot shows there are clear cycles in the average similarity values demarcated by years,
i.e., the predictor variables for a given day is more similar to another observation that is
1yr, 2yr, or 3yrs apart when compared to observations that are 1.5yr, 2.5yr and 3.5yr apart.
Figure 4.1 shows that this trend in similarity of observations is valid even for diﬀerences as
large as 30 years. More subtle trends of cycles of a decade and a half were also observed.
One of the encouraging observations is that the similarity of the predictors showed very slow
decay with time. This observation encourages the notion that the predictor variables even
if separated by large time diﬀerences still contain useful information that can be exploited
for predicting future precipitation events.
One caveat is that though the similarity of predictor variables may not diﬀer much over
time, the similarity of the relationship between the predictor and response variables over time
tend to decrease at a much faster rate, as shown in Figure 4.3. The product of similarity
between predictor variables and similarity between response variables for two time periods
of a given width was used to represent the vertical axis. This was the case because, as shown
in Figure 4.2, though the similarity of precipitation was periodic in nature, it was ﬂuctuating
more rapidly compared to similarity of the predictor variables over time.
67

Similarity of X over time(Quaterly interval )
0.2

Mean Similarity

0.15
0.1
0.05
0
−0.05
−0.1
−0.15

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96100 108 116
104 112 120

Time(Quaterly)
Figure 4.1: Similarity of predictor variables for all pairs of time periods of a given width.

Difference in precipitation value

0.65
0.645
0.64
0.635
0.63
0.625
0.62
0.615
0.61
0.605

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96100 108 116
104 112 120

Quater Year Interval
Figure 4.2: Similarity of response variable for all pairs of time periods of a given width.

68

1200
1000

Similarity measure

800
600
400
200
0
−200
−400
−600

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96100 108 116
104 112 120

Quaterly

Figure 4.3: Similarity of the relation between Predictors and Predicants over time
4.3.2.2

Performance Comparison

This section compares the RMSE, accuracy, and F-measure values for our proposed supervised (ZICR-S) and semi-supervised (ZICR-SS) framework against the precipitation prediction results of multiple linear regression (MLR). All the experiments were performed using a
training size (n) of 3 years starting from the ﬁrst observation in the time series. The test set
size (m) was also ﬁxed at 3 years. After calculating the RMSE on the test set, the training
set was shifted by 3 years, such that it now occupied the data set used for testing in the
previous iteration. The experiment is repeated 7 times for each station. The RMSE values
reported in this section is the mean value of all 7 iterations. The same approach is used to
compute the RMSE values for Rain days, accuracy (for all days), F-measure for Rain days
only and F-measure for NoRain days only. Due to space restriction, we show the results for
20 weather stations.
As shown in Figure 4.4, both our models, ZICR-S and ZICR-SS, signiﬁcantly outper-

69

formed the MLR model (trained on all days) in terms of their RMSE values for predicting
both Rain and NoRain days.
9
MLR
ZICR−S
ZICR−SS

8

Mean RMSE

7
6
5
4
3
2
1
0

0

2

4

6

8

10

12

14

16

18

20

Stations
Figure 4.4: Comparison of RMSE values (for all days) among MLR, ZICR-S, and ZICR-SS.

The supervised version of the approach outperformed MLR for all 37 stations, while
the semi-supervised approach outperformed MLR in 34 out of the 37 stations. In terms of
percentage improvement in RMSE, the RMSE for MLR was at an average 8.8% and 8.4%
worse than ZICR-S and ZICR-SS respectively. ZICR-S outperformed ZICR-SS in 22 out of
the 37 stations.
However, in terms of the RMSE values for Rain days only, Figure 4.5 MLR had an average
RMSE value for Rain days only that was 4.9% and 5.2% higher than ZICR-S and ZICR-SS
respectively. Both ZICR-S and ZICR-SS consistently outperform the MLR model with ZICRS outperforming in 34 and ZICR-SS outperforming in 32 stations. ZICR-S outperformed
ZICR-SS in 21 out of the 37 stations.
Although MLR does not inherently classify any days as Rain or NoRain, the Quadratic
70

12
MLR
ZICR−S
ZICR−SS

Mean RMSE

10

8

6

4

2

0

0

2

4

6

8

10

12

14

16

18

20

Stations
Figure 4.5: Comparison of RMSE values (for Rain days) among MLR, ZICR-S, and ZICR-SS.

Overall classification Accuracy

1
MLR
ZICR−S
ZICR−SS

0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5

0

2

4

6

8

10

12

14

16

18

20

Stations
Figure 4.6: Comparison of classiﬁcation accuracy (for all days) among MLR, ZICR-S, and
ZICR-SS.

71

Discriminant Analysis (QDA) classiﬁer used in our framework, was trained on the MLR
outputs to compare its classiﬁcation accuracy and F-Measure against those of ZICR-S and
ZICR-SS. As shown in Figure 4.6, all 3 ZICR-S ZICR-SS and MLR were comparable in
terms of classiﬁcation accuracy with ZICR-SS outperforming MLR in approx 60% of the
stations. Nevertheless, in terms of F-measure for Rain days, both the models consistently
outperformed MLR as shown in Figure 4.7 with ZICR-S outperforming MLR in 32 stations
while ZICR-SS outperformed MLR in 33 stations.

F−Measure (Rain Days)

1
MLR
ZICR−S
ZICR−SS

0.9

0.8

0.7

0.6

0.5

0.4

0

2

4

6

8

10

12

14

16

18

20

Stations
Figure 4.7: Comparison of F-Measure (for Rain days) among MLR, ZICR-S, and ZICR-SS.

With regard to the number of stations that MLR was outperformed in F-measure for
Rain days, ZICR-S outperformed MLR in 32 and ZICR-SS in 33 stations. Figure 4.8 shows
the comparison of F-measure for NoRain days between MLR, ZICR-S, and ZICR-SS.

72

1
MLR
ZICR−S
ZICR−SS

F−Measure(Non−Rain days)

0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5

0

2

4

6

8

10

12

14

16

18

20

Stations
Figure 4.8: Comparison of F-Measure (for NoRain days) among MLR, ZICR-S, and ZICR-SS.

4.4

Conclusions

This chapter elaborates on extending the ICR framework detailed in Chapter 3, to a semisupervised learning setting. This chapter compares the performance of the framework when
it utilizes both unlabeled data and labeled data instead of using only labeled data during
training of the model.

73

Chapter 5
Modeling Conditional Quantiles
This chapter as well as the following chapter elaborates the importance of accurately predicting the frequency, timing and magnitude of extreme values in the distribution of the
response variable. Speciﬁcally, a semi-supervised framework for smoothed quantile regression (LSSQR) is presented that focuses on accurate prediction of extreme values without
signiﬁcantly degrading the sum-of-square residual errors.

5.1

Introduction

An integral part of climate modeling is downscaling, which seeks to project future scenarios
of the local climate based on the coarse resolution outputs produced by global climate models (GCMs). Two of the more common approaches to downscaling are dynamic downscaling
and statistical downscaling. Dynamic downscaling uses a numerical meteorological model to
simulate the physical dynamics of the local climate while utilizing the climate projections
from GCMs as initial boundary conditions. Though it captures the geographic details of a
region unresolved by GCMs, the simulation is computationally demanding while its spatial
resolution remains too coarse for many climate impact assessment studies. Statistical downscaling establishes the mathematical relationship between the coarse-scale GCM outputs and
the ﬁne-scale local climate variables based on observation data. Unlike dynamic downscaling, it is ﬂexible enough to incorporate any predictor variable and is relatively inexpensive.
74

Most of the statistical downscaling approaches employ regression methods such as multiple
linear regression, ridge regression, and neural networks to estimate the conditional mean of
the future climate conditions. These methods are ill-suited for predicting extreme values of
the climate variables.
An alternative approach is to use techniques such as quantile regression, which aims
to minimize an asymmetrically weighted sum of absolute errors, to estimate the particular
quantile that corresponds to extreme values [77]. Unfortunately, quantile regression tends to
overestimate the response variable resulting in a large number of data points being falsely
predicted to be extreme. Figure 5.1 represents the histogram of the distribution of observed
temperature at a weather station in Canada. The lines represent the distribution of the predicted values for temperature obtained using multiple linear regression (MLR) and quantile
regression. An observation is considered an extreme data point if its response variable is in
the top 5 percentile of observations. The shape of the tail of the distribution that represents extreme data points (observed and projected) is shown in Figure 5.2. It is clear from
the ﬁgures that methods such as multiple linear regression (green line) that estimate the
conditional mean tend to underestimate the tail of observed probability distribution, while
quantile linear regression (red line) overestimates the tail part of the probability distribution. As elaborated in Section 5.4, it was found that for the 37 stations evaluated, at an
average, quantile regression predicted a datapoint to be an extreme point more than twice
as frequently as the actual frequency of observed extreme data points.
To address this overestimation, a method known as smoothed quantile regression (LSQR)
is proposed, that reduces the absolute error of extreme data points by introducing a smoothing term that brings the predicted response value of extreme points closer to the value
corresponding to the percentile of extreme data points. This smoothing term also provides a
75

.6
MLR
Quantile Regression
LSSQR
Observed temperature

Relative frequency

.5

.4

.3

.2

.1

0

−3

−2

−1

0

1

Standardized temperature
Figure 5.1: Histogram of observed temperature.

76

2

3

.30
MLR
Quantile Regression
LSSQR
Observed temperature

Relative frequency

.25

.20

.15

.10

.05

2

2.5
3
Standardized temperature

Figure 5.2: Tail of the histogram.

77

3.5

means to easily extend the objective function to a semi-supervised learning setting (LSSQR).
Semi-supervised learning, in addition to using the training data, can also use the distribution characteristics of the predictor variables of the test set to glean a better estimate of the
distribution of data upon which the model will be applied.
In summary, the main contributions of this chapter are as follows:
• Demonstrating the limitation of MLR, ridge regression and quantile regression in predicting extreme values.
• Presenting a smoothed quantile regression framework for extreme values prediction.
• Extending the framework to a semi-supervised setting.
• Demonstrating the eﬃcacy of our learning framework on climate data (temperature)
obtained from the Canadian Climate Change Scenarios Network website [1]. Both the
supervised and the semi-supervised proposed frameworks outperformed the baseline
methods in 85% of the 37 stations evaluated, in terms of magnitude, frequency and
the timing of the extreme events.

5.2

Preliminaries

Let Dl = {(xi , yi )}n be a labeled dataset of size n, where each xi ∈ Rd is a vector
i=1
of predictor variables and yi ∈ R the corresponding response variable. Similarly, Du =
{(xi , yi )}n+m corresponds to the unlabeled dataset. The objective of regression is to learn
i=n+1
a target function f (x, β) that best estimates the response variable y. β is the parameter
vector of the target function. n represents the number of labeled training points and m
represents the number of unlabeled testing points.
78

5.2.1

Multiple Linear Regression (MLR) and Ridge Regression

One of most widely used forms of regression is multiple linear regression. It solves a linear
model of the form
y = xT β +
where,

∼ N (0, σ 2 ) is an i.i.d Gaussian error term with variance σ 2 . β ∈ Rd is the

parameter vector. MLR minimizes the sum of squared residuals

(y − Xβ)T (y − Xβ)

which leads to a closed-form expression for the solution

ˆ
β = (X T X)−1 X T y

A variant of MLR, called ridge regression or Tikhonov regularization is often used to mitigate
overﬁtting. Ridge regression also provides a formulation to overcome the hurdle of a singular
covariance matrix X T X that MLR might be faced with during optimization. Unlike the loss
function of MLR the loss function for ridge regression is

(y − Xβ)T (y − Xβ) + λβ T β,

and its corresponding closed-form expression for the solution is

ˆ
β = (X T X + λI)−1 X T y

79

where, the ridge coeﬃcient λ > 0 results in a non-singular matrix X T X + λI always being
invertible. The problem with both MLR and ridge regression is that they try to model the
conditional mean, which is not best suited for predicting extremes.

5.2.2

Quantile Linear Regression(QR)

The τ th quantile of a random variable Y is given by:

QY (τ ) = F −1 (τ ) = inf{y : FY (y) ≥ τ }

where,
FY (y) = P (Y ≤ y)
is the distribution function of a real valued random variable Y and τ ∈ [0, 1].
Unlike MLR that estimates the conditional mean, quantile regression estimates the quantile (e.g., median) of Y .To estimate the τ th conditional quantile QY |X (τ ), quantile regression
minimizes an asymmetrically weighted sum of absolute errors. To be more speciﬁc, the loss
function for quantile linear regression is:
N

ρτ (yi − xT β)
i

i=1

where,




τ u

ρτ (u) =

u>0



(τ − 1)u u ≤ 0


Unlike MLR and ridge regression that have a closed-formed solution, quantile regression
is often solved using optimization methods such as linear programming. Linear programming
80

is used to solve the loss function by converting the problem to the following form.

min τ 1T u + (1 − τ )1T v
n
n
u,v
s.t. y − xT β = u − v
where, ui ≥ 0 and vi ≥ 0. But as shown in Figures 5.1 and 5.2, quantile regression often
overestimates data points resulting in too many false positive extreme events predicted.

5.3

Framework for Smoothed Quantile Regression

Given that the primary objective of the model is to accurately regress extreme valued data
points and quantile regression has been shown to perform relatively better that its least
square counterparts that tend to underestimate the frequency and magnitude of extreme
data points, the proposed objective approach of the proposed frameworks is modeled around
linear quantile regression. Section 5.3.1 describes smoothed quantile regression (LSQR) and
its objective function. Section 5.3.2 proposes a semi-supervised extension to LSQR which is
then followed by mathematical properties of the behavior of the objective function.

5.3.1

Smoothed Quantile Regression (LSQR)

A quantile-based linear regression model was proposed, based on the assumption of smoothness, i.e., data points whose predictor variables are similar, should have a similar response.
The notion of smoothness as an integral part of the framework, as experiments provided in
Section 5.4 demonstrate this characteristic in the dataset used. The smoothness assumption

81

could be described as the constraint
n

wij (fi − fj )2 < c

i,j

where wij is a measure of similarity between data point i and j, f the predicted value of the
response variable and c is a constant.
Also, since the framework doesn’t restrict the training set only to extreme data points,
the smoothing component of the objective function tends to implicitly cluster data points
resulting in better distinction of the response variables of an extreme valued data point and a
non-extreme valued data point. Empirical results comparing supervised quantile regression
to the proposed semi-supervised model illustrate this point as shown in Section 5.4. The
term
wij = exp(−

||xi − xj ||2
) i, j ∈ [1, 2, . . . , n]
σ

is equivalent to the radial basis function and is used to capture the similarity between the
predictor variables of data point i and data point j. σ is a scale parameter used to control
the distance above which two data points are not considered as being highly coupled.
Assuming linear regression, f (xi , β) = xi β, the smoothing term can be reformulated as
n

wij (f (xi , β) − f (xj , β))2 = f T ∆f = β T Σβ

i,j

where,
Σ = X T ∆X
∆=D−W

82

n
n
j=1 wij and W = {wij }|i,j=1 .

and D is a diagonal matrix such that Dii =

Coupling smoothing with the objective function of linear qunatile regression, we end up
with the following optimization problem.
n

min
β

ρτ (yi − xT β) + λβ T Σβ
i

i=1

As can be clearly observed from the objective functions of LSQR, λ → 0 results in an
estimate similar to quantile linear regression while, λ → ∞ results in the estimate of the
response variable converging towards the target quantile of data. This is because a large λ
would penalize any non-zero diﬀerence between fi and fj very harshly thereby minimizing
the error by setting fi = α, ∀i ∈ [1, 2, . . . , n], thereby reducing the error from the second
component of the equation to 0. This reduces the loss function to the following
n

f (β) =

ρτ (yi − α),

β = (α, 0, 0, . . . , 0)T

i=1

The formal proof of this is provided in the following theorem.

T heorem 1: f (xi , β) → y(nτ ) as λ → ∞, ∀i ∈ [1, 2, . . . , n].
P roof : Let y(i) be the ith smallest element among yk |n and y(i) < αi <= y(i+1) .
k=1
When λ → ∞, the loss function can be rewritten in terms of αi as follows
i

n

(1 − τ )(αi − y(k) ) +
k=1

n

τ (y(k) − αi ) +

Wij (αi − αi )
i,j=1

k=i+1

83

which is equivalent to minimizing
i

n

τ

y(k) − (nτ − i)αi

y(k) −
k=1

k=1

or maximizing
i

y(k) + (nτ − i)αi = li
k=1

Therefore,
lj − lj−1 = yj − αj−1 + (nτ − j)(αj−1 − αj )
Hence, ∀j : j ≤ nτ ,

lj − lj−1 >= 0, since (yj − αj−1 ), (nτ − j) and (αj−1 − αj ) are all

≥ 0. Similarly, ∀j : j ≥ nτ ,

lj − lj+1 = αj+1 − yj+1 + (nτ − j)(αj − αj+1 ) ≥ 0

Hence, if ∃i : i = nτ , then α = y(nτ ) . But if, i < nτ < (i + 1), then α is in the interval
[y(i) , y(i+1) ]
Figure 5.3 is a plot that tracks the values of β for diﬀerent λ values. The ﬁgure shows
that the regression parameter vector β will converge to (α, 0, 0, . . . , 0)T as λ increases. β0 is
the regression parameter that corresponds to the column of 1’s in the design matrix.
Figures 5.4 and 5.5 plots the inﬂuence of λ on the predicted values returned from LSSQR.
i.e., as the value of λ increases, LSSQR shrinks the prediction range to the quantile τ . Figure
5.5 is a zoomed-in image, capturing the tail of Figure 5.4.

84

Figure 5.3: Inﬂuence of parameter λ on the regression coeﬃcients β in LSQR.

85

Figure 5.4: Inﬂuence of λ on the probability distribution of the predicted values obtained
from LSSQR.

86

Figure 5.5: Inﬂuence of λ on the probability distribution of the predicted extreme values
obtained from LSSQR.

87

5.3.2

Linear Semi-Supervised Quantile Regression (LSSQR)

The objective function of LSQR can be easily extended to a semi-supervised learning setting
since the smoothing factor (the second term in the equation) is independent of y. Therefore,
by extending the range of the indices i and j of the smoothing term to span 1 to n + m, the
predictor variables of the unlabeled data Xu = [xu1 , ..., xum ]T can be harvested.
The objective function of the LSSQR is
n

arg min
β

5.4

ρτ (yi − xT β) + λ
i

i=1

n+m

wij (xT β − xT β)2
i
j

i,j

Experimental Results

In this section, the climate dataset that is used for statistical downscaling is described. This is
followed by the experimental setup, which address the inherent properties of the dataset, such
as its periodic nature. Once the dataset is introduced, we analyze the behavior of baseline
models developed using MLR, ridge regression and quantile regression and contrast them
with LSQR and LSSQR. The eﬃcacy of the models in accurately measuring the magnitude,
the relative frequency and timing of forecasting a data point as an extreme event is measured.

5.4.1

Data

All the algorithms were run on climate data obtained at 37 weather stations in Canada, from
the Canadian Climate Change Scenarios Network website [1]. The response variable to be
regressed (downscaled) corresponds to daily temperature values measured at each weather
station. The predictor variables for each of the 37 stations correspond to 26 coarse-scale
climate variables derived from the NCEP re-analysis data set, which include measurements
88

of airﬂow strength, sea-level pressure, wind direction, vorticity, and humidity, as shown in
Table 5.1. The predictor variables used for training were obtained from the NCEP re-analysis
data set that span a 40-year period (1961 to 2001). The time series was truncated for each
weather station to exclude days for which temperature or any of the predictor values are
missing.
Table 5.1: List of predictor variables for temperature prediction.
Predictor Variables
500 hPa airﬂow strength
500 hPa zonal velocity
500 hPa meridional velocity
500 hPa vorticity
500 hPa geopotential height
500 hPa wind direction
500 hPa divergence
Relative humidity at 500 hPa
Near surface relative humidity
Mean sea level pressure
Surface airﬂow strength
Surface vorticity
Surface divergence

5.4.2

850 hPa airﬂow strength
850 hPa zonal velocity
850 hPa meridional velocity
850 hPa vorticity
850 hPa geopotential height
850 hPa wind direction
850 hPa divergence
Relative humidity at 850 hPa
Surface speciﬁc humidity
Surface zonal velocity
Surface meridional velocity
Surface wind direction
Mean temp at 2 m

Experimental Setup

As is well known, temperature, which is the response variable in our experiments, has seasonal cycles. To eﬃciently capture the various cycles, de-seasonalization is performed prior
to running the experiments. As is common practice in the ﬁeld of climatology, a common
approach to de-seasonalization is to split the data into 4 seasons (DJF, MAM, JJA, SON)
where ’DJF’ refers to the months of December-January-February in the temperature timeseries. Similarly, ’MAM’ refers to March-April-May, and ’JJA’ refers to June-July-August
and ’SON’, September-October-November. In eﬀect, for each station, 4 diﬀerent models,

89

corresponding to the 4 seasons were built. The training size used spanned 6 years of data
and the test size, 12 years. During validation, the parameter λ was selected using the
score returned by RMSE for extreme data points. A data point is considered extreme if
its response variable is greater than .95 percentile (Threshold-1) of the whole dataset corresponding to the station. QR was implemented using the interior point algorithm as detailed
in [76]. Broyden Fletcher Goldfarb Shanno (BFGS) method was used to solve the LSQR
and LSSQR optimization problem.

5.4.3

Evaluation Criteria

The motivation behind the selection of the evaluation metrics was the intent to evaluate the
diﬀerent algorithms in terms of accuracy of the prediction of extreme values, the timing of
the extreme events as well as the frequency with which a data point is predicted to be an
extreme data point. The following metrics are used to capture the above evaluation criteria
for the various models:
• Root Mean Square Error (RMSE), which measures the diﬀerence in magnitude between
the actual and predicted values of the response variable, i.e.:
RMSE =

n (y −f )2
i=1 i i . RMSE was computed on those days that were observed to
n

be extreme data points.
• Precision and recall of extreme events are computed to measure the timing accuracy of
the prediction. F-measure, which is the harmonic mean between recall and precision
values, will be used as a score that summarizes the precision and recall results.
F-measure = 2×Recall×P recision
Recall+P recision
• The frequency of predicting extreme data point for the various methods was measured
90

by computing the ratio of the number of data points that were predicted to be extreme
to the number of observed extreme data points.

To summarize, RMSE is used for measuring the accuracy of the predicted magnitude of the
response variable, whereas F-measure can be thought of as measuring the correctness of the
timing of the extreme events.

5.4.4

Baseline

We compared the performance of LSQR and LSSQR with baseline models created using
multiple linear regression (MLR), ridge regression (Ridge), and quantile regression (QR). All
the baselines were run for the same 37 stations and for all the 4 seasons. Also, a comparison
of the performance of the proposed supervised framework (LSQR) is made with its semisupervised counterpart (LSSQR), where LSSQR demonstrated an improved performance
over LSQR for the 37 stations evaluated upon as shown in Table 5.2. Table 5.2 summarizes
the tally of percentage of times LSSQR outperformed LSQR over the 4 seasons for the given
37 stations. As seen in the table, LSSQR showed an improved performance in terms of both
RMSE and F-measure.
Table 5.2: The relative performance of LSSQR compared with LSQR with regard to the
extreme data points.
RMSE
F-measure

Win
68.25%
60.14%

Loss
31.75%
37.16%

91

Tie
0%
2.7%

5.4.5

Results

As mentioned earlier, experiments were run separately using each of the baseline approaches
and LSQR and LSSQR for the 4 seasons (DJF, MAM, JJA, SON) of the year for each of
the 37 stations’ data. The results over all the seasons and stations are summarized in Tables
5.3 and 5.4 while the individual results of each season in Figures 5.6 and 5.8. Table 5.3
summarizes the relative performance of LSQR with respect to the baseline methods in terms
of RMSE of extreme data points and F-measure of identiﬁcation of extreme data points.
During testing, a data point is considered extreme, if its response variable is greater than .95
percentile (Threshold-1) of the whole dataset corresponding to the station. For the purpose
of analysis, results of using the .95 percentile of the response variable in the training set
(Threshold-2) to identify extreme data points are also summarized. The fact that the results
obtained by using the two diﬀerent baselines is an indicator that the training data did capture
the distribution of the response variable reasonably well. LSQR consistently outperformed
the baselines both in terms of RMSE and F-measure. It must also be noted that LSQR did
outperform MLR and Ridge in terms of recall of extreme events comprehensively across each
of the 37 stations and seasons.
Table 5.3: The percentage of stations LSQR outperformed the respective baselines, with
regard to the extreme data points.

RMSE
F-measure

Threshold-1
Threshold-2
Threshold-1
Threshold-2

MLR
88.51%
89.19%
59.45%
56.08%

Ridge
87.84%
87.84%
60.13%
58.10%

QR
80.40%
79.05%
72.97%
79.05%

Similarly, Table 5.4 summarizes the relative performance of LSSQR with respect to the
baseline methods in terms of RMSE of extreme data points and F-measure of identiﬁcation
of extreme data points. Like LSQR, LSSQR consistently outperformed the baselines both
92

Table 5.4: The percentage of stations LSSQR outperformed the respective baselines, with
regard to the extreme data points.

RMSE
F-measure

Threshold-1
Threshold-2
Threshold-1
Threshold-2

MLR
87.16%
87.84%
60.13%
56.75%

Ridge
85.14%
86.49%
58.78%
59.45%

QR
85.13%
81.76%
75.67%
81.75%

in terms of RMSE and F-measure. It must be noted that LSSQR outperform MLR and
Ridge in terms of recall of extreme events comprehensively across each of the 37 stations
and seasons.
Figure 5.6 gives a breakdown of the performance of the LSSQR over each of the 4 seasons
of the 37 stations using Threshold-1 for the purpose of marking a data point as extreme.
The ﬁgure is a bar chart of percentage of stations that LSSQR outperformed MLR, ridge
regression and QR in prediction accuracy for only extreme data points in the test set. RMSE
was used to compute the accuracy of each model in predicting extreme value data points, at
the 37 stations. As seen in the plot, LSSQR outperforms MLR, ridge regression and QR in
each of the four seasons across the 37 stations.
Figure 5.7 shows a graph that depicts the percentage of stations LSSQR outperformed
MLR, ridge regression and QR in terms of identifying extreme data points over 37 stations.
Again, LSSQR comprehensively outperforms MLR and ridge regression over all the 37 stations and 4 seasons. But as expected, QR outperforms LSSQR in terms of recall performance
for each of the 4 seasons due to the overestimating nature of QR, which consequently resulted in poor precision and which is reﬂected in its F-measure score. At an average, quantile
regression, predicted a datapoint to be an extreme point more than twice as frequently as
the actual frequency of observed extreme data points. In fact, QR lost out to LSSQR in
91% of 37 stations across 4 seasons in terms of precision of identifying extreme data points.
93

1
MLR
Ridge
QR

Ratio of Stations
LSSQR outperformed baseline

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

Season DJF

Season MAM

Season JJA

Season SON

Figure 5.6: Ratio of stations LSSQR outperforming baseline in terms of RMSE of extreme
data points.

94

Figure 5.7: Ratio of stations LSSQR outperforming baseline in terms of recall of extreme
data points.
Figure 5.8 shows a graph that depicts the percentage of stations where LSSQR outperformed MLR, ridge regression and QR in prediction accuracy based on F-measure of the
identifying extreme data points over 37 stations. Again, LSSQR outperforms MLR, ridge
regression and QR for all the 4 seasons.
The performance improvement obtained by LSSQR in terms of predicting the extreme
values can be easily visualized in Figure 5.9. Figure 5.9 is a plot comparing the predicted
response variable of the various methods. The plot is restricted to only extreme data points

95

Figure 5.8: Ratio of stations LSSQR outperformin baseline in terms of F-measure of extreme
data points.
for a station. As expected, the predicted value of the response variable using multiple
linear regression is often underestimating the observed temperature, while quantile regression
regularly overestimates the prediction of temperature and LSSQR lies in between MLR and
QR and closer to the observed temperature.

96

5.5

Conclusions

This chapter presents a semi-supervised framework (LSSQR) for accurately predicting values
of extreme data points. The proposed approach was applied to real world climate data
spanning 37 stations and was compared against MLR, ridge regression and quantile regression
in terms of the eﬀectiveness the model demonstrated in identifying and predicting extreme
temperatures for the given stations. The next chapter merges the intuition of the framework
presented in this chapter, related to extreme values, with the integrated classiﬁcation and
regression framework presented in Chapter 3.

97

3

Observed Temperature
Quantile Regression
LSSQR
MLR

Standardized temperature

2.8

2.6

2.4

2.2

2

1.8

1.6

1.4

5

10

15

20

25

30

35

40

45

50

55

Extreme data points
Figure 5.9: Prediction performance of extreme data points using MLR, Ridge, QR, LSSQR.
98

Chapter 6
Modeling Extremes in Zero-Inﬂated
Data
This chapter extends the LSQR framework presented in Chapter 5, that emphasizes the
accurate predicting of the frequency, timing and magnitude of extreme values in a the distribution of the response variable, to handle zero-inﬂated response variables such as daily
precipitation.

6.1

Introduction

The notion behind being able to foretell the occurrence of an extreme event in a time series
is very appealing, especially in domains with signiﬁcant ramiﬁcations associated with the
occurrence of an extreme events. Predicting pandemics in an epidemiological domain or
forecasting natural disasters in a geological and climatic environment are examples of applications that give importance to detection of extreme events. Unfortunately, the accurate
prediction of the timing and magnitude of such events is a challenge given their low occurrence rate. More so, the prediction accuracy depends on the regression method used as well
as characteristics of the data. On the one hand, standard regression methods such as generalized linear model (GLM) emphasize estimating the conditional expected value, and thus, are

99

not best suited for inferring extremal values. On the other hand, methods such as quantile
regression are focused towards estimating the conﬁdence limits of the prediction, and thus,
may overestimate the frequency and magnitude of the extreme events. Though methods for
inferring extreme value distributions do exist, combining them with other predictor variables
for prediction purposes remains a challenging research problem.
Standard regression methods typically assume that the data conform to certain parametric distributions (e.g., from an exponential family). Such methods are ineﬀective if the
assumed distribution does not adequately model characteristics of the real data. For example, a common problem encountered especially in modeling climate and ecological data is
the excess probability mass at zero. Such zero-inﬂated data, as they are commonly known,
often lead to poor model ﬁtting using standard regression methods as they tend to underestimate the frequency of zeros and the magnitude of extreme values in the data. One way
for handling such type of data is to identify and remove the excess zeros and then ﬁt a
regression model to the non-zero values. Such an approach, can be used, for example, to
predict future values of a precipitation time series [115], in which the occurrence of wet or
dry days is initially predicted using a classiﬁcation model prior to applying the regression
model to estimate the amount of rainfall for the predicted wet days. A potential drawback of
this approach is that the classiﬁcation and regressions models are often built independent of
each other, preventing the models from gleaning information from each other to potentially
improve their predictive accuracy. Furthermore, the regression methods used in modeling
the zero-inﬂated data do not emphasize accurate prediction of extreme values.
The chapter presents an integrated framework that simultaneously classiﬁes data points
as zero-valued or not, and apply quantile regression to accurately predict extreme values or
the tail end of the non-zero values of the distribution by focussing on particular quantiles.
100

We demonstrate the eﬃciency of the proposed approach on modeling climate data (precipitation) obtained from the Canadian Climate Change Scenarios Network website [1]. The
performance of the approach is compared with four baseline methods. The ﬁrst baseline is
the general linear model (GLM) with a Poisson distribution. The second baseline used is the
general linear model using an exponential distribution coupled with a binomial distribution
classiﬁer (GLM-C). A zero-inﬂated Poisson was used as the third baseline method (ZIP).
The fourth basesline was quantile regression. Empirical results showed that the proposed
framework outperforms the baselines for majority of the weather stations investigated in this
study.
In summary, the main contributions of this chapter are as follows:
• Comparison and analysis of the performance of models created using variants of GLM,
quantile regression and ZIP approaches to accurately predict values for extreme data
points that belong to a zero-inﬂated distribution.
• Presenting an approach optimized for modeling zero-inﬂated data that outperforms
the baseline methods in predicting the value of extreme data points.
• Successfully demonstrating the proposed approach to the real-world problem of downscaling precipitation climate data with application to climate impact assessment studies.

6.2

Preliminaries

Consider a multivariate time series L = (xt , yt ), where t ∈ {1, 2, · · · , n} is a discrete-valued
index for time, xt is a d-dimensional vector of predictor variables at time t, and yt is the
101

corresponding value for the response (target) variable. Given an unlabeled sequence of
multivariate observations xτ , where τ ∈ {n + 1, · · · , n + m}, the goal is to learn a target
function f (x, β) that best estimates the values of the response variable by minimizing the
expected loss Ex,y [L(y, f (x, β))]. The weight vector β denotes the regression coeﬃcients to
be estimated from the training data L.
Multiple linear regression (MLR) is one of most widely used regression methods due to
its simplicity. It assumes f (x, β) = β T x (where x is a (d + 1)-dimensional vector whose ﬁrst
element x0 = 1 and β ∈

d+1 is the weight vector) and the response variable y is related to

f (x, β) via the following equation:

y = βT x + ,

∼ N (0, σ 2 ).

As a result, P (y|x) ∼ N (β T x, σ 2 ) and Ey|x [y] =

yP (y|x)dy = β T x. Since the predicted

value of the response variable for a test data point xτ is β T xτ , this implies that the predictions made by MLR focus primarily on the average value of y given xτ . This explains the
limitation of MLR in terms of inferring extreme values in a given time series. The parameter
vector β in MLR can be estimated using the maximum likelihood (ML) approach to obtain

ˆ
β = (XT X)−1 XT y,

where X is the n × (d + 1) design matrix and y is an n × 1 column vector for the observed
values of the response variable.
The drawback of simple linear regression is that it is built on a strong assumption -namely,
normality. Unfortunately, real world data may not always have a normal distribution and

102

may be skewed to one side or may not cover the whole range of real numbers or may have
a heavier tail than the normal distribution, etc. Hence, alternative approaches that are not
constrained by such assumptions such as GLM may be used.

6.2.1

Generalized Linear Model and 2-Step GLM (GLM-C)

The generalized linear model is one of most widely used regression methods due to its
simplicity. Generally, a GLM consists of three elements:
1. The response variable Y, which has a probability distribution from the exponential
family.
2. A linear predictor η = Xβ
3. A link function g(·) such that E(Y|X) = µ = g −1 (η)
where, Y ∈ Rn×1 is the response variables vector, X ∈ Rn×d is the design matrix with
all 1 in the last column. β ∈ Rp×1 is the parameter vector. Since the link function shows
the relationship between the linear predictor and the mean of the distribution, it is very
important to understand the detail about the data before arbitrarily using the canonical
link function. In this case, since the precipitation data are always non-negative and values
represented using a millimeter scale, the non-zero data may be treated as count data allowing
the use Poisson distribution or an exponential distribution to describe the data. Hence, in
these experiments log(·) is chosen as the link function and Poisson distribution chosen. We
scale the Y used in the regression model to be 10 × Y :

(10 × Yi )|Xi ∼ P oi(λi )

E((10 × Yi )|Xi ) = λi = g −1 (ηi ) = g −1 (Xi β);
103

Considering the large number of zeros, one is motivated to perform classiﬁcation ﬁrst
to eliminate the zero values before any regression. There are many classiﬁcation methods
available. But for the purpose of these experiments, logistic regression (which is also a
variation of GLM) was chosen to do the classiﬁcation. The response variable Y ∗ of logistic
regression is a binary variable deﬁned as:

Y∗ =




1 Y > 0,



0 Y = 0


The detail of the model is as follows: The link function is a logit link g(p) = log(

p
),
1−p

such that,
Yi∗ |Xi ∼ Bin(pi )
E(Yi∗ |Xi ) = pi = g −1 (ηi ) = g −1 (Xi β);
When the ﬁtted values are derived, they will be transferred to be binary:

f∗ =




1 1 ≥ Y ∗ > 0.5,
ˆ



0 0.5 ≥ Y ∗ ≥ 0
ˆ


The second part is a GLM with exponential distribution, the response variable Y is just
those non-zero data, and the link function is g(·) = log(·):

Yi |Xi ∼ Exp(λi )

E(Yi |Xi ) = λi = g −1 (ηi ) = g −1 (Xi β);
104

Then, the ﬁtted-value f was found for all Xi
ˆ
Finally, the product of those two ﬁtted-values Y = f ∗ × f was reported.
To ﬁt the GLM model, iteratively reweighted least squares(IRLS) method was used for
maximum likelihood estimation of the model parameters.

6.2.2

Zero Inﬂated Poisson Regression (ZIP)

Diﬀering from the methods above, zero inﬂated poisson regression treats the zero as a mixture
of two distributions: a Bernoulli distribution with probability πi to get 0, and a Poisson
distribution with parameter µ (let P r(·; µ) denote the probability density function). In fact,
the ZIP regression model is deﬁned as:

P r(Y = yi |xi ) =




π + (1 − π )P r(Y = 0; λ ) y = 0,
 i
i
i
i
i


(1 − π )P r(Y = y ; λ )

i
i i

yi > 0

where 0 < πi < 1, and
logit(πi ) = log(

πi
) = xi β 1
1 − πi

log(µi ) = xi β 2
where β 1 , β 2 are all regression parameter. Both of them could be found by maximizing the
likelihood function. For the purpose of the experiments, the R package ’pscl’ was used to ﬁt
the model.

105

6.2.3

Quantile Linear Regression(QR) and 2-step QR (QR-C)

Quantile regression was used to estimate the speciﬁed quantile of a population. Hence, if the
objective of the regression is to estimate the conditional quantile(e.g., median) of Y instead
of a conditional mean like MLR and Ridge regression, one may use quantile regression. Its
loss function for the linear regression model is:
N

f (b) =

ˆ
ρτ (Yi − XT b), and β = arg min f (b),
i
b

i=1

where




τ u

ρτ (u) =

u>0



(τ − 1)u u ≤ 0


Let FY (y) = P (Y ≤ y) be the distribution function of a real valued random variable Y. The
τ th quantile of Y is given by:

QY (τ ) = F −1 (τ ) = inf{y : FY (y) ≥ τ }

It can be proved that the y which minimizes Eρτ (y − y ) should satisfy that FY (ˆ) = τ .
ˆ
ˆ
y
Thus, quantile regression will ﬁnd the τ th quantile of a random variable, for example:

ˆ qr ˆ qr
Median(Y|X) = X β ; β = arg min
b

ρ0.5 (yi − XT b)
i

For the purpose of the experiments conducted, τ = 0.95 was used to represent extreme
high value. Unlike the least squares methods mentioned above, which could be solved by
numerical linear algebra, the solution to quantile regression is relatively non-trivial. Linear

106

programming is used to solve the loss function by converting the problem to the following
form.
min {τ eT u + (1 − τ )eT v|Y − Xb = u − v; b ∈ Rp ; u, v ∈ RN }
+
N
N

u,v,b

For the same reason as mentioned in the Section 6.2.1, a classiﬁcation method should be
incorporated along with the regression model. Logistic regression was used for classiﬁcation,
and quantile regression on those nonzero Y . Finally, the product of those two ﬁtted values
is reported. Quantile regression may return a negative value, which we force to 0. We do
this because precipitation is always non-negative.

6.3

Framework for Integrated Classiﬁcation and Regression

With the introduction of quantile regression, which is an integral part of the objective function, the motivation behind the various components of the proposed objective function needs
to be elaborated. Since zero-inﬂated data is best described with the help of a classiﬁer that
help identify non-zero values and a regression component to address non-zero values, this
framework consists of both components. For the classiﬁer component, a least square support
vector machine is used and for the regression component, the intuition of quantile regression is used to help focus the regression of extreme values. Since the ﬁnal prediction of the
data point using this framework is a product of the regression and classiﬁcation component,
the quantile regression component is built to work on the eventual predicted return value,
thereby integrating both the classiﬁer and regression components.

107

6.3.1

Integrated Classiﬁer and Regression for Extreme Values (ICRE)

The classiﬁcation and regression models developed in this study are designed to minimize
the following objective function:
1
arg min L(ω 1 , ω 2 ) =
ω 1 ,ω 2
n
+

n

(1 − (2yi − 1)fi )2

i=1
n

1
n∗

(6.1)

yi ρτ (yi − fi × (fi + 1)/2) + λ(||ω 1 ||2 + ||ω 2 ||2 )

i=1

where n∗ is the number of nonzero yi . Then it can be expanded as follows:
1
arg min L(ω 1 , ω 2 ) =
ω 1 ,ω 2
n
+

n

(1 − (2yi − 1)(xT ω 2 ))2
i

i=1
n

1
n∗

(6.2)

yi ρτ (yi − (xT ω 1 ) × (sign((xT ω 2 + 1)/2)))
i
i

i=1

+ λ(||ω 1 ||2 + ||ω 2 ||2 )

The rationale for the design of our objective function is as follows. The ﬁrst term which
corresponds to the regression part of the equation represents quantile regression performed
for only the observed non-zero values in the time series. The regression model is therefore
biased towards estimating the non-zero extreme values more accurately and not be adversely
inﬂuenced by the over-abundance of zeros in the time series. The product fi ×(fi +1)/2 in the
ﬁrst term, corresponds to the predicted output of our joint classiﬁcation and regression model.
The second term in the objective function, which is the main classiﬁcation component,
is equivalent to the least square support vector machine. And the last two terms in the
objective function are equivalent to the L2 norm used in ridge regression models to shrink
the coeﬃcients in ω 1 and ω 2 .
108

Each data point is considered to be a representative reading at an instance of time
t ∈ {1, 2, · · · , n} in the time series. Each predictor variable is standardized by subtracting its
mean value and then dividing by its corresponding standard deviation. The standardization
of the variables is needed to account for the varying scales.
The optimization method used while performing experiments is ’L-BFGS-B’, described
by Byrd et. al. [27]. It is a limited memory version of BFGS methods. This method does
not store a Hessian matrix, just a limited number of update steps for it, and then it uses
derivative information. Since this model includes a quantile regression component, which is
not diﬀerentiable, this method of optimization is well suited to the objective function.
To solve the objective function, the inverse logistic function of xT ω 2 instead of sign((xT ω 2 +
i
i
1)/2)) was used. The decision was motivated by the fact that the optimizer tries to do a
line search along the steepest descent direction and ﬁnds the positive derivative along this
line, which would result in a nearly ﬂat surface for the binary component. Hence, conversion
of the binary report to an inverse logistic function of xT ω 2 was used to address this issue.
i
During the prediction stage, the binary-ﬁtted values from the SVM component was used.

6.4

Experimental Evaluation

In this section, the climate data that are used to downscale precipitation is described. This
is followed by the experiment setup. Once the dataset is introduced, the behavior of baseline
models was analyzed and contrasted with ICRE, in terms of relative performance of the various models when applied to this real world dataset to forecast future values of precipitation.

109

6.4.1

Data

All the algorithms were run on climate data obtained for 29 weather stations in Canada, from
the Canadian Climate Change Scenarios Network website [1]. The response variable to be
regressed (downscaled), corresponds to daily precipitation values measured at each weather
station. The predictor variables correspond to 26 coarse-scale climate variables derived from
the NCEP Reanalysis data set and the H3A2a data set(computer generated simulations),
which include measurements of airﬂow strength, sea-level pressure, wind direction, vorticity,
and humidity. The predictor variables used for training were obtained from the NCEP
Reanalysis data set while the predictor variables used for the testing were obtained from
the H3A2a data set. The data span a 40-year period, 1961 to 2001. The time series was
truncated for each weather station to exclude days for which temperature or any of the
predictor values are missing.

6.4.2

Experimental Setup

The ﬁrst step was to standardize the predictor variables by subtracting its mean value and
then dividing by its corresponding standard deviation to account for their varying scales.
The training size used was 10yrs worth of data and the test size, 25yrs. During the validation
process, the selection of the parameter λ was done using the score returned by RMSE-95.
Also, to ensure the experiments replicated the real world scenario where the prediction for a
future timeseries needs to be performed using simulated values of the predictor variables for
the future time series, simulated values for the corresponding predictor variables obtained
from H3A2a climate scenario was used as XU , while XL are values obtained from NCEP.
All the experiments were run for 37 stations.

110

6.4.3

Baseline Algorithm

We compare the performance of ICRE with baseline models created using general linear
model(GLM), general linear model with classiﬁcation (GLM-C), quantile regression(QR),
quantile regression with classiﬁcation and zero-inﬂated Poisson(ZIP). Further details about
the baselines are provided below.

6.4.3.1

General Linear Model (GLM)

The baseline GLM refers to the generalized linear model that uses a Poisson distribution as
a link function, resulting in the regression function log(λ) = Xβ, where E(Y |X) = λ

6.4.3.2

General Linear Model with Classiﬁcation (GLM-C)

Unlike the previous baseline (GLM), GLM-C refers to a two step generalized linear model
that uses a Binomial distribution, for the classiﬁer with the model described as logit(p) =
Xβ, and E(Y = 1|X) = p which Y = 1 when Y > 0 and Y = 0 when Y = 0 and a
second step that uses a generalized linear model with an exponential distribution that is
built only on non-zero response data points. The regression function is log(λ) = Xβ, which
E(Y |X) = λ. The eventual predicted value for each data point is the product of the two
respective ﬁtted values.

6.4.3.3

Quantile Regression (QR)

The baseline QR refers to the regular quantile regression described earlier in the preliminary
section 6.2

111

6.4.3.4

Quantile Regression with Classiﬁcation (QR-C)

The baseline QR-C refers to a two step model that has a GLM that uses a binomial distribution that acts as a classiﬁer and a regular quantile regression model that is built on non-zero
valued data points as described earlier in the preliminary section. These two models that
comprise QR-C are built independent of each other and the eventual predicted value for each
data point is the product of the two respective ﬁtted values.

6.4.3.5

Zero Inﬂated Poisson (ZIP)

Zero Inﬂation Poisson model used as a baseline and is similar to the ZIP model described in
Section 6.2.

6.4.4

Evaluation Criteria

The motivation behind the selection of the various evaluation metrics was to evaluate the
diﬀerent algorithms in terms of predicting the magnitude and the timing of the extreme
events.The following criteria to evaluate the performance of the models are used:
• Root Mean Square Error (RMSE), which measures the diﬀerence between the actual
and predicted values of the response variable, i.e.:
n
2
i=1 (yi − fi fi )

RMSE =

n

• RMSE-95, was used to measure the diﬀerence between the actual and predicted value
of the response variable for only the extreme data points(j). Extreme data points refer
to the points whose actual value were 95 percentile and above. The equation is with
respect to 95 percentile, as throughout this chapter, we associate data points that are
112

95 percentile and above as extreme values, i.e.:

n/20
2
j=1 (yj − fi fj )

RMSE-95 =

n/20

• Confusion matrices will be computed to visualize the precision and recall of extreme
and non-extreme events. F-measure, which is the harmonic mean between recall and
precision values was used as a score that evaluates the precision and recall results.

F-measure =

2 × Recall × P recision
Recall + P recision

To summarize, RMSE-95 is used for measuring magnitude and F-measure measures the
correctness of the timing of the extreme events.

6.4.5

Experimental Results

The results section consists of two main sets of experiments. The ﬁrst set of experiments
evaluates the impact of zero-inﬂated data on modeling extreme values. The second section
compares the performance of ICRE with the baseline methods which are followed .

6.4.5.1

Impact of Zero-Inﬂated Data on Extreme Value Prediction

Unlike regular data which may be modeled using regression, modeling zero-inﬂated data
usually involves a classiﬁer and a regression component. The classiﬁer is used to identify
zero and non-zero values, which is followed by regression for the non-zero values. But since
the focus of the chapter is on extreme data points within zero-inﬂated data, the impact of
the classiﬁer is unclear. In this section, the impact of including the classiﬁer in modeling

113

extreme values of zero-inﬂated data is compared. QR is compared with QR-C and GCM with
GCM-C and show the results in Table 6.1. Note that the percentage of wins for F-measure,
recall, precision may not total to 100 in the case of a tie.
Table 6.1: Percentage of stations won
QR-C
0
81.08

RMSE-95
F-Measure

QR
100
18.92

GLM-C
67.57
18.92

GLM
32.43
35.13

As shown in the Table 6.1, it isn’t clear that using an independent classiﬁer along with
regression for modeling extreme values among zero inﬂated data is preferred. But the results
do indicate that the inclusion or exclusion of a classiﬁer with the regression model built independent of each other may compromise either RMSE-95 (by overestimating the magnitude)
or F-measure (mistiming predicting an extreme value), without necessarily compromising
both together.

6.4.5.2

Comparison of ICRE to Baseline Methods

Table 6.2 shows the relative performance of ICRE to all the baseline methods in terms
of percentage of stations outperformed against the baseline method in terms of RMSE-95
values calculated on extreme rain days. In terms of RMSE of extreme rain days, as shown
in Table 6.2, ICRE outperformed the baselines (except QR) in almost every one of the 37
stations. But QR was the best across all methods for RMSE-95 of extreme days. In terms
Table 6.2: Percentage of stations ICRE outperformed the baseline
RMSE-95
F-Measure

QR-C
91.89
43.24

QR
0
62.16

GLM-C
97.3
89.19

GLM
97.3
89.19

ZIP
97.3
91.9

of F-measure that was computed based on recall and precision of identifying extreme events,
114

ICRE again outperformed the baselines(except QR-C) in majority of the 37 stations. But
ICRE was only able to outperform QR-C in 16 or the 37 stations in terms of F-measure.
Although QR performed the best in terms of estimating magnitude for those extreme events,
it over-estimated the timing of the events as seen by the relatively lower F-measure score.
QR-C did the reverse, it did reasonably well in terms of modeling the timing, but performed
very poorly in terms of the magnitude of the events by overestimating.

6.5

Conclusions

This chapter compares and analyzes the performance of models created using variants of
GLM, quantile regression and ZIP approaches to accurately predict values for extreme data
points that belong to a zero-inﬂated distribution. An alternate framework(ICRE) was present
that outperforms the baseline methods and the eﬀectiveness of the model was demonstrated
on climate data to predict the amount of precipitation at a given station.

115

Chapter 7
Contour Regression
The LSSQR and ICRE frameworks presented in Chapter 5 and Chapter 6 prioritize the
accuracy of the prediction of a response variable at a user speciﬁed quantile. However, there
are climate model applications that are interested in capturing the accurate distribution
characteristics of the response variable across all quantiles. In this chapter, the limitations
of current regression-based approaches in terms of preserving the distribution of observed
climate data is shown and a multi-objective regression framework that simultaneously ﬁts
the distribution properties and minimizes the prediction error is presented. The framework is
highly ﬂexible and can be applied to linear, nonlinear, and conditional quantile models. The
chapter demonstrates the eﬀectiveness of the framework in modeling the daily minimum and
maximum temperature as well as precipitation for climate stations in the Great Lakes region.
The framework showed marked improvement over traditional regression-based approaches in
all 14 climate stations evaluated.

7.1

Introduction

There are numerous climate modeling applications that can be cast into a regression problem, from projecting future climate scenarios to downscaling the coarse-scale outputs from
global/regional climate models for climate change impact assessment and adaptation studies
[107, 60, 104]. In addition to minimizing the residuals of the predicted outputs, some of
116

1

y
y

MLR

0.8

Area between F(y), F(y

)

MLR

F(y)

0.6

0.4

0.2

0

2

3

4

5

6

7

8

9

10

11

y
Figure 7.1: Area between the CDF of y and yM LR .
these applications emphasize preserving speciﬁc characteristics of the predicted distribution.
However, as most regression-based approaches are designed to optimize the former, they
tend to perform poorly on the latter criterion.
As an illustration, consider a two-dimensional regression problem, where the response
variable y is related to the predictor variables x according to the following equation: y =
ω T x + ω0 + (0, σ 2 ), where Ω = [ω2 ω1 ω0 ] = [1, 2, 5]. Using the least square (maximum likelihood) estimation approach, multiple linear regression (MLR) was able to fairly accurately
estimate Ω as [0.99, 1.96, 5.05 ]. Yet, it fared poorly in terms of replicating the shape of
the original distribution of y as seen from its cumulative distribution function (CDF) plots
given in Figure 7.1. Even though the regression model was trained using ten thousand data
points, it is clear from Figure 7.1 that MLR fails to replicate the shape of the cumulative
distribution for y, particularly the tails of the distribution.
As another example, Figure 7.2 compares the histograms of daily maximum temperature
observed at a climate station in Michigan and the predicted outputs of MLR. In this case,
117

450
400

Observed Max Temperature
Multiple Linear Regression
Quantile Mapping

350

Days

300
250
200
150
100
50
0
−30

−20

−10

0

°C

10

20

30

40

Figure 7.2: Histogram of predicted daily maximum temperature at a weather station in
Michigan, 1990-1999.
the standard deviation of MLR’s predicted outputs diﬀers quite substantially from that
of observation data. In spite of minimizing the sum of squared prediction error, regressionbased approaches such as MLR fared poorly in preserving the overall shape of the distribution
compared to non-regression based approaches such as quantile mapping (QM), which had
an RMSE value 25% worse than that of MLR but gives a better ﬁt to the distribution
of maximum temperature. As a consequence, distribution-driven approaches [108, 97] have
been used to correct the distribution characteristics of the data to better match the observed
climate variable. However, their prediction accuracy is typically worse than regression-based

118

1
0.8

F(x)

0.6
0.4
Observed Precipitation
Multiple Linear Regression
Quantile Mapping

0.2
0
0

10

20

30

40

50

60

70

mm/day

Figure 7.3: CDF of predicted daily precipitation at a weather station in Michigan, 1990-1999.
approaches.
Raw projections of climate variables are often obtained from General Circulation Models
(GCM) and more recently from Regional Climate Models (RCM) that incorporate complex
topography, land cover, and other regional forcings into the physical models. These raw
climate projections need to be further post-processed to meet the requirements of impact
assessment studies. In addition to the previously mentioned requirements from the climate
variables, empirical downscaling of the output from the climate models to a ﬁner resolution is
often needed to bridge the mismatch in spatial or temporal scale between the model output
and the scale desired, since the resolutions of the output from the climate models may
remain too coarse for many applications where local scale information is needed. Similarly,
bias correction is often needed to reduce the inherent uncertainties in the RCM outputs that
may be aﬄicted by the systematic errors introduced by the driving GCM runs, imperfections
of the RCM representation, and sampling biases due to the ﬁnite length time series used to
parameterize and validate the models [45].
119

Since the ﬁdelity of both the distribution characteristics and the accuracy of projections are important, a framework for multivariate regression is proposed that regularizes
the distribution of the response variable to simultaneous improve the accuracy of the projection as well as the shape of the distribution by jointly solving both objectives. Due to
its generic nature, the framework may be applied to various types of marginal distributions
as well as diﬀerent objective function criteria including least square error, kernel regression
and quantile regression (QR). In this chapter, the eﬀectiveness of the proposed framework
is demonstrated by downscaling and bias correcting daily temperature and precipitation to
match their corresponding observations.
In summary, the main contributions of this study are:
• Identiﬁcation of the limitations of existing least squared error regression techniques.
• Presentation of a regression based framework (Contour Regression) for multivariate
empirical downscaling and bias correction that address the limitation of existing approaches by simultaneously improving accuracy of projection for individual data points
as well as the overall shape of the distribution.
• Demonstration of the feasibility of adapting the framework to ﬁt various objective
functions such as multivariate ordinary least squares, QR and non-linear kernel ridge
regression.
• Evaluation of the framework on real world climate data and found that it consistently
outperformed or was at least on-par with the baseline approaches and showed its
robustness to response variables having diﬀerent types of shapes of distribution.

120

7.2

Preliminaries

Let D = {(xi , yi )}n be a labeled dataset of size n, where each xi ∈ Rd is a vector
i=1
of predictor variables and yi ∈ R the corresponding response variable. The objective of
regression is to learn a target function f (x, β) that best estimates the response variable y.
β is the parameter vector used by the target function. n represents the number of training
points.

7.2.1

Multiple Linear Regression (MLR)

MLR is the most common regression approach used for empirical downscaling of climate
data. MLR uses ordinary least squares to solve a linear model of the form

y = xT β +

where,

∼ N (0, σ 2 ) is an i.i.d Gaussian error term with variance σ 2 . β ∈ Rd is the

parameter vector. MLR minimizes the sum squared residuals (y − Xβ)T (y − Xβ) which
leads to a closed-form expression for the solution

ˆ
β = (X T X)−1 X T y

7.2.2

Quantile Mapping (QM)

Quantile mapping is the most commonly used approach for correcting the shape of the
distribution of a climate variable to match observations. It adjusts all the moments of the
distribution while maintaining the rank correlation. The following equation is an example

121

of the QM approach.
−1
QM : yi = FY (FX (xi ))

FX (x) is a function that corresponds to the CDF for the predictor variable ’X’ and is deﬁned
by FX (x) = P (x ≤ X). The above equation of QM may be rewritten as follows to help
identify the correction made by QM.

−1
−1
QM : yi = xi + FY (FX (xi )) − FX (FX (xi ))

One of the main assumptions made by QM is that the data points upon which the bias
correction function is to be applied come from the same distribution that describes the
training sets and that the relationship between predictor and response is constant. Also, a
suﬃciently large enough training size is required by QM to capture the true shape of the
distribution of the model and observations. A distinct advantage of QM is that no day-to-day
mapped data are required.
It can be shown that a QM function that accurately replicates the distribution characteristics of the response variable may not guarantee RM SE = 0, nor a relatively small
RM SE.
Proposition 7.2.1. A QM function that accurately replicates distribution of the observation
may have RM SE > 0
−1
Proof. Given QM : y = Fy (Fx (x)), where FX ∈ [0, 1] and FY ∈ [0, 1] are the empirical

cumulative distribution function of x and y respectively. Let R and O be the multiset
containing the quantile values of x in Fx and y in Fy respectively. i.e., Fx (x) = R and
Fy (y) = O. Let ε(i) = |Fy (O(i)) − Fy (R(i))| be the QM prediction error of data point xi .
122

⇒ RM SE =

2
i ε (i)/n. A necessary and suﬃcient condition for the quantile function

to replicate the distribution of observation is that the cardinality, member and multiplicity
of the multiset O equals the multiset R, i.e.,

|O

R| = |O| = |R|

The above requirement does not eliminate that possibility that ∃i, s.t., O(i) = R(i), where
R(i) corresponds to the quantile value of data point xi .
⇒ if ∃i, s.t., O(i) = R(i), and FY (O(i)) = FY (R(i)). ⇒ RM SE > 0. Hence, quantile
mapping function that accurately replicates the distribution characteristics of the response
variable may not guarantee RM SE = 0, nor a relatively small RM SE.

♦

Proposition 7.2.2. A QM function accurately returns the distribution characteristics of the
response variable as well as RM SE = 0 when rank correlation Γ = 1 between the predictor
and response variables.
Proof. Let R and O be the multiset quantiles Fx (x) and Fy (y) of x and y respectively. Let
ε(i) = |Fy (O(i)) − Fy (R(i))| be the ith error of the predicted values from QM. ⇒ RM SE =
ε2 (i)/n. Given (Γ = 1) ≡ (∀i, R(i) = O(i)), we have ε(i) = |Fy (O(i)) − Fy (O(i))|.
⇒ RM SE =

7.3

2
i ε (i)/n = 0

♦

Framework for Multivariate Contour Regression
(CR)

Since regression based approaches have a distinct advantage in terms of prediction accuracy
of individual data points but are limited by their lack of emphasis on the shape of the
123

distribution of the projection as depicted by the area between their two CDFs in Figure 7.1,
there is a need to regularize the area between the CDF of the target response variable and
the regression result. The proposed distribution regularized framework is
n

min
β

(γπ(f (xi ), yi ) + (1 − γ)π(f (xi ), y(i) ))
i=1

where, y(i) corresponds to the i-th order value of the target response variabley. π(., .) can
be any generic loss function, such as sum squared error, while 0 ≤ γ ≤ 1 is a user deﬁned
parameter that may be used for either prioritizing ﬁdelity in regression accuracy or its CDF.
An important required preprocessing step (elaborated in the following subsection) required, is that the predictor matrix X is pre-sorted such that i < j

∀f (xi , β) ≤ f (xj , β).

The choice of π determines the objective function that is to be minimized and could be
as simple as ordinary least squares or a more complex user deﬁned function. Section 7.3.1
elaborates on CR and describes multivariate linear contour regression (MLCR) which has an
objective function that is based on ordinary least squares. Section 7.3.2 proposes kernel contour regression (KCR) that is a kernel-based interpretation of the CR framework. Section
7.3.3 proposes a quantile regression based interpretation that emphasizes the conditional
quartile of the user’s preference. In this study, the conditional quartile chosen corresponded
to the extreme ﬁfth percentile of the distribution.

7.3.1

Multiple Linear Contour Regression (MLCR)

This section describes an approach for CR that is based on ordinary least square (OLS) to
simultaneously regress on the response variable as well as regress on the ordered value of the

124

response variable by minimizing the sum squared error, as shown below.
n

(γ(f (xi , β) − yi )2 + (1 − γ)(f (xi , β) − zi )2 )

i=1

where, f (X, β) = Xβ and zi = y(i) . This equates to minimizing

γ(y − Xβ)T (y − Xβ) + (1 − γ)(z − Xβ)T (z − Xβ)

where the predictor matrix X is pre-sorted such that i < j

∀f (xi , β) ≤ f (xj , β) and

γ ∈ [01] is a user deﬁned parameter that may be used for either prioritizing ﬁdelity in
regression accuracy or shape of the distribution. It is obvious from the equation that as
γ → 1, MLCR converges to the solution of MLR as seen in Figure 7.4, which depicts the
inﬂuence of the γ parameter on the shape of the CDF of the response variable. The closed
form solution to MLCR is

ˆ
β = (X T X)−1 (γX T y + (1 − γ)X T z)

Since it is often not possible to guarantee that X is pre-sorted correctly according to
f (xi , β), one may need to iteratively solve the objective function after reordering the data
points X and corresponding y, such that the new ordering of the data points conforms to
i < j

∀f (xi , β) ≤ f (xj , β) based on the β obtained from the previous iteration, until

convergence. Convergence is obtained when ∀f (xi , β) ≤ f (xj , β),

∀i < j. As shown in the

theorem below, the following objective function converges with each iteration.

125

1

F(x)

0.8
0.6

y
MLCR (γ =1)
MLCR(γ =1/2)
MLCR(γ =1/10)

0.4
0.2
0
−1

0

1

2

3

4

5

6

7

Response Variable (y)
Figure 7.4: Inﬂuence of gamma parameter on ﬁdelity of the response variable’s cumulative
distributive function.
7.3.1.1

Proof of Convergence

This section presents the proof of convergence of the iterative update algorithm. Let βt , ft , Xt
be the regression coeﬃcients, predicted values for the response variable and the predictor
variables at the t-th iteration, while βt+1 , ft+1 , Xt+1 represent the regression coeﬃcients,
predicted values for the response variable and the predictor variables after the (t + 1)-th
iteration.
Proposition 7.3.1. Assuming that the indices of the predictor variables are ﬁxed,

L(βt , ft , Xt ) ≥ L(βt+1 , ft+1 , Xt )

Proof. For a ﬁxed Xt , L(βt+1 , ft+1 , Xt ) ≤ L(βt , ft , Xt ) since the βt+1 is obtained from a
closed form solution of ordinary least squares and by deﬁnition is the solution that minimizes
the objective function. In the worst case, L(βt+1 , ft+1 , Xt ) = L(βt , ft , Xt ).
126

Proposition 7.3.2. Assuming that the regression coeﬃcients β are ﬁxed,

L(βt+1 , ft+1 , Xt ) ≥ L(βt+1 , ft+1 , Xt+1 )

y

y

Proof. Let L(βt+1 , ft+1 , Xt ) = Lt+1 + Lz where, Lt+1 refers to the ﬁrst half of the loss
t
function that regresses on y and Lz refers to the second half of the loss function that regresses
t
on z. Since, the change in ordering of X from t-th to the t + 1-th iteration doesn’t impact
y

the Ly component of the loss function, and L(βt+1 , ft+1 , Xt+1 ) = Lt+1 + Lz , we shall
t+1
n
2
i=1 (f (xi , β) − zi ) which can be rewritten as

concentrate on Lz . Lz = (1 − γ)
t

Lz =
t

n

2
(fi2 + zi + 2fi zi )

i=1

(1 − γ) being a constant, is ignored for simplicity. Given that β and values for f are ﬁxed,
Lz =
t+1

n
2
2
i=1 (f(i) + zi + 2f(i) zi ).

⇒ Lz − Lz =
t
t+1

n

(f(i) zi − fi zi )
i=1

And since,

n
i=1 a(i) b(i) ≥

n
i=1 ai bi

∀a ∈ Rn , b ∈ Rn we have

n
i=1 (f(i) zi ) ≥

n
i=1 (fi zi ),

since by deﬁnition, zi = z(i) .
⇒ Lz − Lz ≥ 0
t
t+1
⇒ L(βt+1 , ft+1 , Xt ) ≥ L(βt+1 , ft+1 , Xt+1 )

Theorem 7.3.1. The objective function L(β) is monotonically non-increasing given the
update formula for β, f and X.
127

Proof. The update formula iteratively modiﬁes the objective function as follows: L(βt , ft , Xt ) ⇒
L(βt+1 , ft+1 , Xt ) ⇒ L(βt+1 , ft+1 , Xt+1 ). Using the above propositions, we have L(βt , ft , Xt ) ≥
L(βt+1 , ft+1 , Xt ) and L(βt+1 , ft+1 , Xt ) ≥ L(βt+1 , ft+1 , Xt+1 ).

⇒ L(βt+1 , ft+1 , Xt+1 ) ≤ L(βt , ft , Xt )

Lemma 7.3.1. The objective function will eventually converge, as the value of the loss
function is always non-negative and since we know L(β) is monotonically decreasing.

7.3.2

Kernel Contour Regression (KCR)

A variant of MLR, called ridge regularization is used to mitigate over-ﬁtting in regression.
Ridge regression also provides a formulation to overcome the hurdle of a singular covariance
matrix X T X that MLR might be faced with during optimization. Unlike the loss function
of MLR, the loss function for ridge regression is

(y − Xβ)T (y − Xβ) + λβ T β,

and its corresponding closed-form expression for the solution is

ˆ
β = (X T X + λI)−1 X T y

128

where, the ridge coeﬃcient λ > 0 results in a non-singular matrix X T X + λI always being
invertible. The dual ridge regression is given by the equation

ˆ
α = y T (G + λI)−1 X

where, G = XX T . By mapping φ the predictor variable X to a higher dimension feature
space F , i.e.,
φ : X ∈ Rd → F ⊆ RN
where N >> d, one can transform the regularized least square regression to feature space
F using the Kernel K. Similarly, the predictor variables of CR can be mapped to a higher
dimension feature space F by using the ridge counterpart of MLCR.

β = (φ(X)T φ(X) + λI)−1 (γφ(X)T y + (1 − γ)φ(X)T z)

⇒ β = λ−1 φ(X)T (γy + (1 − γ)z − φ(X)β) = φ(X)T α
⇒ α = (G + λI)−1 (γy + (1 − γ)z)
where, G = φ(X)φ(X)T , Gij = φ(xi ), φ(xj )T = K(xi , xj ).

7.3.3

Quantile Contour Regression (QCR)

Most regression approaches that are used for downscaling focus on predicting the conditional
mean of the response variable. Predicting the conditional mean is not well suited for predicting extreme values that are better identiﬁed by the conditional quantiles that corresponds
to the extreme values. Hence, unlike the common regression techniques mentioned earlier,
129

approaches similar to quantile regression(QR) [77] are better suited to estimate the extremes
of Y .
To estimate the τ th conditional quantile QY |X (τ ), QR minimizes an asymmetrically
weighted sum of absolute errors using the loss function:
n

ρτ (yi − xT β)
i

i=1

where,




τ u

ρτ (u) =

u>0



(τ − 1)u u ≤ 0


and the τ th quantile of a random variable Y is given by:

QY (τ ) = F −1 (τ ) = inf{y : FY (y) ≥ τ }

where, FY (y) = P (Y ≤ y) is the distribution function of a real valued random variable Y
and τ ∈ [0, 1].
Linear programming is used to solve the loss function by converting the problem to the
following form.
min τ 1T u + (1 − τ )1T v
n
n
u,v
s.t. y − xT β = u − v
where, ui ≥ 0 ,vi ≥ 0 and β ∈ Rd .

130

The objective function of QR can be adopted by CR to obtain the following loss function
n
i=1

(ρτ1 (yi − xT β) + ρτ2 (zi − xT β))
i
i

where,




τ u

ρτ (u) =

u>0



(τ − 1)u u ≤ 0


which equates to

min

u,v,u ,v

τ1 1T u + (1 − τ )1T v + τ2 1T u + (1 − τ )1T v
n
n
n
n

s.t. y − xT β = u − v
s.t. z − xT β = u − v
where, τ2 = 0.5, ui ≥ 0, ui ≥ 0, vi ≥ 0, vi ≥ 0 and β ∈ Rd .
7.3.3.1

Proof of Convergence

Let βt , ft , Xt be the regression coeﬃcients, predicted values for the response variable and the
predictor variables at the t-th iteration, while βt+1 , ft+1 , Xt+1 be the regression coeﬃcients,
predicted values for response variable and the predictor variables after the (t+1)-th iteration.
Proposition 7.3.3. Assuming that the indices of the predictor variables are ﬁxed,

L(βt , ft , Xt ) ≥ L(βt+1 , ft+1 , Xt )

Proof. For a ﬁxed Xt , L(βt+1 , ft+1 , Xt ) ≤ L(βt , ft , Xt ) since βt+1 is the solution that mini131

mizes the objective function. In the worst case, L(βt+1 , ft+1 , Xt ) = L(βt , ft , Xt ).
Proposition 7.3.4. Assuming that the regression coeﬃcients β are ﬁxed,

L(βt+1 , ft+1 , Xt ) = L(βt+1 , ft+1 , Xt+1 )

y

y

Proof. Let L(βt+1 , ft+1 , Xt ) = Lt+1 + Lz where, Lt+1 refers to the ﬁrst half of the loss
t
function that performs QR on y and Lz refers to the second half of the loss function that
t
performs QR on z. Since, the change in ordering of X doesn’t impact Ly we shall concentrate
on Lz . Given, Lz = 0.5
t

n
z
i=1 (fi − zi ) and Lt+1 = 0.5

n
i=1 (f(i) − zi )

⇒ Lz = Lz
t
t+1

Hence, L(βt+1 , ft+1 , Xt ) = L(βt+1 , ft+1 , Xt+1 )
Theorem 7.3.2. The objective function L(β) is monotonically non-increasing given the
update formula for β, f and X.
Proof. The update formula iteratively modiﬁes the objective function as follows: L(βt , ft , Xt ) ⇒
L(βt+1 , ft+1 , Xt ) ⇒ L(βt+1 , ft+1 , Xt+1 ). Using the above propositions, we have L(βt , ft , Xt ) ≥
L(βt+1 , ft+1 , Xt ) and L(βt+1 , ft+1 , Xt ) = L(βt+1 , ft+1 , Xt+1 ).

⇒ L(βt+1 , ft+1 , Xt+1 ) ≤ L(βt , ft , Xt )

132

7.4

Experimental Results

The objective of the experiments was to evaluate the eﬀectiveness of CR on observed climate
data.

7.4.1

Data

All the algorithms were run using climate data obtained at fourteen weather stations in
Michigan, USA. Daily maximum temperature (T), minimum temperature (t), and precipitation (P) were the three climate target variables evaluated.
The predictor variables used in this study were obtained from the North American Regional Climate Change Assessment Program (NARCCAP) [2] (Table 7.1). Nine diﬀerent data
sets are used that correspond to the combination of three diﬀerent RCMs and three target
variables. The three RCMs used are the Canadian Regional Climate Model (CRCM), the
Weather Research and Forecasting Model (WRFG) and the Regional Climate Model Version3 (RCM3) The models were each driven by NCEP/DOE AMIP-II Reanalysis (NCEP) for a
domain covering the United States and Canada. The data for the RCMs spans the period
1980-1999. The gridded RCM data have a spatial resolution of 50km. Unlike observation
data that relate to a point location, RCM data are available at grid resolution with the value
representing a grid-cell average.
Since the observation data used correspond to daily values, preprocessing was also done
to convert the three hour reanalysis-driven RCM data to daily values. Preprocessing was
also needed for conversion of the observation data as well as data from the various RCM
runs to the same units. For instance, precipitation in the observation data was in millimeters
while precipitation data obtained from the various RCM runs was recorded in MKS units of

133

Table 7.1: List of predictor variables from each RCM.
Predictor variables
Meridional Surface Wind Speed
Zonal Surface Wind Speed
Minimum Surface Air Temperature
Maximum Surface Air Temperature
Surface Air Temperature
Surface Pressure
Precipitation
Surface Speciﬁc Humidity
500 hPa Geopotential Height

Frequency
3 hourly
3 hourly
Daily
Daily
3 hourly
3 hourly
3 hourly
3 hourly
3 hourly

kg/m2 /s and needed to be converted to millimeters. In the event of missing values in the
reanalysis/GCM-driven RCM simulated data the whole day corresponding to the data point
was removed during the training phase of the various BCED approaches that were evaluated
in this study, even if the missing value corresponded to only a three hour time stamp for a
particular day.

7.4.2

Experimental Setup

Twenty year (1980-1999) model data from the various RCM models along with the corresponding observation data were split into two parts of ten contiguous years that were used for
training and testing. The results provided in this section are those observed during out-ofsample evaluation only. A 10-fold cross validation approach for comparing the performance
of the various BCED models was also evaluated. But since the climate models’ ability to
reproduce climate variability is typically averaged over the order of ten years for the purpose
of analysis, as noted by Ehret et al. [45], and the relative performances being consistent
across the two set-ups, the results of 10-fold cross validation are not included in this chapter.
For the purpose of the evaluation of the relative skill in bias correction and downscaling of
the proposed approach, popular BCED approaches such as MLR, Lasso, QM, PHC, LOCI,
134

QR, kernel regression were used as baselines. For simplicity, the parameter γ was ﬁxed
across every station. Throughout this chapter, the extreme 5 percentile of a distribution is
deﬁned as extreme values. Consequently, 0.95 is used as τ for QR based experiments that
model extreme precipitation and extreme maximum temperature, while 0.05 is used as τ
for modeling extreme minimum temperature. Radial basis function (RBF) kernel was the
choice of kernel used in this study. For the CR based experiments, the maximum number of
iterations was set to ten.

7.4.3

Results

The motivation behind the experiments was to evaluate the diﬀerent algorithms in terms of
accuracy of the prediction, the ﬁdelity of the shape of the distribution to observation, the
timing of the extreme events and the frequency with which a data point is predicted to be an
extreme data point. The performance of MLCR was compared using MLR, ridge regression
(Ridge), lasso regression (Lasso), QM, LOCI and ﬁtted histogram equalization. Similarly,
QCR was compared to baseline approaches such as MLR, QM, QR. Auto regressive baselines
were not used as baselines as they are not well suited for long term climate projections (40100 years into the future).
Since regression emphasizes minimizing the residuals, MLCR was compared ﬁrst with
its baseline for potential loss in root mean square error (RMSE) performance and put it
in perspective of the improvement over baseline CDFs. Barring possible over-ﬁtting, MLR
should by deﬁnition of its objective function have minimum SSE among the linear regression
approaches. Hence, MLR was used as a baseline to evaluate possible deterioration in terms
of RMSE by MLCR on account of MLCR’s distribution regularization. MLCR showed an
average deterioration in RMSE of about < 3% across the ﬁrst six data sets (target variables
135

Table 7.2: Relative performance gain of MLCR over baseline approaches.

Dataset
WRFG-T
CRCM-T
RCM3-T
WRFG-t
CRCM-t
RCM3-t
WRFG-P
CRCM-P
RCM3-P

RMSE
% loss
MLR
1.9
2.8
2.0
1.0
1.9
1.8
28.8
25.8
29.9

Lasso
1.7
2.6
1.8
0.6
1.6
1.6
28.3
25.0
29.5

RMSE-CDF
% gain
MLR
Lasso
39.0
41.7
25.8
28.0
35.3
39.2
51.4
53.7
38.2
40.1
53.2
56.1
74.3
75.8
71.1
73.2
75.6
76.7

RMSE-CDF
win-loss %
MLR
Lasso
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100

maximum and minimum temperature) (Table 7.2) while improving the average error in terms
of empirical cumulative distribution frequency (RMSE-CDF), around 40% (Figure 7.6).
Given, RMSE-CDF =

n (y −f )2
i=1 (i) (i)
and its results are in the same order as RMSE,
n

it is clear that MLCR was able to considerably improve the shape of the distribution to
better match the observations at the expense of a marginal deterioration in RMSE. This
improvement was observed across all climate stations within each dataset. as shown by the
100% win-loss percentage (Table 7.2). Ridge and Lasso fared comparably well to MLR, while
QM had the worst RMSE, as expected.
MLR fared considerably worse in terms of its CDF, when it came to modeling precipitation (Gamma distribution) (Figure 7.5). Since, MLR struggled to capture the shape of the
precipitation distribution, a smaller value for the γ parameter for MLCR was chosen, than
was used for the previous datasets (normal distribution) to better ﬁt the observations’ CDF.
Consequently, the increase in the deterioration in terms of RMSE performance came at the
expense of an impressive average RMSE-CDF improvement > 70%. For evaluation of similarity of distributions, the Kolmogorov-Smirnov statistic (K) is used, which for a given pair
of cumulative distribution function F1 (x) and F2 (x) is max(|F1 (x) − F2 (x)|), the standard
136

1

0.8

F(x)

0.6

Observation
MLR
QM
MLCR

0.4

0.2

0

0

5

10

15

20

25

30

35

40

45

Precipitation (mm/day)
Figure 7.5: CDF of predicted daily precipitation at a weather station in Michigan over the
years 1990-99.
deviation σ, correlation(ρ) and correlation-CDF(ρ − CDF ), which measures the correlation
between two CDFs. MLCR regularly outperformed the baseline regression approaches at every station (Table 7.3), while QM produces the most accurate standard deviation. However,
MLCR was able to catch up with QR in terms of ρ − CDF , especially for precipitation due
to the emphasis given to the distribution driven term in the experiments.
Table 7.3: Percentage of stations that MLCR outperformed baseline in terms of σ and
ρ − CDF

Dataset
WRFG-T
CRCM-T
RCM3-T
WRFG-t
CRCM-t
RCM3-t
WRFG-P
CRCM-P
RCM3-P

σ
win-loss%
MLR
Lasso
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100
100

QM
0
0
0
0
0
0
7.1
0.0
7.1

137

ρ − CDF
win-loss%
MLR
100
100
100
78.6
92.9
92.9
100
100
100

Lasso
100
100
100
85.8
100
85.8
100
100
100

QM
0
0
0
64.3
35.8
85.7
28.6
50.0
64.3

1

0.8

Observation
MLR
QM
MLCR

F(x)

0.6

0.4

0.2

0
−10

−5

0

5

10

15

20

Maximum Temperature (° C)

25

30

35

Figure 7.6: CDF of predicted daily maximum temperature at a weather station in Michigan,
1990-99.
7.4.3.1

QCR Results

In addition to the above-mentioned metrics for comparison, QCR was compared with baseline
approaches such as MLR, QM and QR, in terms of its performance at extremes of the
distributions. In terms of the RMSE for the extreme valued data point alone, QCR was able
to outperform MLR, since MLR tended to underestimate the extremes. QCR also fared very
well against QR (Figure 7.8), where the regression models emphasized the lowest τ quantile
that correspond to extreme values for the target variable (minimum temperature). It is clear
that QCR emphasized accuracy in the distribution of the lower quantiles of the distribution
over the higher quantiles, as expected.
Precision and recall of extreme events were computed to measure the timing accuracy
of the prediction of extreme valued data points. F-measure, which is the harmonic mean
between recall and precision values, is used as a score that summarizes the precision and
recall results. It was also found that QCR had the best F-measure among the regression

138

1

0.8

F(x)

0.6

0.4

Observation
QR
QCR

0.2

0

0

10

20

30

40

50

60

70

Precipitation (mm/day)
Figure 7.7: CDF of predicted daily precipitation at a weather station in Michigan, 1990-99.
based approaches in terms of correctly identifying extreme values across all the stations.
Figure 7.7 shows the performance of QCR on precipitation. In spite of larger value for the γ
parameter of QCR compared with that used for MLCR, QCR performed better than MLCR
in terms of correcting the overall shape of the distribution. This is because of the zeroinﬂated nature of precipitation, resulting in very few large valued data points, which have
a larger inﬂuence on the appearance of the CDF plot. As seen in Table 7.4, QCR regularly
outperformed QR in terms of the other metrics such as correlation.

7.5

MCR Using Heterogeneous Data

The M CR framework can also incorporate heterogeneous data sources of predictor variables. This extension is referred to as M CRHET . An example of incorporating the heterogeneous data sources is utilizing a reanalysis values for the predictor variables as well as
asynchronous data obtained from GCM driven runs of RCM. The asynchronous predictor

139

1

Observation
QR
QCR

0.8

F(x)

0.6
0.4
0.2
0
−30

−25

−20

−15

−10

−5

0

5

10

15

20

Minimum Temperature (° C)
Figure 7.8: CDF of predicted daily minimum temperature at a weather station in Michigan,
1990-99.
variables obtained from GCM driven runs of RCM is incorporated into the second term of
MCR. Figure 7.9 shows the CDF of predicted daily minimum temperature at a weather
station (Eau Claire)in Michigan, 1990-99, using asynchronous regional climate model data.
Similarly, Figure 7.10 shows the CDF of predicted daily precipitation at the same weather
station in Michigan, 1990-99, using asynchronous regional climate model data.

7.5.1

Geometric Quantile Mapping

The multi-dimensional equivalent of quantile function is geometric quantile [32].
For a univariate random variable X ∈

, let FX (x) be its cumulative distribution func-

tion (CDF), i.e., FX (x) = P (X ≤ x). The corresponding α-quantile of X is given by
inf {x ∈

: FX (x) ≥ α}. More generally, the position [84] of data point z relative to a set

of points Z = (z1 , .., zm )T is given by

140

Table 7.4: Percentage of stations that QCR outperformed baseline approaches in terms of
RMSE, F-measure, k statistic and correlation for data points considered extreme value.
Dataset
WRFG-T
CRCM-T
RCM3-T
WRFG-t
CRCM-t
RCM3-t
WRFG-P
CRCM-P
RCM3-P

RMSE
100
100
100
100
100
100
100
100
100

1
pZ (z) = m

m
i=1 η(z − zi )

F-Measure
100
100
100
100
100
100
100
100
100

k
100
100
100
100
100
100
100
100
100

where

η(w) =

ρ
100
92.9
100
64.3
58.7
78.6
35.8
28.6
21.4



 w


, if w = 0
w


0,


if w = 0

For univariate data, the position pZ (z) is equal to 2FZ (z) − 1, where FZ (z) is the cumulative
distribution function of Z. The multi-dimensional equivalent of quantile function is geometric
quantile [32].
Distribution correction methods such as quantile mapping is only applicable if one can
match the position of a data point in one univariate distribution (say for x) to its corresponding position in another univariate distribution (say for y). This is possible using the
preceding deﬁnition of position for univariate data since the values of pZ are always ﬁxed in
the range between [−1, +1] irrespective of the values in Z. Unfortunately, when extended
to multivariate positions, the range of values for pZ may vary depending on the values in Z.
To overcome this problem, He et al. [61] introduce the notion of a stationary position by
iteratively applying the following position transformation function until convergence:

pk (z) =
Y

1
κn

n
i=1

pk−1 (z) − pk−1 (yi )
Y
Y
pk−1 (z) − pk−1 (yi )
Y
Y
141

,

p1 (z) =
Y

1
κn

n
i=1

z − yi
z − yi

(7.1)

Empirical CDF

1
Observation
MLR

0.8

QM
MCRHET

F(x)

0.6

0.4

0.2

0
−30

−20

−10

0

Minimum Temperature °C

10

20

30

Figure 7.9: Cumulative distribution function of predicted daily minimum temperature at a
weather station in Michigan, 1990-99, using asynchronous regional climate model data.
Here each component in yi must be converted to its marginal rank ﬁrst before applying
the position transformation function. Marginal rank refers to the rank of the data point
divided by the largest rank and then normalized to the range [−1, 1]. The normalization
is done to negate the eﬀect of variables having values that correspond to diﬀerent ranges.
Data points with normalized marginal rank close to ±1 correspond to extreme values for the
particular variable, while those close to 0 are located near the median of the distribution. In
practice, the number of iterations needed to reach a stationary distribution is quite small,
typically K > 5 [61]. For univariate data, it can be shown that Pk reaches a stationary
distribution at k = 1. The term κ in Equation (7.1) is a normalization factor to ensure the
distribution of the geometric positions is supported in a q-dimensional unit hypersphere. In
the case of bivariate response variable Y, the stationary geometric quantile distribution is
√
circularly symmetric around the origin, with the radial density of r/ 1 − r2 for r ∈ (0, 1)

142

Empirical CDF

1

0.8

F(x)

0.6

0.4

Observation
MLR

0.2

QM
MCR

HET

0
0

10

20

30

40

50

60

70

Precipitation

Figure 7.10: Cumulative distribution function of predicted daily precipitation at a weather
station in Michigan, 1990-99, using asynchronous regional climate model data.
[61].

7.5.2

MCR Using Geometric Quantile Mapped Heterogeneous Data

The predictor variables from the asynchronous data can be transformed to the have the
same geometric distribution characteristics of the given synchronous predictor variables data
using methods such as a geometric quantile mapping (GQM) and covariance alignment. The
MCR approach that used Geometric quantile mapping on the predictor variables having
asynchronous data, is referred to as M CRGQ .
Figure 7.11 compares the CDF of predicted daily minimum temperature at a weather
station (Eau Claire)in Michigan, 1990-99, with and without geometric quantile mapping
the asynchronous predictor variables to match the synchronous predictor variables. Using
geometric quantile mapping the asynchronous predictor variables to match the synchronous

143

Empirical CDF

1
Observation
MCRHET

0.8

MCR

GQ

F(x)

0.6

0.4

0.2

0
−25

−20

−15

−10

−5

0

5

10

Minimum Temperature °C

15

20

25

30

Figure 7.11: Comparing the cumulative distribution function of M CRHET and M CRGQ
output of predicted daily minimum temperature at a weather station in Michigan, 1990-99,
using asynchronous regional climate model data.
predictor variables prior to applying GQM showed marginal improvement. Similarly, results
were also seen in the case of the CDF of predicted daily precipitation at the same weather
station in Michigan, 1990-99, with and without geometric quantile mapping the asynchronous
predictor variables to match the synchronous predictor variables (Figure 7.12).

7.5.3

Projections For The Years 2040-2049

Figure 7.13 and Figure 7.14 shows the projected distribution of the climate variables for the
years 2040-2049, for the same weather station in Michigan.

144

1
Observation
MCRHET

0.8

MCR

GQ

F(x)

0.6

0.4

0.2

0
0

10

20

30

40

50

60

Precipitation (mm)

Figure 7.12: Comparing the cumulative distribution function of M CRHET and M CRGQ
output of predicted daily precipitation at a weather station in Michigan, 1990-99, using
asynchronous regional climate model data.

7.6

Conclusions

This chapter presents a framework that regularizes the distribution characteristics of a variable to simultaneously improve the accuracy of individual data points as well as the shape
of the distribution of the projections. The eﬀectiveness of the framework when using a
multivariate linear interpretation, a non-linear (RBF kernel), as well as a quantile driven
interpretation in eﬀectively capturing both the shape and accuracy of observed climate data
of various climate stations located in Michigan, USA, is demonstrated. In addition to consistently reducing day-to-day error of the projections, the framework is also shown to be
ﬂexible enough to capture diﬀerent shapes from various distributions as shown in the case
of Gaussian and Gamma distributions.

145

1
QM output on hindcast data
MLR output on projected data
0.8

QM output on projected data
MCRHET output on projected data

F(x)

0.6

0.4

0.2

0
−30

−20

−10

0

10

Minimum temperature °C

20

30

Figure 7.13: Comparing the cumulative distribution function of projected daily minimum
temperature at a weather station in Michigan, for the period 2040-2049 to that of QM model
output for the years 1990-99

1
0.9
0.8
0.7

F(x)

0.6
0.5
0.4
0.3

QM output of hindcast data
MLR output of projected data
QM output of projected data
MCRHET output of projected data

0.2
0.1
0

0

10

20

30
Precipitation (mm)

40

50

60

Figure 7.14: Comparing the cumulative distribution function of projected daily precipitation
at a weather station in Michigan, for the period 2040-2049 to that of QM model output for
the years 1990-99
146

Chapter 8
Multivariate Contour Regression
This chapter presents a framework for multiple output regression that extends the single
output regression framework presented in Chapter 7. The framework is motivated by the
growing demand for multiple output prediction methods capable of both minimizing residual errors and capturing the joint distribution of the response variables in a realistic and
consistent fashion. The multiple output regression presented in this chapter, preserves the
relationships among the response variables (including possible non-linear associations) while
minimizing the residual errors of prediction by coupling regression methods with geometric
quantile mapping.

8.1

Introduction

Multiple output regression (MOR) is the task of inferring the joint values of multiple response
variables from a set of common predictor variables. The response variables are often related,
though their true relationships are generally unknown a priori. An example application
of multiple output regression is to simultaneously estimate the projected future values of
temperature, precipitation, and other climate variables needed for climate change impact,
adaptation and vulnerability (CCIAV) assessments. The projected values are used as the
driving input variables for phenological and hydrological models to simulate the responses
of the ecological system to future climate change scenarios. To ensure the projected values
147

MOR

BQM

30
Observation
MOR

10
0
−10
−20
−30
−20

Observation
BQM

20
Min Temperature (°C)

Min Temperature (°C)

20

30

10
0
−10
−20

0
20
Max Temperature (°C)

−30
−20

40

0
20
Max Temperature (°C)

40

Figure 8.1: Scatter plot of observed daily maximum and minimum temperature at a climate
station in Michigan, USA.
are realistic, there are certain constraints on the relationship among the response variables
that must be preserved; e.g., minimum temperature must not exceed maximum temperature
or liquid precipitation should be zero when temperature is below freezing. While there have
been numerous multiple output regression methods developed in recent years [25, 112, 12,
95, 69], most of them are focused on ﬁtting the conditional mean or preserving covariance
structure of the outputs. Such methods do not adequately capture the full range of variability
in the joint output distribution, as illustrated in Figure 8.1(a).
The inability of standard regression-based approaches to reproduce the shape of the true
distribution of output variables, even for univariate response variables, is well-documented
[5]. Univariate distribution-driven approaches such as quantile mapping (QM) [108] and statistical asynchronous regression (SAR) [92] have been developed to address this limitation,
but the accuracy of these approaches is generally poor since they are not designed to minimize residual errors. Quantile mapping approaches map a univariate predictor variable x
148

to its corresponding response variable y by transforming the cumulative distribution function (CDF) of x to match that of y. More recently, a bivariate quantile mapping approach
(BQM) (see Figure 8.1(b)) has been developed to generate bivariate response values that
mimic the joint distribution of the observed response data [61]. However, as will be shown
in this chapter, the residual error is signiﬁcantly worse when compared to regression-based
methods because the position and rank correlation between the predictor and response variables remain invariant under QM-based transformation, which in turn, hinders its ability
to minimize residual errors. Thus, unless the predictor variable has a high rank correlation
with the response variable, the residual error upon applying QM-based approaches is likely
to be large.
This suggests a possible hybrid approach to improve both the residual errors and distribution ﬁtting is by ﬁrst applying a regression-based method to transform the predictor
variables so that their rank correlation with respect to the response variable is high, before
applying quantile mapping to adjust for the ﬁt in distribution. However, maximizing the
rank correlation of the data points is necessary but not suﬃcient condition for improvement
in the residuals for QM, unless the response values of the data points are uniformly spaced.
Hence, the need for position regularization, that would prioritize the prediction accuracy of
data points whose position, when incorrectly estimated, results in high residual. The term
‘position’ here refers to the geometric quantile of a data point with respect to a multivariate
distribution, which is analogous to the quantile of a data point in the case of univariate distribution. In this chapter, a position-regularized, multi-output prediction framework called
Multi-Output Contour Regression (MCR) is presented. MCR addresses the dual objective of
preserving the associations among the multiple output variables as well as minimizing residuals. MCR is able to achieve the dual objective by applying a novel, position-regularized
149

regression method, followed by geometric quantile mapping (GQM) to improve the ﬁt in
distribution. The position-regularized regression helps to alleviate the limitation associated
with the rank invariant property of QM, which contributes to the high residuals of QM-based
approaches. MCR additionally addresses the challenge of ensuring that its prediction of the
response variables will always abide by the constraints of the actual response data. MCR is
also not limited by the number of predictor variables that may be used nor does it require
them to have high correlation with the response variables, unlike quantile mapping. The
ﬂexible nature of our framework allows for the incorporation of other loss functions such as
the L1 loss used in quantile regression.

8.2

Preliminaries

Let X = [x1 , .., xn ]T be an (n × d) data matrix and Y = [y1 , .., yn ]T be the corresponding
(n × q) response matrix, such that xi ∈ d and yi ∈

q are column vectors representing the

respective values of predictor and response variables for the ith data point. The objective of
multi-output regression (MOR) is to learn a target function h(x, Ω) that best estimates the
multi-output response y, where Ω = (ω1 , .., ωq ) is the parameter set of the target function.
For a univariate random variable X ∈

, let FX (x) be its cumulative distribution func-

tion (CDF), i.e., FX (x) = P (X ≤ x). The corresponding α-quantile of X is given by
inf {x ∈

: FX (x) ≥ α}. Intuitively, each quantile indicates the value in which a certain

fraction of the data points are below it, and thus, provides a measure of its position in the
data. For example, the median, which is equivalent to the 0.5-quantile, is the central location
of the distribution. More generally, the position [84] of data point z relative to a set of points
Z = (z1 , .., zm )T is given by

150

1
pZ (z) = m

m
i=1 η(z − zi )

where

η(w) =



 w


, if w = 0
w


0,


if w = 0

For univariate data, the position pZ (z) is equal to 2FZ (z) − 1, where FZ (z) is the cumulative
distribution function of Z. The multi-dimensional equivalent of quantile function is geometric
quantile [32].
Distribution correction methods such as quantile mapping is only applicable if one can
match the position of a data point in one univariate distribution (say for x) to its corresponding position in another univariate distribution (say for y). This is possible using the
preceding deﬁnition of position for univariate data since the values of pZ are always ﬁxed in
the range between [−1, +1] irrespective of the values in Z. Unfortunately, when extended
to multivariate positions, the range of values for pZ may vary depending on the values in Z.
To overcome this problem, He et al. [61] introduce the notion of a stationary position by
iteratively applying the following position transformation function until convergence:

pk (z) =
Y

1
κn

n
i=1

pk−1 (z) − pk−1 (yi )
Y
Y
pk−1 (z) − pk−1 (yi )
Y
Y

,

p1 (z) =
Y

1
κn

n
i=1

z − yi
z − yi

(8.1)

Here each component in yi must be converted to its marginal rank ﬁrst before applying
the position transformation function. Marginal rank refers to the rank of the data point
divided by the largest rank and then normalized to the range [−1, 1]. The normalization
is done to negate the eﬀect of variables having values that correspond to diﬀerent ranges.
Data points with normalized marginal rank close to ±1 correspond to extreme values for the
particular variable, while those close to 0 are located near the median of the distribution. In
practice, the number of iterations needed to reach a stationary distribution is quite small,

151

typically K > 5 [61]. For univariate data, it can be shown that Pk reaches a stationary
distribution at k = 1.
The term κ in Equation (8.1) is a normalization factor to ensure the distribution of the
geometric positions is supported in a q-dimensional unit hypersphere. In the case of bivariate
response variable Y, the stationary geometric quantile distribution is circularly symmetric
√
around the origin, with the radial density of r/ 1 − r2 for r ∈ (0, 1) [61]. Therefore,
1

κ=

0

√

r
1 − r2

dr ⇒ κ =

π
4

In this chapter, the position of the multivariate data points in Y is denoted as PY =
[pY (y1 ), .., pY (yn )]T , where pY (yi ) ∈ [−1, 1]q . The notation zXY (y) = p−1 (pY (y)) is used
X
to represent a point in the domain of X that has the same geometric quantile position as
the data point y in Y, i.e., pX (zXY (y)) = pY (y). Consequently, zY Y (yi ) = yi . Finally,
let ZXY (y) = [zXY (y1 )T , .., zXY (yn )T ]T be the geometric quantiles in X that correspond
to the data points in Y .

8.2.1

Properties of the Geometric Quantiles

Proposition 8.2.1. For q = 1, Pk reaches a stationary distribution at k = 1.
Proof. Based on Equation 8.1, for q = 1, Pk (x) computes rank of the univariate variable
X
x, scaled to the range (−1, 1). Since, each iteration of Pk (x) re scales the marginal rank to
X
range (−1, 1), the rank is preserved. Hence, for q = 1, Pk reaches a stationary distribution
at k = 1.

♦

Proposition 8.2.2. Multivariate distributions that are movement transformations of each
other have the same P k distribution.
152

Proof. Geometric quantile distribution is invariant under movement transformation, if, given
X = [x1 , .., xn ] and Y = [y1 , .., yn ], having geometric quantile distribution PX and PY ,
PX = PY . Since, X and Y are scalar transformations of each other, by deﬁnition ∃∆ ∈ Rq
s.t., X = Y + ∆. Given, zY ∈ Y such that

p1 (zY ) =
Y

1
κn

N
i=1

zY − yi
zY − yi

⇒ p1 (zY ) =
Y

1
κn

N
i=1

zX − ∆ − xi + ∆
zX − ∆ − xi + ∆

k
k
⇒ PX = PY ∀ k. Hence P k is invariant under movement transformation.

♦

Proposition 8.2.3. Multivariate distributions that are scale transformations of each other
have the same P K distribution.
Proof. Geometric quantile distribution is invariant under scale transformation, if, given X =
[x1 , .., xn ] and Y = [y1 , .., yn ], having geometric quantile distribution PX and PY , PX = PY .
Since, X and Y are scale transformations of each other, by deﬁnition ∃α ∈ R s.t., X = αY .
Given,
1
PX (zx ) =

1
⇒ PX (zx ) =

1
κn

t∈X

1
κn

t∈X

αzY − αY (t)
αzY − αY (t)

zx − X(t)
zx − X(t)

1
⇒ PX (zx ) =

1
κn

t∈X

α(zY − Y (t))
α zY − Y (t)

⇒ PX = PY , which means that P K is invariant under scale transformation.

153

♦

8.2.2

Quantile Mapping-Based Approaches

Quantile mapping transforms a univariate predictor variable X to its corresponding response
variable Y by adjusting the cumulative distribution function FX to match that of FY :

−1
QM : y = FY (FX (x))
ˆ

(8.2)

It can be shown that QM preserves the rank correlation1 between the variables. For instance,
consider the example in Table 8.1 where y is the response variable and x1 , x2 are two
independent predictor variables. Let QM(x1 ) and QM(x2 ) be the corresponding QM outputs
for x1 and x2 , respectively. If we sort the vectors in ascending order, it is easy to see that
the resulting rank vectors are invariant under QM transformation. As a result, the rank
correlation between x1 (or x2 ) and y is identical to the rank correlation between QM(x1 )
(or QM(x2 )) and y. Furthermore, the empirical CDF for QM(x1 ) as well as QM(x2 ) are
identical to that for y, i.e., FY = FQM(x ) = FQM(x ) .
1
2
Even though quantile mapping was able to replicate the empirical distribution of y perfectly, QM(x1 ) has a higher residual error than QM(x2 ). This can be explained by the lower
rank correlation between x1 and y compared to the rank correlation between x2 and y. Note
that the inverse relationship between rank correlation and residual error holds only if the
values of the response variable are uniformly spaced. For example, if the response value y for
the fourth data point changes from 0.4 to 0.7, the residual error for QM(x2 ) increases from
0.02 to 0.32, and is larger than the residual error for QM(x1 ), which remains at 0.06. In this
case, a high rank correlation for x2 does not translate to lower residual error when applying
quantile mapping. A formal proof showing the relationship between rank correlation and
1 Examples of rank correlation measures include Kendall τ and Spearman’s ρ coeﬃcients.

154

Table 8.1: Quantile Mapping Example
x1
0.6
0.8
0.7
0.9

x2
0.7
0.6
0.9
0.8

y
0.2
0.1
0.3
0.4
SSR=

QM(x1 ) QM(x2 )
0.1
0.2
0.3
0.1
0.2
0.4
0.4
0.3
0.06
0.02

x3
0.7
0.6
0.9
0.8

x4
0.6
0.7
0.8
0.9

y
0.2
0.1
0.3
0.7
SSR=

QM(x3 )
0.2
0.1
0.7
0.3
0.32

QM(x4 )
0.1
0.2
0.3
0.7
0.02

residual error for uniformly spaced data is given in the next section.
Since most data sets are non-uniform, maximizing rank correlation is not a suﬃcient
condition to ensure a low residual error. Nevertheless, the data points were observed to
be associated with quantiles that are located in sparse regions (i.e., far from their next
closest quantiles) will contribute to higher residual error when incorrectly ranked compared
to data points associated with quantiles located in dense regions. This is demonstrated by
the example shown in Table 8.1, where both x3 and x4 have the same rank correlation with
respect to the response variable y, yet have diﬀerent SSR. The response values for the
ﬁrst three data points (0.2, 0.1, and 0.3) are closer to each other than the last data point
(0.7). An incorrect ranking of the fourth data point will lead to much higher residual error
compared to the ﬁrst three data points. Since x3 ranked the fourth data point incorrectly,
its residual error is larger than x4 even though they both have the same rank correlation.
This suggests a possible heuristic for improving both rank correlation and residual error by
emphasizing on data points that contribute to high residual errors in prediction if ranked
incorrectly.

8.2.3

Rank Correlation and Residual Errors of Quantile Mapping

This section presents several properties of the QM approach with respect to the rank correlation and residual error of its output. First, quantile mapping was shown to preserve the

155

rank correlation between the predictor and response variables.
Proposition 8.2.4. Rank correlation is invariant under QM transformation if the values of
the predictor and response variables in a data set are unique.
Proof. Consider a data set D = {(xi , yi )}n that contains n points. Let yi be the quantile
ˆ
i=1
mapped value for the data point with predictor variable xi . To prove that rank correlation
is invariant under QM transformation, it is suﬃcient to show that the rank for xi is identical
to the rank of yi after quantile mapping. Without loss of generality, assume the data points
ˆ
in D are sorted in increasing order of their x values. Thus, the rank for data point xi is i
(since the x values are unique). Equation (8.2) can be rewritten as follows

FY (ˆi ) = FX (xi )
y

Since FX (xi ) = i/n, therefore FY (ˆi ) = FX (xi ) = i/n. Given that the response values yi
y
are distinct, the rank for yi is also i.
ˆ

♦

Next, the relationship between rank correlation and residual error of QM, for data sets
with uniformly spaced response values is illustrated.
Proposition 8.2.5. The residual error of QM is negatively proportional to the rank correlation given a data set with uniformly spaced response variable.
Proof. Consider a data set D = {(xi , yi )}n that contains n points. Let ri be the rank of
i=1
xi and si be the rank of the response value yi . To simplify the discussion, it is assumed that
the ranks in r and s are unique. Since y is uniformly spaced, it can be easily shown that
yi = si c1 + c0 , where c0 and c1 depend only on the minimum and maximum values in y.

156

Following Proposition 1, since QM preserves the rank of the data point xi , yi = ri c1 + c0 .
ˆ
The Spearman rank correlation between x and y can be written as

ρ=

¯
¯
i (ri − r )(si − s)
¯ 2 i (si − s)2
¯
i (ri − r )

which can be further simpliﬁed as ρ = (

i ri si + c3 )/c2 where, c2 and c3 are constants for

a ﬁxed n. Since the QM output y is a reordering of y, its residual error can be computed
ˆ
as SSR =

ˆ 2
i (yi − yi ) = 2(

2
i yi −

ˆ
i yi yi ). Furthermore,

where, c4 depends on c0 , c1 , and n. Therefore, SSR = 2(

2
ˆ
i yi yi = c1

i ri si + c4 .

2
2
i yi − c1 c2 ρ − c3 − c4 ). Since,

c2 is a non-negative constant, c2 c2 will always be non-negative. Hence, SSR is negatively
1
proportional to ρ for a given data set with uniformly spaced y.

♦

Although Proposition 2 is applicable only to uniformly spaced response variable, there are
other situations where the residual error of QM output can be reduced if its rank correlation
increases, as will be shown in the next proposition.
Proposition 8.2.6. Correcting the rank of xi to match the rank of its corresponding response
variable yi , maintains, if not, improves the residual error of QM output, as long as it does
not deteriorate the ranks of other data points in x.
ˆ
Proof. Given the response variable y, let y be the QM output of predictor variable x.
The sum squared residual of QM(x) is SSRx =

ˆ 2
i (yi − yi ) where, the residual error

−1
−1
of the ith data point εxi = yi − yi = FY (FY (yi )) − FY (FY (ˆi )). Following Equation
ˆ
y

(8.2), we have FY (ˆi ) = FX (xi ). Consequently, the residual error εxi can be rewritten as
y
−1
−1
εxi = FY (FY (yi )) − FY (FX (xi )). Assuming the residual error of the QM output for x is

non-zero, there must exist a data point xj such that FY (yj ) = FX (xj ). Next, consider an
157

improved vector x , which is a reordering of the values in vector x subject to the following
two conditions: (1) xi is equal to xi if the ith data point of x is ranked correctly. i.e.,
xi = xi if FX (xi ) = FY (yi ). (2) If ith data point of x is not ranked correctly, then either
−1
xi = xi or xi = FX (FY (yi )). Note that there must be at least one data point ranked

incorrectly in x but correctly x . The second condition also implies that any data point that
has been reordered, it must be ranked correctly. Thus, for all the data points in condition
(1), ε2 = ε2 since their ranks remain unchanged. On the other hand, for all the data points
x
i

xi

in condition (2), ε2 ≥ ε2
x
i

xi

since FX (xi ) is either the same as FX (xi ) or FY (yi ). Therefore,

∀i, ε2 ≥ ε2 . Hence, SSRx ≥ SSRx . Thus, by correcting the ranks of those data points
xi
xi
that do not have the same rank as its corresponding response value, while ensuring the ranks
of all other data points remain the same, the SSR of QM output can be improved. ♦
Even though the above proposition suggests that one can maximize rank correlation
and improve residuals simultaneously, this requires a ﬂexible target function that allows all
possible orderings of x. For linear functions, it might not be possible to produce a reordering
of values in x without aﬀecting the ranks of other data points. Thus, an alternate scheme was
proposed that focuses on correcting the ranks of data points associated with high residuals,
which is explained in the next section.

8.3

Multi-Output Contour Regression Framework
(MCR)

Since QM and regression-based approaches have their own distinct advantages which have
been successfully exploited in a hybrid manner by approaches such as CR, we propose a

158

framework that extends the intuition behind hybrid approaches that exploits the unique
advantages of both QM and regression, to work in a multi-output setting. The approach
ˆ
uses a position regularized regression function h(x, Ω) that prioritizes matching the positions
of output to best match the positions of the observed response data. This step is followed by
correcting the geometric quantiles of the output from the previous step to match the observed
response data using the intuition of QM. This hybrid approach addresses the limitation of
QM regarding the number of predictor variables that may be used as well as requirement of
the predictor variables being highly correlated to the response variable. The hybrid approach
was further exposed to be ﬂexible enough to work in a multi-output setting so as to be able
to capture the multi-output associations that are often ignored.
To prioritize improving the positions of the output, the proposed multi-output contour
ˆ
regression (MCR) framework learns the regression function h(x, Ω). The regression function
ˆ
h(x, Ω) consists of two components. The ﬁrst component is similar to conventional regression
loss function where the data matrix is made to regress with respect to the observed response
variable. This component emphasizes minimizing residual error of the regression function.
ˆ
The second component of h(x, Ω) is the position regularizer that helps improve rank
correlation of h(x, Ω) and y. At a ﬁrst glance, one would expect the second term to be
regressing on the position of the data points. Instead of regressing on the position of the
data points, we regress on the geometric quantiles of the data points obtained by inverse
mapping their positions to the output response space. This is done so that the position
regularizer assigns a larger penalty to those data points whose position when incorrectly
estimated, results in a larger minimum residual errors. To accomplish this, the data matrix

159

is made to regress on zY Y , where,
ˆ
ˆY Y (y) = p−k (pk (y))
zˆ
Y
ˆ
Y

(8.3)

ˆ
is the geometric quantile value in the h(x, Ω) regression output space that corresponds to
the position of the observed response variable y.
The regression function of MCR is shown in Equation (8.4),
n

min
Ω

i=1

(γL(h(xi , Ω), yi ) + (1 − γ)L(h(xi , Ω), zY Y ))
ˆ

(8.4)

where 0 ≤ γ ≤ 1 is a user deﬁned parameter that may be used for either prioritizing ﬁdelity
of regression accuracy or its position correlation.
L can be any generic loss function such as ordinary least square (that multiple linear
regression adopts), or quantile mapping (if certain quantiles are to be prioritized overs others,
such as in the case of a heavy tail distribution).For instance, when the loss function L is
ordinary least square, Equation 8.4 takes the form
q

n

min
Ω

j=1 i=1

(γ(xT Ωj − yi )2 + (1 − γ)(xT Ωj − zY Y )2 )
ˆ
i
i

which corresponds to the following matrix form

ˆ
Ω = arg min tr(γ(XΩ − Y)T (XΩ − Y) + (XΩ − ZY Y )T (XΩ − ZY Y ))
ˆ
ˆ
Ω

ˆ
The regression parameters Ω is learnt in an iterative manner. At each iteration, the
regression output space from the previous iteration is used to compute zY Y in the second
ˆ
160

ˆ
component of the regression function h(x, Ω). For the very ﬁrst iteration, the regression
output space is that of regular multiple linear regression.
ˆ
Once h(x, Ω) is learnt, the MCR prediction for a given data point x having corresponding
ˆ
ˆ
observed multi-output response y and a regression estimation of y = h(x, Ω) is obtained by
inverse geometrically quantile mapping pk (ˆ ) to its corresponding value in the observed
ˆ y
Y
response variable space, to give the MCR prediction ˆY Y ,
z ˆ
ˆ
M CR : ˆY Y = p−k (pk (h(x, Ω)))
z ˆ
ˆ
Y
Y

(8.5)

ˆ
where, p−k (pk (ˆ)) maps the stationary geometric quantile position of h(x, Ω) to its
ˆ y
Y
Y
corresponding data point in Y.
To summarize, multi-output contour regression (MCR) performs multi-output regression
of the predictor variables such that the position of its output is highly correlated with
respect to position of the observed response variable, thereby reducing position errors of the
multi-output regression results. This multivariate regression output is then mapped to its
corresponding geometric quantile counterpart in the observed multi-output response space
using geometric quantiles. The rationale behind using the regularized regression results,
prior to performing multi-output geometric quantile mapping in MCR, is to improve on SSR
by increasing the correlation among the multivariate ranks of the predictors and response
variable.

8.3.1

Estimating Inverse Geometric Quantile Position

ˆ
The value z(p) that corresponds to a given geometric quantile position p, in a multivariate
distribution FY i.e., pY (p), is empirically computed by minimizing the generalized multi161

variate quantile loss function [32]
n

ˆ
z(p) = arg min

z∈ q

where, p ∈

( yi − z|+ < p, yi − z >)

(8.6)

i=1

q and < ., . > denotes the Euclidean inner product. So long all the values of y
i

ˆ
does not fall on the same line, z(p) will be unique for a given p for q ≥ 2 [32]. Algorithms such
as Newton-Raphson’s method can be used to solve the above loss function geometric quanˆ
ˆ
ˆ
tile z(p) using the following update z ← z− δ where,
δ

δ =

δ=

n
−1 (z−y ))
i
i=1 ((nκ)p− z−yi

n
−1 (I − z − y −2 × (z − y )(z − y )T )
q
i
i
i
i=1 z − yi

For a univariate distribution, FY , it can be easily shown that equation (8.6) boils down
to the same loss function used to identify the αth regression quantile in a linear regression
setup for quantile regression [77], where 0 < α < 1 and p = 2α−1. i.e,

n
1 (|yi −z|+p(yi −z))

is minimized for z that corresponds to the αth quantile of Y .

8.3.2

Alternative Approximation-Based Approach for MCR

If one can make the assumption that given the position (p) of a test data point (ytest ) that
belongs to the distribution FY , and ∃yi ∈ Y such that ytest

yi , then the search space for

ˆ
z = ytest can be limited to data points in Y.
ˆ
Given that the search space for z is ﬁnite it will not always possible to ﬁnd the exact same
point in FY using the loss function δ, as it returns a vector. Alternatively, the following
range bound approximation that is equivalent to Equation 8.6, can be used to ﬁnd the best

162

60

30

40
30
20
10
0

Newton Raphson
R−limited Approximation
Euclidean Approximation

35
Time (CPU units)

50
Time (CPU units)

40

Newton Raphson
R−limited Approximation
Euclidean Approximation

25
20
15
10
5

0

500

1000
1500
Training size
(a)

0

2000

0

250

500
750
Test size
(b)

1000

Figure 8.2: Relative computation time of the various approximation-based approaches for
estimating inverse geometric quantile positions.
solution [61, 32].
n

arg min
z

{ yi − z
i=1

1
+ (yi − z)T p}
κ

(8.7)

where κ in the scaling factor chosen in Equation (8.4)
As shown in the experiment section, there was only a marginal performance deterioration
in the solution obtained from the above approximation, due to suﬃcient amount of training
data points. Another approximation approach with even less tighter bounds than Equation
8.7, having O(n) time complexity is to use the following Euclidean approximation.

ˆ
z = arg min((p − pY (yi ))(p − pY (yi ))T )
yi

(8.8)

The R-limited approximation approach (Equations 8.7) as well as the Euclidean approximation approach (8.8) show considerable improvement in the computation time across varying
163

training size (Figure 8.2.a) and test size (Figure 8.2.b), with minimum deterioration in terms
of accuracy of the inverse geometric quantile positions estimated.

8.4

Variations of MCR

As mentioned above, when the loss function L is ordinary least square, equation 8.5 takes
the form
q

N

min
Ω

j=1 i=1

(γ(xT Ωj − yi )2 + (1 − γ)(xT Ωj − zY Y )2 )
ˆ
i
i

which corresponds to the following matrix form

ˆ
Ω = arg min tr(γ(XΩ − Y)T (XΩ − Y) + (XΩ − ZY Y )T (XΩ − ZY Y ))
ˆ
ˆ
Ω

The following subsection demonstrates the use of alternative loss functions in the MCR
framework.

8.4.1

Quantile Multi-Output Contour Regression (M CRQ )

An alternative to using ordinary least square loss function, which is well suited when the
response variable has a non-uniform distribution such as a heavy tail, is to use quantile
regression (QR) loss function for L so as to prioritize the rank correlation for the ranks that
have the highest variance and corresponding highest impact on residual errors while quantile
mapping, if incorrectly ranked [77]. Additionally, M CRQ may also be suited to estimate the
extremes of Y , by prioritizing the correct estimation of ranking extreme values.

164

The objective function of QR can be adopted by MCR to obtain the following loss function
n
i=1

(ρτ1 (yi − xT β) + ρτ2 (zi − xT β))
i
i

where,




τ u

ρτ (u) =

8.4.2

u>0



(τ − 1)u u ≤ 0


Non-linear Multi-Output Contour Regression (M CRN L )

Unlike the above mentioned linear interpretations of MCR, M CRN L uses a non-linear approach. By mapping the predictor variable X to a higher dimension feature space F to give
φ , i.e.,
φ : X ∈ Rd → F ⊆ RN
where N >> d, one can transform the regularized least square regression to feature space F
using the kernel K. Similarly, the predictor variables of MCR can be mapped to a higher
dimension feature space F by using the ridge counterpart of the loss function of MCR.

β = (φ(X)T φ(X) + λI)−1 (γφ(X)T y + (1 − γ)φ(X)T z)

⇒ β = λ−1 φ(X)T (γy + (1 − γ)z − φ(X)β) = φ(X)T α
⇒ α = (G + λI)−1 (γy + (1 − γ)z)
where, G = φ(X)φ(X)T , Gij = φ(xi ), φ(xj )T = K(xi , xj ).

165

8.5

Experimental Results

The objective of the experiments was to evaluate the ability of MCR in replicating the associations among multiple climate response variables while minimizing sum square residuals.
All the algorithms were run using climate data obtained at fourteen weather stations in
Michigan, USA. The response variables used were maximum temperature, minimum temperature, and the total precipitation for each day spanning twenty years. The predictor variables
used in this study are simulated climate data obtained from regional climate models (RCM)
that best correspond to the observed response variables at each of the fourteen weather stations. Three diﬀerent RCM data sets for each of the climate stations were obtained from
North American Regional Climate Change Assessment Program (NARCCAP) [2]. The three
RCMs used are the Canadian Regional Climate Model (CRCM), the Weather Research and
Forecasting Model (WRFG) and the Regional Climate Model Version-3 (RCM3). For the
purpose of the experiments, there were a total of 126 data sets with univariate response
variables, 126 data sets with bivariate responses and 42 data sets with trivariate responses.

8.5.1

Experimental Setup

Twenty year of predictor and response data, spanning the years 1980-1999 was split into
two parts for training and testing. For the purpose of the evaluation of the relative skill in
preserving associations among the multi-output responses, popular regression and quantile
mapping approaches such as MLR, Ridge regression (Ridge), QM, EDCDFm, MOR, CR,
BQM as well as ad-hoc approaches that sequently combine regression and quantile mapping
approaches were used as baseline. An example of the ad-hoc baseline approach used is MOR
in combination with BQM (RBQM) and MLR and QM (RQM). γ was set to 0.5 for all

166

experiments. For CR and MCR based experiments, the maximum number of iterations was
set to ten.
After discarding the missing values, each experiment run for each of the stations, across
all the data sets, had a minimum of one thousand training and test data points. All the
results provided in the following section are on test data (out-of-sample results). Kendall
τ rank correlation and Spearman ρ rank correlation were the two rank correlation metrics
used for evaluation univariate rank correlation. In the following experiment section, results
of only one of the two rank correlation metrics were included, when their results were very
similar. Root mean square error (RMSE), was used as a metric to compare the performance
of the various approaches evaluated in terms of its output residual errors. Two dimensional
and three dimensional scatter plots were used to visualize the relative skill of the various
approaches in preserving the associations among the multi-output responses.

8.5.2

Results

8.5.2.1

Univariate MCR

The rank correlation of the various response variables were computed in a single output MCR
setting and it was found that across all the diﬀerent data sets and stations and response
variables (i.e, 126 datasets), MCR consistently improved the rank correlation across both
rank correlation metrics. The 126 individual data sets that corresponded to univariate
response data were grouped into nine larger data sets, where each of the larger data sets
were a grouping of data sets that shared the same response variable as well as the same RCM
source for the predictor variables.
Figure 8.3 is a box plot representing the percentage of stations in each of the nine data

167

Kendal τ

RMSE

100%

100%

80%

80%

60%

60%

40%

40%

20%

20%

MLR

Ridge

QM EDCDFm CR
(a)

MLR

Ridge

QM EDCDFm CR
(b)

Figure 8.3: Box plot of the percentage stations where MCR showed improvement over single
output baselines, in terms of Kendall τ rank correlation and RMSE, across all RCM’s and
variables.
sets where the rank correlation regularizer used in Equation 8.4, improved rank correlation
and reduced residuals when compared to baselines approaches.
The box plot in Figure 8.4 shows that in spite of MCR’s reported improvement across
majority of stations in terms of τ and RMSE, for both regression and quantile mapping based
approaches, the improvement was not signiﬁcant when compared to the regression based
approaches. However, the rank correlation regularizer showed a signiﬁcant improvement in
terms of RMSE at each station when compared to the corresponding quantile mapping based
approaches.

8.5.2.2

Bivariate MCR

Bivariate modeling for all the combinations of bivariate response variables were evaluated.
As shown in Figure 8.5, MCR performed best in replicating both the bivariate associations

168

Kendall τ

RMSE

1
15%
0.8

10%

0.6

0.4
5%
0.2

0
MLR

Ridge

0

QM EDCDFm CR
(a)

MLR

Ridge

QM EDCDFm CR
(b)

Figure 8.4: Box plot of MCR’s improvement over baseline approaches in terms of Kendall τ
rank correlation and RMSE, across all RCM’s and variables.

MLR

Ridge

QM

CR

40

40

40

40

20

20

20

20

0

0

0

0

−20

−20

−20

−20

−40
−50

0
MOR

−40
50 −50

0
BQM

−40
50 −50

0
RBQM

−40
50 −50

40

40

40

20

20

0

0

−20

−20

50

0

−20

0

20

0

50

40

20

0
MCR

−20

−40
−50

0

−40
50 −50

0

−40
50 −50

0

−40
50 −50

Figure 8.5: Scatter plot portraying the ﬁdelity of forecast values of various approaches
replicating the observed associations among the bivariate temperature response variables.

169

Table 8.2: Performance (RMSE) of bivariate MCR over baseline approaches
Data set

W RF G1
W RF G2
W RF G3
CRCM1
CRCM2
CRCM3
RCM 31
RCM 32
RCM 33

RMSE
% of stations
Avg.improvement
outperformed
across stations
baseline
over baseline
MOR QM BQM MOR QM BQM
29
100
100
-0.06 0.18 0.17
07
100
100
-0.08 0.16 0.16
00
100
100
-0.07 0.31 0.30
93
100
100
0.06 0.25 0.25
71
100
100
0.03 0.23 0.23
07
100
100
-0.02 0.35 0.34
43
100
100
-0.02 0.20 0.20
36
100
100
-0.03 0.19 0.18
00
100
100
-0.07 0.31 0.30

and minimizing SSR, although BQM performed as well in terms of replicating the bivariate
associations. Regression based approaches (both SOR and OMR) fared poorly in preserving
associations in the 2D space, while single output quantile mapping based approaches, were
very sensitive to correlation of the predictor variables with response resulting in poor bivariate associations in spite of replicating the marginal distributions of the individual responses
very well.
In terms of residuals, MCR had considerably lower residuals when compared of the various quantile mapping baseline approaches as shown in Table 8.2. But as expected, MCR
showed marginal increase in residuals when compared to the respective SOR and MOR based
approaches.

8.5.2.3

Trivariate MCR

The performance of modeling the association among three response variables was also evaluated and is shown in Figure 8.6. The performance is compared against single output, and
multiple output models. We also use as a baseline, an trivariate extension of the bivariate
170

Table 8.3: Performance (Kendall τ ) of bivariate MCR over baseline approaches
Data set

W RF G1
W RF G2
W RF G3
CRCM1
CRCM2
CRCM3
RCM 31
RCM 32
RCM 33

Kendall τ
% of stations
Avg.improvement
outperformed
across stations
baseline
over baseline
MOR QM BQM MOR QM BQM
64
100
100
0.03 0.40 0.41
79
100
100
0.04 0.38 0.39
0
100
100
-0.01 0.75 0.67
100
100
100
0.13 0.52 0.53
100
100
100
0.12 0.49 0.52
14
100
100
-0.01 0.78 0.73
79
100
100
0.06 0.46 0.46
79
100
100
0.06 0.47 0.45
0
100
100
-0.01 0.81 0.78

MLR

Observation

QM

CR

50

50

50

50

0

0

0

0

−50
50

−50
50
50

−50
50
50

−50
50
50

0

0

−50 0

0

−50 0

MOR

−50 0

−50 0

BQM

RBQM

50

0

MCR

50

50

50

50

0

0

0

0

−50
50

−50
50
50

−50
50
50

−50
50
50

0
−50 0

0
−50 0

0
−50 0

50

0
−50 0

Figure 8.6: Three dimensional scatter plot of the observed associations among maximum
temperature, minimum temperature and precipitation as well as the respective forecasts
made by the various single output and multiple output approaches.

171

BQM approach, as an additional baseline. Along with MCR, the trivariate extension of BQR
fared best in replicating the observed associations among three variables when compared to
the baseline approaches.
Additionally, MCR was also able to improve upon its BQM counterpart in terms of
reduction of residuals. MCR produced lower RMSE for all the station across all the tri-variate
datasets with an average reduction of RMSE in excess of 10%. The average improvement of
the three variables in terms of rank correlation τ was found to be 0.41.

8.6

Conclusions

This chapter present a multi-output regression framework extension of the single output
regression framework presented in Chapter 7. The multi-output regression framework preserves the general association patterns among multiple response variables while minimizing
the overall residual errors by coupling regression and geometric quantile mapping. The chapter demonstrates the eﬀectiveness of the framework in signiﬁcantly reducing residuals while
preserving the joint distribution of the multi-output variables, over the baseline approaches
in all the climate stations evaluated.

172

Chapter 9
Conclusions and Future Work
In this thesis, a number of multivariate frameworks are presented that improve upon existing
regression based approaches used for generating projections, by integrating multiple objectives pertaining to the unique characteristics of response variable, as well as the expectations
of a long-term projection.
The four primary multivariate frameworks, as well as its logical extensions, address the
following four primary challenges. The ﬁrst framework addresses the challenge of modeling
response variables with irregular distribution characteristics, in particular, zero-inﬂated response variables. The second framework addresses the challenge of extremes in a distribution,
by prioritizing the conditional quantile associated with extreme values. The third framework
addresses the challenge of building a regression framework that preserves the distribution
characteristics of the response variable, so as to provide an unbiased projection across all
the quantiles of the distribution. The fourth framework extends the intuition behind the
above-mentioned single output distribution preserving framework to an multi-output setting, such that not only is each projection unbiased, but it also maintains the relationships
among multiple outputs.
Given that most of the emphasis of the evaluation of the frameworks presented in this
thesis assumed a linear relationship between the predictor and response variables, a detailed
evaluation of the performance, while assuming a non-linear relationship needs to be explored

173

as well. In the frameworks presented in Chapter 3 and Chapter 4, Pearson correlation
coeﬃcient was chosen as the default similarity measure. As future work, the impact of
the choice of kernel/similarity measures chosen, needs to be evaluated. Exploration of nonlinear approaches to further improve the precision and recall of zero and non-zero valued data
points, would provide new insights. Given the availability of numerous sources of available
data, there is extensive scope for further exploiting the available heterogeneous datasets in
modeling zero-inﬂated data.

Relative liklihood of identifying
days with large Precipitation residuals

3
5% Anomaly
10% Anomaly
20% Anomaly

2.5

2

1.5

1

0.5

0

10%

20%

30%
40%
50%
60%
The top X% predictor data points.

70%

80%

90%

Figure 9.1: Relative likelihood of identifying large precipitation residuals

The main challenge in identifying erroneous values within the training dataset is being
able to diﬀerentiate it from valid anomalies. Also, a model that cleans spurious data during
model building should also be able to do so for out-of-sample data upon which the model is to
be applied. Discarding data points due to a faulty value in one of the predictor variables may
result in a very small training data set. This is all the more the case when the occurrence of
errors across predictor variables are independent of each other and there are a large number of
174

predictor variables– even if the percentage of error values for any individual predictor variable
is relatively small. Often in many practical applications due to the scarcity of available data
needed for training a model, discarding large amount of data could be unfeasible. The other
drawback of dropping values that may be erroneous is that the model’s response value for
such data points in the test set may be extremely poor on account of not having been trained
on similar data points.
5

Likehood of having
a high residual for response

4.5

Precipitation
Max Temperature
Min Temperature

4
3.5
3
2.5
2
1.5
1

1 Erroneous predictor

2 Erroneous predictors
3 Erroneous predictors
No. of Erroneous predictor variables (Top 5%)

Figure 9.2: Relative likelihood of identifying days with large residuals

A possible approach for future work is therefore to build regression models in the presence
of uncertain data to identify data points that have a high likelihood of being erroneous,
based on its relationship with other corresponding variables, and assign weights for each
data point that is a function of this uncertainty similar to weighted regression where weights
are the inverse of the variance observed for the respective response variable. Unlike weighted
regression, the weights chosen can be a function of the ﬁdelity of the values for the data
points.
175

As shown in ﬁgure 9.1, it was observed that data points whose predictor variables seemed
anomalous with respect to the rest of its corresponding predictor variables were far more
likely to have large residuals for the response variable. In the case of modeling precipitation
using multiple linear regression, it was found that the data points that belonged to the top
5% of those likely to have an erroneous value for the precipitation predictor variable were
thrice as likely to have large residuals for the response variable.
Figure 9.2 shows that the relative likelihood of a data point being one with a high residual,
increases as a function of the number of predictor variables for the data point being erroneous
increases.

176

BIBLIOGRAPHY

177

BIBLIOGRAPHY

[1] Canadian Climate
http://www.ccsn.ca/.

Change

[2] North
American
Regional
http://www.narccap.ucar.edu/.

Scenarios

Network,

Climate

Change

Environment

Assessment

Canada.

Program.

[3] Z. Abraham and P.-N. Tan. A semi-supervised framework for simultaneous classiﬁcation and regression of zero-inﬂated time series data with application to precipitation
prediction. In Data Mining Workshops, 2009. ICDMW’09. IEEE International Conference on, pages 644–649. IEEE, 2009.
[4] Z. Abraham and P.-N. Tan. An integrated framework for simultaneous classiﬁcation
and regression of time-series data. In SIAM International Conference on Data Mining
(SDM), pages 653–664, 2010.
[5] Z. Abraham, P.-N. Tan, P. Perdinan, J. Winkler, S. Zhong, and M. Liszewska. Distribution regularized regression framework for climate modeling. In SIAM International
Conference on Data Mining (SDM), 2013.
[6] Z. Abraham, P.-N. Tan, P. Perdinan, J. Winkler, S. Zhong, and M. Liszewska. Position
preserving multi-output prediction. In European Conference On Machine Learning And
Principles And Practice Of Knowledge Discovery In Databases (ECMLPKDD), 2013.
[7] Z. Abraham and F. Xin. Extreme value prediction for zero-inﬂated data. In Advances
in Knowledge Discovery and Data Mining, pages 318–329. Springer, 2012.
[8] Z. Abraham, F. Xin, and P.-N. Tan. Smoothed quantile regression for statistical
downscaling of extreme events in climate modeling. pages 92–106, 2011.
[9] D. E. Akyuz, M. Bayazit, and B. Onoz. Markov chain models for hydrological drought
characteristics. J. Hydrometeor., 13:298–309, 2012.
[10] M. A. Alvarez and N. D. Lawrence. Sparse convolved multiple output gaussian processes. arXiv preprint arXiv:0911.5107, 2009.

178

[11] S. Ancelet, M.-P. Etienne, H. Benoˆ and E. Parent. Modelling spatial zero-inﬂated
ıt,
continuous data with an exponentially compound poisson process. Environmental and
Ecological Statistics, 17(3):347–376, 2010.
[12] K. Balasubramanian and G. Lebanon. The landmark selection method for multiple
output prediction. arXiv preprint arXiv:1206.6479, 2012.
[13] S. C. Barry and A. H. Welsh. Generalized additive modelling and zero inﬂated count
data. Ecological Modelling, 157(2):179–188, 2002.
[14] N. Bernier, K. Thompson, J. Ou, and H. Ritchie. Mapping the return periods of
extreme sea levels: allowing for short sea level records, seasonality, and climate change.
Global and Planetary Change, 57(1):139–150, 2007.
[15] C. M. Bishop and N. M. Nasrabadi. Pattern recognition and machine learning, volume 1. springer New York, 2006.
[16] J. Bjørnar Bremnes. Probabilistic forecasts of precipitation in terms of quantiles using
nwp model output. Monthly Weather Review, 132(1):338–347, 2004.
[17] M. B. Blaschko and C. H. Lampert. Learning to localize objects with structured output
regression. In Computer Vision–ECCV 2008, pages 2–15. Springer, 2008.
[18] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. 2001.
[19] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In
Proceedings of the eleventh annual conference on Computational learning theory, pages
92–100. ACM, 1998.
[20] D. B¨hning, E. Dietz, and P. Schlattmann. Zero-inﬂated count models and their
o
applications in public health and social science. Applications of Latent Trait and
Latent Class Models in the Social Sciences. Waxman Publishing Co, 1997.
[21] G. Bontempi. Long term time series prediction with multi-input multi-output local
learning. In Proceedings of the 2nd European Symposium on Time Series Prediction
(TSP), ESTSP08, pages 145–154, 2008.
[22] M. Booij. Extreme daily precipitation in western europe with climate change at appropriate spatial scales. International Journal of Climatology, 22(1):69–85, 2002.

179

[23] G. E. Box and D. A. Pierce. Distribution of residual autocorrelations in autoregressiveintegrated moving average time series models. Journal of the American Statistical
Association, 65(332):1509–1526, 1970.
[24] U. Brefeld, T. G¨rtner, T. Scheﬀer, and S. Wrobel. Eﬃcient co-regularised least squares
a
regression. In Proceedings of the 23rd international conference on Machine learning,
pages 137–144. ACM, 2006.
[25] L. Breiman and J. H. Friedman. Predicting multivariate responses in multiple linear
regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
59(1):3–54, 1997.
[26] A. Buja, B. Logan, J. Reeds, and L. Shepp. Inequalities and positive-deﬁnite functions
arising from a problem in multidimensional scaling. The Annals of Statistics, pages
406–438, 1994.
[27] R. H. Byrd, J. Nocedal, and R. B. Schnabel. Representations of quasi-newton matrices
and their use in limited memory methods. Mathematical Programming, 63(1-3):129–
156, 1994.
[28] K. N. Cahill, D. B. Lobell, C. B. Field, C. Bonﬁls, and K. Hayhoe. Modeling climate
and climate change impacts on winegrape yields in california. American Journal of
Enology and Viticulture, 58(3):414A–414A, 2007.
[29] A. Ceglar and L. Kajfeˇ-Bogataj. Simulation of maize yield in current and changed
z
climatic conditions: addressing modelling uncertainties and the importance of bias
correction in climate model simulations. European Journal of Agronomy, 37(1):83–95,
2012.
[30] M.-C. Chan, C.-C. Wong, and C.-C. Lam. Financial time series forecasting by neural
network using conjugate gradient learning algorithm and multiple linear regression
weight initialization. In Computing in Economics and Finance, volume 61, 2000.
[31] S. P. Charles, B. C. Bates, I. N. Smith, and J. P. Hughes. Statistical downscaling
of daily precipitation from observed and modelled atmospheric ﬁelds. Hydrological
Processes, 18(8):1373–1394, 2004.
[32] P. Chaudhuri. On a geometric notion of quantiles for multivariate data. Journal of
the American Statistical Association, 91(434):862–872, 1996.

180

[33] H. Cheng and P.-N. Tan. Semi-supervised learning with data calibration for longterm time series forecasting. In Proceedings of the 14th ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 133–141. ACM, 2008.
[34] R. T. Clarke. Estimating time trends in gumbel-distributed data by means of generalized linear models. Water Resources Research, 38(7):1111, 2002.
[35] R. T. Clarke. Estimating trends in data from the weibull and a generalized extreme
value distribution. Water resources research, 38(6):25–1, 2002.
[36] I. Cohen, F. G. Cozman, N. Sebe, M. C. Cirelo, and T. S. Huang. Semisupervised learning of classiﬁers: Theory, algorithms, and their application to human-computer interaction. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(12):1553–
1566, 2004.
[37] D. Cooley. Extreme value analysis and the study of climate change. Climatic Change,
97(1-2):77–83, 2009.
[38] D. Cooley, D. Nychka, and P. Naveau. Bayesian spatial modeling of extreme precipitation return levels. Journal of the American Statistical Association, 102(479):824–840,
2007.
[39] C. Cortes and M. Mohri. On transductive regression. Advances in Neural Information
Processing Systems, 19:305, 2007.
[40] F. G. Cozman, I. Cohen, and M. Cirelo. Unlabeled data can degrade classiﬁcation
performance of generative classiﬁers. In FLAIRS Conference, pages 327–331, 2002.
[41] F. G. Cozman, I. Cohen, M. C. Cirelo, et al. Semi-supervised learning of mixture
models. In ICML, pages 99–106, 2003.
[42] L. J. D. Erdman and A. Sinko. Zero-inﬂated poisson and zero-inﬂated negative binomial
models using the countreg procedure. In SAS Global Forum, pages 1–11, 2008.
[43] C. Dorland, R. S. Tol, and J. P. Palutikof. Vulnerability of the netherlands and
northwest europe to storm damage under climate change. Climatic change, 43(3):513–
535, 1999.
[44] C. A. Dos Santos, C. M. Neale, T. V. Rao, and B. B. da Silva. Trends in indices for
extremes in daily temperature and precipitation over utah, usa. International Journal
of Climatology, 31(12):1813–1822, 2011.
181

[45] U. Ehret, E. Zehe, V. Wulfmeyer, K. Warrach-Sagi, and J. Liebert. Hess opinions”
should we apply bias correction to global and regional climate model data?”. Hydrology
and Earth System Sciences Discussions, 9(4):5355–5387, 2012.
[46] W. Enke and A. Spekat. Downscaling climate model outputs into local and regional
weather elements by classiﬁcation and regression. Climate Research, 8(3):195–207,
1997.
[47] A.-C. Favre, S. El Adlouni, L. Perreault, N. Thi´monge, and B. Bob´e. Multivariate
e
e
hydrological frequency analysis using copulas. Water Resources Research, 40(1), 2004.
[48] L. Feudale. Large scale extreme events in surface temperature during 1950–2003 an
observational and modeling study. PhD thesis, George Mason University.
[49] M. Fogarty, L. Incze, K. Hayhoe, D. Mountain, and J. Manning. Potential climate
change impacts on atlantic cod (gadus morhua) oﬀ the northeastern usa. Mitigation
and Adaptation Strategies for Global Change, 13(5-6):453–466, 2008.
[50] C. Frei, R. Sch¨ll, S. Fukutome, J. Schmidli, and P. L. Vidale. Future change of
o
precipitation extremes in europe: Intercomparison of scenarios from regional climate
models. Journal of Geophysical Research: Atmospheres (1984–2012), 111(D6), 2006.
[51] P. Friederichs and A. Hense. Statistical downscaling of extreme precipitation events
using censored quantile regression. Monthly weather review, 135(6):2365–2378, 2007.
[52] K. Fujinaga, M. Nakai, H. Shimodaira, and S. Sagayama. Multiple-regression hidden markov model. In Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on, volume 1, pages 513–516.
IEEE, 2001.
[53] C. Gaetan and M. Grigoletto. A hierarchical model for the analysis of spatial rainfall
extremes. Journal of agricultural, biological, and environmental statistics, 12(4):434–
449, 2007.
[54] S. Ghosh and B. K. Mallick. A hierarchical bayesian spatio-temporal model for extreme
precipitation events. Environmetrics, 22(2):192–204, 2011.
[55] C. L. Giles, S. Lawrence, and A. C. Tsoi. Noisy time series prediction using recurrent
neural networks and grammatical inference. Machine learning, 44(1-2):161–183, 2001.
[56] H. R. Glahn and D. A. Lowry. The use of model output statistics (mos) in objective
weather forecasting. Journal of applied meteorology, 11(8):1203–1211, 1972.
182

[57] A. M. Greene, A. W. Robertson, P. Smyth, and S. Triglia. Downscaling projections
of indian monsoon rainfall using a non-homogeneous hidden markov model. Quarterly
Journal of the Royal Meteorological Society, 137(655):347–359, 2011.
[58] T. Hastie, R. Tibshirani, and J. Friedman. Linear Methods for Regression. Springer,
2009.
[59] T. Hastie, R. Tibshirani, and J. J. H. Friedman. The elements of statistical learning,
volume 1. Springer New York, 2001.
[60] K. Hayhoe, S. Sheridan, L. Kalkstein, and S. Greene. Climate change, heat waves, and
mortality projections for chicago. Journal of Great Lakes Research, 36:65–73, 2010.
[61] X. He, Y. Yang, and J. Zhang. Bivariate downscaling with asynchronous measurements.
Journal of agricultural, biological, and environmental statistics, 17(3):476–489, 2012.
[62] A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics, 12(1):55–67, 1970.
[63] W.-C. Hong, P.-F. Pai, S.-L. Yang, and R. Theng. Highway traﬃc forecasting by
support vector regression model with tabu search algorithms. In Neural Networks,
2006. IJCNN’06. International Joint Conference on, pages 1617–1621. IEEE, 2006.
[64] G. Hudson and H. Wackernagel. Mapping temperature using kriging with external
drift: theory and an example from scotland. International journal of Climatology,
14(1):77–91, 1994.
[65] G. Huerta and B. Sans´. Time-varying models for extreme values. Environmental and
o
Ecological Statistics, 14(3):285–299, 2007.
[66] Y. Hundecha, A. St-Hilaire, T. Ouarda, S. El Adlouni, and P. Gachon. A nonstationary
extreme value analysis for the assessment of changes in extreme annual wind speed
over the gulf of st. lawrence, canada. Journal of Applied Meteorology and Climatology,
47(11):2745–2759, 2008.
[67] T. Iizumi, M. Nishimori, K. Dairaku, S. A. Adachi, and M. Yokozawa. Evaluation and
intercomparison of downscaled daily precipitation indices over japan in present-day
climate: Strengths and weaknesses of dynamical and bias correction-type statistical
downscaling methods. Journal of Geophysical Research: Atmospheres (1984–2012),
116(D1), 2011.

183

[68] A. V. Ines and J. W. Hansen. Bias correction of daily gcm rainfall for crop simulation
studies. Agricultural and forest meteorology, 138(1):44–53, 2006.
[69] A. J. Izenman. Reduced-rank regression for the multivariate linear model. Journal of
multivariate analysis, 5(2):248–264, 1975.
[70] T. H. Jagger, J. B. Elsner, and M. A. Saunders. Forecasting us insured hurricane
losses. Climate extremes and society, pages 189–208, 2008.
[71] T. Joachims. Transductive inference for text classiﬁcation using support vector machines. In ICML, volume 99, pages 200–209, 1999.
[72] R. W. Katz. Statistics of extremes in climate change. Climatic Change, 100(1):71–76,
2010.
[73] B. Kedem and K. Fokianos. Regression models for time series analysis, volume 488.
John Wiley & Sons, 2005.
[74] S. Kim and E. P. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. 2010.
[75] P. Kirshen, K. Knee, and M. Ruth. Climate change and coastal ﬂooding in metro
boston: impacts and adaptation strategies. Climatic Change, 90(4):453–473, 2008.
[76] R. Koenker. Quantile Regression Software. http://www.econ.uiuc.edu/ roger/research/rq/rq.html.
[77] R. Koenker. Quantile Regresssion. Wiley Online Library, 2005.
[78] Z. W. Kundzewicz, L. J. Mata, N. Arnell, P. Doll, P. Kabat, B. Jimenez, K. Miller,
T. Oki, S. Zekai, I. Shiklomanov, et al. Freshwater resources and their management.
2007.
[79] K. E. Kunkel, K. Andsager, and D. R. Easterling. Long-term trends in extreme precipitation events over the conterminous united states and canada. Journal of Climate,
12(8):2515–2527, 1999.
[80] S. Laxman and P. Sastry. A survey of temporal data mining. Sadhana, 31(2):173–198,
2006.
[81] Y.-A. Le Borgne, S. Santini, and G. Bontempi. Adaptive model selection for time series
prediction in wireless sensor networks. Signal Processing, 87(12):3010–3020, 2007.
184

[82] Y. Li, Y. Liu, and J. Zhu. Quantile regression in reproducing kernel hilbert spaces.
Journal of the American Statistical Association, 102(477):255–268, 2007.
[83] E. C. Mannshardt-Shamseldin, R. L. Smith, S. R. Sain, L. O. Mearns, and D. Cooley.
Downscaling extremes: A comparison of extreme value distributions in point-source
and gridded precipitation data. The Annals of Applied Statistics, 4(1):484–502, 2010.
[84] J. I. Marden. Positions and qq plots. Statistical Science, 19(4):606–614, 2004.
[85] L. Mearns, W. Gutowski, R. Jones, L. Leung, S. McGinnis, A. Nunes, and Y. Qian.
The north american regional climate change assessment program dataset. National
Center for Atmospheric Research Earth System Grid Data Portal, Boulder, CO, 2007.
[86] L. O. Mearns, W. Gutowski, R. Jones, R. Leung, S. McGinnis, A. Nunes, and Y. Qian.
A regional climate change assessment program for north america. Eos, Transactions
American Geophysical Union, 90(36):311, 2009.
[87] N. Meinshausen. Quantile regression forests. The Journal of Machine Learning Research, 7:983–999, 2006.
[88] C. Monteleoni, G. A. Schmidt, S. Saroha, and E. Asplund. Tracking climate models.
Statistical Analysis and Data Mining, 4(4):372–392, 2011.
[89] S. Nadarajah. Extremes of daily rainfall in west central ﬂorida. Climatic change,
69(2-3):325–342, 2005.
[90] M. Nogaj, P. Yiou, S. Parey, F. Malek, and P. Naveau. Amplitude and frequency of
temperature extremes over the north atlantic region. Geophysical Research Letters,
33(10), 2006.
[91] A. Ober-Sundermeier and H. Zackor. Prediction of congestion due to road works on
freeways. In Intelligent Transportation Systems, 2001. Proceedings. 2001 IEEE, pages
240–244. IEEE, 2001.
[92] T. O’Brien, D. Sornette, and R. McPherron. Statistical asynchronous regression: Determining the relationship between two quantities that are not measured simultaneously. Journal of geophysical research, 106(A7):13247–13, 2001.
[93] S. Ollinger, C. Goodale, K. Hayhoe, and J. Jenkins. Potential eﬀects of climate change
and rising co2 on ecosystem processes in northeastern us forests. Mitigation and Adaptation Strategies for Global Change, 13(5-6):467–485, 2008.
185

´
[94] J. Ospina Norena, C. Gay Garcia, A. Conde, V. Maga˜a, and G. SANCHEZ TORn
RES ESQUEDA. Vulnerability of water resources in the face of potential climate
change: generation of hydroelectric power in colombia. Atm´sfera, 22(3):229–252,
o
2009.
[95] M. P´rez-Enciso and M. Tenenhaus. Prediction of clinical outcome with microarray
e
data: a partial least squares discriminant analysis (pls-da) approach. Human genetics,
112(5-6):581–592, 2003.
[96] C. Piani, J. Haerter, and E. Coppola. Statistical bias correction for daily precipitation
in regional climate models over europe. Theoretical and Applied Climatology, 99(12):187–192, 2010.
[97] C. Piani, G. Weedon, M. Best, S. Gomes, P. Viterbo, S. Hagemann, and J. Haerter.
Statistical bias correction of global simulated daily precipitation and temperature for
the application of hydrological models. Journal of Hydrology, 395(3):199–215, 2010.
[98] M. Rivington, D. Miller, K. Matthews, G. Russell, G. Bellocchi, and K. Buchan.
Downscaling regional climate model estimates of daily precipitation, temperature and
solar radiation data. Climate Research, 35(3):181, 2007.
[99] E. A. Rosenberg, P. W. Keys, D. B. Booth, D. Hartley, J. Burkey, A. C. Steinemann,
and D. P. Lettenmaier. Precipitation extremes and the impacts of climate change on
stormwater infrastructure in washington state. Climatic Change, 102(1-2):319–349,
2010.
[100] M. Rummukainen and S. meteorologiska och hydrologiska institut. Methods for Statistical Downscaling of GCM Simulations. SMHI RMK. Swedish Meteorological and
Hydrological Institute, 1997.
[101] M. Rusticucci and B. Tencer. Observed changes in return values of annual temperature
extremes over argentina. Journal of Climate, 21(21):5455–5467, 2008.
[102] R. Samuels, A. Rimmer, A. Hartmann, S. Krichak, and P. Alpert. Climate change
impacts on jordan river ﬂow: downscaling application from a regional climate model.
Journal of Hydrometeorology, 11(4):860–879, 2010.
[103] H. Sang and A. E. Gelfand. Hierarchical modeling for extreme values observed over
space and time. Environmental and Ecological Statistics, 16(3):407–426, 2009.
[104] D. Scott and G. McBoyle. Climate change adaptation in the ski industry. Mitigation
and Adaptation Strategies for Global Change, 12(8):1411–1431, 2007.
186

[105] A. J. Smola and B. Sch¨lkopf. A tutorial on support vector regression. Statistics and
o
computing, 14(3):199–222, 2004.
[106] I. Takeuchi, Q. V. Le, T. D. Sears, and A. J. Smola. Nonparametric quantile estimation.
The Journal of Machine Learning Research, 7:1231–1264, 2006.
[107] C. Tebaldi and D. Lobell. Towards probabilistic projections of climate change impacts
on global crop yields. Geophysical Research Letters, 35(8), 2008.
[108] J. M. Themeßl, A. Gobiet, and A. Leuprecht. Empirical-statistical downscaling and
error correction of daily precipitation from regional climate models. International
Journal of Climatology, 31(10):1530–1544, 2011.
[109] Q. Tian, J. Yu, Q. Xue, and N. Sebe. A new analysis of the value of unlabeled data in
semi-supervised learning for image retrieval. In Multimedia and Expo, 2004. ICME’04.
2004 IEEE International Conference on, volume 2, pages 1019–1022. IEEE, 2004.
[110] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society. Series B (Methodological), pages 267–288, 1996.
[111] E. Towler, B. Rajagopalan, E. Gilleland, R. S. Summers, D. Yates, and R. W. Katz.
Modeling hydrologic and water quality extremes in a changing climate: A statistical
approach based on extreme value theory. Water Resources Research, 46(11), 2010.
[112] E. Vazquez and E. Walter. Multi-output support vector regression. In 13th IFAC
Symposium on System Identiﬁcation, pages 1820–1825. Citeseer, 2003.
[113] I. Watterson and M. Dix. Simulated changes due to global warming in daily precipitation means and extremes and their interpretation using the gamma distribution.
Journal of geophysical research, 108(D13):4379, 2003.
[114] A. H. Welsh, R. B. Cunningham, C. Donnelly, and D. B. Lindenmayer. Modelling the
abundance of rare species: statistical models for counts with extra zeros. Ecological
Modelling, 88(1):297–308, 1996.
[115] R. Wilby. Statistical downscaling of daily precipitation using daily airﬂow and seasonal
teleconnection indices. Climate Research, 10(3):163–178, 1998.
[116] R. Wilby, S. Charles, E. Zorita, B. Timbal, P. Whetton, and L. Mearns. Guidelines
for use of climate scenarios developed from statistical downscaling methods. 2004.

187

[117] R. L. Wilby and T. Wigley. Downscaling general circulation model output: a review
of methods and limitations. Progress in Physical Geography, 21(4):530–548, 1997.
[118] T. Zhang and F. Oles. The value of unlabeled data for classiﬁcation problems. In Proceedings of the Seventeenth International Conference on Machine Learning,(Langley,
P., ed.), pages 1191–1198. Citeseer, 2000.
[119] Z.-H. Zhou and M. Li. Semi-supervised regression with co-training. In IJCAI, pages
908–916, 2005.
[120] X. Zhu. Semi-supervised learning literature survey. Computer Science, University of
Wisconsin-Madison, 2:3, 2006.
[121] X. Zhu, Z. Ghahramani, J. Laﬀerty, et al. Semi-supervised learning using gaussian
ﬁelds and harmonic functions. In ICML, volume 3, pages 912–919, 2003.
[122] X. Zhu and A. B. Goldberg. Kernel regression with order preferences. In Proceedings
of the national conference on artiﬁcial intelligence, volume 22, page 681. Menlo Park,
CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2007.

188