MULTI-OBJECTIVE REGRESSION WITH APPLICATION TO THE CLIMATE DOMAIN By Zubin Abraham A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science - Doctor of Philosophy 2013 ABSTRACT MULTI-OBJECTIVE REGRESSION WITH APPLICATION TO THE CLIMATE DOMAIN By Zubin Abraham Regression-based approaches are widely used in climate research to derive the statistical, spatial, and temporal relationships among climate variables. Despite its extensive literature, existing approaches are insufficient to address the unique challenges arising from the data characteristics and requirements of this domain. For example, climate variables such as precipitation have zero-inflated distributions, which render ineffective any linear regression models constructed from the data. In addition, whereas traditional regression-based approaches emphasize on minimizing the discrepancy between observed and predicted values, there is a growing demand for regression outputs that satisfy other domain-specific criteria. To address these challenges, this thesis presents multi-objective regression frameworks designed to extend current regression-based approaches to meet the needs of climate researchers. First, a framework called Integrated Classification and Regression (ICR) is developed to accurately capture the timing of rain events and the magnitude of rain amount in zero-inflated precipitation data. The second multi-objective regression framework focuses on modeling the extreme values of a distribution without degrading its overall accuracy in predicting non-extreme values. The third framework emphasizes on both minimizing the divergence between the regression output and observed data while maximizing the fit of their cumulative distribution functions. The fourth contribution extends this framework to a multi-output setting, to ensure that the joint distribution of the multiple regression outputs is realistic and consistent with true observations. ACKNOWLEDGMENTS This work is partially supported by NSF grant III-0712987 and subcontract for NASA award NNX09AL60G. This work is also partly supported by National Science Foundation Dynamics of Coupled Natural and Human Systems competition Program (CNH Award No.-0909378). The author would like to thank Dr Julie Winkler and Dr Sharon Zhong for valuable discussions on statistical downscaling for climate change projection. iii TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Introduction . . . . . . 1.1 Regression in Climate Research 1.2 Challenges . . . . . . . . . . . . 1.3 Thesis Contributions . . . . . . 1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 8 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Downscaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 13 14 15 16 17 19 21 Chapter 3 Modeling Zero-Inflated Data . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Framework for Simultaneous Classification and Regression . . . . . . . . . . 3.3.1 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Proof of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Classification of Test Data . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Integrated Classification and Regression Algorithm . . . . . . . . . . . . . . 3.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2.1 Effect of Zero-Inflated Time Series Data . . . . . . . . . . . 3.5.2.2 Impact of Coupling the Classifier and Regression Model Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2.3 Performance Comparison . . . . . . . . . . . . . . . . . . . 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 25 28 29 30 32 34 36 37 38 38 40 40 Chapter 2 Related Work . . . . . . . . . . . . . . . . . . . . . . 2.1 Forecasting in the Climate Science Domain . . . . . . . . . . 2.1.1 Downscaling in the Climate Science Domain . . . . . 2.1.2 Bias Correction in the Climate Science Domain . . . 2.1.3 Regression-Based Approaches for Bias Correction and 2.2 Multiple-Objective Prediction . . . . . . . . . . . . . . . . . 2.3 Modeling Extremes . . . . . . . . . . . . . . . . . . . . . . . 2.4 Distribution Preserving Modeling . . . . . . . . . . . . . . . iv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 45 45 57 Chapter 4 Semi-Supervised Modeling of Zero-Inflated Data . . . . . . . 4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Semi-Supervised Framework for Simultaneous Classification and Regression 4.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.1 Rationale for Incorporating Unlabeled Data . . . . . . . . 4.3.2.2 Performance Comparison . . . . . . . . . . . . . . . . . . 4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 59 60 64 64 66 66 69 73 Chapter 5 Modeling Conditional Quantiles . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Multiple Linear Regression (MLR) and Ridge Regression 5.2.2 Quantile Linear Regression(QR) . . . . . . . . . . . . . . 5.3 Framework for Smoothed Quantile Regression . . . . . . . . . . 5.3.1 Smoothed Quantile Regression (LSQR) . . . . . . . . . . 5.3.2 Linear Semi-Supervised Quantile Regression (LSSQR) . . 5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 5.4.3 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . 5.4.4 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 74 78 79 80 81 81 88 88 88 89 90 91 92 97 Chapter 6 Modeling Extremes in Zero-Inflated Data . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Generalized Linear Model and 2-Step GLM (GLM-C) . . . . . . 6.2.2 Zero Inflated Poisson Regression (ZIP) . . . . . . . . . . . . . . 6.2.3 Quantile Linear Regression(QR) and 2-step QR (QR-C) . . . . 6.3 Framework for Integrated Classification and Regression . . . . . . . . . 6.3.1 Integrated Classifier and Regression for Extreme Values (ICRE) 6.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Baseline Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3.1 General Linear Model (GLM) . . . . . . . . . . . . . . 6.4.3.2 General Linear Model with Classification (GLM-C) . . 6.4.3.3 Quantile Regression (QR) . . . . . . . . . . . . . . . . 6.4.3.4 Quantile Regression with Classification (QR-C) . . . . 6.4.3.5 Zero Inflated Poisson (ZIP) . . . . . . . . . . . . . . . 6.4.4 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 99 101 103 105 106 107 108 109 110 110 111 111 111 111 112 112 112 113 v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 6.4.5.1 Impact of Zero-Inflated Data on Extreme Value Prediction . 113 6.4.5.2 Comparison of ICRE to Baseline Methods . . . . . . . . . . 114 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Chapter 7 Contour Regression . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Multiple Linear Regression (MLR) . . . . . . . . . . . . . . . 7.2.2 Quantile Mapping (QM) . . . . . . . . . . . . . . . . . . . . . 7.3 Framework for Multivariate Contour Regression (CR) . . . . . . . . . 7.3.1 Multiple Linear Contour Regression (MLCR) . . . . . . . . . 7.3.1.1 Proof of Convergence . . . . . . . . . . . . . . . . . . 7.3.2 Kernel Contour Regression (KCR) . . . . . . . . . . . . . . . 7.3.3 Quantile Contour Regression (QCR) . . . . . . . . . . . . . . 7.3.3.1 Proof of Convergence . . . . . . . . . . . . . . . . . . 7.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.3.1 QCR Results . . . . . . . . . . . . . . . . . . . . . . 7.5 MCR Using Heterogeneous Data . . . . . . . . . . . . . . . . . . . . . 7.5.1 Geometric Quantile Mapping . . . . . . . . . . . . . . . . . . 7.5.2 MCR Using Geometric Quantile Mapped Heterogeneous Data 7.5.3 Projections For The Years 2040-2049 . . . . . . . . . . . . . . 7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 116 121 121 121 123 124 126 128 129 131 133 133 134 135 138 139 140 143 144 145 Chapter 8 Multivariate Contour Regression . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Properties of the Geometric Quantiles . . . . . . . . . . . . 8.2.2 Quantile Mapping-Based Approaches . . . . . . . . . . . . . 8.2.3 Rank Correlation and Residual Errors of Quantile Mapping . 8.3 Multi-Output Contour Regression Framework (MCR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Estimating Inverse Geometric Quantile Position . . . . . . . 8.3.2 Alternative Approximation-Based Approach for MCR . . . . 8.4 Variations of MCR . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Quantile Multi-Output Contour Regression (M CRQ ) . . . . 8.4.2 Non-linear Multi-Output Contour Regression (M CRN L ) . . 8.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 8.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.2.1 Univariate MCR . . . . . . . . . . . . . . . . . . . 8.5.2.2 Bivariate MCR . . . . . . . . . . . . . . . . . . . . 8.5.2.3 Trivariate MCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 147 150 152 154 155 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 161 162 164 164 165 166 166 167 167 168 170 vi 8.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Chapter 9 Conclusions and Future Work BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . 173 . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . 177 LIST OF TABLES Table 3.1 List of predictor variables for precipitation prediction. . . . . . . . . 39 Table 5.1 List of predictor variables for temperature prediction. . . . . . . . . 89 Table 5.2 The relative performance of LSSQR compared with LSQR with regard to the extreme data points. . . . . . . . . . . . . . . . . . . . . 91 The percentage of stations LSQR outperformed the respective baselines, with regard to the extreme data points. . . . . . . . . . . . . . 92 The percentage of stations LSSQR outperformed the respective baselines, with regard to the extreme data points. . . . . . . . . . . . . . 93 Table 5.3 Table 5.4 Table 6.1 Percentage of stations won . . . . . . . . . . . . . . . . . . . . . . . 114 Table 6.2 Percentage of stations ICRE outperformed the baseline . . . . . . . 114 Table 7.1 List of predictor variables from each RCM. . . . . . . . . . . . . . . 134 Table 7.2 Relative performance gain of MLCR over baseline approaches. . . . 136 Table 7.3 Percentage of stations that MLCR outperformed baseline in terms of σ and ρ − CDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Table 7.4 Percentage of stations that QCR outperformed baseline approaches in terms of RMSE, F-measure, k statistic and correlation for data points considered extreme value. . . . . . . . . . . . . . . . . . . . . 141 Table 8.1 Quantile Mapping Example . . . . . . . . . . . . . . . . . . . . . . . 155 Table 8.2 Performance (RMSE) of bivariate MCR over baseline approaches . . 170 Table 8.3 Performance (Kendall τ ) of bivariate MCR over baseline approaches viii 171 LIST OF FIGURES Figure 1.1 Figure 1.2 Figure 1.3 Figure 1.4 Histogram comparing the distribution of predicted daily maximum temperature at a weather station in Michigan to its respective observed values, 1990-1999. [For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation] . . . . . . . . . . . . . . . . . . . . . . . 4 Histogram comparing the distribution of extreme values of the predicted daily maximum temperature at a weather station in Michigan, obtained using multiple linear regression, to its respective observed values, 1990-1999. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Scatter plot comparing the joint distribution between the quantile mapping (QM) predicted output of two response variables and it true values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Histogram of daily precipitation recorded at a weather station in Canada. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Figure 3.1 A zero-inflated frequency distribution of daily precipitation at a weather station in Canada . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Figure 3.2 Comparison between independent modeling approach and proposed framework for predicting zero-inflated data . . . . . . . . . . . . . . 27 Effect of increasing the number of NoRain days on performance of regression model (best viewed in color). . . . . . . . . . . . . . . . . 41 The cumulative distribution function of multiple linear regression output for a zero-inflated precipitation response variable. . . . . . . . . 42 Comparison of RMSE values (tested on Rain days only) for MLR models trained on all days compared with models trained only on Rain days. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Comparison of RMSE values (tested on All days) for MLR models trained on all days compared with models trained only on Rain days. 44 Figure 3.3 Figure 3.4 Figure 3.5 Figure 3.6 ix Figure 3.7 Figure 3.8 Figure 3.9 Figure 3.10 Figure 3.11 Figure 3.12 Figure 3.13 Figure 3.14 Figure 3.15 Figure 3.16 Figure 3.17 Figure 4.1 Figure 4.2 Comparison of RMSE values (for all days) among MLR, SVM-MLR and ICR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Comparison of RMSE values (for all days) among MLR, SVM-MLR and ICR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Comparison of RMSE values (for Rain days) among MLR, SVM-MLR and ICR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Comparison of RMSE values (for Rain days) among MLR, SVM-MLR and ICR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Comparison of classification accuracy (for all days) between SVMMLR and ICR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Comparison of classification accuracy (for all days) between SVMMLR and ICR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Comparison of F-measure (for Rain days) between SVM-MLR and ICR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Comparison of F-Measure (for Rain days) between SVM-MLR and ICR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Comparison of F-Measure (for NoRain days) between SVM-MLR and ICR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Comparison of F-Measure (for NoRain days) between SVM-MLR and ICR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Comparing the cumulative distribution function of the predicted values of precipitation according to the various models. . . . . . . . . . 57 Similarity of predictor variables for all pairs of time periods of a given width. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Similarity of response variable for all pairs of time periods of a given width. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Figure 4.3 Similarity of the relation between Predictors and Predicants over time 69 Figure 4.4 Comparison of RMSE values (for all days) among MLR, ZICR-S, and ZICR-SS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 70 Figure 4.5 Comparison of RMSE values (for Rain days) among MLR, ZICR-S, and ZICR-SS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Comparison of classification accuracy (for all days) among MLR, ZICR-S, and ZICR-SS. . . . . . . . . . . . . . . . . . . . . . . . . . 71 Comparison of F-Measure (for Rain days) among MLR, ZICR-S, and ZICR-SS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Comparison of F-Measure (for NoRain days) among MLR, ZICR-S, and ZICR-SS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Figure 5.1 Histogram of observed temperature. . . . . . . . . . . . . . . . . . . 76 Figure 5.2 Tail of the histogram. . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Figure 5.3 Influence of parameter λ on the regression coefficients β in LSQR. . 85 Figure 5.4 Influence of λ on the probability distribution of the predicted values obtained from LSSQR. . . . . . . . . . . . . . . . . . . . . . . . . . 86 Influence of λ on the probability distribution of the predicted extreme values obtained from LSSQR. . . . . . . . . . . . . . . . . . . . . . . 87 Ratio of stations LSSQR outperforming baseline in terms of RMSE of extreme data points. . . . . . . . . . . . . . . . . . . . . . . . . . 94 Ratio of stations LSSQR outperforming baseline in terms of recall of extreme data points. . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Ratio of stations LSSQR outperformin baseline in terms of F-measure of extreme data points. . . . . . . . . . . . . . . . . . . . . . . . . . 96 Prediction performance of extreme data points using MLR, Ridge, QR, LSSQR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Figure 4.6 Figure 4.7 Figure 4.8 Figure 5.5 Figure 5.6 Figure 5.7 Figure 5.8 Figure 5.9 Figure 7.1 Area between the CDF of y and yM LR . . . . . . . . . . . . . . . . . 117 Figure 7.2 Histogram of predicted daily maximum temperature at a weather station in Michigan, 1990-1999. . . . . . . . . . . . . . . . . . . . . . 118 Figure 7.3 CDF of predicted daily precipitation at a weather station in Michigan, 1990-1999. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 xi Figure 7.4 Influence of gamma parameter on fidelity of the response variable’s cumulative distributive function. . . . . . . . . . . . . . . . . . . . . 126 Figure 7.5 CDF of predicted daily precipitation at a weather station in Michigan over the years 1990-99. . . . . . . . . . . . . . . . . . . . . . . . . . 137 Figure 7.6 CDF of predicted daily maximum temperature at a weather station in Michigan, 1990-99. . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Figure 7.7 CDF of predicted daily precipitation at a weather station in Michigan, 1990-99. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Figure 7.8 CDF of predicted daily minimum temperature at a weather station in Michigan, 1990-99. . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Figure 7.9 Cumulative distribution function of predicted daily minimum temperature at a weather station in Michigan, 1990-99, using asynchronous regional climate model data. . . . . . . . . . . . . . . . . . . . . . . 142 Figure 7.10 Cumulative distribution function of predicted daily precipitation at a weather station in Michigan, 1990-99, using asynchronous regional climate model data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Figure 7.11 Comparing the cumulative distribution function of M CRHET and M CRGQ output of predicted daily minimum temperature at a weather station in Michigan, 1990-99, using asynchronous regional climate model data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Figure 7.12 Comparing the cumulative distribution function of M CRHET and M CRGQ output of predicted daily precipitation at a weather station in Michigan, 1990-99, using asynchronous regional climate model data.145 Figure 7.13 Comparing the cumulative distribution function of projected daily minimum temperature at a weather station in Michigan, for the period 2040-2049 to that of QM model output for the years 1990-99 . . 146 Figure 7.14 Comparing the cumulative distribution function of projected daily precipitation at a weather station in Michigan, for the period 20402049 to that of QM model output for the years 1990-99 . . . . . . . 146 Figure 8.1 Scatter plot of observed daily maximum and minimum temperature at a climate station in Michigan, USA. . . . . . . . . . . . . . . . . 148 xii Figure 8.2 Relative computation time of the various approximation-based approaches for estimating inverse geometric quantile positions. . . . . . 163 Figure 8.3 Box plot of the percentage stations where MCR showed improvement over single output baselines, in terms of Kendall τ rank correlation and RMSE, across all RCM’s and variables. . . . . . . . . . . . . . . 168 Figure 8.4 Box plot of MCR’s improvement over baseline approaches in terms of Kendall τ rank correlation and RMSE, across all RCM’s and variables.169 Figure 8.5 Scatter plot portraying the fidelity of forecast values of various approaches replicating the observed associations among the bivariate temperature response variables. . . . . . . . . . . . . . . . . . . . . 169 Figure 8.6 Three dimensional scatter plot of the observed associations among maximum temperature, minimum temperature and precipitation as well as the respective forecasts made by the various single output and multiple output approaches. . . . . . . . . . . . . . . . . . . . . . . 171 Figure 9.1 Relative likelihood of identifying large precipitation residuals . . . . 174 Figure 9.2 Relative likelihood of identifying days with large residuals . . . . . . 175 xiii Chapter 1 Introduction Regression is a statistical method for deriving the relationship between a continuous-valued response variable y ∈ and its predictor variables x ∈ expressed as a mathematical function f : d → d . The relationship is typically . Numerous regression-based methods have been developed in the past, including discriminative models (such as multiple linear regression [58] and support vector regression [105]) and generative models (such as hidden Markov regression [57]). These methods are primarily designed to minimize a loss function, [f (x), y], that measures the difference between the observed and predicted values. Although such a loss function is sufficient to ensure a good fit between the predicted and observed values, there are other requirements that must be met when applying such methods to realworld applications. This thesis focuses on developing new regression-based methods that can handle the unique challenges arising from the climate research domain. 1.1 Regression in Climate Research Given the growing concerns about global warming and its potential influence on human and natural systems [107, 28, 49, 93, 75, 60], there is a pressing need to generate accurate and robust projections of future climate scenarios for researchers, policy-makers and other stakeholders. For example, to aid crop management decision making, the projections can be incorporated into crop models to assess the crop yield response to future climate change. 1 Towards this end, recent advances in global and regional climate modeling have produced vast amounts of simulation data that can be harnessed to improve our understanding of the climate system and its evolution under different greenhouse gas emission scenarios [31, 46, 116, 85]. These general circulation models (GCMs) and regional climate models (RCMs), as they are called, are physical-based models, developed based on the fundamental laws of physics, chemistry, and fluid dynamics to simulate the response of the Earth system to various external forcings on a three-dimensional spatial grid mesh. However, the scale of these computer-simulated model outputs are often too coarse to be effectively used in climate change impacts, adaptation, and vulnerability (CCIAV) assessment studies. Furthermore, due to the complexity of the climate system and inadequacy of the models in capturing all of its underlying processes, there are inherent biases in the model outputs that must be corrected to enable reproduction of historical climate conditions. Regression is a popular method to empirically downscale the coarse resolution model outputs to a finer resolution. It can also debias the model outputs to fit the distribution of historical climate data. In addition to generating climate projections, regression can also be applied for the purpose of spatial interpolation, to fill in missing values at locations where observed values are unavailable [64]. However, in spite of extensive number of regressionbased approaches that have been proposed to generate downscaled climate projections and bias corrected climate projections, there are still a number of challenges current regression methods have not adequately addressed. 2 1.2 Challenges The projections generated by climate models are not an end in themselves but are often integrated into other numerical models (such as crop and hydrological models) to enable assessments of future climate change impacts. These downstream models not only differ in terms of the climate variables needed as input, they may also have distinct expectations regarding the desired characteristics of the climate projections. As an example, the skills of the climate projections in terms of simulating the length of wet and dry runs, i.e., number of consecutive rain and non-rain days, is an important requirement to estimate drought duration and intensity [9]. The least-square loss function employed by multiple linear regression (MLR) or the first-order Markov assumption employed by weather generators are inadequate to simulate the higher order temporal autocorrelation of a precipitation time series. Thus, one of the main challenges of applying regression methods to generate climate projections is that there could be more than one distinct expectation of the projections, which are not always easy to simultaneously achieve. Furthermore, the projections should be unbiased across all quantiles. An unbiased projection is one whose distributional characteristics are consistent with that of the true values of the response variable, across all the quantiles of its distribution. To illustrate this, consider the histogram of observed daily maximum temperature at a weather station in Michigan, represented by the gray area in Figure 1.1. The red dotted line represents the histogram of daily maximum temperature generated by multiple linear regression. Observe that the MLR outputs have a warm bias for the cooler days and a cold bias during the warmer days. As a result, MLR is not a suitable approach if extreme values are of paramount importance to users of the climate projections. 3 400 Observed Max Temperature 350 Multiple Linear Regression 300 Days 250 200 150 100 50 0 −30 −20 −10 0 °C 10 20 30 40 Figure 1.1: Histogram comparing the distribution of predicted daily maximum temperature at a weather station in Michigan to its respective observed values, 1990-1999. [For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation] To address the requirement for an unbiased distribution, there is a class of bias correction approaches that can be used. Although these approaches can generate climate projections with low bias, their residual errors can be high, which implies a lack of agreement between the observed and predicted values at each time step. Hence, it is imperative to develop regression-based approaches that provide outputs satisfying both requirements of minimal error and unbiased predictions. In addition, when considering the distribution characteristics of the regression outputs, the end user may be interested in certain quantiles over others. For instance, farmers are often interested in the frequency and magnitude of extreme values in the climate projections, due to the larger economic implications associated with them. However, as most approaches prioritize the conditional mean of the distribution, they tend to underestimate the frequency of extreme-valued data points as shown in Figure 1.2. Thus, another challenge for regression4 400 Observed Max Temperature Multiple Linear Regression 350 300 Days 250 200 150 100 50 0 25 30 °C 35 40 Figure 1.2: Histogram comparing the distribution of extreme values of the predicted daily maximum temperature at a weather station in Michigan, obtained using multiple linear regression, to its respective observed values, 1990-1999. based approach is to provide a framework that is flexible enough to prioritize the accuracy at the quantiles of the end-user’s choice, without significantly degrading the performance at other quantiles. End users may also require projections of more than one climate variable. For instance, daily minimum and maximum temperature as well as total precipitation are among the common variables needed for CCIAV assessments. Although regression-based approaches can be trained for each variable independently, the resulting projections may not be consistent with each other. Hence, there is a growing demand for multiple output prediction methods capable of capturing the joint distribution of the response variables in a realistic and consistent fashion, while minimizing their residual errors. Unfortunately, current methods are designed to optimize one of the two criteria, but not both, as shown in Figure 1.3. Generating projections for multiple variables while preserving their joint relationships and minimizing the 5 3 Observation QM 2 Y2 1 0 −1 −2 −3 −2.5 −2 −1.5 −1 −0.5 0 Y1 0.5 1 1.5 2 2.5 Figure 1.3: Scatter plot comparing the joint distribution between the quantile mapping (QM) predicted output of two response variables and it true values. residual errors is another challenge that has not been sufficiently addressed. Some climate variables are also harder to project due to their unique data characteristics. For instance, daily precipitation is notoriously challenging to model due of its zero-inflated distribution, a challenge that conventional regression methods are not well suited to handle. A zero-inflated distribution is a distribution with an abundance of zero values as shown in Figure 1.4. Such distribution can also be found in many other applications that are related to long-term projections, such as ecological modeling, disease monitoring, and traffic monitoring. As conventional regression methods typically prioritize modeling the conditional mean of the distribution, they tend to underestimate the frequency of zero-valued data points as well as the magnitude of the extreme values of a zero-inflated variable. Thus, there is a need to develop models that are geared toward dealing with some of the more uncharacteristic distributions observed among the climate variables. The source of climate data available for building the regression models may also introduce 6 6000 5000 Frequency 4000 3000 2000 1000 0 0 2 4 6 8 10 12 Daily precipitation (mm) 14 16 18 20 Figure 1.4: Histogram of daily precipitation recorded at a weather station in Canada. additional complications. For example, the GCM or RCM-simulated precipitation data may not be exactly synchronized with the observed precipitation since each simulated output is only one possible realization of the time series. This affects both the training of regression models as well as model evaluation, which typically assumes there is a one-to-one mapping between the input and output variables for each data point used in the regression model. Fortunately, there are alternative approaches, such as quantile mapping, that cater to modeling data points with asynchronous predictor and response variables. Unfortunately, this flexibility also results in a prediction with relatively higher residual errors, as the models do not utilize the existing mapping information between the predictor and response variables. The challenge here is to develop models that can handle asynchronous data with compromises model accuracy. For RCM models, the simulations can be driven by either reanalysis data (akin to true observations) or by GCM models as their initial boundary conditions. Reanalysis-driven 7 RCM runs are typically used to generate hindcast values of the predictor variables whereas GCM-driven RCM runs are needed to generate forecast values of the predictor variables. Most regression-based approaches are trained using reanalysis-driven RCM runs and tested on future data generated by GCM-driven RCM runs. Unfortunately, there are inherent biases in the GCM-driven RCM runs that are not fully accounted for by the regression model. This presents an additional challenge that must be addressed by regression-based approaches, which typically assume that the training and test data have similar distributions. 1.3 Thesis Contributions This thesis presents multivariate regression-based frameworks that simultaneously addresses multiple objectives pertaining to the individual requirements of an accurate climate projection. By simultaneously optimizing multiple objectives, the frameworks generate a projection that satisfy multiple requirement with minimal degradation of any one objective, unlike existing regression-based approaches that address a single objective at the expense of compromising other requirements. The multiple objectives addressed by each of the frameworks pertain to best replicating the unique distribution characteristics of a response variable, while also ensuring an accurate projection in terms of minimum residual errors. As mentioned earlier, pragmatic approaches to modeling predictive systems need to take into account any unique distribution characteristics of the response variable that is critical to generating an accurate projection. Chapter 3 presents an integrated multi-objective framework that simultaneously performs classification and regression to accurately predict values of a zero-inflated time series [4]. The regression and classification models are trained to optimize a joint objective function that minimizes both the classification errors (zero 8 and non-zero values) and regression errors for data points that have non-zero values. The framework compensates for the uncertainty in the data by using a smoothing function that prioritizes non-zero valued data point whose response value is consistent with other data points having similar values for their respective predictor variable, during the learning of the regression function. The effectiveness of the framework is demonstrated in the context of its application to a downscaling precipitation climate variable. The semi-supervised extension of the framework in Chapter 3 is elaborated in Chapter 4 and compared with its supervised counterpart [3]. Given that studies and applications that utilize long-term projections for analysis may be interested in certain quantiles of the distribution of the projection over others, Chapter 5 presents a multivariate framework that focuses on the accurate projection of a specific quantile of the distribution (such as those pertaining to extremes values) with minimum deterioration of the accuracy of the projection at the other quantiles of the response variable [8]. Chapter 5 also elaborates on the framework in a semi-supervised setting. As an extension of the above-mentioned framework, Chapter 6 presents a multi-objective framework called ICR that focuses on reliable prediction of extreme values events for a zero-inflated response variable [7]. This multi-objective framework incorporates the multiple objectives of classification, regression and conditional quantiles. The frameworks in Chapters 5 and 6 were evaluated on climate data and these demonstrated their ability in accurately detecting the frequency, timing and magnitude of extreme temperature and precipitation events effectively compared to several baseline methods. Chapter 7 shows the limitations of popular regression-based approaches in terms of preserving the distribution characteristics of true response variable across the various quantiles in spite of its minimizing residual errors. Chapter 7 goes on to present a multi-objective re9 gression framework that simultaneously replicates the cumulative distribution properties of the response variable while minimizing the residual errors. The framework is highly flexible and can be applied to linear, nonlinear, and conditional quantile models [5]. The effectiveness of the framework in modeling the daily minimum and maximum temperature as well as precipitation for climate stations in the Great Lakes region is demonstrated along with marked improvement over traditional regression-based approaches, for all climate stations evaluated. There is a growing demand for a multiple-output prediction that not only is accurate in terms of minimal residual errors but also in terms of accurately capturing the joint distributional characteristic of multiple output variables, so that they are realistic and consistent with each other. Unfortunately, the preservation of these associations is not guaranteed by regular single or multiple output regression approaches. Chapter 8 presents a framework for multiple output regression that preserves the general association patterns among the response variables (including non-linear associations) while minimizing the overall errors of the individual prediction, by coupling regression and geometric quantile mapping. The effectiveness of the framework in modeling temperature and daily precipitation for climate stations in the Great Lakes region is demonstrated [6]. The framework showed significant improvement in reducing residual errors while preserving the joint distribution of the multi-output variables, over the baseline approaches, in all climate stations evaluated. 1.4 Summary To summarize, the presented frameworks address the challenges pertaining to the application of regression-based approaches to the climate research domain. However, even though the 10 multi-objective frameworks presented in this thesis are originally motivated by the need to generate accurate and unbiased projections of climate variables, they are generic enough to be used for other applications. For example, the ICR framework is applicable to any domains with zero-inflated data distribution whereas the MCR framework is designed for domains that require consistent predictions across multiple response variables. Additionally, the proposed frameworks leverage ideas from semi-supervised learning, statistical asynchronous regression, and geometric quantiles to address the challenges introduced by the climate research domain. All the experimental results reported in this thesis pertaining to demonstrating the effectiveness of the frameworks are conducted on climate data for a study region involving parts of Canada and the Great Lake region around Michigan. The effectiveness of the frameworks is demonstrated in the context of their applications to bias correcting and downscaling precipitation and temperature data. 11 Chapter 2 Related Work Regression is a commonly employed statistical approach for estimating the relationship between a response variable and its respective predictor variables. Popular approaches of regression, such as multiple linear regression (MLR), learn the regression coefficients that estimate the conditional expectation of the response variable, given the predictor variables. Unlike MLR, quantile regression estimates the conditional quantile of the response variables [77]. Similarly, there are numerous other popular variants of regression-based approaches [58, 15, 73], such as ridge regression [62], lasso regression [110], recurrent neural networks [55], Hidden Markov Model Regression [52], and support vector regression [105]. The various regression approaches primarily differ based on the number of predictor variables used, the number of response variables, the type of response variable, the estimation method used to identify the regression coefficients, the bias in the estimated regression coefficients, whether a linear or non-linear regression function is used, etc. Time series prediction [80] has long been an active area of research with applications in finance [30], network monitoring [81], transportation planning [63][91], weather forecasting [46][31], etc. Regression approaches can be applied to time series data for forecasting purposes. When it comes to using regression for time series analysis, the two most common approaches employed are autoregressive and multivariate regression approaches. While multivariate regression-based approaches are constrained by the requirement of the availability 12 of future values of the predictor variable to make a forecast, autoregressive approaches, such as autoregressive moving average (ARMA) and autoregressive integrated moving average (ARIMA) [23] do not have such constraints. Autoregressive approaches make forecasts by repeatedly invoking a model that makes its prediction, one unit at a time and using the predicted values from the previous iteration to infer future values. However, autoregressive approaches are plagued by the error accumulation problem, on account of the propagation of errors from one prediction step to the next one. Given the influence of climate on agriculture [28, 107], natural ecosystems [49, 93], human health and natural calamities [60, 75], economic impact [104] etc., considerable effort has been dedicated to generate projections of climate variables, to aid strategic decision making. This effort has also been precipitated by the growth in the number of climate models in the climate science domain [88]. 2.1 Forecasting in the Climate Science Domain In the climate science domain, there has been extensive research on applying time series regression models on climate data obtained from global climate models (GCM) [31, 46, 116]. Global climate models (GCMs) are extensively used for understanding how the global climate may change in the future. GCMs are computer-generated models for simulating future climate conditions under different greenhouse gas emission scenarios. However, the spatial resolution of GCM outputs are often too coarse to reliably project the future climate scenarios of a local region and do not provide reliable information on scales below about 200 km [Meehl et al., 2007]. Two of the more widely employed approaches to improving the projection of climate 13 variables obtained from climate models such as GCMs and regional climate models (RCM), is downscaling and bias correction. While bias correction is employed to correct the biases in the distributional characteristics of the climate projection, downscaling is employed to refine the granularity of the projection to better represent the climate variability associated with the location of interest of the impact assessment study. In spite of the two distinct objectives, regression is a popular approach employed, to generate bias corrected projections as well as downscaled climate projections. 2.1.1 Downscaling in the Climate Science Domain Downscaling techniques are used to relate the coarse-scale GCM outputs to the local climate variables such as daily precipitation and temperature [116]. There are two common approaches to downscaling climate variables. The first approach is dynamical downscaling, which nests a regional climate model (RCM) into the GCM to represent the atmospheric physics with a higher grid-box resolution within a limited area of interest. The second approach is statistical downscaling, that statistically links coarse-scale weather with relatively finer resolution observed local-scale weather. Wilby and Wigley [117] classified statistical downscaling into regression methods, weather type approaches, and stochastic weather generators. Multiple linear regression (MLR) is probably the most common regression-based statistical downscaling approach whose objective is to minimize the sum square error. Multiple linear regression with randomization (MLRR) is another regression-based approach that adds a randomization term to compensate for the reduced variability in the prediction of the regression function that is fitted to pass through the centroid of the data. Themeβl et al. [108] applies MLRR to bias correct precipitation data. Analog method (AM) and its variants, such as nearest neighbor analog methods (NNAM), are other common downscaling 14 techniques that are based on the intuition of prototyping. The advantage of approaches such as AM is the minimum training time required. The disadvantage is the need for sufficiently large amount of historical data for accurate prediction as well as the limitation of being unable to predict extreme events beyond the magnitude of extreme events available among the historical data sets. Rummukainen [100] differentiated various statistical downscaling approaches based on the nature of the chosen predictors as either being perfect prog(nosis) (PP) or model output statistics (MOS) (Glahn and Lowry [56]). 2.1.2 Bias Correction in the Climate Science Domain Often even the results of downscaling, such as RCM simulation data often needs to be bias corrected to accurately reflect the observed distribution of the respective climate variables before being fed to climate change impact models and/or crop growth and yield models so that biases in the simulated data are not propagated. RCM variables, such as temperature, may require bias correction of the mean and standard deviation of the distribution, while variables such as precipitation may require frequency and intensity of the distribution to be additionally calibrated. The quantile-based bias correction approaches focuses on matching the distribution of the downscaled approaches as closely as possible to that of its observations’ distribution. Quantile mapping, modified quantile mapping (EDCDFm) and transfer functions defined by Piani et al.[96][97] are a few examples. Quantile mapping has been extensively used to downscale climate variable across regions, ranging from the smaller regions, such as Japan (Iizumi et al. [67]) to the larger areas such as the European continent as seen by Piani et al [96]. Samuels et al. [102] and Ines et al. [68] use quantile mapping for downscaling and bias correcting climate variables like precipitation. Ceglar et al. [29] has used it to downscale 15 even variables like solar radiation. Quantile mapping can use both as a means to downscale the climate variables as well as a bias correction technique. When used as separate bias correction technique, QM corrects the errors in the shapes of the two distributions either prior or after downscaling the data, using any of the downscaling approaches. Approaches such as linear regression used by Rivington et al. [98] and copula-based approaches used by Favre et al. [47] are also occasionally used for bias correction. 2.1.3 Regression-Based Approaches for Bias Correction and Downscaling Regression-based approaches are among the most commonly employed approaches for bias correction and downscaling climate projections. In spite of the nuance between the objectives of bias correction and downscaling climate variables for generating projection, there is considerable overlap in terms of the approaches employed by both applications. The various regression-based approaches used for downscaling and bias correcting climate projections can be broadly categorized as either being distribution-driven or accuracy-driven, based on whether their primary objective is minimizing residual errors or ensuring that the cumulative distribution of the projection of the response variable matches, as closely as possible, the distribution of the corresponding observations data at each quantile. Multiple linear regression and QM are respectively, the most popular accuracy-driven and distribution-driven approaches. Among the various accuracy-driven regression-based approaches, change factor (delta method), linear regression (LR), its multiple variable counterpart and multiple linear regression (MLR) are the most popular regression-based approaches used for downscaling and bias 16 correction [108][98]. Multiple linear regression with randomization used by Themel et al. and copula based approaches used by Favre et al. [47] have also occasionally been used for bias correction and downscaling climate variables. Unlike accuracy-driven approaches, distribution-based approaches primarily focuses on ensuring that the cumulative distribution of the projection of the response variable matches as closely as possible, the distribution of the corresponding observations data at each quantile. Quantile mapping (QM), modified quantile mapping (EDCDFm), transfer functions defined by Piani et al. and local intensity scaling [108][61][96][97] are few examples of distributionbased regression approaches commonly used for bias correction and downscaling climate projections. 2.2 Multiple-Objective Prediction Based on the distribution characteristics of the response variable, modeling certain response variables, is more challenging that others. This is often due to the distinct or uncommon distribution characteristics of the response variable. For instance, conventional single-objective regression models that prioritize the conditional mean of the distribution tend to underestimate the number of zero-valued data points while also under-estimating the values of the extreme values of a zero-inflated response variable. Similarly, the single-objective regression approaches such as MLR, that are commonly used when the emphasis is on minimizing SSR, fare poorly in terms of capturing the shape of the distribution and hence is not well suited in preserving the distribution characteristics of the projection. Thus there is the need to have a framework that caters to simultaneously addressing multiple objectives. As mentioned earlier, precipitation which is an important driver in a lot of models such as 17 fresh water modeling [78], is considerably more difficult to model than temperature, mostly due to its high spatial and temporal variability and its nonlinear nature. The motivation behind using a multi-objective approach that combines the use of classification and regression models is to address the challenges of zero-inflated distribution observed in precipitation. Previous studies have shown that additional precautions must be taken to ensure that the excess zeros do not lead to poor fits [11, 13, 42, 20, 114] of the regression models. A typical approach to model a zero-inflated data set is to use a mixture distribution of the form P (y|x) = απ0 (x) + (1 − α)π(x) where π0 and π are functions of the predictor variables x and α is a mixing coefficient that governs the probability an observation is a zero or non-zero value. This approach assumes that the underlying data are generated from known parametric distributions. For example, π may be Poisson or negative binomial distribution (for discrete data) and lognormal or Gamma (for continuous data). Piani et al. [97] proposed a multi-objective transfer functions, specific to response variables having zero-inflated distribution. Similarly, local intensity scaling (LOCI) has been specialized for bias correcting response variables that have a zeroinflated distribution by accounting for the zero-inflated characteristics of precipitation data [108]. There have been extensive studies on the effect of incorporating unlabeled data to supervised classification problems, including those based on generative models[41], transductive SVM [71], co-training [19], self-training [120] and graph-based methods [18][121]. Some studies concluded that significant improvements in classification performance can be achieved when unlabeled examples are used, while others have indicated otherwise [19, 36, 40, 109, 18 118]. Blum and Mitchell [19] and Cozman et al. [36] suggested that unlabeled data can help to reduce variance of the estimator as long as the modeling assumptions match the ground truth data. Otherwise, unlabeled data may either improve or degrade the classification performance, depending on the complexity of the classifier compared to the training set size [40]. Tian et al. [109] showed the ill effects of using different distributions of labeled and unlabeled data on semi-supervised learning. Recently, there have been growing interest on applying semi-supervised learning to regression problems [119][24][39][122]. Some of these approaches are direct extensions of their semi-supervised classification counterparts. Cheng and Tang [33] proposed a semi-supervised learning framework for long-term time series forecasting based on Hidden Markov Model Regression. They also developed a covariance alignment method to deal with the issue of inconsistencies between historical and future data from climate simulation models. None of these semi-supervised learning methods are designed for handling zero-inflated time series data. 2.3 Modeling Extremes Identifying and modeling extreme events in climatology have recently gained a lot of traction [48]. Unfortunately, the common regression techniques mentioned earlier that may be used for downscaling, focus on predicting the conditional mean of the response variable, while extreme values are better identified by conditional quantiles that corresponds to the extreme values. Hence, unlike the common regression techniques mentioned earlier that focus on predicting the conditional mean, the motivation behind the presented model focuses on the conditional quantile, using an approach similar to quantile regression [77]. 19 Variations of quantile regression, such as non-parametric quantile regression and quantile regression forests, have been used to infer the conditional distribution of the response variable, which may be used to build prediction intervals [106, 87]. Also, variants of quantile regression that estimate the median are used due to their robustness to outliers when compared to traditional mean estimate [82]. Friederichs and Hense [51] presented a statistical downscaling approach to estimate censored conditional quantiles of precipitation that uses QR. The conditional probability of the censored variable is estimated using a generalized linear model (GLM) with a logit function to model the nature of the distribution of precipitation and hence cannot be directly applied to model temperature. Mannshardt-Shamseldin et al. [83] demonstrate another approach to downscaling extremes through the development of a family of regression relationships between the 100 year return value (extremes) of climate modeled precipitation (NCEP and CCSM) and station-observed precipitation values. Generalized extreme value theory based approaches have also be applied to model extreme events like hydrologic and water quality extremes, precipitation, etc [111, 16]. The Pareto distribution [43, 66], Gumbel [22, 14] and Weibull [35] are the more common variants of general extreme value distribution used. But these techniques are probability based that emphasize trends pertaining to the distribution of future extreme events and not the deterministic timing of the occurrence of the extreme event. The drawback of building a model that primarily focuses on only a particular section of the conditional distribution of the response variable is the limited amount of available data. Hence, the motivation for incorporating unlabeled data during model building. When it comes to accurately predicting extreme values in the presence of zero-inflated data, studies have shown that additional precautions must be taken to ensure that the excess zeros do not lead to poor fits [11, 13, 42, 20, 114] of the regression models. Generally, simple 20 modeling of zero values may not be sufficient, especially in the case of zero-inflated climate data such as precipitation, where extreme value observations need to be accurately modeled. Due to the significance of extreme values in climatology and the increasing trend in extreme precipitation events over past few decades, a lot of work has be done in analyzing the trends in precipitation, temperature, etc., for regions in the United States and Canada among others [79, 99, 44, 37, 101]. Katz [72] introduces the common approaches used in climate change research, especially with regard to extreme values. The common approaches to modeling extreme events are based on general extreme value theory [53, 89, 50], Pareto distribution [43, 66, 70, 90], generalized linear modeling [35, 34], hierarchical Bayesian approaches [54, 65, 103], etc. Gumbel [22, 14] and Weibull [35] are the more common variants of General extreme value distribution used. There are also Bayesian models such as the model of Cooley et al. [38] that augment the model with spatial information. Watterson et. al. proposed a model that also deals with the skewness of nonzero data/intermittency of precipitation using gamma distribution to interpret changes in precipitation extremes [113]. In contrast, the framework presented in this chapter handles the intermittency of the data by coupling a logistic regression classifier to the quantile regression part of the model. 2.4 Distribution Preserving Modeling Courtesy of projects such as NARCCAP (North American Regional Climate Change Assessment Program), extensive studies have been done to utilize the long-term future climate projections made available [86, 88]. Many of these studies focus on the impact assessment of climate change on domains, ranging from natural ecosystems [93] [49] to those related to 21 human systems [94]. Since many climate change impact assessment studies are interested in long-term climate projections, the accuracy of the distribution of the projection is often critical. As mentioned earlier, efficient utilization of these projections require the projections to be unbiased across all the quantiles of the distribution [31, 4, 116]. As mentioned earlier, bias correction approaches such as QM [108], Equidistant CDF Matching (EDCDFm), Statistical Asynchronous Regression (SAR)[92], transfer functions proposed by Piani et al.(2010b), etc, have been applied, to address these biases in the projection of climate data. However, these approaches are best suited when there is no dayto-day mapping available between the predictor and the response variable, as is the case of downscaling from GCMs or data from RCMs driven by GCMs. QM is very well equipped to generate a projection of the response variable with an unbiased distribution. However, these bias correction approaches under-perform in terms of accuracy of prediction of individual data points. This is because these bias correction approaches do not leverage the original mapping information between the response and predictor variables during training. This drawback is all the more impeding, since data obtained from RCMs driven by reanalysis data have day-to-day mapping and may be used for building a regression model for downscaling and bias correction. Thus, common distribution driven single output regression approaches are best suited when the predictor and output variables are asynchronous and there is less emphasis on low sum squared residual error (SSR). Accuracy-driven regression approaches, such as MLR, Ridge, Lasso and analog methods [108], are commonly used when the emphasis is on minimizing SSR but fare poorly in terms of capturing the shape of the distribution (Figures 1.1). Thus, commonly used regression approaches are not well suited in preserving the distribution characteristics of the projection. Given the drawbacks of regression and quantile22 based approaches, approaches such as Contour Regression (CR) [5] have been proposed that try to simultaneously minimize error and preserve the shape of the forecast distribution. CR uses a regression function that regularizes the area between the CDF of the target response variable and the regression result. To address the limitation of single output regression (SOR), numerous multiple output regression (MOR) models have been proposed including the commonly used multi-output regression [59] and structured output regression [17]. A number of regression-based multiple output models focus on penalizing of the regression matrix using low rank penalization methods such as reduced rank regression [69]. However, these approaches do not model correlation in output dimensions. Another common approach to multiple output prediction is to penalize input space shared, for co-linearity, such as partial least square regression discriminant analysis (PLSDA) [95]. However these models, too, do not capture the association among various response variables. ”Curds and whey” is an example of regression based approach that models output correlation [25]. However, modeling output correlation assumes the relation among the response variables is linear. Multiple output SVR is another approach that takes advantage of correlation among response variables and extends SVR to multi-output systems by considering Cokriging (a multi-variable version of Kriging) [112]. Cokriging models multiple output variables by computing cross covariances between the different outputs. Group lasso [74], LL-MIMO [21], gaussian process MOR [10] are other examples of MOR. However, none of the above-mentioned approaches preserve the full range of variability of the joint distribution of the response variables. He et al.[61] proposed bivariate quantile mapping to address the limitations of QM in bivariate space and use the intuition proposed by Buja et al. regarding geometric quantiles [26]. However, in spite of the approach faring 23 very well in terms of capturing the requisite marginal distribution characteristics of the two response variables, due to its asynchronous nature, it suffers from poor SSR. In the following chapters, approaches that address the challenges of modeling zero-inflated response variables, prioritizing extremes in a distribution of the response variable, preserving the overall distribution characteristics of the response variable, in both a single output and a multi-output setting are proposed. 24 Chapter 3 Modeling Zero-Inflated Data This chapter presents a multi-objective approach for predicting future values of a time series data that are inherently zero-inflated. The proposed framework decouples the prediction task into two objectives—a classification step to predict whether the value of the time series is zero and a regression step to estimate the magnitude of the non-zero time series value. 3.1 Introduction Predictive models for time series data are commonly employed in the fields of economics, finance, epidemiology, ecology, and meteorology, among others. The prediction accuracy is subject to the choice of model used, which in turn, may be limited by characteristics of the time series observations. For example, studies have shown that the performance of classical regression models is degraded when applied to data sets with excess zero values [11, 13, 42, 20, 114]. Such data are typically encountered in applications such as climate and ecological modeling, disease monitoring, manufacturing defect detection, and traffic monitoring. Figure 3.1 shows the histogram of daily precipitation (in log scale) at a weather station in Canada for the period between January 1, 1961 and December 31, 2000. Nearly half of the observations have precipitation values equal to zero. Such zero-inflated data, as they are commonly known, often lead to poor model fitting using standard regression methods as 25 10 9 Log (Frequency) 8 7 6 5 4 3 2 1 0 5 10 15 20 25 30 35 40 45 50 Amount of precipitation per day Figure 3.1: A zero-inflated frequency distribution of daily precipitation at a weather station in Canada they tend to underestimate the frequency of zeros and the magnitude of non-zero values of the data. A typical strategy for handling such type of data is to first invoke a classification model to predict whether the output value is zero. A regression model, which has been trained on the non-zero data points, is then applied to estimate its magnitude only if the classifier predicts a non-zero output. Such an approach is commonly used for statistical downscaling of precipitation [115], in which the occurrence of rain or wet days is initially predicted prior to applying a regression model to estimate the amount of rainfall for the predicted wet days. The limitation of this approach is that the classification and regressions models are often built independent of each other. As a result, neither models can glean 26 Input Output > 0 Classifier Regression Output Output = 0 (a) Two-step independent prediction approach Input Output Regression Classifier Output = 0? Output (b) Joint regression and classification framework Figure 3.2: Comparison between independent modeling approach and proposed framework for predicting zero-inflated data information from the other to potentially improve their prediction accuracy. The objective of this chapter is to develop an integrated framework that accurately estimates the future values of a zero-inflated time series by simultaneous training the classification and regression models. Specifically, the models are trained to optimize a joint objective function that penalizes errors in classifying a data point and errors in predicting the magnitude of non-zero data points. Given a test point, the regression model is applied to estimate the magnitude of the predicted value. The output from the regression model along with the values of other predictor variables of the test point are then fed into a classification model to determine whether the predicted value should be adjusted to zero. The distinction between the traditional two-step independent modeling approach and the proposed framework is illustrated in Figure 3.2. The effectiveness of the learning framework is demostrated in the context of precipitation prediction, using climate data from the Canadian Climate Change Scenarios Network Web site [1]. Specifically, the performance of the integrated framework was compared against 27 two baseline methods. The first baseline corresponds to applying standard multiple linear regression (MLR) method on the entire training data, which includes both dry and rain days. The second baseline method (SVM-MLR) uses a combination of support vector machine classifier to predict dry/wet days and multiple linear regression to predict rainfall amount on wet days. Both the models are trained independently. Empirical results showed that the proposed framework outperforms both MLR and SVM-MLR on the majority of the weather stations investigated in this study. In summary, the main contributions of this chapter are as follows: • An integrated framework for simultaneously learning classification and regression models. • The proposed framework was found to be more effective at predicting zero-inflated time series than building a single regression model or building independent classification and regression models to fit the time series data. • The framework was successfully applied to the real-world problem of downscaling precipitation time series for climate impact assessment studies. 3.2 Preliminaries Consider a multivariate time series L = (xt , ct ), where t ∈ {1, 2, · · · , n} denote the elapsed time, xt is a d-dimensional vector of predictor variables at time t, and ct is the corresponding value for the response (target) variable. Given an unlabeled sequence of multivariate observations xτ , where τ ∈ {n + 1, · · · , n + m}, the goal was to learn a target function f (xτ , w) that best estimates the future values of the response variable at each time τ . The 28 set of weights w = [w1 , w2 , ..., wd ]T are the regression coefficients to be estimated from the training data L. For applications such as statistical downscaling, the predictor variables xτ correspond to climate variables at large spatial scales generated from computer-driven general circulation models (GCMs). For zero-inflated data, the frequency of zero values in the time series is relatively larger than the frequency of each non-zero values, as shown in Figure 3.1. The response variable ct can be mapped into a binary class ct , where    1, if c > 0; t ct =   0, otherwise. (3.1) For brevity, the notation y ≡ f (x, w) was used as the predicted value of the response variable and y as its corresponding predicted class. 3.3 Framework for Simultaneous Classification and Regression In this chapter, a framework is presented for predicting future values of a time series with the following unique characteristics: 1. The framework simultaneously performs classification and regression to improve the accuracy of predicting the magnitude of non-zero values in a zero-inflated time series. 2. The framework can be easily extended to a semi-supervised learning setting via graph regularization. 29 This chapter considers a framework for modeling zero-inflated variables using a combination of classification and regression models. The models in the framework are trained to optimize a joint objective function that considers both the classification errors on the time series and regression errors for the non-zero values. The framework presented in this chapter trains an SVM classifier only once after the parameters of the regression model have been determined. Proofs of convergence of our algorithm are also presented in this section. Multiple linear regression (MLR) was considered as the underlying regression model in this study, in which f (x, w) = wT x. Extending the approach to nonlinear models will be a subject for future research. 3.3.1 Objective Function The classification and regression models developed in this study are designed to minimize the following objective function: n ci (ci − yi yi )2 + T1 arg min L(w, y) = w,y i=1 n (yi − ci )2 i=1 n si,j [ci yi − cj yj ]2 + T3 ||w||2 + T2 i,j=1 where, yi = wd xi,d , yi ∈ {0, 1} d and sij is the similarity between the values of the predictor variables at ti and tj The rationale for the design of our objective function is as follows. The first term is somewhat similar to the standard least-square formulation of multiple linear regression, except the estimation of w is based on the non-zero values in the time series. The regression 30 model is therefore biased towards estimating the non-zero values more accurately instead of being influenced by the over-abundance of zeros in the time series. The product yi yi in the first term corresponds to the predicted output of our joint classification and regression models. The second term in the objective function is equivalent to misclassification error in training data. The third term corresponds to a graph regularization constraint to ensure smoothness and consistency in the model predictions. Specifically, for two highly similar data points xp and xq , i.e., spq is large, the model is penalized if the predicted values of the response variables are inconsistent. Finally, the last term in the objective function is equivalent to the L2 norm used in ridge regression models to shrink the coefficients in w. Each data point was considered to be a given elapsed time t ∈ {1, 2, · · · , n} in the time series. An n × n similarity matrix S = [sij ] is computed between every pair of data points based on the similarities of their predictor variables. Prior to computing the similarity matrix, each variable is standardized by subtracting its mean value and then dividing by its corresponding standard deviation. The standardization of the variables is needed to account for their varying scales. Pearson correlation coefficient was used to compute the similarity between each pair of data points and then transform the value to a range between 0 and 1 to ensure ensure all the terms in the objective function are non-negative. The choice of Pearson correlation as the similarity measure is due to the popularity of the measure in the Earth science domain. 31 3.3.2 Parameter Estimation The objective function can be further expanded as follows: n L(w, y) = wd xi,d )2 + T1 ci (ci − yi i=1 n (yi − ci )2 i=1 d n + T2 si,j ci wd xi,d − i,j=1 + T3 ||w||2 d cj wd xj,d 2 d or equivalently, n L(w, y) = wd xi,d )2 ci (ci − yi i=1 d n (yi − ci )2 + T3 ||w||2 + T1 i=1 n + T2 ci wd xi,d si,j i,j=1 d 2 + cj wd xj,d 2 d ci cj wd wd xi,d xj,d − 2 d,d To estimate the regression parameter w and class labels y, the following iterative procedure was employed. First, the partial derivative of L(w, y) is computed with respect to 32 each of the w’s and set it to zero (assuming y is fixed): ∂L = ∂wk n −2 ci ci − yi wd xi,d i=1 yi xi,k d n + 2T2 si,j ci wd xi,d cj wd xj,d i,j=1 n + 2T2 cj xj,k d si,j i,j=1 n − 2T2 ci xi,k d si,j i,j=1 ci cj wd (xi,d xj,k + xi,k xj,d ) d + 2T3 wk = 0 This reduces to a system of linear equations of the form Aw = b where n bk = ci yi ci xi,k i=1 and A is a square matrix of dimension d × d whose non-diagonal elements is given by, n Ak,l = 2T2 si,j ci xi,l xi,k i,j=1 n − 2T2 si,j ci cj xi,l xj,k i,j=1 n + ci yi xi,l xi,k i=1 33 and diagonal elements n si,j ci x2 i,k Ak,k = 2T2 i,j=1 n − 2T2 si,j ci cj xi,k xj,k i,j=1 n + ci yi x2 + T3 i,k i=1 To estimate y, the following part of the objective function that depends on y is minimized: n Lc (y) = ci (ci − yi yi )2 + T1 i=1 n (yi − ci )2 i=1 subject to the constraint yi ∈ {0, 1}. It is straightforward to show that Lc is minimized according to the following rule:    yi = 1, if ci = 1 and (ci − yi )2 > ci2 + T1 ;   0, otherwise. (3.2) The predicted class labels y are then used to re-estimate the regression coefficients w. This procedure is repeated until the regression coefficients and class labels converge. 3.3.3 Proof of Convergence This section presents the proof of convergence of our iterative update algorithm. Let (wt , yt ) be the regression coefficients and class labels estimated after the t-th iteration and (wt+1 , yt+1 ) be the regression coefficients and class labels estimated after the (t + 1)-th 34 iteration. Proposition 3.3.1. Assuming that the class labels yt are fixed, L(wt+1 , yt ) ≤ L(wt , yt ). Proof. For a fixed yt , let Lr (w) be part of the objective function that depends on the regression coefficients w: n Lr (w) = wd xi,d )2 + T3 ||w||2 ci (ci − yi i=1 d n + T2 si,j i,j=1 ci wd xi,d − d cj wd xj,d 2 d The Hessian matrix H of Lr (w) is given by: ∂ 2 Lr = 2 ∂wk ∂wl n ci yi 2 xi,k xi,l + 2T3 δkl i=1 n + 2T2 si,j (ci xi,k − cj xj,k )(ci xi,l − cj xj,l ) i,j=1 where δkl = 1 if k = l and zero otherwise. Since the parameters T2 and T3 are non-negative, it can be shown that, for any non-zero vector z with real values, zT Hz ≥ 0, i.e., the Hessian matrix is positive semi-definite. Thus, the stationary point wt+1 minimizes L(wt+1 ) and Lr (wt+1 ) ≤ Lr (wt ). Proposition 3.3.2. Assuming that the regression coefficients are fixed, L(wt+1 , yt+1 ) ≤ L(wt+1 , yt ). Proof. For a fixed wt+1 , let L(wt+1 , yt ) = Lc (yt ) + T2 n 2 2 i,j=1 si,j [ci yi − cj yj ] + T3 ||w|| . Note that last two terms are independent of yt . Since the update formula for yt minimizes 35 Lc (y), it follows that L(wt+1 , yt+1 ) ≤ L(wt+1 , yt ). Theorem 3.3.1. The objective function L(w) is monotonically non-increasing given the update formula for w and y. Proof. The update formula iteratively modifies the objective function as follows: L(wt , yt ) ⇒ L(wt+1 , yt ) ⇒ L(wt+1 , yt+1 ). Using the above propositions we have L(wt+1 , yt ) ≤ L(wt , yt ) and L(wt+1 , yt+1 ) ≤ L(wt+1 , yt ). Therefore, L(wt+1 , yt+1 ) ≤ L(wt , yt ) Lemma 3.3.1. The objective function will eventually converge, as the value of the loss function is always non-negative and since we know L(w) is monotonically decreasing. 3.3.4 Classification of Test Data The update formula presented in the previous subsections compute the regression coefficients w and class labels y of the training examples in such a way that minimizes the objective function. For a given test example xτ , where τ ∈ {n + 1, · · · , n + m}, the predicted value of the regression model can be computed as follows: yτ = wT xτ . However, the classification output cannot be determined since the update formula for y depends on the true class labels c, as shown in Equation (3.2). Therefore, to predict the class label y, an SVM classifier on (xt , yt ) as the d + 1-dimensional feature vector and the estimated (yt ) as the class labels using only examples from training data. Once the classifier has been constructed, it can be applied to predict the class label of a test example. The final output of the joint classification and regression model is the product yτ yτ (see Figure 3.2). 36 Empirically, it was found that SVM may be used as an alternate classifier to predict y at each iteration, instead of the update formula described above. But since the objective function of the generic classifier does not necessarily minimize both the first and second term of Lc (y) simultaneously, convergence cannot be guaranteed. 3.4 Integrated Classification and Regression Algorithm The Integrated Classification and Regression (ICR) framework takes as input (xt , ct ) (a multivariate time series with d-dimensional predictor variables xt and response variable ct ) and a sequence of unlabeled observations (xτ ). The output returned by the framework are the regression coefficients (w) and the predicted values of the unlabeled sequence (zτ ). For the training phase set c = (c > 0) and initialize y = c. Then until convergence update w by solving Aw = b followed by updating y using Equation (3.2). Then train an SVM classifier g : (xt , yt ) → yt . During the testing phase, set ∀τ : yτ = wT xτ , ∀τ : yτ = g(xτ , yτ ) and ∀τ : zτ = yτ yτ . It is assumed that the time series data has been partitioned into a training set, a validation set (for model selection), and a test set. Model selection is needed to estimate the parameters T1 ,T2 ,T3 of our objective function L(w, y). The class labels c of the training examples are obtained based on the response variable c . The training phase of the algorithm starts by setting y = c for all the n-training examples. It then iteratively updates the regression coefficients w and class labels y according to the methodology presented in the previous section. At this stage, the value of the objective function is computed and saved for testing convergence of the objective function. Upon convergence, an SVM classifier g is constructed to learn the mapping between the input 37 features x, y and output class y. Once the training phase is completed, the Testing phase begins. Testing is performed by first applying the multiple linear regression model to the predictor variables xτ . This is followed by invoking the SVM classifier to predict the class label yτ for the m test examples. The classifier takes xτ and yτ as input and returns class labels yτ . Finally, the prediction output is obtained by setting zτ = yτ yτ . The time complexity of the training phase of the algorithm is O(k(n2 d + d3 )), where n is the number of training examples, d the number of predictor variables and k is the maximum number of iterations required for convergence. The computational complexity of the training phase is composed of two major parts: the first that requires computing the similarity matrix and the second that requires iteratively solving w and y. The time needed to compute the similarity matrix is (O(n2 d)). The time complexity of each iteration refers to the time needed to compute w (O(n2 d2 + d3 )) plus time needed to compute y (O(n)). Hence, for maximum iterations set to k, the time complexity for the training phase is O(k(n2 d + d3 )), where d n. 3.5 Experimental Evaluation This section presents the experimental results to demonstrate the effectiveness of the proposed framework. 3.5.1 Experimental Setup The ICR algorithm was run on climate data obtained for 37 weather stations in Canada, from the Canadian Climate Change Scenarios Network Web site [1]. The response variable to be 38 regressed corresponds to daily precipitation values measured at each weather station. The predictor variables correspond to 26 coarse-scale climate variables derived from the NCEP Reanalysis data set, which include measurements of airflow strenght, sea-level pressure, wind direction, vorticity, and humidity, as shown in Table 7.1. The data span a 40-year period, 1961 to 2001. The time series was truncated for each weather station to exclude days for which the precipitation values are missing. Table 3.1: List of predictor variables for precipitation prediction. Predictor Variables Mean sea level pressure Surface airflow strength Surface vorticity Surface divergence 500 hPa airflow strength 500 hPa zonal velocity 500 hPa meridional velocity 500 hPa vorticity 500 hPa geopotential height 500 hPa wind direction 500 hPa divergence Relative humidity at 500 hPa Near surface relative humidity Surface zonal velocity Surface meridional velocity Surface wind direction Mean temp at 2m 850 hPa airflow strength 850 hPa zonal velocity 850 hPa meridional velocity 850 hPa vorticity 850 hPa geopotential height 850 hPa wind direction 850 hPa divergence Relative humidity at 850 hPa Surface specific humidity A comparison of the performance of the algorithm(ICR) was made against the multiple linear regression (MLR) model and an approach that combined SVM and MLR (SVM-MLR). MLR uses the least square criterion to estimate the weight vector w of the model. In SVMMLR, SVM was used to learn a classifier model to differentiate between Rain and NoRain days, and MLR was learnt on rain days only. Finally, for the given test set MLR is applied only to those days classified as a Rain day. As far as choice of SVM is concerned, during the evaluation phase a choice of the kernel (Linear or RBF) and its respective parameter is made. The choice of the SVM kernel for ICR was limited to a linear kernel. Future experiments will include a wider selection during the evaluation phase. 39 The following criteria was used to evaluate the performance of the models: • Root Mean Square Error (RMSE), which measures the difference between the actual and predicted values of the response variable, i.e.: RMSE = n (c −y )2 1 i i . n • Accuracy, which measures the number of Rain and NoRain days predicted correctly by the model. • F-measure, which is the harmonic mean between recall and precision values for rain days. 3.5.2 Experimental Results The purpose of the experiment is to demonstrate the following: 1. Limitations of classical regression models in terms of handling zero-inflated time series data. 2. Performance comparison between classical regression models and the proposed framework. 3.5.2.1 Effect of Zero-Inflated Time Series Data The objective of this experiment is to demonstrate the effect of increasing number of zeros in a time series on the performance of a regression model. Specifically, given the precipitation time series of a randomly selected weather station, each day was classified as NoRain or Rain, depending on the amount of precipitation it receives is equal to or greater than zero. Several training sets of different sizes and varying percentage of NoRain and Rain days by randomly 40 Size of the Training set (Yr) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 .2 .4 .6 .8 1.0 1.2 1.4 1.6 0 (Non−Rain days/ Rain days) in TrainingSet Figure 3.3: Effect of increasing the number of NoRain days on performance of regression model (best viewed in color). sampling the original time series, were created. A disjoint test set of size ten years, is used for all the experiments in this subsection. The performance of two multiple linear regression (MLR) models was evaluated: (1) MLR1 , which is trained on both Rain and NoRain days and (2) MLR2 , which is trained on Rain days only. Figure 3.3 compares the RMSE values of both models for Rain days in the test set. The horizontal axis corresponds to the ratio of NoRain to Rain days in the training set. The larger the ratio, the more inflated the number of zeros in the training data. The vertical axis corresponds to the training set size, where each unit on the scale represents a period of three months. The value of each cell indicates the performance improvement when using MLR2 to predict the Rain days: %Improvement = RMSE(MLR1 ) − RMSE(MLR2 ) RMSE(MLR1 ) 41 (3.3) 1 Observation MLR 0.8 F(x) 0.6 0.4 0.2 0 −10 0 10 20 30 40 50 Precipitation (mm) Figure 3.4: The cumulative distribution function of multiple linear regression output for a zero-inflated precipitation response variable. Since the % Improvement is greater than or equal to zero, this indicates that MLR2 consistently outperforms MLR1 in terms of predicting future Rain days irrespective of the training set size. The amount of improvement becomes even more pronounced when the percentage of NoRain days in the training data increases. A similar improvement pattern is observed for all the weather stations investigated in this study, as shown in Figure 3.5. In contrast, MLR1 , which is trained on both Rain and NoRain days, has a lower RMSE compared to MLR2 when applied to all the days in the test set, as shown in Figure 3.6. This is because MLR2 tends to overestimate the amount of precipitation for the NoRain days. In summary, the experiment given in this section clearly justifies the rationale for applying a combination of classification and regression models to better estimate the precipitation amount of Rain days. 42 MLR Trained on Rain days only MLR Trained on All days. RMSE of Precipitation caculated only on Rain days 12 10 8 6 4 2 0 5 10 15 20 25 30 35 Station Figure 3.5: Comparison of RMSE values (tested on Rain days only) for MLR models trained on all days compared with models trained only on Rain days. 43 10 MLR Trained on Rain days only MLR Trained on All days. 9 RMSE of Precipitation 8 7 6 5 4 3 2 1 0 3 6 9 12 15 18 21 24 27 30 33 35 Stations Figure 3.6: Comparison of RMSE values (tested on All days) for MLR models trained on all days compared with models trained only on Rain days. 44 3.5.2.2 Impact of Coupling the Classifier and Regression Model Creation The objective of this experiment is to demonstrate the advantage of building a classifier and regression model in conjunction with each other, as against building them independent of the other for zero-inflated time-series data. Specifically, empirical results demonstrating improvement in the classification accuracy, F-measure of classification as well as RMSE of the predictors are provided. The performance of two multiple linear regression models was evaluated and compared. In the first model, MLR is trained on all days and a quadratic discriminant analysis (QDA) trained on ground truth response variable. In the second model, again MLR is trained on all days but the QDA trained on the predicted response values y = wT x. The results of the experiment show that the model trained on the predicted response values outperformed the model trained on ground truth response variable for all 37 stations, when it came to RMSE, Classification Accuracy and F-Measure. In particular, the average improvements were 13.4% and 19.3% when it came to RMSE and classification accuracy. In summary, these empirical results provide motivation to try and integrate the classifier and regression models to take into consideration the accuracy of the other’s prediction for each individual data point. 3.5.2.3 Performance Comparison This section compares the RMSE, accuracy, and F-measure values of the predicted response variable (Precipitation) for our proposed supervised (ICR) framework against that of multiple linear regression (MLR), SVM-MLR (A model that combines MLR and SVM) and classification and regression tree (CART). All the experiments were performed using a training size (n) of 3 years starting from the first observation in the time series. The test set 45 size (m) was also fixed at 1 year. After calculating the RMSE on the test set, the training set was shifted by 3 years, such that it now occupied the data set used for testing in the previous iteration. The experiment is repeated 5 times for each station. The RMSE values reported in this section is the mean value of all 5 iterations. The same approach is used to compute the RMSE values for Rain days, accuracy (for all days), F-measure for Rain days only and F-measure for NoRain days only. The results for 37 weather stations when ICR is compared with both MLR and SVM-MLR, is presented. Classification accuracy, and F-measures related to classification accuracy of MLR is not plotted on account of MLR not having an explicit classifier. CART fared comparatively poorly in terms of residual errors and classification accuracy and F-measure. However, CART fared well in replicating the cumulative distribution function of the response variable as shown later in the experiment section. As shown in Figures 3.7 and 3.8, our supervised model, ICR significantly outperformed the MLR model (trained on all days) and the SVM-MLR model in terms of their RMSE values for predicting both Rain and NoRain days. ICR outperformed MLR in 36 out of 37 stations and outperformed SVM-MLR in 30 out of the 37 stations. In terms of percentage improvement in RMSE for all days, ICR indicated an average 8% improvement over MLR and 5.8% improvement when compared to SVM-MLR. In terms of the RMSE values for Rain days only, as shown in Figures 3.9 and 3.10, ICR consistently outperformed both the MLR and SVM-MLR model with ICR outperforming MLR in 35 stations and ICR outperforming SVM-MLR in 33 stations. When evaluating average RMSE value for Rain days only, ICR had an improvement of 5.3% over MLR and 8.6% over SVM-MLR. 46 8 MLR SVM−MLR ICR 7 RMSE 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Stations Figure 3.7: Comparison of RMSE values (for all days) among MLR, SVM-MLR and ICR. 47 8 MLR SVM−MLR ICR 7 RMSE 6 5 4 3 2 1 0 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 Stations Figure 3.8: Comparison of RMSE values (for all days) among MLR, SVM-MLR and ICR. 48 RMSE (Rain days only) 10 MLR SVM−MLR ICR 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Stations Figure 3.9: Comparison of RMSE values (for Rain days) among MLR, SVM-MLR and ICR. 49 RMSE (Rain days only) 12 MLR SVM−MLR ICR 10 8 6 4 2 0 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 Stations Figure 3.10: Comparison of RMSE values (for Rain days) among MLR, SVM-MLR and ICR. 50 Overall Classification Accuracy 0.9 SVM−MLR ICR 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Stations Figure 3.11: Comparison of classification accuracy (for all days) between SVM-MLR and ICR. 51 Overall Classification Accuracy 0.9 SVM−MLR ICR 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 Stations Figure 3.12: Comparison of classification accuracy (for all days) between SVM-MLR and ICR. 52 MLR does not inherently classify any days as Rain or NoRain. Hence, a comparison between ICR and MLR with regards to classification accuracy and F-measure, is not plotted. As shown in Figures 3.11 and 3.12, ICR outperformed SVM-MLR in 36 of the 37 stations and showed a 9.1% improvement in classification accuracy. At the same time, in terms of F-measure for Rain days, the model outperformed SVM-MLR, as shown in Figures 3.13, 3.14. ICR outperformed SVM-MLR in 35 out of the 37 stations. Although, MLR does not inherently classify any days as Rain or NoRain, a Quadratic Discriminant Analysis(QDA) classifier mentioned earlier was trained on the MLR output. ICR witnessed a 21.2% improvement in overall classification accuracy. F−Measure (Rain days) 1 SVM−MLR ICR 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Stations Figure 3.13: Comparison of F-measure (for Rain days) between SVM-MLR and ICR. With regard to F-measure for NoRain days, ICR outperformed SVM-MLR, in 36 stations. 53 F−Measure (Rain days) 1 SVM−MLR ICR 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 Stations Figure 3.14: Comparison of F-Measure (for Rain days) between SVM-MLR and ICR. 54 As shown in Figures 3.15,3.16 that shows the comparison of F-Measure for NoRain days between SVM-MLR and ICR, ICR outperformed SVM-MLR in all but one station and witnessed an 8.1% improvement in F-measure results. F−Measure (NoRain days) 1 SVM−MLR ICR 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Stations Figure 3.15: Comparison of F-Measure (for NoRain days) between SVM-MLR and ICR. As shown in the following figure, ICR as well as SVM-MLR was able to capture the frequency of zero-valued data points in the distribution as well as improve the shape of the cumulative distribution function when compared to that of multiple linear regression. CART fared the best in terms of replicating the distribution of the zero inflated precipitation observed, especially for the higher quantiles. However, CART fared less favorably in terms of capturing the frequency of zero-valued data points as shown in Figure 3.17. Also, ICR prediction showed a 28.9% and 24.2% improvement over the CART output in terms of 55 F−Measure (NoRain days) 1 SVM−MLR ICR 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 Stations Figure 3.16: Comparison of F-Measure (for NoRain days) between SVM-MLR and ICR. 56 1 0.8 F(x) 0.6 0.4 Observation ICR CART SVM−MLR 0.2 0 0 5 10 15 20 25 30 Precipitation (mm) 35 40 45 50 Figure 3.17: Comparing the cumulative distribution function of the predicted values of precipitation according to the various models. RMSE for all-days and rain-days respectively. Similarly, the classification accuracy of ICR was 12.9% better than CART. The F-measure of ICR for rain days and non-rain days 5.3% and 25.2% better. 3.6 Conclusions This chapter presents a novel approach for predicting future values of a time series data that are inherently zero-inflated. The proposed framework decouples the prediction task into two steps—a classification step to predict whether the value of the time series is zero and a regression step to estimate the magnitude of the non-zero time series value. The effectiveness of the model was demonstrated on climate data to predict the amount of precipitation at a given station. The framework presented in this chapter assumes a linear relationship between the pre- 57 dictor and response variables. The framework can also be extended to a semi-supervised learning setting as shown in the Chapter 3. 58 Chapter 4 Semi-Supervised Modeling of Zero-Inflated Data This chapter demonstrates a semi-supervised extension of the ICR framework, presented in Chapter 3 . The purpose of the framework is to utilize both labeled and unlabeled data to accurately estimate the future values of a zero-inflated variable, by simultaneously performing classification and regression. The regression and classification models are simultaneously learned by optimizing a unified objective function that includes a graph regularization term to ensure smoothness of their target functions and consistency between the labeled and unlabeled examples. The effectiveness of the semi-supervised learning framework is also demonstrated in the context of precipitation prediction using climate data obtained from the Canadian Climate Change Scenarios Network website [1]. The proposed framework significantly outperforms regression models trained on both zero and non-zero parts of the time series for the majority of the weather stations investigated in this study. 4.1 Preliminaries Let L = (Xl , cl ) be a multivariate time series of length l, where the predictor variables Xl = [xl1 , xl2 , ..., xln ]T is a d-dimensional sequence of values and cl = cl1 , cl2 , ..., cln 59 T is the corresponding ground truth values for the response variable. The objective of time series prediction is to learn a target function f (x, w) that best estimates the future values of the response variable, cu = cu1 , cu2 , ..., cum T , given the historical data L and the unlabeled data, Xu = [xu1 , xu2 , ..., xum ]T , where w = [w1 , w2 , ..., wd ]T is the set of weights associated with the target function. Xu may be obtained, for example, using computer-driven simulation models. In the semi-supervised framework proposed in this study, let n represent the number of labeled training points and m the number of unlabeled training points. In the supervised framework proposed, m represents the number of unlabeled testing points. In this study, the relative frequency of zero values in cl and cu is assumed to be larger than the frequency of non-zero values. Furthermore, the response variable c can be mapped into a binary class c, where c = 1 if c > 0, and c = 0 otherwise. For brevity, the notation y ≡ f (x, w) is used as the predicted value of the response variable and y as the predicted class. Let, yu = yl1 , yl2 , ..., ylm T and yu = [yl1 , yl2 , ..., ylm ]T . In the semi-supervised framework proposed, let y be a vector of length n + m whose first ˜ n elements are initialized with the vector cl and whose remaining m elements are initialized with the vector yu . Hence, in the supervised framework proposed, as there are no unlabeled training points, y is a vector of length n and is initialized with the vector cl . ˜ 4.2 Semi-Supervised Framework for Simultaneous Classification and Regression Unlike ICR, the semi-supervised extension modifies the graph regularization term to be summed over n + m data points, where the m refers to the unlabeled data points. In the remainder of this chapter, the supervised and semi-supervised versions of the algorithm 60 are denoted as ZICR-S and ZICR-SS, respectively (where ZICR stands for Zero-Inflated Classification-Regression method). For brevity, only linear regression models were considered, where f (x, w) = wT x. Extending the approach to nonlinear models will be a subject for future research. The goal was to simultaneously estimate the values of the weight parameters w and the class labels y to minimize the following objective function: n ci (ci − yi yi )2 + T1 arg min f (w) = w,y i=1 n (yi − ci )2 i=1 n n+m + T2 si,j [ci yi − yj yj ]2 + T3 ||w||2 ˜ i=1 j=1 where, xi,d wd = yi . d Intuitively the first term of the objective function is equivalent to the least square formulation of multiple linear regression, except the estimation of w is performed based on the rain days only. The second term of the objective function measures the classification accuracy on the training data. The third term in the objective function computes the sum of squared difference in the predicted response values for every pair of data points, weighted by the similarity value of their predictor variables. This represents a graph regularization constraint to ensure smoothness of the objective function and can be used to extend the framework to a semi-supervised learning setting. Unlike ICR, ZICR the observed class label term in the graph regularizer component of the equation is replaced by its expected value, to that it could be extended to include unlabeled data points. Finally, the last term of the objective function is equivalent to the L2 norm used in ridge regression models to penalize 61 models that have many large non-zero weights. Note that each data point corresponds to a given time period in the time series. The similarity matrix S is computed according to the Pearson correlation coefficient between every pair of data points in X. Prior to computing the similarity matrix, each attribute value of the data set is standardized by subtracting the mean value of the attribute and then dividing by its corresponding standard deviation. The standardization of each column is done to account for differences in the variance of the various attributes in the data set. The Pearson correlation value is then transformed to range between 0 and 1. The choice of Pearson correlation as our similarity measure is due to the popularity of the measure in the Earth science domain. The purpose of the similarity function is to identify how closely related two data points are to one another, and to use this information in creating the regression model which gives more credence to closeness in the predicted amount of precipitation for data points that are similar as against to data points that are dissimilar. As the similarity function has values ranging between 0 to 1, dissimilar data points have limited impact on the error function while similar data points that differ significantly on the amount of predicted precipitation have the largest impact on the error function. The model further emphasizes on using data points that are categorized as rain events by using ‘0’ and ‘1’ as class labels. Such that ‘0’ is assigned to days that are categorized as ‘NoRain’ days and ‘1’ to ’Rain’ days. The supervised version of the framework is obtained by considering only the labeled training examples for the third term in the objective function. An iterative procedure was employed to solve the objective function. First, the partial 62 derivative of f (w) with respect to each of the w’s is computed, and set to zero: ∂f = 2T2 ∂wk n+m si,j i,j=1 n+m + 2T2 yi wd xi,d ˜ d si,j yj wd xj,d ˜ i,j=1 n+m − 2T2 yi xi,k ˜ yj xj,k ˜ d si,j yi yj wd (xi,d xj,k + xi,k xj,d ) ˜˜ i,j=1 d n −2 ci ci − yi wd xi,d i=1 xi,k d + 2T3 wk = 0 This reduces to a system of linear equations of the form Ax = b where x = [w1 w2 ....wd ]T and n bk = ci ci xi,k i=1 A is a square matrix of dimension d × d where the non-diagonal elements, n+m Ak,l = 2T2 n+m si,j yi xi,l xi,k − 2T2 ˜ i,j=1 i,j=1 n + si,j yi yj xi,l xj,k ˜˜ ci yi xi,l xi,k i=1 63 and the diagonal elements n+m Ak,k = 2T2 si,j yi x2 − 2T2 ˜ i,k i,j=1 n + n+m si,j yi yj xi,k xj,k ˜˜ i,j=1 ci yi x2 + T3 i,k i=1 Next quadratic discriminant analysis (QDA) was applied on the predicted response values y = wT x to estimate the class labels of the unlabeled data points. The updated class labels y are then used to re-estimate the regression weights w. This procedure is repeated until convergence. A summary of the framework is presented in Algorithm 1. In the remainder of this paper, the supervised and semi-supervised versions of our algorithm are denoted as ZICR-S and ZICR-SS, respectively (where ZICR stands for Zero-Inflated ClassificationRegression method). 4.3 Experimental Evaluation This section presents the experimental results to demonstrate the effectiveness of our proposed framework. 4.3.1 Experimental Setup The set up of the experiments discussed in this chapter is similar to the experiment setup described in Chapter 3. The performance of the algorithm was compared against the multiple linear regression (MLR) model. MLR uses the least square criterion to estimate the weight 64 Algorithm 1 Concurrent Semi-supervised Regression and classification using simultaneous equations iteratively. Input: X (An (n + m) × d matrix of NCEP weather data) c (A n-dimension vector of class labels (1-Rain/0-NoRain)) c (A n-dimension vector of precipitation values for each day.) Output: w (A d-dimensional vector of weights) y (A (n + m)-dimensional vector containing class labels.) y (A (n + m)-dimensional vector containing regressional values of amount of precipitation for each day) Method: Partition data 3 ways (training, evaluation and test) 1) Perform MLR on the training set (Size-n) to get w. 2) Use the w on the testing set (Size-m) to get yi . 3) Calculate the objective function error using the present w and save the value 4) Quadratic Discriminant Analysis (QDA) is performed on yi to get yi 5) In the semi-supervised approach (ZICR-SS) initialize y to c for the first n datapoints ˜ and initialize the remaining m points of y with y from step-3. ˜ In the supervised approach (ZICR-SS), y is initialized to c only. ˜ 6) Solve w, using the d equations got after differentiating the objective function f (w) 7) After having solved w, solve for y using the linear equation y = xw 8) Apply QDA to find class labels for the training data points y. 10) Calculate the objective function error using the present w 11) For a fixed number of iterations (e.g., 10) or based on the convergence of the objective function, repeat steps 4 to 10 12) Evaluate the model by testing the RMSE error on the test data set. 65 vector w of the model. The following criteria was used to evaluate the performance of the models: • Root Mean square error (RMSE), which measures the difference between the actual and predicted values of the response variable, i.e.: RMSE = n (c −y )2 1 i i . n • Accuracy, which measures the number of Rain and NoRain days predicted correctly by the model. • F-measure, which is the harmonic mean between recall and precision values for rain days. 4.3.2 Experimental Results The purpose of the experiment was to demonstrate the following: 1. Limitations of classical regression models in terms of handling zero-inflated time series data. 2. Rationale of incorporating unlabeled data for precipitation prediction. 3. Performance comparison between classical regression models and our proposed framework. 4.3.2.1 Rationale for Incorporating Unlabeled Data The objective of this section is to demonstrate the utility of incorporating unlabeled data for semi-supervised learning in precipitation prediction. Previous studies have shown that unlabeled data are helpful as long as their distribution is similar to those in the labeled training data [19, 36]. For climate data, the study showed that the natural periodic behavior 66 of the predictor variables and the response variable (precipitation) provides an opportunity to leverage the unlabeled data to improve precipitation prediction. Figure 4.1 shows the average similarity values of the predictor variables over time. The horizontal axis corresponds to the width of two time periods in the time series while the vertical axis corresponds to the average similarity for all pairs of time periods with the given width. For example, consider a time series of length 10,000 days. To compute the average similarity of width 3 months, we compare the similarity of the predictor variables on days 1 and 91, days 2 and 92, and so on. We use Pearson correlation as the similarity measure. The plot shows there are clear cycles in the average similarity values demarcated by years, i.e., the predictor variables for a given day is more similar to another observation that is 1yr, 2yr, or 3yrs apart when compared to observations that are 1.5yr, 2.5yr and 3.5yr apart. Figure 4.1 shows that this trend in similarity of observations is valid even for differences as large as 30 years. More subtle trends of cycles of a decade and a half were also observed. One of the encouraging observations is that the similarity of the predictors showed very slow decay with time. This observation encourages the notion that the predictor variables even if separated by large time differences still contain useful information that can be exploited for predicting future precipitation events. One caveat is that though the similarity of predictor variables may not differ much over time, the similarity of the relationship between the predictor and response variables over time tend to decrease at a much faster rate, as shown in Figure 4.3. The product of similarity between predictor variables and similarity between response variables for two time periods of a given width was used to represent the vertical axis. This was the case because, as shown in Figure 4.2, though the similarity of precipitation was periodic in nature, it was fluctuating more rapidly compared to similarity of the predictor variables over time. 67 Similarity of X over time(Quaterly interval ) 0.2 Mean Similarity 0.15 0.1 0.05 0 −0.05 −0.1 −0.15 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96100 108 116 104 112 120 Time(Quaterly) Figure 4.1: Similarity of predictor variables for all pairs of time periods of a given width. Difference in precipitation value 0.65 0.645 0.64 0.635 0.63 0.625 0.62 0.615 0.61 0.605 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96100 108 116 104 112 120 Quater Year Interval Figure 4.2: Similarity of response variable for all pairs of time periods of a given width. 68 1200 1000 Similarity measure 800 600 400 200 0 −200 −400 −600 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96100 108 116 104 112 120 Quaterly Figure 4.3: Similarity of the relation between Predictors and Predicants over time 4.3.2.2 Performance Comparison This section compares the RMSE, accuracy, and F-measure values for our proposed supervised (ZICR-S) and semi-supervised (ZICR-SS) framework against the precipitation prediction results of multiple linear regression (MLR). All the experiments were performed using a training size (n) of 3 years starting from the first observation in the time series. The test set size (m) was also fixed at 3 years. After calculating the RMSE on the test set, the training set was shifted by 3 years, such that it now occupied the data set used for testing in the previous iteration. The experiment is repeated 7 times for each station. The RMSE values reported in this section is the mean value of all 7 iterations. The same approach is used to compute the RMSE values for Rain days, accuracy (for all days), F-measure for Rain days only and F-measure for NoRain days only. Due to space restriction, we show the results for 20 weather stations. As shown in Figure 4.4, both our models, ZICR-S and ZICR-SS, significantly outper- 69 formed the MLR model (trained on all days) in terms of their RMSE values for predicting both Rain and NoRain days. 9 MLR ZICR−S ZICR−SS 8 Mean RMSE 7 6 5 4 3 2 1 0 0 2 4 6 8 10 12 14 16 18 20 Stations Figure 4.4: Comparison of RMSE values (for all days) among MLR, ZICR-S, and ZICR-SS. The supervised version of the approach outperformed MLR for all 37 stations, while the semi-supervised approach outperformed MLR in 34 out of the 37 stations. In terms of percentage improvement in RMSE, the RMSE for MLR was at an average 8.8% and 8.4% worse than ZICR-S and ZICR-SS respectively. ZICR-S outperformed ZICR-SS in 22 out of the 37 stations. However, in terms of the RMSE values for Rain days only, Figure 4.5 MLR had an average RMSE value for Rain days only that was 4.9% and 5.2% higher than ZICR-S and ZICR-SS respectively. Both ZICR-S and ZICR-SS consistently outperform the MLR model with ZICRS outperforming in 34 and ZICR-SS outperforming in 32 stations. ZICR-S outperformed ZICR-SS in 21 out of the 37 stations. Although MLR does not inherently classify any days as Rain or NoRain, the Quadratic 70 12 MLR ZICR−S ZICR−SS Mean RMSE 10 8 6 4 2 0 0 2 4 6 8 10 12 14 16 18 20 Stations Figure 4.5: Comparison of RMSE values (for Rain days) among MLR, ZICR-S, and ZICR-SS. Overall classification Accuracy 1 MLR ZICR−S ZICR−SS 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0 2 4 6 8 10 12 14 16 18 20 Stations Figure 4.6: Comparison of classification accuracy (for all days) among MLR, ZICR-S, and ZICR-SS. 71 Discriminant Analysis (QDA) classifier used in our framework, was trained on the MLR outputs to compare its classification accuracy and F-Measure against those of ZICR-S and ZICR-SS. As shown in Figure 4.6, all 3 ZICR-S ZICR-SS and MLR were comparable in terms of classification accuracy with ZICR-SS outperforming MLR in approx 60% of the stations. Nevertheless, in terms of F-measure for Rain days, both the models consistently outperformed MLR as shown in Figure 4.7 with ZICR-S outperforming MLR in 32 stations while ZICR-SS outperformed MLR in 33 stations. F−Measure (Rain Days) 1 MLR ZICR−S ZICR−SS 0.9 0.8 0.7 0.6 0.5 0.4 0 2 4 6 8 10 12 14 16 18 20 Stations Figure 4.7: Comparison of F-Measure (for Rain days) among MLR, ZICR-S, and ZICR-SS. With regard to the number of stations that MLR was outperformed in F-measure for Rain days, ZICR-S outperformed MLR in 32 and ZICR-SS in 33 stations. Figure 4.8 shows the comparison of F-measure for NoRain days between MLR, ZICR-S, and ZICR-SS. 72 1 MLR ZICR−S ZICR−SS F−Measure(Non−Rain days) 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0 2 4 6 8 10 12 14 16 18 20 Stations Figure 4.8: Comparison of F-Measure (for NoRain days) among MLR, ZICR-S, and ZICR-SS. 4.4 Conclusions This chapter elaborates on extending the ICR framework detailed in Chapter 3, to a semisupervised learning setting. This chapter compares the performance of the framework when it utilizes both unlabeled data and labeled data instead of using only labeled data during training of the model. 73 Chapter 5 Modeling Conditional Quantiles This chapter as well as the following chapter elaborates the importance of accurately predicting the frequency, timing and magnitude of extreme values in the distribution of the response variable. Specifically, a semi-supervised framework for smoothed quantile regression (LSSQR) is presented that focuses on accurate prediction of extreme values without significantly degrading the sum-of-square residual errors. 5.1 Introduction An integral part of climate modeling is downscaling, which seeks to project future scenarios of the local climate based on the coarse resolution outputs produced by global climate models (GCMs). Two of the more common approaches to downscaling are dynamic downscaling and statistical downscaling. Dynamic downscaling uses a numerical meteorological model to simulate the physical dynamics of the local climate while utilizing the climate projections from GCMs as initial boundary conditions. Though it captures the geographic details of a region unresolved by GCMs, the simulation is computationally demanding while its spatial resolution remains too coarse for many climate impact assessment studies. Statistical downscaling establishes the mathematical relationship between the coarse-scale GCM outputs and the fine-scale local climate variables based on observation data. Unlike dynamic downscaling, it is flexible enough to incorporate any predictor variable and is relatively inexpensive. 74 Most of the statistical downscaling approaches employ regression methods such as multiple linear regression, ridge regression, and neural networks to estimate the conditional mean of the future climate conditions. These methods are ill-suited for predicting extreme values of the climate variables. An alternative approach is to use techniques such as quantile regression, which aims to minimize an asymmetrically weighted sum of absolute errors, to estimate the particular quantile that corresponds to extreme values [77]. Unfortunately, quantile regression tends to overestimate the response variable resulting in a large number of data points being falsely predicted to be extreme. Figure 5.1 represents the histogram of the distribution of observed temperature at a weather station in Canada. The lines represent the distribution of the predicted values for temperature obtained using multiple linear regression (MLR) and quantile regression. An observation is considered an extreme data point if its response variable is in the top 5 percentile of observations. The shape of the tail of the distribution that represents extreme data points (observed and projected) is shown in Figure 5.2. It is clear from the figures that methods such as multiple linear regression (green line) that estimate the conditional mean tend to underestimate the tail of observed probability distribution, while quantile linear regression (red line) overestimates the tail part of the probability distribution. As elaborated in Section 5.4, it was found that for the 37 stations evaluated, at an average, quantile regression predicted a datapoint to be an extreme point more than twice as frequently as the actual frequency of observed extreme data points. To address this overestimation, a method known as smoothed quantile regression (LSQR) is proposed, that reduces the absolute error of extreme data points by introducing a smoothing term that brings the predicted response value of extreme points closer to the value corresponding to the percentile of extreme data points. This smoothing term also provides a 75 .6 MLR Quantile Regression LSSQR Observed temperature Relative frequency .5 .4 .3 .2 .1 0 −3 −2 −1 0 1 Standardized temperature Figure 5.1: Histogram of observed temperature. 76 2 3 .30 MLR Quantile Regression LSSQR Observed temperature Relative frequency .25 .20 .15 .10 .05 2 2.5 3 Standardized temperature Figure 5.2: Tail of the histogram. 77 3.5 means to easily extend the objective function to a semi-supervised learning setting (LSSQR). Semi-supervised learning, in addition to using the training data, can also use the distribution characteristics of the predictor variables of the test set to glean a better estimate of the distribution of data upon which the model will be applied. In summary, the main contributions of this chapter are as follows: • Demonstrating the limitation of MLR, ridge regression and quantile regression in predicting extreme values. • Presenting a smoothed quantile regression framework for extreme values prediction. • Extending the framework to a semi-supervised setting. • Demonstrating the efficacy of our learning framework on climate data (temperature) obtained from the Canadian Climate Change Scenarios Network website [1]. Both the supervised and the semi-supervised proposed frameworks outperformed the baseline methods in 85% of the 37 stations evaluated, in terms of magnitude, frequency and the timing of the extreme events. 5.2 Preliminaries Let Dl = {(xi , yi )}n be a labeled dataset of size n, where each xi ∈ Rd is a vector i=1 of predictor variables and yi ∈ R the corresponding response variable. Similarly, Du = {(xi , yi )}n+m corresponds to the unlabeled dataset. The objective of regression is to learn i=n+1 a target function f (x, β) that best estimates the response variable y. β is the parameter vector of the target function. n represents the number of labeled training points and m represents the number of unlabeled testing points. 78 5.2.1 Multiple Linear Regression (MLR) and Ridge Regression One of most widely used forms of regression is multiple linear regression. It solves a linear model of the form y = xT β + where, ∼ N (0, σ 2 ) is an i.i.d Gaussian error term with variance σ 2 . β ∈ Rd is the parameter vector. MLR minimizes the sum of squared residuals (y − Xβ)T (y − Xβ) which leads to a closed-form expression for the solution ˆ β = (X T X)−1 X T y A variant of MLR, called ridge regression or Tikhonov regularization is often used to mitigate overfitting. Ridge regression also provides a formulation to overcome the hurdle of a singular covariance matrix X T X that MLR might be faced with during optimization. Unlike the loss function of MLR the loss function for ridge regression is (y − Xβ)T (y − Xβ) + λβ T β, and its corresponding closed-form expression for the solution is ˆ β = (X T X + λI)−1 X T y 79 where, the ridge coefficient λ > 0 results in a non-singular matrix X T X + λI always being invertible. The problem with both MLR and ridge regression is that they try to model the conditional mean, which is not best suited for predicting extremes. 5.2.2 Quantile Linear Regression(QR) The τ th quantile of a random variable Y is given by: QY (τ ) = F −1 (τ ) = inf{y : FY (y) ≥ τ } where, FY (y) = P (Y ≤ y) is the distribution function of a real valued random variable Y and τ ∈ [0, 1]. Unlike MLR that estimates the conditional mean, quantile regression estimates the quantile (e.g., median) of Y .To estimate the τ th conditional quantile QY |X (τ ), quantile regression minimizes an asymmetrically weighted sum of absolute errors. To be more specific, the loss function for quantile linear regression is: N ρτ (yi − xT β) i i=1 where,    τ u  ρτ (u) = u>0   (τ − 1)u u ≤ 0  Unlike MLR and ridge regression that have a closed-formed solution, quantile regression is often solved using optimization methods such as linear programming. Linear programming 80 is used to solve the loss function by converting the problem to the following form. min τ 1T u + (1 − τ )1T v n n u,v s.t. y − xT β = u − v where, ui ≥ 0 and vi ≥ 0. But as shown in Figures 5.1 and 5.2, quantile regression often overestimates data points resulting in too many false positive extreme events predicted. 5.3 Framework for Smoothed Quantile Regression Given that the primary objective of the model is to accurately regress extreme valued data points and quantile regression has been shown to perform relatively better that its least square counterparts that tend to underestimate the frequency and magnitude of extreme data points, the proposed objective approach of the proposed frameworks is modeled around linear quantile regression. Section 5.3.1 describes smoothed quantile regression (LSQR) and its objective function. Section 5.3.2 proposes a semi-supervised extension to LSQR which is then followed by mathematical properties of the behavior of the objective function. 5.3.1 Smoothed Quantile Regression (LSQR) A quantile-based linear regression model was proposed, based on the assumption of smoothness, i.e., data points whose predictor variables are similar, should have a similar response. The notion of smoothness as an integral part of the framework, as experiments provided in Section 5.4 demonstrate this characteristic in the dataset used. The smoothness assumption 81 could be described as the constraint n wij (fi − fj )2 < c i,j where wij is a measure of similarity between data point i and j, f the predicted value of the response variable and c is a constant. Also, since the framework doesn’t restrict the training set only to extreme data points, the smoothing component of the objective function tends to implicitly cluster data points resulting in better distinction of the response variables of an extreme valued data point and a non-extreme valued data point. Empirical results comparing supervised quantile regression to the proposed semi-supervised model illustrate this point as shown in Section 5.4. The term wij = exp(− ||xi − xj ||2 ) i, j ∈ [1, 2, . . . , n] σ is equivalent to the radial basis function and is used to capture the similarity between the predictor variables of data point i and data point j. σ is a scale parameter used to control the distance above which two data points are not considered as being highly coupled. Assuming linear regression, f (xi , β) = xi β, the smoothing term can be reformulated as n wij (f (xi , β) − f (xj , β))2 = f T ∆f = β T Σβ i,j where, Σ = X T ∆X ∆=D−W 82 n n j=1 wij and W = {wij }|i,j=1 . and D is a diagonal matrix such that Dii = Coupling smoothing with the objective function of linear qunatile regression, we end up with the following optimization problem. n min β ρτ (yi − xT β) + λβ T Σβ i i=1 As can be clearly observed from the objective functions of LSQR, λ → 0 results in an estimate similar to quantile linear regression while, λ → ∞ results in the estimate of the response variable converging towards the target quantile of data. This is because a large λ would penalize any non-zero difference between fi and fj very harshly thereby minimizing the error by setting fi = α, ∀i ∈ [1, 2, . . . , n], thereby reducing the error from the second component of the equation to 0. This reduces the loss function to the following n f (β) = ρτ (yi − α), β = (α, 0, 0, . . . , 0)T i=1 The formal proof of this is provided in the following theorem. T heorem 1: f (xi , β) → y(nτ ) as λ → ∞, ∀i ∈ [1, 2, . . . , n]. P roof : Let y(i) be the ith smallest element among yk |n and y(i) < αi <= y(i+1) . k=1 When λ → ∞, the loss function can be rewritten in terms of αi as follows i n (1 − τ )(αi − y(k) ) + k=1 n τ (y(k) − αi ) + Wij (αi − αi ) i,j=1 k=i+1 83 which is equivalent to minimizing i n τ y(k) − (nτ − i)αi y(k) − k=1 k=1 or maximizing i y(k) + (nτ − i)αi = li k=1 Therefore, lj − lj−1 = yj − αj−1 + (nτ − j)(αj−1 − αj ) Hence, ∀j : j ≤ nτ , lj − lj−1 >= 0, since (yj − αj−1 ), (nτ − j) and (αj−1 − αj ) are all ≥ 0. Similarly, ∀j : j ≥ nτ , lj − lj+1 = αj+1 − yj+1 + (nτ − j)(αj − αj+1 ) ≥ 0 Hence, if ∃i : i = nτ , then α = y(nτ ) . But if, i < nτ < (i + 1), then α is in the interval [y(i) , y(i+1) ] Figure 5.3 is a plot that tracks the values of β for different λ values. The figure shows that the regression parameter vector β will converge to (α, 0, 0, . . . , 0)T as λ increases. β0 is the regression parameter that corresponds to the column of 1’s in the design matrix. Figures 5.4 and 5.5 plots the influence of λ on the predicted values returned from LSSQR. i.e., as the value of λ increases, LSSQR shrinks the prediction range to the quantile τ . Figure 5.5 is a zoomed-in image, capturing the tail of Figure 5.4. 84 Figure 5.3: Influence of parameter λ on the regression coefficients β in LSQR. 85 Figure 5.4: Influence of λ on the probability distribution of the predicted values obtained from LSSQR. 86 Figure 5.5: Influence of λ on the probability distribution of the predicted extreme values obtained from LSSQR. 87 5.3.2 Linear Semi-Supervised Quantile Regression (LSSQR) The objective function of LSQR can be easily extended to a semi-supervised learning setting since the smoothing factor (the second term in the equation) is independent of y. Therefore, by extending the range of the indices i and j of the smoothing term to span 1 to n + m, the predictor variables of the unlabeled data Xu = [xu1 , ..., xum ]T can be harvested. The objective function of the LSSQR is n arg min β 5.4 ρτ (yi − xT β) + λ i i=1 n+m wij (xT β − xT β)2 i j i,j Experimental Results In this section, the climate dataset that is used for statistical downscaling is described. This is followed by the experimental setup, which address the inherent properties of the dataset, such as its periodic nature. Once the dataset is introduced, we analyze the behavior of baseline models developed using MLR, ridge regression and quantile regression and contrast them with LSQR and LSSQR. The efficacy of the models in accurately measuring the magnitude, the relative frequency and timing of forecasting a data point as an extreme event is measured. 5.4.1 Data All the algorithms were run on climate data obtained at 37 weather stations in Canada, from the Canadian Climate Change Scenarios Network website [1]. The response variable to be regressed (downscaled) corresponds to daily temperature values measured at each weather station. The predictor variables for each of the 37 stations correspond to 26 coarse-scale climate variables derived from the NCEP re-analysis data set, which include measurements 88 of airflow strength, sea-level pressure, wind direction, vorticity, and humidity, as shown in Table 5.1. The predictor variables used for training were obtained from the NCEP re-analysis data set that span a 40-year period (1961 to 2001). The time series was truncated for each weather station to exclude days for which temperature or any of the predictor values are missing. Table 5.1: List of predictor variables for temperature prediction. Predictor Variables 500 hPa airflow strength 500 hPa zonal velocity 500 hPa meridional velocity 500 hPa vorticity 500 hPa geopotential height 500 hPa wind direction 500 hPa divergence Relative humidity at 500 hPa Near surface relative humidity Mean sea level pressure Surface airflow strength Surface vorticity Surface divergence 5.4.2 850 hPa airflow strength 850 hPa zonal velocity 850 hPa meridional velocity 850 hPa vorticity 850 hPa geopotential height 850 hPa wind direction 850 hPa divergence Relative humidity at 850 hPa Surface specific humidity Surface zonal velocity Surface meridional velocity Surface wind direction Mean temp at 2 m Experimental Setup As is well known, temperature, which is the response variable in our experiments, has seasonal cycles. To efficiently capture the various cycles, de-seasonalization is performed prior to running the experiments. As is common practice in the field of climatology, a common approach to de-seasonalization is to split the data into 4 seasons (DJF, MAM, JJA, SON) where ’DJF’ refers to the months of December-January-February in the temperature timeseries. Similarly, ’MAM’ refers to March-April-May, and ’JJA’ refers to June-July-August and ’SON’, September-October-November. In effect, for each station, 4 different models, 89 corresponding to the 4 seasons were built. The training size used spanned 6 years of data and the test size, 12 years. During validation, the parameter λ was selected using the score returned by RMSE for extreme data points. A data point is considered extreme if its response variable is greater than .95 percentile (Threshold-1) of the whole dataset corresponding to the station. QR was implemented using the interior point algorithm as detailed in [76]. Broyden Fletcher Goldfarb Shanno (BFGS) method was used to solve the LSQR and LSSQR optimization problem. 5.4.3 Evaluation Criteria The motivation behind the selection of the evaluation metrics was the intent to evaluate the different algorithms in terms of accuracy of the prediction of extreme values, the timing of the extreme events as well as the frequency with which a data point is predicted to be an extreme data point. The following metrics are used to capture the above evaluation criteria for the various models: • Root Mean Square Error (RMSE), which measures the difference in magnitude between the actual and predicted values of the response variable, i.e.: RMSE = n (y −f )2 i=1 i i . RMSE was computed on those days that were observed to n be extreme data points. • Precision and recall of extreme events are computed to measure the timing accuracy of the prediction. F-measure, which is the harmonic mean between recall and precision values, will be used as a score that summarizes the precision and recall results. F-measure = 2×Recall×P recision Recall+P recision • The frequency of predicting extreme data point for the various methods was measured 90 by computing the ratio of the number of data points that were predicted to be extreme to the number of observed extreme data points. To summarize, RMSE is used for measuring the accuracy of the predicted magnitude of the response variable, whereas F-measure can be thought of as measuring the correctness of the timing of the extreme events. 5.4.4 Baseline We compared the performance of LSQR and LSSQR with baseline models created using multiple linear regression (MLR), ridge regression (Ridge), and quantile regression (QR). All the baselines were run for the same 37 stations and for all the 4 seasons. Also, a comparison of the performance of the proposed supervised framework (LSQR) is made with its semisupervised counterpart (LSSQR), where LSSQR demonstrated an improved performance over LSQR for the 37 stations evaluated upon as shown in Table 5.2. Table 5.2 summarizes the tally of percentage of times LSSQR outperformed LSQR over the 4 seasons for the given 37 stations. As seen in the table, LSSQR showed an improved performance in terms of both RMSE and F-measure. Table 5.2: The relative performance of LSSQR compared with LSQR with regard to the extreme data points. RMSE F-measure Win 68.25% 60.14% Loss 31.75% 37.16% 91 Tie 0% 2.7% 5.4.5 Results As mentioned earlier, experiments were run separately using each of the baseline approaches and LSQR and LSSQR for the 4 seasons (DJF, MAM, JJA, SON) of the year for each of the 37 stations’ data. The results over all the seasons and stations are summarized in Tables 5.3 and 5.4 while the individual results of each season in Figures 5.6 and 5.8. Table 5.3 summarizes the relative performance of LSQR with respect to the baseline methods in terms of RMSE of extreme data points and F-measure of identification of extreme data points. During testing, a data point is considered extreme, if its response variable is greater than .95 percentile (Threshold-1) of the whole dataset corresponding to the station. For the purpose of analysis, results of using the .95 percentile of the response variable in the training set (Threshold-2) to identify extreme data points are also summarized. The fact that the results obtained by using the two different baselines is an indicator that the training data did capture the distribution of the response variable reasonably well. LSQR consistently outperformed the baselines both in terms of RMSE and F-measure. It must also be noted that LSQR did outperform MLR and Ridge in terms of recall of extreme events comprehensively across each of the 37 stations and seasons. Table 5.3: The percentage of stations LSQR outperformed the respective baselines, with regard to the extreme data points. RMSE F-measure Threshold-1 Threshold-2 Threshold-1 Threshold-2 MLR 88.51% 89.19% 59.45% 56.08% Ridge 87.84% 87.84% 60.13% 58.10% QR 80.40% 79.05% 72.97% 79.05% Similarly, Table 5.4 summarizes the relative performance of LSSQR with respect to the baseline methods in terms of RMSE of extreme data points and F-measure of identification of extreme data points. Like LSQR, LSSQR consistently outperformed the baselines both 92 Table 5.4: The percentage of stations LSSQR outperformed the respective baselines, with regard to the extreme data points. RMSE F-measure Threshold-1 Threshold-2 Threshold-1 Threshold-2 MLR 87.16% 87.84% 60.13% 56.75% Ridge 85.14% 86.49% 58.78% 59.45% QR 85.13% 81.76% 75.67% 81.75% in terms of RMSE and F-measure. It must be noted that LSSQR outperform MLR and Ridge in terms of recall of extreme events comprehensively across each of the 37 stations and seasons. Figure 5.6 gives a breakdown of the performance of the LSSQR over each of the 4 seasons of the 37 stations using Threshold-1 for the purpose of marking a data point as extreme. The figure is a bar chart of percentage of stations that LSSQR outperformed MLR, ridge regression and QR in prediction accuracy for only extreme data points in the test set. RMSE was used to compute the accuracy of each model in predicting extreme value data points, at the 37 stations. As seen in the plot, LSSQR outperforms MLR, ridge regression and QR in each of the four seasons across the 37 stations. Figure 5.7 shows a graph that depicts the percentage of stations LSSQR outperformed MLR, ridge regression and QR in terms of identifying extreme data points over 37 stations. Again, LSSQR comprehensively outperforms MLR and ridge regression over all the 37 stations and 4 seasons. But as expected, QR outperforms LSSQR in terms of recall performance for each of the 4 seasons due to the overestimating nature of QR, which consequently resulted in poor precision and which is reflected in its F-measure score. At an average, quantile regression, predicted a datapoint to be an extreme point more than twice as frequently as the actual frequency of observed extreme data points. In fact, QR lost out to LSSQR in 91% of 37 stations across 4 seasons in terms of precision of identifying extreme data points. 93 1 MLR Ridge QR Ratio of Stations LSSQR outperformed baseline 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Season DJF Season MAM Season JJA Season SON Figure 5.6: Ratio of stations LSSQR outperforming baseline in terms of RMSE of extreme data points. 94 Figure 5.7: Ratio of stations LSSQR outperforming baseline in terms of recall of extreme data points. Figure 5.8 shows a graph that depicts the percentage of stations where LSSQR outperformed MLR, ridge regression and QR in prediction accuracy based on F-measure of the identifying extreme data points over 37 stations. Again, LSSQR outperforms MLR, ridge regression and QR for all the 4 seasons. The performance improvement obtained by LSSQR in terms of predicting the extreme values can be easily visualized in Figure 5.9. Figure 5.9 is a plot comparing the predicted response variable of the various methods. The plot is restricted to only extreme data points 95 Figure 5.8: Ratio of stations LSSQR outperformin baseline in terms of F-measure of extreme data points. for a station. As expected, the predicted value of the response variable using multiple linear regression is often underestimating the observed temperature, while quantile regression regularly overestimates the prediction of temperature and LSSQR lies in between MLR and QR and closer to the observed temperature. 96 5.5 Conclusions This chapter presents a semi-supervised framework (LSSQR) for accurately predicting values of extreme data points. The proposed approach was applied to real world climate data spanning 37 stations and was compared against MLR, ridge regression and quantile regression in terms of the effectiveness the model demonstrated in identifying and predicting extreme temperatures for the given stations. The next chapter merges the intuition of the framework presented in this chapter, related to extreme values, with the integrated classification and regression framework presented in Chapter 3. 97 3 Observed Temperature Quantile Regression LSSQR MLR Standardized temperature 2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 5 10 15 20 25 30 35 40 45 50 55 Extreme data points Figure 5.9: Prediction performance of extreme data points using MLR, Ridge, QR, LSSQR. 98 Chapter 6 Modeling Extremes in Zero-Inflated Data This chapter extends the LSQR framework presented in Chapter 5, that emphasizes the accurate predicting of the frequency, timing and magnitude of extreme values in a the distribution of the response variable, to handle zero-inflated response variables such as daily precipitation. 6.1 Introduction The notion behind being able to foretell the occurrence of an extreme event in a time series is very appealing, especially in domains with significant ramifications associated with the occurrence of an extreme events. Predicting pandemics in an epidemiological domain or forecasting natural disasters in a geological and climatic environment are examples of applications that give importance to detection of extreme events. Unfortunately, the accurate prediction of the timing and magnitude of such events is a challenge given their low occurrence rate. More so, the prediction accuracy depends on the regression method used as well as characteristics of the data. On the one hand, standard regression methods such as generalized linear model (GLM) emphasize estimating the conditional expected value, and thus, are 99 not best suited for inferring extremal values. On the other hand, methods such as quantile regression are focused towards estimating the confidence limits of the prediction, and thus, may overestimate the frequency and magnitude of the extreme events. Though methods for inferring extreme value distributions do exist, combining them with other predictor variables for prediction purposes remains a challenging research problem. Standard regression methods typically assume that the data conform to certain parametric distributions (e.g., from an exponential family). Such methods are ineffective if the assumed distribution does not adequately model characteristics of the real data. For example, a common problem encountered especially in modeling climate and ecological data is the excess probability mass at zero. Such zero-inflated data, as they are commonly known, often lead to poor model fitting using standard regression methods as they tend to underestimate the frequency of zeros and the magnitude of extreme values in the data. One way for handling such type of data is to identify and remove the excess zeros and then fit a regression model to the non-zero values. Such an approach, can be used, for example, to predict future values of a precipitation time series [115], in which the occurrence of wet or dry days is initially predicted using a classification model prior to applying the regression model to estimate the amount of rainfall for the predicted wet days. A potential drawback of this approach is that the classification and regressions models are often built independent of each other, preventing the models from gleaning information from each other to potentially improve their predictive accuracy. Furthermore, the regression methods used in modeling the zero-inflated data do not emphasize accurate prediction of extreme values. The chapter presents an integrated framework that simultaneously classifies data points as zero-valued or not, and apply quantile regression to accurately predict extreme values or the tail end of the non-zero values of the distribution by focussing on particular quantiles. 100 We demonstrate the efficiency of the proposed approach on modeling climate data (precipitation) obtained from the Canadian Climate Change Scenarios Network website [1]. The performance of the approach is compared with four baseline methods. The first baseline is the general linear model (GLM) with a Poisson distribution. The second baseline used is the general linear model using an exponential distribution coupled with a binomial distribution classifier (GLM-C). A zero-inflated Poisson was used as the third baseline method (ZIP). The fourth basesline was quantile regression. Empirical results showed that the proposed framework outperforms the baselines for majority of the weather stations investigated in this study. In summary, the main contributions of this chapter are as follows: • Comparison and analysis of the performance of models created using variants of GLM, quantile regression and ZIP approaches to accurately predict values for extreme data points that belong to a zero-inflated distribution. • Presenting an approach optimized for modeling zero-inflated data that outperforms the baseline methods in predicting the value of extreme data points. • Successfully demonstrating the proposed approach to the real-world problem of downscaling precipitation climate data with application to climate impact assessment studies. 6.2 Preliminaries Consider a multivariate time series L = (xt , yt ), where t ∈ {1, 2, · · · , n} is a discrete-valued index for time, xt is a d-dimensional vector of predictor variables at time t, and yt is the 101 corresponding value for the response (target) variable. Given an unlabeled sequence of multivariate observations xτ , where τ ∈ {n + 1, · · · , n + m}, the goal is to learn a target function f (x, β) that best estimates the values of the response variable by minimizing the expected loss Ex,y [L(y, f (x, β))]. The weight vector β denotes the regression coefficients to be estimated from the training data L. Multiple linear regression (MLR) is one of most widely used regression methods due to its simplicity. It assumes f (x, β) = β T x (where x is a (d + 1)-dimensional vector whose first element x0 = 1 and β ∈ d+1 is the weight vector) and the response variable y is related to f (x, β) via the following equation: y = βT x + , ∼ N (0, σ 2 ). As a result, P (y|x) ∼ N (β T x, σ 2 ) and Ey|x [y] = yP (y|x)dy = β T x. Since the predicted value of the response variable for a test data point xτ is β T xτ , this implies that the predictions made by MLR focus primarily on the average value of y given xτ . This explains the limitation of MLR in terms of inferring extreme values in a given time series. The parameter vector β in MLR can be estimated using the maximum likelihood (ML) approach to obtain ˆ β = (XT X)−1 XT y, where X is the n × (d + 1) design matrix and y is an n × 1 column vector for the observed values of the response variable. The drawback of simple linear regression is that it is built on a strong assumption -namely, normality. Unfortunately, real world data may not always have a normal distribution and 102 may be skewed to one side or may not cover the whole range of real numbers or may have a heavier tail than the normal distribution, etc. Hence, alternative approaches that are not constrained by such assumptions such as GLM may be used. 6.2.1 Generalized Linear Model and 2-Step GLM (GLM-C) The generalized linear model is one of most widely used regression methods due to its simplicity. Generally, a GLM consists of three elements: 1. The response variable Y, which has a probability distribution from the exponential family. 2. A linear predictor η = Xβ 3. A link function g(·) such that E(Y|X) = µ = g −1 (η) where, Y ∈ Rn×1 is the response variables vector, X ∈ Rn×d is the design matrix with all 1 in the last column. β ∈ Rp×1 is the parameter vector. Since the link function shows the relationship between the linear predictor and the mean of the distribution, it is very important to understand the detail about the data before arbitrarily using the canonical link function. In this case, since the precipitation data are always non-negative and values represented using a millimeter scale, the non-zero data may be treated as count data allowing the use Poisson distribution or an exponential distribution to describe the data. Hence, in these experiments log(·) is chosen as the link function and Poisson distribution chosen. We scale the Y used in the regression model to be 10 × Y : (10 × Yi )|Xi ∼ P oi(λi ) E((10 × Yi )|Xi ) = λi = g −1 (ηi ) = g −1 (Xi β); 103 Considering the large number of zeros, one is motivated to perform classification first to eliminate the zero values before any regression. There are many classification methods available. But for the purpose of these experiments, logistic regression (which is also a variation of GLM) was chosen to do the classification. The response variable Y ∗ of logistic regression is a binary variable defined as: Y∗ =    1 Y > 0,    0 Y = 0  The detail of the model is as follows: The link function is a logit link g(p) = log( p ), 1−p such that, Yi∗ |Xi ∼ Bin(pi ) E(Yi∗ |Xi ) = pi = g −1 (ηi ) = g −1 (Xi β); When the fitted values are derived, they will be transferred to be binary: f∗ =    1 1 ≥ Y ∗ > 0.5, ˆ    0 0.5 ≥ Y ∗ ≥ 0 ˆ  The second part is a GLM with exponential distribution, the response variable Y is just those non-zero data, and the link function is g(·) = log(·): Yi |Xi ∼ Exp(λi ) E(Yi |Xi ) = λi = g −1 (ηi ) = g −1 (Xi β); 104 Then, the fitted-value f was found for all Xi ˆ Finally, the product of those two fitted-values Y = f ∗ × f was reported. To fit the GLM model, iteratively reweighted least squares(IRLS) method was used for maximum likelihood estimation of the model parameters. 6.2.2 Zero Inflated Poisson Regression (ZIP) Differing from the methods above, zero inflated poisson regression treats the zero as a mixture of two distributions: a Bernoulli distribution with probability πi to get 0, and a Poisson distribution with parameter µ (let P r(·; µ) denote the probability density function). In fact, the ZIP regression model is defined as: P r(Y = yi |xi ) =    π + (1 − π )P r(Y = 0; λ ) y = 0,  i i i i i   (1 − π )P r(Y = y ; λ )  i i i yi > 0 where 0 < πi < 1, and logit(πi ) = log( πi ) = xi β 1 1 − πi log(µi ) = xi β 2 where β 1 , β 2 are all regression parameter. Both of them could be found by maximizing the likelihood function. For the purpose of the experiments, the R package ’pscl’ was used to fit the model. 105 6.2.3 Quantile Linear Regression(QR) and 2-step QR (QR-C) Quantile regression was used to estimate the specified quantile of a population. Hence, if the objective of the regression is to estimate the conditional quantile(e.g., median) of Y instead of a conditional mean like MLR and Ridge regression, one may use quantile regression. Its loss function for the linear regression model is: N f (b) = ˆ ρτ (Yi − XT b), and β = arg min f (b), i b i=1 where    τ u  ρτ (u) = u>0   (τ − 1)u u ≤ 0  Let FY (y) = P (Y ≤ y) be the distribution function of a real valued random variable Y. The τ th quantile of Y is given by: QY (τ ) = F −1 (τ ) = inf{y : FY (y) ≥ τ } It can be proved that the y which minimizes Eρτ (y − y ) should satisfy that FY (ˆ) = τ . ˆ ˆ y Thus, quantile regression will find the τ th quantile of a random variable, for example: ˆ qr ˆ qr Median(Y|X) = X β ; β = arg min b ρ0.5 (yi − XT b) i For the purpose of the experiments conducted, τ = 0.95 was used to represent extreme high value. Unlike the least squares methods mentioned above, which could be solved by numerical linear algebra, the solution to quantile regression is relatively non-trivial. Linear 106 programming is used to solve the loss function by converting the problem to the following form. min {τ eT u + (1 − τ )eT v|Y − Xb = u − v; b ∈ Rp ; u, v ∈ RN } + N N u,v,b For the same reason as mentioned in the Section 6.2.1, a classification method should be incorporated along with the regression model. Logistic regression was used for classification, and quantile regression on those nonzero Y . Finally, the product of those two fitted values is reported. Quantile regression may return a negative value, which we force to 0. We do this because precipitation is always non-negative. 6.3 Framework for Integrated Classification and Regression With the introduction of quantile regression, which is an integral part of the objective function, the motivation behind the various components of the proposed objective function needs to be elaborated. Since zero-inflated data is best described with the help of a classifier that help identify non-zero values and a regression component to address non-zero values, this framework consists of both components. For the classifier component, a least square support vector machine is used and for the regression component, the intuition of quantile regression is used to help focus the regression of extreme values. Since the final prediction of the data point using this framework is a product of the regression and classification component, the quantile regression component is built to work on the eventual predicted return value, thereby integrating both the classifier and regression components. 107 6.3.1 Integrated Classifier and Regression for Extreme Values (ICRE) The classification and regression models developed in this study are designed to minimize the following objective function: 1 arg min L(ω 1 , ω 2 ) = ω 1 ,ω 2 n + n (1 − (2yi − 1)fi )2 i=1 n 1 n∗ (6.1) yi ρτ (yi − fi × (fi + 1)/2) + λ(||ω 1 ||2 + ||ω 2 ||2 ) i=1 where n∗ is the number of nonzero yi . Then it can be expanded as follows: 1 arg min L(ω 1 , ω 2 ) = ω 1 ,ω 2 n + n (1 − (2yi − 1)(xT ω 2 ))2 i i=1 n 1 n∗ (6.2) yi ρτ (yi − (xT ω 1 ) × (sign((xT ω 2 + 1)/2))) i i i=1 + λ(||ω 1 ||2 + ||ω 2 ||2 ) The rationale for the design of our objective function is as follows. The first term which corresponds to the regression part of the equation represents quantile regression performed for only the observed non-zero values in the time series. The regression model is therefore biased towards estimating the non-zero extreme values more accurately and not be adversely influenced by the over-abundance of zeros in the time series. The product fi ×(fi +1)/2 in the first term, corresponds to the predicted output of our joint classification and regression model. The second term in the objective function, which is the main classification component, is equivalent to the least square support vector machine. And the last two terms in the objective function are equivalent to the L2 norm used in ridge regression models to shrink the coefficients in ω 1 and ω 2 . 108 Each data point is considered to be a representative reading at an instance of time t ∈ {1, 2, · · · , n} in the time series. Each predictor variable is standardized by subtracting its mean value and then dividing by its corresponding standard deviation. The standardization of the variables is needed to account for the varying scales. The optimization method used while performing experiments is ’L-BFGS-B’, described by Byrd et. al. [27]. It is a limited memory version of BFGS methods. This method does not store a Hessian matrix, just a limited number of update steps for it, and then it uses derivative information. Since this model includes a quantile regression component, which is not differentiable, this method of optimization is well suited to the objective function. To solve the objective function, the inverse logistic function of xT ω 2 instead of sign((xT ω 2 + i i 1)/2)) was used. The decision was motivated by the fact that the optimizer tries to do a line search along the steepest descent direction and finds the positive derivative along this line, which would result in a nearly flat surface for the binary component. Hence, conversion of the binary report to an inverse logistic function of xT ω 2 was used to address this issue. i During the prediction stage, the binary-fitted values from the SVM component was used. 6.4 Experimental Evaluation In this section, the climate data that are used to downscale precipitation is described. This is followed by the experiment setup. Once the dataset is introduced, the behavior of baseline models was analyzed and contrasted with ICRE, in terms of relative performance of the various models when applied to this real world dataset to forecast future values of precipitation. 109 6.4.1 Data All the algorithms were run on climate data obtained for 29 weather stations in Canada, from the Canadian Climate Change Scenarios Network website [1]. The response variable to be regressed (downscaled), corresponds to daily precipitation values measured at each weather station. The predictor variables correspond to 26 coarse-scale climate variables derived from the NCEP Reanalysis data set and the H3A2a data set(computer generated simulations), which include measurements of airflow strength, sea-level pressure, wind direction, vorticity, and humidity. The predictor variables used for training were obtained from the NCEP Reanalysis data set while the predictor variables used for the testing were obtained from the H3A2a data set. The data span a 40-year period, 1961 to 2001. The time series was truncated for each weather station to exclude days for which temperature or any of the predictor values are missing. 6.4.2 Experimental Setup The first step was to standardize the predictor variables by subtracting its mean value and then dividing by its corresponding standard deviation to account for their varying scales. The training size used was 10yrs worth of data and the test size, 25yrs. During the validation process, the selection of the parameter λ was done using the score returned by RMSE-95. Also, to ensure the experiments replicated the real world scenario where the prediction for a future timeseries needs to be performed using simulated values of the predictor variables for the future time series, simulated values for the corresponding predictor variables obtained from H3A2a climate scenario was used as XU , while XL are values obtained from NCEP. All the experiments were run for 37 stations. 110 6.4.3 Baseline Algorithm We compare the performance of ICRE with baseline models created using general linear model(GLM), general linear model with classification (GLM-C), quantile regression(QR), quantile regression with classification and zero-inflated Poisson(ZIP). Further details about the baselines are provided below. 6.4.3.1 General Linear Model (GLM) The baseline GLM refers to the generalized linear model that uses a Poisson distribution as a link function, resulting in the regression function log(λ) = Xβ, where E(Y |X) = λ 6.4.3.2 General Linear Model with Classification (GLM-C) Unlike the previous baseline (GLM), GLM-C refers to a two step generalized linear model that uses a Binomial distribution, for the classifier with the model described as logit(p) = Xβ, and E(Y = 1|X) = p which Y = 1 when Y > 0 and Y = 0 when Y = 0 and a second step that uses a generalized linear model with an exponential distribution that is built only on non-zero response data points. The regression function is log(λ) = Xβ, which E(Y |X) = λ. The eventual predicted value for each data point is the product of the two respective fitted values. 6.4.3.3 Quantile Regression (QR) The baseline QR refers to the regular quantile regression described earlier in the preliminary section 6.2 111 6.4.3.4 Quantile Regression with Classification (QR-C) The baseline QR-C refers to a two step model that has a GLM that uses a binomial distribution that acts as a classifier and a regular quantile regression model that is built on non-zero valued data points as described earlier in the preliminary section. These two models that comprise QR-C are built independent of each other and the eventual predicted value for each data point is the product of the two respective fitted values. 6.4.3.5 Zero Inflated Poisson (ZIP) Zero Inflation Poisson model used as a baseline and is similar to the ZIP model described in Section 6.2. 6.4.4 Evaluation Criteria The motivation behind the selection of the various evaluation metrics was to evaluate the different algorithms in terms of predicting the magnitude and the timing of the extreme events.The following criteria to evaluate the performance of the models are used: • Root Mean Square Error (RMSE), which measures the difference between the actual and predicted values of the response variable, i.e.: n 2 i=1 (yi − fi fi ) RMSE = n • RMSE-95, was used to measure the difference between the actual and predicted value of the response variable for only the extreme data points(j). Extreme data points refer to the points whose actual value were 95 percentile and above. The equation is with respect to 95 percentile, as throughout this chapter, we associate data points that are 112 95 percentile and above as extreme values, i.e.: n/20 2 j=1 (yj − fi fj ) RMSE-95 = n/20 • Confusion matrices will be computed to visualize the precision and recall of extreme and non-extreme events. F-measure, which is the harmonic mean between recall and precision values was used as a score that evaluates the precision and recall results. F-measure = 2 × Recall × P recision Recall + P recision To summarize, RMSE-95 is used for measuring magnitude and F-measure measures the correctness of the timing of the extreme events. 6.4.5 Experimental Results The results section consists of two main sets of experiments. The first set of experiments evaluates the impact of zero-inflated data on modeling extreme values. The second section compares the performance of ICRE with the baseline methods which are followed . 6.4.5.1 Impact of Zero-Inflated Data on Extreme Value Prediction Unlike regular data which may be modeled using regression, modeling zero-inflated data usually involves a classifier and a regression component. The classifier is used to identify zero and non-zero values, which is followed by regression for the non-zero values. But since the focus of the chapter is on extreme data points within zero-inflated data, the impact of the classifier is unclear. In this section, the impact of including the classifier in modeling 113 extreme values of zero-inflated data is compared. QR is compared with QR-C and GCM with GCM-C and show the results in Table 6.1. Note that the percentage of wins for F-measure, recall, precision may not total to 100 in the case of a tie. Table 6.1: Percentage of stations won QR-C 0 81.08 RMSE-95 F-Measure QR 100 18.92 GLM-C 67.57 18.92 GLM 32.43 35.13 As shown in the Table 6.1, it isn’t clear that using an independent classifier along with regression for modeling extreme values among zero inflated data is preferred. But the results do indicate that the inclusion or exclusion of a classifier with the regression model built independent of each other may compromise either RMSE-95 (by overestimating the magnitude) or F-measure (mistiming predicting an extreme value), without necessarily compromising both together. 6.4.5.2 Comparison of ICRE to Baseline Methods Table 6.2 shows the relative performance of ICRE to all the baseline methods in terms of percentage of stations outperformed against the baseline method in terms of RMSE-95 values calculated on extreme rain days. In terms of RMSE of extreme rain days, as shown in Table 6.2, ICRE outperformed the baselines (except QR) in almost every one of the 37 stations. But QR was the best across all methods for RMSE-95 of extreme days. In terms Table 6.2: Percentage of stations ICRE outperformed the baseline RMSE-95 F-Measure QR-C 91.89 43.24 QR 0 62.16 GLM-C 97.3 89.19 GLM 97.3 89.19 ZIP 97.3 91.9 of F-measure that was computed based on recall and precision of identifying extreme events, 114 ICRE again outperformed the baselines(except QR-C) in majority of the 37 stations. But ICRE was only able to outperform QR-C in 16 or the 37 stations in terms of F-measure. Although QR performed the best in terms of estimating magnitude for those extreme events, it over-estimated the timing of the events as seen by the relatively lower F-measure score. QR-C did the reverse, it did reasonably well in terms of modeling the timing, but performed very poorly in terms of the magnitude of the events by overestimating. 6.5 Conclusions This chapter compares and analyzes the performance of models created using variants of GLM, quantile regression and ZIP approaches to accurately predict values for extreme data points that belong to a zero-inflated distribution. An alternate framework(ICRE) was present that outperforms the baseline methods and the effectiveness of the model was demonstrated on climate data to predict the amount of precipitation at a given station. 115 Chapter 7 Contour Regression The LSSQR and ICRE frameworks presented in Chapter 5 and Chapter 6 prioritize the accuracy of the prediction of a response variable at a user specified quantile. However, there are climate model applications that are interested in capturing the accurate distribution characteristics of the response variable across all quantiles. In this chapter, the limitations of current regression-based approaches in terms of preserving the distribution of observed climate data is shown and a multi-objective regression framework that simultaneously fits the distribution properties and minimizes the prediction error is presented. The framework is highly flexible and can be applied to linear, nonlinear, and conditional quantile models. The chapter demonstrates the effectiveness of the framework in modeling the daily minimum and maximum temperature as well as precipitation for climate stations in the Great Lakes region. The framework showed marked improvement over traditional regression-based approaches in all 14 climate stations evaluated. 7.1 Introduction There are numerous climate modeling applications that can be cast into a regression problem, from projecting future climate scenarios to downscaling the coarse-scale outputs from global/regional climate models for climate change impact assessment and adaptation studies [107, 60, 104]. In addition to minimizing the residuals of the predicted outputs, some of 116 1 y y MLR 0.8 Area between F(y), F(y ) MLR F(y) 0.6 0.4 0.2 0 2 3 4 5 6 7 8 9 10 11 y Figure 7.1: Area between the CDF of y and yM LR . these applications emphasize preserving specific characteristics of the predicted distribution. However, as most regression-based approaches are designed to optimize the former, they tend to perform poorly on the latter criterion. As an illustration, consider a two-dimensional regression problem, where the response variable y is related to the predictor variables x according to the following equation: y = ω T x + ω0 + (0, σ 2 ), where Ω = [ω2 ω1 ω0 ] = [1, 2, 5]. Using the least square (maximum likelihood) estimation approach, multiple linear regression (MLR) was able to fairly accurately estimate Ω as [0.99, 1.96, 5.05 ]. Yet, it fared poorly in terms of replicating the shape of the original distribution of y as seen from its cumulative distribution function (CDF) plots given in Figure 7.1. Even though the regression model was trained using ten thousand data points, it is clear from Figure 7.1 that MLR fails to replicate the shape of the cumulative distribution for y, particularly the tails of the distribution. As another example, Figure 7.2 compares the histograms of daily maximum temperature observed at a climate station in Michigan and the predicted outputs of MLR. In this case, 117 450 400 Observed Max Temperature Multiple Linear Regression Quantile Mapping 350 Days 300 250 200 150 100 50 0 −30 −20 −10 0 °C 10 20 30 40 Figure 7.2: Histogram of predicted daily maximum temperature at a weather station in Michigan, 1990-1999. the standard deviation of MLR’s predicted outputs differs quite substantially from that of observation data. In spite of minimizing the sum of squared prediction error, regressionbased approaches such as MLR fared poorly in preserving the overall shape of the distribution compared to non-regression based approaches such as quantile mapping (QM), which had an RMSE value 25% worse than that of MLR but gives a better fit to the distribution of maximum temperature. As a consequence, distribution-driven approaches [108, 97] have been used to correct the distribution characteristics of the data to better match the observed climate variable. However, their prediction accuracy is typically worse than regression-based 118 1 0.8 F(x) 0.6 0.4 Observed Precipitation Multiple Linear Regression Quantile Mapping 0.2 0 0 10 20 30 40 50 60 70 mm/day Figure 7.3: CDF of predicted daily precipitation at a weather station in Michigan, 1990-1999. approaches. Raw projections of climate variables are often obtained from General Circulation Models (GCM) and more recently from Regional Climate Models (RCM) that incorporate complex topography, land cover, and other regional forcings into the physical models. These raw climate projections need to be further post-processed to meet the requirements of impact assessment studies. In addition to the previously mentioned requirements from the climate variables, empirical downscaling of the output from the climate models to a finer resolution is often needed to bridge the mismatch in spatial or temporal scale between the model output and the scale desired, since the resolutions of the output from the climate models may remain too coarse for many applications where local scale information is needed. Similarly, bias correction is often needed to reduce the inherent uncertainties in the RCM outputs that may be afflicted by the systematic errors introduced by the driving GCM runs, imperfections of the RCM representation, and sampling biases due to the finite length time series used to parameterize and validate the models [45]. 119 Since the fidelity of both the distribution characteristics and the accuracy of projections are important, a framework for multivariate regression is proposed that regularizes the distribution of the response variable to simultaneous improve the accuracy of the projection as well as the shape of the distribution by jointly solving both objectives. Due to its generic nature, the framework may be applied to various types of marginal distributions as well as different objective function criteria including least square error, kernel regression and quantile regression (QR). In this chapter, the effectiveness of the proposed framework is demonstrated by downscaling and bias correcting daily temperature and precipitation to match their corresponding observations. In summary, the main contributions of this study are: • Identification of the limitations of existing least squared error regression techniques. • Presentation of a regression based framework (Contour Regression) for multivariate empirical downscaling and bias correction that address the limitation of existing approaches by simultaneously improving accuracy of projection for individual data points as well as the overall shape of the distribution. • Demonstration of the feasibility of adapting the framework to fit various objective functions such as multivariate ordinary least squares, QR and non-linear kernel ridge regression. • Evaluation of the framework on real world climate data and found that it consistently outperformed or was at least on-par with the baseline approaches and showed its robustness to response variables having different types of shapes of distribution. 120 7.2 Preliminaries Let D = {(xi , yi )}n be a labeled dataset of size n, where each xi ∈ Rd is a vector i=1 of predictor variables and yi ∈ R the corresponding response variable. The objective of regression is to learn a target function f (x, β) that best estimates the response variable y. β is the parameter vector used by the target function. n represents the number of training points. 7.2.1 Multiple Linear Regression (MLR) MLR is the most common regression approach used for empirical downscaling of climate data. MLR uses ordinary least squares to solve a linear model of the form y = xT β + where, ∼ N (0, σ 2 ) is an i.i.d Gaussian error term with variance σ 2 . β ∈ Rd is the parameter vector. MLR minimizes the sum squared residuals (y − Xβ)T (y − Xβ) which leads to a closed-form expression for the solution ˆ β = (X T X)−1 X T y 7.2.2 Quantile Mapping (QM) Quantile mapping is the most commonly used approach for correcting the shape of the distribution of a climate variable to match observations. It adjusts all the moments of the distribution while maintaining the rank correlation. The following equation is an example 121 of the QM approach. −1 QM : yi = FY (FX (xi )) FX (x) is a function that corresponds to the CDF for the predictor variable ’X’ and is defined by FX (x) = P (x ≤ X). The above equation of QM may be rewritten as follows to help identify the correction made by QM. −1 −1 QM : yi = xi + FY (FX (xi )) − FX (FX (xi )) One of the main assumptions made by QM is that the data points upon which the bias correction function is to be applied come from the same distribution that describes the training sets and that the relationship between predictor and response is constant. Also, a sufficiently large enough training size is required by QM to capture the true shape of the distribution of the model and observations. A distinct advantage of QM is that no day-to-day mapped data are required. It can be shown that a QM function that accurately replicates the distribution characteristics of the response variable may not guarantee RM SE = 0, nor a relatively small RM SE. Proposition 7.2.1. A QM function that accurately replicates distribution of the observation may have RM SE > 0 −1 Proof. Given QM : y = Fy (Fx (x)), where FX ∈ [0, 1] and FY ∈ [0, 1] are the empirical cumulative distribution function of x and y respectively. Let R and O be the multiset containing the quantile values of x in Fx and y in Fy respectively. i.e., Fx (x) = R and Fy (y) = O. Let ε(i) = |Fy (O(i)) − Fy (R(i))| be the QM prediction error of data point xi . 122 ⇒ RM SE = 2 i ε (i)/n. A necessary and sufficient condition for the quantile function to replicate the distribution of observation is that the cardinality, member and multiplicity of the multiset O equals the multiset R, i.e., |O R| = |O| = |R| The above requirement does not eliminate that possibility that ∃i, s.t., O(i) = R(i), where R(i) corresponds to the quantile value of data point xi . ⇒ if ∃i, s.t., O(i) = R(i), and FY (O(i)) = FY (R(i)). ⇒ RM SE > 0. Hence, quantile mapping function that accurately replicates the distribution characteristics of the response variable may not guarantee RM SE = 0, nor a relatively small RM SE. ♦ Proposition 7.2.2. A QM function accurately returns the distribution characteristics of the response variable as well as RM SE = 0 when rank correlation Γ = 1 between the predictor and response variables. Proof. Let R and O be the multiset quantiles Fx (x) and Fy (y) of x and y respectively. Let ε(i) = |Fy (O(i)) − Fy (R(i))| be the ith error of the predicted values from QM. ⇒ RM SE = ε2 (i)/n. Given (Γ = 1) ≡ (∀i, R(i) = O(i)), we have ε(i) = |Fy (O(i)) − Fy (O(i))|. ⇒ RM SE = 7.3 2 i ε (i)/n = 0 ♦ Framework for Multivariate Contour Regression (CR) Since regression based approaches have a distinct advantage in terms of prediction accuracy of individual data points but are limited by their lack of emphasis on the shape of the 123 distribution of the projection as depicted by the area between their two CDFs in Figure 7.1, there is a need to regularize the area between the CDF of the target response variable and the regression result. The proposed distribution regularized framework is n min β (γπ(f (xi ), yi ) + (1 − γ)π(f (xi ), y(i) )) i=1 where, y(i) corresponds to the i-th order value of the target response variabley. π(., .) can be any generic loss function, such as sum squared error, while 0 ≤ γ ≤ 1 is a user defined parameter that may be used for either prioritizing fidelity in regression accuracy or its CDF. An important required preprocessing step (elaborated in the following subsection) required, is that the predictor matrix X is pre-sorted such that i < j ∀f (xi , β) ≤ f (xj , β). The choice of π determines the objective function that is to be minimized and could be as simple as ordinary least squares or a more complex user defined function. Section 7.3.1 elaborates on CR and describes multivariate linear contour regression (MLCR) which has an objective function that is based on ordinary least squares. Section 7.3.2 proposes kernel contour regression (KCR) that is a kernel-based interpretation of the CR framework. Section 7.3.3 proposes a quantile regression based interpretation that emphasizes the conditional quartile of the user’s preference. In this study, the conditional quartile chosen corresponded to the extreme fifth percentile of the distribution. 7.3.1 Multiple Linear Contour Regression (MLCR) This section describes an approach for CR that is based on ordinary least square (OLS) to simultaneously regress on the response variable as well as regress on the ordered value of the 124 response variable by minimizing the sum squared error, as shown below. n (γ(f (xi , β) − yi )2 + (1 − γ)(f (xi , β) − zi )2 ) i=1 where, f (X, β) = Xβ and zi = y(i) . This equates to minimizing γ(y − Xβ)T (y − Xβ) + (1 − γ)(z − Xβ)T (z − Xβ) where the predictor matrix X is pre-sorted such that i < j ∀f (xi , β) ≤ f (xj , β) and γ ∈ [01] is a user defined parameter that may be used for either prioritizing fidelity in regression accuracy or shape of the distribution. It is obvious from the equation that as γ → 1, MLCR converges to the solution of MLR as seen in Figure 7.4, which depicts the influence of the γ parameter on the shape of the CDF of the response variable. The closed form solution to MLCR is ˆ β = (X T X)−1 (γX T y + (1 − γ)X T z) Since it is often not possible to guarantee that X is pre-sorted correctly according to f (xi , β), one may need to iteratively solve the objective function after reordering the data points X and corresponding y, such that the new ordering of the data points conforms to i < j ∀f (xi , β) ≤ f (xj , β) based on the β obtained from the previous iteration, until convergence. Convergence is obtained when ∀f (xi , β) ≤ f (xj , β), ∀i < j. As shown in the theorem below, the following objective function converges with each iteration. 125 1 F(x) 0.8 0.6 y MLCR (γ =1) MLCR(γ =1/2) MLCR(γ =1/10) 0.4 0.2 0 −1 0 1 2 3 4 5 6 7 Response Variable (y) Figure 7.4: Influence of gamma parameter on fidelity of the response variable’s cumulative distributive function. 7.3.1.1 Proof of Convergence This section presents the proof of convergence of the iterative update algorithm. Let βt , ft , Xt be the regression coefficients, predicted values for the response variable and the predictor variables at the t-th iteration, while βt+1 , ft+1 , Xt+1 represent the regression coefficients, predicted values for the response variable and the predictor variables after the (t + 1)-th iteration. Proposition 7.3.1. Assuming that the indices of the predictor variables are fixed, L(βt , ft , Xt ) ≥ L(βt+1 , ft+1 , Xt ) Proof. For a fixed Xt , L(βt+1 , ft+1 , Xt ) ≤ L(βt , ft , Xt ) since the βt+1 is obtained from a closed form solution of ordinary least squares and by definition is the solution that minimizes the objective function. In the worst case, L(βt+1 , ft+1 , Xt ) = L(βt , ft , Xt ). 126 Proposition 7.3.2. Assuming that the regression coefficients β are fixed, L(βt+1 , ft+1 , Xt ) ≥ L(βt+1 , ft+1 , Xt+1 ) y y Proof. Let L(βt+1 , ft+1 , Xt ) = Lt+1 + Lz where, Lt+1 refers to the first half of the loss t function that regresses on y and Lz refers to the second half of the loss function that regresses t on z. Since, the change in ordering of X from t-th to the t + 1-th iteration doesn’t impact y the Ly component of the loss function, and L(βt+1 , ft+1 , Xt+1 ) = Lt+1 + Lz , we shall t+1 n 2 i=1 (f (xi , β) − zi ) which can be rewritten as concentrate on Lz . Lz = (1 − γ) t Lz = t n 2 (fi2 + zi + 2fi zi ) i=1 (1 − γ) being a constant, is ignored for simplicity. Given that β and values for f are fixed, Lz = t+1 n 2 2 i=1 (f(i) + zi + 2f(i) zi ). ⇒ Lz − Lz = t t+1 n (f(i) zi − fi zi ) i=1 And since, n i=1 a(i) b(i) ≥ n i=1 ai bi ∀a ∈ Rn , b ∈ Rn we have n i=1 (f(i) zi ) ≥ n i=1 (fi zi ), since by definition, zi = z(i) . ⇒ Lz − Lz ≥ 0 t t+1 ⇒ L(βt+1 , ft+1 , Xt ) ≥ L(βt+1 , ft+1 , Xt+1 ) Theorem 7.3.1. The objective function L(β) is monotonically non-increasing given the update formula for β, f and X. 127 Proof. The update formula iteratively modifies the objective function as follows: L(βt , ft , Xt ) ⇒ L(βt+1 , ft+1 , Xt ) ⇒ L(βt+1 , ft+1 , Xt+1 ). Using the above propositions, we have L(βt , ft , Xt ) ≥ L(βt+1 , ft+1 , Xt ) and L(βt+1 , ft+1 , Xt ) ≥ L(βt+1 , ft+1 , Xt+1 ). ⇒ L(βt+1 , ft+1 , Xt+1 ) ≤ L(βt , ft , Xt ) Lemma 7.3.1. The objective function will eventually converge, as the value of the loss function is always non-negative and since we know L(β) is monotonically decreasing. 7.3.2 Kernel Contour Regression (KCR) A variant of MLR, called ridge regularization is used to mitigate over-fitting in regression. Ridge regression also provides a formulation to overcome the hurdle of a singular covariance matrix X T X that MLR might be faced with during optimization. Unlike the loss function of MLR, the loss function for ridge regression is (y − Xβ)T (y − Xβ) + λβ T β, and its corresponding closed-form expression for the solution is ˆ β = (X T X + λI)−1 X T y 128 where, the ridge coefficient λ > 0 results in a non-singular matrix X T X + λI always being invertible. The dual ridge regression is given by the equation ˆ α = y T (G + λI)−1 X where, G = XX T . By mapping φ the predictor variable X to a higher dimension feature space F , i.e., φ : X ∈ Rd → F ⊆ RN where N >> d, one can transform the regularized least square regression to feature space F using the Kernel K. Similarly, the predictor variables of CR can be mapped to a higher dimension feature space F by using the ridge counterpart of MLCR. β = (φ(X)T φ(X) + λI)−1 (γφ(X)T y + (1 − γ)φ(X)T z) ⇒ β = λ−1 φ(X)T (γy + (1 − γ)z − φ(X)β) = φ(X)T α ⇒ α = (G + λI)−1 (γy + (1 − γ)z) where, G = φ(X)φ(X)T , Gij = φ(xi ), φ(xj )T = K(xi , xj ). 7.3.3 Quantile Contour Regression (QCR) Most regression approaches that are used for downscaling focus on predicting the conditional mean of the response variable. Predicting the conditional mean is not well suited for predicting extreme values that are better identified by the conditional quantiles that corresponds to the extreme values. Hence, unlike the common regression techniques mentioned earlier, 129 approaches similar to quantile regression(QR) [77] are better suited to estimate the extremes of Y . To estimate the τ th conditional quantile QY |X (τ ), QR minimizes an asymmetrically weighted sum of absolute errors using the loss function: n ρτ (yi − xT β) i i=1 where,    τ u  ρτ (u) = u>0   (τ − 1)u u ≤ 0  and the τ th quantile of a random variable Y is given by: QY (τ ) = F −1 (τ ) = inf{y : FY (y) ≥ τ } where, FY (y) = P (Y ≤ y) is the distribution function of a real valued random variable Y and τ ∈ [0, 1]. Linear programming is used to solve the loss function by converting the problem to the following form. min τ 1T u + (1 − τ )1T v n n u,v s.t. y − xT β = u − v where, ui ≥ 0 ,vi ≥ 0 and β ∈ Rd . 130 The objective function of QR can be adopted by CR to obtain the following loss function n i=1 (ρτ1 (yi − xT β) + ρτ2 (zi − xT β)) i i where,    τ u  ρτ (u) = u>0   (τ − 1)u u ≤ 0  which equates to min u,v,u ,v τ1 1T u + (1 − τ )1T v + τ2 1T u + (1 − τ )1T v n n n n s.t. y − xT β = u − v s.t. z − xT β = u − v where, τ2 = 0.5, ui ≥ 0, ui ≥ 0, vi ≥ 0, vi ≥ 0 and β ∈ Rd . 7.3.3.1 Proof of Convergence Let βt , ft , Xt be the regression coefficients, predicted values for the response variable and the predictor variables at the t-th iteration, while βt+1 , ft+1 , Xt+1 be the regression coefficients, predicted values for response variable and the predictor variables after the (t+1)-th iteration. Proposition 7.3.3. Assuming that the indices of the predictor variables are fixed, L(βt , ft , Xt ) ≥ L(βt+1 , ft+1 , Xt ) Proof. For a fixed Xt , L(βt+1 , ft+1 , Xt ) ≤ L(βt , ft , Xt ) since βt+1 is the solution that mini131 mizes the objective function. In the worst case, L(βt+1 , ft+1 , Xt ) = L(βt , ft , Xt ). Proposition 7.3.4. Assuming that the regression coefficients β are fixed, L(βt+1 , ft+1 , Xt ) = L(βt+1 , ft+1 , Xt+1 ) y y Proof. Let L(βt+1 , ft+1 , Xt ) = Lt+1 + Lz where, Lt+1 refers to the first half of the loss t function that performs QR on y and Lz refers to the second half of the loss function that t performs QR on z. Since, the change in ordering of X doesn’t impact Ly we shall concentrate on Lz . Given, Lz = 0.5 t n z i=1 (fi − zi ) and Lt+1 = 0.5 n i=1 (f(i) − zi ) ⇒ Lz = Lz t t+1 Hence, L(βt+1 , ft+1 , Xt ) = L(βt+1 , ft+1 , Xt+1 ) Theorem 7.3.2. The objective function L(β) is monotonically non-increasing given the update formula for β, f and X. Proof. The update formula iteratively modifies the objective function as follows: L(βt , ft , Xt ) ⇒ L(βt+1 , ft+1 , Xt ) ⇒ L(βt+1 , ft+1 , Xt+1 ). Using the above propositions, we have L(βt , ft , Xt ) ≥ L(βt+1 , ft+1 , Xt ) and L(βt+1 , ft+1 , Xt ) = L(βt+1 , ft+1 , Xt+1 ). ⇒ L(βt+1 , ft+1 , Xt+1 ) ≤ L(βt , ft , Xt ) 132 7.4 Experimental Results The objective of the experiments was to evaluate the effectiveness of CR on observed climate data. 7.4.1 Data All the algorithms were run using climate data obtained at fourteen weather stations in Michigan, USA. Daily maximum temperature (T), minimum temperature (t), and precipitation (P) were the three climate target variables evaluated. The predictor variables used in this study were obtained from the North American Regional Climate Change Assessment Program (NARCCAP) [2] (Table 7.1). Nine different data sets are used that correspond to the combination of three different RCMs and three target variables. The three RCMs used are the Canadian Regional Climate Model (CRCM), the Weather Research and Forecasting Model (WRFG) and the Regional Climate Model Version3 (RCM3) The models were each driven by NCEP/DOE AMIP-II Reanalysis (NCEP) for a domain covering the United States and Canada. The data for the RCMs spans the period 1980-1999. The gridded RCM data have a spatial resolution of 50km. Unlike observation data that relate to a point location, RCM data are available at grid resolution with the value representing a grid-cell average. Since the observation data used correspond to daily values, preprocessing was also done to convert the three hour reanalysis-driven RCM data to daily values. Preprocessing was also needed for conversion of the observation data as well as data from the various RCM runs to the same units. For instance, precipitation in the observation data was in millimeters while precipitation data obtained from the various RCM runs was recorded in MKS units of 133 Table 7.1: List of predictor variables from each RCM. Predictor variables Meridional Surface Wind Speed Zonal Surface Wind Speed Minimum Surface Air Temperature Maximum Surface Air Temperature Surface Air Temperature Surface Pressure Precipitation Surface Specific Humidity 500 hPa Geopotential Height Frequency 3 hourly 3 hourly Daily Daily 3 hourly 3 hourly 3 hourly 3 hourly 3 hourly kg/m2 /s and needed to be converted to millimeters. In the event of missing values in the reanalysis/GCM-driven RCM simulated data the whole day corresponding to the data point was removed during the training phase of the various BCED approaches that were evaluated in this study, even if the missing value corresponded to only a three hour time stamp for a particular day. 7.4.2 Experimental Setup Twenty year (1980-1999) model data from the various RCM models along with the corresponding observation data were split into two parts of ten contiguous years that were used for training and testing. The results provided in this section are those observed during out-ofsample evaluation only. A 10-fold cross validation approach for comparing the performance of the various BCED models was also evaluated. But since the climate models’ ability to reproduce climate variability is typically averaged over the order of ten years for the purpose of analysis, as noted by Ehret et al. [45], and the relative performances being consistent across the two set-ups, the results of 10-fold cross validation are not included in this chapter. For the purpose of the evaluation of the relative skill in bias correction and downscaling of the proposed approach, popular BCED approaches such as MLR, Lasso, QM, PHC, LOCI, 134 QR, kernel regression were used as baselines. For simplicity, the parameter γ was fixed across every station. Throughout this chapter, the extreme 5 percentile of a distribution is defined as extreme values. Consequently, 0.95 is used as τ for QR based experiments that model extreme precipitation and extreme maximum temperature, while 0.05 is used as τ for modeling extreme minimum temperature. Radial basis function (RBF) kernel was the choice of kernel used in this study. For the CR based experiments, the maximum number of iterations was set to ten. 7.4.3 Results The motivation behind the experiments was to evaluate the different algorithms in terms of accuracy of the prediction, the fidelity of the shape of the distribution to observation, the timing of the extreme events and the frequency with which a data point is predicted to be an extreme data point. The performance of MLCR was compared using MLR, ridge regression (Ridge), lasso regression (Lasso), QM, LOCI and fitted histogram equalization. Similarly, QCR was compared to baseline approaches such as MLR, QM, QR. Auto regressive baselines were not used as baselines as they are not well suited for long term climate projections (40100 years into the future). Since regression emphasizes minimizing the residuals, MLCR was compared first with its baseline for potential loss in root mean square error (RMSE) performance and put it in perspective of the improvement over baseline CDFs. Barring possible over-fitting, MLR should by definition of its objective function have minimum SSE among the linear regression approaches. Hence, MLR was used as a baseline to evaluate possible deterioration in terms of RMSE by MLCR on account of MLCR’s distribution regularization. MLCR showed an average deterioration in RMSE of about < 3% across the first six data sets (target variables 135 Table 7.2: Relative performance gain of MLCR over baseline approaches. Dataset WRFG-T CRCM-T RCM3-T WRFG-t CRCM-t RCM3-t WRFG-P CRCM-P RCM3-P RMSE % loss MLR 1.9 2.8 2.0 1.0 1.9 1.8 28.8 25.8 29.9 Lasso 1.7 2.6 1.8 0.6 1.6 1.6 28.3 25.0 29.5 RMSE-CDF % gain MLR Lasso 39.0 41.7 25.8 28.0 35.3 39.2 51.4 53.7 38.2 40.1 53.2 56.1 74.3 75.8 71.1 73.2 75.6 76.7 RMSE-CDF win-loss % MLR Lasso 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 maximum and minimum temperature) (Table 7.2) while improving the average error in terms of empirical cumulative distribution frequency (RMSE-CDF), around 40% (Figure 7.6). Given, RMSE-CDF = n (y −f )2 i=1 (i) (i) and its results are in the same order as RMSE, n it is clear that MLCR was able to considerably improve the shape of the distribution to better match the observations at the expense of a marginal deterioration in RMSE. This improvement was observed across all climate stations within each dataset. as shown by the 100% win-loss percentage (Table 7.2). Ridge and Lasso fared comparably well to MLR, while QM had the worst RMSE, as expected. MLR fared considerably worse in terms of its CDF, when it came to modeling precipitation (Gamma distribution) (Figure 7.5). Since, MLR struggled to capture the shape of the precipitation distribution, a smaller value for the γ parameter for MLCR was chosen, than was used for the previous datasets (normal distribution) to better fit the observations’ CDF. Consequently, the increase in the deterioration in terms of RMSE performance came at the expense of an impressive average RMSE-CDF improvement > 70%. For evaluation of similarity of distributions, the Kolmogorov-Smirnov statistic (K) is used, which for a given pair of cumulative distribution function F1 (x) and F2 (x) is max(|F1 (x) − F2 (x)|), the standard 136 1 0.8 F(x) 0.6 Observation MLR QM MLCR 0.4 0.2 0 0 5 10 15 20 25 30 35 40 45 Precipitation (mm/day) Figure 7.5: CDF of predicted daily precipitation at a weather station in Michigan over the years 1990-99. deviation σ, correlation(ρ) and correlation-CDF(ρ − CDF ), which measures the correlation between two CDFs. MLCR regularly outperformed the baseline regression approaches at every station (Table 7.3), while QM produces the most accurate standard deviation. However, MLCR was able to catch up with QR in terms of ρ − CDF , especially for precipitation due to the emphasis given to the distribution driven term in the experiments. Table 7.3: Percentage of stations that MLCR outperformed baseline in terms of σ and ρ − CDF Dataset WRFG-T CRCM-T RCM3-T WRFG-t CRCM-t RCM3-t WRFG-P CRCM-P RCM3-P σ win-loss% MLR Lasso 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 QM 0 0 0 0 0 0 7.1 0.0 7.1 137 ρ − CDF win-loss% MLR 100 100 100 78.6 92.9 92.9 100 100 100 Lasso 100 100 100 85.8 100 85.8 100 100 100 QM 0 0 0 64.3 35.8 85.7 28.6 50.0 64.3 1 0.8 Observation MLR QM MLCR F(x) 0.6 0.4 0.2 0 −10 −5 0 5 10 15 20 Maximum Temperature (° C) 25 30 35 Figure 7.6: CDF of predicted daily maximum temperature at a weather station in Michigan, 1990-99. 7.4.3.1 QCR Results In addition to the above-mentioned metrics for comparison, QCR was compared with baseline approaches such as MLR, QM and QR, in terms of its performance at extremes of the distributions. In terms of the RMSE for the extreme valued data point alone, QCR was able to outperform MLR, since MLR tended to underestimate the extremes. QCR also fared very well against QR (Figure 7.8), where the regression models emphasized the lowest τ quantile that correspond to extreme values for the target variable (minimum temperature). It is clear that QCR emphasized accuracy in the distribution of the lower quantiles of the distribution over the higher quantiles, as expected. Precision and recall of extreme events were computed to measure the timing accuracy of the prediction of extreme valued data points. F-measure, which is the harmonic mean between recall and precision values, is used as a score that summarizes the precision and recall results. It was also found that QCR had the best F-measure among the regression 138 1 0.8 F(x) 0.6 0.4 Observation QR QCR 0.2 0 0 10 20 30 40 50 60 70 Precipitation (mm/day) Figure 7.7: CDF of predicted daily precipitation at a weather station in Michigan, 1990-99. based approaches in terms of correctly identifying extreme values across all the stations. Figure 7.7 shows the performance of QCR on precipitation. In spite of larger value for the γ parameter of QCR compared with that used for MLCR, QCR performed better than MLCR in terms of correcting the overall shape of the distribution. This is because of the zeroinflated nature of precipitation, resulting in very few large valued data points, which have a larger influence on the appearance of the CDF plot. As seen in Table 7.4, QCR regularly outperformed QR in terms of the other metrics such as correlation. 7.5 MCR Using Heterogeneous Data The M CR framework can also incorporate heterogeneous data sources of predictor variables. This extension is referred to as M CRHET . An example of incorporating the heterogeneous data sources is utilizing a reanalysis values for the predictor variables as well as asynchronous data obtained from GCM driven runs of RCM. The asynchronous predictor 139 1 Observation QR QCR 0.8 F(x) 0.6 0.4 0.2 0 −30 −25 −20 −15 −10 −5 0 5 10 15 20 Minimum Temperature (° C) Figure 7.8: CDF of predicted daily minimum temperature at a weather station in Michigan, 1990-99. variables obtained from GCM driven runs of RCM is incorporated into the second term of MCR. Figure 7.9 shows the CDF of predicted daily minimum temperature at a weather station (Eau Claire)in Michigan, 1990-99, using asynchronous regional climate model data. Similarly, Figure 7.10 shows the CDF of predicted daily precipitation at the same weather station in Michigan, 1990-99, using asynchronous regional climate model data. 7.5.1 Geometric Quantile Mapping The multi-dimensional equivalent of quantile function is geometric quantile [32]. For a univariate random variable X ∈ , let FX (x) be its cumulative distribution func- tion (CDF), i.e., FX (x) = P (X ≤ x). The corresponding α-quantile of X is given by inf {x ∈ : FX (x) ≥ α}. More generally, the position [84] of data point z relative to a set of points Z = (z1 , .., zm )T is given by 140 Table 7.4: Percentage of stations that QCR outperformed baseline approaches in terms of RMSE, F-measure, k statistic and correlation for data points considered extreme value. Dataset WRFG-T CRCM-T RCM3-T WRFG-t CRCM-t RCM3-t WRFG-P CRCM-P RCM3-P RMSE 100 100 100 100 100 100 100 100 100 1 pZ (z) = m m i=1 η(z − zi ) F-Measure 100 100 100 100 100 100 100 100 100 k 100 100 100 100 100 100 100 100 100 where η(w) = ρ 100 92.9 100 64.3 58.7 78.6 35.8 28.6 21.4    w   , if w = 0 w   0,  if w = 0 For univariate data, the position pZ (z) is equal to 2FZ (z) − 1, where FZ (z) is the cumulative distribution function of Z. The multi-dimensional equivalent of quantile function is geometric quantile [32]. Distribution correction methods such as quantile mapping is only applicable if one can match the position of a data point in one univariate distribution (say for x) to its corresponding position in another univariate distribution (say for y). This is possible using the preceding definition of position for univariate data since the values of pZ are always fixed in the range between [−1, +1] irrespective of the values in Z. Unfortunately, when extended to multivariate positions, the range of values for pZ may vary depending on the values in Z. To overcome this problem, He et al. [61] introduce the notion of a stationary position by iteratively applying the following position transformation function until convergence: pk (z) = Y 1 κn n i=1 pk−1 (z) − pk−1 (yi ) Y Y pk−1 (z) − pk−1 (yi ) Y Y 141 , p1 (z) = Y 1 κn n i=1 z − yi z − yi (7.1) Empirical CDF 1 Observation MLR 0.8 QM MCRHET F(x) 0.6 0.4 0.2 0 −30 −20 −10 0 Minimum Temperature °C 10 20 30 Figure 7.9: Cumulative distribution function of predicted daily minimum temperature at a weather station in Michigan, 1990-99, using asynchronous regional climate model data. Here each component in yi must be converted to its marginal rank first before applying the position transformation function. Marginal rank refers to the rank of the data point divided by the largest rank and then normalized to the range [−1, 1]. The normalization is done to negate the effect of variables having values that correspond to different ranges. Data points with normalized marginal rank close to ±1 correspond to extreme values for the particular variable, while those close to 0 are located near the median of the distribution. In practice, the number of iterations needed to reach a stationary distribution is quite small, typically K > 5 [61]. For univariate data, it can be shown that Pk reaches a stationary distribution at k = 1. The term κ in Equation (7.1) is a normalization factor to ensure the distribution of the geometric positions is supported in a q-dimensional unit hypersphere. In the case of bivariate response variable Y, the stationary geometric quantile distribution is √ circularly symmetric around the origin, with the radial density of r/ 1 − r2 for r ∈ (0, 1) 142 Empirical CDF 1 0.8 F(x) 0.6 0.4 Observation MLR 0.2 QM MCR HET 0 0 10 20 30 40 50 60 70 Precipitation Figure 7.10: Cumulative distribution function of predicted daily precipitation at a weather station in Michigan, 1990-99, using asynchronous regional climate model data. [61]. 7.5.2 MCR Using Geometric Quantile Mapped Heterogeneous Data The predictor variables from the asynchronous data can be transformed to the have the same geometric distribution characteristics of the given synchronous predictor variables data using methods such as a geometric quantile mapping (GQM) and covariance alignment. The MCR approach that used Geometric quantile mapping on the predictor variables having asynchronous data, is referred to as M CRGQ . Figure 7.11 compares the CDF of predicted daily minimum temperature at a weather station (Eau Claire)in Michigan, 1990-99, with and without geometric quantile mapping the asynchronous predictor variables to match the synchronous predictor variables. Using geometric quantile mapping the asynchronous predictor variables to match the synchronous 143 Empirical CDF 1 Observation MCRHET 0.8 MCR GQ F(x) 0.6 0.4 0.2 0 −25 −20 −15 −10 −5 0 5 10 Minimum Temperature °C 15 20 25 30 Figure 7.11: Comparing the cumulative distribution function of M CRHET and M CRGQ output of predicted daily minimum temperature at a weather station in Michigan, 1990-99, using asynchronous regional climate model data. predictor variables prior to applying GQM showed marginal improvement. Similarly, results were also seen in the case of the CDF of predicted daily precipitation at the same weather station in Michigan, 1990-99, with and without geometric quantile mapping the asynchronous predictor variables to match the synchronous predictor variables (Figure 7.12). 7.5.3 Projections For The Years 2040-2049 Figure 7.13 and Figure 7.14 shows the projected distribution of the climate variables for the years 2040-2049, for the same weather station in Michigan. 144 1 Observation MCRHET 0.8 MCR GQ F(x) 0.6 0.4 0.2 0 0 10 20 30 40 50 60 Precipitation (mm) Figure 7.12: Comparing the cumulative distribution function of M CRHET and M CRGQ output of predicted daily precipitation at a weather station in Michigan, 1990-99, using asynchronous regional climate model data. 7.6 Conclusions This chapter presents a framework that regularizes the distribution characteristics of a variable to simultaneously improve the accuracy of individual data points as well as the shape of the distribution of the projections. The effectiveness of the framework when using a multivariate linear interpretation, a non-linear (RBF kernel), as well as a quantile driven interpretation in effectively capturing both the shape and accuracy of observed climate data of various climate stations located in Michigan, USA, is demonstrated. In addition to consistently reducing day-to-day error of the projections, the framework is also shown to be flexible enough to capture different shapes from various distributions as shown in the case of Gaussian and Gamma distributions. 145 1 QM output on hindcast data MLR output on projected data 0.8 QM output on projected data MCRHET output on projected data F(x) 0.6 0.4 0.2 0 −30 −20 −10 0 10 Minimum temperature °C 20 30 Figure 7.13: Comparing the cumulative distribution function of projected daily minimum temperature at a weather station in Michigan, for the period 2040-2049 to that of QM model output for the years 1990-99 1 0.9 0.8 0.7 F(x) 0.6 0.5 0.4 0.3 QM output of hindcast data MLR output of projected data QM output of projected data MCRHET output of projected data 0.2 0.1 0 0 10 20 30 Precipitation (mm) 40 50 60 Figure 7.14: Comparing the cumulative distribution function of projected daily precipitation at a weather station in Michigan, for the period 2040-2049 to that of QM model output for the years 1990-99 146 Chapter 8 Multivariate Contour Regression This chapter presents a framework for multiple output regression that extends the single output regression framework presented in Chapter 7. The framework is motivated by the growing demand for multiple output prediction methods capable of both minimizing residual errors and capturing the joint distribution of the response variables in a realistic and consistent fashion. The multiple output regression presented in this chapter, preserves the relationships among the response variables (including possible non-linear associations) while minimizing the residual errors of prediction by coupling regression methods with geometric quantile mapping. 8.1 Introduction Multiple output regression (MOR) is the task of inferring the joint values of multiple response variables from a set of common predictor variables. The response variables are often related, though their true relationships are generally unknown a priori. An example application of multiple output regression is to simultaneously estimate the projected future values of temperature, precipitation, and other climate variables needed for climate change impact, adaptation and vulnerability (CCIAV) assessments. The projected values are used as the driving input variables for phenological and hydrological models to simulate the responses of the ecological system to future climate change scenarios. To ensure the projected values 147 MOR BQM 30 Observation MOR 10 0 −10 −20 −30 −20 Observation BQM 20 Min Temperature (°C) Min Temperature (°C) 20 30 10 0 −10 −20 0 20 Max Temperature (°C) −30 −20 40 0 20 Max Temperature (°C) 40 Figure 8.1: Scatter plot of observed daily maximum and minimum temperature at a climate station in Michigan, USA. are realistic, there are certain constraints on the relationship among the response variables that must be preserved; e.g., minimum temperature must not exceed maximum temperature or liquid precipitation should be zero when temperature is below freezing. While there have been numerous multiple output regression methods developed in recent years [25, 112, 12, 95, 69], most of them are focused on fitting the conditional mean or preserving covariance structure of the outputs. Such methods do not adequately capture the full range of variability in the joint output distribution, as illustrated in Figure 8.1(a). The inability of standard regression-based approaches to reproduce the shape of the true distribution of output variables, even for univariate response variables, is well-documented [5]. Univariate distribution-driven approaches such as quantile mapping (QM) [108] and statistical asynchronous regression (SAR) [92] have been developed to address this limitation, but the accuracy of these approaches is generally poor since they are not designed to minimize residual errors. Quantile mapping approaches map a univariate predictor variable x 148 to its corresponding response variable y by transforming the cumulative distribution function (CDF) of x to match that of y. More recently, a bivariate quantile mapping approach (BQM) (see Figure 8.1(b)) has been developed to generate bivariate response values that mimic the joint distribution of the observed response data [61]. However, as will be shown in this chapter, the residual error is significantly worse when compared to regression-based methods because the position and rank correlation between the predictor and response variables remain invariant under QM-based transformation, which in turn, hinders its ability to minimize residual errors. Thus, unless the predictor variable has a high rank correlation with the response variable, the residual error upon applying QM-based approaches is likely to be large. This suggests a possible hybrid approach to improve both the residual errors and distribution fitting is by first applying a regression-based method to transform the predictor variables so that their rank correlation with respect to the response variable is high, before applying quantile mapping to adjust for the fit in distribution. However, maximizing the rank correlation of the data points is necessary but not sufficient condition for improvement in the residuals for QM, unless the response values of the data points are uniformly spaced. Hence, the need for position regularization, that would prioritize the prediction accuracy of data points whose position, when incorrectly estimated, results in high residual. The term ‘position’ here refers to the geometric quantile of a data point with respect to a multivariate distribution, which is analogous to the quantile of a data point in the case of univariate distribution. In this chapter, a position-regularized, multi-output prediction framework called Multi-Output Contour Regression (MCR) is presented. MCR addresses the dual objective of preserving the associations among the multiple output variables as well as minimizing residuals. MCR is able to achieve the dual objective by applying a novel, position-regularized 149 regression method, followed by geometric quantile mapping (GQM) to improve the fit in distribution. The position-regularized regression helps to alleviate the limitation associated with the rank invariant property of QM, which contributes to the high residuals of QM-based approaches. MCR additionally addresses the challenge of ensuring that its prediction of the response variables will always abide by the constraints of the actual response data. MCR is also not limited by the number of predictor variables that may be used nor does it require them to have high correlation with the response variables, unlike quantile mapping. The flexible nature of our framework allows for the incorporation of other loss functions such as the L1 loss used in quantile regression. 8.2 Preliminaries Let X = [x1 , .., xn ]T be an (n × d) data matrix and Y = [y1 , .., yn ]T be the corresponding (n × q) response matrix, such that xi ∈ d and yi ∈ q are column vectors representing the respective values of predictor and response variables for the ith data point. The objective of multi-output regression (MOR) is to learn a target function h(x, Ω) that best estimates the multi-output response y, where Ω = (ω1 , .., ωq ) is the parameter set of the target function. For a univariate random variable X ∈ , let FX (x) be its cumulative distribution func- tion (CDF), i.e., FX (x) = P (X ≤ x). The corresponding α-quantile of X is given by inf {x ∈ : FX (x) ≥ α}. Intuitively, each quantile indicates the value in which a certain fraction of the data points are below it, and thus, provides a measure of its position in the data. For example, the median, which is equivalent to the 0.5-quantile, is the central location of the distribution. More generally, the position [84] of data point z relative to a set of points Z = (z1 , .., zm )T is given by 150 1 pZ (z) = m m i=1 η(z − zi ) where η(w) =    w   , if w = 0 w   0,  if w = 0 For univariate data, the position pZ (z) is equal to 2FZ (z) − 1, where FZ (z) is the cumulative distribution function of Z. The multi-dimensional equivalent of quantile function is geometric quantile [32]. Distribution correction methods such as quantile mapping is only applicable if one can match the position of a data point in one univariate distribution (say for x) to its corresponding position in another univariate distribution (say for y). This is possible using the preceding definition of position for univariate data since the values of pZ are always fixed in the range between [−1, +1] irrespective of the values in Z. Unfortunately, when extended to multivariate positions, the range of values for pZ may vary depending on the values in Z. To overcome this problem, He et al. [61] introduce the notion of a stationary position by iteratively applying the following position transformation function until convergence: pk (z) = Y 1 κn n i=1 pk−1 (z) − pk−1 (yi ) Y Y pk−1 (z) − pk−1 (yi ) Y Y , p1 (z) = Y 1 κn n i=1 z − yi z − yi (8.1) Here each component in yi must be converted to its marginal rank first before applying the position transformation function. Marginal rank refers to the rank of the data point divided by the largest rank and then normalized to the range [−1, 1]. The normalization is done to negate the effect of variables having values that correspond to different ranges. Data points with normalized marginal rank close to ±1 correspond to extreme values for the particular variable, while those close to 0 are located near the median of the distribution. In practice, the number of iterations needed to reach a stationary distribution is quite small, 151 typically K > 5 [61]. For univariate data, it can be shown that Pk reaches a stationary distribution at k = 1. The term κ in Equation (8.1) is a normalization factor to ensure the distribution of the geometric positions is supported in a q-dimensional unit hypersphere. In the case of bivariate response variable Y, the stationary geometric quantile distribution is circularly symmetric √ around the origin, with the radial density of r/ 1 − r2 for r ∈ (0, 1) [61]. Therefore, 1 κ= 0 √ r 1 − r2 dr ⇒ κ = π 4 In this chapter, the position of the multivariate data points in Y is denoted as PY = [pY (y1 ), .., pY (yn )]T , where pY (yi ) ∈ [−1, 1]q . The notation zXY (y) = p−1 (pY (y)) is used X to represent a point in the domain of X that has the same geometric quantile position as the data point y in Y, i.e., pX (zXY (y)) = pY (y). Consequently, zY Y (yi ) = yi . Finally, let ZXY (y) = [zXY (y1 )T , .., zXY (yn )T ]T be the geometric quantiles in X that correspond to the data points in Y . 8.2.1 Properties of the Geometric Quantiles Proposition 8.2.1. For q = 1, Pk reaches a stationary distribution at k = 1. Proof. Based on Equation 8.1, for q = 1, Pk (x) computes rank of the univariate variable X x, scaled to the range (−1, 1). Since, each iteration of Pk (x) re scales the marginal rank to X range (−1, 1), the rank is preserved. Hence, for q = 1, Pk reaches a stationary distribution at k = 1. ♦ Proposition 8.2.2. Multivariate distributions that are movement transformations of each other have the same P k distribution. 152 Proof. Geometric quantile distribution is invariant under movement transformation, if, given X = [x1 , .., xn ] and Y = [y1 , .., yn ], having geometric quantile distribution PX and PY , PX = PY . Since, X and Y are scalar transformations of each other, by definition ∃∆ ∈ Rq s.t., X = Y + ∆. Given, zY ∈ Y such that p1 (zY ) = Y 1 κn N i=1 zY − yi zY − yi ⇒ p1 (zY ) = Y 1 κn N i=1 zX − ∆ − xi + ∆ zX − ∆ − xi + ∆ k k ⇒ PX = PY ∀ k. Hence P k is invariant under movement transformation. ♦ Proposition 8.2.3. Multivariate distributions that are scale transformations of each other have the same P K distribution. Proof. Geometric quantile distribution is invariant under scale transformation, if, given X = [x1 , .., xn ] and Y = [y1 , .., yn ], having geometric quantile distribution PX and PY , PX = PY . Since, X and Y are scale transformations of each other, by definition ∃α ∈ R s.t., X = αY . Given, 1 PX (zx ) = 1 ⇒ PX (zx ) = 1 κn t∈X 1 κn t∈X αzY − αY (t) αzY − αY (t) zx − X(t) zx − X(t) 1 ⇒ PX (zx ) = 1 κn t∈X α(zY − Y (t)) α zY − Y (t) ⇒ PX = PY , which means that P K is invariant under scale transformation. 153 ♦ 8.2.2 Quantile Mapping-Based Approaches Quantile mapping transforms a univariate predictor variable X to its corresponding response variable Y by adjusting the cumulative distribution function FX to match that of FY : −1 QM : y = FY (FX (x)) ˆ (8.2) It can be shown that QM preserves the rank correlation1 between the variables. For instance, consider the example in Table 8.1 where y is the response variable and x1 , x2 are two independent predictor variables. Let QM(x1 ) and QM(x2 ) be the corresponding QM outputs for x1 and x2 , respectively. If we sort the vectors in ascending order, it is easy to see that the resulting rank vectors are invariant under QM transformation. As a result, the rank correlation between x1 (or x2 ) and y is identical to the rank correlation between QM(x1 ) (or QM(x2 )) and y. Furthermore, the empirical CDF for QM(x1 ) as well as QM(x2 ) are identical to that for y, i.e., FY = FQM(x ) = FQM(x ) . 1 2 Even though quantile mapping was able to replicate the empirical distribution of y perfectly, QM(x1 ) has a higher residual error than QM(x2 ). This can be explained by the lower rank correlation between x1 and y compared to the rank correlation between x2 and y. Note that the inverse relationship between rank correlation and residual error holds only if the values of the response variable are uniformly spaced. For example, if the response value y for the fourth data point changes from 0.4 to 0.7, the residual error for QM(x2 ) increases from 0.02 to 0.32, and is larger than the residual error for QM(x1 ), which remains at 0.06. In this case, a high rank correlation for x2 does not translate to lower residual error when applying quantile mapping. A formal proof showing the relationship between rank correlation and 1 Examples of rank correlation measures include Kendall τ and Spearman’s ρ coefficients. 154 Table 8.1: Quantile Mapping Example x1 0.6 0.8 0.7 0.9 x2 0.7 0.6 0.9 0.8 y 0.2 0.1 0.3 0.4 SSR= QM(x1 ) QM(x2 ) 0.1 0.2 0.3 0.1 0.2 0.4 0.4 0.3 0.06 0.02 x3 0.7 0.6 0.9 0.8 x4 0.6 0.7 0.8 0.9 y 0.2 0.1 0.3 0.7 SSR= QM(x3 ) 0.2 0.1 0.7 0.3 0.32 QM(x4 ) 0.1 0.2 0.3 0.7 0.02 residual error for uniformly spaced data is given in the next section. Since most data sets are non-uniform, maximizing rank correlation is not a sufficient condition to ensure a low residual error. Nevertheless, the data points were observed to be associated with quantiles that are located in sparse regions (i.e., far from their next closest quantiles) will contribute to higher residual error when incorrectly ranked compared to data points associated with quantiles located in dense regions. This is demonstrated by the example shown in Table 8.1, where both x3 and x4 have the same rank correlation with respect to the response variable y, yet have different SSR. The response values for the first three data points (0.2, 0.1, and 0.3) are closer to each other than the last data point (0.7). An incorrect ranking of the fourth data point will lead to much higher residual error compared to the first three data points. Since x3 ranked the fourth data point incorrectly, its residual error is larger than x4 even though they both have the same rank correlation. This suggests a possible heuristic for improving both rank correlation and residual error by emphasizing on data points that contribute to high residual errors in prediction if ranked incorrectly. 8.2.3 Rank Correlation and Residual Errors of Quantile Mapping This section presents several properties of the QM approach with respect to the rank correlation and residual error of its output. First, quantile mapping was shown to preserve the 155 rank correlation between the predictor and response variables. Proposition 8.2.4. Rank correlation is invariant under QM transformation if the values of the predictor and response variables in a data set are unique. Proof. Consider a data set D = {(xi , yi )}n that contains n points. Let yi be the quantile ˆ i=1 mapped value for the data point with predictor variable xi . To prove that rank correlation is invariant under QM transformation, it is sufficient to show that the rank for xi is identical to the rank of yi after quantile mapping. Without loss of generality, assume the data points ˆ in D are sorted in increasing order of their x values. Thus, the rank for data point xi is i (since the x values are unique). Equation (8.2) can be rewritten as follows FY (ˆi ) = FX (xi ) y Since FX (xi ) = i/n, therefore FY (ˆi ) = FX (xi ) = i/n. Given that the response values yi y are distinct, the rank for yi is also i. ˆ ♦ Next, the relationship between rank correlation and residual error of QM, for data sets with uniformly spaced response values is illustrated. Proposition 8.2.5. The residual error of QM is negatively proportional to the rank correlation given a data set with uniformly spaced response variable. Proof. Consider a data set D = {(xi , yi )}n that contains n points. Let ri be the rank of i=1 xi and si be the rank of the response value yi . To simplify the discussion, it is assumed that the ranks in r and s are unique. Since y is uniformly spaced, it can be easily shown that yi = si c1 + c0 , where c0 and c1 depend only on the minimum and maximum values in y. 156 Following Proposition 1, since QM preserves the rank of the data point xi , yi = ri c1 + c0 . ˆ The Spearman rank correlation between x and y can be written as ρ= ¯ ¯ i (ri − r )(si − s) ¯ 2 i (si − s)2 ¯ i (ri − r ) which can be further simplified as ρ = ( i ri si + c3 )/c2 where, c2 and c3 are constants for a fixed n. Since the QM output y is a reordering of y, its residual error can be computed ˆ as SSR = ˆ 2 i (yi − yi ) = 2( 2 i yi − ˆ i yi yi ). Furthermore, where, c4 depends on c0 , c1 , and n. Therefore, SSR = 2( 2 ˆ i yi yi = c1 i ri si + c4 . 2 2 i yi − c1 c2 ρ − c3 − c4 ). Since, c2 is a non-negative constant, c2 c2 will always be non-negative. Hence, SSR is negatively 1 proportional to ρ for a given data set with uniformly spaced y. ♦ Although Proposition 2 is applicable only to uniformly spaced response variable, there are other situations where the residual error of QM output can be reduced if its rank correlation increases, as will be shown in the next proposition. Proposition 8.2.6. Correcting the rank of xi to match the rank of its corresponding response variable yi , maintains, if not, improves the residual error of QM output, as long as it does not deteriorate the ranks of other data points in x. ˆ Proof. Given the response variable y, let y be the QM output of predictor variable x. The sum squared residual of QM(x) is SSRx = ˆ 2 i (yi − yi ) where, the residual error −1 −1 of the ith data point εxi = yi − yi = FY (FY (yi )) − FY (FY (ˆi )). Following Equation ˆ y (8.2), we have FY (ˆi ) = FX (xi ). Consequently, the residual error εxi can be rewritten as y −1 −1 εxi = FY (FY (yi )) − FY (FX (xi )). Assuming the residual error of the QM output for x is non-zero, there must exist a data point xj such that FY (yj ) = FX (xj ). Next, consider an 157 improved vector x , which is a reordering of the values in vector x subject to the following two conditions: (1) xi is equal to xi if the ith data point of x is ranked correctly. i.e., xi = xi if FX (xi ) = FY (yi ). (2) If ith data point of x is not ranked correctly, then either −1 xi = xi or xi = FX (FY (yi )). Note that there must be at least one data point ranked incorrectly in x but correctly x . The second condition also implies that any data point that has been reordered, it must be ranked correctly. Thus, for all the data points in condition (1), ε2 = ε2 since their ranks remain unchanged. On the other hand, for all the data points x i xi in condition (2), ε2 ≥ ε2 x i xi since FX (xi ) is either the same as FX (xi ) or FY (yi ). Therefore, ∀i, ε2 ≥ ε2 . Hence, SSRx ≥ SSRx . Thus, by correcting the ranks of those data points xi xi that do not have the same rank as its corresponding response value, while ensuring the ranks of all other data points remain the same, the SSR of QM output can be improved. ♦ Even though the above proposition suggests that one can maximize rank correlation and improve residuals simultaneously, this requires a flexible target function that allows all possible orderings of x. For linear functions, it might not be possible to produce a reordering of values in x without affecting the ranks of other data points. Thus, an alternate scheme was proposed that focuses on correcting the ranks of data points associated with high residuals, which is explained in the next section. 8.3 Multi-Output Contour Regression Framework (MCR) Since QM and regression-based approaches have their own distinct advantages which have been successfully exploited in a hybrid manner by approaches such as CR, we propose a 158 framework that extends the intuition behind hybrid approaches that exploits the unique advantages of both QM and regression, to work in a multi-output setting. The approach ˆ uses a position regularized regression function h(x, Ω) that prioritizes matching the positions of output to best match the positions of the observed response data. This step is followed by correcting the geometric quantiles of the output from the previous step to match the observed response data using the intuition of QM. This hybrid approach addresses the limitation of QM regarding the number of predictor variables that may be used as well as requirement of the predictor variables being highly correlated to the response variable. The hybrid approach was further exposed to be flexible enough to work in a multi-output setting so as to be able to capture the multi-output associations that are often ignored. To prioritize improving the positions of the output, the proposed multi-output contour ˆ regression (MCR) framework learns the regression function h(x, Ω). The regression function ˆ h(x, Ω) consists of two components. The first component is similar to conventional regression loss function where the data matrix is made to regress with respect to the observed response variable. This component emphasizes minimizing residual error of the regression function. ˆ The second component of h(x, Ω) is the position regularizer that helps improve rank correlation of h(x, Ω) and y. At a first glance, one would expect the second term to be regressing on the position of the data points. Instead of regressing on the position of the data points, we regress on the geometric quantiles of the data points obtained by inverse mapping their positions to the output response space. This is done so that the position regularizer assigns a larger penalty to those data points whose position when incorrectly estimated, results in a larger minimum residual errors. To accomplish this, the data matrix 159 is made to regress on zY Y , where, ˆ ˆY Y (y) = p−k (pk (y)) zˆ Y ˆ Y (8.3) ˆ is the geometric quantile value in the h(x, Ω) regression output space that corresponds to the position of the observed response variable y. The regression function of MCR is shown in Equation (8.4), n min Ω i=1 (γL(h(xi , Ω), yi ) + (1 − γ)L(h(xi , Ω), zY Y )) ˆ (8.4) where 0 ≤ γ ≤ 1 is a user defined parameter that may be used for either prioritizing fidelity of regression accuracy or its position correlation. L can be any generic loss function such as ordinary least square (that multiple linear regression adopts), or quantile mapping (if certain quantiles are to be prioritized overs others, such as in the case of a heavy tail distribution).For instance, when the loss function L is ordinary least square, Equation 8.4 takes the form q n min Ω j=1 i=1 (γ(xT Ωj − yi )2 + (1 − γ)(xT Ωj − zY Y )2 ) ˆ i i which corresponds to the following matrix form ˆ Ω = arg min tr(γ(XΩ − Y)T (XΩ − Y) + (XΩ − ZY Y )T (XΩ − ZY Y )) ˆ ˆ Ω ˆ The regression parameters Ω is learnt in an iterative manner. At each iteration, the regression output space from the previous iteration is used to compute zY Y in the second ˆ 160 ˆ component of the regression function h(x, Ω). For the very first iteration, the regression output space is that of regular multiple linear regression. ˆ Once h(x, Ω) is learnt, the MCR prediction for a given data point x having corresponding ˆ ˆ observed multi-output response y and a regression estimation of y = h(x, Ω) is obtained by inverse geometrically quantile mapping pk (ˆ ) to its corresponding value in the observed ˆ y Y response variable space, to give the MCR prediction ˆY Y , z ˆ ˆ M CR : ˆY Y = p−k (pk (h(x, Ω))) z ˆ ˆ Y Y (8.5) ˆ where, p−k (pk (ˆ)) maps the stationary geometric quantile position of h(x, Ω) to its ˆ y Y Y corresponding data point in Y. To summarize, multi-output contour regression (MCR) performs multi-output regression of the predictor variables such that the position of its output is highly correlated with respect to position of the observed response variable, thereby reducing position errors of the multi-output regression results. This multivariate regression output is then mapped to its corresponding geometric quantile counterpart in the observed multi-output response space using geometric quantiles. The rationale behind using the regularized regression results, prior to performing multi-output geometric quantile mapping in MCR, is to improve on SSR by increasing the correlation among the multivariate ranks of the predictors and response variable. 8.3.1 Estimating Inverse Geometric Quantile Position ˆ The value z(p) that corresponds to a given geometric quantile position p, in a multivariate distribution FY i.e., pY (p), is empirically computed by minimizing the generalized multi161 variate quantile loss function [32] n ˆ z(p) = arg min z∈ q where, p ∈ ( yi − z|+ < p, yi − z >) (8.6) i=1 q and < ., . > denotes the Euclidean inner product. So long all the values of y i ˆ does not fall on the same line, z(p) will be unique for a given p for q ≥ 2 [32]. Algorithms such as Newton-Raphson’s method can be used to solve the above loss function geometric quanˆ ˆ ˆ tile z(p) using the following update z ← z− δ where, δ δ = δ= n −1 (z−y )) i i=1 ((nκ)p− z−yi n −1 (I − z − y −2 × (z − y )(z − y )T ) q i i i i=1 z − yi For a univariate distribution, FY , it can be easily shown that equation (8.6) boils down to the same loss function used to identify the αth regression quantile in a linear regression setup for quantile regression [77], where 0 < α < 1 and p = 2α−1. i.e, n 1 (|yi −z|+p(yi −z)) is minimized for z that corresponds to the αth quantile of Y . 8.3.2 Alternative Approximation-Based Approach for MCR If one can make the assumption that given the position (p) of a test data point (ytest ) that belongs to the distribution FY , and ∃yi ∈ Y such that ytest yi , then the search space for ˆ z = ytest can be limited to data points in Y. ˆ Given that the search space for z is finite it will not always possible to find the exact same point in FY using the loss function δ, as it returns a vector. Alternatively, the following range bound approximation that is equivalent to Equation 8.6, can be used to find the best 162 60 30 40 30 20 10 0 Newton Raphson R−limited Approximation Euclidean Approximation 35 Time (CPU units) 50 Time (CPU units) 40 Newton Raphson R−limited Approximation Euclidean Approximation 25 20 15 10 5 0 500 1000 1500 Training size (a) 0 2000 0 250 500 750 Test size (b) 1000 Figure 8.2: Relative computation time of the various approximation-based approaches for estimating inverse geometric quantile positions. solution [61, 32]. n arg min z { yi − z i=1 1 + (yi − z)T p} κ (8.7) where κ in the scaling factor chosen in Equation (8.4) As shown in the experiment section, there was only a marginal performance deterioration in the solution obtained from the above approximation, due to sufficient amount of training data points. Another approximation approach with even less tighter bounds than Equation 8.7, having O(n) time complexity is to use the following Euclidean approximation. ˆ z = arg min((p − pY (yi ))(p − pY (yi ))T ) yi (8.8) The R-limited approximation approach (Equations 8.7) as well as the Euclidean approximation approach (8.8) show considerable improvement in the computation time across varying 163 training size (Figure 8.2.a) and test size (Figure 8.2.b), with minimum deterioration in terms of accuracy of the inverse geometric quantile positions estimated. 8.4 Variations of MCR As mentioned above, when the loss function L is ordinary least square, equation 8.5 takes the form q N min Ω j=1 i=1 (γ(xT Ωj − yi )2 + (1 − γ)(xT Ωj − zY Y )2 ) ˆ i i which corresponds to the following matrix form ˆ Ω = arg min tr(γ(XΩ − Y)T (XΩ − Y) + (XΩ − ZY Y )T (XΩ − ZY Y )) ˆ ˆ Ω The following subsection demonstrates the use of alternative loss functions in the MCR framework. 8.4.1 Quantile Multi-Output Contour Regression (M CRQ ) An alternative to using ordinary least square loss function, which is well suited when the response variable has a non-uniform distribution such as a heavy tail, is to use quantile regression (QR) loss function for L so as to prioritize the rank correlation for the ranks that have the highest variance and corresponding highest impact on residual errors while quantile mapping, if incorrectly ranked [77]. Additionally, M CRQ may also be suited to estimate the extremes of Y , by prioritizing the correct estimation of ranking extreme values. 164 The objective function of QR can be adopted by MCR to obtain the following loss function n i=1 (ρτ1 (yi − xT β) + ρτ2 (zi − xT β)) i i where,    τ u  ρτ (u) = 8.4.2 u>0   (τ − 1)u u ≤ 0  Non-linear Multi-Output Contour Regression (M CRN L ) Unlike the above mentioned linear interpretations of MCR, M CRN L uses a non-linear approach. By mapping the predictor variable X to a higher dimension feature space F to give φ , i.e., φ : X ∈ Rd → F ⊆ RN where N >> d, one can transform the regularized least square regression to feature space F using the kernel K. Similarly, the predictor variables of MCR can be mapped to a higher dimension feature space F by using the ridge counterpart of the loss function of MCR. β = (φ(X)T φ(X) + λI)−1 (γφ(X)T y + (1 − γ)φ(X)T z) ⇒ β = λ−1 φ(X)T (γy + (1 − γ)z − φ(X)β) = φ(X)T α ⇒ α = (G + λI)−1 (γy + (1 − γ)z) where, G = φ(X)φ(X)T , Gij = φ(xi ), φ(xj )T = K(xi , xj ). 165 8.5 Experimental Results The objective of the experiments was to evaluate the ability of MCR in replicating the associations among multiple climate response variables while minimizing sum square residuals. All the algorithms were run using climate data obtained at fourteen weather stations in Michigan, USA. The response variables used were maximum temperature, minimum temperature, and the total precipitation for each day spanning twenty years. The predictor variables used in this study are simulated climate data obtained from regional climate models (RCM) that best correspond to the observed response variables at each of the fourteen weather stations. Three different RCM data sets for each of the climate stations were obtained from North American Regional Climate Change Assessment Program (NARCCAP) [2]. The three RCMs used are the Canadian Regional Climate Model (CRCM), the Weather Research and Forecasting Model (WRFG) and the Regional Climate Model Version-3 (RCM3). For the purpose of the experiments, there were a total of 126 data sets with univariate response variables, 126 data sets with bivariate responses and 42 data sets with trivariate responses. 8.5.1 Experimental Setup Twenty year of predictor and response data, spanning the years 1980-1999 was split into two parts for training and testing. For the purpose of the evaluation of the relative skill in preserving associations among the multi-output responses, popular regression and quantile mapping approaches such as MLR, Ridge regression (Ridge), QM, EDCDFm, MOR, CR, BQM as well as ad-hoc approaches that sequently combine regression and quantile mapping approaches were used as baseline. An example of the ad-hoc baseline approach used is MOR in combination with BQM (RBQM) and MLR and QM (RQM). γ was set to 0.5 for all 166 experiments. For CR and MCR based experiments, the maximum number of iterations was set to ten. After discarding the missing values, each experiment run for each of the stations, across all the data sets, had a minimum of one thousand training and test data points. All the results provided in the following section are on test data (out-of-sample results). Kendall τ rank correlation and Spearman ρ rank correlation were the two rank correlation metrics used for evaluation univariate rank correlation. In the following experiment section, results of only one of the two rank correlation metrics were included, when their results were very similar. Root mean square error (RMSE), was used as a metric to compare the performance of the various approaches evaluated in terms of its output residual errors. Two dimensional and three dimensional scatter plots were used to visualize the relative skill of the various approaches in preserving the associations among the multi-output responses. 8.5.2 Results 8.5.2.1 Univariate MCR The rank correlation of the various response variables were computed in a single output MCR setting and it was found that across all the different data sets and stations and response variables (i.e, 126 datasets), MCR consistently improved the rank correlation across both rank correlation metrics. The 126 individual data sets that corresponded to univariate response data were grouped into nine larger data sets, where each of the larger data sets were a grouping of data sets that shared the same response variable as well as the same RCM source for the predictor variables. Figure 8.3 is a box plot representing the percentage of stations in each of the nine data 167 Kendal τ RMSE 100% 100% 80% 80% 60% 60% 40% 40% 20% 20% MLR Ridge QM EDCDFm CR (a) MLR Ridge QM EDCDFm CR (b) Figure 8.3: Box plot of the percentage stations where MCR showed improvement over single output baselines, in terms of Kendall τ rank correlation and RMSE, across all RCM’s and variables. sets where the rank correlation regularizer used in Equation 8.4, improved rank correlation and reduced residuals when compared to baselines approaches. The box plot in Figure 8.4 shows that in spite of MCR’s reported improvement across majority of stations in terms of τ and RMSE, for both regression and quantile mapping based approaches, the improvement was not significant when compared to the regression based approaches. However, the rank correlation regularizer showed a significant improvement in terms of RMSE at each station when compared to the corresponding quantile mapping based approaches. 8.5.2.2 Bivariate MCR Bivariate modeling for all the combinations of bivariate response variables were evaluated. As shown in Figure 8.5, MCR performed best in replicating both the bivariate associations 168 Kendall τ RMSE 1 15% 0.8 10% 0.6 0.4 5% 0.2 0 MLR Ridge 0 QM EDCDFm CR (a) MLR Ridge QM EDCDFm CR (b) Figure 8.4: Box plot of MCR’s improvement over baseline approaches in terms of Kendall τ rank correlation and RMSE, across all RCM’s and variables. MLR Ridge QM CR 40 40 40 40 20 20 20 20 0 0 0 0 −20 −20 −20 −20 −40 −50 0 MOR −40 50 −50 0 BQM −40 50 −50 0 RBQM −40 50 −50 40 40 40 20 20 0 0 −20 −20 50 0 −20 0 20 0 50 40 20 0 MCR −20 −40 −50 0 −40 50 −50 0 −40 50 −50 0 −40 50 −50 Figure 8.5: Scatter plot portraying the fidelity of forecast values of various approaches replicating the observed associations among the bivariate temperature response variables. 169 Table 8.2: Performance (RMSE) of bivariate MCR over baseline approaches Data set W RF G1 W RF G2 W RF G3 CRCM1 CRCM2 CRCM3 RCM 31 RCM 32 RCM 33 RMSE % of stations Avg.improvement outperformed across stations baseline over baseline MOR QM BQM MOR QM BQM 29 100 100 -0.06 0.18 0.17 07 100 100 -0.08 0.16 0.16 00 100 100 -0.07 0.31 0.30 93 100 100 0.06 0.25 0.25 71 100 100 0.03 0.23 0.23 07 100 100 -0.02 0.35 0.34 43 100 100 -0.02 0.20 0.20 36 100 100 -0.03 0.19 0.18 00 100 100 -0.07 0.31 0.30 and minimizing SSR, although BQM performed as well in terms of replicating the bivariate associations. Regression based approaches (both SOR and OMR) fared poorly in preserving associations in the 2D space, while single output quantile mapping based approaches, were very sensitive to correlation of the predictor variables with response resulting in poor bivariate associations in spite of replicating the marginal distributions of the individual responses very well. In terms of residuals, MCR had considerably lower residuals when compared of the various quantile mapping baseline approaches as shown in Table 8.2. But as expected, MCR showed marginal increase in residuals when compared to the respective SOR and MOR based approaches. 8.5.2.3 Trivariate MCR The performance of modeling the association among three response variables was also evaluated and is shown in Figure 8.6. The performance is compared against single output, and multiple output models. We also use as a baseline, an trivariate extension of the bivariate 170 Table 8.3: Performance (Kendall τ ) of bivariate MCR over baseline approaches Data set W RF G1 W RF G2 W RF G3 CRCM1 CRCM2 CRCM3 RCM 31 RCM 32 RCM 33 Kendall τ % of stations Avg.improvement outperformed across stations baseline over baseline MOR QM BQM MOR QM BQM 64 100 100 0.03 0.40 0.41 79 100 100 0.04 0.38 0.39 0 100 100 -0.01 0.75 0.67 100 100 100 0.13 0.52 0.53 100 100 100 0.12 0.49 0.52 14 100 100 -0.01 0.78 0.73 79 100 100 0.06 0.46 0.46 79 100 100 0.06 0.47 0.45 0 100 100 -0.01 0.81 0.78 MLR Observation QM CR 50 50 50 50 0 0 0 0 −50 50 −50 50 50 −50 50 50 −50 50 50 0 0 −50 0 0 −50 0 MOR −50 0 −50 0 BQM RBQM 50 0 MCR 50 50 50 50 0 0 0 0 −50 50 −50 50 50 −50 50 50 −50 50 50 0 −50 0 0 −50 0 0 −50 0 50 0 −50 0 Figure 8.6: Three dimensional scatter plot of the observed associations among maximum temperature, minimum temperature and precipitation as well as the respective forecasts made by the various single output and multiple output approaches. 171 BQM approach, as an additional baseline. Along with MCR, the trivariate extension of BQR fared best in replicating the observed associations among three variables when compared to the baseline approaches. Additionally, MCR was also able to improve upon its BQM counterpart in terms of reduction of residuals. MCR produced lower RMSE for all the station across all the tri-variate datasets with an average reduction of RMSE in excess of 10%. The average improvement of the three variables in terms of rank correlation τ was found to be 0.41. 8.6 Conclusions This chapter present a multi-output regression framework extension of the single output regression framework presented in Chapter 7. The multi-output regression framework preserves the general association patterns among multiple response variables while minimizing the overall residual errors by coupling regression and geometric quantile mapping. The chapter demonstrates the effectiveness of the framework in significantly reducing residuals while preserving the joint distribution of the multi-output variables, over the baseline approaches in all the climate stations evaluated. 172 Chapter 9 Conclusions and Future Work In this thesis, a number of multivariate frameworks are presented that improve upon existing regression based approaches used for generating projections, by integrating multiple objectives pertaining to the unique characteristics of response variable, as well as the expectations of a long-term projection. The four primary multivariate frameworks, as well as its logical extensions, address the following four primary challenges. The first framework addresses the challenge of modeling response variables with irregular distribution characteristics, in particular, zero-inflated response variables. The second framework addresses the challenge of extremes in a distribution, by prioritizing the conditional quantile associated with extreme values. The third framework addresses the challenge of building a regression framework that preserves the distribution characteristics of the response variable, so as to provide an unbiased projection across all the quantiles of the distribution. The fourth framework extends the intuition behind the above-mentioned single output distribution preserving framework to an multi-output setting, such that not only is each projection unbiased, but it also maintains the relationships among multiple outputs. Given that most of the emphasis of the evaluation of the frameworks presented in this thesis assumed a linear relationship between the predictor and response variables, a detailed evaluation of the performance, while assuming a non-linear relationship needs to be explored 173 as well. In the frameworks presented in Chapter 3 and Chapter 4, Pearson correlation coefficient was chosen as the default similarity measure. As future work, the impact of the choice of kernel/similarity measures chosen, needs to be evaluated. Exploration of nonlinear approaches to further improve the precision and recall of zero and non-zero valued data points, would provide new insights. Given the availability of numerous sources of available data, there is extensive scope for further exploiting the available heterogeneous datasets in modeling zero-inflated data. Relative liklihood of identifying days with large Precipitation residuals 3 5% Anomaly 10% Anomaly 20% Anomaly 2.5 2 1.5 1 0.5 0 10% 20% 30% 40% 50% 60% The top X% predictor data points. 70% 80% 90% Figure 9.1: Relative likelihood of identifying large precipitation residuals The main challenge in identifying erroneous values within the training dataset is being able to differentiate it from valid anomalies. Also, a model that cleans spurious data during model building should also be able to do so for out-of-sample data upon which the model is to be applied. Discarding data points due to a faulty value in one of the predictor variables may result in a very small training data set. This is all the more the case when the occurrence of errors across predictor variables are independent of each other and there are a large number of 174 predictor variables– even if the percentage of error values for any individual predictor variable is relatively small. Often in many practical applications due to the scarcity of available data needed for training a model, discarding large amount of data could be unfeasible. The other drawback of dropping values that may be erroneous is that the model’s response value for such data points in the test set may be extremely poor on account of not having been trained on similar data points. 5 Likehood of having a high residual for response 4.5 Precipitation Max Temperature Min Temperature 4 3.5 3 2.5 2 1.5 1 1 Erroneous predictor 2 Erroneous predictors 3 Erroneous predictors No. of Erroneous predictor variables (Top 5%) Figure 9.2: Relative likelihood of identifying days with large residuals A possible approach for future work is therefore to build regression models in the presence of uncertain data to identify data points that have a high likelihood of being erroneous, based on its relationship with other corresponding variables, and assign weights for each data point that is a function of this uncertainty similar to weighted regression where weights are the inverse of the variance observed for the respective response variable. Unlike weighted regression, the weights chosen can be a function of the fidelity of the values for the data points. 175 As shown in figure 9.1, it was observed that data points whose predictor variables seemed anomalous with respect to the rest of its corresponding predictor variables were far more likely to have large residuals for the response variable. In the case of modeling precipitation using multiple linear regression, it was found that the data points that belonged to the top 5% of those likely to have an erroneous value for the precipitation predictor variable were thrice as likely to have large residuals for the response variable. Figure 9.2 shows that the relative likelihood of a data point being one with a high residual, increases as a function of the number of predictor variables for the data point being erroneous increases. 176 BIBLIOGRAPHY 177 BIBLIOGRAPHY [1] Canadian Climate http://www.ccsn.ca/. Change [2] North American Regional http://www.narccap.ucar.edu/. Scenarios Network, Climate Change Environment Assessment Canada. Program. [3] Z. Abraham and P.-N. Tan. A semi-supervised framework for simultaneous classification and regression of zero-inflated time series data with application to precipitation prediction. In Data Mining Workshops, 2009. ICDMW’09. IEEE International Conference on, pages 644–649. IEEE, 2009. [4] Z. Abraham and P.-N. Tan. An integrated framework for simultaneous classification and regression of time-series data. In SIAM International Conference on Data Mining (SDM), pages 653–664, 2010. [5] Z. Abraham, P.-N. Tan, P. Perdinan, J. Winkler, S. Zhong, and M. Liszewska. Distribution regularized regression framework for climate modeling. In SIAM International Conference on Data Mining (SDM), 2013. [6] Z. Abraham, P.-N. Tan, P. Perdinan, J. Winkler, S. Zhong, and M. Liszewska. Position preserving multi-output prediction. In European Conference On Machine Learning And Principles And Practice Of Knowledge Discovery In Databases (ECMLPKDD), 2013. [7] Z. Abraham and F. Xin. Extreme value prediction for zero-inflated data. In Advances in Knowledge Discovery and Data Mining, pages 318–329. Springer, 2012. [8] Z. Abraham, F. Xin, and P.-N. Tan. Smoothed quantile regression for statistical downscaling of extreme events in climate modeling. pages 92–106, 2011. [9] D. E. Akyuz, M. Bayazit, and B. Onoz. Markov chain models for hydrological drought characteristics. J. Hydrometeor., 13:298–309, 2012. [10] M. A. Alvarez and N. D. Lawrence. Sparse convolved multiple output gaussian processes. arXiv preprint arXiv:0911.5107, 2009. 178 [11] S. Ancelet, M.-P. Etienne, H. Benoˆ and E. Parent. Modelling spatial zero-inflated ıt, continuous data with an exponentially compound poisson process. Environmental and Ecological Statistics, 17(3):347–376, 2010. [12] K. Balasubramanian and G. Lebanon. The landmark selection method for multiple output prediction. arXiv preprint arXiv:1206.6479, 2012. [13] S. C. Barry and A. H. Welsh. Generalized additive modelling and zero inflated count data. Ecological Modelling, 157(2):179–188, 2002. [14] N. Bernier, K. Thompson, J. Ou, and H. Ritchie. Mapping the return periods of extreme sea levels: allowing for short sea level records, seasonality, and climate change. Global and Planetary Change, 57(1):139–150, 2007. [15] C. M. Bishop and N. M. Nasrabadi. Pattern recognition and machine learning, volume 1. springer New York, 2006. [16] J. Bjørnar Bremnes. Probabilistic forecasts of precipitation in terms of quantiles using nwp model output. Monthly Weather Review, 132(1):338–347, 2004. [17] M. B. Blaschko and C. H. Lampert. Learning to localize objects with structured output regression. In Computer Vision–ECCV 2008, pages 2–15. Springer, 2008. [18] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. 2001. [19] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pages 92–100. ACM, 1998. [20] D. B¨hning, E. Dietz, and P. Schlattmann. Zero-inflated count models and their o applications in public health and social science. Applications of Latent Trait and Latent Class Models in the Social Sciences. Waxman Publishing Co, 1997. [21] G. Bontempi. Long term time series prediction with multi-input multi-output local learning. In Proceedings of the 2nd European Symposium on Time Series Prediction (TSP), ESTSP08, pages 145–154, 2008. [22] M. Booij. Extreme daily precipitation in western europe with climate change at appropriate spatial scales. International Journal of Climatology, 22(1):69–85, 2002. 179 [23] G. E. Box and D. A. Pierce. Distribution of residual autocorrelations in autoregressiveintegrated moving average time series models. Journal of the American Statistical Association, 65(332):1509–1526, 1970. [24] U. Brefeld, T. G¨rtner, T. Scheffer, and S. Wrobel. Efficient co-regularised least squares a regression. In Proceedings of the 23rd international conference on Machine learning, pages 137–144. ACM, 2006. [25] L. Breiman and J. H. Friedman. Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(1):3–54, 1997. [26] A. Buja, B. Logan, J. Reeds, and L. Shepp. Inequalities and positive-definite functions arising from a problem in multidimensional scaling. The Annals of Statistics, pages 406–438, 1994. [27] R. H. Byrd, J. Nocedal, and R. B. Schnabel. Representations of quasi-newton matrices and their use in limited memory methods. Mathematical Programming, 63(1-3):129– 156, 1994. [28] K. N. Cahill, D. B. Lobell, C. B. Field, C. Bonfils, and K. Hayhoe. Modeling climate and climate change impacts on winegrape yields in california. American Journal of Enology and Viticulture, 58(3):414A–414A, 2007. [29] A. Ceglar and L. Kajfeˇ-Bogataj. Simulation of maize yield in current and changed z climatic conditions: addressing modelling uncertainties and the importance of bias correction in climate model simulations. European Journal of Agronomy, 37(1):83–95, 2012. [30] M.-C. Chan, C.-C. Wong, and C.-C. Lam. Financial time series forecasting by neural network using conjugate gradient learning algorithm and multiple linear regression weight initialization. In Computing in Economics and Finance, volume 61, 2000. [31] S. P. Charles, B. C. Bates, I. N. Smith, and J. P. Hughes. Statistical downscaling of daily precipitation from observed and modelled atmospheric fields. Hydrological Processes, 18(8):1373–1394, 2004. [32] P. Chaudhuri. On a geometric notion of quantiles for multivariate data. Journal of the American Statistical Association, 91(434):862–872, 1996. 180 [33] H. Cheng and P.-N. Tan. Semi-supervised learning with data calibration for longterm time series forecasting. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133–141. ACM, 2008. [34] R. T. Clarke. Estimating time trends in gumbel-distributed data by means of generalized linear models. Water Resources Research, 38(7):1111, 2002. [35] R. T. Clarke. Estimating trends in data from the weibull and a generalized extreme value distribution. Water resources research, 38(6):25–1, 2002. [36] I. Cohen, F. G. Cozman, N. Sebe, M. C. Cirelo, and T. S. Huang. Semisupervised learning of classifiers: Theory, algorithms, and their application to human-computer interaction. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(12):1553– 1566, 2004. [37] D. Cooley. Extreme value analysis and the study of climate change. Climatic Change, 97(1-2):77–83, 2009. [38] D. Cooley, D. Nychka, and P. Naveau. Bayesian spatial modeling of extreme precipitation return levels. Journal of the American Statistical Association, 102(479):824–840, 2007. [39] C. Cortes and M. Mohri. On transductive regression. Advances in Neural Information Processing Systems, 19:305, 2007. [40] F. G. Cozman, I. Cohen, and M. Cirelo. Unlabeled data can degrade classification performance of generative classifiers. In FLAIRS Conference, pages 327–331, 2002. [41] F. G. Cozman, I. Cohen, M. C. Cirelo, et al. Semi-supervised learning of mixture models. In ICML, pages 99–106, 2003. [42] L. J. D. Erdman and A. Sinko. Zero-inflated poisson and zero-inflated negative binomial models using the countreg procedure. In SAS Global Forum, pages 1–11, 2008. [43] C. Dorland, R. S. Tol, and J. P. Palutikof. Vulnerability of the netherlands and northwest europe to storm damage under climate change. Climatic change, 43(3):513– 535, 1999. [44] C. A. Dos Santos, C. M. Neale, T. V. Rao, and B. B. da Silva. Trends in indices for extremes in daily temperature and precipitation over utah, usa. International Journal of Climatology, 31(12):1813–1822, 2011. 181 [45] U. Ehret, E. Zehe, V. Wulfmeyer, K. Warrach-Sagi, and J. Liebert. Hess opinions” should we apply bias correction to global and regional climate model data?”. Hydrology and Earth System Sciences Discussions, 9(4):5355–5387, 2012. [46] W. Enke and A. Spekat. Downscaling climate model outputs into local and regional weather elements by classification and regression. Climate Research, 8(3):195–207, 1997. [47] A.-C. Favre, S. El Adlouni, L. Perreault, N. Thi´monge, and B. Bob´e. Multivariate e e hydrological frequency analysis using copulas. Water Resources Research, 40(1), 2004. [48] L. Feudale. Large scale extreme events in surface temperature during 1950–2003 an observational and modeling study. PhD thesis, George Mason University. [49] M. Fogarty, L. Incze, K. Hayhoe, D. Mountain, and J. Manning. Potential climate change impacts on atlantic cod (gadus morhua) off the northeastern usa. Mitigation and Adaptation Strategies for Global Change, 13(5-6):453–466, 2008. [50] C. Frei, R. Sch¨ll, S. Fukutome, J. Schmidli, and P. L. Vidale. Future change of o precipitation extremes in europe: Intercomparison of scenarios from regional climate models. Journal of Geophysical Research: Atmospheres (1984–2012), 111(D6), 2006. [51] P. Friederichs and A. Hense. Statistical downscaling of extreme precipitation events using censored quantile regression. Monthly weather review, 135(6):2365–2378, 2007. [52] K. Fujinaga, M. Nakai, H. Shimodaira, and S. Sagayama. Multiple-regression hidden markov model. In Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on, volume 1, pages 513–516. IEEE, 2001. [53] C. Gaetan and M. Grigoletto. A hierarchical model for the analysis of spatial rainfall extremes. Journal of agricultural, biological, and environmental statistics, 12(4):434– 449, 2007. [54] S. Ghosh and B. K. Mallick. A hierarchical bayesian spatio-temporal model for extreme precipitation events. Environmetrics, 22(2):192–204, 2011. [55] C. L. Giles, S. Lawrence, and A. C. Tsoi. Noisy time series prediction using recurrent neural networks and grammatical inference. Machine learning, 44(1-2):161–183, 2001. [56] H. R. Glahn and D. A. Lowry. The use of model output statistics (mos) in objective weather forecasting. Journal of applied meteorology, 11(8):1203–1211, 1972. 182 [57] A. M. Greene, A. W. Robertson, P. Smyth, and S. Triglia. Downscaling projections of indian monsoon rainfall using a non-homogeneous hidden markov model. Quarterly Journal of the Royal Meteorological Society, 137(655):347–359, 2011. [58] T. Hastie, R. Tibshirani, and J. Friedman. Linear Methods for Regression. Springer, 2009. [59] T. Hastie, R. Tibshirani, and J. J. H. Friedman. The elements of statistical learning, volume 1. Springer New York, 2001. [60] K. Hayhoe, S. Sheridan, L. Kalkstein, and S. Greene. Climate change, heat waves, and mortality projections for chicago. Journal of Great Lakes Research, 36:65–73, 2010. [61] X. He, Y. Yang, and J. Zhang. Bivariate downscaling with asynchronous measurements. Journal of agricultural, biological, and environmental statistics, 17(3):476–489, 2012. [62] A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970. [63] W.-C. Hong, P.-F. Pai, S.-L. Yang, and R. Theng. Highway traffic forecasting by support vector regression model with tabu search algorithms. In Neural Networks, 2006. IJCNN’06. International Joint Conference on, pages 1617–1621. IEEE, 2006. [64] G. Hudson and H. Wackernagel. Mapping temperature using kriging with external drift: theory and an example from scotland. International journal of Climatology, 14(1):77–91, 1994. [65] G. Huerta and B. Sans´. Time-varying models for extreme values. Environmental and o Ecological Statistics, 14(3):285–299, 2007. [66] Y. Hundecha, A. St-Hilaire, T. Ouarda, S. El Adlouni, and P. Gachon. A nonstationary extreme value analysis for the assessment of changes in extreme annual wind speed over the gulf of st. lawrence, canada. Journal of Applied Meteorology and Climatology, 47(11):2745–2759, 2008. [67] T. Iizumi, M. Nishimori, K. Dairaku, S. A. Adachi, and M. Yokozawa. Evaluation and intercomparison of downscaled daily precipitation indices over japan in present-day climate: Strengths and weaknesses of dynamical and bias correction-type statistical downscaling methods. Journal of Geophysical Research: Atmospheres (1984–2012), 116(D1), 2011. 183 [68] A. V. Ines and J. W. Hansen. Bias correction of daily gcm rainfall for crop simulation studies. Agricultural and forest meteorology, 138(1):44–53, 2006. [69] A. J. Izenman. Reduced-rank regression for the multivariate linear model. Journal of multivariate analysis, 5(2):248–264, 1975. [70] T. H. Jagger, J. B. Elsner, and M. A. Saunders. Forecasting us insured hurricane losses. Climate extremes and society, pages 189–208, 2008. [71] T. Joachims. Transductive inference for text classification using support vector machines. In ICML, volume 99, pages 200–209, 1999. [72] R. W. Katz. Statistics of extremes in climate change. Climatic Change, 100(1):71–76, 2010. [73] B. Kedem and K. Fokianos. Regression models for time series analysis, volume 488. John Wiley & Sons, 2005. [74] S. Kim and E. P. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. 2010. [75] P. Kirshen, K. Knee, and M. Ruth. Climate change and coastal flooding in metro boston: impacts and adaptation strategies. Climatic Change, 90(4):453–473, 2008. [76] R. Koenker. Quantile Regression Software. http://www.econ.uiuc.edu/ roger/research/rq/rq.html. [77] R. Koenker. Quantile Regresssion. Wiley Online Library, 2005. [78] Z. W. Kundzewicz, L. J. Mata, N. Arnell, P. Doll, P. Kabat, B. Jimenez, K. Miller, T. Oki, S. Zekai, I. Shiklomanov, et al. Freshwater resources and their management. 2007. [79] K. E. Kunkel, K. Andsager, and D. R. Easterling. Long-term trends in extreme precipitation events over the conterminous united states and canada. Journal of Climate, 12(8):2515–2527, 1999. [80] S. Laxman and P. Sastry. A survey of temporal data mining. Sadhana, 31(2):173–198, 2006. [81] Y.-A. Le Borgne, S. Santini, and G. Bontempi. Adaptive model selection for time series prediction in wireless sensor networks. Signal Processing, 87(12):3010–3020, 2007. 184 [82] Y. Li, Y. Liu, and J. Zhu. Quantile regression in reproducing kernel hilbert spaces. Journal of the American Statistical Association, 102(477):255–268, 2007. [83] E. C. Mannshardt-Shamseldin, R. L. Smith, S. R. Sain, L. O. Mearns, and D. Cooley. Downscaling extremes: A comparison of extreme value distributions in point-source and gridded precipitation data. The Annals of Applied Statistics, 4(1):484–502, 2010. [84] J. I. Marden. Positions and qq plots. Statistical Science, 19(4):606–614, 2004. [85] L. Mearns, W. Gutowski, R. Jones, L. Leung, S. McGinnis, A. Nunes, and Y. Qian. The north american regional climate change assessment program dataset. National Center for Atmospheric Research Earth System Grid Data Portal, Boulder, CO, 2007. [86] L. O. Mearns, W. Gutowski, R. Jones, R. Leung, S. McGinnis, A. Nunes, and Y. Qian. A regional climate change assessment program for north america. Eos, Transactions American Geophysical Union, 90(36):311, 2009. [87] N. Meinshausen. Quantile regression forests. The Journal of Machine Learning Research, 7:983–999, 2006. [88] C. Monteleoni, G. A. Schmidt, S. Saroha, and E. Asplund. Tracking climate models. Statistical Analysis and Data Mining, 4(4):372–392, 2011. [89] S. Nadarajah. Extremes of daily rainfall in west central florida. Climatic change, 69(2-3):325–342, 2005. [90] M. Nogaj, P. Yiou, S. Parey, F. Malek, and P. Naveau. Amplitude and frequency of temperature extremes over the north atlantic region. Geophysical Research Letters, 33(10), 2006. [91] A. Ober-Sundermeier and H. Zackor. Prediction of congestion due to road works on freeways. In Intelligent Transportation Systems, 2001. Proceedings. 2001 IEEE, pages 240–244. IEEE, 2001. [92] T. O’Brien, D. Sornette, and R. McPherron. Statistical asynchronous regression: Determining the relationship between two quantities that are not measured simultaneously. Journal of geophysical research, 106(A7):13247–13, 2001. [93] S. Ollinger, C. Goodale, K. Hayhoe, and J. Jenkins. Potential effects of climate change and rising co2 on ecosystem processes in northeastern us forests. Mitigation and Adaptation Strategies for Global Change, 13(5-6):467–485, 2008. 185 ´ [94] J. Ospina Norena, C. Gay Garcia, A. Conde, V. Maga˜a, and G. SANCHEZ TORn RES ESQUEDA. Vulnerability of water resources in the face of potential climate change: generation of hydroelectric power in colombia. Atm´sfera, 22(3):229–252, o 2009. [95] M. P´rez-Enciso and M. Tenenhaus. Prediction of clinical outcome with microarray e data: a partial least squares discriminant analysis (pls-da) approach. Human genetics, 112(5-6):581–592, 2003. [96] C. Piani, J. Haerter, and E. Coppola. Statistical bias correction for daily precipitation in regional climate models over europe. Theoretical and Applied Climatology, 99(12):187–192, 2010. [97] C. Piani, G. Weedon, M. Best, S. Gomes, P. Viterbo, S. Hagemann, and J. Haerter. Statistical bias correction of global simulated daily precipitation and temperature for the application of hydrological models. Journal of Hydrology, 395(3):199–215, 2010. [98] M. Rivington, D. Miller, K. Matthews, G. Russell, G. Bellocchi, and K. Buchan. Downscaling regional climate model estimates of daily precipitation, temperature and solar radiation data. Climate Research, 35(3):181, 2007. [99] E. A. Rosenberg, P. W. Keys, D. B. Booth, D. Hartley, J. Burkey, A. C. Steinemann, and D. P. Lettenmaier. Precipitation extremes and the impacts of climate change on stormwater infrastructure in washington state. Climatic Change, 102(1-2):319–349, 2010. [100] M. Rummukainen and S. meteorologiska och hydrologiska institut. Methods for Statistical Downscaling of GCM Simulations. SMHI RMK. Swedish Meteorological and Hydrological Institute, 1997. [101] M. Rusticucci and B. Tencer. Observed changes in return values of annual temperature extremes over argentina. Journal of Climate, 21(21):5455–5467, 2008. [102] R. Samuels, A. Rimmer, A. Hartmann, S. Krichak, and P. Alpert. Climate change impacts on jordan river flow: downscaling application from a regional climate model. Journal of Hydrometeorology, 11(4):860–879, 2010. [103] H. Sang and A. E. Gelfand. Hierarchical modeling for extreme values observed over space and time. Environmental and Ecological Statistics, 16(3):407–426, 2009. [104] D. Scott and G. McBoyle. Climate change adaptation in the ski industry. Mitigation and Adaptation Strategies for Global Change, 12(8):1411–1431, 2007. 186 [105] A. J. Smola and B. Sch¨lkopf. A tutorial on support vector regression. Statistics and o computing, 14(3):199–222, 2004. [106] I. Takeuchi, Q. V. Le, T. D. Sears, and A. J. Smola. Nonparametric quantile estimation. The Journal of Machine Learning Research, 7:1231–1264, 2006. [107] C. Tebaldi and D. Lobell. Towards probabilistic projections of climate change impacts on global crop yields. Geophysical Research Letters, 35(8), 2008. [108] J. M. Themeßl, A. Gobiet, and A. Leuprecht. Empirical-statistical downscaling and error correction of daily precipitation from regional climate models. International Journal of Climatology, 31(10):1530–1544, 2011. [109] Q. Tian, J. Yu, Q. Xue, and N. Sebe. A new analysis of the value of unlabeled data in semi-supervised learning for image retrieval. In Multimedia and Expo, 2004. ICME’04. 2004 IEEE International Conference on, volume 2, pages 1019–1022. IEEE, 2004. [110] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996. [111] E. Towler, B. Rajagopalan, E. Gilleland, R. S. Summers, D. Yates, and R. W. Katz. Modeling hydrologic and water quality extremes in a changing climate: A statistical approach based on extreme value theory. Water Resources Research, 46(11), 2010. [112] E. Vazquez and E. Walter. Multi-output support vector regression. In 13th IFAC Symposium on System Identification, pages 1820–1825. Citeseer, 2003. [113] I. Watterson and M. Dix. Simulated changes due to global warming in daily precipitation means and extremes and their interpretation using the gamma distribution. Journal of geophysical research, 108(D13):4379, 2003. [114] A. H. Welsh, R. B. Cunningham, C. Donnelly, and D. B. Lindenmayer. Modelling the abundance of rare species: statistical models for counts with extra zeros. Ecological Modelling, 88(1):297–308, 1996. [115] R. Wilby. Statistical downscaling of daily precipitation using daily airflow and seasonal teleconnection indices. Climate Research, 10(3):163–178, 1998. [116] R. Wilby, S. Charles, E. Zorita, B. Timbal, P. Whetton, and L. Mearns. Guidelines for use of climate scenarios developed from statistical downscaling methods. 2004. 187 [117] R. L. Wilby and T. Wigley. Downscaling general circulation model output: a review of methods and limitations. Progress in Physical Geography, 21(4):530–548, 1997. [118] T. Zhang and F. Oles. The value of unlabeled data for classification problems. In Proceedings of the Seventeenth International Conference on Machine Learning,(Langley, P., ed.), pages 1191–1198. Citeseer, 2000. [119] Z.-H. Zhou and M. Li. Semi-supervised regression with co-training. In IJCAI, pages 908–916, 2005. [120] X. Zhu. Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2:3, 2006. [121] X. Zhu, Z. Ghahramani, J. Lafferty, et al. Semi-supervised learning using gaussian fields and harmonic functions. In ICML, volume 3, pages 912–919, 2003. [122] X. Zhu and A. B. Goldberg. Kernel regression with order preferences. In Proceedings of the national conference on artificial intelligence, volume 22, page 681. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2007. 188