MULTI-TASK LEARNING AND ITS APPLICATION TO GEOSPATIO-TEMPORAL DATA By Jianpeng Xu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science and Engineering – Doctor of Philosophy 2017 ABSTRACT MULTI-TASK LEARNING AND ITS APPLICATION TO GEOSPATIO-TEMPORAL DATA By Jianpeng Xu Multi-task learning (MTL) is a data mining and machine learning approach for modeling multiple prediction tasks simultaneously by exploiting the relatedness among the tasks. MTL has been successfully applied to various domains, including computer vision, healthcare, genomics, recommender systems, and natural language processing. The goals of this thesis are: (1) to investigate the feasibility of applying MTL to geospatio-temporal prediction problems, particularly those encountered in the climate and environmental science domains and (2) to develop novel MTL frameworks that address the challenges of building effective predictive models from geospatio-temporal data. The first contribution of this thesis is to develop an online temporal MTL framework called ORION for ensemble forecasting problems. Ensemble forecasting uses a numerical method to simulate the evolution of nonlinear dynamic systems, such as climate and hydrological systems. ORION aims to effectively aggregate the forecasts generated by different ensemble members for a future time window, where each forecast is obtained by perturbing the starting condition of the computer model or using a different model representation. ORION considers the prediction for each time point in the forecast window as a distinct prediction task, where the task relatedness is achieved by imposing temporal smoothness and mean regularization constraints. A novel, online update with restart strategy is proposed to handle missing observations in the training data. ORION can also be optimized for different objectives, such as ε -insensitive and quantile loss functions. The second contribution of this thesis is to propose a MTL framework named GSpartan that can perform inferences at multiple locations simultaneously while allowing the local models for different locations to be jointly trained. GSpartan assumes that the local models share a common, low-rank representation and employs a graph Laplacian regularization to enforce constraints due to the inherent spatial autocorrelation of the data. Sparsity and non-negativity constraints are also incorporated into the formulation to ensure interpretability of the models. GSpartan is a MTL framework that considers only the spatial autocorrelation of the data. It is also a batch learning algorithm, which makes it difficult to scale up to global-scale data. To address these limitations, a new framework called WISDOM is proposed, which can incorporate the task relatedness across both space and time. WISDOM encodes the geospatio-temporal data as a tensor and performs supervised tensor decomposition to identify the latent factors that capture the inherent spatial and temporal variabilities of the data as well as the relationship between the predictor and target variables. The framework is unique in that it trains distinct spatial and temporal prediction models from the latent factors of the decomposed tensor and aggregates the outputs of these models to obtain the final prediction. WISDOM also employs an incremental learning algorithm that can systematically update the models when training examples are available for a new time period or for a new location. Finally, the geospatio-temporal data for many scientific applications are often available at varying spatial scales. For example, they can be generated by computer models simulated at different grid resolutions (e.g., the global and regional models used in climate modeling). A simple way to handle the predictor variables generated from the multi-scale data is to concatenate them into a single feature vector and train WISDOM using the concatenated vectors. However, this strategy may not be effective as it ignores the inherent dependencies between variables at different scales. To overcome this limitation, this thesis presents an extension of WISDOM called MUSCAT for handling multi-scale geospatio-temporal data. MUSCAT considers the consistency of the latent factors extracted from the spatio-temporal tensors at different scales while inheriting the benefits of WISDOM. Given the massive size of the multi-scale spatio-temporal tensors, a novel, supervised, incremental multi-tensor decomposition algorithm is develop to efficiently learn the model parameters. Copyright by JIANPENG XU 2017 ACKNOWLEDGEMENTS Spending five or more years of the life-time for a Ph.D degree is a great commitment for any human beings, including myself. During this fantastic journey as part of my limited life, what I have experienced has and will continue to influence the paths of my personal and career life. There are always moments and fragments in the memory that are related to those wonderful people I met, lived, and worked with. I would like to sincerely acknowledge each of them for being supportive, informative, and resourceful during my adventurous Ph.D study. First and foremost, I would like to thank my advisor, Dr. Pang-Ning Tan for the continuous support of my Ph.D study and research. He is a very patient, motivated, knowledgable and humble mentor. He is willing to spend time on discussing not only the big pictures of my research, but also technical details of the work. His guidance on how to correctly do research, how to identify research problems and how to dive deep in the analysis of results is valuable and greatly benefits my research. I am extremely grateful to those opportunities he provided for pursuing problems of my interest, for attending conferences to make my work visible to other researchers, and for interning at industrial and research organizations to expose myself to different real world problems. Because of him, my Ph.D study in Michigan State University was memorable, enlightening, and less stressful. It is my honor to be able to work with Dr. Tan and I will cherish everything he dedicated during my stay. Next, I would like to thank the rest of my Ph.D committee members, Dr. Jiayu Zhou, Dr. Lifeng Luo and Dr. Abdol-Hossein Esfahanian for their support and guidance throughout my Ph.D program. From Dr. Zhou, I gained considerably amount of practical machine learning techniques for solving optimization problems efficiently, and it is always fruitful in every discussion with him. His expertise in machine learning had influential impact on my research. I would also specially thank Dr. Luo for his continuously support and discussion on the domain knowledge in my research work. Without his domain insight, my research will not be as meaningful. Finally I would thank Dr. Esfahanian for always being supportive for my thesis. It was great pleasure to work with each v of them and I look forward to continue my collaborations in my future career. Further, I would like to spread my acknowledgement to many current and previous peer colleagues, especially, Prakash Mandayam Comar, Zubin Abraham, Lei Liu, Shuai Yuan, Courtland VanDam, Xi Liu, Ding Wang, Kaixiang Lin and Qi Wang from CSE Department, as well as Pouyan Hatami from Geography Department. They are not only keeping my lab hours interesting and informative via discussions on different research topics, but also being helpful to make my life happier and easier at MSU. I also want to express my spacial thanks to the graduate director Eric Torng for his extensive help and support over my Ph.D time, as well as numerous department secretaries for all administrative and travel assistance. Last but not least, I gratefully thank my parents and my wife for their continuous support, encouragement, and unconditional love to me. This thesis would not be accomplished without their support. I also thank my daughter coming into my life in the last year of my Ph.D, who has brought me tremendous happiness as being a father. Particularly, this thesis is partially supported by National Science Foundation through grant NSF III-1615612, NOAA Climate Program office through grant NA12OAR4310081, and NASA Terrestrial Hydrology Program through grant NNX13AI44G. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . 1.1 Geospatio-temporal Data . . . . . . . . . . . . . 1.2 Application of MTL to Geospatio-Temporal Data 1.3 Research Challenges . . . . . . . . . . . . . . . 1.3.1 Growth and Other Characteristics of Data 1.3.2 Spatio-temporal Autocorrelation . . . . . 1.3.3 Capturing Broader-Scale Phenomena . . . 1.3.4 Multi-Scale Data . . . . . . . . . . . . . 1.4 Thesis Contributions . . . . . . . . . . . . . . . 1.4.1 ORION . . . . . . . . . . . . . . . . . . 1.4.2 GSpartan . . . . . . . . . . . . . . . . . 1.4.3 WISDOM . . . . . . . . . . . . . . . . . 1.4.4 MUSCAT . . . . . . . . . . . . . . . . . 1.5 Related Publications . . . . . . . . . . . . . . . . 1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . 1 . 3 . 5 . 5 . 7 . 7 . 7 . 7 . 8 . 8 . 9 . 9 . 10 . 10 CHAPTER 2 BACKGROUND . . . 2.1 Spatio-temporal Data Mining 2.2 Multi-scale Data Mining . . 2.3 Multi-task learning . . . . . 2.4 Tensor Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 12 13 18 CHAPTER 3 ONLINE REGULARIZED MULTI-TASK REGRESSION . 3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Online Regularized multI-task regressiON(ORION) . . . . . . . . 3.2.1 ORION for ε -insensitive Loss Function . . . . . . . . . . 3.2.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3.1 Time complexity . . . . . . . . . . . . . . . . . 3.2.4 Theoretical Analysis of ORION-ε . . . . . . . . . . . . . 3.3 Online Regularized Multi-Task Quantile Regression (ORION-QR) 3.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Performance Comparison for ORION-ε . . . . . . . . . . 3.4.2 ORION with Quantile Regression (ORION-QR) . . . . . . 3.4.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . 3.4.4 Variations of ORION-ε Framework . . . . . . . . . . . . 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 23 25 25 27 29 30 30 34 36 36 41 43 43 45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 4 4.1 4.2 4.3 4.4 MULTI-TASK LEARNING PREDICTION . . . . . . . Preliminaries . . . . . . . . . . . GSpartan . . . . . . . . . . . . . Experimental Evaluation . . . . . 4.3.1 Dataset Description . . . . 4.3.2 Baseline Methods . . . . . 4.3.3 Task relationship matrix . 4.3.4 Experimental Results . . . 4.3.5 Sensitivity Analysis . . . . Conclusion . . . . . . . . . . . . FRAMEWORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MULTI-LOCATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 47 48 53 53 54 55 56 59 60 CHAPTER 5 5.1 5.2 5.3 5.4 WEIGHTED INCREMENTAL SPATIO-TEMPORAL MULTI-TASK LEARNING VIA TENSOR DECOMPOSITION . . . . . . . . . . . . . . . . . . . 61 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 WISDOM: An Incremental Spatio-Temporal Multi-Task Learning Framework . . . 65 5.2.1 Spatio-temporal Predictive Models . . . . . . . . . . . . . . . . . . . . . . 66 5.2.2 Supervised Tensor Decomposition . . . . . . . . . . . . . . . . . . . . . . 67 5.2.3 WISDOM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2.3.1 Incremental Learning over Space . . . . . . . . . . . . . . . . . 69 5.2.3.2 Incremental Learning over Time . . . . . . . . . . . . . . . . . . 72 5.2.3.3 Incremental Learning over Space-Time . . . . . . . . . . . . . . 73 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3.3 Comparison against Baseline Methods . . . . . . . . . . . . . . . . . . . . 76 5.3.4 Convergence Analysis of WISDOM . . . . . . . . . . . . . . . . . . . . . 77 5.3.5 Analysis of Spatial Latent Factors . . . . . . . . . . . . . . . . . . . . . . 78 5.3.6 Analysis of Temporal Latent Factors . . . . . . . . . . . . . . . . . . . . . 79 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 CHAPTER 6 6.1 6.2 6.3 MULTI-SCALE SPATIO-TEMPORAL MULTI-TASK LEARNING VIA INCREMENTAL MULTI-TENSOR DECOMPOSITION . . . . . . . . Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Tensor Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Spatio-Temporal Prediction . . . . . . . . . . . . . . . . . . . . . . . . MUSCAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Multi-tensor decomposition . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Supervised Multi-tensor Decomposition . . . . . . . . . . . . . . . . . 6.2.3 MUSCAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3.1 Incremental Learning over Space . . . . . . . . . . . . . . . 6.2.3.2 Incremental Learning over Time . . . . . . . . . . . . . . . . Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii . . . . . . . . . . . . . . . . . . . . . . . . . . 85 88 88 89 89 90 90 90 92 93 96 98 98 6.3.2 6.3.3 6.4 Baseline Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3.1 Comparison against Baselines . . . . . . . . . . . . . . . . . . 6.3.3.2 Comparison of Multi-scale analysis against single-scale analysis 6.3.3.3 Influence of Data from Multiple Scales . . . . . . . . . . . . . 6.3.4 Analysis of Spatial Latent Factors . . . . . . . . . . . . . . . . . . . . . 6.3.5 Analysis of Temporal Latent Factors . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 102 102 102 104 104 105 105 CHAPTER 7 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . 107 7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 ix LIST OF TABLES Table 3.1: Notations used in Equation (3.3) . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Table 3.2: Comparison of mean absolute error (MAE) for ORION-ε against baseline methods on soil moisture data . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Table 3.3: Comparison of runtime (in seconds) on the Northeast data set. . . . . . . . . . . 41 Table 3.4: Comparison of F1 measure for predicting occurrence of extreme events. . . . . . 42 Table 3.5: Comparison of mean absolute error (MAE) for different variations of ORIONε framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Table 4.1: Win-loss table comparing performance of various methods when applied to the data set with limited training examples (1 year of training data and 39 years of test data). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Table 5.1: List of predictor variables selected from NCEP reanalysis data. . . . . . . . . . 75 Table 5.2: MAE and its standard deviation for WISDOM and other baseline methods . . . . 77 Table 5.3: MAE and its standard deviation for WISDOM and its variants . . . . . . . . . . 77 Table 5.4: Comparison between the number of locations (out of 1,100) in which one method outperforms another . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Table 5.5: List of the climate indices used to correlate with the temporal factors learned from WISDOM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Table 5.6: Comparing MAE of WISDOM and WISDOM-KP . . . . . . . . . . . . . . . . 83 Table 6.1: Spatial resolutions for various climate datasets . . . . . . . . . . . . . . . . . . 86 Table 6.2: Glossary of symbols used in the chapter. . . . . . . . . . . . . . . . . . . . . . 88 Table 6.3: Glossary of tensor/matrix operations. . . . . . . . . . . . . . . . . . . . . . . . 89 Table 6.4: List of variables from NARR data as predictors. . . . . . . . . . . . . . . . . . 99 Table 6.5: List of surface variables selected from NCEP reanalysis data as predictors. . . . 100 Table 6.6: Number of weather stations and grid cells for each response variable. . . . . . . 100 x Table 6.7: Mean and standard deviation of MAE for MUSCAT and other baseline methods for climate datasets over 10 trials. . . . . . . . . . . . . . . . . . . . . . . . 103 Table 6.8: Mean and standard deviation of MAE for MUSCAT against different variations of MUSCAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Table 6.9: Mean of α1 and α2 for the climate datasets over 10 trials. . . . . . . . . . . . . 104 Table 6.10: List of the climate indices used to correlate with the temporal factors learned from WISDOM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 xi LIST OF FIGURES Figure 1.1: A geospatio-temporal data set, where each location has a pair of time series. . . 2 Figure 1.2: Example of trajectory data for hurricane Sandy in 20121 . . . . . . . . . . . . . 3 Figure 2.1: Multiplicative low-rank representation . . . . . . . . . . . . . . . . . . . . . . 14 Figure 2.2: Illustration of MTL based on explicit task relationship . . . . . . . . . . . . . . 15 Figure 2.3: Illustration of MTL based on sharing common parameters . . . . . . . . . . . . 16 Figure 3.1: A schematic illustration of ensemble forecasting task (diagram is best viewed in color). Assuming the latest forecast was generated on September 12, 2011 for the time period between September 17 and October 22, there is one observation value available to verify Forecast (N − 1), two observations available to verify Forecast (N − 2), and so on. . . . . . . . . . . . . . . . . . . . . . . . 21 Figure 3.2: Seasonable soil moisture forecasts and the observed time series at a major river basin in North America. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Figure 3.3: Forecasts on Dataset Northeast for ORION-ε . Fig. 3.3a - 3.3c are results from the training set and Fig. 3.3d - 3.3l are results from the test set. Note that in the early stage of the online learning process, ORION-ε performs similar to Ensemble Median (see Fig. 3.3a - 3.3b), and ORION-ε starts to follow the observation from Fig. 3.3c. . . . . . . . . . . . . . . . . . . . . . . 39 Figure 3.4: Mean absolute error for ORION-ε and Ensemble Median on the Northeast and Midatlantic data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Figure 3.5: Sensitivity Analysis of ORION-ε . Fig. 3.5a shows that ORION-ε tends to choose a large value of µ ; Fig. 3.5b shows that ORION-ε is not that sensitive to β ; Fig. 3.5c shows that λ is the parameter to be tuned in practice. . . . . . . 44 Figure 4.1: Comparison of GSpartan against three baseline methods . . . . . . . . . . . . . 57 Figure 4.2: Comparison of GSpartan against its variants . . . . . . . . . . . . . . . . . . . 58 Figure 4.3: Performance comparison between GSpartan against baseline methods as training set size increases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 xii Figure 4.4: Results of sensitivity analysis on λ1 , λ2 , λ3 , and k for GSpartan. The horizontal axis represents the index of a station and the vertical axis corresponds to the RMSE value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Figure 5.1: The standardized monthly maximum temperature of a weather station in French Polynesia, which correlates strongly with the deseasonalized El-Nino Southern Oscillation Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Figure 5.2: Overview of the proposed WISDOM framework. . . . . . . . . . . . . . . . . 65 Figure 5.3: Changes in MAE over time for WISDOM . . . . . . . . . . . . . . . . . . . . 79 Figure 5.4: Spatial distribution of the spatials factor learned by WISDOM for prcp. (Figure is best viewed in color). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Figure 5.5: Spatial distribution of the spatials factor learned by WISDOM for prcp, continued. (Figure is best viewed in color). . . . . . . . . . . . . . . . . . . . . . . 81 Figure 5.6: Average annual MAE comparison between WISDOM with incremental learning over space and WISDOM without incremental learning over space for the 100 initially chosen locations . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Figure 5.7: Correlations between the climate indices and the temporal factors learned from WISDOM for tmean and prcp . . . . . . . . . . . . . . . . . . . . . . . . 83 Figure 5.8: Percentage of locations whose response variables has a correlation above 0.3 with the temporal factors and climate indices learned from WISDOM for tmean and prcp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Figure 5.9: Stations where WISDOM-KP outperforms WISDOM more than 0.05 in MAE evaluation for tmean and vise versa. . . . . . . . . . . . . . . . . . . . . . . . . 84 Figure 6.1: Spatial distribution of the spatial latent factors learned by MUSCAT for precipitation data (Figure is best viewed in color) . . . . . . . . . . . . . . . . . . 104 Figure 6.2: Correlations between the known climate indices and the temporal latent factors derived from MUSCAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 xiii CHAPTER 1 INTRODUCTION Multi-task learning (MTL) is a data mining and machine learning approach for learning multiple related tasks jointly, by incorporating the relatedness between tasks [20]. It was first introduced in the context of neural networks inspired from models of learning in the human brain. To date, MTL has been successfully applied to many applications, including computer vision [119, 103], healthcare and medical applications [126, 14, 15], recommender systems [79, 101], natural language processing [4], genomics [81] and social media applications [121]. MTL is also applicable to various learning tasks such as classification [113], regression [126], and clustering [52]. The rationale for applying MTL to multiple predictive modeling problems [113] are as follows: (i) the prediction tasks are not identical, so fitting a single model for all tasks using their combined data is often not as effective; and (ii) the prediction tasks are related, so the model for each task could benefit by sharing information across the tasks. However, identifying what type of information to be shared between the different tasks and how to integrate them into a unified learning formulation are among the key challenges that must be addressed. Based on different assumptions of particular applications, numerous methods for representing and modeling the common task structures have been developed in recent years [107, 38, 7, 56, 63, 25, 49, 126, 120, 87, 31]. 1.1 Geospatio-temporal Data Geospatio-temporal data sets are prevalent across many disciplines, including geophysical and environmental sciences [40, 8], medical informatics [32], computational molecular and fluid dynamics [115], and real estate economics [92]. The availability of such data sets have grown in recent years due to advances in sensing technology and large-scale scientific simulation studies. Predictive modeling of such massive data sets is important as it supports various scientific and engineering applications. A geospatio-temporal data set consists of time series observations of various fields associat- 1 ed with each geo-referenced object, e.g., measurements of meteorological variables at a weather station, frequency of infectious disease incidents in a county, or average housing price, annual income, and crime rate in a given neighborhood. Figure 1.1 shows an example of a geospatiotemporal data set, where each location is associated with a pair of time series (e.g., temperature and precipitation). A standard prediction problem for this type of data is to forecast the values of the time series at each location based on their historical observations as well as the spatial relationships among the locations. In addition to temporal forecasting, one might also be interested in inferring the values at locations with limited or no historical data. Figure 1.1: A geospatio-temporal data set, where each location has a pair of time series. The geospatio-temporal data considered in this study (as depicted in Figure 1.1) is different from the another type of geospatio-temporal data known as trajectory data [43, 74, 95, 68]. An example of the latter is shown in Figure 1.2, which shows the trajectory of a hurricane. Unlike Figure 1.1, where the data are generated by stationary geo-referenced objects, trajectory data are generated from movements of the geo-referenced objects. In terms of the prediction task for trajectory data, one is often interested to predict the locations in the future path of the moving object. 1 Image courtesy of http://www.weather.gov/images/okx/Sandy/Track_NHCreport. 2 1.2 Application of MTL to Geospatio-Temporal Data There are numerous applications involving geospatio-temporal data that can be naturally mapped into MTL problems. In the following, we describe several examples of such applications from the climate and environmental science domains. • Ensemble Forecasting: Ensemble forecasting is a well-known numerical prediction technique for modeling the evolution of nonlinear dynamic systems. The ensemble member forecasts are generated from multiple runs of a computer model, where each run is obtained by perturbing the starting condition or using a different model representation of the dynamic system. The set of forecasts produced by the ensemble members are used to quantify the range of uncertainties in the future state of the system. However, for many applications, it is also desirable to obtain a point estimate of the forecasts for decision making purposes. To do so, the ensemble mean Figure 1.2: Example of trajectory data for hurricane Sandy in 20121 . png 3 or median are typically chosen as the consensus estimate of the aggregated forecasts. These approaches are limited in that they assume each member is equally skillful and ignore the potential correlations between the ensemble member forecasts. Furthermore, each forecast typically represents the predictions for an extended future time period, e.g., for a 6-month forecast duration. The ensemble mean or median approaches do not guarantee the temporal predictions would be smooth throughout the forecast window since the mean or median aggregation is performed at each time step independently. This presents an opportunity for developing a temporal MTL framework to learn the optimal weights for combining the ensemble member forecasts, where the prediction at each time step is considered a separate learning task and task relatedness constraints can be imposed to ensure temporal smoothness of the predictions. • Multi-Location Prediction: Another obvious application of MTL is to predict the values of one or more response variables simultaneously at multiple locations. For example, climate scientists are interested to obtain projections of the future climate (e.g., temperature and precipitation) at multiple locations in a study region. Similarly, in hydrology, soil moisture forecasts are needed for various river basins for drought monitoring applications. One way to do this is to train a local model at each location using only its historical observations. This single-task learning (STL) approach may not be effective especially if there are limited training examples available at the locations. This is because the STL models fail to exploit the spatial relationships of the predictions, which is important for many geospatio-temporal prediction problems following Tobler’s first law of geography, which states that “Everything is related to everything else, but near things are more related than distant things.” [98]. Spatio-temporal MTL approaches can be developed that allow the local models at multiple locations to be jointly trained, taking into account the shared information between locations (e.g., their spatial proximity or autocorrelation). 4 • Multi-Scale Prediction: The geospatio-temporal data generated for many scientific applications are often available at multiple spatial scales. For example, the climate projections generated by global and regional climate models are available at a wide range of spatial resolutions, from tens to several hundred kilometers. As another example, the soil moisture estimates provided by the Climate Forecast System Reanalysis (CFSR) and the North American Land Data Assimilation System (NLDAS-2) for hydrology applications are available at 35km and 12km resolutions respectively over the conterminous United States. It has been shown that integrating data from multiple scales for predictive modeling could improve the performance, as different analysis scales have the potential to produce different prediction skills [75]. Many existing methods [82, 35, 75] would concatenate the variables from multiple scales into a single feature vector, enabling us to apply existing approaches for building predictive models from the multi-scale data. However, this strategy may not be effective as it fails to consider the relationships among the multi-scale variables. A MTL framework can be developed to exploit such relationships in order to build more robust and effective predictive models. For example, the predictive modeling using predictor variables at each scale for each location can be treated as a separate learning task and the task relatedness can be defined based on the structure of the models (e.g., the shared latent factors of the predictor variables across multiple scales). 1.3 Research Challenges This section presents the challenges of applying MTL to geospatio-temporal data. 1.3.1 Growth and Other Characteristics of Data Geospatio-temporal data sets often grow linearly over time. For example, climate observations from weather stations are typically available at hourly, 3-hourly, or daily basis. Due to the continuous growth of the data, building predictive models for a large number of locations simultaneously 5 using MTL in a batch mode is computationally demanding as the models have to be trained from scratch each time new observation data become available. An online or incremental MTL approach is more suitable as the models need to be revised using the new observations only instead of re-trained using the entire data set. Furthermore, such an approach is more adaptive to changes due to concept drifts in the data. Although online versions of MTL have been developed in recent years [21, 34, 114, 87, 101], these methods are not specifically designed to exploit properties of the geospatio-temporal data. In particular, the characteristics of the data could make it difficult to apply existing MTL methods to certain applications. For example, the ensemble forecasting task described in the previous section can be viewed as a sliding window prediction problem, which requires making predictions for multiple consecutive time steps into the future. As time progresses, the sliding window will be shifted from its current forecast window to the next forecast window. In addition, the number of observed values in its previous forecast windows varies depending on the current time period, making it difficult to apply existing online MTL methods. To illustrate this, consider an ensemble of weather forecasting models. Every day, the ensemble members would generate their predictions for the next 7 days. For example, on a Sunday, the ensemble members would generate their predictions for a 7-day forecast window starting from the next day (Monday) to the following Sunday. On the next day, which is a Monday, the ensemble members would generate a new set of forecasts for a new 7-day time window (from Tuesday to the following Monday). At this time, the previous forecast window generated on Sunday would have 1 observed value out of its 7 predicted days whereas the forecast window generated on the previous Saturday would have 2 observed values instead of 1. The online MTL approach will have to revise its models using the partially observed (incomplete) data from not just 1 but several of its previously forecasted time windows. This problem will be further elaborated in Chapter 3. 6 1.3.2 Spatio-temporal Autocorrelation Geospatio-temporal data often contain spatial and temporal autocorrelations that can be utilized to improve the performance of the predictive models. For example, temporal autocorrelation can be used to ensure that the predicted values vary smoothly over time. The spatial autocorrelation enables the models to leverage data from other locations based on the similarity between the locations. How to effectively incorporate both the spatial and temporal autocorrelations into the MTL framework is a challenge that needs to be addressed. 1.3.3 Capturing Broader-Scale Phenomena The geospatio-temporal variabilities of the data are potentially influenced by broader-scale phenomena, whose impact may vary from one location to another. For example, the well-known El-Ninö phenomenon has different effects on weather patterns in North America and the rest of the world. To generate realistic predictions, the predictive models for different locations should be able to capture such broader-scale phenomena. Designing a geospatio-temporal MTL framework that can effectively perform multi-location predictions while capturing such broader-scale phenomena is a challenge in this research. 1.3.4 Multi-Scale Data As previously noted, the geospatio-temporal data are often available at multiple spatial resolutions. Such multi-scale data provide useful information that can be harnessed for building robust prediction models. Constructing a MTL framework that can utilize the multi-scale data, while accounting for the relationships between variables at different scales, is another challenge that needs to be addressed. 1.4 Thesis Contributions This section summarizes the key contributions of this thesis in terms of addressing the challenges described in the previous section. 7 1.4.1 ORION In Chapter 3, an online temporal MTL framework called ORION (which stands for Online Regularized multI-task regressiON) is proposed to estimate the weights of the ensemble members for the ensemble forecasting application. The framework uses an online learning with restart strategy to deal with the partially observed data for different forecast windows. It regards the prediction at each time point as a single learning task and employs graph regularization constraints to ensure smoothness in the model parameters while taking into account the task relatedness within each time window. The framework is applicable to different types of loss functions, including ε -insensitive and quantile loss. The quantile loss function is particularly appealing for forecasting extreme events in a time series. Experimental results suggest that ORION reduces the forecast error of ensemble median for all major river basins datasets, and performs better than other baseline algorithms in most cases. 1.4.2 GSpartan In order to address the multi-location prediction problem, Chapter 4 proposes a novel MTL framework called GSpartan, which stands for GeoSPAtio-tempoRal mulTi-tAsk learNing. GSpartan considers the prediction at each location as a single learning task and employs a multi-modal MTL approach to fit the training data simultaneously at multiple locations, where each modality corresponds to a latent factor derived from the data. The multi-modal approach is based on the assumption that the prediction models for all locations share a common set of low-rank base models. The local model at each location is then composed of a linear combination of these base models. The coefficients of the linear combination along with the base models are jointly estimated from the data using a matrix factorization approach. A Laplacian regularization is also added to the MTL formulation to constrain the relatedness between different task models under the assumption of spatial autocorrelation. To ensure interpretability of the models, sparsity and non-negativity constraints are also added into the proposed MTL formulation. The performance of GSpartan is evaluated against several baseline methods on climate data collected from 37 randomly chosen 8 weather stations in Canada. The experimental results show that GSpartan outperforms single-task learning and other existing MTL methods in more than 75% of the stations. 1.4.3 WISDOM GSpartan considers only the spatial autocorrelation between locations, while ignoring the temporal consistency of its predictions. In addition, it is designed for batch learning, which makes it difficult to scale up the algorithm for the growing geospatio-temporal data sets collected by many applications. To overcome these limitations, an incremental, geospatio-temporal MTL framework named WISDOM (Weighted Incremental Spatio-temporal Multi-task Learning via Tensor Decomposition) is proposed in Chapter 5. The framework performs a supervised tensor factorization on the geospatio-temporal data, from which it constructs distinct spatial and temporal predictive models from the derived latent factors. The framework can also incorporate known patterns from the domain by adding constraints on the latent factors. WISDOM can be applied to incremental learning over space, time, or both when new observation data become available. The effectiveness of the proposed framework for multi-location time series prediction is demonstrated on a large-scale global climate data. 1.4.4 MUSCAT Finally, Chapter 6 presents a multi-scale MTL framework named MUSCAT (Multi-scale SpatioTemporal Muti-task Learning via Incremental Multi-tensor Decomposition), which not only learns the prediction models at multiple locations simultaneously, but can also take advantage of the task relatedness between predictor variables at different scales. Unlike WISDOM , MUSCAT represents the data at each scale as a separate tensor and performs a supervised, joint multiple tensor decomposition across all tensors. The consistency of the learned latent spatial and temporal factors are enforced to model the relatedness of the predictor variables from different scales. A comparison between MUSCAT to other baseline methods including WISDOM demonstrates the 9 effectiveness of the proposed framework in terms of handling the multi-scale geospatio-temporal data. 1.5 Related Publications This thesis is based on materials taken from several conference or journal papers that have been published or submitted as this thesis is written. The materials for Chapter 3 are adopted from two papers, entitled “ORION: Online Regularized multI-task regressiON and its application to ensemble forecasting” [107] and “Online Multi-task Learning Framework for Ensemble Forecasting” [111], respectively. Chapter 4 is based on the results published in the paper entitled “GSpartan: a Geospatio-Temporal Multi-task Learning Framework for Multi-location Prediction” [109]. Chapter 5 is adopted from the paper “WISDOM: Weighted Incremental Spatio-Temporal MultiTask Learning via Tensor Decomposition” [110]. This paper also won the best paper award at the IEEE BigData Conference in 2017. Finally, the materials in Chapter 6 are based on a paper that has been submitted to a conference. I am the first author and main contributor for all of these papers. 1.6 Thesis Outline The reminder of this thesis is organized as follows. Chapter 2 presents the background and related works of this thesis. Chapter 3 presents the temporal MTL framework named ORION for ensemble forecasting [107, 111]. Chapter 4 introduces the spatio-temporal MTL framework for multi-location prediction (GSpartan) [109]. Chapter 5 presents an incremental, spatio-temporal MTL framework named WISDOM [110] while Chapter 6 describes the extension of the framework to multi-scale geospatio-temporal data. Finally, Chapter 7 summarizes the dissertation and discusses possible extensions and future directions of this research. 10 CHAPTER 2 BACKGROUND The objective of this thesis is to develop MTL frameworks for the predictive modeling of geospatiotemporal data. In this chapter, previous works related to the key topics investigated in this thesis, including spatio-temporal data mining, multi-task learning, multi-scale modeling, and tensor decomposition, are reviewed. 2.1 Spatio-temporal Data Mining Spatio-temporal data mining can be either related to stationary geo-referenced objects which consist of time series for each object [110, 48, 46, 47], such as soil moisture measurement as weather stations over time, or related to in-motion objects which are associated with movements of the object over time [9, 27, 45], such as trajectory data. This thesis focuses on the methods belonging to the first type of the spatio-temporal data mining. Spatio-temporal data mining is important to many application domains, such as climatology, medicine, and crop sciences. It usually can be categorized into three classes: spatial data mining, temporal data mining, and spatio-temporal data mining. Spatial data mining is to learn, model, and discover the interesting unknown knowledge or patterns with spatial dependencies such as spatial auto-correlation, while temporal data mining is to model or discover the temporal unknown knowledge or patterns from timeseries data by exploring the temporal dependencies such as temporal auto-correlation. Spatio-temporal data mining usually consumes both spatial and temporal information in the modeling process. There are other information or properties that are often considered in the modeling process, such as spatial heterogeneity, non-stationarity, as well as multiple scale effect. Existing tasks for spatial, temporal, spatio-temporal data mining include prediction (classification and regression) [58, 17, 83], clustering [51, 18, 53], association rule mining [62, 73, 96] and anomaly detection [71, 59, 70]. This thesis will focus on the prediction tasks in spatial, and spatio-temporal data mining. 11 Spatial prediction modeling is to predict the target variable at unobserved locations by modeling the data in observed locations. One of the common approach for modeling spatial data is spatial statistics [5, 44], which is composed by three sets of methods based on different types of spatial data: (a). geostatistics for point referenced data, such as Kriging [10]; (b). lattice statistics for areal data, such as spatial autoregressive model (SAR) [58], conditional autoregressive model (CAR) [33] and Markov random fields (MRF) [54]; and (c). point process for spatial point patterns, such as complete spatial randomness (CSR) [44]. Spatio-temporal prediction refers to the methods that perform predictions on the response variable by learning models from predictor variables from spatio-temporal data [88]. Spatio-temporal prediction common approaches include spatio-temporal autoregressive regression [42, 99], spatiotemporal kriging [123, 89], and Bayesian hierarchical spatio-temporal models [104, 85, 13]. A comprehensive survey on spatio-temporal data mining is provided by Shekhar et. al. [88]. Recent work has shown that MTL could potentially gain better performance on spatio-temporal prediction, by taking into account the correlation between tasks across time, space, or time and space [107, 109, 110, 121, 117]. 2.2 Multi-scale Data Mining The concept of multi-scale learning has been used in the literature to describe different classes of methods. For example, in traditional machine learning, multi-scale learning is related to the use of techniques based on multi-kernel [11], multi-covariance matrices [100] or multi-basis functions [80]. These methods are not designed to handle data containing multi-scale variables. Instead, they are focused on extracting and combining different models of the same raw data, which is available at a single resolution. For geospatio-temporal data, the goal of multi-scale learning is to build predictive models from data whose predictor variables are collected at multiple resolutions. A straightforward approach for handling multi-scale data is to concatenate the predictor variables together into one single feature vector and then apply existing prediction methods [35, 75]. These approaches are limited in that it 12 ignores the relationships among the multi-scale variables. With the explosive growing of deep learning techniques, there have been several recent works on multi-scale modeling, in the context of computer vision. The concept of multi-scale data in this case typically refers to different image resolutions. For example, the multi-scale concept used in deep convolutional networks refers to the extraction of features from an image using windows or patches of different sizes [41, 12]. Deep learning can also be used to learn multi-scale spatial features from images that have multiple resolutions [122]. However, these methods assume that the spatial data has a gridded structure, and thus, are inapplicable to geospatio-temporal data where the spatial data may not be gridded. Nevertheless, there have been some recent works using deep learning to model data with multiple resolutions. For example, a multi-scale deep learning approach was proposed in [78] to build models and to perform predictions independently at each scale of the data, followed by majority voting to combine the predictions. Another approach performs global learning on coarser scale data and then refines the model locally on finer scale data [37]. However, unlike our proposed approach, none of these methods consider the relationships between data at different scales. 2.3 Multi-task learning MTL is a well-known machine learning approach for solving multiple, related prediction tasks simultaneously by considering the shared information among the tasks [20]. The rationale of MTL is that the information propagated between related tasks may enhance the overall predictive accuracy if the models are trained jointly instead of independently. The MTL scheme has been successfully adopted into various learning problems including classification [113], regression [126], and clustering [52]. Online learning versions of MTL have also been developed [21, 34, 114, 87, 101], which allows the models of multiple tasks to be updated incrementally. In this section, the existing MTL approaches are reviewed. Based on the information sharing scheme between multiple tasks, MTL methods can be broadly classified into the following categories: 13 MTL based on Low-rank Representation: This category of methods assume that the underlying models for different tasks share a low-rank representation, either additively or multiplicatively. For example, in the multiplicative low-rank representation, the tasks models can be factorized into two terms, one representing the base models shared across tasks, the other representing the combination coefficients. The task models are linear combinations of those base models, weighted by the combination coefficients. An illustration for multiplicative low-rank representation is shown in Figure 2.1. Additive low-rank representation simply replaces the multiplication into summation, with additional constraints for those additive terms. Figure 2.1: Multiplicative low-rank representation Previous works that belong to this category include [23, 7, 56, 63, 24, 25, 49]. For example, Chen et al. [23, 24] assumed the tasks models are composed additively by a low-rank structure capturing the relatedness over the tasks and a group sparse structure that identifies the outlying tasks. A similar idea was proposed in Gong et al. [49], which uses a combination of a task model that enforces group sparsity on features and another that enforces group sparsity for detecting outlying tasks. Argyriou et al. [7] and Kang et al. [56] projected the features into a low dimensional space in which the predictive models for different tasks are inferred and the discriminant features are selected using an ℓ2,1 norm. Kumar et al. [63] assumed that each task model is a linear combination of a set of underlying base models, where the coefficients of the linear combination along with the base models are coupled in a multiplicative way. Sparsity constraints are also enforced to ensure the models are interpretable. Chen et al. [25] incorporated both additive and multiplicative representation into their formulation with a low-dimensional feature map shared across the different tasks. Tensor decomposition methods are also adopted in MTL to extract low-rank representations [118]. These existing MTL methods may not be suitable to leverage some of the task relations specially 14 in geospatio-temporal predictions, or fails to capture geospatio-temporal patterns from the data. MTL based on Explicit Task Relationship: This category of methods explicitly incorporates the task relationship inferred from the application domain into the MTL formulation. The task relationships are typically encoded in a similarity matrix, whose elements represent the closeness between two tasks [39]. Concrete constraints will be enforced on objective functions based on the relations defined by the similarity matrix, in such a way that the models will be similar if the tasks are close to each other. Figure 2.2 illustrates the general idea of using explicit task relationships in MTL. Matrix A represents the similarity matrix of tasks, where Ai j is the similarity score or closeness between task i and task j. Figure 2.2: Illustration of MTL based on explicit task relationship Several works can be categorized under this framework. For example, Zhou et al. [126] employed a task relationship to ensure smoothness in time series prediction, where each time point corresponds to a separate prediction task. Zhang et al. [120] proposed a MTL regularization formulation to simultaneously learn the task relationship and task models. Saha et al. [87] also proposed an online MTL to learn both the task models and their relationships. This category of methods is useful when the task relatedness can be clearly defined. They are inapplicable if the task relationship is not available or if the implicit task relationship needs to be inferred from data. MTL based on Sharing Common Parameters: The assumption in this category is that different 15 tasks are related by sharing a set of common parameters. One of the examples is proposed by Evgeniou et al. [38], which is shown in Figure 2.3. In this example, the models w1 to wn for task 1 to task n are composed by a common term w0 , and n task specific terms v1 to vn , where the common term w0 is shared across different tasks. Figure 2.3: Illustration of MTL based on sharing common parameters There are other ways for sharing model parameters for multi-task learning. For example, Lawrence et al. [64] and Yu et al. [116] proposed a multi-task Gaussian process where the prior parameters are shared across different generative processes. Lee et al. [65] gave a formulation to share hyperparameters between distributions of different task models while Daume et al. [31] learned hierarchical Bayesian models that share a common structure parameterized by a covariance matrix. Note that sharing common terms additively across tasks is equivalent to using a specific task relation where all the tasks are close to the (weighted) averaged models [38]. MTL based on Hybrid Information Sharing Methods: More recently, there have been some efforts to combine two or more MTL information sharing methods described above. For example, Xu et al. [108] combined both low-rank representation and explicit task relations as it models the heterogeneity between patients using low-rank modeling and constrains the task models to be close to the weighted average models from the neighboring tasks. Besides of the taxonomy induced by the way of sharing information between tasks, MTL algorithms can be also divided as batch [38, 126, 120, 87, 56, 63, 24, 25, 49, 116, 31, 65, 20] versus 16 online multi-task learning [34, 21, 87, 3], according to the data availability and the way of training the models. Batch Multi-task Learning Methods: For batch multi-task learning approaches, all the data points from all the tasks have to be available when the learning procedure starts. Similar to single task learning approaches, the advantage of batch learning methods is that all the information can be taken account at the same time and hence the learnt models can be considered as an optimal solution. Most of the aforementioned MTLs are batch methods. Online Multi-task Learning Methods: Online learning is a family of learning algorithms, in which the data is observed in a sequential manner and the models must be updated incrementally based on the new observations. The merit of applying online learning is that it does not consume as much memory compared to batch learning, which is advantageous for handling large, streaming data sets. Some of the popular online learning methods include the weighted majority algorithm [69], online Passive Aggressive (PA) algorithm [29], and confidence-weighted algorithm [36]. Since the amount of data to be processed in multi-task learning is generally much larger than singletask learning, a straightforward solution is to extend MTL to an online learning setting. Dekel et al. [34] presented an approach for multi-task perceptron using a global loss function to define the relationships among the different tasks. This method assumes the true values for all the learning tasks are available when updating the model. Relationships among the tasks are also not explicitly defined in their objective function. As a result, only those tasks that have committed a prediction error will have their weights updated. Cavallanti et al. [21] proposed another perceptron-based online MTL method, which uses an empirical loss function with co-regularization among the tasks. Since the method performs an update one task at a time, the results are order-dependent. Saha et al. [87] extended the work in [21] by proposing a method that simultaneously learns the model and task relation matrix using an alternating optimization scheme. However, this approach has the same limitation as [21] because it uses similar assumptions as [21]. [67] proposed a collaborative online MTL method by extending [38] into an online version. However, it does not consider the possible task relationships explicitly. Another recent work on online multi-task learning consid- 17 ers a regret-minimization approach under expert advice model [3]. However, none of the existing works were designed for geospatio-temporal data to address its challenges. 2.4 Tensor Decomposition Existing tensor decomposition approaches can be categorized into unsupervised [57, 22, 28] and supervised methods [105, 86, 106, 118]. The former is designed to minimize reconstruction error whereas the latter considers the relationship between the predictor and response variables, and thus, is more suitable for predictive modeling problems. Several implementations of supervised tensor decomposition approaches have been developed in recent years. For example, Wu et. al [106] proposed the SNTFM framework to map the representation of each sample from a tensor into a vector and build a predictive model on the new representation of the data. Bernardino et. al [86] and Kishan et. al [105] presented a MTL framework for data sets with multi-modal structures using a supervised tensor decomposition approach. Similarly, Yu et. al [118] proposed a low-rank tensor learning approach for multivariate spatiotemporal data. These approaches encode the parameters of their predictive models as a tensor, which is assumed to have a low rank. Tensor decomposition was performed on the model parameters, unlike our proposed framework, which performs the decomposition on the spatio-temporal tensor data. This strategy allows us to provide meaningful interpretation of the latent factors in terms of their spatial, temporal, and feature dimensions. Incremental/online tensor decomposition methods have been developed for streaming data as well as for data sets that grow dynamically over time [93]. Current incremental methods can be divided into two categories, one is based on Tucker decomposition, while the other is based on CP decomposition. Existing incremental Tucker decomposition methods are mostly based on incremental SVD, applied to the matricization of the tensor for different modes [94, 93, 55]. The incremental SVD based methods assume orthogonality of the latent factors, which is somewhat restrictive for interpretability reasons. Zhou, et al. [127] developed an online CP decomposition approach, where the the latent factors are updated when there are new data. However, these meth- 18 ods are inapplicable to our problem setting since they were developed for unsupervised learning. Although there is a recent work on online supervised tensor decomposition [118], it is restricted to new observation data along the time dimension, whereas our WISDOM and MUSCAT framework consider new observations in both space and time. 19 CHAPTER 3 ONLINE REGULARIZED MULTI-TASK REGRESSION 1 Ensemble forecasting is a popular numerical prediction method for modeling nonlinear dynamic systems, such as climate [97], agriculture [19], ecological [6], and hydrological [72] systems. Specifically, the future states of the systems are predicted using computer models that simulate the physical processes governing the behavior of such systems. Since the models may not fully capture all the underlying processes as well as their parameterization accurately, their forecast errors tend to amplify with increasing lead time. Ensemble forecasting [66] aims at obtaining more robust prediction results by combining outputs from multiple runs of the computer models. Each run is obtained by perturbing the starting condition or using a different model representation of the dynamic system. The forecast generated from each run corresponds to a series of predictions for a future time window, T , known as the forecast duration. As an example, consider the ensemble forecasting task shown in Fig. 3.1. There are altogether N sets of forecasts generated every 5 days, from April 5, 2011 to September 12, 2011. Each set of forecasts contains time series predictions generated by d ensemble members (x1 , ..., xd ) for a forecast duration T . The goal is to combine the d ensemble member forecasts in such a way that produces an aggregated prediction that is consistent with the true observation data, y. The ensemble mean or median is often used as a point estimate of the aggregated forecasts. These estimates assume that every ensemble member prediction is equally plausible, and thus, their predictions should be weighted equally. However, as some ensemble members may fit the observed data less accurately than others, this may degrade the overall predictive performance. To illustrate this, consider the example given in Fig. 3.2, which shows the basin-averaged soil moisture percentile forecasts of a hydrological model ensemble, consisting of 33 members (shown as thin green lines), along with the ensemble median (shown as a dashed line) and the true observed data (shown as a solid red line). The ensemble members were initially calibrated using soil moisture 1 This chapter is based on the previous publication [111]. 20 data from September 2, 2011. Their outputs for that day are therefore the same. Each ensemble member would then generate a set of forecasts for a 40-day time window, from September 7, 2011 to October 12, 2011. Though the forecasts were quite similar at the beginning, they began to diverge with increasing lead time. Some ensemble member forecasts no longer fit the observed data well, thus affecting the accuracy of the ensemble median approach. This example illustrates the need to learn an optimal set of weights for combining the ensemble member forecasts. To address this need, this chapter presents an online learning model that can update the weights of the ensemble members according to their predictive skills. Unlike conventional online learning, the ensemble forecasting task requires making multiple predictions for a time window of length T . As the predictions within the window are not independent due to the temporal autocorrelation of the time series, the ensemble forecasting task can be naturally cast as an online multi-task regression problem. Another difference between conventional online learning and the requirements of ensemble forecasting is that not all observation data are available when the model is updated. For example, in Fig. 3.1, suppose the ensemble members generate a new set of forecasts every 5 days. Let forecast N − 2 be the set of forecasts generated on September 2 for a 40-day period from September 7 to Time Legend Forecast duration d ensemble member forecasts obse observation No observation data Observation data t1 t2 Ensemble member 1 forecasts . Ensemble member 3 forecasts Ensemble member 2 forecasts Ensemble member 4 forecasts . . tT-1 tT x1 x2 … xd y Apr. 05, 2011 Forecast 1 Apr. 10, 2011 Forecast 2 Sep. 02, 2011 Forecast N-2 Sep. 07, 2011 Forecast N-1 Sep. 12, 2011 Forecast N Figure 3.1: A schematic illustration of ensemble forecasting task (diagram is best viewed in color). Assuming the latest forecast was generated on September 12, 2011 for the time period between September 17 and October 22, there is one observation value available to verify Forecast (N − 1), two observations available to verify Forecast (N − 2), and so on. 21 October 12, forecast N − 1 be the corresponding forecast set generated on September 7 for the time window September 12 to October 17, and forecast N be the forecast set generated on September 12 for September 17 to October 22. We assume forecast N to be the most current forecast. When the online learning model is updated on September 12, forecast N − 2 has two observed values in its time window, including September 7 and September 12, while forecast N − 1 has a new observed value for September 12. This means the observation data are not only incomplete in each time window, the number of observations also varies from one window to another. We call this problem online multi-task learning with partially observed data. Due to this property of the data, instead of updating the model from its most recent model, we need to backtrack and revise some of the older models when new observation data are available. This chapter is to develop a framework called ORION (which stands for Online Regularized multI-task regressiON) that uses an online learning with restart strategy to deal with the partially observed data. The framework also employs graph regularization constraints to ensure smoothness 90 80 70 Soil Moisture 60 50 40 30 Ensemble Members Observations Ensemble Median 20 10 2 /7 /2 1/9 1 20 1 /9/ 1/9 1 20 2 1 01 17 /9/ 11 20 /12 0/7 0/2 /27 /22 /10 1/1 1/1 1/9 1/9 1 1 1 1 1 1 20 20 20 20 20 Figure 3.2: Seasonable soil moisture forecasts and the observed time series at a major river basin in North America. 22 in the model parameters while taking into account the temporal autocorrelation of the predictions within each time window. It is capable to incorporate different loss functions, such as ε -insensitive loss and quantile loss function, which is useful for predicting extreme values of a time series. Theoretical proofs are also provided to demonstrate the convergence of the proposed online learning algorithm. The main contributions of this chapter are summarized as follows: • The problem of online regularized multi-task regression with partially observed data is introduced and demonstrated the relevance to the ensemble forecasting task. • A novel framework called ORION is presented, which uses an online learning with restart strategy to solve the problem. It also uses a graph Laplacian to capture relationships among the learning tasks along with a passive aggressive update scheme to optimize the ε insensitive loss function. • The framework is extended to incorporate the quantile loss function for predicting extreme events. To the best of our knowledge, ORION is the first multi-task regression framework that has been tailored for extreme value prediction. • Extensive experiments are performed using a real-world soil moisture data set and the results show that ORION outperforms several baseline algorithms, including the ensemble median, for the majority of the river basins in our data set. 3.1 Problem Formulation We consider a variation of the online multi-task learning process described in [34], in which the learning proceeds in a sequence of rounds. At the start of round n, where n ∈ {1, 2, · · · , N}, (n) (n) (n) the algorithm observes T unlabeled instances, x(n) = {x1 , x2 , · · · , xT }, where each instance (n) xi ∈ ℜd is a d-dimensional vector of predictor variables, and T is the number of instances to be predicted in each round. The algorithm then predicts the target value f (xi ) for each unlabeled instance. We consider the prediction of each instance as a separate learning task. Our goal is to 23 learn a set of prediction functions for the T tasks such that their cumulative loss over the N rounds is minimized. Similar to previous works [34, 21], we consider only linear prediction functions of the form f (x) = wT x, where w ∈ ℜd is the parameter vector. Extending the formulation to non-linear models is beyond the scope of this work. From an ensemble forecasting perspective, each online round corresponds to the scenario when a new set of forecasts is generated, as shown in Fig. 3.1. After N rounds, there are N sets of forecasts generated. Each set of forecasts is generated by d ensemble members, which collectively form the set of predictor variables for our online learning task. Note that each ensemble member produces a time series of length T , which is equivalent to the number of instances (tasks) that must be predicted in each online round. Since the prediction tasks are related due to the inherent temporal autocorrelation of the time series, we cast the ensemble forecasting task into a multi-task learning problem. The number of prediction tasks in each round is equal to the forecast duration, T , and the number of online rounds is equal to the number of forecast sets, N. As can be seen from Fig. 3.1, the number of observed values available to update the predictions at the end of each round varies from one forecast set to another. Let forecast N be the most recent set of forecasts generated for the time window [tN+1 ,tN+2 , · · · ,tN+T ]. There are no observed values available for the given time window. However, the forecast set N − 1, which was previously generated for the time window [tN ,tN+1 , · · · ,tN+T −1 ] now has a new observed value for the time tN . Similarly, the number of observed values for the forecast set N − 2 increases from 1 to 2. More (n) (n) (n) (n) generally, let ym = {y1 , y2 , · · · , ymn } denote the set of observed values available for the forecast set m in round n, where m ≤ n. If T is the forecast duration, then the number of observed values available to update the forecast set m in round n is given by mn = min(n − m, T ). This partially observed data scenario distinguishes our work from other online multi-task learning formulations. Example 1. Consider the scenario shown in Fig. 3.1. Assume the most recent forecast set N was generated on September 12, 2011. The forecast set N − 1, which was generated on September 7, 2011 for predicting a 40-day time window from September 12 to October 17, will have a new observed value for September 12. The forecast set N − 2, which was generated on September 2, 24 2011 for predicting the time series from September 7 to October 12, now has two observed values, for September 7 and 12, respectively. If the forecast duration is T , then all previous forecast sets from Forecast 1 to Forecast N − T will have complete observation values for their entire data sets. Let f (n−1) be the model obtained at the end of round n − 1. After round n, a new labeled data (n) (n) (n) Dn−1 = (x(n−1) , yn−1 ). is available to update f (n−1) , where yn−1 includes the latest observed value for the time tn . The number of labeled examples in D (n−2) , D (n−3) , · · · D (n−T +1) also increases by 1 in round n since all of them involves a prediction for the time step tn . It is therefore insufficient to generate f (n) directly from f (n−1) since the models f (n−1) , f (n−2) , · · · f (n−T ) should also be revised given their expanded labeled data sets. Thus, we employ the following online learning with restart strategy to update our models. In round n, we first backtrack to round n − T and revise f (n−T ) with the expanded labeled data D (n−T ) to obtain a new model f (n−T +1) . We then update f (n−T +1) with the expanded labeled data D (n−T +1) to obtain f (n−T +2) and repeat the procedure until f (n) is obtained. To implement this strategy, the algorithm only needs to maintain two sets of weights, w(n−1) and w(n−T −1) . At the start of round n, the algorithm makes it prediction using w(n−1) . When D (n) is available, the algorithm will backtrack and progressively update the models starting with w(n−T −1) , which was the last set of weights trained from a completely labeled data set, until w(n) is obtained. 3.2 Online Regularized multI-task regressiON (ORION) This section presents the ORION framework for the ε -insensitive loss function. An extension of the framework to the quantile loss function is given in Section 3.3. 3.2.1 ORION for ε -insensitive Loss Function Although our framework requires restarting the online learning process at round n − T to deal with the partially observed data problem, the update formula and optimization step in each round are identical. Specifically, in round n, the ORION framework assumes that the weights are co- 25 regularized as follows: (n) wt (n) (n) = w0 + vt , ∀t ∈ {1, 2, · · · , T }. In other words, the prediction functions for all T tasks share a common term w0 and a task-specific weight vt , which is expected to be small when the predictions are correlated. To estimate the weights, we employ the following objective function, which extends the Passive Aggressive online algorithm given in [29] to a multi-task learning setting with an ε -insensitive loss function: T 1 T 2+ µ ||w − w || t ∑ ||vt ||22 ∑ t−1 2 2 t=2 2 t=1 arg min w0 ,{vt } (3.1) T λ (n−1) 2 β (n−1) 2 + ||w0 − w0 ||2 + ∑ ||vt − vt ||2 2 2 t=1 (n) ∀t ≤ mn : |wtT xt s.t. (n) − yt | ≤ ε ∀t ∈ {1, 2, · · · , T } : wt = w0 + vt µ , λ , β and ε ≥ 0 (n) where mn is the number of labeled observations, xt (n) in the n-th round, and yt is a vector of predictor variables for task t is the target value. For brevity, we have omitted the superscript n in our notations for vt , wt , and w0 . Since wt − wt−1 = vt − vt−1 , Equation (3.1) can be simplified as follows: [ ] 1 T Tr V (L + µ IT )V) 2 arg min w0 ,V + s.t. λ (n−1) 2 β ||w0 − w0 ||2 + ||V − V(n−1) ||2F 2 2 (n) ∀t ≤ mn , |wtT xt (n) − yt | ≤ ε ∀t ∈ {1, 2, · · · , T }, wt = w0 + vt 26 (3.2) where V = [vT1 ; vT2 ; · · · ; vTT ] is a T × d-dimensional matrix, IT is a T × T identity matrix, Tr[·] is the trace operator and    1       2 Li, j =   −1       0 if i = j = 1 or i = j = T if i = j ̸= 1 and i = j ̸= T if i = j + 1 or i = j − 1 otherwise is a graph Laplacian capturing the relationships among the T tasks. The Lagrange formulation is given by 1 Tr(VT (L + µ IT )V) 2 λ (n−1) 2 β + ||w0 − w0 ||2 + ||V − V(n−1) ||2F 2 2 L (w0 , V, τ) = ∑ + (n) τt (|wtT xt (n) − yt | − ε ) t∈O (n) (n) where O (n) = {t|t ≤ mn and |wtT xt (n) − yt | > ε } is the feasible set and τ = {τt } is the set of Lagrangian multipliers such that τt ≥ 0 for all t ∈ O (n) and τt = 0 for all t ∈ / O (n) . In the next subsection, we present the solution for this optimization problem. 3.2.2 Optimization To simplify the notation, we first vectorize the matrix V and concatenate it with w0 . Let z = [w0 ; v1 ; ..., vT ] denote the resulting weight vector to be solved. The Lagrangian can now be written into the following form: L (z, τ) = + 1 1 (z − z(n−1) )T R(z − z(n−1) ) + zT Qz 2 [2 ] ˜ (n) − y(n) T )S − ε 1T Pτ (zT X (3.3) ˜ (n) , R, Q, P, and S are defined in Table 3.1. where X Taking the partial derivative of L with respect to z and setting it to zero yields the following ∂ L (z, τ) ˜ (n) SPτ = 0 = R(z − z(n−1) ) + Qz + X ∂z ˜ (n) SPτ) z = M(Rz(n−1) − X 27 (3.4) Table 3.1: Notations used in Equation (3.3) Notation 0d 0d×T Id A⊗B Definition a d-dimensional column vector of zeros a d × T matrix of zeros a d × d identity matrix Kronecker product between matrices A and B   (n) ˜ (n) X y(n) P S τ R Q (n) (n) x1 x2 · · · xT   (n) x 0d · · · 0d    1   (n)  0d x2 · · · 0d   . .. .. ..    . . . .   . (n) 0d 0d · · · xT (n) (n) (n) [y1 ; y2 ; · · · ; ymn ; 0(T −mn ) ] { 1 if i = j and i ∈ O (n) Pi, j = otherwise { 0 (n) T (n) (n) sign(w xi − yi ) if i = j Si, j = i 0 otherwise [τ ] [ 1 ; ...; τT ] λ Id 0d×T d ] [0T d×d β IT d 0d×d 0d×T d 0T d×d (L + µ IT ) ⊗ Id where M = (R + Q)−1 . It can be easily shown that R + Q is a positive definite matrix, which means it is invertible and its inverse is also positive definite. Plugging z in Equation (3.4) back into Equation (3.3) leads to the following equation after simplification ˜ (n) SPτ = − 1 τT X ˜ (n)T MT X ˜ (n) τ + ℓTn (ˆz(n−1) )τ L (τ)X PS PS 2 + constant (3.5) where ˜ (n) = X ˜ (n) SP X PS [ ] (n) (n)T (n−1) (n−1)T T T ˜ −y X )S − ε 1 P ) = (ˆz ℓn (ˆz zˆ (n−1) = MT RT z(n−1) 28 (3.6) Note that P is a diagonal matrix, whose diagonal element Pt,t is zero if t ∈ / O (n) . In other words, if the target value for task t is either unavailable or predicted correctly (within the ε -insensitive (n) ˜ bound), all the elements in the t-th column of X PS become 0, and the corresponding t-th element in ℓTn (ˆz(n−1) ) is also 0. Thus, τt for t ∈ / O (n) has no impact on Equation (3.5) and can be set to zero. In the following derivation, we assume the rows and columns corresponding to all the tasks (n) ˜ , and ℓTn (ˆz(n−1) ) have been removed. t∈ / O (n) in τ , X PS Taking the partial derivative of the “reduced" Lagrangian with respect to τ and setting it to zero yields ∂L ∂τ ˜ (n)T MT X ˜ (n) τ + ℓn (ˆz(n−1) ) = 0 = −X PS PS [ ]−1 (n)T (n) T ˜ ˜ τ = X ℓn (ˆz(n−1) ) PS M XPS (3.7) There are several points worth noting regarding the update formula for z and its learning rate τ. First, note that Equation (3.7) is only applicable to tasks that belong to O (n) . The columns (n)T (n) T ˜ ˜ PS for t ∈ ˜ in X / O (n) must be removed before calculating τ. Otherwise, the matrix X PS M XPS is not invertible. For t ∈ / O (n) , we set τt = 0 before calculating z. Second, even when τt = 0, the corresponding weight for vt may still change due to the first term of Equation (3.4). This distinguishes our approach from other online algorithms, where a zero learning rate implies the weights will not change in the next round. Finally, our formula for τ has a similar form as the learning rate for the single-task learning given in [29], τn = ℓn /||xn ||2 . The main difference is that the τ for multi-task learning must take into account the task relatedness in both ℓn and the inverse (n)T (n) T ˜ ˜ of X PS M XPS . 3.2.3 Algorithm A summary of the ORION framework for ε -insensitive loss function is given in Algorithm 1. The algorithm proceeds in a sequence of rounds. During round n, the algorithm receives the instances (n) xt for each task t ∈ {1, .., T }. Using the online learning with restart strategy, it will backtrack to round n − T and update the set of labeled observations to include the most recent target value. 29 After computing the loss for each task, it identifies the set of tasks for which the loss exceeds the ε bound. The weights associated with the tasks will be updated using the formula given in Equation (3.4). Note that τt is set to zero for tasks that do not belong to O (n) . In each round, the algorithm only needs to maintain two sets of weights, z(n−T ) and z(n) , along with the predictor variables {x(n−T ) , x(n−T +1) , · · · , x(n) } and the observed target values {y(n−T ) , y(n−T +1) , y(n) }. Its storage complexity is therefore T times the complexity of single-task learning. 3.2.3.1 Time complexity In this section, we analyze the time complexity of the ORION framework for the ε -insensitive loss function in terms of the number of online rounds N, the number of tasks T and the number of features d. For ensemble forecasting, T refers to the forecast duration and d is the number of ensemble members (see Fig. 3.1). Each round requires calculations of Equations (3.4) and (3.7). ˜ (n) needs to be computed first, which requires O(T 3 d) floating point operFor Equation (3.7), X PS ˜ (n)T MT X ˜ (n) requires O(T 3 d 2 ) flops and its inverse will take O(T 3 ) ations (flops). Calculating X PS PS flops. According to Equation (3.6), calculating both zˆ (n−1) and ℓn (ˆz(n−1) ) will require O(T 3 d) flops. The time complexity for calculating Equation (3.7) is O(T 3 d 2 ) whereas the time complexity for Equation (3.4) is O(T 3 d + T 2 d 2 ). Therefore, the model update for each task requires O(T 3 d 2 ) flops. Since there are T tasks, the time complexity for each online round is O(T 4 d 2 ). There are other computations that need to be performed only once throughout the entire online learning process, which is the calculation for M = (R + Q)−1 , whose complexity is O(T 3 d 3 ). Thus, after N rounds, the overall time complexity is O(N(T 4 d 2 +T 3 d 3 )), which is linear in the number of rounds (similar to other online learning algorithms). The number of tasks T and number of features d are domain-dependent, though they are both generally much smaller than N. 3.2.4 Theoretical Analysis of ORION-ε This section presents theoretical analysis on the average loss bound of the ORION-ε algorithm. 30 Input: µ , λ , β , ε = 0.001 ; Initialize: w0 = 0d ; ∀t ∈ {1, ..., T }, vt = 0d ; Compute R and Q using the formula in Table 3.1 ; for n = 2, · · · , N do (n) (n) (n) Receive x1 , x2 , ..., xT ; for m = n − T, · · · , n do Set mn = n − m ; for t = 1, 2, · · · , T do [ ]T (m) (m−1) (m−1) (m) Predict yˆt = w0 + vt xt ; end (n) (n−1) Update ym = ym ∪ {y(n) } ; (m)T (m) (n) xt − ym,t | > ε } ; Set O n = {t|t ≤ mn ; |wt Compute τ using Equation (3.7) and set τt = 0 when t ∈ / O (n) ; Update z(m) using Equation (3.4) ; end end Algorithm 1: Pseudocode for ORION-ε Algorithm Lemma 1. Let U UT be an eigendecomposition of the real symmetric matrix R−1 Q, where U is an orthogonal matrix and is a diagonal matrix containing the eigenvalues of R−1 Q. The eigendecomposition of MR is given by U(I + )−1 UT . Proof. First, we can write MR = (R + Q)−1 R = (I + R−1 Q)−1 = (I + U UT )−1 Since U is an orthogonal matrix, UUT = I. Hence MR = (UUT + U UT )−1 = U(I + )−1 UT Lemma 2. ∥MR∥2 = 1, where ∥ · ∥2 denote the induced 2-norm of a matrix. Proof. The induced 2-norm of a matrix A is defined as ∥A∥2 = max ∥Ax∥2 = ∥x∥2 =1 31 √ λmax , where λmax is the largest eigenvalue of the matrix AT A. Since   0d×T d  0d×d  R−1 Q =   0T d×d β1 (L + µ IT ) ⊗ Id the determinant |R−1 Q| = 0 because the matrix contains rows and columns of all zeros. In addition, it can be shown that R−1 Q is a positive semi-definite matrix since it is diagonally dominant, which means all of its eigenvalues must be non-negative. Since |R−1 Q| = ∏k λk = 0, this implies that the smallest eigenvalue of R−1 Q is λmin = 0. Following Lemma 1, the largest eigenvalue of MR is (1 + λmin )−1 = 1. Finally, given that (MR)T MR = U(I + )−2 U, the largest eigenvalue of (MR)T MR must also be equal to 1. Thus, ∥MR∥2 = 1. Lemma 3. ∥I − MR∥2 = 1 − 1+λ1 max ≤ 1, where ∥ · ∥2 denote the induced 2-norm of a matrix and λmax ≥ 0 is the largest eigenvalue of R−1 Q. Proof. Following Lemma 1, it is easy to see that I − MR = U[I − (I + )−1 ]UT . Thus, the largest eigenvalue of I − MR is 1 − 1+λ1 max , which is the induced 2-norm of the matrix. Theorem 1. Let z(1) , z(2) , · · · , z(n) be a sequence of weights learned using the ORION-ε algorithm. Using the notations given in Table 3.1, the following bound holds for any u ∈ ℜ(T +1)d . 1 N ||ℓn (MT MRz(n) )|| N∑ n ] [ N 1 1 ∥u∥2 T T 2 2 ≤ ∥ℓn (M u)∥ + + ∥u∥Ψ +C ρ , N∑ 2C N n ˜ (n) SP∥ ≤ ρ . where ||τ|| ≤ C, ∥z(n) ∥ ≤ Ψ, and ∥MX Proof. Define ∆n = ∥z(n) − u∥2 − ∥z(n+1) − u∥2 . We will derive the relative loss bound by finding 32 the upper and lower bound of ∑N n=1 ∆n . For the upper bound, N ∑ n=1 ∆n = N ∑ ∥z(n) − u∥2 − ∥z(n+1) − u∥2 n=1 = ∥z(1) − u∥2 − ∥z(N+1) − u∥2 (3.8) = ∥u||2 − ∥z(N+1) − u∥2 ≤ ∥u∥2 where z(1) = 0. Next, we derive the lower bound of ∆n . ∆n = ∥z(n) − u∥2 − ∥z(n+1) − u∥2 ˜ (n) SPτ − u∥2 = ∥z(n) − u∥2 − ∥MRz(n) − MX ˜ (n) SPτ∥2 + 2(MRz(n) − u)T MX ˜ (n) SPτ = ∥z(n) − u∥2 − ∥MRz(n) − u∥2 − ∥MX A lower bound on the first two terms is given as follows ∥z(n) − u∥2 − ∥MRz(n) − u∥2 = ∥z(n) ∥2 − ∥MRz(n) ∥2 − 2uT (I − MR)z ≥ ∥z(n) ∥2 − ∥MR∥2 ∥z(n) ∥2 − 2∥u∥∥I − MR∥∥z(n) ∥ ≥ −2∥u∥∥z(n) ∥ where we have applied Lemmas 2 and 3 and used the fact that ∥Ax∥2 ≤ ∥A∥2 ∥x∥2 and uT v ≤ ∥u∥∥v∥. Furthermore, T ˜ (n) SPτ − uT MX ˜ (n) SP z(n) RT MT MX T T T ˜ (n) − y(n) )S − ε 1T ]Pτ − [(uT MX ˜ (n) − y(n) )S − ε 1T ]Pτ = [(z(n) RT MT MX ≥ ℓTn (MT MRz(n) )τ − ℓTn (MT u)τ 33 Putting them together, we have ∥u∥2 ≥ N ∑ ∆n n=1 N N n n ˜ (n) SP∥2 ∥τ∥2 ≥ −2∥u∥ ∑ ∥z(n) ∥ − ∑ ∥MX N N n n +2 ∑ ℓTn (MT MRz(n) )τ − 2 ∑ ℓTn (MT u)τ ˜ (n) SP∥ ≤ ρ , and after re-arranging the equation, we obtain Assuming ||τ|| ≤ C, ∥z(n) ∥ ≤ Ψ, ∥MX 1 N ||ℓn (MT MRz(n) )|| ∑ N n ] [ 1 N T T 1 ∥u∥2 2 2 ≤ ∥ℓn (M u)∥ + + ∥u∥Ψ +C ρ N∑ 2C N n 3.3 Online Regularized Multi-Task Quantile Regression (ORION-QR) Predicting extreme value events are important for applications such as weather and hydrological forecasting due to their adverse impacts on both human and natural systems. Unfortunately, most of the existing work on multi-task learning have considered only squared or hinge loss functions, and thus, are not suitable for extreme value prediction. In this section, we describe an extension of the ORION framework to predict extreme values in a time series by incorporating a quantile loss function. To describe the approach, we first present the quantile regression (QR) method [60] and introduce the quantile loss function. QR is a statistical method for estimating the conditional quantiles of a target variable as a function of its predictor variables. By focusing on the upper or lower quantiles of the distribution, this may help bias the algorithm towards learning the extreme values of the target distribution. Specifically, QR is designed to improve the estimate of the τth conditional quantile of the prediction by minimizing the following sum of asymmetrically weighted absolute residuals:    τu n u>0 T ρ β ), where ρ (y − x (u) = τ ∑ τ i i   (τ − 1)u i=1 u≤0 34 The τth quantile of a random variable Y is defined as QY (τ) = F −1 (τ) = inf{y : FY (y) ≥ τ} The quantile loss function is asymmetric around τ, i.e., it incurs a higher penalty if the predicted function underestimates the true value of the target variable and lower penalty if it overestimates the true value. By choosing τ close to 1, quantile regression is biased towards predicting higher values of the time series. In the case when τ = 0.5, the solution reduces to the conditional median of the target distribution. The preceding objective function is equivalent to solving the following linear programming problem: p,q τ1TT p + (1 − τ)1TT q s.t. y − Xβ = p − q min p ≥ 0, q ≥ 0 The ORION framework for quantile loss function is designed to solve the following optimization problem. τ1TT p + (1 − τ)1TT q min p,q,w0 ,V 1 Tr(VT (L + µ IT )V) 2 λ β (n) + ||w0 − w0 ||22 + ||V − V(n−1) ||2F 2 2 + s.t. (n) ∀t ≤ mn , yt (n) − wtT xt = pt − qt ∀t, wt = w0 + vt p ≥ 0, q ≥ 0 Compared to ORION-ε , there are two additional parameters, p and q, that must be estimated from the data. With this formulation, the prediction function is trained to fit the conditional quantiles and to ensure smoothness of the model parameters across the different tasks. The latter is attained by using a graph Laplacian to encode the task relationships. Unlike the original QR formulation, ORION-QR requires solving a quadratic programming problem. We employed the CVX software [50] to estimate the parameters using τ = 0.95 to detect extreme (high) value events. 35 3.4 Experimental Evaluation The proposed framework was applied to the soil moisture ensemble forecasting problem. The soil moisture forecasts were obtained from a seasonal hydrological prediction system for 12 major river basins in North America. 33 ensemble member forecasts were generated by running the model multiple times with different initial conditions. The data correspond to 40-day forecasts generated every 5 days for the time period between April, 2011 and September, 2011. The number of learning tasks to be predicted in each forecast set is T = 8. The number of forecast sets in the data set is N = 33, which is equivalent to the number of online rounds. Since the model parameters were initialized to zero, the initial forecasts were poor until the model has been sufficiently trained. We use the first 23 forecast sets as “training data" and report the performance based on the predictions generated for the last 10 forecast sets (“test data"). We evaluated the performance of the different methods in terms of their mean absolute error (MAE) on the test data: MAE = 1 33 8 (n) (n) |yt − yˆt |, ∑ ∑ 80 n=24 t=1 where yˆ is the vector of predicted values. 3.4.1 Performance Comparison for ORION-ε We compared the performance of ORION-ε against the following baseline methods. 1. Ensemble Median (EM): This is an unbiased aggregation of the ensemble member forecasts. 2. Passive-Aggresive (PA) Algorithm [29]: This single-task learning method assumes there is only one set of weights w to be estimated. Starting from the initial weights w = 0, we update the weights in each round as follows (from the first to the last task): w(n) = w(n−1) + sign(y(n) − yˆ(n) )τx(n) where τ = ℓn /∥x(n) ∥22 , and ℓ is the ε -insensitive loss. 36 3. Tracking Climate Models (TCM) [77]: This method uses a Hidden Markov Model to track the current best ensemble member. Unlike ORION-ε , the weights estimated by TCM range between 0 and 1, which means the aggregated forecast always fall within the range of the ensemble member forecasts. Since the method is designed for single-task learning, we modify its implementation to handle T instances in each round. Instead of using squared error between the observed and predicted target values, the loss is computed based on the average squared error over mn instances. 4. Online Multi-task Learning with a Shared Loss(OMTLSL) [34]: This is a modified implementation of the approach given in [34], which was originally proposed for the hinge loss function. Here, we use the ε -insensitive loss function for our regression problem and ignore the use of slack variables. The modified objective function is given by: 1 T (n−1) 2 ∥2 arg min ∑ ∥wt − wt wt 2 t=1 (n) s.t. ∀t ≤ mn , |wtT xt (n) − yt | ≤ ε The optimization problem can be solved using a similar technique as that for ORION-ε . 5. Linear Algorithms for Online Multi-task Learning(LAOM) [21]: The original method was proposed for multi-task classification. We modify the loss function to be squared loss for regression problem. The default approach given in [21] assumes the data is only available one task at a time, and thus, the models are updated one task at a time according to the task relationship matrix. As a consequence, the learning process depends on ordering of the task. ORION-ε does not require such an assumption. For a fair comparison, all the baseline methods adopt the same online learning with restart strategy (similar to ORION-ε ) to deal with the partially observed data. Table 3.2 compares the results of the different methods. As can be seen from the table, ORION- ε works better than the baseline methods on all 12 data sets. In particular, it outperforms OMTLSL, which is a state-of-the-art online multi-task learning algorithm, on all the data sets. Unlike OMTL37 EM PA TCM OMTLSL LAOM arkansusred calinevada colorado columbia lowermiss midatlantic missouri northcentral northeast ohio southeast westgulf ORION-ε Table 3.2: Comparison of mean absolute error (MAE) for ORION-ε against baseline methods on soil moisture data 2.740 3.398 4.362 4.411 9.891 13.473 3.699 6.292 7.422 14.535 8.229 3.790 4.189 4.919 5.934 6.000 12.023 24.381 6.029 8.789 22.040 17.023 8.951 4.697 3.788 4.281 5.741 6.439 11.639 25.140 5.470 8.700 20.490 15.107 8.778 4.490 3.659 4.265 5.634 6.475 10.671 20.961 6.575 9.157 19.471 15.021 9.136 5.689 5.423 4.422 6.068 6.225 14.975 23.143 6.913 10.838 24.877 19.064 10.966 6.150 2.767 4.257 5.674 5.370 11.951 27.507 5.269 7.298 23.824 16.436 9.158 4.369 SL, ORION-ε enforces the constraint wt = w0 + vt , which helps to improve the performance of the ensemble forecasting task. As will be shown in Table 3.5, the improvement is still observed even when the task relationship is removed (i.e., comparing OMTLSL against ORION-ε -NR). Comparing ORION-ε against LAOM, the results suggest the benefit of updating the multiple tasks simultaneously instead of updating them one task at a time. As further evidence, Fig. 3.3 shows the predicted time series for ORION-ε and the Ensemble Median (EM) method on the northeast data set. The first five figures are from the training set and the remaining ten figures are from the test set. Initially, the time series predicted by ORION-ε is similar to EM (see Fig. 3.3a to 3.3c). As more data becomes available, the predictions by ORION- ε is closer to observation data than EM (Fig. 3.3d to 3.3j). In Fig. 3.3k, there appears to be a sudden shift that causes the performance of ORION-ε to degrade significantly. However, after one update, ORION-ε recovers from the mistake and its prediction follows closely the observation data again (Fig. 3.3l). Fig. 3.4 shows the absolute error of ORION-ε and EM during the last 15 rounds of the training data and the 10 rounds in test data, for both Northeast and Midatlantic data. Although the performance of ORION-ε is slightly worse than EM at the beginning, after sufficient 38 100 100 90 90 80 80 70 70 70 60 60 60 50 50 50 40 40 30 30 20 20 40 Ensemble Members ORION-ǫ Observations Ensemble Median 30 20 10 /5 11 /5 11 20 20 20 /25 /20 /15 /30 1 01 1 01 2 2 /19 /14 /6/9 /6/4 /5 11 20 1/6 01 2 /6 11 20 /24 /6 11 20 80 10 /6 11 /6 11 20 20 2 /24 /19 /14 /6/9 1 01 (a) Forecasts for 05/15/2011 /6 11 /6 11 /29 20 20 1/7 01 20 2 2 4 /7/1 11 /9 /7/4 1 01 /7 11 /19 20 80 80 70 60 60 60 50 50 50 40 40 40 30 30 30 20 20 20 10 /8 11 1 20 20 /13 1/8 /8 20 /18 1/8 1 20 8 3 /8/2 11 20 /2 1/8 1 20 /7 /2 /9 11 20 1/9 01 2 2 /9/1 11 20 20 /1 1/8 1 1 20 20 8 3 /2 1/8 1 20 /2 1/8 1 20 20 /12 1/9 /7 /2 /9 11 1/9 01 1 20 2 7 /1 1/9 1 20 100 100 100 90 90 90 80 80 80 70 70 70 60 60 60 50 50 50 40 40 30 30 Ensemble Members ORION-ǫ Observations Ensemble Median 8 /8/1 11 3 /8/2 11 20 8 /8/2 11 20 /2 1/9 1 20 1 20 11 20 20 11 7 /9/1 20 11 2 /9/2 7 /9/2 11 20 (g) Forecasts for 08/18/2011 3 /8/2 11 20 11 20 1 20 /2 1/9 /7 1/9 1 20 20 11 20 7 /9/1 11 20 2 /9/2 11 20 /27 1/9 1 20 11 20 /2 /10 8 /8/2 100 100 90 90 80 80 80 70 70 70 60 60 50 50 11 20 /9/2 20 11 /9/7 1 20 /12 1/9 7 2 7 /7 /2 /12 /10 /10 /9/2 /9/2 /9/1 /10 11 11 11 11 11 11 20 20 20 20 20 20 (j) Forecasts for 09/02/2011 20 /9/7 11 20 7 /9/1 2 /9/1 11 11 20 20 2 /9/2 11 20 7 /9/2 11 20 0/2 /1 11 20 /7 /10 11 20 Ensemble Members ORION-ǫ Observations Ensemble Median 30 20 10 10 /9/7 11 2 /9/2 11 20 40 20 20 7 /9/1 11 20 50 30 10 11 20 60 Ensemble Members ORION-ǫ Observations Ensemble Median 40 20 2 /9/1 /9/7 11 20 (i) Forecasts for 08/28/2011 90 Ensemble Members ORION-ǫ Observations Ensemble Median /9/2 11 11 20 100 30 /9 11 20 10 2 /9/1 (h) Forecasts for 08/23/2011 40 /2 /8/2 11 20 Ensemble Members ORION-ǫ Observations Ensemble Median 30 8 /8/2 /8/2 11 20 40 10 2 /9/1 3 /8/2 11 20 (f) Forecasts for 08/13/2011 Ensemble Members ORION-ǫ Observations Ensemble Median 20 /7 1/9 /8/1 11 20 8 3 8 3 /8/1 11 20 (e) Forecasts for 08/08/2011 10 /8/1 11 20 Ensemble Members ORION-ǫ Observations Ensemble Median 10 /18 1/8 3 /8 /8 11 (d) Forecasts for 08/03/2011 20 /8/1 11 20 80 70 /3 8 3 /8/8 11 20 90 70 /8 11 /8/3 11 20 11 20 100 Ensemble Members ORION-ǫ Observations Ensemble Median 90 10 /7/2 11 20 (c) Forecasts for 07/14/2011 100 Ensemble Members ORION-ǫ Observations Ensemble Median 90 /7/1 11 20 20 9 /7/2 4 9 /14 /7 11 (b) Forecasts for 06/09/2011 100 Ensemble Members ORION-ǫ Observations Ensemble Median 90 10 /5 11 20 100 Ensemble Members ORION-ǫ Observations Ensemble Median 20 11 2 /9/1 20 11 7 /9/1 1 20 1 2 /9/2 7 /9/2 11 20 20 11 /2 /10 20 11 /7 /10 /1 11 20 2 0/1 /1 11 20 7 0/1 (k) Forecasts for 09/07/2011 /9/1 11 20 /9/1 11 20 7 /9/2 2 7 2 /9/2 11 20 11 20 0/2 /1 11 20 /7 /22 /17 /12 /10 /10 /10 /10 11 11 11 11 20 20 20 20 (l) Forecasts for 09/12/2011 Figure 3.3: Forecasts on Dataset Northeast for ORION-ε . Fig. 3.3a - 3.3c are results from the training set and Fig. 3.3d - 3.3l are results from the test set. Note that in the early stage of the online learning process, ORION-ε performs similar to Ensemble Median (see Fig. 3.3a - 3.3b), and ORION-ε starts to follow the observation from Fig. 3.3c. 39 50 45 40 ORION−ε Ensemble Median Absolute Error 35 30 25 20 15 10 5 0 2011/5/15 2011/6/4 2011/6/29 2011/7/14 2011/8/18 (a) Northeast Data 50 45 40 ORION−ε Ensemble Median Absolute Error 35 30 25 20 15 10 5 0 2011/5/15 2011/6/4 2011/6/29 2011/7/14 2011/8/18 (b) Mid-Atlantic Data Figure 3.4: Mean absolute error for ORION-ε and Ensemble Median on the Northeast and Midatlantic data. 40 Table 3.3: Comparison of runtime (in seconds) on the Northeast data set. Runtime per Round Total Runtime ORION PA TCM OMTLSL LAOM 0.0430 0.0005 0.0070 0.0013 0.0014 1.0909 0.0208 0.1252 0.0647 0.0606 training, ORION-ε appears to perform significantly better than EM. We also compared the runtime of ORION against other baseline methods. The total runtime as well as average runtime per round for the Northeast data set is shown in Table 3.3. ORION is relatively slower than other baselines, which is not surprising since it has to backtrack and revise some the older models in each round unlike other methods. Nevertheless, the additional overhead, which is less than 50 ms for each update round, is reasonable for many ensemble forecasting problems that require only a one-time daily update of their models. It is therefore an acceptable tradeoff to achieve the accuracy improvement shown in Table 3.2. 3.4.2 ORION with Quantile Regression (ORION-QR) To evaluate the performance of the ORION framework with quantile loss function, the observation data were initially preprocessed to identify the forecast periods when the soil moisture value is extremely high, i.e., more than 1.64 standard deviations away from the mean. Non-extremes are defined as those values located within the 90% confidence interval of the mean. Only 5 of the 12 data sets contain extreme events in the test period. The number of extreme events during the test period for the five data sets are as follows: arkansusred (14), colorado (14), midatlantic (44), southeast (6), and westgulf (41). We report our experimental results for these data sets only, comparing ORION-QR against EM, ORION-ε , and the following two baselines: 1. Quantile Regression (QR): This is the original QR method used in a batch mode. It assumes all the tasks share the same set of weights w. 2. Online Quantile Regression (OQR): This is a variant of the PA [29] algorithm using a 41 Table 3.4: Comparison of F1 measure for predicting occurrence of extreme events. arkansusred colorado midatlantic southeast westgulf ORION-QR 0.465 0.593 0.500 0.444 0.598 EM 0.4167 0.148 0.3019 0.3077 0.6122 ORION-ε 0.200 0.500 0.500 0.250 0.451 QR 0.500 0.406 0.387 0 0.568 OQR 0.500 0.3030 0.275 0.286 0.625 quantile loss function. The objective function is modified as follows: τ1TT p + (1 − τ)1TT q min p,q,w + s.t. λ ∥w − w(n−1) ∥22 2 (n) ∀t, yt (n) − wT xt (3.9) = pt − qt p ≥ 0, q ≥ 0 which can be solved using standard quadratic programming solvers such as CVX. For this experiment, we are interested in the predictions of extreme events. A predicted event is a true positive (TP) if it is a real event and a false positive (FP) if it does not correspond to a real event. A false negative (FN) corresponds to a real event that was not detected by the algorithm. We use the following F1-measure as our evaluation metric: F1 = 2 TP 2 TP + FP + FN Table 3.4 shows that ORION-QR outperforms both EM and ORION-ε in 4 out of 5 data sets. The latter suggests quantile loss is more effective than ε -insensitive loss when dealing with extreme value prediction. Compared to single-task learning methods, ORION-QR outperforms QR in 4 out of 5 data sets and OQR in 3 out of 5 data sets. Furthermore, there is no significant difference when comparing the number of data sets in which the batch version of QR outperforms its online version, OQR. For the southeast dataset, the F1-measure for QR is zero, which suggests that the method fails to correctly predict extreme events in the test data. This is because there are only 6 extreme events 42 in the test period of southeast dataset, which makes it a hard prediction problem. In contrast, the frequency of extreme events for other datasets is at least 14. The presence of concept drift in the time series data also makes QR less effective compared to OQR. While OQR performs better on the southeast dataset, it is still significantly worse than our proposed ORION-QR algorithm. 3.4.3 Sensitivity Analysis This section analyzes the sensitivity of the three input parameters of the ORION framework.2 Although the experimental results shown here are for the northeast data set, a similar behavior was observed in other data sets. For each experiment, we vary the value of one parameter and fix the values of the remaining two. The parameter µ controls sparsity of the weights for vt . Note that µ must be non-negative to ensure the matrix L + µ IT is positive semi-definite. Fig. 3.5(a) shows that MAE improves as µ increases. As long as µ is sufficiently large (> 50), its MAE becomes stable. This result is not surprising since the prediction tasks are highly correlated. Therefore, the weights for vt are expected to be small. The parameters β and λ affect how much information should be retained from previous rounds. Since the weights for vt are small, the results are not that sensitive to changes in β (see Fig. 3.5(a)). The results are more sensitive to choice of λ . If λ is too small, w0 deviates significantly from its previous value. Conversely, if λ is set too large, the variation in w0 becomes too slow and the algorithm requires more training examples in order to converge. In practice, λ can be set based on its performance on the training data. 3.4.4 Variations of ORION-ε Framework The ORION framework uses two types of information to update its model parameters. First, it uses the λ and β regularizers to control the amount of information retained from previous rounds. 2 Similar to other works, ε is typically fixed to a small number. So we set ε = 10−3 in all our experiments. 43 Second, it relies on the Q matrix to control the amount on information shared among the tasks. We investigate two variations of the ORION-ε framework. We first consider the case when β = 0, which implies that the weight vectors vt are independent of their values in the previous round.3 We denote the approach as ORION-ε -β . Experimental results given in Table 3.5 showed that ORION-ε outperforms ORION-ε -β in 9 out of 12 data sets. However, the difference is not that significant except for 3 of 12 the data sets. The second variation of our framework removes the task relationship by setting Q = 0. This approach is denoted as ORION-ε -NR. Based on the results MAE 12 10 8 6 0 50 100 150 200 µ (a) Effect of varying µ on MAE MAE 12 10 8 6 0 50 100 150 200 β (b) Effect of varying β on MAE MAE 12 10 8 6 0 5 10 15 λ (c) Effect of varying λ on MAE Figure 3.5: Sensitivity Analysis of ORION-ε . Fig. 3.5a shows that ORION-ε tends to choose a large value of µ ; Fig. 3.5b shows that ORION-ε is not that sensitive to β ; Fig. 3.5c shows that λ is the parameter to be tuned in practice. 3 Setting λ = 0 makes R + Q becomes a singular matrix. This situation is not considered in this study. 44 Table 3.5: Comparison of mean absolute error (MAE) for different variations of ORION-ε framework arkansusred calinevada colorado columbia lowermiss midatlantic missouri northcentral northeast ohio southeast westgulf ORION-ε 2.740 3.398 4.362 4.411 9.891 13.473 3.699 6.292 7.422 14.535 8.229 3.790 ORION-ε -NR 3.937 4.781 4.599 4.278 10.038 13.809 3.370 6.163 7.814 14.463 9.583 5.002 ORION-ε -β 2.740 3.390 4.410 5.156 12.047 13.527 5.049 6.475 7.427 14.987 8.232 3.780 given in Table 3.5, ORION-ε outperforms ORION-ε -NR in 8 out of 12 datasets, with substantial improvements in at least 4 of them. This shows the value of incorporating the task relationship into the ORION-ε framework. 3.5 Conclusion This chapter presents an online regularized multi-task regression framework for ensemble forecasting tasks. This framework is unique in that it uses an online learning with restart strategy to update its models. It is also flexible in that it can accommodate both ε -insensitive and quantile loss functions. Experimental results confirm the superiority of the proposed framework compared to several baseline methods. 45 CHAPTER 4 MULTI-TASK LEARNING FRAMEWORK FOR MULTI-LOCATION PREDICTION 1 The geospatio-temporal prediction task typically requires making predictions for a response vari- able at multiple locations [121]. For example, climate scientists are interested to obtain future climate projections of temperature or precipitation for multiple locations in a geographical region of interest. Similarly, in hydrology, soil moisture forecasts are periodically generated at multiple river and lake basins for drought monitoring applications. To perform such predictions, the simplest approach would be to fit a model at each location using only its local observations. This approach may not be effective because it fails to exploit the shared information among the prediction tasks. This is especially true for geospatio-temporal prediction problems following Tobler’s first law of geography, which states that “Everything is related to everything else, but near things are more related than distant things." [98]. To overcome this challenge, this thesis considers the application of multi-task learning (MTL) [20] to geospatio-temporal prediction problems. MTL is a widely-used approach for solving multiple, related learning tasks by exploiting the common structure of the problem. For geospatio-temporal data, the task relatedness could be represented by the spatial proximity or landscape similarity between different locations (e.g., elevation, slope, and other topographic variables). However, incorporating such information alone into existing MTL framework may not be sufficient since the spatio-temporal variabilities of the data are potentially influenced by several macroscale phenomena, whose impact varies from one location to another. For example, the wellknown El-Ninö phenomenon affects differentially weather patterns in North America and the rest of the world. A flexible MTL framework that can capture the spatial autocorrelation of the data as well as the influence of broader scale effects is therefore needed. To address this problem, we present a novel multi-location prediction framework called GSpartan, which stands for GeoSPAtio-tempoRal mulTi-tAsk learNing. GSpartan is developed based 1 This chapter is based on the previous publication [109]. 46 on the assumption that the prediction models for all locations share a common set of low-rank base models, where each base model may represent a macroscale phenomenon that potentially explains the variability of the data. The local model at each location is constructed as a linear combination of these base models. A graph Laplacian regularization is introduced to capture the spatial autocorrelation of the data, thus providing a natural way to specify the relationships among the prediction tasks. To ensure interpretability of the models, we also add sparsity and non-negativity constraints into the GSpartan formulation. We evaluated the performance of GSpartan against several baseline methods on climate data collected from 37 randomly chosen weather stations in Canada. Our experimental results for predicting monthly precipitation showed that GSpartan outperformed single-task learning (STL) and two other existing multi-task learning (MTL) methods in at least 75% of the stations. The ability of GSpartan to outperform other baseline methods increases to more than 84% of the stations if the training data available at each station is limited to only 1 year. 4.1 Preliminaries Let S ⊂ ℜ2 be a set of geo-referenced locations, where each location s ∈ S is associated with a set of temporal fields. One of the fields is designated as the response variable we are interested in predicting, while the rest are considered predictor variables. For instance, in climate modeling, the response variable may correspond to monthly precipitation values recorded at a weather station whereas the predictor variables correspond to outputs generated from a global or regional climate model [26]. An example of the multi-location prediction task here is to infer future monthly values of precipitation for all the locations based on their historical observations and outputs from the climate models. Formally, consider a geospatio-temporal data set D = {(X1 , y1 ), (X2 , y2 ), · · · , (X|S| , y|S| )}, where each tuple, (Xs , ys ), denote the temporal fields at location s. Let Xs ∈ R ns ×d = [xTs,1 , · · · , xTs,ns ] be the matrix of predictor variables and ys ∈ R ns be the time series for the response variable observed at the discrete time points 1, 2, · · · , ns . 47 For single-task learning (STL), each location s is treated as a separate learning task. Let ns be the number of training examples available for task s and d be the number of predictor variables. STL seeks to learn a (local) task model fs (x; ws ) for each location in such a way that the following loss function is minimized: |S| ns min ∑ ∑ ℓs W s=1 i=1 ] [ fs (xs,i ; ws ), ys,i where W = [w1 , ..., w|S| ] ∈ R d×|S| denote the model parameters and ℓs (·) represents the loss function for task s. For brevity, we consider only task models of the form fs (xs ; ws ) = xTs ws with a squared loss function. 4.2 GSpartan This section presents our proposed GSpartan framework for multi-location prediction. The framework was designed to satisfy the following three requirements: 1. It should learn a low-rank representation of the task models. The low-rank representation, defined by a set of base models, represents the possible macroscale phenomena that could help explain the temporal variability of the data. 2. It should incorporate domain knowledge about the spatial autocorrelation of the response variable among the various locations. 3. To ensure interpretability, each base model should depend only on a small subset of the predictor variables. In addition, each local model should be comprised of a small number of base models. The following objective function is used in GSpartan to train the local models jointly: |S| ns min ∑ ∑ ℓs W s=1 i=1 [ ] fs (xs,i ; ws ), ys,i + Ω(W) where Ω(W) is a regularization term. In the following, we will discuss how to formulate the objective function to meet the requirements stated above. 48 Low-rank Representation: We assume that the local models can be expressed as a product of two low-rank matrices, i.e., W = UV, where U ∈ R d×k and V ∈ R k×|S| . The matrix U is a feature representation of the base models while V expresses the weighted combination of the base models that form the local model at each location. Specifically, ui is a column vector in U that represents the feature vector for the i-th base model, while v j is a column vector in V that represents the weights of the base models defining the j-th local model. Task Relation Matrix: We employ a graph Laplacian regularization to incorporate information about the spatial autocorrelation between locations. Let A be the task relation matrix, where Ai, j measures the spatial autocorrelation between locations i and j. The graph Laplacian regularizer can be written as follows: Ωr (W) = |S| ∑ [ ] Ai, j ∥wi − w j ∥22 = Tr W(D − A)WT i, j=1 where D is a diagonal matrix with Di,i = ∑ j Ai j . Intuitively, if Ai, j is large, the graph Laplacian regularizer term will also be large unless wi is similar to w j . Thus, the graph Laplacian is simply a re-statement of Tobler’s first law of geography. Model Interpretability: Sparsity constraints can be imposed to ensure that each base model depends only on a small subset of the predictor variables and each task model is a linear combination of a few base models. To improve interpretability, the coefficients of the weighted linear combination should also be non-negative. To satisfy these requirements, the following L1 regularization penalty is added to the objective function: Ωs (W) = λ1 ∥V∥1 + λ2 ∥U∥1 s.t. W = UV, V ≽ 0 where λ1 and λ2 are the regularization parameters. The notation V ≽ 0 implies all elements of V must be non-negative. 49 Putting everything together, we can now express our objective function for GSpartan (assuming a squared loss function) as follows: |S| 1 λ ∥Xs ws − ys ∥22 + λ1 ∥V∥1 + λ2 ∥U∥1 + 3 Tr(W(D − A)WT ) ∑ 2 s=1 2 min W,U,V V ≽ 0, W = UV s.t. Since W = UV, the objective function reduces to the following simplified expression: |S| λ 1 ∥Xs Uvi − ys ∥22 + λ1 ∥V∥1 + λ2 ∥U∥1 + 3 Tr(UV(D − A)VT UT ) ∑ 2 i=1 2 min U,V (4.1) V≽0 s.t. The preceding constrained optimization problem can be solved using a block coordinate descent approach, by alternately solving for U and V. Details for solving each step efficiently is given below. Solve U, given V: When V is fixed, the objective function can be simplified as follows: |S| min U 1 λ ∥Xs Uvi − ys ∥22 + λ2 ∥U∥1 + 3 Tr(UV(D − A)VT UT ) ∑ 2 i=1 2 (4.2) This optimization problem can be efficiently solved using the proximal gradient descent method. Proximal gradient descent is commonly used to solve optimization problems containing nondifferentiable components. The algorithm also has faster convergence compared to other methods such as subgradient descent. The basic idea here is to minimize a corresponding upper bound function of the original objective function [84]. Based on standard assumptions such as Lipschitz continuity on the partial gradient of the differentiable part of the objective function, we can use a tighter upper bound to approximate the original objective function. Here, we will use Proxlinear [112] to solve our optimization problem, with the following update formula: Uk τUk−1 k k−1 T ˆ k−1 ∥2 + λ2 ∥U∥1 , ˆ ∥U − U = argmin(U − U ) gˆ U + F 2 U where k gˆ U |S| ( = ∑ ) − XTs ys vTs + XTs Xs Uvs vTs s=1 50 + λ3 UV(D − A)VT (4.3) and ˆ k−1 = Uk−1 + ω k−1 (Uk−1 − Uk−2 ) U U The solution for problem (4.3) is given by k−1 ˆ k−1 − gˆ Uk = S k−1 (U ) k−1 τU /λ2 τU (4.4) where Sα (t) = sign(t)(max(|t| − α , 0) is a component-wise soft-thresholding function. Solve V, given U Similarly, when U is fixed, the objective function becomes: |S| min V 1 λ ∥Xs Uvi − ys ∥22 + λ1 ∥V∥1 + 3 Tr(UV(D − A)VT UT ) ∑ 2 i=1 2 (4.5) The update formula for V using proximal gradient descent approach is: τ k−1 ˆ k−1 )T gˆ k + V ∥V − V ˆ k−1 ∥2 + λ1 ∥V∥1 Vk = argmin(V − V F V 2 V (4.6) where gˆ Vk = P + λ3 UT UV(D − A) The s-th column of matrix P is given by ps = −UT XTs ys + UT XTs Xs Uvs and ˆ k−1 = Vk−1 + τ k−1 (Vk−1 − Vk−2 ) V V The solution for problem (4.6) is given by k−1 ˆ k−1 − gˆ Vk = S k−1 (V ) ωV /λ2 τVk−1 (4.7) A further projection step is needed to ensure that the elements of V are non-negative. Note that the learning rates τUk−1 and τVk−1 are provided by the Lipschitz continuous constant k and gˆ k , respectively. The learning rates are given in Theorems 2 and 3 of the partial gradient gˆ U V √ below. The extrapolation term ω is selected to be 0 ≤ ω k ≤ δω 51 τ k−2 for δ < 1. ω τ k−1 Theorem 2. The partial gradient gˆ U is Lipschitz continuous with the constant |S| ∑ ∥XTs Xs∥∥vsvTs∥ + λ3∥V(D − A)VT ∥ τU = s=1 Proof. For any U and U ∗ ∈ R d×k , ∥ˆgU − gˆ U ∗ ∥ |S| ( ) ∗ T T T T = ∑ Xs XsUvsvs − Xs XsU vsvs + λ3(U − U∗)V(D − A)VT ≤ ≤ s=1 |S| ∑ ∥XTs XsUvsvTs − XTs XsU∗vsvTs ∥ + λ3∥U − U∗∥∥V(D − A)VT ∥ s=1 |S| ∑ ∥XTs Xs∥∥vsvTs∥∥U − U∗∥ + λ3∥U − U∗∥∥V(D − A)VT ∥ s=1 ( = ∥U − U∗ ∥ ) |S| ∑ ∥XTs Xs∥∥vsvTs∥ + λ3∥V(D − A)VT ∥ s=1 Theorem 3. Assuming ∥Xs ∥ ≤ R, the partial gradient gˆ V is Lipschitz continuous with the constant τV = ∥U∥2 R2 + λ3 ∥UT U∥∥D − A∥ Proof. For any V and V∗ ∈ R k×S , ∥ˆgV − gˆ V ∗ ∥ = P + λ3 UT UV(D − A) − P∗ − λ3 UT UV∗ (D − A) ≤ ∥P − P∗ ∥ + λ3 ∥UT U∥∥D − A∥∥V − V∗ ∥ |S| = ≤ ∑ ∥UT XTs XsU(vs − v∗s )∥ + λ3∥UT U∥∥D − A∥∥V − V∗∥ s=1 |S| ∑ ∥UT XTs XsU∥∥vs − v∗s ∥ + λ3∥UT U∥∥D − A∥∥V − V∗∥ s=1 ≤ ∥U∥2 R2 |S| ∑ ∥vs − v∗s ∥ + λ3∥UT U∥∥D − A∥∥V − V∗∥ s=1 ) ( T ∗ 2 2 = ∥V − V ∥ ∥U∥ R + λ3 ∥U U∥∥D − A∥ 52 A summary of the GSpartan framework is shown in Algorithm 2 below. Input: Dataset D = {(X1 , y1 ), ..., (XS , yS )}, Task relation matrix A, parameters λ1 , λ2 , λ3 ; Initialize: Randomly generate U and V and set k = 1 Block coordinate descent: while not converge do Solve U given V: Compute τUk using Theorem 2 Update Uk using Equation (4.4) Solve V given U: Compute τVk using Theorem 3 Update Vk using Equation (4.7) k = k+1 end return {U k ,V k } Algorithm 2: Pseudocode for GSpartan framework √ Theorem 4. Let {Uk , Vk } be the sequence generated by Algorithm 2 with 0 ≤ ω k ≤ δω τ k−2 τ k−1 for δω < 1. Then the sequence of {Uk , Vk } will converge. The proof of convergence given by Theorem 4 can be found in Lemma 2.2 of [112]. 4.3 Experimental Evaluation This section presents the experiment results to evaluate the effectiveness of the proposed GSpartan framework. 4.3.1 Dataset Description We evaluated the performance of GSpartan on climate data from 37 randomly chosen weather stations in Canada2 . We use monthly precipitation data from the weather stations as the response variable. The precipitation data spans a 40-year period from January, 1961 to December, 2000. The predictor variables for building the local models were obtained from NCEP-reanalysis3 data, which is a coarse-scale global environmental data that integrates observations with output from a 2 http://climate.weather.gc.ca/ 3 http://www.cccsn.ec.gc.ca/?page=pred-hadcm3 53 numerical weather prediction model. There are 26 predictor variables, including mean temperature at 2 meters, mean sea level pressure, 500 hPa geopotential height, and near surface relative humidity4 . We deseasonalize the precipitation time series by subtracting each monthly values with the average value for that month over the entire 40 year period. We then created multiple versions of the training set for each location by varying the length of the training period from 1 to 30 years. For example, the first version uses monthly precipitation data from 1961 for training and the remaining 39 years for testing while the last version uses the first 30 years of monthly precipitation for training and the remaining 10 years for testing. 4.3.2 Baseline Methods We compared the performance of GSpartan against the following baseline: • LASSO: We applied LASSO regression to the data set at each location independently. The Lasso results serve as a baseline for single-task learning. • MRMTL: The mean regularized MTL (MRMTL) is an algorithm developed in [38] based on the assumption of shared common parameters among task models. Specifically, the objective function of MRMTL is given by S min ∑ ∥Xs ws − ys ∥22 + ρ1 ∥W ∥1 + ρ2 W s=1 S ∑ s=1 ws − 2 1 S w S ∑i=1 i 2 We use the MRMTL implementation given in the MALSAR software package [125]. Note that instead of using ℓ2 norm on W as the original work in [38], we use ℓ1 norm on W for a fair comparison. • SLMTL: This is an MTL algorithm proposed in [24], which assumes that the tasks are related using an incoherent rank-sparsity structure. Unlike GSpartan, SLMTL does not explicitly consider the relationships among tasks (e.g., spatial autocorrelation between locations). The 4 A complete list of the features and their description is available at gc.ca/?page=pred-help 54 http://www.cccsn.ec. objective function for SLMTL is given by [24], S min ∑ ∥Xs ws − ys ∥22 + γ ∥P∥1 W s=1 s.t. W = P + Q, ∥Q∥∗ ≤ τ where W ∈ R d×S and W = [w1 , ..., wS ]. We use the SLMTL implementation provided by the MALSAR software package [125] for our experiments. In addition to the three baseline algorithms, we also investigate the following two variants of GSpartan. • GSpartan-NTR: In this variant, we remove the graph Laplacian regularizer from the objective function given in Equation (4.1). This allows us to evaluate the importance of incorporating spatial autocorrelation into the framework. min W,U,V s.t. 1 S ∑ ∥Xsws − ys∥22 + λ1∥V ∥1 + λ2∥U∥1 2 s=1 V ≽ 0,W = UV • GSpartan-norm: In [124] a normalized graph Laplacian regularizer was used to facilitate transfer of information: 2 1 1 ∑ Ai, j √Di,i wi − √D j, j w j . 2 i, j=1 S We will compare the normalized graph Laplacian against the unnormalized one used in GSpartan. 4.3.3 Task relationship matrix We use the inverse of a modified variogram measure to estimate the spatial autocorrelation between locations. Variogram is a measure developed in spatial statistics to determine the spatial dependence between a pair of locations [30]. The measure is computed based on the variance of 55 the difference in field values for two locations:    1 Ai, j = 1   if i = j var(yi −y j ) otherwise where var(z) denote variance of z. Since we have time series data at each location, we compute var(yi − y j ) using monthly precipitation from the first year (1961). 4.3.4 Experimental Results We evaluated the performance of various methods on the test set in terms of their root-meansquare-error (RMSE): √ R= 1 n (yi − yˆi )2 n∑ i where n is the number of test points. Figure 4.1 compares the RMSE for different methods when applied to the first version of the data set (which has 1 year of data for training and 39 years of data for testing). The horizontal axis of the plot corresponds to indices for the 37 weather stations. The results suggest that GSpartan outperforms other baselines for most of the stations. In fact, looking at Table 4.1, which summarizes the number of wins achieved by each method compared to others, GSpartan outperforms other baselines in at least 32 (86.5%) out of 37 stations. Furthermore, by comparing LASSO against other methods, we observe that MTL is generally better than STL especially when there are limited training data available. To investigate the strength of GSpartan, Figure 4.2 compares its RMSE against other variants of GSpartan. Observe that GSpartan performs no worse than its variants for most of the stations and is significantly better in at least 8 of the stations. Furthermore, the results in Table 4.1 suggest that GSpartan outperforms GSpartan -NTR in all 37 stations, which demonstrates the importance of incorporating spatial autocorrelation into the geospatio-temporal MTL framework. In addition, since GSpartan-NTR outperform other MTL methods in at least 27 stations, this shows the importance of using low-rank representation for modeling the data. The previous results were obtained using a data set with limited training examples (1 year for training and 39 years for testing). We next investigate the relative performance of GSpartan against 56 Table 4.1: Win-loss table comparing performance of various methods when applied to the data set with limited training examples (1 year of training data and 39 years of test data). GSpartan GSpartan-norm GSpartan-NTR MRMTL SLMTL LASSO GSpartan 5 0 4 5 1 GSpartan-norm 32 17 6 10 3 GSpartan-NTR 37 20 6 10 3 MRMTL 33 31 31 31 24 SLMTL 32 27 27 6 7 LASSO 36 34 34 13 34 - other methods as the training set size increases from 1 to 10 years. Specifically, we compare the percentage of stations in which the RMSE for GSpartan is lower than that for other methods. For example, if GSpartan outperforms another method in 32 out of 37 stations, the ratio is 0.865. Figure 4.3 shows the results when the training set size increases. The horizontal axis of the plot corresponds to the index of the data set (which is equivalent to number of training years), while the vertical axis corresponds to the GSpartan outperform ratio. The result shows that, on average, GSpartan outperforms the baseline methods for more than 75% of the stations, and is even higher 2.2 GSpartan MRMTL SLMTL LASSO 2 1.8 1.6 RMSE 1.4 1.2 1 0.8 0.6 0.4 0.2 0 5 10 15 20 25 Station index 30 35 40 Figure 4.1: Comparison of GSpartan against three baseline methods 57 when there are fewer training examples. This confirms our hypothesis that GSpartan can effectively train local models when there are limited training examples. In addition, by comparing GSpartan with GSpartan-NTR, we can see that incorporating spatial autocorrelation enhances the performance of GSpartan irrespective of the training set size. By comparing GSpartan against GSpartan-norm, we also see that a normalized graph Laplacian regularizer indeed degrades the performance of GSpartan . This is because the normalization attenuates the spatial autocorrelation among the tasks, which causes the task relationship to be ineffective. Finally, comparing the results for GSpartan against the two baseline MTL algorithms, it appears that when the training set is small, GSpartan outperforms both MRMTL and SLMTL. However, with increasing training set size, both MRMTL and SLMTL perform better than GSpartan. One possible explanation is that, when there are enough labeled examples available at each station, incorporating the spatial autocorrelation information (which was computed using the first year training data only) might adversely affect the local models. 2 GSpartan GSpartan−norm GSpartan−NTR 1.8 1.6 RMSE 1.4 1.2 1 0.8 0.6 0.4 0.2 0 5 10 15 20 25 Station index 30 35 Figure 4.2: Comparison of GSpartan against its variants 58 40 4.3.5 Sensitivity Analysis Since there are four parameters (λ1 , λ2 , λ3 and k) that must be tuned in GSpartan, this subsection analyzes the performance of the framework as each parameter is varied. For this experiment, we use the data set with 30 years of training and 10 years of testing. The results using other data sets are quite similar, so we omit them due to lack of space. Figure 4.4 shows the results of our experiment. The horizontal axis corresponds to each index location while the vertical axis represents RMSE values. The results from the figure show that GSpartan is not sensitive to changes in λ1 and λ2 for all 37 locations (see Figures 4.4a and 4.4b). Furthermore, Figure 4.4c showed that smaller values of λ3 should be preferred. Figure 4.4d showed that GSpartan is not that sensitive to k for many locations. However, for those locations where k is sensitive, a small value of k tends to produce lower RMSE. 1 0.9 Outperforma ratio 0.8 0.7 0.6 0.5 0.4 GSpartan−norm GSpartan−NTR MRMTL SLMTL LASSO 0.3 0.2 0.1 0 1 2 3 4 5 6 Dataset index 7 8 9 10 Figure 4.3: Performance comparison between GSpartan against baseline methods as training set size increases. 59 4.4 Conclusion This chapter presents a novel geospatio-temporal multi-task learning framework called GSpartan for multi-location prediction. GSpartan assumes that the local models share a common lowrank representation. The framework also enables domain-specific constraints such as spatial autocorrelation to be integrated into its formulation. Experimental results on a real world climate data set showed that the proposed framework outperformed other baseline algorithms especially when there are limited training examples available at each location. 2 2 λ1 = 0.001 1.8 1.6 λ2 = 0.001 1.8 λ1 = 0.01 λ = 0.01 2 λ = 0.1 1.6 λ2 = 0.1 1.4 λ1 = 1 1.4 λ2 = 1 1.2 λ = 10 1.2 λ2 = 10 RMSE RMSE 1 1 1 1 0.8 0.8 0.6 0.6 0.4 0 5 10 15 20 25 Station index 30 35 0.4 40 0 5 (a) Sensitivity Test on λ1 10 15 30 35 40 35 40 (b) Sensitivity Test on λ2 2 2 λ3 = 0.1 1.8 λ3 = 1 1.6 k=1 k=3 k=5 k=7 k=9 1.5 λ = 10 3 RMSE 1.4 RMSE 20 25 Station index 1.2 1 1 0.8 0.5 0.6 0.4 0 5 10 15 20 25 Station index 30 35 0 40 (c) Sensitivity Test on λ3 0 5 10 15 20 25 Station index 30 (d) Sensitivity Test on k Figure 4.4: Results of sensitivity analysis on λ1 , λ2 , λ3 , and k for GSpartan. The horizontal axis represents the index of a station and the vertical axis corresponds to the RMSE value. 60 CHAPTER 5 WEIGHTED INCREMENTAL SPATIO-TEMPORAL MULTI-TASK LEARNING VIA TENSOR DECOMPOSITION 1 Predictive modeling of geospatio-temporal data is an important task for many application do- mains, such as climatology [1, 90, 76], medicine [32], and crop sciences [16]. Such a task typically requires making robust predictions of a target variable at multiple geo-referenced locations based on their historical observation data and other predictor variables. For example, climate scientists are interested to obtain projections of the future climate for multiple locations by downscaling the coarse-scale outputs from regional or global climate models as predictor variables. The multilocation prediction problem can be naturally cast into a multi-task learning (MTL) framework, in which the time series prediction at each location can be regarded as a single learning task. Recent studies [109] have demonstrated the merits of performing a joint learning of the models for multi-location predictions using MTL instead of learning the model at each location independently. While there have been several previous studies on modeling the predictions at various locations by taking into account the spatial autocorrelation [109] or spatial smoothness of the predictions [91], these methods are often developed for batch learning, thus hindering their applicability to large-scale spatio-temporal data. In addition, as many of the previous works have focused primarily on improving prediction accuracy, their resulting models are often too complicated for interpretation by the domain experts. Incorporating known patterns that drive the variability of the spatio-temporal data into a predictive modeling framework is also non-trivial. For example, it is well-known that the climate variability at a location can be influenced by broad-scale teleconnection patterns such as El Niño (see Fig. 5.1). How to seamlessly integrate such patterns into the predictive modeling framework and derive new, previously unknown patterns that could capture other variability in the spatio-temporal data are challenges that have not been sufficiently addressed in the literature [90, 76]. 1 This paper is based on the previous publication [110]. 61 To overcome these challenges, this chapter presents a novel, incremental spatio-temporal learning algorithm called WISDOM (Weighted Incremental Spatio-temporal Multi-task Learning via Tensor Decomposition) for multi-location prediction. The algorithm represents the spatio-temporal data as a third-order tensor, where the dimensions (modes) of the tensor represent the temporal, spatial, and predictor variables of the data. By performing tensor decomposition, the latent factors that characterize the variability of the data along each of the three dimensions can be identified. For climate data, known temporal patterns such as El Niño can be directly integrated as a constraint on one of the temporal latent factors of the spatio-temporal tensor. Sparsity-inducing norms can also be added as additional constraints to avoid model overfitting and enhance model interpretability by the domain experts. 3 2 1 0 -1 -2 -3 Deseasonalized El Nino Index (SOI) Max temperature for station ID 1031 -4 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 Figure 5.1: The standardized monthly maximum temperature of a weather station in French Polynesia, which correlates strongly with the deseasonalized El-Nino Southern Oscillation Index. Our proposed tensor decomposition approach is supervised in that the latent factors of the tensor are estimated jointly with the parameters of the prediction models in a unified learning framework. A unique aspect of our formulation is that it constructs two types of prediction models—spatial and temporal—by regressing on the spatial and temporal latent factors inferred from the data. This is a significant departure from conventional spatio-temporal prediction ap62 proaches [118, 91, 109], which typically learns a temporal model for each location from the historical observations and adds spatial constraints to guide the learning algorithm. An alternative approach is to develop a spatial prediction model such as Gaussian Markov Random Field and kriging that considers the spatial distribution of the target variable, where temporal dependencies are used to compute the covariance function of the model. In both types of approaches, the model is trained on one of the dimensions (space or time) while the other dimension is used as side information that constrains the modeling process. Instead, our formulation enables both space and time to be treated equally as it explicitly trains prediction models from both spatial and temporal latent factors. To make a prediction for location s at time t, we first apply the spatial model to the spatial latent features for s and the temporal model to the temporal latent features for t. We then compute their weighted average to determine the final prediction. Another challenge is that the spatio-temporal data for many applications often grow rapidly over space and time. The prediction models have to be re-trained whenever new observation data become available, either for a new location (e.g., data from a newly deployed sensor) or as time progresses (when labeled data from the most recent time period are available to verify earlier predictions). Instead of performing the joint tensor factorization and model building steps repeatedly from scratch, which is computationally prohibitive due to the time and memory constraints, it would be desirable to develop a framework that can gradually update its previous latent factors and model parameters based on the newly observed data. Thus, we develop an incremental learning algorithm called WISDOM to solve the optimization problem associated with our proposed formulation. In short, the main contributions of the chapter are as follows: 1. A supervised tensor factorization framework is presented for spatio-temporal predictive modeling. The framework is unique in that it constructs both spatial and temporal prediction models from the data and can incorporate known patterns from the domain. 2. A scalable algorithm called WISDOM is developed to effectively solve the optimization 63 problem of the proposed framework. The algorithm can be applied to incremental learning over space, time, or both when new observation data become available. 3. The effectiveness of the proposed framework for multi-location time series prediction is demonstrated on a large-scale global climate data. 5.1 Preliminaries We begin with the notations used in the chapter. A scalar is denoted by a lowercase letter such as c whereas a vector is denoted by a boldface lowercase letter such as x. We denote a matrix by a boldface capital letter, such as A and a tensor by a boldface Euler script letter, such as X . The symbol “:" is used to denote sub-arrays within a matrix or tensor, e.g., A:,i denote the i-th column of matrix A. A tensor is a multidimensional array, whose order refers to the number of dimensions (or modes). Each dimension in the tensor can be referred to by its index. A fiber of a tensor is a vector obtained by fixing all the indices of the tensor except for one of them. For example, in a 3-dimensional tensor X , X i,:, j is the mode-2 fiber, which is obtained by fixing the mode-1 index to i and mode-3 index to j. A slice of a tensor refers to a matrix obtained by fixing all but two of the indices of the tensor. For example, X :,:,i is the i-th mode-3 slice of the tensor X obtained by setting its mode-3 index to i. Mode-n matricization is the process of reordering the elements of a tensor X ∈ ℜ p1 ×...×pN into a matrix X(n) ∈ ℜ pn ×qn , where qn = Πk̸=n pk . The mode-n matricization is obtained by arranging the mode-n fibers of the tensor so that each of them is a column of X(n) . This process is also known as mode-n unfolding. The mode-n product of a tensor X ∈ ℜ p1 ×...×pN with a matrix A ∈ ℜq×pn is defined as (X ×n A)i1 ,...,in−1 , j,in+1 ,...,iN = pn ∑ in =1 X i1 ,...,iN A j,in which results in a new tensor of dimensions p1 × ... × pn−1 × q × pn+1 × ... × N. Furthermore, if ( (1) (2) (N) (n) Y = X ×1 A ×2 A ... ×N A is an Nth-order tensor, then Y(n) = A X(n) A(N) ⊗ ... ⊗ 64 )T A(n+1) ⊗ A(n−1) ⊗ ... ⊗ A(1) [61], where ⊗ denotes the Kronecker product. The Khatri-Rao product of two matrices is equivalent to applying a Kronecker product columnwise to the matrices. For example, given matrices A ∈ ℜN×K and B ∈ ℜM×K , then their KhatriRao product is given by: A ⊙ B = [a1 ⊗ b1 , a2 ⊗ b2 , ..., aK ⊗ bK ] Figure 5.2: Overview of the proposed WISDOM framework. 5.2 WISDOM: An Incremental Spatio-Temporal Multi-Task Learning Framework Let D = (X , Y) be a spatio-temporal data set, where X ∈ ℜS×T ×d denote the spatio-temporal tensor of predictor variables, Y ∈ ℜS×T denote the response variable for all the locations, S is the number of locations, T is the length of the time series, and d is the number of predictor variables. For incremental learning, the data is assumed to be periodically augmented with a new data chunk, (Xnew , Ynew ), where Xnew ∈ ℜS×1×d and Ynew ∈ ℜS×1 if the data is for a new time period or Xnew ∈ ℜ1×T ×d and Ynew ∈ ℜ1×T if the data is from a new location. 65 5.2.1 Spatio-temporal Predictive Models A standard approach to address the spatio-temporal prediction problem is to train a temporal prediction model for each location, ft (xst ; ws ), where ws ∈ ℜd×1 is the model parameter for location s. Note that the temporal models can be trained independently or jointly using a multi-task learning approach such as [109] to predict values of the response variable at a future time t. In the latter case, the spatial information is typically used as constraints [91, 109] to guide the training of the temporal prediction model. Alternatively, one could also train a spatial prediction model for each time t, fs (xst ; vt ), where vt ∈ ℜd×1 is the model parameter at time t and apply the model to predict the values of the response variable at a previously unobserved location s. The framework proposed in this study is novel in that it simultaneously learns the temporal and spatial prediction models using the latent factors derived from the spatio-temporal tensor, as shown in Fig. 5.2. Unlike other previous approaches, it builds separate models from the spatial and temporal latent factors and combines the output of both models to obtain the final prediction. Assuming a linear model, the framework predicts the value for a location s at time t as a weighted linear combination of its spatial and temporal models, i.e., K K k k yˆs,t = xTs,t (∑ As,k wk + ∑ Bt,k vk ) (5.1) where As,k denote the weight of the k-th spatial latent feature for location s, Bt,k denote the weight of the k-th temporal latent feature for time t, wk and vr are the parameters for the corresponding spatial and temporal prediction models of the k-th latent feature. The model parameters can be represented in matrix form as W = [wT1 ; wT2 ; ...; wTK ] ∈ ℜK×d and V = [vT1 ; vT2 ; ...; vTK ] ∈ ℜK×d and are estimated by optimizing the following joint objective function: S T min ∑ ∑ L (xs,t , W, V, ys,t ) + Ωm(W, V) W,V s t (5.2) where L (xs,t , W, V, ys,t ) is the loss function and Ω(W, V) is the regularizer for the model parameters. In this work, we consider a squared loss function, L (xs,t , W, V, ys,t ) = (yˆs,t − ys,t )2 , and define Ωm (W, V) = ∥W∥1 + ∥V∥1 to ensure sparsity of the models. 66 5.2.2 Supervised Tensor Decomposition The formulation presented in Eq.(5.2) requires knowledge about the latent factors of the spatiotemporal tensor. These latent factors are represented by the loading matrices A and B, which can be derived from the data using tensor decomposition techniques. There are two standard approaches for decomposing a tensor, namely, Tucker and CANDECOMP/PARAFAC (CP) decompositions [61]. Tucker decomposition factorizes a tensor into a core tensor and a product of its loading matrices along each mode. Though it provides a more general representation, the latent factors are harder to be interpreted as the number of latent factors along each mode does not have to be identical. In contrast, CP decomposition factorizes a tensor into a sum of rank-1 tensors, i.e., X = A, B, C = ∑K k=1 ak ◦ bk ◦ ck , where ◦ denote the outer product operation between two vectors while ak , bk and ck correspond to the vectors associated with the k-th latent factor. The vectors ak , bk and ck also denote the k-th columns of the loading matrices A, B, and C, respectively. In this work, we apply CP decomposition on the spatio-temporal tensor. Let X ∈ ℜS×T ×d be a spatio-temporal tensor, where the (s,t)-th mode-3 fiber of X corresponds to the feature vector for location s at time t, i.e., X s,t,: = xs,t . To estimate the latent factors, the objective function for CP decomposition can be written as follows: 1 ∥X − A, B, C ∥2F + Ωd (A, B, C) A,B,C 2 min where ∥X ∥F = √ ∑i jk Xi2jk is the Frobenius norm of the tensor X and Ωd (A, B, C) is a reg- ularization term for the loading matrices A, B, and C. To ensure model sparsity, the following regularization penalty can be used: Ωd (A, B, C) = ∥A∥1 + ∥B∥1 + ∥C∥1 Putting everything together, the objective function for our spatio-temporal MTL framework 67 can be stated as follows: min W,V,A,B,C F (W, V, A, B, C) 1 S T T = (xs,t (WT As + VT Bt ) − ys,t )2 ∑ ∑ 2 s t λ + ∥X − A, B, C ∥2F + β ∥[W, V, A, B, C]∥1 2 (5.3) where we have used ∥[W, V, A, B, C]∥1 to denote the ℓ1 norm for W, V, A, B and C, respectively. 5.2.3 WISDOM Algorithm As the size of many spatio-temporal data sets can be very large, efficient algorithms are needed to learn the spatio-temporal predictive models of the data. To optimize the objective function in Eq.(5.3), we develop an incremental learning algorithm called WISDOM to learn the model parameters as well as the latent factors of the tensor. While most incremental learning algorithms consider only updating the parameters over time, for spatio-temporal data, there is also a need to update the parameters over space. For example, new observation data may be available from sensors deployed at a new location or when a scientific research is expanded to include a new study region. To support this, we present two implementations of WISDOM—one for incremental learning over space and the other for incremental learning over time. A hybrid approach that combines both strategies can be easily developed when new observations can be generated over space and time. For incremental learning, our goal is to adapt the existing models without rebuilding the model from scratch as new data become available. To ensure that the model parameters and latent factors do not vary significantly from their previous values, a smoothness criterion can be added to the objective function. We reformulate the optimization problem for incremental learning as follows: min W,V,A,B,C ˜ B, ˜ ˜ V, ˜ A, ˜ C) Q(W, V, A, B, C, W, ˜ B, ˜ ˜ V, ˜ A, ˜ C), = F (W, V, A, B, C) + Γ(W, V, A, B, C, W, 68 ˜ B, ˜ are the previous values before the update, F (W, V, A, B, C) is given by ˜ V, ˜ A, ˜ and C where W, Eq. (5.3) and ˜ B, ˜ ˜ V, ˜ A, ˜ C) Γ(W, V, A, B, C, W, ˜ 2 + ∥B − B∥ ˜ 2. ˜ 2 + ∥V − V∥ ˜ 2 + ∥A − A∥ ˜ 2 + ∥C − C∥ = ∥W − W∥ F F F F F 5.2.3.1 (5.4) Incremental Learning over Space First, we discuss WISDOM’s approach for incremental learning over space, when data from a new location becomes available. Let T be the current time and S be the current number of locations. We assume that the new location has historical observation data from time t0 to T . If the location has only one observation data, then t0 = T . We further assume that the spatial latent features for ˜ s = As . However, other locations are unaffected by the addition of the new location, i.e., ∀s : A the latent features for other modes (B and C) as well as the parameters of the prediction models (W and V) can be affected by the addition of the new data, {xS+1,t0 , xS+1,t0 +1 , · · · , xS+1,T }. For brevity, we denote the feature vectors for the new location as XS+1 , which is a tensor of size 1 × (T − t0 + 1) × d. The objective function for incremental learning over space can be expressed as follows: min W,V,V ,AS+1 ,B,C ˜ ˜ V, ˜ B, ˜ C) Q(W, V, AS+1 , B, C, W, [ ]2 1 T T T T = ∑ xS+1,t (W AS+1 + V Bt ) − yS+1,t 2 t=t 0 λ + 1 ∥X S+1 − ATS+1 , B, C ∥2F 2[ ] η1 2 2 2 2 ˜ ˜ ˜ ˜ ∥W − W∥ + F + ∥V − V∥F + ∥B − B∥F + ∥C − C∥F 2 + β1 ∥W, V, AS+1 , B, C∥1 (5.5) Note that AS+1 is a column vector that represents the spatial latent features for the new location and xS+1,t denote the feature vector of the location at time t. The smoothness parameter η1 determines the extent to which the previous model parameters should be retained. 69 We employ an alternating minimization strategy to solve the optimization problem. Since not all terms in the objective function are differentiable, we employ the proximal gradient descent method [84] to solve each subproblem. The method is applicable to non-differentiable objective functions that can be decomposed into a smooth part and a non-smooth part. Let f (x) = g(x) + h(x), where g(x) is a differentiable function and h(x) is a non-differentiable function. For example, the loss function involving the Frobenius norm terms in our objective function is differentiable whereas the sparsity-inducing L1-norm terms are non-differentiable. The proximal gradient descent method updates its parameter as following: ( ) x(k) = proxt ,h x(k−1) − tk ∇g(x(k−1) ) k where x(k) is the parameter to be estimated at step k. proxt ,h is the proximal operator for the k nondifferentiable function h, ∇g(x(k−1) ) is the gradient on the smooth function g w.r.t. x(k−1) and tk is the step size for the gradient descent update. The proximal operator for ℓ1 norm function is the soft-thresholding operator: proxλ ,h (v) = (v − λ )+ − (−v − λ )+ . The parameters are updated iteratively by calculating the gradient on the smooth part of the objective function, and then apply the soft-thresholding operator (proximal mapping function for ℓ1 norm) to determine its next value. The step size can be found using a line search algorithm. In the following, we provide the gradient of the objective function for each alternating minimization step. I. Solving for AS+1 by fixing W, V, B, C: The objective function can be simplified to retain only terms involving AS+1 as follows: min AS+1 1 T ∑ (xTS+1,t (WT AS+1 + VT Bt ) − yS+1,t )2 2 t=t 0 λ + 1 ∥XS+1 (1) − ATS+1 (C ⊙ B)T ∥2F + β1 ∥AS+1 ∥1 2 where XS+1 (1) is the mode-1 unfolding of the tensor X S+1 . The gradient on the smooth part of the objective function w.r.t. AS+1 is ) T ( T T T ∑ xS+1,t (W AS+1 + V Bt ) − yS+1,t WxS+1,t t=t0 ( )T − λ1 [XS+1 (1) − ATS+1 (C ⊙ B)T ](C ⊙ B) 70 II. Solving for B by fixing AS+1 , W, V, C: The terms in the objective function involving matrix B include 1 T ∑ (xTS+1,t (WT AS+1 + VT Bt ) − yS+1,t )2 2 t=t min B 0 λ + 1 ∥XS+1 (2) − B(C ⊙ ATS+1 )T ∥2F 2 η1 ˜ 2 + β1 ∥B∥1 + ∥B − B∥ F 2 (5.6) The gradient on the smooth part of the objective function w.r.t. Bt is given by ( ) xTS+1,t (WT AS+1 + VT Bt ) − yS+1,t VxS+1,t ( )T + λ1 [xTS+1,t − BtT (C ⊙ ATS+1 )T ](C ⊙ ATS+1 ) ˜ t) − η1 (Bt − B III. Solving for C by fixing AS+1 , B, W, V: Similarly, we can simplify the objective function to include only terms involving the matrix C: min C λ1 ∥XS+1 (3) − C(B ⊙ ATS+1 )T ∥2F 2 η ˜ 2 + β1 ∥C∥1 + 1 ∥C − C∥ F 2 (5.7) The gradient on the smooth part of the objective function w.r.t. C is given by ( ) ˜ −λ1 XS+1 (3) − C(B ⊙ ATS+1 )T (B ⊙ ATS+1 ) + η1 (C − C) IV. Solving for W by fixing V, AS+1 , B, C: The terms in the objective function involving the model parameter W is min W 1 T ∑ (xTS+1,t (WT AS+1 + VT Bt ) − yS+1,t )2 2 t=t 0 η ˜ 2 + β1 ∥W∥1 + 1 ∥W − W∥ F 2 The gradient on the smooth part of the objective function w.r.t. W is given by T ∑ t=t0 ( ) xS+1,t xTS+1,t (WT AS+1 + VT Bt ) − yS+1,t ATS+1 ˜ T) + η1 (WT − W 71 (5.8) V. Solving for V by fixing W, AS+1 , B, C: Finally, the objective function can be simplified for terms involving V as follows: 1 T ∑ (xTS+1,t (WT AS+1 + VT Bt ) − yS+1,t )2 2 t=t min V 0 η ˜ 2 + β1 ∥V∥1 + 1 ∥V − V∥ F 2 (5.9) The gradient on the smooth part of the objective function w.r.t. V is given by T ∑ xS+1,t (xTS+1,t (WT AS+1 + VT Bt ) − yS+1,t )BtT t=t0 ˜ + η1 (V − V) 5.2.3.2 Incremental Learning over Time Next, we examine WISDOM’s strategy for incremental learning over time. Let S be the number of locations and T be the current time. Similar to other online learning schemes, we assume the availability of the feature vectors of predictor variables for all S locations at time T + 1. This information will be used to determine the temporal latent factor BT +1 for the new time period. Similar to the strategy used for incremental learning over space, we assume the new data for time T + 1 does not affect previous temporal latent factors B1 , B2 , · · · , BT . Our strategy for incremental learning over time is to perform the following two steps: first, we learn the temporal latent factor BT +1 based on the values of the predictor variables for the new time period. Next, the model parameters and latent factors for other modes are updated when the target variable for the new time period is observed for all the locations. Step 1: Updating the temporal latent factor BT +1 before observing target variable. The objective function for updating the temporal latent factor is given below: min Q(BT +1 ) = BT +1 λ2 ∥X T +1 − A, BTT +1 , C ∥2F 2 (5.10) Note that the loading matrices A and C correspond to the values obtained from the previous update. Step 2: Updating model parameters and other latent factors after observing target variable. After the new observation for target variable at time T + 1, we can derive the update formula by 72 optimizing the following objective function: min Q(W, V, A, BT +1 , C, W, V, A, C) W,V,A,BT +1 ,C 1 S T = (x (WT As + VT BT +1 ) − ys,T +1 )2 2 s s,T +1 ˜ ˜ ˜ ˜ ∑ λ2 ∥X T +1 − A, BTT +1 , C ∥2F 2 η2 ˜ 2 ˜ 2 + ∥V − V∥ ˜ 2 + ∥A − A∥ + (∥W − W∥ F F F 2 ˜ 2 ) + β1 (∥W, V, A, BT +1 , C∥1 ) + ∥C − C∥ F + (5.11) Solving Eq. (5.11) to obtain the update formula for W, V, A, BT +1 and C is similar to the approach described for incremental learning over space. We omit their details due to lack of space. 5.2.3.3 Incremental Learning over Space-Time The approaches described in the previous subsections can be combined to create a hybrid approach for incremental learning over both space and time. Specifically, the WISDOM algorithm can be initially applied to a subset of the locations at a given starting time. As time progresses, it will apply the model update approach for incremental learning over time to the newly acquired observation data. Similarly, when data from a new location becomes available, it will then invoke the update strategy for incremental learning over space. 5.3 Experimental Evaluation We applied WISDOM to a global-scale climate data set and compared its performance against several baseline algorithms. 5.3.1 Dataset Description The climate data was obtained from two sources. First, we downloaded the monthly climate observation data from the Global Surface Summary of Day (GSOD)2 website. These monthly values 2 https://data.noaa.gov/dataset/global-surface-summary-of-the-day-gsod 73 of total precipitation (prcp), maximum (tmax), minimum (tmin), and average (tmean) temperature are used to define the target/response variable for our prediction tasks. We created 4 data sets, one for each response variable, to evaluate the performance of WISDOM. Though the four response variables can be jointly modeled in a multi-task learning framework, this is beyond the scope of the current work. The second source corresponds to a coarse-scale gridded climate data from NCEP reanalysis3 . We use the data to define the predictor variables for our climate prediction task. Although there are hundreds of variables available in the NCEP reanalysis data, we selected 13 of them as our predictor variables with the help of our domain expert. A detailed description of the selected features is given in Table 5.1. GSOD provides climate data from more than 30,000 monitoring sites worldwide, spanning a time period from 1942 to the present time. We use the monthly data from January 1985 to November 2015 (for a total of 371 months) in our experiment. During preprocessing, we remove the sites that have missing values as well as sites that are co-located in the same grid as another previously chosen site, i.e., we restrict each grid to contain only one GSOD site. This reduces the number of sites in our data set to 1,110. The variables are deseasonalized by subtracting each monthly value from the average value of the given month and then standardized to obtain their Z-scores. The dimensionality of the resulting spatio-temporal tensor after preprocessing is 1110 × 371 × 13. 5.3.2 Experimental Setup We use the data from the first 20 years (1985 - 2004) for training and the rest (2005 - 2015) for testing the efficacy of the prediction models. The last 10 years of the training data are used as validation set to determine the model parameters. We set the number of latent factors in WISDOM to k = 5 and randomly select 100 sites as our initial starting locations. For incremental learning, 3 http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.derived. html 74 Table 5.1: List of predictor variables selected from NCEP reanalysis data. Variable cprat.sfc dlwrf.sfc dswrf.sfc prate.sfc tmax.2m tmin.2m lftx.sfc omega.sig995 pr_wrt.eatm rhum.sig995 slp thick_1000500 thick_850500 Description Monthly mean convective precipitation rate at surface Monthly mean longwave radiation flux at surface Monthly mean solar radiation flux at surface Monthly mean of precipitation rate Monthly mean maximum temperature at 2 m Monthly mean minimum temperature at 2 m Monthly mean surface lifted index Monthly mean omega at sigma level 0.995 Monthly mean of precipitable water content Monthly mean relative humidity at sigma level 0.995 Sea level pressure Monthly mean of thickness for 1000500mb Monthly mean of thickness for 850500mb the new observation may come from a new location or for a new time point. In the former case, we perform incremental learning over space for the new location and update the prediction models for other locations. In the latter case, we perform incremental learning over time to update the models for all the locations. The mean absolute error (MAE) for all locations are used as our evaluation metric and we also report the standard deviation of MAE over time. We compared WISDOM against the following two baseline algorithms. The order in which new observations are introduced over space and time is the same for all the algorithms: 1. STL (Single Task Learning): Each location has its own local (linear) model that is incrementally updated using a gradient descent approach when it has a new observation data. When a new location is introduced, its parameters are randomly initialized and updated only when 75 new observation data for the location becomes available. 2. ALTO: This is an adaptation of the method in [118], which assumes the model parameters for multiple response variables are in the form of a tensor. To extend ALTO to our problem setting, we make the following changes: First, the tensor is reduced to a matrix W since each data set has only one response variable. Second, we replace tensor decomposition with singular value decomposition and apply it to the noise-perturbed weight matrices to obtain the updated model parameters. We have also extended ALTO to perform incremental learning over space: when data from a new location is available, we compute the model for the new location using linear regression and adds the estimated parameters as a new row in W. The updated W is then projected into its low-rank matrix representation. In addition to the two baseline methods, we also consider the following two variations of WISDOM: 3) WISDOM-S: This baseline considers only the spatial component of the framework. Specifically, we remove all terms related to V in Eq. (5.5) and (5.11). 4) WISDOM-T: This baseline considers only the temporal component of the framework. We remove all terms related to W in Eq. (5.5) and (5.11). 5.3.3 Comparison against Baseline Methods We first present the results comparing WISDOM against the two baseline algorithms, STL and ALTO. Table 5.2 shows the average MAE for the various methods whereas Table 5.4 shows the number of locations in which each method outperforms another. The results suggest that WISDOM outperform STL and ALTO in more than 75% of the locations for all 4 data sets evaluated. The percentage is even higher (> 90%) when compared against ALTO on the three temperature data sets. By outperforming STL, this suggests the importance of incorporating spatial autocorrelation into the learning framework. WISDOM also outperforms ALTO, which is another online tensor learning approach for spatio-temporal data. There are two possible reasons for this. First, 76 ALTO performs the following simple update to its weight matrix each time new observation data is available4 : W(k) = (1 − α )W(k−1) + α XZ† [118]. The single-step update may not be sufficient to learn the right weights of the prediction model. In contrast, WISDOM learns the optimal weights that minimize an incrementally updated objective function. Second, ALTO performs a low-rank decomposition on a perturbed weight matrix whereas WISDOM decomposes the data tensor itself. The results suggest that the latter strategy is more effective as the observation data is potentially noisy. Next, we compare WISDOM against its variants in Tables 5.3 and 5.4. Observe that WISDOM and WISDOM-S outperform WISDOM-T on all four data sets, which suggest the importance of incorporating a predictive model from the spatial latent factors. Furthermore, WISDOM performs better than WISDOM-S especially for precipitation prediction. This makes sense as precipitation has less spatial autocorrelation compared to temperature, which is why temporal autocorrelation plays a more significant role in improving its prediction. Table 5.2: MAE and its standard deviation for WISDOM and other baseline methods WISDOM STL ALTO tmax 0.4751± 0.0739 0.5580 ± 0.0599 0.6824 ± 0.1448 tmin 0.5016 ± 0.0777 0.5670 ± 0.0627 0.6598 ± 0.1660 tmean 0.4438 ± 0.0798 0.5233 ± 0.0606 0.6570 ± 0.1691 prcp 0.5700 ± 0.1157 0.6930 ± 0.0897 0.6087 ± 0.1197 Table 5.3: MAE and its standard deviation for WISDOM and its variants WISDOM WISDOM-S WISDOM-T 5.3.4 tmax 0.4751 ± 0.0739 0.4685 ± 0.1031 0.4832 ± 0.1233 tmin 0.5016 ± 0.0777 0.5030 ± 0.0952 0.5285 ± 0.1033 tmean 0.4438 ± 0.0798 0.4380 ± 0.1065 0.4607 ± 0.1036 prcp 0.5700 ± 0.1157 0.5824 ± 0.1206 0.6075 ± 0.1221 Convergence Analysis of WISDOM To demonstrate its convergence, Fig. 5.3 shows the average MAE of WISDOM for all locations across time. A location is included into the average MAE calculation only after the data for the 4 We use incremental update of the weight matrix instead of exact update since the latter requires the entire data to be available in memory. 77 prcp STL tmean ALTO tmin WISDOM-T tmax WISDOM WISDOM-S WISDOM-T ALTO STL WISDOM WISDOM-S WISDOM-T ALTO STL WISDOM WISDOM-S WISDOM-T ALTO STL WISDOM WISDOM-S WISDOM-T ALTO STL WISDOM-S Variable WISDOM Table 5.4: Comparison between the number of locations (out of 1,100) in which one method outperforms another 0 489 268 79 206 0 468 287 103 227 0 489 343 77 212 0 354 164 272 120 621 0 209 74 142 642 0 241 94 209 621 0 331 66 159 756 0 321 425 113 842 901 0 134 262 823 869 0 154 318 767 779 0 107 241 946 789 0 495 200 1031 1036 976 0 947 1007 1016 956 0 918 1033 1044 1003 0 979 838 685 651 0 258 904 968 848 163 0 883 901 792 192 0 898 951 869 131 0 990 997 910 852 0 location becomes available. Although there are some instabilities in its performance during the first 8 years, WISDOM begins to converge after the first 10 years, which is our initial training period, on all four data sets. 5.3.5 Analysis of Spatial Latent Factors Next, we investigate the spatial latent factors derived by WISDOM. Each spatial latent factor is a vector whose elements represent the membership of each location to the given latent factor. Figure 5.4 and 5.5 show the spatial distribution of the latent factors for prcp. The figure shows that the latent factors have varying spatial distributions, which suggests that they capture different aspects 78 of the spatial variability in the data. For example, the first latent factor is dominant in Europe and north of China whereas the second latent factor emphasizes more in US and east of China. 1.2 tmax tmin tmean prcp MAE 1 0.8 0.4 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 0.6 Year Figure 5.3: Changes in MAE over time for WISDOM WISDOM utilizes the spatial latent factors to perform incremental learning over space. To further demonstrate the benefit of incremental learning over space, we compare the average annual MAE for the first 100 randomly chosen locations when the model is updated with and without incremental learning over space. Specifically, in the latter case, no new locations are added into the data set as time progresses. Indeed, as shown in Fig 5.6, adding data from new locations helps to improve the MAE of the first 100 randomly chosen locations. 5.3.6 Analysis of Temporal Latent Factors Each temporal latent factor derived by WISDOM can be represented as a time series. To understand their significance, we correlate the temporal latent factors against the known climate indices given in Table 5.5. Fig. 5.7 shows the resulting correlation for the tmean and prcp data sets. Though the temporal latent factors for both data sets are different, we found some of the factors correlate highly (over 0.6) with the existing climate indices. This result suggests that the temporal latent factors may capture some of the previously known climate phenomena, represented by the climate indices such as AOI and NAO. For each temporal latent factor and climate index, we also calculate the percent of locations whose temperature or precipitation has a correlation above 0.3. The results in Fig. 5.8 suggest that (1) not all climate indices have a significant number of locations highly correlated with them and (2) some latent factors have significant correlation with a relatively large number of 79 locations, comparable to the known indices. More importantly, as some of the latent factors do not correlate highly with the known indices, this suggests that our framework can potentially discover new indices that capture the climate variability for many locations. 1 0.8 0.6 0.4 0.2 0 (a) Factor 1 1 0.8 0.6 0.4 0.2 0 (b) Factor 2 1 0.8 0.6 0.4 0.2 0 (c) Factor 3 Figure 5.4: Spatial distribution of the spatials factor learned by WISDOM for prcp. (Figure is best viewed in color). 80 Table 5.5: List of the climate indices used to correlate with the temporal factors learned from WISDOM. Climate Index AOI NAO WPI QBO PDO SOI Description Arctic Oscillation Index North Atlantic Oscillation West Pacific Pattern Quasi-Biennial Oscillation Pacific Decadal Oscillation Southern Oscillation Index Surprisingly, none of the temporal latent factors were found to correlate highly with SOI, which is a surrogate time series for El Niño. One of the strengths of WISDOM is its ability to incorporate known domain patterns as additional constraints for its formulation. In order to incorporate known 1 0.8 0.6 0.4 0.2 0 (a) Factor 4 1 0.8 0.6 0.4 0.2 0 (b) Factor 5 Figure 5.5: Spatial distribution of the spatials factor learned by WISDOM for prcp, continued. (Figure is best viewed in color). 81 2 MAE 1.5 With incremental learning over space Without incremental learning over space 1 0 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 0.5 Year (a) tmax 2 MAE 1.5 With incremental learning over space Without incremental learning over space 1 0 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 0.5 Year (b) tmin 2 MAE 1.5 With incremental learning over space Without incremental learning over space 1 0 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 0.5 Year (c) tmean 2 MAE 1.5 With incremental learning over space Without incremental learning over space 1 0 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 0.5 Year (d) prcp Figure 5.6: Average annual MAE comparison between WISDOM with incremental learning over space and WISDOM without incremental learning over space for the 100 initially chosen locations 82 Table 5.6: Comparing MAE of WISDOM and WISDOM-KP WISDOM WISDOM-KP tmax 0.4751 0.4678 tmin 0.5016 0.5037 tmean 0.4438 0.4343 prcp 0.5700 0.5725 patterns such as SOI, we simply fix one of the columns in the temporal latent factor matrix B to be the time series of SOI and learn the remaining spatial and temporal latent factors using WISDOM. We denote this approach as WISDOM-KP. The MAE results comparing WISDOM against WISDOM-KP is shown in Table 5.6. The results suggest that WISDOM-KP achieves comparable results as WISDOM in terms of their average MAE. In addition, we also compared the number of locations where WISDOM-KP outperforms WISDOM. For tmax, tmean, and prcp, the MAE for WISDOM-KP is lower than WISDOM in at least 49% of the locations whereas for tmin, the percentage is around 43%. This is not surprising as we do not expect SOI to accurately capture the climate variability for all locations. Instead, there 0.6 0.4 0.2 0 -0.2 0.6 0.4 0.2 0 -0.2 AOI NAO WPI PDO QBO SOI AOI NAO WPI PDO QBO SOI 1 2 3 4 5 AOI NAO WPI PDO QBO SOI AOI NAO WPI PDO QBO SOI 1 2 3 4 5 (a) tmean (b) prcp Figure 5.7: Correlations between the climate indices and the temporal factors learned from WISDOM for tmean and prcp 0.02 0.2 (a) tmean SOI QBO WPI PDO NAO AOI Factor 5 Factor 4 Factor 3 Factor 2 Factor 1 0 SOI QBO WPI PDO NAO AOI Factor 5 Factor 4 Factor 3 Factor 2 0 Factor 1 0.01 (b) prcp Figure 5.8: Percentage of locations whose response variables has a correlation above 0.3 with the temporal factors and climate indices learned from WISDOM for tmean and prcp 83 are locations that are expected to benefit from using SOI as one of the temporal latent factors. To identify such locations, we plot a map of the locations in which WISDOM-KP is better than WISDOM, and vice-versa, for predicting t-mean in Fig. 5.9. The results suggest that by incorporating SOI, an improved predictive performance is observed in areas such as Australia, part of South America, northeast of North America, and locations around Arctic Ocean. Some of these locations are consistent with the results of previous studies [90]. Stations where WISDOM-KP outperforms WISDOM Stations where WISDOM outperforms WISDOM-KP Figure 5.9: Stations where WISDOM-KP outperforms WISDOM more than 0.05 in MAE evaluation for tmean and vise versa. 5.4 Conclusion This chapter presents a spatio-temporal learning framework for multi-location prediction based on supervised tensor decomposition. Unlike conventional methods, the proposed framework constructs both spatial and temporal models of the data and aggregates their output to obtain the final prediction. A novel incremental learning algorithm called WISDOM is developed to simultaneously extracts the latent factors of the spatio-temporal data and learns the spatial and temporal prediction models. We show that WISDOM outperforms several baseline algorithms and can easily accommodate known patterns from the spatio-temporal domain. 84 CHAPTER 6 MULTI-SCALE SPATIO-TEMPORAL MULTI-TASK LEARNING VIA INCREMENTAL MULTI-TENSOR DECOMPOSITION The spatio-temporal data generated from many scientific domains are often available at multiple spatial scales. For example, the climate projections generated by global and regional climate models are available at a wide range of spatial resolutions, from tens to several hundred kilometers (see Table 6.1). Such multi-scale data can potentially be used to train models for predicting and understanding our future climate. However, since the data at different scales may be correlated to one another, concatenating their variables together into a single feature vector for building predictive models may not be an effective strategy. Similarly, considering the data from a single resolution only, e.g., from the finest scale, may not be sufficient as the models that generate the data at different resolutions may have varying degrees of predictive skills at different locations. How to effectively integrate the data from multiple spatial scales into a unified formulation to build predictive models is therefore a challenge that must be addressed [35, 82, 75]. Note that the problem investigated in this study is different than some of the previous multiscale modeling tasks investigated by the machine learning, computer vision, and signal processing communities. For example, some of the previous multi-scale modeling techniques would extract a multi-resolution representation of a single dataset (e.g., using wavelet transform [80]), whereas our focus here is on modeling multiple datasets with inherently different spatial scales. Another example is the multi-resolution image data, which are often created by aggregating the pixels within a grid to obtain a low resolution image representation. Unlike such representation, the geospatiotemporal data investigated in this study are obtained from multiple sources (e.g., climate models from various research groups), each based on different model parameterization and with potentially different spatial scales. Furthermore, the multi-resolution image dataset is static whereas the geospatio-temporal dataset is dynamic in nature. In addition to the challenges posed by the multi-scale data, the complexity of the problem 85 Table 6.1: Spatial resolutions for various climate datasets Climate Dataset NCEP North American Regional Reanalysis Canadian Regional Climate Model (CRCM) Weather Research & Forecasting model (WRFG) NCEP Reanalysis HadCM3 Global Climate Model Spatial Resolution 32 km 45 km 50km 250km 300km is further exacerbated by the need to build predictive models for multiple locations instead of building the model for a single location. For example, in climate modeling, robust predictions are often needed for multiple locations within a study region for some future time period. Although a prediction model can be trained for each location separately, the resulting predicted distribution may not be spatially consistent, i.e., it may not preserve the inherent spatial autocorrelation of the data. Recent works [109, 110, 48, 46, 47, 118] have demonstrated the effectiveness of using multi-task learning to produce predictions that are more consistent across space in the study region. These approaches require making assumptions about the relationship between modeling tasks at different locations and utilizing such relationship to jointly train the multiple models. However, none of these previous studies have dealt with the effect of multi-scale data. Furthermore, with the exception of [110], most of these works are primarily batch learning algorithms, which makes them harder to scale up to very large spatio-temporal datasets. How to efficiently deal with the large scale data without affecting the model accuracy is another important challenge that must be addressed in the predictive modeling of geospatio-temporal data. In addition to its accuracy, interpretability of the model is often a desirable feature. For example, climate scientists are interested to understand the driving factors that influence the variability observed in the climate data. Some of the broad-scale driving factors, such as El Niño, are wellknown to the scientists. Thus, having a geospatio-temporal predictive model that can reproduce the known patterns of the domain as well as accurate predictions would help boost our confidence in the validity of the model. While there have been some previous works based on tensor [110] and matrix [109] factorization for deriving such patterns from geospatio-temporal data, it is unclear 86 whether such patterns would be preserved when applied to multi-scale data. To address these challenges, this chapter presents a multi-task learning framework named MUSCAT (MUlti-SCAle Spatio-Temporal Learning via Incremental Multi-Tensor Decomposition) to build predictive models from multi-scale spatio-temporal data. The multiple tasks in the proposed framework are not only defined over different scales, but are also defined over space and time. In the proposed framework, the multi-scale data is represented using a set of 3-dimensional tensors, which are jointly decomposed into their latent factors via a multi-tensor decomposition approach. The task relationships among the predictor variables at different scales are captured by assuming that the latent spatial and temporal factors are invariant across the scales. The framework also employs a supervised learning approach to incorporate the latent factors into a loss function for predictive modeling. In short, MUSCAT provides a unified formulation for simultaneous decomposition of multiple tensors and fitting the latent factors to the response variable of interest. In addition, to handle large-scale spatio-temporal data, an incremental learning algorithm is proposed enabling the model to be iteratively constructed over space and time, thus avoiding the need to build the model from scratch each time there is new data available. The contributions of this work are summarized as follows: 1. A geospatio-temporal multi-task learning framework called MUSCAT is proposed to build prediction models for multi-scale data at multiple locations. 2. The proposed framework allows the latent spatial and temporal patterns shared by the multiple scales to be derived via a multi-tensor decomposition approach. 3. An incremental learning algorithm over space and over time is proposed to scale the MUSCAT framework to large datasets. 4. Experimental evaluation performed using climate data from the United States Historical Climate Network (USHCN) shows the superiority of the framework compared to several baseline multi-scale learning methods. 87 6.1 Background In this section, we first introduce the notations used in this chapter followed by our problem definition. We then describe the tensor factorization approach employed by our proposed framework. 6.1.1 Notations To facilitate the discussion, we adopt the following notation throughout the chapter. First, scalars are denoted by lowercase italic letters such as v while vectors are denoted by lowercase boldface letters such as x. Note that x is assumed to be a column vector unless stated otherwise. Its corresponding row vector is denoted as xT , where the superscript T denote the transpose operation. Matrices are denoted by boldface capital letters such as A while tensors are represented as boldface calligraphic letters such as X . Furthermore, ak denote the vector corresponding to the k-th column of the matrix A. A summary of the notations used is shown in Table 6.2. Table 6.2: Glossary of symbols used in the chapter. Symbol v x A A:,k or ak X X i,:, j X :,:,i Notation A scalar A column vector A matrix The k-th column of matrix A A tensor The mode-2 fiber of X , which is obtained by fixing the mode-1 index to i and mode-3 index to j The i-th mode-3 slice of the tensor X obtained by setting its mode-3 index to i We use xT y to denote the dot product between two vectors and x ◦ y to denote their outer prod√ uct. Furthermore, let ∥A∥F = ∑i j a2i j be the Frobenius norm of matrix A and ∥X ∥F = ∑i jk xi2jk be the Frobenius norm of the tensor X . For other tensor operations, the notations used are given in Table 6.3. 88 Table 6.3: Glossary of tensor/matrix operations. mode-n matricization mode-n product Khatri-Rao product 6.1.2 reordering the elements of a tensor X ∈ ℜ p1 ×...×pN into a matrix X(n) ∈ ℜ pn ×qn , where qn = Πk̸=n pk , and each of the mode-n fibers of X is a column of X(n) (X ×n A)i1 ,...,in−1 , j,in+1 ,...,iN = pn ∑in =1 X i1 ,...,iN A j,in A ⊙ B = [a1 ⊗ b1 , a2 ⊗ b2 , ..., aK ⊗ bK ], where ⊗ is the Kronecker product. Tensor Factorization We employ the following CANDECOMP/PARAFAC (CP) decomposition [61] to factorize a 3rdorder tensor X into its corresponding latent factors A, B, and C: X = A, B, C = K ∑ ak ◦ b k ◦ ck . (6.1) k=1 Specifically, the tensor is factorized into the sum of K rank-1 tensors, each of which is given by the outer product ak ◦ bk ◦ ck . In addition, the vectors ak , bk and ck denote the k-th columns of the matrices A, B, and C, respectively 6.1.3 Spatio-Temporal Prediction Let D = (X (1) , ..., X (L) , Y) be a multi-scale spatio-temporal data set, where X (l) ∈ ℜS×T ×dl denote the spatio-temporal tensor of predictor variables from the l-th resolution, Y ∈ ℜS×T denote the response variable for all the locations, S is the number of locations, T is the length of the time series, and dl is the number of predictor variables for scale l. The goal of spatio-temporal predictive modeling is to learn a target function f ({x(l) }; Π) that maps each multi-scale variable {x(l) } to its corresponding response variable y ∈ R in such a way that minimizes the difference between its predicted and true value. Note that Π denotes the set of parameters associated with the target function. 89 6.2 MUSCAT This section presents our proposed framework for multi-scale multi-task learning. We begin with the discussion on multi-tensor decomposition. 6.2.1 Multi-tensor decomposition Inspired by previous work on discovering climate indices using SVD [90], tensor decomposition, as a generalization of SVD to higher dimensions, is employed to learn the spatial and temporal latent factors of the data. Since the data are available at multiple scales, the tensor decomposition can be performed at each scale. 1 ∥X (l) − A, B, C(l) ∥2F + Ωd (A, B, C(l) ) A,B,C(l) 2(S × T × dl ) min where A is the spatial latent factor with size S × K and B is the temporal latent factor with size T × K. We assume the underlining climate variability of the data from different scales should be consistent, and thus, the latent factors A and B learned from the data should be shared between scales. S×T1×d is used to normalize the norm so that it does not depend on the size of the tensor. l Ωd (A, B, C) is a regularization term for matrices A, B, and C, and we enforce sparsity of the latent factors by using ℓ1 -norm: Ωd (A, B, C) = ∥A∥1 + ∥B∥1 + ∥C∥1 6.2.2 Supervised Multi-tensor Decomposition The latent factors derived from the multi-tensor decomposition may capture the spatial and temporal climate variabilities of the data, and could be beneficial to spatio-temporal prediction modeling. In this section, we discuss how to incorporate the latent factors into the supervised learning framework. 90 We assume the prediction function is composed by a weighted average of the predictions from multiple scales: L yˆs,t = ∑ αl yˆs,t (l) l where αl is the weights for the prediction from the l-th scale, and is assumed to be probability simplex (αl > 0 and ∑Ll αl = 1). For each scale l, we assume the prediction function is composed by a temporal model for multiple locations, and a spatial model for multiple time points, which is defined as follows: (l) T K K yˆs,t = xs,t (∑ As,k wk + ∑ Bt,k vk ) (l) (l) k (l) k (l) (l) where wk and vk denote the parameters for spatial and temporal prediction models for the k-th latent factor at scale l, while As,k and Bt,k are the weights of location s or time point t to the (l) T (l) k-th latent factor. Note that xs,t ∑K k As,k wk is a temporal model prediction for location s, and (l) T (l) xs,t ∑K k Bt,k vk is a spatial model prediction for time t. The relatedness between the multi-scale data is defined by the consistency of the latent spatial and temporal factors between scales. In other words, the L prediction components share the same spatial factor A and temporal factor B. Putting the multi-tensor decomposition and supervised learning together with squared loss function, the objective function can be written as: min {αl ,W(l) ,V(l) ,C(l) } A,B 1 S T = ∑ 2∑ s t [L ∑ ( ) F {W(l) , V(l) , C(l) }, A, B (l) T αl xs,t ( l K ∑ k (l) As,k wk + K ∑ k (l) Bt,k vk ) − ys,t ]2 L λ 1 (l) − A, B, C(l) ∥2 + β ∥X ∥[{W(l) , V(l) , C(l) }, A, B]∥1 ∑ F 2∑ S × T × d l l l L + s.t. αl > 0 for all l and L ∑ αl = 1 (6.2) l 91 where we use ∥[{W(l) , V(l) , C(l) }, A, B]∥1 to denote the ℓ1 norm regularization term for W(l) , V(l) , A, B and C(l) , respectively. 6.2.3 MUSCAT Spatio-temporal data are usually growing over space and time, when new spatial objects become available, or new measurements are observed for further timestamps. It is expensive to learn the models repeatedly over the whole dataset whenever there are new data available. In this section, we propose an efficient algorithm called MUSCAT to optimize the objective function in Eq.(6.2), which incrementally learns the model parameters and latent factors over space or time. MUSCAT performs update only on the new available data, and the models are constrained to be close to the previous learned models, such that the newly learned model still contains information from the old data. Besides, MUSCAT does not require old data to perform the learning on new data, and hence no old data need to be stored in memory. The optimization problem for incremental learning of the objective function in Eq.(6.2) can be formulated as follows:   (l) ˜ (l) (l) ˜ (l) {αl , α˜l , W , W , V , V , min Q  ˜ (l) }, A, A, ˜ B, B˜ {αl ,W(l) ,V(l) ,C(l) } C(l) , C A,B , W(l) , V(l) , C(l) }, A, B) = F ({αl   (l) ˜ (l) (l) ˜ (l) (l) ˜ (l) {αl , α˜l , W , W , V , V , C , C }, + Γ  ˜ ˜ A, A, B, B ˜ B, ˜ (l) } denote the previous model parameters before the ˜ (l) }, {V ˜ (l) }, A, ˜ and {C where {αl } , {W incremental update, F ({αl , W(l) , V(l) , C(l) }, A, B) is given by Eq.(6.2) and   (l) ˜ (l) (l) ˜ (l) (l) ˜ (l) {αl , α˜l , W , W , V , V , C , C }, Γ  ˜ ˜ A, A, B, B ) L ( 2 (l) (l) 2 (l) (l) 2 (l) (l) 2 ˜ ˜ ˜ = ∑ (αl − α˜l ) + ∥W − W ∥F + ∥V − V ∥F + ∥C − C ∥F l ˜ 2 + ∥B − B∥ ˜ 2, + ∥A − A∥ F F (6.3) 92 which controls how much information will be retained from the previous model. 6.2.3.1 Incremental Learning over Space In this section, we describe the MUSCAT framework for incremental learning over space. Let T be the current time and S be the current number of locations. We assume that the new location has historical observation data from time t0 to T . If the location has only one observation, then t0 = T . We further assume that the spatial latent factors for other locations are unaffected by ˜ s = As . However, the latent factors for other modes (B the addition of the new location, i.e., ∀s : A and C(l) ) as well as the parameters of the prediction models (αl , W(l) and V(l) ) can be affected by the addition of the new data, {xS+1,t0 , xS+1,t0 +1 , · · · , xS+1,T }. For brevity, we denote the feature vectors for the new location as X S+1 , which is a tensor of size 1 × (T − t0 + 1) × d composed by X S+1 ∈ R1×(T −t0 +1)×dl for l = 1, ..., L. (l) The objective function for incremental learning over space can be expressed as follows:   (l) (l) (l) (l) ˜ ˜ {W , W , V , V , Q min  ˜ (l) }, AS+1 , B, B ˜ {αl ,W(l) ,V(l) ,C(l) }, C(l) , C (6.4) AS+1 ,B [L ]2 T T 1 T (l) T (l) (l) AS+1 + V Bt ) − yS+1,t = αl xS+1,t (W 2 t=t 0 l λ1 L 1 (l) + ∥X S+1 − ATS+1 , B, C(l) ∥2F 2 l T × dl [ ] η1 L 2 (l) (l) 2 (l) (l) 2 (l) (l) 2 2 ˜ ∥ ) + ∥B − B∥ ˜ ∥ + ∥V − V ˜ ∥ + ∥C − C ˜ + ((αl − α˜l ) + ∥W − W F F F F 2 l + β1 ∥[{W(l) , V(l) , C(l) }, AS+1 , B]∥1 ∑ ∑ ∑ ∑ s.t. αl > 0 for all l and L ∑ αl = 1 l Note that AS+1 is a column vector that represents the spatial latent factors for the new location and xS+1,t denote the feature vector of the location at time t. The smoothness parameter η1 determines 93 the extent to which the previous model parameters should be retained. An alternating minimization strategy is used to solve the optimization problem, and each subproblem is solved by the proximal gradient descent method [84]. The parameters are updated iteratively by calculating the gradient on the smooth part of the objective function, and then apply the soft-thresholding operator to determine its next value. The step size can be found using a line search algorithm. In the following, we provide the gradient of the objective function for each alternating minimization step. I. Solving for AS+1 by fixing {αl , W(l) , V(l) , C(l) }, B: The objective function with respect to AS+1 can be written as follows: 1 T ∑ 2 t=t min AS+1 [L ∑ l 0 T T (l) T αl xS+1,t (W(l) AS+1 + V(l) Bt ) − yS+1,t ]2 1 λ (l) + 1∑ ∥XS+1 (1) − ATS+1 (C(l) ⊙ B)T ∥2F 2 l T × dl L + β1 ∥AS+1 ∥1 (6.5) (l) (l) where XS+1 (1) is the mode-1 unfolding of the tensor X S+1 .The gradient on the smooth part of the objective function w.r.t. AS+1 is ]L T [ L T T (l) T (l) (l) (l) ∑ ∑ αl xS+1,t (W AS+1 + V Bt ) − yS+1,t ∑ αl W(l)xS+1,t t=t0 − l l )T λ1 ( (l) T (C(l) ⊙ B)T ](C(l) ⊙ B) − A [X ∑ T × dl S+1(1) S+1 l L II. Solving for B by fixing AS+1 , {αl , W(l) , V(l) , C(l) }: The terms in the objective function involving matrix B include min B 1 T ∑ 2 t=t 0 [L ∑ l T T (l) T αl xS+1,t (W(l) AS+1 + V(l) Bt ) − yS+1,t λ L 1 (l) ∥X − B(C(l) ⊙ ATS+1 )T ∥2F + 1∑ 2 l T × dl S+1 (2) η ˜ 2 + β1 ∥B∥1 + 1 ∥B − B∥ F 2 94 ]2 (6.6) The gradient on the smooth part of the objective function w.r.t. Bt is given by [L ]L T T (l) T (l) (l) (l) ∑ αl xS+1,t (W AS+1 + V Bt ) − yS+1,t ∑ αl V(l)xS+1,t l l ]T [ L λ1 (l) T T (l) T T (l) T − ∑ (xS+1,t − Bt (C ⊙ AS+1 ) )(C ⊙ AS+1 ) T × dl l ˜ t) + η1 (Bt − B III. Solving for {C(l) } by fixing AS+1 , B, {αl , W(l) , V(l) }: We will solve C(l) for l = 1, ..., L individually. We can simplify the objective function to include only terms involving the matrix C(l) : λ1 (l) ∥XS+1 (3) − C(l) (B ⊙ ATS+1 )T ∥2F 2T × dl η ˜ (l) ∥2 + β1 ∥C(l) ∥1 + 1 ∥C(l) − C F 2 min C(l) (6.7) The gradient on the smooth part of the objective function w.r.t. C(l) is given by [ ] λ1 (l) (l) T T ˜ (l) ) − X − C (B ⊙ AS+1 ) (B ⊙ ATS+1 ) + η1 (C(l) − C T × dl S+1 (3) IV. Solving for {W(l) } by fixing AS+1 , B, {αl , V(l) , C(l) }: The terms in the objective function involving the model parameter W(l) is [L ]2 T T 1 T (l) T (l) (l) min ∑ ∑ αl xS+1,t (W AS+1 + V Bt ) − yS+1,t 2 (l) t=t W l 0 η ˜ (l) ∥2 + β1 ∥W(l) ∥1 + 1 ∥W(l) − W F 2 (6.8) The gradient on the smooth part of the objective function w.r.t. W(l) is given by ] [L T T T (l) T (l) (i) (i) ∑ αl xS+1,t ∑ αixS+1,t (W AS+1 + V Bt ) − yS+1,t ATS+1 t=t0 i T ˜ (l) T ) +η1 (W(l) − W V. Solving for {V(l) } by fixing {αl , W(l) , C(l) }, AS+1 , B: The objective function can be simplified for terms involving V(l) as follows: ]2 [L T T 1 T (l) T (l) (l) min ∑ ∑ αl xS+1,t (W AS+1 + V Bt ) − yS+1,t 2 t=t V(l) l 0 η ˜ (l) ∥2 + β1 ∥V(l) ∥1 + 1 ∥V(l) − V F 2 95 (6.9) The gradient on the smooth part of the objective function w.r.t. V(l) is given by [L ] T T T (l) (i) T (i) (i) ∑ αl xS+1,t ∑ αixS+1,t (W AS+1 + V Bt ) − yS+1,t BtT t=t0 i ˜ (l) ) +η1 (V(l) − V VI. Solving for αl by fixing {W(l) , V(l) , C(l) }, AS+1 , B: Finally, the parameters for reweighting models learned for different scales can be solved by the following objective function: min αl 1 T ∑ 2 t=t [L 0 ∑ l T T (l) T αl xS+1,t (W(l) AS+1 + V(l) Bt ) − yS+1,t ]2 η + 1 (αl − α˜l )2 2 s.t. αl > 0 for all l and L ∑ αl = 1 l The gradient on the objective function w.r.t αl is given by: T [ L T T (i) T ∑ ∑ αixS+1,t (W(i) AS+1 + V(i) Bt ) t=t0 i ] T T (l) T −yS+1,t xS+1,t (W(l) AS+1 + V(l) Bt ) +η1 (αl − α˜l ) The probability simplex constraint is solved by performing a Euclidean projection onto the probability simplex in proximal mapping step for the proximal gradient descent algorithm [102]. 6.2.3.2 Incremental Learning over Time Similar to incremental learning over space, we also develop incremental learning over time for MUSCAT when new observations become available at new timestamps. Let S be the number of locations and T be the current time and we assume the availability of the feature vectors of predictor variables for all S stations at time T + 1. This information will be used to calculate the temporal latent factor BT +1 for the new time period. Similar to the strategy for incremental learning over 96 space, we assume the new data for time T + 1 does not affect previous temporal latent factors B1 , B2 , · · · , BT . Different from incremental learning over space, there are two steps for incremental learning over time. First, we learn the temporal latent factor BT +1 based on the values of the predictor variables for the new time period before the target variable is observed. Note that the predictions for the next timestamp is performed based on the parameters at current stage. Next, the model parameters and latent factors for other modes are updated when the target variable for the new time period is observed for all the locations. Step 1: Updating the temporal latent factor BT +1 before observing target variable. The objective function for updating the temporal latent factor is given below: λ L 1 (l) min Q(BT +1 ) = 2 ∑ ∥X T +1 − A, BTT +1 , C(l) ∥2F 2 l S × dl BT +1 Note that the loading matrices A and {C(l) } correspond to the values obtained from the previous update. The predictions for the target variable are performed using BT +1 and other model parameters from previous update. Step 2: Updating model parameters and latent factors after observing target variable. After obtaining the new observations for target variable at time T + 1, we can derive the update formula 97 by optimizing the following objective function:   (l) (l) (l) T +1 , {W , V , C }, A, B  Q min  ˜ (l) }, A ˜ ˜ (l) , V ˜ (l) , C {αl ,W(l) ,V(l) ,C(l) }, {W A,BT +1 [ ]2 T T T 1 S L (l) (l) (l) = ∑ αl xs,T +1 (W As + V BT +1) − ys,T +1 2∑ s l λ2 L 1 (l) ∥X T +1 − A, BTT +1 , C(l) ∥2F ∑ 2 l S × dl [ η2 L ˜ (l) ∥2 + ((αl − α˜l )2 + ∥W(l) − W F 2 ∑ l + ] ˜ (l) ∥2 ) + ∥A − A∥ ˜ 2 ˜ (l) ∥2 + ∥C(l) − C +∥V(l) − V F F F + β2 (∥{W(l) , V(l) , C(l) }, A, BT +1 ∥1 ) s.t. αl > 0 for all l and L ∑ αl = 1 l The optimization approach for solving Eq. (6.10) to obtain the update formula for {αl , W(l) , V(l) , C(l) }, A and BT +1 is similar to the approach for incremental learning over space. 6.3 Experimental Evaluation This section describes the experiments performed to evaluate the performance of MUSCAT using a multi-scale climate dataset. 6.3.1 Dataset Description We use a multi-scale climate dataset that has three spatial resolutions. The data at the finest scale correspond to monthly observations obtained from various weather stations. The dataset was obtained from the United States Historical Climatology Network (USHCN)1 . Four variables from this dataset—maximum (tmax), minimum (tmin), mean (tmean) temperature and precipitation (prcp)— 1 http://cdiac.ornl.gov/epubs/ndp/ushcn/ushcn.html 98 Table 6.4: List of variables from NARR data as predictors. Variable acpcp air.2m dlwrf dswrf lftx4 prate prmsl pr_wtr rhum Description Monthly mean convective precipitation accumulation at surface Monthly mean air temperature at 2 m Monthly mean downward longwave radiation flux at surface Monthly mean downward shortwave radiation flux at surface Monthly mean (4-layer) lifted index Monthly mean precipitation rate at surface Monthly mean pressure at mean sea level Monthly mean precipitable water for entire atmosphere Monthly mean relative humidity at 2 m are selected as the response variables for our multi-scale modeling task. The prediction for each response variable will be modeled independently2 . For predictor variables, we will use two other climate datasets, NARR and NCEP reanalysis. NARR3 corresponds to the North American regional reanalysis dataset and has a spatial resolution of 0.3◦ (32 km). Nine predictor variables are selected from this data source with the help of our domain expert. The variables are shown in Table 6.4. NCEP reanalysis4 is a gridded climate dataset with a coarser spatial resolution of 2.5◦ . Seven surface variables are selected as the predictor variables based on the suggestion of our domain experts. A detailed description of the selected variables is shown in Table 6.5. The collected datasets cover the entire United States, spanning the months of January, 1985 until November, 2015. However, since there are missing values in the response variables for some 2 Even though the four variables can be modeled jointly using a multi-task learning framework, this is currently beyond the scope of this work. 3 https://www.esrl.noaa.gov/psd/data/gridded/data.narr.html 4 http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.derived. html 99 Table 6.5: List of surface variables selected from NCEP reanalysis data as predictors. Variable cprat.sfc dlwrf.sfc dswrf.sfc prate.sfc tmax.2m tmin.2m lftx.sfc Description Monthly mean convective precipitation rate at surface Monthly mean downward longwave radiation flux at surface Monthly mean downward shortwave radiation flux at surface Monthly mean of precipitation rate at surface Monthly mean maximum temperature at 2 m Monthly mean minimum temperature at 2 m Monthly mean surface lifted index Table 6.6: Number of weather stations and grid cells for each response variable. tmax tmin tmean prcp USHCN 357 341 333 790 NARR 350 336 328 635 NCEP 146 144 143 159 weather stations over the 30-year period, the number of data points (weather stations) varies from one response variable to another. We also remove the grid cells in the NARR and NCEP reanalysis datasets that do not have any corresponding weather station data. Table 6.6 summarizes the number of weather stations and grid cells for each response variable. After removing the stations with missing values, we apply the following preprocessing step to all the time series of predictor and response variables. For each variable at a given scale, we deseasonalize the time series by subtracting each monthly value from the mean value of its corresponding month. The time series is then standardized to obtain a sequence of Z-scores. Next, we divide the dataset into 3 equal-sized partitions for training, validation, and testing. We use the data from the first 10 years (1985-1994) for training, the next 10 years (1995-2004) for validation, and the rest (2005-2015) for testing. The validation set is used to perform parameter tuning. We adopt a similar evaluation strategy as WISDOM [110] by setting the number of latent 100 factors to be k = 5. Note that the number of weather stations included in the training, validation, and testing may vary depending on when the station was first introduced into our incremental learning formulation. Specifically, we first randomly choose 100 weather stations as our starting locations. At each step during the incremental learning process, we randomly introduce either a new station (along with its historical data) or a new time period (where the data for the new time period becomes available to all the previously selected stations) for updating the latent factors as well as the spatial and temporal multi-scale models. With this incremental learning strategy, a station could be introduced during the training period, the validation period, or the testing period. We used the following mean absolute error (MAE) metric to evaluate the performance of various algorithms: MAE({(yt , yˆ t )}) = ∑Ss=1 ∑N n=1 |ys,n − yˆs,n | S×N (6.10) Note that the evaluation metric is calculated for each station starting from the test period in which incremental learning is performed for the given station. For example, if a station was first introduced into the incremental learning formulation during the training or validation period, then its MAE is calculated for the entire test period. However, if the station was first introduced during the testing period, its MAE is computed starting from the test period in which it was first introduced. The incremental learning process is repeated 10 times, each time with a different random initialization. Results are reported based on the average performance over the 10 trials. 6.3.2 Baseline Algorithm We compare the performance of MUSCAT against single-task learning (STL) and several stateof-the-art multi-task learning algorithms. For these baseline algorithms, the predictor variables from NARR and NCEP reanalysis are concatenated together into a single feature vector to be fitted against the response variable. 1. STL (Single Task Learning): In this approach, an independent linear model is incrementally trained for each location using stochastic gradient descent algorithm when a new observation 101 becomes available. When a weather station is first introduced during the incremental learning process, the parameters of the model are initialized randomly. The parameters are then updated as new observations for the station become available. 2. ALTO: This is variation of the spatio-temporal multi-task learning algorithm proposed in [118] which was designed to build models for multiple response variables simultaneously. 3. WISDOM: This is a recent spatio-temporal multi-task learning approach proposed in Chapter 5. It applies tensor decomposition on the spatio-temporal data but does not distinguish between features from different scales. 6.3.3 6.3.3.1 Experimental Results Comparison against Baselines As previously noted, we repeated the incremental learning process 10 times for each method by randomly introducing new station or new time period into the learning process. We then report the mean and standard deviation of MAE over the 10 runs for each method in Table 6.7. The results suggest that WISDOM and MUSCAT outperform STL in all datasets, which shows the effectiveness of using a multi-task learning approach. Although ALTO also employs a multi-task learning strategy, it performs a simple update on its weight matrix, which may not be sufficient to learn the optimal weight for the prediction models. Comparing WISDOM and MUSCAT , the results show that MUSCAT significantly outperforms WISDOM on all four datasets evaluated in this study. This shows the effectiveness of our proposed framework, which uses a multi-tensor decomposition approach to jointly factorize the multi-scale tensors, instead of factorizing a single tensor with concatenated features from multiple scales, which is the approach used in WISDOM . 6.3.3.2 Comparison of Multi-scale analysis against single-scale analysis In order to determine the value of using multi-scale data, we develop the following two variations of MUSCAT and compare its performance against MUSCAT and WISDOM: 102 Table 6.7: Mean and standard deviation of MAE for MUSCAT and other baseline methods for climate datasets over 10 trials. STL ALTO WISDOM MUSCAT tmax 0.4422 ± 0.0016 0.5854 ± 0.0064 0.3543 ± 0.0155 0.3212 ± 0.0074 tmin 0.4412 ± 0.0020 0.5687 ± 0.0031 0.4001 ± 0.0075 0.3454 ± 0.0065 tmean 0.4141 ± 0.0018 0.5656 ± 0.0053 0.3850 ± 0.0236 0.2844 ± 0.0112 prcp 0.5446 ± 0.0012 0.5806 ± 0.0051 0.4212 ± 0.0054 0.4115 ± 0.0023 1. MUSCAT-S1 In this variation, MUSCAT uses only predictor variables from NCEP. This variation is equivalent to applying WISDOM on the NCEP data. 2. MUSCAT-S2: In this variation, MUSCAT uses only predictor variables from NARR. This variation is equivalent to applying WISDOM to the NARR data. The mean and standard deviation of MAE over 10 runs are reported in Table 6.8. The results showed that MUSCAT outperforms WISDOM , MUSCAT-S1 and MUSCAT-S2 on all four datasets. Furthermore, WISDOM performs better than both MUSCAT-S1 and MUSCAT-S2 on the temperature datasets, which suggest the importance of incorporating data from multiple scales into the modeling framework. For the precipitation dataset, MUSCAT-S2 performs better than WISDOM . This result suggests that for precipitation, the predictor variables from the coarsest scale (NCEP reanalysis) does not help to predict the precipitation values of the stations. Nonetheless, MUSCAT still achieves the lowest MAE because it can learn the appropriate weight combination (α ) for NARR and NCEP reanalysis when making its predictions. Table 6.8: Mean and standard deviation of MAE for MUSCAT against different variations of MUSCAT . WISDOM MUSCAT-S1 MUSCAT-S2 MUSCAT tmax 0.3543 ± 0.0155 0.5492 ± 0.0700 0.3910 ± 0.0183 0.3212 ± 0.0074 tmin 0.4001 ± 0.0075 0.4328 ± 0.0115 0.4094 ± 0.0186 0.3454 ± 0.0065 103 tmean 0.3850 ± 0.0236 0.4543 ± 0.0393 0.4350 ± 0.0532 0.2844 ± 0.0112 prcp 0.4212 ± 0.0054 0.6194 ± 0.0026 0.4208 ± 0.0051 0.4115 ± 0.0023 Table 6.9: Mean of α1 and α2 for the climate datasets over 10 trials. α1 α2 6.3.3.3 tmax 0.2876 0.7124 tmin 0.3651 0.6349 tmean 0.3360 0.6640 prcp 0.0933 0.9067 Influence of Data from Multiple Scales Next, we examine the impact of the data at different scales on the performance of MUSCAT . Our hypothesis is that the predictor variables from coarsest scale are less influential compared to those from the finer scale. To do this, we examine the mean values of the parameters α1 and α2 over 10 runs. The results are shown in Table 6.9. For all four datasets, α2 (NARR) has a consistently higher weight than α1 (NCEP reanalysis). In other words, that NARR predictors are more influential in the prediction of the response variables compared to the NCEP reanalysis predictors. This supports our hypothesis that the finer-level predictors have higher impact than the coarser-level predictors. Factor 1 Factor 2 Factor 3 Factor 4 Factor 5 Figure 6.1: Spatial distribution of the spatial latent factors learned by MUSCAT for precipitation data (Figure is best viewed in color) 6.3.4 Analysis of Spatial Latent Factors One of the advantages of using MUSCAT is that the latent factors can be used to identify the spatial patterns of the data. In this section, we examine the spatial latent factors A generated by MUSCAT . Figure 6.1 shows the spatial distribution of the latent factors for the precipitation data. For each of the 5 spatial latent factors, we plot its corresponding top 20% most influential stations 104 on the map. As can be seen from Figure 6.1, the distribution for spatial latent factors exhibit certain grouping phenomena. For example, the first latent factor is more dominant on the eastern part of the United States, while the third and fifth latent factors are more influential on the northwest part of the country. 6.3.5 Analysis of Temporal Latent Factors Finally, we examine the temporal latent factors B derived by MUSCAT. Since each column in B is a time-series, our goal is to determine whether the latent factors capture some of the known climate patterns previously described in the literature. To do this, we computed the correlation between each latent factor against the known climate indices such as El-Niño Southern Oscillation Index (SOI), North Atlantic Oscillation (NAO) index, Arctic Oscillation Index (AOI), etc. The description of the climate indices are shown in Table 6.10. Figure 6.2 shows the correlation between the selected climate indices and the temporal latent factors generated by MUSCAT for all four datasets. It is expected that the overall correlation may not be that high since the datasets used in this study are limited to weather stations in the United States, whereas most climate indices are defined over other regions of the world. Nevertheless, we still observe relatively high correlation between some indices and the temporal latent factors found. For example, in the tmax dataset, the fourth latent factor has a relatively high correlation with NAO. In the tmin and tmean datasets, the second latent factor has a high correlation with PDO, while for the prcp dataset, the fifth latent factor has a relatively high correlation with AOI. This result suggests that the temporal latent factors generated by MUSCAT was able to capture some of the previously known climate phenomena, which was represented by the climate indices. 6.4 Conclusion This chapter presents a multi-scale multi-task learning framework for geospatio-temporal data by employing a supervised multi-tensor decomposition approach. The framework enables the multi-scale relationships to be harnessed by enforcing a constraint on the consistency between 105 Table 6.10: List of the climate indices used to correlate with the temporal factors learned from WISDOM. Climate Index AOI NAO WPI QBO PDO SOI Description Arctic Oscillation Index North Atlantic Oscillation West Pacific Pattern Quasi-Biennial Oscillation Pacific Decadal Oscillation Southern Oscillation Index 0.3 1 2 3 4 5 0.1 -0.3 (a) tmax -0.1 -0.3 (b) tmin 0.3 0.3 1 2 3 4 5 0.1 -0.1 -0.3 0.1 -0.1 AOI NAO WPI PDO QBO SOI AOI NAO WPI PDO QBO SOI AOI NAO WPI PDO QBO SOI AOI NAO WPI PDO QBO SOI 0.1 AOI NAO WPI PDO QBO SOI AOI NAO WPI PDO QBO SOI AOI NAO WPI PDO QBO SOI AOI NAO WPI PDO QBO SOI -0.1 1 2 3 4 5 0.3 1 2 3 4 5 (c) tmean -0.3 (d) prcp Figure 6.2: Correlations between the known climate indices and the temporal latent factors derived from MUSCAT . the spatial and temporal latent factors derived from the multi-scale geospatio-temporal data. An incremental learning algorithm over space and time is then proposed to efficiently learn the weights of the model. Experiments performed on a real-world multi-scale climate dataset demonstrate the effectiveness of proposed method compared to several baseline algorithms. 106 CHAPTER 7 CONCLUSIONS AND FUTURE WORK In this chapter, the major contributions of this thesis are summarized and possible future research directions are discussed. 7.1 Summary of Contributions Geospatio-temporal data mining tasks can be naturally cast into MTL problems in spatial and temporal dimensions. This thesis proposed four novel MTL frameworks to investigate the possibility of applying MTL approaches on geospatio-temporal domain by addressing the challenges of building effective predictive models for geospatio-temporal data. It also evaluated the performance of the proposed MTL frameworks on different geospatio-temporal applications. The contribution of this thesis is summarized in the following. In Chapter 3, an online temporal MTL framework named ORION is proposed to learn the optimal weights for ensemble members in applications of ensemble forecasting. The framework considers the prediction for each time point in the forecast window as one task, and the task relatedness is defined by the mean regularizer and Laplacian regularizer to constrain the temporal smoothness between tasks. ORION adopts a restart strategy to handle the missing value of the target variable in the forecast window. Various loss functions such as ε -insensitive and quantile loss functions can be adopted in the framework for different objectives. The experimental results on real world soil moisture data from 12 US river basins demonstrate the superior performance of ORION over other baselines. In Chapter 4, in order to perform multi-location prediction, a MTL framework named GSpartan is developed to build models jointly at multiple geo-locations. The framework assumes that the models are sharing a set of hidden models and linearly combined by these hidden models. Spatial autocorrelation is also employed to explicitly define the relationship between tasks. Sparsity and non-negativity are also enforced in the objective to ensure interpretability of the learned models. 107 Experimental results on real world climate data show that GSpartan outperforms other baselines especially when there are limited training examples at each location. GSpartan is limited in that it only considers the spatial autocorrelation of the data and it is difficult to scale up to global-scale data. In Chapter 5, a spatio-temporal MTL framework is proposed for multi-location prediction based on supervised tensor decomposition. This framework constructs multiple tasks across both space and time, and the task relations in spatial and temporal dimension are implicitly determined by the tensor decomposition. A novel incremental learning algorithm called WISDOM is developed to simultaneously learn the latent factors of the spatio-temporal data via tensor decomposition, as well as the prediction models. WISDOM outperforms several baseline algorithms and can easily accommodate existing patterns from the geospatio-temporal domain. In Chapter 6, in order to handle data from multiple scales, a multi-scale spatio-temporal MTL framework is developed for multi-scale modeling on multiple locations simultaneously. In addition to the merit of WISDOM, this framework models the relatedness of the predictors from different resolution or scales, by considering the consistency of the spatial and temporal latent factors learned from the predictor tensor of different scales. A novel incremental multi-tensor decomposition algorithm called MUSCAT is developed to solve the objective efficiently. Experiments on real world climate data show the effectiveness of MUSCAT over other baselines, including WISDOM. 7.2 Future Directions Although the results for the proposed MTL frameworks are promising, they are still limited in a few aspects. In this section, the future research directions are discussed, which are inspired from the work of this thesis. The proposed MTL frameworks in this thesis assume the prediction function to be a linear form, which might not be true in some of the geospatio-temporal applications, such as climate and Earth science. As future work, the choice of non-linear prediction functions or kernel functions needs to be studied. 108 In addition, the features used in the proposed frameworks for multi-location prediction are only a very small portion of the whole feature set, which are mainly determined by the domain experts. However, removing the features outside of the hand-picked feature set might result in information loss. While it might not be effective and efficient to directly use all possible features, it is a challenge to perform feature selection and feature learning such that the learned features with a limited number of size can be effectively represent the original feature set. Since deep learning has been proposed in literature to learn informative features, as future work, MTL framework with feature learning can be proposed by incorporating the merit of deep learning approaches. For many spatio-temporal applications, preserving the fidelity of the distribution is as important as minimizing the prediction error [2]. Although the proposed MTL frameworks do improve the regression prediction performance over the state-of-the-art methods, they still suffer from the fact that the regression will not always preserve the true distribution of the data. As future work, in order to address this problem, multi-task contour regression can be explored to simultaneously minimize the prediction error and maximize the alignment between the distributions of prediction and observation. 109 BIBLIOGRAPHY 110 BIBLIOGRAPHY [1] Zubin Abraham, Malgorzata Liszewska, Perdinan, Pang-Ning Tan, Julie Winkler, and Shiyuan Zhong. Distribution regularized regression framework for climate modeling. In SDM, pages 333–341. SIAM, 2013. [2] Zubin Abraham, Pang-Ning Tan, Perdinan, Julie A. Winkler, Shiyuan Zhong, and Malgorzata Liszewska. Contour regression: A distribution-regularized regression framework for climate modeling. Stat. Anal. Data Min., 7(4):272–281, August 2014. ISSN 1932-1864. [3] Alekh Agarwal, Alexander Rakhlin, and Peter Bartlett. Matrix regularization techniques for online multitask learning. Technical Report UCB/EECS-2008-138, EECS Department, University of California, Berkeley, Oct 2008. URL http://www.eecs.berkeley.edu/ Pubs/TechRpts/2008/EECS-2008-138.html. [4] Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817–1853, December 2005. ISSN 1532-4435. [5] Luc Anselin and Arthur Getis. Spatial statistical analysis and geographic information systems. The Annals of Regional Science, 26(1):19–33, 1992. ISSN 1432-0592. [6] Miguel B. Araujo ´ and Mark New. Ensemble forecasting of species distributions. Trends in Ecology & Evolution, 22(1):42–47, 2007. [7] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, December 2008. ISSN 0885-6125. [8] N. H. Augustin, M. Musio, K. von Wilpert, E. Kublin, S. N. Wood, and M. Schumacher. Modeling spatiotemporal forest health monitoring data. 104:899–911, 2009. [9] Prithu Banerjee, Pranali Yawalkar, and Sayan Ranu. Mantra: A scalable approach to mining temporally anomalous sub-trajectories. In KDD, pages 1415–1424, 2016. [10] Sudipto Banerjee, Bradley P. Carlin, and Alan E. Gelfand. Hierarchical Modeling and Analysis for Spatial Data. Monographs on Statistics and Applied Probability. Chapman and Hall/CRC, 1 edition, 2004. ISBN 158488410X. [11] F. Bellocchio, S. Ferrari, V. Piuri, and N.A. Borghese. Hierarchical approach for multiscale support vector regression. IEEE Transactions on Neural Networks and Learning Systems, 23(9):1448–1460, September 2012. 1045-9227. [12] Gedas Bertasius, Jianbo Shi, and Lorenzo Torresani. Deepedge: A multi-scale bifurcated deep network for top-down contour detection. In CVPR, pages 4380–4389, 2015. [13] Martin Andrew Bezener. Bayesian spatiotemporal modeling using spatial hierarchical priors with applications to functional magnetic resonance imaging. Ph.D dissertation, University of Minnesota, 2015. 111 [14] Jinbo Bi, Tao Xiong, Shipeng Yu, Murat Dundar, and R.Bharat Rao. An improved multitask learning approach with applications in medical diagnosis. In Machine Learning and Knowledge Discovery in Databases, volume 5211 of Lecture Notes in Computer Science, pages 117–132. Springer Berlin Heidelberg, 2008. ISBN 978-3-540-87478-2. [15] Steffen Bickel, Jasmina Bogojeska, Thomas Lengauer, and Tobias Scheffer. Multi-task learning for hiv therapy screening. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 56–63, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-205-4. [16] Shyam Boriah, Vipin Kumar, Michael Steinbach, Christopher Potter, and Steven Klooster. Land cover change detection: A case study. In KDD, pages 857–865, 2008. [17] George Edward Pelham Box and Gwilym Jenkins. Time Series Analysis, Forecasting and Control. Holden-Day, Incorporated, 1990. ISBN 0816211043. [18] Jorge Caiado and Nuno Crato. A garch-based method for clustering of financial time series: International stock markets evidence. Mpra paper, University Library of Munich, Germany, 2007. URL http://EconPapers.repec.org/RePEc:pra:mprapa:2074. [19] Pierre Cantelaube and Jean-Michel Terres. Seasonal weather forecasts for crop yield modelling in europe. Tellus A, 57(3):476–487, 2005. [20] Rich Caruana. Multitask learning. Machine Learning, 28(1):41–75, July 1997. ISSN 08856125. [21] Giovanni Cavallanti, Nicolò Cesa-Bianchi, and Claudio Gentile. Linear algorithms for online multitask classification. Journal of Machine Learning Research, 11:2901–2934, December 2010. ISSN 1532-4435. [22] Kai-Wei Chang, Wen-Tau Yih, Bishan Yang, and Christopher Meek. Typed tensor decomposition of knowledge bases for relation extraction. In EMNLP, 2014. [23] Jianhui Chen, Jiayu Zhou, and Jieping Ye. Integrating low-rank and group-sparse structures for robust multi-task learning. In Proceedings of the 17th ACM SIGKDD Int’l Conf on Knowledge discovery and data mining, pages 42–50. ACM, 2011. [24] Jianhui Chen, Ji Liu, and Jieping Ye. Learning incoherent sparse and low-rank patterns from multiple tasks. ACM Transactions on Knowledge Discovery from Data, 5(4):22:1– 22:31, 2012. ISSN 1556-4681. [25] Jianhui Chen, Lei Tang, Jun Liu, and Jieping Ye. A convex formulation for learning a shared predictive structure from multiple tasks. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(5):1025–1038, May 2013. ISSN 0162-8828. [26] Haibin Cheng and Pang-Ning Tan. Semi-supervised learning with data calibration for longterm time series forecasting. In Proc of ACM SIGKDD Int’l Conf on Knowledge Discovery and Data Mining, pages 133–141, 2008. 112 [27] Chi-Yin Chow and Mohamed F. Mokbel. Trajectory privacy in location-based services and data publication. SIGKDD Explor. Newsl., 13(1):19–29, August 2011. [28] A. Cichocki, D. Mandic, L. De Lathauwer, Guoxu Zhou, Qibin Zhao, C. Caiafa, and H.A. Phan. Tensor decompositions for signal processing applications: From two-way to multiway component analysis. Signal Processing Magazine, 32(2):145–163, 2015. [29] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551–585, December 2006. ISSN 1532-4435. [30] N. Cressie. Statistics for spatial data. Wiley, New York, 1993. [31] Hal Daumé, III. Bayesian multitask learning with latent hierarchies. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, UAI ’09, pages 135–142, Arlington, Virginia, United States, 2009. AUAI Press. ISBN 978-0-9749039-5-8. [32] G.S. Davis, N Sevdalis, and L.N Drumright. Spatial and temporal analyses to investigate infectious disease transmission within healthcare settings. pages 227–243, 2014. [33] Victor De Oliveira. Bayesian analysis of conditional autoregressive models. Annals of the Institute of Statistical Mathematics, 64(1):107–133, 2012. [34] Ofer Dekel, Philip M. Long, and Yoram Singer. Online learning of multiple tasks with a shared loss. Journal of Machine Learning Research, 8:2233–2264, December 2007. ISSN 1532-4435. [35] N. Diodato, G. Bellocchi, C. Bertolin, and D. Camuffo. Multiscale regression model to infer historical temperatures in a central mediterranean sub-regional area. Climate of the Past Discussions, 6:2625–3649, 2010. [36] Mark Dredze, Koby Crammer, and Fernando Pereira. Confidence-weighted linear classification. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 264–271, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-205-4. [37] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In NIPS, pages 2366–2374, 2014. [38] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi-task learning. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, pages 109–117, New York, NY, USA, 2004. ACM. ISBN 1-58113888-1. [39] Theodoros Evgeniou, Charles A. Micchelli, and Massimiliano Pontil. Learning multiple tasks with kernel methods. J. Mach. Learn. Res., 6:615–637, December 2005. ISSN 15324435. [40] JamesH. Faghmous and Vipin Kumar. Spatio-temporal data mining for climate data: Advances, challenges, and opportunities. In Data Mining and Knowledge Discovery for Big Data, volume 1 of Studies in Big Data, pages 83–116. Springer Berlin Heidelberg, 2014. 113 [41] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. PAMI, 35(8):1915–1929, Aug 2013. [42] Robert J. Franzese, Jude C. Hays, and Scott J. Cook. Spatial- and spatiotemporalautoregressive probit models of interdependent binary outcomes. Political Science Research ˘ S173, and Methods, 4(1):151âA ¸ 2016. [43] Scott John Gaffney. Probabilistic Curve-Aligned Clustering and Prediction with Regression Mixture Models. PhD thesis, 2004. [44] Alan E Gelfand, Peter J Diggle, Montserrat Fuentes, and Peter Guttorp. Handbook of spatial statistics. CRC press, 2010. [45] Fosca Giannotti, Mirco Nanni, Fabio Pinelli, and Dino Pedreschi. Trajectory pattern mining. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, pages 330–339. ACM, 2007. [46] A. R. GonÃ˘galves, F. J. Von Zuben, and A. Banerjee. A multitask learning view on the earth system model ensemble. Computing in Science Engineering, 17(6):35–42, Nov 2015. [47] A. R. GonÃ˘galves, A. Banerjee, and F. J. Von Zuben. Spatial projection of multiple climate variables using hierarchical multitask learning. In AAAI, 2017. [48] André R. Gonçalves, Fernando J. Von Zuben, and Arindam Banerjee. Multi-task sparse structure learning with gaussian copula models. JMLR, 17(1):1205–1234, January 2016. [49] Pinghua Gong, Jieping Ye, and Changshui Zhang. Robust multi-task feature learning. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pages 895–903, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1462-6. [50] Michael Grant and Stephen Boyd. CVX: Matlab software for disciplined convex programming, version 2.0 beta. http://cvxr.com/cvx, September 2013. [51] Tony H Grubesic and Alan T Murray. Detecting hot spots using cluster analysis and gis. In Proceedings from the fifth annual international crime mapping research conference, volume 26, 2001. [52] Quanquan Gu, Zhenhui Li, and Jiawei Han. Learning a kernel for multi-task clustering. In Proceedings of the 25th Conference on Artificial Intelligence (AAAI), 2011. [53] Forrest M Hoffman, William W Hargrove, and Anthony D Del Genio. Multivariate spatiotemporal clustering of time-series data: an approach for diagnosing cloud properties and understanding arm site representativeness. [54] Sahyun Hong and W. M. Moon. Application of gaussian markov random field model to unsupervised classification in polarimetric sar image. In 2003 IEEE International Geoscience and Remote Sensing Symposium. Proceedings, volume 2, pages 929–931 vol.2, 2003. 114 [55] Weiming Hu, Xi Li, Xiaoqin Zhang, Xinchu Shi, Stephen Maybank, and Zhongfei Zhang. Incremental tensor subspace learning and its applications to foreground segmentation and tracking. IJCV, 91(3):303–327, 2011. [56] Zhuoliang Kang, Kristen Grauman, and Fei Sha. Learning with whom to share in multi-task feature learning. In Proceedings of the 28st International Conference on Machine Learning, pages 521–528. Omnipress, 2011. [57] Alexandros Karatzoglou, Xavier Amatriain, Linas Baltrunas, and Nuria Oliver. Multiverse recommendation: N-dimensional tensor factorization for context-aware collaborative filtering. In RecSys 2010, pages 79–86, 2010. [58] Baris M. Kazar and Mete Celik. Spatial AutoRegression (SAR) Model: Parameter Estimation Techniques. Springer Publishing Company, Incorporated, 2012. [59] Eamonn Keogh, Jessica Lin, and Ada Fu. Hot sax: Efficiently finding the most unusual time series subsequence. In Proceedings of the Fifth IEEE International Conference on Data Mining, ICDM ’05, pages 226–233, 2005. ISBN 0-7695-2278-5. [60] Roger Koenker. Quantile Regression. Econometric Society Monographs. Cambridge University Press, 2005. ISBN 0521608279. [61] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, August 2009. [62] Krzysztof Koperski and Jiawei Han. Discovery of spatial association rules in geographic information databases. In Proceedings of the 4th International Symposium on Advances in Spatial Databases, SSD ’95, pages 47–66, 1995. [63] Abhishek Kumar and Hal Daum´l²e III. Learning task grouping and overlap in multi-task learning. In Proceedings of the 29st International Conference on Machine Learning. icml.cc / Omnipress, 2012. [64] Neil D. Lawrence and John C. Platt. Learning to learn with the informative vector machine. In Proceedings of the 21st International Conference on Machine Learning, ICML ’04. [65] Su-In Lee, Vassil Chatalbashev, David Vickrey, and Daphne Koller. Learning a meta-level prior for feature relevance from multiple related tasks. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pages 489–496, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-793-3. [66] M. Leutbecher and T.N. Palmer. Ensemble forecasting. Journal of Computational Physics, 227:3515–3539, 2008. [67] Guangxia Li, S.C.H. Hoi, Kuiyu Chang, Wenting Liu, and R. Jain. Collaborative online multitask learning. Knowledge and Data Engineering, IEEE Transactions on, 26(8):1866– 1876, Aug 2014. ISSN 1041-4347. doi: 10.1109/TKDE.2013.139. 115 [68] Mu Li, Amr Ahmed, and Alexander J. Smola. Inferring movement trajectories from gps snippets. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining, WSDM ’15, pages 325–334, 2015. [69] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212–261, February 1994. ISSN 0890-5401. [70] Chang-Tien Lu and Lily R Liang. Wavelet fuzzy classification for detecting and tracking region outliers in meteorological data. In Proceedings of the 12th annual ACM international workshop on Geographic information systems, pages 258–265. ACM, 2004. [71] Chang-Tien Lu, Dechang Chen, and Yufeng Kou. Algorithms for spatial outlier detection. In Proceedings of the Third IEEE International Conference on Data Mining, ICDM ’03, pages 597–, 2003. ISBN 0-7695-1978-4. [72] Lifeng Luo and Eric F. Wood. Use of bayesian merging techniques in a multimodel seasonal hydrologic ensemble prediction system for the eastern united states. Journal of Hydrometeorology, 9:866–884, 2008. [73] Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo. Discovery of frequent episodes in event sequences. Data Min. Knowl. Discov., 1(3):259–289, January 1997. ISSN 1384-5810. [74] Wesley Mathew, Ruben Raposo, and Bruno Martins. Predicting future locations with hidden markov models. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing, UbiComp ’12, pages 911–918, 2012. [75] Bradley A. Miller, Sylvia Koszinski, Marc Wehrhan, and Michael Sommer. Impact of multi˘ S240:97 scale predictor selection for modeling soil properties. Geoderma, 239âA ¸ – 106, 2015. ISSN 0016-7061. [76] C. Monteleoni, G.A. Schmidt, and S. McQuade. Climate informatics: Accelerating discovering in climate science with machine learning. Computing in Science Engineering, 15(5): 32–40, Sept 2013. [77] Claire Monteleoni, Gavin A. Schmidt, and Shailesh Saroha. Tracking climate models. In NASA Conference on Intelligent Data Understanding, pages 1–15. NASA Ames Research Center, 2010. [78] Natalia Neverova, Christian Wolf, Graham W. Taylor, and Florian Nebout. Multi-scale Deep Learning for Gesture Detection and Localization, pages 474–490. 2014. [79] Xia Ning and George Karypis. Multi-task learning for recommender system. In Proceedings of the 2nd Asian Conference on Machine Learning, volume 13 of JMLR Proceedings, pages 269–284. JMLR.org, 2010. [80] Mohamed N. Nounou and Hazem N. Nounou. Multiscale latent variable regression. International Journal of Chemical Engineering, 2010:8, 2010. 116 [81] Guillaume Obozinski, Ben Taskar, and MichaelI. Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2): 231–252, 2010. ISSN 0960-3174. [82] K.P. Overmars, G.H.J. de Koning, and A. Veldkamp. Spatial autocorrelation in multi-scale ˘ S3):257 land use models. Ecological Modelling, 164(2âA ¸ – 270, 2003. ISSN 0304-3800. [83] R. Kelley Pace, Ronald Barry, John M. Clapp, and Mauricio Rodriquez. Spatiotemporal autoregressive models of neighborhood effects. The Journal of Real Estate Finance and Economics, 17(1):15–33, 1998. [84] Neal Parikh and Stephen Boyd. Proximal algorithms. Found. Trends Optim., 1(3):127–239, January 2014. ISSN 2167-3888. [85] A. Riccio, G. Barone, E. Chianese, and G. Giunta. A hierarchical bayesian approach to the spatio-temporal modeling of air quality data. Atmospheric Environment, 40(3):554 – 566, 2006. [86] Bernardino Romera-Paredes, Hane Aung, Nadia Bianchi-Berthouze, and Massimiliano Pontil. Multilinear multitask learning. In ICML, pages 1444–1452. [87] Avishek Saha, Piyush Rai, Hal Daum´l²e III, and Suresh Venkatasubramanian. Online learning of multiple tasks and their relationships. In AISTATS, volume 15 of JMLR Proceedings, pages 643–651. JMLR.org, 2011. [88] Shashi Shekhar, Zhe Jiang, Reem Y. Ali, Emre Eftelioglu, Xun Tang, Venkata M. V. Gunturi, and Xun Zhou. Spatiotemporal data mining: A computational perspective. ISPRS International Journal of Geo-Information, 4(4):2306–2338, 2015. [89] Balaji Vasan Srinivasan, Ramani Duraiswami, and Raghu Murtugudde. Efficient kriging for real-time spatio-temporal interpolation. In PSAS, pages 228–235, 2010. [90] Michael Steinbach, Pang-Ning Tan, Vipin Kumar, Steven Klooster, and Christopher Potter. Discovery of climate indices using clustering. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’03, pages 446– 455, 2003. [91] Karthik Subbian and Arindam Banerjee. Climate multi-model regression using spatial smoothing. In SDM, pages 324–332, 2013. [92] Hua Sun, Yong Tu, and Shi-Ming Yu. A spatio-temporal autoregressive model for multiunit residential market analysis. The Journal of Real Estate Finance and Economics, 31(2): 155–187, 2005. [93] Jimeng Sun, Dacheng Tao, Spiros Papadimitriou, Philip S. Yu, and Christos Faloutsos. Incremental tensor analysis: Theory and applications. TKDD, 2(3):11:1–11:37. [94] Jimeng Sun, Dacheng Tao, and Christos Faloutsos. Beyond streams and graphs: Dynamic tensor analysis. In KDD, pages 374–383, 2006. 117 [95] C. Sung, D. Feldman, and D. Rus. Trajectory clustering for motion prediction. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1547–1552, Oct 2012. [96] Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, and Dimitris Papadias. Spatiotemporal aggregation using sketches. In Proceedings of the 20th International Conference on Data Engineering, ICDE ’04, pages 214–, 2004. ISBN 0-7695-2065-0. [97] Claudia Tebaldi and Reto Knutti. The use of the multi-model ensemble in probabilistic climate projections. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 365(1857):2053–2075, 2007. [98] W. Tobler. A computer movie simulating urban growth in the detroit region. Economic Geography, 46(2):234–240, 1970. [99] Pedro A. Valdes-Sosa. Spatio-temporal autoregressive models defined over brain manifolds. Neuroinformatics, 2(2):239–250, 2004. [100] Christian Walder, Kwang In Kim, and Bernhard Schölkopf. Sparse multiscale gaussian process regression. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 1112–1119, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-205-4. [101] Jialei Wang, Steven C.H. Hoi, Peilin Zhao, and Zhi-Yong Liu. Online multi-task collaborative filtering for on-the-fly recommender systems. In Proceedings of the 7th ACM Conference on Recommender Systems, RecSys ’13, pages 237–244. ACM, 2013. [102] Weiran Wang and Miguel Á. Carreira-Perpiñán. Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. CoRR, abs/1309.1541, 2013. URL http://arxiv.org/abs/1309.1541. [103] Xiaogang Wang, Cha Zhang, and Zhengyou Zhang. In Proceedings of the 22nd IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’09. [104] Christopher K. Wikle, Ralph F. Milliff, Doug Nychka, and L. Mark Berliner. Spatiotemporal hierarchical bayesian modeling: Tropical ocean surface winds. Journal of the American Statistical Association, 96(454):382–397, 2001. [105] Kishan Wimalawarne, Masashi Sugiyama, and Ryota Tomioka. Multitask learning meets tensor factorization: task imputation via convex optimization. In NIPS, pages 2825–2833. 2014. [106] Fei Wu, Xu Tan, Yi Yang, Dacheng Tao, Siliang Tang, and Yueting Zhuang. Supervised nonnegative tensor factorization with maximum-margin constraint. In AAAI 2013, 2013. [107] Jianpeng Xu, Pang-Ning Tan, and Lifeng Luo. ORION: Online Regularized multI-task regressiON and its application to ensemble forecasting. In Proceedings of the 14th IEEE International Conference on Data Mining, ICDM ’14. 118 [108] Jianpeng Xu, Jiayu Zhou, and Pang-Ning Tan. Formula: Factorized multi-task learning for task discovery in personalized medical models. In Proceedings of the 15th SIAM International Conference on Data Mining, SDM ’15, pages 496–504, 2015. [109] Jianpeng Xu, Pang-Ning Tan, Lifeng Luo, and Jiayu Zhou. Gspartan: a geospatio-temporal multi-task learning framework for multi-location prediction. In SDM 2016, 2016. [110] Jianpeng Xu, Jiayu Zhou, Pang-Ning Tan, Xi Liu, and Lifeng Luo. Wisdom: Weighted incremental spatio-temporal multi-task learning via tensor decomposition. In IEEE Big Data, pages 522–531, Dec 2016. [111] Jianpeng Xu, Pang-Ning Tan, Jiayu Zhou, and Lifeng Luo. Online multi-task learning framework for ensemble forecasting. IEEE Transactions on Knowledge and Data Engineering, 29(6):1268–1280, 2017. [112] Yangyang Xu and Wotao Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal of Imaging Sciences, 6:1758–1789, 2013. [113] Ya Xue, Xuejun Liao, Lawrence Carin, and Balaji Krishnapuram. Multi-task learning for classification with dirichlet process priors. Journal of Machine Learning Research, 8:35–63, May 2007. ISSN 1532-4435. [114] Haiqin Yang, Irwin King, and Michael R. Lyu. Online learning for multi-task feature selection. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM ’10, pages 1693–1696. ACM, 2010. ISBN 978-1-45030099-5. [115] Hui Yang, Srinivasan Parthasarathy, and Sameep Mehta. A generalized framework for mining spatio-temporal patterns in scientific data. In Proc of ACM SIGKDD Int’l Conf on Knowledge Discovery and Data Mining, pages 716–721, 2005. [116] Kai Yu, Volker Tresp, and Anton Schwaighofer. Learning gaussian processes from multiple tasks. In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pages 1012–1019, New York, NY, USA, 2005. ACM. ISBN 1-59593-180-5. [117] Rose Yu, Mohammad Taha Bahadori, and Yan Liu. Fast multivariate spatio-temporal analysis via low rank tensor learning. In NIPS, pages 3491–3499, 2014. [118] Rose Yu, Dehua Cheng, and Yan Liu. Accelerated online low rank tensor learning for multivariate spatiotemporal streams. In ICML 2015, volume 37, pages 238–247, 2015. [119] Xiao-Tong Yuan and Shuicheng Yan. Visual classification with multi-task joint sparse representation. In Proceedings of the 23rd IEEE Conference on Computer Vision and Pattern Recognition, pages 3493–3500, June 2010. [120] Yu Zhang and Dit-Yan Yeung. A convex formulation for learning task relationships in multi-task learning. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, pages 733–442. AUAI Press, 2010. ISBN 978-0-9749039-6-5. 119 [121] Liang Zhao, Qian Sun, Jieping Ye, Feng Chen, Chang-Tien Lu, and Naren Ramakrishnan. Multi-task learning for spatio-temporal event forecasting. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pages 1503–1512, 2015. [122] Wenzhi Zhao and Shihong Du. Learning multiscale and deep representations for classifying remotely sensed imagery. Journal of Photogrammetry and Remote Sensing, 113:155 – 165, 2016. [123] Xu Zhong, Allison Kealy, and Matt Duckham. Stream kriging. Computers & Geosciences, 90(PA):134–143, May 2016. [124] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard SchÃ˝ulkopf. Learning with local and global consistency. In The 16th Annual Conference on Neural Information Processing Systems, pages 321–328. MIT Press, 2004. [125] J. Zhou, J. Chen, and J. Ye. MALSAR: Multi-tAsk Learning via StructurAl Regularization. Arizona State University, 2011. URL http://www.public.asu.edu/~jye02/Software/ MALSAR. [126] Jiayu Zhou, Lei Yuan, Jun Liu, and Jieping Ye. A multi-task learning formulation for predicting disease progression. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’11, pages 814–822, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0813-7. [127] Shuo Zhou, Xuan Vinh Nguyen, James Bailey, Yunzhe Jia, and Ian Davidson. Accelerating Online CP Decompositions for Higher Order Tensors. In KDD, 2016. 120