NOVEL LEARNING ALGORITHMS FOR MINING GEOSPATIAL DATA By Shuai Yuan A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science â Doctor of Philosophy 2017 ABSTRACT NOVEL LEARNING ALGORITHMS FOR MINING GEOSPATIAL DATA By Shuai Yuan Geospatial data have a wide range of applicability in many disciplines, including environmental science, urban planning, healthcare, and public administration. The proliferation of such data in recent years have presented opportunities to develop novel data mining algorithms for modeling and extracting useful patterns from the data. However, there are many practical issues remain that must be addressed before the algorithms can be successfully applied to real-world problems. First, the algorithms must be able to incorporate spatial relationships and other domain constraints defined by the problem. Second, the algorithms must be able to handle missing values, which are common in many geospatial data sets. In particular, the models constructed by the algorithms may need to be extrapolated to locations with no observation data. Another challenge is to adequately capture the nonlinear relationship between the predictor and response variables of the geospatial data. Accurate modeling of such relationship is not only a challenge, it is also computationally expensive. Finally, the variables may interact at different spatial scales, making it necessary to develop models that can handle multi-scale relationships present in the geospatial data. This thesis presents the novel algorithms I have developed to overcome the practical challenges of applying data mining to geospatial datasets. Specifically, the algorithms will be applied to both supervised and unsupervised learning problems such as cluster analysis and spatial prediction. While the algorithms are mostly evaluated on datasets from the ecology domain, they are generally applicable to other geospatial datasets with similar characteristics. First, a spatially constrained spectral clustering algorithm is developed for geospatial data. The algorithm provides a flexible way to incorporate spatial constraints into the spectral clustering formulation in order to create regions that are spatially contiguous and homogeneous. It can also be extended to a hierarchical clustering setting, enabling the creation of fine-scale regions that are nested wholly within broader-scale regions. Experimental results suggest that the nested regions created using the proposed approach are more balanced in terms of their sizes compared to the regions found using traditional hierarchical clustering methods. Second, a supervised hash-based feature learning algorithm is proposed for modeling nonlinear relationships in incomplete geospatial data. The proposed algorithm can simultaneously infer missing values while learning a small set of discriminative, nonlinear features of the geospatial data. The efficacy of the algorithm is demonstrated using synthetic and real-world datasets. Empirical results show that the algorithm is more effective than the standard approach of imputing the missing values before applying nonlinear feature learning in more than 75% of the datasets evaluated in the study. Third, a multi-task learning framework is developed for modeling multiple response variables in geospatial data. Instead of training the local models independently for each response variable at each location, the framework simultaneously fits the local models for all response variables by optimizing a joint objective function with trace-norm regularization. The framework also leverages the spatial autocorrelation between locations as well as the inherent correlation between response variables to improve prediction accuracy. Finally, a multi-level, multi-task learning framework is proposed to effectively train predictive models from nested geospatial data containing predictor variables measured at multiple spatial scales. The framework enables distinct models to be developed for each coarsescale region using both its fine-level and coarse-level features. It also allows information to be shared among the models through a common set of latent features. Empirical results show that such information sharing helps to create more robust models especially for regions with limited or no training data. Another advantage of using the multi-level, multi-task learning framework is that it can automatically identify potential cross-scale interactions between the regional and local variables. Copyright by SHUAI YUAN 2017 ACKNOWLEDGEMENTS I spent five years in the department of computer science and engineering. I would not say time flies as they were long and tough five years. Lots of things happened, including not only the exciting ones but also the awful ones. Some things were pushing me down and made me want to quit, but there were also lots of things that pushing me up and encouraged me a lot. During these years of study, I received numerous help from advisor, collaborators, families and friends. First, I would like to express my sincere gratitude towards my advisor Professor PangNing Tan, who has been a great advisor and mentor over the past years, without whom, I would not be able to finish my PhD research. I would like to thank him for his patience, encouragement, and immense knowledge. I could not have imagined having a better mentor for my PhD study. Five years ago I was first admitted as a master student in the program. As a master student, I was lucky enough to be exposed to lots of research opportunities. Professor Tan encouraged me to join the weekly group meeting where we discussed the newest research papers from top conferences. I was also involved in an interdisciplinary project later that semester, where I got a chance to learn how data mining can be applied to solve real scientific questions. I enjoyed doing research in this group and all these exciting experience lead to a PhD program. Professor Tan is knowledgeable and a prominent expert in his domain and he is also a great educator. I still remember those many times, he was so patient to explain the very details to me when I am confused with a concept. I admire him as an inspiring researcher and a great educator. I would also like to thank everyone in the CSI-limnology group, an interdisciplinary group of researchers working on lake multi-scaled geospatial and temporal database. Thanks to the great leadership of the group, we had well organized month meeting, annual workshops and sub-manuscript discussions. Everyone in the group is patient and helpful when I have questions and they are passionate about what we did and what we can do. Special thanks v to Patricia Soranno, Kendra Spence Cheruvelil, Sarah M. Collins, Emi Fergus, Nicholas K. Skaff and Tyler Wagner, without whom it would be impossible to conduct the research. A very special gratitude also goes to the National Science Foundation for providing the funding for my research under grant #EF-1065786. My sincere thanks also goes to my PhD committee members Professor Xiaoming Liu, Professor Jiayu Zhou and Professor Patricia Soranno for their insightful comments and guidance regarding the dissertation. I would also like to thank my labmates and friends for all the great discussion, the help and all the fun we have had for the past five years. I would like to thank my adorable puppy Elsa for the undivided attention and company. Last but not the least, I would like to thank my family. During each phrase of my life, my parents always give their unconditional love and support. PhD study is five years of my life that I will never forget. Iâm ready to start a new chapter of my life and wish all the best to the people I met here. Go spartan! vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 4 5 7 CHAPTER 2 LITERATURE REVIEW . 2.1 Mining Geospatial data . . . . . . 2.2 Feature Learning . . . . . . . . . 2.3 Incomplete Geospatial Data . . . 2.4 Multi-task Learning . . . . . . . . 2.5 Constrained Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 8 9 11 13 14 CHAPTER 1 INTRODUCTION 1.1 Challenges and Motivation 1.2 Lake Ecology Database . . 1.3 Thesis Contributions . . . 1.4 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 3 SPATIALLY CONSTRAINED SPECTRAL CLUSTERING FOR REGIONALIZATION . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Region Delineation as Constrained Clustering Problem . . . . . . . . 3.3.2 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Constrained Spectral Clustering . . . . . . . . . . . . . . . . . . . . . 3.4 Spatially Constrained Spectral Clustering . . . . . . . . . . . . . . . . . . . . 3.4.1 Kernel Representation of Spatial Contiguity Constraints . . . . . . . 3.4.2 Hadamard Product Graph Laplacian . . . . . . . . . . . . . . . . . . 3.4.3 Partitional Spatially-Constrained Spectral Clustering Algorithm . . . 3.4.4 Hierarchical Spatially-Constrained Spectral Clustering Algorithm . . 3.5 Application to Region Delineation . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.4.1 Tradeoff between Homogeneity and Spatial Contiguity . . . 3.5.4.2 Performance Comparison for Partitional based Constrained Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.4.3 Performance Comparison for Hierarchical-based Constrained Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 16 16 18 20 20 21 21 23 23 26 26 27 29 29 30 31 33 33 35 39 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 CHAPTER 4 LEARNING HASH-BASED FEATURES FOR INCOMPLETE CONTINUOUS VALUED DATA . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Support Vector Regression (SVR) . . . . . . . . . . . . . . . . . . . . 4.3.2 Random Fourier Features (RFF) . . . . . . . . . . . . . . . . . . . . 4.3.3 Matrix Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Supervised hash-based feature learning using RFF . . . . . . . . . . . 4.4.2 Hash-based feature learning for incomplete data (H-FLIP) . . . . . . . 4.4.2.1 Updating X . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2.2 Updating W . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1.2 Lake water quality data . . . . . . . . . . . . . . . . . . . . 4.5.1.3 Benchmark data from UCI Machine Learning Repository . . 4.5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2.1 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3.1 Results for Synthetic Data . . . . . . . . . . . . . . . . . . . 4.5.3.2 Results for Lake Water Quality Data . . . . . . . . . . . . . 4.5.3.3 Results for UCI Benchmark Data . . . . . . . . . . . . . . . 4.5.4 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 46 49 50 50 51 53 54 55 56 57 60 60 61 61 61 62 62 62 63 63 64 65 67 67 67 CHAPTER 5 MULTI-TASK LEARNING FOR MULTIPLE RESPONSES AT DIFFERENT LOCATIONS . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Multi-Response, Multi-Location Prediction . . . . . . . . . . . . . . . . . . . 5.3 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2.1 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3.1 Performance Comparison . . . . . . . . . . . . . . . . . . . 5.4.3.2 Model Coefficients . . . . . . . . . . . . . . . . . . . . . . . 69 69 71 72 72 73 75 75 75 76 77 78 78 79 viii 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 6 MULTI-LEVEL MULTI-TASK LEARNING FOR NESTED GEOSPATIAL DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Multi-Level Multi-Task Learning (MLMT) Framework . . . . . . . . . . . . 6.4.1 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Proof of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.4 Cross-scale Interactions (CSIs) . . . . . . . . . . . . . . . . . . . . . . 6.4.5 Generalization to N -level Modeling . . . . . . . . . . . . . . . . . . . 6.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2.1 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . 6.5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.3.1 Performance Comparison for All Regions . . . . . . . . . . . 6.5.3.2 Performance Comparison for Data-Poor Regions . . . . . . . 6.5.3.3 Cross-scale Interactions . . . . . . . . . . . . . . . . . . . . 6.5.3.4 Comparison Between the New and Original Regions . . . . . 6.5.3.5 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . 6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 81 81 84 86 88 88 89 91 93 95 96 97 98 98 98 99 99 101 101 103 104 105 CHAPTER 7 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . 106 7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 ix LIST OF TABLES Table 3.1: Summary statistics of the data set. . . . . . . . . . . . . . . . . . . . . . . 29 Table 3.2: Performance comparison among various partitional spatially-constrained clustering algorithms with the number of clusters set to 10. . . . . . . . . 36 Table 3.3: A toy example illustrating the advantage of using Hadamard product for combining constraints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Table 3.4: Performance comparison among various hierarchical spatially-constrained clustering algorithms with δ = 1 and the number of clusters set to 10. . . 41 Table 3.5: Stability of the regions generated by different hierarchical clustering methods for the state of Michigan. The mean Adjusted Rand Index is computed for each method by comparing the similarity between the regions found with δ = 1/41 to the regions found with δ = 4/41. . . . . . 44 Table 4.1: Comparison among the various regression methods for modeling lake water quality in terms of their mean square prediction error (MSE). . . . 51 Table 4.2: Summary of the data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Table 4.3: Imputation error for synthetic data. . . . . . . . . . . . . . . . . . . . . . 65 Table 4.4: MSE of linear SVR on synthetic data. . . . . . . . . . . . . . . . . . . . . 65 Table 4.5: MSE of linear SVR for lake water quality data. . . . . . . . . . . . . . . . 66 Table 4.6: MSE of linear SVR for UCI data with 20% missing values. . . . . . . . . . 67 Table 5.1: Statistics of lake water quality data. . . . . . . . . . . . . . . . . . . . . . 76 Table 5.2: RMSE on lake water quality data. . . . . . . . . . . . . . . . . . . . . . . 78 Table 5.3: Predictive R2 on lake water quality data. . . . . . . . . . . . . . . . . . . 78 Table 6.1: Summary statistics for 4 lake water quality data. . . . . . . . . . . . . . . 97 Table 6.2: Results for 4 lake water quality data. . . . . . . . . . . . . . . . . . . . . 100 Table 6.3: RMSE comparison for original and new regions. x . . . . . . . . . . . . . . 104 LIST OF FIGURES Figure 1.1: K-means clustering applied on geospatial features. . . . . . . . . . . . . 3 Figure 1.2: The study region of LAGOS-NE database. The blue dots represent lakes with area that are greater than 4 ha. The study region covers 17 states in the Northeastern and upper Midwest parts of the US. . . . . . . . . . 5 Figure 3.1: An illustration of spatial contiguity constraint. . . . . . . . . . . . . . . . 25 Figure 3.2: Comparison between various constrained spectral clustering algorithms in terms of their landscape homogeneity (SSW) and spatial contiguity (PctML and c). The horizontal axis in the plots corresponds to the parameter value δ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Figure 3.3: A toy example illustrating the advantage of using Hadamard product for combining constraints with feature similarity. Each labeled node is a data point, with a solid line representing a must-link constraint between two points and a dashed line representing the absence of such constraint. The weight of each edge denotes the feature similarity between two data points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Figure 3.4: Regions for Iowa created by the SCM algorithm using the weighted sum approach (with δ = 0.95). . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Figure 3.5: Regions created by the SCM, CSP, MB, SSC and BSSC algorithms for the state of Michigan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Figure 3.6: The relationship between Îą and β parameter values for the CSP algorithm when applied to a synthetic graph data. . . . . . . . . . . . . . . . 40 Figure 3.7: Regions developed by 5 hierarchical spatially constrained clustering algorithm for 3 study regions. . . . . . . . . . . . . . . . . . . . . . . . . . 42 Figure 3.8: Comparison between BSSC, HSSC and Wardâs method for number of cluster K = 4, 6, 8, 10. The five metrics evaluated are listed on top of each figure, namely: unnormalized delta, contiguity metric c, PctML preserved, SSW and cluster balance. Results for δ = 1/41 are shown in (a) and results of the unnormalized δ = 4/41 are shown in (b). . . . . . 43 Figure 4.1: Effect of mean value imputation on regression. . . . . . . . . . . . . . . . 48 xi Figure 4.2: Average runtime and accuracy comparison for linear SVR with raw features, nonlinear SVR with raw features, and linear SVR with RFF. . . . 52 Figure 4.3: Comparing the MSE for linear and nonlinear SVR with mean imputation and matrix completion. . . . . . . . . . . . . . . . . . . . . . . . . . 54 Figure 4.4: Average MSE for linear SVR with raw features, nonlinear SVR with raw features, linear SVR with RFF features, and linear SVR with supervised hash-based features on complete synthetic data. . . . . . . . . . . . . . . 64 Figure 4.5: Comparing average MSE of linear SVR on the complete Secchi data. . . 66 Figure 4.6: MSE of H-FLIP when varying the number of basis functions and supervised hash-based features. . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Figure 5.1: Conceptual figure of data with multiple response variable at different locations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Figure 5.2: An example of predicting lake nutrient data based from climate indexes. 72 Figure 5.3: Model weights visualization. . . . . . . . . . . . . . . . . . . . . . . . . 80 Figure 6.1: Example of nested lake ecology data with cross-scale interactions between the local and regional predictor variables. . . . . . . . . . . . . . . 81 Figure 6.2: Example of a multi-level nested data. The finest level represents the cities. Each city belongs to a county, which in turn, is located within a state in a given country. . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Figure 6.3: Percentage of regions in which MLMT performs better than baseline methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Figure 6.4: Performance comparison for regions with limited number of training data. 102 Figure 6.5: Cross-scale interactions between local and regional predictors for the prediction of total phosphorous (a)-(c) and Secchi depth (d)-(f). For each plot, the horizontal axis denotes the local predictors while the vertical axis denotes the regional predictors. . . . . . . . . . . . . . . . . 104 Figure 6.6: Sensitivity analysis for Ď1 , Ď2 , Ď3 and Ď4 . . . . . . . . . . . . . . . . . . . 105 xii LIST OF ALGORITHMS Algorithm 1: Partitional Spatially-Constrained Spectral Clustering. . . . . . . . . . 27 Algorithm 2: Hierarchical Spatially-Constrained Spectral Clustering (HSSC). . . . 28 Algorithm 3: H-FLIP Framework. 58 . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm 4: Multi-Response Multi-Task learning (MRMT) framework. . . . . . 75 Algorithm 5: Multi-Level Multi-Task learning (MLMT) framework. . . . . . . . . 91 xiii CHAPTER 1 INTRODUCTION For the past decade, due to advances in remote sensing, GPS technology and the proliferation of web-based geographical data sharing services, more and more geospatial data have become available. Geospatial data can be defined as data with geographical information, such as latitude, longitude, address, and zip code [117]. Traditional geospatial data include maps of cities, lakes, and watersheds. Modern geospatial data are more ubiquitous and heterogeneous. Examples include photos captured by smartphones with their location service enabled, tweets posted by users tagged with their address information, and the real-time latitude and longitude information recorded by ride-sharing apps. Geospatial data contains valuable information that can be used for various applications, such as environmental science, healthcare, urban planning, and business management. For example, real time traffic data can be used to predict traffic congestion, historical criminal data can be used to predict the locations of future crime events, and satellite images of an area can be used to monitor land cover changes. Because of their wide range applicability, this has led to the growing amount of interest in developing data mining and machine learning techniques to analyze the geospatial data [113, 25, 50, 47, 82, 45, 103]. Such techniques including regression, classification and clustering. Regression analysis is the process for estimating the statistical relationship among variables, specifically how a change in a continuous-valued dependent variable is associated with a change in the independent variables. For example Bertazzon et al. [11] applied a land-use regression (LUR) model to estimate the risks for air pollution. Classification techniques can be used to assign each spatial object to its predefined category [112]. For example, Thil and Wheeler [113] applied a decision tree classifier to predict shopping destination of participants living in a metropolitan area while in [25], Cleve et al. employed a combination of fuzzy and nearest neighbor classification technique to recognize objects in satellite imagery. Clustering is another widely used 1 data mining technique for analyzing geospatial data. The goal of clustering is to partition the spatial objects into groups in such a way that objects belonging to the same group are more similar (i.e., âcloser") to each other compared to those belonging to other groups. Clustering techniques have been applied to various geospatial datasets for region delineation [50, 47] as well as to detect threats to the forest ecosystem from insects, diseases, and other agents [82]. Other techniques such as association rule mining are also applicable to the geospatial data. For example, Han et al. [45] applied association rule mining to discover spatial relationships between large towns and nearby lakes or highways. 1.1 Challenges and Motivation While the large amount of geospatial data and their wide range of applicability have led to the development of numerous data mining algorithms, there are many practical issues remain that need to be addressed. First, the algorithms must be able to take into account spatial autocorrelation and other spatial constraints defined by the domain. For example, it is known that nearby geospatial objects tend to be more similar to each other compare to objects located far away from each other. This suggests that spatial autocorrelation must be incorporated when analyzing geospatial data in order to produce more realistic and useful results. For example, in cluster analysis for region delineation, spatial contiguity of the regions is a desirable criterion as the contiguous area of land is useful for research, policy, and management purposes [67]. However, many traditional clustering methods are not designed to produce spatially contiguous clusters. For example, Figure 1.1 shows the result of applying k-means clustering to lake ecology data, where each region (cluster) is represented with a different color on the map. Although some regions appear to be spatially contiguous, many of them are geographically disconnected. Alternative clustering algorithms are therefore needed that can effectively incorporate spatial autocorrelation for applications that require spatially contiguous regions. Second, the algorithm should be able to handle missing values, which are commonly 2 Figure 1.1: K-means clustering applied on geospatial features. found in many real-world geospatial datasets. For example, real-time data from GPS devices have been increasingly used to monitor the physical activities of human subjects. However, the signal lapses that are inherent in GPS often lead to incomplete traces for applying more advanced analytical techniques [79]. Another example where missing values are often present is in lake ecology data (see the LAGOS-NE database to be described in Section 1.2). Measurements of lake water quality variables would be missing if they were not sampled at a given lake on a particular day. Although there are several standard techniques for handling missing values, they are mostly applied during the preprocessing step. For example, a simple strategy is to employ the complete case approach, where all incomplete data points will be discarded. This strategy is not effective if there are very few data points without missing values. Another common strategy is to impute the missing value with the average value of the corresponding attribute. However, this strategy will fail when the data are skewed or not centered at the mean value. More importantly, since these strategies are implemented as a data cleaning step, they are oblivious to the needs and characteristics of subsequent data mining algorithms to be applied to the preprocessed data. The third challenge is to accurately capture the nonlinear relationship between the geospatial predictors and response variables. 3 Previous studies have showed that many geospatial relationships are nonlinear [86]. For example, Shaker and McCauley et al. [77] showed the relationship between nutrients and chlorophyll among lakes follow a sigmoidal function. Ehlinger [102] showed that the relationship between aquatic ecological conditions and landscape features in Southern Wisconsin is nonlinear. However, modeling this nonlinear relationship is not only difficult, it is also computationally expensive. Finally, the geospatial variables may exist at multiple spatial scales and variables from different scales may interact with each other. For instance, previous work in lake ecology found evidence of cross-scale interactions between geospatial driver variables quantified at local and regional spatial scales for predicting lake nutrients [110]. Designing algorithms that can handle multi-scale relationships in geospatial data sets is another challenge that must be addressed in order to construct more robust predictive models. In this thesis, I present four novel learning algorithms for mining geospatial data. The algorithms were applied mostly to the ecology domain. This domain is chosen as the testbed for my research as it has many applications, such as regionalization and spatial prediction, that may benefit from the deployment of the algorithms developed in this study. As proof of concept, the proposed algorithms will be applied to a newly integrated lakes ecology database named LAGOS-NE (LAke multi-scaled GeOSpatial and temporal database) [109]. LAGOSNE consists of lake ecology data gathered from multiple sources, covering a wide range of spatial scales. Although the proposed algorithms are mostly evaluated on LAGOS-NE, they are generally applicable to other geospatial datasets with similar characteristics. 1.2 Lake Ecology Database LAGOS-NE is a multi-scaled geospatial temporal database integrated from disparate data sources. The database covers a study region spanning across 17 states in the Upper Midwestern and Northeastern part of the United States, as shown in Figure 1.2. The database is divided into two modules, one containing measurements of lake water quality such as total phosphorus, total nitrogen and water clarity while the other include climate, 4 landscape, and other characteristics of the lakes (including lake location, lake area, and freshwater connectivity). Figure 1.2: The study region of LAGOS-NE database. The blue dots represent lakes with area that are greater than 4 ha. The study region covers 17 states in the Northeastern and upper Midwest parts of the US. LAGOS-NE is chosen for this study due to several reasons. First, it is a rich geospatial dataset, containing variables measured at different spatial scales. Both the volume and variety of the data are large, making it a challenging dataset for applying conventional data mining algorithms. The variables in the database also exhibit many of the properties described in Section 1.1, including spatial autocorrelation, missing values, nonlinearity, etc., thus providing opportunities for developing novel data mining algorithms. Finally, even though the database was created by integrating data from multiple sources, rigorous quality assurance/quality control (QAQC) procedures have been performed to produce a standardized database with consistent format and convention [109]. 1.3 Thesis Contributions This section summarizes the technical contributions of this thesis in addressing the challenges of applying data mining to geospatial datasets. The proposed data mining algorithms 5 will be applied to various unsupervised and supervised learning problems such as cluster analysis and regression. The specific contributions of this thesis are as follows: ⢠A spatially constrained spectral clustering algorithm is developed for the region delineation problem. Spectral clustering is a clustering algorithm that partitions a collection of data objects into smaller groups by performing eigen-decomposition of their feature similarity matrix. For geospatial data, the approach can be used to split the geographical landscape units into smaller regions or zones. However, similar to k-means, existing spectral clustering formulation is not designed to produce regions that are spatially contiguous. The proposed algorithm introduces a flexible way to incorporate spatial relationships into the spectral clustering formulation. The resulting regions created by the proposed algorithm were found to be spatially contiguous and relatively more homogeneous compared to existing constrained clustering methods. The formulation can be further extended to create regions that are nested wholly within broader-scale regions. ⢠A supervised hash-based feature learning algorithm is proposed for modeling non-linear relationships in incomplete geospatial data. The proposed algorithm simultaneously infers the missing feature values and learns a set of nonlinear hash-based features from the incomplete data. The learned hash-based features have the following properties: (i) complete (i.e., have no missing values), (ii) lower dimensionality than the original data, (iii) incorporates supervised information to learn the features, (iv) enables a linear model to be trained to capture nonlinear relationship between the predictor and response variables. ⢠A multi-task learning framework is proposed for the joint modeling of geospatial data with multiple response variables. Instead of training the local model for each response variable at each location independently, the framework simultaneously fits the local models for every response variable at all locations by optimizing a joint objective func- 6 tion with trace-norm regularization. The proposed framework enhances the prediction performance of the local models by incorporating spatial autocorrelation between the different locations as well as correlations between response variables into its unified formulation. ⢠A multi-level multi-task learning framework is developed for spatial prediction from nested geospatial data. The framework trains an independent model for each region using both its fine-level and coarse-level features. It also allows information to be shared among the various models through their common set of latent vectors. In addition, the framework can automatically identify potential cross-scale interactions between fine-level and coarse-level variables of the geospatial data. 1.4 Roadmap The remainder of the thesis is organized as follows: Chapter 2 presents a literature review on existing techniques related to this research. Chapter 3 describes the proposed spatially constrained spectral clustering approach for region delineation. Chapter 4 introduces the hash-based feature learning algorithm for incomplete geospatial data. Chapter 5 presents a multi-task learning approach for modeling geospatial data with multiple response variables. Chapter 6 discusses the proposed multi-level multi-task learning approach for handling nested geospatial data. Finally Chapter 7 concludes the work. The findings from three of the chapters in the thesis have been accepted for publication at various data mining conferences. Chapter 3 is based on the materials taken from Shuai et al. [146], which was accepted for oral presentation at the 2015 IEEE International Conference Data Science and Advanced Analytics Special Session on Environmental and Geo-spatial Data Analytics. Chapter 4 is taken from a conference paper presented at the 2017 SIAM International Conference on Data Mining [145]. Chapter 6 is adopted from a manuscript that was recently accepted for the 2017 IEEE International Conference on Data Mining. 7 CHAPTER 2 LITERATURE REVIEW This chapter presents an overview of previous work related to this study. Section 2.1 introduces the general problem of mining geospatial data. Section 2.2 provides an overview of feature learning and hash-based methods. Section 2.3 presents the previous work on handling incomplete data. Section 2.4 presents the existing work on multi-task learning while Section 2.5 describes the previous work on constrained spectral clustering. 2.1 Mining Geospatial data Geospatial data are abound, collected from various sources including field sampling, satellite imagery, GPS and geotagging services. This type of data has unique properties that distinguish it from others. First, it is geo-referenced, i.e., containing location information such as latitude, longitude, height/elevation, address, or zip code. Second, in addition to their location information, the geospatial data often contain non-spatial features characterizing properties associated with the locations. These non-spatial features typically exhibit some spatial patterns, which leads to the spatial autocorrelation associated with the geospatial data. Third, the geospatial data may contain variables defined at multiple spatial scales. These variables may interact at different characteristic spatial scales to produce the nonlinear relationships observed in the data [110]. Due to these unique characteristics, special consideration and techniques are needed for mining geospatial data. Classification is a data mining technique that seeks to categorize data objects into their corresponding class labels [112]. For geospatial data, the classification technique must consider not only the relationship between the predictor and response variables, but also their spatial dependencies as previous studies have shown that the classification performance can be improved by incorporating such dependencies into the models [46, 100, 40, 81]. There are many applications where classification can be applied to the geospatial data. For example, 8 Thil and Wheeler [113] used a top-down spatial decision tree that based on information gain to predict travel destination choices. Cleve et al. [25] used classification algorithm to recognize land-use and land-cover categories from the 3-band aerial imagery. Hollister et al. [49] applied random forests to the land use, land cover and lake morphometry data, to predict the lake trophic state in order to monitor the ecosystem condition. Clustering is another popular data mining technique for geospatial data. For example, it can be applied to create ecological regions of the landscape based on land cover and land use features. Early works such as Openshaw [89] proposed a two-step approach, in which a conventional clustering method was applied followed by a cluster refinement step to separate clusters that were not geographically connected. Host et al. [50] performed hierarchical k-means clustering on monthly temperature and precipitation data across northwestern Wisconsin to identify clusters(regions) with similar seasonal climatic patterns. Hargrove et al. [47] applied k-means clustering on elevation, climatic, and edaphic factors to generate regions. Besides regionalization, clustering techniques can also be used for threat identification. For example, Mills et al. [82] applied k-means clustering to satellite images and analyzed the transition distance between cluster assignment between any two years in order to identify threats to the forest ecosystem. 2.2 Feature Learning Feature learning [10] is the task of extracting an alternative feature representation of a given data set. Feature learning methods can help improve the performance of predictive models in many ways, such as reducing the dimensionality of large-scale data to improve its training efficiency and removing noise, redundant, or irrelevant features that are present in the data. Classical methods such as principal component analysis [90] were developed to create features that preserve variability in the original data. These methods are mostly unsupervised, and thus, provide no guarantee about the usefulness of their extracted features for predictive modeling tasks. More recently, methods such as stacked autoencoders [14, 147] 9 and deep learning [61, 149] have been proposed to create hierarchical features from the data. Although such methods have been successfully applied to applications such as image classification [61], they are expensive to train and require considerable human efforts to design the right architecture for a given prediction problem. Hashing is another feature learning technique that has attracted considerable attention in recent years. The goal of hash-based feature learning is to transform the data into easily computable features that preserve the underlying properties of the data. For example, the Min-hash method [15] was designed to create features that preserve the Jaccard similarity between instances. Random hyperplane based hashing were introduced in [20] to preserve cosine similarity. These features can help improve the efficiency of processing similarity search queries in large data sets. Hash-based methods have also been developed for other similarity measures [131, 85]. In general a hash function is defined as y = h(x), where y is called the hash code or signature and h(¡) is the hash function. Wang et al. [126] divided hashing methods into two categories: one that does not explore the data distribution while the other that learns the hash function based on the distribution of the data. The former category includes locality sensitive hashing (LSH), which is a family of hashing techniques that map instances with similar features into the same hash code with higher probability [52]. LSH can be viewed as a probabilistic similarity-preserving dimension reduction method. Many variants of LSH have been designed to provide an unbiased set of features that preserve certain distance metrics [15, 20, 131, 85]. This includes the Minhash [15] and random hyperplane [20] methods described in the previous paragraph. Unlike the data independent hashing methods described in previous paragraph, the second category encompasses techniques that learn hash functions from a given input dataset. There are three elements that must be considered in this learning to hash scheme [126]: distance metric, hash function and optimization criterion. For example, Weiss et al. [131] proposed an algorithm known as spectral hashing to learn a binary code representation of data such that the Hamming distance between data instances in terms of their hash 10 code correlates with the similarities of their features. The hash function used is given by h(x) = sgn(sin( Ď2 + ÎłwT x)). Let yi â RK be the hash code of length K for the ith data instance and Y = [y1 , ..., yn ]. The spectral hashing algorithm is designed to solve the following optimization problem: min Trace (Y(D â W)YT ) s.t. Y1 = 0, Y YYT = I, yim â {â1, 1} where W is a similarity matrix with W(i, j) = exp(â k xi â xj k2 /2 ) and D is a diagonal matrix, where the diagonal elements are the corresponding row sum of W. Kulis and Darrell [62] proposed an optimization scheme that minimizes the difference between the Euclidean distance in the original feature space and the Hamming distance in the hash code feature space. Specifically, for a given input data set X = [x1 , x2 , ..., xn ], the objective function of their optimization is given as follows: min X w [dM (xi , xj ) â dR (xi , xj )]2 (i,j)âN 1 PK (h (x ) â h (x ))2 and h (x) = where dM (xi , xj ) = 21 k xi â xj k2 , dR (xi , xj ) = K p k j k=1 k i P sgn( q Wpq Îş(xpq , x)). In this formulation, K is the number of hash functions, Îş(xi , xj ) the kernel function over the data and W is the weight matrix to be learnt. 2.3 Incomplete Geospatial Data Missing value is a common problem encountered when dealing with real world datasets. A missing value means there is no data value stored for a given feature. There are different types of missing valuesâmissing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAM) [70]. Some of the popular techniques for dealing with missing values include: (1) using only the complete cases by discarding instances with missing values; (2) imputing the missing values within each feature by their respective average feature values. In addition, model-based approaches have been developed as well, such as those based 11 on maximum likelihood, Bayesian, and multiple imputation methods [69]. However, most of these techniques are parametric methods, which assume that the data follow standard probability distributions such as Gaussian. More recently, matrix completion has emerged as an increasingly popular technique for handling missing values in an input data matrix. Matrix completion employs an optimization scheme for completing a partially observed matrix under the assumption that the original matrix has a low rank. Specifically, the matrix completion approach can be cast into the following constraint rank minimization problem: min rank(X) (2.1) s.t. Xij = Mij , (i, j) â ⌠where M is the data matrix we wish to recover and X is a low rank approximation of the matrix M . ⌠is the set of indices for the non-missing elements of M . CandeĚs and Recht [18] proved that most low rank matrices can be effectively recovered when the number of observed entries m satisfies the following inequality: m > Cn1.2 rlog(n), where n is the dimension of an n Ă n matrix M , r is the desired rank of the matrix X and C is a positive constant. Since solving the rank minimization problem is computationally NP-hard, an alternative way is to approximate the rank with the nuclear norm and solve the convex relaxation problem shown in Equation (2.2) using semidefinite programming. min k X kâ (2.2) s.t. Xij = Mij , (i, j) â ⌠Cai et al.[17] proposed a singular value thresholding (SVT) algorithm for solving the optimization problem shown in Equation (2.3) by reformulating the objective function as follows: 1 min Ď k X kâ + k X k2F 2 s.t. Xij = Mij , (i, j) â ⌠12 (2.3) They showed that the solution of the constrained nuclear norm minimization problem in Equation (2.3) converges to the solution of Equation (2.2) when Ď approaches infinity. A faster algorithm based on accelerated proximal gradient method have been proposed to solve least square minimization problems with nuclear norm regularization [115, 56]. For matrix completion, instead of adding the equality constraint Xij = Mij as in Equations (2.2) and (2.3), a least square loss term can be added to the objective function. The matrix completion formulation can be cast into the following objective function: min X 1 k A(X) â b k22 +Âľ k X kâ 2 (2.4) where A : RmĂn â Rp is a linear map that extracts the non-missing entries of its input matrix and Âľ is a predefined parameter. 2.4 Multi-task Learning Multi-task learning (MTL) is a machine learning approach for solving multiple related prediction problems at the same time by capturing their shared information [19]. The rationale for using MTL is that by incorporating the relationship between different tasks, one can improve the learning ability. For the past decades, MTL has been successfully used in many learning schemes including regression[137, 151], clustering [32] and classification[142, 140]. MTL can be applied to a wide range of applications, such as disease progression prediction [151], Web image and video search [129], as well as web page categorization [22]. The simplest way to integrate different kinds of task relationship is by introducing a regularization term into the learning formulation. Since the assumption of how the tasks are related are different for different applications, this has led to the design of different regularization terms. For example, one simple assumption is that the task parameters are close to each other. This simple intuitive assumption leads to the regularization term based on the mean value of all the task parameters [33]. Another common assumption is that the model parameters shared a low-rank representation. Since the rank of the model parameter matrix is hard to optimize, it can be approximated by using the trace norm regularization 13 instead of minimizing the matrix rank directly. For example, Chen et al. [23] proposed a robust MTL algorithm that can learn multiple tasks simultaneously while identifying the irrelevant tasks. Argyriou et al. [2] generalized the single task L1 norm and proposed a method to learn the sparse representation shared across different models. Kumar et al.[63] assumed that each model is a linear combination of a finite set of base models. Since many geospatial datasets involve predictions at multiple locations, they can be naturally cast into a multi-task learning framework. For example, Xu et al. [138] presented a multi-task learning formulation for predicting monthly precipitation at 37 weather stations in Canada based on geospatial data. Xu et al. [139] proposed a weighted incremental multitask learning algorithm that simultaneously identifies the latent factors through supervised tensor decomposition and learns spatial temporal models. Zhou et al. [151] proposed a MTL learning framework for predicting the disease progression. The proposed framework consider prediction at each time stamp as a task and task relatedness is captured by a temporal group lasso regularizer. Zhao et al. [148] builded predictive models for spatio-temporal event forecasting. The proposed MTL model improved the predicting performance by utilizing both static features and dynamic features generated from a multi-task feature learning framework, and using the shared information between different locations. 2.5 Constrained Spectral Clustering Constrained clustering is a semi-supervised learning algorithm that utilizes side information from the domain to improve clustering performance. Based on the domain knowledge, experts can explicitly specify which pair of data instances must be in the same cluster (i.e., Must-link(ML) constraints) and which pair of instances must not be in the same cluster (i,e, Cannot-link(CL) constraints). For example, a constrained K-means clustering approach was developed in [125]. Shi et al. [105] proposed a constrained co-clustering algorithm that takes into account of both feature similarity and the ML and CL constraints. De Bie et al. [12] proposed a method that constrains the eigenspace for which the cluster membership vector 14 is projected. Coleman et al. [26] extended the approach in [12] to handle the situation in which some of constraints are potentially inconsistent with each other. In this thesis, we focus on constrained spectral clustering. Spectral clustering [104, 75] is a well-known clustering method that uses the eigenvector spectrum of a feature similarity matrix to find the underlying clusters of a given data set. Advantages of using spectral clustering include its flexibility in terms of incorporating diverse types of similarity functions, superiority of its clustering solution compared to the traditional k-means algorithm [84], and its well-established theoretical properties (including the consistency [120] and convergence [121] guarantees of the algorithm). There are two popular ways to incorporate constraints into the spectral clustering framework. The first category encompasses methods that directly alter the graph Laplacian matrix. For example, Kamvar et al. [57] employed a Gaussian kernel as their similarity matrix and considered a binary constraint matrix where ML constraints are assigned as 1 and CL constraints are designated as 0. Kawale and Boley [13] proposed adding an `1 -regularizer that penalizes constraint violations to the graph cut objective function. However, the proposed method was designed for constrained spectral clustering with only 2 clusters.Another way to alter the matrix is by performing a weighted sum between the feature similarity matrix and the adjacency matrix of the constraint graph. The modified graph Laplacian is given by a convex combination of the original graph Laplacian and the Laplacian induced by the constraint matrix. The second category of approaches for incorporating domain constraints is by restricting the feasible solution set of the spectral clustering algorithm. For example, Wang and Davidson [128] proposed an algorithm that optimizes the objective function of spectral clustering while adding a constraint that the ML and CL must be satisfied more than a predefined threshold. 15 CHAPTER 3 SPATIALLY CONSTRAINED SPECTRAL CLUSTERING FOR REGIONALIZATION 3.1 Introduction A regionalization framework delineates the geographical landscape into spatially contiguous, homogeneous units known as regions or zones. Regionalizations are important because they provide the spatial framework used in many disciplines, including landscape ecology, environmental science, and economics, as well as for applications such as public policy and natural resources management [24, 73, 39, 76]. For example, the hierarchical system of hydrologic units described in [101] provides a standardized regionalization framework that has been widely used in water resource and land use studies [31]. Abell et al. [1] have also developed a global biogeographic regionalization framework that serves as a useful tool for studying biodiversity in freshwater systems and for conservation planning efforts. McMahon et al. [78] divide existing multivariate regionalization methods into two categories, qualitative and quantitative. For qualitative methods, regions with similar landscape characteristics are delineated by experts from multiple maps of different geographic features using manual visual interpretation [4, 88]. For quantitative methods, clustering approaches such as k-means and hierarchical clustering [50, 47, 55] are used to partition the geographical area into smaller regions. Although quantitative clustering approaches provide a more systematic and reproducible way to identify regions compared to qualitative approaches, one potential limitation of existing clustering methods is that the regions created may not be spatially contiguous. Region contiguity is a desirable criterion for many applications that treat regions as individual entities for purposes including research, policy, and management (e.g., site-specific management in precision agriculture [67]). Therefore, alternative methods are needed that can effectively cluster similar areas based on multiple mapped variables, but 16 have the added constraint of being spatially contiguous. In the preliminary version of this work [146], we presented a spatially constrained spectral clustering framework that uses a truncated exponential kernel [58] to produce spatially contiguous and homogeneous regions [57, 13, 105, 128]. In this chapter, we extend the formulation to create hierarchical regions, where fine-scaled regions can be nested within broad-scale regions. Creating such nested regions is extremely useful for many applications because hierarchical structure is often held up as a fundamental feature of both the natural world and complex systems (as reviewed in [135]). In fact, the world?s biomes and ecological regions have often been delineated in a nested hierarchical structure [5, 1]. Constrained versions of hierarchical clustering techniques [83] such as single-link [66], complete-link [95], UPGMA [60, 98], and Wardâs method [53, 134] have often been used to create such nested regions. However, as will be shown in this study, the regions generated by such methods tend to be highly imbalanced in terms of their sizes, and thus, are not as suitable for many applications, including resource planning and management. We use a recursive bisection approach to extend our formulation in [146] to hierarchical clustering. Our top-down approach for creating nested regions is different from the bottomup approach commonly used by existing methods [53, 95, 60, 98, 134]. Using three criteria for region evaluationâlandscape homogeneity, region contiguity, and region sizeâour experimental results suggest that the proposed framework outperforms three other constrained hierarchical clustering methods in 2 out of the 3 criteria. For example, it consistently produces regions that are more homogeneous and balanced in region size compared to the spatially constrained complete-link [95] and UPGMA [60, 98] algorithms. Our proposed algorithm also outperforms the constrained version of Wardâs method [134] in terms of producing regions that are spatially contiguous and approximately uniform in size. Finally, although the spatially constrained single link method [66] is also capable of producing regions that are homogeneous and contiguous, it tends to create one or two very large regions that cover the majority of the landscape area. An ad-hoc parameter for maximum region size is needed by 17 the spatially constrained single link method to prevent the formation of such large regions [96]. Tuning this parameter is cumbersome as it must be done at every level of the hierarchy since the maximum region size depends on the number of regions. Our proposed hierarchical method does not have such a problem because its objective function, which is based on the normalized cut criterion [104] used in spectral clustering, is inherently biased towards producing more uniformly-sized clusters. The remainder of this chapter is organized as follows. Section 3.2 reviews previous work on the development of regionalization frameworks, constrained clustering, and hierarchical clustering methods. Section 3.3 formalizes the region delineation problem and presents an overview of spectral clustering. Section 3.4 describes the different ways in which spatial constraints can be incorporated into the spectral clustering framework. It also presents the partitional and hierarchical implementations of our proposed spatially constrained spectral clustering framework. Section 3.5 describes the application of spatially constrained spectral clustering algorithms to the region delineation problem. Section 3.6 concludes with a summary of the results of this study. 3.2 Related Work Region delineation has traditionally been studied as a spatial clustering [44] problem. Duque et al. [31] classified the existing data-driven approaches into two categories. The first category does not require explicit representation and incorporation of spatial constraints into the clustering procedure. Instead, the constraints are satisfied by post-processing the clusters or optimizing other related criteria. For example, Openshaw [89] applied a conventional clustering method followed by a cluster refinement step to split clusters that contained geographically disconnected patches. The second category of methods explicitly incorporates spatial constraints into the clustering algorithm [31]. Examples of such methods include adapted hierarchical clustering, exact optimization methods, and graph theory based methods. This second category also encompasses the constrained clustering methods [3, 125, 29] 18 developed in the fields of data mining and machine learning. Constrained clustering [7, 125] is a semi-supervised learning approach that uses the domain information provided by users to improve clustering results. The domain information is typically provided as must-link (ML) and cannot-link (CL) constraints to be satisfied by the clustering solution. ML constraints restrict the pairs of data points that must be assigned to the same cluster, whereas CL constraints specify the pairs of points that must be assigned to different clusters. For example, Kamvar et al. [57] uses the ML and CL constraints to define the affinity matrix of the data. Shi et al. [105] proposed a constrained co-clustering method that considers both the similarity of features as well as the ML and CL constraints. All of these methods were designed to manipulate the graph Laplacian matrix using the domain constraints available. There has also been growing interest in developing constrained-based approaches for spectral clustering [57, 13, 105, 128, 28]. For example, De Bie et al. [12] developed an approach that restricts the eigenspace for which the cluster membership vector is projected. Wang and Davison [128] proposed a constrained spectral clustering method that considers real-valued constraints and imposed a threshold on the minimum amount of constraints that must be satisfied by the feasible solution. However, none of these constrained spectral clustering methods were designed for the region delineation problem. The framework presented in our previous work [146] employs a Hadamard product to combine the feature similarity matrix with spatial contiguity constraints, which is similar to the approach used in Craddock et al. [28] for generating an ROI atlas of the human brain using fMRI data. However, unlike the approach used in [28], we consider a truncated exponential kernel to relax the spatial neighborhood constraints and perform extensive experiments comparing the framework to various constrained spectral clustering algorithms. Current constrained spectral clustering algorithms have also focused primarily on partitional clustering. They require the number of clusters to be specified a priori. In contrast, hierarchical methods generate a nested set of clustering for every possible number of clusters. The hierarchy of clusters, also known as a dendrogram, can be created either in a top-down 19 (i.e., divisive hierarchical clustering) or bottom-up (i.e., agglomerative hierarchical clustering) fashion [112, 55, 54]. Some of the widely used agglomerative hierarchical clustering algorithms include single link [53], complete link [95], group average (UPGMA) [60, 98], and Wardâs method [55, 134] whereas examples of divisive hierarchical clustering algorithms include minimum spanning tree [42] and bisecting k-means [99]. Current approaches for creating nested regions are mostly based on different variations of agglomerative hierarchical clustering. Each of these variations has its own strengths and limitations [83]. For example, the single-link method can identify irregular shaped clusters but is highly sensitive to noise [112]. In contrast, the Wardâs method can minimize the cluster variance but is susceptible to the inversion problem [83]. Unfortunately, many of these agglomerative methods can produce highly imbalanced sizes of regions, which is not desirable for many applications [6]. 3.3 Preliminaries This section formalizes region delineation as a constrained clustering problem and presents a brief overview of spectral clustering and its constrained-based methods. 3.3.1 Region Delineation as Constrained Clustering Problem d Consider a data set D = {(xi , si )}N i=1 , where xi â R is a d-dimensional vector of landscape features associated with the geo-referenced spatial unit si â R2 . Let R = {1, 2, ¡ ¡ ¡ , k} denote the set of region identifiers, where k is the number of regions, and C = {(si , sj , Cij )} denote the set of spatial constraints. For region delineation, we consider only ML constraints and represent them using a constraint matrix C defined as follows:    1 if si and sj are spatially adjacent, Cij =   0 otherwise. (3.1) The goal of region delineation is to learn a partition function V that maps each spatial unit si to its corresponding region identifier ri â R in such a way that (1) maximizes the similarity 20 between the spatial units in each region and (2) minimizes the constraint violations in the set C. 3.3.2 Spectral Clustering Spectral clustering is a class of partitional clustering algorithms that relies on the eigendecomposition of an input affinity (similarity) matrix S to determine the underlying clusters of the data set. Let {x1 , x2 , ¡ ¡ ¡ , xN } be a set of points to be clustered. To apply spectral clustering, we first compute an affinity matrix S between every pair of data points. The affinity matrix is used to construct an undirected weighted graph G = (V, E), where V is the set of vertices (one for each data point) and E is the set of edges between pairs of vertices. The weight of each edge is given by the affinity between the corresponding pair of data points. The Laplacian matrix of the graph is defined as L = D â S, where D is a diagonal P matrix whose diagonal elements correspond to Dii = j Sij . The goal of spectral clustering is to create a set of partitions on the graph G in such a way that minimizes the graph cut while maintaining a balanced size of the cluster partitions [75]. The spectral clustering solution can be found by solving the following optimization problem [75]: arg min rT Lr s.t. rT Dr = r X Dii , 1T Dr = 0 (3.2) i where 1 and 0 are vectors whose elements are all 1s and 0s, respectively. The solution for r is obtained by solving the following generalized eigenvalue problem: Lr = ÎťDr. To obtain k clusters, we first extract the top k generalized eigenvectors and apply a standard clustering algorithm such as k-means to the data matrix generated from the eigenvectors. 3.3.3 Constrained Spectral Clustering Current methods for incorporating constraints into spectral clustering algorithms can be divided into two categories. The first category encompasses methods that directly alter the 21 graph Laplacian matrix, e.g., by applying a weighted sum between the feature similarity matrix S and the constraint matrix C given in Equation (3.1): Weighted sum: Stotal (δ) = (1 â δ)S + δC, (3.3) δ â [0, 1] is a parameter that controls the trade-off between maximizing cluster homogeneity and preserving the constraints of the data. When δ approaches zero, the clustering solution is more biased towards maximizing the feature similarity whereas when δ approaches one, it is more biased towards preserving the constraints. Let D and D(c) be the diagonal matrices constructed from the feature similarity matrix (S) and constraint matrix (C) in the following way: Dii = X Sij , (c) Dii = X Cij . j j Using Equation (3.3), it can be shown that the modified graph Laplacian is given by a convex combination of the graph Laplacian for the feature similarity matrix and the graph Laplacian for the constraint matrix, i.e., Ltotal = Dtotal â Stotal = (1 â δ)(D â S) + δ(Dc â C) (3.4) The weighted sum approach described above is a special case of the spectral constraint modeling (SCM) algorithm proposed by Shi et al. [105]. The altered graph Laplacian can be substituted into Equation (3.2), which in turn, allows us to apply existing spectral clustering algorithm to identify the regions. SCM: arg min rT Ltotal r (3.5) râRN s.t. rT Dtotal r = X T total r = 0. Dtotal ii , 1 D i The second category of approaches for incorporating domain constraints is to alter the feasible solution set of the spectral clustering algorithm. For example, Wang and Davidson 22 [128] proposed the CSP algorithm, which optimizes the following objective function. arg min rT LĚr CSP: (3.6) râRN s.t. rT CĚr ⼠ι, rT r = vol(G), r 6= D1/2 1, â1/2 where LĚ = Dâ1/2 LDâ1/2 and CĚ = Dc â1/2 CDc are the normalized graph Laplacian and normalized constraint matrix, respectively. The threshold Îą gives a lower bound on the amount of constraints in C that must be satisfied by the clustering solution. Instead of setting the parameter for Îą, Wang and Davison [128] requires users to specify a related parameter β, which was shown to be a lower bound for Îą. 3.4 Spatially Constrained Spectral Clustering In this section, we describe the various ways to represent spatial contiguity constraints and to incorporate them into the spectral clustering framework. 3.4.1 Kernel Representation of Spatial Contiguity Constraints For constrained spectral clustering, we can define a corresponding constraint graph GC = (V, EC ), where V is the set of data points and EC is the set of edges whose weights are defined as follows:    1, (vi , vj ) is a ML edge;    Eij = â1, (vi , vj ) is a CL edge;      0, otherwise. (3.7) For region delineation, the vertices of the constraint graph correspond to the set of spatial units to be clustered, while the ML edges correspond to pairs of spatial units that are adjacent to each other. It is also possible to define a CL edge between every pair of spatial units that are either located too far away from each other or are obstructed by certain barriers (e.g., large bodies of water) that make them unreasonable for assignment to the same region. However, since the number of CL edges tends to grow almost quadratically 23 with increasing number of points, this severely affects the runtime of spectral clustering algorithm. Furthermore, the ML edges are often sufficient to provide guidance on how to form spatially contiguous regions. For these reasons, we consider constraint graphs that have ML edges only in this chapter. Let C denote the adjacency matrix representation of the edge set EC . A constrained spectral clustering algorithm is designed to produce solutions that are consistent with the constraints imposed by GC . Unfortunately, for region delineation, it may not be sufficient to use the adjacency information between neighboring spatial units to control the trade-off between spatial contiguity and landscape homogeneity of the regions. To improve its flexibility, we introduce a spatially constrained kernel matrix, Sc . The simplest form of the kernel would be a linear kernel, which is defined as follows: Slinear =C c Linear Kernel: (3.8) More generally, we can define an exponential kernel [58] on the adjacency matrix C as follows. Exponential Kernel: exp Sc = eC â X Ck 1 2 1 3 = I + C + C + C + ¡¡¡ = 2! 3! k! (3.9) k=0 where I is the identity matrix. Since we consider only ML constraints, the k-th power of the adjacency matrix C represents the number of ML paths of length k that exist between every pair of vertices. An ML path between vertices (vi , vj ) refers to a sequence of ML edges e1 , e2 , ¡ ¡ ¡ , em such that the initial vertex of e1 is vi and the terminal vertex of em is vj . It exp can be shown that Sc is a symmetric, positive semi-definite matrix, and thus, is a valid kernel [58]. Furthermore, as the diameter of the constraint graph is finite, we also consider a truncated version of the exponential kernel: Truncated Exponential : Strunc (δ) c δ X Ck ⥠k! (3.10) k=0 where the parameter δ controls the ML neighborhood size of a vertex. The ML neighborhood specifies the set of vertices that should be in the same region as the vertex under consideration. As an example, consider the graph shown in Figure 3.1. When δ = 1, the ML 24 Figure 3.1: An illustration of spatial contiguity constraint. neighborhood for vertex A corresponds to its immediate neighbors, B, C, D and E. When δ = 2, the ML neighborhood of vertex A is expanded to include vertices that are located within a path of length 2 or less from A, i.e., B, C, D, E, F, G, H and I. When δ = 3, the ML neighborhood for vertex A includes all of the vertices in the graph. Note that each term in the summation given in Equation (3.9) is normalized by the path length; therefore, a vertex that is located further away from a given vertex has less influence as compared to a nearer vertex. Finally, the truncated exponential kernel matrix can be binarized so that it can be interpreted as an adjacency matrix for an expanded constraint graph, whose ML neighborhood size is given by the parameter δ. Sbin c (δ) Binarized Truncated Exponential Kernel : X δ k C >0 âĄI (3.11) k=0 where I[¡] is an indicator function whose value is equal to 1 if its argument is true and 0 otherwise. Both the truncated and binarized truncated exponential kernels allow us to vary the degree to which the original constraint graph should be satisfied. As δ increases, the constraint satisfaction becomes more relaxed. Ultimately, when δ is greater than or equal to the diameter of the graph, Sbin c reduces to a matrix of all 1s, which is equivalent to ignoring the spatial contiguity constraints. 25 3.4.2 Hadamard Product Graph Laplacian We now describe our approach for incorporating the spatially constrained kernel matrix Sc into the spectral clustering formulation. Instead of using the weighted sum approach given in Equation (3.3), we consider a Hadamard product approach to combine Sc with the feature similarity matrix S: Stotal (δ) = S ⌠Sc (δ), Hadamard Product: (3.12) where Sc (δ) corresponds to either the truncated exponential kernel (Equation (3.10)) or the binarized truncated exponential kernel (Equation (3.11)). There are several advantages to using a Hadamard product approach to combine the matrices. First, unlike the weighted sum approach, it discourages spatial units that are located far away from each other from being assigned to the same cluster even though their feature similarity is high. Second, it produces a sparser kernel matrix, which is advantageous for large-scale graph analysis. Finally, it gives more flexibility to the users to specify the level of constraints that must be preserved by tuning the parameter δ, which controls the ML neighborhood size of the constraint graph. P Let Dtotal = j [SâŚS(c) (δ)]ij be elements of a diagonal matrix computed from Stotal . The ii Hadamard product graph Laplacian is given by Ltotal = Dtotal â S ⌠Sc (δ). The modified graph Laplacian can be substituted into Equation (3.2) and solved using the generalized eigenvalue approach to identify the regions. 3.4.3 Partitional Spatially-Constrained Spectral Clustering Algorithm Algorithm 1 presents a high-level overview of our partitional clustering approach. First, a feature similarity matrix is created by applying the Gaussian radial basis function kernel, k(xi , xj ) = exp(â ||xi âxj ||2 ) 2Ď 2 to the feature set of the spatial units. The spatially constrained kernel matrix Sc is then computed from the constraint matrix C, where Cij = 1 if (si , sj ) is a ML edge and 0 otherwise. Note that if the truncated exponential kernel is used to represent 26 Algorithm 1 Partitional Spatially-Constrained Spectral Clustering. Input: D = {(x1 , s1 ), (x2 , s2 ), ..., (xN , sN )} C â RN ĂN : spatial constraint matrix. k: number of clusters. δ: neighborhood size. Output: R = {R1 , R2 , ..., Rk } (set of regions). 1. Create similarity matrix S from {x1 , x2 , ¡ ¡ ¡ , xN }. 2. Compute the spatially constrained kernel matrix, Sc (δ). 3. Compute the combined kernel Stotal based on S and Sc . 4. Compute Dtotal and Ltotal . 5. Solve the generalized eigenvalue problem Ltotal r = ÎťDtotal r. Xr = [r1 r2 ¡ ¡ ¡ rk ] from the top-k eigenvectors. 6. R â k-means(Xr ,k) Create matrix the spatially constrained kernel matrix, we termed the approach as a spatially-constrained spectral clustering (SSC) algorithm. However, if the binarized truncated exponential kernel is used, the approach is known as a binarized spatially-constrained spectral clustering (BSSC). Once the combined graph Laplacian, Ltotal is found, we extracted the first k eigenvectors as the low rank approximation of the combined kernel matrices. We then applied k-means clustering to partition the data into its respective regions. Note that the partitional clustering framework shown in Algorithm 1 is also applicable to the SCM and CSP algorithms, by setting their corresponding graph Laplacian, Ltotal and diagonal matrix, Dtotal . The computational complexity of the spatially constrained spectral clustering is equivalent to the standard spectral clustering algorithm, which is O(N 3 ). 3.4.4 Hierarchical Spatially-Constrained Spectral Clustering Algorithm The formulation described in the previous section can be extended to hierarchical clustering by using a recursive bisection approach. Specifically, the algorithm will iteratively identify the least homogeneous region to be split into two smaller subregions until every subregion contains only a single spatial unit. 27 Algorithm 2 Hierarchical Spatially-Constrained Spectral Clustering (HSSC). Input: D = {(x1 , s1 ), (x2 , s2 ), ..., (xN , sN )} C â RN ĂN : spatial constraint matrix. δ: neighborhood size. Output: R = {R1 , R2 , ..., Rk } (set of regions). 1. Create similarity matrix S from {x1 , x2 , ¡ ¡ ¡ , xN }. 2. Compute the spatially constrained kernel matrix, Sc (δ). 3. Compute the combined kernel Stotal based on S and Sc . 4. Compute Dtotal and Ltotal . 5. Initialize R1 as the cluster containing all N spatial units. 6. for k = 2 to N do 6a. C â = choose(Rkâ1 ) 6b. Rk â Rkâ1 â C â 6c. (C1 , C2 ) â SSC(DC â , 2) 6d. Rk â Rkâ1 ⪠{C1 , C2 ). The pseudocode of the proposed algorithm is shown in Algorithm 2. First, the feature similarity matrix S is computed using the Gaussian RBF kernel function. Next, the spatial constraint matrix Sc (δ) is created using Equation 3.11. The algorithm will then compute the combined kernel Stotal and its corresponding graph Laplacian matrix Ltotal , similar to the approach described in Section 3.3.3. The algorithm initially assigns all the data points to a single cluster. It then recursively partitions the data until k clusters are obtained, as shown in lines 6a-6d in Algorithm 2. Let Rkâ1 be the set of clusters found after k â 1 iterations. On line 6a, the algorithm chooses the cluster C k â Rkâ1 with the worst sum of square within errors (SSW) to be split into two smaller clusters, C1 and C2 (line 6c). One advantage of using our top-down recursive partitioning approach is that neither the feature similarity nor the spatial constraint matrix have to be updated at each iteration unlike the bottom-up hierarchical clustering, which requires us to re-compute the modified feature similarity and constraint matrices each time a pair of clusters is merged. 28 3.5 Application to Region Delineation To evaluate the effectiveness of constrained spectral clustering for region delineation, we conducted a case study on a large-scale terrestrial ecology data set. The results of the case study are presented in this section. 3.5.1 Data set The constrained spectral clustering methods were assessed using geospatial data from the LAGOS-NEGEO [109] database. The database contains landscape characterization features measured at multiple spatial scales with a spatial extent that covers a land area spanning 17 U.S. states. The land area was divided into smaller hydrologic units (HUs), identified by their 12-Digit Hydrologic Unit Code [101]. Our goal was to develop a regionalization system for the landscape by aggregating the 20,257 HUs into coarser regions. We selected 28 terrestrial landscape variables and performed experiments on three study areasâMichigan, Iowa, and Minnesota. When the values for a landscape variable was always zero, we removed that variable before applying the clustering methods. The number of HUs to be clustered in each study region, as well as number of landscape variables for each, are summarized in Table 4.2. Table 3.1: Summary statistics of the data set. Study Area # HUs Michigan Iowa Minnesota 1,796 1,605 2,306 # landscape variables 17 19 19 # PCA components 10 12 11 Diameter of constraint graph 41 43 57 The data set was further preprocessed before applying the constrained clustering algorithms. First, each variable was standardized to have a mean value of zero and variance of one. Since some of the landscape variables were highly correlated, we applied principal component analysis to reduce the number of features, keeping only the principal components 29 that collectively explained at least 85% of the total variance. The principal component scores were then used to calculate a feature similarity matrix for all pairs of HUs in each study area. The ML edges for the constraint graph were determined based on whether the polygons for two HUs were adjacent to each other. 3.5.2 Baseline Methods For partitional-based constrained clustering, we compared our algorithms, SSC and BSSC, against three competing baseline methods. The first baseline, called SCM [105], uses a weighted sum approach (Equation (3.3)) to combine the feature similarity matrix S with the adjacency matrix C of the constraint graph. The algorithm has a parameter δ â [0, 1] that controls whether the clustering should favor homogeneity or spatial contiguity of the regions. When δ approaches 0, the algorithm is biased towards maximizing the similarity of features in the regions whereas when δ approaches 1, it is biased towards producing more contiguous regions. The second baseline method, called CSP [128], uses the spatial constraints to restrict the feasible set of the clustering solution (Equation (3.6)). As noted in Section 3.3.3, the algorithm has a parameter β that gives a lower bound on the proportion of constraints that must be satisfied by the clustering solution. Furthermore, β < Îťmax vol(G) to ensure the existence of a feasible solution [128]. Instead of using β, we define an equivalent tuning parameter δ = β/[Îťmax vol(G)] so that its upper bound, which is equal to 1, is consistent with the upper bound for other algorithms evaluated in this study. The third baseline is a spatially constrained clustering method proposed recently in the ecology literature by Miele et al. [80]. It uses a stochastic model to represent nodes and links in a spatial ecological network. The cluster membership of each node is assumed to follow a multinomial distribution. Spatial constraints are introduced as a regularization penalty in the maximum likelihood estimation of the model parameters. The algorithm is implemented as part of the Geoclust R package. We denote the model-based method as MB 30 in the remainder of this chapter. For hierarchical clustering, we compare our proposed HSSC algorithm against the spaceconstrained clustering method described in [66]. The method is similar to traditional agglomerative hierarchical clustering, except it applies a Hadamard product between the feature similarity matrix S with the spatial constraint matrix Sc to generate a combined similarity matrix Stotal . This is identical to the approach used in HSSC. The agglomerative clustering algorithm initially assigns each spatial unit to be in its own cluster (region). It then merges the two clusters with the highest similarity value in Stotal . Both the feature similarity matrix S and the spatial constraint matrix Sc are then updated accordingly. The update for S depends on how the similarity between two clusters is computed. Among the popular approaches that have been used to update S include single link [106], complete link [111], group average (UPGMA) [108], and the Wardâs method [130]. The adjacency matrix C is updated based on whether there is a path from any point in one cluster to any point in the other cluster and the constrained similarity matrix Sc is updated based on Equation 3.10 with a predefined δ. We implemented SCM, SSC, BSSC, HSSC and the spatially constrained agglomerative hierarchical clustering (single link, complete link, UPGMA, Wardâs method) in Matlab. For CSP and MB, we downloaded their software from the links provided by the authors1 . 3.5.3 Evaluation Metrics We evaluated the performance of the algorithms based on three criteria: homogeneity, spatial contiguity, and region size. To determine whether the regions were ecologically homogeneous, we computed their within-cluster sum-of-square error (SSW) [112]: SSW = k X X dist(Âľi , x)2 (3.13) i=1 xâCi 1 CSP was obtained from https://github.com/gnaixgnaw/CSP whereas MB was downloaded from http://lbbe.univ-lyon1.fr/Download-5012.html?lang=fr. 31 where Âľi is the centroid of the cluster Ci . The lower SSW is, the more homogeneous are the spatial units within the regions. The second criteria assesses the spatial contiguity of the resulting regions. We consider two metrics for this evaluation. The first metric computes the percentage of ML constraints preserved within the regions: PctML = # ML edges within discovered regions Total # of ML edges (3.14) The second metric corresponds to a relative contiguity metric proposed in the ecology literature by Wu and Murray [136]. The metric takes into consideration both the within patch contiguity (Ď) and between patch contiguity (ν): c= Ď+ν ⌠(3.15) where k X N (N â 1) ( i i ), Ď = 2 (3.16) i=1 k X k X Ni Nj 1 ν = ( Îł ) 2 lij i=1 j=1,j6=i P P ( ki=1 Ni )( ki=1 Ni â 1) ⌠= 2 In the preceding formula, k is the number of regions and Ni is the number of spatial units assigned to the i-th region. lij denote the minimum spanning tree path length between regions i and j while Îł is a distance decay parameter. Since the metric is normalized by the total number of possible edges in a complete graph (âŚ), it ranges between 0 and 1. Although spatial contiguity is a desirable criterion, it may lead to highly imbalanced regions [83]. For example, an algorithm that creates one very large region along with many smaller but contiguous regions will likely have a high contiguity value. Previous studies [30, 6] have shown the importance of maintaining a more balanced cluster sizes to ensure good clustering performance. Thus, given a set of k clusters with their corresponding cluster 32 sizes, n1 , n2 , ..., nk , we define a metric, Cbalance, based on the normalized geometric mean of the cluster sizes: 1 k k n1 Ă n2 Ă ... Ă nk , Cbalance = N (3.17) where N is the total number of data points and k is the number of clusters. The metric ranges from 0 to 1 and the larger the value, the more balanced are the cluster sizes. 3.5.4 Results and Discussion This section presents the results of applying various clustering algorithms to the terrestrial ecology data. 3.5.4.1 Tradeoff between Homogeneity and Spatial Contiguity We first analyze the trade-off between landscape homogeneity and spatial contiguity of the regions by comparing the results for four partitional constrained spectral clustering algorithms: SCM, CSP, SSC, and BSSC. The number of clusters was set to 10. As each algorithm has a parameter δ that determines whether the clustering should be more biased towards increasing the within-cluster similarity or preserving the ML constraints, we varied the parameter and assessed their performance using the metrics described in Section 3.5.3. The δ parameter for SSC and BSSC has been re-scaled to a range between 0 and 1 by dividing the ML neighborhood size with the diameter of the constraint graph. The results are shown in Figure 3.2. Observe that the contiguity score (c and PctML) for SCM increases rapidly as δ becomes closer to 1. This is because increasing δ would bias the algorithms towards preserving the spatial constraints. A similar increasing trend was also observed for CSP, especially in Iowa and Michigan, though the increase is not as sharp as SCM. In contrast, the contiguity scores would decrease for BSSC as δ increases because it creates more new ML edges involving spatial units that are not adjacent to each other. For SSC, the contiguity scores do not appear to change by much as δ increases. This is 33 SCM CSP SSC BSSC Contiguity C 1 0.5 0 PctML 1 0.5 0 4 x 10 SSW 1.4 1.2 1 0 0.5 δ 1 0 0.5 δ 1 (a) Iowa SCM CSP SSC BSSC Contiguity C 1 0.5 0 PctML 1 0.5 0 4 x 10 SSW 1.8 1.6 1.4 1.2 1 0 0.5 δ 1 0 0.5 δ 1 (b) Michigan SCM CSP SSC BSSC Contiguity C 1 0.5 0 PctML 1 0.5 0 4 x 10 SSW 2.5 2 1.5 0 0.5 δ 1 0 0.5 δ 1 (c) Minnesota Figure 3.2: Comparison between various constrained spectral clustering algorithms in terms of their landscape homogeneity (SSW) and spatial contiguity (PctML and c). The horizontal axis in the plots corresponds to the parameter value δ. 34 because the weight 1/k! associated with each path of length k decreases rapidly to zero as k increases. As a consequence, the ML neighborhood size for SSC grows until it reaches a maximum size by which increasing δ will not significantly alter the constraint graph. Thus, SSC is less sensitive to parameter tuning compared to BSSC. Figure 3.2 also shows there is generally an increasing trend in SSW for SCM and CSP as δ increases. For SSC, the SSW values do not appear to change significantly with increasing δ whereas for BSSC, the SSW curve decreases monotonically as the neighborhood size increases. The results of this study showed that the trade-off between landscape homogeneity and spatial contiguity varies among the constrained spectral clustering algorithms. For CSP and SSC, the parameters provided by the algorithms do not allow us to achieve the full range of SSW and contiguity scores. Although these algorithms can produce regions with high contiguity scores, their SSW values were also very high. In contrast, with careful parameter tuning, SCM and BSSC can produce regions with significantly lower SSW compared to CSP and SSC. Observe that the slopes of the curves are steeper near δ = 1 for SCM, which suggests that decreasing δ below 1 would lead to a dramatic reduction in the contiguity score and SSW of the regions. This makes it harder for SCM to produce regions that are both spatially contiguous and homogeneous. In contrast, the curves for the contiguity scores of BSSC are flatter near δ = 0. This enables the BSSC algorithm to produce regions with homogeneous landscape features yet are still spatially contiguous. 3.5.4.2 Performance Comparison for Partitional based Constrained Clustering In this experiment, we set the number of clusters to 10 and selected the δ parameter that gives the highest contiguity score for each constrained spectral clustering method. If there are more than one parameter values that achieve the highest contiguity score, we chose the one with lowest SSW. For MB, since the Geoclust R package did not support parameter tuning by users, we applied the algorithm using its default setting. Table 3.2 summarizes the results of our analysis. SCM, SSC, and BSSC can be tuned to 35 Table 3.2: Performance comparison among various partitional spatially-constrained clustering algorithms with the number of clusters set to 10. States IA MI MN Method SCM CSP MB SSC BSSC SCM CSP MB SSC BSSC SCM CSP MB SSC BSSC PctML 93.26% 87.37% 89.95% 92.83% 92.40% 96.08% 87.81% 88.76% 95.69% 94.92% 94.78% 86.62% 88.96% 94.57% 94.12% c 1.00 0.91 0.69 1.00 1.00 1.00 0.92 0.65 1.00 1.00 1.00 0.96 0.64 1.00 1.00 SSW 15104 13628 18997 13993 14001 18200 18307 16091 17534 17485 20506 23755 20400 19998 19594 Cbalance 0.95 0.19 0.34 0.95 0.94 0.85 0.44 0.91 0.92 0.93 0.91 0.69 0.67 0.93 0.91 produce regions that are fully contiguous (c = 1). The SSW for BSSC and SSC are consistently better than SCM. These results clearly showed the advantage of using a Hadamard product approach instead of a weighted sum approach to integrate spatial constraints into the feature similarity matrix. The limitation of using a weighted sum approach can be explained as follows. Since the highest contiguity score is achieved by setting δ = 1, the clustering solution of SCM is equivalent to applying spectral clustering on the constraint graph only, without considering the feature similarity. If we reduce the parameter value to, say δ = 0.95, its contiguity score decreases sharply (see Figure 3.2) while its SSW value is still worse than BSSC. The weighted sum approach has poor SSW because it significantly alters the feature similarity matrix. To illustrate the limitation of using the weighted sum approach, consider the toy example shown in Figure 3.3. Assume there are 4 data points: A, B, C and D, that need to be clustered. A sample of their pairwise similarity values is shown in Table 3.3. Although the A-B pair has a significantly lower similarity value than C-D, the weighted sum approach inflates the similarity significantly (assuming δ = 0.95) which makes it overall 36 Figure 3.3: A toy example illustrating the advantage of using Hadamard product for combining constraints with feature similarity. Each labeled node is a data point, with a solid line representing a must-link constraint between two points and a dashed line representing the absence of such constraint. The weight of each edge denotes the feature similarity between two data points. Table 3.3: A toy example illustrating the advantage of using Hadamard product for combining constraints. Pairs A-B B-C C-D Feature Similarity 0.1 0.5 0.8 ML Constraint 1 0 1 Weighted Sum 0.955 0.025 0.990 Hadamard Product 0.1 0 0.8 similarity to be comparable to C-D. In contrast, the Hadamard product approach simply zeros out the similarity of pairs that do not have ML edges, and thus, will not artificially inflate the similarities of pairs with ML edges. Furthermore, since the feature similarity is computed using Gaussian radial basis function (see Section 3.4.3), the resulting matrix S for the weighted sum approach is still dense after incorporating the spatial constraints. Unless δ = 1, the weighted sum approach will not prevent spatial units that are located far from each other from being placed into the same region. For example, consider the regions found by the weighted sum approach for Iowa, as shown in Figure 3.4. Although the regions appear to be spatially contiguous, they are not compact and have varying sizes. In fact, most of the spatial units were assigned to the same region when δ = 0.95. Even at the lower δ threshold, its SSW (14805) is still the worse than the SSW for our framework and CSP. 37 IA Figure 3.4: Regions for Iowa created by the SCM algorithm using the weighted sum approach (with δ = 0.95). The contiguity scores for MB are worse than other constrained clustering methods. Nevertheless, it preserves at least 88% of the ML edges within the regions. Except for Michigan, its SSW values are also worse than other methods. In contrast, CSP has the lowest contiguity score among all the constrained spectral clustering methods. Except for Iowa, its SSW values are also among the worst. The limitation of CSP [128] is a consequence of the parameter used to control its spatial contiguity. As shown in Equation (3.6), the level of spatial constraints satisfied by the clustering solution depends on the parameter Îą. However, instead of directly tuning Îą, the authors suggested to vary another parameter, β, which was shown to be an upper bound of Îą. The results of our case study showed that increasing the value of β does not necessarily imply an increase in Îą. To illustrate this point, we randomly generated a constraint graph that has nine vertices with a randomly generated feature similarity matrix. Assuming the number of clusters is equal to 2, we ran the CSP algorithm with different parameter settings and plotted their values of Îą and β in Figure 3.6. Although this figure shows that the value of β (blue diamond) is a lower bound of Îą (red circle), the bound is so loose that it can not guarantee that increasing β will increase Îą. In fact, the figure on the right shows that Îą is not a monotonically increasing function of β. This is why controlling its parameter value will not always guarantee that the regions will be contiguous even when δ = 1 (unlike SCM and the Hadamard product approaches). In terms of the Cbalance measure, our results suggest that SCM, SSC, and BSSC achieve 38 (a) SCM (b) CSP (d) SSC (c) MB (e) BSSC Figure 3.5: Regions created by the SCM, CSP, MB, SSC and BSSC algorithms for the state of Michigan. the highest cluster balance for all three states. This can be verified by examining the regions generated by all the competing algorithms for the the state of Michigan, as shown in Figure 3.52 As can be seen from the figure, the regions produced by SCM, SSC, and BSSC are more compact and uniform in size compared to CSP and MB. However, the SSW for SCM is worse than the SSW for our proposed SSC and BSSC algorithms. This is not surprising as SCM cannot produce contiguous clusters unless δ is very close to 1. If δ is lowered slightly to 0.95, the regions changed significantly, as shown in Figure 3.4. By setting δ close to 1, SCM will focus only on preserving the spatial constraints, and thus, has worse cluster homogeneity compared to our algorithms. Thus, our results clearly show the benefits of applying BSSC to develop homogeneous and spatially contiguous regions compared to other baseline algorithms. These results hold true even when the number of regions is varied. A comparison of the results for different number of clusters can be found in our earlier work [146]. 2 The corresponding maps of regions for other states can be found in [146]. 39 16 15.16 14 15.14 12 15.12 10 15.1 8 6 15.08 4 0 15.06 Îą β 2 0 0.5 δ Îą 1 15.04 0 0.5 δ 1 Figure 3.6: The relationship between Îą and β parameter values for the CSP algorithm when applied to a synthetic graph data. 3.5.4.3 Performance Comparison for Hierarchical-based Constrained Clustering In this section, we compared our proposed HSSC algorithm against the spatially-constrained agglomerative clustering methods for constructing nested regions. Note that all of the algorithms apply a Hadamard product to combine the constrained matrix Sc (for a given δ) with the feature similarity matrix S to generate the combined matrix Stotal before applying hierarchical clustering. For the spatially-constrained agglomerative hierarchical clustering methods, the regions are iteratively merged starting from the initial Stotal . For a fair comparison, we set δ = 1 for all the methods. The results for k = 10 are summarized in Table 3.4. In terms of region contiguity, observe that all the methods can achieve c = 1. However, the spatially constrained complete link and UPGMA algorithms produce the highest PctML values while the Wardâs method produces the lowest value. The PctML for our proposed HSSC algorithm is still relatively high and comparable to its non-hierarchical counterparts, SSC and BSSC (see Table 3.2). Despite their high spatial contiguity, both the spatially-constrained complete link and UPGMA methods have the worst SSW compared to other methods. Worst still, their Cbalance values are close to 0, suggesting that the sizes of their regions are highly imbalanced. This can be seen from the maps shown 40 Table 3.4: Performance comparison among various hierarchical spatially-constrained clustering algorithms with δ = 1 and the number of clusters set to 10. States IA MI MN Method HSSC Single link Complete link UPGMA Wardâs HSSC Single link Complete link UPGMA Wardâs HSSC Single link Complete link UPGMA Wardâs PctML 92.53 % 96.02 % 98.75 % 98.45 % 84.96 % 95.41 % 95.31 % 98.72 % 98.33 % 86.28 % 95.02 % 90.70 % 99.17 % 98.81 % 87.69 % c 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 SSW 15080 14191 18309 18227 9281 17420 16575 20083 19441 14047 20075 19660 33431 28681 14183 Cbalance 0.92 0.04 0.02 0.02 0.48 0.86 0.08 0.03 0.04 0.79 0.82 0.20 0.01 0.02 0.63 in Figure 3.7, where there is a large region covering the majority of the landscape in each state. In contrast, our HSSC algorithm has the highest Cbalance, consistently producing regions that are compact and approximately similar in sizes. The spatially-constrained single link method has comparable PctML but slightly lower SSW compared to HSSC. It also suffers from the same imbalance region problem as the complete link and UPGMA methods. Meanwhile, the spatially-constrained Wardâs method achieves the lowest SSW among all the competing methods, which is not surprising since the algorithm is designed to minimize the SSW in each iteration of the algorithm. However, this comes at the expense of its poor PctML values, which is the worst among all the competing methods. In addition, the Wardâs method is known to suffer from the cluster inversion problem [83], in which its objective function is not monotonically non-decreasing as the number of clusters increases. In short, our HSSC algorithm outperforms the complete link, UPGMA, and Wardâs methods in 2 of the 3 evaluation criteria. Its PctML and SSW is also quite similar to single link, which suffers from the region imbalanced problem. Figures 3.8a and 3.8b show a comparison between the regions produced by BSSC, HSSC, 41 HSSC Single link Complete link UPGMA Ward's (a) Regions in IA developed by 5 hierarchical algorithm with 10 clusters. HSSC Single link Complete link UPGMA Ward's (b) Regions in MI developed by 5 hierarchical algorithm with 10 clusters. HSSC Single link Complete link UPGMA Ward's (c) Regions in MN developed by 5 hierarchical algorithm with 10 clusters. Figure 3.7: Regions developed by 5 hierarchical spatially constrained clustering algorithm for 3 study regions. and the Wardâs method for the state of Michigan as we vary the number of regions from 4 to 10. We show the value of the unnormalized δ along with four metricsâc, PctML, SSW, and Cbalanceâat the top of each diagram. Recall that the normalized δ is the ratio between the original δ given in Equation (3.7) and the diameter of the spatial constraint graph. As we increase δ from 1/41 to 4/41, the Wardâs method no longer produces regions that are contiguous unlike the BSSC and HSSC methods. The Cbalance for Wardâs method is also worse than our algorithms except when the number of clusters is small. The results for BSSC is quite comparable to HSSC, since both of them are based on the same spatially constrained 42 BSSC HSSC Wardâs K = 4 K = 6 K = 8 K = 10 (a) Results for δ = 1/41 BSSC HSSC Wardâs K = 4 K = 6 K = 8 K = 10 (b) Results for δ = 4/41 Figure 3.8: Comparison between BSSC, HSSC and Wardâs method for number of cluster K = 4, 6, 8, 10. The five metrics evaluated are listed on top of each figure, namely: unnormalized delta, contiguity metric c, PctML preserved, SSW and cluster balance. Results for δ = 1/41 are shown in (a) and results of the unnormalized δ = 4/41 are shown in (b). 43 spectral clustering framework. The Cbalance and SSW for HSSC are slightly better than BSSC but its PctML is slightly worse. While this may seem counter-intuitive since BSSC directly optimizes the objective function for spatially-constrained spectral clustering whereas HSSC uses a greedy recursive bisection strategy, it is worth noting that the objective function does not depend solely on the feature similarities within the regions. Instead, it takes into account the spatial constraint matrix C as well. In fact, if we compare the values of the objective functions for BSSC and HSSC at k = 10 for the state of Michigan, BSSC has a noticeably lower value (13458.13) compared to HSSC (14284.21). Finally, we also compare the stability of the regions as we increase ML neighborhood size (δ). For this experiment, we show the results for the state of Michigan (in which the diameter of the constraint graph is 41) and varies the normalized δ from 1/41 to 4/41. We use the adjusted rand index [94] to compare the similarity between two clustering results. A high adjusted rand index would suggest that the regions found are stable, i.e., do not vary significantly with different values of δ. Table 3.5 shows the mean adjusted rand index (averaged over the number of clusters, which varies from 1 to 10) for HSSC, BSSC, and Wardâs method. The results suggest that the proposed BSSC and HSSC methods are less sensitive to the change in δ compared to the Wardâs method, which is another advantage of using our frameworks. Mean Adjusted Rand Index BSSC 0.92 HSSC 0.86 Wardâs 0.66 Table 3.5: Stability of the regions generated by different hierarchical clustering methods for the state of Michigan. The mean Adjusted Rand Index is computed for each method by comparing the similarity between the regions found with δ = 1/41 to the regions found with δ = 4/41. 3.6 Conclusions This chapter investigated the feasibility of applying constrained spectral clustering to the regionalization task. We compared several constrained spectral clustering methods and 44 showed the trade-off between landscape homogeneity and spatial contiguity of their resulting regions. We presented two algorithms, SSC and BSSC, that uses a Hadamard product approach to combine the similarity matrix of landscape features with spatial contiguity constraints. The results of our case study showed that the proposed BSSC algorithm is most effective in terms of producing spatially contiguous regions that are homogeneous. The extension of this algorithm to a hierarchical clustering setting also shows its advantages in producing regions that are more balanced in size compared to other hierarchical spatiallyconstrained algorithms. It also achieves high spatial contiguity and moderate SSW, comparable to the results of its non-hierarchical counterpart (BSSC). 45 CHAPTER 4 LEARNING HASH-BASED FEATURES FOR INCOMPLETE CONTINUOUS VALUED DATA 4.1 Introduction Real world data sets are often noisy, making it difficult to develop accurate prediction models from the data. The data are often high-dimensional and may contain redundant or correlated features, as well as missing values, which makes it crucial to derive a good set of features to represent the data. This has led to considerable interest in developing feature learning methods to overcome the limitations of using the original features of the data. Numerous methods are available, from classical methods such as principal component analysis [90] to more recent ones such as hash-based feature learning [64]. Hashing is a popular feature learning technique for transforming high dimensional data into an alternative representation that preserves the similarities between instances in the original data [126]. There are two main advantages of using hashing for feature learning. First, it provides an effective dimensionality reduction technique, especially for applications such as large-scale image and multimedia retrieval problems [59]. Second, the similarity preserving property of the hash functions enables the hash-based features to be used as an approximation to the features defined in the Reproducing Kernel Hilbert Space (RKHS). This allows us to construct linear models on the hash-based features with comparable accuracy as their nonlinear counterpart, but with substantial improvement in computational complexity. Hash-based feature learning methods can be generally classified into two categories, depending on how the features are created [126]. The first category corresponds to dataindependent methods, which create the features by applying randomly generated hash functions to the data. This includes the min-hash [15], random hyperplane-based hashing [20], and shift-invariant kernel hashing [92] methods. One potential limitation of using data- 46 independent methods is that the number of hash-based features needed to provide a good representation of the data can be very large since the hash functions are generated randomly. Thus, data-driven methods have been developed as an alternative to such methods as they can fit the salient properties of the data using a small set of hash functions. Methods that belong to this second category include spectral hashing [131], semi-supervised hashing [127], and minimal loss hashing [85]. Existing hash-based feature learning methods assume that the input data are complete. Any missing values present in the data must be imputed before the hash functions can be derived. Because the imputation is typically performed during preprocessing, the hash-based features created after the preprocessing may not be optimal for the subsequent modeling task. As an illustration, consider the example shown in Figure 4.1. Suppose P denotes a data point that has a missing value for its predictor variable (the x-axis), but the value of its response variable (the y-axis) is known. After applying mean imputation, the data point P is shifted to its estimated point Pâ, which is much larger than its true value. This process introduces a bias into the data set, resulting in a new regression model (represented by a solid line in the diagram) that deviates from its true model (represented by a dashed line). At first glance, the bias may seem quite small. However, it may become an issue with more missing values in the data. If P is imputed in a way that takes into account the effect of the imputation on the response variable, the regression model would be less affected by biases due to the imputation. To overcome this challenge, we present a novel framework called H-FLIP that combines missing value imputation with hash-based feature learning. Specifically, the missing values are imputed in a way that preserves the relationship between the predictor and response variables of the data. Many existing hash-based feature learning methods are designed for discrete-valued data, such as those encountered in image classification problems [59]. In this chapter, we investigate the application of hash-based feature learning for regression problems. Specifically, the hashbased features created with our approach are continuous rather than binary. In addition, 47 Figure 4.1: Effect of mean value imputation on regression. many existing methods create their hash-based features using a linear combination of the original features. Since the relationship between the response and predictor variables is often non-linear, these methods make the hash-based features ineffective for more complex prediction tasks. Previous studies have shown that the inner product of random Fourier features provides a good approximation to the shift-invariant kernels that are often used for building nonlinear models [93]. However, since the Fourier features are generated randomly, a relatively large number of such features is needed to provide a succinct representation of the data. We address this problem by training a set of hash functions to fit the response variable using random Fourier features. This enables our framework to model nonlinear relationships in the data using a small set of hash functions. Our framework can thus be regarded as a hybrid method that embeds a data-independent approach, i.e., random Fourier features, into a supervised learning setting. In summary, the main contributions of this chapter are: ⢠We developed a supervised feature learning framework for regression problems using nonlinear, random Fourier features as its basis functions. ⢠We extended this framework to handle incomplete data by enabling the missing value 48 imputation and hash-based feature learning to be performed jointly. ⢠We demonstrated the efficacy of our approach through extensive experiments using both synthetic and real-world data sets. Our empirical results showed that the framework is more effective than the standard approach of imputing the missing values before applying feature learning in 7 out of 9 real-world data sets evaluated in this study. The remainder of this chapter is organized as follows. Section 4.2 reviews the previous works related to this research while Section 4.3 presents the preliminary background of this work. The proposed framework along with its optimization algorithm are described in Section 4.4. Experimental results validating the effectiveness of the framework is presented in Section 4.5. Finally, we present the conclusions of this study in Section 4.6. 4.2 Related Work High dimensionality is one of the characteristics of real world datasets. To avoid the curse of dimensionality we need a powerful set of features in low dimensional space. Feature learning [10] is the task of learning an better feature representation of a given data set.There are different types of models for feature learning, such as probabilistic models [97, 87], auto-encoders [14, 147], manifold learning [143], and deep networks [61, 149]. Hashing is another feature learning technique that has been well studied. The goal of hash-based feature learning is to transform the data into easily computable features that preserve the underlying properties of the data, such as Min-hash and random hyperplane mapping. However, they were mostly designed to represent complete data with binary hash codes unlike the framework proposed in this chapter. Missing values can be generally classified into three categoriesâmissing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR) [70]. A common strategy to deal with missing values is to impute them with the mean or median value of their respective features. Model-based approaches based on maximum likelihood and Bayesian estimation methods have also been developed [69]. These methods assume 49 that the data follow a certain probability distribution and the missing values are imputed in a way that preserve properties of the distribution. More recently, matrix completion [18] has emerged as a popular method for dealing with missing values. The method assumes that the partially observed data matrix has a low rank, and thus, can be effectively recovered using only a small number of observations. CandeĚs and Recht [18] formulated matrix completion as a trace-norm minimization problem and solved a convex relaxation of the problem using semi-definite programming. The approach was extended by Cai et al. [17] who proposed a singular value thresholding (SVT) algorithm for solving the optimization problem. In this chapter, we integrate the matrix completion formulation into our framework to deal with incomplete data. 4.3 Preliminaries This section reviews some of the fundamental concepts underlying the framework proposed in this chapter. 4.3.1 Support Vector Regression (SVR) Support vector regression is a widely-used method to solve large scale prediction problems. The method is grounded in statistical learning theory, based on the structural risk minimization (SRM) principle [118]. The goal of SVR is to learn a linear function f that fits each instance in the training set D = {xi , yi }N i=1 with an absolute error bounded by . Formally, this can be expressed by the following constrained optimization problem: w,b X 1 k w k2 +C (Ξi + Ξiâ ) 2 s.t. â â Ξiâ min i ⤠yi â wT xi â b ⤠+ Ξi , where Ξi and Ξiâ are the slack variables that can be used to relax the error bounds on the training instances and C controls the trade-off between minimizing the training error and the magnitude of w. 50 Table 4.1: Comparison among the various regression methods for modeling lake water quality in terms of their mean square prediction error (MSE). Method\Response variable Multiple linear regression Ridge regression Lasso regression Linear SVR Nonlinear SVR (RBF) TP 1.39 0.78 0.79 0.78 0.76 TN 2.39 0.72 0.74 0.71 0.69 Chla 1.31 0.71 0.72 0.71 0.71 Secchi 0.85 0.59 0.58 0.58 0.56 The linear SVR formulation can be extended to a nonlinear setting by projecting the original features x to a higher dimensional feature space, ÎŚ(x), such that the inner product between instances in the new, projected space is equivalent to computing their similarity in the original space using a nonlinear kernel function. For example, the Gaussian radial basis function (RBF), k(xi , xj ) = exp[â kxi âxj k2 ] 2Ď 2 is a popular choice for nonlinear SVR. The implicit mapping to a high-dimensional feature space facilitated by the kernel function enables SVR to capture non-linear dependencies in the data. To illustrate the advantages of using SVR, we have compared its performance to other regression methods on four lake water quality data sets obtained from the LAGOS-NE database [109]. Details about the data sets can be found in Section 4.5. Table 4.1 shows the mean square error (MSE) obtained using 10-fold cross validation. Due to the noise present in the data, methods such as multiple linear regression may overfit the training data resulting in its high error rate. The performance of linear SVR is either comparable to or better than the results obtained using lasso and ridge regression. However, the best results are obtained using nonlinear SVR with Gaussian RBF kernel. 4.3.2 Random Fourier Features (RFF) Despite its superior performance in terms of modeling complex relationships in data, the nonlinear SVR method scales poorly with increasing data set size. This is because the method requires considerable computational resources to compute and store the kernel matrices. To overcome this limitation, Rahimi and Recht [93] proposed a mapping function known as 51 (a) Accuracy (b) Run time Figure 4.2: Average runtime and accuracy comparison for linear SVR with raw features, nonlinear SVR with raw features, and linear SVR with RFF. random Fourier feature: Ď(x) = p 2/p cos(Rx + t) (4.1) where R â
+
=
Ď
k X â Y k2F +g(X)
2
Ď
1
k X â Y â âf (Y) k2F +g(X) + K(Y)
2
Ď
58
1 k âf (Y) k2 and
where K(Y) = f (Y) â 2Ď
F
âf (Y) = P(Y) â P(A)
K r
X
2
T
cos(YRk + T)wk â y 1Td
+ Îą
p
k=1
r
2
T
â
sin(YRk + T)diag(wk )R
p
(4.8)
We use in Equation (4.8) to denote the Hadamard product between two matrices and 1d
to denote a d-dimensional vector of all ones. Let diag(wk ) be a p Ă p diagonal matrix, whose
k-th diagonal element corresponds to wk . In addition, T = 1N tTk is an N Ă p matrix, whose
rows correspond to the vector tT defined in Equation (4.1). Since the last term K(Y) does
not depend on X, Q(X, Y) can be minimized by solving:
Ď
1
k X â Y â âf (Y) k2F +Âľ k X kâ
Ď
X 2
min
(4.9)
Let X(k) denote the most recent estimate after k iterations. The following two steps are
performed to update the estimate.
1. Apply accelerated gradient descent method to the smooth part of the objective function.
Îł (kâ1) â 1 (k)
(kâ1)
(k)
(k)
X âX
Y
= X +
Îł (k)
1
Z(k) = Y(k) â
âf (Y(k) )
(k)
Ď
The above equation reduces to the update formula for standard gradient descent when
âk : Îłk = 1.
2. Apply singular value shrinkage operator to Z(k) . Let Z(k) = UÎŁVT , where ÎŁ = diag(Ďi )
is a diagonal matrix containing the singular values of Z(k) while U and V are matrices
containing its left and right singular vectors. We update X as follows:
X(k+1) = UD
Âľ/Ď (k)
(ÎŁ)VT
(4.10)
where Dν (ÎŁ) is a diagonal matrix whose i-th element is max(0, Ďi â ν). This is equivalent
to applying a threshold Âľ/Ď (k) to each singular value of Z(k) .
59
Table 4.2: Summary of the data sets.
Response variable
# instances (lakes)
# features
Mean
Std deviation
TP
3694
356
37.5
68.2
TN
1961
356
821.2
729.6
Chla
4834
356
20.9
36.9
Secchi
4684
356
2.6
1.86
(a) Lake water quality data
Data
# instances
# features
Housing
506
14
Wine
4898
12
Parkinson
5875
26
News
5000
61
Concrete
1030
9
(b) UCI machine learning data
The step size of the gradient descent is determined dynamically using a line search algorithm [115]. The matrix X is updated until one of the following stopping conditions is
satisfied:
(1) the maximum number of iterations is reached,
(2) kX(k) â X(kâ1) kF /kX(k) k < ,
(3) the objective function given in Equation (4.7) no longer decreases significantly.
4.4.2.2
Updating W
When X is fixed, we update W by minimizing the following objective function:
K
1X
k ÎŚk (Xl )wk â y) k22 +Îť k W kF
min
W 2
k=1
This is equivalent to solving the objective function for supervised hash-based feature learning
as presented in Section 4.4.1. W can be updated using only instances that belong to the
training set in the following way:
wk = (ÎŚk (Xl )T ÎŚk (Xl ) + ÎťI)â1 ÎŚk (Xl )T y
60
(4.11)
4.5
Experimental Evaluation
This section describes the experiments conducted to evaluate the performance of our
proposed framework.
4.5.1
Data Sets
We have performed experiments using both synthetic and real-world data sets.
4.5.1.1
Synthetic data
We created a rank-20 data matrix X containing 5000 instances (rows) and 100 predictor
variables (columns) in the following way:
X = PQ + 0.1E,
where P â <5000Ă20 , Q â <20Ă100 , and E â <5000Ă100 . The entries of the matrices P, Q
and E are generated randomly from a standard normal distribution. Let xi denote the i-th
predictor variable. The value of the response variable y is computed as follows:
y = x1 x2 + x10 x11 + x12 + N (0, 0.1).
All the columns in X and the vector y are standardized to have zero mean and unit variance.
4.5.1.2
Lake water quality data
We used several lake water quality data sets from LAGOS-NE [109], which is a geo-spatial
database that contains landscape characterization features and lake water quality data measured at multiple scales covering 17 states in the United States. We used four lake water
quality variablesâtotal phosphorus (TP), total nitrogen (TN), total chlorophyll-a (chla)
and Secchi depth (Secchi)âas response variables, creating 4 distinct data sets for our experiments. Our goal was to predict the response variables for each lake based on a set of
61
predictor variables (features) that included land cover, land use, and climate variables. Since
there are multiple lake water quality measurements taken at different times for each lake,
we computed a single value for each lake by taking the mean of all measurements during the
summer months after 2010. The statistics for each data set are shown in Table 4.2.
4.5.1.3
Benchmark data from UCI Machine Learning Repository
We also conducted our experiment on five benchmark data from UCI machine learning
repository [68], including housing [51], wine [27], Parkinson [116], online news [36], and
concrete strength [141] (see Table 4.2).
4.5.2
Experimental Setup
We performed our experiments on a Dell PowerEdge R620 server with 2.7GHz Dual Intel Xeon processor. The proposed framework and other baseline methods were written in
Matlab.
4.5.2.1
Baseline Methods
First, we compare our proposed supervised hash-based feature learning method for complete
data against the following three baseline algorithms: (1) linear SVR trained on the raw
features, (2) nonlinear SVR trained on the raw features, (3) linear SVR trained on the
random Fourier features.
Second, we compare H-FLIP, which is an extension of our supervised hash-based framework to deal with incomplete data, against the following three methods:
⢠MC+Raw : Missing values are imputed during preprocessing using matrix completion. A
linear SVR is then trained on the imputed data.
⢠MC+PCA: Missing values are imputed during preprocessing using matrix completion. We
then apply principal component analysis (PCA) to extract features from the imputed data.
62
A linear SVR is then trained to fit the PCA-reduced data.
⢠MC+RFF This is similar to the previous two approaches except we use random Fourier
features as feature learning on the imputed data. A linear SVR is then trained on the
unsupervised RFF.
For a fair comparison, we extract an equal number of features for all the methods, unless
specified otherwise. For example, the number of PCA components, unsupervised RFF, and
supervised hash-based features are set to 50. For H-FLIP, we first project the data to 200
sparse RFF basis functions before reducing it to the 50-dimensional hash-based features using
our supervised learning framework. The regularization parameter Îť in H-FLIP is determined
using cross validation, while the parameter Îą in Equation (4.6) is set to 0.01. We apply
SVR to the features generated by the baseline and proposed methods. As noted in Section
4.3.1, there are two hyper-parameters, and C, that must be determined when applying
SVR. These hyper-parameters are chosen using nested cross validation [119], in which an
inner 3-fold cross validation is performed for hyper-parameter tuning and an outer 5-fold
cross validation is performed for model assessment.
4.5.2.2
Evaluation Metrics
We consider both the imputation error as well as the prediction accuracy of the induced
SVR models. To assess the error in missing value imputation, let Ac be the true complete
data matrix and X be the estimated (imputed) matrix. The imputation error is computed
as follows:
Imputation error = kP(X) â P(Ac )k22 /kP(Ac )k22 .
We also evaluate the performance of the SVR models in terms of their mean square prediction
error,
P
2
MSE = N1 N
i=1 (yĚi â yi ) ,
where yĚi is the predicted response value for the i-th data instance and yi is its true value.
63
Figure 4.4: Average MSE for linear SVR with raw features, nonlinear SVR with raw features,
linear SVR with RFF features, and linear SVR with supervised hash-based features on
complete synthetic data.
4.5.3
Experimental Results
This section presents the results of our experiments on both the synthetic and real-world
data.
4.5.3.1
Results for Synthetic Data
We begin with the results of our experiments for the complete synthetic data. The proposed
framework is compared against linear SVR on raw features, nonlinear SVR (with RBF kernel)
on raw features, and linear SVR on unsupervised RFF features. As expected, Figure 4.4
shows that linear SVR on the raw features is worse than other methods as it fails to capture
the nonlinear relationships in the data. In addition, the accuracy of linear SVR on both RFF
and our supervised hash-based features improves as the number of features increases. More
importantly, they are comparable to the accuracy of nonlinear SVR with raw features. This
supports the rationale for using RFF to capture nonlinear dependencies in the data. Finally,
comparing RFF against the proposed supervised hash-based features, we observe that the
supervised approach does not require as many features to achieve high accuracy compared
to unsupervised RFF. This justifies the case for using supervised hash-based feature learning
64
Table 4.3: Imputation error for synthetic data.
Percent of missing value
Mean imputation (MI)
Matrix completion (MC)
H-FLIP
10%
1.0002
0.0271
0.0273
20%
1.0002
0.0276
0.0280
30%
1.0003
0.0280
0.0290
Table 4.4: MSE of linear SVR on synthetic data.
% missing
MC+Raw
MC+PCA
MC+RFF
H-FLIP
10%
0.9680
0.9681
0.8960
0.4596
20%
0.9683
0.9683
0.8887
0.4610
30%
0.9679
0.9683
0.8997
0.4626
#features
100
50
50
50
for nonlinear regression problems.
Next, we add missing values randomly into the synthetic data and compare the imputation error of H-FLIP against methods that apply mean imputation and matrix completion
during preprocessing. The results in Table 4.3 suggest that mean imputation has the highest
imputation error while matrix completion has the lowest error. The imputation error of
H-FLIP is very close to the imputation error for matrix completion, which is not surprising
as H-FLIP is designed not only to recover the incomplete data, but also to fit the response
variable as accurately as possible. The imputation errors of both matrix completion and
H-FLIP also do not change significantly as we vary the percent of missing values from 10%
to 30%, which shows the robustness of our proposed framework.
In addition to their imputation errors, we also compare the MSE values of their regression
models. The results in Table 4.4 show that the raw features and PCA-induced features are
worse than unsupervised RFF. H-FLIP outperforms all the baseline methods because it learns
the appropriate nonlinear features and imputes the missing values without adding significant
bias that could degrade the performance of the regression model.
65
Figure 4.5: Comparing average MSE of linear SVR on the complete Secchi data.
Table 4.5: MSE of linear SVR for lake water quality data.
MC+Raw
MC+PCA
MC+RFF
MC+RFF
H-FLIP
4.5.3.2
TP
0.78
0.81
0.84
0.79
0.79
TN
0.73
0.75
0.82
0.74
0.70
Chla
0.72
0.74
0.81
0.72
0.71
Secchi
0.60
0.63
0.70
0.60
0.58
#features
356
50
50
300
50
Results for Lake Water Quality Data
First, we report the results of applying the various methods to the complete, lake water
quality data with no missing values using Secchi as response variable. Figure 4.5 shows that
the MSE for linear SVR on the supervised hash-based features is slightly better than linear
SVR on the raw features. More importantly, the supervised hash-based features can achieve
a low MSE with fewer number of features compared to unsupervised RFF.
We repeat the experiments by adding 20% missing values to the predictor variables and
compare the MSE for H-FLIP against the baseline methods. The results shown in Table 4.5
suggest that MC+RFF with only 50 features performs the worst, which is consistent with
our previous observation that a large number of RFF is needed to effectively represent the
data. As we increase the number of RFF from 50 to 300, the MSE improves significantly,
66
Table 4.6: MSE of linear SVR for UCI data with 20% missing values.
Method
MC+Raw
MC+PCA
MC+RFF
H-FLIP
Housing
0.35
0.36
0.38
0.32
Wine
0.76
0.83
0.76
0.69
Parkinson
0.80
0.97
0.85
0.74
News
0.92
0.94
0.94
0.92
Concrete
0.50
0.67
0.42
0.36
comparable to the results for MC+Raw. Nevertheless, H-FLIP with 50 hash-based features
outperforms all other methods in 3 of the 4 data sets. In fact, the MSE of linear SVR
with H-FLIP on the incomplete data is comparable to the results for nonlinear SVR on the
complete data (see Table 4.1).
4.5.3.3
Results for UCI Benchmark Data
The results in Table 4.6 show that H-FLIP outperforms the baseline methods in 4 of the 5
data sets. The improvement in H-FLIP is more significant here compared to the lake data as
there are more nonlinear relationships in these data sets. Nonlinear SVR has a lower MSE
than linear SVR by more than 0.10 in 3 of the 5 UCI benchmark data but none in the lake
data (see Table 4.1).
4.5.4
Sensitivity Analysis
We perform experiments using total nitrogen (TN) as the response variable to evaluate how
sensitive the H-FLIP results are when varying the number of hash functions (K) and the
length of each sparse RFF (p). We first vary the number of hash functions from 5 to 300
while fixing the length of each sparse RFF to be 200. The results shown in Figure 4.6a
suggest that the rate of change in the test MSE is quite slow when increasing the number of
hash functions compared to the training MSE.
Next, we vary the length of the sparse RFF from 10 to 300 while fixing the number of
hash functions to be 50. The results given in Figure 4.6b suggest that the test error of
67
(a) RFF basis functions
(b) Hash-based features
Figure 4.6: MSE of H-FLIP when varying the number of basis functions and supervised
hash-based features.
H-FLIP is not that sensitive to the increasing length of the RFF compared to its training
error.
4.6
Conclusion
This chapter presents H-FLIP, a hash-based feature learning framework for incomplete
data. Our experimental results show that the framework is particularly effective for data
sets that have nonlinear relationships between their predictor and response variables. For
future work, we plan to extend the framework to support other hash functions beyond RFF
such as spectral hashing [131].
68
CHAPTER 5
MULTI-TASK LEARNING FOR MULTIPLE RESPONSES AT DIFFERENT
LOCATIONS
5.1
Introduction
Predictive modeling of geospatial data has attracted considerable interest due to its wide
range of applicability. For example, Brown et al. [16] employed a fusion of two spatial choice
models based on logistic regression to predict criminal behaviors. Wimberly et al. [133] used
enhanced spatial models to infer the geographical distributions of two tick-borne pathogens.
Felicisimo [34] used multiple logistic regression models for the purpose of forested area territorial planning.
One of the challenges in geospatial predictive modeling is dealing with the inherent spatial
relationships of the data. Most data mining algorithms implicitly assume that the underlying data are independent and identically distributed. Such an assumption violates the
first law of geography [114], which states that: everything is related to everything else but
nearby things are more related than distant things. Thus, it would be useful to construct
predictive modeling techniques that can explicitly incorporate the spatial relationships of
the data as previous studies have suggested that spatial analysis would perform poorly if
such relationships are ignored [21].
Another challenge in predictive modeling of geospatial data is that it does not only
require performing inference at multiple spatial locations simultaneously, it may also involve
predicting multiple, possibly related, response variables. For example, climate scientists are
interested in predicting the minimum and maximum temperature as well as precipitation at
various locations. These response variables are often correlated with each other. In the field
of limnology, researchers are interested in modeling lake nutrients such as total phosphorous
and nitrogen in order to monitor the quality of water in freshwater lakes. Since the nutrient
69
variables are often correlated with one another, the challenge is to build models for the
response variables taking into account their joint dependencies.
Figure 5.1: Conceptual figure of data with multiple response variable at different locations.
One way to predict the multiple response variables at different locations is to fit a local
model for each response variable at each location independently. While this strategy is easy
to implement, it has several limitations:
⢠Some response variables are not always available at certain locations. This makes it
difficult to train an accurate local model when there is insufficient training data for a
response variable at a given location.
⢠Such a strategy fails to account for spatial relationships of the data.
⢠Such a strategy fails to take advantage of the correlation between different response
variables.
To overcome these limitations, a multi-task learning framework is proposed for modeling
multiple response variables at different locations. As reviewed in Chapter 2.4, multi-task
learning (MTL) learns multiple related tasks at the same time so as to improve the model
performance. Rather than training the local models independently for each response variable
at each location, the proposed formulation simultaneously learns the local models for all
response variables by optimizing a joint objective function with trace-norm regularization.
The proposed framework also incorporates the spatial relationships between locations as well
as the inherent correlation between response variables to improve the model performance.
70
The main contributions of this chapter are summarized as follows:
⢠We developed a novel multi-task learning framework for joint prediction of geospatial
data that contains multiple response variables at different locations. The proposed
framework incorporates spatial autocorrelation among different locations by adding a
constraint into the formulation. The proposed framework also takes into account of
the inherent correlation between response variables
⢠With the additional knowledge of the spatial autocorrelation and the correlation among
response variables, the proposed framework can be extrapolated to locations with no
available training data.
⢠We demonstrated the effectiveness of the proposed framework on predicting lake water
quality data. Our experimental results showed that the proposed framework outperformed other baseline algorithms.
The reminder of the chapter is organized as follows.Section 5.2 defines the multi-response,
multi-location prediction problem and Section 5.3 introduces the proposed multi-task learning framework for multiple response variables at different locations for geospatial data. Experimental results are presented in Section 5.4. Finally, Section 5.5 summarizes the findings
of this study.
5.2
Multi-Response, Multi-Location Prediction
ni Ăd denote the
Consider a geospatial data set D = {Xi , Yi }N
i=1 , where each Xi â <
set of predictors and Yi â