A MINIMALISTIC DATA DISTRIBUTION SYSTEM TO SUPPORT UNCERTAINTY-AWARE GIS By Nicholas Oren Ronnei A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Geography – Master of Science 2017 ABSTRACT A MINIMALISTIC DATA DISTRIBUTION SYSTEM TO SUPPORT UNCERTAINTY-AWARE GIS By Nicholas Oren Ronnei Error and uncertainty are inherent in all digital elevation models (DEMs) – representations of the Earth’s terrain. It is absolutely essential to account for this uncertainty in any GIS operations that rely on this data because uncertainty propagates through any derived products. This can have very serious consequences such as the potential invalidation of model results. Geostatistical methods like conditional stochastic simulation have been developed to mitigate this problem, but they require expert knowledge to apply them to a project. Despite the fact that uncertainty propagation has been discussed in geographic literature for nearly three decades, there has been very little progress in making such analysis accessible to those who are not geostatistics experts — the majority of GIS users. This research uses open source software to build a system that makes the results of complex error models accessible to researchers worldwide without the need for expert knowledge. Then, I use this system to acquire data and perform a basic analysis, demonstrating how the average researcher might incorporate uncertainty propagation in own their work. In doing so, I hope to elucidate the ways in which conditional stochastic simulation changes the traditional spatial data model and set an example for others to follow. Copyright by NICHOLAS OREN RONNEI 2017 For the people and puppies who have supported me on this trying and rewarding journey – I would not have made it without you. iv ACKNOWLEDGEMENTS I would like to thank my committee for their tremendous patience with me throughout the development of this manuscript. I would also like to thank Dr. Ashton Shortridge specifically for his support and guidance during my time at Michigan State University. Thank you to the many friends and family members who provided me much needed moral support. Thank you as well, to the the innumerable discussion board contributors who both helped me find sources and solve tough technical problems by sharing their expertise in a public forum. Finally, I would like to thank the administrative staff of the Department of Geography for keeping me on track through my absent mindedness. This work was funded by the National Geospatial Intelligence Agency, for which I am truly grateful. v TABLE OF CONTENTS LIST OF FIGURES viii KEY TO ABBREVIATIONS x 1 Introduction 1.1 The Problem 1.2 The Solution 1.3 Research Objectives 1 2 4 5 2 Literature Review 2.1 Error and Uncertainty 2.1.1 The Spatial Structure of Error 2.1.2 Error Propagation 2.1.3 Mitigating Error Impacts with Geostatistics 2.2 GIS and Uncertainty 2.2.1 Spatial Data Quality 2.2.2 A Brief History of Error-aware GIS 2.2.3 Uncertainty and GIS in Distributed Environments 2.3 GIS in Distributed Environments 2.3.1 Early Examples of Distributed GIS 2.3.2 The Importance of Web Standards 2.3.3 Web Standards and Geographic Information 2.3.4 The Limitations of Web Standards 2.3.5 The Importance of Systems Architecture 2.3.6 Service Oriented Architectures 2.3.7 Resource Oriented Architectures 2.3.8 Modern Examples of Distributed GIS 7 7 7 8 11 15 15 17 21 23 24 27 31 34 36 38 40 42 3 Building a New Web-based Distribution System 3.1 Design Philosophy 3.1.1 Make it Open 3.1.2 Make it Easy 3.1.3 Make it Magic 3.1.4 Standards Compliance 3.2 Server Architecture 3.2.1 Choosing the Right Architecture 3.2.2 Node.js and Express 3.2.3 PostGIS and R 3.3 User Interface 3.3.1 Front End Framework 3.3.2 Web Mapping Library 52 52 53 54 55 56 58 58 62 66 69 69 71 4 From Planning to Practice 4.1 The Back End 73 74 vi 4.1.1 PostGIS Lessons Learned 4.1.2 Redesigning the Architecture 4.2 The Front End 4.2.1 User Interface 4.2.2 User Experience 4.3 Connecting the Two 4.3.1 Feeding User Input to R 4.3.2 Security concerns 4.4 System Performance 74 77 78 79 80 83 83 85 87 5 Using the System 5.1 Monte Carlo Viewshed Analysis 5.1.1 Changing the Data Model 5.1.2 An Example Analysis with R and GRASS 94 95 96 97 6 Conclusions 6.1 Limitations 6.2 Future Directions 100 102 104 REFERENCES 106 vii LIST OF FIGURES Figure 1. A basic diagram of the simple kriging process. Phase A shows the sample data in red (boreholes from the example above). Phase B shows the regular grid to which values are interpolated in black. Phase C shows the calculation for a single point (green) in the grid. Phase D shows the final interpolated surface and its associated contours. The smooth surface generated is typical of interpolation kriging. 9 Figure 2. The standard method for calculating a viewshed. 13 Figure 3. A single run of the Monte Carlo Simulation viewshed model. 13 Figure 4. The entire Monte Carlo Simulation viewshed model. 14 Figure 5. Duckham’s (2002) conceptual model of the error-sensitive GIS developer’s perspective. 19 Figure 6. A timeline of some of the major contributions to the concept of error-aware GIS. 23 Figure 7. A family tree of Service and Resource Oriented Architectures and their constituent systems as discussed in the preceding paragraphs of this section. 51 Figure 8. A conceptual diagram of our system’s architecture. The database both stores the data and performs the analysis using PL/R. After processing, the server sends the user a download link by email. 61 Figure 9. Left: Calculating slope on a tiled DEM – notice the dark colored lines laid across the image in a perfect grid. Right: Calculating slope with ST_Union to avoid edge effects. 76 Figure 10. The Data panel of the user interface allows users to submit data requests to our system. 80 Figure 11. This image shows the user interface with a color blindness filter applied and form validation feedback showing. 82 Figure 12. This image shows the user interface with a color blindness filter applied and form validation feedback showing. 84 Figure 13. Run times in hours across all 50 jobs for 720 arcsecond patches. 90 Figure 14. Run times in hours across all 50 jobs for 540 arcsecond patches. 91 Figure 15. Run times in hours across all 50 jobs for 360 arcsecond patches. 91 Figure 16. Average time to complete a job by size of requested area. 92 viii Figure 17. A comparison between the final result of the Monte Carlo Viewshed Analysis script and one of its intermediate outputs. ix 99 KEY TO ABBREVIATIONS AESC American Engineering Standards Committee AJAX Asynchronous JavaScript and XML API Application Programming Interface ANSI American National Standards Organization AML ARC/INFO Macro Language ASTER Advanced Space-borne Thermal Emissions Radiometer AWARE Available Water Resource in Mountain Environments CRUD Create, Read, Update, Delete CSS Conditional Stochastic Simulation DEM Digital Elevation Model DDoS Distributed Denial of Service DUE Data Uncertainty Engine ESRI Earth Science Research Institute FEMA Federal Emergency Management Administration FGDC Federal Geographic Data Committee FOSS Free and Open Source Software FTP File Transfer Protocol GDAL Geospatial Data Abstraction Library GDEM Global Digital Elevation Model GeOnAS Geographic Online Analysis System GEO Group on Earth Observation GEOSS Global Earth Observation System of Systems x GIS Geographic Information Systems GiST Generalized Search Tree GNU GNU’s Not Unix GPL General Public License GPS Global Positioning System GRASS Geographic Resources Analysis Support System GWASS GRASS Web Application Software System GUI Graphical User Interface HTML Hypertext Markup Language HTTP Hypertext Transfer Protocol INSPIRE Infrastructure for Spatial Information in Europe IP Internet Protocol Address ISO International Organization for Standardization IT Information Technology JPL Jet Propulsion Laboratory JSON JavaScript Object Notation JSON-RPC JavaScript Object Notation Remote Procedure Call LAMP Linux, Apache, MySQL, and PHP LAN Local Area Network LP DAAC Land Processes Distributed Active Archive Center MATCH Multidisciplinary Assessment of Technology Centre for Healthcare MCA Monte Carlo Analysis MEAN MongoDB, Express, Angular, and Node.js MODIS Moderate Resolution Imaging Spectroradiometer xi MPGC Multiple Protocol Geospatial Client NASA National Aeronautic and Space Agency NGA National Geospatial Intelligence Agency (formerly NIMA) NIMA National Imagery and Mapping Agency NSDI National Spatial Data Infrastructure OGC Open Geospatial Consortium OGIS Open Geodata Interoperability Specification OS Operating System OSM OpenStreetMap PC Personal Computer QGIS Quantum GIS PHP PHP Hypertext Preprocessor REST Representational State Transfer ROA Resource Oriented Architecture RPC Remote Procedure Call SDI Spatial Data Infrastructure SDSS Spatial Decision Support Systems SOA Service Oriented Architecture SOAP Simple Object Access Protocol SPA Single Page Application SQL Structured Query Language SRTM Shuttle Radar Topography Mission TRI Topographic Roughness Index UI User Interface xii UMN University of Minnesota URI Universal Resource Identifier URL Universal Resource Locator UX User Experience W3C World Wide Web Consortium WCS Web Coverage Service WFS Web Feature Service WMS Web Map Service WMTS Web Map Tile Service WIRM Web-based Interactive River Model WPS Web Processing Service WSDL Web Service Description Language XML Extensible Markup Language xiii 1 Introduction In the information age, scientists are rarely short of data. The incredible speed with which sensor technologies continue to grow, evolve, and miniaturize suggests that this trend will continue. Indeed, in the geosciences, we know more about our planet than we ever have. This data revolution occurred due to improved data collection techniques generally, and satellite data acquisition in particular. Since the dawn of flight, aerial imagery has played a critical role in improving our knowledge of our world. Now, with so many satellite constellations orbiting earth and providing a constant flow of data in nearly real time, it is easy to think that we have answered all the data-related questions of remote sensing. We can tell health of the forest canopy in a small patch of forest in Madagascar which has never been surveyed thanks to MODIS. We have elevation data coverage for 99% of the Earth’s land mass thanks to GDEM1. As a researcher, it can feel as though one has all the pieces and need only put them together to find the answer to a difficult problem. But is this really the case? Of course not! Even though the sheer quantity of data in the public domain and the massive areas for which those data are available can create that perception, these data are attempts to model the natural world. By definition, models are simplified versions of the complex phenomena they are intended to represent. Therefore there will always be some disagreement between the model and the reality. While researchers no longer have to worry about a lack of such essential data as sensor suites like MODIS, ASTER, and LandSat provide, they do have to carefully consider the quality of those data. Remotely sensed datasets can appear infallible to the end user. They cover the globe and can provide a 1 https://asterweb.jpl.nasa.gov/GDEM.ASP 1 discrete “measurement” of a phenomenon anywhere in their coverage zones, be it temperature, land classification, or soil moisture. However, those experienced in remote sensing know that those datasets contain a lot of uncertainty and error. This work focuses on the error and uncertainty inherent to elevation data. It discusses the existence of uncertainty in the data, the characteristics of that uncertainty, and its impact on the researcher. It briefly covers the ways in which such uncertainty has been handled in the past, and advances a new method of dealing with it in the future. Finally, it provides concrete examples for other researchers to follow so that they can avoid the negative impacts of uncertainty on their work. 1.1 The Problem Elevation data is a fundamental requirement for all sorts of geospatial research. From modeling precipitation and climate to calculating viewsheds, the applications of digital elevation models (DEMs) are ubiquitous. Some of these applications have very significant impacts on the lives of everyday people, such as FEMA Floodplain Mapping in the United States. These maps designate how much people must pay for flood insurance and even delineates where people can and cannot be covered for flood insurance based on risk of inundation. But what if these maps were wrong altogether? People might be paying more than they should for flood insurance. Worse, their homes may be in danger of frequent flooding and they would not even know it. When uncertainty and error exist in the input data, they propagate through any operation on the data and remain embedded in the result (Heuvelink, 1999). Error and uncertainty do exist in DEMs, as they exist in all geographic data, and their propagation can lead to a variety of negative outcomes like that of the floodplain example above (Fisher & Tate, 2006). No matter how careful researchers are, if their underlying assumptions are incorrect then their results will be flawed. 2 But what are these errors, this uncertainty? Succinctly, “Our lack of knowledge about the reliability of a measurement in its representation of the truth is referred to as uncertainty,” whereas, “Error is defined as the departure of a measurement from its true value” (Wechsler, 1999). So, uncertainty in the context of DEMs refers to our lack of knowledge about the error in the dataset. Global datasets like ASTER GDEM promise 1 arcsecond horizontal resolution (approximately 30m) and 1m vertical precision with an overall accuracy of approximately 25m. So while GDEM can detect an elevation change as small as 1 meter, vertical measurement could be 25 meters higher or lower than reported and occur 30 meters in any direction from where it is reported (Toutin & Cheng, 1999). In reality uncertainty is greater than reported. GDEM has a practical horizontal resolution of ~72m (Tachikawa et al., 2011). Furthermore, error in vertical measurements varies widely depending on landscape characteristics (Tachikawa, Kaku, & Iwasaki, 2011). Fortunately, extensive research has addressed the problem of error and uncertainty in DEMs more broadly and GDEM in particular. The research proves the existence of error in GDEM (Miliaresis & Paraschou, 2011; Gesch et al., 2012). It proves that the error in DEMs is non-random (Oksanen & Sarjakoski, 2006; Erdoğan, 2010). Researchers have even discovered effective ways to model and predict error in DEMs (Kyriakidis, Shortridge, & Goodchild, 1999; Fisher, 1998). Specifically, conditional stochastic simulation (CSS) has proven immensely useful when coupled with Monte Carlo analysis (MCA) for ameliorating the impact of error in DEMs (Aerts, Goodchild, & Heuvelink, 2003; Castrignanò et al., 2006). There only remains one real problem regarding uncertainty that has not been thoroughly researched: how best to put information about error and uncertainty in the hands of the average GIS user. That’s problematic because whether the user is aware of it or not, the uncertainty and error in a dataset become embedded in the result of any GIS operation performed on that dataset or any derivative of it. 3 This can lead to serious problems if, for example, those operations involve mapping a flood plain. While the analysis might be correct, the uncertainty in the data obscure the truth and may cause a person to make poor decisions about where to build their new home. There are ways to handle uncertainty in geographic data that help minimize its impact such as conditional stochastic simulation (explained in later sections). Unfortunately, today’s GIS user must be a geostatistical expert to apply conditional stochastic simulation to a given project. This effectively prevents the vast majority of GIS users from considering the impact of error and uncertainty in their research any further than writing a couple sentences about it in the Limitations section of their papers. The major challenge facing these users is the task of developing the error model for their data. This research seeks to change that by bringing error modeling to them. 1.2 The Solution This work is part of a larger project sponsored by the National Geospatial-Intelligence Agency (NGA) and led by Dr. Ashton Shortridge and Dr. Joseph Messina at Michigan State University. The first stage of the project involved substantial research on modeling error and uncertainty in SRTM v4.1 and ASTER GDEM v2, the primary datasets on which this paper will focus. The second stage of the project – addressed in this thesis – involves distributing the results of these models so that researchers around the globe can take advantage of their work. The major difficulty in accounting for uncertainty in SRTM and GDEM with CSS is the development of the original error model. The second most difficult step is developing code to efficiently generate error surfaces from that error model. Removing those two steps from the chain of operations effectively removes the barrier to entry for the average GIS user, as MCA with pre-generated datasets is conceptually straightforward. 4 1.3 Research Objectives The objectives of this research are as follows: 1. Synthesize the technical approaches of web-based spatial data distribution systems. 2. Create a system to distribute DEM error realizations on the Internet. 3. Create an example analysis to demonstrate how researchers might incorporate stochastic simulation in their own work. 4. Examine how the stochastic paradigm influences the traditional geographic data model. Objective 1 is a precondition for achieving Objective 2. Without this foundation, it would be nearly impossible to build an effective system. Objective 2 will ensure that researchers around the globe can employ complex error modeling in their work without the need to be experts on the subject. However, simply accessing the results of error models will not be enough – researchers need to know what to do with these data. Objective 3 will provide examples of how to use the stochastic paradigm on three common and simple GIS operations: slope, aspect, and viewshed analysis. Objective 3 is also crucial to the accomplishing stage two of the larger NGA project because education is just as important as access. Objective 4 addresses the fundamental changes the stochastic paradigm makes to the definition of geographic data. Goodchild, Shortridge, and Fohl discussed this issue as early as 1999. Rather than including measures of uncertainty in the metadata, our system effectively includes uncertainty information within the data themselves. Reconceptualizing the data model represents one of the first big 5 steps towards the creation of the “error aware GIS” alluded to throughout the literature (Duckham, 2002; Fisher & Tate, 2006). We begin by exploring the literature surrounding the existence and characteristics of error and uncertainty in DEMs and the problems they cause. Then, we discuss methods for analyzing and ameliorating the impacts of error and uncertainty on GIS operations. Next we turn to the literature surrounding web-based geographic data distribution systems: their origins, their construction, and modern examples. After that, we discuss the design philosophy and plan for building and implementing our own system, followed by an examination of why the original plan failed and how we adapted to the challenges we faced. Then, we provide an example of how to use the data distribution we created over the course of our research. Finally, we conclude with a summary of what we accomplished and what remains to be done. 6 2 Literature Review 2.1 Error and Uncertainty Error exists in all geographic datasets. DEMs are no exception, regardless of what technologies are used to collect the data (Zandbergen, 2010; Oksanen & Sarjakoski, 2006; Holmes, Chadwick, & Kyriakidis, 2000). Fisher and Tate (2006) present four varieties of error in geographic data: error with bias, systematic error, random error, and spatially autocorrelated error. Many researchers have studied the accuracy of SRTM and GDEM using other DEMs as references (Bolten & Waldhoff, 2010; Rexer & Hirt, 2014). Others have done the same using highly accurate laser altimeters (Hengl & Reuter, 2011; Zhao, Xue, & Ling, 2010; Ni, Sun, & Ranson, 2015) or GPS benchmarking (Li et al., 2013). Together, these studies reveal that error in DEMs generally, and SRTM and GDEM in particular, is spatially autocorrelated – the closer two positions are, the more likely they are to have similar levels of error. 2.1.1 The Spatial Structure of Error While ubiquitous, error and uncertainty have characteristics that vary from dataset to dataset. If researchers can find the characteristics shared by spatially autocorrelated clusters of error, then they can model error based on the presence of those characteristics (Carlisle, 2005). Error in SRTM varies spatially depending on land cover, aspect, and slope (Shortridge & Messina, 2011). Error in GDEM is also affected by the same factors (Jing et al., 2014). Simply put, areas with forest cover, mountainous regions, and rugged areas whose slopes face a certain way are more error prone than areas which do not have those characteristics. In addition to those commonalities, GDEM uncertainty is heavily affected by the number of “scenes” 7 (individual satellite images) used in the derivation of a particular value, known as its “stack number” (Miliaresis & Paraschou, 2011). 2.1.2 Error Propagation All data have inherent error and uncertainty. When a GIS user performs an operation on a dataset, the uncertainty from input dataset will be present the output as well. This process is known as error propagation, and it is problematic because “the output may not be sufficiently reliable for correct conclusions to be drawn from it” (Heuvelink, 1999). Understanding how error and uncertainty propagate can help reduce their impact on the output. CSS is a statistical technique which researchers use to do just that. Petroleum exploration is one field which employs CSS quite extensively, and in a way very similar to how it is used in this research (Honh, 2013). To take an example from the oil and gas industry, imagine an oil reservoir deep underground. The goal is to make the most accurate map possible of that reservoir given the limited data available. There are many points at which field workers have collected bore samples, but because the boundaries of geological features are not quite as discrete as we typically imagine, we can only find them by taking measurements of continuous data (in this case, the concentration of oil). Naturally, nearby boreholes are likely to have similar sample values. Kriging, a linear estimation technique used by the mining industry since at least the 1960s, takes advantage of this (Cressie, 1990; Hohn, 2013). There are several types of kriging, but just two main reasons to use them: interpolation and simulation. In the case of the former, the modeler is simply trying to get a smooth, reliable prediction surface. The mechanics of interpolation kriging work like inverse distance weighting in that sample points close to a grid point have more weight in the prediction than those 8 Figure 1. A basic diagram of the simple kriging process. Phase A shows the sample data in red (boreholes from the example above). Phase B shows the regular grid to which values are interpolated in black. Phase C shows the calculation for a single point (green) in the grid. Phase D shows the final interpolated surface and its associated contours. The smooth surface generated is typical of interpolation kriging. farther away. Predictions in simple kriging (the only type discussed in this section) tend toward the global mean of the sample values, ensuring a nice, smooth surface. While this is great for making prediction maps, it totally ignores the frequently noisy structure of empirical data. it is like using the average of all historical maximum temperatures for a given day to estimate its high temperature. While it is not a bad guess, the actual value is likely to vary from from the estimate. 9 For a brief history and a good explanation of the origins and development of kriging use for interpolation, see Cressie (1990). In the second case, and the one relevant to this project, the modeler does not predict grid values using the actual sample values, but those drawn from a normal distribution. This process is known as Stochastic Simulation and it produces a realization of a random field per the Gauss-Markov Theory. The modeler can go a step further and “condition” the distribution to make it reflect the sample data. In this case, known as conditional Gaussian simulation (also referred to by the broader term conditional stochastic simulation over the course of this paper), the sample values are converted to Z-scores, and these are used to predict Z-scores for each grid point. Then, a value for the grid cell is drawn from a normal distribution based on that Z-score. See Figure 1 for a graphical demonstration of the kriging process. Let us revisit our imaginary oil reservoir to see the practical implication of CSS. By using simulation rather than interpolation, our survey team produces one possible realization of a conditional random field. Put another way, it is one equally probable representation of reality based on the characteristics of the available data. If we perform the simulation process a large number of times, we can create a probability surface that reveals how likely it is to find oil at a particular place, rather than where oil is. The former is preferable to the latter, because it captures the impact of uncertainty on the results of our oil discovery model. This is possible because the realizations have the same properties as the simulation model, preserving the texture of the data. The interpolation surface, on the other hand, is smooth and oversimplifies the complexity of reality. For geospatial analyses which depend on the texture of the landscape, such as slope, this is a crucial benefit. 10 2.1.3 Mitigating Error Impacts with Geostatistics Many methods have been used to model uncertainty and error in DEMs (Erdoğan, 2010; Cuartero et al., 2014). The most thoroughly researched method for modeling and understanding error and uncertainty is conditional stochastic simulation. In their 2006 paper, Wechsler and Kroll discuss how CSS can be used “for evaluation of the effects of uncertainty on elevation and derived topographic parameters.” Indeed, Heuvelink (1999) has long used CSS to model uncertainty in geographic data, including DEMs. CSS is often coupled with another technique called Monte Carlo simulation to study uncertainty propagation. Uncertainty and error propagate in all geographic operations (de Moel & Aerts, 2011; Heuvelink, 1999; Arbia, Griffith, & Haining, 1998), but CSS and MCS can be used to measure and mitigate their impact. Combining CSS and MCS to study uncertainty propagation is by no means a technique unique to elevation data, nor even to the field of GIScience. The literatures of spatial decision support systems, geostatistics, and geophysics (to name just a few) contain many examples of studies on the impact of uncertainty on model outputs. More relevant to this work, Fisher (1998) and others such as Carlisle (2005) and Castrignanò et al. (2006) have shown that CSS and MCS greatly improve the quality of elevation error modeling, thus enhancing the accuracy of outputs. While these effects can be measured and accounted for, few people do so because they lack the experience necessary to apply CSS to their own work. While the researchers mentioned above have done work with CSS and MCS, they are domain experts. Furthermore, they are not necessarily interested in the data or the output of the model they are working with. they are interested in measuring uncertainty propagation. While this is scientifically worthy in its own right, for these techniques to be meaningful to the geosciences community they must have 11 a practical application. Experts also need to be able to communicate effectively about uncertainty in data to end users who aren’t used working with it – a challenge no one to date has been able to overcome sufficiently as discussed in Section 2.2.1. The process of repeating CSS to create a probability surface described in the previous subsection is known as Monte Carlo simulation. To better understand it, let’s drop our oil reservoir example and consider a simple, widely known GIS operation: the viewshed calculation. In a standard viewshed calculation, our inputs are a Digital Elevation Model, a point of origin, and other parameters depending on the specific implementation of the algorithm such as observer height and atmospheric refraction. In a Monte Carlo simulation, the model looks very similar. In fact, one will notice that the only difference between the standard (Figure 2) and the MCS (Figure 3) versions of the model is that the latter accepts an error realization (one version of reality) instead of a DEM for its elevation input parameter. This is, however a little deceptive because Figure 3 only represents a single run of the full MCS model. The complete model contains numerous runs, as one can see from Figure 4. Each realization is, in theory, equally likely to represent reality, so it makes almost intuitive sense that more runs produce a better result. The more possibilities one considers, the clearer the picture of reality becomes. If a particular grid cell falls within the viewshed over 90% of simulations, we can be quite certain that it would really be in the viewshed if we stood at our selected point of origin and looked for it. This is, of course, a very simple example. Bearing in mind the practical and computational challenges involved in creating robust, effective models for generating error realizations it is easy to see why most spatial research has been so slow to include it. That said, there are a variety of example applications that, while still conducted by domain experts, represent quintessential real-world use cases. 12 Figure 2. The standard method for calculating a viewshed. Figure 3. A single run of the Monte Carlo Simulation viewshed model. 13 Figure 4. The entire Monte Carlo Simulation viewshed model. The CSS/MCS technique discussed above offers a way for users to understand how error and uncertainty propagates and and to account for error and uncertainty in their analyses. One such example comes from Oksanen and Sarjakoski (2005), who detail the effects of error propagation in drainage basin delineation. Another example of using CSS to quantify error can be found in the work of Hengl, Heuvelink, and Van Loon (2010) who use CSS to generate stream networks from elevation data. One of the more influential examples of CSS/MCS and its benefits comes from Aerts, Goodchild, and Heuvelink (2003). Their case study provided ski resort owners with far more accurate estimates of cost and schedule to build a new ski run than traditional route optimization, potentially saving them millions of dollars in unforeseen costs. The author highly recommends this paper to the curious reader, as it offers a clear explanation of the processes involved and is relatively accessible even to those new to the CSS/MCS. 14 2.2 GIS and Uncertainty 2.2.1 Spatial Data Quality When beginning a new project, the quality of the data used should be of the utmost concern to the researcher. But what defines the quality of a spatial dataset? There are two core communities concerned with error in geographic data: spatial data quality and accuracy. The former focuses mostly on standards, while the latter is concerned with positional and attribute accuracy. There is some consensus within the discipline that this distinction is arbitrary, and that efforts should be made to draw the two research groups back together, as both of them are fundamentally concerned with the same thing – data quality (Devillers et al., 2010). Questions of data quality arose long before GIS, but are particularly relevant since its rise to popularity. Questions of accuracy arise from the simple fact that GIS models the real world, and models are always imperfect: “These forcible deviations between a representation and actual circumstances constitute error” (Chrisman, 1991). The accuracy community’s research has focused largely on the quantification of error and uncertainty (examples include Heuvelink, 1999, Goodchild 1993; Heuvelink & Burrough, 1993; Kyriakidis, Shortridge, & Goodchild, 1999; Goodchild, Shortridge, & Fohl, 1999; Wechsler, 1999). The spatial data quality community concerns itself with how to communicate those imperfections in a way that is meaningful. Regardless of their differences and despite many advances in both areas, “a large body of scientific knowledge is still only in the hands of researchers and embedded in scientific publications” (Devillers et al., 2010). The boon that GIS offers to researchers, planners, and analysts of all kinds is hard to understate. Members of those communities have been able to accomplish things previously thought impossible and in no time at all. It is easy to get caught up in the smooth flow of things 15 and forget to ask hard questions. “Errors and uncertainties in data can lead to serious problems, not only in the form of inaccurate results but also in the consequences of decisions made on the basis of peer data,” leading to repercussions ranging from getting directions to the wrong place to legal action (Goodchild, 1993). By paying attention to spatial data quality, GIS users can avoid those problems. Accuracy of the data is an important characteristic, but communicating information about accuracy in an actionable format is essential. From the user’s perspective, useful metadata answers one simple question: are the data fit for use in a particular application? The good metadata should help users with that “fitness for use” assessment (Chrisman, 1991; Kim, 1999; Veregin, 1999; Guptill, 1999; Devillers et al., 2002; Van Oort & Bregt, 2005). In spatial data In the spatial community as in others, there are standards for the metadata which require metrics be included to help researchers answer that question. Unfortunately, metadata frequently fall short of that goal. Research shows that, contrary to popular belief, spatial data users often do attempt to assess fitness for use (Van Oort & Bregt, 2005). However, even experienced users can have trouble doing so (Devillers et al., 2010) and others feel that the cost in time and other resources to fully assess data quality outweighs the benefits of doing so (Van Oort & Bregt, 2005). One reason commonly suggested for this failure is that standards often document uncertainty using confusing terminology and obscure methodologies (Devillers et al., 2010). Early metadata standards work from the Federal Geographic Data Committee (FGDC), International Hydrographic Organization, was essential for reasons discussed in the Web Standards and Geographic Information section of this paper, but failed to appropriately address fitness for use issues (Kim, 1999; Veregin, 1999; Guptill, 1999; Salge, 1999). Since then, little practical progress has been made towards easing the assessment of fitness for use (Devillers et al., 2010). However, the literature reveals a wide variety of different approaches to the problem, 16 mostly from the accuracy community, including visualization (MacEachren, Robinson, & Harper, 2005) and web-based uncertainty simulation (Bowling & Shortridge, 2010; Byrne et al., 2010; Walker & Chapra, 2014). These developments are promising and reflect the beginnings of practical applications of research regarding fitness for use (Devillers et al., 2010). This research constitutes another contribution to those developments. By removing the complicated statistical models which describe the error inherent to GDEM and SRTM from the user’s workflow, it improves the accessibility of spatial data quality information. Instead of attempting to infer the accuracy of information at a particular spot, using error realizations from the web distribution system allows the GIS user to work with data in a format and manner he or she is familiar with. The only major changes are in the number of times the user must run the analysis and the addition of a final map algebra step. After that, fitness for use can be visualized using the resulting probability surface, unlike traditional analyses. See the Using the System section of this paper for a detailed example. 2.2.2 A Brief History of Error-aware GIS The concept of an “error-aware GIS” capable of handling spatial data quality information is not a new one (Unwin, 1995). In fact, the impact of uncertainty on GIS and its uses has long been a research priority (Heuvelink, 1999; Goodchild & Gopal, 1989). Previous research has even lead to the development of a few systems which aspired to the title of “error-aware”, though they have all had drawbacks that tempered their usefulness. Logsden, Bell, and Westerlund (1996) developed an uncertainty visualization tool for land transition probabilities using C, UNIX shell scripts, and ARC/INFO Macro Language (AML). While undeniably useful for planners, this tool fall far from the “error-aware GIS” described in the literature. Its primary contribution to such that goal was the development of a visualization tool, 17 and may be better described as an “probability-aware GIS”. While it did employ stochastic modeling, those models were used to determine the probability of land transition based on LandSat data, and data quality was not considered. Still, the ideas behind the visualization techniques they employed serve as a useful guide for future developers. Another, much more robust example of error-aware GIS development comes from the work of Goodchild, Shortridge, and Fohl (1999). They proposed a method for encapsulating uncertainty within a geographic dataset. This is important for two reasons. First, “Choosing a data model that artificially separates quality and spatial data entails additional conceptual and implementational structures to maintain the connections between spatial data and its quality” (Duckham, 2002). The second is the fundamental change to the geographic data model which encapsulation creates. It prevents the geographic data and at least some of their associated metadata from being separated, a point which Beard (1997) says is increasingly important in a world driven by data sharing and it precludes data creators from slapping unhelpful generic spatial data quality information on their dataset because the simulation models are specific to each dataset (Duckham, 2002). The approach taken in this paper is revolutionary and doubtless a good example to follow, but the key drawback is that data creators still need to be geostatistical experts, making it difficult for the average user to create his or her own data. It would also likely increase the cost of data production for data vendors. Duckham (2002) attempts to address some of those issues in his own work on developing an error-sensitive GIS for Kingston Communications. Duckham employs an object-oriented database design to store data and their associated metadata (data quality information included) and then employs a user interface which assists in both the collection 18 Figure 5. Duckham’s (2002) conceptual model of the error-sensitive GIS developer’s perspective. and analysis of those data. This concept is very similar to the encapsulation work of Goodchild, Shortridge, and Fohl (1999) but improves on it by providing a user interface to work with the data and their quality information. The artificial-intelligence-assisted data collection user interface also overcomes the data production cost problems of that work (Duckham, 2002; Goodchild, Shortridge, & Fohl, 1999). Another informative example in the timeline of error-aware GIS development is the work of Aerts, Clarke, and Keuper (2003) who qualitatively tested uncertainty visualization techniques. This work is essential to the development of an error-aware GIS one of its core functions would be to aid the user in the conceptualization and visualization of error. Indeed, all of the research discussed so far in this subsection stresses the essential nature of an intuitive user interface in the development of an error-sensitive system. Karssenberg and De Jong (2005a, 2005b) also worked on creating an error-aware GIS by extending the PCRaster environmental modeling package. PCRaster was created and is now maintained by the University of Utrecht (Karssenberg et al., 2010). They focused on environmental modeling using the existing modeling language built into PCRaster, but extended the framework to from two- to three-dimensional models and added tools for MCS. The flavor of 19 error propagation analysis this enabled was unusual compared to the other variations discussed in this paper because its primary focus was continuous phenomena such as temperature or precipitation data rather than discrete features (Karssenberg & De Jong, 2005a). Because elevation is also a continuous phenomena, PCRaster is particularly relevant to this research. On the other hand, PCRaster had its share of problems. For one, it still requires expert knowledge of geostatistics, as the functions it employs require the user to specify the semivariogram and other parameters used in the conditional stochastic simulation and Monte Carlo analysis process. Additionally, while the rather rigid structure of the program makes its use rather straightforward for the user, it is not computationally optimized or particularly extensible. Finally, Karssenberg and De Jong (2005b) point out that, at the time of writing, PCRaster was not truly a GIS. That is, it was not capable of visualizing or organizing data in its own right, and relied on other software to do so. Since then, the University of Utrecht has released the software under the GNU General Public License and added visualization support – two very important conditions for making these types of analysis accessible to the general public. In 2007, Heuvelink and Brown introduced perhaps the most fully featured attempt at an error-sensitive GIS to date: a prototype known as the Data Uncertainty Engine (DUE). Briefly put, “The Data Uncertainty Engine (DUE) is a prototype software tool for assessing uncertainties in environmental data, storing them within a database, and for generating realisations of data to include in Monte Carlo uncertainty propagation studies” (Heuvelink, 2007). The DUE incorporates the ideas of Goodchild, Shortridge, and Fohl (1999) in that it encapsulates an uncertainty model within the data model, but it does so following the same object-oriented principles which Duckham (2002) espouses. Like PCRaster, it is GNU GPL licensed and can handle continuous data (Brown & Heuvelink, 2008). Unlike PCRaster, it uses a database to organize and store the models and data. The database can also be used to apply common 20 models to standard geographic data which has no encapsulated model, similar to Duckham’s (2002) system. Additionally, DUE has the capacity to assist not only positional uncertainty analysis, but also attribute and temporal uncertainty analyses or any combination thereof (Brown & Heuvelink, 2007; Heuvelink, 2007). Unfortunately, like the original PCRaster, DUE cannot rightly be called a GIS and must be used alongside another program which can perform GIS operations (Heuvelink, 2007). On the surface, this may seem an odd design choice. Why build an error-aware “GIS” when it cannot perform GIS operations? By separating the GIS from the assistance that the DUE offers, the end user can use whichever software they feel comfortable with – a major user experience bonus. Additionally, it avoids platform lock-in and ensures that if the development of a particular GIS falters, the DUE can live on; this point is not trivial, as it seems Duckham’s platform died with the spatial software division of one of the companies with which he collaborated2. 2.2.3 Uncertainty and GIS in Distributed Environments Recently, uncertainty analysis, like so many other computational challenges, has moved to the cloud. The Model Web, an idea created and undertaken by Group on Earth Observation (GEO), “Is a generic concept for increasing access to models and their outputs and to facilitate greater model-model interaction, resulting in webs of interacting models, databases, and websites” (Nativi, Mazzetti, & Geller, 2013). The Model Web (Bastin et al., 2013) would combine disparate models from various disciplines to make them accessible from a single interface, allowing the user to find, access, chain together, and run models entirely on the web. The UncertWeb would do something similar; while the Model Web offers access to traditional geospatial models, UncertWeb offers uncertainty-aware models. 2 According to http://www.laser-scan.com/demo/laser-scan-history/, accessed March 01, 2017. 21 As Nativi, Mazzetti, and Geller (2013) point out, “The long term vision of a consultative infrastructure is clearly an ambitious goal.” Unfortunately, it seems to have proven too ambitious. Work on both the Model Web and UncertWeb appears to have ended at or before the summer of 2017. For the Model Web, there is no official website to check as the plan was to include Model Web work within the Global Earth Observation System of Systems (GEOSS) framework, but the GEOSS interface does not include Web Processing Services (WPS) – the main Open Geospatial Consortium standard on which the Model Web was to rely – as a filter option. This suggests that the work was not actually completed within the timeline mandated by the project sponsors, and was discontinued thereafter (Nativi, Mazzetti, & Geller, 2013). As of summer, 2017 the official UncertWeb website returns no response3 despite the fact that the domain is still registered to one of the project’s main developers4 and that the official project page (maintained by the leading university on the grant: Aston University, Birmingham, UK) lists the project as “completed”. Additionally, development on the official UncertWeb GitHub repository ceased in 2013, with only minor modifications thereafter. Despite their early end, the work that went into these platforms offers many valuable insights on system design philosophy. I believe that the Model Web and UncertWeb designers are visionaries in much the same way Duckham (2002) and Aerts, Clarke, and Keuper (2003) were. The discoverability and interoperability principles they espoused are important to consider when developing a sustainable, useful system, even if their plans were never fully realized. Additionally, the truly immense body of literature related to the Model Web and UncertWeb touches numerous disciplines from information theory to ecology – evidence itself of the value of abstract specifications. Distributed systems literature and the number of disciplines which contribute to it will only expand in the future. This research aims to continue down the path the Model Web and 3 4 http://www.uncertweb.org accessed July 10, 2017 https://who.is/whois/uncertweb.org accessed July 10, 2017 22 UncertWeb research created while avoiding some of the problems that brought them down. To accomplish that aim, a careful assessment of implementation decisions made then and the technological advances since is required. Figure 6. A timeline of some of the major contributions to the concept of error-aware GIS. 2.3 GIS in Distributed Environments In order to discuss GIS specifically, one must first understand what a “distributed system” actually is. According to Coleman (1999), “The term distributed computing was coined,” by Champine, Coop, and Heinselman (1980), “to describe a situation where processing tasks and data are distributed among separate hardware components connected by a network.” In simple terms, distributed computing allows for the separation of data management, analysis, data visualization across multiple machines connected by a network such as a Local Area Network (LAN) or the Internet. The benefits of distributed computing are numerous, but especially so in the scientific community. Sometimes referred to as cloud computing, distributed computing frees researchers from the limitations of their own personal hardware. In a distributed system, researchers may control intensive analyses that require very expensive hardware remotely, allowing the centralization of 23 computing resources thereby reducing costs. Distributed systems can also allow researchers to select and analyze data without having the data itself or the analysis software installed on their local machines. 2.3.1 Early Examples of Distributed GIS Until the mid-1980s, distributed GIS followed the model of most computer systems and consisted of a mainframe or mini-mainframe central host to which users would connect via a terminal with very minimal computation power of its own. This could be considered “distributed” computing in that each user shared central resources and accesses those resources from a separate machine. The advent of the PC changed that, and by 1986 PCs were rapidly becoming the preferred machine for GIS users due to their low cost and the fact that users no longer had to share scarce computation resources. This shift away from distributed computing had its own problems, particularly because data became difficult to share and lacked the quality control implicit to a centrally hosted system. By the late 1980s, central spatial database servers entered the network, supercharging distributed spatial analysis by reintroducing an authoritative data source and providing mass storage devices which could be accessed from low-capacity personal computers (Coleman, 1999). As network technology continued to improve in the early 1990s, networks were no longer constrained to office buildings or universities. Instead, the numerous previously isolated networks spread across the U.S. (and later, the world) were connected to form the Internet (Coleman, 1999). It became possible for organizations to share information across vast distances at incredible speeds. For GIS users, it changed the way geographic information was discovered and accessed forever. The same ideas as the intra-organizational distributed 24 systems described in the preceding paragraph, expanded to Internet scale, led to the FGDC’s National Spatial Data Infrastructure (NSDI) initiative (FGDC, 1994). Though governments had been working on Spatial Data Infrastructures (SDIs) since at least the mid 1980s, the advent of the Internet changed the scale of those projects immensely (Coleman, 1999; Guptill, 1999; Maguire & Longley, 2005). Instead of a central database server on a closed network, these large scale SDIs had web-based data distribution systems commonly called (meta)data catalogs or clearinghouses. Clearinghouses were specialized websites meant to be “the means by which data users can more economically find, share, and use geospatial data” (Guptill, 1999). The goal of all such systems, early or contemporary, is to help users save time and effort when acquiring new data and lower costs for producers by reducing the amount of duplicated data. Clearinghouses followed a design known as Resource Oriented Architecture (ROA). They were typically filesystem-based and allowed clients to select datasets piece by piece with many options to choose from (Han et al., 2008). Often they relied on File Transfer Protocol (FTP), and the only graphical interface one could expect was from either a dedicated FTP client or a web browser capable of working as one. As technology improved and the Internet grew, it became common to access the same sort of systems via Hypertext Transfer Protocol (HTTP) rather than FTP – the ubiquitous download link. Even as databases and their dynamic query capabilities, became more popular, ROA dominated the SDI environment. The problem with a resource-oriented approach to portal development is that data access is often only one aspect of a user’s problem. What if a user needs to process the data in a particular way, but do not have the software available to them on their local machine? What if a user has a postal address, but they need a point defined by latitude and longitude instead? At this point, the user needs a service, which, at the time, was unavailable in a resource oriented 25 architecture. This would later change, as discussed in the Resource Oriented Architectures section of this paper, but by 2002 the spatial community was already frustrated with the limited capabilities of clearinghouses. To meet user needs, “The functional capabilities of clearinghouses should likely be changed from a data-oriented to a user and application-oriented focus” (Crompvoets et al., 2004). This shift reflects not only lessons learned during the development and use of first generation systems, but also the availability of new technologies such as Web Services. Thanks to the standardization of Web Services, portal developers could design abstract systems to interact with third-party services, even without knowledge of what those services or even if they exist yet. This design paradigm is much better at fulfilling a user’s needs than first generation systems because it is more flexible. For example, portals can provide services that allow the user to view the data in his or her web browser before downloading. Services may also directly expose data, allowing a user to download only the data subset they require as opposed to an entire dataset. In 2017, when cell phones are as powerful as the desktop PCs of 10 years ago, these miraculous capabilities are easy to take for granted simply because they are so ubiquitous. Obviously the Internet has been incredibly important to the sharing, analysis, and display of geographic data (Rinner, 2003). It is worth pointing out that the growth of the Internet in general is what has enabled the growth of distributed GIS in particular. Internet protocols were not invented with a geographic context in mind. Distributed GIS is built on the extensive set of existing standards which power the Internet, so understanding some of them, at least at a basic level, is important to understanding our proposed system. 26 2.3.2 The Importance of Web Standards Standards are the key to interoperability in all technology. The American National Standards Institute (ANSI) was originally founded as the American Engineering Standards Committee (AESC) in 1919 to promote and maintain standards in the engineering world5. Since then, ANSI has developed specifications for everything from paper sizes (ANSI/ASME Y14.1) to programming languages such as “ANSI C” (ANSI X3.159-1989) and “American Standard Fortran” or informally, “FORTRAN 66” (ASA X3.9-1966). In the case of eye protection, without such standards employers would need to trust the word of private companies when it came to the safety of their employees vision. In the case of software, it would mean that a package purchased from one vendor may not work with software from another. While this scenario is great for the software company, which would have a very captive user base, it is not so great for the software user. Imagine if ArcGIS was still the only program that could open a shapefile – ESRI would effectively have a monopoly on spatial analysis. Truly, standards are critical to the sharing of information. The necessity of standards becomes even clearer in networked environments. What if web browsers such as Mozilla Firefox and Google Chrome used different methods of connecting to web servers? As Guptill (1999) points out, “The number of interfaces required (and the effort required) increases as the square of the number of communicating systems.” With the need to design so many interfaces, the Internet could not possibly have grown at the rate it has. Despite the existence of numerous vendors who supply web browsers, all of them adhere absolutely to low level standards like HTTP. If they did not, they would not be able to access the Internet. However, precisely because HTTP is low level and focused, it gives developers a lot of flexibility 5 ANSI History: https://www.ansi.org/about_ansi/introduction/history 27 on how to handle connections to a website after they have been established with HTTP. The implications of this dynamic are best explained with an example. Imagine making a telephone call to a customer support line. You pick up the phone, dial a number, and you are connected to an automated system on the other end. Whatever happens next is up to you, the user. You can listen to the prerecorded messages and respond using predetermined prompts as explained by the system. You might, or example, “Press 2 for accounts and billing,” after which you will be given more information and another set of ways to respond. Communicating over the internet is very similar, in that while the communication between your phone and the automated system is standard – the tone produced when you depress a key on the keypad is the same every time – the response from the system is not. Pressing 2 when calling a different automated system, or even a different menu within the same system, is likely to have a much different outcome. HTTP works in a very similar way. Without going into unnecessary detail, the HTTP specification contains several HTTP methods. These methods are often referred to as “HTTP verbs” because they have names like GET, POST, PATCH, and DELETE. In addition to these four there are several others6, each with their own purpose and semantics. HTTP verbs, conceptually, work in a way similar to pressing the buttons in the example above. Using certain methods on certain web addresses produce one set of results, while using those same methods at a different web address may produce different results. How the system responds is entirely up to the system designer. When a developer writes code to handle HTTP requests, he or she must decide explicitly how the server will respond to each individual request method at each endpoint (web page). Similarly, phone system designers have to decide how to respond to each individual tone for 6 List of HTTP Methods: https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods 28 each individual menu. While good systems follow predictable patterns such as the REST style discussed in the “Resource Oriented Architectures” of this paper, there is no standard procedure for how a system must respond to a particular input. What that means in practical terms is that when you type “http://www.example.com” into a web browser, there is no law of computing that says what has to happen next. You can be redirected to a different page, a download could begin, or anything else the developer wanted could happen so long as the basic rules of HTTP are followed. It all depends on how the server responds to the request your browser issued to the website. While different HTTP methods have slightly different properties (POST, for example, allows the requester to send some data along with the request), it is this flexibility is what allows developers to create the interfaces which enable distributed systems that perform arbitrarily complex processes in response to requests. In order to interact with data or have a server perform geoprocessing remotely, a developer must create a Web Service. IBM broadly describes a web service as “a generic term for a software function that is hosted at a network addressable location.”7 Confusingly, there is also a formal web service standard. The W3C formally defines Web Services as using SOAP and the Web Service Description Language (WSDL)8. In this paper, we retain the informal definition because the difference between the definitions lies in implementation details, not the concept. Furthermore, whether or not we follow a particular standard is generally not important to the users – they just want to have a working, reliable service. In the terminology of web services, the web address to which the request is sent is referred to as an endpoint. Using a server-side processing architecture a GIS developer can, for example, create an endpoint that accepts POST requests containing a user’s email address and 7 JSON Web Services: https://www.ibm.com/support/knowledgecenter/en/SSGMCP_5.3.0/com.ibm.cics.ts.webservices.doc/conc epts/concepts_json.html 8 W3C Glossary, Web Service: https://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice 29 an arbitrary geometry, runs an analysis on data stored locally using that geometry, and emails the result to the user after the analysis finishes. By employing a hybrid architecture, the server may do the processing and return the data to the user’s web browser for visualization. Under a client side architecture, the endpoint could instead be designed to return data directly to the user, allowing the browser to perform analysis and visualize the result (Walker & Chapra, 2014; Bryne et al., 2010). The dynamic nature of such of a web processing service makes the flexibility which HTTP offers a must-have, but there are downsides to that flexibility as well. To continue with the previous example, POST is a logical verb to use. However, there’s nothing to prevent that developer from only responding to the PATCH or PUT methods instead. Because there are no set rules for how a server responds to a request, it can difficult to chain systems together. It can also be difficult to discover and understand what systems do. To combat these problems, programmers have come up with standards and specifications that operate at higher levels than HTTP such as Simple Object Access Protocol (SOAP), which defines a series of rules for machine-to-machine communication that is commonly used in distributed GIS solutions. For distributed computation to work, all the machines must have a previously agreed upon object model, and interoperability in GIS has long been a topic of research (Sondheim, Gardels, & Buehler, 1999). Whether sharing geographic information amongst researchers, processing it on a remote server, or downloading it from a data producer, the geographic data case is fundamentally concerned with interoperability and therefore standardization. Furthermore, standardization has many benefits beyond interoperability. The process of discovering geographic data and geoprocessing services is much easier when a standardized metadata format. It is important to note, however, that standardization is challenging and only matters if the entire community participates. 30 Because of this, standardization is almost always an organic, user-community driven process triggered in response to the need to communicate across various systems (Salgé, 1999). 2.3.3 Web Standards and Geographic Information Geographers around the world have understood the necessity of standardization for a long time. In 1919, President Woodrow Wilson attempted to standardize the production of federal geographic information by creating the Board of Surveys and Maps, which would later go on to become the Federal Geographic Data Committee (FGDC)9. Since then, numerous national, regional, and international organizations have attempted to standardize geographic data (Salge, 1999). Standardization in GIS is the key to interoperability, and interoperability is the path to distributed GIS systems (Sondheim, Gardels, & Buehler, 1999; Coleman, 1999). With this in mind, the Open Geospatial Consortium (OGC) is an international non-profit organization made up of representatives from the geospatial community which creates the standards necessary for sharing geographic data, creating web services, and more. Founded as the Open GRASS Foundation in 1992 to support private sector adoption of GRASS GIS, the OGC refocused its priorities in 1994 and became a standards organization. Their goal was to build a foundation for a rich ecosystem comprised of “diverse geoprocessing systems communicating directly over networks by means of a set of open interfaces based on the ‘Open Geodata Interoperability Specification’ (OGIS)”10. OGC went on to create several very important standards for geographic data on the web. A few of the particularly well-known ones are the Web Feature Service (WFS), Web Map Service (WMS), and Web Map Tile Service (WMTS), though there are many others.11 The OGC standard most relevant to this research is the Web Processing Service (WPS) specification (OGC, 2015). 9 FGDC History: https://www.fgdc.gov/who-we-are/history OGC History: http://www.opengeospatial.org/ogc/history 11 List of OGS Standards: http://www.opengeospatial.org/standards 10 31 The WPS specification outlines a standard set of methods which a system must provide via SOAP in order to be considered WPS-compliant. The specification also outlines standards for service metadata and specific ways in which those the metadata must be made available. These strict rules and codified principles can be overly constricting for small projects, but are necessary for large scale systems. Following standards like the WPS allows users to access a service in a predictable, easy-to-understand way. It also allows a developer to chain several web services together, realizing to OGC’s original vision of an interoperable networks of geographic data and analysis. Standards may be considered either de facto or de jure depending on their origins, but are essential regardless of those origins (Salgé, 1999). De facto standards are created ad hoc by user communities in a direct response to their needs. They are informal, but ubiquitous. For example, the ESRI Shapefile (ESRI, 1998) is a de facto standard format for geospatial vector data simply because of its past popularity despite serious design flaws. De jure standards often begin as de facto standards which evolve and become formalized under the guidance of some overseeing committee that represents the needs of a community such as the OGC, ISO, FGDC, or the FGDC’s European counterpart INSPIRE. OGC services in particular followed this path, while older standards such as metadata and file transfer standards developed by national and regional organizations as early as the 1980s were created in direct response to a need for formal standards (Salgé, 1999). Traditionally, commercial software development has implemented its own de facto standards as needed to encourage interoperability amongst a company’s own software products. This is why ESRI’s ArcGIS Online works very well with all of ESRI’s software. Because the standards are de facto and not available to the public, those software products suffer from limited interoperability with systems outside a company’s software ecosystem. 32 As early as 1999, the rise of the Internet began to put pressure on many geographic software companies to adopt open standards and improve interoperability (Goodchild & Longley, 1999). This trend continues today, especially as the toolchains researchers rely on become more complex as is evident from the ever-increasing number of OGC standards. Further evidence of this pressure shows in some of ESRI’s choices over the recent years, including the decision to switch from Visual Basic to a free and open source language called Python12 for scripting in the ArcGIS suite. Other good signs of interoperability improvements include the ESRI JavaScript API (similar to the Google Maps API), the esri-leaflet plugin for the popular Leaflet library, and ESRI’s decision to adopt the Mapbox Vector Tile Specification. Despite these promising developments, some things do not seem to change. In December of 2016, ESRI introduced their new, proprietary “expression language” called Arcade (Barker, 2016). According to its dedicated page on ESRI’s website, Arcade can be use for anything from “writing simple scripts to control how features are rendered, or expressions to control label text”.13 The main benefit of the language, however, is that it provides a single expression-based interface for styling and even analyzing data which can be used in any ESRI product, from mobile to browser to desktop. While highly interoperable within the ESRI ecosystem, the proprietary nature of the language prevents it from being used in any other context and steers ESRI users away from learning other languages which may be more broadly useful. ESRI is by no means unique in this behavior, and is only referenced here as an example due to the organization’s massive influence in the geographic community. While ESRI and other companies producing proprietary software do offer solutions for interoperability that meet open standards in addition to their own private standards, there are 12 13 https://www.python.org/ https://developers.arcgis.com/arcade/ 33 myriad open source options capable of serving geographic data on the web. Boundless14, for example, uses exclusively open source technology to offer a commercially supported enterprise GIS solution similar to that of the ESRI suite. For technically advanced users looking to build a system of their own, Steiniger and Hunter (2012) provide an outline of selected existing open source software that can be used to build up a Spatial Data Infrastructure (SDI). They point out that for every category of system that compose SDIs, open source components exist for their proprietary counterparts. The simple existence of such software and the standards supporting it is important. As Dalle and Jullien (2001) point out, “Most software technologies are indeed subject to network effects and thus tend to give rise to de facto standards and to monopolies.” Without open standards provided by OGC, ISO, FGDC, and others, those monopolies would be held in private hands (Salgé, 1999). Without open standards, reliable and interoperable systems like those offered by Boundless or discussed by Steiniger and Hunter (2012), Pebesma et al. (2010), and Yue et al. (2015) would be highly difficult, if not outright impossible, to build. Furthermore, the existence of some standards, such as metadata standards, allow other standards to be built on top of them providing other benefits in the process, such as the Web Catalog Service specification which enables users to discover data services. 2.3.4 The Limitations of Web Standards There are, of course, limits to the benefits standards can provide. The biggest obstacle to implementing standards like the OGC’s WPS specification is the complexity. The WPS specification dictates what data formats the service can accept, what data can be accepted in those formats, the names of methods used to describe and call the service, and so many more details. At 133 pages of dense technical language, it is also hard to read (OGC, 2015). Because 14 https://boundlessgeo.com/ 34 of these barriers to entry, a geographer looking to develop a new geoprocessing service may have a hard time designing it from the beginning to be OGC compliant (Yue et al., 2015). Researchers are typically not IT experts, and this is a major barrier to the widespread adoption and implementation of OGC compliant services which require a service-driven approach and a SOAP-based interface (Mazzetti, Nativi, & Caron, 2009). Without some experience in implementing standards compliant systems, the only way to do so with any ease is to use an existing piece of software that implements the standards by default. Another reason one might find to not implement a particular standard when creating a new web service is simply that the goals of the specification do not align with the goals of the service, the organization creating the service, or the community who will use the service. One great example of a large project that made the decision to avoid a standard in favor of its goals and usefulness for users is OpenStreetMap. At the time of writing, OpenStreetMap uses a REST API15 which is not OGC compliant to accomplish this. While all of these operations could be handled in an OGC compliant fashion, doing so would add a lot of extra work to the project for very minimal benefit to the community that supports it16. Another instance in which standards may hinder more than help a project is if that project is doing something new and revolutionary. The best standards anticipate change, but innovation can quickly make them obsolete (Salge, 1999). For instance, the GeoBrain implements many OGC standards, but chooses to modify the WPS because it does not fit the system’s intended purpose well enough. GeoBrain is discussed in detail in the Examples of Modern Distributed GIS section of this paper. 15 According to the wiki page: http://wiki.openstreetmap.org/wiki/API_v0.6 An OSM help question on the topic: https://help.openstreetmap.org/questions/981/ogc-and-interoperability 16 35 Finally, not all standards apply to all situations. It does not necessarily make sense for someone building a car in the United States to follow European standards. Still, standards like the minimum strength of the materials used in that car should be followed regardless of where it is built. In software, machines that will be part of a network need to follow the basic standards which enable the Internet. Beyond that, the question of which standards to follows is determined largely by the purpose of the system and thereby the architectural choices of its designer. 2.3.5 The Importance of Systems Architecture For those of us who gained our computing and system design knowledge outside of computer science departments, the idea of a “system architecture” may be abstract. As it turns out, that’s not our fault. In his 1987 paper on the subject, John Zachman points out that the term is rather ambiguous even amongst systems architects. In fact, the purpose of the paper is to use concepts from engineering, construction, and normal architecture to draw analogies between software because it was impossible for professionals to agree on a definition. In a follow-on piece a decade later, Zachman (1997) still cannot provide a concise definition, but the ISO Architecture Working Group (2011) defines systems architecture as the “fundamental concepts or properties of a system in its environment embodied in its elements, relationships, and in the principles of its design and evolution”. The concept, more or less, is thus: systems are made up of many small pieces. Those pieces need to fit together perfectly when assembled to create a system. The only way to achieve that goal is to plan it out from the start, just like an architect draws floor plans for a house before construction begins. A system’s architecture is the conceptual model to which it must adhere, or risk imperiling the rest of the system. 36 Just as contractors rely on architects to give them direction, so too do developers rely on system architects. Given time and training, contractors can make changes to the design on the fly. A contractor could also have an architect develop several modular floor plans, allowing the contractor swap parts of the plan in or out as needed. The same holds true for software architects and developers. The home architect, in this case, is analogous to a very talented developer or team of developers who create some cornerstone piece of software, such as a web server. The architects may design that server so that pieces of it can be swapped out or added as needed – a user authentication strategy, a template engine, or other common subsystems which run on web servers. Other developers can take that piece of software and work with it, so long as they follow the basic principles which allow the various pieces of the system to interact. They can modify the existing system directly, or extend it to add additional functionality as long as they consult the “blueprint” first to make sure the changes will not break anything else. At the end of the day, it is up the developer to put the architect’s plan into action. As in geography, there is a question of scale involved in system architecture. The example above covers the scale of a single web server, but in many cases a web server is just one part of an application. In the application built during the course of this research, for example, users interact (indirectly) with a database server and system processes in addition to the web server. Each subsystem has its own architectural requirements. Additionally, the system as a whole needs its own architecture. At the highest level, architecture must consider how the client and server interact. Today there are two main types of architecture: Resource Oriented Architecture (ROA) and Service Oriented Architecture (SOA). Where ROAs are concerned with providing access to resources, SOAs are concerned with getting things done – with the execution of some some operation over a network. 37 2.3.6 Service Oriented Architectures SOA is “an organizing and delivery paradigm” which can provide remote access to data, processing tools, and even visualization tools across various domains (Mazzetti, Nativi, & Caron, 2009). SOAs exist to support a model called service-driven (Nativi, Mazzetti, & Geller, 2013) or service-oriented computing, “A concept in which a larger software system is decomposed into independent distributed components ... These components are deployed on remote servers and are designed to respond to individual requests from client applications“ (Castronova, Goodall, & Elag, 2013). Basically, SOA seeks to provide a full suite of computing capability by aggregating access to services (as opposed to the services themselves) of various types from various domains via a single interface. SOA is all about machine-to-machine communication, and service-oriented applications work by chaining services together via a standard interface to complete an operation. For example, a user may request a particular operation be performed on a particular piece of data. The user only sends a single request, but the server translates that request into a chain of service calls which may, for example, include a service to provide the dataset, a service to process that dataset, and a service to visualize the output. Users may also choose to chain several processing steps together (Di, 2004b). The modular nature of SOA is perfect for large groups like the geospatial community which have diverse interests, but rely on many of the same tools. It allows subgroups to use their expert knowledge to maintain their own datasets and tools, exposing them for use as necessary by the broader community while maintaining total control over those tools (Mazzetti, Nativi, & Caron, 2009). Separating the code base out into small, separately developed pieces also improves maintainability. The benefits provided by the loose-coupling between the system 38 and the service are only possible because of interoperability standards like the Web Processing Service specification, the importance of which are discussed in the Web Standards and Geographic Information section of this paper. The primary benefit of SOA is that it allows disparate entities to combine their services under a single application without relinquishing ownership of the code or data, which essential for the development of a truly distributed GIS (Granell, Díaz, & Gould, 2010). The distributed, modular nature of SOA makes for easier code maintenance, as developers are only responsible for their small part of the system. Furthermore, the loose coupling between systems makes it easy to add additional services as they are developed and prevents the entire application from failing should something happen to a single service. These benefits make SOA an ideal candidate for systems like UncertWeb, GEO Model Web, and many of the other tools discussed in the Modern Examples of Distributed GIS section of this paper. (Bastin et al., 2012; Nativi, Mazzetti, & Geller, 2013). SOA is well suited to the design of portals and has enjoyed great success in the geographic community, but it has several problems as well. By far the most serious of those problems is the immense complexity of service-oriented systems (Bastin et al., 2012). Because they are based entirely on machine-to-machine communication, they require advanced user interfaces to make them accessible to their human users (Nativi et al., 2011). Even more complex, services may require other services to mediate between them for the purposes of, amongst other things, describing and converting data (Nativi et al., 2011; Nativi, Mazzetti, & Geller, 2013). it is almost impossible to implement a small scale SOA because they are, at their core, systems of systems. A single system cannot accomplish the goals of SOA because each service must be its own system, which may be accessed through a service broker (another system), which a lay user must access via a user interface – yet another system. To cap it off, all 39 of those systems must follow very specific rules to be able to transfer requests and data between themselves. These and other difficulties relating the the complexity of SOA are limitations which cannot be ignored. Should a developer decide these barriers to entry are too great, there is another family of architectures to explore. 2.3.7 Resource Oriented Architectures A ROA is only concerned with accessing particular resource in response to user request. This is a direct contrast to SOAs which are only concerned with accomplishing a given task. In a ROA every resource has a Uniform Resource Identifier (URI), the most common type of which is the Uniform Resource Locator (URL). The URL, commonly known as a web address, is actually a standardized URI that contains not only the name of the resource, but a method for retrieving it.17 But what is a resource? According to Granell et al. (2013), “Any informational entity may be regarded as a resource in the target RESTful application.” Essentially, this means that anything, be it a document, model, raw data, or any other entity one can put on the internet, can be a resource. One extremely common ROA is known as Representational State Transfer (REST) and was first described by Roy Fielding (2000) in his dissertation. Under a REST architecture, a user interacts with a server via an Internet protocol (typically HTTP, but not necessarily – see Constrained Application Protocol, RFC 7252). REST usually employs HTTP’s GET, PUT, POST, and DELETE methods to create a uniform interface via which a user may interact with resources on a server. For an accessible yet detailed description of REST principles, see Mazzetti, Nativi, and Caron (2009). In doing so, REST elevates HTTP from a simple transport protocol to an Application protocol capable of performing arbitrary operations on a remote machine (Granell et al., 2013). Because it relies only on HTTP and not the heavy, complex requirements of SOAP or 17 URL Spec: https://www.w3.org/html/wg/href/draft#url 40 other SOA structures, REST is simple to deploy, reliable, and scalable, making it a very common architecture for building APIs, but it can also be limiting. As explained above, early geographic information clearinghouses were almost entirely resource-oriented. Early clearinghouse development revealed the shortfalls of ROA in geographic data distribution; specifically, their static nature could not accommodate changing user demands. Today many geographic data distribution systems use REST APIs. These APIs differ significantly from traditional ROAs. Often referred to as RESTful web services, they stretch the boundaries of ROA by allowing users access to services (just like SOA) by mapping functions – conceptually modeled as resources – to specific URLs and HTTP methods. For example, ESRI offers RESTful web services to access geographic data as discussed later in this paper. Rather than sending request parameters as part of a complex object, the parameters are encoded directly in the URL. Much of the same information is transferred, but using a different information model. If this seems convoluted and confusing, that’s because, conceptually, it is. Several examples of this confusion are visible in the research on RESTful geoservices. Furthermore, for a service to be truly RESTful, request and response must be exchanged in a format which supports hypermedia (interactive media). JavaScript Object Notation (JSON) has no such capability even though it is commonly used in “RESTful” systems (Walker & Chapra, 2014; Yue et al., 2015). Even top engineers at large Internet companies like LinkedIn have trouble with this.18 To further complicate the matter, RESTful services blur the line between SOA and ROA. As discussed above, the term “resource” can refer to just about anything. When using a ROA to provide services as resources, does it inherently become a SOA? REST is all about transferring 18 JSON and REST: https://www.linkedin.com/pulse/rest-vs-rpc-soa-showdown-joshua-hartman 41 representations of the state of resources on a remote server, but Castronova, Goodall, and Elag (2013) attempt to use it to extend PyWPS – a WGC compliant SOA. This a bothersome question, but fortunately does not inhibit the creation of RESTful services in any way. REST is merely a style of architecture, not a standard or protocol fundamental to the proper operation of the Internet. Finally, some researchers have found that REST offers few unique practical benefits for geoportals, and cost far outweighs benefit if an SOA is already in place (Lucchi, Millot, & Elfers, 2008). Despite the drawbacks listed above, many geospatial service providers do provide RESTful interfaces. Walker and Chapra (2014) present the Web-based Interactive River Model (WIRM), “An interactive web application for a simulation model of biochemical oxygen demand and dissolved oxygen in rivers,” which provided great insight into the inner workings of REST-based services and clients in the geosciences field. Mazzetti, Nativi, and Caron (2009) provide a detailed and thoughtful analysis of the benefits of REST over the traditional SOA/SOAP model for environmental systems modeling. Yue et al. (2015) include it in their vision for the future of distributed and “intelligent” GIS. Finally, Granell et al. (2013) offer the most extensive treatment of REST for geospatial web services I have discovered, including a summary of other architectures historically used by the online modelling community, definitions of REST and ROA, and practical implementation suggests for the deploying RESTful services for web-based modeling. As demonstrated in the following section, it is also a common design pattern in many existing geospatial data distribution systems. 2.3.8 Modern Examples of Distributed GIS GIS on the web is rapidly growing and changing, but its importance has been increasing ever since it is inception (Rinner, 2003). Having discussed the fundamental architectures and 42 standards which enable the development of modern distributed GIS, we are now in a much better position to compare them and gain useful insights in the process. As revolutionary as they were at the time, clearinghouses quickly became outdated. Succinctly, “For the first generation, data were the key driver for SDI development and the focus of initiative development. However, for the second generation, the use of that data (and data applications) and the need of users are the driving force for SDI development” (Crompvoets et al., 2004). While the core principles of improving discoverability and reducing cost remain, the manner in which they are addressed has changed and new functionality has been added. Instead of clearinghouses, second generation web distribution systems are usually referred to as “portals”. According to Maguire and Longley (2005), “Portals are web sites that act as a door or gateway to a collection of information resources, including data sets, services, cookbooks, news, tutorials, tools and an organized collection of links to many other sites usually through catalogs.” Though there are many types of portal across the web (Maguire & Longley, 2005), this research is only concerned with portals for geographic information and will use the term portal or the more explicit “geoportal” interchangeably. Tait (2005) defines a geoportal as “a web site that presents an entry point to geographic content on the web or, more simply, a web site where geographic content can be discovered.” Lessons learned from user dissatisfaction with clearinghouses encouraged more extensible designs in portals. The generic interfaces employed to achieve that extensibility allowed the evolution of the portal concept to better align with the Maguire and Longley (2005) definition of a portal. Today’s portals feature improved, much more interactive user interfaces which can accomplish the same tasks clearinghouses were built for in addition to connecting users to many additional resources such as processing and visualization services. Technologies 43 like Asynchronous JavaScript and XML (AJAX) allow users to interact with portals as though they were desktop software (Han et al., 2009; Qui et al., 2012; Han et al., 2012). There exist today several proprietary and open source solutions, commercially supported or otherwise, that can act as web-based data distribution systems. Esri’s ArcGIS Server, Autodesk’s MapGuide, Boundless’ Geoserver, and UMN Mapserver are just a few of the many options. Combining any of these technologies with an online interface (e.g. ArcGIS Online and ArcGIS Server) creates a portal. Even constrained to Free and Open Source Software (FOSS) alone, there are numerous options for building a web-based GIS for data distribution. Steiniger and Hunter (2012) describe various recently developed FOSS tools useful for building a geodata portal. Lee (2009) offers another method, combining a FOSS back end with Google’s Web Map API front end for a private-public hybrid19 that’s easy to create and maintain. While today’s online geographic data distribution systems are usually much more complicated than older systems, they are conceptually identical to the early second generation portals first described in the early to mid 2000s. Anyone wishing to design such a system must look to existing systems for guidance, or suffer the consequences of eschewing many years of trial and error. Unfortunately, diligent searches revealed very few journal articles describing the technical implementation of existing systems. On the other hand, there is a wealth of conceptual descriptions; all of the work relating to the Model Web and UncertWeb discussed in the “A Brief History of GIS and Error” subsection of this thesis fall under the “conceptual description” and “second generation” categories. I postulate that there are three fairly simple explanations for the limited number of white papers regarding the construction of distributed GIS: the difficulty of translating mundane activities of system design into formal literature, the prerogative of proprietary software 19 With “private” being a reference to the proprietary data on which Google Maps rely. 44 companies to protect trade secrets, and the ad hoc nature of open source projects. First, understanding the implementation of a system design is, in my experience, best gained through trial and error, with a lot more error than success. Formal literature is mostly concerned with communicating what worked, not what did not work the first several hundred times. As to the second point, it does not make economic sense for SuperGeo, Autodesk, or Esri to share detailed guides about how they build their software. Their distributed GIS systems are their highest value products, and giving competitors such valuable information would be a poor business decision. Finally, FOSS projects are are built piece by piece, as needed, for whatever purpose the community needs at the moment they are conceived of. These are typically not good conditions for detailed and accurate recording of a development process. According to research on open source innovation, “Open source software is typically created within open source software projects, often initiated by an individual or group that wants to develop a software product to meet their own needs” (von Krogh & Hippel, 2006). Other research on open source projects points out that communications and organization are often very loose; they often lack “explicit system-level design, or even detailed design,” or a “project plan, schedule, or list of deliverables” (Mockus et al., 2002). There are not hard and fast rules for how to or why to do things. Instead, issues are dealt with as they arise and therefore typically not documented outside of a user group listserv or forum. Ironically, the loose organization and informality that makes open source projects so flexible and agile are the same things that makes them difficult to study. Though the Mozilla and Apache case study proves it can be done (Mockus et al., 2002), a similar research effort on geospatial software would probably be worthy of its own thesis. As a result, while it might be fairly easy to download and use FOSS, it can be difficult to model a new system off of an 45 existing one without the deep knowledge of that system’s developer community. One of the secondary goals of this research is to address this gap in the literature. Details or no, there is ample experience for a developer to draw from when considering the design of a geoportal. The earliest example I could unearth of a second generation portal was called “G-portal”, and described a ROA-based system for geodata creation, dissemination, discovery, and display in educational settings (Lim et al., 2002). Diping Li (2004a) introduces G-portal’s SOA-based counterpart GeoBrain: “A three-tier standard-based open geospatial web service system which fully automates data discovery, access, and integration steps of the geospatial knowledge discovery process under the interoperable service framework.” System users could access GeoBrain’s functionality through a user interface called the Integrated Multiple-protocol Geoinformation Client (MPGC). In a separate paper, Li (2004b) also offers a detailed explanation of the fundamental concepts which make GeoBrain possible. SOA in general and GeoBrain in particular would go on to become major players in the modern distributed GIS field. Essentially, GeoBrain is an interconnected web of distributed data providers, catalog services, geoprocessing services, and visualization services (Li, 2004b). It is the quintessential service-oriented geoportal. As discussed above, it is built for machine-to-machine communication, and can be very difficult for people to use (Nativi et al., 2011). As web technologies improved, the GeoBrain team introduced the GeoBrain Online Analysis System (GeOnAS), a web-browser-based Graphical User Interface (GUI) for interacting with GeoBrain services which replaced the MPGC (Han et al., 2008). GeOnAS is a distributed GIS designed for data discovery and visualization with limited processing capabilities. (Zhao et al., 2012). Improving on the GeOnAS concept, Han et al. (2012) authored a paper outlining the purpose, goals, and even some implementation details of a project called DEM Explorer. As the 46 name suggests, DEM Explorer served as a portal where users could view and download digital elevation data. Powered by GeoBrain, DEM Explorer could access the Geodata Abstraction Library (GDAL)’s DEM processing capabilities and limited GRASS GIS functions via web processing services to produce vector data such as watershed basins or raster data like Topographic Roughness Index (TRI) in response to user queries. The project was very successful and NASA, its sponsor, adopted DEM Explorer as the interface for the Land Processes Distributed Active Archive Center (LP DAAC) portal, renaming it Earth Explorer. At LP DAAC, the connections to GeoBrain’s processing functions were severed, leaving users access to only data retrieval and viewing services and making it the quintessence of a modern data distribution service. It still has all the download functionality despite the fact that as of June, 2017 the interface is slated for deprecation in favor of the newer Earthdata platform. DEM Explorer is a web-based GUI for GeoBrain, which relies on Apache’s Java-based Axis2 server (Han et al., 2012). GeoBrain offers OGC compliant WMS, WFS, and WCS. Additionally, it offers Web Standards compliant (but not OGC compliant) services for geoprocessing (Di, 2004a). DEM Explorer helps a user consume these services are by translating user input to machine-readable format and sending requests to the server. On the server side, programmers use a servlet called WSClient to the GeoBrain system which converts standard HTTP GET requests and their associated parameters into SOAP messages and passes them along to the the appropriate service. While DEM Explorer uses HTTP for its interactions, GeoBrain’s design also lets it interact with over other transport protocols. DEM Explorer is not the only web-based GUI created to interact with GeoBrain. Its cousin, the GRASS Web Application Software System (GWASS), also offers a fully-featured GUI which accesses GeOnAS’ GRASS GIS web-services. Instead of using the WSClient servlet, GWASS’ application stack – the group of front end, back end, and database 47 technologies which comprise the system – uses its own web server to translate between the GUI client’s requests and GeoBrain services. Unlike DEM Explorer and GeOnAS which were built to demonstrate the capabilities of distributed GIS, GWASS was built to solve the specific problem of operating system incompatibility for GRASS GIS (Qiu et al, 2012). Apparently, though, it did not completely solve that problem as it is no longer available online20. Yet another example of a distributed GIS based on SOA comes from Granell, Díaz, and Gould (2010) who discuss the development of the AWARE: “A tool for monitoring and forecasting Available WAter REsource in mountain environments.” AWARE implements OGC-compliant services including WMS, WCS, WFS, and WPS for environmental modeling using Java as the application language and Apache as the server. AWARE is built on the same concepts as its more broadly focused cousin INSPIRE, another European SOA-based portal similar to GeoBrain (INSPIRE, 2007; Lucchi, Millot, & Elfers, 2008). Like other SOAs, AWARE is extremely complicated, well featured, and a great example of a modern geoportal, offering everything from discovery to advanced processing (Granell, Díaz, & Gould, 2010). Unfortunately, like its siblings the GEO Model Web, UncertWeb, and GWASS, AWARE appears to have been abandoned after completion21. The noticeable tendencies of these large SOA-based systems to be left unfinished or abandoned after completion is notable. With the exception of extremely large systems like INSPIRE, GEOSS, Earth Explorer, and Earthdata, service-driven portals have not lived up to the hopes of their designers. I believe that the main reason for this is that these systems do not run themselves. Their immense complexity and interdependence, even distributed across so many organizations, is too much for developers to maintain. Reducing complexity by removing processing functions seems to be a good way to keep systems like Earth Explorer (Han et al., 2012) alive. Private companies like NextGIS have 20 21 GWASS Home Page: http://wastetoenergy.utdallas.edu/gwass Official AWARE project page: http://www.copernicus.eu/projects/aware 48 adopted this model as well, using an AJAX-enabled web-based GUI to consume OGC compliant services from a MapServer instance connected to PostgreSQL/PostGIS22. As a counter the oft-too-large goals of previous geoportals, there has been a resurgence of ROA-based systems in distributed GIS. Mazzetti, Nativi, and Caron (2009) offer one explanation for why may be taking place: “Scientists’ main objective is not to develop and maintain the complex infrastructures required for SOA implementation, but to access and publish information in the easiest possible way.” RESTful services are simple, easy to deploy, and very lightweight, making them an attractive option. they are also widely accepted in the community. The OGC Standards Working Group is currently developing a standard for RESTful services23, but there are many RESTful geospatial services in the wild. One of the most well known of those services is OpenStreetMap (OSM). OSM is, first and foremost, a database. To interact with the database, OSM developers created a GUI through which clients may interact with the data stored in that database – querying existing data, creating new data, deleting records, or simply plotting them on the map interface. That GUI is very similar in function to those of GWASS, GeOnAS, or AWARE except that relies on the REST API which underlies the entire system, rather than a SOA. While OSM’s architecture is well suited to REST, their system is also meant to function in manner similar to the websites for which REST was designed. Performing standard Create, Read, Update, and Delete (CRUD) operations is the most common application of the REST architecture. In their 2013 paper, Granell et al., 2013 discuss how services may be exposed with REST, rather than just the resources exposed by OSM. Naturally, adding additional functionality adds additional complexity to the application as well. However, Some big names in the geospatial community offer REST APIs. ESRI developed its ArcGIS Server API using the REST 22 23 NextGIS: http://nextgis.com/ OGC SWG RESTful Services: http://www.opengeospatial.org/projects/groups/restfulswg 49 style, and exposes a full range of capabilities similar to that of GeoBrain via a ROA24. GIS Cloud is another GIS company which offers a REST API, though it is not as fully featured as ArcGIS Server. GIS Cloud does not allow any geoprocessing more complex than geolocation over their API, but they do allow users to dynamically create maps and services to share those maps. Figure 7 offers a “family tree” of the systems discussed throughout this section, and the graphical representation of the connections between systems within each architectural type reveals several interesting things about each. First, the systems of the SOA branch have a lot more connections to one another than the systems of the ROA branch. This is because SOAs are designed to function within a system of systems and require numerous related systems to function properly, whereas ROAs are designed to operate alone or within a system of systems. Put another way, ROA functions the same way regardless of how the system is accessed. SOAs require one system for actual geoprocessing and an additional system to access that functionality. To see the difference, compare the GeoBrain system with OSM. GeoBrain has four connected systems, all of which serve to provide access to some subset of GeoBrain itself. Without these systems, the protocols used by GeoBrain are extremely difficult to access manually for a human user due to the fact that SOAs are designed for machine-to-machine communication. OSM, on the other hand, has no connections between other systems in the ROA branch. While one can design an additional system to access OSM, that system would use the exact same methods as a human user who sought to access the data manually. Another interesting point illuminated by Figure 7 is the difference in attribution between the two branches. While the SOA branch is filled with academic citations, the ROA branch has far more web addresses than journal citations. ROA is far more popular than SOA on the Internet today, and this discrepancy seems to show that the same holds true within the 24 ESRI ArcGIS Server REST API: http://resources.arcgis.com/en/help/arcgis-rest-api/ 50 geospatial industry. This idea is further supported by the fact that very few of the systems in SOA branch of the tree are still operating – hence the lack of web addresses. This is a simple consideration, but should not be overlooked. While the SOA paradigm is clearly preferred for the design of geospatial data distribution systems, there are some benefits to choosing a ROA. Figure 7. A family tree of Service and Resource Oriented Architectures and their constituent systems as discussed in the preceding paragraphs of this section. 51 3 Building a New Web-based Distribution System This section of the paper discusses the technologies we use to accomplish our first research objective: create an easy way to distribute DEM error realizations on the Internet. Having reviewed the long and varied history of both uncertainty-aware GIS and distributed GIS, we are in a better position to discuss the motivations for the development of a new system. we are also in a better position to understand exactly how our system differs from past approaches to the same problem and what those differences mean for users. It starts by discussing the design philosophy of the system, explaining which principles are most important to us and why. Then, it presents a discussion of the selected technologies and the reason for their selection, beginning with the back end and concluding with the front end. 3.1 Design Philosophy One of the stated objectives of this research is to put advanced error modeling in the hands of the average GIS user. The literature had very clear advice for the selection of software in building such a system to ensure that the average GIS user can use it. To promote widespread use, a system should “minimise user requirements and maximise simplicity” while ensuring that any software used is “accessible to all DEM users” (Darnell, Tate, & Brunsdon, 2008). The second requirement makes using FOSS is a necessity – a recommendation this work adheres to strictly. While Darnell, Tate, and Brunsdon (2008) offer their own example of what this might look like, their decision to run the error realization on the user’s machine have considerable drawbacks. Specifically, their model requires that the user be able to interpret a variogram and fit the variogram model – something most people lack the training for. They do, 52 however, have a very robust visualization component that provides a great example for anyone to follow when building such a system. Instead, our system proposes a solution similar to that of MATCH Uncertainty Elicitation Tool25 described by Morris, Oakley, and Crowe (2014) in that it relies on an uncertainty expert to do the actual modeling, and allows end users to access the results of those models. Uncertainty visualization techniques are a perfect example of why, when building a system to distribute error realizations and models to operate on them, there are more usability considerations than software alone. It is also very important to look at research on user perceive and understand uncertainty. People have been working on visualizing uncertainty ever since computer hardware existed to do so (Logsdon, Bell, & Westerlund, 1996). In one particularly useful study, Aerts, Clarke, and Keuper (2003) performed research on various methods for clearly visualizing uncertainty in land use change scenarios and discovered that people are most comfortable with variations in the hue of a single color to represent uncertainty. The work of MacEachren, Robinson, and Harper (2005) offers more useful contributions to research regarding uncertainty visualization as it relates to user perception. The considerations above, coupled with the desire for this work to be helpful to a broad community, will be accomplished by adhering to three core design principles: Make it Open, Make it Easy, and Make it Magic. This subsection discusses each of them in turn, explaining what they mean and why they are important. It also tackles the question of whether or not to implement open standards. 3.1.1 Make it Open Like all of the software used in its creation, this web distribution system must be free and open source software. While FOSS is not categorically free of cost, all of the projects this 25 MATCH Uncertainty Elicitation Tool: http://optics.eee.nottingham.ac.uk/match/uncertainty.php 53 system relies on are available for download at no charge. This helps keep our development costs down, improving the likelihood that the project will live longer than the grant that supported it, unlike many of the systems discussed in the “Modern Examples of Distributed GIS”. It also ensures that developers around the world will have access to the same software and can contribute to the project should they find it useful. Finally, both the author and his advisor have strong personal commitments to the open source community and would like to contribute to it in a meaningful way. Of course, none of these outcomes are likely unless the system achieves the next principle. 3.1.2 Make it Easy Perhaps the most important message gleaned from the literature on both uncertainty and distributed GIS is that systems need to be easy if they are to be widely adopted. If a potential user finds a system difficult to navigate, they are not likely to use it again. The prototypical “user” we refer to throughout this paper – the “average GIS user” – is someone who may or may not have had formal GIS training, but uses it on a near daily basis. The “user” we target is someone who knows enough about GIS to know that the data they use have flaws, but not enough to know how to do something about it. The overarching goal of this project is not to give those users a deep understanding of error modeling or propagation, but to make it easy to access realizations and thereby consider uncertainty in their work. If the system is too complicated for a user to even download a set of error realizations, we have already failed. Even if the system is mildly inconvenient it is likely to turn some people away, so making the system easy to use is absolutely critical to the success of the project. While an obvious consideration for the user, ease is equally important for developers. Choosing a platform with a lot of pre-existing software packages created to ease development is 54 a must, as it significantly reduces the work required to produce the minimum viable product for serving error realizations on the web – the primary goal of this research. But the benefits go beyond initial development cost. Ensuring the code uses a common platform and is easy to maintain greatly improves the odds that other developers in the geospatial community will extend this system to suit their own needs. When groups of people contribute to a software project for their own reasons, it inevitably makes the software better in the long run (von Krogh & von Hippel, 2006). Building the system with tools that make it easy to extend encourages others to contribute to our open source project thus improving the system and strengthening our ability to distribute DEM error realizations in the future. 3.1.3 Make it Magic This final principle goes hand in hand with “Make it Easy”, but from a slightly different point of view. “Make it Magic” demands that the system “just works” for the researcher who simply wants to share his or her work without building a web server. The system design should promote a “set and forget” type of deployment. After the initial download, anyone should be able to get the system up and running on their local machine with minor configuration changes and the fewest possible interactions with the command line. After its running, the server should take care of itself, only informing the person who deployed it when something goes wrong. The system needs to be capable of handling long-running processes, allowing users to hook it up to arbitrarily complex geoprocessing scripts. Neither system users nor researchers who deploy it should have to concern themselves with how the system handles that. The system should accept incoming requests and respond immediately, though processing may not even begin until well after the response is issued. Whenever the process finally completes, the system should notify the requestor that the data are ready for download. This is called an 55 asynchronous operation and it ensures the best possible experience for the end user while still achieving the goals of the researcher who deployed the system. 3.1.4 Standards Compliance Standards compliance must be implemented before the system will truly be production-ready. However, we found it very impractical to implement OGC standards in the prototype. The first problem to arise was the selection of a standard. The main purpose of the system is to distribute DEM data, so the WCS specification seems a natural choice. On the other hand, each request runs a model on the server using that input, suggesting that maybe WPS is a better choice. Further complicating matters, it can take several hours to generate those data depending on the size of the requested area. Conceptually, our process is part data processing service and part web coverage service. It is, after all, built for distributing raster data. However, the data do not exist until the user requests that the model generate them. In reality either could be a good choice given more time and people to consider the problem, but neither handle changes to the geographic data model well. In a perfect world, we could employ the encapsulation method for changing the data model suggested by Goodchild, Shortridge and Fohl (1999) and store the uncertainty simulation model within the dataset and let users generate realizations locally. Unfortunately mainstream GIS has not developed to suit that data model. Our method is something of a hack: we distribute many uncertainty-adjusted versions of the same dataset in an open and widely supported format that mainstream GIS software is designed to handle – the GeoTIFF. This changes the data model because in one sense, “the dataset” refers to all of those versions collectively. In another, equally valid interpretation, we are distributing numerous related datasets. While the WPS specification can handle either, using it does not make the problem any easier. 56 There are existing Python libraries and frameworks which, when combined, can perform very similar processes to the ones we are using now. Specifically, the PyWPS26 library and Flask 27 microframework can be combined to create an easy, off-the-shelf solution for serving OGC-compliant processing services. The WPS specification does not require a specific way of returning data, which makes it flexible enough to handle our revised data model, but it also means an additional layer of complexity on top of the same storage and processing logic we had have to write without it. Furthermore, our simulation scripts are all written in the R Language (R Core Team, 2016) and require advanced tools unavailable in Python. While a port from R to Python is possible thanks to libraries like rpy228, it would be a massive undertaking. On the other hand, not complying with standards makes it very easy to adhere to the main goal of this project: build a platform to distribute DEM error realizations on the web, and show researchers how they can use those realizations with their existing tools. While, they cannot be pulled into a desktop GIS over a service like WFS, but they can be used just like any other GeoTIFF. This is a huge ease-of-use bonus, as virtually every GIS user has worked with a DEM in GeoTIFF format. If they need another format for some reason, any GIS can easily convert the data to desired format because GeoTIFF is an OGC standard format. Like the OSM development community discovered, it sometimes makes sense to sacrifice standards compliance in favor of meeting the needs of your users. Based on lessons learned from the wide range of literature discussed in the Literature Review section of the paper, we needed to create a system that: 1. Uses exclusively Free and Open Source Software 2. Is easy to use 26 http://pywps.org/ http://flask.pocoo.org/ 28 https://rpy2.bitbucket.io/ 27 57 3. Is easy to maintain and extend 4. Requires little or no experience to deploy 5. does not require scientists to learn about web technologies 6. Asynchronously generates and distributes data Point number 1 obviously addresses the “Make it Open” principle. Points 2, 3, and 4 correspond to the “Make it Easy” principle, and “Make it Magic” to points 5 and 6. The sum of these requirements is a reliable system that empowers domain experts to provide access to their work regardless of their web development skills. While ideally the system should implement a standard architecture, that goal is unrealistic for prototype development. 3.2 Server Architecture 3.2.1 Choosing the Right Architecture The extensive discussion of systems architecture in the Literature Review, rather than clarifying the issue, has somewhat muddied the waters. The concepts of SOA and ROA are easy enough to understand on their own, but in the context of implementation the lines between them blur. Some describe RESTful web services as SOAs, despite the fact that REST is an ROA style. On the other hand, SOAs imply a tight contract between client and server that follows a strict set of rules defined by the W3C. To avoid all the difficulties which involve standards-compliant services, we could adopt a REST style and not worry about how it is categorized. Unfortunately, the REST model is hard to apply to distributed GIS services. For instance, consider the difficulty of modeling a raster as a web resource. Obviously, the entire raster can be a resource. But then how do you process that in a web architecture? The size of the data can preclude processing or even transmission of a large area. Also, this prevents dynamic area selection – a global raster would need to be cut up 58 with some sort of tiling system. Another option would be to model individual pixel as a resource, but that quickly becomes absurd! Each HTTP call has a cost in time, and how many calls might it take to make to perform a complex analysis? Even though the ease and simplicity of REST make it a tempting option, it does not appear to be the right tool for the job. To simplify the discussion, it is best to remember that questions arise from trying to apply a standard pattern to a process it simply was not designed for. While they are the best practices, they are not the only practices. As discussed in the previous subsection, standards compliance can sometimes hinder a project’s development rather than support it. Of course, web standards are important and a robust system should follow a standard architecture to facilitate community development and reduce failure points. However, when it comes to choosing between implementing a standard and achieving the project’s goals, accomplishing the goals should always be more important. The main goal of this research is not to produce a robust, production-ready system. It is to create a prototype to test the feasibility of generating DEM error realizations and determine some of the obstacles to building a robust, production-ready system. Coming into the project, the author had no back end development experience whatsoever. The Make it Easy principle applies to both users and developers, so the easiest path to the minimum via product is best one. Ultimately the system should adopt a standard architecture, but without constructing a prototype first, especially with the lack of systems architecture experience, we may choose the wrong one. Based on the need to create a minimum viable product as quickly and easily as possible, the best approach is to take what we need from the principles of various architectural styles. From the SOA-based Remote Procedure Call (RPC), we will borrow the concept of an endpoint accepting a single call to call a single function. we are only exposing a single function, and we 59 do not need to do anything else with the endpoint because users will be notified of success or failure by email, as the operations may take a very long time to complete. We do not want to deal with any of the standards that robust RPC APIs use, however, as that adds a layer of difficulty unnecessary for prototype development. Instead, we can use the “weakly” RESTful style described by Lucchi and Millot (2008). A weakly or accidentally RESTful style implements some of the principles of REST, but not all of them. Like REST, we can use HTTP for our standard interface. Instead of implementing all 4 of the standard methods, we will using a single method (POST) to perform an RPC-like call. This architecture is simple, very easy, and, most importantly, it works. Having settled on this hybrid architectural style, we need to determine what that function call actually does. There are two possible approaches for distributing DEM error realizations on the web: pre-generating and storing them in a database, or creating them on-the-fly in response to a user request. The former has the benefits of ease and durability, but raises potential storage problems. One thousand SRTM realizations would require approximately 4.25 terabytes of disc space based on the average storage size of an SRTM tile. The same number of GDEM realizations would require 272 terabytes (see further discussion in Section 4). While possible, such a storage solutions are very expensive and would likely raise other computational issues for the database. The PostgreSQL website brags, “There are active PostgreSQL systems in production environments that manage in excess of 4 terabytes of data,” so while it could probably handle GDEM alone, managing both datasets would be uncharted territory29. The latter is more difficult to implement, but relieves the burden of storing several thousand global raster datasets. In addition, on the fly processing seems a better fit for the random nature of the stochastic paradigm. Figure 8 and the following paragraph describe how that system works. 29 https://www.postgresql.org/about/ 60 Figure 8. A conceptual diagram of our system’s architecture. The database both stores the data and performs the analysis using PL/R. After processing, the server sends the user a download link by email. The web server receives and immediately responds to incoming requests. Then, the data request is queued for execution as soon as the computing resources become available. When the database is ready, we issue it a new query. That query activates a custom function which produces error realizations for the area specified in the system user’s request. The database runs that query, looping as necessary to produce the requested number of realizations. When the database finishes processing it alerts the web server, which zips the freshly generated DEM error realizations and sends the requester an email informing them that their data are ready for download at the supplied link. The following subsections discuss the technologies chosen to implement this architecture and the reasons they were chosen. 61 3.2.2 Node.js and Express Node.js30, usually shortened to “Node”, is an open source server platform written in JavaScript31. This is fairly revolutionary because until the creation of Node, JavaScript was a technology constrained solely to the browser. For front end developers, this is a dream come true. Previously, a move from client-side (front end) development to server-side (back end) development required learning a new language such as Java, C++, or Python. Now, a developer can use a single, widely-used language across the whole stack without compromising on functionality or flexibility – characteristics which definitely fit the Make it Easy and Make it Open principles of our system design. Node is a mature platform capable of performing any operation its more traditional cousins can. Node is used by some big names, including PayPal32, Netflix33, and Uber34 because it is easy to use, extremely fast, and capable of doing more with the same computing resources. This subsection will explain some of those benefits as they relate to this project. Because Node uses JavaScript, there is a very small learning curve for most developers. That’s because JavaScript is one of the three main languages used in front end web development, the other two being Hypertext Markup Language (HTML) and Cascading Style Sheets. After a turbulent decade in the hands of private software companies, an open source revolution began around JavaScript in 2005 thanks to the benefits of AJAX35 – the same benefits discussed by Han et al. (2008) in relation to GeOnAS. As a result, JavaScript has very well developed documentation and plenty of grey literature (i.e. blog posts, forums) on how to 30 https://nodejs.org/ https://www.javascript.com/ 32 Node and PayPal: https://www.paypal-engineering.com/2013/11/22/node-js-at-paypal/ 33 Node and Netflix: https://www.dev-metal.com/going-node-js-netflix-slides-micah-r-netflix/ 34 Node and Uber: https://nodejs.org/static/documents/casestudies/Nodejs-at-Uber.pdf 35 History of JavaScript https://www.w3.org/community/webed/wiki/A_Short_History_of_JavaScript 31 62 accomplish an incredible array of tasks using the language making it easy to learn by example. Also, the fact that it has been so widely utilized for so long increases the odds that a given developer has some previous experience JavaScript. Node is powered by Google V8 JavaScript engine36, an open source project that compiles JavaScript to machine code (Node.js, 2016). What that means in practice is that JavaScript can run even faster than C++ with the proper optimization37. While this might not be an important consideration for a prototype system, there are undeniable benefits to starting out on a scalable platform. In an ideal world, the data distribution system would be used by everyone who needs to perform spatial analysis using GDEM or SRTM. The term “optimistic” falls a mile short describing of that outcome, but choosing scalable technologies from the beginning makes growth later on much easier. In addition to speed, Node has other characteristics that make it scalable. The most important of those characteristics is the “event-driven, non-blocking I/O model that makes it lightweight and efficient38,” and allows a Node server to handle many more connections than other servers using the same available resources. Node accomplishes this with the Event Loop, which is best understood with an example. Imagine going to a doctor’s appointment. You walk in the door, head to the receptionist’s desk to check in, and take a seat to await your appointment. When a doctor becomes available, a nurse notifies you, brings you back, and the appointment gets underway. This is a typical system for a doctor’s office to follow, but it could be done another way. Imagine instead that your doctor handles every piece of the visit from check in to examination. You avoid the initial wait, but this quickly becomes inefficient. The intake phase costs the clinic a lot more money, as a doctor is paid much more per hour than 36 https://developers.google.com/v8/ Google V8 Beats C++: http://v8-io12.appspot.com/#93 38 About Node: https://nodejs.org/en/about/ 37 63 a receptionist would be to perform the same tasks. But what it it is a busy day at the clinic, or the patient before you took longer than expected? If there are no doctors to greet new patients, sick people will be turned away. To prevent that the clinic may have to hire additional doctors to handle high-demand situations. Node, on the other hand, follows the form of the first system. By using a single core (the receptionist) to accept all incoming requests (patients) and only connecting them to processing cores (doctors) as they become available, the system (clinic) uses its resources more efficiently. When a new patient comes in, the receptionist processes all the system’s information, puts the patient “in line” to see the doctor (asynchronicity), and is immediately ready (not blocked) to handle the next patient regardless of how complicated the first patient’s appointment might be. Only after the doctor finishes with the current patient does he or she send a nurse (event) to begin the next examination, ensuring all patients are accommodated in a timely and responsive manner. Node’s asynchronous, event-driven style is not the best solution for every problem. There are two specific cases for which Node is a bad option: performing heavy computation and interacting with a relational database39. That said, it was designed specifically for use in distributed networks – an important consideration for research on distributed GIS (Node.js, 2016). This Node.js-based data distribution system does both of those things, but in ways that do not negatively impact Node’s efficiency. Given the constraints that the system must be easy to use and should meet researchers where they are comfortable, these two weaknesses can even become strengths, as discussed in the Section 4. In addition to the fact that Node’s design principles fit our goals well, there are several practical reasons for selecting Node as the server platform. The first of those is Node’s fantastic 39 Node’s Weaknesses: https://www.toptal.com/nodejs/why-the-hell-would-i-use-node-js 64 package management system, the Node Package Manager (npm). Built to manage project dependencies, npm allows developers to quickly and easily install any software published on the npm website. After installing the software, npm tracks it without storing the entire package so that code can be easily shared amongst developers. This benefit is especially powerful when combined with version control software like git, as it ensures developers will use the exact same software every time they update their local copy of the project. Node programs accept a modular design philosophy similar to the Unix design philosophy40 which makes npm indispensable. Instead of writing large programs that offer complete functionality, Node developers usually write software that accomplishes one very specific task, no more. These “modules” are designed to work with others and inside of a larger application. Because Node is a mature and very popular open source project, the community is very large and diverse, modules have already been written to accomplish nearly all of the common tasks a server might face. This project makes use of many of those modules, relying on original code only when necessary, improving system reliability and lowering the barrier to entry for future developers who may want adapt the system to suit their own needs. The project uses the Express web server framework to handle and respond to incoming requests. Express has long been a favorite in the Node community because of it is pluggable middleware system which makes it easy to extend the server’s functionality. With minimal configuration, a developer can add support for user authentication, data validation, static file serving, logging, and innumerable other essential functions. Strong documentation and continued community support are two more reasons to choose Express. In reality, however, the server framework is interchangeable. Express was the right choice for this project because it was well documented and easy to learn, but another developer could accomplish same tasks 40 Unix Design Philosophy: https://en.wikipedia.org/wiki/Unix_philosophy 65 just as effectively using different modules without much added difficulty. The goal of this project is not to create an innovative new web server, but to give users to a geoprocessing service and the data it produces. The web server itself only supports this goal, it is not the entire solution – the actual geoprocessing happens outside the web server subsystem. 3.2.3 PostGIS and R PostGIS is the spatial objects extension for the open source PostgreSQL database server. PostgreSQL is widely used and loved; at the time of writing, DB Engines ranks it as the 5th most popular database in the world, with its popularity growing steadily since 201441. PostGIS has enjoyed a similar meteoric rise in the geospatial industry as a comprehensive vector data storage and analysis system. More recently, it gained capabilities to work with raster data. The PostGIS Raster Module42 promises numerous benefits for raster storage and analysis. There are many small but useful features which databases offer that make them preferable to file systems for spatial data storage. One of those is built-in, easy to use data backup solutions. Another is the transaction log, which records all operations performed in the database. This log can be analyzed if something goes wrong with the system and help bring it back to life. PostGIS also offers easy ways to restrict which database users can perform which operations, allowing developers to create separate “roles” (user accounts) for applications, researchers, and database administrators. These varying permission levels promote system security and protect data from accidents while also improving the usefulness of logs, which record information on who conducted a transaction. All of these useful details can be accomplished with simple file system storage as well, but they are much easier in with a database and require a narrower range of skills to implement. 41 42 DB Engines: https://db-engines.com/en/ranking https://postgis.net/docs/manual-2.2/RT_reference.html 66 PostGIS also offers several big-picture benefits, one of which is data organization. The models which generate the error realization data provided by the data distribution system rely on several spatial datasets. Using PostGIS allows us to keep all of that data in one central place, accessing the data through a common interface. This helps keep code simple and readable. While the same thing could be accomplished using just the file system, PostGIS offers other benefits that make it a good choice for storing raster data. One of the biggest benefits PostGIS offers is discussed in the “Early Examples of Distributed GIS” subsection of this paper: a centralized, authoritative source for geospatial data which can be accessed from anywhere in an enterprise system. While useful, that benefit does not directly impact this project because the data distribution system itself is the mechanism for providing users with data. However, this system is designed to help data producers quickly move from installation to production with the same tools they are already using. It is likely that those tools include a PostGIS database. The final benefit PostGIS offers this project is the numerous procedure languages available for writing database functions. Procedure languages like PL/R and PL/Python allow users to create stored procedures in the database using languages other than SQL. That capability is very useful for this project because the scripts originally created to produce error realizations are written using the R language. The ability to wrap those scripts in database functions streamlines calls from the web server to the database server – the web server need only call the function using tools already available for interacting with the database. In addition, it offloads the intensive processing from the web server to the database server, freeing Node’s Event Loop and allowing it work efficiently. PostgreSQL is built to handle these operations, so there’s no concern about computation there, and PL/R allows us to work with a language we are already familiar with. 67 R offers a whole lot more than familiarity and simplicity. It is a very powerful open source tool for statistical analysis and is widely used in the spatial community. While not quite as popular as Python, R has recently seen a meteoric rise in popularity amongst geospatial developers43. This is likely because R offers extremely advanced statistical analysis capabilities which are unavailable in Python44. The original research which created the error models used in our system used R for that very reason. Additionally, R has long been popular in scientific disciplines which touch Geography such as Biology. This makes it a natural choice when trying to work along interdisciplinary lines and increases the likelihood that our project will be beneficial to other researchers who want to distribute their own research results. For these reasons, R is the best language choice to accomplish our goals. However, we will not be using it in isolation. The error model requires a few basic raster-derived inputs: slope, aspect, and MODIS ecoregion, all of which are stored in the database. These and other variables are used in a linear regression to calculate a global mean error surface. This global mean error surface, henceforth called the “Mean Layer”, is the main input to the actual error simulation calculation. The technique used in error simulation is called regression kriging, which uses a linear regression model to calculate the Mean Layer – the average expected error – and then simple kriging to calculate the spatially structured residuals of the linear regression (Hengl, Heuvelink, & Rossiter, 2007). Using native database functions, we can easily calculate the Mean Layer and store it for later use. This makes it much easier to generate error realizations as they are requested, because we can call a single function on a single dataset, significantly reducing database calls and processing time. 43 44 https://blogs.esri.com/esri/esri-insider/2015/07/20/building-a-bridge-to-the-r-community/ https://www.r-bloggers.com/r-an-integrated-statistical-programming-environment-and-gis/ 68 3.3 User Interface So far we have discussed the back end of the data distribution system, but not the part system users will actually interact with. In computer programming, people often use a house as a metaphor to understand the various parts of an web application. The database is the foundation, the back end (web server) is the house itself, and the front end (user interface) is interior design. While people might appreciate the layout of a house and its sturdy construction, they rarely think about these things because they are layers beneath the surface. The importance of a good user interface was stressed in both the “A Brief History of Error and GIS” and the “Modern Examples of Distributed GIS” subsections of this paper. Developers may build a fantastic system, but if it lacks a good user interface it will not attract many users. Building on the experiences and recommendations of previous researchers, our system implements a simple, intuitive user interface. It employs AJAX technology to ensure the page remains responsive for users. It uses a colorblind-friendly color palette to improve accessibility. Like the database and back end, the user interface is composed of solely open source technologies to ensure that the entire stack can be used by any researcher who would like to implement the system to distribute his or her own work. The following paragraphs describe those technologies and explain why they were selected. 3.3.1 Front End Framework AngularJS45 is the front end framework we selected for building the user interface. Angular is an extremely popular open source project developed by Google for creating beautiful web applications. For a surprisingly small size AngularJS offers a lot of additional functionality 45 https://angularjs.org/ 69 over other libraries and frameworks commonly used in front end development46. Additionally, instead of following traditional Model-View-Controller or Model-View-Viewmodel patterns, AngularJS is a Model-View-Whatever (as in, “whatever works for you”)47 framework allowing for more flexible, task-oriented development. These benefits impact both developers and system end users in the following ways. Angular is built to create web applications, so its core design principles mandate that the page remain responsive to users at all times. This is a huge user experience benefit, as users are never waiting for a loading bar to fill up. For developers, AngularJS exposes straightforward APIs to implement the AJAX calls which allow that responsiveness. In addition, AngularJS offers two-way data binding that allows a developer to easily manage user input and send it to the server. These are just two of the numerous useful features AngularJS offers developers, but they are the most relevant to this project. Another aspect of AngularJS that’s very beneficial to this project in particular and community-driven development in general is that it was designed for testability. In addition to AngularJS itself, Google has led the effort to create and maintain open source testing tools such as Karma and Protractor. they have also supported development on existing test suites such as Jasmine and Sinon. Tests are important because in practice, they work as de facto standards for your project. Sometimes called specifications, tests are used to determine whether or not code runs as expected. After adding some new feature, developers can run tests to ensure that their changes have not broken some important part of software. Though it offers far more functionality than the project currently needs, the flexibility, community support, and testability AngularJS offers make it a good platform for growth. This research yielded only a prototype of a data distribution system that, while functional, is a long 46 47 https://www.airpair.com/angularjs/posts/jquery-angularjs-comparison-migration-walkthrough https://plus.google.com/+AngularJS/posts/aZNVhj355G2 70 way off from being used in high capacity production environments. That will change. Currently, our system supports a single processing service. The system’s architecture does not mandate this in any way, so the system could, in theory, support an entire geoportal like those discussed in the “Modern Examples of Distributed GIS” subsection and offer services for everything from data discovery to data visualization. As the project continues to grow, AngularJS will be able to accommodate that growth. 3.3.2 Web Mapping Library Our application uses the Leaflet48 open source web mapping library. Leaflet is a good choice for this project for several reasons. First, it is a popular, mature open source project that is based on another, even more mature open source web mapping library called OpenLayers. Both are well featured and maintained by very active communities. For our purposes, Leaflet is a better choice than OpenLayers because it is smaller and faster. Leaflet also has a much simpler API than OpenLayers. Since the web map in our application is really only a convenient way to select data we do not need any of the advanced functionality OpenLayers offers, but the speed and simplicity of Leaflet are useful. Of additional benefit to our application is the Leaflet Angular Directive – an open source project that lets a developer easily integrate Leaflet with an AngularJS project. A similar project exists for Open Layers that is also well supported and widely used, but for the reasons described in the previous paragraph, Leaflet remains our library of choice. Pre-existing Directive code is a nice bonus because complicated directives are tough to write. While it is not strictly necessary to include Leaflet in a Directive, it does have some nice features. When using AngularJS, developers write JavaScript following a particular object model to produce what are known as Directives. After writing the JavaScript, directives can be included 48 http://leafletjs.com/ 71 on a web page simply by adding an HTML element. So, instead of writing all of the boilerplate code needed to display a simple web map, a developer can simply include the Leaflet Directive and write “” to easily render the map on the page. Using the Leaflet Directive also helps the map fit in better with the rest of the app’s components. AngularJS also includes the concepts of Controllers and Services. Controllers control how a Directive responds to user interactions, while Services are used to pass data back and forth between various Controllers. In our application, we use the Leaflet Directive and a map Controller to display and control the map respectively. We use a form Directive and Controller to accept and to respond to user input. Finally, the map Service communicates user interaction with the map to the form, and interaction with the form to the map. All of this happens in near real time thanks to the two-way data binding which Angular offers. What this means for the user is that when he or she resizes the bounding box delimiting the data request, the numbers on the form change as the corner of the box is dragged. 72 4 From Planning to Practice The previous section discussed our plan for building the system, the various components that make up our “stack”, and the reasons for choosing them. Developers use the term “stack” to refer to the combination of technologies which power a particular system. For example, the LAMP stack (Linux, Apache, MySQL, and PHP) is a very common setup used to power many software projects. It is the most common stack used for the popular Wordpress blogging platform. In the Node.js world, the MEAN stack (MongoDB, Express, AngularJS, and Node) is the most common. These labels are convenient, as they vaguely describe the architecture of a system enough for people to talk about them categorically. In reality, though, there are infinite variations on these common stacks. For instance, it is very common to swap out Apache for another server called nginx (pronounced Engine Ex) in a Wordpress stack. For Node web applications, it is not uncommon to run nginx as the main web server for serving static files, then connect to the local Node application server through some network magic. These implementation details are invisible when discussing technologies at acronym scale. Reducing an application to LAMP or MEAN makes it convenient to talk about, but it also hides the complexity of a system. In other words, it hides many of the things that can go wrong during the development process. Unfortunately, that is the scale at which all of the traditional literature on the topic of distributed GIS discusses the subject. More detailed treatments of system implementation do exist, but in the grey literature of the web such as blog posts from system developers or questions asked on help forums like the Stack Exchange family of websites. This section of the paper addresses that gap, and will directly discuss the challenges encountered during implementation and offer some critiques of the technologies used. Finally, it 73 will address how these considerations relate to the research objectives put forth at the beginning of this paper. 4.1 The Back End 4.1.1 PostGIS Lessons Learned PostGIS is undoubtedly a fantastic tool for geospatial research. It can certainly be a powerful analysis tool, particularly in the when it comes to vector data. It is a masterful storage solution for raster data, too. It is not, however, a good choice for large scale raster analysis. The PostGIS Raster Module needs to be developed further before it can rival other GIS solutions for the analysis of continuous spatial phenomena. This section discusses some of those shortcomings as they relate to the project. The first of many problems with the PostGIS Raster Module is the result of a frustrating Catch 22. Whether in a file system or a database, the best way to quickly access stored raster data is to cut it up into manageable pieces. A raster with coverage for the entire globe can be extremely large. ASTER GDEM version 2 has 22,702 1° by 1° tiles averaging about 12 megabytes each in GeoTIFF format, translating to about 272 gigabytes of data total (NASA JPL, 2009). SRTM version 4.1 at 3-arcsecond resolution has just under 15,000 1° by 1° tiles averaging about 3 megabytes each in GeoTIFF format, for a total of 45 gigabytes (NIMA, 2000). Obviously, that’s too much data to handle at once, so they are broken into smaller, more manageable pieces for storage and transfer. Typically, it is the responsibility of the user to merge these pieces back together if they need more than one of them to perform their research. Common raster data models like the GeoTIFF spatially reference pixels by their location relative to neighboring pixels. The only spatial information in the raster is attached to the upper left pixel, and the locations of all other pixels are defined by the rotation of that first pixel (i.e. 74 degrees from North) and how many pixels down or to the right they are of the upper left corner. So, in order to spatially locate any point within a raster, a GIS must do the math to figure out how far down the column and across the row to look for the point. This operation is very slow because it is a precise process. Databases can improve query speed by creating spatial indexes, which partition space to allow the computer to quickly find them using a common tree data structure such as GiST (Generalized Search Tree), B-tree, R-tree, or KD-tree to name a few. Because rasters only have one point, such indexes are not very useful on large rasters. One of the main goals of a database is to provide access to those data as though they were all together in one file. In order to maintain fast query times, however, the opposite must be true. Instead of combining the tiles, databases cut them up into even smaller pieces. Spatial indexes become much more effective for the minimum bounding rectangle of small each piece. Research on optimal tile size for SRTM and GDEM access in a PostGIS database shows that 100 by 100 pixels is the best for query times (Langley & Shortridge, 2015). These mini-rasters can usually be accessed just as if they were part of a larger whole with one major exception: any operation which relies on a neighborhood of cells will encounter the edge of a raster much faster than it would otherwise. In our case, we discovered this while trying to calculate slope on multi-tile areas. Slope and its cousin aspect are important because they are inputs to the models used for generating DEM error realizations. Slope is the first derivative of the elevation surface. To calculate slope, a neighborhood of at least 4 (bishop’s or rook’s case) cells are required49, though a 3x3 kernel (queen’s case) is more common (Verdin et al., 2007). What this means in practice is that we cannot calculate slope at the edge of a raster, because there are not enough neighboring cells. This quickly becomes a major problem for a database composed of 100x100 pixel tiles. Figure 9 shows the 49 https://www.usna.edu/Users/oceano/pguth/md_help/html/demb1f3n.htm 75 results of a slope calculation on tiled DEM. Fortunately, PostGIS has a tool to counter this problem. Figure 9. Left: Calculating slope on a tiled DEM – notice the dark colored lines laid across the image in a perfect grid. Right: Calculating slope with ST_Union to avoid edge effects. By using the ST_Union function, a user can combine those small tiles into a cohesive whole to perform operations that require neighboring cells. This method is also useful retrieving data for analysis in an external software package such as QGIS or ArcGIS. Unfortunately, there are limits to the size of the area ST_Union can handle. In my experiments with GDEM v2, areas larger than 60 km2 caused PostGIS to throw a memory exception, cancelling the process. I attempted to work around this process by creating custom database functions to iteratively calculate slope for small areas around the globe using the PL/PGSQL language because it has access to native PostGIS functions. To my dismay, this approach did not solve the problem. Cursors and For Loops – the iterator patterns required to implement the functions i just described – are notoriously slow in databases50. My custom solutions were taking more than 40 50 https://stackoverflow.com/questions/287445/why-do-people-hate-sql-cursors-so-much 76 hours to run on the State of Colorado alone, so pre-generating a global slope layer was out of the question. This caused problems for our original plan for generating error realizations on-the-fly because it prevented us from creating and storing the Mean Layer. Without the benefits of a pre-generated mean layer, there was little benefit to using the database for any analysis. It became more practical to take an alternative architectural approach which would allow us to continue using R, but without the added difficulty of developing database functions. Instead, we used Node’s ability to spawn child processes – operating system processes which Node manages. These can be any process which can be started from the command line, including geoprocessing scripts written in R or any other language. 4.1.2 Redesigning the Architecture When this project began, the one of the main goals was to shift a lot of the processing load to PostGIS to ease integration with the web server and streamline the system. The previous subsection explained that the analytical and data management tools available in PostGIS ultimately were not up to the task. Heavy computation on a Node server is bad. It blocks the event loop and prevents the server from processing incoming requests. When it became clear that wrapping our error realization generation functions in the database was not an option, we were unable to offload that heavy processing. Fortunately, we were able to work around that problem by using Node’s ability to spawn child processes which, like any other OS process, receive their own dedicated core. Node is a single-threaded process, so it only requires one core for itself. Any remaining cores can be used however the server needs, be that for handling database connections or geoprocessing. Under the original architecture, those cores would be consumed with database 77 operations. Under the new architecture, they will be used to run the R script we use to generate error realizations. Each running instance of the script pauses its own execution to use its own thread to pull in data from the database. Those processes run completely independent of the web server, allowing it to achieve maximum efficiency under either architecture so the redesign should not negatively impact performance. What will degrade performance under the new architecture is the increased amount of geoprocessing and database calls required to produce the data, but because we were unable to produce a global Mean Layer, that was a reality we faced anyway. Frustrating though it was, I believe this roadblock ultimately proved beneficial to the project because the new architecture further separates the distribution mechanism from the geoprocessing mechanism. The modular nature of the new architecture allows the model code to continue evolving separate from the distribution system. It also provides the system with the flexibility to let developers plug-and-play with different geoprocessing scripts. One consequence of that separation is that a script must be able to access the data it needs on its own, but unless written to run within a desktop GIS environment most geoprocessing scripts already do that. As a result, researchers probably need to change very little of the business logic in their scripts before attaching them to the distribution system. The nature of those changes is further discussed in the “Combining the Two” subsection. 4.2 The Front End Unlike the back end, front end development proceeded exactly as planned. The AngularJS framework proved very easy to work with and readily handled our use case. The reasons for choosing AngularJS described in the “Front End Framework” subsection of this paper, so this subsection will instead focus exclusively on implementation details. It begins with 78 a description and user interface, moves on to discuss the UI router module and its role in the project, and concludes with a discussion of user experience considerations. 4.2.1 User Interface The system’s user interface is designed as a Single Page Application (SPA). SPAs are built to look and feel like a desktop app with the intent of improving overall user experience. Users do not have to follow a series of links to navigate through the website. The same page they land on is the page they stay on throughout their time using the system. The layout of the application is similar to that of other mapping applications like Google Maps or Esri’s Story Map, also examples of SPAs, with a sidebar on the left taking up approximately 30% of the screen and a map on the right taking up the rest. This layout emphasizes the map as the primary tool of discovery and leaves the sidebar to provide information only when the user seeks it. Most importantly, it is a layout very familiar to users because it is commonly used in web mapping applications. Figure 10 shows a screenshot of the interface. Across the top in green are the three panel views: About, Help, and Data. The About panel just provides information about the project for the curious explorer. The Help panel answers some questions we have anticipated users might have, such as “How do I use this website?” and “How do I draw a shape?”. The Data panel is open when a user lands on the page, and provides the data request form as well as brief description of how to use the page and a link to the Help panel should the user require more information. On the Data panel just below the description and above the form are blue buttons which the user may click to either “Draw a Shape” or “Use View” to fill in the bounding box information required by the form. When a user clicks on either, the grey message at the top changes to a help message about how use that method to complete the form. In addition, a red box appears on the map to show the 79 selected area. Users may also fill out the bounding box manually if they know the coordinates and then verify their selection using the “Plot Area on Map” button at the bottom of the form. After filling out the form, a user may either submit their data request with the “Submit” button, or start over by pressing the “Reset” button. There’s nothing complicated about our UI, and that’s exactly the point. Figure 10. The Data panel of the user interface allows users to submit data requests to our system. 4.2.2 User Experience A good user experience is about more than intuitive, good looking design. While both of those qualities are important, good design both anticipates and enforces intuition. it is also inclusive. A user interface cannot be well designed if it fails to consider accessibility for vision impaired users. This subsection describes how our user interface addresses those concerns. We adhere closely to HTML standards for form design and use label elements to describe each input element. This ensures that visually impaired users who rely on screen 80 readers will still be able to to access and use the application. We also considered colorblind users in our design, ensuring that the colors selected for warnings and success messages were distinct enough to catch the eye regardless of which type of colorblindness a user may suffer from. These seem like small concerns, but they are not for the millions of people who deal with them every day. it is these subtle choices that make good interfaces. Another subtle but simple way to improve user experience is by reinforcing user interaction. This is especially important for SPAs which, by design, do not reload when a user clicks a button or provide other signals that user input has been received. Instead, a developer conceive of ways to explicitly signal that application is registering the user’s actions. In our user interface, every interaction with the system is acknowledged and reinforced. For example, we use client-side validation to inform the user if the values they have entered are incorrect. If a user supplies impossible coordinates for the bounding box, the form will request that they enter a value between -180 and 180. All fields of the form are type and content validated in a similar manner to improve user experience. Additionally, the “Submit” button at the bottom of the form remains disabled until all the fields are filled and validated. After the user submits the request, the button is disabled again to prevent the user from accidentally submitting the request more than once. Figure 11 demonstrates what these features look like for a color blind user – note that while different from the colors in Figure 11, they are still very distinct. In addition, the “Submit” and “Plot Area on Map” buttons are disabled because the form is incomplete. 81 Figure 11. This image shows the user interface with a color blindness filter applied and form validation feedback showing. The Data panel is not the only part of the interface for which reinforcement is important. Users should also be guided between panels. UI Router allows our interface to avoid a common pitfall of SPAs, which is poor navigation capabilities. Because the SPAs are just a single web page, the URL in the address bar typically does not change when you navigate through the application. For example, when switching between the panels of our sidebar the URL would not change to reflect which panel the user was currently viewing. UI Router allows us to do exactly that, using the address bar to signal when the panel has changed. In addition, it allows the user to share the link the that specific panel; visiting the URL for the help panel overrides the interface’s default behavior of showing the Data panel first. Deep linking also provides great opportunities for extending the user interface in the future. As this project matures, it is likely that we will continue adding new services. At some point, we may need to redesign the interface to allow for a catalog of possible services. UI Router allows us to deeply link each of those 82 services within the application allowing a user to bookmark a particular service so that they can quickly and easily find it in the future. 4.3 Connecting the Two 4.3.1 Feeding User Input to R After the failure of the original system architecture plan, we had to rethink how we would manage user input in our new environment. Some digging online revealed a way to use Node to manage a batch of long-running geoprocessing tasks51. The API for creating child processes in Node allows a developer to pass environment options, including global environment variables. Our new system architecture works the same way, except that it parses user input to set the values of the variables rather than pulling them from a pre-created list. We rely on an asynchronous task queue to execute our geoprocessing script with its unique environment variables. The following paragraphs describe how that works. When a user requests data, the server may take anywhere from a few minutes to several hours to complete the request. In either case, that’s way too long to be waiting for a response from the server. To avoid that problem, we implement what is commonly called a worker function. Worker functions just set up background processes. In JavaScript, functions are first class objects. This means that they, like any other object, can be put into an array, assigned to a variable, or even passed to another function as an argument. This includes worker functions. When a data request comes in, the server receives it, validates it, supplies 51 https://contourline.wordpress.com/2013/10/08/700/ 83 Figure 12. This image shows the user interface with a color blindness filter applied and form validation feedback showing. the request data to the worker function, then inserts the worker function into a process queue. When system resources are available to execute the process, the queue calls the worker function, which begins processing the data based on the request parameters. After the task is complete, the worker process emits an event to tell the system it is done processing and ready for the next task. The “done” event in turn triggers the auto-mailer, which sends an email to inform the user that the data are ready for download. One important thing to note about the process described above is that the worker function can call processes in other languages. That is hugely beneficial to the goal of meeting researchers where they are comfortable. Not only can end users work with the resulting data in whatever workflow they are already using, but any researchers who want to use the system to distribute their own work can use their existing code with minimal modifications. While most 84 other server platforms offer the same capability, none of them can match Node’s ability to handle so many connections efficiently. Though we only tested the server with single-threaded operations, we anticipate that so long as a developer using a multithreaded script considers that when configuring the server, the system should handle it just fine. Thanks to Node and child processes, the system is easy to use both as a data producer and data consumer. In addition, it certainly meets people where they are comfortable, allowing them to work in languages they are already familiar with. These characteristics match the design philosophy of this project well and are a big part of the reason we retained Node as the server platform after the original architectural plan proved to be a failure. Another implication of the new architecture is that the system could easily be extended to run multiple scripts, allowing the developer to expose many geoprocessing services via the same web server. This would move the system much closer conceptually to service-oriented portals like GeOnAS. As discussed in the previous subsection, the user interface is also ready for extension after the system becomes production ready. Had the original plan for the system’s architecture prevailed, the system would have none of the flexibility worker functions offer by loosely coupling data processing and web server. 4.3.2 Security concerns In today’s world, security must be a primary concern for any network-enabled system. Recent high-profile attacks on companies like Sony52 and organizations like the Office of Personnel Management53 have proven that no one is safe. Computer security researchers have even proven that attackers can take over a car as its driving down the road54. Clearly, an analysis of potential security risks is of critical importance. 52 https://www.washingtonpost.com/news/the-switch/wp/2014/12/18/the-sony-pictures-hack-explained/ http://nbcnews.to/1GlS9jm 54 https://www.wired.com/2016/03/fbi-warns-car-hacking-real-risk/ 53 85 The system described over the course of this paper has a very small “attack surface”. The term attack surface refers to “all of the different points where an attacker could get into a system, and where they could get data out”55. Our system has exactly one entry point – the data request endpoint to which users send POST requests, and exactly one exit point – a static file server which only distributed data pre-packaged by the geoprocessing script. All other data manipulation takes place in the geoprocessing script or the server code which cannot be accessed from the outside. While that makes an attack unlikely, it most certainly does not make it impossible. Any time a server accepts raw user input, security must be a concern56. That concern is especially valid when that user input is used to make database calls. Attackers can provide malformed input which can give them access to the database or even the entire system. The best way to avoid those problems is by rigorously validating any user-provided data on both the client and server side of the application. Another attack to which our system is vulnerable is the Distributed Denial of Service (DDoS) attack. This type of attack involves flooding a system with incoming requests and is a favorite tactic of hacktivist groups like Anonymous because they are virtually impossible to defend against 57. The only way to tell the difference between valid and malicious requests is to examine how quickly an IP address is making the requests. A common practice is to block requests occurring in rapid succession, e.g. less than one second apart, because humans do not typically make requests that quickly. Even then, the server has to do the work of analyzing and rejecting those requests. 55 https://www.owasp.org/index.php?title=Attack_Surface_Analysis_Cheat_Sheet https://www.owasp.org/index.php/Input_Validation_Cheat_Sheet 57 https://www.wired.com/2016/01/hacker-lexicon-what-are-dos-and-ddos-attacks/ 56 86 Fortunately, DDoS attacks are very unlikely to cause actual damage to a system. They do not help attackers gain control of a system, only to bring it to a crawling halt. The system would experience the same results should a large number of people suddenly decide to start using the system at once. Because of the obscurity of its purpose (security through obscurity, they say) and the fact that it is a prototype, our system architecture does not have capabilities for defending against a DDoS attack. The final security consideration for any system must be ensuring user privacy, and ours is no exception. For the ease of development, the system currently uses HTTP to transfer data. In a production environment, the system should use HTTPS so that traffic between the server and client are encrypted. We may not be requesting credit card information over an unsecured channel, but without SSL (Secure Sockets Layer) encryption attackers could easily read the email addresses sent in data requests. The switch from HTTP to HTTPS is trivial and simply requires registration with a certificate authority such as LetsEncrypt58 and the placement of the provided certificate in the server’s root directory. Node and Express have built-in methods for handling HTTPS, so only minor adjustment will be required to the server code. 4.4 System Performance In any computer system performance is important, but that holds especially true for web services. On a desktop computer, the user is the only one that controls the load on the system. If he or she is running a computationally intensive analysis which makes the computer unusable, the only person inconvenienced is himself or herself. For systems like ours which open themselves to the Internet, anyone can place a heavy load on the server. This is the entire point of the DDoS attacks discussed in the previous subsection. If our system goes down, it could 58 https://www.letsencrypt.org 87 affect anyone who has unprocessed data requests waiting for computation resources. But understanding our system’s performance is not only about protecting users, it is also about affording them the best possible experience. By testing our system’s strengths and weaknesses, we can learn how to optimize its performance. For example, if we know that our system processes realizations faster when there is fewer jobs processing at once, we can find the sweet spot between concurrency and speed which allows us to serve the most users as quickly as possible. Both to protect ourselves from the possibility of a crash due to overloading and to tune our system for maximum efficiency, we need to know exactly what our system is capable of. This section is about discovering its limits and describing its operational characteristics. There are three main factors which determine our processing time: the number of realizations generated, the size of the requested area, and the concurrency number of jobs running at once (concurrency). Based on the performance of our R scripts before using them in the context of the data distribution system, we have a few hypotheses as to how each will affect our system. First, we believe that the number of realizations has the least impact on overall performance, as Dr. Shortridge’s experience with the code showed him that the most computationally expensive part of the process is variogram estimation. Since the variogram model only needs to be estimated once for a given area, it should be fairly trivial to produce large numbers of realizations for that area. The next consideration is the size of the requested area, which we believe greatly impacts performance. Our script has a problem with large areas because of the way we retrieve data from our PostGIS database. The some of the same limitations that prevented us from calculating the global Mean Layer inside the database using PL/PgSQL and PL/R prevent us from efficiently pulling those data into our R script. Tiling the database is necessary to maintain fast query time, but as discussed in Section 3.2.3 it causes problems when recombining the 88 data. In PostGIS, we needed to use the ST_Union function which could handle only small areas. Our R script avoids ST_Union by manually stitching together the tiles returned by our database query. Unfortunately, this has its own performance problems which, in the end, are very comparable to ST_Union’s. As a result, we expect that the size of the requested area will have a large negative impact on system performance. The final factor in our system’s performance is how many concurrent processing – how many geoprocessing jobs are running at any given time. We expect to see that regardless of the fact that each geoprocess the server spawns has its own dedicated CPU, the greater the number of running jobs, the poorer the system’s performance will be. We attribute this to the fact that memory is still shared amongst all the processes. For that reason, we currently limit the number of concurrent processes to 8, preventing our server from ever coming close to its full capacity. The physical machine we run our application on has 64 cores and 64 gigabytes of memory. Based on what we observed in development, our processes usually use about 2GB but as much as 4GB of memory per process. By limiting the number of jobs to 8, we ensure that the system never exceeds more than 50% of its memory usage for our application. This is important, as the server is a shared resource on which other researchers rely. Testing our system’s capacity should help us discover if 8 jobs is really the best balance of capacity and speed. To test our system’s capacity, we randomly generated 50 data requests for three different sized study areas and logged the processing time of each to a file. Using JavaScript’s Math.random()59 function to pull from a uniform distribution to get both random locations within the Continental United States and a random number of realizations between 50 and 500. The idea is that these conditions simulate the actual load a system might face. Not everyone is 59 https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Math/random 89 requesting the same number of realizations for the same area, so our method allows us to vary the request parameters in a way similar to what we expect to see from users. We tested square patches that were 360, 540, and 720 arcseconds on a side and compared run times both within their respective groups and averages across all groups. Figure 13. Run times in hours across all 50 jobs for 720 arcsecond patches. 90 Figure 14. Run times in hours across all 50 jobs for 540 arcsecond patches Figure 15. Run times in hours across all 50 jobs for 360 arcsecond patches. 91 Figure 16. Average time to complete a job by size of requested area. Based on the results of our tests, we feel confident that our hypotheses about size and concurrent processing are correct. Figures 13, 14, and 15 show each of the individual task run times for each group. Regardless of the area’s size, processing times increased as more jobs were completed, suggesting that the number of concurrent processes impacts processing time. Ideally we would confirm that hypothesis with tests which slowly incremented concurrency over time. Unfortunately they proved too difficult to create in the limited time between system development and the publication of this thesis, but would be very interesting to see and may be worth the effort in the future. At first, it seemed that the numerous peaks and valleys across all of the sizes might be related to concurrency as well, but the way the test scripts are written ensures that 8 jobs will always be running until fewer than 8 jobs remain in the process queue. Instead, they may suggest that the number of realizations requested matters more than we initially thought. Further testing should be done to see how incrementing the requested number of realizations affects processing time. 92 Finally, Figure 16 compares average processing times across request sizes. As expected, size severely impacts performance. There appears to be an exponential increase in processing time for linear increases in the size of the requested area. While it would have been nice to test this with larger sizes, the already long run times threatened to become prohibitive. The long run times are due to the limitations of the database. As described in Section 4.1, the tiling process required for storing large datasets is problematic when it is necessary to recombine the tiles for analysis, like it is in our case. Because of the memory limitations of the function PostGIS uses to recombine tiles natively, our error modeling script manually recombines them. Unfortunately, due to the way PostGIS provides access to the underlying data, our script suffers the same limitations. As with the database, our script will attempt to perform the requested operation regardless of whether or not it can be successfully completed. Until the data retrieval logic is improved, we should most certainly limit the size of the area which users can request to avoid excessive stress on the system and prevent users from entering data requests which cannot be fulfilled. 93 5 Using the System Building a system to distribute error realizations only completes the first research objective. To truly empower researchers to get more accurate results from their work, the next step is to show them how to use the error realizations in their own work. The literature surrounding Spatial Decision Support Systems (SDSS) frequently considers uncertainty. Indeed, literature related to handling uncertainty within SDSS goes all the way back to the 1990s (Crossland, Wynne, & Perkins, 1995). One of the seminal papers on error propagation and uncertainty comes from the SDSS literature (Aerts, Goodchild, & Heuvelink, 2003). As such, this literature will play a critical role in the development of our own error-aware analysis. In developing our analysis scripts we will continue to use only FOSS, as this ensures that they will be helpful to the greatest number of users. Hengl, Heuvelink, and Van Loon (2010) provide an example of an analysis tool created using FOSS that includes stochastic simulation in their work using the R Statistical Package (R Core Team, 2016) for stream network derivation. Darnell, Tate, and Brunsdon (2008) also use R to implement a FOSS model for stochastic simulation and error visualization. However, both these examples leave something to be desired because they rely too much on the user having certain statistical knowledge. In reality, most of “these users undoubtedly lack expert knowledge about both the data collection methods employed by the data producer and the spatial simulation model theory and implementation in vogue with spatial information scientists” (Goodchild, Shortridge, & Fohl, 1999). Our system avoids that problem, as the analysis below reveals. 94 5.1 Monte Carlo Viewshed Analysis In order to demonstrate how researchers might use our data distribution system, we looked to basic but common analyses which GIS users perform using DEMs. One of those, viewshed analysis, is an ideal operation to demonstrate necessity and simplicity of performing uncertainty-aware analyses. Given the location of an observer and a DEM, a basic viewshed analysis determines whether or not the land represented by a given cell in the DEM is visible to the observer. Uncertainty in the DEM can easily lead to bad results. Our data distribution system promotes a type of uncertainty-aware analysis known as Monte Carlo analysis, so named after the well-known casinos of that municipality. Monte Carlo analysis seeks to reduce uncertainty by anticipating all possible outcomes of a particular operation and the probability that those outcomes will occur60. In order to do that, a researcher needs to vary the inputs to their analysis based on a set of conditions that reflect possible variations in the data – a process known as conditional stochastic simulation which we discussed in the Error and Uncertainty section of this thesis. Our system removes the difficulties associated with gathering those simulated inputs, letting researchers skip straight to the analysis phase. Monte Carlo viewshed analysis is not very different from a regular viewshed calculation. In fact, a Monte Carlo viewshed analysis is composed of many regular viewshed analyses. In the context of our system, a Monte Carlo viewshed analysis performs a regular viewshed analysis for every DEM Error Realization the user downloads. After all realizations have been processed, the researcher takes the mean of all the outputs to create a viewshed probability surface. Instead of values of either 1 or 0 (visible or not visible), the viewshed probability surface 60 http://www.investopedia.com/articles/financial-theory/08/monte-carlo-multivariate-model.asp 95 contains values between 1 and 0 reflecting the probability that a particular DEM cell is visible from the observer’s location. For a graphical representation of the process, refer to Figures 3 and 4 in Section 2.1.3 5.1.1 Changing the Data Model Monte Carlo analysis requires a fundamental rethinking of the traditional data model. In addition to the traditional combination of attribute data, geometric data, projection data, and metadata, the Monte Carlo paradigm requires a fifth component: an uncertainty model. Thanks to our system, researchers no longer need advanced geostatistical knowledge to perform Monte Carlo analysis. Unfortunately, it does not free them from the constraints imposed by the inability of mainstream GIS software to handle a new data model. Ideally, an error-aware GIS would support the ideas of Goodchild, Shortridge, and Fohl (1999) or Heuvelink, Brown, and van Loon (2007) who discuss the possibility and logistics of actually including uncertainty information in the data model. As this paper has thoroughly explained, while such software does exist it is not accessible to the majority of users and is far from mainstream. As discussed in Section 3.1.4, our system works around this problem by changing the data model at a conceptual level rather than at implementation level, enabling the use mainstream tools for Monte Carlo analysis. Instead of encapsulating a simulation model within the data and requiring that users know how to to produce realizations, we distribute only the model outputs. Conceptually, both methods deliver numerous versions of a single dataset. However, the latter only requires the user to adjust his or her thinking, while the former requires a software redesign. Therefore, the former is more likely to actually be used by researchers because it allows them to use tools they are already familiar with and requires minimal 96 adjustment to existing analysis scripts. To demonstrate this, we wrote a simple Monte Carlo Viewshed Analysis script using R and GRASS GIS. 5.1.2 An Example Analysis with R and GRASS We selected R because it is FOSS and available around the globe in many languages, maximizing the likelihood that researchers can apply this example to their own work. In addition, it is available for download free of charge and has a very active development community. As discussed previously, it is also widely used in the geospatial community. Finally, R can be used as a geoprocessing engine in QGIS – the most popular open source GIS in the world. Eventually this example can be fully integrated with the QGIS GUI, making it easy for anyone to use an uncertainty-aware GIS. These reasons, combined with the fact that we are already using R for our DEM Error Realization scripts, make R a natural choice for our application example. The script we developed follows the method described at the start of this section. The code is free, open, and available online via the Help section of the distribution system. It relies on the very useful and greatly appreciated work of the GRASS Development Team (2016) for GRASS GIS; the R Core Team (2016) for the R Language; Roger Bivand (2016) for the rgrass7 package which connects the two; Pebesma and Bivand (2005) for the sp package; Hijmans (2016) for the raster package; Bivand, Keit, and Rowlingson (2016) for the rgdal package; and Shum and Akimov (2015) for the hashids package. Monte Carlo viewshed analysis is possible using these packages and some simple file manipulation logic. After downloading and extracting DEM error realizations for the desired study area, they should be projected because the realizations are distributed in the WGS84 geographic coordinate system. The r.viewshed algorithm used in the analysis relies on map units to calculate slope, and degrees can be a problematic unit in that calculation (GRASS Development 97 Team, 2016). The script provides a function to easily convert all realizations from EPSG 4326 to the desired coordinate reference system. After that, a user simply needs to supply one more more observation points in a SpatialPointsDataFrame object, the name of a column in that object to use for naming the output viewshed(s), the path of the DEM error realizations directory, and the path to the desired output directory. Behind the scenes, the code performs a viewshed calculation for each point using each DEM Error Realization and returns a single raster that is the mean of all viewshed calculations. The analysis itself is only important as proof of concept. The “study area” selected was one of the randomly selected test patches created during the system tests described in Section 4.4 and covers an approximately 11 square kilometer area near Modesto, California, USA. What is meaningful is how simple it becomes to conduct Monte Carlo analysis when the user no longer needs to create, tune, and run error simulation models to create DEM error realizations. The script also reveals the implication of the changes Monte Carlo analysis requires, accepting a directory of inputs rather than a single file. For comparison purposes, Figure 17 shows the results of the entire analysis, the output of one of the intermediate viewshed analyses for calculation. Note the differences in the values displayed in their legends and in their appearance – the intermediate result implies authority and truth by declaring cells as visible or not visible, while the final result reflects the fuzziness of the actual data. The intermediate viewshed in Figure 17 is just one of the 107 viewshed analyses (one for each DEM Error Realization) conducted on this test dataset. While it looks very similar to the Monte Carlo Viewshed at first, they have very different interpretations. In the intermediate result, cells may be either visible (value of 1) or invisible (value of 0) to the observer. The Monte Carlo result instead shows the percentage of intermediate viewsheds in which a particular cell is visible – the higher a cell’s value (the lighter its color), the more frequently it is visible. Even the 98 cells with the highest probabilities of being visible to the theoretical observer are invisible in about 10% of the Intermediate analyses, clearly demonstrating the uncertainty inherent to SRTM v4.1. Without our system to provide easy access to DEM error realizations, this is the level of uncertainty researchers are absorbing. Rather than relying abstract measures of accuracy stored in an associated metadata file to help them decide whether or not SRTM is fit for their use, they can visually interpret the uncertainty – a much more intuitive approach. This is just one of the numerous benefits Monte Carlo analysis has to offer. Figure 17. A comparison between the final result of the Monte Carlo Viewshed Analysis script and one of its intermediate outputs. 99 6 Conclusions Though many obstacles remain on the path to bringing uncertainty-aware spatial analysis, the web-based data distribution system described in this thesis represents a step in the right direction. First and foremost, it accomplishes the primary research objective of developing a system for distributing DEM error realizations over the Internet. Second, we succeeded in creating a relatively simple example of how to incorporate Monte Carlo analysis into basic GIS operations to help researchers understand how they might apply it in their own work. We did both of these things using exclusively open source technologies, ensuring that anyone can benefit from them. Finally, we explored the implications of the stochastic paradigm on the geographic data model in a wide variety of contexts, from systems architecture to applied analysis. Though our system changes the data model at a conceptual level rather than an implementation level, it still provides some challenges for both users and developers. it is also the best way to address the data model problem given the fact that mainstream GIS software does not handle the stochastic paradigm well, and other software, though it exists, is both hard to acquire and difficult to use. The key driver behind our primary research objectives – the reason for doing this research in the first place – is making it easier for geospatial researchers to consider the impact of uncertainty on their work. Our new system is the first uncertainty-aware analysis tool that truly offloads the burden of creating, tuning, and running the complicated geostatistical models which produce our DEM error realizations from end users and onto geostatistical experts. In the spirit of our core design principles, we have definitely made it easy and made it magic. We feel that the simple analysis scripts discussed in Section 5 in general and Section 5.2.2 in particular 100 demonstrate that. Despite all these successes, there were most definitely some disappointments as well. Without question, the biggest of those disappointments was the inability of PostgreSQL and PostGIS to meet our analysis needs. It proved extraordinarily difficult to work with the data the way we needed to, likely because SQL is a declarative language and lacks the functional control of languages typically used for analysis such as Python and R61. Once we finally had code that worked for generating the Mean Layer, we quickly discovered that it would not scale to the size we needed. Far from being able to generate a Mean Layer for the entire globe, our queries could not even handle all of Colorado. Even after abandoning our hopes of storing the Mean Layer in the database, PostGIS still caused us performance problems as discussed in Section 4.4. Admittedly, being forced to change our architectural plans ended up a fortuitous development, even though the process of finding PostGIS’ limitations was frustrating. The major advantage of the new approach is that it allows a researcher to, with minor adjustments, use the same code they developed in their research to create the final product. Python, R, or anything else that can be run from the command line and access system environment variables will probably work just fine. While this problem has a silver lining, others do not. Initially, we had hoped to develop more than an example for researchers to follow. At the beginning, we set out to create script tools capable of integrating with QGIS’ user interface, making it even easier for researchers to use our system. Had we succeeded, a researcher would not even need to be familiar in with R to use our code to perform his or her own Monte Carlo viewshed analysis. We also wanted to offer at least two additional Monte Carlo analysis tools (Slope and Aspect) to the portfolio, but the time spent trying to work around the database ultimately prevented us from achieving those goals. While disappointing, this failure was not 61 https://neo4j.com/blog/imperative-vs-declarative-query-languages/ 101 total as we were still provided at least one example. In addition to these minor issues, this project also suffered from some major limitations. 6.1 Limitations Without question, the biggest limitation to this research was our extremely limited back end development experience. Researchers in Geography are usually not concerned with matters of web server architecture, so there online resources were the best we had. While the documentation for all of our selected modules and libraries is very good, there is no substitute for practical experience in software development. While not catastrophic, the implications of our inexperience certainly affected the project and cost us more than just time. For example, despite knowing the importance of testing and selecting frameworks known for their testability, our code has no test coverage whatsoever. This was not for lack of consideration or effort, but simply because we did not have enough experience to know how things were supposed to work, learning as we went along. Automated tests formally codify an application’s logic, running a set of functions which determine if the application is actually doing what the developer thinks it is – if it is doing what it was designed to do. Automated software testing is very important, particularly in open source projects in which contributors may rarely meet face to face, because it defines a set of rules for the application to follow. If a developer makes a change to the codebase, they can simply run the tests to ensure that their modifications have not negatively affected the system before sharing their changes with others. Not having these tests puts limits on the future of our project, and they need to developed before any of the items discussed in the “Future Directions” subsection. 102 Another limitation was our stubborn insistence on using PostGIS for our analysis, for which we lost time on every other aspect of the project including testing. The author’s inexperience with web architectures led him to believe that it would be the only easy way to achieve the system objectives discussed in Section 3, and the ever-elusive solution to our query problems always seemed to be just around the corner. Even when we finally found a solution, it was not good enough to accomplish our goal of generating the global Mean Layer within the database. We would have seen far more return on our invested time and effort had we turned to other solutions like Node’s child processes earlier. Had we settled on a different architecture sooner, we may have had time to write our tests, complete all of QGIS tools we had originally hoped to create, or implement a standards-compliant service interface like JSON-RPC. PostGIS is a proven tool for data storage and vector-based geospatial analysis, but we strongly recommend against using it for advanced raster analysis. The final limitation of this project which we will discuss is also a testament to its uniqueness and importance: no other system exists by which we can measure our system’s performance. While Section 4.4 discusses system performance, there is no industry benchmark against which we may compare our results. To our knowledge, there are not even similar systems written in other languages which might truly compare to our’s. Certainly Heuvelink and Brown’s Data Uncertainty Engine is in the same class, but it has very different functionality and according to a 2010 paper, the only way to obtain a copy is to directly contact the authors and request it. Without building a second system in a different language, we have no way to objectively assess the benefits one stack may provide over another. Obviously, we did not have the time to develop a second system purely for comparison purposes. While very understandable, this limitation is also no other systems like this We only created one server, so 103 we cannot really test it against an implementation in another language or using different technologies. 6.2 Future Directions Though successful in developing a prototype system, the results do not constitute a production-ready system. The software is written by amateurs, for amateurs. It needs a lot of work from people more experienced with web application design. Having recently accepted a position as a software developer with a company that uses a very similar stack to the system described in this thesis, the author is confident he will soon have the experience to make good decisions about the system’s architecture and continue its development. One of the first goals for that improved architecture must be the adoption of some standardized connection interface such as OGC’s WPS or JSON-RPC. Which one will be best depends largely on decisions about how the architecture should grow to interface with new clients. The main purpose of employing a Web Services standard is to allow third parties to confidently develop clients for the service we provide. Of course, it also offers other benefits like chainability and discoverability depending on which standard is chosen. Currently, our online GUI is the only client designed to access our API. Adopting a standard like WPS would allow interested parties to use existing clients designed to work with those standards. For example, QGIS and ArcGIS are both designed to work with the WPS standard. It would also be possible to connect to a WPS or JSON-RPC service from R. Should we find a good way to generate and store the global Mean Layer, we could transition the service from an entirely server-side architecture to a hybrid client-server architecture like the one discussed by Walker and Chapra (2014). By creating an R package designed around our API, we could transmit the global Mean Layer to the client using the WPS 104 specification and let the package generate realizations on the client side. This would dramatically reduce the processing load on our application servers and improve the user experience because researchers would no longer have to wait for extended periods to get their data, instead generating it themselves from the much smaller (and therefore faster to transfer) Mean Layer dataset. After architecture improvements, the biggest goal for future research relating to our system must be spreading knowledge about it, and the best way to do that is by example. Future work must focus on not only applying our system in various analyses, but publishing tools that may be reused by others for the same purpose. Ideally, those tools would not be just code, though that would certainly be acceptable. Instead, those works should seek to fill the gap left by this thesis and create tools which leverage QGIS’ GUI to improve user experience and reach a wider audience. This would be the best sort of evangelism our system could hope for, and it would certainly drive traffic. The authors of such tools could even provide access to them directly through our system’s web interface, making them extremely accessible. 105 REFERENCES 106 REFERENCES Aerts, J. C. J. H., Goodchild, M. F., & Heuvelink, G. B. M. (2003). Accounting for Spatial Uncertainty in Optimization with Spatial. Transactions in GIS, 7(2), 211–230. http://doi.org/10.1111/1467-9671.00141 Aerts, J. C. J. H., Lin, N., Botzen, W., Emanuel, K., & de Moel, H. (2013). Low-Probability Flood Risk Modeling for New York City. Risk Analysis, 33(5), 772–788. http://doi.org/10.1111/risa.12008 Aerts, J. C. J. H., Clarke, K. C., & Keuper, A. D. (2003). Testing popular visualization techniques for representing model uncertainty. Cartography and Geographic Information Science, 30(3), 249–261. http://doi.org/10.1559/152304003100011180 Arbia, G., Griffith, D., & Haining, R. (1998). Error propagation modelling in raster GIS: overlay operations. International Journal of Geographical Information Science, 12(2), 145–167. http://doi.org/10.1080/136588198241932 Atkinson, P. M., & Foody, G. M. (2002). Uncertainty in Remote Sensing and GIS. http://doi.org/10.1002/0470035269 AWG, I. (2011). Systems and software engineering — Architecture description (ISO/IEC/IEEE 42010:2011). Bastin, L., Cornford, D., Jones, R., Heuvelink, G. B. M., Stasch, C., Pebesma, E., … Williams, M. (2012). Managing Uncertainty in Integrated Environmental Modelling Frameworks: The UncertWeb framework. Environ Modell Softw, (in press), 116–134. http://doi.org/http://dx.doi.org/10.1016/j.envsoft.2012.02.008 Beard, K. (1997). Representations of data quality. Taylor and Francis. Bédard, Y., Gervais, M., Devillers, R., Levesque, M.-A., & Bernier, E. (2009). Data Quality Issues and Geographic Knowledge Discovery. Geographic Data Mining and Knowledge Discovery, (May), 99–115. http://doi.org/10.1201/9781420073980.ch5 Bivand, R. (2016). rgrass7: Interface Between GRASS 7 Geographical Information System and R. Retrieved from https://cran.r-project.org/package=rgrass7 Bivand, R., Keitt, T., & Rowlingson, B. (2016). rgdal: Bindings for the Geospatial Data Abstraction Library. Retrieved from https://cran.r-project.org/package=rgdal Bolten, A., & Waldhoff, G. (2010). Error Estimation of Aster Gdem for Regional Applications Comparison To Aster Dem and Als Elevation Models. In 3rd ISDE Digital Earth Summit. Nessebar, Bulgaria. 107 Bowling, E., & Shortridge, A. (2010). A Dynamic Web-based Data Model for Representing Geographic Points with Uncertain Locations. In N. J. Tate & P. F. Fisher (Eds.), International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences (pp. 1–4). Leicester. Brown, J. D., & Heuvelink, G. B. M. (2007). The Data Uncertainty Engine ( DUE ): A software tool for assessing and simulating uncertain environmental variables. Computers & Geosciences, 33, 172–190. http://doi.org/10.1016/j.cageo.2006.06.015 Brown, J. D., & Heuvelink, G. B. M. (2008). Data Uncertainty Engine ( DUE ) User’s Manual. Byrne, J., Heavey, C., & Byrne, P. J. (2010). A review of Web-based simulation and supporting tools. Simulation Modelling Practice and Theory, 18(3), 253–276. http://doi.org/10.1016/j.simpat.2009.09.013 Carlisle, B. H. (2005). Modelling the Spatial Distribution of DEM Error. Transactions in GIS, 9(4), 521–540. Retrieved from http://dx.doi.org/10.1111/j.1467-9671.2005.00233.x Castrignanò, A., Buttafuoco, G., Comolli, R., & Ballabio, C. (2006). Accuracy assessment of digital elevation model using stochastic simulation. In M. Caetano & M. Painho (Eds.), 7th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences (pp. 490–498). Lisbon, Portugal. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.133.5561&rep=rep1&type=pdf Castronova, A. M., Goodall, J. L., & Elag, M. M. (2013). Models as web services using the Open Geospatial Consortium (OGC) Web Processing Service (WPS) standard. Environmental Modelling and Software, 41, 72–83. http://doi.org/10.1016/j.envsoft.2012.11.010 Champine, G. A., Coop, R. D., & Heinselman, R. C. (1980). Distributed Computer Systems: Impact on Management Design and Analysis. Amsterdam: Elsevier Science Inc. Chrisman, N. (1983). The Role of Quality Information in the Long-Term Functioning of a Geographic Information System. Proceedings of the International Symposium on Automated Cartography, 79–88. http://doi.org/10.3138/7146-4332-6J78-0671 Chrisman, N. R. (1991). The error component in spatial data. Geographical Information Systems: Principles and Applications, 165–174. Claessens, L., Heuvelink, G. B. M., Schoorl, J. M., & Veldkamp, A. (2005). DEM resolution effects on shallow landslide hazard and soil redistribution modelling. Earth Surface Processes and Landforms, 30(4), 461–477. http://doi.org/10.1002/esp.1155 Coleman, D. J. (1999). Geographical information systems in networked environments. In P. A. Longley, M. F. Goodchild, D. J. Maguire, & D. W. Rhind (Eds.), Geographical Information Systems: Principles and Applications (2nd ed., Vol. 1, pp. 317–329). New York: John Wiley and Sons. 108 Cressie, N. (1990). The origins of kriging. Mathematical Geology, 22(3), 239–252. http://doi.org/10.1007/BF00889887 Crompvoets, J., Bregt, A., Rajabifard, A., & Williamson, I. (2004). Assessing the worldwide developments of national spatial data clearinghouses. International Journal of Geographical Information Science, 18(7), 665–689. http://doi.org/10.1080/13658810410001702030 Crossland, M. D., Wynne, B. E., & Perkins, W. C. (1995). Spatial decision support systems: An overview of technology and a test of efficacy. Decision Support Systems, 14(995), 219–235. http://doi.org/10.1016/0167-9236(94)00018-N Cuartero, a., Polo, M. E., Rodriguez, P. G., Felicisimo, a. M., & Ruiz-Cuetos, J. C. (2014). The Use of Spherical Statistics to Analyze Digital Elevation Models: An Example From LIDAR and ASTER GDEM. IEEE Geoscience and Remote Sensing Letters, 11(7), 1200–1204. http://doi.org/10.1109/LGRS.2013.2288924 Dalle, J., & Jullien, N. (2001). OPEN-SOURCE vs . PROPRIETARY SOFTWARE, (33), 1–16. Retrieved from http://flosshub.org/sites/flosshub.org/files/dalle2.pdf Darnell, A. R., Tate, N. J., & Brunsdon, C. (2008). Improving user assessment of error implications in digital elevation models. Computers, Environment and Urban Systems, 32(4), 268–277. http://doi.org/10.1016/j.compenvurbsys.2008.02.003 de Moel, H., & Aerts, J. C. J. H. (2011). Effect of uncertainty in land use, damage models and inundation depth on flood damage estimates. Natural Hazards, 58(1), 407–425. http://doi.org/10.1007/s11069-010-9675-6 De Moel, H., Asselman, N. E. M., & H. Aerts, J. C. J. (2012). Uncertainty and sensitivity analysis of coastal flood damage estimates in the west of the Netherlands. Natural Hazards and Earth System Science, 12(4), 1045–1058. http://doi.org/10.5194/nhess-12-1045-2012 Devillers, R., Gervais, M., Bédard, Y., & Jeansoulin, R. (2002). 45 Spatial Data Quality: From Metadata To Quality Indicators and Contextual End - User Manual, (March 2002), 21–22. Devillers, R., Stein, A., Bédard, Y., Chrisman, N., Fisher, P., & Shi, W. (2010). Thirty Years of Research on Spatial Data Quality: Achievements, Failures, and Opportunities. Transactions in GIS, 14(4), 387–400. http://doi.org/10.1111/j.1467-9671.2010.01212.x Di, L. (2004a). GeoBrain-A Web Services based Geospatial Knowledge Building System. Proceeding of NASA Earth Science Technology Conference, 2004 June 22 - 24, Palo Alto CA USA, 8. Di, L. (2004b). Distributed Geospatial Information Services-architectures, Standards, and Research Issues. XXth ISPRS Congress Technical Commission II, 187–193. Duckham, M. (2002). A User-Oriented Perspective of Error-sensitive GIS Development. Transactions in GIS, 6(2), 179–193. http://doi.org/10.1111/1467-9671.00104 109 Erdoğan, S. (2010). Modelling the spatial distribution of DEM error with geographically weighted regression: An experimental study. Computers & Geosciences, 36(1), 34–43. http://doi.org/10.1016/j.cageo.2009.06.005 ESRI. (1998). ESRI Shapefile Technical Description. Computational Statistics, 16(July), 370–371. http://doi.org/10.1016/0167-9473(93)90138-J FGDC (1994). The 1994 plan for the national spatial data infrastructure––building the foundation of an information based society. Washington: FGDC. Fisher, P. E., & Tate, N. J. (2006). Causes and consequences of error in digital elevation models. Progress in Physical Geography, 30(4), 467–489. http://doi.org/Doi 10.1191/0309133306pp492ra Fisher, P. (1998). Improved Modeling of Elevation Error with Geostatistics. GeoInformatica, 2(3), 215–233. http://doi.org/10.1023/A:1009717704255 Fujisada, H., Urai, M., & Iwasaki, A. (2012). Technical Methodology for ASTER Global DEM. IEEE Transactions on Geoscience and Remote Sensing, 50(10), 3725–3736. http://doi.org/10.1109/TGRS.2012.2187300 Gesch, D., Oimoen, M., Zhang, Z., Meyer, D., & Danielson, J. (2012). Validation of the Aster Global Digital Elevation Model Version 2 Over the Conterminous United States. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XXXIX(September), 281–286. http://doi.org/10.5194/isprsarchives-XXXIX-B4-281-2012 Goodchild, M. F. (1993). Data models and data quality: problems and prospects. Environmental Modeling with GIS. Retrieved from http://www.geog.ucsb.edu/~good/papers/192.pdf Goodchild, M. F., & Gopal, S. (1989). The accuracy of spatial databases. CRC Press. Goodchild, M. F., Shortridge, A. M., & Fohl, P. (1999). Encapsulating Simulation Models With Geospatial Data Sets. Spatial Accuracy Assessment: Land Information Uncertainty in Natural Resources, 1–16. Goodchild, M. F., & Longley, P. A. (1999). The future of GIS and spatial analysis. In P. A. Longley, M. F. Goodchild, D. J. Maguire, & D. W. Rhind (Eds.), Geographical information systems (2nd ed., Vol. 1, pp. 567–580). New York. Retrieved from http://www.geos.ed.ac.uk/~gisteac/gis_book_abridged/files/ch40.pdf Granell, C., Díaz, L., & Gould, M. (2010). Service-oriented applications for environmental models: Reusable geospatial services. Environmental Modelling and Software, 25(2), 182–198. http://doi.org/10.1016/j.envsoft.2009.08.005 110 Granell, C., Díaz, L., Schade, S., Ostländer, N., & Huerta, J. (2013). Enhancing integrated environmental modelling by designing resource-oriented interfaces. Environmental Modelling and Software, 39, 229–246. http://doi.org/10.1016/j.envsoft.2012.04.013 GRASS Development Team. (2016). Geographic Resources Analysis Support System (GRASS GIS) Software, Version 7.0. Retrieved from http://grass.osgeo.org Guptill, S. C. (1999). Metadata and data catalogues. In P. A. Longley, M. F. Goodchild, D. J. Maguire, & D. W. Rhind (Eds.), Geographical Information Systems: Principles and Technical Issues (2nd ed., Vol. 2, pp. 677–692). New York: John Wiley and Sons. Han, W., Di, L., Zhao, P., & Li, X. (2009). Using Ajax for Desktop-like Geospatial Web Application Development. In 2009 17th International Conference on Geoinformatics. Fairfax, VA, USA: IEEE. http://doi.org/https://doi.org/10.1109/GEOINFORMATICS.2009.5293475 Han, W., Di, L., Zhao, P., & Shao, Y. (2012). DEM Explorer: An online interoperable DEM data sharing and analysis system. Environmental Modelling & Software, 38(JUNE 2012), 101–107. http://doi.org/10.1016/j.envsoft.2012.05.015 Han, W., Di, L., Zhao, P., Wei, Y., & Li, X. (2008). Design and implementation of geobrain online analysis system (GeOnAS). Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 5373 LNCS, 27–36. http://doi.org/10.1007/978-3-540-89903-7_4 Hengl, T., Heuvelink, G. B. M., & Van Loon, E. E. (2010). On the uncertainty of stream networks derived from elevation data: The error propagation approach. Hydrology and Earth System Sciences, 14(7), 1153–1165. http://doi.org/10.5194/hess-14-1153-2010 Hengl, T., Heuvelink, G. B. M., & Rossiter, D. G. (2007). About regression-kriging: From equations to case studies. Computers and Geosciences, 33(10), 1301–1315. http://doi.org/10.1016/j.cageo.2007.05.001 Hengl, T., & Reuter, H. (2011). How accurate and usable is GDEM? A statistical assessment of GDEM using LiDAR data. Handbook of Quantitative and Theoretical Geography or Advances in Quantitative and Theoretical Geography, (July), 000–046. Retrieved from http://www.geomorphometry.org/HenglReuter2011 Heuvelink, G. B. M., Brown, J. D., & van Loon, E. E. (2007). A probabilistic framework for representing and simulating uncertain environmental variables. International Journal of Geographical Information Science, 21(5), 497–513. http://doi.org/10.1080/13658810601063951 Heuvelink, G. B. M. (2007). ERROR − AWARE GIS AT WORK : REAL − WORLD APPLICATIONS OF THE DATA UNCERTAINTY ENGINE. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 34. 111 Heuvelink, G. B. M. (1996). Identification of field attribute error under different models of spatial variation. International Journal of Geographical Information Systems, 10(8), 921–935. Heuvelink, G. B. M. (1999). Propagation of errors in spatial modelling with GIS. Geographical Information Systems: Principles and Applications, Vol. 1., 3(4), 303–322. http://doi.org/10.1080/02693798908941518 Heuvelink, G. B. M., & Burrough, P. A. (1993). Error propagation in cartographic modelling using Boolean logic and continuous classification. International Journal of Geographical Information Systems, 7(3), 231–246. Heuvelink, G. B. M., & Burrough, P. a. (2002). Developments in statistical approaches to spatial uncertainty and its propagation. International Journal of Geographical Information Science, 16(2), 111–113. http://doi.org/10.1080/13658810110099071 Hijmans, R. J. (2016). raster: Geographic Data Analysis and Modeling. Retrieved from https://cran.r-project.org/package=raster Holmes, K. W., Chadwick, O. A., & Kyriakidis, P. C. (2000). Error in a USGS 30-meter digital elevation model and its impact on terrain modeling. Journal of Hydrology, 233(1–4), 154–173. http://doi.org/10.1016/S0022-1694(00)00229-8 I. N. S. P. I. R. E. Directive (2007). Directive 2007/2/EC of the European Parliament and of the Council of 14 March 2007 establishing an Infrastructure for Spatial Information in the European Community (INSPIRE). Published in the official Journal on the 25th April. Jing, C., Shortridge, A., Lin, S., & Wu, J. (2014). Comparison and validation of SRTM and ASTER GDEM for a subtropical landscape in Southeastern China. International Journal of Digital Earth, 7(12), 969–992. http://doi.org/10.1080/17538947.2013.807307 Karssenberg, D., & De Jong, K. (2005a). Dynamic environmental modelling in GIS: 1. Modelling in three spatial dimensions. International Journal of Geographical Information Science, 19(5), 559–579. http://doi.org/10.1080/13658810500032362 Karssenberg, D., & De Jong, K. (2005b). Dynamic environmental modelling in GIS: 2. Modelling error propagation. International Journal of Geographical Information Science, 19(6), 623–637. http://doi.org/10.1080/13658810500032362 Karssenberg, D., Schmitz, O., Salamon, P., de Jong, K., & Bierkens, M. F. P. (2010). A software framework for construction of process-based stochastic spatio-temporal models and data assimilation. Environmental Modelling and Software, 25(4), 489–502. http://doi.org/10.1016/j.envsoft.2009.10.004 Kim, T. J. (1999). Metadata for geo-spatial data sharing: A comparative analysis. Annals of Regional Science, 33, 171–181. http://doi.org/10.1007/s001680050099 112 Kyriakidis, P. C., Shortridge, A. M., & Goodchild, M. F. (1999). Geostatistics for conflation and accuracy assessment of digital elevation models. International Journal of Geographical Information Science, 13(7), 677–707. http://doi.org/10.1080/136588199241067 Li, P., Shi, C., Li, Z., Muller, J.-P., Drummond, J., Li, X., … Liu, J. (2013). Evaluation of ASTER GDEM using GPS benchmarks and SRTM in China. International Journal of Remote Sensing, 34(5), 1744–1771. http://doi.org/10.1080/01431161.2012.726752 Li, X., Di, L., Han, W., Zhao, P., & Dadi, U. (2009). Sharing and reuse of service-based geospatial processing through a Web Processing Service. 2009 17th International Conference on Geoinformatics, Geoinformatics 2009. http://doi.org/10.1109/GEOINFORMATICS.2009.5293431 Li, X., Di, L., Han, W., Zhao, P., & Dadi, U. (2010). Sharing geoscience algorithms in a Web service-oriented environment (GRASS GIS example). Computers and Geosciences, 36(8), 1060–1068. http://doi.org/10.1016/j.cageo.2010.03.004 Li, Z., Yang, C., Huang, Q., Liu, K., Sun, M., & Xia, J. (2014). Building Model as a Service to support geosciences. Computers, Environment and Urban Systems. http://doi.org/10.1016/j.compenvurbsys.2014.06.004 Lim, E.-P., Goh, D. H.-L., Liu, Z., Ng, W.-K., Khoo, C. S.-G., & Higgins, S. E. (2002). G-Portal: a map-based digital library for distributed geospatial and georeferenced resources. … on Digital Libraries, 351–358. http://doi.org/citeulike-article-id:329152 Logsdon, M. G., Bell, E. J., & Westerlund, F. V. (1996). Probability mapping of land use change: A GIS interface for visualizing transition probabilities. Computers, Environment and Urban Systems, 20(6), 389–398. http://doi.org/10.1016/S0198-9715(97)00004-5 Lucchi, R., Millot, M., & Elfers, C. (2008). Resource Oriented Architecture and REST. Assessment of Impact and Advantages on INSPIRE, Ispra: European Communities, 5–13. http://doi.org/10.2788/80035 MacEachren, A. M., Robinson, A., & Hopper, S. (2005). Visualizing geospatial information uncertainty: What we know and what we need to know. Cartography and Geographic Information Science, 32(3), 139–160. http://doi.org/10.1559/1523040054738936 Maguire, D. J., & Longley, P. A. (2005). The emergence of geoportals and their role in spatial data infrastructures. Computers, Environment and Urban Systems, 29(1 SPEC.ISS.), 3–14. http://doi.org/10.1016/j.compenvurbsys.2004.05.012 Mazzetti, P., Nativi, S., & Caron, J. (2009). RESTful implementation of geospatial services for Earth and Space Science applications. International Journal of Digital Earth, 2, 40–61. http://doi.org/10.1080/17538940902866153 Miliaresis, G. C., & Paraschou, C. V. E. (2011). An evaluation of the accuracy of the ASTER GDEM and the role of stack number: a case study of Nisiros Island, Greece. Remote Sensing Letters, 2(2), 127–135. http://doi.org/10.1080/01431161.2010.503667 113 Mockus, A. (2002). Two Case Studies of Open Source Software Development : Apache and Mozilla. Engineering, 11(3), 309–346. Morris, D. E., Oakley, J. E., & Crowe, J. A. (2014). A web-based tool for eliciting probability distributions from experts. Environmental Modelling and Software, 52, 1–4. http://doi.org/10.1016/j.envsoft.2013.10.010 NASA JPL. (2009). ASTER Global Digital Elevation Model [Data set]. NASA JPL. https://doi.org/10.5067/aster/astgtm.002 Nativi, S., Khalsa, S., Domenico, B., Craglia, M., Pearlman, J., Mazzetti, P., & Rew, R. (2011). The brokering approach for earth science cyberinfrastructure. EarthCube White Paper. US NSF [Online]. Nativi, S., Mazzetti, P., & Geller, G. N. (2013). Environmental model access and interoperability: The GEO Model Web initiative. Environmental Modelling and Software, 39, 214–228. http://doi.org/10.1016/j.envsoft.2012.03.007 Ni, W., Sun, G., & Ranson, K. J. (2015). Characterization of ASTER GDEM elevation data over vegetated area compared with lidar data. International Journal of Digital Earth, 8(3), 198–211. http://doi.org/10.1080/17538947.2013.861025 Node.js. (2016). How Uber Uses Node.js to Scale Their Business. Retrieved from https://nodejs.org/static/documents/casestudies/Nodejs-at-Uber.pdf OGC. (2015). OGC WPS 2.0 Interface Standard. Open Geospatial Consortium. Retrieved from https://portal.opengeospatial.org/files/14-065 Oksanen, J., & Sarjakoski, T. (2005). Error propagation analysis of DEM‐based drainage basin delineation. International Journal of Remote Sensing, 26(14), 3085–3102. http://doi.org/10.1080/01431160500057947 Oksanen, J., & Sarjakoski, T. (2006). Uncovering the statistical and spatial characteristics of fine toposcale DEM error. International Journal of Geographical Information Science, 20(4), 345–369. http://doi.org/10.1080/13658810500433891 Pebesma, E. J., & Bivand, R. S. (2005). Classes and methods for spatial data in {R}. R News, 5(2), 9–13. Retrieved from https://cran.r-project.org/doc/Rnews/ Pebesma, E., Cornford, D., Nativi, S., & Stasch, C. (2010). The uncertainty enabled model web (UncertWeb). CEUR Workshop Proceedings, 679. Qiu, F., Ni, F., Chastain, B., Huang, H., Zhao, P., Han, W., & Di, L. (2012). GWASS: GRASS web application software system based on the GeoBrain web service. Computers and Geosciences, 47, 143–150. http://doi.org/10.1016/j.cageo.2012.01.023 114 R Core Team. (2016). R: A Language and Environment for Statistical Computing. Vienna, Austria. Retrieved from https://www.r-project.org/ Reuter, H. I., Nelson, a., & Jarvis, a. (2007). An evaluation of void‐filling interpolation methods for SRTM data. International Journal of Geographical Information Science, 21(9), 983–1008. http://doi.org/10.1080/13658810601169899 Reuter, H. I., Nelson, A., Strobl, P., Mehl, W., & Jarvis, A. (2009). A first assessment of Aster GDEM tiles for absolute accuracy, relative accuracy and terrain parameters. IEEE Transactions on Geoscience and Remote Sensing, 240–243. http://doi.org/10.1109/IGARSS.2009.5417688 Rexer, M., & Hirt, C. (2014). Comparison of free high resolution digital elevation data sets (ASTER GDEM2, SRTM v2.1/v4.1) and validation against accurate heights from the Australian National Gravity Database. Australian Journal of Earth Sciences, 61(2), 213–226. http://doi.org/10.1080/08120099.2014.884983 Rinner, C. (2003). Web-based Spatial Decision Support : Status and Research Directions. Journal of Geographic Information and Decision Analysis, 7(1), 14–31. Roman, D., Schade, S., Berre, A. J., Bodsberg, N. R., & Langlois, J. (2009). Model as a Service (MaaS). In AGILE Workshop - Grid Technologies for Geospatial Application. Salgé, F. (1999). National and international data standards. In P. A. Longley, M. F. Goodchild, D. J. Maguire, & D. W. Rhind (Eds.), Geographical Information Systems: Principles and Applications (2nd ed., Vol. 2, pp. 693–706). New York: John Wiley and Sons. San, B. T., & Suzen, M. L. (2005). Digital elevation model (DEM) generation and accuracy assessment from ASTER stereo data. International Journal of Remote Sensing, 26(22), 5013–5027. http://doi.org/10.1080/01431160500177620 Seffino, L. A., Medeiros, C. B., Rocha, J. V., & Yi, B. (1999). WOODSS - a spatial decision support system based on workflows. Decision Support Systems, 27(1), 105–123. http://doi.org/10.1016/S0167-9236(99)00039-1 Shekhar, S., Coyle, M., Goyal, B., Liu, D., and Sarkar, S. (1997). Data models in geographic information systems. Communications of the ACM, 40(4), 103–111. Shortridge, A., & Messina, J. (2011). Spatial structure and landscape associations of SRTM error. Remote Sensing of Environment, 115(6), 1576–1587. http://doi.org/10.1016/j.rse.2011.02.017 Shum, A., & Akimov, I. (2015). hashids: Generate Short Unique YouTube-Like IDs (Hashes) from Integers. Retrieved from https://cran.r-project.org/package=hashids Sondheim, M., Gardels, K., & Buehler, K. (1999). GIS interoperability. In P. A. Longley, M. F. Goodchild, D. J. Maguire, & D. W. Rhind (Eds.), Geographical Information Systems (2nd 115 ed., Vol. 1, pp. 347–358). New York: John Wiley and Sons. Retrieved from http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:GIS+interoperability#1 Steiniger, S., & Hunter, A. J. S. (2012). Free and Open Source GIS Software for Building a Spatial Data Infrastructure. Geospatial free and open source software in the 21st century. http://doi.org/10.1007/978-3-642-10595-1_15 Tachikawa, T., Hato, M., Kaku, M., & Iwasaki, A. (2011). CHARACTERISTICS OF ASTER GDEM VERSION 2. Tachikawa, T., Kaku, M., & Iwasaki, A. (2011). ASTER GDEM Version 2 Validation Report. International Geoscience and Remote Sensing Symposium (IGARSS), 1–24. Tait, M. G. (2005). Implementing geoportals: Applications of distributed GIS. Computers, Environment and Urban Systems, 29(1 SPEC.ISS.), 33–47. http://doi.org/10.1016/j.compenvurbsys.2004.05.011 Toutin, T., & Cheng, P. (1999). DEM Generation with ASTER Stereo. EOM Current Issues, 48–51. Retrieved from http://www.eomonline.com/Common/currentissues/June01/thierry.htm Unwin, D. J. (1995). Geographical information systems and the problem of “error and uncertainty.” Progress in Human Geography, 19(4), 549–558. http://doi.org/10.1177/030913259501900408 Urai, M., Tachikawa, T., & Fujisada, H. (2012). Data Acquisition Strategies for Aster Global Dem Generation. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences, I-4(September), 199–202. http://doi.org/10.5194/isprsannals-I-4-199-2012 Uran, O., & Janssen, R. (2003). Why are spatial decision support systems not used? Some experiences from the Netherlands. Computers, Environment and Urban Systems, 27(5), 511–526. http://doi.org/10.1016/S0198-9715(02)00064-9 Van Oort, P. A. J., & Bregt, A. K. (2005). Do users ignore spatial data quality? A decision-theoretic perspective. Risk Analysis, 25(6), 1599–1610. http://doi.org/10.1111/j.1539-6924.2005.00678.x Van Oort, P. (2006). Spatial data quality: from description to application. Publications on Geodesy 60. Retrieved from http://library.wur.nl/WebQuery/wdab/1788022 Verdin, K. L., Godt, J. W., Funk, C., Pedreros, D., Worstell, B., & Verdin, J. (2007). Development of a Global Slope Dataset for Estimation of Landslide Occurrence Resulting from Earthquakes. Colorado: U.S. Geological Survey, Open-File Report, 1188, 25. Veregin, H. (1999). Data quality parameters. Geographical Information Systems: Principles and Applications, Vol. 1., 177–190. Retrieved from http://www.geos.ed.ac.uk/~gisteac/gis_book_abridged/files/ch12.pdf 116 von Krogh, G., & von Hippel, E. (2006). The promise of research on open source software. Management Science, 52(7), 975–983. http://doi.org/10.1287/mnsc.1060.0560 Walker, J. D., & Chapra, S. C. (2014). A client-side web application for interactive environmental simulation modeling. Environmental Modelling & Software, 55, 49–60. Wang, R. Y., & Strong, D. M. (1996). Beyond Accuracy : What Data Quality Means to Data Consumers. Journal of Management Information Systems, 12(4), 5–33. Wechsler, S. P. (1999). A Methodology for Digital Elevation Model (DEM) Uncertainty Evaluation: The Effect DEM Uncertainty on Topographic Parameters. ESRI Proceedings 1999. Retrieved from http://downloads2.esri.com/campus/uploads/library/pdfs/6283.pdf Wechsler, S. P., & Kroll, C. N. (2006). Quantifying DEM Uncertainty and its Effect on Topographic Parameters. Photogrammetric Engineering & Remote Sensing, 72(9), 1081–1090. http://doi.org/10.14358/PERS.72.9.1081 Welch, R., Jordan, T., Lang, H., & Murakami, H. (1998). ASTER as a source for topographic data in the late 1990’s. IEEE Transactions on Geoscience and Remote Sensing, 36(4), 1282–1289. http://doi.org/10.1109/36.701078 Wu, S., Li, J., & Huang, G. H. (2008). Characterization and evaluation of elevation data uncertainty in water resources modeling with GIS. Water Resources Management, 22(8), 959–972. http://doi.org/10.1007/s11269-007-9204-x Yamaguchi, Y., Kahle, A. B., Tsu, H., Kawakami, T., & Pniel, M. (1998). Overview of advanced spaceborne thermal emission and reflection radiometer (ASTER). IEEE Transactions on Geoscience and Remote Sensing, 36(4), 1062–1071. http://doi.org/10.1109/36.700991 Yue, P., Baumann, P., Bugbee, K., & Jiang, L. (2015). Towards intelligent GIServices. Earth Science Informatics, 8(3), 463–481. http://doi.org/10.1007/s12145-015-0229-z Zachman, J. A. (1987). A framework for information systems architecture. IBM Systems Journal. http://doi.org/10.1147/sj.263.0276 Zachman, J. A. (1997). Enterprise architecture: The issue of the century. Database Programming & Design, 44+. Zhao, G., Xue, H., & Ling, F. (2010). Assessment of ASTER GDEM performance by comparing with SRTM and ICESat/GLAS data in Central China. 2010 18th International Conference on Geoinformatics, (40801186), 1–5. http://doi.org/10.1109/GEOINFORMATICS.2010.5567970 Zhao, P., Di, L., Han, W., & Li, X. (2012). Building a web-services based geospatial online analysis system. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 5(6), 1780–1792. http://doi.org/10.1109/JSTARS.2012.2197372 117