ACCESS TO HOSPITALS IN A REGULATED HEALTH CARE SYSTEM:
IMPLICATIONS FOR UTILIZATION
By
Paul L. Delamater

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Geography
2012

ABSTRACT
ACCESS TO HOSPITALS IN A REGULATED HEALTH CARE SYSTEM:
IMPLICATIONS FOR UTILIZATION
By
Paul L. Delamater
Hospital use varies among populations due to access, socio-demographic characteristics, and
overall health care needs. Further, the interaction between populations and health care
providers is often mediated by governing bodies, such as Certiﬁcate of Need programs, that
regulate the supply of health care resources. The intersection of the spatial and aspatial
components of access and utilization within a regulated health care market provide the
backdrop for this research. The outcomes provide insights that inform future health services
research and oﬀer guidance for public policy initiatives. The research approach adopted in
this dissertation addresses both methodological and theoretical issues related to the study
of access and utilization and the nature of the relationship between them. The project is
comprised by three sequential studies tied together within the framework of assessing access
and utilization in a regulated health system.
The ﬁrst study examines methods used to measure distance among locations. Speciﬁcally, the work addresses the theoretical and applied implications of using raster and network
data models for identifying areas with limited geographic accessibility. The ﬁndings suggest
that the network data model provides a more accurate framework for estimating vehicular
travel time along roadways, while the raster data model oﬀers advantages in scenarios where
roadways are not the primary route of travel. The second study oﬀers a methodology for
clustering spatial observations having multiple attribute values. The speciﬁc focus of the
work is the formation of Hospital Groups, the allocation units used in a state-level methodology for predicting future hospital bed demand. The main outcome of the research is the
methodology itself, which provides a substantial advance over the previous methodologies
used in health services research by way of its ability to cluster observations based on overall

patterns of health care utilization and geographic location, simultaneously. Using knowledge
gained from the ﬁrst two studies, the ﬁnal portion of the dissertation explores the relationship between the availability of hospital beds and the utilization of hospital services. The
focus of the study is Roemer’s Law, which states that a hospital bed built is a bed ﬁlled. The
ﬁndings of this study provide strong support for the concept that greater levels of hospital
bed availability lead to higher hospital utilization rates. This relationship is conﬁrmed at
various levels of data aggregation, demonstrating that the observed impact of availability on
utilization is stable across geographic scales of analysis.
The main outcomes of this research can be separated into those relating to advancement in
health services research and those relating to public policy. From a public policy perspective,
this dissertation oﬀers updated methodologies for identifying areas with limited geographic
accessibility and grouping health-based observations. In addition, the ﬁnal study ﬁnds strong
evidence of the eﬀects of Roemer’s Law, thus providing support for the continued regulation
of hospital bed availability. This dissertation also contributes signiﬁcant new knowledge
to the ﬁeld of health services research. The speciﬁc salient outcomes include: detailing
both the theoretical and applied diﬀerences between the raster and network data models for
estimating travel time among locations, oﬀering a methodology that simultaneously clusters
observations based on comprehensive patterns of utilization and geographic location, and
producing compelling, robust evidence that hospital availability has a positive, signiﬁcant
relationship with hospital utilization rates.

Copyright by
PAUL LARRY DELAMATER
2012

ACKNOWLEDGEMENTS
There are a number of people that have provided me with the support that was invaluable
in completing this dissertation. Most importantly, I would like to thank Dr. Joseph Messina
for his willingness to continue as my academic advisor after my time away from the program
and for his seemingly unwavering belief that I would ﬁnish this work. A special amount of
gratitude goes to my committee members for helping me to expand my understanding of
health and medical geography and pushing me to fully realize the goals of my research. I
commend them for the amount of time they spent ﬁelding hastily written emails, reading
and commenting on rough manuscript drafts, and listening intently while I ﬁred numerous
questions and ideas toward them. I would like to thank my parents for their encouragement
and support. Finally, I want to thank Jeni Lee for her strength and understanding throughout
the process of completing this research. I deﬁnitely could not have ﬁnished without her
support and I may not have even attempted it without her encouragement.

v

TABLE OF CONTENTS

List of Tables

ix

List of Figures

x

Introduction

1

Study #1
Measuring geographic access to health care:
methods
2.1 Abstract . . . . . . . . . . . . . . . . . . . . .
2.2 Background . . . . . . . . . . . . . . . . . . .
2.2.1 Access and geographic accessibility . .
2.2.2 Data models . . . . . . . . . . . . . . .
2.3 Case study . . . . . . . . . . . . . . . . . . . .
2.4 Data and methods . . . . . . . . . . . . . . .
2.4.1 Roads data . . . . . . . . . . . . . . .
2.4.1.1 Speed limit classiﬁcation . . .
2.4.1.2 Road hierarchy . . . . . . . .
2.4.1.3 Network comparison . . . . .
2.4.2 Population and hospital data . . . . .
2.4.3 Raster-based method . . . . . . . . . .
2.4.4 Network-based method . . . . . . . . .
2.4.5 Sensitivity . . . . . . . . . . . . . . . .
2.5 Results . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Underserved areas . . . . . . . . . . .
2.5.2 Limited Access Areas . . . . . . . . . .
2.5.3 Sensitivity . . . . . . . . . . . . . . . .
2.5.3.1 Speed limits . . . . . . . . . .
2.5.3.2 Population representation . .
2.6 Discussion . . . . . . . . . . . . . . . . . . . .
2.7 Conclusions . . . . . . . . . . . . . . . . . . .

raster and network-based
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Study #2
Regional health care planning: a methodology to cluster facilities
community utilization patterns
3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2.1 Unreliable measure of hospital similarity . . . . . . .
3.2.2.2 Poorly deﬁned home areal units . . . . . . . . . . . .
3.2.2.3 Subjective modiﬁcation by expert panel . . . . . . .
vi

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

12
12
13
15
17
21
22
22
23
24
26
28
30
31
32
33
33
34
37
37
37
38
47

using
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

50
50
51
54
55
57
59
61

3.3

3.4
3.5
3.6

3.2.2.4 New methodology to cluster hospitals .
Methods . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Overview . . . . . . . . . . . . . . . . . . . . .
3.3.2 Input data . . . . . . . . . . . . . . . . . . . . .
3.3.3 Clustering algorithm . . . . . . . . . . . . . . .
3.3.4 Determining the number of Hospital Groups . .
3.3.5 New hospital assignment . . . . . . . . . . . . .
Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
Discussion . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

62
63
63
63
65
66
69
70
75
78

Study #3
Do more hospital beds lead to higher hospitalization rates?
examination of Roemer’s Law
4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Research design . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 Case study . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.3 Population data . . . . . . . . . . . . . . . . . . . . . . .
4.3.4 Travel time . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.5 Ethics statement . . . . . . . . . . . . . . . . . . . . . .
4.3.6 Hospital utilization . . . . . . . . . . . . . . . . . . . . .
4.3.7 Spatial accessibility . . . . . . . . . . . . . . . . . . . . .
4.3.8 Clustering methodology . . . . . . . . . . . . . . . . . .
4.3.9 Methods to remove multicollinearity . . . . . . . . . . .
4.3.9.1 Principal components analysis . . . . . . . . . .
4.3.9.2 Bivariate regressions . . . . . . . . . . . . . . .
4.3.9.3 Test variance inﬂation factor . . . . . . . . . .
4.3.10 Regression models . . . . . . . . . . . . . . . . . . . . .
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

80
80
81
83
83
86
88
88
89
89
91
94
97
97
101
101
103
104
107
110
111

Conclusions
5.1 Overall contributions . . . . . . . . . . . . . . . . . . .
5.2 Future research . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Geographic accessibility . . . . . . . . . . . . .
5.2.2 Clustering health care observations . . . . . . .
5.2.3 Roemer’s Law . . . . . . . . . . . . . . . . . . .
5.2.4 Spatial structure . . . . . . . . . . . . . . . . .
5.2.5 Health insurance . . . . . . . . . . . . . . . . .
5.2.6 Spatial accessibility . . . . . . . . . . . . . . . .
5.2.7 Access and utilization in a regulated health care

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

113
113
116
116
118
120
123
124
124
125

vii

. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
system

.
.
.
.
.
.
.
.
.

A spatial

Appendices
A: R code to implement the Thomas Methodology . . . . . . . . . . . .
B: R code to implement the new clustering methodology . . . . . . . .
C: Testimony– Blue Cross Blue Shield of Michigan/Blue Care Network
D: Additional Figures and Tables . . . . . . . . . . . . . . . . . . . . .
E: Additional R Code . . . . . . . . . . . . . . . . . . . . . . . . . . .
References

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

127
128
141
154
157
174
236

viii

LIST OF TABLES

1

Travel speeds (miles per hour, mph) used in custom-built network datasets .

27

2

Turn delays (seconds) used in custom-built network datasets . . . . . . . . .

27

3

Mean diﬀerence in travel time and road distance between Google Maps and
custom-built networks in shortest path analysis . . . . . . . . . . . . . . . .

28

4

Comparison of underserved areas . . . . . . . . . . . . . . . . . . . . . . . .

34

5

Comparison of Limited Access Areas . . . . . . . . . . . . . . . . . . . . . .

35

6

Comparison of underserved areas and LAAs identiﬁed with speed limits assigned to roads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

Comparison of results from block centroid population assignment method with
original travel speed settings . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

8

Michigan roads by travel speed . . . . . . . . . . . . . . . . . . . . . . . . .

41

9

Initial candidate solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

10

Attribute variable set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

11

Coeﬃcient statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

7

D.1 Cluster solutions and incF scores . . . . . . . . . . . . . . . . . . . . . . . . 162
D.2 Number of components and % of the total variance explained for each functional set of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

ix

LIST OF FIGURES

1

The triangle of human ecology . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2

Research design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

3

A) Network data model and B) Cost example . . . . . . . . . . . . . . . . .

18

4

A) Raster data model and B) Cost example . . . . . . . . . . . . . . . . . .

20

5

Conversion of vector road data to raster cells . . . . . . . . . . . . . . . . . .

21

6

Hierarchical classiﬁcation system for speed limits

. . . . . . . . . . . . . . .

25

7

Travel time estimates from custom-built networks compared with travel time
estimated from Google Maps . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

8

Example of raster ﬁlter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

9

Service areas (and resulting underserved areas) produced by network-based
method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

10

Underserved areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

11

Example of the similarities and diﬀerences between network and raster-based
underserved areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

12

Limited Access Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

13

Limited Access Areas with block population assignment method . . . . . . .

39

14

Conversion of vector roads data to raster data format with slowest route rule

42

15

Service area delineation in areas where no roads are present . . . . . . . . .

44

16

Michigan’s current Subareas . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

17

Subareas produced by the Thomas Methodology using current data . . . . .

58

18

RI and CI values for two hospitals of diﬀerent sizes . . . . . . . . . . . . . .

60

19

Local minima and random starting locations with the K-means algorithm . .

67

20

Initial candidate solutions for Hospital Groups . . . . . . . . . . . . . . . . .

71

21

Hospital Groups created using new clustering methodology . . . . . . . . . .

72

x

22

Population distribution and hospital locations in Michigan . . . . . . . . . .

87

23

Age adjusted hospital utilization (UST D ) and bed distribution (Av, E2SFCA)
in Michigan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

24

Distance decay of hospital utilization in Michigan . . . . . . . . . . . . . . .

93

25

Zip Code clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

26

Standardized coeﬃcients for weighted SAR models

27

Example of rezoned region . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

. . . . . . . . . . . . . . 105

D.1 incF scores for cluster solutions in set S . . . . . . . . . . . . . . . . . . . . 157
D.2 Moran’s I of regression residuals for weighted OLS regression model . . . . . 158
D.3 Moran’s I of regression residuals for weighted SAR and CAR models . . . . . 159
D.4 Levene Test of regression residuals for SAR and CAR models . . . . . . . . . 160
D.5 Levene Test of regression residuals for weighted SAR and CAR models . . . 161

xi

Introduction
The United States health care system is decentralized and fragmented, while also growing
increasingly expensive over the last 40 years (Kaiser Family Foundation, 2009). Although
some states mediate the availability of health care services through Certiﬁcate of Need (CON)
programs, the US health care system has generally followed “market” forces in its evolution,
resulting in an inequitable distribution of resources (Angell, 2008; Kuttner, 2008) and disparities in access to health care services. Concurrently, the costs of health care services have
risen dramatically in recent years including an increased burden of out-of-pocket costs being
placed on consumers (Cunningham, 2010; Wennberg, 2005). Despite spending more money
on health care than any country in the world, the US lags far behind the leaders in numerous
measures of public health outcomes (Murray and Frenk, 2010). Furthermore, the increased
commercialization and proﬁt-maximizing behavior of health care providers has resulted in
distorted resource allocation of services and escalating costs (Kuttner, 2008).
Access to health care services can be deﬁned as the ability to secure appropriate and
eﬀective health care services in a timely manner. It is well understood that access arises from
a combination of both spatial and aspatial factors. In addition, utilization of health care
services varies among populations and is dependent upon both access and factors unrelated
to access such as overall health needs, socio-demographic characteristics, and perceptions
of the health care system (Andersen and Newman, 1973). Understanding how access to
services aﬀects service utilization and health outcomes has been identiﬁed as being of great
importance in health services research (Higgs, 2009).
In the US, both researchers and the popular media have largely placed an emphasis on
exposing the ﬁnancial barriers in access to health care due to the prohibitive costs of health
care services and signiﬁcant uninsured or underinsured populations. Past research has shown
that a lack of health insurance is associated with less service utilization and worse health
outcomes (Freeman et al., 2008). Other studies have explored the spatial aspects of access
to health care services, indicating that a large number of people in the US have limited
1

geographic access to services such as emergency departments (Carr et al., 2009), specialty
physicians (Rosenthal et al., 2005), and cancer care (Onega et al., 2008). Another large
body of research, most notably by Wennberg and colleagues in the Dartmouth Atlas of
Health Care, has explored small area variations in health care spending (Fisher et al., 2003),
utilization (Wennberg, 2005), and outcomes (Welch et al., 2011), exposing disparities that
exist throughout the US.
Although health care delivery has shifted increasingly towards proﬁtability (Kuttner,
2008), health care planning and regulation in the US generally attempted to achieve two
broad goals: 1) promote public health by ensuring that the supply of services meets the
population’s needs and 2) contain health care costs by regulating the supply of services to
a level congruent with the need of the population. Regulation is often enforced through
state-based Certiﬁcate of Need (CON) programs. The primary goals of CON programs are
to contain health care costs by limiting the supply of health care services to only those needed
by the population and to achieve equal access to health care (McGinley, 1995). Passage of
the National Health Planning Act of 1974 required states to implement CON programs to
receive federal funding for certain programs such as Medicare and Medicaid. However, this
act was repealed in 1986 under concerns that it had failed to achieve its goal of reducing
overall health care costs (Finn, 2007). Although their merits have been questioned over the
past 40 years (see US Federal Trade Commission, 2004; Rivers et al., 2007; Ferrier et al.,
2010) and they are no longer federally mandated, 35 states currently employ some form of
CON program (National Conference of State Legislatures, 2011).
A number of states continue to regulate the supply of acute care hospitals, inpatient
hospital beds, and hospital services through CON programs (Langley et al., 2010). Given
that the plurality of overall health care expenditure in the US is for inpatient hospital care
(Kaiser Family Foundation, 2009), hospitalizations, and thus hospitals, are logical candidates
for cost control measures. The high costs of inpatient hospitalizations, in conjunction with
the generally accepted implications of Roemer’s Law (Shain and Roemer, 1959; Roemer,

2

1961), a bed built is a bed ﬁlled, serve as the current justiﬁcation for continued regulation of
hospital-based resources through CON programs.
Theoretically, access and utilization should have a direct relationship with each other
considering that access measures the “potential” to utilize services (Aday and Andersen,
1974). The study of this relationship has a long history in health and medical geography
and health services research. As noted by Hunter et al. (1986), Jarvis’ study from the mid
19th century considered the eﬀects of distance on admissions to mental health hospitals.
Jarvis noted that the number of people from a given area admitted to a mental hospital
declined with increasing distance from the hospital, postulating that this eﬀect was not due
to an abundance of people with mental health problems near the facilities. These ideas gave
rise to Jarvis’ Law, that health care utilization decreases with increasing distance from the
location of the service. Additionally, the previously mentioned Roemer’s Law was delivered
in the late 1950s, deﬁning the relationship between hospital bed availability and hospital
utilization. Although only two are mentioned here, each demonstrate historical attempts by
researchers to understand how access-related factors aﬀect health services use.
More recent research has provided contrary or inconclusive ﬁndings in regards to the
direct relationship between access and utilization (e.g., Goodman et al., 1997; Wright and
Ricketts III, 2010). As the understanding of spatial structure and spatial processes in health
services research has progressed, shortcomings of previous research are exposed. However,
in spite of improved knowledge and methodological capabilities, the intertwined spatial and
aspatial components of access and utilization make characterizations of this relationship
extremely diﬃcult. In addition, factors such as clinical practice variation among areas
(Wennberg, 2005) and supply-induced demand complicate research eﬀorts. Hence, few studies have linked access and utilization together in a comprehensive and coherent framework
acknowledging the spatial and aspatial components of each. As a result, simply stated, current health services research lacks a clear understanding of how access aﬀects the volume of
utilization nor how it aﬀects where people seek care.

3

The primary goal of this research is to provide a more complete understanding of how
access to hospitals impacts hospital utilization. However, given the complexity of this issue,
the research approach adopted in this dissertation addresses not only the nature of the
relationship between access and utilization, but also methodological and theoretical issues
related to the study of this relationship. I explore access and utilization of hospitals in
Michigan, a health care system that has been under CON regulation for 40 years. Michigan
serves as an excellent case study for this work due to 1) a physical landscape with two
separate peninsulas that complicate traditional distance measurements, 2) a large variation
in regional population density (both urban and rural areas) and hospital availability, allowing
for access and utilization to be examined over a wide range of settings, and 3) an overall
system of hospitals that have been, historically, relatively stable due to CON regulation.
Michigan implemented a CON program in 1972, thus it is one of the longest-tenured,
currently active programs in the US (Finn, 2007). As part of its overall CON program,
Michigan regulates the availability of hospital beds such that any hospital wanting to add
licensed beds to their facility, relocate their existing facility (more than 2 miles from the
existing facility), or construct a new facility must ﬁle a CON application and demonstrate a
population need for the additional beds (Michigan Department of Community Health, 2009).
The state implements a bed need methodology to predict future population demand for acute
care hospital beds (Langley et al., 2010), thus providing hospitals the necessary information
for CON applications. In addition, Michigan identiﬁes regions in the state with limited
geographic access to acute care hospitals (Messina et al., 2006), providing little resistance
for hospitals or hospital systems expanding into these regions (Michigan Department of
Community Health, 2009).
Many of the ideas that ultimately led to this research were formed, in part, while attending various meetings with members of Michigan’s Department of Community Health;
the Michigan CON Commission; and academics formerly involved with Michigan’s CON
program. However, the experience of working in a scientiﬁc advisory role for the most recent

4

hospital bed CON Committee proved to be the most inﬂuential. This committee included
various stakeholders from Michigan’s hospitals, hospital systems, and insurance companies
who were assembled to review the state’s hospital bed standards in the spring of 2011.
Throughout six months of meetings, the need for information that would inform not only
health services researchers, but also health care providers and policy makers became apparent. The most obvious need was for a better understanding of the spatial aspects of health
care access and utilization, especially as they relate to health care policy. Paul-Shaheen
and Carpenter (1982) noted: there are no purely technical answers in health policy; hence,
this dissertation not only explores issues related to health services, but also acknowledges
the intertwined nature of health research, policy, and regulation, and has aimed to provide
original, robust, and useable ﬁndings.
The triangle of human ecology provides a useful conceptual framework for the study of
overall population health (see Figure 1(A), Meade and Emch, 2010). The state of health,
found in the center of the triangle, is inﬂuenced by population, habitat, and behavior and
interactions among these characteristics. Each of these broad characteristics comprise three
sub-characteristics. In Figure 1(B), the portion of the triangle explored in this research is
illustrated. In this framework, hospital utilization is considered a behavior, not a health
outcome or description of the state of health. Because hospitalization is used in an attempt
to restore health in cases of illness or injury, a state of comprimised health can be assumed.
However, because the health outcomes associated with hospitalization are not assessed within
this work, considering hospital utilization as a state of health is not justiﬁed. To explore the
spatial aspects of meta-relationship between access and hospital utilization, characteristics
of the population and their built habitat are considered, most notably the interaction among
population location, the transportation infrastructure, and hospital location.
The overall research project includes three sequential studies in which I explore spatial
accessibility characterization, health care utilization patterns, and the relationship between
the access and utilization. The speciﬁc outcomes of these studies are 1) the development

5

gender

A

age

genetics

Population

Habitat

STATE
OF
HEALTH

natural
social

beliefs
social
organization

built

gender

B

Behavior

technology

age

genetics

Population

Habitat

STATE
OF
HEALTH

BEHAVIOR

HOSPITALIZATION
natural
beliefs
social

built

social
organization

technology

Figure 1: The triangle of human ecology. A) the original (Meade and Emch, 2010,
redrafted by Paul L. Delamater) and B) the portion of the overall triangle that is explored
in this research. For interpretation of the references to color in this and all other ﬁgures, the
reader is referred to the electronic version of this dissertation.

6

of a system for measuring distance and travel time among locations, 2) a methodology for
comparing and clustering observations based on utilization patterns and location, and 3) an
assessment of the relationship between access and utilization of hospital services. Because the
measurement of both access and utilization patterns are heavily dependent upon the ability
to accurately characterize and measure distance and spatial relationships, studies 1 and 2
focus on methodological problems regarding the characterization of location, distance, and
spatial patterns in health services research. The ﬁnal study uses this knowledge to explore
the relationship between access to hospitals and hospital utilization in Michigan. Figure 2
shows the interconnected nature of these studies as they pertain to the overall research goal.
Because each study was submitted independently for publication, they all contain the
sections (e.g., Abstract, Introduction, Conclusions, etc.) required for submission. In this
dissertation, each study is presented as an individual chapter with its original formatting
intact. These chapters are written in the ﬁrst-person plural point of view. While I conceived
and performed the research and drafted the resulting manuscripts, my committee provided
helpful guidance throughout the research process and manuscript preparation, thus they
were included as co-authors of the submitted versions.
In the following paragraphs, I provide a short summary of the background and aims
of each individual study. Then, each study is presented as a stand-alone chapter. In the
Conclusions, the work is synthesized within the framework of assessing access and utilization
within a health care system regulated by a CON program. The outcomes provide insights
that inform future health services research, while also oﬀering guidance for future public
policy initiatives.
Study 1: Measuring geographic access to health care: raster and networkbased methods. Traditional measures of geographic accessibility to services have been
replaced with more elaborate gravity-based metrics that incorporate the distance, supply,
and potential demand (Ngui and Apparicio, 2011), thus integrating accessibility and availability into one comprehensive measure (e.g., the 2 Step Floating Catchment Area (2SFCA),

7

Study 1
Distance
measurement

Spatial access

Utilization
patterns

Aspatial access
Clustering
method

Utilization
Volume

Study 2

Other
Determinants
of utilization

U = β1Av + β2Ins + β3X

Study 3

Inference from
coefficients

Figure 2: Research design

8

Stability of
Coefficients
(MAUP)

Luo and Wang, 2003). Because these metrics are calculated with limited or no actual utilization data, they rely heavily on accurate measures of distance among locations. Past
studies have regularly measured distance using a straight line (Euclidean), not accounting
for true connectivity or travel impedances between locations (Jones et al., 2010; Martin
et al., 2002). More recently, advances in data availability and processing capability have
provided researchers the ability to incorporate both road connectivity (road distance) and
travel speed (travel time) in their characterizations of distance. Although these measures are
generally accepted to be more accurate representations of the friction of distance or travel
burden between locations (see Pedigo and Odoi, 2010; Shahid et al., 2009), there remains
uncertainty regarding their implementation. Most importantly, because speed limits are
often not included in roads data, travel speed must be estimated based on available road
attribute data. Concurrently, both raster and vector-based (network) methods are available
to for researchers to calculate travel time and distance along a road network.
This study compares raster and network-based methods of calculating travel time between locations. The speciﬁc case study focuses on the identiﬁcation of regions with limited
geographic accessibility to Michigan’s hospitals. I develop a speed limit classiﬁcation system
based on road attribute data and explore its robustness by comparing network travel time
estimates with those from Google Maps. Thirty minute travel time buﬀers (service areas)
are constructed around each Michigan hospital using both raster and network-based methods. Underserved areas are identiﬁed as those falling outside of the travel time buﬀers. To
understand each method’s sensitivity to speed limit settings, the speed limit classiﬁcations
are modiﬁed and changes in the resulting underserved areas are compared.
Study 2: Regional health care planning: a methodology to cluster facilities
using community utilization patterns. In health services research, the methods used
to create small (geographic) areas has been given little attention (Shwartz et al., 2001).
Small areas are constructed by combining disaggregated population units based on some
level of shared similarity among units. Wennberg and Gittelsohn (1973) oﬀered a method

9

to create hospital service areas by aggregating population units to a single facility (or two
near-by facilities) based on a simple plurality rule of utilization, noting that over 85% of care
was delivered by hospitals within the service area boundaries. However, this method often
requires manual adjustment to ensure contiguity (e.g., Klauss et al., 2005) and is problematic
in urban areas where service use is distributed similarly to many near-by facilities (Thomas
et al., 1981). Another set of methods, based on distance from each facility (e.g., Garnick et al.,
1987; Schuurman et al., 2006), rely heavily upon the assumption that bypass of the nearest
facility will be minimal. Although these approaches have scientiﬁc merit, they assume that
each population unit should be tied to a speciﬁc hospital (or set of hospitals). Additionally,
by ignoring where the rest of the population seeks care, they do not account for the overall
patient utilization patterns and are, thus, incomplete in their comparison.
I provide a methodology to cluster observations based on their overall utilization patterns and geographic location. Speciﬁcally, this methodology is used to group Michigan’s
acute care hospitals into “Hospital Groups.” The limitations in Michigan’s current method
to group hospitals (Thomas et al., 1981) are identiﬁed, emphasizing the importance for a
methodology that emphasizes overall patterns of utilization, not hospital-based competition.
Along with the clustering methodology, I supply a heuristic that assists in determining the
appropriate number of clusters in the data, a common diﬃculty in clustering applications
(Jain, 2010). Although the speciﬁc case study proposed focuses on grouping hospitals, the
theoretical underpinnings are such that the methodology can be used to group any set of
spatial observations with multiple attributes. Importantly, it can be used to create health
service areas while addressing limitations found in the plurality and distance-based methods.
Study 3: Do more hospital beds lead to higher hospitalization rates?

A

spatial examination of Roemer’s Law. Roemer’s Law deﬁnes a positive relationship
between the availability of hospital beds and the use of hospital services. Past research has
provided support for Roemer’s Law (e.g., Ginsburg and Koretz, 1983; Harris, 1975; Kroneman
and Siegers, 2004; Pasley et al., 1995; Shwartz et al., 2011; Wennberg, 2005); however,

10

other studies have found conﬂicting results (e.g., Alexander et al., 1999; Rohrer, 1990; van
Doorslaer and van Vliet, 1989) or inconclusive results (e.g., Clark, 1990). The intertwined
relationships among population health, access, use of health care services, and outcomes
provide a number of research dilemmas, both theoretically and methodologically. Perhaps,
the most diﬃcult is deﬁning and characterizing the availability of hospital beds. Although
counting the number of beds in a hospital is trivial, measuring the overall availability of those
beds to a population is a much more complex task related to distance, demand, and accessrelated factors. Unsophisticated measures of hospital bed availability, such as containerbased methods or simple distance measures (Joseph and Phillips, 1984; Guagliardo, 2004),
ignore the multifaceted nature of access and the spatial and geographic nature of health care
service use. Others have noted that the observed eﬀects of Roemer’s Law may be due to
oversimpliﬁed methods used to assign hospital beds to regions (Folland and Stano, 1990).
In addition, statistical methods that do not incorporate spatial structure in the relationship
between availability and utilization are at risk of being biased due to the eﬀects of spatial
autocorrelation.
This study explores the relationship between access and utilization of hospital services
using an ecological research design that integrates individual behavioral models of health
care utilization in an explicitly spatial context. I characterize both the spatial and aspatial components of access while also controlling for other determinants of hospitalization
throughout my theoretical and applied models. As a result, the eﬀects of Roemer’s Law can
be identiﬁed and quantiﬁed. The ecological study design implemented in this study necessitates that the relationship between access and utilization be explored at varying scales of
analysis to examine the eﬀects of the Modiﬁable Areal Unit Problem (MAUP) (Openshaw,
1984; Fotheringham and Wong, 1991). Therefore, I explore the stability of the relationship
by performing the analysis at numerous levels of data aggregation.

11

Measuring geographic access to health care: raster and
network-based methods
2.1

Abstract

Background: Inequalities in geographic access to health care result from the conﬁguration of facilities, population distribution, and the transportation infrastructure. In recent
accessibility studies, the traditional distance measure (Euclidean) has been replaced with
more plausible measures such as travel distance or time. Both network and raster-based
methods are often utilized for estimating travel time in a Geographic Information System.
Therefore, exploring the diﬀerences in the underlying data models and associated methods
and their impact on geographic accessibility estimates is warranted. Methods: We examine
the assumptions present in population-based travel time models. Conceptual and practical
diﬀerences between raster and network data models are reviewed, along with methodological implications for service area estimates. Our case study investigates Limited Access
Areas deﬁned by Michigan’s Certiﬁcate of Need (CON) Program. Geographic accessibility
is calculated by identifying the number of people residing more than 30 minutes from an
acute care hospital. Both network and raster-based methods are implemented and their
results are compared. We also examine sensitivity to changes in travel speed settings and
population assignment. Results: In both methods, the areas identiﬁed as having limited
accessibility were similar in their location, conﬁguration, and shape. However, the number
of people identiﬁed as having limited accessibility varied substantially between methods.
Over all permutations, the raster-based method identiﬁed more area and people with limited accessibility. The raster-based method was more sensitive to travel speed settings, while
the network-based method was more sensitive to the speciﬁc population assignment method
employed in Michigan. Conclusions: Diﬀerences between the underlying data models help
Submission information: Submitted to the International Journal of Health Geographics on January 13, 2012. Accepted on April 10, 2012. Published on May 15, 2012:
Volume 11, Issue 15. Authors on manuscript: Paul L. Delamater, Joseph P. Messina, Ashton M. Shortridge, Sue C Grady.
12

to explain the variation in results between raster and network-based methods. Considering
that the choice of data model/method may substantially alter the outcomes of a geographic
accessibility analysis, we advise researchers to use caution in model selection. For policy, we
recommend that Michigan adopt the network-based method or reevaluate the travel speed
assignment rule in the raster-based method. Additionally, we recommend that the state
revisit the population assignment method. Keywords: Health care access, geographic accessibility, limited access areas, underserved populations, health services.

2.2

Background

Disparities in the geographic accessibility of health care services arise due to the manner
in which people and facilities are arranged spatially. Speciﬁcally, health care services are
provided at a ﬁnite number of ﬁxed locations, yet they serve populations that are continuously and unevenly distributed throughout a region (Joseph and Phillips, 1984). Although
inequalities in accessibility are inevitable due to this conﬁguration, the extent to which they
manifest is a product of the unique spatial arrangement of the health care delivery system,
the location and distribution of the population within a region, and the characteristics of
the transportation infrastructure. Of particular concern are scenarios that result in large
distances between people and health care facilities. These populations experience greater
diﬃculty in gaining access due to increased travel times, often coupled with poor transportation infrastructure and a lack of public transportation options (Arcury et al., 2005).
The spatial or geographic dimensions of access have received considerable attention from
planners and researchers for many years (Cromley and McLaﬀerty, 2002). Referred to as
spatial accessibility (Guagliardo, 2004), the spatial dimensions of access include accessibility
and availability of services. Accessibility (or geographic accessibility) is a measure of the
“friction of distance” or “burden of travel” between locations, whereas availability generally
measures the number of services in comparison to the number of potential users of the service.
Identifying areas with limited spatial accessibility of health care services allows planners to

13

understand the eﬀects of opening, closing, or relocating health care facilities or modifying
the services oﬀered by existing facilities (McGuirk and Porell, 1984). Thus, accurate and
detailed representations of spatial accessibility are imperative to describe and understand
the overall access picture.
Changing technology and the availability of detailed spatial data have allowed for the
representation of geographic accessibility in a GIS to more closely resemble the real-world
phenomena of travel. Early studies acknowledged that the travel costs among locations were
more complex than those provided by straight-line (Euclidean) distance measures (see Shannon et al., 1973), yet this particular representation of geographic accessibility has been the
most widely used in past health services research (McLaﬀerty, 2003). Although Euclidean
distance has shown to be correlated with travel time (Apparicio et al., 2008; Haynes et al.,
2006; Phibbs and Luft, 1995), it does not incorporate topological structures or the transportation infrastructure (Jones et al., 2010), both of which are likely to inﬂuence travel travel
time. As computational power and data collection/storage capabilities have improved, more
detailed representations of geographic accessibility have emerged, incorporating the transportation infrastructure (e.g., roads → travel distance), travel impedance (e.g., speed limits
→ travel time), and various modes of travel (public transportation → travel time).
The ﬂexibility provided by GIS allows for multiple data representations of the same realworld phenomena. Speciﬁcally, travel costs can be represented using a ﬁeld-based model
(raster) or an object-based model (vector). The vector data model can also be extended to
incorporate network or graph features and is referred to as a “network” data model. Whereas
a raster vs. vector debate in regards to spatial data representation and analysis in GIS has
been present for many years in the GIS and Geography literature (see Couclelis, 1992; van
Bemmelen et al., 1993; Goodchild et al., 2007), the issues have not been fully explored in
health services research. Considering the importance placed on the role of distance and travel
in health care accessibility studies, we believe that an examination of the data models and
methods is warranted. Thus, the purpose of this paper is to compare geographic accessibility

14

measured as travel time using both raster and network (vector) based models of spatial
data representation. We aim to illuminate both the conceptual and practical diﬀerences
between models and their methodological implications in measuring geographic accessibility.
Speciﬁcally, we address the following questions over the course of this manuscript:
• What are the basic assumptions when constructing a conceptual model of travel?
• What are the speciﬁc abstractions in the raster and network representational models
of travel in a GIS?
• What are the similarities and diﬀerences in results between data models?
• How do the underlying diﬀerences in data models aﬀect the results?
The manuscript is organized as follows. First, we oﬀer a short review of access and
geographic accessibility. Next, the spatial data models and methods used to calculate travel
costs are summarized. In the following section, we describe our case study and report on
the speciﬁc data and methods used in analysis. Next, we report our results and discuss
the similarities and diﬀerences between methods. Lastly, we discuss the implications of our
ﬁndings for measuring geographic accessibility.
2.2.1

Access and geographic accessibility

Access to health care is a multifaceted and complex concept, dependent upon the characteristics of both the population in need of services and the health care delivery system (Aday
and Andersen, 1974). Penchansky and Thomas (1981) identiﬁed ﬁve distinct dimensions of
access which were classiﬁed by Khan (1992) into spatial components (accessibility and availability) and aspatial components (aﬀordability, accommodation, and acceptability). Access
to health care can also be classiﬁed into potential and realized delivery of services (Aday
and Andersen, 1974; Joseph and Phillips, 1984) based on whether actual utilization data of
the services is incorporated (realized) or based solely on the characteristics of the services
oﬀered (potential).
15

In recent health service research, distance is commonly measured as vehicular travel
time over a road network calculated in GIS (Higgs, 2004). However, other measures such
as travel distance or Euclidean distance are also regularly used (Higgs, 2009; McLaﬀerty,
2003). By incorporating real-world connectivity provided by the road infrastructure, travel
distance oﬀers a more accurate characterization of the distance among locations compared to
Euclidean distance. Yet, travel distance does not recognize the variations in travel impedance
(speed limits or travel speeds) often found between rural and urban environments. Although
Euclidean and travel distance are computationally less expensive and require fewer inputs,
respectively, recent improvements in spatial data processing capabilities and drive distance
analysis allow for vehicular travel time to be modeled more easily in a GIS (Jones et al.,
2010). We acknowledge that travel time estimates oﬀer the most accurate representation of
the cost of travel for measuring geographic accessibility based on a number of recent studies
in health services research discussing the subject (see Apparicio et al., 2008; Martin et al.,
2002; Pedigo and Odoi, 2010; Shahid et al., 2009).
A number of assumptions regarding real world phenomena are required prior to spatial
representation and modeling. In the case of forming a conceptual for model travel time,
the initial assumption is that the unique and personal experience of travel among locations
can be suﬃciently characterized and estimated using spatial data and models. Rather than
attempting to isolate and discuss all the factors inﬂuencing travel time, we instead point
out the general assumptions present in many geographic accessibility models constructed
for population-based studies. First, the models assume that each person in the population
has similar driving characteristics and comparable vehicles. Another assumption is that
each person experiences the same travel conditions, therefore variation in factors inﬂuencing
travel time such as the day, time of day, local traﬃc patterns, and weather are held constant.
The models also assume that all people possess knowledge of and choose to travel along the
shortest path between locations. Increased availability of desktop and internet-based trip
planners has likely diminished the overall impact of this assumption, yet it remains salient in

16

travel time models. Finally, due to limitations in data availability and data processing capabilities, the location of a population is often assigned to a single point location. Therefore,
the travel time estimates originating from this location are assumed to be a reliable proxy for
the travel time experienced by each member of the population. Although these assumptions
hide signiﬁcant variability, they are necessary when conducting population-based studies due
to the unpredictability of potential factors inﬂuencing travel (Witlox, 2007) and the lack of
individually georeferenced data. Hence, GIS-based travel time estimates should aim only to
capture the average situation encountered, a suitable metric for most accessibility studies
(Haynes et al., 2006).
2.2.2

Data models

The diﬀerences between raster and network data models have been extensively documented
in many GIS textbooks and research papers (e.g, Longley et al., 2010). Although the conceptual models of space, input data formats, and computational algorithms employed in
processing these data diﬀer, the basic premise behind the calculation of travel time is quite
similar for both. Travel time is modeled as a function of distance and travel speed and can
be conceptualized as the cost of movement. A number of data products based on cost of
movement can be calculated using a GIS. However, due to their importance in assessing
geographic access, we focus our discussion on a minimum cost path between locations and
a catchment or service area corresponding to a point location. In the following paragraphs,
the data formats and corresponding cost of movement concepts are summarized for both the
network and raster models.
The basic network data model comprises a series of nodes (points) that are connected
by edges (lines). Because the nodes and edges are the sole geometric features deﬁned in the
data model, any place not falling on the network is essentially “undeﬁned” or empty space.
Therefore, location and movement within the network data model are conﬁned solely to the
edges and nodes (see Figure 3(A)).
In the representational model of travel time, the cost to traverse an edge is deﬁned by
17

A

B

B

B

45 mph
A

E

C

A

PR

E

C

25 mph

D

D

Figure 3: A) Network data model and B) Cost example
the edge length and its associated travel speed. Additionally, the network data model can
be augmented to include a penalty for a directional change at a node (i.e., a time penalty
or turn delay when making a turn at an intersection). In this case, movement through a
node is assigned an angular direction, relative to the original direction of travel, and the
corresponding delay for that directional change is applied. An example of travel within a
network model is detailed in Figure 3(B), showing travel from Node A to Node D in a simple
network. The travel time (TAD ) for the trip can be calculated such that
d
d
TAD = AE + ED + PR
SAE SED

(1)

using edge distance A-E (dAE ), edge distance E-D (dED ), travel speed of edge A-E (SAE ),
travel speed of edge E-D (SED ), and the turn delay for making a 90◦ right hand turn at
Node E (PR ).
Many recent studies of health service accessibility have utilized the network data model
for calculating travel time estimates (Dai, 2010; Pedigo and Odoi, 2010; Schuurman et al.,

18

2010; Wan et al., 2011). The network data model is appealing for representing vehicular travel
time or distance considering that road segments (edges) are connected at road intersections
(nodes), upholding real-world connectivity among locations. Results of path calculations are
likely to be very similar to those experienced in the real world due to the similarities between
the data model structure and the true travel environment (Kwan and Hong, 1998). Because
areal features are not deﬁned in the network data model, service area calculation requires
that edges (lines) must be converted to a polygon representation. The polygon represents the
areal extent of the edges within the service area, but requires an approximation of undeﬁned
space in the original data model.
The raster data model is composed of a series of regularly sized and spaced cells (or
pixels). Cells are arranged in a lattice with explicit spatial boundaries, thus all locations
within the boundaries of the lattice are represented by their 2 dimensional coordinate location. In this data model, travel occurs through cell to cell movement wherein a speciﬁc cost
is designated for each cell, representing the time required to traverse the cell.
In most GIS software packages, movement occurs in only cardinal directions (Rook’s case)
or in both cardinal and diagonal directions (Queen’s case, see Figure 4(A)). However, other
software packages oﬀer more ﬂexible options such as Knight’s case movement (Lopez-Quilez
and Munoz, 2009). Travel time is calculated using the cell dimensions and travel speed
assigned to the cell. Unlike the network model, the length of individual steps in a route is
based on the cell resolution of the data and thus, constant throughout the entire raster grid.
Figure 4(B) contains a graphic representation of possible travel routes between cell A and
cell D in the raster model. In this case, the journey can be accomplished by taking a similar
route as shown in Figure 3(B) whereas the route goes from cell A to cell E to cell D. Travel
time (TAD ) for this route would be calculated such that
TAD =

d
2

d

+ 2
SA SE

+

d
2

d

+ 2
SE SD

(2)

where d is the distance between cell centers, which is equal to cell resolution, and travel speed

19

A

B
F

25 mph
A

45 mph
H

45 mph

B

25 mph

G

25 mph

E

25 mph

B

25 mph

C

25 mph

A

45 mph

D

25 mph

F

E

45 mph

I

25 mph

H

25 mph

D

45 mph

d

25 mph

G

25 mph
C

45 mph
I

25 mph

d

Figure 4: A) Raster data model and B) Cost example
(Si ) is deﬁned for each cell. Division by 2 occurs for each step in the movement because half
of each cell is traversed with each step. In this case, to travel from Point A to Point E, half
of d is traversed at 45 mph and half is at 25 mph. The journey can also be completed by
taking the diagonal, direct route between the two points such that
√
√
2
2
∗d
∗d
2
2
TAD =
+
SA
SD

(3)

where the increase in distance traveled for the step is accounted for by using the Pythagorean
theorem to adjust the distance term.
The raster data model has been used to calculate travel time in health service accessibility
studies (see Martin et al., 2002; Messina et al., 2006; Ray and Ebener, 2008; Tanser et al.,
2006). Because all locations are explicitly deﬁned in the raster data model, it is attractive
for creating service areas, especially in regions without an all-encompassing transportation
network (Tanser et al., 2006).
Roads data are generally available as vector features and must be converted to a raster
20

Figure 5: Conversion of vector road data to raster cells. The original roads (black lines
on left) are converted to a cell-based representation with large cell sizes (middle), resulting
in an overconnected travel grid. Smaller cells (right) improve the topological structure of
the travel grid. However, the two roads are still erroneously connected in this scenario.
representation. This process requires speciﬁcation of a cell resolution. The abstraction
process necessitates decision rules for assigning a travel speed to cells in which multiple
roads (with varying speed limits) fall inside the cell bounds and/or cells in which no roads
are present. When the vector roads data are converted to cells, the roads cease to exist as
unique and individual entities (e.g., highways, surface streets, ramps, etc.) and become a
surface of travel speeds (see Figure 5). In the raster data model, the strict topology that
governs real world travel along roads is replaced by predeﬁned directional movement among
cells. Thus, in routing applications, the raster data model has the potential to produce
unexpected results (Sander et al., 2010; Upchurch et al., 2004). Furthermore, travel time
estimates may be either overestimated or underestimated depending upon the geometric
complexity of the road network and the cell resolution.

2.3

Case study

Our case study explores the geographic accessibility of hospitals in Michigan. The Michigan
Department of Community Health (MDCH) identiﬁes Limited Access Areas (LAA) as a
part of the state’s Certiﬁcate of Need (CON) program, thus oﬀering a formal deﬁnition of

21

areas with limited geographic accessibility with which to compare methods. The state also
serves as an excellent study area to conduct a travel time analysis due to a unique physical
geography (two separate peninsulas with irregular shorelines) and highly variable mix of
urban and rural regions (Martin et al., 2002).
As deﬁned by statute (Michigan Department of Community Health, 2009), an LAA is
any geographic area containing a population of 50,000 that is more than a 30 minute drive
time (utilizing the slowest route available) to the nearest acute care hospital oﬀering 24
hours/day 7 days/week emergency room services. LAA maps are used by the MDCH and
Michigan’s CON Commission to evaluate applications to construct new hospitals or branch
locations and requests to add or modify existing hospital services.
In Messina et al. (2006), the authors presented a raster-based GIS methodology used
to measure travel time to hospitals and identify underserved areas and LAAs in Michigan.
This methodology is re-implemented using updated population and health service facility
data from 2010. Underserved areas and LAAs are also identiﬁed using a network-based
travel time analysis. Both methods are tested for sensitivity to travel speed settings and
changes in the population assignment method. The results of the raster and network-based
methods are compared and implications for measuring geographic accessibility are explored.

2.4
2.4.1

Data and methods
Roads data

Both the network and raster-based methods of calculating travel time among locations
are heavily dependent upon a detailed and accurate representation of both road location
(length) and travel speed (impedance). The 2009 road network database (Michigan Geographic Framework Version 10a) was acquired from the Michigan Center for Geographic
Information (MCGI, http://www.michigan.gov/cgi). The location of each road segment is
provided along with attributes including, but not limited to: length, road name, data source,
National Functional Classiﬁcation (NFC) code, Framework Classiﬁcation Code (FCC), and

22

legal ownership.

2.4.1.1

Speed limit classiﬁcation

The estimation of travel speed for each road seg-

ment, in the absence of measured travel speed data, can be accomplished most accurately
using the posted speed limit and surface material of the road segment. Speed limits deﬁne
the maximum legal travel speed, whereas surface material helps to determine realistic travel
speeds (n.b., reasonably lowered speeds on unpaved roads in rural areas). Because neither
speed limit nor road surface type are included as attributes in the MCGI roads database, we
developed a hierarchical classiﬁcation system to assign estimated travel speed to each road
segment. Traditional methods of assigning travel speeds or speed limits are generally simple
classiﬁcations using only the FCC or the NFC of each road segment (see Birkmeyer et al.,
2003; Nallamothu et al., 2006; Berke and Shi, 2009). Our classiﬁcation system for assigning
travel speed oﬀers a signiﬁcant advantage over traditional methods by incorporating NFC,
FCC, and road ownership into in a hierarchical decision tree, rather than relying on a single
road attribute class.
The actual speed limits of Michigan roads are based upon road classiﬁcation, landuse of
surrounding areas, or average travel speed. Statutory speed limits are those set throughout
the state for a certain set of roads (i.e., 70 mph for expressways, 55 mph for state and
county roadways, and 25 mph for roads in business or residential areas), whereas modiﬁed
speed limits are assigned when roads require a speed limit below 55 mph, but above 25 mph.
National guidelines state that modiﬁed speed limits be based upon the 85th percentile speed
of all travelers during free ﬂowing traﬃc and ideal weather conditions. The length of a speed
zone should be at least one half of a mile and the number of speed limit changes along a
given route should be kept minimal (Michigan Oﬃce of Highway Safety Planning).
In preliminary investigations, we found that the NFC system provided valuable information for speed limit assignment, but should be superseded or supplemented with FCC
or road ownership. For instance, in small rural communities, road ownership better char-

23

acterized observed speed limits than the NFC system, where the cutoﬀ value for an urban
population is 5,000 people. Using only the NFC attribute, the speed limits for streets in
many small communities (rural villages and towns with populations less than 5,000) would
be mis-assigned as they are not distinguished from other rural roads. Each of the many
scenarios encountered will not be discussed in detail; however, a graphic depiction of the
complete hierarchical classiﬁcation system is found in Figure 6. Development and preliminary evaluation of the classiﬁcation system included personally traveling road networks in
southeast and mid-Michigan, documenting the actual speed limits.

2.4.1.2

Road hierarchy

Each road was assigned a “hierarchy” value in an eﬀort to

control traﬃc ﬂow within the network data model. The MCGI roads data did not contain
attribute information describing real-world connectivity at road intersections (e.g., overpasses
and underpasses). All intersections are presumed traversable if no connectivity rules are
established, leading to an over-connected network and likely underestimation of travel times
if not accounted for. True connectivity could not be established for all roads in the state
due to the large number of intersections in the roads dataset (n > 500,000) along with a
lack of reference data. Therefore, our eﬀorts were directed towards establishing realistic
connectivity between expressways and surface streets.
We utilized the hierarchy attribute in conjunction with a turn delay to account for
the absence of connectivity information at expressway intersections in the MCGI data. In
ArcGISTM , turn delays in a network dataset can be assigned not only by the direction of the
turn, but also by the hierarchy values of the intersecting roads. Using the FCC attribute in
the roads data, all expressways were assigned a hierarchy value of 1, all ramps (leading onto
and oﬀ of expressways) were assigned a value of 2, and all remaining roads (surface streets)
were assigned a value of 3. Considering that real-world traﬃc ﬂow between expressways
and surface streets is restricted to only entrance and exit ramps connecting the two road
types, we assigned an artiﬁcially high turn delay (20 minutes) to any direct turn between

24

All Roads

FCC = A13,A15,A23,A29
Expressway ramps

LEGAL = 5
City minor

LEGAL = 4
City major

25 mph

25 mph

35 mph

NFC = 1,2
All expressways

70 mph

NFC = 6
All minor collectors

NFC = 3,4,5
Arterials &
major collectors

NFC = 3,4,5
NFC+ = 2,3,4
Urban arterials &
major collectors

NFC = 7
Local

NFC = 0
Private

55 mph

NFC = 3,4,5
NFC+ = 1
Rural arterials &
major collectors

25 mph

NFC = 7
NFC+ = 1
Rural local

NFC = 7
NFC+ = 2,3,4
Urban local

45 mph

55 mph

NFC = 3,4,5
NFC+ = 2,3,4
LEGAL = 1
Urban arterials &
major collectors
State owned

35 mph

NFC = 3,4,5
NFC+ = 2,3,4
LEGAL = 2,3,7
Urban arterials &
major collectors
County owned or
unclassified

NFC = 7
NFC+ = 2,3,4
LEGAL = 1,7
Urban local
State owned or
unclassified

NFC = 7
NFC+ = 2,3,4
LEGAL = 2
Urban local
County
primary

NFC = 7
NFC+ = 2,3,4
LEGAL = 3
Urban local
County local

35 mph

55 mph

25 mph

45 mph

Figure 6: Hierarchical classiﬁcation system for speed limits

25

expressways and surface roads (hierarchy values 1 and 3). This prevented the network solver
from choosing to make a “non-existent” turn between surface streets and expressways due
to the unrealistically high turn delay between road hierarchy values. Essentially, expressway connectivity within the network was restricted to match actual driving conditions, thus
improving the accuracy of travel time estimates.

2.4.1.3

Network comparison

Five network datasets were created and explored to

better understand how changes to the speed limit classiﬁcation system (see Table 1) and the
penalties assigned for turn delays (see Table 2) aﬀected the estimated travel times. Although
the Michigan Oﬃce of Highway Safety Planning oﬀers guidelines for assigning road speed
limits (Michigan Oﬃce of Highway Safety Planning), we were unable to locate reference data
for comparative purposes. Furthermore, collecting enough actual travel time data to allow
for formal statistical testing was not feasible. Given these limitations, we compared travel
time estimates to results obtained from Google MapsTM . The results from Google Maps were
not considered true travel times due to the lack of methodological documentation available
and a substantial number of speed limit errors that were manually identiﬁed in their roads
data. However, because the Google Maps travel time estimates are derived from independent
source data, the comparison allowed us to assess whether the travel speeds and turn delays
of our custom built networks provided reasonable travel time estimates1 (see Wang and Xu,
2011).
A “shortest path” analysis was completed for 1618 routes covering a broad range of travel
distances (range = 0.5 - 647 miles, mean = 185.41 miles) and route types (e.g., rural, urban,
1 The

dominance of Google Maps in web-based mapping applications (BuiltWith Trends,
2012) does not guarantee that their roads data, travel speed data, or travel time estimates
are, in fact, accurate. However, given the large and growing number of users, we believe that
there is a low likelihood that the Google Maps source data contain a substantial amount of
signiﬁcant errors.

26

Table 1: Travel speeds (miles per hour, mph) used in custom-built
datasets
Road Type
N1 N2 N3 N4
Expressways
70 60 60 62
Ramps
25 25 25 25
City owned, major
35 30 30 35
City owned, minor
25 20 20 25
Private
25 25 25 25
Minor collectors
55 55 55 45
Rural arterials and major collectors
55 55 55 45
Rural local
45 45 45 45
Urban, state owned arterials and major collectors
35 35 35 35
Urban, county owned arterials and major collectors 45 45 45 45
Urban, state owned local
35 35 35 35
Urban, county primary local
55 55 55 45
Urban, county local
25 25 25 25

network
N5
65
20
30
20
20
50
50
40
30
40
30
50
20

Table 2: Turn delays (seconds) used in custom-built network datasets
Turn Type
N1
N2
N3
N4
N5
Non-existent expressway turn 1,200 1,200 1,200 1,200 1,200
Reverse (non U-turn)
8
8
10
45
20
Left
4
5
8
30
8
Right
2
3
5
15
5
Straight (with crossroad)
1
0
2
1
1
Straight (no crossroad)
0
0
0
0
0

27

Table 3: Mean diﬀerence in travel time and road distance between Google Maps
and custom-built networks in shortest path analysis
Time (minutes) Distance (miles)
Network 1
18.39
2.84
Network 2
8.29
6.41
Network 3
1.54
4.42
Network 4
2.33
3.04
Network 5
0.87
2.55

suburban)2 . All networks provided reasonable travel time estimates compared to Google
Maps (see Figure 7 and Table 3). Network 5 was considered the most suitable for estimating
travel time in this application. The travel speeds speciﬁed in Network 5 are a simple 5
mph reduction of the initial speed limit values from our hierarchical classiﬁcation system,
oﬀering an objective method to account for sub-optimal driving and traﬃc conditions and
the presence of stop signs, traﬃc lights, and other mechanisms for traﬃc control not present
in the roads database. Additionally, the turn delays (outside of the expressway turn delay)
in Network 5 are conservative, but conventional, estimates for normal surface street turns
(Price, 2008, 2009).
2.4.2

Population and hospital data

2010 block population data and boundary ﬁles were acquired from the US Census Bureau
(http://www2.census.gov/census 2010/, http://www.census.gov/geo/www/tiger/). Michigan statute requires that LAAs be identiﬁed using zip code population data, therefore the
block population data were aggregated to their corresponding Zip Code Tabulation Area
(ZCTA) boundaries (n = 978), herein referred to as zip codes. Because the census blocks
nest perfectly inside the zip code boundaries, the block population polygons were converted
to geographic centroids and spatially joined to the zip code boundary ﬁle. The population
2A

custom-written automated query function was implemented in RTM . The function
sent origin and destination locations to the Google Maps API and returned the resulting
travel times and distances.
28

100 200 300 400 500 600 700

Network 1
Network 2
Network 3
Network 4
Network 5

0

Network Time (minutes)

Travel Time Comparison

0

100

200

300

400

500

600

700

Google Maps Time (minutes)
Figure 7: Travel time estimates from custom-built networks compared with travel
time estimated from Google Maps

29

of each zip code was calculated by summing the population of all the block centroids falling
within its boundaries. Michigan’s total population was 9,883,640 in 2010.
Location and attribute data for 169 hospitals in Michigan were acquired from the MDCH.
The hospital addresses were geocoded in ArcGIS and converted to point features. Hospital
attribute data were used to identify and subset those hospitals oﬀering acute care and 24/7
emergency room services, resulting in 137 hospitals.
2.4.3

Raster-based method

The raster-based method used to identify LAAs is documented extensively by Messina et
al. (2006) and MDCH (Michigan Department of Community Health, 2009). Thus, it will
only be summarized here. First, roads data were converted to a raster grid of 1 km cells
wherein the travel speed for each cell was deﬁned as speed of the slowest road falling inside
the bounds of the cell. Because each cell required a speciﬁc travel speed, cells containing no
roads were assigned 3 mph as an estimate of non-vehicular travel speed. Travel time or cost
for traversing each cell was calculated using the cell length and speciﬁc travel speed. An
accumulated cost surface was created wherein cell values represented the total travel time
from the cell to the nearest hospital location (i.e., least cost path for each cell). To identify
underserved areas, the accumulated travel time surface was reclassiﬁed into a Boolean surface
based on whether the cell was greater than 30 minutes from a hospital location. The grid
representing underserved areas was then ﬁltered to remove any groups of less than three
contiguous cells (using Queen’s case connectivity). The ﬁltering process was conducted in
an eﬀort to remove single cells and very small areas where no roads were present, but were
generally “inside” the 30 minute travel bounds. Using a connectivity ﬁlter in lieu of a
“count-only” ﬁlter ensured that areas near the edges of the actual underserved areas were
not trimmed. Figure 8 shows an example of the ﬁltering process near an underserved area in
southern Michigan. After the ﬁltering process, the underserved areas were converted from
a raster grid to a vector data format (polygons) wherein a unique ID was assigned to each
contiguous underserved area.
30

Accumlated Cost Surface

Boolean Surface

Boolean Surface (filtered)

v
®
v
®
v
®

v
®

v
®

v
®
v
®

v
®
v
®

®
Travel Time (minutes) v v Hospitals
®
0

Roads

40

v
®

v
®

Areas within 30 minutes
Underserved Areas

Figure 8: Example of raster ﬁlter
The population assignment method, according to Michigan’s guidelines for identifying
LAAs, requires that the entire population of a zip code be assigned to the underserved
area if any portion of the zip code polygon falls inside of the underserved area. Thus, the
underserved area polygons and zip code polygons were spatially joined in the GIS such that
each underserved area polygon was assigned the summed population of all intersecting zip
code polygons. Underserved areas with a total population of 50,000 or greater were then
classiﬁed as Limited Access Areas.
2.4.4

Network-based method

ArcGIS Network Analyst was employed for all network-based analysis. Prior to converting
the vector roads database to a network data format, each line segment was assigned a travel
time value calculated using the line segment’s length and estimated travel speed. Upon the
conversion to the network data format, travel time was speciﬁed as the cost value for edges.

31

®
v

®
v

®
v

®
v
®
v

®
v

®
v
30 minute Service
areas and resulting
Underserved areas

®
v
®
v Hospitals

Miles
0

10

Roads

20

30 Minute
Service Areas
Underserved
Areas

Figure 9: Service areas (and resulting underserved areas) produced by networkbased method
Turn delays were deﬁned to both control traﬃc ﬂow and to model expected slowdowns in
travel speed accompanying directional changes as detailed previously.
After the network was built, we created 30 minute travel time polygons for each of the
hospital locations using the “Service Area” function. Underserved areas were identiﬁed by
clipping the service area polygons from a state base map, essentially ﬁnding the inverse of the
30 minute travel areas throughout the state (see Figure 9). Population data were assigned
to each underserved polygon and the LAAs were subset using the methods detailed in the
previous section.
2.4.5

Sensitivity

To assess each method’s sensitivity to the input roads data, the preceding steps for the raster
and network methods were carried out a second time using the original speed limits of the
32

roads as opposed to the travel speeds in Network 5. In the raster-based analysis, the speed
limit of cells with no roads present were raised to 10 mph. This test was conducted in an
eﬀort to uncover the variability in the results associated with small changes in the travel
speed settings. Although this was not a comprehensive sensitivity analysis, exploring the
diﬀerence in results due to the changes in the travel speed settings allowed us to estimate
the relative importance of the settings for each method and the overall robustness of each
data model.
We also evaluated each method for sensitivity to the scale of the data used to assign
population to underserved areas. Instead of assigning the population using the zip code
polygons, we assigned population using the US Census block centroids. In this method, a
block’s population was assigned to an underserved area only when the centroid fell within
the bounds of underserved area polygon. Then, the population of all block centroids were
summed and new LAAs were then identiﬁed using the updated population totals within the
underserved areas. The results of the population assignment by census block were compared
to the original results for both the raster and network-based methods. Considering that the
block estimates of population are closer to the “true” number of people within the underserved areas (Apparicio et al., 2008), this comparison allowed us to evaluate which method
is more sensitive to the population assignment method speciﬁed in Michigan’s statute.

2.5
2.5.1

Results
Underserved areas

The underserved areas identiﬁed using both the raster and network-based methods are found
in Figure 10 and Table 4. Overall, the raster-based method identiﬁed more total area, zip
codes, and population as being underserved than the network method. The raster method
produced fewer unique contiguous areas than the network method. Examination of Figure
10 reveals that this result was due to larger and more contiguous areas in the raster output.
The most notable diﬀerence between methods is the total population identiﬁed as being

33

Table 4: Comparison of underserved areas (Percent ﬁgures reﬂect proportion of
state totals)
Underserved Areas
Raster % Network %
2)
Area (km
52,971
35
40,043
26
Number of unique areas
223
386
Number of zip codes
410
42
316
32
Total population (zip code) 2,258,452 23 1,280,257 13

underserved. Whereas the raster method reports that 23% of Michigan’s population (≈2.26
million) lives in underserved areas, the network method identiﬁed only 13% (≈1.28 million),
a diﬀerence of nearly one million people.
As Figure 10 illustrates, the underserved areas identiﬁed by both methods share similar
shapes resulting in a general agreement in the overall conﬁguration of underserved places
throughout the state. We compared the spatial conﬁguration of the underserved areas by
conducting an overlay analysis. The total overlapping area (the areas identiﬁed by both
methods) was 38,667 km2 , comprising 71% of the total area identiﬁed by either method
(54,347 km2 ). The network-based results are a nearly perfect subset of the raster-based
results; only 1,376 km2 were identiﬁed uniquely by the network method. Figure 11 shows a
detailed example where each method produced both overlapping and unique results.
2.5.2

Limited Access Areas

The results of the LAA identiﬁcation are found in Figure 12 and Table 5. Again, the raster
method produced more total area, zip codes, and total population identiﬁed in LAAs. Similar
to the results of the underserved areas, the most notable diﬀerence between methods is the
total population identiﬁed. The raster-based method identiﬁed over 1.8 million people in
LAAs, whereas the network-based method identiﬁed just over 650,000, a diﬀerence of over
one million residents. Because the LAAs are a subset of the underserved areas, the spatial
conﬁguration produced by each method are similar.

34

Network-based
Underserved areas

Raster-based
Underserved areas

Figure 10: Underserved areas

Table 5: Comparison of Limited Access Areas. % ﬁgures reﬂect
totals
Limited Access Areas
Raster % Network
Area (km2 )
49,080
32
34,634
Number of unique areas
15
6
Number of zip codes
328
33
199
Total population (zip code) 1,830,028 19
654,755

35

proportion of state
%
23
20
7

Miles

0 3 6
Unique Raster
Underserved Areas
Unique Network
Underserved Areas
Overlapping
Underserved Areas

Figure 11: Example of the similarities and diﬀerences between network and rasterbased underserved areas

36

Network-based
Limited Access Areas

Raster-based
Limited Access Areas

LAA

Underserved
area

Figure 12: Limited Access Areas
2.5.3
2.5.3.1

Sensitivity
Speed limits

The results for underserved areas and LAAs, using both the net-

work and raster-based methods, are presented in Table 6. The table contains the initial
areas identiﬁed and the areas identiﬁed using the actual speed limit values of the input
roads data (+5 mph). Interestingly, the network-based method identiﬁed more people as
being underserved, whereas the raster-based method identiﬁed more once the LAA criteria
of 50,000 people was applied to the underserved areas.

2.5.3.2

Population representation

Table 7 displays the number of people in under-

served areas and LAAs when the population is assigned using the US Census block centroids.
In both the raster and network-based methods, the use of a less aggregated population data
source identiﬁes far fewer people as being underserved within the state. A new set of LAAs
were identiﬁed using the original 50,000 population criteria, but with population assigned
37

Table 6: Comparison of underserved areas and LAAs
assigned to roads. % change reﬂects change compared to
Underserved Areas
Raster % change
2)
Area (km
37,945
-28
Number of unique areas
61
-73
Number of zip codes
238
-42
Total population (zip code) 856,150
-62
Limited Access Areas
Area (km2 )
Number of unique areas
Number of zip codes
Total population (zip code)

Raster
35,404
6
194
694,562

% change
-28
-60
-41
-62

identiﬁed with speed limits
initial travel speed settings.
Network % change
31,815
-21
390
1
255
-19
1,000,612
-22
Network
19,343
3
117
333,290

% change
-44
-50
-41
-49

Table 7: Comparison of results from block centroid population assignment method
with original travel speed settings. % change reﬂects change compared to zip code
intersection method.
Block centroid
Raster % change Network % change
Underserved population
489,588
-78
191,420
-85
Limited access population 288,118
-84
0
-100

using the block population in lieu of the zip code populations. Figure 13 shows the resulting
LAAs. Only three LAAs were identiﬁed using the raster-based method and no underserved
area met the population criteria using the network-based method, although two areas nearly
met the criteria with populations of 45,786 and 47,849.

2.6

Discussion

The results of the analysis show that large areas in Michigan are outside of a 30 minute travel
time from an acute care hospital and thus have limited geographic accessibility, regardless
of which data model is employed. Using the state’s current methods, we found that over 2.2
million residents would be considered underserved and over 1.8 million residents would be
classiﬁed as having limited access. The network-based method identiﬁes fewer total residents
38

Network-based
Limited Access Areas
(Block population)

Raster-based
Limited Access Areas
(Block population)

LAA (Block pop.)
Population btwn.
45000 & 50000
(Block pop.)
LAA
(Zip code pop.)

Figure 13: Limited Access Areas with block population assignment method
as underserved (≈1.28 million) and as having limited access (≈650,000). The results are less
dramatic after “raising” the speed limits of the input roads data by 5 mph. However, both
the raster and network-based methods identiﬁed large numbers of underserved and limited
access populations in this scenario. Modifying the population assignment method resulted
in far fewer people as both underserved and having limited access using both methods.
Notably, the network-based method in conjunction with the block population assignment
did not identify any oﬃcial LAAs, although nearly 200,000 would be considered underserved
in this scenario and two underserved areas nearly meet the 50,000 person LAA threshold.
The general location of the underserved areas and LAAs are similar between raster and
network-based methods. Much of the underserved area is found in sparsely populated regions
in Michigan’s Upper Peninsula and northern Lower Peninsula. However, both methods
identiﬁed small areas in the more populated central and southern Lower Peninsula. These
smaller underserved areas are located in rural regions between urban centers. The raster-

39

based method identiﬁed larger, more contiguous underserved areas, thus more were classiﬁed
as being LAAs.
In both the network and raster data models, the cost to travel among locations is based
on the distance separating places and travel speed. Given these meta-parameters, the 71%
agreement in total area identiﬁed as underserved is not completely surprising. However,
in all of the tests performed in this analysis, the raster-based method identiﬁed more total
area as underserved and as LAAs in comparison to the network-based method, warranting
further examination. Figures 10 and 12 show that both methods identiﬁed similar patterns
of underserved areas and LAAs throughout the state, however the raster method’s results
are universally larger. These results appear to be due to the underlying diﬀerence in the
data models and the abstraction process occurring when converting the vector road data
to a raster representation. The diﬀerences in the data models’ characterization of space
are worth reinforcing such that they directly inﬂuence geographic accessibility measurement.
The raster data model deﬁnes space as a continuous surface where each cell within the
data extent has a speciﬁc location and attribute value. The network data model deﬁnes
space as an empty container that is populated only by features having speciﬁc locations and
attributes. In the following paragraphs, we explore these diﬀerences and their implications
for conducting geographic accessibility studies.
Given the structural constraints of the raster data model, accessibility calculation necessitates converting the vector road data to a cell-based representation. The conversion process
requires a decision rule for assigning the speed limit to a cell when multiple roads are present
within the cell bounds. Although a number of decision rules exist (e.g., the highest travel
speed or the mean travel speed of roads within the cell), each increases the uncertainty of
travel time estimates in the raster method. In the case study, because Michigan statute
requires that the speed limit of the cell be determined by the slowest route available, only
a small percentage of cells are assigned to the higher speed categories (i.e., highways and
expressways) due to the presence of nearby slower roads. This results in a general overesti-

40

Table 8: Michigan roads by travel speed
Travel Speed (mph) Network % Raster % Diﬀerence
20
30.78
38.92
8.14
30
5.99
0.36
-5.63
40
40.75
49.33
8.58
50
19.73
11.20
-8.53
65
2.76
0.19
-2.57

mation of the time required to travel among locations. Figure 14 contains an example that
illustrates the dilemma produced by the abstraction process. In the example, an expressway
traversing a medium-sized town nearly disappears after the conversion to the raster data
format. Although Figure 14 shows a very speciﬁc example, the impact of this decision rule
in the conversion process is not trivial when summed over the entire state. Table 8 contains
the proportions of the roads in each travel speed class in the original vector format (based on
road length) and after conversion to the raster format (based on cell counts). Notably, the
raster format contains a higher proportion of roads in the 20 and 40 mph classes and less in
the rest of the travel speed classes. As Figure 14 illustrates, this clearly inhibits high-speed
travel. The result of slower travel speeds is an overestimation of travel time among locations
and an increased amount of area identiﬁed as being underserved. As Table 4 shows, the
raster-based method identiﬁed nearly 13,000 km2 more total area as being underserved than
the network-based method. In addition, the raster-based underserved areas were larger on
average than the network-based areas (237.54 km2 vs. 103.74 km2 ). Larger contiguous underserved areas increase the probability that the 50,000 population threshold will be reached
for LAA classiﬁcation. Hence, the raster-based method identiﬁed nearly 1.2 million more
people in LAAs than the network-based method.
All areas of the state should be accounted for in the LAA identiﬁcation process (Messina
et al., 2006). This creates a conundrum- LAAs are conceptually based upon vehicular travel
time, yet some places in the state do not have any roads present. In the raster data model, all

41

Travel Speed (mph)

20

30
40

50
65

3
20

30
40

50
65

Figure 14: Conversion of vector roads data to raster data format with slowest
route rule
locations within the data extent are explicitly deﬁned and measurable. Hence, to be included
in the service area estimation, each cell must be assigned a speciﬁc travel speed even if no
roads are present within the cell. The network model does not deﬁne “space” outside of the
network features (i.e., places not located on a node or edge feature). Therefore, non-road
areas are undeﬁned and not directly measured in service area calculation. Because the two
data models diverge greatly in their characterization of space without roads, each method
requires speciﬁc techniques to account for the presence of non-road areas when identifying
geographic service areas based on vehicular travel time estimates.
In the raster method, non-road cells are not distinguished from cells with roads. Therefore, by assigning an artiﬁcially low travel speed value to non-road cells (e.g., walking speed),
vehicular-based travel time estimates originating at these cells will be artiﬁcially high. Regions near the origin of the service area will be less aﬀected than those located towards the
periphery of the serve area extent. For example, the travel time to exit a 1 km non-road cell
42

with a travel speed of 3 mph is 6.21 minutes. When a speciﬁc threshold value for a service
area is implemented, the higher travel time estimates for non-road cells result in regions or
cells identiﬁed as “non-served” areas even though they fall within the extent of the larger
service area (see Figure 8). When combined with the conservative population assignment
method employed by Michigan, the non-road cells have the potential to signiﬁcantly bias
the results of the analysis. Therefore, we implemented the ﬁlter process to limit the number
of non-road cells identiﬁed as underserved. As observed in the results of the speed limit
sensitivity analysis, the raster-based method is much more sensitive to changes in the input
speed limits. The 5 mph increase in travel speeds led to a 28% reduction in the total area
(15,000 km2 ) and 62% reduction in the population (1.4 million) identiﬁed as underserved,
far outpacing the changes observed in the network-based method. Whereas some of the
raster-based method’s sensitivity can be attributed to the cell-based representation of roads
and the predeﬁned directional movement (considering that travel occurs in large 1km steps
between cells), we believe that much of it is due to the change in speed for the non-road cells
(from 3 mph to 10 mph).
“Non-road” areas are also accounted for in the network-based method; however, this
process is not as apparent due to the output format of the data produced using ArcGIS
Network Analyst. The “Service Area” function produces polygon features which are in turn
used to clip a state base map to ﬁnd non-served areas. Albeit indirectly, all areas in the
state are measured when implementing the network-based method to identify service areas.
Although this technique appears straight-forward, it is not without uncertainty. Service area
polygons constructed from the network-based data model are actually areal approximations
of the network edges (roads) within a speciﬁed travel time from the origin location. In
Network Analyst, the network edges are converted to a triangulated irregular network (TIN)
data structure with travel time estimates along the edges as the “height” value. Service
area polygons are then formed by subsetting the TIN to only those areas falling within the
speciﬁed travel time (ESRI, 2010). Figure 15 shows a service area where large regions, both

43

®
v

®
v

®
v

®
v
®
v

®
v

®
v Hospitals

Roads
30 minute
Service area

Figure 15: Service area delineation in areas where no roads are present
inside and near the bounds, have no roads. The ﬁgure includes two detailed examples of
non-road areas to help illustrate the abstraction process of generating a polygon from a set of
lines. In the upper right example, the non-road area is nearly completely enclosed by roads
within 30 minutes, thus the entirety of the non-road area is considered “served”. In the lower
right example, the non-road area is bisected by the boundary of the service area. Speciﬁcally,
the “cut out” region in the service area appears to be a remnant of the TIN conversion and
subsetting technique. In theory, this particular boundary could be located anywhere within
the non-road area; therefore, its true location is uncertain. The uncertainty associated with
the polygon generation process raises questions regarding the validity of the service area
boundaries produced by Network Analyst. However, we did not ﬁnd any evidence that this
led to a large amount of over or under-representation of underserved areas (and hence, LAAs)
in our case study.
Because the conceptual models of space diﬀer signiﬁcantly between data models, topo-

44

logical relationships governing movement among locations are also highly dissimilar. In the
raster model, connectivity is deﬁned solely by cell proximity- movement only occurs in single
step increments in predeﬁned directions from the cell. The network data model, on the other
hand, enforces strict connectivity rules within the data structure itself; travel only occurs
along the edges of the network and directional changes can only be accomplished at nodes.
Because the actual cost of travel between locations is highly dependent upon the connectivity provided by the transportation network linking the locations, the models’ diﬀerences in
deﬁning connectivity lead to dissimilar travel time estimates. Speciﬁcally, real-world connectivity is not accounted for in the raster data model. Therefore, travel routes among locations
may be geographically warped, resulting in inaccurate travel time estimates. For example,
in Figure 14, all cells surrounding the 65 mph cell (on the right side of the map) have the
potential to “route” through this cell. However, in the original vector road data, no ramp
connects the surface streets to the expressway within this cell. Only the cell to the left and
bottom of the 65 mph cell are actually connected to this cell. Therefore, movement is less restricted in the raster model than in the real-world and travel time estimates will generally be
underestimated. In our case study, we believe that the underestimation of travel speeds was
oﬀset by the previously discussed overestimation of travel time due to the “slowest route”
assignment rule.
Reducing the cell size of the input data used in the raster-based method would result in
improved travel time estimates. Speciﬁcally, smaller cells will increase the probability of a
single road falling within each cell, negating the impact of the decision rule to assign travel
speeds to multi-road cells. In addition, as cell size is reduced, the topological similarity
between the raster travel speed surface and the original roads data increases (see Figure
5). As a result, travel time estimates would be more accurate for cells falling on or near
the road network, providing improved results in simple distance measurements and routing
applications. However, for service area identiﬁcation, reducing the cell size would also lead to
an increase the number of non-road cells in the raster data. This would likely require a more

45

sophisticated method to create the travel speed surface, a more elaborate ﬁltering process
to remove these cells, or a polygon generating algorithm similar to the one employed in the
network-based method. Additionally, reducing cell size may lead to substantial increases in
processing time and data storage requirements (Upchurch et al., 2004; Schuurman et al.,
2006).
By design, the zip code population assignment rule used in Michigan is conservative
(Messina et al., 2006) in that it attempts to minimize the likelihood of source A errors (Current and Schilling, 1990). Hence, by assigning the entire zip code population regardless of the
amount of area overlapping an underserved area, the true population with limited geographic
accessibility is almost certainly overestimated. The results from the block population assignment method illustrate the magnitude of the overestimation. The percent change values in
Table 7 show that the network-based method was more sensitive to the block population
assignment method, overall. This is likely a result of the diﬀerences in the size and shape
of the underserved areas produced by each method. On average, the raster-based method
produced larger contiguous underserved areas. Due to the abstraction and ﬁltering processes
(see Figure 8) in the raster-based method, the minimum size of an underserved area is 3
cells (3km2 ). The network-based method has no such size restriction. This diﬀerence has
three main implications in relation to population assignment. First, larger areas increase the
likelihood that an individual area will intersect multiple zip codes when assigning population
using the zip code intersection method, resulting in more underserved areas meeting the LAA
population criteria (See Tables 4, 5, 6, and 7). Second, unequally sized underserved areas
can be assigned the same population. For example, using the intersection method, a very
small area that falls on the border of two zip codes would be assigned the same population
as a larger area completely covering the two zip codes. However, third, larger areas increase
the likelihood that an underserved area will contain a block centroid when the population
assignment method is modiﬁed. Considering that the average size of the raster-based underserved areas were generally larger than their network counterparts, the raster-based method

46

was less aﬀected by the change in the population assignment method.

2.7

Conclusions

We have presented a comparison of raster and network-based methods for measuring geographic access to health care facilities. Speciﬁcally, we have explored how both conceptual
and practical diﬀerences in the underlying data models have the potential to inﬂuence travel
time estimates. In Michigan, each data model and method produced underserved areas and
LAAs with similar conﬁguration and shape, but of varying size. Speciﬁcally, the raster-based
method identiﬁed 132% more land area as underserved than the network-based method. After assigning population to the underserved areas, the results clearly indicate that these
spatial diﬀerences resulted in substantial variation in the number of people with limited geographic accessibility to acute care hospitals. In fact, the raster-based method identiﬁed 176%
more people than the network-based method, a diﬀerence of nearly one million state-wide.
Using the 50,000 population minimum for an underserved area to be deemed an LAA, the
diﬀerences were even greater with the raster-based method identifying 142% more land area
and 279% more people in LAAs.
Because speed limit data were not available for Michigan roads, travel speeds were estimated using the available road attribute data. Although we presented a detailed hierarchical
speed limit classiﬁcation system, the unavailability of the true speed limits, the variability in
road surface types, and the large number of roads throughout the state make a perfect characterization of travel speeds impossible. Therefore, we tested each data model for sensitivity
to changes in the travel speed settings. The method using the raster data model was more
sensitive to the input speed limits of the roads data. Speciﬁcally, a small increase in travel
speed settings produced greater changes in the resulting underserved areas and population
identiﬁed when compared to the network-based method.
Messina et al. selected the raster-based method to fulﬁll the requirement that all areas of
the state be measured directly while assessing geographic access in Michigan (Messina et al.,

47

2006). However, we have illustrated that converting the roads data to a 1 km cell resolution
leads to a substantial loss of topological relationships due to the abstraction process. In
addition, the coarse resolution requires a decision rule to assign travel speeds to cells with
multiple roads present, resulting in a lower precision travel speed dataset. A reduction in cell
size would provide a travel speed surface more similar to the original roads data along with
better travel time estimates and more accurate routing results. Uncertainty associated with
travel speed classiﬁcation systems is always present in these kinds of large, unconstrained
travel models. Future application of raster data modeled geographic access should explore
alternatives to the methods described here for assigning travel speeds to cells with multiple
roads and cells where no roads are present. Furthermore, an examination of the eﬀects of
cell size is also warranted in future research eﬀorts as it was not considered here.
As noted earlier, the conservative population assignment method currently employed in
Michigan likely overestimates the number of people in underserved areas (and thus in LAAs).
We implemented an alternative population assignment method using higher spatial resolution
data. Our ﬁndings suggest that the network-based method was more sensitive to the block
population data assignment method. This sensitivity is likely due to the overall smaller
underserved areas produced by the network-based method and its lack of a minimum size
ﬁlter as was employed in the raster-based method. However, this ﬁnding speaks more to the
population assignment method used by Michigan rather than the results of the travel time
analysis. Thus, we believe that the overestimation of the population with limited geographic
accessibility, regardless of whether the network or raster-based method is employed, warrants
further evaluation.
Both the network and raster data models provide a valid structure for constructing travel
time models. A deﬁnitive conclusion regarding the superiority of one or the other is unjust,
however, due to the lack of true reference data to compare each against. Therefore, we recommend that, when measuring geographic access for health-related applications, researchers
consider how the data models and associated methods employed may potentially inﬂuence

48

their results. Because the raster data model deﬁnes all areas as traversable, the raster-based
method appears more suitable when estimating travel time service areas for non-vehicular
travel modes or in regions where travel is not restricted to roads. For estimating vehicularbased travel time, we contend that the network data model provides a more accurate characterization of the topology governing vehicular travel. Therefore, for this travel mode,
we believe that the network-based method is the appropriate choice to identify areas with
limited geographic access to health care services.

Acknowledgements: This work was funded by the Michigan Department of Community Health, Certiﬁcate of Need Program. The authors would like to thank the two
anonymous reviewers for their helpful comments and Larry Horvath for his assistance in
determining the ﬁnal set of hospitals used in the analysis.
49

Regional health care planning: a methodology to cluster
facilities using community utilization patterns
3.1

Abstract

Background: Community-based health care planning and regulation necessitates grouping
facilities and areal units into regions of similar health care use. Limited research has explored the methodologies used in creating these regions. We oﬀer a new methodology that
clusters facilities based on community utilization patterns and geographic location. Case
study: Our case study focuses on Hospital Groups in Michigan, the allocation units used
for predicting future inpatient hospital bed demand in the state’s Bed Need Methodology.
We detail the scientiﬁc, practical, and political concerns that were considered throughout
the formulation and development of the methodology. Methods: The clustering methodology employs a 2-step K-means + Ward’s clustering algorithm to group hospitals. The
ﬁnal number of clusters is selected using a heuristic that integrates both a statistical-based
measure of cluster ﬁt and characteristics of the resulting Hospital Groups. Results: Using
recent hospital utilization data, the clustering methodology identiﬁes 35 Hospital Groups in
Michigan. After extensive research, review, and discussion, the new clustering methodology
was approved by Michigan’s Certiﬁcate of Need Commission to replace the state’s previous
methodology. Conclusions: Despite being developed within the politically charged climate of Certiﬁcate of Need regulation, we provide an objective, replicable, and sustainable
methodology to create Hospital Groups. Because the methodology is built upon theoretically sound principles of clustering analysis and health care service utilization, it is suitable
for grouping either facilities or areal units. Keywords: Health care utilization, hospital
planning, certiﬁcate of need, clustering, K-means, Ward’s.
Submission information: Submitted to Source Code for Biology and Medicine on
February 22, 2012. Submitted to BMC Health Services Research on April 1, 2012. Authors
on manuscript: Paul L. Delamater, Ashton M. Shortridge, Joseph P. Messina.

50

3.2

Background

Health care planning and regulation in the United States has generally attempted to achieve
two broad goals: 1) promote public health by ensuring that the supply of services meets
the population’s need and 2) contain health care costs by regulating the supply of services
to a level congruent with the need of the population. Regulation is often enforced through
state-level Certiﬁcate of Need (CON) programs, which attempt to enable a suﬃcient supply
of service to meet the population’s health care needs without providing a large oversupply
or duplication of services (Ferrier et al., 2010). CON programs require that proposals for
additional health care services or facilities demonstrate an unmet need prior to approval.
Although their merits have been questioned over the past 40 years (see US Federal Trade
Commission, 2004; Rivers et al., 2007; Ferrier et al., 2010) and they are no longer federally
mandated, CON programs persist throughout the US.
A number of states implement CON programs to regulate the supply of acute care hospitals, inpatient hospital beds, and hospital services (Langley et al., 2010). Considering that
the costs of hospital-based care make up a plurality of overall health care spending (Kaiser
Family Foundation, 2009), hospitals are a logical target for cost containment measures. Additionally, Roemer’s Law (Roemer, 1961) states that a bed built is a bed ﬁlled, implying that
an oversupply of hospital beds results in more and possibly unnecessary hospitalizations and
costs.
Health care services are used by people, but are supplied by health care professionals
who deliver these services at hospitals, clinics, and other facilities. Although the demand
for hospital services can be considered an attribute of people or populations, the supply
only exists at hospitals. In addition, the areal units used to aggregate populations rarely, if
ever, contain residents who use a single health care facility (Bay and Nestman, 1984). To
enable community-based planning of health care resources, communities and/or hospitals are
grouped to form regions of similar health care use. Thus, planning occurs at a regional level
wherein the supply of health care resources available to the community are measured against
51

community need. CON programs predict or evaluate the relationship between supply and
demand of hospital beds, necessitating methods or techniques for grouping both population
units and hospitals (e.g., Illinois General Assembly, 2012; New York State Department of
Health, 2012; North Carolina Department of Health and Human Services, 2012).
Very limited research emphasis has been placed on grouping or clustering hospitals based
on similarity in community utilization. Methods for clustering hospitals using multivariate
data received attention from health services researchers in the 1970s and 1980s. These
studies, however, were more focused on identifying hierarchical structure in the overall system
of hospitals or identifying similar hospitals for determining reimbursement levels (Berry Jr.,
1973; Elayat et al., 1978; Klastorin and Watts, 1981, 1982; Vertrees and Manton, 1986).
More recently, this research topic has been revived in response to changes in health care
delivery and organization (Dubbs et al., 2004; Luke, 2006; Zwanziger and Khan, 2008).
John Griﬃth, J. William Thomas, and colleagues explored the subject of service communities over 30 years ago (Griﬃth, 1972; Thomas, 1979; Thomas et al., 1981; Griﬃth et al.,
1981), providing a clustering methodology that groups communities and hospitals simultaneously. The State of Michigan adopted the Thomas methodology (Thomas et al., 1981)
for the creation of the state’s Subareas (see Figure 16), the allocation units used in Michigan’s Bed Need Methodology. In 2011, Michigan’s CON Commission recommended a review
of the Thomas Methodology. When implemented with current data, the methodology did
not produce an acceptable Subarea conﬁguration. Additionally, a number of theoretical and
practical issues were identiﬁed, raising concerns that the methodology was no longer suitable
to identify Subareas in light of the changes in hospital use and utilization patterns since its
adoption over 30 years ago.
A review of the literature provides little guidance toward alternative or improved methods
to group health care facilities. The branch of research most related to this particular problem
is the creation of small areas. Yet, the methods used to create small areas have received
little attention (Shwartz et al., 2001). Although multiple methods have been proposed

52

Current Subareas

8E
8B

8D

8A

8C

8G 8G

8I

HSA 8

8F

8L

8K
8J

8H

7B
7A

7B 7B

1G

HSA 1

1A 1A 1F
1H
1A
1A
1A 1A 1D
1E
1D
1H 1H
1H
1C 1C
1H
1C
1J

1I

7F
7I
4A

7C
7E

7D

HSA 7
7F 7F
7H
7G

HSA 4 4C

6A

6E

6A

6C HSA 6 6G 6G 6G
6F 6H
6B 6D
6F 6F 6H
4F

4B
4E
4B
4G
4I 4I
HSA 5
4G 4H
4K 2A 5A 5B
4J
3A 2A 2A
HSA 1
4L
2A
3C
3A
3B 2B
3C HSA 3
3C
2D
3D 3A
3E
HSA 2
3A
3D
4D

Figure 16: Michigan’s current Subareas. Labels indicate hospital location and Subarea
membership. Underlying colors represent Michigan’s Health Service Areas (HSA). The data
have been thinned for display purposes.

53

more recently to group communities into health service regions (e.g., Goodman et al., 2003;
Shortt et al., 2005; Klauss et al., 2005), they are extensions of the straightforward, yet
unsophisticated, plurality method employed by Wennberg and the Dartmouth Atlas group
(Wennberg and Gittelsohn, 1973).
Here, we present a new clustering methodology that groups hospitals based on overall community utilization patterns and geographic location. The methodology is objective,
replicable, and sustainable, oﬀering a substantial improvement over the previous methodology. Furthermore, the methodology uses generally accepted clustering techniques and can be
easily transferred to create small areas for health service studies. The source code necessary
to replicate our clustering methodology is provided to ensure that the speciﬁc techniques we
employ are unambiguous (See Appendix B).
Our manuscript is organized as follows: First, we oﬀer a brief overview of clustering
analysis and methods. The overview provides an introduction to a number of topics that were
considered during the development of the clustering methodology. Next, we detail our case
study and discuss the scientiﬁc, practical, and political concerns that were encountered while
reviewing the Thomas Methodology and developing the new methodology. The clustering
methodology is then provided in detail. We present the resulting hospital clusters and discuss
the implications for adopting the methodology. Finally, we explore pathways in which our
methodology can be extended for use in other health service applications.
3.2.1

Clustering

The overall objective in most clustering analyses is to assign individual observations into
natural groups or clusters. Jain (2010, p. 652) states that the operational deﬁnition of
clustering is:
Given a representation of n objects, ﬁnd K groups based on a measure of similarity such that the similarities between objects in the same group are high while
the similarities between objects in diﬀerent groups are low.

54

A large majority of clustering algorithms can be described as either hierarchical or partitional
in nature. Hierarchical algorithms use an n x n similarity matrix to recursively form nested
clusters over all possible values of K. Partitional algorithms divide observations into a userdeﬁned number of clusters and utilize an n x n similarity matrix or an n x m matrix of
observations, where n observations have m attributes or data dimensions.
Applied cluster analysis requires the analyst to make a number of subjective decisions.
Prior to clustering, the attributes (or variables) used to describe similarity among observations must be determined, a potentially subjective process (Klastorin and Watts, 1981).
Additionally, a large number of clustering techniques exist, creating a “user’s dilemma” in
the technique selection process (Dubes and Jain, 1976). Finally, determining the number of
clusters or groups, K, is one of the most diﬃcult problems in cluster analysis (Steinley, 2006;
Jain, 2010). Milligan and Cooper (1987) provide a comprehensive review of clustering and
cluster analysis, oﬀering a seven-step structure to guide the clustering process.
3.2.2

Case study

In 2011, the State of Michigan CON Commission formed a Hospital Bed Standard Advisory
Committee (HBSAC) to investigate issues related to the state’s Hospital Bed Standards
(see Michigan Department of Community Health, 2009). One charge of particular concern
was to explore the methodology used to calculate the necessary supply of hospital beds
needed to meet the state’s future population demand (Bed Need Methodology). As part of
this charge, the HBSAC formed a working group that focused on the speciﬁc methodology
employed to create Subareas, the allocation units used in the state’s Bed Need Methodology.
The HBSAC working group was composed of various stakeholders in Michigan’s health care
industry including representatives from hospitals, hospital systems, and health insurance
providers. The authors were commissioned by Michigan’s Department of Community Health
(MDCH) to provide the HBSAC with technical and scientiﬁc support throughout the process
of reviewing the Thomas Methodology and to oﬀer alternative approaches to create Subareas
or other modiﬁcations of the Thomas methodology.
55

The HBSAC working group’s initial concerns revolved around: 1) the legitimacy of the
current Subarea conﬁguration, 2) the high frequency of single hospital Subareas in the current
conﬁguration (32 out of 64), 3) the plausibility of re-implementing the Thomas methodology,
which includes an initial automated clustering method and a secondary step where the
results are reviewed and modiﬁed by an expert panel, and 4) the suitability of the Thomas
methodology itself, given changes in health care delivery in the 30 years since its adoption.
Despite eﬀorts to trace the history of Michigan’s Subareas, we were unable to locate detailed
records or accounts of previous conﬁgurations. Outside of minor changes in 2002, we believe
that the Subarea conﬁguration had not undergone signiﬁcant modiﬁcation since the original
formulation in the late-1970s.
Although a detailed description of the methodology is oﬀered in Thomas et al. (1981)
and the Hospital Bed Standards Michigan Department of Community Health (2009), portions of the methodology remain cryptic. A similar problem was experienced by researchers
at Michigan State University when tasked with implementing Michigan’s Bed Need Methodology and detailed by Langley et al. (2010). Therefore, the initial action required explicitly
deﬁning the Thomas Methodology (see Appendix A) and running the methodology with
up-to-date population and hospital utilization data. We used the R programming language
and environment (R Development Core Team, 2011) to complete this task. We selected R
because of the statistical, graphical, and data processing capabilities it provides. Other beneﬁts of using R are that it is a multi-platform open source language, is highly customizable,
and can be augmented with a number of additional packages.
As Figure 17 illustrates, the Thomas Methodology does not provide a solution resembling

56

the current Subarea conﬁguration when implemented with recent hospitalization data3 . Most
notably, only 21 Subareas were identiﬁed. The dissimilarity can most likely attributed to
changes in hospital utilization patterns that have occurred since the last time the methodology was run. However, because the original Thomas Methodology results have been modiﬁed
by an expert panel, we cannot state this with complete certainty. In addition, we identiﬁed
theoretical and methodological issues in the current methodology that provided concern.
These included an unreliable measure of hospital similarity; poorly deﬁned home areal units;
and subjective modiﬁcation by an expert panel.

3.2.2.1

Unreliable measure of hospital similarity

The Thomas Methodology clus-

ters hospitals based on overlapping home areal units, deﬁned by patient utilization patterns
expressed using Relevance Index (RI) values. For a hospital, hi , RI values are deﬁned at
each population unit (set of j units) such that
RIi,j =

P di,j
P dj

(4)

where P di,j is the number patient days used by residents of areal unit j at hospital i and
P dj is the total number of patient days used by residents of areal unit j. Although RI is
calculated for each hospital, the measure actually provides more information about communities rather than the hospitals. Because the patient days are summed for the areal unit in
the denominator, the RI value describes the importance of the hospital to the community.
Thus, using RI values to compare hospitals provides little information about the similarity
3 Interpreting the deﬁnition of the “home areal unit” of each hospital or cluster of hospitals

in the Thomas Methodology was especially problematic. The original manuscript is quite
vague in its discussion of home areal units. Unfortunately, the deﬁnition in the Hospital Bed
Standards does not oﬀer clariﬁcation. Therefore, we implemented multiple versions of the
Thomas Methodology, each with a slightly diﬀerent interpretation of the home areal unit.
Although each produced unique results, none provided Subareas that were similar to the
current conﬁguration. The results presented in Figure 17 deﬁned the home areal unit as the
zip code in which the hospital is located. This implementation also allowed the algorithm
run until clustering was completed.

57

17

18

Thomas Subareas

17
18

18

18

18 18

18

HSA 8

16

9

18
18

18

9
9 9

14

2
15

HSA 1

21 NG 21 NG
21 21 NG
21 NG NG

2 15
2

NG NG
21
8

14

NG
11
5

9

1

9

HSA 7
11 NG
11
11

3

3

1

10

7 HSA 6 10 10 10
5
13 7 NG 10 10 14
5
5
5
10
13
14
5
5 5
5
HSA 5
5
6 6
5
5
4
55
6
5
66
HSA 1
20
20
NG
20 19
12 HSA 3
12
2
12 20
20
HSA 2
20
12

HSA 4 5

Figure 17: Subareas produced by the Thomas Methodology using current data.
34 hospitals did not possess the required minimum home area to be included in the Thomas
Methodology. Because no details are provided by the methodology with regards to handling
these cases, they were removed from the clustering process. They have been assigned NG
(non-groupable) for display purposes. The data have been thinned for display purposes.

58

of the overall hospital utilization patterns.
An alternative measure of utilization patterns, the Commitment Index (CI), is a hospitalbased representation of patient utilization patterns. CI is deﬁned for a hospital, hi , at each
population unit (set of j units) such that
CIi,j =
where

P di,j
P di

(5)

P di is the total number of patient days at hospital hi . CI values measure the

importance of each population unit to the hospital. Unlike RI, CI values are not directly
inﬂuenced by the size of the hospital (as measured by number of inpatient beds). For
example, two hospitals located near each other, one small and one large, may have very
similar patterns of utilization when expressed as CI values (e.g., in Figure 18).
Although Griﬃth contends that RI is “more useful” than CI (Griﬃth, 1972), we ﬁnd
that only to be true for deﬁning service populations or exploring market penetration for
a single hospital. It has little utility for comparative purposes. Conversely, the CI values
provide a suitable measure of similarity among community utilization patterns. As Figure 18
illustrates, the two hospitals have very similar patterns of patient utilization, drawing comparable percentages of their total patients from the surrounding areal units. The Pearson’s
correlation coeﬃcient (r ) of the two hospitals’ CI values conﬁrms the similarity with near
perfect correlation (r = 0.975). Although correlation is also high between the hospitals’ RI
values (r = 0.855), the similarity in community utilization patterns is not nearly as apparent
due to the diﬀerences in magnitude of the RI values. Furthermore, the Euclidean distance
between the hosptials’ CI values is 0.034, whereas the distance between RI values is 0.437.

3.2.2.2

Poorly deﬁned home areal units

In the Thomas Methodology, hospitals

are clustered iteratively based on RI values in home areal units. However, these home areal
units are poorly deﬁned once hospitals have been clustered. Speciﬁcally, the home areal unit
of the entire cluster is assigned as the home areal unit of a single cluster member hospital.

59

Relevance Index, Hospital #1

Relevance Index, Hospital #2

!

!

Relevance
Index

Relevance
Index

11%

11%

22%

22%

0%

0%

Commitment Index, Hospital #1

Commitment Index, Hospital #2

!

!

Commitment
Index

Commitment
Index

0%

0%

3.25%

3.25%

6.5%

6.5%

Figure 18: RI and CI values for two hospitals of diﬀerent sizes. Hospital #1 has
470 licensed inpatient beds and Hospital #2 has 94.

60

Because the methodology further clusters these groups based on overlap within the single
home areal unit, it does not acknowledge that multiple hospitals compose the cluster. This
results in scenarios where hospitals grouped into the same Subarea may share little to no
similarity. For example:
• Hospital A is clustered with Hospital B based on Hospital B’s RI in Hospital A’s home
areal unit.
• Once the hospitals are clustered to form Cluster AB, the home areal unit is assigned
as Hospital B’s home areal unit.
• When Cluster AB is further clustered with Hospital C, the criteria for clustering is
based on Hospital C’s RI value in Cluster AB’s home areal unit. Because Cluster AB’s
home areal unit was deﬁned as Hospital B’s home areal unit alone, overlap between
Hospital C and Hospital A’s home areal units is not considered.
In this scenario, Hospital C and Hospital A may share little or no similarity in the newly
formed Cluster ABC. Because the Thomas Methodology iterates until there is little overlap
among home areal units, this can lead to very large clusters (see Clusters #5 and #18 in
Figure 17) or geographically distorted clusters (see Cluster #10 in Figure 17).

3.2.2.3

Subjective modiﬁcation by expert panel

In the Thomas Methodology, the

Subareas results provided by the clustering algorithm are passed along to an expert panel
for modiﬁcation. Thomas et al. (1981, p. 46) state:
Based on members’ knowledge of hospital relationships and other factors inﬂuencing the reasonableness of proposed groupings, the committee is asked to decide
whether the objectively determined clusters are in fact appropriate. ... Thus the
committee makes the ﬁnal determination, using the patient origin data analysis
as one important source of information.

61

Although this step oﬀers the potential to incorporate useful qualitative or local knowledge
into Subarea formulation, it is also raises practical concerns with regards to implementation.
The Hospital Bed Standards do not provide guidance regarding the composition of the expert
panel or the scope of their charge. Additionally, by modifying the Subareas post hoc, the
original results of the Thomas Methodology are lost, leaving no record that would allow for
the utility of the automated method itself to be examined.

3.2.2.4

New methodology to cluster hospitals

After discussing the theoretical con-

cerns and application-oriented limitations present in the Thomas Methodology, the HBSAC
working group opted to explore alternate approaches to creating Subareas, rather than
choosing to modify the parameters of the Thomas Methodology in such a way that the
methodology would provide reasonable results. In addition, the group decided to replace
the term, Subarea, with Hospital Group to better reﬂect the nature and speciﬁc use of these
units within the context of the overall Bed Need Methodology. For the remainder of this
manuscript Subareas will be referred to as Hospital Groups.
Our overall goal in creating the new Hospital Group methodology was for the method
itself to be as objective, replicable, and sustainable as possible. Considering the subjectivity present in clustering applications and the vast number of possible clustering methods,
we placed emphasis on the higher-level theoretical issues, rather than speciﬁc applicationoriented concerns. Preliminary discussions with the HBSAC working group focused on the
identiﬁcation of measurable hospital characteristics that could be used to compare and cluster similar hospitals. From this discussion, two characteristics were deemed as the most
important, 1) that hospitals drew their patients from similar communities and 2) that hospitals were geographically proximate. Given these meta-parameters, we presented the HBSAC
working group with a variety of suitable clustering methods. Because CON-related proceedings have the potential to become highly political aﬀairs, we initially presented only the
clustering methods themselves, rather than oﬀering “results” of the methods. This left the

62

HBSAC to form their opinions based on the merits and appropriateness of the clustering
methods themselves, not the Hospital Groups they produced.

3.3

Methods

3.3.1

Overview

The new clustering methodology employs a 2-step K-means + Ward’s algorithm to create
Michigan’s Hospital Groups. This algorithm compares observations across multiple attribute
values, allowing for both community utilization patterns and hospital location to be evaluated
simultaneously in cluster formation. In this, speciﬁc patient hospitalization data and travel
distance measurements among hospitals are required. The methodology includes a heuristic
to determine the number of Hospital Groups, K, based on statistical measures of cluster
ﬁt and characteristics of the Hospital Group solution. We also include a set of techniques
to assign a new or proposed hospital to the existing Hospital Group solution in case this
scenario arises.
The source code used to implement the overall methodology4 can be found in Appendix
B. We utilize the R programming language using only base package functions to allow for
portability across operating systems. The code in Appendix B has also been modiﬁed slightly
from the actual code presented to the HBSAC in an eﬀort to make it more generalizable.
In the following sections, we provide a detailed description of each step in the clustering
methodology.
3.3.2

Input data

The methodology requires georeferenced hospital utilization data. We employ data from the
Michigan Inpatient Database (MIDB), a nearly exhaustive record of the state’s inpatient
hospitalizations. Each patient record includes the discharging hospital, the zip code of the
4 Although

not discussed in the manuscript, the CON approved source code contains
additional steps to assign a numeric identiﬁer to the resulting Hospital Groups based on
their geographic location and bed inventory.

63

patient’s residence, patient demographic information, and diagnostic codes. Using the most
recent three years of MIDB data, the number of patient days used at each hospital by residents of each Michigan zip code are arranged in an n x z origin-destination (OD) matrix.
Three years of data are included to ensure that recent patterns of state-wide hospital utilization are captured without the ﬂuctuations possible in a single year. All existing hospitals
that reported their inpatient data to the MIDB for any portion of the three year period are
included. In this, reporting is essentially universal throughout the state’s hospitals. The n
x z matrix of patient days is converted to a CI matrix (for each hospital in n) using Eq. 5.
The geographic location of each hospital is represented as an 1 x n vector of the travel
distances to the other hospitals in the state. When consolidated, this results in an n x n
OD distance matrix. The use of an n-dimensional representation of location, in lieu of traditional 2-D locational attributes such as x,y geographic coordinates, is necessary to account
for Michigan’s particular physical characteristics and transportation infrastructure. Most
notably, Euclidean distance measurements may lead to misrepresentations of true distances
among locations near shorelines. For example, using only x,y coordinates to deﬁne location,
hospitals in Michigan’s “thumb” region (HSA 6) in Figure 16 would be considered near hospitals to their northwest, not accounting for the true magnitude of their separation due to
the Saginaw Bay. Distances among hospitals are calculated as travel distances on Michigan
roads using a custom-built network model (Delamater et al., 2012). After the n x n matrix
is assembled, the distance entries are rescaled from 0 to 1 by dividing each by the maximum
distance between any two hospitals. The rescaling process ensures that the range of values
in the hospital utilization matrix and distance data matrix are similar(Milligan and Cooper,
1988; Steinley et al., 2004).
The utilization matrix and distance matrix are joined to form a ﬁnal data matrix containing n rows or observations with m (z + n) attribute values per observation.

64

3.3.3

Clustering algorithm

The K-means clustering algorithm is employed as the primary method to create Hospital
Groups. The speciﬁc algorithm employed is that of Hartigan and Wong (1979), the default
option in R’s base package kmeans() function. Given a set of n observations with m associated attribute values to be partitioned into K clusters, K-means attempts to ﬁnd the cluster
solution (C ) that minimizes the sum of the squared errors (J(C)) between cluster members
(xi ) and their associated cluster center (ck ) over all clusters.
K

xi − µ k 2

J(C) =

(6)

k=1 xi ∈ck

Although their origins are closely related, two distinct characteristics of the basic Kmeans algorithm provided concern for identifying Hospital Groups. First, solving Eq. 6 is
an NP-hard problem (Drineas et al., 2004), essentially rendering it non-computable in any
acceptable amount of time. Thus, K-means relies on an search algorithm to approximate
the solution and likely ﬁnds a locally optimal solution, rather than the globally optimal
solution (Steinley, 2003; Jain, 2010). Second, the basic K-means method employs a random
initialization procedure for the search algorithm. Given that the input data were of high
dimensionality, the resulting Hospital Group solution identiﬁed by the randomly initiated
K-means algorithm would likely vary between model runs. Therefore, the results would not
be reproducible.
To examine the variability associated with the random initialization of K-means and for
the presence of local optima, we initially grouped the hospitals into 50 clusters using 5,000
random starting locations. Although there were roughly 9 x 10203 possible solutions5 , the
observed variability in the output cluster solutions was much higher than initially expected;
each random start provided a unique 50 cluster solution.
To stabilize the clusters provided by the K-means algorithm, we “seed” it with rational
5 Based

on K N /K! (Kaufman and Rousseeuw, 2005) where K = 50 and N = 158.

65

starting locations in lieu using of the random start method (Milligan, 1980). Ward’s hierarchical clustering algorithm (Ward, 1963) was employed to initially cluster the hospitals and
provide the seed locations. The cluster centers produced by Ward’s algorithm are a K x m
set of locations that deﬁne the central location of each cluster in m-dimensional space. They
are used as initial locations in the K-means search algorithm, creating a 2-step K-means +
Ward’s clustering algorithm. Because Ward’s algorithm provides deterministic results, this
eﬀectively and eﬃciently removed the stochastic element present in K-means initialization.
In addition, for K = 50, the cluster solution identiﬁed by K-means + Ward’s provided a
superior ﬁt to solutions from all 5,000 model runs using K-means with random starts (see
Figure 19). Although we cannot conﬁrm that the K-means + Ward’s algorithm provided
the globally optimal solution, we are encouraged that a single model run produced such a
large improvement in the ﬁt of the cluster solution.
3.3.4

Determining the number of Hospital Groups

As was discussed earlier, one of the more diﬃcult problems facing any applied cluster analysis
is determining the number of clusters in which to group the data. Researchers have noted that
the selection of K is largely subjective (Elayat et al., 1978), may be politically inﬂuenced
(Klastorin and Watts, 1981), or completed by an analyst with expert domain knowledge
(Jain, 2010). The members of the HBSAC working group were steadfast that the number
of Hospital Groups (K ) should be derived from the data itself, not explicitly predetermined
prior to the clustering process nor modiﬁed after clustering is completed. However, no
method or measure exists to deﬁnitively answer the question, “how many clusters should
the data be grouped into?”. Therefore, in conjunction with the HBSAC working group, we
developed a heuristic to determine the number of Hospital Groups, incorporating a statistical
approach along with a set of decision rules.
We deﬁne k as the set of integer values from 2 to n-1. A Hospital Group solution is
created for each value in k using the K-means + Ward’s clustering algorithm, allowing all
possible values of K to be evaluated. The ﬁrst step in the heuristic to determine the ﬁnal
66

q

0.965

Cluster fit (R2)
0.970
0.975

0.980

K−means + Ward's

0
1000 2000 3000 4000 5000
K−means random starts, sorted by (R2)

Figure 19: Local minima and random starting locations with the K-means algorithm. 5,000 K-means model runs for K = 50 produced 5,000 unique cluster solutions (black
line). A single model run of the K-means + Ward’s algorithm provided another unique cluster
solution with a better ﬁt (red point) than any of the 5,000 stand-alone K-means solutions.

67

value of K is to calculate the incremental F statistic (incF ) (Gujarati, 1988) for each solution
in k,
2
2
Ri − Ri−1
ki − ki−1
incFi =
2
1 − Ri
n − (ki − 1)

(7)

R2 = 1 − (RSS/T SS).

(8)

where

RSS and TSS are the residual sum of squared error and total sum of squared errors, respectively, calculated for each cluster solution in k (J(C) from Eq. 6 is equal to RSS ). R2 is an
overall measure of the “ﬁt” of the cluster solution to the original data. The incremental F
statistic measures only the amount of ﬁt gained from allowing an additional cluster in the
solution, while also penalizing for adding this additional cluster. Because increasing K will
almost certainly improve the R2 of the cluster solution, incF oﬀers a measure that incorporates both ﬁt and K. Initial candidate solutions are selected by identifying those with local
maxima in incF (all solutions where incFk > incFk−1 and incFk > incFk+1 ).
After the initial candidate solutions are identiﬁed, a set of decision rules is employed
to select the ﬁnal value of K. The HBSAC working group oﬀered two qualiﬁcations for a
suitable Hospital Group conﬁguration, 1) that no individual Hospital Group contains more
than 20 hospitals and 2) that the number of “single hospital” Hospital Groups is minimized.
First, all initial candidate solutions where any single Hospital Group contains more than
20 hospitals are removed. Next, for each of the remaining solutions, the number of single
hospital Hospital Groups is noted. The solution(s) having the minimum number of single
hospital Hospital Groups is/are retained. If multiple solutions meet these criteria, the ﬁnal
solution is selected by choosing the candidate with the maximum K from the remaining
solutions.

68

3.3.5

New hospital assignment

The HBSAC working group requested that the new clustering methodology include steps to
assign a new or prospective hospital to the existing set of Hospital Groups. In the Thomas
Methodology, this task was accomplished re-running the entire methodology with market survey data (projected RI values for the new hospital) added as a new observation. The HBSAC
working group members doubted the veracity of these survey data and requested a simpliﬁed
approach that did not require a market survey or rerunning the clustering methodology. We
designed a method wherein a new hospital is assigned to an existing Hospital Group using
geographic location.
A geocoded location of the new hospital is required to calculate the travel distance from
the new hospital to each existing hospital. These distances are placed in a 1 x n vector,
which is rescaled using the maximum distance between any two hospitals in Michigan (see
Input data) and arranged such that the entries are in the same order as the entries in the
original travel distance matrix.
Like the Ward’s algorithm, the 2-step K-means + Ward’s algorithm produces a K x m
matrix of cluster centers. The cluster centers from the Hospital Group solution are subset
to only those columns corresponding to the travel distance attributes (column numbers z+1
to m), resulting in a K x n matrix. This subset represents the geographic location of the
existing Hospital Group centers in n-dimensional space.
The Euclidean distance (d) from the new hospital to an existing Hospital Group center
is calculated
n

(ci − hi )2

d=

(9)

i=1

where ci is the cluster center for the Hospital Group and hi is the rescaled distance vector
for the new hospital. A d value is calculated from the new location to each existing Hospital
Group. The new hospital is assigned to the Hospital Group having the minimum d value.

69

3.4

Results

We implemented the new Hospital Groups clustering methodology using inpatient hospitalization data from 2007 to 2009, which included 169 acute care hospitals. A small number of
hospitals reported their inpatient data to the MIDB in tandem with another hospital or set
of hospitals. The hospitals reporting together are owned by the same health care system and
are located very near each other geographically. Therefore, these were treated as a single
observation for the purposes of clustering6 . Two hospitals did not report any patient records
to the MIDB and were removed prior to clustering. The ﬁnal data matrix consisted of 158
observations with 1065 attributes (CI values for 905 zip codes and rescaled travel distance
to 160 hospital locations).
A Hospital Group solution was created using the 2-step K-means + Ward’s algorithm
for each value of K from 2 to 157. We implemented the heuristic to select the number of
Hospital Groups for the ﬁnal solution. 49 initial candidate solutions were identiﬁed using
incF values (see Figure 20 and Table 9). Next, candidate solutions of less than 29 clusters
were removed due to the maximum number of hospitals in a single Hospital Group. From
the remaining candidate solutions, the minimum number of single hospital Hospital Groups
was 1. Therefore, all solutions greater than 35 clusters were removed from consideration.
From the remaining candidate solutions, 35 was the maximum value of K and selected as
the ﬁnal Hospital Group solution (see Figure 21).

6 Because

these hospitals were each associated with a unique geographic location, their
travel distance measurements were slightly dissimilar. To calculate the travel distances
for the grouped set, we took the mean of the hospitals comprising the group. However,
when calculating the number of “single hospital” Hospital Groups during the clustering
methodology, the grouped set was not considered a single facility.
70

Initial Candidate Solutions

q
q
q
q

q
q

q
q

q
q

q
q

q
q
q
q
q
q
q q
q q
q
q
q
q
q
q q
q
q
q q q
q
qqq
q
q q qq
q qq
q
q
q
q q
q
q
qq
q q q qqq q q q q
q
q
q q
q q q q
q
q
q q q q q q qq
q q
q q
qq
qqq
q
qq
q
q
q q
q q
qq
q
q
q
q
q
q
q q qq
q
q q
q q qq
q
q
q
q
q
q
q q
q
q
q
q q
qq
q
q q
q q
q
qq qq q
q
q
q
q
q
q
q
q
q
q
q
q

q
q

2

incremental F statistic
4
6
8

q
q

q
q

q
q

q
q

q
q
q
q

20

40

60

80

100

Number of clusters

Figure 20: Initial candidate solutions for Hospital Groups. Data are truncated for
display purposes. Red points represent local maxima in incF values.

71

32

32

Hospital Groups

32
31

32

30 30

34

HSA 8

31

31

30

33

33
34

35
27

27 27

6

5
5

HSA 1
2

2

6
3

2
2 2 3
2 1 1 NG
3
1
1
55
444
5
4
4

26
26
19

26 29
18

29
29

28

HSA 7
23

23

28

23

22 HSA 6 24 24 24
18
25 24
22 21
19
25
18
21
17
17
24
16
17
17
HSA 5
16 15
7 7
17
20
16
15 10 7
HSA 1
11
10
14
11
12
8
14 HSA 3
13
9
13 14
12
HSA 2
14
13

HSA 4 18

Figure 21: Hospital Groups created using new clustering methodology. The data
have been thinned for display purposes.

72

Table 9: Initial candidate solutions. SH are the number of single hospital clusters in the
overall solution and Max is the maximum number of hospitals in any cluster in the Hospital
Group solution. Solutions with less than 29 clusters have Max > 20 and were removed
from consideration. From the remaining solutions, the minimum SH value was 1. Therefore,
solutions with SH > 1 were removed from consideration. From the remaining 3 solutions
(29, 33, 35), the 35 cluster solution was the maximum K and selected as the ﬁnal Hospital
Group solution.
Clusters

incF

SH

Max

3

94.78

0

91

8

25.39

0

58

11

14.58

0

48

16

7.43

0

48

18

7.10

0

45

21

9.49

0

36

26

6.30

1

24

29

6.10

1

17

33

3.73

1

17

35

4.28

1

17

38

4.16

2

17

40

4.34

4

17

42

4.01

7

17

45

4.03

8

16

47

3.61

8

16

50

3.93

10

16

52

3.53

12

16

54

3.88

14

13

57

3.55

17

12

59

3.99

20

12

63

4.09

25

12

65

4.03

28

12

Cont. on next page

73

Table 9 – Cont. from previous page
Clusters

incF

SH

Max

68

4.14

32

12

70

4.27

35

12

73

4.56

38

12

77

4.46

43

9

80

4.06

45

8

82

4.32

47

7

87

4.24

54

7

92

3.98

63

7

96

4.18

70

7

100

5.03

75

7

104

5.43

80

7

107

5.06

83

7

109

3.94

86

7

112

3.61

89

7

114

3.76

91

7

117

2.93

96

7

121

3.06

101

7

124

2.83

103

5

126

2.62

107

5

128

2.46

109

5

130

2.41

112

5

135

2.17

119

5

137

1.95

120

5

141

2.12

126

5

149

2.10

136

3

154

2.38

145

3

156

2.64

149

3

To explore the stability of the Hospital Groups provided by the methodology, we re-

74

created a 35 Hospital Group solution using data from 2004 to 2006. This allowed us to
test the resulting Hospital Groups with data from an independent time period with no
overlapping years. Because a small number of hospitals closed and opened during this time
frame, after clustering, the hospitals were normalized such that only hospitals open during
both time periods were compared. The normalization step was completed post-clustering as
to not inﬂuence the results of the 2004-2006 Hospital Group solution. Overall, the two 35
Hospital Group solutions were in agreement on 93.37% of hospitals (155 of 166 hospitals).
30 of the 35 Hospital Groups produced using the 2004-2006 data were an exact match (both
group size and hospital membership) with their counterparts from the 2007-2009 data.

3.5

Discussion

Following extensive review, the HBSAC recommended that the new clustering methodology
for Hospital Groups be adopted into Michigan’s Hospital Bed Standards by a unanimous
vote. The recommendation was presented to the state’s CON Commission, who approved
unanimously to move the methodology forward to the public comment stage. After allowing
for public comments, the new methodology was again approved by the CON Commission to
be adopted into Michigan’s Hospital Bed Standards7 .
Using the 2007-2009 utilization data, the new clustering methodology reduced the number
of Hospital Groups in Michigan from 64 to 35. During development of the methodology,
the HBSAC strongly believed that the number of single hospital Hospital Groups in the
state should be decreased. Therefore, an emphasis was placed on this characteristic in the
heuristic to select the ﬁnal number of Hospital Groups. In the 35 Hospital Group solution,
only one Hospital Group contained a single hospital (2.86% of the groups). This result was
substantially diﬀerent than the current conﬁguration wherein 50% of the 64 Subareas contain
a single hospital.
The overall ﬁt of the original 64 Subarea conﬁguration is slightly better than that of the
7 The

CON Commission is scheduled to meet and approve the methodology in June, 2012.

75

35 Hospital Group solution (R2 = 0.984 vs. R2 = 0.973). However, this is not entirely
surprising, given that model ﬁt is inﬂuenced heavily by the number of clusters. Using the
F statistic, which incorporates both R2 and the number of clusters, we found that the
35 Hospital Group solution (F = 130.21) outperformed the original conﬁguration (F =
92.86). It is important to note that a direct comparison of the 64 Subarea and the 35
Hospital Group solutions can be somewhat misleading given that they were created with very
disparate methods and do not have a similar number of clusters. Methods and procedures to
evaluate clustering methods or algorithms generally compare cluster solutions with the same
number of clusters or compare the cluster to solutions to a random clustering of observations.
Therefore, a more appropriate statistical test of the methods would require that the number
of output clusters be similar. For example, the ﬁt of a K-means + Ward’s 64 cluster solution
(R2 = 0.989 and F = 135.71) is better than the current 64 Subarea conﬁguration. Yet,
caution is also warranted in interpreting these results considering the uncertainty surrounding
the modiﬁcation performed on the original output of the Thomas methodology in the 64
Subarea conﬁguration.
Comparing the statistical ﬁt of the 64 Subareas to the 35 Hospital Groups (or a 64 Kmeans + Ward’s cluster solution) does not provide a suitable evaluation of the two methods
in light of the ﬁnal purpose for allocating beds in the Bed Need Methodology. Rather, the
small number of clusters produced by the Thomas Methodology, when implemented with
recent hospitalization data, speaks more to the overall utility of the Thomas Methodology
itself. Therefore, the most basic advantage provided by the new clustering methodology is
that it produces an usable and actionable number of Hospital Groups.
While the new methodology was generally lauded by members of the HBSAC and CON
Commission, there are potential implications for hospital bed distribution within the state.
The reduction of the number of single hospital Hospital Groups assumes a more regional
view of community-based need than the previous conﬁguration. While the initial move
toward more regional-level planning and regulation units is consistent with other states’

76

CON programs, the actual consequences for inpatient hospital bed distribution and access
in Michigan remain to be seen. We are encouraged, however, by our preliminary tests
showing that the 35 Hospital Group conﬁguration did not substantially alter predictions of
the state’s future bed demand.
Another issue to consider is the use of alternative data for clustering hospitals. Because
the focus of this application is to deﬁne Hospital Groups for inpatient hospital bed planning,
we chose only to include inpatient hospitalization data. However, other measures such as the
American Hospital Association’s case-mix adjusted discharges may be explored in the future.
Adjusted discharges incorporate both inpatient and outpatient hospital visits, possibly offering a more complete characterization of community health care utilization. Additionally,
raw inpatient days do not provide insight into the eﬃcacy of the hospitalizations or their
overall contribution to public health (Thomas et al., 1983). For our speciﬁc application, we
do not consider the use of inpatient hospitalization data as a limitation. However, we do
acknowledge the limitations of these data and future research would beneﬁt from exploring
alternative data sources for clustering hospitals.
While the clustering methodology was designed speciﬁcally to create groups of hospitals,
the concepts are transferable to the creation of health service areas or small areas. One
of the most notable topics in health services research over the past 30 years has been the
exploration of small area variation in health care utilization (Wennberg and Gittelsohn, 1973;
Wennberg, 2005), spending (Fisher et al., 2003), and outcomes (Welch et al., 2011) in the
US. These studies often rely on an aggregation method wherein small areas are formed by
grouping disaggregated population units into larger regions based on similarity in health
services use. The method implemented by Wennberg and colleagues at Dartmouth employs
a simple plurality rule, grouping areal units based on a single CI value, not their overall
patterns of utilization (Wennberg and Gittelsohn, 1973). In rural communities, this process
is generally straightforward considering that much of the population’s health care needs are
provided by a single facility. Because urban areas often contain a greater number of facilities,

77

service use by any given community is often distributed similarly among facilities (Thomas
et al., 1981), complicating small area creation and/or service area deﬁnition. Using our
clustering methodology, community utilization patterns can be expressed as the CI values
from areal units to hospitals. The areal units could then be clustered into regions of similar
hospital use, where the overall utilization patterns and location are considered. However, we
note that an additional step would be required to link the clustered areas to speciﬁc hospitals
or groups of hospitals using this methodology.

3.6

Conclusions

The goal of our new clustering methodology to create Hospital Groups was for it to be as
objective, replicable, and sustainable as possible. Given the politically and economically
charged climate surrounding CON regulation in Michigan, a full recasting of the theoretical
approach to cluster hospitals was no small undertaking. A number of possible clustering
methods were presented to the HBSAC working group and each could be considered “objective” given that they are data-driven. However, we believe that placing our focus on the
concepts of hospital similarity and the theoretical underpinnings of the methods, rather than
results, allowed for a politically objective overall methodology to emerge. In addition, we
implement a heuristic that selects the ﬁnal number of Hospital Groups based on desirable
characteristics of the solutions instead of relying on a predeﬁned number. The use of a
heuristic does not completely remove all subjectivity from our methodology; the HBSAC
working group members determined which characteristics were acceptable for selecting the
ﬁnal number of clusters. However, by including the decision rules in the methodology, the
new clustering methodology provides a level of transparency that was not present in the
post-clustering modiﬁcation step of the previous methodology.
Two distinct interpretations of “replicable” are fulﬁlled by the clustering methodology.
First, by integrating the K-means and Ward’s clustering algorithms, we have eﬀectively
removed the unconstrained stochastic element associated with random starting locations in

78

K-means. Each time the methodology is run with the same data, it will produce the same
ﬁnal Hospital Group solution (both the conﬁguration of the Hospital Groups and the number
of Hospital Groups). By supplying the source code necessary to implement the methodology,
we have provided it in an unambiguous format. Additionally, the methodology is built upon
well-known clustering algorithms allowing it be transferable in other statistical packages.
We examined the sustainability of the clustering methodology by creating a 35 Hospital
Group solution using hospitalization data from 2004-2006. The high level of agreement in
the composition and size of the resulting Hospital Groups suggests that the methodology
captures long-term community hospital utilization patterns in Michigan. Therefore, when
the clustering methodology is run in the future, Hospital Group conﬁguration will not change
dramatically unless community utilization patterns have signiﬁcantly changed.
We believe that the appropriate levels of consideration were given to the scientiﬁc, practical, and political concerns encountered during the developmental process. The new clustering methodology oﬀers substantial improvement over the previous methodology, as it is
unambiguously actionable and produces superior results. Furthermore, the methodology is
generalizable such that it is suitable for clustering both facilities or areal units within a
variety of health care service applications.

Acknowledgements: This research was supported by the Michigan Department of
Community Health, Certiﬁcate of Need Program. Additionally, the authors would like to
thank Dr. Bruce Pigozzi for his insightful suggestions regarding clustering methodologies
and evaluation procedures.
See Appendix C for Blue Cross Blue Shield’s testimony regarding the proposed clustering
methodology.
79

Do more hospital beds lead to higher hospitalization
rates? A spatial examination of Roemer’s Law
4.1

Abstract

Background: Roemer’s Law, a widely cited principle in health care policy, states that
hospital beds that are built tend to be used. This simple but powerful expression has been
invoked to justify Certiﬁcate of Need regulation of hospital beds in an eﬀort to contain health
care costs. Despite its inﬂuence, a surprisingly small body of empirical evidence supports
its content. Furthermore, known geographic factors inﬂuencing health services use and the
spatial structure of the relationship between hospital bed availability and hospitalization
rates have not been suﬃciently explored in past examinations of Roemer’s Law. We pose
the question, “Accounting for space in health care access and use, is there an observable association between the availability of hospital beds and hospital utilization?” Methods: We
employ an ecological research design based upon the Anderson behavioral model of health
care utilization. This conceptual model is implemented in an explicitly spatial context. The
eﬀect of hospital bed availability on the utilization of hospital services is evaluated, accounting for spatial structure and controlling for other known determinants of hospital utilization.
The stability of this relationship is explored by testing across numerous geographic scales
of analysis. The case study comprises an entire state system of hospitals and population,
evaluating over one million inpatient admissions. Results: We ﬁnd compelling evidence
that a positive, statistically signiﬁcant relationship exists between hospital bed availability
and inpatient hospitalization rates. Additionally, the observed relationship is invariant with
changes in the geographic scale of analysis. Conclusions: This study provides evidence
for the eﬀects of Roemer’s Law, thus suggesting that variations in hospitalization rates have
origins in the availability of hospital beds. This relationship is found to be robust across geoSubmission information: Submitted to PLoS Medicine on May 3, 2012. Submitted
to PLoS ONE on June 12, 2012. Authors on manuscript: Paul L. Delamater, Joseph P.
Messina, Sue C. Grady, Vince WinklerPrins, and Ashton M. Shortridge.

80

graphic scales of analysis. These ﬁndings suggest continued regulation of hospital bed supply
to assist in controlling hospital utilization is justiﬁed. Keywords: Roemer’s Law; hospital
utilization; supplier-induced demand; access; spatial accessibility; Certiﬁcate of Need

4.2

Introduction

Roemer’s Law famously and simply states, hospital beds that are built tend to be used (Shain
and Roemer, 1959, p.71). Although the authors’ original intent behind the statement is
debatable, the most common interpretation is that as the supply of hospital beds increases
the use of hospital services also increases. Roemer’s Law has fostered the belief that excess
hospital beds leads to an over utilization of hospital services, when the observed demand
outpaces the population’s actual need for services (Mulley, 2009). Hospital utilization rates
rise, therefore, due to higher levels of inpatient admissions which may or may not lead to
longer stays, contributing to higher costs. Wennberg (2005) suggests that Roemer’s Law may
be due to physicians being inﬂuenced by a subliminal knowledge regarding the availability
of hospital beds.
In the USA, the high costs of inpatient hospitalizations, in conjunction with the generally
accepted implications of Roemer’s Law, serve as the justiﬁcation for state-based Certiﬁcate
Of Need (CON) programs. CON programs are independent entities that are responsible for
regulation of the supply of health care services such that the supply meets the population’s
health care needs without an oversupply or duplication of services. Given that the plurality
of overall health care expenditure in the USA is for inpatient hospital care (Kaiser Family
Foundation, 2009), hospitalizations, and thus hospitals, are logical candidates for cost control
measures. Supply is regulated by CON programs (Ferrier et al., 2010) wherein an unmet
demand for services must be demonstrated prior to CON approval of new expenditures for
hospital construction or expansion. Currently in the USA, 35 states have some form of
CON program with 28 states speciﬁcally regulating the supply of acute care hospital beds
(National Conference of State Legislatures, 2011).

81

Roemer’s Law deﬁnes a positive relationship between the availability of hospital beds and
the use of hospital services. Past research has provided support for the eﬀects of Roemer’s
Law (e.g., Ginsburg and Koretz, 1983; Harris, 1975; Kroneman and Siegers, 2004; Pasley
et al., 1995; Shwartz et al., 2011; Wennberg, 2005), while other research has found conﬂicting
(e.g., Alexander et al., 1999; Rohrer, 1990; van Doorslaer and van Vliet, 1989) or inconclusive
results (e.g., Clark, 1990). The intertwined relationships among population health, access,
use of health care services, and outcomes provide a number of research dilemmas, both
theoretically and methodologically. Perhaps, the most diﬃcult dilemma is deﬁning and
characterizing the availability of hospital beds. Although counting the number of beds in a
hospital is trivial, measuring the overall availability of those beds to a population is a much
more complex and inﬂuenced by distance, demand, and access-related factors. Measures
of hospital bed availability such as container-based metrics or simple distance (Joseph and
Phillips, 1984; Guagliardo, 2004) ignore the multifaceted nature of access and the spatial and
geographic nature of health care service use. Others have noted that the observed eﬀects of
Roemer’s Law may be due to oversimpliﬁed methods used to assign hospital beds to regions
(Folland and Stano, 1990). In addition, statistical methods that do not incorporate spatial
structure in the relationship between access and utilization are at risk of being misestimated
due to the eﬀects of spatial autocorrelation.
As Wennberg and colleagues (1999, p.2) have noted, in American health care, geography
is destiny. The important role of spatial factors in health care services use have not been
been given full consideration when exploring Roemer’s Law. Hence, we believe a substantive
re-examination is warranted.
So, the critical question remains, “does the availability of hospital beds aﬀect hospital
utilization?”. Whereas Roemer’s natural experiment (Roemer, 1961) was based on a regional
study when a single hospital added a substantial number of inpatient beds, we approach this
issue by examining an entire hospital system, comprising the hospitals, populations, and
transportation infrastructure that connects populations to hospitals. We employ an ecolog-

82

ical research design that integrates individual behavioral models of health care utilization
in an explicitly spatial context. Thus, the research question is reframed to ask, “Accounting for space in health care access and use, is there an observable association between the
availability of hospital beds and hospital utilization?”.
We characterize both the spatial and aspatial components of access such that their individual and combined contributions can be subsequently identiﬁed. Furthermore, by controlling for other determinants of hospital utilization, we isolate the eﬀects of hospital bed
availability on the utilization of hospital services, thus allowing us to statistically examine
the eﬀects of Roemer’s Law on hospitalization rates. In addition, we explore the stability
of the relationship between hospital bed availability and hospital utilization by constructing
models at varying scales of geographic analysis.

4.3
4.3.1

Materials and Methods
Research design

The Andersen model of health service utilization serves as the underlying theoretical framework in our research: utilization of health services results from a predisposing component,
an enabling component, and illness level or “need” (Andersen and Newman, 1973). This
framework is appealing because characteristics of both the population and the health care
delivery system are integrated into a single model:
U = f (n, P, E, N )

(10)

where n is the number of people, P is the predisposing component, E is the enabling component, and N is need for services. The enabling component in the Anderson model roughly
equates to access, but does not provide a detailed characterization. We extend the Andersen model using the theoretical framework oﬀered by Penchansky and Thomas (1981) that
deﬁnes access as the “ﬁt” between the population in need of services and services oﬀered.
In this framework, access results from a combination of ﬁve separate dimensions. Khan

83

(1992) classiﬁed the dimensions into spatial components: accessibility (Ac) and availability
(Av) and aspatial components: aﬀordability (Af ), acceptability (Ap), and accommodation
(Am). In addition to the ﬁve access components proposed, we add a mobility component
(M ) to capture diﬀerences in the ability to overcome distance (Paez et al., 2010). Portions
of the extended access framework cross over through P and E from the Andersen model.
Therefore, we deﬁne:
P = f (Ag, G)

(11)

A = f (Ac, Av, Af, Ap, Am, M )

(12)

N = f (H, εh )

(13)

H = f (In, Ed, Et).

(14)

where

Ag and G are the age and gender structure of the population, A is access, and H is the
health status of the population. It is important to highlight the distinction between need
(N ) and demand (U ) for services in this framework. Although a certain amount of U is
predictable based on known demographic characteristics of the population, N arises from
the general health status of the population and, for hospitalizations, includes a stochastic
element triggered by unpredictable instances of ill-health (Feldstein, 1966). Measuring N
is problematic in health services research given that patients and health professionals often
evaluate the need for services diﬀerently (Donabedian, 1972), resulting in cases of both
unmet need and unnecessary utilization. Therefore, in Eq. 13, H is a measure of the health
of the population and εh is a random variable representing occurrence of ill-health. Oleske
(2009) report six approaches to measuring health care need, yet all are essentially proxies
for estimation of H. Thus, we employ socio-economic status (SES) measures, income (In),
education (Ed), and ethnicity/race (Et), as proxy measures of population health (see Young
(2005), pp.153-154 for a discussion of inclusion of ethnicity/race in health models). Although
there may be questions regarding causality between SES and health, SES has shown to be

84

signiﬁcantly correlated with both morbidity and self-assessed health status (Norris et al.,
2003) in the US and internationally (Young, 2005).
Our theoretical model is supplemented by accounting for variations in hospital utilization
among populations that may not be fully captured in Eq. 14. We use the number of Low
Variation (LV) hospitalizations (ULV ) to help capture this variability. LV hospitalizations are
those with little clinical-based doubt regarding the need for hospitalization (Wennberg, 2005);
therefore, variations in LV hospitalization rates can be considered as arising from the actual
health care needs of the population. We also consider hospitalizations for Ambulatory Care
Sensitive (ACS) conditions (UACS ) in our theoretical model. This class of hospitalizations
(also known as preventable hospitalizations) are those where inpatient hospitalization may
be avoided if primary care is available (Bindman et al., 1995) and accessible (Ricketts et al.,
2001). Hence, we control for variation in hospital utilization due to inadequate access to
primary care. In combination, we label these variables as the case mix of a population’s
hospital usage, oﬀering proxy measures of health variation not captured in P or H.
Given dissimilar population sizes among areal units or zones, we normalize all variables
by population size producing rate-based (e.g., beds / person) or proportional (e.g., % of
population with insurance) measures where applicable. Therefore, we remove n from the
theoretical model when moving to an applied model. In addition, due to the diﬀerences in
age structure among populations, we age-standardize the hospitalization rates. Hence, we
remove Age from the theoretical model and specify a full model of hospital utilization,
Ustd = f (G, Ac, Av, Af, M, In, Ed, Et, LVstd , ACSstd , εh )

(15)

which allows for examination of the relationship between U and hospital bed availability
while controlling for diﬀerences in demographic characteristics and health status among

85

populations 8 .
The proposed framework is implemented in an explicitly spatial context, acknowledging
the role of geography in interactions among populations and hospitals. First, because all
populations do not have equivalent geographic access to the same hospital services, we incorporate the spatial character of hospital utilization by limiting our analysis to only those
hospitalizations where services were demanded locally. Second, we overcome container-based
measures of hospital bed availability by calculating a metric that captures the interaction
between distance, hospital bed supply, and demand. Third, we employ spatial regression
models which incorporate the spatial structure of the proposed framework, thus counteracting the problems associated with spatial autocorrelation.
The ecological study design requires that we address issues stemming from the Modiﬁable
Areal Unit Problem (MAUP, Openshaw, 1984; Fotheringham and Wong, 1991). The MAUP
arises when correlation or regression-based analysis is inﬂuenced by the particular resolution
or zoning scheme of the data. In extreme cases, regression coeﬃcients may ﬂip from positive
to negative or statistical signiﬁcance may be greatly altered when an alternate scale of
analysis or zoning methodology is implemented (Chi and Zhu, 2008; Mobley et al., 2008;
Wright and Ricketts III, 2010). Therefore, we explore the stability of Roemer’s law by
evaluating the relationship between hospital bed availability and hospital utilization over
varying levels of data aggregation.
4.3.2

Case study

Our case study explores the relationship between hospital bed availability and utilization
for the state of Michigan. As of 2010, Michigan had a population of 9,883,640 residents
8 The

other access-related variables, Ap and Am, have been removed from the theoretical
model for the following reasons: 1) Acceptability was deﬁned by Penchansky and Thomas
(1981) as capturing the religious or racial/ethnic ﬁt between a person and the health care
facilities, thus is very likely outdated. 2) Accommodation attempts to account for waiting
times, hours of operation, telephone appointment systems, and other non-supply related
factors of the health care facility. These factors should be quite constant among modern
hospitals.
86

Population Density

Hospitals

G
G
G

G

G

G G

G

G

G

G
G
G

G

G

G
G

G G

G

G

Block Group
People /
KM sq.
23.51
2.07
1.19
0.52
0.08
0

G

G

G
G

G

G
G

G
G

G
G
G
G
GG
G

G
G
GG
G

G

G

G G

G
G

G
G

G
G
G
G

G

G

G

G

G

G

G

GG G
G
G
G

G

G G
G
G

G

G
G
G

G
G
G
G
G
G

G
G
G

G
G
GG
G

G

G
G

G
G

G
G
G

G

G
G
G G
G GGGGG
GG G GG
G
G GGG
GGG G G
G
G G G GG
G
G GG
G
G
G
G
G
G

G
G
G

Figure 22: Population distribution and hospital locations in Michigan.
served by 169 acute care hospitals with 26,180 total licensed inpatient beds. In 2010, there
were 1,127,576 hospital admissions of Michigan residents to Michigan hospitals and a total
of 5,313,149 days spent in hospitals, resulting in an overall patient day usage rate of 0.537
patient days per person. For every 1000 people, there were 9.51 hospital admissions per
month, which is slightly higher than the national averages of 8/1000 found by Green et al.
(2001) and 9/1000 as reported by White et al. (1961).
Michigan employs a CON program to regulate the availability of inpatient hospital beds
(Messina et al., 2006). To assess the needs of the population, a bed need methodology is
implemented to predict the future demand for hospital beds, which is compared with current
levels of supply (Langley et al., 2010). Michigan serves as a satisfactory study area due to the
large number of hospitalizations and population, the state’s relatively stable system of acute
care hospitals, and a diverse collection of rural and urban areas with varying population
densities, health care services distributions, and demographic characteristics (see Figure 22)

87

by which to examine Roemer’s Law.
4.3.3

Population data

The Zip Code boundary data used for Michigan were acquired from the ESRI ArcGISTM
v10 data CD9 . The 2010 population and demographic attribute data were acquired from the
US Census Bureau (http://2010.census.gov). Block-level data for age, gender, race/ethnicity
were aggregated to their respective Zip Code boundaries. The age-speciﬁc data were aggregated into 5 year categories for 0 to 84 years of age with an additional category for 85 and
older. Income, education, and mobility attributes were culled from the 2006-2010 American Community Survey 5-year estimates (http://www.census.gov/acs/www/). These data
are available at the block group level and were aggregated to the Zip Code boundaries. A
small number of block groups were not reported (48 blocks with a population of 52,593,
roughly 0.5% of the total state population). Values for the missing block group data were
estimated using a weighted average of ﬁrst-order (queen’s case) neighboring values (Bivand
et al., 2008). First-order neighbors are deﬁned as areas sharing a common boundary. 2009
Small Area Health Insurance Estimates (SAHIE, http://www.census.gov/did/www/sahie/)
data were used for health insurance rates. For this analysis, we only considered the health
insurance status of people under 65 years of age. Because SAHIE data are only available
at the county level, Zip Code-level data were estimated using the age-speciﬁc rates found in
the SAHIE data and age-speciﬁc population distribution of the Zip Codes.
4.3.4

Travel time

Travel time data were derived using a custom-built network model. The most recently
available roads database (2009 version 10a, http://www.michigan.gov/cgi) was downloaded
from the Michigan Center for Geographic Information and used to construct the network
travel model. Travel speeds for each road were assigned using the road attribute data and a
9 Prior

to the analysis, the 908 unique Zip Codes were aggregated into 895 Zip Codes due
to mismatches between the spatial data and the hospital utilization data

88

Age Adjusted
Patient Day Rates

Hospital Bed
Availability

Utilization

E2SFCA

Patient days /
1000 people

Hospital beds /
1000 people

1490
1210
885
605
326
1

7.19
5.61
4.27
2.92
1.57
0

Figure 23: Age adjusted hospital utilization (UST D ) and bed distribution (Av,
E2SFCA) in Michigan.
hierarchical speed limit classiﬁcation system (Delamater et al., 2012).
4.3.5

Ethics statement

The Michigan Hospital Inpatient Database (MIDB) consists of routinely collected information on patient’s hospital discharge for billing purposes. The patients provided consent for
their information to be stored in the hospital database but that information is protected
under HIPPA rules. All identiﬁable patient information was removed from the MIDB prior
to use in this research. The Michigan State University Internal Review Board determined
the use of this de-identiﬁable data exempt (IRB #07-362).
4.3.6

Hospital utilization

Inpatient hospitalization data were gathered from the 2010 Michigan Inpatient Database
(MIDB), a comprehensive record of the state’s inpatient hospitalizations. For each nonpsychiatric hospital admission excluding normal newborns, the age, principal discharge di89

agnosis (ICD-9-CM), length of stay in days (LOS), Zip Code of residence, and admitting
hospital were collected. Travel time was attached to each discharge, calculated from the
population-weighted centroid of the Zip Code of residence and the location of the admitting
hospital (Berke and Shi, 2009). Hospitalizations occuring more than 60 minutes from the
patient’s residence were removed from the analysis. This geographic constraint accounts for
two scenarios in which hospitalizations would not be aﬀected by the hospital bed availability
of nearby hospitals, thus confounding the analysis. First, it removes hospitalizations where
patients traveled a long distance due to the availability of hospital-speciﬁc services, not hospital bed availability. Second, the constraint removes hospitalizations that occured when
the patient was a signiﬁcant distance away from their residence (e.g., while on vacation) and
not aﬀected by local hospital bed availability. While the 60 minute cutoﬀ value is arbitrary,
it is based on previous research exploring spatial accessibility in regions having highly rural
populations (McGrail and Humphreys, 2009). Of the total patient days in 2010, 93.2% were
served by a hospital within 60 minutes of the patient’s residence.
The LV hospitalization data used in this analysis included discharges for Myocardial
Infarction, Ischemic Stroke, and Hip Fracture (Fisher et al., 1994)10 . ICD-9-CM codes for
the ACS hospitalizations were culled from the Dartmouth Atlas of Healthcare (Wennberg
et al., 1999). In 2010, there were 659,997 patient days for ACS conditions and 229,834 for
LV conditions.
Because the age distribution of populations is not homogeneous among areal units, the
hospitalization data were standardized via the direct method of standardization (Meade
and Emch, 2010). Michigan’s 2010 population was used as the standard population. Age
standardization was accomplished in a two step process. Some of the state’s Zip Codes
contain small populations in each age-speciﬁc category and thus violate the 20/50 rule for
calculating health-related incidence rates (Klein et al., 2002). In addition, as previously
10 ICD-9-CM

codes: Myocardial Infarction (410), Ischemic Stroke (431, 434-438), and Hip

Fracture (808)

90

mentioned, inpatient hospitalizations are also subject to random ﬂuctuations of ill-health
events. Therefore, the ﬁrst step in the age standardization process was to calculate each areal
unit’s age-speciﬁc patient day usage rates using an local Empirical Bayes (EB) smoothing
method (Marshall, 1991). This smoothing method assumes that the patient day count
data follow a Poisson distribution, while also borrowing strength from the patient days and
populations of neighboring regions (Bivand et al., 2008; Odoi et al., 2003). The neighborhood
structure for the EB smoothing process was deﬁned via ﬁrst-order neighbors. Once the agespeciﬁc rates were smoothed, each areal unit’s age-speciﬁc patient day rates were multiplied
by the age-speciﬁc distribution of Michigan’s population. To calculate the overall patient
day rate, the age-speciﬁc data were summed and divided by the total state population (see
Figure 23).
Following the age-standardization process, the hospital utilization rate data were converted to a Standardized Rate Diﬀerence (SRD) by subtracting the average utilization rate
of the entire state from the age-adjusted utilization rate of each observation. This simple
scalar transformation did not aﬀect the magnitude of the data; however, it did allow for easier interpretation of the results such that observations with rates greater than 0 are higher
than the state average and those less than 0 are lower.
4.3.7

Spatial accessibility

Recently, a set of gravity-based GIS measures of spatial accessibility have been proposed that
allow both availability and accessibility to be integrated by including measures of supply,
demand, and distance simultaneously (Ngui and Apparicio, 2011). The general form of the
gravity-based models can be represented as
n

AG
i

=
j=1

Sj f (dij )
m
k=1 Pk f (djk )

(16)

where AG is the spatial accessibility for population zone i, Sj is the attractiveness of a
i
facility at location j, f (dij ) is an impedance (decay) function based on the distance (d) from

91

zone i to location j, f (dik ) is an impedance function based on the distance from location j
to zone k, and Pk is the population in zone k. The total number of zones and facilities are
n and m, respectively.
We employed the enhanced two-step ﬂoating catchment area (E2SFCA), a gravity-based
metric proposed by Luo and Qi (2009), to measure the availability of hospital beds. One
drawback in using gravity-based measures is that the unit (AG ) is often diﬃcult to interpret.
The E2SFCA overcomes this limitation by providing availability values in easy to understand,
container like units (hospital beds per person). The E2SFCA improves on its predecessor,
the two-step ﬂoating catchment area (2SFCA, Radke and Mu, 2000; Luo and Wang, 2003),
by replacing a dichotomous distance characterization with distance or service area “bands”
radiating from each service location. The FCA measures overcome the theoretical limitations
of container-based measures by allowing the catchment areas for supply and demand locations
to “ﬂoat” based on travel distance or travel time in lieu of adherence to administrative
boundaries. To accomplish this, the potential demand is calculated for each facility, which
is in turn used to calculate the supply available at each areal unit.
The E2SFCA requires weight values to allocate demand and supply to the distance bands
using the theory of distance decay. The three functions most oftenly used to model distance
decay in gravity-based measures are the Inverse power, Exponential, and Gaussian (Kwan
and Hong, 1998). Gravity-based models are generally limited by the arbitrary selection of
a distance decay function and the associated β parameter that describes the magnitude of
decay (Schuurman et al., 2010). However, because the actual travel patterns of Michigan
residents are known, our study is not limited by this arbitrary selection process. Using
the actual utilization patterns of state residents, the distance decay function and associated
parameter values were empirically estimated using a non-linear regression model.
Initial investigations showed that the oft-used distance decay functions did not adequately
ﬁt the utilization patterns. However, the downward log-logistic decay function (de Vries et al.,

92

1.0
0.8

q
q
q
q
q
q
q
q
q

0.6

q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
qq
qq
qq
qq
qq
qq
qqq
qqq
qqqq
qqqq
qqqqq
qqqqqq
qqqqqqqq
qqqqqq

0.2

0.4

q

q
q
q
qq
qqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq

0 100

0.0

1.0
0.0

Decay weight
0.2 0.4 0.6 0.8

q

300
500
Travel time

0

20

40
Travel time

60

Figure 24: Distance decay of hospital utilization in Michigan. Left) over the entire
range of the inpatient travel data. Right) a subset of the travel data. The circles are the
data points (thinned for display purposes) and the line is the downward log-logistic function
ﬁt to the data.
2009),
W =

γ
d
1 + ( β )β1

(17)

0

provided a superior characterization of the observed decay pattern and thus was employed
to estimate the weights (W ) for the E2SFCA calculation (See Figure 24). In Eq. 17, the
γ parameter controls W at d = 0. Therefore, because W must equal 1 at d = 0, we were
able to simplify the parameter estimation process by setting γ equal to 1. We estimated the
two remaining decay parameters (β0 and β1 ) using the non-linear least squares estimator
available in R (R Development Core Team, 2011). The resulting parameter values were β0
= 13.89 and β1 = 1.82. Both parameters were statistically signiﬁcant (p < 2 x 10−16 ) and
the model produced a low residual standard error (RSE = 0.003) with an excellent curve ﬁt
(see Figure 24).
In the ﬁrst step in the E2SFCA, the supply is calculated at each facility. Using the
network dataset, travel time rings were created for each hospital at 5 minute intervals to a
93

maximum of 45 minutes and a ﬁnal ring was created from 45 to 60 minutes to incorporate
travel in the rural regions in the state (McGrail and Humphreys, 2009). A W value was
assigned to each travel ring using the downward log-likelihood function of each travel time
value comprising the ring (e.g., the 5-10 minute ring W value is the mean of the W values
for 5-10 minutes). The population data were spatially joined to the travel time rings. The
supply (Rj , beds / person) is calculated at each facility as follows:
Sj

Rj =

(18)
Pk W r

k∈[Dr <60]

where Sj is the number of licensed hospital beds at hospital j, Pk is the set of population of
units falling within the set of travel time rings (Dr ), and Wr is the set of associated weight
values for the travel time rings. Census block centroid points were used in this step as they
oﬀered the most accurate representation of population location.
The second step of the E2SFCA calculates the availability of hospital beds (Av) as
moderated by distance as follows:
Rk Wk

Avi =

(19)

k∈[di,j <60]

where Avi is the availability of hospital beds at population unit i, Rk is the set of hospitals
within 60 minutes of population unit i, and Wk is the set of weights based on the travel
time from unit i to hospital j for all hospitals in k using Eq. 17. We completed this step
using the travel time from the population weighted Zip Code centroids to the hospitals, thus
calculating the availability of hospital beds at the Zip Code level (see Figure 23).
4.3.8

Clustering methodology

Much of the available literature regarding data aggregation in health services research pertains to the creation of small-areas for investigating health disparities among regions (e.g.,
Wennberg and Gittelsohn, 1973). Generally speaking, these methods use geodemographic

94

characteristics of the initial areas to create clusters of homogeneous, contiguous regions (Rey
et al., 2011). Although a number of methods have been proposed for creating small-areas,
these were deemed inappropriate for our study. Speciﬁcally, we believe that implementing a
method that clusters the areal units by the same attributes that were being used to explore
Roemer’s Law would essentially be optimizing the aggregation process to achieve a stronger
statistical outcome (Openshaw, 1984). Hence, the level of objectivity in our test of the
MAUP would be diminished (Swift et al., 2008).
Given this problem, we implemented a clustering methodology that incorporates hospital
utilization patterns and geographic location, identifying geographically promixal areal units
whose populations use a similar set of hospitals (Delamater et al., Under review). The
resulting clusters are based on similarities in hospital use; however, they are not explicitly
optimized based on the same geodemographic attributes used to construct the regression
models. Essentially, the clustering methodology is based on principles garnered from smallarea studies, but does not produce the statistical bias likely present when using the same set
of attributes for the purpose of grouping the data and constructing the regression models.
The initial observation units (Zip Codes) were grouped into clusters using the K-means
clustering algorithm with rational starting locations provided by Ward’s Hierarchical clustering (Milligan, 1980). We clustered the original Zip Code data based on their hospital
utilization patterns and geographic location simultaneously. The utilization pattern data
were an n x m matrix containing the proportion of each Zip Code’s total inpatient hospital
days (1:n) spent at each hospital (1:m), otherwise known as the Commitment Index (CI,
Griﬃth, 1972). The location of each observation is deﬁned by the travel time from each Zip
Code (population weighted centroid) to each hospital, thus comprising another n x m matrix.
Representing location as a set of travel distances, rather than coordinates from a traditional
planar coordinate system (e.g., latitude and longitude), allows for factors inﬂuencing the true
separation among places (i.e., road infrastructure, travel speeds, or the physical landscape)
to be more accurately characterized (Jones et al., 2010). The travel time data were rescaled

95

598 Clusters

299 Clusters

50 Clusters

Figure 25: Zip Code clusters.
to match that of the CI data (0-1) by dividing by the maximum travel time between any
Zip Code and hospital pair. The two n x m matrices were appended to create the ﬁnal data
matrix input to the clustering methodology.
The clustering methodology was run iteratively such that it provided a cluster solution
for the set of all possible clusters from 2 to 894 (the set, S). We subset the resulting
set S by implementing a selection method based on the incremental F score (incF ) of
each cluster solution (Delamater et al., Under review; Gujarati, 1988). IncF measures only
the amount of “ﬁt” gained from allowing an additional cluster within the solution, while
also penalizing for adding this additional cluster. Therefore, local maxima in incF scores
represent cluster solutions that provide an substantial improvement in the ﬁt when compared
with its immediate neighbors. From the initial set S, 276 cluster solutions had local maxima
in incF scores, thus they were selected as the levels of aggregation for the regression analysis
(see Appendix Figure D.1 and Table D.1). Figure 25 provides three example maps from the
ﬁnal set of cluster solutions. The attribute data for each Zip Code were aggregated based on
cluster membership. In addition, we added the non-clustered data (with the 895 Zip Code
observations) for a ﬁnal set of 277 levels of aggregation.

96

4.3.9

Methods to remove multicollinearity

Considering that the ultimate goal of the analysis was inference on the coeﬃcient values
from a regression analysis, multicollinearity in the independent variable set would invalidate
the observed coeﬃcient values. Substantial correlation (Pearson’s Correlation Coeﬃcient,
r > 0.5) was observed among the independent variables. We addressed the multicollinearity
using a suite of methods as described in the following sections.

4.3.9.1

Principal components analysis

We performed a Principal Components Anal-

ysis (PCA) on functional “sets” of variables: income/education, ethnicity/race, transportation, mobility, and case mix. By producing uncorrelated component variables, PCA reduces
the number of independent variables without a large reduction in the explanatory power of
the independent variable set (Jolliﬀe, 2002). For example, at most scales of data aggregation,
the seven variables within the income/education variable set yielded only a single component. Rather than attempting to identify which of the seven variables would be included in
the regression analysis, we were able to include a single income/education component that
suﬃciently described the entire suite of variables (Graham, 2003; Vyas and Kumaranayake,
2006). Because the data were not standardized, we used the correlation matrix for the PCA
(Jolliﬀe, 2002). We also employed a varimax rotation of the results to assist in interpretation
of the component structure (Luginaah et al., 2001).
General methods to determine the number of components to extract include manual interpretation of the results or “rules of thumb” (Rogerson, 2006), thus were not applicable
for our study given the large number of PCA runs that were necessary to complete the
multi-scale analysis. Therefore, we implemented a heuristic that allowed for automation of
the process to select the number of components extracted. We added a randomly generated
variable to each of the variable sets included in the PCA analysis and generated components. Because PCA provides the loadings on each component for each input variable, the
component most heavily inﬂuenced by randomness was identiﬁed. The PCA was then reim-

97

plemented without the random variable, extracting only those components describing more
variation in the data than randomness.
The functional sets of variables, the input data, and the interpreted output of the PCA
are as follows (see Appendix Table D.2 for detailed information including the number of
components extracted and the amount of variation captured by the extracted components
for each functional variable set at each level of data aggregation):
• Income/education
– Input variables
1. Median household income
2. Median earnings (16+)
3. % less than high school education (25+)
4. % with high school eduction (25+)
5. % with associates degree (25+)
6. % with bachelors degree (25+)
7. % with graduate degree (25+)
– Components
1. Income and education (SES): High scores reﬂect populations with higher education, income, and earnings11
• Ethnicity/race
– Input variables
1. % White
2. % African American
11 In

19 of the 277 levels of aggregation, 2 components were identiﬁed: one with high scores
on education and another with high scores on income and earnings. The impacts of this split
are noted in the Results section.
98

3. % Hispanic
4. % Asian
5. % American Indian or Alaskan Native (AIAN)
6. % Hawaiian or Paciﬁc Islander (HWPI)
– Components
1. Race (BLACK): High scores reﬂect populations with higher proportions of
African Americans and lower proportions of Whites
2–5. Minority population components: High scores reﬂect observations with higher
proportions of Hispanic (HISP), Asian (ASIAN), AIAN, and HWPI populations12
• Means of Transportation to Work (Transportation)
– Input variables
1. % Automobile (16+)
2. % Car pool (16+)
3. % Public transportation (16+)
4. % Motorcycle (16+)
5. % Walk, Bicycle, other (16+)
– Components
1. Transportation (TRAN1): High scores reﬂect populations that are less reliant
on automobiles as the means for their journey to work
12 The

number of components for ethnicity/race were highly variable across the levels of
aggregation. The breakdown was as follows: 1 component (32), 2 components (127), 3
components (64), 4 components (52), 5 components (2). The component interpretations are
noted in the Results section.

99

2. Shared transportation (TRAN2): High scores reﬂect populations with a larger
number of people using car pools for their journey to work13
• Average Travel Time to Work (Mobility)
– Input variables
1. % 0-9 minutes (16+)
2. % 10-19 minutes (16+)
3. % 20-29 minutes (16+)
4. % 30-39 minutes (16+)
5. % 40-59 minutes (16+)
6. % 60-89 minutes (16+)
– Components
1. High mobility (MOB1): High scores reﬂect populations that have a higher
proportion of long distance (greater than 40 minute) commuters
2. Medium mobility (MOB2): High scores reﬂect population that have a higher
proportion of medium distance (20-40 minute) commuters and a lower proportion of short distance (less than 10 minutes) commuters
• Hospitalizations (Case Mix)
– Input variables
1. Age-adjusted rate of LV hospitalizations
2. Age-adjusted rate of ACS hospitalizations
13 In

37 of the 277 levels of aggregation, only a single component was identiﬁed: one with
high scores on non-automobile means of transportation. The component TRAN2 was not
included in the ﬁnal regression analysis as we did not believe that a suﬃcient theoretical
relationship existed between populations with a higher proportion of carpoolers and hospital
utilization.

100

– Components
1. Case mix (CASE): High scores reﬂect populations that have higher rates of
both LV and ACS hospitalizations

4.3.9.2

Bivariate regressions

Because we were interested in the individual impacts

of Av and Af on hospital utilization, these variables (E2SFCA and INS) were held out
of the PCA analysis. However, we found that E2SFCA was moderately correlated with the
African American population component (BLACK) and INS was moderately correlated with
the SES component. In addition, the case mix component (CASE) was also correlated with
the African American population component (BLACK). Although the moderate correlation
would not invalidate the regression results, we wanted to identify the isolated eﬀects of
these variables. Therefore, we adopted the strategy of regressing the variable of interest
on its associated correlated variable and using the residuals for further analysis (Graham,
2003). In this, the residuals function as the “unexplained” portion of the variable of interest,
allowing both variables to be included in the ﬁnal model. For example, the variable E2SFCA
becomes the availability of hospital beds not associated with BLACK and is thus recast as
E2SFCA-resid. This process was completed independently at all levels of aggregation when
r was greater than 0.4. The F scores of the overall model and coeﬃcients were tested to
ensure the linear models provided signiﬁcant (p value < 0.05) results.

4.3.9.3

Test variance inﬂation factor

We calculated the variance inﬂation factor

(VIF) for the set of independent variables (see Table 10), removing those with a VIF > 2
(Graham, 2003). The variables were removed in a reverse step-wise fashion starting with
those considered the least established predictors of hospital utilization toward the most (from
bottom to top in Table 10). For example, if TRAN1 and SES both had a VIF > 2, then
TRAN1 would be removed ﬁrst in the stepwise process. As the level of aggregation increased
and the number of observations became smaller, correlation among the independent variables
increased substantially. As a result, we did not perform any subsequent analysis at scales of
101

Table 10: Attribute variable set. TM is the variable label in the modiﬁed Andersen model
from Eqs. 11-13 and TM2 is the label from the full model speciﬁed in Eq. 15.
TM1

TM2

Abbr.

Name

Description

Dependent variable:
U
Ustd

SRD

St. Rate Diﬀerence

Diﬀerence between the
age standardized hospitalization rate and the
state’s age standardized
rate

Independent variables:
N
ACSstd , LVstd

CASE-resid

Case mix

A

Ac, Av

A

Af

N

In, Ed

P
N

G
Et

N
N
N

Et
Et
Et

N

Et

A

M

A

M

A

M

ACS and LV component not explained by
BLACK
E2SFCA-resid Hospital Bed Availability E2SFCA not explained
by BLACK
INS-resid
Health Insurance
INS not explained by
SES
SES
Income/education
High income and education component
FEMALE
Gender
Female population
BLACK
Ethnicity/race
African American component
HISP
Ethnicity/race
Hispanic component
ASIAN
Ethnicity/race
Asian component
AIAN
Ethnicity/race
American Indian or
Alaskan Native component
HWPI
Ethnicity/race
Hawaiian or Paciﬁc Islander component
TRAN1
Transportation
Non-automobile reliant
component
MOB1
High mobility
Long commutes to work
component
MOB2
Medium mobility
Medium commutes to
work component

102

aggregation with fewer than 37 clusters/observations.
4.3.10

Regression models

As noted earlier, previous studies of the eﬀects of Roemer’s Law have not incorporated spatial
structure. The main implication of this particular model misspeciﬁcation is that regression
coeﬃcients may have contained artiﬁcially low standard errors, leading to the rejection of the
null hypothesis when it should have been accepted. Initial tests of non-spatial linear models
showed high spatial autocorrelation in the residuals with ﬁrst-order neighboring values (see
Appendix Figure D.2). To account for this phenomena, we used two sets of spatial error
models (Anselin, 1988), Simultaneous and Conditional Autoregressive Regression models
(SAR and CAR, respectively). Both models use the general form,
Y = βX + µ

(20)

µ = λW µ + .

(21)

where

In the spatial error model, Y is a vector of SRD observations; X and B are matrices of
independent variables and coeﬃcients, respectively; µ is a vector of autocorrelated residuals;
λ is the autoregressive coeﬃcient; W is a neighborhood weight matrix; and

is a vector of

non-autocorrelated residuals.
SAR and CAR models diﬀer in their treatment of the spatial pattern in the dependent
variable (Anselin, 2003; Chi and Zhu, 2008). In the SAR model, the spatial pattern is
explained only by the independent variables, simultaneously over all observations. The CAR
model uses the independent variables to explain the spatial pattern of the dependent variable,
but also conditions the value of the dependent variable on its neighboring values (Anselin,
2003). For all regression models, we deﬁned W as ﬁrst-order neighbors. No prior information
in our data suggested whether the SAR or the CAR model were more appropriate for this
analysis. Additionally, we were unable to locate past research that provided compelling

103

justiﬁcation for the use of one over the other.
A Levene test conﬁrmed heteroscedasticity in the models’ residuals due to diﬀering population sizes among areal units (Rogerson, 2006, see Appendix Figure D.4). Therefore, we
implemented weighted SAR and CAR models (Bivand et al., 2008; Sparks and Sparks, 2010)
using the inverse of the square root of the population size as the weights. This speciﬁcation
led to a substantial alleviation of the heteroscedasticity in the residuals (see Appendix Figure
D.5).
We constructed the SAR and CAR regression models at each level of data aggregation
produced by the clustering methodology. An automated stepwise-like process was employed
to remove independent variables that were insigniﬁcant predictors of hospital utilization rate.
The initial regression model was constructed and the independent variables were tested for
signiﬁcance (p value < 0.05). If all variables were signiﬁcant, the process terminated. If
any were insigniﬁcant, the variable having the highest p value in the model was removed
and a new model was constructed. This process continued iteratively until only statistically
signiﬁcant independent variables remained in the ﬁnal model.

4.4

Results

In total, the SAR and CAR models were constructed at 268 levels of aggregation. In 12 and
31 models for the weighted SAR and CAR models, the spatial parameter (λ) was insigniﬁcant
and the model considered invalid. The overall coeﬃcient values of the independent variables
were very similar among the SAR and CAR models over all levels of aggregation; however, the
results of the CAR model contained latent spatial autocorrelation in the residuals at higher
levels of aggregation (see Appendix Figure D.3). Considering these ﬁndings, we believe the
CAR model was misspeciﬁed at these scales of analysis and report only the results of the
SAR model. Selected standardized coeﬃcient values for the SAR model are found in Figure
26 and Table 11 contains an overview of all coeﬃcient values.
In general, the magnitude of the statistical relationship among the independent variables

104

β (Standardized, weighted SAR model)
−0.2
0.0
0.2
0.4

β (Standardized, weighted SAR model)
0.0
0.2
0.4
0.6
0.8
0

200
400
600
800
Number of clusters

0

200
400
600
800
Number of clusters

Figure 26: Standardized coeﬃcients for weighted SAR models. LEFT: E2SFCAresid (red), CASE-resid (black), INS-resid (green), BLACK (blue), RIGHT: SES (black),
TRAN1 (brown), MOB1 (green), MOB2 (blue), λ (red). All coeﬃcients are signiﬁcant at a
p value < 0.05.

105

Table 11: Coeﬃcient statistics. Total is the number of times the variable is present;
Model is the number of times that the variable was included in the initial model (V IF <
2); positive is the number of time the variable’s coeﬃcient was signiﬁcant (p value < 0.05)
and positive in the ﬁnal model; negative is the number of time the variable’s coeﬃcient was
signiﬁcant (p value < 0.05) and negative in the ﬁnal model; and insig is the number of times
the variable was insigniﬁcant and removed from the model.
weighted SAR model
Variable
Total Model positive negative
CASE-resid
268
268
268
0
E2SFCA-resid 268
268
254
0
INS-resid
268
268
252
0
FEMALE
268
254
19
0
SES1
268
256
256
0
SES2
17
17
0
13
BLACK
268
268
268
0
ASIAN
66
63
44
4
AIAN
99
98
0
95
HISP
106
103
102
0
HWPI
137
137
41
34
TRAN1
268
248
219
0
TRAN2
238
0
0
0
MOB1
268
260
254
0
MOB2
268
252
237
0
λ
268
268
252
4

106

insig
0
14
16
235
0
4
0
15
3
1
62
29
0
6
15
12

and hospital utilization was quite stable across levels of aggregation. In particular, hospital
bed availability (E2SFCA), LV and ACS hospitalization rates (CASE), health insurance
coverage (INS), proportion of African Americans (BLACK), high income and education
(SES), and higher mobility (MOB1 and MOB2) had consistent, positive relationships with
hospital utilization rates across levels of aggregation.

4.5

Discussion

Although Roemer initially seemed somewhat surprised that his statement had been bestowed
the status of a law (Roemer, 1961), our ﬁndings provide compelling evidence to support
this claim. We found that a positive, signiﬁcant relationship exists between hospital bed
availability and hospital utilization rates while controlling for numerous other determinants
of hospital utilization. Additionally, this relationship was consistent across levels of data
aggregation providing support that the origin of the observed eﬀect is not a product of the
scale of analysis.
In previous studies, Alexander et al. (1999) and Clark (1990) found that hospital beds
per capita was not a signiﬁcant predictor of hospital use rates in Michigan. In Alexander
et al., SES variables were the most signiﬁcant predictors of hospital utilization, whereas
board certiﬁed physicians and registered nurses per hospital bed were signiﬁcant predictors
in Clark’s study. In contrast, our results illustrate that both SES and bed availability have
signiﬁcant impacts on hospital utilization rates; however, we did not consider measures
of physicians or nursing as variables in our models. A number of factors cause concern
in the results of these previous studies. First, although Alexander et al. controlled for
temporal autocorrelation in their regression models, neither study acknowledged the spatial
structure of their observations, thus likely misspecifying their regression models. Second, in
both studies, hospital bed availability was calculated using a summation of the beds and
population within the administrative unit boundaries, not incorporating the travel behavior
of patients. Third, both studies were limited to regional-level observation units (58 over

107

Michigan’s lower peninsula for Alexander et al. and 53 over Michigan’s lower peninsula
excluding Detroit for Clark) and a single scale of analysis.
As Figure 26 illustrates, in the weighted SAR model, the coeﬃcient for E2SFCA decreases slightly as the data are aggregated to a regional-level scale. The most similar level
of aggregation used in our analysis to those employed by Clark and Alexander et al. is 70
clusters (58 observations in the lower peninsula). At this level of aggregation, the weighted
SAR model provides a positive, signiﬁcant coeﬃcient for hospital bed availability; however,
the λ parameter is insigniﬁcant in this model. In a non-spatial weighted OLS regression
with 70 clusters, we ﬁnd that hospital bed availability is again not a signiﬁcant predictor of
utilization rates. These results likely stem from the homogenization of the data that occurs
as the level of aggregation moves towards this regional scale of analysis. Interestingly, the
level of aggregation used by Alexander et al. and Clark is very near an observed threshold
where λ and E2SFCA become insigniﬁcant in the set of SAR models. In fact, at 88 clusters,
E2SFCA is a positive and signiﬁcant predictor and the λ parameter is also signiﬁcant, suggesting that both Alexander et al. and Clark’s studies may have produced diﬀerent ﬁndings
had they used less aggregated data. As a result, the eﬀects of hospital bed availability on
utilization rates may go undetected at regional-level scales. More speciﬁcally, our results provide empirical evidence of a threshold level in the ability to observe the eﬀects of Roemer’s
Law in small area studies.
Recent research has shown the danger in statistical inference garnered from ecologicalbased relationships at a single geographic scale of analysis. Wright and Ricketts III (2010),
in a review of Kravet et al. (2008), showed that coeﬃcient values related to the supply of
health care resources may change in signiﬁcance and even direction as the scale of analysis
changes by way of data aggregation. Their work highlights the problems associated with the
MAUP in health-based research. In our study, the stability of the coeﬃcients across levels
of aggregation suggest that the observed relationships are not highly susceptible to variation
due to the scale in which the data are aggregated to. Although levels of aggregation smaller

108

than Zip Codes could not be tested (due to the privatization of the hospitalization data),
the overall statistical strength and invariant nature of the relationship between hospital
bed availability and hospital utilization provide strong evidence that our ﬁndings are not a
product of the MAUP.
With support of Roemer’s Law demonstrated, we turn our attention toward the implications of our research with regards to CON programs. Past research has suggested that
over the past 40 years CON programs have not been successful in controlling health care
costs (Ferrier et al., 2010; Rivers et al., 2007; US Federal Trade Commission, 2004). A recent study by Conover and Sloan (2003) reported that Michigan’s CON program had not
eﬀectively contained hospital costs and recommended that the state abandon regulation of
acute care hospital beds. Whereas the eﬀects of hospital bed availability on health care costs
were not considered, the ﬁndings do suggest that eﬀorts to control hospital bed availability
will aﬀect hospital utilization rates. Furthermore, the signiﬁcant, stable, and positive nature
of the observed relationship indicates that CON-based regulation of hospital bed supply to
levels consistent with the needs of the population is justiﬁed.
Although it was not the focus of the analysis, our results also showed a strong, positive
association between a higher proportion of Black and Hispanic populations and higher rates
of hospital utilization. Given that other possible determinants of hospital utilization, SES
and access to primary care (ACS hospitalizations), often associated with contributing to
poorer health in disadvantaged populations were controlled for in our models, these ﬁndings
are troubling from a social justice perspective. Although the cause behind this statistical
association was not further explored in the present analysis, recent work by Grady (2006;
2010) and Grady et al. (2008) has demostrated that neighborhood segregation is associated
with health disparities in New York and Michigan. In the present context, higher hospitalization rates for areas having a higher percentage of Black residents might point to underlying
health issues that stem from neighborhood eﬀects (Darden et al., 2010; Diez Roux et al.,
2001; Oakes, 2004). Considering that metropolitan Detroit is one of the most segregated

109

cities in the USA (Darden et al., 2007) and a large proportion of Michigan’s African American population resides in this region, our ﬁndings suggest that a more detailed analysis
exploring the eﬀects of race, segregation, and neighborhoods on hospital utilization rates in
southeast Michigan is warranted.

4.6

Limitations

Our analysis did not consider alternative neighborhood structures in the EB smoothing
process or the spatial regression models. Other neighborhood structures, such as those based
on distance or k -nearest neighbors, require a deﬁned threshold value for determining neighbor
status. Given the large range of data conﬁgurations evaluated and their dissimilar geographic
scales (for reference, see Figure 25), specifying a single distance or k threshold would not
provide a consistent spatial structure throughout scales of analysis. Hence, the decision
to employ a ﬁrst-order neighborhood structure was considered necessary due to the multiscalar nature of the research design. For example, if the neighborhood structure was deﬁned
using the 10 nearest neighbors, the neighborhood organization would vary considerably as
the data were aggregated to more regional scales14 . The same diﬃcultly would manifest if
a minimum distance threshold was implemented, augmented by the limitations associated
with measuring distances among highly aggregated areal units (Hewko et al., 2002). For the
purposes of our analysis, the ﬁrst-order neighborhood structure provided a characterization
of spatial structure supported by theory (Tobler, 1970) and ﬂexible enough to accommodate
the multi-scale nature of the research design.
Although the scale eﬀect of the MAUP was explored in our analysis, the zoning eﬀect was
not explicitly examined. However, the eﬀects of zone modiﬁcation was implicitly addressed
through the use of a non-agglomerative clustering methodology. Speciﬁcally, for each iteration in the clustering method, the Zip Code data were clustered, not the clusters from the
previous step in the iteration. Hence, in many cases, regions were essentially “rezoned”, thus
14 Speciﬁcally,

10 neighbors may approximate ﬁrst-order neighbors at low levels of aggregation, but 2nd or 3rd order neighbors at higher levels of aggregation.
110

84 clusters

79 clusters

Figure 27: Example of rezoned region. In the 84 cluster solution (left), the region
contains 6 clusters. In the 79 cluster solution (right), the same region contains 5 nonagglomerative clusters.
providing an implied examination of the zoning eﬀect of the MAUP. To illustrate this point,
Figure 27 contains an example of a small region that was rezoned rather than agglomerated
as the level of aggregation changed. Given this limitation, we recommend that further consideration of the zoning eﬀect of the MAUP to be included in future research of Roemer’s
Law.

4.7

Conclusions

This research found a positive, signiﬁcant association between the hospital bed availability
and hospital utilization rates while controlling for other determinants of hospitalization. The
research design was implemented in a explicitly spatial context, incorporating the spatial and
aspatial aspects of health care access and utilization along with the spatial structure of their
relationship. Thus, we have provided compelling empirical evidence to support Roemer’s
Law.
111

Recent hospital construction and expansion (bypassing the CON program through legislative action) and a proposed transfer of beds into areas of the state without a demonstrated
need for additional hospital beds highlight the importance of our ﬁndings in Michigan. Nationally, as health care systems and hospitals adapt to increasing health care costs, a changing
economic climate, and provisions contained within the Aﬀordable Care Act, gaining a clearer
understanding of the eﬀects of hospital bed availability on hospital utilization is paramount.
Whereas the ﬁndings of this study address the research question originally posed, they
also elicit a number of new questions regarding health care policy and health services research.
Perhaps, the most important question is, “what are the causal mechanisms that lead to
higher hospitalization rates in areas with higher hospital bed availability?”. While some
have suggested that the answer lies in the clinical decision-making process of physicians
(Mulley, 2009), others have suggested that it may be the hospitals themselves (Shwartz
et al., 2011) and the question remains unanswered.

Acknowledgements: This research was funded by the Michigan Department of Health,
Certiﬁcate of Need Program. The funders had no role in study design, data collection and
analysis, decision to publish, or preparation of the manuscript.
The R code used in this work can be found in Appendix E.
112

Conclusions
5.1

Overall contributions

This dissertation contributes signiﬁcant new knowledge to the ﬁeld of health services research.
The speciﬁc salient outcomes include: detailing both the theoretical and applied diﬀerences
between the raster and network data models for estimating travel time among locations,
oﬀering a methodology that simultaneously clusters observations based on comprehensive
patterns of utilization and geographic location, and producing compelling, robust evidence
that hospital availability has a positive, signiﬁcant relationship with hospital utilization rates.
Study #1 explores the impacts of data model choice on travel time estimates. A hierarchical classiﬁcation system is presented for assigning speed limits to roads. To explore the
suitability of the assigned speed limits, travel time estimates are compared to those from
Google Maps, an independent data source. Subsequently, populations and areas with limited geographic access to hospitals are identiﬁed using the raster and network data models.
The regions with limited access were generally similar in shape and conﬁguration. However,
the analysis showed that the raster-based method produced larger overall regions than the
network-based method, leading to a greater number of people identiﬁed as having limited
areas. The major theoretical diﬀerences between the underlying data models were linked to
the observed diﬀerences in an applied case study. Consequently, the network data model is
suggested as preferable for estimating vehicular travel time if the topological relationships
governing real-world travel are a priority in study design; these relationships are upheld
within the data model itself. When estimating travel time for non-vehicular travel modes,
in scenarios where travel is not restricted to roadways, and in cases where each location
must be explicitly measured, the raster data model is more suitable given the unconstrained
nature of movement in the data model.
Study #2 presents a clustering methodology for grouping geographically proximal hospitals with similar community utilization patterns. The methodology was speciﬁcally developed to create Hospital Groups for Michigan’s CON Program. Therefore, the scientiﬁc
113

and political concerns encountered during the methodology’s development are detailed in
the study. The clustering methodology employs a K-means + Ward’s clustering algorithm,
simultaneously grouping hospitals based on their overall patient utilization patterns and geographic location. All possible values for the number of Hospital Groups (k ) are evaluated
and a hueristic is provided to select the ﬁnal conﬁguration. The methodology was designed
to be repeatable, sustainable, and actionable. However, the clustering methodology can
be employed to group any type of spatial observations having multiple attributes as it was
built on ﬁrst principles of clustering analysis. The methodology can be easily integrated for
use with areal units to create small areas (or regions). In this, the clustering methodology
provides a substantial theoretical advantage over the most oft-employed methods to create
small areas such that it integrates overall patterns of health care utilization and geographic
location, rather than relying on simple characterizations of utilization or relying solely on
geographic location.
The ﬁnal study (Study #3) in this dissertation examines the eﬀects of Roemer’s Law,
a simple but powerful statement that proposes that hospital utilization will increase if the
supply of hospital beds is increased. This study provides several key innovative and improved
approaches to the study of health care access and utilization. The research design improves
upon previous examinations of Roemer’s Law by incorporating spatial factors in the analysis;
the spatial nature of both utilization and access are considered, while also accounting for the
spatial structure of their relationship. Secondly, the conceptual model of access is extended
past the traditional barriers only model in which access is characterized only by the presence
of factors limiting service utilization. The conceptual model is not in itself unique. CON
programs are built around this theory. Yet, most applied access-related studies fail to account
for this phenomenon. Third, the research design incorporates a novel, multi-scalar approach,
exploring the stability of the statistical association between hospital bed availability and
utilization rates. The multi-scalar approach allows for a richer understanding of the eﬀects
of Roemer’s Law, while also providing a general framework for spatial regression analysis

114

with areal data.
The ﬁndings from Study #3 showed strong empirical evidence of the eﬀects of Roemer’s
Law in Michigan. In a state-level study including the entire system of hospitals and population (including over 160 hospitals, over 1 million patient admissions, and nearly 10 million
residents), the availability of hospital beds was found to have a signiﬁcant, positive eﬀect
on hospital utilization rates while controlling for other determinants of variation in utilization rates. Additionally, this relationship proved to be highly stable across geographic scales.
These results suggest that the eﬀects of Roemer’s Law are robust and due to health processes
unrelated to the scale of analysis.
The main outcomes of the individual studies can be separated into those relating to
advancement in health services research and those relating to matters of public policy. From
a public policy perspective, this work oﬀers updated methodologies to assist CON programs
in their assessment and regulation of health care access. Study #1 provides a step-by-step
guide for implementing the network-based method of identifying Limited Access Areas in
Michigan. In addition, this study oﬀers detailed descriptions of the theoretical diﬀerences
and applied implications of the data models.
The clustering methodology detailed in Study #2 is very near ﬁnal approval into Michigan’s Hospital Bed Standards. The new methodology improves upon the previous methodology, which had become unusable given changes in hospital utilization patterns over time.
The most important policy-related implication of the new methodology is that it provides
an acceptable Hospital Group solution, thus removing the need for an expert committee to
modify the automated results. Thus, from a public policy perspective, the new clustering
methodology provides a greater level of objectivity.
Study #3 ﬁnds strong evidence of the eﬀects of Roemer’s Law, thus providing empirical
support that areas with greater hospital bed availability have greater hospital utilization
rates. Hence, this work provides empirical support for continued CON-based regulation of
Michigan’s hospital bed supply. Study #3 also showed that regions with a higher proportion

115

of Black residents had higher hospital utilization rates, even while controlling for other
determinants of increased rates. This speciﬁc ﬁnding raises larger concerns about public
health for disadvantaged populations in Michigan.
At a macro-level, this dissertation has provided a broadly-ranging exploration of access
to hospitals and hospital utilization within a regulated health care system. The topic was
approached from an explicitly spatial perspective, exposing the importance of location, geography, and distance-related factors in health services research. The research has delivered
tangible research outcomes while also providing methodological advancements with the potential to improve the eﬀectiveness of CON-based assessment and regulation of health care
services. Thus, when viewed in its entirety, this dissertation provides key insights into the
relationship between access and utilization, the study of access and utilization, and the
methods used by CON programs in their mediation of health care resources.

5.2
5.2.1

Future research
Geographic accessibility

One of the most overlooked and under-reported aspects of research of geographic accessibility
is the uncertainty present in population-level travel time estimates. As was discussed in
Study #1, the actual travel time among locations is governed by a large number of factors
including, but not limited to: individual driving characteristics, traﬃc volume, and the
speciﬁc route chosen. Given these sources of variation, population-level models of travel time
can only aim to provide generalized estimates of travel time among locations. However, the
accuracy of these estimates has been largely ignored in previous health services accessibility
research. The most pertinent sources of uncertainty uncovered in Study #1 are 1) the
accuracy of travel speeds assigned to the roads data and 2) the completeness and/or accuracy
of the roads data. The research approach in Study #1 provides an initial step toward
addressing these issues. By comparing network-derived travel time estimates with estimates
from Google Maps, the custom-built network dataset is evaluated against an independently

116

derived dataset. Given Google’s lack of transparency regarding the input data and methods
used for their travel time estimates, this approach cannot provide quantitative estimates of
the accuracy of the network or the uncertainty present in the model.
The second source of uncertainty in population travel time estimates arises from the
completeness and/or the positional accuracy of the roads data and how they aﬀect travel
time estimates. In the speciﬁc state-level case study, there were over 750,000 line segments
(network edges) resulting in over 500,000 intersections (network nodes). Although these data
are the most up-to-date available and are provided with metadata that include both a short
description of the methods used to gather the data and the sources of the roads data, no
method currently exists to evaluate the completeness or accuracy of the data themselves.
This issue may be especially salient in Michigan, where a large number of private roads15 are
found. MDOT could not provide a quantitative estimate of the accuracy or completeness
of the dataset when contacted directly. Given recent research illustrating the importance of
roads data in health-based access studies (see Frizzelle et al., 2009), further exploration into
methods that would provide quantitative estimates of the uncertainty present in large roads
databases or methods to improve their accuracy is warranted.
Perhaps the most important questions raised in the geographic access study are those
regarding the appropriateness and utility of Michigan’s deﬁnition of limited access areas.
First, the 30 minute cut-oﬀ value employed in identifying limited access areas is likely outdated. Although a number of studies have invoked 30 minutes as an appropriate travel time
to discern those with geographic access from those without, like Roemer’s Law, a surprisingly small amount of empirical research exists to justify this particular choice. The most
cited work, that of Bosanac et al. (1976), is over 30 years old. Both the ability to travel
and the expectation of reasonable travel to obtain health care have likely changed since since
that research. As reported, for Michigan in 2010, roughly 20% of inpatient days were spent
15 Private

roads are those not maintained by local, regional, or state government agencies.
Thus, they are not oﬃcial roads.

117

in hospitals more than 30 miles from the patients’ residences. Second, as was discussed in
Study #3, a more comprehensive framework of spatial accessibility includes not only distance, but also supply and demand. Thus, although travel time provides a simple, easy to
understand measure of geographic access, this metric alone does not incorporate the other
factors known to inﬂuence spatial accessibility and is insuﬃcient in identifying those having
limited access. These ﬁndings call into question both the appropriateness of 30 minutes as
a cut-oﬀ value and use of distance alone to determine access status. Importantly, I have
exposed the need for future research that incorporates current patient travel patterns and
expectations of health service accessibility to provide a more complete characterization of
what constitutes “limited” geographic access.
5.2.2

Clustering health care observations

John Griﬃth, one of the architects of the previous clustering methodology employed by
Michigan’s CON program, delivered a positive review of the new clustering methodology
presented in Study #2, stating that it was an important new solution and an advance over
prior work. However, the selection of the ﬁnal number of hospital groups in the clustering
methodology remains essentially heuristic. Important research questions remain unanswered,
speciﬁcally 1) “What is the right number of clusters?” and 2) “What is the proper balance
between objective, scientiﬁc analysis and political considerations in health policy?” On the
surface, the two questions may appear highly dissimilar; however, throughout the development process of the clustering methodology, they were revealed to be unequivocally linked.
Despite the large amount of literature on clustering theory, methods, and uses, a recent
review from the clustering literature states (Jain, 2010, p. 654)16 :
The most critical choice is K. While no perfect mathematical criterion exists,
a number of heuristics are available for choosing K. Typically, K-means is run
independently for diﬀerent values of K and the partition that appears the most
16 In

Jain (2010), K is the number of clusters.
118

meaningful to the domain expert is selected.
Therefore, given the current status of clustering methodologies, the ﬁrst question above
cannot be authoritatively answered. As a result, the ﬁrst question becomes “What is the
right number of clusters for the particular application?” which directly corresponds to the
second. Professor Griﬃth, given his longtime research focus on policy-based issues in health
care (e.g., Griﬃth, 1972; Griﬃth et al., 1981), seems to have understood this limitation in the
process of identifying the number of Hospital Groups, thus viewing the heuristic employed in
the clustering methodology as a step toward objectivity in a highly politicized process, not
a limitation of the research. Yet, although a politically acceptable solution to this problem
was delivered, further research eﬀorts towards the statistical evaluation of cluster solutions
would likely provide valuable insights toward identifying the right number of clusters in a
dataset, thus marginalizing the subjectivity introduced by employing a domain expert (or
set of experts) for this task.
In the case of Michigan’s Hospital Groups, further evaluation of the state’s proposed
Hospital Group conﬁguration may also beneﬁt policy makers by providing a quantitative
estimate of how the number of Hospital Groups aﬀects predictions of future bed demand.
As the results of Study #3 illustrated, the relationship between hospital utilization rates
and hospital bed availability became undetectable at regional-level scales. Speciﬁcally, a
threshold level was discovered near 90 observation units; further aggregation into fewer
observation units yielded regression models with insigniﬁcant predictor variables. Because
the hospitals in proposed Hospital Groups are more highly aggregated than the previous
conﬁguration (35 vs. 64 groups) and both fall under the identiﬁed threshold, the most
apparent concern is that localized unmet hospital bed demand and/or excess capacity will
be masked by variation within the spatially larger groups. As a result, regions with an unmet
demand may go unnoticed. Furthermore, the threshold uncovered in Study #3 casts doubt
upon the use of the Hospital Groups as a unit of analysis for exploring other health-related
relationships in Michigan.
119

Study #2 also invoked limitations in the current methods used to create small areas in
health services research. Small areas are created by aggregating individual areal observations into groups or regions. Studies that explore variations in health care spending, health
outcomes, and health care utilization have been present and forefront in the health services
literature since Wennberg and Gittlesohn’s (1973) work exploring variations in utilization
rates in Maine. However, despite the nearly 40 years that have passed since this seminal publication, their relatively simple method to create the small areas remains oft-used in current
research. This method assigns the initial observations to groups (small areas) using only a
single measure of hospital utilization, the plurality of visits (e.g., Unit A would be assigned
to Hospital 1’s group if more residents visited that facility than any other, regardless of the
actual proportion). While this method performs well in regions where facilities are well distributed, it often requires manual adjustment of the small area membership in regions with
multiple facilities. Hence, the clustering methodology developed in Study #2 provides an
important advance in improving the creation of small areas. However, given that many small
area studies focus on exploring diﬀerences among facilities, this methodology only delivers
the ﬁrst step in this process. To link the clustered regions to speciﬁc hospitals or groups of
hospitals would require an additional step not explored in Study #2. One possible approach
for this task is to consider attributes of both the facilities and population units in the cluster
formation method, as suggested by Gilmour (2010).
5.2.3

Roemer’s Law

The ﬁnal study found evidence of a positive association between hospital bed availability
and hospital utilization rates while controlling for other determinants of hospital utilization.
Although this outcome does provide support for Roemer’s Law, it begs the question, “What
causes Roemer’s Law?” Given the ecological nature of this research, an attempt to assign
causation based on the ﬁndings is not justiﬁed. Hence, the logical next step to better
understand the implications of Roemer’s Law is to explore the causal factors that produce
higher utilization rates in areas with greater hospital bed availability. Although Wennberg
120

(2005) points to clinical decision-making as a possible cause of Roemer’s Law, the complex
interaction of the actors involved in hospitalization (i.e., doctors, hospitals, and patients),
in combination with the socio-demographic, environmental, and stochastic nature of illhealth events that lead to inpatient hospitalization, suggest that this explanation may be
inadequate. The complexity of the process poses signiﬁcant challenges to future research
endeavors. Given the high costs of hospitalizations and strong evidence of Roemer’s Law
demonstrated by this study, eﬀorts to understand the underlying mechanisms are clearly
warranted.
The demonstrated eﬀect of Roemer’s Law in Study #3 suggests that eﬀorts to mediate
the availability of hospital beds will likely impact hospital utilization rates. These results
lend support for continued CON regulation of hospital bed availability under the assumptions that 1) overutilization is present in areas with high hospital bed availability and 2)
curbing overutilization of hospital services will assist in lowering overall health care costs.
This dissertation provides the initial step toward evaluating the ﬁrst assumption. A logical continuation of the work would be to isolate the eﬀects of hospital bed availability on
utilization rates and identify areas where high availability leads to higher than expected
utilization. Additionally, exploring temporal changes in utilization rates as hospitals have
opened, closed, and/or expanded their capacity would also likely provide insights towards
identifying areas in which overutilization may be present, along with a more detailed understanding of the relationship between availability and utilization in these areas. Furthermore,
similar inpatient hospitalization data from a state without CON regulation would provide
the opportunity to build a natural experiment exploring whether the presence of a CON
program had an impact on this relationship.
There has been little recent research examining the eﬀectiveness of CON programs in
controlling health care costs. The lack of research likely stems from the variation in scope
and size among the state-based CON programs and the limited availability of information

121

regarding the speciﬁcs of CON laws and oversight (Rivers et al., 2007)17 . This dissertation
corroborates Rivers et al.’s assertion as searches for detailed information regarding nearby
states’ CON programs or recent research exploring speciﬁc CON programs were largely
unsuccessful.
The small number of recent CON-related studies provided conﬂicting results regarding
the programs’ eﬀectiveness. A general review determined that the programs have not reached
their goal of health care cost-containment (Banks et al., 1999). Conover and Sloan (1998)
found that states that had repealed their CON laws did not experience an increase in health
care costs per capita. Further research found that CON regulation may actually lead to
higher health care costs (Rivers et al., 2007) and lower levels of hospital eﬃciency (Granderson, 2011) by obstructing the potential for competition among hospitals. In an example
of conﬂicting research, Ferrier et al. (2010) showed that states with CON programs have
higher levels of hospital eﬃciency, thus improving resource allocation and lowering social
costs. Additionally, recent work by Hellinger (2009) demonstrated that states with CON
programs have fewer hospital beds per capita, which is associated with lower overall health
care costs (however, the CON variable was not itself a signiﬁcant predictor of health care
costs in the model).
CON-based research papers appear to have declined in number over the last decade.
Given that the general ﬁndings show that CON programs have failed to reach their aims of
health care cost containment, it begs the question of why a majority of states that continue
to employ them. Although eﬀorts towards easing CON regulation or deregulation altogether
have increased in recent years (see Romano, 2003; Robeznieks, 2008), CON programs persist.
Perhaps this is a reﬂection of the duality in the overall goals of many CON programs.
Outright removal of CON laws would not only remove the program tasked with regulating
health services expansion, but also the program that attempts to provide equitable access
17 Finn

(2007) provides a historical overview of CON at a national level and a detailed
examination of Michigan’s CON program.

122

to health services for the population. As Rivers et al. (2007) suggest, tasking both goals to
the same program is not practical. In Michigan, this sentiment was oﬀered prior to Rivers et
al.’s research by Conover and Sloan (2003) in their assessment of the state’s CON program.
This dissertation found that hospital bed availability is positively related with hospital
utilization rates, thus providing support for the continued regulation of hospital bed supply. However, further identifying areas with signiﬁcant over- or underutilization of hospital
services in the state would provide a better understanding of the eﬀectiveness of Michigan’s
CON program in mediating hospital bed supply to meet population need. This work would
provide the necessary foundation to a larger exploration of the eﬀects of the state’s CON
program on health care costs.
5.2.4

Spatial structure

Study #3 examined the relationship between access and utilization over a broad geographic
area containing large variations in demographic structure, socio-economic status, and urban/rural settings. Given the variability among areas, further explorations of speciﬁc regions
of the state may oﬀer more detailed information regarding the signiﬁcance and magnitude
of this relationship. The CAR and SAR regression models, although accounting for spatial
structure, consider the magnitude of the observed relationships among variables to be stationary (i.e., a single coeﬃcient describes the relationship over the entire study area). Techniques such as Geographically Weighted Regression (GWR) provide the ability to identify
nonstationarity in regression-based coeﬃcients, allowing spatial structure in the observed relationships themselves to be identiﬁed (Brunsdon et al., 1996). GWR could be implemented
to identify regional characteristics of the drivers of hospital utilization, thus providing an
initial step toward a more detailed examination of the determinants of hospital utilization
within speciﬁc regions.

123

5.2.5

Health insurance

In Study #3, health insurance coverage was considered an “aspatial” measure of access to
health care and only contained the percent of the population having coverage. No diﬀerence
was made among public or private insurance, type of insurance (e.g., health maintenance
organization (HMO), preferred provider organization(PPO)), or level of coverage. Although
more detailed data regarding the nature of health insurance coverage would be useful, a
more interesting avenue to explore is the potential for the “type” of insurance to impact
utilization patterns. Because HMOs and PPOs oﬀer incentives for staying within a speciﬁc
network of providers, the resultant utilization patterns of the population may reﬂect the
geographical distribution of providers within their speciﬁc network not those expected based
on travel time or hospital bed availability. Additionally, diﬀerences in utilization patterns
may manifest between those with public insurance (i.e., Medicare and Medicaid) and those
having private insurance, as well as diﬀerences in utilization rates among public and private
insurance holders. Examining the the eﬀects of health insurance coverage on spatial patterns of utilization may provide important insights into understanding the travel behavior
of patients, while also potentially oﬀering a better understanding of how redistribution of
hospital bed availability may impact hospital utilization rates.
5.2.6

Spatial accessibility

The Enhanced 2-Step Floating Catchment Area method was employed to describe the availability and accessibility of hospital beds in Study #3. The creation of improved metrics to
describe spatial accessibility is an active area of health services research; FCA metrics are a
relatively recent development with improvements and modiﬁcations being oﬀered regularly.
An updated method, the 3 Step Floating Catchment Area (3SFCA) was recently proposed
by Wan et al. (2012), introducing potential competition among facilities into the calculation of spatial accessibility. However, preliminary experiments conducted for Study #3 (not
presented in manuscript) suggest that the 3SFCA underestimates potential demand, thus

124

providing artiﬁcially inﬂated levels of spatial accessibility in areas with multiple nearby facilities. From a theoretical perspective, the inclusion of facility competition into an FCA-based
metric has the potential to provide a more accurate and comprehensive characterization of
availability and accessibility. Yet, the initial experiments suggest that the 3SFCA has not
accomplished this goal. A better understanding of the relationship among supply, demand,
and potential competition is necessary to provide an applied FCA metric that incorporates
these elements simultaneously. Further attention is warranted given the widespread use of
these metrics in current health services research.
5.2.7

Access and utilization in a regulated health care system

Within the last ﬁve years in Michigan, two hospitals were opened in areas without a demonstrated need for additional hospital beds. Approval for these facilities did not come from the
state CON program, but through specially-drafted legislation subverting the CON process
(Greene, 2012). More recently, another Michigan conglomerate hospital system ﬁled a CON
application to transfer a large number of licensed beds into an area of the state without a
demonstrated need for additional hospital beds (Hopkins, 2012). If the CON application
for the current request is denied, special legislative action to approve the transfer appears
highly likely (Greene, 2012). Thus, control over the distribution of hospital beds in Michigan
will again be removed from the state’s CON program. In addition, the redistribution of the
state’s hospital bed supply will not follow demonstrated patterns of population need. Although these are speciﬁc examples from the particular study area explored in this research,
they are a microcosm of the larger changes occurring in the US health care system. As a
result of rising health care costs and a shift toward proﬁt-maximizing behavior (Kuttner,
2008), the US health care system continues to undergo signiﬁcant changes, many of which
impact health care delivery and population access.
The Patient Protection Act of 2010 attempts to address “aﬀordability” by increasing
access to health insurance for the currently uninsured and underinsured populations (Schoen
et al., 2011), limiting insurance companies’ abilities to deny coverage, and redistributing
125

the burden of public health (and costs) onto all stakeholders involved (i.e., health care
practitioners, health insurance companies, and health care consumers). Increases in health
insurance coverage, along with a greying US population due to increased life expectancies and
the aging of the baby-boom generation, have the potential to signiﬁcantly raise the future
demand for health care services (Hofer et al., 2011; Strunk et al., 2006). The potential burden
placed on the US health care system due to increased health insurance access and population
demand for services, in conjunction with the changes resulting from a shifting economic
environment, highlight one piece of the uncertainty existing for the future public health
of US citizens. The poor current performance of the US health care system in numerous
measures of public health outcomes (Murray and Frenk, 2010) only oﬀers further concern.
By examining the meta-relationship between health care access and utilization, this dissertation has provided important ﬁndings while also supplying a number of research pathways
for future studies in health geography. This work did not consider the relationship among
access, utilization, and public health outcomes. Thus, at a macro level, the most important
question invoked by this research, but yet to be answered is, “How do access and utilization
aﬀect public health outcomes?” Given the recent changes within the US health care system
and those likely forthcoming, answering this question is paramount to understanding how
changes in health care access and utilization will aﬀect population health in the US.

126

Appendices

127

Appendix A
R code to implement the Thomas Methodology
#########################################################################
####
####
#### Citation information (code)
####
#### Delamater PL, Shortridge AM, and JP Messina, Regional health
####
#### care planning: a methodology to cluster facilities using
####
#### community utilization patterns. BMC Health Services Research
####
####
####
#### Citation information (original methodology)
####
#### Thomas JW, Griffith JR, and P Durance, Defining hospital
####
#### clusters and associated service communities in metropolitan
####
#### areas, Socio-Economic Planning Sciences 1981, 15(2):45-51
####
####
####
#### Max Relevance Algorithm Clustering Algorithm
####
####
####
#### Requires: Patient visits table (zip -> hospital)
####
####
Zip population data
####
####
Hospital info table
####
####
Hospital zip code table
####
####
####
#### Interpreted and converted to R code by Paul Delamater and
####
#### Ashton Shortridge during summer, 2011 for the Michigan
####
#### Hospital Bed Standard Advisory Committee working group.
####
#### Funding for this research was provided by the Michigan
####
#### Department of Community Health.
####
####
####
#########################################################################

#########################################################################
##
##
## Get input data
##
##
##
#########################################################################

###########################
## Read patient visits data
###########################
####

Note: pv is a table with hospitals in rows and zip codes
128

####
####
####
####
####

in columns. Hospital identifier column should be
labeled "HOSP_ID". Zip code column lables should
be the five digit zip code (e.g., "48823"). Table
entries are the number of hospitalizations from
residents of each zip code at each hospital.

# Load data
pv <- read.csv("inputdata/hosp.zip.visits.mtx.csv")
# Ensure HOSP_ID in character format
pv$HOSP_ID <- as.character(pv$HOSP_ID)
# Remove characters from column names
# R adds an "X" to the zip code number
# Assumes all zip codes are 5 digits
names(pv)[2:ncol(pv)] <- substr(as.character(names(pv)[2:ncol(pv)]), 2, 6)
############################################################
## Convert number of visits to proportions (Relevance Index)
############################################################
# Define variable for last zip code column
n.zip <- ncol(pv)
# Sum hospital visits for each zip code
# Assumes HOSP_ID is first column
zip.visits <- colSums(pv[,2:n.zip])
# Divide each entry by summed visits to create Rij values
pv[,2:n.zip] <- pv[,2:n.zip] / rep(zip.visits, each = nrow(pv))
################################
## Read hospital attributes data
################################
####
####
####

Note: hosp.info is a table with hospitals in rows and
attributes in columns. In this case, the column
that corresponds to the patient records is "MIDB".

# Load data
hosp.info <- read.csv("inputdata/hospitals.csv")
####
####
####

Note: hosp.HAU is a table with hospitals in rows and
attributes in columns. In this case, the column
that corresponds to the patient records is "MIDB".
129

####

Home Areal Unit is "ZIP".

# Load home areal unit (zip code) of each hospital
hosp.HAU <- read.csv("inputdata/hospital.zipcodes.csv")
# Attach to home areal unit to patient records
pv <- merge(pv, hosp.HAU, by.x="HOSP_ID", by.y="MIDB", all.x=TRUE)
# Change column name
names(pv)[ncol(pv)] <- "HAU"
# Add column with Rij (Relevance Index) of each hospital in
# its own home areal unit
for (h in 1:nrow(pv)) pv$RiHAU[h] <- pv[h,which(names(pv)==pv$HAU[h])]
###############################
## Get zip code population data
###############################
####
####
####

Note: zip pop is a table with the zip code name in a
column, "ZIP" and the population of the zip code
in a column, "POP"

# Load data
zip.pop <- read.csv("inputdata/zipcode.population.csv")

#########################################################################
##
##
## Code to implement Thomas Methodology
##
##
##
#########################################################################

#######################################
## Prepare data and create data holders
#######################################
# Add column with initial alpha values (all are set at 0.02)
# In an update of this code, initial alpha values are set at 0.05
pv$alpha <- 0.02
# Define initial values for alpha variables
alpha.1 <- 0.02
alpha.2 <- 0.125
130

# Add binary holder column for individuals / groups
pv$Group <- 0
# Add column to hold hospital names after clustering
pv$GrNames <- pv$HOSP_ID
####################################################
## Calculate population weighted relevence index, Rj
####################################################
####
####
####

Note: Pi = population of areal unit i
Rij = relevance index values for areal unit i
to hospital j

# Calculate PiRij values (Pi * Rij)
PiRij.matrix <- pv
PiRij.matrix[,2:n.zip] <- PiRij.matrix[,2:n.zip] * rep(zip.pop$POP, each
= nrow(PiRij.matrix))
# Create holder for Rj values
Rj.all <- NULL
# Create holder for Ij zip codes
Ij.matrix <- pv
Ij.matrix[,2:n.zip] <- 0
# Calculate Rj for each hospital
for (j in 1:nrow(pv)) {
# Get hospital j’s Ri values
hosp.j <- pv[j,2:n.zip]
####
####
####

Note: From Thomas et al., Ij = set of areal units
for which individual relevance values of
hospital j exceeds or equals alpha

# Find zip codes with Rij greater than alpha
Ij.list <- which(hosp.j >= alpha.1)+1
# Write zip codes greater than alpha to Ij holder
Ij.matrix[j,c(Ij.list)] <- 1
# If no areal units in Ij, Rj value is zero
if (length(Ij.list) == 0) {

131

# Write hospital ID and 0 to Rj holder
Rj.all <- rbind(Rj.all, cbind(as.character(pv$HOSP_ID[j]), 0))
} else {

# Get list of zip code names
Ij.zips <- names(pv)[Ij.list]
# Get numerator value for Rj
PiRij <- sum(PiRij.matrix[j,Ij.list])
# Get denominator value for Rj (total zip code population)
Pi <- sum(zip.pop$POP[c(Ij.list-1)])
####
####

Note: Rj = sum(Pi(dij/Di)) / sum(Pi)
where dij/Di is Relevance Index

# Calculate Rj (population weighted relevance index)
Rj <- PiRij / Pi
# Put in holder
Rj.all <- rbind(Rj.all, cbind(as.character(pv$HOSP_ID[j]), Rj))
}
}
# Make Rj.all into dataframe
Rj.all <- as.data.frame(Rj.all)
# Rename columns in Rj.all
names(Rj.all) <- c("HOSP_ID", "Rj")
# Convert from factor to numeric and character
Rj.all$Rj <- as.numeric(levels(Rj.all$Rj)[Rj.all$Rj])
Rj.all$HOSP_ID <- as.character(Rj.all$HOSP_ID)
###################################################
## Remove hospitals with Rj of 0 from analysis
## These hospitals are ungroupable using the method
###################################################
# Locate hospitals with Rj = 0
zeros <- which(Rj.all$Rj == 0)
132

# Get hospital ID
Ungroupable.hospitals <- Rj.all[c(zeros),1]
# Remove hospitals from matrices
pv <- pv[-c(zeros),]
Rj.all <- Rj.all[-c(zeros),]
Ij.matrix <- Ij.matrix[-c(zeros),]
PiRij.matrix <- PiRij.matrix[-c(zeros),]
# Write ungroupable hospitals info to table
Ungroupable.hospitals.info <- hosp.info[hosp.info$MIDB %in%
Ungroupable.hospitals, ]
###########################################################
## Start iterative process part of the code and explicitly
## state which method will be used to STOP the process
###########################################################
# Create holder for grouped hospitals
Grouped.Hospitals <- NULL
# Create holder for temporary Rj.min values
Rj.temp <- NULL
####
####
####
####
####
####
####
####
####
####
####
####
####
####
#
#
#
#

Note: From Thomas et al., The procedure terminates
when one of three conditions occurs: (1) all
hospitals have been aggregated into a single
large cluster; (2) a user-specified number of
iterations has been completed; or (3) all
identified clusters are stable, i.e., no
cluster serves more than alpha of the patients
in the home areal unit of any other cluster.
These lines will make the iterative process stop
at a specified number of Subareas, similar to
option number (2) above. To choose this option
uncomment the following lines and comment out,
"run <- 1" and "while (run == 1) {"

Select desired number of Subareas
n.subareas <- 64
Start grouping hospitals
while (nrow(pv) > n.subareas) {

133

####
####
####
####
####
####

These lines will make the iterative process stop
when no hospital/group has greater than alpha
of any other hospitals/group’s home area (option
number (3) above). These line also stops code if
all hospitals are aggregated into one large
group (option number (1) above).

# Create variable used in the iterative process for stopping
run <- 1
# Start grouping hospitals
while (run == 1) {
################################
## Find hospital with minimum Rj
################################
####
####
####
####

Note: Checks for hospitals in a temporary holder. This
holder is defined below. It is used in case any
hospital is the min Rj, but does not have another
hospital to group with yet

if (length(Rj.temp) == 0) {
# Locate hospital with minimum Rj value
which.hosps <- which(Rj.all$Rj == min(Rj.all$Rj))
# Get number of "minimum" Rj hosps
n.min.hosps <- length(which.hosps)
} else {
# Locate hospital with minimum Rj value (minus temp)
which.hosps <- which(Rj.all$Rj == min(Rj.all$Rj[-Rj.temp]))
# Get number of "minimum" Rj hosps to determine if ties exist
n.min.hosps <- length(which.hosps)
}

# If a tie exists, randomly select which of the hospitals is
# is selected for aggregation. Otherwise min.hosp is used
if (n.min.hosps > 1) {

134

# Create random variable using number of tied hospitals
min.hosp <- round((n.min.hosps-1)*runif(1))+1
# Select hosptial using random variable
min.hosp <- which.hosps[min.hosp]
} else {
# Use the single hospital
min.hosp <- which.hosps
}
# Subset minimum hospital from Rj.all
Rj.min <- Rj.all[min.hosp,]
# Print to screen to display which hospital is selected
print(paste("Rj.min = ", Rj.min[2], ", HOSP_ID = ", Rj.min[1],
sep=""))
# Get Rj.min’s home areal unit (column number!)
Rj.min.Ij <- which(names(pv) == pv$HAU[min.hosp])
# Print Rj.min’s home areal unit and RI
print(paste("Rj.min HAU = ", pv$HAU[min.hosp], ", RI = ",
pv[min.hosp, Rj.min.Ij], sep=""))
################################################################
## Find hospital/cluster with max RI in Rj min’s home areal unit
################################################################
####
####
####
####

Note: From Thomas et al., the hospital with the smallest
Rj is identified and grouped to form a cluster with
the hospital having the greatest individual
relevance in hospital j’s home areal unit.

# Find max RI in Rj min’s home areal unit
Rj.max.Rj.min <- which(pv[,Rj.min.Ij] == max(pv[,Rj.min.Ij]))
# If statement in case it selects itself
# e.g., no hospital or cluster has higher Ri in minimum’s
# home area
if (Rj.max.Rj.min == min.hosp) {
# Pick the next highest after removing min hospital
135

next.Rj.max <- max(pv[-min.hosp,Rj.min.Ij])
Rj.max.Rj.min <- which(pv[,Rj.min.Ij] == next.Rj.max)
}
# In case of ties for Rj.max select randomly from tied hospitals
if (length(Rj.max.Rj.min) > 1) {
# Generate random number
rand <- round((length(Rj.max.Rj.min)-1)*runif(1))+1
# Use random number to select
Rj.max.Rj.min <- Rj.max.Rj.min[rand]
}
# Get RI of Rj.max
alpha.Rj.max <- pv[Rj.max.Rj.min, Rj.min.Ij]
# Print alpha value and HOSP ID to screen
print(paste("alpha.Rj.max = ", alpha.Rj.max, ", HOSP_ID = ",
pv$HOSP_ID[Rj.max.Rj.min], sep=""))
################################################################
## Big logic part of code. Determines whether to group hospitals
## or move to next minimum hospital in list
################################################################
####
####
####
####
####
####
####
####
####

Note: From Thomas et al., ...the hospital with the smallest
Rj is identified and grouped to form a cluster with
the hospital having the greatest individual relevance
in hospital j’s home areal unit.
We assume that there is a ’cut-off’ value in
step based on the text in termination option
(3), i.e., no cluster serves more than alpha
patients in the home areal unit of any other

this
number
of the
cluster.

# If the Rj value in Rj.min’s home area is larger than the
# alpha cutoff of the hospital or cluster, then cluster
if (alpha.Rj.max >= pv$alpha[Rj.max.Rj.min]) {
#########################################
## Update RI values to reflect clustering
#########################################
136

# Sum RI values for clustered hospitals
pv[Rj.max.Rj.min,2:n.zip] <- pv[Rj.max.Rj.min,2:n.zip] +
pv[min.hosp,2:n.zip]
# Update alpha score and group columns
pv$alpha[Rj.max.Rj.min] <- alpha.2
pv$Group[Rj.max.Rj.min] <- 1
pv$GrNames[Rj.max.Rj.min] <- paste(pv$GrNames[Rj.max.Rj.min],
pv$GrNames[min.hosp], sep=",")
#####################################
## Update home areal unit for cluster
#####################################
####
####
####
####
####

Note: From Thomas et al., When a previously formed cluster
j* is identified for further clustering, its home
areal unit is assumed to be the home areal unit of
the hospital (member of j*) having the highest Rij
among the cluster hospitals’ home areas

# If Rj min’s relevance in its home area is larger than Rj max
# assign new home areal unit to newly formed cluster
if (pv$RiHAU[Rj.max.Rj.min] < pv$RiHAU[min.hosp]) {
# Assign Rij to cluster entry
pv$RiHAU[Rj.max.Rj.min] <- pv$RiHAU[min.hosp]
# Assign new home areal unit to cluster entry
pv$HAU[Rj.max.Rj.min] <- pv$HAU[min.hosp]
}
#########################################################
## Update Ij.matrix to reflect new alpha value of cluster
#########################################################
# Find zip codes above new alpha value
Ij.new <- which(pv[Rj.max.Rj.min,2:n.zip] >= alpha.2)+1
# Clear old Ij row, then write new zip codes to Ij holder
Ij.matrix[Rj.max.Rj.min, 2:n.zip] <- 0
Ij.matrix[Rj.max.Rj.min,c(Ij.new)] <- 1
######################
137

## Update PiRij.matrix
######################
# Sum PiRij entries
PiRij.matrix[Rj.max.Rj.min,2:n.zip] <PiRij.matrix[Rj.max.Rj.min,2:n.zip] +
PiRij.matrix[min.hosp,2:n.zip]
##############################################
## Update Rj.all with new list of Ij zip codes
##############################################
# Get numerator value for Rj
n.PiRij <- sum(PiRij.matrix[Rj.max.Rj.min,Ij.new])
# Get denominator value (total zip code population)
n.Pi <- sum(zip.pop$POP[c(Ij.new-1)])
# Calculate Rj (population weighted relevence index)
Rj <- n.PiRij / n.Pi
# Put in holder
Rj.all$Rj[Rj.max.Rj.min] <- Rj
#########################################################
## Remove Rj.min from pv, Ij.matrix, PiRij.matrix, Rj.all
## because it has now been grouped
#########################################################
pv <- pv[-c(min.hosp),]
Ij.matrix <- Ij.matrix[-c(min.hosp),]
PiRij.matrix <- PiRij.matrix[-c(min.hosp),]
Rj.all <- Rj.all[-c(min.hosp),]
# Write Rj.min hosp to holder
Grouped.Hospitals <- c(Grouped.Hospitals, Rj.min$HOSP_ID)
# Print to screen which hospitals have been grouped
print(Grouped.Hospitals)
# Reset Rj.temp because current Rj.min has been grouped
Rj.temp <- NULL
} else {
###############################################################
138

## Re-run steps with a different Rj.min because aggregation may
## produce clusters with home areas > alpha) in former Rj min’s
## home area. So we hold onto this Rj.min and re-check later
## List this Rj.min in holder
###############################################################
Rj.temp <- c(Rj.temp, min.hosp)
}
# Print to screen which hospitals are in Rj.temp and
# the length of both Rj.temp and Rj.all
print(paste("Rj temp has: ", length(Rj.temp), " hospitals/cluster", sep=""))
print(paste("Rj all has: ", nrow(Rj.all)-1, " hospitals/clusters remaining",
sep = ""))
##################################################
## Determine whether to keep attempting to cluster
## or to terminate the iterative process
##################################################
# If all the hospitals (-1) are in Rj.temp, then no
# hospital has more than alpha of another’s home area
if (length(Rj.temp) == nrow(Rj.all)-1) {
run <- 0
}
# If all hospitals are grouped Rj.all has one row
if (nrow(Rj.all) == 1) {
run <- 0
}
}
###################################################
## Attach Subarea designation to hospital info file
###################################################
# Get number of Subareas
n.subareas <- dim(pv)[1]
# Make empty holder
subarea.table <- NULL
# Break apart output table from Thomas method
139

# and insert into holder
for (p in 1:n.subareas) {
names <- unlist(strsplit(pv$GrNames[p], ","))
subarea.table <- rbind(subarea.table, cbind(p, names))
}
# Rename column names
colnames(group.table) <- c("Thomas", "MIDB")
# Attach Subarea names to hospital info file
hosp.info <- merge(hosp.info, group.table, by="MIDB", all.x=TRUE)
# Name ungroupable hospitals "NG"
hosp.info$Thomas <-as.character(hosp.info$Thomas)
hosp.info$Thomas[is.na(hosp.info$Thomas)] <- "NG"

140

Appendix B
R code to implement the new clustering methodology
#########################################################################
####
####
#### Citation information
####
#### Delamater PL, Shortridge AM, and JP Messina, Regional health
####
#### care planning: a methodology to cluster facilities using
####
#### community utilization patterns. BMC Health Services Research
####
####
####
#### 2-step K-means + Ward’s Algorithm
####
####
####
#### Requires: Patient visits table (zip -> hospital)
####
####
Hospital travel distance table (hosp -> hosp)
####
####
Hospital info table
####
####
####
#### Methodology developed by Paul Delamater, Ashton Shortridge,
####
#### and Joe Messina during summer, 2011 for the Michigan Hospital ####
#### Bed Standard Advisory Committee working group. Funding for
####
#### this research was provided by the Michigan Department of
####
#### Community Health.
####
####
####
#########################################################################

#########################################################################
##
##
## Get input data
##
##
##
#########################################################################

###########################
## Read patient visits data
###########################
#### Note:
####
####
####
####
####
####

pd.1, pd.2, pd.3 are tables with hospitals in rows
and zip codes in columns. Hospital identifier column
should be labeled "HOSP_ID". Zip code column lables
should be the five digit zip code (e.g., "48823").
Table entries are the number of patient days from
residents of each zip code at each hospital.

141

####

Assumes patient day matrices have similar dimensions!

# Load data
pd.1 <- read.csv("inputdata/hosp.zip.patdays.mtx.y1.csv")
pd.2 <- read.csv("inputdata/hosp.zip.patdays.mtx.y2.csv")
pd.3 <- read.csv("inputdata/hosp.zip.patdays.mtx.y3.csv")
# Create 3 year sum matrix
p.sum.3yr <- pd.1[,2:ncol(pd.1)] + pd.2[,2:ncol(pd.2)] + pd.3[,2:ncol(pd.3)]
# Re-attach hospital names column
p.sum.3yr <- cbind(pd.1[,1], p.sum.3yr)
# Rename hospital names column
names(p.sum.3yr)[1] <- "HOSP_ID"

# Ensure HOSP_ID in character format
p.sum.3yr$HOSP_ID <- as.character(p.sum.3yr$HOSP_ID)
# Remove characters from column names
# R adds an "X" to the zip code number
# Assumes all zip codes are 5 digits
names(p.sum.3yr)[2:ncol(p.sum.3yr)] <substr(as.character(names(p.sum.3yr)[2:ncol(p.sum.3yr)]), 2, 6)
#############################################################
## Convert raw patient days to proportions (Commitment Index)
#############################################################
# Define variable for last zip code column
n.zip <- ncol(p.sum.3yr)
# Sum patient days for each hospital
# Assumes HOSP_ID is first column
hosp.pat <- rowSums(p.sum.3yr[,2:n.zip])
# Divide each column by total patient days
p.sum.3yr[,2:n.zip] <- p.sum.3yr[,2:n.zip] / hosp.pat
# Rename table
p.CI.3yr <- p.sum.3yr
rm(p.sum.3yr)
##########################################
142

## Remove hospitals with no patient visits
##########################################
# Locate hospitals with zero visits
zero.pv <- which(hosp.pat == 0)
# Get names of zero hospitals (will need this later!)
zero.names <- as.character(p.CI.3yr$HOSP_ID[zero.pv])
#### Note:
####

p.CI.3yr is now an n x z+1 matrix of CI values.
The "+1" includes the identifier column (HOSP_ID).

# Remove hospitals from CI matrix
p.CI.3yr <- p.CI.3yr[-c(zero.pv),]
################################
## Read hospital attributes data
################################
#### Note:
####
####

hosp.info is a table with hospitals in rows and
attributes in columns. In this case, the column
that corresponds to the patient records is "MIDB".

# Load data
hosp.info <- read.csv("inputdata/hospitals.csv")
############################
## Read travel distance data
############################
#### Note:
####
####
####
####
####

od is a table with "TO", "FROM", and "DISTANCE"
as columns (format from ArcGIS Network Analyst).
This table must be re-arranged such that it is
an actual OD matrix (n x n dimensions). If data
is already arranged in an OD matrix, skip to
"Scale table" section.

# Load data
od <- read.csv("inputdata/travel-distance.csv")
#############################
## Convert table to OD matrix
#############################
# Create empty holder
143

dist.mat <- NULL
# Get unique FROM hospitals
f.hosp <- unique(od$FROM)
# Loop through hospitals
for (fr in 1:length(f.hosp)) {
# Subset
od.sm <- od[od$FROM == f.hosp[fr],]
# Sort matrix, shouldn’t be necessary... but safer
od.sm <- od.sm[order(od.sm$TO),]
# Append distance to holder as a ROW
dist.mat <- rbind(dist.mat, od.sm$DISTANCE)
}
# Make into dataframe
dist.mat <- as.data.frame(dist.mat)
# Assign column names
names(dist.mat) <- f.hosp
# Assign row names
row.names(dist.mat) <- f.hosp
###########################################
## Scale table to match CI data range (0-1)
###########################################
# Get maximum distance between hospitals
max <- max(dist.mat)
# Rescale data
dist.mat <- dist.mat/max
##############################################
## Join distance data matrix to CI data matrix
##############################################
#### Note:
####

To create the final n x m data matrix used for
clustering, the n x z and n x n matrix are joined.

# Add column for table join
dist.mat$HOSP_ID <- row.names(dist.mat)
## Join tables, add distance matrix to CI data
144

p.CI.3yr <- merge(p.CI.3yr, dist.mat, by="HOSP_ID")

#########################################################################
##
##
## Custom 2-step K-means + Ward’s clustering function
##
##
##
#########################################################################
#### Note:
####

The inputs for the function are the n x m data
matrix (x) and the desired number of clusters (clusters).

kmeans.ward <- function(x, clusters) {
# Create distance matrix
d <- dist(x, "euclidean")
# Perform Ward’s clustering
hc <- hclust(d, method="ward")
# Get cluster members at "K" clusters
memb <- cutree(hc, k = clusters)
# Make empty holder for cluster center locations
cent <- NULL
# Get cluster centers
for (k in 1:clusters) {
cent <- rbind(cent, colMeans(x[memb == k,]))
}
# Use cluster centers from Ward’s to seed K-means clustering
k.m <- kmeans(x, cent, iter.max = 10000)
# Return the K-means object
return(k.m)
}

#########################################################################
##
##
145

## Create initial cluster solutions for Hospital Groups
##
##
##
#########################################################################
#######################################
## Prepare data and create data holders
#######################################
#### Note:
####

All possible numbers of clusters are considered
from 2 to n-1.

# Define the range of cluster solutions to evaluate (the set k)
cl.max <- nrow(p.CI.3yr)-1
clusters <- c(2:cl.max)
# Create an empty holder for cluster statistics
wss <- bss <- r2 <- incF <- SingHosp <- MaxSize <- rep(0, length(clusters))
k.data.pat <- cbind(clusters, wss, bss, r2, incF, SingHosp, MaxSize)
# Get number of data attributes in table (columns)
col.max <- ncol(p.CI.3yr)

#####################################################
## Conduct K-means + Ward’s for all cluster solutions
#####################################################
for (K in 1:length(clusters)) {
# Use K-means + Wards method to create clusters
Kclust <- kmeans.ward(p.CI.3yr[,2:col.max], clusters[K])
# Write cluster statistics to data holder
# Within sum of squares
k.data.pat[K,2] <- Kclust$tot.withinss
# Between sum of squares
k.data.pat[K,3] <- Kclust$betweenss
# R^2
k.data.pat[K,4] <- 1-(Kclust$tot.withinss/Kclust$totss)
# Number of single hosp clusters
table.c <- table(Kclust$cluster)
k.data.pat[K,6] <- sum(table.c == 1)
146

# Maximum size of any single cluster
k.data.pat[K,7] <- max(table.c)
}
# Convert data holder to dataframe
k.data.pat <- as.data.frame(k.data.pat)
#################################
## Calculate incremental F scores
#################################
n.obs <- nrow(p.CI.3yr)
for (i in 2:length(clusters)) {
k.data.pat$incF[i] <- ((k.data.pat$r2[i]-k.data.pat$r2[i-1])/
(k.data.pat$clusters[i]-k.data.pat$clusters[i-1])) /
((1-k.data.pat$r2[i])/((n.obs)-(k.data.pat$clusters[i]-1)))
}

#########################################################################
##
##
## Select the number of Hospital Groups using the heuristic
##
##
##
#########################################################################

############################################
## Find local maxima in incremental F scores
############################################
# Make variable of the last candidate solution to evaluate for maxima
i <- cl.max-1

# Find the local maxima
incF.peaks <- which(k.data.pat$incF[3:i] > k.data.pat$incF[2:(i-1)] &
k.data.pat$incF[3:i] > k.data.pat$incF[4:(i+1)])+2
# Subset initial candidate solutions
147

candidates <- k.data.pat[incF.peaks,]
#######################################################################
## Remove solutions wherein a single cluster has more than 20 hospitals
#######################################################################
candidates <- candidates[candidates$MaxSize < 20,]
###########################################################
## Subset solutions to those with the "minimum" number of
## single hospital Hospital Groups from remaining solutions
###########################################################
candidates <- candidates[candidates$SingHosp == min(candidates$SingHosp), ]
######################################
## From the remaining solutions select
## the solution with most clusters, K
######################################
solution <- candidates[candidates$clusters == max(candidates$clusters), ]
# Get number of clusters
n.clusters <- solution$clusters
###################################################
## Use K-means + Wards method to re-create clusters
###################################################
#### Note:
####
####
####
####
####
####
####

Only the cluster statistics were kept in the
initial clustering process. The final cluster
solution is recreated to extract Hospital Group
membership and cluster center information.
Because the clustering algorithm provides
deterministic results, this clustering
configuration will be identical to the one
formed in the intial clustering process.

HG.solution <- kmeans.ward(p.CI.3yr[,2:col.max], n.clusters)
# Attach Hospital Group number to MIDB name
HG.names <- as.data.frame(cbind(p.CI.3yr$HOSP_ID, HG.solution$cluster))
names(HG.names) <- c("MIDB", "HG")

148

#########################################################################
##
##
## Rename the Hospital Groups
##
##
##
#########################################################################

#### Note:
####
####
####
####
####

The K-means + Ward’s names clusters using random
numbers. This section will re-enumerate the Hospital
Groups based on an existing larger regional group.
(HSA - Health Service Area) and the sum of the beds
in the Hospital Groups. This section can be omitted
if re-enumerating is not necessary.

####
####
####

This sections also requires that the hospital
information file (hosp.info) has columns named
"HSA" and "BEDS".

# Attach initial cluster number to hospital information table
hosp.info <- merge(hosp.info, HG.names, by="MIDB", all.x=TRUE)
# Convert cluster number column to character format
hosp.info$HG <- as.character(hosp.info$HG)
# If hospitals were removed becasue they didn not have patient
# records, assign them to "NG"
hosp.info$HG[is.na(hosp.info$HG)] <- "NG"
# For each Hospital Group, find the HSA where the max number of
# hospitals falls inside. These lines of code assumes that there is
# a column named "HSA" in the hospital information table (hosp.info)
HG.HSA <- NULL
for (hg in 1:n.clusters) {
sub <- hosp.info$HSA[hosp.info$HG == hg]
t.sub <- table(sub)
HG.HSA <- c(HG.HSA, names(t.sub[t.sub == max(t.sub)]))
}
##############################################
## Rename Hospital Groups by HSA and bed count
##############################################
# Make holder
HG.NEW <- NULL
149

# Make counter variable. Will hold the "last" Hospital Group
# name assigned
max.hg <- 0
# Get number of "regions" (HSAs)
hsa.list <- as.numeric(sort(unique(HG.HSA)))
# Start looping through the regions
for (hsa in hsa.list) {
# Get Hospital Groups in region
hsa.hgs <- which(HG.HSA == hsa)
# Subset hospital information file to only hospitals
# in these HSAs
sub.hosp.info <- hosp.info[hosp.info$HG %in% hsa.hgs,]
# Aggregate the number of hospital beds in each
# Hospital Group. Assumes there is a column in hosp.info
# named "BEDS"
bed.totals <- aggregate(sub.hosp.info$BEDS, by=list(HG = sub.hosp.info$HG), sum)
# Reorder aggregated table by total number of beds
bed.totals <- bed.totals[order(bed.totals$x, decreasing=TRUE), ]
# Change column type to character
bed.totals$HG <- as.character(bed.totals$HG)
# Make numbers for first and last Hospital Group in
# this subset. Uses counter.
f.hg <- max.hg+1
l.hg <- nrow(bed.totals)+max.hg
# Make join table
j.HG.names <- as.data.frame(cbind(bed.totals$HG, f.hg:l.hg))
# Name columns
names(j.HG.names) <- c("HG_O", "HG_N")
# Append the holder table
HG.NEW <- rbind(HG.NEW, j.HG.names)
# Advance counter
max.hg <- l.hg
150

}
################################################################
## Attach new Hospital Group names to hospital information table
################################################################
hosp.info <- merge(hosp.info, HG.NEW, by.x="HG", by.y="HG_O", all.x=TRUE)
# Remove old cluster numbers
hosp.info$HG <- NULL
# Rename Hospital Group column
col <- which(names(hosp.info) == "HG_N")
names(hosp.info)[col] <- "HG"

#########################################################################
##
##
## Assign a new hospital to existing Hospital Groups
##
##
##
#########################################################################

#### Note:
####
####
####

This code requires a 1 x n vector of hospital distances
to assign a hospital to the existing Hospital Groups
based on location. Uses a Euclidean distance measure from
the new hospital to the existing cluster centers.

###################################################
## Get original cluster centers from K-means object
###################################################
HG.centers <- as.data.frame(HG.solution$centers)
# Subset to only "travel distance" attributes
HG.centers <- HG.centers[,n.zip:ncol(HG.centers)]
# Attach new cluster names to cluster centers
HG.centers$HG_O <- rownames(HG.centers)
HG.centers <- merge(HG.centers, HG.NEW, by="HG_O")
# Remove old names and re-sort data
HG.centers$HG_O <- NULL
rownames(HG.centers) <- HG.centers$HG_N)
151

HG.centers$HG_N <- NULL
HG.centers <- HG.centers[order(as.numeric(rownames(HG.centers))),]
##################################
## Get new hospital or observation
##################################
# Get travel distance for new observation
new.hosp.loc <- read.csv("inputdata/new.hospital.location.csv")
# Remove characters from column names
# R adds an "X" to the column names that
# are only numeric values.
# Assumes hospital name is 4 characters long.
fix.names <- which(nchar(names(new.hosp.loc)) > 4)
names(new.hosp.loc)[fix.names] <substr(as.character(names(new.hosp.loc)[fix.names]), 2, 5)
# Test that columns match in new hospital and Hospital
# Group cluster centers
if (sum(names(new.hosp.loc) != names(HG.centers)) > 0)
print("Columns do not match")
# Divide travel distances by the maximum travel distance
# between any hospitals in Michigan
new.hosp.loc <- new.hosp.loc / max
###################################################
## Create function to measure Euclidean distance in
## n-dimensional space
###################################################
euc.dist <- function(x1, x2) {
dist <- sqrt(sum((x1-x2)^2))
return(dist)
}
#####################################################
## Measure distance from new location to all existing
## Hospital Group centers
#####################################################
new.dists <- apply(HG.centers, 1, euc.dist, x2=new.hosp.loc)
# Get closest Hospital Group
152

HG.new.hosp.loc <- names(new.dists)[new.dists == min(new.dists)]
#### Note:
####

This is the Hospital Group that the new
hospital is assigned to

print(HG.new.hosp.loc)

153

Appendix C
Testimony– Blue Cross Blue Shield of Michigan/Blue Care Network
Testimony
Blue Cross Blue Shield of Michigan/Blue Care Network
CON Commission Meeting: Proposed Hospital Bed Standards
December 15, 2011
Thank you for the opportunity to provide testimony on behalf of Blue Cross Blue Shield of
Michigan (BCBSM) and Blue Care Network (BCN). BCBSM/BCN supports the proposed
hospital bed standards which have been submitted for Commission consideration by the
Hospital Bed SAC. The proposed standards reﬂect months of deliberative discussions and
ensure that the needs and realities of the health care marketplace in Michigan are the central
tenet of the standards.
Hospital Group and Bed Need Methodolgy
The proposed methodologies developed by the workgroup and approved by the SAC were
developed over a period of ﬁve months with the participation of multiple stakeholders and
the assistance of the MSU Department of Geography. The workgroup focused on the goal of
developing objective, replicable, and sustainable standards which could be utilized now and
into the future.
The standards developed through the workgroup process accomplish these goals in the following manner:
• The proposed hospital group methodology groups hospitals based on location and
utilization patterns. This methodology will more logically group hospitals
than the groupings provided by the existing methodology.
The testimony’s text formatting has been slighly modiﬁed to meet the required format
for this dissertation. However, instances of boldface text are unchanged from the original
document.

154

• The demand for bed need will be based on modeling of trends based on the previous ﬁve
years of county-wide patient day data. The previous methodology relied on zip-code
level data and often inaccurate population projections. The proposed methodology
will capture trends in patient day rates more eﬀectively than the current
methodology, will avoid the errors that are encountered when using small
data sets, and will require the collection of dramatically less data.
• According to MSU Geography, which has been contracted to run this data for the
Department in previous years, the methodologies “can be executed within a
short time frame, using open-source code, and produces replicable results.”
When considering the tenets of cost, quality, and access, the proposed methodologies show
that the current number of hospitals and hospital beds in the state are more than adequately
serving the demands of Michigan’s population. When run illustratively for the workgroup
using 2009 MIDB data, the proposed methodologies found no areas of hospital bed need in
the state and an overall excess of 6,747 hospital beds state-wide. Should patient population
and utilization trends change in the future, the methodologies are equipped to reﬂect such
changes.
Hospital Bed Reduction
BCBSM supports the proposals that emerged from the hospital bed reduction work goup
as a valuable ﬁrst step in addressing the excess bed capacity in Michigan’s hospitals. The
proposals adopted by the SAC will limit the ﬁnancial incentive for hospitals to use large
amounts of excess beds as a bargaining tool for their purchase. Additionally, the proposals
will promote the development of capital projects that will be more reﬂective of a hospital’s
average occupancy, which could provide cost savings in the future. While BCBSM believes
that the proposal is a step in the right direction, continued eﬀorts must address excess
hospital capacity on a larger scale in order to truly make a more signiﬁcant impact on excess
costs within the health care system.
155

Conclusion
BCBSM/BCN supports the Hospital Bed Standards recommended by the Hospital Bed
SAC to the CON Commission. The thorough review of these standards over the past six
months has resulted in signiﬁcant improvements to the standards that will ensure appropriate
hospital access and reﬂect the health care needs of the state’s population for years to come.
12/15/11

156

Appendix D
Additional Figures and Tables
q
q

q

q
q

q

q

q q
q

q

q

q
q
qq

5

q

q

q

q

q

q

q
q

q

q

q

q
q

q
q

incF score
3
4

q

q
q q

q

q

q

q

q

q
q q
q q q
q

q

q

q

q
q

q

q

q

q

q

q
qq

q

q

q

q

q

q

q
q
q

q
q
q

q

q
q
q

q
q
q
qq

q

q

q
q
q
q

q
q

q

q
qq
q
q q
q

q

q

q

q
q

q

q
q

q

q
q q
qq
q

q

q

q

q

q q q
q
q
q

q
q
q

q

q
q

q

q
q

q

q

qq q
q
qq
q

q

q
q

q

q
q

q
q
q

q
q

q
q
q
qq
qq q
q
q q q
q
q
q q
qq q q
q
q
q q qq
q q
q
q
q
q qq q q
q
q

q

2

q

q

q

q

q

q

q

q

q

q
q
q
q

q

q
q

q

q

q
q

q

q
qqq q q
q
qq
q q
q
q qqqq
q q
qq
q q q
q
qq q
q
q q
q
qq q
q
qq q
q q q
qq q
q
q
q
q q
q
q q
qq
q
q
q
q
qq
q

q

q

q
q

q

q

200

250
300
350
Number of clusters

q
q

400

Figure D.1: incF scores for cluster solutions in set S. Black points represent peak
values in incF scores. The data has been truncated for display purposes.

157

p value (Moran's I, weighted OLS model)
0.0
0.2
0.4
0.6
0.8
1.0

0

200
400
600
800
Number of clusters

Figure D.2: Moran’s I of regression residuals for weighted OLS regression model.
All values less than 0.05 (dotted line) have signiﬁcant spatial autocorrelation in the model
residuals.

158

p value (Moran's I, weighted CAR model)
0.0
0.2
0.4
0.6
0.8
1.0

p value (Moran's I, weighted SAR model)
0.2
0.4
0.6
0.8
0

200
400
600
800
Number of clusters

0

200
400
600
800
Number of clusters

Figure D.3: Moran’s I of regression residuals for weighted SAR and CAR models.
All values less than 0.05 (dotted line) have signiﬁcant spatial autocorrelation in the model
residuals.

159

p value (Levene Test, CAR model)
0.0
0.2
0.4
0.6
0.8
1.0

p value (Levene Test, SAR model)
0.0
0.2
0.4
0.6
0.8
1.0
0

200
400
600
800
Number of clusters

0

200
400
600
800
Number of clusters

Figure D.4: Levene Test of regression residuals for SAR and CAR models. All
values less than 0.05 (dotted line) have signiﬁcant heteroscedasticity in the model residuals
due to population size.

160

p value (Levene Test, CAR model)
0.0
0.2
0.4
0.6
0.8
1.0

p value (Levene Test, SAR model)
0.2
0.4
0.6
0.8
1.0
0

200
400
600
800
Number of clusters

0

200
400
600
800
Number of clusters

Figure D.5: Levene Test of regression residuals for weighted SAR and CAR models. All values less than 0.05 (dotted line) have signiﬁcant heteroscedasticity in the model
residuals due to population size.

161

Table D.1: Cluster solutions and incF scores.
CL
8
13
17
19
21
23
27
29
32
34
37
40
44
47
50
54
56
59
61
64
66
70
73
77
79
84
88
90
93
96
101
103
105
108

incF
173.588
117.188
57.137
46.520
41.105
35.267
28.022
27.952
21.648
32.543
22.762
24.243
24.131
22.427
17.827
17.783
23.635
15.538
17.436
10.838
18.027
11.505
20.162
13.521
14.299
8.937
9.763
12.296
8.184
8.975
10.237
8.573
8.618
9.590

CL incF
219 4.478
223 4.785
227 4.155
229 4.191
231 3.352
234 3.802
237 3.709
239 4.119
243 3.621
245 3.812
247 3.220
250 3.472
253 3.558
255 3.436
257 3.541
259 3.644
263 2.805
265 3.905
268 3.972
272 3.450
274 3.630
279 3.544
281 3.508
284 3.550
290 4.243
294 3.461
297 3.419
299 3.022
301 3.446
303 3.384
306 3.340
308 3.149
310 3.277
312 3.186

162

CL incF
420 2.574
422 2.525
424 2.514
426 2.470
428 2.446
430 2.422
432 2.421
437 2.399
441 2.409
445 2.388
447 2.343
449 2.370
453 2.367
455 2.367
457 2.362
461 2.584
466 2.379
471 2.305
474 2.347
477 2.350
480 2.330
484 2.287
488 2.295
494 2.287
497 2.284
500 2.287
506 2.408
509 2.042
511 2.206
516 2.270
520 2.112
523 2.193
525 2.203
528 2.317
Cont.

CL incF
634 1.950
637 1.936
641 1.968
645 1.957
647 1.928
651 1.915
653 1.896
656 1.890
659 1.876
662 1.869
664 1.901
669 1.887
672 1.882
674 1.865
676 1.838
679 1.839
681 1.850
688 1.787
690 1.787
693 1.779
695 1.771
698 1.769
701 1.773
709 1.799
711 1.784
714 1.789
718 1.806
722 1.803
725 1.765
728 1.770
731 1.789
735 1.734
737 1.699
746 1.752
on next page

CL
111
114
116
121
124
129
131
133
136
139
142
145
147
150
152
154
157
159
161
164
166
172
176
180
184
187
189
192
194
196
199
204
208
210

Table
incF
6.621
10.148
9.488
7.840
6.302
10.391
7.376
8.357
7.806
6.168
5.989
6.427
6.223
5.386
6.834
6.671
6.621
5.148
6.301
6.079
5.838
7.055
6.552
6.088
5.825
4.415
4.086
6.036
5.112
5.035
5.050
5.255
4.161
4.120

D.1 –
CL
316
319
322
325
329
331
334
337
339
341
343
345
349
354
356
359
362
364
367
369
373
376
381
383
386
389
393
396
399
402
407
409
411
413

Cont. from previous page
incF CL incF CL incF
2.903 531 2.176 749 1.683
3.137 533 2.164 753 1.681
3.003 536 2.156 756 1.697
3.333 541 2.139 760 1.679
2.908 544 2.148 763 1.680
2.958 546 2.636 766 1.682
2.929 550 2.133 770 1.680
3.159 554 2.146 775 1.670
2.895 556 2.150 782 1.672
2.760 558 2.153 785 1.735
2.942 560 1.958 789 1.658
2.834 562 2.149 792 1.654
2.725 564 2.182 796 1.597
2.978 566 2.177 798 1.593
2.891 568 2.151 803 1.562
3.673 570 2.101 805 1.564
2.597 572 2.152 807 1.533
2.648 575 2.150 810 1.528
2.936 578 2.143 812 1.589
2.762 581 2.191 821 1.544
2.758 586 2.096 824 1.521
2.884 588 2.098 827 1.532
3.257 591 2.097 830 1.480
2.725 593 2.170 837 1.515
2.659 596 2.063 843 1.484
2.697 598 2.110 849 1.463
2.673 601 2.078 855 1.415
2.651 607 2.080 858 1.361
2.638 610 2.054 860 1.373
2.658 612 1.916 868 1.418
2.551 615 2.038 876 1.546
2.736 619 2.029 881 1.600
2.895 621 2.037 886 1.751
3.031 623 2.098 888 1.764
Cont. on next page
163

CL
216

Table D.1 – Cont. from previous page
incF
CL incF CL incF CL incF
4.818
416 2.443 625 2.057 890 1.665

164

Table D.2: Number of components and % of the total variance explained for each
functional set of variables.

CL
34
37
40
44
47
50
54
56
59
61
64
66
70
73
77
79
84
88
90
93
96
101
103
105
108
111
114
116
121
124
129
131

SES
ETH TRAN MOB
n
s
n
s
n
s
n
s
1 0.85 1 0.43 1 0.64 2 0.89
1 0.86 1 0.40 1 0.65 2 0.89
1 0.85 2 0.61 1 0.64 2 0.89
1 0.83 5 1.00 2 0.88 2 0.88
1 0.82 3 0.78 1 0.62 2 0.88
1 0.81 3 0.77 1 0.64 2 0.88
1 0.81 3 0.76 2 0.85 2 0.88
1 0.81 2 0.59 1 0.61 2 0.88
1 0.80 1 0.38 1 0.60 2 0.88
1 0.79 2 0.59 1 0.60 2 0.88
1 0.78 2 0.58 2 0.84 2 0.88
1 0.77 3 0.75 1 0.59 2 0.87
1 0.77 3 0.75 1 0.60 2 0.86
1 0.79 1 0.36 1 0.60 2 0.87
1 0.82 2 0.57 1 0.60 2 0.87
1 0.82 3 0.74 1 0.60 2 0.86
1 0.81 3 0.74 1 0.59 2 0.87
1 0.80 2 0.57 1 0.59 2 0.86
1 0.80 1 0.37 1 0.59 2 0.86
1 0.80 1 0.37 1 0.59 2 0.86
1 0.79 5 1.00 1 0.59 2 0.86
1 0.79 2 0.57 1 0.60 2 0.85
1 0.79 2 0.56 1 0.60 2 0.85
1 0.79 2 0.56 1 0.59 2 0.85
1 0.79 3 0.73 1 0.59 2 0.85
1 0.79 3 0.73 1 0.60 2 0.85
1 0.79 4 0.87 1 0.60 2 0.85
1 0.79 2 0.56 1 0.60 2 0.85
1 0.79 2 0.56 1 0.61 2 0.84
1 0.78 3 0.73 1 0.61 2 0.85
1 0.78 3 0.73 1 0.60 2 0.84
1 0.78 1 0.36 1 0.59 2 0.84
Cont. on
165

CASE
n
s
1 0.93
1 0.93
1 0.93
1 0.93
1 0.93
1 0.93
1 0.93
1 0.93
1 0.93
1 0.93
1 0.93
1 0.93
1 0.93
1 0.92
1 0.92
1 0.92
1 0.91
1 0.91
1 0.91
1 0.91
1 0.93
1 0.94
1 0.94
1 0.93
1 0.93
1 0.93
1 0.93
1 0.93
1 0.93
1 0.93
1 0.94
1 0.93
next page

CL
133
136
139
142
145
147
150
152
154
157
159
161
164
166
172
176
180
184
187
189
192
194
196
199
204
208
210
216
219
223
227
229
231

Table
SES
n
s
2 0.89
2 0.89
2 0.89
2 0.89
2 0.89
2 0.89
1 0.70
2 0.89
1 0.70
2 0.89
2 0.89
2 0.89
2 0.89
1 0.70
2 0.89
2 0.89
2 0.89
1 0.71
2 0.89
1 0.72
1 0.71
1 0.71
1 0.71
1 0.71
1 0.71
1 0.71
2 0.88
2 0.88
1 0.72
1 0.72
1 0.71
1 0.72
1 0.72

D.2 – Cont. from previous page
ETH TRAN MOB
CASE
n
s
n
s
n
s
n
s
3 0.74 2 0.91 2 0.85 1 0.93
3 0.74 2 0.91 2 0.85 1 0.93
3 0.75 2 0.91 2 0.85 1 0.93
3 0.74 2 0.91 2 0.84 1 0.93
3 0.74 2 0.91 2 0.84 1 0.93
3 0.74 2 0.91 2 0.84 1 0.93
3 0.74 2 0.91 2 0.84 1 0.93
3 0.74 2 0.91 2 0.84 1 0.91
3 0.74 2 0.91 2 0.84 1 0.91
3 0.74 2 0.91 2 0.84 1 0.91
3 0.74 2 0.91 2 0.84 1 0.91
3 0.74 2 0.91 2 0.84 1 0.91
3 0.73 2 0.91 2 0.83 1 0.90
3 0.73 2 0.91 2 0.83 1 0.90
3 0.73 2 0.90 2 0.83 1 0.90
3 0.73 2 0.89 2 0.83 1 0.90
3 0.73 2 0.89 2 0.83 1 0.90
2 0.56 2 0.89 2 0.83 1 0.90
3 0.73 2 0.89 2 0.82 1 0.90
2 0.56 2 0.89 2 0.82 1 0.89
3 0.73 2 0.89 2 0.83 1 0.89
3 0.73 2 0.89 2 0.83 1 0.89
2 0.56 2 0.89 2 0.82 1 0.89
1 0.36 2 0.88 2 0.82 1 0.90
2 0.56 2 0.88 2 0.83 1 0.90
2 0.56 2 0.88 2 0.82 1 0.90
3 0.73 2 0.88 2 0.82 1 0.90
2 0.56 2 0.88 2 0.82 1 0.89
3 0.72 2 0.88 2 0.82 1 0.89
2 0.55 2 0.88 2 0.82 1 0.89
2 0.55 2 0.88 2 0.82 1 0.89
2 0.55 2 0.88 2 0.82 1 0.89
2 0.55 2 0.88 2 0.81 1 0.88
Cont. on next page
166

CL
234
237
239
243
245
247
250
253
255
257
259
263
265
268
272
274
279
281
284
290
294
297
299
301
303
306
308
310
312
316
319
322
325

Table
SES
n
s
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.71
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.73
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.71
1 0.71

D.2 – Cont. from previous page
ETH TRAN MOB
CASE
n
s
n
s
n
s
n
s
2 0.55 2 0.88 2 0.81 1 0.88
2 0.55 2 0.88 2 0.81 1 0.88
2 0.55 2 0.88 2 0.81 1 0.88
2 0.55 2 0.88 2 0.80 1 0.88
2 0.55 2 0.88 2 0.81 1 0.88
4 0.87 2 0.88 2 0.81 1 0.88
2 0.55 2 0.88 2 0.81 1 0.88
2 0.55 2 0.88 2 0.81 1 0.88
2 0.55 2 0.88 2 0.80 1 0.88
2 0.55 2 0.88 2 0.80 1 0.88
2 0.55 2 0.88 2 0.80 1 0.88
2 0.55 2 0.88 2 0.80 1 0.88
2 0.55 2 0.88 2 0.80 1 0.88
2 0.55 2 0.88 2 0.80 1 0.88
2 0.55 2 0.88 2 0.80 1 0.88
2 0.55 2 0.88 2 0.80 1 0.88
3 0.72 2 0.88 2 0.79 1 0.88
2 0.55 2 0.88 2 0.79 1 0.87
2 0.55 2 0.88 2 0.79 1 0.87
2 0.55 2 0.87 2 0.79 1 0.87
2 0.55 2 0.87 2 0.79 1 0.87
1 0.35 2 0.87 2 0.79 1 0.87
2 0.55 2 0.87 2 0.79 1 0.87
2 0.55 2 0.87 2 0.79 1 0.87
2 0.54 2 0.87 2 0.78 1 0.87
2 0.54 2 0.87 2 0.78 1 0.87
3 0.71 2 0.87 2 0.78 1 0.87
3 0.71 2 0.87 2 0.78 1 0.87
1 0.35 2 0.87 2 0.78 1 0.87
2 0.54 2 0.87 2 0.78 1 0.86
2 0.54 2 0.87 2 0.78 1 0.87
4 0.87 2 0.87 2 0.78 1 0.87
2 0.54 2 0.87 2 0.78 1 0.87
Cont. on next page
167

CL
329
331
334
337
339
341
343
345
349
354
356
359
362
364
367
369
373
376
381
383
386
389
393
396
399
402
407
409
411
413
416
420
422

Table
SES
n
s
1 0.71
1 0.71
1 0.71
1 0.71
1 0.71
1 0.71
1 0.71
1 0.71
1 0.71
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.71
1 0.71
1 0.71
1 0.71
1 0.71
1 0.71
1 0.71
1 0.71
1 0.71
1 0.71
1 0.71

D.2 – Cont. from previous page
ETH TRAN MOB
CASE
n
s
n
s
n
s
n
s
1 0.36 2 0.87 2 0.78 1 0.87
4 0.87 2 0.87 2 0.78 1 0.86
3 0.71 2 0.87 2 0.78 1 0.87
3 0.72 2 0.87 2 0.78 1 0.86
3 0.72 2 0.87 2 0.78 1 0.86
4 0.87 2 0.87 2 0.78 1 0.86
2 0.54 2 0.87 2 0.78 1 0.86
3 0.71 2 0.87 2 0.78 1 0.86
2 0.54 2 0.87 2 0.78 1 0.86
2 0.54 2 0.87 2 0.78 1 0.86
3 0.71 2 0.87 2 0.78 1 0.85
1 0.36 2 0.87 2 0.78 1 0.85
3 0.72 2 0.87 2 0.78 1 0.85
2 0.55 2 0.87 2 0.78 1 0.85
3 0.72 2 0.87 2 0.78 1 0.85
1 0.36 2 0.87 2 0.78 1 0.85
2 0.54 2 0.87 2 0.78 1 0.84
2 0.54 2 0.87 2 0.78 1 0.84
2 0.54 2 0.87 2 0.77 1 0.85
2 0.54 2 0.87 2 0.77 1 0.85
2 0.54 2 0.87 2 0.77 1 0.85
1 0.36 2 0.87 2 0.77 1 0.85
2 0.54 2 0.86 2 0.77 1 0.85
4 0.86 2 0.86 2 0.77 1 0.85
3 0.70 2 0.86 2 0.77 1 0.84
2 0.54 2 0.86 2 0.77 1 0.84
2 0.54 2 0.86 2 0.77 1 0.84
1 0.35 2 0.86 2 0.77 1 0.84
4 0.86 2 0.86 2 0.77 1 0.84
4 0.86 2 0.86 2 0.77 1 0.84
1 0.35 2 0.86 2 0.77 1 0.84
2 0.54 2 0.86 2 0.77 1 0.82
4 0.86 2 0.86 2 0.77 1 0.82
Cont. on next page
168

CL
424
426
428
430
432
437
441
445
447
449
453
455
457
461
466
471
474
477
480
484
488
494
497
500
506
509
511
516
520
523
525
528
531

Table
SES
n
s
1 0.71
1 0.71
1 0.71
1 0.71
1 0.71
1 0.71
1 0.71
1 0.72
1 0.72
1 0.72
1 0.71
1 0.71
1 0.72
1 0.73
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.73
1 0.73
1 0.73
1 0.72
1 0.73
1 0.72
1 0.72
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.72

D.2 – Cont. from previous page
ETH TRAN MOB
CASE
n
s
n
s
n
s
n
s
3 0.70 2 0.86 2 0.76 1 0.82
4 0.86 2 0.86 2 0.76 1 0.82
2 0.54 2 0.86 2 0.76 1 0.82
2 0.54 2 0.86 2 0.76 1 0.82
2 0.54 2 0.86 2 0.76 1 0.81
4 0.86 2 0.86 2 0.76 1 0.82
4 0.86 2 0.86 2 0.76 1 0.80
4 0.86 2 0.86 2 0.76 1 0.80
2 0.54 2 0.86 2 0.76 1 0.80
2 0.53 2 0.86 2 0.76 1 0.81
1 0.35 2 0.86 2 0.76 1 0.81
2 0.53 2 0.86 2 0.76 1 0.81
2 0.53 2 0.86 2 0.76 1 0.81
2 0.53 2 0.86 2 0.76 1 0.80
4 0.86 2 0.86 2 0.75 1 0.80
2 0.53 2 0.86 2 0.75 1 0.80
4 0.86 2 0.86 2 0.75 1 0.80
2 0.53 2 0.86 2 0.75 1 0.80
2 0.53 2 0.86 2 0.75 1 0.80
2 0.54 2 0.86 2 0.75 1 0.80
2 0.53 2 0.86 2 0.74 1 0.80
1 0.35 2 0.86 2 0.75 1 0.80
1 0.35 2 0.86 2 0.74 1 0.80
2 0.53 2 0.85 2 0.74 1 0.80
4 0.86 2 0.85 2 0.74 1 0.80
4 0.86 2 0.85 2 0.74 1 0.80
2 0.53 2 0.85 2 0.74 1 0.79
4 0.86 2 0.85 2 0.75 1 0.79
4 0.86 2 0.85 2 0.75 1 0.79
4 0.86 2 0.85 2 0.75 1 0.79
4 0.86 2 0.85 2 0.75 1 0.79
2 0.53 2 0.85 2 0.75 1 0.79
1 0.35 2 0.85 2 0.74 1 0.79
Cont. on next page
169

CL
533
536
541
544
546
550
554
556
558
560
562
564
566
568
570
572
575
578
581
586
588
591
593
596
598
601
607
610
612
615
619
621
623

Table
SES
n
s
1 0.72
1 0.72
1 0.72
1 0.72
1 0.73
1 0.72
1 0.72
1 0.72
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72

D.2 – Cont. from previous page
ETH TRAN MOB
CASE
n
s
n
s
n
s
n
s
3 0.70 2 0.85 2 0.74 1 0.79
1 0.35 2 0.85 2 0.74 1 0.79
4 0.86 2 0.85 2 0.74 1 0.79
3 0.70 2 0.85 2 0.74 1 0.79
1 0.35 2 0.85 2 0.74 1 0.80
4 0.86 2 0.85 2 0.74 1 0.79
3 0.70 2 0.85 2 0.74 1 0.79
4 0.86 2 0.85 2 0.74 1 0.79
2 0.53 2 0.85 2 0.74 1 0.79
1 0.35 2 0.85 2 0.74 1 0.79
3 0.70 2 0.85 2 0.74 1 0.79
1 0.35 2 0.85 2 0.74 1 0.79
3 0.70 2 0.85 2 0.74 1 0.79
1 0.35 2 0.85 2 0.74 1 0.79
3 0.70 2 0.85 2 0.74 1 0.79
1 0.35 2 0.85 2 0.74 1 0.79
2 0.53 2 0.85 2 0.74 1 0.79
2 0.53 2 0.85 2 0.74 1 0.79
2 0.53 2 0.85 2 0.74 1 0.79
4 0.86 2 0.85 2 0.74 1 0.79
2 0.53 2 0.85 2 0.74 1 0.79
2 0.53 2 0.85 2 0.74 1 0.79
4 0.86 2 0.85 2 0.74 1 0.79
4 0.86 2 0.85 2 0.74 1 0.79
2 0.53 2 0.85 2 0.74 1 0.79
2 0.54 2 0.85 2 0.74 1 0.79
4 0.86 2 0.85 2 0.74 1 0.79
4 0.86 2 0.85 2 0.74 1 0.79
2 0.54 2 0.85 2 0.74 1 0.78
2 0.54 2 0.85 2 0.74 1 0.78
4 0.86 2 0.85 2 0.74 1 0.78
1 0.36 2 0.85 2 0.74 1 0.78
2 0.54 2 0.85 2 0.74 1 0.78
Cont. on next page
170

CL
625
634
637
641
645
647
651
653
656
659
662
664
669
672
674
676
679
681
688
690
693
695
698
701
709
711
714
718
722
725
728
731
735

Table
SES
n
s
1 0.72
1 0.71
1 0.71
1 0.71
1 0.71
1 0.71
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.71
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72
1 0.72

D.2 – Cont. from previous page
ETH TRAN MOB
CASE
n
s
n
s
n
s
n
s
2 0.54 2 0.85 2 0.74 1 0.78
4 0.86 2 0.85 2 0.74 1 0.78
2 0.53 2 0.85 2 0.74 1 0.78
2 0.53 2 0.85 2 0.74 1 0.79
4 0.86 2 0.85 2 0.74 1 0.79
4 0.86 2 0.85 2 0.74 1 0.79
2 0.53 2 0.85 2 0.74 1 0.79
2 0.53 2 0.85 2 0.74 1 0.79
2 0.53 2 0.85 2 0.74 1 0.79
2 0.53 2 0.85 2 0.73 1 0.79
4 0.86 2 0.85 2 0.74 1 0.79
2 0.53 2 0.85 2 0.74 1 0.79
2 0.53 2 0.85 2 0.73 1 0.79
2 0.53 2 0.85 2 0.73 1 0.79
2 0.53 1 0.59 2 0.73 1 0.79
3 0.70 2 0.85 2 0.73 1 0.79
4 0.86 2 0.85 2 0.73 1 0.79
1 0.35 2 0.85 2 0.73 1 0.79
2 0.53 2 0.85 2 0.73 1 0.79
3 0.70 2 0.85 2 0.73 1 0.79
2 0.53 2 0.85 2 0.73 1 0.79
2 0.53 2 0.85 2 0.73 1 0.79
2 0.53 2 0.85 2 0.73 1 0.79
2 0.53 2 0.85 2 0.73 1 0.79
4 0.86 2 0.85 2 0.73 1 0.79
2 0.53 2 0.85 2 0.73 1 0.79
2 0.53 2 0.85 2 0.73 1 0.79
2 0.53 2 0.85 2 0.73 1 0.79
2 0.53 2 0.85 2 0.73 1 0.79
2 0.53 2 0.85 2 0.73 1 0.79
2 0.53 2 0.85 2 0.73 1 0.79
2 0.53 2 0.85 2 0.73 1 0.79
2 0.53 2 0.85 2 0.73 1 0.79
Cont. on next page
171

CL
737
746
749
753
756
760
763
766
770
775
782
785
789
792
796
798
803
805
807
810
812
821
824
827
830
837
843
849
855
858
860
868
876

Table
SES
n
s
1 0.72
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.73
1 0.74
1 0.74
1 0.74
1 0.74
1 0.74
1 0.74
1 0.74
1 0.74
1 0.74
1 0.74
1 0.74
1 0.74
1 0.74
1 0.74

D.2 – Cont. from previous page
ETH TRAN MOB
CASE
n
s
n
s
n
s
n
s
2 0.53 2 0.85 2 0.73 1 0.79
2 0.53 2 0.85 2 0.73 1 0.79
4 0.86 2 0.85 2 0.73 1 0.79
4 0.86 2 0.85 2 0.73 1 0.79
4 0.86 2 0.85 2 0.73 1 0.79
4 0.86 2 0.84 2 0.73 1 0.80
2 0.53 2 0.84 2 0.73 1 0.80
1 0.35 2 0.84 2 0.73 1 0.80
1 0.35 2 0.84 2 0.73 1 0.80
2 0.53 2 0.84 2 0.73 1 0.80
4 0.86 2 0.84 2 0.73 1 0.80
2 0.53 2 0.84 2 0.73 1 0.80
2 0.53 2 0.84 2 0.73 1 0.80
4 0.86 2 0.84 2 0.73 1 0.80
3 0.70 2 0.84 2 0.73 1 0.80
3 0.70 2 0.84 2 0.73 1 0.80
2 0.53 2 0.84 2 0.73 1 0.80
2 0.53 2 0.84 2 0.73 1 0.80
4 0.86 2 0.84 2 0.73 1 0.80
3 0.70 2 0.84 2 0.73 1 0.80
3 0.70 2 0.84 2 0.73 1 0.80
3 0.70 2 0.84 2 0.73 1 0.80
4 0.86 2 0.84 2 0.73 1 0.80
4 0.86 2 0.84 2 0.73 1 0.80
4 0.86 2 0.84 2 0.73 1 0.80
2 0.53 2 0.84 2 0.73 1 0.80
4 0.86 2 0.85 2 0.73 1 0.80
2 0.53 2 0.85 2 0.73 1 0.80
4 0.86 2 0.85 2 0.73 1 0.81
3 0.70 2 0.85 2 0.73 1 0.81
4 0.86 2 0.85 2 0.73 1 0.81
2 0.53 2 0.85 2 0.73 1 0.81
3 0.70 2 0.85 2 0.73 1 0.81
Cont. on next page
172

CL
881
886
888
890
895

Table
SES
n
s
1 0.74
1 0.74
1 0.74
1 0.74
1 0.74

D.2 – Cont. from previous page
ETH TRAN MOB
CASE
n
s
n
s
n
s
n
s
3 0.70 2 0.85 2 0.73 1 0.81
4 0.86 2 0.85 2 0.73 1 0.81
3 0.70 2 0.85 2 0.73 1 0.81
3 0.70 2 0.85 2 0.73 1 0.81
3 0.70 2 0.85 2 0.73 1 0.81

173

Appendix E
Additional R Code
##################################################################
## R code to import ACS 5yr data and transfer/modify/figure out ##
## allocation issues to get data into ZIP code format
##
##################################################################
library(rgdal)
library(sp)
library(maptools)
library(shapefiles)
library(spdep)
### Import Block Group data
acs.bg <- read.csv("/media/data/GISdata/acs2010/5yr/tables/
ACS.5yr.IncEdMob.BlockGroup.csv")
### Remove leading characters from GEOID
acs.bg$GEOID <- substr(acs.bg$GEOID, 8, 19)
### First, import the Block Group shapefile and remove any BGs
### that do not have population (water!)
BG.proj.clip <- read.dbf("/media/data/GISdata/census_2010_data/
block_groups/MI_2010_blockgroups_proj_clip.dbf")
BG.proj.clip <- BG.proj.clip$dbf
# Get unique GEOID10 for "good" Block Groups
BGs <- unique(BG.proj.clip$GEOID10)

### Subset ACS data to these Block Groups
which.bgs <- which(acs.bg$GEOID %in% BGs)
acs.bg <- acs.bg[which.bgs,]

### Read in Block Pop data
### Will use this to weight and allocate to
### block groups, tracts, and counties
bp.cent <- read.dbf("/media/data/GISdata/census_2010_data/blocks/
MI_2010_blocks_proj_cent_pop/MI_2010_blocks_proj_cent_pop_gt0.dbf")
bp.cent <- bp.cent$db
174

bp.cent <- bp.cent[,c(1,17)]
bp.cent$GEOIDBG <- substr(bp.cent$GEOID, 1, 12)
bp.cent$GEOIDT <- substr(bp.cent$GEOIDBG, 1, 11)
bp.cent$GEOIDC <- substr(bp.cent$GEOIDBG, 1, 5)
### Aggregate by block group, tract, and county
BG.pop <- aggregate(bp.cent$POP100, by=list("GEOIDBG" = bp.cent$GEOIDBG),
sum)
T.pop <- aggregate(bp.cent$POP100, by=list("GEOIDT" = bp.cent$GEOIDT),
sum)
C.pop <- aggregate(bp.cent$POP100, by=list("GEOIDC" = bp.cent$GEOIDT),
sum)
names(BG.pop)[2] <- names(T.pop)[2] <- names(C.pop)[2] <- "POP"
### Attach population to block groups
acs.bg <- merge(acs.bg, BG.pop, by.x="GEOID", by.y="GEOIDBG", all.x=TRUE)
sum(BG.pop$POP)
sum(acs.bg$POP, na.rm=TRUE)
### Remove any Block Groups with no pop
acs.bg <- acs.bg[!is.na(acs.bg$POP),]
sum(acs.bg$POP)

#########
### Convert counts to percentages
#########
acs.bg.pct <- acs.bg
### Education
acs.bg.pct[,11:15] <- acs.bg.pct[,11:15] / acs.bg.pct[,10]
#sum(acs.bg.pct[2,11:15])
### Mobility (Trav to work, trav time)
acs.bg.pct[,17:28] <- acs.bg.pct[,17:28] / acs.bg.pct[,16]
#sum(acs.bg.pct[1,17:21])
#sum(acs.bg.pct[1,22:28])

##########
### Import polygon file and get neighbors
175

##########
# Read in block group polygon file
bgs.poly <- readOGR("/media/data/GISdata/census_2010_data/block_groups/
MI_2010_blockgroups_proj_clip.shp", layer="MI_2010_blockgroups_proj_clip")
# Make neighbors list
## Use queen contiguity
bgs.nb <- poly2nb(bgs.poly, queen = TRUE)
na.MHI <- which(is.na(acs.bg.pct$MedHouInc))
na.ME <- which(is.na(acs.bg.pct$MedEarn16p))
na.E <- which(is.na(acs.bg.pct$EdPop25p) | acs.bg.pct$EdPop25p == 0)
na.T <- which(is.na(acs.bg.pct$TrvWorkPop16p) | acs.bg.pct$TrvWorkPop16p
== 0)
### Get interpolation stats
#u.na.bgs <- unique(c(na.MHI, na.ME, na.E, na.T))
#length(u.na.bgs)
#nrow(acs.bg.pct)
#bg.na.pop <- sum(acs.bg.pct[u.na.bgs,ncol(acs.bg.pct)])
#bg.na.pop / sum(acs.bg$POP)
#length(u.na.bgs) / nrow(acs.bg.pct)
### Start looping through each Block Group that is missing data
### Median Household Income
for (i in 1:length(na.MHI)) {
## Get the Block Group ID
na.id <- acs.bg.pct$GEOID[na.MHI[i]]
## Find which entry in the shapefile
which.poly <- which(bgs.poly$GEOID10 == na.id)
## Get list of neighbors
nbs <- bgs.poly$GEOID10[bgs.nb[which.poly][[1]]]
## Get the Median Houshold Income values of the neighbors
## Get the Populations of the neighbors
nb.vals <- acs.bg.pct[acs.bg.pct$GEOID %in% nbs, c(1,8,29)]
## Get weighted average of neighbors
mhi.interp <- sum(nb.vals$MedHouInc * nb.vals$POP, na.rm=TRUE) /
sum(nb.vals$POP, na.rm=TRUE)

176

## Put interpolated value into table
acs.bg.pct$MedHouInc[na.MHI[i]] <- mhi.interp
}
sum(is.na(acs.bg.pct$MedHouInc))

### Start looping through each Block Group that is missing data
### Median Earnings
for (i in 1:length(na.ME)) {
## Get the Block Group ID
na.id <- acs.bg.pct$GEOID[na.ME[i]]
## Find which entry in the shapefile
which.poly <- which(bgs.poly$GEOID10 == na.id)
## Get list of neighbors
nbs <- bgs.poly$GEOID10[bgs.nb[which.poly][[1]]]
## Get the Median Houshold Income values of the neighbors
## Get the Populations of the neighbors
nb.vals <- acs.bg.pct[acs.bg.pct$GEOID %in% nbs, c(1,9,29)]
## Get weighted average of neighbors
mhi.interp <- sum(nb.vals$MedEarn16p * nb.vals$POP, na.rm=TRUE) /
sum(nb.vals$POP, na.rm=TRUE)
## Put interpolated value into table
acs.bg.pct$MedEarn16p[na.ME[i]] <- mhi.interp
}
which(is.na(acs.bg.pct$MedEarn16p))

### Start looping through each Block Group that is missing data
### Median Household Income
for (i in 1:length(na.MHI)) {
## Get the Block Group ID
na.id <- acs.bg.pct$GEOID[na.MHI[i]]
## Find which entry in the shapefile
177

which.poly <- which(bgs.poly$GEOID10 == na.id)
## Get list of neighbors
nbs <- bgs.poly$GEOID10[bgs.nb[which.poly][[1]]]
## Get the Median Houshold Income values of the neighbors
## Get the Populations of the neighbors
nb.vals <- acs.bg.pct[acs.bg.pct$GEOID %in% nbs, c(1,8,29)]
## Get weighted average of neighbors
mhi.interp <- sum(nb.vals$MedHouInc * nb.vals$POP, na.rm=TRUE) /
sum(nb.vals$POP, na.rm=TRUE)
## Put interpolated value into table
acs.bg.pct$MedHouInc[na.MHI[i]] <- mhi.interp
}
sum(is.na(acs.bg.pct$MedHouInc))
### Start looping through each Block Group that is missing data
### Education
for (i in 1:length(na.E)) {
## Get the Block Group ID
na.id <- acs.bg.pct$GEOID[na.E[i]]
## Find which entry in the shapefile
which.poly <- which(bgs.poly$GEOID10 == na.id)
## Get list of neighbors
nbs <- bgs.poly$GEOID10[bgs.nb[which.poly][[1]]]
## Get the Median Houshold Income values of the neighbors
## Get the Populations of the neighbors
nb.vals <- acs.bg.pct[acs.bg.pct$GEOID %in% nbs, c(1,10:15)]
## Get weighted average of neighbors
mhi.interp <- colSums(nb.vals[,3:7] * nb.vals$EdPop25p, na.rm=TRUE) /
sum(nb.vals$EdPop25p, na.rm=TRUE)
## Put interpolated value into table
acs.bg.pct[na.E[i],11:15] <- mhi.interp
}
178

which(is.na(acs.bg.pct$EdltHS))
### Start looping through each Block Group that is missing data
### Mobility
for (i in 1:length(na.T)) {
## Get the Block Group ID
na.id <- acs.bg.pct$GEOID[na.T[i]]
## Find which entry in the shapefile
which.poly <- which(bgs.poly$GEOID10 == na.id)
## Get list of neighbors
nbs <- bgs.poly$GEOID10[bgs.nb[which.poly][[1]]]
## Get the Median Houshold Income values of the neighbors
## Get the Populations of the neighbors
nb.vals <- acs.bg.pct[acs.bg.pct$GEOID %in% nbs, c(1,16:28)]
## Get weighted average of neighbors
mhi.interp <- colSums(nb.vals[,3:14] * nb.vals$TrvWorkPop16p, na.rm=TRUE) /
sum(nb.vals$TrvWorkPop16p, na.rm=TRUE)
## Put interpolated value into table
acs.bg.pct[na.T[i],17:28] <- mhi.interp
}
which(is.na(acs.bg.pct$TrvCar))

### Write out table
write.csv(acs.bg.pct, file="/media/data/GISdata/acs2010/5yr/tables/
ACS.5yr.IncEdMob.BlockGroup.Interpolated.csv", row.names=FALSE)
##################################################################
##################################################################

############################################################
## R Code to aggregate county data from SAHIE by Zip Code ##
## Use block / zip percentages because of nesting
##
############################################################
179

options(scipen=500)

## Get joined block / zip code : age proportions table
zip.blk.age <- read.csv("/home/delamate/MDCH/data/dissertation/zipcodes/
tables/zipcode_blocks_age_breakdown.csv")
sum(zip.blk.age$SumAgePop)
## Make County ID column
zip.blk.age$GEOIDC <- substr(zip.blk.age$GEOID10, 1, 5)
## Aggregate to Counties
zip.c <- aggregate(zip.blk.age[,c(5:22,24)], by=list("GEOIDC" =
zip.blk.age$GEOIDC, "ZIP" = zip.blk.age$ZIP), sum)
sum(zip.c$SumAgePop)
## Create similar age brackets
zip.c$P0_19 <- rowSums(zip.c[,3:6])
zip.c$P20_64 <- rowSums(zip.c[,7:15])
zip.c$P0_64 <- rowSums(zip.c[,3:15])
zip.c$P65p <- rowSums(zip.c[,16:20])
## Subset
zip.c <- zip.c[,c(1:2,21:25)]
sum(zip.c$SumAgePop)
##
## First, sum Zip Code 0 - 64 populations
zip.pop <- aggregate(zip.c[,6], by=list("ZIP" = zip.c$ZIP), sum)
names(zip.pop)[2] <- "POP064ZIP"

## Merge
zip.c <- merge(zip.c, zip.pop, by="ZIP", all.x=TRUE)
## Create County :: Zip percentage
zip.c$CPCT064 <- zip.c$P0_64 / zip.c$POP064ZIP

## Get SAHIE data table
sahie.c <- read.csv("/media/data/GISdata/census_SAHIE/2009/tables/
sahie.county.health.insurance.estimates.csv")
180

## Remove error columns
sahie.c <- sahie.c[,c(1:4,6,8:9,11,13:14,16)]
## Fix 0-18, 18-64 :: Make into 0-18, 19-64
sahie.c$POP_19_64 <- sahie.c$POP_18_64 ((sahie.c$POP_0_18 + sahie.c$POP_18_64) - sahie.c$POP)
sahie.c$INSPOP_19_64 <- sahie.c$INSPOP_18_64 ((sahie.c$INSPOP_0_18 + sahie.c$INSPOP_18_64) - sahie.c$INSPOP)
sahie.c$POP_19_64<- sahie.c$POP_18_64 ((sahie.c$POP_0_18 + sahie.c$POP_18_64) - sahie.c$POP)
sahie.c$UNINSPOP_19_64 <- sahie.c$POP_19_64 - sahie.c$INSPOP_19_64

## Convert SAHIEs to Percentages
sahie.c.o <- sahie.c
sahie.c[,4:5] <- sahie.c[,4:5] / sahie.c$POP
sahie.c[,7:8] <- sahie.c[,7:8] / sahie.c$POP_0_18
sahie.c[,10:11] <- sahie.c[,10:11] / sahie.c$POP_18_64
sahie.c[,13:14] <- sahie.c[,13:14] / sahie.c$POP_19_64

## Get COUNTY pop data
c.age <- read.csv("/home/delamate/MDCH/data/dissertation/zipcodes/
tables/county_AGE_blockpop_adj.csv")
## Aggregate to similar age brackets
c.age$P0_19 <- rowSums(c.age[,2:5])
c.age$P20_64 <- rowSums(c.age[,6:14])
c.age$P0_64 <- rowSums(c.age[,2:14])
c.age$P65p <- rowSums(c.age[,15:19])
c.age <- c.age[,c(1,20:24)]

######################
### Use the weights to allocate COUNTY values to ZIP CODES!!
######################
zip.sahie <- NULL
u.zips <- unique(zip.c$ZIP)
for (i in 1:length(u.zips)) {
## Get weights
181

zip.c.weights <- zip.c[zip.c$ZIP == u.zips[i], c(1,2,9)]
## Get SAHIE County values
sahie.c.sub <- sahie.c[sahie.c$COUNTY %in%
as.numeric(substr(zip.c.weights$GEOIDC, 3, 5)), c(1,2,4,7,13)]
## Sort each
zip.c.weights <- zip.c.weights[order(zip.c.weights$GEOIDC),]
sahie.c.sub <- sahie.c.sub[order(sahie.c.sub$COUNTY),]
# if(sum(zip.bg.weights$GEOIDBG != acs.bg.sub$GEOID) > 0)
print(paste("Something went wrong at i = ", i, sep=""))
## Multiply by weights and SUM
w.sahie <- sum(sahie.c.sub$INSPOP * zip.c.weights[,3])
## Now, figure out overall insured rate... assuming all 65+ are insured
zip.o <- zip.c[zip.c$ZIP == u.zips[i], c(1,2,6,7)]
zip.64u <- sum(zip.o$P0_64 * sahie.c.sub$INSPOP)
zip.65p <- sum(zip.o$P65p * 1)
ov.ins.rate <- (zip.64u + zip.65p) / (sum(zip.o$P0_64) + sum(zip.o$P65p))
## Now, get overall insured rate... and 0-64 rate using the separate rates
zip.o <- zip.c[zip.c$ZIP == u.zips[i], c(1,2,4,5,6,7)]
zip.0.19 <- sum(zip.o$P0_19 * sahie.c.sub$INSPOP_0_18)
zip.20.64 <- sum(zip.o$P20_64 * sahie.c.sub$INSPOP_19_64)
zip.65p <- sum(zip.o$P65p * 1)
interp.ins.rate <- (zip.0.19 + zip.20.64 + zip.65p) / (sum(zip.o$P0_19)
+ sum(zip.o$P20_64) + sum(zip.o$P65p))
interp.ins.rate.64 <- (zip.0.19 + zip.20.64) / (sum(zip.o$P0_19)
+ sum(zip.o$P20_64))
## Put in holder
zip.sahie <- rbind(zip.sahie, c(u.zips[i], w.sahie, ov.ins.rate,
interp.ins.rate, interp.ins.rate.64))
}
class(zip.sahie)
zip.sahie <- as.data.frame(zip.sahie)
names(zip.sahie) <- c("ZIP", "HeaInsRate0_64", "HeaInsRateTot",
"HeaInsRateIntTot", "HeaInsRateInt0_64")

182

### Write out file
write.csv(zip.sahie, file="/home/delamate/MDCH/data/dissertation/
zipcodes/tables/zipcode_SAHIE_HealthInsurance.csv", row.names=FALSE)
############################################################
############################################################

###########################################################
## R Code to subset patient days to only those spent
##
## in a hospital < 60 minutes travel time from residence ##
## This matches specification in E2SFCA calculation
##
###########################################################
library(rgdal)
library(sp)
library(maptools)
library(shapefiles)
## Get records
records <- read.csv(file="2010/2010MIDBrecords.csv")
## Get Zip / Hosp distances
## Get OD matrix for Zip Codes
## Read in origin-destination shapefile table
od <- read.dbf("/media/data/Project Files/Delamater/Dissertation/
Utilization/OD/hosps_esri2010pwcent_traveltime_od.dbf")
## Remove header info
od <- od$dbf
## Subset
od <- od[,c(7,8,5)]
# Remove weird point
# Weird point is Block centroid that didn’t fall in any Zip Code!
badID <- which(od$FAC_ID %in% unique(od$FAC_ID)[170:338])
od <- od[-c(badID),]
## Get hospital info file and attach FAC_ID
hosp.info <- read.csv("/media/data/GISdata/hospitals/csv/
2011-hosps-beds.csv")
od <- merge(od, hosp.info[,c(1,4)], by="FAC_ID", all.x=TRUE)
## Aggregate distances by ZIPCODE / MIDB
od <- aggregate(od$Total_MinT, by=list("ZIPCODE" = od$ZIPCODE,
183

"MIDB" = od$MIDB), mean)
names(od)[3] <- "MIN"
length(unique(od$ZIPCODE))
length(unique(od$MIDB))

### Attach travel time to patient records
names(records)[c(2,4)] <- c("MIDB", "ZIPCODE")
records <- merge(records, od, by=c("ZIPCODE", "MIDB"), all.x=TRUE)

### Subset
records.60
no.dist <records.60
records <-

<- records[records$MIN <= 60, ]
which(is.na(records.60$LOS))
<- records.60[-no.dist,]
records.60

#write.csv(records, file="2010/2010MIDBrecords.60min.csv",
row.names=FALSE)
### Aggregate by age!
### Bin into AGE categories
records0004 <- records[as.numeric(as.character(records$AGE))
records0509 <- records[as.numeric(as.character(records$AGE))
as.numeric(as.character(records$AGE)) < 10, ]
records1014 <- records[as.numeric(as.character(records$AGE))
as.numeric(as.character(records$AGE)) < 15, ]
records1519 <- records[as.numeric(as.character(records$AGE))
as.numeric(as.character(records$AGE)) < 20, ]
records2024 <- records[as.numeric(as.character(records$AGE))
as.numeric(as.character(records$AGE)) < 25, ]
records2529 <- records[as.numeric(as.character(records$AGE))
as.numeric(as.character(records$AGE)) < 30, ]
records3034 <- records[as.numeric(as.character(records$AGE))
as.numeric(as.character(records$AGE)) < 35, ]
records3539 <- records[as.numeric(as.character(records$AGE))
as.numeric(as.character(records$AGE)) < 40, ]
records4044 <- records[as.numeric(as.character(records$AGE))
as.numeric(as.character(records$AGE)) < 45, ]
records4549 <- records[as.numeric(as.character(records$AGE))
as.numeric(as.character(records$AGE)) < 50, ]
records5054 <- records[as.numeric(as.character(records$AGE))
as.numeric(as.character(records$AGE)) < 55, ]
records5559 <- records[as.numeric(as.character(records$AGE))
184

< 5, ]
>= 5 &
>= 10 &
>= 15 &
>= 20 &
>= 25 &
>= 30 &
>= 35 &
>= 40 &
>= 45 &
>= 50 &
>= 55 &

as.numeric(as.character(records$AGE)) < 60, ]
records6064 <- records[as.numeric(as.character(records$AGE)) >= 60
as.numeric(as.character(records$AGE)) < 65, ]
records6569 <- records[as.numeric(as.character(records$AGE)) >= 65
as.numeric(as.character(records$AGE)) < 70, ]
records7074 <- records[as.numeric(as.character(records$AGE)) >= 70
as.numeric(as.character(records$AGE)) < 75, ]
records7579 <- records[as.numeric(as.character(records$AGE)) >= 75
as.numeric(as.character(records$AGE)) < 80, ]
records8084 <- records[as.numeric(as.character(records$AGE)) >= 80
as.numeric(as.character(records$AGE)) < 85, ]
records85p <- records[as.numeric(as.character(records$AGE)) >= 85,
records0004 <- aggregate(records0004[,4],
records0004$ZIPCODE), FUN=sum)
records0509 <- aggregate(records0509[,4],
records0509$ZIPCODE), FUN=sum)
records1014 <- aggregate(records1014[,4],
records1014$ZIPCODE), FUN=sum)
records1519 <- aggregate(records1519[,4],
records1519$ZIPCODE), FUN=sum)
records2024 <- aggregate(records2024[,4],
records2024$ZIPCODE), FUN=sum)
records2529 <- aggregate(records2529[,4],
records2529$ZIPCODE), FUN=sum)
records3034 <- aggregate(records3034[,4],
records3034$ZIPCODE), FUN=sum)
records3539 <- aggregate(records3539[,4],
records3539$ZIPCODE), FUN=sum)
records4044 <- aggregate(records4044[,4],
records4044$ZIPCODE), FUN=sum)
records4549 <- aggregate(records4549[,4],
records4549$ZIPCODE), FUN=sum)
records5054 <- aggregate(records5054[,4],
records5054$ZIPCODE), FUN=sum)
records5559 <- aggregate(records5559[,4],
records5559$ZIPCODE), FUN=sum)
records6064 <- aggregate(records6064[,4],
records6064$ZIPCODE), FUN=sum)
records6569 <- aggregate(records6569[,4],
records6569$ZIPCODE), FUN=sum)
records7074 <- aggregate(records7074[,4],
records7074$ZIPCODE), FUN=sum)
records7579 <- aggregate(records7579[,4],
records7579$ZIPCODE), FUN=sum)
185

by=list(ZIP =
by=list(ZIP =
by=list(ZIP =
by=list(ZIP =
by=list(ZIP =
by=list(ZIP =
by=list(ZIP =
by=list(ZIP =
by=list(ZIP =
by=list(ZIP =
by=list(ZIP =
by=list(ZIP =
by=list(ZIP =
by=list(ZIP =
by=list(ZIP =
by=list(ZIP =

&
&
&
&
&
]

records8084 <- aggregate(records8084[,4], by=list(ZIP =
records8084$ZIPCODE), FUN=sum)
records85p <- aggregate(records85p[,4], by=list(ZIP =
records85p$ZIPCODE), FUN=sum)
## Get Zip Code age breakdown
zip.age <- read.csv(file="/home/delamate/MDCH/data/dissertation/
zipcodes/tables/zipcode_AGE_blockpop_adj.csv")
zip.days <- merge(records0004, records0509, by="ZIP", all.x=TRUE,
all.y=TRUE)
zip.days <- merge(zip.days, records1014, by="ZIP", all.x=TRUE, all.y=TRUE)
names(zip.days)[2:ncol(zip.days)] <- names(zip.age)[2:ncol(zip.days)]
zip.days <- merge(zip.days, records1519, by="ZIP", all.x=TRUE, all.y=TRUE)
zip.days <- merge(zip.days, records2024, by="ZIP", all.x=TRUE, all.y=TRUE)
zip.days <- merge(zip.days, records2529, by="ZIP", all.x=TRUE, all.y=TRUE)
names(zip.days)[2:ncol(zip.days)] <- names(zip.age)[2:ncol(zip.days)]
zip.days <- merge(zip.days, records3034, by="ZIP", all.x=TRUE, all.y=TRUE)
zip.days <- merge(zip.days, records3539, by="ZIP", all.x=TRUE, all.y=TRUE)
zip.days <- merge(zip.days, records4044, by="ZIP", all.x=TRUE, all.y=TRUE)
names(zip.days)[2:ncol(zip.days)] <- names(zip.age)[2:ncol(zip.days)]
zip.days <- merge(zip.days, records4549, by="ZIP", all.x=TRUE, all.y=TRUE)
zip.days <- merge(zip.days, records5054, by="ZIP", all.x=TRUE, all.y=TRUE)
zip.days <- merge(zip.days, records5559, by="ZIP", all.x=TRUE, all.y=TRUE)
names(zip.days)[2:ncol(zip.days)] <- names(zip.age)[2:ncol(zip.days)]
zip.days <- merge(zip.days, records6064, by="ZIP", all.x=TRUE, all.y=TRUE)
zip.days <- merge(zip.days, records6569, by="ZIP", all.x=TRUE, all.y=TRUE)
zip.days <- merge(zip.days, records7074, by="ZIP", all.x=TRUE, all.y=TRUE)
names(zip.days)[2:ncol(zip.days)] <- names(zip.age)[2:ncol(zip.days)]
zip.days <- merge(zip.days, records7579, by="ZIP", all.x=TRUE, all.y=TRUE)
zip.days <- merge(zip.days, records8084, by="ZIP", all.x=TRUE, all.y=TRUE)
zip.days <- merge(zip.days, records85p, by="ZIP", all.x=TRUE, all.y=TRUE)
names(zip.days)[2:ncol(zip.days)] <- names(zip.age)[2:ncol(zip.days)]
zip.days[is.na(zip.days)] <- 0
names(zip.days)
### Clean up file to account for funky Zip Codes ##################
### See dissertation data notes ###################################
### Remove islands
#islands <- which(zip.days$ZIP %in% c(48028, 49726, 49757, 49775, 49782))
#airport <- which(zip.days$ZIP == 48242)
#zip.days <- zip.days[-c(islands, airport), ]
186

### Reassign zips
zip.days$ZIP[zip.days$ZIP
zip.days$ZIP[zip.days$ZIP
zip.days$ZIP[zip.days$ZIP
zip.days$ZIP[zip.days$ZIP
zip.days$ZIP[zip.days$ZIP
zip.days$ZIP[zip.days$ZIP

==
==
==
==
==
==

48710] <- 48706
48743] <- 48739
48824 | zip.days$ZIP == 48825] <- 48823
49104] <- 49103
49792] <- 49765
49873] <- 49807

zip.days <- aggregate(zip.days[,2:ncol(zip.days)], by=list("ZIP" =
zip.days$ZIP), sum)
dim(zip.days)
### 3 Zip Codes have NO patient days within 60 minutes!!
### 892 records... add zero entries for these 49725, 49858, 49893
zip.days <- rbind(zip.days, c(49725, rep(0,18)))
zip.days <- rbind(zip.days, c(49858, rep(0,18)))
zip.days <- rbind(zip.days, c(49893, rep(0,18)))
zip.days <- zip.days[order(zip.days$ZIP), ]
dim(zip.days)
zip.days$TotDays <- rowSums(zip.days[2:19])
## Write records to file
write.csv(zip.days, file="/home/delamate/MDCH/data/dissertation/
zipcodes/tables/zipcode_2010_UTILIZATION_age_breakdown.60min.csv",
row.names=FALSE)
###########################################################
###########################################################

###########################################################
## R Code to estimate distance decay in a gravity model ##
## Uses MIDB patient days and network GIS travel data
##
###########################################################
library(rgdal)
library(sp)
library(maptools)
library(shapefiles)
library(classInt)
## Get table of patient days
187

pd <- read.csv("/home/delamate/MDCH/data/dissertation/utilization/2010/
2010MIDBrecords.csv")
## Subset to needed data
pd <- pd[,c(2,4,3)]
## Get travel distance data
## Get OD matrix for Zip Codes
## Read in origin-destination shapefile table
od <- read.dbf("/media/data/Project Files/Delamater/Dissertation/
Utilization/OD/hosps_esri2010pwcent_traveltime_od.dbf")
## Remove header info
od <- od$dbf
## Subset
od <- od[,c(7,8,5)]
# Get hosp info and attach MIDB number
hosp.info <- read.csv("/media/data/GISdata/hospitals/csv/
2011-hosps-beds.csv")
# Attach to OD matrix
od <- merge(od, hosp.info[,c(1,4)], by="FAC_ID", all.x=TRUE)
## Get mean of distances to grouped hospitals
od <- aggregate(od$Total_MinT, by=list("ZIPCODE" = od$ZIPCODE, "MIDB" =
od$MIDB), mean)
names(od)[c(1,3)] <- c("ZIP", "Min")

## Get total number of beds per MIDB hospital (SUPPLY)
n.beds <- aggregate(hosp.info$BEDS2010, by=list(MIDB = hosp.info$MIDB),
sum)
names(n.beds)[2] <- "BEDS"
## Get total number of patient days per MIDB hospital (DEMAND)
n.patient.days <- aggregate(pd$LOS, by=list(MIDB = pd$HOSP_ID), sum)
names(n.patient.days)[2] <- "PDj"
## Get total number of patient days per ZIP
n.pat.days.zip <- aggregate(pd$LOS, by=list(ZIP = pd$MIDB_ZIP), sum)
names(n.pat.days.zip)[2] <- "PDi"
## Get patient flows from each Zip to each hospital
zip.pd <- aggregate(pd$LOS, by=list(ZIP = pd$MIDB_ZIP, MIDB = pd$HOSP_ID),
sum)
188

names(zip.pd)[3] <- "PDij"

## Aggregate data into single table
data <- zip.pd
data <- merge(data, n.beds, by="MIDB", all.x=TRUE)
data <- merge(data, n.patient.days, by="MIDB", all.x=TRUE)
data <- merge(data, n.pat.days.zip, by="ZIP", all.x=TRUE)
data <- merge(data, od, by=c("ZIP", "MIDB"), all.x=TRUE)
####
#### Distance decay function is cumulative by distance traveled!!
#### It’s a probability that people travel X dist or less
####
#### Get total days
## Remove "NAs" in distance
nas <- which(is.na(data$Min))
data <- data[-nas,]
t.pat.days <- sum(data$PDij)
#### Bin probability by Distance
max.d <- max(data$Min)
dist.bin <- 0:floor(max.d)
#### Make holder
Prob <- as.data.frame(cbind(dist.bin, rep(0, length(dist.bin))))
names(Prob) <- c("Min", "Prob")
#### Loop through and get probabilities
for (d in dist.bin) {
## Get records greater than or equal to distance bin
sub.data <- data[data$Min >= d,]
## Get patient days
d.pat.days <- sum(sub.data$PDij)
## Get probability
Prob$Prob[d+1] <- d.pat.days / t.pat.days
}

nlc <- nls.control(maxiter = 1000000)
189

nlsfit.d.log.logistic <- nls(Prob ~ 1 / (1 + (Min/a)^b), data=Prob,
start=list(a=1, b=1), control=nlc, trace=TRUE)
summary(nlsfit.d.log.logistic)
a <- summary(nlsfit.d.log.logistic)$coefficients[1,1]
b <- summary(nlsfit.d.log.logistic)$coefficients[2,1]

## Write out data for making figures
fig.dat <- as.data.frame(cbind(Prob, fitted(nlsfit.d.log.logistic)))
names(fig.dat) <- c("Min", "Util", "dll")
write.csv(fig.dat, file="distance.decay.utilization.csv", row.names=FALSE)
## Write out weights table
weight.table <- as.data.frame(cbind(0:700, 1 / (1 + ((0:700)/a)^b)))
names(weight.table) <- c("Min", "dllWgt")
write.csv(weight.table, file="/home/delamate/MDCH/data/dissertation/
Distance.decay/decreasing.log.likelihood.weights.empirical.2010.csv",
row.names=FALSE)
###########################################################
###########################################################

###################################################
## Code to calculate demand for E2SFCA
##
## Get poplulation and allocate based on weights ##
###################################################
options(scipen=999)
## Get block / ring data
blcks <- read.csv("/home/delamate/MDCH/data/dissertation/E2SFCA/tables/
block.centroid.rings.csv")
## Get 2010 weights
weights <- read.csv("/home/delamate/MDCH/data/dissertation/
Distance.decay/decreasing.log.likelihood.weights.empirical.2010.csv")
## Convert aggregate weights to "ring" structure
w.5 <- mean(weights$dllWgt[1:6])
w.10 <- mean(weights$dllWgt[6:11])
w.15 <- mean(weights$dllWgt[11:16])
190

w.20
w.25
w.30
w.35
w.40
w.45
w.60

<<<<<<<-

mean(weights$dllWgt[16:21])
mean(weights$dllWgt[21:26])
mean(weights$dllWgt[26:31])
mean(weights$dllWgt[31:36])
mean(weights$dllWgt[36:41])
mean(weights$dllWgt[41:46])
mean(weights$dllWgt[46:51])

weights.mean <- c(w.5, w.10, w.15, w.20, w.25, w.30, w.35, w.40, w.45,
w.60)
nms <- seq(5,60,5)
nms <- nms[c(1:9,12)]
names(weights.mean) <- c(paste("W", nms, sep=""))

weights.mat <- as.data.frame(matrix(rep(weights.mean, nrow(blcks)),
nrow=nrow(blcks), ncol=length(weights.mean), byrow=TRUE))
names(weights.mat) <- names(weights.mean)
## Make presence / absence table
blcks.p <- blcks
blcks.p[,3:12] <- as.numeric(blcks.p[,3:12] > 0)

## Multiply p/a by weights
wght.blck <- blcks.p
wght.blck[,3:12] <- wght.blck[,3:12]*weights.mat

## Multiply pop by weights
pop.wght <- wght.blck
wght.blck$Min5 <- wght.blck$Min5 *
wght.blck$Min10 <- wght.blck$Min10
wght.blck$Min15 <- wght.blck$Min15
wght.blck$Min20 <- wght.blck$Min20
wght.blck$Min25 <- wght.blck$Min25
wght.blck$Min30 <- wght.blck$Min30
wght.blck$Min35 <- wght.blck$Min35
wght.blck$Min40 <- wght.blck$Min40
wght.blck$Min45 <- wght.blck$Min45
wght.blck$Min60 <- wght.blck$Min60

wght.blck$POP100
* wght.blck$POP100
* wght.blck$POP100
* wght.blck$POP100
* wght.blck$POP100
* wght.blck$POP100
* wght.blck$POP100
* wght.blck$POP100
* wght.blck$POP100
* wght.blck$POP100

library(rgdal)
library(sp)
191

library(maptools)
library(shapefiles)
## Read in 5 minute file
dbf5 <- read.dbf("/media/data/Project Files/Delamater/Dissertation/
Utilization/Service-areas/join/MI_2010_blocks_proj_cent_pop_gt0_join_5.dbf")
dbf5 <- dbf5$dbf
## Subset to needed columns
dbf5 <- dbf5[,c(names(dbf5) == "GEOID10" | names(dbf5) == "FAC_ID")]
## Attach demand per hospital!
dbf5 <- merge(dbf5, wght.blck[,c(1,3)], by="GEOID10", all.x=TRUE)
## Aggregate by hospital
hosp.demand <- aggregate(dbf5$Min5, by=list(FAC_ID = dbf5$FAC_ID), sum)
names(hosp.demand)[2] <- "Min5"
rm(dbf5)
## Read in 10 minute file
dbf10 <- read.dbf("/media/data/Project Files/Delamater/Dissertation/
Utilization/Service-areas/join/MI_2010_blocks_proj_cent_pop_gt0_join_10.dbf")
dbf10 <- dbf10$dbf
## Subset to needed columns
dbf10 <- dbf10[,c(names(dbf10) == "GEOID10" | names(dbf10) == "FAC_ID")]
## Attach demand per hospital!
dbf10 <- merge(dbf10, wght.blck[,c(1,4)], by="GEOID10", all.x=TRUE)
## Aggregate by hospital
hosp.demand.t <- aggregate(dbf10$Min10, by=list(FAC_ID = dbf10$FAC_ID), sum)
names(hosp.demand.t)[2] <- "Min10"
## Merge
hosp.demand <- merge(hosp.demand, hosp.demand.t, by="FAC_ID", all.x=TRUE)
rm(dbf10)

192

## Read in 15 minute file
dbf15 <- read.dbf("/media/data/Project Files/Delamater/Dissertation/
Utilization/Service-areas/join/MI_2010_blocks_proj_cent_pop_gt0_join_15.dbf")
dbf15 <- dbf15$dbf
## Subset to needed columns
dbf15 <- dbf15[,c(names(dbf15) == "GEOID10" | names(dbf15) == "FAC_ID")]
## Attach demand per hospital!
dbf15 <- merge(dbf15, wght.blck[,c(1,5)], by="GEOID10", all.x=TRUE)
## Aggregate by hospital
hosp.demand.t <- aggregate(dbf15$Min15, by=list(FAC_ID = dbf15$FAC_ID), sum)
names(hosp.demand.t)[2] <- "Min15"
## Merge
hosp.demand <- merge(hosp.demand, hosp.demand.t, by="FAC_ID", all.x=TRUE)
rm(dbf15)

## Read in 20 minute file
dbf20 <- read.dbf("/media/data/Project Files/Delamater/Dissertation/
Utilization/Service-areas/join/MI_2010_blocks_proj_cent_pop_gt0_join_20.dbf")
dbf20 <- dbf20$dbf
## Subset to needed columns
dbf20 <- dbf20[,c(names(dbf20) == "GEOID10" | names(dbf20) == "FAC_ID")]
## Attach demand per hospital!
dbf20 <- merge(dbf20, wght.blck[,c(1,6)], by="GEOID10", all.x=TRUE)
## Aggregate by hospital
hosp.demand.t <- aggregate(dbf20$Min20, by=list(FAC_ID = dbf20$FAC_ID), sum)
names(hosp.demand.t)[2] <- "Min20"
## Merge
hosp.demand <- merge(hosp.demand, hosp.demand.t, by="FAC_ID", all.x=TRUE)
rm(dbf20)

193

## Read in 25 minute file
dbf25 <- read.dbf("/media/data/Project Files/Delamater/Dissertation/
Utilization/Service-areas/join/MI_2010_blocks_proj_cent_pop_gt0_join_25.dbf")
dbf25 <- dbf25$dbf
## Subset to needed columns
dbf25 <- dbf25[,c(names(dbf25) == "GEOID10" | names(dbf25) == "FAC_ID")]
## Attach demand per hospital!
dbf25 <- merge(dbf25, wght.blck[,c(1,7)], by="GEOID10", all.x=TRUE)
## Aggregate by hospital
hosp.demand.t <- aggregate(dbf25$Min25, by=list(FAC_ID = dbf25$FAC_ID), sum)
names(hosp.demand.t)[2] <- "Min25"
## Merge
hosp.demand <- merge(hosp.demand, hosp.demand.t, by="FAC_ID", all.x=TRUE)
rm(dbf25)

## Read in 30 minute file
dbf30 <- read.dbf("/media/data/Project Files/Delamater/Dissertation/
Utilization/Service-areas/join/MI_2010_blocks_proj_cent_pop_gt0_join_30.dbf")
dbf30 <- dbf30$dbf
## Subset to needed columns
dbf30 <- dbf30[,c(names(dbf30) == "GEOID10" | names(dbf30) == "FAC_ID")]
## Attach demand per hospital!
dbf30 <- merge(dbf30, wght.blck[,c(1,8)], by="GEOID10", all.x=TRUE)
## Aggregate by hospital
hosp.demand.t <- aggregate(dbf30$Min30, by=list(FAC_ID = dbf30$FAC_ID), sum)
names(hosp.demand.t)[2] <- "Min30"
## Merge
hosp.demand <- merge(hosp.demand, hosp.demand.t, by="FAC_ID", all.x=TRUE)
rm(dbf30)
194

## Read in 35 minute file
dbf35 <- read.dbf("/media/data/Project Files/Delamater/Dissertation/
Utilization/Service-areas/join/MI_2010_blocks_proj_cent_pop_gt0_join_35.dbf")
dbf35 <- dbf35$dbf
## Subset to needed columns
dbf35 <- dbf35[,c(names(dbf35) == "GEOID10" | names(dbf35) == "FAC_ID")]
## Attach demand per hospital!
dbf35 <- merge(dbf35, wght.blck[,c(1,9)], by="GEOID10", all.x=TRUE)
## Aggregate by hospital
hosp.demand.t <- aggregate(dbf35$Min35, by=list(FAC_ID = dbf35$FAC_ID), sum)
names(hosp.demand.t)[2] <- "Min35"
## Merge
hosp.demand <- merge(hosp.demand, hosp.demand.t, by="FAC_ID", all.x=TRUE)
rm(dbf35)

### Read in 40 minute files
dbf40a <- read.dbf("/media/data/Project Files/Delamater/Dissertation/
Utilization/Service-areas/join/MI_2010_blocks_proj_cent_pop_gt0_join_40a.dbf")
dbf40b <- read.dbf("/media/data/Project Files/Delamater/Dissertation/
Utilization/Service-areas/join/MI_2010_blocks_proj_cent_pop_gt0_join_40b.dbf")
dbf40a <- dbf40a$dbf
dbf40b <- dbf40b$dbf
## Subset to needed columns
dbf40a <- dbf40a[,c(names(dbf40a) == "GEOID10" | names(dbf40a) == "FAC_ID")]
dbf40b <- dbf40b[,c(names(dbf40b) == "GEOID10" | names(dbf40b) == "FAC_ID")]
## Combine tables
dbf40 <- rbind(dbf40a, dbf40b)
## Attach demand per hospital!
dbf40 <- merge(dbf40, wght.blck[,c(1,10)], by="GEOID10", all.x=TRUE)

195

## Aggregate by hospital
hosp.demand.t <- aggregate(dbf40$Min40, by=list(FAC_ID = dbf40$FAC_ID), sum)
names(hosp.demand.t)[2] <- "Min40"
## Merge
hosp.demand <- merge(hosp.demand, hosp.demand.t, by="FAC_ID", all.x=TRUE)
rm(dbf40a)
rm(dbf40b)
rm(dbf40)

### Read in 45 minute files
dbf45a <- read.dbf("/media/data/Project Files/Delamater/Dissertation/
Utilization/Service-areas/join/MI_2010_blocks_proj_cent_pop_gt0_join_45a.dbf")
dbf45b <- read.dbf("/media/data/Project Files/Delamater/Dissertation/
Utilization/Service-areas/join/MI_2010_blocks_proj_cent_pop_gt0_join_45b.dbf")
dbf45a <- dbf45a$dbf
dbf45b <- dbf45b$dbf
## Subset to needed columns
dbf45a <- dbf45a[,c(names(dbf45a) == "GEOID10" | names(dbf45a) == "FAC_ID")]
dbf45b <- dbf45b[,c(names(dbf45b) == "GEOID10" | names(dbf45b) == "FAC_ID")]
## Combine tables
dbf45 <- rbind(dbf45a, dbf45b)
## Attach demand per hospital!
dbf45 <- merge(dbf45, wght.blck[,c(1,11)], by="GEOID10", all.x=TRUE)
## Aggregate by hospital
hosp.demand.t <- aggregate(dbf45$Min45, by=list(FAC_ID = dbf45$FAC_ID), sum)
names(hosp.demand.t)[2] <- "Min45"
## Merge
hosp.demand <- merge(hosp.demand, hosp.demand.t, by="FAC_ID", all.x=TRUE)
rm(dbf45a)
rm(dbf45b)
rm(dbf45)

### Read in 60 minute files
196

dbf60a <- read.dbf("/media/data/Project Files/Delamater/Dissertation/
Utilization/Service-areas/join/MI_2010_blocks_proj_cent_pop_gt0_join_60a.dbf")
dbf60b <- read.dbf("/media/data/Project Files/Delamater/Dissertation/
Utilization/Service-areas/join/MI_2010_blocks_proj_cent_pop_gt0_join_60b.dbf")
dbf60a <- dbf60a$dbf
dbf60b <- dbf60b$dbf
## Subset to needed columns
dbf60a <- dbf60a[,c(names(dbf60a) == "GEOID10" | names(dbf60a) == "FAC_ID")]
dbf60b <- dbf60b[,c(names(dbf60b) == "GEOID10" | names(dbf60b) == "FAC_ID")]
## Combine tables
dbf60 <- rbind(dbf60a, dbf60b)
## Attach demand per hospital!
dbf60 <- merge(dbf60, wght.blck[,c(1,12)], by="GEOID10", all.x=TRUE)
## Aggregate by hospital
hosp.demand.t <- aggregate(dbf60$Min60, by=list(FAC_ID = dbf60$FAC_ID), sum)
names(hosp.demand.t)[2] <- "Min60"
## Merge
hosp.demand <- merge(hosp.demand, hosp.demand.t, by="FAC_ID", all.x=TRUE)
rm(dbf60a)
rm(dbf60b)
rm(dbf60)

### Sum
hosp.demand$DemandSum <- rowSums(hosp.demand[,2:11])

### Get bed info
## Import hospital bed numbers
hosp.info <- read.csv("/media/data/GISdata/hospitals/csv/
2011-hosps-beds.csv")
## Attach beds to demand table
hosp.demand <- merge(hosp.demand, hosp.info[,c(4,8)], by="FAC_ID",
all.x=TRUE)
197

## Reorder
hosp.demand <- hosp.demand[,c(1,13,2:12)]

####
#### Calculate Hospital Supply!
####
hosp.demand$HospSupply <- hosp.demand$BEDS2010 / hosp.demand$DemandSum
# hosp.demand[,c(1,2,14)]

####
#### Calculate E2SFCA
####
## Get OD matrix for Zip Codes
## Read in origin-destination shapefile table
od <- read.dbf("/media/data/Project Files/Delamater/Dissertation/
Utilization/OD/hosps_esri2010pwcent_traveltime_od.dbf")
## Remove header info
od <- od$dbf
## Subset
od <- od[,c(7,8,5)]
# Remove weird point
# Weird point is Block centroid that didn’t fall in any Zip Code!
badID <- which(od$FAC_ID %in% unique(od$FAC_ID)[170:338])
od <- od[-c(badID),]
# Calculate weights
### Because we have the actual distance, we can calculate this using
### the formula from estimate.decay.parameter.R
#2009 data# weight = (1 / (1 + (Min/13.77)^1.83127))
od$dllWgt <- (1 / (1 + (od$Total_MinT/13.885928)^1.817622))
########################
############
############ Important!
############
############ Because we only considered demand out to 60 minutes,
############ change any weight above 60 minutes to 0 (zero)
198

############
########################
od$dllWgt[od$Total_MinT > 60] <- 0

## Attach supply at each hospital
od <- merge(od, hosp.demand[,c(1,14)], by="FAC_ID", all.x=TRUE)
## Multiply supply by weight
od$Supply <- od$dllWgt*od$HospSupply

## Aggregate for each ZIP CODE
E2SFCA <- aggregate(od$Supply, by=list(ZIP = od$ZIPCODE), sum)
names(E2SFCA)[2] <- "E2SFCA"

### Clean up file to account for funky Zip Codes ##################
### See dissertation data notes ###################################
### Get zip pop data :: WE NEED ORIGINAL DATA HERE! ::
### Not "adjusted for bad zips" data!
zip.pop <- read.csv("/home/delamate/MDCH/data/dissertation/zipcodes/
tables/zipcode_blockpop_adj_ALLZIPS.csv")
### Attach to 3SFCA
E2SFCA <- merge(E2SFCA, zip.pop, by="ZIP", all.x=TRUE)
### Remove islands
#islands <- which(od$ZIPCODE %in% c(48028, 49726, 49757, 49775, 49782))
#od <- od[-islands, ]
### Reassign zips
E2SFCA$ZIP[E2SFCA$ZIP
E2SFCA$ZIP[E2SFCA$ZIP
E2SFCA$ZIP[E2SFCA$ZIP
E2SFCA$ZIP[E2SFCA$ZIP
E2SFCA$ZIP[E2SFCA$ZIP
E2SFCA$ZIP[E2SFCA$ZIP

==
==
==
==
==
==

48710] <- 48706
48743] <- 48739
48824 | E2SFCA$ZIP == 48825] <- 48823
49104] <- 49103
49792] <- 49765
49873] <- 49807

###############
### Do a weighted mean by zip population
###############

199

E2SFCA$WgtE2SFCA <- E2SFCA$E2SFCA*E2SFCA$BkPopAdj2010
E2SFCA <- aggregate(E2SFCA[,3:4], by=list("ZIP" = E2SFCA$ZIP), sum)
E2SFCA$E2SFCA <- E2SFCA$WgtE2SFCA / E2SFCA$BkPopAdj2010
### Clean up file to account for funkty Zip Codes ##################
### See dissertation data notes ####################################

E2SFCA <- E2SFCA[,-3]

### Test to see if math worked
sum(hosp.info$BEDS2010)
sum(E2SFCA$E2SFCA * E2SFCA$BkPopAdj2010)

## Write out to table
write.csv(E2SFCA, file="/home/delamate/MDCH/data/dissertation/zipcodes/
tables/zipcode_E2SFCA_2010_new.csv", row.names=FALSE)

## Attach to Zip Polys and map (updated clusters!)
## Import Zip Code file and attach
zip.poly <- readOGR("/media/data/Project Files/Delamater/Dissertation/
Clustering/Cluster_shapefiles/d/895.shp", layer="895")
## Join
zip.poly@data <- cbind(zip.poly@data, E2SFCA)
zip.poly$E31000 <- zip.poly$E2SFCA * 1000

writeOGR(zip.poly, dsn="/media/data/Project Files/Delamater/Dissertation/
Utilization/E2SFCA", layer="MI_CL895_E2SFCA_new", driver="ESRI Shapefile")
###################################################
###################################################

###################################################
## R Code to cluster Zip Codes into regions
##
## Built upon code from Michigan Hospital Groups ##
###################################################
200

### Read in 2010 patient records
pd <- read.csv("/home/delamate/MDCH/data/dissertation/utilization/2010/
2010MIDBrecords.csv")
### Drop age column
pd <- pd[,-1]
### Clean up file to account for funky Zip Codes ##################
### See dissertation data notes ###################################
### Remove islands
islands <- which(pd$MIDB_ZIP %in% c(48028, 49726, 49757, 49775, 49782))
pd <- pd[-islands, ]
### Reassign zips
pd$MIDB_ZIP[pd$MIDB_ZIP
pd$MIDB_ZIP[pd$MIDB_ZIP
pd$MIDB_ZIP[pd$MIDB_ZIP
pd$MIDB_ZIP[pd$MIDB_ZIP
pd$MIDB_ZIP[pd$MIDB_ZIP
pd$MIDB_ZIP[pd$MIDB_ZIP

==
==
==
==
==
==

48710] <- 48706
48743] <- 48739
48824 | pd$MIDB_ZIP == 48825] <- 48823
49104] <- 49103
49792] <- 49765
49873] <- 49807

### Clean up file to account for funkty Zip Codes ##################
### See dissertation data notes ####################################
### Convert to OD matrix
out <- aggregate(pd$LOS, by=list("ZIP" = pd$MIDB_ZIP, "MIDB" =
pd$HOSP_ID), sum)
names(out)[3] <- "PD"
pd <- reshape(out, direction=’wide’,idvar=’ZIP’, timevar=’MIDB’)
rm(out)
pd[is.na(pd)] <- 0
### Sort... rename columns
pd <- pd[order(pd$ZIP), ]
names(pd)[2:ncol(pd)] <- paste("PD", substr(names(pd)[2:ncol(pd)], 4, 7),
sep="")
### Convert patient days to Commitment Index values
z.days <- rowSums(pd[,2:ncol(pd)])
ci.pd <- pd[,2:ncol(pd)] / z.days
ci.pd$ZIP <- pd$ZIP
ci.pd <- ci.pd[,c(159,1:158)]
201

# write.csv(pd, file="datatables/2010.patient.days.csv",
row.names=FALSE)
# write.csv(ci.pd, file="datatables/2010.patient.days.CI.csv",
row.names=FALSE)
# pd <- read.csv(file="datatables/2010.patient.days.csv")
# ci.pd <- read.csv(file="datatables/2010.patient.days.CI.csv")
#sum(ci.pd[3,2:ncol(ci.pd)])

### Read in Distance data from Zip Codes to hospitals
library(rgdal)
library(sp)
library(maptools)
library(shapefiles)
## Get OD matrix for Zip Codes
## Read in origin-destination shapefile table
od <- read.dbf("/media/data/Project Files/Delamater/Dissertation/
Utilization/OD/hosps_esri2010pwcent_traveltime_od.dbf")
## Remove header info
od <- od$dbf
## Subset
od <- od[,c(7,8,5)]
### Have to attach MIDB number to OD matrix
## Import hospital info
hosp.info <- read.csv("/media/data/GISdata/hospitals/csv/
2011-hosps-beds.csv")
## Attach MIDB number to OD matrix
od <- merge(od, hosp.info[,c(4,1)], by="FAC_ID", all.x=TRUE)
sum(is.na(od$MIDB))
### Clean up file to account for funky Zip Codes ##################
### See dissertation data notes ###################################
### Remove islands
#islands <- which(od$ZIPCODE %in% c(48028, 49726, 49757, 49775, 49782))
#od <- od[-islands, ]
### Reassign zips
od$ZIPCODE[od$ZIPCODE == 48710] <- 48706
od$ZIPCODE[od$ZIPCODE == 48743] <- 48739
202

od$ZIPCODE[od$ZIPCODE
od$ZIPCODE[od$ZIPCODE
od$ZIPCODE[od$ZIPCODE
od$ZIPCODE[od$ZIPCODE

==
==
==
==

48824 | od$ZIPCODE == 48825] <- 48823
49104] <- 49103
49792] <- 49765
49873] <- 49807

### Clean up file to account for funkty Zip Codes ##################
### See dissertation data notes ####################################
## Because some hospitals have the same MIDB, but are in different
## locations, aggregate records (mean) by BOTH : FROM and TO
## this gives the "mean" distance for these locations
od <- aggregate(od[,3], by=list("ZIP" = od$ZIPCODE, "MIDB" = od$MIDB),
mean)
names(od)[3] <- "Min"
### Convert to OD matrix
od <- reshape(od, direction=’wide’,idvar=’ZIP’, timevar=’MIDB’)
od[is.na(od)] <- 0
names(od)[2:ncol(od)] <- paste("D", substr(names(od)[2:ncol(od)], 5, 8),
sep="")
# write.csv(od, file="datatables/traveltime.zipcodes.hospitals.csv",
row.names=FALSE)
### Scale distance matrix from 0-1
max.dist <- max(od[,2:ncol(od)])
# Divide by maximum travel time btwn any two hospitals
od[,2:ncol(od)] <- od[,2:ncol(od)] / max.dist
# write.csv(od, file="datatables/traveltime.zipcodes.hospitals.scaled.csv",
row.names=FALSE)
#####################
## Attach utilization matrix to distance matrix
#####################
data <- merge(ci.pd, od, by="ZIP")
dim(data)
sum(is.na(data))
# write.csv(data, file="datatables/cluster.data.2010.csv", row.names=FALSE)
# data <- read.csv("datatables/cluster.data.2010.csv")

203

################################################################
## Function to seed Kmeans cluster algorithm with centers
##
## provided by a Ward’s cluster output. Stabilizes results
##
## and produces better results
##
################################################################
kmeans.ward <- function(x, clusters) {
d <- dist(x, "euclidean") # create distance matrix
hc <- hclust(d, method="ward") # initial clusters
memb <- cutree(hc, k = clusters) # get ’n’ clusters
cent <- NULL # make holder
for (k in 1:clusters) {
# get cluster centers
cent <- rbind(cent, colMeans(x[memb == k,]))
}
k.m <- kmeans(x, cent, iter.max = 10000) # seed kmeans with ward’s
return(k.m)
}

## Define the range of solutions to evaluate
cl.max <- nrow(data)-1
clusters <- c(2:cl.max)
## Create a holder for cluster statistics
wss <- bss <- r2 <- incF <- rep(0, length(clusters))
k.data.pat <- cbind(clusters, wss, bss, r2, incF)
## Create a holder for cluster membership
membership <- data.frame(data$ZIP)
names(membership) <- "ZIP"
# Get number of columns in data
col.max <- ncol(data)
start <- Sys.time()
count <- seq(0, cl.max, 25)
# Loop through clusters
for (z in 1:length(clusters)) {
## Use K-means + Wards method to create clusters
kmeans <- kmeans.ward(data[,2:col.max], clusters[z])
## Write cluster stats to data holder
k.data.pat[z,2] <- kmeans$tot.withinss
204

k.data.pat[z,3] <- kmeans$betweenss
k.data.pat[z,4] <- 1-(kmeans$tot.withinss/kmeans$totss)
## Write cluster membership to data holder
membership$clusters <- kmeans$cluster
names(membership)[z+1] <- paste("CL", z+1, sep="")
if (z %in% count) print(paste("Cluster: ", z, " at ",
Sys.time()-start, sep=""))
}
print(Sys.time() - start)
## Convert data holder to data frame
k.data.pat <- as.data.frame(k.data.pat)
## Calculate incremental F score
for (i in 2:length(clusters)) {
k.data.pat$incF[i] <- ((k.data.pat$r2[i]-k.data.pat$r2[i-1])/
(k.data.pat$clusters[i]-k.data.pat$clusters[i-1])) /
((1-k.data.pat$r2[i])/((nrow(data))-(k.data.pat$clusters[i]-1)))
}
## Write data to file
write.csv(k.data.pat, file="datatables/cluster.stats.csv", row.names=FALSE)

## Find peaks in incremental F score
incF.peaks <- which(k.data.pat$incF[3:(cl.max-1)] >
k.data.pat$incF[2:(cl.max-2)] & k.data.pat$incF[3:(cl.max-1)] >
k.data.pat$incF[4:cl.max])+2
## Subset results
cluster.groups <- k.data.pat[incF.peaks,]
membership <- membership[,c(1,cluster.groups$clusters)]
## Write out cluster membership to file
write.csv(membership, file="datatables/cluster.membership.incF.peaks.csv",
row.names=FALSE)
###################################################
###################################################

205

###########################################
## R Code to create data for regressions ##
###########################################
start <- Sys.time()
library(rgdal)
library(sp)
library(maptools)
library(shapefiles)
library(spdep)
library(gpclib)
library(plotrix)
# This line allows maptools to use gpc lib
gpclibPermit()
library(car)
library(MASS)
library(psych)
### Read in cluster information
cl <- read.csv("/home/delamate/MDCH/data/dissertation/clustering/
datatables/cluster.membership.incF.peaks.csv")
### Add "non-clustered" column
cl[,ncol(cl)+1] <- seq(1:895)
names(cl)[ncol(cl)] <- "CL895"
### Get total number of clusterings to be evaluated
e <- ncol(cl)-1
# Get list of dissolved shapefiles
cs <- list.files("/media/data/Project Files/Delamater/Dissertation/
Clustering/Cluster_shapefiles/d", pattern=’.shp’)
cs <- cs[seq(1,length(cs),2)]
# Get hospitalization information
pd <- read.csv("/home/delamate/MDCH/data/dissertation/zipcodes/tables/
zipcode_2010_UTILIZATION_age_breakdown.csv")
# Get zip age breakdown
age <- read.csv("/home/delamate/MDCH/data/dissertation/zipcodes/tables/
zipcode_AGE_blockpop_adj.csv")
# Get large data table
206

data <- read.csv("/home/delamate/MDCH/data/dissertation/zipcodes/tables/
zipcode_all_variables.csv")
## Add age categories for income, insurance variables
data$Pop0_64 <- rowSums(age[,2:14])
data$Pop16p <- rowSums(age[,5:19])
data$Pop25p <- rowSums(age[,7:19])
## Make STANDARD population (state totals)
std.pop <- colSums(age[,-1])

## For PCA random variable
set.seed(1)
tolerance <- function (x) {
1/vif(x)
}

###################
### START ITERATION
###################
for (i in e:1) {
######################
### Attach cluster membership to files
######################
pd.i <- merge(pd, cl[,c(1,i+1)], by="ZIP", all.x=TRUE)
age.i <- merge(age, cl[,c(1,i+1)], by="ZIP", all.x=TRUE)
data.i <- merge(data, cl[,c(1,i+1)], by="ZIP", all.x=TRUE)
######################
### Get dissolved shapefile, make neighbors
######################
shp <- readOGR(paste("/media/data/Project Files/Delamater/Dissertation/
Clustering/Cluster_shapefiles/d/", cs[i], sep=""), layer=substr(cs[i],
1, nchar(cs[1])-4), verbose=FALSE)
### Get neighbors
nb <- poly2nb(shp)
### One island poly doesn’t have neighbors in the original file...
### It is connected by a bridge to a single zip code
207

### Assign neighbors to it manually
nb.mat <- nb2mat(nb,style="B",zero.policy=TRUE)
### Find regions with zero neighbors
w.zero <- as.numeric(which(rowSums(nb.mat) == 0))
### If island poly is not grouped, assign neighbor as 48193
if (length(w.zero) > 0) {
near.clust <- data.i[data.i$ZIP == 48193, 42]
nb[[w.zero]] <- near.clust
nb[[near.clust]] <- as.integer(c(w.zero, nb[[near.clust]]))
}

######################
### Aggregate attributes
######################
### First, those columns that use FULL population
data.i[,c(6,8:9,11,30:38)] <- data.i[,c(6,8:9,11,30:38)] *
data.i$BkPopAdj2010
### Next, those columns that use 0-64
data.i[,c(7,10)] <- data.i[,c(7,10)] * data.i$Pop0_64
### Next, those columns that use 16+
data.i[,c(12,18:29)] <- data.i[,c(12,18:29)] * data.i$Pop16p
### Finally, those columns that use 25+
data.i[,13:17] <- data.i[,13:17] * data.i$Pop25p
### Now, sum by CLUSTER
data.cl <- aggregate(data.i[,2:41], by=list("CL" = data.i[,42]), sum)
### Now, divide by appropriate "summed" population
### First, those columns that use FULL population
data.cl[,c(6,8:9,11,30:38)] <- data.cl[,c(6,8:9,11,30:38)] /
data.cl$BkPopAdj2010
### Next, those columns that use 0-64
data.cl[,c(7,10)] <- data.cl[,c(7,10)] / data.cl$Pop0_64
### Next, those columns that use 16+
data.cl[,c(12,18:29)] <- data.cl[,c(12,18:29)] / data.cl$Pop16p
### Finally, those columns that use 25+
data.cl[,13:17] <- data.cl[,13:17] / data.cl$Pop25p
names(data.cl)[1] <- names(cl)[i+1]
######################
### Aggregate hospitalization rates
### Calculate Empirical Bayes estimates of patient day rates
208

######################
## First, aggregate patient days and age population by CLUSTER
pd.i <- aggregate(pd.i[,2:19], by=list("CL" = pd.i[,21]), sum)
age.i <- aggregate(age.i[,2:19], by=list("CL" = age.i[,21]), sum)
# Make holders
EB <- data.frame(CL = pd.i$CL)
CR <- data.frame(CL = pd.i$CL)
EB.phi <- data.frame(CL = pd.i$CL)
EB.gamma <- data.frame(CL = pd.i$CL)
## Loop through each age group
for (z in 2:19) {
if (sum(age[,z] == 0) > 0) {
age.zero <- age.i[,z]
age.zero[which(age.zero == 0)] <- 1
eb <- EBlocal(pd.i[,z], age.zero, nb, zero.policy=TRUE)
} else {
eb <- EBlocal(pd.i[,z], age.i[,z], nb, zero.policy=TRUE)
}
eb[is.na(eb)] <- 0
EB[,z] <- eb$est
CR[,z] <- eb$raw
EB.phi[,z] <- attributes(eb)$parameters$a
EB.gamma[,z] <- attributes(eb)$parameters$m
names(EB)[z] <- names(CR)[z] <- names(EB.phi)[z] <- names(EB.gamma)[z]
<- names(pd.i)[z]
}
### Write out age-specific rates
write.csv(CR, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/patdayrates/Crude/", names(cl)[i+1],
".csv", sep=""), row.names=FALSE)
write.csv(EB, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/patdayrates/EBsmooth/", names(cl)[i+1],
".csv", sep=""), row.names=FALSE)
write.csv(EB.phi, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/patdayrates/EBphi/", names(cl)[i+1],
".csv", sep=""), row.names=FALSE)
write.csv(EB.gamma, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/patdayrates/EBgamma/", names(cl)[i+1],
".csv", sep=""), row.names=FALSE)
### Get crude population rate
209

pop.days <- sum(pd.i[,2:19])
pop.c.r <- pop.days / std.pop[19]
### Calculate overall EB adj hospitalization rates
eb.adj <- NULL
for (z in 1:nrow(pd.i)) {
## Multiply zip-specific rates by std pop
z.r.eb.adj <- EB[z,2:19] * std.pop[1:18]
## Sum and divide by total population
z.r.eb.adj <- sum(z.r.eb.adj) / std.pop[19]
## Attach to holder
eb.adj <- rbind(eb.adj, c(as.numeric(pd.i$CL[z]),
as.numeric(z.r.eb.adj)))
}
### Insert into data table
data.cl$AgeAdjPatDayRateEBadj <- eb.adj[,2]
### Calculate Standardized Rate Ratio and Standardized Rate Difference
data.cl$StRateRatio <- data.cl$AgeAdjPatDayRateEBadj / pop.c.r
data.cl$StRateDif <- data.cl$AgeAdjPatDayRateEBadj - pop.c.r
### Write out the Aggregated data table
write.csv(data.cl, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/all_data/", names(cl)[i+1], ".csv", sep=""),
row.names=FALSE)

######################
### Principal Components
######################
data.cl$RANDOM <- runif(nrow(data.cl))
### SES: Education, Income
cor.ses <- cor(data.cl[,c(11:14,16:17,42)])
pca.ses <- principal(cor.ses, nfactors=7, rotate="none", scores=FALSE)
pca.ses.loadings <- unclass(pca.ses$loadings)
m <- apply(abs(pca.ses.loadings), 2, FUN=max)
n.ses.pc <- as.numeric(which(abs(pca.ses.loadings[7,]) == m))
210

if (length(n.ses.pc) == 0) n.ses.pc <- which(abs(pca.ses.loadings[7,])
== max(abs(pca.ses.loadings[7,])))
# Reconduct PCA without random variable to get "pct of variance explained"
cor.ses <- cor(data.cl[,c(11:14,16:17)])
pca.ses <- principal(cor.ses, nfactors=6, rotate="none", scores=FALSE)
pca.ses.loadings <- unclass(pca.ses$loadings)
pca.var <- apply(pca.ses.loadings^2, 2, sum) / 6
write.csv(pca.var[1:(n.ses.pc-1)], file=paste("/home/delamate/MDCH/data/
dissertation/regressions/output_tables/pca_variance/SES/",names(cl)[i+1],
".csv", sep=""))
# Do actual PCA with rotation and # of components
pca.ses <- principal(data.cl[,c(11:14,16:17)], nfactors=n.ses.pc[1]-1,
rotate="varimax", scores=TRUE)
pca.ses.loadings <- unclass(pca.ses$loadings)
# round(pca.ses.loadings, 2)
### Write loadings to file
write.csv(pca.ses.loadings, file=paste("/home/delamate/MDCH/data/
dissertation/regressions/output_tables/pca_loadings/SES/",names(cl)[i+1],
".csv", sep=""))
### ETH: Ethnicity
pca.eth <- principal(data.cl[,c(32:36,38,42)], nfactors=7, rotate="none",
scores=FALSE)
pca.eth.loadings <- unclass(pca.eth$loadings)
m <- apply(abs(pca.eth.loadings), 2, FUN=max)
n.eth.pc <- as.numeric(which(abs(pca.eth.loadings[7,]) == m))
if (length(n.eth.pc) == 0) n.eth.pc <- which(abs(pca.eth.loadings[7,]) ==
max(abs(pca.eth.loadings[7,])))
# Reconduct PCA without random variable to get "pct of variance explained"
pca.eth <- principal(data.cl[,c(32:36,38)], nfactors=6, rotate="none",
scores=FALSE)
pca.eth.loadings <- unclass(pca.eth$loadings)
pca.var <- apply(pca.eth.loadings^2, 2, sum) / 6
write.csv(pca.var[1:(n.eth.pc-1)], file=paste("/home/delamate/MDCH/data/
dissertation/regressions/output_tables/pca_variance/ETH/",names(cl)[i+1],
".csv", sep=""))
pca.eth <- principal(data.cl[,c(32:36,38)], nfactors=n.eth.pc[1]-1,
rotate="varimax", scores=TRUE)
pca.eth.loadings <- unclass(pca.eth$loadings)
# round(pca.eth.loadings, 2)
### Write loadings to file
write.csv(pca.eth.loadings, file=paste("/home/delamate/MDCH/data/
dissertation/regressions/output_tables/pca_loadings/ETH/",names(cl)[i+1],
".csv", sep=""))

211

### MOBILITY 1
pca.mob <- principal(data.cl[,c(18:20,22,42)], nfactors=5, rotate="none",
scores=FALSE)
pca.mob.loadings <- unclass(pca.mob$loadings)
m <- apply(abs(pca.mob.loadings), 2, FUN=max)
n.mob.pc <- as.numeric(which(abs(pca.mob.loadings[5,]) == m))
if (length(n.mob.pc) == 0) n.mob.pc <- which(pca.mob.loadings[5,] ==
max(pca.mob.loadings[5,]))
# Reconduct PCA without random variable to get "pct of variance explained"
pca.mob <- principal(data.cl[,c(18:20,22)], nfactors=4, rotate="none",
scores=FALSE)
pca.mob.loadings <- unclass(pca.mob$loadings)
pca.var <- apply(pca.mob.loadings^2, 2, sum) / 4
write.csv(pca.var[1:(n.mob.pc-1)], file=paste("/home/delamate/MDCH/data/
dissertation/regressions/output_tables/pca_variance/MOB/",names(cl)[i+1],
".csv", sep=""))
pca.mob <- principal(data.cl[,c(18:20,22)], nfactors=n.mob.pc[1]-1,
rotate="varimax", scores=TRUE)
pca.mob.loadings <- unclass(pca.mob$loadings)
#round(pca.mob.loadings, 2)
### Write loadings to file
write.csv(pca.mob.loadings, file=paste("/home/delamate/MDCH/data/
dissertation/regressions/output_tables/pca_loadings/MOB/",names(cl)[i+1],
".csv", sep=""))
### MOBILITY 2
pca.mob2 <- principal(data.cl[,c(23:28,42)], nfactors=7, rotate="none",
scores=FALSE)
pca.mob2.loadings <- unclass(pca.mob2$loadings)
m <- apply(abs(pca.mob2.loadings), 2, FUN=max)
n.mob2.pc <- as.numeric(which(abs(pca.mob2.loadings[7,]) == m))
if (length(n.mob2.pc) == 0) n.mob2.pc <- which(pca.mob2.loadings[7,] ==
max(pca.mob2.loadings[7,]))
# Reconduct PCA without random variable to get "pct of variance explained"
pca.mob2 <- principal(data.cl[,c(23:28)], nfactors=6, rotate="none",
scores=FALSE)
pca.mob2.loadings <- unclass(pca.mob2$loadings)
pca.var <- apply(pca.mob2.loadings^2, 2, sum) / 6
write.csv(pca.var[1:(n.mob2.pc-1)], file=paste("/home/delamate/MDCH/data/
dissertation/regressions/output_tables/pca_variance/MOB2/",names(cl)[i+1],
".csv", sep=""))
pca.mob2 <- principal(data.cl[,c(23:28)], nfactors=n.mob2.pc[1]-1,
rotate="varimax", scores=TRUE)
pca.mob2.loadings <- unclass(pca.mob2$loadings)
# round(pca.mob2.loadings, 2)
212

### Write loadings to file
write.csv(pca.mob2.loadings, file=paste("/home/delamate/MDCH/data/
dissertation/regressions/output_tables/pca_loadings/MOB2/",names(cl)[i+1],
".csv", sep=""))
### Aggregate PCA scores into table
pca.scores <- as.data.frame(cbind(pca.ses$scores, pca.eth$scores,
pca.mob$scores, pca.mob2$scores))
### Rename columns
names(pca.scores) <- c(paste("SESPC", 1:(n.ses.pc[1]-1), sep=""),
paste("ETHPC", 1:(n.eth.pc[1]-1), sep=""), paste("MOBPC", 1:(n.mob.pc[1]-1),
sep=""), paste("MOB2PC", 1:(n.mob2.pc[1]-1), sep=""))
write.csv(pca.scores, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/pca_data/",names(cl)[i+1],".csv", sep=""))
######
### Remove MOBPC2 if it is in data (no theory!)
######
if ("MOBPC2" %in% names(pca.scores)) pca.scores <pca.scores[,-(which(names(pca.scores) == "MOBPC2"))]
### Make new data
data.pc <- as.data.frame(cbind(data.cl[,c(1:4,6,10,31)], pca.scores))
listw <- nb2listw(nb, style = "B", zero.policy=TRUE)
### Regress E2SFCA and Ethnicity here
if (cor(data.pc$E2SFCA, data.pc$ETHPC1) >= 0.4) {
lm.E2SFCA <- lm(data.pc$E2SFCA ~ data.pc$ETHPC1)
resid.E2SFCA <- residuals(lm.E2SFCA)
data.pc[,5] <- resid.E2SFCA
names(data.pc)[5] <- "E2SFCAresid"
# write out regression statistics
write.csv(summary(lm.E2SFCA)$coefficients, file=paste
("/home/delamate/MDCH/data/dissertation/regressions/output_tables/
pre.regression.stats/E2SFCA/", names(cl)[i+1],".csv", sep=""))
}
### Regress Health Insurance and SES here
if (cor(data.pc$HeaInsRateInt0_64, data.pc$SESPC1) >= 0.4) {
lm.hi <- lm(data.pc$HeaInsRateInt0_64 ~ data.pc$SESPC1)
resid.HI <- residuals(lm.hi)
213

data.pc[,6] <- resid.HI
names(data.pc)[6] <- "HeaInsRateInt0_64resid"
write.csv(summary(lm.hi)$coefficients, file=paste("/home/delamate/
MDCH/data/dissertation/regressions/output_tables/pre.regression.stats/
HealIns/", names(cl)[i+1],".csv", sep=""))
}
cor.table <- abs(cor(data.pc[,5:ncol(data.pc)]))
write.csv(round(cor.table, 6), file=paste("/home/delamate/MDCH/data/
dissertation/regressions/output_tables/correlation_all/",names(cl)[i+1],
".csv", sep=""))
names.lm <- paste(names(data.pc)[5:ncol(data.pc)], collapse=" + ")
vif.lm <- vif(lm(formula(paste("StRateDif ~ ", names.lm), sep=""),
data=data.pc))
write.csv(vif.lm, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/vif_all/",names(cl)[i+1],".csv", sep=""))
### Remove correlated variables here
### If there’s too much correlation... don’t proceed!!
while (max(vif.lm) > 2) {
### Find max
which.vif <- names(vif.lm)[(max(which(vif.lm > 2)))]
which.max <- which(names(data.pc) == which.vif)
### Remove correlated variable
data.pc <- data.pc[,-which.max]
names.lm <- paste(names(data.pc)[5:ncol(data.pc)], collapse=" + ")
vif.lm <- vif(lm(formula(paste("StRateDif ~ ", names.lm), sep=""), data=data.pc))
}
write.csv(vif.lm, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/vif/",names(cl)[i+1],".csv", sep=""))
g.vars <- which(colnames(cor.table) %in% names(data.pc))
write.csv(round(cor.table[g.vars,g.vars], 6), file=paste("/home/delamate/
MDCH/data/dissertation/regressions/output_tables/correlation/",
names(cl)[i+1],".csv", sep=""))
lm.dat <- as.matrix(data.pc[,5:ncol(data.pc)])

214

write.csv(lm.dat, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/lm_data/", names(cl)[i+1],".csv", sep=""),
row.names=FALSE)
}
print(Sys.time() - start)
###########################################
###########################################

#####################################################################
## R Code to conduct state-level simple linear regressions
##
## at many scales of analysis to show AUTOCORRELATION of residuals ##
#####################################################################
start <- Sys.time()
library(rgdal)
library(sp)
library(maptools)
library(shapefiles)
library(spdep)
library(gpclib)
library(plotrix)
# This line allows maptools to use gpc lib
gpclibPermit()
library(car)
library(MASS)
library(psych)

### Read in cluster information
cl <- read.csv("/home/delamate/MDCH/data/dissertation/clustering/
datatables/cluster.membership.incF.peaks.csv")
### Add "non-clustered" column
cl[,ncol(cl)+1] <- seq(1:895)
names(cl)[ncol(cl)] <- "CL895"
# Get list of population files
pop.files <- list.files("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/all_data")
order <- unlist(strsplit(pop.files, ".csv"))
215

order <- order(as.numeric(substr(order, 3, 5)))
pop.files <- pop.files[order]
# Get list of ACS files
acs.files <- list.files("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/patdayrates.ACS/data")
order <- unlist(strsplit(acs.files, ".csv"))
order <- order(as.numeric(substr(order, 3, 5)))
acs.files <- acs.files[order]
# Get list of Low Variation files
lv.files <- list.files("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/patdayrates.LowV/data")
order <- unlist(strsplit(lv.files, ".csv"))
order <- order(as.numeric(substr(order, 3, 5)))
lv.files <- lv.files[order]
# Get list of dissolved shapefiles
cs <- list.files("/media/data/Project Files/Delamater/Dissertation/
Clustering/Cluster_shapefiles/d", pattern=’.shp’)
cs <- cs[seq(1,length(cs),2)]
# Get list of utilization rates and variables
u.files <- list.files("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/patdayrates.60min/data")
order <- unlist(strsplit(u.files, ".csv"))
order <- order(as.numeric(substr(order, 3, 5)))
u.files <- u.files[order]
i.files <- list.files("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/lm_data")
order <- unlist(strsplit(i.files, ".csv"))
order <- order(as.numeric(substr(order, 3, 5)))
i.files <- i.files[order]
matches <- which(u.files %in% i.files)
u.files <- u.files[matches]
pop.files <- pop.files[matches]
acs.files <- acs.files[matches]
lv.files <- lv.files[matches]
shp.match <- which(as.numeric(unlist(strsplit(cs, ".shp"))) %in%
as.numeric(substr(unlist(strsplit(i.files, ".csv")), 3, 5)))
cs <- cs[shp.match]

216

cl.match <- which(as.numeric(substr(names(cl)[2:ncol(cl)], 3, 5))
%in% as.numeric(substr(unlist(strsplit(i.files, ".csv")), 3, 5)))
cl <- cl[,c(1,cl.match+1)]
## Choose significance for regression models
sig <- 0.05
# sig <- 0.1
### Get total number of clusterings to be evaluated
e <- length(u.files)
###################
### START ITERATION
###################
for (i in e:1) {
######################
### Get data
######################
ut <- read.csv(paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/patdayrates.60min/data/", u.files[i],
sep=""))
lm <- read.csv(paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/lm_data/", i.files[i], sep=""))
acs <- read.csv(paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/patdayrates.ACS/data/", i.files[i],
sep=""))
lv <- read.csv(paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/patdayrates.LowV/data/", i.files[i],
sep=""))
pop <- read.csv(paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/all_data/", pop.files[i], sep=""))
pop <- pop$BkPopAdj2010
i.pop <- 1/sqrt(pop)
pop.med <- median(pop)
fits.groups <- pop <= pop.med
qx <- quantile(pop, probs=seq(0,1,0.2))
q.fits.groups <- cut(pop, qx, include.lowest = TRUE)
######
### Remove MOBPC2 if it is in data (no theory!)
######
if ("MOBPC2" %in% names(lm)) lm <- lm[,-(which(names(lm) == "MOBPC2"))]
217

######################
### Get dissolved shapefile, make neighbors
######################
shp <- readOGR(paste("/media/data/Project Files/Delamater/Dissertation/
Clustering/Cluster_shapefiles/d/", cs[i], sep=""), layer=substr(cs[i],
1, nchar(cs[1])-4), verbose=FALSE)
### Get neighbors
nb <- poly2nb(shp)
### One island poly doesn’t have neighbors in the original file...
### It is connected by a bridge to a single zip code
### Assign neighbors to it manually
nb.mat <- nb2mat(nb,style="B",zero.policy=TRUE)
### Find regions with zero neighbors
w.zero <- as.numeric(which(rowSums(nb.mat) == 0))
### If island poly is not grouped, assign neighbors as 48193 (only for one)
if (length(w.zero) > 0) {
near.clust <- cl[cl$ZIP == 48193, i+1]
nb[[w.zero]] <- near.clust
nb[[near.clust]] <- as.integer(c(w.zero, nb[[near.clust]]))
}
listw <- nb2listw(nb, style = "B", zero.policy=TRUE)
### Do PCA on ACS and LV
pca.h <- principal(cbind(acs$AgeAdjPatDayRateEBadj,
lv$AgeAdjPatDayRateEBadj), nfactors=1, rotate="varimax", scores=TRUE)
acs.lv.scores <- pca.h$scores
pca.h2 <- principal(cbind(acs$AgeAdjPatDayRateEBadj,
lv$AgeAdjPatDayRateEBadj), nfactors=2, rotate="none", scores=FALSE)
pca.h2.loadings <- unclass(pca.h2$loadings)
pca.var <- apply(pca.h2.loadings^2, 2, sum) / 2
# Regress on ETHPC1
acs.lv.lm <- lm(acs.lv.scores ~ lm$ETHPC1)

######
###### LM weighted, normal
######
lm.dat <- as.matrix(cbind(residuals(acs.lv.lm), lm))
218

colnames(lm.dat)[1] <- "ACSLowVPCresid"
## Regression
milm <- lm(ut$StRateDif ~ lm.dat, weights=i.pop)
## If there are non-significant terms in the model...
while (sum(summary(milm)$coefficients[2:(ncol(lm.dat)+1),4] > sig) > 0) {
## Remove and remodel
bad.t <- as.numeric(which(summary(milm)$coefficients[2:(ncol(lm.dat)+1),
4] == max(summary(milm)$coefficients[2:(ncol(lm.dat)+1),4])))
if (ncol(lm.dat) == 2) name <- paste("lm.dat", colnames(lm.dat)[-bad.t],
sep="")
lm.dat <- lm.dat[,-bad.t]
milm <- lm(ut$StRateDif ~ lm.dat, weights=i.pop)
if (is.vector(lm.dat) == TRUE) break
}
lm.sum <- summary(milm)$coefficients
if (is.vector(lm.dat) == TRUE) rownames(lm.sum)[2] <- name
write.csv(lm.sum, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/regression.stats.ACS.LowV.dep.lm/weighted/
n.scale/betas/",names(cl)[i+1],".csv", sep=""))
mt <- moran.test(residuals(milm), listw, randomisation=FALSE)
resid.med <- median(residuals(milm))
r.fits.groups <- residuals(milm) <= resid.med
stats <- c(summary(milm)$adj.r.squared, summary(milm)$fstatistic[1],
pf(summary(milm)$fstatistic[1], summary(milm)$fstatistic[2],
summary(milm)$fstatistic[3], lower.tail = FALSE),
leveneTest(residuals(milm),factor(fits.groups))[1,3],
leveneTest(residuals(milm),factor(r.fits.groups))[1,3], mt$statistic,
mt$p.value)
names(stats) <- c("adjR2", "F", "F.p", "LevenePOP2", "LeveneRESID2",
"Moran", "Moran.p")
write.csv(stats, file=paste("/home/delamate/MDCH/data/dissertation/
219

regressions/output_tables/regression.stats.ACS.LowV.dep.lm/weighted/
n.scale/sig/",names(cl)[i+1],".csv", sep=""))

######
###### LM weighted, scaled
######
lm.dat <- as.matrix(cbind(residuals(acs.lv.lm), lm))
colnames(lm.dat)[1] <- "ACSLowVPCresid"
lm.dat <- scale(lm.dat)
ut <- as.data.frame(scale(ut))
## Regression
milm <- lm(ut$StRateDif ~ lm.dat, weights=i.pop)
## If there are non-significant terms in the model...
while (sum(summary(milm)$coefficients[2:(ncol(lm.dat)+1),4] > sig) > 0) {
## Remove and remodel
bad.t <- as.numeric(which(summary(milm)$coefficients[2:(ncol(lm.dat)+1),
4] == max(summary(milm)$coefficients[2:(ncol(lm.dat)+1),4])))
if (ncol(lm.dat) == 2) name <- paste("lm.dat", colnames(lm.dat)[-bad.t],
sep="")
lm.dat <- lm.dat[,-bad.t]
milm <- lm(ut$StRateDif ~ lm.dat, weights=i.pop)
if (is.vector(lm.dat) == TRUE) break
}
lm.sum <- summary(milm)$coefficients
if (is.vector(lm.dat) == TRUE) rownames(lm.sum)[2] <- name
write.csv(lm.sum, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/regression.stats.ACS.LowV.dep.lm/weighted/
scale/betas/",names(cl)[i+1],".csv", sep=""))
mt <- moran.test(residuals(milm), listw, randomisation=FALSE)
resid.med <- median(residuals(milm))
r.fits.groups <- residuals(milm) <= resid.med
220

stats <- c(summary(milm)$adj.r.squared, summary(milm)$fstatistic[1],
pf(summary(milm)$fstatistic[1], summary(milm)$fstatistic[2],
summary(milm)$fstatistic[3], lower.tail = FALSE),
leveneTest(residuals(milm),factor(fits.groups))[1,3],
leveneTest(residuals(milm),factor(r.fits.groups))[1,3], mt$statistic,
mt$p.value)
names(stats) <- c("adjR2", "F", "F.p", "LevenePOP2", "LeveneRESID2",
"Moran", "Moran.p")
write.csv(stats, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/regression.stats.ACS.LowV.dep.lm/weighted/
scale/sig/",names(cl)[i+1],".csv", sep=""))
######
###### LM non-weighted, normal
######
lm.dat <- as.matrix(cbind(residuals(acs.lv.lm), lm))
colnames(lm.dat)[1] <- "ACSLowVPCresid"
## Regression
milm <- lm(ut$StRateDif ~ lm.dat)
## If there are non-significant terms in the model...
while (sum(summary(milm)$coefficients[2:(ncol(lm.dat)+1),4] > sig) > 0) {
## Remove and remodel
bad.t <- as.numeric(which(summary(milm)$coefficients[2:(ncol(lm.dat)+1),
4] == max(summary(milm)$coefficients[2:(ncol(lm.dat)+1),4])))
if (ncol(lm.dat) == 2) name <- paste("lm.dat", colnames(lm.dat)[-bad.t],
sep="")
lm.dat <- lm.dat[,-bad.t]
milm <- lm(ut$StRateDif ~ lm.dat)
if (is.vector(lm.dat) == TRUE) break
}
lm.sum <- summary(milm)$coefficients
if (is.vector(lm.dat) == TRUE) rownames(lm.sum)[2] <- name

221

write.csv(lm.sum, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/regression.stats.ACS.LowV.dep.lm/n.weighted/
n.scale/betas/",names(cl)[i+1],".csv", sep=""))
mt <- moran.test(residuals(milm), listw, randomisation=FALSE)
resid.med <- median(residuals(milm))
r.fits.groups <- residuals(milm) <= resid.med
stats <- c(summary(milm)$adj.r.squared, summary(milm)$fstatistic[1],
pf(summary(milm)$fstatistic[1], summary(milm)$fstatistic[2],
summary(milm)$fstatistic[3], lower.tail = FALSE),
leveneTest(residuals(milm),factor(fits.groups))[1,3],
leveneTest(residuals(milm),factor(r.fits.groups))[1,3], mt$statistic,
mt$p.value)
names(stats) <- c("adjR2", "F", "F.p", "LevenePOP2", "LeveneRESID2",
"Moran", "Moran.p")
write.csv(stats, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/regression.stats.ACS.LowV.dep.lm/n.weighted/
n.scale/sig/",names(cl)[i+1],".csv", sep=""))

######
###### LM non-weighted, scaled
######
lm.dat <- as.matrix(cbind(residuals(acs.lv.lm), lm))
colnames(lm.dat)[1] <- "ACSLowVPCresid"
lm.dat <- scale(lm.dat)
ut <- as.data.frame(scale(ut))
## Regression
milm <- lm(ut$StRateDif ~ lm.dat)
## If there are non-significant terms in the model...
while (sum(summary(milm)$coefficients[2:(ncol(lm.dat)+1),4] > sig) > 0) {
## Remove and remodel
bad.t <- as.numeric(which(summary(milm)$coefficients[2:(ncol(lm.dat)+1),
4] == max(summary(milm)$coefficients[2:(ncol(lm.dat)+1),4])))
if (ncol(lm.dat) == 2) name <- paste("lm.dat", colnames(lm.dat)[-bad.t],
sep="")

222

lm.dat <- lm.dat[,-bad.t]
milm <- lm(ut$StRateDif ~ lm.dat)
if (is.vector(lm.dat) == TRUE) break
}
lm.sum <- summary(milm)$coefficients
if (is.vector(lm.dat) == TRUE) rownames(lm.sum)[2] <- name
write.csv(lm.sum, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/regression.stats.ACS.LowV.dep.lm/n.weighted/
scale/betas/",names(cl)[i+1],".csv", sep=""))
mt <- moran.test(residuals(milm), listw, randomisation=FALSE)
resid.med <- median(residuals(milm))
r.fits.groups <- residuals(milm) <= resid.med
stats <- c(summary(milm)$adj.r.squared, summary(milm)$fstatistic[1],
pf(summary(milm)$fstatistic[1], summary(milm)$fstatistic[2],
summary(milm)$fstatistic[3], lower.tail = FALSE),
leveneTest(residuals(milm),factor(fits.groups))[1,3],
leveneTest(residuals(milm),factor(r.fits.groups))[1,3], mt$statistic,
mt$p.value)
names(stats) <- c("adjR2", "F", "F.p", "LevenePOP2", "LeveneRESID2",
"Moran", "Moran.p")
write.csv(stats, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/regression.stats.ACS.LowV.dep.lm/n.weighted/
scale/sig/",names(cl)[i+1],".csv", sep=""))
}
print(Sys.time() - start)
#####################################################################
#####################################################################

###############################################
## R Code to conduct state-level regressions ##
## at many scales of analysis
##
223

###############################################
start <- Sys.time()
library(rgdal)
library(sp)
library(maptools)
library(shapefiles)
library(spdep)
library(gpclib)
library(plotrix)
# This line allows maptools to use gpc lib
gpclibPermit()
library(car)
library(MASS)
library(psych)
### Read in cluster information
cl <- read.csv("/home/delamate/MDCH/data/dissertation/clustering/
datatables/cluster.membership.incF.peaks.csv")
### Add "non-clustered" column
cl[,ncol(cl)+1] <- seq(1:895)
names(cl)[ncol(cl)] <- "CL895"
# Get list of population files
pop.files <- list.files("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/all_data")
order <- unlist(strsplit(pop.files, ".csv"))
order <- order(as.numeric(substr(order, 3, 5)))
pop.files <- pop.files[order]
# Get list of ACS files
acs.files <- list.files("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/patdayrates.ACS/data")
order <- unlist(strsplit(acs.files, ".csv"))
order <- order(as.numeric(substr(order, 3, 5)))
acs.files <- acs.files[order]
# Get list of Low Variation files
lv.files <- list.files("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/patdayrates.LowV/data")
order <- unlist(strsplit(lv.files, ".csv"))
order <- order(as.numeric(substr(order, 3, 5)))
lv.files <- lv.files[order]

224

# Get list of dissolved shapefiles
cs <- list.files("/media/data/Project Files/Delamater/Dissertation/
Clustering/Cluster_shapefiles/d", pattern=’.shp’)
cs <- cs[seq(1,length(cs),2)]
# Get list of utilization rates and variables
u.files <- list.files("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/patdayrates.60min/data")
order <- unlist(strsplit(u.files, ".csv"))
order <- order(as.numeric(substr(order, 3, 5)))
u.files <- u.files[order]
i.files <- list.files("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/lm_data")
order <- unlist(strsplit(i.files, ".csv"))
order <- order(as.numeric(substr(order, 3, 5)))
i.files <- i.files[order]
matches <- which(u.files %in% i.files)
u.files <- u.files[matches]
pop.files <- pop.files[matches]
acs.files <- acs.files[matches]
lv.files <- lv.files[matches]
shp.match <- which(as.numeric(unlist(strsplit(cs, ".shp"))) %in%
as.numeric(substr(unlist(strsplit(i.files, ".csv")), 3, 5)))
cs <- cs[shp.match]
cl.match <- which(as.numeric(substr(names(cl)[2:ncol(cl)], 3, 5))
%in% as.numeric(substr(unlist(strsplit(i.files, ".csv")), 3, 5)))
cl <- cl[,c(1,cl.match+1)]
## Choose significance for regression models
sig <- 0.05
# sig <- 0.1
### Get total number of clusterings to be evaluated
e <- length(u.files)
###################
### START ITERATION
###################
for (i in e:1) {

225

######################
### Get data
######################
ut <- read.csv(paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/patdayrates.60min/data/", u.files[i],
sep=""))
lm <- read.csv(paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/lm_data/", i.files[i], sep=""))
acs <- read.csv(paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/patdayrates.ACS/data/", i.files[i],
sep=""))
lv <- read.csv(paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/patdayrates.LowV/data/", i.files[i],
sep=""))
pop <- read.csv(paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/all_data/", pop.files[i], sep=""))
pop <- pop$BkPopAdj2010
i.pop <- 1/sqrt(pop)
pop.med <- median(pop)
fits.groups <- pop <= pop.med
qx <- quantile(pop, probs=seq(0,1,0.2))
q.fits.groups <- cut(pop, qx, include.lowest = TRUE)
######
### Remove MOBPC2 if it is in data (no theory!)
######
if ("MOBPC2" %in% names(lm)) lm <- lm[,-(which(names(lm) == "MOBPC2"))]
######################
### Get dissolved shapefile, make neighbors
######################
shp <- readOGR(paste("/media/data/Project Files/Delamater/Dissertation/
Clustering/Cluster_shapefiles/d/", cs[i], sep=""), layer=substr(cs[i],
1, nchar(cs[1])-4), verbose=FALSE)
### Get neighbors
nb <- poly2nb(shp)
### One island poly doesn’t have neighbors in the original file...
### It is connected by a bridge to a single zip code
### Assign neighbors to it manually
nb.mat <- nb2mat(nb,style="B",zero.policy=TRUE)
### Find regions with zero neighbors
w.zero <- as.numeric(which(rowSums(nb.mat) == 0))
226

### If island poly is not grouped, assign neighbors as 48193 (only for one)
if (length(w.zero) > 0) {
near.clust <- cl[cl$ZIP == 48193, i+1]
nb[[w.zero]] <- near.clust
nb[[near.clust]] <- as.integer(c(w.zero, nb[[near.clust]]))
}
listw <- nb2listw(nb, style = "B", zero.policy=TRUE)
### Do PCA on ACS and LV
pca.h <- principal(cbind(acs$AgeAdjPatDayRateEBadj,
lv$AgeAdjPatDayRateEBadj), nfactors=1, rotate="varimax",
scores=TRUE)
acs.lv.scores <- pca.h$scores
# Regress on ETHPC1
acs.lv.lm <- lm(acs.lv.scores ~ lm$ETHPC1)
write.csv(summary(acs.lv.lm)$coefficients, file=
paste("/home/delamate/MDCH/data/dissertation/regressions/output_tables/
pre.regression.stats/ACS.LowV/", names(cl)[i+1],".csv", sep=""))

######
###### SAR
######
lm.dat <- as.matrix(cbind(residuals(acs.lv.lm), lm))
colnames(lm.dat)[1] <- "ACSLowVPCresid"
lm.dat <- scale(lm.dat)
ut <- as.data.frame(scale(ut))
write.csv(cor(lm.dat), file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/scale.regression.stats.ACS.LowV.dep/
correlation/",names(cl)[i+1],".csv", sep=""))
if (sum(abs(cor(lm.dat)) > 0.5) > ncol(lm.dat)) write.table
(as.character(names(cl)[i+1]), file="/home/delamate/MDCH/
data/dissertation/regressions/output_tables/
scale.regression.stats.ACS.LowV.dep/bad.correlation.regressions.txt",
col.names = FALSE, row.names=FALSE, append=TRUE)
## Spatial regression
misar <- spautolm(ut$StRateDif ~ lm.dat, listw = listw)
## If there are non-significant terms in the model...
while (sum(summary(misar)$Coef[2:(ncol(lm.dat)+1),4] > sig) > 0) {
227

## Remove and remodel
bad.t <- as.numeric(which(summary(misar)$Coef[2:(ncol(lm.dat)+1),
4] == max(summary(misar)$Coef[2:(ncol(lm.dat)+1),4])))
if (ncol(lm.dat) == 2) name <- paste("lm.dat", colnames(lm.dat)[-bad.t],
sep="")
lm.dat <- lm.dat[,-bad.t]
misar <- spautolm(ut$StRateDif ~ lm.dat, listw = listw)
if (is.vector(lm.dat) == TRUE) break
}
lm.sum <- summary(misar)$Coef
if (is.vector(lm.dat) == TRUE) rownames(lm.sum)[2] <- name
write.csv(lm.sum, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/scale.regression.stats.ACS.LowV.dep/SAR/
betas/",names(cl)[i+1],".csv", sep=""))
write.csv(residuals(misar), file=paste("/home/delamate/MDCH/data/
dissertation/regressions/output_tables/
scale.regression.stats.ACS.LowV.dep/SAR/residuals/",names(cl)[i+1],
".csv", sep=""))
write.csv(misar$fit$signal_trend, file=paste("/home/delamate/MDCH/
data/dissertation/regressions/output_tables/
scale.regression.stats.ACS.LowV.dep/SAR/fitted/effect/",
names(cl)[i+1],".csv", sep=""))
write.csv(misar$fit$signal_stochastic, file=paste("/home/delamate/
MDCH/data/dissertation/regressions/output_tables/
scale.regression.stats.ACS.LowV.dep/SAR/fitted/spatial/",
names(cl)[i+1],".csv", sep=""))
mt <- moran.test(residuals(misar), listw, randomisation=FALSE)
stats <- c(misar$lambda, as.numeric(summary(misar)$LR1$p.value),
misar$LL, misar$LL0, misar$fit$s2, AIC(misar),
as.numeric(summary(misar, Nagel=TRUE)$NK), mt$statistic, mt$p.value)
names(stats) <- c("lambda", "lambda.p", "LL", "LL0", "s2", "AIC",
"NagelR2", "Moran", "Moran.p")
228

write.csv(stats, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/scale.regression.stats.ACS.LowV.dep/SAR/
sig/",names(cl)[i+1],".csv", sep=""))
## Test SAR for heteroskedasticity
write.csv(leveneTest(residuals(misar),factor(fits.groups))[1,],
file=paste("/home/delamate/MDCH/data/dissertation/regressions/
output_tables/scale.regression.stats.ACS.LowV.dep/levene/SAR/2/",
names(cl)[i+1],".csv", sep=""))
write.csv(leveneTest(residuals(misar),factor(q.fits.groups))[1,],
file=paste("/home/delamate/MDCH/data/dissertation/regressions/
output_tables/scale.regression.stats.ACS.LowV.dep/levene/SAR/5/",
names(cl)[i+1],".csv", sep=""))
resid.med <- median(residuals(misar))
r.fits.groups <- residuals(misar) <= resid.med
write.csv(leveneTest(residuals(misar),factor(r.fits.groups))[1,],
file=paste("/home/delamate/MDCH/data/dissertation/regressions/
output_tables/scale.regression.stats.ACS.LowV.dep/levene/SAR/
resid2/",names(cl)[i+1],".csv", sep=""))
r.qx <- quantile(residuals(misar), probs=seq(0,1,0.2))
rq.fits.groups <- cut(residuals(misar), r.qx, include.lowest = TRUE)
write.csv(leveneTest(residuals(misar),factor(rq.fits.groups))[1,],
file=paste("/home/delamate/MDCH/data/dissertation/regressions/
output_tables/scale.regression.stats.ACS.LowV.dep/levene/SAR/
resid5/",names(cl)[i+1],".csv", sep=""))
######
###### weighted SAR
######
lm.dat <- as.matrix(cbind(residuals(acs.lv.lm), lm))
colnames(lm.dat)[1] <- "ACSLowVPCresid"
lm.dat <- scale(lm.dat)
## Spatial regression
w.misar <- spautolm(ut$StRateDif ~ lm.dat, listw = listw, weights=i.pop)
## If there are non-significant terms in the model...
while (sum(summary(w.misar)$Coef[2:(ncol(lm.dat)+1),4] > sig) > 0) {
## Remove and remodel
bad.t <- as.numeric(which(summary(w.misar)$Coef[2:(ncol(lm.dat)+1),
4] == max(summary(w.misar)$Coef[2:(ncol(lm.dat)+1),4])))

229

if (ncol(lm.dat) == 2) name <- paste("lm.dat", colnames(lm.dat)[-bad.t],
sep="")
lm.dat <- lm.dat[,-bad.t]
w.misar <- spautolm(ut$StRateDif ~ lm.dat, listw = listw, weights=i.pop)
if (is.vector(lm.dat) == TRUE) break
}
lm.sum <- summary(w.misar)$Coef
if (is.vector(lm.dat) == TRUE) rownames(lm.sum)[2] <- name
write.csv(lm.sum, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/scale.regression.stats.ACS.LowV.dep/wSAR/
betas/",names(cl)[i+1],".csv", sep=""))
write.csv(residuals(w.misar), file=paste("/home/delamate/MDCH/data/
dissertation/regressions/output_tables/
scale.regression.stats.ACS.LowV.dep/wSAR/residuals/",names(cl)[i+1],
".csv", sep=""))
write.csv(w.misar$fit$signal_trend, file=paste("/home/delamate/MDCH/
data/dissertation/regressions/output_tables/
scale.regression.stats.ACS.LowV.dep/wSAR/fitted/effect/",names(cl)[i+1],
".csv", sep=""))
write.csv(w.misar$fit$signal_stochastic, file=paste("/home/delamate/
MDCH/data/dissertation/regressions/output_tables/
scale.regression.stats.ACS.LowV.dep/wSAR/fitted/spatial/",names(cl)[i+1],
".csv", sep=""))
mt <- moran.test(residuals(w.misar), listw, randomisation=FALSE)
stats <- c(w.misar$lambda, as.numeric(summary(w.misar)$LR1$p.value),
w.misar$LL, w.misar$LL0, w.misar$fit$s2, AIC(w.misar),
as.numeric(summary(w.misar, Nagel=TRUE)$NK), mt$statistic,
mt$p.value)
names(stats) <- c("lambda", "lambda.p", "LL", "LL0", "s2", "AIC",
"NagelR2", "Moran", "Moran.p")
write.csv(stats, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/scale.regression.stats.ACS.LowV.dep/wSAR/
sig/",names(cl)[i+1],".csv", sep=""))
230

## Test wSAR for heteroskedasticity
write.csv(leveneTest(residuals(w.misar),factor(fits.groups))[1,],
file=paste("/home/delamate/MDCH/data/dissertation/regressions/
output_tables/scale.regression.stats.ACS.LowV.dep/levene/wSAR/2/",
names(cl)[i+1],".csv", sep=""))
write.csv(leveneTest(residuals(w.misar),factor(q.fits.groups))[1,],
file=paste("/home/delamate/MDCH/data/dissertation/regressions/
output_tables/scale.regression.stats.ACS.LowV.dep/levene/wSAR/5/",
names(cl)[i+1],".csv", sep=""))
resid.med <- median(residuals(w.misar))
r.fits.groups <- residuals(w.misar) <= resid.med
write.csv(leveneTest(residuals(w.misar),factor(r.fits.groups))[1,],
file=paste("/home/delamate/MDCH/data/dissertation/regressions/
output_tables/scale.regression.stats.ACS.LowV.dep/levene/wSAR/
resid2/",names(cl)[i+1],".csv", sep=""))
r.qx <- quantile(residuals(w.misar), probs=seq(0,1,0.2))
rq.fits.groups <- cut(residuals(w.misar), r.qx, include.lowest = TRUE)
write.csv(leveneTest(residuals(w.misar),factor(rq.fits.groups))[1,],
file=paste("/home/delamate/MDCH/data/dissertation/regressions/
output_tables/scale.regression.stats.ACS.LowV.dep/levene/wSAR/resid5/",
names(cl)[i+1],".csv", sep=""))

######
###### CAR
######
lm.dat <- as.matrix(cbind(residuals(acs.lv.lm), lm))
colnames(lm.dat)[1] <- "ACSLowVPCresid"
lm.dat <- scale(lm.dat)
## Spatial regression
micar <- spautolm(ut$StRateDif ~ lm.dat, listw = listw, family="CAR")
## If there are non-significant terms in the model...
while (sum(summary(micar)$Coef[2:(ncol(lm.dat)+1),4] > sig) > 0) {
## Remove and remodel
bad.t <- as.numeric(which(summary(micar)$Coef[2:(ncol(lm.dat)+1),
4] == max(summary(micar)$Coef[2:(ncol(lm.dat)+1),4])))
if (ncol(lm.dat) == 2) name <- paste("lm.dat", colnames(lm.dat)[-bad.t],
sep="")

231

lm.dat <- lm.dat[,-bad.t]
micar <- spautolm(ut$StRateDif ~ lm.dat, listw = listw, family="CAR")
if (is.vector(lm.dat) == TRUE) break
}
lm.sum <- summary(micar)$Coef
if (is.vector(lm.dat) == TRUE) rownames(lm.sum)[2] <- name
write.csv(lm.sum, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/scale.regression.stats.ACS.LowV.dep/CAR/
betas/",names(cl)[i+1],".csv", sep=""))
write.csv(residuals(micar), file=paste("/home/delamate/MDCH/data/
dissertation/regressions/output_tables/
scale.regression.stats.ACS.LowV.dep/CAR/residuals/",
names(cl)[i+1],".csv", sep=""))
write.csv(micar$fit$signal_trend, file=paste("/home/delamate/MDCH/
data/dissertation/regressions/output_tables/
scale.regression.stats.ACS.LowV.dep/CAR/fitted/effect/",names(cl)[i+1],
".csv", sep=""))
write.csv(micar$fit$signal_stochastic, file=paste("/home/delamate/MDCH/
data/dissertation/regressions/output_tables/
scale.regression.stats.ACS.LowV.dep/CAR/fitted/spatial/",
names(cl)[i+1],".csv", sep=""))
mt <- moran.test(residuals(micar), listw, randomisation=FALSE)
stats <- c(micar$lambda, as.numeric(summary(micar)$LR1$p.value),
micar$LL, micar$LL0, micar$fit$s2, AIC(micar),
as.numeric(summary(micar, Nagel=TRUE)$NK), mt$statistic, mt$p.value)
names(stats) <- c("lambda", "lambda.p", "LL", "LL0", "s2", "AIC",
"NagelR2", "Moran", "Moran.p")
write.csv(stats, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/scale.regression.stats.ACS.LowV.dep/CAR/
sig/",names(cl)[i+1],".csv", sep=""))
## Test CAR for heteroskedasticity
write.csv(leveneTest(residuals(micar),factor(fits.groups))[1,],
file=paste("/home/delamate/MDCH/data/dissertation/regressions/
232

output_tables/scale.regression.stats.ACS.LowV.dep/levene/CAR/2/",
names(cl)[i+1],".csv", sep=""))
write.csv(leveneTest(residuals(micar),factor(q.fits.groups))[1,],
file=paste("/home/delamate/MDCH/data/dissertation/regressions/
output_tables/scale.regression.stats.ACS.LowV.dep/levene/CAR/5/",
names(cl)[i+1],".csv", sep=""))
resid.med <- median(residuals(micar))
r.fits.groups <- residuals(micar) <= resid.med
write.csv(leveneTest(residuals(micar),factor(r.fits.groups))[1,],
file=paste("/home/delamate/MDCH/data/dissertation/regressions/
output_tables/scale.regression.stats.ACS.LowV.dep/levene/CAR/resid2/",
names(cl)[i+1],".csv", sep=""))
r.qx <- quantile(residuals(micar), probs=seq(0,1,0.2))
rq.fits.groups <- cut(residuals(micar), r.qx, include.lowest = TRUE)
write.csv(leveneTest(residuals(micar),factor(rq.fits.groups))[1,],
file=paste("/home/delamate/MDCH/data/dissertation/regressions/
output_tables/scale.regression.stats.ACS.LowV.dep/levene/CAR/resid5/",
names(cl)[i+1],".csv", sep=""))

######
###### weighted CAR
######
lm.dat <- as.matrix(cbind(residuals(acs.lv.lm), lm))
colnames(lm.dat)[1] <- "ACSLowVPCresid"
lm.dat <- scale(lm.dat)
## Spatial regression
w.micar <- spautolm(ut$StRateDif ~ lm.dat, listw = listw, family="CAR",
weights=i.pop)
## If there are non-significant terms in the model...
while (sum(summary(w.micar)$Coef[2:(ncol(lm.dat)+1),4] > sig) > 0) {
## Remove and remodel
bad.t <- as.numeric(which(summary(w.micar)$Coef[2:(ncol(lm.dat)+1),
4] == max(summary(w.micar)$Coef[2:(ncol(lm.dat)+1),4])))
if (ncol(lm.dat) == 2) name <- paste("lm.dat", colnames(lm.dat)[-bad.t],
sep="")
lm.dat <- lm.dat[,-bad.t]
w.micar <- spautolm(ut$StRateDif ~ lm.dat, listw = listw, family="CAR",
233

weights=i.pop)
if (is.vector(lm.dat) == TRUE) break
}
lm.sum <- summary(w.micar)$Coef
if (is.vector(lm.dat) == TRUE) rownames(lm.sum)[2] <- name
write.csv(lm.sum, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/scale.regression.stats.ACS.LowV.dep/wCAR/
betas/",names(cl)[i+1],".csv", sep=""))
write.csv(residuals(w.micar), file=paste("/home/delamate/MDCH/data/
dissertation/regressions/output_tables/
scale.regression.stats.ACS.LowV.dep/wCAR/residuals/",names(cl)[i+1],
".csv", sep=""))
write.csv(w.micar$fit$signal_trend, file=paste("/home/delamate/MDCH/
data/dissertation/regressions/output_tables/
scale.regression.stats.ACS.LowV.dep/wCAR/fitted/effect/",
names(cl)[i+1],".csv", sep=""))
write.csv(w.micar$fit$signal_stochastic, file=paste("/home/delamate/
MDCH/data/dissertation/regressions/output_tables/
scale.regression.stats.ACS.LowV.dep/wCAR/fitted/spatial/",
names(cl)[i+1],".csv", sep=""))
mt <- moran.test(residuals(w.micar), listw, randomisation=FALSE)
stats <- c(w.micar$lambda, as.numeric(summary(w.micar)$LR1$p.value),
w.micar$LL, w.micar$LL0, w.micar$fit$s2, AIC(w.micar),
as.numeric(summary(w.micar, Nagel=TRUE)$NK), mt$statistic, mt$p.value)
names(stats) <- c("lambda", "lambda.p", "LL", "LL0", "s2", "AIC",
"NagelR2", "Moran", "Moran.p")
write.csv(stats, file=paste("/home/delamate/MDCH/data/dissertation/
regressions/output_tables/scale.regression.stats.ACS.LowV.dep/wCAR/sig/",
names(cl)[i+1],".csv", sep=""))
## Test wCAR for heteroskedasticity
write.csv(leveneTest(residuals(w.micar),factor(fits.groups))[1,],
file=paste("/home/delamate/MDCH/data/dissertation/regressions/
output_tables/scale.regression.stats.ACS.LowV.dep/levene/wCAR/2/",
names(cl)[i+1],".csv", sep=""))
234

write.csv(leveneTest(residuals(w.micar),factor(q.fits.groups))[1,],
file=paste("/home/delamate/MDCH/data/dissertation/regressions/
output_tables/scale.regression.stats.ACS.LowV.dep/levene/wCAR/5/",
names(cl)[i+1],".csv", sep=""))
resid.med <- median(residuals(w.micar))
r.fits.groups <- residuals(w.micar) <= resid.med
write.csv(leveneTest(residuals(w.micar),factor(r.fits.groups))[1,],
file=paste("/home/delamate/MDCH/data/dissertation/regressions/
output_tables/scale.regression.stats.ACS.LowV.dep/levene/wCAR/resid2/",
names(cl)[i+1],".csv", sep=""))
r.qx <- quantile(residuals(w.micar), probs=seq(0,1,0.2))
rq.fits.groups <- cut(residuals(w.micar), r.qx, include.lowest = TRUE)
write.csv(leveneTest(residuals(w.micar),factor(rq.fits.groups))[1,],
file=paste("/home/delamate/MDCH/data/dissertation/regressions/
output_tables/scale.regression.stats.ACS.LowV.dep/levene/wCAR/resid5/",
names(cl)[i+1],".csv", sep=""))

}
print(Sys.time() - start)

235

REFERENCES

236

REFERENCES

Aday, L.A., Andersen, R., 1974. A framework tor the study of access to medical care. Health
Services Research 9, 208–220.
Alexander, J.A., Lee, S.D., Griﬃth, J.R., Mick, S.S., Lin, X., Banaszak-Holl, J., 1999. Do
Market-Level hospital and physician resources aﬀect small area variation in hospital use?
Medical Care Research and Review 56, 94–117.
Andersen, R., Newman, J.F., 1973. Societal and individual determinants of medical care
utilization in the united states. The Milbank Memorial Fund Quarterly: Health and
Society 51, 95–124.
Angell, M., 2008. Privatizing health care is not the answer: lessons from the united states.
Canadian Medical Association Journal 179, 916–919.
Anselin, L., 1988. Spatial Econometrics: Methods and Models. Kluwer Academic Publishers,
Dordrecht.
Anselin, L., 2003. Spatial externalities, spatial multipliers, and spatial econometrics. International Regional Science Review 26, 153–166.
Apparicio, P., Abdelmajid, M., Riva, M., Shearmur, R., 2008. Comparing alternative approaches to measuring the geographical accessibility of urban health services: Distance
types and aggregation-error issues. International Journal of Health Geographics 7, 1–14.
Arcury, T.A., Gesler, W.M., Preisser, J.S., Sherman, J., Spencer, J., Perin, J., 2005. The
eﬀects of geography and spatial behavior on health care utilization among the residents of
a rural region. Health Services Research 40, 135–156.
Banks, D.A., Foreman, S.E., Keeler, T.E., 1999. Cross-subsidization in hospital care: some
lessons from the law and economics of regulation. Health Matrix 9, 1–35.
Bay, K.S., Nestman, L.J., 1984. The use of bed distribution and service population indexes
for hospital bed allocation. Health Services Research 19, 141–160.
van Bemmelen, J., Quak, W., van Hekken, M., van Oosterom, P., 1993. Vector vs. rasterbased algorithms for cross country movement planning, in: Proceedings Auto-Carto, pp.
304–317.
Berke, E., Shi, X., 2009. Computing travel time when the exact address is unknown: a
comparison of point and polygon ZIP code approximation methods. International Journal
of Health Geographics 8, 1–9.

237

Berry Jr., R.E., 1973. On grouping hospitals for economic analysis. Inquiry 10, 5–12.
Bindman, A.B., Grumbach, K., Osmond, D., Komaromy, M., Vranizan, K., Lurie, N.,
Billings, J., Stewart, A., 1995. Preventable hospitalizations and access to health care.
JAMA: The Journal of the American Medical Association 274, 305–311.
Birkmeyer, J.D., Siewers, A.E., Marth, N.J., Goodman, D.C., 2003. Regionalization of
high-risk surgery and implications for patient travel times. The Journal of the American
Medical Association 290, 2703–2708.
Bivand, R.S., Pebesma, E.J., G´mez-Rubio, V., 2008. Applied Spatial Data Analysis with
o
R. Use R!, Springer, New York, NY.
Bosanac, E.M., Parkinson, R.C., Hall, D.S., 1976. Geographic access to hospital care: A
30-Minute travel time standard. Medical Care 14, 616–624.
Brunsdon, C., Fotheringham, A.S., Charlton, M.E., 1996. Geographically weighted regression: A method for exploring spatial nonstationarity. Geographical Analysis 28, 281–298.
BuiltWith Trends, 2012. Top in mapping. http://trends.builtwith.com/mapping/top, last
accessed: 2012.
Carr, B.G., Branas, C.C., Metlay, J.P., Sullivan, A.F., Camargo Jr., C.A., 2009. Access to
emergency care in the united states. Annals of Emergency Medicine 54, 261–269.
Chi, G., Zhu, J., 2008. Spatial regression models for demographic analysis. Population
Research and Policy Review 27, 17–42.
Clark, J.D., 1990. Variation in michigan hospital use rates: Do physician and hospital
characteristics provide the explanation? Social Science & Medicine 30, 67–82.
Conover, C.J., Sloan, F.A., 1998. Does removing Certiﬁcate-of-Need regulations lead to a
surge in health care spending? Journal of Health Politics, Policy and Law 23, 455–481.
Conover, C.J., Sloan, F.A., 2003. Evaluation of Certiﬁcate of Need in Michigan. Technical
Report. Center for Health Policy, Law and Management, Duke University.
Couclelis, H., 1992. People manipulate objects (but cultivate ﬁelds): Beyond the rastervector debate in gis, in: Frank, A.V., Campari, I., Formentini, U. (Eds.), Theories and
Methods of Spatio-Temporal Reasoning in Geographic Space. Springer Berlin / Heidelberg.
volume 639 of Lecture Notes in Computer Science, pp. 65–77.
Cromley, E.K., McLaﬀerty, S., 2002. GIS and Public Health. Guilford Press, New York.
Cunningham, P.J., 2010. The growing ﬁnancial burden of health care: National and state
trends, 2001–2006. Health Aﬀairs 29, 1037–1044.
238

Current, J.R., Schilling, D.A., 1990. Analysis of errors due to demand data aggregation
in the set covering and maximal covering location problems. Geographical Analysis 22,
116–126.
Dai, D., 2010. Black residential segregation, disparities in spatial access to health care
facilities, and late-stage breast cancer diagnosis in metropolitan detroit. Health & Place
16, 1038–1052.
Darden, J., Rahbar, M., Jezierski, L., Li, M., Velie, E., 2010. The measurement of neighborhood socioeconomic characteristics and black and white residential segregation in
metropolitan detroit: Implications for the study of social disparities in health. Annals
of the Association of American Geographers 100, 137–158.
Darden, J.T., Stokes, C., Thomas, R.W., 2007. The state of Black Michigan, 1967-2007.
Michigan State University Press, East Lansing.
Delamater, P.L., Messina, J.P., Shortridge, A.M., Grady, S.C., 2012. Measuring geographic
access to health care: raster and network-based methods. International Journal of Health
Geographics (In press).
Delamater, P.L., Shortridge, A.M., Messina, J.P., Under review. Regional health care planning: a methodology to cluster facilities using community utilization patterns. BMC
Health Services Research .
Diez Roux, A.V., Merkin, S.S., Arnett, D., Chambless, L., Massing, M., Nieto, F.J., Sorlie,
P., Szklo, M., Tyroler, H.A., Watson, R.L., 2001. Neighborhood of residence and incidence
of coronary heart disease. New England Journal of Medicine 345, 99–106.
Donabedian, A., 1972. Models for organizing the delivery of personal health services and
criteria for evaluating them. The Milbank Memorial Fund Quarterly 50, 103–154.
van Doorslaer, E., van Vliet, R., 1989. “A built bed is a ﬁlled bed?” an empirical reexamination. Social Science & Medicine 28, 155–164.
Drineas, P., Frieze, A., Kannan, R., Vempala, S., Vinay, V., 2004. Clustering large graphs
via the singular value decomposition. Machine Learning 56, 9–33.
Dubbs, N.L., Bazzoli, G.J., Shortell, S.M., Kralovec, P.D., 2004. Reexamining organizational
conﬁgurations: An update, validation, and expansion of the taxonomy of health networks
and systems. Health Services Research 39, 207–220.
Dubes, R., Jain, A.K., 1976. Clustering techniques: The user’s dilemma. Pattern Recognition
8, 247–260.
Elayat, H., Murphy, B., Prabhakar, N., 1978. Entropy in the hierarchical cluster analysis of
hospitals. Health Services Research 13, 395–403.
239

ESRI,
2010.
Algorithms
used
by
network
analyst.
http://webhelp.esri.com/arcgisdesktop/9.3/index.cfm?TopicName=Algorithms used by Network Analy
Feldstein, P.J., 1966. Research on the demand for health services. The Milbank Memorial
Fund Quarterly 44, 128–165.
Ferrier, G., Leleu, H., Valdmanis, V., 2010. The impact of CON regulation on hospital
eﬃciency. Health Care Management Science 13, 84–100.
Finn, M., 2007. Health care demand in Michigan: An examination of the Michigan Certiﬁcate
of Need acute care bed need methodology. Ph.D. thesis. Michigan State University.
Fisher, E.S., Wennberg, D.E., Stukel, T.A., Gottlieb, D.J., Lucas, F.L., Pinder, E.L., 2003.
The implications of regional variations in medicare spending. part 1: The content, quality,
and accessibility of care. Annals of Internal Medicine 138, 273–287.
Fisher, E.S., Wennberg, J.E., Stukel, T.A., Sharp, S.M., 1994. Hospital readmission rates
for cohorts of medicare beneﬁciaries in boston and new haven. New England Journal of
Medicine 331, 989–995.
Folland, S., Stano, M., 1990. Small area variations: A critical review of propositions, methods, and evidence. Medical Care Research and Review 47, 419–465.
Fotheringham, A.S., Wong, D.W.S., 1991. The modiﬁable areal unit problem in multivariate
statistical analysis. Environment and Planning A 23, 1025–1044.
Freeman, J.D., Kadiyala, S., Bell, J.F., Martin, D.P., 2008. The causal eﬀect of health insurance on utilization and outcomes in adults: A systematic review of US studies. Medical
Care 46, 1023–1032.
Frizzelle, B., Evenson, K., Rodriguez, D., Laraia, B., 2009. The importance of accurate road
data for spatial applications in public health: customizing a road network. International
Journal of Health Geographics 8, 1–24.
Garnick, D., Luft, H., Robinson, J., Tetreault, J., 1987. Appropriate measures of hospital
market areas. Health Services Research 22, 69–89.
Gilmour, S.J., 2010. Identiﬁcation of hospital catchment areas using clustering: an example
from the NHS. Health Services Research 45, 497–513.
Ginsburg, P.B., Koretz, D.M., 1983. Bed availability and hospital utilization: estimates of
the ”Roemer eﬀect”. Health Care Finance Review 5, 87–92.
Goodchild, M.F., Yuan, M., Cova, T.J., 2007. Towards a general theory of geographic
representation in GIS. International Journal of Geographical Information Science 21, 239–
260.
240

Goodman, D.C., Fisher, E., Stukel, T.A., Chang, C., 1997. The distance to community
medical care and the likelihood of hospitalization: is closer always better? American
Journal of Public Health 87, 1144–1150.
Goodman, D.C., Mick, S.S., Bott, D., Stukel, T., Chang, C.h., Marth, N., Poage, J., Carretta, H.J., 2003. Primary care service areas: A new tool for the evaluation of primary
care services. Health Services Research 38, 287–309.
Grady, S.C., 2006. Racial disparities in low birthweight and the contribution of residential
segregation: A multilevel analysis. Social Science & Medicine 63, 3013–3029.
Grady, S.C., 2010. Racial residential segregation impacts on low birth weight using improved
neighborhood boundary deﬁnitions. Spatial and Spatio-temporal Epidemiology 1, 239–249.
Grady, S.C., Ram´
ırez, I.J., 2008. Mediating medical risk factors in the residential segregation
and low birthweight relationship by race in new york city. Health & Place 14, 661–677.
Graham, M.H., 2003. Confronting multicollinearity in ecological multiple regression. Ecology
84, 2809–2815.
Granderson, G., 2011. The impacts of hospital alliance membership, alliance size, and repealing certiﬁcate of need regulation, on the cost eﬃciency of non-proﬁt hospitals. Managerial
and Decision Economics 32, 159–173.
Green, L.A., Fryer, G.E., Yawn, B.P., Lanier, D., Dovey, S.M., 2001. The ecology of medical
care revisited. New England Journal of Medicine 344, 2021–2025.
Greene, J., 2012.
McLaren’s plan b if CON is rejected:
Legislators.
http://www.crainsdetroit.com/article/20120212/HEALTH/302129950/mclaren-s-plan-bif-con-is-rejected-legislators.
Griﬃth, J.R., 1972. Quantitative Techniques for Hospital Planning and Control. Lexington
Books, Lexington, MA.
Griﬃth, J.R., Restuccia, J.D., Tedeschi, P.J., Wilson, P.A., Zuckerman, H.S., 1981. Measuring community hospital service in michigan. Health Services Research 16, 135–160.
Guagliardo, M., 2004. Spatial accessibility of primary care: concepts, methods and challenges. International Journal of Health Geographics 3, 1–13.
Gujarati, D.N., 1988. Basic Econometrics. McGraw-Hill.
Harris, D.M., 1975. An elaboration of the relationship between general hospital bed supply
and general hospital utilization. Journal of Health and Social Behavior 16, 163–172.
Hartigan, J.A., Wong, M.A., 1979. Algorithm AS 136: A K-Means clustering algorithm.
Journal of the Royal Statistical Society. Series C (Applied Statistics) 28, 100–108.
241

Haynes, R., Jones, A., Sauerzapf, V., Zhao, H., 2006. Validation of travel times to hospital
estimated by gis. International Journal of Health Geographics 5, 1–8.
Hellinger, F., 2009. The eﬀect of certiﬁcate-of-need laws on hospital beds and healthcare
expenditures: an empirical analysis. American Journal of Managed Care 15, 737–744.
Hewko, J., Smoyer-Tomic, K.E., Hodgson, M.J., 2002. Measuring neighbourhood spatial accessibility to urban amenities: does aggregation error matter? Environment and Planning
A 34, 1185–1206.
Higgs, G., 2004. A literature review of the use of gis-based measures of access to health care
services. Health Services and Outcomes Research Methodology 5, 119–139.
Higgs, G., 2009. The role of GIS for health utilization studies: literature review. Health
Services and Outcomes Research Methodology 9, 84–99.
Hofer, A.N., Abraham, J.M., Moscovice, I., 2011. Expansion of coverage under the patient
protection and aﬀordable care act and primary care utilization. Milbank Quarterly 89,
69–89.
Hopkins, C., 2012. McLaren’s request to transfer hospital beds from pontiac causes divide.
The Oakland Press .
Hunter, J.M., Shannon, G.W., Sambrook, S.L., 1986. Rings of madness: Service areas of
19th century asylums in north america. Social Science & Medicine 23, 1033–1050.
Illinois General Assembly, 2012.
Illinois health care facilities plan:
Narrative and planning policies:
Introduction formula components,
planning area development policies, and normal travel time determinations.
http://www.ilga.gov/commission/jcar/admincode/077/077011000D05100R.html,
last
accessed: 2012.
Jain, A.K., 2010. Data clustering: 50 years beyond k-means. Pattern Recognition Letters
31, 651–666.
Jolliﬀe, I.T., 2002. Principal Component Analysis. Springer Series in Statistics, Springer.
2nd edition.
Jones, S.G., Ashby, A.J., Momin, S.R., Naidoo, A., 2010. Spatial implications associated
with using euclidean distance measurements and geographic centroid imputation in health
care research. Health Services Research 45, 316–327.
Joseph, A.E., Phillips, D.R., 1984. Accessibility and utilization: geographical perspectives
on health care delivery. Harper & Row Ltd, London.
Kaiser Family Foundation, 2009. Health Care Costs: A Primer. Technical Report 7670-02.
242

Kaufman, L., Rousseeuw, P.J., 2005. Finding Groups in Data: An Introduction to Cluster
Analysis. Wiley-Interscience.
Khan, A.A., 1992. An integrated approach to measuring potential spatial access to health
care services. Socio-Economic Planning Sciences 26, 275–287.
Klastorin, T., Watts, C., 1981. The determination of alternative hospital classiﬁcations.
Health Services Research 16, 205–220.
Klastorin, T.D., Watts, C.A., 1982. A current reappraisal of berry’s hospital typology.
Medical Care 20, 441–449.
Klauss, G., Staub, L., Widmer, M., Busato, A., 2005. Hospital service areas - a new tool for
health care planning in switzerland. BMC Health Services Research 5, 1–15.
Klein, R.J., Proctor, S.E., Boudreault, M.A., Turczyn, K.M., 2002. Healthy People 2010
Criteria for Data Suppression. Technical Report 24. National Center for Health Statistics.
Hyattsville, Maryland.
Kravet, S.J., Shore, A.D., Miller, R., Green, G.B., Kolodner, K., Wright, S.M., 2008. Health
care utilization and the proportion of primary care physicians. The American Journal of
Medicine 121, 142–148.
Kroneman, M., Siegers, J.J., 2004. The eﬀect of hospital bed reduction on the use of beds:
A comparative study of 10 european countries. Social Science & Medicine 59, 1731–1740.
Kuttner, R., 2008. Market-Based failure — a second opinion on U.S. health care costs. New
England Journal of Medicine 358, 549–551.
Kwan, M.P., Hong, X.D., 1998. Network-based constraints-oriented choice set formation
using gis. Geographical Systems , 139–162.
Langley, S., Fuller, S., Messina, J., Shortridge, A., Grady, S., 2010. A methodology for
projecting hospital bed need: a michigan case study. Source Code for Biology and Medicine
5, 1–10.
Longley, P.A., Goodchild, M., Maguire, D.J., Rhind, D.W., 2010. Geographic Information
Systems and Science. Wiley. 3rd edition.
Lopez-Quilez, A., Munoz, F., 2009. Geostatistical computing of acoustic maps in the presence
of barriers. Mathematical and Computer Modelling 50, 929–938.
Luginaah, I., Jerrett, M., Elliott, S., Eyles, J., Parizeau, K., Birch, S., Abernathy, T.,
Veenstra, G., Hutchinson, B., Giovis, C., 2001. Health proﬁles of hamilton: Spatial
characterisation of neighbourhoods for health investigations. GeoJournal 53, 135–147.

243

Luke, R.D., 2006. Taxonomy of health networks and systems: A reassessment. Health
Services Research 41, 618–628.
Luo, W., Qi, Y., 2009. An enhanced two-step ﬂoating catchment area (E2SFCA) method for
measuring spatial accessibility to primary care physicians. Health & Place 15, 1100–1107.
Luo, W., Wang, F., 2003. Measures of spatial accessibility to health care in a GIS environment: synthesis and a case study in the chicago region. Environment and Planning B:
Planning and Design 30, 865–884.
Marshall, R.J., 1991. Mapping disease and mortality rates using empirical bayes estimators.
Journal of the Royal Statistical Society. Series C (Applied Statistics) 40, 283–294.
Martin, D., Wrigley, H., Barnett, S., Roderick, P., 2002. Increasing the sophistication of
access measurement in a rural healthcare study. Health & Place 8, 3–13.
McGinley, P.J., 1995. Beyond health care reform: Reconsidering certiﬁcate of need laws in
a ”Managed competition” system. Florida State University Law Review 23, 141–188.
McGrail, M.R., Humphreys, J.S., 2009. Measuring spatial accessibility to primary care in
rural areas: Improving the eﬀectiveness of the two-step ﬂoating catchment area method.
Applied Geography 29, 533–541.
McGuirk, M.A., Porell, F.W., 1984. Spatial patterns of hospital utilization: the impact of
distance and time. Inquiry 21, 84–95.
McLaﬀerty, S.L., 2003. Gis and health care. Annual Reviews in Public Health 24, 25–42.
Meade, M., Emch, M., 2010. Medical Geography. Guilford Publications, New York. third
edition.
Messina, J.P., Shortridge, A.M., Groop, R.E., Varnakovida, P., Finn, M.J., 2006. Evaluating
michigan’s community hospital access: spatial methods for decision support. International
Journal of Health Geographics 5, 1–18.
Michigan Department of Community Health, 2009. Certiﬁcate of need review standards for
hospital beds.
Michigan Oﬃce of Highway Safety Planning, . Establishing Realistic Speed Limits. Technical
Report OHSP 894. Lansing, MI.
Milligan, G., 1980. An examination of the eﬀect of six types of error perturbation on ﬁfteen
clustering algorithms. Psychometrika 45, 325–342.
Milligan, G.W., Cooper, M.C., 1987. Methodology review: Clustering methods. Applied
Psychological Measurement 11, 329–354.
244

Milligan, G.W., Cooper, M.C., 1988. A study of standardization of variables in cluster
analysis. Journal of Classiﬁcation 5, 181–204.
Mobley, L.R., Kuo, T.M., Andrews, L., 2008. How sensitive are multilevel regression ﬁndings
to deﬁned area of context? Medical Care Research and Review 65, 315–337.
Mulley, A.G., 2009. Inconvenient truths about supplier induced demand and unwarranted
variation in medical practice. BMJ 339, 1007–1009.
Murray, C.J.L., Frenk, J., 2010. Ranking 37th — measuring the performance of the U.S.
health care system. New England Journal of Medicine 362, 98–99.
Nallamothu, B.K., Bates, E.R., Wang, Y., Bradley, E.H., Krumholz, H.M., 2006. Driving
times and distances to hospitals with percutaneous coronary intervention in the united
states: Implications for prehospital triage of patients with st-elevation myocardial infarction. Circulation 113, 1189–1195.
National Conference of State Legislatures, 2011. CON-Certiﬁcate of need state laws.
http://www.ncsl.org/issues-research/health/con-certiﬁcate-of-need-state-laws.aspx.
New York State Department of Health, 2012.
Title 10 NYCRR: part
709 - determination of public need for medical facility construction.
http://www.health.state.ny.us/nysdoh/rules/709.htm, last accessed: 2012.
Ngui, A., Apparicio, P., 2011. Optimizing the two-step ﬂoating catchment area method
for measuring spatial accessibility to medical clinics in montreal. BMC Health Services
Research 11, 1–12.
Norris, J.C., Van der laan, M.J., Lane, S., Anderson, J.N., Block, G., 2003. Nonlinearity
in demographic and behavioral determinants of morbidity. Health Services Research 38,
1791–1818.
North Carolina Department of Health and Human Services, 2012. North carolina 2012
state medical facilities plan. http://www.ncdhhs.gov/dhsr/ncsmfp/2012/2012smfp.pdf,
last accessed: 2012.
Oakes, J.M., 2004. The (mis)estimation of neighborhood eﬀects: causal inference for a
practicable social epidemiology. Social Science & Medicine 58, 1929–1952.
Odoi, A., Martin, S.W., Michel, P., Holt, J., Middleton, D., Wilson, J., 2003. Geographical
and temporal distribution of human giardiasis in ontario, canada. International Journal
of Health Geographics 2, 1–13.
Oleske, D.M., 2009. An epidemiologic framework for the delivery of health care services, in:
Epidemiology and the Delivery of Health Care Services. Springer US, pp. 3–30.
245

Onega, T., Duell, E.J., Shi, X., Wang, D., Demidenko, E., Goodman, D., 2008. Geographic
access to cancer care in the U.S. Cancer 112, 909–918.
Openshaw, S., 1984. The modiﬁable areal unit problem. GeoBooks, Norwich, UK.
Paez, A., Mercado, R., Farber, S., Morency, C., Roorda, M., 2010. Accessibility to health
care facilities in montreal island: an application of relative accessibility indicators from the
perspective of senior and non-senior residents. International Journal of Health Geographics
9, 1–15.
Pasley, B.H., Lagoe, R.J., Marshall, N.O., 1995. Excess acute care bed capacity and its
causes: the experience of new york state. Health Services Research 30, 115–131.
Paul-Shaheen, P., Carpenter, E.S., 1982. Legislating hospital bed reduction: The michigan
experience. Journal of Health Politics, Policy and Law 6, 653–675.
Pedigo, A.S., Odoi, A., 2010. Investigation of disparities in geographic accessibility to emergency stroke and myocardial infarction care in east tennessee using geographic information
systems and network analysis. Annals of Epidemiology 20, 924–930.
Penchansky, R., Thomas, J.W., 1981. The concept of access: Deﬁnition and relationship to
consumer satisfaction. Medical Care 19, 127–140.
Phibbs, C.S., Luft, H.S., 1995. Correlation of travel time on roads versus straight line
distance. Medical Care Research and Review 52, 532–542.
Price, M., 2008. Slopes, sharp turns, and speed. ArcUser , 50–55.
Price, M., 2009. Convincing the chief. ArcUser , 50–54.
R Development Core Team, 2011. R: A Language and Environment for Statistical Computing. Technical Report. Vienna, Austria.
Radke, J., Mu, L., 2000. Spatial decompositions, modeling and mapping service regions to
predict access to social programs. Annals of GIS 6, 105–112.
Ray, N., Ebener, S., 2008. Accessmod 3.0: computing geographic coverage and accessibility
to health care services using anisotropic movement of patients. International Journal of
Health Geographics 7, 1–17.
Rey, S.J., Anselin, L., Folch, D.C., Arribas-Bel, D., Sastr´ Guti´rrez, M.L., Interlante, L.,
e
e
2011. Measuring spatial dynamics in metropolitan areas. Economic Development Quarterly 25, 54–64.
Ricketts, T.C., Randolph, R., Howard, H.A., Pathman, D., Carey, T., 2001. Hospitalization
rates as indicators of access to primary care. Health & Place 7, 27–38.
246

Rivers, P.A., Fottler, M.D., Younis, M.Z., 2007. Does certiﬁcate of need really contain
hospital costs in the united states? Health Education Journal 66, 229–244.
Robeznieks, A., 2008. Site under construction. Modern Healthcare 38, 6–7, 16.
Roemer, M.I., 1961. Bed supply and hospital utilization: a natural experiment. Hospitals
35, 36–42.
Rogerson, P.A., 2006. Statistical Methods for Geography: A Student’s Guide. SAGE Publications. 2nd edition.
Rohrer, J., 1990. Supply-induced demand for hospital care. Health Services Management
Research 3, 41–48.
Romano, M., 2003. Pros and cons of certiﬁcates. Modern Healthcare 33, 4.
Rosenthal, M.B., Zaslavsky, A., Newhouse, J.P., 2005. The geographic distribution of physicians revisited. Health Services Research 40, 1931–1952.
Sander, H.A., Ghosh, D., van Riper, D., Manson, S.M., 2010. How do you measure distance
in spatial models? an example using open-space valuation. Environment and Planning B:
Planning and Design 37, 874–894.
Schoen, C., Doty, M.M., Robertson, R.H., Collins, S.R., 2011. Aﬀordable care act reforms
could reduce the number of underinsured US adults by 70 percent. Health Aﬀairs 30, 1762
–1771.
Schuurman, N., Berube, M., Crooks, V.A., 2010. Measuring potential spatial access to
primary health care physicians using a modiﬁed gravity model. Canadian Geographer 54,
29–45.
Schuurman, N., Fiedler, R., Grzybowski, S., Grund, D., 2006. Deﬁning rational hospital
catchments for non-urban areas based on travel-time. International Journal of Health
Geographics 5, 1–8.
Shahid, R., Bertazzon, S., Knudtson, M., Ghali, W., 2009. Comparison of distance measures
in spatial analytical modeling for health service planning. BMC Health Services Research
9, 1–14.
Shain, M., Roemer, M.I., 1959. Hospital costs relate to the supply of beds. Modern Hospital
92, 71–73.
Shannon, G.W., Skinner, J.L., Bashshur, R.L., 1973. Time and distance: the journey for
medical care. International Journal of Health Services 3, 237–244.
Shortt, N.K., Moore, A., Coombes, M., Wymer, C., 2005. Deﬁning regions for locality health
care planning: a multidimensional approach. Social Science & Medicine 60, 2715–2727.
247

Shwartz, M., Payne, S.M., Restuccia, J.D., Ash, A.S., 2001. Does it matter how small
geographic areas are constructed? ward’s algorithm versus the plurality rule. Health
Services and Outcomes Research Methodology 2, 5–18.
Shwartz, M., Pek¨z, E.A., Labonte, A., Heineke, J., Restuccia, J.D., 2011. Bringing responsio
bility for small area variations in hospitalization rates back to the hospital: The propensity
to hospitalize index and a test of the roemer’s law. Medical Care 49, 1062–1067.
Sparks, P.J., Sparks, C.S., 2010. An application of spatially autoregressive models to the
study of US county mortality rates. Population, Space and Place 16, 465–481.
Steinley, D., 2003. Local optima in K-Means clustering: What you don’t know may hurt
you. Psychological Methods 8, 294–304.
Steinley, D., 2006. K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology 59, 1–34.
Steinley, D., Banks, D., House, L., McMorris, F.R., Arabie, P., Gaul, W., 2004. Standardizing variables in k-means clustering, in: Classiﬁcation, Clustering, and Data Mining
Applications. Springer Berlin Heidelberg. Studies in Classiﬁcation, Data Analysis, and
Knowledge Organization, pp. 53–60.
Strunk, B.C., Ginsburg, P.B., Banker, M.I., 2006. The eﬀect of population aging on future
hospital demand. Health Aﬀairs 25, w141–w149.
Swift, A., Liu, L., Uber, J., 2008. Reducing MAUP bias of correlation statistics between
water quality and GI illness. Computers, Environment and Urban Systems 32, 134–148.
Tanser, F., Gijsbertsen, B., Herbst, K., 2006. Modelling and understanding primary health
care accessibility and utilization in rural south africa: An exploration using a geographical
information system. Social Science & Medicine 63, 691–705.
Thomas, J.W., 1979. Techniques for deﬁning geographic boundaries for health regions.
Socio-Economic Planning Sciences 13, 321–326.
Thomas, J.W., Berki, S.E., Wyszewianski, L., Ashcraft, M.L.E., 1983. Classiﬁcation of
hospitals based on measured output: The VA system. Medical Care 21, 715–733.
Thomas, J.W., Griﬃth, J.R., Durance, P., 1981. Deﬁning hospital clusters and associated
service communities in metropolitan areas. Socio-Economic Planning Sciences 15, 45–51.
Tobler, W.R., 1970. A computer movie simulating urban growth in the detroit region.
Economic Geography 46, 234–240.

248

Upchurch, C., Kuby, M., Zoldak, M., Barranda, A., 2004. Using GIS to generate mutually
exclusive service areas linking travel on and oﬀ a network. Journal of Transport Geography
12, 23–33.
US Federal Trade Commission, 2004. Improving Health Care: A Dose of Competition.
Technical Report. US Government Printing Oﬃce. Washington, D.C.
Vertrees, J.C., Manton, K.G., 1986. A multivariate approach for classifying hospitals and
computing blended payment rates. Medical Care 24, 283–300.
de Vries, J.J., Nijkamp, P., Rietveld, P., 2009. Exponential or power distance-decay for
commuting? an alternative speciﬁcation. Environment and Planning A 41, 461–480.
Vyas, S., Kumaranayake, L., 2006. Constructing socio-economic status indices: how to use
principal components analysis. Health Policy and Planning 21, 459–468.
Wan, N., Zhan, F.B., Zou, B., Chow, E., 2011. A relative spatial access assessment approach
for analyzing potential spatial access to colorectal cancer services in texas. Applied Geography 32, 291–299.
Wan, N., Zou, B., Sternberg, T., 2012. A three-step ﬂoating catchment area method for analyzing spatial access to health services. International Journal of Geographical Information
Science .
Wang, F., Xu, Y., 2011. Estimating O–D travel time matrix by google maps API: implementation, advantages, and implications. Annals of GIS 17, 199–209.
Ward, J.H., 1963. Hierarchical grouping to optimize an objective function. Journal of the
American Statistical Association 58, 236–244.
Welch, H.G., Sharp, S.M., Gottlieb, D.J., Skinner, J.S., Wennberg, J.E., 2011. Geographic
variation in diagnosis frequency and risk of death among medicare beneﬁciaries. JAMA:
The Journal of the American Medical Association 305, 1113–1118.
Wennberg, J., 2005. Variation in Use of Medicare Services Among Regions and Selected
Academic Medical Centers: Is More Better? Technical Report 874. The Commonwealth
Fund.
Wennberg, J., Cooper, M., Dartmouth Atlas of Health Care Working Group, 1999. The
Quality of Medical Care in the United States: A Report on the Medicare Program. Technical Report. American Hospital Association. Chicago, IL.
Wennberg, J., Gittelsohn, A., 1973. Small area variations in health care delivery. Science
182, 1102–1108.

249

White, K.L., Williams, T.F., Greenberg, B.G., 1961. The ecology of medical care. New
England Journal of Medicine 265, 885–892.
Witlox, F., 2007. Evaluating the reliability of reported distance data in urban travel behaviour analysis. Journal of Transport Geography 15, 172–183.
Wright, D.B., Ricketts III, T.C., 2010. The road to eﬃciency? re-examining the impact of
the primary care physician workforce on health care utilization rates. Social Science &
Medicine 70, 2006–2010.
Young, T.K., 2005. Population Health: Concepts and Methods. Oxford University Press,
New York.
Zwanziger, J., Khan, N., 2008. Safety-Net hospitals. Medical Care Research and Review 65,
478–495.

250