SOME NEW MODELS FOR SMALL AREA ESTIMATION
By
Hao Ren

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Statistics
2011

ABSTRACT
SOME NEW MODELS FOR SMALL AREA ESTIMATION
By
Hao Ren

This dissertation includes some new models for small area estimation. There are four parts
in total. The ﬁrst part studied the selection of ﬁxed eﬀects covariates in linear mixed models.
A modiﬁed bootstrap selection procedure for linear model from literature was extended to
linear mixed eﬀects models. Both theoretical work and simulations showed the eﬀectiveness
of this procedure for linear mixed eﬀects models.
In the second part, a new approach by shrinking both means and variances of small areas
was introducted. This method modeled the small area means and variances in a uniﬁed
framework. The smoothed variance estimators used information of direct point estimators
and their sampling variances, and consequently, for the smoothed small area estimators.
Conditional mean squared error of prediction was also studied in this part to evaluate the
performance of predictors.
The third part studied the conﬁdence intervals of small area estimators introduced in the
second part. The literature of small area estimation is dominated by point estimation and
their standard errors. The standard normal or student-t conﬁdence intervals do not produce
accurate intervals. The conﬁdence intervals produced in this part are from a decision theory
perspective.
The fourth part estimated the small areas means with clustering of the small areas. In the
realistic application, the estimation may not be appropriate to “borrow strength” from all
other small areas universally, if cluster eﬀects exist between clusters of small areas. A model

based on clustering was studied in this part, which included an additional cluster eﬀect to
the basic area level model. Since the partition of clusters was not known, a stochastic search
procedure from literature was adapted ﬁrst to ﬁnd the clustering partition.

ACKNOWLEDGMENT

I would like to express my sincere appreciation to my major dissertation advisor, Professor
Tapabrata Maiti; for his invaluable guidance, support, and encouragement throughout my
dissertation study. This dissertation could not have been completed without his guidance
and support.
I would also like to thank Professors Sarat Dass, Lifeng Wang, and William Schmidt for
serving on my guidance committee. My special thanks go to Professor Sarat Dass for his
kindly help and great advice for my dissertation. I also want to thank Professor Lifeng Wang
and William Schmidt for their valuable help and support to my dissertation.
I also want to thank Adele Brandstrom and Jessalyn Smith for their kind help on style
editing.
I would like to thank my wife Qi Diao for her support, encouragement and love. Finally,
I would like to thank my beloved parents and sister for their consistent support and love.
Thank you!

iv

TABLE OF CONTENTS

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
List of Figures

vii

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 Introduction
1.1 Small Area Estimation . . . .
1.2 Classical Approach for Sample
1.3 Model Based Estimation . . .
1.3.1 Area Level Model . . .
1.3.2 Unit Level Model . . .
1.4 Mixed Model . . . . . . . . .
1.5 Study Topics . . . . . . . . .

. . . . .
Survey .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

2 Bootstrap Model Selection for Linear Mixed Eﬀects Models:
to Small Area Estimation
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Linear Mixed Model . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Fay-Herriot Model . . . . . . . . . . . . . . . . . . . . .
2.3.2 Nested-Error Regression Model . . . . . . . . . . . . . .
2.4 The Theoretical Framework . . . . . . . . . . . . . . . . . . . .
2.5 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Algorithms and Settings for Diﬀerent Models . . . . . . .
2.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

1
1
3
5
6
7
8
10

Application
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

3 Prediction Error of Small Area Predictors Shrinking both Mean and Variances
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Model and Estimation Method . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Model with Assumption . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Estimation of the Small Area Parameters . . . . . . . . . . . . . . . .
3.2.3 Estimation of the Stuctural Parameters . . . . . . . . . . . . . . . . .
3.3 Prediction Error Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Mean Squared Error of Prediction . . . . . . . . . . . . . . . . . . . .
2
ˆ
3.3.2 Bias Correction for νi (B; Xi , Si ) . . . . . . . . . . . . . . . . . . . .
2
3.3.3 Approximation of ci (B; Xi , Si ) . . . . . . . . . . . . . . . . . . . . .
3.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v

ix

17
17
18
21
21
24
27
34
34
37

53
53
54
54
56
59
62
62
63
66
67

3.5

Appendix: Matrix Calculation Results . . . . . . . . . . . . . . . . . . . . .
ˆ
3.5.1 Computation of B − B . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.2 Second Order Correction of νi . . . . . . . . . . . . . . . . . . . . . .

4 Conﬁdence Interval Estimation of Small Area Parameters
Mean and Variances
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Conﬁdence Interval . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Deﬁnition . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 Choice of the Tuning Parameter . . . . . . . . . . . .
4.4 Theoretical Justiﬁcation of Tuning Parameter . . . . . . . .
4.5 Alternative Model for Conﬁdence Interval . . . . . . . . . .
4.6 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . .
4.7 A Real Data Analysis . . . . . . . . . . . . . . . . . . . . . .

71
71
75

Shrinking both
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

5 Clustering Based Small Area Estimation: An Application to MEAP
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Prior Information . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.2 Model-Based Objective Functions . . . . . . . . . . . . . . . . .
5.3.3 Estimation of Mean Scores . . . . . . . . . . . . . . . . . . . . .
5.3.4 MSE of Estimator . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Stochastic Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Biased Random Walk . . . . . . . . . . . . . . . . . . . . . . . .
5.4.2 Split-Merge Moves . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

79
79
80
81
81
83
84
87
90
94

Data 99
. . . 99
. . . 104
. . . 106
. . . 106
. . . 107
. . . 109
. . . 110
. . . 111
. . . 111
. . . 112
. . . 113

6 Discussion

119

Bibliography

123

vi

LIST OF TABLES

2.1

Selection probabilities for linear model . . . . . . . . . . . . . . . . . . . . .

40

2.2

The coverage probabilities and lengths of conﬁdence intervals for linear model

41

2.3

Selection probabilities for Fay-Herriot model, loss function without random
eﬀect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

The coverage probabilities and lengths of conﬁdence intervals for the FayHerriot model, loss function without random eﬀect . . . . . . . . . . . . . .

43

2.4

2.5

Selection probabilities for Fay-Herriot model, loss function with random eﬀect 44

2.6

The coverage probabilities and lengths of conﬁdence intervals for the FayHerriot model, loss function with random eﬀect . . . . . . . . . . . . . . . .

45

2.7

Selection probabilities for Nested-Error regression model with group level . .

46

2.8

The coverage probabilities and lengths of conﬁdence intervals for for NestedError regression model with group level . . . . . . . . . . . . . . . . . . . . .

47

Selection probabilities for Nested-Error regression model with element level .

48

2.10 The coverage probabilities and lengths of conﬁdence intervals for Nested-Error
regression model with element level . . . . . . . . . . . . . . . . . . . . . . .

49

2.11 Compare of loss functions without and with random eﬀect for Nested-Error
model, data size n = 160. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

2.12 The coverage probabilities and lengths of conﬁdence intervals for Nested-Error
regression model with diﬀerent loss functions, data size n = 160. . . . . . . .

51

2.9

3.1

3.2

Results of the simulation study, and here we present estimate (Est.), empirical
standard deviation (SD) for β and τ 2 . We set β = 10. . . . . . . . . . . . . .

69

Results of MSE estimator, n = 36, m = 9 . . . . . . . . . . . . . . . . . . . .

70

vii

3.3

Results of MSE estimator, n = 180, m = 18 . . . . . . . . . . . . . . . . . .

70

4.1

Estimation of model parameter. The left panel is for β and the right panel is
for τ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

4.2

Simulation results for prediction when τ 2 = 0.25 . . . . . . . . . . . . . . . .

92

4.3

Simulation results for prediction when τ 2 = 1 . . . . . . . . . . . . . . . . .

92

4.4

Simulation results for prediction when τ 2 = 4 . . . . . . . . . . . . . . . . .

93

4.5

Crop data from You and Chapman(2006) . . . . . . . . . . . . . . . . . . . .

95

4.6

Estimation results of corn . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

5.1

Number of public school 4th graders in some school districts in Michigan state 100

viii

LIST OF FIGURES

4.1

4.2

Corn hectares estimation. The vertical line for each county displays the conˆ
ˆ
ﬁdence interval of θi , with θi marked by the circle, for (I) Proposed method,
(II)Wang and Fuller (2003), (III)Hwang et al. (2009) and (IV)Qiu and Hwang
(2007). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

Boxplot of estimates of corn hectares for each county. (I) to (IV) are the 4
methods corresponding to Figure 4.1. . . . . . . . . . . . . . . . . . . . . . .

97

5.1

Standard deviation of public school students’ assessment scores in available
school districts in Michigan state. . . . . . . . . . . . . . . . . . . . . . . . . 100

5.2

Poverty level of available school districts in Michigan state. . . . . . . . . . . 102

5.3

Math scores of 4th graders in public schools in available school districts in
Michigan state. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.4

Trace plot of diﬀerent m values. . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.5

Map of clusters.

5.6

Mean scores vs. poverty levels for school districts in the 7 clusters and all the
districts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.7

Coeﬃcient of variation for the model-based estimates and observed mean scores.117

5.8

MSE of estimators for models with clustering (5.2) or no clustering (5.1).

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

ix

. 117

Chapter 1
Introduction

1.1

Small Area Estimation

The term small area is commonly used to denote a small geographical area, such as a district.
It can also be used to denote a small demographic group, such as a small group with certain
social economic status or a sex/race/ethnicity group. Usually, a small area is deﬁned when
the domain speciﬁc sample is not large enough to support direct estimates with an adequate
level of statistical precision.
The history of small area statistics is very long and can be traced back at least to eleventh
century England and seventeenth century Canada. However, those are based on either census
or administrative records targeting a complete enumeration (Brackstone, 1987) and sampling
is usually not involved in those studies.
In recent years, sample surveys have become more and more popular because they are
cost-eﬀective and can help solve the issue that occurs when the population is dynamic and
the individuals making up the population are moving constantly. Some basic sample survey
1

methods can be simple random sampling, systematic or stratiﬁed sampling. From the collected sample, the direct measures, such as the mean and standard error, can be calculated
as the estimates for the population or each domain. However, if a small area exists, that
is, if the sample for some domain is not large enough, it may result in unacceptably large
standard errors if it is only based on direct survey estimators and only from the sample
area data. Thus, it is important to research small area statistics to develop more reliable
measures.
Small area estimation has gained more attention in recent years because of greatly increasing demand from both public and private sectors. For public sectors, from countries
like the U.S. and Canada, they have a growing demand for small area statistics for the
purpose of formulating policies and programs, in cases like government funding allocation
and local regional planning. In order to identify areas that are in need of funding, such as
certain school districts or some subpopulation, the reliable estimates from the small area
are required. The increasing demand for small area statistics also comes from private sectors, like small businesses. Often, the business decisions rely on the local social-economic
environment and other regional conditions. Therefore, estimates with adequate precision are
needed from small areas. Since direct estimates from sample surveys sometimes cannot meet
such requirements for small areas, the research on small area statistics has become more and
more important.
What is more, when central and eastern European countries and the former Soviet Union
countries moved away from a centralized decision making system, they also demand survey
results not only for large areas but also small areas. Again, this created an increasing demand
for the small area statistics.
2

With high power computers, the processing of large and complex data become feasible
and it helps with the development of the small area statistics. Powerful statistical methods
with a sound theoretical foundation have been developed for the analysis of small area data.
Such methods “borrow strength” from related or similar small areas through implicit or
explicit models which provide a link between related small areas. These types of “borrow
strength” models will be the main focus of this dissertation. For review on small area
estimation, papers include Ghosh and Rao (1994), Rao(1999), Marker(1999), Rao (2001)
and Pfeﬀermann (2002), etc.

1.2

Classical Approach for Sample Survey
Direct Estimators

The variables of interest in a sample survey are usually the total measures or mean of the
area or domain. Direct estimators are commonly used to provide estimate of such variables
in a domain and use the sample data only in that domain. The typical direct estimators
are design-based. A more extensive reference of direct estimation in sampling theory can
be found in Lohr (1999). Direct estimators will sometimes suﬃce, such as domains with
suﬃciently large sample size and particularly after addressing survey design issues. But, it is
well known that direct estimators for small areas are usually unreliable for the unacceptable
large standard errors due to unduly sample sizes of small areas.
Model-based methods for direct estimators are also developed. They provide valid conditional inferences about the particular sample drawn, regardless of the sampling design.
However, model-based methods depend heavily on the correct speciﬁcation of models. The
methods can perform poorly under misspeciﬁcation even if the sample size is large.
3

Demographic Methods

The most powerful demographic method is census. Censuses are usually conducted at 10year or 5-year intervals to provide population counts for speciﬁc geographical areas or subpopulations deﬁned by age, sex, marital status and other demographic variables. But the
information from a census becomes outdated due to changes in the size and composition
of the resident population over time. Therefore, various demographic methods, other than
census, are developed to provide population estimation in the noncensal years.
The changes of demographic variables are strongly related to changes of local population.
Administrative registers contain current data of local population on various demographic
variables. Such variables are called symptomatic indicators, such as number of births and
deaths and net emigration during the period since the last census. Traditional demographic
methods employ indirect estimators based on implicit linking models, which related the population estimates and symptomatic variables. These methods may be categorized as either
symptomatic accounting techniques or regression symptomatic procedures. Symptomatic
accounting techniques provide indirect estimators under some implicit linking models with
the symptomatic variables. Regression symptomatic procedures use multiple linear regression to estimate local area populations. The symptomatic variables are used as independent
variables. For detailed description of these methods, one can see Rao (2003).
Typically, demographic indirect estimators use only administrative and census data and
sampling is not involved in these methods.

Indirect Estimators

As introduced previously, the unacceptably large standard errors of direct estimators are due
4

to unduly sample size of the small areas. Therefore, it is necessary to ﬁnd indirect estimators
that increase the eﬀective sample size and thus decrease the standard error.
Traditional indirect estimation methods are based on implicit models that provide a
link to related small areas through supplementary data. Such estimators include synthetic
estimators, composite estimators, and James-Stein (or shrinkage) estimators. If a large area
covers several small areas and the small areas are assumed to have the same characteristics as
the large area, a reliable direct estimator for the large area can be used to derive an indirect
estimator for a small area. Such an estimator is called synthetic estimator (Gonzalez 1973).
The global measures (averaged over small areas) are often used with synthetic estimates. If
an estimator is a weighted average of a synthetic estimator and a direct estimator, it is called
a composite estimator. Actually, any estimator that has the composite form can be called
a composite estimator, both design-based and model-based. For a composite estimator, a
suitable weight needs to be chosen. The common weight approach uses a common weight
for the composite estimators for all small areas, then the total MSE is minimized with
respect to the common weight. The James-Stein estimator (James and Stein, 1961) is similar
to the common weight estimator and attracted a lot of attention in mainstream statistics
literature. It achieves large gains in eﬃciency in the traditional design-based framework
without assuming a model on the small area parameters. A detailed introduction of these
estimators can be found in Rao (2003).

1.3

Model Based Estimation

The traditional indirect estimators are brieﬂy introduced in the previous section. Reduction
in MSE is the main reason for using indirect estimators. Indirect estimators are largely
5

based on sample survey data in conjunction with auxiliary population data. However, the
traditional indirect estimators only provide an implicit link between the small areas. The
model based estimators which provide an explicit link between the small areas are introduced
in this section.
Model based estimators are indirect estimators based on small area models. Small area
models take the random area-speciﬁc eﬀects into account and include additional auxiliary
variables in the model to explain the eﬀects, which make speciﬁc allowance for between
area variation. Such models deﬁne the way how the related data are incorporated in the
estimation process. The use of explicit models makes model diagnostics possible which can
be used to ﬁnd suitable models that ﬁt the data well. For example, a selection of auxiliary
variables is conducted later in this dissertation. Area-speciﬁc measures of precision also can
be associated with each small area estimate. For complex data structures, more complex
model structures can be adopted, such as mixed models and generalized linear models. And
the existing methodologies for these models can be utilized directly to achieve accurate small
area estimation.
In general, small area models can be classiﬁed into two broad types: (i) area level model:
modeling the small area direct estimators with area-speciﬁc covariates. When unit level data
are not available, such area level models are necessary. (ii) unit level model: modeling the
unit direct estimators with unit-speciﬁc covariates.

1.3.1

Area Level Model

¯
Let Yi be the small area means and g(·) be a speciﬁed function that link the model based
¯
¯
estimator θi and Yi , θi = g(Yi ). The use of g(·) makes the model more robust. But in our
6

¯
study, we choose g(·) as the identical function, i.e. Yi = θi . The area level model relates the
θi and the area-speciﬁc auxiliary data zi = (z1i , . . . , zpi )T in the following way,

T
θi = zi β + bi vi ,

i = 1, . . . , m

(1.1)

where β = (β1 , . . . , βp ) is the p × 1 vector of regression parameters, bi ’s are known positive
constants, and vi ’s are area-speciﬁc random eﬀects that are iid with E(vi ) = 0 and V ar(vi ) =
2
σv . In this dissertation, a normal distribution is always adopted for vi . However, it is possible
to choose other distributions which makes the model robust. m is the total number of areas.
For making inferences about the θi ’s under model (1.1), we assume that direct estimators,
ˆ
θi , are available and

ˆ
θi = θi + ei ,

i = 1, . . . , m

(1.2)

2
ˆ
where the ei ’s are sampling errors, E(ei ) = 0 and V ar(ei ) = σe . That is, the estimators θi
i
2
are design-unbiased. And we always assume that the sampling variances, σe , are known.
i
The area level models have various extensions. A famous example is the Fay-Herriot
model. The model was developed by Fay and Herriot in 1979 to estimate per-capita income
for a small area (population less than 1,000). The Fay-Herriot model is adapted in this
dissertation and the details will be introducted later.

1.3.2

Unit Level Model

When unit-speciﬁc auxiliary data for each population element in each small area is available, the unit level model is adapted. Let yij be the variable of interest and Xij =
7

(xij1 , . . . , xijp )T be the available element-speciﬁc auxiliary data. Then a one-fold nested
error linear regression model is given

T
yij = Xij β + νi + eij
j = 1, . . . , ni ;

(1.3)

i = 1, . . . , m.

Here νi are the area-speciﬁc eﬀects and are assumed to be iid random variables with E(vi ) =
2
0 and V ar(vi ) = σv . eij = kij eij and the eij ’s are iid random variables, independent of
˜
˜
2
the νi ’s, with E(˜ij ) = 0 and V ar(˜ij ) = σe , the kij ’s are known constants, and ni are
e
e
the number of elements in the ith area. In addition, normality of the νi ’s and eij ’s is often
˜
∑ni
assumed. The parameters of inferential interest are the small area totals Yi = i=1 yij or
¯
the means Yi = Yi /ni .
The unit level model does not include sample selection bias in the model; that is, the
sample values are assumed to obey the model. Simple random sampling is satisﬁed for this
condition. But for more complex sampling designs, it may not be appropriate. For example,
in stratiﬁed multistage sampling, the design features are not incorporated in the model.
However, there are various extensions to account for such features.

1.4

Mixed Model

The small area models introducted in the previous section may be regarded as special cases of
the mixed models. The research of Mixed Models has gained much attention in recent years.
It has many names in in a wide variety of disciplines in the physical, biological and social
sciences. The mixed model is particularly useful in settings where repeated measurements
8

are made on the same statistical units (longitudinal data), or where measurements are made
on clusters of related statistical units (clustered or panel data). Therefore, it can be called
the model for repeated measurements, or a hierarchical model.
The most important diﬀerence of the mixed model from classical statistics is that observations are not necessary from the same population, where independent and identically
distributed is a typical assumption. Mixed model data may have a more complex, multilevel,
hierarchical structure. The general assumption of mixed model data is that observations
from one cluster can be correlated, but independent between diﬀerent clusters; and the subpopulations of clusters can be diﬀerent. Therefore, a random eﬀect at each cluster level is
introduced into the model. The forms of a mixed model can be linear, generalized linear
(such Logistic and Poisson), and nonlinear. In this dissertation, we focus on the linear mixed
models only.
A general linear mixed model can be written as

y = Xβ + Zv + e

(1.4)

where y is the vector of sample observations, X and Z are known matrices, and v and e
are distributed independently with means 0 and covariance matrices G and R, respectively,
depending on some variance components parameters.
One approach to this model can be obtained by using the general theory of Henderson
(1975) for a mixed linear model. If the parameters of variance components are known, the
best linear unbiased estimator of θ = lT β + mT v is given by

ˆ
ˆ
ˆ
θ = lT β + mT GZ T V −1 (y − X β)
9

(1.5)

where V = R + ZGZ T is the variance-covariance matrix of y and

ˆ
β = (X T V −1 X)−1 (X T V −1 Y )

is the generalized least squares estimator of β.
There are a lot of other approaches available, such as empirical best linear unbiased
prediction, empirical Bayes method, and hierarchical Bayes method. Some of these methods
will be involved in our studies later. The details of these methods can be found in literatures,
such as Ghosh and Rao (1994) and Rao (2003).

1.5

Study Topics

As I introduced in previous sections, small area estimation and the statistical techniques
therein have become a topic of growing importance in recent years and this is the reason
why I chose small area estimation as the topic of my dissertation. To be more speciﬁc, the
works in this dissertation includes the following aspects.

Model Selection for Linear Mixed Eﬀects Models

The need for reliable small area estimates is felt by many agencies, both public and private,
for making useful policy decisions. For example, the small area statistics being used to
monitor socio-economic and health conditions for diﬀerent stratum deﬁned by age, sex,
racial groups over small geographical areas.
It is now widely recognized that the direct survey estimates for small areas are usually
unreliable, being accompanied with large standard errors and coeﬃcients of variation. This
10

makes it necessary to use models, either explicit or implicit, to connect the small areas and
obtain estimates of improved precision by borrowing strength across areas.

In the previous sections, some basic models for small area estimation is presented. Before
we start working on more complex approach methods, it is important to note the inﬂuence
on the choice of models, particularly on the choice of auxiliary variables. The success of any
model-based method depends on the availability of good auxiliary data. Therefore, more
attention should be given to the selection of auxiliary variables that are good predictors of
the study variables.

The model selection study here arises from potential choices of the ﬁxed eﬀects covariates
in linear mixed eﬀects models. A bootstrap model selection method is adapted in this
dissertation. The procedure based on the bootstrap has some important features: ﬁrst, the
bootstrap method can provide more accurate inference (Adkins and Hill, 1990; Hall,1989)
at the same time when the model selection is undergoing. In other words, the bootstrap
based inference on regression parameters takes into account the model selection procedure.
Second and most importantly, the bootstrap selection procedure does not depend on a speciﬁc
probability distribution.

However, a straightforward application of the bootstrap does not yield a consistent model
selection procedure (Shao, 1996). A simple modiﬁcation was applied to the straightforward
bootstrap procedure: generate m (instead of n, and m < n, n is the size of all available
data) iid bootstrap observations from the empirical distribution which puts equal mass on
each pair of available data. After the modiﬁcation, it will result with consistency if and only
if m/n → 0 and m → ∞.
11

However, only the linear model is studied in the literature. In this dissertation, the
modiﬁed bootstrap selection procedure is extended to linear mixed eﬀects models, including
the Fay-Herriot Model and the Nested-Error Regression Model, that are commonly used in
small area estimation.

Small Area Predictors by Shrinking both Mean and Variances

The basic area level model is introducted in section 1.3.1. The survey based direct small area
estimates and their variance estimates are the main ingredient to build area level small area
models. Typical modeling strategies assume that the sampling variances are known while
a suitable linear regression model is assumed for the means. For detailed developments,
one can see Ghosh and Rao (1994), Pfeﬀermann (2002) and Rao (2003). The typical arealevel models are subject to two main criticisms: (i) in practice, the sampling variances are
estimated quantities and these are subject to substantial errors due to the fact that they are
often based on equivalent sample sizes as the direct estimates are being calculated and (ii)
assumption of known and ﬁxed sampling variances does not take into account the uncertainty
of estimation into the overall small area estimation strategy.
Previous attempts have been made to model only the sampling variances by Maples et al.
(2009), Gershunskaya and Lahiri (2005), Huﬀ et al. (2002), Cho et al. (2002), Valliant (1987),
and Otto and Bell (1995). Wang and Fuller (2003) and Rivest and Vandal (2003) extended
the asymptotic mean squared error (MSE) estimation of Prasad and Rao (1990) when the
sampling variances are modeled with few additional parameters. You and Chapman (2006)
considered sampling variance modeling. However, they adopted full Bayesian estimation
techniques.
12

The issues have been recently raised by the practitioners. For example, Herrador et al.
(2008) investigated estimates of design variances of model based and model-assisted small
area estimators. The latest developments are nicely summarized in a recent article by
William Bell of the United States Census Bureau (Bell, 2008). He carefully examined the
eﬀect of the above two issues in the context of MSE estimation for model based small area
estimators. He also provided numerical evidence of the eﬀect of assuming known sampling
variances in the estimation of MSE in the context of the Fay-Herriot model. The developments so far made to this issue can be “loosely” viewed as (i) smoothing the direct sampling
error variances to obtain stable variance estimates with low bias and (ii) (partial) account
of sampling variance uncertainty by extending the Fay-Herriot model.
Much less or no attention has been given to accounting the sampling variances eﬀectively while modeling the mean compared to the volume of research that has been devoted
for modeling the means and their inferential issues. Thus, there is a lack of systematic development in small area estimation which includes shrinking both means and variances. In
other words, we would like to exploit the technique of “borrowing strength” from other small
areas to “improve” the variance estimates as we do to improve the small area mean estimates. The methodology introduced here develops a dual “shrinkage” estimation for both
the small area means and variances in a uniﬁed framework. In this process, the smoothed
variance estimators use information of direct point estimators and their sampling variances,
and consequently, for the smoothed small area estimators.
The modeling perspective is closely related to Wang and Fuller (2003), Rivest and Vandal (2003), You and Chapman (2006) and Hwang et al. (2009). An EM-based estimation
approach is developed. Numerical evidences are provided to show eﬀectiveness of the dual
13

shrinkage estimation.
One statistic reported is conditional mean squared error of prediction (CMSEP) which is
more akin to Booth and Hobert (1998). Booth and Hobert argued strongly for CMSEP as
opposed to unconditional mean squared error of prediction (UMSEP). Recently, this technique has again been emphasized by Lohr and Rao (2009) in the context of nonlinear mixed
eﬀect models. These authors favor CMSEP particularly for non-normal models when the
posterior variance of small area parameters depends on the area speciﬁc responses. Although
they were interested only in generalized linear mixed models where the posterior variance
depends on area speciﬁc responses, this property of posterior variance is perhaps true for
situations with posterior non-linearity.
Another inference adopted is conﬁdence intervals of small area means. The small area
estimation literature is dominated by point estimation and their standard errors. It is well
known that the standard practice of (pt. est. ± qs.e.), q is a Z (standard normal) or a t cutoﬀ point, does not produce accurate intervals. See, Hall and Maiti (2006) and Chatterjee
et al. (2008) for more details. The previous works are based on the bootstrap procedure and
has limited use due to repeated estimation of model parameters. The conﬁdence intervals
produced in this dissertation are from a decision theory perspective.
Clustering Based Small Area Estimation
For the introduced small area models, both the area level model and the unit level model
assumed iid random area eﬀects. That is, when we make inference of each small area, we
“borrow strength” from all other small areas universally. But the realistic geographical
(spatial) and socio-economic status of small areas in a large region may be quite diﬀerent
from each other. For example, the demographic composition of a small area might be close
14

to the adjacent small areas, but might not be all the adjacent areas; a similar thing exists
for house-hold incomes, poverty levels and many other types of information. When we
“borrow strength” from other small areas under such conditions, the inference may even be
misleading. Therefore, it is more appropriate to divide all small areas into groups (clusters)
if diversities exist between groups of small areas.

In this dissertation, a data set from the Michigan Educational Assessment Program
(MEAP) is analyzed. The numbers of students for each school district is diverse. The
standard deviations of students’ math scores for each school district are also quite diﬀerent
from each other. A detailed description of the data set is given the later chapters. These
information show that small area models are appropriate to be adopted to analyze the data
set; and the comparisons of mean scores based on model-based estimate results are more
meaningful than the direct comparison of mean score. In addition, the poverty levels of
school districts suggest that small area models with clustering the school districts is more
appropriate.

There are a lot of clustering methods available. However, it usually happens that we do
not have partition information for clustering, or even the number of clusters. A stochastic
search procedure from Booth et al. (2008) is adopted to solve the problem. The partition
of clustering is involved in the model as a parameter. A cluster-speciﬁc random eﬀect is
also included in the model, which allows the cluster means departure from the assumed
base model. The objective function is constructed based on the posterior distribution of
the undergoing partition. The partition maximizes the objective function is chosen as the
“optimal” clustering partition.
15

In Booth et al. (2008), they assumed the variances of objects were unknown but iid within
a cluster. The main diﬀerence between our model and the model in Booth et al. (2008) is
that diﬀerent variances are assigned to each school district and assumed as known since they
are reported in the MEAP data set. The posterior distribution based on diﬀerent variances
for each school district does not have a close form and only can be calculated by numerical
method.

16

Chapter 2
Bootstrap Model Selection for Linear
Mixed Eﬀects Models: Application to
Small Area Estimation

2.1

Introduction

Let {(xi , yi ), i = 1, . . . , n} be the available data set. Some models always need to be ﬁtted if
the objective is to discover the relationship between x and y. Often, all the components of x
may not be related or important to y. An optimal model contains the necessary components
of x. The procedure of selecting the variables is a model selection problem in which each
model corresponds to a particular set of the components of x.
Among the many existing variable/model selection procedures, such as AIC, BIC, Mellow’s Cp , R2 etc., the procedure based on the bootstrap has some important features: ﬁrst,
the bootstrap method can provide more accurate inference (Adkins and Hill, 1990; Hall,
17

1989) at the same time when the model selection is undergoing. In other words, the bootstrap based inference on regression parameters takes into account the model selection procedure. Second and most importantly, the bootstrap selection procedure does not depend
on a speciﬁc probability distribution.
A very important theoretical study of a bootstrap selection procedure is its consistency.
Shao (1996) discovered that a straightforward application of the bootstrap does not yield a
consistent model selection procedure. A simple modiﬁcation was applied to the straightforward bootstrap procedure: generate m (instead of n, and m < n) independent and identical
ˆ
(iid) bootstrap observations from the empirical distribution F , which puts mass n−1 on each
pair (xi , yi ), i = 1, . . . , n. After the modiﬁcation, it will result in consistency if and only if
m/n → 0 and m → ∞.
However, only the linear model is studied in the literature. In this chapter, the modiﬁed
bootstrap selection procedure is extended to diﬀerent aspects of the mixed model, including
the Fay-Herriot Model and Nested-Error Regression Model, that are commonly used in
small area estimation. The model selection study here arises from potential choices of the
ﬁxed eﬀects covariates. In the following sections, we ﬁrst start with a replication of Shao’s
algorithm in the linear model. Then the modiﬁed algorithms of model selection procedures
for linear mixed models is provided. After that, the simulation studies were carried out with
respect to the previous algorithms. The results are listed in the ﬁnal section.

2.2

Linear Model

The model is given by
yi = xT β + ei , i = 1, . . . , n
i
18

(2.1)

where xi is the ith value of a p × 1 vector of explanatory variables and yi is the response at
xi , β is a p × 1 vector of unknown parameters. In our study, p is ﬁxed, and does not increase
2
as n increases. We assume that ei , i = 1, . . . , n, are iid Normal(0, σe ) and

µi = E(yi |xi ) = xT β
i

2
var(yi |xi ) = σe

i = 1, . . . , n

The unknown parameters are estimated by the least squares estimator (LSE)

ˆ
β = (X T X)−1 X T Y

(2.2)

For the variable/model selection procedures, let α be a subset of 1, . . . , p of size pα and let
xiα (or βα ) be the subvector of xi (or β) containing the components of xi (or β) indexed
by the integers in α. Then a candidate model, say model α, is

yi = xT βα + ei , i = 1, . . . , n
iα
µiα = E(yi |xi ) = xT βα
iα

2
var(yi |xi ) = σe

and the parameter is estimated using equation (2.2)

T
T
ˆ
βα = (Xα Xα )−1 Xα Y

Then an average conditional expected loss is deﬁned to measure the eﬃciency of model α



n
∑
1
ˆ
Γn (α) = E 
(yi − xT βα )2 |Y, X 
iα
n
i=1

19

(2.3)

The model α ∈ A which minimizes Γn (α) will be chosen as the estimate of the optimal
model, where A is the collection of some subsets of 1, . . . , p. The largest possible A is the
one containing all nonempty subsets of 1, . . . , p. For practice, we will consider a smaller
collection of subsets. The optimal model, called α0 , in this sense is the correct model with
the smallest size and the smallest Γn (α). As Γn (α) involves the unknown parameter β, α0
must be estimated by α.
ˆ
The model selection procedure is consistent if

lim P (ˆ = α0 ) = 1
α
n→∞
∗
With the modiﬁed bootstrapping pairs method, generate m (< n) iid pairs (x∗ , yi ) from the
i
ˆ
empirical distribution F putting mass n−1 on (xi , yi ), i = 1, . . . , n.
Then a bootstrap estimator of E[Γm (α)], for m < n, is

ˆ
Γn,m (α) = E∗

where ∥a∥ =

√

˜∗
∥Y − Xα βα,m ∥2
n

(2.4)

˜∗
ˆ
aT a for any vector a, βα,m is the bootstrap analog of βα based on the

generated m iid bootstrapping observations,


−1
m
∑
∗ x∗′ 
∗
˜∗

βα,m =
xiα iα
x ∗ yi
iα
i=1
i=1
m
∑

and E∗ represents expectation approximated by Monte Carlo with size B,
B
˜∗b
∑ ∥Y − Xα βα,m ∥2
ˆ (B) (α) = 1
Γn,m
B
n
b=1
20

The objective of the selection procedure is to select a model αn,m ∈ A that minimizes
ˆ
ˆ
Γn,m (α). From the literature (Shao, 1996), this bootstrap selection procedure is consistent;
that is
lim P (ˆ n,m = α0 ) = 1
α
n→∞
provided that m satisﬁes m/n → 0 and m → ∞.

2.3

Linear Mixed Model

In this section, the modiﬁed bootstrapping selection procedure will be applied to the Linear
Mixed Model (LMM). The general mixed linear model is

y = Xβ + Zν + e

Where y is the vector of sample observations, X and Z are known matrices, β is the unknown
parameters, and ν and e are distributed independently with mean 0 and covariance matrices
G and R. Here two special cases of LMM, the Fay-Herriot model and Nested-Error regression
model, will be used to illustrate the modiﬁed bootstrap selection procedure.

2.3.1

Fay-Herriot Model

If the character of interest is the sample mean y of each group, y = (¯1 , . . . , yn ), Fay and
¯
¯
y
¯
Herriot (1979) developed this model to estimate per-capita income for small areas (population
21

less than 1,000). The model can be stated as

yi = µi + ei
¯

(2.5)

µi = xT β + νi
i

i = 1, . . . , n

where xi = (xi1 , . . . , xip )T is available for each area i, β is the unknown parameters, and
e = (e1 , . . . , en )T and ν = (ν1 , . . . , νn )T are distributed independently as Normal(0, D) and
Normal(0, AI) respectively, where D = diag(D1 , . . . , Dn ) and Di is known. Therefore, µi is
Normal(xT β, A) and y is Normal(µ, D) if µ is given.
¯
i

For the Fay-Herriot model, the best linear unbiased estimator of µi is
A y
ˆ
ˆ
µi = xT β + A+D (¯i − xT β), i = 1, . . . , n
ˆ
i
i
i
ˆ
β = (X T V −1 X)−1 X T V −1 y
¯

(2.6)

where V = diag(A + D1 , . . . , A + Dn ) and X = col1≤i≤n (xT ).
i

The model selection procedure in the Fay-Herriot model is similar to that in the linear
∗
model. Consider bootstrapping pairs, let (x∗ , yi ), i = 1, . . . , m, be iid sample from the
i ¯
∗
empirical distribution putting mass n−1 to each (xi , yi ), i = 1, . . . , n. And Di , i = 1, . . . , m,
¯
are the corresponding known variance components. Then the bootstrap analogs of β and µi
ˆ
under model α with bootstrap sample size m are


−1
m
m
∑ x∗ x∗′
∑ x∗ y ∗
iα iα 
iα ¯i x∗ y ∗
˜∗

βα,m =
¯
∗
∗
ˆ
ˆ
A m + Di
Am + Di iα i
i=1
i=1
ˆm
A
˜∗
˜∗
µiα = xT βα,m +
˜
(¯ − xT βα,m )
y
iα
iα
ˆm + D∗ i
A
i
22

ˆ
ˆ
ˆ
where Am is the analog of A with bootstrap size m and A is an unbiased quadratic estimator
of A, where


m
∑

(

m
∑

)



1 
µ2 −
ˆi
Di 1 − xT (X T X)−1 xi 
i
n−p
i=1
i=1
ˆ
˜
A = max(A, 1/n)

˜
A=

(2.7)

ˆ
ˆ
where µi = yi − xT β and β = (X T X)−1 X T y .
ˆ
¯
¯
i

Deﬁne
1
ˆ
Γn,m (α) = E∗
n

m
∑

˜∗
(¯i − xT βα,m )2
y
iα

(2.8)

i=1

Or deﬁne the loss function with the random eﬀect estimated by the bootstrap sample data

(

1 ∑

)2
ˆ
Am
˜∗
˜∗
ˆ
Γn,m (α) = E∗
yi − xT βα,m −
¯
(¯ − xT βα,m )
y
+
iα
iα
ˆm + D i
m
A
 i∈I
i
B



∑
1
T β ∗ )2
˜
(¯i − xiα α,m
y

n−m

i∈IB
/

(2.9)

where IB is the set of indexes of which groups are chosen as bootstrap sample. The model
ˆ
selected by this bootstrap procedure is αn,m ∈ A that minimizes Γn,m (α).
ˆ

23

2.3.2

Nested-Error Regression Model

This model was proposed by Battese et al. (1988) to estimate mean acreage under a crop
for counties (small area) in Iowa. Their model is given by

yij = xT β + νi + eij
ij

i = 1, . . . , t, j = 1, . . . , ni ,

t
∑

ni = n

(2.10)

i=1
2 ), e ∼ N (0, σ 2 )
νi ∼ N (0, σν
e
ij

where yij is the character of interest for the jth sampled unit in the ith sample area,
xij = (xij1 , . . . , xijp )T , β is the unknown parameters, and ni is the number of sampled
units in the ith small area. The random errors νi and eij are independent of each other.
2
2
The variance-covariance matrix of y is V = diag(V1 , . . . , Vt ) with Vi = σe Ini + σν 1ni 1T .
ni
The generalized least squares estimator of β is the same as (2.6)

ˆ
β = (X T V −1 X)−1 X T V −1 Y

(2.11)

The corresponding estimator of νi is

νi = γi (¯i − xT β)
ˆ
y
¯i ˆ
2
2 2
¯
¯
where γi = σν (σν + σe n−1 )−1 , and yi and xi are the sample mean of yij and xij in the
i
ith group.

If variance components are unknown in LMM, variance components are estimated by
24

Henderson’s method 3 (for nested error model):
∑∑ 2
σe = (n − t − k + λ)−1
ˆ2
eij
ˆ
[∑ ∑
]
2 − (n − k)ˆ 2
2 = n−1
σν
˜
uij
ˆ
σe
∗

(2.12)

]
[
∑
¯ ¯
¯ ¯j
n2 xj xT . λ = 0 if the model has no intercept term and
where n∗ = n−tr (X T X)−1 t
j=1 j
λ = 1 otherwise. {ˆ2 } are the residuals from the ordinary least squares regression of yij − yi
eij
¯
on {xij1 − xi·1 , · · · , xijk − xi·k } and {ˆij } are the residuals from the ordinary least squares
¯
¯
u
regression of yij on {xij1 , · · · , xijk }. To avoid negative σν , we deﬁne σν = max(˜ν , 1/n).
˜2
ˆ2
σ2
The model selection procedure in the nested-error regression model is a little diﬀerent
from previous models. First, the bootstrap will be done with the group level, instead of
∗
elements in each group. Let (x∗ , yi ), i = 1, . . . , m, be iid sample groups from the empirical
i
distribution putting mass t−1 to each group (xi , yi ), i = 1, . . . , t, where xi = (xT , . . . , xT )T
i1
in
ˆ
and yi = (yi1 , . . . , yin )T . Then the bootstrap analogs of β under model α with bootstrap
sample size m is

˜∗
ˆ ∗T
ˆ∗
ˆ ∗T
ˆ
βα,m = (Xα V ∗−1 Xα )−1 Xα V ∗−1 Y ∗

(2.13)

ˆ∗
ˆ
where Xα and Y ∗ are the bootstrap sample data with the corresponding estimated variance
components.
Deﬁne

1
ˆ
Γn,m (α) = E∗
n

n
∑

˜∗
(yij − xij βα,m )2

(2.14)

i=1

Or deﬁne the loss function by considering the random eﬀect estimated by the bootstrap
25

sample data:

ˆ
Γn,m (α) = E∗



1 ∑ (
m
 i∈I

)2
˜∗
yij − xT βα,m − γi (¯i − xT βα,m ) +
˜ y
¯iα ˜∗
ij,α

B





∑
1
˜∗
(yij − xT βα,m )2
ij,α

n−m

i∈IB
/

(2.15)

where γi = σν (˜ν + σe n−1 )−1 , and σν and σe are estimated variance components under
˜
˜ 2 σ2 ˜ 2 i
˜2
˜2
model α with bootstrap sample size m, IB is the set of indexes of which groups are chosen
as the bootstrap sample.
The alternative algorithm is to bootstrap with the element level. Since the data are
correlated within a group, the transformation is needed ﬁrst to reduce the correlation:

yij = yij − αi yi•
¯
νi = νi − αi νi

xij = xij − αi xi•
¯

(2.16)

eij = eij − αi ei•
¯

where
[
]
2
2
2 1/2
αi = 1 − σe /(σe + ni σν )
After the transformation, the new model is a “Linear Model”:

2
yij = xT β + uij where uij ∼ Normal(0, σe )
ij

(2.17)

Then β is estimated by
ˆ
β = (X T X)−1 X T Y
where X = col1≤i≤t col1≤j≤n (xT ) and Y = col1≤i≤t col1≤j≤n (yij ). After the transi ij
i
26

formation of the data, the bootstrap procedure of the linear model can be used to the
nested-error regression model.

2.4

The Theoretical Framework

Recently, Das et al. (2004) presented a general linear mixed model

Y = X0 β0 + Zv + en

(2.18)

where Y ∈ Rn is a vector of observed responses, X0 , n × p0 , Zn×q are known matrices
and v and en are independent random variables with dispersion matrices D(ψ) and Rn (ψ)
respectively. Here β0 ∈ Rp0 and ψ ∈ Rk are ﬁxed parameters. The mixed ANOVA model,
longitudinal models including the Fay-Herriot model and the nested error regression model
are special cases of the above framework.
The model selection issue here arises from potential choices of the ﬁxed eﬀects covariates,
which are reported in the columns of the matrix X0 . In particular, the true value of p0 is
not known. We have a collection of vectors Xi ∈ Rn , i = 1, . . . , p, and any candidate X has
a sub-collection of these vectors as columns.
Let S = {1, . . . , p} denote the set of indices of the columns of covariates, and a typical
subset of S is given by α = {j1 , . . . , jpα } ⊆ S, and has pα ∈ {0, 1, . . . , p} elements.
Xα = [Xj : . . . : Xjp ] ∈ Rn × Rpα , that is, the columns of Xα are those vectors
1
α
whose indices match with those of α. In a slight abuse of notation, we write that the model
27

α is given by

Y = Xα βα + Zv + en .

(2.19)

The number and choice of candidate models may be restricted by imposing conditions on
the subset α = {j1 , . . . , jpα } ⊆ S. For example, nested models may be considered as
a restrictive choice of candidate models α. It can be seen that the true data generating
process, given by (2.18), is one of the candidate models. The goal of model selection is to
consistently identify the true model.
We start with a random vector w = (w1 , . . . , wn ) which is a realization from a Rn
dimensional random variable, sometimes referred to as bootstrap weights. We impose the
restriction that wi ≥ 0, which also reinforces the notion that these are weights. For convenience, we deﬁne the diagonal matrix W, whose diagonal entires are given by w, that is,
the ith entry is wi . When w has the Multinomial (m; 1/n, . . . , 1/n) distribution for some
m ≤ n, this method is identical to the m-out-of-n bootstrap. In particular, when m = n, we
get the classical botstrap of Efron (1979). Other weights may be used as well, for example
to obtain the Bayesian bootstrap.
We use these bootstrap weights in a most interesting way. We deﬁne the following inner
product between two vectors x and y in Rn :

< x, y >W = xT Wy =

n
∑

wi xi yi .

i=1

Thus, this inner product is a weighted Euclidean inner product, where the weights are the
bootstrap weights. Notice that we recover the usual Euclidean inner product when all the
28

weights are equal to 1. In general, however, < ·, · >W is a semi-inner product, as some of the
weights can be zero. Suppose PW,α is the projection matrix for a projection on the column
space of Xα in terms of this randomly weighted semi-inner product < ·, · >W . Thus, for
any vector v ∈ Rn , its projection on the column space of Xα is given by vW,α = PW,α v.
Let In denote the identity matrix in Rn . For each model α, we obtain the following
number:

Tα = EB ||(In − PW,α )Y||2 ,

where EB stands for the bootstrap expectation, that is, expectation conditional on the
observed data. Thus, in terms of our framework, EB stands for integration with respect to
the random matrix W conditional on all other random terms.
We now develop some conditions for the bootstrap weights {w1 , . . . , wn }. First, we
assume that all the weights are semi-exchangeable up to order 4, that is, all marginal distributions involving subsets of size 4 or below from {w1 , . . . , wn } have an exchangeability
property. In particular, we assume that the univariate marginals of wi for various i = 1, . . . , n
are all identical, and the two-dimensional marginal of any pair (wi , wj ) for i ̸= j = 1 . . . , n
also have the same distribution.
We assume several lower order moments of w. In particular, we reserve the notation
−1
2
Ew1 = µw and Vw1 = σw , and deﬁne the centered and scaled weights Wi = σw (wi − µw )
for i = 1, . . . , n. Our technical assumptions will involve conditions on various cross-moments,
which we state later.
In terms of the centered and scaled bootstrap weights W = (W1 , . . . , Wn ), we now write
the random matrix of bootstrap weights as W = µw In +σw W , where W = diag(W1 , . . . , Wn ),
29

the diagonal matrix with centered and scaled weights. For use in our technical proofs, we
deﬁne the matrix U = σw µ−1 W . In the m-out-of-n bootstrap (moon-bootstrap) case, we
w
2
have σw = n−1 m(1 − n−1 ), µw = n−1 m and cw11 = −n−2 m.
We also deﬁne the notation n−1 XT Xα = Dn,α . The inverse of Dn,α exists if and only
α
if Xα has full column rank, and we assume that this is the case. Our technical methodology
applies to cases where Dn,α does not have an inverse. However, this scenario results in a
problem of identiﬁability of models, and we assume that Xα has full column rank to avoid
any issue of unique identiﬁcation of models.

We also deﬁne
−1/2
−1/2 T
Xα U Xα Dα
Aα = n−1 Dα
(
)−1
Pα = Xα XT Xα
XT
α
α
−1/2 T
Xα
= n−1 Xα Dα
−1/2
−1/2 T
Bα = U Xα Dα
(In + Aα )−1 Dα
Xα U
[
]
˜ α = Pα In − U + n−1 Bα Pα
P
˜
Eα = Pα − Pα [In + U ]

In terms of these, we have
)−1
(
XT W
PW,α = Xα XT WXα
α
α
)−1
(
XT [µw In + σw W ]
= Xα XT [µw In + σw W ] Xα
α
α

30

(
)−1
= n−1 µw Xα n−1 µw XT [In + U ] Xα
XT [In + U ]
α
α
(
)−1
= n−1 Xα n−1 XT [In + U ] Xα
XT [In + U ]
α
α
(
)−1
XT [In + U ]
= n−1 Xα Dα + n−1 XT U Xα
α
α
}
]−1
[
{
−1 X D1/2 I + n−1 D−1/2 XT U X D−1/2 D1/2
XT [In + U ]
= n
α α
α
n
α
α
α
α
α
]−1
[
−1 X D1/2 {I + A } D1/2
XT [In + U ]
= n
n
α
α
α
α
α
−1/2
−1/2 T
= n−1 Xα Dα
[In + Aα ]−1 Dα
Xα [In + U ]
(
)
−1/2
−1/2 T
= n−1 Xα Dα
In − Aα + Aα [In + Aα ]−1 Aα Dα
Xα [In + U ]
[
−1 X D−1 − D−1/2 A D−1/2
= n
α α
α
α
α
]
−1/2
−1 A D−1/2 XT [I + U ]
Aα (In + Aα )
+Dα
α α
α n
[
−1/2
(In + Aα )−1
= Pα − Pα U Pα + n−1 Pα U Xα Dα
]
−1/2 T
Xα U Pα [In + U ]
Dα
[
]
= Pα In − U + n−1 Bα Pα [In + U ]
˜
= Pα [In + U ] .

Using this, we have

In − PW,α = In − Pα + Pα − PW,α
˜
= In − Pα + Pα − Pα [In + U ]
= In − Pα + Eα .
31

We now analyze the last term in greater detail. We have

˜
Eα = Pα − Pα [In + U ]
[
]
−1 B P [I + U ]
= P α − P α In − U + n
α α n
[
]
˜
= Pα − Pα In − U + n−1 Bα Pα − Pα U
˜
= Pα U Pα − n−1 Pα Bα Pα − Pα U
[
]
= Pα U Pα − n−1 Pα Bα Pα − Pα In − U + n−1 Bα Pα U
= Pα U Pα − Pα U + Pα U Pα U − n−1 Pα Bα Pα − n−1 Pα Bα Pα U
= Pα U Pα − Pα (In − U )Pα U − n−1 Pα Bα Pα (In + U )
= −Pα U (In − Pα ) + Pα U Pα U − n−1 Pα Bα Pα (In + U ).

Since Eα is Pα times another matrix, we have (In − Pα )Eα = 0. Using this, we have

Tα = EB ||(In − PW,α )Y||2
= EB YT (In − Pα + Eα )T (In − Pα + Eα )Y
= Y T (In − Pα )Y + EB YT ET Eα Y.
α

Thus, we have a very neat decomposition of Tα , with the ﬁrst term capturing the squared
residuals in model α, and the other term containing all the bootstrap related quantities.
Our next task is to compute EB (ET Eα ). Note that all the bootstrap weights are in the
α
matrix U . We have

ET Eα = (In − Pα )U Pα U (In − Pα ) + ER,α ,
α
32

where ER,α is a symmetric matrix, which is entirely negligible. The algebra for showing
the diﬀerent components of ER,α is long and tedious, and it also is the source of much of
our technical assumptions, so we skip the details here.
Consequently, we now need to evaluate

V α = EB U Pα U
= ((EB Ui Uj Pα,ij ))

 2 −2
 σw µw P
α,ii
=

 c
2 −2
w11 σw µw Pα,ij

if i = j
if i ̸= j

For use later on, we deﬁne the diagonal matrix

2
Gα = (1 − cw11 )σw µ−2 diag(Pα,ii ).
w

In terms of the subsequent analysis, the diﬀerence between Vα and Gα is negligible. Thus,
the properties of Tα are governed by

Zα = YT (In − Pα ) [In + Gα ] (In − Pα )Y.

The diﬀerence between Tα and Zα is negligible, although showing this involves a lot of
algebra.
There is a Zα value corresponding to each model α. Also note that, at least in large
samples, Zα is positive, which is very consistent with the fact that Tα is always positive. We
compare the various Tα ’s, and establish that with probability tending towards one the true
model has the smallest value. This is again a lot of algebra, but the main scheme for proving
33

this uses the moment-generating function for Tα . The results follow from a calculation based
on derivatives of determinants of matrices like A + tB.

2.5

Simulation Study

To examine the ﬁnite-sample performance of the selection procedures, a simulation study
was carried out based on bootstrapping pairs with diﬀerent bootstrap sampling size m and
various model settings. First, the similar results of the linear model from Shao’s paper were
replicated. The same data set used by Shao from the solid waste data example of Gunst
and Mason (1980) was used again; and the same value of β, (2, 9, 6, 4, 8), is chosen. Then
the Fay-Herriot model was tested with the same data set. After that, two diﬀerent selection
procedures of the nested-error model based on the same data set were carried out.

2.5.1

Algorithms and Settings for Diﬀerent Models

For the linear model, the size of the data set, n, is 40 and ei , i = 1, . . . , n, is iid standard
normal errors. The algorithm is:
1. Generate data, Y, according to the deﬁned linear model (2.1).
2. Bootstrap one sample with size m = 15 from the original data.
3. For each candidate model: (1) estimate the unknown parameters by (2.2); (2) calculate
the loss function in (2.3).
4. Repeat step 2&3 B = 100 times. Choose the candidate model with the minimized
mean of loss functions as the optimal model. Get the 95% conﬁdence interval of ﬁxed
eﬀect parameters from the 2.5% and 97.5% quantiles of B bootstrap samples.
34

5. Repeat steps 2, 3 & 4 with diﬀerent bootstrap samples of size m = 20, 25, 30, 40.
6. Record the optimal model and corresponding conﬁdence intervals. Repeat steps 1-5
1000 times.
7. Repeat the whole process with diﬀerent true model settings.

For the Fay-Herriot model, the same x is used as the linear model and n = 40; for the
variance components, chose A = 1 and D = diag(0.7I8 , 0.6I8 , 0.5I8 , 0.4I8 , 0.3I8 ), I8 means
an 8 × 8 identical matrix. The algorithm is:

1. Generate data, Y, based on the model (2.5).
2. Bootstrap one sample with size m = 15 from the original data.
3. For each candidate model: (1) estimate variance components A using (2.7); (2) estimate
ˆ
the ﬁxed eﬀect parameters with A using (2.6); (3) calculate the loss function using (2.8)
and (2.9).
4. Repeat step 2&3 B = 100 times. Choose the candidate model with the minimized
mean of loss functions as the optimal model. Get the 95% conﬁdence interval of ﬁxed
eﬀect parameters from the 2.5% and 97.5% quantiles of B bootstrap samples.
5. Repeat steps 2, 3 & 4 with diﬀerent bootstrap samples of size m = 20, 25, 30, 40.
6. Record the optimal model and corresponding conﬁdence intervals. Repeat steps 1-5
1000 times.
7. Repeat the whole process with diﬀerent true model settings.
35

For the nested-error regression model, we use the same data set with size n = 40, divided
into t = 10 groups with ni = (5, 6, 4, 5, 3, 2, 5, 3, 3, 4); choose the variance components as
2
2
σν = 1 and σe = 1. The algorithm of the bootstrap with group level is:
1. Generate data, Y, based on model (2.10).
2. Bootstrap one sample of m = 4 groups from the original 10 groups of the data.
3. For each candidate model: (1) estimate variance components by Henderson’s method
3 in (2.12); (2) estimate the ﬁxed eﬀect parameters based on σν and σe using (2.11)
ˆ2
ˆ2
or (2.13); (3) calculate the loss function using (2.14) and (2.15).
4. Repeat steps 2&3 B = 100 times. Choose the candidate model with the minimized
mean of loss functions as the optimal model. Get the 95% conﬁdence interval of ﬁxed
eﬀect parameters from the 2.5% and 97.5% quantiles of B bootstrap samples.
5. Repeat steps 2, 3 & 4 with diﬀerent bootstrap samples of size m = 5, 6, 8, 10.
6. Record the optimal model and corresponding conﬁdence intervals. Repeat steps 1-5
1000 times.
7. Repeat the whole process with diﬀerent true model settings.
The algorithm of the bootstrap with element level of Nested-Error Regression Model:
1. Generate data, Y, based on model (2.10).
2. Estimate variance components with the whole data by Henderson’s method 3 in (2.12).
3. Transfer the original data to linearized data which can be ﬁtted by a linear model using
the tranformation in (2.16).
36

4. Bootstrap one sample with size m = 15 from the new data.
5. For each candidate model: estimate the ﬁxed eﬀect parameters like a linear model and
calculate the corresponding loss function.
6. Repeat steps 4&5 B = 100 times. Choose the candidate model with the minimized
mean of loss functions as the optimal model. Get the 95% conﬁdence interval of ﬁxed
eﬀect parameters from the 2.5% and 97.5% quantiles of B bootstrap samples.
7. Repeat steps 4, 5 & 6 with diﬀerent bootstrap samples of size m = 20, 25, 30, 40.
8. Record the optimal model and corresponding conﬁdence intervals. Repeat steps 1-7
1000 times.
9. Repeat the whole process with diﬀerent true model settings.
ˆ
The bootstrap estimators Γn,m (α) were computed with size B = 100. The empirical
probabilities of selecting each model and the coverage probabilities of the true parameter
value were based on 1,000 simulations. The lengths of the conﬁdence interval were means
of the 1,000 replications. For comparison, the results of empirical selection methods using
AIC, BIC and adjusted R2 were also reported. The whole procedure was programmed in R.

2.5.2

Results

At ﬁrst the results of the Linear Model are reported. Table 2.1 is the selection probabilities
of the optimal model; Table 2.2 is the coverage probabilities of each parameter and the
lengths of the corresponding conﬁdence intervals. From the results, the similar characters of
this bootstrap selection procedure as Shao’s results are supported. The consistency of the
37

modiﬁed bootstrap selection procedure is supported clearly. Also, the modiﬁed bootstrap
selection procedure can be substantially better than other selection methods in most cases.
After that the results of the Fay-Herriot model are reported in Table 2.3 ∼ 2.6. From the
results in the Tables 2.3 and 2.4, the application of this bootstrap selection procedure working
with the Fay-Herriot model is supported. In Tables 2.5 and 2.6, the diﬀerent loss functions
with or without considering the random eﬀects are compared. The selection probabilities
of the loss function adding the random eﬀects are higher than that of the loss function
without the random eﬀects in some cases with higher bootstrap sample size. The coverage
probabilities do not have a signiﬁcant diﬀerence.
Tables 2.7 and 2.8 report the empirical selecting probabilities of the Nested-Error regression model working on the group level. Tables 2.9 and 2.10 report the results of working
on the element level. From the results, the selection probabilities of working on the element
level are higher in most cases, as the process of bootstrapping with the element level after
the transformation is applied is closer to the process of the Linear model. But the data size
is only 40 in these cases and especially the number of groups is only 10, maybe this is part of
the reason why the bootstrapping with group level is not as good as the work with element
level. After that, the two diﬀerent deﬁnitions of loss functions are compared in Tables 2.11
and 2.12 with a bigger data size n = 160 to test the eﬀect of increasing data size. Similar to
the case of the Fay-Herriot model, the probabilities of the function with random eﬀect are
higher when the bootstrap sample size is bigger.
In addition, except the case when the full model is the optimal model, the modiﬁed
bootstrap model selection procedure with smaller bootstrap size m works better than the
unmodiﬁed bootstrap model selection procedure, when bootstrap with size 40. From the
38

trend of probability changing, the optimal choice of m clearly depends on the parameter β.
When the size of parameters of the optimal model is small, the modiﬁed bootstrap selection
procedure with smaller bootstrap size is better than the procedure with bigger bootstrap
size. If the size of parameters of the optimal model increases, the optimal choice of bootstrap
size m will increase. This changing trend is tested in more detail by more bootstraps of sizes
m = 11, 12, 13, 14, 15, 20, 25, 30, 40. The result is not listed here.
The empirical probabilities of the coverage of the conﬁdence interval are the probabilities
of the bootstrap conﬁdence interval covering the true parameter values. From the results, the
95% conﬁdence interval from the modiﬁed bootstrap selection procedure will cover the true
parameter values with more than probability 0.9 in most cases. The coverage probabilities
with smaller bootstrap sample size are higher, but the lengths of the conﬁdence intervals
with bigger bootstrap sample size are shorter.
In conclusion, the modiﬁed bootstrap model selection procedure can be applied to nonlinear model cases, like linear mixed models. Bootstrapping pairs with size m works well in
these models, where m/n → 0 and m → ∞.

39

Table 2.1: Selection probabilities for linear model

m=15
0.957
0.016
0.018
0.008
0.001
0.000
0.000
0.000

Bootstrap
m=20 m=25 m=30
0.851 0.747 0.654
0.046 0.079 0.101
0.066 0.097 0.128
0.027 0.055 0.072
0.007 0.006 0.018
0.001 0.005 0.008
0.002 0.011 0.013
0.000 0.000 0.006

m=40
0.491
0.134
0.192
0.091
0.034
0.019
0.023
0.016

AIC
0.576
0.094
0.086
0.122
0.051
0.027
0.024
0.020

(2,0,0,4,8) 1,4,5*
1,2,4,5
1,3,4,5
1,2,3,4,5

0.958
0.021
0.020
0.001

0.897
0.047
0.054
0.002

0.828
0.067
0.096
0.009

0.770
0.100
0.103
0.027

0.575
0.153
0.201
0.071

0.732 0.893 0.515
0.098 0.051 0.163
0.098 0.042 0.161
0.072 0.014 0.161

(2,9,0,4,8) 1,4,5
1,2,4,5*
1,3,4,5
1,2,3,4,5

0.012
0.973
0.002
0.013

0.003
0.961
0.002
0.034

0.000
0.904
0.001
0.095

0.001
0.863
0.000
0.136

0.000
0.741
0.000
0.259

0.000 0.000 0.000
0.811 0.931 0.673
0.001 0.001 0.001
0.188 0.068 0.326

(2,9,6,4,8) 1,2,3,5
1,2,4,5
1,3,4,5
1,2,3,4,5*

0.012
0.005
0.069
0.914

0.000
0.001
0.023
0.976

0.000
0.000
0.009
0.991

0.000
0.000
0.001
0.999

0.000
0.000
0.000
1.000

0.000 0.000 0.000
0.000 0.000 0.000
0.000 0.003 0.000
1.000 0.997 1.000

True β T
Model
(2,0,0,4,0) 1,4*
1,2,4
1,3,4
1,4,5
1,2,3,4
1,2,4,5
1,3,4,5
1,2,3,4,5

40

BIC
0.818
0.050
0.040
0.062
0.017
0.005
0.004
0.004

R2
0.315
0.116
0.105
0.163
0.114
0.076
0.053
0.058

Table 2.2: The coverage probabilities and lengths of conﬁdence intervals for linear model
T Para
True β
m=15
m=20
m=25
m=30
m=40
Non Bootstrap
2
(2,0,0,4,0) σe
0.974(1.320) 0.935(1.135) 0.922(0.989) 0.883(0.905) 0.810(0.759)
0.920(0.902)
β1
0.995(1.427) 0.983(1.201) 0.968(1.060) 0.957(0.958) 0.905(0.819)
0.948(0.855)
β4
0.999(2.106) 0.974(1.751) 0.945(1.579) 0.924(1.481) 0.841(1.397)
0.948(0.966)
2
(2,0,0,4,8) σe
β1
β4
β5

0.980(1.372)
0.996(1.500)
0.996(4.348)
0.998(4.799)

0.950(1.167)
0.990(1.252)
0.985(3.431)
0.994(3.769)

0.918(1.012)
0.970(1.089)
0.980(2.909)
0.978(3.164)

0.884(0.912)
0.952(0.983)
0.944(2.556)
0.962(2.768)

0.840(0.779)
0.927(0.838)
0.912(2.108)
0.915(2.269)

0.933(0.918)
0.955(0.864)
0.944(1.700)
0.964(2.119)

2
(2,9,0,4,8) σe
β1
β2
β4
β5

0.972(1.403)
0.994(1.576)
0.996(14.015)
0.999(6.292)
0.999(5.416)

0.940(1.178)
0.989(1.300)
0.990(10.255)
0.993(4.799)
0.993(4.072)

0.914(1.023)
0.971(1.116)
0.974(8.075)
0.986(3.890)
0.973(3.351)

0.885(0.922)
0.959(1.005)
0.958(6.867)
0.975(3.384)
0.971(2.945)

0.831(0.786)
0.932(0.852)
0.914(5.353)
0.937(2.632)
0.919(2.367)

0.927(0.927)
0.937(0.863)
0.951(4.108)
0.936(2.153)
0.946(2.151)

2
(2,9,6,4,8) σe
β1
β2
β3
β4
β5

0.971(1.555)
0.999(1.738)
0.909(17.204)
0.982(10.175)
0.978(6.962)
0.997(5.958)

0.920(1.187)
0.988(1.365)
0.986(13.43)
0.988(7.186)
0.997(5.174)
0.995(4.365)

0.907(1.050)
0.967(1.175)
0.973(10.983)
0.976(5.656)
0.983(4.196)
0.982(3.588)

0.873(0.916)
0.959(1.029)
0.962(9.297)
0.971(4.735)
0.980(3.531)
0.970(3.061)

0.814(0.791)
0.935(0.865)
0.931(7.400)
0.940(3.661)
0.939(2.750)
0.931(2.448)

0.927(0.939)
0.940(0.873)
0.944(7.022)
0.957(3.277)
0.944(2.286)
0.945(2.203)

41

Table 2.3: Selection probabilities for Fay-Herriot model, loss function without random eﬀect

Model
1,4*
1,2,4
1,3,4
1,4,5
1,2,3,4
1,2,4,5
1,3,4,5
1,2,3,4,5

m=15
0.966
0.006
0.003
0.025
0.000
0.000
0.000
0.000

Bootstrap
m=20 m=25 m=30
0.902 0.812 0.673
0.033 0.057 0.088
0.013 0.025 0.057
0.049 0.090 0.123
0.001 0.009 0.025
0.001 0.006 0.020
0.001 0.001 0.010
0.000 0.000 0.004

(2,0,0,4,8) 1,4,5*
1,2,4,5
1,3,4,5
1,2,3,4,5

0.995
0.003
0.002
0.000

0.966
0.026
0.008
0.000

0.901
0.060
0.032
0.007

0.822
0.090
0.063
0.025

0.660
0.147
0.117
0.076

0.688 0.857 0.498
0.138 0.073 0.188
0.111 0.060 0.174
0.063 0.010 0.140

(2,9,0,4,8) 1,4,5
1,2,4,5*
1,3,4,5
1,2,3,4,5

0.002
0.981
0.015
0.002

0.000
0.968
0.008
0.024

0.000
0.949
0.007
0.044

0.000
0.888
0.003
0.109

0.000
0.802
0.002
0.196

0.000 0.000 0.000
0.848 0.931 0.710
0.004 0.009 0.003
0.148 0.060 0.287

(2,9,6,4,8) 1,2,3,5
1,2,4,5
1,3,4,5
1,2,3,4,5*

0.065
0.042
0.203
0.690

0.017
0.007
0.069
0.907

0.003
0.001
0.029
0.967

0.000
0.000
0.013
0.987

0.000
0.000
0.002
0.998

0.000 0.000 0.000
0.000 0.000 0.000
0.000 0.009 0.000
1.000 0.991 1.000

True β T
(2,0,0,4,0)

42

m=40
0.487
0.126
0.084
0.145
0.072
0.038
0.031
0.017

AIC
0.558
0.107
0.092
0.118
0.063
0.022
0.019
0.021

BIC
0.806
0.061
0.057
0.051
0.012
0.006
0.005
0.002

R2
0.326
0.141
0.108
0.148
0.108
0.048
0.059
0.062

Table 2.4: The coverage
random eﬀect
True β T Para
2
(2,0,0,4,0) σe
β1
β4

probabilities and lengths of conﬁdence intervals for the Fay-Herriot model, loss function without
m=15
0.967(1.927)
0.997(1.748)
0.992(2.570)

m=20
0.934(1.666)
0.988(1.473)
0.973(2.089)

m=25
0.919(1.476)
0.970(1.293)
0.956(1.878)

m=30
0.876(1.333)
0.960(1.160)
0.926(1.838)

m=40
Non Bootstrap
0.820(1.145)
0.641(0.656)
0.927(0.994)
0.944(1.042)
0.853(1.719)
0.939(1.199)

2
(2,0,0,4,8) σe
β1
β4
β5

0.972(1.960)
0.996(1.837)
0.997(5.288)
0.998(5.861)

0.940(1.696)
0.990(1.526)
0.991(4.232)
0.991(4.607)

0.917(1.490)
0.976(1.332)
0.973(3.622)
0.979(3.890)

0.876(1.349)
0.963(1.192)
0.959(3.197)
0.970(3.411)

0.816(1.149)
0.926(1.014)
0.920(2.634)
0.934(2.788)

0.625(0.658)
0.939(1.048)
0.946(2.140)
0.946(2.582)

2
(2,9,0,4,8) σe
β1
β2
β4
β5

0.965(1.990)
0.997(1.936)
0.983(16.619)
0.998(7.626)
0.998(6.525)

0.930(1.711)
0.991(1.583)
0.984(12.204)
0.998(5.829)
0.991(4.982)

0.909(1.506)
0.980(1.364)
0.967(9.712)
0.983(4.755)
0.981(4.136)

0.871(1.356)
0.963(1.220)
0.954(8.265)
0.979(4.095)
0.967(3.574)

0.817(1.162)
0.927(1.028)
0.895(6.491)
0.948(3.208)
0.932(2.882)

0.656(0.667)
0.947(1.057)
0.950(5.209)
0.954(2.738)
0.950(2.639)

2
(2,9,6,4,8) σe
β1
β2
β3
β4
β5

0.968(2.219)
0.997(2.084)
0.787(17.268)
0.935(11.206)
0.924(7.675)
0.999(7.121)

0.927(1.757)
0.992(1.657)
0.919(15.283)
0.961(8.563)
0.968(6.094)
0.995(5.327)

0.891(1.519)
0.985(1.416)
0.957(12.815)
0.968(6.800)
0.974(5.038)
0.982(4.348)

0.858(1.367)
0.969(1.251)
0.963(11.158)
0.959(5.696)
0.978(4.275)
0.974(3.719)

0.788(1.164)
0.936(1.047)
0.926(8.877)
0.924(4.379)
0.947(3.332)
0.938(2.949)

0.615(0.652)
0.938(1.054)
0.938(8.329)
0.927(3.847)
0.936(2.819)
0.942(2.650)

43

Table 2.5: Selection probabilities for Fay-Herriot model, loss function with random eﬀect
True β T
Model
(2,0,0,4,0) 1,4*
1,2,4
1,3,4
1,4,5
1,2,3,4
1,2,4,5
1,3,4,5
1,2,3,4,5

m=15
0.976
0.002
0.002
0.020
0.000
0.000
0.000
0.000

m=20
0.951
0.009
0.007
0.033
0.000
0.000
0.000
0.000

m=25
0.910
0.022
0.018
0.048
0.001
0.001
0.000
0.000

m=30
0.872
0.038
0.021
0.066
0.002
0.001
0.000
0.000

m=40
0.806
0.054
0.039
0.087
0.005
0.007
0.001
0.001

(2,0,0,4,8) 1,4,5*
1,2,4,5
1,3,4,5
1,2,3,4,5

0.993
0.005
0.002
0.000

0.988
0.008
0.003
0.001

0.964
0.025
0.011
0.000

0.940
0.032
0.025
0.003

0.898
0.051
0.045
0.006

(2,9,0,4,8) 1,4,5
1,2,4,5*
1,3,4,5
1,2,3,4,5

0.010
0.972
0.017
0.001

0.001
0.972
0.011
0.016

0.001
0.954
0.017
0.028

0.000
0.940
0.012
0.048

0.000
0.912
0.008
0.080

(2,9,6,4,8) 1,2,3,5
1,2,4,5
1,3,4,5
1,2,3,4,5*

0.080
0.051
0.246
0.623

0.031
0.016
0.120
0.833

0.009
0.007
0.064
0.920

0.005
0.002
0.057
0.936

0.001
0.003
0.035
0.961

44

Table 2.6: The coverage probabilities and lengths of conﬁdence intervals for the Fay-Herriot model, loss function with random
eﬀect
True β T Para
m=15
m=20
m=25
m=30
m=40
2
(2,0,0,4,0) σe
0.966(1.891) 0.940(1.637) 0.917(1.458) 0.888(1.331) 0.834(1.134)
β1
0.996(1.738) 0.983(1.463) 0.969(1.284) 0.960(1.162) 0.910(0.991)
β4
0.993(2.517) 0.978(1.979) 0.952(1.716) 0.919(1.570) 0.862(1.331)
2
(2,0,0,4,8) σe
β1
β4
β5

0.976(1.926)
0.994(1.835)
0.998(5.279)
0.998(5.775)

0.943(1.665)
0.990(1.518)
0.983(4.182)
0.985(4.586)

0.914(1.476)
0.978(1.326)
0.965(3.511)
0.972(3.858)

0.884(1.337)
0.963(1.199)
0.948(3.105)
0.957(3.385)

0.817(1.141)
0.934(1.015)
0.900(2.515)
0.924(2.777)

2
(2,9,0,4,8) σe
β1
β2
β4
β5

0.970(2.021)
0.998(1.938)
0.973(16.438)
0.995(7.678)
0.998(6.587)

0.944(1.725)
0.983(1.567)
0.983(12.087)
0.989(5.802)
0.995(5.021)

0.910(1.529)
0.972(1.369)
0.965(9.504)
0.976(4.762)
0.985(4.143)

0.890(1.378)
0.948(1.215)
0.941(8.000)
0.950(4.070)
0.962(3.586)

0.829(1.171)
0.910(1.032)
0.884(6.120)
0.913(3.189)
0.921(2.886)

2
(2,9,6,4,8) σe
β1
β2
β3
β4
β5

0.963(2.267)
0.994(2.099)
0.729(16.369)
0.929(11.030)
0.908(7.533)
0.994(7.162)

0.939(1.802)
0.986(1.676)
0.863(14.225)
0.927(8.344)
0.943(5.982)
0.983(5.330)

0.901(1.546)
0.974(1.422)
0.918(12.238)
0.942(6.633)
0.968(4.967)
0.979(4.347)

0.862(1.384)
0.968(1.262)
0.911(10.559)
0.924(5.612)
0.950(4.235)
0.960(3.731)

0.790(1.172)
0.928(1.054)
0.892(8.551)
0.903(4.316)
0.916(3.312)
0.934(2.976)

45

Table 2.7: Selection probabilities for Nested-Error regression model with group level
Bootstrap
m=5 m=6 m=8 m=10
0.870 0.788 0.621 0.498
0.032 0.049 0.090 0.103
0.038 0.052 0.079 0.089
0.048 0.072 0.122 0.146
0.006 0.021 0.040 0.074
0.001 0.008 0.021 0.029
0.005 0.008 0.021 0.036
0.000 0.002 0.006 0.025

True β T
Model
(2,0,0,4,0) 1,4*
1,2,4
1,3,4
1,4,5
1,2,3,4
1,2,4,5
1,3,4,5
1,2,3,4,5

m=4
0.933
0.017
0.022
0.026
0.002
0.000
0.000
0.000

(2,0,0,4,8) 1,4,5*
1,2,4,5
1,3,4,5
1,2,3,4,5

0.895 0.881 0.838 0.745
0.054 0.053 0.071 0.096
0.043 0.057 0.070 0.110
0.009 0.010 0.022 0.049

0.627
0.141
0.141
0.090

0.634 0.825 0.468
0.126 0.073 0.164
0.114 0.066 0.164
0.126 0.036 0.204

(2,9,0,4,8) 1,4,5
1,2,4,5*
1,3,4,5
1,2,3,4,5

0.021
0.782
0.152
0.044

0.001 0.000 0.000
0.811 0.822 0.800
0.104 0.062 0.027
0.084 0.116 0.173

0.000
0.730
0.014
0.256

0.000 0.000 0.000
0.746 0.888 0.655
0.005 0.010 0.001
0.249 0.102 0.344

(2,9,6,4,8) 1,2,3,5
1,2,4,5
1,3,4,5
1,2,3,4,5*

0.102 0.022 0.006 0.000
0.066 0.031 0.011 0.001
0.509 0.291 0.131 0.029
0.323 0.656 0.852 0.970

0.000
0.000
0.005
0.995

0.000 0.000 0.000
0.000 0.002 0.000
0.005 0.015 0.003
0.995 0.983 0.997

46

AIC
0.471
0.124
0.096
0.142
0.074
0.022
0.037
0.034

BIC
0.724
0.072
0.055
0.090
0.033
0.006
0.010
0.010

R2
0.315
0.129
0.102
0.152
0.117
0.046
0.061
0.078

Table 2.8: The coverage probabilities and lengths of conﬁdence intervals for for Nested-Error regression model with group level
True β T Para
m=4
m=5
m=6
m=8
m=10
Non Bootstrap
2
(2,0,0,4,0) σv
0.914(2.673) 0.883(2.326) 0.855(2.105) 0.804(1.810) 0.768(1.597)
0.943(5.833)
2
σe
0.964(1.408) 0.935(1.228) 0.917(1.116) 0.871(0.942) 0.826(0.841)
0.919(1.035)
0.983(2.338) 0.971(2.044) 0.952(1.842) 0.928(1.570) 0.890(1.398)
0.918(1.515)
β1
β4
0.982(3.145) 0.958(2.473) 0.930(2.112) 0.885(1.795) 0.824(1.661)
0.938(1.070)
2
(2,0,0,4,8) σv
2
σe
β1
β4
β5

0.926(2.757)
0.946(1.476)
0.981(2.535)
0.983(5.079)
0.992(6.799)

0.881(2.345)
0.917(1.279)
0.968(2.172)
0.973(4.255)
0.985(5.003)

0.851(2.100)
0.886(1.144)
0.951(1.918)
0.961(3.742)
0.974(4.080)

0.794(1.790)
0.838(0.957)
0.913(1.618)
0.919(3.037)
0.939(3.197)

0.760(1.570)
0.797(0.844)
0.879(1.426)
0.891(2.593)
0.921(2.722)

0.934(5.781)
0.908(1.044)
0.916(1.522)
0.934(1.814)
0.949(2.421)

2
(2,9,0,4,8) σv
2
σe
β1
β2
β4
β5

0.953(2.896)
0.952(1.743)
0.989(2.822)
0.820(13.957)
0.950(6.866)
0.993(7.922)

0.903(2.406)
0.924(1.402)
0.979(2.277)
0.885(11.292)
0.971(5.705)
0.984(5.554)

0.863(2.118)
0.896(1.210)
0.953(2.012)
0.912(9.585)
0.962(4.929)
0.976(4.452)

0.814(1.794)
0.843(1.002)
0.923(1.664)
0.897(7.139)
0.945(3.850)
0.937(3.304)

0.767(1.566)
0.794(0.875)
0.895(1.454)
0.852(5.980)
0.911(3.242)
0.894(2.830)

0.929(5.598)
0.921(1.081)
0.920(1.511)
0.943(4.417)
0.943(2.336)
0.928(2.474)

2
(2,9,6,4,8) σv
2
σe
β1
β2
β3
β4
β5

0.968(3.220)
0.964(2.110)
0.994(2.964)
0.438(11.579)
0.813(9.159)
0.794(6.134)
0.982(9.033)

0.917(2.499)
0.907(1.580)
0.972(2.378)
0.680(12.512)
0.835(7.749)
0.889(5.724)
0.987(6.423)

0.862(2.123)
0.879(1.283)
0.963(2.051)
0.847(12.343)
0.889(6.808)
0.918(5.057)
0.972(4.956)

0.785(1.728)
0.824(1.012)
0.924(1.678)
0.926(10.322)
0.928(5.367)
0.945(4.041)
0.944(3.642)

0.736(1.514)
0.793(0.874)
0.889(1.457)
0.915(8.705)
0.919(4.514)
0.922(3.367)
0.915(3.025)

0.936(5.835)
0.914(1.086)
0.928(1.527)
0.932(8.088)
0.929(3.951)
0.940(2.530)
0.937(2.660)

47

Table 2.9: Selection probabilities for Nested-Error regression model with element level

Model
1,4*
1,2,4
1,3,4
1,4,5
1,2,3,4
1,2,4,5
1,3,4,5
1,2,3,4,5

m=15
0.964
0.010
0.006
0.020
0.000
0.000
0.000
0.000

Bootstrap
m=20 m=25 m=30
0.860 0.741 0.630
0.023 0.049 0.066
0.029 0.048 0.069
0.077 0.118 0.148
0.007 0.027 0.048
0.002 0.007 0.011
0.001 0.008 0.020
0.001 0.002 0.008

(2,0,0,4,8) 1,4,5*
1,2,4,5
1,3,4,5
1,2,3,4,5

0.990
0.010
0.000
0.000

0.961
0.022
0.013
0.004

0.891
0.044
0.052
0.013

0.814
0.072
0.079
0.035

0.631
0.131
0.137
0.101

0.690 0.879 0.478
0.104 0.043 0.160
0.104 0.050 0.153
0.102 0.028 0.209

(2,9,0,4,8) 1,4,5
1,2,4,5*
1,3,4,5
1,2,3,4,5

0.000
0.968
0.024
0.008

0.000
0.943
0.014
0.043

0.000
0.909
0.011
0.080

0.000
0.852
0.007
0.141

0.000
0.750
0.002
0.248

0.000 0.000 0.000
0.813 0.908 0.688
0.001 0.006 0.000
0.186 0.086 0.312

(2,9,6,4,8) 1,2,3,5
1,2,4,5
1,3,4,5
1,2,3,4,5*

0.018
0.010
0.140
0.832

0.001
0.004
0.036
0.959

0.000
0.000
0.017
0.983

0.000
0.001
0.009
0.990

0.000
0.000
0.002
0.998

0.000 0.000 0.000
0.000 0.001 0.000
0.002 0.011 0.000
0.998 0.988 1.000

True β T
(2,0,0,4,0)

48

m=40
0.443
0.090
0.097
0.183
0.093
0.024
0.045
0.025

AIC
0.552
0.100
0.080
0.125
0.069
0.023
0.029
0.022

BIC
0.825
0.051
0.036
0.056
0.016
0.007
0.007
0.002

R2
0.306
0.114
0.097
0.170
0.118
0.050
0.071
0.074

Table 2.10: The coverage probabilities and lengths of conﬁdence intervals for Nested-Error regression model with element level
True β T Para
m=15
m=20
m=25
m=30
m=40
Non Bootstrap
(2,0,0,4,0) β1
β4

0.990(2.390)
0.989(2.175)

0.974(2.036)
0.966(1.852)

0.953(1.815)
0.927(1.709)

0.933(1.652)
0.891(1.635)

0.898(1.417)
0.823(1.528)

0.931(1.519)
0.941(1.065)

(2,0,0,4,8) β1
β4
β5

0.989(2.479)
0.999(4.363)
0.997(5.526)

0.977(2.090)
0.989(3.467)
0.983(4.264)

0.954(1.854)
0.961(2.943)
0.966(3.577)

0.932(1.681)
0.931(2.637)
0.947(3.129)

0.891(1.440)
0.876(2.203)
0.894(2.565)

0.928(1.519)
0.943(1.807)
0.931(2.412)

(2,9,0,4,8) β1
β2
β4
β5

0.992(2.628)
0.975(13.728)
0.998(6.272)
0.999(6.294)

0.983(2.164)
0.976(10.271)
0.993(4.752)
0.986(4.687)

0.957(1.897)
0.959(8.354)
0.976(3.968)
0.974(3.837)

0.935(1.709)
0.943(7.216)
0.960(3.428)
0.955(3.329)

0.898(1.453)
0.885(5.945)
0.914(2.750)
0.904(2.663)

0.918(1.527)
0.941(4.401)
0.939(2.328)
0.939(2.467)

(2,9,6,4,8) β1
β2
β3
β4
β5

0.997(2.814)
0.849(17.305)
0.944(10.208)
0.968(6.498)
1.000(7.140)

0.985(2.260)
0.955(14.590)
0.969(7.838)
0.990(5.052)
0.988(5.209)

0.960(1.946)
0.966(12.200)
0.970(6.359)
0.976(4.200)
0.976(4.223)

0.947(1.744) 0.905(1.475)
0.951(10.542) 0.919(8.559)
0.958(5.478) 0.922(4.368)
0.965(3.614) 0.930(2.888)
0.960(3.618) 0.916(2.868)

0.922(1.516)
0.918(8.039)
0.925(3.925)
0.929(2.516)
0.928(2.643)

49

Table 2.11: Compare of loss functions without and with random eﬀect for Nested-Error model, data size n = 160.
True β T
Model
(2,0,0,4,0) 1,4*
1,2,4
1,3,4
1,4,5
1,2,3,4
1,2,4,5
1,3,4,5
1,2,3,4,5

m=15
w/o
w
0.735 0.654
0.065 0.083
0.058 0.079
0.081 0.097
0.029 0.052
0.002 0.011
0.013 0.009
0.016 0.016

m=20
w/o
w
0.593 0.562
0.094 0.092
0.081 0.106
0.112 0.106
0.063 0.076
0.013 0.020
0.016 0.020
0.027 0.018

m=25
w/o
w
0.494 0.535
0.079 0.081
0.108 0.106
0.148 0.119
0.076 0.083
0.020 0.025
0.034 0.027
0.040 0.025

m=30
w/o
w
0.447 0.485
0.094 0.097
0.094 0.112
0.146 0.121
0.094 0.088
0.034 0.031
0.043 0.040
0.047 0.025

m=40
w/o
w
0.339 0.461
0.099 0.092
0.130 0.119
0.160 0.128
0.097 0.094
0.047 0.036
0.056 0.040
0.072 0.029

(2,0,0,4,8) 1,4,5*
1,2,4,5
1,3,4,5
1,2,3,4,5

0.820 0.781 0.724 0.719 0.642 0.654 0.583 0.637 0.489 0.596
0.066 0.101 0.115 0.122 0.138 0.146 0.152 0.150 0.178 0.168
0.074 0.078 0.090 0.094 0.109 0.120 0.127 0.128 0.153 0.134
0.041 0.040 0.071 0.066 0.111 0.079 0.138 0.085 0.180 0.102

(2,9,0,4,8) 1,4,5
1,2,4,5*
1,3,4,5
1,2,3,4,5

0.000
0.870
0.000
0.130

(2,9,6,4,8) 1,2,3,5
1,2,4,5
1,3,4,5
1,2,3,4,5*

0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.881 0.807 0.842 0.752 0.798 0.717 0.789 0.664 0.769
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.119 0.193 0.158 0.248 0.202 0.283 0.211 0.336 0.231

50

Table 2.12: The coverage probabilities and lengths of conﬁdence intervals for Nested-Error regression model with diﬀerent loss
functions, data size n = 160.
m=15
m=20
m=25
T Para
True β
w/o
w
w/o
w
w/o
w
2
(2,0,0,4,0) σv
0.982(1.635) 0.982(1.638) 0.951(1.406) 0.951(1.409) 0.942(1.251) 0.946(1.255)
2
σe
0.996(0.775) 0.996(0.773) 0.980(0.659) 0.975(0.658) 0.957(0.597) 0.957(0.595)
β1
0.996(1.200) 0.996(1.200) 0.987(1.030) 0.987(1.031) 0.969(0.922) 0.971(0.922)
β4
0.984(1.155) 0.982(1.215) 0.957(1.038) 0.955(1.051) 0.930(0.948) 0.928(0.922)
2
(2,0,0,4,8) σv
2
σe
β1
β4
β5

0.977(1.658)
0.984(0.776)
0.993(1.207)
0.992(1.894)
0.996(2.174)

0.977(1.659) 0.954(1.424) 0.952(1.426) 0.939(1.262) 0.941(1.264)
0.984(0.775) 0.977(0.668) 0.977(0.667) 0.959(0.595) 0.959(0.594)
0.993(1.208) 0.986(1.047) 0.986(1.046) 0.974(0.932) 0.974(0.933)
0.993(1.903) 0.974(1.539) 0.973(1.536) 0.952(1.331) 0.951(1.321)
0.996(2.175) 0.990(1.797) 0.989(1.794) 0.978(1.575) 0.978(1.568)

2
(2,9,0,4,8) σv
2
σe
β1
β2
β4
β5

0.979(1.644)
0.981(0.787)
0.994(1.216)
0.992(4.316)
0.996(2.325)
0.997(2.205)

0.979(1.646) 0.953(1.407) 0.953(1.410) 0.936(1.245) 0.936(1.247)
0.981(0.785) 0.966(0.673) 0.966(0.672) 0.949(0.596) 0.949(0.595)
0.994(1.217) 0.988(1.042) 0.987(1.042) 0.970(0.933) 0.969(0.934)
0.990(4.286) 0.981(3.725) 0.981(3.634) 0.958(3.339) 0.958(3.231)
0.997(2.323) 0.986(1.891) 0.988(1.886) 0.973(1.601) 0.971(1.595)
0.997(2.203) 0.991(1.853) 0.991(1.845) 0.976(1.602) 0.977(1.597)

2
(2,9,6,4,8) σv
2
σe
β1
β2
β3
β4
β5

0.972(1.685)
0.978(0.800)
0.994(1.229)
0.999(7.032)
0.997(3.531)
0.997(2.470)
0.991(2.376)

0.972(1.685) 0.940(1.438) 0.940(1.438) 0.915(1.273) 0.915(1.273)
0.978(0.800) 0.958(0.687) 0.958(0.687) 0.946(0.604) 0.946(0.604)
0.994(1.229) 0.981(1.050) 0.981(1.050) 0.968(0.932) 0.968(0.932)
0.999(7.032) 0.996(5.862) 0.996(5.862) 0.978(5.130) 0.978(5.130)
0.997(3.531) 0.987(2.926) 0.987(2.926) 0.965(2.542) 0.965(2.542)
0.997(2.470) 0.990(1.989) 0.990(1.989) 0.988(1.705) 0.988(1.705)
0.991(2.376) 0.984(1.950) 0.984(1.950) 0.977(1.718) 0.977(1.718)
51

Continued from Table 2.12
m=30
m=40
T Para
True β
w/o
w
w/o
w
2
(2,0,0,4,0) σv
0.933(1.133) 0.935(1.138) 0.876(0.982) 0.883(0.986)
2
σe
0.948(0.542) 0.944(0.540) 0.924(0.466) 0.924(0.465)
β1
0.962(0.833) 0.962(0.831) 0.924(0.726) 0.919(0.726)
β4
0.903(0.876) 0.892(0.845) 0.840(0.790) 0.834(0.728)
2
(2,0,0,4,8) σv
2
σe
β1
β4
β5

0.907(1.142)
0.933(0.540)
0.956(0.848)
0.936(1.178)
0.959(1.420)

0.913(1.144) 0.857(0.982) 0.866(0.986)
0.934(0.539) 0.891(0.464) 0.895(0.464)
0.956(0.847) 0.928(0.728) 0.929(0.727)
0.934(1.153) 0.893(1.002) 0.888(0.971)
0.956(1.412) 0.929(1.207) 0.922(1.194)

2
(2,9,0,4,8) σv
2
σe
β1
β2
β4
β5

0.913(1.133)
0.927(0.539)
0.960(0.847)
0.942(3.076)
0.959(1.415)
0.961(1.450)

0.920(1.135) 0.866(0.978) 0.869(0.980)
0.927(0.538) 0.882(0.469) 0.882(0.468)
0.960(0.847) 0.916(0.730) 0.916(0.730)
0.941(2.918) 0.898(2.693) 0.890(2.513)
0.958(1.405) 0.922(1.176) 0.923(1.167)
0.961(1.441) 0.927(1.229) 0.927(1.219)

2
(2,9,6,4,8) σv
2
σe
β1
β2
β3
β4
β5

0.914(1.152)
0.934(0.547)
0.947(0.852)
0.965(4.684)
0.950(2.314)
0.955(1.508)
0.947(1.536)

0.914(1.152) 0.860(0.994) 0.860(0.994)
0.934(0.547) 0.902(0.471) 0.902(0.471)
0.947(0.852) 0.918(0.732) 0.918(0.732)
0.965(4.684) 0.924(3.961) 0.924(3.961)
0.950(2.314) 0.926(1.947) 0.926(1.947)
0.955(1.508) 0.914(1.254) 0.914(1.254)
0.947(1.536) 0.896(1.305) 0.896(1.305)
52

Chapter 3
Prediction Error of Small Area
Predictors Shrinking both Mean and
Variances

3.1

Introduction

The goal of this chapter is to introduce a methodology which develops a dual “shrinkage”
estimation for both the small area means and variances in a uniﬁed framework. In this
process, the smoothed variance estimators use information from direct point estimators and
their sampling variances and consequently for the smoothed small area estimators. The
modeling perspective is closely related to Wang and Fuller (2003), Rivest et al. (2003), You
and Chapman (2006) and Hwang and Zhao (2009).
The conditional mean squared error of prediction (CMSEP) is used to evaluate the prediction error, which is more akin to Booth and Hobert (1998). Booth and Hobert argued
53

strongly for CMSEP as opposed to unconditional mean squared error of prediction (UMSEP). Recently, this technique has again been emphasized by Lohr and Rao (2009) in the
context of nonlinear mixed eﬀect models. These authors favor CMSEP particularly for nonnormal models when the posterior variance of small area parameters depends on the area
speciﬁc responses. Although they were interested only in generalized linear mixed models
where the posterior variance depends on area speciﬁc responses, this property of posterior
variance is perhaps true for a situation with posterior non-linearity.
A brief outline of the rest of the chapter is as follows. Section 3.2 contains the proposed
model and estimation method of small area parameters and the structural parameters. In
Section 3.3, we provide an estimation of prediction error in terms of the conditional mean
squared error of prediction. Simulation is performed in Section 3.4. And the detailed forms
of some matrix calculations are given in appendix 3.5 of this chapter.

3.2
3.2.1

Model and Estimation Method
Model with Assumption

As noted by Bell (2008), the previous researchers Wang and Fuller (2003) and Rivest and
Vandal (2003) essentially used the direct survey estimates to estimate the parameters related
to sampling variance modeling and did not use the direct survey variance estimates though
they were available. We propose a hierarchical model that uses both the direct survey
estimates and sampling variance estimates to estimate all the parameters that determines
the stochastic system. In this process we exploit the mean-variance joint modelling via a
hierarchical model so that the ﬁnal estimator is based on shrinking both mean and variance.
54

We like to mention that Hwang et al. (2009) considered shrinking means and variances in the
context of microarray data analysis and prescribed an important solution where they plugged
in a shrinkage estimator of variance into the mean estimator. Thereby the inference regarding
the mean does not take into account the variance estimation completely. Furthermore, their
model does not include any covariate information.

2
Let (Xi , Si ) be the pair of direct survey estimates and their sampling variances for the
ith area, i = 1, · · · , n. Let Z i = (Zi1 , · · · , Zip )T be the set of covariates available at
the estimation stage and β = (β1 , · · · , βp )T be the associated regression coeﬃcients. We
propose the following hierarchical model:

2
2
Xi |θi , σi ∼ Normal(θi , σi )
θi ∼ Normal(Z T β, τ 2 ); i = 1, · · · , n,
i







(3.1)






2 2
(ni − 1)Si /σi ∼ χ2 −1
ni

−2

σi ∼ Gamma{α, γ}; i = 1, · · · , n, 

(3.2)

where B = (α, γ, β T , τ 2 )T , referred to as the structural parameters, are unknown and ni
represents the sample size for a simple random sample (SRS) from the ith area. For a complex survey design the degrees of freedom of the chi-square distribution need to be determined
carefully (e.g., Maples and Huang 2009). Note that the chi-square distribution for the sample
2
variance is valid for only a random sample. Note that σi are the sampling variances of the
2
Xi ’s and are usually estimated by Si ’s. You and Chapman (2006) did not consider the second
level of sampling variance modelling. Thus their model can be treated as the Bayesian version
55

of the models considered in Rivest and Vandal (2003) and Wang and Fuller (2003). The second level of (3.2) might be further extended as Gamma{γ, exp(Z T β 2 )/γ} to accommodate
i
covaraite information in the variance modeling. The inference can be made from the condi2
tional distribution of θi (the parameter of interest) given the data (Xi , Si , Z i ), i = 1, · · · , n.
However, this does not have a closed form expression. We adopted rejection sampling to
overcome the situation.

3.2.2

Estimation of the Small Area Parameters

2
2
If the parameters B are known, the joint distribution of {Xi , Si , θi , σi } is
(X − θi )2
1
1
2
2
exp{− i
}
π(Xi , Si , θi , σi |B) = √
2
2
2σi
Γ{(ni − 1)/2}2(ni −1)/2
2πσi
]
)
[
(
2
2 (ni −1)/2−1
(ni − 1)Si
Si
ni − 1
1
√
exp{−
(ni − 1)
}
2
2
2
σi
2σi
σi
2πτ 2
( )α+1
T
(θi − Zi β)2
1
1
1
exp{−
}
exp{−
}
α σ2
2
Γ(α)γ
2τ 2
γσi
i
{
}
2 (θ − Z T β)2
(Xi − θi )2 (ni − 1)Si
1
i
i
≈ exp −
−
−
−
2
2
2
2τ 2
2σi
2σi
γσi
( ) ni +α+1 ( ) 1
2
1
1 2
1
×
2
Γ(α)γ α
τ2
σi
[
{
} ]
T
2
(θi − Zi β)2
(Xi − θi )2 (ni − 1)Si
1
1
= exp −
−
+
+
2
2
2
γ σ2
2τ
i
( ) ni +α+1
2
1
1
√
.
(3.3)
×
2
σi
τ 2 Γ(α)γ α
56

2
2
Therefore the conditional distribution of σi and θi given the data (Xi , Si ), i = 1, · · · , n and
B are

[

{
} ]
T
(Xi −Zi β)2
(ni −1) 2 1 1
exp −
−
2 Si + γ σ 2
2
2(σi +τ 2 )
2 |X , S 2 , B) ∝
i ,
π(σi i i
2 )(ni −1)/2+α+1 (σ 2 + τ 2 )1/2
(σi
i
{
}
T β)2
(θi − Zi
−(n /2+α)
2
π(θi |Xi , Si , B) ∝ exp −
ψi i
,
2
2τ

(3.4)

{
}
2 + 0.5(n − 1)S 2 + 1 .
ψi ≡ 0.5(Xi − θi )
i
i
b

(3.6)

(3.5)

where

2
Note that the above conditionals are obtained by integrating out θi and σi respectively from
the joint distribution (3.3).
From now on we will borrow the notations from Booth and Hobert (1998) for determining
all related stochastic distributions. In this context, a meaningful point estimator for θi is its
conditional mean,

2
2
θi (B; Xi , Si ) = EB (θi |Xi , Si ),

(3.7)

where EB represents the expection with respect to the conditional distribution of θi with
known B. A sensible prediction error can be measured by the posterior variance

2
2
νi (B; Xi , Si ) = varB (θi |Xi , Si ).
57

(3.8)

Neither (3.7) nor (3.8) has any closed form expression. Therefore, we have numerically
computed using the following approach.
2
1. Generate R random numbers from π(θi |Xi , Si , B) with rejection sampling method:
(c)
T
1a. Generate a candidate random sample θi from a Normal(Zi β, τ 2 );
1b. Generate a uniform random number U ;
1c. If






θi −Xi 2 −(ni /2+α)
( G ) 
x
U < 1+

ni + 2α − 1 



2
where G2 = {(ni − 1)Si + 2/γ}/(ni + 2α + 1), then accept the candiate random
x
(r)
sample as θi ; otherwise go back to 1a;
1d. Repeat 1a to 1c R times.
2. Approximate θi and νi by

R
1 ∑ (r)
˜
θi =
θi
R
r=1

and
R
1 ∑ (r) ˜ 2
νi =
˜
(θi − θi )
R
r=1
In pratice B is unknown. We estimate them by maximizing the marginal likelihood and the
ˆ
details are given in the next section. Let B denote the corresponding estimator. Substituting
ˆ
B by B in formulas (3.7) and (3.8) will produce the estimates of θi and νi :

2
2
ˆ
ˆ
θi = θi (B; Xi , Si ) = E ˆ (θi |Xi , Si )
B
2
2
ˆ
νi = νi (B; Xi , Si ) = var ˆ (θi |Xi , Si ).
ˆ
B
58

(3.9)
(3.10)

The estimator (3.9) is popularly known as the empirical Bayes estimator and (3.10) is the
estimated Bayes risks. It is well known that the νi only consider the variability in the
ˆ
prediction procedure, but not the variability due to the parameter estimation (B). We
accounted this additional variability by adapting the technique of Booth and Hobert (1998)
in this set up. The details are discussed in Section 3.3.

3.2.3

Estimation of the Stuctural Parameters

In pracice, B is unknown. We obtain the maximum likelihood estimate of B by maximizing
the marginal likelihood LM =

∏n
M
2
n
i=1 Li of {(Xi , Si , Z i )i=1 ; B}, where

)
(n
{
}
ni
∫
T β)2
− 2i +α
Γ( + α)
(θi − Z i
exp −
LM ∝ Constanti · √ 2
ψi
dθi .
i
2τ 2
τ 2 Γ(α)γ α
Therefore, the log-likelihood is

log(LM ) =

n
∑[

n
1
logConstanti + log{Γ( i + α)} − log{Γ(α)} − logτ 2 −
2
2
i=1
(n
) 

}
{
∫
T
− 2i +α
(θi − Zi β)2
ψi
αlog(γ) + log  exp −
dθi 
2τ 2

The maximizing marginal likelihood equation is

∂log(LM )
=0
∂B

where LM is the complete data likelihood and LM =
59

∏n
M
i=1 Li .

(3.11)

The detailed expression of the partial derivative corresponding to β is:
(n
)
{
}
Tβ
T β)2
− 2i +α
∫ θi −Zi
(θ −Z
Z
exp − i i
ψi
dθi
n
∑
τ2
2τ 2
∂log(LM )
(n
)
=
{
}
∂β
T β)2
− 2i +α
∫
(θ −Z
i=1
exp − i i
ψi
dθi
2τ 2
(
)
n
T
∑
θi − Zi β
=
E Z
τ2
i=1

(3.12)

2
where the expectation corresponds to conditional distribution of θi , π(θi |Xi , Si , B). The
estimate of β is obtained by solving ∂log(LM )/∂β = 0, and we obtain

ˆ
β=(

n
∑

Z i Z T )−1 {
i

i=1

n
∑

Z i E(θi )}.

(3.13)

i=1

The expression of the partial derivative corresponding to τ 2 is:

∂log(LM )
n
=−
2
∂τ
2τ 2

=−

n
2τ 2

)
(n
{
}
T
T β)2
− 2i +α
∫ (θi −Zi β)2
(θi −Zi
ψi
dθi
n
2 )2 exp −
∑
2(τ
2τ 2
)
(n
+
{
}
T β)2
− 2i +α
∫
(θ −Z
i=1
exp − i i
ψi
dθi
2τ 2
(
)
n
T
∑
(θi − Zi β)2
+
E
(3.14)
2(τ 2 )2
i=1

Then the estimate of τ 2 is obtained by solving ∂log(LM )/∂τ 2 = 0, and we obtain

τ2 =
ˆ

n
∑

E(θi − Z T β)2 /n.
i

i=1

60

(3.15)

Similarly, we estimate α and γ by solving Sα = 0 and Sγ = 0 where
∂Elog(LM )
∂α
n
n
∑
∑
n
=
log ′ {Γ( i + α)} − nlog ′ {Γ(α)} − nlog(γ) −
E{log(ψi )}
2
i=1
i=1
∂Elog(LM )
Sγ =
∂γ
( )
n
nα
1 ∑ ni
1
=−
+
( + α)E
γ
2
ψi
γ2
i=1

Sα =

(3.16)

(3.17)

To solve equations (3.13), (3.15), (3.16) and (3.17), we use the EM algorithm. At the
E-step of the tth iteration, we calculate the expectation with respect to the conditional
2
density of θi , π(θi |Xi , Si , B (t−1) ), i = 1, . . . , n, with B (t−1) denoting the estimate of B
at the (t − 1)th iteration. The rejection sampling method is used to generate samples from
the conditional distribution of θi ’s. At the M-step of the tth iteration, we maximize LM by
solving equations (3.13), (3.15), (3.16) and (3.17) conditional on the values of expectations
obtained in the E-step. To solve these equations, we also need
]
∑n [ ′′
n
log {Γ( 2i + α)} − log ′′ {Γ(α)} + V ar{log(ψi )}
i=1 [
( )
{
}]
∑n
1 + 1 E 1 − ( ni + α) 1 Cov 1 , log(ψ )
Sαγ = i=1 − γ
i
2
ψi
ψi
γ2
γ2
Sαα =

Sγα = Sαγ
∑n
i=1

{

α − (n + 2α) 1 E
i
γ2
γ3
( )}
n
+( 2i + α)2 1 V ar 1
ψi
γ4

Sγγ =

61

(

1
ψi

)

)

(
n
+ ( 2i + α) 1 E
γ4

1
2
ψi

(3.18)

Then α and γ can be estimated by iterative Newton-Raphson method:
 l  (l−1) 
−1 

α
α
 Sαα Sαγ   Sα 
−
  = 
 

γ
γ
Sγα Sγγ
Sγ

3.3

Prediction Error Calculation

3.3.1

Mean Squared Error of Prediction

Following the deﬁnition of conditional mean squared prediction error (CMSEP) of Booth
and Hobert (1998), the prediction error variance is,

2
2
2
ˆ
CM SEP (B; Xi , Si ) = EB [{θi − θi (B; Xi , Si )}2 |Xi , Si ]

2
2
ˆ
where θi and θi (B; Xi , Si ) are as deﬁned in (3.9) and (3.7). Since θi (B; Xi , Si ) − θi and
2
2
ˆ
θi − θi (B; Xi , Si ) are conditionally independent given Xi and Si ,
2
2
2
2
ˆ
CM SEP (B; Xi , Si ) = varB (θi |Xi , Si ) + EB [{θi − θi (B; Xi , Si )}2 |Xi , Si ]

(3.19)

2
2
= νi (B; Xi , Si ) + ci (B; Xi , Si )
2
where ci (B; Xi , Si ) is the correction term due to the estimation of unknown parameters B.
The correction contribution is of order Op (n−1 ). Note that the above measure is still not
usable because it involved the unknown structural parameters B. It’s natural to plug-in the
62

ˆ
estimate B of B and get a usable measure of mean squared prediction error

2
2
2
ˆ
ˆ
ˆ
CM SEP i = M SE(B; Xi , Si ) = νi (B; Xi , Si ) + ci (B; Xi , Si ) = νi + ci
ˆ
ˆ

As it will be clear in the next section (as well as from small area estimation literature)
the estimator (3.19) has considerable bias. Typically the order of bias is Op (n−1 ) due to
estimation of νi by νi .
ˆ

ˆ
Bias Correction for νi (B; Xi , Si2 )

3.3.2

2
ˆ
For the bias correction of the estimated conditional variance νi (B; Xi , Si ) we expand this
about B:
2
∂νi (B; Xi , Si )
2
2
ˆ
ˆ
νi = νi (B; Xi , Si ) = νi (B; Xi , Si ) + (B − B)T
ˆ
∂B
2)
∂νi (B; Xi , Si
1 ˆ
ˆ
+ (B − B)T
(B − B) + Op (n−2 ).
T
2
∂B∂B

(3.20)

2
ˆ
Then the approximated bias involved in νi (B; Xi , Si ) is

ˆ
E(B − B)

2
∂νi (B; Xi , Si )
∂B

2
1 ∂νi (B; Xi , Si ) −1
+ tr{
I (B)},
2
∂B∂B T

where I(B) is the Fisher’s information matrix obtained from the marginal likelihood LM ,
[

]
∂2
2
Isr = −E
log − likelihood(Xi , Si )
∂B s ∂B r
63

The second order derivative of the log-likelihood is given in section 3.2.3 (3.18) and appendix
3.5.1 and can be used to compute the information matrix.
Handling the bias analytically is diﬃcult. Booth and Hobert (1998) adopted bootstrap
bias correction instead. This is due to the fact that there is no closed form expressions
available for this bias terms. The bootstrap method requires repeated estimation of model
parameters based on re-sampled data. This often pauses practical and computational diﬃculties in this hierarchical model. As we will see in the next subsection handling the second
term in (3.19) is also diﬃcult due to the same diﬃculty of not having any closed form expression. Thus, if we have to do the bootstrap for the bias correction of νi , the estimation
of ci can also be done in the same run. It is not necessary to obtain any analytical approximations. The resampling techniques has been used in Jiang et al. (2002), Hall and Maiti
(2006). We will be applying the rejection sampling method instead where there is no need
of estimating model parameters repeatedly.
ˆ
Let B be the maximum likelihood estimator of B as proposed in the previous section.
(
)
ˆ
Following Cox & Snell (1968, Equation 20), we approximate E B − B up to O(n−1 ).
(·)
(i)
Deﬁne I −1 = (I rs ) with as the inverse of I = (Irs ), where Irs = E(−Vrs ) and Vrs =
ˆ
∂ 2 logLM /∂B r ∂B s . The bias in the sth element of B is
i
1 ∑ ∑ ∑ rs tu
I I (Krtu + 2Jt,ru )
2 r
t u
∂ 3 logLM
(·)
(i)
i
Krst = E(Wrst ),
Wrst =
,
∂B r ∂B s ∂B t
M
∑ (i) (i)
(i) ∂logLi
Ur =
Jr,st = E{ Ur Vst },
∂B r

ˆ
E(B s − B s ) ≈

(3.21)

2
where the LM is the marginal likelihood of (Xi , Si ) deﬁned in the previous section. The
i
64

detailed formulas are given in appendix.

With

(

)
{
}
∂νi T
∂νi ∂νi ∂νi ∂νi
=
,
,
,
∂B
∂α ∂γ ∂β ∂τ 2

(3.22)

2
2
and νi (B; Xi , Si ) = varB (θi |Xi , Si ), we have
∂νi
∂α
∂νi
∂γ
∂νi
∂β
∂νi
∂τ 2

2
= −Cov ∗ {θi , log(ψi )} + 2E ∗ θi Cov ∗ {θi , log(ψi )}
{
}
ni
1
∗ (θ2 , 1 ) − 2E ∗ θ Cov ∗ (θ , 1 )
= ( + α)
Cov i
i
i ψ
2
ψi
γ2
i
{
}
1
2
=
Cov ∗ (θi , θi ) − 2E ∗ θi Cov ∗ (θi , θi )
τ2
}
1 {
2
=
Cov ∗ {θi , (θi − β)2 } − 2E ∗ θi Cov ∗ {θi , (θi − β)2 }
2(τ 2 )2

where the ∗ means that the expection, varirance and covariance that are calculated with
respect to the conditional distribution of θi at the estimated parameters’ value. The ap2
proximated expression of ∂νi (B; Xi , Si )/∂B∂B T is given in the appendix. The expection,
variance and covariance are computed based on the Monte Carlo approximation. For exam∑
(r)
(1)
(R)
ˆ
ples, E ∗ (θi ) = R θi /R, where (θi , · · · , θi ) are random numbers generated from
r=1
the conditional distribution of θi . Therefore,
2
∂ν (B; Xi , Si ) −1
ˆ X , S 2 ) − (B − B) ∂νi − 1 tr{ i
ˆ
νi ≈ νi (B; i i
ˆ
I (B)}.
∂B 2
∂B∂B T
This expression is second order correct meaning the bias is of order op (n−1 ).

65

3.3.3

Approximation of ci (B; Xi , Si2 )

2
2
ˆ
The deﬁnition of ci (B; Xi , Si ) is given in the previous section by ci (B; Xi , Si ) = EB [{θi −
2
2
2
2
2
ˆ
ˆ
θi (B; Xi , Si )}2 |Xi , Si ] where θi − θi (B; Xi , Si ) = θi (B; Xi , Si ) − θi (B; Xi , Si ). Using
ˆ
the Taylor series expansion and ignoring the term Op (|B − B|2 ) we write

2
2
2 ˆ
ˆ
θi (B; Xi , Si ) − θi (B; Xi , Si ) = AT (B; Xi , Si )(B − B)
i

(3.23)

where
2
∂θi (B; Xi , Si )
2
AT (B; Xi , Si ) =
i
∂B
(
)
2
2
2
2
∂θi (B; Xi , Si ) ∂θi (B; Xi , Si ) ∂θi (B; Xi , Si ) ∂θi (B; Xi , Si )
=
,
,
,
∂α
∂γ
∂β
∂τ 2
2
2
Since θi (B; Xi , Si ) = EB (θi |Xi , Si ), the components of Ai are
2
∂θi (B; Xi , Si )
= E ∗ (θi )E ∗ {log(ψi )} − E ∗ {θi log(ψi )} = −Cov ∗ {θi , log(ψi )}
∂α
{ ( )
( )}
2
∂θi (B; Xi , Si )
ni
n
1
1
1
∗ θi − E ∗ (θ )E ∗ 1
= ( + α)
E
= ( i + α) Cov ∗ (θi , )
i
2
2
∂γ
2
ψi
ψi
2
ψi
γ
γ
2
]
∂θi (B; Xi , Si )
1 [ ∗ 2
1
E (θi ) − {E ∗ (θi )}2 =
=
var∗ (θi )
2
2
∂β
τ
τ
2)
]
[
∂θi (B; Xi , Si
1
∗ {θ (θ − β)2 } − E ∗ (θ )E ∗ (θ − β)2 = 1 Cov ∗ {θ , (θ − β)2 }
=
E
i i
i
i
i i
∂τ 2
2(τ 2 )2
2τ 2
As a consequence of (3.23)

2
2
2
ci (B; Xi , Si ) ≈ AT (B; Xi , Si )I −1 (B)Ai (B; Xi , Si )
i
66

(3.24)

ˆ
and the approximation is correct up to Op (n−1 ). Substituting B into formula (3.24) will
yield an estimate of the correction term,

2
2
2
ˆ
ˆ
ˆ
ˆ
ci (B; Xi , Si ) ≈ AT (B; Xi , Si )I −1 (B)Ai (B; Xi , Si ).
i
ˆ
As the estimated information matrix is {I(B)}−1 = Op (n−1 ), the error in the approxima2
ˆ
tion is op (n−1 ). Similar to the case of νi , the items in Ai (B; Xi , Si ) do not have closed
forms. They can be approximated by the samples generated in the estimation procedure.
Summing up all the derivations and approximations we obtain the following result.
ˆ
Theorem. The estimated conditional mean squared of prediction error for θi is
2
∂ν
1 ∂νi (B; Xi , Si ) −1
2
ˆ
ˆ
CM SP E = νi (B; Xi , Si ) − (B − B) i − tr{
I (B)}
∂B 2
∂B∂B T
2
2
ˆ
ˆ
ˆ
+AT (B; Xi , Si )I(B)−1 Ai (B; Xi , Si ).
i

(3.25)

The formula is second order correct in the sense that it has a bias of Op (n−2 ).

3.4

Simulation Study

Simulation design To check the ﬁnite sample performance of the proposed estimators, a
simulation set up closely related to Wang and Fuller (2003) was considered. To simplify the
2
simulation, we do not choose any covariate Z, only (Xi , Si ) are generated. First, generate
observations for each small area using the model

Xij = β + ui + eij ,

j = 1, . . . , ni , i = 1, . . . , n,
67

2
where ui ∼ Normal(0, τ 2 ) and eij ∼ Normal(0, ni σi ). Then the random eﬀects model for
the small area mean is

Xi = β + ui + ei ,

i = 1, . . . , n,

(3.26)

∑n i
∑n i
2
¯
where Xi = Xi· = n−1 j=1 Xij , ei = e¯ = n−1 j=1 eij . Therefore, Xi ∼ Normal(θi , σi ),
i·
i
i
2
where θi = β + ui and θi ∼ Normal(β, τ 2 ), and ei ∼ Normal(0, σi ). We estimated the mean
2
for each small area, θi , i = 1, . . . , n. We estimated σi with the unbiased estimator
ni
1 ∑
2= 1
¯
Si
(Xij − Xi· )2
ni − 1 ni
j=1

(3.27)

2 2
It is to be noted that (ni − 1)Si /σi ∼ χ2
. Like Wang and Fuller (2003), we set all
(ni −1)
ni equal to m that eased our programming eﬀorts. However, the sampling variances were
2
still unequal by choosing one-third of the σi equal to 1, one-third equal to 4, and one-third
equal to 16. In the simulation, we set β = 10 and took three diﬀerent values of τ 2 , 0.5, 1,
4. For each of τ 2 , we generated 1,000 samples for each of the combinations (m, n) = (9, 36),
(18, 180). In table 3.1 we present the mean, empirical standard deviation and the estimated
standard deviation of estimates of β and τ 2 . The results are consistent with the large sample
theory of maximum likelihood estimation.
Method of analysis We also compute the estimates based on the Wang and Fuller
approach. Table 3.2 and Table 3.3 provide the numerical results for comparing two methods:
(I) the proposed method and (II) Wang and Fuller (2003). All the results are averaged over
areas within the group having the same sampling variances. We calculated the bias and
ˆ
prediction mean squared error of θi based on 1,000 replications. We used empirical measures
68

Table 3.1: Results of the simulation study, and here we present estimate (Est.), empirical
standard deviation (SD) for β and τ 2 . We set β = 10.
For β
Proposed Method
Wang and Fuller (2003)
2
τ
Est.
SD
Est.
SD
n = 36, m = 9
0.5
9.9989
0.3715
10.0031
0.3355
1
10.0053
0.3772
10.0079
0.3684
4
9.9978
0.4950
9.9992
0.4967
n = 180, m = 18 0.5
9.9978
0.1676
10.0015
0.1387
1
9.9987
0.1660
10.0013
0.1532
4
9.9993
0.2134
9.9996
0.2127
2
For τ
Proposed Method
Wang and Fuller (2003)
2
τ
Est.
SD
Est.
SD
n = 36, m = 9
0.5
0.5978
0.3156
1.2227
0.9973
1
1.1190
0.4983
1.6468
1.1591
4
4.0449
1.5113
4.5154
2.0121
n = 180, m = 18 0.5
0.5211
0.0381
0.7767
0.3783
1
1.0451
0.1190
1.2757
0.4618
4
4.1092
0.5812
4.2707
0.8718

of relative bias and coeﬃcient of variation to quantify the performaces of diﬀerent MSE
estimators. Relative bias of the MSE estimator was deﬁned by

RBi =

E{M SE i } − M SEi
M SEi

(3.28)

for i = 1, ..., 3, where E{M SE i } was estimated empirically as the average of values of M SE i
(
)2
ˆ
over replications. M SEi was deﬁned as the average value of θi − θi . The coeﬃcient of
variation of the MSE estimator was taken to be
1
[E{M SE i − M SEi }2 ] 2
CVi =
M SEi
for i = 1, ..., 3.
69

(3.29)

Table 3.2: Results of MSE estimator, n = 36, m = 9
τ 2 = 0.5
τ2 = 1
τ2 = 4
2
σi
I
II
I
II
I
II
Bias 1
0.0018 -0.0024
0.0098 0.0157
0.0032 0.0070
4
-0.0061 -0.0001
0.0001 -0.0011
0.0066 0.0056
16 -0.0208 -0.0235
0.0173 0.0198
0.0365 0.0361
MSE 1
0.4281 0.6532
0.5588 0.6352
0.8786 0.8583
4
0.6658 1.0389
0.9718 1.1982
2.2183 2.3032
16 0.6785 1.0215
1.1576 1.4069
3.6549 3.8209
RB
1
0.2561 0.6044
0.2364 0.2885
0.2922 0.0009
4
0.0016 1.1227
-0.0544 0.4850
-0.0557 0.0363
16 0.0508 1.7802
-0.1246 0.6538
-0.1120 0.0450
CV
1
0.6327 0.9124
0.5246 0.5963
0.4925 0.4293
4
0.6908 1.4339
0.4382 0.7895
0.3309 0.3697
16 0.9411 2.2498
0.5212 1.0594
0.3272 0.3721

Table 3.3: Results of MSE estimator, n = 180, m = 18
τ 2 = 0.5
τ2 = 1
τ2 = 4
2
σi
I
II
I
II
I
II
Bias 1
0.0009 0.0001
-0.0017 -0.0000
-0.0007 -0.0005
4
-0.0034 -0.0053
0.0016 0.0040
0.0049 0.0053
16 0.0004 -0.0014
0.0034 0.0053
0.0048 0.0045
MSE 1
0.3524 0.3931
0.5146 0.5383
0.8121 0.8119
4
0.4742 0.5226
0.8434 0.8897
2.0940 2.1194
16 0.5110 0.5292
0.9612 0.9866
3.2698 3.3052
RB
1
0.0696 0.3926
0.1237 0.1529
0.1847 0.0245
4
-0.0216 0.5509
0.0044 0.2046
-0.0190 0.0231
16 -0.0324 0.6202
0.0295 0.2717
-0.0043 0.0639
CV
1
0.1791 0.5482
0.2228 0.3309
0.3128 0.2939
4
0.1142 0.7653
0.1323 0.3987
0.1833 0.2150
16 0.1045 0.8875
0.1263 0.5028
0.1389 0.2006

70

Simulation results The proposed method has considerable lower prediction mean squared
errors. The gain is a maximum when the ratio of sampling variance to model variance is
the largest. The reduction is almost half when the number of small areas is 36. The relative
bias is always less than Wang and Fuller (2003) except the case τ 2 = 4, σi = 1 (high model
variance, low sampling variance). The proposed method might have low underestimation
(about 5%) in case of n = 36. On the other hand, the Wang Fuller method can have very
large overestimation. In terms of coeﬃcient of variation the proposed method outperforms
Wang and Fuller (2003).

3.5

3.5.1

Appendix: Matrix Calculation Results

ˆ
Computation of B − B

In this section of the appendix, the detailed expression of equations (3.21) are given. The
(i)
log-likelihood is already given in the section 3.2.3. The ﬁrst order derivative terms, Ur , are
given in the formulas (3.16), (3.17), (3.12) and (3.14) in section 3.2.3.
(i)
(i) (i)
For the second order derivative, Vαα , Vγγ , and Vαγ are given in equation (3.18) as Sαα ,
Sγγ , and Sαγ . The other terms are
1
1
(i)
=− +
V ar(θi − β)
ββ
2 (τ 2 )2
τ
1
1
1
(i)
V 2 2 =
−
E(θi − β)2 +
V ar((θi − β)2 )
τ τ
2(τ 2 )2 (τ 2 )3
4(τ 2 )4
V

71

The other cross terms of the second order derivative are

1
(i)
V
= − Cov(log(ψi ), θi − β)
αβ
τ2
1
(i)
V 2 =−
Cov(log(ψi ), (θi − β)2 )
2 )2
ατ
2(τ
n
1
1
(i)
V
= ( i + α)
Cov( , θi − β)
γβ
2τ 2
2
ψi
γ
n
1
1
(i)
Cov( , (θi − β)2 )
V 2 = ( i + α)
2 (τ 2 )2
2
ψi
γτ
2γ
1
1
(i)
V 2 =−
E(θi − β) +
Cov(θi − β, (θi − β)2 )
2 )2
2 )3
βτ
(τ
2(τ

The third derivative terms are:

n
(i)
Wααα = logΓ′′′ ( i + α) − logΓ′′′ (α) − Cov{log(ψi ), log2 (ψi )}
2
−2Elog(ψi )V ar{log(ψi )}
)
(
ni
6
6
2
2α
1
1
1
(i)
+
Wγγγ = −
+ ( + α)
E −
E
E
2
γ3
γ 4 ψi γ 5 ψ 2 γ 6 ψ 3
i
{
}i
n
6
1
1
1 1
−( i + α)2
)
V ar( ) +
Cov( ,
2
ψi
ψi ψ 2
γ5
γ6
i
{
}
ni
1
1 1
1
1
−( + α)3
Cov( ,
) − 2E V ar( )
2
ψi ψ 2
ψi
ψi
γ6
i
]
1 [
(i)
Cov{θi − β, (θi − β)2 } − 2E(θi − β)V ar(θi − β)
W
=
βββ
(τ 2 )3
1
3
3
1
(i)
W 2 2 2 =−
+
E(θi − β)2 −
V ar{(θi − β)2 } +
·
τ τ τ
(τ 2 )3 (τ 2 )4
2(τ 2 )5
8(τ 2 )6
[
]
Cov{(θi − β)4 , (θi − β)2 } − 2E(θi − β)2 V ar{(θi − β)2 }

72

The other cross terms of the third order derivative are
]
[
ni
1
(i)
2 (ψ ), 1 } − 2Elog(ψ )Cov{log(ψ ), 1 }
Wααγ = ( + α)
Cov{log
i
i ψ
i ψ
2
γ2
i
i
2
1
− Cov{log(ψi ), }
ψi
γ2
[
]
1
(i)
W
=
Cov{log2 (ψi ), θi − β} − 2Elog(ψi )Cov{log(ψi ), θi − β}
ααβ
τ2
]
1 [
(i)
2 (ψ ), (θ − β)2 } − 2Elog(ψ )Cov{log(ψ ), (θ − β)2 }
W
=
Cov{log
i
i
i
i
i
αατ 2
2(τ 2 )2
[
]
ni
1
1
2
1
1
1
1
1
(i)
Wαγγ =
−
E −
E
+ ( + α)
2V ar( ) − Cov{log(ψi ),
}
2
2
ψi
γ 2 γ 3 ψi γ 4 ψ 2
γ4
ψi
i
[
ni
1
1 1
1
1
Cov{log(ψi ) · , } − Cov{log(ψi ), }E
−( + α)2
2
ψi ψi
ψi
ψi
γ4
]
1
−Elog(ψi )V ar( )
ψi
[
ni
1
1
1
1
(i)
=
W
Cov( , θi − β) − ( + α)
Cov{log(ψi ) · , θi − β}−
αγβ
ψi
2
ψ
γ 2τ 2
γ 2τ 2
] i
1
1
Cov{log(ψi ), θi − β}E − Elog(ψi )Cov( , θi − β)
ψi
ψi
ni
[
( 2 + α)
1
1
1
(i)
2
2
W
2 = 2γ 2 (τ 2 )2 Cov{ ψ , (θi − β) } − 2γ 2 (τ 2 )2 Cov{log(ψi ) · ψ , (θi − β) }
αγτ
i
i ]
1
1
−Cov{log(ψi ), (θi − β)2 }E − Elog(ψi )Cov{ , (θi − β)2 }
ψi
ψi
[
]
1
(i)
2 } − 2E(θ − β)Cov{log(ψ ), θ − β}
W
=−
Cov{log(ψi ), (θi − β)
i
i i
αββ
(τ 2 )2
1
1 [
(i)
W
=
Cov{log(ψi ), θi − β} −
Cov{log(ψi ) · (θi − β), (θi − β)2 }−
αβτ 2
(τ 2 )2
2(τ 2 )3
]
Cov{log(ψi ), (θi − β)2 }E(θi − β) − Elog(ψi )Cov{θi − β, (θi − β)2 }
1
1 [
(i)
W 2 2 =
Cov{log(ψi ), (θi − β)2 } −
Cov{log(ψi ) · (θi − β)2 , (θi − β)2 }
2 )3
2 )4
ατ τ
4(τ
4(τ
]
2 }E(θ − β)2 − Elog(ψ )V ar{(θ − β)2 }
−Cov{log(ψi ), (θi − β)
i
i
i

73

n
n
2
1
1
1
(i)
W
= −( i + α)
Cov(θi − β, ) + ( i + α)
Cov(θi − β,
)
γγβ
2
2
ψi
2
γ3τ 2
γ 4τ 2
ψi
{
}
ni
1
1
1
2 1
) − 2E Cov(θi − β, )
+( + α)
Cov(θi − β,
2
2
ψi
ψi
γ 4τ 2
ψi
ni
n
( 2i + α)
(i)
2 , 1 } + ( 2 + α) Cov{(θ − β)2 , 1 }
W
=−
Cov{(θi − β)
i
2
ψi
γγτ 2
γ 3 (τ 2 )2
2γ 4 (τ 2 )2
ψi
]
[
n
( 2i + α)2
1
1
1
+
} − 2E Cov{(θi − β)2 , }
Cov{(θi − β)2 ,
2
4 (τ 2 )2
ψi
ψi
2γ
ψi
[
n
1
1
1
(i)
W
= ( i + α)
Cov{ · (θi − β), θi − β} − Cov( , θi − β)E(θi − β)
γββ
2
ψi
ψi
γ 2 (τ 2 )2
]
1
−E V ar(θi − β)
ψi
ni
n
[
( 2 + α)
( 2i + α)
1
1
(i)
2
W
2 = − γ 2 (τ 2 )2 Cov( ψ , θi − β) + 2γ 2 (τ 2 )3 Cov{ ψ · (θi − β), (θi − β) }
γβτ
i
i
]
1
1
2 }E(θ − β) − E Cov{θ − β, (θ − β)2 }
−Cov{ , (θi − β)
i
i
i
ψi
ψi
n
n
[
( 2i + α)
( 2i + α)
1
1
(i)
2} +
W 2 2 =−
Cov{ , (θi − β)
Cov{ · (θi − β)2 , (θi − β)2 }
2 (τ 2 )3
2 (τ 2 )4
ψi
ψi
γτ τ
γ
4γ
]
1
2 }E(θ − β)2 − E 1 V ar{(θ − β)2 }
−Cov{ , (θi − β)
i
i
ψi
ψi
1
2
1 [
(i)
W
=
V ar{(θi − β)2 }
−
V ar(θi − β) +
2
2 )2 (τ 2 )3
2 )4
ββτ
(τ
2(τ
]
−2E(θi − β)Cov{(θi − β)2 , θi − β}
2
2
(i)
W 2 2 =
E(θi − β) −
Cov{(θi − β)2 , θi − β}
2 )3
2 )4
βτ τ
(τ
(τ
[
]
1
3 , (θ − β)2 } − V ar{(θ − β)2 }E(θ − β)
+
Cov{(θi − β)
i
i
i
4(τ 2 )5
All the other terms are equal to the above values according to symmetry.

74

3.5.2

Second Order Correction of νi

In the equation (3.20), the second order derivative of νi is included. The detailed formula is
given in the following:

∂ 2 νi
2
= − ∂ Cov(θi , logψi ) + 2 ∂ Eθi · Cov(θi , logψi ) + 2Eθi ∂ Cov(θi , logψi )
∂α
∂α
∂α
∂α2
∂ Cov(θ2 , logψ ) = −Cov(θ2 logψ , logψ ) + Cov(θ2 , logψ )Elogψ + Eθ2 Cov(logψ , logψ )
i
i
i
i
i
i
i
i
i
i
i
∂α
∂ Eθ = −Cov(θ , logψ )
i
i
∂α i
∂ Cov(θ , logψ ) = −Cov(θ logψ , logψ ) + Cov(θ , logψ )Elogψ + Eθ Cov(logψ , logψ )
i
i
i
i
i
i
i
i
i
i
i
∂α
∂ 2 νi
2
= − ∂ Cov(θi , logψi ) + 2 ∂ Eθi · Cov(θi , logψi ) + 2Eθi ∂ Cov(θi , logψi )
∂α∂γ
∂γ
∂γ{
∂γ
∂ Cov(θ2 , logψ ) = (n /2 + α) 1 Cov(θ2 logψ , 1 ) − Cov(θ2 , 1 )Elogψ
i ψ
i
i
i
i
i
i ψi
∂γ
γ2
i
}
2
2
−Eθi Cov(logψi , 1 ) − 1 Cov(θi , 1 )
ψi
ψi
γ2
∂ Eθ = (n /2 + α) 1 Cov(θ , 1 )
i
i ψ
∂γ i
γ2
i{
∂ Cov(θ , logψ ) = (n /2 + α) 1 Cov(θ logψ , 1 ) − Cov(θ , 1 )Elogψ
i
i ψ
i
i
i
i ψ
i
∂γ
γ2
i
i
}
−Eθi Cov(logψi , 1 ) − 1 Cov(θi , 1 )
ψi
ψi
γ2
∂ 2 νi
2
= − ∂ Cov(θi , logψi ) + 2 ∂ Eθi · Cov(θi , logψi ) + 2Eθi ∂ Cov(θi , logψi )
∂α∂β
∂β
∂β
∂β
{
∂ Cov(θ2 , logψ ) = 1 Cov(θ2 logψ , θ − β) − Cov(θ2 , θ − β)Elogψ
i i
i
i
i
i i
i
∂β
τ2
}
2
−Eθi Cov(logψi , θi − β)
∂ Eθ = 1 V ar(θ )
i
∂β i τ 2
∂ Cov(θ , logψ ) = 1 {Cov(θ logψ , θ − β) − Cov(θ , θ − β)Elogψ
i
i
i
i i
i i
i
∂β
τ2
}
−Eθi Cov(logψi , θi − β)
75

∂ 2 νi
2
= − ∂ 2 Cov(θi , logψi ) + 2 ∂ 2 Eθi · Cov(θi , logψi ) + 2Eθi ∂ 2 Cov(θi , logψi )
∂α∂τ 2
∂τ
∂τ
∂τ
[
∂ Cov(θ2 , logψ ) =
1
2
2
Cov{θi logψi , (θi − β)2 } − Cov{θi , (θi − β)2 }Elogψi
i
i
∂τ 2
2(τ 2 )2
]
2
−Eθi Cov{logψi , (θi − β)2 }
∂ Eθ =
1 Cov{θ , (θ − β)2 }
i i
∂τ 2 i 2(τ 2 )2
[
∂ Cov(θ , logψ ) =
1
Cov{θi logψi , (θi − β)2 } − Cov{θi , (θi − β)2 }Elogψi
i
i
∂τ 2
2(τ 2 )2
]
−Eθi Cov{logψi , (θi − β)2 }
{
}
∂ 2 νi
ni
2 Cov(θ2 , 1 ) − 2Eθ Cov(θ , 1 ) + ( ni + α) 1 ·
= −( 2 + α) 3
i
i ψ
i ψi
2
∂γ 2
γ
γ2 }
i
{
∂ Cov(θ2 , 1 ) − 2 ∂ Eθ · Cov(θ , 1 ) − 2Eθ ∂ Cov(θ , 1 )
i ψ
i ∂γ
i ψ
i ψi
∂γ
∂γ i
i
i
{
}
∂ Cov(θ2 , 1 ) = ( ni + α) 1 Cov(θ2 1 , 1 ) − Cov(θ2 , 1 )E 1 − Eθ2 V ar( 1 )
i ψi
i ψi ψi
i ψi ψi
i
2
∂γ
ψi
γ2
2
+ 1 Cov(θi , 12 )
γ2
ψi
∂ Eθ = (n /2 + α) 1 Cov(θ , 1 )
i
i ψ
∂γ i
γ2
{ i
}
ni
∂ Cov(θ , 1 ) = ( + α) 1 Cov(θ 1 , 1 ) − Cov(θ , 1 )E 1 − Eθ V ar( 1 )
i ψ
iψ ψ
i ψ
i
2
∂γ
ψi
γ2
i
i i
i ψi
+ 1 Cov(θi , 12 )
γ2
ψi
}
∂ Cov(θ2 , 1 ) − 2 ∂ Eθ · Cov(θ , 1 ) − 2Eθ ∂ Cov(θ , 1 )
i ψ
i ∂β
i ψ
i ψi
∂β
∂β i
i
i
{
}
∂ Cov(θ2 , 1 ) = 1 Cov(θ2 1 , θ − β) − Cov(θ2 , θ − β)E 1 − Eθ2 Cov( 1 , θ − β)
i ψi
i ψi i
i i
i
∂β
ψi
ψi i
τ2
∂ Eθ = 1 V ar(θ )
i
∂β i τ 2
}
{
∂ Cov(θ , 1 ) = 1 Cov(θ 1 , θ − β) − Cov(θ , θ − β)E 1 − Eθ Cov( 1 , θ − β)
i ψ
i i
i
iψ i
∂β
ψi
ψi i
τ2
i
i
∂ 2 νi
n
= ( 2i + α) 1
∂γ∂β
γ2

{

76

{
}
∂ 2 νi
ni
1
∂ Cov(θ2 , 1 ) − 2 ∂ Eθ · Cov(θ , 1 ) − 2Eθ ∂ Cov(θ , 1 )
= ( 2 + α) 2
i ψ
i ∂τ 2
i ψ
i ψi
∂γ∂τ 2
γ
∂τ 2
∂τ 2 i
i
i
[
∂ Cov(θ2 , 1 ) =
1
2
2
Cov{θi 1 , (θi − β)2 } − Cov{θi , (θi − β)2 }E 1 −
i ψi
ψi
ψi
∂τ 2
2(τ 2 )2
]
2
Eθi Cov{ 1 , (θi − β)2 }
ψi
1
∂ Eθ =
2
2 i 2(τ 2 )2 Cov{θi , (θi − β) }
∂τ
[
∂ Cov(θ , 1 ) =
1
Cov{θi 1 , (θi − β)2 } − Cov{θi , (θi − β)2 }E 1 −
i ψ
ψi
ψi
∂τ 2
2(τ 2 )2
i
]
Eθi Cov{ 1 , (θi − β)2 }
ψi
{
}
∂ 2 νi
∂ Cov(θ2 , θ − β) − 2 ∂ Eθ · Cov(θ , θ − β) − 2Eθ · ∂ Cov(θ , θ − β)
= 1
i i
i ∂β
i i
i i
∂β i
∂β 2
τ 2 ∂β
[
∂ Cov(θ2 , θ − β) = 1 Cov{θ2 (θ − β), θ − β} − Cov(θ2 , θ − β)E(θ − β)−
i
i
i i
i i
i i
∂β
τ2
]
2
Eθi V ar(θi − β)
∂ Eθ = Cov(θ , θ − β)
i i
∂β i
∂ Cov(θ , θ − β) = 1 [Cov{θ (θ − β), θ − β} − Cov(θ , θ − β)E(θ − β)−
i i
i
i i
i
i i
∂β
τ2
]
Eθi V ar(θi − β)
{
}
{
∂ 2 νi
1
∂
2 , θ − β) − 2Eθ · Cov(θ , θ − β) + 1
2
i
i i
2 = − (τ 2 )2 Cov(θi i
2 ∂τ 2 Cov(θi , θi − β)−
∂β∂τ
τ
}
2 ∂ 2 Eθi · Cov(θi , θi − β) − 2Eθi · ∂ 2 Cov(θi , θi − β)
∂τ
∂τ
[
1
∂ Cov(θ2 , θ − β) =
2 (θ − β), (θ − β)2 } − Cov{θ2 , (θ − β)2 }E(θ − β)
Cov{θi i
i
i
i
i i
i
∂τ 2
2(τ 2 )2
]
2
−Eθi Cov{θi − β, (θi − β)2 }
∂ Eθ =
1 Cov{θ , (θ − β)2 }
i i
∂τ 2 i 2(τ 2 )2
[
1
∂ Cov(θ , θ − β) =
Cov{θi (θi − β), (θi − β)2 } − Cov{θi , (θi − β)2 }E(θi − β)
i i
∂τ 2
2(τ 2 )2
]
2}
−Eθi Cov{θi − β, (θi − β)
77

[
]
∂ 2 νi
1
2
= − 2 3 Cov{θi , (θi − β)2 } − 2Eθi Cov{θi , (θi − β)2 } +
2
(τ )
∂τ 2
[
1
2
· ∂ 2 Cov{θi , (θi − β)2 } − 2 ∂ 2 Eθi · Cov{θi , (θi − β)2 }
2(τ 2 )2 ∂τ
∂τ
]
−2Eθi · ∂ 2 Cov{θi , (θi − β)2 }
∂τ
[
∂ Cov{θ2 , (θ − β)2 } =
1
2
2
Cov{θi (θi − β)2 , (θi − β)2 } − Eθi V ar{(θi − β)2 }
i
i
∂τ 2
2(τ 2 )2
]
2 , (θ − β)2 }E(θ − β)2
−Cov{θi
i
i
1
∂ Eθ =
2
2 i 2(τ 2 )2 Cov{θi , (θi − β) }
∂τ
[
∂ Cov{θ , (θ − β)2 } =
1
Cov{θi (θi − β)2 , (θi − β)2 } − Eθi V ar{(θi − β)2 }
i i
∂τ 2
2(τ 2 )2
]
−Cov{θi , (θi − β)2 }E(θi − β)2
The other terms are equal to the symetric terms.

78

Chapter 4
Conﬁdence Interval Estimation of
Small Area Parameters Shrinking
both Mean and Variances

4.1

Introduction

The new approach to small area estimation based on joint modeling of mean and variances
is given in the previous chapter. The conditional mean squared error of prediction is also
estimated to evaluate the prediction error. In this chapter, we will obtain conﬁdence intervals
of small area means. The small area estimation literature is dominated by point estimation
and their standard errors. It is well known that the standard practice of (pt. est. ± qs.e.), q
is Z (standard normal) or t cut-oﬀ point, does not produce accurate intervals. See, Hall and
Maiti (2006) and Chatterjee et al. (2008) for more details. The previous works are based on
the bootsrap procedure and has limited use due to repeated estimation of model parameters.
79

The conﬁdence intervals produced in this chapter are from a decision theory perspective.
The rest of the chapter is organized as follows. In section 4.2, the proposed model is
repeated. Section 4.3 gives the deﬁnition of the conﬁdence intervals. Theoretical justiﬁcation
and an alternative model are given in section 4.4 and 4.5. Section 4.6 contains a simulation
study. A real data analysis is included in Section 4.7.

4.2

Proposed Model

The hierarchical model adopted here is already introduced in section 3.2. We want to point
out again that this model used both the direct survey estimates and sampling variance
estimates to estimate all the parameters that determines the stochastic system. From the
simulation study, this mean-variance joint modelling performed better than Hwang et al.
(2009), which does not take into account the variance estimation completely.
2
Let Xi be the direct survey estimates and Si be their sampling variances for the ith area,
i = 1, · · · , n. Let Z i = (Zi1 , · · · , Zip )T be the set of covariates available at the estimation
stage and β = (β1 , · · · , βp )T be the associated regression coeﬃcients. We propose the
following hierarchical model:


2 ∼ Normal(θ , σ 2 )

Xi |θi , σi
i i

θi ∼ Normal(Z T β, τ 2 ); i = 1, · · · , n, 
i

(4.1)






2 2
(ni − 1)Si /σi ∼ χ2 −1
ni

−2

σi ∼ Gamma{α, γ}; i = 1, · · · , n, 
80

(4.2)

where B = (α, γ, β T , τ 2 )T , referred to as the structural parameters, are unknown and
ni represents the sample size for a simple random sample (SRS) from the ith area. Note
2
2
that σi are the sampling variances of Xi ’s and are usually estimated by Si ’s. Note that
the chi-square distribution for the sample variance is valid for only a random sample. For
a complex survey design the degrees of freedom of the chi-square distribution need to be
determined carefully (e.g., Maples and Huang 2009). The second level of (3.2) might be
further extended as Gamma{γ, exp(Z T β 2 )/γ} to accommodate covaraite information in
i
the variance modeling. The inference can be made from the conditional distribution of θi
2
(the parameter of interest) given the data (Xi , Si , Z i ), i = 1, · · · , n. Under our model
set up the conditional distribution of θi does not have a closed form, and for handling a
non-standard distribution we use Monte Carlo methods.

4.3
4.3.1

Conﬁdence Interval
Deﬁnition

Following Joshi (1969), Casella and Hwang (1991), Hwang et al. (2009), consider the loss
function

(k/σ)L(C) − IC (θ)

where k is a tuning parameter, independent of the model parameters and L(C) is the length
of a conﬁdence interval C and IC (θ) is 1 or 0 depending on θ ∈ C or not. Then the decision
{
}
2
Bayes conﬁdence interval for θi is obtained by minimizing E [(k/σ)L(C) − IC (θ)]|Xi , Si )
81

and is given by

−1
2
2
Ci (B) = {θ : kE(σi |Xi , Si , B) < π(θi |Xi , Si , B)}.

(4.3)

When k and B are known we follow the following steps to calculate the above CI. Note that
2
the conditional distribution of σi and θi given the data and B in equation (3.4) and (3.5)
are
T 2
2
2
2
2
2 |X , S 2 , B) ∝ exp[−0.5(Xi − Z i β) /(σi + τ ) − {0.5(ni − 1)Si + 1/b}(1/σi )]
π(σi i i
2
2
(σi )(ni −1)/2+a+1 (σi + τ 2 )1/2
−(n /2+a)
2
π(θi |Xi , Si , B) ∝ exp{−0.5(θi − Z T β)2 /τ 2 }ψi i
i
2
where ψi = 0.5(Xi − θi )2 + 0.5(ni − 1)Si + 1/γ.
−1
2
Therefore, we calculate E(σi |Xi , Si , B) using the Monte Carlo method. Suppose that
−1
2
2
2
2
σik , k = 1, · · · , N are N observations from π(σi |Xi , Si , B). Then E(σi |Xi , Si , B) will
be calculated by
N
1 ∑ 1
.
N
σik
k=1
2
2
For drawing random numbers from π(σi |Xi , Si , B) we use the rejection sampling method
−1
2
(Robert and Casella, 2004). Next we determine the values of θi by solving kE(σi |Xi , Si , B)−
2
π(θi |Xi , Si , B) = 0. Note that for solving the above equation we require the normalizing
2
constant of the conditional density π(θi |Xi , Si , B), and that is calculated by a numerical
method.
In practice, B is unknown. We propose to estimate it by maximizing the marginal
likelihood LM =

∏n
M
2
n
i=1 Li of {(Xi , Si , Z i )i=1 ; B}. The detailed prodecure is described
82

in section 3.2.3.

4.3.2

Choice of the Tuning Parameter

We discuss in this section the choice of the tuning parameter k in (4.3). As a ﬁrst step when
a is known, consider

√

(
k = k(B) = u0 ϕ

tα/2

ni + 2a + 2
ni − 1

)
(4.4)

where ϕ is the standard normal distribution, tα/2 is (1 − α/2)-th percentile of t distribution
with (ni − 1) degrees of freedom, and
√
u0 =

σ2
1+ i .
τ2

(4.5)

This choice of k involves components of B which will be assumed to be known for the
moment. In the case when B is unknown, as is in most practical situations, B will be
ˆ
replaced by its corresponding estimate B derived using the estimation method described in
2
Section 3.2.3. The deﬁnition of u0 in (4.5) further involves σi which is not a component
2
of B. Thus, u0 is replaced by u0 obtained by replacing σi with its maximum a posteriori
ˆ
estimate
2
2
σi = σi (B) = arg max 2 π(σi | Xi , Si , B),
ˆ2 ˆ2
σi

(4.6)

2
2
2
the value of σi that maximizes the posterior density of σi given Xi , Si and B.
We demonstrate that the coverage probability of Ci (B) with this choice of k is close to
1 − α. Theoretical justiﬁcations are provided in Section 4.4.
In Hwang et al. (2009) the choice of k is ad-hoc where they equate the Bayes interval to
the t-interval and solve for k.
83

2
Note that without any hierarchical model Si and Xi are independent as Si and Xi are
ancillary and complete suﬃcient statistics for θi , respectively. However, under models (3.1)
2
2
and (3.2) the conditional distribution of σi and θi involve both Xi and Si which is seen
from (3.4) and (3.5).
2
We would like to mention that in Hwang et al. (2009) the shrinkage estimators for σi was
2
2
based only on the information on Si , but not using both Xi and Si . They then plugged-in
the Bayes estimator of σi into the Bayes estimators of small area parameters. Thus if σB,i
ˆ2
2
is the Bayes estimator of σi , then Hwang et al.’s small area estimator can be written as
2
E(θi |Xi , σi )| 2 2 .
σi =ˆB,i
σ
Remark 1. As it is mentioned previously the d.f. associated with the χ2 distribution for
sampling variance modeling need not be simply ni − 1, ni being the sample size for i-th
area. There is no sound theoretical result for determining the d.f. in the case of complex
survey design. Wang and Fuller (2003) approximated the χ2 to the normal based on the
Wilson-Hilferty approximation. If one knows the exact sampling design then the simulation
based guideline of Maples et al. (2009) could be useful. In the case for county level estimation
from the American Community Survey, they suggested the estimated d.f. = 0.36 ×

4.4

√
ni .

Theoretical Justiﬁcation of Tuning Parameter

We present some theoretical justiﬁcation for the choice of k according to equations (4.4),
(4.5) and (4.6). Assume B is ﬁxed and known for the moment. The conditional distribution
84

of θi can be approximated as
∫∞
2
2
2
2
2
2
π(θi |Xi , Si , B) = 0 π(θi | Xi , Si , B, σi )π(σi |Xi , Si , B) dσi

(4.7)

2
≈ π(θi | Xi , Si , B, σi )
ˆ2
−1
2
where σi as deﬁned in (4.6). In a similar way, approximate E(σi | Xi , Si , B) by
ˆ2
−1
2
E(σi |Xi , Si , B) ≈ σi .
ˆ −1

(4.8)

˜
˜
Based on (4.7) and (4.8), we have Ci (B) ≈ Ci (B) where Ci (B) is the conﬁdence interval
for θi given by
˜
Ci (B) =

{

}
2
θi : π(θi | Xi , Si , B, σi ) ≥ k σi
ˆ2
ˆ −1 ,

(4.9)

2
2
2
with σi replaced by σi . From (3.1), it follows that the conditional density π(θi | Xi , Si , B, σi )
ˆ2
is normal with mean µi and variance vi , where µi and vi are given by the expressions
(
µi = wi Xi + (1 − wi )Z T β and
i

vi =

1
1
+
2 τ2
σi

2
1/σi
and wi =
. Now, choosing k = u0 ϕ
ˆ
2
(1/σi +1/τ 2 )
˜
conﬁdence interval Ci (B) becomes
{
˜
Ci (B) =

(
2
= σi

√

(

|θ − µi |
ˆ
θi : u0 i
ˆ
≤ tα/2
σi
ˆ

85

)−1

tα/2

√

2
σi
1+
τ2

ni +2a+2
ni −1

ni + 2a + 2
ni − 1

)−1
,

(4.10)

)
as discussed, the

}
,

(4.11)

2
where µi is the expression for µi in (4.10) with σi replaced by σi . Now consider the behavior
ˆ
ˆ2
of σi ≡ σi (B) as τ 2 ranges between 0 and ∞. When τ 2 → ∞, σi converges to
ˆ2 ˆ2
ˆ2
(ni −1) 2 1
2 2
2 Si + b = (ni − 1)Si + b .
2 (∞) ≡ σ 2 (a, b, β, ∞) =
σ
ˆ
ˆi
ni −1
ni + 2a + 1
2 +a+1
Similarly, when τ 2 → 0, σi converges to
ˆ2
T 2
2 2
2 (0) ≡ σ 2 (a, b, β, 0) = (Xi − Z i β) + (ni − 1)Si + b .
σ
ˆ
ˆi
ni + 2a + 2
For all intermediate values of τ 2 , we have min{ˆ 2 (0), σ 2 (∞)} ≤ σi ≤ max{ˆ 2 (0), σ 2 (∞)}.
σ
ˆ
ˆ2
σ
ˆ
Therefore, it is suﬃcient to consider the following two cases: (i) σi ≥ σ 2 (∞), where it follows
ˆ2 ˆ
2
2
ˆ2
that (ni + 2a + 2)ˆi = (ni + 2a + 1)ˆi + σi ≥ (ni − 1)Si + 2 + σi ≥ (ni − 1)Si , and (ii)
σ2
σ2 ˆ 2
b
2
2
σi ≥ σ 2 (0), where it follows that (ni +2a+2)ˆi = (Xi −Z T β)2 +(ni −1)Si + 2 ≥ (ni −1)Si .
ˆ2 ˆ
σ2
i
b
So, in both cases (i) and (ii),

2
(ni + 2a + 2) σi ≥ (ni − 1)Si .
ˆ2

(4.12)

(
)
2
2
2 2
Since θi − µi ∼ N 0, σi τ 2 /(σi + τ 2 ) and (ni − 1)Si /σi ∼ χ2 −1 , the conﬁdence
ni
interval

{
Di =

|θ − µi |
θi : u0 i
≤ tα/2
Si

}
(4.13)

has coverage probability 1 − α. Thus, if u0 and µi are replaced by u0 and µi , it is expected
ˆ
ˆ
˜
that the resulting conﬁdence interval Di , say, will have coverage probability of approximately
1 − α. From (4.12), we have
86

˜
˜
P (C(B)) ≥ P (Di ) ≈ 1 − α,

(4.14)

˜
establishing an approximate lower bound of 1 − α for the conﬁdence level of C(B).
In (4.14), B was assumed ﬁxed and known. In the case when B is unknown, we replace
ˆ
B by its marginal maximum likelihood estimate B. Since (4.14) holds regardless of the true
√
ˆ
value of B, substituting B for B in (4.14) will involve an order O(1/ N ) of error where
N =

∑n
i=1 ni . Compared to each single ni , this pooling of ni s is expected to reduce the

˜ ˆ
˜
error signiﬁcantly so that C(B) is suﬃciently close to C(B) to satisfy the lower bound of
1 − α in (4.14).

4.5

Alternative Model for Conﬁdence Interval

˜
It is possible to reduce the width of the conﬁdence interval C(B) based on an alternative
hierarchical model for small area estimation. The constant term ni + 2a + 2 in (4.12) in the
alternative model becomes ni + 2a. The model is

2
2
Xi | θi , σi ∼ N (θi , σi ),
2
2
θi | σi ∼ N (Z i β, λ σi ),
2 2
(ni − 1)(Si /σi ) ∼ χ2 −1 ,
ni
2
σi ∼ Inverse-Gamma(a, b),

(4.15)
(4.16)
(4.17)
(4.18)

for i = 1, 2, · · · , n. Note that in the above alternative formulation, it is assumed that the
2
variability of θi is proportional to σi while in the previous model, the variance of θi was τ 2
87

2
independent of σi ; in other words the model variance was constant. The set of all unknown
parameters in the current hierarchical model is B = (a, b, β, λ). By re-parametrizing the
variance in (4.16), some simplications can be obtained in the derivation of the posteriors of
2
θi and σi given Xi , Si and B. We have

n
2
2
π(σi |Xi , Si , B) = IG  i + a,
2

[

2
(ni − 1)Si
2

] 
2 1 −1
(Xi − Z i β)
.
+
+
2(1 + λ)
b

2
Given B and σi , the conditional distribution of θi is normal with mean µi and variance vi
λσ 2
2
2
as in (4.10) with τ 2 replaced by λ σi : π(θi |Xi , σi , B) = N (µi , i ). By integrating out
1+λ
2
2
σi , it follows that the conditional distribution of θi given Xi , Si and B is
2
π(θi |Xi , Si , B) =

∫ ∞

2
2
2
2
π(θi |Xi , σi , B)π(σi |Xi , Si , B)dσi
0
[
]−(n +2a+1)/2
i
(1 + λ)
δ2
∝
(θi − µi )2 +
,
2λ
2

(4.19)

2
where δ 2 = (ni − 1)Si + (Xi − Z i β)2 /(1 + λ) + 2/b. We can rewrite (4.19) as
{
}− (ni +2a+1)
√
2
(θi − µi )2
Γ((ni + 1)/2 + a) 1 + λ
2
√
π(θi |Xi , Si , B) =
1+
δ ∗ Γ(ni /2 + a) (ni + 2a)λ π
(ni + 2a)δ ∗2 λ/(1 + λ)
which can be seen to be a scaled t-distribution with ni + 2a degrees of freedom and scale
√
λ where δ ∗2 =
δ 2 . Also
∗
parameter δ
1+λ
(ni +2a)

88

Γ((ni + 1)/2 + a)(δ 2 /2)−{(ni +1)/2+a}
Γ(ni /2 + a)(δ 2 /2)−(ni /2+a)
√
Γ((ni + 1)/2 + a)
2
=
∗ √n + 2a .
Γ(ni /2 + a) δ
i

−1
E(σi |Xi , Si , B) =

In this context, choosing k = k(B) as

−(n +2a+1)/2 √
i
t2


1+λ
1
α/2
k = 1+
·
·√ ,

ni − 1 
λ
2π
the conﬁdence interval in (4.3) simpliﬁes to

Ci (B) ≡















| θi − µi |
θi : √
≤ tα/2 .




∗2


λ (ni +2a)δ




ni −1
1+λ

(4.20)

2
Using similar arguments as before and noting that (ni + 2a)δ ∗2 ≥ (ni − 1)Si , we have
P (Ci (B) ≥ Di ) = 1 − α where Di is the conﬁdence interval in (4.13). Again here, B was
assumed ﬁxed and known. In the case when B is unknown, we replace B by its marginal
ˆ
maximum likelihood estimate B. It is expected that the pooling technique will result in an
ˆ
error small enough so that P (Ci (B)) ≈ P (Ci (B)), and thus, enable the conﬁdence level of
ˆ
Ci (B) to be greater than 1 − α.

89

4.6

Simulation Study

Simulation design We considered a simulation setting similar to Wang and Fuller (2003),
which is the same as the set up in section 3.4. Each sample in the simulation study was
generated through the same steps as the section 3.4. To simplify the simulation, we still do
2
not choose any covariate Z, only (Xi , Si ) are generated. The observations for each small
area is ﬁrst generated as

Xij = β + ui + eij ,

j = 1, . . . , ni , i = 1, . . . , n,

2
2
where ui ∼ Normal(0, τ 2 ) and eij ∼ Normal(0, ni σi ). Then, the model for (Xi , Si ) is the
same as equations (3.26) and (3.27):

Xi = β + ui + ei , i = 1, . . . , n,
ni
1 ∑
2= 1
¯
Si
(Xij − Xi· )2
ni − 1 ni
j=1
∑n i
∑n i
2
¯
where Xi = Xi· = n−1 j=1 Xij , ei = e¯ = n−1 j=1 eij . Therefore, Xi ∼ Normal(θi , σi ),
i·
i
i
2
where θi = β + ui and θi ∼ Normal(β, τ 2 ), and ei ∼ Normal(0, σi ). It is to be noted that
2 2
(ni − 1)Si /σi ∼ χ2
. Then the quantities to be predicted are the mean for each small
(ni −1)
area, θi , i = 1, . . . , n.
Like Wang and Fuller (2003), we set all ni equal to m which eased our programming
2
eﬀorts. However, the sampling variances were still unequal by choosing one-third of the σi
equal to 1, one-third equal to 4, and one-third equal to 16. In the simulation, we set β = 10
and took three diﬀerent values of τ 2 , 0.25, 1, 4. For each τ 2 , we generated 200 samples for
90

each of the combinations (m, n) = (9, 36), (18, 180).
In this simulation study we compare the proposed method with the method of Wang and
Fuller (2003), Hwang et al. (2009) and Qiu and Hwang (2007) which are referred to as I, II,
III, and IV, respectively. Note that the estimator proposed in Qiu and Hwang (2007) was
adjusted for the Fay-Herriot model (1979).
The methods are judged based on bias, mean squared error (MSE), asymptotic coverage
probability (ACP) of the conﬁdence intervals and the length of the conﬁdence intervals
(ALCI).
Simulation results In Table 4.1 we present the mean and empirical standard deviation
of estimates of β and τ 2 . The numerical results indicate good performance of the EM
algorithm based maximum likelihood estimate of the model parameters.
Table 4.1: Estimation of model parameter. The left panel is for β and the right panel is for
τ2
β = 10
n = 36, m = 9
n = 180, m = 18
2
τ
Mean
SD
Mean
SD
0.25 10.0071 0.3618
9.9951 0.1853
1
10.0142 0.3311
9.9970 0.1743
4
10.0282 0.4639
10.0048 0.2254
SD: standard deviation over 200 replicates

n = 36,
Mean
0.25 0.2558
1
0.9418
4
3.5592
τ2

τ 2 = 0.25, 1, 4
m=9
n = 180,
SD
Mean
0.0605
0.2575
0.3333
1.0426
1.3316
4.0817

m = 18
SD
0.0097
0.1264
0.5551

The following Tables 4.2, 4.3 and 4.4 provide the numerical results averaged over areas
within the group (having the same sampling variances). We calculated the relative bias, mean
squared error, coverage rate and average conﬁdence intervals of the small area estimators
based on 200 replications.
In most cases, the bias of the four methods are comparable. In the case of high sampling
variance, the method IV outperformed other methods. High sampling variance gives more
91

2
σi
Bias
1
4
16
MSE 1
4
16
ALCI 1
4
16
ACP 1
4
16

Table 4.2: Simulation results for prediction when τ 2 = 0.25
n = 36, m = 9
n = 180, m = 18
I
II
III
IV
I
II
III
IV
0.0048 0.0198 0.0272 0.0018
-0.0051 -0.0086 -0.0112 -0.0111
-0.0033 -0.0061 -0.0145 -0.0158
-0.0130 -0.0109 -0.0065 -0.0116
0.0126 0.0370 0.0369 0.0096
-0.0046 -0.0045 -0.0080 -0.0061
0.3066 0.3955 0.6861 0.3805
0.2258 0.2757 0.4470 0.2922
0.3281 0.5119 1.3778 0.7285
0.2595 0.3010 0.5805 0.3748
0.3715 0.4623 1.6749 1.9316
0.2815 0.2856 0.4856 0.6383
2.1393 2.5485 4.4906 3.0528
1.9220 1.6006 3.6466 2.4811
2.2632 3.9574 6.8887 5.6842
2.0557 2.1524 5.2472 4.2160
2.3221 4.5619 9.3335 11.1363
2.1046 2.3308 6.5273 7.8492
0.9468 0.8958 0.9771 0.9708
0.9564 0.8160 0.9851 0.9631
0.9468 0.9433 0.9829 0.9917
0.9555 0.8478 0.9967 0.9967
0.9365 0.9375 0.9933 0.9975
0.9529 0.8472 0.9998 0.9999

2
σi
Bias
1
4
16
MSE 1
4
16
ALCI 1
4
16
ACP 1
4
16

Table 4.3: Simulation results for prediction when τ 2 = 1
n = 36, m = 9
n = 180, m = 18
I
II
III
IV
I
II
III
IV
-0.0152 0.0205 0.0255 0.0051
-0.0064 -0.0085 -0.0111 -0.0101
-0.0167 -0.0164 -0.0151 -0.0219
-0.0151 -0.0121 -0.0133 -0.0164
-0.0323 0.0508 0.0515 0.0216
-0.0028 -0.0017 -0.0073 -0.0039
0.5645 0.6300 0.7238 0.6260
0.5288 0.5555 0.5673 0.6336
0.8566 1.0746 1.5396 1.0992
0.8159 0.8707 0.9415 0.8948
1.0482 1.2406 2.1059 2.3156
0.9786 1.0043 1.1024 1.1878
3.4550 3.1822 4.4938 3.2117
3.1088 2.5094 3.6763 2.8676
4.0321 5.8733 6.8984 5.7909
3.7844 4.2908 5.3323 4.5543
4.4082 7.4286 9.3555 11.1555
4.1187 5.1590 6.6785 7.8937
0.9704 0.8800 0.9762 0.9275
0.9660 0.8771 0.9786 0.8879
0.9633 0.9308 0.9812 0.9808
0.9627 0.9464 0.9918 0.9740
0.9533 0.9325 0.9912 0.9938
0.9613 0.9560 0.9974 0.9979

92

Table 4.4: Simulation results for prediction when τ 2 = 4
2
σi
Bias
1
4
16
MSE 1
4
16
ALCI 1
4
16
ACP 1
4
16

I
-0.0024
-0.0343
-0.0147
0.8822
2.0577
3.4516
4.6318
6.2015
7.7221
0.9791
0.9556
0.9510

n = 36,
II
0.0248
-0.0310
0.0702
0.8786
2.2315
3.7401
4.1936
10.9093
18.0039
0.9067
0.9850
0.9958

m=9
III
IV
0.0229 0.0180
-0.0210 -0.0340
0.0767 0.0467
0.8579 1.0559
2.1818 2.2422
3.9267 3.8981
4.5369 3.7677
7.0376 6.4314
9.6718 11.3341
0.9733 0.9029
0.9725 0.9496
0.9796 0.9858

I
-0.0084
-0.0110
0.0016
0.8359
2.0424
3.3153
4.0256
5.9000
7.4430
0.9674
0.9592
0.9573

n = 180,
II
-0.0098
-0.0092
0.0024
0.8334
2.0735
3.3516
3.5346
9.0913
14.6665
0.9135
0.9919
0.9990

m = 18
III
IV
-0.0122 -0.0106
-0.0174 -0.0132
-0.0059 0.0012
0.8541 0.8605
2.0935 2.1130
3.3939 3.3631
3.9626 3.7499
6.2217 6.1540
8.3908 8.7537
0.9600 0.9468
0.9633 0.9573
0.9718 0.9776

weight to the population mean by a construction that makes the estimator closer to the
mean at the second level. On the other hand, methods I-III use shrinkage estimators of
the sampling variances which would be less than the maximum of all sampling variances.
This would tend to have a little more bias. However, due to shrinkage in sampling variance,
method I would expect gain in the variance of the estimators meaning the mean squared
error (MSE) would tend to smaller. Among the methods I-III, method I performed better
than method II and method III, method II and III are much closer to each other. The
maximum gain in method I compared to method II is 99%.
In terms of the mean squared error (MSE), method I performed consistently better than
the other three methods in most cases, except the case when the ratio of sampling variance to
2
model variance is lowest: (σi = 1)/(τ 2 = 4) = 0.25. In this case, the variance between small
areas (model variance) is much higher than the variance within each small area(sampling
variance). When using our method to estimate the mean of each small area, the information
93

“borrowed” from other areas will misdirect the estimation. The maximum and minimum
gain compared to methods II and III are 30%, -9% and 77%, -11% respectively.
ACP: We calculated 95% conﬁdence intervals. Methods I and III do not have any under
coverage. This is expected by their optimal interval construction. Method I meets the
nominal coverage rate more frequently than any other method. Method II has some under
coverage. This could go as low as 82%.
ALCI: Method I produced considerably shorter conﬁdence intervals in general. Method
IV produced comparable lengths as in other methods, but the length was considerably higher
2
in case of high sampling variance. The reason is the method IV truncates M = τ 2 /(τ 2 + σi )
with a positive number M1 = 1 − Qα /(N − 2), where Qα is the α-quantile of a chi-squared
distribution with N degrees freedom. When the ratio of sampling variance to model variance,
2
2
σi /τ 2 , is high, M1 is much greater than M . For example, in the case of (σi , τ 2 ) = (16, 0.25),
the average length is 11.13 in method IV whereas this is only 2.78 in method I and 4.56 in
method II.

4.7

A Real Data Analysis

We illustrate our methodology with a widely studied example. The data set is from the U.S.
Department of Agriculture and was ﬁrst analyzed by Battese, Harter and Fuller (1988). The
data set is about crop and soybeans in 12 Iowa counties. The sample sizes for these areas
are small, ranging from 1 to 5. We shall consider corn only to save space. For the proposed
model, the sample size of each area requires ni > 1. Therefore the same modiﬁed data
from You and Chapman (2006) is used, which only includes the areas with sample size 2 or
greater. The mean reported crop hectares for corn (xi ) comprise the direct survey estimates.
94

Table 4.5: Crop data from You and Chapman(2006)
Corn
√
2
County
ni
xi
z1
z2
Si
Franklin
3 158.623 318.21 188.06 5.704
Pocahontas 3 102.523 257.17 247.13 43.406
Winnebago
3 112.773 291.77 185.37 30.547
Wright
3 144.297 301.26 221.36 53.999
Webster
4 117.595 262.17 247.09 21.298
Hancock
5 109.382 314.28 198.66 15.661
Kossuth
5 110.252 298.65 204.61 12.112
Hardin
5 120.054 325.99 177.05 36.807

Table 4.6: Estimation results of corn
I: Proposed method
II: Wang and Fuller(2003)
ˆ
ˆ
County
θi
Conﬁdence Interval
θi
Conﬁdence Interval
Franklin
131.8106 104.085, 159.372(55.287)
155.4338 124.151, 193.094( 68.943)
Pocahontas 108.7305 80.900, 136.436(55.536)
102.3682 -38.973, 244.019(282.993)
115.9093 -53.768, 279.314(333.083)
Winnebago 109.0559 81.430, 136.646(55.216)
Wright
131.6113 103.736, 159.564(55.828)
131.0674 8.330, 280.263(271.932)
Webster
113.1484 92.805, 133.348(40.543)
109.4795 32.514, 202.675(170.161)
Hancock
129.4279 111.781, 147.193(35.412)
124.1028 56.750, 162.013(105.262)
Kossuth
121.0071 103.451, 138.626(35.175)
116.7147 68.049, 152.454( 84.405)
Hardin
130.2520 112.373, 148.114(35.741)
137.7983 51.734, 188.373(136.638)
III: Hwang et al.(2009)
IV: Qiu and Hwang(2007)
ˆ
ˆ
County
θi
Conﬁdence Interval
θi
Conﬁdence Interval
Franklin
158.4677 128.564, 188.370( 59.805)
157.7383 146.999, 168.477( 21.478)
Pocahontas 100.1276 -44.039, 244.295(288.334)
101.1661 19.444, 182.887(163.442)
Winnebago 114.1473 0.065, 228.228(228.163)
113.7746 56.263, 171.286(115.022)
143.2244 41.559, 244.889(203.330)
Wright
140.3717 -24.119, 304.862(328.982)
Webster
115.7865 50.297, 181.275(130.978)
115.2224 75.124, 155.320( 80.196)
113.1766 83.691, 142.661( 58.970)
Hancock
111.3087 66.213, 156.403( 90.189)
112.3239 89.520, 135.127( 45.607)
Kossuth
110.9585 74.366, 147.550( 73.184)
Hardin
126.6093 40.040, 213.178(173.137)
123.9049 54.607, 193.202(138.594)

95

Figure 4.1: Corn hectares estimation. The vertical line for each county displays the conﬁˆ
ˆ
dence interval of θi , with θi marked by the circle, for (I) Proposed method, (II)Wang and
Fuller (2003), (III)Hwang et al. (2009) and (IV)Qiu and Hwang (2007).
The sample variances are calculated based on the original data assuming a simple random
sampling. The sample s.d. varies widely from 5.704 to 53.999 (coeﬃcient of variation varies
from 0.036 to 0.423). The means of number of pixels from LANDSAT satellite data (zi ) are
the covariates in the estimation procedure. z1 is the mean of pixels of corn and z2 is the
mean of pixels of soybean. These covariates are used to ﬁt the model (4.1) and (4.2). The
detail of the modiﬁed data can be found in You and Chapman (2006)and tabulated in the
Table 4.5.
The small area estimates and their conﬁdence intervals are summarized in Fig 4.1. The
96

Figure 4.2: Boxplot of estimates of corn hectares for each county. (I) to (IV) are the 4
methods corresponding to Figure 4.1.
numerical ﬁgures are also provided in Table 4.6. The point estimates in all 4 methods are
comparable. The summary measures, mean, median, and range of the parameter estimates
for the methods (I,II,II,IV) are respectively (121.9, 124.1, 122.2, 122.6), (125.2, 120.4, 115.0, 114.5)
and (23.1, 53.0, 58.4, 56.6). The distributions are summarized in Fig 4.2. The central locations are similar under all the methods but there is a signiﬁcant diﬀerence in their variability.
If the assumption of constant model variance is true and when sample size varies from only
from 3 to 5, then under simple random sampling, one would expect relatively low variation
among the small area estimates. Method I is superior in this sense.
Further, smoothing sampling variances has strong implications in measuring uncertainty
and hence in the interval estimation. The proposed method has the shortest conﬁdence
interval on average compared to all other methods. Methods II and III provide intervals with
negative lower limit. This seems unrealistic because the direct average of area under corn is
97

high positive for all the counties. Moreover, the crude conﬁdence intervals (xi ± t0.025 Si ),
the widest, do not contain zero for any of the areas. Note that method II does not have
any theoretical support on its conﬁdence intervals. Method II and method III produce wider
conﬁdence intervals when the sampling variance is high. For example, the sample size for
both Franklin county and Pocahontas county is three, but the sampling standard deviations
are 5.704 and 43.406. Although the conﬁdence interval under method I is comparable, they
are wide apart for methods II and III. This is because, although these methods consider
the uncertainty in sampling variance estimates, the smoothing did not use the information
from direct survey estimates, resulted in the underlying sampling variance estimates remain
highly variable (due to small sample size). In eﬀect, the variance of the variance estimator
(of the point estimates) is bigger compared to that in method I. This is further conﬁrmed by
the fact that the intuitive standard deviations of the ”smoothed” small area estimates (one
fourth of the interval) are smaller and less variable under method I compared to others.
In addition to the conﬁdence interval, we also calculated the BIC for the proposed method
and method III (Hwang et al.(2009)) as these two methods have the same numbers of parameters and their model structure are very similar to each other. The diﬀerence is in the
level-2 model for the variance part. The BIC of proposed method is 210.025, method III’s is
227.372. The BIC of the proposed method is lower. We could not compare Wang and Fuller
(2003) since the explicit likelihood has not been used. The BIC criteria support our data
analysis.

98

Chapter 5
Clustering Based Small Area
Estimation: An Application to MEAP
Data

5.1

Introduction

As we mentioned in the previous chapters, small area estimation problems exist in many
application ﬁelds. In this chapter, we examine small area estimation in educational assessment. In particular, we analyze data from the Michigan Educational Assessment Program
(MEAP).
For educational accountability purpose, the results of high stakes tests are used as an
indicator of school districtperformance. Often times the average test scores across grade are
reported for each school district. However, the number of students in each school distric are
quite diﬀerent from each other. Table 5.1 summarizes the number of school districts with a
99

Table 5.1: Number of public school 4th graders in some school districts in Michigan state
Range of Numbers of Students Number of School Districts
1 ∼ 100
210
100 ∼ 200
124
200 ∼ 300
62
300 ∼ 500
39
500 ∼ 1000
30
1000 ∼
11

Figure 5.1: Standard deviation of public school students’ assessment scores in available school
districts in Michigan state.

given size for Grade 4 in Michigan. We can see that a large school district in a large city
might have thousands of students in a grade, while a small school district in a rural area
might have less than 100 students in a grade. Figure 5.1 gives the standard deviation of the
assessment scores of students in each school district. From the map, it can be seen that there
is a large amount of variation in the standard deviations of the school districts. Therefore,
the direct comparison of average scores between large and small school districts may not be
appropriate. A comparison of model based “smoothed” results is more appropriate.
100

The general approach for this type of analysis is to use the Fay-Herriot model. Let Yi
be the average score reported, Xi be the available covariate information for the regression
stage, and β be the regression parameter, i = 1, . . . , n for the total number of districts n.
Then the model is given by (1.1) or (2.5),

T
Yi = Xi β + vi + ei

(5.1)

where vi is the random eﬀect for each school district, which follows an independent and identical distribution, usually a normal distribution; ei is the sampling error; and the variances
of ei are known.
When we use the model based approach in equation (5.1), we “borrow strength” from
all other school districts universally since all school districts sharing the same regression
parameter β and the random eﬀects vi are iid. However, the actual geographical and socioeconomic characteristics of school districts in a large region are quite diverse. Poverty levels
for the school districts in Figure 5.1 are given in Figure 5.2. From the map, it can be seen
that the school districts in upper Michigan have higher poverty levels. Although the school
districts in lower Michigan have lower poverty levels in general, some school districts have
higher poverty levels than others.
Therefore, it is more appropriate to divide the school districts into several groups (clusters). In each group, the school districts are similar in some meaningful ways, namely they
are from the same cluster. Then the estimation for a school district only borrows strength
from similar school districts and avoids the misleading information from other school districts. Since we do not have any restriction on the spatial characteristic of school districts
in our study, we only focus on the clustering based small area estimation. However, the
101

Figure 5.2: Poverty level of available school districts in Michigan state.

partition of clusters was not known. Cluster analysis had to be conducted before we can
estimate the small areas.
There are many methods for cluster analysis (clustering). In one widely used class of
methods in cluster analysis, two clusters are chosen to be merged based on the optimization of some criterion. Popular criteria include the sum of within-group sums of squares
(Ward, 1963) and the shortest distance between groups. Iterative relocation (partitioning)
is another common class of methods. In each iteration, data points are moved from one
cluster to another depending on whether improvement is achieved with respect to some criterion. K-means clustering (MacQueen, 1967) is a method of iterative relocation with the
sum of squares criterion. K-means clustering is used to generate an initial partition in this
dissertation.
Clustering algorithms can also be based on probability models. Literature shows that
102

some of the heuristic methods, such as K-means clustering, can also be treated as approximate estimation methods of certain probability models. In the context of model-based
clustering, ﬁnite mixture models are often proposed. Each component probability distribution in ﬁnite mixture models corresponds to a cluster. It has been shown that ﬁnite mixture
models can be used to solve the practical questions that arise when applying clustering
methods. A review of model-based clustering can be found in Fraley and Raftery (2002).
However, the ﬁnite mixture model approach of clustering does not explicitly include the
partition as a parameter and involves independent and identically distributed structures.
In addition, there is usually no restriction on the mean structure in this class of models.
Information from covariates in the mean proﬁle is often necessary in many applications.
Therefore, the idea of modeling the mean via regression and keeping the ability to detect
clusters at the same time has gained more attention recently. Booth et al. (2008) proposed
a new clustering methodology based on a multilevel linear mixed model. A cluster-speciﬁc
random eﬀect is included in the model, which allows the departure of the cluster means
from the assumed base model. An objective function is constructed based on the posterior
distribution of the undergoing partition. The partition that maximizes the objective function
is chosen as the “optimal” clustering partition. A stochastic search algorithm is also proposed
to ﬁnd such a posterior probability.
A similar model from Booth, Casella and Hobert (2008) is used to describe the relationship between the assessment performance and other covariate information in this study. The
actual undergoing partition of districts is used as a parameter in the model. An objective
function is deﬁned as the posterior distribution of the partition parameter based on the given
data. The posterior distribution is known up to a normalizing constant. The partitions with
103

Figure 5.3: Math scores of 4th graders in public schools in available school districts in
Michigan state.
the highest posterior probability will be the ﬁnal result. The stochastic search procedure
is such an algorithm constructed by a mixture of two Metropolis-Hastings algorithms: one
makes small scale changes to individual objects and another performs large scale moves
involving entire clusters. The details are given in the following sections.
The data set is introduced in Section 5.2, and the proposed model is given in Section 5.3.
The stochastic search algorithm is described in Section 5.4. In the Section 5.5, the results of
the data analysis are presented.

5.2

Data Set

The data used is from the Michigan Educational Assessment Program (MEAP). MEAP is
a standardized test that was started by the State Board of Education. The test is taken by
all public school students in the state of Michigan from elementary school to middle/junior
104

high school. The results at the district level are public accessible and can be downloaded
at http://www.michigan.gov/mde/. Our study considered the 4th grade math score in Fall
2009. There are about 700 districts in the MEAP’s data set. We wanted to analyze the
cluster partition of districts with respect to the poverty level in each district, the data
reported by Small Area Income and Poverty Estimates (SAIPE) program. SAIPE is under the U.S. Census Bureau and reports model-based estimation results of sample surveys
in noncensal years. A more detailed description of SAIPE can be found at its website
http://www.census.gov//did/www/saipe/. There are about 550 school districts in SAIPE’s
data set. Since MEAP and SAIPE use diﬀerent codes to represent the same school districts,
we only had 476 districts left after merging these two data sets by matching the names of
school districts, that is, some school districts with data were not included in our merged
data set due to non-uniform data coding. No statistical methods were used in this study
to manipulate the data sets. In the future, if other information can be accessed so that
all the schools districts’ data can be matched, there would be no missing data and a more
complete analysis could be conducted. There is a lot of information reported in the original
data set, but we only kept the average math score, standard deviation, number of students,
and poverty level in our data set.

The map in Figure 5.3 shows the distribution of math scores of public school 4th graders
in available districts in Michigan state. Comparing the maps in Figures 5.1, 5.2 and 5.3, we
can see there are some patterns between the math score and the poverty level, but there is
not a simple linear relationship.
105

5.3

Proposed Model

We used w to represent a partition of the districts, and c = c(w) clusters in this partition,
2
which are denoted by C1 , . . . , Cc . Let (Yi , Si ) be the observed score and the variances
for the i-th district, i = 1, · · · , n, where n is the total number of available districts. Let
X i = (Xi1 , · · · , Xip )T be the set of available covariates and β k = (βk1 , · · · , βkp )T be the
associated regression coeﬃcients if the district i belongs to cluster k. For a ﬁxed partition
w, we consider the following hierarchical model:

Yi = X T β k + uk + vi + εi ,
i
2
uk ∼ N (0, λσk ),

i ∈ Ck

(5.2)

λ>0

2
vi ∼ N (0, σk )
2
εi ∼ N (0, Si )

where X T β k is the linear part of the mean speciﬁcation for each cluster, uk are the random
i
eﬀects at the cluster level, vi are the random eﬀects from the districts within each cluster,
2
and Si , i = 1, . . . , n are treated as known and reported by MEAP. The data-driven tuning
parameter λ can be determined by analysis of variance.

5.3.1

Prior Information

The priors of the model are set as

β k ∼ unif orm on Rp

(5.3)

2
σk ∼ IG(a, b)

(5.4)

106

where Rp is the p-dimensional real space. We choose an improper prior for β k , and no
speciﬁc prior information is given to β k . IG(a, b) is the inverse gamma distribution with
parameters a and b.
The prior for w is chosen as
c(w)
Γ(m)mc(w) ∏
πn (w) =
Γ(nk )
Γ(n + m)
k=1

(5.5)

where nk = #(Ck ) and m is a parameter, m > 0. This prior distribution was used in
Crowley (1997) and Booth et al. (2008). From this prior, more weight is put on the partition
with smaller numbers of clusters when m is decreased. If w follows the density function in
(5.5), then

P r{c(w) = k} =

Γ(m)mk
Γ(n + m)

∑

k
∏

Γ(nj )

(5.6)

w:c(w)=k j=1

and

E{c(w)} = m

n−1
∑
i=0

1
m+i

(5.7)

If m → 0, E{c(w)} → 1. If m → ∞, E{c(w)} → n.

5.3.2

Model-Based Objective Functions

Based on the model and prior information described previously, we can determine the modelbased objective functions, which mimics the procedure in Booth et al. (2008). For a ﬁxed
c(w)
w, let θk denote the parameters of each cluster and θ = (θk )
. Then the joint density
k=1
107

function of Y = (Y1 , . . . , Yn ) is given by


c(w) ∫
∫
∏

 ∏
f (Y |θ, w) =
f (Yi |uk , vi , θk )f (vi |θk )dvi  f (uk |θk )duk

i∈Ck
k=1

(5.8)

If we use Y ∗ , an nk × 1 vector, to represent the Yi s in cluster k, then Y ∗ follows a
k
k
T
normal distribution with mean X ∗ β k and variance V k . X ∗ = {Xi , i ∈ Ck } and
k
k




2
2
2
λσk
 (1 + λ)σk + Si
1



2
2
2
λσk
(1 + λ)σk + Si

2

Vk = 

···



2
2
λσk
λσk

···

2
λσk

..

2
λσk
.
.
.

.

2
2
· · · (1 + λ)σk + Si
nk













(5.9)

Then the density function of Y ∗ is
k
nk
1
∗ |θ, w) = (2π)− 2 |V |− 2 exp{− 1 (Y ∗ − X ∗ β )T V −1 (Y ∗ − X ∗ β )} (5.10)
f (Y k
k
k
k k
k k
k
2 k

and

f (Y |θ, w) =

c(w)
∏

f (Y ∗ |θ, w)
k

(5.11)

k=1

2
Integrating the product of the joint density function of Y and the prior of β k , σk , and
multiplying by the prior πn (w) yields the objective function π(w|y),

π(w|y) ∝ πn (w) ·

c(w) ∫
∏

2
2
f (Y ∗ |θ, w)π(σk )π(β)dσk dβk
k

k=1
108

(5.12)

2
where π(·) represents the prior distributions of σk and β k . Since we include individual
2
district variance Si for each district, the objective function does not have a closed form for
the ﬁnal integrating result, and we compute the integral numerically.

5.3.3

Estimation of Mean Scores

For a ﬁxed partition w, the model (5.2) for cluster k, k = 1, . . . , c(w) can be rewritten in the
form of a general linear mixed model as (1.4),

Y ∗ = X ∗ β k + Z k v k + ek
k
k
T
where Y ∗ = (Yi , i ∈ Ck )T , X ∗ = (Xi , i ∈ Ck )T , ek = (εi , i ∈ Ck )T , and
k
k




1


1

Zk = 
.
.
.

1



1 0 ··· 0


0 1
0


.
.. . 
.
.
. .
.


0 ···
1
nk ×(nk +1)



 uk 




 vi 
 1 
vk = 

 . 
. 
 .




vin
k (nk +1)×1

(5.13)

2 2
2
Then the variance matrices of v k and ek are Gk = diag(λσk , σk , . . . , σk ) and Rk =
2
2
diag(Si , . . . , Si ). Therefore, the variance matrix is V k = Rk + Z k Gk Z T .
k
nk
1
Using the general theory of Henderson (1975) for a mixed linear model and equation
ˆ
ˆ
(1.5), the best linear unbiased estimator (BLUE) of Yi = X T β k + uk + vi is given by
ˆ
ˆ
i
ˆ
ˆ
ˆ −1
ˆ
Yi = X T β k + mT Gk Z T V k (Y ∗ − X ∗ β k )
i
i
k
k
k
109

(5.14)

ˆ
where β k is the general least squared estimator,
ˆ
ˆ −1
ˆ −1
β k = (X ∗T V k X ∗ )−1 (X ∗T V k Y ∗ )
k
k
k
k

(5.15)

2 ˆ2
ˆ
where V k is the variance matrix with estimator of σk , σk . Then mi = (1, 0, . . . , 1, . . . , 0)T
is a (nk + 1) × 1 vector with 1 at the ﬁrst and i + 1 positions for ith district and 0 for other
2
positions, and σk is obtained by maximizing the log of the marginal likelihood of σk ,
ˆ2
2
logf (σk |Y ∗ , X ∗ , θ) = log
k
k

∫

∗
f (yk |θ, w)dβ

n
1
1
= (1 − k )log(2π) − log|V k | + log|(X ∗T V −1 X ∗ )−1 |
k
k
k
2
2
2
1
− Y ∗T V −1 Y ∗
k
k
2 k
1
+ Y ∗T V −1 X ∗T (X ∗T V −1 X ∗ )−1 X ∗T V −1 Y ∗
(5.16)
k
k
k
k
k
k
k
k
2 k

The maximization is found by numeric methods.

5.3.4

MSE of Estimator

A MSE estimator from (6.2.37) in Rao (2003) for the BLUE given in the previous section is
given as

ˆ
M SE(Yi ) = g1 (ˆk ) − bT2 ▽ g1 (ˆk ) + g2 (ˆk ) + 2g3 (ˆk )
σ2
σ2
σ2
σ2
σ
ˆ
k
2 ) = mT (G − G Z T V −1 ZG )mT
ˆ
g1 (ˆk
σ
k i
k
k
K
i
ˆ −1
g2 (ˆk ) = dT (X ∗T V k X ∗ )−1 d
σ2
k
k
(

) (
)T
T
T
ˆ
¯ σ2
g3 (ˆk ) = tr  ∂b 2 V ∂b 2
σ2
V (ˆk )
∂σ
∂σ
k
k
110

(5.17)

ˆ −1
ˆ
where dT = xT − mT Gk Z T V K X ∗T , and mT , Gk , Z T , V K and X ∗ are given in the
i
i
i
k
k
previous section. bT2 ▽ g1 (ˆk ) is the bias correction of g1 (ˆk ) and calculated by bootstrap.
σ2
σ2
σ
ˆ
k

5.4

Stochastic Search

The estimator of mean score and the corresponding MSE can be calculated if the cluster
partition is given, but the partition of the school districts is not known. The optimal cluster
partition can be found by maximizing the objective function through the following stochastic
search procedure.
Similar to Booth et al. (2008), the stochastic search procedure included two diﬀerent
behaviors: a small scale change as a biased random walk of one district and a large scale
change as a split-merge of clusters. The two diﬀerent scale moves are combined together with
diﬀerent probabilities. Let w′ be the current partition and c is the number of clusters. At
each iteration, if c = 1, a split move is proposed automatically. If c > 1, the biased random
walk is proposed with probability pb , and the split-merge move is proposed with probability
1 − pb .

5.4.1

Biased Random Walk

As with the nearest neighbor random walk, one district is moved each time, and the move
w′ → w has a positive probability if and only if w′ and w share the same partition but only
have one district in diﬀerent clusters.
In Booth et al. (2008), the algorithm was applied to genetic data. A cluster may contain
a single gene only because the observed data are replicated measurements of genes. However,
we only have one piece of observed data for each school district. If the cluster contains one
111

school district only, the model (5.2) can not be estimated or determined for that cluster.
Therefore, each cluster must contain at least two school districts. The detailed moving
rules are described as follows. First, one district is chosen uniformly and randomly from the
districts inside the clusters with nk >= 2. Then it will move to one of the other c−1 clusters
with probability 1/(c − 1). The acceptance probability of w′ → w is min{1, π(w)/π(w′ )}.

5.4.2

Split-Merge Moves

Splitting or merging of clusters are large scale moves. In the biased random walk, only one
district is moved at a time. A large scale change would require a lot of steps of the biased
random walk and is very unlikely to occur. Therefore, it is necessary to add the large scale
change to the algorithm.
At each iteration, a merge move or split move will be proposed randomly. A merge move
is proposed with probability pm ∈ (0, 1). Two clusters are chosen uniformly at random
from the current partition and form a new cluster together. A split move is proposed with
probability 1 − pm . Since each cluster includes by at least two school districts, one cluster
with at least four school districts in the current partition is chosen and then split into two
clusters with each cluster containing at least two school districts.
Suppose the current partition is w′ , and w is the partition after merging two clusters in
w′ . Then
P (w′ → w) =

pm
c(w′ )(c(w′ )−1)
2

P (w → w′ ) =

1−pm
∑c(w)
∗ −1 ∗
(2n
−n −1)
I[#{Ck (w)}≥4]
k=1
112

(5.18)

where n∗ is the number of districts of the cluster in w which is formed by merging two
clusters in w′ . The acceptance probability of move w′ → w is min{1, R} where

R=

π(w)P (w → w′ )
π(w′ )P (w′ → w)

(5.19)

and the acceptance probability of move w → w′ is min{1, 1/R}.

5.5

Data Analysis

In the analysis of our data set, we chose settings similar to those of Booth et al. (2008) for
the proposed probabilities of diﬀerent moves: pb = 0.9 and pm = 0.5. For the prior, we
chose a = 3, b = 40 and kept the range of cluster-level variation to a smaller range to the
variance between school districts in the model (5.1).
First, we applied the K-means cluster algorithm in R to the data without any covariates.
We chose the initial number of clusters as 10 and used the numbers 1, . . . , 10 to denote the
clusters. Then we started the stochastic search procedure with this initial partition. In
each iteration, the biased random walk or split-merge moves is proposed and accepted with
respect to the acceptance probability. The numbers 1, . . . , c(w) were used to denote the
clusters in the new partition. The total number of iterations was 105 .
Figure 5.4 gives the trace plot of the number of clusters vs. iterations. Four lines are
drawn with respect to diﬀerent m values: 0.5, 0.1, 0.0001, 0.00001. From Figure 5.4, we can
see that the ﬁnal number of clusters decreases as the value of m decreases. This follows the
property of the prior of the partition parameter and resembles the results of Booth et al.
(2008). We chose the m value as 0.00001 so that the expected number of clusters would be
113

Figure 5.4: Trace plot of diﬀerent m values.
close to 1. We ran the whole procedure again with double the number of iterations 2 · 105
and used the ﬁrst 105 iterations as a burn-in period. The cluster index of each district was
recorded.
The average number of clusters in the last 105 iterations was 9.4. When the stochastic
search procedure converged, all the clustering results had closed posterior probability. One
clustering result with 7 clusters is shown in Figure 5.5. Scatter plots of mean scores vs.
poverty levels of the school districts in the 7 clusters are given in Figure 5.6. The reason
for choosing this partition is that all the sizes of the clusters are greater than 10 in this
clustering result, which is more similar to the real world since there are a total of 476 school
114

Figure 5.5: Map of clusters.

districts.
From the map of clusters in Figure 5.5, we can see that all the available school districts
are divided into 7 clusters, and each cluster is formed from several geographically separated
small groups. However, most districts in each small group are neighbors of each other. This
agrees with our intuitive image of school districts in that adjacent school districts with similar
poverty levels may perform similarly on the state assessment, but all the school districts in
a county/city maybe not be placed in one performance level category.
The scatter plots in Figure 5.6 indicate that the relationship between the mean scores of
school districts and their poverty levels are close to linearity. However, the R2 is only 0.3 if we
ﬁt all the data with a simple linear regression. The R2 s are (0.26, 0.37, 0.26, 0.89, 0.58, 0.42, 0.74)
if we ﬁt the data within each cluster with simple linear regression. But if we ﬁt the data
with model (5.2), the R2 s are greater than 0.9 for all the clusters. This indicates the neces115

Figure 5.6: Mean scores vs. poverty levels for school districts in the 7 clusters and all the
districts.

sity of ﬁtting with a more complex model that accounts for clustering. The mixed model is
appropriate since it allows the departure of mean structures for each school district. We also
analyzed the data set with model (5.1). The BIC of model (5.1) is 3197, and model (5.2) is
2857. The BIC of the model with clustering is lower.
Based on model (5.1) and (5.2), the estimates of the mean scores can be obtained. The
results show that 90% of the model based (with clustering) estimates are within 2 points
of the observed scores. However, there are 10 school districts that have more than 4-point
diﬀerences. The mean scores for all school districts range from 403.6 to 456.8 and 4 points
is 7.5 % of this score range. Therefore, it is necessary to make adjustments if we want to
116

Figure 5.7: Coeﬃcient of variation for the model-based estimates and observed mean scores.

Figure 5.8: MSE of estimators for models with clustering (5.2) or no clustering (5.1).

117

compare the mean scores of these school districts directly, and we need to include more
information to make the comparison more comprehensive.
Boxplots of coeﬃcient of variation for the proposed model, the Fay-Herriot model, and
observed data are given in Figure 5.7. The MSE’s boxplots for estimators based on model
(5.1) and (5.2) are given in Figure 5.8. From the comparisons of coeﬃcient of variation and
MSE, the model with clustering is more stable and can provide more accurate estimates.
From the results, we can see that cluster based small area estimation works very well
for our study purpose and reaches a reasonable and stable clustering partition of school
districts. In this study, the common prior was used and no speciﬁc background information
was emphasized. Only the poverty level was included as covariate. If more information is
available, then meaningful inferences for policymaking can be obtained easily using small
area estimation methods similar to those used in the study. Additional information that
could be used includes meaningful prior information such as spacial characteristic about
school districts, or socioeconomic inferences such as population percentages of speﬁcic races.

118

Chapter 6
Discussion
In this dissertation, we studied several topics of small area estimation. Small area estimation
research has become popular in recent years because of increasing demand of small area
statistics from both public and private sectors. Small area estimation uses small area models
to link the direct estimators and covariates explicitly.
It is important to note that the quality of model based estimator for small area depends
on the availability of good covariates. That is the reason we conducted the study of model
selection in Chapter 2 of this dissertation. Bootstrap method was adopted since it has
several advantages compared to other competitive methods. As indicated in (Shao, 1996),
a modiﬁcation with smaller bootstrap sample size is needed to result in a consistent model
selection. This modiﬁed bootstrap selection procedure was extended to linear mixed eﬀects
models, which is the backbone of small area models. In this dissertation, we have established
that the modiﬁed bootstrap selection procedure worked very well for linear mixed eﬀects
models.
For small area models, the typical modeling strategy in area level model is to assume
119

the sampling variances as known. However, the assumption of known and ﬁxed sampling
variances does not take into account the uncertainty of estimation into the overall small area
estimation strategy. In Chapter 3, a new approach of small area estimation is introduced,
which is a dual “shrinkage” estimation for both small area means and variances in a uniﬁed
framework. Conditional mean squared error of prediction (CMSEP) was used to evaluate
the performance of the proposed estimator. Since the model parameters are not known in
practice, and they have to be estimated, the estimator of CMSEP was derived which is
second order correct.
It is well known that the standard practice of conﬁdence interval does not produce accurate intervals in case of small area estimation. Therefore, a conﬁdence interval from a
decision theory perspective was derived in Chapter 4. Both the theoretical justiﬁcation and
the simulation study were presented. In conclusion, the proposed estimator of small area parameters and corresponding estimator of CMSEP and conﬁdence intervals outperform other
methods from literatures under most cases.
In the Chapter 5, we applied small area estimation to 4th grade math score from Michigan
Educational Assessment Program (MEAP). We wanted to conduct comparisons between
diﬀerent school districts. Because of the diverse standard deviations and varying sample
sizes, school districts were identiﬁed as small areas in this study. When we used small area
models to “borrow strength” from other school districts, we ﬁrst conducted a cluster analysis
to decide which school districts we should “borrow strength” from. The ﬁnal estimate results
were calculated based on the “optimal” partition of clusters. And the results outperformed
those of the Fay-Herriot model.
All the computing in this dissertation was carried out in R and FORTRAN. The structure
120

of the programming was under R and the heavily computing part was executed by calling
FORTRAN from R. The data analyzed in Chapter 4 and 5 are open access to public. The
code for simulation illustrations and data analysis is available from the author upon request.
Issues Needed to be addresses for Practical Applications
Small area estimation can play an very important role in many applications, such as
public health and education. In practice, there are many issues needed to pay attention
to when we apply small area estimation to the real applications. The ﬁrst issue is about
the availability of covariates. It may happen that the good covariates are only available for
some small areas, not for others. For example, in the MEAP study, Chapter 5, the data
set contained more than 700 school districts, but the data set from Small Area Income &
Poverty Estimates (SAIPE) program only contained around 550 school districts and used a
diﬀerent code system to represent the same school districts. After combined the two data
sets by the names of school districts, only 476 school districts left. In our study, we did not
adapt any imputation method to recover the missing data. But in practice, it is probably
necessary to do so. There are many methods available for imputation of missing data, such as
nearest-neighbor imputation, hot deck imputation, regression imputation, or using experts’
judgments etc. Sometimes another model can also be ﬁtted for missing values. For more
detailed description of missing data and small area estimation, one can found in Longford
(2005).
Another issue could be the estimates of sampling variances. Our study showed the
advantages of modeling small area means and variances simultaneously. However in practice,
one might still choose to model only the basic area level model, which is a simpler model and
assume the sampling variances as known. In that case, the estimation methods of sampling
121

variance are very important for the accuracy of ﬁnal estimation. The sampling variances are
usually estimated quantities and those are subject to substantial errors due to the fact that
they are often based on equivalent sample sizes as the direct estimates are being calculated.
There are many literatures on diﬀerent estimate methods of sampling variance. The detail
variance estimation can be found in Wolter (1985).
There might be other issues in the application of small area estimation. However, by
using a explicit model to link the direct estimators and covariants, diagnostics for Small area
models is possible. Important results from other ﬁelds can also be applied to small area
models directly. Therefore, small area estimation can be very useful to provide more stable
and accurate estimation results in many applications. Extension of small area estimation to
other related areas is of further interest.

122

BIBLIOGRAPHY

123

Bibliography
Adkins, L. and Hill, R. (1990). An improved conﬁdence ellipsoid for the linear regression
models. Journal of Statistical computation and Simulations, 36:9–18.
Battese, G., Harter, R., and Fuller, W. (1988). An error-components model for prediction
of county crop areas using survey and satellite data. Journal of the American Statistical
Association, 83:28–36.
Bell, W. (2008). Examining sensitivity of small area inferences to uncertainty about sampling
error variances. Tech. Report U.S. Census Bureau.
Booth, J., Casella, G., and J.P., H. (2008). Clustering using objective functions and stochastic search. Journal of the Royal Statistical Society, B, 70:119–139.
Booth, J. and Hobert, J. (1998). Standard errors of prediction in generalized linear mixed
models. Journal of the American Statistical Association, 93:262–272.
Brackstone, G. (1987). Small area data: policy issues and technical challenges. In Small Area
Statistics R. Platek, J.N.K. Rao, C.E.Sarndal and M.P.Singh eds., Wiley, New York:3–20.
Casella, G. and Hwang, J. (1991). Evaluating conﬁdence sets using loss functions. Statistica
Sinica, 1:159–173.
Chatterjee, S., Lahiri, P., and Li, H. (2008). Parametric bootstrap approximation to the
distribution of eblup and related prediction intervals in linear mixed models. Annals of
Statistics, 36:1221–1245.
Cho, M., Eltinge, J., Gershunskaya, J., and Huﬀ, L. (2002). Evaluation of generalized
variance function estimators for the u.s. current employment survey. In Proc. Amer.
Statist. Assoc. Surv. Res. Meth. Sec., pages 534–539.
Cox, D. and Snell, E. (1968). A general deﬁnition of residuals. Journal of the Royal Statistical
Society. Series B (Methodological), 30:248–275.
Crowley, E. (1997). Product partition models for normal means. Journal of the American
Statistical Association, 92:192–198.
Das, K., Jiang, J., and Rao, J. (2004). Mean squared error of empirical predictor. The
Annals of Statistics, 32:818–840.
124

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics,
7:1–26.
Fay, R. and Herriot, R. (1979). Estimates of income for small places: an application of
james-stein procedures to census data. Journal of the American Statistical Association,
74:269–277.
Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and
density estimation. Journal of the American Statistical Association, 97:611–631.
Gershunskaya, J. and Lahiri, P. (2005). Variance estimation for domains in the u.s. current
employment statistics program. Proc. Amer. Statist. Assoc. Surv. Res. Meth. Sec., pages
3044–3051.
Ghosh, M. and Rao, J. (1994). Small arear estimation: An appraisal. Statistical Science,
9:54–76.
Gonzalez, M. (1973). Use and evaluation of synthetic estimates. In Proceedings of the Social
Statistics Section, pages 33–36. American Statistical Association.
Gunst, G. and Mason, R. (1980). Regression Analysis and Its Applications. New York:
Marcel Dekker.
Hall, P. (1989). Unusual properties of bootstrap conﬁdence intervals in regression problem.
Probability Theory and Related Fields, 81:247–273.
Hall, P. and Maiti, T. (2006). Nonparametric estimation of mean squared prediction error
in nested-error regression moels. Annals of Statistics, 34:1733–1750.
Henderson, C. (1975). Best linear unbiased estimation and prediction under a selection
model. Biometrics, 31:423–447.
Herrador, M., Morales, D., Esteban, M. D., Snchez, n., Santamara Arana, L., Marhuenda, Y.,
and Prez, A. (2008). Sampling design variance estimation of small area estimators in the
spanish labour force survey. Statistics and Operations Research Transactions, 32:177–198.
Huﬀ, L., Eltinge, J., and Gershunskaya, J. (2002). Exploratory analysis of generalized
variance function models for the u.s. current employment survey. Proc. Amer. Statist.
Assoc. Surv. Res. Meth. Sec., pages 1519–1524.
Hwang, J., Qiu, J., and Zhao, Z. (2009). Empirical bayes conﬁdence intervals shrinking both
mean and variances. Journal of the Royal Statistical Society, B, 71:265–285.
James, W. and Stein, C. (1961). Estimation with quadratic loss. Proceedings of the Fourth
Berkeley Symposium on Mathematical Statistics and Probability, University of California
Press, Berkeley:361–379.
Jiang, J., Lahiri, P., and Wan, S. (2002). A uniﬁed jackknife theory for empirical best
prediction with m-estimation. The Annals of Statistics, 30:1782–1810.
125

Joshi, V. (1969). Admissibility of the usual conﬁdence sets for the mean of a univariate or
bivariate normal population. The Annals of Mathematical Statistics, 40:1042–1067.
Lohr, S. Sampling: Design and Analysis. Duxbury, Paciﬁc Grove, CA.
lohr, S. and Rao, J. (2009). Jackknife estimation of mean squared error of small area
predictors in nonlinear mixed models. Biometrika, 96:457–468.
Longford, N. T. (2005). Missing Data and Small-Area Estimation. Modern Analytical
Equipment for the Survey Statistician Series. Springer, 1st edition.
MacQueen, J. (1967). Some methods for classiﬁcation and analysis of multivariate observations. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol.1, eds. L.M.L.Cam and J.Neyman, Berkley, CA: University of California Press,
281-297.
Maples, J., Bell, W., and Huang, E. (2009). Small area variance modeling with application to
county poverty estimates from the american community survey. Proceeding of the Survey
Research Methods Section, ASA, pages 5056–5067.
Marker, D. (1999). Organization of small area estimatiors using a generalized linear regression
framework. Journal of Oﬃcial Statistics, 15:1–24.
McCullagh, P. and Nelder, J. A. (1989). Generalized Liner Models, 2nd ed. London: Chapman and Hall.
Otto, M. and Bell, W. (1995). Sampling error modelling of poverty and income statistics
for states. American Statistical Association, Proceedings of the Section on Government
Statistics, pages 160–165.
Pfeﬀermann, D. (2002). Small area estimation - new developments and directions. International Statistical Review, 70:125–143.
Prasad, N. and Rao, J. (1990). The estimation of the mean squared error of small-area
estimators. Journal of the American Statistical Association, 85:163–171.
Qiu, J. and Hwang, J. (2007). Sharp simultaneous intervals for the means of selected populations with application to microarray data analysis. Biometrics, 63:767–776.
Rao, J. (1999). Some recent advances in model-based small area estimation. Survey methodology, 25:175–186.
Rao, J. (2001). Small area estimation with applications to agriculture. in Proceedings of
the Second Conference on Agricultural and Environmental Statistical Applications, ISTAT,Rome,Italy.
Rao, J. (2003a). Small Area Estimation. Wiley, New York.
Rao, J. (2003b). Some new developments in small area estimation. Journal of the Iranian
Statistical Society, 2:145–169.
126

Rao, J. and Shao, J. (1992). Jackknife variance estimation with survey data under hot deck
imputation. Biometrika, 79(4):811–822.
Rivest, L.-P. and Vandal, N. (2003). Mean squared error estimation for small areas when
the small area variances are estimated. Proc. Internat. Conf. Recent adv. in surv. samp.
Robert, C. and Casella, G. (2004). Monte Carlo Statistical Methods (second edition).
Schall, R. (1991). Estimation in generalized linear models with random eﬀects. Biometrika,
78:719–727.
Shao, J. (1996). Bootstrap model selection. Journal of the American Statistical Association,
91:655–665.
Valliant, R. (1987). Generalized variance functions in stratiﬁed two-stage sampling. Journal
of the American Statistical Association, 82:499–508.
Wang, J. and Fuller, W. (2003). The mean squared error of small area predictors constructed
with estimated error variances. Journal of the American Statistical Association, 98:716–
723.
Ward, J. (1963). Hierarchical groupings to optimize an objective function. Journal of the
American Statistical Association, 58:234–244.
Wolter, K. (1985). Introduction to Variance Estimation. Statistics for Social and Behavioral
Sciences. Springer, New York, second edition.
You, Y. and Chapman, B. (2006). Small area estimation using area level models and estimated sampling variances. Survey Methodology, 32:97–103.

127