LHBRARY
Michigan State
Unlverslty

 

 

 

PLACE N RETURN BOX to roman this checkout from you: "cord.
TO AVOID FINES Mum on or before data duo.

DATE DUE DATE DUE DATE DUE

    

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU I: An Aﬂimm ActtoNEwnl Opportunity Initiation
mm.

TOWARD A MULTILEVEL GENERALIZED LINEAR MODEL:
THE CASE FOR POISSON DISTRIBUTED DATA

By

Wing Shing Chan

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Counseling, Educational Psychology
and Special Education

1995

ABSTRACT

TOWARD A MULTILEVEL GENERALIZED LINEAR MODEL:
THE CASE FOR POISSON DISTRIBUTED DATA

by
Wing Shing Chan

This study presents a maximum likelihood approach for
estimating a random coefficient generalized linear model via
the Monte Carlo EM algorithm. Monte Carlo integration
through simulating multivariate normal variates is used to
integrate out the random effects in the E-step. The M-step
is equivalent to maximizing a weighted sum of log likelihood
for an exponential family and a multivariate normal
distribution. The former weighted likelihood is maximized
by one step of the Iteratively Reweighted Least Square
algorithm while the latter is maximized in closed form by
choosing appropriate weights. Standard errors for the fixed
effect parameters and variance components are obtained
through the tractable observed complete data information
matrix. The Poisson distribution is used for illustration. A
simulation study with a non-diagonal variance-covariance
matrix shows better accuracy than an earlier comparable
Gibbs sampling approach. Computing efficiency is

demonstrated as the FORTRAN program converges in about

ii

40 iterations within 30 minutes for a two-variance
component model with 700 Poisson observations (100 groups
each with 7 within-group units) on an IBM compatible 486

DX/33MHz personal computer.

iii

Copyright by
WING SHING CHAN

1995

iv

Dedicated to

the land and people of Michigan

ACKNOWLEDGMENTS

I wish to express my thanks to Prof. Betsy J. Becker
(Measurement and Quantitative Methods), Prof. Ralph L. Levine
(Psychology), Prof. Stephen W. Raudenbush (Measurement and
Quantitative Methods) and Prof. James H. Stapleton (Statistics
and Probability) for serving on my dissertation committee and
giving me many useful suggestions. I am grateful to Prof.
Raudenbush for introducing me the concepts and principles of
the Hierarchical Linear Models and to Prof. Stapleton for
teaching me five graduate courses in mathematical statistics.
Moreover, the large and beautiful campus of the Michigan State
University has provided a favorable environment for building the
foundation of my graduate education.

I also wish to express my gratitude for the hospitality from
the Department of Maternal and Child Health at Harvard
University where I visited as a research specialist in 1991-1992.
I wrote my dissertation proposal while residing in Somerville of

Massachusetts during this period.

vi

I wish to thank my wife, Li Ming Wei, for her patience,
encouragement and support as well as for her assistance in
computer programming. She accompanied me to live in Forest
Hills, New York where I conducted my independent dissertation
research in 1992-1994. In 1993 the Queens College of CUNY
had offered me some helpful part-time research and teaching
opportunities and an Internet access to the rest of the world. I
wish to thank my family members in Hong Kong for their needful
financial support during the last phase of the dissertation
research and writing.

Last but not least, I wish to thank my Shih-Fu, Ch’an
Master Dr. Sheng-Yen in Elmhurst, New York. From 1991 to
1995, his teaching in Buddhist meditation is of great help to my
personal life and academic research as I attain a calmer and

clearer mind.

W. S. CHAN
Queens, NY

March 1995

TABLE OF CONTENTS

LIST OF TABLES

LIST OF FIGURES

Chapter
I. INTRODUCTION .

An illustration for the application of a multilevel
generalized linear model to educational research

11. BACKGROUND .

Implications for the present research .
Exponential family of distributions .
Generalized linear models .

III. STATISTICAL MODEL AND ESTIMATION
STRATEGY

A multilevel generalized linear model
Maximum likelihood estimation via
the Monte Carlo EM algorithm .

IV. PARAMETER ESTIMATION

The expectation step of the EM algorithm .

The maximization step of the EM algorithm
Estimation for the fixed effect parameters
Estimation for the variance component
parameters .

viii

Page

xi

12
18
20
22
24
24
27
3o
30
32
32

35

VI.

VII.

VIII.

VARIANCE OF PARAMETER ESTIMATES

Estimation for the variance of the fixed effect
estimates

Estimation for the variance of the variance
component estimates

EMPIRICAL PROPERTIES OF THE STATISTICAL
MODEL

Maximization of the likelihood function .
Convergence of estimated parameters
Convergence from different starting values
Influence of the size of Monte Carlo samples

SIMULATION STUDY

Simulation test for a random intercept model
Simulation test for a random coefficient model .

CONCLUSIONS

Further research

APPENDIX A: Mathematical notes

A1 Expectation of the function of a random
variable by a joint distribution .
A2 Expected information matrix with weights
A3 Conditional expectation by a missing data
distribution via the Monte Carlo method .
A4 Conditional expectation of the
score function

APPENDIX B: Estimation of starting values

LIST OF REFERENCES:

ix

4o

41

44

48
49
52
54
54
58

58
61

67

69

74

74

76

77

79

80

83

Table

LIST OF TABLES

Poisson, binomial and normal distributions as
members of the exponential family .

Comparing parameter and likelihood values
obtained by two starting values (Parameter values
for y and T are respectively -O.7 and 0.16)

Results of 2 simulation studies. Case 1: Random
intercept model. Case 2: Random coefficient
model

Log expected mean and expected mean
(in parenthesis) by time and sex .

Page

21

54

61

62

LIST OF FIGURES

Page
Figure

1. Log likelihood function as a function of the number

of iterations for two sets of starting values . . . 51
2. Convergent paths of model parameters from two

initialpositions............... 53
3. Influence of the size of Monte Carlo samples on

the estimated log likelihood . . . . . . . . . 56

xi

CHAPTER I

INTRODUCTION

Educational data concerning students often arise within
classrooms or schools. Since different classrooms have teachers
of various experiences and qualifications, and different schools
have various school climates, policies and facilities, educational
results vary according to the quality of the educational context.
One of the most important questions in educational research is
what kind of teachers or schools will enhance learning most. To
answer this question empirically, we must collect data about
students' academic performance and characteristics of the
teachers, classrooms and schools. For research results to be
readily used for academic theory construction or public policy
making, large samples of classrooms or schools must be studied
to make broad generalizations.

However, traditional statistical research tools such as
linear regression analysis fail to account for the variability due
to the varying effects of each classroom or school. Ordinary
random effects analysis of variance models also fail to allow for
unbalanced data and flexible covariate adjustments. Because the

scores of students within the same classroom or school are

correlated, the standard statistical independence assumption is
violated when data from large numbers of classrooms or schools
are analyzed. As a remedy to these deficiencies in traditional
statistical tools, multilevel or hierarchical linear statistical
models have been developed for educational and social research
applications (Aitkin, Anderson, and Hinde, 1981, Goldstein,
1986, Longford, 1987 and Raudenbush and Bryk, 1986).
Multilevel linear models have two or more levels of
regression equations. For example, in two-level models, a level-
one equation relates students' outcomes (e.g., math achievement
or verbal ability) to a set of independent variables (e.g.,
previous GPA, IQ, sex or socioeconomic status). However, each
school is allowed to have its own regression equation. The
slopes or coefficients of the level-one regression equation for
each school are considered randomly distributed. The level-one
coefficients are predicted by another set of independent
variables related to the level-two units, that is, the school
variables (e.g., private or public school, rural or urban school
and racial composition of the school). The explicit modeling of
level-one coefficients by level-two variables helps educational
researchers study the intriguing relationships among schools,

teachers and students. For example, Raudenbush and Bryk

(1986) were able to demonstrate through a multilevel model that
the type of school (Catholic or public) has a differentiating
effect on the influence of students' socioeconomic status (SES)
on Math achievement. Specifically, they found that Catholic
schools have a flatter slope in a regression of Math achievement
on SES, which supports the contention that Catholic schools are
more egalitarian than public schools.

Moreover, educational researchers making use of multilevel
models need not be restricted by either choosing the student
level or the aggregated school level for regression analysis. The
estimated residual variances of the regression equations from
both levels not only account for the components of variances at
the student and school levels, they also give information about
the relative amount of unexplained variation for the two levels.

When repeated measurements on students over time are
taken as the level-one units and the students are correspondingly
taken as level-two units, the multilevel model can be used as an
appropriate model to study academic growth over time. Student
variables such as sex, race or income can be used to predict the
effect of time on academic achievement (i.e., the rate of growth
or change). Therefore multilevel models are a means to solving

two of the most persistent methodological problems in

educational research, namely the assessment of multilevel effects
and the measurement of change or growth (Bryk and
Raudenbush, 1992). Growing numbers of educational studies, for
example, about vocabulary growth, school effectiveness and
teaching styles have been conducted in a multilevel perspective
(e.g., Aitkin, Anderson, and Hinde, 1981, Raudenbush and Bryk,
1986 and Huttenlocher et al., 1991). A review of educational
applications of multilevel models is included in Raudenbush
(1988)

Notwithstanding the growing popularity of multilevel
models, most of the research applications are confined to
continuous outcome variables with normal error distributions.
Details about the methodology and applications of normal
multilevel models in social and educational research can be
found in the books written by Goldstein (1987), Bryk and
Raudenbush (1992) and Longford (1993). However, a wide range
of educational outcomes are not normally distributed, for
example, pass or failure (dichotomous variable), absence from
school (discrete counts) and time before graduation or dropout
of school (censored survival time data). These non-normal
outcomes can be dealt with, respectively, by statistical models

based on the binomial, Poisson and exponential distributions.

These common distributions are subsets of the exponential
family of distributions. A unified approach for maximum
likelihood estimation of single-level regression models based on
the exponential family of distributions is the generalized linear
model ( Nelder and Wedderburn, 1972). Computer software for
such statistical models is now also available (Aitkin et al.,
1989). Many attempts have been made to develop multilevel non-
normal models and they are described in the next chapter.
However, a unified approach for a multilevel generalized linear
model with full maximum likelihood estimation has not been
developed.

This dissertation is an attempt to extend the multilevel
normal models to include other non-normal outcome variables so
that a wider range of educational dependent variables can be
investigated within the multilevel framework. In other words,
the goal of the dissertation is to develop a unified maximum
likelihood estimation approach for the multilevel generalized
linear model. Since the generalized linear model covers a wide
range of distributions, the dissertation will demonstrate
statistical estimation of one member of the exponential family:
the Poisson distribution. The same approach can be followed for

the other distributions of the exponential family.

Though the Poisson distribution is not commonly used in
educational research, its potential for application should not be
underestimated. The Poisson distribution is a standard model for
independent count data. Much educational data exists in terms
of counts, for example, school absence (Aitkin et al., 1989,
p.223), classroom behaviors such as speaking in class, altruistic
behaviors, antisocial behaviors, vocabularies of children’s
utterances, number of peer-reviewed publications, teenage
pregnancies, frequencies of using school facilities such as the
library, gymnasium and counseling service, and numbers of times
repeating a difficult required course or a certifying exam of a
professional institution. The Poisson model is especially useful
if the mean occurrence of counts per unit time is low. In these
instances many people will have zero observed counts.
Approximation of the empirical data by the normal distribution
often fails to account for the positive skewness of the data.

Since the educational outcome variables in the above
paragraph also arise within educational settings, for example,
classrooms, high schools and tutorial schools, the multilevel
approach will often need to be applied. The multilevel Poisson
model has an additional advantage of being able to help resolve

the over-dispersion problem (Cox, 1983) due to more than

expected variance. In theory, data of the Poisson distribution

should have its mean approximately equal to its variance.

However data arising from groups (e.g., educational

institutions) are statistically dependent within a group, so the

observed variance of the whole set of data can be much larger
than the corresponding mean. A multilevel Poisson model
accounts for the variation due to grouping by including a second
or third level of variation. Thus it helps remove the over-
dispersion due to the natural grouping of the data. Similar
problems also exist for the binomial distribution and the
multilevel approach can offer the same benefit.

In essence, this dissertation will undertake research that
fulfills the objectives below:

1) To develop a multilevel statistical model for the
exponential family of distributions.

2) To provide maximum likelihood estimation of the
parameters and standard errors for the parameters of the
model.

3) To use the Poisson distribution to demonstrate the

details ofthe estimation method.

4) To write an iterative computer program to obtain
statistical estimation for the multilevel Poisson model by
iteration.

5) To use a small simulation study to demonstrate the
validity of the computer program.

Before describing the state of previous research or the
technical details for the formulation and estimation of the
multilevel generalized linear model, an example on how the

model can be applied could be illuminating.

An illustration for the application of multileveQneraleed
linear model to educaLional research

To understand the formulation of the multilevel generalized
linear model, it might be best to study a simple hypothetical
example. A similar data analytic scenario from a real national
school survey, with a normal outcome variable analyzed through
the Hierarchical Linear Model, can be found in Raudenbush and
Bryk (1986).

Suppose a survey on the number of altruistic behaviors of
the recent month is made on students from a large number of
schools. Some of the schools have an ethics education program.

The socioeconomic status (SES) of each student was also

recorded. The research question is to study whether schools
with ethics education programs affect the differentiation effect
of SES on the exhibition of altruistic behavior in school. The
analysis requires that we model the random effects due to each
individual school, thus a multilevel model is formulated. Since
the dependent variable takes on a random natural number greater
or equal to zero, a Poisson model can be used to fit the data. As
the Poisson model is a member of the exponential family, this is
an example of the multilevel generalized linear model.

In level 1, the 1"" student's altruistic behavior ya. in school

j is modeled by his/her SES level:

1
log/1,). =60]. +,BUSES,.J..
Here 1,] refers to the mean of a student's counts of altruistic
behavior. The response variable yijllij is assumed to be Poisson

distributed with parameter iv. The logarithmic function is the

'link' function (Aitkin, Anderson, Francis and Hinde, 1989, p.76)
that maps the natural numeric count onto the set of real numbers

being predicted linearly by the student's SES variable. ,6,” is the
random intercept for schoolj. ,6”. is the random slope for the

effect of SES on the logarithm of the mean altruistic behavior

for students in schoolj. A positive ,3”. will mean that the higher

10

the SES of a student is, the more altruistic behavior the student

will elicit. For illustrations, we now assume that empirically ,6”.

is positive. However, the implementation of an ethics education
program can either suppress or elevate the influence of SES on
altruistic behavior. To study the effect of higher level variables
on the relationships between the outcome and the lower level

variables, we need the level-2 equations:

ﬂoj = 700 +701(ETHICS PRGM)j +uoj,
:61} = y“, +yn(ETHICS PRGM)J. +ulj.

The random intercepts and slopes of the schools are being
predicted by the dummy variable for the ethics education
program which is zero for absence and one for presence. The

residual errors or random effects uj's at the school level are

assumed to form a multivariate normal distribution with zero
expectations.
As an example, if schools with ethics education are found

to have higher ,Boj's on average (i.e., yo] >0), it means that

ethics education can help students behave more altruistically. If

schools with an ethics education program have smaller ﬂlj's on

average (i.e., 7/“ <0), then we might infer that ethics education

can suppress the influence of one's social class background on

ll

one's exhibition of altruistic behaviors in schools. In other
words, ethics education enhances the egalitarian tendency of
students toward acting altruistically.

This example has demonstrated the potential of applying
multilevel generalized linear models for real world educational
research involving student-school interaction effects with a

discrete count outcome variable.

 

1To facilitate understanding for readers who are previously acquainted
with the Hierarchical Linear Model (Raudenbush and Bryk, 1986), the
mathematical symbols are chosen to be similar to those of HLM.

CHAPTER II

BACKGROUND

Much work has already been done on the multilevel normal
models. A review of the methodology and applications of such
models in educational research is given by Raudenbush (1988).
Representative approaches that have been widely applied in
educational research include the models and computer programs
formulated by Goldstein (1986, 1987), Longford (1987), and
Raudenbush and Bryk (1986). Since the present dissertation
focuses on the multilevel extension for non-normal models, the
review on multilevel normal models will not be repeated here.

Multilevel extensions of various particular non-normal
models have also been published. For example, Anderson and
Aitkin (1985), Conaway (1990), Stiratelli, Laird and Ware
(1984) and Wong and Mason (1985) have published papers on
the binomial models; Goldstein (1991) on the log-linear model;
and Albert (1985, 1988, 1992), Morton (1987), Tsutakawa
(1985, 1988) on the Poisson models. Albert's (1985, 1988, 1992)
models are Bayesian models with a gamma prior distribution.
Morton (1987) uses the quasi-likelihood approach in an analysis

of variance framework. Tsutakawa (1985, 1988) employs

12

13

approximating techniques for both Bayesian and empirical Bayes
estimation. All of these past Poisson models do not have the
flexibility to allow for a full scale two-level model with multiple
explanatory variables and variance-covariance components.

In terms of the whole exponential family of distributions,
sometimes labeled as the generalized linear model with random
effects, there are models developed by Anderson and Hinde
(1988), Longford (1988), Zeger, Liang and Albert (1988), Schall
(1991), Zeger and Karim (1991) and Breslow & Clayton (1993).

Various estimation approaches and algorithms have been
adopted in the above past research. Statistical estimation
approaches include the Bayesian (e.g., Zeger and Karim, 1991),
maximum likelihood (e.g., Anderson and Hinde, 1988) and the
maximum quasi-likelihood method (e.g., Longford, 1988). The
Bayesian approach is especially relevant when the level-2l
sample size is small and the asymptotic normal approximation of
the posterior distribution using the maximum likelihood
approach becomes inadequate. Contrarily, the maximum
likelihood approach generally involves less computation and
simpler analytic methods. It also gives results similar to those
of the Bayesian approach when the sample size is large. The

maximum quasi-likelihood approach is similar to the maximum

14

likelihood approach except that the quasi-likelihood function
only specifies the relationship between the mean and the
variance and does not contain the full parametric likelihood (For
details, see Wedderburn, 1974 and McCullagh and Nelder, 1989,
p.325). Although the quasi-likelihood approach requires fewer
assumptions than the ordinary likelihood approach, it can fail to
give reasonable results in some cases (Crowder, 1987). There is
also some loss of efficiency when the data depart from a natural
exponential family (Firth, 1987).

The algorithms relevant for the present research are: the
EM algorithm (Dempster, Laird and Rubin, 1977), Fisher-scoring
(Longford, 1987), iterative generalized least squares (Goldstein,
1986, 1989), data augmentation (Tanner and Wong, 1987) and
Gibbs sampling (Gelfand, Hills, Racine-Poon and Smith, 1990).
Tanner (1991) provides an excellent introduction to many data
augmentation methods, broadly defined. It should be noted that
a given algorithm might not be restricted to be used only for a
particular estimation approach. For example, the EM algorithm
can be used for maximizing a general likelihood (Dempster et
al., 1977), for empirical Bayes estimation (Dempster, Rubin and
Tsutakawa, 1981) or for Bayesian estimation (Racine-Poon,

1985)

15

Although several models already exist for formulating a
multilevel generalized linear model, there has not been a full
maximum likelihood (ML) model available for applications in
educational and social studies. For example, although Anderson
and Hinde (1988) formulate a maximum likelihood approach via
the EM algorithm, their estimation method can only allow for a
single random intercept model. The potential extension to high
dimensions of random coefficients is also hindered by their use
of Gaussian quadrature integration technique that works only for
small dimensions (Rubinstein, 1981). Longford's (1988) non-
normal extension to the multilevel normal model uses an
approximation of the quasi-likelihood for estimation.

A recent simulation study by Rodriguez and Goldman
(1993) concludes that the approximate quasi-likelihood method
of Longford (1988) is equivalent to Goldstein’s (1991)
approximate generalized least square approach with regard to
the multilevel logit model. The simulation results reveal
substantial biases in the estimates of the fixed effects and/or the
variance components whenever the random effects are large
enough to be interesting. Moreover, the multilevel estimates of
the fixed effects from the approximate quasi-likelihood method

are virtually the same as those obtained using standard logit

16

models that ignore the hierarchical structure of the data
(Rodriguez and Goldman, 1993, p.15).

Zeger, Liang and Albert (1988) use a generalized
estimating equation approach (Liang and Zeger, 1986) which
combines quasi-likelihood and robust variance estimation for the
marginal or so called ‘population average’ model. The
estimating equation (Zeger et al., 1988, p.1053) is essentially a
variant of the quasi-likelihood estimation (c.f. McCullagh and
Nelder, 1989, p.327). Schall’s (1991) approach is an
approximate maximum likelihood and quasi-likelihood estimation
by means of a first order Taylor's expansion of the linked data.
Zeger and Karim (1991) adopt the Bayesian approach via the
Gibbs sampling method which is heavily computing intensive.
Karim and Zeger (1992) have reported that a disadvantage of the
Gibbs sampler is the computational burden. In their analysis of
an ecological data set on Salamander mating with 360 binary
responses, it takes about 5 hours of computer time on a 14 MIP
DEC station 3100 microcomputer to generate 2000 simulated
values from the posterior distribution, each obtained after 80
iterations of the Gibbs algorithm. They consider that the time
required is sufficiently long as to possibly discourage the fitting

of several different models (Karim et al., 1992, p.643). This

17

would not be feasible for educational applications with large
data sets. In their paper on approximate inference for
generalized linear mixed models, Breslow and Clayton (1993)
use Laplace’s method for integral approximation in conjunction
with a quasi-likelihood approach.

Even for a particular distribution of the generalized linear
model, there has not been a true maximum likelihood model. For
example, Stiratelli, Laird and Ware (1984) estimate a multilevel
binomial model via the EM algorithm. Because the joint
posterior distribution of the fixed effect parameters and
variance components is intractable, it is approximated by a
multivariate normal approximation. The properties of their
estimates will depart from those of ML estimates. Goldstein
(1991) uses a first order Taylor's expansion for the nonlinear
part of the likelihood and the resultant Iterative Reweighted
Least Square estimates do not remain maximum likelihood
estimates (See Rodriguez and Goldman, 1993).

Overall, Zeger and Karim (1991) have provided the most
promising simulation results so far for a multilevel Bayesian
logit model with two variance components. They have
demonstrated a simulation study using a diagonal variance-

covariance matirx. A random intercept model is used to analyze

18

a clinical data set for illustration. Breslow and Clayton (1993)
present a comparative simulation study against Zeger and Karim
(1991). Their results approach that of Zeger and Karim (1991)

when the sample size is increased.

Impligtions for the present research

Vis-a-vis the present status of research in the field of
multilevel generalized linear models, this dissertation will
attempt to provide an alternative of a full maximum likelihood
estimation approach with multiple random regression
coefficients.

In order to reduce the expected intensive computation, I
choose to adopt the maximum likelihood approach instead of the
Bayesian approach because the latter generally requires solving
high dimensional multiple integrals (Smith et al., 1985) or many
repeated rounds of simulations (e.g., Zeger and Karim, 1991).
The maximum likelihood approach is used in preference to the
quasi-likelihood approach because of the former's well-behaved
asymptotic properties of consistency, unbiasedness and
efficiency (Rice, 1987, p.234-254).

As for the algorithm that maximizes the likelihood

function, the EM algorithm (Dempster et al., 1977) is used for

19

the following reasons. Since much educational data involves
national surveys with huge numbers of schools, the number of
unobserved random effects due to schools could be large (e.g.,
in the thousands). A naive and direct maximization of the
likelihood involves simultaneously estimating the fixed effects
and all the random effects. These huge number of parameters are
unstable to estimate for most common maximization routine.
However, in maximizing the marginal likelihood, the EM
algorithm treats the random effects as missing data and
essentially estimates just the fixed effect and variance
component parameters. The EM algorithm is preferable to the
other Newton-type algorithms because it does not require
computing the inverse of the information matrix. Generally,
simple closed form solutions are attainable for the iterating
steps of the EM algorithm. Iterates are also confined within the
parameter space during the execution of the EM algorithm. It is
also proved that the EM algorithm increases the marginal
likelihood of the observations for every iteration (Dempster et
al., 1977). A disadvantage of the EM algorithm is its relatively
slow linear convergent rate. Because I have adopted Monte
Carlo integration to use with the EM algorithm, a slower

convergent rate could prove to be an advantage because the

20

iterates will not so easily jump out of the parameter space due
to the randomness of the Monte Carlo simulations. More details
about the EM algorithm will be provided in the next chapter.
Before giving the mathematical details in formulating and
estimating the multilevel generalized linear model, the
definitions and properties of the exponential family of
distributions and the generalized linear model are described as

follows.

Exponential family of distributions
The distribution of a random variable Y belongs to the
exponential family if it can be expressed in the form

(McCullagh and Nelder, 1989, p.28-29.):

f (yl 49. ¢) = exvﬂyé' - 11(6)) / a(¢) + c(y. ¢)] (2. 1)

where a(.),b(.)and c(.) are some known functions. The parameter
6 is called the canonical parameter and ¢ is the dispersion
parameter. The function a(¢)=¢/m, where m is the prior weight
for the data. This function reduces to ¢ if the data is

unweighted. It can be easily shown that the first two moments
are (McCullagh and Nelder, 1989, p.29):

E(Y) = 5(9), Var(Y) = b"(6)a(¢). (2. 2)

21

Therefore, the mean of the observations is related to the
canonical parameter 0. The variance depends on the canonical
parameter (and hence on the mean) as well as on the dispersion

parameter ¢. Other representations of the exponential family of

distributions can be found in Dempster, Laird and Rubin (1977)
and Zeger and Karim (1991). For example, the Poisson,
binomial, and normal distributions can be expressed in the form
of the exponential family of distributions as shown below

(McCullagh and Nelder, 1989, p.30):

Table l. Poisson, binomial and normal distributions as members
of the exponential family

 

Distribution Notation ¢ 19 b(0) C(y,¢)

 

Poisson [9(1) 1 log}. exp(6l) —logy!

Binomial B(n,7I)/n l/n log(——”—) log(l+e9) log(n]
l-rr ny

2
Normal N(,u,02) 02 ,u 92/2 -%[y?+108(27r¢)]

 

22

Generalized linw models

The linear model based on the exponential family of
distributions is introduced by Nelder and Wedderburn (1972) as
the generalized linear model. The model is defined by N
independent random variables Y1,Y2,...,YN sampled from the
exponential family of distribution with corresponding canonical
parameters 61,62,...,6N and a common dispersion parameter ¢. The
joint probability density function of the Y's (unweighted) is

therefore
f(yny2,-~,y~|91,92,~-,9~,¢)= 6XP{_Z[y.-9.- - b(49,-)1/ ¢+ZC(y.-,¢)l}- (2 . 3)

In actual modeling, the expected value ,u, of Y, is predicted by a

linear combination of explanatory variables xl,x2,...,xN as follows:
g(#.-)=X.'ﬂ, (2-4)

where g(.) is a monotone link function (Aitkin et al., 1989,

p.76), ,6 is the p x 1 vector of parameters and X1 is the vector

of explanatory variables. The link function relates the expected
values of Y to its linear predictors through a transformation
function. The principal usage of the link function is to map the
limited domain of the mean of the observations onto the real
line. For example, the domain of the mean number of counts

from a Poisson distribution is never negative, the logarithmic

23

function puts the mean onto the real number domain. The
commonly used link functions for the Poisson, binomial and
normal distributions are the logarithm, logit and identity
functions respectively. These link functions have the property of

6,=g(,u,.)=x,',6 and are named as the canonical links.

 

' Personal communication suggested by Dr. Stephen W. Raudenbush.

CHAPTER III

STATISTICAL MODEL AND ESTIMATION STRATEGY

A multilevel generalized linear model
In the multilevel framework, an observation is represented
by yij, i=1,2,...nj; j=1,2,...,J. The subscript i refers to the units
in level-one, e.g., pupils, andj refers to the units in level-two,
e.g., schools. P is the number of level-one predictors and Q, is
the number of level-two predictors for the pth random
coefficient in level-one. Conditional on a P+1 x 1 random effect
vector u), the observations are random samples from the
exponential family of distributions with density (McCullagh and
Nelder, 1989):
f(y.,-| “1:6y9¢) = eXPKyy-Q-j - b(011W ¢ +C(yy-,¢)l- (3- 1)
The conditional moments are
1,, s E(y.-,-|u,) = my), Var(y.-,-lu,) = we...» ’ (3.2)
Let the level-l equation be:

g(#ij)Enjj =ﬂ0j+ﬂljxlij+'“+ﬂR/’xl’gj (3-3)

24

f1)

x1y

x23

 

 

\x Pg}

The level-2 equations are

25

.30)

ﬂu

I31),-

ﬂOj = 700 +701w01j +"°+7OQOWOQ0j +u0j.

:81} = 710 +71lwllj+'”+71QleQ,j +ulj.

161’} = 7P0 +ylePlj +---+7 PQPWPQPJ +tu‘

The above can be expressed as

180) 1,wmj,...,w0Qaj
4,. _ 0
ﬂ.) 0

(700‘
701

7 OQO

710
711

719,

7P0
7P1

 

 

K7 P9,)

(3.4)

0J'

“u

1'

In matrix notation, the level-1 (3.3) and level-2 (3.4) equations

can be rewritten compactly as

ﬁlly); 771} = xijﬂp

(3.5)

26
ﬂj=w17+ur (3.6)

P
The matrix wJ has dimension equal to (P+1) x Z(Qp+l) and the
p=0

P
vector 7's dimension is equal to 2(Qp+l) x 1. If Qp=Q for all p,
p=0

then ZPXQP +1)=(P+l)(Q+ l ).

Combining the equations (3.5) and (3.6), we have
g(,u,.j)=17,.j =1r'“wj}’+x'uuJ (3.7)
=Auy+Bou (say). (3.8)
The random effect vector 111 is assumed to vary with a
multivariate normal distribution with zero expectations and
variance-covariance matrix T with dimension P+l x P+l.

The principal objective of the statistical estimation for this

model is to estimate the P+l fixed effects 7 and the symmetric

matrix T with (P+1)(P+2)/2 unique variance-covariance
components.
As noted earlier, direct maximization of the marginal

likelihood function log f(Y|}/,T,u) is not feasible because the

number of unknown vectors 11], which depends on the number of

schools, can be very large. Alternatively, one can integrate out

27

the random effects u and maximize the log likelihood by partial

differentiation. Let (p=(}/,T), we have

a—i—logﬂWhgloglﬂmuvmmdu.1 (3.9)

However, except for normal distributions, the above
integral cannot be solved analytically. High dimensional

numerical integration of the above requires the knowledge of (o

the unknown parameter. Thus we need some kind of iterative

procedure to maximize the log likelihood.

MXimum likelihood estimation via the Monte Carlo
EM ﬂgorithm

Treating the random effects 11 as missing data, the EM
algorithm (Dempster, Laird and Rubin, 1977) can be used to
maximize the analytically intractable marginal log likelihood
function f(Y|y,T) indirectly through iterations derived from a
combination of the E (expectation) step and the M
(maximization) step.

The E step computes a Q function which is the conditional

expectation of logf(Y,u|7,T), the 'complete data' log likelihood,

with respect to the distribution of the 'missing data' 11

28

conditioned on the observed data Y and the parameter values

(0") at the 1th iteration.

Q(<0;<0"’)= j108f(Y,u|¢)f(UI<0"’,Y)du (3. 10)

The Q function is a function of (a and is maximized to

obtain the parameter values gym” for the 1+1”I E step. That is,

the M step solves the following equation:
a . (I) _
, Q((0,(0 )—0. (3.11)
“P

Iteration between the E and M steps continues until the
parameter values converge. Dempster, Laird and Rubin (1977)
have shown that the EM algorithm increases the marginal

likelihood f(Y|7,T) for every iteration and the algorithm

converges to at least a local maxima. Wu (1983) discovered an
error in the proof for the convergence of the EM algorithm in
Dempster et al. (1977). However, Wu (1983) shows that under
mild regularity conditions the EM sequence converges to the
maximum likelihood estimate. Wu (1983, p.95) contends that
Dempster et al.'s (1977) results on the monotonicity of
likelihood sequence and the convergence rate of the EM
sequence remain valid. In other words, employment of the EM

algorithm by itself guarantees maximum likelihood estimation if

29

the starting values are close to the maxima and convergence is
achieved.

In practice, the integral for the expectation step may be
hard to obtain. One can use the so called Monte Carlo EM
algorithm (Wei and Tanner, 1990) to bypass the high
dimensional multiple integral. The Monte Carlo EM algorithm is
an EM algorithm with the expectation in the E-step
approximated by a Monte Carlo integration.

The conditional expectation of the complete data log
likelihood is now estimated by the arithmetic mean of M
realizations of the complete data log likelihood. with the

unobserved random effects substituted by their simulated values:

Q(¢;¢“’)-=- ﬁzlogﬂvmnw), (3.12)

m=l
where um is generated from the distribution f(u|(o‘”,Y). It should
be noted that the accuracy of the above approximation can be
improved to any degree by increasing the number of random

samples in the calculation.

 

1For simplicity of notation, the domain of the multiple integral in this
dissertation is not printed. Interested readers can refer to the notation
used in Appendix A1 for a more rigorous presentation.

CHAPTER IV
PARAMETER ESTIMATION

Convergence to maximum likelihood estimates is achieved
by alternative implementations of the E-step and M-step of the
EM algorithm. The E-step consists of computing the conditional

expectation of the 'complete data' log likelihood function below:

The expeat_ation step of the EM algorithm

Q(¢;¢"’) = I logf(Y,ul¢)f(ulr/)"’.Y)du <4. 1)

= i[Zlogf(Y,,u,l¢)1f(u|go“’,Y)du

j=l
= zjlogm.u,1¢>/(m¢“>.v>du. (4.2)
1:1

By applying the results from Appendix A1, Corollary 1,

expression (4.2) becomes

2j108f(Y,,u,l¢)f(u,l¢"’,Y)du,
pi

= ZjlogﬂYpu,l¢)f(u,|¢"’,Y,)du,, (4 . 3)
1:1

due to the independence between level-2 units.

Applying Bayes' theorem,

30

31

f (Y,|u,,¢"’)f(u,lco‘”)
f (Y1l (PM) i ’

 

Q(<0;¢‘”)= Zjlogmpum
1:1

= Zzlrlk’gf(Yp“1|¢)f(Y1l “1,¢"’)f(u,|¢“’)du,. (4. 4)
Fl 1

The constant C,:f(Y,|(p“>)=jf(Y,|u,,¢“>)f(u,|¢<”)du, is

computed by the Monte Carlo integration method (Rubinstein,

1981).

l M
C1zﬁ2f(leunp¢m)a (4-5)

-=l

where “11a“21w-sum ~f(uj|(0“’), which are M samples of simulations

from the multivariate normal distribution.

Applying the Monte Carlo integration technique again to

(4.4):

Wm

Q(<0;co"’)~ Z—h:?zlogf(v,,u..l¢)f(v,law“) (4.6)
1:1 1 111:1

 

logf(Y,,u...l¢)f(Y,lawe”)

°<= Z icing/(Y1 “w(0)+logf(u_u|¢)]f(yll W11.)

111:1 1:1 1
M .1

ocZZUogf(Y,Iu.,,7)+logf(u...lT)]w...., where

-:1 1:1

(4.7)

 

 

f(Y,|u...,,¢"’) "‘ ’ f(Y,lu.,,¢“’)
C 22 C .
1 n=11=1 1

32

The M-step maximizes Q(.) to give the new parameter

(1+1) (1+1)

values (p . Parameters y are obtained by solving the

equations

M J 1 Y ,
2.11213 Dwain" ”We” (4.8)

 

and T“”’ by solving

M ’ logf(u...,|T) _
2:3 Wee” y/mj—O. (4.9)

 

The maximization step of the EM algorithm

The solution for maximizing an unweighted single level
analogue of the likelihood function in equation (4.8) can be
found in Aitkin et a1. (1989, p.322-325) for the generalized
linear models. Anderson and Hinde (1988, p.3851-3852) have
also derived the solution to a random intercept multilevel model
with weighted likelihood using the numerical Gaussian
quadrature method. In this dissertation, the solution to a random
coefficient multilevel model with weighted likelihood function

involving Monte Carlo integrations is derived below.

Estimation for the fixed-effect parameters

33

The maximization of Q(.) with respect to y is equivalent

to maximizing a weighted sum of log likelihoods of the
conditional distribution of the outcome variable given the

random effect variables “-41- The inclusion of the outer

summation sign in subscript m is equivalent to expanding the
original data set by M times with the simulated random effects

um, appropriately substituted (c.f. Anderson and Hinde, 1988).

We have to maximize the following weighted log likelihood

function of the exponential family with respect to y:

L<7>=222{u.6.. —b<9..)1/¢+c(y..¢)w.j. (4.10)

m=lj=l i=1
with g(,um,.j)=9m,.j =77my =x11w17+xi1un1§Au7+Buunu using the canonical
link function. The derivative of (4.10) w.r.t. 7 is

M J "j 56 ..
ﬂ=ZZZIy.-b'wm.>1jlwmj/¢. <4“)

697 nr=l j=l 1:1

For exponential families, p=E(y)=b'(t9) and Var(y)=¢b"(6), thus

 

59mg? _ 86mg alum!) . ﬁlm”

 

 

ay "44.1%.... 0’7
= ”1 . 1 .Au. (4.12)
b (any) g'(/ijj)
M J "r - A .
Therefore 1:2220” pm”) "W”. (4.13)
5’7 m=l 1:1 1:1 meg'UImy)

34

To form a Fisher's scoring algorithm (c.f. Seber, 1989, p.685) as
in the GLIM program (Aitkin et al., 1989), we need the expected
value of the second derivative of the weighted log likelihood:

By results proved in Appendix A2,

52L) M J "I 1 [01,“, 01mg.)
E——— :— —E —— 4.14
t... 2.2,: .. .. < >

 

z 9142': 1 EU. w... 121114.113..-
"1:1 1:1 1:1 Wm} mez [g'U‘my )]2

111:1 1:1 1:1 mgig'(#m1j )]2

 

MJ";

=‘ZZwajAuA1 (4'15)

"1:1 1:1 1:1
: —A'QA (4.16)

where A={[A,,...,AJ]m=l,...,[A1,...,AJ]m=M}' with A11={A11a---,A..,1}' and!) is

. = WM]
'1 Vmglg’ (11....) )]2

 

a diagonal weight matrix with elements (am (4.17).

For the 1th iteration of the Fisher's scoring algorithm, the

(1+1)

new estimate 7 is given by

d“)
y"+”:y‘”+(A'Q‘”A)"— . (4.18)
37
3L M J "1
NOW 5=ZZZ(yy—#mﬁ)g(ﬂmy)Aﬂwmij (4-19)
m=l j=l 1:1

=A'Q"’6 (4.20)

35

where 5:[(51,]...6'1,“)...(6'“J...6'1,,JJ)],...,[(§',m...6'M,,ll)...(5:,m...6'mlj)]}' with
6..=<y.—1z,..)g'(p,..,) (4.21).
Thus y“) : y”) +(A'Q"’A)"A'Q‘”§‘”
: (A'Q‘”A)"(A'Q'”Ay"’ + A'Q"’§“’)
=(A'Q‘”A)“A'Q"’Z"’ (4.22)
where Z'”=Ay‘”+5"’ (4.23).
The scoring algorithm can now be viewed as an iterative

reweighted least square algorithm (IRLS). The 'adjusted

dependent variable' Z is regressed on explanatory variables A

with diagonal weight matrix Q. The new estimate of 7“") can be

found by solving the following system of equations using

ordinary Gaussian elimination (Johnson and Riess, 1988):
(A'Q<”A)y<’+” : A'Q“’Z"’. (4.2 4)

Estimation for the variance component parameters

The new variance component parameters Tm” belong to the
multivariate normal distribution and are estimated by the sum of
the weighted squares and cross-products of the simulated
random effects. This is based on an extension of the standard

maximum likelihood estimate (Seber, 1984) for the dispersion

36

matrix of the multivariate normal distribution. The solution to

equation (4.9) becomes

M J
T““’=ZZt//mjuwu'w. (4.25)

m=l j=l
The validity of the above equation is supported by the

following theorem:

Theorem 1. Let u1,u,,...,uJ be J independent p-dimensional
vectors of random variables and each u] have a non-singular

MVN distribution with zero expectation and a positive definite

variance-covariance matrix T. Let 11120 V j and r be the count

J
of wj>0. For er and subject to 211/}:1, the weighted log
1:1

J J
likelihood Zy/jlogﬂule) is maximized uniquely at T = Zy/jujuj'.
1:1 j=1

Proof: The following two lemmas are required for the proof of
this theorem.
Lemma 1: Consider the matrix function f, where
f(T)=log|T| + trfl‘"<l>]. If (I) is positive definite, subject to T
being positive definite, f(T) is minimized uniquely at T=<I>

(Watson, 1964).

37

Lemma 2: Let U'=(u1,u2,...,uJ), where the u are J

1
independent p-dimensional vectors of random variables, and let
‘I’ be a positive semidefinite JxJ matrix of rank r (rzp). Suppose
that for eachj and all h (b¢0) and c, pr(b'u,-=c)=0. Then U"I’U

is positive definite with probability 1 (Das Gupta, 1971).

Let c: ”€ij log271, then

I

1 1 , _
2V1 logf(ule)=c-:2—Zt//j longl—EZV’J‘U T 1“}
j J J

1 1 , _
=c——log|T|Zr//j——Zy/jtr[ujT'uj]

2 j 2 1

l l -l 1
=c-—log|T|-—trT ijujuj

2 2 ,

=c—%{log|T|+tr[T"Zwjujuj'D. (4.26)
J

J
Let U'=(u,,u2,...,uJ) and ‘I’=diag(1//l,y/2,...,1//J), then ijujuj'
1:1

= U"I’U. ‘1’ is positive semidefinite of rank r because r is the

count of V11>0- Since 11,- ~ Np(0,T) and T is positive definite,

then for all h (b¢0) and c, b'uj ~ Np(0, b'Tb). This distribution

38
is nondegenerate as b'Tb>0. Hence pr(b'u,-=c)=0. Therefore for

J
er, ijujuj' is positive definite with probability 1 by lemma 2.
1:1

J
Directly applying lemma 1 to (4.26), Zyleogﬂule) is

1=1

J
maximized uniquely at T =Zy/jujuj'. D
1:1

Applying the above theorem, the weighted likelihood of the

multivariate normal distribution ZZlogﬂumleyymj in equation
1» 1

m} "'1'

(4.9) is maximized by T"“’=ZZz//mju u ' because it could be
m J

easily verified that 232me =1.
m 1

In practice, because the M-step needs only one iteration of
an IRLS, this EM algorithm becomes a generalized EM (GEM)
algorithm (Dempster et al., 1977). The fixed parameters are thus
expressed as a weighted least squares expression (4.22).
Applying the theorem above, a simple closed form solution
exists for the variance component parameters. Hence, the
advantage of using the EM algorithm, which often involves

closed form solutions in the M-step, is maintained.

39

The convergence to the maximum likelihood estimate can
be monitored by tracing through the expected complete data log

likelihood function, Q(.). The reason is that every step of the

EM algorithm will increase the log likelihood function through

the increase of Q(.) (Dempster, Laird and Rubin, 1977). With

Monte Carlo EM algorithm, the complete data likelihood or any
other estimated iterates converge in a stochastic fluctuating

manner. By plotting the iterative values of Q(.), the algorithm is
stopped after the Q(.) function has risen to a maximum plateau.

In an alternative parallel way, one can also monitor convergence
by noting the increment of the model parameters by plotting
parameter values against the number of iteration. After
convergence, the algorithm can be allowed to run for more steps
with an increased number of Monte Carlo samples so as to

decrease the stochastic variation of the estimated parameters.

CHAPTER V

VARIANCE OF PARAMETER ESTIMATES

The variance for the estimates of the fixed parameters and

variance components can be estimated using the inverse of the

information matrix (see Aitkin et al., 1989, p.81):

 

flogfmwj“ . (51)

3‘02

 

W14

6’
The diagonal elements of V correspond to the large sample
variances of the maximum likelihood estimator for the

parameters (p. Louis (1982) shows that the Fisher information

matrix of the incomplete data likelihood can be expressed as a
function of the conditional expectations of the complete data
information matrix and the cross-products of the complete data
score function. In the terminology of this study, Louis’s (1982)

result can be re-expressed as:

 

 

 

_ 0"210gf(Y|(0) = _E[62 logf(Y9ul¢)]_E[31°gf(Ya“|(P) 510gf(Y, Ill¢)]

 

ﬂgoé'go' 560590. &(p ﬂw'
+§logf(Yl¢)§108f(Yl¢), (5 .2)
410 310'

with the expectation taken with respect to f(ulY,go). The

observed information matrix is thus

40

41

_ 4: log/”(11141 : -111
4041' I, 4144'

 

 

mag/”(um E15 1011101111,) 5 ”gm “"o’i (5 3)
6’ W W

for the last term in equation (5.2) above vanishes at the
maximum likelihood estimator (,7).
This method makes use of the complete data distribution

f(Y,u|(o) rather than the marginal distribution f(Ylgo) which does

not have an analytic derivation. Similar to its application in the
EM algorithm, the method of Monte Carlo can be applied to
(5.3) to ease the integration problem (Tanner, 1991, p.39).
Since maximum likelihood estimators are asymptotically
normally distributed, the estimated standard errors can be easily

used to create confidence intervals and hypothesis tests.

Estimation for the variance of the fixed effect estimates
The components of equation (5.3) required for the
evaluation the variance of the fixed effect estimates are derived

as follows:

 

 

éalogf(Y,u|¢) 6%ng ,u,I¢) Y d“
E1 (W 1=1§§W f(ul .10)
g M J 1 ﬂzlogﬂY uwlco)
EEMC. 44

 

f(Y jl unpg’b)a

(see Appendix A3)

42

 

 

 

 

 

 

 

      

___M , 1 3210.5“wa (0) 421816.114] -
§§Mqi 414' 5757 “Y '“ “”"0’
M , 1 a’lgﬂ | ) .

zgqui o Ergu'ﬂjﬂY lu..,,¢) (5.4)

0"210gf(Y’|uw,¢) ‘1 59,819,,” "J AEAE'
h = -b"6’ = . 5.5
6W Z ( a) 31' ¢ Emmet”)? ( )

Moreover,

 

[610g f(Y, ul (0) 418161.411
4) (91'

 

_E[(4Iogf(vlu 4) ++4logf<ul4))(41011161111,11) 410111111145]
4) 41' 41'

 

_E meg/”(nu 11) ﬂog/(an 12)]
a”

 

 

1‘11

_[2”:3108f“laymié’lognguPW]
1:1 1:1

 

_E[Z':5108f(Y Iu,.¢)0"logf(Y',lu,,(/1)]

1=1 37

+E[ 2": 3108f(Ylluj,(p) ﬁlogf(Yj.|uE,(p):l
2. 87 a)”

 

51 , 31 Y ,
=J‘Z 0gf(Y1|u1¢) ogféyjiuj ¢)f(ulle,¢)duj

 

 

“Li 2": 3108f“ MumwlogﬂYlu, e)

JJ'=1;1:1' 57 5)’ f(u “Y'l’ ¢)f(u1'lYJ"¢)du1dur

with the latter term equal to (c.f. Appendix Al)

43

31 Y1
0%“ 1|“: ’u'p)f(Y J1|}! .,go)du,. (5-6)

 

 

J 31 Y ,
Z J. 08“ ,luJ ¢)f(“1|Yp¢)d“1j

 

 

 

 

 

 

 

 

 

lJ'=l;J*J' ﬂy ﬂy.
1 ﬂgﬂmmﬂwﬂwm .
= . , (T , 1991) (5.7
”EH 0) 0.7 anner )
= , 610g 101,111) J 510g f(Y,|¢)_ ’ é’logﬂY |¢)0”logf(Y I¢)2 5. 8
Z3 4 2.} 4' .243 a) 01 ( )
with ié'logf(YE|(o)| ié’logf(YE|(p)| =0-0=0 at MLE of go. Hence
1 07 |¢=gb 3 a}, ¢=é1
at MLE of (p,
E[41ogf(v, ul (0) ﬂog/(Y, 11111)]
a) 41' .
J M 1 3108f(Y|u ,(owlogﬂYlu ,(0) -
NEEMC 0,; "" a); l“ f(leuNJP)

 

j=l In=1

M 1 alogﬂv In... 11) .
[.EMC) 07' f(Y l 11.19)} (5.9)

J M 61 Y , ..
{“2 MIC, og/(EEIu.l ¢)f(Y1|“-11"/’)]'

 

(using Appendix A3 and footnote 2)

 

 

M "1 1_m1'jA "1 i—miiA. A
=ZZ‘E%C‘[ZWJ 11 ”’sz 1a ))"]f(Y,lu..,,(o)

1 1:1 meg.(#m11) 1:1 Vmijg'(pm.j

 

 

.1 M 1 "1(yVE—me)Au] Y A .
HiiglMCj [21 14.1161...) ﬂ ’lu"’¢)]

44

 

 

M "I 11'— MA
[ZML‘iZO1/.;(p).)u]ﬂy Yiluwmﬂ (5-10)

Estimation for the var_i_ance of the va_r_i_ance component estimates
The variances of the covariance estimates are estimated in

an analogous way as that for the fixed effects estimates in the

preceeding section. Results follow by replacing the partial

deferential of y by that of t which stands for any arbitrary

element of the covariance matrix T.
From Longford (1987), we have the following two useful

derivative formulas:

 

3 3T
—1 T=tr(T"—) 5.11
31 ogI I 31 ( )
-1
and 5T :—T"§r—T“. (5.12)
31 31'

Let r' be any arbitrary element of the covariance matrix T other

than 7. Thus,

 

 

-l
2,log|T|=tr[éT ﬂ): tr(-T’l 3TT_, 31")

 

313731 3T 3t 3t
-1
(5.13) and 321‘ . = — T‘lﬂ(—T"lﬂ.T") +(-T'lﬂ.T'1)ir—T"l
3131 31 3t 3t 3t
=T‘lér—T‘1ZFTT'1 +T‘1ﬂT"lﬂT". (5.14)
32' 3t 3t 3t

45

 

Since log f(umjlgo)oc—%log|T|—%umj'T"umj, therefore,
310 u ,
gf( mj|¢)=-l ”(T—lﬂj-umj (T-lﬂT-ljumj 9 (5.15)
31 2 32’ 3T

 

and

32108f(1{...|(p) = __;_[E(_T-. 31‘ 31‘)

———T" _
3131 31' 3t

+u,,,'(T“ gr" 3151“ +1“ gT—lgrljumji (5.16)

As in the variance estimation of the fixed effects, we need

 

 

,0 031011111.» E[3’logf(Y,uI¢)] and E[é‘logf(Y,ul(p)3108f(Y.ul(o)].

 

 

 

3751' 3! 52.1
Now,
E flogﬂlee)
3131'
~” ' 1 52108f(Y1|u..1,¢) é’zlogﬂumle) .
~§§MC,[ 3r3r' + 5131' f(Y1'“"1’¢)
"' ’ 1 flogﬂleuwe) .
=ZZMC 3131' f(leumm). (5.17)
n=lj=l j
And,

 

E[3108f(Y, ulw) 3108f“, 11W]
at 3'

 

= E[(3logf§(tY| 11,10) +3logafjulw)xﬂlogf;tYl 1w) + 3108511195]

 

=E[3logf(UI(0) 5108f(ﬂ|¢)]
3t 31'

46

I
J=1 i=1 (9r

 

 

Z

5’7 11:11:11 57 31'

—

_%imeﬂwwimeﬂwm]

_Eiﬁlogfwjlmé’logfwjlm] E[ -' mug/(uImalogfwj-Ie)

j:

 

J alogﬂ |¢)3108f( le)
=I§ 3qu ET‘” f(u I Y,.¢)du,

, alogf(u,1¢)3logf(u 9|)

+IZ

1.1'=1;1==1' 5" 3r

 

/( uY,I ,¢)f(u,.IY..,¢)du,du

with the latter term equal to (c.f. Appendix A1)

108f(u ,le)

" 31
Z J- 0gf(uji¢)f(ujle,¢ﬂuan f(u uJIY j,¢)duj' (518)

1.1'=1-.1:1' 3’

 

 

= J 3logf(Yj|(p) 3logf(Yj.|(p)’ (see Appendix A4) (5.19)

I
JJ'=1 'Jﬂ' 52' at

 

z ’ 5108f (le 9): 3108f (YjIe) _ 2’: 3108f (lee) 3108/ (le (o)
1: a: 1-, 51' =1 31 31'

 

 

 

(5.20)

 

 

with ialog/(le ialogﬂYﬁmI

=0-0=0 at MLE of (0. Hence
1 at 1 3r

|¢=é |¢=¢

at MLE of (p,

410g f(Y, um) 610g f(Y, 11112)
E1 ET 3.1

 

... " M 1 3108f(ujlco)é'logf(u,lco) .
"Eémc, E, E f(Y Inwe)

 

 

J ht (91 A 34 (91 A
12:? 2.1M1C oggujlmf(y1|“w¢)][§Mbj 0ggifu'mﬂYlu 41(0)](5-21)

47

(using Appendex A3 and A4)

 

lTanner (1991, p.37) shows an equivalent of
alogﬂYlu ,(P) alogﬂYlgo)
I a; ’ f(uIYpmahnl = ﬂy I .

2 Result follows from a simple identity:

 

 

J J J J
2012b]. =Zajbj + Z ajbj.
J=l j=1 j=1 J.j'=lJ*J'

for any real number aj,bj. with j,j E{l,2,...,J}.

CHAPTER VI

EMPIRICAL PROPERTIES OF THE STATISTICAL MODEL

The formulas for the estimates of the parameters and the
associated standard errors of the Multilevel Generalized Linear
Model (MGLM) in the preceeding chapters were used in building
up a computer program. Microsoft FORTRAN Version 5.1 was
the software used in writing the program in FORTRAN. Sample
FORTRAN subroutines for mathematical and matrix operations
and random Poisson variate generations were obtained from
Press (1989). The program was built, tested and implemented on
a GatewayZOOO IBM compatible personal computer at clock
speed 33 MHz with an Intel 486 DX microprocessor. Random
numbers are generated by the default random function from the
above FORTRAN software. Multivariate normal variates are
generated by the LU decomposition method given by Rubinstein
(1981)

In the following sections, we take a closer look at the
performance of the estimators as programmed. The purpose is
not so much to prove absolute validity of the program as to
draw insight from several program runs. Simulation tests for

validity are left for the next chapter.

48

49

To facilitate the analysis, I limit the model to the One-Way
Poisson model with a single random component, also known as
the random intercept model. In other words, the parameter set
will contain just a fixed effect 7 (Gamma) and a univariate
variance-covariance matrix T (Tau). Because Monte Carlo
integration is used in the algorithms, the stochastic nature of
the random variate generations will be revealed in the
subsequent figures as minor oscillations along the path to
convergence. The parameter values for 7 and T are respectively
-0.7 and 0.16. The data set generated from these parameters has

level-2 units J=50 and level-l units nj =20 for all j. The number

of samples used in the Monte Carlo integrations is 200. Two

starting positions are attempted. One set of initial 7 and T pair

is (-1.5, 0.3) and the other set is (-O.l, 0.05).

mximization of the likelihood function
Under normal conditions, the EM algorithm will increase
the likelihood function to at least a local maximum through the
iterations. It is instructive though to watch how the likelihood
function approaches the maximum for some sample runs. As
shown in Figure l, the log likelihood functions from two

different sets of starting values increase and merge to a plateau

50

in around 15 iterations. After 15 iterations, the log likelihood
function rises from -1015.6 to -970.6 when the starting values
for y and T are respectively -1.5 and 0.3. For the other set of
starting values for y and T, being -0.1 and 0.05, the rise is from
-1031.9 to

-970.8. The results in Figure 1 give evidence supporting the
achievement of maximum likelihood. The shape of the two rising
likelihoods also gives support that the likelihood function is
increased with the number of EM iterations. The strict
monotonic increase of the likelihood function is masked by the
random fluctuations due to the Monte Carlo samples.

The log likelihood is calculated as follows:

10g f (YI 90‘”) = Zlogf(Y,Ico"’)
j=l

J
= 2 log CJ
j=l

z Zlog£ﬁ2f(leumj,g/)m)) (6.1)

j=l m=l

51

Figure 1. Log likelihood function as a function of the number of
iterations for two sets of starting values

Log Likelihood

-930

-950

-970

-1010

4020

-1 030

-1 040

ﬁ

 

 

5 10 15 20 25 30

Nunber of Iterations

52

Convergence of estimaﬁd pa_r_a:meters

As shown from Figure 2, the estimates of y and T both
converge rather smoothly to stable values after some 15
iterations. The slight discontinuity for the converging T
sequence with starting value 0.3 is caused by my fixing the value
of T for the first three iterations in order to improve stability.

Compared to the parameter value of 7: -0.7, the iterations
from two different initial positions obtain the values of -0.700
and -0.690 at the 15th iteration. Similarly, estimations of the
variance component T obtain the values of 0.205 and 0.199,
which are not far from the theoretical parameter value of 0.16.
These specific deviations are due to the random sampling errors

in creating the data sets.

53

Figure 2. Convergent paths of model parameters from two initial
positions

 

 

'0.4 II

 

 

Gamma

 

 

 

 

 

 

Number of Iterations

 

 

 

 

 

 

 

 

 

 

Odd-
GIN-

Hunter of Iterations

 

 

 

54

Convergence from different starting values

Given the parameter values of y and T as respectively -0.7
and 0.16, the program was tested for two sets of starting values.
The actual iterating paths are shown in the previous three
figures. It is evident that the convergent parameter values
estimated by the two iterating sequences from different starting
positions are identical. The difference in precision obtained can
be accounted by the Monte Carlo nature of the iterations.

Results are summarized in the following table:

Table 2. Comparing parameter and likelihood values obtained by
two starting values (Parameter values for y and T are

respectively -0.7 and 0.16)

 

 

 

Starting values Estimated values at 15”I iterations

7 T 7 T Log Likelihood
-l.2 0.3 -O.7OO 0.205 ~970.62
-0.1 0.05 -O.69O 0.199 -970.77

 

 

Influence of the size of Monte Carlo samples
This program employs Monte Carlo EM algorithm and
computes Monte Carlo integration through the generation of

multivariate normal variates. The higher the number of Monte

55

Carlo samples, the better the approximation to the theoretical
integral by the Monte Carlo integral. Figure 3 depicts the
influence of the number of Monte Carlo samples on the
estimated log likelihood for a random intercept Poisson model.

The same data set generated from the parameter set (7,T) =

(-0.1, 0.05) is being modeled on three different occasions with
the size of Monte Carlo samples vary over 50, 200 and 800 per
each integral. As expected, the run with the least Monte Carlo
samples (M=50) produces the largest fluctuations on the
estimated log likelihood values and vice versa. It should also be
noted that the run with 800 samples seem to achieve the largest
log likelihood on average than the other runs with fewer
samples. It is because parameter estimates are more precise and
get closer to their ML values as the Monte Carlo integration
becomes more accurate with larger generated samples. The
likelihood value computed from these parameter values which
are closer to the ML values will approach nearer to the
maximum likelihood value.

The effect of the size of Monte Carlo samples on parameter
estimation is similar. The convergent paths of the estimated
parameters fluctuate less when the number of Monte Carlo

samples increases.

56

Figure 3. Influence of the size of Monte Carlo samples on the
estimated log likelihood

 

 

-964 HHHHHHH+H
5 1 1 2
O 5 O
'966-(I-
’968-u-

 

Log Likelihood -970 -L

, .A AA\ 1'
._ [WW '“ "’ " ‘
~974 '/

-976

 

Nunbor of Iterations

 

 

 

In this chapter, illustrative program runs provide some
confirmations on the validity of the estimation procedure and
the computer program. The program does appear to be working
reliably to produce maximum likelihood estimation for at least a
random intercept Poisson model. Different starting values lead
to the same results at convergence. The size of Monte Carlo
samples is found to affect the smoothness of the convergent path

and the proximity of the statistical estimation to the true ML

57

estimate. More formal simulation tests will be presented in the

next chapter.

CHAPTER VII

SIMULATION STUDY

Simulation test for a random iaterceat model

 

Despite its costliness, the simulation study is one of the
best strategies to access the quality of one’s statistical
estimation. Using selected parameter values, random variates
can be generated from their respective density functions to form
a large number of independently simulated data sets. Parameter
values are then estimated from each of these data sets. The
means of the estimated parameters averaged over all the data
sets are compared with the respective chosen parameter values
to determine the bias, if any, of the estimators. The empirical
variance of the estimators might also be used to compare with
the variance of the other estimators to determine their relative
efficiency. The following paragraphs describe a simulation test
for the simplest multilevel generalized linear model: a random
intercept model with no predictor variables. This model is
analogous to the one-way random effect ANOVA model.

The observation yij has a Poisson distribution with mean

parameter pg. Using a logarithmic link function, the model is

58

59

logy,y =y+uj,
where 7 is the fixed effect and u] is the random effect for school
j. The random effect u} is modeled by a normal distribution:
uj ~N(0,r)

For the simulation test, the parameters y and r are set to
be -0.7 and 0.16. Thus the school mean is expected to be
e417 =0.497z0.5. The mean 0.5 is chosen to simulate the
problematic scenario in data analysis where many zero
occurrences are expected from a Poisson distribution and the
distribution will not look normal and symmetrical. Parameters
for the standard error of )7 and f are set by the empirical
standard deviation of )7 and f (For more detail, see Press
(1986), p.529-532). Each simulated data set contains 50 schools
each with 20 students. One hundred independent data sets are
created. They are fitted by the random intercept Poisson model
and the estimated parameters are recorded. To be economical in
time, the beginning iterations of the program employ fewer
Monte Carlo (MC) samples. The later iterations use larger
samples as results approach convergence so as to reduce
stochastic variation in the final estimates. Thus, 50 MC samples
are used in the first 15 iterations, followed by 200 MC samples

for the next 10 iterations and followed by 800 MC samples for

60

the last 7 iterations. Experience shows that convergence to
about two significant figures of accuracy is achieved with this
scheme. To reduce randomness, parameter estimates from the
last 4 iterations are averaged to produce the final estimate.
However, standard errors of the parameter estimates are only
calculated at the last iteration. They are not being averaged
because they usually fluctuate less and require a large additional
computation resource per iteration. Results of the simulation
test is tabulated as the case 1 study in Table 3.

The results show that the fixed effect parameter and the
variance component are both estimated without bias with error
less than 1.5%. The standard error of the fixed effect has a
slight negative bias below 3%. The standard error of the
variance component is negatively biased by about 18%. Coverage
by 95% confidence intervals range from 88% to 94%. Overall,
the simulation study shows that the statistical estimation of the

model is quite satisfactory.

Table 3. Results of 2 simulation studies. Case 1: Random

intercept model. Case 2: Random coefficient model.

 

 

 

 

(a: 700 7m 7m 79, 100 701 To
My;

(0 -0.7 —— —— — 0.16 -— ——

1?) -0.71 0.16

59 0.074 0.051

39; 0.072 0.042

Coverage 95 88

of95% CI

8&9;

(p -0.5 0.4 0.3 0.2 0.083 0.017 0.0092
50 -0.50 0.39 0.30 0.20 0.086 0.011 0.012
59 0.089 0.12 0.036 0.044 0.031 0.0068 0.0046
59; 0.084 0.11 0.037 0.050 0.015 0.0050 0.0024
Coverage 94 94 95 97 67 69 73

of 95% CI

 

Simulation test for a random coefficient model

A linear growth model predicted by the sex of students is

used in a simulation study for a random coefficient model. The

time data x, is coded as -3, -2, -l, 0, 1, 2, 3 and sex 0)}. is coded

as 0 (girls) and 1 (boys). There are 7 repeated time observations

nested within 100 students equally divided by the two sexes.

62

The level-one model is
logy” =ﬂoj+ﬂuxy.
The level-two model consists of
ﬁe) = 700 +701w) +249),
,6” =710 +ynwj+ulj.
The fixed parameters 700,701,710 and y” are respectively set

to be: -0.5, 0.4, 0.3 and 0.2. The dispersion matrix is

0.0827 0.0165 , .
Thus the correlation between ac] and u“. 18 set

0.0165 0.00919 '
to be

rm = 0.0165 = 0 6
,Iemr“ J0.0827x0.00919

 

 

 

The log expected mean and expected mean (in parenthesis) of a
student are tabulated below by sex and by the beginning and

ending time points:

Table 4. Log expected mean and expected mean (in parenthesis)
by time and sex

 

Time xv.

 

Sex 60. -3.0 3.0

J

 

Girls 0 -1.40 (0.25) 0.40 (1.50)

Boys 1 -1.60 (0.20) 1.40 (4.10)

 

63

The iteration scheme is as follows: Iteration begins with 50 MC
samples. At the 20th iteration, it increases to 200 MC samples
for 10 iterations. At the 30th iteration, 800 MC samples are used
for the last 8 iterations. Experience with some trial runs shows
that this iteration scheme is generally more than sufficient for
attaining the desired accuracy of the estimates. The final
estimates for the parameters and the parameter variances are
calculated by the same method as described in the previous
random intercept case. Again, 100 data sets are generated from
the known set of parameters.

Results of the simulation study are displayed in Table 3
along side with the results for the random intercept case.
Estimation for the fixed effects reach 2 significant figures of
accuracy in general except for ym, which has a slight negative
bias by about 3%. Coverage of the 95% confidence intervals are
very good with results ranging from 94% to 97%. Standard error
estimates for the fixed effects 700 and 701 come very close to
their 'true' estimate with only 6% and 8% negative bias. The
standard error estimates for 7,0 and 711 deviates positively from
their 'true' values by 3% and 14% respectively. The 'true' values
for the standard errors of the fixed effects and the variance

components are not pre-selected parameters. They are estimated

64

by the empirical standard deviations of the fixed effects and the
variance components obtained from the simulation runs.
Estimations of the variance components tend to be less

accurate than those for the fixed effects. Nevertheless, 700 is

estimated accurately with only about 4% positive bias. Variance

component I“ is estimated with 30% positive bias, while
component rm is esimated with a 35% negative bias. Standard

errors of the variance components tend to be larger than their
conterparts for the fixed effects, with biases equal to +52%,
-26%, -48%. Coverage probabilities by the confidence intervals
of the variance components are 67%, 69% and 73%. These larger
errors of the variance components estimates may be due to
chance differences because the ML estimators are only
asymptotically unbiased. The large sample assumption for the
normality of the MLE of T may be violated due to the small
within-group sample size of 7.

The quality of my simulation results is comparable to those
of a random effect binomial model using a similar design by
Zeger and Karim (1991, p.83). Although they use the binomial
distribution instead of the Poisson distribution, it is still
interesting to compare the degree of accuracy of their results

against mine because both studies have identical structural

65

models, variable values and number of cases. For easy
comparison, only absolute value of the biases are shown below.
In their simulation study with 2 variance components, fixed
effect estimates are biased by 10%, 14%, 10% and 14% (Mine:
0%, 3%, 0% and 0%) with their corresponding standard errors
biased by 3%, 6%, 12% and 10% (Mine: 6%, 8%, 3% and 14%).
The nominal 90% intervals by these 4 standard errors have
coverage probabilities of 80%, 89%, 87% and 87% (Mine: 94%,
94%, 95% and 97% for 95% confidence interval). Their variance
components are biased by 69% and 48% (Mine: 4%, 30%)
respectively. Their covariance component (parameter=0.0) is
only biased by 8% (Mine: 35% for a non-zero parameter=0.0165
with 0.6 correlation between the 2 random effects). Their
standard errors of the variance components are biased by 18%,
75% and 57% (Mine: 52%, 26% and 48%). However, the
coverage probabilities of their nominal 90% intervals for the
variance components provided by their standard errors achieve
high values of 88%, 95% and 100% (Mine: 67%, 69% and 72%
for 95% confidence interval). Apparently, the standard errors of
their variance components are quite over-estimated. Thus their
confidence intervals are longer and will often include the‘
parameter values for more than 90% of the samples. On the

contrary, the standard errors of the variance components of my

66

estimation method tend to be under-estimated. Because of the
incorporation of vague prior distributions, Bayesian approaches
tend to produce larger standard errors than the non-Bayesian

approaches (c.f. Tsuatakawa, 1985)

CHAPTER VIII

CONCLUSIONS

This dissertation has demonstrated a maximum likelihood
(ML) estimation approach for multilevel generalized linear
model via the Monte Carlo EM algorithm. Although only the
Poisson distribution is used for illustration, the same approach.
can be used to cover other distributions in the exponential
family such as the binomial, exponential and normal
distributions. Anderson and Hinde (1988) have published the
first paper on the random effects generalized linear model using
a true ML approach also with EM algorithm. Unfortunately,
their model is limited to a single variance component and they
have not given any method to find out the standard errors for
the fixed effects and the variance component. Their adoption of
a numerical quadrature technique for solving multiple integrals
also limited their potentials for extension to multiple variance
component model because quadrature techniques are good for
low dimensions and their required computations accelerate with
increasing dimensions (Rubinstein, 1981).

Built on Anderson and Hinde's (1988) pioneering work, I

have been able to extend their random intercept model to a fully

67

68

random coefficient model because of the successful
implementation of the Monte Carlo methods and the discovery
and derivation for the proof of Theorem 1. With this theorem,
we can easily estimate in closed form the variance components
of any dimensions by equation (4.25). Maximum likelihood
estimation is achieved because the EM algorithm is a smooth
likelihood maximizer through indirectly maximizing the
expectation of the 'complete data' likelihood. Using Monte Carlo
integrations by generating multivariate normal distribution, the
present approach is able to work at higher dimensions of
variance components. The growth rate of computer resources for
Monte Carlo integration is linear with the increase of integral
dimensions. Employing Louis's (1982) method, I derived the
formulas for computing the standard errors for the fixed effects
and the variance components using tractable complete data
information and avoided the intractable incomplete data
information matrix. Simulation studies on a random intercept
and a random coefficient model have shown promising results of
the accuracies of my program. It is in general superior to the
best existing approach (Zeger and Karim, 1991) with regard to
the degree of estimation accuracy. Besides, computation time of

my program is comparatively much shorter.

69

Besides its statistical contribution, the present model can
help bring some of the challenging data analytic scenarios such
as multilevel models, longitudinal studies (e.g. Raudenbush and
Chan, 1992, 1993) and meta-analysis (e.g. Becker, 1988) into a
new realm involving non-normal random effects models.
Research which involves discrete count, dichotomous and
survival time data is often difficult for an applied researcher.
Coupled with a complicated multilevel, longitudinal or meta-
analytic scenario, suitable statistical models that solve these
analytic problems are in great need. This approach can provide
an additional option for doing research using random effects
models. Practical educational studies include a large number of
level-one variables and the demand for high dimension random
slopes can now be solved through the Monte Carlo approach.
The present approach will also serve as a standard to evaluate
the existing approximate maximum or quasilikelihood

approaches.

Further research
Notwithstanding the above mentioned success, a number of

works that remedy some existing shortcomings or advance the

70

usefulness of the current model and program are awaiting. They

are listed as follows:

1) Application to real data analysis

Due to the problems in finding a suitable educational data
set to demonstrate the multilevel Poisson model, only simulation
studies are presented. Upon the access of suitable data sets,
applicational studies should be performed to demonstrate the
usefulness of my program to solve real data analytic challenges

in multilevel studies.

2) Extension to other members of the exponential family
Random effects binomial models may be more common in

applied educational and social research than the corresponding
Poisson models. Binomial distribution should be substituted for
Poisson distributions in this model for deriving a random effect
binomial model. Simulation studies should be performed to
access the accuracy of the resulting model. Extension to
exponential, normal or other member of the exponential family

should also be persued.

3) Improving the speed of the program

71

Due to the large computer resource requirement by
generating multivariate normal distributions for doing Monte
Carlo integrations, run time for the simulation models can reach
20-30 minutes for about 2 significant figures of accuracy. To be
applicable to larger real data sets, program speed has to be
increased. Possible choices are choosing a faster programming
language and computer system, employing more efficient random
variable generating methods, using suitable variance reduction
techniques (Rubinstein, 1981) or accelerators of the EM

algorithm (e.g. Louis, 1982, Jamshidian and Jennrich, 1993).

4) Semh for a_good criterion of convergence

Difficulties have been reported about judging when to stop
a stochastic convergent sequence (Gelfand and Smith, 1990).
Ploting the sequence of parameters against number of iterations
could tell us when the converging parameter path are stable
against the background of Monte Carlo fluctuations. Further
research should be done either to automate the ploting technique
or to compute a criterion of convergence based on some suitable

standard.

5) Improviag the estimates ofthe staadard errors

72

In my simulation study with two variance components, the
biases of the standard errors of the variance components range
from 26% to 52%. The reason could be due to the low sample
size (n): 7) within each level-2 unit. Lack of information to
estimate the variance components could give rise to a larger
error in their standard error estimates. Rodriguez and Goldman
(1993) report in their simulation study of a multilevel logit
model that parameter estimates are more biased when the number
of observations within group is small. Moreover, variance
estimates tend to have a skewed asymmetric distributions when
sample sizes are not large enough. With appropriate choice of
prior distributions for the parameters, estimation by Bayesian
techniques could improve over ML approaches especially when
the available sample size is small. The use of the mean as a
point estimate to summarize the sampling distribution of the
variance components may not be the best choice. For example,
Zeger and Karim (1991) suggest that the use of mode or
geometric mean might alleviate some of the biases.

Data augmentation techniques (Tanner, 1987) could be
used to obtain the whole posterior distributions of the standard

error sampling distributions and the use of the posterior mode in

73

place of the MLE might lead to improvement of the standard

error estimates.

APPENDIX A

athematical notes

 

A1 ExpectaLtion of the function of a random variable bygaa’oint
distribution

Let ACR‘PHW, BCRP“, f:A—>R, and ngP”—-)R and let
f(u)=f(ul)f(u2),...,f(uJ) be a joint distribution function ofJ
multivariate random variables u1,...,uJ each with dimension P+1.

Then the expectation
E[g(u9)] = Lg(u))f(u)du = I,g(u,)f(u.>du, .
Proof:
Lg(u,-)f(u)du= Lg<u.)/(z4),...,f(u9 )du19...,du_,
Evaluating the integrand for.'some uj. axis withj' ¢j,
j' e{l,...,j—l,j+l,...,.l}, it becomes

L gluj)f(“1),...,f(uj._l),f(uj.+1),...,f(uJ)dup...,duj._l'duj.+l,,,,,duJ,

74

75
Performing the above partial integration for all uj. axis,

j' ¢j, the integrand becomes Lg(uj)f(uj)duj. D

 

Corollary 1: ZLg(uj)f(u)du=ZIBg(uj)f(uj)duj .

Proof: It follows directly from summing both sides of the

results from Appendix Al. CI

Corollary 2: Additionally, let thP+‘—)R, then

I, g1u, )h(u,- )/(u)du = I,g(u, )f(u, )du, I,g(u,. mu) )du,

u

Proof: Following the proof in Appendix A1, it can be

readily shown that for CCRZU’“),

L. H. s. = [C g(uj )h(uj. )f(uj )f(uj. )dujduj.

= L g(uj )f(uj )duj Lh(uj. )f(uj. )duj. . D

Corollary 2a:

2; I, gut->110, )f(u)du = Z [38(14))f(u))du)jsg(uj- )f(u,- >424)

Menu-.9; J.J'=1:1==f
Proof: It follows directly from summing both sides of the

results from Appendix A1, Corollary 2. CI

76

A2 Expected information mairix with weights

1
Let L((0)=Zl//,logf(y,|(p) be a weighted log likelihood

i=1
function with 1 independent random samples and

L,((p)=i;/,logf(y,lgo). f(y,|gp) is a probability density function with

parameter (a and w, is some weight of given value . Then

46%») ___ {I} _1_ 401.00) (1.02)) .
ﬂaw. i=1 V11 ﬂip 580'

 

Proof:

4521472))
5959'

Eb"; .Zw.logf(y.-I¢))

 

=Zw, 455;, .logf(y9|¢))

' 3108f(y9|¢)3108f(y9|¢))
— E
E“ l a» 49'

(Seber, 1989, p.685),

= {12:45114. 108f(y9| con—5:17!) 108 mm)

H; w, 610 550'

77

A3 Conditional expectation by the missing data distribution via
Monte Carlo method

Let E[g(u)] be the expectation of a function g(.) of u,
consisted of independent variates u,,u,,...,u,, conditioned on the
dataY with independent componentsYl,Y2,...,YJ and the current

parameter value (0“) at iteration I. Then

E[g(u)]= Ig<u>f<uIY,¢"’>du

M J
1 z
z —g(ll )f(Y|u 91p")
EEMC’ l1 .1 III
where
1 M (I)
CJ~ﬁZf(Y,Iu...,¢ )
Ina!
and

uu,uzj,...,uMl ~f(u1|(p‘”)
are M samples of simulations from the multivariate distribution
f(u,I<o"’).
Proof:

E[g(u)]= Iglu)f(uIY,co‘”)du

By Appendix Al, Corollary 1, the above is equal to

78

Elana/(814998 )du,
j=1

 

, f(YI , Wm I “0
29:180.) ‘“"” “’1’

f(Y)|(PU)) duj, by Bayes' law _

.9 z-Ciigwn/(YIu,,¢“’)f(u,1¢“>)du, (A3. 1)
I

J=l

where the constant Clsf(Yj|¢7“’)=[f(YJIul,¢(”)f(u’|(0(")a‘uJ is

computed by the Monte Carlo integration method (Rubinstein,

1981):
l M ,
C] z—M—Zf(leuﬂa¢())a
111:]
where
uljauzp-Haum ~f(ujl¢(l) )3
are M samples of simulations from the multivariate distribution

f(u,I¢"’)-

Applying the Monte Carlo integration technique again,

equation (A3.1) is approximated by

E-ﬁlc—Jywovwiuww‘”)

=ZZ$3(“u)f(YlI“-m¢m)~ D (A3'2)
n=lj=l j

79

A4: Conditional expectation of the score function

Lemma (c.f. a similar argument from Tanner, 1991, p.37):

3108 f (YJI w)

 

 

 

 

5108f(uI¢)
I &’ f(u,IY,¢)du,= at
Proof:
f(Y,,u,I¢)
Y =
f( ,Ico) f(u,|Y,¢)
f(leup¢)f(ujl¢)
Y =
f( JI40) f(u,|Y,,¢)

5108f(Y,l¢) : 5108f(Y,|IIp(/J) + 3108f(u,|¢) _ 5108f(u,|Y,,(0)

 

 

 

 

ﬁr at 31 at
_ 0+ 5108f(u,|¢) _ alogﬂHIIchD)
ﬁr 37' '

Taking the expected value for both sides w.r.t. f(ujlYJ,(p):

é’logf(YJ|(o) = Ialogf(u,|(o) logf(u,IY .77)

 

a
f(u,IY .mdu, —I

 

f(ujlYpﬂdII,

 

 

 

0'11 0"! 52'
5’10 /( I71) 4101mm
= I gar“ f(u,IY,,¢)du,- I ’5, du.

 

=J‘alogf(uj1¢)

at f(u,IY .mdu, - gi— I f(u,IY,,¢)du,

 

= I mag/(um

3r f(uJIY,(o)du,. CI

APPENDIX B

Estimation of Starting Varlues

Starting values are vital to the efficient and successful
implementation of any iterative algorithm. In our case, if the
starting values are too distant from the final ML estimates, it
will take a long time for the program to converge. In the worse
case, the program will fail. It is because the starting values may
not be within a quadratic likelihood region of the maxima which
is a requirement for Newton-type maximization algorithm to
work (Seber and Wild, 1989). Precise estimation of starting
values may sometimes require iterations and becomes time
consuming. Hence a good choice of starting values are those
which are easy to compute and close to the final ML estimates.

Starting values for the fixed effects 7 are calculated by

regressing the logarithm of the dependent variable on the
independent variables related to the fixed part of the model. The

random part of the model is ignored for simplicity:

log(,u,.j)=A,j;/+B,ju szy. (B.l)

80

81

By taking the datum Y

,1. as an approximation for mean #0., we can

obtain starting values 7“” as the least square solutions of the

model:

log(Yy)=Au7(°’+eij, (B.2)
and
7(0) =(A'A)-1A'Z, (3.3)

where A=(A;,A;,...,A'J)' and Z:[(logYu,...,logYn‘1),...,(logYU,...,logYan )]'

are, respectively, the stacked matrix of the independent
variables for the fixed effects and the stacked vector of the
logarithm of the dependent variable across all the level-2 units.
An arbitrary small negative value in place of zero is used to
avoid undefined value for the logarithmic function: For example,
if Y.)- =0, Yijis set to be 10".

The above computed starting values for fixed effects are
used to calculate the starting values for the variance-covariance
matrix of the random effects:

Let

Z; =log(Yy.)—Au7‘°) zBou. (B.4)
Random effects of each level-2 units can be approximated by
regressing the difference between the logarithm of the dependent

variables and the predicted fixed part of the systematic

82

component on the independent variables related to the random
effects:
uj" =(B'JBJ)"B'JZ;. (B.5)
Starting values for the variance-covariance matrix are

estimated by

J
T"’=%Zu§°’uj”', (Seber, 1984). (B6)
j=1

The above estimation has the advantage of being positive
definite with probability 1, which is a requirement for
generating the multivariate normal random variables in

perfoming Monte Carlo integrations.

LIST OF REFERENCES

Aitkin, M., Anderson, D., and Hinde, J. (1981). Statistical
modeling of data on teaching styles. Journal of the Royﬂ
Statistigal Society. Series A. 144(4). 419-461.

Aitkin, M., Anderson, D., Francis, B. and Hinde, J. (1989).
Mistical modeling in GLIM. NY: Oxford University
Press.

Albert, J.H. (1985). Simultaneous estimation of Poisson
means under exchangeable and independence models.

Journal of Statistical Computation and Simulﬂion. 2;,
1-14.

Albert, J.H. (1988). Bayesian estimation of Poisson means using
a Hierarchical log-linear model. In J.M. Bernado, M.H.
Degroot, D.V. Lindley and A.F.M. Smith (Eds.), Bayesian
statistics 3, (519-531). NY : Clarendon Press.

Albert, J. (1992). A Bayesian analysis of a Poisson random
effects model for home run hitters. The American Statisticia_n_,
4a,246-253.

Anderson, D.A. and Hinde, JP. (1988). Random effects in

generalized linear models and the EM algorithm.

83

84

Communications in Sﬂtistics Theogy and Methods. Q(ll),
3847-3856.

Anderson, D.A. and Aitkin, M. (1985). Variance component
models with binary response: interviewer variability. Journal
of the Royal Statistiaral Society. Ser. B. 47, 203-210.

Becker, B.J. (1988). Synthesizing standardized mean-change
measures. British Journal of Mathematical_and Statisticﬂ
Psychology, 41_, 257-278.

Box, G.E.P. and Tiao, G.C. (1973). Bayesian inference in
statistical analysis. Reading, MA: Addison-Wesley.

Breslow, NE. and Clayton, D.G. (1993). Approximate inference

 

iueneralized linear mixed models. Journal of the America

Statistical Association. 88. 9-25.

 

Bryk, A.S. and Raudenbush, SW. (1992). Hierarchical lineg
models for sociaﬁmd behavioral research: Appligtions and
data analysis methods. Beverly Hills, CA: Sage.

Crowder, M. (1987). On linear and quadratic estimating
functions. Biometrika, 7_4_, 591-597.

Conaway, M.R. (1990). A random effects model for binary
data. Biometrics. 46. 317-328.

Cox, D.R. (1983). Some remarks on over-dispersion.

Biometrika, 7_0, 269-274.

85

Das Gupta, S. (1971). Nonsingularity of the sample covariance
matrix. SMhya A. 33. 475-478.

Dempster, A.P., Laird, N.M. and Rubin, DB. (1977). Maximum
likelihood from incomplete data via the EM algorithm (with
discussion). Journﬂ of the Royal Statistical Society. Ser. B.
Q, 1-38.

Firth, D. (1987). On the efficiency of quasi-likelihood
estimation. Biometrika, 14, 233-245.

Gelfand, A.E., Hills, S.I., Racine-Poon, A., and Smith,

A.F.M. (1990). Illustration of Bayesian inference in Normal
data methods using Gibbs sampling. Journﬂ of the America_n_
Mstical Association. 90, 972-985.

Goldstein, H.I. (1986). Multilevel mixed linear model analysis
using iterative generalized least squares. Biometrika, 7_3, 43-
56.

Goldstein, H.I. (1987). Multilevel models in educational and

 

social research. London: Oxford University Press.

Goldstein, H.I. (1989). Restricted unbiased iterative generalised
least squares estimation. Biometrika, _‘_7_6_, 622-623.

Goldstein, H.I. (1991). Nonlinear multilevel models, with an

application to discrete response data. Biometrika, 3, 45-51.

86

Huttenlocher, J.E., Haight, W., Bryk, A.S. and Seltzer, M.
(1991). Early vocabulary growth: Relationship to language
input and gender. Developmeﬂal Psycholgy, Ml).- 236-249.

Jamshidian, M. and Jennrich, R. (1993). Conjugate gradient
acceleration of the EM algorithm. Journal of the American
Statistical Association, 88, 221-228.

Johnson, L.W. and Riess, R.D. (1982). Numerical Angysis,
Philippines: Addison-Wesley

Karim, M.R. and Zeger, S.L. (1992) Generalized linear models
with random effects; Salamander mating revisited. Biometrics,
48,631-644.

Liang, KY. and Zeger, S.L. (1986). Longitudinal data analysis
using generalized linear models. Biometrika, 7_3, 13-22.

Longford, N.T. (1987). A fast scoring algorithm for maximum
likelihood estimation in unbalanced mixed models with nested
random effects. Biometrika, ﬂ, 817-827.

Longford, N.T. (1988). A quasi-likelihood adaption for variance
component analysis. Proc. Sect. Comp. Stgstq Am. Statist.
Assoc., 137-142.

Longford, N.T. (1993). R_andom coefficient models. NY: Oxford

university press.

87

Louis, T.A. (1982). Finding observed information using the EM
algorithm. Journal of the Royal Statistigl Society. Ser. B,
41, 98-130.

McCullagh, P. and Nelder, J.A. (1989). Generalized lineg
models. London: Chapman and Hall.

Morton, R. (1987). A generalized linear model with nested
strata of extra-Poisson variation. Biometrika, 71, 247-257.
Nelder, J.A. and Wedderburn, R.W.M. (1972). Generalized linear
models. Journal of the Royal StatistiLaI Society. Ser. A, 135,

370-384.

Press, W.H. (1986). Numerical recipes: The art of scientific
computing. NY: Cambridge University Press.

Raudenbush, S.W. and Bryk, A.S. (1986). A hierarchical model
for studying school effects. Socio-logyof Education. 59, 1-17.

Raudenbush, S.W. (1988). Educational applications of
hierarchical linear models: a review. Journal of Educationﬂ
Statistics, 13, 85-116.

Raudenbush, S.W. and Chan, W.S. (1992). Growth curve analysis
in accelerated longitudinal design with application to the
National Youth Survey. Journﬂ of Crime and Delinquency.

2__9_, 387-411.

88

Raudenbush, S.W. and Chan, W.S. (1993). Application of a
Hierarchical Linear Model to the study of adolescent deviance
in an overlapping cohort design. Journal of Consultingﬂ
Clinical Psychology, 6;, 941-951.

Racine-Poon, A. (1985). A Bayesian approach to Non-linear
random effects models.Biometrics. 41. 1015-1023.

Rice, J.A. (1987). M_athemaatical statistics an_d data analysis.
California: Wadsworth Inc.

Rodriguez, G. and Goldman, N. (1993). An assessment of
estimation procedures for multilevel models with binary
responses. Paper presented at a workshop on multilevel
analysis organized by the Rand Corporation, Santa Monica,
California, July, 1993.

Rubinstein, R.Y. (1981). Simulation and the Monte Carlo
method. NY: Wiley.

Schall, R. (1991). Estimation in generalized linear models with

random effects. Biometrika, _‘Zﬁ, 719-727.

 

Seber, G.A.F. (1984). Multivariate observations.NY: Wiley.

Seber, G.A.F. (1989). Nonlinm raggssion. NY: Wiley.

Smith, A.F.M., Skene, A.M., Shaw, J.E.H., Naylor, J.C. and
Dransdield, M. (1985). The implementation of the Bayesian

paradigm. Communications in Statistics: Theory

89

@d Methods. MU), 1079-1102.

Stiratelli, R., Laird, N. and Ware, J.H. (1984). Random effects
models for serialobservations with binary response.
Biometrics. 40. 961-971.

Tanner, M.A. and Wong, W.H. (1987) The calculation of
posterior distributions by data augmentation (with
discussion). Journal of the American Statistical Association.
8_2_, 528-550.

Tanner, M.A. (1991). Tools for stagstical inference: Observed
data and data augmentation methods. Berlin: Springer-Verlag.

Tsutakawa, R.K. (1985). Estimation of cancer mortality rates: A
Bayesian analysis of small frequencies. Biometrics. 41. 69-79.

Tsutakawa, R.K. (1988). Mixed model for analyzing geographic
variablity in mortality rates. Journgﬂ of the Americaa
§_t_21tistiaal Association. 83. 37-42.

Watson, G.S. (1964). A note on maximum likelihood. Sankhya A,
aa,303-304.

Wei, G.C.G. and Tanner, M.A. (1990). A Monte Carlo
implementation of the EM algorithm and the Poor Man's Data
Augmentation algorithms. Journal of the American Statistical

Association, 8;, 699-704.

90

Wong, G.Y. and Mason, W.M. (1985). The hierarchical logistic

regression model for multilevel analysis. Journal of the

 

American StlatisticQAssociation. 80, 513-524.

Wu, C.F.J. (1983). On the convergence properties of the EM
algorithm. Annals of Statistics. 31, 144-148.

Zeger, S.L., Liang, K.Y. and Albert, P.S. (1988). Models for
longitudinal data: A generalized estimating equation
approach. Biometrics. 44, 1049-1060.

Zeger, S.L. and Karim, M.R. (1991). Generalized linear models
with random effects: a Gibbs sampling approach. Journal of

the American Statistical Association. 86, 79-86.

"‘IIIIIIIIIIIIIIIIIS