Tilllldlulﬂlmlllllllﬂlliﬂl

-~ #279 23‘

 

 

 

 

 

 

 

 

 

 

7 \
$523333: mﬁmuuww[MM/nu

dissertation entitled

A COMPARISON OF THE EM AND DATA AUGMENTATION ALGORITHMS
ON SIMULATED SMALL SAMPLE HIERARCHICAL DATA
FROM RESEARCH ON EDUCATION

presented by

Randall Peter Fotiu

has been accepted towards fulﬁllment
of the requirements for

I I I
P I. ’ /) e\ I ~ 7! A ' ' ‘ .
I . If I'_ l ' V, ' ‘ ' L ‘ , ‘ [( ~“ In" '7
' . - degree m 1 I t ~ .
'- '- ; . i _
p" ‘- 1‘ Y ' I 1’ / ' ' 4/ '

 

ft IA 7 r t a ’ .4 / a \i
3 { . ‘- / X * v
/
7/; / ' ‘7’? /. 2" /.I -'
\L‘ /. u,’ ',""/-_V_,,--——~/ "7 1,;- " 4., ‘3' ‘x‘ '1 5;;L L
’ Major professor

.. +7“; »1-..' "'1'
Date ~" " m"7‘/

MS U is an Afﬁrmative Action/Equal Opportunity Institution 0-12771

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.

 

DATE DUE DATE DUE DATE DUE

 

or}. 893 11:58
JDW ‘

 

uNIS
F: 26 DJ J

 

[29m

 

e04 {L

 

 

sepgem/
261

 

 

RB d (1' ”95.!

 

 

 

 

 

 

 

 

 

MSU Is An Afﬁrmative ActIorVEqueI Opportunity Institution

 

 

A COMPARISON OF THE EM AND DATA AUGMENTATION ALGORITHMS
ON SIMULATED SMALL SAMPLE HIERARCHICAL DATA
FROM RESEARCH ON EDUCATION

By

Randall Peter Fotiu

AN ABSTRACT OF A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Counseling, Educational
Psychology and Special Education

1989

Mr:

:2
j.
.m/

94)

ABSTRACT
A COMPARISON OF THE EM AND DATA AUGMENTATION ALGORITHMS
ON SIMULATED SMALL SAMPLE HIERARCHICAL DATA
FROM RESEARCH ON EDUCATION
By

Randall Peter Fotiu

The hierarchical organization of our schools is reﬂected in the data collected
in a natural school setting by educational researchers. It is common for researchers
investigating the effects of an experimental treatment at the class level to obtain
data from a small number of classes and an unequal number of students within each
class. The statistical analysis of this type of data provides many challenges,
especially when there is an interaction between a student level characteristic and the
class level treatment. The parameters and notation of the speciﬁc hierarchical linear
model are developed in detail. Two statistical procedures are evaluated on .
simulated small sample hierarchical data. The empirical Bayes procedure using the
EM algorithm is compared to the Bayesian procedure implemented with the data
augmentation algorithm. Detailed explanations are provided that pertain to the
application of these two statistical procedures and the algorithms used in their
respective implementations.

The data used to evaluate these two statistical procedures were simulated to

represent a modest experimental design. This design included a measured student

ii

Randall Peter Fotiu

outcome and an independent variable representing a student characteristic such as
motivation, aptitude, or interest. Furthermore, students were nested within ten
classes and these classes were randomly assigned to either a treatment or control
group. The size of each class was randomly selected from a distribution of class
size. This data was generated to reﬂect parameter values of the hierarchical linear
model that might typically be obtained. The interaction effect between student and
class level factors was evaluated at three effect sizes. For each level of the
interaction effect, ﬁve hundred replications of an experiment were generated.

The Bayesian approach using the data augmentation algorithm recovered the
parameters of the hierarchical linear model as well or better than the empirical
Bayes approach using the EM algorithm. In particular, the Bayesian approach was
superior in recovering the between-class variance-covariance parameter matrix. In
addition, the Bayesian approach using the data augmentation algorithm provided
finite approximations to the posterior distributions of the parameters of the model.
The empirical Bayes approach using the EM algorithm performed better in
hypothesis testing of the between class effect parameters when a t-test was utilized.
The empirical Bayes procedure using a z—test and this implementation of the
Bayesian procedure tended to provide liberal hypothesis tests. The implementation
of the Bayesian procedure using the data augmentation algorithm required
considerable computer resources, whereas the implementation of the empirical Bayes
procedure using the EM algorithm was well within practical limits for research on

education.

“ah “I .. ‘
V'\:‘ 9-,.“ \\
I

iii

ACKNOWLEDGMENTS

I would like to thank my dissertation director Stephen Raudenbush and
dissertation committee members Andrew Porter, Richard Houang, Susan Melnick and
James Stapleton, for their advice, insight and guidance throughout the process of
writing this dissertation. I have been very fortunate to have had the opportunity to
work with this outstanding group of researchers and teachers throughout my
graduate studies. My wife, Josie Wojtowicz, deserves special thanks for her support

and understanding while I wrote this dissertation.

iv

TABLE OF CONTENTS

LIST OF TABLES ...................................... vii
LIST OF FIGURES ...................................... ix

I. AN OVERVIEW OF THE STRUCTURE AND ANALYSIS OF

HIERARCHICAL DATA IN EDUCATION ...................... 1
Introduction ........................................ 1
The Structure of Educational Data and Associated Problems ......... 3

The Hierarchical Organization of Educational Data ........... 3

The Unit of Analysis Debate and the Speciﬁcation of the
Hierarchical Model ............................... 4
The Problem of an Unbalanced Design ................. 10
Small Sample Problems ........................... 11
A Review of Estimation Procedures ........................ 12
The ANOVA Procedures .......................... 13
Maximum Likelihood Procedures ..................... 17
A Proposed Solution ................................. 21
II. NOTATION CONVENTIONS AND THE MODEL ................ 25
A Guide to the Matrix Notation .......................... 26
The Hierarchical Linear Model in Scalar Terms ................ 27
A Matrix Representation of the Hierarchical Linear Model ......... 30
Assumptions for the Two Statistical Procedures ................ 33

III. APPLICATION OF THE EM AND DATA AUGMENTATION ALGORITHMS
TO THE HIERARCHICAL LINEAR MODEL WITH NORMAL ERRORS 36

The Empirical Bayes Estimation Procedure Using the EM Algorithm . . . 36 "

The EM Algorithm and Covariance Estimation ................. 42 V

The Data Augmentation Algorithm ........................ 46 /
Initial Values ................................. 51
The Imputation Step ............................. 52
The Posterior Step .............................. 55
Some Additional Notes on the Data Augmentation Algorithm . . . 57

IV. METHOD ......................................... 59

Research Questions .................................. 59
Determining a Convergence Criterion .................. 59
Comparing the Accuracy of the Two Statistical Procedures ..... 59
The Error Rates ................................ 60
The Computational Efﬁciency of the Statistical Algorithms ..... 60

The Design and Sampling Plan for the Simulation ............... 61

Creating the Data ................................... 62
Parameters of the Model .......................... 63
The Parameter Values Selected ...................... 66
The Steps for the Data Generation Procedure ............. 68
Random Number Generation ........................ 71

Evaluating Convergence ............................... 72

Evaluating the Accuracy of the Two Statistical Procedures ......... 74

Evaluating Type I Error Rates and Power .................... 75

Evaluating the Implementation of the Data Augmentation Algorithm . . . 80

V. SIMULATION RESULTS ................................ 81
Convergence Criterion Results ........................... 81
Comparing the Results of the Two Statistical Procedures ........... 84
Type I Error Rate Results .............................. 87
Power Analysis Results ................................ 89
Computational Efﬁciency Results ......................... 93

VI. DISCUSSION ....................................... 95
Advantages and Disadvantages ........................... 95
Suggestions for Future Research .......................... 97

APPENDIX A ......................................... 101
Tables Containing Simulation Results ...................... 101

APPENDIX B ......................................... 116
Data Augmentation Algorithm FORTRAN Source Code .......... 116
Hierarchical Data Generation FORTRAN Source Code ........... 136

REFERENCES ........................................ 142

vi

4—1
4—2
4-3
4-4

5-1

A-2
A-3
A-4
A-5
A-6
A-7

A-8

LIST OF TABLES

Class Size Frequency Distribution .......................... 62
The Design of the Experimental Conditions .................... 66
Parameter Values for the Three Experimental Conditions ............ 68
Data Table for Error Analysis ............................ 76
MSE Between DA Posterior Means and Population Values for y ....... 82

MSE Between DA Posterior Means and Population Values for o2 and T . 82

DA Posterior Means Resulting from Good and Poor Initial Values ..... 84
Performance Ranking of Hypothesis Testing Procedures ............ 88
Power of the Test of y“ for Experimental Conditions 2 and 3 ........ 90
Central Processing Unit Time ............................. 93
Average Number of Iterations of the EM Algorithm .............. 94
Posterior Estimates for y from Experimental Condition 1 ........... 101
Posterior Estimates for o“ and T from Experimental Condition 1 ..... 101
Posterior Estimates for y from Experimental Condition 2 ........... 102
Posterior Estimates for O'2 and T from Experimental Condition 2 ..... 102
Posterior Estimates for y from Experimental Condition 3 ........... 103
Posterior Estimates for o2 and T from Experimental Condition 3 ..... 103
MSE’s Relative to Population Values for y from Condition 1 ........ 104
MSE’s Relative to Population Values for y from Condition 2 ........ 104

vii

A-9 MSE’S Relative to Population Values for y from Condition 3 ........ 105

A-10 MSE’s Relative to Population Values for O2 and T from Condition 1 . . . 105
A-11 MSE’s Relative to Population Values for O'2 and T from Condition 2 . . . 106
A-12 MSE’s Relative to Population Values for o" and T from Condition 3 . . . 106
A-13 Type I Error Rates for 700 from Experimental Condition 1 .......... 107
A-14 Type I Error Rates for yo, from Experimental Condition 1 .......... 107
A-15 Type I Error Rates for y” from Experimental Condition 1 .......... 108
A-l6 Type I Error Rates for 700 from Experimental Condition 2 .......... 108
A—17 Type I Error Rates for yo, from Experimental Condition 2 .......... 109
A-18 Type I Error Rates for 700 from Experimental Condition 3 .......... 109
A-19 Type I Error Rates for 70, from Experimental Condition 3 .......... 110

A-20 McNemar’s Statistics for Condition 1 Type I Errors with Nominal 0t=.01 110
A-21 McNemar’s Statistics for Condition 1 Type I Errors with Nominal 0t=.05 111
A-22 McNemar’s Statistics for Condition 1 Type I Errors with Nominal 0t=.10 111
A-23 McNemar’s Statistics for Condition 2 Type I Errors with Nominal 0t=.01 112
A-24 McNemar’s Statistics for Condition 2 Type I Errors with Nominal 0t=.05 112
A-25 McNemar’s Statistics for Condition 2 Type I Errors with Nominal 0t=.10 113
A-26 McNemar’s Statistics for Condition 3 Type I Errors with Nominal 0t=.01 113
A-27 McNemar’s Statistics for Condition 3 Type I Errors with Nominal 0t=.05 114
A-28 McNemar’s Statistics for Condition 3 Type I Errors with Nominal Ot=.10 114
A-29 Experimental Condition 2 Power Comparisons with DA for y” ....... 115

A-30 Experimental Condition 3 Power Comparisons with DA for y” ....... 115

viii

LIST OF FIGURES

5-1 Scatterplot of Condition 3 ATI Effect ....................... 86

5-2 Scatterplot of 95% HPD Intervals for Condition ATI Effect ......... 92

ix

CHAPTER I

AN OVERVIEW OF THE STRUCTURE AND ANALYSIS OF
HIERARCHICAL DATA IN EDUCATION

Introduction

Throughout history, scientiﬁc inquiry on a speciﬁc subject has often raised
interesting questions in unexpected areas. Not only has science taken unexpected
turns, it has fostered the continual improvement and advancement of its common
methods and tools. Progress in statistics and experimental design has moved
forward hand-in-hand with science. William Sealy Gosset was hired by the Arthur
Guinness Sons & Company, Limited of Dublin in 1899 to help apply the scientiﬁc
method to the brewing process. During his work at the brewery, it became evident
that formal statistical analysis needed to be extended to small samples. Frequently,
samples of eight to twelve were obtained at the brewery. This supplied Gosset with
the motivation to develop the "Student’s t-test" to provide a statistical tool for small
sample problems (Tankard, 1984). Sir Ronald Fisher’s work at the Rothamsted
Experimental Station provided a substantive context for his work in the development
of factorial designs and the analysis of variance method. In the preface of Fisher’s
classic work, Statistical Methods for Research Workers (1936, p. vii), he stated,
"Daily contact with the statistical problems which present themselves to the
laboratory worker has stimulated the purely mathematical researches upon which are

based the methods here presented." Research into the educational process as it

2
takes place in schools also provides a unique environment with many associated
methodological challenges.

Much of the research interest in education is focused on the effects of
various experimental treatments applied in a natural school setting. Not only are the
main effects directly attributable to the experimental treatments of interest, the
interaction effects these treatments have with known student characteristics and
attributes are also critical concerns of educators. This combination of experimental
control of treatments, the natural school settings in which they are applied, the
psychological characteristics of students, and the ﬁnancial constraints of most
educational research present many difﬁculties for statisticians involved in this area
of research. The issues related to the social organization of schools and the
structure of experimental data found in educational research are developed in the
following sections of this chapter. Many statistical procedures have been formulated
to cope with the problems presented by data generated in this type of research
setting. A critical review of some of the most widespread statistical procedures
applied in this research setting is also presented.

The historical advancement of statistical methods used in experimental
research in education leads naturally to a Bayesian formulation. The new data
augmentation algorithm developed by Tanner and Wong (1987) shows promise as a
practical implementation of a Bayesian solution and resolves many of the
shortcomings found in prior statistical methods used in educational research. To
illustrate its application in an educational research context, this new algorithm is
implemented in a computer program and applied to simulated data sets that mimic

the characteristics of educational data. And to evaluate the performance of this new

3

algorithm, it is compared to an alternative statistical estimation procedure, the
empirical Bayes approach using the EM Algorithm developed by Dempster, Laird
and Rubin (1977).

The Structure of Educational Data and Associated Problems

The majority of educational research takes place in a school setting. As a
consequence, the hierarchical organization of our schools is reflected in the data
collected in a natural school setting. The statistical analysis of this hierarchical data
provides many challenges. This hierarchical data structure resulting from research in
schools, the lengthy debate amongst educational researchers concerning the
appropriate unit of analysis, the implications of unbalanced designs, and small
sample problems are methodological issues encountered in research on education.

The following sections address these important and related issues.

The Hierarchical Organization of Educational Data

Typically, the prominent interest of educational research is to investigate how
characteristics of the school or classroom environment inﬂuence student learning
(Good and Brophy, 1986; and Brophy and Good, 1986). In some cases, the
characteristics of interest are an inherent part of the existing setting, and in others
they are introduced and controlled by the experimenter. In educational studies, both
naturally occurring and experimentally controlled factors are commonly found
together. Since a primary goal of educational research is to apply promising
positive results in an effort to improve student learning, it makes sense to study the

educational process in a natural setting. On one hand, an inference is the most

4
generalizable when the research and the application settings are equivalent or closely
approximate one another. However, conducting research in a natural setting limits
experimental control.

A feature of our formal education system and its social organization is its
hierarchical (nested) structure. For example, students are grouped within classrooms,
classrooms are grouped within schools, and schools are grouped within districts.
This multilevel structure provides a conceptual framework for understanding the
effects of schooling (Bidwell and Kasarda, 1980; and Barr and Dreeden, 1983).

The resources available to a teacher, such as books, materials, and class size, have
an inﬂuence on the instruction provided to students. The same instruction provided
to students may have differential effects (Cronbach and Snow, 1977) and this
differential response may cause the teacher to update and adjust their lesson plans.
Thus, we have a model for schooling that is both hierarchical in structure and the
levels of the hierarchy are interdependent. This conceptual model of schooling
should be reﬂected in the statistical models used to evaluate and understand the
effects of schooling (Burstein, 1980; Cooley, Bond and Mao, 1981; Raudenbush and
Bryk, 1986; Goldstein, 1987).

The Unit of Analysis Debat_e__ and the Speciﬁcation of the Hierarchical Model

This multilevel and interactive model of schooling has prompted a debate
among researchers on the appropriate unit of analysis. Should educational research
at the class level be evaluated strictly at the class level, the student level, or some
combination? In addition, how can this model be speciﬁed to reﬂect the

complexities of the hierarchical data collected from educational researt h .' These

questions are discussed in this section.

To exemplify the unit of analysis problem, let us construct a hypothetical
research study. Suppose a sample of ten classes were drawn randomly from the
population of interest and for illustrative purposes, each class contained 20 Students.
Five of the classes are randomly assigned to an experimental condition and the
remaining classes are assigned to a control group. Furthermore, the experimental
group’s treatment is applied at the class level and differs from the control group
only in the sequence in which the material is presented. The research question of
interest is whether the new experimental sequencing of the educational material
promotes higher levels of student achievement when compared to the existing order
of presentation. When the unit of analysis is the individual student, then an
‘ independent group t-test with 198 degrees of freedom (practically equivalent to a 2-
test) could be used to test for the effect of the experimental condition relative to the
control condition. On the other hand, if the class is taken as the unit of analysis, a
t-test with 8 degrees of freedom could be used to test for a treatment effect. Under
the assumptions that the observations are independent and drawn from the same
normally distributed population, the independent group t-test using student level data
is substantially more powerful than the alternative t-test using class level data.
Hence, for a researcher under pressure to ﬁnd signiﬁcant differences, the choice is
clearly in favor of the more powerful t-test using student level data. But, are the
assumptions for the independent t-test satisﬁed when comparing treatments that are
presented at the class level?

The assumption that individual students within a class are statistically

independent may not be appropriate. Glass and Stanley (1970, pp. 505-506) make

6
the distinction between the "units of statistical analysis" and "experimental units"
and provide the following two deﬁnitions.

The units of statistical analysis are the data (the actual numbers) that

we consider to be the outcomes of independent replications of our

experiment. If you will, the units of statistical analysis are the

numbers that we count when we count up degrees of freedom

"within" or "for replication."

The experimental units are the smallest division of the collection of

experimental subjects that have been randomly assigned to the

different conditions in the experiment and that have responded

independently of each other for the duration of the experiment.

From these two deﬁnitions, it is apparent that if the assumption of statistical
independence between students within a class is not appropriate then the unit of
statistical analysis and the experimental units are clearly different.

Most teachers have examples of classes with distinct collective personalities
and would freely agree with the Gestalt idea that a class as a whole is more than
the sum of its students. All educators are aware of situations where one or two
disruptive students cause the teacher to spend a considerable amount of class time
managing these students in lieu of instruction. Or for a positive example, a class
might have a very cooperative and supportive group of students and as a result
accomplished far more than expected. In the above two cases, it is obvious that an
individual student does not respond independently and consequently, the a priori
assumption that each student responds independently is not reasonable.

Not only can student responses be dependent within classrooms, generally
students are not randomly assigned to classes in the ﬁrst place. The assignment of
students to classes is a purposeful task performed by teachers and administrators and

is not a haphazard or random activity. This deliberate assignment of students to

classrooms fails to satisfy the deﬁnition presented earlier for students to be

7
considered the experimental units. In addition, Lumsdaine (1963) warns that even
if students were assigned randomly to classes and treatments, important sources of
error variation may not be accurately reﬂected in error estimates when the individual
is used as the experimental unit. In a classical two-stage sampling design students
are randomly assigned to classes and then classes are randomly assigned. to
treatments. Thus, at the outset of an experiment under these conditions the
intraclass correlation is equal to zero. But during the duration of the treatment the
teacher and students have the opportunity to interact, and this has the effect of
inducing the intraclass correlation to be greater than zero.

A number of researchers have examined the unit of analysis question and its
methodological implications. Lindquist (1953) advocated using the analysis of
variance technique on the group means and has stimulated much subsequent research
on the appropriate unit of analysis and associated statistical questions. Barcikowski
(1981) extended Lindquist’s work and investigated statistical power when group
means were used as the unit of analysis with the analysis of variance method. The
robustness of the t-test to violations of the assumptions regarding the unit of
analysis were investigated by Blair, Higgins, Topping and Mortimer (1983). They
found that even small amounts of between-class variation caused inﬂation in the
type I error rate of the t-test when an analysis was incorrectly carried out at the
student level. Further work on the speciﬁcation of the proper statistical model that
identiﬁes the random factor(s) explicitly in the linear model was examined by
Peckman, Glass and Hopkins (1969), and Hopkins (1982). This literature has
emphasized the fact that the effects of intact groups cannot be ignored a priori.

But, is the unit of analysis the critical question for the educational researcher?

8

Education as it takes place in our schools is distinguished by the fact that
learning, an essentially psychological activity, is cultivated in a social environment.
It is not sufﬁcient for the researcher to choose between the student and the class as
the appropriate unit of analysis. The principal consideration is the interplay between
the psychological and social elements in our educational system. Teachers are faced
with the difﬁcult task of mediating and balancing individual student concerns with
those of the class. Determining the optimal point between individual and group
instruction that produces the maximum educational beneﬁts is an important
challenge. This interplay between the individual and the group must also be
reﬂected in educational research methodology.

The differential responses of individuals to treatments is what Cronbach
(1957) called "Aptitude X Treatment Interaction" (ATI). This differential effect a
treatment has on individuals with different psychological characteristics depicts a
frequent situation found in educational research. Educational research methods must
be sensitive to the covariation between the psychological and social components of
education and learning. Good estimates for the variance-covariance terms of the
model are required to recognize important ATT effects.

The intraclass correlation coefﬁcient is another measure that reveals a type of
relationship between individual and group information. The intraclass correlation is
constructed from a partitioning of the total random or unexplained variance into two
components, one for the within-group variation and the other for the between- group
variation. The intraclass correlation coefﬁcient is then deﬁned by the ratio of the
between-group variance relative to the total variance, and as a consequence, its

value ranges from zero to one. For the special case when the intraclass correlation

9

is considered to be known, Blair and Higgins (1986) recommend a weighted least
squares solution to the unit of analysis problem. The conclusions reached using this
analysis strategy are somewhere between those obtained using the individual as the
unit of analysis and the group mean depending on the speciﬁc value of the
intraclass correlation coefﬁcient. Except for the degrees of freedom used to ﬁnd the
tabled critical value, an intraclass correlation approaching zero would produce results
comparable to those using individuals as the unit of analysis and an intraclass
correlation approaching unity would produce results comparable to using the group
as the unit of analysis. Blair and Higgins (1986) mention that the intraclass
correlations found in classroom level educational research have a rather narrow
range. However, the intraclass correlation is generally unknown in practice and this
implies that the required variance components must be estimated from the data.
Consequently, their proposed solution is incomplete.

Not only are the variance components necessary for understanding the
educational process under study, the covariance components also play an important
part. In many problems in education, there are a number of variables that are of
interest and these variables may be interdependent. Thus, a multivariate formulation
is essential to model linear dependencies between variables and for joint probability
statements about parameters that are linearly dependent. Raudenbush and Bryk
(1988) provide an example where estimated parameter variance and covariance
components enable a better understanding of school effects. In their example, they
use the variance-covariance terms to construct the correlation coefﬁcients between
"excellence" (mean level of achievement) with "equity" as measured by the

regression coefﬁcients for minority status, social class, and academic background.

10
The Problem oLan Unbalanced Desiga

Generally, the researcher has no control over the assignment of students to
classes and there is nothing that prevents classes from varying in size. Not only
does the number of students per class vary, there are other circumstances that
contribute to an unbalanced design. If a research study is longitudinal in nature
there is the problem of attrition. Students may be absent during the days designated
for treatment or testing. In other cases, students and their families may move while
a research project is in progress.

Statistical techniques are available for the balanced design, for example when
there is the same number of students per class (Kirk, 1968; Winer, 1971; and
Searle, 1971). When intact groups like classes are fundamental to educational
research, the probability of obtaining a balanced design is extremely small. The
application of many ANOVA type estimation procedures to unbalanced designs do
not produce consistent results and as a consequence, results may vary dependent on
the statistical procedure used or the particular manner it is applied (Searle, 1971).

There have been some recent advances in the application of statistical
procedures to unbalanced designs. A number of researchers have successfully
applied a variety of numerical maximum likelihood procedures to unbalanced
hierarchical data found in educational research. The Fisher scoring method
(Deleeuw and Kreft, 1986; and Longford, 1987), the iterative generalized least
squares method (Goldstein, 1987) and the EM algorithm (Mason, Wong and
Entwistle, 1984; Bryk, Raudenbush, Seltzer and Congdon, 1986) all produce
maximum likelihood estimates with their desirable large sample properties. In

particular, maximum likelihood estimators are asymptotically normal. Hence, for

11
large scale studies, maximum likelihood estimation provides a solution to the
problems of unbalanced designs. But for small sample situations, the
appropriateness of the large sample properties of maximum likelihood estimation is

unclear.

Small Sample Problems

Along with the problems associated with intact groups, there are typically
constraints placed on the resources available for educational research. Experimental
research on education is expensive and compounds the problems associated with the
hierarchical structure of educational data. For example, sampling considerations are
more problematic in a hierarchical structure. Each level up the hierarchy represents
a substantially smaller population. It is much easier to obtain the participation of
twenty students than the same number of classes or even more difﬁcult, to enlist the
participation of twenty schools. This is especially true if the study requires the
implementation of treatments. The recruitment and training of cooperating teachers,
and the careful monitoring of the treatment delivery are very labor intensive
activities.

Also, most educational treatments are implemented over a period of time and
the data collection phase requires considerable resources over the course of the
study. Multiple observations, testing and test scoring are all standard tasks in
educational research. The sum of these necessary research activities makes it often
impractical to have large-scale individual experimental research studies on education.
It is more likely that smaller scale projects will continue to be the norm.

Small sample problems create additional demands for statistical procedures

12
not found with large samples. For example, the Central Limit Theorem states that
the sampling distribution of the mean is approximately normal regardless of the
population distribution. The only restrictions of this theorem is that the sample
must be sufﬁciently large and the population variance must be ﬁnite. To evaluate
the sampling distribution of a mean ﬁ'om a small sample, however, it is necessary
to make some assumptions about the population distribution. To make inferences
about the distribution of the sample mean based on a small sample, Student’s t-test
requires that the sample be drawn from a population with a normal distribution.
Intuitively, it makes sense that the less information provided by the sample or data
the more a researcher has to rely on prior knowledge and the validity of the

assumptions.

A Review of Estimation Procedures

The hierarchical structure of educational data can be represented as a mixed-
model, a subclass of linear models. A mixed-model conceptualization is appropriate
because a combination of both ﬁxed and random factors are included in the model.
Fixed factors are factors in which all treatment levels or categories of interest are
included in the experiment. Inferences about a ﬁxed factor pertain only to the
speciﬁc levels of the factor that have been included in the experiment. In research
on education a ﬁxed factor might be method of instruction, type of materials, or
media used. Random factors have their associated treatment levels randomly
sampled from the population of interest. Inferences about random factors are
usually directed toward their population variances, and this information is often used

to construct benchmarks to evaluate the relative size of ﬁxed effects included in the

13
model. Students, classrooms, and schools are typical examples of random factors in
educational research, and estimation of the variance and covariance components of
these random factors is an important function in research on education.

In the context of mixed models, statistical estimation is difﬁcult for
unbalanced hierarchical designs where the interaction effect of measured student
characteristics by treatments is an important focus of the investigation. This
difﬁculty is compounded in small sample situations. Many estimation procedures
that have analytical solutions for balanced designs fail for the unbalanced case and
numerical approximation techniques are required. Some of the estimation
procedures that have been proposed as solutions to the estimation problems
presented by the combination of a hierarchical data structure, the interaction of
measured student level characteristics by class level treatments, an unbalanced
design, and a small sample size are reviewed next. These estimation procedures are
divided into two groups, the analysis of variance (ANOVA) type procedures and the
maximum likelihood procedures. I shall discuss the advantages and identify the

drawbacks with the estimation procedures within these two groups.

The ANOVA Procedares

The problem of estimation for unbalanced mixed-model designs has a long
history. R. A. Fisher is generally credited with systematically developing factorial
designs along with the analysis of variance method. This standard ANOVA
procedure provides the starting point for critically examining some of the alternative
procedures proposed in this framework to handle the unbalanced case.

From his ﬁeld work in animal science and the prevalence of unbalanced

14
experimental designs in this domain of research, Henderson (1953) proposed three
methods for variance and covariance component estimation for this situation. The
ﬁrst method of estimation, referred to as Method 1, was essentially the conventional
ANOVA method and is a starting point for comparisons. With this traditional
ANOVA method, point estimates for the variance and covariance components in
some situations can be calculated by equating the mean squares to their expected
values. This set of equations can be solved for the unknown variance and
covariance components in the balanced design case. In the unbalanced case,
obtaining this solution is complicated by the increased complexity of the coefﬁcients
of each variance and covariance component. More importantly for unbalanced
designs, Searle (1971, p. 41) states, "This is generally true; in mixed models,
expected values of the S’s ("analogous" sum of squares) contain functions of the
ﬁxed effects that cannot be eliminated by considering linear combinations of the
S’s." In this context, the S’s are computed with unbalanced data in an analogous
manner to the sum on squares computed with balanced data, but they do not
necessarily have the same properties. Searle (1971, p. 36) provides an example
where an S term is a quadratic form that is not positive deﬁnite. In effect this
means that in many cases the variance and covariance components cannot be
extracted using the ANOVA method.

Method 2 proposed by Henderson (1953) involves computing the least
squares estimates for the ﬁxed effects, correcting the data by subtracting the ﬁxed
effects, and then using Method 1 on the corrected data. Searle (1971) criticizes this
method when used for mixed-model esﬁmation on the grounds that it is not

uniquely deﬁned and it does not apply whenever the model includes interactions

15
between the ﬁxed effects and the random effects. These two concerns expressed by
Searle illustrate the problems with this method for mixed-models. Therefore this
strategy is inappropriate for investigating the interaction between student level and
class level variables.

The third method proposed by Henderson (1953), the ﬁtting of constants
method, also has some shortcomings. It uses reductions in the sums of squares due
to ﬁtting different submodels. These reduction terms are then set equal to their
expected value and solved for the unknown variance and covariance components.
This procedure does not produce a unique solution because the order in which the
reduction terms are ﬁtted may lead to different values for the associated mean
squares. In most situations, there is no guiding principle for ordering the
calculations of the reduction terms using this ﬁtting of constants method, making
this method unsatisfactory for the unbalanced mixed-model situation.

There is nothing inherent in these three estimation methods in the
mixed-model case that prevents negative estimates of variance components in either
the balanced or unbalanced cases (Searle, 1971). If a given application of one of
these three methods dOes produce a negative estimate for a variance component, a
researcher could set the estimate to zero. But how much conﬁdence can be placed
in an estimate that is outside the parameter space? In the balanced case, the
estimates of variance components are unbiased for the three methods (Searle, 1971)
and also Method 3 produces unbiased estimates in the unbalanced case. Since these
unbiased variance estimates may be negative, the criteria for unbiased variance
estimation may not be of primary importance.

Furthermore, Henderson (1953) states that in unbalanced design situations the

16
sampling variances of the estimates obtained by the three estimation methods are not
known. As a result, hypotheses testing of the parameters in the model can not be
evaluated with the same conﬁdence as in a balanced design. This uncertainty about
the sampling variances of the estimates also applies to statements concerning
interval estimates.

The introduction of covariates, such as student abilities, aptitudes, and
interest, create additional problems. The AN OVA procedure has been extended to
include covariates. The analysis of covariance, ANCOVA, requires the assumption
that the within-class regression coefﬁcients are homogeneous (Winer, 1971). This
assumption may not be tenable in many research situations. For example, it is quite
possible that the effect of a covariate measuring a student background characteristic
may vary across schools. In this case, the assumptions underlying the traditional
ANCOVA procedure would be violated. When the assumption of homogeneous
regression coefﬁcients is not appropriate, the obvious question is why are the effects
different. An alternative approach proposed by Cronbach and Webb (1975) was to
estimate separate regressions within each group, but they concluded that when the
sample size for each group was small the resulting estimates were not stable. Thus,
the ANCOVA and the estimation of separate regressions methods may not provide a
satisfactory solution for the heterogeneity of regressions problem. In addition, the
ANCOVA procedure does not provide a general solution for estimating the
covariance of a random student level effect with a random class level effect. A
general solution must estimate this important covariance when the variables are
measured on a discrete or continuous scale and is applicable to both balanced and

unbalanced design situations.

17
Maximum Likelihood Procedures

The problems encountered with the AN OVA and ANCOVA estimation methods
have led to the investigation of other estimation procedures. The maximum
likelihood procedure of estimation, with the availability and computational speed of
computers, has received renewed interest. This approach to estimation has many
desirable features. One feature of this method is that it constrains the variance
components to be nonnegative. Maximum likelihood estimators are functions of
every sufﬁcient statistic, consistent, asymptotically normal, and efﬁcient (Harville,
1977). These properties of maximum likelihood estimators provide a large sample
solution to the problems resulting from an unbalanced design. Generally, the
distribution of maximum likelihood estimators in small sample situations are
unknown.

Implementation of the maximum likelihood method is not always easily
accomplished. In the unbalanced mixed-model situation with the variance-covariance
components unknown, there is no general closed form or explicit analytic solution to
the maximum likelihood equations (Searle, 1971). A numerical procedure must be
used in this situation. One of the ﬁrst numerical methods used for maximum
likelihood estimation was the Newton-Raphson method. This method has the
desirable property of quadratic convergence. The disadvantage of the
Newton-Raphson method is that there is no guarantee that it will converge to the
solutions of the maximum likelihood equations. It might converge to a local rather
than the global solution (Harville, 1977). This method may depend on good initial
approximations. In some cases, if the initial approximations are not sufﬁciently

close to the actual roots of the likelihood equations, the Newton-Raphson’s method

is“
may diverge (Harville, 1977; Burden, Faires and Reynolds, 1981).

There are a number of alternative computational approaches to the Newton-
Raphson method that produce maximum likelihood estimates of the parameters of a
hierarchical linear model. The Fisher scoring method (DeLeeuw and Kreft, 1986;
Longford, 1987), iterative generalized least squares (Goldstein, 1987), and the EM
algorithm (Mason, Wong and Entwistle, 1984; Bryk, Raudenbush, Congdon and
Seltzer, 1986) are examples of different computational approaches currently available
to researchers that produce maximum likelihood estimates. The EM algorithm
deveIOped by Dempster, Laird and Rubin (1977), and implemented with the HLM
Computer Program (Bryk et al., 1986) was selected for use in this study. This
estimation procedure has been applied to a variety of research problems in education
with success. For examples of this algorithm used in educational research, see
Dempster, Rubin and Tsutakawa (1981), Laird and Ware (1982), Strenio, Weisberg
and Bryk (1983), and Raudenbush and Bryk (1985, 1986). This algorithm gets its
name from the two basic iterative steps: the expectation step and the maximization
step. Wu (1983) has shown that the EM algorithm will converge when the

I . complete data are from a curved Vexponential family, and the expected log of the

... -

1.
Generally, the EM algorithm has a slower rate of convergence than the

h-H»- z»- ' “quvr... 5‘ _ﬁ.u——pw

Newton-Raphson method. Ewever, [FEE-M algorithmwill conver el when given
well-def ' ' ' estimates of the parameters along with W

Wndiﬁons outlined above. The availability of this

software package and its su erior conver e ro )were the primary reasons for

selecting this method.

 

 

 

 

 

19

An advantage of maximum likelihood estimation is that it may be applied to
the hierarchical linear model to produce\empirical Bayes estimates. The hierarchical
model may be easily formulated to produce empirical Bayes estimates which are
equivalent to restricted maximum likelihood estimates (Laird and Ware, 1982).
Harville (1977) points out that one criticism of the standard maximum likelihood
approach to estimation is that iﬂﬁ§ﬂ¢f¥91m0.SESQPBEIhQﬁlQS‘SLin'dﬂgIEEEBE
fr‘egdoﬂmythat results from ,csﬁmaﬁpﬂggﬁhg, fgﬁfmfféqiGirlie/model. He points out

Itathat inadequacies of this u'aditional maximum likelihood (ML) approach are
overcome by restricted maximum likelihood (REML) estimation because this
approach does take into account the loss of degrees of freedom due to estimating
ﬁxed effects (Harville, 1977). Dempster, Rubin and Tsutakawa (1981) differentiate
the MA and wLapproages\ by denoting the two likelihood functions by MLF
and MLR,\where F stands for ﬁxed and R for random.\

The empirical Bayes approach capitalizes :70 the strengths of the data. For
example, in a two-level hierarchical linear model with student and class levels,
regression coefﬁcients representing the regression of an outcome variable on one or
more student level independent variables may be expressed as a precision-weighted
combination of the contributions for the entire data set and the individual class.
This method uses the strengths of the data for estimation, because the more
precisely a component of the model is measured the more weight it will receive in
constructing an estimate. Thus, a regression equation for a particular classroom
containing a small number of students may borrow information across classes to

obtain an improved estimate. This estimation approach provides a method to

resolve the problem of estimating individual regression equations from nested units

20
with small samples. The details of the precision-weighting scheme are presented in
Chapter III.

The empirical Bayes approach, in the context of the hierarchical linear
model, is available to educational researchers with the “HLM computer program
(Bryk, et al., 1986). This method of estimation is most/Suited}; EEWE.
"problems. ‘ The distribution of the estimates resulting from the EM algorithm applied
to a hierarchical model are generally not known in Small sample situations. For
example, in a hierarchical model with students nested in classes, a sufﬁciently large
sample of classes is required for trustworthy results. In addition, the standard errors
for any ﬁxed effectst at the class level fail to take into account the uncertainty
associated with the estimation of the variance-covariance components, which is an
important consideration in small sample problems. A basic strategy of the
maximum likelihood methods is to calculate an estimate of the variance-covariance
components and use this point estimate as if it were known. For small sample
problems, variance-covariance estimates may be quite unstable. The small sample
dispersion of maximum likelihood estimates that rely on variance-covariance
estimates do not reﬂect the consequences of the variation that may result from other
plausible variance-covariance component values. The empirical Bayes approach
relies on point estimates of the variance-covariance components to construct weights
and there has been some evidence (Bassiri, 1988) that small sample empirical Bayes
estimates have standard errors that are two small and do not reﬂect the uncertainty

associated with the estimation of the required variance-covariance components.

21
A Proposed Solution

An alternative methodology that may be applied to this analysis problem is a
Bayesian approach. The Bayesian method is not new and was initially brought into
focus by Reverend Thomas Bayes (1763). The application of true Bayesian
methods to hierarchical models has been impeded because the mathematics involved
has been intractable. Lindley and Smith (1972) indicate that most reasonable
distributions employed in a hierarchical model lead to integrals which cannot all be
expressed in closed form. The Bayesian approach is faced with the problem that
the required equations cannot always be expressed in closed form. The EM
algorithm supplied an iterative solution to this problem for the maximum likelihood
method. Tanner and Wong (1987) have developed a new data augmentation
algorithm for the implementation of the Bayesian method. This method provides an
innovative means for the computation of posterior distributions.

The key advantage of the Bayesian approach is that it supplies the joint
posterior distribution of the parameters of the model. The joint posterior
distribution provides more information to the researcher than joint modal estimates
for the parameters produced by the empirical Bayes approach. For example, it is
possible for a researcher to calculate the mode, mean, median, and percentile points
for any parameter of interest from the joint posterior distribution. In addition, small
sample parameter estimates may be unstable and cause problems when they are
considered known in empirical Bayes or other maximum likelihood methods. The
Bayesian approach uses distributions rather than point estimates and the joint
posterior distribution resulting from this approach reﬂects the uncertainty of our

knowledge about the parameters in the model. For example, variance-covariance

22
parameters are assumed to have a distribution and their contribution to the joint
posterior distribution is determined by including the probability of all possible
values across their distribution. Hence, if different plausible values of the variance-
covariance parameters result in different estimates from the empirical Bayes
approach, this variability is reﬂected in the Bayesian approach. The empirical
Bayes approach will have a tendency to underestimate this variability. Since many
research projects in education are conducted with small to moderate sample sizes,
especially when the data has a hierarchical structure, the Bayesian approach may
provide a solution to small sample size problems that have not been resolved by the
empirical Bayes approach.

In this study, the Bayesian approach implemented with the data augmentation
algorithm is applied to a common experimental design found in research on
education. This design includes an independent variable measured at the student
level. This independent variable could be a measured student characteristic such as
motivation, aptitude, or interest. In addition, a simple experimental design
representing an experimental group and a control group was deﬁned at the
classroom level. The speciﬁc details of this model are presented in the next
chapter.

Hierarchical data illustrating representative samples found in research on
education at the class level are simulated with known properties. A number of
distinct data sets are generated with different combinations of between-class effect
parameters. The number of students per class is drawn randomly from a realistic
distribution of class sizes that might be found at the elementary school level. The

remainder of the parameters that are necessary to deﬁne the simulated data are set

23
to values typically obtained from research on education.

Since the data augmentation algorithm is new, there are questions concerning
the most effective implementation of the algorithm and the amount of computational
resources it requires. The data augmentation method requires the generation of
latent data that is added to the observed data, as its name implies. The amount of
data generated is a variable of the algorithm and a technique for controlling this
aspect must be determined before it can be implemented. A strategy must be
developed to obtain an optimal balance between the amount of data generated and
the computational resources required. Furthermore, this algorithm is an iterative
procedure and stopping rules must be determined prior to programming it for the
computer. The practicality of this algorithm in terms of the necessary computational
resources needs to be evaluated due to the combination of data generation and
iteration. This is an important consideration because in substantive educational
research situations the data are analyzed a number of times using different models.
Any data analysis procedure should be efﬁcient enough to make this ﬂexibility
practical.

The investigation of the performance of the implementation of the Bayesian
statistical procedure is another important consideration. A practitioner must know
the properties of a statistical procedure. In addition, researchers are interested in
knowing which estimation method produces the best results relative to a set of
possible methods currently available. To make a meaningful comparison the
empirical Bayes approach using the EM algorithm was selected as an alternative
estimation procedure. The performance of these methods under small sample

conditions are investigated. A simulation study is unique because it provides

24
knowledge of the population parameters. A comparison of both methods can be
made as to how accurately they recover the population parameters.

In a hypothesis testing context, type I error rates are compared against the
nominal level for the relevant between-class level parameters. In addition, the type
I error rates produced by the two statistical methods will be compared. A related
concern is the statistical power of the test. The power of the two methods is

estimated and compared.

CHAPTER II
NOTATION CONVENTIONS AND THE MODEL

This chapter provides a brief discussion of the matrix symbols and
conventions employed in this study. Also, the speciﬁc hierarchical linear model
used is ﬁrst described in scalar terms and then extended to matrix notation. The
basic assumptions for the empirical Bayes approach using the EM algorithm and the
Bayesian approach using the data augmentation algorithm are presented.

The educational context features students nested within classes for the ﬁrst
level of the hierarchy, the within-class level. Let the dependent variable be a
measure of student achievement in a subject area and the independent variable be a
measured psychological attribute, such as an aptitude. The dependent variable is
regressed on the independent variable resulting in two regression coefﬁcients for
each class, a slope and an intercept. At the between-class level, classes are
randomly assigned to either the experimental treatment or the control group. At this
level, the within-class parameters are considered random outcomes and are modeled
as functions of the ﬁxed effects due to treatments. In this example, it is the
researcher’s task to explain the variation among the outcomes of c.“ h class as a
result of the between-class model’s experimental treatment. This hierarchical linear

model is translated into an explicit mathematical presentation.

25

26

A Guide to the Matrix Notation

Bold letters are used to represent matrices and vectors. Bold upper case
letters are used to represent matrices and bold lower case letters are used to
represent vectors. For example, X represents a matrix and x represents a vector.
Furthermore, scalar values are represented by characters or numbers not in bold face
type. The order of matrices and vectors is indicated by "(r x c)" to specify the
number of rows and columns respectively. A matrix with r rows and c columns

may be written as X (r x c) and displayed as

X11 X12 ch
X = X21 x22 X2:
X” X4 Xv:

 

 

An element of X in row i and column j is Xij. In a similar manner, the (r x 1)

column vector y may be illustrated as

Y:

 

 

yr

To help differentiate symbols, parameters are represented by Greek characters and
Roman characters are used to represent observed values or known constants.

Moreover, the superscript "-1" is used to indicate the inverse of a matrix, an

H,"

apostrophe

27

is used to indicate the transpose of a matrix. The identity matrix

is denoted by I. "Var" is the variance operator, "Cov" is the covariance operator,

"D" indicates a dispersion mauix, "Tr" is the trace operator, and "E" is the

expectation operator. The symbol "cc" indicates proportionality, and "~" can be read

as, "is distributed." The following summarizes these conventions:

A.
A.

I

Var(x)
Cov(x, y)
D(y)
Tr(2)
EU)

A cc cA

y ~ N01. 6’)

is a vector of parameters,

is a matrix of ﬁxed constants,

is the inverse of matrix A,

is the transpose of matrix A,

is the identity matrix,

is the variance of x,

is the covariance matrix of x and y,

is the dispersion of the vector y,

is the trace of the matrix 2,

is the expectation of the vector y,

indicates the matrix A is proportional to cA, and
indicates that y is distributed normally, with mean u

and variance 0‘.

The Hierarchical Linear Model in Scalar Terms

At the ﬁrst level, the within-class model speciﬁes a model for the units of

statistical analysis. For classroom level research in education, the ﬁrst level

involves students within a class. In this case, the within-class model includes two

28
parameters, an intercept and a slope. In other educational research situations this
ﬁrst level of the hierarchy might model classrooms nested within schools. Since
two unique regression coefﬁcients are determined for each class, the resulting set of
regression parameters vary randomly across classes. The between-class model is
developed to explain the random variation of these regression parameters.
More speciﬁcally, the within-class model, is presented in scalar terms as

follows:

Yij = Bjo + lexijl + rij’ (2.01)
where

yij is the outcome response of the i‘” student in the j‘” class,

[310 is the intercept (base) of the j"1 class,

B,- is the regression slope for the jm class,
m is i"I student’s predictor response, and

rij is a random error for the i"1 student in the j"I class,
for

i = 1, 2, , nj with nj students in the j" class, and

1, 2, , k and k is equal to the number of classes.

He
II

At the second level, the between-class level, we specify a model for the
random parameters of the ﬁrst level. The random parameters, the intercepts and the
slopes, are now considered to be outcomes and are explained as functions of
between-class differences. These between-class differences could be explained with

a design on the classes that includes experimental treatments, naturally occurring

29
classiﬁcation factors, measured attributes of the classes, or some combination of
these different types of independent variables. The between-class design selected for
this study differentiates classes into experimental treatment and control groups.

The between-class model is presented in scalar terms as follows:

%=m+m%+%. amm

Bit = 7m + 711W,- + ujr. (2.02b)
where

700 is the base of class intercepts,

70, is the main effect due to experimental treatment,

u,-0 is a random error associated with the j"I class,

‘ym is the base of the within-class slopes,

Y“ is the interaction effect (ATI),

ujl is a random error associated with the j"’ class, and

Wj is equal to 1 for classes in the experimental group, and

is equal to -1 for the classes in the control group.

Equations (2.01), (2.02a) and (2.02b) may be combined to form a more

traditional representation of the hierarchical model. This equivalent representation of

the model may be written as

Yij = 700 + Yorwj + “03' + Yloxijl + Yrrwjxijr + uljxijl + rij° (2-03)

This traditional representation obscures the hierarchical structure. The terms in the

30

model are the same as described previously.

In many situations the unconditional model provides some valuable

information. The unconditional model in this hierarchical context refers to the

model without the inclusion of any between-class predictor variable(s), which in this

case is the treatment indicator variable Wj. When both the unconditional and

conditional models have been estimated the proportion of additional between-class

variance accounted for by the inclusion of one or more predictors is available to the

researcher to help assess the performance of the conditioning variable(s). The

unconditional between-class model is presented in scalar terms below:

Bjo = To + 1130’ (2.04a)

m=u+nn amm
where

yo is the base of the class intercepts,

u}.0 is a random error associated with the j‘“ class,

y1 is the base of the within-class slopes, and

u is a random error associated with the j‘” class.

A Matrix Representation of the Hierarchical Linear Model

The next step is to extend the scalar representation of the model into matrix

terms. In general, the within-class model may contain random effects, ﬁxed effects,

or a combination of these two types of effects. In this speciﬁc model, the within-

class slope and intercept parameters are considered as random variables. The

within-class model is given in matrix terms as follows:

 

 

 

 

y = XB + r,
where
3’:
y = 3'2
3'1:
X1
X = 0
b
with
1
Xj = 1
'1
B;
l3 = 132
Br
B1 = l B10
jl
r1
r = r2
'r.

 

with

 

.Nxc

ljl
2jl

><><

 

 

31

(Znj x 1),

0 <an x 2k).

 

(211,- x 2) and note that Xijo = 1,

 

(2k x 1),

(2 x 1).

(in, x 1),

(2.05)

32

 

 

r1).
rj = r2j (2111- X 1),
In;
for
j = 1, 2, . . . , k with k equal to the number of classes, and
i = 1, 2, . . . , nj for the students in the j‘“ class.

The between-class model, which comprises the second level in the hierarchy,

is presented as follows:

B = WY + u, (2.06)

where B is deﬁned as above and

w
w = w2 (2k x 4),

 

 

and for a class in the experimental group

33

Too
Y = 701 (4 x l), and

710
In

u = u2 (2k x 1),

 

 

with

u: I ujo I (2x1).

Assumptions for the Two Statistical Procedures
For the _§YBP§}'1931.4.B.3YCS,PFQCCdUY¢, the following assumptions are made:

_-...,.—— .__

r ~ ”$91-77), and u ~ N(0, T). (2.07)

lln addition, we assumeé nonhfomagve_R®LQi§Wfor the between-class
,s' parameter vector 7. etirtethﬁerwwerds. BILLIEIIQWESSE-.§9°.,‘1§..@9..89F9§1 .Patametet.

1' Lalues of this distribution is__abyQSIJIEIILQE..XPHQEEEI‘I'. This noninformative prior is

-‘_ n “._‘vaM ...—.~ _. --

l

speciﬁed as

7 ~ N(0, I‘). (2.08)

Y».

Furthermore, these simplifying assumptions are made:

34
2 = 162, (2.09)
and

Cov(r, u) = 0. (2.10)

The assumption given in (2.10) implies that individual student errors are independent

of class level errors and T is block diagonal with each Tj represented as

T1. = too to, (2.11)

Furthermore, the Tj are assumed to be the same for all j, which is often referred to
as homogeneity of variance-covariance matrices. Consequently, a pooled estimate of
T is calculated for this common variance-covariance matrix.

The classes have been randomly assigned to either the experimental treatment
or the control group. This is a common method used to control for existing
differences at the class level prior to implementing treatments. Given that the
classes were randomly assigned to the two experimental conditions, the model
developed in (2.01), (2.02a) and (2.02b) is parsimonious. An alternative possibility
that may occur when classes are not randomly assigned to treatments is that a linear
relationship might exist prior to ueatment between the class mean on the
independent variable and the outcome measure. In this slightly more complicated
situation, an extra term in the between-group model would be required to account
for this effect.

The full Bayesian model for the data augmentation algorithm requires

additional distribution assumptions for the parameters (52 and T. Conjugate prior

35
distributions are speciﬁed for these parameters. The inverse chi-square prior

distribution for O'2 is given by
02 "' V0602X.2(V0) (2-12)

with the degrees of freedom parameter v0 and o",2 indicates the prior value for O".

Anglfraaﬂy.-the .prior._,clistribution_ .of T_i_s.-a.t1. inverse. Wishm..di,stdbu§9_a.gi¥en.-by .
T saw-101', u), (2.13)

391319)?“ is the parameter matrix of £11.¢,irl.verse.-Wishart distribution _.and-.p...i§.,.the
degrees of freedom parameter. Anderson (1984, p. 268) calls ‘I’ the "precision
matrix." The degree of knowledge concerning the actual values of the parameters
in (2.08), (2.12), and (2.13) and the contribution they have to the joint posterior
distribution is assumed almost null or very small so that these prior distributions are
noninformative. /This means that \before a sample is drawn the actual values of y,
0" and T are vague. There are no speciﬁc values for y, 0" or T within their
respective parameter spaces that are considered more probable than any others
before sampling. Any inference based on the posterior distribution, which is a
combination of the data and the prior distributions, will be dominated by the

information derived from the data.

CHAPTER III

APPLICATION OF THE EM AND DATA AUGMENTATION ALGORITHMS
TO THE HIERARCHICAL LINEAR MODEL WITH NORMAL ERRORS

The formal derivations of the two algorithms are presented for the speciﬁc
hierarchical linear model used in this simulation study. This model includes
different design elements so that the application of these statistical procedures to

other related designs is straightforward.

The Empirical Bayes Procedure Using the EM Algorithm

The empirical Bayes procedure presented in this section draws heavily on
prior research. The following works have been especially important: Lindley and
Smith (1972) and Smith (1973) for their work putting the hierarchical linear model
in a Bayesian framework; Dempster, Laird and Rubin (1977) for developing the EM
algorithm; Strenio, Weisberg and Bryk (1983) and Raudenbush (1984) for the
application of this approach in an educational context; and the development of the
HLM computer program by Bryk, Raudenbush, Seltzer and Congdon (1986).
Furthermore, the HLM computer program was used to produce the empirical Bayes
estimates for this study.

The two level hierarchical linear model and its assumptions for the empirical

Bayes approach are restated next:

36

37

y=XB+rr r~N(0,Z),
B=Wy+u, u~N(0,T),
T:__N(9. F).

and

2 = 162 and Cov(r, u) = 0.

Initially, Z and T are assumed known. The goal of the empirical Bayes
approach is to estimate the joint posterior modes of B and 7 given y, 2 and T.

This joint posterior distribution is given by

p6. B. vI 2,1) .. ﬂy I B. 2) £503 I M) 11(1):. (3.01)

The term empirical Bayes is used to describe this procedure because the B
parameters at the ﬁrst level use the second level of the model as a prior, which is a
Bayesian idea. / The second level does not usejanuinforrnativeprior distribution for
y} / Consequently, the second level parameters are estimated empirically from the
data. ’V/No prior distributions are speciﬁed for 2 and T. From a Bayesian
perspective, the result of this statistical procedure is the calculation of the joint
posterior modes of the two parameter vectors B and y from (3.01).

To determine the empirical Bayes estimator for B and y, the partial

derivatives of (3.01) must be taken with respect to B and y. The prior distribution

r.

h(v),_i§ considetedw aﬂyerymsreell, ..<?9.I.1.St.a.n_t pncLitcanhelgroredrwhﬂameempincaL

 

 

Bayes estimators are, calculated. The resulting equations are then set to zero and

38

solved to provide the empirical Bayes estimators. To begin with,
MB. 7 I y. 2. T) cc (3.02)
eXpl-1/20' - XB)’Z"(y - XB) - 1/2(B - WY)’T"(I3 - W10}.

After taking the natural logarithm of (3.02), expanding and collecting the terms in

the exponent, and eliminating all terms not containing B or y, the following

expression results from these operations:

Ln p03. 7' y. 2. T) °<= (3.03)
B’X’E“XB - 2B’X’Z‘1y + B’T"B - Zy’W’T"B + y’W’T‘Wy.

Next, the partial derivative of the logarithm of expression (3.03) is taken with

respect to B:

3141 P(B. Y I y, 2. T) / 313 = (3.04) V
2X’2'1XB - 2X’2“y + 2T"B - ZT‘Wy.

Setting (3.04) equal to zero and solving for B, assuming all matrices that are

inverted are non-singular, results in

l3 = (X’Z"X + T‘)" (X’Z“y + T‘WY). (3.05) V

To help simplify (3.05), the following terms are deﬁned:

39

L = (X’E“X + T"), (3.06)

v = (X’2“X)", , (3.07)

A = L“V“, (3.08)
and

I - A = L"T". (3.09)

With the above deﬁnitions, equation (3.05) can now be simpliﬁed as follows:

0 = L"(X’2“y) + L'1T‘Wy (3.10)
= L"V"V(X’£"y) + (I - A)Wy

= AB° + (I - A)Wy.

Note that B° in equation (3.10) is an ordinary least squares estimator of the within-
class parameter vector B when 2 = 10". An estimator for y is still required to
determine the empirical Bayes joint modal estimates for B and 7 given y, 2 and T.
The modal estimator for the y parameter vector is determined similarly to
the estimator for B. First, the partial derivative of the natural logarithm of (3.01) is
taken with respect to y and then the result is set equal to zero. These operations

result in

aLn p(B, y I y, 2, T) / 87 = 2W’T'1Wy - 2W’T"B, (3.11)
and

y = (W’T‘W)“W’T‘B. (3.12)

40

The results of (3.10) can be substituted into (3.12) as follows:

Y = (W’T‘W)“W"I“[AB° + (I - A)Wy]. (3.13)

Multiplying out (3.13), rearranging some terms, and then multiplying through by
(W’T“W) results in

[I - (W’T‘W)“(W’T"W) + (W’T‘W)"W’T‘AW]Y (3.14)
= (W’T"W)“W’T‘AB°,
and

W’T‘AWy = W’T‘AB°.

Two helpful identities (identities 3 and 4 from Smith, 1973) can be used to

rewrite equation (3.14) in a more informative representation. Let

A“ = (T + V)" (3.15)
= T'1 - T"(T‘l + V")"T‘l
= '1‘1 - T"[I - (T“ + V")"V"]
= T‘ - T'1 + T‘[(T" + V")"V"]
= T“[(T“ + V“)"V"].
From (3.08)
A = L"V" = (T‘1 + V“)“V“,
and hence,

A“ = TIA.

41
Substituting A‘1 for T“A in (3.14) and solving for y results in the empirical Bayes

estimator

Y. = (waA-lwylW’A-IBO, (3.16)

and it follows from (3.10) that

[3‘ = AB° + (r - A)Wy'. (3.17)

The empirical Bayes estimates of y’ and B‘ in (3.16) and (3.17) are deﬁned as the
values of y and B in their respective parameter spaces that jointly maximize
expression (3.02).

Smith (1973) presents the posterior dispersions of y’ and B‘ conditional on y,

E and T. These dispersion matrices are

D(y‘) = (W’A"W)", (3.18)
and

D(B‘) = AV + (I - A)D(Wy')(l - A), (3.19)
where

D(Wy’) = W(W’A“W)“W’.

When the variance-covariance components are unknown, there is no general
analytical solution to this estimation problem. The EM algorithm provides one

iterative, numerical solution.

42
The EM Algorithm and Covariance Estimation

The empirical Bayes estimator for B and 7 presented in the preceding section
assume that 2 and T are known, a situation rarely if ever found in practice. These
variance-covariance matrices must be estimated from the data. The empirical Bayes
approach to this problem is to substitute maximum likelihood estimates of 2 and T
into the expression (3.02). The result of this approach is to obtain the joint modal
posterior distribution of B and 7 given y and the variance-covariance matrices 2 and
T are equal to their maximum likelihood estimates. To begin our development, let
us consider the two random vectors r and u had been observed. In this case,
maximum likelihood estimates for 2‘. and T could be calculated easily.

The structures of Z and T implied by the assumptions reduce the size of the
mau'ices that are required to compute the estimates of these two variance-covariance
matrices. Since it was assumed that Cov(r,u) = 0 and Z. = Io“, a pooled within-
class estimate can be obtained for the variance-covariance matrix 2. Also, it has
been assumed that the Bj’s are independently and identically distributed and the
sufﬁcient statistic for T can be pooled across classes by summing the submatrices
on the diagonal. Let r’r and ujuj’ be the sufﬁcient statistics for the two variance-
covariance matrices. It follows that the maximum likelihood estimates for o2 and T

are

01 = r’r / (Zn), (320)
and

T = (l/k) Zujuj’. (3.21)

43
The central idea behind the EM algorithm is to estimate the sufﬁcient
statistics that are required to calculate the maximum likelihood estimates for 0" and
T as presented in equations (3.20) and (3.21). The EM algorithm has two basic
steps, the expectation and the maximization steps. The basic outline of the EM

algorithm is presented below.

1. The expectation step (E-Step) requires initial estimates for B' and 7’.
These estimates are then considered as if they were known parameters
and used to calculate the conditional expectations of the sufﬁcient
statistics given y, 2 and T necessary to construct the sum of squares
associated with Z and T.

2. The maximization step (M-Step) uses the new estimated sum of
squares derived in step 1 for 2 and T to calculate updated maximum
likelihood estimates for these two variance-covariance matrices.

3. The updated estimates for the variance-covariance matrices generated
in the maximization step are now recycled back to the E-Step to
recalculate new values for B‘ and y‘. The newest values for B‘ and
y' are used to generate another set of expected sum of squares for the
variance-covariance matrices, which are then passed on to the
maximization step. This process continues to iterate between the
expectation and maximization steps until a stopping criterion is
reached. One criterion is to stop the iterative process when the
change in a function of the joint posterior distribution between

successive iterations is less than some speciﬁed tolerance.

44

The strategy employed by the EM algorithm is to compute the conditional
expectations of r’r and Zujuj’ given y, 2 and T. These estimated sufﬁcient statistics
are then passed to the maximization step to estimate 2 and T. It has been assumed
that Z = 10‘1 and Cov(r, u) = 0. Let the superscript "p" indicate the previous
iteration’s posterior parameter estimates. The previous iteration’s posterior estimates
for the variance-covariance matrices Z and T are considered known for the purpose
of calculating the conditional expectations. The following details for determining
the required conditional expectations are from Dempster, et al. (1981). First the

conditional expectation for the estimated sufﬁcient statistic for 2 = Io2 given y, 2

and T is
E(I"r) = EU - XB)’(.Y - X3) (322)
= ELY - XB’ + XB" - XB)’(y - XB" + XB" - X6)
= (y - XB">’(y - XB") + £03 - BP)’X’X(I3 - 13")
= (y - XB‘TO/ - X13”) + Tr[X’X Var(B - [301
= (y - XB’)’(y - X13”) + TrIX’X Var(B)].
with

Var(B) = Var(W‘y) + Var(u) + Cov(Wy, u) + Cov(u, Wy)
= APV" + (r - A’)S"(I - AP)’,
where

S = W{W’[(X’2§“X)‘l + T]"W}"W’.

In a similar manner, the conditional expectation for T given y, and previous values

for 2 and T is

45
E(2ujuj’) = E[Z(Bj - ij)(Bj - ij)’] (3.23)
= Zu’ju’j’ + 2(A"J.V"j + APjSPjAPj’).

The estimated expected statistics derived from (3.22) and (3.23) are passed to
the maximization step to produce restricted maximum likelihood estimates for (3'2 and
T respectively (Harville, 1976; and Laird and Ware, 1982). Let us deﬁne 2’ = Io"
and T' to be the estimates that maximize the restricted likelihood. The empirical
Bayes approach substitutes these variance-covariance components as if they were
known constants. This strategy reduces the information from two distributions into
point estimates. Hence, empirical Bayes inferences about 7 are based on the

following joint posterior distribution:

p(v l y. 2:21 T=T' ) = lp(y I B. 2:) p(B I y, r) as. (3.24)

The empirical Bayes approach does not take into consideration the uncertainty of
our knowledge of the unknown variance-covariance matrices Z and T. The
estimates of T in a small sample situation may not be estimated with very much
precision. The empirical Bayes approach underestimates the uncertainty associated
with T and to a lesser extent 2 because the point estimates for these parameters do
not provide any information about their precision. As a result, if other probable
values for T produce substantially different values for y', then the dispersion of y'
will be underestimated. The asymptotically normal properties of maximum
likelihood estimators makes this concern less important in large sample problems or

in situations where the most probable values of T do not substantially effect the

46

estimation of y'.

The Data Augmentation Algorithm

Wovides an alternative to the empirical Bayes

M

approach / and explicitly incomeretes. .theypesttainty. oppreeisien. of the variance-
covariance components /n the model. The, essentialdifference in thesegtwow ,
approaches/Ts that the Bayesian approach speciﬁes a prior distribution for the
variance-covariance components’in the model, rather than summarizing this
‘ /

information into a point estimate/ Hence, Bayesian inferences about 7 are base on

My | y. z, T) = ﬁlm I 13.2) p(B | 730951) 991:0) @98383625)

This Bayesian approach provides more information about the precision of the
standard error of the posterior distribution (Lindley, 1972) than is available with
some form of point estimate substitution for the variance-covariance components. In
iparticular, the Bayesian approach presented in (3.25) should provide standard errors

ifor 7 that tend to be larger than those in (3.24). {This should be; especially. evident.
in small sample situations},

The problem with the Bayesian solution to the hierarchical linear model is
that even for reasonable priors it is exceptionally“ complicated io execute. The
required integrationvis multidimensionalé the densities vare mathematically complex
and [,Cannot all be expressed in closed form (Lindley, 1972). The data augmentation

algorithm jls an alternative procedure for the implementation of the Bayesian

approach I that can approximate (3.25)\vg}'_t_119_11.t~3§£§35511§tjﬂg,£h§..99mplexintegration.“

47
he data u ' a1“ ' is amethod for Calculation/of the/posterior
diiuibuuon of a model’s parameters vwithout integration“. The basic ideanehind this
agoﬁmm/gwpkmem the. observed data/{with some latent datal mltingin. an.
_augmented datasei. In many cases: given this augmented data: thwahsis
pr/oblemi which‘initially was. difﬁcult. or..intractable, becomes relatively
straightforward. Ingthe general hierarchical linear model, each successive level is
more difﬁcult /to sample /because the relevant populations become smaller. For
example, it is easier to sample n students than to sample the same number of
classes and as a result, the data at higher levels becomes sparse. For the specific
hierarchical model under consideration, if large quantities of data were available
pertinent to the between-class parameters 7 and T, the analysis would be much
easier. The data augmentation algorithm developed by Tanner and Wong (1987)
provides a strategy to generate the required latent data for the model’s between-class
parameters to facilitate the statistical analysis.
In order to develop this algorithm, further assumptions are required that were
unnecessary for the empirical Bayes procedure. These additional assumptions
/Specify prior distributions for the parameters O"2 and T./ These conjugate prior

distributions are

62 7' Vooozqu/o),
and
T ~ W“(‘I’, u).

v
\/

These assumptions were presented previously in (2.12) and (2.13). One

48

“Conseqmnce of these assumptionsvis that thevuncertainty resulting Vfrom unknown
variance-covariance components” is included (explicitly in the model.

madataaaugmentation algorithmwusgsgythese assumptions- inadditionJo the
data to generate a large set of latent or unobserved data by Monte Carlo sampling.
If this set of augmented data, is sufﬁciently large, it_ can be used as a ﬁnite
approximation to the true posterior distribution. This algorithm is iterative and each
auooessiveiteration produces a__better approximation, until convergence, toAhe true.
posteriordisuibution. The a priori knowledge assumed for the actual parameter
values of . the prior distributions/are vague or noninformative in--their contribution to
the posterior distribution.“ 7 It is the assumptions concerning the model, in conjunction
(with the distribution of the data that determines the joint posterior disu'ibution.

The joint distribution of the model is

p(y, [3. E. y. T) = (3.26)

PM ' B, 2) 13.2931 >3. 7. T) 133137;. T).

This joint distribution can be rewritten in two alternative expressions (Morris, 1987)

as

qty) MB. 2 I y) Q30. T | [3. 2). (3.27)

01‘

“(3') 1'2“}, Z I y, Y, T) 5(7) T I Y)- (328)

The above two alternative forms contain densities derived from the pi densities in

49

(3.26) and q1(y) = r,(y). The joint posterior density may be written as

p(B. 2. 7. T | y) x p(y, B. 2. 7. T) / (My). (3.29)

First, to calculate the joint posterior density presented in (3.29), the
knowledge of q3 would result in the straightforward calculation of the joint posterior
density given by r2 because the joint posterior density of y and T is known. Or, on
the other hand, if r2 were known then the joint posterior density of q3 would follow
directly because the joint posterior distribution of B and 2 is known. The
dependency between q3 and r2, in terms of the joint posterior density of the
parameters, suggests an iterative solution to this problem, since neither q3 or r2 is
generally known. The data augmentation algorithm provides an ingenious procedure
to overcome this dilemma.

The data augmentation algorithm, as outlined by Tanner and Wong (1987),
has two basic steps, the imputation and posterior steps. The qut step is the
imputation step. This step generates latent data to augment the observed data.
Initially, suppose parameters B and 2 from r2 were observed, then 7' can be
calculated, where the asterisk (*) indicates an estimated mean. Next, given 7’ just
calculated and a previously observed T, m new y’s are sampled by Monte Carlo
methods from the posterior distribution of y. This is the point in the algorithm
where latent data may be generated to augment the observed data. With the
knowledge of the m 7’s, m T"s can be calculated and subsequently, m T’s are
sampled from the posterior distribution of T using Monte Carlo methods. The result

of the imputation step is m parameters 7 and T approximating a sampit- From q3.

50

The second step is the posterior step. This step uses the augmented data
resulting from the previous imputation step to obtain a sample of m B’s and 2’s.
Given m parameters 7 and T from q3 and a previously observed 2, m values for B‘
can be calculated and used as estimated means to sample m updated B’s by Monte
Carlo methods. Subsequently, m 2”s can be calculated and then m 2’s can be
sampled by Monte Carlo methods. The results from this step are m parameters B
and Z approximating a sample from r2.

Initially, the results from the two steps of the data augmentation algorithm
produce rough approximations. To improve the approximations, the results from the
posterior step can be considered as an intermediate approximation of r2 instead of a
ﬁnal one and recycled back to the imputation step to generate a new sample to
update q,. The size of m may be increased in the imputation step during an
iteration until it reaches some maximum value. At this point, the value for m may
be held constant. This iterative process between the imputation and posterior steps
has been shown by Tanner and Wong (1987) to converge in probability to the
desired posterior distribution under mild regularity conditions. For large values of
m, for example m = 2048, the mixture of the densities produced from the
imputation and posterior steps can be considered a ﬁnite approximation to the
desired joint posterior density presented in (3.29).

One advantage of this algorithm is that not only are point estimates
generated as a mixture of densities, but the results also include ﬁnite approximations
to the true joint posterior distribution. For example, the calculated values for a
parameter of interest can be sorted in order and then the or/2% tails of the

distribution can be easily determined. This procedure results in a (l - 00% highest

51

posterior density (HPD) interval for the parameter. Thus, this approach provides an
alternative to the sampling theory approach, which is especially useful when the
analytically determined sampling distribution of an estimator is not known and the
implementation of the appropriate conditional distributions such as r2 and q3 do not
offer insurmountable problems.

The details for the implementation of the data augmentation algorithm are
presented in the following sections. This presentation will begin with a brief
discussion concerning the calculation of initial values required to begin the data

augmentation algorithm.

Initial Values .
v x V

Initial values for B, E and T are required for the start of this algorithm.
Parameter estimatesjnay be calculated by a variety of techniques. One method to
calculate these parameter estimates is to use the empirical/Bayes procedure. The
ﬁnal estimates from the empirical Bayes procedure can be used to supply initial
values Vto the Bayes approach using the data augmentatiomalgorithm. Since these
two statistical procedures are compared in this study, the ﬁnal estimates of the
empirical Bayes procedure were not used to help minimize dependencies between
methods. Least squares estimates of B \were used as initial starting values for this
parameter. The ﬁrst iteration results from the empirical Bayes procedure for Z and

T were used as starting values because this method constrains estimates of these

parameters to their respective parameter spaces.

52
The Impatation Step
Given a set of m pairs (B, 2) from r2(B, 2’. l y, y, T), m new values for 'y'
and T’ can be calculated. These posterior means are used to generate a sample of
size m’ of the parameter pairs (7, T) from q,, where m S m’. The fu'st iteration of
this step m = 1 and may be increase so that m’ = 2 at the completion of the

imputation step. We note that y is from a normal disuibution given by

7‘" ~ Ntr‘”. (W’T‘W)"). (3.30)
and q3 may be expressed in the following form:

cm. T I B. 2) = cm I T. B. 2) (1.0). (3.31)
where q,» represents the noninformative prior distribution of T. The calculation of

7‘ does not depend on T, because T is constant with respect to B. The m values

for y' can be calculated as follows:

7“" = 2(Wj’Wj)‘ 2W; 5", (3.32)
where

i = 1, 2, ..., m, and

j = 1, 2, ..., k.

Once the y‘””s are calculated, a sample is drawn by Monte Carlo sampling

from (3.30), the posterior distribution of 7. Let us denote the m T’s from the

53

previous iteration of the imputation step TS). Let

A“) = [2(Wj’Té’HWJ-H". (3.33)
for

i = 1, 2, ..., m,

where the A“”s provide the current approximation to the dispersion of y. The

A“”s are factored such that

A“) = Bti)B(i)’, (3.34)

where B“) is a lower triangular Cholesky factor of A"). The matrix equation used to

generate a new 7‘" is

7(1) = 7(1)— + Baku), (3.35)
where
i = 1, 2, ..., m,

1: l, 2, ..., m’ and m Sm’.

The vector x‘” contains elements that are independently and identically distributed
N(0, 1). The data may be augmented using (3.35). For example, let m’ = 2m.
Two x vectors can be generated for each pair y‘"’ and 8‘”. Hence, two new 7
vectors may be generated.

Once m’ new y’s have been generated, this information can be used to

54

update the posterior distribution of T. Let

"1"” = (l/k) C"), (3.36)
where
C“) = 2(Bf" - W;y‘")(BJ-"’ - iji’Y, and now

i = l, 2, ..., m’.

If the i" sample covariance mauix is W”, where T")‘ has the Wishart distribution
with parameters T and k, and T has the a priori inverse Wishart distribution W"(‘1’,
u), then the conditional distribution of T (given T“”) is W"(T“" + ‘I’, k + D)
(Anderson, 1984). Thus, if we assume the precision mauix ‘I’ approaches 0 and its
degrees of freedom parameter 1) approaches zero to indicate a noninformative prior,

then the posterior distribution of T is

T ~ W"(T‘”°, k). (3.37)

Jones (1985), and Smith and Hocking (1972) describe a method that uses Bartlett’s
(1933) decomposition along with T‘"° to obtain a random sample from the
distribution in (3.37).

To obtain a sample of T“”s from the disuibution given in (3.37) we ﬁnd the
Cholesky factor of the T“”’s such that T‘”’ = D“’D"”, where D‘” is a lower triangular

mauix. For E"), a lower triangular matrix, let

F") = DmEmE’mD’m = (D"’E"’)(D‘°E"’)’. (3.38)

55
Furthermore, if eij is an element of E, and e2jj is an independent chi-square variable _.
...‘Iith ‘(k T j + 1) degrees of freedom and the wel’eri’ients below the diagonal are
independently distributed N(0, 1), then F“) will have the required inverse Wishart
distribution with parameters T“” and k (Jones, 1985; and Anderson, 1984).
Now, for each T“” calculated in (3.36) an updated T“) is sampled from
(3.37). This is accomplished by generating F“) from (3.38) using Monte Carlo

methods and setting T“) = 1/k F“). These new values for ya) and T“) are passed to

the posterior step to generate new updated values for B and E.

The Posterior Step

The data generated from q, by the execution of the I-Step consists of the
pairs (7"), T‘”), for i = 1, 2, ..., m. These values are considered known during the
execution of the P-Step. Given these data, the posterior step calculates m B"s and
Y’s. This information is used to realize a sample of size m from r2. The desired

posterior density r2 can be rewritten as

r2(Bs Z I yr Y: T) = r2‘(B I yr 2! Yr T) r2"(z)° (339)

We note that r," is the noninformative prior assumed for 2.
Let 2“,” indicate the previous iterations sample from the posterior distribution

of 2. New values for the B"s can be calculated with the following equation:

13;” = Alli? + (I - awn/i). (3.40)

where

56
ATI) = (VEDJ + 'I‘(i)-l)-lVJ(i)-l’

and

V?“ = (xjazélHXQ-l.

A new set of B’s can be sampled from the following distribution:

Bl" ~ N(B§"’. D(B§"). (3.41)
where

D(B§~") = (V?) + Tel)“.
Samples are drawn from (3.41), an approximation of r2, in much the same manner

as the W’s were sampled in the imputation step. First, the matrix D(B§”) is factored

by the Cholesky method such that

Bail") = Ll‘lLl‘”. (3.42)

where Lg” is a lower triangular mauix. Next, the B}‘”s are sampled with the

following equation:

B?" = BE" + Li’xﬁ". (3.43)

where x?) is a vector containing independent and identically distributed elements

sampled from N(0, 1).

After generating new B§"’s from (3.43), the 2“” = 0'2“”1 can be updated to

57

reﬂect these new values. Let

0'1“). = 1/N 20') ' XjBii))’(y)' " XjBii))i (3‘44)
where

N=zn,.

It has been assumed that 62“" has a chi-square distribution and of, has an inverse
chi-square prior distribution. This prior distribution of 0'}, is noninformative and
may be considered a very small constant in its contribution to the posterior
distribution of the o"”’s when the degrees of freedom for 0'}, approach 0 (Lindley,
1965b). The updated posterior values of 0"“) can be generated by Monte Carlo

methods as follows:

out) = No.20)- / 762%"), (3.45)

where the xz‘"(v) are independent chi-square variates with the degrees of freedom
parameter v = N.
This simulated sample from r2 of m parameter pairs (B‘”, 2‘”) are passed back

to the imputation step to begin a new iteration.

Some Additional Notes on the Data Augmentation Algorithm.
To begin the ﬁrst iteration, there is only one estimate of B, y, and T
available for the ﬁrst iteration of the imputation step. The parameters 7 and T are

augmented in this step of the algorithm. Each execution of the imputation step

58
produces twice as many samples from the posterior density as the number entering
this step. This doubling of the latent data continues up to some predetermined
maximum. The rate the data is augmented is a variable of the algorithm. Doubling
the data in the imputation step was chosen as a slow linear rate so that the number
of iterations required for convergence would not be as reliant on the initial values
as they would if m was set at its maximum value right at the beginning. After this
maximum value has been reached, it is held constant until a stopping criterion for
the algorithm is reached.

Traditionally, an iterative process can be said to converge when some
function based on the difference between the previous and the current iteration
values is less than some speciﬁed tolerance. The implementation of the data
augmentation algorithm includes the inu'oduction of random elements, and as a
result the convergence criterion for this algorithm will be quite different than the
one used for the EM algorithm. The data augmentation algorithm will produce a
stochastic trend across iterations rather than a strictly increasing monotonic function
evaluated with the EM algorithm. Since this is a simulation study, the population
values of all the parameters are known. These population values can be compared
to the values obtained from the posterior distributions calculated with the data
augmentation algorithm at selected iterations. This information obtained from a
number of different iterations across experimental data sets will be examined to
provide some insight into an appropriate number of iterations to provide the desired
level of convergence. Determining convergence is a major unsolved problem of the

data augmentation algorithm.

CHAPTER IV

METHOD

Research Questions
The primary questions of interest are introduced in this section. The speciﬁc

methodological details for investigating these questions are presented in subsequent

sections.

Determining a Convergence Criterion

The iterative data augmentation algorithm lacks a strategy for determining
convergence. The convergence of this algorithm is dependent on a number of
factors, such as, the model, the assumptions, the data, the number of augmented
data sets that are generated, and the number of iterations. How do we know when

this algorithm has converged?

Comparing the Accuracv of the Two Statistical Procedures

There is little known about the performance of the empirical Bayes approach
using the EM algorithm in small sample hierarchical problems and the Bayesian
approach using the data augmentation algorithm is virtually untested as an applied
statistical technique. In many studies, the most important parameter is the 7 vector.

In addition, the variance-covariance matrices Z and T are important parameters to

59

60
accurately recover because they provide information on the dispersion of B and y.
Which of the two competing statistical procedures recover the parameters of the

model from small sample hierarchical data with the greatest accuracy?

The Error Rm

There are two possible types of errors that can be committed when making
an inference in a hypothesis testing situation. A type I error is made when the null
hypothesis is rejected when it is in fact true. Traditionally, researchers have tried to
control the type I error rate, or. The origins of the 0.05 level of statistical
signiﬁcance can be traced back to Sir Ronald Fisher in 1925 (Cowles and Davis,
1982) and a narrow range of values around the 0.05 level have persisted. Under
the conditions of this simulation, are the observed type I error rates equivalent to
the nominal error rates for the empirical Bayes and the Bayes statistical procedures?

The other error that can be made is when the null hypothesis is not rejected
when in fact the alternative is true. This is referred to as a type 11 error. The type
11 error rate, B, is used to deﬁne the statistical power of a test, (1 - B). This
translates into a statement about the probability of detecting a true alternative
hypothesis. Is the statistical power for both procedures equivalent? If the statistical

power is different, which procedure is the most powerful?

The Computational Efﬁciency of the Statistical Algorithms
The computational resources required by the data augmentation algorithm are
unknown. This procedure generates large quantitiesof data and it is also iterative.

Consequently, it has the potential to consume substantial amounts of computer

61

resources. Are the computational resources required by the data augmentation
algorithm within practical limits for the application of this method to substantive
problems in education? How do the two statistical procedures compare in terms of
the computer resources they require?

The EM algorithm is also an iterative algorithm. How many iterations are
required for convergence? Do the required number of iterations change when the
EM algorithm is applied to data from different populations? In particular, does the

number of iterations increase as the conditional variance of In decreases?

The Design and Sampling Plan for the Simulation

The primary design factor in this simulation study is the size of the between-
class parameter 7”. The empirical Bayes approach using the EM algorithm and the
Bayesian approach using the data augmentation algorithm are compared under three
experimental conditions. The 7,, parameter varies across the three experimental
conditions. The relative magnitude of this interaction (ATI) effect parameter is
essentially determined by p,, the correlation of Bjl with Wj, where Wj is the
between-class predictor.

Five hundred replications of an experiment are generated for each
experimental condition. This number was chosen as a compromise between the
number of replications necessary to make an informed inference about the statistical
results produced by the two procedures and the required computer processing time.
Computer processing time is a concern because both algorithms are iterative and
computation intensive. This is especially true for the data augmentation algorithm.

The number of students assigned to each class in this simulation is drawn

62
randomly from a distribution of class size. This distribution models a plausible
distribution of elementary school class size. The expected class size from this

distribution is 24.03. Table 4-1 displays the distribution of class size.

Table 4-1

Class Size Frequency Distribution

 

Class Size Percentage

 

30 0.005
29 0.030
28 0.060
27 0.1 10
26 0.125
25 0.135
24 0.130
23 0.125
22 0.095
21 0.070
20 0.050
19 0.030
18 0.020
17 0.010
16 0.005

 

Creating the Data

A number of issues must be addressed before the appropriate data can be
generated. The parameters necessary to generate the data must be deﬁned and then
suitable values must be selected before the actual data sets can be constructed.
Another consideration is the technique used to generate the required random
numbers and transforming them to the appropriate distributions. These issues are

covered next.

63

Pagameters of the Model

For the speciﬁc hierarchical linear model investigated, the predictor variable,
X1)“ at the within-class level and the error term, rij, are identically and independently
distributed normally with means equal to zero and variances equal to one.
Furthermore, the variables xm, r“, ujo, and uj, are mutually independent. Finally, T is
assumed to be a block diagonal mauix with homogeneous diagonal submauices.
This assumption permits generation of a common Tj matrix and as a consequence a
pooled estimate of this variance-covariance matrix is appropriate.

To generate data with the required distribution, we must deﬁne some
additional parameters. The method used to generate the necessary standardized
hierarchical data was developed by Bassiri (1988) and modiﬁed to ﬁt this speciﬁc
model. This special case based on standardized hierarchical data changes the
location and scale but it does not affect generalization to other normal distribution

parameter values. Let the pooled within-class parameters be deﬁned in the

following manner:

7, is the pooled slope,

p” is the pooled within-class and within-treatment correlation of x with y,

FL

is the unconditional pooled variance of y, and

.9.

is the pooled variance of x.

Recall that too and In are the diagonal elements of Tj corresponding to the Var(ujo)
and Var(u)-1) respectively. Also, Cov(u,-o, u“) = 0 which implies that to, = t“, = 0.

The following is a list of required deﬁnitions.

64

1. c = In / too and In = c1200. (4.01)
2. ti = Pi. 0‘? / 0i. (4.02)
and
03 = Var(xajm + xmuj. + r.) (4.03)

= yj + t11 + Var(r,j)
= 73‘ + c100 + 1.0.
The results of (4.03) are substituted into (4.02) and the expression for
7% can be rewritten as
y? = pi,“ + c100 +1.0) (4.04)
= (Pi, / (1-0 - Pi,» (0100 + 1.0).

3. d = too / (too + oi), (4.05)
where d is the intraclass correlation. Next, the expression for the
pooled within-class variance of y can be expressed in a new form by
substituting (4.04) into (4.03) and
0'; = {(pi, / (1.0 — (31)) (ct:00 + 1.0)} + ct:00 + 1.0. (4.06)
Substituting equation (4.06) for O? into equation (4.05) and solving for
Too results in
1:00 = d/ ((1.0 - d)(1.0 - p2,) - Cd). (4.07)
The values for c are constrained by 0 < c < (1 - piy)(l - d) / (1, so

that the variance component too is greater than zero.

Equation (4.07) implies that once values are selected for the parameters d,
p”, and c, the value for the variance component too is determined. From the

deﬁnition of c in (4.01), it can be seen that In is also determined.

65
Deﬁnitions for the conditional variances too and 1,, given W must be
considered in addition to the corresponding unconditional variance terms deﬁned
previously. Let p() be the correlation between Bjo and Wj and p1 is the correlation

between Bjl and W, for Wj as deﬁned in (2.02a) and (2.02b). Let

to. = pom)“. (4.08)

and

In = 910.1)“. (4.09)

The conditional variances for too and 1,, given W are deﬁned as follows,

tmlw = 1:00 - 7’00 (4.10)
= tm(1.0 - pi).
and
rm“, = 1,, - yi, (4.11)
= t,,(1.0 - pf).

The covariance terms “to, and 1,0 are set to zero. The two between-class
parameters 701 and y“ are determined by the correlation coefﬁcients p0 and p,.
Since po is set to zero, yo, will also be equal to zero. In addition, 700 is equal to
zero because this is the central value for a standardized hierarchical linear model.

And ﬁnally, 7,0 reﬂects the base for the within-class slopes.

66

The Parameter Values Selected

The number of classes was selected at ten to represent a small but realistic
number of classes that might typically arise in an experimental research project at
the class level. This small sample size of classes was selected because it is
expected that the difference between the two statistical procedures will be greatest
under this condition.

The three levels of the correlation coefﬁcient (31 determining the relative size
of y” are 0.0, 0.3 and 0.6. The value 0.0 was selected for the type I error rate
analysis. The other two values were selected to investigate the power of the test
over a range of plausible values for the between-class correlation coefﬁcient related
to y”. Also, the two algorithms may recover T differentially as the correlation p1
increases in size. Table 4-2 illustrates the three experimental data conditions and

the values selected for 1),.

Table 4-2

The Design for the Experimental Conditions

 

Experimental Conditions

 

Correlations 1 2 3

 

p, 0.0 0.3 i 0.6

 

Many student characteristics such as motivation, aptitudes, and interest have a
moderate linear relationship with achievement if the measures are chosen

judiciously. Since there is only one independent variable at the within-class level, it

67

is assumed that this single independent measure was chosen with care. To reﬂect
this situation, the actual value of the correlation coefﬁcient for p,y was set at 0.60.

The intraclass correlation, d, is generally small in educational research. Kish
(1965) reports that most intraclass correlations are less than 0.15. Raudenbush and
Bryk (1986) estimated the intraclass correlation to be 0.177 in a national high
school educational research project. Blair and Higgins (1986) picked a value of
0.20 for the intraclass correlation used in their simulation study. There is some
evidence that the inu'aclass correlation associated with educational data has a rather
narrow range of values (Blair et al., 1983). The intraclass correlation for this
simulation was selected to be 0.18.

Research ﬁndings do not routinely report the value of c, the ratio tn / too.
Thus, choosing a likely value for c is not as straightforward. There is some
indication that the value of c in an educational context is rather small. Bryk and
Raudenbush (1987) found values of c equal to 0.11 and 0.05, and Raudenbush and
Bryk (1986) found a value of 0.10. A value of 0.10 was selected for this
parameter.

The results determined by these selected parameter values are summarized in

Table 4-3.

68

Table 4-3

Parameter Values for the Three Experimental Conditions

 

Experimental Conditions

 

 

Parameters 1 2 3

pI 0.0 0.3 0.6

700 0.0 0.0 0.0

Yo: 0.0 0.0 0.0

710 0.7632 0.7632 0.7632
711 0.0000 0.0565 0.1 130
oJ 1.0 1.0 1.0
130on 0.3552 0.3552 0.3552
tll .w 0.0355 0.0323 0.0227
tom, = ‘51on 0.0 0.0 0.0

 

The Steps for the Dat_a Generation Procedure

The basic strategy for data generation is to create sufﬁcient statistics for each
class and then collect k classes of sufﬁcient statistics into a complete set of data
simulating one experiment. There are a couple of reasons for this approach over
the more traditional method of generating data speciﬁcally for each unit of statistical
analysis. The ﬁrst reason is that both the HLM computer program developed by
Bryk et al. (1986) and the data augmentation program developed for this study
Operate on a set of sufﬁcient statistics as input. The other reason is to reduce the
number of observations that must be created and manipulated. The following data
generation procedure was originally presented by Bassiri (1988) with some
modiﬁcations incorporated to ﬁt the unique requirements of this study.

Let i = 1, 2, ..., nj index students within the j"I class. The ﬁve sufﬁcient

statistics required for each class are

69

2 xi: 2 XI: 2 r19 2 Fit and 2 xiri°

The within-class data, the xijl’s and r,j’s, are independently and identically distributed
N(0, 1), which implies that the Cov(xijl, rij) = 0.

The following steps are used to simulate the data for one class and k
repetitions constitute one simulated experiment. These steps are repeated 500 times
for each experimental condition depicted in Table 4-2. The only parameter value of
the data that changed was pl. Let i = 1, 2, ..., 11,- indicate students within the j‘“

class and the required steps are listed next.

1. Generate a value for nj sampled randomly from the distribution of

class size presented in Table 4-2.

2. Generate

2n~MQ@. Mn)
3. Generate

2 (xi - it.)2 ~ ICE-1), (4.13)

and then use the results from step 2 to compute

Z x? = 2 (xi - it.)2 + (Z x,)2 / nj. (4.14)
4. Generate

2n~Nwmn (4w)
5. Similar to step 3, generate

Z (ri - I.)2 ~ X2014» (4.16)

and then using the results from step 4 to compute

70

z r: = 2 (r. - r): + (2 r.)2 / n... (4.17)
First generate 2 ~. N(0, 1) and then construct a t variable with (nj-2)
degrees of freedom as follows:

t = z / (70:11.2) / (nj — 2))1”, . (4.18)
then compute the sample correlation coefﬁcient R

R = t / (t2 + nj - 2)”, (4.19)
and

Z X.r. = R(2 (X. - X.)2)"’ (Z (r. - if)“ + (2 XXX I‘D/n). (4.20)
Half the classes are randomly assigned to the experimental condition
and the remaining classes to the control group.
Recall that 700 has been set to zero and 7,0 equals the pooled within-
class regression coefﬁcient. The correlation coefﬁcient, p,, is set
according to Table 4—2 and p0 equals zero. The conditional variance
terms, 10on and 1?qu are calculated from (4.10) and (4.11)
respectively. When yo, or y“ are set to zero, their conditional and
unconditional variance terms are equivalent. Since the Cov(ujo, ujl) =
0, the elements of uj can be generated independently from the
following distributions:

uq ~ N(0, tmlw), (4.21)
and

ulj ~ N(0, Tum). (4.22)
Recall from (2.02a) and (2.02b) that

l3)o = 700 + 701W) + ujo.

and

71
Bil = 710 + 701w) + up:

where you is set to zero, 70, and y“ are determined from equations
(4.08) and (4.09) respectively, and 7,0 is computed by taking the
square root of equation (4.04).

10. The last step in the data generation procedure calculates 2 y,,-, 2 yfj,
and 2 xmyij from the sufﬁcient statistics already available and the
values selected for the parameters. The three desired quantities are

calculated as follows:

E yij = Z (Bjo + jlxijl + r.) (4.23)
= “ij ‘*’ [3112 xi)! + 2 r”.
2 Y1),- = 2 (Bjo 'i' lexijl + 11))2 (424)
= njBio + Bil}: xi)! + 2 ti) + 2510 112 xiii + 23102 r”.
+ 23512 erij,
and
2 xmyij = B102 xijl + B112 xfj, + 2 xmrij. (4.25)

Random Nimrber Generation

The random number generator was based on the linear congruential method.
This method produces a sequence of random uniform values between 0 and 1.
These uniform random values are then transformed from a uniform distribution to
another distribution, for example a normal or chi-square distribution. The general

form of the linear congruential sequence is given by:

72
XM = (aXi + b) mod m, for i 2 0, (4.26)
where
m is the modulus,
a is the multiplier,
b is the additive increment, and

X0 is the initial value or seed.

The two subroutines DRNNOR and DRNCHI from the International Mathematical
and Statistical Library (IMSL), Version 10.0 were used to generate and transform
random uniform values on the interval (0, 1) into random standard normal and chi-
square variables respectively.

The random standard normal deviates produced by the IMSL subroutine
DRNNOR were transformed when necessary to another normal distribution with new

location and scale parameter values using the following general transformation:

Yi = u + o,x,, (4.27)
where

xi ~ N(0, 1),
and

Y. ~ N01. 0:).

Evaluating Convergence
To determine a convergence criterion, the ﬁrst 100 experiments under the no

effects condition for y“ are analyzed. The convergence of this algorithm is

73
investigated with the maximum number of augmented data sets set at 2048. The
results from the data augmentation algorithm are evaluated at selected iterations
relative to the pOpulation parameter values. The mean squared errors resulting from
the discrepancy between the results from the data augmentation algorithm and the
population parameter values are calculated at 20, 25, 30, 40, 50 and 60 iterations.
The calculation of the mean squared errors provide a measure of how effectively the
data augmentation algorithm recovers information from the data. The most
important parameters to investigate are the variance—covariance components because
they are used as weights to calculate the posterior B. In addition, these variance-
covariance components are required for inferences about B and 7.

Convergence in the context of the data augmentation algorithm is a stochastic
process, since random sampling is an integral element of the algorithm. Thus,
convergence must be evaluated over iterations for trends in addition to the usual
convergence criterion that tests some function of successive values for inclusion in
an arbitrarily small neighborhood. The necessary number of iterations is primarily
determined from the stability of the mean squared error terms of O" and T from this
ﬁrst run of 100 experiments. Of course, the mean squared errors of the 7 elements
are also examined. The required number of iterations is then used for subsequent
data augmentation runs.

This "empirical" procedure for determining convergence is used because the
properties of the data are known whereas, little is known about the convergence
properties of the data augmentation algorithm. This algorithm’s convergence is
stochastic rather than absolute and this makes it more difﬁcult to determine a

criterion for convergence because any function measuring a difference between

74
iterations will contain random sampling ﬂuctuations. A real time convergence
criterion must take into consideration information generated by the algorithm on the
parameters of the model over a number of iterations. Generalizing a real time
convergence criterion is an important and complex problem that must be addressed
in future studies before this method for implementing a Bayesian solution can be
used by researchers.

Another related concern is how dependent are the results of the data
v _.

 

 

 

 

augmentation algorithm to speciﬁc staring values. In this study, the data_

 

m‘-‘

 

componentsfrom- the EM algorithm» implmcntei-Mem computer program
(Bryk, et al., 1986). Twotrial replications are run with both good and poor

variance-covariance component starting values. The standard good starting values

 

WQDW. T/ltemsgniggmiere
dermhedixmuldpb/ERMMMWWW 2-
The results fIQKL the .data,_.augmentation.algorithm, are. displayed for each of these
conditions.) In addition, the results from the EM algorithm and the population

values for these parameters are displayed.

Evaluating the Accuracy of the Two Statistical Procedures

Descriptive information is calculated and displayed for the important
parameters. The parameters y, o’" and T are of principal interest. From the
Bayesian approach the posterior distributions of y, 0", and T are available for
analysis and their posterior means are easily determined. The empirical Bayes

approach provides posterior modal estimates for y, 0", and T. The point estimates

75 ,
)/
for these parameters are collected from each procedure and averaged over the 500
replications in each experimental condition.

In the unique circumstance of a simulation study, the value of each
parameter in the model is known. Therefore, the performance of each statistical
procedure may be evaluated with respect to the accuracy they recover the population
parameters. The mean squared errors are calculated for each parameter estimate
relative its associated population value. To evaluate the relative accuracy these
procedures recover the population parameters, the ratio of mean squared errors of
the data augmentation algorithm and the EM algorithm are compared for y, o", and
T. These ratios produce F test statistics and are compared to a tabled value with

the appropriate degrees of freedom. These tests are performed for each

experimental data condition.

Evaluating Type I Error Rates and Power

The type I errors produced from standard two-tailed z-tests and t-tests from
the empirical Bayes approach are tabulated for the no effect hypothesis tests for the
three parameters you, 701 and 7,1. The investigation of type I errors for y“ are
appropriate only under the ﬁrst experimental condition. The data augmentation
procedure produces a ﬁnite approximation to the posterior distribution of 7. An
alternative test of a null hypothesis, available with the Bayesian approach, may be
determined using the 0t/2 and (1 - 0t/2) percentiles of a parameters posterior
distribution. Since the Bj,’s have been set signiﬁcantly greater than zero, it is
inappropriate to include 710 in the type I error analysis. The three tests are

evaluated at three levels of signiﬁcance, 0.01, 0.05 and 0.10. An observed type I

76
error rate, p, is compared to the nominal rate, p0, using a standard test for a

binomial proportion. The standardized test statistic is

Z = (P ‘ Po) / (Po (1 - po) / n)”2 (428)

This is a large sample normal approximation to the binomial distribution. As a rule
of thumb, (Johnson and Bhattacharyya, 1977, p. 203) this approximation is
satisfactory when up is greater than 15. Thus, this approximation might be
marginal for tests at or = 0.01, but it should be satisfactory for tests at the other two
levels of signiﬁcance.

The type I error rates for each procedure are compared using McNemar’s
test. The following details for McNemar’s test are from Fleiss (1981). The

appropriate layout for the data is presented next in Table 4-4.

Table 4-4

Data Table for Error Analysis

 

 

 

Test 1
Test 2 Correct Decision Wrong Decision Total
Correct Decision a b a + b
Wrong Decision c d c + d
Total a + c b + d n

 

The entries in Table 4-4 represent the number of response pairs that fall into a

particular cell. This test for matched pairs with a dichotomous outcome tests the

77
hypothesis that two proportions are equal. The proportion of test 1 results that

correctly fail to reject the null hypothesis is

P1 = (a + C) / n, (4.29)

and the proportion of test 2 results that correctly fail to reject the null hypothesis is

p2 = (a + b) / n. (4.30)

The difference between the two proportions, p2 - pl, may be tested with the

following statistic with correction for continuity developed by Edwards (1948):

x’ = (lb - cl - 1)2 / (b + c). (4.31)

The value of the x2 statistic is compared to the central chi-square distribution with
one degree of freedom. A large value of the test statistic implies a difference in
error rates.

Guarding against type I errors is not the sole concern in hypothesis testing
situations. Researchers are particularly interested in the power of a statistical test
for a given set of experimental conditions. The more powerful a test for a given
level of signiﬁcance the more sensitive it is in detecting true experimental effects.
Power is often measured over a range of possible effect sizes represented by true
alternative hypotheses for a predetermined level of signiﬁcance. In particular, the

power of the tests for the true effects represented by yo, and y“ are investigated

78
with or = 0.05. The power of the test is evaluated in a manner similar to the type I
error rate. The data are collected and tabulated in the format presented in Table 4-
4. The essential difference between type I error analysis and power analysis is in
what hypothesis is me. For the type I error analysis the no effect null hypothesis
is true and the correct decision is to not reject the null hypothesis. For a power
analysis the alternative hypothesis is true and the correct decision is to reject the
null hypothesis.

The power of McNemar’s test under the conditions of this study is an
important consideration. In general, statistical power is related to sample size.
Even with the number of experiments set at 500 for each data condition the
effective sample size is substantially less than this ﬁgure, because McNemar’s test
statistic only uses the entries in the cells b and c from Table 4.4. The sum of
these off diagonal elements represent the total number of divergent responses.

An approximation of the power function for McNemar’s test can be used to
estimate the power of this test. The following approximations to McNemar’s test
without the continuity correction is from Miettinen (1968). The approximate power

of McNemar’s test depends on a discrepancy parameter deﬁned as

5 = 5(1):) - E(Pz)- (4.32)

Also, the nuisance parameter Q is deﬁned as

C = E(p1) + E(p2). (4.33)

79
Miettinen (1968) suggests setting the nuisance parameter Q at the upper bound of

the following expression:

[5| S W 5- E(Pr) 7' E(p2) ’ 213031132). (434)

For example, suppose E(p,) = 0.97 and E(p2) = 0.93, then plausible values for 5 and
w from a type I error analysis between the Bayesian and empirical Bayes

approaches with the nominal a = 0.05 are

5 = 0.97 - 0.93 = 0.04, (4.35)
and

‘l’ = 0.97 + 0.93 - 2(0.97)(0.93) = 0.0958. (4.36)

An approximation to the unconditional power function is
(l - B) z (DH-pa,” + (nw)‘”l8l] / [\y’ - (0.25)8’(3 + w)]""}, (4.37)

where um is the 100(1 - tit/2) percentile of the standard normal distribution, n is the
total sample size, and (I) denotes the distribution function of a standard normal
variate. Accordingly, the power of the test to differentiate between type I error
rates of the two statistical procedures is approximately 0.84. This should provide

satisfactory power if the results are similar to the those projected.

80

Evaluating the Implementation of the Data Augmentation Algorithm

The amount of computer resources used by the data augmentation is an
important consideration for its application to substantive problems. The amount of
central processing unit time will be collected and a mean processing time will be
determined for this statistical procedure for each of the simulation conditions. This
will provide a per job estimate for solving similar problems on an IBM 3090-180.
The EM algorithm implemented in the HLM computer program (Bryk, et al., 1986)
has been used extensively in applied research and the amount of computer resources
it requires is well within practical limits set for most applied research. The central
processing unit time required by the HLM computer program is collected to provide
a relative measure of computational efﬁciency.

The EM algorithm is an iterative statistical procedure and it is of interest to
tabulate the number of iterations required for convergence. The number of
iterations required to obtain the convergence criterion for each experimental data
condition are tabulated to investigate the question of whether of not the parameter

values of the data sets effect the number of iterations necessary.

CHAPTER V
SIMULATION RESULTS

The results from the simulation are presented in this chapter in the same
order the questions were stated in the preceding chapter. Abbreviations are used in
some instances to indicate the statistical procedures. "DA" indicates the Bayesian
approach using the data augmentation algorithm and "EM" indicates the empirical

Bayes approach using the EM algorithm.

Convergence Criterion Results

A stopping rule was necessary to effectively implement the data
augmentation algorithm. This algorithm was developed with the maximum number
of augmented data sets ﬁxed at m = 2048. It required eleven iterations to reach
this number of augmented data sets. The tests for convergence were carried out on
one hundred replications of the ﬁrst experimental condition. The only element of y

b A A'—

set different than zero was 710. Th rim used to evaluate convergen05___

ﬂ. _ _ a“---~,_.

{ was to investigate the mean squared errors-Between the results of the data

I

__-, r” __ .. . . - .. ~-““" 7 “ "I
augmentation algorithm )Lanﬂthe population values of tin; parameters. The mean
9“" - «m—urw~---MW”“M""” r N)

\
)

squared error terms were calculated (at a selected number of iteratigpsjp/Lthe.
...,...7, m...“ 7

«Egyﬂﬁl‘ 8.--.Ofinl61f95}. These results are presented next in Table 5-1 and Table 5-2.

81

82

Table 5-1

MSE Between DA Posterior Means and Population Values for y

 

 

 

 

Parameters
Number of
Iterations 700 701 710 711
20 0.035461. 0.036.138 0.909957 , 0.007949
25 0.035394 0.036193 0.009592 0.007926
30 0.035258 0.036327 0.009579 0.007950
40 0.03531 1 0.036220 0.009620 0.007976
50 0.035852 0.036180 0.009539 0.008101
60 0.035460 0.036082 0.009559 0.008017
"Based on 100 replications.

Table 5-2

MSE Between [DA Posterior Meansand Population Values for 0" and .T

I

 

 

 

Parameters

Number of

Iterations G"2 1:00 to, 1:u

20 9011246“ 0.036207 0.003193 0.001 158
25 0.0.1 1269 ' 0.036035 0.003127 0.001 143
30 ”0.011 150 0.036734 0.003128 0.001 169
40 0.01 1219 0.036201 0.003135 0.001145
50 0.011141. 0.035999 0.003099 0.001 1 13
60 0.011 17.4 0.036785 0.003166 0.001 170

 

Based on 100 replications.

It was expected that the data augmentation algorithm would. exhibit
.. .-1 "’ ‘1' ~.--.....- r..- . -~ .. . .
- conversant; We; 21.99.1123. tires! point. {rennet-349.9... 3.9,. iter. adop- If

the algorithm did not converge for a given augmented data set size there would be

. 83
V ‘9" \r"
a trendjnxhe~mean~squared~error terms in the two tables” with these terms generally

1 _‘ ...- ..-

decreasing aspire iterations increased. Inspection omese table; prvidemsmno

ﬁnenemigeneralﬂecreannwganmnd-rememmmnntespvetthe

- 39113134. -.iESfinBES' These tables provide no indication that one value for the
number of iterations was superior to another. Since the data augmentation algorithm
is very computationally intensive, a smaller number of iterations is desirable. In
addition, the initial values for the variance-covariance components for the data
augmentation algorithm were the results from the ﬁrst iteration of the EM algorithm.
These initial estimates should provide good starting values. The number of

iterations was selected at 25. This was a slightly more conservative criterion than

the minimum number of iterations tested, which was 20.

T. p examine the inﬂuence of different initial yalues, the data augmentation
I

method was [up using good and then oor initi for the “ﬁnance-covariance
1’4
K

cgmponents. [339 trials were musing ..bgth. gokoﬂpnipmuniﬁalkvahres. The

W“. ”4‘ --
w-

number of iterations for the data augmentation algorithm was set at, 69. Table 5-3

presents the results of these two trials.

84
T99l9.,.§:3_...

DA Posterior, MQaQSResulting from Good and Poor Initial Values

 

 

 

 

 

Parameters

6’ too to. tn
Trial A
Good Starters (GS) ’ 1913A 0.34972 -0.10295 0.03916
Poor Starters (PS) V 0.5.0139. 0.69944 -0.20590 0.07832
DA Using GS L936}; 0.32456 -0.10815 0.04721
DA Using PS 1.0368; 0.35223 -0.ll799 0.05420
EM gpw 0.35021 -0.13314 0.06238
Trial B
Good Starters (GS) 1.03876 ' 9.3.2.9666 9.05190... 0.10172
Poor Starters (PS) 0.51938 0.41332» 0.10380 ' 0.20344
DA Using GS v v1.10300, " 0.15031 0.03918 0.07381
DA Using PS v x- LLQQQE v 0.16144 0.04336 0.08022
EM 1.03702 v 0.20970 0.04367 ,, 0.10950
Population Values 1.00000 0.35517 0.00000 0.03552

 

DA results based on 60 iterations.

”‘ Table 5-3 provides evidence that the data augmerTTation algorithm was rpbust with ,

 

 

‘-

W. The results from good and poor initial values

were very similar.

Comparing the Results of the Two Statistical Procedures

The three experimental conditions reﬂect changes in the interaction effect of
a student level characteristic and the experimental treatment applied at the class
level. The population value of 7,, increases from condition 1 to 3 while the
conditional variance of 1,, decreases. The remaining parameters did not vary across

experimental conditions. The results pertinent for evaluating how well these

85
procedures recover the parameters of interest across these three experimental
conditions are summarized in this section.

Ratios of the mean squared errors were used for inferences about the relative
accuracy of the two statistical procedures over the 500 experimental replications.
These ratios formed F test statistics to test for the equality of the mean squared
errors versus the alternative that they were not equal. These F statistics were
compared to the tabled P value with 500 degrees of freedom associated with the
numerator and the denominator parameters of this central F distribution.

Tables A-l, A-3 and A-5 in Appendix A display the means for 7 obtained
from each experimental condition and the population values for y are provided for
reference. The mean values for 7 were calculated from the 500 replications of each
experimental condition. The values calculated for y from the empirical Bayes
approach using the EM algorithm and the Bayesian approach using the data
augmentation algorithm were very similar. The ratio of mean squared errors for 7
obtained from the two procedures relative to the population parameter values are
presented in tables A-7, A-8, and A-9 in Appendix A. All of the F statistics
formed by these ratios were not signiﬁcant at the 0.05 level. Therefore, the
estimates of the 7 vector obtained by the two statistical procedures were not
signiﬁcantly different at the 0.05 level. To illustrate how close the Bayes and
empirical Bayes estimates were, Figure 5-1 displays a scatterplot of the estimates for
the experimental condition 3 ATI effect parameter 7“ for the two procedures. The

correlation between the estimates from the two procedures was 0.992.

86

Ov.o

 

mn.o

on.o

mN.o

.ON.O

noouum H94 n coduapco

aﬂuaEduna achﬂm HQOANOQED
no.0
-

md.0 OH.O
. p . t

r. I

To meswa

00.0

NO DOHQuouumom

 

roH.o)

.mo.oa

100.0

vmo.o

10H.o

rmH.o

JON.o

rmN.C

ton.o

1mm.o

 

nov.o

mnaon mnudsnuon

87

The mean values for the variance-covariance components calculated from the
500 replications of each experimental condition are presented in Appendix A in
tables A-2, A-4 and A-6. The population values are provided in these three tables
for reference. The mean squared errors of both statistical procedures relative to the
values of the population parameters were calculated for O'2 and T over the 500
replications and displayed for the three experimental conditions in tables A-10, A-11
and A-12 in Appendix A. The F tests comparing the Bayesian and empirical Bayes
point estimates indicated that there was a statistical difference (p < 0.01) between
the two methods in their ability to recover the t“ variance component for all three
experimental conditions. The evidence suggests that the Bayesian approach using
the data augmentation algorithm obtained posterior mean estimates of 1:11 that were
closer to the population value. In addition, the Bayesian approach obtained
posterior mean estimates under condition 3 of the covariance term To) that were

closer to the population value with a signiﬁcance probability less than 0.05.

Type I Error Rate Results

The actual type I error rates were compared to the nominal rates at three
levels of signiﬁcance, 0.01, 0.05 and 0.10. The Bayesian approach consn'ucts a test
from the upper and lower 01/2 percentile points of the posterior distribution of y.
The empirical Bayes approach provided standard z-tests and t-tests. There were
three parameters, you, 70, and 71,, that had population values of zero for
experimental condition 1. The parameters 700 and yo, have population values equal
to zero for experimental conditions 2 and 3. Thus, for the three experimental

conditions there were seven parameters to be tested at three nominal levels of

88
signiﬁcance. This resulted in a total of twenty-one tests for each hypothesis testing
procedure. To provide an index for ranking the relative performance, the number of
times a test procedure obtained a type I error rate that was not signiﬁcantly
different from the nominal rate with or = 0.05 was tabulated. Thus, an index value
of 21 for a given procedure indicates that all comparisons showed no signiﬁcant
statistical difference between the obtained and the nominal type I error rates. This
index was tabulated from tables A-13 through A-19 listed in Appendix A. In Table
5-4 the performance of the hypothesis testing methods are ranked according to this

index.

Table 5-4

Performance Ranking of Hypothesis Testing Procedures

 

Procedure Rank Index

 

EM t-test 1 19
EM z-test 2 6
DA 3 4

 

According to the index rankings presented in Table 5-4, the empirical Bayes
approach using a t-test to test the 7 parameters that had a population value of zero
provides evidence that this testing method produces an observed type I error rate
that was much closer to the nominal rate than the other two testing procedures.
The Bayes procedure and the empirical Bayes procedure using the z-test appear to
be similar in terms of there index values listed in Table 5-4. Both of these tests

were liberal and they tended to reject the null hypothesis more often than expected.

89

The type I error rates may also be compared with McNemar’s test. A chi-
square statistic is produced by McNemar’s test that can be compared to the central
chi-square value with one degree of freedom and or = 0.05. The critical value for
this test is 3.84.

The results from twenty-one McNemar’s tests are presented in tables A-20
through A-28. The Bayesian approach was statistically different from the empirical
Bayes approach using the t-test with p < 0.05 in 18 out of the 21 tests. The
Bayesian approach was different from the empirical Bayes approach using the z-test
in 6 out of the 21 tests. These six tests that indicated a statistical difference were

for tests of type I error rates with the nominal error rate set at 0.10.

Power Analysis Results

There are two elements of 7 that had true effects in this simulation study.
The ﬁrst element, 7,0, represented the intercept or base of the le’s. This effect was
detected with or = 0.05 by the three testing procedures for all replications of the
three experimental conditions. As expected, this effect was detected with 100%
power by all three testing procedures.

The other parameter of interest in this power analysis was 7“. The observed
power of each test for y“ with a = 0.05 was calculated for the two experimental
conditions that had true effects. The power of the empirical Bayes approach using
both the z-test and the t-test were compared to the power of the Bayesian approach.
The correlation between Bjl and Wj was set at 0.30 for experimental condition 2 and
this produced a relatively small value for y". This correlation was increased to

0.60 for experimental condition 3 and as a result, the power to detect 7,, was

90

expected to be greater. In both cases, the power of the test was expected to be
relatively weak, since the proportion of correct decisions was expected to be small.
The following table displays the observed power of the statistical tests for

experimental conditions 2 and 3.

Table 5-5

Power of the Test of y“ for Experimental Conditions 2 and 3

 

 

 

Condition
Method 2 3
DA 0.124 0.298
EM z-test 0.120 0.298
EM t-test 0.066 0.202

 

Power based on 500 replications.

Table 5-5 illustrates the weak power these tests have for the small sample
conditions of this study. This weak power was expected. In both experimental
conditions, the Bayes approach using the data augmentation algorithm and the
empirical Bayes approach using the EM algorithm and the z-test appeared to be
slightly more powerful in detecting a true effect when compared to the empirical
Bayes approach using the t-test. As noted earlier, these two more powerful testing
procedures had observed type I error rates that were generally greater than the
nominal rate. They appear more powerful because they were liberal tests.

From McNemar’s tests presented in tables A-29 and A-30 in Appendix A,

the Bayesian approach and the empirical Bayes approach using the t-test were

91
statistically different with signiﬁcance probability p < 0.01 for both experimental
conditions. When the Bayesian approach was compared to the empirical Bayes
approach using the z-test, there was no signiﬁcant difference between these
procedures at ct = 0.05 for both experimental conditions.

To help illustrate the relationship between the Bayesian and empirical Bayes
t-test hypothesis testing procedures, the 95% HPD intervals about the ATI effect
parameter 7,, for experimental condition 3 are presented in Figure 5-2. The
correlation between the intervals produced by these two methods was 0.923, which

was based on 500 replications of this condition.

92

mh.o

 

Oh.o

nIuU now: u I aao>uoucH nahom docuhomﬁu
mw.o ow.o mn.o
— -

on.o mv.o o¢.o mn.o
- n p — > h

uOOuuH HF< n COquUCOU DOM ndm>HODCH Om: Omm H0 UOHQhOuuaUm

mlm. Sana

ram.o
tow.o
raw-o

 

moaon HCUON>DHD

93
Computational Efﬁciency Results
The amount of computer resources required by the data augmentation
algorithm is an important consideration for the application of this method to
substantive problems in education. The following table presents the central
processing unit time on an IBM 3090-180 mainframe computer utilized by the data

augmentation algorithm and the EM algorithm.

Table 5-6

Central Processing Unit Time

 

CPU Seconds

 

 

 

Experimental

Conditions Replications DA EM

1 400 9136 78

2 500 18676 95

3 500 14180 123
Totals 1400 41992 296
Average 29.99 0.21

 

The EM algorithm implemented in the HLM computer program (Bryk et al.,
1986) has been available to researchers for three years and it has been applied to
many hierarchical data sets. It can be seen clearly from Table 5-6 that the
processing time required by the EM algorithm is well within the practical limits for
everyday use by researchers, whereas the data augmentation algorithm requires
substantial computer resources and may be rather expensive to run repeatedly.

One point of interest was whether variations in the parameters affected the number

of iterations that the EM algorithm requires to converge. In particular, does

94
decreasing the conditional value of T increase the number of iterations necessary to
attain convergence? The average number of iterations required for the EM
algorithm to converge for the three experimental conditions are presented in the

following table.

Table 5-7

The Average Number of Iterations of the EM Algorithm

 

 

Experimental Average
Condition Iterations
1 41.346

2 40.790

3 53.678

 

500 replications per condition.

There appeared to be a increase in the number of iterations required for
convergence for experimental condition 3. This may have been due to the fact that
the conditional variance of 1,, was the smallest in experimental condition 3 and it is
closest to the boundary point of a variance component’s parameter space. This may
have caused some instability estimating this variance parameter and as a

consequence, the EM algorithm required more iterations to converge.

CHAPTER VI
DISCUSSION

The results presented in the previous chapter are discussed and summarized
in this chapter. The following section discusses some of the advantages and
disadvantages of the Bayesian and empirical Bayes implementations. And ﬁnally,

some unanswered questions and suggestions for future research are presented.

Advantages and Disadvantages

The fundamental question of this simulation study was whether or not the
Bayesian approach implemented with the data augmentation algorithm would recover
the parameters of the hierarchical model more accurately than the empirical Bayes
approach using the EM algorithm. The performance of these two statistical
methodologies was primarily evaluated in terms of their mean squared errors relative
to the population parameter values. In all experimental conditions investigated there
was no statistical difference between point estimates produced by the Bayesian
procedure using the data augmentation algorithm and the empirical Bayes procedure
using the EM algorithm in terms of recovering y and 0". The Bayesian approach
did produce posterior mean estimates for some elements of T that were substantially
closer to their population values than the empirical Bayes approach. The T mauix

was estimated from a small sample and the Bayesian approach was superior in

95

recovering this variance-covariance mauix. Therefore, we may conclude that the
Bayes approach implemented with the data augmentation algorithm recovered the
parameters of the model under investigation as well or better than the empirical
Bayes approach using the EM algorithm.

Another advantage of the Bayesian approach implemented with the data
augmentation algorithm is that it produces a ﬁnite approximation to the posterior
distribution. The information supplied by the complete posterior distribution
provides the researcher with a number of options. A variety of measures from a
posterior distribution may be easily determined, for example, the mean, median,
mode and percentile points. In addition, it is possible to graph an entire posterior
distribution of a parameter.

The empirical Bayes procedure using a t-test was clearly superior for
hypothesis tests of the elements of 7 that had population values of zero. The
observed type I error rates from this procedure were very close to the nominal rates.
The Bayesian procedure as implemented in this study and the empirical Bayes
approach using the z-test were liberal and tended to reject the null hypothesis more
often then the expected nominal rate. The Bayesian procedure using the data
augmentation algorithm may produce a type I error rate closer to the nominal rate if
the maximum number of augmented data sets (m) was increased. The tail sections
of a posterior distribution might be approximated more accurately with a larger
number of augmented data sets.

The power to detect y,,, the interaction effect between the student level
independent measure and the between-class treatment, was weak for both

experimental conditions that had a true effect associated with this parameter. Since

97
there were only ten classes, the power of this test was expected to be weak. The
observed power for the Bayes and empirical Bayes procedure using the z-test were
approximately equivalent and they were more powerful than the empirical Bayes
procedure using a t-test. The two procedures that appeared more powerful achieved
this power at the expense of controlling type 1 errors. Their type I error rates were
higher than the expected nominal rate.

The primary disadvantage of the data augmentation algorithm was the large
amount of computer resources it requires. Each replication took an average of
about 30 seconds to analyze on an IBM 3090-180 mainframe computer. This
computer is a relatively fast general purpose mainframe computer. The empirical
Bayes approach using the EM algorithm was much more efﬁcient in terms of its
computer resource requirements. The HLM computer program (Bryk et al., 1986)
required on the average only 0.21 seconds to converge.

One unresolved problem with the data augmentation algorithm is that there is
no general strategy for determining convergence. This makes it less appealing to
researchers working on substantive problems in education. Also, the implementation
of the data augmentation algorithm created for this study was very speciﬁc in
nature. It was designed for the one model studied in this simulation and it is not

generally available for use by researchers in education.

Suggestions for Future Research
This simulation study attempted to model conditions that might be obtained
in an education research project focused on student by class level treatment

interaction effects. Consequently, the parameters for the experimental data sets were

98
selected to reﬂect typical data that might be observed. The class size was selected
to model the class size distribution of elementary schools. This distribution of class
size was not that variable and as a result the overall sampling design was only
moderately unbalanced. It is plausible that as the sampling distribution of class size
becomes more variable, the results between the two statistical procedures will
increasingly diverge. The disuibution of class size was not a variable in this study.
Investigation of the inﬂuence different class size distributions have may provide
insight into situations that capitalize on the strengths of the Bayesian method.
Since the data augmentation algorithm provides a new tool for a true Bayesian
approach to data analysis, additional experience in a wide variety of situations is
important to fully evaluate its usefulness. It would be extremely informative to
know under what general conditions the data augmentation algorithm is clearly
superior and under what conditions other statistical procedures are adequate. This is
a concern given the large amount of computer resources required by the data
augmentation algorithm.

There are a number of different ways that the data augmentation algorithm
can be implemented. The number of augmented data sets and how they are
increased is a variable of the algorithm that may effect the rate of convergence.
The number of augmented data sets (m) was doubled until it reached a maximum of
2048. The number of data sets and the rate at which they are increased could be
varied and investigated to determine an optimal strategy for controlling the
augmentation of data sets. In addition, if the maximum number of augmented data
sets is increased, will the observed type I error rate become closer to the nominal

error rate? This algorithm may require a large number of augmented data sets to '

99
accurately reﬂect the tail areas of a posterior distribution.

Developing a general stopping rule for the data augmentation algorithm is an
important and challenging issue for future investigation. Since this was a simulation
study and the population parameters were known, a stopping rule was developed
empirically. This approach is not much help in applied situations where the
parameter values are unknown and the conditions of the data analysis project are
not similar to the conditions investigated in this simulation study. A stopping
criterion must consider convergence approaching some ﬁnite stochastic limit within
some tolerance and at the same time investigate trends across iterations. / This is a
much more difﬁcult task than traditional stopping rules that examine some function
of differences/between iterations; for a value lesslthan some predetermined criterion.

The cost for computer resources required by the data augmentation algorithm
will decrease in the future because computers are getting faster and more
inexpensive. In addition, the speciﬁc implementation of the data augmentation
procedure could be optimized to run faster by rewriting sections of the computer
code. Since this was the ﬁrst implementation of this method for the hierarchical
model under investigation, many parts of the code were developed to clearly and
accurately reflect the algorithm, and optimization was a secondary concern. Future
simulation studies investigating properties of the data augmentation algorithm could
be implemented on a parallel processing computer. Simulation studies like this one
are suitable for parallel processing computers because every replication is
independent and may be processed separately.

For the educational researcher considering methods to analyze experimental

data that are expected to be similar to the data investigated in this simulation study,

100
the empirical Bayes procedure implemented with the EM algorithm and using a t-
test for the 7 effects is recommended at this time. This method is recommended
because there are computer programs available that can analyze a large variety of
models, the cost for the required computer resources is substantially less, and it
appears to perform well for hypotheses tests. Possibly, for other analysis situations
the Bayesian procedure implemented with the data augmentation algorithm will
prove to be clearly superior. It would be interesting to evaluate the performance of
the Bayesian approach using the data augmentation algorithm in a situation with
larger variability in group (class) size. In the past, the Bayesian approach has been
held back because there were no practical means for its implementation. Now with
the data augmentation algorithm as a practical means for implementing this approach
researchers can begin to acquire some experience. And with this experience,
researchers will become more aware of the advantages and disadvantages of these

two statistical approaches.

APPENDICES

APPENDIX A

TABLES CONTAINING SIMULATION RESULTS

Table A-1

Posterior Estimates for y from Experimental Condition 1

 

 

 

 

 

Parameters

MethOd 700 701 710 711

DA Mean 0.00992 0.00183 0.75767 0.00401
EM Mean 0.00984 0.00201 0.75713 0.00389
DA Sd of Mean 0.19627 0.20402 0.08926 0.08639
EM Sd of Mean 0.19609 0.20366 0.08974 0.08670
Corr between DA and EM 0.99948 0.99951 0.99474 0.99332
Population Values 0.00000 0.00000 0.76320 0.00000
Estimates based on 500 replications.

Table A-2

Posterior Estimates for 0" and T from Experimental Condition 1

 

 

 

 

Parameters
Method (5‘2 too to, 1,,
DA 1.02767 0.34374 0.00060 0.03171
EM 0.99654 0.36984 0.00000 0.04172
Population Values 1.00000 0.35517 0.00000 0.03552

 

Estimates based on 500 replications.

101

102

Table A-3

Posterior Estimates for y from Experimental Condition 2

 

 

 

 

 

Parameters

Method 700 701 710 711

DA Mean -0.00160 -0.00653 0.75381 0.06044
EM Mean -0.00142 ~0.00646 0.75422 0.06007
DA Sd of Mean 0.19503 0.21104 0.08737 0.08861
EM Sd of Mean 0.19482 0.21058 0.08695 0.08847
Corr between DA and EM 0.99948 0.99952 0.99309 0.99124
Population Values 0.00000 0.00000 0.76320 0.05654
Estimates based on 500 replications.

Table A-4

Posterior Estimates for 0" and T from Experimental Condition 2

 

 

 

 

Parameters
Method 0’ 1:00 To, 1,,
DA 1.02972 0.32566 -0.00271 0.02828
EM 0.99875 0.35238 -0.00248 0.03731
Population Values 1.00000 0.35517 0.00000 0.03232

 

Estimates based on 500 replications.

Posterior Estimates for y from Experimental Condition 3

103

Table A-5

 

 

 

 

 

Parameters

MethOd You 701 710 711

DA Mean -0.00969 0.00138 0.75950 0.11602
EM Mean -0.00945 0.00111 0.75990 0.11616
DA Sd of Mean 0.20464 0.19215 0.08609 0.08335
EM Sd of Mean 0.20452 0.19232 0.08463 0.08256
Corr between DA and EM 0.99960 0.99957 0.99116 0.99199
Population Values 0.00000 0.00000 0.76320 0.11308
Estimates based on 500 replications.

Table A-6

Posterior Estimates for 0" and T from Experimental Condition 3

 

 

 

 

Parameters
Method 0" 1:0,, to, 11,
DA 1.01760 0.34005 0.00064 0.02353
EM 0.98989 0.36513 0.00041 0.02971
Population Values 1.00000 0.35517 0.00000 0.02273

 

Estimates based on 500 replications.

104

Table A-7

MSE’s Relative to Population Values for y from Condition 1

 

 

 

 

Parameters
Method You 7m 710 In
DA 0.03 854299 0.04154461 0.00798253 0.00746498
EM 0.03847160 0.04139983 0.00807444 0.00751721
Ratio DA/EM 1.0019 1.0035 0.9886 0.9931

 

Note. * p < 0.05; ** p < 0.01 (two-tailed test signiﬁcance levels). Based
on 500 replications.

Table A-8

MSE’s Relative to Population Values for y from Condition 2

 

 

 

 

Parameters
Method yo, 70, y,,, 7,,
DA 0.03796498 0.0444921 1 0.00770650 0.007 85097
EM 0.03788249 0.044297 31 0.007 62598 0.0078237 8
Ratio DA/EM 1.0022 1.0044 1.0106 1.0035

 

Note. * p < 0.05; ** p < 0.01 (two—tailed test signiﬁcance levels). Based
on 500 replications.

105

Table A-9

MSE’s Relative to Population Values for y from Condition 3

 

 

 

 

Parameters
Method You 701 710 In
DA 0.04188586 0.036851 1 1 0.00740986 0.00694154
EM 0.04183597 0.03691413 0.00715952 0.00681267
Ratio DA/EM 1.0012 0.9983 1.0350 1.0189

 

Note. * p < 0.05; ** p < 0.01 (two-tailed test signiﬁcance levels). Based
on 500 replications.

Table A-10

MSE’s Relative to Population Values for 0" and T from Condition 1

 

 

 

 

Parameters
Method 62 to, to, t,,
DA 0.01048149 0.04103996 0.00361215 0.00077421
EM 0.00929093 0.04051690 0.00424074 0.00147059
Ratio DA/EM 1.1281 1.0129 0.8518 0.5265 **

 

Note. * p < 0.05; ** p < 0.01 (two-tailed test signiﬁcance levels). Based
on 500 replications.

106

Table A-ll

MSE’s Relative to Population Values for o“ and T from Condition 2

 

 

 

 

Parameters
Method 0" too to, 1,,
DA 0.01034900 0.03917907 0.00304788 0.00061088
EM 0.008 85991 0.03749094 0.00350264 0.00129564
Ratio DA/EM 1.1691 1.0450 0.8702 0.4715 **

 

Note. * p < 0.05; ** p < 0.01 (two-tailed test signiﬁcance levels). Based
on 500 replications.

Table A-12

MSE’s Relative to Population Values for (52 and T from Condition 3

 

 

 

 

Parameters
Method 02 Too “rm 1 11
DA 0.00977 809 0.0395 87 72 0.00257984 0.00046431
EM 0.00881106 0.03869871 0.00314165 0.00106294
Ratio DA/EM 1.1098 1.0230 0.8212 * 0.4368 **

 

Note. * p < 0.05; ** p < 0.01 (two-tailed test signiﬁcance levels). Based
on 500 replications.

107

Table A- 13

Type I Error Rates for 70,, from Experimental Condition 1

 

 

 

 

 

or = 0.01 or = 0.05 or = 0.10
Method Rate 2 Rate 2 Rate 2
DA 0.018 1.798 0.096 4.720 ** 0.152 3.876 **
EM z-test 0.020 2.247 * 0.086 3.694 ** 0.134 2.534 *
EM t-test 0.006 -0.899 0.040 -1.026 0.096 -0.298

 

Note. * p < 0.05; ** p < 0.01 (two-tailed test signiﬁcance levels). Based
on 500 replications.

Table A- 14

Type I Error Rates for yo, from Experimental Condition 1

 

 

 

 

 

or = 0.01 or = 0.05 or = 0.10
Method Rate z Rate 2 Rate 2
DA 0.032 4.944 ** 0.082 3.283 ** 0.140 2.981 **
EM z-test 0.028 4.045 ** 0.080 3.078 ** 0.122 1.640
EM t-test 0.014 0.899 0.046 -0.410 0.090 -0.745

 

Note. * p < 0.05; ** p < 0.01 (two-tailed test signiﬁcance levels). Based
on 500 replications.

108

Table A- 15

Type I Error Rates for 7,, from Experimental Condition 1

 

 

 

 

 

or = 0.01 or = 0.05 or = 0.10
Method Rate 2 Rate 2 Rate 2
DA 0.008 -0.450 0.046 -0.410 0.098 -0.149
EM z-test 0.012 0.450 0.054 0.410 0.104 0.298
EM t-test 0.004 -1.348 0.022 -2.873 ** 0.072 -2.087 *

 

Note. * p < 0.05; ** p < 0.01 (two-tailed test signiﬁcance levels). Based
on 500 replications.

Table A- 16

Type I Error Rates for 7,0 from Experimental Condition 2

 

 

 

 

 

or = 0.01 or = 0.05 or = 0.10
Method Rate 2 Rate 2 Rate 2
DA 0.034 5.394 ** 0.088 3.899 ** 0.138 2.832 **
EM z-test 0.032 4.944 ** 0.076 2.668 ** 0.120 1.491
EM t-test 0.006 -0.899 0.048 -0.205 0.100 0.000

 

Note. * p < 0.05; ** p < 0.01 (two-tailed test signiﬁcance levels). Based
on 500 replications.

109

Table A- 17

Type I Error Rates for 70, from Experimental Condition 2

 

 

 

 

 

a = 0.01 or = 0.05 or = 0.10
Method Rate 2 Rate z Rate z
DA 0.036 5.843 ** 0.112 6.361 ** 0.180 5.963 **
EM z-test 0.038 6.293 ** 0.110 6.156 ** 0.162 4.621 **
EM t-test 0.014 0.899 0.060 1.026 0.124 1.789

 

Note. * p < 0.05; ** p < 0.01 (two-tailed test signiﬁcance levels). Based
on 500 replications.

Table A- 18

Type I Error Rates for 70,, from Experimental Condition 3

 

 

 

 

 

or = 0.01 or = 0.05 or = 0.10
Method Rate 2 Rate 2 Rate 2
DA 0.028 4.045 ** 0.086 3.694 ** 0.144 3.280 **
EM z-test 0.034 5.394 ** 0.086 3.694 ** 0.128 2.087 *
EM t-test 0.018 1.798 0.036 -1.436 0.098 -0.149

 

Note. * p < 0.05; ** p < 0.01 (two-tailed test signiﬁcance levels). Based
on 500 replications.

110

Table A-l9

Type I Error Rates for 70, from Experimental Condition 3

 

 

 

 

 

or = 0.01 or = 0.05 or = 0.10
Method Rate 2 Rate 2 Rate 2
DA 0.022 2.697 ** 0.084 3.488 ** 0.140 2.981 **
EM z-test 0.026 3.596 ** 0.080 3.078 ** 0.118 1.342
EM t-test 0.004 -1.348 0.038 -1.231 0.088 -0.894

 

Note. * p < 0.05; ** p < 0.01 (two-tailed test signiﬁcance levels). Based
on 500 replications.

Table A-20

McNemar’s Statistics for Condition 1 Type I Errors with Nominal 0t = 0.01

 

 

 

Parameters
DA vs.
MethOd 700 To: 711
EM z-test 0.000 0.500 0.500
EM t-test 4.167 * 7.111 ** 0.500

 

Note. * p < 0.05; ** p < 0.01 for one degree of
freedom test. Based on 500 replications.

111

Table A-21

McNemar’s Statistics for Condition 1 Type I Errors with Nominal or = 0.05

 

 

 

Parameters
DA vs.
Method You 701 711
EM z-test 3.200 0.000 0.643
EM t-test 26.036 ** 16.056 ** 8.643 **

 

Note. * p < 0.05; ** p < 0.01 for one degree of
freedom test. Based on 500 replications.

Table A-22

McNemar’s Statistics for Condition 1 Type I Errors with Nominal or = 0.10

 

 

 

Parameters
DA vs.
Method Yon Yo. 7“
EM z-test 5.818 * 7.111 ** 0.308
EM t-test 26.056 ** 23.040 ** 8.471

 

Note. * p < 0.05; ** p < 0.01 for one degree of
freedom test. Based on 500 replications.

112

Table A-23

McNemar’s Statistics for Condition 2 Type I Errors with Nominal or = 0.01

 

 

 

Parameters
DA vs.
MethOd 700 701
EM z-test 0.000 0.000
EM t-test 12.071 ** 9.091 **

 

Note. * p < 0.05; ** p < 0.01 for one
degree of freedom test. Based on 500
replications.

Table A-24

McNemar’s Statistics for Condition 2 Type I Errors with Nominal or = 0.05

 

 

 

Parameters
DA vs.
Method You Yo)
EM z-test 3.125 0.000
EM t-test 18.050 ** 24.039 **

 

Note. * p < 0.05; ** p < 0.01 for one
degree of freedom test. Based on 500
replications.

113

Table A-25

McNemar’s Statistics for Condition 2 Type I Errors with Nominal or = 0.10

 

 

 

Parameters
DA vs.
MCUlOd Yoo Yor
EM z-test 7.111 ** 5.818 *
EM t-test 17.053 ** 26.036 **

 

Note. * p < 0.05; ** p < 0.01 for one
degree of freedom test. Based on 500
replications.

Table A-26

McNemar’s Statistics for Condition 3 Type I Errors with Nominal or = 0.01

 

 

 

Parameters
DA vs.
Method You Yo:
EM z-test 1.333 0.167
EM t-test 3.200 7.111 **

 

Note. * p < 0.05; ** p < 0.01 for one
degree of freedom test. Based on 500
replications.

114

Table A-27

McNemar’s Statistics for Condition 3 Type I Errors with Nominal or = 0.05

 

 

 

Parameters
DA vs.
Method 700 701
EM z-test 0.250 0.500
EM t-test 23.040 ** 21.044 **

 

Note. * p < 0.05; ** p < 0.01 for one
degree of freedom test. Based on 500
replications.

Table A-28

McNemar’s Statistics for Condition 3 Type I Errors with Nominal or = 0.10

 

 

 

Parameters
DA vs.
MOII'IOd Yoo YO!
EM z-test 6.125 * 7.692 **
EM t-test 21.044 ** 24.039 **

 

Note. * p < 0.05; ** p < 0.01 for one
degree of freedom test. Based on 500
replications.

115

Table A-29

Experimental Condition 2 Power Comparisons with DA for 7,,

 

 

Method )8
EM z-test 0.071
EM t-test 25.290 **

 

Note. ** p < 0.01 for one
degree of freedom test. Based
on 500 replications.

Table A-30

Experimental Condition 3 Power Comparisons with DA for 7,,

 

 

Method )6
EM z-test 0.033
EM t-test 46.021 **

 

Note. ** p < 0.01 for one
degree of freedom test. Based
on 500 replications.

APPENDIX B
SOURCE CODE FOR COMPUTER PROGRAMS

Data Augmentation Algorithm FORTRAN Source Code

PROGRAM DAUG

***********************************************************‘k************

* This program calculates the posterior distribution of the
* between-group regression coefficients for a two level
* hierarchical model using the data augmentation algorithm.

PARAMETER (MXAUG=2048, Mxrr=25, MXK=10)

INTEGER ISEED, IT, MPT, NEXP, NGEN, NK, NT, NTOT

LOGICAL DONE

REAL B(2,MXK,MXAUG), BLS(2,MXK), DWWIN(4)

REAL G(4,MXAUG), GAMMA(4), ID2(2,2), POW, 91w, SIG(MXAUG), SIGMAZ
REAL SXY(2,MXK), srthxx), TAU(3), TIN(2,2,MXAUG), th3), WC(2,4)
REAL WE(2,4), wrthx), XX(2,2,MXK), XXIN(2,2,MXK)

COMMON /BCOM/ B /GCOM/ G /SIGCOM/ SIG /TINCOM/ TIN

OPEN(10, FILEI’SSM’, FORM='UNFORMATTED')
OPEN(20, FILEt’INIT')
OPEN(25, FILE=’VARS')
OPEN(30, FILE=’OUT1’)
OPEN(40, FILE=’LAST')
OPEN(50, FILE=’SEED’)
OPEN(60, FILE=’KNOW’)

* Prepare IMSL random number generators.
CALL RNOPT(3)
ISEED = 123457
CALL RNSET(ISEED)
NTOT - 0
NR = 10

READ(20,201) NEXP, POW, PlW
CALL DESIGN(DWWIN, IDZ, NK, WC, WE)

DO 20 N = 1, NEXP
CALL READER(BLS, NK, NT, SXY, 5Y2, TZ, WI, XX, XXIN)
CALL KNOWN(BLS, GAMMA, POW, PlW, SIGMAZ, TAU, WI, XXIN)

IT = 1

MPT = 1

DONE = .FALSE.
* Begin DO-WHILE loop.
111 CONTINUE

116

888

20

201
501

117

IF (DONE) GOTO 888

CALL GSTAR(DWWIN, IT, MPT, NK, WC, WE, WI)
IF (2 * MPT .LE. MXAUG) THEN

NGEN = 2

CALL SIMG(IT, MPT, NGEN, NK)

MPT = 2 * MPT
ELSE

NGEN = 1

CALL SIMG(IT, MPT, NGEN, NK)
END IF
CALL TSTAR(IT, MPT, NGEN, NK, WC, WE, WI)
CALL SIMT(IT, MPT, NK, TZ)
CALL BETA(BLS, IDZ, MPT, NGEN, NK, WC, WE, WI, XX)
CALL SSTARIIT, MPT, NK, NT, SXY, 3Y2, XX)
CALL SIMS(IT, MPT, NT)

Update loop control.
IF (IT .EQ. MXIT) THEN
DONE = .TRUE.
ELSE
IT = IT + 1
END IF

GOTO 111
CONTINUE
End DO-WHILE loop.

Sum the total number of experimental units for all experiments.
NTOT = NTOT + NT
CONTINUE

CALL NTAVE(NEXP, NTOT)
CALL RNGET(ISEED)
WRITE(50,501) ISEED

FORMAT(I5,2FS.2)
FORMATlIlO)

END

118

SUBROUTINE DESIGN(DWWIN, ID2, NK, WC, WE)

*1:***********************************************************************

”k
*
*

10

Construct IDZ and the diagonal matrix W’W inverse. Also,
construct the two between-group design matrices. WC indicates
the control and WE indicates the experimental conditions.

INTEGER NK
REAL DWWINM), ID2(2,2), WC(2,4), WE(2,4)

DO 10 I = 1
DWWIN(I)
CONTINUE

ll‘

4
1.0 / NK

ID2(1,1)
IDZ(2,1)
ID2(1,2)
ID2(2,2)

ﬂ
F1C>C>H

OOOO

WC(1,1)
WC(2,1)
WC(1,2)
WC(2,2)
WC(1,3)
WC(Z, 3)
WC(1,4)
WC(2,4)

lllllllllllllil
l
H<D+4c>c>kac>w
o<3c>o<3<3c>o

WE(1,1)
WE(2,1)
WE(1,2)
WE(2,2)
WE(1,3)
WE(2,3)
WE(1,4)
WE(2,4)

HOHOOHOH
OOOOOOOO

RETURN
END

119

SUBROUTINE READER(BLS, NK, NT, SXY, 8Y2, TZ, WI, XX, XXIN)

*‘k‘k*********************************************************************

*

‘k
*
k

10

20

30

40

Read sufficient sum of squares statistics from logical unit 10
and the initial estimates for Beta, Sigma_2, and Tau from logical
unit 20. Calculate Y’Y and Inv(Tau). Construct for each group
X’Y, Inv(X’X) and XX.

PARAMETER (MXAUG=2048, MXK=10)

INTEGER NJ(MXK), NK, NT

REAL B(2,MXK,MXAUG), BLS(2,MXK), DET, FS, FT(2,2), SIG(MXAUG)
REAL SX(MXK), SXY(2,MXK), SX2(MXK), SY(MXK), SY2(MXK), T(2,2)
REAL TIN(2,2,MXAUG), TZ(3), WI(MXK), XX(2,2,MXK), XXIN(2,2,MXK)
DOUBLE PRECISION DSX, DSXY, DSXZ, DSY, DSYZ, DWI

CHARACTER *12 ID

COMMON /BCOM/ B /SIGCOM/ SIG /TINCOM/ TIN

SSM data file must be converted from DOUBLE to SINGLE PRECISION.

DO 10 K = l, NK

READ(20,201) (BLS(I,K), I = l, 2)
CONTINUE
READ(20,202) SIG(1)
DO 20 I ' 1, 2

READ(20,203) (T(I,J), J = l, 2)
CONTINUE
TZ(Z) a T(2,l)

Reading final EM estimates for Sigma_2 and Tau.

READ(20,204) FS

DO 30 I a 1, 2
READ(20,203) (FT(I,J), J

CONTINUE

READ(20,205)

1, 2)

Calculate TIN = Inv(Tau).

DET = T(1,1) * T(2,2) - T(2,1) * T(1,2)
TIN(1,1,1) = T(2,2) / DET

TIN(2,1,1) - -1.o * T(2,1) / DET
TIN(1,2,1) = TIN(2,1,1)

TIN(2,2,1) a T(1,1) / DET

DO 40 M 8 1, 3
READ(10)
CONTINUE
NT = 0.0
DO 50 K 8 1, NR
READ (10) NJ(K)
NT = NT + NJ(K)
READ(10) DSY, DSX
SY(K) I SNGL(DSY) * NJ(K)
SX(K) = SNGL(DSX) * NJ(K)
SXY(1,K) = SY(K)
READ(10) DSYZ, DSXY, DSXZ
DSYZ = DSYZ + NJ(K) * DSY * DSY
DSXY = DSXY + NJ(K) * DSY * DSX
DSXZ 8 DSXZ + NJ(K) * DSX * DSX

120

SY2(K) - SNGL(DSYZ)
SXY(2,K) = SNGL(DSXY)
SX2(K) = SNGL(DSXZ)
READ(10) ID, DWI
WI(K) = SNGL(DWI)

* Construct (X’X) for each group.
XX(1,1,K) = NJ(K)
XX(2,1,K) = SX(K)
XX(1,2,K) = SX(K)
XX(2,2,K) = SX2(K)

* Calculate Inv(X’X)
DET = XX(1,1,K) * XX(2,2,K) - XX(2,1,K) * XX(1,2,K)
XXIN(1,1,K) XX(2,2,K) / DET
XXIN(2,1,K) -1.0 * XX(2,1,K) / DET
XXIN(1,2,K) XXIN(2,1,K)
XXIN(2,2,K) XX(1,1,K) / DET

* Initialize B in BCOM common block.
DO 60 I = l, 2
B(I,K,1) = BLS(I,K)
60 CONTINUE
50 CONTINUE

201 FORMAT(30X,2F12.5)

202 FORMAT(5(/),16X,E13.6/)
203 FORMAT(8X,2F12.5)

204 FORMAT(/16X,E13.6/)

205 FORMAT(7(/))

RETURN
END

121

SUBROUTINE KNOWN(BLS, GAMMA, POW, P1W, SIGMAZ, TAU, WI, XXIN)

'k**************************************‘k‘k*******************************

*
*

10

30

SO
40

Calculate Gamma and D(Gamma) from the knowledge of the
complete data.

PARAMETER(MXK=10)
REAL BLS(2,MXK),
REAL GAMMAM),
REAL WI(MXK),

D(2,2), DINB(2):
POW, PlW, SIGMAZ,
XXIN(2,2,MXK)

DET, DIN(2,2), DIS(4,4)
SWDINB(4), T(2,2), TAU(3)

READ(25,201) SIGMAZ
READ(25,202) (TAU(J),
T(1,1) = TAU(1)
T(2,1) = TAU(2)
T(1,2) = TAU(2)
T(2,2) = TAU(3)
DO 10 J - 1, 4
SWDINB(J) = 0.0
DO 10 I = 1
DIS(I,J)
CONTINUE

J = l, 3)

, 4
= 0.0

DO 20 K = 1, MXK
Calculate DIN = Inv(D)
DO 30 J = 1, 2

DO 30 I = l, 2

= Inv(T + V) for each group.

D(I,J) - T(I,J) + XXIN(I,J,K) * SIGMAZ
CONTINUE
DET a D(1,1) * D(2,2) - D(1,2) * D(2,1)
DIN(1,1) = D(2,2) / DET
DIN(2,1) - -1.0 * D(2,1) / DET
DIN(1,2) = DIN(2,1)
DIN(2,2) = D(l,l) / DET
Calculate Sum(DIS) - Sum(W'Inv(D)W) across groups.
DIS(1,1) - DIS(1,1) + DIN(1,1)
DIS(2,1) = DIS(2,1) + WI(K) * DIN(I,1)
DIS(3,1) = DIS(3,1) + DIN(2,1)
DIS(4,1) a DIS(4,1) + WI(K) * DIN(2,1)
DIS(1,2) - DIS(1,2) + WI(K) * DIN(1,1)
DIS(2,2) - DIS(2,2) + DIN(1,1)
DIS(3,2) - DIS(3,2) + WI(K) * DIN(2,1)
DIS(4,2) a DIS(4,2) + DIN(2,1)
DIS(1,3) - DIS(1,3) + DIN(1,2)
DIS(2,3) = DIS(2,3) + WI(K) * DIN(1,2)
DIS(3,3) - DIS(3,3) + DIN(2,2)
DIS(4,3) = DIS(4,3) + WI(K) * DIN(2,2)
DIS(1,4) - DIS(1,4) + WI(K) * DIN(1,2)
DIS(2,4) = DIS(2,4) + DIN(1,2)
DIS(3,4) a DIS(3,4) + WI(K) * DIN(2,2)
DIS(4,4) = DIS(4,4) + DIN(2,2)

Calculate SWDINB = Sum(W’Inv(D)BLS)
DO 40 I = l, 2

DINB(I) = 0.0
DO 50 J = 1, 2

DINB(I) = DINB(I) + DIN(I,J) * BLS(J,K)
CONTINUE

CONTINUE

20

70
60

80

201
202
601

122

SWDINB(l) = SWDINB(l) + DINB(1)

SWDINB(Z) = SWDINB(Z) + WI(K) * DINB(1)

SWDINB(3) - SWDINB(3) + DINB(2)

SWDINB(4) '3 SWDINB(4) + WI(K) * DINB(2)
CONTINUE

Calculate DIS = Inv(Sum(W’Inv(D)W))
CALL LINRG(4, DIS, 4, DIS, 4)
DO 60 I = 1, 4
GAMMA(I) = 0.0
DO 70 J = 1, 4
GAMMA(I) == GAMMA(I) + DIS(I,J) * SWDINB(J)
CONTINUE
CONTINUE

WRITE(60,601) (GAMMA(I), I = l, 4)
DO 80 I = 1, 4

WRITE(60,601) (DIS(I,J), J = 1, 4)
CONTINUE

FORMAT (F11 . 5)
FORMAT (3 (F11 . 5))
FORMAT (45'11 . 5)

RETURN
END

123

SUBROUTINE GSTAR(DWWIN, IT, MPT, NK, WC, WE, WI)

******************************************************************‘k‘k*‘kir'k

*
*
*

20

50
40

70
60

30

80
10

301

Calculating MPT vectors of Gamma_Star, where MPT is the pointer
that indicates the number of Beta vectors in the current
augmented data set.

PARAMETER (MXAUG=2048, MXK=10)

INTEGER IT, MPT, NK, UN

REAL B(2,MXK,MXAUG), DWWIN(4), GK4,MXAUG)
REAL SWB(4), WC(2,4), WE(2,4), WI (MXK)
COMMON /BCOM/ B /GCOM/ G

DO 10 N = 1, MPT
DO 20 I a 1, 4
SWB(I) = 0.0
CONTINUE
DO 30 K = 1, NK
IF (WI(K) .GT. 0.0) THEN
DO 40 J = 1, 4
DO 50 I = l, 2
SWB(J) = SWB(J) + WE(I,J) * B(I,K,N)
CONTINUE
CONTINUE
ELSE
DO 60 J = 1, 4
DO 70 I = 1, 2
SWB(J) 8 SWB(J) + WC(I,J) * B(I,K,N)
CONTINUE
CONTINUE
END IF
CONTINUE

Calculate the Nth Gamma vector and store it in the GCOM block.
DO 80 I = 1, 4
G(I,N) - DWWIN(I) * SWB(I)
CONTINUE
CONTINUE

FORMAT(4 (F11. 5) )

RETURN
END

124

SUBROUTINE SIMG(IT, MPT, NGEN, NK)

*‘k‘k‘k*************************‘k‘k'k‘k‘k‘k‘k*********‘k**************************

*
*
‘k

20
10

30

9O

Generate random Gamma vectors. If the current number of Gamma
vectors is less than MXAUG then the number of Gamma vectors will
be increased by 2.

PARAMETER (MXAUG=2048, MXK=10)

INTEGER GPT, IRANK, IT, LOOPS, MPT, NGEN, NK, NR, RPT, UN

REAL DET, G(4,MXAUG), GBAR(4), GVAR(4), GT(4,MXAUG), R(4*MXAUG)
REAL 3(3), SIN(3), TIN(2,2,MXAUG), U(3)

COMMON /GCOM/ G /TINCOM/ TIN

Get the required i.i.d. standard normals.
NR = 4 * NGEN * MPT

CALL RNNOR(NR, R)

GPT = 1

RPT = 1

DO 10 N = 1, MPT
Calculate Sum(WTINW) = S for the three values other than zero.
8(1) = NK * TIN(1,1,N)
3(2) = NK * TIN(1,2,N)
5(3) = NK * TIN(2,2,N)

Calculate Inverse(S) = SIN

DET = 8(1) * 8(3) - 5(2) * 8(2)
SIN(l) = 8(3) / DET

SIN(2) - -1.0 * 8(2) / DET
SIN(3) = 8(1) / DET

Calculate the Cholesky factor.
U(l) = SQRT(SIN(1))

U(2) = SIN(2) / U(l)

U(3) = SQRT(SIN(3) - U(2) * U(2))

Generate Gamma from Gamma_Star, the mean of the distribution.
DO 20 I - 1, NGEN

GT(1,GPT) = G(I,N) + U(l) * R(RPT)
GT(2,GPT) = G(2,N) + U(l) * R(RPT+1)
GT(3,GPT) = G(3,N) + U(2) * R(RPT) + U(3) * R(RPT+2)
GT(4,GPT) a G(4,N) + U(2) * R(RPT+1) + U(3) * R(RPT+3)
RPT - RPT + 4
GPT = GPT + 1
CONTINUE
CONTINUE

LAST = NGEN * MPT
DO 30 I = 1, 4
DO 30 N = 1, LAST
G(I,N) 3 GT(I,N)
CONTINUE

Calculate the mean and dispersion of Gamma.
IF (IT .EQ. 25) THEN
UN = 30
DO 90 I = 1, 4
GBAR(I) = 0.
GVAR(I) = 0.
CONTINUE

O
O

125

D0 100 I - 1, 4
DO 100 N a 1, MXAUG
GBAR(I) - GBAR(I) + G(I,N)
GVAR(I) = GVAR(I) + G(I,N) * G(I,N)
100 CONTINUE
DO 110 I - 1, 4
GBAR(I) = GEAR(I) / REAL(MXAUG)
GVAR(I) = GVAR(I) / REAL(MXAUG) - GBAR(I) * GBAR(I)
110 CONTINUE
WRITE(UN,301) (GBAR(I), I = 1, 4)
WRITE(UN,301) (GVAR(I), I = 1, 4)
CALL TAILS(IT, UN)
END IF

301 FORMAT(4(F11.5))

RETURN
END

126

SUBROUTINE TSTAR(IT, MPT, NGEN, NK, WC, WE, WI)
**‘k*******************************‘k*************************************
* TSTR = Sum(Tau_Star) / NK for each gamma vector in the augmented
* data set. Store these in the TINCOM block.

PARAMETER (MXAUG=2048, MXK=10)

INTEGER IT, MPT, NGEN, NK, PPT, UN

REAL B(2,MXK,MXAUG), G(4,MXAUG), ST(Z), TBAR(3), TIN(2,2,MXAUG)
REAL TSTR(2,2), WC(2,4), WCG(2), WE(2,4), WEG(2), WI(MXK)
COMMON /BCOM/ s /GCOM/ G /TINCOM/ TIN

PPT - 1
DO 10 N = 1, MPT
DO 20 J = l, 2
DO 30 I = l, 2
TSTR(I,J) = 0.0
30 CONTINUE
20 CONTINUE
DO 40 I = 1, 2
WCG(I) = 0.0
WEG(I) = 0.0
DO 50 J = 1, 4
WCG(I) 8 WCG(I) + WC(I,J) * G(J,N)
WEG(I) 8 WEG(I) + WE(I,J) * G(J,N)

50 CONTINUE
4O CONTINUE
* Calculate the vector sum of (B - WG) over groups.

DO 60 K = 1, NK
IF(WI(K) .LT. 0.0) THEN
DO 70 I = 1, 2

ST(I) a B(I,K,PPT) - WCG(I)
70 CONTINUE
ELSE
DO 80 I = 1, 2
ST(I) = B(I,K,PPT) - WEG(I)
80 CONTINUE
END IF
* Let TSTR - Sum(B - WG)(B - WG)’ over groups.

DO 90 J = 1, 2
DO 100 I = l, 2
TSTR(I,J) = TSTR(I,J) + ST(I) * ST(J)

100 CONTINUE
90 CONTINUE
60 CONTINUE
* Store TSTR / NK in TIN

DO 110 J = 1, 2
DO 110 I = 1, 2
TIN(I,J,N) = TSTR(I,J) / 10.0
110 CONTINUE
IF (NGEN .EQ. 2 .AND. MOD(N,2) .EQ. 0) THEN
PPT = PPT + 1
ELSE IF (NGEN .EQ. 1) THEN
PPT = PPT + 1
END IF
10 CONTINUE
* Calculate point estimate of TSTAR and write it out.

120

130

140

301

127

IF (IT .EQ. 25) THEN
UN = 30
DO 120 I 3 1, 3
TBAR(I) = 0.0
CONTINUE
DO 130 N = 1, MXAUG
TBAR(I) = TBAR(l) + TIN(1,1,N)
TBAR(Z) = TBAR(Z) + TIN(2,1,N)
TBAR(3) = TBAR(3) + TIN(2,2,N)
CONTINUE
DO 140 I 3 1, 3
TBAR(I) = TBAR(I) / REAL(MXAUG)
CONTINUE
WRITE(UN,301) (TBAR(J), J = 1, 3)
END IF

FORMAT (3 (F11 . 5))

RETURN
END

128

SUBROUTINE SIMT(IT, MPT, NK, TZ)

********************************************************‘k***************

*
*
*

111

222

The Tau_Star matrices currently in TINCOM are factored and
are used to generate new Tau inverse matrices. These new
matrices are stored in the TINCOM.

PARAMETER (MXAUG32048)

INTEGER IT, MPT, NK, UN

LOGICAL AGAIN

REAL 8(3), CHIA(MXAUG), CHIB(MXAUG), CHIl, CHI2, DET, DFl, DF2
REAL L(3), LB(2,2), LBL(3), R(MXAUG), RX, TBAR(3)

REAL TIN(2,2,MXAUG), TZ(3)

COMMON /TINCOM/ TIN

N 3 1

DFl 3 REAL(NK)

DF2 3 REAL(NK)

CALL RNCHI(MPT, DFl, CHIA)
CALL RNCHI(MPT, DF2, CHIB)
CALL RNNOR(MPT,R)

Begin DO WHILE loop

CONTINUE

IF (N .GT. MPT) GOTO 999

Begin REPEAT UNTIL loop

continue

Calculate L, the lower Cholesky factor, note TIN = Tau_Star.
L(1) = SQRT(TIN(1,1,N))
L(2) 3 TIN(2,1,N) / L(1)
L(3) 3 SQRT(TIN(2,2,N) - L(2) * L(2))
Set up B = AA’, for A a Bartlett decomposition.
B(1) 3 CHIA(N)
8(2) 3 R(N) * SQRT(CHIA(N))
8(3) 3 CHIB(N) + R(N) * R(N)
Calculate a new Tau - LBL / NK

L8(1,1) = L(1) * 8(1)
LB(1,2) a L(1) * 8(2)
L8(2,1) - L(2) * 8(1) + L(3) * 8(2)
LB(2,2) = L(2) * 8(2) + L(3) * 8(3)

L8L(1) = L8(1,1) * L(1) / 10.0
LBL(2) = (L8(1,1) * L(2) + LB(1,2) * L(3)) / 10.0
LBL(3) = (LB(2,1) * L(2) + LB(2,2) * L(3)) / 10.0
Calculate TIN = Inverse(Tau)
DET = LBL(l) * LBL(3) - LBL(2) * LBL(2)
IF (DET .LT. 0.00001) THEN
CALL RNCHI(1, DFl, CHIl)
CALL RNCHI(1, DF2, CHIZ)
CALL RNNOR(1,RX)
CHIA(N) = CHIl
CHIB(N) = CHIZ
R(N) - Rx
TIN(1,1,N) - TZ(l)
TIN(2,1,N) = TZ(2)
TIN(2,2,N) = TZ(3)
AGAIN - .TRUE.
ELSE
AGAIN = .FALSE.
END IF

999

10

20

30

301

IF (AGAIN)
End REPEAT
TIN(1,1,N)
TIN(2,1,N)
TIN(1,2,N)
TIN(2,2,N)
N 3 N + 1

GOTO 111

CONTINUE

End DO WHILE

Calculate po
IF (IT .EQ.
UN = 30

DO 10 I 3
TBAR(I)
CONTINUE
DO 20 N 3

129

GOTO 222

UNTIL loop

LBL(3) / DET

-1.0 * LBL(2) / DET
TIN(2,1,N)

-LBL(1) / DET

loop

int estimate of T and write it out.
25) THEN

1, 3
3 0.0

1, MXAUG

DET 3 TIN(1,1,N) * TIN(2,2,N) - TIN(1,2,N) * TIN(1,2,N)

TBAR(l)

TBAR(2)

TBAR(3)
CONTINUE
DO 30 I 3

TBAR(I)
CONTINUE
WRITE(UN,3

END IF

FORMAT(3(F11
RETURN
END

3 TBAR(l) + (TIN(2,2,N) / DET)
3 TBAR(2) + (-1.0 * TIN(1,2,N) / DET)
3 TBAR(3) + (TIN(1,1,N) / DET)

1, 3
3 TBAR(I) / REAL(MXAUG)

01) (TBAR(J), J 3 1, 3)

.5))

130

SUBROUTINE BETA(BLS, ID2, MPT, NGEN, NK, WC, WE, WI, XX)

*'k**********************************************************************

*
*

30

40

50

60

Calculate Beta_Star and then simulate the posterior distribution
of Beta.

PARAMETER (MXAUG32048, MXK=10)

INTEGER MPT, NGEN, NK, NR, Ns, RPT

REAL B(2,MXK,MXAUG), BLS(2,MXK), BSTR(2), D(2,2), DET

REAL DIN(2,2), G(4,MXAUG), ID2(2,2), ILAM(2,2), ILAMW(2,4)
REAL ILAMWG(2), LAM(2,2), LF(3), R(20), SIG(MXAUG)

REAL TIN(2,2,MXAUG), WC(2,4), WE(2,4), WI(MXK), XX(2,2,MXK)
COMMON /8COM/ 8 /GCOM/ G /SIGCOM/ SIG /TINCOM/ TIN

NR = NK * 2

NS 3 1

DO 10 N 3 1, MPT
Get a set of random numbers.
CALL RNNOR(NR, R)
RPT 3 1

DO 20 K 3 1, NK
Calculate DIN = Inv(Inv(V) + Inv(T)) for a group.
DO 30 I 3 l, 2
DO 30 J 3 1, 2
D(I,J) 3 TIN(I,J,N) + XX(I,J,K) / SIG(NS)
CONTINUE
DET 3 D(1,1) * D(2,2) - D(1,2) * D(2,1)
DIN(1,1) 3 D(2,2) / DET
DIN(1,2) 3 -D(1,2) / DET
DIN(2,1) 3 DIN(1,2)
DIN(2,2) = D(1,1) / DET

Factor DIN 3 LF * LF' for generating a new Beta.
LF(l) 3 SQRT(DIN(1,1))

LF(2) 3 DIN(1,2) / LF(l)

LF(3) 3 SQRT(DIN(2,2) - LF(2) * LF(2))

Calculate LAM = DIN * Inv(V)
DO 40 J'- 1, 2
DO 40 I = 1, 2
LAM(I,J) = 0.0
DO 40 L = 1, 2
LAM(I,J) = LAM(I,J) + DIN(I,L) * XX(L,J,K) / SIG(NS)
CONTINUE

Calculate first part of Beta_Star = BSTR = LAM * BLS
DO 50 I 3 1, 2

BSTR(I) 3 0.0
DO 50 J 3 1, 2
BSTR(I) 3 BSTR(I) + LAM(I,J) * BLS(J,K)
CONTINUE

DO 60 J 3 1, 2
DO 60 I 3 1, 2
ILAM(I,J)3 ID2(I,J) - LAM(I,J)
CONTINUE

Calculate the second part of Beta_Star = (I - LAM) * W * G.

70

80

90

100

20

10

131

IF (WI(K) .GT. 0.0) THEN
DO 70 J 3 1, 4
DO 70 I 3 l, 2
ILAMW(I,J) 3 0
DO 70 L 3 1, 2
ILA.MW(I,J) 3
CONTINUE
ELSE
DO 80 J 3 1, 4
DO 80 I 3 1, 2
ILAMW(I,J) 3
DO 80 L 3 1,
ILAMW(I,J)
CONTINUE
END IF

.0

ILAMW(I,J) + ILAM(I,L) * WE(L,J)

.0

NO

ILAMW(I,J) + ILAM(I,L) * WC(L,J)

DO 90 I 3 1, 2
ILAMWG(I) 3 0.0
DO 90 J 3 l, 4
ILAMWG(I) 3 ILAMWG(I) + ILAMW(I,J) * G(J,N)
CONTINUE

Complete Beta_Star calculations.
DO 100 I 3 1, 2

BSTR(I) 3 BSTR(I) + ILAMWG(I)
CONTINUE

Determine pointer value and generate new B’s.
B(1,K,N) 3 BSTR(I) + LF(l) * R(RPT)
B(2,K,N) 3 BSTR(2) + LF(2) * R(RPT) + LF(3) * R(RPT+1)
RPT 3 RPT + 2
CONTINUE

Update pointer for Sigma_Z.
IF (NGEN .EQ. 2 .AND. MOD(N,2) .EQ. 0) THEN
NS 3 NS + 1
ELSE IF (NGEN .EQ. 1) THEN
NS 3 N
END IF
CONTINUE

RETURN
END

132

SUBROUTINE SSTAR(IT, MPT, NK, NT, SXY, 3Y2, XX)

*‘k'k**‘k‘k'k****************************************************************

* Calculate the sum of squares for Sigma_2_Star for each data set.

PARAMETER (MXAUG32048, MXK310)

INTEGER IT, MPT, NK, NT, UN

REAL B(2,MXK,MXAUG), BXXB, SBAR, SCP, SIG(MXAUG), ss, SXY(2,MXK)
REAL SY2(MXK), XX(2,2,MXK), xxs(2)

COMMON /8COM/ 8 /SIGCOM/ SIG

DO 10 N 3 1, MPT
SS 3 0.0
DO 20 K 3 1, NK
DO 30 I 3 1, 2
XXB(I) 3 0.0
DO 40 J 3 1, 2
XXB(I) 3 XXB(I) + XX(I,J,K) * B(J,K,N)

40 CONTINUE
30 CONTINUE
BXXB 3 0.0
SCP = 0.0
DO 50 I 3 1, 2

BXXB 3 BXXB + B(I,K,N) * XXB(I)
SCP 3 SCP + B(I,K,N) * SXY(I,K)

50 CONTINUE
SS 3 SS + SY2(K) - 2.0 * SCP + BXXB
20 CONTINUE
* SIG contains the sum of squares and has not been averaged.

SIG(N) 3 SS
10 CONTINUE

IF (IT .EQ. 25) THEN
UN 3 30
SBAR 3 0.0
DO 60 N 3 1, MXAUG

SBAR 3 SBAR + SIG(N) / REAL(NT)
60 CONTINUE

SBAR 3 SBAR / REAL(MXAUG)
WRITE(UN,301) SBAR

END IF

301 FORMAT(F11.5)

RETURN
END

133

SUBROUTINE SIMS(IT, MPT, NT)
**********************************************************************‘k*
* Simulate Sigma_Z. The vector SIG contains sum of squares for the
* Sigma_2_Star's and these values are replaced with Sigma_Z’s.

PARAMETER (MXAUG32048)

INTEGER IT, MPT, NT, UN

REAL CHI(MXAUG), DF, SBAR, SIG(MXAUG)
COMMON /SIGCOM/ SIG

DF 3 REAL(NT)
CALL RNCHI(MPT, DF, CHI)
DO 10 N 3 1, MPT
SIG(N) 3 SIG(N) / CHI(N)
10 CONTINUE

IF (IT .EQ. 25) THEN
UN 3 30
SBAR 3 0.0
DO 20 N 3 l, MXAUG

SBAR 3 SBAR + SIG(N)
20 CONTINUE

SBAR 3 SBAR / REAL(MXAUG)
WRITE(UN,301) SBAR

END IF

301 FORMAT(F11.5)

RETURN
END

134

SUBROUTINE TAILS(IT, UN)

************************************************‘k‘k‘k*********************

*
'k
k

20

10

301

This subroutine calculates the median, upper and lower percentile
points of the posterior distribution of gamma for (alpha / 2) =
0.005, 0.025, 0.050. TAILS is called by GSTAR.

PARAMETER (MXAUG32048)

INTEGER IN, IT, MID, UN

REAL G(4,MXAUG), GVEC(MXAUG), P(7)
COMMON /GCOM/ G

Calculate the percentiles of the posterior distribution of Gamma.
MID 3 MXAUG / 2
DO 10 I a 1, 4

DO 20 N 3 1, MXAUG

GVEC(N) 3 G(I,N)

CONTINUE

CALL SVRGN(MXAUG, GVEC, GVEC)

P(l) (GVEC(MID) + GVEC(MID+1)) / 2.0

P(Z) 3 GVEC(10) * 0.76 + GVEC(ll) * 0.24

P(3) 3 GVEC(2037) * 0.24 + GVEC(2038) * 0.76
P(4) 3 GVEC(51) * 0.80 + GVEC(52) * 0.20
P(5) 3 GVEC(1996) * 0.20 + GVEC(1997) * 0.80
P(6) 3 GVEC(102) * 0.60 + GVEC(103) * 0.40

P(7) 3 GVEC(1945) * 0.40 + GVEC(1946) * 0.60
WRITE(UN,301) (P(J), J 3 1, 7)
CONTINUE

FORMAT (7 (F11.5))

RETURN
END

135

SUBROUTINE NTAVE(NEXP, NTOT)

***********************‘k************************************************

* Calculate the average NT per experiment.

INTEGER NEXP, NTOT
REAL AVNT

AVNT 3 REAL(NTOT) / REAL(NEXP)
WRITE(40,401) AVNT

401 FORMAT(’ Average N per experiment 3 ’,F8.3)

RETURN
END

136

Hierarchical Data Generation FORTRAN Source Code Listing

PROGRAM HLMDAT

***'k*******************************************************************

*X’l’ﬂ'l‘l’ﬂ’l’

10

101

HLMDAT generates replications of an experiment. The following
parameters must be set before each run.

NK - the number of Classes

NEXP - the number of experiments

POW - the correlation between BOj and Wj

PIW - the correlation between B1j and Wj

RNOPT(5) - initialization for IMSL number generator
RNSET(ISEED) - set the seed after the first run

PARAMETER (MAX322, NEXP3500)

INTEGER ISEED, NJ(MAX), NK

DOUBLE PRECISION G(4), POW, P1W, SX(MAX), SXY(MAX), SX2(MAX)
DOUBLE PRECISION SY(MAX), SY2(MAX),TOO, TOOW, T11W, W(MAX)

OPEN(10, FILE3’SEED’)
OPEN(ZO, FILE3’VARS')
OPEN(30, FILE3’DATA’, FORM3'UNFORMATTED’)

Seed set for data set 3 with POW - 0.0 and le = 0.6
CALL RNOPT(5)

ISEED = 1461009557

CALL RNSET(ISEED)

NK 3 10

POW 3 0.0
PlW 3 0.6
CALL TGSET(G, POW, P1W, T00, TOOW, T11W)
DO 10 K 3 1, NEXP
CALL CLSIZE (NJ, NK)
CALL GEN(G, NJ, NK, SX, SXY, SX2, SY, 8Y2, TOOW, T11W, W)
CALL RITE(NEXP, NJ, NK, SX, SXY, SX2, SY, 8Y2, W)
CONTINUE

CALL RNGET(ISEED)
WRITE(10,101) ISEED

FORMAT(I10)

END

137

SUBROUTINE TGSET(G, POW, P1W, TOO, TOOW, T11W)

*******‘k***************************************************************

1%

Set the between subject parameters Tau and Gamma.

PARAMETER (C3O.10D+0, D3O.18D+0, PXY=O.60D+0)
DOUBLE PRECISION G(4), POW, P1W, SIGZY, Sl
DOUBLE PRECISION T00, TOOW, T11, T11W

Calculate the unconditional and conditional variance terms
T00 - D / ((1.0 - D) * (1.0 - PXY**2) - C * D)

T11 3 C * T00

TOOW - T00 * (1.0 - POW**2)

T11W 3 T11 * (1.0 - P1W**2)

The pooled unconditional variance of Y
81 - C * T00 + 1.0
SIGZY 3 51 + (PXY**2 / (1.0 - PXY**2)) * (C * T00 + 1.0)

Calculate the Gamma vector G 3 {G00, G01, G10, Gll}
G(l) 0.0

6(2) 3 POW * SQRT(TOO)

G(3) 3 SQRT((PXY**2 / (1 3 PXY**2)) * (C * T00 + 1.0))
G(4) 3 P1W * SQRT(Tll)

RETURN

END

138

SUBROUTINE CLSIZE (NJ, NK)

*‘k*‘k‘k***************‘k**************************************************

* Randomly draw a vector of class sizes from a distribution.

PARAMETER (MAX 3 22)
INTEGER NJ(MAX), NK, CPT
DOUBLE PRECISION R(MAX)

CALL DRNUN(NK,
DO 10 I 3 1, NK

R)

10

CPT 3 NINT(1000 * R(I))

IF (CPT
NJ(I)
ELSE IF
NJ(I)
ELSE IF
NJ(I)
ELSE IF
NJ(I)
ELSE IF
NJ(I)
ELSE IF
NJ(I)
ELSE IF
NJ(I)
ELSE IF
NJ(I)
ELSE IF
NJ(I)
ELSE IF
NJ(I)
ELSE IF
NJ(I)
ELSE IF
NJ(I)
ELSE IF
NJ(I)
ELSE IF
NJ(I)
ELSE
NJ(I)
END IF

.GT.
3 30
(CPT
3 29
(CPT
3 28
(CPT
3 27
(CPT
3 26
(CPT
3 25
(CPT
3 24
(CPT
3 23
(CPT
3 22
(CPT
3 21
(CPT
3 20
(CPT
3 l9
(CPT
3 18
(CPT
3 17

3 16

995

.GT.

.GT.

.GT.

.GT.

.GT.

.GT.

.GT.

.GT.

.GT.

.GT.

.GT.

.GT.

.GT.

.AND.

965

905

795

670

535

405

280

185

115

065

035

015

005

CPT

.AND.

.AND.

.LE.

CPT

CPT

CPT

CPT

CPT

CPT

CPT

CPT

CPT

CPT

CPT

CPT

CPT

1000)
.LE.
.LE.
.LE.
.LE.
.LE.
.LE.
.LE.
.LE.
.LE.
.LE.
.LE.
.LE.

.LE.

THEN
995)
965)
905)
795)
670)
535)
405)
280)
185)
115)
065)
035)

015)

THEN

THEN

THEN

THEN

THEN

THEN

THEN

THEN

THEN

THEN

THEN

THEN

THEN

CONTINUE

RETURN
END

139

SUBROUTINE GEN(G, NJ, NK, SX, SXY, SX2, SY, 5Y2, TOOW, T11W, W)
***********************************************************************

* Generate data for one replication of an experiment.

PARAMETER (MAX322, MXR361, MXU344)

INTEGER NJ(MAX), NK, NR, NT

DOUBLE PRECISION BO(MAX), Bl(MAX), CHI(2), DFl, DF2, G(4)

DOUBLE PRECISION R(MXR), SIG, SR(MAX), SR2(MAX), SX(MAX)

DOUBLE PRECISION SXR(MAX), SXSR(MAX), SXY(MAX), SX2(MAX), SY(MAX)
DOUBLE PRECISION SY2(MAX), TAU(3),TCHI, TMPl, TMP2, TMP3, TOOW
DOUBLE PRECISION T11W, U(MXU), U0, U1, W(MAX)

NT 3 0
SIG 3 0.0
DO 10 I 3 1, 3
TAU(I) 3 0.0
10 CONTINUE

DO 20 J = 1, NK

NR = 2 * NJ(J)
CALL DRNNOR(NR+1, R)
NT = NT + NJ(J)
SX(J) 0.0
SR(J) 0.0
SXR(J) = 0.0
DO 30 I = 1, NR, 2

SX(J) = SX(J) + R(I)

SR(J) - SR(J) + R(I+1)

30 CONTINUE
SXSR(J) = SX(J) * SR(J)
DFl - NJ(J) - 1
CALL DRNCHI(2, DFl, CHI)
SX2(J) = CHI(l) + SX(J)**2 / NJ(J)
SR2(J) = CHI(2) + SR(J)**2 / NJ(J)
SIG = SIG + SR2(J)
DF2 = NJ(J) - 2
CALL DRNCHI(1, DF2, TCHI)
T = R(NR + 1) / SQRT(TCHI / DF2)
P = T / SQRT(T**2 + DF2)
SXR(J) = P * SQRT(CHI(1)) * SQRT(CHI(2)) + SXSR(J) / NJ(J)
20 CONTINUE

CALL DRNNOR(NK*2, U)
DO 40 K = 1, NR
IF (K .LE. NK/2) THEN

W(K) 3 1.0
ELSE

W(K) 3 -1.0
END IF
U0 3 SQRT(TOOW) * U(K)
U1 3 SQRT(T11W) * U(K+NK)
TAU(l) 3 TAU(l) + U0 * U0
TAU(2) 3 TAU(2) + U0 * U1

TAU(3) 3 TAU(3) + 01 * Ul

BO(K) 3 G(1) + G(2) * W(K) + U0
Bl(K) 3 G(3) + G(4) * W(K) + Ul
SY(K) 3 NJ(K) * BO(K) + Bl(K) * SX(K) + SR(K)

TMPl 3 NJ(K) * BO(K)**2 + B1(K)**2 * SX2(K) + SR2(K)

40

50

201
202

TMP2 3

SY2(K)
SXY(K)
CONTINUE

SIG 3 SIG / NT
DO 50 I 3 1, 3

TAU(I) 3 TAU(I) / NK

CONTINUE

WRITE(20,201) SIG

WRITE(20,202)

FORMAT(F11.5)
FORMAT(3F11.5)

RETURN
END

(TAU(I),

I 3 1,

140

2.0 * BO(K) * Bl(K) * SX(K)
TMP3 3 2.0 * Bl(K) * SXR(K)
3 TMP1 + TMP2 + TMP3
3 BO(K)

3)

+ 2.0 * BO(K)

* SX(K) + Bl(K) * SX2(K) + SXR(K)

* SR(K)

141

SUBROUTINE RITE(NEXP, NJ, NK, SX, SXY, SX2, SY, 3Y2, W)

**‘k********************************************************************

*

10

One replication of an experiment is written to a file.

PARAMETER (MAX=22)

CHARACTER *12 ID, LABEL1(0:2), LA8EL2(0:1)

CHARACTER *44 STRING

INTEGER M1, M2, N8R5, NEXP, NJ(MAX), NK, NVAR, QMAX, T, VNUM
DOUBLE PRECISION SX(MAX), SXY(MAX), SX2(MAX), SY(MAX)

DOUBLE PRECISION 322(MAX), W(MAX), 8AR(2)

DATA STRING / '01020304050607080910111213141516171819202122’/

NVAR 3 2

NBRS 3 0

QMAX 3 1

VNUM 3 O

LABEL1(0) 3 ’ BASE'
LABEL1(1) 3 'DEPY'
LABEL1 (2) 3 ' INDX’
LABEL2(0) 3 ' BASE’
LABEL2(1) 3 ’INDW’

WRITE(30) NVAR, N8R5, QMAX, NK, VNUM
WRITE(30) (LABEL1(I), I - o, NVAR)
WRITE(30) (LABEL2(1), I = 0, QMAX)
M1 = 1
M2 = 2
DO 10 I31, NK
T = NJ(I)
WRITE(30) T
8AR(1) 3 32(1) / T
8AR(2) = SX(I) / T
322(1) = 522(1) - T * BAR(1) * BAR(1)
sx2(I) = sx2(I) - T * 8AR(1) * 8AR(2)
SX2(I) = SX2(I) - T * 8AR(2) * 8AR(2)
WRITE(30) (BAR(K), K = 1, 2)
WRITE(30) 322(1), sx2(I), sx2(I)
ID - STRING(M1:M2)
WRITE(BO) ID, W(I)
M1 = M1 + 2
M2 = M2 + 2
CONTINUE

RETURN
END

REFERENCES

Anderson, T.W. (1984). An introduction to multivariate statistical analysis.
New York: John Wiley and Sons.

Barcikowski, RS. (1981). Statistical power with group mean as the unit of
analysis. Journal of Educational Statistics, 6, 267-285.

Barr, R., and Dreeden, R. (1983). How schools work. Chicago: University of
Chicago Press.

Bartlett, MS. (1933). On the theory of statistical regression. Proc. Roy. Soc.
Edinburgh, 53, 260-283.

Bassiri, D. (1988). Large and small sample properties of maximum likelihood
estimates for the hierarchical linear model. Unpublished doctoral
dissertation, Michigan State University.

Bates, D.M., and Watts, DO. (1981). A relative offset orthoganality convergence
criterion for nonlinear least squares. Technometrics, 23, 2, 179-184.

Bayes, TR. (1763). An essay towards solving a problem in the doctrine of
chances. The Philosophical Transactions of the Royal Society, 53, 370-418
(reprinted in Biometrika (1958), 45, 293-315).

Bidwell, C. and Kasarda, J. (1980). Conceptualizing and measuring the effects of
school and schooling. American Journal of Education, 88, 401-430.

Blair, RC, and Higgins, 11. (1986). Comment on statistical power with group
mean as the unit of analysis. Journal of Educational Statistics, 11, 161-169.

Blair, R.C., Higgins, J.J., Topping, M.E.H., and Mortimer, AL. (1983). An
investigation of the robustness of the t test to unit of analysis violations.
Educational and Psychological Measurement, 43, 69-80.

Box, G.E.P., and Tiao, CC. (1973). Bayesian inference in statistical analysis.
Reading, Massachusetts: Addison-Wesley Publishing Company.

142

143

Braun, H.I., Jones, D.H., Robin, DB, and Thayer, D.T. (1983). Empirical Bayes
estimation in the general linear model with data of deﬁcient rank.
Psychometrika, 48, 2, 171-181.

Brophy, J .E., and Good, T.L. (1986). Teacher behavior and student achievement.
In M.C. Wittrock (Ed.), Handbook of research on teaching. New York:
Macmillan.

Bryk, A.S., and Raudenbush, S.W. (1987). Application of hierarchical linear models
to assessing change. Psychological Bulletin, 101, I, 147-158.

Bryk, A.S., Raudenbush, S.W., Seltzer, M, and Congdon, RT. (1986). An
introduction to HLM: Computer program and users’ guide. University of
Chicago, Department of Education.

Burden, R.L., Faires, J.D., and Reynolds, A.C. (1981). Numerical analysis. Boston:
Prindle, Weber and Schmidt.

Burstein, L. (1980). The analysis of multilevel data in educational research and
evaluation. In DC. Berliner (Ed.), Review of research in education, vol. 8.
Washington, DC: American Educational Research Association.

Campbell, D.T., and Stanley, J.C. (1963). Experimental and quasi-experimental
designs for research. Chicago: Rand McNally College Publishing Company.

Chambers, J.M. (1977). Computational methods for data analysis. New York:
John Wiley & Sons.

Cooley, W.W., Bond, L., and Mao, B. (1981). Analyzing multilevel data. In Berk,
R.A. (Ed.), Educational evaluation methodology. Baltimore: Johns Hopkins
University Press.

Cowles, M. and Davis, C. (1982). On the origins of the .05 level of statistical
signiﬁcance. American Psychologist, 37, 553-558.

Cronbach, LJ. (1957). The two disciplines of scientiﬁc psychology. The American
Psychologist, 12, 671-684.

Cronbach, L.J., and Snow, RE. (1977). Aptitudes and instructional methods. New
York: Irvington.

Cronbach, L.J., and Webb, N. (1975). Between and within-class effects in a
reported aptitude-by-treatment interaction: Reanalysis of a study by G.L.
Anderson. Journal of Educational Psychology, 6, 717-724.

DeLeeuw, J., and Kreft, I. (1986). Random coefﬁcient models for multilevel
analysis. Journal of Educational Statistics, 11, 57-85.

144

Dempster, A.P., Laird, N.M., and Rubin, DB. (1977). Maximum likelihood from
incomplete data via the EM algorithm (with discussion). Journal of the
Royal Statistical Society, Series B, 39, 1-38.

Dempster, A.P., Rubin, DB, and Tsutakawa, R.K. (1981). Estimation in covariance
components models. Journal of the American Statistical Association, 76,
341-353.

Edwards, AL. (1948). Note on the "correction for continuity" in testing the
signiﬁcance of the difference between correlated proportions. Psychometrika,
13, 185-187.

Fisher, R.A. (1936). Statistical methods for research workers. Edinburgh: Oliver
and Boyd.

Fleiss, LL. (1981). Statistical methods for rates and proportions. New York: John
Wiley & Sons.

Glass, G.V., and Stanley, J.C. (1970). Statistical methods in education and
psychology. Englewood Cliffs, New Jersey: Prentice-Hall, Inc.

Goldstein, HI. (1987). Multilevel models in education and social research.
London: Oxford University Press.

Good, T.L., and Brophy, J.E. (1986). School effects. In M.C. Wittrock (Ed.),
Handbook of research on teaching. New York: Micmillan.

Harville, D. A. (1976). Extension of the Gauss-Markov Theorem to include the
estimation of random effects. Annals of Statistics, 4, 384-95.

Harville, D. A. (1977). Maximum likelihood approaches to variance component
estimation and to related problems. Journal of the American Statisrical
Association, 72, 320-338.

Henderson, OR. (1953). Estimation of variance and covariance components.
Biometrics, 9, 226-252.

Henderson, C.R., Jr., and Henderson, CR. (1979). Analysis of covariance in mixed

models with unequal subclass numbers. Communications in Statistics-Theory
and Methods, 751-787.

Heyns, B. (1986). Educational effects: Issues in conceptualization and
measurement. In J.G. Richardson (Ed.), Handbook of theory and research
for the sociology of education. Westport, CT: Greenwood Press.

Hopkins, K.D. (1982). The unit of analysis: Group means versus individual
observations. American Educational Research Journal, 19, 5-18.

145

International Mathematical and Statistics Libraries (1987). MATH/LIBRARY and
STAT/LIBRARY, version 1.0. Houston: IMSL Problem-Solving Software
Systems.

Jones, MC. (1985). Generating inverse Wishart matrices. Communications in
Statistics-Simulation and Computation, 511-514.

Johnson, R.A., and Bhattacharyya, GK. (1977). Statistical concepts and methods.
New York: John Wiley and Sons.

Kackar, RN, and Harville, DA. (1984). Approximations for standard errors of
estimators of ﬁxed and random effects in mixed linear models. Journal of
the American Statistical Association, 79, 853-862.

Kirk, RE. (1968). Experimental design: Procedures for the behavioral sciences.
Belmont, California: Brooks/Cole Publishing Company.

Kish, L. (1965). Survey sampling. New York: Wiley and Sons.

Knuth, DE (1981). The art of computer programming, volume 2, seminumerical
algorithms. Reading, Massachusetts: Addison-Wesley Publishing Company.

Laird, N.M., and Ware, J.H. (1982). Random-effects models for longitudinal data.
Biometrics, 38, 963—974.

Lindley, D.V. (1965a). Introduction to probability and statistics from a Bayesian
viewpoint, part I, probability. Cambridge: Cambridge University Press.

Lindley, D.V. (1965b). Introduction to probability and statistics from a Bayesian
viewpoint, part 2, inference. Cambridge: Cambridge University Press.

Lindley, D.V., and Smith, A.P.M. (1972). Bayes estimates for the linear model

(with discussion). Journal of the Royal Statistical Society, Series B, 34, 1-
41.

Lindquist, ER (1953). Design and analysis of experiments in psychology and
education. Boston: Houghton Mifflin.

Lumsdaine, A.A. (1963). Instruments and media of instruction. In N.L. Gage
(Ed.), Handbook of research on teaching. Chicago: Rand-McNally.

Longford, NT, (1987). A fast scoring algorithm for maximum likelihood

estimation in unbalanced mixed models with nested effects. Biometrika, 74,
817-827.

Mason, W.M., Wong, G.Y., and Entwistle, B. (1984). Contextual analysis through
the multilevel linear model. In S. Leinhardt (Ed.), Sociological methodology
(pp. 72-103). San Francisco: Jossey-Bass.

146

Miettinen, 0.8. (1968). The matched pairs design in the case of all-or-none
responses. Biometrics, 24, 339-352.

Morris, C.N. (1987). Comment on Tanner and Wong (1987). Journal of the
American Statistical Association, 82, 542-543.

Morrison, DP. (1976). Multivariate statistical methods. New York: McGraw-Hill.

Odel, P.L., and Feiveson, AH. (1966). A numerical procedure to generate a sample
covariance matrix. Journal of the American Statistical Association, 61, 199-
203.

Peckman, P.D., Glass, G.V., and Hopkins, K.D. (1969). The experimental unit in
statistical analysis. Journal of Special Education, 3, 337-349.

Pedhazur, El. (1982). Multiple regression in behavioral research: Explanation and
prediction. New York: Holt, Rinehart and Winston.

Raudenbush, S.W. (1984).— Applications of a hierarchical linear model in
educational research. Unpublished doctoral dissertation, Harvard University.

Raudenbush, S.W., and Bryk, A.S. (1285). Empirical Bayes meta-analysis. Journal
of Educational Statistics, 10, 75-98.

Raudenbush, S.W., and Bryk, A.S. (1986).-— A hierarchical model for studying
school effects. Sociology of Education, 59, 47-65.

" Raudenbush, S.W., and Bryk, A.S. (L288). Methodological advances in analyzing
the effects of schools and classrooms on student learning. In El. Rothkopf
(Ed.), Review of research in education (pp 423-475). Washington, DC:
American Educational Research Association.

Searle, SR. (1970). Large sample variances of maximum likelihood estimators'of
variance components using unbalanced data. Biometrics, 26, 505-524.

Searle, SR. (1971). Topics in variance component estimation. Biometrics, 27, l-
76.

Smith, A.M.F. (1973). A general Bayesian linear model. Journal of the Royal
Statistical Society, Series B, 35, 61—75.

Smith, W.B., and Hocking, RR. (1972). Algorithm A853: Wishart variate
generator. Applied Statistics, 21, 341-345.

Strenio, J.F., Weisberg, H.I., and Bryk, A.S. (1983). Empirical Bayes estimation of
individual growth-curve parameters and their relationship to covariates.
Biometrics, 39, 71-86

147

Tankard, J.W., Jr. (1984). The statistical pioneers. Cambridge, Massachusetts:
Schenkmann Publishing Company, Inc.

Tanner, M.A., and Wong, W.H. (1987). The calculation of posterior distributions
by data augmentation (with discussion). Journal of the American Statistical
Association, 82, 528-550.

Thisted, R.A. (1988). Elements of statistical computing. New York: Chapman and
Hall.

Winer, BJ. (1971). Statistical principles in experimental design. New York:
McGraw-Hill Book Company.

Wu, C.F.J (1983). On the convergence properties of the EM algorithm. Annals of
Statistics, 11, 95-103.

HICH G BRRRIES

I'llllllllilllllall

 

5