TWO ESSAYS ON EDUCATIONAL RESEARCH: (1) USING MAXIMUM CLASS SIZE RULES TO EVALUATE THE CAUSAL EFFECTS OF CLASS SIZE ON MATHEMATICS ACHIEVEMENT: EVIDENCE FROM TIMSS 2011; (2) POWER CONSIDERATIONS FOR MODELS OF CHANGE By Wei Li A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Educational Policy - Doctor of Philosophy Measurement and Quantitative Methods - Dual Major 2015 ABSTRACT TWO ESSAYS ON EDUCATIONAL RESEARCH: (1) USING MAXIMUM CLASS SIZE RULES TO EVALUATE THE CAUSAL EFFECTS OF CLASS SIZE ON MATHEMATICS ACHIEVEMENT: EVIDENCE FROM TIMSS 2011; (2) POWER CONSIDERATIONS FOR MO DELS OF CHANGE By Wei Li This dissertation is a collection of two essays that address issues of class size effects on student achievement and power analysis methods for model of change s. Class size reduction policies have been widely implemented around the world in the past decades. However, findings about the effects of class size on student achievement have been mixed. In addition, most of the studies about class size effects have focuse d on the effects on the average achievement for all students. Only a few studies have focused on the differential class size effects across the student achievement distribution, and their findings have been mixed. The first essay (Chapter 1 and Chapter 2) was designed to evaluate class size effects on student achievement. In particular, Chapter 1 employed instrumental variables (IV) methods to examine the causal effects of class size on fourth grade mathematics achievement using data from TIMSS (Trends in International Mathemat ics and Science Study). While I found some evidence of class size effects in Romania and the Slovak Republic, overall there were no systematic patterns of class size effects. The results indicate that in most European and Asian countr ies class size reduction may not improve mathematic s achievement in fourth grade. The first essay also evaluated the differential class size effects across mathematics achievement distribution. In particular, Chapter 2 employed quantile regression analysis, coupled with instrumental variables methods, to examine the causal effects of class size on fou rth grade mathematics achievement. While I found some evidence of quantile -specific class size effects in Romania and the Slovak Republic, overall there were no systematic patterns of class size effects. What is more, there was no evidence to show that hig h- or low -achievers benefited more from smaller classes. The results indicate that in most European and Asian countries class size reduction may not increase or reduce the achievement gap between low - and high -achi eving students in fourth grade. The second essay of this dissertation (Chapter 3) was designed to provide methods for three -level models in studies of polynomial change. Experiments that involve nested structures often assign entire groups to treatment conditions and follow them over time to asses s group differences in the average of change, rate of acceleration, or higher degree polynomial effect. Chapter 3 provide methods for power analysis in three -level polynomial change models for cluster randomized designs (i.e., treatment at the third level) and block randomized designs (i.e., treatment at the second level). Both unconditional models and conditional models that include covariates at the second (e.g., student) and the third (e.g., school) levels are discussed. The power computations take into account nesting effects at the second and at the third level, the duration of study, sample size effects (e.g., the number s of schools and students), and covariates effects. Chapter 3 also provided illustrative examples to show how powers are influenced by the study duration, sample sizes and covariates at the second and the third level. ACKNOWLEDGEMENTS This dissertat ion has been a great challenge to me. I have benefited a lot from many people I worked with at different stages of my life. I would l ike to acknowledge their support, guidance, and help. First of all, I would like to thank my dissertation committee for their support and guidance. I sincerely thank Dr. Spyros Konstantopoulos for his long term academic guidance and unlimited support in my doctoral studies. He introduced me into class size effects analysis and power an alysis. Dr. Konstantopoulos is a very knowledgeable and generous person, who has shared countless great ideas with me and helped me make this dissertat ion much better than I can imagine. He has been a great mentor during my journey of Ph.D. study. His ins ightful direction has greatly shaped my thinking and career goals as a quantitative educational researcher. I particularly appreciate his time, patience , kindness , and encouragement that helped me get though the hard times during my doctoral study. I was so fortunate to have Dr. Konstantopoulos as my advisor. I am indebted to him for all of his support, guidance and encouragement. I am looking forward to continuing to work with him and learn from him in the future. I thank Dr. Barbara Schneider, who provided me great opportunities to work in several large -scale research projects. I worked under her as a graduate assistant, and gained valuable experiences of analyzing the real data and built up my quantitative skills. I also thank her thoughtful advice and feedback as I was working on this dissertation and searching for an academic job. I am thankful to Dr. Amita Chudgar and Dr. Kenneth Frank for agreeing to serve on my dissertation committee . I specially thank Dr. Chudgar for her iv support and caring that helped me survive when I first came to MSU. I thank Dr. Frank for his critical feedback and valuable advice that I have learned a lot from. I would like t o acknowledge the generous financial support from Dr. Michael Sedlak and the Educational Policy program that supported my doctoral study and this dissertation. I also express my thanks to Dr. David Arsen and Dr. Kim Maier for their guidance and support. I thank my teachers in the Graduate School of Education at Peking University. I particularly thank Dr. Ding Yanqing who encouraged me to pursue a doctoral degree in the U.S. and suggested me do a second major in MQM. He gave me countless advice and directio n during the pa st years , which have changed my life . I am also thankful to my master advisor, Dr. Yue Changjun, for his consistent guidance and support when I was studying in GSE and at MSU. I thank Dr. Yan Fengqiao, Dr. Song Yingquan, Dr. Ding Xiaohao, Dr. Yang Po, and Dr . Ma Liping for their encouragement and help. Moreover, I am thankful to my friends and fellow students from Educational Policy and MQM at MSU. Specifically, I thank my close friend, classmate, and coauthor - Dr. Yisu Zhou. I got countle ss support and help from him at every stage of my doctoral study. I am very thankful to Dr. Anne Traynor, Dr. Min Sun, Dr. Yongmei Ni, Guan Saw, Liyang Mao and Na Liu for their help and suggestion with respect to my research and my career. I would like to thank my close friend Keyin Wang who took care of my wife and me in the past years at MSU. I am thankful to Siwen Guo, Xin Luo, Ran Xu, and Yi Wei. I wish them all the best in their graduate study and subsequent life. I also would like to thank my close friend Wang Feng and his wife for taking care of my wife when I was in the U.S. while she was in Beijing. v I owe many thanks to my family. I thank my parents, Li Yueqing and Yin Shuwen, and my parents in law, Zhaoqiang and Di Youlan, for their unconditional love during my study in China and in the U.S. Finally, I would like to th ank my wife, Meng Zhao, who has been there to laugh with me and to cry with me since we met eleven years ago. This dissertation is dedicated to my grandparents, Yu Qingzhen and Li Gu angxin, who gave me the best childhood. vi TABLES OF CONTENTS LIST OF TABLES ............................................................................................................. ix LIST OF FIGURES ........................................................................................................... xi INTRODUCTION .............................................................................................................. 1 CHAPTER 1 CLASS SIZE EFFECTS ON FOURTH GRADE MATHEMATICS ACHIEVEMENT ................................................................................................................ 4 Introduction ..................................................................................................................... 4 Literature Review ............................................................................................................ 6 Methods ........................................................................................................................... 9 Data .............................................................................................................................. 9 Country selection ....................................................................................................... 10 Measures .................................................................................................................... 11 Multiple Regression ................................................................................................... 13 Instrumental Variables ............................................................................................... 14 Results ........................................................................................................................... 19 Descriptive statisti cs .................................................................................................. 19 Regression Results ..................................................................................................... 23 IV Results .................................................................................................................. 25 Comparison of Regr ession and IV Estimates ............................................................ 28 Discussion ..................................................................................................................... 30 CHAPTER 2 DOES CLASS SIZE REDUCTION CLOSE THE ACHIEVEMENT GAP ........................................................................................................................................... 33 Introduction ................................................................................................................... 33 Literature Review .......................................................................................................... 35 Method .......................................................................................................................... 40 Quantile Regression ................................................................................................... 40 Instrumental Variable and Control Function ............................................................. 41 Results ........................................................................................................................... 46 Discussion ..................................................................................................................... 51 CHAPTER 3 POWER CONSIDERATION FOR MODEL OF CHANGE ..................... 53 Introduction ................................................................................................................... 53 The Polynomial C hange Model ..................................................................................... 56 Statistical Models .......................................................................................................... 59 Design I: Treatment Assigned at Third Level (Cluster Design) ................................ 59 Unconditional Model ............................................................................................. 59 Covariates at Second and Third Levels ................................................................. 66 Design II: Treatment Assigned at Second Level (Block Randomized Design) ........ 69 Unconditional Model ............................................................................................. 69 vii Covariates at Second and Third Levels ................................................................. 72 Illustrative Examples ..................................................................................................... 75 Cluster Randomized Design: A Linear Growth Model ............................................. 75 Block Randomized Design: A Linear Growth Model ............................................... 84 Block Randomized Design: A Quadratic Growth Model .......................................... 95 Conclusion ................................................................................................................... 104 APPENDICES ................................................................................................................ 107 Appendix A: Vari able Description .............................................................................. 108 Appendix B: Control Function Approach for Quantile Regression ............................ 109 Appendix C: Proof of Equation (3.6) .......................................................................... 111 Appendix D: Proof of Equation (3.9) .......................................................................... 114 REFERENCES ............................................................................................................... 116 viii LIST OF TABLES Table 1.1: Maximum Class Size Rules: TIMSS 2011 ....................................................... 12 Table 1.2: Descriptive Statistics for Some Variables of Interest of TIMSS 2011 Samples: Means and Standard Deviations ........................................................................................ 21 Table 1.3: OLS Regression Estimates and Standard Errors of Class Size ........................ 24 Table 1.4: Analysis of the impact of unobservable confounding variables ...................... 24 Table 1.5: First Sta ge Regression Estimates and Standard Errors of the Computed Average Class Size .......................................................................................................................... 27 Table 1.6: Second Stage Regression Estimates and Standard Errors of Class Size .......... 27 Table 1.7: Results from Durbin -Wu-Hausman Test .......................................................... 29 Table 2.1: 2SLS and Quantile Regression Estimates and Standard Errors of Class Size . 47 Table 2.2: Differences in Quantile Regression Estimates ................................................. 50 Table 3.1 : Effect of Study Duration ( D) and Number of Schools ( M) on Power Holding Number of Students ( N) in Each School Constant at 20: CRD, Linear Rate of Change .. 79 Table 3.2: Effect of Study Duration ( D) and Number of Students ( N) on Power Holding Number o f Schools ( M) Constant at 20: CRD, Linear Rate of Change ............................ 80 Table 3.3: Effects of Number of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 3: CRD, Linear Rate of Change ....................... 81 Table 3.4: Effect of Covariates on Power: CRD, L inear Rate of Change ......................... 83 Table 3.5: Effects of Covariates, Number of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 3, w2 = 0.6 and w3 = 0.6: CRD, Linear Rate of Change .......................................................................................................................... 86 Table 3.6: Effect of Study Duration ( D) and Number of Schools ( M) on Power Holding Number of Students ( N) in Each School Constant at 40: BRD, Linear Rate of Change .. 89 Table 3.7: Effect of Study Duration ( D) and Number of Students ( N) on Power Holding Number of Schools ( M) Constant at 40: BRD, Linear Rate of Change ............................ 90 Table 3.8: Effects of Number of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 4: BRD, Linear Rate of Change ....................... 91 ix Table 3.9: Effect of Covariates on Power: BRD, Linear Rate of Change ......................... 93 Table 3.10: Effects of Covariates, Number of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 4, w2 = 0.6 and w3 = 0.6: BRD, Linear Rate of Change .......................................................................................................................... 94 Table 3.11: Effect of Study Duration ( D) and Number of Schools ( M) on Power Holding Number o f Students ( N) in Each School Constant at 40: BRD, Quadratic Rate of Change ........................................................................................................................................... 98 Table 3.12: Effect of Study Duration ( D) and Number of Students ( N) on Power Holding Number of Schools ( M) Constant at 40: BRD, Quadratic Rate of Change ...................... 99 Table 3.13: Effect s of Number of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 4: BRD, Quadratic Rate of Change ............... 100 Table 3.14: Effect of Covariates on Power: BRD, Quadratic Rate of Change ............... 102 Table 3.15: Effects of Covariates, Number of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 4, w2 = 0.6 and w3 = 0.6: BRD, Quadratic Rate of Change ................................................................................................................ 103 Table A .1: Variable Names and Coding Methods using Data from TIMSS 2011 .......... 108 x LIST OF FIGURES Figure 3.1: Effect of Study Duration ( D) and Number of Schools ( M) on Power, Holding Number of Students ( N) in Each School Constant at 20: CRD, Linear Rate of Change .. 79 Figure 3.2: Effect of Study Duration ( D) and Number of S tudents ( N) on Power Holding Number of Schools ( M) Constant at 20: CRD, Linear Rate of Change ............................ 80 Figure 3.3: Effects of Number of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 3: CRD, Linear Rate of Change ....................... 81 Figure 3 .4: Effect of Covariates on Power: CRD, Linear Rate of Change ....................... 83 Figure 3.5: Effects of Covariates, Number of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 3, w2 = 0.6 and w3 = 0.6: CRD, Linear Rate of Change .......................................................................................................................... 86 Figure 3.6: Effect of Study Duration ( D) and Number of Schools ( M) on Power Holding Number of Students ( N) in Each School Constant at 40: BRD, Linear Rate of Change .. 89 Figure 3.7. Effect of Study Duration ( D) and Number of Students ( N) on Power Holding Number of Schools ( M) Constant at 40: BRD, Linear Rate of Change ............................ 90 Figure 3.8: Effects of Number of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 4: BRD, Linear Rate of Change ....................... 91 Figure 3.9: Effect of Covariates on Power: BRD, Linear Rate of Change ....................... 93 Figure 3.10: Effects of Covariates, Number of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 4, w2 = 0.6 and w3 = 0.6 : BRD, Linear Rate of Change .................................................................................................................. 94 Figure 3.11: Effect of Study Duration ( D) and Number of Schools ( M) on Power Holding Number of Students ( N) in Each School Constant at 40: BRD, Quadratic Rate of Change ........................................................................................................................................... 98 Figure 3.12: Effect of Study Duration ( D) and Number of Students ( N) on Power Holding Number of Schools ( M) Constant at 40: BRD, Quadratic Rate of Change ...................... 99 Figure 3.13: Effects of Number of Schools ( M) and Number of Students ( N) on Power Holdin g Study Duration ( D) Constant at 4: BRD, Quadratic Rate of Change ............... 100 Figure 3.14: Effect of Covariates on Power: BRD, Quadratic Rate of Change .............. 102 Figure 3.15: Effects of Covariates, Number of Schools ( M) and Number of Students ( N) on xi Power Holding Study Duration ( D) Constant at 4, w2 = 0.6 and w3 = 0.6: BRD, Quadratic Rate of Change ................................................................................................................ 103 xii INTRODUCTION This dissertation is a collection of two essays that address issues of class size effects and power analysis method for model of changes in longitudinal randomized control trails. The first essay (Chapter 1 and Chapter 2) focus ed on the effects of class size on fourth graders mathematics achievement. Many countries have recently enacted class size reduction policies . Mixed research findings leave policy makers, practitioners, and researchers wondering if class size reduction policy is an effective way of improvi ng student achievement. Chapter 1 addresses the effects of class size on student average achievement. Specifically, Chapter 1 investigated the effects of class size on mathematics achievement for fourth graders using data from the Trends in International M athematics and Science Study (TIMSS). Typical statistical method such as ordinary least square regression may produce biased estimates of class size effects because student and teacher allocation to classes is likely non -random. For example, students might be assigned to classes based on their prior achievement; however, there was no prior achievement provided in TIMSS. To account for the non -randomness of student assignment and to facilitate causal inference, I created a class size index that is independen t of the unobserved process of student assignment, which is usually called Instrumental Variable (IV). In particular, I computed the grade and school specific average class size based on the maximum class size requirement in a country as the instrument. Ge nerally, no systematic pattern of association between class size and mathematics achievement was found in my study. These results indicate that class size reduction may not improve fourth grade mathematics achievement. 1 Besides improving average student ac hievement, another critical objective of education interventions is to increase achievement for students at risk, and thus reduce the achievement gap between lower - and higher -achieving students. Class size reduction has been advocated as such an intervent ion by some researchers; however, no recent study has used current data to evaluate if CSR closes the achievement gap. Chapter 2 attempts to fill in that gap in the literature by exploring the differential class size effects for students with different lev els of achievement. I employed quantile regression analysis, coupled with IV, to estimate the causal effects of class size on student achievement in the middle as well as the lower and upper tails of the achievement distribution. I also compared the differ ences of estimated effects of cl ass size between low - and high -achieving students. Overall, there was were systematic differential class size effects across achievement distribution , and in most countries class size reduction may not reduce the achievement gap between low - and high -achieving fourth grade students. Chapter 3 addresses power analysis methods in three -level polynomial change models. An important part of the design phase of an experiment involves power analysis. Statistical power is the probability of detecting the treatment effect of interest when it e xists. A priori power analyses help educational researchers identify how big a student, classroom, or school sample they need to ensure a good enough chance (e.g., > 80 percent) of detecting a treatment effect assuming it is true. It is common in education to employ designs where students are assigned randomly to a treatment and a control condition, and then they are followed over time. The main objective in these studies is to determine whether treatment effects fade or have lasting benefits over time. Pre vious work has presented methods for power analysis of two -level (e.g., repeated measures nested in students) models. 2 Nonetheless, populations in education have frequently more complicated structures. For example, students are also nested within classes or schools and so forth. As a result, it is natural to extend methods for power analysis for tests of treatment effects from two to three -levels . My second essay (Chapter 3) was designed to provide methods for power analysis in three -level models. Both metho ds for cluster randomized designs (i.e., treatment at the third level) and block randomized designs (i.e., treatment at the second level) were discussed. 3 CHAPTER 1 CLASS SIZE EFFECTS ON FOURTH GRADE MATHEMATICS ACHIEVEMENT Introduction Identifying the best allocation of school resources to improve student achievement has been a fundamental objective in education for a long time. As a result, school resources such as teacher pay, per -pupil funding, and class size have received considerabl e attention in the past three decades. The underlying assumption is that these school resources can have positive effects on student achievement. The effects of class size on student achievement have received particular attention in education research and policy. Results from experiments have indicated positive effects of small classes on student achievement (e.g., Finn & Achilles, 1990; Molnar et al., 1999). Specifically, evidence from Project STAR (Student -Teacher Achievement Ratio) in Tennessee has stron gly indicated achievement improvements for students in small classes compared to students in regular size classes (e.g., Nye, Hedges, & Konstantopoulos, 2000; Krueger, 1999). However, results from quasi -experiments have indicated much smaller class size ef fects. For example, Angrist and Lavy (1999) found significant but smaller class size effects in Israel than those reported in Project STAR. Also, Hoxby (2000) analyzed data from a natural experiment in Connecticut and found that class size did not have a significant effect on student achievement. Findings about class size effects have informed policies in different countries and, as a result, various countries have enacted class size reduction (CSR) polices. Such policies have been quite popular in the U.S . especially during the past decade. Twenty -one states in the U.S. had a CSR policy in place in 2007 -2008 (Education Week , 2008). In Asia, 4 countries such as South Korea, Japan, Singapore, and districts such as Hong Kong and Chinese Taipei, have implemented CSR policies aiming to increase student achievement in recent years. Similarly, in Europe, most countries have adopted CSR policies. In particular, two thirds of the European Union countries had introduced maximum class size requirements until 2011 in an attempt to ensure that class size does not exceed 30 students per class. Some European countries have lowered their upper class size limits in the last few years. For example, in Austria, since the 2007 -2008 school year the number of students per classroo m has been reduced at primary schools, general secondary schools, academic secondary schools and pre -vocational schools (EACEA Eurydice, 2011). Also, Scotland has reduced lately class size in first grade from a maximum of 30 students to 25 students (EACEA Eurydice, 2011). Other countries however, have stopped setting upper class size limits or have increased their upper class size limits in primary education. Norway for instance has stopped setting upper class size limits since 200 3.2004. Also, Italy and Portugal have increased their upper class size limits from 25 and 24 in 2006 - 2007 to 26 and 28 in 2010 -2011 respectively. Class size reduction policies require considerable investments in education. But economic budgets allocated to education at the federal and local levels are typically limited. Policy makers, practitioners, and researchers are still wondering whether CSR policies are an effective way of improving student achievement. Chapter 1 attempts to provide additional evidenc e about the effects of class size on student performance using data from a large -scale international assessment program. Specifically, the purpose of Chapter 1 is to examine the effects of class size on mathematics achievement using data from the 2011 5 four th grade sample of TIMSS. My sample included hundreds of schools and thousands of students in 18 Asian and European countries and districts (see Table 1.1). I employed regression methods to analyze the data. To facilitate causal inferences of class size ef fects I used instrumental variables (IV) that take advantage of the maximum class size rule. My study contributes to the existing literature in two ways. First, I used the most recent TIMSS data from 2011 that allows us to evaluate recent, concurrent CSR policies and compare class size effects across 18 Asian and European countries and districts. Second, I used maximum class size rules that allowed us to construct instruments to estimate the causal effects of class size on mathematics achievement in fourth grade across countries. Literature Review During the past three decades, researchers explored the effects of class size reductio n on student achievement through meta -analyses, experimental and quasi -experimental designs, as well as other advanced statis tical methods such as IV. Meta -analytic reviews of early work on small class effects indicated positive relationship between small classes and student achievement, but the magnitude of the effects was small. For example, Glass and Smith (1979) synthesized 77 studies and found that the average effect -size when class sizes were reduced from 25 to 15 was 0.13 standard deviations (SD). Using a subset of the Glass and Smith (1979) sample that employed random assignment or initial controls for student quality, Sl avin (1989) found extremely small effects of class size on achievement. He concluded that the class size effects are consistent, but small in kindergarten through third grades, slightly smaller in fourth through eighth grades, and non -existent in ninth thr ough twelfth grades. 6 Project STAR is viewed as the most impressive and most powerful field experiment about class size effects in education (Mosteller, 1995). There have been numerous analyses of the Tennessee STAR data that have produced high internal va lidity estimates. Finn and Achilles (1990) were the first to analyze these data and found that students in small classes performed higher than those in regular classes in all subject areas, in every year of the experiment (kindergarten through third grade) . Nye, Hedges, and Konstantopoulos (2000) analyzed the validity of Project STAR and suggested that the effects of class size might be under -estimated because of imperfect implementation. They also found that the estimated class size effects were consistent with those from Glass and Smith (1979). Other studies by Krueger (1999) and Konstantopoulos (2008) produced similar findings about the positive effects of small classes on student achievement in early grades. Studies that have used observational data, es pecially data from large -scale surveys, have usually produced results with high external validity (i.e., generality). However, the internal validity (or unbiasedness) of estimates in observational or quasi -experimental studies is not so easy to achieve. Th at is, researchers have to use advanced statistical methods to warrant the internal validity of estimates. For instance, traditional ordinary least square (OLS) regression may produce biased estimates because of omitted variables bias (i.e., predictors may not be orthogonal to the error term). Previous work has utilized different analytic methods to examine class size effects on student achievement. For example, Pong and Pallas (2001) used multilevel models to analyze TIMSS data from 1995 in nine different countries and found no class size effects on eighth grade achievement except in the U.S. Other researchers have used IV methods to analyze observational data in an attempt to explore the causal effects of class size 7 reduction. For example, Akerhielm (1995 ) used two instruments for class size -the average class size for a given subject in the student™s school and the eighth grade enrollment in the school -to analyze class size effects on eighth graders™ mathematics, science, English, and history scores using data from 1988 NELS. Her results indicated a significant and negative relationship between class size and student achievement. Hoxby (2000) used data from a natural experiment and used IV methods to estimate the effects of class size on student achievement in Connecticut. Her method exploited random variation in class size due to random variation in births from year to year in schools and district catchment areas. She found no class size effects in fourth and sixth grades. Cho, Glewwe, and Whitler (2012) applied Hoxby™s (2000) method to compute class size effects in Minnesota and found positive effects of smaller classes on student achievement, but these estimates were smaller than estimates from Project STAR. Moreover, Wossmann and West (2006) examined clas s size effects in 11 countries that participated in TIMSS 1995. Their results indicated that there was no clear pattern of whether or when class size affects student achievement. One of the best IV used to capture class size effects was introduced by Angr ist and Lavy (1999). Specifically, their study used the Maimonides rule that sets the maximum class size to 40 students per classroom in order to evaluate the effect of class size on student achievement in Israel. The maximum class size rule of 40 was used to construct IV estimates of class size on test scores. The study reported a statistically significant effect of small classes on fifth grade reading and mathematics scores. In fourth grade the benefit of being in small classes was significant in reading, but not in mathematics. However, in third grade no significant effects of class size on achievement scores were detected. 8 Several other researchers have also used maximum class size rules as IV to evaluate class size effects. For instance, Bonesronning (2003) investigated class size effects using a maximum class size rule of 30 students per classroom in Norway. His analysis i ndicated small class effects. Wossmann (2005) explored class size effects in Europe using data from TIMSS 1995 for eighth grade students. He found two statistically significant relationships between class size and student achievement: a marginally signific ant effect in Norway and a highly significant effect in Iceland. He also found a statistically significant but positive relationship between class size and student achievement in Switzerland. For Denmark, France, Germany, Greece, Ireland, Spain, and Sweden , the estimates were not significant. A recent study about class size effects on fourth grade reading achievement in Greece also reported statistically insignificant estimates (Konstantopoulos & Traynor, 2014). Urquiola (2006) studied 10,018 third -grade st udents in Bolivia and found significant class size effects with effect sizes as large as 0.30 SDs, bigger than effects found in Project STAR in the U.S. and in Israel. Methods Data I used data from TIMSS latest survey in 2011. TIMSS is the largest international database that measures trends in mathematics and science achievement at fourth and eighth grades. First conducted in 1995, TIMSS provides reliable and timely data about mathematics and science achievement every four years. It also collects ex tensive information about students, teachers, school principals, and curriculum experts via background questionnaires. 9 A stratified two -stage cluster -sampling design was used in TIMSS, where schools are sampled at the first stage and one or more intact cla sses are sampled at the second stage in each of the sampled schools (Martin & Mullis, 2012). TIMSS has produced high -quality assessment measures. Also, teachers reported class size information on all intact classrooms that were sampled. Other useful inform ation about students, teachers, and schools has also been collected. It is noteworthy, that TIMSS was designed to yield a national probability sample of fourth (or eighth) graders. With the use of appropriate weights, one can make projections to the popula tion of fourth (or eighth) graders in each country, which points to the high external validity of the estimates. The stratified two -stage cluster -sampling design used in TIMSS makes the computation of the standard errors of estimates complicated because s tudent data within the same school are correlated rather than independent. Following the suggestion from TIMSS technical report (Martin & Mullis, 2012), I used the jackknife repeated replication technique (JRR) to estimate the sampling variance because it is computationally straightforward and provides approximately unbiased estimates of the sampling errors (see Martin & Mullis, 2012). That is, JRR standard errors take into account clustering effects induced by the multi -stage sampling scheme. Country se lection I used fourth grade data from the fifth and latest administration of the TIMSS assessment in 2011. I focused on fourth grade mathematics achievement because class size effects are typically expected in elementary grades (Nye et al., 2000). Twenty f ive European countries were surveyed in TIMSS 2011. I selected 14 countries of those 25 participating countries that had known clear rules about maximum class size limits for 10 four th graders in 2011 (see Table 1. 1). The highest upper class size limit in the fourth grade was in Malta and the Czech Republic with a maximum number of 30 students per classroom. The lowest upper class size limit of 24 students per classroom was in Lithuania. The most common upper class size limit was also 28 students per classroom (EA CEA Eurydice, 2012). I also selected four Asian countries and districts that set clear maximum class size limit in the fourth grade in 2011. Compared the rules in Europe, the upper class size limits were quite larger in Asia, which ranged from 30 (Hong Kong) to 40 (Japan and Singapore). Table 1 provides detail about the selected countries as well as their upper class size limits. Measures The dependent variable was mathematics achievement represented by five plausible values. Because the item pool of TI MSS 2011 was too large for students to finish in two hours, TIMSS used an incomplete booklet design that had each student complete only a proportion of the item pool (Martin & Mullis, 2012). Then, multiple imputation methods were used to construct a distri bution of scores that the students might have obtained had they completed the full test. The plausible values are a sample of scores from this distribution that incorporates the uncertainty about student scores (Martin & Mullis, 2012). It has been shown th at five plausible values can produce reliable and consistent estimates of student achievement (Schafer, 1999). The main independent variable was class size and was reported by teachers. Specifically, the class size measure was the number of students in a sampled classroom provided by the teachers. Student, teacher, classroom, and school variables of interest were also used as covariates. The student covariates included gender (e.g., a dummy for female), 11 Table 1. 1: Maximum Class Size Rules: TIMSS 2011 Country Maximum Class Size Rule Country Maximum Class Size Rule Austria (AUT) 25Lithuania (LTU) 24Croatia (HRV) 28Malta (MLT) 30Czech Republic (CZE) 30Portugal (PRT) 28Denmark (DNK) 28Romania (ROM) 25Germany (DEU) 29Slovak Republic (SVK) 25Hungary (HUN) 26Slovenia (SVN) 28Italy (ITA) 26Spain (ESP) 25Hong Kong (HKG) 30Singapore (SPG) 40Japan (JPN) 40Chinese Taipei (TWN) 3212 The teacher covariates included education (e.g., dummy for completing post -secondary education), years of teaching experience, gender (e.g., a dummy for female), and teacher™s instruction time per week. Classroom covariates included class level SES represented by aggregate measu res of number of books in the home and average number of items in the home. The proportion of female students in the classroom and the average student positive affect to mathematics were also used as classroom covariates. School covariates included percent of economically disadvantaged students, percent of students having the tested language as their native language, income level of the school immediate area, and fourth grade enrollment and its square. Missing data flags (i.e., dummies) were included in the models to account for miss ing data effects. The Appendix A provides the full list of variables as well as detailed description about coding. Multiple Regression To examine the class size effects on student mathematics achievement, I employed first a mult iple regression model that included class size and student, teacher/classroom, and school covariates namely 01ii iScoreClassSize i2i3i4 ST+CL +SC (1.1) where iScore represents mathematics scores, 0 is the constant term, ClassSize i is the main independent variable, 1 represents the class size effect and is the regression coefficient of interest, STi is a row vector of student background characteristics, 2 is a column vector of regression coefficients of student characteristics, CLi a is row vector of 13 classroom or teacher characteristics, 3is a column vector of regression coefficients of teacher and classroom characteristics, SCi is a row vector of school characteristics, 4 is a column vector of regression coefficients of school characteristics, and iis the error term. Because TIMSS used a complicated cluster sampling design (i.e., sampled schools at the first stage and then classes within schools), the clustering effect needs to be incorporated in the estimation of the standard errors. To achieve this we used JRR techniques to obtain a cluster robust standard error as suggested by M artin and Mullis (2012). Instrumental Variables Typical regression could provide consistent estimates of class size under the assumption that class size is not correlated with unobserved processes that may take place in schools. These unobservables are represented by the error term in model (1.1). Howev er, such an assumption is strong and rarely met in observational studies. The assignment of students and teachers to classrooms is not random typically, and thus class size could be correlated with unobserved factors related to student, parent, and teacher characteristics. For example, students may be assigned to classes based on their prior achievement or motivation. Parents may also influence assignment to classes. For instance, parents may want their children to be assigned in the classroom with the high est quality teacher or a specific peer composition (e.g., their children™s friends). Teachers may also influence assignment by either selecting high achieving students in their classrooms by teaching the class with the higher proportion of high achieving s tudents. If such processes were to take place, the estimated class size effect from equation (1.1) would be biased. 14 Because students and teachers are rarely randomly assigned to classrooms in a grade class size might be correlated with unobserved charact eristics of students or teachers. For example, in order to help low achieving students, some schools might assign higher quality teachers to classes with higher proportions of low achievers. Variables that determine assignment of students and teachers to c lasses are not typically measured. For example, student motivation, family income, parental pressure, teacher quality, etc. are rarely available in observational datasets. In addition, cross -sectional data rarely provide indexes of prior ability or perform ance. Although we included as many covariates as we could in our multiple regression analysis, it is still possible that unobservable factors that are part of the error term in equation (1.1) are correlated with class size. If that were true, then the esti mated class size effect in equation (1.1) would be biased. One way to overcome this potential shortcoming and facilitate causal inferences, is to compute an index of class size that is independent of unobserved student, teacher or school variables. Speci fically, we used the maximum class size rule in each country to compute school and grade specific average class size. This new variable was then used as an instrument to exclude unobserved variables from the teacher reported class size. In other words, thi s method creates a new class size variable that is fierror freefl and should not be related to unobserved variables. Our method is similar to that used by Angrist and Lavy (1999). The first step in this approach is to compute the average class size in fourth grade in each school . Specifically, the average class size in fourth grade in each school based on the maximum class size requirement is calculated as /[int((1)/)1] iii fEErule (1.2) 15 where iE denotes the enrollment in grade four in a school; if denotes the computed average class size based on the maximum class size rule; rule denotes the upper class size limit in a given country; for any positive number n, the function int(n) is the immediate smaller integer less than n. For example, if grade enrollment E = 70 and the maximum class size rule is 30 then int(n) = int(2.33) = 2. The upper class size limit generates discontinuities of the computed class size as the enrollment count increases to multiples of the upper class size limit. For example, if the maximum class size rule is 30 in a specific country, the above equation captur es the fact that the maximum class size rule allows enrollment of cohorts of 1-30 to be grouped in a single class, while enrollment of cohorts 31 -60 are split into two classes with average class sizes 15.5 -30, and so on. The second step was to regress th e teacher reported class size on the instrument (i.e., the school and grade specific average class size we computed in equation 1.2), as well as other covariates (see variables section). This step is designed to eliminate the unobservables (i.e., the error ) from teacher reported class size. Specifically, the regression equation is 01iiiClassSizef u i2i3i4 ST+CL +SC (1.3) where if is the computed average class size (i.e., the instrument) in a school based on the maximum class size rul e and iu is the error term. All other terms have been defined previously. The ™s are the regression estimates that need to be computed. The fitted 16 values of this regression are computed and will be used in the third step as the new class size variable that is fifreefl of error. The third and final step of this procedure used a regression where the fitted values (denoted below of FVi) from the regressi on equation ( 1.3) represent class size and are the main independent variable in the following achievement regression 01 234 iiiiii YFV STCLSC (1.4) where Y indicates mathematics scores, i is an error term and all other terms have been defined previously. The coefficient 1 represents the relationship between mathematics achievement and class size, adjusted for student, teacher/classroom, and school characteristics. Appropriate student weights were used in both regressions (equations 1.3 classroom/teacher, and school covariates included in equation (1.4) are the same as those include d in equation (1.3). The method (i.e., instrumental variables) described above has been used in previous work to estimate causal class size effects (e.g., Angrist & Lavy, 1999; Krueger, 1999). We used JRR techniques to estimate the standard errors of the r egression coefficients. The TIMSS sampling design makes the JRR techniques particularly well suited for estimating the standard errors in complex sampling surveys such as TIMSS (Martin & Mullis, 2012). Our analysis was conducted for each plausible value se parately, and then the five sets of estimates were combined to construct one set of final estimates of class size effects. To combine estimates we used formulae provided by Shafer (1999). The standard error of the 17 class size effects was a combination of th e sampling variance obtained through JRR techniques and the variance between plausible values (see Martin & Mullis, 2012). The standard errors of the regression coefficients were also corrected for the two -stage estimation (i.e., equations 1.3 and 1.4) bef ore they were combined across plausible values. There were two key conditions that the computed average class size if must meet in order for the instrument to be valid: (1) schools should follow the maximum class size rule very wel l. In other words, if should be correlated significantly with reported class size; and (2) the instrument cannot be correlated with any of the unobserved student, teacher, or school characteristics (i.e., if should not be correlated with the error term in equation 1.1). The first condition can be checked through the first stage regression (equation 1.3). If the coefficient of the computed average class size (the instrument) is significantly different from zer o, then the assumption that reported class size and the instrument are related holds. If the instrument is only marginally significant, our instrument may be weak. When instruments are weak, then the standard IV estimates, hypothesis tests, and confidence intervals may be unreliable (Stock, Wright, & Yogo, 2002). When multiple instruments are used the rule of thumb is that the F -statistic of all instruments in the first -stage regression should be larger than 10 (Staiger & Stock, 1997). In our study only one instrument is used (i.e., average class size per school) and thus we employ a t -test. The t -statistic of the regression coefficient of the instrument ( 1 in equation 1.3) should be greater than 3.20 and significant in the first sta ge regression. The t -statistic denotes the statistic for testing the hypothesis of a zero coefficient for the instrument (computed average class size using maximum class size rule) in the first stage regression. 18 The second condition which is called fiexogenous assumptionfl or fiexclusion restrictionfl indicates that computed average class size influences student mathematics achievement only through reported class size controlling for grade enrollment and other covaria tes. The question is essentially whether the instrument might be correlated with unmeasured factors that influence student assignment to classes. For example, private schools could manipulate the maximum class size requirement through adjusting their tuiti on or enrollment to avoid creating additional classrooms (see Urquiola & Verhoogen, 2009). Unfortunately, I cannot identify public or private schools based TIMSS data. Parents could manipulate the class size rule as well if school choice is an option in th eir country. In other words, some parents might take advantage of the rules and make their kids study in schools with smaller classes. There was some evidence that showed associations between smaller class size and higher student SES level in Spain and Mal ta based on some regression analysis, which indicates parents with higher SES might manipulates the rules and raises some concern of the validity of the IV in these two countries. Results Descriptive statistics Table 1, 2 presents descriptive statistics for selected student, teacher, and school variables of interest as well as samples si zes for students, classes, and schools. The national average mathematics scores for all countries participating in TIMSS have been set to a mean of 500 and a SD of 100 . Fo urteen countries in Table 1. 2 had average scores greater than 500. Asian countries™ score were much higher than European countries. Singapore had the highest average score (605.79), closely followed by Hong Kong, Chinese Taipei, 19 and Japan. Denmark had the highest scores among European countries, closely followed by Lithuania, Portugal, and Germany. With an average score of 482.28, Romania had the lowest average score. Spain, Croatia, and Malta also had average scores lower than 500. About half of the studen ts were females for all countries. At least 70 percent of students almost always spoke the tested language at home for all countries except Chinese Taipei, Malta, Singapore, and Spain. The average class sizes in grade four for European countries were much smaller than those in Asian countries. In Europe, the smallest average class size with 19 students per class was in Austria, closely followed by Slovenia, the Slovak Republic, Lithuania and Romania. With nearly 23 students per classroom on average, Spain h ad the largest classes. The largest average grade four enrollment (70.13) was in Italy; while the smallest average grade four enrollment (25.6) was in the Slovak Republic. In Asia, the largest average class size were found in Singapore (37). Teacher experi ence varied across countries. The highest average teacher experience was in Lithuania (24 years), while the lowest was in Singapore (nearly ten years). Almost all teachers completed post -secondary education in all countries in our sample except Italy and Romania. More than 75 percent of teachers were females in all European countries in our sample except Denmark, where only about half of the teachers were females; while among Asian countries, it ranged from 56 percent to 82 percent. In addition, school siz e was much larger in Asia than in Europe. The numbers of students and schools per country sampl e are also presented in Table 1.2. The number of schools ranged from 96 in Malta to 216 in Denmark; the number of classes ranged from 197 in Malta to 351 in Singapore; the smallest sample of students was in Malta (3607), while the largest sample of students was in Singapore (6368). 20 Table 1. 2: Descriptive Statistics for Some Variables of Interest of TIMSS 2011 Samples: Means and Standard Deviations AUT CZE DEU DNKESP HRV HUNITA LTU MLT PRT ROM SVK SVN HKGJPN SGP TWN Student Variables Mathematics Achievement 508.31 510.85 527.74 536.96 482.43 490.17 515.40 507.82 533.69 495.77 532.26 482.28 506.77 513.03 601.61 585.37 605.79 591.21 (62.70) (70.39) (62.14) (70.77) (70.31) (67.07) (89.79) (72.17) (74.02) (77.71) (68.68) (105.36) (79.63) (68.52) (66.42) (72.31) (78.18) (73.22) Female 0.49 0.48 0.49 0.51 0.49 0.50 0.49 0.50 0.48 0.49 0.49 0.48 0.49 0.48 0.46 0.49 0.49 0.47 (0.50) (0.50) (0.50) (0.50) (0.50) (0.50) (0.50) (0.50) (0.50) (0.50) (0.50) (0.50) (0.50) (0.50) (0.50) (0.50) (0.50) (0.50) Age in Years 10.33 10.42 10.37 11.02 9.97 10.67 10.77 9.81 10.85 9.92 10.01 10.77 10.38 9.92 10.07 10.62 10.90 10.24 (0.44) (0.44) (0.49) (0.38) (0.44) (0.32) (0.51) (0.36) (0.37) (0.42) (0.49) (0.65) (0.64) (0.34) (0.51) (0.28) (0.46) (0.31) Almost Always Speaking Tested Language at Home 0.76 0.86 0.73 0.78 0.67 0.85 0.97 0.78 0.82 0.16 0.89 0.88 0.79 0.65 0.85 0.32 0.50 (0.43) (0.35) (0.45) (0.41) (0.47) (0.36) (0.18) (0.42) (0.39) (0.37) (0.31) (0.33) (0.41) (0.48) (0.36) (0.47) (0.50) SES: Numbers of Books in the Home 2.94 3.17 3.17 2.96 2.95 2.55 3.01 2.74 2.57 2.90 2.73 2.29 2.89 2.98 2.82 2.75 3.08 2.90 (1.13) (1.09) (1.10) (1.08) (1.16) (1.07) (1.25) (1.15) (1.04) (1.05) (1.07) (1.15) (1.13) (1.04) (1.15) (1.07) (1.08) (1.26) SES: Numbers of Items in the Home 6.30 8.47 8.09 8.89 5.31 7.48 8.39 6.47 8.61 8.94 6.92 6.28 7.34 8.04 6.28 7.95 7.76 5.85 (1.21) (1.69) (1.65) (1.29) (0.92) (1.35) (1.94) (1.82) (1.79) (1.76) (1.55) (2.55) (1.80) (1.50) (2.03) (1.79) (1.93) (1.71) Classroom Variables Class Size 19.33 21.13 21.61 21.25 22.63 20.61 22.09 20.10 20.00 21.40 20.91 20.00 19.67 19.67 32.13 28.90 37.00 28.03 (4.12) (5.30) (3.89) (3.98) (4.17) (5.73) (5.45) (4.59) (5.11) (4.85) (4.78) (5.86) (4.84) (4.13) (5.38) (8.54) (5.57) (4.60) Classroom SES: Average Numbers of Books in the Home 2.94 3.17 3.17 2.96 2.95 2.55 3.01 2.74 2.57 2.90 2.73 2.29 2.89 2.98 2.82 2.75 3.08 2.90 (0.48) (0.45) (0.46) (0.42) (0.57) (0.48) (0.65) (0.45) (0.48) (0.37) (0.53) (0.69) (0.59) (0.39) (0.50) (0.37) (0.47) (0.45) Classroom SES: Average Numbers of Items in the Home 6.30 8.47 8.09 8.89 5.31 7.48 8.39 6.47 8.61 8.94 6.92 6.28 7.34 8.04 6.28 7.95 7.76 5.85 (0.40) (0.57) (0.66) (0.48) (0.31) (0.53) (0.98) (0.64) (0.85) (0.62) (0.74) (1.72) (1.07) (0.52) (1.02) (0.52) (0.77) (0.55) Percent of Female Students 0.49 0.48 0.49 0.51 0.49 0.50 0.49 0.50 0.48 0.49 0.49 0.48 0.49 0.48 0.46 0.49 0.49 0.47 (0.14) (0.13) (0.11) (0.11) (0.10) (0.12) (0.13) (0.11) (0.13) (0.26) (0.14) (0.12) (0.12) (0.11) (0.17) (0.07) (0.21) (0.07) Teacher Variables Experience in Years 21.54 18.76 19.29 15.73 21.02 20.75 23.96 23.98 24.01 12.70 17.29 23.22 19.94 20.69 14.49 17.33 9.81 14.56 (11.58) (10.28) (12.27) (10.74) (11.00) (9.79) (9.92) (10.02) (8.49) (8.29) (8.62) (11.11) (10.02) (9.67) (8.24) (11.69) (9.11) (7.15) Complete Post-Secondary Education 0.99 0.92 0.87 0.86 0.94 0.98 0.97 0.21 0.97 0.86 0.97 0.57 0.99 0.99 0.96 0.92 0.87 0.98 (0.11) (0.26) (0.34) (0.35) (0.23) (0.13) (0.17) (0.41) (0.16) (0.34) (0.17) (0.50) (0.10) (0.11) (0.19) (0.27) (0.34) (0.14) Female 0.91 0.95 0.78 0.53 0.77 0.96 0.94 0.91 0.99 0.81 0.86 0.87 0.92 0.96 0.56 0.59 0.71 0.82 (0.28) (0.23) (0.41) (0.50) (0.42) (0.19) (0.24) (0.29) (0.10) (0.39) (0.34) (0.34) (0.27) (0.19) (0.50) (0.49) (0.45) (0.38) Instruction Time in Hours 3.98 4.13 4.08 3.11 4.57 3.74 4.04 5.82 4.06 5.43 7.17 4.02 3.73 4.44 4.16 3.72 5.49 3.14 (0.90) (0.90) (0.97) (0.41) (0.69) (0.83) (1.21) (1.37) (0.95) (1.32) (1.08) (1.03) (0.12) (0.78) (0.76) (0.27) (1.68) (0.91) NA 21 Table 1.2 (cont™d) AUT CZE DEU DNKESP HRV HUNITA LTU MLT PRT ROM SVK SVN HKGJPN SGP TWN School Variables Grade 4 Enrollment 27.15 26.50 42.49 35.86 36.22 52.82 33.43 70.13 35.18 41.48 26.45 29.82 25.60 30.54 105.78 58.61 273.05 108.96 (22.37) (21.49) (25.37) (21.51) (23.13) (33.47) (25.20) (46.61) (55.25) (26.80) (25.58) (32.26) (26.82) (22.55) (48.65) (41.50) (88.83) (120.63) Income Level of the School's Immediate Area: Low 0.25 0.48 0.21 0.22 0.29 0.34 0.59 0.18 0.68 0.14 0.39 0.65 0.56 0.42 0.06 0.05 0.07 0.06 (0.43) (0.50) (0.41) (0.41) (0.46) (0.47) (0.49) (0.39) (0.47) (0.35) (0.49) (0.48) (0.50) (0.49) (0.23) (0.21) (0.25) (0.24) Income Level of the School's Immediate Area: Medium 0.71 0.50 0.71 0.64 0.66 0.64 0.40 0.71 0.32 0.85 0.60 0.33 0.43 0.56 0.40 0.80 0.74 0.60 (0.45) (0.50) (0.45) (0.48) (0.47) (0.48) (0.49) (0.46) (0.47) (0.36) (0.49) (0.47) (0.50) (0.50) (0.49) (0.40) (0.44) (0.49) Income Level of the School's Immediate Area: High 0.04 0.02 0.07 0.14 0.04 0.02 0.01 0.11 0.00 0.01 0.01 0.02 0.01 0.02 0.54 0.15 0.19 0.34 (0.19) (0.14) (0.26) (0.35) (0.20) (0.14) (0.08) (0.31) (0.00) (0.10) (0.11) (0.14) (0.10) (0.14) (0.50) (0.36) (0.39) (0.48) City Size: 0-3,000 0.56 0.58 0.27 0.38 0.27 0.36 0.45 0.11 0.58 0.29 0.43 0.46 0.58 0.53 0.37 0.21 1.00 0.05 (0.50) (0.50) (0.44) (0.49) (0.44) (0.48) (0.50) (0.32) (0.50) (0.46) (0.50) (0.50) (0.49) (0.50) (0.49) (0.41) (0.00) (0.22) City Size: 3,001-15,000 0.26 0.17 0.27 0.22 0.15 0.37 0.21 0.40 0.13 0.60 0.27 0.39 0.15 0.22 0.50 0.30 0.00 0.24 (0.44) (0.38) (0.44) (0.41) (0.35) (0.48) (0.41) (0.49) (0.33) (0.49) (0.44) (0.49) (0.36) (0.42) (0.50) (0.46) (0.00) (0.43) City Size: 15,001-50,000 0.03 0.10 0.22 0.19 0.16 0.08 0.13 0.18 0.09 0.10 0.12 0.06 0.13 0.09 0.07 0.13 0.00 0.26 (0.18) (0.31) (0.42) (0.39) (0.37) (0.28) (0.33) (0.39) (0.28) (0.30) (0.32) (0.24) (0.34) (0.29) (0.25) (0.34) (0.00) (0.44) City Size: 50,001-100,000 0.02 0.06 0.06 0.09 0.13 0.06 0.05 0.12 0.02 0.00 0.03 0.03 0.07 0.03 0.05 0.30 0.00 0.27 (0.13) (0.24) (0.23) (0.29) (0.33) (0.23) (0.23) (0.33) (0.15) (0.00) (0.18) (0.17) (0.25) (0.18) (0.21) (0.46) (0.00) (0.44) City Size: 100,001-500,000 0.05 0.06 0.10 0.05 0.16 0.05 0.09 0.07 0.13 0.00 0.07 0.05 0.04 0.08 0.01 0.06 0.00 0.15 (0.21) (0.23) (0.30) (0.21) (0.37) (0.22) (0.28) (0.26) (0.33) (0.00) (0.26) (0.22) (0.19) (0.27) (0.12) (0.24) (0.00) (0.36) City Size: >500,000 0.08 0.03 0.09 0.08 0.14 0.08 0.08 0.11 0.06 0.01 0.08 0.02 0.03 0.04 0.00 0.00 0.00 0.03 (0.27) (0.18) (0.28) (0.28) (0.34) (0.27) (0.27) (0.31) (0.24) (0.10) (0.28) (0.14) (0.18) (0.20) (0.00) (0.00) (0.00) (0.17) Schools 15817719721615115214920215496147148197195136149176150Classes 276235205216200295249239277197240246314243137149351155Students 466845783995398741834584520442004688360740424673561644923957441163684284Note: Weighted means are reported. Standard deviations are in parentheses. Variable "SES: Numbers of Books in the Home" takes values from one to five indicating 0-10 books, 11-25 books, 26-100 books, 101-200 books and more than 200 books; it was used as a continuous variable in our analysis for simplicity. Country abbreviations: AUT = Austria, CZE = Czech Republic, DEU = Germany, DNK = Denmark, ESP = Spain, HRV = Croatia, HUN = Hungary, ITA = Italy, LTU = Lithuania, MLT = Malta, PRT = Portugal, ROM = Romania, SVK = Slovak Republic, SVN = Slovenia, JPN = Japan, TWN = Chinese Taipei, HKG = Hong Kong, SGP = Singapore. 22 Regression Results The class size estimates of the OLS regression ana lysis are summarized in Table 1. 3. Negative coefficients of class size indicate that student achievement increases as class size decreases; while positive coefficients indicate that student achievement increases as class size increases. The regression coefficients of class size were negative in eight of the 18 countries, but none of them was significant at the 0.05 level after controlling for student, teache r/classroom, and school characteristics. Significant and positive class size coefficients were found in Croatia, Hong Kong and Malta. As we discuss in the method section, OLS results might be biased because of omitted variables. I analyzed the impact of th e omitted variables (unobservable confounding variables) using approach in Frank (2000). The method is based on the idea that for a confounding variable to change the significance of the variable of interest (e.g., class size) it should be correlated with both the variable of interest and the dependent variable. Frank (2000) developed formulas to calculate the minimum correlations necessary to invalidate the inference. He defined the Impact Threshold for a Confounding Variable (ITCV) as the lowest product o f the partial correlation between the dependent variable and the confounding variable and the partial correlation between the variable of interest and the confounding variable that makes the coefficient insignificant. The higher the absolute value of the I TCV is, the more robu st the OLS estimate is. Table 1. 4 presents the ITCVs, and their corresponding minimum correlations between students score and confounding variables and correlation between class size and the confounding variable, which would invalidate the reference of OLS results for countries and districts with the significant estimates. It should be noted that the correlation coefficients shown in Table 23 Table 1. 3: OLS Regression Estimates and Standard Errors of Class Size AUTCZE DEU DNKESPHRV HUNITA LTU MLT PRT ROM SVK SVN HKG JPN SGP TWN Class size -0.78 0.550.800.68-0.39 0.60*0.00-0.41 -0.51 0.98*-0.62 0.19-0.87 0.172.71*-0.08 0.60-0.84 (0.59) (0.54) (0.51) (0.72) (0.67) (0.30) (0.53) (0.78) (0.42) (0.42) (0.84) (1.08) (0.60) (0.57) (0.45) (0.22) (0.45) (0.75) Number of Schools 15817317615613914714417915189143138192188119148176145 Number of Students 463745173565296538834427485836904556334637914359541343663452438962404155 R-sq 0.2580.2540.3110.2220.2770.2030.4390.1930.2830.2130.3040.3520.3280.2360.3810.2190.4530.240* p < 0.05 Country abbreviations: AUT = Austria, CZE = Czech Republic, DEU = Germany, DNK = Denmark, ESP = Spain, HRV = Croatia, HUN = Hungary, ITA = Italy, LTU = Lithuania, MLT = Malta, PRT = Portugal, ROM = Romania, SVK = Slovak Republic, SVN = Slovenia, JPN = Japan, TWN = Chinese Taipei, HKG = Hong Kong, SGP = Singapore. Table 1. 4: Analysis of the impact of unobservable confounding variables Croatia Malta Hong Kong ITCV|z 0.0580.0720.132Rxcv|z 0.2410.2690.363Rycv|z 0.2410.2690.363 24 1.4 are partial correlations that condition on the covariates included in equation (1.1). Therefore, for instance, in Hong Kong, the result indicates that to sustain an inference an omitted variable would have to be correlated at 0.363 with class size and at 0.363 with mathematics achievement, conditional on all the covariates in equation (1.1). The partial correlatio n coefficients shown in Table 1. 4 ranged from 0.241 to 0.363, which were somewhat large since they conditioned on a group of student, teacher/cl assroom and school covariates. However, it is still difficult to tell if the significant coefficients in Croatia, Malta and Hong Kong were robust to omitted variables because TIMSS does not provide information such as prior achievement and family income, w hich are usually highly correlation with student achievement and class size. The positive class size coefficients are somewhat puzzling. One possible explanation is that parents chose high quality schools for their kids, which increased school enrollment and thus increased the average class size in high quality schools. IV Results The first stage regression re sults are summarized in Table 1. 5, and the IV estimates of class size effects in Table 1 .6. In 12 countries the first stage regression coefficients of computed average class size were significant and positive, and the t -statistic of the instrument was bigger than 3.20. This indicates that the correlation between reported class size and the instrument is strong enough in these countries (Staiger & Sto ck, 1997). It should be noted that, for Hong Kong, although the absolute value of the t -statistic was larger than 3.20, it was negative, which indicates the computed class size and the teacher reported class size were negative correlated. That is because, in Hong Kong, the maximum class size rules were only applicable to part of schools but not all of the primary schools. 25 However, there was no information from TIMSS data to identify which schools should follow the rules. Therefore, our IV methods were not a ppropriate to Hong Kong. In Denmark, the first stage regression coefficients of class size were also significant and positive, but the t -statistic of the instrument was smaller than 3.20. In Croatia, Italy Malta, and Singapore, the first stage regression c oefficients of class size were insignificant, which indicate the IVs were quite weak in these countries. To sum up, the IVs might not be valid in Hong Kong, Malta and Spain; also, in Croatia, Denmark, Italy, Malta and Singapore, the IVs were weak, which made the IV estimates and inference unreliable. Therefore, I will focus on IV estimates from countries with valid and strong IVs. The IV estimates of class size are summarized in Table 1. 6. The coefficients for Austria, Lithuania, Portugal, Slovenia, Japan and Chinese Taipei were negative but insignificant. The coefficients for the Czech Republic, Germany, and Hungary were positive but insignificant. The estimated class size effects were negative and significant at the 0.05 level in only two countries: Romania and the Slovak Republic. The magnitude of class size coefficients for Romania and the Slovak Republic were about 4.5, which is equivalent to 0.045 SD among all fourth graders who participated in TIMSS 2011. Such results indicate that a one student reduction would increase about 4.5 points (or 0.045 SD) of student mathematics achievement on average in the TIMSS scale. To facilitate interpretation, we transformed our estimates to effect sizes (standard deviations units) assuming a reduction i n class size of 10 students. The effect size was 0.48 SD and 0.44 SD respectively for Romania and the Slovak Republic. Such effect sizes are quite substantial in magnitude and larger than estimates reported in prior studies (e.g., 26 Table 1 .5: First Stage Regression Estimates and Standard Errors of the Computed Average Class Size AUTCZE DEU ESPHUNLTU PRT ROM SVK SVN JPN TWN HKG DNKHRV ITA MLT SGP IV: Computed Average Class Size 0.50*0.62*0.47*0.70*0.50*0.59*0.38*0.59*0.49*0.45*0.82*0.73*-0.50* 0.35*0.100.260.280.28(0.10) (0.08) (0.09) (0.08) (0.09) (0.11) (0.11) (0.08) (0.07) (0.08) (0.10) (0.11) (0.15) (0.14) (0.13) (0.16) (0.16) (0.19) T-Statistic for IV 5.23 7.94 5.06 8.96 5.28 5.49 3.38 7.38 6.62 5.52 8.12 7.45 -3.28 2.54 0.72 1.7 1.71 1.51 Number of Schools 15817317613914415114313819218814814511915614717989176 Number of Students 463745173565388348584556379143595413436643894155345229654427369033466240* p < 0.05 Note: Standard errors are in parentheses. Countries with Strong IV Countries with Weak IV Country abbreviations: AUT = Austria, CZE = Czech Republic, DEU = Germany, DNK = Denmark, ESP = Spain, HRV = Croatia, HUN = Hungary, ITA = Italy, LTU = Lithuania, MLT = Malta, PRT = Portugal, ROM = Romania, SVK = Slovak Republic, SVN = Slovenia, JPN = Japan, TWN = Chinese Taipei, HKG = Hong Kong, SGP = Singapore. Table 1 .6: Second Stage Regression Estimates and Standard Errors of Class Size AUT CZE DEU HUNLTU PRT ROM SVK SVN JPN TWN Class Size -1.82 0.241.330.45-1.25 -3.80 -4.84* -4.40* -1.87 -0.81 -0.83 (1.27) (1.16) (1.26) (1.49) (1.28) (2.67) (2.28) (1.58) (1.32) (0.46) (1.23) Number of Schools 158173176144151143138192188148145 Number of Students 46374517356548584556379143595413436643894155* p < 0.05 Note: Standard errors are in parentheses. Country abbreviations: AUT = Austria, CZE = Czech Republic, DEU = Germany, HUN = Hungary, LTU = Lithuania, PRT = Portugal, ROM = Romania, SVK = Slovak Republic, SVN = Slovenia, JPN = Japan, TWN = Chinese Taipei. 27 Angrist & Lavy, 1999). A reduction of eight students, which was by and large the average reduction in number of students between regular size and small size classes in Project STAR, would indicate an increase in mathematics achievement nearly one -third of a SD. This is a considerable effect knowing that the average benefit for students in small classes in Project STAR was nearly 0.20 SD. Comparison of Regression and IV Estimates Finally, I examined whether IV estimates were indeed different than regression estimate s that could be biased. To compare OLS and IV estimates, we used the Durbin -Wu-Hausman test (Durbin, 1954; Hausman, 1978; Wooldridge, 2010; Wu, 1973). Specifically, we ran the regression 01 2234 +iiiiiii ScoreResidualClassSize STCLSC (1.6) where iResidual is the residual term from regression equation (1.3). The idea is that once we control for reported class size (and other covariates) the coefficient of the residuals should not be significant unless there is omitted variable bias. The significance of 1 indicates that the regression and IV estimates are different. The significance of 1 indicates the reported class size is endogenous, that is, reported class size is correlated with omitted variables that are p art of the error term of equation (1.3). Table 1 .7 summarizes the results of the Durbin -Wu-Hausman test for the full samples. Significant estimates at 0.05 level were found in Romania and the Slovak Republic; while significant estimates at 0.10 level were found in Portugal, Slovenia, and Japan. The results suggest the regression and 28 Table 1. 7: Results from Durbin -Wu-Hausman Test AUTCZE DEU HUNLTU PRT ROM SVK SVN JPN TWN First Stage Residual 1.350.43-0.69 -0.53 0.873.46+6.67*4.38*2.55+0.95+-0.02 (1.34) (1.14) (1.38) (1.52) (1.26) (2.00) (2.22) (1.68) (1.33) (0.50) (1.43) Number of Schools 158173176144151143138192188148145 Number of Students 46374517356548584556379143595413436643894155* p < 0.05, + p < 0.10 Note: Standard errors are in parentheses. Country abbreviations: AUT = Austria, CZE = Czech Republic, DEU = Germany, HUN = Hungary, LTU = Lithuania, PRT = Portugal, ROM = Romania, SVK = Slovak Republic, SVN = Slovenia, JPN = Japan, TWN = Chinese Taipei. 29 IV estimates are different in these countries. They also indicate that reported class size was endogenous and thus correlated with omitted variables in these countries. These results support the notion that that IV analysis was necessary and that the IV estimates should capture the causal ef fects of class size on student achievement in these two countries. For other countries with strong and valid instruments -Austria, the Czech Republic, Germany, Hungary, and Lithuania, and Chinese Taipei - the results indicate that estimates from regression and IV analyses were overall similar. These findings may suggest that there is little bias from omitted variables in the regression analysis in these countries. Discussion I investigated the effects of class size on mathematics achievement for fourth graders in 18 countries and districts in 2011 using rich data from TIMSS. These European and Asian countries and districts had maximum clas s size limits, which allowed me to use an IV approach to explore the causal effects of class size on student achievem ent. Both regression analyses and IV analyses were conducted. By and large, I did not observe significant class size effects in most countries. Significant class size coefficients at the 0.05 level were found in Romania and the Slovak Republic. These coeff icients indicated that class size reductions increased mathematics achievement significantly and meaningfully. The estimates produced from the IV analysis were somewhat different than those from the OLS analysis in some countries. The Durbin -Wu-Hausman tes t provided some evidence that reported class size was correlated with omitted variables in some countries and that the IV analysis was necessary and provided valid estimates of class size effects in Romania 30 and the Slovak Republic. In other countries howev er, the regression estimates were similar to the IV estimates, which suggests that regression estimates were as good as IV estimates. Generally, the results indicated no systematic pattern of association between class size and achievement. For nine of the eleven countries and districts with strong and valid IV no class size effects were found. The exceptions were Romania and the Slovak R epublic. These significant class size effects were quite substantial in magnitude compared to prior studies (e.g., Angrist & Lavy, 1999). Nonetheless, my findings are in congruence with findings of previous work that used prior cycles of TIMSS assessments and have indicated generally no significant relationships between class size and achievement (Pong & Palls, 2001; Wossmann, 2005; Wossmann & West, 2006). Romania and the Slovak Republic are not as wealthy or developed countries compared to the other Europe an countries in our sample, which might indicate that school resources such as class size reduction may play a more important role in less wealthy countries. Unfortunately TIMSS does not provide data about classroom dynamics, instruction, and practices an d therefore it is difficult to know exactly why we failed to detect class size effects in most countries. Prior studies have suggested that class size have positive effects when teachers spend more time on individualized instruction or when pupils become m ore involved in learning activities (e.g., Finn & Achilles, 1990). Perhaps in most of my samples teachers did not utilize individualized instruction when class size was reduced. Also, perhaps students were not as actively involved in learning activities wh en class size was reduced. One possible limitation of our estimates is related to the enrollment information we used in our models. Specifically, enrollment information from the beginning of the school 31 year can predict average class size more accurately ( see Angrist & Lavy, 1999). However, the enrollment information available in TIMSS is at the time of testing, which is near the end of the school year. Thus, we could not control for any enrollment changes during the school year. If potential changes of enr ollment are not random, our results might be biased, and that™s a potential limitation of our study (Wossmann, 2005). Another potential limitation is that our IV method may not be valid. Although we tested if covariates were locally balanced across school s around cut -offs, it is unclear whether enrollment influences student achievement only though class size once enrollment and other important covariates are controlled for. However, if class size is related to unobserved variables that we could not control for (e.g., parental education level or family income) then our IV estimates may be biased. 32 CHAPTER 2 DOES CLASS SIZE REDUCTION CLOSE THE ACHIEVEMENT GAP Introduction The effects of class size on student achievement have been discussed repeatedly in education res earch and policy in the past decades. Meta -analytic reviews of early work on small class effects (e.g., Glass & Smith, 1979) and studies using data from a high -quality large -scale experiment (e.g., Finn & Achilles, 1990) indicated a positive relationship between small classes and student achievement. In particular, evidence from Project STAR (Student -Teacher Achievement Ratio) in Tennessee has strongly indicated achievement improvements for students in small classes compared to regular size classes (e.g., K rueger, 1999; Nye, Hedges, & Konstantopoulos, 2000). These findings suggest that reducing class size is a promising policy option to increase academic achievement, on average, for all students. Bes ides improving average student achievement, another critical objective of education interventions is to increase achievement for students at risk, and thus reduce the achievement gap between lower - and higher -achieving students. Class size reduction has been advocated as such an intervention by some researchers (e.g., Finn & Achilles, 1990). One way to evaluate whether CSR can close the achievement gap is to examine the interaction effect between class size and student background such as gender, socioeconom ic status (SES), minority status, etc. Prior studies have focused typically on the average effects of class size on student achievement for all students. Only a few studies have examined the differential class size effects for subgroups of students, most o f which have used data from Project STAR. The findings of these studies were mixed. For example, 33 Finn and Achilles (1990) found some evidence that the positive effects of small classes were larger for minority students, especially in kindergarten and first grade, while Nye, Hedges, and Konstantopoulos (2002) found weak or no evidence for differential effects of small classes on minority and low -SES students. Another way to evaluate whether CSR can close the achievement gap is to estimate the differential cl ass size effects across student achievement distribution using quantile regression. Konstantopoulos (2008) used quantile regression to evaluate the small size effects for student in the middle and tails of the achievement distribution using data from Proje ct STAR and found that reductions in class size did not reduce the achievement gap between low - and high -achievers in the early grades. Later studies have found similar findings using the same data (Ding & Lehrer, 2011; Jackson & Page, 2013). Nevertheless, there is some evidence that the cumulative effects of being in a small class from kindergarten through third grade may reduce the achievement gap in reading and science in some of the later grades four through eight (Konstantopoulos & Chung, 2009). Howeve r, no recent study has used current data to evaluate if CSR closes the achievement gap. Chapter 2 was designed to fill in that gap in the literature and explore the differential class size effects for students with different levels of achievement. In part icular, Chapter 2 examined the effects of class size across the student achievement distribution (i.e., middle and upper or lower tails), in an attempt to address the question of whether CSR closes the achievement gap between high - and low -achievers, using the latest cycle of a large -scale international assessment program. Specifically, I used the data from the 2011 fourth grade sample of the Trends in International Mathematics and Science Study (TIMSS). I utilized maximum class size rules 34 available in some countries to gauge class size effects on mathematics achievement. I employed quantile regression to estimate class size effects on student achievement in the middle as well as in the lower and upper tails of the achievement distribution. To deal with the potential endogeneity of class size, I computed the average class size in a school based on the maximum class size rule in each country, which was used as an instrumental variable (IV) for class size. I used the control function approach (see Lee, 2007) to estimate the differential causal effects of class size effect on fourth graders™ mathematics achievement. Chapter 2 contributes to the existing literature in two ways. First, I used the most recent TIMSS data from 2011 that allowed us to evaluate recent, concurrent CSR policies, and to compare class size effects across Asian and European countries and districts. Second, I used quantile regression coupled with IV to evaluate causal class size effects across the achievement distribution. To my knowledge, the TIMSS data have not been used to examine differential class size effects, although some researchers have used previous cycles of TIMSS assessment to evaluate average class size effects (e.g., Pong & Palls, 2001; Wossmann, 2005; Wossmann & West, 2006). Literature Review During the past three decades, researchers explored the effects of class size reduction on student achievement through meta -analyses, experimental and quasi -experimental designs (e.g., RD), as well as other advanced statistical methods such as IV. Most researchers have focused exclusively on estima ting mean differences in student achievement between small and regular -size classes (Konstantopoulos, 2008). For example, meta -analytic reviews of early work on small class effects have indicated a positive 35 relationship between small classes and student achievement, but the magnitude of the effect was small (e.g., Glass and Smith, 1979; Slavin, 1989). Project STAR is viewed as the most impressive and most powerful field experiment about class size e ffects in education (Mosteller, 1995). There have been numerous analyses of the Tennessee STAR data that have produced high internal validity estimates. Finn and Achilles (1990) were the first to analyze these data, and they found that students in small classes performed higher than those in regular classes in all subject areas, and in every year of the experiment (kindergarten through third grade). Nye, Hedges, and Konstantopoulos (2000) examined the validity of Project STAR, and they suggested that the ef fects of class size might be under -estimated because of imperfect implementation. They also found that the estimated class size effects were consistent with those from Glass and Smith (1979). Researchers also attempted to evaluate average class size effect s using observational data. The main difficulty of analyzing observational data is that the internal validity (or unbiasedness) of estimates in observational or quasi -experimental studies is not so easy to achieve. That is, researchers have to use advanced statistical methods to warrant the high internal validity of estimates for observational data. Previous work has utilized different analytic methods to examine class size effects on student achievement. For example, Pong and Pallas (2001) used multilevel models to analyze TIMSS 1995 data in nine different countries and found no class size effects on eighth grade achievement except in the U.S. Other researchers have used IV methods to analyze observational data in an attempt to explore the causal effects of class size reduction (e.g., Akerhielm, 1995; Hoxby, 2000; Cho, Glewwe, & Whitler, 2012; Wossmann & West, 2006). One of the best instruments used to capture class size effects was introduced by Angrist 36 and Lavy (1999). They used the Maimonides rule that sets the maximum class size to 40 students per classroom in order to evaluate the effect of class size on student achievement in Israel. The authors used this maximum class size rule of 40 to construct IV estimates of class size on test scores. They found a statistically significant effect of small classes on fifth grade reading and mathematics scores. However, they found no significant effects of class size on third grade scores. Several other researchers have also used maximum class size rules as IV to evaluate class size effects. For instance, Bonesronning (2003) investigated class size effects using a maximum class size rule of 30 students per classroom in Norway. His analysis indicated small class effe cts. Wossmann (2005) explored class size effects in Europe using data from TIMSS 1995 for eighth grade students. He found two statistically significant and negative relationships between class size and student achievement in Norway and Iceland. He also fou nd a statistically significant but positive relationship between class size and student achievement in Switzerland. For Denmark, France, Germany, Greece, Ireland, Spain, and Sweden, the estimates were not significant. A recent study about class size effect s on fourth grade reading achievement in Greece also reported statistically insignificant estimates (Konstantopoulos & Traynor, 2014). Urquiola (2006) studied third -grade students in Bolivia and found significant class size effects, with effect sizes as la rge as 0.30 standard deviations, bigger than the effects found in Project STAR in the U.S. and in Israel. Class size reduction can potentially affect average student achievement as well as the achievement gap among subgroups of students. In other words, in teractions between class size effects and student background, such as student SES or achievement level, are possible (Konstantopoulos & Chung, 2009). If economically disadvantaged students or low -37 achieving students benefit more from being in smaller classe s, CSR would decrease the achievement gap. However, most prior studies have focused on the average class size effects, while only few studies have explored the interaction effects between class size and student backgrounds or achievement levels. The differ ential effects of class size have traditionally been determined through statistical interactions between class size and student variables such as gender, SES, and race. Project STAR data have been used to examine such interaction effects. For example, earl y analyses have reported that class size reduction had larger positive effects for minority students (see Finn & Achilles, 1990). These average differences were significant for reading achievement for the first two years of the experiment (kindergarten and first grade). However, more recent studies could not fully replicate these findings. For example, Nye, Hedges, and Konstantopoulos (2000) found weak evidence that class size reduction had larger benefits for minority students. Also, Nye, Hedge and Konstan topoulos (2002) examined the differential effects of small classes for students who were low -achievers in previous grades, and they found no evidence of additional small class benefits for these students. Several non -experimental studies have also evaluate d class size effects for subgroups of students, and almost all of them did not find differential class size effects. For example, Hoxby (2000) analyzed data from a natural experiment in Connecticut and found no evidence of class -size effects at schools tha t served high percentages of economically disadvantaged or minority students. In a similar study, Cho, Glewwe, and Whitler (2012) found the estimated class size effects did not differ by race/ethnicity, gender, or free lunch eligibility. One exception was the study by Jepsen and Rivkin (2009), which found 38 differential class size effects among subgroups. They analyzed the CSR policy in California and found that this policy initially helped economically advantaged (both in family background and performance) s tudents more than their less affluent peers. One appropriate method of examining differential class size effects at different levels of achievement is quantile regression, which examines class size effects across the entire student achievement distribution . Konstantopoulos (2008) employed this approach to estimate class size effects at the tenth, twenty -fifth, fiftieth, seventy -fifty, and ninetieth quantiles, using data from Project STAR. He also constructed t-tests to examine whether the estimates were sta tistically different across quantiles and found some evidence that higher -achieving students benefited more from being in small classes in certain early grades than other students. Later studies confirmed such findings (e.g., Ding & Lehrer, 2011; Jackson & Page, 2013). Nevertheless, Konstantopoulos and Chung (2009) examined the long -term effects of class size across the student achievement distribution. They found that f or certain grades (fourth and sixth grade) in reading and science, low - achievers benefi ted more from being in small classes consistently in the early grades, while for other grades, no differential class size effects were found. Very few previous studies examined quantile -specific class size effects using non -experimental data. To our knowle dge, there were only two studies. One is by Levin (2001), who used quantile regression as well as IV methods through two -stage least absolute deviations (2SLAD) (Amemiya, 1987) to estimate the causal effects of class size on scholastic achievement across v arious points in the conditional distributions of mathematics and languages achievement of Dutch primary school students. He did not find any significant class size effects at any quantile. Levin (2001) did not examine differences 39 between estimates across quantiles, and thus it is not clear whether CSR reduced the achievement gap. Ma and Koenker (2006) reanalyzed Levin™s data and found that, for mathematics scores, lower -achieving students benefited more from smaller classes while average and high -achieving students did not get benefit from smaller classes. To sum up, it is not very clear if class size reduction would decrease achievement gap or not; also, there were quite limited studies that evaluated class size effects across achievement distribution. It is necessary to provide more evidence of class size effects across achievement distribution using concurrent data. Method In this chapter, I also used the data from TIMSS 2011 , and focused on fourth grade mathematics achievement. I analyzed the same countries as I did in Chapter 1. Table 1.1 provides detail about the selected countries as well as their upper class size limits. Quantile Regression The objective of my study was to examine class size effects across the distribution of fourth graders™ mathematics achievement, especially the effects in the upper and lower tails of the distribution. Ordinary least squares (OLS) regression fails to describe the full distrib utional impact of class size on student achievement, unless the lower -achievers and higher -achievers benefit the same from smaller classes as students in the middle of the achievement distribution. Quantile regression (Koenker and Bassett, 1978) is a tool that allows researchers to estimate quantile -specific class size effects, not only in the middle but also in the tails of the conditional student mathematics achievement distribution. Thus, we used quantile regression, and compared quantile -specific class size effects across 40 different quantiles of the achievement distribution to evaluate whether CSR closes or enlarges the achievement gap. I evaluated class size effects at the tenth, twenty -fifth, fiftieth, seventy -fifth, and ninetieth quantiles through the following equation 01234 iiiiii ScoreClassSize STCLSC (2.1) where iScore represents mathematics s cores, 0 is the constant term, ClassSize is the main independent variable, 1 represents the class size effect and is the coefficient of interest, STi is a row vector of student background characteristics, 2is a column vector of regression coefficients of student characteristics, CLi is a row vector of classroom or teacher characteristics, 3 is a column vector of regression coefficients of teacher and classroom ch aracteristics, SCi is a row vector of school characteristics, 4 is a column vector of regression coefficients of school characteristics, and iis the error term. Instrumental Variable and Control Function An important issue to consider in estimating quantile -specific class size effects is that class size may be endogenous because of omitted variable bias. The relative position of students in the conditional achievement distribution could be related to systema tic differences in unobservables, such as motivation, family background, school or teacher quality, etc. In that case, the estimated class size effect from equation (2.1) cannot reflect the true quantile -specific class size effect. Because students and teachers are rarely randomly assigned to classrooms in a grade 41 class size might be correlated with uno bserved characteristics of students or teachers. For example, in order to help low achieving students, some schools might assign higher quality teachers to classes with higher proportions of low achievers. Variables that determine assignment of students an d teachers to classes are not typically measured. For example, student motivation, family income, parental pressure, teacher quality, etc. are rarely available in observational datasets. In addition, cross -sectional data rarely provide indexes of prior abi lity or performance. Although we included as many covariates as we could in our multiple regression analysis, it is still possible that unobservable factors that are part of the error term in equation (2.1) are correlated with class size. If that were true , then the estimated class size effect in equation (2.1) would be biased. In general, there are two sources of omitted variable bias that are related to student mathematics achievement, and to class size as well. First, students do not choose schools rand omly but typically attend schools in their neighborhoods. Therefore, students within the same school might share common characteristics, such as parents™ education, parents™ occupations, and family income. That is, class size may be correlated with SES man ifested via parents™ occupations or family income. Such variables were not measured or reported in the TIMSS 2011 fourth grade student survey. Second, students and teachers are rarely randomly assigned to classrooms, and thus class size might be correlated with unobserved student or teacher characteristics. For example, students may be assigned to classes based on their ability or motivation. TIMSS 2011, being a cross -sectional survey, did not include information about prior achievement (a proxy for ability ). In the same vein, in order to help low -achieving students, some schools might assign higher quality teachers to classes with higher proportions of low -achievers. There were only very few teacher characteristics 42 reported in the TIMSS 2011 teacher survey, such as their gender, experience, and education level, which may capture only partially teacher fiquality.fl Although we included as many covariates as we could in our analysis, there are likely unobservable factors that could be correlated with class size that are part of the error term of equation (2.1). Just as with OLS, endogeneity of class size renders quantile -specific estimates biased. To overcome this potential shortcoming and to facilitate causal inferences, we used IV methods. Specifically, we crea ted a grade and school specific average class size variable using the maximum class size rule, and we used it as an instrument for class size. Our method is similar to the one used by Angrist and Lavy (1999) and is the same as we did in Chapter 1. The aver age class size in fourth grade, based on the maximum class size requirement, could be calculated through the following equation /[int((1)/)1] iii fEErule (2.2) where iE denotes the enrollment in grade four in a school; if denotes the computed school and grade specific average class size based on the maximum class size rule; rule denotes the upper class size limit in a given country; and for any positive number n, the function int(n) is the largest integer less than or equal to n. I adopted the control function approach proposed by Lee (2007) to get quantile -specific IV estimates. Lee™s approach fits our study for two reasons: first, his estimation approach is computationally convenient and simple to implement through the fiqregfl comm and in STATA; second, the required assumptions by Lee™s control approach hold in general settings (see Lee, 2007). Th e control function approach is also a two -stage 43 estimation method that is similar to two -stage -least square (2SLS). The basic idea is to ad d a control variable to equation (2.1) such that, once we condition on this variable, the teacher reported class size will be independent of omitted variables (see Wooldridge, 2010). This so-called control variable usually needs to be estimated through a f irst stage regression, because it cannot be observed or measured directly. In our study, the first stage regression equation is 01234 iiiiii ClassSizef u STCLSC (2.3) where ifis the computed average class size in a school based on the maximum class size rule, and iuis the error term. All other terms have been defined previously. The ™s are the regression estimates that need to be computed. Researchers typically use the estimated residuals from equation (2.3) as the control variable. Residuals can be estimated from a quantile regres sion, or even an OLS regression (Lee, 2007). It should be noted that Lee™s control function method is only applicable to continuous endogenous variables. Although class size is conceptually continuous, it has only a finite number of distinct values. In this case, Lee (2007) suggest ed using OLS regression in the first stage. I calculated residual ‹iuthrough the following equation ‹=iii uClassSizeClassSize where iClassSize is the fitted value of iClassSize from equation (2.3), the OLS regression. Contrary to the conventional control function approach that inserts ‹iuinto equation (2.1) 44 as the second stage regression, Lee (2007) proposed inserting a power series or kernel of ‹iu . He showed that with proper conditions, the estimator from his control function approach is consistent (See Appendix B for a proof). In this study, I added a fifth order polynomial of ‹iu , denoted as ‹()iu , into equation (1). Specifically, the second stage regression in each quantile (i.e., tenth, twenty -fifth, fiftieth, seventy -fifth, and ninetieth) is 01 234 ‹()iiiiiii YClassSizeu STCLSC (2. 4) The coefficient 1 represents the relationship between mathematics achievement and class size, adjusted for student, teacher/classroom, and school characteristics; ‹()iu represents a fifth order polynomial of ‹iu . The ™s indicate regression estimates that need to be computed. The student, classroom/teacher, and school covariates included in equation (2.4) are the same as those included in equation (2.3) (see Appendix A). Appropriate student weights were used in both regressions (equations 2.3 and 2.4). It should be noted that due to the two -step feature of the model, the standard errors of estimates in equation (2.4) were adjusted by nonparametric bootstrap techniques using 1000 replications . I used the bootstrap method introduced by Kelnikov (2010), which is suitable for complex survey data and corrects the potential clustering effects (i.e., students nested within schools). Also, my analysis was conducted for each plausible value separately, and then th e averages of the five sets of estimates were calculated and reported as the final estimates of class size effects for each quantile (see Schafer & Olsen, 1998). The standard error of the class size effects was a combination of the sampling variance obtain ed through 45 bootstrap techniques and the variance between plausible values (see Martin & Mullis, 2012). Similar to the case in 2SLS context, there were two key assumptions that the computed average class size if must meet in order for the variable to be a valid IV: (1) if should be correlated with actual class size, and (2) if should not be correlated with the error term in equation (2.1). The first assumption indicates that schools followed the maximum class size requirement when they assigned students to classrooms. In a 2SLS context, such an assumption can easily be tested through the first stage regression. If the instrument is only marginally significant, our instrument c ould be weak. When instruments are weak, then the standard IV estimates, hypothesis tests, and confidence intervals may be unreliable (Stock, Wright, & Yogo, 2002). The rule of thumb is that the t -statistic of the instrument in the first -stage regression s hould be larger than 3.2 (Stock, Wright, & Yogo, 2002). Results from Table 1.6 in Chapter 1 had shown that there were five countries or districts - Denmark, Croatia, Italy, Malta and Hong Kong - whose IVs were weak. In addition, the significant but negative coefficient in Hong Kong indicated that the IV was valid in Hong Kong. Results I only evaluated the class size effects for countries and districts with strong and valid IVs. The quantile -specific IV estimates of class size are summarized in Table 2. 1. To compare the results between OLS regression and median regression (quantile regression at the fiftieth quantile), estimates from 2SLS are also presented. Negative coefficients of class 46 size indicate that student achievement increases as class size decrease s, which is what researchers and policy makers expect. In Romania, the Slovak Republic, Slovenia, Japan Table 2. 1: 2SLS and Quantile Regression Estimates and Standard Errors of Class Size 10th 25th 50th 75th 90th AUT-1.82 0.35-2.07 -2.26 -2.51 -1.71 (1.27) (3.37) (3.04) (2.04) (2.15) (2.79) CZE 0.240.830.670.75-0.10 -0.95 (1.16) (1.74) (1.48) (1.28) (1.56) (1.48) DEU 1.332.252.282.261.58-0.30 (1.26) (2.28) (1.97) (1.68) (2.16) (2.42) HUN0.451.31-0.50 0.701.191.17(1.49) (2.65) (2.24) (1.78) (1.92) (2.44) LTU -1.25 -0.06 -0.15 -0.79 -2.15 -2.81 (1.28) (3.27) (1.65) (1.75) (1.71) (2.41) PRT -3.80 -2.84 -2.89 -2.89 -3.05 -4.68 (2.67) (4.63) (3.10) (2.78) (3.10) (3.69) ROM -4.84* -5.46 -5.52 -5.72* -4.86+ -4.23 (2.28) (3.63) (3.71) (2.64) (2.92) (3.46) SVK -4.40* -4.42* -3.59 -4.10+ -4.68* -4.33 (1.58) (2.24) (2.57) (2.19) (2.16) (2.78) SVN -1.87 -1.13 -1.69 -2.03 -2.65 -2.70 (1.32) (2.69) (2.09) (1.86) (2.31) (2.52) JPN -0.81 -1.45 -1.03 -0.71 -0.51 -0.18 (0.46) (0.92) (0.81) (0.57) (0.63) (0.79) TWN -0.83 -1.83 -0.27 -0.63 -0.15 0.21(1.23) (2.53) (2.27) (1.79) (1.86) (1.93) Note: Bootstrap standard errors are in parentheses. 2SLS Quantile and Chinese Taipei, the magnitude of the coefficients in the median regression were similar to those from 2SLS. In Germany, Hungary, Lithuania, and the Cze ch Republic, the magnitude of the coefficients in the median regression were quite different from those from 2SLS . In terms of significance, the estimates from 2SLS and the estimates from median regression were quite similar and, by and large, insignificant. In addition, the standard errors from the median regression were larger than those from 2SLS. The coefficients of class size were negative but insignificant across all five quantiles 47 in Lithuania, Japan, Portugal, and Slovenia. The coefficients for Austria, the Czech Republic, Germany, Hungary, and Chinese Taipei were mixed: for some quantiles, they were positive, while for the other quantiles, they were negative. However, none of the quantile estimates were significant. Negative and significant quantile -specific class size estimates were only found in Romania and the Slovak Republic. In Romania, the class size coe fficient at the fiftieth quantile was negative and significant at the 0.05 level. Also, the class size coefficient at the seventy -fifth quantile was negative and significant at the 0.10 level. Such results indicate that students in the middle and upper tai l of the achievement distribution benefitted from being in smaller classes. For instance, a one student reduction corresponds to an increase of about 5.7 points of mathematics achievement in the TIMSS scale for students in the middle of the achievement dis tribution. This is equivalent to about 0.057 standard deviations (SD) among all fourth graders who participated in TIMSS 2011. For the other three quantiles, the estimates were negative but insignificant. The magnitude of the class size coefficients were s imilar across quantiles and ranged between 4.23 at the ninetieth quantile to 5.72 at the fiftieth quantile. In the Slovak Republic, the estimates at the tenth quantile and seventy -fifth quantile were significant and negative at the 0.05 level. The estimate at the fiftieth quantile was negative and significant at the 0.10 level. Such results indicate that students in the lower tail, median or upper tail of the achievement distribution benefitted from smaller classes. For the other two quantiles (twenty -fifth and ninetieth quantiles), the estimates were negative but insignificant. The magnitude of the class size coefficients were similar across quantiles and ranged between 3.59 at the twenty -fifth quantile to 4.68 at the seventy -fifth quantile. 48 To facilitate i nterpretation, I transformed the estimates to effect sizes expressed in SD units, assuming a reduction in class size of eight students, which was the average class size reduction in Project STAR. For Romania, the effect sizes were about 0.46 SD at the fift ieth quantile, and about 0.39 SD at the seventy -fifth quantile. For the Slovak Republic, the effect sizes were about 0.36 SD at the tenth quantile and the seventy -fifth quantile, and about 0.33 SD at the fiftieth quantile. Such effect sizes are quite subst antial in magnitude and larger than the conditional mean estimates reported in prior studies (e.g., Angrist and Lavy, 1999; Nye, Hedges & Konstantopoulos, 2004). For example, the average effect size for Project STAR was about 0.20 SD. In Japan t he magnitud e of the coefficients indicated that the class size effects were consistently larger for low -achievers than for other students. For example, the magnitude of the coefficient estimated at the tenth quantile was more than eight times larger than that at the ninetieth quantile. In countries such as the Czech Republic, Lithuania, Portugal , and Slovenia, the magnitude of the coefficients indicated that the class size effects were consistently larger for higher -achievers than for other students. For example, the magnitude of the coefficient estimated at the ninetieth quantile was about 47 times as large as that at the tenth quantile in Lithuania. Overall these results seem mixed. In some countries, the results seem to support the notion that high -achieving students may benefit more from being in small classes than other students. In contrast, in other countries low -achievers seem to benefit more fr om smaller classes than other students. Still, one needs to examine whether the estimates across these different quantiles were statistically significant. A bootstrap procedure was employed to compute the standard errors of the differences between two qua ntile -specific estimates (Kelnikov, 2010). Table 2.2 summarizes the 49 differences between estimated class size coefficients and their bootstrap standard errors. I calculated the difference between two specific -quantile estimates by subtracting the estimated class size coefficient of lower achievers from the estimated class size coefficients Table 2 .2: Differences in Quantile Regression Estimates AUT-2.07 0.360.54-2.86 -0.44 -0.25 -2.61 -0.19 (3.01) (2.93) (1.83) (2.28) (2.16) (0.94) (1.73) (1.60) CZE -1.78 -1.62 -1.71 -0.93 -0.77 -1.05 0.260.23(1.17) (1.26) (1.17) (1.35) (1.24) (1.11) (1.13) (1.09) DEU -2.42 -2.59 -2.56 -0.96 -0.71 -0.68 -0.14 -0.02 (2.02) (1.52) (1.37) (1.77) (1.36) (0.98) (1.57) (0.83) HUN-0.14 1.320.13-0.12 1.570.38-0.39 1.19(2.05) (1.90) (1.40) (1.83) (1.55) (1.11) (1.59) (1.26) LTU -2.75 -2.66 -2.02 -2.08 -1.99 -1.36 -0.73 -0.64 (2.79) (1.55) (1.32) (2.59) (1.25) (1.00) (2.02) (0.99) PRT -2.45 -1.79 -2.21 -0.72 -0.48 0.03-0.18 0.41(4.76) (3.27) (2.55) (4.22) (2.55) (1.67) (3.70) (2.02) ROM 1.231.301.500.600.660.86-0.27 -0.20 (3.08) (3.26) (2.18) (2.18) (2.42) (1.22) (1.45) (1.86) SVK 0.20-0.74 -0.23 -0.09 -0.79 -0.51 0.18-0.51 (2.30) (2.01) (1.42) (2.02) (1.69) (1.02) (1.55) (1.42) SVN -1.57 -1.01 -0.67 -1.53 -0.97 -0.62 -0.90 -0.34 (1.92) (1.56) (1.42) (1.70) (1.29) (0.90) (1.46) (1.10) JPN 1.27*0.850.500.940.520.250.650.26(0.60) (0.66) (0.41) (0.61) (0.62) (0.29) (0.50) (0.45) TWN 2.040.480.851.680.120.491.19-0.37 (1.81) (1.37) (1.20) (1.61) (1.15) (0.92) (1.51) (0.81) Note: Bootstrap standard errors are in parentheses. 50th vs. 25th Quantile 50th vs. 10th Quantile 90th vs. 10th Quantile 90th vs. 25th Quantile 90th vs. 50th Quantile 75th vs. 10th Quantile 75th vs. 25th Quantile 75th vs. 50th Quantile of higher achievers. Thus, a negative difference indicated that high -achievers benefitted more from small classes than low -achievers. For example, in Japan, the difference of the class size coefficients between the ninetieth and the tenth quantile was 1.27, which indicates that a one student reduction in class size would increase achievement by 1.27 points in the mathematics achievement scale between these two quantiles (favoring the tenth quantile). In other words, negative difference indicates an increase in the achievement gap between high -achievers and low -achievers as class size decreases. In contrast, a 50 positive difference indicates a decrease in the achievement gap between high -achievers and low -achievers as class size decreases. The results in Table 2. 2 show that almost a ll differences between any two specific -quantile estimates were insignificant with only one exception, which indicates that in general CSR did not reduce the achievement gap between high - and low -achievers. By and large, CSR is likely to have no impact on the achievement gap across countries, which is inconsistent with prior studies, especially the studies using data from Project STAR (e.g., Konstantopoulos, 2008; Ding & Lehrer, 2011; Jackson & Page, 2013) that consistently found high -achieving students got more benefit from small classes and thus achievement gap increased. Discussion I investigated the differential effects of class size at different levels of mathematics achievement for fourth graders, using rich data from TIMSS 2011. The European and Asi an countries and districts I selected had maximum class size rules, which allowed me to use an IV approach to explore the causal effects of class size on student achievement across the achievement distribution. Specifically, I used a control function appro ach, coupled with quantile regression, to examine differential class size effects for students in the middle, lower, and upper tails of the achievement distribution. Generally, the findings from the quantile regression indicated no systematic patterns of association between class size and achievement. In nine of the eleven European and Asian countries and districts that had strong IV and valid RE design, we found insig nificant class size effects. The only two exceptions were Romania and the Slovak Republic, where 51 significant class size effects were detected in some quantiles. These significant class size effects were quite substantial in magnitude compared to prior stud ies (e.g., Angrist & Lavy, 1999). Nonetheless, my findings are in congruence with the findings of previous work that used prior cycles of TIMSS and have indicated generally insignificant relationships between class size and achievement (Pong & Palls, 2001; Wossmann, 2005; Wossmann & West, 2006). I also compared class size coefficients at the lower and upper tails of the achievement distribution. These results suggest no differential class size effects across the achievement distribution. In sum, our finding s suggest that CSR has no impact on achievement gap between low - and high -achieving students. In other words, lower -achieving students did not get hurt from CSR policies. Such findings are not in congruence with findings of previous works that used high -quality experimental data or (e.g., Konstantopoulos, 2008; Nye, Hedges & Konstantopoulos, 2002). In addition, our findings indicates that, for some specific countries such as Romania and the Slovak Republic, CSR is a promising policy that would increase stud ent achievement but not increase achievement gap between low - and higher -achieving students. 52 CHAPTER 3 POWER CONSIDERATION FOR MODEL OF CHANGE Introduction In recent years, there has been an increased interest in assessing the effects of educational interventions via experimental designs where students, classrooms, or schools are randomly assigned to a treatment and a control condition. An important part of the design phase of an experiment involves power analysis. Statistical power is the probability of detecting the treatment effect of interest when it exists (Boruch & Gomez, 1977; Cohen, 1988). A priori power computations are critical in designing experiments because they inform empirical researchers about the sampling scheme needed to detect a treatment effect. Specifically, a priori power analyses help educational researchers identify how big a sample is needed at the student, classroom, or school level to ensure a high probability (e.g., > 80 percent) of detecting a treatment effect if it were true (Lipsey 1990; Konstantopoulos, 2008a). The recent resurgence of experiments in education has bee n an attempt to establish rigorous research in the field. That is, currently much of the empirical research in education employs randomized experiments that are typically large in scale. These field experiments allow education researchers to examine the ef fects of school, or student interventions on student performance. In addition, education experiments incorporate often times a longitudinal component where students are followed over time. The main objectives in these studies include assessing whether the treatment effects are cumulative or have lasting benefits or whether they fade over time. For example, the effect of a novel mathematics curriculum is evaluated through an experiment (i.e., novel versus traditional mathematics 53 curriculum) where measurement s of student outcomes (e.g., mathematics achievement) are collected repeatedly over time (e.g., every spring for a few years). In repeated measures experiments each student has their own trajectory which is a function of time and indicates the rate of cha nge over time (Raudenbush & Bryk, 2002). The central goal in such studies is not only to estimate the treatment effect in the first year of the study (e.g., immediate effects), but also gauge longer term effects over time. For example, a researcher may be interested in the change or growth of mathematics achievement for students who use a novel mathematics curriculum vis -a-vis students who use a traditional mathematics curriculum. In this case, it is important for the researcher to compare trajectories of s tudents who received the treatment (i.e., novel curriculum) versus those who did not receive the treatment (i.e., traditional curriculum). The change in measurements over time does not always follow a linear trend. Instead, trajectories sometimes poin t to nonlinearities such as curvilinear trends. For example, Huttenlocher et al. (1991) studied how children™s vocabulary is accelerated in early years. One way of defining trajectories of change is via polynomial functions (Raudenbush & Liu, 2001). The fi rst degree polynomial indicates linear rate of change, the second degree polynomial indicates a quadratic rate of change, the third degree polynomial indicates a cubic rate of change and so forth. That is, treatment effects are estimated for linear rates o r non-linear rates of change. Studies about polynomial change may be viewed as having a nested structure. For example, measurements are nested within individuals and this nesting needs to be taken into account in the design phase of the study as well as i n the statistical analysis phase. Prior work has utilized two -level models (e.g., measurements within students) for repeated 54 measurements designs (see Raudenbush & Bryk, 2002). In particular, the authors presented methods for power analysis of treatment ef fects in studies of polynomial change with one level of nesting. Power is a function of the magnitude of the treatment effect, the sample size of individuals, the duration of the study, and the frequency of measurements over time. Researchers should take i nto account all of these parameters in the design phase of the experiment to ensure that treatment effects will be detected. Nonetheless, populations in education have frequently more complicated structures. For example, students are also nested within cl asses or schools and so forth. In addition, education interventions typically assign either schools or students randomly to treatment or control groups. For instance, students are assigned to small or regular classes within schools. Or schools are randomly assigned to an assessment program or not. It seems natural to extend methods for power analysis for tests of treatment effects in studies of polynomial change from two to three -levels. Consider for example, a nested structure where measurements are nested within students and students in turn are nested within schools. That is, the first level is repeated measurements, the second level is students, and the third level is schools. Spybrook et al. (2011) reported in the optimal deign manual formulae to calcul ate power for three -level polynomial change models without covariates. This study extend s previous methods by Raudenbush and Liu (2001) and Spybrook et al. (2011), and provide methods for power analysis of tests of treatment effects in studies of polynomi al change with two levels of nesting (e.g., students and schools) where the treatment is either at the third level (e.g., school intervention) or at the second level (e.g., student intervention). In particular, I present first methods for power analysis fo r cluster randomized designs (CRD) where for instance schools are randomly assigned in a 55 treatment and a control group, students are nested within schools, and repeated measurements are nested within students. This design assumes that schools are sampled randomly from a larger population at the first stage and then students within schools are randomly sampled. That is, both schools and students are random effects. Within CRD I briefly present the unconditional model (i.e., no covariates at any level), and t hen I expand the model to include covariates in the second and third levels. Second we provide methods for power analysis for block randomized designs (BRD) where the treatment is at the second level (e.g., student intervention) and the third level units ( e.g., schools) serve as blocks. For example, students are assigned to treatment and control conditions within schools. In this design both schools and students are also treated as random effects. In addition, we will discuss how study duration, sample size (number of third and second level units), and covariates influence power through two illustrative samples. The Polynomial Change Model A polynomial is an algebraic expression that contains more than one term and is described as a sum of term s of the sam e variable (e.g., time) in different powers (Kirk, 2012). For example, student achievement growth could be modeled through a polynomial equation of the third degree as 230123 Yaaa (3.1) where Y is student achievement, a is a measure of time such as age at each time of measurement, 0 is a constant, 1a is a linear component, 22a is a quadratic component, 56 33a is a cubic component, and is an error term. One disadvantage of equation (3.1) is that the trend components are highly correlated, which leads to multicollinearity. To resolve the dependency problem, one can utilize orthogon al polynomial contrast coefficients, which have been frequently used to fit trends of repeated measures. Equation (3.1) can then be constructed as 00112233 Yccccu (3.2) where 12,,c c and 3c are orthogonal polynomial coefficients that are independent with each other and thus enable researchers to independently test a null hypothesis for each of the three components (Kirk, 2012). Orthogonal polynomial coefficients have been used to fit trends sin ce the early 20 th century (e.g., Fisher, 1928). Jennrich and Sampson (1971) provided an algorithm to generate the orthogonal polynomial contrast coefficients. They are provided in tables of many experimental design texts (e.g., Kirk, 2012). Previous work h as discussed sample size and statistical power considerations for group comparisons using repeated measures, most of which however are focused on single -level models (e.g., Bloch, 1986; Hedeker, Gibbons, & Waternaux, 1999). Raudenbush and Liu (2001) extend ed this work and provided power analysis and sample determination methods for repeated measures in two -level models. They focused on studies in which two groups were followed over time to assess group differences in the average rate of change, rate of acce leration, or a higher degree polynomial effect. Through a two -level model combined with orthogonal polynomial contrasts at the first level, the authors examined how the duration of the study, frequency of observation, and number of participants affected 57 statistical power. They found that power increases as the study duration or the number of stude nts increases. Meorbeek (2008) discussed how the costs of including more persons or taking more measurement influence powers in a two -level polynomial growth model s and provided methods of comparing alternative design on the basis of their costs and sample size. She also took drop -out into consideration, and found that power decrease as the dropout increase, and thus increasing the study duration might have a negati ve effect on the power. Power analysis methods for growth models with two level s of nesting have rarely been discussed in prior literature. One exception was by Jong, Moerbeek, and Van der Leeden (2010), who discussed power estimation methods for three -level growth models with linear rate of change only. They have demonstrated that power is influenced by intraclass correlation coefficients, level of randomization, sample size, covariates and drop -out rates. However, their methods could not be appli ed to mo dels with higher order of change rates (e.g., quadratic rate of change). The optimal design manual by Spybrook et al. (2011) has provided power calculation formulae for three -level models in studies of polynomial change where the treatment is at the third level (e.g., schools) , but has not incorporated the effects of covariates . Both Randenbush and Liu (2001) and Spybrook et. al. (2011) discussed unconditional models that did not include any covariates at any level. However, prior studies have shown that covariates (e.g., students and school characteristics) could increase power significantly. Hedges and Hedberg (2007) documented that prior test scores and demographic covariates such as SES account for nearly one -third of the variance at the student level. Bloom, Richburg -Hayes, and Black (2007) found that controlling for baseline covariates could 58 improve the precision of CRD studies that examine the impact of school interventions. Konstantopoulos (2012) showed that covariates at different levels of the hie rarchy potentially explain a considerable proportion of the variance at the corresponding levels, and centering of lower level covariates plays an important role in this (see also Snijders & Bosker, 1999). Statistical Models Design I: Treatment Assigned at Third Level (Cluster Design) Unconditional Model Consider a simple three -level growth design where level -3 units (e.g., clusters such as schools) are randomly assigned to treatment or control conditions (i.e., clusters are nested within treatment). The first level for change over time of level -2 unit i in cluster j can de expressed as a polynomial function, namely 001122 (1)(1) ...gijijgijgijg PijPggij Yccccu (3.3) where pgc represent orthogonal polynomial contrasts of degree p (p = 0, 1, –, P-1) at measurement g (g = 1, –, G), pij ™s represent the mean and the rates of change (linear, quadratic, cubic, etc.), and gij u is the within level -2 unit random term with variance 2e. When p = 0, 01gc and 0ij represents the average outcome for level -2 unit i in level -3 unit j. When p = 1, 1gc is a linear contrast and 1ij is the linear rate of change for level -2 59 unit i in level -3 unit j, and so forth. We work with orthogonal polynomial contrasts because they facilitate the computations of estimators and their standard errors, and simplify power analysis (see Raudenbush & Liu, 2001). The results apply to studies of any length and for polynomials of any degree (Kirk, 2012). Orthogonal polynomial contrast coefficients should satisfy two conditions: The pth polynomial contrasts trend sum to zero, and the sum of the product of the pth and p™th polynomial contrasts is equal to zero (see Kirk, 2012) 1100.GpggGpgpg gccc (3.4) Orthogonal polynomial coefficients that meet conditions shown in e quation ( 3.4) are not unique because, any group of orthogonal polynomial coefficients denoted as pgppg kCc, also mee t these two conditions, where pk could be any constant (see Appendix C for a detailed proof). With equally spaced time points, the following formulae could be used to calculate orthogonal polynomial coefficients 011222 1,1 1,21 ()4(41) pgppg gGggpggpg pg ckC CgCgGpGp CCC Cp (3.5) 60 (see Jennrich & Sampson, 1971) , where pgc is one possible orthogonal polynomial coefficient of degree p at measurement g as defined before, and pk could be any constant. That is, research ers could choose any pk to get their own orthogonal polynomial contrast coefficients. For example, when 01, k11, k21, 2kand 316k, one can compute th e first four orthogonal coefficients as 011222132311 1 11 212137 620gGggGggGGggg cgcgGgGcgGgGg cggGG (3.6) (see Appendix C for a detailed proof). When G = 4, then the values of the orthogonal coefficients are 0123(1, 1, 1, 1) (1.5, 0.5, 0.5, 1.5) (0.5, -0.5, 0.5, 0.5) (0.05, 0.15, 0.15, 0.05). cccc (3.7) 61 Least squares estimates of each level -2 unit™s change parameter as well as their variance can be computed as 121221‹‹() Gpggij gpij Gpggepij GpggcYcVar c (3.8) (see Seber & Lee, 2003), where 4221(!)()! (2)!(21)!(1) Gpgp gpGp ckppGp (3.9) (see Appendix D for proof). In the second level model each of the parameters pij (e.g., the average polynomial change for each individual) from the first level equation varies between level -2 units (e.g., individuals) within level -3 units (e.g., schools), namely 0pijpjpij , (3.10) where 0pj™s represent the average polynomial effects within level -3 units such as schools and the pij ™s are individual specific random effects within level -3 units for ea ch 62 polynomial change parameter. The random effects follow a multivar iate normal distribution with zero means, variances 2pp, and covariance 'pp between the random effects pij and 'pij . At the third level each of the parameters, 0pj™s (average polynomial change for each level -3 unit) vary across third level units such as schools, namely 000010 pjppjpj T , (3.11) where 00p™s represent the average polynomial effects across level -3 units, 01p™s represent the average difference between the treatment and the control group for each polynomial change parameter, and the 0pj™s are level -3 unit specific random effects for each polynomial change parameter. These random effects follow a multivariate normal distribution with zero means, variances 2pp, and covariance 'pp between t he random effects 0pj and '0pj. Suppose there are N level -2 units within each level -3 unit and m level -3 units within each treatment condition, which means that the total number of level -3 units is M = 2m and thus the total number of level -2 units is MN. Then, the estimate of the variance of the treatment effect for polynomial p is 222 012()() pppppp Var NmN , 2221epGpggc (3.12) 63 and 21Gpggc is defined in equation (3.9 ) (see Konstantopoulos, 2008a; Raudenbush & Liu, 2001; Spybrook et al., 2011). Suppose that a researcher wants to test the hypothesis that 01p is different from zero and carries out the usual t-test. The test statistic is defined as 0101‹‹/() pptVar . (3.13) When the null hypothesis is true, the test statistic t has a Student™s t-distribution with 2m-2 degrees of freedom. When the null hypothesis is false, the test statistic t has the non -central t-distribution with 2m-2 degrees of freedom and non -centrality parameter . The non-centrality parameter is defined as the expected value of the estimate of the treatment effect divided by the square root of the variance of the estimate of the treat ment effect, namely 01222 12() pppppp mNN . (3.14) To calculate power, we need to define a standardized effect size first. Prior literature provided three definitions of standard effect size for three level models (e.g., He dge, 2010; Konstanto poulos, 2008a, 2008b). The first option of defining the standardized effect size for a polynomial degree p in three -level models is the group differences divided by the 64 square root of the total variance (Hedges, 2010; Jong, Moerbeek, & Van der Leeden, 2010; Konstantopoulos, 2008a) 011222 pppppp ES . (3.15) However, the denomination of ES1 depends on 2p, which is a function of the study duration as shown in equation ( 3.8). In other words, ES1 changes as the study duration varies. Because this study evaluate s various designs with alternative study duration but with fixed effect size, ES1 is not appropriate. Another two ways of defining the standardized effects size are 0122ppppp ES or 012,pppES (3.16) where ES is the group differences divided by the square root the sum of level -2 variance and level -3 variance (Jo ng, Moerbeek, & Van der Leeden, 2010; Spybrook et. al., 2011); while ES2 is the group differences divided by the square root level -3 variance. Both ES and ES2 could be used as the standardized effect size in three -level models ( see Hedges, 2011). It should be noted that ES2 is larger than ES for the same model if 20pp, especially when the level -2 variance acco unt for a large proportion of the total variance. For example, in our illustrative example using data from Project STAR in a later section , the effect size was larger than one if ES2 is used ; however the effect size from Project STAR was about 0.2 65 using Cohen ™s d. Cohen (1988) suggested that 0.2 is considered as a small effect size , 0.5 is considered as a medium effect size , and 0.8 is considered as a large effect size . Therefore, small or medium effect size might be interpreted as large effect size without cautiousness if ES2 is used. In order to avoid assuming a large standardized effect and keep consistent with Cohen ™s definition of small, medium and large effect size , I use ES as the definition of standard ized effect size in this study. Note that researchers still need to be cautious to interpret ES, which trends to be larger than ES1 since it does not take the variance at the first level into consideration. Then, the non -centr ality parameter of the t-test in equation ( 3.14) simplifies to 22222 2pppp ppppp mN N . (3.17) The power of a two -tailed t-test for a specified significance level is defined as p1 = 1 Œ c( /2, 2 m-2), ( 2m-2), -c( /2, 2 m-2), ( 2m-2), ] (3.18) where c( ) is the level a one-tailed critical value of the t -distribution with v degrees of freedom (e.g., c(0.05,20)=1.72), and H( x, v, ) is the cumulative distribution function of the non -central t -distribution with v degrees of freedom and non -centrality parameter . Alternatively, one can use an F-test with 1, 2 m Œ 2 degrees of freedom and a non -centrality parameter 2. Covariates at Second and Third Levels 66 When covariates are included at the second level equation (3.10) becomes 0pijpj Apij ijp2j X, (3.19) where Xij is a row vector of k level -2 unit characteristics, and p2j is a row vector of k coefficients of level -2 unit characteristics. The Apij ™s are level -2 specific random effects within level -3 units for each polynomial change parameter, and subscript A indicates adjustment in the error term because of covariates. The random effects follow a multivariate normal distribution with zero means, variances 2Rpp , covariance 'Rpp between random effects pij and 'pij , and subscript R indicates residual variance because of covariates. All other terms have been defined previously. Similarly, the third level model of equation (3.11) becomes 00001 0pjpApj Apj T P02j Z (3.20) where ZP02 is a row vector of q level -3 unit characteristics, and jis a column vector of coefficients of level -3 unit characteristics. The 0Apj ™s are level -3 specific random effects for each polynomial change parameter, where subs cript A indicates adjustment because of covariates (see Konstantopoulos, 2008a). These random effects follow a multivariate normal distribution with zero means, variances 2Rpp , covariance 'Rpp between random 67 effe cts 0pj and '0pj, and subscript R indicates residual variance because of covariates. All other terms have been defined previously. As a result, the non -centrality parameter of the t-test for the three -level mod el with covariates at second and third levels is defined as 01222 3212AAp ppppp mNNww , (3.21) where 2222 32/,/, Rpppp Rpppp ww (3.22) that is, 2w indicates the proportion of the variance at the second level that is still unexplained; while 3w indicates the proportion of the variance at the third level that is still unexplained. For example, when w3 = 0.8, it indicates that the variance at the third level decreased by 20% because of inclusion of covariates at the third level (assuming a centering approach where covariates can explain variance in the outcome only at their corresponding levels). In oth er words, the covariates at the third level explain 20% of the variance at the third level. We assume that the coefficient of the treatment does not change after adding covariates at the second and third level ( 0101 App ), which is reason able since in experimental 68 designs the treatment ( Tj) should be independent of any covariates (observed or unobserved). Then the non -centrality parameter A in equation ( 3.21) simplifies to 22222 322pppp Appppp mN Nww . (3.23) The power of a two -tailed t- p2 = 1 Œ c( /2, 2 m-q-2), ( 2m-q-2), A -c( /2, 2 m-q-2), ( 2m-q-2), A] (3.24) where q is the number of covariates at the third level. As mentioned previously, an F -test could be used instead. Design II: Treatment Assigned at Second Level (Block Randomized Design) Unconditional Model The first level model is identical to equation (3 .3). T he second level model incorporates the treatment ( Tij), namely 01pijpjpjijpij +T , (3.25) where 0pj™s represent the average polynomial effects within level -3 units, ij T is a dummy variable coded as one if second level unit i in third level unit j is assigned to treatment or control conditions and zero otherwise, 1pj is the treatment effect within level -3 units, and 69 the pij ™s are level -2 random effects within level -3 units for each polynomial change parameter. The random effects follow a multivariate normal dist ribution with zero means, variances 2pp, and covariance 'pp between rand om effects pij and 'pij . The third level equations for the intercept ( 0pj) and the treatment effect ( 1pj) are 0000 1101 pjppj pjppj , (3.26) where 00p™s represent the average polynomial effects across level -3 units, the 0pj™s are level -3 unit specific random effects for each polynomial change parameter, 10p™s represent the average di fference between the treatment and the control groups for each polynomial change parameter across level -3 units, and the 1pj™s are treatment by level -3 unit random effects (interaction effects) for each polynomial change parameter. T he 0pj™s follow a multivariate normal distribution with zero means and variances 2pp, whilst the treatment by level -3 unit random effects also follow a normal distribution with a mean of zero and a variance 2Tpp , where subscript T indicate s treatment at the second level w hose effect varies at the third level . Suppose there are M level -3 units and n level -2 units within each treatment condition within each level -3 unit, which means that the total number of level -2 units in each level -3 unit is N = 2n and thus the total number of level -2 units is MN. Then, the estimate of the variance of the treatment effect for polynomial p is 70 222 102()() pTppppp Var nMn , 2221epGpggc (3.27) where subscript T indicate s treatment at the second level w hose effect varies at the third level . and 21Gpggc is defined in equation ( 3.9). Suppose that a researcher wants to test the hypothesis that 10p is different from zero and carries out a t-test. The test statistic is defined as 1010‹‹/(). pptVar (3.28) When the null hypothesis is true, the test statistic t has a Student™s t-distribution with M-1 degrees of freedom (Konstantopoulos, 2008b). When the null hypothesis is false, the test statistic t has the non -central t-distribution with M-1 degrees of freedom and non -centrality parameter . The non -centrality parameter is defined as the expected value of the estimate of the treatment effect divided by the square root of the variance of the estimate of the treatment ef fect, namely 10222 12()pTppppp Mnn . (3.29) 71 We define the standardized effect size for a polynomial degree p as 1022 pTpppp ES . (3.30) Then, the non -centrality parameter of the t -test simplifies to 22222 2Tpppp Tppppp Mn n . (3.31) The power of a two -tailed t-test for a specified significance level is defined as p3 = 1 Œ c( /2, M-1), ( M-1), -c( /2, M-1), ( M-1), ] (3.32) where c,v) is the level a one-tailed critical value of the t -distribution with v degrees of freedom, and H( x, v, ) is the cumulative distribution function of the non -central t-distribution with v degrees of freedom and non -centrality parameter . As noted previously one could use an F -test instead. Covariates at Second and Third Levels When covariates are included at the second level equation (3.25) becomes 01 pijpjApjij Apij +T+ ijp2j X, (3.33) 72 where Xij is a row vector of k level -2 unit background characteristics, and p2j is a row vector of k coefficients of level -2 unit characteristics. The Apij ™s are level -2 specific random effects within level -3 units for each polynomial change parameter, where subscript A indicates adjustment because of covariates. The random effects follow a multivariate normal distribution with zero means, variances 2Rpp , and covariance 'Rpp between random effects pij and 'pij . The subscript R indicates residual variance because of covariates. All other term s have been defined previously. When covariates are included at the third level equation (3.26) becomes 000 0110 1,pjp Apj ApjAp Apj P1p0j P1p1j ZZ (3.34) where ZP1 is a row vector of q level -3 unit characteristics and the ™s include regression coefficients. The 0Apj ™s are level -3 unit specific random effects for each polynomial change parameter, and the 1Apj ™s are treatment by level -3 unit random effects (interaction effects) for each polynomial change parameter. The 0Apj ™s follow a multivariate normal distribution with zero means, variances 2Rpp , and the treatment by level -3 unit random effects also follows a normal distribution with a mean of zero and a variance 2RTpp , subscrip t R indicates residual variance because of covariates . The non -centrality parameter of the t-test when covariates are added at the second and third levels is defined as 73 10222 3212()ApTppppp Mnnww (3.35) where subscript A indicates adjustment because of covariates (see Konstantopoulos, 2008b) and 2222 32/,/, RTppTpp Rpppp ww (3.36) that is, 2w indicates the proportion of the variance at the second level that is still unexplained, and 3w indicates the proportion of the treatment by level -3 unit variance at the third level that is still unexplained. We assume the coefficie nt of the treatment does not change after adding covariates at the second and third level ( 1010 App ), which is reasonable since in experimental designs th e treatment ( Tij) should be independent of any covariates (observed or unobserved). Then the non -centrality parameter A of the t -test in equation (3.35) simplifies to 22222 322Tpppp ATppppp Mn nww . (3.37) The power of a two -tailed t-test for p4 = 1 Œ c( /2, M-q-1), ( M-q-1), A -c( /2, M-q-1), ( M-q-1), A], (3.38) 74 where q is the number of covariates at the third level. As mentioned previously one could use an F -test instead. Illustrative Examples Cluster Randomized Design: A Linear Growth Model To illustrate the applicability of the methods to assess consequences of study duration, sample sizes (students and schools), and covariates on power, we firstly utilized the data from a large scale experiment that was conducted in Indiana. This experiment employed a CRD, where students were nested within schools, and schools were nested within treatment and control groups. Random assignment took place at the school level, that is, schools were randomly assigned to treatment and control conditions. Schools in the treatment group adopted specific diagnostic assessment tools to measure student learning a few times during the 2009 -2010 school year and to provide diagnostic inf ormation to teachers to improve ongoing instruction. The study incorporated a longitudinal component and thus student mathematics and reading achievement were measured three times in the spring of 2010, 2011, and 2012 (see Konstantopoulos, Miller, & van de r Ploeg, 2013 for a more detailed introduction on this experiment). The total number of participating schools was 50 with 32 schools in the treatment group. Overall, nearly 20,000 students participated in the study during the 2009 -2010 school year. The ou tcome is standardized student mathematics achievement. Because the study duration was only 3 years, we used a linear rate of change model at level -1 (repeated measures), namely 75 20011 , (0, ) gijijgijggijgij eMathccuuN , where gij Math is student mathematics achievement in year g, 0(1, 1, 1) gc and 1(-1, 0, 1) gc at g = 1, 2, 3 in accord with the orthogonal polynomials in equation (3.6). This model defines 0ij as the mean mathematics achievement for student i in cluster j, and 1ij is the average rate of linear change of mathematics achievement for student i in school j. The second level model (student level) is 200000 00, (0, ) ijjijij N 211011 11, (0, ) ijjijij N , where 00j is the mean ma thematics ac hievement in school j, and 10j is the average growth rate in school j. The third level model (school level) is 2000000010000 00=+, (0, ) jjjj TN 2101001011010 11=+, (0, ) jjjj TN , 76 where 000is the grand mean, 001 is the main effect of treatment for the mean, jT is a binary indicator co ded as one for treatment schools and zero for control schools, 100is the average rate of change, and 101is the main effect of treatment for the rate of change, which is my primary interest. We estimates of the relevant variances are 222 11110.000920. , =, 000910.00012 e . To calculate power, we assumed a standardized effect size of 0.40 and a significance level of 0.05. We also assumed the sample size as m = 10 and N = 20, which indicates 10 schools in the treatment group (20 schools in total) and 20 students in each treatment or control school. According to equation (3.8) and equation (3.9) with G = 3, p =1 and k1 = 1, first I calculate 210.00092124320.00046. Then, I calculate the non -centrality parameter of the t -test based on equation (3.17), namely 22222 0.001020 00120.00091 .0900.000120 .422.000910.00046 2300 pppp ppppp mN N . Then, I compute the critical value of the test using the t-distribution with (2×10) - 2 = 18 degrees of freedom as c(0.25, 48) 2.101. To compute power I use equation ( 3.18) as p = 1 Œ [2.101, 18, 2.090] + [-2.101, 18, 2.090] 0.508. 77 Tables 3 .1 to 3.3 and Figure 3 .1 to 3.3 show how variations of study duration and sample sizes affect power to detect the treatment effect for the linear rate of change in cluster designs, assuming tw o-tailed t-tests at the 0.05 significance level and effect size as 0.40. Table 3.1 and Figure 3.1 provide power estimates for designs that vary the study duration ( D) and the number of schools ( M), holding the number of students ( N) in each school constant at 20. The estimate of power from above was 0.508 (see Table 3.1, row 2, column 2). As the study duration or number of schools increase, power increases. When the study duration is three and the number of schools is 40, power reaches to 0.80 (i.e., 0.822). Note that, power increases significantly as study duration increases from two to three, but then power only changes marginally. This suggests that for a fixed number of students, increasing the study duration beyond a certain point has only a small effect on powers. In addition, the number of schools has bigger effects on powers compared to the study duration. For example, when study duration is tripled from two to six, powers are less than doub led; while number of schools tripled from 10 to 30, powers are more than doubled. Table 3.2 and Figure 3.2 provide power estimates for designs that vary the duration of study ( D) and the number of students ( N) in each school, holding the number of schools (M) constant at 20. As the study duration or the number of students grows, power becomes larger. In particular, power changes significantly when the study duration increases from two to three, and then powers does not change much as the study dura tion becomes longer. Similarly, increasing the number of students increases power to a specific number of students per school and beyond that number power does not change much. It is noteworthy that increasing the number of students is not an effective way of boosting power. For 78 Table 3.1: Effect of Study Duration ( D) and Number of Schools ( M) on Power Holding Number of Students ( N) in Each School Constant at 20: CRD, Linear Rate of Change 10203040506070809010020.201 0.395 0.562 0.693 0.791 0.861 0.910 0.942 0.964 0.977 30.257 0.508 0.696 0.822 0.900 0.945 0.971 0.985 0.992 0.996 40.273 0.538 0.728 0.849 0.920 0.959 0.980 0.990 0.995 0.998 50.279 0.549 0.740 0.858 0.926 0.963 0.982 0.992 0.996 0.998 60.282 0.554 0.745 0.862 0.929 0.965 0.983 0.992 0.996 0.998 70.283 0.556 0.747 0.864 0.931 0.966 0.984 0.992 0.997 0.998 80.284 0.558 0.748 0.865 0.931 0.966 0.984 0.993 0.997 0.998 DMNote. Effect size is 0.4 with a significance level of 0.05. Figure 3.1: Effect of Study Duration ( D) and Number of Schools ( M) on Power, Holding Number of Students ( N) in Each School Constant at 20: CRD, Linear Rate of Change Note. Effect size is 0.4 with a significance level of 0.0579 Table 3.2: Effect of Study Duration ( D) and Number of Students ( N) on Power Holding Number of Schools ( M) Constant at 20: CRD, Linear Rate of Change 10203040506070809010020.277 0.395 0.463 0.507 0.537 0.559 0.576 0.589 0.600 0.609 30.396 0.508 0.560 0.590 0.609 0.623 0.633 0.640 0.646 0.651 40.434 0.538 0.584 0.609 0.626 0.637 0.645 0.651 0.656 0.660 50.449 0.549 0.592 0.616 0.631 0.642 0.649 0.655 0.660 0.663 60.455 0.554 0.596 0.619 0.634 0.644 0.651 0.657 0.661 0.665 70.459 0.556 0.598 0.620 0.635 0.645 0.652 0.658 0.662 0.665 80.461 0.558 0.599 0.621 0.636 0.645 0.653 0.658 0.662 0.666 DNNote. Effect size is 0.4 with a significance level of 0.05. Figure 3.2: Effect of Study Duration ( D) and Number of Students ( N) on Power Holding Number of Schools ( M) Constant at 20: CRD, Linear Rate of Change Note. Effect size is 0.4 with a significance level of 0.05. 80 Table 3.3: Effects of Number of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 3: CRD, Linear Rate of Change 102030405060708090100100.201 0.257 0.285 0.302 0.314 0.322 0.328 0.333 0.337 0.340 200.396 0.508 0.560 0.590 0.609 0.623 0.633 0.640 0.646 0.651 300.563 0.696 0.751 0.780 0.798 0.810 0.819 0.826 0.831 0.835 400.694 0.822 0.867 0.890 0.903 0.911 0.917 0.922 0.925 0.928 500.792 0.900 0.933 0.947 0.956 0.961 0.964 0.967 0.969 0.970 600.862 0.945 0.967 0.976 0.981 0.983 0.985 0.987 0.988 0.988 700.910 0.971 0.984 0.989 0.992 0.993 0.994 0.995 0.995 0.996 800.943 0.985 0.993 0.995 0.997 0.997 0.998 0.998 0.998 0.998 900.964 0.992 0.997 0.998 0.999 0.999 0.999 0.999 0.999 0.999 1000.977 0.996 0.999 0.999 0.999 1.000 1.000 1.000 1.000 1.000 MNNote. Effect size is 0.4 with a significance level of 0.05. Figure 3.3: Effects of Number of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 3: CRD, Linear Rate of Change Note. Effect size is 0.4 with a significance level of 0.05. 81 example, as shown in Figure 3.2, power is still less than 0.70 even if the number of studen ts per school reaches to 1000. Table 3.3 and Figure 3.3 provides power estimates for designs that vary the number of students ( N) in each school and the number of schools ( M), holding study duration constant at three. As the number of students per school or the number of schools increases, power increases initially and then does not change much. Power reaches to 0.80 with various combinations of the number of schools and the number of students per schools (e.g., M = 30 and N = 60, M = 40 and N = 20, and M = 60 and N = 10). It also should be noted that the number of schools affects power more significantly than the number of students in each school, holding the study duration fixed. For example, power is at least about tripled when the number of schools incr eases from ten to 100; while power less than doubled when the number of students increases from ten to 100. Cov ariates also influence powers assuming they explain a certain proportion of variances at the second or the third level. Table 3.4 and Figure 3.4 shows how power varies as the proportion of the unexplained variances at the second and third levels vary for a design with M = 20 (or m = 10), N = 20, D = 3, and ES = 0.40. The degrees of freedom decrease when I add covariates at the thir d level. Assuming that five covariates are added at the third ( q = 5), the degrees of freedom reduce to (2×10) - 5 - 2 = 13. As the unexplained variance decreases because of covariates, power increases. For example, when w3 = 0.9 and w2 = 0.9, which indica tes the proportion of the unexplained variance at the se cond and the third level are 90% (or the covariates explain 10% of the variances at the second and the third level), the power is 0.526, which is larger than the power without covariates (0.508). In a ddition, covariates at the third level affect power significantly more than covariates at 82 Table 3.4: Effect of Covariates on Power: CRD, Linear R ate of Change 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.988 0.979 0.967 0.953 0.937 0.919 0.900 0.881 0.861 0.2 0.958 0.943 0.925 0.907 0.888 0.868 0.848 0.828 0.808 0.3 0.914 0.895 0.875 0.855 0.835 0.815 0.795 0.775 0.756 0.4 0.862 0.842 0.822 0.802 0.782 0.763 0.744 0.726 0.708 0.5 0.809 0.790 0.770 0.751 0.733 0.715 0.697 0.681 0.665 0.6 0.758 0.739 0.721 0.704 0.687 0.670 0.655 0.639 0.625 0.7 0.710 0.693 0.676 0.660 0.645 0.630 0.616 0.602 0.589 0.8 0.666 0.650 0.635 0.621 0.607 0.593 0.580 0.568 0.556 0.9 0.626 0.612 0.598 0.585 0.572 0.560 0.549 0.537 0.526 W3W2Note. The study duration is 3 with 20 schools and 20 students in each school; significance level is 0.05. Figure 3.4: Effect of Covariates on Power: CRD, Linear Rate of Change Note. The study duration is 3 with 20 schools and 20 students in each school; significance level is 0.05. 83 the second level, which is mainly because the ratio between variance of level -2 random effect and variance of level -3 random effect ( 221111 /) is small. For example, as the proporti on of the unexplained variances at the second level ( w2) decreases from 0.9 to 0.1 with w3 = 0.9, power increases slightly from 0.526 to 0.626. However, as the proportion of the unexplained variance at the third level ( w3) decreases from 0.9 to 0.1 with w2 = 0.9, power increases significantly from 0.526 to 0.861. To compare the powers between design with and without covariates, I also co mpute power estimates for designs that vary the number of students ( N) in each school and the number of schools ( M), assuming 40% of variances explained at the second and the third level ( w2 = w3= 0.6), holding study duration constant at three, which are p resented by Table 3.5 and Figure 3.5. In general, power increases when covariates explain a certain proportion of variance at the second or the third level, comparing the power estimates in Table 3.3. There are only three exceptions (i.e., M = 10 and N = 10, M = 10 and N = 20, and M = 10 and N = 30) , where power decreases when covariates were added. That is because degrees of freedom decreases as I assume five covariates added at the third level. Block Randomized Design: A Linear Growth Model The second example utilized data f rom Project STAR (Student -Teacher Achievement Ratio) in Tennessee (e.g., Finn & Achilles, 1990; Krueger, 1999; Nye, Hedges, & Konstantopoulos, 2000). This experiment employed a block randomized design, where within each school (the b lock) and grade, students and their teachers were randomly assigned to one of three treatment conditions: small classes (1 3.17 students), regular -size classes (22 -25 students), and regular classes with a full -time teacher aide (22 -25 students). Project STAR was a longitudinal st udy that started in the 1985 -1986 school year. The 84 cohort of students who entered kindergarten in the 1985 -1986 school year remained in the experiment until their third grade. Students™ mathematics and reading achievement were mea sured four times in the end of kindergarten, first grade, second grade, and third grade. Overall, more than 11,000 students in 79 schools participated in the experiment over the four -year period. The sample included students in small classes or regular cla sses only to ensure a balanced design. Students in regular classes with a full -time teacher aide were excluded from the analysis. The outcome is standardized student mathematics achievement. A linear rate of change was used at level -1 (repeated measures), namely 20011 , (0, ) gijijgijggijgij eMathccuuN , where gij Math is student mathematics achievement in year g, 0(1, 1, 1, 1) gc and 1(-1.5, 0.5, 0.5, 1.5) gc at g = 1, 2, 3, 4 following equation ( 3.6). This model defines 0ij as the mean mathematics achievement for student i in school j, and 1ijis the average linear rate of change of mathematics achievement for student i in school j. The second level model (student level) is 200001 00 0021101111 11, (0, ) +, (0, ) ijjjijijij ijjjijijij TNTN where 00jis the mean mathematics achievement in school j, 01j is the average difference of mathemat ics achievement between students in small classes and students in 85 Table 3.5: Effects of Covariates, Number of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 3, w2 = 0.6 and w3 = 0.6: CRD, Linear Rate of Ch ang e 102030405060708090100100.195 0.253 0.283 0.302 0.315 0.324 0.331 0.337 0.341 0.345 200.525 0.670 0.734 0.768 0.789 0.804 0.814 0.822 0.829 0.833 300.727 0.861 0.906 0.927 0.939 0.946 0.952 0.955 0.958 0.960 400.851 0.945 0.970 0.979 0.984 0.987 0.989 0.990 0.991 0.992 500.922 0.980 0.991 0.994 0.996 0.997 0.998 0.998 0.998 0.998 600.960 0.993 0.997 0.999 0.999 0.999 1.000 1.000 1.000 1.000 700.981 0.998 0.999 1.000 1.000 1.000 1.000 1.000 1.000 1.000 800.991 0.999 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 900.996 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1000.998 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Note. Effect size is 0.4 with a significance level of 0.05. MN Figure 3.5: Effects of Covariates, Number of Schools ( M) and Number of Students ( N) on Power Holding Study Dura tion ( D) Constant at 3, w2 = 0.6 and w3 = 0.6: CRD, Linear Rate of Change Note. Effect size is 0.4 with a significance level of 0.05. 86 regular classes in school j, 10jis the average linear growth rate in school j, and 11jis the average difference of linear growth rate between st udents in small classes and students in regular classes in school j. The third level model (school level) is 2000000000 002010100101 002101001010 102111101111 11=, (0, ) =, (0, ) =, (0, ) =, (0, ) jjj jjjT jjj jjjT NNNN where 000is the gran d mean, 010is the average treatment effect for all schools, 100is the average linear rate of change, and 110is the main effect of treatment for the linear change rate, which is my primary interest. The variance estimates are 222 11110.303690.0 , =, 07530.0207 9eT To calculate power, I assumed a standardized effect size of 0.40 and a significance level of 0.05. I also assumed sample sizes M = 40 and N = 40, which indicates there were 20 students in the treatment or control condition (40 students in total) in each school and there were 40 schools. According to equation (3.8) and equation (3.9) with G = 4, p =1 and k1 = 1, first I compute 210.30369120.060738543 . Then, I calculate the non -centr ality parameter of the t -test based on equation (3.31) 87 22222 320.007530.007534020 0.020970.4221500.020970.0607 3381.93Tpppp Tppppp MnES nww . The critical value of the test using the t-distribution with 40 - 1 = 39 degrees of freedom is c(0.25, 39) 2.022.Finally, I computed power as P = 1 Œ [2.022, 39, 1.933] + [-2.22, 39, 1.933] 0.471. Table 3.6 to 3.8 and Figure 3.6 to 3.8 show how variations of study duration and sample sizes affect the power to detect the treatment effect for the linear rate of change in block designs, assuming two -tailed t-tests at the 0.05 significance level and effect size as 0.40. Table 3.6 and Figur e 3.6 show how power changes as study duration ( D) and the number of schools ( M) changes, holding the number of students ( N) in each school constant at 40. As the duration of study increases, the power of detecting a linear rate of change increases slightl y when the study duration increases from two to six, and remains virtually unchanged as the study duration increases from six to eight. However, as the number of schools increases, power increases significantly more. For example, when the number of schools increases from 20 to 60, the power is more than doubled. In particular, when M = 80 and D = 6, or M = 90 and D = 4, power reaches to 0.80. Table 3.7 and Figure 3.7 provides power estimates for designs that vary the duration of study ( D) and the number of students ( N) in each school, holding the number of schools (M) constant at 40. These results re -confirm that the power of detecting a linear rate of 88 Table 3.6: Effect of Study Duration ( D) and Number of Schools ( M) on Power Holding Number of Students ( N) in Each School Constant at 40: BRD, Linear Rate of Change 10203040506070809010020.092 0.145 0.199 0.254 0.307 0.360 0.410 0.458 0.504 0.548 30.125 0.222 0.318 0.410 0.494 0.571 0.639 0.698 0.750 0.794 40.140 0.255 0.367 0.471 0.563 0.644 0.713 0.771 0.818 0.857 50.146 0.269 0.387 0.495 0.591 0.672 0.741 0.797 0.842 0.878 60.149 0.275 0.396 0.507 0.603 0.685 0.753 0.808 0.852 0.887 70.150 0.278 0.401 0.512 0.609 0.691 0.759 0.814 0.857 0.892 80.151 0.280 0.404 0.516 0.613 0.695 0.762 0.817 0.860 0.894 DMNote. Effect size is 0.4 with a significance level of 0.05 Figure 3.6: Effect of Study Duration ( D) and Number of Schools ( M) on Power Holding Number of Students ( N) in Each School Constant at 40: BRD, Linear Rate of Change Note: Effect size is 0.4 with a significance level of 0.05. 89 Table 3.7: Effect of Study Duration ( D) and Number of Students ( N) on Power Holding Number of Schools ( M) Constant at 40: BRD, Linear Rate of Change 2040608010012014016018020020.177 0.254 0.304 0.339 0.365 0.385 0.401 0.413 0.423 0.432 30.335 0.410 0.443 0.462 0.474 0.483 0.489 0.494 0.497 0.500 40.423 0.471 0.489 0.498 0.504 0.508 0.511 0.514 0.515 0.517 50.465 0.495 0.506 0.512 0.515 0.518 0.519 0.521 0.522 0.522 60.485 0.507 0.514 0.518 0.520 0.522 0.523 0.524 0.524 0.525 70.496 0.512 0.518 0.521 0.523 0.524 0.525 0.525 0.526 0.526 80.502 0.516 0.520 0.522 0.524 0.525 0.525 0.526 0.526 0.527 DNNote. Effect size is 0.4 with a significance level of 0.05 Figure 3.7. Effect of Study Duration ( D) and Number of Students ( N) on Power Holding N umber of Schools ( M) Constant at 40: BRD, Linear Rate of Change Note: Effect size is 0.4 with a significance level of 0.05. 90 Table 3.8: Effects of Number of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 4: BRD, Linear Rate of Change 20406080100120140160180200100.128 0.140 0.144 0.146 0.148 0.149 0.150 0.150 0.151 0.151 200.229 0.255 0.265 0.270 0.274 0.276 0.278 0.279 0.280 0.281 300.329 0.367 0.382 0.390 0.395 0.398 0.400 0.402 0.404 0.405 400.423 0.471 0.489 0.498 0.504 0.508 0.511 0.514 0.515 0.517 500.510 0.563 0.584 0.594 0.601 0.605 0.608 0.611 0.612 0.614 600.588 0.644 0.665 0.676 0.682 0.687 0.690 0.692 0.694 0.696 700.656 0.713 0.734 0.744 0.750 0.755 0.758 0.760 0.762 0.763 800.716 0.771 0.790 0.800 0.806 0.810 0.813 0.815 0.816 0.818 900.767 0.818 0.836 0.845 0.850 0.854 0.857 0.858 0.860 0.861 1000.810 0.857 0.873 0.881 0.886 0.889 0.891 0.893 0.894 0.895 MNNote. Effect size is 0.4 with a significance level of 0.05 Figure 3.8: Effects of Number of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 4: BRD, Linear Rate of Change Note: Effect size is 0.4 with a significance level of 0.05. 91 change does not increase consistently as study duration increases. In addition, when the number of students in each school is small (e.g., 20), power is impacted more as study duration increases from two to f our, compared to the power estimates when the number of students is large (e.g., 200). Similarly, power does not increase consistently as the number of students increases, especially after a certain number of students. For example, the power does not chang e much as the number of students increases from 160 to 200. What is more, it is hardly to boost power through increasing the number of students per schools. For example, as shown in Figure 3.7, even if there are 2000 students per school, powers are still a round 0.5. Table 3.8 and Figure 3.8 provides power estimates for designs that vary the number of students ( N) in each school and the number of schools ( M), holding study duration constant at four. As the number of schools increases, power increases consist ently. For example, power increases approximately 0.1 as the number of schools changes from ten to 50, and then powers increases around 0.06 for every ten school increase until they reach to 0.80. When M = 80 schools and N = 80 students, power becomes 0.80. In addition, power increases as the number of students increase from 20 to 80, but does not change much as the number of students increases from 100 to 200. Such results indicate that to boost power it is recommended to sample more schools rather than to sample more students per school. Table 3.9 and Figure 3.9 shows how the power of detecting a linear rate of change is influenced by the proportion of unexplained variance at the second and third levels when M = 40, N = 40, D = 4, and ES = 0.40. I assume that five covariates are added at the third level ( q = 5) and thus the d egrees of freedom reduce to 40 - 5 - 1 = 34. The results show that power increases when covariates are added in the model, as expected. For example, 92 Table 3.9: Effect of Covariates on Power: BRD, Linear Rate of Change 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.983 0.982 0.982 0.981 0.980 0.980 0.979 0.978 0.977 0.2 0.931 0.929 0.928 0.927 0.926 0.925 0.923 0.922 0.921 0.3 0.858 0.857 0.855 0.854 0.853 0.851 0.850 0.848 0.847 0.4 0.782 0.781 0.780 0.778 0.777 0.776 0.774 0.773 0.772 0.5 0.712 0.711 0.710 0.709 0.707 0.706 0.705 0.704 0.703 0.6 0.650 0.649 0.648 0.647 0.646 0.645 0.644 0.643 0.642 0.7 0.596 0.595 0.594 0.593 0.592 0.591 0.590 0.589 0.589 0.8 0.549 0.548 0.547 0.547 0.546 0.545 0.544 0.543 0.543 0.9 0.508 0.508 0.507 0.506 0.506 0.505 0.504 0.504 0.503 W3W2Note. The study duration is 4 with 40 schools and 40 students in each school; significance level is 0.05 Figure 3.9: Effect of Covariates on Power: BRD, Linear Rate of Change Note: The study duration is 4 with 40 schools and 40 students in each school; significance level is 0.05. 93 Table 3.10: Effects of Covariates, Number of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 4, w2 = 0.6 and w3 = 0.6: BRD, Linear Rate of Change 20406080100120140160180200100.136 0.154 0.162 0.166 0.169 0.171 0.172 0.174 0.174 0.175 200.302 0.352 0.374 0.386 0.393 0.398 0.402 0.405 0.407 0.409 300.443 0.514 0.542 0.558 0.568 0.574 0.579 0.583 0.586 0.588 400.565 0.645 0.676 0.692 0.702 0.709 0.714 0.717 0.720 0.723 500.666 0.746 0.776 0.791 0.800 0.806 0.811 0.814 0.817 0.819 600.748 0.823 0.849 0.861 0.869 0.874 0.878 0.881 0.883 0.884 700.812 0.878 0.900 0.910 0.916 0.920 0.923 0.925 0.927 0.928 800.862 0.918 0.934 0.942 0.947 0.950 0.952 0.954 0.955 0.956 900.900 0.945 0.958 0.964 0.967 0.969 0.971 0.972 0.973 0.973 1000.928 0.964 0.973 0.977 0.980 0.981 0.982 0.983 0.984 0.984 Note. Effect size is 0.4 with a significance level of 0.05 MN Figure 3.10: Effects of Covariates, Number of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 4, w2 = 0.6 and w3 = 0.6: BRD, Linear Rate of Change Note. Effect size is 0.4 with a significance level of 0.05. 94 when covariates explain 40% of the variance at both the second and third level ( w3 = w2 = 0.6), the power increases from 0.471 to 0.641. The power increase is much larger than that when adding 20 schools or adding 160 students in each school. In addition, covariates at the second level have little influence on power, while the covariates at t he third level affect power significantly more. Powers does not change much as the proportion of variances explained at the second level increase from 10% to 90% regardless how much of the variances at the third level are explained, which is mainly because the variance of the treatment by school random effects (i.e., 211) only account for a small proportion of the total vari ance. Table 3.10 and Figure 3.10 provide power estimates for designs that vary the number of students ( N) in ea ch school and the number of schools ( M), assuming 40% of variances explained at the second and the third level and holding study duration constant at four. In general, power increases when covariates explain a certain proportion of variance at the second o r the third level, comparing the power estimates in Table 3.8. In particular, it requires fewer schools or fewer students per school for power to reach to 0.80. For instance, with N = 40, only 60 schools are need to boost power to 0.80, which is 30 schools fewer comparing to the design without covariates included. Block Randomized Design: A Quadratic Growth Model I also used data from Project STAR to fit a model with qua dratic rate of change at level - 1 (repeated measures), namely 2001122 , (0, ) gijijgijgijggijgij eMathcccuuN , 95 where gij Math is student mathematics achievement in year g, 0(1, 1, 1, 1) gc, 1(-1.5, 0.5, 0.5, 1.5) gc and 2(0.5, 0.5, -0.5, 0.5) gc at g = 1, 2, 3, 4 following equation ( 3.6). This model defines 2ij as the average quadratic rate of change of mathematics achievement for student i in school j. All the other terms has been defined previously. The second level model (stu dent level) is 200001 00 0021101111 1122202122 22, (0, ) +, (0, ) +, (0, ) ijjjijijij ijjjijijij ijjjijijij TNTNTN where 20jis the average quadratic growth rate in school j, and 21jis the average difference of quadratic growth rate between students in small classes and s tudents in regular classes in school j. All the other terms has been defined previously. The third level model (school level) is 2000000000 002010100101 002101001010 102111101111 112202002020 202212102121 22=, (0, ) =, (0, ) =, (0, ) =, (0, ) =, (0, ) =, (0, ) jjj jjjT jjj jjjT jjj jjjT NNNNNN 96 where 200 is the average quadratic rate of change, and 210 is the main effect of treatment for the quadratic change rate, which is my primary interest. The variance estimates are 222 22220.242390.0 , =, 09430.0754 2eT To calculate power, I assumed a standardized effect size of 0.40 and a significance level of 0.05. I also assumed sample sizes M = 40 and N = 40, which indicates there are 20 students in the treatment or control co ndition (40 students in total) in each school and there are 40 schools. According to equation (3.8) and equation (3.9) with G = 4, p =1 and 212k , first I compute 220.242390.2423972065432 . Then, I calculate the non -centrality parameter of the t -test based on equation (3.31) 22222 32.075240.00943 .075240.009430.24239 4040 01.0.40221500 756.Tpppp Tppppp MnES nww The critical value of the test using the t-distribution with 40 - 1 = 39 degrees of freedom is c(0.25, 49) 2.023.Finally, I computed power as P = 1 Œ 2.023, 39, 1.756 -2.023, 39, 1.756] 0.403.97 Table 3.11: Effect of Study Duration ( D) and Number of Schools ( M) on Power Holding Number of Students ( N) in Each School Constant at 40: BRD, Quadratic Rate of Change 10203040506070809010030.093 0.148 0.205 0.261 0.316 0.370 0.422 0.471 0.518 0.562 40.124 0.218 0.313 0.403 0.486 0.562 0.630 0.689 0.741 0.785 50.132 0.237 0.341 0.438 0.527 0.606 0.675 0.734 0.784 0.825 60.134 0.243 0.349 0.448 0.538 0.618 0.687 0.745 0.795 0.836 70.135 0.244 0.351 0.452 0.542 0.622 0.691 0.749 0.798 0.839 80.135 0.245 0.353 0.453 0.544 0.624 0.692 0.751 0.800 0.840 90.135 0.245 0.353 0.454 0.544 0.624 0.693 0.752 0.801 0.841 DMNote. Effect size is 0.4 with a significance level of 0.05. Figure 3.11: Effect of Study Duration ( D) and Number of Schools ( M) on Power Holding Number of Students ( N) in Each School Constant at 4 0: BRD, Quadratic Rate of Change Note. Effect size is 0.4 with a significance level of 0.05. 98 Table 3.12: Effect of Study Duration ( D) and Number of Students ( N) on Power Holding Number of Schools ( M) Constant at 40: BRD, Quadratic Rate of Change 2040608010012014016018020030.190 0.261 0.302 0.330 0.349 0.363 0.374 0.382 0.389 0.395 40.360 0.403 0.419 0.428 0.433 0.437 0.440 0.442 0.443 0.445 50.421 0.438 0.444 0.447 0.449 0.450 0.451 0.452 0.452 0.453 60.440 0.448 0.451 0.452 0.453 0.454 0.454 0.454 0.455 0.455 70.447 0.452 0.453 0.454 0.455 0.455 0.455 0.455 0.455 0.456 80.449 0.453 0.454 0.455 0.455 0.455 0.456 0.456 0.456 0.456 90.451 0.454 0.455 0.455 0.455 0.456 0.456 0.456 0.456 0.456 DNNote. Effect size is 0.4 with a significance level of 0.05. Figure 3.12: Effect of Study Duration ( D) and Number of Students ( N) on Power Holding Number of Schools ( M) Constant at 40: BRD, Qua dratic Rate of Change Note. Effect size is 0.4 with a significance level of 0.05 . 99 Table 3.13: Effects of Numb er of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 4: BRD, Quadratic Rate of Change 20406080100120140160180200100.114 0.124 0.127 0.129 0.131 0.132 0.132 0.133 0.133 0.133 200.197 0.218 0.227 0.232 0.235 0.237 0.238 0.239 0.240 0.241 300.280 0.313 0.326 0.333 0.337 0.340 0.342 0.344 0.345 0.346 400.360 0.403 0.419 0.428 0.433 0.437 0.440 0.442 0.443 0.445 500.437 0.486 0.505 0.515 0.521 0.526 0.529 0.531 0.533 0.534 600.508 0.562 0.583 0.593 0.600 0.605 0.608 0.610 0.612 0.614 700.572 0.630 0.651 0.662 0.669 0.673 0.677 0.679 0.681 0.683 800.631 0.689 0.710 0.721 0.728 0.732 0.736 0.738 0.740 0.741 900.683 0.741 0.761 0.772 0.778 0.782 0.785 0.788 0.790 0.791 1000.730 0.785 0.805 0.814 0.820 0.824 0.827 0.829 0.831 0.832 MNNote. Effect size is 0.4 with a significance level of 0.05. Figure 3.13: Effects of Number of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 4: BRD, Quadratic Rate of Change Note. Effect size is 0.4 with a significance level of 0.05 . 100 Table 3.11 to 3.13 and Figure 3.11 to 3.13 show how variations of study duration and sample sizes affect the power to detect the treatment effect of the quadratic rate of change in block designs, assuming two -tailed t-tests at the 0.05 significance level and effec t size as 0.40. Table 3.11 and Figure 3.11 show how power changes as study duration ( D) and the number of schools ( M) changes, holding the number of students ( N) in each school constant at 40. Please note that there should be at least three repeated measures ( D = 3) to estimate a quadratic growth model. As the duration of study increases, the power of detecting a quadratic rate of change increases slightly when the study duration increases from three to six; and remains virtually unchanged as the stud y duration increases from six to nine. However, as the number of schools increases, power increases significantly more. It should be noted that it requires more schools and longer study duration for power to reach to 0.80 comparing the results from linear growth model. That is mainly because the ratio between the level -2 random effects and the variance of treatment by school random effect (i.e., 222222 /T) in the quadratic growth model was smaller than the ratio between the level -2 random ef fects and the variance of treatment by school random effect (i.e., 221111 /T) and in the linear growth model. In particular, when M = 90 and D = 8, or M = 100 and D = 7, power reaches to 0.80. Table 3.12 and Figure 3.12 provides power estimates for designs that vary the duration of study ( D) and the number of students ( N) in each school, holding the number of schools (M) at 40. The results were quite similar to those in Table 3.7. Both the study duration and the number of students in ea ch school have quite limited influence on the power, especially when the study duration or the number of students in each school is beyond a certain 101 Table 3.14: Effect of Covariates on Power: BRD, Quadratic Rate of Change 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.950 0.949 0.949 0.948 0.948 0.948 0.947 0.947 0.946 0.2 0.865 0.865 0.864 0.864 0.863 0.863 0.862 0.861 0.861 0.3 0.774 0.774 0.773 0.773 0.772 0.772 0.771 0.770 0.770 0.4 0.691 0.691 0.691 0.690 0.690 0.689 0.689 0.688 0.688 0.5 0.620 0.620 0.620 0.619 0.619 0.618 0.618 0.618 0.617 0.6 0.561 0.560 0.560 0.560 0.559 0.559 0.559 0.558 0.558 0.7 0.511 0.510 0.510 0.510 0.509 0.509 0.509 0.509 0.508 0.8 0.468 0.468 0.468 0.468 0.467 0.467 0.467 0.467 0.466 0.9 0.432 0.432 0.432 0.432 0.432 0.431 0.431 0.431 0.431 W3W2Note. The study duration is 4 with 40 schools and 40 students in each school; significance level is 0.05. Figure 3.14: Effect of Covariates on Power: BRD, Quadratic Rate of Change Note. The study duration is 4 with 40 schools and 40 students in each school; sig nificance level is 0.05. 102 Table 3.15: Effects of Covariates, Number of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 4, w2 = 0.6 and w3 = 0.6: BRD, Quadratic Rate of Change 20406080100120140160180200100.120 0.135 0.142 0.146 0.148 0.150 0.151 0.152 0.153 0.153 200.254 0.298 0.317 0.328 0.335 0.339 0.343 0.345 0.347 0.349 300.373 0.438 0.465 0.480 0.489 0.496 0.500 0.504 0.507 0.509 400.481 0.559 0.590 0.607 0.618 0.625 0.630 0.634 0.637 0.640 500.576 0.660 0.692 0.710 0.720 0.727 0.732 0.736 0.739 0.742 600.658 0.742 0.773 0.789 0.799 0.805 0.810 0.813 0.816 0.818 700.727 0.807 0.835 0.849 0.858 0.863 0.867 0.870 0.873 0.874 800.784 0.857 0.882 0.894 0.901 0.905 0.909 0.911 0.913 0.915 900.831 0.896 0.916 0.926 0.932 0.935 0.938 0.940 0.941 0.943 1000.869 0.924 0.941 0.949 0.953 0.956 0.958 0.960 0.961 0.962 Note. Effect size is 0.4 with a significance level of 0.05. MN Figure 3.15: Effects of Covariates, Number of Schools ( M) and Number of Students ( N) on Power Holding Study Duration ( D) Constant at 4, w2 = 0.6 and w3 = 0.6 : BRD, Quadratic Rate of Change Note. Ef fect size is 0.4 with a significance level of 0.05. 103 number. Table 3.13 and Figure 3.13 provides power estimates for designs that vary the number of students ( N) in each school and the number of schools ( M), holding study duration constant at four. As the number of schools increases, power increases consistently as expected. When M = 100 schools and N = 60 students, power becomes 0.80. Table 3.14 and Figure 3.14, and Table 15 and Figure 15 show how the power of detecting a quadratic rate of change is influenced by the proportion of un explained variance at the second and third levels when D = 4, ES = 0.40 and q = 5. The results are quite similar to the results in Table 3.9 and Table 3.10. Power increases as the proportion of variances explained increases at the second or the third level . In particular, fewer schools or fewer students per school are needed for power to reach to 0.80. For instance, with N = 40, only 70 schools are need to boost power to 0.80, which is more than 30 schools fewer comparing to the design without covariates i ncluded. In addition, covariates at the third level have more impacts on power than covariates at the second level. Conclusion Multilevel experimental designs are becoming more common in education. Frequently these designs assign individuals (e.g. , students) or entire clusters such as schools randomly to a treatment or a control group and follow them over time. In such designs, research ers face the challenge of choosing study duration and sample sizes to ensure that treatment effects will be detect ed. The present study extended previous work on power analysis for two -level models in studies of polynomial change and presented methods for three -level models. 104 The power of the test of the treatment effect in studies of polynomial change with two -levels of nesting is a function of the magnitude of the treatment effect, the study duration, the sample size of individuals, the sample size of clu sters, and the proportion of the variances that covariates at the second or third levels explain. Several finding s emerged from this study that applies to both CRD and BRD. First, power increases as the study duration, the number of students in each school, or the number of schools increases. Other things being equal, the number of level -3 units (clusters) influences power more than the number of level -2 units (individuals) or the duration of the study. In particular, the number of students and the study duration have limited influence on power. This indicates clearly that researchers should sample more schools rather than students within schools to maximize power. Note that the number of schools impacts power through the degrees of freedom of the t-test. It also should be noted that the higher order polynomials a growth model includes, the longer the study duration is needed. For example, to fit a linear rate of change model, the minimum study duration is two; to fit a quadratic rate of change model, the study duration should be at least three. Second, covariates that explain a proportion of variances at the second or third level could increases powers and thus reduce the study duration or sample sizes needed to boost power to a certain level. For instance, in the first illustrative example with a CRD, when covariates could explain 40% of variances at both the second an d the third level, the required number of schools for power reaching to 0.80 drops from 30 to 20, holding the number of students in each school constant at 60. Because the number of covariates at the third level reduces the degrees of freedom for the t-tes t researchers should use a small 105 number of third level covariates that are strongly related to the outcome, especially when the number of schools is not large. Third, the effects of covariates on powers depend on the ratio between the variance of the random effects at the second and the third level. For instance, in my three illustrative examples, since the ratio between the variance of the random effects at the second and the third level is small, covariates at the third level affect powers more significantly than covariates at the second level. In addition, comparing the results from the second and the third illustrative sample, powers are larger in the second s ample, which is mainly because the ratio between the variance of the random effects at the second and the third level is larger in the second example than that in the third example. 106 APPENDICES 107 Appendix A: Variable Description Table A .1: Variable Names and Coding Methods using Data from TIMSS 2011 Variables: Description (TIMSS Variable Name) Student Variables Mathematics Achievement Set of five overall mathematics score plausible value variables Female Binary indicator for the student whose gender is female Age Student age at the time of testing Speaking the Tested Language at Home Binary indicator for the student who spoke the tested language at home fialways or almost alwaysfl SES: Books in the Home Number of books in the home SES: Items in the Home Sum of eleven wealth-related household possessions variables Positive Affect to Mathematics Average of five self-reported student's affect to mathematics variables, with negatively-worded items reverse-coded Parents Asked What the Student was Learning in School Binary indicator for the parents asking the student what he/she is learning in school every day or almost every day Student Talked about the Schoolwork with Parents Binary indicator for the student talking about the schoolwork with parents every day or almost every day Parents Made Sure the Student Set Aside Time for the Homework Binary indicator for the parents making sure that the student sets aside time for the homework every day or almost every day Parents Checked if the Student Did the Homework Binary indicator for the parents checking if the student does the homework every day or almost every day Teacher/Classroom Variables Class Size Number of students in the classroom Classroom SES: Books Average number of books in the home Classroom SES: Items Average number of items in the home Proportion Female Proportion of female students in a class Average Students' Positive Affect to Mathematics Average self-reported student's affect to mathematics in a class Teacher Experience in Years Teacher's year of teaching Teacher Completing Post-Secondary Education Binary indicator for the teacher who completed post-secondary education Female Binary indicator for the teacher who is female Instruction Time Time spending teaching mathematics to the students in the class per week School Variables Percent Disadvantaged Students Set of four indicators for categorical percentage of economically-disadvantaged students Percent of Students Having Tested Language as Native Language Binary indicator for categorical percentage of the students having tested language as their native language more than 90% Students Having Early Numeracy Skills Set of four indicators for categorical percentage of the students entering the primary grades with early numeracy skills City Size Set of six indicators for categorical city population (labels = 0Œ3,000, 3,001Œ15,000, 15,001-50,000, 50,001-100,000, 100,001Œ500,000, greater than 500,000) Income Level of the School's Immediate Area Set of three indicators for the income level of the school's immediate area Grade 4 Enrollment Total enrollment of fourth graders in the school 108 Appendix B: Control Function Approach for Quantile Regression A quantile re gression model with endogenous variables can be written as 1'()'() '() ryxzu xzv (B.1) where x is a vector of endogenous variables, and z=(z 1, z2) are exogenous variables, and our interest is to estimate () , the coefficients of x at -th quantile. There are three ways to estimate () in quantile regression literature. Amemiya (1982) and Powell (1983) first proposed a two -stage absolute value (2LAD) approach, which specifically focused on the median and is quite similar to the 2SLS estimation procedure. However, the required assumption for this approach is difficult to interpret and thus it was not been used widely for empirical studies. Chernozhukov and Hansen (2006) proposed an Instru mental variable quantile regression (IVQREG) approach that assume (|)0 uQz, which means the -th quantile of u Œone of the error terms in equation (A2.1) equals to zero, conditional on the other error term ( z) in in equation (A2.1). Chernozhukov and Hansen (2008) developed inference procedures that are fully robust to weak instruments based on the IVQREG estimator. However, there is only Matlab codes available to their approach. In addition, it is not clear how to incorporate sampling weights and how to adjust the clustering effects (e.g., students nested in schools) using their methods. Lee (2007) proposed a control function approach deal with the endogenous variables in quantile regre ssion. According to equation (B .1) we have 109 (|,)(|,) uu QxzQvz . (B.2) This approach assumes that the instrument variables z is independent of ( u, v), therefore we have (|,)(|,) uuQvzQvz . (B.3) Substitute equation (A2.3) to equation (A2.1), we have 11(|,,)'()'()(|,) '()'()(|). yuuQxzvxzQzv xzQv (B.3) Therefore, to estimate () , we must know (|) uQv, which is a function of v. Since v is not observed, we must estimate it through regressing x on z using OLS or quantile regression. Also, we have no idea if the correlation between u and v is linear or non -linear, Lee (2007) suggest using a series or kernel of v to better model the relationship between u and v. To sum up, Lee™s (2007) control functio n approach is also a two -stage estimation approach: (1) regression x on z using OLS or quantile regression and get ‹'() rvxz ; (2) regress y on x, z 1, and a series or kernel of ‹v through quantile regression to get () . 110 Appendix C: Proof of Equation (3.6) According to Randenbush and Liu (2001), Ygi is an outcome for person i (i=1, 2, –, n) at occasion g (g=1, 2, ..., G) and thus 1+=2GgGGm1. According to the equation (5) in Randenbush and Liu (2001), the equation (2) in Randenbush and Liu (2001) could be simplified as 0111 g/G gGggCCggg 222211 1224113311 211 11(G1)(1) (/) 221211 21216Gggg gGggggg GggGGCCCGgg GGggCCCC C 22137 620Ggggg (C.1) where 112GggGgG . According to the equation provided in Fisher (1936, P149), we have 012221233111 =g 1 1237 20gggggggCCg GCCGCCC (C.2) 111 When 01, k11, k21, 2kand 316k, we have 000 111 222 333 ggggggggCCk CCk CCk CCk or pgpgp CCk (C.3) According to equation in page 30 of Fisher(1957) and equation (1) from Jennrich and Sampson(1971), we have a recurrence formula: 1,g1 1,g 01 1 pgpgpp ggCCCC CCgg (C.4) where 222 2() 4(41) ppGp pand g is number of occasions ( g=1, 2, ...., G ); p is the degree of the orthogonal polynomial; and pmCthe orthogonal polynomial coefficient of degree p at occasion g. According to equation (C .4), we have 222 22211 01 1,1111 4(41) 1212 ggg ggIfp GGGCCC CCgg 112 2312 122222323,2, 4(G4) 4(441) 14 1215 14 1215 37 20ggg gIfp CCC CGGgggg ggGG ggggGgggg To sum up, we have 00000 11111 22222 33333 ggg ggg ggg ggg CCkCk CCkCk CCkCk CCkCk or pgpgppgp CCkCk (C.5)113 Appendix D: Proof of Equation (3.9) Based on the equation (2) from Jennrich and Sampson (1971), we have: 222 221,g 211201(G) , where 4(41) GGpgpp pggGggppCCpCG (D.1) Therefore we have 2211222122311(1)(1)(1) 1, 43 124(4)(1)(1) 2,4(441)12 (2)(1)(1)(2) 1809(9)(2)(1) 3, 4(491) GggGggGggGGGG IfpC GGGGG IfpC GGGGG GGGG IfpC (1)(2) 180(3)(2)(1)(1)(2)(3) 2800GGGGGGGGG Let 01, k11, k21, 2kand 316k, we have 222 11 (1,2,3) GGpgppg ggCkCp Also, according to eq uation (D .1), we have: 222 222222 1,g 1,g 22111()(G) 4(41) 4(41) GGGpgppgggpMp pCCpCpp Let 224(41) ppp, we have 114 2222 1,g 11212,g 11(1) (G) (G)(G)(1)(1) (G)(G)(1)(1)... (1)( MGpgp pmgGpppgppppCpC ppGpGpC ppGpGp GppGpp 201121 1) ...(G)(G1)(G2)(G1)(Gp) Mgmppp CpppG Let 11H,ppp we have 21()! 1!Gpgp gMpCHMp . (D.2) Also, Let 2K=H, ppp k we have 2222 11()!()! .1!1!GGpgppgpp pggGpGp CCHK GpGp (D.3) In addition, according to equation (8) in P. 104 of Plackett (1960), we have: 222124()! (!) (!) 21(1)!(21)! (2)! 2(!) (!)()! (2)!(21)!(1)! GpgGGpGPpppGpp CpppppGP ppGp (D.4) Therefore we can write 442(!) (2)!(21)! (!) K= (2)!(21)! ppppHpppkpp (D.5) 115 REFERENCES 116 REFERENCES Akerhielm, K. (1995). Does class size matter? Economics of Education Review, 14 (3), 229 -241. Amemiya, T. (1982). Two Stage Least Absolute Deviations Estimators. Econometrica , 50(3), 689-711. Angrist, J. D., & Lavy, V. (1999). Using Maimonides' rule to estimate the effect of class size on scholastic achievement. Quarterly Journal of Economics, 114 (2), 533 -575. Bloom, H. S., Richburg -Hayes, L., & Black, A. R. (2007). Using covariates to improve precision for st udies that randomize schools to evaluate educational interventions. Educational Evaluation and Policy Analysis, 29(1) , 30-59. Bloch, D. A. (1986). Sample size requirements and the cost of a randomized clinical trial with repeated measurements. Statistics in Medicine, 5 (6), 663.667. Bonesronning, H. (2003). Class size effect on student achievement in Norway: Patterns and explanations. Southern Economic Journal, 69 (4), 952 -965. Boruch, R. F., & Gomez, H. (1977). Sensitivity, bias, and theory in impact evaluations. Professional Psychology, 8 (4), 411. Cho, H., Glewwe, P., & Whi tler, M. (2012). Do reductions in class size raise students™ test scores? Evidence from population variation in Minnesota's elementary schools. Economics of Education Review, 31 (3), 77 -95. Cohen, J. (1988). Statistical power analysis for the behavioral sc iences : Routledge. Ding, W., & Lehrer, S. F. (2011). Experimental estimates of the impacts of class size on test scores: robustness and heterogeneity. Education Economics, 19(3), 229 -252. Durbin, J. (1954). Errors in variables. Review of the International Statistical Institute, 22(1/3), 23 -32. Dufour, J. M. (2003). Identification, weak instruments, and statistical inference in econometrics. Canadian Journal of Economics , 36(4), 767-808. EACEA Eurydice. (2012). Key data on education in Europe 2012 . Brussels : Eurydice. Education Week . (2008). Quality counts 2008: Tapping into teaching . Bethesda, MD: Editorial Projects in Education. Finn, J. D., & Achilles, C. M. (1990). Answers and questions about class size: A statewide experiment. American Educational Rese arch Journal, 27 (3), 557 -577. Fisher, R. A. 117 (1928). Statistical methods for research workers (2d ed.). Edinburgh, London,: Oliver and Boyd. Glass, G. V., & Smith, M. L. (1979). Meta -analysis of research on class size and achievement. Educational Evaluation and Policy Analysis, 1 (1), 2 -16. Hausman, J. A. (1978). Specification tests in Econometrics. Econometrica, 46 (6), 1251 -1271. Hedeker, D., Gibbons, R. D., & Waternaux, C. (1999). Sample size estimation for longitudinal designs with attrition: Comparing time -related contrasts between two groups. Journal of Educational and Behavioral Statistics, 24 (1), 70 -93. Hojo, M. (2013). Class -size effects in Japanese sch ools: A spline regression approach. Economics Letters, 120 (3), 583 -587. Hoxby, C. M. (2000). The effects of class size on student achievement: New evidence from population variation. Quarterly Journal of Economics, 115 (4), 1239 -1285. Huttenlocher, J., Haight, W., Bryk, A., Seltzer, M., & Lyons, T. (1991). Early vocabulary growth: Relation to language input and gender. Developmental Psychology, 27 (2), 236. Jackson, E., & Page, M. E. (2013). Estimating the distributional effects of educ ation reforms: A look at Project STAR. Economics of Education Review, 32, 92-103. Jennrich, R. I., & Sampson, P. I. (1971). Remark as R3: A remark on algorithm AS 10. Journal of the Royal Statistical Society. Series C (Applied Statistics), 20 (1), 117 - 118. Jong, K., Moerbeek, M., & Van der Leeden, R. (2010). A priori power analysis in longitudinal three -level multilevel models: an example with therapist effects. Psychotherapy Research , 20(3), 273.284 Kirk, R. E. (2012). Experimental design : procedures for the behavioral sciences (4th ed.). Thousand Oaks: Sage Publications. Kolenikov, S. (2010). Resampling variance estimation for complex survey data. Stata Journal , 10(2), 165 -199. Konstantopoulos, S. (2008a). The power of the test for treatment effects in th ree -level cluster randomized designs. Journal of Research on Educational Effectiveness, 1 (1), 66-88. Konstantopoulos, S. (2008b). The power of the test for treatment effects in three -level block randomized designs. Journal of Research on Educational Effect iveness , 1(4), 265-288. 118 Konstantopoulos S., & Chung, V. (2009). What are the long -term effects of small classes on the achievement gap? Evidence from the Lasting Benefits Study. American Journal of Education , 116 (1), 125-154. Konstantopoulos, S. (2012). The impact of covariates on statistical power in cluster randomized designs: Which level matters more? Multivariate Behavioral Research, 47 , 392- 420. Konstantopoulos, S., Miller, S., & van der Ploeg, A. (2013). The impact of indiana™s system of interim as sessments on mathematics and reading achievement. Educational Evaluation and Policy Analysis, 35(4), 481-499. Konstantopoulos, S., & Traynor, A. (2014). Class size effects on reading achievement using PIRLS data: Evidence from Greece. Teachers College Rec ord, 116 (2), 1 -29. Krueger, A. B. (1999). Experimental estimates of education production functions. Quarterly Journal of Economics, 114 (2), 497 -532. Lee, S. (2007). Endogeneity in quantile regression models: A control function approach. Journal of Econome trics , 141(2), 1131 -1158. Lipsey, M. W. (1990). Design sensitivity: Statistical power for experimental research (Vol. 19): Sage. Levin, J. (2001). For whom the reductions count: A quantile regression analysis of class size and peer effects on scholastic ac hievement. Empirical Economics ,26(1), 221 -246. Leuven, E., Oosterbeek, H., & Ronning, M. (2008). Quasi -experimental estimates of the effect of class size on achievement in Norway. Scandinavian Journal of Economics, 110(4), 663 -693. Ma, L., & Koenker, R. ( 2006). Quantile regression methods for recursive structural equation models. Journal of Econometrics , 134(2), 471 -506. Martin, M.O. & Mullis, I.V.S. (Eds.). (2012). Methods and procedures in TIMSS and PIRLS 2011 . Chestnut Hill, MA: TIMSS & PIRLS Internatio nal Study Center, Boston College. Moerbeek, M. (2008). Powerful and cost -efficient designs for longitudinal intervention studies with two treatment groups. Journal of Educational and Behavioral Statistics , 33(1), 41 -61. Molnar, A., Smith, P., Zahorik, J., Palmer, A., Halbach, A., & Ehrle, K. (1999). Evaluating the SAGE program: A pilot program in targeted pupil -teacher reduction in Wisconsin. Educational Evaluation and Policy Analysis, 21 (2), 165 -177. 119 Mosteller, F. (1995). The Tennessee study of class siz e in the early school grades. Future of Children, 5 (2), 113 -127. Muthén, B. O., & Curran, P. J. (1997). General longitudinal modeling of individual differences in experimental designs: A latent variable framework for analysis and power estimation. Psychological Methods, 2 (4), 371 -402. Nye, B., Hedges, L. V., & Konstantopoulos, S. (2000). The effects of small classes on academic achievement: The results of the Tennessee class size experiment. American Educational Research Journal, 37 (1), 123 -151. Nye, B., Hedges, L. V., & Konstantopoulos, S. (2002). Do low -achieving students benefit more from small classes? Evidence from the Tennessee class size experiment. Educational Evaluation and Policy Analysis , 24(3), 201 -217. Plackett, R. L. (1960). Principle s of regression analysis. Oxford: Clarendon Press. Pong, S. L., & Pallas, A. (2001). Class size and eighth -grade math achievement in the United States and abroad. Educational Evaluation and Policy Analysis, 23 (3), 251 - 273. Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models : applications and data analysis methods (2nd ed.). Thousand Oaks: Sage Publications. Raudenbush, S. W., & Liu, X. F. (2001). Effects of study duration, frequency of observation, and sample size on power in studies of gro up differences in polynomial change. Psychological Methods, 6 (4), 387 -401. Raudenbush, S. W., Martinez, A., & Spybrook, J. (2007). Strategies for improving precision in group -randomized experiments. Educational Evaluation and Policy Analysis, 29(1) , 5-29. Seber, G. A. F., & Lee, A. J. (2003). Linear regression analysis (2nd ed.). Hoboken, N.J.: Wiley -Interscience. Shafer, J. L. (1999). Multiple imputation: A primer. Statistical Methods in Medical Research, 8, 3-15. Slavin, R. E. (1989). Class size and student achievement: Small effects of small classes. Educational Psychologist, 24 (1), 99 -110. Snijders, T., & Bosker, R. J. (1999). Multilevel analysis: an introduction to basic and advanced multilevel modeling. Sage Pu blications. Thousand Oaks, CA . Staiger, D., & Stock, J. H. (1997). Instrumental variables regression with weak instruments. Econometrica, 65(3), 557 -586. 120 Stock, J. H., Wright, J. H., & Yogo, M. (2002). A survey of weak instruments and weak identification i n generalized method of moments. Journal of Business and Economic Statistics, 20 (4), 518 -529. Urquiola, M. (2006). Identifying class size effects in developing countries: Evidence from rural Bolivia. The Review of Economics and Statistics, 88 (1), 171 -177. Urquiola, M., & Verhoogen, E. (2009). Class -size caps, sorting, and the regression -discontinuity design. American Economic Review, 99 (1), 179 -215. Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data . Cambridge, MA: MIT Press. Wossmann, L. (2005). Educational production in Europe. Economic Policy, (43), 445 -504. Wossmann, L., & West, M. (2006). Class -size effects in school systems around the world: Evidence from between -grade variation in TIMSS. European Economic Review, 50(3), 695-736. Wu, D. M. (1973). Alternative tests of independence between stochastic regressors and disturbances. Econometrica, 41 (4), 733 -750. 121