THREE ESSAYS ON THE ECONOMICS OF EDUCATION: CLASS-SIZE REDUCTION, TEACHER LABOR MARKETS, AND TEACHER EFFECTIVENESS By Steven Dieterle A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Economics 2012 ABSTRACT THREE ESSAYS ON THE ECONOMICS OF EDUCATION: CLASS-SIZE REDUCTION, TEACHER LABOR MARKETS, AND TEACHER EFFECTIVENESS By Steven Dieterle Prior research has established the potential for achievement gains from attending smaller classes. However, large statewide class-size reduction (CSR) policies have not been found to consistently realize such gains. A leading explanation for the disappointing performance of CSR policies is that schools are forced to hire additional teachers of lower quality to meet the new class-size requirements. The first chapter uses administrative data from an anonymous state to explore whether the value-added of newly hired teachers fell after the introduction of CSR. The results suggest that while there was a modest fall in the relative average quality of newly hired teachers and those retained beyond their first year, this drop is not nearly large enough to explain the failure of CSR to produce sizeable achievement gains. Furthermore, schools facing CSR pressure saw similar falls in quality as those that did not, likely due to labor market competition forcing all schools along the effective teacher supply curve. Therefore, between-school differences in the quality of incoming teachers cannot explain the failure of previous quasiexperimental treatment-control comparisons to find achievement effects from statewide CSR. In addition to providing insight into CSR, the results are informative for assessing any potential intervention that may drastically increase the short-run demand for teachers. The next chapter provides necessary background information for the first chapter. Prior evidence for the change in teacher composition associated with CSR policies focused on changes in observable teacher characteristics in California. California experienced large increases in the percentage of teachers without certification, without an advanced degree, and with fewer than three years of experience. These changes in the composition of the teacher workforce in California have been cited when looking at CSR in other states. However, there has been no study of changes in the composition of the teacher workforce in other states. Chapter 2 explores the case of CSR in the state of Florida. Unlike in California, this paper finds little evidence of large changes in teacher characteristics associated with class-size reduction. Only average experience moves in a direction consistent with California’s experience. This drop in experience is found to be slightly larger for schools serving larger minority and low-income populations. The third chapter examines the effectiveness of a range of pedagogy practices in kindergarten and first grade using a longitudinal survey dataset. In answering this question, the focus is on concerns over the nonrandom exposure of students to different teaching strategies. The findings indicate that various teaching modalities, such as working with counting manipulatives, using math worksheets, and completing problems on the chalkboard, have positive effects on achievement in kindergarten. In first grade, pedagogical practices relating to explaining problem-solving and working on problems from textbooks have positive effects on achievement. Furthermore, the results suggest that the models and estimators previously employed to estimate teacher characteristic and practice effects using longitudinal survey data likely neglected problems arising from the nonrandom sorting of students and teachers into schools. To my loving wife Niccole: Thank you for your never-ending love and support, without which this would not have been possible. iv ACKNOWLEDGEMENTS I would like to sincerely thank Gary Solon for his invaluable guidance and support, both in terms of comments and suggestions on this dissertation and in my development as an economist. I would also like to thank the other members of my guidance committee: Todd Elder, Cassie Guarino, and Jeff Wooldridge for their assistance with this and other work. The papers comprising this dissertation also benefited greatly from helpful comments from other faculty and graduate students at Michigan State University including Steven Haider, Leslie Papke, Otavio Bartalotti, Quentin Brummet, Stacy Miller, and Elizabeth Quin, among many others. Their combined effort has helped immensely in the completion of this project. All remaining errors are my responsibility. v TABLE OF CONTENTS LIST OF TABLES .................................................................................................................. viii LIST OF FIGURES .................................................................................................................... x CHAPTER 1 CLASS-SIZE REDUCTION AND THE QUALITY OF ENTERING TEACHERS .................... 1 Introduction ..................................................................................................................... 1 Literature Review/Background ........................................................................................ 4 Institutional Details: CSR in State X ................................................................................ 7 Data................................................................................................................................. 8 CSR and Teacher Characteristics in State X..................................................................... 9 Empirical Methodology ................................................................................................. 12 Results ........................................................................................................................... 20 Sensitivity Analysis ....................................................................................................... 34 Conclusion .................................................................................................................... 36 APPENDICES .......................................................................................................................... 38 Appendix A: Additional Tables ..................................................................................... 39 Appendix B: Measuring Teacher Quality ....................................................................... 42 REFERENCES ......................................................................................................................... 47 CHAPTER 2 CLASS-SIZE REDUCTION AND THE COMPOSITION OF THE TEACHER WORKFORCE ......................................................................................................................... 51 Introduction ................................................................................................................... 51 Literature Review/Background ...................................................................................... 52 CSR in Florida ............................................................................................................... 57 Data............................................................................................................................... 59 Analytical Approach ...................................................................................................... 60 Results ........................................................................................................................... 63 Discussion ..................................................................................................................... 70 Conclusion .................................................................................................................... 71 REFERENCES ......................................................................................................................... 73 CHAPTER 3 WHAT CAN WE LEARN ABOUT EFFECTIVE EARLY MATHEMATICS TEACHING? A FRAMEWORK FOR ESTIMATING CAUSAL EFFECTS USING LONGITUDINAL SURVEY DATA ...................................................................................................................... 76 Introduction ................................................................................................................... 76 Modeling and Estimation Framework ............................................................................ 79 vi Prior Survey-Based Research on the Impact of Teacher Characteristics and Teaching Practices on Student Achievement in the Early Grades .................................................. 85 Data............................................................................................................................... 88 Methods ........................................................................................................................ 94 Results ........................................................................................................................... 99 Summary and Conclusions .......................................................................................... 118 APPENDIX ............................................................................................................................ 120 REFERENCES ....................................................................................................................... 129 vii LIST OF TABLES Table 1.1 New Class-Size Maximums: State X........................................................................... 8 Table 1.2 Estimated CSR Mathematics Achievement Effects for State X ................................. 22 Table 1.3 Pooled OLS Cohort Estimates with Cohort Indicators ............................................... 24 Table 1.4 Pooled OLS Cohort Estimates with Cohort-by-Year Indicators ................................. 25 Table 1.5 Estimates of New Cohort Effects by CSR Intensity ................................................... 29 Table 1.6 Estimated Contribution of Teacher Cohort Composition and Experience to Average Achievement ............................................................................................................. 33 Table 1.7 Estimated Contribution of Teacher Composition to Average Achievement: CSR vs. Non-CSR Schools ..................................................................................................... 34 Table 1.8 Cohort Effect Estimates from Alternative VAM Estimators ...................................... 35 Table 1.9 Descriptive Statistics ................................................................................................ 39 Table 1.10 Estimates from Pooled OLS Regressions ................................................................ 40 Table 2.1 Mean Teacher Characteristics: California K-3 Teachers 1995-1997 .......................... 55 Table 2.2 Mean Teacher Characteristics: All Florida Public School Teachers 2000-2009 ......... 56 Table 2.3 New Class-size Maximums and Average Class Size ................................................. 58 Table 2.4 Summary Statistics: Florida Elementary Schools 2000-2009..................................... 60 Table 2.5 Quartile Cutoffs: School-Level Averages .................................................................. 62 Table 2.6 Mean of School-Level Advanced Degree Percent: Disaggregated by School Characteristics ........................................................................................................... 64 Table 2.7 Mean of School-Level Average Experience: Disaggregated by School Characteristics .................................................................................................................................. 67 Table 2.8 Mean of School-Level Out-of-field Percent: Disaggregated by School Characteristics .................................................................................................................................. 69 viii Table 3.1 OLS Estimates of Lag Score Specification Pooled with Grade Interactions with School Dummies ..................................................................................................... 101 Table 3.2 Random Assignment Test p-values: Joint Significance of Child Characteristics and Lagged Score .......................................................................................................... 104 Table 3.3 Main Model and Estimation Results ....................................................................... 106 Table 3.4 Alternative Models and Estimators ......................................................................... 113 Table 3.5 Summary Statistics for Forty Imputed Data Sets ..................................................... 121 Table 3.6 Summary of Missing Data ...................................................................................... 125 ix LIST OF FIGURES Figure 1.1 Teacher-Level Trends: All Teachers in Grades 4-6 in Core Courses ........................ 10 Figure 1.2 Teacher-Level Trends: Entry Cohorts in Grades 4-6 in Core Courses ...................... 12 Figure 1.3 Percentile Rank Distributions: Entry Cohorts........................................................... 31 Figure 3.1 Tree Diagram of Possible Model/Estimation Strategies ........................................... 99 x CHAPTER 1 CLASS-SIZE REDUCTION POLICIES AND THE QUALITY OF ENTERING TEACHERS 1. Introduction The potential for student achievement gains from smaller classes has been well documented in experimental and quasi-experimental research over the last two decades (Krueger 1999; Krueger & Whitmore 2001; Angrist & Lavy 1999). As of 2005, this potential led to the adoption of large-scale class-size reduction (CSR) measures in thirty-two states (Council for Education Policy, Research and Improvement (CEPRI) 2005). To date, studies of CSR policies find only mixed evidence of achievement effects, with estimated effects consistently falling short of what might be expected from the experimental research. Due to the high costs of implementation, on the order of $21 billion over nine years in Florida (Florida Department of Education (FDOE)) and of $1.5 billion a year in California (Bohrnstedt & Stecher 1999), the efficacy of CSR policies has been called into question. One common explanation for the underperformance of CSR is that it forces schools to hire new teachers of lower quality in order to meet the class-size requirements. The gains from having smaller classes are thought to be offset by having teachers of lower quality in the classroom. Using administrative data covering grades four through six from an anonymous, diverse state (subsequently referred to as State X) around the implementation of a statewide CSR program, this paper addresses two separate, but related, questions. Did the CSR-induced demand increase lead to schools hiring and retaining lower quality teachers, here measured by value-added to 1 student mathematics achievement? If so, what effect did this fall in quality have on average achievement and can it explain a large portion of the unrealized CSR gains? 1 Similar results obtained using reading test scores instead of mathematics are available upon request from the author. Generally, the reading results are slightly smaller in magnitude and 1 The value-added estimates of cohort performance found here indicate a small reduction in the average quality of both newly hired teachers and teachers who are retained after their first year. In terms of student achievement, the estimated conditional mean performance of the larger postCSR hiring cohorts ranges from 0.0033 to 0.0277 test score standard deviations lower than the smaller pre-CSR cohorts in each cohort’s first year. These differences in cohort performance persist partially over time as the composition of each cohort changes, with the differences in preand post-CSR second-year cohort effects ranging from 0.0078 to 0.0182 standard deviations. However, there is evidence that further attrition for post-CSR hiring cohorts may lead to negligible differences among the remaining teachers after three to four years, implying an even smaller long-run CSR hiring effect on achievement. Even if the average quality of cohorts had not changed, there may have been a short-run effect of CSR hiring on student performance due to hiring more teachers with less experience. The fall in average achievement attributable to the change in both average quality and experience is less than one-fiftieth of the test score standard deviation. This fall in achievement is driven primarily by changes in cohort quality, rather than experience, and was generally experienced by all schools. In fact, schools classified as treated (those for which CSR was binding) in previous quasi-experimental estimates of CSR policy effects in State X experience a slightly smaller drop in achievement attributable to the stock of teachers than those considered untreated. This result implies that a differential change in the quality of newly hired teachers is not the mechanism preventing the expected achievement gains from CSR. Further, it suggests a role for competition were slightly more sensitive to the specification and estimator chosen. However, these differences do not change the conclusions drawn. The decision to focus on mathematics scores only was made for the sake of brevity and due to the fact that it is common in the education production function literature for mathematics scores to be more responsive to inputs than reading. 2 for teacher candidates pushing all schools along the effective teacher supply curve in connected labor markets. The results are informative beyond providing a better understanding of CSR programs and how they relate to the pool of employed teachers. Importantly, the results help fill a gap in the prior literature on the quality or value-added elasticity of teacher supply. To understand why there is little research on this elasticity, it is helpful to consider what is unique to the intervention studied here. Namely, it provides a rare opportunity to observe a substantial increase in the number of teachers hired for the same schools in a short time period. This sort of variation is preferred to relying on cross-sectional or longer-run differences in teacher hiring to identify this elasticity. An understanding of the nature of the underlying teacher labor supply is useful for predicting the impact of any intervention that results in a sudden change in teacher demand. For instance, short-run increases in teacher demand associated with retirement buyout plans or changes in curriculum are often met with concerns over the quality of the new teachers hired (Center for Local, State, and Urban Policy 2010). The results found here are informative in predicting the fall in quality associated with such policies. The paper proceeds as follows: section 2 provides a review of the relevant literature and background information, section 3 discusses the institutional details of the policy, section 4 discusses the data used, section 5 looks at the changes in teacher characteristics that accompanied CSR, section 6 discusses the empirical strategy used, section 7 gives and discusses the main results, section 8 tests the sensitivity of the results to the use of alternative value-added measures, and section 9 concludes. 3 2. Literature Review/Background Based on the random assignment of students and teachers to classrooms of varying sizes, the results of the Tennessee STAR experiment suggested that class-size reduction is a potentially viable tool to promote achievement gains. Krueger (1999) analyzes the STAR data and finds that being randomly assigned to a small (13-17 students) class as opposed to a larger class (22-25 students) in early elementary school led to roughly one-fifth of a standard deviation increase in average test scores. In a follow-up, Krueger & Whitmore (2001) find that being in a small class also impacted student outcomes well after the experiment, such as increasing the likelihood of taking a college entrance exam. The finding of statistically and practically significant class-size effects from the Tennessee STAR experiment led many states to explore the use of CSR to promote student achievement growth. By 2005, thirty-two states had adopted some sort of CSR program (CEPRI 2005). Despite CSR’s popularity among teachers and parents, there is only mixed support for the conclusion that these large-scale CSR programs are effective at helping to raise test scores. In their official report on CSR in California, Bohrnstedt & Stecher (2002) were unable to find conclusive evidence of achievement gains for kindergarten through third grade. In contrast, Jepsen & Rivkin (2009) use class-size variation from California CSR and find that a ten-student reduction in class size is associated with an increase in achievement of one-tenth to onetwentieth of a standard deviation in grades two through four. Like Bohrnstedt & Stecher, Chingos (2012) found null effects for fourth through eighth grade of CSR in Florida. The effectiveness of CSR in State X will be explored in more detail in section 7. In light of the experimental results discussed above, the fact that gains are not consistently realized with large-scale CSR programs is puzzling. One possibility is that the experimental 4 Tennessee STAR results could be an anomaly. Indeed, not all experimental and quasiexperimental studies find significant class-size effects (Hoxby 2000). The size and scope of STAR limit our ability to assess whether the results would hold up under repeated experiments. Given this limitation, a recent paper by Rockoff (2009) that discusses the results of several classsize experiments from the beginning of the twentieth century provides some additional context for the STAR results. Rockoff concludes that the balance of these early class-size experiments suggest there was little achievement benefit to attending smaller classes. This conclusion comes with several caveats, including the small scale of these early studies and some experimental design issues. Most importantly, it seems plausible that changes in the educational environment since the early twentieth century may have changed the role of class size in affecting achievement. Given these concerns, these experiments serve, at most, as suggestive evidence that the results of Tennessee STAR may be an anomaly. Assuming that there are potential gains from reducing class size, a leading explanation for the failure of CSR revolves around changes in teacher quality associated with the implementation of the program (Stecher & Bohrnstedt 2000; Imazeki n. d.; Buckingham 2003; CEPRI 2005, Chingos 2012). One way in which teacher quality may change is if schools are forced to hire additional teachers from lower on the quality distribution in order to meet the new class-size requirements. Schools may also retain teachers that would otherwise have been dismissed for poor performance to lessen the hiring burden. Gains associated with smaller classes are then offset by having less capable teachers in classrooms, yielding no gains on net. To support these teacher-quality-based explanations, it is common to look at changes in observable teacher characteristics associated with the implementation of CSR. Stecher & Bohrnstedt (2000) document declines in the percentages of fully certified teachers, teachers with 5 advanced degrees, and experienced teachers in California. While changes in teacher characteristics do indicate changes in the teacher workforce, the link between these characteristics and student performance on exams has often been found to be weak. Goldhaber (2008) provides a detailed review of the education production function literature concluding that teacher quality is not ―strongly correlated‖ with observable teacher characteristics. Given the evidence from the education production function literature, the finding that observable teacher characteristics change after CSR implementation may not adequately explain the lack of test score gains. The more relevant question is whether schools are forced to hire teachers who contribute less to a student’s achievement growth. Jepsen & Rivkin (2009) analyze California's CSR program to estimate the relationship between teacher cohort size and quality. These authors examine whether the estimated effects of teacher experience and certification differed across years. Intuitively, this approach identifies the quality of new cohorts of teachers because those teachers identified as inexperienced or uncertified in a given year are more likely to be new. They find no statistically or practically significant differences in the estimated experience or certification effects across years. Jepsen & Rivkin's approach is limited, however, by their lack of access to individual-level data on students and teachers, so that they cannot identify which teachers make up a hiring cohort or link students to specific teachers. Kane & Staiger (2005) use individual-level data from Los Angeles to analyze California’s CSR program. They calculate value-added for teachers hired just before and immediately after 2 CSR and find no differences between the two cohorts of teachers several years later. They also find no evidence of differential attrition between the two cohorts. Kane & Staiger benefit from 2 Unfortunately the Kane & Staiger (2005) paper was unpublished and is unavailable via the internet. This information comes from a related paper by Staiger & Rockoff (2010). 6 having individual-level data. However, given the potential differences between Los Angeles and other districts in the state, it is difficult to conclude how general these results are. Finally, studies of California’s CSR program are hampered by a lack of pre-policy test score data. With data on individual students and teachers for an entire state that spans the introduction of the policy, it is possible to assess the change in teacher quality associated with CSR more directly and to look across heterogeneous districts and schools. 3. Institutional Details: CSR in State X In November of 2002, State X voters approved a constitutional amendment that created a new statewide CSR program. The program was set to begin in the 2003-2004 school year. Separate class-size maximums were set for different grade levels, as shown in Table 1.1. The law also established per-pupil allocations from the state government for each year a district or school was found to be in compliance with the law. There is anecdotal evidence that the allocation was not enough to cover the full costs of CSR implementation for some districts. This anecdote suggests that a reallocation of other resources may partially explain CSR performance. This possibility will be explored in the results section. The new law allowed for a gradual phase-in of the mandated class sizes. A district or school was in compliance if it had lowered the average class size by two students from the previous year or if it was already below the maximum. For the first three years of the program, the compliance was based on the district average, while the next three years it was based on a school-level average. Non-compliance by districts or schools initially resulted in a portion of the CSR allocation being directed toward capital outlays aimed at reducing class size. Beginning in the third year of the program, the threatened sanctions for non-compliance became more severe. According to the law, districts not in compliance were to be forced to implement one of the 7 following four policies: having year-round schools, having double sessions in schools, changing school attendance zones, or altering the use of instructional staff. As seen in Table 1.1, the new maximums were binding for most districts at implementation with only 12% and 42% of districts below the required average class size in kindergarten through third grade and fourth grade through eighth grade, respectively. With average class size dropping from 23 to 16 for the earliest grades and 24 to 19 in the middle grades, it is clear that the program did achieve the stated goal of reducing class size. Table 1.1: New Class-size Maximums: State X Grades Maximum Percent Below Max Yr 1 KG-G3 18 11.94% G4-G8 22 41.79% G9-G12 25 91.04% Source: State X Department of Education District Level Average CS Yr 1 23.07 24.16 24.10 Average CS Yr 8 16.39 18.91 21.94 4. Data The data used for this analysis will be a combination of restricted-use state administrative data and State X’s published class-size averages. The administrative data link students in grades one through six to teachers and schools from the 2000-2001 to the 2007-2008 school year. Importantly, this student/teacher link is made at the classroom level, rather than at the grade level or by matching students to end-of-year exam proctors, as is the case in other prominent administrative data sets. In addition to basic student demographics, the data include test scores for students from third to sixth grade. These test score data enable the estimation of teacher value-added for teachers in grades four through six over a seven-year period starting with the 2001-2002 school year. Importantly, the data track teachers over the same time period as the students. This allows teachers to be followed as long as they stay in the state’s elementary school education system. 8 For instance, it is possible to identify when teachers enter or exit the public elementary school system over time. The teacher information includes relevant variables such as a teacher’s experience and degree level. Finally, State X has made each district/school’s average class size publically available since the beginning of the CSR program. These class-size averages allow for the identification of districts and schools that needed to reduce class size in order to stay compliant. Generally in this paper, schools are divided into those with district- (school-) level average class size below the maximums for grades four through eight in the year prior to district- (school-) level CSR enforcement and into quartiles of average class size for those above the maximums. Descriptive statistics for the key variables used in this study are presented in Appendix Table 1.9. Notably, nearly 70% of the student-year observations in the data are linked to a teacher observed entering at some point in the sample period. 5. CSR and Teacher Characteristics in State X To provide background for the subsequent analysis and to tie the current work to the previous CSR literature on changes in the teacher workforce, it is helpful to consider how teacher characteristics changed with the introduction of CSR in State X. Bohrnstedt & Stecher (1999) found fairly large swings in teacher characteristics around California's introduction of CSR with the percent with less than three years experience rising from 17% to 28%, the percent without an advanced degree rising from 17% to 23%, and the percent not fully certified going from 1% to 12% in just three years. On the other hand, Dieterle (2011) found much smaller changes in State X with only average teacher experience dropping with the introduction of CSR. Dieterle uses publicly available data aggregated at the school level for his analysis. The administrative data used in this study allow for a closer look at the characteristics of both the stock of teachers and the flow of new teachers into the system. Here, the focus is on teachers teaching a core course 9 (those that fall under CSR requirements) in grades four through six (those for which value-added estimation is possible). Figure 1.1 displays the trends over time in the number of teachers, Figure 1.1: Teacher-Level Trends All Teachers in Grades 4-6 in Core Courses Panel B: Percent with Advanced Degree Panel A: Number of Teachers 34 24500 33.5 33 22500 32.5 20500 32 18500 31.5 2000 2002 2004 2006 2000 2002 2004 2006 Panel C: Average Years Experience Panel D: Percent Novice Teachers 11.25 40 \ 11 10.75 38 10.5 10.25 36 10 9.75 34 2000 2002 2004 2006 2000 2002 2004 2006 Figure 1.1: Teacher-Level Trends: All Teachers in Grades 4-6 in Core Courses percent with an advanced degree, average experience, and percent with three or fewer years of experience for all fourth through sixth grade core course teachers. In panel A, the number of teachers in this group is shown to rise steadily over the introduction of CSR from under 19,500 before CSR to nearly 24,500 after five years. The pattern for the percentage with an advanced degree is less clear, dropping by three-quarters of a percentage point with the introduction of CSR and the change to school-level enforcement, while increasing in the other years. Average experience drops from a pre-CSR level of roughly eleven 10 years to nearly 9.5 years by the introduction of school-level enforcement four years later. While this drop in the average experience of teachers is not trivial, prior research finds that early experience matters more for student achievement. Panel D addresses this issue of early experience by showing that the percentage of teachers considered novices increased by five percentage points, from 34% to 39%, over the implementation of CSR. To the extent that these new teachers stayed in teaching, the drop in achievement associated with having more novice teachers may only be temporary as the teachers gain experience. Figure 1.2 explores trends for the flow, rather than the stock, of teachers into the state public elementary school system. Recall that the data follow all first through sixth grade teachers in public schools in State X. Therefore, a teacher entering the data may be new to teaching, returning to teaching, transferring from a public middle or high school, moving from a private school within the state, or moving from a public or private school in another state. Panel A shows that the number of fourth through sixth grade core course teachers entering the data each year increases by roughly 2,000 by the fourth year of CSR. The percent with an advanced degree, in Panel B, shows a similar pattern to the stock of teachers, dropping only in years of a change in CSR policy. The average experience of entering cohorts, shown in Panel C, actually falls more before CSR than after. Similarly, the percent novice increases more before CSR. 11 Figure 1.2: Teacher-Level Trends Entry Cohorts in Grades 4-6 in Core Courses Panel A: Number of Teachers Panel B: Percent with Advanced Degree 6750 33 6000 32 5250 31 4500 30 3750 2000 2002 2004 2006 Panel C: Average Years Experience 7 2000 2002 2004 2006 Panel D: Percent Novice Teachers 64 62 6.5 60 6 58 5.5 56 5 2000 2002 2004 2006 54 2000 2002 2004 2006 Figure 1.2: Teacher-Level Trends: Entry Cohorts in Grades 4-6 Core Courses 6. Empirical Methodology The methodology used here follows from the standard value-added approach to education production function estimation and consists of two broad steps. The first set of estimates uses several variants of the value-added framework to answer the general question of whether schools hire teachers of lower quality with the CSR-induced demand increase. Selected results from the first step are then used to create estimates of the change in overall average achievement that can be attributed to the change in the stock of teachers. Additionally, we explore the extent to which these changes can explain the disappointing CSR achievement effects from estimators that 12 compare treated schools (those above the new CSR maximums at introduction) to untreated schools. For the purposes of this paper, teacher quality will be defined as the contribution teachers make to student mathematics achievement growth. While it is clear that test scores are only one facet of a student’s academic growth and that a good teacher may contribute to other areas such as a child’s social development, the advent of school accountability programs has positioned test scores as the key measure used to assess teachers and schools. Indeed, value-added to test scores is a particularly appropriate metric for assessing why test scores did not go up more with CSR. The main estimation strategies used here are based on OLS estimation of what will be referred to as a lag score specification due to the presence of the student’s prior test score as an explanatory variable: 3,4 3 See Appendix B: Measuring Teacher Quality, for a detailed discussion of value-added models and estimation. 4 Class size is measured by the number of students linked to a teacher in a given year in the test data. While this serves as a reasonable proxy in fourth and fifth grade, it is less reliable in sixth grade when many schools have teachers teaching multiple classes. In estimating (6.1) we allow for different effects of class size for each grade. The proxy measure of class size is important for separating out the quality of newly hired teachers from any effect the reduced class sizes may have had on achievement under CSR. 13 Aigst   t   Aigst 1  X igst   Cohortigst  1   2 Aigst 1  f ( Expigst )  3CSigst  g  ci   s  eigst where Aigst is student i's test score in grade g, school s, year t (6.1)  t are year fixed effects Aigst 1 is student i's prior test score X igst are student demographics Cohortigst are teacher cohort indicators Aigst 1 is the average prior test score of student i's classmates f ( Expigst ) is a cubic in teacher experience CSigst is a proxy measure of class size g are grade fixed effects ci is an unobserved student heterogeneity term  s are school fixed effects Note that the OLS estimation of 6.1 (our preferred strategy) treats ci as if it is uncorrelated with all other student-level variables. While this assumption is unlikely to hold in practice, there is evidence that OLS estimation of the lag score specification typically performs well. Kane & Staiger (2008) find that this method does the best at estimating a teacher's value-added in nonexperimental settings by comparing estimates for the same teachers both with and without random assignment to students. Using simulated data, Guarino et al. (2011) find that the lag score specification estimated by OLS is fairly robust, compared to other common VAM estimators, to different teacher and student sorting mechanisms. The intuition for this result is that assignment is driven more by dynamic (i.e. changes in test performance), rather than static, characteristics of students. Estimators that attempt to eliminate unobserved student heterogeneity introduce additional assumptions and greatly reduce the identifying variation, 14 while failing to capture much of the assignment mechanism which threatens the validity of VAM estimates. The sensitivity of the results to the choice of value-added approach is explored in section 8. The main coefficients of interest are the estimates of γ1, the average quality of entry cohorts of teachers. Specifically, interest lies in comparing the average quality of entry cohorts before and after the introduction of CSR. The teacher-quality explanation for the poor performance of CSR would be consistent with smaller gains associated with cohorts entering the data after CSR was implemented compared to earlier cohorts. The inclusion of δs, , the school fixed effects, is important for two reasons. First, it helps to control for the fact that certain schools may tend to service higher- or lower-ability students on average. The school fixed effects also help to identify whether schools hired teachers of lower quality in CSR years. Consider a case in which schools face teacher supply curves consisting of candidates of homogeneous quality. However, given evidence that there is substantial sorting of teachers into geographically small markets (Boyd et al. 2005; Lankford et al. 2002), each school may face a different level of teacher quality. If CSR disproportionately induced hiring in schools that faced supplies of lower-quality teachers, without controlling for these school-level differences in supply, we would see an overall negative relationship between CSR years and the average quality of new entrants. However, in this scenario there is no actual relationship between CSR and the quality of teachers hired by particular schools. The inclusion of school 15 fixed effects controls for the time-invariant quality level of teacher supply that different schools face. The experience profile can be thought to capture three distinct factors: teaching-specific human capital accumulation, nonrandom sorting of students to teachers based on experience, and nonrandom attrition of teachers. Focusing on the human capital piece of the experience profile, the possible effect of CSR on short-run achievement is better captured when the experience of the teacher is not controlled for. However, controlling for experience allows for a more direct comparison of teacher quality throughout the sample period. If experience is not controlled for, teachers from earlier cohorts may look better than later cohorts simply because the estimates are partially based on years in which these teachers have more experience than later cohorts. The joint contribution of both cohort quality and experience to student achievement is considered in more detail later. Care should be taken not to interpret the estimates of (6.1) as necessarily a pure CSR hiring effect. In particular, the effects of any changes that may be related to the quality of teachers hired in a given year are included in the estimates of γ1. State X enacted many policies over the same time period, including measures to reduce the costs of entering the teaching profession through alternative certification pathways. These changes included the authorization of school districts (rather than just colleges and universities) to provide professional preparation programs for certification beginning in the 2002-2003 school year and a law in 2004 allowing for the creation of Educator Preparation Institutes (EPI) for college graduates with a non-education 16 degree to receive certification (Feistritzer 2007). If these measures led to a change in the labor supply of teachers in CSR years, part of the estimated cohort quality may be capturing these changes. Fortunately, the uptake of these alternative pathways was quite low over the period of our data. Sass (2011) documents the number of District Certified and EPI teachers linked to third through tenth grade students in State X from the 2000-2001 to the 2006-2007 school year at only 1,473 and 206, respectively. 5 The approach adopted here captures potential CSR effects that would be difficult to parameterize given the available data. For example, the school-level class-size averages are only available starting with the year directly before school-level enforcement. This data limitation makes it difficult to identify individual schools that may have hired additional teachers during district-level enforcement years in order to preempt the switch to school-level enforcement. The estimates of γ1 for the 2005-2006 hiring cohort will include the effect of schools hiring additional teachers because of the switch in enforcement the following year. Note that these teacher valueadded measures may also capture changes over time in resources that complement a teacher's ability to raise achievement. If CSR led to a reduction in these resources, then part of the change in measured teacher effectiveness over time may be capturing these changes as well. There is some suggestive evidence, discussed later, that this is not a large problem in interpreting the results. 5 Clearly, the number of these alternatively certified teachers in grades four through six will be much lower and, in the longer run, some substitution from traditional certification may be expected, suggesting little role for the introduction of these two programs to be driving the results that follow. 17 To address whether the CSR-induced demand increase led to both the hiring and retention of lower value-added teachers, as well as the possibility that attrition from teaching led to different long-term cohort effects, the cohort-specific indicators are replaced with cohort-by-year indicators: Aigst   t   Aigst 1  X igst   Cohort  Yearigst 1   2 Aigst 1  f ( Expigst )  3CSigst  g  ci   s  eigst (6.2) The estimates discussed above will combine the initial average performance level for a cohort with the longer-term impact of that cohort as the composition changes. With nonrandom attrition, having a single cohort indicator for the 2001-2002 cohort will disproportionately weight the estimates toward the relatively productive (or unproductive) teachers that contribute more observations to the estimation by staying in the data longer. Conversely, the estimated 20072008 cohort effect roughly weights each teacher evenly, regardless of their eventual attachment, giving an estimate of the initial performance. In this way, the cohort-by-year estimates are likely preferred to those discussed above; however the previous estimators without cohort-by-year effects are maintained as they provide a much more tractable comparison among the many other modeling and estimation choices. The estimates of equations (6.1) and (6.2) can be thought of as identifying the statewide general equilibrium relationship between hiring cohorts and student performance. However, it is possible that the CSR policy had more ―bite‖ in schools farther away from the new class-size maximums. In fact, the hypothesis that changes in teacher quality can explain CSR performance is based on this notion. For the next set of estimates, the entry cohorts are further divided into 18 teachers entering schools based on their pre-district and pre-school-level enforcement class-size averages. For the pre-district-level enforcement, schools are grouped based on the district-level averages in the year before CSR, with those districts already under the new maximums for fourth through eighth grade in one group and the remaining districts divided into quartiles of average class size. The schools are grouped similarly using the pre-school-level enforcement class-size averages. The previous estimates all identify changes in average performance of cohorts. To assess the change in the distribution of teacher quality among hiring cohorts, individual teacher valueadded estimates are obtained by replacing the various cohort dummies with individual teacher 6 indicators. All of the above estimates are aimed at uncovering the change in teacher quality associated with increased CSR hiring. By using a select set of these estimates it is possible to more directly test the impact that the change in the stock of teachers had on average achievement in State X and whether this change can explain the fact that quasi-experimental estimates show disappointing CSR achievement effects. As shown in section 5, the number of new teachers increased while the average experience of teachers in State X went down over the implementation of CSR. To assess the effect on average achievement of the change in average quality of new cohorts and the drop in average experience, the contribution of each of these components is calculated using the estimates of equation (6.1). 6 Due to computational constraints, this estimation is done separately by district. 19 The estimated contribution to average achievement in the state of the cohort composition and  teacher experience are calculated in each year as COHORTt 1 and  ( EXP )  EXP   EXP2   EXP3  , respectively. Both the total contribution and the    f t t 1 t 2 t 3 separate contribution of each component will be presented along with the change since 2001. A second set of estimates, based on interacting a CSR district treatment dummy with all included 7 covariates in (6.1), is used to show a similar breakdown of the total contribution of the stock of teachers to achievement differences among these schools. This directly addresses the question of how much of the lack of achievement gains found in quasi-experimental estimates of the CSR policy in State X can be explained by differential changes in teacher quality. 7. Results Before presenting the main results, we estimate the CSR policy effect within the value-added framework discussed in section 6. These results will help to establish the extent to which CSR in State X fell short of the potential experimental gains from reducing class size for the sample and model used here. Specifically, equation (6.1) is adapted by replacing the cohort indicators, teacher experience, and class size variables with CSR treatment-by-year indicators: (7.1) Aigst   t   Aigst 1  X igst   (T  YEARst )1   2 Aigst 1  g  ci   s  eigst Two separate regressions are estimated based on school- or district-level CSR enforcement. For the district-level enforcement, treatment T equals 1 for districts that were above the new classsize maximum in the year before CSR, and 0 otherwise. The school-level treatment status is similarly determined by the school average class size the year prior to school-level enforcement. 7 This is effectively the same as running separate regressions using the treated and untreated samples. 20 It is important to note that the regressions include year and school dummy variables and the omitted treatment category is for the 2001-2002 cohort. Table 1.2 presents the estimates of (7.1) for district- and school-level CSR with districtenforcement years shaded light gray and school-enforcement years in dark gray. Note that these regressions use test scores standardized within grade and year as the dependent variable. Beginning with the district CSR results, most of the estimated CSR achievement effects are small and not statistically different from either zero or the estimated pre-CSR treatment-year interaction coefficient (T x 2002-2003). The one exception is the 2004-2005 effect, estimated to be a statistically significant 0.0264 standard deviations. While statistically significant, the point estimate is practically small. As a rough point of comparison, a simple prediction of the potential effect of CSR based on the estimates of Krueger (1999) would be on the order of oneeighth of a standard deviation. 8 Even the ninety-five percent confidence intervals for these estimates fall short of half of the rough Tennessee STAR benchmark. As shown by the results in the last column of Table 1.2, the treatment-by-year effects after the switch to school-level enforcement during the 2006-2007 school year are negative. The interpretation of these results is made more difficult by the fact that there are also statistically significant negative CSR achievement effects estimated prior to the switch to school-level enforcement. One potential explanation is that those schools farthest from meeting the class-size 8 Krueger estimates the small class effect in third grade (the closest grade to those considered here) to be roughly one-fifth of a standard deviation. This corresponds to an average difference in class size of eight students, from 24 to 16. State X’s average class size change in fourth through eighth grade was five students, from 24 to 19. Assuming a linear effect of class size, the Krueger estimates from Tennessee suggest an effect of one-fortieth of a standard deviation per student which gives the simple prediction of one-eighth. This Tennessee STAR benchmark can be thought of as a rough guide for assessing CSR and cohort performance. While it is not clear what magnitude of achievement effects would constitute a successful CSR policy, having an external, experimental comparison is preferred to simply testing for statistically significant estimates. 21 requirements in 2006-2007 were forced to allocate more resources to class-size reduction in anticipation of the switch in enforcement. Table 1.2: Estimated CSR Mathematics Achievement Effects for State X G4-G6 G4-G6 Sample District School CSR Level -0.0170 -0.0323 T x 2002-2003 (0.0180) (0.0244) 0.0163 -0.0284* T x 2003-2004 (0.0152) (0.0143) 0.0264** -0.00604 T x 2004-2005 (0.0125) (0.0102) 0.00902 -0.0459*** T x 2005-2006 (0.0183) (0.0164) -0.00522 -0.0410* T x 2006-2007 (0.0186) (0.0231) 0.00915 -0.0273 T x 2007-2008 (0.0156) (0.0216) 2,752,060 2,716,399 Observations 0.653 0.653 R-squared Cluster robust standard errors in parentheses; clustered at district (school) level for district (school) CSR *** p<0.01, ** p<0.05, * p<0.1 The results found in Table 1.2 generally concur with those found by a previous study. Both suggest, at most, small positive effects of CSR when treatment is defined by pre-CSR districtlevel class-size averages and potentially negative effects for estimates based on school-level 9 treatment status. Importantly the difference in the estimated achievement effects and the Tennessee STAR benchmark allows for the possibility that the average quality of the newly hired 9 Direct replication and extension of the previous paper’s approach finds that the estimated CSR effects can be sensitive to the included covariates and the particular model used. In some cases, point estimates are found on the order of 0.06 standard deviations, and in many cases the standard errors are too large to rule out effects close to the one-eighth of a standard deviation benchmark from Tennessee STAR. The value-added modeling approach used here is much less sensitive to the choice of included covariates, but does limit the data used in estimation by requiring a lagged test score for each student. 22 teachers did change with CSR and that this change may have affected the performance of the policy. Tables 1.3, 1.4, and 1.5 display the estimates based on the general specifications found in equations (6.1) and (6.2). Of particular interest are the estimated coefficients on the teacher entry cohort dummy variables. These estimates reflect the conditional mean performance of students in classrooms taught by teachers entering the data in each year, relative to those students in classrooms taught by teachers already in the State X public elementary school system at the beginning of the panel. The policy-relevant comparison is between pre-CSR and post-CSR cohorts. Table 1.3 presents the baseline estimates of the cohort effects. 10 Again we use the convention of shading district CSR enforcement years in light gray and school CSR enforcement years in dark gray. For reference, the initial cohort size is also presented. All specifications are estimated using developmental scale test scores that have been standardized within grade and year. 11 The results show that students with teachers who entered during CSR perform worse on average. For instance, students of teachers from the 2006-2007 cohort are estimated to score, on average, over one-fiftieth of a standard deviation (0.0319-0.00929=0.0226) worse than students with a 2002-2003 cohort teacher. Importantly, the estimated cohort effects for these two cohorts are strongly statistically different with a p-value of 0.0000. Overall, the estimated post-CSR cohort effects range from 0.0069 to 0.0285 standard deviations lower than the two pre-CSR cohorts. The magnitude of the differences seen in Table 10 See Appendix Table 1.10 for other estimates from these regressions. 11 There is no agreement on the preferred choice between scale scores and grade-year standardized scale scores. Here, the main conclusions that can be drawn do not differ with this choice. See Reardon & Galindo (2009) for a brief discussion of the two approaches. 23 1.3 fall short of what would be needed to explain why CSR policies do not produce the gains expected based on experimental results. Recall that a simple extrapolation of the STAR results would place the expected achievement gain at roughly one-eighth of a standard deviation. Table 1.3: Pooled OLS Cohort Estimates with Cohort Indicators Cohort Specification (6.1) Equation Year Entry Cohort -0.00345 2001-2002 (0.00331) 2824 N -0.00929*** 2002-2003 (0.00247) 2856 N -0.0162*** 2003-2004 (0.00465) 3378 N -0.0221*** 2004-2005 (0.00455) 4037 N -0.0304*** 2005-2006 (0.00237) 4247 N -0.0319*** 2006-2007 (0.00450) 4492 N 2007-2008 -0.0265*** (0.00469) 3390 N Observations 2,752,060 R-squared 0.653 District Cluster Robust standard errors in parentheses: *** p<0.01, ** p<0.05, * p<0.1 Note: Models include teacher experience cubic, a class size proxy, student demographic variables, and school, grade and year dummies 24 Differences in cohort performance of, at most, one-thirty-fifth of a standard deviation do not compare to the difference in estimated CSR effects and the Tennessee STAR benchmark. Table 1.4: Pooled OLS Cohort Estimates with Cohort-by-Year Indicators Cohort-by-Year Specification (6.2) Equation 2001-2002 2002-2003 2003-2004 2004-2005 2005-2006 2006-2007 2007-2008 Year Entry Cohort 2001-2002 -0.0411*** -0.0171*** 0.00243 0.00208 0.00932 0.00929 0.0106* (0.00512) (0.00509) (0.00617) (0.00630) (0.00564) (0.00644) (0.00581) 2824 2028 1649 1547 1374 1258 1115 N -0.0453*** -0.0116*** -0.0120* 0.000914 0.0108 -0.00496 2002-2003 (0.00486) (0.00403) (0.00634) (0.00636) (0.00807) (0.00553) 2856 1990 1705 1534 1349 1225 N -0.0486*** -0.0308*** -0.00150 -0.00268 -0.00816 2003-2004 (0.00629) (0.00557) (0.00664) (0.00980) (0.00642) 3378 2422 2076 1902 1706 N -0.0688*** -0.0249*** -0.00860 0.000120 2004-2005 (0.00726) (0.00561) (0.00583) (0.00540) 4037 2904 2457 2091 N -0.0636*** -0.0295*** -0.0198*** 2005-2006 (0.00485) (0.00381) (0.00453) 4247 2995 2489 N -0.0674*** -0.0261*** 2006-2007 (0.00404) (0.00563) 4492 3080 N -0.0581*** 2007-2008 (0.00521) 3390 N Observations 2,752,060 R-squared 0.653 District Cluster Robust standard errors in parentheses: *** p<0.01, ** p<0.05, * p<0.1 Note: Models include teacher experience cubic, a class size proxy, student demographic variables, and school, grade and year dummies Table 1.4 displays the results from equation (6.2), which allows for separate cohort-by-year effects. Recall from the previous section that the motivation for these estimates was to explicitly 25 allow the average performance of a cohort to change as its composition changes. While the initial productivity of the earlier cohorts is lower than the previous estimates would suggest, the relative performance of cohorts in their first years are relatively unchanged from the previous estimates with post-CSR cohorts having average achievement 0.0033 to 0.0277 standard deviations below the pre-CSR cohorts. The point estimates suggest the relative performance gap between pre-CSR and post-CSR cohorts drops to between 0.0078 and 0.0182 standard deviations in each cohort’s second year. Not all second-year post-CSR cohort effects are statistically different from the pre-CSR estimates at traditional significance levels. For instance, the p-value is 0.3030 for the t-test of the null that the second-year effects for the 2001-2002 and 2004-2005 cohorts are the same. On the other hand, the null that the 2002-2003 cohort and the 2003-2004 cohorts were equally effective in their second years can be rejected at common significance levels (p-value=0.0039). Also note that pre-CSR cohorts become comparable to the baseline teachers after three or four years with year-specific cohort effects statistically indistinguishable from zero. The two post-CSR cohorts observed for at least four years, 2003-2004 and 2004-2005, also appear to level off to be roughly comparable to the baseline after four years. This result suggests that the potential long-run CSR hiring effects may be even smaller than those initially observed. However, the largest post-CSR hiring cohorts are not observed long enough to make a complete comparison across all cohorts. In particular, the estimated third-year effect for the 2005-2006 cohort is still statistically different from zero, at nearly one-fiftieth of a standard deviation. It is important to note here that these estimates come from a specification that includes a cubic term in teacher experience. This implies that much of this observed improvement for cohorts is being 26 driven by compositional changes of the cohort, rather than human capital accumulation that is common to all cohorts. These results suggest that not only may schools be initially hiring lower value-added teachers due to the CSR-induced demand increase, but the schools may be retaining more low valueadded teachers longer in order to meet CSR requirements. State X is notable for dismissing teachers within their first three years for poor performance at a much higher rate than the nation as a whole, with the state’s ninety-seven day probationary rule cited as a possible explanation. However, these results suggest that the short-run CSR demand increase may have weakened this mechanism for ensuring quality instruction. Both phenomena, the hiring and retention of lower value-added teachers, fit nicely within the framework of a simple search model of teacher hiring in which teachers are effectively viewed as experience goods (see Rockoff & Staiger 2010). However, it appears that the long-run achievement effect of these changes may be relatively small. A comparison across cohorts within the same year lends some insight into the role other inputs into the education process may have had in affecting student performance over this time. In particular, the effect of unmeasured changes in classroom inputs directly complementary to teaching may be included in the cohort effect estimates. Recall that there is some anecdotal evidence that State X’s CSR program was not fully funded, raising the possibility that a reallocation of other inputs may have coincided with the hiring increase studied here. However, since earlier cohorts likely face similar resources within schools as later cohorts in a given year, the fact that the earlier cohorts perform noticeably better in each year suggests that it is not changes in these other complementary inputs driving the results. For instance, in the 2004-2005 school year the 2002-2003 cohort has an estimated cohort effect over one-twentieth (0.0688- 27 0.0120=0.0568) of a standard deviation better than the 2004-2005 cohort. This is a practically and statistically significant difference in performance that is likely not due to differences in other classroom-level inputs. Table 1.5 shows the estimates from specifications in which the entry cohorts are further divided based on the amount of CSR pressure the school was under. This grouping is done based on both the district averages prior to CSR and the school averages prior to the change to schoollevel enforcement. Those schools already below the maximums are included in the ―None‖ group while the remaining schools are divided into quartiles based on average class size. Looking first at the estimates based on the district groupings, the estimates show that across the board all schools saw a decline in the performance of new teachers over the implementation of CSR. Importantly, it is not the case that the estimated effects are monotonically increasing in magnitude with increases in CSR pressure. Taken together, it appears that CSR-induced hiring did not just impact the quality of new teachers for schools originally above the new class-size maximums. Rather it suggests that the untreated schools were still forced to move along the effective teacher supply curve as candidates they may have otherwise hired to fill openings created by turnover and enrollment growth were hired by nearby schools facing CSR pressure. Similarly, the results for the school-level disaggregation do not consistently tell a story that CSR lowered incoming teacher quality disproportionately for treated schools. One exception, however, is in the year before school-level enforcement for those schools farthest from reaching the new maximums (Q4). These schools, which were likely pre-empting the switch to schoollevel enforcement in the following year, had a hiring cohort estimated to be 0.0617 test score standard deviations worse than the baseline teachers, while the other schools saw cohorts between 0.0219 and 0.0326 standard deviations worse. 28 Table 1.5: Estimates of New Cohort Effects by CSR Intensity None Q1 Q2 Q3 Q4 District Enforcement Entry Cohort -0.00500 -0.00394 0.000417 -0.0165* 0.00531*** 2001-2002 (0.00760) (0.00503) (0.00912) (0.00969) (0.00115) -0.0151*** 0.00384 -0.00144 -0.0188*** -0.0197*** 2002-2003 (0.00545) (0.00784) (0.00529) (0.00417) (0.00286) -0.0251*** -0.0199 -0.0164*** -0.0171* 0.00444** 2003-2004 (0.00622) (0.0121) (0.00445) (0.00869) (0.00194) -0.0227*** -0.0292*** -0.0167** -0.0375*** -0.00591*** 2004-2005 (0.00519) (0.00610) (0.00648) (0.0102) (0.00156) -0.0320*** -0.0240*** -0.0338*** -0.0276*** -0.0336*** 2005-2006 (0.00501) (0.00726) (0.00673) (0.00404) (0.00281) -0.0388*** -0.0176** -0.0222*** -0.0668*** -0.0229*** 2006-2007 (0.00689) (0.00816) (0.00778) (0.00781) (0.00416) -0.0357*** -0.0391*** -0.0251*** -0.0163 -0.00780*** 2007-2008 (0.00834) (0.00598) (0.00690) (0.0129) (0.00244) 2,754,022 Observations 0.653 R-squared School Enforcement -0.00879* -0.0117 -0.0159 0.00550 0.0488*** 2001-2002 (0.00447) (0.0182) (0.0101) (0.0115) (0.00555) -0.00744** -0.0197* -0.0201* -0.00728 -0.0126 2002-2003 (0.00342) (0.0113) (0.0113) (0.0147) (0.00778) -0.0226*** -0.0147 -0.00517 0.00423 0.0163* 2003-2004 (0.00430) (0.0104) (0.0117) (0.0162) (0.00876) -0.0225*** -0.00972 -0.0378* -0.0178 -0.0206* 2004-2005 (0.00449) (0.0114) (0.0223) (0.0132) (0.0118) -0.0278*** -0.0326** -0.0263** -0.0219** -0.0617*** 2005-2006 (0.00355) (0.0135) (0.0117) (0.00849) (0.00387) -0.0306*** -0.0329*** -0.0504*** -0.0195* -0.0376*** 2006-2007 (0.00506) (0.00659) (0.0113) (0.0100) (0.00803) -0.0308*** -0.0314* -0.0204 0.00395 -0.0160** 2007-2008 (0.00487) (0.0182) (0.0164) (0.0146) (0.00773) 2,752,060 Observations 0.653 R-squared Standard errors clustered at the district level in parentheses: *** p<0.01, ** p<0.05, * p<0.1 The above estimates identify changes in mean cohort performance. To allow for a comparison of the entire distribution of teacher quality over time, individual teacher value-added 29 is also estimated. To explore the relative performance of CSR cohorts, teachers are given a percentile rank based on their estimated value-added relative to all the teachers in the sample. Figure 1.3 displays histograms of the distribution of teacher percentile ranks for each entry cohort. The solid line on each graph represents a uniform distribution of percentile ranks (i.e., the distribution for a cohort if a given teacher from that cohort was equally likely to be ranked anywhere in the overall distribution). Prior to CSR, the percentile rank distribution of the entry cohorts is roughly uniform. Over the implementation of CSR, starting with the 2003-2004 entry cohort, there is a noticeable increase in the probability a given teacher will be ranked below the twentieth percentile. To the extent that the included experience profile does a poor job of capturing the human capital accumulation for these later cohorts, the perceived pattern may better represent a short-run effect. It is important to note that the value-added estimates for later cohorts will tend to be noisier as well. However, if differences in the percentile rank distributions across cohorts were simply an artifact of increased noise, more outliers would be expected at both ends of the distribution resulting in a U-shaped distribution. That we only see more teachers at the low end of the percentile rank distribution for the later cohorts suggest that it is not due purely to noise. Ultimately, this analysis would be aided by having access to additional years of data to better address the issue of differences in the precision of the estimates for different cohorts. Regardless, Figure 1.3 provides additional suggestive evidence that teachers hired post-CSR were more likely to be low value-added teachers. 30 Figure 1.3: Percentile Rank Distributions Entry Cohorts 2001-2002 Entry Cohort .016 2002-2003 Entry Cohort .016 .012 .012 .008 .008 .004 .004 0 0 0 .016 20 40 60 80 2003-2004 Entry Cohort 100 0 20 40 60 80 2004-2005 Entry Cohort 100 0 20 40 60 80 2006-2007 Entry Cohort 100 0 20 100 .016 .012 .012 .008 .008 .004 .004 0 0 0 .016 20 40 60 80 2005-2006 Entry Cohort 100 .016 .012 .012 .008 .008 .004 .004 0 0 0 .016 20 40 60 80 2007-2008 Entry Cohort 100 .012 .008 .004 0 0 20 40 60 80 100 Figure 1.3: Percentile Rank Distributions: Entry Cohorts 31 40 60 80 The comparison among the estimated entry cohort effects does not fully capture the contribution of these teachers to average statewide achievement. In particular, this comparison misses the fact that not all students in CSR years are taught by teachers hired in post-CSR cohorts and that the average experience in the state dropped in post-CSR years. Table 1.6 shows the estimated contribution to average achievement in the state of the cohort composition     f ( COHORT t 1 ) and teacher experience (  ( EXP )  EXP 1  EXP2 2  EXP3  3 ) using the t t t t estimates from Table 1.3. Both the total contribution, and the separate contribution of each component are presented. Finally, the change in the contribution to statewide average achievement since 2001 is also shown. While the contribution attributable to these teacher characteristics falls over the introduction of CSR, even in the worst year this represents only a difference of 0.0172 standard deviations. This difference is driven more by the relative performance of the cohorts than by the drop in teacher experience. 12 12 Recall that the experience profile can be thought to capture the effects of differential attrition and within school sorting of students to more experienced teachers, in addition to human capital accumulation. 32 Table 1.6: Estimated Contribution of Teacher Cohort Composition and Experience to Average Achievement Achievement Contribution Change from 2001    ( EXP )  ( EXP ) Total Total Year COHORTt  1 COHORTt  1 f f t t -0.0068*** 0.0272*** 0.0204*** (0.0010) (0.0029) (0.0030) -0.0006 2002-2003 -0.0074*** 0.0270*** 0.0195*** (0.0004) (0.0010) (0.0030) (0.0030) 2003-2004 -0.0090*** 0.0275*** 0.0185*** -0.0022*** (0.0008) (0.0015) (0.0029) (0.0031) 2004-2005 -0.0115*** 0.0271*** 0.0156*** -0.0047*** (0.0012) (0.0020) (0.0027) (0.0030) -0.0149*** 0.0267*** 0.0118*** -0.0081*** 2005-2006 (0.0010) (0.0017) (0.0028) (0.0030) 2006-2007 -0.0182*** 0.0261*** 0.0079*** -0.0114*** (0.0013) (0.0021) (0.0028) (0.0029) -0.0165*** -0.0233*** 0.0265*** 0.0032 2007-2008 (0.0020) (0.0028) (0.0032) (0.0030) Standard errors clustered at the district level in parentheses *** p<0.01, ** p<0.05, * p<0.1 2001-2002 0.0002*** (0.0000) -0.0002*** (0.0001) -0.0001*** (0.0000) -0.0005*** (0.0001) -0.0011*** (0.0002) -0.0007*** (0.0002) -0.0009** (0.0004) -0.0019*** (0.0007) -0.0048*** (0.0012) -0.0086*** (0.0010) -0.0125*** (0.0013) -0.0172*** (0.0019) To directly assess the role of these same changes in the estimated CSR policy effects, estimates are used from a modified version of (6.1) in which a CSR treatment dummy is interacted with all included regressors. Table 1.7 displays the evolution of the total contribution, cohort composition plus experience, of teachers to average performance separately for CSR and non-CSR schools based on pre-district CSR class sizes. Table 1.7 also shows the difference in these changes between CSR and non-CSR schools. Column (6) is of particular interest as it relates to the type of comparison previously used to estimate CSR policy effects. Both CSR and non-CSR schools experience a drop in the teachers’ contribution to average achievement. Interestingly, the CSR schools saw a slightly smaller drop, 0.0076 test score standard deviations smaller by 2007-2008, than those schools for which CSR was not binding at introduction. This estimate is of the opposite sign needed to explain the finding of no achievement gain from CSR. 33 Clearly, the change in average achievement attributable to the makeup of the teaching stock falls well short of explaining the lack of achievement gains. Table 1.7: Estimated Contribution of Teacher Composition to Average Achievement: CSR vs. Non-CSR Schools Total Achievement Contribution Change from 2001 CSR No CSR Difference CSR No CSR Difference Year 2001-2002 0.0204*** 0.0214*** -0.0010 (0.0034) (0.0057) (0.0066) 2002-2003 0.0195*** 0.0204*** -0.0009 -0.0008 -0.0010 0.0001 (0.0033) (0.0061) (0.0069) (0.0005) (0.0008) (0.0010) 2003-2004 0.0193*** 0.0169** 0.0024 -0.0010 -0.0045*** 0.0035** (0.0031) (0.0065) (0.0072) (0.0009) (0.0010) (0.0013) 2004-2005 0.0161*** 0.0150** 0.0011 -0.0043*** -0.0064*** 0.0021 (0.0031) (0.0060) (0.0067) (0.0015) (0.0017) (0.0023) 2005-2006 0.0133*** 0.0082 0.0051 -0.0071*** -0.0132*** 0.0061*** (0.0033) (0.0058) (0.0066) (0.0012) (0.0011) (0.0016) 2006-2007 0.0097*** 0.0035 0.0062 -0.0107*** -0.0179*** 0.0072*** (0.0030) (0.0064) (0.0071) (0.0015) (0.0014) (0.0021) 2007-2008 0.0084*** 0.0018 0.0066 -0.0120*** -0.0196*** 0.0076*** (0.0031) (0.0076) (0.0082) (0.0017) (0.00223) (0.0028) Standard errors clustered at the district level in parentheses *** p<0.01, ** p<0.05, * p<0.1 8. Sensitivity Analysis In addition to the OLS estimates of the lag score equation, the cohort effects were also estimated by the alternative VAMs discussed in Appendix B. Specifically, effects were estimated by a 2SLS version of the Arellano & Bond (1991) dynamic GMM estimator (FDIV) on the lag score equation used above and by OLS or fixed effects (FE) on a gain score equation obtained by constraining λ=1 in equation (6.1) and subtracting prior achievement from both sides of the equation. Table 1.8 displays the results from each of the estimation methods. Note that the sample size is decreased substantially for the FDIV estimator as the requirement of a twice lagged score leaves only students with three consecutive test scores in the estimation sample. 34 Comparing columns (1) and (2) of Table 1.8 shows that the choice of estimating cohort effects by OLS on the lag score versus gain score equations makes little difference for the estimated cohort effects. Comparing the preferred estimates in column (1) to the others reveals more variation; however, the main conclusion that can be drawn is invariant to the estimator chosen. Recall from above that students with teachers in the 2006-2007 cohort are estimated to score, on average, one-fiftieth of a standard deviation worse than those with teachers from the 2002-2003 cohort. Looking across the estimates in Table 1.8, all estimators suggest similar magnitudes of this effect with the largest being closer to 0.03 when estimated by FE. Table 1.8: Cohort Effect Estimates from Alternative VAM Estimators (1) (2) (3) (4) Lag Gain Gain Lag Model OLS OLS FE FDIV Estimator Entry Cohort -0.00345 -0.00269 0.00275 0.00188 2001-2002 (0.00331) (0.00333) (0.00788) (0.00388) -0.00929*** -0.00846*** -0.0142** -0.0154*** 2002-2003 (0.00247) (0.00260) (0.00650) (0.00348) -0.0162*** -0.0164*** -0.0273*** -0.0248*** 2003-2004 (0.00465) (0.00438) (0.00501) (0.00257) -0.0221*** -0.0220*** -0.0365*** -0.0305*** 2004-2005 (0.00455) (0.00448) (0.00981) (0.00496) -0.0304*** -0.0304*** -0.0442*** -0.0387*** 2005-2006 (0.00237) (0.00253) (0.00661) (0.00415) -0.0319*** -0.0313*** -0.0434*** -0.0395*** 2006-2007 (0.00450) (0.00464) (0.0100) (0.00515) -0.0265*** -0.0254*** -0.0166* -0.0245*** 2007-2008 (0.00472) (0.00460) (0.00918) (0.00489) 2,752,060 2,752,060 2,752,060 1,329,658 Observations 0.653 0.034 0.399 -R-squared Standard errors clustered at the district level (Cols (1), (2), (3), and (5)) or the student level (Col (4)) in parentheses; Grade-year standardized test scores are the dependent variable and the cubic in experience is included in all regressions *** p<0.01, ** p<0.05, * p<0.1 35 The robustness of the main result to alternative value-added approaches provides assurance that the relationship between cohort quality and CSR hiring requirements is not being driven purely by biases in the chosen estimator. Across all estimators, the effect of having a post-CSR cohort teacher is quite small, suggesting that teacher quality, as measured by value-added, did not play a major role in the observed performance of CSR in State X. Given the ever-present concerns over the role unobserved student ability may play in estimating education production functions, it may be surprising that the methods used to address unobserved heterogeneity (FE and FDIV) yield similar results to those that do not. As was alluded to earlier, the unobserved heterogeneity threatens the consistency of the estimates if schools were using some static unobserved characteristic of students to determine whether a student would be taught by a teacher hired in a particular year. Given the results in Table 1.8, it seems reasonable, particularly when controlling for teacher experience, that schools were not engaging in this sort of nonrandom assignment. While it may certainly be the case that student achievement is affected by a student’s innate ability and that this ability is used by schools in making some decisions, it does not appear to be used in a way that would lead to inconsistencies in our main estimates. 9. Conclusion The results presented above provide little support for the conclusion that a drop in the quality of newly hired teachers explains the lack of noticeable achievement gains from CSR in State X. Despite large increases in the number of teachers, the evidence suggests that newly hired teachers account for only slight decreases in achievement during the implementation of CSR. The overall drop in achievement from the 2001-2002 to the 2007-2008 school year attributable to changes in the average quality, experience, and cohort composition of fourth through sixth grade 36 teachers is estimated to be only 0.0172 test score standard deviations. Furthermore, the results suggest that this decrease in quality was experienced by both treated and untreated schools alike. These treatment spillovers imply that the disappointing CSR effects found in quasi-experimental research cannot be explained by differential changes in new teacher quality. Given that teacher quality does not play a large role in the failure of statewide CSR programs to achieve expected achievement gains, exploring alternative mechanisms is an important next step. One possibility is that other input levels may have changed, especially in cases in which CSR was implemented without full funding, as was the case in State X. As noted above, however, differences in resources directly used by teachers after CSR may also have a limited scope for explaining CSR performance. Understanding the mechanisms at play will help to determine whether popular CSR policies can be designed to promote achievement gains. More generally, the results of this paper suggest that while large short-run increases in teacher demand may lead to modest declines in the value-added of newly hired teachers, these declines may not substantially affect long-run achievement. This conclusion should be interpreted with caution, as our findings reflect the experience of a single state based on teachers in grades four through six. In other states or grades, the quality of incoming teachers may fall more dramatically in response to the introduction of CSR policies. 37 APPENDICES 38 APPENDIX A ADDITIONAL TABLES Table 1.9: Descriptive Statistics Mean 1625.46 0.02 0.23 0.23 0.03 0.50 0.12 0.50 0.04 10.67 0.09 166.75 7.72 1515.01 20.86 22.49 82.46 10.77 2.86 0.44 0.40 0.42 0.37 0.35 20.83 0.71 0.07 0.07 0.07 0.07 39 Std. Dev. 246.90 0.14 0.42 0.42 0.18 0.50 0.33 0.50 0.20 1.00 0.28 21.04 7.70 169.72 8.70 11.07 35.32 10.35 24.27 0.26 0.20 0.23 0.17 0.14 Test Score Asian Black Hispanic Other Race Female Disabled Free or Reduced Lunch Limited English Age Foreign Born Days Present Days Absent Lagged Peer Score Class-size G4 Class-size G5 Class-size G6 Teacher Experience District CSR G4-G8 Average Class Size Below Max Q1 Q2 Q3 Q4 School CSR G4-G8 Average Class Size Below Max Q1 Q2 Q3 Q4 Entry Cohorts 3.15 0.45 0.26 0.26 0.26 0.26 Table 1.9 (cont’d) 2001-2002 2002-2003 2003-2004 2004-2005 2005-2006 2006-2007 2007-2008 0.10 0.09 0.10 0.11 0.10 0.09 0.07 Table 1.10: Estimates from Pooled OLS Regressions Cohort Specification (6.1) Equation 0.706*** Prior Math Score (0.00564) 0.0947*** Asian (0.00515) -0.137*** Black (0.00347) -0.0273*** Hispanic (0.00242) -0.0239*** Other Race (0.00229) -0.0160*** Female (0.00148) -0.185*** Disabled (0.0124) -0.0585*** Free or Reduced Lunch (0.00141) -0.0738*** Limited English (0.01000) -0.0555*** Age (0.00322) 0.0706*** Foreign Born (0.00354) 0.00109*** Days Present (3.58e-05) -0.00500*** Days Absent (0.000293) 0.00731*** Experience (0.000890) -0.000341*** Experience Sq 40 0.30 0.29 0.30 0.31 0.30 0.29 0.25 Cohort-by-Year (6.2) 0.706*** (0.00564) 0.0947*** (0.00511) -0.137*** (0.00347) -0.0273*** (0.00244) -0.0240*** (0.00231) -0.0160*** (0.00148) -0.185*** (0.0125) -0.0584*** (0.00140) -0.0742*** (0.0100) -0.0554*** (0.00322) 0.0706*** (0.00356) 0.00108*** (3.56e-05) -0.00500*** (0.000293) 0.00502*** (0.000699) -0.000231*** Table 1.10 (cont’d) (4.72e-05) 4.23e-06*** Experience Cu (6.92e-07) 0.0799*** Lagged Peer Score (0.0131) 8.97e-05 Class Size (0.000252) -7.95e-05 Class Size*G5 (0.000412) -0.000535 Class Size*G6 (0.000328) 2,752,060 Observations 0.653 R-squared Robust standard errors in parentheses: *** p<0.01, ** p<0.05, * p<0.1 41 (3.40e-05) 2.76e-06*** (4.39e-07) 0.0789*** (0.0131) 5.00e-06 (0.000258) -2.58e-05 (0.000429) -0.000540* (0.000320) 2,752,060 0.653 APPENDIX B MEASURING TEACHER QUALITY The purpose of value-added models (VAMs) is to separate the portion of student growth attributable to particular teachers from the many other possible sources of growth. Viewed in this light, the challenges of VAM estimation are those faced in identifying causal relationships with panel data more generally. VAM estimation has proven to be difficult in non-experimental settings and there is no consensus on what the best model of student achievement is or the best approach to estimating the portion attributable to teachers (McCaffrey et al. 2004; Kane & Staiger 2008, Rothstein 2009, 2010; Koedel & Betts 2011). Much of this difficulty stems from the nonrandom assignment of students to teachers both within and across schools. The following discussion draws heavily from prior work on the assumptions applied to the education production function underlying VAM estimation (Todd & Wolpin 2003; Harris, Sass, & Semykina 2011; Guarino, Reckase, & Wooldridge 2011). This discussion should be thought of as a guide for considering the issues that arise in VAM estimation, rather than outlining a more formal structural model of education production to be estimated. The starting point for the value-added framework is a very general model that specifies a student’s achievement in a particular year as a function of both current and past inputs to the education process and the student’s unobserved ability: (B.1) Ait  ft ( X it ,, X i0 , Eit ,..., Ei 0 , ci , uit ) 42 where Ait is the achievement of student i in year t X it is a vector of family and student characteristics for student i in year t Eit is a vector of education inputs for student i in year t ci is unobserved student ability uit is an idiosyncratic shock to student i's achievement in year t Here, the vector Eit can be thought to include indicators for individual teachers or groups of teachers. Given computational and data constraints, several assumptions are typically made to yield a tractable estimating equation. First it is assumed that ft is linear and constant across years: (B.2) Ait  t  X it 0 ,,  X i0 t  Eit  0  ...  Ei0 t  t ci  uit Typically, researchers do not have complete data on all prior inputs. To address the lack of prior inputs, it is common to add and subtract λAit-1 to the right hand side of (B.2). Assuming that the effect of the inputs decays at a geometric rate equal to λ and that t  t 1 is a constant (set to equal one without loss of generality) allows us to eliminate the lagged inputs and rewrite equation (B.2) as a function of current inputs and lagged achievement only: (B.3) Ait   t   Ait 1  X it 0  Eit  0  ci  eit eit  uit  uit 1 Up to now, the assumptions made on the original model in equation (B.1) have been primarily data-driven. At this point, there is some choice over further assumptions imposed on the model. Under the assumptions that eit is serially uncorrelated and that ci is uncorrelated with 13 the included inputs (or equal to zero), 13 equation (B.3), referred to as the lag score equation from This condition would hold if λ≈1 and ηt≈ηt-1 43 here on, could be reasonably estimated by OLS. 14 While the no-serial-correlation assumption is by no means trivial, the assumption that ci is uncorrelated with the inputs is perhaps the most questionable. It seems possible, given nonrandom sorting of students and teachers into schools, as well as nonrandom assignment of students to teachers within schools, that the student unobserved ability may be correlated with teacher assignment. Despite these concerns, there is evidence that this approach may be preferred and so it will serve as the basis for the main analysis in this paper. As a sensitivity check, we will also consider other value-added models and estimators. Briefly, it is also common to assume that λ=1, and to subtract Ait-1 from both sides of equation (B.3), yielding a gain score model of student achievement: (B.4) Ait   t  X it 0  Eit  0  ci  it  it  uit  uit 1 Equation (B.4) could then be estimated by OLS or fixed effects (FE). 15 OLS estimation of (B.4) relaxes the need for no serial correlation in the errors at the cost of assuming the prior achievement persists completely in determining current achievement. If λ≠1, then this approach effectively introduces an additional term, (λ-1)Ait-1, on the right hand side of equation (B.4), 14 Note that prior achievement is also a function of the unobserved student heterogeneity term, and is therefore endogenous in (B.3) when ci is not zero and is ignored. This certainly leads to inconsistent estimates of λ, but the extent to which this bias is propagated in the estimated teacher effects is unclear. 15 In the panel data context, the gain score equation is also commonly estimated using an Empirical Bayes shrinkage estimator (Kane & Staiger, 2008). Note that the shrinkage factor is determined by the number of observations per group and tends toward one as the group size becomes large. Since in my preferred specification the group size is quite large and is similar across all groups, the Empirical Bayes estimator will yield results very similar to OLS. 44 which will likely lead to an omitted variables bias. Importantly, OLS on (B.4) does not control for the unobserved student heterogeneity in any way. FE estimation is particularly appealing, as it relaxes the assumption that ci is uncorrelated with the inputs. However, FE requires the additional assumption that Xit and Eit are strictly exogenous conditional on ci in (B.4) for consistent estimation. The strict exogeneity assumption essentially implies that the inputs in time t are uncorrelated with the unobserved error terms in every time period. 16 Practically speaking, the strict exogeneity assumption precludes any feedback from realized achievement shocks to future inputs. For instance, if a principal reacts to a randomly good or bad test score in one year when determining a future teacher assignment, this would violate strict exogeneity. As noted by Rothstein (2009, 2010), the fixed effects approach is useful when assignment to teachers is made based on a static characteristic of the student. The usefulness of FE estimation breaks down some when assignment decisions are made dynamically based on new information gathered over time by the relevant decision makers, be it principals, parents, or the students. Finally, it has become more common to estimate teacher value-added using approaches based on the dynamic GMM estimator found in Arellano & Bond (1991) (see Koedel & Betts 2011). Researchers taking this approach either use the Arellano & Bond GMM estimator, or a 2SLS version based on identical moment conditions, here referred to as the First-Differenced 16 Note that the strict exogeneity assumption is what precludes the use of fixed effects on the lag score equation as well. The lag score equation necessarily violates strict exogeneity by including the lagged dependent variable as a regressor since Ait-1 must be correlated with the error term in period t-1. 45 Instrumental Variables (FDIV) estimator. 17 Specifically, a first-differenced version of the lag score equation (B.3) is estimated using twice-lagged test scores as an instrument for the lagged gain score. This estimator directly addresses the presence of ci in (B.3) through the firstdifferencing while also avoiding the problem that including lagged achievement violates strict exogeneity with the use of instrumental variables. Importantly, this approach still requires strict exogeneity of the other regressors. While this assumption could be relaxed by using lagged regressors as instruments, as is done for prior achievement, this has not been common in the value-added literature. Most importantly, the Arellano & Bond-inspired approach requires that the errors in (B.3) not be serially correlated for twice lagged achievement to be a valid instrument. Finally, these approaches require an additional year of data for each student, thereby reducing the sample with which teacher value-added can be calculated. 17 The GMM and FDIV approaches are identical if the optimal GMM weighting matrix is replaced by an identity matrix. 46 REFERENCES 47 REFERENCES Angrist, J. D. & Lavy, V. (1999). Using Maimonides’ Rule to Estimate the Effect of Class Size on Scholastic Achievement. Quarterly Journal of Economics, 114(2), 533-575. Bohrnstedt, G. W. & Stecher, B. M. (1999). Class-size Reduction in California 1996-1998: Early Findings Signal Promise and Concerns. Palo Alto, CA.: CSR Research Consortium, EdSource, Inc. Bohrnstedt, G.W. & Stecher, B.M. (2002). What We Have Learned about Class-size Reduction in California. Sacramento: California Department of Education. Boyd, D., Lankford, H., Loeb, S., & Wyckoff, J. (2005). The Draw of Home: How Teachers’ Preferences for Proximity Disadvantage Urban Schools. Journal of Policy Analysis and Management, 24(1), 113-132. Buckingham, J. (2003). Class Size and Teacher Quality. Educational Research for Policy and Practice, 2, 71-86. Center for Local State and Urban Policy (2010). Mandating Merit: Assessing the Implementation of the Michigan Merit Curriculum. http://closup.umich.edu/files/pr-13-michigan-meritcurriculum.pdf Council for Education Policy, Research and Improvement (2005). Impact of the Class-size Amendment on the Quality of Education in Florida. Chingos, M. M. (2012).The Impact of a Universal Class-size Reduction Policy: Evidence from Florida’s Statewide Mandate. Economics of Education Review, 31(5), 543-562. Dieterle, S. (2011). Class-size Reduction Policies and the Composition of the Teacher Workforce. Unpublished draft. Feistritzer, C. E. (2007). Alternative Teacher Certification 2007. Washington D.C.: National Center for Education Information. Goldhaber, D. (2008). Teachers Matter, But Effective Teacher Quality Policies Are Elusive. In Ladd, H. F. & Fiske, E. B. (ed.) Handbook of Research in Education Finance and Policy. New York, NY : Routledge, 146-165. Guarino, C. M., Reckase, M. D., & Wooldridge, J. M. (2011). Evaluating Value-added Methods for Estimating Teacher Effects. Working paper. Harris, D., Sass, T., & Semykina, A. (2011). Value-added Models and the Measurement of Teacher Quality. Unpublished draft. 48 Hoxby, C. M. (2000). The Effects of Class Size on Student Achievement: New Evidence from Population Variation. Quarterly Journal of Economics, 115(4), 1239-1285. Imazeki, J. (n. d.). Class-size Reduction and Teacher Quality: Evidence from California. Working paper. Jepsen, C. & Rivkin, S. (2009). Class Size Reduction and Student Achievement: The Potential Tradeoff between Teacher Quality and Class Size. Journal of Human Resources, 44(1), 223-250. Kane, T. J. & Staiger, D. O. (2005). Using Imperfect Information to Identify Effective Teachers. Unpublished manuscript. Kane, T. & Staiger, D. (2008) Estimating Teacher Impacts on Student Achievement: An Experimental Evaluation. Working Paper 14607, National Bureau of Economic Research. Koedel, C. & Betts J. R. (2011). Does Student Sorting Invalidate Value-added Models of Teacher Effectiveness? An Extended Analysis of the Rothstein Critique. Education Finance and Policy, 6(1), 18-42. Krueger, A. B. (1999). Experimental Estimates of Education Production Functions. Quarterly Journal of Economics, 114(2), 497-532. Krueger, A. B. & Whitmore, D. M. (2001). The Effect of Attending a Small Class in the Early Grades on College-test Taking and Middle School Test Results: Evidence from Project STAR. Economic Journal, 111(468), 1-28. Lankford, H., Loeb, S., & Wyckoff, J. (2002). Teacher Sorting and the Plight of Urban Schools: A Descriptive Analysis. Educational Evaluation and Policy Analysis, 24(1), 37-62. McCaffrey, D., Lockwood, J.R., Koretz, D., Louis, T., & Hamilton, L. (2004) Models for Valueadded Modeling of Teacher Effects. Journal of Educational and Behavioral Statistics, 29(1), 67-101. Murnane, R. J. (1975). The Impact of School Resources on the Learning of Inner City Children. Cambridge, MA: Ballinger Publishing Company. Reardon, S. & Galindo, C. (2009). The Hispanic-White Achievement Gap in Math and Reading in the Elementary Grades. American Educational Research Journal, 46(3), 853-891. Rivkin, S., Hanushek, E. A., & Kain, J. F. (2005). Teachers, Schools, and Academic Achievement. Econometrica, 73(2), 417-458. Rockoff, J. (2009). Field Experiments in Class Size from the Early Twentieth Century. Journal of Economic Perspectives, 23(4), 211-230. 49 Rothstein, J. (2009). Student Sorting and Bias in Value-added Estimation: Selection on Observables and Unobservables. Education Finance and Policy, 4(4), 537-571. Rothstein. J. (2010). Teacher Quality in Educational Production: Tracking, Decay, and Student Achievement. Quarterly Journal of Economics, 125(1), 175-214. Sass, T.R. (2011). Certification Requirements and Teacher Quality: A Comparison of Alternative Routes to Teaching. Working paper. Staiger, D. & Rockoff, J. (2010). Searching for Effective Teachers with Imperfect Information. Journal of Economic Perspectives, 24(3), 97-118. Stecher, B. & Bohrnstedt G., eds. (2000). Class-size Reduction in California: Summary of the 1998-1999 Evaluation Findings. Todd, P. & Wolpin, K. (2003). On the Specification and Estimation of the Production Function for Cognitive Achievement. Economic Journal, 113(485), 3-33. Wooldridge, J. (2002). Econometric Analysis of Cross Section and Panel Data. MIT Press: Cambridge, MA. 50 CHAPTER 2 CLASS-SIZE REDUCTION AND THE COMPOSITION OF THE TEACHER WORKFORCE 1. Introduction Large-scale class-size reduction (CSR) policies have been a popular educational policy aimed at improving a range of student outcomes. However, they are often met by concerns that the policy will lead to a detrimental change in the composition of the teacher workforce through new hiring and movement of teachers across schools. For instance, it has been suggested that these changes may adversely affect student achievement. A common first step to look for changes in teacher composition is to focus on changes in observable teacher characteristics after CSR implementation (Bohrnstedt & Stecher 1999; Imazeki n. d.; Buckingham 2003). For example, an increase in teachers who were not fully certified, did not have an advanced degree, and who were less experienced has been documented around the introduction of CSR in California (Stecher & Bohrnstedt 2000). While research suggests only weak correlations between these characteristics and student achievement in the aggregate, sudden swings in teacher characteristics are an indication of less selective hiring practices by schools at a particular point in time and may be associated with undesirable student outcomes, including those not easily measured by test scores. The evidence from California has been cited as a potential issue for CSR policies in other states, including Florida (CEPRI 2005; Chingos 2012; Walberg 2006). In his analysis of CSR in Florida, Chingos cites the experience of California and suggests that his school-level estimates may capture effects ―such as reduced teacher quality, if the pool of qualified applicants for teaching positions was depleted during the district-level implementation of CSR.‖ It is not clear, however, whether the California results can be generalized to other states. We use Florida’s implementation of a CSR program in the 2003-2004 school year to see if similar changes in the 51 workforce occurred. The analysis is based on publicly available school-level data from the Florida School Indicators Reports (FSIR) that track many aspects of schools, including average teacher characteristics. A cursory glance at the data does not reveal substantial changes in teacher characteristics like those found in California. This paper takes a more detailed look at Florida’s experience, in an attempt to uncover evidence of a change in the stock of teachers. A descriptive analysis of general trends in teacher characteristics will help to address the issue of how general the California results are and provide background on how CSR played out in Florida. Contrary to the prevailing story, we find little evidence of large changes in teacher characteristics occurring with the introduction of CSR. Only the average experience of Florida teachers shows a slight decline. It follows that the California experience is not easily generalized to other states. The divergent experiences may stem from different contextual factors surrounding the two policies. First, Florida CSR was implemented over several years, while California’s policy was introduced in a shorter time frame. Additionally, Florida CSR began after the push for hiring ―highly qualified‖ teachers that accompanied No Child Left Behind and led to an increased cost to hiring out-of-field teachers and those without an advanced degree. The paper proceeds as follows: section 2 provides a brief literature review and some background on CSR, section 3 looks specifically at CSR implementation in Florida, section 4 describes the data used, section 5 outlines the analytical approach used, section 6 presents the main analysis, section 7 discusses the results, and section 8 concludes. 2. Literature Review/Background The adoption of smaller classes has been driven by a desire to improve student outcomes. Class size is believed to help students through several channels, including improved behavior, 52 better classroom management, and allowing for more individual student-level instruction. It is worth noting that while improvements along these dimensions may feed into higher student achievement, there may also be positive effects on outcomes not captured by standardized tests. For instance, behavioral changes may help students better interact with peers, a skill with potentially large labor market returns. Unfortunately, there are few quality measures of these other outcomes, leaving achievement as the main metric for judging the performance of the policy. The role of school resources, including pupil-teacher ratios, in affecting student achievement has been hotly debated among researchers for some time (Krueger 2003; Hanushek 2003). While meta-analyses come to different conclusions, more recent experimental and quasiexperimental studies have helped to establish the potential for achievement gains from attending smaller classes. The results of the Tennessee STAR class-size experiment, featuring random assignment of students and teachers to classrooms of varying sizes, often have been cited as evidence that class-size reduction is a potential policy lever to improve student achievement. Krueger’s (1999) analysis of the STAR data reveals roughly a fifth of a standard deviation increase in average test scores for students randomly assigned to a small (13-17 students) class as opposed to a larger class (22-25 students) in early elementary school. Further, Krueger finds that the small-class effect is larger for minority and low-income students. In a follow-up, Krueger & Whitmore (2001) find that being in a small class also was related to long-term student outcomes. The authors found that students placed in a small class in kindergarten through third grade were more likely to take a college entrance exam, an important indicator of eventual educational attainment. 53 The results of the Tennessee STAR experiment, along with the intuitive appeal of smaller classes, led thirty-two states to adopt CSR policies by 2005 (CEPRI 2005). Despite CSR’s popularity with teachers and parents, there is little support for the conclusion that these largescale programs lead to better achievement. Bohrnstedt et al. (2002) found no conclusive evidence of achievement gains in their official report on California CSR. In a study of CSR in Florida, Chingos (2012) found no effect, as well. Many explanations for the lack of noticeable test score gains from CSR center on the problem of moving from experiments designed to hold all other resources equal to statewide programs with unknown general equilibrium effects. A leading explanation revolves around changes in teacher quality associated with CSR programs (Stecher & Bohrnstedt 2000; Imazeki n. d.; Buckingham 2003; CEPRI 2005; Chingos 2012; Walberg 2006). Teacher quality may fall if schools must hire lower-quality teachers to meet the class-size mandate. The class-size reduction gains are then offset by having less capable teachers in classrooms. Another possibility is that the distribution of teacher quality among schools changes, with the best teachers in disadvantaged schools leaving to take CSR-induced job openings at better schools. Since minority and low-income students are more likely to attend these disadvantaged schools, the good teachers are taken from the students who were found by Krueger to have the largest benefit from smaller classes. This may then lower the average gain to achievement associated with CSR. To find evidence of these teacher composition explanations, it is common to look for changes in the characteristics of teachers around the implementation of CSR. Bohrnstedt et al. (2002) document these changes in California. The percentage fully certified in California dropped from over 98% to 87% for early elementary teachers. This decline was larger for schools with more 54 low-income students, with the fully certified percentage dropping from roughly 96% to below 80% in schools with more than 30% of students classified as low-income. In a previous report on California’s CSR program, Bohrnstedt et al. (1999) also documented the change in teachers with advanced degrees and the proportion of novice teachers. As seen in Table 2.1, the percent of teachers with an advanced degree teaching in kindergarten to third grade fell from 83% to 77% from 1995 to 1997. Over the same time period, the percent with fewer than three years experience grew from 17% to 28%. While on average these teacher characteristics may not be strongly correlated with student achievement, large changes at a point in time are indicative of less selective hiring practices by schools. Table 2.1: Mean Teacher Characteristics: California K-3 Teachers 1995-1997 Year 1995 1996 1997 Percent with a Bachelor's Degree or Less 17% 20% 23% Percent Not Fully Certified 1% 4% 12% Percent with Less than Three Years Experience 17% 28% Source: Bohrnstedt et al. (1999) Figure 6 To date, there has yet to be a study of the changes in teacher characteristics associated with CSR in Florida. The substantial changes found around California’s 1996 introduction of CSR were driven by a 38% increase in the number of K-3 teachers employed in the state from 1995 to 1997 (Bohrnstedt et al.). For comparison, in Florida the overall predicted increase in the teacher workforce as a percentage of pre-CSR employment in elementary and middle schools (those in which CSR had the most bite) over the eight years was on the order of 19%, with the largest oneyear increase being over 11% (CEPRI 2005). Including increases in the teacher workforce from enrollment growth puts the increase at 39% over eight years and 25% in the largest hiring year. The lack of a phase-in period forced California schools to find the majority of the new teachers 55 in a very short time period. In Florida, a similar percentage increase was accomplished over several years. Due to the fact that fewer middle and high schools faced CSR-induced pressure to reduce class sizes in Florida, the focus here will be on teacher characteristics in elementary schools. The choice of limiting the analysis to elementary schools also allows for a more direct comparison to California, where CSR was implemented in kindergarten through third grade only. The overall change in average teacher characteristics is far less drastic in Florida than California. Table 2.2 presents the changes in average characteristics over the last decade in Florida elementary schools. As will be explained in more detail in the next section, Florida’s program involved two distinct periods of class-size enforcement, with district average class-size enforcement beginning in the 2003-2004 school year and school-level enforcement in 20062007. Only average experience shows the expected relationship over time, dropping in the years of and directly following a change in CSR policy. Table 2.2: Mean Teacher Characteristics: All Florida Public School Teachers 2000-2009 Year 1999- 2000- 2001- 2002- 2003- 2004- 2005- 2006- 2007- 200800 01 02 03 04 05 06 07 08 09 Advanced Degree 32.5% 33.0% 29.9% 32.7% 35.1% 34.8% 34.3% 33.9% 34.0% 34.7% Average Experience Out-of-field 12.9 12.9 13.0 12.9 12.8 12.5 12.5 12.0 11.9 12.1 6.4% 5.6% 5.5% 7.4% 8.8% 8.8% 6.5% Source: Florida School Indicator Reports (FDOEc) elementary schools from 2000-2009 The patterns in Table 2.2 cast doubt on the story that CSR induced considerable changes in Florida’s pool of teachers. This does not rule out that some schools experienced changes in 56 teacher characteristics. It may be the case that by looking at average characteristics across all elementary schools in Florida, changes for the schools most affected by CSR are being washed out in the average. That is, some schools may have experienced meaningful changes in the stock of teachers, while others were left relatively unaffected. The purpose of this paper is to explore this possibility in more detail. 3. CSR in Florida Florida voters approved a constitutional amendment in November 2002 that created a statewide CSR program to take effect the following school year, 2003-2004 (FDOEa). The policy introduced new class-size maximums for three different grade levels. The new maximums were 18 students in kindergarten through third grade, 22 students in fourth grade through eighth grade, and 25 students in ninth grade through twelfth grade. The new maximums were accompanied by the creation of per-pupil allocations from the state government for each year a district or school was compliant with the law. There is anecdotal evidence that, at least for some districts, the allocation was not enough to cover the full costs of CSR (State Board of Education 2010). The fact that CSR in Florida may have been only partially funded suggests that schools may have had to allocate resources away from other inputs in order to meet classsize requirements. A key feature of the policy was that it allowed for a gradual phase-in of the mandated class sizes. A district or school was in compliance with CSR if it had average class sizes below the maximum or if it was above the maximum but had lowered the average class size by two students from the previous year. For the first three years of the program the compliance was based on the district average, while the next three years it was to be based on a school-level average. In following years, compliance was to be based on the size of individual classes; 57 however, the switch to class-level enforcement was delayed past the period of the data used here. Non-compliance by districts or schools initially resulted in a portion of the CSR allocation being directed toward capital outlays aimed at reducing class size. Beginning in the third year of the program additional sanctions for non-compliance were introduced. Districts not in compliance were forced to implement one of the following four policies: having year-round schools, having double sessions in schools, changing school attendance zone, or altering the use of instructional staff. Table 2.3: New Class-size Maximums and Average Class Size District Level Grades Maximum Percent Below Max 2003 Average CS 2003 KG-G3 18 12% 23 G4-G8 22 42% 24 G9-G12 25 91% 24 Source: Florida Department of Education Average CS 2009 16 19 22 As seen in Table 2.3, the new maximums were binding for most districts at implementation in kindergarten through third grade and fourth grade to eighth grade. Table 2.3 also shows that the average class size in the state did drop considerably for the earliest and middle grades. In terms of the impact on teacher hiring, the state projected the number of teachers that would need to be hired in order to implement the CSR program at roughly 20,000. This represents a 19% increase over the number of teachers employed the year prior to CSR (CEPRI). In the heaviest CSR hiring year, the number of CSR-induced openings was projected to be nearly as large as those from general turnover. This hiring spike of an estimated 11,000 teachers corresponded with the change from district-level to school-level compliance during the 20062007 school year. 58 4. Data The FSIR provide school-level data on all public schools in Florida including average teacher characteristics at each school. The three dependent variables of interest are the percentage of teachers with an advanced degree, the percentage teaching out of their certified field, and the average years of experience. The analysis here uses data for non-charter elementary schools in Florida’s 67 traditional county-level districts from the 1999-2000 to the 2008-2009 school year. Charter schools were not originally subject to CSR and are therefore dropped from the analysis. Several schools, while still coded as elementary schools in the FSIR, are excluded because the schools serve select populations of students, for instance pre-kindergarten programs and those for students with severe discipline problems. The sample is restricted further to only include schools with observations in all ten years, to avoid changes in teacher characteristics associated with new or soon-to-be closed schools. The resulting sample includes at most 14,399 schoolyear-level observations. The FSIR did not report the percentage of teachers out of their field until the 2002-2003 school year. Therefore, the following analysis is based on only seven years of data for the out-of-field outcome. This reduces the number of school-year observations to at most 10,080. The FSIR also include student demographic and general school information. This includes information on student racial percentages, free and reduced-price lunch percentages, per-pupil expenditure, school enrollment, size of instructional staff, and each school’s accountability grade. Unfortunately, racial breakdowns of each school’s student population are only available for the two most recent school years. The school-level information found in the FSIR is supplemented by Florida’s official classsize average report for each district from the 2002-2003 school year. The official class-size 59 averages allow for the identification of districts that needed to reduce class size in order to remain compliant with CSR. Summary statistics for the teacher characteristics and school-level variables used for this study are presented in Table 2.4. Table 2.4: Summary Statistics: Florida Elementary Schools 2000-2009 Variable Advanced Degree Percent Average Years Experience Out-of-field Percent Distance from Class Size Max- District Pupil-Teacher Ratio- School Black Percent Hispanic Percent Free and Reduced Lunch Percent Per-pupil Expenditure A School B or C School D or F School Sources: FSIR and class-size average reports Mean 31.43% 12.61 6.27% 4.63 15.94 26.69% 23.60% 56.91% $5541.91 0.52 0.41 0.07 Std. Dev. 11.20 3.28 11.08 3.40 2.79 26.68 23.7 24.36 1574.16 0.50 0.49 0.25 5. Analytical Approach To assess the relationship between Florida’s CSR program and potential changes in the pool of teachers, this paper focuses on trends in average teacher characteristics over time disaggregated across different school characteristics. While this approach eschews any claims to causal inference, it does avoid the complexities of estimating a dynamic model of CSR implementation. In particular the dynamic nature of CSR implementation does not lend itself to clearly defined treatment groups. For instance, some schools unaffected by district-level CSR may nonetheless make CSR related hiring decisions in district CSR years in anticipation of the change to school-level enforcement. In addition, any spillover due to the general increase in teacher demand in the broader statewide teacher labor market will further confound pure 60 treatment effect estimates. While the following approach does not allow for the identification of a clearly defined treatment effect, it does allow for a very flexible and nuanced analysis of how the composition of Florida’s teacher workforce changed over the implementation of CSR. As little evidence has been given to this point on whether or not Florida experienced the same drastic changes in the teacher workforce as California, this approach provides a transparent and informative look. The tables that follow display the trends in teacher characteristics for Florida around the time of CSR adoption. For each table, the mean teacher characteristic in each school is first averaged across all schools in each year. This first row provides a statewide trend in these school averages. Next, each table shows the evolution of similar averages across various groupings of the schools. Schools are grouped based on the average of their school accountability grade over the ten years, as well as by quartiles of several school characteristics averaged over the time period. Basing the quartile groups on the average characteristic over the time period avoids confounding relevant changes in teacher characteristics with changes in the composition of the groups over time. The quartiles are numbered one to four going from the lowest level of the associated school characteristic up to the highest. For instance, FRL 1 schools are those with the lowest average percentage of students who are on free or reduced-price lunch and FRL 4 are the schools with the highest average percentage over this time period. Table 2.5 displays the minimum and maximum values, plus the quartile cutoffs for each of the school characteristics used to disaggregate the data. The goal of Tables 2.6-2.8 is to provide insight as to whether or not the timing of CSR was associated with noticeable changes in average teacher characteristics for 61 particular types of schools. Columns shaded light gray correspond to years in which districtlevel CSR was enforced while the dark gray columns indicate years of school-level CSR. Table 2.5: Quartile Cutoffs: School Level Averages Variable Min Q1-Q2 0.11 2.54 Distance from Class Size MaxPupil-Teacher Ratio-School 8.34 14.16 District Free and Reduced Lunch Percent 2.86 38.59 Black Percent 0.00 7.33 Hispanic Percent 0.00 6.15 Per-pupil Expenditure 3724.40 4840.90 Sources: FSIR and class-size average reports Q2-Q3 3.28 15.80 59.49 16.75 15.00 5331.40 Q3-Q4 6.87 17.58 76.06 36.10 31.78 6029.80 Max 12.21 43.85 97.11 98.45 98.50 11750.80 In addition to disaggregating by common school characteristics, we also look at two measures aimed at gauging the relative bite of the CSR policy for schools. The first is created by using the official district-level average class size reports for the year prior to the beginning of CSR. Schools in districts already below the new class-size maximum for kindergarten through third grade in the year prior are coded as ―0‖ for Distance from Class-size Maximum in the table. It is important to note that this is a very small subset of 15 schools out of the roughly 1,400 included in the sample. We separate all other schools by quartiles of how far their district was from the new maximum in the year prior to CSR. While this measure is very good for determining which districts faced relatively stronger CSR pressure, it does not distinguish among schools within the same district. Unfortunately, the official school-level class-size averages are not reported until the 2005-2006 school year, after schools had begun to respond to the policy. To allow for differing pressure for schools within districts, we also calculate the school’s pupilteacher ratio (PTR) in the year just prior to district enforced CSR. The school’s pupil-teacher ratio is a more direct measure for each school; however, it cannot be directly linked to the CSR policy like the district-level averages can. 62 6. Results Table 2.6 displays the trends in the percent of teachers in each school with an advanced degree over the implementation of CSR. Recall that California saw the percent with an advanced degree drop six percentage points in two years with CSR. The trend for all schools in the sample does not show evidence of a drop in teachers with advanced degrees that coincides with CSR implementation. In fact, we see modest increases in the average advanced degree percent in 2003-2004 and 2006-2007 at the introduction of district-level and school-level CSR, respectively. Moving on to the rest of the table, it is very difficult to find instances in which the year-toyear changes in mean advanced degree percent during CSR are sizeable and move toward having fewer advanced degree teachers in schools. One of the few exceptions to this is among those schools in districts already below the new kindergarten through third grade maximum in 20022003. The mean advanced degree percent in these schools dropped nearly four percentage points from the first year of CSR to the third. As these are presumably the schools with lowest pressure to hire new teachers because of CSR, it seems likely that this drop was driven by some other changes to this small one percent subsample of schools. Interestingly, the drop for the schools facing little pressure to reduce class sizes occurs opposite a large increase of thirteen percentage points for those schools in districts facing the highest CSR related pressure. However, this large increase seems to be a return to prior levels for these schools, rather than a direct response to CSR. A similar trend occurs for schools with the highest percentages of Hispanic students. 63 Table 2.6: Mean of School Level Advanced Degree Percent: Disaggregated by School Characteristics Year 1999-00 2000-01 2001-02 2002-03 2003-04 2004-05 2005-06 2006-07 2007-08 2008-09 All Schools 31.95 32.51 29.75 30.26 32.46 32.13 30.65 31.20 31.26 32.09 School Accountability Grade 34.70 35.37 32.83 33.19 35.03 34.58 32.82 33.84 33.91 34.65 A 29.98 30.50 27.66 28.40 30.89 30.57 29.25 29.47 29.45 30.36 B or C 29.30 29.37 26.97 25.87 28.31 29.40 28.02 28.74 28.09 28.84 D or F Distance from Class Size Max- District 26.45 27.33 21.29 26.67 26.24 25.38 22.85 24.12 23.97 24.66 0 29.71 29.89 29.11 29.88 29.66 29.97 29.36 29.28 29.23 29.97 1 30.65 32.02 31.76 31.81 31.76 31.31 31.10 31.43 31.52 31.82 2 30.73 30.77 29.91 30.68 30.39 30.06 26.72 29.52 29.58 30.32 3 39.27 40.11 28.10 28.20 41.19 40.22 38.92 36.82 37.00 38.82 4 Pupil-Teacher Ratio-School 31.05 31.49 29.24 29.41 31.52 31.70 29.79 31.14 31.12 31.82 1 31.68 32.21 28.73 29.50 31.43 31.10 29.50 29.90 30.27 30.77 2 32.17 33.08 30.52 30.93 32.83 32.44 30.51 31.15 31.11 32.36 3 32.91 33.27 30.51 31.20 34.08 33.28 32.77 32.59 32.55 33.37 4 Free and Reduced Lunch Percent 35.32 35.84 34.44 34.93 35.08 34.85 33.91 34.30 34.27 34.77 1 32.08 32.64 30.59 30.80 32.29 31.75 29.76 31.08 31.12 32.18 2 29.24 29.43 26.66 28.05 29.47 29.34 27.85 28.36 28.45 29.20 3 31.15 32.12 27.30 27.26 33.00 32.61 31.08 31.07 31.23 32.20 4 Black Percent 34.96 35.56 30.75 31.21 35.58 35.07 33.72 33.90 33.71 34.59 1 31.12 31.75 30.00 30.71 30.91 30.40 28.46 29.98 30.03 30.84 2 29.99 31.28 29.13 29.47 31.20 31.06 29.86 30.70 30.90 31.54 3 31.70 31.44 29.10 29.63 32.16 32.01 30.55 30.24 30.43 31.38 4 Hispanic Percent 32.20 32.38 31.47 32.43 32.64 32.13 29.98 31.74 31.06 31.60 1 31.88 32.52 30.98 31.32 31.99 31.69 30.78 30.40 31.08 31.65 2 30.23 30.70 29.11 29.48 30.37 30.48 29.19 29.56 29.77 30.49 3 33.46 34.42 27.42 27.78 34.83 34.24 32.63 33.08 33.15 34.60 4 Per-pupil Expenditure 30.72 31.16 30.58 31.21 30.70 29.89 27.23 29.62 29.63 30.41 1 30.06 30.66 29.17 29.98 30.31 30.53 29.34 29.53 29.86 30.38 2 31.90 32.50 30.27 30.68 32.97 32.36 31.45 31.85 31.35 32.44 3 35.11 35.70 28.98 29.16 35.86 35.76 34.56 33.81 34.22 35.12 4 Sources: FSIR and class-size average reports 64 Table 2.7 presents the same type of trends for the average experience in schools. Unlike for the advanced degree percent, we do see some evidence of a fall in average teacher experience with CSR. Over the first two years of school-level enforcement the mean of average experience across all schools falls by almost one year from 12.85 to 11.97 years. It is important to note that the potential negative effects of having less experienced teachers on average may be only temporary, with average experience recovering to 12.58 years in 2008-2009. The comparable statistic from California was an eleven percentage point increase in teachers with less than three years experience over two years. Unfortunately, the FSIR only contain the average experience measure, not allowing for a direct comparison to California. The distinction between the two measures, average experience and percent novice, may be important. Some of the evidence from the education production function literature suggests that the largest returns to experience occur in the first few years of teaching (Goldhaber 2008). Therefore, California’s increase in novice teachers may well have been associated with a drop in achievement. It is less clear if Florida’s decrease in average experience could help explain poor achievement performance of CSR. For instance, it could be the case that Florida schools hired more teachers with, say, six years experience. Average experience may go down in this case, but it may not have large consequences for student achievement. Looking closer at the two-year drop in average experience, we see that the drop was larger for schools in districts with higher average class sizes in 2002-2003. Schools in the lowest quartile of distance from the new maximum saw a decrease in mean average experience of 1.03 years while those farthest from the CSR maximum experienced a decrease of 1.72 years over the same time period. In contrast, disaggregating by school pupil-teacher ratio yields a different story, with a larger fall for the schools with the lowest pupil-teacher ratios of 1.29 compared to 65 0.64 for high pupil-teacher ratio schools. This seemingly contradictory result could reflect the fact that pupil-teacher ratios can be poor measures of actual class size. Across Table 7, we see patterns consistent with larger drops in average experience in schools serving more disadvantaged populations. For example, the schools with the lowest percent free or reduced-price lunch students experience a 0.60 year decrease compared to 1.21 years for the highest quartile. Similar patterns hold for the percent Black or Hispanic. Schools with the lowest levels of per-pupil expenditure see slightly larger decreases, by about one-fifth of a year, than higher spending schools. 66 Table 2.7: Mean of School Level Average Experience: Disaggregated by School Characteristics Year 1999-00 2000-01 2001-02 2002-03 2003-04 2004-05 2005-06 2006-07 2007-08 2008-09 All Schools 12.59 12.65 12.74 12.97 12.87 12.69 12.85 12.20 11.97 12.58 School Accountability Grade 13.13 13.33 13.49 13.88 13.74 13.43 13.66 13.14 13.09 13.65 A B or C 12.24 12.20 12.24 12.39 12.28 12.19 12.31 11.56 11.21 11.88 9.87 10.28 D or F 11.22 11.08 11.05 10.78 11.19 11.26 11.21 10.53 Distance from Class Size Max- District 14.55 14.64 14.40 14.49 13.31 13.55 13.17 12.62 12.14 12.15 0 13.06 13.27 13.49 13.88 13.66 13.52 13.43 13.28 12.96 13.18 1 12.39 12.49 12.92 12.84 12.47 11.97 12.87 11.60 11.46 12.82 2 12.77 12.92 12.88 13.08 12.98 12.54 12.54 12.42 12.26 12.86 3 11.83 11.51 11.17 11.70 12.15 12.78 12.56 11.18 10.84 11.00 4 Pupil-Teacher Ratio-School 12.10 12.13 12.12 12.31 12.17 12.24 12.47 11.59 11.18 11.83 1 12.51 12.63 12.80 12.99 12.92 12.70 12.86 12.24 12.02 12.72 2 13.02 13.05 13.17 13.49 13.30 13.04 13.12 12.57 12.37 12.98 3 12.73 12.79 12.87 13.13 13.10 12.79 12.96 12.42 12.32 12.81 4 Free and Reduced Lunch Percent 13.23 13.49 13.69 14.18 13.98 13.65 13.96 13.37 13.36 13.94 1 12.91 13.06 13.21 13.51 13.34 13.01 13.16 12.65 12.35 12.99 2 12.58 12.59 12.72 12.87 12.51 12.30 12.48 11.79 11.56 12.16 3 11.62 11.45 11.34 11.36 11.67 11.80 11.82 11.00 10.61 11.24 4 Black Percent 12.59 12.86 13.03 13.22 13.35 13.26 13.41 12.89 12.94 13.57 1 12.73 12.81 13.06 13.55 13.15 12.83 13.22 12.34 12.09 12.79 2 12.91 12.90 12.89 13.15 12.94 12.57 12.78 12.19 11.95 12.62 3 12.12 12.03 11.97 11.99 12.04 12.10 11.99 11.39 10.90 11.35 4 Hispanic Percent 13.09 13.21 13.38 13.74 13.57 13.32 13.47 13.12 12.68 12.97 1 13.06 13.30 13.43 13.74 13.53 13.33 13.33 12.68 12.51 12.95 2 12.54 12.58 12.70 13.04 12.66 12.34 12.64 11.78 11.50 12.58 3 11.65 11.51 11.44 11.40 11.74 11.77 11.96 11.23 11.19 11.84 4 Per-pupil Expenditure 12.50 12.57 12.75 13.19 12.88 12.30 12.82 12.02 11.82 12.56 1 12.96 13.12 13.38 13.66 13.36 13.05 13.22 12.51 12.23 13.00 2 12.51 12.58 12.61 12.84 12.72 12.59 12.81 12.28 12.09 12.64 3 12.37 12.32 12.20 12.22 12.51 12.83 12.55 12.01 11.75 12.13 4 Sources: FSIR and class-size average reports 67 Table 2.8 displays the results for the mean percentage of teachers teaching out of their certified field. As mentioned previously, certification information by school is only available from the 2002-2003 school year forward. This provides only one year of pre-CSR information, making it difficult to gauge what typical year-to-year fluctuations may have been in the absence of CSR. Further complicating a descriptive analysis of certification status in Florida schools over this time period is the fact that Florida expanded alternative certification programs throughout the same time period (Feistritzer 2007). In effect, obtaining certification early in the sample period may not be directly comparable to having certification later in the period. With these caveats in mind, less emphasis is placed on the observed trends for out-of-field percent. This is unfortunate, as certification was found to be the most responsive characteristic to CSR in California. For all schools, the mean out-of-field percent actually drops in the first year of CSR by 1.80 percentage points only to increase over the next few years before finally dropping again in the third year of school-level CSR enforcement. It is difficult to find trends consistent with what might be expected based on California’s experience. For instance, those schools in districts farthest from meeting the new class-size maximums see a drastic drop of over six percentage points at CSR introduction and consistently have lower percentage out-of-field than other schools in districts closer to the cutoff in 2002-2003. There is some evidence that schools with higher free and reduced-price lunch percentages experienced larger increases in percentage of teachers teaching out-of-field compared to low free and reduced-price lunch schools. 68 Table 2.8: Mean of School Level Out-of-Field Percentage: Disaggregated by School Characteristics 2002-03 2003-04 2004-05 2005-06 2006-07 2007-08 All Schools 7.28 5.48 6.45 6.73 6.67 6.58 School Accountability Grade 5.93 4.70 4.92 4.67 5.11 4.73 A 8.24 6.03 7.53 8.07 7.65 7.74 B or C 8.50 6.23 7.97 9.97 10.15 10.66 D or F Distance from Class Size Max- District 1.35 3.90 11.94 5.12 4.21 5.62 0 4.98 2.60 3.28 3.54 3.04 2.28 1 8.40 8.49 12.65 12.43 13.68 12.39 2 7.05 6.68 6.40 8.11 7.05 6.59 3 9.49 3.30 2.12 0.94 1.43 4.52 4 Pupil-Teacher Ratio-School 9.85 6.05 7.24 7.06 7.55 8.10 1 7.23 5.45 7.05 6.42 6.72 6.63 2 6.27 5.55 6.65 7.82 6.71 6.16 3 5.78 4.87 4.89 5.61 5.73 5.46 4 Free and Reduced Lunch Percent 5.51 4.35 4.95 4.77 5.54 4.81 1 6.69 5.18 5.08 5.86 4.95 4.84 2 8.14 6.08 7.87 8.20 7.76 7.33 3 8.78 6.29 7.92 8.07 8.45 9.35 4 Black Percent 5.40 4.19 4.83 4.83 4.70 4.24 1 8.32 6.79 7.07 7.05 7.57 7.43 2 8.00 6.16 7.84 7.90 7.41 7.05 3 7.39 4.76 6.08 7.13 7.01 7.60 4 Hispanic Percent 4.29 2.93 3.58 4.24 3.29 3.34 1 6.76 4.71 5.13 5.92 5.55 5.05 2 9.15 6.98 8.90 8.93 9.61 9.45 3 8.95 7.32 8.24 7.84 8.28 8.51 4 Per-pupil Expenditure 8.39 7.75 8.87 9.62 9.45 8.65 1 6.79 4.42 6.18 6.52 6.95 6.83 2 6.53 4.69 5.69 5.20 5.44 5.13 3 7.41 5.05 5.08 5.56 4.84 5.71 4 Sources: FSIR and class-size average reports 69 2008-09 4.74 3.41 5.60 7.29 0.67 1.79 7.12 6.00 3.58 5.52 4.34 4.47 4.66 3.84 3.24 5.74 6.15 2.66 4.89 4.77 6.65 3.59 5.00 5.84 4.56 5.46 4.88 4.25 4.38 7. Discussion Broadly, we do not see teacher characteristics changing with the same magnitude in Florida as in California. The preceding investigation found consistent evidence of reductions in average experience only, suggesting that Florida CSR may not have been associated with drastic changes in hiring selectivity. This is not to say, however, that there were not real changes in the teacher workforce occurring as a result of CSR that may have contributed to the lack of achievement gains. The results simply suggest that there were not changes in these observed characteristics. Other unobserved characteristics of teachers may have been changing as a result of CSR, however. This possibility is explored in detail in a related and complementary paper by Dieterle (2012). Importantly, the approach adopted here benefits from requiring no assumptions about the nature of the education process. For now, it is important to consider why we see such stark differences between what happened in California with what we see in Florida. Several factors may have played a role. Two potential explanations include differences in the implementation of CSR and teacher labor market conditions. As was mentioned previously, Florida allowed for a gradual phase-in of the policy that allowed for the same proportionate increase in the number of teachers in Florida over eight years that California experienced in two years. If short-run teacher labor supply is more responsive in terms of these teacher characteristics than in the long run, then we would expect to see bigger changes in California than in Florida. It is also possible that the conditions of the teacher labor market have changed in the decade between the two CSR programs. The biggest changes came about due to the introduction of the federal No Child Left Behind Act (NCLB) in 2001. NCLB includes a clause requiring ―highlyqualified‖ teachers in every classroom. ―Highly-qualified‖ has generally been interpreted to 70 mean teachers who are certified in the field they teach and who have advanced degrees. The ―highly-qualified‖ teacher clause may have contributed to the observed results in two ways. First on the teacher demand side, schools faced an additional cost of hiring out-of-field teachers with minimum education. On the supply side, teacher training programs are now designed with the new requirements in mind. In response to NCLB, the State of Florida has implemented many alternative certification programs within the last decade (Feistritzer 2007). Florida now has nine separate paths to certification. Further, CSR increased demand for early education teachers the most. Reports from Florida have suggested it is easier to find teachers certified in early elementary compared to other certification areas (FDOE 2008). In short, it may be easier in general to find prospective teachers with full certification and advanced degrees than it was even a decade ago. 8. Conclusion CSR has been a prominent policy tool over the last two decades aimed at improving educational outcomes. The popularity of reducing class size was driven by hopes of improving a range of student outcomes, partially drawing on experimental work showing the potential achievement gains from smaller classes. This potential has not been fully realized in large-scale CSR programs. One of the leading concerns has been that the composition of the teacher workforce changes with increased hiring in order to meet the larger workforce necessary for smaller classes. This story is largely based on evidence from California, where large changes in observable teacher characteristics occurred when CSR was implemented. These shifts suggest, perhaps drastic, changes in the selectivity of teacher hiring. However, this concern has been voiced in discussions of CSR in states other than California, such as Florida. This paper finds some evidence of a small overall decline in average experience that coincided with the adoption 71 of CSR in the State, particularly with the switch to school-level enforcement. Importantly, this decline seems to have been driven by schools in districts that faced the strongest CSR pressure and those serving more disadvantaged student populations. Finally, there was little conclusive evidence of CSR-induced changes in the percentage of teachers with advanced degrees or teaching out-of-field. Overall, these changes are small and do not point to the same potential changes in teacher hiring experienced in California. It is important to note that Florida’s program has not resulted in noticeable achievement gains for students (Chingos 2012). The findings of this paper cast some doubt on the ability of the teacher composition story to explain the observed lack of CSR related achievement gains. Obviously, this is based only on three commonly observed traits. It is possible that the teacher workforce is changing in other ways that offset potential gains from CSR (see Dieterle 2012). There could also be a completely different mechanism that prevents the realization of achievement gains from reducing class size. Further research must be conducted in order to determine precisely why such a popular and widespread policy has yet to produce positive effects on student achievement and whether it has had positive effects on other important student outcomes. 72 REFERENCES 73 REFERENCES Angrist, J. D. & Lavy, V. (1999). Using Maimonides’ rule to estimate the effect of class size on scholastic achievement. Quarterly Journal of Economics, 114(2), 533-575. Bohrnstedt, G. W. & Stecher, B. M. (1999). Class Size Reduction in California 1996-1998: Early Findings Signal Promise and Concerns. CSR Research Consortium, EdSource, Inc.: Palo Alto, Ca. Bohrnstedt, G.W. & Stecher, B.M. (2002). What We Have Learned about Class Size Reduction in California. Sacramento: California Department of Education. Buckingham, J. (2003). Class size and teacher quality. Educational Research for Policy and Practice, 2, 71-86. Chingos, M. M. (2012).The impact of a universal class-size reduction policy: Evidence from Florida’s statewide mandate. Economics of Education Review, 31(5), 543-562. Council for Education Policy, Research and Improvement (2005). Impact of the Class Size Amendment on the Quality of Education in Florida. Retrieved from http://www.cepri.state.fl.us/pdf/2005%20Class%20Size%20Impact%20Full%20Report.pdf Dieterle, S. (2012). Class-size reduction policies and the quality of entering teachers. Working paper. Feistritzer, C. E. (2007). Alternative Teacher Certification 2007. Washington D.C.: National Center for Education Information. Florida Department of Education. ―Class size reduction amendment.‖ Retrieved from http://www.fldoe.org/ClassSize/ Florida Department of Education. ―Class size averages.‖ Retrieved from http://www.fldoe.org/ClassSize/csavg.asp Florida Department of Education. ―Florida school indicator reports.‖ Retrieved from http://www.fldoe.org/eias/eiaspubs/0809fsir.asp Florida Department of Education (2008). ―New hires in Florida public schools: Fall 1998 through fall 2007.‖ Goldhaber, D. (2008). Teachers matter, but effective teacher quality policies are elusive. In Ladd, H. F. & Fiske, E. B. (Eds) Handbook of Research in Education Finance and Policy (pp. 146-165). New York, NY : Routledge. 74 Hanushek, E. A. (2003). The failure of input-based schooling policies. Economic Journal, 113(485), F64-F98. Imazeki, J. (n. d.). Class-size reduction and teacher quality: Evidence from California. Working Paper. Krueger, A. B. (1999). Experimental estimates of education production functions. Quarterly Journal of Economics, 114(2), 497-532. Krueger, A. B. (2003). Economic considerations and class size. Economic Journal, 113(485), F34-F63. Krueger, A. B. & Whitmore, D. M. (2001). The effect of attending a small class in the early grades on college-test taking and middle school test results: Evidence from Project STAR. Economic Journal, 111(468), 1-28. No Child Left Behind Act of 2001, Pub. L. no. 107-110, 115 Stat 1959-1960 (2002). Print. State Board of Education (2010). ―State Board of education meeting transcript: June 15, 2010 Orlando, Florida.‖ Retrieved from http://www.fldoe.org/board/meetings/2010_06_15/junetranscript.pdf. Walberg, Herbert J. (2006). Class size. Reforming Education in Florida: A Study Prepared by the Koret Task Force on K-12 Education. Hoover Press. Retrieved from http://www.hoover.org/publications/books/8354. pg. 245-254. 75 CHAPTER 3 WHAT CAN WE LEARN ABOUT EFFECTIVE EARLY MATHEMATICS TEACHING? A FRAMEWORK FOR ESTIMATING CAUSAL EFFECTS USING LONGITUDINAL SURVEY DATA 1. Introduction This study investigates the impact of teacher characteristics and instructional strategies on the mathematics achievement of students in kindergarten and first grade. Understanding the factors that make some teachers more effective than others is vital to achieving and supporting high-quality instruction. Early teaching, in particular, can be crucial to the future academic progress of children, as well as in determining later economic well-being and other nonacademic outcomes (Barnett, 1995; Currie & Thomas, 2000; Kilpatrick, Swafford & Findell, 2001; Chetty et al., 2011). Evidence of ―what works‖ in elementary mathematics instruction can be obtained from multiple sources: experiments, observations, administrative data, and surveys. This study utilizes survey data and provides a framework to guide the estimation of causal effects in nonexperimental settings. Although experimental evidence of effects is generally considered the gold standard, true educational experiments are rare, centered primarily around interventions, and difficult to impose or implement. In addition, they are generally conducted on a small scale, yielding potentially ambiguous conclusions regarding the effects of scaled-up interventions. Classroom observations provide an in-depth picture of teaching but are costly to conduct on a large scale and can be difficult to parse into identifiable, quantifiable elements of instruction that overcome inter-rater reliability issues. Administrative data are often used to uncover associations between factors like teacher training and student achievement. However, these data contain only the most basic teacher and student characteristics required for reporting and compliance purposes and often suffer from missing, and sometimes inaccurate, data records. Large-sample survey 76 data, if representative of the population of interest and sufficiently detailed, can represent an improvement over administrative data. They are less costly to collect than classroom observations on a large scale, and they can provide information on treatments that cannot be investigated through experiments. Much of the body of knowledge on teacher effectiveness consists of estimates derived from survey data, such as the National Educational Longitudinal Study (NELS) and the Early Childhood Longitudinal Study, Kindergarten Class of 1998–99 (ECLS-K). However, these data have significant limitations, as well. They generally rely on selfreported activities and characteristics and often suffer from item non-response, even when overall survey response rates are high. In addition, sample attrition in longitudinal surveys can pose a threat to representativeness if it is nonrandom. The fundamental disadvantage of survey data with respect to experimental data, however, is that students are not randomly assigned to treatments, rendering causal inference inherently difficult—a disadvantage shared with observational and administrative data. Survey data can compensate for this handicap to varying degrees, however, by providing a rich set of control variables that might reduce omitted variable bias in estimates. This paper tackles the question of how best to use longitudinal survey data to elicit causal inference with respect to teacher-related factors impacting early mathematics achievement in the face of potential threats to validity due to nonrandom assignment to treatment. Researchers have several tools at their disposition to deal with these threats. Methodological decisions can be complex, however, and, as we show, results can be sensitive to the approach selected. As such, it is imperative to understand why inconsistencies occur and which methods are best in a given situation. The goals of this paper are thus twofold: (1) to lay out a careful approach for selecting an appropriate model and estimation method to investigate teacher effects using longitudinal survey 77 data and (2) to apply this approach in answering our specific research question—i.e., the extent to which the observable background characteristics and instructional practices of kindergarten and first grade teachers produce gains in the mathematics achievement of their students. We use data from ECLS-K, a nationally representative sample of kindergarteners followed over time. The data include student assessments in mathematics and reading at each wave as well as detailed information from parents, teachers, and school administrators and are therefore well suited to an investigation of our research question. Through a step-by-step analysis of the data, we select a modeling and estimation strategy. Our findings indicate that teacher certification and courses in methods of teaching mathematics have a slightly negative effect on student achievement in kindergarten, whereas postgraduate education has a positive effect in first grade. Various teaching modalities, such as working with counting manipulatives, using math worksheets, and completing problems on the chalkboard, have positive effects on achievement in kindergarten, and pedagogical practices relating to explaining problem-solving and working on problems from textbooks have positive effects on achievement in first grade. We find that the models and estimators previously employed to estimate teacher characteristic and practice effects using longitudinal survey data likely neglected important features needed to establish causal inference. Importantly, we show that the conclusions drawn depend on the estimation and modeling choices made, underscoring the importance of setting out a clear strategy for choosing among the many possibilities available. This paper is organized as follows. In section 2, we outline a framework for selecting a model and method for estimating teacher effects using longitudinal survey data. Then, in section 3, we review the relevant literature pertaining to our specific research question—i.e., what 78 teacher characteristics and practices affect student achievement in the early grades?—with particular attention to the estimation methods used. Section 4 describes our data, section 5 outlines our methods, section 6 presents and discusses results, and section 7 concludes. 2. Modeling and Estimation Framework Models A very general cumulative effects model views current achievement as a function of all relevant current and past inputs, a student-specific effect, and a random error term (Hanushek, 1979; Todd & Wolpin, 2003; Harris, Sass, & Semykina, 2010; Guarino, Reckase, & Wooldridge, 2011): Yijst  ft ( X it ,..., X i 0 , Zit ,..., Zi 0 , ci  uijst ) (1) where Yijst = achievement of child i with teacher j in school s at time t ft = time-varying education production function relating inputs to achievement X it = time-varying child, family, and neighborhood inputs for child i in period t Zit = time-varying schooling inputs (such as teaching practices, peer effects, etc.) ci = time-invariant child effect uijst = unobserved error term Researchers make several assumptions to render (1) tractable for analysis. As a first step, the function ft is generally assumed to be linear in the parameters. These assumptions yield the linear cumulative effects (LCE) model: (2) Yit  t  X it0  X it 11  ...  X i 0t  Zit 0  Zit 11  ...Zi0t  t ci  uit Our interest lies primarily in estimating the  parameters, which in this case convey the partial effects of teacher characteristics and practices on math achievement. Typically, data on all inputs prior to kindergarten—e.g., preschool or daycare characteristics—are unavailable, and the 79 student effect ci, sometimes thought of as unmeasured innate student ability or motivation, is generally unobservable. To address the lack of prior inputs, it is customary to impose an additional assumption: namely, that the effect of the inputs decays at a geometric rate equal to λ. In terms of the parameters, this assumption requires  s   s 0 and  s   s0 for s  1,, T . The geometric decay assumption on the input effects allows one to eliminate the lagged inputs and rewrite equation (2) as a geometric distributed lag (GDL) model: (3) 1 Yit  t  Yit 1  X it0  Zit 0   t ci  eit eit  uit  uit 1 A second common restriction is that  t   , indicating that the child-specific effect has the same effect on achievement in every period. Note that under this assumption there is no loss of generality by denoting  as 1 in these models if a constant is present in the model. (4) Yit  t  Yit 1  X it0  Zit 0  ci  eit Equation (4) is generally referred to as a dynamic linear model due to the presence of lagged achievement on the right-hand side. Estimation of (4) is generally feasible, as it requires only contemporaneous inputs and a lag of achievement. However, many researchers proceed to subtract prior achievement from both sides of the equation and estimate what is often referred to as a gain score model. This results in the following: (5) Yit  Yit 1  t  X it0  Zit 0  ci  (  1)Yit 1  eit Of course, if   1 , then the piece of prior achievement left in the error term disappears. If not, it causes an omitted variables problem as well as negative serial correlation. Thus, in 1 Algebraically, (3) is derived from (2) by adding and subtracting λY it-1 from both sides of (2), imposing the GDL assumption, substituting and simplifying terms. See Guarino, Reckase, & Wooldridge (2011) and Harris, Sass, & Semykina (2010) for a detailed explanation of the geometric distributed lag assumption and its application to value-added modeling. 80 choosing this model specification, researchers are essentially assuming that   1 (i.e., that there is no decay in the impact of prior inputs on current achievement) or that the consequences of violating this assumption are unimportant in estimating the parameters of interest. To relate equations (4) and the constrained version (5) directly to our research questions, we now rewrite them under all the assumptions hitherto mentioned using terms specific to our study. Yijst  Yijst 1  1PDG jst  2TC jst  3Classijst  4 X ijst  t   s   j  ci   ijst (6) (7) Yijst  Yijst 1  1PDG jst  2TC jst  3Classijst  4 X ijst  t   s   j  ci  (  1)Yijst 1   ijst where: PDG= the set of pedagogical practices used by teacher j TC= the set of background characteristics of teacher j CLASS=a set of classroom characteristics X= a set of child- and family-related control variables τj=an unobserved teacher-specific effect δs=an unobserved school-specific effect εijst=a random error term In our study, we use data from three time periods. ECLS-K assessed kindergarten and first grade children’s achievement in the full sample in the fall of the kindergarten year (t = 0), in the 2 spring of the kindergarten year (t = 1), and in the spring of the first grade year (t = 2). In our modeling, the fall test score in kindergarten is lagged relative to the spring test score. For first grade, the spring kindergarten test score is lagged relative to the spring first grade test score. 3 The characteristics of kindergarten teachers were recorded in the fall, and corrected in the spring 2 A subsample of first grade children was assessed in the fall; we do not use that partial wave. 3 Thus the intervals between current and lagged tests differ across grades. We take this into account by controlling for time elapsed between tests and, in some cases, by interacting lagged achievement with a grade indicator. 81 if children’s teachers changed. Information on teaching practices was recorded in the spring. For first grade teachers, all information was recorded in the spring. In these new equations (6) and (7), the composite error term (  s   j  ci   ijst ) contains unobserved school and teacher effects as well as the child effect ci and the idiosyncratic term εijst. By including the terms δs and τj, we allow for the possibility that in our focus on the specific teacher characteristics and pedagogical behaviors contained in PDG and TC, certain 4 teacher and school-level factors that are relevant to predicting achievement may be omitted. It is important to note, however, that δs can be explicitly estimated through the inclusion of school dummy variables rather than left as a component of the unobserved error term, whereas fixed teacher effects τj—e.g., an unobserved teacher quality component—cannot be estimated separately from the observed teacher variables of interest unless we observe more than one class of students for each teacher. 5 For notational simplicity, we do not include interactions in this model, although in our investigations, we explore the possibility that the effects of teacher characteristics and instructional practices on achievement differ across grades. Estimators Several estimation strategies can be applied to models (6) and (7). The estimators associated with these two models—the lag score and the gain score—are outlined below and summarized in 4 In our notation, we impose the simplifying assumptions that the unobserved school and teacher effects are constant across students and over time. 5 In ECLS-K, we have only one classroom per teacher. The possible effects of this limitation are later discussed. 82 Figure 3.1. Both models can be estimated using either cross-sections of data (i.e., data for each grade separately) or data that are pooled across kindergarten and first grade. If the equations are estimated using cross-sections of data, two primary approaches are possible: ordinary least squares (OLS) estimation and maximum likelihood estimation with random effects assumptions. The OLS estimator will be consistent if student heterogeneity and the other error components are uncorrelated with either the input or output variables. We can relax that assumption with respect to  s , if need be, by putting in school dummy variables. In Figure 3.1, this set of choices is represented by the OLS portion of the tree on the left side of the diagram for cross-sectional data. Efficiency gains may be possible using maximum likelihood if we exploit the nested structure of the data (in our case, children within classrooms within schools) and assume that the teacher and school terms (  s and  j ) are normally distributed random effects. Such estimation strategies are often referred to as mixed or hierarchical linear models (HLM). This strategy effectively accounts for correlation among test scores for students with the same teacher and within the same school in the estimation of the pedagogy and teacher effects. Here again, we can treat the school effects as fixed rather than random by including school dummy variables, and use an HLM estimator with just teacher random effects (see the HLM portion of the tree diagram in Figure 3.1). If the data are pooled across the grades, OLS can still be used on either model (6) or (7) under the same independence assumptions as in the cross-section data case (see the right portion of the figure under ―panel‖). However, panel data allow us to make use of approaches that deal with the presence of unobserved student heterogeneity. 83 In the lag score model (6), pooling across grades allows for the elimination of heterogeneity by first-differencing and then instrumenting for the endogenous lagged test score gain with the twice-lagged test score. A common estimator that accomplishes this is the generalized method of moments (GMM) approach described in Arellano & Bond (1991), which we will henceforth refer to as AB. This approach not only accounts for unobserved student heterogeneity but also allows us the flexibility of leaving λ unconstrained. However, it relies on the assumptions that the error term in equation (6) is serially uncorrelated and that all other inputs satisfy a ―strict exogneity‖ assumption—namely, that errors εijst in one time period are uncorrelated with inputs in all other time periods (Wooldridge, 2002, ch. 10, pp. 252-254). In the gain score model (7), in which λ is constrained to equal 1, random effects (RE) or fixed effects (FE) estimators can be used to mitigate problems associated with student heterogeneity. RE assumes the child-specific heterogeneity and the inputs are uncorrelated, strict exogeneity, as well as a particular structure to the error covariance matrix, and may result in efficiency gains over OLS if these assumptions are met (Wooldridge 2002, ch. 10, pp. 252-254). FE estimation relaxes the assumption of zero correlation between heterogeneity and observed right-hand-side variables but maintains strict exogeneity. Note that the strict exogeneity assumption required by RE and FE is violated in model (6) because the lagged test score variable on the right-hand side is a function of the error term in the previous period. Thus, RE and FE estimators will be inconsistent for (6) and will only apply to the gain score model. The modeling and estimation choices that we have described—18 in total and summarized in Figure 3.1—will be those considered in our study. In section 4, we outline a strategy for selecting the appropriate method. But first, we review prior survey-based research pertaining to our research question with careful attention to the methods chosen in these studies. Divergence 84 of our findings from earlier findings may be traceable, at least in part, to the different choices of methods. 3. Prior survey-based research on the impact of teacher characteristics and teaching practices on student achievement in the early grades Although a number of prior studies have used survey data to examine the relationship between student achievement and observable teacher characteristics and practices, using a variety of model specifications and estimation methods (see Wayne & Youngs 2003 for a review), relatively few have examined these relationships in the context of early elementary mathematics teaching. Rowan, Correnti, & Miller (2002) estimated a gain score model using HLM with both teacher and school random effects to analyze a longitudinal data set on elementary school children across the US in the early 1990s. They found no effect of teacher certification status or subject-matter preparation on achievement growth in mathematics but found a positive relationship between teaching experience and growth for students going from third to sixth grade and a negative relationship between advanced degrees and growth for students in both the early and upper elementary grades. Among a small set of pedagogical practices examined, only time spent on whole-class instruction was positively related to mathematics achievement. Other studies focused on links between teacher-reported instructional practices and early mathematics achievement have tended to explore the use of ―reform-based‖ or ―standards-based‖ practices that emphasize problem solving and inquiry. Le et al. (2006), in a study of third grade students in five districts across the U.S. followed longitudinally for three years estimated a lag score model for achievement using OLS. They found weak and inconsistent relationships between student-centered practices and mathematics achievement. Reform-based practices showed positive effects on measures related to problem solving but negative effects on 85 measures designed to capture a student’s grasp on mathematical procedures. Cohen & Hill (2000) also estimated a lag score model using OLS to study links between teacher responses on a survey administered to elementary teachers in California in 1994 and student achievement on the California Learning Assessment System during the same year. They found that teacher-reported frequency of use of reform-based practices was positively related to mathematics test scores among fourth-graders. Hamilton et al. (2003) conducted a meta-analytic synthesis of teacher survey data from grades three through seven at several National Science Foundation-funded Systemic Reform Initiative sites to investigate the effect of reform-based teaching practices on mathematics and science learning. Their results were based on OLS regressions, although it is not clear whether gain or lag scores were used and specifications differed across sites depending on the availability of data. They combined several specific practice items into a ―reform-based‖ scale and found small and weak but fairly consistent positive relationships between teachers’ use of reform-based practices and student achievement. In addition to the studies cited above, three prior studies estimated the effects of teacher characteristics and instructional practices on the mathematics achievement of early elementary students using ECLS-K. All three relied on a single cross-section of the ECLS-K for the main analysis, with Guarino, Hamilton, Lockwood, & Rathbun (2006) and Bodovski & Farkas (2007) focusing on kindergarten and Parlady & Rumberger (2008) focusing on first grade. All employ HLM estimation. Guarino et al. (2006), using a gain score model, found no evidence of a direct relationship between the background characteristics of teachers and student achievement but found that spending more time on subject was associated with relatively large gains in achievement. They constructed instructional practice scales, combining several practice measures into aggregate 86 indexes using factor analysis. Among the scales designed to capture pedagogical approaches, those describing an emphasis on traditional practices and computation, measurement and advanced topics, advanced numbers and operations, and student-centered instruction (e.g., having students explain how problems were solved) were positively associated with mathematics achievement gains. Bodovski & Farkas (2007) also focused on kindergarten and created instructional practice scales but utilized a lag score model in their analysis. They found that both traditional and interactive approaches, along with practices that focused on advanced counting, practical math and single-digit operations were related to larger gains in achievement. On the other hand, spending additional time on basic numbers and shapes was found to be associated with lower achievement gains. They also mention obtaining similar results from an unreported fixed effects estimation which relies on the pooled kindergarten and first grade data available in the ECLS-K as evidence that their results approximate causal effects. Parlady & Rumberger (2008) used a lag score model and focused on first grade rather than kindergarten. They found that the use of math worksheets and calendars raised student mathematics achievement, whereas the use of geometric manipulatives lowered it. They restricted their sample to a 30 percent subsample of students who were tested in the fall of first grade and used the fall exam score as an explanatory variable in the analysis. The relationships found in these survey-based studies cannot be interpreted as causal unless the assumptions underlying the models and estimators used are met. In this paper, we analyze the ECLS-K data using a step-by-step approach to justify our modeling and estimation choices, the goal being to provide estimates with the best claim to causal inference. We then compare our results with those obtained in earlier studies using other methods. 87 4. Data The ECLS-K selected a nationally representative sample of approximately 22,000 children who were enrolled in approximately 1,000 kindergarten programs in the United States, during the 1998-99 school year. The children were selected from both public and private kindergartens offering full-day and part-day programs. The sample consisted of children from different racialethnic and socioeconomic backgrounds and included an oversample of Asian children and private school kindergartners. The sample design for the ECLS-K was a dual-frame, multistage sample. First, 100 Primary Sampling Units were selected (PSUs were counties or groups of counties). Schools within the PSUs were then selected; public schools from a public school frame and private schools from a private school frame. In the fall of 1998, approximately 23 kindergartners were selected within each of the sampled schools (Tourangeau et al., 2001). The ECLS-K followed these children at various intervals through eighth grade. Three of the seven available waves of the data are utilized in this study. The first two waves of data were collected in the fall and spring of the 1998-1999 kindergarten year, respectively. The third and fourth waves were collected in the fall and spring of the first grade year, but the third wave (fall of the first grade year) was collected only on a relatively small (30 percent) sub-sample of the children and is therefore not used in this study. For this study, we restrict the data to the fall and spring kindergarten and spring first grade waves. The fifth, sixth, and seventh waves are excluded because they occur after intervals of two or three years—thus the variation in children’s learning gains is likely to be only loosely connected to the practices and abilities of contemporaneous teachers. In the three waves selected for the study, we make use of several categories of data—achievement assessments, teacher interviews, and student and family characteristics. 88 Achievement Assessments Assessments that included cognitive components were conducted with the sampled children through one-on-one tests administered by trained individuals at each wave. The full achievement assessment used a computer-assisted personal interview and took approximately 50-70 minutes to complete. It included tests of reading and mathematics as well as other components that differed by wave (e.g., general knowledge in the kindergarten wave and science in the third grade). The test was untimed, and the kindergarten test required children to respond verbally or through pointing; no writing was required. Each test was conducted using a two-stage design. The first stage consisted of a routing section that was administered to all students, and the second stage consisted of one of several alternative forms, the choice of which depended on the child’s performance on the first stage. Only the assessments in mathematics are utilized in this study. The mathematics assessments had low, middle, and high difficulty second-stage options. The purpose of the adaptive design was to maximize accuracy of measurement and minimize administration time. 6 The content of the mathematics assessments was selected to represent cognitive skills that are typically taught at each stage of development and that are important for the development of later proficiency (Rock & Pollack, 2002). Efforts were made to accommodate children who spoke a language other than English in the kindergarten and first grade assessments. Prior to administering these assessments, a language-screening test—the Oral Language Development Scale (OLDS)—was administered to those children identified from their school records (or by their teacher, if no school records were available) as coming from a home in which the primary 6 See the User’s Manual for the ECLS-K Base Year Public-Use Data Files and Electronic Codebook, NCES 2001-029 (U.S. Department of Education, National Center for Education Statistics, 2001) or the ECLS-K Psychometric Report for Kindergarten through First Grade (Rock & Pollack, 2002) for a more complete description of the assessment procedures. 89 language spoken was not English. Children whose performance exceeded an established cut score on the OLDS received the full English direct assessment in mathematics. Students who did not pass the OLDS but who spoke Spanish were given a translated form of the mathematics assessment. Various methods were used to confirm that the psychometric properties of the Spanish mathematics assessment were comparable to those for the English version (Rock & Pollack, 2002). Three types of scores were reported for each test: (1) the number of questions answered correctly on the first-stage routing test, (2) item response theory (IRT) scale scores, and (3) standardized (t-scale) scores. The most appropriate of these for the purpose of this study are the IRT scores, because IRT scores are designed to make it possible to calculate scores that can be compared regardless of which second-stage form a child took in the adaptive test. They compensate for the possibility of a low-ability student guessing several items correctly. In addition, they make possible longitudinal measure of gains in achievement over time, even though the tests administered are not identical at each point (Tourangeau et al., 2001). Teacher-level variables Information on the teachers in both kindergarten and first grade was gathered in a set of selfadministered paper-and-pencil questionnaires that included questions about their backgrounds and instructional practices. Background characteristics used in this study consisted of indicators for race/ethnicity, teaching experience, certification, educational attainment, and completion of 7 courses in methods of teaching mathematics. Other relevant variables consisted of time spent on preparation, and, most importantly, a set of instructional practices described in the next section. 7 Based on exploratory analysis, (i) we trichotomized years of teaching experience as 0-4, 4-10, 10+; (ii) we dichotomized coursework as 0-2 versus 3+ courses. 90 Instructional Practices The spring teacher questionnaires include sets of items that address instructional practices in mathematics. The items address a wide range of practices that may occur in classrooms in the early grades and were selected to align with the skills tapped by the ECLS-K achievement assessments. Both the kindergarten and first grade teachers were asked very similar questions regarding their instructional practices; thus we were able to construct nearly identical sets of practices that apply to the two time periods. Specific pedagogical practices are listed as items in the ECLS-K teacher questionnaire under the question: ―How often do children in the class do each of the following math activities?‖ The kindergarten teacher questionnaire includes 17 activities representing different pedagogical modalities. The first grade teacher questionnaire included the 8 same items, with very few differences. We code teacher responses on all of these items to reflect days per month. 9 In addition to these items, we include a measure of time spent on mathematics in our analyses. Teachers were asked how often they teach mathematics and how much time they spend on the subject on the days they teach it. We combined the responses to both questions to estimate the total hours per week a teacher reported spending on mathematics. In addition, teachers were 8 The first grade questionnaire adds ―Work on problems for which there are several appropriate methods or solutions‖ and ―Do worksheet or workbook page emphasizing routine practice or drill‖ to the 17 kindergarten pedagogy questions. 9 We code the response categories for mathematics activities using what is essentially interval midpoint scaling: ―never‖ → 0 days per month; ―once a month or less‖ → 1 day per month; ―two or three times a month‖ → 2.5 days per month; ―once or twice a week‖ → 6 days per month; ―three or four times a week‖ → 14 days a month; ―daily‖ → 20 days per month. The metric assumes a standard of four weeks in a month and five working days per week. 91 asked the extent to which they utilized divided achievement grouping, without special reference to mathematics. This was coded as hours per week the student’s spent in such groups. Content Coverage In addition to the pedagogical variables, the surveys contain several items relating to content coverage. The stem question is: ―How often is each of the following math skills taught in the class?‖ and 29 skill or content areas are then listed. We use these items as control variables to enable us to isolate the effect of pedagogical techniques holding constant content emphases that might align to a greater or lesser degree with the tests. They are recoded in a manner similar to that of the pedagogy variables. 10 Appendix Table 3.5 displays descriptive statistics for the teacher variables included in the model. Classroom Characteristics Teachers were also asked to describe demographic characteristics of their classes. Reported are class size and the percentages of children of different racial-ethnic groups and with disabilities. We include these in our analyses to control for differential teacher responses to variation in classroom composition. Child and Family Variables Student-specific variables used in our analyses include the number of days elapsed between tests, indicators for disability status, attending a full day kindergarten, whether the child is repeating kindergarten, and whether the child takes the test in Spanish. In addition, we include several controls that capture socioeconomic status (e.g., parent education and income), family 10 The response categories for the skills (content) items are the same as those for the activities (pedagogical modalities) items, except that the ―never‖ category was named ―not taught‖ and expanded into ―taught at a higher grade level‖ and ―children should already know.‖ We code the ―not taught‖ categories as 0 times per month. 92 behaviors (e.g., parental involvement, the number of extracurricular activities in which the child engages, how often the child reads, and the number of educational activities in which the child participates in the home), and family structure (e.g., single parent, number of siblings). ECLS-K provides a rich set of such variables, allowing us to capture child effects that are often omitted in administrative data. The full set of these controls and their descriptive statistics are also included in Appendix Table 3.5. Sample Adjustments Although item non-response for the variables in our study is typically low (e.g., around one to two percent for the pedagogy practices, as shown in Appendix Table 3.6), the combined effect of missing item responses across all variables leads to a sample decrease of more than 60 percent in kindergarten and more than 65 percent in first grade. This drop in the number of observations hinders our ability to estimate the pedagogy and teacher characteristic effects with precision. To counter the loss of information stemming from item non-response, we used Royston's (2004) Stata implementation of chained multiple imputation (Van Buuren, Boshuizen & Knook, 1999) to impute missing values for all variables except student test scores. Forty imputed data sets were produced for each of three types of data—pooled data from both kindergarten and first grade, data from kindergarten only, and data from first grade only. The 40 pooled data sets were each composed of 21,232 student-year observations representing students with test score at all three waves. The separate kindergarten and first grade cross-sectional imputed data sets were composed of 16,356 and 11,780 student observations, respectively, including students with non-missing current and lagged test scores only in the particular grade imputed. Post-imputation estimation was carried out using Stata routines influenced by Carlin, 93 Galati & Royston (2008). Following ECLS-K guidelines, we used sampling weights supplied with the data in the imputation to better approximate the initial population. 5. Methods Here we outline a decision-making process to choose among the 18 model/estimation choices described in Section 2 and illustrated in Figure 3.1. To decide among these alternatives, we undertake a multi-step investigation. Each step pertains to one of the decisions needed: lag score versus gain score, cross-section versus panel, and ultimately choosing an estimator. Gain Score vs. Lag Score Deciding between a gain score or lag score model amounts to testing whether the assumption λ=1 in equation (6) is justifiable. In addition to merely observing the coefficient of the lagged test score when estimating equation (6), we make use of a test proposed by Harris, Sass, & Semykina (2011). Applied to our problem, this test amounts to testing the joint significance of including a set of variables representing the first lags of all the inputs in the gain score equation (7) for first grade. Formally, this is a test of the null hypothesis that  . The test is motivated by the idea that the lagged score, with the coefficient constrained to equal one, effectively serves as a sufficient statistic for past inputs. Therefore, if the gain score approach properly controls for past inputs, the included lagged inputs should not be statistically significant. It should be noted, however, that since we observe only two grades, the test of  will pertain only to first grade because there are no lags of inputs available for kindergarten. Cross Section vs. Panel The decision to use panel data versus separate cross-sections in our analyses is based on three considerations: whether the impact of teacher characteristics and practices changes across 94 grades, whether there are precision gains due to increased sample sizes, and whether there is a need for panel data methods to eliminate time-invariant unobserved child heterogeneity, which may bias estimates of teacher effects related to PDG and TC. To investigate whether the impact of teaching practices varies across grades, we estimate equation (6) or (7) (depending on whether we choose a gain score or a lag score model) using the pooled data and including interactions between a grade dummy and all teacher characteristics and practices. If several characteristics and practices interact significantly with grade, it would indicate that either the interactions should be included in any panel analyses or that crosssectional regressions should be run separately by grade. Possible precision gains from using the panel, due to the increased number of observations, might be a reason to prefer the panel regressions, but such gains are likely to be quite small. As mentioned in the previous section, the grade-specific sample sizes in ECLS-K (N=16,356 for kindergarten and N=11,780 for grade 1) are likely large enough to mitigate concerns over lack of precision. The most important driver of the decision regarding the panel versus the cross-sections is the question of whether unobserved child-specific heterogeneity, ci, is likely to bias the teacher effect estimates. Only panel data methods (i.e., FE in the gain score model and AB in the lag score model) address this problem by eliminating the effects of heterogeneity. Assessing the influence of heterogeneity is not straightforward and the tools and tests at our disposal provide only suggestive evidence of the extent of the problem. Nevertheless, we can make use of the following set of procedures. A common approach to assessing the influence of unobserved heterogeneity on the effect estimates is to use a Hausman (1978) test in which the coefficients from random and fixed 95 effects estimators are compared, or a related variable addition test suggested by Wooldridge 11 (Wooldridge, 2002, ch.10, pp. 288-291). It should be noted, however, that in our achievement regression, the RE and FE estimators are consistent only for the gain score model; thus we can seek evidence from these tests only under the assumption that λ=1. Therefore, if our results from step one indicate that the gain score model is not a viable option, these tests cannot be viewed as reliable. Similarly, both estimators require strict exogeneity of the included regressors, and a violation of this assumption limits the reliability of this test. Another approach to assessing the influence of unobserved heterogeneity that is applicable in both the gain score model and the lag score model is to investigate how observable child and family characteristics vary across different levels of exposure to particular teacher characteristic (TC) or teaching practices (PDG). To do so, we adapt the common balance of covariates approach, in which the mean observable characteristics across levels in TC or PDG are compared. First we explore whether the variables in TC or PDG (which we can refer to as ―treatments‖) vary with the child-specific observables—prior achievement and the variables contained in Xijst—in the following regression: (8) 12 Teacher variablei  t   Yit 1   X it  it If a test of joint significance of all the right-hand-side variables in (8)—i.e., a test of the null that all β and ξ are zero—does not reject, then there is strong evidence that treatment is unrelated to these observed child characteristics. In this case, we have more confidence that selection on 11 This is accomplished by estimating the gain score equation (7) with the addition of variables representing the within-child average values across time of all the time-varying variables. If a joint significance test of all the time averages rejects, it provides evidence that the random effects assumption is not justified. 12 For computational feasibility, when the dependent variable is dichotomous, we use a linear probability model with discrete regressors. 96 unobservables is also negligible. If instead the test rejects, then we conclude that exposure to treatment differs systematically across child characteristics and, unless we believe that our set of observables is so complete as to eliminate anything that might be left in ci (e.g., individual intelligence or motivation) which is correlated with treatment, we might continue to worry about unobserved heterogeneity. If the test rejects, a possible reason might be that systematic variation of treatment across child and family characteristics is due to the sorting of children with particular characteristics into particular schools. To verify this, we can run the following regression, which includes both the child variables and school dummies, and again test the joint significance of the child variables: (9) Teacher variablei  t   Yi,t 1   X it   s  vit If the inclusion of school dummies removes the significance of the child variables, then we can assume that treatments are distributed randomly across child characteristics within schools. This would offer evidence that, at least within schools, selection on observables is negligible. If we again rely on the assumption that selection on the unobservables, ci, should be no more threatening than that related to our rich set of observables, then we can argue that panel data methods that remove ci are unnecessary, and that the inclusion of school dummies in either the cross-sectional or panel regressions will suffice to remove this threat. If instead, evidence of unobserved heterogeneity is non-negligible, then the panel will be needed at the cost of requiring the additional assumptions discussed in Section 2 for consistent estimation of the treatment effects. 97 Choosing an Estimator The above analyses provide a strategy for choosing between the lag and the gain score model and deciding whether to pool or separate the data across grades. After gathering the information from these analyses, we can substantially narrow down the set of 18 choices illustrated in Figure 3.1. The gain score/lag score decision, coupled with the cross-section/panel decision will lead to a small number of estimators for further consideration. In addition, the assignment to treatment analysis helps determine whether or not to include school dummies. At this point, if more than one model/estimation method remains viable, we will compare their findings and discuss the similarities and differences found. 98 Gain Score (λ=1) Cross Section OLS w/o Sch FE Panel OLS HLM w/ Sch FE Tch RE & SCH RE Tch RE & Sch FE w/o Sch FE FE RE w/ Sch FE w/o Sch FE w/o Sch FE w/ Sch FE w/ Sch FE Lag Score Cross Section OLS w/o Sch FE Panel HLM w/ Sch FE Tch RE & Sch RE OLS w/o Sch FE Tch RE & Sch FE AB w/ Sch FE w/o Sch FE w/ Sch FE Figure 3.1: Tree Diagram of Possible Model/Estimation Strategies 6. Results Gain Score vs. Lag Score: Results To choose between using the lag score equation (6) or the gain score equation (7) we first visually inspect the coefficient on prior test scores in (6) and then apply the test proposed by Harris, Sass, & Semykina (2011). When we use OLS to estimate (6) with the panel, our estimate of  is 0.7505 and when we estimate it with the cross-sections, the estimate of  is 0.9070 for 99 kindergarten and 0.6998 for first grade. All three estimates are statistically different from 1, with the p-value equal to 0.0000 in each case for the test of the null hypothesis that λ=1. For the formal test, we estimate equation (7) and include the lagged inputs in the first grade specification. We find the lagged inputs to be statistically significant when included in equation (7), with a p-value for the joint test of significance of 0.0005. Given these results, we choose the lag score specification found in equation (6). Cross Section vs. Panel: Results To investigate whether teaching practices should be allowed to vary across grades, we estimated equation (6) using the pooled data and including interaction terms between the grade dummy and all the teacher practices and characteristics. The results are shown in Table 3.1. Due to the large number of controls included in the model, we only display those relevant to our research question – i.e., the teaching practices and teacher characteristics. We find significant interactions with grade for a small number of teacher characteristics and teaching practices. Postgraduate education matters more in predicting higher achievement in first grade than in kindergarten. Working with counting manipulatives predicts lower achievement in first grade than in kindergarten—outweighing the positive main effect. Overall, we find that pooling the data and constraining the coefficients to be the same across grades can obscure grade-specific relationships in a sufficient number of instances so as to make it helpful to separate our regressions by grade. 100 Table 3.1: OLS Estimates of Lag Score Specification Pooled with Grade Interactions with School Dummies st 1 Grade Base Interaction Teacher Characteristics Black -0.3012 -0.4136 (0.5627) (0.7835) Other -0.1369 0.0286 (0.4828) (0.6619) Hispanic -0.4883 0.7102 (0.5501) (0.8390) Novice -0.1978 0.1326 (0.2303) (0.3360) Very experienced -0.5538*** 0.4035 (0.2018) (0.3001) Above BA 0.1938 1.9183** (0.1999) (0.8159) Masters or above 0.1571 2.1143** (0.2151) (0.8216) Regular certification -0.3349 -0.1044 (0.2591) (0.3528) Courses in methods for teaching math 0.0142 0.0365 (0.1567) (0.2191) Paid preparatory time 0.0514 -0.0789 (0.1994) (0.2572) Non-paid preparatory time 0.1030 -0.0463 (0.1525) (0.1994) Pedagogy Practices Hours per week teaching math 0.0208 0.0073 (0.0408) (0.0648) Hours per week in math groups -0.0633 0.0486 (0.0765) (0.1009) Count out loud -0.0341* 0.0132 (0.0202) (0.0239) Work with geometric manipulatives -0.0333** 0.0379* (0.0139) (0.0224) Work with counting manipulatives to learn basic operations 0.0544*** -0.0680*** (0.0175) (0.0221) Play mathematics-related games -0.0220 0.0140 (0.0145) (0.0201) Use a calculator for math 0.0219 0.0029 (0.0327) (0.0476) Use music to understand mathematics concepts -0.0016 -0.0319 (0.0144) (0.0313) 101 Table 3.1 (cont’d) Use creative movement or drama to understand mathematics concepts Work with rulers, measuring cups, spoons, or other measuring instruments Explain how a mathematics problem is solved Engage in calendar-related activities Do mathematics worksheets Do mathematics problems from the textbook Complete mathematics problems on the chalkboard Solve mathematics problems in small groups or with a partner Work on mathematics problems that reflect real-life situations Work in mixed achievement groups Peer tutoring -0.0226 (0.0189) -0.0153 (0.0202) 0.0144 (0.0128) 0.0162 (0.0244) 0.0217 (0.0133) 0.0297** (0.0135) 0.0359** (0.0141) 0.0001 (0.0154) 0.0043 (0.0149) -0.0074 (0.0113) -0.0037 (0.0121) -0.0012 (0.0315) -0.0284 (0.0298) 0.0181 (0.0183) -0.0069 (0.0285) -0.0075 (0.0181) -0.0082 (0.0162) -0.0202 (0.0180) 0.0139 (0.0224) 0.0045 (0.0208) 0.0052 (0.0155) 0.0118 (0.0180) 21,232 Observations Standard errors clustered at the school level in parentheses; estimation with 40 imputed data sets *** p<0.01, ** p<0.05, * p<0.1 To assess whether there is evidence of selection on observables, we estimate equations (8) and (9)—i.e. the treatment regressions. The p-values for the joint F-test for these regressions are presented by grade in Table 3.2. For equation (8), our results show that for most teaching practices and most teacher characteristics we consider, the child variables are found to be jointly significant, indicating that treatments are not randomly distributed across these observable characteristics. However, once we condition on the school attended by including school dummies in equation (9), the child variables are jointly significant in only one case—having a Hispanic 102 teacher. 13 In effect, all other teacher characteristics and instructional practices appear to be randomly assigned to students with different observable characteristics within schools. This offers evidence that the selection on observables is random within schools. Because of the rich set of observables these data provide, it can be argued that selection on unobservables should also be negligible. Thus, panel data methods that remove child effects (i.e., FE and AB) may be overly cautious—with the first differencing removing too much variation—if we already control for school differences with these data. Using AB instead of a simpler estimator with school dummies will likely increase standard errors due to the lost variation in the data and reduce our ability to detect significant effects. Furthermore, AB requires additional assumptions, such as no serial correlation in equation (6) and strict exogeneity of the other inputs. 14 By eschewing AB, we do not rely on these assumptions to obtain consistent estimates. 13 Further investigation (not shown in a table) revealed that some racial/ethnic matching may take place within schools, which led us to include interactions between teacher and child race categories on our achievement regressions. 14 AB also requires that the coefficient on the lagged dependent variable not be close to one. In that case, it breaks down due to a weak instrument problem. 103 Table 3.2: Random Assignment Test p-values: Joint Significance of Child Characteristics and Lagged Score st 1 Grade Kindergarten Without Without With School With School School School Dummies Dummies Dummies Dummies Dependent Variable Pedagogy Practices 0.0000 0.6841 0.1145 0.9220 Hours per week teaching math 0.0000 0.1051 0.0000 0.9683 Hours per week in math groups 0.0011 0.2916 0.0000 0.8087 Count out loud 0.0000 0.4642 0.0000 0.9678 Work with geometric manipulatives Work with counting manipulatives to 0.0003 0.2363 0.0004 0.3322 learn basic operations 0.0928 0.1510 0.3604 0.8725 Play mathematics-related games 0.3464 0.9923 0.0001 0.9976 Use a calculator for math Use music to understand mathematics 0.0008 0.2787 0.0136 0.9748 concepts Use creative movement or drama to 0.0000 0.6779 0.0000 0.8972 understand mathematics concepts Work with rulers, measuring cups, 0.0046 0.2788 0.2348 0.2008 spoons, or other measuring instruments Explain how a mathematics problem is 0.0000 0.7656 0.2025 0.2329 solved 0.0170 0.3226 0.0002 0.9661 Engage in calendar-related activities 0.0000 0.9823 0.2742 0.6238 Do mathematics worksheets Do mathematics problems from the 0.0000 0.9489 0.0000 0.8261 textbook Complete mathematics problems on the 0.0000 0.9054 0.0000 0.9381 chalkboard Solve mathematics problems in small 0.0000 0.9257 0.0048 0.8505 groups or with a partner Work on mathematics problems that 0.0200 0.9708 0.0209 0.9313 reflect real-life situations 0.0033 0.3245 0.0000 0.7647 Work in mixed achievement groups 0.0037 0.6531 0.0000 0.9711 Peer tutoring Do worksheets or workbook page 0.0025 0.9871 emphasizing routing practice or drill Work on problems for which there are 0.0000 0.6069 several solutions Teacher Characteristics 0.0000 0.4198 0.0000 0.7903 Black 0.0017 0.8295 0.0004 0.5561 Other 0.0000 0.0002 0.0000 0.0105 Hispanic 0.0762 0.5053 0.0014 0.9622 Novice 104 Table 3.2 (cont’d) Very experienced Above BA Masters or above Regular certification Courses in methods for teaching math Paid preparatory time Non-paid preparatory time 0.0026 0.0000 0.0000 0.0000 0.2598 0.0000 0.0004 0.3906 0.8872 0.8701 0.4256 0.5452 0.9813 0.6177 0.0012 0.0001 0.0000 0.0000 0.0660 0.0000 0.0089 0.9528 0.7972 0.7798 0.9374 0.8519 0.9950 0.9681 As a result of these analyses, we narrow down the model/estimation choices in Figure 3.1 to those that use cross-section data (and thus preserve maximum flexibility) and include school dummies. Note that previous studies relying on HLM techniques (e.g., Guarino et al. 2006, Bodovski & Farkas 2007, Parlady & Rumberger 2008, Rowan, Correnti, & Miller 2002) assumed random school effects, which are uncorrelated with the teacher characteristics and practices experienced by the child, rather than fixed school effects, which do not impose this assumption. The above analyses suggest that this assumption is unlikely to hold in ECLS-K, thus leading estimators with random school effects to be inconsistent. Choosing an Estimator: Results At this point, the remaining estimation choices among those outlined in Figure 3.1 are OLS and HLM on the lag score model with school dummy variables. Table 3.3 presents results for teacher characteristics and teaching practices using these estimators side by side. Certain teacher characteristics and practices—though relatively few—show evidence of producing achievement effects. The two methods produce very similar results. 15 15 The main difference between the two In fact, the coefficients themselves are more or less identical for OLS and HLM in kindergarten due to the fact that the estimated unexplained variance in achievement due to differences in teachers within schools is very close to zero. In first grade, this variance component is somewhat larger. 105 approaches lies in the computation of the standard errors. We used cluster-robust standard errors at the school level in the OLS regressions, producing standard errors that tended to be slightly more conservative. Table 3.3: Main Model and Estimation Results Estimation and Modeling Choices OLS MLE/HLM OLS Estimation Method Lag Lag Lag Specification Fixed Fixed Fixed School Effect None Random None Teacher Effect KG KG G1 Grades Teacher Characteristics Black 0.1236 0.1236 -0.8861 (0.4491) (0.3879) (0.6216) Other -0.0395 -0.0395 0.3187 (0.3962) (0.3331) (0.5209) Hispanic -0.0002 -0.0002 0.245 (0.4823) (0.4108) (0.7377) Novice -0.1474 -0.1474 0.3012 (0.2128) (0.1767) (0.2790) Very experienced -0.1395 -0.1395 -0.0922 (0.1897) (0.1499) (0.2383) Above BA -0.1015 -0.1015 4.066** (0.1813) (0.1641) (2.0097) Masters or above 0.1616 0.1616 4.1721** (0.1871) (0.1635) (1.9963) Regular certification -0.4326* -0.4326** -0.209 (0.2462) (0.2042) (0.3037) -0.2555* -0.2555** 0.0069 Courses in methods for teaching math (0.1434) (0.1261) (0.1879) Paid preparatory time 0.0041 0.0041 0.0072 (0.1862) (0.1556) (0.2297) Non-paid preparatory 0.1785 0.1785 0.1188 time (0.1551) (0.1250) (0.2054) Pedagogy Practices 0.0138 0.0138 0.0171 Hours per week teaching math (0.0385) (0.0327) (0.0657) -0.0114 -0.0114 -0.0733 Hours per week in math groups (0.0722) (0.0617) (0.0786) Count out loud -0.0247 -0.0247 -0.024 (0.0189) (0.0156) (0.0166) 106 MLE/HLM Lag Fixed Random G1 -0.5563 (0.4933) 0.453 (0.3942) 0.6662 (0.6330) 0.2489 (0.2252) -0.0997 (0.1957) 3.7029** (1.5919) 3.8633** (1.5822) -0.2528 (0.2489) -0.0291 (0.1538) -0.0333 (0.1956) 0.1176 (0.1595) 0.012 (0.0594) -0.0799 (0.0615) -0.0274** (0.0133) Table 3.3 (cont’d) Work with geometric manipulatives Work with counting manipulatives to learn basic operations Play mathematicsrelated games Use a calculator for math Use music to understand mathematics concepts Use creative movement or drama to understand mathematics concepts Work with rulers, measuring cups, spoons, or other measuring instruments Explain how a mathematics problem is solved Engage in calendarrelated activities Do mathematics worksheets Do mathematics problems from the textbook Complete mathematics problems on the chalkboard Solve mathematics problems in small groups or with a partner Work on mathematics problems that reflect real-life situations Work in mixed achievement groups Peer tutoring -0.0096 (0.0141) -0.0096 (0.0112) 0.0084 (0.0192) 0.0081 (0.0153) 0.0323** 0.0323** -0.0162 -0.0174 (0.0152) (0.0129) (0.0180) (0.0150) -0.017 (0.0122) -0.017 (0.0107) -0.0152 (0.0183) -0.0107 (0.0155) -0.0021 (0.0299) 0.0049 (0.0127) -0.0021 (0.0240) 0.0049 (0.0109) -0.0016 (0.0358) -0.0355 (0.0289) -0.0058 (0.0304) -0.0306 (0.0236) -0.0264 -0.0264* 0.0091 0.0141 (0.0163) (0.0136) (0.0271) (0.0235) -0.0018 -0.0018 -0.027 -0.0213 (0.0177) -0.0026 (0.0149) -0.0026 (0.0274) 0.0494*** (0.0217) 0.0489*** (0.0114) 0.0123 (0.0237) (0.0101) 0.0123 (0.0204) (0.0165) -0.0027 (0.0196) (0.0130) -0.0023 (0.0162) 0.0206 (0.0127) 0.0088 0.0206* (0.0109) 0.0088 0.0078 (0.0177) 0.0275* 0.0066 (0.0144) 0.0272** (0.0145) 0.0280** (0.0127) 0.0280** (0.0143) 0.0172 (0.0115) 0.0174 (0.0127) (0.0117) (0.0142) (0.0113) -0.0055 -0.0055 0.0114 0.0084 (0.0141) (0.0117) (0.0201) (0.0157) 0.0003 0.0003 -0.0211 -0.0217 (0.0128) 0.0063 (0.0106) -0.0096 (0.0118) (0.0106) 0.0063 (0.0086) -0.0096 (0.0096) (0.0177) 0.0102 (0.0141) 0.0164 (0.0159) (0.0143) 0.0123 (0.0115) 0.0187 (0.0135) 107 Table 3.3 (cont’d) Do worksheets or workbook page emphasizing routing practice or drill Work on problems for which there are several solutions 0.005 0.0021 (0.0173) 0.0237 (0.0141) 0.0286** (0.0171) (0.0141) 16,356 16,356 11,780 11,780 Observations Source: ECLS-K; Standard errors clustered at the school level in parentheses; estimation with 40 imputed data sets *** p<0.01, ** p<0.05, * p<0.1 All specifications include class size, class racial percentages, teacher content practices, family welfare status, prior test score, time between exams, household size, number of siblings, level of parental involvement, how often the child reads, participation in extracurricular activities, number of children's books; indicators for completing the Spanish math exam, student disability, mother's education level, household income level, mother's employment status, single parent family, father absent, pay tuition, child's race, speak non Teacher characteristics show different effects in kindergarten and first grade as expected due to our previous investigation of interaction terms. In kindergarten, we find evidence that having a teacher who is certified reduces achievement in kindergarten by approximately .43 IRT scale 16 points (a small effect of about 1/25 of a standard deviation ) and that having taken more than two courses in methods of teaching mathematics reduces achievement by a little more than half that amount. In first grade, we find, however, that certification and coursework does not matter one way or the other, but that training in the form of advanced degrees has a positive effect: having a teacher with post-graduate education raise test scores by roughly four IRT scale points—i.e., more than one-third of a standard deviation. The differences in the effects of background training and education between kindergarten and first grade—discussed further in the next section—are noteworthy. 16 As shown in Appendix Table 3.5, the standard deviation in test scores is 8.84 in kindergarten and 8.65 in first grade. 108 Small but for the most part strongly significant effects emerged in both grades with respect to teaching practices, and, again, the effects of specific practices differ across grades. In kindergarten, teachers who emphasize the use of counting manipulatives and the chalkboard have a positive impact on achievement. The effects are small—using any of these practices for an additional 10 days per month will raise test scores by less than 1/25 of a standard deviation in IRT scale points. The smaller standard errors in the HLM regression also point to weak evidence that using math worksheets has a small positive effect and using creative movement to teach math has a small negative effect. In first grade, teachers who spend more time engaging students in explaining how a mathematics problem is solved raise test scores by approximately .05 IRT scale points, and the coefficient is highly significant. Although somewhat larger than the other coefficients, it indicates that using this practice an extra 10 days per month will raise achievement by 1/20 of a standard deviation—still a modest effect. A smaller and weakly significant positive effect is detected for doing mathematics problems from a textbook. In addition, the HLM regression provides some evidence that working on problems for which there are several solutions has a positive impact on achievement, a finding that complements the strong finding for explaining math problems. Discussion If we assert that the findings are causal, some policy implications emerge. Even though kindergarten and first grade represent closely spaced points on a continuum of early childhood education, we find that few characteristics and practices have a consistent effect in both grades. Our results could be interpreted as suggesting that training and pedagogy that is geared toward analysis or explanation is appropriate for first grade but not necessarily for kindergarten. There 109 are three possible explanations for this. One is that the ECLS tests measure very different constructs in the two grades. However, given that the tests are constructed by the same assessment teams so as to provide a certain amount of continuity, this seems unlikely. Or, secondly, curricular differences across the two grades may be such that they align more with standardized tests in first grade. However, given the large set of content coverage controls in our models, this also seems unlikely. A third explanation, that the cognitive development of a child differs markedly across these two periods of growth, is perhaps more likely. As a corollary to this hypothesis, the negative finding for certification in kindergarten warrants further investigation and suggests either that the approach to mathematics pedagogy in early elementary teacher trainings programs may not be geared toward achieving the kind of learning that can be measured by standardized tests administered at that point in time, or that it is not well aligned with learning development at the kindergarten stage. It is important to note that teacher training programs generally group kindergarten, early elementary, and upper elementary training into the same mathematical content and pedagogy courses. Thus training is not fine-tuned toward specific developmental stages. Our findings with respect to the specific practices suggest that the developmental needs of students change from kindergarten to first grade and may require different teaching techniques. The use of counting manipulatives, although fairly frequently used in both grades, with kindergarten teachers using it an average of 12.5 times per month and first grade teachers using it an average of 11 times a month (see Appendix Table 3.5), only affected achievement in kindergarten. A possible explanation might be that the usefulness of this type of manipulative in influencing learning reaches a plateau; kinesthetic approaches may prove effective among kindergarteners, whereas first grade students may outgrow their use. Also effective in 110 kindergarten only is the use of chalkboards. Kindergarten teachers report using this practice only 4.7 times per month on average while first grade teachers use it almost twice that amount, suggesting that its relatively non-routine practice in kindergarten produces a helpful boost to achievement. On the other hand, the more verbally-oriented pedagogical technique of asking students to explain how a problem is solved influences achievement for first grade students but not for kindergarten students. It is encouraging to note that this practice is utilized relatively often in first grade (on average, 12.8 times per month in first grade versus 8 times per month in kindergarten). Several explanations might be offered for the small pedagogical effects. First, it is important to acknowledge the limitations of retrospective survey items in accurately capturing the frequency of use. Although some studies have validated survey responses by comparing them with measurements taken during classroom observations (Mayer, 1999; Stipek & Byler, 2004), some measurement error and possibly recall bias may remain. 17 Measurement error, if random, would attenuate coefficients. Another possible explanation is that pedagogical activities as isolated and specific as those we measure may have a small effect on learning and that several ―best practices‖ or artful combinations of them are needed to produce substantial learning gains. Finally, it may be the case that it is not so much what a teacher does but how she does it that matters in producing learning. Thus measures of frequency and not quality of teaching modalities do not capture all the essential components of pedagogy. In other words, the effect of an effective kindergarten teacher who completes math problems on a chalkboard might differ from that of an ineffective one who does the same thing. In this sense, it should be acknowledged that despite the steps we have taken to select an estimation strategy, an impediment to claiming that 17 That the ―continuous‖ frequency scale was derived from more approximate, discrete frequency categories undoubtedly contributes to measurement error. 111 our results are causal remains. The remaining issue is that neither the OLS nor the HLM estimator deals with the possibility that omitted teacher effects, represented in (6) and (7) by τj are nonrandom. As we have mentioned, one concern might be that high quality teachers tend to use certain teaching techniques but that such techniques, if adopted less skilled teachers, would produce little effect. Or, it is possible the seeming ineffectiveness of particular techniques may be due to inadequate training in those methods. Thus there is a question whether the technique itself or the ability of teachers to properly use the technique matters. Our data follow a single cohort of students and do not provide the type of longitudinal information on their teachers that would permit us to control for time-constant teacher effects. We have no econometric technique at our disposition that allows us to eliminate unobserved teacher effects. We might argue, however, that the inclusion of the extensive set of teacher characteristics, content coverage variables, and school indicators, as well as our practices of interest, substantially narrows the range of what can be attributed to an unobserved teacher effect. Furthermore, in sensitivity analyses 18 in which teacher characteristics are omitted from the model, the practice estimates change very little, suggesting that their use is not driven by training or experience in ways that affect achievement. Sensitivity of Findings to Other Specifications and Estimators As mentioned in our review of the literature, prior survey-based studies have tended to use different models and estimators from those selected in our analysis. In order to demonstrate the importance of taking careful steps in selecting the methods used, Table 3.4 shows the sensitivity of findings to the choice of model and estimator. The table displays results for HLM estimation with random school effects on the kindergarten and first grade cross-sections in the first four 18 These analyses are not shown in the paper but are available from the authors upon request. 112 columns—analytic approaches that have been more prevalent in the literature than our preferred approaches—and results for the child fixed effects estimator applied to the panel in the last column. The HLM estimation is carried out for both the lag score and the gain score model. Table 3.4: Alternative Models and Estimators Estimation Method Specification School Effect Teacher Effect Grades Teacher Characteristics Black Other Hispanic Novice Very experienced Above BA Masters or above Regular certification Courses in methods for teaching math Paid preparatory time Non-paid preparatory time Pedagogy Practices Hours per week teaching math Hours per week in math groups Count out loud Work with geometric manipulatives Estimation and Modeling Choices MLE/HL MLE/HLM MLE/HLM MLE/HLM M Lag Gain Lag Gain Random Random Random Random Random Random Random Random KG KG G1 G1 0.0820 (0.4290) -0.3261 (0.3042) -0.2503 (0.4181) -0.2577 (0.1625) -0.1416 (0.1363) -0.1562 (0.1420) 0.0017 (0.1473) -0.4011** (0.1730) -0.1980* (0.1145) -0.0333 (0.1279) 0.1539 (0.1086) 0.1019 (0.4302) -0.2909 (0.3024) -0.1968 (0.4222) -0.2569 (0.1626) -0.1616 (0.1358) -0.1394 (0.1411) 0.0126 (0.1474) -0.3845** (0.1728) -0.1772 (0.1136) -0.0458 (0.1282) 0.1520 (0.1091) 0.0419 (0.0289) 0.0271 (0.0584) -0.0337** (0.0136) -0.0185* (0.0096) 0.0462 (0.0288) 0.0241 (0.0587) -0.0309** (0.0137) -0.0181* (0.0097) 113 -0.7731 -0.6880 (0.5019) (0.5146) 0.0946 0.3276 (0.4084) (0.4344) 0.8086 0.6579 (0.6257) (0.6573) 0.0533 0.0218 (0.2289) (0.2414) -0.0957 -0.1045 (0.1955) (0.2073) 2.3174*** 2.5012*** (0.7260) (0.7687) 2.5502*** 2.8267*** (0.7256) (0.7685) -0.2826 -0.3401 (0.2399) (0.2483) -0.2060 -0.1873 (0.1514) (0.1617) -0.1171 -0.1895 (0.1713) (0.1856) 0.1777 0.1018 (0.1499) (0.1582) 0.0647 (0.0564) -0.0460 (0.0643) -0.0183 (0.0121) 0.0036 (0.0149) 0.0349 (0.0587) -0.0151 (0.0657) -0.0011 (0.0132) 0.0072 (0.0154) FE Gain None None Pooled -0.9328 (0.9257) -0.1778 (0.8656) -0.2694 (0.9521) -0.2632 (0.3565) -0.5128* (0.2958) -0.3458 (0.4092) 0.0018 (0.4388) -0.4328 (0.4258) 0.0441 (0.2566) -0.0019 (0.3024) 0.0746 (0.2534) 0.0973 (0.0694) -0.0225 (0.1082) -0.0354 (0.0245) -0.0335 (0.0228) Table 3.4 (cont’d) Work with counting manipulatives to learn basic operations Play mathematics-related games 0.0253** 0.0274** -0.0181 -0.0095 (0.0113) (0.0112) (0.0136) (0.0148) -0.0170* -0.0191** -0.0052 0.0029 (0.0092) (0.0093) (0.0149) (0.0157) Use a calculator for math -0.0027 -0.0040 0.0090 -0.0014 (0.0225) (0.0224) (0.0304) (0.0313) Use music to understand mathematics 0.0152 0.0171* -0.0102 0.0042 concepts (0.0100) (0.0100) (0.0265) (0.0279) Use creative movement or drama to -0.0255* -0.0264** -0.0144 -0.0155 understand mathematics concepts (0.0132) (0.0133) (0.0248) (0.0255) Work with rulers, measuring cups, -0.0085 -0.0102 -0.0060 -0.0025 spoons, or other measuring instruments (0.0133) (0.0132) (0.0204) (0.0216) Explain how a mathematics problem is -0.0016 -0.0038 0.0372*** 0.0288** solved (0.0085) (0.0085) (0.0122) (0.0127) 0.0311* 0.0314** 0.0239* 0.0221 Engage in calendar-related activities (0.0158) (0.0158) (0.0138) (0.0144) Do mathematics worksheets 0.0240*** 0.0243*** 0.0197 0.0209 (0.0082) (0.0082) (0.0127) (0.0137) Do mathematics problems from the -0.0028 -0.0027 0.0331*** 0.0349*** textbook (0.0090) (0.0090) (0.0096) (0.0100) Complete mathematics problems on the 0.0309*** 0.0312*** 0.0123 0.0133 chalkboard (0.0094) (0.0094) (0.0110) (0.0118) Solve mathematics problems in small 0.0106 0.0115 -0.0102 -0.0113 groups or with a partner (0.0111) (0.0111) (0.0147) (0.0155) Work on mathematics problems that 0.0029 0.0032 -0.0009 -0.0039 reflect real-life situations (0.0097) (0.0097) (0.0134) (0.0143) Work in mixed achievement groups -0.0027 -0.0030 0.0043 0.0038 (0.0077) (0.0077) (0.0112) (0.0119) Peer tutoring -0.0058 -0.0040 0.0122 0.0146 (0.0087) (0.0087) (0.0130) (0.0134) 0.0069 0.0113 Do worksheets or workbook page emphasizing routing practice or drill (0.0128) (0.0137) 0.0204 0.011 Work on problems for which there are several solutions (0.0134) (0.0140) 0.0446* (0.0258) -0.0084 (0.0217) 0.0550 (0.0517) -0.0005 (0.0280) -0.0479 (0.0356) -0.0330 (0.0302) 0.0059 (0.0208) 0.0120 (0.0300) 0.0290 (0.0203) 0.0240 (0.0178) 0.0320 (0.0202) 0.0066 (0.0234) 0.0232 (0.0221) -0.0082 (0.0178) -0.0050 (0.0189) 16,356 16,356 11,780 11,780 21,232 Observations Source: ECLS-K; Standard errors clustered at the school level in parentheses; estimation with 40 *** p<0.01, sets imputed data** p<0.05, * p<0.1 All specifications include class size, class racial percentages, teacher content practices, family welfare status, prior test score, time between exams, household size, number of siblings, level of parental involvement, how often the child reads, participation in extracurricular activities, number of children's books; indicators for completing the Spanish math exam, student disability, mother's education level, household income level, mother's employment status, single parent family, father absent, pay tuition, child's race, speak non-English in home, repeating 114 Table 3.4 (cont’d) kindergarten, attend full day kindergarten Specifications without school dummies also include school level variables: minority percentage, private religious, private non-religious, school enrollment, region, suburban, rural, gang problems, crime problems Before discussing the differences between these approaches and our preferred approaches, it is interesting to note the differences between the lag and gain score models in the results displayed here in Table 3.4. Moving to a gain score model has little effect on the magnitude of the estimates for kindergarten but a noticeable effect on those for first grade. Recall that in these data, the coefficient on lambda was fairly close to one for kindergarten and much lower for first grade. In addition, our test of the λ=1 assumption applied only to first grade. Thus it appears that using the gain score model in the ECLS-K data is relatively costless in kindergarten, but these results serve to support our claim that doing so in first grade introduces a noticeable amount of bias. Given this evidence, it seems all the more unlikely that the FE estimator in column five has much insight to offer, since it not only relies on a gain score model but also constrains the coefficients on teacher characteristics and practices to be the same across grades and relies on the strict exogeneity assumption. It is also interesting to note that the FE estimator, having removed a great deal of variation by time-demeaning the data, displays larger standard errors and thus finds no variable to be significant at the .05 level. For instance, the point estimate for the effect of having a teacher with regular certification is of similar or greater magnitude to the others; however the standard error is approximately twice as large. Differences in significance due to large changes in the point estimates (for example, see the coefficient on ―very experienced‖), on the other hand, may be due to bias—for example due to violations of strict exogeneity. Overall, 115 the FE estimator, applied to these data and research questions, offers little in the way of policyrelevant information with regard to our research question. Moving now to a comparison of results from the HLM estimators containing random school effects (i.e., those commonly used in the literature) with those derived from our preferred estimators, shown in Table 3.3, we find a few similarities but some notable differences. With regard to teacher characteristics, the HLM results from Table 3.4 generally display coefficients with slightly lower magnitudes for certification and coursework in kindergarten and coefficients with more notably lower magnitudes for postgraduate study in first grade. With regard to instructional practices, the Table 3.4 HLM estimates diverge from those in Table 3.3 in several instances. In kindergarten, they show a significant negative effect of counting out loud, the use of geometric manipulatives, engaging in calendar-related activities, whereas these are not significant when school effects are treated as fixed. In addition, although both sets of results display significant coefficients for working with counting manipulatives and completing problems on the chalkboard, the magnitudes differ slightly. In first grade, the magnitude and significance of counting out loud, the use of calendars, and working on problems with several solutions differs across the two sets of results, and the magnitude of the most robust results—that of explaining how problems are solved—is somewhat understated. It is important to reiterate why we might see the differences outlined above. Differences in the conclusions based on different estimators can be driven either by differences in the point estimates or in the standard errors. Of the two HLM approaches, the school random effects specification tends to produce smaller standard errors due to the additional covariance structure imposed on the estimation. However, it is not the case that differences in the conclusions drawn between the two HLM estimators are purely due to smaller standard errors. Rather, it is typically 116 the case that the magnitude of the point estimates changes enough to alter the significance level. Such differences in the point estimates likely occur because estimates with school random effects do not properly control for the sorting of students into different schools, whereas when the random school effects are replaced by fixed effects, this sorting is explicitly accounted for. 19 It is important to note that not only our preferred approaches but also our sensitivity analyses using HLM with random school effects produce some results that differ from the HLM studies in the prior literature and lead to different conclusions, particularly for first grade. This happens despite our attempts to mimic their methods in our sensitivity analyses. In these cases, specification and sample differences account for much of the divergence. It is difficult to compare our findings with those of the two kindergarten studies we cited (Guarino et al., 2006, and Bodovski & Farkas, 2007), because the prior studies used instructional practice scales rather than individual practice variables. Both prior studies found that ―traditional approaches‖ were positive and significantly related to achievement and thus conflated the effects of some of the variables we use (i.e., worksheets, textbooks, and chalkboard), one of which we find highly significant. However, our findings did not concur with other findings of theirs, such as the positive impact of student centered instruction in Guarino et al. (a scale composed of items such as explaining how mathematics problems are solved, playing games, and using music or creative movement), and interactive instruction in Bodovski and Farkas (a scale composed of items such as explaining how mathematics problems are solved, solving problems in small groups or with a partner, and peer tutoring). The differences between our results and those of Parlady & Rumberger’s 2008 study of first grade are more stark and seemingly due to sample differences. The prior study used the 30 19 In other words, by controlling explicitly for δ s in equations (6) and (7), rather than assuming it is random. 117 percent subsample of first graders who were sampled in the fall as well as the spring and did not impute missing data. Their findings that the use of math worksheets and calendars raise and the use of geometric manipulatives lower first grade achievement are not supported in our study based on the full sample, even when we employ methods that are similar to theirs. 7. Summary and Conclusions Survey data on instructional practices, if well designed and carefully analyzed, have the potential to address questions regarding the means by which effective teachers can affect student success. Simple teacher performance studies based on administrative data can be useful in showing that quality matters but have access to very limited information on teachers and provide little insight regarding policy prescriptions needed to improve overall effectiveness. Surveys that use richly detailed survey data to focus on teachers’ actions and how they affect student outcomes can hold the key to designing policy instruments, assuming the results they find are causal. A source of concern in non-experimental research on teacher effects, however, is the sensitivity of findings to different modeling and estimation techniques. This paper has outlined a process for selecting an appropriate model and estimation method to investigate teacher effects on student achievement using longitudinal survey data. We applied this process in a series of sequential steps to data from ECLS-K and found little evidence to support many of the modeling and estimation choices heavily relied upon in the literature—for example, models with gain scores or random school effects or models that constrain coefficients to be the same across grades. Our study clearly illustrates how methodological choices can influence results. Although a few of our findings concur with those in prior research, most diverge. Prior studies find little or 118 no relationship between teacher background characteristics and mathematics achievement in either kindergarten or first grade, but we find evidence that teacher certification may slightly lower achievement in kindergarten and that postgraduate education may contribute to relatively substantial achievement gains in first grade. In addition, whereas prior studies claim that studentcentered or interactive pedagogy improves achievement in kindergarten and that worksheets and calendars are important tools in first grade, we find that working with counting manipulatives and completing math problems on the chalkboard improve achievement in kindergarten and that explaining how mathematics problems are solved is important in first grade. Taken as a whole, our findings suggest that there may be important developmental differences in the mathematics learning capabilities of children in kindergarten versus first grade and that training and pedagogy should be structured appropriately. Importantly, there is no finding that holds up under all the many specification and estimation approaches available to researchers—a circumstance that highlights the need to ground the selection of a model and estimation method on sound reasoning. Clearly, it is important that researchers justify their methodological choices through a thorough investigation of related assumptions and sensitivities. The guidelines for empirical investigation in the longitudinal survey data context provided in this study can aid researchers in selecting a credible set of results and advancing causal claims. 119 APPENDIX 120 APPENDIX Table 3.5: Summary Statistics for Forty Imputed Data Sets Pooled Kindergarten Mean Mean SD 36.41 0.498 0.111 80.95 4.722 1.487 0.747 4.649 82.76 0.076 0.210 0.022 0.559 0.142 0.200 0.207 275.3 0.041 0.132 0.161 0.057 0.044 0.116 0.300 0.336 0.083 0.146 0.470 0.229 3.973 1.328 0.203 0.220 11.67 0.500 0.314 7.383 1.781 1.116 0.435 2.339 61.58 0.266 0.407 0.147 0.496 0.349 0.400 0.405 90.82 0.198 0.338 0.368 0.232 0.205 0.320 0.458 0.472 0.276 0.354 0.499 0.420 1.522 1.258 0.402 0.414 27.71 0.493 0.122 74.67 2.077 1.406 0.746 4.439 74.41 0.128 0.213 0.032 0.558 0.140 0.223 0.227 186.3 0.045 0.150 0.173 0.059 0.044 0.134 0.305 0.326 0.079 0.001 0.462 0.217 3.679 1.056 0.239 0.191 8.84 0.500 0.327 4.44 0.268 0.507 0.435 2.513 59.31 0.335 0.409 0.177 0.497 0.347 0.416 0.419 21.26 0.208 0.357 0.379 0.235 0.205 0.340 0.460 0.469 0.269 0.023 0.499 0.413 1.576 1.138 0.427 0.393 Child and Family Variables Math score Female Non-English at home Age in months Household size Number of siblings Mother work before KG How often child reads Children's books in home Welfare participant Pay tuition Spanish exam Full day KG Disabled Single parent household Father absent Days between exams KG repeater Child Black Child Hispanic Child Asian or Pacific-Islander Child other race Mother below high school Mother high school Mother some college Mother BA or above Only Child Mother work full time Mother work part time Parental school involvement Extracurricular activities Income less than $20k Income more than $100k Teacher Characteristics Teacher Black Teacher Hispanic SD st 1 Grade Mean SD 44.08 0.495 0.126 86.97 8.942 1.423 0.741 4.811 85.11 0.055 0.209 0.016 0.569 0.158 0.241 0.242 364.2 0.042 0.136 0.162 0.060 0.043 0.128 0.296 0.332 0.083 0.001 0.481 0.223 4.035 1.489 0.226 0.223 8.65 0.500 0.331 4.21 6.415 0.510 0.438 2.182 62.46 0.229 0.407 0.127 0.495 0.365 0.428 0.428 20.14 0.201 0.343 0.368 0.237 0.204 0.334 0.457 0.471 0.275 0.023 0.500 0.416 1.539 1.312 0.418 0.416 0.062 0.242 0.065 0.246 0.066 0.248 0.057 0.232 0.063 0.243 0.057 0.232 121 Table 3.5 (cont’d) Teacher other race More than two course in teaching math Less than five years experience More than nine years experience Certification status Degree greater than BA Degree MA or above More than two hours paid preparatory time More than five hours non-paid preparatory time Teacher Pedagogy Practices Count out loud Work with geometric manipulatives Work with counting manipulatives to learn basic operations Play mathematics-related games Use a calculator for math Use music to understand mathematics concepts Use creative movement or drama to understand mathematics concepts Work with rulers, measuring cups, spoons, or other measuring instruments Explain how a mathematics problem is solved Engage in calendar-related activities Do mathematics worksheets Do mathematics problems from the textbook Complete mathematics problems on the chalkboard Solve mathematics problems in small groups or with a partner Work on mathematics problems that reflect real-life situations Work in mixed achievement groups Peer tutoring Do worksheets or workbook page emphasizing routing practice or drill Work on problems for which there are several solutions Teacher Content Practices Correspondence between numbers and quantity Writing all numbers from 1 to 10 Counting by 2s, 5s, and10s 122 0.050 0.445 0.199 0.621 0.874 0.323 0.523 0.715 0.431 0.217 0.497 0.399 0.485 0.331 0.467 0.499 0.451 0.495 0.048 0.426 0.184 0.618 0.858 0.349 0.353 0.710 0.406 0.213 0.495 0.388 0.486 0.349 0.477 0.478 0.454 0.491 0.056 0.461 0.228 0.600 0.876 0.298 0.692 0.705 0.444 0.230 0.499 0.420 0.490 0.329 0.457 0.461 0.456 0.497 15.49 6.256 17.91 4.220 13.31 6.942 7.624 6.237 9.616 6.484 5.918 5.511 11.63 6.284 12.514 6.087 11.096 6.474 9.508 0.952 2.692 2.369 6.344 2.424 4.719 4.075 10.623 0.609 4.045 3.285 6.551 2.296 5.607 4.775 8.376 1.310 1.461 1.577 6.033 2.548 3.243 3.165 3.734 4.568 3.542 4.396 4.070 4.839 10.48 18.27 11.77 7.666 6.902 7.238 4.725 7.190 8.633 7.066 8.198 19.03 9.21 3.737 4.666 7.098 3.624 7.075 6.788 6.187 12.831 17.61 14.41 11.640 9.291 6.684 5.434 6.413 8.519 7.211 7.102 6.252 6.301 6.215 7.949 6.226 8.905 6.693 7.968 6.760 9.853 6.519 9.638 7.718 9.978 7.872 9.407 7.521 6.108 6.771 5.545 6.912 6.823 6.688 11.60 6.729 7.200 6.570 11.43 7.721 14.163 6.109 9.020 8.230 8.933 7.876 11.296 6.892 6.882 8.236 10.06 7.352 9.409 7.710 10.73 6.943 Table 3.5 (cont’d) Counting beyond 100 Writing all numbers from 1 to 100 Recognizing and naming geometric shapes Identifying relative quantity Sorting objects into subgroups according to a rule Ordering objects by size or other properties Making, copying, or extending patterns Recognizing the value of coins and currency Adding single-digit numbers Subtracting single-digit Place values Reading two-digit numbers Reading three-digit numbers Mixed operations Reading simple graphs Performing simple data collection and graphing Fractions Ordinal numbers (e.g., first, second, third) Using measuring instruments accurately Telling time Estimating quantities Adding two-digit numbers Carrying numbers in addition Subtracting two-digit numbers Estimating probability Writing mathematics equations to solve word problems Class Characteristics Class percent Asian Class percent Hispanic Class percent Black Class percent other race Class percent disabled Hours per week on math Hours per week in math divided achievement groups School Characteristics Minority Percent Quintile Gang problems Crime problems 123 6.475 4.112 6.480 9.701 5.976 7.473 5.799 6.316 6.695 5.550 6.051 2.883 8.326 9.880 7.301 7.810 5.267 6.898 6.799 6.077 7.092 5.522 4.994 9.606 4.855 7.224 6.124 5.439 6.679 4.884 5.314 8.170 7.771 11.62 10.37 9.192 13.23 6.985 1.855 7.291 5.457 5.167 6.840 7.070 7.022 7.319 8.313 7.216 7.981 4.368 6.612 5.909 6.415 9.738 5.956 8.582 6.598 6.647 12.609 5.514 0.586 7.439 5.108 5.657 6.976 6.516 6.944 6.708 8.631 7.782 7.969 2.693 6.967 5.992 4.412 6.693 9.610 14.864 14.392 11.733 14.003 8.408 3.139 7.248 5.944 4.636 6.324 7.145 5.632 5.811 7.118 6.546 7.795 5.330 6.276 5.897 2.656 7.607 3.351 8.006 5.223 3.519 0.918 2.716 2.116 4.889 3.927 7.043 4.292 7.458 5.612 5.735 3.387 5.197 3.886 6.270 1.761 8.556 2.717 5.828 4.560 0.919 0.309 0.508 1.393 1.828 3.391 7.485 3.905 6.874 5.431 3.478 2.129 2.729 3.502 4.063 3.622 6.756 4.088 10.275 5.893 6.424 1.691 5.155 2.848 8.019 4.268 6.490 4.679 7.411 5.735 6.504 4.488 6.237 4.186 6.634 0.038 0.130 0.148 0.009 0.082 3.711 0.699 0.079 0.236 0.247 0.028 0.102 1.807 1.166 0.121 0.168 0.177 0.106 0.153 3.045 0.482 0.055 0.114 0.118 0.029 0.072 1.829 1.025 0.102 0.140 0.145 0.086 0.124 4.368 0.927 0.048 0.100 0.095 0.017 0.058 1.540 1.279 2.620 1.530 2.720 1.551 2.690 1.549 0.294 0.456 0.333 0.471 0.275 0.446 0.380 0.485 0.422 0.494 0.354 0.478 Table 3.5 (cont’d) Private religious Private non-religious Midwest South West Suburban Rural Enrollment 300 to 499 Enrollment 500 to 749 Enrollment greater than 749 N (Single Imputed Set) Source: ECLS-K; Multiply imputed data 0.187 0.025 0.256 0.337 0.225 0.383 0.231 0.305 0.273 0.185 21,232 124 0.390 0.157 0.436 0.473 0.417 0.486 0.422 0.460 0.446 0.388 0.172 0.042 0.246 0.331 0.229 0.384 0.209 0.279 0.291 0.161 16,356 0.377 0.201 0.431 0.470 0.420 0.486 0.407 0.448 0.454 0.368 0.189 0.024 0.252 0.349 0.218 0.392 0.229 0.304 0.252 0.213 11,780 0.392 0.153 0.434 0.477 0.413 0.488 0.420 0.460 0.434 0.410 Table 3.6: Summary of Missing Data st Kindergarten NonPercent of missing Possible Obs Math score Math score lag Teacher Black Teacher other race Teacher Hispanic Less than five years experience More than nine years experience Degree greater than BA Degree MA or above Certification status More than two course in teaching math More than two hours paid preparatory time More than five hours non-paid preparatory time Class percent disabled Class percent Black Class percent Hispanic Class percent Asian Class percent other race Spanish exam Days between exams Disabled Mother below high school Mother high school Mother some college Mother BA or above Income less than $20k Income more than $100k Household size Number of siblings Only Child Welfare participant Single parent household Father absent Mother work full time Mother work part time Pay tuition Nonmissing Obs Percent of Possible 16356 16356 16356 16356 16356 16261 16261 15379 15379 15715 15307 15637 15908 100.00% 100.00% 100.00% 100.00% 100.00% 99.42% 99.42% 94.03% 94.03% 96.08% 93.59% 95.60% 97.26% 11780 11780 11780 11780 11780 11525 11525 11483 11483 11402 10925 11157 11525 100.00% 100.00% 100.00% 100.00% 100.00% 97.84% 97.84% 97.48% 97.48% 96.79% 92.74% 94.71% 97.84% 15005 14550 14567 14551 14338 16356 16356 15291 15566 15566 15566 15566 15803 15803 15183 15183 15183 14551 15183 15183 14948 14948 15098 91.74% 88.96% 89.06% 88.96% 87.66% 100.00% 100.00% 93.49% 95.17% 95.17% 95.17% 95.17% 96.62% 96.62% 92.83% 92.83% 92.83% 88.96% 92.83% 92.83% 91.39% 91.39% 92.31% 11464 11069 11069 11069 11069 11780 11780 10758 10752 10752 10752 10752 10289 10289 10784 10784 10784 10741 10784 10784 10494 10494 10745 97.32% 93.96% 93.96% 93.96% 93.96% 100.00% 100.00% 91.32% 91.27% 91.27% 91.27% 91.27% 87.34% 87.34% 91.54% 91.54% 91.54% 91.18% 91.54% 91.54% 89.08% 89.08% 91.21% 125 1 Grade Table 3.6 (cont’d) Parental school involvement How often child reads Extracurricular activities Children's books in home Age in months Female Child Asian or Pacific-Islander Child Black Child Hispanic Child other race Non-English at home KG repeater Full day KG Mother work before KG Hours per week on math Hours per week in math divided achievement groups Count out loud Work with geometric manipulatives Work with counting manipulatives to learn basic operations Play mathematics-related games Use a calculator for math Use music to understand mathematics concepts Use creative movement or drama to understand mathematics concepts Work with rulers, measuring cups, spoons, or other measuring instruments Explain how a mathematics problem is solved Engage in calendar-related activities Do mathematics worksheets Do mathematics problems from the textbook Complete mathematics problems on the chalkboard Solve mathematics problems in small groups or with a partner Work on mathematics problems that reflect real-life situations Work in mixed achievement groups Peer tutoring 16356 15137 16356 15166 16353 16355 16325 16325 16325 16325 15772 15291 16356 15155 15772 14746 100.00% 92.55% 100.00% 92.72% 99.98% 99.99% 99.81% 99.81% 99.81% 99.81% 96.43% 93.49% 100.00% 92.66% 96.43% 90.16% 11780 10767 11780 10729 11779 11780 11764 11764 11764 11764 10940 11168 11780 11059 10835 8159 100.00% 91.40% 100.00% 91.08% 99.99% 100.00% 99.86% 99.86% 99.86% 99.86% 92.87% 94.80% 100.00% 93.88% 91.98% 69.26% 16356 16182 16250 100.00% 98.94% 99.35% 11677 11646 11647 99.13% 98.86% 98.87% 16196 16228 16277 99.02% 99.22% 99.52% 11665 11716 11709 99.02% 99.46% 99.40% 16236 99.27% 11692 99.25% 16260 99.41% 11660 98.98% 16108 98.48% 11720 99.49% 16248 16219 16144 99.34% 99.16% 98.70% 11690 11735 11687 99.24% 99.62% 99.21% 16281 99.54% 11750 99.75% 16262 99.43% 11748 99.73% 16217 99.15% 11709 99.40% 16272 16197 99.49% 99.03% 11662 11600 99.00% 98.47% 126 Table 3.6 (cont’d) Do worksheets or workbook page emphasizing routing practice or drill Work on problems for which there are several solutions Correspondence between numbers and quantity Writing all numbers from 1 to 10 Counting by 2s, 5s, and10s Counting beyond 100 Writing all numbers from 1 to 100 Recognizing and naming geometric shapes Identifying relative quantity Sorting objects into subgroups according to a rule Ordering objects by size or other properties Making, copying, or extending patterns Recognizing the value of coins and currency Adding single-digit numbers Subtracting single-digit Place values Reading two-digit numbers Reading three-digit numbers Mixed operations Reading simple graphs Performing simple data collection and graphing Fractions Ordinal numbers (e.g., first, second, third) Using measuring instruments accurately Telling time Estimating quantities Adding two-digit numbers Carrying numbers in addition Subtracting two-digit numbers Estimating probability Writing mathematics equations to solve word problems Sample Weight Minority Percent Quintile Private religious Private non-religious Enrollment 300 to 499 11735 99.62% 11631 98.74% 15951 97.52% 11410 96.86% 16120 16061 15965 16015 16144 16193 16169 98.56% 98.20% 97.61% 97.92% 98.70% 99.00% 98.86% 11603 11656 11509 11560 11655 11702 11645 98.50% 98.95% 97.70% 98.13% 98.94% 99.34% 98.85% 16101 16190 16056 98.44% 98.99% 98.17% 11592 11675 11653 98.40% 99.11% 98.92% 16130 15895 16011 16071 16015 16067 16071 16201 98.62% 97.18% 97.89% 98.26% 97.92% 98.23% 98.26% 99.05% 11731 11711 11629 11624 11539 11604 11641 11719 99.58% 99.41% 98.72% 98.68% 97.95% 98.51% 98.82% 99.48% 16076 16079 15990 15931 15821 16143 16211 16112 16051 16192 98.29% 98.31% 97.76% 97.40% 96.73% 98.70% 99.11% 98.51% 98.14% 99.00% 11619 11663 11524 11595 11441 11358 11610 11542 11499 11631 98.63% 99.01% 97.83% 98.43% 97.12% 96.42% 98.56% 97.98% 97.61% 98.74% 16356 15973 16356 16356 16237 100.00% 97.66% 100.00% 100.00% 99.27% 11780 11619 11780 11780 11685 100.00% 98.63% 100.00% 100.00% 99.19% 127 Table 3.6 (cont’d) Enrollment 500 to 749 Enrollment greater than 749 West South Midwest Suburban Rural Gang problems Crime problems All Variables Source: ECLS-K; Multiply imputed data 16237 16237 16356 16356 16356 16356 16356 13917 13960 99.27% 99.27% 100.00% 100.00% 100.00% 100.00% 100.00% 85.41% 83.70% 11685 11685 11780 11780 11780 11731 11731 10507 10548 99.19% 99.19% 100.00% 100.00% 100.00% 99.58% 99.58% 89.19% 89.54% 6271 38.34% 3923 33.30% 128 REFERENCES 129 REFERENCES Arellano, M. & Bond, S. (1991). Some Tests of Specification of Panel Data: Monte Carlo Evidence and an Application to Employment Equations. Review of Economic Studies, 58, 277-298. Barnett, S.W. (1995). Long-term Effects of Early Childhood Programs on Cognitive and School Outcomes. The Future of Children, 5, 25-50. Bodovski, K. & Farkas, G. (2007). Do Instructional Practices Contribute to Inequality in Achievement? The Case of Mathematics Instruction in Kindergarten. Journal of Early Childhood Research, 5(3), 301-322. Carlin, J. B., Galati, J. C., & Royston, P. (2008). A New Framework for Managing and Analyzing Multiply Imputed Data in Stata. Stata Journal, 8(1), 49-67. Chetty, R., Friedman, J. N., Hilger, N., Saez, E., Whitmore Schanzenbach, D., & Yagan, D. (2011). How Does Your Kindergarten Classroom Affect Your Earnings? Evidence from Project STAR. Quarterly Journal of Economics, 126(4), 1593-1660. Cohen, D. & Hill, H. (2000). Instructional Policy and Classroom Performance: The Mathematics Reform in California. Teachers College Record, 102(2), 294-343. Currie, J., Thomas, D. (2000). School Quality and the Longer-term Effects of Head Start. The Journal of Human Resources, 35(4), 755-774. Guarino, C., Hamilton, L., Lockwood, J.R., & Rathbun, A.H. (2006). Teacher Qualifications, Instructional Practices, and Reading and Mathematics Gains of Kindergarteners. (NCES 2006-031). U.S. Department of Education. Washington, DC: National Center for Education Statistics. Guarino, C., Reckase, M. & Wooldridge, J. (unpublished draft). Evaluating Value-Added Methods for Estimating Teacher Effects. Hamilton, L., McCaffrey, D., Stecher, B., Klein, S., Robyn, A., & Bugliari, D. (2003). Studying Large-Scale Reforms of Instructional Practice: An Example from Mathematics and Science. Educaitonal Evaluation and Policy Analysis, 25(1), 1-29. Hanushek, E. (1979). Conceptual and Empirical Issues in the Estimation of Educational Production Functions. Journal of Human Resources, 14(3), 351-388. Harris, D., Sass, T., & Semykina, A. (unpublished draft). Value-Added Models and the Measurement of Teacher Productivity. 130 Hausman, J. A. (1978). Specification Tests in Econometrics. Econometrica, 46(6), 1251-1271. Kilpatrick, J., Swafford, J., & Findell, B. (2001). Adding it Up: Helping Children Learn Mathematics. National Academy Press, Washington D.C. Le, V., Stecher, B., Lockwood, J.R., Hamilton, L., Robyn, A., Williams, V., et al. (2006). Improving Mathematics and Science Education: A Longitudinal Investigation Between Reform-Oriented Instruction and Student Achievement. RAND Corporation, Santa Monica, CA. Mayer, D.P. (1999). Measuring Instructional Practice: Can Policy Makers Trust Survey Data? Educational Evaluation and Policy Analysis, 21(1), 29-45. Palardy, G. & Rumberger, R. (2008). Teacher Effectiveness in First Grade: The Importance of Background Qualifications, Attitudes, and Instructional Practices for Student Learning. Educational Evaluation and Policy Analysis, 30(2), 111-140. Rock, D.A., & Pollack, J.M. (2002). Early Childhood Longitudinal Study - Kindergarten Class of 1998-99 (ECLS-K), Psychometric Report for Kindergarten Through First Grade (NCES 2002–005). U.S. Department of Education. Washington, DC: National Center for Education Statistics. Rowan, B., Correnti, R., & Miller, R. (2002). What Large-Scale Survey Research Tells Us about Teacher Effects on Student Achievement: Insights from the Prospects Study of Elementary Schools. Teachers College Record, 104, 1525–67. Royston, P. (2004). Multiple Imputation of Missing Values. Stata Journal, 4(3), 227-241. Stipek, D., Byler, P. (2004). The Early Childhood Classroom Observation Measure. Early Childhood Research Quarterly, 19, 375-397. Todd, P. & Wolpin, K. (2003). On the Specification and Estimation of the Production Function for Cognitive Achievement. Economic Journal, 113(485), 3-33. Tourangeau, K., Burke, J., Le, T., Wan, S., Weant, M., Brown, E., Vaden-Kiernan, N., Rinker, E., Dulaney, R., Ellingson, K., Barrett, B., Flores-Cervantes, I., Zill, N., Pollack, J., Rock, D., Atkins-Burnett, S., Meisels, S., Bose, J., West, J., Denton, K., Rathbun, A., & Walston, J. (2001). ECLS-K Base Year Public-use Data Files and Electronic Codebook: User’s Manual .(NCES 2001–029). U.S. Department of Education. Washington, DC: National Center for Education Statistics. Van Buuren, S., Boshuizen, H., & Knock, D. (1999). Multiple Imputation of Missing Blood Pressure Covariates in Survival Analysis. Statistics in Medicine, 18, 681-694. Wayne, A. & Youngs, P. (2003). Teacher Characteristics and Student Achievement Gains: A Review. Review of Educational Research, 73(1), 89-122. 131 Wooldridge, J. (2002). Econometric Analysis of Cross Section and Panel Data. MIT Press: Cambridge, Ma. 132