THREE ESSAYS IN LABOR ECONOMICS AND THE ECONOMICS OF EDUCATION By Brian Stacy A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Economics – Doctor of Philosophy 2014 ABSTRACT THREE ESSAYS IN LABOR ECONOMICS AND THE ECONOMICS OF EDUCATION By Brian Stacy In the first chapter of my dissertation, I examine the robustness of typical teacher quality measures to alternate ranking systems factoring in the dispersion of value-added. The typical measure used by researchers and school administrators to evaluate teachers is based on how the students’ achievement increases after being exposed to the teacher, or based on the teacher’s “value-added”. When teacher value-added is heterogeneous across her students, then the typically used measure reflects differences in the average value-added the teacher provides. However, researchers, administrators, and parents may care not just about the average value-added, but also its variance. Encouragingly, ranking systems factoring in the dispersion produce similar rankings as the ranking system based only on the mean. In the second chapter, I examine the effect of measurement error in the dependent variable on quantile regression, because unlike OLS regression, even classical measurement error can generate bias. I examine the pattern and size of the bias using both simulation and an empirical example. The simulations indicate that classical error can cause bias and that non-classical measurement error, particularly heteroskedastic measurement error, has the potential to produce substantial bias. Using restricted access Health and Retirement Study data containing matched IRS W-2 earnings records, I examine whether estimates of the returns to education statistically differ using a precisely measured and mismeasured earnings variable. I find that returns to education are over-stated by roughly 1 percentage point at the median and 75th percentile using earnings reported by survey respondents. In the third chapter, my coauthors and I investigate how the precision and stability of a teacher’s value-added estimate relates to student characteristics. We find that the year-to-year stability of teacher value-added estimates can depend on the previous achievement level of a teacher’s students. The stability level of the estimates are typically 25% to more than 50% larger for teachers serving initially higher performing students. We offer a policy simulation demonstrating that teachers who serve low-achieving students may be differentially likely to be the recipient of sanctions in a high stakes policy based on value-added estimates. I would like to dedicate this dissertation to my future wife, Tina Plerhoples, to my parents, Kathy and Richard, to my brother and sister, Mark and Katie, and to my other family and friends who have helped me in countless ways throughout the years. Some have helped directly on the dissertation, but just as importantly many others have encouraged me, helped me, or simply made my life more enjoyable throughout my time as a graduate student. iv ACKNOWLEDGEMENTS I would like to thank my committee chair, Steven Haider, for the tremendous effort he put into helping me develop, produce, and revise the contents of this dissertation. I would also like to greatly thank my other committee members, Scott Imberman, Mark Reckase, and Jeff Wooldridge for the help, guidance, and patience they also showed throughout the process of producing this dissertation. Several others also greatly aided me on these essays: Quentin Brummet, Steve Dieterle, Cassie Guarino, Tina Plerhoples, all the members of the MSU VAM project group, as well as numerous seminar and conference participants. I would like to thank Dan McCaffrey and JR Lockwood who both taught me a great deal about teacher value-added methods and many other topics, while I was a summer associate at the RAND corporation. I would also like to acknowledge that the research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grants R305D100028 and R305B090011 to Michigan State University. The opinions expressed are those of the authors and do not represent the views of the Institute or the U.S. Department of Education. v TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 1 1.1 1.2 1.3 1.4 1.5 1.6 RANKING TEACHERS WHEN TEACHER VALUE-ADDED IS HETEROGENEOUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Framework for Evaluating Teacher Quality . . . . . . . . . . . . . . . . . . . Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Correlation between γ j and σ 2j . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Do Teacher Rankings Change When We Add Information on ValueAdded Variances under Plausible Teacher Ranking Functions? . . . . . Sensitivity Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x APPENDIX 1 1 3 5 6 10 10 12 14 TABLES AND FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . 17 CHAPTER 2 2.1 2.2 2.3 2.4 2.5 LEFT WITH BIAS? QUANTILE REGRESSION WITH MEASUREMENT ERROR IN LEFT HAND SIDE VARIABLES . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model and Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation Evidence of Bias in Quantile Regression . . . . . . . . . . . . . . 2.3.1 Simulation Results Under Classical Measurement Error . . . . . . . . 2.3.2 Simulation Results Under Mean-Reverting Measurement Error . . . 2.3.3 Simulation Results Under Heteroskedastic Measurement Error . . . Quantile Returns to Education as an Application . . . . . . . . . . . . . . . . 2.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Characteristics of Measurement Error in Log Earnings . . . . . . . . 2.4.3 Estimates of the Returns to Education and Experience . . . . . . . . . 2.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDIX . . . . . . . . . . . . . 27 27 28 30 31 33 34 35 36 37 39 41 42 TABLES AND FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . 44 CHAPTER 3 3.1 3.2 3.3 DOES THE PRECISION AND STABILITY OF VALUE-ADDED ESTIMATES OF TEACHER PERFORMANCE DEPEND ON THE TYPES OF STUDENTS THEY SERVE? . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Previous Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi . . . . . . . . 54 54 56 57 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heteroskedastic Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Heteroskedastic Measurement Error . . . . . . . . . . . . . . . . . . 3.5.2 Other Possible Causes of Heteroskedastic Student Level Error . . . . Testing for Heteroskedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . Evidence of Differences in Classroom Compositions . . . . . . . . . . . . . . Effects of Heteroskedastic Student Level Error on Precision of Teacher ValueAdded Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.1 Simple Model of Heteroskedasticity . . . . . . . . . . . . . . . . . . . 3.8.2 Including other Covariates in Achievement Model . . . . . . . . . . . Inter-year Stability of Teacher Effect Estimates by Class Characteristics . . . 3.9.1 Brief Overview of the Analysis . . . . . . . . . . . . . . . . . . . . . . Results on the Stability of Teacher Effect Estimates by Subgroup . . . . . . . 3.10.1 DOLS Stabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.1.1 4th Grade Results . . . . . . . . . . . . . . . . . . . . . . . . 3.10.1.2 6th Grade Results . . . . . . . . . . . . . . . . . . . . . . . . 3.10.2 EB Lag Stabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sensitivity Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . High Stakes Policy Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDIX . . . . . . . 57 58 60 60 62 62 63 . . . . . . . . . . . . . 64 64 66 67 70 72 72 73 74 74 75 76 78 TABLES AND FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . 81 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 vii LIST OF TABLES Table 1.1 Student Level Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 17 Table 1.2 Teacher Level Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 18 Table 1.3 Standard Deviation and Correlations for γ j and σ 2j . . . . . . . . . . . . . . . . 19 Table 1.4 Estimates and Standard Errors of γ j and σ 2j for Select Teachers . . . . . . . . . 20 Table 1.5 Comparison of Ranking System Composed of γˆj and Alternative Ranking Systems Including σ j . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Table 1.6 Sensitivity Checks for Mathematics Teachers . . . . . . . . . . . . . . . . . . . 25 Table 1.7 Sensitivity Checks for Reading Teachers Table 2.1 Simulation Results for OLS/Quantile Regression Estimates with Classical Measurement Error in Dependent Variable. . . . . . . . . . . . . . . . . . . . . . . 44 Table 2.2 Simulation Results for OLS/Quantile Regression Estimates with Classical Measurement Error in Dependent Variable. . . . . . . . . . . . . . . . . . . . . . . 45 Table 2.3 Simulation Results for OLS/Quantile Regression Estimates with Mean-Reverting Measurement Error in Dependent Variable. . . . . . . . . . . . . . . . . . . . . 46 Table 2.4 Simulation Results for OLS/Quantile Regression Estimates with Mean-Reverting Measurement Error in Dependent Variable. . . . . . . . . . . . . . . . . . . . . 47 Table 2.5 Simulation Results for OLS/Quantile Regression Estimates with Heteroskedastic Measurement Error in Dependent Variable. . . . . . . . . . . . . . . . . . . 48 Table 2.6 Simulation Results for OLS/Quantile Regression Estimates with Heteroskedastic Measurement Error in Dependent Variable. . . . . . . . . . . . . . . . . . . 49 Table 2.7 Summary Statistics, Wave 1 (1992) Male Workers with Positive Earnings . . . . 50 Table 2.8 Measurement Error Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . 51 Table 2.9 Estimates of Conditional Distribution of Measurement Error . . . . . . . . . . . 52 . . . . . . . . . . . . . . . . . . . . . 26 Table 2.10 Estimates of Mincer Equation: Male Workers with Positive Earnings . . . . . . . 53 Table 3.1 Summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 viii Table 3.2 Average Squared Residuals for DOLS based on Subgroups of Prior Year Class Average Achievement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Table 3.3 Tests for Heteroskedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Table 3.4 Estimates of Year to Year Stability for DOLS by Subgroups of Class Achievement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Table 3.5 Estimates of Year to Year Stability for EB Lag by Subgroups of Class Achievement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Table 3.6 High Stakes Policy Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 ix LIST OF FIGURES Figure 1.1 Plots of 95% CI and Standard Errors on the Number of Student Observations for Math Teachers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Figure 1.2 Plots of 95% CI and Standard Errors on the Number of Student Observations for Reading Teachers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Figure 1.3 Scatterplot of Estimates of γ j and σ 2j for Mathematics Figure 1.4 Scatterplot of Estimates of γ j and σ 2j for Reading . . . . . . . . . . . . . . . . 23 Figure 2.1 Kernel Estimate of the Density of Measurement Error in Log Earnings . . . . . 51 Figure 3.1 Standard Error of Measure Plots for Mathematics Grades 3- 6 x . . . . . . . . . . . . . 23 . . . . . . . . . 81 CHAPTER 1 RANKING TEACHERS WHEN TEACHER VALUE-ADDED IS HETEROGENEOUS 1.1 Introduction Teacher quality measures based on student achievement data are increasingly being utilized by researchers in topics ranging from the impact of teacher quality on later life outcomes, to the impact of teacher quality on housing prices, to the quality of teachers who transfer or leave the teacher labor force (see, e.g., Chetty et al. (2011), Imberman and Lovenheim (2013), or Boyd et al. (2008) for examples of each). Additionally, federal education policies, such as the Teacher Incentive Fund and the Race to the Top, have sparked substantial demand for rigorous measures of teacher quality by administrators who wish to identify the most and least effective teachers. The most commonly used measures of teacher quality are value-added measures that attempt to isolate a teacher’s contribution to student learning in a year. Some studies make the simplifying assumption that teacher value-added is identical for all students.1 With this assumption, a “teacher effect” can be estimated for each teacher, which reflect differences in the value-added provided. Other studies explicitly explore heterogeneity in teacher value-added and find evidence that teacher value-added is different for different students.2 With 1 The assumption of a constant value-added is explicitly stated in Chetty et al. (2011) for instance, but implicitly assumed in many structural models of achievement used in value-added estimation. 2 For instance, Dee (2004) examines whether assigning a student to a teacher of the same race improves student achievement using experimental project STAR data, and finds an increase for both black and white students. One year with same race teacher increases achievement 2 to 4 percentile points. Aaronson et al. (2007) computes teacher value-added separately for students with high and low prior year test scores and finds that the correlation between the two is .39. A similar exercise is done by Condie et al. (2014). Loeb et al. (2014) examines whether teachers quality depends on whether a student is an English learner. Lockwood and McCaffrey (2009) examine heterogeneity in teacher value-added by interacting value-added with predicted achievement and find modest interaction effects with the interactions explaining around 10% of the total variation in teacher effects across teachers. 1 heterogeneity, the “teacher effects” that are typically estimated reflect differences in the mean value-added provided. From here on I will refer to these measures as “value-added means”. Despite the recognition that teacher value-added can be heterogeneous, little work has been done examining teacher quality beyond the value-added means.3 Teachers may differ in the variance of the value-added they provide, and this information may be important for researchers and administrators forming and using teacher quality ratings. For example, an individual may view a teacher that produces large learning gains for a few students and small gains for the rest differently from a teacher that produces moderate gains for all students. Examining the variance of value-added in addition to the mean can distinguish between these two cases. In this paper, I examine the sensitivity of teacher rankings to alternate rankings that factor in the variance of teacher value-added. I estimate “value-added variances”, which reflect differences across teachers in the variance of the value-added a teacher provides. These can be identified using the same random assignment conditional on observables assumptions made to identify value-added means. I then use this additional information to create alternate rankings, which I compare to the rankings based solely on value-added means. Using administrative data linking students to teachers from a large, diverse, anonymous state, I find little evidence of a systematic mean-variance trade-off in teacher value-added. The valueadded means and variances are in fact negatively correlated (math: -.328, p<.001, reading: -.206, p<.001). I also find that there are larger differences across teachers in terms of the mean than the variance. As a result, teacher rankings systems incorporating both value-added means and value-added standard deviations are highly correlated with a system only comprised of value-added means. The correlations are above .9 in most cases. 3 Some exceptions include the papers listed in the footnote above. 2 1.2 Framework for Evaluating Teacher Quality A convenient framework for ranking teachers is the potential outcomes framework.4 For our purposes, the potential outcomes are the potential achievement outcomes if a student is assigned to any of the teachers in the population. Let i denote a randomly drawn student from the population. Let Ai ( j) be the achievement level of student i if they are assigned to a particular teacher j. Administrators and researchers are typically interested in identifying how students would perform if they were assigned to one teacher compared to another. The primary difficulty in making this type of causal inference is that, if there are J potential teachers, it is only possible to observe one of the J potential outcomes for a student. The key assumption used to make causal inferences about teachers is that assignment of teachers to student is random conditional on Xi , which is a set of observable characteristics of students. With this assumption, even though we do not observe all Ai ( j) for each student, we can use the observed outcome, Ai , to estimate teacher effects. This assumption of selection only on observables is sometimes referred to as ignorability (or unconfoundedness) conditional on Xi .5 This assumption implies that principals base assignment on observable characteristics of students, such as prior year test scores, but do not assign on unobservable factors that affect achievement. The ignorability assumption has been hotly debated in the value-added literature.6 Important 4 See Rosenbaum and Rubin (1983),Rubin (1974), Rubin et al. (2004), or Imbens and Wooldridge (2008) for further background. 5 See Imbens (2000) for further discussion. 6 The assumption is not directly testable. However, Rothstein (2010) develops an indirect falsification test based on the idea that future teachers cannot impact contemporaneous test scores, so evidence of a relationship is evidence of a violation of the assumptions. Rothstein finds that the falsification test rejects, suggesting estimates of teacher effects may be biased. However, Goldhaber and Chaplin (2012) and Guarino et al. (2014) both find that such falsification tests may over reject. Also, Guarino et al. (2012) produce simulation evidence that estimators flexibly controlling for prior year test scores and teacher fixed effects are fairly robust across a variety of nonrandom assignment scenarios. Chetty et al. (2011) find that value-added measures controlling for prior year achievement and demographics predict changes in school level achievement when teachers switch schools and predict long term outcomes such as earnings and college attendance. Finally, Kane et al. (2013) examine whether value-added estimates are biased using a large randomized experiment in which students were randomly assigned to teachers within schools. The authors 3 for my purposes, this assumption is necessary for estimating value-added means. And without further identifying assumptions, we can estimate value-added variances. A typical way of estimating teacher effects is to estimate the parameters in the following equation for the conditional mean of achievement:7 E(Ai |Xi , Ti ) = (Xi − µX )β + Ti1 γ1 + · · · + TiJ γJ . (1.1) Xi is a set of control variables. Ti1 is an indicator variable equal to 1 if assigned to teacher 1 and 0 otherwise, Ti2 is an assignment indicator for teacher 2, and so on, and γ j is the teacher effect for teacher j.8 Under the ignorability assumption, the estimates of γ j are consistent estimates of the value-added means. However, in this case, γ j does not fully characterize the impact of assigning students to a teacher. Teachers may also differ in the variance of the value-added they provide. Teacher valueadded may vary for a few reasons. For instance, a teacher’s pedagogical style may work well with some students and not others. Also, some teachers may relate better with some students than others, for instance if they are of the same race or gender, which could lead to differences in the value-added provided. Teachers may also deploy more resources at some students than others.9 Some others may be less able to cater their instruction to the needs of all students in a classroom. Define σ 2j as the value-added variance for teacher j. With γ j and σ 2j we can get a more complete measure of teacher quality than looking at the mean alone. In order to estimate σ 2j , assume that the conditional variance of achievement has the following function form: find no evidence of bias in estimators that control for a student’s prior achievement scores and demographics. 7 For instance see Rothstein (2009) or Harris et al. (2011). This achievement model is sometimes motivated using the education production function framework. For more details, see Hanushek (1979) or Todd and Wolpin (2003) 8 In my parameterization, γ is normalized so that it is teacher j’s mean level of achievement j produced for the average student. A value of zero for γ j indicates that a teacher produces a mean achievement level of zero for the average student. 9 Neal and Schanzenbach (2010) find evidence that teachers may target resources at students in the middle of the achievement distribution because of proficiency requirements. 4 Var(Ai |Xi , Ti ) = exp(Ti1 ψ1 + · · · + TiJ ψJ + (Xi − µX )δ ). (1.2) Note that Xi is centered around its mean, µX , in (1.1) and (1.2). After centering Xi around its average, one can interpret γ j and σ 2j = exp(ψ j ) as teacher j’s mean and variance of achievement produced conditional on having the average student.10 Intuitively, if a teacher produces a larger variance in achievement for the average student (σ 2j ) than another teacher, then this reflects a larger variance in the value-added provided by that teacher, and likewise with the mean. In order to estimate γ j and σ 2j , I use the following procedure based on least squares. I first estimate the parameters in Equation (1.1) using an OLS regression of the student’s observed achievement score on Xi − X¯ and Ti . Then I form residuals from this initial regression and estimate (1.2) using non-linear least squares of the squared residuals on Xi − X¯ and teacher indicators.11 1.3 Data The data come from an administrative data set in a large and diverse anonymous state. Basic student information such as demographic, socio-economic, and special education status are available. 3,341,109 student year observations are available for students in grades 3-6 from years 2001-2007. The data include achievement scores in reading and math on a state criterion referenced test. The test scores are vertically scaled, so that test scores in grades 3-6 are on the same scale. The benefit of the vertical scale is that if, for instance, a student scores a 500 in 4th grade and a 500 in 5th 10 For clarity, a value of zero for γ j indicates that a teacher produces a mean achievement level of zero for the average student, and a value of zero for σ 2j indicates that a teacher produces a variance of achievement of zero for the average student. The exponential function is chosen to model the conditional variance instead of a linear function, because a linear function would not guarantee that the predicted conditional variance is positive. Using the exponential function to model a conditional variance dates back in the econometrics literature to Harvey (1976). 11 To see why the non-linear least squares regression using the squared residuals can consistently estimate the parameters in the conditional variance, note that Var(Ai |Xi , Ti ) = E(εi2 |Xi , Ti ) by definition, where εi = Ai − E(Ai |Xi , Ti ). Because the OLS residuals converge in distribution to εi , as noted in Harvey (1976), using the squared residuals in place of εi2 in the NLS regression still produces consistent estimates of the parameters in E(εi2 |Xi , Ti ). 5 grade, then you can interpret this as meaning the student gained no knowledge from 4th to 5th grade. Student-teacher links are available for value-added estimation. The analysis focuses on mathematics and reading student achievement in grade 6. Grade 6 is chosen for two reasons. First, conditioning on a larger number of previous test scores increases the plausibility that assignment of students to teachers is unrelated to student unobservables. Second, teachers in grade 6 often teach multiple sections in a given year, which increases the number of student observations. The larger number of student observations is important for the precision of the estimates. I impose some restrictions on the data. Students that cannot be linked with a teacher are dropped, as are students linked to more than one teacher in a school year in the same subject. The analysis focuses on traditional public school students, so students in charter schools are dropped. I also drop teachers with less than 12 student observations because accurately estimating valueadded means and variances requires a large number of student observations. In all around one third of the student observations are not used in the analysis. Student level characteristics of the final data set are reported in the Table 1.1. The students in the final sample tend to be somewhat higher achieving, more white, and less likely to be free-and-reduced price lunch or limited English proficient than the students in the original sample. Table 1.2 reports summary statistics aggregated to the teacher level. There are 5,987 math and 6,606 reading teachers in the sample. There are on average 114.58 and 105.013 student observation per teacher for math and reading teachers respectively. This is important for the precision of the estimates of γ j and σ 2j . Student characteristics aggregated to the teacher level are also reported. 1.4 Results The controls included are similar to other papers in the literature (e.g. Chetty et al. (2011)). The vector of covariates, Xi , includes cubic functions of lagged and twice lagged math and reading scores, indicators for whether the student is a minority, the student’s free-and-reduced price lunch status, the student’s limited English proficiency status, and gender. 6 In order to increase the precision of the estimates, I pool student observations across all available years and include year dummies as additional controls. Estimation is done separately for math and reading teachers. Similar to Rothstein (2009), I standardize test scores so that grade 6 test scores have a population mean of zero and a standard deviation of one. Using the same standardization in each grade keeps the vertical scale intact.12 Therefore, one test score unit translates into an increase of one standard deviation in achievement for sixth graders. Based on this, I estimate the measures for the value-added means (γ j ) and the value-added variances (σ 2j ) for the 6,249 mathematics teachers and 6,836 reading teachers. As reported in Table 1.3, the standard deviation of the estimates of γ j across teachers is .207 in mathematics and .155 in reading.13 Additionally, going from the teacher at the 50th percentile in the estimated distribution of γ j to a teacher at the 75th percentile in mathematics increases mean value-added by .13 test score standard deviations. Going from the 50th to 75th percentile in reading increase mean value-added by .092 standard deviations. The differences across teachers for σ 2j are more modest. σ 2j has a standard deviation across teachers of .086 in mathematics and .106 in reading. Going from a teacher at the 50th percentile in the estimated distribution of σ 2j to a teacher at the 75th percentile increases the variance of value-added by .043 test score standard deviation units. This is 4.3% of the variance of overall achievement. Going from the 50th percentile to the 75th percentile in reading means increasing variance of value-added by .054. In order to provide some information about the precision of the estimates, I report estimates 12 With this standardization, grade 5 math test scores have a mean of -.152 and standard deviation of .928. Grade 4 math test scores have a mean of -.763 and standard deviation of .979. Grade 5 reading test scores have a mean of -.209 and standard deviation of .981. Grade 4 reading test scores have a mean of -.413 and standard deviation of .960. 13 These estimates are in line with what other researchers have found for the standard deviation across teachers for the mean. Kane and Staiger (2010) find a standard deviation adjusted for sampling variation of .143 for mathematics teachers. Aaronson et al. (2007) find an adjusted standard deviation of .193 for mathematics teachers and .113 for reading teachers. Rothstein (2009) finds an adjusted standard deviation of .107 for reading teachers. 7 of γ j and σ 2j along with their standard errors for select teachers in Table 1.4.14 Estimates and standard errors for teachers at the 10th, 25th, 50th, 75th, and 90th percentiles of γ j (top panel) and σ 2j (bottom panel) are reported. Additionally, Figures 1.1 and 1.2 show the 95% confidence intervals and standard errors plotted on the number of student observations for a randomly selected subsample of teachers for math and reading.15 The OLS estimates of γ j are in the top left. The NLS estimates of σ 2j are in the top right.16 An average standard error at each number of student observations for the OLS estimates of γ j is displayed in the bottom left, and the standard errors for the NLS estimates of σ 2j are in the bottom right.17 The red lines are when the number of students are 25, 50, and 100 student observations. One thing to notice from this analysis is that, as more student observations are available for each teacher, the estimates all become more precise. This is evident in Figures 1.1 and 1.2 by noticing that both the confidence intervals and standard errors shrink as the number of student observations increase. Also, the magnitudes of the standard errors do not differ much for γ j and 14 I use a bootstrapping technique to produce standard errors for the estimates of γ j and σ 2j . In order to keep the number of student observations per teacher fixed for every bootstrap replication, I do sampling with replacement within teachers. To be clear, if there are N observations and N j observations corresponding to teacher j in the original data set, to produce N observations for each bootstrap sample, draw N j observations for teacher j, where the N j observations are randomly drawn with replacement from the set of students assigned to the teacher, and repeat this procedure for all teachers. 100 bootstrap replications were performed. Since estimation of σ 2j involves two steps (first forming residuals after an OLS regression of current achievement on the covariates and teacher indicators then NLS of the squared residuals on the covariates and teacher indicators) each bootstrap iteration involves estimation of both steps. The sampling with replacement of teachers done in this paper is similar to bootstrapping approach done in Winters et al. (2012). 15 The randomly selected subsamples of 584 mathematics teachers and 743 reading teachers were used instead of the entire sample, because the bootstrapping procedure was very time intensive. 16 I also try a procedure based on Normal quasi-MLE to estimate γ and σ 2 . I parameterize the j j mean and variance of the normal distribution so that D(Ai |Xi , Ti ) = Normal((Xi − µX )β + Ti γ, exp(Ti ψ + (Xi − µX )δ )) (1.3) The estimates were similar in the two approaches, although the QMLE results were slightly more efficient. Since the QMLE is more complex and more computationally difficult to implement, I chose to present the results for the simpler two-step estimator. 17 The average standard error at each number of student observations was formed using a polynomial smoother. 8 σ 2j . In the bottom panels of Figures 1.1 and 1.2, I also include cutoffs for whether the measures are accurate enough to distinguish the very best and very worst teachers, which is often a goal of forming teacher quality measures. The upper blue line represents the standard error necessary to say with 95% accuracy that a teacher ranked in the bottom 10% is not in the top 10%. The lower blue line represents the standard error necessary to say that a teacher ranked in the bottom 25% is not in the top 75%.18 Ideally, the average standard errors should be below the cutoffs. 12 student observations are enough to distinguish teachers at the 90th and 10th percentiles for both γ j and σ 2j in mathematics and reading. When going to the tougher requirement of distinguishing teachers at the 75th and 25th percentiles, 12 student observations is only enough in the case of mathematics for γ j . 50 student observations is enough in reading at the 75-25 difference with γ j , and 100 observations is enough in reading for σ 2j . More than 200 are necessary in mathematics to distinguish between the 75th and 25th percentiles in σ 2j . This is partially due to the smaller difference in the value-added variances for mathematics teachers at the 75th and 25th percentiles compared to for instance the value-added variances for reading teachers (a gap of .079 versus .147 in reading). Overall, 12 student observations are enough to distinguish the very worst from the very best for both γ j and σ 2j , but in some cases it may be difficult to distinguish teachers toward the center of the distribution without large numbers of student observations. In the remaining analysis, I will continue to use all teachers with more than 12 student observations, but will also explore the results when only teachers with more than 100 student observations are included as a sensitivity check. 18 I form the blue lines by calculating the difference in γ and σ 2j at the 90th and 10th percentiles and the 75th and 25th percentiles. For math and γ j , the 90-10 difference is .48 and the 75-25 difference is .254. In reading and γ j , the 90-10 difference is .374 and the 75-25 difference is .183. In math and σ 2j , the 90-10 difference is .176 and the 75-25 difference is .079. In reading and j σ 2j , the 90-10 difference is .223 and the 75-25 difference is .147. I then form the standard error necessary at each of the gaps, by dividing the gap by 1.96. 9 1.4.1 Correlation between γ j and σ 2j A worry in using only estimates of γ j in rankings is that teachers that produce high value-added means may be leaving some students behind, producing small gains for these students. In order to examine whether this is the case, in Table 1.3 I report the correlation between γˆ j and σˆ2j , which is -.328 for mathematics and -.206 for reading. Scatterplots for the estimates of γ j and σ 2j are also shown in Figures 1.3 and 1.4. In both cases the correlation is statistically different from 0 at the 1% level.19 This indicates, contrary to the initial fear, that teachers with higher levels of mean value-added tend also to have a lower variance in value-added. This suggest that, if having a low variance is a good thing, teachers rated favorably along one dimension are more likely to be rated favorably along the other. This also means that a ranking system that incorporates both γˆ j and σˆ2j will tend to produce similar rankings as a ranking system that only focuses on mean value-added. Also, because there are fewer differences in σ 2j across teachers, then rankings that incorporate information on the teacher’s effect on the variance may not differ much from a ranking based solely on the mean effect. 1.4.2 Do Teacher Rankings Change When We Add Information on Value-Added Variances under Plausible Teacher Ranking Functions? Principals or administrators may be interested in ranking teachers at least in part on the variance of value-added. A teacher that produces a given mean level of value-added, but with a high variance, may generate more complaints from parents than a teacher that produces a similar mean level and a lower variance. Administrators may also have asymmetric payoffs, for instance if they are penalized for having a certain number of students fall below basic proficiency levels, that may make them rate the slightly lower mean, lower variance teacher more highly.20 19 The standard errors for the significance test for the correlations are calculated by bootstrap- ping. 20 There may be cases where individuals would prefer a higher variance. For instance, if a school’s sole focus was to produce a few super star students, they would want teachers to have a 10 In the following section I produce teacher rankings under a variety of ranking schemes. I use value-added standard deviations in the ranking function rather than variances, because standard deviations are expressed in the same units as the mean, whereas the variance is expressed in squared units.21 I use the following simple ranking function:22 r j = qγˆ j − (1 − q)σˆ j . where r j is teacher j’s ranking and q is a weight put on the value-added mean and value-added standard deviation. I will compare three alternate ranking systems to the rankings based only on γ j: Baseline Ranking: Teacher rankings are based solely on the estimate of γ j 25% on σ j : Teacher rankings based 75% on estimate of γ j and 25% on estimate of σ j 33% on σ j : Teacher rankings based 67% on estimate of γ j and 33% on estimate of σ j 50% on σ j : Teacher rankings based equally on estimate of γ j and σ j I produce Spearman rank correlations between the baseline ranking system and the three alternate rankings systems in in Table 1.5. The rank correlations are above .94 in mathematics, and above .88 in reading. All rankings are above .96 when less than 1/3 of the weight is placed on the value-added standard deviation and above .98 when less than 25% of the weight is placed on σ j . Thus, incorporating σ j into teacher rankings isn’t likely to dramatically alter the rankings for most teachers under a variety of alternative ranking systems compared to ranking teachers solely large variance. 21 The value-added standard deviations are estimated by taking the square root of the estimated value-added variances. 22 There are many other potential objective functions, which may not translate exactly into a mean-variance trade off. For instance, a principal may want to maximize the number of students that pass a proficiency level, and suppose that principal wants to assign a teacher to a classroom of students that is initially far below the proficiency level. The principal in this case may want a teacher that produces a large variance in value-added to get more students up to that proficiency level. However, I chose the ranking function in this paper for its simplicity. 11 on their value-added mean. This result is likely driven by the negative correlation between the value-added mean and standard deviation and the more modest variation in σ j compared to γ j . To add some comparison to the numbers, Goldhaber et al. (2013) compare teacher rankings, based on value-added means, under alternate sets of control variables. The authors find that the correlation between estimates that control for student test scores and demographics and estimates that control for additional peer characteristics is around .99. The correlation between estimates that control for school fixed effects and estimates that do not is only .65. This suggests that the decision to include information on the value-added standard deviation is slightly more consequential than the decision to include peer variables, and much less important than the decision to include school fixed effects. One caveat is that, even though the correlations are strong, for particular teachers changing the ranking system can have a large impact. In order to provide a rough sense of how far a teacher may be moving using the different rankings, in the bottom panel of Table 1.5, I report the fraction of teachers that move in the rankings ± 10% of teachers. This corresponds to a move of 625 spots in the rankings for math teachers and 684 spots for reading teachers. Particular teachers can move quite a bit in the rankings in some of the alternate ranking schemes. 22% of teachers move in the rankings ± the equivalent of 10% of teachers in the case where 50% of the weight is put on the standard deviation in math. However, in the case where 25% of the weight is put on the standard deviation, only 2% of teachers move the equivalent of ± 10% of teachers. 1.5 Sensitivity Checks I perform a number of sensitivity checks for the analysis, which are reported in Table 1.6 for mathematics and Table 1.7 for reading. I discuss the results for mathematics in detail below, but the results for reading are similar. Overall, the results for the sensitivity checks are similar to the baseline results. As I discussed in section 4, the estimates of γ j and σ 2j are not precise enough to distinguish between teachers at the 75th and 25th percentiles teachers when teachers have only 12 student 12 observations, except in the case of γ j for mathematics. The imprecision that this reflects could potentially affect the results. In row 2 of Table 1.6, I report results when the sample is restricted to include only teachers with more than 100 observations. I report the correlation between the estimates of γ j and σ 2j , the Spearman rank correlations between a system were teachers are ranked only on the mean and a system where 50% of the weight is put on the estimate of σ j , and the percentage of teachers that move ± 10% of teachers in the rankings under the alternate ranking system. Overall, the results are similar to the baseline case. The Spearman rank correlations are slightly higher and the percent moving 10% in the rankings is slightly lower when this restriction is imposed compared to the baseline. As discussed in Goldhaber et al. (2013), there is considerable disagreement about the conditioning variables that are needed for ignorability. It is common to include classroom level peer characteristics or school indicator variables in the regressions. In row 3 of Table 1.6, results for the estimates of γ j and σ 2j when classroom level peer variables are included.23 The correlation between the estimates of γ j and σ 2j when the classroom level variables are included is -.244. The spearman rank correlation between the alternate ranking system and the ranking based only on the mean is .906, and the percent moving more than 10% is 34%. The correlation is slightly lower, and the percent moving 10% is slightly higher than the baseline case. This may be due to the additional noise in the estimates created by trying to identify the coefficients on the classroom peer variables. In row 4, I show results from when I estimate value-added variances using a linear functional form rather than an exponential functional form and while keeping the covariate set identical to the baseline specification.24 I estimate σ 2j in an OLS regression of squared residuals from the regression to estimate (1.1) on Xi − X¯ and teacher indicator variables. The correlation between the 23 The peer variables I include are: average prior year math and reading scores, proportion free and reduced-price lunch, and proportion limited English proficient. These coefficients are identified using within teacher variation in classroom composition. 24 Note that the estimates of σ 2 are not guaranteed to be positive using this approach. However, j in practice there are only a few instances where σ 2j is estimated to be negative for a teacher. In the case with the linear variance but no school fixed effects, only .2% teachers have negative estimates. In the specification, reported below, with school dummy variables and linear variance only .1% teachers have negative estimates. 13 estimate of σ 2j and the estimate of γ j is -.348. The rank correlation is .919, which is similar to the rank correlation from the baseline specification of .947. In the final row, I report results from a specification with school dummy variables. Due to computation issues related to finding convergence in the non-linear least squares algorithm when school and teacher indicator variables were both included, I again change the functional form for the variance from an exponential function of the parameters to a linear function. I estimate σ 2j in an OLS regression of squared residuals from the regression to estimate (1.1), which also had ¯ school indicator variables, and teacher indicator school indicator variables included, on Xi − X, variables.25 In this case, the correlation between the estimates of γ j and σ 2j is -.310, and the Spearman correlation drops slightly to .860 compared to .947 in the baseline specification. The percent that move ± 10% also increases to 35%. 1.6 Summary and Conclusions Researchers and administrators interested in teacher quality typically produce a single measure of teacher quality. If teachers are having heterogeneous impacts on their students, this measure reflects differences across teachers in the mean value-added they provide, but only examining the effect for the mean may offer an incomplete characterization of a teacher’s quality. This paper offers an empirical strategy for identifying measures of value-added variances, and examines how rankings change when this information is added. There are several important findings in this paper. I find evidence that there are modest to moderate differences across teachers in the size of the value-added variance, but the differences across teachers for σ 2j are smaller than differences across teachers for γ j . Teacher rankings based on the mean and the variance are negatively correlated, with a correlation around -.25. As a result, teacher rankings that include value-added variances tend to be highly correlated with rankings that only include value-added means under some plausible ranking schemes. Typically the correlation 25 I used the user written felsdvreg package in Stata to estimate the coefficients on the teacher and school indicator variables. The coefficients are identified by teachers switching schools. 14 is above .9. A positive conclusion from this paper is that rankings using measures of value-added means are fairly robust to adding information on the value-added variance. This paper also shows that value-added variances can be calculated at fairly low cost. Researchers already computing value-added means by regressing test scores on covariates and teacher indicator variables can estimate value-added variances using the two step approach used in the paper. These estimates could be useful for researchers who wish to study the factors that affect the variance in teacher value-added for instance. More research could be done on this topic. The methods and findings in this paper can serve as a starting point. 15 APPENDIX 16 APPENDIX TABLES AND FIGURES Table 1.1: Student Level Summary Statistics Variable Mean Std. Dev. Original Sample Number of Student Obs 923,247 Math Standardized Scale Score Reading Standardized Scale Score White Free and Reduced Price Lunch Limited English Proficiency Female 0 0 0.492 0.486 0.18 0.508 1 1 .5 0.5 0.384 0.5 Sample After Restrictions Number of Student Obs 685967 Math Standardized Scale Score Reading Standardized Scale Score White Free and Reduced Price Lunch Limited English Proficiency Female 17 0.074 .09 0.497 0.479 0.177 0.512 0.962 0.956 0.5 0.5 0.382 0.5 Table 1.2: Teacher Level Summary Statistics Variable Mean Std. Dev. Math Teachers Number of Mathematics Teachers Student Obs for Math Teachers 5987 114.58 126.303 Student and Teacher Characteristics Aggregated to Teacher level Average Prior Year Math Score Fraction Free Reduced Price Lunch Fraction Limited English Proficient Fraction White Teacher Experience -.203 0.527 0.186 0.462 7.826 .547 0.257 0.215 0.299 8.85 Reading Teachers Number of Reading Teachers Student Obs for Reading Teachers 6606 105.013 119.82 Student and Teacher Characteristics Aggregated to Teacher level Average Prior Year Reading Score Fraction Free Reduced Price Lunch Fraction Limited English Proficient Fraction White Teacher Experience 18 -0.337 0.521 0.181 0.471 7.711 0.611 0.256 0.22 0.299 8.832 Table 1.3: Standard Deviation and Correlations for γ j and σ 2j Statistic Mathematics Reading Std Dev γˆj 0.207 .155 Std Dev σˆ2j 0.086 .106 Correlation γˆj and σˆ2j -.328 -.206 Number of Teachers 6249 6836 Controls included in estimation of γ j and σ 2j include a year dummy, cubic functions of lagged and twice lagged math and reading scores, indicators for minority status, free-and-reduced price lunch status, limited English proficiency status, gender, and teacher indicator variables. 19 Table 1.4: Estimates and Standard Errors of γ j and σ 2j for Select Teachers γj Select Teachers Mathematics 10th Pctl 25th Pctl 50th Pctl 75th Pctl 90th Pctl -.165 -.061 .066 .194 .314 Reading (.106) -.084 (.081) .006 (.056) .092 (.066) .187 (.086) .272 (.094) (.061) (.091) (.066) (.109) σ 2j Select Teachers Mathematics 10th Pctl 25th Pctl 50th Pctl 75th Pctl 90th Pctl .096 .133 .181 .238 .301 Observations (.035) (.049) (.043) (.059) (.072) 584 Reading .171 .220 .270 .329 .405 (.049) (.059) (.073) (.079) (.122) 743 10th Pctl refers to a teacher at the 10th percentile. 25th Pctl refers to a teacher at the 25th percentile and so on. Controls included in estimation of γ j and σ 2j include a year dummy, cubic functions of lagged and twice lagged math and reading scores, indicators for minority status, free-and-reduced price lunch status, limited English proficiency status, gender, and teacher indicator variables. 20 Figure 1.1: Plots of 95% CI and Standard Errors on the Number of Student Observations for Math Teachers The OLS estimates of γ j are in the top left. The NLS estimates of σ 2j are in the top right. Average standard errors at each number of student observations, formed using a polynomial smoother, for the OLS estimates of γ j are in the bottom left, and the average standard errors for the NLS estimates of σ 2j are in the bottom right. The red lines are when the number of students are 25, 50, and 100 student observations. The blue lines represent the standard error necessary to statistically reject at the 5% level that a teacher at the 25th percentile is not above the 75th percentile, and that a teacher in the 10th percentile is not above the 90th. 21 Figure 1.2: Plots of 95% CI and Standard Errors on the Number of Student Observations for Reading Teachers The OLS estimates of γ j are in the top left. The NLS estimates of σ 2j are in the top right. Average standard errors at each number of student observations, formed using a polynomial smoother, for the OLS estimates of γ j are in the bottom left, and the average standard errors for the NLS estimates of σ 2j are in the bottom right. The red lines are when the number of students are 25, 50, and 100 student observations. The blue lines represent the standard error necessary to statistically reject at the 5% level that a teacher at the 25th percentile is not above the 75th percentile, and that a teacher in the 10th percentile is not above the 90th. 22 Figure 1.3: Scatterplot of Estimates of γ j and σ 2j for Mathematics Figure 1.4: Scatterplot of Estimates of γ j and σ 2j for Reading 23 Table 1.5: Comparison of Ranking System Composed of γˆj and Alternative Ranking Systems Including σ j Subject 25% on σ j 33% on σ j 50% on σ j Spearman Rank Correlation with γˆj Mathematics .993 .985 .947 Reading .982 .964 .881 Percentage Moving in Rankings 10% of Teachers Mathematics 2% 6% 22% Reading 8% 16% 37% Math Observations Reading Observations 6249 6836 Controls included in estimation of γ j and σ 2j include a year dummy, cubic functions of lagged and twice lagged math and reading scores, indicators for minority status, free-and-reduced price lunch status, limited English proficiency status, gender, and teacher indicator variables. 24 Table 1.6: Sensitivity Checks for Mathematics Teachers Corr γˆj and σˆ2j Spearman 50% on σˆ j Moving ± 10% Baseline -.328 .947 22% Teachers with ≥ 100 Student Obs -.317 .969 15% Classroom Level Variables -.244 .906 34% Linear Variance -.348 .919 28% School Dummy Variables with Linear Variance -.310 .860 35% Specification Controls included in baseline estimation of γ j and σ 2j include a year dummy, cubic functions of lagged and twice lagged math and reading scores, indicators for minority status, free-and-reduced price lunch status, limited English proficiency status, gender, and teacher indicator variables. 25 Table 1.7: Sensitivity Checks for Reading Teachers Corr γˆj and σˆ2j Spearman 50% on σˆ j Moving ± 10% Baseline -.206 .881 37% Teachers with ≥ 100 Student Obs -.228 .924 31% Classroom Level Variables -.172 .858 40% Linear Variance -.221 .844 42% School Dummy Variables with Linear Variance -.166 .823 39% Specification Controls included in baseline estimation of γ j and σ 2j include a year dummy, cubic functions of lagged and twice lagged math and reading scores, indicators for minority status, free-and-reduced price lunch status, limited English proficiency status, gender, and teacher indicator variables. 26 CHAPTER 2 LEFT WITH BIAS? QUANTILE REGRESSION WITH MEASUREMENT ERROR IN LEFT HAND SIDE VARIABLES 2.1 Introduction Quantile regression, which allows a researcher to examine the effects of covariates on different points of the conditional distribution of the outcome variable, is an important tool for empirical research. For instance, such methods have been used to examine the returns to schooling (Buchinsky (1994)), inter-generational earnings (Eide and Showalter (1999)), birth weight (Abrevaya and Dahl (2008)), and empirical finance (Chernozhukov and Umantsev (2001)). See Koenker and Hallock (2001) for a review. Despite it’s popularity as an empirical tool, a relatively small literature exists on the effects of measurement error on quantile regression estimation, and within this literature, most of the work has been concentrated on measurement error in independent variables.1 Almost no research has been done on the issue of bias in quantile regression estimation caused by measurement error in the dependent variable, except for a brief discussion in a footnote in Hausman (2001) and in Chen et al. (2005), who only examine the issue in the context of censored quantile regression at the median. This lack of research is surprising because, unlike OLS, even classical measurement error in the dependent variable can cause quantile coefficient estimates to be biased.2 Moreover, many other realistic types of measurement error, such as mean-reverting and heteroskedastic measurement error, complicate matters quickly.3 In this paper, I examine bias in the quantile regression estimator caused by measurement error 1 See Angrist et al. (2006) for example. 2 Hausman (2001) mentions this fact and that the bias tends to be in the direction of the median coefficient estimate. 3 Bound and Krueger (1991), Bound et al. (1994), and Pischke (1995) have found evidence that the measurement error is mean reverting, and Hausman (2001) reports that heteroskedastic measurement error may exacerbate bias. 27 in the dependent variable using simulation and an empirical example. In the simulations, I examine the cases of classical measurement error, mean-reverting measurement error, and heteroskedastic measurement error. My results confirm that the introduction of classical measurement error when the underlying error term is symmetrically distributed can bias the quantile regression estimator towards the coefficient at the median.4 My results further show that, in cases when the regression error is asymmetric, the estimator can be biased as well, but no clear pattern emerges. The simulations also show that mean reverting and heteroskedastic measurement error can potentially cause bias. In the empirical application, I examine quantile regression estimates of the returns to education using both reported earnings from the Health and Retirement Study and matched IRS W-2 records, which I assume to be accurate. I find that estimates of the returns to education at the median and 75th percentile are overstated by around 1 percentage point (a bias of around 12-15%) using reported earnings instead of the more accurate W-2 records. These differences are statistically significant at the 5% level. For context, this bias is similar in magnitude to the upward bias caused by omitted ability in the OLS estimator that has been found by others.5 Also, the pattern of the estimates suggests that the returns to education are less heterogeneous than previously thought. 2.2 Model and Estimator This section will provide a brief overview of quantile regression. For more details, one can read Koenker and Bassett (1978), Koenker (2005), or Wooldridge (2010) among many other sources. The goal of quantile regression is typically to examine the effects of covariates on different points of the conditional distribution of the outcome variable. It is common to model conditional quantiles using a model that is linear in parameters. In which case, we can express the τth conditional quantile of yi as 4 This finding is also reported in Hausman (2001), although no simulation results are presented. 5 Upward ability bias in the OLS estimator of the return to education is also around 10-15%, as reported in Card (1999). 28 Qτ (yi |xi ) = xi β 0 (τ), (2.1) where xi is a vector of covariates, and β 0 is a vector of population parameters. It can be shown that β 0 (τ) satisfies the condition that min E[(τ − 1[yi − xi β < 0])(yi − xi β )], β ∈ℜK (2.2) where 1[·] is the indicator function. Assuming that β 0 (τ) uniquely satisfies Equation (2.2), the parameters can be consistently estimated under some weak regularity conditions by finding values that satisfy the sample analog. In many cases, instead of observing the dependent variable yi , the researcher observes the variable measured with error, call it Yi . As is well-known, such measurement error causes no bias in the OLS estimator if it follows the classical assumptions.6 Heuristically, this is the case, because the measurement error is simply absorbed into a composite error term. This fortunate outcome is not generally the case for the quantile regression estimator for the following reason. Let ui = yi − xi β be the quantile regression error term. In the case of no measurement error, it can be shown that the first order conditions for (2.2) are: E(xi (1[ui < 0] − τ)) = 0. (2.3) When measurement error in the dependent variable is introduced, the first order conditions are: E(xi (1[ui + ei < 0] − τ)) = 0. (2.4) Because the expected value operator does not pass through the indicator function, the first order conditions are not the same, so there is no guarantee that the parameters that solve (2.4) also solve (2.3) even under classical measurement error.7 6I define classic measurement error as measurement error that is independent of the true value of the dependent variable and the covariates. 7 Note that in the OLS case with classical measurement error, the first order conditions with and 29 2.3 Simulation Evidence of Bias in Quantile Regression Since no closed form solution exists for the quantile regression estimator, it is difficult to examine bias caused by measurement error in the dependent variable analytically. In order to study the issue further, I produce simulation evidence on how various forms of measurement error affect the quantile coefficient estimates. My data generating processes consist of a dependent variable with a single explanatory variable. In order to generate different parameters at different quantiles, a random coefficients model is used. My baseline data generating process is meant to be a very simple model of returns to schooling and takes the following form: yi = αo + xi β0 + γxi ηi + ωi , (2.5) where ηi and ωi are independent of one another and xi and have a standard normal distribution. Note that throughout the discussion of the simulations, I will refer to ωi as the regression error and ei , which I will define below, as the measurement error. The regressor xi , which can be thought of as years of schooling, has a binomial distribution with n = 16 and p = .75 producing a distribution with a mean of 12 and standard deviation of 3.8 In my simulation, β is set to .075, α is set to 5, and γ is set to .04. The parameter choices are meant to roughly mimic what is found in previous literature and in my HRS/W-2 data.9 In addition to the baseline, I also examine the performance of the estimator under a number of alternate data generating processes. In the second data generating process, I make the effect of xi negative rather than positive. In the third, I report simulation results in which the effect at the without classical measurement error are the same. This is true since the expected value operator does pass through linear functions. 8 These parameters were chosen to give a very basic approximation to the distribution of number of years of schooling. I have also produced simulation results where xi has a uniform, a normal, and a Poisson distribution. The general patterns are the same as described below. 9 The choice of .075 is meant to be reflective of the estimates of the mean return to education found previously by other authors, which typically are in the .07-.10 range. For an overview of the mean returns to education literature, see Card (1999). 30 10th percentile is the largest. In the fourth, I examine the bias caused by measurement error when the effect of xi is much more heterogeneous. In the fifth, I examine the case where the distribution of ωi , in equation (2.5), is the Student’s t distribution with 3 degrees of freedom instead of the standard normal distribution. This distribution is a symmetric distribution that has a thicker tail than the standard normal distribution. In the sixth, I examine the results when the distribution of ωi has an asymmetric, lognormal distribution. In this case, ωi = exp(Zi ), where Zi has a standard normal distribution. For this case, the effect at the mean and median will no longer be identical. I examine three cases of measurement error: classical measurement error, mean-reverting measurement error, and heteroskedastic measurement error.10 In each case, the measurement error is additive with the form: Yi = yi + ei . (2.6) Simulations were done using Stata. 1,000 simulation repetitions were performed. Each repetition contained 10,000 simulated observations. In the tables, I report the quantile regression estimates at the .10, .25, .50, .75, and .90 quantiles. For purposes of comparison, I also report OLS estimates. 2.3.1 Simulation Results Under Classical Measurement Error For the simulations with classical measurement error, I assume that ei is independent of yi and xi . Also, ei is normally distributed with a mean of zero. The variance is chosen so that the reliability ratio is Var(yi ) = .8. Var(yi ) +Var(ei ) (2.7) 10 Classical measurement error is a case in which the measurement error is independent of the co- variates. Mean-reverting measurement error is a case where there is a negative correlation between ωi and ei . As discussed in Kim and Solon (2005), one way to interpret mean-reversion found in the measurement error in earnings records is that when workers are asked to report their earnings for the year, the workers under report transitory earnings and shade toward their usual earnings. Heteroskedastic measurement error is a case where the variance of the measurement error depends on the covariates in xi . 31 This reliability ratio is approximately the value calculated by Bound and Krueger (1991) for men’s reported income in the CPS data. In addition, I examine classical measurement error when the reliability is .6 as a more extreme case. In row (1) of Table 2.1, I report the results for the baseline specification with normally distributed regression error and classical measurement error with a reliability of .8. In column (2), the quantile regression estimator at the .10 quantile is shown to be biased towards the median coefficient in this simulation. The estimate is .056, while the true value of the parameter is .053 (a bias of roughly 6%) and the median coefficient is .075. The estimator at the .25 quantile is also biased towards the median, but to a lesser degree. The median estimator is unbiased. The estimator at the .75 quantile is slightly biased again towards the median coefficient, and the estimator at the .90 quantile is also biased toward the median coefficient, by an amount nearly symmetric with estimator at the .10 quantile. This pattern is consistent with the pattern reported in the footnote in Hausman (2001) that quantile regression estimators at the tails of the distribution are biased towards the true parameter at the median. In row (2), I report the estimates when the reliability is .6. The estimates follow the same pattern as those in row (1), but the results show a stronger bias towards the true parameter at the median. In row (3), I report results in which the coefficient on xi is negative. In row (4), I report results in which the effect at the 10th percentile is the largest. In both of these cases, the finding that under classical error the estimator at the tails are biased toward the median coefficient holds. Next, I examine the bias caused by measurement error when the effect of xi is much more heterogeneous. In this simulation, the bias at the tails is much larger in magnitude than the baseline simulation. As shown in Table 2.2, the bias at the 10th and 90th percentiles is still towards the median, but the magnitude of the bias is around .06 instead of .003 in the baseline simulation (a bias of around 14% instead of 6%). The simulations do not prove this, but they may hint that as the effects at different quantiles become more heterogeneous, the bias becomes larger with classical measurement error. With larger differences between the effect in the tails and the effect at the 32 median, there may be more room for bias towards the median. In rows (6) and (7), I change the distribution of the regression error. In row (6), I report results were the distribution of ωi , in equation (2.5), is the Student’s t distribution with 3 degrees of freedom instead of the standard normal distribution. Despite this difference, the results look very similar to the orginal simulation in row (1). The results still display bias towards the median in the case of classical measurement error. Finally, I examine the results when the distribution of ωi is asymmetric with a lognormal distribution in row (7). An important thing to note is that in this case the coefficients at the tails of the distribution are not necessarily biased towards the median coefficient. Also, the estimator at the median is noticably biased by around 1 percentage point, which was not the case with the symmetric distributions. A few key points emerge. First, under some alternate simulation parameters and distributions, when the error term, ωi , is symmetrically distributed, the quantile regression estimator at the tails tend to be biased towards the median coefficient when there is classical measurment error in my simulations. I conjecture that this is true generally for symmetric distributions, but the simulations do not prove this. Second, when the effects across the conditional distribution are relatively more heterogeneous using my data generating process and normally distributed errors, the bias at the tails can be larger. Third, when the error term, ωi , is asymmetrically distributed, bias still exists and the direction is less clear when there is classical error. The estimator for the coefficient at the median may also be biased. 2.3.2 Simulation Results Under Mean-Reverting Measurement Error In Tables 2.3 and 2.4, I report estimates when mean-reverting measurement error is added to the dependent variable. In these cases, the measurement error has the following form:11 11 As discussed in Kim and Solon (2005), one way to interpret mean reversion in the measurement error is that when workers are asked to report their earnings for the year, the workers under report transitory earnings and shade toward their usual earnings. In my simulation, this is reflected with a negative correlation between ωi and ei . 33 E(ei |ωi ) = −.3ωi . (2.8) The parameters are chosen to match what is found in Bound and Krueger (1991) for measurement error in log earnings and matches what is found in my HRS data discussed below. As a more extreme case, I also examine, in row (2), mean-reverting measurement error of the form: E(ei |ωi ) = −.45ωi . (2.9) In row (1) and (2), results are reported for the baseline specification with normally distributed regression error. In these cases, the estimators at the tails of the distribution are biased away from the true parameter at the median in this simulation. In row (1), the bias is -.002 at the .10 quantile (a bias of roughly 4%) and the bias is .001 at the .90 quantile. In row (2) the bias is more pronounced, with a bias of -.004 at the .10 quantile (a bias of roughly 7.5%) and a bias of .004 at the .90 quantile. The OLS estimator and the estimator at the median are unbiased by this form of mean reverting measurement error in this simulation. Rows (3) and (4) show a similar picture as in row (1). The results show a slight bias at the tails of the conditional distribution away from the median coefficient. In rows (5) and (6), the bias is towards the median, meaning that there appears to be no general result for mean reverting error regarding bias towards or away from the median coefficient. Again, in the case with more heterogeneous effects in row (5), the magnitude of the bias for the estimators at the tails of the distribution is much larger than for instance the baseline case in row (1) ( a bias of around 13.2% versus 4% in the case of the .10 quantile). Finally in row (7), there again is no clear patter to the bias when the regression error is asymmetric. 2.3.3 Simulation Results Under Heteroskedastic Measurement Error In the cases of heteroskedastic measurement error, reported in Tables 2.5 and 2.6, the measurement error has the following form: Var(ei |xi ) = .25exp(−.1xi + .01xi2 ). 34 (2.10) The parameters are chosen to match what is found empirically in my HRS/W-2 earnings data.12 In row (2), I again examine a more extreme case that takes the form: Var(ei |xi ) = .25exp(−.1xi + .02xi2 ). (2.11) Heteroskedastic measurement error has the potential to produce bias at the tails that is considerably larger than previous cases. The results in row (1) show heteroskedasticity producing a bias of -.019 (a bias of 36% compared to 6% in the baseline case with classical measurement error) in the case of the estimator at the .10 quantile and a bias of .019 in the case of the .90 quantile. The estimators at the .25 and .75 quantiles are also biased, but to a lesser degree. The median estimator does not appear to be biased by heteroskedasticity in the case with the normally distributed regression error. In row (2), which are based on added measurement error with a more extreme form of heteroskedasticity, we see severe bias at the tails of the distribution. The bias at the .10 quantile is -.15 (a bias of 283%), and the bias at the .90 quantile is .186. The estimators at the .25 and .75 quantiles are also substantially biased, while the estimator at the median is largely unbiased. In the other simulation scenarios in rows (3) through (7), substantial bias exists in many cases as well. Overall, the simulation evidence suggests that the quantile regression estimator can be biased by classical measurement error under a variety of distributions and data generating processes. The bias can potentially be made worse when there is non-classical measurement error, particularly in the case of heteroskedastic measurement error. In this next section, I offer an empirical example showing bias. 2.4 Quantile Returns to Education as an Application In my empirical example, I use reported earnings in data from the Health and Retirment study benchmarked against what I maintain are more reliable IRS W-2 records data. Bound and Krueger (1991) find that reported earnings in Current Population Study data contains substantial measure12 More details on the data can be found in section 4.1. More details on the approach to estimating the parameters can be found in section 4.2. 35 ment error when bench-marked against more reliable Social Security earnings records data. Since quantile regression is often applied to income data, the effect of measurement error in these income variables on quantile coefficient estimates is important to understand. I am following a convention in the literature, for instance Chen et al. (2005), maintaining that the administrative earnings records are more reliable.13 Buchinsky (1994) has an excellent paper examining the returns to education using quantile regression. I will closely follow the specification in that paper. The regressions are based on the familiar Mincer (1974) equation. log(yi ) = β0 + Si β1 + Ei β2 + Ei2 β3 + Bi β4 + εi (2.12) where log(yi ) is the log of annual earnings, Si is years of schooling, Ei is experience, and Bi is an indicator variable for being African American. I will follow Buchinsky (1994) and estimate parameters for a reduced form equation that does not factor in omitted ability. In addition, I will not address the issue of measurement error in reported years of schooling. The focus of this analysis will be on measurement error in earnings. 2.4.1 Data The Health and Retirement Study is a survey of over 26,000 Americans (and their spouses) over the age of 50. The purpose of the study was to examine the transition of individuals from the labor force into retirement. The study collects information on income, employment, demographics, as well as on the participants health, retirement assets, and health care expenditures. Participants are asked to report their total wage earnings, labor force status, age, experience and education level.14 Importantly for my analysis, many HRS respondents also consented to having 13 There is good reason to think that the actual dependent variable of interest is permanent income, since the income in any one year may not be an accurate reflection of the return to an additional year of education (see Haider and Solon (2006)). Constructing a measure of permanent income and examining how estimates using this measure compare to using reported annual earnings may be a topic of future research. 14 In order to keep things as similar as possible to Buchinsky (1994) I use potential experience, 36 their survey records matched with their W-2 earnings records, which allows me to match reported earnings with the respondent’s W-2 records. Haider and Solon (2000) show that the respondents who consented have observable characteristics which are similar overall to the complete sample. The total wage earnings from the W-2 data comes from the box described as, ’Wages, tips, and other compensation’. Income from self employment or income contributed to 401(k) pensions is not included. Income above $250,000 is top coded. I make a number of sample restrictions in the analysis. I use only the first wave of the study, which took place in 1992-93, since many of the workers, particularly in later waves of the survey, are not prime working age. In the 1992-93 survey, workers are surveyed about earnings in 1991. I exclude women from the analysis in order to avoid sample selection issues with female participation in the labor force. My main set of results includes all workers that have at least $2500 in self-reported and W-2 earnings in 1991 dollars. Summary statistics of the final sample are reported in Table 2.7. 2.4.2 Characteristics of Measurement Error in Log Earnings I define the measurement error as the difference between log reported earnings and the more accurately measured log W-2 earnings.15 In this section, I provide an overview of measurment error in my self reported earnings data.16 Given that non-classical measurement error, and in particular heteroskedastic measurement error, has the potential to exacerbate bias caused by measurement defined as age minus years of education minus, as my measure of experience instead of years reported working. 15 To be more clear, the measurement error for observation i, e is constructed as: i ei = log(sv_earni ) − log(irs_earni ), where log(sv_earni ) is the log of survey earnings, and log(irs_earni ) is the log of IRS W-2 earnings. 16 Bricker and Engelhardt (2008) also study measurement error in HRS earnings data using the HRS/IRS W-2 matched earnings records. They find evidence of a negative correlation between the measurement error and the true earnings variable. They also find a positive correlation between the measurement error and the education level of the respondent. 37 error in the dependent variable, I also examine the relationship between the measurement error and the true earnings variable and covariates. The raw summary statistics of the measurement error are reported in Table 2.8. The mean, standard deviation, and 10th, 25th, 50th, 75th, and 90th percentiles of the measurement error are included in the table. A kernel estimate of the density of the measurement error is included in Figure 2.1. The measurement error in log reported earnings has a mean close to zero and the standard deviation is .486. The measurement error also shows some rightward skewness, with the mean larger than the median. I examine the degree of mean reversion in the measuremt error in my data by an OLS regression of the measurement error on the log of the true W-2 earnings. As discussed in Kim and Solon (2005), a coefficient of zero for the log of true earnings indicates no mean reversion in the measurement error, and a negative coefficient indicates mean reversion. Results are reported in column (1) of Table 2.9. The coefficient on the log true earnings variable is -.234, which is statistically significant at the 1% level, and is similar to the degree of mean reversion detected in Bound and Krueger (1991), Bound et al. (1994), and Pischke (1995). Next, I examine the relationship between the measurement error and the covariates. I assume the following functional form for the conditional expectation and variance: E(ei |Si , Ei , Bi ) = γ0 + Si γ1 + Ei γ2 + Ei2 γ3 + Bi γ4 , (2.13) Var(ei |Si , Ei , Bi ) = σ 2 exp(Si δ0 + Si2 δ1 + Ei δ2 + Ei2 δ3 + Bi δ4 ). (2.14) I estimate the parameters in Equation (2.13) by an OLS regression of ei on years of education, experience, experience squared, and the indicator for being black. I estimate the parameters in (2.14), which will tell us whether the measurement error is conditionally heteroskedastic, by non-linear least squares of the squared residuals, wich come from the OLS regression to estimate Equation (2.13), on the same covariates.17 17 This produces consistent estimates of the parameters in the conditional variance, because 38 The results for the conditional mean are reported in column (2) of Table 2.9. The estimated coefficients are insignificant when experience, experience squared, and the indicator for being black are included in column (3). Overall, the estimates suggest a small or negligible effect of the covariates on the conditional mean of the measurement error.18 The estimates of the coefficients in Equation (2.14) are reported in column (3). The coefficient on education squared is statistically significant at the 5% level, suggesting that the measurement error is conditionally heteroskedastic. The other estimated coefficients are not statistically significant. Overall, the measurement error displays mean-reversion and heteroskedasticity. The heteroskedasticity is particularly a cause for concern, because it had such a strong effect in the simulations. In the next section, I report estimates of the returns to education and experience using both the log of reported earnings as the dependent variable and the log of true earnings using W-2 earnings records and test for differences. 2.4.3 Estimates of the Returns to Education and Experience In order to test whether estimates based on IRS W-2 records statistically differ from estimates based on the reported earnings, I perform the following procedure: 1. Estimate the Mincer equation in (2.12) using reported earnings and again using the (true) W-2 records at the .1, .25, .50, .75, and .9 quantiles. 2. Form the difference between the estimates using the W-2 records and the estimate using reported earnings for each quantile. 3. Repeat the procedure 1000 times sampling with replacement to produce bootstrapped standard errors for the differences between the estimates using reported earnings and (true) W-2 Var(ei |Si , Ei , Bi ) = E(v2i |Si , Ei , Bi ) by definition, where vi = ei − E(ei |Si , Ei , Bi ), and because the OLS residuals converge in distribution to vi , as noted in Harvey (1976). 18 These results are consistent with Bound and Krueger (1991), who do a similar analysis in Table 3 of their paper. 39 earnings records. In Table 2.10, I show estimates of the returns to education and experience using true W-2 earnings in row (1) and estimates using the reported earnings records in row (2).19 Stars in row (2) signifiy that the difference between the estimates using W-2 earnings and reported earnings are statistically different from zero. For comparison, the first column shows estimates for the mean from an OLS regression of log annual earnings on years of education, experience, experience squared, and an indicator for whether the respondent is black. Columns (2) through (6) show estimates of the quantile coefficients for the .10, .25, .50, .75, and .90 quantiles. The return to a year of education at the 10th conditional percentile is estimated to be .046 using log reported earnings and .056 using log W-2 earnings. The return at the 25th percentile is .078 using log reported earnings and .079 for true earnings. However, neither of these differences are statistically significant. At the 50th percentile, the estimated return is .086 for reported earnings and .075 for true earnings. This difference of .0113 is statistically significant at the 5% level. Interestingly, this difference is very similar to the difference found by Chen et al. (2005), who find that using the mismeasured earnings variable biases the censored quantile regression estimate of the return to education at the median by around .014.20 The estimate at the 75th percentile is .083 and .074 for true earnings, and this difference is also statistically significant at the 5% level. The estimates at the 90th percentile are .090 for reported and .083 for true earnings, but this difference is not statistically significant. These results suggest that returns to education may be overstated at at least the median and 75th percentile. In the lower two panels, I show returns to experience. Since the Mincer equation in (2.12) includes a quadratic in experience, the return depends on the level of experience of the individual. 19 I also have examined the returns using only workers with more than 7 years of education and also only workers who report working full time. The patterns are very similar to those reported below. 20 The authors use the 1978 CPS-SSR match file, which combines reported earnings with social security earnings records. The authors do not report estimates at quantiles other than the median. They also do not report uncensored quantile regression results because of severe top coding in the social security earnings records. 40 I report the returns at 10 years of experience in the middle panel and 25 years of experience in the lower panel of Table 2.10. Overall, the estimates of the return to experience tend to be low compared to estimates found in the literature. This may be because the Health and Retirement study participants are older, with an average age around 55. At this age, experience may have only a small return. The mean return to experience estimated using OLS and reported earnings is statistically different from the return estimated using true earnings at the 5% level. However, none of the quantile regression estimates statistically differ using the different earnings measures. 2.4.4 Discussion To summarize the findings, I find with 95% confidence that the effect of an additional year of education at the conditional median and the conditional 75th percentile is overstated when using the mismeasured log earnings variable. The point estimates suggest the estimator at the median is overstated by 1.13 percentage points and that the estimator of the effect at the 75th percentile is overstated by .91 percentage points. The fact that I find the estimator at the median to be biased may suggest that the conditional distribution of true log earnings or the measurement error is asymmetric, given my simulation results. I cannot say with 95% confidence that the estimators at the other quantiles are biased, but this may reflect less precision at the other quantiles. The point estimate at the 90th percentile suggest a bias of .7 percentage points, and the point estimate at the 10th percentile suggest a bias of -.8 percentage points. The estimates of the effects of experience do not statistically differ. One potential explanation for not detecting bias for experience is that the quantile effects of experience are estimated imprecisely. This is true because there is not much variation in experience in my sample. The lack of precision in the estimates may explain the lack of statistically significant differences. 41 2.5 Conclusions This paper makes several important contributions. I add to a small literature on how quantile regression estimates are affected by measurement error in the dependent variable. I show in my simulations that even under classical measurement error the quantile regression estimator may be biased by measurement error in the dependent variable. If one assumes classical measurement error and that the conditional distribution of the true dependent variable is normal, then the simulation evidence suggests that the estimator at the tails of the distribution may be biased towards the median coefficient, although a rigorous proof of this could be a useful topic of future research. In some simulations, the median is also biased, and the size of the bias depends on the amount of heterogeneity in the effects across the distribution and the amount of mean reversion and heteroskedasticity in the measurement error. Empirically, I show that quantile regression estimator of the returns to education may be biased by measurement error in log reported earnings when compared to the more accurate W-2 earnings records. I find evidence that returns to education estimated at the median and 75th percentile are modestly over stated using reported earnings. This paper can serve as a cautionary note to researchers using quantile regression techniques with possible mismeasured dependent variables. A bright side is that even though the estimates appear biased, the bias is not overwhelmingly large in the context of the returns to education. The largest bias seen is around 1.13 percentage points. In other contexts, however, the bias could be larger. Future research in other contexts could be useful. Also, finding a solution to the problem may be another important topic for future research. 42 APPENDIX 43 APPENDIX TABLES AND FIGURES Table 2.1: Simulation Results for OLS/Quantile Regression Estimates with Classical Measurement Error in Dependent Variable. OLS .10 .25 .50 .75 .90 Baseline Spec: Normally Distributed Regression Error True Parameter .075 .053 .063 .075 .087 .097 (1) Estimator w/ Rel .8 .075 (.00023) .056 (.00037) .065 (.00031) .075 (.00028) .085 (.00032) .095 (.00039) (2) Estimator w/ Rel .6 .075 (.00027) .058 (.00044) .066 (.00035) .075 (.00033) .084 (.00037) .092 (.00046) Negative Effect: Normal Regression Error True Parameter -.075 -.097 -.087 -.075 -.064 -.053 (3) Estimator w/ Rel .8 -.075 (.00023) -.094 (.00037) -.085 (.00031) -.075 (.00028) -.065 (.00032) -.055 (.00039) Largest Effect at .10 Quantile: Normal Regression Error True Parameter .075 .088 .082 .075 .068 .061 (4) Estimator w/ Rel .8 .075 (.00022) .087 (.00036) .082 (.0003) .075 (.00027) .069 (.0003) .063 (.00037) The first row of each panel reports the true value of the parameter. The second row reports the estimated coefficient. Rel .8 refers classical measurement error with a reliability ratio of .8. Rel .6 refers classical measurement error with a reliability ratio of .6. The lognormal distribution in row (7) is such that log(ωi ) has standard normal distribution. Standard errors for the mean across the 1000 reps in parenthesis. 44 Table 2.2: Simulation Results for OLS/Quantile Regression Estimates with Classical Measurement Error in Dependent Variable. OLS .10 .25 .50 .75 .90 More Heterogeneous Effects: Normal Regression Error True Parameter .077 -.424 -.187 .077 .341 .576 (5) Estimator w/ Rel .8 .076 (.001) -.366 (.00164) -.156 (.00132) .077 (.00125) .31 (.00136) .516 (.00169) Baseline Spec: Student’s t w/ 3 d.f. Regression Error True Parameter .075 .055 .062 .075 .088 .094 (6) Estimator w/ Rel .8 .074 (.00037) .06 (.00059) .066 (.00041) .075 (.00035) .084 (.00042) .09 (.0006) Baseline Spec: Lognormal Regression Error True Parameter .075 .044 .067 .087 .091 .086 (7) Estimator w/ Rel .8 .075 (.00045) .058 (.00043) .068 (.00035) .078 (.00038) .086 (.00053) .087 (.00106) The first row of each panel reports the true value of the parameter. The second row reports the estimated coefficient. Rel .8 refers classical measurement error with a reliability ratio of .8. Rel .6 refers classical measurement error with a reliability ratio of .6. The lognormal distribution in row (7) is such that log(ωi ) has standard normal distribution. Standard errors for the mean across the 1000 reps in parenthesis. 45 Table 2.3: Simulation Results for OLS/Quantile Regression Estimates with Mean-Reverting Measurement Error in Dependent Variable. OLS .10 .25 .50 .75 .90 Baseline Spec: Normally Distributed Regression Error True Parameter .075 .053 .063 .075 .087 .097 .075 (.00018) .051 (.00031) .062 (.00025) .075 (.00022) .087 (.00026) .098 (.00031) (2) Estimator w/ Larger .075 Mean Reverting (.00016) .049 (.00028) .061 (.00022) .075 (.0002) .089 (.00023) .101 (.00029) (1) Estimator w/ Mean Reverting Negative Effect: Normal Regression Error True Parameter (3) Estimator w/ Mean Reverting -.075 -.097 -.087 -.075 -.064 -.053 -.075 (.00018) -.099 (.00031) -.088 (.00025) -.075 (.00022) -.063 (.00026) -.052 (.00031) Largest Effect at .10 Quantile: Normal Regression Error True Parameter (4) Estimator w/ Mean Reverting .075 .088 .082 .075 .068 .061 .075 (.00017) .09 (.0003) .083 (.00023) .075 (.00022) .067 (.00024) .06 (.0003) The first row of each panel reports the true value of the parameter. The second row reports the estimated coefficient. Mean-Reverting refers to mean-reverting measurement error with the following form: E(ei |ωi ) = −.3ωi . Larger Mean-Reverting refers to measurement error with the following form: E(ei |ωi ) = −.45ωi . The lognormal distribution in row (7) is such that log(ωi ) has standard normal distribution. Standard errors for the mean across the 1000 reps in parenthesis. 46 Table 2.4: Simulation Results for OLS/Quantile Regression Estimates with Mean-Reverting Measurement Error in Dependent Variable. OLS .10 .25 .50 .75 .90 More Heterogeneous Effects: Normal Regression Error True Parameter (5) Estimator w/ Mean Reverting .077 -.424 -.187 .077 .341 .576 .077 (.00098) -.368 (.00162) -.159 (.00126) .078 (.00119) .312 (.00133) .521 (.0016) Baseline Spec: Student’s t w/ 3 d.f. Regression Error True Parameter (6) Estimator w/ Mean Reverting .075 .055 .062 .075 .088 .094 .075 (.00029) .056 (.00046) .065 (.00033) .075 (.0003) .085 (.00035) .093 (.00045) Baseline Spec: Lognormal Regression Error True Parameter (7) Estimator w/ Mean Reverting .075 .044 .067 .087 .091 .086 .075 (.00034) .058 (.00042) .067 (.00034) .077 (.00034) .086 (.00042) .091 (.00072) The first row of each panel reports the true value of the parameter. The second row reports the estimated coefficient. Mean-Reverting refers to mean-reverting measurement error with the following form: E(ei |ωi ) = −.3ωi . Larger Mean-Reverting refers to measurement error with the following form: E(ei |ωi ) = −.45ωi . The lognormal distribution in row (7) is such that log(ωi ) has standard normal distribution. Standard errors for the mean across the 1000 reps in parenthesis. 47 Table 2.5: Simulation Results for OLS/Quantile Regression Estimates with Heteroskedastic Measurement Error in Dependent Variable. OLS .10 .25 .50 .75 .90 Baseline Spec: Normally Distributed Regression Error True Parameter .075 .053 .063 .075 .087 .097 .075 (.00023) .034 (.00038) .053 (.00031) .075 (.00028) .097 (.00033) .116 (.0004) (2) Estimator w/ Larger .075 Heteroskedastic (.00034) -.133 (.00049) -.035 (.00041) .075 (.00039) .184 (.00042) .282 (.00052) (1) Estimator w/ Heteroskedastic Negative Effect: Normal Regression Error True Parameter (3) Estimator w/ Heteroskedastic -.075 -.097 -.087 -.075 -.064 -.053 -.075 (.00023) -.116 (.00038) -.097 (.00031) -.075 (.00028) -.053 (.00033) -.034 (.0004) Largest Effect at .10 Quantile: Normal Regression Error True Parameter (4) Estimator w/ Heteroskedastic .075 .088 .082 .075 .068 .061 .075 (.00022) .067 (.00038) .07 (.0003) .075 (.00028) .079 (.00031) .084 (.00038) The first row of each panel reports the true value of the parameter. The second row reports the estimated coefficient. Heteroskedastic refers to heteroskedastic measurement error with the following form: Var(ei |xi ) = .25exp(−.1xi + .01xi2 ). Larger Heteroskedastic refers to measurement error with the following form: Var(ei |xi ) = .25exp(−.1xi + .02xi2 ). The lognormal distribution in row (7) is such that log(ωi ) has standard normal distribution. Standard errors for the mean across the 1000 reps in parenthesis. 48 Table 2.6: Simulation Results for OLS/Quantile Regression Estimates with Heteroskedastic Measurement Error in Dependent Variable. OLS .10 .25 .50 .75 .90 More Heterogeneous Effects: Normal Regression Error True Parameter (5) Estimator w/ Heteroskedastic .077 -.424 -.187 .077 .341 .576 .077 (.001) -.461 (.00159) -.205 (.00125) .078 (.00119) .36 (.00128) .614 (.00162) Baseline Spec: Student’s t w/ 3 d.f. Regression Error True Parameter (6) Estimator w/ Heteroskedastic .075 .055 .062 .075 .088 .094 .075 (.00037) .014 (.0006) .04 (.00041) .075 (.00036) .11 (.00042) .137 (.00059) Baseline Spec: Lognormal Regression Error True Parameter (7) Estimator w/ Heteroskedastic .075 .044 .067 .087 .091 .086 .075 (.00045) -.011 (.00044) .037 (.00038) .089 (.00039) .134 (.00053) .145 (.00106) The first row of each panel reports the true value of the parameter. The second row reports the estimated coefficient. Heteroskedastic refers to heteroskedastic measurement error with the following form: Var(ei |xi ) = .25exp(−.1xi + .01xi2 ). Larger Heteroskedastic refers to measurement error with the following form: Var(ei |xi ) = .25exp(−.1xi + .02xi2 ). The lognormal distribution in row (7) is such that log(ωi ) has standard normal distribution. Standard errors for the mean across the 1000 reps in parenthesis. 49 Table 2.7: Summary Statistics, Wave 1 (1992) Male Workers with Positive Earnings Variable Total Reported Annual Earnings Total Annual Earnings W-2 Hours Worked/Week Main Job Weeks Worked/Year Main Job Hourly Wage Rate Years of Tenure Current Job Total Years Worked Total Years of Education Age Black Hispanic Mean Std. Dev. 36052.17 28173.51 33157.61 25492.41 43.69 10.52 50.43 5.47 27.74 486.15 15.46 11.78 37.46 5.92 12.7 3.29 55.87 4.61 0.129 0.335 .091 0.288 Number of Observations 50 Min. 2800 2600 1 1 0.96 0 3 0 23 0 0 Max. 410000 245000 95 52 24000 55.8 65 17 77 1 1 2975 Figure 2.1: Kernel Estimate of the Density of Measurement Error in Log Earnings Measurement error defined as difference between log reported earnings and log W-2 earnings. Table 2.8: Measurement Error Descriptive Statistics Quantiles Variable Mean Std. Dev. .10 .25 .50 .75 .90 Measurement Error .060 .486 -.268 -.051 .032 .166 .443 Number of Observations 2975 Measurement error defined as difference between log reported earnings and log W-2 earnings. 51 Table 2.9: Estimates of Conditional Distribution of Measurement Error Mean VARIABLES Log W-2 Earnings (1) Variance (2) (3) -.234*** (.018) Education .001 -.103 (.004) (.078) Education Squared .008** (.004) Experience -.019 -.079 (.014) (.076) Experience Squared .0002 .001 (.0002) (.001) Black Observations 2,975 -.019 .069 (.026) (.190) 2,975 2,975 Estimates in column (1) come from OLS regression of measurement error on log true earnings. Estimates in column (2) come from an OLS regression of the measurement error on the covariates. Estimates in column (3) come from an NLS regression of the squared residuals from the OLS regression in column (3) on the covariates. Robust standard errors in parentheses. *** p<0.01, ** p<0.05, * p<0.1 52 Table 2.10: Estimates of Mincer Equation: Male Workers with Positive Earnings OLS .10 .25 .50 .75 .90 .075 (.005) .086** (.006) .074 (.005) .083** (.005) .083 (.006) .090 (.007) .003 (.012) -.007 (.010) .001 (.019) .001 (.026) -.000 (.001) -.004 (.005) .002 (.001) .001 (.012) Returns to Education Using (True) W-2 Earnings Using Reported Earnings .073 (.005) .075 (.005) .054 (.014) .046 (.014) .079 (.007) .078 (.006) Returns to Experience at 10 Years Using (True) W-2 Earnings Using Reported Earnings .016 (.011) .002* (.011) .065 (.041) .032 (.023) .016 (.016) .012 (.021) .002 (.013) -.003 (.013) Returns to Experience at 25 Years Using (True) W-2 Earnings Using Reported Earnings .004 .020 (.005) (.022) -.005** -.000 (.006) (.011 ) Number of Observations .002 (.008) -.002 (.010) -.002 (.006) -.005 (.006) 2975 All regressions include years of education, experience, experience squared, and an indicator for whether black. All workers have at least $2500 in reported and W-2 earnings in 1991 dollars. Bootstrapped standard errors in parenthesis. 1000 bootstrap replications performed. *** Difference between estimates using W-2 and reported earnings statistically significant at 1% level. ** Difference statistically significant at 5% level. * Difference statistically significant at 10% level 53 CHAPTER 3 DOES THE PRECISION AND STABILITY OF VALUE-ADDED ESTIMATES OF TEACHER PERFORMANCE DEPEND ON THE TYPES OF STUDENTS THEY SERVE? This work is coauthored with Cassandra Guarino, Mark Reckase, and Jeff Wooldridge. 3.1 Introduction Teacher value-added estimates are increasingly being used in high stakes decisions. Many districts are implementing merit pay programs or moving toward making tenure decisions based at least partly on these measures. It is important to understand the chances that a teacher will be misclassified in a way that may lead to undeserved sanctions. Misclassification rates depend on the precision of teacher effect estimates, which is related to a number of factors. The first is the number of students a teachers is paired with in the data. Teachers that can be matched with more student observations will tend to have more precise teacher effect estimates. Another factor that can affect the precision of a teacher effect estimate is the error variance associated with students in the teacher’s classroom. If the error variance is large, perhaps because the model poorly explains the variation in achievement or because the achievement measures themselves poorly estimate the true ability level of a student, then the precision of a teacher effect estimate will be low. A question that seems to have lacked much attention is whether the precision varies by the characteristics of the students a teacher faces. Tracking of students into classrooms and sorting of students across schools means that different teachers may face classrooms that are quite different from one another. If it is found that teachers serving certain groups of students have less reliable estimates of value-added than other teachers serving other students, then all else the same, the probability that a teacher is rated above or below a certain threshold will be larger for teachers 54 serving these groups. High stakes policies that reward or penalize teachers above or below a certain threshold will then, again all else the same, impose sanctions or rewards on teachers serving these groups with a higher likelihood. There are some reasons for suspecting that the characteristics of students in a classroom relates to the precision of teacher effect estimate. First, there could be a relationship between the characteristics of a classroom and the number of students linked to a teacher. This could be true because of a relationship between class size and student characteristics, because of poor data management for schools serving certain groups, or because of low experience levels for teachers serving certain groups, which limit the number of years that can be used to estimate the teacher’s value-added. Also, heteroskedastic student level error can imply that teachers paired with those students with large error variances may have less reliable teacher effect estimates. There is strong theoretical reason for supposing that the student level error is heteroskedastic. Item response theory suggests that because test items are typically targeted towards students in the center of the achievement distribution, achievement tends to be measured less precisely for students in the tails. The heteroskedasticity is also quite substantial, and suggests that teachers paired with particularly high achieving or low achieving students may have less reliable teacher effect estimates. In addition to heteroskedasticity caused by poor measurement, it is also conceivable that the error variance for true achievement is different for different students. In the remainder of the paper, we test for heteroskedasticity in the student level error term. In addition, year-to-year stability coefficients, which are very similar to year-to-year correlations, using a variety of commonly used value added estimators are computed for teachers serving different groups of students. Year to year stability coefficients for teachers with students in the bottom quartile, top quartile, and middle two quartiles in classroom level prior achievement are compared to one another. A test of the homoskedasticity assumption easily rejects. Also, large and statistically significant differences in the stability coefficients among sub groups of teachers are detected, and the differences persist even after the number of student observations for all teachers is artificially created 55 to be the same and when two years of data are used to compute value added. In many cases, the year-to-year stability coefficients are 25 to more than 50% larger in size for teachers serving initially higher achieving students compared to teachers serving lesser achieving and disadvantaged students. This finding has several implications. For practitioners implementing high stakes accountability policies, teachers serving certain groups of students may be unfairly targeted for positive or negative sanctions simply because of the composition of their classroom and the variability this creates for their estimates. In this paper, we produce simulation evidence that bears this out. In addition, the heteroskedasticity makes it important for researchers and practitioners to make standard errors heteroskedasticity robust. Also, heteroskedasticity is a potential source of bias for those using empirical Bayes value-added estimates, which assume homoskedasticity. 3.2 Previous Literature A few studies have examined the stability and precision of teacher effect estimates. Aaronson et al. (2007) examined the stability of teacher effect estimates using three years of data from the Chicago public school system. They find that there is considerable inter-year movement of teachers into different quintiles of the estimated teacher quality distribution, suggesting that teacher effect estimates are somewhat unstable over time. They also find that teachers associated with smaller number of student observations are more likely to be found in the extremes of the estimated teacher quality distribution. Koedel and Betts (2007) perform a similar analysis as Aaronson et al. (2007) using two years of data from the San Diego public school system and also find that there is considerable movement of teachers across quintiles. McCaffrey et al. (2009) found year-to-year correlations in teacher value added to be .2 to .5 for elementary school teachers and .3 to .7 for middle school teachers using data from 5 county level school districts from the state of Florida from the years 2000-2005. They find that averaging teacher effect estimates over multiple years of data improves the year-to-year stability of the value56 added measures. This paper adds to the previous literature by specifically looking at whether the stability of teacher effect estimates is related to the characteristics of the students assigned to the teacher. 3.3 Data The data come from an administrative data set in large and diverse anonymous state. It consists of 2,985,208 student year observations from years 2001-2007 and grades 4-6. Student-teacher links are available for value-added estimation. Also, basic student information, such as demographic, socio-economic, and special education status, are available. Teacher information on experience is also available. The data include vertically scaled achievement scores in reading and math on a state criterion referenced test. The analysis will focus on value-added for mathematics teachers. We imposed some restrictions on the data in order to accurately identify the parameters of interest. Students who cannot be linked with a teacher are dropped, as are students linked to more than one teacher in a school year in the same subject. Students in schools with fewer than 20 students are dropped, and students in classrooms with fewer than 12 students are dropped. Districts with fewer than 1000 students are dropped to avoid the inclusion of charter schools in the analysis, which may employ a set of teachers who are somewhat different from those typically found in public schools. Characteristics of the final data set are reported in Table 3.1.1 The analysis presented later is done separately for 4th grade and 6th grade. This is done because the degree of tracking may be different in 6th grade from 4th grade, which may cause differences in the year-to-year stability of value-added estimates. 3.4 Model The model of student achievement will be based on the education production function, which is laid out in , Todd and Wolpin (2003), Harris et al. (2011), and Guarino et al. (2012), among other 1 These restrictions eliminated about 31.2% of observations in 4th grade and 19% in 6th grade 57 places.2 Student achievement is a function of past achievement, current student and class inputs, along with a teacher effect. Aigt = τt + λ1 Aig−1t + λ2 Aalt ig−1t + Xigt γ1 +X¯igt γ2 + Tigt β + vigt (3.1) with vigt = ci + εigt + eigt − λ1 eig−1t − λ2 ealt ig−1t where Aigt is student i’s test score in grade g and year t. τt is a year specific intercept. Aalt ig−1t is the test score in the alternate subject, which in the analysis presented below is the reading score. Xigt is a vector of student level covariates including free and reduced price lunch and limited English proficiency status, gender, and race. X¯igt consists of class level covariates, including lagged achievement scores, class size, and demographic composition. Tigt is a vector of teacher indicators. The teacher effects are represented in the β vector. ci represents a student fixed effect. εigt represents an idiosyncratic error term affecting achievement. eigt is measurement error in the test scores with ealt igt representing the measurement error in the alternate subject score. 3.4.1 Estimation Methods Teacher effects were estimated using two commonly used value-added estimators.3 The first is a dynamic OLS estimator (DOLS), which includes teacher indicators in an OLS regression based on equation (1).4 The estimator is referred to as dynamic because prior year 2 The model shown includes a lagged score of the alternate subject, which isn’t necessary under the assumptions typically made in deriving the regression model based on the education production function. However, including this variable is common in practice, so we chose to include it as well. 3 We have studied two more estimators based on a gain score equation. One estimator based on teacher fixed effects, and another based on empirical Bayes. The patterns for these two other estimators are similar to those reported for DOLS and EB Lag. 4 This estimator was found to be the most robust of all the estimators evaluated in Guarino et al. (2012) 58 achievement is controlled for on the right hand side. The coefficients on the teacher indicator variables are interpreted as the teacher effects. We run our models using one year of data and again using two years of data. Because the effects of class average covariates are not properly identified in a teacher fixed effects regression with only one year of data, these variables are dropped from the DOLS regressions.5 Additionally, when one year of data is used to estimate value-added, the year specific intercepts are dropped. The second is an empirical Bayes estimator (EB Lag) which treats teacher effects as random. The estimator follows closely the approach laid out in Kane and Staiger (2008). The parameters of the control variables are estimated in a first stage using OLS, then unshrunken teacher effect estimates are formed by averaging the residuals from the first stage among the students within a teacher’s class. The shrinkage term is the ratio of the variance of persistent teacher effects to the sum of the variances of persistent teacher effects, idiosyncratic classroom shocks, and average of the individual student shocks.6 Teacher effects are interpreted as the shrunken averaged residuals for each teacher. 5 We have tried a two step method that can identify the effect of class average covariates in a teacher fixed effects regression as a sensitivity check, and the results are similar. First, using the pooled data with multiple years, equation (1) is estimated using OLS with teacher fixed effects included. Then, a residual is formed. ˆ ¯ wigt = Aigt − τˆt − λˆ 1 Aig−1t − λˆ 2 Aalt ig−1t − Xigt γˆ1 − Xigt γˆ2 − f (experigt ) = Tigt β + vˆigt which is then used in a second stage regression to form teacher effects using a sample based on 1 year of data. 6 It is common to treat the variance of the individual student shocks as uniform across the population of students. In an effort to evaluate commonly used estimators, we also computed the shrinkage term by using the same variance term for the student level shocks for all teachers. Under heteroskedasticity, this shrinkage term would not be the shrinkage term used by the BLUP. 59 3.5 Heteroskedastic Error There is good reason to suspect that the error in the student achievement model is heteroskedastic. We will present some basic theory suggesting that measurement error in test scores is heteroskedastic. Also, we will offer some possible reasons why the error variance of actual achievement may be heteroskedastic. 3.5.1 Heteroskedastic Measurement Error Item response theory is typically the foundation for estimating student achievement. A state achievement test is typically composed of 40-50 multiple choice questions, or items. Each student can either answer a question correctly or incorrectly, and the probability of answering any individual question is assumed to be a function of the item characteristics and the achievement level of the student. The typical model of a correct response to an item assumes (See Reckase (2009) for more details): Prob(ui j = 1|ai , bi , ci , θ j ) = ci + (1 − ci )G(ai (θ j − bi ) where ui j represents an incorrect or correct response to item i by student j. ai is a discrimination parameter, bi is a difficulty parameter, and ci is a guessing parameter for item i. θ j is the achievement level of student j. Often, a logit functional form is assumed for G(·), although the probit functional form is also used. In the case of the logit form we have: Prob(ui j = 1|ai , bi , ci , θ j ) = ci + (1 − ci ) (a (θ −b )) e i j i (a (θ −b )) 1+e i j i Parameters can then be estimated using maximum likelihood or alternatively using a Bayesian estimation approach. To illustrate why heteroskedasticity exists, we will focus on maximum likelihood estimation. Lord (1980), under the assumption that the answer to each test item by each respondent is independent conditional on θ , showed that the maximum likelihood estimate of θ 60 has a variance of: σ 2 (θˆ |θ ) = ∑ni=1 (ci ai )2 (a (θ −b )) e i j i (a (θ −b )) (1+e i j i )2 −1 where n is the number of items. As can be seen, the variance would be minimized with respect to θ if θ j − bi = 0 for all items, and as θ j − bi approaches ±∞ , the variance grows large. Since test items are often targeted toward students near the proficient level, in the sense that θ j − bi is near 0 for these students, students in the lower and upper tail often have noisy estimates of their ability. The intuition is that the test is aimed at distinguishing between students near the proficiency cutoff, and so the test offers little information for students near the top or bottom of the distribution. Plots of the estimated standard deviation of the measurement error (SEM) on the student’s test score level are shown below in Figure 3.1. The SEMs are on the vertical axis and the student’s test score are on the horizontal axis for grades 3 through 6 for mathematics. The plots are from the 2006 State X Technical Report on Test Characteristics. The measurement error variance is a function of the test score level. Students in the extreme ranges of the test score distribution have a measurement error variance that is substantially larger than in the center. Also, it may be the case that some groups of students may be less likely to answer all questions on the exam. As described in State X technical reports, test scores are computed for all students who answer at least 6 questions in each of 2 sessions. Students who answer only a fraction of the total number of questions on the exam will tend to have less precisely estimated test scores. A prediction of the theory presented above is that the error variance will be related to all variables that predict current achievement. This is because the variance of the measurement error is directly related to the current achievement of the student, so all variables that influence the current achievement level of the student should also be related to the measurement error variance. In the test of heteroskedasticity that follow, this is the pattern that emerges. 61 3.5.2 Other Possible Causes of Heteroskedastic Student Level Error In addition to heteroskedasticity generated from measurement, it is possible that other sources of heteroskedasticity exist. Little literature exists on this topic, but there are many potential causes, and we can only speculate on what they may be. Some groups of students, such as those with low prior year achievement, may have more variation in unobserved factors such as motivation, classroom disruptions, neighborhood effects, family effects, or learning disabilities. In addition, students who perform poorly on tests may tend to leave many questions blank or guess at answers, and thus their scores from test to test may be more variable. In the following sections, we test for heteroskedasticity empirically, and look for possible differences in the error variance among groups. This serves to demonstrate that the theoretical worries are justified and can motivate some predictions about how the precision of teacher effect estimates may depend on certain characteristics of the their students. 3.6 Testing for Heteroskedasticity Under homoskedasticity: E(v2ig |Zig ) = σv2 where Zig are the covariates in the regression model. We implemented a simple test of the homoskedasticity assumption examining whether squared residuals are related to student characteristics. The first test simply grouped students into three groups: those with prior year test scores in the bottom 25%, the middle 50%, and the top 25%. We then calculated the average squared residuals for each group of students. We used the residuals from the DOLS regressions, which made use of teacher indicators. Results are included in Table 3.2. One thing to note is that the average squared residuals for the group of students in the bottom 25% in terms of prior year achievement are much larger than those for the group of students in the top 25%. The average squared residuals are around 62 45% larger for the bottom 25% compared with the top 25% for 4th grade and more than twice as large for 6th grade, even though under homoskedasticity, we would expect them to be similar. This is suggestive that more unexplained variation exists for the group of students in the bottom 25% of the prior year achievement score. Next we regressed the squared residuals on the covariates as well as on their squares and cubes. Results for grades 4 and 6 are reported in Table 3.3. We found that several of the variables including the lagged test scores, as well as the indicators for the student being African-American, free and reduced priced lunch, and limited English proficiency status were statistically significant predictors at the 10% level. Since the precision and stability of a teacher’s value-added measure depends in part on how much unexplained variation there is in the student’s test scores, as will be explained below, this suggests that teachers paired with large numbers of disadvantaged or low achieving students may have less precise teacher value-added estimates. In the following sections, we will present evidence of this. Specifically, we will show that teachers of these types of students tend to have less stable teacher effect estimates over time. In addition to the regressions presented in Table 3.3, we performed the traditional BreuschPagan test, using fitted values, for heteroskedasticity separately for grade 4 and 6 and using the DOLS estimators. The test easily rejects the null hypothesis that the error is homoskedastic, with p-values for all grades and estimators less than .0001. 3.7 Evidence of Differences in Classroom Compositions For there to be differences in the stability or the precision of teacher effect estimates due to student level heteroskedastic error, it is necessary for variation in classroom compositions to exist. For particular districts or states with little variation in classroom composition, it is unlikely that there will be large differences in the stability and precision of estimates due to heteroskedasticity. Also, there are some variables, such as gender, in which there may be a relationship with the error variance, but don’t impact the precision and stability of teacher effect estimates, since there is little 63 variation across classrooms with respect to the variables. To show that there is variation in classroom composition with respect to certain variables across the state, we included a set of summary statistics in the middle panels of Table 3.1 on classroom characteristics, which show that classrooms vary in their characteristics along a number of dimensions. The average past year math score of students in a class ranges from a score of 686.75 to 2066.737 for grade 4 and 866 to 2097 for grade 6. The interval between classrooms 2 standard deviations above the mean and 2 standard deviations below the mean is [1128.797,1697.353] for grade 4 and [1383.791,1911.623] for grade 6. Additionally, the proportion free and reduced priced lunch, limited English proficiency status, Hispanic, and African-American variables all range from 0 to 1. 3.8 3.8.1 Effects of Heteroskedastic Student Level Error on Precision of Teacher Value-Added Estimates Simple Model of Heteroskedasticity This model is designed to show, in the simplest case, how heteroskedasticity in the student level error can produce heteroskedasticity in teacher effect estimates. In the model there are two types of students and two teachers that students can be assigned to. The student types differ in the size of the student’s error variance. The achievement equation model is: Ai = T0i β0 + T1i β1 + εi where Ai is the achievement level of student i, T0 and T1 are teacher assignment indicator variables for the two teachers, teacher 0 and teacher 1, β0 and β1 are teacher effects for teacher 0 and teacher 1, and εi is an error term assumed to be independent of teacher assignment. 64 Let the variable Si indicate which of the two student types the student belongs to and v0 < v1 . Var(εi ) = v0 i f Si = 0 Var(εi ) = v1 i f Si = 1 v0 < v1 In this simple case, an OLS estimate of the teacher effect for teacher k produces: −1 βˆk − βk = = 2 ∑N i=1 Tki ∑N i=1 Tki εi ∑N i=1 Tki εi Nk = ε¯k where ε¯k is the average error for the students that teacher k receives and Nk is the number of student observations for teacher k. Let’s suppose that each teacher has some students from S=0 and some from S=1. And also that teacher 0 tends to get more students from group 0, and teacher 1 tends to get more students from group 1. We can use the Central Limit Theorem for inference. According to Greene (2008) (pg 1051, Lindeberg-Feller Central Limit Theorem with Unequal Variances) a central limit theorem result is possible as long as the random variables are independent with finite means and finite positive N k σ 2 ), where N is the number of students for variances. Also, the average variance, N1 (∑i=1 k εik k teacher k, must not be dominated by any single term in the sum and this average variance must converge to a finite constant, σ¯ ε2 as the number of students per teacher goes to infinity. k N 1 k 2 ( ∑ σε ) Nk →∞ Nk i=1 ik σ¯ ε2 = lim k Assume that all of those conditions hold. In that case, d Nk (βˆk − βk ) → Normal(0, σ¯ ε2 ) k 65 and Avar(βˆk ) ≈ σ¯ ε2 k Nk In this simple example the average variance, σ¯ ε2 , for teacher 1 will tend to be larger than teacher k 0, since they have more students from S=1. Therefore the asymptotic variance of the teacher effect estimate for teacher 1 will tend to be larger. 3.8.2 Including other Covariates in Achievement Model Adding in covariates along with the teacher indicator variables complicates the result. In this case the achievement model is: Ai = T0i β0 + T1i β1 + Xi γ + εi where Xi is a vector of covariates. A well known result (see Wooldridge (2010)), is that the OLS estimate of the teacher fixed effect for teacher k is: βˆk − βk = A¯ k − X¯k γˆFE − βk = ε¯k − X¯k (γˆFE − γ) where A¯ k and X¯k are the class averages of achievement and the covariates, and γˆFE is the fixed effects estimator of γ. It’s straight forward to show that σ¯ ε2 ˆ Avar(βk ) ≈ k + X¯k Avar(γˆFE )X¯k Nk σ¯ ε2 k Nk will tend to be larger for teacher 1 than teacher 0. However, because of the additional terms in the Avar(βˆk ), it is not theoretically clear which teacher will have the less precise teacher 66 effect estimate when the relationships between the covariates and the student types are unknown. Ultimately, whether teacher effect estimates are less precise for some teachers is an empirical question. The important point is that it is possible for some teachers to have less precise estimates due to student characteristics, so it is worthwhile to check whether that is the case. 3.9 Inter-year Stability of Teacher Effect Estimates by Class Characteristics Imprecision of teacher effect estimates has some important implications, especially for policies that use teacher value-added estimates to make inferences about teacher quality. The precision of a teacher effect estimate will affect how well that estimate can predict the true teacher effect. If the estimated teacher effect is quite noisy, then the estimate will tend to poorly predict the true teacher effect. This section explains how examining the year to year stability of value-added estimates can reveal important information about the measures for those intending to use them for high stakes policies. The year to year stability is calculated by regressing the value-added measure in year t on a value-added measure in a previous year. We calculate separate stability coefficients for teachers with classrooms in the bottom 25%, middle 50%, and top 25% in terms of their students incoming average achievement. Those wishing to skip the technical details may move on to the next section. Following McCaffrey et al. (2009), we can model a teacher effect estimate for teacher j in year t as: βˆ jt = β j + θ jt + v jt where βˆ jt is the teacher effect estimate, β j is the persistent component of the teacher effect, θ jt is a transitory teacher effect that may have to do with a special relationship a teacher has with a class or some temporary change in a teacher’s ability to teach, and v jt is an error term due to sampling variation. The variance of v jt will be related to the number of student observations used to estimate a teacher effect and the error variance associated with the students in the particular 67 teacher’s class. An important coefficient for predicting the persistent component of the teacher effect using an estimated teacher effect, which is essentially what a policy to deny tenure to teachers based on value added scores would be doing, is the stability coefficient, as termed by McCaffrey et al. (2009). The stability coefficient for teacher j is: Sj = σβ2 σβ2 + σθ2 + σv2 jt jt Note that the stability depends on the variance of the error term v jt . Assuming that the expectation of β j conditional on βˆ jt is linear and that β j , θ jt , and v jt are uncorrelated, then: σβ2 Cov(βˆjt , β j ) ˆ ˆ E(β j |β jt ) = α + β jt = α + 2 βˆ jt = α + S j βˆ jt 2 2 ˆ σ + σ + σ Var(β jt ) v jt θ β jt and then also assuming that θ jt and v jt are mean zero, we get: E(β j |βˆjt ) = (1 − S j )µβ + S j βˆjt j where µβ is the mean of β j .7 So the weight that βˆjt receives in predicting β j is related to the j stability coefficient. If the stability coefficient is small, then the estimated teacher effect receives little weight in the conditional expectation function and is of little use in predicting β j . 7 If the conditional expectation function isn’t linear, then the algebra shown works for the linear projection, which is the minimum mean squared error predictor among linear functions of the estimated teacher effect. The assumption that β j , θ jt , and v jt are uncorrelated essentially implies that the teacher effect estimates are unbiased. There is some empirical support for this assumption at least for the DOLS and EB Lag estimators. Kane and Staiger (2008), Kane et al. (2013), and Chetty et al. (2011) both find that similar value-added estimators are relatively unbiased. If the estimates are biased, then we are effectively evaluating the stability of reduced form coefficients and not the causal effects of teachers on achievement. The estimators evaluated are commonly used in practice and conceivably will be used as the basis for high stakes policies, so it still may be of interest to know how they vary from year-to-year. 68 The stability coefficient can be estimated by an OLS regression of current year teacher valueadded estimates on past year estimates of teacher value-added and a constant. This does impose the additional assumption that the variances of θ jt and v jt are constant over time and that the transitory teacher effect and error terms are uncorrelated over time. In that case the OLS estimates are estimating the population parameter: σβ2 Cov(βˆ jt−1 , βˆ jt ) j = 2 = Sj 2 σβ + σθ + σv2 Var(βˆ jt−1 ) jt−1 jt−1 j Since the variance of the teacher effect estimates tends to be constant over time, the regression coefficient is nearly identical to the inter-year correlation coefficient. The stability coefficient will be estimated for different subgroups of teachers based on the characteristics of the students a teacher receives. Specifically, the stability will be computed for teachers that received classes in the bottom 25%, middle 50% and top 25% of classroom average prior test score in both years t and t − 1. If the variance of v jt differs across subgroups of teachers, then the stability and the degree to which the estimate predicts the true teacher effect will also differ. Another ratio may be of interest. Following McCaffrey et al. (2009) once again, the reliability of a teacher effect estimate, denoted as R jt , is: R jt = σβ2 + σθ2 σβ2 + σθ2 jt jt + σv2 jt It may be of interest to know how much a teacher affected student learning in a given year. This may be the case in a merit pay system, for instance. In this case, we would be interested in the expected value of β j + θ jt conditional on the estimated teacher effect in year t. Using similar assumptions as before: E(β j + θ jt |βˆjt ) = (1 − R jt )µβ + R jt βˆjt 69 Under an additional assumption that variance of β j and θ jt do not vary across subgroups, then the stability of teacher value added estimates will be proportional to the reliability. This is simply because: R jt = 3.9.1 σβ2 + σθ2 jt σβ2 Sj Brief Overview of the Analysis Given that there may be differences in the degree of tracking or sorting in elementary and middle schools, the analysis is done separately by grade. Additionally, since it may be that teachers of certain types of classrooms are less experienced, and this may affect the year-to-year stability of the teacher’s value-added estimate, the teacher’s level of experience is controlled for in the regressions by creating separate dummy variable for each possible year of experience and including each of those variables in the regressions. The estimates for the different subgroups were computed by an OLS regression of the current year value-added estimate on the lagged teacher value-added estimate interacted with a subgroup indicator variable, a subgroup specific intercept, and an indicator for the teacher’s level of experience. The regression equation is: βˆ jt = 3 3 g=1 M g=1 ∑ αg1{subgroup jt = g} + ∑ γgβˆ jt−11{subgroup jt = g} + ∑ ζτ 1{exper jt = τ} + φ jt τ=1 where βˆ jt is teacher j’s value added estimate in year t, subgroup jt is a variable indicating the teacher’s subgroup, and exper jt is the teacher’s experience level. The γg parameters are the parameters of interest in the analysis. One way to think about them is as a group specific autoregressive coefficient for a teacher’s value-added score, and they are quite similar to group specific year-to-year correlations in value-added. 70 The advantage of the regression based approach over calculating year-to-year correlations is that it is much easier to calculate test statistics using conventional regression software. In the following sections, we will test whether the year-to-year stability of teacher value-added estimates for different subgroups are statistically different from one another. The analysis is also repeated for each grade with the number of student observations artificially set to be equal. Since the precision of estimates for a teacher depends on both the number of student observations and the degree of variation in the student level error, it is of interest to identify the separate effects of these two sources of variability in teacher effect estimates. In order to make the number of student observations equal for all teachers, first all teachers with less than 12 student observations were dropped. Then for those teachers with more than 12 student observations, students are randomly dropped from the classroom until the the number of student observations is 12 for all teachers.8 To give an example, suppose a teacher has 20 students in a class, then 8 of the students are randomly dropped, so that the teacher’s value-added estimate is based on the scores of only 12 students. First, results will be reported in which all teacher effects are estimated using only one year of data. Then, the analysis will be reported using two years of data for each teacher. When two years of data are used to compute value-added the groupings into bottom 25%, middle 50%, and top 25% are based on the two year average of prior year test score within the teacher’s classrooms. This then averages over the same sample of students used to compute the two year value-added measures. In the case of the estimates based on two years of data, the teacher effect estimate for year t will be estimated using years t and t − 1. The stabilities are computed by regressing the valueadded estimate for year t on year t − 2. This is done so that the years in which teacher effects are estimated do not overlap, which will avoid sampling variation or class level shocks affecting both estimates. 8 We have also done the analysis where the number of observations is set to 15 and 20, and the general patterns reported are the same. 71 3.10 Results on the Stability of Teacher Effect Estimates by Subgroup The inter-year stabilities for subgroups of teachers based on the average past year score of the students in the class are reported below.9 We perform separate tests for whether the estimates for the middle 50% and top 25% statistically differ from the bottom 25%. Also, a joint test that the estimates for the middle 50% and top 25% are both statistically different from the bottom 25% is reported. Although there is variation in what is statistically significant across grades and estimators, a few patterns do emerge. The stability ratio tends to be highest for teachers facing classrooms in the middle 50% and top 25% in average lagged score compared to teachers in the bottom 25%. The stability ratio is typically 25 to over 50% larger for teachers with classrooms in the middle 50 and top 25%. This pattern is true even after the number of student observations is fixed at 12 and in some cases when 2 years of data are used to compute value-added. 3.10.1 DOLS Stabilities Table 3.4 shows the results for the DOLS estimator. Results for 4th grade and 6th grade are shown separately. The left panels show the DOLS teacher value-added estimates when the data is based on only one year of data. The right panel are based on estimates with two years of data. Within each panel, results labeled “Unrestricted Obs” are based on teacher value-added estimates that use all the available student observations in a year. Results labeled “12 Student Obs” are based on only 12 randomly chosen student observations in each year. For the two year results, the results reported under the “12 Student Obs” column are based on 12*2=24 student observations. Standard errors are clustered at the school level.10 A “+” symbol indicates that the middle 50% (or top 25% 9 We have also examined whether the inter-year stability differs when classrooms are grouped according to proportion free-and-reduced price lunch, proportion Hispanic, and proportion African-American. We found that teachers in classrooms with high proportions of minority and low-income students also have lower inter-year stabilities. Results are available upon request. 10 We have also tried clustering at the teacher level, but the school level standard errors were more conservative, so we chose to report those. 72 as the case may be) coefficient is statistically different from the bottom 25% at the 5% level. 3.10.1.1 4th Grade Results In 4th grade, the stability for teachers with classes in the bottom 25% of prior year achievement is .359, and the stabilities for the middle 50% and top 25% are .483 and .555 respectively when the number of student observations is unrestricted. The coefficients for the middle 50% and top 25% statistically differ from the coefficient for the bottom 25% at the 5% level. The patterns are quite similar once the number of student observations is fixed at 12, although predictably the estimates are somewhat smaller, since in the unrestricted case each teacher’s value-added estimate is based on at least 12 observations. The stability for the bottom 25% is .308 while the stabilities for middle and top are .392 and .471 respectively and are statistically different from the bottom. Additionally, in both the unrestricted and restricted to 12 observations cases, the joint test that both the middle 50% and top 25% coefficients differ from the bottom rejects comfortably at the 5% level. For the cases in which two years of data are used, the stability is calculated using four years of data. The teacher effect estimate in year t, which uses data from year t and t − 1, is regressed on the teacher effect estimate from year t − 2, which uses years t − 2 and t − 3. For a teacher to be included in one of the quartile groupings, the teacher had to have a two-year average prior year achievement score in that quartile range for years t and t − 2. This dramatically reduced the sample of teachers available to compare. When two years of data are used to estimate teacher value-added in 4th grade the stability for teachers with classes in the bottom 25% increase to .551 and to .646 and .730 for the middle and top, respectively, in the unrestricted observations case. The difference between the coefficients for the top and bottom is statistically significant at the 5% level. The point estimate for the middle 50% is larger than the bottom 25%, but the difference between the two is not statistically significant at the 5% level. The joint test that top or the middle coefficient differs from the bottom is significant at the 5% level. When the number of student observations per year is fixed at 12, the point estimates in the case of the middle and top are larger than the bottom, and both are statistically different from 73 the bottom. The joint test that either the middle or top is different from the bottom also rejects. 3.10.1.2 6th Grade Results The results for 6th grade are broadly similar to 4th grade using one year of data. With one year of data and unrestricted observations the stabilities tend to be higher than in 4th grade. This is likely due to 6th grade teachers having more student observations per year. In this case, the stabilities are .534, .619, and .665 for the bottom, middle, and top respectively. The tests for whether the top stabilities are different from the bottom rejects, while the test for the middle 50% does not. The joint test also rejects. When 12 student observations are used, the stabilities are .356, .401, and .479, respectively, for the bottom, middle, and top. Once again the test that the top and bottom differ and the joint test rejects, while the test that the middle differs from the bottom does not. In the case of two years of data, none of the estimates statistically differ from one another in either the case of unrestricted observations or the case restricted to 12 student observations. 3.10.2 EB Lag Stabilities The results for the empirical Bayes estimates can be found in Table 3.5 and are quite similar to those for the DOLS estimates. One difference between the empirical Bayes and DOLS specifications is that the regressions corresponding to the empirical Bayes estimates include classroom aggregates of the individual covariates, since this is often one of the justifications for using this approach over DOLS.11 In the case of one year of data and 4th grade, the stability estimates are .361, .483, and .551 for the bottom 25%, middle 50%, and top 25%, respectively, in the unrestricted observations case. In the case where the number of student observations is set to 12, the stability estimates are .309, .391, and .461 respectively. In both cases, the middle 50% and top 25% estimates are statistically significantly different from the bottom 25%. The estimates are very similar to the DOLS case. 11 We have also included class aggregates in the DOLS regressions, and the results do not change much. Estimates of the class level aggregates were identified for DOLS using the two step approach described previously. 74 In the two year case in 4th grade, the pattern is again fairly similar to the DOLS results. When the number of observations is unrestricted, only the top and bottom 25% stabilities are statistically from one another. The p-value of the joint test is .0511, however. When the number of observations is restricted to 12, the estimates are .476, .584, and .657, respectively. The difference between the top 25% and bottom 25% coefficients is statistically at the 5% level. The joint test rejects at the 5% level as well. In 6th grade with one year of data, the only statistically significant difference at the 5% level is between the top 25% and bottom 25% in the unrestricted case, with point estimates of .650 for the top and .548 for the bottom. In the case of 2 years of data, no statistically significant differences are detected. 3.11 Sensitivity Checks We performed a number of sensitivity checks. All of them support the conclusion that differences exist in the inter-year stabilities across sub-groups. We performed the analysis using English language arts scores and found similar patterns as mathematics. The teachers assigned to students in the bottom 25% tended to have less stable value-added scores from year to year. One thing interesting to note is that English language arts value-added scores tended to be less stable from year-to-year overall compared to mathematics. This finding is consistent with the findings reported in the MET project reports. Since it conceivable that teachers of students with low average prior achievement scores are inexperienced and inexperienced teachers also have lower inter-year stabilities, the analysis was repeated dropping all teachers with less than 5 years of experience. However, the teacher’s experience was controlled for in the regression of the teacher’s current value-added score on their prior value-added score specifically to account for this issue, and the patterns described above were very similar to those seen in this sensitivity check as expected. As an additional sensitivity check, we repeated the analysis with school dummies. We were still able to detect statistically significant differences in inter-year stabilities across sub-groups. 75 We tried estimating the empirical Bayes estimates using an alternate estimator. In the alternate estimator, we estimated the model parameters using a mixed effects estimator that treated the teacher effects as random. These results were very similar to the empirical Bayes approach outlined above that was based on the approach taken in Kane and Staiger (2008). Also, we used twice lagged reading and math scores as instruments for the once lagged reading and math scores to help account for measurement error in these variables as another sensitivity check. Again, statistically significant differences were found in the stabilities across sub-groups. Finally, we performed the analysis separately for the six largest school districts in the state. The general patterns held. In a majority of the cases, the stability coefficient was estimated to be the smallest in the case of the bottom 25%. In no case was the stability coefficient of the middle 50% or top 25% statistically significantly smaller than the bottom 25%. In some districts, the teachers with classrooms in the middle 50% had the largest year-to-year stability, while in others the top 25% had the largest year-to-year stability. In one case the year to year stability of the bottom 25% was the largest, but it wasn’t statistically significantly so. The estimates were quite noisy when the sample was separated in this way, so it is not clear whether this reflected real differences across districts or not. It seems possible that in different context the group of teachers that has the largest year-to-year stability could differ. However, our main takeaway is that some groups of teachers have less stable value added estimates from year-to-year. Tables for all of these sensitivity checks are available upon request. 3.12 High Stakes Policy Simulation There is an increasing push to use value-added estimates for high stakes decisions such as tenure or merit pay bonuses. Since the precision and stability of a teacher’s value-added estimate is related to the makeup of the teacher’s class, it may be the case that the teachers serving certain groups of students may be more likely receive a sanction or bonus. In order to examine this, we produced a simulation in which high stakes decisions are made based upon value-added scores, and teachers differ in the stability of their value-added estimates. 76 We base the stability level of the measure of value-added on the results we found in the previous sections. Each teacher is ranked and flagged if they are in the bottom or top 10% according to their teacher value-added score. We then calculate the proportion of teachers associated with each stability level that are labeled as either in the bottom or top 10%. The simulation consists of 300 teachers and 3 stability levels. 100 teachers are assigned to each stability level. The true teacher effects are normally distributed and have a mean of 0 and a variance of 1. The “estimated” teacher effects have estimation error added that is normally distributed with mean 0, and the variance depends on the stability level of the teacher. Two sets of stability levels were chosen. The first corresponds to the DOLS estimates in 4th grade with 12 student observations and one year of data, with stabilities of .308, .392, and .471. The second corresponds to the DOLS estimates in 4th grade with 12 student observations and 2 years of data, with stability levels of .465, .578, and .660. We calculate the average proportion of teachers associated with each stability level over the 5000 reps. Results are included in Table 3.6. The results from the simulation using the DOLS estimates in 4th grade with 12 student observations and one year of data can be found in the upper panel. For teachers associated with the stability of .308, which was the stability associated with teachers of classrooms in the bottom 25% in the analysis above, the proportion found in the bottom or top 10% was .249. When the stability level was .392 the proportion dropped to .195, and when the stability went to .471, the proportion fell to .156. This last drop was nearly a 10 percentage point change from the lowest stability. The results using two years of data show a similar pattern and can be found in the bottom panel. Teachers associated with the lowest stability have a proportion of .243. Teachers associated with stabilities of .578 and .660, which were associated with students in the middle 50% and top 25%, respectively, were found in the bottom or top 10% of the estimated teacher quality distribution at a proportion of .193 and .164 respectively. This represents an almost 8 percentage point drop for the latter. The simulation results indicate that the differences in stability levels found in this analysis can have a large impact on the likelihood that a teacher finds his or herself in the top or bottom of the 77 estimated teacher quality distribution. 3.13 Conclusion This paper provides evidence that the variability and stability of teacher effect estimates depends on the characteristics of a teacher’s class. Policies to deny tenure to teachers and policies designed to reward teacher performance in a given year, which are based on teacher value-added estimates, may differentially impact teachers with certain types of students. The relationship between the stability of estimates and the classroom characteristics of students extends beyond the number of student observations. There is a strong theoretical reason for suspecting that a student’s error term is heteroskedastic and statistical tests bear this out. As a consequence of this and student tracking and sorting into schools, teachers will serve different groups of students and have differences in the precision of their teacher effect estimates as a result. The differences in the stability ratios are large in magnitude and statistically significant even after fixing the number of student observations to a constant. Also, some evidence is presented that the relationships remain even as more observations are added. When two years of data are used, there still exist statistically significant and large differences for different subgroups of teachers. The heteroskedasticity is likely due in part to heteroskedastic measurement error variance. Assuming the item response model is correct, heteroskedastic measurement error is a direct result of the maximum likelihood estimation procedure which produces estimates of the achievement level of each student. The patterns that teachers of students with lagged achievement scores in the middle of the achievement distribution tend to have the highest inter-year stabilities is consistent with heteroskedasticity caused by the measurement error, although teachers with students in the top 25% also tend to have more stable estimates. One reason the top and bottom may be different is that there may be greater potential for guessing or item non-response for students at the bottom of the distribution. It may be possible to reduce the heteroskedasticity by improving measurement. Future work will hopefully explore how much of the heteroskedasticity is attributable to measurement. 78 Heteroskedastic student level error also has other implications for researchers and policymakers. Empirical Bayes estimators are commonly computed assuming homoskedastic student level error. This assumption does not seem to be true, and since there are large differences in stability ratios that appear to be driven by heteroskedasticity, the violation of this assumption may impact the teacher rankings that are created using the empirical Bayes estimators. Allowing heteroskedasticity in the student level error should be done if possible. Additionally, it is quite common for standard errors and the corresponding confidence intervals to be based on a homoskedasticity assumptions.12 It is important that the confidence intervals accurately reflect imprecision caused by all sources of variability, not just the number of student observations, so standard errors should at least be made heteroskedasticity robust. This is particularly important since the teacher value-added estimates are being made publicly available in some school districts. It is important to understand the limitations of any measure of performance. The analysis presented here does suggest that for all subgroups value-added measures do have positive interyear stabilities, so information can be gathered for all subgroups of teachers. However, teachers of certain groups of students will tend to have less precise and less stable teacher value-added estimates. As a result of this, it is the opinion of the authors that care should be used in evaluating teachers using value-added estimators and value-added estimates should not be used as the sole basis of any high stakes policy involving teachers. 12 Ballou et al. (2004) assume homoskedasticity in computing standard errors, as does the value- added estimator employed by the NYC school district 79 APPENDIX 80 APPENDIX TABLES AND FIGURES Figure 3.1: Standard Error of Measure Plots for Mathematics Grades 3- 6 81 Table 3.1: Summary statistics 4th Grade Variable Mean Std. Dev. Min. Math Scale Score 1543.377 240.699 581 Reading Scale Score 1591.033 291.045 295 Math Standardized Scale Score 0.103 0.947 -3.957 Reading Standardized Scale Score 0.105 0.928 -4.578 Black 0.208 0.406 0 Hispanic 0.224 0.417 0 Free and Reduced Price Lunch 0.486 0.5 0 Limited English Proficiency 0.173 0.378 0 Avg. Lag Math Score 1413.075 142.139 686.75 Prop. FRL 0.496 0.28 0 Prop. LEP 0.17 0.213 0 Prop. Hispanic 0.218 0.245 0 Prop. Black 0.216 0.248 0 Students/Teacher 49.008 38.534 12 Teacher Years of Experience 8.902 8.887 0 # of Teachers 14,820 # of Schools 1,768 N 726,299 6th Grade Variable Mean Std. Dev. Min. Math Scale Score 1701.841 232.71 569 Reading Scale Score 1704.809 294.454 539 Math Standardized Scale Score 0.092 0.913 -4.163 Reading Standardized Scale Score 0.071 0.928 -4.049 Black 0.224 0.417 0 Hispanic 0.223 0.416 0 Free and Reduced Price Lunch 0.476 0.499 0 Limited English Proficiency 0.174 0.379 0 Avg. Lag Math Score 1647.707 131.958 866 Prop. FRL 0.496 0.259 0 Prop. LEP 0.172 0.205 0 Prop. Hispanic 0.214 0.234 0 Prop. Black 0.24 0.245 0 Students/Teacher 145.378 165.685 12 Teacher Years of Experience 9.571 9.362 0 # of Teachers 5,323 # of Schools 796 N 773,849 82 Max. 2330 2638 3.409 3.753 1 1 1 1 2066.737 1 1 1 1 412 47 Max. 2492 2758 3.354 3.526 1 1 1 1 2097 1 1 1 1 1036 40 Table 3.2: Average Squared Residuals for DOLS based on Subgroups of Prior Year Class Average Achievement Grade Overall Bottom 25% Middle 50% Top 25% 4th Grade N 18644.722 709302 28091.514 174780 13665.164 356821 19352.092 177701 6th Grade N 16395.069 723292 29825.119 179894 11574.907 357843 12670.438 185555 All regressions include lagged math and ELA test scores, indicators for Black, Hispanic, free and reduced price lunch, limited English proficiency, female, and year dummies. 83 Table 3.3: Tests for Heteroskedasticity VARIABLES Math Lag Score Math Lag Score Squared Math Lag Score Cubed Reading Lag Score Reading Lag Score Squared Reading Lag Score Cubed Black Hispanic FRL LEP Female Constant Observations R2 Joint Test p-value Grade 4 DOLS Squared Residuals Grade 6 DOLS Squared Residuals -91.60*** -176.5*** (5.357) (15.28) 0.00705* 0.00709 (0.00396) (0.00939) 1.05e-05*** 1.57e-05*** (9.45e-07) (1.90e-06) -45.76*** -55.68*** (2.772) (5.663) 0.0161*** 0.0173*** (0.00195) (0.00341) 0.79e-07 -8.73e-07 (4.43e-07) (6.65e-07) 293.8* 473.6** (177.3) (205.2) -265.9* -272.0* (154.7) (159.8) 540.6*** 1,104*** (114.9) (120.3) 1,249*** 711.8*** (190.7) (183.3) -1,436*** -2,609*** (97.34) (114.0) 134,184*** 262,361*** (2,472) (8,485) 709,302 0.050 886.6 0 723,292 0.079 862.4 0 All regressions include lagged math and ELA test scores, indicators for Black, Hispanic, free and reduced price lunch, limited English proficiency, female, and year dummies. Standard errors clustered at school level in parentheses. Joint Test refers to F test statistic that all coefficients equal to 0. *** p<0.01, ** p<0.05, * p<0.1 84 Table 3.4: Estimates of Year to Year Stability for DOLS by Subgroups of Class Achievement DOLS 4th grade Bottom 25% Middle 50% Top 25% Observations R2 Joint Test p-value 1 Year of Data Unrestricted Obs 12 Student Obs 0.359*** 0.308*** 2 Years of Data Unrestricted Obs 12 Student Obs 0.551*** 0.465*** (0.0277) (0.0266) (0.0437) (0.0449) 0.483***+ 0.392***+ 0.646*** 0.578***+ (0.0181) (0.0180) (0.0325) (0.0315) 0.555***+ 0.471***+ 0.730***+ 0.660***+ (0.0255) (0.0246) (0.0495) (0.0485) 8,124 0.227 14.70 4.81e-07 7,650 0.165 10.14 4.27e-05 2,735 0.357 3.677 0.0257 2,527 0.298 4.436 0.0121 DOLS 6th grade Bottom 25% Middle 50% Top 25% Observations R2 Joint Test p-value 1 Year of Data Unrestricted Obs 12 Student Obs 0.534*** 0.356*** 2 Years of Data Unrestricted Obs 12 Student Obs 0.812*** 0.574*** (0.0452) (0.0476) (0.0588) (0.0756) 0.619*** 0.401*** 0.717*** 0.560*** (0.0209) (0.0247) (0.0447) (0.0485) 0.665***+ 0.479***+ 0.711*** 0.575*** (0.0263) (0.0310) (0.0403) (0.0508) 4,290 0.481 3.684 0.0256 3,772 0.288 3.233 0.0401 1,506 0.642 1.193 0.304 1,359 0.445 0.0274 0.973 All regressions include lagged math and ELA test scores, indicators for Black, Hispanic, free and reduced price lunch, limited English proficiency, female, and year dummies. Standard errors clustered at school level in parentheses *** p<0.01, ** p<0.05, * p<0.1 + Indicates value statistically different from Bottom 25% at 5% level Joint Test: F-test statistic that Middle 50 % and Top 25 % coefficients different from Bottom 25% 85 Table 3.5: Estimates of Year to Year Stability for EB Lag by Subgroups of Class Achievement EB Lag 4th grade Bottom 25% 1 Year of Data Unrestricted Obs 12 Student Obs 0.361*** 0.309*** 2 Years of Data Unrestricted Obs 12 Student Obs 0.571*** 0.476*** (0.0278) (0.0269) (0.0445) (0.0459) Middle 50% 0.483***+ 0.391***+ 0.659*** 0.584*** (0.0183) (0.0180) (0.0341) (0.0318) Top 25% 0.551***+ 0.461***+ 0.733***+ 0.657***+ (0.0254) (0.0246) (0.0497) (0.0491) 8,124 0.220 13.80 1.16e-06 7,650 0.157 8.813 0.000158 2,735 0.352 2.985 0.0511 2,527 0.291 3.697 0.0252 Observations R2 Joint Test p-value EB Lag 6th grade Bottom 25% Middle 50% Top 25% Observations R2 Joint Test p-value 1 Year of Data Unrestricted Obs 12 Student Obs 0.548*** 0.354*** 2 Years of Data Unrestricted Obs 12 Student Obs 0.814*** 0.583*** (0.0433) (0.0482) (0.0497) (0.0702) 0.614*** 0.385*** 0.717*** 0.551*** (0.0199) (0.0247) (0.0432) (0.0481) 0.650***+ 0.457*** 0.714*** 0.561*** (0.0267) (0.0318) (0.0405) (0.0529) 4,290 0.437 2.492 0.0835 3,772 0.224 2.402 0.0913 1,506 0.610 1.558 0.212 1,359 0.387 0.0715 0.931 All regressions include lagged math and ELA test scores, indicators for Black, Hispanic, free and reduced price lunch, limited English proficiency, female, class averages of all preceding variables, class size, a quadratic function of experience, and year dummies. Standard errors clustered at school level in parentheses. *** p<0.01, ** p<0.05, * p<0.1 + Indicates value statistically different from Bottom 25% at 5% level Joint Test: F-test statistic that Middle 50 % and Top 25 % coefficients different from Bottom 25% 86 Table 3.6: High Stakes Policy Simulation Simulation 1: DOLS Stability, 4th Grade, 12 Student Obs, 1 year of Data Stability .308 .392 .471 Error Variance of VAM Estimate Proportion Found in Bottom or Top 10% 2.247 1.551 1.123 .249 .195 .156 Simulation 2: DOLS Stability, 4th Grade, 12 Student Obs, 2 years of Data Stability .465 .578 .660 Error Variance of VAM Estimate Proportion Found in Bottom or Top 10% 1.151 .730 .515 .243 .193 .164 Simulations results are based on 5000 Monte Carlo repetitions. There are 100 teachers per type. True teacher effects are distributed Normal(0,1). Error in the value-added measures is normally distributed with mean 0 and a variance listed in the “Error Variance of VAM Estimate” column. 87 BIBLIOGRAPHY 88 BIBLIOGRAPHY Aaronson, D., Barrow, L., and Sander, W. (2007). Teachers and student achievement in the chicago public high schools. Journal of Labor Economics, 25(1):95–135. Abrevaya, J. and Dahl, C. M. (2008). The effects of birth inputs on birthweight: evidence from quantile estimation on panel data. Journal of Business & Economic Statistics, 26(4):379–397. Angrist, J., Chernozhukov, V., and Fernandez-Val, I. (2006). Quantile regression under mispecification, with an application to the u.s. wage structure. Econometrica. Arias, O., Hallock, K. F., and Sosa-Escudero, W. (2001). Individual heterogeneity in the returns to schooling: Instrumental variables quantile regression using twins data. Empirical Economics. Ballou, D., Sanders, W., and Wright, P. (2004). Controlling for student background in value-added assessment of teachers. Journal of Educational and Behavioral Statistics, 29(1):37–65. Bound, J., Brown, C., Duncan, G. J., and Rodgers, W. L. (1994). Evidence on the validity of crosssectional and longitudinal labor market data. Journal of Labor Economics, pages 345–368. Bound, J. and Krueger, A. B. (1991). The extent of measurement error in longitudinal earnings data: Do two wrongs make a right? Journal of Labor Economics. Boyd, D., Grossman, P., Lankford, H., Loeb, S., and Wyckoff, J. (2008). Who leaves? teacher attrition and student achievement. Technical report, National Bureau of Economic Research. Bricker, J. and Engelhardt, G. V. (2008). Measurement error in earnings data in the health and retirement study. Journal of Economic and Social Measurement, 33(1):39–61. Buchinsky, M. (1994). Changes in the u.s. wage structure 1963-1987: Application of quantile regression. Econometrica. Buchinsky, M. (1998). Recent advances in quantile regression models: A practical guideline for empirical research. The Journal of Human Resources. Card, D. (1999). The causal effect of education on earnings. Handbook of labor economics, 3:1801–1863. Center, V.-A. R. (2010). Nyc teacher data initiative: Technical report on the nyc value-added model 2010. Technical report. Chen, X., Hong, H., and Tamer, E. (2005). Measurement error models with auxiliary data. The Review of Economic Studies. Chernozhukov, V. and Umantsev, L. (2001). Conditional value-at-risk: Aspects of modeling and estimation. Empirical Economics, 26(1):271–292. 89 Chetty, R., Friedman, J. N., and Rockoff, J. E. (2011). The long-term impacts of teachers: Teacher value-added and student outcomes in adulthood. Technical report, National Bureau of Economic Research. Condie, S., Lefgren, L., and Sims, D. (2014). Teacher heterogeneity, value-added and education policy. Economics of Education Review, 40(0):76 – 92. Dee, T. S. (2004). Teachers, race, and student achievement in a randomized experiment. Review of Economics and Statistics, 86(1):195–210. Eide, E. R. and Showalter, M. H. (1999). Factors affecting the transmission of earnings across generations: A quantile regression approach. Journal of Human Resources, pages 253–267. Goldhaber, D. and Chaplin, D. (2012). Assessing the ‘rothstein falsification test’: Does it really show teacher value-added models are biased? Center for Education Data & Research Working Paper. Goldhaber, D., Walch, J., and Gabele, B. (2013). Does the model matter? exploring the relationship between different student achievement-based teacher assessments. Statistics and Public Policy, 1(1):28–39. Greene, W. H. (2008). Econometric Analysis. Pearson. Guarino, C., Reckase, M. D., and Wooldridge, J. M. (2012). Can value-added measures of teacher performance be trusted? Technical report, Discussion Paper series, Forschungsinstitut zur Zukunft der Arbeit. Guarino, C. M., Reckase, M. D., Stacy, B., and Wooldridge, J. M. (2014). Evaluating specification tests in the context of value-added estimation. Technical report, Michigan State Education Policy Center. Haider, S. and Solon, G. (2000). Non random selection in the hrs social security earnings sample. RAND, Labor and Population Program Working Paper Series. Haider, S. and Solon, G. (2006). Life-cycle variation in the association between current and lifetime earnings. The American Economic Review. Hanushek, E. A. (1979). Conceptual and empirical issues in the estimation of educational production functions. Journal of human Resources, pages 351–388. Harris, D., Sass, T., and Semykina, A. (2011). Value-added models and the measurement of teacher productivity. Harvey, A. C. (1976). Estimating regression models with multiplicative heteroscedasticity. Econometrica: Journal of the Econometric Society, pages 461–465. Hausman, J. (2001). Mismeasured variables in econometric analysis: Problems from the right and problems from the left. The Journal of Economic Perspectives. 90 Imbens, G. M. and Wooldridge, J. M. (2008). Recent developments in the econometrics of program evaluation. Technical report, National Bureau of Economic Research. Imbens, G. W. (2000). The role of the propensity score in estimating dose-response functions. Biometrika, 87(3):706–710. Imberman, S. A. and Lovenheim, M. F. (2013). Does the market value value-added? evidence from housing prices after a public release of school and teacher value-added. Technical report, National Bureau of Economic Research. Jacob, B. A. and Lefgren, L. (2008). Can principals identify effective teachers? evidence on subjective performance evaluation in education. Journal of Labor Economics, 26(1):101–136. Kane, T. J., McCaffrey, D. F., Miller, T., and Staiger, D. O. (2013). Have we identified effective teachers? validating measures of effective teaching using random assignment. research paper. met project. Bill & Melinda Gates Foundation. Kane, T. J. and Staiger, D. O. (2002). The promise and pitfalls of using imprecise school accountability measures. The Journal of Economic Perspectives, 16(4):91–114. Kane, T. J. and Staiger, D. O. (2008). Estimating teacher impacts on student achievement: An experimental evaluation. Technical report, National Bureau of Economic Research. Kane, T. J. and Staiger, D. O. (2010). Learning about teaching: Initial findings from the measures of effective teaching project. Bill & Melinda Gates Foundation. Kim, B. and Solon, G. (2005). Implications of mean-reverting measurement error for longitudinal studies of wages and employment. Review of Economics and Statistics, 87(1):193–196. Koedel, C. and Betts, J. (2007). Re-examining the role of teacher quality in the educational production function. Technical report, Department of Economics, University of Missouri. Koenker, R. (2005). Quantile Regression. Cambridge University Press. Koenker, R. and Bassett, G. (1978). Regression quantiles. Econometrica. Koenker, R. and Hallock, K. F. (2001). Quantile regression. The Journal of Economic Perspectives. Lockwood, J. and McCaffrey, D. F. (2009). Exploring student-teacher interactions in longitudinal achievement data. Education, 4(4):439–467. Loeb, S., Soland, J., and Fox, L. (2014). Is a good teacher a good teacher for all? comparing valueadded of teachers with their english learners and non-english learners. Educational Evaluation and Policy Analysis. Lord, F. M. (1980). Applications of Item Response to Theory to Practical Testing Problems. Lawrence Erlbaum. Machado, J. A. and Mata, J. (2005). Counterfactual decomposition of changes in wage distributions using quantile regression. Journal of applied Econometrics, 20(4):445–465. 91 McCaffrey, D. F., Lockwood, J., Koretz, D., Louis, T. A., and Hamilton, L. (2004). Models for value-added modeling of teacher effects. Journal of educational and behavioral statistics, 29(1):67–101. McCaffrey, D. F., Sass, T. R., Lockwood, J., and Mihaly, K. (2009). The intertemporal variability of teacher effect estimates. Education, 4(4):572–606. Mincer, J. A. (1974). Schooling and earnings. In Schooling, experience, and earnings, pages 41–63. Columbia University Press. Morris, C. N. (1983). Parametric empirical bayes inference: theory and applications. Journal of the American Statistical Association, 78(381):47–55. Neal, D. and Schanzenbach, D. W. (2010). Left behind by design: Proficiency counts and testbased accountability. The Review of Economics and Statistics, 92(2):263–283. Pischke, J.-S. (1995). Measurement error and earnings dynamics: Some estimates from the psid validation study. Journal of Business & Economic Statistics, 13(3):305–314. Reckase, M. (2009). Multidimensional Item Response Theory. Springer. Rivkin, S. G., Hanushek, E. A., and Kain, J. F. (2005). Teachers, schools, and academic achievement. Econometrica, 73(2):417–458. Rockoff, J. E. (2004). The impact of individual teachers on student achievement: Evidence from panel data. The American Economic Review, 94(2):247–252. Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55. Rothstein, J. (2009). Student sorting and bias in value-added estimation: Selection on observables and unobservables. Education, 4(4):537–571. Rothstein, J. (2010). Teacher quality in educational production: Tracking, decay, and student achievement. The Quarterly Journal of Economics, 125(1):175–214. Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688. Rubin, D. B., Stuart, E. A., and Zanutto, E. L. (2004). A potential outcomes view of value-added assessment in education. Journal of educational and behavioral statistics, 29(1):103–116. Todd, P. E. and Wolpin, K. I. (2003). On the specification and estimation of the production function for cognitive achievement*. The Economic Journal, 113(485):F3–F33. Winters, M. A., Dixon, B. L., and Greene, J. P. (2012). Observed characteristics and teacher quality: Impacts of sample selection on a value added model. Economics of Education Review, 31(1):19–32. Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data. MIT Press. 92