CONTRIBUTIONS OF STUDENT RESPONSE THEORY TO EVALUATION SYSTEMS: THREE ESSAYS EXTENDING ITEM RESPONSE THEORY PROCEDURES TO MEASURES OF TEACHER PERFORMANCE By Tara Kilbride A DISSERTATION Michigan State University in partial fulfillment of the requirements Submitted to for the degree of Measurement and Quantitative Methods—Doctor of Philosophy 2019 CONTRIBUTIONS OF STUDENT RESPONSE THEORY TO EVALUATION SYSTEMS: THREE ESSAYS EXTENDING ITEM RESPONSE THEORY PROCEDURES TO MEASURES OF TEACHER PERFORMANCE ABSTRACT By Tara Kilbride This dissertation is a collection of three papers that each use Student Response Theory (SRT), a method for estimating educator effectiveness built upon an analogy of students and teachers to test items and examinees in Item Response Theory (IRT). Prior studies compare SRT to competing methods like value-added and student growth percentile models. This study broadens this literature by shifting the primary focus to comparisons between SRT and IRT and implications of their differences in the context of teacher evaluation and accountability. Using student-teacher linked data from an anonymous large school district in a major U.S. city, unique contributions of IRT concepts and procedures are examined within the SRT context. The first paper focuses on the construction of a student instructional demand index that operates like an item difficulty parameter in an SRT model. Although prior studies recommend two different methods for estimating this index, they focus primarily on one method (regression analysis) and minimally on the other (IRT calibration). In this paper, I select indicators of instructional demand that are optimal for IRT calibration and compare instructional demand indices across model specifications and estimation methods. I find that the IRT calibration method yields estimates of instructional demand that are more consistent with other indicators of teacher quality than the regression analysis method. However, when students and teachers are evaluated against a difficult standard, neither type of instructional demand index is consistent with these other measures. In the second paper, I define mathematical functions of SRT model parameters that are comparable to characteristic curves and information functions in IRT. From these functions, I derive several different indicators of teacher performance and compare their respective meanings and the quality of information they each provide. I then identify groups of teachers whose classes comprise equivalent levels of instructional demand, conceptually similar to parallel test forms in IRT. I find high levels of consistency between different effectiveness measures from the same SRT model. The results also suggest that state-determined thresholds for proficiency and advanced proficiency are so difficult for most students in this district that they provide little information about educators. The basic proficiency threshold, however, is informative about most students and teachers. The third paper applies procedures related to differential item functioning, common item equating, and item selection to the SRT framework as potential tools in a formative teacher evaluation system. Tests of “differential student functioning” find consistent model performance across most subgroups of teachers and classrooms. Although SRT models are successfully equated to a common scale across years, some estimates of teacher growth have unreasonably large standard errors, offering little power to make inferences about changes in performance. Mismatch between the abilities of teachers and demands of their students are the main factor driving this finding. Applications of optimal item selection procedures to SRT, however, may be useful for matching students and teachers with one another in mutually beneficial ways. These findings are generally encouraging, but also emphasize the importance of setting reasonable goals for students and teachers and thoughtful assignment practices in determining whether this method will be useful in practice. Copyright by TARA KILBRIDE 2019 ACKNOWLEDGEMENTS I would first like to express my deepest gratitude to my advisor, Dr. Mark Reckase, whose own work inspired me to pursue this project and whose guidance and support was invaluable throughout my graduate school career. I am also profoundly grateful to Dr. Katharine Strunk, who connected me with the resources necessary to begin this work and whose high standards motivate me to grow as a researcher. Thank you to Dr. Joseph Martineau, whose passion for this topic reminds me why it is important and challenges me to think about it in new ways. Thank you to Dr. Ken Frank, who provided thoughtful feedback and fresh perspectives throughout this process. This work would not have been possible without my fantastic dissertation committee. My sincerest gratitude to Dr. Josh Cowen for taking a chance by trusting me with a challenging project, and for remaining supportive and invested in my success ever since. Thank you to Brooke Quisenberry and Dr. Ting Shen, who have been great classmates and friends to me since we began our doctoral studies together. Thank you to all of the current and former educators in my life, especially those who took the time to share their experiences and perspectives with me as I completed this work. My most heartfelt appreciation goes to my family, TG, and Justin for their love, encouragement, and patience. And of course, thank you to Pepper for insisting that I stop working and go for a walk every now and then. v TABLE OF CONTENTS LIST OF TABLES …………………………………………………………………………….… viii LIST OF FIGURES ………………………………………………………….....…………….…… x KEY TO ABBREVIATIONS …………………………………………………………….......... xii CHAPTER 1. INTRODUCTION ……………….……………………………..…………..……. 1 1.1 Background ………………………………………………………..……..……..…… 1 1.1.1 Types and Comparisons of Statistical Growth Models …………..…....… 1 1.1.2 The Two-Parameter Logistic Model in SRT …………………...…..…… 6 1.2 The Present Study …………………………………………………..…….……..….. 8 1.2.1 Research Questions …………………………………...……….…..……. 8 1.2.2 Significance and Implications ………………………..……………..…….. 10 2.1 Introduction …………………………………………………………..………..…… 2.2 Data and Methodology ………………………………………………..………......... 2.2.1 Source Data ………………………………………………..………..…. 2.2.2 Selecting Instructional Demand Indicators …………………..…..…….. 2.2.3 Constructing Indices of Student Instructional Demand …..……..……… 2.3 Results …………………………………………………………………….……….. CHAPTER 2. MEASURING STUDENT DEMAND AND EDUCATOR CAPACITY …..….... 12 12 15 15 17 20 21 2.3.1 Comparisons of SIDIs Across Indicator Sets and Estimation Methods … 21 2.3.2 Comparisons of SRT Models with Different SIDIs ……………………. 24 2.3.3 Comparisons to Other Measures of Educator Quality ………….….…….. 28 2.4 Discussion ……………………………………………………………………….….. 31 2.4.1 Validity of the SIDI ……………………………………………….…….. 31 2.4.2 Differences Across SRT Specifications ……………………..….…….... 32 2.5 Conclusions …………………………………………………………………….…… 34 3.2 Data and Methodology ………………………………………………………..….... CHAPTER 3. CONTRIBUTIONS OF SRT TO SUMMATIVE EVALUATION ……….……. 36 3.1 Introduction ………………………………………………………………………..… 36 Item and Student Response Functions ………………………………..… 3.1.1 36 Item Characteristic Curves and their SRT Counterparts …………………... 3.1.2 38 Item and Test Information Functions and their SRT Counterparts ..….…. 3.1.3 39 3.1.4 Differences Between Models and Contexts ………………………..…..…. 41 42 3.2.1 Estimates of Demand, Capacity, and Consistency …………………...…… 42 3.2.2 Defining Characteristic Curves and Information Functions for SRT.. .….. 43 3.2.3 Ranking and Categorizing Teachers ……………………………….…..…. 45 49 3.3.1 Distributions of Characteristic Curves and Information Functions ……….. 49 3.3.2 Comparing Alternate Measures of Effectiveness …………………..……. 54 3.3.3 Parallel Classroom Analysis ……………………………………………… 55 3.3 Results …………………………………………………………………………..…. vi 3.3.4 Comparing Alternate Rating Categories …………………………………. 60 3.4 Discussion ……………………………………………………………………….…. 64 3.4.1 Contextualizing the Instructional Demand Scale ………………………… 66 3.4.2 Simplifying the ECC ……………………………………………………... 68 3.4.3 Describing Teacher Performance in Terms of Demand …………….……. 70 3.4.4 Referencing Estimates to Populations of Teachers and Students ………... 72 3.5 Conclusions ………………………………………………………………………….. 74 CHAPTER 4. CONTRIBUTIONS OF SRT TO FORMATIVE EVALUATION ………….….. 77 4.1 Introduction ………………………………………………………………………… 77 4.1.1 Summative and Formative Evaluations of Teachers …………………….. 77 4.1.2 Differential Item and Student Functioning ……………………………….. 79 4.1.3 Establishing a Consistent Scale …………………………………………… 80 4.1.4 Optimizing the Assignment Process …………………………………..… 81 4.1.5 Significance …………………………………………………………..…. 83 4.2 Data and Methodology ……………………………………………………………….. 83 4.2.1 Data and Measures ……………………………………………………….. 83 4.2.2 Differential Student Functioning …………………………………..……. 85 4.2.3 Equating the Instructional Demand Index Across Years …………..……. 89 Identifying Growth of Teachers ………………………………….....…… 4.2.4 91 4.2.5 Assignment Optimization …………………………………………..……. 92 4.3 Results ……………………………………………………………………….…..…. 94 4.3.1 Differential Student Functioning Results ……………………………..…. 94 4.3.2 SIDI Equating Results ……………………………………………..…….. 95 4.3.3 Teacher Growth Analysis Results …………………………………..…… 98 4.3.4 Optimal Assignment Results ………………………………………….….. 108 4.4 Discussion …………………………………………………………………….…….. 109 4.4.1 Differential Student Functioning …………………………………………. 110 4.4.2 Equating the SIDI ………………………………..……………….……... 111 4.4.3 Teacher Growth ………………………………………..………….…….. 112 4.4.4 Optimal Assignment …………………………………...…………..…….. 113 114 117 117 120 123 BIBLIOGRAPHY …………………………………………………..…………………………….. 126 CHAPTER 5. OVERALL CONCLUSION AND DISCUSSION ………………………….….. 5.1 Summary of Findings ………………………………………………………..……. 5.2 Implications ………………………………………………………………..……… APPENDIX ……………………………………………………………………………………. 4.5 Conclusions ……………………………………………………….…………..…… vii LIST OF TABLES Table 2-1. Summary of student and teacher characteristics by cohort …...…………..……..……. 16 Table 2-2. Definitions of discrete levels for instructional demand indicators ……….………..….. 18 Table 2-3. Instructional demand indicators in the full, compact, and/or restricted SIDI .………. 19 Table 2-4. Distributions of SIDIs and correlations with future state assessment scores …….….. 23 Table 2-5. Correlations between estimates from different instructional demand indices .……… 24 Table 2-6. Akaike Information Criterion (AIC) for SRT models …………………….………… 25 Table 2-7. Range of correlations among capacity estimates for pairs of SRT models …………. 26 Table 3-1. Comparison of item and student response functions ………………………………... 38 Table 3-2. Descriptive statistics for SRT model parameters …………………………………… 42 Table 3-3. Mathematical functions of SRT model parameters …………………………………. 43 Table 3-4. Definitions of effectiveness measures and levels derived from the SRF …..……….. 47 Table 3-5. Spearman rank correlations of effectiveness measures derived from the SRF ……… 55 Table 3-6. Hierarchical cluster analysis solutions ……………………………………………… 56 Table 3-7. Characteristics of students, teachers, and classes by cluster ……………….……….. 58 Table 3-8. Percent of teachers in each performance category (L1 to L4) ….…………………... 61 Table 3-9. Teacher classification patterns and their frequencies for low-target measures …….. 62 Table 3-10. Descriptive statistics for low-target SRT parameters by rating pattern ………….… 63 Table 3-11. Sample teacher-level reports: capacity relative to student demand distribution …... 73 Table 4-1. Full and restricted samples of teachers and students ………………………………... 85 Table 4-2. Mantel-Haenzel DSF 2x2 contingency table ………………………………………... 86 Table 4-3. DSF categories based on the ETS DIF classification system …………………….….. 88 viii Table 4-4. Student demand and probability estimates by teacher …………………………….... 93 Table 4-5. Results from differential student functioning (DSF) analysis ……….…..………… Table 4-6. DSF by year for equated SRT measures …………………………………..………. 96 98 Table 4-7. Detectable and observed magnitudes of growth ……………………………..…… 106 Table 4-8. Comparison of teachers by change type (if 1 SD change is detectable) …………… 107 Table 4-9. Actual, random, and optimized class assignments …………………………………. 109 Table A1. Location and slope parameters for instructional demand indicators ………………. 124 Table A2. Deriving equations for the P25 and P75 measures (Chapter 3) ……………………. 125 ix LIST OF FIGURES Figure 2-1. Test Information Functions for IRT-calibrated SIDIs ……………….……..…...… 22 Figure 2-2. Scatterplot of capacity rankings by SRT model specifications ……………………. 27 Figure 2-3. Scatterplot of capacity rankings (special education excluded) …………...……….. 27 Figure 2-4. Marginal density distributions by observation rating ………………….……..…… 29 Figure 2-5. Marginal density distribution by experience level ………………………………… 30 Figure 3-1. Distributions of student and educator-student characteristic curves ……………… Figure 3-2. Distributions of student and educator-student information functions …………….. 50 52 Figure 3-3. Distributions of class and educator-class information functions and educator characteristic curves ……………………………………………………………….…………… 53 Figure 3-4. Dendrogram for CIF cluster analysis (6-cluster solution) ………………..…..…… 57 Figure 3-5. Relationships between cluster membership and capacity rankings ………..…..….. 59 Figure 3-6. Mapping of instructional demand scale to typical student characteristics …….…… 67 Figure 3-7. Distribution of instructional demand across district and sample schools …….……. 68 Figure 3-8. Sample reports for one teacher from each sample school …………………….……. 69 Figure 3-9. Sample reports for all teachers in the sample schools …………………...….…..… 71 Figure 4-1. Distribution of instructional demand with shaded areas in DSF intervals …..…….. 87 Figure 4-2. Test information functions and conditional standard errors for each SIDI ……….. Figure 4-3. Instructional demand estimates based on each SIDI ……………..………………. 97 97 99 Figure 4-4. Distributions of changes in capacity and their standard errors ………………...…… Figure 4-5. Scatterplot of math capacity estimates colored according to the standard errors of corresponding math growth estimates ……………………………………………………… 100 Figure 4-6. Scatterplot of ELA capacity estimates colored according to the standard errors of corresponding ELA growth estimates ………………………………………………….……. 101 x Figure 4-7. Mean standard errors of growth estimates by student counts …………………..… 103 Figure 4-8. Standard errors of growth estimates by gap between math capacity and class mean instructional demand ……………………………………………………………………. 104 Figure 4-9. Standard errors of growth estimates by gap between ELA capacity and class mean instructional demand ……………………………………………………………………. 105 xi KEY TO ABBREVIATIONS 2PL ……………………………………………………………………… Two-Parameter Logistic AIC ……………………………………………………………..….. Akaike Information Criterion CIF ………………………………………………………………...… Class Information Function DIF ………………………………………………………………… Differential Item Functioning DSF …………………………………………………….………. Differential Student Functioning E-CIF ……………………………………………………… Educator-Class Information Function E-SIF ……………………………………………………. Educator-Student Information Function ECC …………………………………………………………….… Educator Characteristic Curve ELA ………………………………………………………………………. English Language Arts ELL ………………………………………………………………...…. English Language Learner E-SCC …………………………………………………… Educator-Student Characteristic Curve ETS …………………………………………………………………. Educational Testing Service FRL ……………………………………………………………...… Free or Reduced-Price Lunch GRM ……………………………………………………………………. Graded Response Model ICC ………………………………………………………………….…. Item Characteristic Curve IIF …………………………………………………………………….. Item Information Function IRF ……………………………………………………………………… Item Response Function IRT ………………………………………………………………………... Item Response Theory LEP …………………………………………………………………. Limited English Proficiency MGP ……………………………………………………….... Mean or Median Growth Percentile MH ……………………,……………………………………………………….. Mantel-Haenszel xii PE …………………………………………………………………………….. Physical Education SIDI ………………………………………………………… Student Instructional Demand Index SIF ………………………………………………………………… Student Information Function SCC ………………………………………………………………… Student Characteristic Curve SD ……………………………………………………………………………. Standard Deviation SGP ……………………………………………...……………………. Student Growth Percentile SPED …………………………………………………………………………... Special Education SRF ………………………………………………………………..…. Student Response Function SRT ……………………………………………………………………. Student Response Theory SWD …………………………………………………………………...… Student With Disability TIF ………………………………………………………………….… Test Information Function URM ……………………………………………………………….. Under-Represented Minority VAM …………………………………………………………….………….. Value-Added Model VIF ………………………………………………………………….….. Variance Inflation Factor xiii CHAPTER 1. INTRODUCTION 1.1 Background Objective evidence of performance is critical for making meaningful comparisons of teachers or schools across different contexts. However, the methods used to produce such evidence have historically been both limited and controversial. Statewide school accountability systems implemented after the No Child Left Behind Act of 2001 primarily relied on rates of proficiency on standardized tests as measures of school quality. As research unveiled the biases and unintended impacts of these accountability policies (Booher-Jennings, 2005, Neal & Schanzenbach, 2010, Grissom et al., 2013, Dee et al., 2013), alternative metrics emerged that shifted the focus from achievement to growth (Ladd & Lauen, 2010). By 2016, 40 states required objective evidence of student growth to be a component in teacher evaluations (NCTQ, 2017). Most of these states used statistical growth models (typically a Value-Added Model or “VAM”) to generate norm-referenced estimates of teachers’ contributions to student growth. 1.1.1 Types and Comparisons of Statistical Growth Models VAMs estimate the effects of individual teachers on student performance adjusting for prior performance and, in some cases, student demographics and other characteristics (Rothstein, 2008, Hanushek & Rivkin, 2010). The SAS Education Value-Added Assessment System (EVAAS), one of the most widely used VAMs, uses hierarchical linear modeling to estimate these effects (Sanders & Horn, 1994). The Value-Added Research Center (VARC), American Institutes for Research (AIR), and Mathematica Policy Research have each developed similar 1 approaches. Student Growth Percentile (SGP) models, which are sometimes considered a type of VAM, rank students within comparable prior achievement groups using quantile regression, and the mean or median SGP across students with the same teacher (referred to as MGPs) are used as indicators of effective teaching (Betebenner, 2011). Largely due to its use of a familiar percentile rank scale and relatively simple calculation of MGPs from SGPs, this method is often preferred over traditional VAMs for its greater accessibility to stakeholders without statistical backgrounds. However, increased transparency of MGP models and some less sophisticated VAMs may come at the expense of greater bias favoring teachers of more advantaged students (Castellano & McCaffrey, 2017, Walsh & Isenberg, 2013) and greater sensitivity to nonrandom sorting (Guarino et al., 2014). Transparency and fairness are both central priorities in teacher accountability, but one of these priorities is often compromised for the sake of the other when choosing between the accountability metrics that are typically available to stakeholders. This contributes to an ongoing debate about whether it is appropriate for these metrics to influence decisions about teachers. Specific concerns about some statistical models include effects of non-random sorting of students and teachers (Rothstein, 2009, Harris, 2009), instability of ratings (Koedel & Betts, 2007, Ballou, 2005), effects of omitted variables (Harris, 2009), and the assumption of a linear relationship between prior and expected performance (Lissitz & Doran, 2009). Concerns about transparency include limited interpretability using norm-referenced statistical growth models (Thum, 2003), as there is no direct link between teachers’ ratings and concrete performance criteria. Consequently, the results neither deliver explicit guidance for improvement nor provide a basis to assess longitudinal changes in effectiveness. The role of these models in accountability systems, as a result, is generally limited to assigning summative ratings to teachers in order to 2 administer rewards or sanctions and influence personnel decisions, despite experts cautioning against their use in high-stakes decisions (Braun, 2005). This results in apprehension about impacts on instructional climate, educator morale, and student welfare (Lee, 2011). Reckase and Martineau (2014) proposed an alternate method that offers solutions to many of these concerns: the estimates produced are criterion-referenced and defined in relation to student characteristics, longitudinal changes can be assessed in a meaningful way, random sorting is not assumed, and results may lend well for formative uses in addition to summative. The method, initially titled the Educator Response Function and renamed Student Response Theory (SRT) in subsequent work (Martineau, 2016), is grounded in Item-Response Theory (IRT) methodology. SRT is an extension of techniques typically used to estimate student performance from responses to different types of test questions to the context of estimating educator capacity from outcomes for different types of students. In IRT, examinees’ locations on a latent capacity scale are estimated by maximizing the joint probability functions associated with observed strings of responses to test items with specific properties such as item difficulty (Lord, 2012). SRT conceptualizes students as the test items; teachers are tasked with helping students to reach pre-defined performance goals and succeeding to do so with a particular student is analogous to a correct response to a test question. The set of outcomes for all of a teacher’s students constitute that teacher’s test of capacity. Just as test questions vary in their levels of difficulty, the difficulty levels of teachers’ tasks vary depending on characteristics of their students. For instance, it will likely be easier for teachers to help students who reached an advanced proficiency level in the prior school year to reach a target proficiency level than students who only reached a basic proficiency level. Absences, instances of discipline or behavior problems, and learning disabilities likely increase 3 the level of challenge associated with reaching a performance goal, while high levels of family support, strong work ethic, and good study habits likely decrease it. These factors influence the ease with which students respond to instruction and, therefore, the level of instructional proficiency required for a teacher to successfully help a student reach an academic benchmark. An index of these factors serves as a scale of “instructional demand” analogous to IRT’s difficulty parameter, also referred to as a “student challenge index” in previous studies. The two- parameter logistic (2PL) model also estimates a slope parameter that describes an item’s level of discriminating power. In SRT, a slope parameter is estimated at the teacher level and describes the strength of the relationship between instructional demand and target attainment for a particular teacher (Reckase & Martineau, 2014), also referred to as a “consistency parameter” (Martineau, 2016). SRT bears some similarities to VAMs and SGPs. Like these statistical growth models, SRT estimates effectiveness of teachers on the basis of test scores, while adjusting for student characteristics that are likely to affect performance. Preliminary studies have demonstrated that rankings of teachers and schools are similar across these different model types. Ham (2014) computed capacity estimates for the same group of teachers using both SRT and VAMs. Rank- order correlations (averaged across grade levels) ranged from 0.69 to 0.89 for reading and from 0.83 to 0.92 for math, depending on the model specifications. Martineau (2016) compared school-level capacity measures using VAMs, SGPs, and SRT. Correlations were highest when SRT performance goals were fixed at a moderate difficulty level (0.81 to 0.84 for reading and 0.82 to 0.76 for math) and with individualized performance goals based on prior performance (0.80 to 0.84 for reading and 0.73 to 0.78 for math). Correlations between estimates from VAMs and SGPs were notably higher (0.91 to 0.99) than the correlations between VAMs or SGPs and 4 SRT, suggesting that although the capacity estimates produced by SRT are related to those produced by these growth models there may be key differences in the information they capture. Ham (2014) outlines three key differences that distinguish SRT from other statistical growth models. First, SRT uses student performance data in a different manner than most common growth models. VAMs and SGPs use a continuous student performance variable as an outcome, providing norm-referenced information about the rank of a student score relative to others in the same year and grade level. SRT uses a discrete, criterion-referenced performance outcome, which aligns with the criterion-referenced nature of most state assessments. Second, SRT incorporates student characteristics in a different way. VAMs treat student characteristics as additional educational inputs in the model, while SRT considers these characteristics separately from the outcome variable to determine the different levels of demand posed to educators. Third, the definition of teacher effectiveness differs between the two types of models. VAMs frame teacher effects as the amount of change in student test scores, after accounting for other educational inputs, whereas SRT defines the teacher effect as a latent trait describing the capacity of a teacher to help students reach a particular performance target. As a result, SRT teacher capacity estimates are sample-independent while VAM teacher effects are sample- dependent. These differences have several implications for the potential role of SRT in an evaluation system. The criterion-referenced capacity scale allows for estimates to be interpreted in terms of the probability that a specific type of student will achieve a specific performance target. The targets typically used in an SRT model correspond to performance level descriptors that explicitly define what students should be able to know or do in order to be considered successful (Martineau, 2016). These ties between observable student characteristics and concrete 5 performance criteria make results more accessible to stakeholders who may be unfamiliar with a particular statistical model but knowledgeable about student performance standards and their relationships to instructional practices. Separate calibration of student parameters and sample- independent properties of capacity estimates lend better to establishing scales that are comparable across contexts and over time, empowering administrators to monitor absolute changes in teacher performance, rather than simply changes in teachers’ relative positions within the distribution. Procedures used to examine item difficulty, comparability of test forms, and estimation precision in IRT may add additional value and decision-making power when applied to analogous concepts within an evaluation framework. 1.1.2 The Two-Parameter Logistic Model in SRT The analogous form of the two-parameter logistic (2PL) model for SRT is shown in Equation 1-1. The variable xs represents the achievement level of student s, while xt represents the target achievement level for the student. The outcome variable for the SRT model is a binary indicator that equals 1 if the student reaches this target performance level (!" ≥ !$) and equals 0 otherwise (!" < !$). This is equivalent to a correct response to a particular test item in IRT. The variable &'$ represents the latent capacity of educator e with respect to performance target t and is equivalent to the latent ability level of an examinee in IRT. The level of instructional demand for student s, ds, is equivalent to the item difficulty parameter in IRT and the slope parameter ae is equivalent to the discrimination parameter in IRT. There are a few key characteristics that distinguish this model from its IRT counterpart. First, the location parameter is estimated separately from the other model parameters in SRT. This takes place in an earlier step, and pre-calculated instructional demand estimates are treated 6 as constants in the SRT model used to estimate slope and capacity parameters. Second, the slope parameter is estimated at the teacher level (analogous to the examinee level, rather than the item level). This means that the 2PL in SRT includes two second-level (teacher) parameters and only one first-level (student) parameter, whereas the 2PL in IRT includes one second-level (examinee) parameter and two first-level (item) parameters. ($(!"≥!$ | &'$, -'$, .")= 1!2(-'$(&'$−.") 1+1!2(-'$(&'$−.") where ($ is the probability associated with a particular achievement target t; !" is the achievement level attained by student s; !$ is the achievement level required to meet target t; ." is the instructional demand level of student s; -'$ is the target t slope parameter for educator e; and &'$ is the latent capacity of educator e to help students reach target t. (1-1) The three main assumptions underlying the 2PL in IRT (De Ayala, 2009) also apply within the SRT framework. The first assumption is that the student response function (SRF) increases monotonically with teacher capacity. This makes sense intuitively: the greater the capacity of a teacher, the more likely a student is to reach a performance target. The second assumption is unidimensionality of teacher capacity. Satisfying this assumption hinges on proper specification of the student instructional demand index, as omitting important indicators from this index is a threat to unidimensionality (Ham, 2014). The third assumption, conditional independence of student outcomes, is perhaps the most concerning of the three within the SRT context. Although the average characteristics of students’ peers generally have little effect on teacher value-added estimates (Harris & Sass, 2006), the presence of a disruptive peer has been 7 shown to impact both student achievement and teacher value-added (Horoi & Ost, 2015) and could affect SRT capacity estimates in similar ways. 1.2 The Present Study The purpose of this study is to examine the unique capabilities of SRT models as tools for evaluating the performance of teachers, making appropriate decisions based on this evaluative information, and ultimately driving improvement in instructional quality. These tasks are organized into three main research questions and addressed in three corresponding standalone papers. The first paper focuses on the construction of the student instructional demand index (SIDI). The second paper explores contributions of IRT-specific procedures and their SRT analogs within a summative evaluation framework. The third paper examines the contributions of these types of procedures within a formative evaluation framework. 1.2.1 Research Questions The first paper broadly addresses the following question: how does the estimation method and set of indicators used to construct the SIDI affect estimates of teacher capacity and student demand? Proper specification of this index is critical in order to satisfy the unidimensionality assumption. Choices between different estimation methods and indicator sets also have practical implications for the role of an SRT model in an evaluation system; these decisions may affect the point within the school year when information about instructional demand will be available and the way standards and expectations are set for teachers with different types of students. These practical implications are considered alongside the statistical properties of indicators, 8 instructional demand estimates, and SRT models, in order to evaluate the consequences of various decisions in the SIDI construction process. The second paper explores the following question: can mathematical functions of SRT parameters, comparable to item response and test information functions in IRT, be used to generate more comprehensive summative reports of teacher performance? These mathematical functions are used to frame SRT results in multiple ways, creating several indicators of teacher performance with somewhat different meanings. The statistical validity and practical value of these indicators hinge both on the monotonicity assumption and on the levels of variation and concordance among the different indicators. Taking these factors into consideration, this paper explores whether a combination of the indicators can offer non-redundant, non-contradictory information about educators and can therefore provide more nuanced summaries of their performance than any one of the measures on its own. The third paper addresses a similar question while shifting outside of the summative evaluation framework: can extensions of IRT procedures to the SRT context contribute meaningfully to a formative evaluation system? IRT procedures related to differential item functioning, test form equating, and optimal item selection are extended to SRT as potential tools for identifying differences in performance across subgroups of teachers and classrooms, monitoring changes in the performance of a teacher over time, and making decisions about how best to match students and teachers with one another. The conditional independence assumption is critical to these applications. The extent to which this assumption is violated may impact whether it is feasible or appropriate to equate the instructional demand index across years, under what circumstances performance is expected to differ from model-based predictions, and whether predictions of future performance can be trusted. 9 1.2.2 Significance and Implications The primary purpose of this study is to establish whether, how, and to what extent SRT can contribute to an educator evaluation system in ways that a traditional VAM cannot. Earlier studies demonstrate that SRT can perform the same tasks as a VAM and produce similar results. By leveraging the aspects of SRT that distinguish it from other statistical growth models, arising from the vast body of research and well-studied technology within the IRT framework, this study homes in on the practical relevance of SRT and its unique benefits to an evaluation system. The proposed adaptations of IRT concepts, procedures, and technology to SRT that are explored in the study each aim to align a statistical growth model more closely with the priorities of educators, administrators, and other stakeholders. The first paper considers several decisions about how to measure instructional demand that determine when this information can be made available and how it relates to educational equity. The second paper expands beyond a simple ranking of educators to construct multifaceted reports of teaching performance that highlight strengths and weaknesses of an individual, consider contextual factors for different comparisons, and frame results in terms of concrete performance criteria. The third paper connects SRT measures across time and contexts, creating a framework for monitoring the performance of individual educators, groups of educators, and the entire population of educators over time, and using this information to better meet the needs of students. While this study takes critical steps towards refining SRT methods for use in a practical setting, it also identifies areas of concern. Violations of model assumptions and fundamental differences between the IRT and SRT frameworks that were discussed in earlier studies are examined through a different lens, in a different setting, using different data. This contributes to larger conversations about limitations of these measures, contextual factors that contribute to 10 model performance, ways in which IRT and SRT are not truly analogous, and directions for future work. 11 CHAPTER 2. MEASURING STUDENT DEMAND AND EDUCATOR CAPACITY 2.1 Introduction Student response theory (SRT) is a method for estimating the effectiveness of an educator using a statistical model analogous to an item response theory (IRT) model (Reckase & Martineau, 2014). One of the primary differences between the two-parameter logistic (2PL) model in IRT and the analogous version of this model in SRT is the manner in which the location parameter is estimated. In IRT, the item location (or item difficulty) parameter is typically estimated concurrently with the item discrimination (or slope) parameter and the latent variable (Lord, 2012). In SRT, the student location (or instructional demand) parameter is estimated independently in an earlier step, and then treated as a constant in the model used to estimate the teacher slope and capacity parameters. Instructional demand is a construct describing the varying levels of challenge to an educator that are associated with helping different students to reach a performance target, given each student’s prerequisite knowledge, behavior, attitude, and different needs as a learner. In order to estimate instructional demand, an index is constructed using a combination of indicators describing the academic performance, behavior, attendance, disabilities, and support structures of students. The inclusion of demographic indicators like ethnicity and socioeconomic status can be controversial: on one hand, there are well-documented relationships between these variables and academic performance (Baker et al., 2016). On the other hand, incorporating these factors can inappropriately imply that these relationships are causal or inadvertently set lower standards for 12 educators of students from disadvantaged backgrounds1. This necessitates careful consideration of not only the statistical properties of instructional demand indicators, but also the social consequences of their use in an accountability framework. Reckase and Martineau (2014) discuss two approaches for estimating the student instructional demand index (SIDI): regression analysis and IRT calibration. The two methods produce SIDIs with fundamentally different meanings, and function best with different types of instructional demand indicators. With the regression analysis approach, student performance at the end of a school year is regressed on performance in the prior school year and a set of student- level covariates believed to influence performance. The SIDI is then defined by determining the predicted performance for each student, and reverse-scaling these values so that lower predicted performance corresponds to a higher level of instructional demand. The ideal covariates for regression analysis are highly-correlated with the outcome variable (future performance) but uncorrelated with each other (Chatterjee & Hadi, 2015). The second approach uses IRT calibration to estimate instructional demand as a latent variable with student-level indicators operating as test items. This is equivalent to the first principal component of a set of risk factors (Martineau, 2016). Because this approach does not rely on current year performance in the estimation process, instructional demand can be calculated before the school year begins. The ideal indicators for this method are highly-correlated with future achievement and also highly- correlated with each other. Preliminary studies of SRT utilize both estimation methods. Ham (2014) constructed both types of SIDIs using prior performance, attendance, economic disadvantage, FRL eligibility, 1 A set of guidelines issued by the federal government in 2009 explicitly discouraged this practice on the grounds that expectations of growth and standards of achievement should be the same regardless of these characteristics (U.S. Department of Education, 2009). Researchers have argued that this advice is misguided (Ehlert et al., 2013). 13 targeted assistant eligibility, special education placement, and limited English proficiency as indicators. While the regression method produced an approximately normal distribution of estimated demand levels, the IRT calibration method resulted in very few possible demand values, and ultimately only the regression-based SIDI was used for the SRT analysis. Martineau (2016) used a similar set of indicators to construct both types of SIDIs and the resulting capacity estimates were markedly different from each other. Correlations with estimates from other SRT models and VAMs were weaker, estimates were more stable across grade levels and years, and correlations with demographic variables were stronger for models with the IRT-calibrated SIDI. The reason for these differences was not entirely clear, and analyses focused primarily on models that used the regression-based SIDI. One possible explanation for these results is that the sets of instructional demand indicators used in these studies are better suited for the regression analysis method, and that the IRT calibration method would be more successful with a different set of indicators. If this is the case, IRT calibration could offer three noteworthy advantages over regression analysis in this context. First, the regression analysis method is subject to bias from regression to the mean while the IRT calibration method is not (Reckase & Martineau, 2014). Second, the IRT calibration method does not rely on performance outcomes from the current year and can therefore be estimated in advance of an upcoming school year and potentially be used for planning purposes. Third, IRT equating procedures could be applied to SIDIs from different grade levels and years to establish consistent scales across time and across contexts. This study compares different sets of instructional demand indicators, methods for estimating the SIDI, and the resulting estimates of demand and capacity from SRT models under each combination of an indicator set and estimation method. This is the first study to employ such a rich set of instructional demand 14 indicators, as well as the first to select instructional demand indicators based on optimality for IRT calibration, and the first to analyze the impact of omitting demographic variables from the SIDI construction process. 2.2 Data and Methodology 2.2.1 Source Data This study draws on data from a large, anonymous, urban school district in a major U.S. city. The district serves a diverse population with a large proportion of high-needs students, providing a strong framework for studying the varying levels of demand faced by educators. Table 2-1 provides descriptive information about students and teachers in the sample. The majority of students in the district are nonwhite and eligible for free or reduced-price lunch (FRL), and the percentage of students classified as English language learners (ELL) is far greater than the 2015 national average of 9.5%. Students in this district tend to perform poorly on standardized math and ELA assessments, relative to other students throughout the state. Scores in both subjects are farthest below state averages in earlier grade levels and move closer to the state average for higher grades. Nearly 90% of teachers in the sample are tenured, and only 1-2% are first-year teachers in each grade level each year. Fewer 4th grade students in the first cohort are eligible for gifted programs than 4th grade students in the second cohort. Within cohorts, eligibility in gifted and special education programs increases between the two years in the study. Aside from these exceptions, characteristics of students and teachers are fairly consistent across cohorts, grade levels, and years. 15 Table 2-1. Summary of student and teacher characteristics by cohort. Cohort 1 2014-2015 3rd 2015-2016 2014-2015 2015-2016 Cohort 2 School year Grade level Students Percent female Percent nonwhite Percent FRL eligible Percent ELL Percent special education Percent gifted Mean number of absences Mean standardized math score Mean standardized ELA score Teachers Mean number of students Percent female Percent tenured Percent first-year teachers 4th 4th 5th 48.8 85.2 83.4 32.0 11.3 7.5 6.9 -0.61 -0.77 17.5 79.9 88.9 1.9 48.8 85.5 80.0 29.1 12.7 9.7 6.5 -0.26 -0.41 20.4 75.7 86.3 1.5 48.8 85.7 83.3 26.9 12.6 14.1 6.7 -0.27 -0.43 20.2 75.9 86.1 2.1 49.0 85.6 79.4 24.0 13.8 14.7 6.2 -0.09 -0.07 20.7 74.2 86.2 1.2 Student-level indicators of instructional demand include performance level categories on state standardized math and ELA assessments, course grades for achievement and effort in each of ten subject areas (math, reading, writing, speaking, listening, science, history, health, art, and physical education), number of days enrolled in the district during the year, number of days absent during the year, number of suspensions, length of suspensions, parent education level, eligibility for specific gifted and special education programs, ethnicity, eligibility for free or reduced-price lunch (FRL), gender, language spoken at home, English language learner (ELL) status, and previous ELL status. SIDI estimates are linked to performance levels for the same students on state standardized assessments in the following year (as 4th or 5th graders in 2015- 2016) and identifiers for their 2015-2016 classroom teachers. Teacher-level variables used for validation purposes include years of teaching experience and rating categories from classroom observations. 16 2.2.2 Selecting Instructional Demand Indicators In order to construct a SIDI using IRT calibration, student-level variables related to instructional demand are redefined as discrete test items for compatibility with a hybrid IRT model. Continuous variables are collapsed into either two or four levels to create a combination of graded-response model (GRM) and two-parameter logistic (2PL) items. All items are coded such that the highest values correspond to the highest levels of demand. For instance, students who reached the advanced proficiency benchmark in the previous year are likely to pose lower levels of demand on their educators and are assigned the lowest value, 0, for the graded-response item constructed from this variable. Students who did not reach any of the performance benchmarks in the previous year are likely to pose higher levels of demand; these students are assigned the highest value, 3. The value definitions for all instructional demand indicators are shown in Table 2-2. The complete list of indicators is used to estimate an initial SIDI, referred to as the “full” index, before undergoing an item reduction process for alternate versions of the SIDI. The full set of instructional demand indicators is reduced to a “compact” set based purely on empirical relationships between the indicators through an iterative analysis of internal consistency using Chronbach’s coefficient alpha (Cronbach, 1951). In each iteration of the item analysis, indicators with item-test correlations below 0.1 were identified and excluded from future iterations. Once there were no remaining items below 0.1, items with item-test correlations below 0.25 were excluded iteratively until no more remained. A third “restricted” indicator set was selected by repeating the same item reduction process, except that several demographic variables that could be controversial for accountability purposes (ethnicity, gender, free or reduced-price lunch 17 eligibility, parent education level, and home language) were excluded from consideration prior to beginning the item analysis. Table 2-2. Definitions of discrete levels for instructional demand indicators Polytomous (GRM) items 0 1 2 3 State performance level: math Advanced proficiency Proficiency State performance level: ELA Course grade for achievement [3.5, 4.0] [3.0, 3.5) Course grade for effort Highest degree held by parent Advanced degree College degree Number of suspensions Number of days suspended Number of days absent Number of days enrolled Dichotomous (2PL) items 0 0-4 180 1 5-10 175-179 Gender Eligible for gifted program(s) Native English speaker English is home language Consistent achievement/effort grades throughout year Under-represented minority (URM) Student with disability (SWD) Eligible for free or reduced-price lunch (FRL) Limited English proficiency (LEP—current or previous) Basic proficiency Below basic proficiency [2.0, 3.0) [0.0, 2.0) High school diploma 2 11-17 155-174 0 Female None 3+ 18+ <155 1 Male Yes No No Yes The three sets of indicators (full, compact, and restricted) are outlined in Table 2-3. The full set, which consists of 73 items, has high internal consistency with alpha=0.92. Through the item reduction process, the full set reduces to 33 items in the compact item set and 31 items in the restricted set, both with alpha=0.94. While these are the exact sets of indicators used to estimate each IRT-calibrated SIDI, some of the indicators that are considered optimal for IRT 18 Table 2-3. Instructional demand indicators in the full, compact, and/or restricted SIDI. Category Math performance Inconsistent math grades Inconsistent math effort F C R X X ELA performance Performance (other subjects) Learner needs & support structure Inconsistent reading grades Inconsistent writing grades Inconsistent speaking grades Inconsistent listening grades Inconsistent reading effort Inconsistent writing effort Inconsistent speaking effort Inconsistent listening effort GRM items F C R 2PL items Math achievement X X X Math effort* X X X State math level X X X Reading achievement X X X Writing achievement X X X Speaking achievement X X X Listening achievement X X X X X X Reading effort X X X Writing effort* X X X Speaking effort Listening effort X X X X X X State ELA level X X X Science achievement History achievement X X X X X X Health achievement X X X Art achievement X X X PE achievement Science effort* X X X X X X History effort* X X X Health effort X X X Art effort PE effort X X X X Days absent X Days enrolled X Times suspended Days suspended X Parent education level X X Number of disabilities X X X ELL X URM status Inconsistent science grades Inconsistent history grades Inconsistent health grades Inconsistent art grades Inconsistent PE grades Inconsistent science effort Inconsistent history effort Inconsistent health effort Inconsistent art effort Inconsistent PE effort FRL eligibility Male Non-native English speaker Home language not English X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X Proficient, non-native speaker Previously ELL Gifted (any type)* Gifted (high intellectual ability) Gifted (high achievement) Gifted (multiple types) SPED: any type* SPED: adapted PE SPED: autistic SPED: behavioral intervention SPED: speech impairment SPED: occupational therapy* SPED: referred for counseling SPED: resource specialist SPED: specific learning disability X X SPED: any physical disability SPED: rare disability X SPED: any related to learner needs X X X SPED: any behavioral disability* X *Item was excluded from regression-based SIDI due to collinearity. F: full, C: compact, R: restricted 19 calibration are inappropriate for the regression analysis method due to collinearity with other indicators. These items, which are marked with an asterisk (*) in Table 2-3, are excluded from the regression-based SIDIs. After excluding collinear variables, there are no pairs of regression model covariates with correlations above 0.8 and no regression model with a variance inflation factor (VIF) above 4.0. 2.2.3 Constructing Indices of Student Instructional Demand For the IRT-calibration method, the Graded Response Model (Samejima, 2016) shown in Equations 2-1 and 2-2 is used to estimate student levels of instructional demand with each item set (full, compact, and restricted). A single slope parameter is estimated for each item, and a unique difficulty parameter is estimated for each response option beyond the zero category. For the polytomous graded-response items defined in Table 2-2, this yields four total parameters (one discrimination parameter and three location parameters). For the dichotomous items, this reduces to a two-parameter logistic (2PL) model with one discrimination parameter and one location parameter. (6,78(.9)= exp [-6>.9−?6,7@] 1+exp [-6>.9−?6,7@] (6,7(.9)=(6,78(.9)−(6,78B8 (.9) (2-1) where .9 is the latent level of instructional demand for student i, (6,7 is the probability that a student is in category q of indicator j, (6,78 is the probability that a student is in category q or higher of indicator j, (2-2) aj is the discrimination parameter for indicator j, and bj,q is the location parameter for category q of indicator j 20 For each IRT-calibrated SIDI, a comparison SIDI is estimated using the regression analysis method with the same set of indicators (except for the collinear items that were excluded, as indicated in Table 2-3). The regression-based SIDI is computed by estimating a linear regression model with scores on the 2015-2016 state standardized assessment as an outcome variable and 2014-2015 instructional demand indicators as covariates (as shown in Equation 2-3). The instructional demand estimate is equivalent to the reverse of the predicted outcome for a student, as shown in Equation 2-4. Separate models are used for state math and ELA assessments; this means that there are twice as many regression-based SIDIs estimated as there are IRT-calibrated SIDIs, as the IRT calibrated SIDI is not subject-specific. Continuous indicators that were collapsed into discrete levels for the IRT-calibrated SIDI are used in their original forms when estimating the regression-based SIDI. (2-3) (2-4) C$=DE+ DBC$FB+DGHB+⋯+ DJHJ+K . is the instructional demand level for a student where yit is the standardized achievement score at time t, yt-1 is the standardized achievement score from the previous year, X1-Xk is a set of instructional demand indicators, .=−CL$ 2.3 Results 2.3.1 Comparison of SIDIs Across Indicator Sets and Estimation Methods Test Information Functions for each of the IRT-calibrated SIDIs are shown in Figure 2-1. Even after drastically reducing the number of indicators for the compact and restricted SIDI, test 21 information levels are impacted very little. The information functions for the compact and restricted SIDIs are nearly identical to one another, which suggests that the impact of excluding demographic indicators is negligible. The shapes of these curves are nearly identical across the two cohorts of students. Table 2-4 provides descriptive statistics for each of the SIDIs. The distributions of instructional demand are consistent across item sets and grade levels. The correlations between student-level demand estimates and future performance on standardized math and ELA assessments are also listed in Table 2-4: all are strong and negative, and correlations are stronger for regression demand indices than for IRT indices. Figure 2-1. Test Information Functions for IRT-calibrated SIDIs 22 Table 2-4. Distributions of SIDIs and correlations with future state assessment scores SIDI conditions Descriptive statistics Future score correlation Method IRT Regression (math) Regression (ELA) 4th 5th Grade Indicators Mean SD Min Max -3.76 3.98 -3.43 4.05 -3.34 4.00 -3.63 4.01 -3.30 3.92 -3.21 4.03 -2.81 2.56 -2.66 2.50 -2.69 2.43 -2.84 2.60 -2.68 2.57 -2.69 2.70 -2.58 2.55 -2.58 2.45 -2.54 2.53 -2.58 2.63 -2.51 2.69 -2.53 2.68 0.00 0.98 Full Compact 0.00 0.98 Restricted 0.00 0.98 Full 0.00 0.98 Compact 0.00 0.98 Restricted 0.00 0.98 Full -0.07 0.87 -0.07 0.87 Compact -0.07 0.86 Restricted Full -0.08 0.88 -0.08 0.88 Compact -0.08 0.88 Restricted Full -0.07 0.85 -0.07 0.85 Compact -0.07 0.85 Restricted -0.07 0.87 Full Compact -0.07 0.87 -0.07 0.86 Restricted 4th 5th 4th 5th Math -0.67 -0.68 -0.67 -0.68 -0.69 -0.68 -0.87 -0.87 -0.87 -0.88 -0.88 -0.88 -0.82 -0.82 -0.82 -0.83 -0.84 -0.83 ELA -0.70 -0.70 -0.69 -0.71 -0.71 -0.71 -0.81 -0.81 -0.81 -0.83 -0.83 -0.83 -0.86 -0.86 -0.86 -0.87 -0.87 -0.87 Table 2-5 provides correlations among estimates of instructional demand for the same students based on SIDIs estimated with different methods and sets of instructional demand indicators. There are strong, positive, linear relationships between estimates for all pairs of SIDIs. Estimates of instructional demand are nearly identical across pairs of SIDIs with the same estimation method, regardless of which set of instructional demand indicators is used to construct them. In comparison, the relationships between SIDIs constructed using opposite estimation methods are somewhat weaker. However, correlations among these estimates are still rather strong, ranging from 0.77 to 0.81. 23 Table 2-5. Correlations between estimates from different instructional demand indices Method (8) (5) (3) (1) IRT Regression (math) Regression (ELA) Indicators Full Compact Restricted Full Compact Restricted Full Compact Restricted (6) (4) (7) (2) (9) 0.99 0.99 0.78 0.78 0.78 0.81 0.81 0.81 0.99 0.78 0.78 0.78 0.81 0.81 0.81 0.77 0.77 0.78 0.80 0.81 0.81 0.99 0.99 0.95 0.95 0.95 0.99 0.95 0.95 0.95 0.95 0.95 0.95 0.99 0.99 0.99 (1) (2) 0.99 (3) 0.99 0.99 (4) 0.77 0.77 0.77 (5) 0.77 0.78 0.77 0.99 (6) 0.77 0.78 0.77 0.99 0.99 (7) 0.81 0.81 0.80 0.94 0.95 0.94 (8) 0.81 0.81 0.80 0.94 0.95 0.94 0.99 (9) 0.81 0.81 0.80 0.94 0.94 0.95 0.99 0.99 *Below diagonal: 4th grade, above diagonal: 5th grade 2.3.2 Comparisons of SRT Models with Different SIDIs Table 2-6 provides the Akaike Information Criterion (AIC) values for different SRT models. The models differ in which target performance level is used as an outcome variable (basic proficiency, proficiency, or advanced proficiency) and which set of instructional demand indicators is used to compute the SIDI (full, compact, or restricted). AIC comparisons are only meaningful across models for the same grade level, as these estimates were derived from identical data. Across all conditions, the AIC values for models with the most difficult performance target (advanced proficiency) indicate the best fit. SRT models with regression- based SIDIs have AIC values that indicate better fit than the corresponding models with IRT- calibrated SIDIs. There are no major differences in model fit across sets of instructional demand indicators when all other conditions are the same. 24 Table 2-6. Akaike Information Criterion (AIC) for SRT models Target Low Middle High SIDI Full Compact Restricted Full Compact Restricted Full Compact Restricted Math 4th IRT 5th 4th ELA -12703 -12649 -12683 -12410 -12308 -12367 -7105 -7036 -7084 -12865 -12773 -12820 -10270 -10170 -10232 -6576 -6509 -6552 -13107 -13050 -13103 -12541 -12454 -12518 -8786 -8708 -8768 Regression Math ELA 5th -12044 -12005 -12051 -12222 -12135 -12201 -7709 -7639 -7697 4th -9088 -9133 -9203 -8317 -8377 -8436 -4595 -4636 -4714 5th -8766 -8807 -8866 -6585 -6432 -6700 -4217 -4252 -4260 4th -9852 -9891 -9974 -9315 -9333 -9404 -6295 -6299 -6365 5th -9026 -9120 -9173 -8949 -9018 -9085 -5544 -5565 -5613 Table 2-7 lists ranges of correlations for capacity estimates among pairs of SRT models with specified properties. Correlations of capacity estimates are above 0.90 for all pairs of SRT models with the same performance target and estimation method but different sets of indicators. All of these pairs of models have correlations above 0.99 except when one model uses a regression-based SIDI with the restricted set of indicators (correlations range from 0.91 to 0.94 for these conditions). Correlations are lower when the pair of models uses different performance targets and the same estimation method; they are lower for nonconsecutive targets (one low and one high) and regression-based SIDIs, and about the same regardless of whether the same demand indicators are used. Pairs of models with different SIDI estimation methods show similar patterns across targets and sets of indicators, but correlations are lower overall. Because estimates of capacity are so consistent across sets of indicators, the remaining comparisons focus only on SRT models and SIDIs constructed from the restricted indicator set. 25 Table 2-7. Range of correlations among capacity estimates for pairs of SRT models. Model 1 SIDI Model 2 SIDI Target Same Indicators Range of correlations Different >0.99 IRT IRT Regression Regression IRT Regression Different Different Same Different Different Same Different Different Same Different Different Same Different Different Same Different 0.59 to 0.86 0.58 to 0.86 0.91 to 0.99 0.40 to 0.73 0.40 to 0.70 0.52 to 0.66 0.25 to 0.53 0.25 to 0.53 Scatterplots of teacher capacity rankings2 across pairs of SRT models across estimation methods and performance targets are provided in Figure 2-2. There are high concentrations of teachers close to the 45-degree line, indicating that their capacity rankings are consistent across the pair of models. However, there are still large quantities of teachers with inconsistent rankings across every pair of SRT models. Patterns of clustering indicate that capacity estimates across somewhat broad ranges in one model are concentrated within limited ranges in other models. This appears to be primarily driven by special education teachers. When these teachers are excluded, as in Figure 2-3, there are far fewer of these clustering patterns. The differences in between Figures 2-2 and 2-3 are most prominent among pairs of models with at least one regression-based SIDI. Regardless, there are still large proportions of teachers ranked inconsistently across models when only general education teachers are included. 2 Capacity estimates were converted to percentile rankings for these scatterplots in order to improve visibility of patterns in concentrated areas of the capacity distribution. 26 Figure 2-2. Scatterplot of capacity rankings by SRT model specifications. Figure 2-3. Scatterplot of capacity rankings (special education excluded) 27 2.3.3 Comparisons to Other Measures of Educator Quality Figure 2-4 provides marginal density distributions of capacity percentile rankings across teachers rated in each overall performance category based on observations of their classroom teaching. Marginal density distributions are the most distinct across observation rating categories for the capacity rankings from SRT models with the lowest performance target. Teachers in the lowest observation rating category are generally ranked within a smaller range of the capacity distribution by models with IRT-calibrated SIDIs than by models with regression-based SIDIs. These teachers are also more likely to be ranked in the top half of the capacity distribution by SRT models with regression-based SIDIs than those with IRT-calibrated SIDIs across all performance targets and in both subject areas. Figure 2-5 provides marginal density distributions of capacity percentile rankings by teaching experience level. Across both subjects and in all but the highest performance target, the distributions of capacity rankings are more distinct across experience categories for SRT models with IRT-calibrated SIDIs than for models with regression-based SIDIs. However, there is no distinction between marginal distributions of capacity rankings by teaching experience level for any SRT models that use the highest performance target as the outcome variable, regardless of the subject area and SIDI estimation method. 28 Figure 2-4. Marginal density distributions by observation rating. 29 Figure 2-5. Marginal density distributions by experience level. 30 2.4 Discussion 2.4.1 Validity of the SIDI The types of instructional demand indicators retained after the reduction processes reflect similar balances of academic and behavioral characteristics to the full set. Considering the minimal impact of removing approximately half of the indicators on the test information function and distribution of estimated instructional demand, it can be concluded that the excluded indicators were not making substantial contributions to estimates of instructional demand. These indicators tend to have either low discriminating power or difficulty parameters far outside the observed range of student instructional demand levels. The impact of removing demographic indicators is also negligible. This suggests that other, less controversial indicators may adequately account for the variation in performance associated with differences between demographic groups. Differences in instructional demand can be attributed directly to the types of indicators that were retained in the restricted set. However, there is no strong theoretic basis for attributing differences in instructional demand to demographic group membership. For this reason, the restricted set of indicators is likely the most appropriate for use in an accountability framework. Similarly, regression-based SIDIs change very little depending on the set of instructional demand indicators used. However, the regression-based SIDIs are slightly more sensitive to these differences than the IRT-calibrated SIDIs. Estimates of instructional demand for the same students from different SIDIs, in general, are affected very little by the set of indicators used or the subject area but are somewhat sensitive to the estimation method. Both methods produce SIDIs that are negatively correlated with future assessment performance, suggesting associations 31 between higher instructional demand levels and lower academic performance. A significant relationship between these quantities is desirable, as this provides evidence of convergent validity. However, some discordance between the two quantities is also desirable in order to demonstrate evidence of divergent validity. The magnitude of correlations for the IRT-calibrated SIDIs is only moderately strong, implying that these SIDIs capture a different underlying construct than the assessment scores. The correlations for regression-based SIDIs are notably stronger; this makes sense considering that this estimation method frames instructional demand estimates in reference to these same assessment scores. However, this does warrant consideration about whether the regression analysis method is capturing the appropriate construct. If instructional demand is a substantively different construct than expected achievement, the regression analysis method may not be capturing this. 2.4.2 Differences Across SRT Specifications The comparatively better fit for SRT models that use the highest performance target may be misleading, as other findings suggest these measures are the least informative about educator performance. This improvement in model fit could be a result of low variation in the outcome variable, as there are very few students who reach the high performance targets. These results also suggest that the types of students who reach these targets can be predicted from demand indicators with greater accuracy than for the other performance targets. This implies that this outcome is related more to characteristics of the students and less to characteristics of their teachers, relative to the other performance targets. Although this translates to an improvement in model fit, if the outcome is unlikely to be affected by an educator it is probably not an appropriate indicator of educator performance. Differences in fit across SRT models with IRT- 32 calibrated SIDIs and regression-based SIDIs may also be misleading. Regression-based SIDIs are estimated as predicted assessment performance. By definition, these instructional demand estimates account for the greatest possible amount of variation in assessment outcomes. This translates to “better” model fit by construction but does not necessarily imply that these models produce better estimates of educator capacity. Similar to the estimates of instructional demand, estimates of capacity are quite consistent across different sets of instructional demand indicators, but somewhat less consistent for models with regression-based SIDIs than those with IRT-calibrated SIDIs. The choice of a performance target has a much larger impact, though once again, the impact is more pronounced for models with regression-based SIDIs than those with IRT-calibrated SIDIs. These relationships are likely driven in part by special education teachers. These teachers tend to fall within a restricted range of the capacity scale using regression-based SIDIs but are more dispersed across the capacity distribution using IRT-calibrated SIDIs. This suggests that IRT- calibrated SIDIs might be better able to discriminate between instructional demand levels of special education students and capacity levels of special education teachers. One possible explanation for this is that indicators of eligibility in some specific special education programs were included in IRT-calibrated SIDIs but excluded from regression-based SIDIs due to collinearity. Although these indicators are highly correlated with one another, they may be important to discriminating between the relative demand levels of special education students and the capacities of their teachers. The marginal distributions of capacity rankings across classroom observation rating categories and teaching experience levels offer insight about the concurrent validity of capacity estimates from SRT models with different specifications. Across all conditions, the highest 33 performance target does not discriminate between these groups of teachers in intuitive ways. The high performance target is unreasonably difficult for the vast majority of students in the district, and as a result, it is rather uninformative about educator quality in most cases. The two lower targets do separate these groups to a greater degree, especially for ELA. Models with IRT- calibrated SIDIs are much more likely to rank “ineffective” teachers in lower parts of the distribution than those with regression-based SIDIs, and much less likely to assign low rankings to “effective” teachers. Across all specifications, models with regression-based SIDIs do not show meaningful distinctions between the distributions of rankings for teachers at different experience levels, while IRT-calibrated SIDI are able to do so in both subject areas for all targets except the highest. 2.5 Conclusions These results demonstrate that IRT calibration is a feasible method for constructing the student instructional demand index for an SRT analysis. It is important to recognize that the indicators of instructional demand in this study were selected in a manner that is optimal for IRT-calibration. This demonstrates that, with proper selection methods, IRT calibration is a useful tool for demand scale construction, and that these SIDIs outperform regression-based SIDIs in several ways. While results from prior studies favored the regression analysis method, this is likely a result of the type of indicators used to construct the SIDI. This study demonstrates that selecting indicators in a different way can favor a different method. This does not necessarily mean that IRT calibration is a more effective method of scale construction for SRT, 34 but rather that the selection of indicators should be appropriate for the scale construction method of choice. The IRT calibration method offers several advantages over regression analysis. Its independence from current year achievement allows for instructional demand to be estimated far earlier than with a regression-based SIDI, allowing for this information to influence planning processes for an upcoming school year. The potential ability to equate scales across grade levels and years using parameters of instructional demand indicators also could allow for the measurement of absolute changes (or growth) in teaching capacity over time, rather than simply changes in the relative position of a teacher within the capacity distribution. Considering the benefits IRT calibration may offer to schools and evaluation systems, it may also be worthwhile to invest in the collection of higher-quality or more appropriate instructional demand indicators before implementing an SRT model in a practical setting, rather than relying on a limited but readily-available set of indicators. The study also highlights a few areas of concern that warrant further examination. Capacity estimates may be less reliable for special education teachers. This is likely related to a mismatch between the instructional demands of their students and the performance targets against which they are assessed. Similarly, when the student performance target in an SRT model is unreasonably high, estimates of capacity are less consistent with other measures of educator quality, even for general education teachers. In order for these tools to be informative and useful in practice, the performance targets considered in SRT models must be both challenging and realistic. Setting appropriate benchmarks for students and teachers will be critical in order for SRT to contribute valuable feedback to an education system. 35 CHAPTER 3. CONTRIBUTIONS OF SRT TO SUMMATIVE EVALUATION 3.1 Introduction Within a teacher accountability framework, value-added models (VAMs) and similar statistical growth models typically operate as summative assessments of teaching performance. Estimates of effectiveness are compared against a standard or benchmark to identify high or low performing educators and administer rewards or sanctions (Harris, 2011). This process is conceptually similar to end-of-year student achievement testing to determine a final grade or performance level for the year. Student Response Theory (SRT) bridges the student assessment and educator assessment contexts, using item response theory (IRT) methodology to define a statistical growth model for educators (Reckase & Martineau, 2013). Previous research in this area focuses primarily on estimating and comparing teacher effects across SRT and other statistical growth models like value-added and growth percentile models (Ham, 2014; Martineau, 2016). In this study, the main comparison of interest is between the IRT and SRT frameworks. Procedures commonly used in IRT analyses are studied within the SRT context, with particular regard to the meaning and quality of information they provide and implications of differences across the two frameworks. 3.1.1 Item and Student Response Functions An item-response function (IRF) is a mathematical equation that relates the probability of a correct response to properties of the test item and the examinee (Lord, 2012). In typical applications of IRT, various quantities derived from the IRF are useful for item selection, 36 outcome prediction, standard setting, and comparing test forms. In SRT, a student response function (SRF) plays a comparable role to the IRF, by modeling the relationship between performance outcomes and characteristics of students and teachers. However, the IRF and SRF are not perfectly analogous to each other, and the ways in which they differ have implications for the interpretation and use of these functions. Corresponding components of the IRF and SRF are shown side-by-side in Table 3-1. The outcome variable and latent variable are both directly analogous across the two functions. The binary outcome variable, X, takes on a value of 1 in IRT if a student gives a correct response to a particular test item. In SRT, this variable takes on a value of 1 for a teacher if a pre-defined performance target is achieved by a particular student in their class. The latent variable, &, represents performance of a student in IRT and of an educator in SRT. The location parameter, ., describes the difficulty of a test item in IRT and the level of instructional demand posed by a student in SRT. While the two location parameters are conceptually similar, they differ in how and when they are estimated. The item difficulty parameter is typically estimated simultaneously with the latent variable and slope parameter. The instructional demand parameter is estimated separately and entered into the SRT model as a constant (Reckase & Martineau, 2014). The most prominent difference between the IRF and SRF is the nature of the slope parameter, a. In IRT, the slope is estimated at the item level. This parameter describes the extent to which a test item discriminates between examinees with latent ability levels just above and just below the item difficulty level. In order to estimate a perfectly-analogous student slope parameter in an SRT model, students would need to work with many teachers simultaneously and the same achievement target would need to be measured separately with each of these teachers. This is not feasible in the SRT context, as it is incompatible with the way that students 37 and teachers are placed with one another and with the way classroom instruction takes place. For this reason, the SRT slope parameter is estimated at the teacher level. This parameter describes the consistency of expected outcomes across students posing different levels of instructional demand to a teacher. Because of this difference, the SRF is a function of one first-level variable (instructional demand) and two second-level variables (educator capacity and educator consistency), while the IRF is a function of two first-level variables (item difficulty and item discrimination) and one second-level variable (examinee ability). Table 3-1. Comparison of item and student response functions. IRT ((H9)= 1!2M-9(&6−.9)N 1+1!2M-9(&6−.9)N Correct response to item (H9) Item difficulty (.9) Item discrimination (-9) Examinee ability (&6) Item (i) Examinee (j) Response function Dependent variable 1st level unit 2nd level unit Location parameter Slope parameter Latent variable SRT ((H")= 1!2[-'(&'−.")] 1+1!2[-'(&'−.")] Successful student outcome (H") Student instructional demand (.") Educator consistency (-') Educator capacity (&') Student (s) Educator (e) 3.1.2 Item Characteristic Curves and their SRT Counterparts In IRT, each individual item has an item characteristic curve (ICC) that is defined by substituting the difficulty and discrimination parameters for the item into the IRF. The ICC illustrates the probability of a correct response to a particular item for examinees at each location along the latent scale (Lord, 2012). The direct analog to the ICC in SRT is a student characteristic curve (SCC) which illustrates the probability that a student with a given level of instructional demand will reach a performance target with teachers of varying capacity levels. 38 However, there is only one student-level parameter in the SRF and two teacher-level parameters, so substituting the one student-level parameter would leave two unknown quantities in the SCC (capacity and consistency). In order to hold constant two variables that describe the same unit of analysis, Martineau (2016) defined an educator characteristic curve (ECC) that illustrates the relationship between the probability of a successful student outcome for a particular teacher and the level of instructional demand posed by a student. By defining the characteristic curve in this way, the expected performance of similar students across different teachers can be compared beyond inferences about their relative capacities. If the SRT slope parameter varies among teachers, and equivalently, if the IRT slope parameter varies among items, the characteristic curves for different items or educators will intersect with one another and the relative order of their heights will vary along the latent scale. In SRT, this variability in relative rankings could be particularly useful for evaluating different aspects of teaching performance. This type of analysis may offer more insight into the strengths or weaknesses of a particular teacher in terms of their expected performance with different types of students, compared to a single summative measure. 3.1.3 Item and Test Information Functions and their SRT Counterparts In IRT, a mathematical function of the slope parameter and the ICC called an item information function (IIF) indicates the amount of Fisher information a particular item provides about examinees at each location along the latent scale. The test information function (TIF), computed as the sum of the IIF accros every item on a test, is inversely related to the standard error of an estimate at a particular location on the latent scale (Lord, 2012). Because of its relationship to estimation precision, the TIF can be used to inform standard setting processes, 39 item selection algorithms, and the development or identification of equivalent test forms (Samejima, 1977a). The direct analog to the IIF in SRT would be a student information function (SIF). The sum of SIFs across all students in the same class would then constitute a class information function (CIF), analogous to the TIF. In variable-length adaptive testing, items are continually administered until a pre- determined stopping rule, which typically corresponds to a minimum acceptable level of precision, is achieved (Dodd, Koch, & De Ayala, 1993). For instance, if the stopping rule is a reliability coefficient of 0.9, items are administered until the height of the TIF at the estimated ability level of the examinee corresponds to a reliability of 0.9. For a latent trait with a standard deviation of one, test reliability reaches 0.9 when the standard error and test information level are approximately 3.16 and 10, respectively. Within the SRT framework, similar benchmarks could be useful for determining the range of latent teacher capacities for which a particular group of students provides reliable estimates. The TIF is also used to determine whether different forms of a test can be considered equivalent. Two tests consisting of different items that measure the same construct can be considered or “weak parallel” forms if their TIFs are the same (Samejima, 1977b). This is an important step in ensuring fairness for students taking different forms of the test. Similar techniques can be used to address concerns about fairness for educators evaluated based on different groups of students. Comparisons of classroom-level information functions across teachers may be informative about the demand of a class as a whole, quality of information provided by an estimate, and which groups of teachers are most appropriate to compare with one another. 40 3.1.4 Differences Between Models and Contexts A few key differences between IRT and SRT complicate the extensions of these procedures from one context to the other. The primary methodological difference is that the two response functions differ in the number of variables at each level of analysis. The IRF is a function of two first-level variables and one second-level variable, while the SRF is a function of one first-level variable and two second-level variables. This means that, although substituting the first-level (item) parameters into the IRF defines an equation with a single unknown quantity, substituting first-level (student) parameters into the SRF results in an equation with two unknown quantities. Several contextual differences arise from differences in what factors are considered when administering items to an examinee versus assigning students to a teacher. Adjustments may be necessary in order for some equations and procedures developed for IRT to be useful when applied within the SRT framework. This requires careful consideration of the implications of the differences between models and contexts and of the adjustments made to address them. This study explores possible uses of the SRF and its derivations within a summative teacher evaluation framework. Specifically, the SRF is used to define cut-scores and rating categories with intuitive meanings to facilitate result comprehension. SRT-specific information functions are analyzed to determine the extent of differences across classrooms and implications of these differences for comparing teachers to one another. A cluster analysis of information functions is then used to identify groups of educators whose sets of students constitute reasonably-equivalent tests of teaching performance. Finally, sample reporting materials are presented to illustrate ways this information may be communicated to and interpreted by stakeholders. 41 3.2 Data and Methodology 3.2.1 Estimates of Demand, Capacity, and Consistency The study draws on data from an anonymous large urban school district in a major U.S. city. The analytic sample consists of 5th grade students and teachers in the 2015-2016 school year. Estimates of instructional demand, educator capacity, and educator consistency were constructed according to the process described in Chapter 2 and Table 3-2 provides descriptive statistics for each of these quantities. The student instructional demand index (SIDI) was estimated from characteristics of these same students in previous school year (as 4th graders in 2014-2015) in the restricted set of demand indicators using the IRT calibration method. Educator capacity and consistency parameters were estimated for three different SRT models, each with the same SIDI but for three different student performance targets. The three targets correspond to sequential state-determined performance level categories for a standardized mathematics assessment. The labels “basic proficiency”, “proficiency”, and “advanced proficiency,” are used interchangeably with “low,” “middle,” and “high” to describe the three targets. Table 3-2. Descriptive statistics for SRT model parameters SD 0.98 2.18 2.23 2.03 0.34 0.29 0.17 Student instructional demand Educator capacity: low target Educator capacity: middle target Educator capacity: high target Educator consistency: low target Educator consistency: middle target Educator consistency: high target Mean 0.00 -0.13 -4.57 -8.41 1.89 2.02 2.09 Min -3.21 -5.42 -9.85 -12.60 0.17 0.25 1.20 Max 4.03 6.23 3.81 1.72 2.79 2.93 2.67 42 3.2.2 Defining Characteristic Curves and Information Functions for SRT Several mathematical functions, comparable to different characteristic curves and information functions in IRT, are defined from the SRT model parameters. Each of these functions is listed in Table 3-3. Martineau (2016) introduces an educator characteristic curve (ECC) that addresses the different numbers of parameters at each level in the item and student response functions. However, a student-level characteristic curve is a closer analog to the ICC and may be of interest for different purposes. Two types of student-level functions are defined, both of which express the probability of achieving a performance target as a function of only one variable, educator capacity, by substituting both a student-level variable (instructional demand) and a teacher-level variable (educator consistency) as constants. Table 3-3. Mathematical functions of SRT model parameters ($(!"≥!$)= 1!2(-'$(&'$−.") 1+1!2(-'$(&'$−.") O$(& | .",-$)=-'$G($[1−($] Variables: P: probability x: achievement level a: slope θ: capacity d: instructional demand I: information Function E-SCC – educator-student characteristic curve Name ECC – educator characteristic curve SCC – student characteristic curve ($(d | θe, ae) ($(θ | ds, -• ) ($(θ | ds, ae) O$(& | ds, -•) O$(& | ds, -') P O$(& | .",-•) "(') E-CIF—educator-class information function P O$(& | .",-') "(') E-SIF—educator-student information function SIF—student information function CIF—class information function Level teacher student student student student teacher teacher Subscripts: s: student e: educator t: target •: mean IRT counterpart ICC ICC ICC IIF IIF TIF TIF 43 The first of these functions, referred to simply as the student characteristic curve (SCC), substitutes the average slope parameter across all teachers as a constant. This function expresses the probability that a particular student will reach a performance goal with a teacher of average consistency, as a function of the capacity of the teacher. The SCC may be useful for judgements about groups of students independently from properties of the teachers to whom they are assigned. For other purposes, such as computing Fisher Information standard errors of teacher capacity estimates, the properties of a specific student-teacher pairing are appropriate. The second student-level characteristic function is informed by properties of both the teacher and student. The educator-student characteristic curve (E-SCC) describes the probability associated with a particular student and a teacher at a particular educator consistency level as a function of educator capacity. Constructing analogous information functions for SRT also requires careful consideration of the meaning and purpose of the function and implications of the different nature of the slope parameters. If the purpose is to identify equivalent groups of students, then the function should be defined solely by student-level parameters. The student information function (SIF), which is informed by student instructional demand and the mean slope across educators, serves this purpose. The SIF is summed across all students in the same class, yielding a class information function (CIF), that describes the total amount of information that a particular group of students contributes to estimates of capacity for teachers with average slope parameters at each location along the capacity scale. However, if an information function is to be used to judge the precision of a capacity estimate or to determine a cut-score, then it is more appropriate to incorporate properties of both the student and the teacher. The information function defined from the E-SCC is referred to as the educator-student information function (E-SIF). Summed across all students 44 in a class, the resulting educator-class information function (E-CIF) is informative about the information a particular group of students contributes to an estimate of capacity for a particular teacher and can be used as a basis for deriving standard errors of these estimates. 3.2.3 Ranking and Categorizing Teachers Teachers are assigned percentile rankings according to several different measures. The primary measure is the capacity estimate from the SRT model, which is equivalent to the location on the instructional demand scale where the height of the ECC is exactly 0.5. Teachers are then ranked according to several alternate measures of effectiveness derived from the SRF and its derivations. The first alternate measures are the locations of the instructional demand scale that correspond to ECC heights of 0.25 and 0.75. These measures, referred to as P25 and P75, respectively, indicate the instructional demand level of a student expected to have a probability of reaching a performance target of 0.25 or 0.75 given the capacity and consistency parameters for a particular teacher. The equations used to construct these measures and the steps used to derive them are shown in the Appendix. Next, teachers are ranked according to the heights of their ECCs at fixed locations along the instructional demand scale that correspond to meaningful distinctions between types of students. These fixed values are selected based on the location parameters for different instructional demand indicators from the IRT calibration process described in Chapter 2. For instance, the location parameter for a student having 5 or more absences is 0.04, indicating that the probability that a student with instructional demand of 0.04 was absent for 5 or more days is 0.5. Students with instructional demand levels above 0.04 have probabilities above 0.5 and students with instructional demand levels below 0.04 have probabilities below 0.5. Three key 45 values are selected, describing low-demand, average-demand, and high-demand students based on location parameters for characteristics associated with these types of students. An instructional demand level of -1.5 was selected as the low-demand value, as students below this level have probabilities above 0.5 of receiving top marks from their teachers for their effort in all subject areas. An instructional demand level of 0 was selected as the average-demand value, as students below this instructional demand level have probabilities of above 0.5 of reaching basic proficiency. An instructional demand level of 1.5 was selected for the high-demand value, as students above this demand level have high probabilities of receiving failing grades in their courses. These measures are referred to as D1, D2, and D3, corresponding to the low-demand, average-demand, and high-demand values, respectively. D1 ranks teachers according to the probability that a low-demand student will reach a particular performance target in their class, while D2 and D3 rank teachers based on the probabilities for average-demand and high-demand students. These three fixed probabilities and three fixed demand levels each define a unique effectiveness measure derived from the same SRF. These measures, outlined in Table 3-4, are each computed for the three different performance targets, yielding 18 total measures according to which teachers are ranked. Teachers are then assigned categorical effectiveness levels in several ways. For each of the fixed-probability measures, teachers are placed into one of four categories that are separated using the three fixed demand levels as cut-points. Similarly, for each of the fixed-demand measures, teachers are placed into one of four categories that are separated using the three fixed probabilities as cut-points. Teachers are assigned to groups according to each of the six categorization schemes (also outlined in Table 3-4) for each of the 46 Table 3-4. Definitions of effectiveness measures and levels derived from the SRF. Effectiveness measures descriptions and equations Fixed-probability measures P25 Demand when !<−1.5 L1 P50 Demand when probability=.50 P75 Demand when Low-demand student likely to miss target Low-demand student very likely to miss target Low-demand student probability=.25 *+−ln13/+0 *+ probability=.75 *+−ln3/+0 e567869:.;< 1+e567869:.;< e5686 1+e5686 e56786>:.;< 1+e56786>:.;< Unlikely for high-demand 1<0.25 Unlikely for low-demand student to hit target student to hit target student to hit target L1 Unlikely for average-demand somewhat likely to miss target D2 Probability if demand=0 D3 Probability if demand=1.5 Fixed-demand measures D1 Probability if demand=-1.5 Effectiveness levels classification criteria and descriptions −1.5≤!<0 L2 0≤!<1.5 L3 !≥1.5 L4 Low-demand student somewhat likely to hit target Average-demand student somewhat likely to hit target High-demand student somewhat likely to hit target Low-demand student likely to hit target Average-demand student likely to hit target High-demand student likely to hit target Low-demand student very likely to hit target 0.25≤1<0.5 L2 Average-demand student very likely to hit target 0.5≤1<0.75 L3 High-demand student very likely to hit target 1≥0.75 L4 Somewhat likely for low- demand student to hit target Likely for low-demand student to hit target Very likely for low-demand student to hit target Somewhat likely for average- demand student to hit target Likely for average-demand student to hit target Very likely for average- demand student to hit target Somewhat likely for high- demand student to hit target Likely for high-demand student to hit target Very likely for high-demand student to hit target 47 three performance targets, yielding 18 total categorical assignments for each teacher in the sample. Groups of teachers with equivalent or comparable groups of students were identified by conducting a cluster analysis of the heights of teachers’ CIFs at periodic locations along the instructional demand scale. CIF heights are computed at all integer values between and including -4 and 4. Similar to the manner in which test forms with matching TIFs are classified as parallel test forms, sets of classrooms with similar CIFs are considered reasonably-equivalent comparison groups (or parallel classrooms). Using a hierarchical cluster analysis with Ward’s linkage, teachers are classified into comparison groups of teachers whose CIFs are most similar to theirs, indicating that the students they teach comprise equivalent levels of instructional demand. Statistical and practical properties of clustering solutions are both considered. The Duda- Hart index is one of the statistical properties considered. This index gives the ratio of the within- cluster sum of squared errors after partitioning the data to the sum of squared errors before partitioning the data (Duda & Hart, 1973). The corresponding pseudo T-squared statistic is a measure of cluster similarity taking into account the number of observations (Halpin, 2016). The preferred clustering solutions should correspond to increases in the Duda-Hart index and decreases in pseudo T-squared, relative to the solution with one fewer cluster. Practical considerations include the number and size of clusters in a solution. In order for teachers to be compared meaningfully within clusters, the number of teachers per cluster must be sufficiently large. However, the number of groups must also be adequately high to distinguish between types of classrooms in a meaningful way. 48 3.3 Results 3.3.1 Distributions of Characteristic Curves and Information Functions All students in the sample have their own unique SCCs and E-SCCs determined by their estimated instructional demand and either the mean consistency estimate across teachers for a particular performance target or the actual consistency estimate of their classroom teacher for the target. The distributions of these characteristic curves across the sample of students is shown in Figure 3-1, where the 5th, 25th, 50th, 75th, and 95th percentiles of SCC and E-SCC heights are shown at each location along the educator capacity scale. When teacher capacity is at either extreme end of the distribution, there is little variation in student probabilities. The distribution of student probabilities is fairly constant across performance targets, even though the targets vary in difficulty. This is because the instructional demand scale is fixed across targets. Differences in target difficulty, rather, are reflected in downward shifts of the capacity distribution for the higher targets. The SCC is a function of student instructional demand and the mean slope across the sample of teachers. Instructional demand is constant across performance targets, so the only source of differences in the distribution of SCCs across performance targets is differences in the distributions of teacher slopes for the different performance targets. Although there are slight differences in these mean slopes (as shown previously in Table 3-2), these do not translate to visually-perceptible differences in the corresponding SCC distributions. E-SCC is a function of student instructional demand (which is constant across performance targets) and the slope for a particular student’s teacher. Similar to the distributions of the SCC, any differences in the distributions of E-SCCs across performance targets must be attributed to differences in teacher 49 Figure 3-1. Distributions of student and educator-student characteristic curves. slopes across targets. As was the case with the SCCs, no such differences in E-SCCs are evident from the figure. 50 The distributions of the SIF and E-SIF across all students are shown in Figure 3-2. The SIF distributions are similar in shape across the three performance targets. However, the peak heights of the 95th curves vary, reaching just below 1, approximately 1, and just above 1 for the low, middle, and high targets, respectively. This is a result of slightly larger slopes, on average, for the higher targets. While the shapes of the 5th, 25th, 50th, and 75th percentiles resemble bell curves, the 95th percentile has a flatter distribution, as the maximum information for any given student is limited by the fixed slope parameter. Excluding the 95th percentile, the distributions of the E-SIF resemble bell curves as well and are similar across targets. While the 95th percentile for the low target is shaped similarly to these, the 95th percentile for the middle and high targets show a comparatively lower level of information for higher-demand students, resulting in an asymmetric curve. The distributions of the ECC, CIF, and E-CIF across all teachers are shown in Figure 3-3. The educator characteristic curve (ECC) differs most drastically across targets. While there is a high level of variation in probabilities across the entire range of observed student demand values for the low target, the ECC heights for the middle and high targets have very little variation. For the middle target, the median teacher has a probability close to zero for almost the entire range of demand values. While the 75th and 95th percentile have probabilities above zero for a small range of values on the demand scale, probabilities are close to zero for all higher-demand students. For the highest target, even teachers at the 95th percentile have probabilities near zero across most of the instructional demand scale, and no teachers below the 75th percentile have a probability visually distinguishable from zero anywhere in the range of observed demand values. The class CIF and E-CIF distributions are similar in shape across targets. The 95th percentile of the E-CIF peaks higher than the 95th percentile of the CIF, as the teacher slope is not fixed at its mean. 51 Figure 3-2. Distributions of student and educator-student information functions 52 Figure 3-3. Distributions of class and educator-class information functions and educator characteristic curves 53 The dashed lines on the CIF and E-CIF plots highlight the minimum level of information (approximately 2.5) required for a reliability coefficient of 0.9 when the distribution of latent capacity has a standard deviation of about 2.0. Across the three performance targets, the median E-CIF is above this threshold for demand levels between approximately -2 and 2. Capacity estimates are within this range for 61% of teachers in the sample for the low target, 14% of teachers for the middle target, and fewer than 1% of teachers for the high target. At the 75th percentile, the range of capacities above this threshold extends from approximately -2.5 to 2.5. Capacity estimates are within this range for 73%, 19% and 1% of teachers for the low, middle, and high targets, respectively. At the 25th percentile, the range is only from about -0.5 to 1.5. Only 31% of teachers have capacity estimates within this range for the low target, 4% for the middle target, and fewer than 1% for the high target. 3.3.2 Comparing Alternate Measures of Effectiveness Table 3-5 provides Spearman rank correlations among the different measures of effectiveness derived from the SRF. Rankings based on the same performance target but different probabilities or demand levels are very similar with correlations above 0.90. These relationships are strongest for the same type of measure (i.e. both measures are based on a fixed probability or both are based on a fixed demand level) with correlations above 0.97. Correlations for measures with the same target but different types of rankings (one fixed demand and one fixed probability) are strongest for the low target, followed by the middle, then the highest target. Correlations among pairs of measures based on different targets are weakest when the targets are farther apart (i.e. low target and high target) and highest when both measures are based on fixed probabilities. Correlations of measures for different targets are similar in strength when both are 54 based on fixed demand levels and when one is based on a fixed probability and the other on a fixed demand level. Table 3-5. Spearman rank correlations of teacher effectiveness measures derived from the SRF Low target Middle target High target P50 P25 P75 D1 D2 D3 P50 P25 P75 D1 D2 D3 P50 P25 P75 D1 D2 D3 t e g r a t w o L t e g r a t e l d d i M t e g r a t h g i H P50 1 P25 .99 P75 .99 .99 D1 D2 .99 D3 .99 P50 .78 P25 .78 P75 .79 .74 D1 D2 .72 D3 .69 P50 .55 P25 .54 P75 .55 D1 .50 .48 D2 D3 .46 1 .98 .98 .98 .99 .78 .78 .79 .75 .73 .70 .54 .53 .54 .49 .48 .46 1 .99 1 .99 .97 .99 .97 .77 .77 .77 .76 .77 .78 .71 .73 .70 .71 .69 .68 .55 .54 .54 .55 .54 .56 .50 .48 .48 .48 .46 .47 1 .99 .76 .76 .77 .73 .71 .74 .53 .53 .53 .45 .47 .45 1 .75 .75 .76 .74 .72 .66 .51 .51 .51 .48 .46 .45 1 .99 .99 .98 .97 .95 .71 .71 .71 .68 .67 .66 1 .99 .99 .97 .96 .71 .70 .71 .68 .67 .66 1 .98 .96 .94 .72 .72 .72 .69 .67 .66 1 .99 .99 .67 .67 .67 .67 .64 .66 1 .99 .64 .65 .64 .65 .65 .64 1 .62 .62 .62 .65 .63 .63 1 .99 .99 .95 .93 .91 1 .99 .96 .94 .92 1 .95 1 .99 .93 .90 .99 1 .99 1 3.3.3 Parallel Classroom Analysis Solutions from the hierarchical cluster analysis are shown in Table 3-6. The solutions associated with an increase in the Duda-Hart index and a decrease in pseudo T-squared relative 55 to the solution with one fewer cluster were considered. Solutions with fewer than 3 clusters or fewer than 100 teachers per cluster were excluded from consideration, as these solutions do not adequately distinguish between types of classrooms or maintain a sufficient number of teachers per group for within-cluster comparison purposes. Only the 3, 6, and 8-cluster solutions meet all of the statistical and practical criteria discussed. Ultimately, the 6-cluster solution was selected because its pseudo T-squared is the smallest, indicating dissimilarity of clusters, taking into account the number of observations (Halpin, 2016). The dendrogram for this solution is shown in Figure 3-4. Table 3-6. Hierarchical cluster analysis solutions Number of clusters 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Pseudo T-squared 860.67 424.73 430.65 738.81 219.51 303.49 239.89 237.9 118.21 175.8 185.9 197.55 150.94 158.77 Duda-Hart index 0.618 0.707 0.627 0.460 0.626 0.495 0.681 0.468 0.572 0.557 0.565 0.595 0.578 0.580 56 Figure 3-4. Dendrogram for CIF cluster analysis (6-cluster solution). Table 3-7 provides characteristics of teachers and students in each cluster of classrooms. Clusters 1 through 4 have similar levels of variation in instructional demand within classrooms. Clusters 5 and 6 have somewhat less variation, as these classes are comprised mainly of special education students who tend to have high demand levels. Estimates of capacity for teachers with different classroom types do vary to some extent. Teachers in Cluster 1, on average, have the highest estimates of capacity for all three performance targets. Teachers in Cluster 6, who primarily teach special education students and have unusually low student counts, have the lowest average capacity for the low target, but perform similarly to the remaining clusters for the middle and high targets. Slope parameters, on average, differ very little between clusters. 57 Table 3-7. Characteristics of students, teachers, and classes by cluster 5 2 4 Cluster Percent of teachers 1 18.2% 0.0 (0.0) 29.6 (3.0) -0.8 (0.4) 0.7 (0.2) 1.5 (2.0) -3.0 (2.4) -7.0 (2.6) 1.9 (0.4) 2.0 (0.4) 2.1 (0.3) 25.4% 0.0 (0.0) 28.1 (2.9) 0.2 (0.3) 0.7 (0.2) 0.1 (2.1) -4.7 (2.4) -8.6 (2.2) 1.9 (0.4) 2.0 (0.3) 2.1 (0.2) 3 10.4% 0.0 (0.0) 23.6 (3.5) 0.7 (0.3) 0.7 (0.2) -0.6 (1.9) -5.5 (1.9) -9.2 (1.4) 1.9 (0.4) 2.0 (0.2) 2.1 (0.1) Proportion SPED Number of students Mean student demand SD student demand Capacity (low target) Capacity (middle target) Capacity (high target) Slope (low target) Slope (middle target) Slope (high target) 14.8% 0.0 (0.0) 21.5 (3.9) -0.2 (0.3) 0.8 (0.2) 0.2 (1.9) -4.6 (2.2) -8.7 (1.9) 1.8 (0.4) 2.0 (0.4) 2.1 (0.2) 12.0% 0.6 (0.5) 9.0 (2.3) 0.7 (0.8) 0.5 (0.2) -1.0 (2.4) -4.9 (2.0) -8.5 (1.8) 2.0 (0.3) 2.1 (0.3) 2.1 (0.1) 6 19.1% 0.9 (0.3) 2.9 (1.5) 1.1 (0.7) 0.5 (0.4) -1.5 (1.5) -5.3 (1.2) -8.8 (0.8) 2.0 (0.2) 2.1 (0.1) 2.1 (0.0) All 100% 0.2 (0.4) 19.8 (10.7) 0.2 (0.8) 0.7 (0.3) -0.1 (2.2) -4.6 (2.2) -8.4 (2.0) 1.9 (0.3) 2.0 (0.3) 2.1 (0.2) The left side of Figure 3-5 shows the proportions of teachers at each percentile of the capacity scale represented by each classroom cluster. While there are some differences in cluster composition across capacity levels for all three targets, these differences become more prominent as performance targets become more difficult. Teachers in Cluster 6, in particular, are unlikely to fall outside of a restricted range of capacity scores, especially for the middle and high target. Cluster 1, which consists of large, low-demand classes, has the most consistent distribution of rankings across targets of all the clusters. These teachers tend to rank towards the top of the capacity distribution for all three targets. The right side of Figure 3-5 provides scatterplots of within-cluster percentile rankings and overall percentile rankings of capacity parameters for each performance target. Within-cluster rankings are most similar to overall rankings for teachers in 58 Clusters 2 and 4, although even the rankings for these groups become less similar as performance targets become more difficult. (cid:38)(cid:79)(cid:88)(cid:86)(cid:87)(cid:72)(cid:85)(cid:3)(cid:20) (cid:38)(cid:79)(cid:88)(cid:86)(cid:87)(cid:72)(cid:85)(cid:3)(cid:21) (cid:38)(cid:79)(cid:88)(cid:86)(cid:87)(cid:72)(cid:85)(cid:3)(cid:22) (cid:38)(cid:79)(cid:88)(cid:86)(cid:87)(cid:72)(cid:85)(cid:3)(cid:23) (cid:38)(cid:79)(cid:88)(cid:86)(cid:87)(cid:72)(cid:85)(cid:3)(cid:24) (cid:38)(cid:79)(cid:88)(cid:86)(cid:87)(cid:72)(cid:85)(cid:3)(cid:25) (cid:72) (cid:79) (cid:76) (cid:87) (cid:81) (cid:72) (cid:70) (cid:85) (cid:72) (cid:83) (cid:3) (cid:92) (cid:87) (cid:76) (cid:70) (cid:68) (cid:83) (cid:68) (cid:70) (cid:3) (cid:85) (cid:72) (cid:87) (cid:86) (cid:88) (cid:70) (cid:16) (cid:81) (cid:75) (cid:79) (cid:76) (cid:87) (cid:76) (cid:58) (cid:79) (cid:3) (cid:85) (cid:72) (cid:87) (cid:86) (cid:88) (cid:70) (cid:3) (cid:75) (cid:70) (cid:68) (cid:72) (cid:3) (cid:92) (cid:69) (cid:71) (cid:72) (cid:81) (cid:72) (cid:86) (cid:72) (cid:85) (cid:83) (cid:72) (cid:85) (cid:3) (cid:86) (cid:85) (cid:72) (cid:75) (cid:70) (cid:68) (cid:72) (cid:87) (cid:87) (cid:3) (cid:73) (cid:82) (cid:3) (cid:87) (cid:81) (cid:72) (cid:70) (cid:85) (cid:72) (cid:51) (cid:19) (cid:19) (cid:20) (cid:19) (cid:27) (cid:19) (cid:25) (cid:19) (cid:23) (cid:19) (cid:21) (cid:19) (cid:19) (cid:19) (cid:20) (cid:19) (cid:27) (cid:19) (cid:25) (cid:19) (cid:23) (cid:19) (cid:21) (cid:19) (cid:19) (cid:19) (cid:20) (cid:19) (cid:27) (cid:19) (cid:25) (cid:19) (cid:23) (cid:19) (cid:21) (cid:19) (cid:19) (cid:21)(cid:19) (cid:23)(cid:19) (cid:25)(cid:19) (cid:47)(cid:82)(cid:90)(cid:3)(cid:87)(cid:68)(cid:85)(cid:74)(cid:72)(cid:87) (cid:48)(cid:76)(cid:71)(cid:71)(cid:79)(cid:72)(cid:3)(cid:87)(cid:68)(cid:85)(cid:74)(cid:72)(cid:87) (cid:43)(cid:76)(cid:74)(cid:75)(cid:3)(cid:87)(cid:68)(cid:85)(cid:74)(cid:72)(cid:87) (cid:20)(cid:19)(cid:19) (cid:27)(cid:19) (cid:21)(cid:19) (cid:50)(cid:89)(cid:72)(cid:85)(cid:68)(cid:79)(cid:79)(cid:3)(cid:70)(cid:68)(cid:83)(cid:68)(cid:70)(cid:76)(cid:87)(cid:92)(cid:3)(cid:83)(cid:72)(cid:85)(cid:70)(cid:72)(cid:81)(cid:87)(cid:76)(cid:79)(cid:72) (cid:19) (cid:19) (cid:19) (cid:20) (cid:19) (cid:27) (cid:19) (cid:25) (cid:19) (cid:23) (cid:19) (cid:21) (cid:19) (cid:19) (cid:19) (cid:20) (cid:19) (cid:27) (cid:19) (cid:25) (cid:19) (cid:23) (cid:19) (cid:21) (cid:19) (cid:19) (cid:19) (cid:20) (cid:19) (cid:27) (cid:19) (cid:25) (cid:19) (cid:23) (cid:19) (cid:21) (cid:19) (cid:23)(cid:19) (cid:25)(cid:19) (cid:27)(cid:19) (cid:20)(cid:19)(cid:19) Figure 3-5. Relationships between cluster membership and capacity rankings 59 3.3.4 Comparing Alternate Rating Categories Table 3-8 shows the percentages of teachers assigned each performance rating level based for the classification schemes defined in Table 3-4, along with the breakdown of these percentages by cluster. For the high target, nearly every teacher is placed in the lowest category across all measures and clusters. The same is true for the middle target, although to a lesser extent. The overwhelming majority of teachers are placed in the bottom category in these cases, with teachers of Cluster 1 classes appearing in the higher categories at somewhat higher rates. For the low target measures, more substantial proportions of teachers receive ratings in each of the four categories. Teachers are split most evenly among categories for the measures based on fixed probabilities (P25, P50, P75), while the measures based on fixed demand levels (D1, D2, D3) tend to have much higher frequencies in either the top of bottom category. The dominant category for these fixed demand measures varies by cluster. Because of the low level of variation in ratings within and between classification schemes for the middle and high targets, comparisons of ratings for the same individual teachers across measures focus only on the low target. Table 3-9 provides frequencies for all 14 observed combinations of ratings across these measures. About 37% of teachers receive the same rating on all six measures. 20% of teachers are in bottom group on all measures while the other 17% are placed in the top group on all measures. No teachers receive uniform ratings in either of the middle categories. All teachers placed in the bottom category for P25 or for D1 are also in the bottom category on the other five measures; more than half of these teachers are from Clusters 5 and 6. All teachers placed in the top category for P75 or for D3 are also in the top category on the five other measures; nearly half of these teachers are from Cluster 1, and a quarter are from Cluster 2. 60 Table 3-8. Percent of teachers in each performance category (L1 to L4) P25 P50 P75 Cluster L1 L2 L3 L4 L1 L2 L3 L4 L1 L2 L3 L4 41 17 9 16 13 2 17 2 1 0 0 0 0 1 0 0 0 0 0 0 0 12 31 44 27 58 70 39 83 92 97 96 94 99 93 99 100 100 100 100 100 100 4 13 22 12 44 36 21 64 84 92 86 88 96 84 97 99 100 99 99 100 99 6 20 33 19 51 53 29 76 89 95 92 91 98 89 98 99 100 100 100 100 99 25 26 24 30 16 11 22 7 3 0 2 3 0 3 1 0 0 0 0 0 0 27 22 16 24 12 6 18 5 2 0 1 1 0 2 1 0 0 0 0 0 0 8 22 29 1 17 40 23 21 10 5 9 7 4 10 2 1 0 0 1 0 1 25 29 29 33 14 18 25 10 3 2 4 4 0 4 1 0 0 0 0 0 0 51 25 13 25 16 4 24 4 1 0 0 1 0 1 0 0 0 0 0 0 0 63 35 20 35 24 6 32 5 2 0 1 1 0 2 0 0 0 0 0 0 0 17 29 30 27 16 32 26 14 8 5 6 5 2 7 1 1 0 0 0 0 0 21 30 32 33 17 22 26 11 5 3 3 4 1 5 1 0 0 0 0 0 0 1 2 3 4 5 6 all 1 2 3 4 5 6 all 1 2 3 4 5 6 all D1 D2 D3 Cluster L1 L2 L3 L4 L1 L2 L3 L4 L1 L2 L3 L4 41 17 9 16 13 2 17 2 1 0 0 0 0 1 0 0 0 0 0 0 0 12 35 51 32 61 77 43 85 95 98 95 95 100 94 99 100 100 100 100 100 100 4 13 22 12 44 36 21 64 84 92 86 88 96 84 97 99 100 99 99 100 99 37 65 80 65 76 94 68 95 98 100 99 99 100 98 100 100 100 100 100 100 100 9 13 12 15 8 6 11 4 2 0 1 2 0 2 1 0 0 0 0 0 0 11 9 4 9 4 1 7 2 1 0 0 0 0 1 0 0 0 0 0 0 0 2 6 11 6 7 17 8 11 4 2 6 3 2 5 1 0 0 0 0 0 0 6 12 10 8 7 17 10 7 4 2 4 3 1 4 1 0 0 0 0 0 0 88 69 56 73 42 30 61 17 8 3 4 6 1 7 1 0 0 0 0 0 0 11 13 13 14 6 9 1 5 1 2 2 1 0 2 0 0 0 0 0 0 0 67 39 24 40 25 8 35 6 2 0 1 2 0 2 1 0 0 0 0 0 0 12 10 7 11 8 2 8 1 1 0 0 1 0 1 0 0 0 0 0 0 0 1 2 3 4 5 6 all 1 2 3 4 5 6 all 1 2 3 4 5 6 all Low target Middle target High target Low target Middle target High target 61 Table 3-9. Teacher classification patterns and their frequencies for low-target measures. Tree diagram of rating categories across measures D3 P25 L1 L1 P50 L1 P75 L1 D1 L1 D2 L1 L2 L3 L4 ` L1 L2 L2 L3 L2 L3 L4 L1 L1 L2 L1 L2 L2 L3 L1 L1 L2 L3 L3 L4 L2 L3 L4 L3 L4 L4 L4 L3 L3 L4 L4 L4 L4 L1 L1 L1 L2 L2 L3 L4 L2 L3 L3 L4 L4 L4 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L4 Frequency of pattern Breakdown by cluster 1 2 3 4 5 6 20.5% 0.8% 3.4% 2.3% 1.8% 5.3% 7.0% 0.3% 1.6% 1.2% 0.9% 0.9% 3.2% 0.8% 0.2% 0.2% 1.9% 1.4% 1.1% 2.6% 1.4% 0.3% 2.9% 3.2% 1.1% 0.9% 0.9% 0.1% 1.2% 1.2% 0.5% 1.1% 0.8% 0.1% 2.0% 2.0% 0.8% 0.7% 0.4% 0.0% 0.7% 0.7% 0.3% 3.1% 1.3% 0.0% 1.7% 1.2% 0.4% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2% 1.8% 1.9% 7.5% 0.2% 2.2% 2.2% 4.2% 0.0% 0.6% 0.4% 0.9% 0.1% 1.4% 1.3% 2.3% 0.2% 0.7% 0.4% 1.5% 0.0% 0.4% 0.2% 0.4% 8.1% 9.3% 5.1% 0.7% 10.3% 9.7% 4.3% 0.2% 0.1% 0.8% 7.2% 6.6% 16.9% 62 For the remaining 63% of teachers, ratings vary across the six categorization schemes. About 15% of teachers receive ratings in two different categories; in all of these cases, the two categories are consecutive (either the two bottom categories or the two top categories). 27% are rated in three categories: 23% are rated in three consecutive categories. For the other 4%, five of the six ratings are in the top two categories and the other is in the bottom category. This pattern is observed most frequently among Cluster 1 teachers. 20% are rated in each of the four categories across the six classification schemes. These inconsistent classification patterns are observed most frequently among Cluster 2 teachers. Because each rating is a function of the teacher’s SRT parameters, descriptive statistics for the capacity and slope estimates of teachers with each rating pattern are given in Table 3-10 for reference. Uniform classification patterns are associated with very high or very low capacity estimates. Inconsistent classification patterns are associated with smaller slope estimates and capacity estimates that fall between the demand levels used as cut-points. Table 3-10. Descriptive statistics for low-target SRT parameters by rating pattern. Rating category Estimated capacity L2 L2 L3 P25 P50 P75 D1 D2 D3 Mean L1 L1 L1 L1 L1 L1 -3.02 L1 L1 L2 L1 L1 -1.78 L2 L1 L3 L1 L1 -1.20 L2 L4 L1 L1 -0.76 L1 L3 L2 L1 -0.70 L2 L4 L2 L1 -0.30 L2 L4 L3 L1 0.27 L3 L3 L4 L4 L1 0.75 L2 L1 L3 L2 L2 -0.25 L1 L3 L3 L2 0.20 L2 L4 L3 L2 0.62 L3 L4 L4 L2 1.19 L3 L4 L4 L3 1.79 L4 L4 L4 L4 3.30 L3 L4 L4 SD Min Max Mean 2.02 0.75 0.18 1.90 1.84 0.18 2.07 0.12 0.16 1.08 1.79 0.19 1.79 0.17 2.07 0.13 0.14 0.45 0.38 0.17 0.95 0.28 0.17 1.79 1.84 0.17 0.90 1.96 -5.42 -2.30 -1.50 -1.08 -0.96 -0.69 0.01 0.50 -0.43 0.08 0.05 0.81 1.51 1.98 -1.92 -1.50 -0.79 0.49 -0.41 0.00 0.70 1.09 -0.13 0.32 0.93 1.50 2.34 6.23 Estimated slope SD Min Max 2.79 0.23 0.30 2.54 2.62 0.34 2.74 0.24 0.13 1.32 2.58 0.34 2.57 0.36 2.72 0.26 0.19 0.64 0.58 0.29 1.36 0.20 0.35 2.62 2.79 0.35 0.28 2.67 1.18 1.02 1.04 1.52 0.81 0.91 0.90 1.58 0.22 0.17 0.71 0.93 0.94 1.17 63 3.4 Discussion These results repeatedly demonstrate that estimates for and conclusions about teachers vary depending on the student performance target selected. Different peak information levels across targets suggest that the predictive relationship between instructional demand and target attainment is stronger for the higher targets; this is consistent with findings from the model fit analysis in Chapter 2. However, the educator characteristic curves (ECCs) shown in Figure 3-3 affirm that very few students are expected to reach the high target even with exceptional teachers. The same is true of the middle performance target, although to a lesser extent. The distribution of ECCs shows little variation in the probability of attaining the middle target for students at most instructional demand levels. Because attainment of these higher targets is both rare and strongly related to observable characteristics of students, they may not be appropriate benchmarks against which to evaluate teachers. Similarly, categorization schemes for the middle and high target measures place the overwhelming majority of teachers into the lowest performance level. These classifications could be useful for identifying a very small number of higher-performing teachers but would be less helpful for discriminating among other types of teachers. The categorization schemes for the low target measures do a better job of discriminating among the full sample of teachers. Four of the categorization schemes for the low target identify either the top or bottom group of teachers in perfect agreement with the remaining 5 measures. The P25 and D1 measures are strong choices for confidently identifying a low-performing group of teachers, as this group has the lowest probabilities of success with the least-demanding students. Similarly, P75 and D3 are strong choices for identifying a high-performing group of teachers, as those identified will 64 have the highest probability of success with the most demanding students. However, the vast majority of teachers are placed in different categories across measures. This stresses that, aside from those who perform extremely well or extremely poorly, a combination of measures may offer a more nuanced description of educator performance than any single measure on its own. The clusters of parallel classrooms highlight differences not only in the groups of students assigned to different teachers, but also in the properties of capacity estimates for these teachers. For instance, Cluster 6 teachers work with very small groups of students who are typically in special education programs. Considering the low student counts and high demand levels, using this information to assess the effectiveness of these teachers is equivalent to administering a test that is very short and very difficult. Figure 3-5 highlights that, although these teachers’ estimated capacities for the low target span the entire scale, they are disproportionally concentrated within much smaller ranges of the capacity scale for the higher targets. This is likely more indicative of an assessment that does not adequately discriminate between the capacities of these teachers than it is of a lack of variation in their true capacities. In contrast, Cluster 1 teachers, who have the most students and lowest demand levels, have the most consistent distribution of results across targets. This is equivalent to an assessment of effectiveness that is much longer and easier than the assessment for teachers in Cluster 6. Relationships between classification patterns and cluster membership are evident in Table 3-10. A greater proportion of cluster 5 teachers receive uniform classifications than any other cluster, with 57% falling in either the top or bottom category on all 6 measures. Cluster 1 teachers follow behind this group with 45% receiving uniform ratings. While sorting and assignment patterns could play a role in these differences, the main attributes that define each cluster (student counts and the distribution of instructional demand) also influence the precision 65 with which capacity is estimated. Cluster 1 teachers have the most students and lowest demand levels. The “longer” test of effectiveness allows for greater precision, which could possibly result in more consistent ratings. The “easier” test could result in a ceiling effect. This would explain the lack of variation in ratings across measures that have inherently different meanings, and a higher target measure may be more appropriate for evaluating this particular group of teachers. Cluster 5 teachers, on the other hand, are likely to receive uniform ratings (typically in the lowest category) despite having fewer students and higher demand levels. While this cannot be a result of higher precision due to test length, it is possible that Cluster 5 teachers are experiencing a floor effect. If this is the case, this implies that even the lowest target is inappropriately difficult for most students in Cluster 5 classrooms. In general, the different measures vary more in their meanings and interpretations than they do in quality. Thus, rather than pointing towards a recommended measure or subset of measures for reporting purposes, these results affirm that SRT can offer a degree of flexibility for districts or administrators to select measures that align with their objectives and are relevant for a particular audience. The criterion-referenced nature of SRT measures connects these estimates to concepts that are already familiar to many stakeholders. The section that follows uses sample reporting materials to illustrate a few possible ways this information can be framed, presented, and interpreted. 3.4.1 Contextualizing the Instructional Demand Scale While the student instructional demand index (SIDI) estimated in Chapter 2 is a standardized measure, it has been rescaled for simplicity and user-friendliness for reporting purposes. The rescaled SIDI was computed by adding 4.5 to the SIDI, so that all observed 66 demand levels take on positive values. The scale bar in Figure 3-6 divides the scale into one-unit intervals that are each labeled with a consecutive integer that corresponds to the midpoint of the interval (i.e. the interval labeled “1” spans from 0.5 to 1.5 on the rescaled SIDI). Typical characteristics of students in each interval are indicated below the scale bar, providing a framework to associate student characteristics with numbers or colors on the scale. For example, students within the fourth interval (whose rescaled demand estimates are between 3.5 and 4.5) tend to have average marks for effort, earn B’s in their courses, reach proficiency, not be eligible for gifted or special education programs, and have average attendance. Course effort Course grades Performance level Program eligibility Attendance 1 2 Good A Advanced Gifted Good 3 4 Average B Proficient 5 6 Below Average C Basic 7 8 Poor D-F Below Basic Special education Poor None Average Figure 3-6. Mapping of instructional demand scale to typical student characteristics. The same scale bar appears alongside several other sample reporting materials to connect the information presented with types of students and instructional challenges. Figure 3-7, for instance, illustrates the distributions of instructional demand across the entire district and three randomly-selected sample schools in reference to the scale bar. Sections of the distribution are shaded according to percentage of students within the corresponding instructional demand interval. The distributions shown in Figure 3-7 indicate that each of the sample schools has lower-demand students, on average, than the district as a whole. This provides a context for interpreting a report. 67 1 2 3 4 5 6 7 8 Entire district School X School Y School Z Percent of students 0 50 100 Figure 3-7. Distribution of instructional demand across district and sample schools. 3.4.2 Simplifying the ECC Figure 3-8 combines this information with results for three sample teachers (one from each sample school) from the SRT analysis. On the left-hand side, the distribution of instructional demand for a particular teacher is shown alongside the distributions for the school, all students in the same classroom cluster (the term “comparison group” is used in place of cluster because it is more familiar to many audiences), and all students in the district. The right- hand side condenses the educator characteristic curve (ECC) into a simpler format, while also tying this information to the color-coded scale bar. Symbols are used to indicate how likely different types of students are to reach each performance benchmark with the teacher, according to the height of the teacher’s ECC at the midpoints of each interval on the demand scale. For instance, students with demand levels greater than or equal to 6 are “very unlikely” to reach the high target with Teacher C; however, the distribution of demand for Teacher C reveals that there are no students at this level in Teacher C’s class. Nearly all of Teacher C’s actual students are “likely” or “very likely” to reach all of the performance targets. The report does not explicitly show the capacity or slope estimate for any teacher, but this information is still communicated in the report. The location where probabilities change from “unlikely” to “likely” categories is the location of the capacity estimate and the abruptness of 68 Key 0.00-0.05: very unlikely 0.05-0.25: unlikely 0.25-0.50: somewhat unlikely 0.50-0.75: somewhat likely 0.75-0.95: likely 0.95-1.00: very likely - - - - - - + + + + + + 1 2 3 4 5 6 7 8 Percent of students 0 50 100 Benchmark How likely are students at each demand level to reach this benchmark with Teacher A? 7 1 3 2 4 5 6 8 Basic Proficient Advanced + + + + + + + + + + + + + + + - - - - - - - - - - - + - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - 1 2 3 4 5 6 7 8 Percent of students 0 50 100 Benchmark How likely are students at each demand level to reach this benchmark with Teacher B? 1 7 6 4 5 2 3 8 Basic Proficient Advanced + + + + + - - - - - - - - - - - - + - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1 2 3 4 5 6 7 8 Percent of students 0 50 100 Benchmark Basic Proficient Advanced How likely are students at each demand level to reach this benchmark with Teacher C? 7 1 5 2 3 4 6 8 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - - - - - - - + + + + + + + + + + + + - - - - - - - Teacher A School X Comparison group District Teacher B School Y Comparison group District Teacher C School Z Comparison group District Figure 3-8. Sample reports for one teacher from each sample school. transitions between probability categories communicates the magnitude of the slope. For instance, Teacher A’s capacity for the low target is between 5 and 6, as the probability category changes from a single plus symbol (“somewhat likely”) in the 5th interval to a single minus 69 symbol (“somewhat unlikely”) in the 6th interval. Teacher C’s probability for the high target changes more abruptly than this: there is a change from two plus symbols (“likely”) in the 4th interval to two minus symbols (“unlikely”) in the 5th interval, indicating that the capacity parameter is between 4 and 5, and that the slope is larger than Teacher A’s low-target slope. Because the parameter values themselves may not be inherently meaningful to many stakeholders, a simplified form of the ECC can communicate the relevant concepts in terms of probabilities, performance benchmarks, and types of students. 3.4.3 Describing Teacher Performance in Terms of Demand Figure 3-9 summarizes the performance of all teachers of the same grade level in each of the sample schools. The scale bars on the left compare the distributions of demand across these teachers and are supplemented by information about the mean demand, size, and type (or cluster) of each class. Differences in the distribution of demand for teachers in the same school suggest that assignment practices vary across these three locations. For instance, the overall distributions of demand in School Y and School Z are quite similar, but School Y places students with similar demand levels in the same classrooms while School Z distributes instructional demand equitably across teachers with each classroom distribution mirroring the schoolwide distribution. In School X, classrooms vary in size, mean demand, and variation in demand. The right-hand side provides information about the capacity estimates for each teacher. The number reported is largest integer value on the demand scale for which a teacher has a probability above 0.5 for a particular target. This frames the estimate for a teacher in terms of the types of students (with respect to instructional demand) that are likely to reach a particular benchmark in their class. Many capacity estimates, particularly those for the highest target, are outside the range of instructional 70 demand values that are observed for students in the district. In these cases, an asterisk (*) is reported in place of a demand level, as there is no realistic level of instructional demand that corresponds to a probability of student success above 0.5. For example, no teacher in School Y has a capacity estimate within the range of observed demand values for the middle or high target. 1 2 3 4 5 6 7 8 class type mean demand class size Highest level of instructional demand likely to reach target target 3 target 1 target 2 1 1 2 2 2 6 3.7 3.8 4.2 4.4 4.7 6.1 34 34 28 30 33 1 8 8 6 7 7 6 6 4 2 2 1 4 * * * Insufficient data 1 2 3 4 5 6 7 8 class type mean demand class size Highest level of instructional demand likely to reach target target 3 target 1 target 2 1 4 1 4 2 6 3.1 3.3 3.7 3.9 4.7 5.7 27 17 24 22 25 5 4 3 5 4 1 2 * * * * * * * * * * * * School X Teacher A1 Teacher A2 Teacher A3 Teacher A4 Teacher A5 Teacher A6 School Y Teacher B1 Teacher B2 Teacher B3 Teacher B4 Teacher B5 Teacher B6 1 2 3 4 5 6 7 8 class type mean demand class size Highest level of instructional demand likely to reach target target 1 target 3 target 2 School Z Teacher C1 6 Teacher C2 5 Teacher C3 5 Teacher C4 * Teacher C5 * Figure 3-9. Sample reports for all teachers in the sample schools. 3.5 3.7 3.7 3.8 3.8 34 33 34 33 33 1 1 1 1 1 8 8 8 7 7 8 6 7 3 2 71 3.4.4 Referencing Estimates to Populations of Teachers and Students While norm-referenced measures that describe a teacher’s relative position in the distribution are commonly featured in evaluation systems, SRT scores can also be presented in reference to distributions of students. Each estimate of educator capacity corresponds to a location on the student instructional demand scale. The percentage of students with demand levels below a teacher’s capacity estimate indicates how many students would have probabilities of reaching a performance target with this teacher that are above 0.5. By reporting this percentage, teachers’ capacities are framed in terms of how many students would, theoretically, be more likely than not to reach a particular performance target if placed in their class. Table 3- 12 shows two versions of this metric: the first is the percentage of all students in the district with probabilities above 0.5 for the teacher, and the second is the percentage of all students in a particular school with probabilities above 0.5 for the teacher. While the first percentage is useful for comparisons of teachers across the entire district relative to the student population, the second is informative about how well a teacher meets the needs of students in their own school. The table also includes district and school norms as comparison points, further contextualizing the performance of the teacher relative to other teachers district-wide and in the same school. For example, only 33% of students in the district have probabilities above 0.5 of reaching the lowest performance target with Teacher B, meaning that 33% of students in the district have instructional demand estimates that are greater than the capacity estimate for this teacher. In comparison, 45% of students in the district have probabilities above 0.5 of reaching this same target with a teacher whose capacity is equal to the district average. However, 60% of the students in Teacher B’s school have probabilities above 0.5 with Teacher B, compared to only 32% of these students with a teacher whose capacity is equal to the school average. 72 Although the capacity of Teacher B, in reference to the districtwide distributions of capacity and demand, may suggest poor performance, Teacher B is expected to perform well with the majority of students in the school where he/she teachers, and expected to perform significantly better than the average teacher in that school. Table 3-11. Sample teacher-level reports: capacity relative to student demand distribution. Percentage of students more likely than not to reach target with Teacher A All students in district All students in Teacher A’s school Average teacher in School X Teacher A Teacher A Basic proficiency Proficiency Advanced proficiency 92% 19% 0% Average teacher in district 45% 4% 0% Average teacher in School X 95% 35% 6% Percentage of students more likely than not to reach target with Teacher B All students in district All students in Teacher B’s school Average teacher in School Y Teacher B Teacher B Basic proficiency Proficiency Advanced proficiency 33% 0% 0% Average teacher in district 45% 4% 0% Average teacher in School Y 18% 0% 0% Percentage of students more likely than not to reach target with Teacher C All students in district All students in Teacher C’s school Average teacher in School Z Teacher C Teacher C Basic proficiency Proficiency Advanced proficiency 98% 97% 49% Average teacher in district 45% 4% 0% Average teacher in School Z 98% 50% 35% Average teacher in district 76% 17% 0% Average teacher in district 76% 17% 0% Average teacher in district 76% 17% 0% 98% 38% 9% 32% 0% 0% 99% 51% 46% 97% 33% 0% 60% 0% 0% 99% 99% 83% Both Figure 3-9 and Table 3-12 present information about the capacity estimate for a teacher, which is equivalent to the location on the demand scale corresponding to a probability of 0.5 (referred to as P50 in Sections 3.2 and 3.3). However, equivalent versions of these reports 73 could be developed for any measure derived from the ECC. The SRT framework affords flexibility to administrators or policymakers to determine what information is most relevant to their priorities and their context. Relevant measures may include, but are not necessarily limited to, probabilities associated with each teacher at a fixed demand level that corresponds to specific student attributes, demand levels associated with a probability above an agreed-upon threshold, or percentages of students expected to achieve a particular outcome with each teacher. 3.5 Conclusions Item-response functions, item and test characteristic curves, and item and test information functions serve a variety of purposes in testing contexts. Analogous forms of these functions also offer meaningful contributions to the context of educator evaluation with Student Response Theory. Some of these contributions apply concepts or procedures from IRT directly to the SRT framework; for instance, matching information functions to identify parallel test forms. While fundamental differences between how tests are constructed and how students are assigned to classrooms require different procedures to be used, the underlying concept of identifying an equivalent assessment is the same. Comparisons of SRT information functions highlighted some concerns about differences in the properties of classrooms across students, and the implications of these differences for estimating the capacity of a teacher. Differences in the number of students assigned to different teachers equate to varying lengths and precision levels of the assessments for different teachers. Differences in the distribution of demand across classrooms equate to varying difficulty levels of assessments for different teachers. While varying lengths and difficulty levels are common in some areas of IRT, particularly in adaptive tests, equal 74 precision and fairness are valid concerns if comparisons are to be made across the entire population of teachers. However, by identifying groups of teachers with reasonably-equivalent information functions, many of these concerns can be alleviated. Other applications are unique, arising from key differences and incompatibilities among IRT and SRT. In particular, the teacher-level slope results is the equivalent of an IRT model with only one item-level parameter and two examinee-level parameters. This key difference raises concerns about whether some of the technology that has been developed and studied extensively in IRT contexts is truly compatible with the SRT framework. However, this difference also introduces new possibilities in an SRT analysis that do not have a direct parallel in IRT. The additional teacher-level parameter that relates properties of educators to the instructional demand levels of their students is especially useful for deriving alternate performance measures from the same SRT models. These alternate measures allow for flexibility to choose those that are most useful and meaningful for the specific purposes of a district, state, or program. The different measures derived from the SRF for the same performance target are quite similar, but measures are much less consistent across targets. Some of this can be explained by the difficulty level of the middle and high targets; very few students reach either of these performance benchmarks regardless of their instructional demand levels and the capacities of their teachers. As a result, these targets are often not informative about an educator’s performance. The measures derived from the low target are more appropriate for the vast majority of students and teachers. While the different low-target measures are highly correlated and rank teachers consistently, there are slight differences in how they are defined and interpreted. When combined, these different measures offer a more nuanced summary of educator performance than is possible with any single measure. 75 The criterion-referenced nature of SRT also contributes to the flexibility in report design; results can be framed in a variety of ways that are connected directly to stakeholders’ knowledge of students and teaching. While the sample reports illustrate some of the ways this information may be presented in practice, these are only preliminary suggestions. Additional work is needed, perhaps with focus groups comprised of different stakeholders, to determine the best ways to develop reports that empower stakeholders to understand the results and their implications and take appropriate actions in response that improve future outcomes for students and educators. With proper development, SRT could be a useful for presenting objective information about teacher performance with direct connections to the varying levels of instructional demand posed to teachers, properties of a specific population of students, and concrete performance standards for both teachers and students. 76 CHAPTER 4. CONTRIBUTIONS OF SRT TO FORMATIVE EVALUATION 4.1 Introduction 4.1.1 Summative and Formative Evaluations of Teachers Summative and formative assessments are both important components in evaluations of student learning. Summative assessments typically occur at the end of a unit or course to evaluate whether a particular goal has been met, while formative assessments occur on an ongoing basis to evaluate student needs and adjust future instruction accordingly. In evaluations of teaching performance, the role of value-added models (VAMs) and similar statistical growth models has been typically been as a summative assessment of teaching performance, where ratings are assigned at the conclusion of a school year and associated with consequences for teachers. In a formative teacher evaluation framework, schools or districts use evaluative information about teaching performance to guide decisions about assignment, professional development, and resource allocation. This occurs through an ongoing process in order to best support the needs of teachers and evaluate whether these decisions have the desired outcomes in future evaluations. Student Response Theory (SRT) applies the underlying methodology of item response theory (IRT) to an assessment of educator performance, offering an alternative to VAMs and other statistical growth models. In IRT, the probability of an examinee responding correctly to a test item is predicted as a function of the latent ability of the examinee and properties of the test item (Lord, 2012). SRT frames students as test items that assess the performance of their educators. The probability that a student will reach a pre-defined performance target with a 77 particular teacher (equivalent to the probability of a correct item response in IRT) is predicted as a function of the latent capacity of the teacher and properties of the student (Reckase & Martineau, 2014). The 2-parameter logistic (2PL) models in IRT and SRT define similar location parameters. In IRT, this parameter indicates the difficulty level of an item, while in SRT, it indicates the level of instructional demand posed by a student to an educator. Instructional demand is a construct that describes the difficulty level associated with helping a particular student reach a performance target. SRT may be better equipped to play a formative role in the teacher evaluation process than commonly-used statistical growth models like VAMs. Due to the criterion-referenced nature of SRT scales, the information produced by these models relates directly to characteristics of students and teachers and performance standards for students and teachers. This forms a clearer link between the content of a report and instructional or administrative practices. As a result, the information produced by an SRT model could potentially provide educators and administrators with actionable feedback about their practices. With norm-referenced VAMs and similar models, changes in teachers’ ratings over time typically only denote changes in their relative ranking within a distribution of teachers. IRT equating procedures may allow for a consistent longitudinal scale to be established for evaluating teachers within the SRT framework. This type of scale introduces the possibility of measuring absolute growth over time, both for individual teachers and for the distribution of teachers as a whole. The combination of the consistent longitudinal scale and the probabilistic nature of the model also provides a means for administrators to predict outcomes of different types of students with different types of teachers, consider these predictions in decisions about assignment and resource allocation in an upcoming school year, and monitor changes over time in meaningful ways. 78 4.1.2 Differential Item and Student Functioning In IRT, differential item functioning (DIF) occurs when the probability of a correct response is different for examinees with the same latent ability who differ on a characteristic that is irrelevant to the construct being measured (Holland & Wainer, 2012). The purpose of a DIF analysis is to improve the quality of a test by revising or removing items that exhibit bias. The analog to DIF in SRT is differential student functioning (DSF). While a DIF analysis explores whether items function differently for certain types of examinees, a DSF analysis investigates whether students perform differently than expected in certain types of classrooms. Tests of DSF would investigate whether students placed in classrooms or schools or with teachers with certain characteristics tend to perform differently than predicted given their levels of instructional demand. However, a DSF analysis would likely serve different purposes and warrant different actions than a typical DIF analysis due to fundamental differences between the IRT and SRT contexts. Revising or removing students when DSF is detected is neither a reasonable nor a desirable response. Rather, DSF detection would ideally prompt changes to instructional or administrative practices to address the underlying problem that resulted in DSF. The primary purpose of a DSF analysis would be to provide diagnostic information about subgroups to target for intervention. In an IRT framework, this is equivalent to responding to DIF by training subgroups of examinees to respond to an offending item differently rather than removing or revising the item. DSF analysis could also provide diagnostic information about the student instructional demand index (SIDI) and SRT model with respect to underlying model assumptions. For 79 instance, misspecification of the SIDI is a threat to the unidimensionality assumption (Ham, 2014). If a DSF test indicates that certain characteristics are associated with different model performance, this implies that the SIDI does not appropriately account for those characteristics. Similarly, DSF for an aggregate classroom-level characteristic would raise concerns about the conditional independence assumption. Ham (2014) finds evidence of conditional dependence in an SRT analysis and discusses both misspecification of the SIDI and peer effects as possible explanations. A DSF analysis may offer additional insight into the source, magnitude, and practical significance of conditional dependence among students in the same class. While this particular purpose does not play a direct role in formative evaluation, it is a necessary step in establishing whether the necessary conditions for equating are met. Many of the proposed uses of SRT in formative evaluation, such as measuring growth of educators and predicting future performance of students with different potential teachers, hinge on the ability to equate SRT scales over time. 4.1.3 Establishing a Consistent Scale Test equating procedures in IRT allow for scales to be linked across different test forms by placing the parameters from each test form onto a common scale. With common-item equating, a subset of items that appear on two different test forms operate as anchors that define this scale (Ryan & Brockmann, 2009). Some items may appear on both forms but relate differently to the underlying construct for the groups of examinees taking each form or across testing occasions due to differences or changes in curriculum, item exposure, or context. For this reason, the relationships between parameters of common items are first analyzed before selecting 80 a set of anchors. Through this process, items that exhibit evidence of parameter drift are identified and excluded as anchors (Dorans et al., 2010). Because norm-referenced statistical growth measures do not have fixed scales across years, multi-year comparisons can only focus on changes in the relative standing of teachers, rather than on the absolute growth of teachers. The same process can be extended to SRT instructional demand indices from different years or grade levels in order to assess longitudinal growth and compare educators of different grade levels using a consistent scale. An analysis of SIDI indicator parameters from different years and grade levels can identify the extent of variation in the relationships between indicators and instructional demand over time and across contexts. Then, the indicators that are consistent across these conditions can be used as a basis for building a common scale. This allows for the effectiveness of individual educators to be analyzed over multiple years and for rates of educator growth to be calculated. Relationships between classroom or school-level factors and growth rates of early career teachers may shed light on how administrators can support them more effectively. 4.1.4 Optimizing the Assignment Process When IRT item parameters are already known, this information is sometimes used to select the best item(s) for a particular purpose. Sometimes this purpose is to construct a new form of a test such that the information function (TIF) matches that of an existing form as closely as possible (Samejima, 1977a). In an adaptive testing context, the purpose is to select the most informative item for a particular examinee, given their responses to previous items (Cella et al., 2007). These decisions are typically subject to certain constraints, such as item exposure and content balance. These processes parallel the assignment of students to teachers in an SRT 81 framework. Although the objectives and constraints are likely to differ tremendously across these two contexts, similar optimization procedures may still be applicable and useful in different ways within the SRT framework. Probabilistic outcomes from an SRT model for students with all potential teachers in an upcoming year could be useful for administrators in determining which students and teachers to place with one another for an upcoming year. Optimal assignments can be configured countless ways, affording flexibility to administrators to determine how different factors are prioritized. For instance, assignments can be made in a way that maximizes the number of students expected to meet a proficiency standard in a particular subject, optimally matches student demand levels with teacher capacities, or distributes instructional demand across teachers as equitably as possible, subject to constraints like class size restrictions, groups of students who must be placed either together or separately, or avoidance of tracking practices where students of similar ability levels are placed together. Comparisons between actual assignments and different types of optimized assignments may also serve as feedback for administrators about inefficiencies or implications of their existing assignment practices. Prior studies find that schools vary widely in the relative degrees of influence from principals, teachers, and parents in assignment decisions (Monk, 1987; Paufler & Amrein-Beardsley, 2014) and that teachers with more prominent positions in formal leadership and informal advice networks tend to be assigned higher-achieving students (Kim, Frank, & Spillane, 2018). Comparing the assignments determined through traditional processes to assignments that optimize a particular objective function, implications of existing practices on expected outcomes may surface. 82 4.1.5 Significance While the use of statistical growth models in teacher evaluation is widely criticized, the need for objective performance measures is far less disputed than the manner in which these measures are used. A model that offers evaluative information in a formative context, helping administrators to better support the needs of teachers and students, may garner significantly more support within the field than one that simply focuses on making summative judgements and determining high-stakes personnel decisions (Goe et al., 2017). This study explores possible contributions of SRT within a formative evaluation framework. Procedures extended from traditional IRT applications offer potential tools for diagnosing problems in an educational system, monitoring changes over time, and making informed decisions to improve future outcomes for both students and teachers. 4.2 Data and Methodology 4.2.1 Data and Measures The study draws on data from an anonymous large school district in a major U.S. city for 5th grade teachers and their students during the 2015-2016 and 2016-2017 school years. SRT measures of educator capacity and consistency for both math and ELA were estimated according to the procedures outlined in Chapter 2. Estimates for the low performance target are the primary focus in this study, as findings from Chapter 3 suggest that these measures are most informative about the population of students and teachers in this district. The sample of students is restricted to those with data available from the previous school year, as this information is used to compute the student instructional demand index (SIDI). The SIDI is estimated using the IRT calibration 83 method and restricted set of instructional demand indicators according to the procedures outlined in Chapter 2. In order to compare estimates for the same teachers with the same students across subject areas, 5th grade students with different teachers for math than for ELA (approximately 3% all students each year) and teachers with none of the same students for both subjects (approximately 3% of all teachers each year) are excluded from the sample. For analyses of teacher growth, the sample is further restricted to teachers that appear in the data as 5th grade teachers in both school years (69% of all teachers) and students placed with these teachers (74% of all students). Table 4-1 describes the full samples of students and teachers, the restricted sample used in cross-sectional analyses, and the further restricted sample used for longitudinal analyses. The rate of special education eligibility is slightly higher for the first cohort than the second. This corresponds to a slightly lower ratio of students to teachers, as special education and inclusion classes tend to be smaller than general education classes. For both cohorts, the mean of the instructional demand distribution is approximately zero for the full sample and decreases as more restrictions are imposed. This indicates that the students excluded from the sample tend to have higher instructional demand levels than those that are retained. Special education students, who generally have high instructional demand levels, likely contribute to this pattern. For instance, students with specific learning disabilities may be more likely to work with someone other than their primary classroom teacher in either math or ELA. Because the special education populations vary in size across the two cohorts, special education teachers may be more likely to move between grade levels between years, and therefore less likely to meet the criteria for inclusion in the growth analyses. Potential implications of these differences on the quality of 84 information about teachers of high-demand and special education students are considered throughout the study. Table 4-1. Full and restricted samples of teachers and students. Student Cohort 1 5th grade 2015-2016 X X 74% 66% 21.8 11% -0.09 (0.98) Sample restrictions Students with same teacher for math and ELA Teachers who taught 5th grade both years Percent of all students included Percent of all teachers included Student-teacher ratio Percent eligible for special education services Distribution of instructional demand 100% 100% 19.5 12% 0.00 (0.98) 97% 97% 19.6 12% -0.03 (0.98) X X Student Cohort 2 5th grade 2016-2017 X X 73% 71% 21.4 11% -0.08 (0.97) 97% 97% 20.9 11% -0.04 (0.97) 100% 100% 20.7 11% 0.00 (0.98) 4.2.2 Differential Student Functioning Tests of DSF are conducted using the Mantel-Haenszel (MH) DIF index (Mantel & Haenszel, 1959; Holland & Thayer, 1988). The MH DIF statistic is computed from a set of 2x2 contingency tables, with a separate table for each of j latent ability levels, containing the counts of correct and incorrect responses to a particular item i for members of the focal and reference groups. Applying these same procedures for a DSF analysis, there is a separate 2x2 table for each of t latent capacity levels that contains counts of students at an instructional demand level d who did and did not meet a particular performance target in the focal and reference groups (the SRT adaptation of the Mantel-Haenszel DIF table is shown in Table 4-2). In a DIF analysis, the values in these contingency tables represent responses to the same item across different examinees. In the DSF context, however, each student is only observed with one teacher. In this case, the 85 counts in each cell of the contingency table instead reflect groups of students with approximately equal levels of instructional demand. Table 4-2 Mantel-Haenszel DSF 2x2 contingency table Reference group Focal group Student reached target Yes Ast Cst No Bst Dst The distribution of student instructional demand estimates, shown in Figure 4-1, is slightly bimodal. Each peak represents a commonly observed instructional demand level among the sample of students, one slightly above and one slightly below the mean. Students with instructional demand levels within 0.15 of the lower peak are grouped together for one set of DSF analyses, and students within 0.15 of the upper peak are grouped together for another set. These groups of students provide insight about performance of educators with slightly below- average and slightly above-average demand levels, across several key characteristics of teachers and classrooms. For the purpose of the DSF tests, each of these groups of students operates like a single test item administered to many examinees, rather than a set of similar items that are each administered to one examinee. These groups were selected because their high frequencies allow for more teachers to be included in the DSF tests. 76% of all teachers in the sample have at least one student in at least one of the two demand intervals and 55% have at least one student in each of the two demand intervals. 86 Figure 4-1. Distribution of instructional demand with shaded areas in DSF intervals. In Chapter 3, the instructional demand and educator capacity scales were divided into eight one-unit intervals for reporting purposes. These same eight intervals (which each span approximately half of a standard deviation of the capacity distribution) are used as the latent capacity levels for the DSF analysis. The MH common odds ratio, shown in Equation 4-1, is computed using the counts in each of the contingency tables across the eight capacity levels. This ratio is then transformed into a “Mantel-Haenszel delta difference” (Dorans & Holland, 1993), which is typically abbreviated as MH DIF but shown in Equation 4-2 as MH DSF for consistency with SRT terminology. 87 !"#$=∑[()*+)*/(()*+/)*+0)*++)*)] *∑[/)*0)*/(()*+/)*+0)*++)*)] * 34 +67=−2.35ln?!"#$@ (4-1) (4-2) The resulting DSF statistics are classified using the Educational Testing Service (ETS) guidelines for categorizing DIF. The ETS classification system assigns a letter (A, B, or C) indicating the level of DIF (“negligible DIF,” “slight to moderate DIF,” or “moderate to large DIF,” respectively) and a symbol (“+” or “-”) indicating the direction of DIF, based on the magnitude of MH DIF and statistical significance of the corresponding chi-square test (Zieky, 1993; Zwick, Thayer, & Lewis, 1999). For instance, “B+” indicates slight to moderate DIF in favor of the focal group, while “C-” indicates moderate to large DIF in favor of the reference group. Table 4-3 outlines these ETS classification criteria, with terminology from SRT in place of analogous IRT terms. Table 4-3. DSF categories based on the ETS DIF classification system. Direction Magnitude MH chi-square test p<0.05 p>0.05 Positive DSF (favors focal group) Negative DSF (favors reference group) MH DSF > 1.5 1.0 < MH DSF < 1.5 MH DSF < 1.0 MH DSF > -1.0 -1.5 < MH DSF < -1.0 MH DSF < -1.5 C+ B+ A A B- C- A A A A A A 88 4.2.3 Equating the Instructional Demand Index Across Years First, test-level properties of the student instructional demand indices (SIDIs) from each year are compared to assess a few key assumptions of test form equating. In order for two test forms to be considered equated, or for scores from each form to be used interchangeably, both forms must measure the same construct with the same level of precision and have the same level of difficulty (Holland & Dorans, 2006). Because the SIDIs were constructed using IRT calibration, plots of their test information functions (TIFs) and conditional standard errors can be examined to compare the difficulty and precision levels from each year. Each cohort of students is then assigned an estimate of instructional demand based on the SIDI from the opposite cohort, so that estimates for the same individuals can be compared across indices. A high degree of consistency between these estimates is desired to support the assumption that both SIDIs measure the same construct. Next, parameters and standard errors of individual instructional demand indicators are compared across the two SIDIs. Although both SIDIs were constructed using identical sets of instructional demand indicators, relationships between some indicators and the latent construct could differ between the two years. Relationships between the location parameter estimates, standard errors of location parameter estimates, slope parameter estimates, and standard errors of slope parameter estimates, for the same indicators across the two years are examined so that a set of consistent indicators can be identified. These indicators are referred to as “anchors,” which form the basis for equating the SIDIs using an anchor test design (Kolen & Brennan, 2014). Indicators are selected for the anchor test using the following steps: 1) estimates from the SIDI for the second cohort are regressed on estimates from the first cohort, 2) instructional demand indicators with standardized residuals greater than 3 or less than -3 are excluded from 89 consideration, and 3) the same regression model is fit without these indicators. These steps are repeated until all remaining instructional demand indicators fall between -3 and 3, and these remaining indicators comprise the anchor test. The anchor test is then assessed against criteria for test length, content balance, and parameter stability to establish its suitability for common-item equating. The anchor test must meet a minimum length requirement of about 20-25% of the total number of items in each full test form (Hambleton et al., 1991; Kolen & Brennan, 2014). The indicators included in the anchor test must also be representative of the balance of content on each form, such that the anchor test looks like a “mini version” of the two full test forms (Kolen & Brennan, 2014). Lastly, the parameters of instructional demand indicators must be sufficiently stable across the two forms, with correlations of at least 0.95 and the ratio of their standard deviations between 0.9 and 1.1 (Huynh & Meyer, 2010). The SIDI for the second cohort of students is rescaled by applying a linear transformation that results in the location parameters of anchor indicators having the same mean and standard deviation as they do for the SIDI for the first cohort of students. The rescaled SIDI is then used to fit SRTs models in order to estimate educator capacity and consistency for the second school year on the same scale as the capacity and consistency estimates from the first school year. DSF tests, as described in Section 4.2.2, are conducted to compare results from the first and second school years as reference and focal groups, respectively. The purpose of these DSF tests is to determine whether there are differences in model performance across teachers with similar capacities estimated from models for different years. If DSF is nonnegligible, this could indicate nonequivalence of the two scales. 90 4.2.4 Identifying Growth of Teachers ΔB*, is computed using Equation 4-3. The variables !B* and !(BCD)* represent the capacity Growth of an educator over the course of a particular year y, represented by the variable estimates for teacher t in years y-1 and y (in this study, these are the 2015-2016 and 2016-2017 school years, respectively). The standard errors of the two capacity estimates are derived from the educator-class information functions (E-CIFs) for each teacher, as described in Chapter 3. The height of the E-CIF at the estimated capacity level is inversely related to the standard error of the capacity estimate, as shown in Equation 4-4. The height of the educator-student information function (E-SIF) is computed for each student s assigned to teacher t in year y, given all students assigned to the same teacher that year, and the reciprocal of the square root of this the slope and capacity estimates for teacher t in year y (!B* and EB*, respectively) and the instructional demand level of student s (represented by FB)). These heights are summed across sum is the standard error of the capacity estimate. The standard error of the growth estimate, G∆I, is computed using Equation 4-5, where GJ(KLM)I and GJKI are the standard errors of !(BCD)* and !B*, respectively. ΔB*=!B*−!(BCD)* 1+RSKI(JKICTKU)WQ1− RSKI(JKICTKU) GJKI=NO EB*PQ RSKI(JKICTKU) 1+RSKI(JKICTKU)W G∆I=XYGJ(KLM)IZP+YGJKIZP B)(*) CD (4-3) (4-4) (4-5) 91 Using these estimates of growth and their standard errors, three main questions are explored. First, under what conditions can growth of teachers be detected? In order for growth to be detectable, the standard error of each capacity estimate must be reasonably small. Because these standard errors are derived from the sums of information functions across all students in a class, they are likely sensitive to the number of students in a class and the distance between estimates of student instructional demand and estimates of educator capacity. The next question, “how much growth is detectable?” is addressed by determining the proportions of observed growth estimates that have standard errors below the threshold required to detect different amounts of growth. The third question, “what characteristics are associated with different types of growth?” is addressed for a subset of teachers with standard errors that are sufficiently small to detect moderate levels of growth. Relationships between observable teacher characteristics and significant negative, nonsignificant or zero, and significant positive changes in capacity are explored using a series of chi-square tests of association. 4.2.5 Assignment Optimization Optimal assignments of students and teachers to one another are explored for rising 5th graders in one sample school. The school was selected based on the number of general education classrooms for the 5th grade, and relatively high variation in student instructional demand and teacher capacity estimates compared to other schools of similar sizes. There are 126 5th grade students, 6 general education 5th grade classrooms, and one teacher for each of these classrooms in the sample school. Table 4-4 shows the distribution of instructional demand for each of these classrooms, along with the average predicted probabilities of reaching the low and middle performance targets across students in each class, based on estimates of capacity and consistency 92 of their assigned classroom teachers. Capacity estimates from the initial year are used because results from 2016-2017 would not be available at the time assignment decisions are made. The six classrooms vary somewhat in size and in mean instructional demand. On average, most students are very unlikely to reach the middle target for either math or ELA, although these probabilities are higher for students with Teacher 3. For the low performance targets, there is more variation in mean probabilities for math across classrooms, while the probabilities for ELA are very high for students with most teachers. Table 4-4. Student demand and probability estimates by teacher. Teacher/ Instructional Demand Class Student Count Min Max 2.25 -1.44 2.22 -0.85 -0.21 2.13 2.15 -1.43 2.81 -1.26 -1.60 2.62 Average Probability Math targets ELA targets Low Middle Low Middle 0.03 0.57 0.00 0.42 0.96 0.73 0.01 0.65 0.00 0.46 0.70 0.20 0.93 0.80 1.00 0.53 0.08 0.99 0.00 0.00 0.11 0.00 0.00 0.00 1 2 3 4 5 6 22 20 22 17 22 23 Mean -0.41 0.91 -0.70 0.23 0.45 0.42 SD 1.00 0.75 1.17 1.18 0.98 1.17 Using an evolutionary optimization algorithm, different arrangements of these 126 students with these 6 teachers are generated in an effort to maximize the sum of probabilities of reaching performance targets across all students and both subjects. For most students, the probabilities considered in the objective function reflect the low targets for both math and ELA. However, if a student has a probability above 0.5 for the low target in a particular subject across all 6 potential teachers, the middle target probability is used in its place. Theoretically, if a student had a probability above 0.5 across all 6 teachers for the middle target, the high target 93 would be considered instead. Because no students meet this description, the high targets are not considered at all. Optimization procedures are repeated across four conditions, and each condition is repeated with 5 different random seeds, yielding 20 suggested assignments. These are compared to the actual assignments as well as randomly-generated assignments. The four conditions differ according to how initial values are set prior to beginning the optimization procedure (actual assignments or random assignments) and the mutation rate for the evolutionary algorithm (0.075 or 0.15). A higher mutation rate allows for more changes in assignments in each iteration of the optimization process, while a lower mutation rate places greater weight on preserving the initial assignments. In order to ensure that class sizes are reasonably equitable across teachers, a minimum and maximum number of students per teacher (17 and 25, respectively) are included as integer constraints. 4.3 Results 4.3.1 Differential Student Functioning Results Results of DSF tests by teacher ethnicity, teacher gender, evaluation status, student count, mean instructional demand in a class, standard deviation of instructional demand in a class, classroom cluster (as defined in Chapter 3), and indicators for whether the majority of students in a class are nonwhite, eligible for special education services, classified as English Language Learners (ELL), or eligible for free or reduced-price lunch (FRL), are provided in Table 4-5. Each MH DSF statistic and corresponding chi-square significance test is assigned a category in accordance with the ETS DIF classification system. DSF is negligible across nearly every 94 characteristic tested. There are a few exceptions to this: classes with mostly special education students compared to those with mostly general education students (for both modal demand levels and both subjects), high demand classes compared to average demand classes (lower mode and ELA only), Cluster 6 classes compared to Cluster 2 classes (lower mode for math and upper mode for ELA), and Cluster 1 classes compared to Cluster 2 classes (lower mode for ELA only). DSF is negative for each of these comparisons, indicating lower performance in the focal group than the reference group. 4.3.2 SIDI Equating Results The test information functions (TIFs) and conditional standard errors for the SIDIs from each year (prior to equating) are shown in Figure 4-2. The corresponding curves from opposite years are nearly indistinguishable from each other upon visual inspection. Estimates of instructional demand for students based on the SIDI for the opposite cohort are also nearly identical to those for their own cohort, with correlations above 0.99. As Figure 4-3 illustrates, instructional demand estimates on the two SIDIs fall along the identity line for nearly all students. There are very few points that deviate far enough from this line for visual identification, and these deviations are extremely small. Parameter estimates are also consistent across the two SIDIs. The only indicators flagged as outliers and excluded from the anchor set are those for the gifted programs and the total number of disabilities. Parameter estimates for the remaining indicators are correlated above 0.99 across the two years. The ratio of anchor indicator standard deviations for the 2015 and 2016 SIDIs is 1.02. Table 4-6 shows results from DSF tests using the two years as comparison groups 95 Table 4-5. Results from differential student functioning (DSF) analysis Lower modal student Upper modal student ELA Math ELA Math chi2 Comparison groups Reference White Female Not evaluated Focal Nonwhite Male Evaluation year Low student count Average High student count Average High mean demand Average Low mean demand Average High SD demand Average Average Low SD demand Cluster3 1 Cluster 2 Cluster 2 Cluster 3 Cluster 2 Cluster 4 Cluster 2 Cluster 5 Cluster 6 Cluster 2 Mostly nonwhite White Mostly Special ed. General ed. Mostly ELL Mostly FRL Not ELL Not FRL 1.73 0.18 A sig. ETS DSF 1.32 0.17 A 0.33 0.07 A sig. ETS MH DSF chi2 DSF 0.01 0.00 0.00 A sig. ETS MH sig. ETS MH MH DSF chi2 DSF chi2 DSF DSF DSF 0.22 0.89 0.14 A -0.31 2.92 0.17 A 0.22 -0.13 0.25 0.06 A -0.13 0.48 0.10 A -0.13 0.46 0.09 A 0.12 -0.12 0.32 0.07 A 0.85 10.9 0.01 A -0.04 0.03 0.01 A 0.25 1.90 0.18 A -0.36 2.52 0.18 A -0.54 4.19 0.13 A -0.54 7.42 0.05 A -0.32 2.65 0.18 A -0.70 5.01 0.10 A -0.06 0.04 0.01 A -0.01 0.00 0.00 A 0.33 0.01 0.00 0.00 A -0.64 2.91 0.17 A -1.22 9.09 0.02 B- 0.03 0.01 0.00 A -0.19 0.65 0.12 A -0.22 0.60 0.11 A 0.48 3.24 0.16 A 0.26 0.89 0.14 A 0.29 1.64 0.18 A -0.35 2.44 0.18 A -0.33 1.50 0.18 A 0.27 1.46 0.18 A -0.57 9.20 0.02 A 0.10 0.20 0.05 A 0.09 0.02 0.00 0.00 A 0.06 0.30 1.59 0.18 A 0.05 0.03 0.01 A -1.00 12.8 0.01 B- 0.03 -0.62 1.77 0.18 A -0.12 0.16 0.04 A -0.07 0.05 0.01 A -0.45 4.22 0.13 A -0.22 0.61 0.11 A -0.02 0.00 0.00 A 0.14 0.35 0.07 A -1.57 3.50 0.15 A -1.68 3.26 0.16 A -0.69 1.92 0.18 A -0.70 1.52 0.18 A -7.81 11.4 0.01 C- -4.66 6.85 0.06 A -2.98 6.61 0.06 A -3.28 9.82 0.02 C- -0.39 0.83 0.14 A -0.14 0.06 0.02 A 0.17 0.05 0.01 A 0.04 0.00 0.00 A -4.43 14.1 0.00 C- -8.96 9.12 0.02 C- -11.7 18.7 0.00 C- -5.97 25.0 0.00 C- 0.03 0.02 0.01 A 0.35 2.34 0.18 A -0.43 6.43 0.07 A -0.16 0.76 0.13 A -0.69 4.53 0.12 A -0.00 0.00 0.00 A -0.43 1.45 0.18 A 0.51 1.49 0.18 A 0.14 0.03 A 0.05 0.01 A 0.00 0.00 A 3 These are the same clusters defined from class information functions (CIFs) in Chapter 3. Descriptions of each cluster are as follows: 1) high student count and low mean demand, 2) high student count and average mean demand, 3) average student count and high mean demand, 4) average student count and below- average mean demand, 5) low student count and high mean demand (inclusion classes with combination of special education and general education students), 6) very low student count and very high mean demand (dedicated special education classrooms). 96 Figure 4-3. Instructional demand estimates based on each SIDI. Figure 4-2. Test information functions and conditional standard errors for each SIDI. 97 for SRT models using the low performance targets and equated SIDIs. DSF is negligible for both modal instructional demand levels across both subject areas. Table 4-6. DSF* by year for equated SRT measures. Math ELA Lower mode Upper mode Lower mode Upper mode MH DSF -0.02 0.01 0.04 0.20 chi2 7116 884 12700 1483 sig. 0.00 0.00 0.00 0.00 ETS DSF A A A A *The focal and reference groups are 2017 and 2016 for these tests, respectively.. 4.3.3 Teacher Growth Analysis Results On average, teacher capacities decreased by about 0.07 in math and 0.02 in ELA between the two school years. These changes correspond to about 0.04 and 0.01 standard deviations of the capacity distribution4, respectively. Figure 4-4 provides histograms of math and ELA capacity changes. The magnitude of change is within one standard deviation of the capacity distribution for approximately 70% of teachers in math and 65% of teachers in ELA. The standard errors of these estimated changes vary widely across teachers. For math, the mean standard error is 1.28 with a standard deviation of 1.40. For ELA, the mean is 1.62 with a standard deviation of 1.82. The distributions of standard errors, which are also provided in Figure 4-3, are highly skewed in the negative direction. The median standard errors of 0.72 for math and 0.82 for ELA are more reflective of the typical magnitudes observed in the sample, as the means are more sensitive to a 4 The standard deviations of capacity measures are slightly larger across the entire of samples (approximately equal to 2) than for the analytic sample (about 1.80 and 1.74 for math and ELA, respectively). These values reflect the standard deviation of 2015-2016 capacities across the analytic sample of teachers in the growth analysis. 98 Figure 4-4. Distributions of changes in capacity and their standard errors. 99 Figure 4-5. Scatterplot of math capacity estimates colored according to the standard errors of corresponding math growth estimates. 100 Figure 4-6. Scatterplot of ELA capacity estimates colored according to the standard errors of corresponding ELA growth estimates. 101 relatively small number of extremely large standard errors. Figures 4-5 and 4-6 provide scatterplots of 2015-2016 and 2016-2017 capacity estimates for the same subject with points colored according to the standard error of the growth estimate (or the change between the two capacity estimates). Standard errors are generally largest when the capacity estimate in one of the years is on one of the far ends of the distribution. However, there are teachers with capacities in these ranges with standard errors in the lower ranges. Student counts and gaps between teacher capacity and class mean instructional demand were explored as possible drivers of differences in standard errors across teachers, as both of these quantities are directly related to the educator-class information function (E-CIF) from which standard errors are derived. Figure 4-7 provides the mean standard error among groups of teachers with a given student count in each of the two years. There is no obvious visual relationship between the student counts for each year and the mean standard error. In fact, some cells corresponding to very low student counts have small mean standard errors and some cells corresponding to very high student counts have very large mean standard errors. Figures 4-8 and 4-9 provide scatterplots of the distance between a teacher’s capacity estimate and the mean instructional demand estimate of students in the teacher’s class for each year, where the color of each point corresponds to the standard error of the growth estimate for the teacher. There are clear visual patterns in the magnitude of standard errors based on these capacity-demand gaps; teachers with larger gaps in either direction have larger standard errors. There are a few exceptions to this, where a teacher without an extreme capacity-demand gap in either year has a standard error in the highest interval (appearing as a dark red point in a location populated mostly by bright green points). These correspond to teachers with unusually small consistency (or slope) estimates in at least one of the two years. 102 Figure 4-7. Mean standard errors of growth estimates by student counts. 103 Figure 4-8. Standard errors of growth estimates by gap between math capacity and class mean instructional demand. 104 Figure 4-9. Standard errors of growth estimates by gap between ELA capacity and class mean instructional demand. 105 Table 4-7 provides details about the standard errors necessary for different magnitudes of growth to be considered statistically significant. The smallest magnitude of growth in math capacity with statistical significance is 0.64. This corresponds to about one third of a standard deviation of the capacity distribution. While about 72% of all math growth estimates are larger than this (including both positive and negative growth estimates), only 3% of math growth estimates have sufficiently small standard errors for this change to be statistically significant. Similarly, only 7% of ELA growth estimates have sufficiently small standard errors for the smallest observed significant growth (0.73 capacity units) to be statistically significant. More than half of all growth estimates have sufficient standard errors to detect significant changes in capacity of one full standard deviation unit, however, growth of this magnitude is not frequently observed, particularly among the subset of teachers with sufficient standard errors to detect it. About 31% of math growth estimates and 35% of ELA growth estimates are equal to or above this level, and fewer than half of these are statistically significant. Table 4-7. Detectable and observed magnitudes of growth. Math Magnitude of growth (change in capacity) Maximum SE for this magnitude of growth to be significant Percent of teacher growth estimates above this magnitude Percent of teachers with sufficient SE to detect growth Percent of teachers with significant growth above this magnitude ELA Magnitude of growth (change in capacity) Maximum SE for this magnitude of growth to be significant Percent of teacher growth estimates above this magnitude Percent of teachers with sufficient SE to detect growth Percent of teachers with significant growth above this magnitude 106 0.64 0.33 72% 3% 1% Minimum detectable 0.5 SD 1 SD 1.80 0.92 31% 60% 13% Minimum detectable 0.5 SD 1 SD 1.74 0.89 35% 53% 14% 0.73 0.37 67% 7% 3% 0.90 0.46 60% 22% 8% 0.87 0.44 62% 16% 7% For the subset of teachers with standard errors below the threshold for detecting one standard deviation of growth, percentages of teachers with significant negative, nonsignificant or zero, and significant positive changes in capacity are shown in Table 4-8. About half of all teachers have significant changes in capacity across the two years, but significant changes are more often negative than positive. Early career teachers are the only group more likely to have a positive change in math capacity than a negative change, and no groups of teachers are more likely to have positive changes than negative changes in ELA capacity. Rates of negative changes are particularly high among probationary teachers (in both subjects) and mid-career teachers (in ELA only). There are statistically significant associations between ELA growth and both teacher minority status and teaching experience. There are no significant associations found between math growth and observable teacher characteristics. Table 4-8. Comparison of teachers by change type (if 1SD change is detectable) Math Growth Negative None Positive ELA Growth Negative None Positive All teachers Employment status Probationary Tenured Master’s degree Yes No Teacher Ethnicity Nonwhite White Teaching Experience 1-5 years 6-9 years 10+ years Evaluation Year Yes No 25% 54% 20% 29% 49% 22% 42% 25% 24% 26% 26% 24% 20% 33% 25% 22% 26% 50% 55% 57% 53% 53% 58% 50% 48% 55% 59% 53% 8% 21% c2=2.30 p=0.32 19% c2=0.88 21% p=0.64 21% c2=1.43 18% p=0.49 30% c2=2.27 18% p=0.69 20% 40% 28% 26% 30% 30% 25% 26% 59% 26% 19% c2=1.61 21% p=0.45 26% 30% 45% 50% 51% 49% 46% 60% 48% 33% 51% 54% 48% 15% 22% 23% 21% 24% 15% 26% 7% 23% 20% 22% c2=1.41 p=0.50 c2=0.89 p=0.64 c2=6.23 p=0.04 c2=12.84 p=0.01 c2=1.14 p=0.57 107 4.3.4 Optimal Assignment Results The top panel of Table 4-9 shows properties of actual class assignments and a random set of assignments. When students are placed with their actual assigned teacher, the objective function (the sum of probabilities across 126 students and 2 subjects) is equal to 124.06. The within-student sums of probabilities (math probability plus ELA probability) have a standard deviation of 0.52. These same values were computed for randomly generated assignments. Across 100 replications of randomly placing students and teachers with one another, the mean of the objective function is 121.15 with a standard deviation of 4.73. This suggests that the expected outcomes for the actual assigned classes are within the range of typical outcomes when classes are assigned completely at random. The mean standard deviation of within-student sums across the 100 random assignment replications is 0.58 with a standard deviation of 0.02. This suggests that, although the sum of probabilities is similar for actual and random assignments, there is less variability in student probabilities. Properties of the actual assignments, a set of random assignments with a similar objective function value to the actual assignments, and distributions of properties across optimization replications from each condition, are provided in Table 4-9. Although the objective functions have similar values for the actual and random assignments, students and demand levels are distributed more equitably across teachers with the actual assignments than with random assignments. Optimized classes have more equitable class size distributions than the random assignments, but they are slightly less equitable than the actual assignments. The standard deviation of mean demand across teachers is higher for all of the optimized assignments than for the actual assignments, and highest for the conditions with actual assignments as starting values. On average, optimization procedures lead to greater increases in the objective function when the 108 actual assignments are used as starting values. Even after optimization, the value of the objective function for conditions with random starting values is similar in to that of the actual assignments. Optimization conditions with the higher mutation rate have more consistent results across different random seeds. Table 4-9. Actual, random, and optimized class assignments. Objective function Final Change 124.06 123.81 136.49 (5.80) 140.86 (1.09) 127.41 (4.94) 126.99 (2.33) ---- ---- 12.44 (5.80) 16.80 (0.98) 3.59 (4.94) 3.18 (2.33) SD across teachers Student count Mean demand 2.19 6.00 2.92 (0.61) 3.59 (0.41) 3.26 (0.47) 3.52 (0.67) 0.15 0.24 0.38 (0.04) 0.40 (0.10) 0.23 (0.07) 0.18 (0.02) Conditions Initial Mutation rate ---- ---- ---- ---- Actual Random Optimal Actual 0.075 Optimal Actual 0.15 Optimal Random 0.075 Optimal Random 0.15 4.4 Discussion These results offer two types of diagnostic feedback. First, they identify concerns about the performance and specifications of the SRT model. Second, they identify patterns in the performance of students and teachers. While the second type of feedback is most relevant to teachers, administrators, and other stakeholders, the first type provides evidence of validity and shortcomings in these measures that should be considered carefully when interpreting the more practical findings and when developing an appropriate SRT model for use in a formative evaluation system. 109 4.4.1 Differential Student Functioning Most of the nonnegligible DSF results occur in similar groups of teachers that are identified in slightly different ways. There is likely a lot of overlap in the groups of teachers with special education classes, cluster 6 classes, and classes with high mean instructional demand levels, so it is not surprising that findings are similar across DSF tests for each of these focal groups. Teachers of high demand, special education, or cluster 6 classes tend to have fewer students in general. When viewing the classroom as a test in the SRT framework, these teachers are given shorter, more difficult tests, than teachers of an average class. The types of students in these classes are also more likely to have different teachers for math and ELA; these students were excluded from this study, potentially exacerbating the effects of low student counts for some of these teachers. Near-average students tend to perform worse in these classes than similar students with similar teachers in other types of classes. It is also possible that the instructional demand index does not adequately capture the added challenges associated with teaching many high-demand students simultaneously. The only nonnegligible DSF test that was not for one of these same groups was for Cluster 1 relative to Cluster 2 for the lower mode in ELA. In this case, both the focal and reference groups have high student counts but differ in their mean demand levels. This result indicates that students with slightly below-average demand levels in large, low-demand classes are less likely to reach the ELA target than similar students with similar teachers in large, average-demand classes. This could indicate that teachers of Cluster 1 classes direct their instruction more towards low-demand students, while instruction in Cluster 2 classes is generally directed towards average-demand students. Similarly, tendencies to direct instruction towards 110 higher-demand students in higher-demand classes could contribute to the significant DSF findings for special education, Cluster 6, and high-demand classes. If given this information as part of a formative evaluation process, an administrator may choose to plan professional development activities that focus on improving instructional practices to reach a broader range of students. An administrator may also consider this information when determining class assignments in the following year. 4.4.2 Equating the SIDI The TIFs shown in Figure 4-1 provide strong evidence that, prior to equating, the SIDIs from each year can be considered equivalent forms. The high level of consistency for instructional demand estimates from each SIDI further supports that the forms can be used interchangeably, and neither form would be preferred over the other. The content balance of the anchor test is nearly identical to that of the complete set of indicators used to estimate each SIDI. The indicators for gifted students and the total number of disabilities do operate somewhat differently on the two forms, however. This may be related to differences between the groups of gifted students or students with multiple disabilities in each cohort. For instance, students in the first cohort with multiple disabilities may have qualitatively different disabilities than those in the second cohort. It is also possible that the eligibility criteria for different gifted programs or special education services changed slightly after the first year. Aside from these few exceptions, which were excluded from the set of anchors used to equate the scales, relationships between instructional demand indicators and the latent construct measured by the SIDI are quite stable across the two forms. The DSF tests comparing performance in the two years are all classified as negligible, affirming that the two scales are comparable. 111 4.4.3 Teacher Growth Under certain conditions, there is relatively little power to detect significant changes in equated educator capacity estimates due to their unreasonably large standard errors. However, mismatch between student demand levels and teacher capacities poses a far greater threat to standard errors of growth estimates than low student counts. These findings emphasize that appropriate matching of the needs of students with the capabilities of teachers is critical in order to make inferences about the performance and growth of teachers in an SRT analysis. It may be more feasible to measure growth over longer periods of time, as changes occurring over multiple years may be larger relative to the standard errors of the initial and final capacity estimates. Teachers at either extreme end of the capacity distribution are disproportionately likely to have large gaps between their capacity estimates and the mean instructional demand levels of their students, large standard errors, and large estimates of growth. Because of these relationships, the teachers with growth estimates above a given level tend not to be the same teachers with sufficiently small standard errors for this level of growth to be statistically significant. In future work, it will be important to address differences in SRT and IRT models and contexts, such as the level of the slope parameter and interactions among students in the same class, that are likely to contribute to these enlarged standard errors. The frequency with which negative growth (or decreases in capacity from one year to the next) is observed raises concerns about possible changes in the difficulty of a performance target over time. If the cut-score for a performance level designated by the state is more difficult in the second year than the first, students will be less likely to reach the target than students of the same instructional demand levels in the previous year. Because the capacity scale is defined by the 112 relationship between instructional demand and target attainment, this sort of change could result in a systematic shift of capacity estimates for all teachers regardless of whether their capabilities as educators have actually changed. In order to prevent incorrect inferences about declining capacities or underestimation of positive growth, comparability of performance targets should be established in future SRT studies involving longitudinal comparisons. 4.4.4 Optimal Assignment Differences between the properties of actual assignments and randomly-generated assignments reveal that, although the expected sum of probabilities is similar, the actual assignments correspond to fewer students with probabilities very close to 0 or 1 than the random assignments. This suggests that the actual assignments do a better job of matching student instructional demand levels and teacher capacity levels. As a result, students are less likely to be placed with teachers with whom they have very low probabilities of reaching the performance target than if they were assigned at random. Despite beginning with approximately the same initial value for the objective function, the optimization process is more successful in improving successful outcomes when the actual assignments are used as initial values, compared to the random assignments. Although improvements to the objective function vary across random seeds, results are much more consistent across replications for conditions with the higher mutation rate. Results are most consistent for the condition with actual assignments as starting values and the higher mutation rate. Optimal assignments in this condition correspond to approximately 10 more students (of the 126 5th grade students in the school) expected to reach performance targets than with the actual assignments. 113 The standard deviation of class sizes tends to be larger for optimized assignments than for the actual assignments. The assignment procedures used in the sample school may prioritize equitable class sizes to a greater extent than simply imposing a minimum and maximum size. The class size restrictions may need to be revised in order to generate assignments that align with other priorities and concerns of the school. Similarly, the standard deviation of class mean instructional demand levels is lowest for the actual assignments. In order to maximize expected student outcomes, the optimal assignments focus more on matching of students and teachers, while assignment practices in the school may focus more on equitable distribution of instructional demand across teachers. If this is the primary objective of teachers and administrators in this particular school, then optimal assignments that minimize the standard deviation of instructional demand across teachers may be of greater interest. However, providing administrators with optimal assignments for different objectives, and comparisons of these with actual assignments, may be helpful in evaluating implications of their current assignment practices. For instance, while the sample school may currently focus more on equitable distribution than on expected student outcomes when determining assignments, an administrator may choose to emphasize expected student outcomes more in future assignment decisions after reviewing this type of report. 4.5 Conclusions These findings affirm that the mean instructional demand level of a class, and its relationship to the capacity of a teacher or instructional demand of an individual student, impacts both the performance of students and the quality of information provided about a teacher. 114 Commonalities between special education, high mean demand, and cluster 6 classrooms are likely related more to discrepancies between the instructional demand levels of students in these classes and the capacities of most teachers than they are to low student counts. This is equivalent to assessing the performance of these groups of teachers using a significantly more difficult form of a test than for other teachers. When the latent capacity of a teacher is far below the demand level of their students, this may result in a floor effect. These patterns could also indicate a tendency of teachers to direct instruction towards the level of most students in the class. This would also explain differential performance of average students in very low-demand classes, compared to similar students with similar teachers in average classes. The instructional demand indices for the two cohorts of students are successfully equated to a common scale, however, standard errors of teacher growth estimates are too large to make judgements about changes in most teachers’ capacities between the two years. The magnitudes of these standard errors are most sensitive to gaps between capacity and mean demand and impacted little by student counts in comparison. With greater emphasis on matching student instructional demand levels with teacher capacity levels in assignment processes, this type of growth analysis may be more feasible. However, findings from the sample school and in previous studies suggest that some teachers and administrators prioritize equitable distribution of students and instructional demand across teachers, as opposed to matching of students with the most appropriate teachers. Further development of SRT methods are necessary in order to adequately capture growth of educators when there are gaps between capacity and instructional demand as well as interactions among students in the same class and their contributions to the instructional demands posed to an educator. However, the negligible DSF findings, feasibility of longitudinal 115 equating, and improvements in expected outcomes through assignment optimization are suggest that these procedures can contribute to formative teacher evaluation processes in novel and impactful ways. 116 CHAPTER 5. OVERALL CONCLUSION AND DISCUSSION 5.1 Summary of Findings The main objective of the first paper is to demonstrate whether the student instructional demand index (SIDI) can be constructed in ways that are more beneficial to an evaluation system and less controversial politically. Although the IRT calibration method was less successful than regression analysis for estimating the SIDI in previous studies, this study is the first to use such an expansive set of instructional demand indicators. It is also the first to explicitly choose indicators based on their optimality for IRT calibration. The results suggest that the IRT calibration method produces an index of instructional demand that is related to future achievement closely enough to demonstrate evidence of convergent validity but differs enough from future achievement to offer evidence of divergent validity. This suggests that the construct this SIDI captures is related to but different from future achievement. The regression analysis method, in comparison, creates a SIDI that is essentially an indicator of future performance. This is problematic within an accountability framework; if characteristics of a student before beginning the school year are so highly predictive of performance at the end of the school year, this implies that teachers have relatively little influence in these outcomes. Estimates of teacher capacity from SRT models with IRT-calibrated SIDIs are also more consistent with other measures of teacher quality. The IRT-calibrated SIDI discriminates between teachers in different rating categories based on observations of their classroom teaching, as well as between different levels of teaching experience. However, when a difficult 117 performance target is used as a standard for evaluating student and teacher performance, neither type of SIDI is able to discriminate among these groups of teachers. Results from the second paper also emphasize that when student performance targets are unreasonably difficult, SRT models are not very informative about educator capacity. The lowest performance target for the state assessment taken by students in this sample is the only benchmark that provides reliable estimates of teacher performance across a wide range of the instructional demand scale. The middle performance target provides reliable information about some students and teachers in the district, but the highest performance target is generally uninformative across the entire student and teacher distributions. Different measures derived from the low target are highly correlated with one another. This suggests that, although it might not be particularly useful to report multiple measures if the information they provide is redundant, administrators should feel comfortable choosing a measure derived from the SRF based on the relevance of its interpretation to the objectives and priorities of the educational system conducting the evaluation, as long as the underlying SRT model uses a performance target that is informative about the populations of students and teachers in that system. Differences in class information functions (CIFs) raise some alarm about whether comparisons across all teachers are fair or appropriate. However, commonalities among the CIFs across large subsets of teachers, similar to parallel forms of a test, provide a promising solution to this problem. CIF matching allows for meaningful within-cluster comparisons, however, differences between within-cluster rankings and overall rankings suggest that overall rankings should be interpreted with caution. 118 Results from the third paper indicate that, under most conditions, model performance is consistent across characteristics of teachers and classes. Most exceptions to this are for classes with very high-demand students, including special education classes. The only other exception is for slightly below-average demand students placed in large, low-demand classes compared to large, average-demand classes. While low student counts in the high-demand and special education classes offer one possible explanation for these differences in model performance, the presence of a similar pattern across large classes with different demand levels could not be explained in the same way. Another explanation is that the differences in model performance relate to mismatch between the instructional demand levels of individual students and the instructional demand levels of their classmates. It is not clear whether performance of these students is more likely affected by the demand levels of their classmates, or whether the instruction administered in these types of classrooms is targeted towards the mean demand level of the class. If the latter is true, students far above or below that mean demand level might not benefit as much as they would in an average-demand classroom where instruction is designed for a different type of student. Parameter estimates for instructional demand indicators are rather stable across the two years, lending well for common item equating of the SIDI across the two cohorts of 5th grade students. The only instructional demand indicators that were flagged for inconsistency were for gifted programs and students with multiple disabilities. These indicators represent opposite ends of the instructional demand distribution, where estimates tend to be less precise in general. These differences could also reflect changes in the criteria for program eligibility between the two years or differences in the relatively small groups of students from each cohort that fall within these categories. Standard errors of teacher growth estimates are rather large for some teachers, 119 resulting in little power to detect changes in capacity in these cases. The primary factor driving these large standard errors is mismatch between the capacity of a teacher and the mean demand level of students in their class. This discrepancy makes a far greater impact than the number of students in a class. Small slope estimates also correspond to large standard errors, however, this affects far fewer teachers than capacity-demand mismatch. Although inappropriate matching of students and teachers is a limiting factor for analyzing teacher growth, assignment optimization procedures provide guidance to schools about how to improve the ways students and teachers are assigned to one another. However, an administrator would only be interested in this information if student-teacher matching aligns with their objectives and priorities in the assignment process. In the sample school selected for the optimization analysis, the actual assignments appeared to prioritize the equitable distribution of students and instructional demand across teachers. This type of practice would likely result in more similar CIFs across teachers in the same school, facilitating comparisons among the different teachers. However, it is not necessarily the best practice for improving student outcomes, accurately assessing the performance of a teacher, or monitoring and supporting teacher growth over time. 5.2 Implications The importance of setting realistic performance standards for students and teachers is emphasized in several findings across the three papers. For this district, only one of the three performance targets is informative about large proportions of students and teachers. Without reliable information about performance based on the other two targets, the potential benefits of 120 using multiple performance outcomes to generate different types of measures about teachers cannot be assessed. Different types of IRT models with polytomous or continuous response variables would likely capture more of the variation in student performance than the single dichotomous outcome that was informative in this study. However, these models are not as easily-interpretable as the 2PL and may be less accessible to stakeholders. Many of the other advantages of SRT over VAMs relate to IRT-specific procedures that have been studied extensively within the context of dichotomous response models. Continuous response models, in particular, are rarely used in research or in practice, and the types of IRT technology that may be most beneficial to the evaluation context has not been developed for these types of models. Interpretability and application of well-studied IRT technology are focal points of this study, so the most appropriate resolution for these same purposes would be to seek different dichotomous targets to use in place of the higher targets. Like the state-determined proficiency levels, these targets should correspond to concrete standards of performance. However, they must be both reasonable and challenging for a significant proportion of students in the district. Another common theme throughout these findings is mismatch between the average instructional demand level of a class and either the capacity of a teacher or the instructional demand level of an individual student. Large discrepancies between capacity and demand result in large standard errors of capacity and growth estimates. Discrepancies between individual students’ instructional demand levels and class mean demand correspond to differences in performance compared to similar students with similar teachers but different classmates. One explanation is that these differences arise from interactions between students in the class, or peer effects. Another explanation is that teachers alter their instructional practices to target the types of students that comprise the majority of a class. Students who are either above or below the 121 demand levels of most students in their classes may not benefit as much as they would in a class where instruction is targeted towards students like them. The first explanation could potentially be addressed with further development of the SRT model and instructional demand index. The second explanation would be relevant feedback for administrators and could prompt them to train teachers on instructional practices that benefit a wider range of students or to actively avoid placing students in classes where most students are far above or far below their instructional demand levels. Although several findings point to possible differences in model performance for special education students and special education teachers, as well as for teachers with very low student counts, most of these relationships are better explained by discrepancies between the high demand levels in these classes and the capacity estimates of these teachers. These types of classes are typically associated with low student counts, and are also more likely to have students missing from the data, however, the results suggest that the capacity-demand gap makes a far greater difference than the number of students placed with a teacher. 122 APPENDIX 123 Student with IEP Table A1. Location and slope parameters for instructional demand indicators 5 Indicators Not gifted (high achievement) B in writing Not gifted (intellectual ability) B in science B in history Proficient in math B in math B in phys ed B in health B in reading B in art B in speaking Not designated as gifted Proficient in ELA B in listening B for writing effort B for PE effort B for health effort B for science effort B for history effort B for speaking effort B for reading effort B for listening effort B for math effort B for art effort Basic math proficiency Basic ELA proficiency C in writing C in math C in reading Below basic ELA proficiency Absent 5-10 days Below basic math proficiency C in history C in science C in listening C for writing effort C for reading effort C for math effort C in speaking slope Indicators (cont’d) 1.42 C for listening effort 2.95 C for history effort 1.23 C for science effort 2.63 C in health 2.96 C for speaking effort 1.66 Limited English proficiency 2.56 C for health effort 1.52 D-F in math 2.52 D-F in writing 2.72 D-F in art 1.91 D-F in reading 2.31 1.45 Learning disability 1.76 C for art effort 2.92 One disability 2.91 C in phys ed 1.49 C for PE effort 2.63 D-F in history 2.95 D-F for math effort 3.02 D-F for writing effort 2.30 D-F for reading effort 3.06 D-F in science 2.71 D-F in listening 2.91 D-F for listening effort 1.86 D-F for history effort 1.66 D-F for science effort 1.76 D-F in speaking 2.95 D-F in health 2.56 Two disabilities 2.72 Resource specialist 1.76 D for health effort 0.36 D for speaking effort 1.66 At least 3 disabilities 2.96 D-F in art 2.63 Absent 11-17 days 2.92 D-F for art effort 2.91 D-F for PE effort 3.06 D-F in phys ed 2.91 Absent 18 or more days 2.31 location -2.62 -2.23 -2.19 -2.12 -2.06 -2.04 -2.04 -2.03 -2.02 -2.00 -1.81 -1.77 -1.68 -1.67 -1.58 -1.41 -1.39 -1.37 -1.35 -1.34 -1.29 -1.25 -1.25 -1.21 -1.17 -0.90 -0.76 -0.23 -0.12 -0.01 0.04 0.04 0.41 0.48 0.49 0.55 0.55 0.64 0.66 0.77 location 0.78 0.90 1.00 1.09 1.14 1.15 1.39 1.57 1.68 1.74 1.75 1.90 1.92 1.94 1.94 1.97 2.09 2.11 2.14 2.14 2.20 2.22 2.32 2.42 2.44 2.56 2.67 2.67 2.86 2.92 2.93 2.98 3.63 3.91 3.92 4.13 4.58 4.60 7.12 slope 2.71 3.02 2.95 2.52 2.30 1.07 2.63 2.56 2.95 1.91 2.72 1.30 1.29 1.86 1.26 1.52 1.49 2.96 2.91 2.91 3.06 2.63 2.92 2.71 3.02 2.95 2.31 2.52 1.26 1.07 2.63 2.30 1.26 1.91 0.36 1.86 1.49 1.52 0.36 5 IRT-calibrated instructional demand index with restricted item set (cohort 1). 124 Table A2. Deriving equations for the P25 and P75 measures (Chapter 3) !"($)= exp +,-"./-"−$12 1+exp +,-"./-"−$12 !"($)[1+exp +,-"./-"−$12]=exp +,-"./-"−$12 !"($)+!"($)∗exp +,-"./-"−$12=exp+,-"./-"−$12 !"($)=exp +,-"./-"−$12−!"($)∗exp +,-"./-"−$12 !"($)=exp+,-"./-"−$12∗(1−!"($)) !"($) 1−!"($)=exp+,-"./-"−$12 ln ( !"($) 1−!"($))=,-"./-"−$1 $+ln ( !"($) 1−!"($)),-": =/-" $=/-"−ln ( !"($) 1−!"($)),-": 1−0.25A,-": = /-"−ln>0.250.75A ,-": =/-"−ln>13A,-": 1−0.75A,-": = /-"−ln>0.750.25A,-": =/-"−ln(3) ,-"⁄ P25=/-"−ln> 0.25 P75=/-"−ln> 0.75 (A1) (A2) (A3) (A4) (A5) (A6) (A7) (A8) (A9) (A10) (A11) 125 BIBLIOGRAPHY 126 BIBLIOGRAPHY Baker, B. D., Farrie, D., & Sciarra, D. G. (2016). Mind the gap: 20 years of progress and retrenchment in school funding and achievement gaps. ETS Research Report Series, 2016(1), 1-37. Ballou, D., Sanders, W., & Wright, P. (2004). Controlling for student background in value-added assessment of teachers. Journal of educational and behavioral statistics, 29(1), 37-65. Ballou, D. (2005). Value-added assessment: Lessons from Tennessee. Value added models in education: Theory and applications, 272-297. Betebenner, D. W. (2011). A technical overview of the student growth percentile methodology: student growth percentiles and percentile growth projections / trajectories. Paper presented at The National Center for the Improvement of Educational Assessment. New Hampshire. Booher-Jennings, J. (2005) "Below the bubble: “Educational triage” and the Texas accountability system." American educational research journal 42(2), 231- 268. Braun, H. I. (2005). Using Student Progress to Evaluate Teachers: A Primer on Value- Added Models. Policy Information Perspective. Educational Testing Service. Castellano, K. E., & McCaffrey, D. F. (2017). The accuracy of aggregate student growth percentiles as indicators of educator performance. Educational Measurement: Issues and Practice 36(1), 14-27. Cella, D., Gershon, R., Lai, J. S., & Choi, S. (2007). The future of outcomes measurement: item banking, tailored short-forms, and computerized adaptive assessment. Quality of Life Research, 16(1), 133-141. Chatterjee, S., & Hadi, A. S. (2015). Regression analysis by example. John Wiley & Sons. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297-334. De Ayala, R. J. (2009). The Theory and Practice of Item Response Theory. (T. D. Little, Ed.). New York: The Guilford Press. Dee, T. S., Jacob, B., & Schwartz, N. L. (2013). The effects of NCLB on school resources and practices. Educational Evaluation and Policy Analysis, 35(2), 252-279. 127 Dodd, B. G., Koch, W. R., & De Ayala, R. J. (1993). Computerized adaptive testing using the partial credit model: Effects of item pool characteristics and different stopping rules. Educational and psychological measurement, 53(1), 61-77. Dorans, N. J., Moses, T. P., & Eignor, D. R. (2010). Principles and Practices of Test Score Equating. Educational Testing Service, Princeton, NJ. Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35-66). Hillsdale, NJ: Erlbaum. Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis (Vol. 3). New York: Wiley. Ehlert, M., Koedel, C., Parsons, E., & Podgursky, M. (2013). Selecting growth measures for school and teacher evaluations. National Center for Analysis of Longitudinal Data in Education Research (CALDER). Working Paper, 80. Embretson, S. E., & Reise, S. P. (2013). Item response theory. Psychology Press. Goe, L., Wylie, E. C., Bosso, D., & Olson, D. (2017). State of the states' teacher evaluation and support systems: A perspective from exemplary teachers. ETS Research Report Series, 2017(1), 1-27. Grissom, J. A., Kalogrides, D., & Loeb, S. (2013). Strategic Staffing: Examining the Class Assignments of Teachers and Students in Tested and Untested Grades and Subjects. American Education Finance and Policy Conference, New Orleans, LA. Guarino, C., Reckase, M., Stacy, B., & Wooldridge, J. (2014). A comparison of Student Growth Percentile and Value-Added models of teacher performance. Statistics and Public Policy, 2(1): 1-11. Halpin, B. (2016). Cluster analysis stopping rules in Stata. Working Paper WP2016- 01, Department of Sociology, University of Limerick. https://osf.io/rjqe3. Ham, E. H. (2014). Comparison between educator performance function-based and educator production function-based teacher effect estimation. Unpublished doctoral dissertation Michigan State University, East Lansing, MI. Hambleton, R., Swaminathan, H., & Rogers, H. (1991). Fundamentals of item response theory. Newberry Park, CA: Sage. Hanushek, E. A., & Rivkin, S. G. (2010). Generalizations about using value-added measures of teacher quality. The American Economic Review, 267-271. Harris, D. N. (2009). Would accountability based on teacher value added be smart policy? An examination of the statistical properties and policy alternatives. Education, 4(4), 319-350. 128 Harris, D. N. (2011). Value-Added Measures in Education: What Every Educator Needs to Know. Harvard Education Press. 8 Story Street First Floor, Cambridge, MA 02138. Harris, D. N., & Sass, T. R. (2006). Value-added models and the measurement of teacher quality. Unpublished manuscript. Holland, P. W., & Dorans, N. J. (2006). Linking and equating. Educational measurement, 4, 187-220. Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. Test validity, 129-145. Holland, P. W., & Wainer, H. (2012). Differential item functioning. Routledge. Horoi, I., & Ost, B. (2015). Disruptive peers and the estimation of teacher value added. Economics of Education Review, 49, 180-192. Huynh, H., & Meyer, P. (2010). Use of robust z in detecting unstable items in item response theory models. Practical Assessment, Research & Evaluation, 15(2), 1-8. Kim, C. M., Frank, K. A., & Spillane, J. P. (2018). Relationships among Teachers' Formal and Informal Positions and Their Incoming Student Composition. Teachers College Record, 120(3), n3. Koedel, C., & Betts, J. R. (2007). Re-examining the role of teacher quality in the educational production function. National Center on Performance Incentives, Vanderbilt, Peabody College. Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking: Methods and practices. Springer Science & Business Media. Ladd, H. F., & Lauen, D. L. (2010). Status versus growth: The distributional effects of school accountability policies. Journal of Policy Analysis and Management, 29(3), 426-450. Lee, L. (2011). What Did the Teachers Think? Teachers' Responses to the Use of Value- Added Modeling as a Tool for Evaluating Teacher Effectiveness. Journal of Urban Learning, Teaching, and Research, 7, 97-103. Lissitz, B., & Doran, H. (2009). Modeling Growth for Accountability and Program Evaluation: An Introduction for Wisconsin Educators. Wisconsin Department of Public Instruction. Lord, F. M. (2012). Applications of item response theory to practical testing problems. Routledge. Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the national cancer institute, 22(4), 719-748. 129 Martineau, J. A. (2016). Introduction to and Preliminary Evaluation of Student Response Theory. Unpublished manuscript, Center for Assessment, Dover, NH. Monk, D.H. (1987). Assigning Elementary Pupils to Their Teachers. The Elementary School Journal. 88(2), 166–87. Neal, D., & Schanzenbach, D. W. (2010). Left behind by design: Proficiency counts and test- based accountability. The Review of Economics and Statistics, 92(2), 263-283. National Council on Teacher Quality. (2017). State Teacher Evaluation Policy Yearbook: National Summary. Washington, DC. Paufler N.A. & Amrein-Beardsley, A. (2014). The Random Assignment of Students Into Elementary Classrooms: Implications for Value-Added Analyses and Interpretations. American Education Research Journal. 51(2), 328-362. Reckase, M. D. & Martineau, J. A. (2014). The Evaluation of Teachers and Schools Using the Educator Response Function (ERF) In Lissitz, R. W., Value Added Modeling and Growth Modeling with Particular Application to Teacher and School Effectiveness. Rothstein, J. (2008). Teacher quality in educational production: Tracking, decay, and student achievement (No. w14442). National Bureau of Economic Research. Rothstein, J. (2009). Student sorting and bias in value-added estimation: Selection on observables and unobservables. Education, 4(4), 537-571. Ryan, J., & Brockmann, F. (2009). A Practitioner's Introduction to Equating with Primers on Classical Test Theory and Item Response Theory. Council of Chief State School Officers. Samejima, F. (1977a). A use of the information function in tailored testing. Applied psychological measurement, 1(2), 233-247. Samejima, F. (1977b). Weakly parallel tests in latent trait theory with some criticisms of classical test theory. Psychometrika, 42(2), 193-198. Samejima, F. (2016). Graded response models. In Handbook of Item Response Theory, Volume One (pp. 123-136). Chapman and Hall/CRC. Sanders, W. L., & Horn, S. P. (1994). The Tennessee value-added assessment system (TVAAS): mixed-model methodology in educational assessment. Journal of Personnel Evaluation in Education, 8(3), 299–311. Thum, Y. M. (2003). Measuring progress toward a goal: Estimating teacher productivity using a multivariate multilevel model for value-added analysis. Sociological Methods & Research, 32(2), 153-207. 130 U.S. Department of Education (2009). Growth Models: Non-Regulatory Guidance. (January 12). http://www2.ed.gov/policy/gen/guid/significant-guidance.doc. Walsh, E., & Isenberg, E. (2014). How does value added compare to student growth percentiles? Statistics and Public Policy, 2(1), 1-13 Zieky, M. (1993). Practical questions in the use of DIF statistics in test development. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35-66). Hillsdale, NJ: Erlbaum. Zwick, R., Thayer, D. T., & Lewis, C. (1999). An empirical Bayes approach to Mantel‐Haenszel DIF analysis. Journal of Educational Measurement, 36(1), 1-28. 83% 131