EXAMINING TEACHER PERCEPTIONS OF THE RELATIONSHIP BETWEEN EVALUATION POLICY AND TEACHER PRACTICE IN A NORTH CAROLINA SCHOOL SYSTEM By Amanda Marie Slaten Frasier A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Educational Policy- Doctor of Philosophy 2017 ABSTRACT EXAMINING TEACHER PERCEPTIONS OF THE RELATIONSHIP BETWEEN EVALUATION POLICY AND TEACHER PRACTICE IN A NORTH CAROLINA SCHOOL SYSTEM By Amanda Marie Slaten Frasier Examining the justification for current evaluation policy reveals that such policy rests on two assumptions related to the impact on the work of teachers: (1) evaluations are necessary because teachers need to be rated, sanctioned, or rewarded in order to be motivated to improve their practice; and (2) evaluations yield information that is useful for teachers to improve practice. Both assumptions have driven policy changes over time and carry implications for teacher classroom practice. This mixed methods study examines how a state-wide standardized evaluation policy utilized in North Carolina affects the work of high school teachers in a single school district under varying school and individual conditions. Specifically, this study focuses on teachers who offer perspectives from varying combinations of the following school-level variables: status at a high or low evaluation condition school and status at a high or low evaluation effectiveness school, and the following individual variables: status as a Mathematics or English teacher, years of experience, and licensure level. This dissertation tests the previously-stated assumptions about teacher evaluation and teacher work in a North Carolina school system in a to answer the following research questions: (1) What, if any, role do reported school evaluation conditions and school evaluation status play in shaping teacher motivation, experiences with feedback, and work decisions related to teacher evaluation? (2) What individual-teacher level factors are associated with differences in teacher motivation, experiences with feedback, and work decisions related to teacher evaluation? Analysis of the whole sample demonstrated that teachers did not find evaluation to motivate performance or to provide useful feedback. Though quantitative differences between school locations were not found, there were qualitative differences in how evaluation was related to practice across sites. Differences were also found in the evaluation-practice relationship between teachers of different licensure levels and different levels of experience where those in the lower designation perceived a greater impact of evaluation policy. Finally, differences between the subject areas of Math and English were identified, but may have been influenced by the capacity of observers and specifically, a lack of subject area alignment between the observer and the classroom in English, such alignment was present for some of the Math teachers in the study. Therefore, it is important to examine the context of evaluation, particularly the capacity of the administration that conducts evaluation. The results of this study suggest that the characteristics and capacity of an observer do matter in how the observation protocol is interpreted and implemented. Additionally, the evaluation climate and culture, or evaluation scenario of a school, may also influence the ways in which teachers find evaluation motivating and how teachers approach feedback from evaluation. The results of this study provide insight into the relationship between teacher evaluation and classroom practice, an area that has previously been under researched despite the impact other high-stakes accountability policies have had on teaching practices and the teaching workforce. For my mom “A mother is she who can take the place of all others but whose place no one else can take.” iv ACKNOWLEDGEMENTS In all the time it took me to research and write this dissertation, I never expected this part to be the hardest to write. I would first and foremost like to thank Dr. Michael Sedlak for recruiting me to the Educational Policy program at Michigan State University, for serving on my initial guidance committee, and for putting together a 2012 cohort of colleagues which offered immense professional and personal support to me throughout the program. The program provided me with invaluable opportunities to travel, research, and network which graduate students in other programs can only dream of. So, I am forever indebted to this program for the professional and personal development that I have received. I would also like to thank some fellow graduate students who provided feedback on earlier iterations of this work. Specifically, I must mention Alyssa Morley, Iwan Syrahil, Sarah Galey, and Jihyun Kim. Additionally, I must thank my fantastic committee. Dr. Anne-Lise Halvorsen became my academic advisor in my second year of the program and as such has been with me through a myriad of both personal and professional ups and down. We have worked together on several projects and I have learned an immense amount about academic life from her. Dr. Peter Youngs approached me about working together when he was transitioning to a position at another university. I am so grateful that he stuck by me and offered his expertise despite the fact that he was several states away. It says a lot about Peter that he was willing to keep working with students at another university despite not having any formal obligation to do so. Dr. Corey Drake and I met together while I was working as a graduate assistant on a project, and she agreed to become a member of my dissertation committee just a year prior to my dissertation when I v realized that I would be examining some issues in which she had professional interests. All three of the above mentioned committee members helped me through a period of my life where I very much was struggling. The second year of my graduate school program I was giving a final exam presentation in Peter’s class when my phone rang and I found out that my mother was on life support. She died less than 48 hours later, leaving me to deal with not only the emotional ramifications of her death, but the legal, physical, and practical as well. All three of these people continued to work with me, guide me, and mentor me at a time when I would otherwise have very well quit everything. Sometimes I only got by because of the immense sense of obligation I felt to these people. The final member of my committee, Dr. Madeline Mavrogordato joined my dissertation committee later on and really helped shape the methods of this study as well as my understanding of school leaders’ roles in implementing evaluation policy. I have especially enjoyed sharing a personal connection with Maddy over our mutual love of horses, dressage, and Chick-Fil-A. These four individuals have helped shaped this work and shape me as a professional. Thank you for your time, your expertise, and your belief in me. I also must extend my appreciation to the anonymous teachers who participated in this study. I wish there was more I could have given them for their help and their dedication to the profession. Finally, I have to thank my family. I came from a family where college was never discussed as an option, and despite being a high achiever in high school, I almost did not attend. I have to thank my good friend, Jeff, for helping me enroll in community college. He is as much family to me as anyone and without him I may still be waitressing at a Denny’s. I have to thank my sister and grandmother, neither of whom always understood what I was doing or what I have been up to, but who have supported me with their whole hearts throughout my entire education. vi (I promise, Aimee, this is the last graduation of mine you will ever be dragged to.) My husband, Chad, earned his PhD a few years ago and until then I had no idea that one could earn an advanced degree without having wealthy parents to pay for it. His experience prompted me to seek out my own doctoral program and he has done his best to support me in the same ways I supported his education. And finally, I need to thank my mother. How I wish so badly that you were here to see the conclusion of this journey! The truth is that the loss of you was as much of my graduate program as anything else. Over four years ago I was headed to my first professional conference in Europe and you dropped a bomb on me by telling me over the phone that you were having surgery to remove a cancerous kidney. You told me to get on the plane and I did. And I did it again and again. Losing you was the reason I wanted to quit and the reason I could not all in one. As I told you on the day you died, everything I am and everything I ever will be is because of you and I reflect on what you have taught me and on our relationship every single day. As I write this I am eight months pregnant and my only hope is that I can have the same influence on your grandchild that you had on me. Thank you for being my mother, for being my strength, and for always believing in me. This is for you. This is because of you. I love you. vii TABLE OF CONTENTS LIST OF TABLES xi LIST OF FIGURES xii KEY TO ABBREVIATIONS xiii CHAPTER 1: Introduction 1 CHAPTER 2: Evaluation Policy Background A Brief History of Teacher Evaluation A Brief Legal Review of the Governance Shift as Related to Evaluation Teacher Evaluation in North Carolina 8 9 16 18 CHAPTER 3: Literature Review Examining Policy Assumptions Assumption 1: Teacher Motivation Extrinsic motivation Intrinsic motivation Assumption 2: Feedback Teacher Responses to External Accountability Pressure Turnover Narrowing Curriculum and Teaching Testing Strategies Triaging Students Teacher Responses to Curriculum Reform Leadership Capacity and Evaluation Theory Driving Research 25 25 25 25 27 28 30 31 34 36 37 39 42 CHAPTER 4: Research Design and Methodology Participants and Sampling Strategy Teacher Working Conditions Survey and Evaluation Conditions Educator Effectiveness Database and the Effectiveness Score Phase 1: Surveys of Sample Schools Survey instrument Phase 2: Preliminary Interviews of Focal Teachers Phase 3: Follow-up Interviews of Focal Teachers Data Analysis and Establishing Validity Quantitative Qualitative Researcher Background and Neutrality School System Site School Sites School 1: Riley 45 46 47 51 55 56 57 58 59 59 59 68 69 70 73 viii School 2: Phoenix School 3: Charles School 4: Central 73 74 75 CHAPTER 5: Overall Trends Across All Teachers 77 Survey Participants 79 Comparing Sample Teacher and School Wide Perceptions of Evaluation Conditions 81 Teacher Perceptions of Last Year versus the Current Year about the Evaluation Process 82 Interview Participants 86 Evaluation as a Form of Motivation 87 Evaluation as a Source of Feedback 90 Perceptions of Feedback from Observation 91 Perceptions of Feedback from Testing 98 Feedback from Other Forms of Evaluation 101 Responses to Reform 104 Reform Typologies 106 Conclusion 108 CHAPTER 6: The Context of the School Site Comparing Perceptions of Evaluation and Practice between School Sites Insights from Focal Teacher Interviews School Vignettes Charles Central Riley Phoenix Do Evaluation Conditions and Effectiveness Matter? Evaluation Scenarios 112 113 116 119 119 122 125 129 132 134 CHAPTER 7: Individual-Level Characteristics and Teacher Evaluation Reported Licensure Level Survey Interview Discussion Seven-Year Status Survey Interview Discussion Subject Area Survey Interview Subject Area Specific Concerns Observation concerns Testing concerns 139 139 139 142 142 144 144 147 148 150 150 153 154 154 156 ix Conclusion 159 CHAPTER 8: Conclusions and Implications Implications for Research Implications for Policy and Practice Leadership and Evaluation Perceptions of Validity Altered Teacher Behaviors Reconciling Evaluation Policy for Both High and Low Stakes Purposes Limitations Concluding Thoughts 161 163 165 167 170 171 172 176 178 APPENDIX 181 REFERENCES 185 x LIST OF TABLES Table 1. Timeline of Educator Evaluation Changes in North Carolina Since 2009 20 Table 2. Teacher Working Conditions Survey Evaluation Related Questions 48 Table 3. Calculating Condition Score 50 Table 4. Evaluation Rubric 53 Table 5. Establishing an Evaluation Effectiveness Score 54 Table 6. Interview Code Descriptions 61 Table 7. School-Level Demographics 72 Table 8. Survey Respondents 80 Table 9. Responses on Teacher Working Conditions Replication Questions 81 Table 10. Paired T-Tests of Statement Themes Reflecting on Last Year Versus This Year 85 Table 11. Interview Participants 86 Table 12. Interviewee Code Use 87 Table 13. Complementary Question Set Means by School 114 Table 14. Code Interview Case Count by School 118 Table 15. Independent Sample T-Test of Survey Themes by Reported Licensure Level 141 Table 16. Occurrence of Codes in Interviews by Licensure Status 142 Table 17. Independent Sample T-Test of Survey Themes by Seven-Year Status 146 Table 18. Occurrence of Codes in Interviews by Seven-Year Status 147 Table 19. Independent Sample T-Test of Survey Themes by Subject Area 152 Table 20. Occurrence of Codes in Interviews by Subject Area 154 xi LIST OF FIGURES Figure 1. Framework to Guide Research on Evaluation and Practice xii 44 KEY TO ABBREVIATIONS AIG Academically Intellectually Gifted ARRA American Recovery and Reinvestment Act AYP Annual Yearly Progress CCSS Common Core State Standards EC Exceptional Children ELA English Language Arts ELL English Language Learner EOC End of Course ESEA Elementary and Secondary Education Act FARPL Free and Reduced Price Lunch MET Measures of Effective Teaching NC North Carolina NCFE North Carolina Final Exam NCES National Center for Educational Statistics NCLB No Child Left Behind PDP Professional Development Plan PLC Professional Learning Community RttT Race to the Top SWD Students with Disabilities TWC Teacher Working Conditions VAM Value Added Model xiii CHAPTER 1: Introduction Current teacher evaluation policies have emerged from policymaker critiques that previous systems of evaluation did not accurately identify the effectiveness of teachers and that many teachers were often rated as high performing. One example of such a critique is found in the Measures of Effective Teaching (MET) project, sponsored by the Bill and Melinda Gates Foundation, which began in 2009 and is the largest study of teacher evaluation to date. In justifying the project’s worth, the Gates Foundation described previous evaluation schemes as “not providing the information needed to close the achievement gap. Despite 40 years of research pointing to huge differences in student achievement gains across teachers, most school districts and state governments cannot pinpoint what makes a teacher effective or identify their most and least effective teachers” (Bill & Melinda Gates Foundation, 2010, p. 2). The justification provided by the Gates Foundation identifies schools and districts as ineffective evaluators of teachers. Furthermore, studies have found that large numbers of teachers have rated highly across states. For instance, the New Teacher Project study found that for evaluation systems with only two ratings, “satisfactory” and “unsatisfactory,” 99% of teachers earned a satisfactory. In evaluation systems with more than two ratings, 94% of teachers received one of the top two ratings and less than 1% were rated unsatisfactory (Weisberg, Sexton, Mulhern, & Keeling. 2009). The Weisberg et al. study termed this top-heavy sort of assessment as “the Widget Effect.” Aside from ranking teachers inaccurately, another criticism of local based evaluation systems is that the perfunctory nature of evaluation does not provide meaningful feedback to improve practice. These critiques have led to reform in evaluation, often centering policy at the state level rather than the district level, and typically including both standardized observation measures and growth data based on student performance. 1 Examining the justification for current evaluation policy reveals that such policy rests on two assumptions related to the impact on the work of teachers: (1) evaluations are necessary because teachers need to be rated, sanctioned, or rewarded in order to be motivated to improve their practice; and (2) evaluations yield information that is useful for teachers to improve practice. Both assumptions have driven policy changes over time and carry implications for teacher classroom practice. Following the adoption of standardized observation protocols and value-added models (VAMs) meant to measure student growth by individual teachers, a large body of literature has examined both the technical aspects of evaluation (e.g., Baker et al., 2010; Bill and Melinda Gates Foundation, 2013; Corcoran, 2010; Glazerman et al., 2011; Goldhaber, Goldschmidt, & Tseng, 2013; Harris, 2009; Hill, Kapula, & Umland, 2011; McCaffrey, Lockwood, Koretz, & Hamilton., 2003; Raudenbusch & Jean, 2012; Rothstein & Mathis, 2013; Sanders & Horn, 1994) as well as the resource and infrastructure demands such systems place on schools and districts (e.g., Anagnostopoulos, Rutledge, & Jacobsen, 2013a; Mintrop & Sunderman, 2013; Thorn & Harris, 2013). The effective labelling and sorting of teachers into ranked categories is thought to be important because other research has been unable to identify the characteristics of effective teachers (Ballou & Podgursky, 1998; Boyd, Grossman. Lankford, Loeb, & Wyckoff, 2009; Darling-Hammond, Holtzman, Gatlin, & Heilig, 2005; Goldhaber & Brewer, 1997; Harris, 2009) or what type of preparation best prepares one for the classroom (e.g., Goldhaber & Hansen, 2010). What remains unclear from current research is (1) the effect that evaluation systems have on teacher classroom practice as individuals go through high-stakes individual evaluation cycles and (2) the extent to which teachers use feedback from evaluation to further guide classroom practice. This dissertation addresses this gap in the literature by examining teacher perceptions of 2 the interaction between evaluation policy and classroom practice considering differences at the school- and individual- levels. This study examines how a state-wide standardized evaluation policy utilized in North Carolina affects the work of high school teachers under varying school and individual conditions in the same school district. Specifically, this study focuses on teachers who offer perspectives from varying combinations of the following school-level variables: status at a high or low evaluation condition school and status at a high or low evaluation effectiveness school, and the following individual variables: status as a Mathematics or English teacher, years of experience, and licensure level (the latter two are linked to the number of evaluations a teacher receives). For the purposes of this dissertation, I use responses from evaluation-themed questions from the 2016 administration of the biannual North Carolina Teacher Working Conditions (TWC) survey to identify school status as having high or low evaluation conditions based on site deviation from district averages. I define school status as a high or low evaluation school in a similar manner by using data from 2015-2016 from the Educator Effectiveness Database Section of the North Carolina School Report Card system. In both cases, data from 2015-2016 is used because that was the most current data available at the time of the study and represents the school year immediately preceding the study year. Additionally, I track varying characteristics of teachers such as their subject area certifications, years of experience (career status), and licensure level through survey responses. The rationale for and additional explanations of these definitions and methods will follow in a subsequent section. Using the aforementioned variables, this dissertation tests the previously-stated assumptions about teacher evaluation and teacher work in a North Carolina school system to answer the following research questions: 3 (1) What, if any, role do reported school evaluation conditions and school evaluation status play in shaping teacher motivation, experiences with feedback, and work decisions related to teacher evaluation? (2) What individual-teacher level factors are associated with differences in teacher motivation, experiences with feedback, and work decisions related to teacher evaluation? These questions were answered in a mixed method study using a combination of quantitative data analysis of survey results and qualitative analysis of interview transcripts. Chapter 2 presents background information on teacher evaluation policy. The chapter starts with a brief history of teacher evaluation in the United States, particularly in relation to the larger movement to increase school accountability. Next is a legal review of how teacher evaluation policies are linked to an overall shift in governance over schools. Finally, I describe the historical, legal, and policy context of evaluation in the study state of North Carolina. Chapter 3 provides a literature review and develops a framework to investigate questions about the relationship between teacher evaluation and teacher practice. The chapter starts with examining the two policy assumptions motivating teacher evaluation policy, namely teacher motivation and feedback use. There is a gap in current literature on the evaluation-practice relationship. Therefore, two related areas of literature are explored to anticipate how teachers may respond to evaluation policies in practice: teacher responses to external accountability pressure and to curriculum reforms. Because school administrators play a large role in how evaluation is implemented at the school-level, I also include a section on leadership capacity and evaluation. Finally, I present a framework for pursuing research on the relationship between teacher evaluation and teacher practice. 4 Chapter 4 includes the research design and methodology for this dissertation. The chapter includes a description of my sampling strategy, including methodology for calculating schoollevel Evaluation Condition and Effectiveness Scores and the selection of sites based on those calculations. Additionally, I delineate my three phases of data collection including: survey administration, preliminary focal interviews, and follow-up focal interviews. I then discuss how I analyzed data and established validity of my findings. Finally, I provide descriptions of the school system and the four school sites. In Chapter 5, I use data from the entire sample of teachers to identify overall trends in how teachers perceive the practice and evaluation relationship. First, I present the results of questions which replicate the North Carolina Teacher Working Conditions survey, in which the original survey was used to calculate school-level Evaluation Condition Scores. Next, I compare teacher perceptions of the prior year to the study year. I then use the literature derived framework from Chapter 3 to examine both of the two policy assumptions of evaluation (as motivation and as a feedback tool). Additionally, I use literature from teacher responses to external accountability measures to identify similar responses fueled by the evaluation policy in this study. Finally, I identify teacher reform typologies using categories derived from literature on teacher responses to classroom reform. Chapter 6 examines school-level differences across research sites. I initially hypothesized that schools with varying levels of effectiveness scores and varying evaluation conditions would perceive evaluation differently. However, the quantitative data showed no significant differences between schools. I then draw on interview data to explain the quantitative data and offer alternative theories. I also illustrate that despite the lack of statistical findings, there were stark differences in how evaluation affected teachers across schools as evidenced in 5 the interviews. To do this, I present vignettes of each of the four schools and describe three evaluation scenarios which emerged from the interview data. Chapter 7 investigates the individual-level teacher characteristics of: licensure, years of experience, and subject area to identify differences in how teachers of various characteristics perceive the relationship between evaluation and practice. Each of the three characteristics include a separate presentation of the survey data, interview data, and discussions of the characteristic. For the characteristic of subject area background, specific concerns around observation and testing are presented. Finally, Chapter 8 will offer concluding thoughts on the dissertation. This will include implications for research as well as for the practice of evaluating teachers and for evaluation policy implementation at the school-level. Specifically, the following areas will be explored: leadership capacity, perceptions of evaluation validity, and altered teacher behaviors. Additionally, some of the possible unintended consequences of the evaluation policy in this study are discussed. Policy recommendations will be provided for how evaluation for both high and low stakes purposes may be reconciled to allow for more effective use. Limitations of this study are also discussed. Overall, the results of this dissertation demonstrate that teachers do not find evaluation policy to be motivating or to provide feedback that is useful to changing practice. However, some unintended consequences of teacher evaluation policy emerge in patterns similar to what has been found in research on other external accountability measures. School-level results indicate that approaches to observation and testing are not standardized across sites, despite efforts to create a policy which is uniform. Additionally, the way in which school administration approaches the components of evaluation influences teacher perceptions, possibly impacting the 6 success of the policy or further leading to unintended policy consequences that may negatively impact the teaching workforce and/or the work of teachers. Additionally, some are differences demonstrated between groups of individual-level characteristics. For instance, differences between licensure and experience levels may be linked to the frequency of evaluation and the increased high-stakes for those who have lower levels of licensure or experience. The statistical differences in subject area may be related more to the conditions under which individuals are evaluated, particularly in regard to the capability of the evaluating administrator, rather than characteristics that are inherently linked to teacher subject area background. These findings suggest that despite attempts to standardize evaluation protocols, differences in school and individual contexts may result in differing evaluation experiences and differing relationships between evaluation policy and individual teacher practices. At the time of this writing, there is a gap in the literature on how formal teacher evaluation policy is related to classroom practice. This is an important question to consider because evaluation, by definition, defines what is valued in whatever is being appraised. Additionally, such policies are touted by policymakers as being necessary to motivate teachers to do a better job and to provide feedback for them to do so. Therefore, it is important to consider whether formal policies do motivate and provide feedback to teachers, and if such policies do these things, then to consider in what ways teacher practice changes as a result? This dissertation begins to answer important questions around evaluation and practice as related to the study context. Such information is useful when weighing the costs and benefits of high-stakes teacher evaluation policies. 7 CHAPTER 2: Evaluation Policy Background Evaluation is a process in which the characteristics of what is valued are identified and appraised. Traditionally, the evaluation process for teachers in the U.S. has been a local affair consisting of classroom observation and local personnel preferences, such as the teacher’s ability to coach or teach certain subjects, with limited standardization among the protocols, frequency, or observers utilized (e.g., Tyack & Cuban, 1995). Cohen (2011) explained that in the past, “conceptions of teaching quality were tied to a teacher’s years of education, degree attainment, and years of experience, none of which are closely related to the quality of work in the classroom” (p. 63). So, definitions of good teaching have often been determined at the local level. This created variety among evaluators (depending on preferences and experiences) and from site to site (depending on implementation and fidelity) due to variability in values among both individual evaluations and local school systems. Considering the impact that teachers have on students’ success, critiques that locally based evaluation systems may make removing “bad” teachers who are protected by tenure due to lack of evidence of their ineffectiveness and that personal preferences of an administrator may keep ineffective teachers in the classroom, are valid (Chetty, Friedman, & Rockoff, 2011; Hanushek & Rivkin, 2010; Haycock, 1998; Nye, Konstantopoulos, & Hedges, 2004; Rowan, Correnti, & Miller, 2006; Sanders & Rivers, 1996; Schacter & Thum, 2005). Over time, such critiques have led to formal policy changes affecting the ways in which teachers are evaluated. This chapter briefly delineates the history of teacher evaluation in the United States with a focus on the last two decades, which highlight a marked shift from local control over evaluation systems to the use of various interventions from both state and federal governments. The second half of the chapter describes the educational context of the study state, North Carolina. 8 A Brief History of Teacher Evaluation Historically, decisions about hiring, evaluating, retaining, and firing teachers have been made at the school-level and school administrators have generally been able to exercise a great deal of freedom in selecting teachers for open positions (e.g. Tyack & Cuban, 1995). For instance, the evaluation of schools in general can be traced back to the Common School Era in Massachusetts where Horace Mann rode from school to school writing analyses of each location he visited (Mann, 1868). And while evaluation policies have not always been formalized, teachers have always been held accountable to someone for something, whether it be the tidiness of the classroom or whether students could recite memorized text to an audience. What teachers were held accountable for and who they were accountable to has varied, but such accountability was always tied to the retention of a teaching position. However, in the early days of American schooling, the values that defined good teaching were determined and defined at the local level. Over time state, and later, federal government became increasingly involved in matters of school regulation, including the regulation of teacher quality, which includes evaluation. While full federal intervention in public schools is fairly recent, the first attempts to federally influence education can be traced to the aftermath of the Civil War with Congressional debate over establishing a federal Department of Education in 1866, followed by the failed Hoar Bill of 1870 which attempted to establish federal takeover of public schools which were failing (Newman, 2013). Such early attempts failed to be implemented, but presented policy frameworks which manifest in contemporary federal education policy. In the last two decades, as federal influence has increased via directing state-level policy in schools through both mandates and incentives, some traditionally locally held powers, such as 9 control of teacher evaluation, have shifted and become more centralized, at least in part, at the state level. This shift has been gradual, with early critics pointing out that such changes have increasingly de-professionalized teaching. For instance, Giroux (1985) contended over 30 years ago that curriculum policies disempowered teachers and reduced their status to that of a highlevel technician of objectives and goals created by people with no experience with classroom realities. Although this shift in power structure has occurred gradually over time, the 2009 Race to the Top (RttT) initiative incentivized states to create legislation that sometimes drastically changed local districts’ and schools’ ability to control how they recruit, compensate, and maintain their teaching workforce. In this chapter, details about these other policy points may be included in cases where such policies are linked. Finally, I briefly describe how teacher evaluation policy has evolved to its current nature at the time of this study. As the U.S. has undergone a shift from a tradition of local control to a more centralized governance structure, the values defining a “good” education and “good” teaching have also shifted and become more universally defined by policy. What is valued in education is not something necessarily stated explicitly in most policies, but instead values are something that can be decoded from various sources such as student learning standards, classroom curriculum, teacher preparation requirements, professional development components, professional licensing requirements, and performance evaluations. Evaluations and other accountability mechanisms may be the most influential component of defining educational values because such measures explicitly state what should be accomplished in the classroom and to what degree it should be accomplished. Likewise, the shift in governance has been accompanied by an increase in accountability from local to external (non-local) sources, which has created a greater and perhaps more narrowed consensus of what is valued in schools. Additionally, it is unclear how such high- 10 stakes accountability policies derived from state governance affect the work of teachers. It is possible that such policies, when centralized at the state level, may bear more influence on individuals than previous local evaluation policies and therefore, cause greater impact to the teaching workforce. Although centralization is broadly defined as the consolidation of power at a higher level of government, at issue here is the transfer of power over decisions regarding the teaching workforce from local governing bodies to the state level, which often occurred under the direction and guidance of the federal government. While some aspects of this move to centralization at the state level, such as the creation of state teacher certification (the first state tests emerged in the 1860’s which were followed by university preparation programs in the early 20th century), occurred much earlier, much of the more recent evidence of this shift can be seen in what has been termed the “Accountability Movement” (Vinovskis, 2009). The Accountability Movement included a shift to standards-based reform and outcome-based education models. Mintrop and Sunderman (2013) describe the evaluation movement that has accompanied increased centralization in school governance as occurring in three waves. These waves offer a framework for understanding the progression of school accountability policy and indicate that over the last two decades student test scores on standardized tests have increasingly served as a proxy for student learning. Additionally, states or localities have used these measures to influence teacher pay, retention, or promotion. The first wave of this accountability involved experiments in states, such as Texas, and localities, such as Chicago (Mintrop & Sunderman, 2013). The seeming success of these smaller scale experiments largely inspired the second wave of reform. 11 The second wave formed at the national level with a series of educational goals first presented by President George H.W. Bush and his America 2000 plan, and later refined by President William Clinton’s Goals 2000: Educate America Act (Vinovskis, 2009). Both plans introduced national goals for education and were precedents for President George W. Bush’s No Child Left Behind (NCLB), a renewal of the Elementary and Secondary Education Act of 1965 (ESEA), which was a law passed by President Lyndon Johnson that established federal funding for schools as part of his “War on Poverty.” NCLB introduced federal guidelines for states as well as punitive measures for schools failing to meet expectations (Vinovskis, 2009). Additionally, after the passage of NCLB in 2001, test scores became a main component of measuring the effectiveness of individual schools and districts, representing the second wave of accountability, where failure to make targeted improvements in different measures led to sanctions including the possibility of state takeover (Mintrop & Sunderman, 2013). The second wave marked an era of sanctions where federal guidelines required state takeovers or closures of schools deemed to be “failing” to make established growth guidelines. These takeovers differentially impacted poor socio-economic areas and occurred primarily in urban districts, such as Chicago. America 2000, Goals 2000, and NCLB paved the way for the Obama administration’s Race to the Top (RttT) Initiative of 2009, which was followed shortly by the NCLB/ESEA waiver program, which prompted states to undergo several legislative changes to reform education in order to compete for money to supplement state budgets, or in the case of the waivers, to seek relief from NCLB mandates. The RttT initiative was funded by the American Recovery and Reinvestment Act of 2009 (ARRA) which allocated $4.35 billion dollars for the RttT program. Although only 12 states received the funds, the application process required 12 changes to existing school systems and governance structures at the state level and legislative changes occurred in all applying states. One required change for states applying for RttT funds was to alter teacher evaluation policies and adopt student growth as a main measure of teacher evaluation as well as standardizing previously used observation protocols (US Department of Education, 2009). Thus, through RttT federal values have influenced the states’ assumption of previously-held local powers over the teaching workforce. So, along with other changes to school policy, RttT enticed states to implement new personnel laws, including revamping teacher evaluation systems to include student growth measured by state test scores along with the use of standardized observation data as part of a requirement for multiple measures of evaluation (US Department of Education, 2009). Furthermore, these evaluations were required to be attached to personnel retention decisions, which prompted states to make changes that eliminated or reduced tenure. In most cases, these personnel laws were changed along with laws that permitted greater numbers of charter schools, increased alternative pathways into the teaching profession, mandated the creation of statewide data systems to serve as repositories of information on both students and personnel, and made changes to state-level student academic standards, largely through the adoption of the Common Core State Standards (CCSS). Thus, in many states, the RttT legislation greatly impacted the way school was managed including the ways in which teachers were hired, retained, and fired. It is important to note that due to the simultaneous adoption of multiple policies, policy actors (including teachers) may be unable to discern these as separate, distinct changes. In other words, changes to things like evaluation, tenure, and teaching standards, having occurred concurrently may appear like a “package deal” to teachers who are influenced by all components of the package simultaneously. 13 Additionally, states, partially based on resource disparity and partially based on existing systems and traditions, have varied greatly in their approaches to meeting these new federally-inspired laws. Under NCLB, schools faced sanctions for failing to grow student scores in accordance with goals set for Annual Yearly Progress (AYP). So, the shift to using student test scores as a proxy for teacher rather than school effectiveness represents the latest incarnation of test scores as a proxy of student learning and represents the third wave of accountability as espoused by Mintrop and Sunderman (2013): one that is focused on the effects of the individual teacher. The legislative changes in state level teacher evaluation policy that occurred during RttT coincide with the third wave. An unintended consequence of these legislative changes was a further narrowing of what policy values as important in education as tested schools undergo more intense microscopic examination under these teacher-focused policies. However, state policies attempt to mitigate this by pairing the student effectiveness component of evaluations with standardized observations to create multiple forms of measurement. Evaluations using both observation and student growth measures are intended to be more concrete and uniform across systems than precursors which were often designed at the local level based on local values and priorities. The rationale behind the change was that multiple measures of teacher effectiveness will produce a fairer rating and better feedback than if districts relied upon a single measure instrument. However, it is important to remember that despite these federally inspired changes, there are still policy discrepancies across and even within states. Furthermore, the publicity accompanying such legislative changes often touted teacher evaluation policy as a much needed and previously unexplored area of educational governance, 14 which often obscured the fact that teachers have always been held accountable for their practice in some way. What has changed is the technology behind teacher evaluation, the shift in educational values linked to such measures of teacher quality, and the demand for new infrastructure required by utilizing sophisticated psychometric techniques such as VAMs (Anagnostopoulos, Rutledge, & Jacobsen, 2013b). Such infrastructure has not previously been evident in U.S. schools which have lacked a system of common evaluations, standards, and frameworks; this makes teaching in Americans schools much different from other skilled, service occupations (Cohen, 2011). As such, many have described the shift from NCLB to the RttT requirements for teachers to be a shift from a designation of “highly qualified” to one of being “highly effective” as localities are asked to focus less on what qualifications teachers bring to the job, but rather what sorts of results are produced by teachers (Powell, 2013). NCLB remained in effect until December 2015. In 2011, shortly following the announcement of the RttT competition, then U.S. Secretary of Education Arne Duncan instituted a waiver program whereby states could seek flexibility from specific provisions of the federal legislation, most specifically the unobtainable 100% proficiency requirement. As of 2014, 42 states and the District of Columbia had applied for and obtained ESEA waivers, but many lawmakers viewed them as an unconstitutional subversion of federal policy in exchange for the adoption of executive branch preferred policies (Epenbach, 2014; Umpstead & Kirby, 2012). Regardless, sweeping legislative changes occurred in many states due to a combination of RttT application and NCLB waiver requirements. The third generation that Mintrop and Sunderman (2013) described is the current wave at the time of this dissertation and includes the latest federal influence, the RttT competition and its inspired legislation. What distinguishes this third wave is increased focus on the accountability 15 of individuals rather than entire schools or systems. In evaluation, this has manifested as evaluation systems that include psychometric measures meant to gauge an individual teacher’s exact effect on a student as measured on a standardized test. The third wave has also brought about a standardization of observation protocols for teachers and a greater value placed on teacher performance on evaluations when considering job retention. It is notable that most states that changed their laws did not receive the RttT funding; however, most did eventually receive a waiver from NCLB compliance. Therefore, most states were tasked with implementing unfunded, mandated changes to schools and systems. What is of interest here are the changes related to the standardization of teacher evaluation and the narrowing of accountability focus to the level of the individual. Thorn and Harris (2013) characterized this shift as follows: “This shift in the way we measure success in education represents a sea change, with consequences for the way schools operate as well as for the individual autonomy that teachers came to expect during the past half-century (p. 57),” a sentiment that suggests that macro-level policies can and do effect teachers at the classroom level. A Brief Legal Review of the Governance Shift as Related to Evaluation Several law reviews acknowledge how both NCLB and RttT have affected educational governance structures at the state and local level. For instance, Garda and Doty (2013) argued that NCLB compelled states to implement “far ranging governance reforms for failing school districts and Title 1 schools,” but that these efforts at the individual school-level have failed (p. 2). RttT, however, incited governance changes at the state level. The review outlines the requirements of NCLB’s annual yearly progress (AYP) requirement and discusses many of the legal issues that resulted from such mandates. For instance, Reading School District v. 16 Department of Education illustrates one of many failed attempts of schools and districts to contest the labeling of schools as not meeting AYP (Garda & Doty, 2013). While NCLB was only enforceable through the mechanism of withholding federal Title 1 funds, RttT enticed states to change laws to meet federal values and priorities through grant applications and NCLB waivers. Garda and Doty further pointed out that the failures of NCLB and RttT to create meaningful reform have not been a result of complex legal issues or lawsuits, but rather from political resistance (2013). This suggests that the issue states have with federal influence is a result of changes to the power structure and governance. Umpstead and Kirby (2012) also acknowledged several of the high-profile lawsuits that challenged NCLB, particularly those focused on the limited funding available to states who were tasked with implementing what was essentially an unfunded federal mandate, such as: School District of Pontiac v. Secretary of the Education Department and Connecticut v. Duncan as well as those regarding NCLB’s effects on student achievement, such as: Levi v. O’Connell, Board of Education of Ottawa Township High School District 140 v. US Department of Education, and Coachella Valley Unified School District v. California. This piece noted that the NCLB waivers may have been unconstitutional due to coercing states to adopt other policies found preferable by the Obama administration, and initial drafts of the Obama administration’s ESEA reauthorization included many of the same provisions present in the RttT and NCLB waiver applications, which led to a delay in the law’s reauthorization (Umpstead & Kirby, 2012). Issues of teacher quality were also addressed, most specifically through NCLB’s highly-qualified teacher provision, which created a variety of designations across states trying to meet the mandate. For instance, the lawsuit Renee v. Spellings challenged California’s designation of teachers without full certification as highly qualified, an opinion that was upheld in the appeal Renee v. Duncan, 17 leading to Congress responding by adjusting the law and further illustrating the complex relationship between law, governance, and education (Umpstead & Kirby, 2012). Furthermore, Barnes (2011) outlined the history of ESEA leading to RttT and contended that given the results of previous federal initiatives, the only benefactor of resulting RttT legislation was “big government” and contended that the program led to the violation of individual liberties. Her arguments are linked mainly to previous litigation that resulted from attempts to create standards in education, yet the criticism that RttT violates individual liberties could also be applied to teaching issues, such as the loss of due process rights through discontinuing tenure and the loss of a fair and transparent evaluation procedure. Similarly, Powell (2013) directly investigated issues of teacher quality including the weakening of the tenure system. She contended that tenure is not the reason that ineffective teachers become difficult to fire, but rather that this is due to the ineffective and unreliable procedures utilized in teacher evaluation. Citing studies such as the New Teacher Project’s “Widget Effect” (Weisberg et al., 2009), Powell stressed that states need to not only adopt legislation required to change evaluation procedures, but also to implement strategies to attract and retain effective teachers; in her view, this includes a streamlined evaluation process and the maintenance of due process rights. Teacher Evaluation in North Carolina North Carolina was an ideal location for examining the convergence of state-level evaluation policy and classroom conditions due to its strong, pre-existing statewide evaluation policy. Unlike many other states, North Carolina designed a precise evaluation instrument that all districts were required to utilize that pre-dated RttT (Table 1). This existing system was one reason why North Carolina was able to score highly on the RttT application and become one of 18 the states that received funding through the program. Upon the announcement of the RttT competition, North Carolina broadened the evaluation to include a value-added model (VAM) of student performance and changed infrastructure related to the evaluation to accommodate the new policy. North Carolina also was one of the 12 states that received RttT funding in 2010. Because the statewide evaluation system has been in place in some form prior to RttT and has been ingrained as part of teacher practice for many years, North Carolina schools are an excellent place to examine how such policies impact classroom practices. 19 Table 1 Timeline of Educator Evaluation Changes in North Carolina Since 2009 Year Relevant Legislation What happened Dec 2009, TCP-C-004 Establishes three types of evaluation cycles and a process for Updated Feb 2015 16 NCAC 06C .0503 performance appraisal. Dec 2009 TCP-C-019 Teacher and principal evaluations must be submitted to the state superintendent annually. July 2011 115C-333 State must be notified of employee dismissals. Aug 2011 TCP-C-022 All systems must evaluate all teachers annually and must include the student growth component. August 2012 TCP-C-006 August 2013 Current Operations and Capital Improvements Appropriations Act of 2013, ch. 360, 2013 N.C. Sess. Laws 995 Standard six, the student growth standard, is added to the evaluation. One-year contract structure is initiated for teachers who have not met career status recognition, requiring full annual evaluation cycles for all teachers without career status indefinitely. Permanent elimination of all career status designations to occur in 2018 (currently ruled unconstitutional). 20 In 2009, the North Carolina General Assembly passed a mandate to create teacher evaluation procedures that supplemented and supported newly-created State Board of Education requirements under TCP-C-006 (North Carolina State Board of Education, 2012b). The policy also specified a process for professional growth plans for teachers. Meanwhile, TCP-C-019, which was created in December 2009, specified that all teacher and principal evaluations must be submitted to the state superintendent annually (North Carolina State Board of Education, 2012a). By the time TCP-C-006 and TCT-C-019 had been passed, the state had already begun a massive state-wide roll out of what was then termed the “New Teacher Evaluation.” The evaluation at this time consisted of five observation standards and included pre- and postconferences as well as a year-end summative conference. Training was provided for administrators to ensure fidelity to the instrument and training was also provided to teachers. These trainings occurred at the school-level, were provided by staff from the North Carolina Department of Education, and were mandatory for all teachers. Upon applying for RttT funding, North Carolina added a sixth standard which accounted for “student growth.” Trainings on this standard occurred in spring 2012 and the standard was included with 2012-2013 evaluation onward (North Carolina State Board of Education, 2012a). In 2016, the North Carolina Department of Education announced that student growth will be removed as a stand-alone standard and would instead be incorporated into the other five standards. However, the logistics of that transition were not yet clear at the time of this study. So, under the system which was current at the time of this study, all teachers in the state were measured against the state instrument consisting of five observation type standards and one student growth standard. State Board Policies and Statutes TCP-C-004, most recently updated in February 2015, established the performance appraisal process including: the creation of three 21 types of evaluation cycles dependent on a teacher’s certification and administrator assignment, and a process including training, orientation, self-assessment, observation, pre- and postconferencing, and summative evaluation. At the time of this study, the evaluation is administered differently depending on whether a teacher has received “career status” in their district. A oneyear contract structure was instituted in 2013 under the Operations and Capital Improvements Appropriations Act of 2013, which required teachers who had not achieved career status by that time to undergo a full evaluation cycle each year indefinitely (four formal observations). In other words, teachers who did not earn career status prior to 2013 are no longer eligible for that designation and the evaluation process follows that distinction. The law also stated that career status would be removed from all North Carolina teachers at the conclusion of the 2017-2018 school year. Court litigation and several rounds of appeals followed the passage of this act with the most recent update being that the one-year contracts for those who never made career status has been upheld, but the repeal of tenure for those with career status had been unanimously deemed unconstitutional by the NC Supreme Court in June 2015. However, the law remains active at the time of this study. Therefore, a teacher who has career status would be required to complete only an abbreviated evaluation each year consisting of two abbreviated observations that may not cover all the standards. The exception is teachers who are renewing their licensure in the current year who are also subject to a more intense evaluation cycle consisting of four observations, regardless of having career status. These requirements can be modified based on administrator discretion and teachers may receive more evaluations than what the state requires if administration decides. As previously stated, a repeal of career status entirely would mean that all teachers in the state would have to undergo a full evaluation cycle annually. 22 Moreover, the results of each teacher’s individual evaluation are reported and tracked at the state level. Therefore, state-level administration can gauge the effectiveness of any teacher in the state, according to the evaluation instrument, at any time using the state data system. A teacher’s effectiveness is tracked throughout their career so long as they remained in the state of North Carolina. As a growth instrument, teachers are expected to “grow” on their evaluation throughout their career. This is a marked departure from previous systems where teachers could leave past effectiveness ratings behind by obtaining a new job in a different school system. Furthermore, 115C-333, also passed by the North Carolina General Assembly, requires the notification of the State Board of Education upon dismissal of employees. These policies demonstrate the power over the teaching workforce that is held by state-level institutions following RttT. The longitudinal tracking of teachers at the state level coupled with the legislation attaching evaluation to employment retention makes this evaluation policy particularly high-stakes for teachers in North Carolina. Additionally, because the policy and the instruments were designed at the state level, and because evaluators’ scores are tracked by the state, it is possible that what the state values in education overtakes what local administration values about quality teachers and teaching. North Carolina is an ideal site for research on the relationship between evaluation and practice because the state-level policy is so strong. Not only are all teachers subject to the same evaluation protocol, but the results are reported directly to the state-level. Additionally, large numbers of teachers do not have career status and as such are under one-year contracts and subject to full evaluation cycles consisting of at least four observations annually. Moreover, the state has gone to great lengths to eliminate career status altogether, which would make all teachers subject to one-year contracts and full evaluation cycles if the Supreme Court decision is 23 not upheld. These changes in career status, coupled with evaluation results being a top consideration for lay-offs in cases of reduction in force, make the evaluation policy high-stakes for teachers in North Carolina. 24 CHAPTER 3: Literature Review In this chapter, I review the literature relevant to my research questions. In doing this, I build a theory upon which my research is based. I first approach the two aforementioned evaluation policy assumptions, that evaluation simultaneously motivates teachers and provides feedback to improve practice, by separately exploring the ideas behind teacher motivation and the relationship between feedback and teacher practice. I primarily draw on two bodies of literature to further situate my study. Because there is a gap in the literature regarding the relationship between teacher evaluation and teacher practice, I first review literature on how teachers have responded to other external accountability pressures, specifically pressures resulting from NCLB. Secondly, I review literature on how teachers modify practice to accommodate curriculum reforms. Additionally, to better understand the differences in school contexts, I review literature on the relationship between school leadership and evaluation implementation. Examining Policy Assumptions Assumption 1: Teacher Motivation This sub-section addresses the policy assumption that evaluations are necessary because teachers need to be rated, sanctioned, or rewarded in order to be motivated to do a better job. Firestone (2014) identified two theories of motivation that guide thinking about evaluation. The first theory Firestone describes is an economics-based theory focused on external rewards and the second theory is based in psychology and focused on intrinsic reward with teachers improving practice through assessment, feedback, training, and professional development. Extrinsic motivation. The theory that teachers are most motivated by external forces comes from the field of economics. Such thinking is reflected in a number of existing financial 25 policies such as career ladder pay scales, bonuses or salary increases for passing proficiency exams, recruitment and retention bonuses, and performance or merit-based pay. At issue in the policy context of this study is what Firestone argues is “the most powerful incentive… access to employment itself” (2014, p. 102). In North Carolina, teacher evaluation is the top criterion for deciding which teachers will be removed from employment when there is a reduction in force in a school system. Additionally, evaluation results are reported to the state and past performance is accessible to other potential employing schools statewide. In contrast to teachers, students under the same system “have no direct incentives to perform in such schemes, apart from whatever pressure their teachers can create” (Cohen, 2011, p. 74). This means that teachers often must persuade students that academic work, specifically the test upon which part of teacher observation scores are based, is even worth doing (Cohen, 2011). As a result, teachers may alter behaviors to try and improve student achievement in ways that would favorably influence results. Aside from the idea that poorly-performing teachers should be removed from the system, extrinsic factors can also lead to teachers self-selecting out of the system. For instance, research shows that teachers leave schools when they do not receive competitive salaries and that qualified individuals may seek employment in other sectors (Ingersoll & May, 2012; Johnson & Birkland, 2003). Because North Carolina offers a statewide salary schedule with limited local supplements, there is little financial competition between districts and teachers may move to nearby states or remove themselves from education careers altogether. Alternatively, extrinsic theory also means that teachers may choose to continue in the profession even if teaching is not their main priority. For instance, studies in American education suggest that students attach little importance to academic learning over practical knowledge, and may hold their teachers in little esteem (Cusick, 1983; Powell, Farrar, Cohen, 1985). As a result, some teachers focus on other 26 aspects of school, such as coaching or relating to students, often as part of the negotiation process to make their jobs bearable (Cusick, 1983). Therefore, teachers who are driven by external motivation may stay in the profession for an income, even if teaching is not an individual priority. Intrinsic motivation. Intrinsic motivation theory stems from the belief that people are rewarded by the feedback they receive from their work, and that they feel good when they are performing well (Deci & Ryan, 1996; Hackman & Oldham, 1980.) In the simplest terms, this means that someone who is intrinsically motivated feels good when they do well. Firestone (2014) argued that in general, those who are motivated internally experience both autonomy and self-efficacy and therefore evaluation should create rewards and contribute to the creation of rewarding conditions. In a previous review of working condition studies, Firestone and Pennell (1993) found that 10 out of 13 studies confirmed this relationship between teacher autonomy and teacher commitment, a condition that they contend is similar to motivation. Similarly, in an earlier critique on curriculum policy, Giroux argued that a technocratic approach to policy is grounded in the assumption that teacher behavior needs to be controlled and made consistent and predictable across all contexts, thereby reducing teacher autonomy to plan and develop curriculum and instruction to instead teach to a test (1985). Firestone (2014) also contended that research on self-efficacy (e.g., Bandura, 1997) and teacher efficacy (e.g., Tschannen-Moran, Woolfolk Hoy, & Hoy, 1998) suggests that competence and expectancy are motivating forces for teachers. In this case, competency means that the individual has the capacity to carry out the expected tasks and expectancy implies that the actions of the individual will lead to an intended outcome. 27 Teacher competency and expectancy may also vary based on several classroom level conditions that teachers may be unable to control, such as teaching assignment (Ball & Bass, 2000) and student interaction with classroom materials (Cohen, Raudenbusch, & Ball, 2003). However, more systemic conditions such as administrative support, adequate physical facilities, adequate instructional materials, and realistic workloads also may influence a teacher’s competency and expectancy (Firestone & Pennell, 1993). Additionally, research suggests that teachers are more motivated in schools that are orderly, have adequate school discipline, and are not overly punitive (Firestone & Rosenblum, 1988; Garet, Porter, Desimone, Birman, & Yoon, 2001; Ingersoll & May, 2012; Johnson & Birkeland, 2003; Kushman, 1992). Firestone contends, “The opposite of the fully autonomous individual is the person performing an activity under duress” (2014, p. 101). Additionally, the importance of evaluation conditions is evident in Cohen’s (2011) description of how the work of teachers is regulated by the society, economy, and culture around them and that a lack of consensus about educational results can increase uncertainty and dispute in a school whereas such conditions may not exist in a more cohesive school with individuals of similar ability. Assumption 2: Feedback Aside from ineffectively rating teachers, criticism has also abounded that previous evaluation systems did not provide enough information to improve teacher quality through feedback. Boyd, Grossman, Lankford, Loeb, & Wyckoff (2006) found that without useful feedback, most teachers’ performance plateaus by their third or fourth year on the job. Yet, locally developed evaluations used in the past have often been criticized as providing only a cursory review of teaching practice. Furthermore, research suggests that feedback that directly stems from the work itself can contribute to enhancing teacher competence and intrinsic rewards 28 (Hackman & Oldham, 1980). Most new teacher evaluation systems, including the one examined in this study, use multiple measures in combination to evaluate teachers. A typical manifestation is a combination between a standardized observation protocol and a value-added measure of teacher effects based on student standardized test scores, which is what is utilized in North Carolina at the time of this dissertation. One justification of using a system of multiple measures is that it theoretically will yield multiple types of feedback for teachers to use to improve practice. Additionally, the standardizations will define focal points deemed important. And while feedback has historically come directly from students (Black & William, 2009; Hart, & Murphy, 1990), formal teacher evaluation could provide feedback through both quantitative measures of student achievement and structured observation tools that are now part and parcel of teacher evaluation policy. Despite current policy often mandating the use of multiple measures, classroom observations are often viewed as the instrument that is mostly like to provide actionable guidance on how to improve teaching. This is because unlike the summative assessment produced with student achievement data, observation protocols are often accompanied by post-conference reflection between the observer and the observed. Additionally, there is some evidence that when teachers are provided scores and feedback from standardized protocols by a research project staff member or an administrator, respectively, they improve their practice (Allen, Pianta, Gregory, Mikami, & Lun, 2011; Taylor & Tyler, 2011). Feedback, however, can differ greatly depending on the person who is providing it. For instance, successful learning can have varying definitions from individual to individual (Cohen, 2011). So, it is possible that the quality of feedback a teacher receives will be influenced by an evaluator’s values despite the standardization of observation protocols. 29 Additionally, there is some emerging evidence that what an observer chooses to emphasize for improvement may be determined by the subject area being observed. For instance, Bell et al. (2015) found differences in the rank ordering of teachers when different protocols (general versus subject specific) were used. Additionally, rank ordering differed based on the subject area taught by the teacher compared to the observer’s subject area background. This study also found that note-taking and feedback patterns from evaluators differed depending on the subject matter background of the observer and whether there was alignment between an observer’s background and the subject being taught. The differences were more pronounced in mathematics, which suggest significant complexity in the ways that protocol, subject matter, and observer background intersect. Similarly, evidence exists of such differences in literature on how potential observers deal with different types of reform. For instance, in a study of 15 elementary school administrators and 15 curriculum coordinators, Burch and Spillane (2003) found that more emphasis was placed on teacher inputs and building literacy across subject areas with literacy reforms while math reforms focuses on sequenced instruction and external supports. Therefore, the quality of the feedback received may be dependent on many factors including the subject being taught and the background of the individual providing it. Both are likely to play a role in whether an individual teacher finds observation feedback useful. Teacher Responses to External Accountability Pressure While there is a gap in research regarding the relationship between evaluation and teacher classroom practice, research on other external accountability policies based on student testing results is extensive and has demonstrated unintended effects on the teaching workforce, primarily in the form of turnover, as well as on practices in the school or classroom. For instance, 30 “gaming” refers to engaging in strategic behaviors that will increase reported performance without making gains in actual student performance. Attempts at gaming can range from outright cheating and changing answers (Jacob & Levitt, 2003) to more benign techniques such as changing the quality of student lunches during testing (Figlio & Winicki, 2005) or moving the teachers with the best records of producing gains to tested areas (Cohen-Vogel, 2011; Grissom, Kalogrides, & Loeb, 2012). I briefly describe some commonly referenced issues with external accountability: teacher turnover, narrowing of curriculum, prioritizing the teaching of strategies over curriculum, and the triaging of students. Turnover Research suggests that external accountability pressures impact the teaching force, particularly in high-need schools. For instance, Clotfelter et al. (2004) suggested that lowperforming schools at risk of performance sanctions experienced negative effects on retention rates and on the probability of filling a vacancy with a high-quality teacher. If such evidence is true for sanctions at the school-level, then it would be reasonable to assume that these negative effects could persist, and possibly be amplified, when sanctions are applied at the teacher level through value-added models (VAMs) and standardized observation protocols. Also, dismissing teachers based on poor student test growth becomes problematic when dealing with lowperforming schools that may already be experiencing staffing difficulties. In such cases, it becomes clear that dismissing poor-performing teachers based on evaluations does not offer the sole solution to the issue of consistently low-performing schools. Research suggests that teacher turnover for any reason comes at great financial cost to schools and educational costs to students (Ingersoll, 2001). While changes in evaluation were largely driven by criticisms of locally based observation, most of the practical problems 31 identified with current evaluation policy focus on the use of student growth driven VAMS. This creates a conundrum where the costs of losing an effective, but misidentified teacher must be weighed against the costs of leaving students with an ineffective teacher who would not be dismissed without the use of VAMs. Policy makers should also consider the costs of possibly keeping a bad teacher who was misidentified as effective and may be difficult to dismiss. Raudenbush and Jean (2012) argue, "Falsely identifying teachers as being below a threshold poses a risk to teachers, but failing to identify teachers who are truly ineffective poses risks to students" (p. 2). Numerous researchers report that the risk of misidentification of teachers is high and widely variable depending on the model and confidence interval used (Raudenbush & Jean, 2012; Goldhaber et al., 2013). Similarly, Goldhaber et al. (2013) demonstrated how, depending on model specifications, teachers could easily switch the quintile in which they are assigned, showing that the most reliable use of VAMs can be found in separating only the truly outstanding teachers from the truly terrible, something that may likely be already known in a school, and that the middle quintiles show extreme variation based on the specifications used. To this extent, VAMS are prone to the same criticism of previously used local level evaluations when teachers are not being accurately labeled. As Harris (2009) points out, this unreliability in VAMs provides little in terms of formative feedback about a teacher's practice and instead serves to summatively signal quality, something that could be dangerous given the extreme variability described above when attached to high-stakes policies. Again, aside from potential financial consequences, such systems are likely to also challenge teachers’ feelings of competence and efficacy. Additionally, the potential inequity and instability of the VAM instrument may pressure certain teachers to exit the system. Although VAMs have been adopted in many states, including 32 North Carolina, this adoption has been highly criticized by both scholars and practitioners. One issue is the lack of tests for all grades and subject levels. Under NCLB, states had to create tests in some, but not all, grades and subjects. As some researchers have suggested, this lack of tests is most alarming at the high school-level, where NCLB mandates only one test in each subject area even though each student has different teachers for each of several subjects each year (Goldhaber et al. 2013; Harris, 2009). Harris (2009) also raises the question of how VAMs will be able to account for the possible effects of other teachers (particularly at the high school-level where a student is enrolled with several instructors simultaneously), teamwork among staff, and peer effects. Furthermore, the aforementioned challenges in the calculation of VAMs suggest that, at least at the high school-level where there are more specialized courses, there may be an additional challenge in shifting to a VAM that assumes that student ability is comparable across all subject areas (Goldhaber et al., 2013). Many VAM models use prior test scores to predict future achievement, which is problematic in more specialized courses and curricula, such as Physics, which would not have a prior test. This disconnect challenges teachers’ feelings of efficacy and competence, which may in turn drive teachers of more specialized subjects from the workforce. Regardless, replacing any teacher comes at a financial cost and instability in a school’s workforce can carry educational effects as well, regardless of whether the teacher is removed or leaves and regardless of whether that teacher was effective (Ingersoll, 2001). Therefore, it is important to consider how teacher evaluation policies may be linked to teacher turnover and retention. Aside from the threat of job loss when accountability is attached to high-stakes personnel decisions, the research on turnover is important to consider when thinking about the relationship 33 between motivation and evaluation. If evaluation elicits feelings of incompetence in an individual it may affect their intrinsic motivation. Teachers may choose to leave a school in favor of another school or job that provides the types of intrinsic rewards necessary to foster work satisfaction. Likewise, if evaluation is attached to extrinsic rewards such as bonuses or job security, then individuals may choose to leave the system in favor of positions that are more extrinsically rewarding and financially secure. Narrowing Curriculum and Teaching Testing Strategies While the knowledge and skills of a teacher are important, the work of teachers is also entirely dependent on the willingness of students to participate in learning. As such, the negotiation of curriculum is a key event in classrooms. Cohen (2011) argues that, “practitioners must supplement their expertise with client’s consent and with the knowledge and skills that clients bring to bear” (p. 12). As such, teachers often find ways to anticipate what students will find interesting in order to negotiate content and workload (Cohen, 2011; Powell et al., 1985). Accountability has added an extra layer to this dilemma as teachers may now feel pressure to emphasize certain areas of the curriculum known to be emphasized in assessments. For instance, some research suggests that an increased focus on testing outcomes in certain subjects has resulted in a narrowing of curriculum that increases as external pressures increase (Carnoy & Loeb, 2002; Ladd & Zelli, 2002; Rothstein & Mathis, 2013). Such work follows the logic of Milgrom and Roberts (1992) and the principal agent theorem, which contends that in organizations with multiple goals, agents will focus on rewarded goals at the expense of other goals. Additionally, American students are tested more than any other students in the world, yet there is little agreement over what should be in tests and there is often considerable variability 34 among the curriculum standards and tests of the same subject (Conley, et al. 2011; Floden, Porter, Schmidt, & Freeman, 1980; Porter, McMaken, Hwang, & Yang, 2011; Porter, Polikoff, & Smithson 2009). Cohen (2011) describes how this lack of agreement can lead to some teachers aligning content with standardized tests whereas others may select from a textbook or a workshop, or simply choose to teach what they learned as students. Therefore, standardized testing has done little to create uniform curriculums across locales and may actually result in the narrowing of curriculum to meet specific demands of specific evaluations. Also, teachers may forego teaching curriculum altogether and devote lessons to teaching test-taking strategy rather than content. For instance, research suggests that VAMs potentially reward teachers who use a curriculum focused on testing or testing strategy rather than actual subject matter (Carnoy & Loeb, 2002; Goldhaber, et al., 2013; Ladd & Zelli, 2002; Mintrop & Sunderman, 2013; Rothstein & Mathis, 2013). Therefore, teachers may feel pressure to devote class time to teaching testing skills rather than actual components of the subject area. Concerns about narrowed curriculum or replacing curriculum with teaching test strategies are important to consider as the VAMs tested in the MET study are prone to large error with a correlation of around 0.5 for elementary teachers, and that this error would increase as teachers focus more on the goal of increasing scores and avoiding sanctions (Rothstein & Mathis, 2013). In other words, the greater the risk of sanctions attached to scores, the greater the risk of focus on the tested curriculum and testing techniques at the expense of other areas of curriculum. Therefore, it is important to consider how evaluation may influence what is taught in a classroom. It is possible that teachers may be adapting the curriculum they teach to address components that are more likely to impact their evaluation. This could be true in regard to narrowing the curriculum, but it is also possible that teachers may select certain lessons that they 35 feel will be more appealing for instances when they know they may be formally observed. It is also possible that teachers may modify how they teach, such as by employing the use of more assessments that look like those formally used for evaluation, or picking teaching strategies that may be more appealing to observers, such as employing technology the day of the observation or utilizing a particular method, such as Socratic seminar, if it is thought an observer may score more favorably. Triaging Students Research has determined that another popular method of gaming in schools involves removing low-performing students from the test pool. This can be done in a variety of ways. For instance, a study by Figlio and Getzer (2002) showed that students who were low-income or previously low-achieving in six large Florida districts had been categorized as students with disabilities (SWDs), a category that was exempt from testing at the time of the reassignments, at a rate much higher than prior to the implementation of accountability policy. Similarly, a study of over 41,000 disciplinary events in Florida schools suggests that schools assigned substantially harsher punishments to low-achieving versus high-achieving students with a significantly increased gap during the testing period (Figlio, 2006). Such practices served the purpose of removing the scores of students who may be poor achievers. Although most of the available current studies extend to school-level accountability policies and school-level gaming practices, there is evidence to suggest that similar actions may also occur at the classroom level. For instance, Booher-Jennings (2005) described how a school in Texas participated in “educational triaging.” Under this system, resources were diverted towards students who were predicted to be at threshold levels of passing the state assessment as well as towards students who were counted towards the school’s overall accountability rating. 36 Similar behaviors were observed in a study in Chicago where teachers diverted more attention to students near the pass threshold (Neal & Schanzenbach, 2010). It is likely that teachers will continue to engage in similar behaviors with new accountability policies focused on the level of the individual teacher. Therefore, it should be considered that teachers may direct focus on certain students based on their evaluations. Teacher Responses to Curriculum Reform Some of the assumptions behind teacher evaluation policy are based in economic theories. Specifically, these assumptions originate from the idea that teachers will behave as rational actors within a system, and that given increased pressure, teachers will perform better (Milgrom & Roberts, 1992). However, such economic views fail to account for the manner in which evaluation reform is embedded in existing institutional structures. So, while economic theories inform the construction of the policy, such theories are unable to predict the behavior of the actors affected by such policy. While there is a gap in the research about the ways in which teachers may respond to evaluation reform, there is a lot of information available on how teachers respond to curriculum reforms. The research on teacher response to classroom reform suggests that teachers can respond to policy interventions in a variety of ways. However, the conditions of the classroom and the work of teachers creates an atmosphere in which those tasked with enacting simultaneous and sometimes competing policies from multiple governance levels have little opportunity to understand or realize the original policy intent (Kennedy, 2005). Current evaluation reform, unlike curriculum reform, extends external accountability pressure to the level of the individual teacher. Therefore, it is possible that teachers may react in a variety of ways to meet the requirements of the evaluation policy that may be similar to those demonstrated by teachers under curriculum reforms. 37 One predicament in teaching is dependence on students to participate in changes in practice (Cohen, 2011). Because teachers are dependent on students’ success under current evaluation policies, there are powerful incentives for dramatic changes that can lead to new behaviors, skills, habits, and understandings. Many of these possible behaviors were discussed in the previous section on teacher responses to external accountability pressures. Alternatively, it is also possible that the lack of cohesiveness among schools may lead to teachers perceiving major changes occurring when outsiders actually view the change as minimal (Cohen, 1990). So, literature on teacher response to curriculum reform can help predict ways in which differences in school sites may interact with the evaluation policy to yield different effects across and within sites. Several frameworks have emerged that identify typologies which teachers exhibit when faced with reforms. One framework that has been utilized when looking at teacher response to classroom reform was employed by Oliver (1991) in describing the strategic processes that organizations employ in response to external pressures. Oliver describes a typology of strategic responses including: acquiescence, compromise, avoidance, defiance, and manipulation. Coburn (2004) argues in her study of the implementation of a reading policy that “the relationship between institutional pressures and classrooms was much more interactive and nonlinear than that portrayed by Oliver. The teachers were connected to messages from the environment via a web of interactive linkages through which messages about reading moved in, out, and around schools through multiple routes” (p. 223). Coburn felt that there were conflicts between her observations and Oliver’s views of both denial and acquiescence. As a result, Coburn offered five alternative typologies: rejection, decoupling/symbolic response, parallel structures, assimilation, and accommodation (2004). 38 Alternatively, a recent piece on educator actions in a competitive marketplace has condensed these typologies into three typologies which were dependent on the perceived legitimacy of the reform: acquiescence, denial, or adaptation (Yurkofsky, 2016). Under this condensed version, those who acquiesce accept the policy and modify practice around it, those who deny it may disregard or revolt against the policies ideals, and those who adapt may try to weld existing practices and beliefs with policy priorities in order to ensure survival in the system. Regardless of the specific typologies used, the general idea that teachers may perceive evaluation policy legitimacy in varying ways and act according to their perceptions is relevant to this proposed study. These perceptions may differ based on the perceived legitimacy of the policy within the school site, the teacher’s relative security in their job, and the usefulness of the feedback received, points which all emerged in interviews with focal teachers. So, with this in mind, I have adopted Yurkofsky’s three typologies, which were designed to focus on educator actions in competitive markets, to my dissertation. Leadership Capacity and Evaluation There is emerging evidence that school leadership impacts the success of evaluation policy both in terms of implementation and in the quality of feedback in which teachers receive. In the case of North Carolina’s policy, school administrators are the individuals who conduct most formal evaluations. Researchers have documented that the roles of principals have shifted over time to include an expanded role as an instructional leader due to changes in both policies and public expectations (Bryk, Sebring, Allensworth, Luppescu, & Easton, 2010; Louis, Dretzke, & Wahlstrom, 2010; Spillane & Kennedy, 2012). Additionally, policies are subject to interpretation and alteration by those who are tasked with enactment in real contexts, resulting in what Lipsky termed “street-level bureaucrats” (2010). Because of this, the capacity of administrative leaders to conduct evaluation can impact the way in which evaluation policy is 39 implemented in varying school contexts as evaluating principals become street-level bureaucrats of the policy. For instance, studies have uncovered some of the unintended consequences of having principals conduct formalized evaluation. First, principals have varying views on both the purpose and the use of evaluations and may respond to one policy message at the expense of others, leading to varied implementations of the policy (Kraft & Gilmore, 2016. Reinhorn, et. al, 2017). Additionally, the aforementioned expanded role of principals has contributed to a deficit of time to devote to evaluation. Furthermore, a lack of experience in the subject area being observed may result in narrowed feedback being provided to a teacher that does not allow for improvement of instructional practices (Kraft & Gilmore, 2016). Studies have also suggested that the quality of feedback a teacher receives from an evaluation is dependent on principals having the necessary training, time, and resources to devote to provide individualized, actionable feedback (Kraft & Gilmore, 2016. Reinhorn, Johnson, & Simon, 2017). Similarly, principals who are well versed in the application of good instructional practices are best prepared to engage teachers in a process of inquiry, reflection, and improvement (Reinhorn et al., 2017). The unintended consequences of the school-level administrator’s role in implementing evaluation policy contributes to variability in the success of the policy across school sites. Studies have also demonstrated that principals may assess teachers differently on formal evaluations when opposed to summative evaluations. Two recent studies demonstrate that while principals still tend to overall evaluate their teachers quite positively, more positive ratings tend to be assigned on high-stakes assessments versus low-stakes assessments, and principals verbally report ineffective teachers in their school despite formal evaluation ratings demonstrating otherwise (Grissom & Loeb, 2017; Kraft & Gilmour, 2016). Furthermore, these differences are 40 also amplified when the stakes are higher for individuals. For instance, new teachers, who have limited career protections and are therefore more likely to be adversely affected by a negative evaluation than an experienced teacher, are often rated much more positively on high-stakes assessments versus low-stakes (Grissom & Loeb, 2017). This demonstrates that principals are reluctant to show criticism on formal evaluations that may be expressed in lower stakes situations. There are a few possible explanations for the variability in ratings given on high-stakes assessments versus low-stakes assessments as well as for variability across experience groups. First, it may be possible that principals find more value in providing formative feedback to their teachers through informal means versus using high-stakes, summative evaluations. For instance, in a study of six schools, all of the principals interviewed began referencing their approaches to formative evaluation rather than summative evaluation, suggesting that formative, low-stakes feedback may be more valued by administrators (Reinhorn et al., 2017). Additionally, principals in one study cited time constraints as a reason to be more lenient in high-stakes evaluations as they were unable to provide concrete feedback to improve practice (Kraft & Gilmour, 2017). Principals also explained that they wanted to recognize teacher effort and to evaluate potential while simultaneously motivating teachers towards achieving that potential (Kraft & Gilmour, 2017). Additionally, principals may rate their teachers in a particular way in order to protect their staff. In one study, school administrators expressed concern over the difficulties of replacing a teacher who was either removed or felt pressured to remove themselves from the classroom due to poor ratings, particularly with newer teachers who may not have any career protections (Grissom & Loeb, 2017; Kraft & Gilmour, 2017). Differences in how experienced 41 versus non-experienced teachers are evaluated were also found in one study of all states plus DC and 25 large school districts (Steinberg & Donaldson, 2016). Overall, the reluctance of administrators to critically evaluate teachers on high-stakes assessments suggests that principals may be attempting to protect staff from the consequences of low scores or otherwise feel unable to be as critical as they would be in low-stakes situations. Theory Driving Research The preceding literature review informs the building of a theory for investigations into how teacher evaluation may impact classroom practices (Figure 1). The theory is that motivation and feedback provided by evaluation are factors that interact and create an impetus for action on the part of the teacher. However, both motivation and feedback are filtered through various aspects of teaching conditions. For this study, the school-level factors that will be examined as part of teaching conditions include evaluation conditions and evaluation status, and individuallevel factors include years of experience, licensure, and subject area. According to the proposed framework in Figure 1, these factors filter the policy to yield classroom practices. There are other potential factors that can affect classroom practice, but these two school-level and three individual-level factors remain the focus of my study. According to this theory, teacher motivation associated with evaluation is influenced by both extrinsic and intrinsic rewards. Additionally, conditions are informed by school-level factors (evaluation conditions at a school and the existing evaluation status of teachers at a school) as well as individual-level teaching conditions (experience, licensure, and subject area). Finally, while we do not yet know how teachers may specifically react to evaluation in regard to classroom practice, existing literature on teacher responses to accountability pressures and to classroom reform predict the ways such reactions may manifest. This study was designed to specifically examine the extent to which teachers felt their practice was influenced by evaluation 42 with particular attention to modifications in what is taught, the teaching strategies utilized, and the directing of focus on certain students based on evaluation. I was unable to gauge whether turnover was related to evaluation, but I did ask questions that gauged teacher perceptions of evaluation as related to their perceptions of fairness and job security. 43 Figure 1. Framework to Guide Research on Evaluation and Practice 44 CHAPTER 4: Research Design and Methodology The format of this dissertation is a mixed methods case study describing and explaining the relationship between teacher evaluation policy and teacher practice in light of various contexts and conditions. Case studies are an ideal design for attempting to understand a particular phenomenon where multiple variables interact in a single context (Derrington, 2013; Halverson & Clifford, 2006; Miles, Huberman, & Saldana, 2014; Yin, 2009). The scale of this case study is four high schools of varying contexts in a single school system. This scale not only allows for a breadth of analysis across locations in the district, but also for a detailed, in-depth look at how individual teachers see evaluation interacting with their classroom practice. So, I was able to collect data across system, school, and individual contexts. Three major types of case studies are commonly used to study research questions including: exploratory case studies, descriptive case studies, and explanatory case studies (Berg, 2007). My research questions focus on describing relationships within a phenomenon and, when possible, explaining what influences individual behavior in a case. Therefore, this study meets the criteria of both a descriptive and an explanatory case study. Furthermore, mixed methods are utilized to answer the research questions in this dissertation. Mixed method research can be formally defined as “the class of research where the researcher mixes or combines quantitative and qualitative research techniques, methods, approaches, concepts, or language into a single study” (Johnson & Onwuegbuzie, 2004, p. 17). Statewide, publicly available quantitative data were used in the selection of the research sites. Additionally, the data was collected in three phases. The first phase of survey collection represents the quantitative stage, though open-ended commentary was also permitted to allow survey participants to explain answers more fully if they desired. The second and third phases consisted of interviews which were analyzed qualitatively to better explain the findings from the 45 survey. In this manner, the qualitative data also served as a check for the quantitative analysis. Furthermore, the approach taken to analysis allows the quantitative data to describe what is happening while the qualitative work helps explain the phenomena. Johnson and Onwuegbuzie (2004) contended that the objective of mixed methods research is to draw from the strengths and minimize the weaknesses of qualitative and quantitative methodology which results in research which is superior to that conducted with one method. Because case studies allow for a nuanced understanding of the particularities of context and mixed methods studies allow for analysis which can address my research questions more fully than a single method approach, the surveys and interviews, along with publicly available district- and school-level data, allow me to effectively address my research questions by both describing and explaining the relationship between teacher evaluation and practice at the four study schools. Participants and Sampling Strategy Participants in this study are high school teachers (N= 45) in North Carolina. The focus on high school teachers is important for two reasons. First, the subject area distinction is more pronounced at this level (compared to elementary-level teachers) because high school teachers usually hold degrees in the subjects they teach rather than broadly in education (that sometimes include a major, but not a degree in a subject area). The teaching certificate is often secondary to the subject degree in North Carolina. Secondly, most North Carolina high schools, including the four in this study, follow block schedules where courses are taught over half a year and then change for the second semester. The block scheduling allowed me to conduct follow-up interviews after a semester-long course had ended, students had taken assessments, and teachers 46 had an idea of how their evaluation was going with respect to the student growth score they may receive. The selection of high schools for my study involved examining district level data. First, I reviewed the North Carolina Working Conditions Survey results and Educator Effectiveness results for each high school to determine evaluation conditions and effectiveness status. These sources are described in greater detail in the next two sections. I then identified four focal schools for the study that fit varying combinations of high/low evaluation conditions and high/low effectiveness status. I describe these measures in the next two sections of this chapter. Teacher Working Conditions Survey and Evaluation Conditions The Department of Public Schools of North Carolina, in conjunction with the North Carolina Association of Educators, administers a Teacher Working Conditions Survey (TWC) biannually which asks teachers to answer questions about varying aspects of their working conditions, including topics such as professional development, facilities, and community support. The overall response rate in Broadville County for the 2015-2016 school year was 79.79%. The data from these surveys are publicly available (http://www.ncteachingconditions.org/). There are nine questions on the survey which focus specifically on evaluation (Table 2). Seven of the questions are directed towards local assessment, such as observation, either by explicitly stating the focus is local or by being components of a larger section on local conditions. Two of the questions focus on testing, which is the state level component of evaluation. These questions were used to determine the evaluation conditions of individual schools in a method I will next describe. 47 Table 2 Teacher Working Conditions Survey Evaluation Related Questions 7.1d Teachers are held to high professional standards for delivering instruction. 7.1f Teacher performance is assessed objectively. 7.1g Teachers receive feedback that can help them improve teaching. 7.1h The procedures for teacher evaluation are consistent. 9.1a State assessment data are available in time to impact instructional practices. 9.1b Local Assessment data are available in time to impact instructional practices. 9.1c State assessment accurately gauges students’ understanding of standards. I used data from the latest administration of this survey (Spring 2016) to determine a school’s evaluation conditions (North Carolina Teacher Working Conditions, 2017). The original survey responses are presented in a Likert-type format; however, the data is also reported as a percentage of the total number of people who indicated any level of agreement. As previously mentioned, some of the evaluation-based questions focused on the local level and others at the state level. There were some drastic differences among the scores for locally focused questions and state focused questions, so I separated the questions based on whether there was a specific state or local focus to create two distinct scores, one for local and one for state. I then created composite averages of the percentage of respondents who indicated some level of agreement separately for the state and local categories. For each high school, I compared the scores of each of the aforementioned categories and measured the distance of the school’s percentage from the system’s average. This calculation yielded either a positive or negative number which indicated 48 distance from the system mean. These numbers provided a school’s condition score for each category (Table 3). 49 Table 3 Calculating Condition Score Teacher Working Condition Survey Questions Local Score Location 7.1d 7.1f 7.1g 7.1h 9.1a 9.1b 9.1c Composite Broadville 92 84 81 83 52 75 31 83 Riley 94 83 90 75 48 70 33 82.4 Phoenix 100 94 88 95 29 60 22 87.4 Charles 88 95 78 81 39 63 22 81 Central 90 72 72 84 22 49 16 73.4 Note. The full text of the survey questions are located in Appendix A 50 Local Condition Score 0.6 4.4 -2.0 -9.6 State Score Composite 41.5 40.5 25.5 30.5 19 State Condition Score -1.0 -16.0 -11.0 -22.5 Response Rate 79.8% 94.1% 63.3% 90.0% 50.0% Educator Effectiveness Database and the Effectiveness Score I used data for the school year (2015-2016) that preceded the study year (2016-2017) from the Educator Effectiveness section of the North Carolina School Report Card database to calculate Evaluation Effectiveness scores. I was also able to separate this score by local and state focus, as I will describe later. The website for the Educator Effectiveness database states in highlighted text that, “North Carolina’s Educator Evaluation System is a growth instrument. It identifies the knowledge, skills, and dispositions expected of teachers, and measures the level at which teachers meet the standard as they make changes to their teaching” (emphasis is consistent with the referenced text) (Educator Effectiveness Database, 2015). The instrument consists of six standards (See Table 4). The website also specifies that due to teachers and administrators being lifelong learners, “It is expected that teachers in a school would be distributed across the rating categories” (emphasis is consistent with the referenced text) (Educator Effectiveness Database, 2015). The first five of the six standards of the evaluation instrument debuted during the 20102011 school year. Legislation current at the time of writing states that career status teachers must receive a full evaluation of all six standards at least once during a five-year license renewal cycle. Otherwise, career status teachers can be evaluated on an abbreviated cycle. All teachers who had not received career status prior to the 2013-2014 school year are subjected to the full evaluation cycle each year. Standards 1-5 are observation standards determined locally by school-level administration with five possible proficiency ratings, whereas standard six is based on student growth data on state exams, determined by a state software system, and has three proficiency levels (Table 4). In the past, the proficiency for standard 6 was determined at the high school level by individual 51 student test data for teachers of the three state tested subjects: Algebra II, Biology, and English II. At the time of this dissertation, those results are used for schoolwide scores which are combined with individual teacher scores from students taking the North Carolina Final Exam in order to calculate a teacher’s standard 6 score. At the time of this study, the North Carolina Department of Public Instruction had announced that standard six was going to be “devalued” and spread across the other five standards; however, at the time of writing it was unclear how that would occur. The methodology and technology for calculating standard six scores as well as the assessments used to determine such scores are all conducted by the state. Again, in the 20132014 school year, North Carolina removed career status as a designation obtainable by teachers who had not yet received it. It is notable that teachers lose career status if they switch between systems in the state and may have been unable to retain that status. Non-career status teachers are on a one-year contract structure and must be evaluated on a full cycle every year indefinitely regardless of the years of experience. 52 Table 4 Evaluation Rubric Evaluation standards for teachers Type and Ratings 1 Teachers demonstrate leadership 2 3 Teachers establish a respectful environment for a diverse population of students. Teachers know the content they teach. 4 Teachers facilitate learning for their students. 5 Teachers reflect on their practice. 6 Teachers contribute to the academic success of their students Observation (Local) Not Demonstrated Developing Proficient Accomplished Distinguished Student Growth (State) Does Not Meet Meets Exceeds Evaluation Effectiveness scores were calculated using data from the 2015-2016 school year, which is the year that preceded the study year. Standards 1-5 were not applied to all teachers as those with career status could be evaluated on an abbreviated evaluation schedule at the discretion of the observing administrator. So, I could only calculate the average number of standards proficient, not an average of teachers who were proficient for standards 1-5. For standard 6, the number of standards and number of teachers are the same. First, I calculated an average of standards proficient for standards 1-5 by summing the total number of proficient counts for all five standards and dividing that by the total count for standards 1-5. Standards 1-5 are awarded locally by school-level administration following observation and are labeled as “local” scores. Standard six was more straightforward as it was calculated by the state based on standardized student assessments. I summed the number of teachers who met the standard and 53 divided that by the total number of teachers. I then took the averaged percentages for each school and subtracted each school’s average from the system’s average to create Effectiveness Scores for each school for both local and state measures (Table 5). Table 5 Establishing an Evaluation Effectiveness Score Location Local Score Local Proficient Broadville 99.0% Riley +0.6 99.6% Phoenix +1.0 100% Charles -2.5 96.5% Central 100% +1 54 State Proficient 88.0% 89.5% 75.0% 97.3% 92.2% State Score +1.5 -13 +9.3 -4.2 Phase 1: Surveys of Sample Schools For the first phase of research, I administered a survey to Mathematics and English teachers at the focal schools to identify ways in which evaluation influenced teacher practice during the previous school year as well as the anticipated effect on the upcoming year. The first series of questions on the survey were demographic questions designed to identify years of experience, licensure type, what subjects a teacher had taught, past and current status as a teacher of tested or non-tested courses, and current status as a teacher of End of Course (EOC) or North Carolina Final Exam (NCFE) tested courses. In the demographics section I replicated the nine evaluation condition questions from the TWC survey to establish a measure of the individual teacher’s satisfaction with the conditions at the school. This helped me identify whether or not a teacher deviated significantly from school-wide responses and assisted in my selection of focal teachers. The survey then included Likert-scale questions requesting that teachers reflect on their prior year including: the extent to and way in which evaluation affected their motivations to succeed in the classroom, their use of feedback from evaluations, as well as their perceptions of job security and accuracy of the evaluation, and the ways in which evaluation guided what was taught, how it was taught, or on whom focus was directed in the classroom. The third portion of the survey asked the same questions about anticipated behaviors “looking ahead” in the new school year and how teachers planned on modifying practice in the current school year. The final two question sets were complementary and are referred to as the “complementary question set” throughout the dissertation. During an initial analysis of this survey, I identified focal teachers at each school and attempted to procure two teachers from Math and two from English to participate in the 55 interview phase. Descriptive statistics and paired t-tests were run on the survey responses first as a whole sample, then by school-level, and then by individual-level characteristics. Survey instrument. The survey consisted of three sections and is available in Appendix A. The first section asked participants for demographic data. The second section replicated nine questions from the Teacher Working Condition Survey that were used to calculate the school Evaluation Condition scores as described prior. The final section contained a complementary question set that asked teachers to reflect on the previous year and then the current year. The nine questions from the Teacher Working Conditions survey used to determine Evaluation Condition Scores were replicated on the survey administered in Qualtrics to get a sense of the perceptions that Math and English teachers from the focal schools had of evaluation conditions. I used the same scale that was used in the original state-administered Teacher Working Conditions Survey which included the options “Strongly Disagree,” “Disagree,” “Agree,” “Strongly Agree,” and “Don’t Know.” To analyze the results of this section for Chapter 5, I eliminated the “Don’t Know” responses question by question which resulted in a different reported N across questions. The bulk of the survey featured questions asking teachers to reflect on the previous school year and then a complementary set of questions asking them to think about and anticipate the current school year. Each question set had a unifying theme. However, one question about the prior year was not replicated in the current year question set; that question asked teachers to evaluate the statement: “Last year’s evaluation will impact decisions about classroom practice in the upcoming school year.” The nature of this question did not allow for a complementary question in the second set. Also, the number of participants in each set of the survey is different because some participants were not in the classroom in the prior year and therefore the section 56 reflecting on the previous year was not applicable to those individuals. Additionally, responses were not forced, so some participants opted not to answer all of the questions, which also contributed to variations in the N question by question. Teachers were asked to evaluate all of the statements using the following Likert-type scale, where the higher numbers indicate a higher level of agreement: 1. Strongly Disagree; 2. Disagree; 3. Neither Agree nor Disagree; 4. Agree; 5. Strongly Agree. Phase 2: Preliminary Interviews of Focal Teachers The first round of interviews was conducted two months into the 2016-17 school year. The purpose of these interviews was to better distinguish the relationship between teacher practice in the classroom and the evaluation policy. I attempted to sample two English and two Math teachers from each school. However, I was unable to achieve uniform sampling across subject areas. Riley did not have any Math teachers who were willing to be interviewed and Phoenix only had one Math teacher who was willing to be interviewed. Conversely, at Central the English Department Chair recruited teachers for interviews, and due to a communication error, selected three English teachers to be interviewed. During the first interview, I tried to identify typologies of reform response from teachers as well as to parse out differences between individuals of varying characteristics. First, I asked teachers to generally explain their experiences with evaluation both in the past as well as so far in the current school year. The next questions asked during the interviews were developed based on both the school-based and individual-level responses to the survey items. This was done in an attempt to find explanations for differences that were related to the context and conditions of specific teachers. Finally, I ended each interview by asking every teacher their thoughts on the two policy assumptions of evaluation: (1) evaluations are necessary because teachers need to be 57 rated, sanctioned, or rewarded in order to be motivated to do a better job; and (2) evaluations yield information that is useful for teachers to improve practice. Phase 3: Follow-up Interviews of Focal Teachers Follow-up interviews of the focal teachers were conducted in mid-March of the 20162017 school year. At that point, teachers had been through one state testing cycle in January for the first semester. Due to the block schedule system used in all the study schools, teachers were teaching entirely new courses. At this point in the year, every teacher had been evaluated at least once and nearly all of them had completed all the required evaluations for the year. I began the interview by asking teachers for an update on their observations for the year. I also asked teachers how testing had gone and if they had any surprises from the process or the scores. I inquired about teachers’ courses in the current semester and if they felt any differing pressure from state testing with the courses they had currently versus the prior semester. I focused on attempting to identify any changes in typology, perception, or behaviors based on the evaluation in the first half of the year. Finally, I shared with each teacher the status of their school’s Evaluation Conditions and Evaluation Effectiveness scores as I had calculated previously. I asked teachers if each specific score surprised them or if they thought it was an accurate reflection of the climate of their school and why. I also asked teachers to reflect on if anything had changed in the current school year that may alter the scores if this study were to be replicated with similar data from the current year. Aside from serving as a new source of data, this interview also served as a member check to ensure validity of the study (Deyhle, Hess, LeCompte, 1992). 58 Data Analysis and Establishing Validity Quantitative The survey data served three purposes in this study: as a source of data for analysis, as a mechanism to identify focal teachers for the interview portion, and to provide information used to develop individual-level questions for the interview phase. All quantitative analysis of the survey data was completed using SPSS software. The data were analyzed in three ways: as a whole sample, at the school-level, and based on individual teacher characteristics. I first conducted a sample-wide analysis of the data. This analysis included calculating descriptives and conducting paired sample t-tests to identify differences between the prior year and the current year for the whole sample. To examine school-level differences, I conducted two types of analysis looking for differences between schools as well as within schools. First, I calculated descriptives and conducted ANOVA to identify differences between schools for both the prior and current year question sets. I then conducted paired t-tests within each school to determine differences within each school for responses on the prior year versus the current year. I examined three individual teacher characteristics in the survey data: licensure, sevenyear status, and subject area. First, descriptives were calculated for each of the three categories for both the prior and current year question sets. Then, independent sample t-tests were conducted to determine differences between the categories of each characteristic on both the prior and current year question sets. Qualitative Both the preliminary and follow-up rounds of interviews were audio recorded, transcribed, and checked for accuracy. Copies of the transcripts were provided to the interview 59 participants so they could ensure that the interview appropriately reflected their intended meaning. I organized and coded all of the interview transcripts using the qualitative data software, Dedoose. A coding scheme was developed inductively. The coding scheme would be considered open-coding as the codes developed as my work progressed rather than being pre-determined outright. However, most of my codes were grounded in the results of my literature review. Specifically, I focused on the different types of motivation (extrinsic and intrinsic), the types of responses teachers demonstrate in research on accountability pressures and classroom reform (acquiescence, denial, and adaptation), and some of the types of reform responses teachers engaged in (selecting curriculum, selecting teaching strategies, and directing focus on students). I started with these aforementioned grounded codes and developed new codes and child codes as trends further emerged. In this manner, coded material was grouped together by emerging theme and typologies. Codes were not mutually exclusive. The validity of my codes was confirmed by double coding 36% of the data, and any discrepancy was noted and addressed in order to look for alternative interpretations of the data (Miles et al., 2014). Descriptions and examples of interview codes are available in Table 6. 60 Table 6 Interview Code Descriptions Codes Definition Motivation Reference to being motivated to better perform in the classroom, better teaching practice, increase student achievement, or increase performance in some other aspect of the teacher’s job. Example “Teachers do not need to be ranked in order to be motivated to do better. I feel like one doesn't enter teaching for that. The people who are entering teaching are doing it for intrinsic motivations because they generally want to help, and that competitiveness just takes away from the whole goal, which most teachers have which is to help students learn… I guess what motivates me is students having curiosity, and the pursuit of intellect, that motivates me.” Internal Reference to a form of internal motivation. This reference may include a teacher being motivated by disappointment or achievement in the evaluation process or experience. “I would say that that probably comes down to why someone came into the profession in the first place. I do not feel like I need affirmation from my Principal as much as I do feel like I really actually care about my students' growth and learning. And so, I came to be an educator simply because I believed in the ability to make an impact in this world, and I see the need for it… And I think that probably drives me forward more than anything else. I think that also, you have to really love what you are teaching and the process, right? Because I do think it is a hard profession, and it kind of beats people down really quickly. And without that motivation or that affirmation, I think a lot of people do get lulled to sleep a lot.” External Reference to a form of external motivation including things such as pay increase and the achievement of ratings on an evaluation rubric. “If I had an evaluation score that was really low, that might motivate me to see, ‘What did I do wrong?’ And, ‘Let me try to do better.’” 61 Table 6 (cont’d) Observation Feedback Negative Reference to feedback gained from the observation process or rubric. “I try and find value in them and I think a lot of times I will get out of them something different than what I expected to, like ‘Oh hey, I noticed this and this in your classroom and I'm wondering if you might try this idea or this type of formative assessment’ or something that has been really helpful but wasn't necessarily what I expected going in and so I'm wondering if, with teachers that are a little more seasoned, that if they have those little things, because I feel like a lot of times that advice that I'm getting is maybe like "Oh, I might have figured this out in a year or two." And so, I'm wondering if you've got all of that, because there's only so much you can see in 45 minutes or an hour, hour and a half. That if there is anything you can see, it's probably one of those things that's pretty easy to fix.” Reference to feedback which describes the evaluation process, feedback, or scoring in a negative or detrimental manner. May also highlight aspects of the feedback which teachers perceive make it not useful. “I remember a couple years back, they started putting all that data into EVAAS for us to look at, from school to school. You could look at the different schools and just see the different standards …and you could just see what the average evaluation score was in each category, in each school. And there were some schools that were just consistently, much higher. It was odd, and I cannot remember which school it was, but you look at one school and 45% of their teachers had the highest score in almost every single category. And you look at another school, it is 20 minutes down the road, and 10% of their teacher had the highest score in every category. The problem, and I do not know if this is a training thing for the administrators, or if it's a systemic thing, what it is…And we all looked at that and said, ‘Well, this doesn't make any sense, if Principal A over at that high school's just going to get everyone a five just because either they have low standards or they're evaluating based on just the talent they have and maybe they do not have a real strong talent pool. That removes a lot of the objectivity... Because the same person's not doing all the observations and they are not 62 Table 6 (cont’d) holding everyone to the same standard, there is going to be a problem. We even talked about that in the schools, we know that sometimes if you get a certain administrator assigned to you but that means, ‘Oh, yeah, my scores are going to be great.’ Because for whatever reason that person, they are busy doing other stuff, they have multiple things they're dealing with or they just are more laid back or easy going, or sometimes they just do not have the same time in the classroom to know what they are looking for all the time. And the standards aren't the same all the time.” Positive Reference to feedback which describes the process, feedback, or scoring in a positive or helpful manner. May also highlight aspects of the feedback which teachers perceive make it useful. “The feedback that I get or that I have gotten in the past from evaluations has often been... very specific because when you only see a small snippet of someone's classroom or their teaching style or whatever that day happened... I think it's most effective if you focus on, ‘I solve this one thing specifically. And if it comes up again or when it comes up again this can help.’ And so, the last observation I had last year, the Principal was in here and she was watching, and she gave me a suggestion where I would asked students... I gave them the, ‘Does everyone understand?’ We did my five seconds of nodding and looked around and try to make eye contact with everyone. And afterwards [the principal] said, ‘That was good. Try this’ and gave me a list of three or four different little quick snap formative assessment like, ‘Everyone put your head down. Give me one, two or three.’ And now that's what I do, and I feel like it informs my teaching much better than what I was doing which was just a very simple glance around try and read everyone's face. Those specific things more so than any big grand teaching strength or weakness that I might have that is really hard to observe in 40 minutes or 50 minutes.” 63 Table 6 (cont’d) Testing Feedback Reference to feedback gained from the testing process, scoring, or score reporting. “Well, so when we get our feedback on test results from the state, it's divided basically into three categories. RL, so Reading Literature, RI, Reading Informational text, and then Language, which is vocabulary skills. And that is the only breakdown of that test data that we get, are those three categories. So, within reading fiction there are so many components of reading fiction, so I never have any idea if my students are struggling with characterization or plot structure. Or I never have any idea if with RI, if they are struggling with central idea or supporting details. So, there is no way for me to use that data to actually improve my instruction other than if I was weak in informational. Let us try to throw more non-fiction in… so there's no specific feedback for me to build on.” Negative Reference to feedback which describes the evaluation process, feedback, or scoring in a negative or detrimental manner. May also highlight aspects of the feedback which teachers perceive make it not useful. “So, I feel like we are doing a great job here. But the test, if they are just looking at achievement, I do not think that shows everything. We are looking at growth. We're doing pretty darn well. But in terms of the test's usefulness, its effectiveness at determining what our kids know, I don't feel that it does overall. I just do not think you can accurately gauge what students have learned in a 90 day course or 180 days, on a 30, 40 question test, especially one that is multiple choice, at least, part of it is multiple choice. Positive Reference to feedback which describes the process, feedback, or scoring in a positive or helpful manner. May also highlight aspects of the feedback which teachers perceive make it useful. “The kids that I actually saw growth from in my class were those kids that I saw growth from on the test.” Reference to changing some aspect of teaching in anticipation of or the result of an evaluation. “Our big thing in the past couple of years has been learning targets. They want everyone to have some kind of learning target on the board. And that is not something that I have really done before. I Work Decisions 64 Table 6 (cont’d) always have an agenda on the board of, ‘Here is what we are doing today,’ and as we're going along we talk about, ‘Why we are doing those things.’ But I never specifically said, ‘I will be able to do this,’ or ‘I will be... ‘I never had that goal, stated in that way. So yeah, that's something that I'll put up there now because that's just something that on the observations they told us, ‘We are going to look for these learning targets.’ And so, I will make sure that I am putting them up there, even if I don't always agree with the whole process because I know that's something they are looking for.” Strategy/How Taught Reference to choosing or altering a teaching strategy in anticipation of or the result of an evaluation. Reference to choosing, excluding, Curriculum/What or otherwise altering curriculum in Taught anticipation of or the result of an evaluation, which may include the teaching explicit testing strategies. Who is Taught Reference to directing or not directing focus on certain students (triaging) in anticipation of or the “Some of it helps to me to change minor things in my instruction, a little bit. But it is usually... What I mean by that is if students see an equation written a certain way, I know to make sure to show them that format versus another format that's not incorrect... I would not be teaching an incorrect format.” “The pressure is that the Math One does have the end of course test. [Three teachers] developed a plan together to create a spiral review throughout Math One. And it has been going so well because each day of the week we have a different type of warm up activity, and we're reviewing a specific outcome that they covered last semester in foundations to Math One. And now we are also beginning to review the ones that we started at the beginning of this semester. But it's really great for us to pick up on little details, that we are like, ‘Oh, is that what they were missing? They could not tell the difference between a solid line and a dotted-line graph?’ Who knew that was the little missing piece of information? And it has not been perfected yet, but it has been really helpful for us.” “What happens, I think, in my mind is, which can be a dangerous thing, sometimes I think, when I get the Honors class, I naturally expect that they will do fine on the exam. And so, where I spend 65 Table 6 (cont’d) result of an evaluation, which could a lot of time with the Standard class test prepping, I do not spend as occur within classes or across much with the Honors class. And they usually do fine. Their growth classes. is usually not as large, which it is harder to meet growth anyways. Their test growth is usually not as good. Even though they meet what they should meet to pass, they don't grow as much. But my bigger concerns for the Honors class shift more toward writing, which we are not evaluated on at all, that writing prep that they need for the college-level writing, and just those critical-level thinking skills, the research skills, some of those bigger things that I can spend more time with them in the Honors class and really do not get a chance to go into with the Standard class because we are testprepping.” Response to Reform Teachers make a statement that exhibits an adherence to one of three reform typologies identified in this dissertation. “Nobody cares if you actually teach what you are supposed to teach. They are just glad you showed up… I know what to do. I have got the degrees. I know what to do. You do not need to be constantly telling me what to do. And because they do not invade our space very often, who knows?” Acquiescence A statement that reflects a typology where the individual accepts the policy without question and feels the policy did not impact their lives or jobs. “I think teachers' attitudes toward the observation of fairness, none of us feel like we are targeted or there is pressure put on us or anything like that. At the same time none of us, I do not think, feel like we are getting amazing feedback for growth and whatever… it is not impacting us one way or the other, we are just doing what we do in our classrooms every day.” Adaptation A statement that reflects a typology where the individual adapts the policy to fit their own needs. A teacher describing how she uses evaluation for self-assessment: “I feel like, when I go through the standards in that pre-evaluation is when I learn the most about, ‘Am I doing these things? Which of these could I do more?’ I can talk about curriculum and classroom management with my instructional coach, but these other things, like, ‘Am I contacting parents?’ For me, when I read through and 66 Table 6 (cont’d) did my pre-evaluation this semester, I was like, ‘Oh, I am so bad at that. Maybe I should work on that a little bit.’ So, when I looked at that pre-evaluation, and I kind of looked at those things, they were asking me like, ‘Do you contact parents regularly and stuff?’ That is our standard or something like that. I was like, ‘I could do better at that.’” Denial A statement that reflects a typology where the individual openly rejects or rebels against the policy due to a perceived negative impact. “The principal tells people, ‘These are what your PDP goals are going to be.’ I was like, ‘Are you kidding me? You are not telling me.” I say, ‘I am going to do what I want to work on. I am not going to work on what you tell me to, just because you just told me. I am going to be that bad kid.’ Because, I believe that I should be free to pick my own things to work on.” 67 Researcher Background and Neutrality Aside from the double coding of data, my study incorporates various internal and external supports for establishing validity. First, I was well-prepared to approach this type of research due to my background as a National Board Certified high school teacher with five years in public schools, including four years of experience in the state of North Carolina. I was teaching in North Carolina when the statewide observation system was adopted and when the student growth standard was added. This allowed me to think about my own experiences during that time to anticipate how the policy may have impacted teachers. My experience also granted me greater awareness of the policy atmosphere in which I was investigating and allowed me to more effectively engage with interview participants. Additionally, my background equipped me to be able to engage in the work that I conducted in this dissertation. I had extensive coursework in both quantitative and qualitative research methods as well as valuable work experience in my research assistantships dealing with both quantitative and qualitative data. This work experience has also involved developing codes from literature reviews to analyze artifacts such as: interview transcripts, observation rater notes, think aloud transcripts, and student writing. However, I recognize that bias occurs unintentionally and thus I constantly acknowledged how my past experiences, particularly as a teacher in the same state as my study, may have impacted data collection. I ensured neutrality by writing and reviewing my interview and survey questions beforehand to ensure that I asked non-leading questions and allowed opportunity for clarification from participants. I piloted both the survey and interview questions with other North Carolina teachers in different school systems prior to data collection. Also, the use of multiple types of data in the form of surveys and interviews served as a validity check and I used my 68 initial survey data to triangulate the later interview data (Stake, 2004). My study also features a multiple case design by including various groups of teachers (Math and English teachers, provisionally and professionally licensed, teachers from four different school sites, etc.) which enabled me to test my theory that different groups experience accountability pressure from evaluation in differing ways (Yin, 2009). Additionally, the second set of interviews, which was conducted several months after the first round, served as a type of member check to obtain feedback on the themes and typologies that emerged from the first round of interviews (Deyhle et al., 1992). Finally, my research was guided by a capable committee of faculty from two universities in the areas of teacher education, educational policy, and educational administration and I also sought the feedback of other students in the MSU Educational Policy PhD program throughout the dissertation process (Glesne, 2006). School System Site Broadville County is a large school system in North Carolina that surrounds a separate city school system. According to a school system profile available online, Broadville ranks in the top 15 of school systems in size of student population, yet ranks 85 th in funding out of 115 school systems in the state. Broadville serves just over 25,000 students and the system website states that over 25% of its students live below the poverty line. The schools in this study demonstrate a high rate of students enrolled in the Free and Reduced Price Lunch program. At the time of this study, there were 23 elementary schools, three intermediate schools, seven middle schools, six regular high schools, one alternative high school, and two middle/early colleges. According to the school system profile, as of the 2012-2013 school year, 14% of students were classified as Exceptional Children (EC), which is North Carolina’s designation for those receiving special education services. In the same year, 16% of students had the designation of being Academically 69 or Intellectually Gifted (AIG). Additionally, there were about 15,000 students classified as English Language Learners (ELL) who spoke 66 different home languages. The school system online profile also states that Broadville employs about 4,000 people and is the second largest employer in the area. According to the NC School Report Card, in 2012-2013 about 20% of teachers had less than four years of experience while an additional 20% had between five and nine. These averages are comparable to the other large school systems in North Carolina. Broadville was selected for this dissertation primarily due to its size and diversity. A large district was needed in order to be able to identify enough high schools with varying Evaluation Condition and Effectiveness scores to conduct analysis of differences at the schoollevel. Additionally, teacher evaluation was a sensitive topic at the time of this study and many school systems were facing lawsuits over the implementation of the policy. Increased pushback on state level teacher policies, including the evaluation policy examined in this study, was occurring from the local governments, universities, teacher unions and groups, and the public. Therefore, it was important to be able to provide relative anonymity for the participating district, schools, and teachers. A large district with demographics similar to other, large school districts was necessary to meet such requirements. So, Broadville was an ideal location for this study because of the varying characteristics between its nine high schools and demographics that were similar to other large school systems in North Carolina. School Sites Table 7 provides demographic information of the four focal schools in this study. Data on the student population are derived from the National Center for Education Statistics (NCES) Database, which is drawn from the 2013-2014 school year. The teacher data and classroom data 70 come from the publicly available NC School Report Card which uses information from the 20122013 school year. The Conditions and Effectiveness Scores were derived from the NC Teacher Working Conditions Survey and the Educator Effectiveness Database, respectably, and were calculated in the manner described earlier in this chapter from data from the 2015-2016 school year. 71 Table 7 School-Level Demographics Riley Phoenix Charles Central Student population1 1591 134 789 1103 % White Students1 68% 71% 82% 86% % Hispanic Students1 12% 8% 8% 8% 1 % Black Students 11% 10% 4% 1% % Asian Students1 3% 1% 1% 1% % Native American/Pacific Islander Students1 0% 2% 1% 0% 1 % Mixed or Other Races 6% 7% 6% 4% Students Participating in Free or Reduced Price Lunch1 38% 87% 46% 39% 2 Classroom Teachers 100 18 56 64 Teachers Fully Certified2 95% 89% 93% 92% 2 % Teachers with advanced degrees 36% 56% 31% 30% % National Board Certified Teachers2 24% 38% 41% 38% 2 % Teachers with more than 10 years experience 63% 53% 67% 66% Teacher turnover rate2 8% 5% 13% 17% 2 Average English II class size compared to system average +4 -14 +4 +1 Average Math I class size compared to system average2 0 -14 -3 0 Local Condition Score4 0.6 4.4 -2 -9.6 4 State Condition Score -1 -16 -11 -22.5 Local Effectiveness Score4 +0.6 +1 -2.5 +1 4 State Effectiveness Score +1.5 -13 +9.3 -4.2 School-Level Growth Score3 Meets N/A Exceeds Exceeds 1 2 Note. National Center for Education Statistics (NCES) Database, 2013-2014 school year; North Carolina School Report Card, 2012-2013 school year; 3 North Carolina School Report Card, 2012-2013 school year; 4 Calculated as described 72 School 1: Riley Riley is the largest high school in the study serving just under 1,600 students and employing about 100 teachers. The student body is the most diverse of the schools in the study with 68% of students being white and 32% non-white. Riley has the lowest level of students participating in free and reduced-price lunch (FARPL) in this study, at 38%. Additionally, 95% of Riley teachers are fully certified, 36% have advanced degrees, 24% are National Board Certified, and 63% have over 10 years of teaching experience. The turnover rate was only 8% and class sizes are reportedly close to the school system’s average. Riley is the only school in this study to have a separate Freshman Academy program geared at ensuring success for students entering high school. Teachers who teach courses for the Freshman Academy are all located on the same wing of the school which is separated from the main body of the school by the cafeteria. Freshman have a dedicated administrator and counselor also located in the wing. The Condition Score and Effectiveness scores were also quite close to district average. The Local Condition Score was only 0.6 above and the State Condition Score was 1.5 above the district average. Riley was also closer to average on the Effectiveness Scores than any other school at 0.6 above the local and 1.5 above the state. Overall, Riley has conditions and effectiveness that are quite close to Broadville’s average. School 2: Phoenix Phoenix is the smallest high school in the study, though its population is larger than two other specialty schools in the system. The population fluctuates throughout the year, but it serves about 134 students and employs 18 teachers. It is the second most diverse of the schools in the study with 71% of students being white and 29% non- white. Phoenix has the highest level of 73 students participating in FARPL in this study, at 87%. Phoenix has the lowest percentage of teachers fully certified at 89%, which is possibly an artifact of its small staff size. However, it has the highest percentage of teachers with advanced degrees at 65%, 38% are National Board Certified, and 53% have over ten years of teaching experience. The turnover rate at Phoenix is the lowest in the study at only 5% and class sizes are much, much smaller than the school system’s average. Phoenix is an alternative school that specializes in students who are failing out of or otherwise unable to perform in the traditional high schools. The program is selective and students who want to attend the school must go through an application process to be admitted. The class sizes are quite small which makes Phoenix’s alternative education program one of the most expensive programs that Broadville County runs. Phoenix had the highest Local Condition Score at 4.4 but the State Condition Score was 16 below the district average, indicating a high level of dissatisfaction with state components of evaluation. Similarly, while Phoenix was fairly close to district average for Local Effectiveness Score (+1), the State Effectiveness Score was -13, well below the district average. Overall, teachers at Phoenix have an average to high view of local conditions and an average ranking in effectiveness for local conditions. The State Condition Score and State Effectiveness Score fall far below the system’s average. Phoenix serves as an example of a unique working environment with high reported Local Conditions and average Local Effectiveness but very low State Conditions and Effectiveness. School 3: Charles Charles is the smallest traditional high school in the study serving just under 800 students and employing 56 teachers. The student body consists of 82% white students and 18% non- 74 white. Charles has the second highest level of students participating in FARPL in this study, at 46%. Additionally, 93% of Charles teachers are fully certified, 31% have advanced degrees, 41% are National Board Certified, and 67% have over 10 years of teaching experience. The turnover rate was only 13% and class sizes are fairly close to the school system’s average. Charles features an initiative to improve Math scores. This initiative involved the creation of a required Introduction to Math course which all students take prior to taking Math I. Math I is an EOC course and counts for the schoolwide growth score while the introductory course counts as an elective for students. The Local Condition Score was fairly close to the district average at -2, however the State Condition Score was -11 below the district average. Charles had the lowest Local Effectiveness Score at -2.5, which was still fairly close to the school system average. However, despite negative reported State Conditions, Charles fared much better than average on the State Effectiveness with a score of 9.3. Charles is a school with average Local Conditions, lower than average Local Effectiveness, high State Effectiveness, but low reported State Conditions. School 4: Central Central is a traditional high school serving just over 1,100 students and employing 64 teachers. The student body is the least diverse of all school in this study and consists of 86% white students and 14% non-white. At 39%, Central has a similar level of students participating in FARPL as Riley. Additionally, 92% of Central teachers are fully certified, 30% have advanced degrees, 38% are National Board Certified, and 66% have over 10 years of teaching experience. The turnover rate was 17% and class sizes are close to the school system’s average. Central did not view either local or state evaluation conditions favorably, with a score of 9.6 on the Local Condition Score and a -22.5 on the State Condition Score. These were by far the 75 lowest condition scores in the study. Interestingly, the teachers at Central have fared pretty well on Educator Effectiveness with a local score of 1 above the district average. However, with a State Effectiveness Score of -4.2, Central had the lowest score aside from Phoenix. Overall, Central serves as an example of a school with low reported Local and State Conditions but average Local and State Effectiveness. 76 CHAPTER 5: Overall Trends Across All Teachers This chapter explores trends in data across the entire sample of teachers from the study school system by analyzing responses from a survey of Math and English teachers across the four focal school sites (N=45) as well as examining the results of analysis from the focal interviews across all sites (n=14). An examination of the entire sample of responses allows for an analysis of the perceptions of a general sample of teachers to discern how evaluation policy may be related to practice, as well as to test the two assumptions of evaluation policy: (1) evaluations are necessary because teachers need to be rated, sanctioned, or rewarded in order to be motivated to do a better job and (2) evaluations yield information that is useful for teachers to improve practice. Overall, trends across the sample of teachers surveyed and interviewed for this study demonstrate that teachers do not perceive that their evaluations provide motivation or useful feedback for improving practice. While teachers expressed a positive view of their work expectations, views about the consistency and quality of feedback from observations were less positive, and state testing data was viewed very negatively. The complementary question set from the survey showed that teachers held generally negative opinions about evaluation from the previous year with slightly more positive responses on eight of the 11 questions when anticipating the current year. Four areas on the complementary question set had statistically significant, positive changes when comparing the prior year to the current year: modifying practice from evaluation, choosing teaching strategies based on evaluation, using observation data to modify practice, and feeling evaluation will be conducted fairly. In the focal interviews, teachers stated that feedback from both the observation and testing components of formal evaluation were not useful. Teachers expressed the following concerns regarding the validity of 77 observations: being told low rankings were necessary to show growth over years, the timing and timeliness of evaluation administration and feedback reception, the small sample of teaching actually observed, very broad or very narrow standards, unobtainable levels of distinction, and the consistency of scores across sites and administrators. The testing component was also criticized as not being timely or specific enough to provide valuable feedback. The validity of the testing component was questioned by teachers as being based on: a model that was difficult to understand, extremely low cut scores, and a small sample of both students and the curriculum. First, I analyzed survey data to gauge the sample’s overall perceptions of evaluation conditions with questions that replicated the North Carolina Teacher Working Condition Survey along with a complementary question set that asked teachers to reflect on the previous year as well as anticipate the upcoming year. Next, I analyzed interview data from 14 focal teachers across the four school sites utilizing the literature-based framework developed in Chapter 3 to explore: teacher perceptions of feedback from both elements of formal evaluation (observation and student testing), evaluation as a mechanism to motivate, reported changes in teacher practice, and teacher reform typologies. This chapter is meant to provide an overview of results from the entire sample of teachers surveyed across four schools. Chapter 6 explores how the context of the school and school-level factors may influence such perceptions and answers the research question: What, if any, role do reported school evaluation conditions and school evaluation status play in shaping teacher motivation, experiences with feedback, and work decisions related to teacher evaluation? Chapter 7 examines how the context of the individual may similarly influence perceptions and answers the research question: What individual-teacher level factors are associated with differences in teacher motivation, experiences with feedback, and work decisions related to teacher evaluation? 78 Survey Participants A survey was administered to Math and English teachers at the four focal high schools in October 2016. The survey was available online through Qualtrics. The first section of the survey asked participants to provide demographic information. Table 8 outlines the demographic information for survey respondents. I have included both licensure status and years taught divided into the categories of “seven or fewer” and “eight or more.” In North Carolina, a teacher is usually able to move from provisional to professional status after three years of teaching. However, due to the tenure law enacted in 2013, teachers with less than seven years of experience are subjected to full evaluation cycles every year. Therefore, it seemed pertinent to record both groups as all provisionally licensed teachers are evaluated in a full cycle, but many professionally licensed teachers are evaluated in full cycles as well. 79 Table 8 Survey Respondents Total Teachers Taught < 7 years Taught 8+ years Prof. License Prov. License Riley Phoenix Central Charles Total 0 3 3 3 9 15 4 10 7 36 11 4 11 8 34 4 3 2 2 11 15 7 13 10 45 80 Have taught EOC 15 7 9 4 35 English Teachers Math Teachers Response Rate 5 3 7 4 19 10 4 6 6 26 68.18% 100% 76.47% 76.92% 76.27% Comparing Sample Teacher and School Wide Perceptions of Evaluation Conditions In Chapter 4, I describe how publicly available statewide Teacher Working Conditions Survey data from 2016 was used to calculate school-level Evaluation Condition Scores. Table 9 shows a summary of the results of replication questions where lower numbers represent disagreement and higher numbers represent agreement. The primary purpose of asking the replication questions was to ensure that focal teachers selected for interviews did not hold beliefs that varied wildly from the average of teachers at the school. I compared focal teacher replication responses to the school averages on the replication questions as well as on the original 2016 data to determine that I was not selecting a focal teacher who held outlier beliefs. Table 9 Responses on Teacher Working Conditions Replication Questions Question N Min Max Mean Teachers are held to high professional standards for delivering 44 2 4 3.57 instruction. Teacher performance is assessed objectively. 40 1 4 3.08 Teachers receive feedback that can help them improve teaching. 41 1 4 2.78 The procedures for teacher evaluation are consistent. 40 1 4 2.85 Local assessment data are available in time to impact 39 1 4 2.67 instructional practices. Teachers use assessment data to inform their instruction. 42 2 4 3.12 State assessment data are available in time to impact 38 1 4 2.03 instructional practices. State assessments provide schools with data that can help 41 1 4 2.22 improve teaching. State assessments accurately gauge students’ understanding of 42 1 3 1.90 standards. Note. 1- Strongly Disagree 2- Disagree 3- Agree 4- Strongly Agree There was an option for “Do Not Know.” These responses were removed in order to calculate the means. On average, teachers agreed with statements that were related to their quality of work overall, namely: teachers are held to high standards, teachers are assessed objectively, and teachers use assessment to modify instruction. However, there was less agreement with 81 statements that reflect the perceived usefulness of local level evaluation to teachers when asked about the quality of feedback, the consistency of evaluation, and the timeliness of local data in order to improve instruction. When asked specifically about state testing data, teachers expressed perceptions that viewed such data in a more negative light than local data. In particular, on average, teachers disagreed that state assessments are available on time, provide feedback to improve teaching, or accurately gauge student understanding. Overall, the trends observed with the sample of English and Math teachers from the four focal schools aligned with the same trends that were observed in the district. Teacher Perceptions of Last Year versus the Current Year about the Evaluation Process Table 10 groups the data from the complementary question set of the survey into themes and provides descriptives and paired t-test results from the survey. The final question of the prior year section did not have a complementary current year question; therefore, a paired analysis could not be conducted for that question. In this case, the descriptives for that question are provided. Also, the paired t-tests were run question by question, excluding individuals who did not answer the complementary pair. So, there is some variability in the “n” from question to question which results from the inclusion of first year teachers who had no prior year experience to reflect on or from individuals skipping questions. When reflecting on the previous year, each of the five of levels on the scale were used by at least one teacher for all questions. However, on average teachers seemed to disagree with nearly all of the statements. Two statements fell on average between “strongly disagree” and “disagree” and those were statements about teachers’ concerns that evaluation could impact employment or label an individual as a bad teacher. All the questions about evaluation’s impact on practice led to responses on average between “disagree” and “neither agree nor disagree.” 82 Only one statement generated a response in the affirmative range and that referenced teachers’ perceptions of whether the evaluation was fair. For all but three themes, teachers’ responses were higher for the same question themes about the current year compared to the previous year. All five ratings were used for each statement except for the statement, “I feel I will be evaluated fairly in the upcoming school year,” which utilized between “neither agree nor disagree” to “strongly agree” and averaged as the highest overall ranking. As mentioned previously, evaluation fairness was the only theme to have a mean in the affirmative range for the prior year and had a statistically significant difference between the prior year and the current year t(40) = -2.01, p = 0.05. This significance suggests that teachers overall may have felt more optimistic about the fairness of evaluation in the current year regardless of their experiences the prior year. Three practice related statements ranked between “neither agree nor disagree” and “agree” with means close to a neutral score of “3,” demonstrating that teachers were not overwhelmingly in agreement that evaluation impacted their practice in the stated manner. However, statistically significant differences were found when comparing the themes of modifying practice using feedback from evaluation from the prior year to the current year, t(41) = -1.83, p = 0.08, choosing teaching strategies based on what one was evaluated on from the prior to the current year, t(41) = -1.81, p = 0.08, and using observation data to modify classroom practice from the prior to the current year, t(41) = -1.83 p = 0.08. Such differences demonstrate that teachers on average may have intended to more deliberately take these actions in the current school year as opposed to the previous year. All other statements in the complementary question set fell below the neutral ranking and were not statistically significant. As with the question set focused on the previous year, the 83 statements about concern over future employment (M= 1.93, SD= 1.07) and being labeled a bad teacher (M= 1.88, SD= 0.92) had the lowest averages though there was an overall rise from the prior year. This rise in averages between teacher perceptions from the past year to the current year suggests that teachers overall may have a more favorable outlook on evaluation in the upcoming year as opposed to the prior. However, the results do not seem to indicate that the surveyed teachers perceive that evaluations have a large impact on practices. 84 Table 10 Paired T-Tests of Statement Themes Reflecting on Last Year Versus This Year Prior Current M SD M SD n 95% CI for Mean Diff -0.39, 0.25 r Modify practice in anticipation of an 2.62 1.23 2.69 1.26 42 0.66*** evaluation Modify practice using feedback from 2.82 1.22 3.14 1.07 42 -0.70, 0.04 0.47*** evaluation Have concern evaluation affects 1.81 1.11 1.93 1.07 42 -0.29, 0.05 0.87*** employment Have concern evaluation labels as a bad 1.86 1.05 1.88 0.92 42 -0.22, 0.18 0.79*** teacher Have concern evaluation does not reflect 2.55 1.21 2.40 1.06 42 -0.15, 0.43 0.68*** competency Choose curriculum based on what 2.54 1.33 2.46 1.19 41 -0.27, 0.42 0.63*** evaluated on Choose teaching strategies based on what 2.71 1.26 2.98 1.26 42 -0.55, 0.03 0.72*** evaluated on Direct focus on certain students based on 2.48 1.19 2.45 1.15 42 -0.26, 0.31 0.69*** what evaluated on Use test data to modify classroom practice 3.24 1.14 3.33 1.10 42 -0.42, 0.23 0.56*** Use observation data to modify classroom 2.81 1.17 3.14 1.20 42 -0.70, 0.04 0.50*** practice Feel evaluated fairly 3.83 1.10 4.15 0.57 41 -0.64, 0.00 0.37** Feel last year will impact current year 2.50 1.22 42 Note. Scale for Survey: Strongly Disagree 2- Disagree 3- Neither Agree nor Disagree 4- Agree 5- Strongly Agree * = p< 0.1, **= p< 0.05, *** = p < 0.01 85 t df -0.45 41 -1.83* 41 -1.40 41 -0.24 41 1.00 41 0.43 40 -1.81* 41 1.67 41 -0.59 -1.83* 41 41 -2.01** 40 Interview Participants Fourteen teachers were chosen as focal participants from the four schools. Table 11 summarizes the characteristics of the focal participants. Table 11 Interview Participants Total Taught <7 years Taught 8+ years Prof. License Prov. License Riley Have taught EOC English Math 2 0 2 2 0 2 2 0 Phoenix 3 2 1 1 2 3 2 1 Central 5 2 3 5 0 5 3 2 Charles 4 2 2 3 1 2 2 2 Total 14 6 8 11 3 12 9 5 After coding was complete, I first examined the frequency and the percentage of interviewees in which each code occurred. In this manner, I was able to determine that some perspectives came up much more frequently across interviewees as opposed to others (Table 12). For instance, teachers were more likely to have a negative opinion of, or to be critical of, feedback received from either component of evaluation (observation and testing). The next section discusses the trends that emerged in the interview portion along with examples and possible explanations for the trends observed. The discussion of these trends is laid out to mirror the components of the framework outlined in Chapter 3. First, I analyzed the interview data through the lens of the two assumptions of teacher evaluation policy to determine if teachers found evaluations to be motivating and/or to provide useful feedback. Next, I examined the data to determine if teachers exhibited any responses similar to those recorded in literature about 86 teacher responses to other external reform initiatives. In doing this, I also considered whether teachers exhibited certain reform typologies during the interviews Table 12 Interviewee Code Use Code Motivation Internal External Observation Feedback Negative Positive Testing Feedback Negative Positive Job Loss Work Decisions Strategy/How Taught Curriculum/What Taught Who is Taught Response to Reform Typology Acquiescence Adaptation Denial Note. n= 14 Frequency Percentage 13 9 14 14 8 13 13 2 8 92.9% 64.3% 100.0% 100.0% 57.1% 92.9% 92.9% 14.3% 57.1% 6 6 2 42.9% 42.9% 14.3% 7 8 3 50.0% 57.1% 21.4% Evaluation as a Form of Motivation Every teacher was asked whether evaluations motivated them to improve their instruction. None of the teachers indicated that their formal evaluations, either observations or test scores, motivated improvements in instruction. However, teachers noted that they were upset when parts of evaluation went poorly or if lower than expected ratings had been received. For instance, Mrs. Ranier, who had students misbehaving during an observation immediately prior to our interview, joked that she would need to drink after an evaluation like the one she had that day, “[W]hen [evaluations] fall short of what I want them to be, either justifiably, like this one, or not justifiably, like ones in the past, I just have to try to put it in another compartment of my 87 brain, because it's so demoralizing.” The disappointment Mrs. Ranier and other teachers described seemed linked to intrinsic motivation because teachers spoke of the feeling directed inward toward themselves rather than outward towards the external individual who assigned the ranking. When disappointed by their own performance, teachers may feel as though they are lacking in competency, whereas being disappointed in an unfair rating reflects frustrations with efficacy and an inability to obtain the score an individual feels the may deserve. Those who are intrinsically motivated feel good when told they are doing well and may find frustration when they either do not feel they did well or feel unjustly labelled as such. All except one teacher offered intrinsic reasons for why they were teachers and acknowledged their own feelings of accountability as a source of motivation. Intrinsically motivated teachers were skeptical that most teachers were motivated by external factors and referenced low pay, long work hours, and lack of respect that came with their job, arguing that such conditions were at odds with someone who would be motivated by external rewards. In Mr. Allen’s words, “I want to make sure that I am doing it the right way because I think it is important…I think that is really what was driving me, is that I want to make sure I am doing this because it has long-lasting impacts on these kids and on our community.” Mr. Allen, like other teachers, also brought up the discrepancy he felt between his own impressions of the quality of his work and the ratings he received in observations, I am very aware there are some days where I go in, I am like, “Man, that was a two out of ten kind of day.” It wasn't good enough. But it's for me, it is harsher coming from me than it is from a third party. Because, I do not know what their standards necessarily are. Because, I have had days I thought were very mundane days and I have gotten really 88 good ratings, I am like, “No. That was not a four out of five kind of day. That was a 2.5 out of five at best kind of day.” Nine teachers did reference extrinsic motivation by acknowledging that some teachers may need an external push, like the threat of job loss due to poor evaluations. Despite recognizing that some individuals may need to be evaluated to be motivated, these teachers did not feel that the evaluation system was particularly motivating for the majority of the teaching population. For instance, when asked about evaluation as an external motivating factor, Mr. Donaldson, an English teacher at Riley, felt that teachers would like recognition wherever it came from and in whatever form it came in, but that the system in which teachers achieved ratings was perceived as so arbitrary that most educators do not take evaluations seriously. Further pushing against the idea that teachers were externally motivated to improve work, four teachers noted that they felt fortunate that their spouses had good paying jobs that allowed them to teach and do something they really cared about as they would otherwise be unable to afford to remain in the profession. Teachers also mentioned that when the current evaluation system was initiated it was intended that bonuses would soon be attached to their scores. A bonus policy never came to fruition, but the teachers who taught during the implementation of the current system remembered the initial plan and mentioned how this proposal resulted in increased anxiety and attention to the evaluations initially. However, concerns were alleviated when the bonus system never materialized. Overwhelmingly, the focal teachers interviewed for this study did not view evaluation as a means to motivate individuals to do better at their job and cited intrinsic sources of motivation as more valuable motivators for the improvement of practice. 89 Evaluation as a Source of Feedback Feedback from evaluations was the topic that dominated conversations about evaluation. In the formal evaluation policy examined in this dissertation, feedback stemmed from both administrator observation and state testing. Both sources were referenced nearly equally across interviewees. In general, teachers usually critiqued the feedback that was provided by both components of formal evaluation with all focal teachers referencing negative aspects of observation feedback and all but one for testing feedback. Specifically, teachers critiqued formal observations due to systematic concerns about the growth model promoted by the state and the small sample of observations conducted. Other critiques included concerns related to: the timing of both when observations were conducted and when feedback was received, some standards being either too broad or too specific, the difficulty of achieving high marks, and the lack of standardization and consistency in scoring across sites and evaluators. Regarding testing, teachers had difficulty with understanding the way in which their growth scores were calculated. Teachers explained that the metric used had either been not explained to them or was difficult to understand. Teachers also identified mathematical weaknesses in their understanding of the model. For instance, teachers referenced that the way in which the teacher’s growth score was calculated did not seem to truly account for the small sample size of either students or questions used to calculate scores. Additionally, teachers raised concerns with student issues that fell outside of the teacher’s control (such as frequent absences or extended illnesses) yet impacted a teacher’s growth score nonetheless. Teachers also questioned the validity of the test used to calculate student growth and described incredibly low cut scores which allowed students to pass or even obtain high grades with low percentages of 90 correct questions. Finally, teachers stated that they were unable to utilize the feedback provided by testing because of the amount of time it took to receive, and the lack of specific information provided. Perceptions of Feedback from Observation The timeliness of evaluations and the amount of time spent on evaluations were problematic for teachers who referenced evaluations being done at inopportune times or in sequences that were unable to best assess teaching. For instance, teachers on full evaluation cycles often brought up instances where all three required evaluations would occur in quick succession within a month, often in the same class, rather than sampling throughout different classes during the year. Teachers indicated that they felt this sampling technique made it impossible for an administrator to really gauge how a teacher handled different types of classes and different groups of students. Teachers who were observed exclusively in a difficult class felt at a disadvantage to those who may have been exclusively observed in a higher achieving or better-behaved class. Additionally, teachers stated that this quick succession of observations often occurred later in the year when administrators expressed that they were trying to make up for missed observations and complete requirements before a deadline. The teachers who experienced this lamented being observed during review time for exams when they were unable to demonstrate how new concepts were taught to students or were required to participate in certain review activities instead of “actual teaching.” An additional criticism teachers raised was having an observation prior to receiving feedback from an earlier observation, which prevented teachers from learning from the first observation and addressing issues that may have been brought up, or 91 ensuring the next observed lesson exhibited standards that may not have been met in the previous observation. Teachers also expressed frustration with perceived flaws in the structure of the observation system itself. Teachers at all four schools noted that when the current evaluation instrument was new, when they were early in their career, and/or when a new administrator evaluated them for the first time, they were told by the observer that they would be rated lower initially to allow the teacher to “show growth” in future evaluations. Mrs. Ranier, a veteran English teacher at Charles explained, “I have talked to administrators and I know they were told when they were trained to evaluate us that they have to leave room for growth. Which means you cannot ever be at the top, not really.” The approach described by Mrs. Ranier was brought up by teachers at all four focal schools. Another English teacher at Charles, Mr. Eagle, who had previous teaching experience but was new at the current school, was told prior to his first observation of the year, “You are going to be developed or proficient, the very first two categories. You will not hit advanced, there is just no way you are going to hit distinguished.” This left teachers with a sense that the scores received from an observation may not be a true, objective reflection of teaching ability and instead a score the administrator gave that was subjective to how long an administrator had observed them and would allow room for the administrator to show that a teacher has “grown” over time. Additionally, several teachers argued that an observation of a fraction of a class period, even if conducted three times a year, did not provide an adequate sample for an administrator to get an idea of a teacher’s ability. Mr. Allen, an English teacher at Central, began to calculate the actual amount of teaching his administrator observed as compared to the amount of time he taught the entire year, explained, “If a statistician looked at that, they would be horrified that is 92 how we get evaluated.” Teachers stated that it would be preferable if administrators took more interest in the work of teachers outside of evaluations so that observations were not the only time the evaluator was exposed to an individual’s teaching practice. Administration taking a greater interest in teacher work could manifest in a few ways. Two of the schools, Community and Riley, required teachers to submit daily lesson plans, but teachers also felt that having engaging pre- and post-conferences (a requirement of the evaluation system for those undergoing full evaluation cycles, which was reportedly not followed with fidelity) and having administrators pop into classrooms more frequently for informal check-ins would lead to a better assessment of teaching, rather than a few formalized snapshots each year. Mrs. Ranier, who had taught for 22 years, suggested that instead of taking large chunks of time for formal observing followed by lengthy post conferences, principals should be more present in classrooms in order to better know the staff and their teaching. Reflecting on schools where she had previously worked, this teacher stated, I really enjoy working at a school where the administrators are present and they are often in your class, because then when I sit down with them, and they are talking about my teaching, they can say, “Oh, but you did this on the other day, when I was just walking through.” And also, then their presence, in and of itself, would not be so disarming when they come here the two or three times they come to do an actual observation…they should be more present, so that I feel more comfortable, and so the students feel more comfortable with them, and they have a better idea of whether or not I am doing my job consistently. This critique was common across interviews and while I did not specifically ask each teacher about how often administrators visited informally, no teacher reported that an 93 administrator was ever in their classroom in the study year aside from formal observations. Teachers also talked about certain evaluation standards being either too specific or too broad, which they perceived made it difficult to receive feedback that could actually be used to help instruction. One of the standards that was mentioned frequently was the technology standard. Some teachers had difficulty meeting this standard because their administrators only showed up to observe at times when technology was not used. Mr. Allen discussed the difficulty he had satisfying the technology standard, “One of the standards is: Do you use technology? If I am seen four times a year…I have had years where I get ‘no’ or I will get just whatever the bare minimum one is... I use technology most weeks, two or three times a week. [The administrator] came in four times and it was four days where in that particular period I was not using technology.” Conversely, some Math teachers brought up how easily satisfied the technology requirement was because their observing administrator had a poor definition of what constituted technology. Evidently, some observing administrators counted calculator use as technology use to satisfy the requirement of the observation rubric. Therefore, Math teachers would not need to use any other technology source to meet that requirement of the evaluation other than a tool which was already commonly used and required as part of the curriculum. However, every Math teacher interviewed referenced difficulty meeting another standard on the evaluation rubric. Each of the Math teachers brought up how the standard “global awareness” presented a challenge. I asked one teacher, Mrs. Proffitt to describe what global awareness looked like in a Math classroom and to explain what was meant by how resources given to Math teachers to satisfy the requirement were lacking. She described attending a 94 professional development training for a program called “Newszilla” which teachers were told they were expected to use and could accommodate the global awareness standard on the observation rubric, “I looked and looked and there were like two that actually applied…It is not the same in every classroom. Social studies, English. Science, you can bring it all in, but it is very difficult for us in Math to.” Another Math teacher, Mr. Robbins, summed up the issue this way, I feel I am at a disadvantage compared to a History teacher. Even a Science teacher, I think, would have a little bit easier time with that because when you talk about different events like pollution… you can talk about what is going on in different countries. Mathematics, if we are studying quadratic functions, how am I supposed to incorporate global awareness into that without using some kind of a stretch of a real-world context that is just really bizarre? Additionally, two Math teachers noted that the stretch that was necessary to incorporate the requirements of the “global awareness standard” fell outside of and may even have opposed the Math standards set by the state. These teachers contended that the standard and the push to use programs such as “Newszilla” required teachers to teach things that were outside of the established standards and outside of what was tested. So, these Math teachers felt that the requirements of the teacher evaluation process were at odds with the actual standards of the courses taught. Conversely, teachers also pointed out that some standards seemed far too broad. This issue was raised by others who mainly indicated that the broad interpretations of standards in the observation rubric did little to promote the more standardized observations promised when the current evaluation instrument was introduced. Mr. Robbins explained how a standard about 95 teachers being ethical felt too broad to be a standard that should be observed, “And I was always bothered by that because I felt like I should be in the distinguished, the farthest up possible. But the only way that I have been told that I think I could even get to that is by basically doing a workshop for teachers teaching them the code of ethics.” Mr. Robbins felt that not being ranked high in the ethics category indicated a deficit in ethics, whereas the rubrics required the sharing of knowledge amongst staff through activities such as workshops or leading staff staffing to be a pre-requisite for achieving high marks. Several teachers, including Mr. Robbins, were troubled by the absence of standards that focused distinctly on a teacher’s ability to teach the subject area. Mr. Robbins explained, “There is nothing in those standards that says anything about Math.... It is just, ‘Are you a good teacher?’ It is very broad.” This was a topic broached by teachers of both subject areas; however, the relationship between observation and subject area will be explored in greater detail in a subsequent chapter on individual-level differences. For the conversations with focal teachers, it appears that the assessment of whether a teacher was competent in their respective subject area was left up to the measurement of student growth as calculated by standardized testing rather than by an administrator’s judgement. Similarly, while not necessarily referencing specific standards, the difficulty of obtaining high rating levels such as “distinguished” was brought up by other teachers. Mrs. Ranier described how it would be impossible to be considered a distinguished teacher overall, even after 22 years in the classroom, based on the observation rubric used, I do not like the idea that I have to do things outside of the classroom that I do not want to do… I do not want to go to professional meetings, I do not want to lead a committee, I do not want to do any of that… So, I do not like the new evaluations in that I am graded 96 for things that I do not feel are the reasons I became a teacher, and reflect my performance and abilities as a teacher... I help to hire new teachers? No. Am I part of a professional organization? No. Overall, teachers expressed that they felt it was nearly impossible to be highly rated as a teacher overall due to the requirements of the protocol. Teachers also indicated that they felt the observations and ratings were not standardized or consistent. However, achieving higher ratings may be easier in in locations outside those in this study. Four teachers referenced looking at the publicly available scores at other schools and noticing that in some locations, all teachers were rated in the highest two designations for all categories. Critiques of such practice was one of the reasons used to justify the current, lengthy evaluation protocol used in North Carolina and teachers were quick to point out that the issue had not yet been remedied (Weisberg et al., 2009 and others). Teachers also mentioned discrepancies in evaluations between different schools in which they had worked or even within the same school between evaluators. Teachers indicated that they knew which administrators in the school would be “tougher” on observations than others. Teachers also seemed aware of which administrators would provide better feedback and which were just “checking a box.” Most of these issues are related to the context of site or of the individual and are discussed in the next two chapters. However, it is important to note that teachers overall expressed that they did not feel that they could accurately trust their ratings as a reliable source of feedback about their practice due to discrepancies in how these rankings were awarded. Overall, teachers were very critical of the feedback received from observations and of the observation process as a whole. Nine teachers mentioned a belief that the evaluation process as a whole may be necessary to help expedite the formal removal of teachers who should not be in 97 the classroom, but all felt the current system used was grossly ineffective at providing feedback to improve practice. Eight teachers overall mentioned positive aspects of the formal observation process and described how it created an opportunity for them to independently reflect on their own practice. However, these same teachers stated that reflective practice was something they already did on a regular basis and the observation was just an opportunity for them to engage in such behavior more systematically. Perceptions of Feedback from Testing Similar to observations, teachers also expressed frustration with the feedback received from the testing component of their evaluations. There are two types of tests administered to high school students that count towards teacher evaluation scores: the End of Course (EOC) exam which is given in Math I, English II, and Biology and counts for both individual teachers and for school-level evaluations; and the North Carolina Final Exam (NCFE) which counts for individual-level teachers and is administered to all other courses, with few exceptions. In the case of this study, those exceptions include Advanced Placement courses and in the case of Charles High School, an Introduction to Math course which counted as an elective that the school required prior to Math I. There were two concerns which were unique to subject area. First, English teachers expressed frustration that the tests covered a small amount of the standards for their subject area. Second, Math teachers expressed concerns over recent changes to the curriculum and tests. These issues will be explored deeper in Chapter 7 which focuses on individual-level characteristics like subject area. Both Math and English teachers expressed concerns over the accuracy of the test, particularly in regard to the scoring scale that was used and the method used to calculate standard 98 6. Teacher growth for to standard 6 is calculated using a very complicated psychometric model and teachers seemed to know little about it aside from that it was supposed to calculate how much a student grew in a given year and that the highest and lowest scores were eliminated as outliers. Teachers expressed frustration that the score was calculated with such a small sample of questions (40) and two teachers cited examples of outlier scores which were dropped despite the teacher feeling that the student had made significant amounts of growth due to the work of that teacher. Teachers also believed that the calculating technique was ineffective at eliminating poor results due to other factors outside teaching, such as in the case of excessive student absences or illness during test day. Aside from the two subject area-specific concerns referenced earlier, teachers also questioned the validity of the test in regard to the cut scores used for students. One teacher explained, “One student got, out of 40 questions, she got seven right, and that was not a pass, but it was still very high, it was like a 58, and I remember thinking, ‘I should have just told her to pick ‘A’ for every answer because she would have done better than getting seven right. Similarly, a math teacher quipped, “Well, if you are curving it down that far, if you are guessing on every single one, the difference between an A and a C [for a grade] is negligible.” The cut scores in particular seemed to frustrate teachers as these seemed to provide students with an unreliable indicator of their performance that did not mirror the scale of assessments used by teachers in the classroom. Another major barrier to using the tests as source of feedback was the timing and specificity of the feedback received. While the raw score and the grade of the student was received quickly following the test, the breakdown of scores was not received until the following fall when the new school year was already well underway. Once the more detailed reports are 99 received the following fall, teachers expressed further frustration over the level of detail provided as particular goals are not identified in the data. An English teacher explained, “[It] is always frustrating to us that the data we receive back from the test itself is just so general. It will say ‘Reading Information Strand 2,’ and Strand 2 has six different goals in it, so I have no idea where my weaknesses truly are as a teacher, to help the kids grow there.” Teachers also perceived that it was impossible to discern from the data whether the scores were a result of instructional decision of the teacher. One teacher explained how his students exhibited a higher than usual amount of growth in the last testing cycle, “I do not know where that happened. I know of different things I did, but I do not know if that actually had made that [difference]. I do not know if my changes made the growth happen or is it just because maybe we read some stories that related to the story that was on the test that year and they were just more familiar in some way.” Overall, teachers perceived a lack of specific, actionable feedback as a barrier to making improvements. There were only two teachers who felt that the feedback they received from testing was positive and both taught at the alternative school, Phoenix. In both cases the teachers referenced historical data rather than the data received in a testing cycle completed by that teacher. Mr. Forest, the Math teacher at Phoenix, told an anecdote about looking back on past test data for a student and realizing he had missed Math courses, We looked at his test data… and he was fairly consistent in elementary school, he was probably scoring threes and fours on the state exam tests, and then in fourth grade, it just tanks, and he's down in level 1, level 2…We kept digging and digging, and we looked into his transcript, and he had not actually taken seventh and eighth grade math. And then now, he is 18 and he is trying to learn Math 1…He is bad at fractions, that basic 100 background, even to have to be able to just kind of push him through Math 1, 2, and 3… So, that kind of addresses how we can use the state data. So, in the above example, the record keeping that follows state testing was useful in providing a teacher feedback on what was missing from the student’s background. However, other than these two teachers from the alternative school who offered similar examples, the other teachers in the sample did not describe having any positive experiences with regard to the feedback received from testing data. Feedback from Other Forms of Evaluation The literature reviewed in Chapter 3 illustrated that feedback can be beneficial to improving a teacher’s performance; however, Hackman and Oldham (1980) argued that feedback should stem from the work of teachers. When discussing the formal evaluation system utilized in North Carolina, teachers seemed to be referencing a disconnect between their work and the feedback provided. Rather than being driven by the work, feedback often seemed to be driven by the observation instrument or the values the observing administrator had derived from the instrument. Similarly, the tests used to calculate standard 6 of teachers’ evaluations presented feedback that was driven by the test itself as well as by the values driving the test. Interestingly, when asked about the relationship between evaluation and feedback, the focal teachers often referenced other types of evaluation and sources of data outside of the formal evaluation process. These references usually entered the conversation organically without any prompting. However, once this pattern was discovered, I started asking teachers about their experiences with other forms of evaluation, whether that be observation or testing, when they had not mentioned other sources of feedback. 101 Eight teachers referenced experiences with other observers, such as instructional coaches and in one case, the superintendent, to be beneficial and meaningful. Mr. Robbins explained the relationship he has with his instructional coach, We have a Math coach here. She comes at least once a week, sometimes twice a week, and will come in and observe our classes. She offers assistance during that class. Occasionally, she will help students or I can ask her a question about, “Where do I go now in this lesson?” And she can direct me. Sometimes she just comes and observes, and then later, she will come during my planning period and talk to me about what she saw, what she noticed… She will come back another week and do it again, and maybe say, “Hey, I noticed that you tried this today and it worked really well.” So, I am getting evaluated from her, but it's not on any kind of formal basis. Mr. Robbins goes on to explain how this type of observation counters many of the complaints about the feedback from formal evaluation described previously. He explained that his instructional coach observations occurred much more frequently than formal observations and highlighted the coach’s background and expertise in Math as being of particular importance and relevant to the quality of feedback that he received from observation experiences. The experience of Mr. Robbins echoed that of many other teachers who brought up informal evaluation experiences as more valuable than the formal administrator evaluation required by law. Such teachers highlighted the frequency of the observations, the personalization of the experience, and the subject area knowledge of the observer as key components of what made them feel that these informal evaluation experiences more successful. So, teachers are not dismissive of all forms of observation feedback and many in this study found other, informal sources as useful. Rather, teachers presented legitimate concerns over the usefulness of the 102 feedback received from formal evaluations which were often countered by more positive, informal experiences. However, not all teachers indicated that informal evaluations were positive. The value of the experiences seemed to be contingent on the people involved and the respective teaching situation. Four teachers felt like the coaches brought in theory that was not applicable to the realities of the classroom the teacher was working in. One Math teacher explained, “There is theory and then there is practice…I have not found [instructional coach observations] to be useful to me because it is like asking me to really reconstruct how I am going to teach and I am not going to be able to do that. And so, the ideas that I am getting are not really ideas that I can implement that are realistic for me to try.” In this case, the informal observer may have failed at relating feedback to the teacher’s work as her approach was more theory-driven than based in practical application. As with observation, teachers also mentioned other methods of testing which were informal and provided feedback that was useful to their practice. Specifically, Math teachers mentioned the use of county-wide benchmarks. The focal English teachers did not use county benchmarks, though they were aware of them, and some veteran teachers mentioned using them in the past. However, teachers of both subjects mentioned using school-based common assessments designed in professional learning communities (PLCs). One English teacher described how schoolwide common assessment worked at Central, “We design the pretest, we teach our unit, and then we give our post-test and then compare those to pre and post-test assessments, but then we sit down and we break them down into smaller components in order to then revise our instruction.” She described this experience as “authentic” and explained how this type of assessment design provided feedback that benefitted her instruction and allowed teachers 103 to more readily evaluate why a student may have missed a question, “I know exactly which questions my kids missed, so I know where my weaknesses were in my instruction. So yes, I use that all the time to inform my instruction… and I can also look at that particular question and say, ‘Was it how the question was worded or was it the skill?’” Again, the interviews suggest that teachers are not totally dismissive of testing as a form of feedback. Instead, teachers stressed that the informal tests which were successful were designed to match closely with the work of teachers and to provide feedback in a manner that could be used by the teachers. The informal testing described here provided feedback that was both timely and specific enough to show teachers areas of strength and weakness that could be improved upon with their current set of students. Responses to Reform There is some evidence that teachers change practice either in anticipation of or resulting from teacher evaluation policy in manners similar to those observed in teacher responses to other external accountability pressures. Overall, nine of the 14 interviewees mentioned changes in practice due to either aspect of the evaluations (observation or testing). Six focal teachers referenced changing teaching strategies or how a teacher taught, six focal teachers described changing curriculum or what a teacher taught, and two referenced focusing on certain students due to evaluation. The way in which these responses manifested will be subsequently described. The changes in teaching strategies that the six teachers cited were generally quite superficial. For instance, teachers mentioned making sure that learning targets were listed on the board because they knew their administrator would check for those during observation. Additionally, teachers described trying to “hit a box” on the evaluation score sheet for a standard such as technology use, which had not yet been observed. 104 Teachers also adapted their teaching strategies based on testing, but again this often resulted in minor alterations. For instance, one Math teacher explained how he made his students complete warm-ups on the computer to mirror the conditions under which students would take the state assessment. Similarly, another Math teacher described how the format of Math problems may be altered to better mirror what students would see on a test, “If students see an equation written a certain way, I know to make sure to show them that format versus another format that is not incorrect.” Likewise, an English teacher at Riley described creating study guides with questions that were worded in the same manner that students would find on the standardized test. All of the alterations described were superficial changes that put students in situations that better mimicked the conditions of state testing. When asked about making changes to curriculum due to evaluation, the six teachers referred to the influence of the state tests as opposed to observations. For instance, an English teacher at Central described how she focused more on reading curriculum, which she knew would be on the test, at the expense of other, non-tested elements of the English curriculum (specifically the writing, speaking, and listening strands of the Common Core State Standards). The same teacher stated that this was especially true for her standard classes as opposed to her honors courses, with the former needing much more of a “push” in order to show growth on the exam. Mr. Forest, the Math teacher at Phoenix, saw evaluation policy as ideally being aligned with good teaching and referenced both the observation and testing components in his explanation. He explained that he felt that choosing curriculum that aligned with the way he would be evaluated was a “moral obligation.” Additionally, he stated that he was not motivated by the evaluation itself, but instead by the principles of good teaching that the evaluation was 105 meant to measure, “When I am looking at the standards, I see that as what a teacher should do already, and I know that most other teachers at my school feel for the most part the same way, that we do not plan for that evaluation.” Mr. Forest reiterated that he sees the evaluation standards as something he should be doing on average most of the semester and felt this approach may seem lax, but that at his particular school his job was not in danger as there was a lack of Math teachers, particularly those willing to teach at an alternative school. There were only two interviews that referenced focusing on particular students due to the formal evaluation. These examples were about directing certain skills at students who needed increased help passing the test; for instance, teaching reading strategies to students who read below grade level or basic Math concepts to students who lacked the skills to complete grade appropriate work. There were no mentions of directing attention on certain students due to observations. Such behaviors were also very superficial and did not represent radical changes in a teacher’s practice. Reform Typologies Reform typologies were recorded if a teacher made a statement during an interview that demonstrated that they fit one of the three categories of Yurkofsky’s condensed reform typologies: acquiescence, adaptation, or denial (2016). Seven focal teachers included statements that indicated acquiescence. Teachers who demonstrated acquiescence generally stated that they accepted their evaluations without question. For example, Mr. Augustus, a Math teacher at Central called formal evaluation, “a fact of life.” Teachers who demonstrated acquiescence did not feel that evaluations had any effect on their teaching lives in any other way, nor did they try to push back on the system or attempt to adapt the system to be more useful for them. For instance, Mrs. Street, an English teacher at Phoenix surmised, “I do not even really pay attention 106 to the principal being in the room…I always ask him, ‘Do I still have a job?’ And I still have a job, so it went well.” Some teachers referenced ways in which they were able to take formal evaluations and adapt the process and the results into something that was useful to them. Overall, eight focal teachers included statements that exhibited adaptation. For example, at Riley, teachers were required to submit weekly lesson plans and the new administrator there connected these submissions to the evaluation scores given to teachers. Mrs. MacDonald, an English teacher at Riley explained how she adapted this policy to suit her needs, “Instead of having something a week out that I am going to have to spend the whole week revising anyway…I am just doing it a day at a time, and [the principal] has not said anything about it. It seems more manageable.” Other teachers described how the evaluation process was more of an opportunity to self-assess practice, which would represent another form of adaptation. One teacher described how, rather than worry about ratings from a third party, she looked at the standards as a sort of checklist against which she could rank herself. There were two teachers who described actions or attitudes that ignored the policy or indicated denial about it. While these teachers still participated in the requirements of the evaluation, they mentioned ignoring some directives related to them. For instance, Mr. Donaldson was identified as weak in one area by an administrator and was told to go back and change his Personal Development Plan (PDP) to reflect improving that weakness as his goal for the year, he stated that he never did it and explained, “I am not going to go back in my PDP and change it because it is not something that I feel will make me a better teacher.” Denial of the evaluation policy was a sentiment echoed by Mr. Forest at Phoenix who questioned the Math competency of his administrator and stated, 107 I am going to do what I want to work on. I am not going to work on what [my administrator] tells me to, just because [he] just told me… I believe that I should be free to pick my own things to work on. And I think, if a teacher is unwilling to do that, then the principal should then be able to step in and actually accurately say, “These are the things you should work on.” And that is what an evaluation should give you. But I think if the teacher is doing that on their own, it is public enough, we do not need to do an evaluation and rank like that. In addition to refusing to adjust goals based on his principal’s suggestions, this teacher was particularly upset that his principal seemed to lack an understanding of what “good Math teaching” looked like. In response, he coupled with colleagues from other schools to write a grant proposal that would train principals to recognize good Math practices to help them not only with evaluation but in recruiting and retaining effective Math teachers. Conclusion Overall, an examination of the trends across the sample of teachers surveyed and interviewed for this study shows that teachers may make minor alterations to their practice due to evaluation policy. However, these changes are far from revolutionary and instead represent very superficial adjustments rather than deep, meaningful, and sustained changes. Moreover, the focal teachers did not self-report that formal evaluations were a motivating force, nor that feedback was useful in improving their practice. An examination of the TWC replication questions yielded results that mirrored trends found overall in the 2016 data. Overall, teachers expressed a positive view of their work expectations, a less positive view about the consistency and quality of feedback from evaluations, and a very negative view of the value of state testing data. The complementary 108 question set from the survey showed that teachers held generally negative opinions about evaluation from the previous year with slightly more positive responses on eight questions when anticipating the current year. However, responses still demonstrated an overall negative perception of the evaluation process. Four areas on the complementary question set had statistically significant, positive changes when comparing the prior year to the current year: modifying practice from evaluation, choosing teaching strategies based on evaluation, using observation data to modify practice, and feeling evaluation will be conducted fairly. These changes may be driven by staffing and initiative changes in some of the schools, which are explored further in Chapter 6. In the focal interviews, teachers stated that feedback from both the observation and testing components of formal evaluation were not useful. Teachers expressed the following concerns regarding the validity of evaluations: being told low rankings were necessary to show growth over years, the timing and timeliness of evaluation and feedback reception, the small sample of teaching actually observed, a combination of very broad and very narrow standards, nearly unobtainable levels of distinction, and the consistency of scores across sites and administrators. Teachers expressed frustration that formal evaluations were not supplemented with a continuous informal presence of administrators in the classroom. While teachers overwhelmingly had negative views of observation, two interviewees mentioned that the observations provided them with an opportunity to be reflective, though these teachers stated that this was something that was ingrained in their practice anyway. The testing component was also criticized as not being timely or specific enough to provide valuable feedback. Teachers also questioned the accuracy of the tests due to dramatic cut scores and small samples of both questions and students. Additionally, teachers had concerns 109 that the equation used to calculate their growth as a teacher was opaque and seemed inaccurate. Two interviewees referenced archived testing data as potentially being helpful to identify past student weaknesses, though neither teacher found testing data received for current students to be particularly meaningful. Many, but not all, teachers described feedback received through other sources of informal evaluation, most notably observation by curriculum coaches and locally developed tests, as sources of feedback that were more meaningful and more useful than feedback from either component of formal evaluations. Possibly for the reasons discussed in the feedback section, teachers did not feel that they were externally motivated by evaluations, though a few mentioned that a poor evaluation would be upsetting to them. The disappointment teachers described was framed as resulting from feeling personally let-down by poor performance (whether real or perceived) and indicated the teachers were intrinsically motivated to do well. Teachers also referenced several discouraging aspects of teaching and described intrinsic rewards of teaching as the prime motivator for doing well in their job. With this in mind, it is not surprising that teachers do not feel that evaluation has much effect on their practice. Changes that were mentioned by teachers consisted of superficial issues like listing learning targets on the board, trying to incorporate technology into an observation, or adjusting classroom activities like warm-ups to more closely match the format of the state test. Interestingly, when teachers talked about how evaluation influenced practice, they referenced both observation and testing as affecting how they teach, but testing was overwhelmingly referenced when making choices in curriculum or when directing focus on certain students. Not every teacher demonstrated a reform typology in the interviews, but acquiescence and adaptation were more common than outright denial of the policy. Given that 110 teachers are legally bound to abide by the policy, it is not surprising that most teachers would be unwilling to totally disregard or even push back against the policy. 111 CHAPTER 6: The Context of the School Site The previous chapter presented results across the entire sample population for this study. This chapter answers my first research question: what, if any, role do reported school evaluation conditions and school evaluation status play in shaping teacher motivation, experiences with feedback, and work decisions related to teacher evaluation? As described in Chapter 4, the four focal schools for the study were selected due to variability in evaluation conditions (using measures created from the 2016 administration of the North Carolina Teacher Working Conditions Survey) and evaluation scores (using 2016 Educator Effectiveness data). The hypothesis that motivated this selection was that schools where teachers perceived evaluation conditions to be very good may view the impacts of evaluation on practice differently than schools where teachers perceive conditions to be very poor. Similarly, teachers from schools with highly rated teachers (i.e., receive high evaluation scores) may perceive the impacts of evaluation on practice differently than schools where many teachers receive low scores. The schools in this study represented a range of evaluation conditions and effectiveness scores. Riley demonstrated Local and State Evaluation Condition Scores which were very close to the district average (0.6 and -1). Phoenix demonstrated a slightly above average Local Evaluation Condition Score (4.4), while Charles had a slightly below average Local Evaluation Condition Score (-2). Central had a very low Local Evaluation Condition Score (-9.6). Low State Condition Scores were demonstrated by Charles (-11), Phoenix (-16), and Central (-22.5). In this study, all of the schools demonstrated Local Evaluation Effectiveness Scores that were very close to district average with scores ranging from -2 to 1. However, the schools demonstrated variety in State Evaluation Effectiveness Scores. Riley was near district average (1.5) while Central was slightly below (-4.2). In contrast, Phoenix was far below district average (-13) and 112 Charles was far above district average (9.3). So, the four schools in this study offered a wide range of varying combinations of the two scores. In this section, I first present an analysis of the survey data to determine if there were statistically different perceptions held among school sites as measured by the survey results Then, I use a combination of survey and interview data to explain how evaluation conditions and evaluation scores are related to the perceptions of teachers at the four focal schools as well as present some alternative explanations based on the interview data. Finally, I describe the specific evaluation scenarios present in the school contexts in this study. Comparing Perceptions of Evaluation and Practice between School Sites As explained in previous chapters, the final section of the survey contained complementary thematic statements that asked teachers to reflect on the previous school year as well as anticipate about the current school year using a Likert scale. Table 13 shows the descriptive statistics for each complementary question set separated by school site. Each thematic set was analyzed using a one-way ANOVA to determine if there were differences between schools in each complementary set. No significant differences between schools were found on any of the questions from the complementary questions set. However, significant differences did emerge within schools when comparing the prior to the current year, which will be discussed in an upcoming section. Later in this chapter, I examine the cases of each school and elaborate on how evaluation conditions and evaluation scores may have impacted teacher perceptions of evaluation in ways in which the survey was unable to capture. 113 Table 13 Complementary Question Set Means by School Riley M Modifying practice in anticipation of an evaluation Modifying practice using feedback from evaluation Concern evaluation affects employment Concern evaluation labels as a bad teacher Concern evaluation does not reflect competency Choosing curriculum based on what evaluated on Choosing teaching strategies based on what evaluated on Directing focus on certain students based on what evaluated on Use test data to modify classroom practice Use observation data to modify classroom practice Feel evaluated fairly Phoenix SD n M SD Central n Charles M SD N M SD n Prior Current Prior Current Prior Current Prior Current Prior Current Prior Current Prior Current Prior Current 2.47 2.53 2.73 3.20 2.13 2.33 2.00 2.13 2.53 2.33 2.73 2.67 2.40* 2.80* 2.53* 2.33* 1.41 1.51 1.28 1.08 1.46 1.50 1.20 1.19 1.51 1.11 1.39 1.35 1.18 1.27 1.19 1.23 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 2.60 2.71 2.80 3.43 2.20 2.43 2.00 2.00 2.80 2.71 2.40 2.50 3.20 3.71 2.80 2.71 0.89 1.25 1.48 0.98 1.64 1.27 1.73 1.00 1.64 1.25 1.52 1.23 1.30 0.95 1.30 0.76 5 7 5 7 5 7 5 7 5 7 5 6 5 6 5 7 2.67 2.83 2.50** 3.25** 1.42 1.58 1.67 1.83 2.25 2.42 2.33 2.17 2.92 2.83 2.33 2.25 1.30 1.15 1.12 1.36 0.52 0.52 0.65 0.84 0.97 1.24 1.30 1.03 1.44 1.47 1.30 1.22 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 2.80 2.90 3.30 3.00 1.60 1.70 1.80 1.70 2.80 2.50 2.50 2.80 2.70 3.10 2.40 2.70 1.14 1.29 0.95 0.94 0.52 0.48 0.92 0.48 0.79 0.71 1.27 1.23 1.16 1.10 1.74 1.16 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Prior Current Prior Current Prior Current 3.60 3.60 2.67 3.20 3.60 4.07 0.99 0.99 1.11 1.32 1.55 0.70 15 15 15 15 15 15 2.20* 3.29* 3.20 3.71 4.00 4.14 1.10 1.11 1.10 0.95 0.71 0.69 5 7 5 7 5 7 3.17 3.00 2.83 2.92 4.17 4.36 1.40 1.35 1.47 1.38 0.58 0.51 12 12 12 12 12 11 3.30 3.40 2.80 3.10 3.70 4.00 0.82 0.84 1.03 0.88 0.68 0.00 10 10 10 10 10 10 114 Table 13 (cont’d) Last year will impact current Prior 2.33 1.23 15 2.80 1.10 5 2.42 1.38 12 2.70 1.16 10 year Current Note. Scale for Survey: Strongly Disagree 2- Disagree 3- Neither Agree nor Disagree 4- Agree 5- Strongly Agree; * = p< 0.1, **= p< 0.05, *** = p < 0.01 115 Insights from Focal Teacher Interviews I conducted interviews with 14 focal teachers across school sites and coded transcripts to discern differences between motivation, feedback use, reform responses, and reform typologies across schools. The data in Table 14 presents the interview case for each code. In other words, the frequency is the number of interviews that included at least one occurrence of the code. Using this reporting method rather than frequency of interviewees was necessary due to the small number of participants in this data set when separated by school and the variable number of interviewees at each location. Each interviewee participated in two interviews for the study, so the “n” reported equals twice the number of interviewees at a school. The percent of occurrence for the interview at each school is also reported to allow for a better comparison across schools with varying sample sizes. These data will be examined more in-depth for the section on “Evaluation Scenarios” at the end of the chapter. However, the most notable differences between schools center on interviews from Phoenix. For instance, Phoenix had the highest occurrence of the reform typology of acquiescence (which occurred when a teacher made a statement that indicated an acceptance of the policy, which could be with reluctance, but without protest) of the four focal schools with 50% of the interviews containing statements that indicated acquiescence. Phoenix teachers also mentioned internal motivation more frequently by percentage of interview, at a rate of 66.7%. Conversely, at Riley there were internal motivation statements in only 25% of the interviews. Additionally, Phoenix interviews were the least likely to mention positive aspects of observation (16.7%), but it was the only school where positive aspects of testing were mentioned. It may be that the unique circumstances of the alternative school account for the 116 differences when compared to the three traditional high schools across interviews. The context of Phoenix will be explored in the next section. 117 Table 14 Code Interview Case Count by School Riley (n= 4) Motivation Internal External Observation Feedback Negative Positive Testing Feedback Negative Positive Work Decisions Strategy/How Taught Curriculum/What Taught Who is Taught Response to Reform Acquiescence Adaptation Denial Phoenix (n= 6) Central (n=10) Charles (n= 8) Frequency Percentage Frequency Frequency Frequency Percentage Frequency Percentage 1 2 2 2 2 2 2 0 25.0% 50.0% 50.0% 50.0% 50.0% 50.0% 50.0% 0.0% 4 3 7 7 3 6 6 0 4 2 6 6 3 6 6 0 4 2 6 6 3 6 6 0 50.0% 25.0% 75.0% 75.0% 37.5% 75.0% 75.0% 0.00% 4 3 7 7 3 6 6 0 40.0% 30.0% 70.0% 70.0% 30.0% 60.0% 60.0% 0.00% 1 0 0 25.0% 0.0% 0.0% 1 2 1 4 2 0 4 2 0 50.0% 25.0% 0.00% 1 2 1 10.0% 20.0% 10.0% 1 1 1 25.0% 25.0% 25.0% 3 4 1 1 2 0 1 2 0 12.5% 25.0% 0.00% 3 4 1 30.0% 40.0% 10.0% 118 School Vignettes This section will draw on survey and interview data to create a short description of the unique context related to teacher evaluation at each of the four schools. Detailed demographic information for each school is not included here but can be found in the school profiles in Chapter 4. Charles Charles was the smallest of the three traditional high school in the study (789 students) and had the highest participation in free or reduced price lunch (FRPL), at 46%, of the three traditional high schools. Of the four high schools in this study, teachers at Charles described a focus on testing results that was much more intense than the other three high schools. An intense focus on testing achievement may be the reason why Charles had the highest State Evaluation Effectiveness score at 9.3 above the district average. Conversely, Charles had the lowest Local Evaluation Effectiveness score, and the only negative score in the study, at -2.5. There were four focal teachers interviewed from Charles, two whom taught Math and two whom taught English. Mrs. Ranier, an English teacher, stated that due to the testing focus, she felt the faculty at Charles was “analytical and cynical” and expressed surprise that even teachers in subject areas that have traditionally avoided state testing, such as the Art department, seemed this way. She stated that there was a disconnect between what the principal thought needed to be done to achieve good test scores and what teachers thought needed to occur, an issue Mrs. Ranier attributed to her principal’s lack of practical experience. The principal, Mrs. Warner, only spent three years in the classroom as a special education teacher before moving to administration. Mrs. Ranier felt she had to engage in a lot of required “cover your behind” activities that were expected of the teachers at her school, such as parent meetings and various types of 119 documentation. She stated, “I really liked it better 10 years ago when I was trusted to do my job and do it well. And I think I did a better job because I was less anxious, there was less stress.” Mr. Eagle, an English teacher who had taught previously out-of-state, but was in his first year at Charles, described a general feeling of being watched, both in the classroom and outside in meetings and other functions. While Mr. Eagle stated that the administration seemed very enthusiastic about his performance as a teacher, he acknowledged that there seemed to be an “invisible list of bad teachers” who were watched more frequently. He also noted an intense focus on student growth on tests. He had good scores his first semester with his seniors, which he attributed to pure luck as the majority of his semester had been spent helping students work on their senior projects rather than on topics and skills covered by the North Carolina Final Exam (NCFE), and he had already been approached about teaching an End of Course (EOC) class the following year due to this success. The suggested course reassignment signaled that the principal was willing to engage in the gaming strategy of moving teachers with the best records of producing growth to areas where testing stakes are higher (Cohen-Vogel, 2011; Grissom et al., 2012). The principal’s decision to pursue this highlights the testing focus of the school. The views of the two focal English teachers were echoed by the two Math teachers who were interviewed for this study. Both Math teachers talked about the testing results-driven atmosphere of the school; however, the Math teachers did not seem to feel the same pressures and oversight that the English teachers expressed. There are three possible explanations for this. First, the closeness of the departments differed in professional relations, personal relations, and physical location. Mr. Robbins, the more veteran of the two Math teachers, described how the department consisted of established teachers who had taught for several years. Mr. Silver, the other focal Math teacher was the newest in the department and he had been there for several 120 years and had student taught at Charles prior to being hired. The Math department was also very tight-knit and would meet on the weekends to play board games at each other’s houses. In contrast, the English department consisted of many teachers who were new to the school, including both focal English teachers. The English department was also spread out across the school instead of being located in adjacent classrooms in one hallway like the Math department. Secondly, the Math teachers were not observed by the principal in the study year, but were instead observed by an assistant principal who had a Math background. However, at Charles the administration observed different departments each year so subject area alignment was not guaranteed for Math teachers. In contrast, English teachers at Charles had never been observed by an administrator with English experience. Finally, the Math teachers also received a lot of outside support in the form of a curriculum coach, while the English department declined assistance from their curriculum coach. This may mean that Math teachers relied more on feedback from the coach and their tight-knit group of peers whereas English teachers were primarily receiving feedback from the administrator. The stability of the Math department and the coaching support it received may help explain the different perceptions shared by teachers across the two subject areas. There were no statistically significant differences in teacher perceptions of evaluation when comparing reflections on the prior year to the current year. Despite all four focal teachers acknowledging that there was a strong focus on raising test scores in their school, teachers did not demonstrate large rises on any of the testing-related questions. However, there were two themes where Charles demonstrated low means as compared to two of the other schools: Riley and Phoenix. Compared to teachers at these two schools, teachers at Charles seemed to feel more secure in their jobs despite evaluation policy and demonstrated little concern that evaluation 121 would affect employment (prior year: M= 1.60, SD= 0.52; current year: M= 1.70, SD= 0.48) or result in the label of “bad teacher” (prior year: M= 1.80, SD= 0.92; current year: M= 1.70, SD= 0.48). Central Central was a mid-sized traditional high school (1,103 students) and was the least ethnically diverse with a student population that was 86% white. Central was above the district average for Local Effectiveness Scores and only slightly below for State Effectiveness Scores; however, the school had very low Local Condition Scores (-9.6) and even lower State Condition Scores (-22.5). These low Condition Scores indicate that teachers held negative views about evaluation policy at Central. These scores may be related to a unique situation where the Math and English departments seemed to have divergent experiences in regard to observation. The divergence was a result of the Math teachers being observed for several consecutive years by the main principal, Mr. Nichols, who had several years of experience as a Math teacher prior to becoming an administrator. In contrast, the English department was observed for several consecutive years by an assistant principal, Mr. Reward, who allowed teachers to complete their own evaluations because he struggled to operate a computer. There were three English teachers and two Math teachers who completed focal interviews from Central. The three focal English teachers felt that their department was strong and did not require a lot of oversight. One teacher, Mrs. Williams, attributed Mr. Reward’s assignment to the department as a testament to teacher skill and explained that she felt the main principal did such a good job at hiring that the teachers at Central did not need to be watched or evaluated. All three English teachers described evaluation under Mr. Reward in a similar way: the assistant principal would sit in for part of a class and then largely leave up the assigning of scores and comments to 122 the observed teacher. Mrs. Williams described taking over the computer from Mr. Reward and typing up the evaluation for him; “And I think I’m fair,” she added. One English teacher, Mrs. Hoard, was only required to have one evaluation in the study year and described how that observation did not occur with students. Instead, Mr. Reward observed her conducting a department meeting and filled out the observation rubric based on that observation. As a result, Mrs. Hoard was not formally observed teaching at all during the study year and a lot of areas were marked “not observed” on her observation form. She also noted that sections of the evaluation that were focused on students were rated by Mr. Reward despite an absence of students during the observation. Mrs. Hoard did not like that observations were conducted so haphazardly, but felt secure that the results of her evaluation would have no effect on her at all. Overall, the English teachers were very open about the experience and stated that the assistant principal was very good at many things, such as handling discipline and bus schedules, but that he was ineffective in many aspects of his position as an administrator. One of the Math teachers indicated in his interview that he had been observed once under Mr. Reward and described similar experiences, so the deficit, as described by the teachers, may have transcended subject areas. Conversely, the Math department described very positive experiences with observation due to quality feedback from Mr. Nichols, the principal, who had a background as a Math teacher. The focal Math teachers described how Mr. Nichols would identify things in the observed lesson that may have otherwise gone unnoticed by the teacher or make suggestions that were practical and could be used to improve instruction. Both Math teachers expressed gratitude that they had an administrator who “knows the Math” and could effectively identify if something 123 went wrong. Overall, the Math teachers seemed to feel that the feedback received from observations was useful and valid due to their administrator’s background as a Math teacher. Yet, there was a disconnect demonstrated between Central’s high State Effectiveness scores and low State Condition scores. I asked the focal teachers to help explain how teachers at Central could have such high Effectiveness Scores yet have such a low impression of evaluation conditions at their school. Mrs. Proffitt, a Math teacher described how one of the assistant principals had been returned to classroom teaching at a different school after “making a mess of testing” the previous year with numerous, serious scheduling and protocol errors. Mrs. Proffitt suggested that this may have led to very low perceptions of testing in the previous year when the data for the Condition Scores were collected. However, it may also be that the observation style of Mr. Reward was related to the Local Condition Scores. While only two subject areas were included in this study, Mr. Reward conducted at least one third of the observations at Central and presumably he had the same challenges with using technology and filling out the evaluation form for other subject areas as was reported for English and, in the past, at least one Math teacher. The interviews support this hypothesis as teachers were ranked highly at the school, often after having completed their own evaluation ratings, but were frustrated at the purported ineffectiveness of their evaluator. Additionally, teachers were very open in discussing Mr. Reward’s tactics, so it may be that teachers who were not evaluated by him were also frustrated that some of their colleagues did not get evaluated in the same manner. All of the teachers interviewed also stated that the overall consensus in their respective departments was that testing was a “fact of life” and that they wanted students to do well on tests; however, teachers overwhelmingly did not see value in the tests as sources of feedback or 124 as valid measures of student gains. Mr. Augustus surmised, “I do not think we feel it is very useful or effective and it is a waste of time and a waste of resources and a waste of money.” He went on to describe how across the state, teachers of Math II courses had to administer two different EOCs that year and to uphold students’ impression that both tests counted toward grades, “[T]he giving of two tests, but only one of them we are going to have access to the data. That is, to me, not very useful and not very effective.” The unique situation Math II teachers were placed in is described in greater depth in the next chapter which examines differences based on individual characteristics such as subject area. Similar to Charles, teachers at Central seemed to feel quite secure in their jobs and demonstrated little concern that evaluation would affect employment (prior year: M= 1.42, SD= 0.52; current year: M= 1.58, SD= 0.52) or result in the label of “bad teacher” (prior year: M= 1.67, SD= 0.65; current year: M= 1.83, SD= 0.84). There was also a significant difference when comparing responses from the prior to the current year on teachers’ use of feedback to modify practice, t(11) = -2.46, p = 0.03. However, this change only represented a shift from teachers on average agreement from “disagree” to “neither disagree or agree.” So, while teachers may have overall felt more inclined to use feedback in the study year, this change does not necessarily mean that teachers rely heavily on evaluation feedback for classroom planning. Riley Riley was the largest high school in the study (1,591 students) and was the most ethnically diverse. Riley also had the lowest percentage of students in FRPL (38%) and ranked very near the district average on all of the Condition and Effectiveness Scores. However, the aspect of Riley that is the most distinguishing is that it had a new principal, Ms. Jefferson, during the study year. According to the teachers, her approach to observation was very different from 125 her predecessor’s. Ms. Jefferson initiated significant changes in teacher classroom practice by requiring the submission of daily lesson plans on which she reviewed and commented. In addition, she immediately completed a full observation, complete with conferences and discussions about lesson plan submissions, of every teacher during the first month of the school year. For instance, Mr. Donaldson, an English teacher, was formally observed three times before our first interview in mid-October. He was not in a renewal cycle and was only legally required to have two snapshot observations (abbreviated observations covering only a few standards) in the study year. Given the size of the high school with around 100 teachers, observing each was an impressive accomplishment for Ms. Jefferson and required a considerable investment of her time. Teachers seemed apprehensive about engaging in a study on evaluation given this new focus on observation, as evidenced by my conversations with the two focal teachers and from comments typed into the open response section of the survey. Nearly all Math teachers took the survey for this study (90.91% response rate), but none were willing participate in the interview. I spoke to the department head regarding the interview and he stated that the department had come to a decision not to participate in the interview because the atmosphere surrounding observation in the school had become “tense.” In the English department, the survey participation rate was low (45.45% response rate) and the only teachers who volunteered to be interviewed were very secure veteran teachers who reported always receiving good marks on their respective evaluations. The two focal English teachers were critical of the new administrator’s approach to observation despite receiving glowing evaluations from Ms. Johnson. Their criticism centered on the fact that the new administrator equated the completion of paperwork with good teaching. Mr. 126 Donaldson stated that he observed “pushback from some quarters” of the teachers regarding the increased frequency and intensity of observations and that some teachers felt that there was an “entrapment factor” behind the approach. Additionally, both focal teachers described how they had taken advantage of Ms. Jefferson’s tendency to focus on paperwork completion over content. For instance, Mrs. Macdonald had begun to modify how she submitted the lesson plans so that the process became more applicable to her own classroom practice. She noted that the principal had not challenged her alterations. Mrs. Macdonald also noted that after the first few lesson plan submissions she stopped receiving feedback and has since invested less time in the process. Meanwhile, Mr. Donaldson was not observed by Ms. Jefferson again between the first and second interviews and had started recycling lesson plans on his daily submissions, which he stated had gone unnoticed. However, Mrs. Macdonald said she was aware of several teachers who were being reprimanded for not completing paperwork, which seemed to support Mr. Donaldson’s “entrapment” theory. Mrs. Macdonald was also skeptical that the very high marks she received on her initial observations were an accurate reflection of her teaching. She stated that the feedback she had received lacked substance and suggested that Ms. Jefferson was measuring her “enthusiasm and charisma” as a teacher rather than her ability. Additionally, Mrs. Macdonald felt that the fact that she met all deadlines on her paperwork submissions was instrumental in her doing well as the new principal seemed to value this over actual teaching ability. The increased and intense focus on evaluation, tied to the submission of daily lesson plans, seemed to result in increased tension around observations at Riley. There were two statements on the complementary question set which had significant differences when comparing the prior to the current year. First, teachers were reportedly more likely to choose teaching 127 strategies based on evaluation, t(14) = -1.87, p = 0.08. Secondly, teachers were reportedly less likely to direct focus on certain students based on evaluation, t(14) = 1.87, p = 0.08. However, the averages for both categories remained between the “disagree” and “neither agree nor disagree” ranges. So, these changes do not represent a reliance on evaluation for choosing strategies nor a total abandonment of using evaluation to direct focus on certain students. However, an increased awareness of the relationship between evaluation and the selection of teaching strategies might have resulted from the daily lesson plan requirement or from teachers’ decisions to use certain strategies to meet observation requirements. Compared to other schools, the mean responses on the survey were not exceptionally high or low. For the most part, responses from Riley teachers fell in the middle of the means for all of the question themes on the survey. There was no change at all in teachers’ intentions to use testing data to modify instruction as compared between the prior year and the current year as measured in the survey. The focal teachers expressed that they felt that testing would not factor into their evaluation in a way that was different than before Ms. Jefferson came to the school. However, this survey was administered at the beginning of the school year before a testing cycle had been completed. Both focal teachers commented that the English department was strong, and they were surprised that Riley was slightly below the district average for standard six of the evaluation. Mrs. Macdonald explained that the school was located in a higher socio-economic area than the other schools and stated that students were “high testers” at the school. The school scores were always near the top of the district, so this slightly than lower average may have resulted from the way growth was calculated rather than from low student scores. Mrs. Macdonald also stated that despite Riley usually scoring at the top for the district, she still paid attention to testing and remembered a time 128 when the Biology teachers had low scores which resulted in reprimands and teachers being sent away for additional training, which was something she thought about every testing season when she began to review for exams. Phoenix Phoenix has a unique context as an alternative school serving students who did not experience success in traditional high schools. As an alternative school it was the smallest in the study (134 students) and had the highest rate of FRPL (87%) when compared to the traditional high schools. The Local Condition Score and Local Effectiveness Score were both slightly above the average for the district. However, the State Condition Score was low (-16) and the State Effectiveness Score was the lowest in the study (-13). When asked about attitudes towards testing at his school, Mr. Brown, a second-year English teacher who was in his first year at Phoenix, surmised, “[There] are bigger fish to fry here [than academics], especially on the social and emotional level. Some of these kids are dealing with a lot [which] matters more in the grand scheme of things. Some of these kids need to have social skills as opposed to knowing how to take a test.” During our interviews, Mr. Brown often reflected on the difference in his experiences between his first year at his first school, which was located in Tennessee, and his second and current year at Phoenix. He felt the difference was related to the school administration. At his previous school, Mr. Brown felt anxiety about his evaluations and perceived extreme pressure on teachers to achieve high test scores. He stated that everything in his observations seemed to be linked back to state testing and described his first year as being in a classroom that was micromanaged by policies meant to increase student achievement on tests. 129 I observed a noticeable difference in Mr. Brown’s anxiety about testing between his first and second interview. In the first interview, Mr. Brown expressed nervousness about how testing would play out in the unique new classroom context he now taught in as he previously had faced retribution if his students did not perform well. By the second interview and second semester, Mr. Brown seemed to have accepted that testing did not matter in the same way at Phoenix as it did at his previous school. Mr. Brown stated that the administration at Phoenix did not have low expectations, but instead, “They understand, in fact, they really understand what the teachers are dealing with, like how a classroom looks. And I feel like their perceptions, they align maybe pretty well with the teachers. Maybe that is why teachers feel that the scores are pretty accurate. I felt like mine were pretty accurate.” Mrs. Street is a veteran English teacher who previously served as a curriculum coach at Phoenix. She returned to the classroom to finish out her teaching career with a few years left before retirement. She talked extensively about the autonomy that was afforded to teachers at Phoenix, which was something she felt did not occur at other schools. Mrs. Street stated that the principal of Phoenix allowed and encouraged teachers to try new things to reach the unique population of students they served. Mrs. Street described how she had felt in other schools, particularly during observations, “I would be very nervous to try new or out of the ordinary [methods]. I would stick to something more scripted, something tried and true. Here we have the freedom. We are not going to be marked down for trying a strategy or trying something with students and it fails.” The focal teachers interviewed for the study also brought up the culture of Phoenix, which Mrs. Street terms as being one of “learning and growing” where every teacher is willing to accept feedback from the others. Mrs. Street explains, “[F]eedback is necessary… it is not a bad 130 thing or just a good thing, it is how can we all learn from each other.” Mr. Forest described conversations he had with colleagues from other schools about school climate and wondered if the better relationship between teachers and administrators he perceived at his school was the result of accessibility. “Since we are such a small staff here, I feel like I can walk into [my administrator’s] office at any time and I wonder if my colleagues at other schools feel the same way.” Mr. Forest’s statement does seem to contrast statements from teachers at the other focal schools who stated that administration did not enter classrooms aside from evaluations or meet informally. The mean for Phoenix was fairly higher on the statement “choosing teaching strategies based on what evaluated on” when compared to the other schools in the study. Additionally, the mean for this statement rose when comparing the prior to the current year, though the change was not statistically significant. Perhaps this increased focus on strategies is related to the unique student population of Phoenix and a school climate where experimentation is encouraged. Additionally, there was a significant, positive difference for the statement “use test data to modify classroom practice” when comparing the prior to the current year with the mean rising, t(4) = -2.24, p = 0.09. Nothing in the interviews indicated that policy changes at the school drove these changes in means. However, the small sample size may indicate that this change simply reflects the personal resolve of those teachers who completed the survey. Overall, the focal teachers’ descriptions of the context and climate of Phoenix seemed to match the conditions and effectiveness scores used to select the school for the study, which boasted high Local Condition scores and Local Effectiveness Scores slightly above the district averages. It is interesting to note that three of the themes in which large rises were demonstrated at Phoenix were among the same as those at Riley: “modifying practice using feedback from 131 evaluation,” “choosing teaching strategies based on evaluation,” and “using observation data to modify classroom practice.” A very large rise was also seen in the theme “using testing data to modify classroom practice.” Yet, Phoenix teachers had very low perceptions of State Conditions and State Effectiveness with scores of -16 and -13 below the district average, respectively, which would lead one to believe that teachers did not rely on testing feedback to modify instruction and that even with the growth model, the school underperformed compared to the majority of others in the district. As referenced before, it is unclear if these results were driven by the addition of two new-to-Phoenix teachers to the current year question set as opposed to the prior year or if these statements merely reflected a continuing dedication to “learning and growing” that was espoused by Mrs. Street and supported by statements from the other two focal teachers. Do Evaluation Conditions and Effectiveness Matter? Overall, the measures used to select schools may not have been effective in identifying differences of context because no significant differences in survey results were found among the four schools. Additionally, the effects may be understated due to the small sample size. However, contextual differences may have impacted the relationship between evaluation and teacher practice in schools. The interviews suggested that such a relationship may be more related to the type of individual who is conducting evaluations and how evaluation is related to the climate and policy focus of the school. For instance, it may be important to consider whether evaluation is even stressed by school administration and if so, which parts of the evaluation are emphasized? Other studies have demonstrated that teachers respond to evaluation through the lens of their administration and so the way in which principals choose to focus on evaluation may influence how teachers perceive the policy (Reinhorn, Johnson, & Simon, 2017). Do administrators stress the 132 observation or the testing component more? Are there certain aspects of each component that receive more focus than others? Additionally, in regard to the observation component of evaluation, it may be important to consider the subject area background, the skill of the observer, and the foci or values administrators bring in approaching evaluation. These are specific considerations that were not taken into account in the way in which the Evaluation Condition Scores were calculated. For instance, the two teachers interviewed at Riley questioned the proficiency of the new principal who stressed observation as important, but seemed to connect this evaluation component with paperwork and check boxes rather than providing meaningful feedback. Observer proficiency certainly was reported as an issue for the English teachers at Central who described cases of writing their own observations or not being observed teaching at all. While subject area will be discussed in more depth in the next chapter, the Math teachers who were observed by administrators with Math backgrounds expressed gratitude that their observer had proficiency in the subject area observed. Additionally, teachers, such as Math teachers at Charles, who receive quality feedback from outside sources such as coaches, may value coaches’ feedback over formal evaluation feedback. The interviews also revealed some considerations not captured by the Evaluation Effectiveness Scores. First, the local scores which were based on observation were clustered closely around the district mean with a range of -1.45 to 1.09. This trend was consistent in all the schools in the district sample; therefore, a wide range of scores in this category was unavailable. For the state-based score, which was based on standard six, also known as the student growth standard, the range was much wider, -14.41 to 8.81. Charles and Central were above the district mean while Riley was slightly below, and Phoenix had the worst performance among high schools in Broadville. However, the teachers at Phoenix spoke at length about how the school 133 administration granted them autonomy from school-based policy related to testing and encouraged an atmosphere of experimentation to help the students at the school succeed in other, perhaps more meaningful ways. While Phoenix teachers ranked the lowest in Effectiveness scores for both local and state components, they had the highest Local Condition Score of the focal schools. The high score is not surprising given the level of autonomy granted to teachers there. Therefore, it is possible that the conditions under which teachers receive evaluation scores have more of an association with teacher perceptions of evaluation than the actual evaluation scores that are received by those teachers. Evaluation Scenarios Overall, the vignettes reveal three types of scenarios related to evaluation. The first is the technocratic scenario. Giroux (1985) argued that a technocratic approach to policy reduces teacher autonomy by attempts to regulate and control behavior. For instance, Mr. Brown, the new English teacher at Phoenix, described these conditions in his previous school while contrasting his two teaching experiences. This technocratic approach is also evidenced to a large extent in Ms. Jefferson’s approach at Riley. According to teachers, Ms. Jefferson focused more on observation than the testing component with a formalities-driven approach to observation based on controlling teacher behavior through lesson plan submission and oversight with possible reprimand through observation. An adherence to procedure was valued rather than quality of work. Additionally, technocratic approaches regarding the testing component of evaluation may have already been in place prior to Ms. Jefferson’s hire as evidenced by Mrs. Macdonald’s fear of what happened to the Biology teachers following poor testing performance. Firestone (2014) argued that duress is the opposite of autonomy; it appears that teachers at Riley 134 may have been experiencing some duress around evaluation. Therefore, evaluation at Riley, according to teachers, is fully under a technocratic model. A technocratic approach is also evident, but to a lesser extent, at Charles, particularly in the English department. Scholarship has demonstrated that conditions for motivation in schools include realistic workloads, administrative support, and operating in systems that are not overly punitive (Firestone & Pennell, 1993; Firestone & Rosenblum, 1988). Mrs. Ranier described a challenging workload and lack of support from administration that expected teachers to cover their own “behinds” and specifically referenced how such a school climate impacted the motivation of teachers. All of the teachers at Charles stressed the results-driven nature of the school in regard to testing, a condition Giroux (1985) describes as being technocratic. The Math teachers may have appeared to be more aligned with a test-driven approach due to having group buy-in towards this policy. Additionally, the success of Math teachers was also aided by the creation of Introductory Math courses which were required of all students and meant to increase student success on the Math I test. Because the course helped ease the burden of teaching Math concepts by spreading the curriculum across two semesters instead of one, the Introductory Math course may have helped create this policy buy-in for Math teachers. So, Charles is also a school that operates under a technocratic model, particularly in regard to the testing component of evaluation. The second evaluation scenario is the Autonomous and Self-Efficacious Scenario. In this scenario, teachers are able to operate under a system of internal rewards where improvement is driven by assessment, feedback, training, and professional development while evaluation contributes to rewarding conditions (Firestone, 2014). A clear example of this can be seen at Phoenix, where teachers felt supported by administration and worked in an atmosphere of 135 learning and improvement guided by their administrator. While teachers did not necessarily view evaluation, and particularly the testing component, in a positive manner, they were satisfied overall with the way observations were conducted at their school and with the administration’s approach to the policy. However, this did not mean that the administration at Phoenix was handsoff. Aside from Riley, Phoenix was the only school that required submission of lesson plans. It was also the only school where teachers mentioned administrator presence in the classroom aside from evaluation. Overall, while the staff at Phoenix did adhere to the policy requirements of teacher evaluation, the policy did not define instruction and instead teachers and administrators were free to work together to create their own definitions of success in more meaningful and supported ways. Therefore, Phoenix serves as an excellent example of teachers working under conditions of both autonomy and self-efficacy as allowed in a non-traditional high school. To a lesser extent, the Math departments at Charles and Central exhibited some tendencies consistent with this scenario. The Math department at Charles exhibited a level of autonomy not present in the English department, perhaps because of the overall closeness and stability of the department and perhaps because there was greater buy-in from the Math teachers regarding the value of a testing-focused curriculum and technocratic policies compared to the English department. For these reasons, Charles more appropriately fits into the Technocratic Scenario as previously described. Likewise, Math teachers at Central experienced some tendencies consistent this scenario, partially driven by having an administrator with a Math background resulting in a mutual recognition of competence in the subject. However, Central more readily fits into the final scenario described below. The final scenario is Consensus Lacking. Cohen (2011) describes how a lack of consensus in an educational context can increase uncertainty and dispute. The teachers at Central 136 certainly described themselves as being autonomous and self-efficacious and teachers in both departments were dismissive of both the observation and testing components of the evaluation as evidenced by the Evaluation Condition Scores and the interview data. The evaluation scores received by teachers at Central indicated that teachers were accomplishing a “good job” by those measures, yet when it came to evaluation there was a lack of consensus regarding the policy. Math teachers seemed to find some validity in their own personal observations, which were conducted by a former Math teacher, but did not seem to find value in the observation policy or process as a whole. Meanwhile, the English teachers’ evaluations were conducted in a manner more consistent with the Wild West, where teachers were sometimes not actually observed teaching or essentially observed themselves. However, neither department found value in the testing component of evaluation. Overall, there was a lack of consensus and an attitude of even disdain toward evaluation at Central that was supported by the way in which administration approached the policy. Additionally, the interview data presented earlier in this chapter in Table 13 offer some support for the scenarios presented above, though the results should be interpreted with caution due to the small size of the samples at each school. For instance, Phoenix had the highest occurrence of the reform typology of acquiescence (which occurred when a teacher made a statement that indicated an acceptance of the policy, which could be with reluctance but without protest) of the four focal schools with 50% of the interviews containing statements that indicated acquiescence. The second highest occurrence was found at Central where 30% of interviews included statements indicating acquiescence. It could be that teachers at Phoenix were more likely to demonstrate acquiescence to evaluation policy because it was unlikely to interfere with their classroom lives under the Autonomous and Self-Efficacious scenario. 137 All of the teachers who demonstrated acquiescence at Central were English teachers, whose department often completed observations with little input from the administrator. Again, the way evaluations were conducted at Central may have led to less interference in the classroom lives of teachers, but did little to improve teaching conditions at the school. Phoenix teachers also mentioned internal motivation more frequently by percentage of interview, at a rate of 66.7%. Conversely, at Riley, the school under the Technocratic scenario, there were internal motivation statements in only 25% of the interviews. Again, this is sensible given that individuals who are allowed to act autonomously and practice self-efficacy tend to be internally motivated, whereas teachers in Riley were operating under a system of external threats and rewards. In summary, the survey results did not indicate that differences in evaluation conditions or effectiveness as measured in this study affected teacher perceptions of the relationship between evaluation and practice. However, teachers’ statements during the interview phase suggest that conditions, particularly related to administrator implementation of the policy and expectations around effectiveness scores, do matter. Specifically, more work is needed to parse out how components of evaluation conditions impact teachers differently. 138 CHAPTER 7: Individual-Level Characteristics and Teacher Evaluation This chapter addresses the question of whether individual-level teacher factors are associated with differences in teacher motivation, experiences with feedback, and work decisions related to teacher evaluation. Differences are examined by comparing teacher reflections on the prior and current year complementary survey questions as well as by examining differences between groups. Specifically, three individual-level differences are examined: reported licensure status (provisional vs. professional), years of experience (seven years or fewer vs. eight years or more), and subject area (Math vs. English). In this section, I present an analysis of survey and interview data to examine the relationships between each of the three individual-level factors and teacher perceptions of evaluation. Throughout, I explain how I found individual-level characteristics to be related to the perceptions of teachers at the four focal schools. Additionally, I present some alternative explanations for the differences that emerge. Reported Licensure Level Survey Two licensure types were reported among survey participants: provisional and professional. It is important to note that all teachers who were provisionally licensed were subjected to full observation cycles each year which consisted of three full-length observations and a peer observation assessed by all five observation standards along with standard six, which is the student growth standard. All observations are supposed to include conferencing between the observer and the observed teacher. Teachers with a professional license may have either a full cycle or an abbreviated cycle (two snapshot observations evaluating three observation standards plus the growth standard), dependent on when the teacher began teaching in North Carolina and whether or not their license is up for renewal in a given year. 139 Quantitative analysis revealed some differences between licensure levels for both the prior year and current year question sets. An independent samples t-test was conducted to determine the relationship between teacher licensure level and perceptions of the teacher evaluation process in the prior year. An analysis of teachers’ reflections on the previous year yielded two significant results when comparing teachers who reported provisional licensure to those who reported professional licensure. First, provisionally licensed teachers were more likely to have stated that they chose curriculum in anticipation of evaluation than professionally licensed teachers, t(40) = 1.93, p = 0.06. Additionally, provisional teachers were also more likely to agree that they directed focus on certain students based on evaluation compared to professionally licensed teachers, t(40) = 1.96, p = 0.06). Unfortunately, these statements were about evaluation at large so it is unclear whether these responses may have been associated differentially if examined separately by observation or testing (see Table 15). An independent samples t-test was also used to examine the relationship between teacher licensure level and perceptions of the teacher evaluation process in the current year. Three significant differences emerged between provisionally and professionally licensed teachers. Overall, provisional teachers were more likely to state that they anticipated modifying practice in anticipation of an evaluation as opposed to professionally licensed teachers, t(42) =1.96, p = 0.06. Provisionally licensed teachers were also more likely to be concerned that an evaluation would label them a bad teacher as opposed to professionally licensed teachers, t(42) = 1.84, p = 0.07 and, as in the prior year question set, provisionally licensed teachers were more likely to direct focus on students based on what they will be evaluated on as opposed to professionally licensed teachers, t(42) = 3.02, p < 0.01. 140 Table 15 Independent Sample T-Test of Survey Themes by Reported Licensure Level Provisional Professional Survey Themes M SD n Modify practice in anticipation of an evaluation Modify practice using feedback from evaluation Have concern evaluation affects employment Have concern evaluation labels as a bad teacher Have concern evaluation does not reflect competency Choose curriculum based on what evaluated on Choose teaching strategies based on what evaluated on Direct focus on certain students based on what evaluated on Use test data to modify classroom Practice Use observation data to modify classroom practice Feel evaluated fairly M SD n 95% CI t Prior 3.00 1.15 10 2.50 1.24 32 -0.40, 1.40 1.13 Current 3.36 1.21 11 2.52 1.25 33 -0.02, 1.72 1.96* Prior 3.20 1.40 10 2.69 1.15 32 -0.37, 1.40 1.17 Current 3.36 1.29 11 3.15 1.03 33 -0.56, 0.99 0.55 Prior 2.30 1.25 10 1.66 1.00 32 -0.15, 1.44 1.63 Current 2.36 1.12 11 1.88 1.08 33 -0.28, 1.25 1.28 Prior 2.30 1.25 10 1.72 0.96 32 -0.17,1.34 1.56 Current 2.45 1.04 11 1.79 0.86 33 -0.06, 1.21 1.84* Prior 2.50 1.27 10 2.56 1.22 32 -0.96, 0.84 -0.14 Current 2.45 1.04 11 2.45 1.23 33 -0.76, 0.76 0.00 Prior 3.20 1.23 10 2.31 1.28 32 -0.04, 1.82 1.93* Current 2.80 1.14 10 2.45 1.23 33 -0.54, 1.23 0.79 Prior 3.10 1.20 10 2.59 1.27 32 -0.41, 1.42 1.12 Current 3.55 1.21 11 2.85 1.23 33 -0.16, 1.56 1.64 Prior 3.10 1.20 10 2.28 1.14 32 -0.03, 1.67 1.96* Current 3.27 1.10 11 2.18 1.01 33 0.36, 1.82 3.02*** Prior 3.40 1.08 10 3.19 1.18 32 -0.63, 1.06 0.51 Current 3.36 0.92 11 3.33 1.14 33 -0.74, 0.80 0.08 Prior 3.00 1.25 10 2.75 1.16 32 -0.62, 1.12 0.58 Current 3.00 1.34 11 3.24 1.15 33 -1.08, 0.60 -0.58 Prior 3.70 0.82 10 3.88 1.13 32 -0.96, 0.61 -0.45 Current 4.10 0.57 10 4.15 0.57 33 -0.46, 0.36 -0.25 Feel last year will impact current Prior 3.00 1.14 10 2.34 1.13 32 -0.22, 1.53 1.52 Year Current Note. Scale for Survey: Strongly Disagree 2- Disagree 3- Neither Agree nor Disagree 4- Agree 5- Strongly Agree * = p< 0.1, **= p< 0.05, *** = p < 0.01 141 df 40 42 40 42 40 42 40 42 40 42 40 41 40 42 40 42 40 42 40 42 40 41 40 Interview The data in Table 16 present the interview case for each code by reported licensure status. The frequency is the number of interviews that included at least one occurrence of the code, which was used rather than frequency of interviewees due to the small “n” of these data when divided into categories. Each interviewee gave two interviews for the study so the “n” reported equals twice the number of interviewees at a school. The percent of occurrence for the interview in each category is included to allow for comparisons between the two groups. Table 16 Occurrence of Codes in Interviews by Licensure Status Provisional (n= 6) Codes Motivation Internal External Observation Feedback Negative Positive Testing Feedback Negative Positive Work Decisions Strategy/How Taught Curriculum/What Taught Who is Taught Response to Reform Acquiescence Adaptation Denial Professional (n= 22) Frequency Percentage Frequency Percentage 4 2 4 4 2 4 4 1 66.7% 33.3% 66.7% 66.7% 41.7% 66.7% 66.7% 16.7% 10 8 19 18 7 19 19 1 45.5% 36.4% 86.4% 81.8% 31.8% 86.4% 86.4% 4.5% 1 2 0 16.7% 33.3% 0.0% 7 4 2 31.8% 18.2% 9.1% 1 1 1 16.7% 16.7% 16.7% 7 7 2 31.8% 31.8% 9.1% Discussion The survey results indicated that provisional teachers may have been more likely than professionally licensed teachers to modify their practice in anticipation of an evaluation, 142 specifically in regard to choosing curriculum and directing focus on students. Support is found in the focal interviews where a much larger percentage of interviews from provisionally licensed teachers referenced changes in curriculum based on evaluation as compared to professionally licensed teachers (33.3% versus 18.2%); however, the sample size of the interview data is too small to investigate with inferential statistics. Likewise, it is unsurprising that provisionally licensed teachers, who in general have less experience and perhaps less confidence in their instructional decisions, are more likely to fear being labelled negatively on an evaluation as opposed to more experienced, and possibly more confident, fully-licensed teachers. Provisional teachers, who do not have any sort of tenure protection or any prospect of receiving such under the current policies, also have more at stake if poor evaluation results are received. With this in mind, it is possible that evaluation may differentially motivate provisional teachers to change practice in an attempt to perform better on the evaluation measures. There are other possible explanations for the differences found in the survey data. For instance, professionally licensed teachers, who in general have more work experience than provisionally licensed teachers (the exception being teachers who transfer from out of state and receive provisional licenses), may value something else aside from evaluations when it comes to classroom practices such as choosing curriculum or directing focus on certain students. There may be evidence for provisional teachers being more favorable to evaluation feedback than professional teachers in the focal interviews where there were fewer mentions of the negative aspects of both observation (66.7% versus 81.8%) and testing (66.7% versus 86.4%) from provisional teachers as opposed to professional. Provisional teachers may have a greater likelihood to focus on one or both components of evaluation to aide in these classroom practices because they have had less exposure to other types of guidance, and perhaps less confidence in 143 identifying good practice, therefore seeing the values in evaluation as suitable guiding principles. However, it is difficult to draw support and conclusions from the focal interviews due to the sample only including three provisionally certified teachers, two of which were from Phoenix and one from Charles. Seven-Year Status Survey As previously mentioned, changes in evaluation policy over recent years require all teachers with seven or fewer years of experience in North Carolina to be observed on a full evaluation cycle. Teachers who fall into the category of having seven or fewer years of experience may have either provisional or professional licenses; therefore, a different sample of teachers was included when the data was examined by the years of experience instead of by licensure level. So, independent sample t-tests were conducted using teacher seven-year status instead of reported licensure level (Table 17). No significant differences were found between teachers who had seven or fewer years of experience and teachers who had eight or more years of experience when analysis was run on the prior year question set; however, four significant differences between the two groups were found in the analysis of the current year question set. First, the seven years or fewer teachers were more likely to both modify practice in anticipation of an evaluation compared to the eight years or over group, t(42) = 2.28, p = 0.03, and were more likely to modify practice using the feedback of an evaluation as opposed the eight years or over group, t(42) = 1.81, p= 0.08. Specifically, the seven years or fewer teachers were more likely to use test data to modify classroom practice than the eight years or over teachers, t(42) = 2.49, p = 0.02. Among those practices that teachers stated they would modify, seven years or fewer teachers were more likely to choose teaching strategies based on evaluation than 144 the eight years or over teachers, t(42) = 2.92, p = 0.01 and were more likely to direct focus on certain students than eight years or over teachers, t(42) = 3.26, p < 0.01. 145 Table 17 Independent Sample T-Test of Survey Themes by Seven-Year Status 7 or fewer Survey Themes M SD n Modify practice in anticipation of an evaluation Modify practice using feedback from evaluation Have concern evaluation affects employment Have concern evaluation labels as a bad teacher Have concern evaluation does not reflect competency Choose curriculum based on what evaluated on Choose teaching strategies based on what evaluated on Direct focus on certain students based on what evaluated on Use test data to modify classroom practice Use observation data to modify classroom practice Feel evaluated fairly 8 or more M SD n 95% CI t Prior 3.13 0.99 8 2.50 1.26 34 -0.34, 1.59 1.31 Current 3.56 1.01 9 2.51 1.27 35 0.12, 1.97 2.28** Prior 3.13 1.55 8 2.74 1.14 34 -0.58, 1.36 0.81 Current 3.78 0.97 9 3.06 1.08 35 -0.08, 1.52 1.81* Prior 1.88 1.36 8 1.79 1.07 34 -0.81, 0.97 0.18 Current 2.22 1.09 9 1.94 1.11 35 -0.56, 1.11 0.68 Prior 2.00 1.31 8 1.82 1.00 34 -0.67, 1.02 0.67 Current 2.22 0.97 9 1.86 0.91 35 -0.33, 1.06 1.06 Prior 2.75 1.17 8 2.50 1.24 34 -0.72, 1.22 0.52 Current 2.56 1.01 9 2.43 1.09 35 -0.69, 0.94 0.32 Prior 2.75 1.49 8 2.47 1.29 34 -0.77, 1.27 0.54 Current 2.88 0.99 8 2.46 1.25 35 -0.54, 1.37 0.89 Prior 3.00 1.31 8 2.65 1.25 34 -0.65, 1.36 0.71 Current 3.89 0.93 9 2.80 1.23 35 0.30, 1.88 2.92*** Prior 2.75 1.28 8 2.41 1.18 34 -0.62, 1.29 0.72 Current 3.44 0.88 9 2.20 1.05 35 0.47, 2.02 3.26*** Prior 3.50 1.07 8 3.18 1.17 34 -0.59, 1.24 0.58 Current 3.89 0.60 9 3.20 1.13 35 0.12, 1.26 2.49** Prior 3.13 1.36 8 2.74 1.14 34 -0.55, 1.33 0.55 Current 3.67 1.12 9 3.06 1.19 35 -0.28, 1.50 1.39 Prior 3.75 0.46 8 3.85 1.16 34 -0.95, 0.75 -0.24 Current 4.13 0.35 8 4.14 0.60 35 -0.47, 0.43 -0.08 Feel last year will impact current Prior 2.88 1.55 8 2.41 1.13 34 -0.50, 1.43 0.97 year Current Note. Scale for Survey: Strongly Disagree 2- Disagree 3- Neither Agree nor Disagree 4- Agree 5- Strongly Agree * = p< 0.1, **= p< 0.05, *** = p < 0.01 146 df 40 42 40 42 40 42 40 42 40 42 40 41 40 42 40 42 40 42 40 42 40 41 40 Interview The data in Table 18 present the interview case for each code by seven-year status. The frequency is the number of interviews that included at least one occurrence of the code, which was used rather than frequency of interviewees due to the small “n” of these data. Each interviewee participated in two interviews for the study so the “n” reported equals twice the number of interviewees at a school. The percent of occurrence among interviews in each category is included to allow for comparisons across groups. Table 18 Occurrence of Codes in Interviews by Seven-Year Status Taught Seven Years or Fewer (n= 12) Codes Motivation Internal External Observation Feedback Negative Positive Testing Feedback Negative Positive Work Decisions Strategy/How Taught Curriculum/What Taught Who is Taught Response to Reform Acquiescence Adaptation Denial Taught Eight Years or More (n= 16) Frequency Percentage Frequency Percentage 6 4 10 10 5 9 9 1 50.0% 33.3% 83.3% 83.3% 41.7% 75.0% 75.0% 8.3% 7 6 10 10 4 11 11 1 43.8% 37.5% 62.5% 62.5% 25.0% 68.8% 68.8% 6.3% 6 3 1 50.0% 25.0% 8.3% 2 3 1 12.5% 18.8% 6.3% 2 4 1 16.7% 33.3% 8.3% 6 4 2 37.5% 25.0% 12.5% 147 Discussion Other studies have demonstrated that novice teachers are less effective than more experienced teachers (Clotfelter, Ladd, & Vigdor, 2007; Rockoff, Jacob, Kane , & Staiger, 2011; Wayne & Youngs, 2003). Yet, newer teachers make rapid gains early in their careers (Boyd, Lankford, Loeb, 2008; Rockoff, 2004) and improve most rapidly in schools with higher socioeconomic status and higher schoolwide VAM scores (Loeb, Kalogrides, and Beteille, 2012). Three trends emerged from the combination of data from the survey and interview as separated by seven-year status which could offer explanation for the changes seen early in teacher careers. First, teachers who had seven years or fewer of experience were more likely to state they would change practice in anticipation of an evaluation than those with more experience. Similarly, a study by Sun, Mutcheson, & Kim (2016) demonstrated that early career teachers were more likely to use evaluation feedback to improve their practice, a finding that was reflected in the results of this dissertation. Possible explanations include that teachers with fewer than seven years’ experience are observed much more frequently and, due to lacking tenure, job retention is more closely tied to evaluation if a reduction in workforce were enacted. Therefore, it could be that more exposure to evaluation led to a greater awareness of changing practice to meet observation targets or that the higher stakes of evaluation led to such changes. Second, teachers in the seven years or fewer category were more likely to state they would use feedback from an evaluation in general and specifically, feedback in the form of test data, than teachers with more experience. Again, this reliance could be due to increased observations and/or the greater stakes attached to lacking tenure. Interestingly, in the interview data, the frequency of occurrences of interviews mentioning feedback from observation in a 148 positive manner was much higher among the seven years or fewer group versus the eight years or more (41.7% versus 25.0%), which suggests that there may be a relationship between the frequency of formal observations and the perceived value of the feedback received. Teachers who were interviewed often cited the limited number of observations conducted as influencing their ability to use observation feedback. So, it may be that those who are observed more frequently see greater value in the experience as a source of feedback. Third, teachers in the seven years or fewer category were more likely to state that they changed their practice, specifically by choosing teaching strategies and directing focus on students based on evaluation than teachers with more experience. The interview data were supportive of this as interviews from teachers in the seven years or fewer category more frequently contained statements referencing a change in teaching strategies based on evaluation as opposed to the eight years or more group (50.0% versus 12.5%). Because teachers in the seven years or fewer group are evaluated on all six standards instead of four, it is also possible that teachers were simply trying to meet some of the standards with superficial changes. Superficial changes in practice dominated teachers’ descriptions of changes in teaching strategies as discussed in Chapter 5. It is also possible that teachers who fell into the eight years or more category who received less observations and were more likely secured with tenure simply felt less external pressure from the evaluation policies. Again, interview data supports this as interviews from teachers in the seven years and fewer group were less likely to demonstrate a reform typology of acquiescence than those in the eight years and over group (16.7% versus 37.5%) indicating that there was less acceptance of the policy among those teachers subjected to full evaluation cycles. 149 Another possible explanation for the differences between the seven years or fewer and eight years or more groups may be unrelated to the pressure or frequency of evaluation, but due instead to the growth model that has been employed by administrators using the observation rubric, which was a common complaint among teachers who questioned the validity of the observation instrument as discussed in Chapter 5. It is possible that teachers with fewer years of experience are simply rated lower than those who have been evaluated across a longer time span because administrators may feel policy pressure to score newer teachers low initially in order to demonstrate growth later on. However, it should also be noted that the focal interview teachers in the seven years or fewer category were spread evenly across three of the focal schools: Charles, Phoenix, and Central, and there were no teachers from Riley represented in that sample. Such unbalance may help explain the differences seen between groups in the interview data. Subject Area Survey The third individual-level characteristic of interest was subject area: whether the teachers taught Math or English. Independent samples t-tests were run on both sets of questions from the complementary question set, which asked teachers to reflect on statements regarding the prior and then the current year, to identify differences between Math and English teachers. Differences between subject areas emerged in four themes between both question sets (Table 19). For both questions sets, English teachers were more likely to state they modified practice in anticipation of evaluation when compared to Math teachers (prior year: t(40) = 2.07, p = 0.05; current year: t(42) = 3.18, p < 0.01). While English teachers were more likely than Math teachers to report changing practice in anticipation of an evaluation, Math teachers appear significantly more likely to say that they would use observation feedback to modify classroom 150 practice when compared to English teachers, t(40) = -2.09, p= 0.04. Similarly, Math teachers were significantly more likely to say the prior year’s evaluation would impact current year classroom practice when compared to English teachers, t(40) = -1.18, p = 0.07. 151 Table 19 Independent Sample T-Test of Survey Themes by Subject Area English Survey Themes Modify practice in anticipation of an evaluation Modify practice using feedback from evaluation Have concern evaluation affects employment Have concern evaluation labels as a bad teacher Have concern evaluation does not reflect competency Choose curriculum based on what evaluated on Choose teaching strategies based on what evaluated on Direct focus on certain students based on what evaluated on Use test data to modify classroom practice Use observation data to modify classroom practice Feel evaluated fairly M 3.06 3.37 2.50 3.16 1.78 2.00 2.00 1.89 2.44 2.42 2.72 2.63 3.00 3.16 2.67 2.53 3.17 3.21 2.39 3.05 3.78 4.16 2.11 SD 1.26 1.12 1.34 1.21 1.11 1.05 1.24 0.81 1.20 1.07 1.27 1.07 1.33 1.21 1.46 1.12 1.10 1.03 1.15 1.27 1.11 0.60 1.13 Math n 18 19 18 19 18 19 18 19 18 19 18 19 18 19 18 19 18 19 18 19 18 19 18 M 2.29 2.24 3.04 3.24 1.83 2.00 1.75 1.96 2.63 2.48 2.38 2.46 2.50 2.92 2.33 2.40 3.29 3.44 3.13 3.28 3.88 4.13 2.79 SD 1.12 1.20 1.08 1.01 1.13 1.16 0.90 1.02 1.25 1.09 1.35 1.32 1.18 1.29 0.96 1.16 1.20 1.12 1.12 1.14 1.04 0.54 1.22 n 24 25 24 25 24 25 24 25 24 25 24 24 24 25 24 25 24 25 24 25 24 25 24 95% CI 0.02, 1.51 0.41, 1.84 -1.3, 0.21 -0.76, 0.60 -0.76, 0.65 -0.68, 0.68 -0.42, 0.92 -0.64, 0.51 -0.95, 0.59 -0.72, 0.60 -0.48, 1.18 -0.58, 0.93 -0.29, 1.29 -0.53, 1.01 -0.48, 1.14 -0.58, 0.83 -0.85, 0.60 -0.90, 0.44 -1.45, -0.03 -0.96, 0.51 -0.78, 0.58 -0.32, 0.38 -1.42, 0.06 t 2.07** 3.18** -1.45 -0.25 -0.16 0.00 0.76 -0.23 -0.47 -0.18 0.85 0.47 1.29 0.62 0.84 0.36 -0.35 -0.70 -2.09** -0.63 -0.29 0.19 -1.85* Prior Current Prior Current Prior Current Prior Current Prior Current Prior Current Prior Current Prior Current Prior Current Prior Current Prior Current Feel last year will impact current Prior year Current Note. Scale for Survey: Strongly Disagree 2- Disagree 3- Neither Agree nor Disagree 4- Agree 5- Strongly Agree * = p< 0.1, **= p< 0.05, *** = p < 0.01 152 df 40 42 40 42 40 42 40 42 40 42 40 41 40 42 40 42 40 42 40 42 40 41 40 Interview The data in Table 20 present the interview case for each code by subject area where the frequency is the number of interviews that included at least one occurrence of the code. This method was used rather than frequency of interviewees due to the small “n” of this data and the variable number of interviewees in each subject area. Each interviewee participated in two interviews for the study so the “n” reported equals twice the number of interviewees at a school. The percentage of occurrence for the interviews in each subject is also reported to allow for comparisons between the two groups. There were no stark differences between the two subject areas aside from a less frequent occurrence of statements related to external motivation appearing in English interviews versus Math interviews (27.8% versus 50%). Whether this difference in external motivation is related to the characteristics of the individual teachers interviewed for the study or to teachers of Math or English as a whole is unclear. It is also possible that these numbers are influenced by missing Math teachers from the interview sample at Riley and by only having one Math teacher participate at Phoenix. 153 Table 20 Occurrence of Codes in Interviews by Subject Area English (n= 18) Codes Motivation Internal External Observation Feedback Negative Positive Testing Feedback Negative Positive Work Decisions Strategy/How Taught Curriculum/What Taught Who is Taught Response to Reform Acquiescence Adaptation Denial Math (n= 10) Frequency Percentage Frequency Percentage 8 5 12 12 5 12 12 1 44.4% 27.8% 66.7% 66.7% 27.8% 66.7% 66.7% 5.56% 5 5 8 8 4 8 8 1 50.0% 50.0% 80.0% 80.0% 40.0% 80.0% 80.0% 10.0% 4 4 1 22.2% 22.2% 5.56% 4 2 1 40.0% 20.0% 10.0% 6 4 2 33.3% 22.2% 11.1% 2 4 1 50.0% 40.0% 10.0% Subject Area Specific Concerns Observation concerns. The survey analysis findings are not surprising considering some of the specific concerns raised by teachers during the interview phase. It is possible that these findings relate to the background of the observer rather than the subject area of the observed teacher. As discussed in the previous chapter, Math teachers at Central were observed by a principal with a Math background, whereas English teachers at the same school often essentially evaluated themselves due to their assigned observer reportedly having difficulties operating the computer on which evaluation scores had to be recorded. In the case of one English teacher at Central, an observation of teaching was never conducted and instead she had been observed conducting a department meeting. Additionally, Math teachers at one other school, Charles, 154 mentioned the value of having an administrator with a Math background observing lessons, which had occurred for those teachers in the previous study year. Therefore, it is entirely possible that the experiences of teachers at those schools strongly affected the survey results where Math teachers felt that last year’s evaluation would impact the study year and that observation feedback was used to modify classroom practice. Mr. Augustus explained how he imagined his administrator at Central, who had a Math background, could bring something different to the feedback given in a Math classroom versus another subject. While Mr. Augustus explained that the observations were “definitely focused on classroom management,” he felt confident that when dealing with something very specific in the Math curriculum, that a non-Math person would be unable to appreciate the context and prior learning that students needed leading up to the lesson. However, Principal Nichols could provide feedback specific to the Math lesson he observed. Mr. Augustus was very sure of the Math competency of his evaluating administrator and explained, “I have had him, during observations, where he will see a kid that is struggling, and I am helping some other kids and he will go over and actually help him.” The description Mr. Augustus gives of his principal’s feedback mirrors the experiences with coach observation that teachers found valuable, as discussed in Chapter 5. In that chapter, I described other informal forms of evaluation that teachers found valuable which included feedback from coach observations in cases where the teacher found the observer to be able to provide classroom relevant suggestions. The Math teachers at Charles in particular talked about their Math coach and highlighted that she was knowledgeable and had practical experience as a Math teacher, so she knew what it was like to teach a Math course. 155 Moreover, emerging literature supports the idea that the subject area background of the observer and the alignment between the background of an observer and the subject area observed may matter greatly in the types of feedback that a teacher receives (Bell et al., 2015). One of the assumptions behind the North Carolina teacher evaluation policy is that the rigorous standardization of the observation protocol would lead to more equitable observation experiences and better feedback for teachers. However, it seems that this feedback may be more useful when teachers are given the opportunity to receive feedback from an observer who not only understands a subject, but actively displays competence in the subject area. Testing concerns. General concerns that teachers had about testing were addressed in Chapter 5, but two-subject specific concerns were raised in the interviews. English teachers had concerns that the tests which were administered to their students to gauge growth did not address all of the standards. On the other hand, Math teachers had a curriculum that had been in flux over the last several years and during the study year Math 2 teachers had to administer two tests to students, only one in which would be used to measure student growth or to provide feedback. North Carolina adopted the Common Core Curriculum Standards (CCCS) for the 20122013 school year and the English/Language Arts (ELA) standards for high school, in the form most current at the time of this writing, have five anchor strands: Reading: Literature, Reading: Informational Text, Writing, Speaking and Listening, and Language. However, English teachers contend that the only strands addressed on the tests are the two Reading strands and some of the Language strand. When there are short answer writing portions on the test, teachers feel that those portions are not scored accurately because teachers receive the student score for the test on the same day it was administered. Teachers also cited examples of specific standards within the anchor strands that were known to not appear on the test and further described how the 156 prioritizing of standards on the test influenced what they chose to teach in their classroom. For instance, many teachers chose to forego writing altogether in lower achieving classes in order to try and get students to pass the test and exhibit growth, whereas Listening/Speaking was not incorporated into lessons in a formal manner in any courses. Such actions are similar to those demonstrated in research on gaming strategies for highstakes testing that may result in a narrowing of curriculum to focus on tested aspects (Carnoy & Loeb, 2002; Ladd & Zelli, 2002; Rothstein & Mathis, 2013). Additionally, the different approaches teachers described in regard to changing curriculum between lower achieving classes and higher achieving classes could also represent a form of educational triage, except rather than removing students from the testing pool (Figlio, 2006; Figlio & Getzer, 2002) or diverting resources within a classroom (Booher-Jennings, 2005) teachers were adjusting the curriculum between classes in an attempt to get more students to pass the standardized test. English teachers also raised the discrepancy between the idea of “College and Career Readiness” standards and an English test with content which was nearly a third based on poetry, which teachers contended indicated that the test was heavily grounded in literature despite the attempts to make ELA broader with the inclusion of the anchor standards. Mr. Allen described how the situation applied to students who wanted to go into technical trades such as welding or plumbing, “I think it is great that they should be exposed to poetry and that they should see what that has to say about society, but at the same time, is it fair that 30% of their final exam grade is based on a couple of random poems? When that has almost nothing to do with College and Career Readiness?” Similarly, Mrs. Williams stated that she had always joked that one year the tenth-grade test would include War and Peace and ask questions about Russian patronyms only to be horrified this year when she was proctoring the exam and noticed that War and Peace had 157 been included. Mrs. Williams maintained that the literature and the concepts on the test were “not practical” to teach standard high school students. In Math classes, the curriculum and standards have been shifting. Overall, the teachers seemed to agree with the shifts and felt like the state was responding to teacher concerns that the previous order of certain standards and topics did not align in a way that made sense across the four required Math courses. Teachers explained that certain courses used to be “heavy” in certain topics or that some concepts were presented out of order. However, during the study year, the standards for Math 2 changed significantly. So, some teachers in the study were expected to administer a statewide field test from which they would receive no data or feedback. A county developed exam was also administered and the student grades from that took about two weeks to receive because the data had to be transformed and managed at the county level. However, the district’s position on this testing was reportedly to ask the Math teachers to lie and tell students that both tests, administered on separate days, would count toward student grades. By lying about the situation teachers were artificially creating pressure for students to perform in a situation absent of direct pressures to motivate students, a reform response described by Cohen (2011) that occurs in scenarios with high-stakes for teachers. By the second interview, the teachers had already completed one round of Math 2 testing, but the dual tests were scheduled to be administered again at the end of the school year. The Math 2 teachers talked about how they “hated” lying to students and about how the two tests seemed like a waste of time and resources which did not provide any useful feedback. Teachers also described how the situation did not motivate them to have their students do well. Mr. Robbins summed up how the predicament presented an opportunity to game the system, 158 [H]ow do you motivate kids? And …do I want the kids to bomb that field test? Because if every kid around the state just does horrible on it, then the state will do one of two things: they will either think that the test was too hard and they will write easier questions, which will benefit my kids in the future, or they will normalize it according to those awful grades, and it will mean that the curve in future years will be extremely low. One of those two things will happen, so you are not really motivating me to really push my kids to do extremely well on it. Conclusion Overall, some differences emerged between different groups for the three individual characteristics examined in this chapter. The results are similar when examining licensure and years of experience. For instance, both teachers who are provisionally licensed or have seven years or fewer of experience are more likely to report altering classroom practices due to evaluations when compared to teachers who are either professionally licensed or who have eight or more years of experience in North Carolina. Such differences could be due to many factors including a lack of guidance in how to model classroom practices, the more frequent occurrence of observations, or the higher stakes that evaluations carry for teachers with lower designations. The differences in subject area may not be a result of a teacher’s subject area background but instead a result of the contextual differences under which evaluation occurs. For instance, subject area results may have been influenced in part by a particularly poor observer in the English department at Central juxtaposed to a particularly competent observer for the Math department at the same school. In contrast, Math teachers at Central and at Charles both described positive experiences with having an observer with the same subject area background. So, feedback may be more useful when an observer is competent in the area observed and may 159 contribute to changes in classroom practice in that manner. However, observers who are not competent in a subject provide feedback that has less value to a teacher. Additionally, concerns about testing were raised by teachers of both subject areas. Specifically, English teachers were concerned that standards and entire anchor strands were left entirely off of the test. English teachers also expressed concern that the test was literature heavy and did not fit into the ideas of “College and Career Readiness” as espoused by the adopted standards. English teachers described some gaming behaviors such as narrowing of curriculum to focus on those standards and strands which were tested at the expense of non-tested elements of the curriculum. Meanwhile, Math teachers expressed specific concerns over the administration of a field test in Math 2. The concerns included that administering tests in this manner did not provide any reliable feedback and also led to opportunities to potentially game the system. Overall, it appears that there are differences between provisionally and professionally licensed teachers and those with seven or fewer years of experience and those with more years of experience. These differences may be related to the higher stakes risk associated with evaluations for teachers without tenure protection or to the unique circumstances of being a newer teacher. Differences between subject areas in this study were not necessarily driven by inherently different characteristics between Math and English teachers, but rather by the unique circumstances under which each group taught and was evaluated. Additionally, there is some evidence to suggest that the subject area background may matter in regard to the value of feedback received and the extent to which such feedback can influence practice. Finally, conditions around testing led English teachers, and possibly Math teachers as well, to engage in gaming behaviors in an attempt to improve test scores. 160 CHAPTER 8: Conclusions and Implications Prior teacher evaluation protocols were usually developed, or at least selected, at the local level and consisted of observations by school administrators. The results of such observations were usually bound within the school or district in which the observation was conducted, and principals relied on references to determine the potential ability of a new hire. Such systems were previously critiqued as rating too many teachers as high performing (Weisberg et al., 2009). Critics argued that a better system of teacher evaluation would be more standardized and include multiple measures to determine teacher proficiency, in turn more accurately gauging teacher competency by approaching proficiency from different angles and providing a better source of feedback to improve teacher practice (US Department of Education, 2009). While the critics of previous local-based systems presented valid points, their critiques emerged in a political atmosphere where student test scores had increasingly served as a proxy for student achievement, school-level governance was becoming increasingly consolidated at the state level, and teachers were increasingly portrayed by policymakers as individuals who had become complacent in their jobs under the safety of union strongholds. Federal initiatives such as RttT prompted a “rapid policy diffusion” of new evaluation policies which resulted in legislative changes in 46 states (Grissom & Youngs, 2015, p. 169). Mintrop and Sunderman termed the resulting legislation as the “third wave” of accountability where accountability for individual student success became narrowed to the focus of each individual teacher’s impact (2013). The teacher evaluation system used in North Carolina at the time of this dissertation was created in response to criticisms of prior local-based systems and in many ways, serves as an ideal example of what many policymakers felt an ideal evaluation system should look like. First, most educational policy, including teacher pay and graduation requirements, had been centralized at the state-level for several decades. Because of this, North Carolina already had pre161 existing systems of technology which could be used to track student and teacher growth statewide under new accountability systems. Additionally, North Carolina was and is a “Rightto-Work State” and, thus, there has been limited job protection offered to teachers by teacher unions. These conditions allowed policymakers to elevate the stakes attached to teacher evaluations and connect observations and student test growth to the retention of employment, what Firestone had termed as “the most powerful incentive” to motivate teachers (2014, p. 102). Finally, North Carolina had already consolidated teacher evaluation at the state level with a rigorously standardized observation protocol prior to Race to the Top (RttT). However, despite attempts to create a teacher evaluation system that answered the critiques of previous systems, the observation scores of teachers in the schools in this study were overwhelmingly rated proficient. For instance, 96.5% of the observation standards rated were marked proficient or higher in Charles, the lowest achieving school in this study which was 2.5% below the district average of 99%, indicating that there was a trend across Broadville County to rank teachers as proficient or higher. These results mirror other studies that demonstrate that the “widget effect” has persisted post evaluation reform (Sawchuck, 2013). Additionally, discrepancies existed between the two measures used to rank teachers: observation and student growth on standardized tests. For example, Phoenix teachers had 100% of their rated observation standards marked as proficient, yet had poor performance on the student growth standard with only 75% of teachers being rated as proficient. What has remained uncertain is how newer evaluation policies impact the work of teachers. This dissertation examined the ways in which evaluation policy relates to teacher practice while considering various aspects of school and individual contexts. I then parsed out the ways in which school characteristics and individual-level characteristics may impact the evaluation- 162 practice relationship. Though quantitative differences between schools were not found, there were qualitative differences in how evaluation was related to practice across sites. Differences were also found in the evaluation-practice relationship between teachers of different licensure levels and different levels of experience, possibly due to the increased risk evaluation carries to those in the lower designations. Finally, differences between the subject areas of Math and English were identified, but may have not been the result of unique characteristics of Math or English teachers. Rather, differences may have been influenced by the proficiency and approach of observers and a lack of subject area alignment between the observer and the classroom in English. In contrast, a subject area match was present for some of the Math teachers in the study. Therefore, it is important to examine the context of evaluation, particularly the capacity of the administration that conducts observation. Despite attempts to standardize evaluation, there are factors that influence how observation is conducted in schools. For instance, the results of this study suggest that the characteristics and capacity of an observer do matter in how the observation protocol is interpreted and implemented. Additionally, the evaluation climate and culture, or evaluation scenario of a school, may also influence the ways in which teachers find evaluation motivating and how teachers approach feedback from evaluation. The results of this study provide insight into the relationship between teacher evaluation and classroom practice, an area that has previously been under researched despite the impact other high-stakes accountability policies have had on teaching practices and the teaching workforce. Implications for Research Teacher evaluation has gained much popularity as a research topic over the past decade. The importance of such work is amplified by the often drastic changes that occurred in state 163 policies following RttT. Prior research focusing on the technical aspects of evaluation, including the potential issues of using both local-based observation tools and VAMs or student growth measures as part of evaluation, have been examined extensively (Baker et al., 2010; Bill and Melinda Gates Foundation, 2010; Corcoran, 2010; Glazerman et al., 2011; Goldhaber et al., 2013; Harris, 2009; Hill et al., 2001; McCaffrey et al., 2003; Rothstein & Mathis, 2013). For instance, the potential misidentification of teachers using VAMs has been extensively investigated (Goldhaber et al., 2013; Harris, 2009; Raudenbusch & Jean, 2012). Additionally, the infrastructure changes that accompany such systems have also been explored (Anagnostopoulos at al., 2013a; Mintrop & Sunderman, 2013; Thorn & Harris, 2013). However, at the time of writing, there was a gap in the scholarship examining the relationship between teacher evaluation policies and teacher practices. So, the work in this dissertation represents the next step in research on teacher evaluation policies, one in which the impacts of the policy on policy actors at the classroom level is examined. This dissertation also represents part of the next generation of literature on how external accountability influences practice. Previous accountability policies have been examined for impacts on both the teaching workforce (Clotfelter et al., 2004) as well as on teacher practice (Carnoy & Loeb, 2002; Ladd & Zelli, 2002; Rothstein & Mathis, 2013). As accountability policy has now fully entered the “third wave” focused on individual-level accountability following RttT, it becomes important to revisit questions about how external pressures influence teachers. This is important because previous work on teacher responses to external pressures have demonstrated that teachers may engage in behaviors, in an attempt to meet policy demands, which carry financial or educational costs to schools. It is possible that such results are amplified when policy narrows to the level of individual accountability. 164 Additionally, this dissertation contributes to growing body of research on the relationship between evaluation and the context of evaluation, including observer background, observation protocols, and teacher characteristics, with particular attention to the capacity of the school leaders who are tasked with undertaking school-level evaluation. There is scant research available on how observers interact with observation protocols, yet newer work is emerging that examines the impact of the subject area background of observer upon both the ranking a teacher receives as well as the type of feedback received by the observed teacher (Bell et. al, 2015). Other work indicates that there may be differences between feedback received in different subject areas at the elementary-level (Burch & Spillane, 2003). Moreover, the frustrations expressed by teachers in this study mirrors recent work by Reinhorn et al. (2017), where teachers expressed disappointment with administrators lacking the background and experience to provide subject-specific recommendations for improvement. What is yet unclear is whether subject area differences, such as those observed in this study, stems from the nature of a subject or from the background and ability of the observer. Implications for Policy and Practice While teacher evaluation is often cited by policymakers as providing a source of feedback and motivation for teachers to improve classroom practice, the results of this study suggest that may not be true in some circumstances and contexts. The two assumptions that evaluation policy could simultaneously motivate teachers and provide feedback to improve practice did not play out as expected, at least in regard to the teachers in this study. So, is this failure for the policy to materialize as assumed a result of poor theory or poor implementation? Some teachers described other evaluation policies which were perceived as useful and wellimplemented, particularly in regard to the use of instructional coaches. However, this was not a 165 universal experience across teachers. For instance, teachers seemed to reject coaches who focused more on theory rather than actual teaching situations. Overall, teachers expressed that they felt that better feedback was received from an observer who knew the teacher’s subject. Likewise, there were implementation issues of the formal evaluation policy, particularly in the case of Central, where one administrator did not conduct observation appropriately, whereas at Charles at least some of the teachers found the formal evaluation process to be helpful. It becomes apparent in looking at these scenarios across schools and across both formal and informal evaluation policies that teachers are likely to reject aspects of a policy which are perceived as coming from invalid sources. Thus, it is important to understand the conditions under which policies are being implemented, particularly in regard to the capacity of the individual doing the evaluating. For instance, one reason teachers were critical of both components of the evaluation was due to the timing of feedback. For observations, teachers opined that the feedback was only about one class and that observations were often conducted in quick succession in a short amount of time or at the end of the year where improvements could not be implemented. These issues in timing are related to the capacity of leadership conducting the evaluations as well as limitations in resources (specifically time) to conduct the lengthy observation and feedback process. Testing feedback, on the other hand, was not available until the next year, which also prevented teachers from using the feedback to make meaningful change. My study points to three implications for policy and practice. First, I explore the relationships that emerged in the data between leadership capacity and the success of the evaluation policy. Second, I describe how questions about the validity of the evaluation system in the constraints of the context of schools served as a barrier for evaluations being useful as a 166 motivating tool or a feedback source. Finally, I describe how an evaluation system with conflicting messages about motivation in a high-stakes environment, as implemented in the schools in this dissertation, may serve to motivate school personnel to engage in undesirable behaviors. Leadership and Evaluation The observation instrument used in North Carolina at the time of this study was a lengthy, standardized, very detailed document that is meant to cover nearly every conceivable aspect of teaching. However, despite the detail of the observation instrument, the background, skill, preferences, and values of the human observer influenced how individual teachers were evaluated. For instance, in another study, factors such as teacher personality, philosophy, and effort were found to have contributed to evaluation ratings (Harris, Ingle, & Rutledge, 2014). Moreover, the standardization of the observation protocol used in the district in this dissertation could have been influenced through individual interpretations of the protocol or policy, by the evaluation scenario created by the climate and culture of the school, and by the observer’s proficiency as an evaluator. First, what an observer chooses to value in teaching may matter greatly in how the observation protocol gets interpreted, as was the case with Riley’s the new administrator, Ms. Jefferson. The focal teachers reported that Ms. Jefferson tended to equate “good teaching” with the submission of lesson plans and expressed concern that this was impacting the way ratings were assigned in observation. According to the focal teachers, Ms. Jefferson placed emphasis on the completion of tasks rather than on what she saw happen in classrooms. The situation at Jefferson demonstrates that observers can choose to prioritize certain actions of teachers or interpret the observation instrument in a way that allows for such prioritization. Moreover, it is 167 important to note that an overreliance on scores from evaluation, particularly as related to testing, may inhibit teacher autonomy which presents a challenge to intrinsic motivation (Firestone, 2014). Teachers at Phoenix described a “culture of improvement” that allowed for teacher autonomy and experimentation, which contrasts teachers at Riley who were apprehensive to even talk about evaluation with an outsider due to the administrator’s focus. How the policy is interpreted by an observer also matters. Teachers across all of the school sites reported that the observation instrument was intended to be used as a growth instrument. Therefore, how a teacher is evaluated may hinge on their observing administrator’s interpretation of “growth.” For instance, teachers expressed that some administrators seemed to think that “growth” meant that a new teacher should always be ranked low regardless of past experience or of the performance observed in the classroom. The intentional lowering of initial evaluations in order to leave room for growth could be discouraging for a teacher who perhaps should have scored higher if ranked objectively and also is illustrative of how the scale used to rank teachers may not be truly standardized across all sites. The growth interpretation is one that emerges in this study, but there may be others. For instance, other studies have demonstrated that teachers respond to evaluation through the lens of their administration (Reinhorn et al., 2017) and that the persistence and strength of policy messages shapes the understanding and implementation of evaluation for administrators (Rigby, 2014). Related to the first two points, evaluation scenarios also seem to be created within the school. Such scenarios appear to be primarily driven by administration and administrators’ individual approaches to evaluation, but also may reflect a long standing cultural tradition or climate component of the school as driven by the interpretation of evaluation policy. Three scenarios are presented in this dissertation, though there is certainly a possibility of more: 168 Technocratic, Autonomous/Self-Efficacious, and Consensus Lacking. The experiences teachers had with evaluation varied greatly depending on which scenario their school exhibited. As described previously, the technocratic scenario at Riley lead to feelings of duress in teachers whereas the lack of consensus at Central lead to some teachers finding evaluation to be invalid. Finally, the proficiency of observers also interacts with observation protocols. The English teachers at Central described their observer, Mr. Reward, as an administrator who did not possess the skills or the proficiency needed to complete the observation instrument properly. Mr. Reward would often ask teachers to complete their own evaluation ratings and in one reported case, conducted an observation at an inappropriate time. Teachers at Central seemed to indicate that they did not forget when administrator “messed up” evaluation. Aside from issues with Mr. Reward, a teacher described how a previous assistant principal had made serious mistakes with overseeing testing which may have contributed to negative feelings toward evaluation. While Central provides an extreme case of an observer lacking the proficiency to conduct observations, there is also some evidence from this study that suggests that the proficiency of an observer in the subject area being observed may also matter, particularly in regard to the quality of the feedback received. For instance, the Math teachers at Central and at Charles described situations under which they had been observed by an administrator with a Math background and described the feedback as useful and valid. Additionally, teachers who had positive experiences with curriculum coaches described a similar situation of receiving useful feedback that was directly relevant to their work in the classroom. Therefore, the impact an observer has on evaluation should be considered when observations are used as part of high-stakes decision making processes. 169 Perceptions of Validity Teachers also expressed concerns over the validity of both components of the evaluation instrument. The concerns are crucial because in order for observations and test data to be used as a feedback source and a tool for motivation, teachers need to see the measure as valid. Many of the concerns around the validity of observations are related to the observer and are discussed above. For instance, teachers are unlikely to find feedback from an unskilled observer to be either useful or motivating. Teachers may also interpret evaluation differently depending on the evaluation scenario in their respective school or may approach feedback from an observation differently if they do not think the observer’s focus is a valid component of teaching. However, teachers had other concerns about the validity of the observation instrument, such as questioning whether the frequency and timing of evaluations provides a good enough sample of their work to pass judgement of teaching ability. Similarly, teachers also expressed concerns that some of the standards on the observation may be too narrow for teachers to achieve every year based on a few observations. Teachers also raised concerns over the testing component of evaluation, including that the test was very short and the questions did not address all of the standards in which teachers were tasked with teaching. Additionally, cut scores for students were very low, which teachers felt was misleading. Feedback on testing was described by one teacher as an “autopsy report” as it came much too late to be used to implement any classroom changes. Finally, the psychometric model which was used to calculate student growth and to evaluate teachers was difficult for teachers to understand. While teachers had been told about some components of the student growth equation, such as the removal of outliers which were meant to adjust for some of the very valid critiques researchers have presented on the use of VAMS in high-stakes situations (Harris, 170 2009 and others), teachers felt that the scores often seemed at odds with the realities of the classroom. Concerns over the validity of the evaluation components should be addressed if the purposes of evaluation are to include motivating teachers and providing feedback useful to improve teacher practice. Otherwise, it is important to provide opportunities to motivate and receive feedback in other ways from sources in which teachers do perceive validity. Altered Teacher Behaviors The results of this dissertation also suggest that the high-stakes evaluation system used in North Carolina at the time of the study may sometimes motivate teachers to engage in undesirable practices. While there were no extreme cases of gaming or cheating that emerged in this study, there was evidence that teachers sometimes altered practices to improve student test scores. For instance, teachers described how they may change the way in which questions are worded or the medium through which assignments are presented in order to familiarize students with formats found on the test. Teachers also cited examples of certain testing strategies they taught in order to assist students in becoming better test takers. There were also examples of how curriculum was altered to meet evaluation requirements. In English, teachers chose to forgo certain standards because it was known they would not appear on the test. Teachers admitted that such practices meant that they did not successfully teach all the standards for their course. Additionally, English teachers stated that a narrowing of the curriculum was more common in the lower achieving courses where teachers felt more test prep would be necessary. This practice resulted in the withholding of certain parts of the curriculum from selected groups of students. Teachers also cited some examples of changes in practice due to observation, such as being sure to incorporate technology on an observation day in order to ensure that standard 171 would be met, but these examples were more benign than some of the alterations that were motivated by testing. Nonetheless, the examples that teachers shared about how they altered practice to accommodate either form of evaluation indicated that they were indeed motivated to make changes in order to receive a better score; however, some of the changes teachers were motivated to engage in may have unintended negative consequences for students. Research also indicates that high-stakes accountability systems may result in increased turnover (Ingersoll, 2001). There were two instances of this that appeared in the interviews. Mr. Brown and Mr. Eagle were both teachers who had taught previously out of state, who were in their first years at their respective schools, and who both cited a focus on testing and pressure to have students perform well on tests as reasons for seeking other employment opportunities. The behaviors noted above were most often reported by teachers who were provisionally licensed and/or had seven or fewer years of experience in North Carolina and were therefore subjected to increased observations and were unprotected by career status. It’s unclear whether teachers with provisional licensure or seven years or fewer designations reported engaging in these behaviors more frequently because they were less experienced teachers or because evaluation held higher stakes for them than more experienced teachers. Summatively, these examples of teacher behavior suggest that teachers do respond to high-stakes evaluation in a manner similar to studies done on teacher response to other accountability measures (Rothstein & Mathis, 2013; and others). Therefore, the benefits of high-stakes evaluation policy should be weighed against these unintended consequences. Reconciling Evaluation Policy for Both High and Low Stakes Purposes The evaluation policy in North Carolina as well as elsewhere in the country is partially meant to serve as a tool to regulate the quality of teachers in the classroom. Previous research has 172 demonstrated that while principals do report using evaluation to move poorly performing teachers towards dismissal, such teachers often leave before formal dismissal can occur (Kraft & Gilmour, 2017). And while the use of multiple measures has attempted to mitigate previous concerns over the use of local observations, when making human resources decisions principals have reported relying more on observations, which are perceived as more specific and transparent, than on VAMs or test scores which are not timely and are opaque (Goldring et al., 2015). One study suggests that the perspective of school administrators is that effective teaching is broader than what can be expressed in test scores and, as also demonstrated in this dissertation, such interpretations are subject to a principal’s prior knowledge, connection with the policy message, and the social context of the school (Rigby, 2014). Similarly, VAMs have been shown to correlate with principal assessments of a teacher’s ability to raise test scores, but not with other aspects of teaching, making VAMs a narrow predictor of a teacher’s ability to do their job (Grissom, Loeb, & Doss, 2016). Moreover, the high-stakes nature of current evaluation policy may make it difficult for administrators to honestly assess their teachers, particularly when replacing a teacher may be difficult or when administrators feel like they lack the capacity to effectively evaluate in a high-stakes scenario. For instance, a study by Grissom and Loeb (2017) found that principals tended to evaluate more positively on higher stakes evaluations that on low stakes. This provides a possible explanation for why teachers still tend to be highly rated by administrators in the schools in this study and elsewhere. So, can evaluation be simultaneously a formative feedback experience and a summative high-stakes tool for human resource decisions? As far back as 1988, Popham referred to the “dysfunctional marriage” between the two concepts and Firestone (2014) outlined how those 173 concepts involved two competing theories of motivation that stymied progress. However, there are some ways in which this relationship could be improved under current policy. If the policy is to use evaluation for feedback, then some changes must be made to make current systems more effective. Critiques that evaluation instruments are too broad could be addressed by instead providing focused feedback on a few targets. Teachers, such as those in this dissertation, may perceive that it is unfair that administrators make judgements on areas of practice where an evidence-based recommendation for improvement cannot be provided. Henry and Guthrie (2016) explained that “in a system where everything is a priority, nothing is a priority” (p. 153). Teachers are unable to improve if they are unsure of what needs improving. Likewise, principals need the training to provide specific, actionable feedback that will be of use to teachers. Likewise, if teachers are to be evaluated using VAMs than the timeline for the return of scores and feedback should be shortened and shared with teachers in a way that would allow evaluators to make meaningful changes to their practice immediately. If the timeline for providing feedback cannot be tightened and/or if the feedback provided cannot be made to be more specific and useful, then observers need to be able provide feedback that will allow teachers to improve their practice and thus improve test scores (Henry & Guthrie, 2016). This would involve additional training for observers and the creation of professional opportunities for teachers to examine and interpret the data. Additionally, there are several research supported school-level supports which can be provided to help teachers use feedback including: relevant professional development opportunities, timely and specific feedback tied to effective teaching, and the influence of collegial relationships (Sun, Penuel. Frank, Gallagher, & Youngs, 2013). In this dissertation, teachers reported successful informal evaluation experiences when certain 174 conditions were met. For instance, teachers reported successful experiences when they had common time to work together on local assessments, whether developed at the district or department level. Overall, research has demonstrated that teachers working in more supportive environments are more likely to improve their effectiveness over time (Kraft & Papay, 2014). These opportunities for professional support need to be created in concert with the formal evaluations. There may also be promise in the inclusion of subject specific observation protocols. The use of generic instruments, such as the one in this study, assume that the same types of knowledge and practices are suitable across all grade levels and subject areas while simultaneously assuming that evaluators can assess instruction in areas where they do not have background (Youngs & Whittaker, 2016). Additionally, commercially available subject specific protocols tend to focus on lessons as opposed to other areas of teaching, such as the use of summative assessments and data analysis ability (Young & Whittaker, 2016). It may be that principal observations should be combined with other types of observations with different foci to gain a more well-rounded impression of teacher ability. In this dissertation, teachers who did not find validation in their principal’s assessments often cited other sources, such as curriculum coaches or colleagues who addressed lesson and classroom specific aspects of teaching, as testaments to personal skill and sources of valuable feedback. Similarly, if the evaluation policy is to attach high-stakes to teacher evaluation, then the bias that was a focus of critiques of previous locally developed systems could be addressed by designing a training system which utilizes a calibration technique and multiple observers (Youngs & Grissom, 2016). The use of multiple observers could include individuals who have expertise in the teacher’s subject area. Additionally, given criticisms of such uses in current 175 research, stronger evidence of the validity and reliability of current evaluation systems for making high-stakes human resource decisions should be presented so that both teachers and administrators can be more confident in the accuracy of ratings (Youngs & Grissom, 2016). This may allow principals to feel that they can give more honest critiques of teachers in observations rather than distributing high ratings across the workforce. The teachers who participated in this dissertation showed great distrust of the accuracy of both the observation and testing components of the evaluation. Such distrust could be mitigated by changes to the evaluation process. Limitations There are three main limitations to this study. First, this study is bound by the specific context in which the study schools are situated. While a variety of schools were deliberately sampled for this study, all of the schools are located in the same school system. The policy of interest is a state-level policy; however, it is unclear what, if any, influence district-level priorities and initiatives, or the physical location of the county examined here in proximity to other counties and states, may have had on the relationship between evaluation and practice. Additionally, the policy investigated here is unique to the state of North Carolina. While North Carolina serves as a model of many of the tenants espoused by the RttT application requirements and while many states have adopted legislation that is similar to North Carolina’s as a result of RttT, no other state will have the same policy history, concurrent policies, and cultural, social, and historical identities that North Carolina does. The context of North Carolina is one of higher stakes than other states where local unions may be stronger. For instance, there is great variability in how states implemented RttT inspired teacher evaluation policies and often districts are able to select local models. However, in North Carolina, teachers are evaluated under the same system and evaluation ratings are electronically 176 recorded at the state level, which may impact a teacher’s future career prospects or ability to be mobile across the state. Furthermore, the North Carolina evaluation model is a growth model and teachers are expected to exhibit continuing growth, which may result in unintended consequences such as initially receiving lower ratings or the inflation of ratings among more experienced teachers. Additionally, this study only examined high school teachers of two subject areas: Math and English. Therefore, the dissertation does not address other grade levels or subject areas which may have very different experiences and perspectives from high school Math and English teachers. While this study provides important information on the relationship between teacher evaluation and teacher practice in a state with a high-stakes, statewide teacher evaluation policy, it is unclear whether the results would be replicated elsewhere or under different circumstances or with different populations of teachers. A second limitation of this study is the assessment of differences in Evaluation Conditions and School Evaluation Effectiveness. An initial goal of this dissertation was to identify ways in which classroom practice was differentially impacted by sampling schools of various conditions and effectiveness levels. However, despite differences in scores and differences which emerged in the qualitative work, no statistically significant differences occurred between teachers at different schools on the survey measures. It may be that measuring evaluation conditions and effectiveness in a different way may have yielded different quantitative results. It is also possible that a finer grained analysis may have been necessary to discern differences in conditions. For instance, the Math and English teachers at Central had very different experiences with evaluation. So, examining conditions at a department level may have produced different results. It is also possible that any potential results were understated by the 177 small sample sizes utilized in this study. What is clear from this study is that evaluation policy, and particularly the observation component of evaluation, was implemented very differently across the four school contexts despite the rigorous standardization of the protocol at the state level. Finally, a second focus of this study was to discern if there were differences between Math and English teachers and the relationship between evaluation and practice. While differences were found statistically, the qualitative work revealed differences in how teachers of these subject areas were observed as well as issues with tests which were specific to each subject area. Therefore, I was unable to determine if the subject area differences were inherent characteristic of either Math or English teachers or instead a result of the unique conditions under which teachers of each subject were evaluated. Stronger conclusions may have been drawn from a larger sample, or at least, from a more even sample of teachers. No Math teachers from Riley agreed to be interviewed for this project, which the department chair stated was presumably because of the school’s new administrator’s increased focus on observation. Additionally, only one of the Math teachers at Phoenix was available for interview. This created unbalance in the sample as well as a lack of representation in the focal interviews for one entire Math department from a study school. This limitation does not mean that subject area does not matter, but rather that the context in which a teacher of a certain subject area is evaluated may matter more than the subject itself. Concluding Thoughts From a policy perspective, it is important to consider that contextual differences exist in schools and so formal evaluation may not always be a useful source of feedback to teachers and may not accurately reflect an individual’s teaching abilities. Currently, such evaluations are high- 178 stakes and are attached to teacher job retention policies. Evaluation results are also reported to the state-level and follow a teacher throughout their career, leading to the possibility that a teacher who has had inaccurate but poor evaluations may be negatively impacted in the future. Therefore, it is important to consider the potential benefits of using an imperfect evaluation instrument, whether that instrument is observation, student growth, or a combination of both, against potential individual-level costs. If evaluation is to serve as both a motivator to improve classroom practice and as a source of feedback to teachers, then certain conditions of the evaluation may need to be changed. Qualitative results suggest that ongoing formative feedback by an observer or by multiple observers who can identify what good teaching looks like in a context is more valuable and motivating to teachers than a summative assessment. Additionally, the high-stakes nature of current evaluation policy may drive teachers to engage in practices which may be detrimental to student learning. Moreover, these practices may actually be rewarded under current evaluation systems. Additionally, such detrimental practices may be amplified in teachers who undergo more frequent evaluations without career protections. This concern is particularly relevant for North Carolina, as the legislation which is current at the time of writing is effectively phasing out teacher career status for all teachers. The assumptions behind teacher evaluation policy requires the policy to be both highstakes and be used to weed-out low performing teachers while simultaneously providing the type of feedback that can support a teacher’s development on a growth model evaluation. Firestone (2014) had argued that the success of the type of evaluation system seen in North Carolina, which focuses on the use of both external and internal motivating factors, is stymied by the inherent conflicts between the two theories of motivation. The results of this dissertation support 179 this hypothesis, though there are several areas where the system could be improved to better accommodate both goals. The policy examined in this study is problematic because it tasks administrators with conducting high-stakes evaluations and providing formative feedback to all teachers up to four times a year. Yet, principals often lack the training and time resources to evaluate teachers in a high-stakes manner and to simultaneously provide constructive feedback to allow for systematic improvements. To accomplish a better balance, evaluation would need to be lower stakes, more formative, and focus on all teachers, not just a concentration of newer teachers. At the time of this writing, there is a gap in the literature on how formal teacher evaluation policy is related to classroom practice. This is an important question to consider because evaluation, by definition, defines what is valued in whatever is being appraised. Additionally, such policies are touted by policymakers as being necessary to motivate teachers to do better jobs and to provide feedback for them to do so. Therefore, it is important to consider whether or not the formal policies do motivate and provide feedback to teachers and if such policies do these things than to consider in what ways teacher practice changes as a result? While questions around the evaluation and practice relationship could benefit from future work using larger sample sizes and perhaps spanning additional levels of schooling and different subject areas, this dissertation begins to answer important questions around evaluation and practice as related to the study context. Such information is useful when weighing the costs and benefits of high-stakes teacher evaluation policies. 180 APPENDIX 181 APPENDIX A: Survey Instrument Part I: Demographic Questions (Short Answer) 1.) Including this year, how many years have you been teaching? 2.) Including this year, how many years have you been teaching in North Carolina? 3.) Including this year, how many years have you been teaching at this school? 4.) What is your certification level? Provisional, Professional/Career, Other 5.) What subjects are you certified in? 6.) What grades are you certified to teach? 7.) Have you ever taught a course that was assessed by an End of Course (EOC) or End of Grade (EOG) exam? 8.) Did you teach English II, Math I, or Biology last year? 9.) Are you teaching English II, Math I, or Biology this year? 10.) Indicate your level of agreement with the following questions about the conditions in this school: (Scale: Strongly Disagree, Disagree, Agree, Strongly Agree, Don’t Know) a. Teachers are held to high professional standards for delivering instruction. b. Teacher performance is assessed objectively c. Teachers receive feedback that can help them improve teaching. d. The procedures for teacher evaluation are consistent e. State assessment data are available in time to impact instructional practices. f. Local assessment data are available in time to impact instructional practices. g. Teachers use assessment data to inform their instruction. h. State assessment data are available in time to impact instructional practices. 182 i. State assessments provide schools with data that can help improve teaching. j. State assessments accurately gauge students’ understanding of standards. Part II: Prior Year Indicate your level of agreement with the following questions about your practices in the classroom from the following year (2015-2016) and (B) current year (2016-2017). (Scale: Strongly Disagree, Disagree, Neither Agree or Disagree, Agree, Strongly Agree, Not Applicable) 11.) Last year, I modified classroom practice in anticipation of an upcoming evaluation. 12.) Last year, I modified classroom practice using feedback from my evaluation. 13.) Last year, I was concerned that my evaluation results could impact future employment. 14.) Last year, I was concerned that my evaluation may label me as a bad teacher. 15.) Last year, I was concerned that my evaluation does not accurately reflect my competency as a teacher. 16.) Last year, I chose curriculum based on what I will be evaluated on. 17.) Last year, I chose teaching strategies based on what I will be evaluated on. 18.) Last year, I directed focus on certain students based on what I will be evaluated on. 19.) Last year, I used test data to modify classroom practice. 20.) Last year, I used observation feedback to modify classroom practice. 21.) I felt I was evaluated fairly in the previous school year. 22.) Last year’s evaluation will impact my decisions about classroom practice in the new school year. 183 Part III: Current Year Indicate your level of agreement with the following questions about your practices in the classroom from the current year (2016-2017). (Scale: Strongly Disagree, Disagree, Neither Agree or Disagree, Agree, Strongly Agree, Not Applicable) 23.) This year, I will modify classroom practice in anticipation of an upcoming evaluation. 24.) This year, I will modify classroom practice using feedback from my evaluation. 25.) This year, I am concerned that my evaluation results could impact future employment. 26.) This year, I am concerned that my evaluation may label me as a bad teacher. 27.) This year, I am concerned that my evaluation does not accurately reflect my competency as a teacher. 28.) This year, I will choose curriculum based on what I will be evaluated on. 29.) This year, I will choose teaching strategies based on what I will be evaluated on. 30.) This year, I will direct focus on certain students based on what I will be evaluated on. 31.) This year, I will use test data to modify classroom practice. 32.) I will use observation feedback to modify classroom practice. 33.) I feel I will be fairly during this school year. 184 REFERENCES 185 REFERENCES Allen, J. P., Pianta, R. C., Gregory, A., Mikami, A. Y., & Lun, J. (2011). An interaction-based approach to enhancing secondary school instruction and student achievement. Science, 333(6045), 1034–1037. Anagnostopoulos, D., Rutledge, S. R., & Jacobsen, R. (2013a). Mapping the information infrastructure of accountability. In D. Anagnostopoulos, S. Rutledge, & R. Jacobsen (Eds.), The infrastructure of accountability (pp. 1-21). Cambridge, MA: Harvard University Press. Anagnostopoulos, D., Rutledge, S. R., & Jacobsen, R. (2013b). The infrastructure of accountability: Tensions, implications, and concluding thoughts. In D. Anagnostopoulos, S. R. Rutledge, & R. Jacobsen (Eds.), The infrastructure of accountability (pp. 213-228). Cambridge, MA: Harvard University Press. Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L., & Shepard, L. A. (2010). Problems with the use of student test scores to evaluate teachers. Washington, DC: Economic Policy Institute. Ball, D. L., & Bass, H. (2000). Interweaving content and pedagogy in teaching and learning to teach: Knowing and using mathematics in J. Boaler (Ed.), Multiple perspectives on the teaching and learning of mathematics (pp. 83-104). Westport, CT: Ablex. Ballou, D., & Podgursky, M. (1998). Teacher recruitment and retention in public and private schools. Journal of Policy Analysis and Management, 17(3), 393-417. Bandura, A. (1997). Self-efficacy in changing societies. New York, NY: Cambridge University Press. Barnes, C. R. (2011). “Race to the Top” only benefits big government. Journal of Law and Education, 40(2), 393-402. Bell, C., Drake, C., Wilson, M., Frasier, A., Qi, Y., McCaffrey, D., Lockwood, J. R., & Kim, J. (2015). Subject specific and general observation protocols as tools for the evaluation and improvement of teaching. Paper session presented at the meeting of the American Education Research Association. Chicago, IL. Berg, B. L. (2007). Qualitative research methods for the social sciences (6th Edition). San Francisco, CA: Pearson Education. Bill & Melinda Gates Foundation. (2010). Working with teachers to develop fair and reliable measures of effective teaching. Seattle, WA: Author. Retrieved from http://www.metproject.org/downloads/met-framing-paper.pdf 186 Bill & Melinda Gates Foundation. (2013). Ensuring fair and reliable measures of effective teaching: Culminating findings from the MET Project’s three-year study. Seattle, WA: Author. Black, P., & William, D. (2009). Developing the theory of formative assessment. Educational Assessment, Evaluation, and Accountability, 21(1), 5-31. Booher-Jennings, J. (2005). Below the bubble: "Educational Triage" and the Texas accountability system. American Educational Research Journal, 42(2), 231-268. Boyd, D. J., Grossman, P., Lankford, H., Loeb, S., Wyckoff, J. (2006). How changes in entry requirements alter the teacher workforce and affect student achievement. Education Finance and Policy, 1(2), 176–216. Boyd, D. J., Grossman, P. L., Lankford, H., Loeb, S., & Wyckoff, J. (2009). Teacher preparation and student achievement. Educational Evaluation and Policy Analysis, 31(4), 416-440. Boyd, D., Lankford, H., Loeb, S., Rockoff, J., & Wyckoff, J. (2008). The narrowing gap in New York City teacher qualifications and its implications for student achievement in highpoverty schools. Journal of Policy Analysis and Management, 27, 793–818. Bryk, A., Sebring, P. B., Allensworth, E., Luppescu, S., & Easton, J. (2010). Organizing schools for improvement: Lessons from Chicago. Chicago, IL: University of Chicago Press. Burch, P., & Spillane, J. P. (2003). Elementary school leadership strategies and subject matter: Reforming mathematics and literacy instruction. The Elementary School Journal, 103(5), 519-535. Carnoy, M., & Loeb, S. (2002). Does external accountability affect student outcomes? A crossstate analysis. Educational Evaluation and Policy Analysis, 24(4), 305-331. Chetty, R., Friedman, J. N., & Rockoff, J. E. (2011). The long-term impacts of teachers: Teacher value-added and student outcomes in adulthood. NBER Working Paper Series. Working Paper. National Bureau of Economic Research. Cambridge, MA. Retrieved from http://obs.rc.fas.harvard.edu/chetty/value_added.pdf Clotfelter, C. T., Ladd, H. F., & Vigdor, J. L. (2007). Teacher credentials and student achievement: Longitudinal analysis with student fixed effect. Economics of Education Review, 26, 673–682. Clotfelter, C. T., Ladd, H. F., Vigdor, J. L., Diaz, R. A. (2004). Do school accountability systems make it more difficult for low-performing schools to attract and retain high-quality teachers? Journal of Policy Analysis and Management, 23(2), 251-271. Coburn, C. E. (2004). Beyond decoupling: Rethinking the relationship between the institutional environment and the classroom. Sociology of Education, 77(3), 211-244. 187 Cohen, D. K. (1990). A revolution in one classroom: The case of Mrs. Oublier. Educational Evaluation and Policy Analysis, 12(3), 311-329. Cohen, D. K., Raudenbusch, S. W., & Ball, D. L. (2003). Resources, instruction, and research. Educational Evaluation and Policy Analysis, 25(2), 119-142. Cohen, D. K. (2011). Teaching and its predicaments. Cambridge, MA: Harvard University Press. Cohen-Vogel, L. (2011). Staffing to the test: Are today's school personnel practices evidencebased? Educational Evaluation and Policy Analysis, 33(4), 483- 505. Conley, D. T., Drummond, K. V., Gonzalez, A., Seburn, M., Stout, O., and Rooseboom, J. (2011). Lining up: The relationship between the Common Core State Standards and five sets of comparison standards. Educational Policy Improvement Center. Retrieved from www.epiconline.org Corcoran, S. P. (2010) Can Teachers be Measured by Test Scores? Should They Be? The Use of Value-Added Measures of Teacher Effectiveness in Policy and Practice. Annenberg Institute for School Reform. Cusick, P. (1983). The egalitarian ideal and the American high school. New York, NY: Longman. Darling-Hammond, L., Holtzman, D. J., Gatlin, S. J., & Heilig, J. V. (2005). Does teacher preparation matter? Evidence about teacher certification, Teach for America, and teacher effectiveness. Education Policy Analysis Archives, 13(42). Deci, E. L. & Ryan, R. M. (1996). Need satisfaction and the self-regulation of learning. Learning & Individual Differences, 8(3), 165-184. Deyhle, D. L., Hess, G. A., & LeCompte, M. D. (1992). Approaching ethical issues for qualitative researchers in education. In M. LeCompte, W. Millroy, & J. Preissle (Eds.), Handbook of qualitative research in education (pp. 598–641). San Diego: Academic Press. Derrington, M. L., & Campbell, J. W. (2013). The changing conditions of instructional leadership: Principals' perceptions of teacher evaluation accountability measures. In B. Barnett, A. R. Shoho, & A. J. Bowers (Eds.), School and district leadership in an era of accountability (pp. 231-251). Charles, NC: Information Age Publishing. Educator Effectiveness Database. (2015). Information on educator effectiveness data. Retrieved from http://apps.schools.nc.gov/ords/f?p=155:1 Erpenbach, W. J. (2014). A study of states’ requests for waivers from requirements of the No Child Left Behind Act of 2001: New developments in 2013-2014. Washington, DC: Council of Chief State School Officers. 188 Figlio, D. N. (2006). Testing, crime and punishment. Journal of Public Economics, 90(4), 837851. Figlio, D. N., & Getzler, L. S. (2002). Accountability, ability and disability: Gaming the system (No. w9307). National Bureau of Economic Research. Figlio, D. N., & Winicki, J. (2005). Food for thought: the effects of school accountability plans on school nutrition. Journal of Public Economics, 89(2), 381-394. Firestone, W. A. (2014). Teacher evaluation policy and conflicting theories of motivation. Educational Researcher. Firestone, W. A. & Pennell, J. R. (1993). Teacher commitment, working conditions, and differential incentive policies. Review of Educational Research, 63(4), 489-529. Firestone, W. A., & Rosenblum, S. (1988). Building commitment in urban high schools. Educational Evaluation and Policy Analysis, 23(4), 285-300. Floden, R. E., Porter, A. C., Schmidt, W. H., & Freeman, D. J. (1980). Don’t they all measure the same thing?: Consequences of standardized test selection. In E. L. Baker and E. S. Quellmalz (Eds.) Educational testing and evaluation, design, analysis, and policy (pp. 109-120). Beverly Hills, CA: Sage. Garda, R. A., & Doty, D. S. (2013). The legal impact of emerging governance models on public education and its office holders. Urban Lawyer, 45(21). Garet, M. S., Porter, A. C., Desimone, L., Birman, B. F., & Yoon, K. S. (2001). What makes professional development effective? Results from a national sample of teachers. American Educational Research Journal, 38(4), 915-946. Giroux, H. (1985). Teachers as transformatory intellectuals. Critical Educators, 46-49. Glazerman, S., Goldhaber, D., Loeb, S., Raudenbusch, S., Staiger, D., & Whitehurst, G. J. (2011). Passing muster: Evaluating teacher evaluation systems. Washington, DC: The Brookings Brown Center Task Group on Teacher Quality. Glesne, C. (2006). Becoming qualitative researchers: An introduction (3rd ed.). Boston: Pearson. Goldhaber, D. D., & Brewer, D. J. (1997). Why don't schools and teachers seem to matter? Assessing the impact of unobservables on educational productivity. Journal of Human Resources, Summer, 505-523. Goldhaber, D., & Hansen, M. (2010). Implicit measurement of teacher quality: Using performance on the job to inform teacher tenure decisions. American Economic Review, 100(2), 250-255. 189 Goldhaber, D., Goldschimdt, P., & Tseng, F. (2013). Teacher value-added at the high school level: Different models, different answers? Education Evaluation and Policy Analysis. Retrieved form http://epa.sagepub.com/content/early/2013/01/15/0162373712466938 Goldring, E., Grissom, J. A., Rubin, M., Neumerski, C. M., Cannata, M., Drake, T., & Schuermann, P. (2015). Make room value added: Principals’ human capital decisions and the emergence of teacher observation data. Educational Researcher, 44(2), 96-104. Grissom, J. A., Kalogrides, D., & Loeb, S. (2012). Strategic Staffing: Examining the Class Assignments of Teachers and Students in Tested and Untested Grades and Subjects. Paper presented at the annual meeting of the Association of Education Finance and Policy. Grissom, J. A. & Loeb, S. (2017). Assessing principals’ assessments: Subjective evaluations of teacher effectiveness in low- and high-stakes environments. Education Finance and Policy, 12(3), 369-395. Grissom, J. A. & Youngs, P. (2015). Improving teacher evaluation systems: Making the most of multiple measures. New York, NY: Teachers College Press. Grissom, J. A., Loeb, S., & Doss, C. (2016). The multiple dimensions of teacher quality: Does value-added capture teachers’ nonachievement contributions to their schools? In Grissom, J. A. & Youngs, P. (Eds.), Improving teacher evaluation systems: Making the most of multiple measures. (pp. 37-50). New York, NY: Teachers College Press. Hackman, J. R., & Oldham, G. R. (1980). Word redesign. Reading, MA: Addison-Wesley. Halverson, R., & Clifford, M. (2006). Evaluation in the wild: A distributed cognitive perspective on teacher assessment. Educational Administration Quarterly, 42(4), 578-619. Hanushek, E. A., & Rivkin, S. G. (2010) Generalizations about using value-added measures of teacher quality. American Economic Review, 100(2), 267-271. Harris, D. N. (2009). Would accountability based on teacher value added be smart policy? An examination of the statistical properties and policy alternatives. Education Finance and Policy, 4(4), 319-350. Harris, D. N., Ingle, W. K., & Rutledge, S. A. (2014). How teacher evaluation methods matters for accountability: A comparative analysis of teacher effectiveness ratings by principals and teacher value-added measures. American Educational Research Journal, 5(1), 73– 112. Hart, A. W., & Murphy, M. J. (1990). New teachers react to redesigned teacher work. American Journal of Education, 98, 224-250. Haycock, K. (1998) Good teaching matters. Educational Trust: Thinking K-16. Retrieved from 190 http://edtrust.org/sites/edtrust.org/files/public_actions/files/k16_summer98.pdf Henry, G. T. & Guthrie, J. E. (2016). Using multiple measures for developmental teacher evaluation. In Grissom, J. A. & Youngs, P. (Eds.), Improving teacher evaluation systems: Making the most of multiple measures. (pp. 143-155). New York, NY: Teachers College Press. Hill, H. C., Kapula, L. & Umland, K. (2011). A validity argument approach to evaluating teacher value-added scores. American Educational Research Journal, 48(3), 794-831. Ingersoll, R. M. (2001). Teacher turnover and teacher shortages: An organizational analysis. American Educational Research Journal, 38(3), 499-534. Ingersoll, R. M., & May, H. (2012). The magnitude, destinations, and determinants of mathematics and science teacher turnover. Educations Evaluation and Policy Analysis, 34(4), 435-464. Jacob, B. A., & Levitt, S. D. (2003). Rotten apples: An investigation of the prevalence and predictors of teacher cheating (No. w9413). National Bureau of Economic Research. Johnson, R. B., & Onwuegbuzie, A. J. (2004). Mixed methods research: A research paradigm whose time has come. Educational Researcher, 33(7), 14-26. Johnson, S. M., & Birkland, S. E. (2003). Pursuing a “sense of success”: New teachers explain their career decisions. American Educational Research Journal, 40(3), 581-617. Kennedy, M. M. (2005). Inside teaching: How classroom life undermines reform. Cambridge, MA: Harvard University Press. Kraft, M. A. & Gilmour, A. (2016). Can principals promote teacher development as evaluators? A case study of principals’ views and experiences. Educational Administration Quarterly, 52, 711-753. Kraft, M. A. & Gilmour, A. (2017). Revisiting the Widget Effect: Teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46(5), 234-249. Kushman, J. W. (1992). The organizational dynamics of teacher workplace commitment: A study of urban elementary and middle schools. Educational Administration Quarterly, 28(1), 5-42. Ladd, H. F., & Zelli, A. (2002). School-based accountability in North Carolina: The responses of school principals. Educational Administration Quarterly, 38(4), 494-529. Lipsky, M. (2010). Street-level bureaucracy: Dilemmas of the individual in public service (30th anniversary expanded ed.). New York, NY: Russell Sage Foundation. 191 Loeb, S., Kalogrides, D., & Beteille, T. (2012). Effective schools: Teacher hiring, assignment, development, and retention. Education Finance and Policy, 7, 269–304. Louis, K., Dretzke, B., & Wahlstrom, K. (2010). How does leadership affect student achievement? Results from a national US survey. School Effectiveness and School Improvement, 21, 315-336. Mann, H. (1868). Life and works of Horace Mann (Vol. 3). Walker, Fuller and Company. McCaffrey, D. F., Lockwood, J. R., Koretz, D. M. & Hamilton, L. S. (2003). Evaluating valueadded models for teacher accountability. Santa Monica, CA: RAND. McDonell, L. M. (2005). No Child Left Behind and the federal role in education: Evolution or revolution? Peabody Journal of Education, 80(2), 19–38. Miles, M. B., Huberman, A. M., & Saldana, J. (2014). Qualitative data analysis: A method sourcebook. CA, US: Sage Publications. Milgrom, P. R., & Roberts, J. D. (1992). Economics, organization and management. Englewood Cliffs, NJ: Prentice-Hall. Mintrop, H., & Sunderman, G. L. (2013). The paradoxes of data-driven school reform: Learning from two generations of centralized accountability systems in the United States. In D. Anagnostopoulos, S. R. Rutledge, & R. Jacobsen (Eds.), The infrastructure of accountability. (pp. 23-40). Cambridge, MA: Harvard University Press. Neal, D., & Schanzenbach, D. W. (2010). Left behind by design: Proficiency counts and testbased accountability. The Review of Economics and Statistics, 92(2), 263-283. Newman, A. (2013). Realizing educational rights: Advancing school reform through courts and communities. Chicago, IL: University of Chicago Press. North Carolina Board of Education. (2012). ESEA flexibility request. Retrieved from http://www2.ed.gov/policy/eseaflex/approved-requests/nc.pdf North Carolina Board of Education. (2012). 16 NCAC 6C .0504. policy on standards and criteria for evaluation of professional school employees. Retrieved from http://sbepolicy.dpi.state.nc.us/policies/TCP-C-006.asp North Carolina State Board of Education. (2012). 115C-333; N.C. Constitution, Article IX, Sec.5. Teacher evaluation process. Retrieved from http://ncrules.state.nc.us/ncac/title%2016%20%20education/chapter%2006%20%20 elementary%20and%20secondary%20education/subchapter%20c/16%20ncac% 2006c%20.05 03.pdf North Carolina Department of Public Instruction. North Carolina Educator Effectiveness Data. 192 Retrieved from http://apps.schools.nc.gov/pls/apex/f?p=155:1 North Carolina Teacher Working Condition Survey. (2017). Reports from TWC 2016. Retrieved from https://ncteachingconditions.org/results Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004). How large are teacher effects? Educational Evaluation and Policy Analysis, 26(3), 237-257. Oliver, C. (1991). Strategic responses to institutional processes. Academy of Management Review, 16(1), 145-179. Popham, W. J. (1988). The dysfunctional marriage of formative and summative teacher evaluation. Journal of Personnel Evaluation in Education, 1, 269-273. Porter, A., McMaken, J., Hwang, J. & Yang, R. (2011). Common Core Standards: The new U.S. intended curriculum. Educational Researcher 40(103). 103-116. Porter, A. C., Polikoff, M. S., & Smithson, J. (2009). Is there a defacto national intended curriculum? Evidence from state content standards. Educational Evaluation and Policy Analysis. 31(3). 238-268. Powell, A., Farrar, E., & Cohen, D. (1985). The shopping mall high school. Boston, MA: Houghton Mifflin. Powell, E. (2013). The quest for teacher quality: Early lessons from Race to the Top and state legislative efforts regarding teacher evaluation. DePaul Law Review 62(1061). Raudenbusch, S. W., & Jean, M. (2012). How should educators interpret value-added scores? Retrieved from htpp://carnegieknowledgenetwork.org/briefs/value-added/interpretingvalue-added Reinhorn, S. K., Johnson, S. M. & Simon, N. S. (2017) Investing in development: Six highperforming, high-poverty schools implement the Massachusetts Teacher Evaluation Policy. Educational Evaluation and Policy Analysis, 39(3), 383-406. Rigby, J. C. (2015). Principals’ sensemaking and enactment of teacher evaluation. Journal of Educational Administration, 53, 374–392. Rockoff, J. E. (2004). The impact of individual teachers on student achievement: Evidence from panel data. American Economic Review, 94, 247–25. Rockoff, J. E., Jacob, B. A., Kane, T. J., & Staiger, D. O. (2011). Can you recognize an effective teacher when you recruit one? Education Finance and Policy, 6, 43–74. Rothstein, J., & Mathis, W. J. (2013). Review of have we identified effective teachers? AND A composite estimator of effective teaching: Culminating findings from the Measures of 193 Effective Teaching Project. Retrieved from http://nepc.colorado.edu/thinktank/reviewMET-final-2013 Rowan, B., Correnti, R., & Miller, R. J. (2006). What large-scale survey research tells us about teacher effects on student achievement: Insights from the prospects study of elementary schools. Teachers College Record, 104(8), 1525-1567. Sanders, W. L., & Horn, S. (1994). The Tennessee Value-Added Assessment System (TVAAS): Mixed-model methodology in educational assessment. Journal of Personnel Evaluation in Education, 8(3), 299-311. Sanders, W. L., & Rivers, J. C. (1996). Cumulative and residual effects of teachers on future student academic achievement. Knoxville, TN: University of Tennessee Value-Added Research and Assessment Center. Sawchuk, S. (2013, February 5). Teachers’ ratings still high despite new measures. Retrieved from www.edweek.org /ew/articles/2013/02/06/20evaluate_ep.h32.html Schacter, J., & Thum, Y. M. (2005) TAPing into high quality teachers: preliminary results from the Teacher Advancement Program Comprehensive School Reform. School Effectiveness and School Improvement 16(327), 327-353. Spillane, J. P., & Kenney, A. W. (2012). School administration in a changing education sector: The US experience. Journal of Educational Administration, 50, 541-561. Stake, R. (2004). Qualitative case study. In N. K. Denzin & Y. S. Lincoln (Eds.), The Sage handbook of qualitative research (3rd ed.). (pp. 443–466). Thousand Oaks, CA: Sage. Steinberg, M. P. & Donaldson, M. L. (2016). The new educational accountability: Understanding the landscape of teacher evaluation in the post-NCLB era. Education Finance and Policy, 11, 340–359. Sun, M., Mutcheson, R. B., & Kim, J. (2016). Teachers’ use of evaluation for instructional improvement and school supports for such use. In Grissom, J. A. & Youngs, P. (Eds.), Improving teacher evaluation systems: Making the most of multiple measures. (pp. 102115). New York, NY: Teachers College Press. Sun, M., Penuel, W. R., Frank, K. A., Gallagher, H. A., & Youngs, P. (2013). Shaping professional development to promote the diffusion of instructional expertise among teachers. Educational Evaluation and Policy Analysis, 35(3), 344-369. Taylor, E. S., & Tyler, J. H. (2011). The effect of evaluation on performance: Evidence from longitudinal student achievement data of mid-career teachers(No. w16877). National Bureau of Economic Research. Thorn, C., & Harris, D. N. (2013). The accidental revolution: Teacher accountability, value194 added, and the shifting balance of power in the American school system. In Anagnostopoulos, D. Rutledge, S. R., & Jacobsen, R. (Eds.), The infrastructure of accountability. (pp. 57-74). Cambridge, MA: Harvard University Press. Tschannen-Moran, M., Woolfolk Hoy, A., & Hoy, W. K. (1998). Teacher efficacy: Its meaning and measure. Review of Educational Research, 68(2), 202-248. Tyack, D. B., & Cuban, L. (1995). Tinkering toward utopia. Cambridge, MA: Harvard University Press. Umpstead, R. R., & Kirby, E. (2012). Reauthorization revisited: Framing the recommendations for the Elementary and Secondary Education Act’s Reauthorization in light of No Child Left Behind’s implementation challenges. West’s Education Law Reporter 276(1). U.S. Department of Education. (2009). Race to the Top Program executive summary. Retrieved from www2.ed.gov/programs/racetothetop/executive-summary.pdf Vinovskis, M. (2009). From a nation at risk to no child left behind: National education goals and the creation of federal education policy. New York: Teachers College Press, Columbia University. Wayne, A. J., & Youngs, P. (2003). Teacher characteristics and student achievement gains: A review. Review of Educational Research, 73, 89–122. Weisberg, D., Sexton, S., Mulhern, J., Keeling, D. (2009). The widget effect: Our national failure to acknowledge and act on differences in teacher effectiveness. The New Teacher Project. http://widgeteffect.org/downloads/TheWidgetEffect.pdf 2009 Yin, R. K. (2009). Doing case study research (4th Ed.), Thousand Oaks, CA: Sage Publications. Youngs, P. & Grissom, J. A. (2016). Multiple measures in teacher evaluation: Lessons learned and guidelines for practice. In Grissom, J. A. & Youngs, P. (Eds.), Improving teacher evaluation systems: Making the most of multiple measures. (pp. 169-184). New York, NY: Teachers College Press. Youngs, P. & Whittaker, A. (2016). The role of edTPA in assessing content-specific instructional practices. In Grissom, J. A. & Youngs, P. (Eds.), Improving teacher evaluation systems: Making the most of multiple measures. (pp. 37-50). New York, NY: Teachers College Press. Yurkofsky, M. (2016). Unpacking the panacea: Exploring how field-level pressures and inhabited actors influence the course of competition. Paper session presented at the meeting of the American Education Research Association. Washington, D.C. 195