PROJECT-BASED SCIENCE LEARNING FACILITATED THROUGH TECHNOLOGY By Sarah Maestrales A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Measurement and Quantitative Methods - Doctor of Philosophy 2024 ABSTRACT This dissertation focuses on three manuscripts all related to bolstering science achievement through recommendations from the National Research Council (NRC) regarding the teaching and learning of science. The manuscripts address meeting the NRC’s call to incorporate curriculum and assessment that lead to more in-depth knowledge that can be transferred across domains, and the use of technology to facilitate teaching and learning. The first manuscript, “Improving Science Achievement – Is It Possible? Evaluating the Efficacy of a High School Chemistry and Physics Project-Based Learning Intervention,” describes the process of developing intensive project-based curriculum and assessment materials for high school chemistry and physics classrooms. This study answers the question of what impact that curriculum has on students’ future science achievement and academic ambition. The second manuscript, “Using Machine Learning to Score Multi-Dimensional Assessments of Chemistry and Physics,” focuses on the use of a supervised machine learning approach to facilitate the scoring of science assessments. The goal of this study was to determine whether automating the process of classifying these responses could reduce the burden placed on teachers in scoring assessments that effectively measure the dimensions of learning spelled out by the NRC. The third study, “U.S. and Finnish high school engagement during the Covid-19 Pandemic,” then explores student engagement with the use of technologies that facilitate remote instruction. This thesis is dedicated to my daughter Callista. Thank you for your love and understanding. iii ACKNOWLEDGEMENTS This work would not have been possible without the support of my committee members: Dr. Barbara L. Schneider, Dr. Kenneth Frank, Dr. Kimberly Kelly, and Dr. Joseph Krajcik. And also thank you to Dr. Xiaoming Zhai for his continued support and guidance. This material is based upon work supported by the National Science Foundation under Grant No. OISE: 1545684. iv TABLE OF CONTENTS CHAPTER 1: INTRODUCTION AND LITERATURE REVIEW ............................................... 1 Bolstering an International STEM Workforce............................................................................ 1 The Crafting Engagement in Science Environments Study........................................................ 7 REFERENCES ......................................................................................................................... 13 CHAPTER 2: IMPROVING SCIENCE ACHIEVEMENT—IS IT POSSIBLE? EVALUATING THE EFFICACY OF A HIGH SCHOOL CHEMISTRY AND PHYSICS PROJECT-BASED LEARNING INTERVENTION.................................................................................................... 16 Abstract ..................................................................................................................................... 16 Introduction and Literature Review .......................................................................................... 17 The Intervention: Crafting Engaging Science Environments ................................................... 18 Method ...................................................................................................................................... 27 Results ....................................................................................................................................... 36 Discussion ................................................................................................................................. 39 Notes ......................................................................................................................................... 43 REFERENCES ......................................................................................................................... 46 APPENDIX 2.A BALANCE TABLES BETWEEN THE TREATMENT AND CONTROL SCHOOLS ................................................................................................................................ 49 APPENDIX 2.B TEACHER EXIT SURVEY ITEMS DEALING WITH THE TEACHER UNITS, PRACTICES, AND CURRICULUM: ........................................................................ 53 APPENDIX 2.C RACE HETEROGENEITY MODEL ........................................................... 55 APPENDIX 2.D MEDIATION MODEL ................................................................................. 56 APPENDIX 2.E EDUCATION AMBITION MODEL ............................................................ 57 APPENDIX 2.F FULL TREATMENT EFFECTS ESTIMATES............................................ 58 APPENDIX 2.G FULL HETEROGENEITY RESULTS ........................................................ 59 APPENDIX 2.H COLLEGE AMBITION FULL RESULTS .................................................. 60 APPENDIX 2.I FULL MEDIATION RESULTS .................................................................... 61 APPENDIX 2.J ITEM EQUIVALENCE FOR THE CHEMISTRY AND PHYSICS SUMMATIVE ASSESSMENTS.............................................................................................. 62 CHAPTER 3: USING MACHINE LEARNING TO SCORE MULTIDIMENSIONAL ASSESSMENTS OF CHEMISTY AND PHYSICS .................................................................... 64 Abstract ..................................................................................................................................... 64 Introduction and Literature Review .......................................................................................... 65 Methods..................................................................................................................................... 71 Results ....................................................................................................................................... 81 Discussion ................................................................................................................................. 90 REFERENCES ......................................................................................................................... 96 APPENDIX 3.A ITEM 1: EXPERIMENTAL DESIGN.......................................................... 99 APPENDIX 3.B ITEM 2: RELATIVE MOTION.................................................................. 101 APPENDIX 3.C ITEM 3: PROPERTIES OF SOLUTIONS ................................................. 102 APPENDIX 3.D ITEM 4: STATES OF MATTER ................................................................ 103 CHAPTER 4: U.S. AND FINNISH HIGH SCHOOL SCIENCE ENGAGEMENT DURING THE COVID-19 PANDEMIC .................................................................................................... 104 v Abstract ................................................................................................................................... 104 Introduction and Literature Review ........................................................................................ 105 Methods................................................................................................................................... 113 Results ..................................................................................................................................... 119 Discussion ............................................................................................................................... 129 REFERENCES ....................................................................................................................... 134 CHAPTER 5: DISCUSSION AND CONCLUSION ................................................................. 137 Contributing to the Landscape of Science Education Research ............................................. 137 Connecting the Pieces ............................................................................................................. 141 Conclusion .............................................................................................................................. 145 REFERENCES ....................................................................................................................... 147 vi CHAPTER 1: INTRODUCTION AND LITERATURE REVIEW This dissertation focuses on three manuscripts all related to bolstering science achievement through recommendations from the National Research Council (NRC) regarding the teaching and learning of science. The manuscripts address meeting the NRC’s call to incorporate curriculum and assessment that lead to more in-depth knowledge that can be transferred across domains, and the use of technology to facilitate teaching and learning. The first manuscript, “Improving Science Achievement – Is It Possible? Evaluating the Efficacy of a High School Chemistry and Physics Project-Based Learning Intervention,” describes the process of developing intensive project-based curriculum and assessment materials for high school chemistry and physics classrooms. This study answers the question of what impact that curriculum has on students’ future science achievement and academic ambition. The second manuscript, “Using Machine Learning to Score Multi-Dimensional Assessments of Chemistry and Physics,” focuses on the use of a supervised machine learning approach to facilitate the scoring of science assessments. The goal of this study was to determine whether automating the process of classifying these responses could reduce the burden placed on teachers in scoring assessments that effectively measure the dimensions of learning spelled out by the NRC. The third study, “U.S. and Finnish high school engagement during the Covid-19 Pandemic,” then explores student engagement with the use of technologies that facilitate remote instruction. Bolstering an International STEM Workforce According to a 2015 report from the National Science Board (NSB), knowledge and skills related to science, technology, engineering, and mathematics (STEM) are becoming more important to a wider range of workers. Improving science learning is a topic for researchers around the world. International studies such as Programme for International Student Assessment 1 and Trends in International Mathematics and Science Study aim to measure students’ achievement in science and mathematics. While achievement is important, research suggests that STEM students need a more robust understanding of science in order to meet the demands of a rapidly changing, technology driven workforce. A report on graduate education by Wendler et al. (2010) goes beyond the discussion of science knowledge and achievement to make the claim that it is the innovative applications of that knowledge that will drive future economic prosperity. The National Research Council supports this claim and spells out A Framework for K-12 Education that should provide students with the applicable skills to be used in investigation and scientific reasoning but can be applied across disciplines and in everyday life (NRC, 2012b). STEM Ambition The current number of students who are motivated to pursue STEM careers is too low to meet the demand for STEM professionals in the United States (NRC, 2012b). In order for students to, one day, join the STEM workforce, they have to have some ambition to pursue a career in the sciences. A 2018 manuscript by Vincent-Ruz & Schunn determined there is a psychometric distinction between science identity and other constructs related to science attitudes and that science identity was an equal or stronger predictor of students’ participation in optional science-related activities than other predictors. Early career aspirations are a strong predictor of later learning outcomes. One study found that students who expected a career in the sciences by eighth grade were 3.4 times more likely to earn a baccalaureate degree in physical sciences or engineering. Fortunately, it appears that student pathways through the STEM pipeline are not fixed. A positive change of science identity, even occurring after students are already enrolled in a non-STEM major at university, can significantly improve the odds of a student completing a STEM degree (Ma & Xiao, 2021). Flowers III and Banda (2016) argue that the key 2 to a diverse STEM workforce is in fostering students’ science identities. This may be a matter of simply providing students with the opportunity to understand what it means to be a scientist and to develop their identities as scientists. In the Framework, the NRC places a strong focus on problem solving, design, and project-based experiments to explain every-day phenomena designed to bolster students’ self-perception as scientists and help develop awareness of careers in the sciences (2012b). A Framework for K-12 Science Education Three Dimensions of Learning STEM jobs are changing dramatically as we adapt to new technological advancements in almost every aspect of our daily lives. To meet these changing demands, the STEM workers at every level of education must be capable of flexibility in the application of their skills or knowledge, including high school and two-year college technical STEM workers as well as those with advanced degrees (NGSS Lead States, 2013; NSB, 2015). The National Research Council (NRC) uses the term “deeper learning” to describe this ability to adapt what was learned and apply it to other situations (2012a). To retain a relevant skillset as the demands of the workforce change, students must learn to adapt their knowledge to a variety of situations as the technological demands change and even learn new skills and information on their own (NRC, 2011). To create scientifically literate consumers of technology and to successfully educate a technical workforce to use and adapt their skills across every educational level, that effort should begin early and continue on through university (NRC, 2012b). In 2012, the NRC set forth a Framework for K-12 Science Education that provided research-based suggestions toward the design of an effective and coherent curriculum for students in the United States (US). They recommended future curriculum and 3 assessment be built around a central framework that incorporated the learning of science with the skills necessary to plan and revise experiments to better understand the world around them. That NRC’s Framework for K-12 Science Education divides the knowledge and skills associated with science learning into three dimensions labeled as Disciplinary Core Ideas (DCIs), Crosscutting Concepts (CCs), and Science and Engineering Practices (SEPs). DCIs are broad scientific concepts that are fundamental to the field of study. CCs are concepts which facilitate understanding across multiple fields of study. And SEPs are practices employed by scientists and engineers in their respective fields to conduct investigations, build models, create theories and explanations that use reasoning to explain phenomena, and to design and build systems. The NRC (2012b) argues that rather than developing a limited understanding of many topics, students should instead learn a limited number of DCIs and CCs, with a primary focus on the depth of the learning. They further argue that students should engage in a process of building upon their knowledge and skills through engagement in scientific inquiry and engineering. Next Generation Science Standards The NGSS facilitates the development of curriculum by taking the dimensions of learning defined by the NRC, and describing how students’ mastery of the three dimensions can be operationalized by grade level. According to the NGSS (2013), these grade-based performance expectations (PEs) have been adapted by many states for use as policy in their own standards to encourage a learning progression which guides students through the recommended revisions to continually build upon their skills through a variety of tasks associated with science understanding. The individual PEs take a specific task and integrate each of the three dimensions into the skills and content knowledge required to master that task. As students progress through the grade levels, the PEs become more in-depth and complex (Schneider et al., 2022). 4 Three-Dimensional Curriculum To engage students in each of the dimensions of science learning, they must be given opportunities to practice as scientists working in their field (Krajcik, & Shin, 2014). In the United States (US), many individual states have adapted the Next Generation Science Standards (NGSS) for their K-12 curriculum and assessment materials because they have been associated with various benefits to students’ learning outcomes (NGSS Lead States, 2013). Despite having a suggested framework to shape curriculum and assessment, there are a number of limitations including teacher’s ability to roll out new lessons in the classroom. Unfortunately, it takes time to create meaningful lesson plans (NRC, 2012a). Three-Dimensional Assessment To understand what students are learning from the multidimensional science curriculum, research suggests that assessments should move away from memorization tasks such as multiple- choice to include a variety of learning tasks that require students to use scientific practices and engage reasoning through explanation (e.g. Pellegrino, 2013; Harris et al., 2019; NRC, 2014). Although modeling and explanation tasks are more difficult and time-consuming to score than multiple-choice, assessment must use items other than multiple-choice and incorporate more in- depth reasoning tasks to measure what students are learning in the classroom (Haudek et al., 2019). For example, Cooper et al. (2017) noted that researchers gained more information about what the students understood in chemistry because they were better able to convey the spatial aspects of their reasoning. To develop this transferable deeper learning described by the NRC, the curriculum and pedagogy are matched with assessment and feedback to provide meaningful learning experiences (NRC, 2012). Measuring the students’ understanding though the learning progression is critical 5 in this process as the assessments help to shape pedagogical support and foster a healthy classroom environment. The value of these assessments is improved when the students are required to demonstrate their knowledge-in-use or ability to connect science content to specific scientific practices (Kubsch et al., 2019). Facilitating Three-Dimensional Science Learning Through Technology Automated Scoring of Constructed Response Unfortunately, the shift away from multiple-choice tasks to items which more accurately measure students master of the dimensions of learning may prove challenging as assessments must be developed and scoring must be facilitated to reduce the burden on teachers (NRC, 2014). Without new methods to facilitate scoring, the longer time in scoring requires more work to assess students while students wait longer to receive feedback (Williamson et al., 2012; Ha & Nehm 2016). Fortunately, existing studies suggest that machine learning may be able to reduce that time between assessment and feedback in turn promoting the use of these more in-depth assessment measures (Lee et al., 2019b; Lottridge et al., 2018; Zhai et al., 2020b). Because assessments which capture more information about students’ understanding are more difficult to score, methods to reduce the burden of scoring are highly desirable. By reducing the time needed for scoring, machine learning could make CR as accessible as multiple choice. Research into the use of automated scoring of constructed response is showing great promise (Zhai et al., 2020a). Generally, machine learning is successfully being used to facilitate science learning in many ways including automated feedback, the scoring of essays, and even learning games (Zhai et al., 2020b; Lee et al., 2019a). 6 The Crafting Engagement in Science Environments Study Researchers with Crafting Engagement in Science Environments (CESE) designed a study to meet the NRC’s many recommendations to promote science learning. CESE wanted to investigate and promote students’ mastery of these necessary and adaptable science competencies through the development of research-based curriculum and methods to assess student learning. The curriculum and assessments are based on the science learning guidelines set forth in NGSS (2013) with project-based activities that allow students to engage in inquiry driven reasoning and experimentation using a variety of scientific practices. CESE applied the NRC’s dimensions of learning and the related NGSS PEs to develop specific lessons that teachers can implement in their high school physics and chemistry classrooms. Researchers in both the US and Finland worked together to design a series of project-based lesson plans for teachers. As described in Schneider et al. (2022), the CESE physics and chemistry interventions involve the enactment of three units throughout the school year, with each unit lasting approximately 4 weeks. Designed to be taught in a specific order, the three CESE units each build upon each other as students use ideas or concepts they discovered in the previous unit to scaffold their understanding of new ideas in subsequent units. Each unit is designed to align to a specific set of performance expectations (PEs) and is built around the idea of students figuring out the answer to a meaningful and relevant question referred to as the “Driving Question” (DQ). Given the importance of these SEPs for science learning, CESE scaffolded students learning through a variety of tasks that were introduced throughout the units and measured on the post- unit assessments. While many SEPs are invoked throughout the curriculum, special care is put 7 towards engaging students with making observations, collecting data, using those observations to explain phenomena, and then modeling the phenomena. Students in both the US and Finland participated in the learning intervention in the 2018- 2019 and 2019-2020 school years. The magnitude and duration of the study left researchers with a large and comprehensive dataset obtained through cluster randomized trails. Students completed multiple instruments that were designed to collect information about their science achievement before and after the intervention as well as background and exit surveys regarding their experiences in the classroom and their beliefs around science. That data collection effort led to a number of manuscripts intended to shed light onto science learning and how to facilitate that learning through technology. Improving Science Achievement – Is It Possible? Evaluating the Efficacy of a High School Chemistry and Physics Project-Based Learning Intervention In 2022, CESE researchers Schneider et al. published their main-effects in Educational Researcher in a manuscript titled “Improving Science Achievement – Is It Possible? Evaluating the Efficacy of a High School Chemistry and Physics Project-Based Learning Intervention.” Based on prior evidence surrounding the benefits of the NGSS, CESE researchers hypothesized that students would perform better on a third-party developed measure of science achievement after participation in the intervention’s inquiry-driven project-based curriculum. The 2022 publication from Schneider et al. comprises Chapter 2 of this dissertation. In this chapter we discuss the foundations of the CESE study, its development, founding principles, teacher training, and curriculum development. To explore what was driving the treatment effects, CESE researchers tested the interaction effects of treatment with gender and race or ethnicity. They also tested the mediating effect of fidelity of implementation in the classroom. Finally, 8 CESE researchers tested the impact of treatment on educational ambition with the hypothesis that the positive effects of participation in NGSS aligned, project-based curriculum, and acting as real scientists would encourage students to consider their future learning objectives. All analysis were considered in three-level models clustered at the teacher and school levels and included both student and school level covariates. The results of this study show positive impacts for students engaged in the treatment condition of the intervention. On average the intervention showed a positive effect for all students with no interaction effects by gender or race showing significance with an alpha of 0.05. The sample used for this study showed a high generalizability index to the general population of high school students in chemistry and physics courses in the US. Additionally, engagement in the project-based intervention was associated with students’ later learning ambition, even without any specific career focus in the curriculum. The results are discussed in more detail in Chapter 2 of the dissertation. Using Machine Learning to Score Multi-Dimensional Assessments of Chemistry and Physics As a part of the CESE intervention, students participated in multiple assessments of science achievement that were designed to incorporate multiple dimensions of learning. Prior to beginning the intervention, students participated in a science achievement assessment that included several constructed response (CR) items. Although the original rubrics were aligned with NGSS PEs for varying grade levels, they were not intended to capture the three-dimensions of learning, CESE found that most students did engage multiple of the three-dimensions in their responses despite not being prompted to do so. Therefore, CESE researchers developed special rubrics to capture information regarding the DCIs, CCs, and SEPs related to the chosen items. 9 Due to the magnitude of the study in the 2018-2019 school year, and additional costs associated with scoring CR items, researchers with CESE decided to explore the possibility of using automated scoring methods. In Chapter 3 of this dissertation, I discuss the methods and results from a manuscript titled “Using Machine Learning to Score Multi-Dimensional Assessments of Chemistry and Physics” which was published in the Journal of Science Education and Technology in 2021. This study was developed and conducted by Maestrales et al. under the larger CESE project. In this manuscript, the authors discuss in depth the need for three-dimensional assessment and the need for automated scoring methods that reduce the burden on teachers. They outline the supervised learning approach with rigorous training of human raters and the process of developing a dataset for use in training the algorithm. Human-to-human agreement and human-to-machine agreement are described using Cohen’s kappa. Researchers also explored the agreement between human and machine raters by the dimensions captured in the rubric and representation within the sample. The machine reported probability of a correct classification was reported by dimensions of learning and representation within the sample as well. Unlike many other studies into automated scoring, this study discusses the specific rubrics used by human raters and how they address the dimensions of learning to create classification categories. The results of this study, also discussed in Chapter 2, show that the human raters were able to come to high agreement using the multi-dimensional rubrics. Using the data sets created using those human scores, machine learning algorithms developed by AACR were effective in classifying the constructed response items with the machine-to-human rater agreement being similar to agreement between human raters. In categories which were well represented within the training set, the algorithm performed very well. Although some student 10 vocabulary choices which were under-represented within the sample were often over-represented among the discrepancies, the algorithms were generally rather robust even when scoring very open-ended responses. U.S. and Finnish High School Engagement During the Covid-19 Pandemic In addition to aiding in the process of assessment and feedback in science learning, technology has also taken an important role in science learning as more coursework has moved to an online environment. During the 2019-2020 school year, the Covid-19 pandemic forced schools around the world to close their doors and many schools turned to remote content delivery to connect with their students. Having already collected information about students’ science engagement in their physics and chemistry classrooms at the beginning of the school year, CESE researchers in the US were in a unique position to study changes in student engagement during remote instruction. Moreover, CESE was able to collect data from students from two countries with very different approaches to the shift to remote content delivery. The details of this international collaboration in data collection and analyses during the pandemic comprise Chapter 4 of this dissertation. In 2021, CESE researchers in the US and Finland published their findings in the International Journal of Psychology in a manuscript titled “U.S. and Finnish high school engagement during the Covid-19 Pandemic.” CESE defined engagement as students reporting high interest, high skill, and high challenge. The US team reported the results of the change in the log odds, from the beginning of the school year to the time they were surveyed during remote instruction, that a student reported high interest, skill, challenge, and engagement. Both teams discussed the activities students reported in their classrooms. The US team discussed which activities students reported in their online science classes, how interested they were in those activities, and how those relationships impacted their 11 engagement. In Finland researchers were able to collect data regarding situational engagement during remote teaching and used this information to understand engagement during high, medium, and low frequency activities. In the US, researchers used educational ambition as a measure of persistence during the pandemic. Finally, Finnish researchers discussed the relationships between situational engagement and social and emotional learning with correlations between the students’ emotional state and their situational engagement. Also described in Chapter 4 are the results of this analysis. Despite the shift to remote instruction, this study showed an increase in engagement for US students. Students in both the US and Finland showed a preference for those activities which were least available during remote instruction. Students in the US reported the most interest in SEPs that could be done remotely. Not surprisingly, those types of activities showed the strongest relationships to engagement during the pandemic. Additionally, US students were showing persistence in their college ambitions, with many students firming up decisions to attend 4 or more years of college despite hardships encountered during the Covid-19 pandemic. 12 REFERENCES Alper, J. (Ed.). (2016). Developing a national STEM workforce strategy: A workshop summary. National Academies Press. Cooper, M. M., Stieff, M., & DeSutter, D. (2017). Sketching the invisible to predict the visible: From drawing to modeling in chemistry. Topics in cognitive science, 9(4), 902-920. Flowers III, A.M. and Banda, R. (2016), "Cultivating science identity through sources of self - efficacy", Journal for Multicultural Education, Vol. 10 No. 3, pp. 405-417. https://doi.org/10.1108/JME-01-2016-0014. Ha, M., & Nehm, R. H. (2016). The impact of misspelled words on automated computer scoring: a case study of scientific explanations. Journal of Science Education and Technology, 25(3), 358-374. Harris, C. J., Krajcik, J. S., Pellegrino, J. W., & DeBarger, A. H. (2019). Designing knowledge- in-use assessments to promote deeper learning. Educational Measurement: Issues and Practice, 38(2), 53-67. https://doi.org/10.1111/emip.12253. Haudek, K., Santiago, M., Wilson, C., Stuhlsatz, M.,Donovan, B., Bracey, Z., Gardner, A., Osborne, J., & Cheuk, T. (2019). Using Automated Analysis to Assess Middle School Students’ Competence with Scientific Argumentation, presented at the Annual Meeting of the National Council on Measurement in Education (NCME). Toronto, ON. Krajcik, J. S., & Shin, N. (2014). Project-Based Learning. Dalam S. Keith (Ed). The Cambridge Handbook of The Learning Science (hlm. 275-297). Kubsch, M., Nordine, J., Neumann, K., Fortus, D., & Krajcik, J. (2019). Probing the relation between students’ integrated knowledge and knowledge-in-use about energy using network analysis. Eurasia Journal of Mathematics, Science and Technology Education, 15(8), em1728. Lee, H. S., McNamara, D., Bracey, Z. B., Liu, O. L., Gerard, L., Sherin, B., Wilson, C., Pallant, A., Linn, M., Haudek, K., & Osborne, J. (2019a). Computerized text analysis: Assessment and research potentials for promoting learning. Lee, H. S., Pallant, A., Pryputniewicz, S., Lord, T., Mulholland, M., & Liu, O. L. (2019b). Automated text scoring and real‐time adjustable feedback: Supporting revision of scientific arguments involving uncertainty. Science Education, 103(3), 590-622. Lottridge, S., Wood, S., & Shaw, D. (2018). The effectiveness of machine score-ability ratings in predicting automated scoring performance. Applied Measurement in Education, 31(3), 215-232. Ma, Y., & Xiao, S. (2021). Math and science identity change and paths into and out of STEM: Gender and racial disparities. Socius, 7, 23780231211001978. 13 Maestrales, S., Marias Dezendorf, R., Tang, X., Samela-Aro, K., Bartz, K., Juuti, K., Lavonen, J., Krajcik, J., & Schneider, B. (2021a). US and Finnish High School Science Engagement During the Covid-19 Pandemic. International Journal of Psychology, 57(1), 73-86. Maestrales, S., Zhai, X., Touitou, I., Baker, Q., Krajcik, J., & Schneider, B. (2021b). Using Machine Learning to Evaluate Multidimensional Assessments of Chemistry and Physics. Journal of Science Education and Technology, 30(2), 239-254. National Research Council. (2011). Assessing 21st Century Skills: Summary of a Workshop. The National Academies Press. National Research Council. (2012a). Education for life and work: Developing transferable knowledge and skills in the 21st century. The National Academies Press. National Research Council. (2012b). A Framework for K-12 Science Education: Practices, Crosscutting Concepts, and Core Ideas. The National Academies Press. National Research Council. (2014). Developing Assessments for the Next Generation Science Standards. Washington, DC: The National Academies Press. National Science Board. (2015). Revisiting the STEM workforce: A companion to science and engineering indicators 2014. NSB-2015-10. Schneider, B., Krajcik, J., Lavonen, J., Salmela-Aro, K., Klager, C., Bradford, L., Chen, I., Baker, Q., Touitou, I., Peek-Brown, D., Marias Dezendorf, R., Maestrales, S. & Bartz, K. (2022). Improving Science Achievement—Is It Possible? Evaluating the Efficacy of a High School Chemistry and Physics Project-Based Learning Intervention. Educational Researcher, 0013189X211067742. Tai, R. H., Qi Liu, C., Maltese, A. V., & Fan, X. (2006). Planning early for careers in science. Science, 312(5777), 1143-1144. Vincent-Ruz, P., & Schunn, C. D. (2018). The nature of science identity and its role as the driver of student choices. International journal of STEM education, 5(1), 1-12. Wendler, C., Bridgeman, B., Cline, F., Millett, C., Rock, J., Bell, N., & McAllister, P. (2010). The path forward: The future of graduate education in the United States. Educational Testing Service. Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational measurement: issues and practice, 31(1), 2-13. Zhai, X., Haudek, K., Shi, L., H Nehm, R., & Urban‐Lurain, M. (2020a). From substitution to redefinition: A framework of machine learning‐based science assessment. Journal of Research in Science Teaching, 57(9), 1430-1459. 14 Zhai, X., Yin, Y., Pellegrino, J. W., Haudek, K. C., & Shi, L. (2020b). Applying machine learning in science assessment: a systematic review. Studies in Science Education, 56(1), 111-151. 15 CHAPTER 2: IMPROVING SCIENCE ACHIEVEMENT—IS IT POSSIBLE? EVALUATING THE EFFICACY OF A HIGH SCHOOL CHEMISTRY AND PHYSICS PROJECT-BASED LEARNING INTERVENTION Abstract Crafting Engaging Science Environments is a high school chemistry and physics project- based learning intervention that meets Next Generation Science Standards performance expectations. It was administered to a diverse group of over 4,000 students in a randomized control trial in California and Michigan. Results show that treatment students, on average, performed 0.20 standard deviations higher than control students on an independently developed summative science assessment. Mediation analyses show an indirect path between teacher and student-reported participation in modeling practices and science achievement. Exploratory analyses indicate positive treatment effects for enhancing college ambitions. Overall, results show that improving secondary school science learning is achievable with a coherent system comprising teacher and student learning experiences, professional learning, and formative unit assessments that support students in “doing” science. 16 Introduction and Literature Review U.S. students’ engagement and achievement in elementary and secondary science has been relatively stagnant over the past two decades (National Center for Education Statistics, 2017) and regrettably remains behind other industrialized countries (Organization for Economic Co-operation and Development, 2020). Recognizing the consequences of persisting mediocre student academic performance in science achievement, three national efforts A Framework for K–12 Science Education (National Research Council [NRC], 2012), the Next Generation Science Standards (NGSS; NGSS Lead States, 2013), and Science and Engineering for Grades 6 to 12: Investigation and Design at the Center (National Academies of Sciences, Engineering, and Medicine, 2019)—were initiated to articulate a vision and a set of standards for markedly reforming science teaching and learning. Breaking with traditional practices of memorizing science facts, formulae, and mathematical operations, these reports emphasize the importance of “doing” science for improving science literacy and encouraging the pursuit of science careers. Grounded in the ideas of John Dewey (1938) and subsequent work by Blumenfeld et al. (1991) and Krajcik and Shin (2014), “the act of doing” engages the students in a cognitive process where they encounter a problem, plan a solution, work on it, and reflect on the results. The NRC’s (2012) Framework for K–12 Science Education argues that science learning should focus on making sense of phenomena or solving design problems by engaging students in the three dimensions of scientific knowledge. These dimensions include science and engineering practices, crosscutting concepts, and disciplinary core ideas (DCIs) which should increase in depth and sophistication within and across the grade levels (NRC, 2012). Science and engineering practices are behaviors scientists perform as they build theories about natural phenomena through investigations, creating models, generating explanations, and constructing 17 science-based arguments. Crosscutting concepts refer to “ideas” that are found and linked across disciplines and which provide different tools for exploring phenomena. DCIs focus on the key ideas of a science discipline—or, in other instances, a fundamental organizing idea for a single discipline—and are critical for explaining phenomena. Shortly after the development of the Framework, the NGSS were released having been created by a national effort of scientists, educational researchers, teachers, and policymakers (NGSS Lead States, 2013). The NGSS described a set of performance expectations that identified what students should know and be able to do. These performance expectations follow a progression of learning from primary to secondary grades, outlining and linking specific science ideas that increase in meaning and complexity. Many U.S. states have adopted the NGSS (National Science Teacher Association, 2017), or adapted them to develop similar science proficiency standards. Despite this widespread adoption of NGSS-like standards, there is a lack of research on evidence-based curricula and lessons that exemplify these reforms. The intervention “Crafting Engaging Science Environments” (CESE) was initiated to fill this gap by: (1) creating an intervention for high school chemistry and physics designed on the Framework’s vision and NGSS performance expectations and (2) assessing its effects on student science achievement (Schneider et al., 2020). The Intervention: Crafting Engaging Science Environments Deeply concerned with the lack of interest and engagement among secondary school students, an international team of Finnish and U.S. science education researchers, teachers, psychologists, sociologists, and psychometricians collaboratively designed the CESE intervention. Over the course of 3 years, this interdisciplinary team worked on developing a holistic intervention that takes a system approach to teaching and learning in secondary school 18 chemistry and physics classrooms. The intervention consists of six units—three in chemistry and three in physics—with corresponding professional learning and assessments. Although team members from both countries collaborated on the design, the intent was not a comparative study but instead each country conducted its own investigations, recognizing the cultural and demographic differences of the student and teacher populations. Beginning with a design phase followed by a field-test conducted in the United States and Finland in 2017– 2018 (Schneider et al., 2020), an efficacy trial was then conducted in the United States (2018–2019) with students of different demographic factors at the individual level (i.e., prior academic ability, race and ethnicity, and gender) and at the school level (i.e., urbanicity, percentage of free or reduced -price lunch, and region) in California and Michigan.1 The results of this efficacy study are reported below. Theoretical Basis of the Intervention The CESE intervention was designed taking into account critiques of successful curricular interventions, including empirically tested instructional and learning materials for teachers and students, teacher professional learning with sustained support and three-dimensional assessments for evaluating learning in science (Harris et al., 2015; Polikoff et al., 2018; Sinatra et al., 2015). Guided by the principles of project-based learning (PBL) and the complementary ideas of the Framework and the NGSS, the coherent CESE system was formed to support science learning by challenging students to engage in relevant and meaningful experiences. The theoretical principles of PBL in science revolve around learners making sense of “real-world phenomena” and solving relevant questions by planning and carrying out their own investigations (Krajcik & Shin, 2014). Collaborating with classmates, students engage in experiences in which they create artifacts that support the development of scientific ideas and use 19 of scientific practices. PBL uses the three dimensions of scientific knowledge that allow students to draw from their knowledge across disciplines and life experiences, rather than being the passive recipients of knowledge. This contrasts with traditional approaches to chemistry and physics, where students are often instructed to plug numbers into equations without truly understanding the underlying relationships described by the equations. Specifically, the CESE intervention focuses on seven major PBL principles: meeting subject and grade-relevant NGSS performance expectations; constructing a meaningful driving question that motivates a solution to a complex problem or an explanation for a compelling phenomenon; providing opportunities for the use of scientific practices; creating collaborative experiences and investigations for finding solutions to a driving question; integrating learning tools to make sense of evidence; developing artifacts that respond to the driving question and reveal students’ comprehension; and using assessments that capture emerging understandings (Krajcik & Shin, 2014). Enacting these principles helps transform the daily science experiences of teachers and students, from teacher-led instruction to environments where both teachers and students work together on solving problems and figuring out how to explain phenomena. In such a learning environment, students have agency and directly participate in using scientific practices, much like scientists and engineers. There were several reasons for why our intervention design with its PBL framework was constructed for high school students in physics and chemistry. First, advances in science learning emphasize the components of PBL as necessary for student learning (National Academies of Sciences, Engineering, and Medicine, 2019; NRC, 2000). Second, there is limited work on measuring reforms in high school physical science courses (What Works Clearinghouse, 2020). We recognize that there are several newer studies that take a different approach to understanding 20 why high school students lack an interest in science (National Assessment of Educational Progress, 2021) and why fewer students are choosing STEM (science, technology, engineering, and mathematics) majors (Riegle-Crumb et al., 2011). Furthermore, there have also been many studies highlighting how different types of methodologies enhance students’ learning experiences in sciences (Lee et al., 2020). There are considerable studies that examine student access among underrepresented minorities to more advanced level courses or subjects like chemistry or physics (National Science Board, 2020; Riegle-Crumb et al., 2019) with new promising science curricular efforts (Engels et al., 2019; Sasson et al., 2018). However, several critiques concerned that the equitable and inclusionary participatory work of PBL suggest that it needs to be studied more rigorously to learn if indeed it has a positive impact on student outcomes (Chen & Yang, 2019; Cheung et al., 2016; Condliffe et al., 2017; Harris et al., 2019). Our work is a response to this critique. Third, chemistry and physics are important subject areas, as they are often considered gatekeeper courses for many science specializations and postsecondary schooling (Hinojosa et al., 2016; Riegle-Crumb & King, 2010). Finally, scientific literacy is needed for all students, as evidenced by our understanding and responses when faced with pandemics, technological change, and environmental concerns (National Science Board, 2019). This combination of a need for new curricula paired with the necessity of PBL led to the creation of the CESE system. Components of the Intervention Teacher and Student Experiences and Materials Recognizing the challenges that teachers would undoubtedly have in transforming instructional practices for all their science units, the team decided to develop three units for chemistry and physics, each of which lasted 4 to 6 weeks. Table 2.1 describes the three 21 chemistry and three physics units, along with their driving questions, performance expectations, and phenomena. Table 2.1 Units, Performance Expectations, Driving Questions, and Phenomena for the Units Unit Performance Expectation Driving Question Phenomena Evaporation HS-PS1-3: Plan and conduct an investigation to gather evidence to compare the structure of substances at the bulk scale to infer the strength of electrical forces between particles. “Why do I feel colder when I am wet than when I am dry?” Water, acetone and ethanol evaporate when placed on your skin, making you feel cool. HP-PS3-2: Develop and use models to illustrate that energy at the macroscopic scale can be accounted for as a combination of energy associated with the motion of particles (objects) and energy associated with the relative positions of particles (objects). Periodic Table HS-PS1-1: Use the periodic table as a model to predict the relative properties of elements based on the patterns of electrons in the outermost energy level of atoms. HS-PS1-2, Construct and revise an explanation for the outcome of a simple chemical reactions based on the outermost electron states of atoms, trends in the periodic table, and knowledge of the patterns of chemical properties. “Why is table salt safe to eat, but the substances that form it are explosive or toxic when separated?” Conservation of Matter HS-PS1-7: Use mathematical representations to support the claim that atoms, and therefore mass, are conserved during a chemical reaction. “Why does it seem like I can make a substance appear or disappear?” Ice at zero degrees Celsius will change to liquid water at zero degree with the addition of energy but no temperature change occurs. Sodium reacts with water. Potassium reacts with water. A solution of sodium chlorine reacts with potassium iodide solution to form iodine and potassium chloride. A solution of bromine reacts with potassium iodide solution to form iodine and potassium bromide solution. Substances like paper, can burn. They appear to disappear, but in a closed system, the burning of paper has no mass change. It is necessary to add energy to start the burning of paper but after the start, lots of energy is given off as the temperature of the surrounding area increases. 22 Table 2.1 (cont’d) Forces and Motion MagLev Electric Motors HS-PS2-3: Apply scientific and engineering ideas to design, evaluate, and refine a device that minimizes the force on a macroscopic object during a collision. HS-PS2-3: Apply scientific and engineering ideas to design, evaluate, and refine a device that minimizes the force on a macroscopic object during a collision. HS-PS3-5: Develop and use a model of two objects interacting through electric or magnetic fields to illustrate the forces between objects and the changes in energy of the objects due to the interaction. HS-PS3-2: Develop and use models to illustrate that energy at the macroscopic scale can be accounted for as a combination of energy associated with the motion of particles (objects) and energy associated with the relative positions of particles/ objects. HS-PS3-1: Create a computational model to calculate the change in energy of one component in a system when the change in energy of the other components and energy flows in and out of the system are known; HS-PS2-5: Plan and conduct an investigation to provide evidence that an electric current can produce a magnetic field and that a changing magnetic field can produce an electric current; HS-PS3-3: Design, build, and refine a device that works within given constraints to convert one form of energy into another form of energy. “How can I design a vehicle to be safer for a passenger during a collision?” When a car crashes, destruction and damages to the car and the passengers occur. “What makes a super speed train (Maglev) function without touching the track?” When a magnet is brought close enough to second magnet, the second magnet will move without touching it. “How can I make the most efficient electric motor?” An electric current can cause the shaft of an electric motor to spin or turn. Each unit was designed with an overriding driving question, lesson sequences incorporating scientific practices, and postunit assessments. The first step in the unit design process was to select a set of performance expectations for each unit in chemistry and physics and then unpack the performance expectations to elaborate on the ideas and identify the scientific practices (Harris et al., 2019; Krajcik & Czerniak, 2018; Krajcik & Shin, 2014). The PBL framework requires each unit to have a driving question (see Table 2.1) that is meaningful to students’ lives and connects a phenomena or complex problem to a concrete experience recognizable to the students. This drives the sequence of coherent lessons that continue to 23 motivate students to meet the unit’s learning goals. Moreover, the initial experiencing of the phenomena leads the students to ask their own questions. Another important feature of the driving question is that it initiates the construction of a systematic sequence of lessons that builds throughout the unit, leading to additional questions that are threaded through various lessons, providing coherence across multiple experiences. The lessons that form from the driving question are not defined scripts but a flexible roadmap that connects prior experiences and ideas to specific scientific practices, such as planning and carrying out investigations, analyzing and interpreting data, and constructing explanations and designing solutions. The investigations related to the driving question are not independent projects, but rather carefully constructed experiential activities that build across the unit and are anchored in and help answer the driving question. One of the most important scientific practices that the PBL units emphasize is having students construct models and connect them with evidence-based explanations of phenomena. Here again, the direct involvement of students in modeling is not an isolated task, such as diagramming a simple relationship between two variables, but instead is directly related to explaining the phenomena under investigation and responding to the driving question. Raising the importance of modeling is a key scientific practice in the NRC’s Framework, in which the intent is to provide students with experiences whereby they can become directly involved in systems thinking.2 By incorporating the practices of scientists and engineers, the modeling experiences afford students the opportunity to learn how to represent phenomena and physical systems, explain and predict the phenomena in a consistent and logical manner, and understand their data. The CESE modeling experiences are deliberately created so that students are supported in 24 learning about identifying system components and the relationships among them which can take many forms, including mathematical formulae, diagrams, and computer simulations. To understand how this process plays out in classrooms, the following two examples summarize the first units in chemistry and physics.3 The first unit in chemistry focuses on explaining evaporative cooling. Working from the CESE driving question (see Table 2.1), the lessons are designed so that students use classroom experiments and models to figure out and explain how evaporative cooling occurs; they must then figure out how this relates to the interactions of particles at the molecular level, as well as the matter’s macro-level structure and properties, and energy transfer. Students manipulate different variables throughout the experiments, looking to explain how each component may influence evaporation. Across the unit, as students learn and assess how to make sense of phenomena, they construct models and explanations of the process of evaporative cooling, connecting energy changes to changes in the structure of matter in the system. The first unit in physics focuses on students exploring the driving question, How can I make a vehicle safer during a collision? Here, experiences and investigations include working collaboratively to investigate and develop computer models to explain collisions, figuring out relationships among mass, force, and velocity in a collision by experiencing what happens when each of those variables is individually manipulated. Students use their new knowledge of mass, force, velocity, acceleration (Newton’s second law), and momentum in combination with engineering practices to develop their best design in answering the driving question. Then they use a set of materials to build and test a cart that minimizes the force on a passenger in a collision. 25 Postunit Assessments One of the underlying principles of the CESE system is to create postunit assessment tasks that extend student learning experiences by using the three dimensions of scientific knowledge to explain phenomena and solve challenging problems to demonstrate mastery of NGSS performance expectations—but not what was articulated in the curricular unit. The steps used for creating these assessment tasks and rubrics were a modification of a previous process articulated by Harris et al. (2019). The development of assessment tasks allows for the creation of items through a principled, clearly defined process that is grounded in learning and assessment theory. All of the postunit assessments have the students draw models and write full descriptions of what is shown; these are then evaluated using a rubric that assesses their knowledge of the NGSS performance expectations, including the three dimensions of science learning. In addition, external reviews and classroom pilots were performed to increase item validity. Professional Learning and Teacher Support Professional learning was designed with best practices from research and emphasized teachers’ active participation in learning, connections to classroom contexts, collaboration, and reflection (Darling-Hammond et al., 2017; Garet et al., 2001; Krajcik, 2014; van den Bergh et al., 2015). A key feature of CESE professional learning is for treatment teachers to experience what their students will be doing during their science activities, including using the driving question board, asking their own questions, building models, developing evidence-based explanations, and conducting experiments, not as students but as adult learners. The intent here is to guide and support the teachers in new ways of teaching often found to be challenging. All treatment teachers spent three in-person days learning about the NGSS, three- dimensional learning, science PBL, and a review of the first units in chemistry and physics, 26 conducted with team members and experienced teacher facilitators. Several times during the school year, the treatment teachers also met in person with facilitators to talk through teaching the next set of units. Facilitators also connected with the teachers via video conferences and online message boards. A hotline and a monitored email address were also available that centered on teacher questions. Over each of the 4 to 6-week units, there were approximately 90 teacher requests for additional support and information. If needed, facilitators were also available for face-to-face interactions.4 Control teachers also met in-person at the beginning of the school year for a day and were given a workshop on the NGSS. Testing the Intervention Three research questions guided this investigation: Research Question 1: What is the main effect of this intervention on students’ science learning? Do treatment students outperform control students on a summative science assessment? Research Question 2: What other conditions besides the intervention could be affecting the treatment effect? More specifically, does the treatment effect vary by race/ethnicity and gender? Research Question 3: What is the mediating effect of fidelity of implementation on the treatment effect? Sample Method A power analysis indicated that the sample should include at least 48 schools with 50 students per school to detect a robust effect (i.e., 0.20).5 Schools were recruited from four areas (Los Angeles Unified School District, San Diego County Office of Education, Detroit Public 27 School Community District, and other districts throughout Michigan), which allowed for a diverse sample of prior school-level science achievement, socioeconomic status, and race and ethnicity, including a substantial representation of Hispanic students, many of whom were English language learners. Participating school districts signed memoranda of understanding and supplied lists of schools for potential participation from which the randomization process was undertaken. Randomization Assignment to treatment status was made by schools rather than individual teachers to prevent spillover, as teachers within a particular school might plan together and/or share curriculum materials and instructional practices with colleagues. Also, a small number of schools requested that all their teachers in a subject be teaching the same curriculum. Team members contacted the principals from the district lists for potential participation, explaining that they would either be assigned to receive the treatment or control group that would receive the treatment the following school year. After receiving a principal’s agreement to participate, schools were randomly assigned to treatment or control status, with an equal probability (0.50) of each. Given district and principal support, nearly all of the teachers were willing to participate. (See APPENDIX 2.A, also available on the journal website, for the balance tables between the treatment and control schools).6 Attrition After randomization, nine schools attrited because of school closures or canceled science courses. Although not part of the sample selection (i.e., schools and students), teacher attrition was explored because of its impact on the student sample. Teacher attrition was related to school policies and personal issues: fiscal problems and a teacher strike which resulted in: teacher class 28 reassignments and class cancellations; medical emergencies; mismatch between the course timeline and student abilities; and undisclosed reasons. Students were excluded from the analytic sample if they were missing either the pretest or summative assessment (or both).7 Students who joined classrooms after the pretest was administered were excluded from the analysis. Initially, there were 70 schools (36 treatment and 34 control), with 129 chemistry and physics teachers and 6,211 students (3,325 treatment and 2,886 control). After accounting for attrition, the final analytic sample includes 61 schools (30 treatment and 31 control), 119 teachers and 4,238 students (2,127 treatment and 2,111 control). Table 2.2 summarizes overall and differential attrition across the treatment and control groups at the school and student levels. Table 2.2 Attrition Level Panel A: School level Initial schools Final schools Attrition Panel B: Student level Initial students Final students Attrition Overall Treatment Control Differential 70 61 36 30 34 31 12.86% 16.67% 8.82% 7.84% 6,211 4,238 3,325 2,127 2,886 2,111 31.77% 36.03% 26.85% 9.18% Table 2.3 provides descriptive statistics and the balance for the analytic sample on pretest and demographic characteristics. The balance was estimated using a two-level hierarchical linear model (HLM) of the treatment on the characteristic of interest. 29 Table 2.3 Descriptive Statistics and Balance for the Analytic Sample Note. Standardized pretest is in standard deviations. Other covariates are proportions. *p < 0.05. **p < 0.01. ***p < 0.001. Given slight differences in the standardized pretest and proportion of race and ethnicity between the treatment and control students, these measures were included as control variables in the analytic models along with dummy variables for region and course subject. Instruments Students Several instruments were used to collect information from the treatment and control students, including demographic characteristics, pretest baseline science achievement, and a summative assessment. For students whose first language was Spanish, all consent forms, 30 teaching materials, and unit and summative tests were first translated into Spanish, and then translated back into English by another translator to ensure accuracy. All translators were fluent in both languages with a secondary school science specialization. The student background survey was administered on Qualtrics or via paper and pencil. Questions were largely adapted from Programme for International Student Assessment (PISA) and included questions about home background (race, ethnicity, language, parent’s education, etc.) attitudes toward science, and preferences for a future career in science. A pretest was given at the start of the school year to students in both conditions to measure their baseline science knowledge for a covariate in the analytic model. This pretest contained multiple choice items chosen from the National Assessment of Educational Progress (NAEP) test bank. The items covered a range of topics and difficulty levels, plus, some were aligned with DCIs and performance expectations for chemistry and physics (see Maestrales et al., 2021).8 To measure the difference between the treatment and control groups at the end of the intervention, the students completed a summative assessment consisting of items developed by the Michigan Department of Education for use on the state’s 11th grade science assessment.9 Michigan was one of the earlier states to adopt NGSS-like standards and that adoption spurred an interest in a redesign of their science achievement test given to high school students. Several science curriculum specialists and psychometricians worked to design a science assessment that encompassed grade-level NGSS standards and NRC three-dimensional science. Our summative test included items that corresponded to the physical science performance expectations.10 This third-party assessment allowed for an objective measure of the differences in achievement between the treatment and control group.11 31 Student exit surveys asked about the frequency of different PBL activities in their classroom, which allowed for a comparison of student versus teacher perceptions of fidelity of implementation. The exit survey also asked students to reflect on the importance of science in their own lives, interest in the future study of science, experiences in their science classrooms, and science materials available at their schools. Teachers A teacher survey with questions adapted from the Teaching and Learning International Survey (TALIS) included items on years of teaching experience, teaching methods, and attitude toward teaching. This survey also included teacher knowledge and exposure to NGSS and PBL to ensure that the beginning of the intervention of such knowledge between the treatment and control teachers. All teachers and students were also asked to complete an exit survey at the end of the year. Items on the teacher exit survey asked about their use of PBL, questions on coverage of performance expectations, and the quality of classroom resources that affected the intervention lessons. These measures also allowed for the testing of the assumption that the control teachers were teaching business as usual. The exit survey items used to determine the units covered by the control teachers, their practices, and their textbooks and curriculum tools can be found in APPENDIX 2.B (also available on the journal website). In addition to these exit surveys, in-person observations of randomly selected teachers (in both treatment and control) were also conducted and used for determining fidelity of implementation. Another important use of these observations was to obtain information and confirm that the control teachers were conducting their science classrooms with business as usual, and not using PBL practices or a similar type of curriculum that emphasized CESE principles or experiential activities. Although unable to conduct observations for all teachers, 32 these observations provided important insights into the use of PBL in their classrooms, independent observer assessments of PBL use, and how these measures corresponded with teacher self-reports in the exit surveys. Analysis To assess the effect of the treatment on science achievement and to account for clustering that occurs as a result of assignment of schools to treatment, a two-level HLM was used (Bloom, 2005; Raudenbush, 1997; Raudenbush & Bryk, 2002), with students clustered within schools. Model 1: 𝑌𝑖𝑗 = 𝛾00 + 𝛾01 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑗 + 𝑆𝑗𝛿𝑗 + 𝑋𝑖𝑗𝛾𝑖𝑜 + 𝑟0𝑗 +∈𝑖𝑗 where 𝑌𝑖𝑗is the standardized summative assessment test score with student i in school j; 𝛾00 is the mean outcome of the control group; 𝛾01 is the difference between the treatment and control group; 𝑠𝑗are the school-level covariates, including school pretest mean and region; and 𝛿𝑗are the coefficients on those covariates. 𝑋𝑖𝑗 are the individual-level covariates, including pretest, course (chemistry or physics), and gender of the students. 𝛾𝑖0 are the coefficients on these individual- level covariates. Finally, 𝑟0𝑗 is the school-level error term, and ∈𝑖𝑗 is the student-level error term. Because race, ethnicity, and gender data for every student was not available, Model 2 was estimated including a dummy variable for this missing data: Model 2: 𝑌𝑖𝑗 = 𝛾00 + 𝛾01 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑗 + 𝑆𝑗𝛿𝑗 + 𝑋𝑖𝑗𝛾𝑖𝑜 + 𝑀𝑖𝑗𝛼𝑖𝑜 + 𝑟0𝑗 +∈𝑖𝑗 where 𝑋𝑖𝑗 are the individual-level covariates, including pretest, course (chemistry or physics), and gender of the students, now including race and ethnicity of the students. 𝛾𝑖0 are the 33 coefficients on these individual-level covariates. 𝑀𝑖𝑗 is the vector of missing dummies for gender and race and 𝛼𝑖0 are the coefficients on those dummies. To determine whether treatment effects differ by race/ethnicity or gender, a cross-level interaction between gender and treatment and then race and treatment was conducted. The following model is specified for gender. The race and ethnicity model substitutes the Female interaction with the treatment with each Race dummy. The full model is shown in APPENDIX 2.C (also available on the journal website). Gender heterogeneity: 𝑌𝑖𝑗 = 𝛾00 + 𝛾01 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑗 + 𝛾11 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑗 × 𝐹𝑒𝑚𝑎𝑙𝑒𝑖𝑗 + 𝑆𝑗𝛿𝑗 +𝑋𝑖𝑗𝛾𝑖𝑜 + 𝑀𝑖𝑗𝛼𝑖𝑜 + 𝑟0𝑗 +∈𝑖𝑗 where 𝛾11 is the coefficient on female students in the treatment group. When teachers implemented the intervention with fidelity, it was expected that students would have higher scores on their summative science assessment compared with students whose teachers did not. This assumption was based on teachers’ reports of the frequency of various science practices, the extent to which they “incorporated” PBL into their classes (32% reported frequently, 50% sometimes, 15% rarely, 1% never), and students’ reports of the use of modeling, aggregated to the teacher level to represent the extent to which teachers mad e this scientific practice a part of their science classroom instruction. The impact of teacher fidelity of implementation was conducted with the 3-2-1 mediation model (Pituch et al., 2009). The 3-2-1 mediation model requires a three-level HLM where the teacher is at Level 2. This is because the expected mediation is from the teacher’s level of implementing the intervention, which is a teacher-level variable. Therefore, to ensure that the treatment effects were consistent with the three levels, the overall treatment effect was estimated 34 using a three-level HLM. Once the reliability of the three levels was found, the mediation analysis was conducted. Here, the treatment is delivered at Level 3 (the school level), and mediation occurs at Level 2 (the teacher level) and outcome is student assessment scores (Level 1). This process is conducted twice, first for the teachers’ reported use of the intervention in their class and then the students’ use of modeling in that teachers’ class. The mediation model is shown in APPENDIX 2.D (also available on the journal website). Additionally, one exploratory research question was whether the treatment would enhance educational ambitions. For this model, a mixed effects logistic regression was used where the outcome was the change in levels of planned education attainment from the beginning to the end of the school year. The covariates used in this analysis were the same as Model 1 above. The models for estimating these effects are given in APPENDIX 2.E (also available on the journal website). Generalizability of the Study To assess the generalizability of the study, generalizability indexes were analyzed that summarized the degree of similarity between distributions of the propensity score of the sample schools compared with the various inference populations (the United States as a whole as well as each state; Tipton, 2014). This generalizability index takes values between 0 and 1. Essentially, for each of the 51 inference populations chosen (including the U.S. population as a whole), the sample of 61 schools was compared with the population of schools meeting the inclusion criteria (public schools) using a propensity score estimated with logistic regression. These scores estimate the probability that a school would be selected into the study given its covariates which included school enrollment, percentage of students on free or reduced -price lunch, school urbanicity, and percentage of White, Black, Hispanic, and Asian students. 35 Student Outcomes Results Across both of the different models for the treatment effect, the treatment students outperformed the control students at a significance level of p < 0.05 or less and with a consistent treatment effect near 0.2 standard deviations. In the most conservative result as shown in Model 2, which includes all the covariates, the treatment students scored 0.208 standard deviations (p < 0.01) higher than the control students after taking into account the pretest, region, and demographic information. This represents about a 7 percentage-point increase for a standardized test score based on the summative assessment. Both the estimated treatment effects of the two models are reported in Table 2.4 (see APPENDIX 2.F, for the full model results, also available on the journal website). Table 2.4 Estimated Treatment Effect of the CESE System Effect (1) Treatment effect 0.22 (0.064) (2) 0.208 (0.065) Additional controls Note. Treatment effect is the difference between the treatment and control group, measured in standard deviations. Standard errors are in parentheses. CESE = Crafting Engaging Science Environments. *p < 0.05. **p < 0.01. ***p < 0.001. x To confirm that the findings were robust, a sensitivity check was conducted using the Frank et al. (2013) framework for evaluating the robustness of an inference. To invalidate this inference, 28.6% of the estimated treatment effect, which is approximately 1,637 observations, would have to be replaced with cases for which the effect of the treatment is zero. Following the estimation of the treatment effect, Cohen’s f was calculated and found to be 0.01, indicating that the variance due to the treatment effect relative to the proportion of unexplained outcome variance is 0.01 (Lorah, 2018). 36 Tests of Homogeneity of Variance In tests of heterogeneity by student demographic characteristics (gender and race/ethnicity), there were no statistically significant differences in treatment effect by student gender or race/ethnicity (see Table 2.5 and APPENDIX 2.G for the full model of results, also available on the journal website). Table 2.5 Summary of Student Level Heterogeneity Summary of Student Level Heterogeneity Effect Female Black Hispanic Asian Other race Multiple races Treatment Predictor of interest Interaction 0.185** (0.708) −0.061* (0.03) 0.047 (0.058) 0.192** (0.069) 0.222** (0.085) −0.280** −0.238** (0.087) 0.189 (0.120) (0.084) −0.033 (0.089) 0.212** 0.205** 0.212** (0.064) 0.044 (0.062) −0.090 (0.139) (0.065) −0.188 (0.096) 0.183 (0.151) (0.066) −0.086 (0.163) −0.109 (0.201) Note. Coefficients are measured in standard deviations. Standard errors are in parentheses. *p < 0.05 **p < 0.01. As shown in Table 2.5, in the first iteration of the model, the outcome is significant for race/ethnicity and gender. However, when examining the effect for gender and race interacted with the treatment, the effects are not significantly different than zero. This interaction effect indicates that there is no evidence of a difference in the treatment by gender or race/ethnicity. With regard to the exploratory model, a mixed effects logistic regression estimating change in educational ambitions from fall to spring was conducted. Students in the treatment group were 20% more likely than the control group to increase their postsecond ary aspirations. The coefficient on the treatment was 0.18 (standard error = 0.09; p = 0.05). If treatment students originally intended to attend a 2-year school, they would be 1.2 times more likely at the end of the intervention to intend to attend a 4-year college than their control counterparts (see APPENDIX 2.H for the full model results which are also available on the journal website). 37 Fidelity of Implementation Table 2.6 shows the results of estimating the treatment effect using a three-level HLM. As seen in Table 2.6, the results are consistent with those above in the two-level HLM; therefore, it is appropriate to use a three-level model for the fidelity of implementation analysis. Table 2.7 shows the results of the fidelity of implementation mediation models. The composite teacher measure of “incorporation of PBL” was not a statistically significant effect and only accounted for a small portion (11%) of the overall treatment effect (see APPENDIX 2.I, for the full model results which are also available on the journal website). Table 2.6 Estimated Treatment Effect of the CESE System with Three Levels (4) Effect (3) Treatment effect 0.211*** (0.059) 0.196** (0.060) Additional controls Three levels (student, school, and teacher) Note. Treatment effect is the difference between the treatment and control group, measured in standard deviations. Standard errors are in parentheses. CESE = Crafting Engaging Science Environments.*p < 0.05. **p < 0.01. ***p < 0.001. x x x Table 2.7 Mediation Effects Measure a b Indirect effect (a * b) 95% Confidence interval Total effect explained (a * b)/c Teachers’ incorporation of PBL 0.329 0.068 0.022 [−0.013, 0.071] 11% (0.126) (0.055) (0.021) Students’ use of modeling 0.234 0.234 0.055 [−0.008, 0.132] 28% (0.056) (0.139) (0.036) Note. Standard errors are in parentheses. PBL = project-based learning. However, students’ reported use of modeling explains roughly 28% of the total treatment effect (and is significant at the 0.10 level). This was expected, as the teacher is ultimately 38 responsible for guiding and supporting the students’ modeling activities. The extent to which teachers’ frequent use of modeling—one of the key scientific practices incorporated in the intervention—was a promising sign that this experience was a path through which the treatment worked. Findings of Generalizability Index When the generalizability index analysis was conducted using the Common Core Data with seven covariates, the results show that the generalizable index for the entirety of the United States is 0.82. This indicates that the sample from this study is similar to the inference population—here, in the United States with regard to the covariates selected. In this case, when statistical adjustments are used to find an average treatment effect, these would be approximately unbiased for the inference population. Discussion Results of this randomized controlled trial and its generalizability are an important contribution to science teaching and learning; its significant effects on over 4,000 diverse students in two different states are especially heartening given the few science interventions that have been rigorously tested at the high school level and shown to be effective at improving science learning. Results show that the intervention was effective at raising students’ science learning as measured by an independently developed summative assessment. There was no evidence that the intervention produced different effects based on students’ gender, race, or ethnicity. It is important to underscore what these results mean and what they do and do not conclude. Since there is no difference here among racial and ethnic groups on the effect of the treatment, this does not mean that the intervention is engaging history, culture, or race and 39 ethnicity sufficiently. One of the key principles of CESE states that science must be personally meaningful and of interest to the students to engage them in science. This happens by having students ask questions about scientific phenomenon and having them relate these ideas to their own lives. These principles of CESE and their enactment are quite different from the current critique of science learning and curriculum (see Lee et al., 2020; Pinkard et al., 2017; Rosebery et al., 2016). Additionally, CESE science experiences are deliberately designed for different inclusionary group activities. These activities are uniquely designed to involve multiple groups of students in problem solving, writing explanations, and becoming more informed about their world as well as learning from one another. Our results show that contrary to other interventions that often fail to positively affect all students, our theoretical framework and experiential learning opportunities should benefit all groups on average. The most important takeaway is that science academic performance related to the NGSS can be improved with an intervention that is created and implemented as a coherent system approach: an approach that includes teacher and student learning experiences, teacher professional learning, and formative unit assessments that incorporate the three dimensions, including—but not limited to— modeling and writing explanations. The treatment provided teachers with multiple professional learning experiences on how to enact PBL that underscored the importance of “doing science.” Many of the teacher practices emphasized in the treatment included having the students take primary ownership for solving problems, figuring out phenomenon, engaging in scientific practices, and learning the meaning of science concepts across multiple DCIs. Results shown in Table 2.6 indicate that “modeling” was an indirect pathway that affected students’ science achievement scores. It is no surprise that students who frequently 40 participated in modeling activities had an advantage on the summative assessment, as modeling is one of the eight scientific practices emphasized in the NGSS. However, modeling is not an experience that occurs independent of instructional opportunities. That the treatment students reported that modeling was a practice they used on multiple occasions suggests that a key principle of the intervention was being implemented in the classrooms. Our goal has been to help adolescents not only become more science literate but also to ignite a new deepened interest in science, which in our exploratory work, we were pleased to find. Our exploratory analyses indicated that the intervention changed students’ educational ambitions. This was encouraging, as educational ambitions are a key predictor for college enrollment (Schneider et al., 2016). Given that the treatment increased students’ college ambitions suggests that the intervention, with its emphasis on engaging in three dimensions of learning scientific knowledge to make sense of compelling phenomena or solve complex problems, may be a trigger for pursuing further education in science and other fields. Limitations of the Study The three units in this intervention lasted, on average, 12 to 16 weeks. Most science courses typically last longer and include additional areas of study. Had the intervention lasted longer and included more units, the treatment effects on science learning may have been larger. However, it could also be the case that students may have reached a saturation point in their exposure to scientific practices and that teachers would be unable to sustain the types of instruction used in this intervention. From interviews we conducted with treatment teachers who participated in earlier phases of this study, this did not appear to be the case; indeed, teachers reported that they found themselves using these practices in other units that were not part of the CESE curriculum. 41 The lack of an effect for teacher reports in the meditation model may be the result of the measures employed. Other methods may have provided a more robust measure, such as many more in-person observations of teachers with high interrater agreement between multiple observers and repeated student surveys. However, due to cost constraints, this was not a possibility. Finally, conducting a study of this magnitude—one that is a large-scale randomized control trial in two different states, and includes professional development and materials—is quite costly from financial, personnel considerations, and time-consuming. For future studies that look to expand and generalize this work to larger populations, these costs will be considerable. However, the promise of this intervention appears to be positive enough to warrant further investment. Conclusion Unquestionably, there is an immediacy for dramatically transforming science learning, especially given the health and environmental challenges young people are facing today and likely to face in the future. While recent major reforms to science teaching and learning have been met with enthusiasm and action at the state level, these reforms have not yet been widely adopted at the classroom level, particularly in high school chemistry and physics. This is in part because of the lack of aligned curriculum materials, professional learning, and assessments. In this respect, the results of this intervention are especially encouraging and merit further expansion. This intervention is grounded in the recommendations of A Framework for K–12 Science Education (NRC, 2012) and Science and Engineering for Grades 6 to 12 (National Academies of Sciences, Engineering, and Medicine, 2019), which incorporate the three dimensions of scientific knowledge. Our results suggest it is possible to change the scientific learning environments for 42 all students and expect positive science achievement results. However, this can only happen when there is a principled design system that involves not only just an engaging curriculum, but also high-quality professional learning and formative assessments designed to stimulate knowledge. If the science and science education communities intend to improve science learning, students need to work on relevant meaningful problems and participate in scientific practices similar to the actual work of scientists—such as “figuring out” phenomena, building and testing models that explain those phenomena, searching for patterns and connections in data, and uncovering cause and effect relationships. Notes This study is supported by the National Science Foundation (OISE-1545684; PIs Barbara Schneider and Joe Krajcik). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. Finnish authors were supported by the Academy of Finland (#298323; PIs Jari Lavonen and Katariina Salmela-Aro). We thank the following people for their contributions and consultation regarding this study: Larry Hedges (Northwestern University), Jeffery Wooldridge (Michigan State University), Mark Reckase (Michigan State University), Yiling Cheng (Kaohsiung Medical University), and Kalle Juuti and Janna Inkinen (University of Helsinki). 1The Finnish randomized controlled trial started a year later in 2019–2020 and had several interruptions because of the Covid-19 pandemic. Results from the Finnish efficacy study and U.S. control teachers who gave their new chemistry and physics classrooms the CESE treatment in 2019–2020 can be found in Maestrales, Dezendorf, et al. (2021). 43 2See Cooper (2020) for a description of the relationship between crosscutting ideas, system thinking, and modeling. 3Additional details can be found on the CESE website: https://sites.google.com/a/msu.edu/craftingengagingscience/home. 4On average for both chemistry and physics, respectively, the total professional learning was 40 hours. 5These estimates were based on Optimal Design Software (Spybrook et al., 2011), which calculates the number of clusters (i.e., schools) to power a study given the expected characteristics of the sample and a hypothesized effect size. This includes the number of observations (i.e., students), intraclass correlations, and R2 values from previous science assessments and covariates. These values were estimated based on information compiled by Spybrook et al. (2016), as well as our knowledge about the schools we expected to recruit. 6Because of space limitations, we have included an online APPENDIX for additional tables (also available on the journal website) and information. 7If pretests were missing at random, imputation would not increase efficiency. The standard errors did not increase after imputation. If the imputation reduced the standard errors, then we would have expected greater efficiency of our models, but this was not the case (Wooldridge, 2010). The student attrition rate was higher than that recommended by the What Works Clearinghouse (2020), so as a robustness check we imputed pretest scores for those who only had a summative assessment, bringing our attrition rates down. 8To verify the quality of the pretest, a multinomial logistic regression and an item response theory (IRT) nominal response model were conducted to examine students’ response patterns by gender Cheng and Reckase [2020] for a fuller description of the pretest item 44 differentiation). Results showed no significant differences on test-level scores between genders, or significant gender differences for most of the distractors. High-achieving girls and boys also chose the correct answers at the same frequency. These findings resonate with previous findings of gender similarities (Hyde & Linn, 2006; Zell et al., 2015). 9A confidentiality agreement with Michigan Department of Education was signed and the team was not allowed to show the items used for the summative assessment. Student scores and other de-identified information were stored on a secured server at Michigan State University. 10The Psychometric technical report for the science test has been delayed because of Covid-19. However, a test of the reliability of the items is available on request. 11The physical science summative test scores for students taking chemistry or physics were made comparable using the R package equateIRT. The scores were then standardized for subjects and a 2pl model was analyzed. Then the two tests were equated with only the control students so that a treatment effect would not bias the equating process so that the students’ summative assessments for physics and chemistry would be comparable based on the distribution of the two sets of scores (see APPENDIX 2.J for the full table of item equivalence for the chemistry and physics summative assessments). 45 REFERENCES Bloom, H. (2005). Randomizing groups to evaluate place-based programs. MDRC. Blumenfeld, P. C., Soloway, E., Marx, R. W., Krajcik, J. S., Guzdial, M., & Palincsar, A. (1991). Motivating project-based learning: sustaining the doing, supporting the learning. Educational Psychologist, 26(3–4), 369–398. https://doi.org/10.1080/00461520.1991.9653139. Chen, C.-H., & Yang, Y.-C. (2019). Revisiting the effects of project-based learning on students’ academic achievement: A meta-analysis investigating moderators. Educational Research Review, 26(February), 71–81. https://doi.org/10.1016/j.edurev.2018.11.001. Cheng, Y., & Reckase, M. (2020). The effect of gender differences and similarities on science performance [Conference section]. American Educational Research Association Conference, San Francisco, CA, United States. https://www.aera.net/Events- Meetings/Annual-Meeting/Previous-Annual-Meetings/2020-Annual-Meeting. (Conference canceled). Cheung, A., Slavin, R. E., Kim, E., & Lake, C. (2016). Effective secondary science programs: A best-evidence synthesis. Journal of Research in Science Teaching, 54(1), 58–81. https://doi.org/10.1002/tea.21338. Condliffe, B., Quint, J., Visher, M. G., Bangser, M. R., Drohojowska, S., Saco, L., & Nelson, E. (2017). Project-based learning: A literature review (Working Paper). MDRC. https://eric.ed.gov/?id=ED578933. Maestrales, S., Zhai, X., Touitou, I., Baker, Q., Schneider, B., & Krajcik, J. (2021). Using machine learning to evaluate multidimensional assessments of chemistry and physics. Journal of Science Education and Technology, 30(2), 239–254. https://doi.org/10.1007/s10956-020-09895-9. National Academies of Sciences, Engineering, and Medicine. (2019). Science and engineering for Grades 6–12: Investigation and design at the center. The National Academies Press. National Assessment of Educational Progress. (2021). Results from the 2019 Science Assessment. U.S. Department of Education and the Institute of Education Sciences. National Center for Education Statistics. (2017). The condition of education 2017. Author. https://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2017144. National Research Council. (2000). How people learn: Brain, mind, experience, and school (Expanded ed.). National Academies Press. National Research Council. (2012). A framework for K–12 science education: Practices, crosscutting concepts, and core ideas. National Academies Press. 46 National Science Board. (2019). The skilled technical workforce: Crafting America’s science and engineering enterprise. https://www.nsf.gov/nsb/publications/2019/nsb201923.pdf . National Science Board. (2020). The state of U.S. science and engineering 2020. https://ncses.nsf.gov/pubs/nsb20201. National Science Teacher Association. (2017). NGSS hub. https://ngss.nsta.org/. NGSS Lead States. (2013). Next generation science standards: For states, by states. National Academies Press. https://epsc.wustl.edu/seismology/book/presentations/2014_Promotion/NGSS_2013.pdf. Organisation for Economic Co-operation and Development. (2020). Science performance (PISA)—indicator. https://data.oecd.org/pisa/science-performance-pisa.htm. Pinkard, N., Erete, S., Martin, C., & McKinney de Royston, M. (2017). Digital Youth Divas: Exploring narrative-driven curriculum to spark middle school girls’ interest in computational activities. Journal of the Learning Sciences, 26(3), 477–516. https://doi.org/10.1080/10508406.2017.1307199. Pituch, K. A., Murphy, D. L., & Tate, R. L. (2009). Three-level models for indirect effects in schooland class-randomized experiments in education. Journal of Experimental Education, 78(1), 60–95. https://doi.org/10.1080/00220970903224685. Polikoff, M. S., Campbell, S. E., Koedel, C., Le, Q. T., Haraway, T., & Gasparian, H. (2018). The formalized processes districts use to evaluate textbooks [University of Southern California Working Paper]. Raudenbush, S. W. (1997). Statistical analysis and optimal design for cluster randomized trials. Psychological Methods, 2(2), 173–185. https://doi.org/10.1037/1082-989X.2.2.173. Raudenbush, S. W., & Bryk, A. (2002). Hierarchical linear models (2nd ed.). Sage. Riegle-Crumb, C., & King, B. (2010). Questioning a White male advantage in STEM: Examining disparities in college major by gender and race/ethnicity. Educational Researcher, 39(9), 656–664. https://doi.org/10.3102/0013189X10391657. Riegle-Crumb, C., King, B., & Irizarry, Y. (2019). Does STEM stand out? Examining racial/ethnic gaps in persistence across postsecondary fields. Educational Researcher, 48(3), 133–144. https://doi.org/10.3102/0013189X19831006. Riegle-Crumb, C., Moore, C., & Ramos-Wada, A. (2011). Who wants to have a career in science or math? Exploring adolescents’ future aspirations by gender and race/ethnicity. Science Education, 95(3), 458–476. https://doi.org/10.1002/sce.20431. Rosebery, A. S., Warren, B., & Tucker-Raymond, E. (2016). Developing interpretive power in science teaching. Journal of Research in Science Teaching, 53(10), 1571–1600. https://doi.org/ 10.1002/tea.21267. 47 Sasson, I., Yehuda, I., & Malkinson, N. (2018). Fostering the skills of critical thinking and question-posing in a project-based learning environment. Thinking Skills and Creativity, 29(September), 203–212. https://doi.org/10.1016/J.TSC.2018.08.001. Schneider, B., Klager, C., Chen, I.-C., & Burns, J. (2016). Transitioning into adulthood: Striking a balance between support and independence. Policy Insights From the Behavioral and Brain Sciences, 3(1), 106–113. https://doi.org/10.1177/2372732215624932. Schneider, B., Krajcik, J., Lavonen, J., & Salmela-Aro, K. (2020). Learning science: The value of crafting engagement in science environments. Yale University Press. https://doi.org/10.2307/j.ctvwcjfk1. Sinatra, G. M., Heddy, B. C., & Lombardi, D. (2015). The challenges of defining and measuring student engagement in science. Educational Psychologist, 50(1), 1–13. https://doi.org/10.1080/00461520.2014.1002924. Spybrook, J., Bloom, H., Congdon, R., Hill, C., Martinez, A., & Raudenbush, S. W. (2011). Optimal design plus empirical evidence: Documentation for the “optimal design.” http://www.hlmsoft.net/ od/od-manual-20111016-v300.pdf Spybrook, J., Westine, C. D., & Taylor, J. A. (2016). Design parameters for impact research in science education: A multistate analysis. AERA Open, 2(1). https://doi.org/10.1177/2332858415625975. Tipton, E. (2014). How generalizable is your experiment? An index for comparing experimental samples and populations. Journal of Educational and Behavioral Statistics, 39(6), 478– 501. https://doi.org/10.3102/1076998614558486. van den Bergh, L., Ros, A., & Beijaard, D. (2015). Teacher learning in the context of a continuing professional development programme: A case study. Teaching and Teacher Education, 47(April), 142–150. https://doi.org/10.1016/j.tate.2015.01.002. What Works Clearinghouse. (2020). What Works Clearinghouse™: Standards handbook, version 4.1. Institute of Education Sciences. U.S. Department of Education. https://ies.ed.gov/ncee/wwc/Docs/referenceresources/WWC- Standards-Handbook-v4-1-508.pdf. Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. MIT Press. Zell, E., Krizan, Z., & Teeter, S. R. (2015). Evaluating gender similarities and differences using metasynthesis. American Psychologist, 70(1), 10–20. https://doi.org/10.1037/a0038208. 48 APPENDIX 2.A BALANCE TABLES BETWEEN THE TREATMENT AND CONTROL T-test T-Value -0.96 0.68 1.50 Variable Total Enrollment Proportion free- reduced price lunch SAT composite SCHOOLS Table 2.A.1 Michigan Schools Data Balance Check MI: Treatment (n=12) MI: Control (n=12) SD Mean 883.10 338.40 151.00 Min Max 134.00 Mean 744.00 SD Min 337.80 276.00 1186.00 Max 0.42 0.23 0.11 0.87 0.39 0.22 0.16 0.86 -0.29 990.60 92.30 840.30 1166.90 1015.00 67.50 887.20 1076.90 Proportion White 0.58 0.29 0.11 0.94 0.76 0.23 0.20 0.95 0.00 0.24 0.30 Proportion Black Proportion Hispanic Proportion Asian -0.27 Source: Data comes from the Michigan Consortium for Educational Research (MCER) and The Common Core of Data (CCD). -1.60 0.03 0.11 0.07 0.11 0.04 0.80 0.25 0.10 0.10 0.00 0.00 0.41 0.48 0.02 0.03 0.14 0.09 0.09 0.03 0.00 0.03 0.29 Figure 2.A.1 Michigan Schools Data Balance Check 49 Table 2.A.2 Detroit Schools Data Balance Check Detroit: Treatment (n=6) Detroit: Control (n=6) Variable Mean SD Min Max Mean SD Min Max % of Free-reduced 0.66 0.14 0.51 0.83 0.75 0.13 0.58 0.90 SAT composite 872.46 120.16 754.62 1035.98 812.66 56.41 744.29 884.46 % of White % of African % of Hispanic 0.01 0.79 0.17 0.02 0.38 0.38 0.00 0.03 0.00 0.04 0.99 0.94 0.02 0.82 0.12 0.03 0.32 0.29 0.00 0.20 0.00 0.07 1.00 0.72 T- test T- value -1.15 1.10 -0.68 -0.15 0.26 % of Asian 0.09 0.00 Source: Data comes from the Michigan Consortium for Educational Research (MCER) and The Common Core of Data (CCD). 0.00 0.07 0.03 0.00 0.03 0.04 0.18 Figure 2.A.2 Detroit Schools Data Balance Check 50 Table 2.A.3 Los Angeles Schools Data Balance Check Los Angeles Schools Data Balance Check LA: Treatment group (n=12) LA: Control group (n=13) Variable Mean SD Min Max Mean SD Min Max T- test T- value 0.43 0.96 0.68 0.85 37.06 500.00 2547.93 2437.00 1262.17 670.47 Total enrollment % of Free- reduced Grade 11 math score % of White % of African % of Hispanic % of Asian 0.01 Source: https://www.caschooldashboard.org/#/Home. 2478.80 2595.20 0.20 0.05 0.00 0.05 0.00 0.86 0.02 0.63 0.17 0.22 0.06 0.98 0.03 0.04 0.47 1676.77 653.95 409.00 2531.00 1.56 0.86 0.44 0.75 0.93 0.26 2544.89 38.87 2491.40 2605.70 -0.19 0.04 0.10 0.78 0.05 0.02 0.06 0.16 0.04 0.50 0.20 0.55 0.10 0.15 0.42 0.98 0.16 -0.23 1.43 -1.30 1.49 Figure 2.A.3 Los Angeles Schools Data Balance Check 51 Total enrollment % of Free- reduced Grade 11 math 2607.50 0.35 Table 2.A.4 San Diego Schools Data Balance Check San Diego Schools Data Balance Check San Diego San Diego Treatment group (n=8) Control group (n=8) Variable Mean SD Min Max Mean SD Min Max T- test T- value 2222.38 550.64 1257.00 3051.00 2022.56 686.40 396.00 2739.00 -0.75 0.25 0.08 0.71 0.32 0.29 0.03 0.74 58.84 2530.20 2713.30 % of White % of African % of Hispanic 0.14 0.03 0.59 0.17 0.02 0.31 0.01 0.01 0.07 0.52 0.06 0.92 2609.70 0.31 0.02 0.45 68.23 2521.40 2714.70 0.28 0.01 0.37 0.01 0.01 0.07 0.75 0.03 0.93 0.22 -0.12 -1.12 0.00 -0.22 % of Asian 0.54 Sources: California Department of Education; https://www.caschooldashboard.org/#/Home. 0.15 0.00 0.21 0.18 0.60 0.00 0.10 1.42 Figure 2.A.4 San Diego Schools Data Balance Check 52 APPENDIX 2.B TEACHER EXIT SURVEY ITEMS DEALING WITH THE TEACHER UNITS, PRACTICES, AND CURRICULUM Items: 1. How familiar are you with the principles of project-based learning in science? a. Very familiar b. Somewhat familiar c. Not at all familiar 2. How often do you incorporate project-based learning in your science teaching? a. Frequently b. Sometimes c. Rarely d. Never 3. How often do you employ the following teaching practices (1-never or almost never, 2- occasionally, 3-frequently, 4-in all or almost all lessons)? a. I present a summary of recently learned content b. Students work in small groups to come up with a joint solution to a problem or task c. I give different work to the students who have difficulties learning and/or to those who can advance faster d. I refer to a problem from everyday life or work to demonstrate why new knowledge is useful e. I let students practice similar tasks until I know that every student has understood the subject matter f. I check my students’ exercise books or homework g. Students work on projects that require at least one week to complete h. Students use information and communication technology for projects or class work I expect students to explain their thinking on complex problems I give students a choice of problems to solve i. j. k. I connect science concepts I teach to uses of those concepts outside of school I encourage students to solve problems in more than one way l. 4. How often do you employ the following scientific practices (1-never or almost never, 2- occasionally, 3-frequently, 4-in all or almost all lessons)? a. I guide students to ask questions b. I guide students to define problems c. I guide students to develop models d. I guide students to plan investigations e. I guide students to conduct investigations f. I guide students to interpret data g. I guide students to solve problems h. I guide students to construct an explanation i. I guide students to use evidence to make an argument 53 I guide students to communicate information j. k. I guide students in having them present their explanations and models l. I guide students in construction products related to the work they do in class. 5. How often do you employ the following classroom activities (1-never or almost never, 2- occasionally, 3-frequently, 4-in all or almost all lessons)? a. I guide students to work on a computer b. I guide students to work in a group c. I guide students to work in a lab d. I guide students to solve math problems 6. How often do you employ the following teaching practices (1-two to 5 times a week, 2- about once per week, 3-twice a month, 4- once per month or less)? a. Hands-on experiments b. Create or use models c. Assign textbook d. Use the internet to find answers for science e. Use computational modeling, CAD programs, or other modeling software 7. What textbooks did you use in your classes this year? 8. Do you write all of your own lesson plans or do you get them from somewhere else? a. I write all of my own lesson plans b. I write some of my own lesson plans c. I do not write my own lesson plans 9. Where do you get your lesson plans or your inspiration for lessons plans? a. District or Department Standard Lesson Plans b. Other teachers & colleagues c. Internet search d. Other 54 APPENDIX 2.C RACE HETEROGENEITY MODEL 𝑌𝑖𝑗 = 𝛾00 + 𝛾01 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗 + 𝛾11 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗𝑥𝑊ℎ𝑖𝑡𝑒𝑖𝑗 + 𝑆𝑗𝜕𝑗 + 𝛾𝑖0 𝑋𝑖𝑗 + 𝑀𝑖𝑗 𝑎𝑖0 + 𝑟0𝑗 + 𝜖𝑖𝑗 𝑌𝑖𝑗 = 𝛾00 + 𝛾01 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗 + 𝛾11 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗𝑥𝐵𝑙𝑎𝑐𝑘𝑖𝑗 + 𝑆𝑗𝜕𝑗 + 𝛾𝑖0𝑋𝑖𝑗 + 𝑀𝑖𝑗 𝑎𝑖0 + 𝑟0𝑗 + 𝜖𝑖𝑗 𝑌𝑖𝑗 = 𝛾00 + 𝛾01 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗 + 𝛾11𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗𝑥𝐻𝑖𝑠𝑝𝑎𝑛𝑖𝑐𝑖𝑗 + 𝑆𝑗𝜕𝑗 + 𝛾𝑖0 𝑋𝑖𝑗 + 𝑀𝑖𝑗 𝑎𝑖0 + 𝑟0𝑗 + 𝜖𝑖𝑗 𝑌𝑖𝑗 = 𝛾00 + 𝛾01 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗 + 𝛾11 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗𝑥𝐴𝑠𝑖𝑎 𝑛𝑖𝑗 + 𝑆𝑗𝜕𝑗 + 𝛾𝑖0𝑋𝑖𝑗 + 𝑀𝑖𝑗 𝑎𝑖0 + 𝑟0𝑗 + 𝜖𝑖𝑗 𝑌𝑖𝑗 = 𝛾00 + 𝛾01 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗 + 𝛾11𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗𝑥𝑂𝑡ℎ𝑒𝑟𝑅𝑎𝑐 𝑒𝑖𝑗 + 𝑆𝑗𝜕𝑗 + 𝛾𝑖0 𝑋𝑖𝑗 + 𝑀𝑖𝑗 𝑎𝑖0 + 𝑟0𝑗 + 𝜖𝑖𝑗 𝑌𝑖𝑗 = 𝛾00 + 𝛾01 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗 + 𝛾11𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗𝑥𝑀𝑢𝑙𝑡𝑖𝑅𝑎𝑐𝑖𝑎 𝑙𝑖𝑗 + 𝑆𝑗𝜕𝑗 + 𝛾𝑖0 𝑋𝑖𝑗 + 𝑀𝑖𝑗 𝑎𝑖0 + 𝑟0𝑗 + 𝜖𝑖𝑗 55 APPENDIX 2.D MEDIATION MODEL First, we estimate the overall treatment effect controlling for the mediator of interest to find paths b (𝛾010 ) and c (𝛾001 ): 𝑌𝑖𝑗𝑘 = 𝛾000 + 𝛾001 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑘 + 𝛾010 𝑀𝑒𝑑𝑖𝑎𝑡𝑜 𝑟𝑗𝑘 + 𝑆𝑘𝜕𝑘 + 𝛾020 𝐶ℎ𝑒𝑚𝑖𝑠𝑡𝑟 𝑦𝑗𝑘 + 𝛾𝑖00 𝑋𝑖𝑗 + 𝑀𝑖𝑗 𝑎𝑖00 + 𝜇00𝑘 + 𝑟0𝑗𝑘 + 𝜖𝑖𝑗𝑘 Next, we estimate a two-level model in which the treatment indicator predicts the indirect effect on the outcome which accounts for some proportion of the overall treatment effect, c. Here, pathway a is 𝛾001 . 𝑀𝑒𝑑𝑖𝑎𝑡𝑜 𝑟𝑗𝑘 = 𝛾000 + 𝛾001 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑘 + 𝑆𝑘𝜕𝑘 + 𝛾020 𝐶ℎ𝑒𝑚𝑖𝑠𝑡𝑟 𝑦𝑗𝑘 + 𝜇00𝑘 + 𝑟0𝑗𝑘 Confidence intervals for the multilevel mediation effects were computed using an empirical M-test as outlined by Tofighi and Mackinnon (2011). 56 APPENDIX 2.E EDUCATION AMBITION MODEL The model for students’ educational ambition 𝑙𝑜𝑔𝑖𝑡(𝜋𝑖𝑗) = 𝛾00 + 𝛾01 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗 + 𝑆𝑗𝛿𝑗 + 𝑋𝑖𝑗 𝛾𝑖0 + 𝑟0𝑗 +𝜖𝑖𝑗 where 𝜋𝑖𝑗 is the probability that the binary indicator showing that a students’ college ambition increased from the fall to the spring is equal to one. 𝛾00 is the likelihood of the control students increasing their college ambition. 𝛾01 is the difference of the likelihood between the treatment and control group. 𝑆𝑗 are the school level covariates, including school pretest mean and region and 𝜕𝑗 are the coefficients on those covariates. 𝑋𝑖𝑗 are the individual level covariates, including pretest, course (chemistry or physics), gender, and race/ethnicity of the students and 𝛾𝑖0 are the coefficients on these individual level covariates. 𝑟0𝑗 and 𝜖𝑖𝑗 are the school and student level error terms. 57 APPENDIX 2.F FULL TREATMENT EFFECTS ESTIMATES Table 2.F.1 Full Treatment Effect Estimates Treatment Standardized Pretest Score Chemistry Region 2 Region 3 Region 4 School Mean Pretest Female Missing Sex Black Hispanic Asian Other Race Multiple Races Missing Race Constant Random Effects Variances School Teacher Student (1) 0.220*** (0.064) 0.283*** (0.022) -0.522*** (0.116) -0.126+ (0.075) -0.1 (0.09) 0.009 (0.09) 0.220+ (0.123) 0.311* (0.127) (2) 0.208** (0.065) 0.269*** (0.021) -0.514*** (0.115) 0.003 (0.079) 0.026 (0.105) 0.125 (0.1) 0.179 (0.12) -0.038 (0.033) -0.014 (0.118) -0.191* (0.077) -0.253*** (0.067) 0.005 (0.077) -0.105 (0.084) -0.137 (0.093) -0.257** (0.095) 0.409** (0.129) 0.038*** (0.005) 0.038*** (0.004) 0.768*** (0.030) 4238 0.761*** (0.030) 4238 58 N Note. Standard errors in parentheses. +p < 0.10 *p < 0.05 **p < 0.01 ***p < 0.001 APPENDIX 2.G FULL HETEROGENEITY RESULTS Table 2.G.1 Full Heterogeneity Results Model 2 Female Black Hispanic Asian Other Multi Treatment Treatment x Interaction Female Missing female Race Black Hispanic Asian Multi Other Missing race Pretest average School Pretest Chemistry Region Detroit LA San Diego .208** (0.065) not in model -0.038 (0.033) -0.014 (0.118) .185** (0.071) 0.047 (0.058) -.0608* (0.031) -0.014 (0.119) -0.191* (0.118) -.1914* (0.077) -.250*** (0.067) 0.005 (0.077) -0.137 (0.093) -0.105 (0.084) -.257** (0.095) .269*** (0.021) 0.179 (0.120) -.510*** (0.115) -.250*** (0.067) 0.005 (0.077) -0.137 (0.093) -0.106 (0.084) -.255** (0.095) .269*** (0.021) 0.182 (0.120) -.510*** (0.115) 0.003 (0.079) 0.027 (0.105) 0.125 (0.010) 0.003 (0.079) 0.028 (0.105) 0.125 (0.099) .192** (0.069) 0.189 (0.120) -0.039 (0.033) -0.011 (0.116) -.280** (0.087) -.250*** (0.068) 0.010 (0.075) -0.137 (0.093) -0.103 (0.085) -.256** (0.092) .269*** (0.021) 0.179 (0.118) -.510*** (0.114) -0.019 (0.078) 0.021 (0.107) 0.120 (0.102) .222** (0.085) -0.033 (0.089) -0.038 (0.033) -0.017 (0.117) -0.190* (0.077) -.238** (0.084) 0.005 (0.077) -0.137 (0.093) -0.105 (0.084) -.256** (0.095) .269*** (0.021) 0.178 (0.119) -.510*** (0.115) -0.001 (0.080) 0.026 (0.105) 0.125 (0.010) .212** (0.064) -0.090 (0.140) -0.038 (0.33) -0.015 (0.119) -.193* (0.076) -.250*** (0.067) 0.044 (0.062) -0.137 (0.093) -0.105 (0.084) -.257** (0.095) .270*** (0.021) 0.178 (0.120) -.510*** (0.115) 0.004 (0.079) 0.026 (0.105) 0.125 (0.099) .205** (0.065) 0.183 (0.151) -0.038 (0.033) -0.013 (0.118) -.190* (0.078) -.253*** (0.067) 0.004 (0.077) -0.137 (0.093) -0.188 (0.096) -.257** (0.095) .269*** (0.021) 0.179 (0.120) -.510*** (0.115) 0.002 (0.079) 0.026 (0.105) 0.124 (0.100) .211** (0.066) -0.109 (0.201) -0.038 (0.033) -0.014 (0.119) -.190* (0.076) -.250*** (0.067) 0.006 (0.077) -0.086 (0.163) -0.104 (0.083) -.257** (0.095) .270*** (0.021) 0.179 (0.120) -.510*** (0.115) 0.003 (0.079) 0.026 (0.104) 0.125 (0.100) Note: standard errors are in parentheses. *p<0.5 **p<0.01 ***p<0.001 59 APPENDIX 2.H COLLEGE AMBITION FULL RESULTS Table 2.H.1 College Ambition Full Results Treatment Female Chemistry Standardized pretest score Coefficient β 0.18* (0.09) -0.01 (0.09) 0.10 (0.10) 0.01 (0.06) -0.23 (0.16) Standardized school average pretest score Race (White non-Hispanic is default) Hispanic Black Asian Other Multiracial Region (Detroit is default) LA Michigan San Diego 0 (0.14) -0.24 (0.22) -0.01 (0.21) -0.09 (0.31) -0.29 (0.24) -0.19 (0.23) -0.12 (0.23) -0.12 (0.22) Odds Ratio e^(β) 1.20 0.99 1.11 1.01 0.79 1.00 0.79 0.99 0.91 0.75 0.83 0.89 0.89 Note. Standard errors are in parentheses. *p<0.05, **p<0.01 ***p<0.001. Changed Educational Ambition omits students who said they did not know. High school and less than high school were in one category, Trade School and Community College are combined in another, and the outcome is the difference of the end of year less the beginning of the year. 60 APPENDIX 2.I FULL MEDIATION RESULTS Table 2.I.1 Full Mediation Model Treatment Mediator Chemistry Standardized Pretest Score School Mean Pretest Region 2 Region 3 Region 4 Constant Random Effects - Variances School Path A - PBL 0.361** (0.135) -0.164 (0.115) 0.016 (0.148) 0.427* (0.180) 0.335+ (0.186) -0.291 (0.181) -0.099 (0.141) Path B - PBL 0.193** (0.060) 0.059 (0.050) -0.528*** (0.092) 0.284*** (0.019) 0.349*** (0.065) -0.187* (0.083) -0.066 (0.086) 0.017 (0.079) 0.311** (0.115) Path A - Models 0.224*** (0.061) -0.327*** (0.055) -0.031 (0.062) -0.094 (0.118) 0.061 (0.087) -0.063 (0.081) 0.110+ (0.067) Path B - Models 0.191** (0.063) 0.171 (0.134) -0.446*** (0.101) 0.283*** (0.019) 0.403*** (0.064) -0.122 (0.084) -0.081 (0.090) 0.026 (0.084) 0.242* (0.123) 0.000 (0.000) 0.000 (0.000) 0.021*** (0.005) 0.000*** (0.000) Teacher 0.085*** (0.009) 0.724*** (0.028) Note. Standard errors in parentheses. +p < 0.10 *p < 0.05 **p < 0.01 ***p < 0.001. 0.087*** (0.032) 0.723 (0.147) 0.040*** (0.005) 0.394*** (0.025) Student 61 APPENDIX 2.J ITEM EQUIVALENCE FOR THE CHEMISTRY AND PHYSICS SUMMATIVE ASSESSMENTS In order to equate the chemistry summative assessment and the physics summative assessment, two separate summative assessments with different items, we used a 2pl polynomial model, which uses two parameters to estimate the item difficulty and student ability, to first generate the distributions of the two tests. The chemistry summative assessment contained 25 items; however, every student incorrectly answered one of the items, item 11. Therefore, this item was unable to be used to differentiate between stud ent ability. Because of this, we removed the item from the analysis, leaving the chemistry summative assessment with 24 items. The physics summative assessment contained 12 items. We then used the equateIRT package in R to obtain the following table. The equating is based only on the chemistry control group, not including the treatment group to avoid possible issues caused by the treatment effect within the analysis. 62 Table 2.J.1 Chemistry and Physics Score Transformation: Norm Table Chemistry score (24 items) Equating: estimated score in physics (SE) -0.486 (.014) -0.390 (.114) 0.069 (.587) 0.614 (.201) 1.047 (.590) 1.610 (.290) 2.000 (.593) 2.560 (.364) 2.906 (.577) 3.395 (.878) 3.769 (0.587) 4.184 (0.770) 4.636 (0.659) 5.058 (0.778) 5.538 (0.806) 6.007 (0.845) 6.489 (0.885) 7.017 (0.939) 7.495 (0.87) 8.076 (1.023) 8.568 (1.483) 9.184 (1.053) 9.743 (1.59) 10.345 (0.971) Score 1 Score 2 Score 3 Score 4 Score 5 Score 6 Score 7 Score 8 Score 9 Score 10 Score 11 Score 12 Score 13 Score 14 Score 15 Score 16 Score 17 Score 18 Score 19 Score 20 Score 21 Score 22 Score 23 Score 24 Note. Because the physics exam was more difficult than the chemistry assessment, we see that a score of 1-4 on the chemistry exam is equivalent to a score of 0 on the physics assessment. In addition, a perfect score in chemistry is only equivalent to a score of 10 in physics. We used these equated scores for the final analysis of the main effect for this intervention. 63 CHAPTER 3: USING MACHINE LEARNING TO SCORE MULTIDIMENSIONAL ASSESSMENTS OF CHEMISTY AND PHYSICS Abstract In response to the call for promoting three-dimensional science learning (NRC, 2012), researchers argue for developing assessment items that go beyond rote memorization tasks to ones that require deeper understanding and the use of reasoning that can improve science literacy. Such assessment items are usually performance-based constructed responses and need technology involvement to ease the burden of scoring placed on teachers. This study responds to this call by examining the use and accuracy of a machine learning text analysis protocol as an alternative to human scoring of constructed response items. The items we employed represent multiple dimensions of science learning as articulated in the 2012 NRC report. Using a sample of over 26,000 constructed responses taken by 6700 students in chemistry and physics, we trained human raters and compiled a robust training set to develop machine algorithmic models and cross-validate the machine scores. Results show that human raters yielded good (Cohen’s k = 0.40 – 0.75) to excellent (Cohen’s k > 0.75) interrater reliability on the assessment items with varied numbers of dimensions. A comparison reveals that the machine scoring algorithms achieved comparable scoring accuracy to human raters on these same items. Results also show that responses with formal vocabulary (e.g., velocity) were likely to yield lower machine-human agreements, which may be associated with the fact that fewer students employed formal phrases compared with the informal alternatives. 64 Introduction and Literature Review Measuring science knowledge and achievement has long been an important topic in science education research. The National Research Council ([NRC], 2012) has spelled out what they call three-dimensional learning to better facilitate student knowledge development and meet the demands of a modern STEM workforce. Three-dimensional learning encourages knowledge- in-use that can be generalized and used across multiple scientific fields to successfully meet the rapidly changing demands of science and technology careers adapted to the emerging issues of the twenty-first century (Harris et al., 2019; Haudek et al., 2019). This concept of knowledge-in- use occurs when students apply disciplinary core ideas (DCIs) in tandem with science and engineering practices (SEPs) and crosscutting concepts (CCs) to solve problems or make sense of phenomena. While the Framework for K-12 Science Education (NRC, 2012) presents a promising new vision of science learning, assessing the three-dimensionality of science learning is challenging (see NRC assessment report [2014] and the National Academies of Sciences, Engineering, and Medicine report [2019]). Multiple-choice questions are used ubiquitously in national, state, and classroom assessments of science achievement. However, these multiple-choice assessments typically rely on memorization of key concepts and thus have difficulty meeting the needs for assessing knowledge- in-use learning. Instead of strictly using multiple-choice questions, assessments should incorporate items that use the three dimensions of learning through a variety of task formats including constructed response (CR), which requires students to use their knowledge to solve problems with scientific practices (e.g., Harris et al., 2019). Unfortunately, CR is both time and resource consuming to score compared with multiple- choice items, and thus, teachers may not be willing to implement CR items in their classrooms. 65 Approaches that employ machine learning have shown great potential in automatically scoring CR assessments (Zhai et al., 2020a). As indicated in a recent review study (Zhai et al., 2020c), machine learning has been adopted in many science assessment practices using CRs, essays, educational games, and interdisciplinary assessments (e.g., Lee et al., 2019a; Nehm et al., 2012). More importantly, machine scoring can provide automatic and immediate feedback to students and teachers, with the potential to accelerate the use of three-dimensional assessment practices in classrooms, benefiting science learning (Lee et al., 2019b). While the potential of machine learning has been recognized, few studies have tackled the true challenge of scoring CR items on multi-dimensional science assessments. There are relatively few studies applying machine learning to analyze assessment items in which students perform tasks that require the use of multiple dimensions of scientific knowledge to make sense of phenomena (Zhai et al., 2020a). In addition, none of the studies explicitly document whether and how these assessments measure the dimensionalities of science learning. We examine the capacity of machine learning to automatically score multi-dimensional science assessments, contrasted with human scorers, using a large database of student CRs. We highlight how the machine agreement changes as we increase the size of the training set as compared with human rater agreement. Additionally, we discuss some of the complex challenges for achieving high agreement between human and machine algorithms when scoring multi-dimensional assessments, including clarifying rubrics for improved agreement between human and machine scoring and treatment of missing and outlier responses and scores. This study answers three questions: (1) How reliable were human raters in scoring multi-dimensional responses? (2) Could machine learning algorithms score multi-dimensional assessments as accurately as 66 humans? and (3) How are key phrases in student responses associated with machine scoring of the multi-dimensional assessments? Defining the Dimensions of Learning Building on the NRC reports, science researchers are calling for more comprehensive assessments that gauge students’ abilities to use knowledge to explain phenomena, solve real- world problems, engage in creative and critical thinking, and analyze and interpret scientific data (Haudek et al., 2019). According to Pellegrino (2013), assessments that include CR should reflect the principles spelled out in A Framework for K-12 Science Education (NRC, 2012) and the Next Generation Science Standards (NGSS Lead States [NGSS], 2013), by carefully considering and identifying which of the three dimensions of learning each question is being designed to measure. In contrast to multiple choice, CR items are often difficult to develop and to score. Most studies that include them have not met the challenge of specifying and connecting the three dimensions of learning with their assessment items. The NRC Framework for K-12 Science Education (2012) recommends that the learning and instruction of science throughout K-12 should integrate SEPs, DCIs, and CCs to make sense of phenomena. The NGSS puts this recommendation into practice by creating three-dimensional performance standards or expectations explaining the key concepts and skills students should be able to use at a given grade level (NGSS Lead States, 2013). Each dimension of learning has its own grade-based expectations. Measuring SEPs offers insight into how students use practices employed by scientists and engineers in the field, such as gathering and obtaining information and using it in argumentation (NRC, 2012). The CCs are considered “crosscutting” because they are concepts used to aid in examining phenomena and solving problems across all fields of 67 science and engineering (NRC, 2012). In turn, assessments should measure how students apply disciplinary knowledge through SEPs and CCs. Creating and Scoring Multi-Dimensional Science Assessments Despite the obvious advantages of CR in gaining more in-depth insights into student understanding, they are used somewhat less frequently than multiple-choice because scoring CR requires considerable time and effort. However, it is important to create assessments that capture students’ use of the three dimensions of science learning, and machine learning offers the potential to facilitate scoring, making these assessments more tenable for classroom and research purposes. Yet, three-dimensional items can be complicated to score as they must evaluate students’ knowledge of the subject matter through the DCIs and provide insights about how students develop, understand, and use that knowledge. To score them, one must be familiar, not only with core content, but also with how the dimensions of learning work together and are being measured. These constraints prevent the broad use of three-dimensional tests that rely on CRs for science learning. Because scoring CRs requires considerable time and effort on behalf of human raters, this can potentially elongate the period before students receive feedback (Ha & Nehm, 2016). However, studies (e.g., Lottridge et al., 2018) suggest that it should be possible to decrease human rater costs through automated machine learning algorithms. This would allow teachers and researchers to collect more detailed data on students’ science knowledge with scoring costs comparable with those of multiple-choice assessments. Scientists are building rubrics to measure aspects of three-dimensional learning related to human and machine scoring, but what seems to have received less attention is attributing the outcomes of the algorithm to the dimensions being scored. This is an important concept that needs to be examined in greater depth as it requires 68 students to develop knowledge across subject domains. Outlining the dimensions associated with each item is one of the key contributions of our study. Further exploration of human and machine scoring for three-dimensional learning CRs needs to incorporate DCIs, SEPs, and CCs while recognizing the complexity and difficulty of this particular task. Applications of Machine Learning in K-12 Science Assessment The number of studies on the use of machine learning to score science assessments is increasing as the technology becomes more accessible and recent studies show considerable promise for machine scoring for science CRs across multiple age groups. A prior review (Zhai et al., 2020c) suggests that various researchers have used more than 20 programs or platforms to study the automatic scoring of science learning assessments. For example, researchers from ETS (Liu et al., 2014) applied machine learning for scoring student CRs that explained science phenomena through multi-dimensional reasoning with “c-rater,” an automated machine algorithmic program. The Liu team from ETS developed multi-level rubrics and found that the machine was capable of automatically scoring student responses, achieving moderate to large Cohen’s kappa (k) values between the machine and human scorers. They found that “c-rater” could capture valid ideas in students’ responses and provide nuanced information about their performance. Other scientists collaborated with ETS and explored the classroom applications of c-rater-ML in several projects. Mao et al. (2018) applied the c-rater-ML to automatically evaluate students’ written argumentation to provide automatic feedback to students. The machine feedback could assist students to revise their arguments. In another study, Zhu et al. (2017) reported that more than 77% of students made revisions after receiving machine feedback and those who revised their responses received higher scores than the others in the final test. 69 The earliest development of programs besides c-rater is the SPSS Text Analysis (Nehm & Haertig, 2012), which required humans to develop word libraries manually. This program is costly and labor-intensive for users. The Summarization Integrated Development Environment (SIDE) developed by the TELEDIA lab at Carnegie Mellon University is the first free machine- learning-based confirmatory analysis program used in science education (Mayfield & Rosé, 2010, 2013). Its successor, Light Summarization Integrated Development Environment (LightSIDE), is more user-friendly, flexible, and accessible to public users and has been frequently used in studies. Other open sources such as RapidMiner or Weka are all popular in automatically scored science assessments. However, most of these programs applied individual algorithms each time. If one algorithm does not work well, users can choose another. Instead of a single algorithm, this study employed the Automated Analysis of Constructed Response (AACR) Web Portal (AACR, 2020) to automatically score students’ CRs using multiple algorithms. Different from most commercial automatic scoring programs, which fit one type of algorithm at a time, the AACR Web Portal developed eight algorithms that can be employed simultaneously. The AACR scoring Web Portal was developed to serve classroom needs for formative assessment purposes. Currently, it is used for exploratory and confirmatory factor analysis in item development. For exploratory analysis purposes, AACR can be applied with unsupervised machine learning to identify patterns and lexical features of student responses. Based on the findings, researchers can revise their items and rubrics iteratively. A confirmatory analysis is used in the late stage of item development to develop and validate the machine algorithms. Based on the predictions of each algorithm, derived from the cross-validation, the machine assigns weights toward the algorithm that best optimizes the algorithmic parameters. The 70 ensemble approach has been tested and compared with other general classifiers. Extensive experimentation by Large, Lines, and Bagnall (2019) reveals that the ensemble approach has measurable benefits over other classifiers such as alternative weighting, selection, or meta- classifier approaches. More importantly, the ensemble approach outperforms other classifiers with small training datasets. While machine learning is being used more frequently and continued research leads to improved reliability, a question remains about what steps researchers should take to move from human to machine scoring of multi-dimensional assessments. Sample Methods To begin this process, for new and untested items, a large database of student responses is necessary. “Crafting Engaging Science Environments (CESE),” an ongoing science intervention with 6700 high school students in California and the Midwest, provided such a database. Within the CESE sample, 48.5% were male students and 51.5% were female students, 29.2% of students identified their race as white, 47.5% identified their race as Hispanic, 11.9% as black, and 5.0% as Asian. Almost three quarters of students (74%) reported speaking Spanish in the home. To measure baseline science understanding at the beginning of the intervention, CESE relied on an assessment developed using National Assessment of Education Progress (NAEP) open-source science test-bank items. All 6700 students took the test, and this yielded 26,800 constructed responses to four CR items. With 26,800 responses requiring classification, it would be possible to learn if machine scoring was a viable option. To conserve resources, the team decided to learn if AACR could score the CRs with the same reliability as human raters. 71 Instruments and Measures CESE adopted four NAEP questions in this project. The questions tapped phenomena in chemistry or physics using everyday scenarios. Though these test questions were not initially designed to be three-dimensional as those illustrated in The Framework (NRC, 2012), multiple dimensions of science learning were detected in the responses to these items. As shown in Fig. 3.1, when reviewing the responses to each question, between 14% and 55% of students responded using multi-dimensional reasoning (i.e., the use of DCIs, SEPs, and CCs associated with the NGSS performance expectations) without being prompted. To accommodate this, three response classifications including incorrect, correct, and “multi-dimensional correct (MDC)” were adapted from the original binary rubrics. The MDC rating was awarded only if a student was able to demonstrate reasoning with regards to associated DCIs, CCs, or SEPs. Figure 3.1 shows the distribution of incorrect, correct, and MDC responses used by students on the assessment, as identified by human scorers. Figure 3.1 Dimensionality of Student Responses The research team partnered with third party scientists to further verify the two- and three-dimensional rubrics for the CRs. The newly developed rubrics allowed the project team to use the existing items to probe students’ learning of DCIs, CCs, and SEPs. Explanations of items 72 and their rubrics are given below, while more complete details of the items, rubrics, sample responses, and associated dimensions of learning are shown in Appendices A—D. Item 1: Experimental design, shown in APPENDIX 3.A, asks students to identify the error in an experiment where a student tests three different shoes, each on a different floor, to determine which had the highest coefficient of friction. Students’ responses included middle school level reasoning associated with NGSS performance expectations for DCI ETS1.A Defining and Delimiting Engineering Problems, and the grades 3–5 level SEP of Planning and Carrying Out Investigations (2013). To adapt this rubric to the NGSS, the MDC score for this item meant students correctly identified an error in the experimental setup (DCI) and explained that she could not compare the frictional force of different shoes on different floors due to the failure to isolate a variable while holding the floor constant (SEP). A correct response was given for students who correctly identified the error without explaining how the error affected the experiment. Item 2: Relative motion focused on relative motion between two vehicles traveling on the highway (see APPENDIX 3.B). This item asked students to explain why a truck driving alongside them on the highway appeared to not be moving. Students engaged in middle school level reasoning associated with NGSS performance expectations for DCI [PS2.A] Forces and Motion, and CC Scale and Proportion (2013). To align the rubric with the NGSS performance expectations, the MDC classification was reserved for students who connected the truck’s speed to that of the observer inside the vehicle (DCI) and stated the equal relative speeds would cause the phenomenon (CC). Responses that dis- cussed only the speed without referencing how this related to the visual event were considered correct, but not MDC. 73 In the third constructed response, item 3: properties of solutions, students engaged elementary to middle school level reasoning for the DCI associated with PS1.A Structures and Property of Matter. As shown in APPENDIX 3.C, this was a fully three-dimensional item and students showed middle school level reasoning in cause and effect (CC) and planning and carrying out investigations (SEP) as outlined by the NGSS (2013). This question asked students to design an effective experiment to differentiate between the contents in two identical glasses. One glass contained saltwater while the other contained fresh water. Students could not suggest tasting the contents of either glass. To achieve the MDC classification, students would describe an experiment that controlled for relevant variables (SEP) to differentiate between fresh water and a solution (DCI) and correctly attribute causality (CC) to the chosen experiment by explaining the outcome. For a correct score, the DCI and SEP were considered inseparable. For example, a response that stated “test the density” was incorrect, because it explained neither an experiment that would do so, nor the expected results. The fourth and final constructed response, item 4: states of matter, asked students to demonstrate their understanding of what causes matter to change states (see APPENDIX 3.D). The question asked students to explain why water in a hot pot would evaporate more quickly than in a pot on the counter. Students used reasoning related to NGSS performance expectations for PS1.A Structure and properties of matter (DCI) and an understanding of energy flow (CC) related to performance expectations of Energy and Matter (2013). For the score to be classified as MDC, students would first demonstrate an understanding that the water evaporated (DCI). Second, students would attribute causality (CC) to the heat transferred from the stove to the water. Measured somewhat differently than other items, correctly reporting either the CC or the DCI was enough for a correct score. 74 Scoring occurred in a two-cycle process. The first cycle involved human scoring while the second cycle used human scores to train the machine algorithm to score. Figure 3.2 shows the flow of the two-cycle process which began with rubric development and ended in a completed algorithm which could instantly score remaining responses. Figure 3.2 Training and Algorithm Development Cycle 1: Human Training and Scoring Ten undergraduates were recruited to score students’ responses. These undergraduates were in their junior or senior years of a natural science major (nine physics students and one biology student) and had completed at least two college-level courses in both physics and chemistry. The raters participated in training sessions and scored in an iterative process that included a cycle of rigorous training sessions, calibration, and revisions for clarification. This was followed by a second cycle that involved calibrating the machine’s scores by providing the human scores and responses. As described above, the process shown in Fig. 2 began with construction of the multidimensional rubrics. When the rubrics were completed, raters were trained, and then proceeded to practice scoring. As training, raters scored a randomized sample of student CRs. The randomization procedure considered students’ knowledge level as well as ethnic, racial, and 75 geographic factors. AACR recommends a minimum human agreement of k = 0.80 before compiling a training set for the machine, so low inter-rater reliability (IRR) meant continued rater training. When the raters achieved a high IRR for all raters from the small practice sets, they next completed “bulk” scoring sets of various sizes. After bulk scoring, additional IRR testing ensured sustained agreement. If the IRR was low on the bulk scoring, raters returned to training. With high agreement for human raters from the bulk scoring, human scores were compiled to create a training set for the AACR algorithm. Because the success of the algorithm may depend on the quality of the data, it is imperative to create a quality training set with high inter-rater agreement. Therefore, it is very important to address issues suspected to reduce reliability in human scoring of multi-dimensional open-ended CR items. The Challenges for Human Raters and How We Addressed Them It became apparent that the rubrics needed to be explicit in what was expected of the student. The first consideration is to list all possible solutions. From item 3: properties of solutions, we learned that some issues in scoring arose, unexpectedly, from raters’ advanced knowledge of science and that rubrics needed to state nearly all possible correct experiments. Agreement typically improved with each round of scoring, as new experiments were identified and included in the rubric. The second consideration was to create a hierarchy outlining the importance of each dimension’s contribution to the score. Disagreements arose when rubrics did not explicitly state which dimension was being measured. In scoring item 1: experimental design, for example, raters disagreed on whether stating a direct claim was as important as knowing how to control for variables in the experiment. After the training session where this issue emerged, the human 76 agreement fell from k = 0.71 to k = 0.38 (shown later in Table 3.1). To correct for this in later scoring sessions, we revised the rubric and created a “dimensional hierarchy” for our raters, explicitly stating which dimensions and which specific aspects of each dimension were being measured. This process took too long, however, and due to subsequent changes to the scoring team during the process, the item did not make it to machine scoring during our collaboration with AACR. The third consideration was to carefully weigh the choice to make changes to the scoring team. When the agreement is high, changes can drop that agreement rapidly, but the low agreement can be improved by training new raters on an item. Consistent training and calibration sessions helped to bring new raters into agreement, but new scoring teams did not easily come to agreement with past raters’ scores. Often, this meant re-scoring assignments done by the previous rating team because the machine needed consistent agreement across the entire training set. When possible, maintaining the same scoring team until completion of the training set can help sustain a high inter-rater agreement. Conversely, the low agreement can be improved by introducing the item to a new team if the original team showed a lower than desirable agreement. Bulk Scoring Criteria AACR requested large training sets, with at least one hundred classified responses of each scoring proficiency category for each item. To meet this requirement, raters began to score in larger “bulk” sets. When their agreement was high, raters scored independently, but continued to score some content overlapping between raters. This allowed for continued checks for inter- rater agreement, which was calibrated after each wave of scoring so that raters could discuss disagreements. Scoring continued in this way until raters successfully scored at least 100 responses representing each of the three proficiency categories. Generally, two or three raters 77 were selected to score a given item based on their shared availability and the scoring was assigned to these pairs or triads. The number of responses scored in a wave was determined by the raters’ availability and the number of responses needed to obtain 100 examples of each proficiency category. Cycle 2: Constructing the Training Set In the second cycle, focused on generating a quality training set, we used the consensus, or median, score taken between raters who had scored a response in common. If a response was scored by only one rater, the individual score was considered the median. If the median score did not fall into one of the scoring categories (incorrect, correct, or MDC), the response was omitted from the training set and returned to the pool of unscored responses. Due to lower inter-rater agreement on item 3, a triad was used somewhat differently. A rater pair scored all responses in common, and the third rater in the triad acted as a tiebreaker to generate the consensus score. After the bulk scoring process, the human scores and student responses were used to develop training sets for the algorithmic models. For each item, a robust training set was designed to examine key lexical features associated with multi- dimensional reasoning specific to the CR item. The responses and corresponding consensus scores for the three successfully scored items were given to AACR to create a predictive model. AACR developed models specific to each item using a cross- validation approach by using a portion of the scored responses to create the algorithm and reserving the rest to test the model. AACR accomplished this using feature extraction analysis. The AACR Web Portal examined each response in the training set by its lexical features, primarily combinations of a number, “n,” with words called “n-grams” to tune parameters for the algorithm development. 78 To validate the accuracy of the machine algorithms, AACR Web Portal applied a cross- validation approach which was found to be the most effective when compared with split- and self-validation methods (Zhai et al., 2020b). Using cross-validation, the machine first partitioned the data into n subsets, named “n-folds.” A random selection of (n-1) folds of human-scored responses was used to train the machine and develop the algorithmic model, which was then used to score the remaining one-fold student responses. The machine scores of the one-fold of responses were compared with human scores to calibrate the machine- human agreement, which was indicated by parameters such as Cohen’s kappa. The training, scoring, and comparisons were iterated n times so that each fold of data played a role as both the training and testing sets. The average of Cohen’s kappa generated in these processes indicated the accuracy of the algorithmic model. At the same time, the algorithmic model generated a computer confidence parameter which helped to diagnose which specific responses were difficult for the algorithm to score. After receiving the results of the bulk scoring, the scoring team reviewed cases where the human and machine scores disagreed. From this point, the process diverged for the three items for which human raters were in high enough agreement to move on to the machine scoring process. Each question provided unique challenges in developing the algorithmic model. Challenges in the Preparation of Machine Training Data and How We Addressed Them We first considered how to bolster low human-human agreement using tie breakers. Issues arose in reaching high human-human agreement when scoring atypical responses or CRs which were open-ended. For item 3: properties of solutions, we employed a tiebreak method. This method is similar to that used by Haudek and colleagues (2019) where raters trained until achieving a human-human agreement of k = 0.60 or greater then scored responses individually with some responses overlapping between raters. A third rater would break ties in the 79 disagreements. Haudek and colleagues’ results showed that the machine-human agreement was similar to human-human agreement, and machine-human agreement was higher than human- human agreement for some constructs. The second consideration was in handling underrepresentation. Human raters examined the scoring discrepancies for lexical patterns, termed as “key phrases.” Key phrases emerged for some items, but not others. Key phrases (e.g., “velocity” or “relative”) were much more apparent in item 2: relative motion. For item 3: properties of solutions, raters coded the key phrases as the experiments used in student responses. Item 4: states of matter showed frequent use of specific terms, but those terms did not seem to impact the machine scoring. Because underrepresentation can be problematic, researchers must remain aware of the potential for responses to be scored incorrectly. When lexical patterns emerged around these errors, it was feasible to predict future discrepancies. In this study, we selected additional responses with key phrases that were less represented in the overall sample but seemed common in the machine- human disagreements. Unscored responses containing these key phrases were then mixed with random responses and added to the next wave of bulk scoring. Where there were not enough of the potentially problematic responses to include more examples in the training set, we called upon human raters to review and score responses where errors were likely. Data Analysis To answer the first research question, we reported the human-human agreements indicated by Cohen’s kappa by wave of scoring for each item. To answer the second research question, we calibrated the machine-human agreements for each item, using the Cohen’s kappa, and compared the agreements with the corresponding human- human agreements. We also reported the machine scoring accuracy according to the dimensions of science learning. To 80 answer the third research question, we calculated the frequency with which key phrases were used, the frequency with which they were scored incorrectly, and the percentage they comprised the total disagreements. Results Reliability of Human Raters in Scoring the Multi-Dimensional Assessments Table 3.1 shows the cumulative agreement for human raters over the 8-month scoring period described above. Responses were scored in successive waves. For each wave, the number of responses overlapping between raters to check IRR is shown. Human raters achieved a f inal Cohen’s kappa after several-wave training as k1 = 0.67, k2 = 0.80, k3 = 0.64, and k4 = 0.76 for the four items, respectively. According to a criterion proposed by Fleiss (1981), kappa values over 0.75 indicate excellent agreement while values between 0.40 and 0.75 indicate good agreement. According to this criterion, the human rater reliability is excellent for two of the items and good for the others. We also found that human scoring reliability increased with successive calibration training meetings. For the most successful item, item 2: relative motion, agreement increased from k = 0.72 and peaked at k = 0.88 over the successive waves of training and scoring. For three of the four items, agreement between human raters was high enough to move to machine learning during the collaboration with AACR. 81 Table 3.1 Human Agreement by Wave Wave NOverlap Accuracy k NCumulative Accuracy k NTotal Note Item 1: experimental design 1 99 0.81 2 3* 4 50 100 25 Item 2: relative motion 1 2 3 4 5 6* 7* 60 60 90 30 0 33 75 0.58 0.84 0.80 0.83 0.85 0.93 0.87 0.85 0.87 Item 3: properties of solutions 1 2* 3** 4** 30 190 70 400 Item 4: states of matter 1 2 3* 30 19 25 0.77 0.73 0.72 0.78 0.93 0.79 0.85 0.71 0.38 0.73 0.67 0.72 0.77 0.88 0.80 0.76 0.80 0.63 0.56 0.57 0.65 0.86 0.62 0.72 99.00 50.00 100.00 25.00 60.00 90.00 180.00 210.00 210.00 243.00 318.00 30.00 190.00 70.00 470.00 30.00 49.00 74.00 0.81 0.58 0.84 0.80 0.83 0.83 0.88 0.88 0.88 0.87 0.87 0.77 0.73 0.72 0.77 0.93 0.88 0.87 0.71 0.38 0.73 0.67 0.72 0.74 0.81 0.81 0.81 0.80 0.80 0.63 0.56 0.57 0.64 0.86 0.77 0.75 0 0 0 0 51 80 439 564 602 740 808 0 0 70 465 88 242 524 Practice Practice Practice Practice Practice Practice Bulk scoring Key phrases Key phrases Bulk scoring Bulk scoring Practice Practice Bulk scoring Bulk scoring Bulk scoring Bulk scoring Bulk scoring 25 0.80 0.89 99.00 4* Bulk scoring Note. NOverlap is the number of items raters scored in common. NCumulative is the total number of jointly scored responses in all combined waves. NTotal is the number of total responses scored which can be sent to the machine. A wave number followed by * indicates that this wave, and those following, were scored by a new team, while ** indicates a third rater was added to tiebreak. 0.87 0.76 594 For item 2: relative motion, training sets were compiled for the algorithm after wave 3 (bulk scoring), wave 5 (key phrases), and wave 7 (bulk scoring). Drops in the human rater agreements correspond to those time periods between waves of scoring, during which raters were waiting for and analyzing the results of the predictive model. For instance, agreement fell from k = 0.88 to k = 0.80 between waves 3 and 4 where the predictive model was tested. 82 Changes in agreement also sometimes corresponded to considerable changes in the composition of the scoring team. This drop can be seen in the agreement for wave 2 of item 3: properties of solutions, when new scoring members joined the existing team. Although item 1: experimental design was not scored by AACR due to the low human agreement, that agreement is shown to improve after introducing a new rubric to a new scoring team. Between waves 2 and 3, agreement increased from k = 0.38 to k = 0.73 when the item was reintroduced to an entirely new team. Machine-Human Agreement Vs. Human-Human Agreement Table 3.2 shows the mean score awarded by humans, the mean score awarded by the machine, and the machine-human agreement for each wave of machine scoring. All rounds of scoring achieved fair to good agreement (Cohen’s k = 0.64 to k = 0.81), even with as few as 336 responses in the smallest training set. Criteria for machine scoring, proposed by Nehm and Haertig (2012), consider Cohen’s kappa between 0.41 and 0.60 as moderate, between 0.61 and 0.80 as substantial, and over 0.80 as almost perfect. According to these criteria, the machine scoring outcomes for the three items were categorized as substantial to almost perfect. As shown in Table 3.2, the agreement between the machine and humans was as high or higher than the human-human agreement reported for the cumulative waves of scoring for two of the three items (for human-human agreement, see Table 3.1). 83 Table 3.2 Description of Both Human and Machine Scores Wave (Sample) Mean Human Machine k(SE) Item 2: relative motion 1 (484) 1.83 (0.03) 1.76 (0.03) 0.78 (0.02) 2 (662) 1.89 (0.03) 1.83 (0.03) 0.78 (0.02) 3 (808) 1.85 (0.03) 1.81 (0.03) 0.81 (0.02) Item 3 properties of solutions 1 (468) 1.76 (0.04) 1.68 (0.04) 0.69 (0.03) Item 4: states of matter 1 (336) 2.59 (0.03) 2.60 (0.03) 0.76 (0.04) 2 (594) 2.49 (0.02) Note. Item 1 is not shown because the item did not proceed to machine scoring. 2.51 (0.02) 0.64 (0.03) In human scoring for item 2: relative motion, raters returned a cumulative k = 0.80 after all waves of scoring. The selected key phrases did not significantly improve scoring for the second wave of items sent to AACR. Despite adding all examples of the key phrases from the full data set, they still comprised only 0.50% to 8.79% of the responses scored for the final training set. To build the final training set, we added additional responses without a focus on identifying key phrases for a total of 808 student responses. AACR returned their model with higher agreement to the human raters (k = 0.81) than the final cumulative agreement between humans (k = 0.80). The lowest human agreement for any item sent to AACR was item 3: properties of solutions. By having two raters score all responses together and a third rater’s scores used as the tiebreaker, the algorithm matched more closely to the humans (k = 0.69) than the humans agreed with each other (k = 0.64). Although further predictive models were not developed for this item, the method yielded substantial agreement between the machine and human scores, despite lower human–human agreement. For item 4: states of matter, agreement was lower in the second round 84 of scoring (k = 0.64 compared with k = 0.76). Waves of bulk scoring were sent to AACR after waves 2 and 4 of human scoring. Despite adding an additional 258 responses and achieving similar cumulative human- human agreement for the second set (k = 0.77 to k = 0.76), agreement between the machine and humans still fell. Accuracy of the Machine Scoring Associated with Dimensions of Learning To better understand the capabilities of the machine scoring algorithm, we compared the machine and human scores by the associated dimensions of learning. Table 3.3 shows the distribution of scoring proficiency classifications for humans in all 4 items, how the machine classified the same responses, and agreement for the three items that had sufficient human- human agreement to move to machine scoring. The accuracy, or percent agreement, ranges from 28 to 93% for the individual proficiency levels for each item regardless of the associated dimensions, but the machine scored with accuracy greater than 59% for all categories which were well represented in the sample. The lowest overall accuracy (28.57%) and lowest reported certainty of score (0.68) corresponded to the least represented classification which is the incorrect category for item 4: states of matter. This proficiency level comprised less than 4% of the training sample. For each item, the lowest accuracy and lowest certainty of score correspond to the category with the least representation. 85 Table 3.3 Human and Machine Percentage of Score, Agreement, Certainty, and Dimensionality Proficiency Item dimensionality Human Machine Mean Probability (SE) Accuracy Level DCI, CC, SEP Item 2: relative motion (N = 808) Incorrect Incorrect Correct MDC DCI DCI + CC Item 3: properties of solutions (N = 468) Incorrect Incorrect or DCI only Correct MDC DCI + SEP DCI + CC + SEP Item 4: states of matter (N = 594) Incorrect Correct Incorrect DCI or CC 37.87 39.36 22.77 50.32 23.44 26.24 3.54 41.75 39.98 38.74 21.29 57.63 16.56 25.81 2.19 46.30 90 (0.00) 91 (0.01) 82 (0.01) 85 (0.01) 78 (0.01) 80 (0.01) 68 (0.03) 83 (0.01) 93.14 86.79 78.80 92.31 59.63 79.51 28.57 83.87 DCI + CC MDC Note. Item 1 is not shown because it did not proceed to machine scoring. Mean probability refers to the prediction returned from AACR that a given score was correct. MDC multidimensional correct, DCI disciplinary core ideas, CC crosscutting concepts, SEP science and engineering practice. 87 (0.01) 82.77 54.71 51.52 Table 3.3 shows that the machine classified student responses with a similar distribution to human scores but with a tendency to score a little lower than human raters. The machine awards between 0.4 and 3.2% fewer MDC proficiency classifications for each question. This is similarly reflected in Table 3.2, where the mean score awarded by the machine is slightly lower than for humans in nearly all cases. Table 3.3 also shows that for items where all proficiency levels were well represented, the machine scored the incorrect classifications with higher accuracy than other categories. This was not reflected in the machine’s predicted certainty however, and it cannot be asserted that the machine reported the highest confidence in scoring incorrect responses. For item 2: relative motion, Table 3.3 shows that the machine showed high accuracy in scoring both the correct and MDC proficiency classifications. The use of only the DCI for a correct answer was scored with an accuracy 86.79%, and MDC responses which combined the 86 use of the DCI with cause and effect (CC) were scored with an accuracy 78.80% when compared with human classifications. As shown in Table 3.3, scoring for item 3: properties of solutions was fully three-dimensional. The relative amount of human and machine scores for two- dimensional (correct) responses were 23.44% and 16.56%, respectively. The machine matched more closely to humans for three-dimensional (MDC) responses, which comprised 26.24% of the human proficiency classifications for this item and 25.81% of machine scores. Item 4: states of matter was scored differently from item 2, as it allowed for the use of either of two different dimensions of reasoning for partial credit. For a correct response, students could attribute the phenomenon to the evaporation of water into smaller particles (DCI) or reason that it was caused by the heat of the stove (CC). Human scorers classified responses as correct 41.75% of the time while the machine used this proficiency level for 46.30% of responses. The machine and humans classified responses with the MDC proficiency level 51.52% and 54.71%, respectively. The correct and MDC responses were both well represented in the sample and each were scored with accuracy over 82%. Key Phrases in Scoring Open Ended Constructed Responses As demonstrated in Table 3.4, the machine can score open- ended CRs despite the varied language engaged by students. For item 2: relative motion, the majority of students (56.06%) chose to use common phrases (e.g., “speed”) to describe the phenomenon while a smaller proportion (8.79%) used advanced vocabulary (e.g., “velocity”). Student responses with more advanced vocabulary use seemed to be proportionately more represented among those that were scored incorrectly by the machine. For example, 18.31% of student responses that include the formal phrase “velocity” were scored incorrectly compared with 11.70% of responses that used the word “speed.” As shown in Table 3.4, these key phrases were often characterized by 87 including commonly used phrases with a high prediction of a correct score, or an atypical response accompanied by a low prediction score. Examples of student responses, the human score, the machine score, and the certainty of the machine score for item 2: relative motion can be found in APPENDIX 3.E. 88 Table 3.4 Key Phrases Associated with the Machine Scoring Key Phrases Associated with the Machine Scoring % Scored incorrectly % of all responses Key phrase(s) Item 2: relative motion (NResponses = 808 and NDisagreements = 102) % All MH disagreements Velocity and relative Relative Velocity or relative without speed Fast Velocity Speed 0.50 0.87 7.18 7.80 8.79 56.06 50.00 71.43 20.69 12.70 18.31 11.70 1.96 4.90 11.76 7.84 12.75 51.96 Item 3: properties of solutions (NResponses = 465 and NDisagreements = 87) Taste Freeze pH Dissolve Smell Mass or weight Evaporate Boil Density 1.08 2.15 2.80 3.01 6.02 7.74 7.96 10.97 13.98 0.00 10.00 7.69 7.14 14.29 16.67 21.62 19.61 13.85 Item 4: states of matter (NResponses = 594 and NDisagreements = 111) Steam Into the air Heat and evaporation Heat 37.54 15.66 20.37 21.21 16.14 16.13 14.05 15.08 0.00 1.15 1.15 1.15 4.60 6.90 9.20 11.49 10.34 32.43 13.51 15.32 17.12 Mean Probability 77 (0.10) 81 (0.10) 84 (0.09) 87 (0.11) 85 (0.09) 89 (0.10) 80 (0.07) 83 (0.11) 80 (0.08) 80 (0.13) 83 (0.10) 83 (0.09) 83 (0.11) 82 (0.11) 83 (0.10) 86 (0.10) 88 (0.09) 89 (0.09) 89 (0.10) Evaporation 15.25 Note. Item 1 is not shown because it did not move to machine scoring. Mean Probability refers to the prediction made by AACR that the algorithm assigned the same classification as the human raters. 86 (0.10) 30.63 37.54 For item 3: properties of solutions, students provided numerous experiments, including boiling or evaporating the water to look for remaining residue. Human raters coded key phrases as words associated with the types of experiments used by students as shown in Table 3.4 (e.g., evaporate, smell). Item 3 had lower human agreement (k = 0.64) than the other items scored with 89 the machine (k = 0.80 and k = 0.76), but still fell within the boundaries of substantial agreement (k = 0.61–0.81). We found, generally, that the percentages of machine-human disagreements of each key phrase were consistent with the frequency with which they appeared in student responses. For instance, the word “density” was used by 13.98% of students in the sample and accounted for 10.34% of the responses where the machine and humans disagreed. Item 4: states of matter shows a similar trend where the percentages of machine-human disagreements were also consistent with the frequency with which they appeared in student responses. For instance, the use of “into the air” was used in 15.66% of all responses and comprised 13.51% of all disagreements. The key phrases selected for this item were used in at least 15% of the responses and were scored incorrectly in similar proportions ranging from 15.08 to 16.14% of their total use. Discussion The performance expectations embodied in NGSS for chemistry and physics cannot be effectively measured unless meaningful and scorable three-dimensional assessments are developed (Cheuk et al., 2019; Pellegrino, 2013; NRC, 2014). Because such assessments require the other dimensions to be used together with scientific practices, they may involve a variety of performance-based tasks, such as writing short answers or drawing, to capture students’ mastery of performance expectations (Pellegrino, 2013; NRC, 2014). The NRC (2014) calls for investment of time and other resources into the development of these new assessments and to facilitate the implementation of these three-dimensional assessments in the classroom; this includes “existing and emerging technologies” that support scoring. In accordance with this initiative, this study built upon prior work of developing multi-dimensional assessments and implementing machine learning approaches for scoring those assessments. 90 This study demonstrated the ability of automated analysis to facilitate the transition to multi-dimensional assessment by showing high agreement between computers and humans when scoring CR items. Machine learning could facilitate scoring three-dimensional assessments more quickly than human scoring alone, allowing teachers and researchers to collect detailed information on students’ knowledge-in-use as recommended by the NRC. The findings have contributed to our knowledge by building on a foundation of research focused on machine scoring, which has previously been applied for scoring key concepts (Nehm & Haertig, 2012) and argumentation (Cheuk et al., 2019), among other purposes. This study has added to the literature surrounding machine scoring of CR by providing a comparison of multiple items, showing how each item was scored by humans and the machine algorithm, and demonstrating how each item aligned to the NGSS performance expectations. Using an objective measure, this study analyzed student responses to determine the use of two- and three-dimensional learning to describe phenomena. Through specially developed rubrics, this study has shown that the machine algorithm could score accurately when students engaged reasoning associated with CCs. All three of the items scored by the machine included a CC to differentiate between correct and incorrect classifications. In fact, the machine was able to successfully classify students’ use of a single dimension or multiple dimensions for each item, with accuracy comparable with the human raters. The machine algorithm also scored similarly to human raters when a correct response included any of multiple possible experiments. It is important to note that it took longer than expected to develop scoring models because this was a training exercise where the items were neither designed to be three- dimensional nor to be scored with machine learning. As such, it required a great deal of thought, training, rubric development, and IRR testing in an iterative process. This has allowed for 91 exploration of very open-ended response items showing that machine algorithms can attain high accuracy on sufficiently large data sets, and even train the machine to correctly classify responses by the associated dimensions. Once the algorithm obtained high agreement between the humans and the machine, AACR was able to instantly score the remaining responses. While prior studies have collected evidence that supervised machine learning can be used successfully in automatic scoring, some argue that when presenting the results of machine learning, training sets are not discussed sufficiently in presenting their results (Geiger et al., 2020). Few studies explicitly describe whether and how their assessments and rubrics target the three dimensions of science learning, or how well machine learning classifies student responses into the different categories. Consequently, we have limited knowledge about how machine scoring can support three-dimensional assessment practices. In this study, we have discussed each item used in the analysis, including how the rubrics tap each dimension and the details of the training set. By doing so, we were able to examine the machine’s capacity to classify and score items based on the dimensionality of scientific knowledge employed by students. Consistent with other studies, our results suggest that rater calibration and sample size might be significant factors impacting machine performance. These issues can be mitigated with continued rater training and larger sample sizes or “training sets” for the machine to build its algorithmic models (Balfour, 2013; Cheuk et al., 2019). Additionally, given that the quality of human-scored training data might be critical to machine scoring (Balfour, 2013), we examined how to improve human scoring of multi- dimensional assessments to facilitate machine performance. In this course, successful results emerged from our study that might be valuable for future applications in machine scoring. On scoring these multi-dimensional CRs, we found that the best method to train the raters, among those we tried, was not just to explain the correct 92 answer in the rubric, but to also inform raters which dimensions were being scored. The human scoring or coding of multi-dimensional responses should include explicit rubrics that list all possible solutions and show the hierarchy of importance for the dimensions being measured. Comprehensive rubrics helped humans to score consistently. It is also important to carefully weigh decisions to change the composition of the scoring team. Advanced vocabulary or key phrases (e.g., “velocity”) may increase the challenge for machine scoring, as compared with informal key phrases (e.g., “speed”). We suspect that this finding may be associated with the fact that fewer students used formal key phrases than those who used the informal alternatives. This concept of representation appears again where the algorithm’s lowest agreement to humans coincides with the least represented scoring proficiency. Because machine learning has a difficult time scoring more unique texts (Balfour, 2013), and students could correctly propose many experiments, researchers hypothesized poor results for item 3: properties of solutions. Despite the broad range of possible answers, we obtained good to substantial agreement (k = 0.69) between human raters and the machine. This was similar to the results of Haudek and his colleagues (2019) where they obtained higher agreement between humans and the machine than between human raters for some constructs. The results from this study also show that it is possible to bolster a training set with low human-human agreement through tiebreakers. Addressing the complications in building an accurate algorithmic model was beneficial at multiple levels. The search for lexical patterns in scoring discrepancies not only facilitated construction of a more robust model for scoring, but also identified human raters’ errors even after sufficient agreement was reached. Even with high human-human agreement, we found that some discrepancies corresponded to human errors. We were then able to address these errors 93 with raters directly to prevent recurrence. Outside the context of this paper, reviewing discrepancies between humans and the machine provided insight into broader vocabulary use and creativity in responses. Limitations As exploratory research, this study adopted existing items in a national test and developed multi-dimensional rubrics according to NGSS to score students’ responses. While this study collected evidence indicating sufficient machine capacity in terms of scoring responses according to the dimensionalities, there were limitations to this study. Given that the assessment tasks were adopted from a national test that was not originally designed to be three-dimensional, we only have one three-dimensional item while the other three are two-dimensional. Though this study achieved high machine-human agreements for these items, future studies should develop more three-dimensional items and test the machine capacity to automatically score three- dimensional assessments. Because three-dimensional assessments are more complex than two- dimensional assessments, this could be more challenging for the machine to score. Conclusions and Implications The benefits of applying the three dimensions of science learning, by incorporating SEPs, DCIs, and CCs, have been proposed in both the NRC Framework (NRC, 2012) and the NGSS (NGSS Lead States, 2013). This concern is particularly pronounced because most multi- dimensional assessments contain CRs and scoring of CRs is both time and labor intensive. This implies notable efforts for human scoring on behalf of educators, state departments, and researchers. This study found that human experts were able to reliably (Cohen’s k > 0.60) score student responses to assessment items with varied dimensions. This study shows that machine scoring was capable of classifying student responses when measured by the use of the 94 dimensions of learning spelled out by the NGSS, with accuracy that was comparable with human experts. This shows promise for the use of machine learning to facilitate the measurement of in- depth science understanding and meeting the recommendations for science assessment from the NRC. Once assessments are constructed, and a number scored, the remainder of a large sample can be scored almost instantly. This study indicates that the automated analysis of two- and three-dimensional CR items may be a viable solution in reducing both financial and time costs associated with measuring in- depth science knowledge to facilitate the gradual shift to follow NRC guidelines. Despite the labor involved in constructing the rubrics, training raters, and developing algorithms, once the models were complete, the machine algorithm could continue to score the remaining students’ CRs rapidly. Machine scoring of three-dimensional assessments could have several meaningful impacts on state or national standardized testing, and the monitoring of student performance in the classroom. With coordination, quality three-dimensional assessments could be delivered online and CRs scored almost instantly. However, given what we have learned, this is a complex process both in terms of identifying dimensionality and building rubrics which can be reliably scored. If it is possible to develop items that are three-dimensional and create rubrics for them, this would be an important contribution to meeting new science education reform efforts. Machine scoring could facilitate the use of more robust measures of students’ understanding through knowledge-in-use assessments. 95 REFERENCES AACR. (2020). September 4, 2020, Retrieved from https://apps.beyondmultiplechoice.org. Balfour, S. P. (2013). Assessing writing in MOOCs: Automated Essay Scoring and Calibrated Peer ReviewTM. Research & Practice in Assessment, 8, 40–48. Cheuk, T., Osborne, J., Cunningham, K., Haudek, K., Santiago, M., Urban-Lurain, M., Merril, J., Wilson,C., Stuhlsatz, M.,Donovan, B., Bracey, Z., & Gardner, A. (2019). Towards an Equitable Design Framework of Developing Argumentation in Science tasks and Rubrics for Machine Learning. Presented at the Annual meeting of the National Association for Research in Science Teaching (NARST). Baltimore, MD. Fleiss, J.L. (1981). Statistical methods for rates and proportions (2nd ed.). New York: John Wiley. ISBN 978–0–471–26370–8. Geiger, R. S., Yu, K., Yang, Y., Dai, M., Qiu, J., Tang, R., & Huang, J. (2020, January). Garbage in, garbage out? do machine learning application papers in social computing report where human- labeled training data comes from?. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (pp. 325–336). Ha, M., & Nehm, R. H. (2016). The impact of misspelled words on automated computer scoring: a case study of scientific explanations. Journal of Science Education and Technology, 25(3), 358–374. Harris, C. J., Krajcik, J. S., Pellegrino, J. W., & DeBarger, A. H. (2019). Designing knowledge-in- use assessments to promote deeper learning. Educational Measurement: Issues and Practice, 38(2), 53-67. https://doi.org/10.1111/emip.12253. Haudek, K., Santiago, M., Wilson, C., Stuhlsatz, M.,Donovan, B., Bracey, Z., Gardner, A., Osborne, J., & Cheuk, T. (2019). Using Automated Analysis to Assess Middle School Students’ Competence with Scientific Argumentation, presented at the Annual Meeting of the National Council on Measurement in Education (NCME). Toronto, ON. Large, J., Lines, J., & Bagnall, A. (2019). A probabilistic classifier ensemble weighting scheme based on cross-validated accuracy estimates. Data mining and knowledge discovery, 33(6), 1674–1709. Lee, H. S., McNamara, D., Bracey, Z. B., Liu, O. L., Gerard, L., Sherin, B., Wilson, C., Pallant, A., Linn, M., Haudek, K., & Osborne, J. (2019a). Computerized text analysis: Assessment and research potentials for promoting learning. Lee, H. S., Pallant, A., Pryputniewicz, S., Lord, T., Mulholland, M., & Liu, O. L. (2019b). Automated text scoring and real-time adjustable feedback: Supporting revision of scientific arguments involving uncertainty. Science Education, 103(3), 590–622. 96 Liu, O. L., Brew, C., Blackmore, J., & Gerard, L. (2014). Automated scoring of constructed response science items: Prospects and obstacles. Educational Measurement-Issues and Practices, 33(2), 19–28. https://doi.org/10.1111/emip.12028. Lottridge, S., Wood, S., & Shaw, D. (2018). The effectiveness of machine score-ability ratings in predicting automated scoring performance. Applied Measurement in Education, 31(3), 215–232. Mao, L., Liu, O. L., Roohr, K., Belur, V., Mulholland, M., Lee, H.-S., & Pallant, A. (2018). Validation of automated scoring for a formative assessment that employs scientific argumentation. Educational Assessment, 23(2), 121–138. Mayfield, E., & Rosé, C. (2010, June). An interactive tool for supporting error analysis for text mining. In Proceedings of the NAACL HLT 2010 Demonstration Session (pp. 25–28). Mayfield, E., & Rosé, C. P. (2013). Open source machine learning for text. Handbook of automated essay evaluation: Current applications and new directions. National Academies of Sciences, Engineering, and Medicine. (2019). Science and engineering for grades 6–12: Investigation and design at the center. National Academies Press. National Research Council. (2012). A framework for K-12 science education: Practices, crosscutting concepts, and core ideas. National Academies Press. National Research Council. (2014). Developing assessments for the next generation science standards. National Academies Press. Nehm, R. H., & Haertig, H. (2012). Human vs. computer diagnosis of students’ natural selection knowledge: testing the efficacy of text analytic software. Journal of Science Education and Technology, 21(1), 56–73. NGSS Lead States. (2013). Next generation science standards: For states, by states. Washington, DC: The National Academies Press. Pellegrino, J. W. (2013). Proficiency in science: Assessment challenges and opportunities. Science, 340(6130), 320–323. Zhai, X., Haudek, K., Shi, L., Nehm, R., Urban-Lurain, M. (2020a). From substitution to redefinition: A framework of machine learning-based science assessment. Journal of Research in Science Teaching, 57(9), 1430-1459. https://doi.org/10.1002/tea.21658. Zhai, X., Haudek, K., Stuhlsatz, M., Wilson, C. (2020b). Evaluation of construct-irrelevant variance yielded by machine and human scoring of a science teacher PCK constructed response assessment. Studies in Educational Evaluation, 67, 1-12. https://doi.org/10.1016/j.stueduc.2020.100916. Zhai, X., Yin, Y., Pellegrino, J., Haudek, K., Shi., L. (2020c). Applying machine learning in science assessment: A systematic review. Studies in Science Education. 56(1), 111-151. 97 Zhu, M., Lee, H.-S., Wang, T., Liu, O. L., Belur, V., & Pallant, A. (2017). Investigating the impact of automated feedback on students’ scientific argumentation. International Journal of Science Education, 39(12), 1648–1668. 98 APPENDIX 3.A ITEM 1: EXPERIMENTAL DESIGN Table 3.A.1 Item 1: Experimental Design, Text & Rubric Item 1: Experimental Design Text & NGSS Alignment Question Text Meg designs an experiment to see which of three types of sneakers provides the most friction. She uses the equipment listed below. 1. Sneaker 1 2. Sneaker 2 3. Sneaker 3 4. Spring scale. She uses the setup illustrated below and pulls the spring scale to the left. Meg tests one type of sneaker on a gym floor, a second type of sneaker on a grass field, and a third type of sneaker on a cement sidewalk. Her teacher is not satisfied with the way Meg designed her experiment. A. Describe one error in Meg’s experiment. Alignment to the NGSS (2013) Performance Expectations Dimension Grade-Level Performance Expectation DCI CC SEP 6-8 (N/A) 3-5 ETS1.A Defining and Delimiting Engineering Problems (N/A) Planning and Carrying Out Investigations 99 Table 3.A.2 Item 1: Experimental Design, Text & Rubric Item 1: Experimental Design, Student Examples & Rubric Multi-dimensional Correct "Meg’s error is that she is testing three experiments in separate and different settings, allowing the experiments to have different outcomes. This stops her from knowing if her other shoes work on a gym floor or grass field or a cement sidewalk." DCI: Student correctly identifies the error in the experimental setup. Correct Incorrect “Meg should have tested the sneakers in the same location for each test." "Meg should’ve used different types of sneakers, not the same." DCI: Student correctly identifies an error in the experimental setup. Provides an incorrect response or irrelevant error in the experimental set-up. & & SEP: Student explains this is a failure to control for variables or that the results cannot be compared. No SEP: Student does not explain that it controls for relevant variables. 100 APPENDIX 3.B ITEM 2: RELATIVE MOTION Table 3.B.1 Item 2: Relative Motion Text and Rubric Item 2: Relative Motion Text and NGSS Alignment Question Text Suppose you are riding in a car along the highway at 55 miles per hour when a truck pulls up along the side of your car. This truck seems to stand still for a moment, and then it seems to be moving backward. A. Tell how the truck can look as if it is standing still when it is really moving forward. Alignment to the NGSS (2013) Performance Expectations Dimension Grade-Level Performance Expectation DCI CC SEP 6-8 6-8 PS2.A Forces and Motion Scale and Proportion (N/A) (N/A) Table 3.B.2 Item 2: Relative Motion Text and Rubric Item 2: Relative Motion Student Example and Multi-dimensional Rubric Multi-dimensional Correct “The truck looks as if it is standing still as both your car and the truck are moving at 55 mph in the same direction." DCI: Student relates the truck’s speed to the speed of the observer. & CC: Student states that equal relative speeds would cause the truck to appear as though it is standing still. Correct Incorrect "It is going 55 miles per hour, which is as fast as the car is going." “the truck looks like it is still because it is losing speed." DCI: Student relates the truck’s speed to the speed of the observer.” & Student provides an incorrect/irrelevant explanation for the phenomena OR only restates the question. No CC: Student does not discuss the visual phenomenon being caused by the relative speeds. 101 APPENDIX 3.C ITEM 3: PROPERTIES OF SOLUTIONS Table 3.C.1 Properties of Solutions Text and Rubrics Item 3: Properties of Solutions Text and NGSS Alignment Question Text Maria has one glass of pure water and one glass of salt water, which look exactly alike. Explain what Maria could do, without tasting the water, to find out which glass contains the salt water. Alignment to the NGSS (2013) Performance Expectations Dimension Grade-Level Performance Expectation DCI CC SEP 3-5, 6-8 6-8 6-8 PS1.A Structure and Properties of Matter Cause and Effect Planning and Carrying Out Investigations Table 3.C.2 Item 3: Properties of Solutions Student Example and Rubric Multi-dimensional Correct "Maria could use two similar cups and weigh them both and the heavier one is saltwater." SEP: Student response describes an experiment that controls for relevant variables. DCI: The experiment iso- lates a measurement that will differentiate fresh water from salt water. CC: Student indicates the expected result that will allow them to differentiate the fresh water and salt water. Correct Incorrect "Maria can weigh the cups that hold the water." "Your body floats easier in salt water." Student response does not describe an experiment that will differentiate fresh water from salt water. SEP: Student response describes an experiment that controls for relevant variables. DCI: The experiment isolates a measurement that will differentiate fresh water from salt water. No CC: Student does not indicate the expected result that will allow them to differentiate the fresh water and salt water. 102 APPENDIX 3.D ITEM 4: STATES OF MATTER Table 3.D.1 Item 4: States of Matter Text and NGSS Alignment Question Text Anita puts the same amount of water in two pots of the same size and type. She places one pot of water on the counter and one pot of water on a hot stove. After ten minutes, Anita observes that there is less water in the pot on the hot stove than in the pot on the counter, as shown below. A. Why is there less water in the pot on the hot stove? B. Where did the water go? Alignment to the NGSS (2013) Performance Expectations Dimension Grade-Level Performance Expectation DCI CC SEP 6-8 6-8 PS1.A Structure and Properties of Matter Energy and Matter (N/A) (N/A) Table 3.D.2 Item 4: States of Matter Item 4: States of Matter Student Example and Rubric Multi-dimensional Correct “The heat caused it to evaporate.” DCI: Student says the water evaporated. & CC: Attributes this to to the heat from the stove. Correct Incorrect “The water evaporated.” “It dried up.” DCI: Student says the water evaporated. OR CC: Attributes this to the heat from the stove. Provides an incorrect or irrelevant explanation. 103 CHAPTER 4: U.S. AND FINNISH HIGH SCHOOL SCIENCE ENGAGEMENT DURING THE COVID-19 PANDEMIC Abstract When the Covid-19 pandemic struck, research teams in the United States and Finland were collaborating on a study to improve adolescent academic engagement in chemistry and physics and the impact remote teaching on academic, social, and emotional learning. The ongoing “Crafting Engaging Science Environments” (CESE) intervention afforded a rare data collection opportunity. In the United States, students were surveyed at the beginning of the school year and again in May, providing information for the same 751 students from before and during the pandemic. In Finland, 203 students were surveyed during remote learning. Findings from both countries during this period of remote learning revealed that students’ academic engagement was positively correlated with participation in hands-on, project-based lessons. In Finland, results showed that situational engagement occurred in only 4.7% of sampled cases. In the United States, students show that academic engagement, primarily the aspect of challenge, was enhanced during remote learning. Engagement was in turn correlated with positive socioemotional constructs related to science learning. The study’s findings emphasize the importance of finding ways to ensure equitable opportunities for students to participate in project-based activities when learning remotely. 104 Introduction and Literature Review The 2019–2020 school year brought significant changes to educational systems around the globe when elementary and secondary schools closed suddenly, finding themselves faced with new social distancing guidelines as the world plunged into a crippling pandemic (Meluzzi, 2020). According to the United Nations Educational, Scientific, and Cultural Organization (2020), these closures impacted over 63% of students enrolled in pre-primary through tertiary learning institutions worldwide. Schools closed with little or no notice, leaving parents and educators barely time to prepare for this new reality. With the shift to full days of remote instruction, teachers and students found themselves adapting to entirely new learning environments. Unfortunately, more technology does not imply improved educational outcomes (Escueta et al., 2017). With many students on Zoom or other platforms, equitable participation became a serious problem during the pandemic. Domina et al. (2021) showed students’ academic engagement was improved with greater access to technological resources and quality instruction that included socioemotional learning. Globally, however, socioeconomically disadvantaged students are less likely to have the tools they need to participate in remote instruction (Meluzzi, 2020). Students in lower-income schools were shown to be less engaged with their schoolwork than their same aged peers with greater access to resources (Hopkins et al., 2021). Learning losses are, in turn, shown to vary with academic engagement and access to school supplies or technology necessary for participation (Dorn et al., 2020). The inability of some families and schools to provide the financial and material support for experiences that would normally be provided in a science classroom means existing gaps hindering equitable participation are exacerbated by the pandemic. In the United States, for 105 example, nearly one third of students were unable to participate in remote learning during the first wave of the pandemic (Meluzzi, 2020), placing additional stressors on students and their families. Not surprisingly, the pandemic has coincided with an increase in student anxiety, which inhibits students’ ability to engage with their online classrooms (Yang et al., 2020). High school students are experiencing stressors of the pandemic, many of which limit the coping mechanisms teenagers usually employ to deal with the normal stressors associated with being in high school. Survey data has shown students reported feeling disinterested, bored, and socially isolated when spending long hours in virtual classes; parents also have expressed similar concerns about their children’s academic learning and well-being (Kaufman et al., 2020). Recent literature shows the pandemic has caused difficulty in attaining academic engagement in remote classrooms with many teachers reporting the need for additional resources to do so (Trinidad, 2021). Literature regarding learning during the Covid-19 pandemic suggests that students’ academic engagement may differ when content is delivered remotely as compared to learning in a classroom environment. In a joint project of two countries, the United States and Finland, we aim to better understand academic engagement and its correlation to various activities assigned in science classes during remote learning due to the pandemic. A Two-Country Intervention Facing a Pandemic The sudden transition to remote teaching occurred during an ongoing collaborative intervention, “Crafting Engaging Science Environments” (CESE). Funded by the National Science Foundation and Academy of Finland, CESE brought together a team of learning scientists, science education researchers, psychologists, sociologists, and teachers. They designed an intervention that supported students’ academic engagement and impacted not only their academic learning, but also their social and emotional learning (Schneider et al., 2020). Based on 106 the principles of project-based learning (PBL) that support student experiential activities in “figuring out” phenomena (Krajcik & Shin, 2014), the unifying theme that motivated the CESE intervention was to improve engagement in physics and chemistry courses for high school students in grades 10 through 12. Throughout multiple years of this collaboration, a series of questions related to the conceptualization of engagement and its impact on social and emotional learning remained a continual focus. During the 2019–2020 school year, CESE was in its first-year efficacy trial in Finland. In the United States, CESE was undergoing a maturation study. Having shown promising results for the learning outcomes of treatment students in the previous year (Schneider et al., 2022), the goal was to determine whether teachers in their second year of teaching the project-based learning intervention would show greater impacts on learning outcomes than teachers implementing the lessons for the first time. Although the pandemic ended the ability to study students’ engagement during hands-on lessons, a unique opportunity arose to study academic engagement in this remote learning environment. U.S. and Finnish Government Responses CESE’s shifted focus to studying academic engagement during remote science classes occurred within two contrasting national contexts. The two countries saw differences in the transition based on their respective approaches. The relative populations of the United States (over 330 million) and Finland (over 5.5 million) affected each country’s response to the pandemic. Finland was able to centralize its decision-making, given its smaller population. Centralization in the United States was more difficult, not just because of its larger population but each of the 50 states has the right to control its own schools. When U.S. schools closed, 107 teachers in the CESE study reported guidelines for learning and instruction that differed among states, districts, and even schools. In the United States, the movement to remote instruction began in mid -March, which coincided with spring break in many districts. Assuming social distancing measures would be brief, some schools simply extended the spring break. Awaiting guidance from the state or federal governments meant many classrooms did not make this transition until April. Consistent with the findings of a 2020 study from Reich et al., policies differed at multiple levels of decision making; some CESE teachers reported that their schools required all teachers to use the same curriculum and had strict protocols for contacting parents, while other teachers reported that their schools entrusted teachers with all instructional and logistical decisions. Schools, teachers, and students had limited familiarity with remote learning. Orchestrating an equitable learning environment where all students, especially those in low-income families and communities, were equipped with computers and internet access was a monumental undertaking. When districts were unable to provide equipment for every student, they had to develop alternative methods that assured equitable learning experiences. Planning and organizing some form of high-quality remote instruction for the 50 million primary and secondary students in U.S. public schools became exceedingly challenging. Despite the confusion sur rounding these unprecedented changes, some national studies suggest that the majority of teachers and administrators reported effective communication from their districts regarding policy changes and that instruction was supported through relevant professional development opportunities (Kraft & Simon, 2020). Consistent with these findings, teachers in the CESE study noted that their districts provided professional development opportunities focused on adapting their pedagogy with a variety of online tools. 108 In contrast to the United States, Finland’s government was able to quickly decide that all students in the country would transition to remote learning. All schools were closed from March 18th until May 13th. For students in upper secondary schools, vocational training institutes, tertiary, and other educational institutions, the government recommended continuing distance teaching until the end of the semester. The government worked with the schools to ensure that all students and teachers had access to computers, wi-fi, and instructional guidance for students and staff (The Finnish National Agency for Education, 2020). Several educational platforms were already in place and allowed educators to provide feedback, assign homework, and communicate with parents and students through the pandemic. According to the Finnish Teachers’ Union Survey, about half of the teachers reported having sufficient pedagogical and digital competence for teaching during the remote learning period; a similar proportion of students claimed the change to distance learning went well. Only a small number of students had difficulties due to insufficient equipment or lack of skills for distance learning. Overall, Finnish students and teachers both reported having considerable experience with digital skills such as how to use computers, how to find information, and using reputable sources to “fact check.” These survey reports are consistent with several other international studies which have shown that Finnish students are among the most well-prepared to work online compared to students in other industrialized countries (see, Fraillon et al., 2019, International Education Association (IEA) International Computer and Information Literacy Study). Not surprisingly, Finnish teachers and students reported feeling quite confident about their abilities to succeed in distance learning and had the skill sets to do so. 109 Studying Engagement Even before the pandemic, students’ academic engagement in science was a deep concern globally, and several major reports by OECD (2019) connected engagement with interest in science and its attractiveness as a career option. One of the major challenges of CESE was to theoretically describe and measure academic engagement in science classes. Part of the problem was that while there was consistent agreement that academic engagement varied over time, what engagement meant in various contexts was viewed differently. One of the first considerations was how to specify academic engagement and what constructs should be used to define it (see, Hidi & Renninger, 2006; Schneider et al., 2016). CESE views academic engagement in science as being comprised of interest, skill, and challenge (Schneider et al., 2016): and not all activities are likely to have the same effect on students’ social and emotional or academic learning (Inkinen et al., 2020). Recognizing the difficulty of trying to define academic engagement without specifying when it occurs misses the ability to identify when students are feeling interested, skilled, and challenged in what they are doing. The approach identifies these three constructs as critical for enhancing students’ academic engagement which are grounded in psychological literature. Interest is the psychological predisposition for a specific activity, topic, or object; skill is the mastery of a set of specific tasks; and challenge is the willingness to take on a difficult, somewhat unpredictable course of action. When students report high interest, skill, and challenge, they are considered to be engaged. Situational Engagement/Optimal Learning Moments The primary focus of the CESE study is situational engagement. When measured in the moment, instances of situational engagement are considered optimal learning moments (OLMs), 110 which are situationally specific times when a student is so deeply engrossed in a task that it feels as if time flies by Schneider et al. (2016). During those times, students tend to be concentrating and feeling in control (see, Salmela-Aro et al., 2016). This idea is similar to how Csikszentmihalyi (1990) describes flow as being completely immersed in an activity. For this study, we consider OLMs to be situations that elevate students’ academic engagement and are positively related to social and emotional learning. Our research shows that OLMs occur about 15–20% of the time in science lessons (Inkinen et al., 2020; Schneider et al., 2016). Our interest is to examine how often they occur when students are learning remotely. Academic Engagement and Social and Emotional Learning Experience Researchers recently started to examine the relationships between academic engagement and other factors including social–emotional skills and learning experiences as they mutually reinforce one another (Salmela-Aro & Upadyadya, 2020). The current study applied the OECD framework on social and emotional constructs which include maintaining positive emotions, managing social relationships, and keeping goal pursuits. Some key elements are optimism (ambition or future importance), persistence, curiosity, social interaction, and self-efficacy (Kankaras & Suarez-Alvarez, 2019). In this study, we aimed to understand social–emotional experiences and their relationship to academic engagement during remote learning. One important social–emotional learning experience is persistence, also referred to as grit (see Duckworth, 2016; Tang et al., 2019), which refers to how long students stay with a task or learning assignment without giving up. In relation to our idea of academic engagement, it is important to determine whether students “give up” in more challenging situations and whether “grit” acts as a buffer, inspiring students to persist in the task at hand (Salmela-Aro & Upadyadya, 2020; Tang et al., 2021). “Grit” has particular importance to Finnish society as it has 111 been associated with the term “sisu,” which can be translated as “determination to overcome adversity,” and is a hallmark of the Finnish perception of their national character. Another social–emotional concept related to academic engagement is curiosity, defined as the desire for knowledge or information found to be associated with question-asking, exploration behaviors, and achievement (Hidi & Renninger, 2006). Recently, curiosity, the epistemic emotion that triggers interest, has also been highlighted in the engagement process (Hidi & Renninger, 2019). Thus, the role of curiosity in OLM is important to investigate. When students are engaged in learning we expect them to feel that their science work is related to their future education goals (Schneider et al., 2016). U.S. students who participated in the CESE intervention during the previous year’s efficacy trial showed increased educational ambitions (Schneider et al., 2022). During a pandemic, however, uncertain futures might weaken educational goals. Alternatively, science is now trending heavily in the news which could foster students’ curiosity and engagement with it, in turn strengthening their educational ambitions. Current Study When the pandemic struck, CESE teachers in both countries were unable to teach the intervention. Although CESE was rendered unable to study students’ academic engagement during these project-based lessons in a science classroom, a unique opportunity arose to study engagement in relation to the activities assigned in this remote learning environment. As mentioned above, comprised of three components (challenge, interest, and skill), it was possible that academic engagement could increase or decrease based on these subsequent parts. If students felt more challenged in the remote environment or more interested in science due to ties to current events, it could increase their academic engagement. In Finland, CESE researchers 112 sought to understand situational engagement during remote lessons. Without the clear project- based learning features, would students be less engaged during remote lessons? In the United States, where the transition to remote teaching varied greatly among schools and districts, materials distribution to study situational academic engagement was not possible. Instead, students were surveyed during the pandemic, and results from those who responded were compared to their responses collected earlier in the year. In Finland, the same students were not tracked over the course of the school year due to the semester-by-semester class structure of Finnish high schools. Here, the more organized shift to remote teaching made it possible to provide students with a diary to collect data about their engagement in the moment. Despite their differences, the study teams were able to collect similar data related to academic engagement in science. Guiding the investigation at this unprecedented time are three major questions: (a) how engaged were students in their remote science classes? (b) how engaged were students in their specific learning activities during remote learning? (c) how was academic engagement related to social and emotional learning experiences during remote learning? This is not a comparative study; rather, its intent is to underscore similarities and differences in students’ academic engagement when in remote science classroom environments. U.S. and Finnish Samples Methods The ongoing work in both countries allowed CESE to survey and interview the students and teachers when they were participating in their remote science classes. To measure academic engagement, we have deliberately selected academic and social–emotional constructs that both countries have investigated with their own respective student populations with some variations and differences in instruments which are explained below. Both countries used self -report 113 surveys to collect information from high school students enrolled in chemistry and physics courses. In the United States, data were obtained from students in fall 2019 and then again during the pandemic, allowing for comparisons of attitudes and perceptions from in-person to remote experiences. In Finland, however, data were only obtained in the spring. U.S. Sample During the 2019–2020 school year, the United States was in a maturation study where both treatment and control teachers from the previous year taught the intervention, for their second and first years, respectively. In the beginning of the school year, 4,954 high school students from 86 physics and chemistry classrooms completed a background survey as part of the CESE intervention. In the United States, the expected age range of students in grades 10 through 12 is between 14 and 17 years old. In the CESE study, 95% of students fell within the anticipated age range and 98% were in grades 10 through 12. We initially anticipated that our response rate would be similar to the analytic sample from the efficacy trial in the previous year. Given the pandemic, however, it became apparent that we would not get anywhere near that percentage of respondents. Taking into account attrition rates from the previous year, teachers’ reports of low attendance after the shift to remote learning, the loss of one highly populous d istrict, and district policies that gave little incentive for students to attend remotely, we expected to receive less than 20% (n = 879) of our original 2019–2020 sample. When surveyed again in Spring 2020, 922 students replied to the exit survey. Of the students who responded to the exit survey, 81% (n = 751) had completed both a background and exit survey in the participating regions. These students were retained in the analytic sample, of which 55.53% are female; 22.64% white (non- Hispanic); 40.21% Hispanic; 5.73% Black; 6.39% Asian, 17.04% multiple race/ethnicities; less than 1% other, and 7.46% did not provide information about their race or ethnicity. 114 Finnish Sample When the pandemic struck, the Finnish team reached out to six of the nine participating teachers involved in the efficacy study regarding the possibilities of collecting data while students were learning remotely. All six teachers agreed to the study, which included 203 students (97 males, 103 females, 3 who preferred not to answer; all within the range of 16– 18years old). It is important to note that the number of students included in this study is relatively small, but all were in grades 10 through 12 and participating in the CESE intervention. Measures U.S. Measures In the United States, the same students were followed through a full school year and surveyed in fall 2019 and again in spring 2020, after moving to remote learning. Students were asked to respond to questions about their demographic data such as race, gender, grade point average, and attitudes toward science. In both fall and spring, students were asked about their interest, skill, and challenge in their physics or chemistry class. Students reported how much they agreed with the following statements: I am interested in science; I feel skilled in science; and I find science challenging on a four-point Likert scale. To measure students’ academic engagement, we calculated the mid-point for each of the three categories (i.e. interest, skill, and challenge). Academic engagement is the binary outcome indicating that a student reported scores above the midpoint (e.g. 3 or 4) for each of the three categories. We then compared their responses from fall to spring after the shift to remote learning. Students also reported on how frequently they participated in a number of activities when in a remote classroom and how interested they were in each activity. These activities included: discussion boards; one-on-one video chats with the teacher; watching videos of experiments; 115 online simulations; live lessons; recorded videos of lessons; using textbooks; writing papers; building models at home; making presentations using slides or power-point to share with the class; text-based instruction; working in groups through video chat; and experiments to try at home. Students’ reported interest was ranked and compared to its reported activity frequency in the classroom. Frequency was measured on a 5-point scale: we do not do this activity (0); less than once every 2 weeks (1); once every 2 weeks (2); once per week (3); or every day (4). Interest here was reported on a four-point scale ranging from “this does not interest me (1)” to “this interests me a lot (4).” On the fall and spring surveys, students responded to questions specifically related to project-based learning tasks associated with the CESE intervention. These activities were included to determine whether students were still able to engage in the same project-based tasks fundamental to the intervention. These questions were altered to better suit the situation of remote learning when administered on the second survey and additional questions related to modelling were added. In fall, these were measured on a scale similar to those that were specific to online learning and where students reported the frequency with which they performed certain activities: never or almost never (1); once every month (2); once every 2 weeks (3); once per week (4); or more than once every week (5). These activities included several different types of modelling activities, opportunities to take pride in their achievements, ask questions in class, discuss phenomena, work together to solve problems, generally present their findings, and the frequency with which students performed “science and engineering practices like taking measurements to collect data about the world around us and using evidence to make a claim.” To understand the relationship between social and emotional skills and students’ academic engagement, college ambition was considered a behavioral measure of persistence (or 116 grit). On both the fall and spring surveys, students were asked to report their educational goals regarding how far they expect to go in school, including: I do not know how far I will go; less than high school; graduate from high school but not go any further; go to a vocational, trade, or business school after high school; graduate from a two-year college (Associate’s Degree); graduate from a four-year college (Bachelor’s Degree); Master’s Degree or Equivalent; and Ph.D., M.D., or other advanced professional degree. A binary variable was created to indicate whether or not a student reported plans to attend at least 4years of college. Finnish Measures In Finland, the CESE study focused on situational engagement and researchers did not follow the same students through an entire school year. During the pandemic, students were asked to complete two surveys. The first survey focused on their general feelings and experiences in remote learning during the pandemic. The second survey asked students to report their real-time feelings and experiences using a diary format, the experience sampling method (ESM), for which they answered short surveys in the moment, d uring their remote lessons. Situational engagement consists of high levels of interest, skill, and challenge. Students reported in the ESM survey their momentary interest (Are you interested in what you are doing?), skill (Do you feel skilled at what you are doing?), and challenge (Do you feel challenged by what you are doing?) on a four-point scale of: not at all (1); a little (2); much (3); and very much (4). Academic engagement was measured similarly in both countries. Students were considered situationally engaged (i.e. OLM) if their responses were 3 or 4 to all three questions. A binary variable of 1 or 0 was generated to indicate whether this was an OLM or not. The ESM survey asked students to report their practices when they received the survey. They could choose from: following teacher’s instruction, doing tasks independently, studying 117 from books, studying from the web site, writing, discussing online, making videos, asking questions, developing a model, using a model, planning an investigation, conducting an investigation, analyzing data, solving math problems, constructing an explanation, using evidence to make an argument, evaluating information, and other. Students chose all practices that applied to them. These options were recoded as dichotomous variables for the analysis (practice was reported (1) or not reported (0)). The ESM also surveyed students’ social and emotional experiences in real-time on a four-point scale: not at all; a little; much; or very much. We focused on students’ remote learning experience regarding their belief that the material had importance for their future; feelings of loneliness; boredom; confidence; curiosity; and grit as they have been highlighted in the OECD social and emotional skills frame works (Kankaras & Suarez-Alvarez, 2019; Salmela-Aro & Upadyadya, 2020). In total, these ESM surveys produced an average of 3.49 responses per student, for a total of 701 situational responses. The general survey asked students to report how often they engaged in the following science practices on a four-point scale: never or hardly ever (1); some hours (2); most hours (3); and in all classes (4). The practices were: have opportunities to explain my own thoughts; plan how to study; do practical tests; draw conclusions from experiments or research; apply concepts related to everyday problems; participate in debate or discussion; follow the teacher’s demonstrations; do experiments as instructed; follow the teacher’s teaching or example in remote learning; view a video or animation; do assignments independently; study a book, study a website or e-learning plat form; make notes or summaries; share documents with other students; present my output in a video conference (Zoom, Meet, Skype, etc.); write a joint document with another student; chat online; create videos or animations; do experimental research with tools 118 found at home; ask for advice from another student; help another student; and get feedback from the teacher that promotes learning. Results U.S. Results How Engaged Were U.S. Students When Attending Courses Remotely? Despite the many changes to instruction, U.S. students were more likely to report academic engagement after participating in the CESE intervention. Table 4.1 shows the change in the odds that a student reported above median scores for interest, skill, challenge, and the engagement variable, meaning they reported high scores for all three. As shown in Table 4.1, when surveyed in Spring 2020, students showed a strong increase in their science interest and the level of challenge they felt in their remote physics or chemistry class as compared to earlier in the school year. Students were 4.24 times more likely to report high levels of interest and 7.36 times more likely to report high levels of challenge. Students were only 1.53 times more likely to report high levels of skill during the pandemic. This change was significantly less (p < 0.001) than the differences in either of the other questions. The increase in all three categories resulted in students being 9.24 times more likely to be engaged. Table 4.1 Changes in U.S. Students Academic Engagement During the 2019-2020 School Year 𝛽 SE 𝛽 OR (eβ) High interest 1.44*** 0.13 4.24 0.42** 1.53 0.14 0.12 2.00*** High skill High challenge Engagement Note. High Interest, High Skill, and High Challenge are binary variables indicating a student reported a 3 or a 4 on the scale. This table shows the change in the log odds of a student reporting high measures of these variables from fall to spring in the 2019 – 2020 school year. *p < 0.05. ∗∗p < 0.01. ∗∗∗p < 0.001. 2.22*** 0.19 9.24 7.36 119 What Kinds of Activities Were U.S. Students Participating in During Remote Teaching? Figure 4.1 shows the frequency at which U.S. students reported specific class activities and how interesting they found these experiences during the pandemic. The most frequent activities used remotely were videos of experiments, online simulations, text-based instruction, and discussion boards. Students reported watching recorded videos of lessons more frequently than attending live lessons. Using textbooks, building models at home, and making presentations were the least frequently used. While performing experiments at home was one of the top interests reported by students, the frequency at which that occurred was low. Students found writing papers, using Google Slides or PowerPoint to make presentations, and using textbooks among the least interesting online activities. Figure 4.1 U.S. Students’ Frequency and Interest in Online Learning Activities How Did These Tasks Relate to Students’ Academic Engagement? Table 4.2 shows the impact of each predictor on student academic engagement (high interest, skill, and challenge) in its own logistic regression model due to strong correlations in the 120 frequency of activities assigned during remote learning. Each model controls for engagement at the beginning of the school year, race, gender. Students were clustered by school and classroom to account for variance that might occur due to school policy or teachers’ familiarity with teaching online. When surveyed before the transition to remote learning, only 5 students reported academic engagement for every 100 who did not. The increase in academic engagement when measured in the fall showed the odds of reporting engagement were as high as 48 students for every 100 students who reported not being engaged. As shown in Table 4.2, when controlling for demographic data and their academic engagement (high interest, skill, and challenge) measure recorded in fall, there were strong significant correlations between academic engagement and the frequency of most of the project-based activities related to the CESE intervention. The highest correlations to academic engagement were found in the frequency of equations modelling and participation in science and engineering practices, respectively, showing students to be 1.29 and 1.30 times more likely to report engagement. All types of modelling, except building models at home, were positively correlated with academic engagement, and students were between 1.17 and 1.30 times more likely to report engagement for each unit increase in frequency. Additionally, students who reported more frequent opportunities to take pride in their science achievements were 1.26 times more likely to report engagement with each unit increase in frequency. Many of the activities specifically related to remote teaching were not significantly correlated with academic engagement, and the highest correlations again corresponded with more project-based tasks. For example, the odds of a student reporting engagement were 1.18 times higher with more frequent at home experiments during remote learning and 1.20 times higher with increase in frequency of building presentations with Slides or Power Point. Despite listing textbook use as uninteresting, students were 1.19 times more likely to report being 121 engaged with each increase in frequency of reported use. Building models at home is not shown in this table because it was not significantly correlated to academic engagement and the logit model did not converge when controlling for other factors. Table 4.2 Logistic Regression Coefficients for Each Activity on Academic Engagement Correlation coefficient Logit regression coefficient (β) SE (β) Odds ratio (eβ) Build models Class discussions about phenomena Computer modelling 1 Computer modelling 2 Draw visual models Equations modelling Opportunities to ask questions Opportunities to take pride Present their findings Science and engineering practices Work together to understand phenomena Items specifically related to remote teaching Discussion board Experiments to try at home Live lessons One-on-one video chat with teacher Online simulations Presentations using slides or power-point Recorded lessons Text-based instructions Textbook use Watching videos of experiments Working in groups through video chat 0.20 * 0.15 0.22 ** 0.20 ** 0.15 * 0.16 ** 0.27 ** 0.11 0.23 ** 0.21 ** 0.26 *** 0.17 ** 0.08 0.17 ** 0.11 0.15 ** 0.13 0.18 * 0.02 0.13 0.18 ** 0.14 0.14 0.15 ** 0.09 0.08 0.08 0.07 0.06 0.05 0.10 0.12 0.08 0.07 0.07 0.06 0.05 0.06 0.06 0.05 0.07 0.07 0.04 0.08 0.05 0.08 0.07 1.23 1.16 1.25 1.22 1.17 1.18 1.3 1.11 1.26 1.23 1.29 1.18 1.09 1.18 1.12 1.16 1.14 1.20 1.02 1.13 1.19 1.15 1.15 0.12 ** 0.09 * 0.13 *** 0.13 *** 0.11 ** 0.10 * 0.15 *** 0.05 0.14 *** 0.13 *** 0.15 *** 0.11 ** 0.06 0.10 ** 0.08 * 0.10 ** 0.07 0.10 ** 0.02 0.08 * 0.10 * 0.07 0.09 * 0.09 * 122 Writing papers 1.16 Note. Due to the high correlations between activities, each activity was run in its own logistic regression model. The coefficients represent the impact of the activity on engagement when controlling for prior engagement, race, gender, and variance at the school and classroom levels. ∗p < 0.05. ∗∗p < 0.01. ∗∗∗p < 0.001. 0.04 How Did Engagement During the Pandemic Impact Students’ Future Aspirations? In order to understand students’ persistence during the pandemic, we explored the changes students made to their educational plans by comparing their responses in spring to those from the beginning of the year using the binary college indicator. Prior to the shift to remote instruction, 68.66% of students planned to attend four or more years of college. When measured again during remote learning, the number of students planning to attend college or graduate school increased significantly (p < 0.05) with the odds of a student reporting plans to attend college or graduate school rising from 2.19 to 2.6. Because the GPAs of students who reported they “do not know” were more similar to students planning 2 to 4 years of college than those who planned to attend trade school or no post-secondary education, we anticipated that much of this change would come from students affirming their plans for college. When omitting students who reported they did not know their plans on either the background or exit surveys, there were no significant differences from the beginning to the end of the year. Additionally, fewer students reported not knowing their plans during the pandemic than when surveyed at the beginning of the year (p < 0.001). To see how academic engagement impacted our measure of student persistence, we next used a two-level logistic regression, again accounting for variance between classrooms, with the binary college ambition indicator as the outcome. As shown in Table 4.3, when controlling for race and gender, we found that both GPA and academic engagement had significant correlations to plans to attend four or more years of college during the pan demic, even when controlling for previous ambitions. Students who reported being engaged in their science courses during the pandemic were 2.19 times more likely to report plans to attend college or graduate school. The 123 odds of reporting plans to attend college or graduate school were also 1.8 times higher for each unit increase in grade point average. Teacher level random effects were non-negligible. Table 4.3Students’ Plans to Attend Four or More Years of College and Engagement US Students’ Plans to Attend Four or More Years of College and Engagement Previous plans to attend four or more years of college Female (male comparison) Race (White non-Hispanic comparison) Hispanic Black Othera Asianb Multiple GPA Academic engagement during pandemic β SE 𝛽 OR (eβ) 3.56*** 0.56 35.01 −0.70 0.56 0.50 −0.41 0.69 −3.00* 0.00 0.78 0.59* 0.78** 0.64 0.72 1.40 0.00 0.59 0.24 0.30 0.66 1.99 0.05 1.00 2.18 1.80 2.19 Note. The analytic sample for this table is students who reported their educational ambitions both before and during the pandemic. ∗p < 0.05. ∗∗p < 0.01. ∗∗∗p < 0.001. aOnly three students in the final analytic sample listed their race as Other, one of those three students selected a lower level of education. bAll students who listed their race as Asian reported plans to attend four or more years of college both before and during the pandemic. Finnish Results How Did Finnish Students Engage Situationally During Remote Teaching? Nearly half of students indicated interest in their science activities (44.5%); however, only one quarter (29%) of experiences were identified as leaving students feeling skilled and more than one third (34.2%) were identified as challenging (see Table 4.4). When OLMs were calculated from these three measures, only 4.7% of science experiences were engaging moments. The mean for academic engagement was only 0.05 (min 0−max 1). 124 Table 4.4 Finnish Students’ OLM Situational Engagement During Pandemic Finnish Students’ OLM Situational Engagement During Pandemic Percentage of occurrence N 701 696 700 701 M 2.49 2.18 2.31 0.05 SD 0.71 0.73 0.71 0.21 (%)a 44.5 29 34.2 4.7 Interest Skill Challenge OLM, situational engagement Note. aOccurrence is defined as choosing 3 (much) or 4 (very much) in the scale. bSituational engagement is defined as the joint occurrence of interest, skill, and challenge. What Kinds of Activities Were Finnish Students Participating in During Remote Teaching? To understand the learning activities students participated in while learning remotely, we first summarised the mean level of each of the learning activities that have been reported in the general survey (see Figure 4.2). Differences in frequency were examined using a one-way analysis of variance [ANOVA] (F = 158.5, df = 22, p < 0.001). The most often mentioned activities were following teacher’s instruction, following demonstrations, and doing independent assignments. The least mentioned activities were doing tests, sharing documents, presenting, making a video or animation, and doing experiments at home. Discussion and interaction with peers and teachers (e.g. online chat, helping each other) were mentioned at the moderate level. Studying from a website and viewing videos and animations were common activities during the pandemic, though their frequencies were lower than teachers’ direct instruction and independent work. 125 Figure 4.2 Frequency of Learrning Activities from General Survey Students’ real-time situational learning activities were also compared using chi-square tests. When pooling the data, significant differences were found among these activities (χ2 = 4085.6, df = 17, p < 0.001). Following teachers’ instruction, doing tasks or assignments independently, studying from books and solving mathematical problems were more represented than other activities. Real-time situational learning activities were then divided into three groups based on their reported frequency in the 701 situational responses: activities that occurred more than 50% of the time; those that occurred from 49% to 11% of the time; and those that occurred less than 10% of the time. Using cross-tabulation analysis, we found that more frequently employed activities, such as following teachers’ instruction, which happened over 50%of the time, were less successful in facilitating academic engagement (adj. residual=−2.47) than those activities 126 that were classified as medium or low frequency (see Table 4.5).We then compared the level of interest, skill, and challenge across activities using one-way ANOVA (see Table 4.6). There were significant differences across activities for interest and skill but not challenge. Post-hoc analyses again confirmed that students were less interested and felt less skilled in activities that occurred the most frequently. In other words, the most common activities students experienced were the least engaging. Table 4.5 Cross-tabulation Analysis of Situational Engagement per ESM Activity Group Activity group Not occurred Occurred Total High frequency Count Std residual Adj std residual Medium frequency Count Std residual Adj std residual Low frequency Count Std residual Adj std residual 1365 0.42 2.47 972 −0.36 −1.83 143 −0.34 −1.41 75 1440 −1.61 −2.47 77 1.38 1.83 14 1.32 1.41 1049 157 166 Total 2480 Note. High frequency activities include following teachers’ instruction, doing tasks independently, and book studying; medium frequency activities include solving math problems, writing, studying from a website, discussing online, constructing an explanation, using a model, analyzing data, evaluating information, and asking questions; low frequency activities include using evidence to make an argument, developing a model, conducting an investigation, planning an investigation, making videos, and other. 2646 127 Table 4.6 ANOVA Results for Interest, Skill, and Challenge per ESM Activity Group ANOVA Results for Interest, Skill, and Challenge per ESM Activity Group High frequency activities Medium frequency activities Low frequency activities F Post-hoc Interest Skill 2.54 2.23 2.76 2.37 2.76 2.39 Challenge 2.31 2.33 2.38 30.66, df = 2, p < .001 11.58, df = 2, p < .001 0.76, df = 2, p = .47 High < medium, low High < medium, low ns Note. High frequency activities includes following teachers’ instruction, doing task independently, and book studying; medium frequency activities includes solving math problem, writing, studying from website, discussing online, constructing an explanation, using a model, analyzing data, evaluating information, and asking questions; low frequency activities includes using evidence to make an argument, developing a model, conducting an investigation, planning an investigation, making videos, and other. What did the social and emotional learning of the students look like while learning remotely in Finland? Among the six types of social and emotional learning experiences, when measured situationally the most salient was the importance of learning for the future (see Table 4.7). Close to half of responses (45.5%) indicated students felt that what they were learning was useful for their future. The likelihood of reporting being confident (30.44%), curious (28.1%), or persistent (i.e. gritty; 27.5%) was modest. Correlation analyses show that when students felt their learning was important for their future, and were moderately curious, persistent, and confident about themselves, they were more likely to be situationally engaged (OLM). 128 Table 4.7 Situational Engagement (Optimal Learning Moments) and Social Emotional Learning Situational Engagement (Optimal Learning Moments) and Social Emotional Learning M SD Occurr- ence (%)a 0.05 0.21 4.71 1 2 3 4 5 6 2.53 0.79 45.56 0.15** 1.38 1.80 2.18 2.09 0.72 0.82 0.83 0.87 8.73 17.74 30.39 28.10 −0.08* −0.17** 0.14** 0.12** 0.12** 0.12** −0.13** 0.22** 0.30** 0.27** 0.30** −0.14** 0.06 −0.19** −0.26** −0.24** 0.30** 0.42** 2.12 0.54** 7. Grit Note. * p < 0.05. ∗∗p < 0.01. aOccurrence is defined as choosing 3 (much) or 4 (very much) in the scale. 27.48 0.82 0.06 1. Situational engagement 2. Future importance 3. Lonely 4. Bored 5. Confident 6. Curious Discussion When measured during the pandemic, U.S. students reported greater interest and challenge in their science subject than they did in Fall 2019. Generally, academic engagement for U.S. students showed an increase, but Finnish results showed that situational academic engagement was low. While it is impossible to distinguish a causal relationship, it is possible that this difference in results suggests students are less engaged in the specific activities they do remotely but were influenced by factors outside the classroom that increased overall engagement. For example, one or more components of engagement (interest, skill or challenge) increased for students because of what they were seeing in the news regarding the novel coronavirus. Academic engagement was positively correlated with a number of project-based activities employed during the pandemic. Students showed a significant increase in the odds of reporting engagement with more frequent use of science and engineering practices, class discussions, working together to understand phenomena, various modelling activities, presenting their work, and conducting experiments at home. Textbook use remained high among students who reported an overall sense of engagement. If teachers are using the textbooks for homework, this could be 129 related to previous findings that students show above average situational engagement while doing math problems (Inkinen et al., 2020). Consistent with findings from Domina et al. (2021) more frequent opportunities for social and emotional learning (e.g. taking pride in science achievements and working in groups with peers) were also positively correlated with academic engagement. The Finnish study found that the less frequently assigned active practices (e.g. asking questions, analyzing data) were better than passive activities in facilitating interest and skill. Similarly, there was a correlation between students reporting that they were engaged and the frequency of real science and engineering practices in their U.S. classrooms. Unfortunately, the shift to remote learning did not allow students to engage in the same sorts of hands-on, project-based activities they would in a normal CESE classroom. For Finnish students, the most frequent learning activities during the pandemic were following teacher’s instruction or demonstration and doing independent tasks or assignments. In the United States, students reported high interest but few opportunities to try experiments at home. This was also the least frequently reported activity for students in Finland, which may be due to the difficulty associated with conducting science investigations without experimental tools and support from teacher and peers. Additionally, safety concerns and difficulty with distributing or obtaining resources have been shown to hinder participation in experimental activities while learning science remotely (Kelley, 2020). Students in both countries reported few opportunities to collaborate with one another in trying new activities and problem solving while learning remotely. Analysis of specific challenges faced by students in Finland reveals that many students had difficulties in planning their studies while learning remotely. Compared to the challenge of study planning, students had fewer challenges regarding technical problems or a place to study at 130 home. Students may be unfamiliar with effective time management practices when leaving the structured environment of their in-person high school classes. Finding ways to help students in planning their multiple assignments may benefit students who are learning in this less structured remote learning environment. Despite facing numerous difficulties and challenges, there were some positive findings regarding students’ social and emotional experiences. Despite reporting low situational engagement, nearly half of the Finnish students surveyed still felt that what they were learning in their science classes was important to their future. In the United States, education aspirations remained high, and a significant proportion of the surveyed students raised or affirmed their ambitions toward college. This effect may be driven by students who previously did not know their plans deciding to attend college. For students who reported their future academic plans before and after the pandemic, academic engagement was significantly correlated with plans to attend college or graduate school. Regardless of the challenges the U.S. students faced and the difficulties they experienced with the organization and management of the transition to receiving lessons remotely, students remained positive about their future education. If we consider this ambition a testament to persistence, there was a significant correlation between academic engagement and persistence in both countries. Limitations In the U.S. sample, the most at-risk students were often unable to participate in the remote learning experience. The U.S. sample of students participating in remote learning was skewed toward students who had access to computers in districts that supported remote learning. Due to socioeconomic barriers, students in the United States were not graded and could not be held accountable for attendance. This lack of incentive may imply that those students who 131 continued to participate shared certain characteristics. While socioeconomic disparities did not impact students’ participation in Finland, this study did not follow the same students in a longitudinal study; the results are mainly framed by participation in full remote days of instruction. In both studies, academic engagement was comprised of three components, each represented by only one question. The three questions are not expected to measure the same construct, but instead to indicate when three independent concepts occur together, which is reflected in the low Cronbach’s alpha (0.61). Moving forward, measuring each construct with a multiple question scale could provide more reliable results. Implications It is important to emphasize that in both the United States and Finland active pedagogical practices in distance-learning environments are found to be the most engaging among these students. By acknowledging the difficulties of conducting these proactive practices remotely, schools should provide support and sufficient tools to promote effective science learning. Although Finland’s unified government experienced a somewhat smoother transition to remote learning than the United States, both countries had their own set of student challenges. In both countries, students lacked opportunities to engage in scientific practices used by real scientists in the field. Although improving engagement and promoting positive social and emotional learning is undoubtedly a challenge for remote instruction, it is one that needs attention regardless of how soon the pandemic ends. Remote or hybrid learning situations are likely to continue for the long- term. The activities students were most interested in involved doing science rather than simply reading about it, which is what the CESE intervention emphasizes. In the full sample in the prior year, a positive effect for science learning was found among treatment students, including for 132 low-income and minority students who were over-sampled (Schneider et al., 2022). This leads to concerns of potential learning loss when students are unable to participate in experiential science learning. Collaboration and experimentation are key practices used by real scientists working in the field. To optimize students’ science learning, remote classroom environments may use and adapt existing technologies that allow students to remain engaged with their lessons with opportunities to figure out phenomena. This engagement with science may in turn encourage students toward more ambitious educational goals. Conclusion Similar to results from previous years of the CESE intervention, during remote learning in the 2019–2020 school year, students in the CESE study showed that engagement was strongly correlated with a variety of project-based activities assigned by their teachers. In both the United States and Finland, students reported higher engagement with some of the least frequently assigned activities. When learning remotely, students show more engagement when performing real science and engineering practices like conducting investigations at home, performing modelling activities, asking questions, and participating in class discussions. This engagement is in turn related to positive social and emotional outcomes such as confidence, persistence, and ambition. The likelihood of engagement is much lower in the remote learning environments than in the normal classroom setting. Consequently, attention should be paid to providing equitable opportunities to participate in project-based learning activities whether learning remotely or in a classroom. 133 REFERENCES Csikszentmihalyi, M. (1990). Flow: The psychology of optimal experience. Harper Perennial. Domina, T., Renzulli, L., Murray, B., Garza, A. N., & Perez, L. (2021). Remote or removed: Predicting successful engagement with online learning during COVID-19. Socius, 7, 2378023120988200. Dorn, E., Hancock, B., Sarakatsannis, J., & Viruleg, E. (2020). COVID-19 and student learning in the United States: The hurt could last a lifetime (Vol. 9, p. 2021). McKinsey & Company. Retrieved from https://www.mckinsey.com/ industries/public-and-social- sector/our-insights/Covid-19-and-student-learning-in-the-united-states-the-hurt-could last-a-lifetime#. Duckworth, A. L. (2016). Grit: The power of passion and perseverance. Scribner. Escueta, M., Quan, V., Nickow, A. J., & Oreopoulos, P. (2017). Education technology: An evidence-based review. NBER Working Paper, No. 23744. https://doi.org/10.3386/w23744. Fraillon, J., Ainley, J., Schulz, W., Friedman, T. & Duckworth, D. (2019). Preparing for Life in a Digital World. IEA International Computer and Information Literacy Study 2018: International Report. Retrieved from https://www.iea .nl/sites/default/files/2019- 11/ICILS%202019%20Digital %20final%2004112019.pdf. Hidi, S., & Renninger, K. A. (2006). The four-phase model of interest development. Educational Psychologist, 41, 111–127. Hopkins, B., Turner, M., Lovitz, M., Kilbride, T., & Strunk, K. O. (2021). Policy brief: A look inside Michigan classrooms: Educators perceptions of Covid-19 and K-12 schooling in the fall of 2020. Education Policy Innovation Collaborative. Michigan State University, East Lansing, MI. Inkinen, J., Klager, C., Juuti, K., Schneider, B., Salmela-Aro, K., Krajcik, J., & Lavonen, J. (2020). High school students’ situational engagement associated with scientific practice in designing science situations. Science Education, 104(4), 667–692. https://doi.org/10.1002/SCE.21570. Kankaras, M., & Suarez-Alvarez, J. (2019). Assessment framework of the OECD study on social and emotional skills. OECD Education Working Papers No. 207. Paris: OECD. Kaufman, J. H., Hamilton, L. S., & Diliberti, M. (2020). Which parents need the Most support while K–12 schools and childcare centers are physically closed? (p. 2020). RAND Corporation. Retrieved from https://www.rand.org/ pubs/research_reports/RRA308- 7.html. 134 Kelley, E. W. (2020). Reflections on three different high school chemistry lab formats during COVID-19 remote learning. Journal of Chemical Education, 97, 2606–2616. https://doi .org/10.1021/acs.jchemed.0c00814. Kraft, M., & Simon, N.S. (2020). Teachers’ experiences working from home during the COVID- 19 pandemic. Upbeat. Retrieved from https://f.hubspotusercontent20.net/ hubfs/2914128/Upbeat%20Memo_Teaching_From_Home_ Survey_June_24_2020.pdf. Krajcik, J. S., & Shin, N. (2014). Project-based learning. In D. S. Keith (Ed.), The Cambridge handbook of the learning sciences (pp. 275–297), Cambridge, United Kingdom: The Cambridge University Press. Meluzzi, F. (2020). Strengthening online learning when schools are closed: The role of families and teachers in supporting students during the COVID-19 crisis. The OECD Forum Network. Retrieved from http://www.oecd.org/coronavirus/ policy- responses/strengthening-online-learning-when schools-are-closed-therole-of-families- and-teachers-in supporting-students-during-the-Covid-19-crisis-c4ecba6c/. OECD. (2019). PISA 2018 results (Volume I): What students know and can do. PISA, OECD Publishing. https://doi.org/ 10.1787/5f07c754-en. Reich, J., Buttimer, C. J., Fang, A., Hillaire, G., Hirsch, K., Larke, L. R., Littenberg-Tobias, J., Moussapour, R., Napier, A., Thompson, M., & Slama, R. (2020). Remote learning guidance from state education agencies during the Covid-19 pandemic: A first look. Retrieved from osf.io/ k6zxy/. Salmela-Aro, K., Moeller, J., Schneider, B., Spicer, J., & Lavonen, J. (2016). Integrating the light and dark sides of student engagement using person-oriented and situation-specific approaches. Learning and Instruction, 43, 61–70. Salmela-Aro, K., & Upadyadya, K. (2020). School engagement and school burnout profiles during high school– The role of socio-emotional skills. European Journal of Developmental Psychology, 17(6), 943–964. Schneider, B., Krajcik, J., Lavonen, J., & Salmela-Aro, K. (2020). Learning science: Crafting engaging science environments. Yale University Press. Schneider, B., Krajcik, J., Lavonen, J., Salmela-Aro, K., Broda, M., Spicer, J., Bruner, J., Moeller, J., Linnansaari, J., Juuti, K., & Viljaranta, J. (2016). Investigating optimal learning moments in U.S. and Finnish science classes. Journal of Research in Science Teaching, 53(3), 400–421. Schneider, B., Krajcik, J., Lavonen, J., Salmela-Aro, K., Klager, C., Baker, Q., Chen, I., Bradford, L., Touitou, T., Peek-Brown, D., Marias Dezendorf, R., & Maestrales, S. (2022). Improving science achievement– Is it possible? Evaluating the efficacy of a high school chemistry and physics project-based learning intervention: Crafting engaging science environments. 135 Tang, X., Upadyaya, K., & Salmela-Aro, K. (2021). School burnout and psychosocial problems among adolescents: Grit as a resilience factor. Journal of Adolescence, 86, 77–89. https://doi.org/10.1016/j.adolescence.2020.12.002. Tang, X., Wang, M. T., Guo, J., & Salmela-Aro, K. (2019). Building grit: The longitudinal pathways between mindset, commitment, grit, and academic outcomes. Journal of Youth and Adolescence, 48(5), 850–863. The Finnish National Agency for Education. (2020). Guidelines for primary education. Opetushallitrus. The Author. Retrieved from https://www.oph.fi/fi/koulutus-ja-tutkinnot/ opetustoimi-ja-koronavirus. Trinidad, J. E. (2021). Equity, engagement, and health: School organisational issues and priorities during COVID-19. Journal of Educational Administration and History, 53(1), 67–80. https://doi.org/10.1080/00220620.2020.1858764. United Nations Educational, Scientific, and Cultural Organization. (2020). Education: From disruption to recovery. The Author. Retrieved from https://en.unesco.org/Covid19/ educationresponse. Yang, X., Zhang, M., Kong, L., Wang, Q., & Hong, J. C. (2020). The effects of scientific self - efficacy and cognitive anxiety on science engagement with the “question-observation- doing-explanation” model during school disruption in COVID-19 pandemic. Journal of Science Education and Technology, 30(3), 380–393. 136 CHAPTER 5: DISCUSSION AND CONCLUSION Contributing to the Landscape of Science Education Research Studies regarding high school science interventions, like that developed by CESE, are critical to the landscape of science education research. In a 2013 content analysis of 650 empirical chemistry education papers published between 2004 and 2013, the Royal Society of Chemistry reported that, worldwide, only 25% of the manuscripts selected studied students in grades 10 and 12 while % of studies were conducted at post-secondary institutions, and in the U.S. there were 12.7 times more manuscripts related to post-secondary education than grades 10- 12 (Teo et al., 2013). Moreover, a 2020 analysis by Kanim and Ximena showed that high school students comprised only 8% of the students in physics education research manuscripts published between 1970 and 2015 while 70% of those students were enrolled in university calculus-based courses. Kanim and Ximena (2020) use this data to argue that many physics education research studies are not necessarily generalizable to the 1.38 million high school physics students in the United States (US) as studies as the institutions where the studies typically occur have more wealthy students with higher math preparation and fewer ethnically diverse students than the general population. Unlike the studies that comprise the majority of science education research, CESE focused on non-calculus-based science education and deliberately over-sampled high schools in low-income and diverse school districts. Common Core Data was used to test the generalizability index for the entirety of the US with a score of 0.82, suggesting the results of the mean treatment effect could be generalized to the inference population (Schneider et al., 2022). The NRC calls for comprehensive measures that include the development of new project - based curriculum, educator training into project-based and inquiry-driven teaching (NRC, 2012b) 137 and constructing assessments that capture information about students’ ability to construct explanations using three-dimensional reasoning (NRC, 2014). Researchers at CESE designed an intervention to meet those suggestions and provide educators with learning materials that are tested and ready for use in the classroom, and the results were improved science achievement and increased educational ambition. All three manuscripts associated with this dissertation come together to support prior research into the benefits of project-based curriculum and the possibilities of using technologies that facilitate student learning in the average US high schools. Improving Science Achievement – Is It Possible? Evaluating The Efficacy of a High School Chemistry and Physics Project-Based Learning Intervention The project based-based learning intervention designed by CESE was effective in boosting students’ mean performance on the summative assessment taken at the end of the school year. Students who participated in the CESE treatment condition outperformed their peers in the control group by more than 0.2 standard deviations, with 28% of that effect potentially accounted for by students reported use of models in the classroom (Schneider et al., 2022). Additionally, the treatment condition was related to a significantly higher likelihood of an increase in educational ambition during the school year, even when controlling students’ personal demographics, their level of ability measured by the pretest and the measured average pretest score from physics and chemistry students within the school. Implications Although the CESE curriculum is not designed to specifically discuss or promote college ambition, it does provide students with the experience of acting as real scientists in the field. During the intervention, students explain real-world phenomena by asking questions, designing experiments, then collecting and analyzing data. While the NRC (2012a) suggests that promoting 138 educational attainment may enhance or promote the development of relevant science competencies, it may be that educators can promote educational attainment through the teaching of relevant science competencies as well. Using Machine Learning to Score Multi-Dimensional Assessments of Chemistry and Physics To understand more about what students were learning in the NGSS aligned curriculum, researchers at CESE developed a number of assessments and rubrics which captured information about students’ engagement of three-dimensional reasoning. Unlike many other studies, this manuscript addressed specific items drawn from a national test bank, the rubrics developed, and how those rubrics represented the individual dimensions of learning and grade level PEs. Coupled with the diverse student body that lent itself to the generalizability of the 2018-2019 study, the results suggest that machine-learning algorithms can successfully classify responses from a diverse and representative sample of US students. And moreover, the automated scoring classifications showed high agreement to human raters when broken down by the NGSS PEs represented in the students’ reasoning. Implications Automated scoring methods were effective in differentiating between response classifications using rubrics developed to capture the use of multi-dimensional reasoning. With the sample sizes needed for training the scoring algorithms, this method of scoring would not be useful on assessments developed by individual teachers in their classrooms. It could, however, be applied to national test-bank items where large numbers of student responses are used to train an algorithm which is then associated with that item online. The Automated Analysis of Constructed Response (AACR) collaboration provides educators with a tool that does exactly 139 that. AACR provides a test-bank of NGSS aligned items that ask students to explain their reasoning. Teachers can then upload the responses into the Constructed Response Classifier (CRC Tool) which has already been trained to identify correct responses. Through technologies such as the CRC Tool, teachers can collect information about students’ reasoning and argumentation with similar ease to multiple choice. Individual educators can save time associated with assessment development by using test banks provided by the automated scoring services such as that provided by AACR. Teachers would not need to train the algorithms with human scores or worry about the impact of item difficulty on the distribution of scores among student responses. This could provide significantly more information to teachers than multiple choice with similar scoring cost and efforts. U.S. and Finnish High School Engagement During the Covid-19 Pandemic During the pandemic, CESE also found that students reported engagement (high levels of interest, skill, and challenge) during their online learning experiences, despite the difference in content delivery. Students who were able to participate in their courses online reported highest interest in the activities they were able to do with the least frequency. Students in Finland showed highest interest in “low frequency activities” which included using evidence to make an argument, developing a model, conducting or planning investigations, making videos. In the US, students’ reported participation in SEPs had the largest coefficient in predicting whether a student reported being engaged in their online classrooms. Three of the four highest interest activities involved watching videos of experiments, performing experiments at home, and using online simulations. Despite the numerous hardships for people worldwide, students who maintained participation in the CESE intervention, by accessing their courses remotely, during the Covid-19 140 pandemic were more likely to plan to attend college when surveyed at the end of the school year. Unlike students in the 2018-2019 cohort, this difference was driven by students who had previously reported uncertainty and later changed to a plan for 4 or more years of college. Additionally, the plan to attend college was strongly related to a student reporting engagement when surveyed during online courses. Implications The results of the Maestrales et al. (2021a) study suggest students are showing interest in hands-on-science while also understanding that videos and interactive simulations are a useful substitute when circumstances limit activities. These types of technologies can benefit students in a variety of learning environments both inside and outside the classroom. For instance, some teachers in the CESE intervention reported that they were unable to participate in certain experiments due to a lack of science resources. Many of those same teachers also reported having access to a computer lab where they could schedule time for their students to participate in online activities and simulations. These types of simulations can be used to provide alternative solutions that allow students to develop skills by incorporating project-based opportunities to participate in various SEPs, even when resources are limited. Connecting the Pieces The manuscripts in this dissertation agree with prior research that suggests doing science has proven benefits to students learning in the classroom and online. In addition to use in the classroom, vetted science curriculum materials available online, such as those developed by CESE, could be used to design future research. Research from CESE provides quantitative evidence that supports the NRC’s call for professional development, NGSS aligned curriculum, and multi-dimensional assessments (Schneider et al., 2022). Greater similarities between 141 classrooms in these ways is expected to promote equity, but school resources can still have a significant impact on student science learning, and students in schools that lack material resources may have limited opportunities to participate in project-based science activities (NGSS Lead States, 2013). Coupled with automated scoring methods, such as those designed by AACR, studies could be developed using these vetted materials that lead to more similar pedagogy, teaching materials, content, and assessment without placing significantly greater burdens on teachers. Project-Based Curriculum in the Classroom and Online Remote content delivery by CESE teachers during Covid-19 proved that meaningful science teaching could occur in online classrooms. Despite having little to no time to prepare for such a shift in pedagogy, teachers were resourceful in moving content online by using the many existing online platforms and simulations. Not reported in these manuscripts, teachers were asked to provide information regarding their experiences in teaching online during the pandemic. Teacher interviews revealed that they used a wide variety of online learning tools to bring as much project-based content to students as possible. They also noted that students seemed most interested in lessons and experiments that were related to what they were seeing in the news, such as making their own hand sanitizer during the shortages. The student data collection instruments for the 2019-2020 cohort were informed by teachers’ reported use of interactive simulations, videos of experiments, and collaboration tools. In addition to those instruments described in this dissertation, CESE students were also asked to provide suggestions about how to improve the online learning experience. Their responses confirmed what their teachers were identifying as meaningful lessons. Responses centered around coverage of real-world science 142 topics of and project-based experiments that could be conducted at home. Many students also recognized the need for those experiments to be designed around equity and safety. A variety of online resources have emerged that provide opportunities to participate in simulated experiments or collect data based on simulations of events when hands-on participation is not possible. Existing technologies were helpful during the Covid-19 pandemic when many educators worldwide turned to remote content delivery with little time to adapt lessons that were designed for in-person delivery (Brown & Krzic, 2021; Maestrales et al., 2021a). These types of simulations could facilitate equitable learning for all students including those who cannot come to the classroom, those with disabilities, and students in classrooms that lack the resources for some project-based lessons. CESE students suggested they were interested in the adaptive technologies that would allow them to participate in or watch experiments remotely during the pandemic (Maestrales et al. 2021a). During interviews, CESE teachers reported using websites, such as LabXchange, developed by Harvard University in 2018 (LabXhange, 2018), to provide virtual experimentation and data collection opportunities to their students while online. Some research suggests differences in student engagement between online and in-person delivery formats (Robinson & Hullinger, 2008; Kemp & Grieve, 2014). In a manuscript regarding the efficacy of hands-on experiments compared to computerized experiments, Carter and Emerson (2012) highlight the difficulties in making formal comparisons between studies due to differences in both pedagogy and outcome measures. Although earlier studies such as that conducted by Carter and Emerson in 2012 found that students reported greater satisfaction when experiments were delivered in the classroom, advances in the available technologies that students are accustomed to using inside and outside the classroom has created a much different landscape 143 for learners. As technology develops and the general population becomes more familiar with the new resources, studies must encourage understanding of this new learning landscape. By coupling vetted curriculum and assessment materials with these newly developed interactive learning tools, researchers could provide significant insights into differences for students learning in the classroom and online. In addition to the direct benefits to both students and teachers in the classroom, these national standards and more consistent curriculum could also lead to research where data is comparable across studies. College Ambition The benefits of project-based learning appear to go further than academic achievement. It appears that three-dimensional project-based lessons have impacts on students’ desire to further their educational aspirations. Connecting the results of the presented studies in regard to college ambition, participation in project-based activities appears to be related to future goal setting. The treatment intervention was linked to increased college ambition for the 2018-2019 cohort. Information from the 2019-2020 cohort showed that there was a significant relationship between participation in SEPs and students’ reported engagement, while engagement was in turn related to educational ambitions. Scientists have shown a strong, positive three-way relationship between mastery experiences, goal setting behaviors, and self-efficacy or the confidence that one can succeed in a task (Bandura, 1999; Earley & Lituchy, 1991; West & Thorn, 2001). It is possible that the engagement in and successful completion of these inquiry driven experiments may provide mastery experiences which foster improved self-efficacy and goal setting in the students. This would support the growing body of research suggesting that successful project-based science experiences lead students to a greater sense of science self-efficacy (e.g. Bilgin et al., 2014; 144 Samsudin et al., 2020; Schaffer et al., 2013). It may be through this connectedness of mastery and self-efficacy that CESE students are increasing their educational goals. To bolster the STEM workforce and better understand the nuances of the STEM pipeline, future study should seek to explain the mechanism by which project-based activities foster this increase in educational ambition. Future endeavors in structural equations modeling or path analysis could help to provide insights into the mediating effects of science identity, mastery experiences, and self-efficacy, in models regarding the impact of project-based science lessons on goal-setting behaviors or college ambition. Limitations Although the initial student sample for the 2019-2020 cohort mirrored the sample from the 2018-2019 study, Covid-19 created unique circumstances which left many learners unable to participate. The final results of this study reflect those students who were able to attend their classes remotely and chose to attend when little could be done to mandate participation or attendance. Additionally, students were asked to respond to these questions about remote participation in SEPs when there were no other options available for learning. To understand student interest in new and emerging technologies, it is important to consider their attitudes and opinions in their regular learning environments. Conclusion The many manuscripts and projects that have been developed under the CESE project- based learning intervention contribute significantly to the landscape of science education research, yet there is more to be done. More single-unit curriculum materials that meet NGSS standards need to be tested and made available to educators to implement project-based curriculum for the full school year. Evidence suggests that national standards for project-based 145 curriculum and assessment will benefit students’ academic performance. These new units can provide new opportunities for more comparable studies through the similarities in methodology, pedagogy, and outcome measures. This will in turn allow for deeper investigation of how emerging technologies impact students and teachers in their project-based classrooms. 146 Bandura, A., & Wessels, S. (1994). Self-efficacy (Vol. 4, pp. 71-81). REFERENCES Bilgin, I., Karakuyu, Y., & Ay, Y. (2015). The effects of project based learning on undergraduate students' achievement and self-efficacy beliefs towards science teaching. Eurasia Journal of Mathematics Science and Technology Education, 11(3). Brown, S., & Krzic, M. (2021). Lessons learned teaching during the COVID‐19 pandemic: Incorporating change for future large science courses. Natural Sciences Education, 50(1), e20047. Carter, L. K., & Emerson, T. L. (2012). In-class vs. online experiments: Is there a difference? The Journal of Economic Education, 43(1), 4-18. Earley, P. C., & Lituchy, T. R. (1991). Delineating goal and efficacy effects: A test of three models. Journal of applied psychology, 76(1), 81. Kanim, S., & Cid, X. C. (2020). Demographics of physics education research. Physical Review Physics Education Research, 16(2), 020106. Kemp, N., & Grieve, R. (2014). Face-to-face or face-to-screen? Undergraduates’ opinions and test performance in classroom vs. online learning. Frontiers in Psychology, 5, 1278. https://doi.org/10.3389/ fpsyg.2014.01278. LabXchange:About. (2018). Retrieved from https://about.labxchange.org/. Maestrales, S., Marias Dezendorf, R., Tang, X., Samela-Aro, K., Bartz, K., Juuti, K., Lavonen, J., Krajcik, J., & Schneider, B. (2021a). US and Finnish High School Science Engagement During the Covid-19 Pandemic. International Journal of Psychology, 57(1), 73-86. Maestrales, S., Zhai, X., Touitou, I., Baker, Q., Krajcik, J., & Schneider, B. (2021b). Using Machine Learning to Evaluate Multidimensional Assessments of Chemistry and Physics. Journal of Science Education and Technology, 30(2), 239-254. National Research Council. (2012b). A Framework for K-12 Science Education: Practices, Crosscutting Concepts, and Core Ideas. The National Academies Press. National Research Council. (2014). Developing Assessments for the Next Generation Science Standards. Washington, DC: The National Academies Press. NGSS Lead States. (2013). Next generation science standards: For states, by states. Washington, DC: The National Academies Press. Robinson, C. C., & Hullinger, H. (2008). New benchmarks in higher education: Student engagement in online learning. Journal of Education for Business, 84(2), 101-109. 147 Samsudin, M. A., Jamali, S. M., Md Zain, A. N., & Ale Ebrahim, N. (2020). The effect of STEM project-based learning on self-efficacy among high-school physics students. Journal of Turkish Science Education, 16(1), 94-108. Schaffer, S. P., Chen, X., Zhu, X., & Oakes, W. C. (2012). Self‐efficacy for cross‐disciplinary learning in project‐based teams. Journal of Engineering Education, 101(1), 82-94. Schneider, B., Krajcik, J., Lavonen, J., Salmela-Aro, K., Klager, C., Bradford, L., Chen, I., Baker, Q., Touitou, I., Peek-Brown, D., Marias Dezendorf, R., Maestrales, S. & Bartz, K. (2022). Improving Science Achievement—Is It Possible? Evaluating the Efficacy of a High School Chemistry and Physics Project-Based Learning Intervention. Educational Researcher, 0013189X211067742. Teo, T. W., Goh, M. T., & Yeo, L. W. (2014). Chemistry education research trends: 2004–2013. Chemistry Education Research and Practice, 15(4), 470-487. West, R. & Thorn, R. (2001). Goal-setting, self-efficacy, and memory performance in older and younger adults. Experimental aging research, 27(1), 41-65. 148