PROJECT-BASED SCIENCE LEARNING  
FACILITATED THROUGH TECHNOLOGY  

By 

Sarah Maestrales 

A DISSERTATION 

Submitted to 
Michigan State University 
in partial fulfillment of the requirements 
for the degree of 

Measurement and Quantitative Methods - Doctor of Philosophy 

2024 

 
 
   
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ABSTRACT  

This dissertation focuses on three manuscripts  all related to bolstering science 

achievement through recommendations from the National Research Council (NRC) regarding the 

teaching and learning of science. The manuscripts address meeting the NRC’s call to incorporate 

curriculum  and assessment that lead to more in-depth knowledge that can be transferred across 

domains, and the use of technology to facilitate teaching and learning. The first manuscript, 

“Improving Science Achievement – Is It Possible? Evaluating the Efficacy of a High School 

Chemistry and Physics Project-Based Learning Intervention,” describes the process of 

developing intensive project-based curriculum  and assessment materials for high school 

chemistry and physics classrooms.  This  study answers the question of what impact that 

curriculum  has on students’ future science achievement and academic ambition. The second 

manuscript, “Using Machine Learning to Score Multi-Dimensional  Assessments of Chemistry 

and Physics,” focuses on the use of a supervised machine learning  approach to facilitate the 

scoring of science assessments. The goal of this study was to determine whether automating the 

process of classifying these responses could reduce the burden placed on teachers in scoring 

assessments that effectively measure the dimensions of learning spelled out by the NRC. The 

third study, “U.S. and Finnish high school engagement during the Covid-19 Pandemic,” then 

explores student engagement with the use of technologies that facilitate remote instruction.  

 
 
 
This thesis is dedicated to my daughter Callista. 
Thank you for your love and understanding. 

iii 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ACKNOWLEDGEMENTS 

This work would not have been possible without the support of my committee members: 

Dr. Barbara L. Schneider, Dr. Kenneth Frank, Dr. Kimberly Kelly, and Dr. Joseph Krajcik. And 

also thank you to Dr. Xiaoming Zhai for his continued support and guidance.  

This material  is based upon work supported by the National Science Foundation under 

Grant No. OISE:  1545684. 

iv 

 
 
 
TABLE OF CONTENTS 

CHAPTER 1: INTRODUCTION AND LITERATURE REVIEW ............................................... 1 
Bolstering an International STEM Workforce............................................................................ 1 
The Crafting Engagement in Science Environments Study........................................................ 7 
REFERENCES ......................................................................................................................... 13 

CHAPTER 2: IMPROVING SCIENCE ACHIEVEMENT—IS IT POSSIBLE? EVALUATING 
THE EFFICACY OF A HIGH SCHOOL CHEMISTRY AND PHYSICS PROJECT-BASED 
LEARNING INTERVENTION.................................................................................................... 16 
Abstract ..................................................................................................................................... 16 
Introduction and Literature Review .......................................................................................... 17 
The Intervention: Crafting Engaging Science Environments ................................................... 18 
Method ...................................................................................................................................... 27 
Results ....................................................................................................................................... 36 
Discussion ................................................................................................................................. 39 
Notes ......................................................................................................................................... 43 
REFERENCES ......................................................................................................................... 46 
APPENDIX 2.A BALANCE TABLES BETWEEN THE TREATMENT AND CONTROL 
SCHOOLS ................................................................................................................................ 49 
APPENDIX 2.B TEACHER EXIT SURVEY ITEMS DEALING WITH THE TEACHER 
UNITS, PRACTICES, AND CURRICULUM: ........................................................................ 53 
APPENDIX 2.C RACE HETEROGENEITY MODEL ........................................................... 55 
APPENDIX 2.D MEDIATION MODEL ................................................................................. 56 
APPENDIX 2.E EDUCATION AMBITION MODEL ............................................................ 57 
APPENDIX 2.F FULL TREATMENT EFFECTS ESTIMATES............................................ 58 
APPENDIX 2.G FULL HETEROGENEITY RESULTS ........................................................ 59 
APPENDIX 2.H COLLEGE AMBITION FULL RESULTS .................................................. 60 
APPENDIX 2.I FULL MEDIATION RESULTS .................................................................... 61 
APPENDIX 2.J ITEM EQUIVALENCE FOR THE CHEMISTRY AND PHYSICS 
SUMMATIVE ASSESSMENTS.............................................................................................. 62 

CHAPTER 3: USING MACHINE LEARNING TO SCORE MULTIDIMENSIONAL 
ASSESSMENTS OF CHEMISTY AND PHYSICS .................................................................... 64 
Abstract ..................................................................................................................................... 64 
Introduction and Literature Review .......................................................................................... 65 
Methods..................................................................................................................................... 71 
Results ....................................................................................................................................... 81 
Discussion ................................................................................................................................. 90 
REFERENCES ......................................................................................................................... 96 
APPENDIX 3.A ITEM 1: EXPERIMENTAL DESIGN.......................................................... 99 
APPENDIX 3.B ITEM 2: RELATIVE MOTION.................................................................. 101 
APPENDIX 3.C ITEM 3: PROPERTIES OF SOLUTIONS ................................................. 102 
APPENDIX 3.D ITEM 4: STATES OF MATTER ................................................................ 103 

CHAPTER 4: U.S. AND FINNISH HIGH SCHOOL SCIENCE ENGAGEMENT DURING 
THE COVID-19 PANDEMIC .................................................................................................... 104 

v 

 
 
Abstract ................................................................................................................................... 104 
Introduction and Literature Review ........................................................................................ 105 
Methods................................................................................................................................... 113 
Results ..................................................................................................................................... 119 
Discussion ............................................................................................................................... 129 
REFERENCES ....................................................................................................................... 134 

CHAPTER 5: DISCUSSION AND CONCLUSION ................................................................. 137 
Contributing to the Landscape of Science Education Research ............................................. 137 
Connecting the Pieces ............................................................................................................. 141 
Conclusion .............................................................................................................................. 145 
REFERENCES ....................................................................................................................... 147 

vi 

 
 
CHAPTER 1: INTRODUCTION AND LITERATURE REVIEW 

This dissertation focuses on three manuscripts  all related to bolstering science 

achievement through recommendations from the National Research Council (NRC) regarding the 

teaching and learning of science. The manuscripts address meeting the NRC’s call to incorporate 

curriculum  and assessment that lead to more in-depth knowledge that can be transferred across 

domains, and the use of technology to facilitate teaching and learning. The first manuscript, 

“Improving Science Achievement – Is It Possible? Evaluating the Efficacy of a High School 

Chemistry and Physics Project-Based Learning Intervention,” describes the process of 

developing intensive project-based curriculum  and assessment materials for high school 

chemistry and physics classrooms.  This  study answers the question of what impact that 

curriculum  has on students’ future science achievement and academic ambition. The second 

manuscript, “Using Machine Learning to Score Multi-Dimensional  Assessments of Chemistry 

and Physics,” focuses on the use of a supervised machine learning  approach to facilitate the 

scoring of science assessments. The goal of this study was to determine whether automating the 

process of classifying these responses could reduce the burden placed on teachers in scoring 

assessments that effectively measure the dimensions of learning spelled out by the NRC. The 

third study, “U.S. and Finnish high school engagement during the Covid-19 Pandemic,” then 

explores student engagement with the use of technologies that facilitate remote instruction.  

Bolstering an International  STEM Workforce 

According to a 2015 report from the National Science Board (NSB), knowledge and 

skills  related to science, technology, engineering,  and mathematics (STEM) are becoming more 

important to a wider range of workers. Improving science learning  is a topic for researchers 

around the world. International studies such as Programme for International Student Assessment 

1 

 
 
and Trends in International Mathematics and Science Study aim to measure students’ 

achievement in science and mathematics. While  achievement is important, research  suggests that 

STEM students need a more robust understanding of science in order to meet the demands of a 

rapidly changing, technology driven workforce. A report on graduate education by Wendler et al. 

(2010) goes beyond the discussion of science knowledge and achievement to make the claim that 

it is the innovative applications of that knowledge that will drive future economic prosperity. The 

National Research Council supports this claim and spells out A Framework for K-12 Education 

that should provide students with the applicable skills to be used in investigation and scientific 

reasoning but can be applied across disciplines  and in everyday life (NRC, 2012b). 

STEM Ambition 

The current number of students who are motivated to pursue STEM careers is too low to 

meet the demand for STEM professionals in the United States (NRC, 2012b). In order for 

students to, one day, join the STEM workforce, they have to have some ambition to pursue a 

career in the sciences. A 2018 manuscript by Vincent-Ruz & Schunn determined there is a 

psychometric distinction between science identity and other constructs related to science 

attitudes and that science identity was an equal or stronger predictor of students’ participation in 

optional science-related activities than other predictors. Early career aspirations  are a strong 

predictor of later learning outcomes. One study found that students who expected a career in the 

sciences by eighth grade were 3.4 times more likely to earn a baccalaureate degree in physical 

sciences or engineering. Fortunately, it appears that student pathways through the STEM pipeline 

are not fixed. A positive change of science identity, even occurring after students are already 

enrolled in a non-STEM major at university, can significantly improve  the odds of a student 

completing a STEM degree (Ma & Xiao, 2021). Flowers III and Banda (2016) argue that the key 

2 

 
 
to a diverse STEM workforce is in fostering students’ science identities. This may be a matter of 

simply  providing students with the opportunity to understand what it means to be a scientist and 

to develop their identities as scientists. In the Framework, the NRC places a strong focus on 

problem solving, design, and project-based experiments to explain every-day phenomena 

designed to bolster students’ self-perception as scientists and help develop awareness of careers 

in the sciences (2012b). 

A Framework for K-12 Science Education 

Three Dimensions of Learning 

STEM jobs are changing dramatically as we adapt to new technological advancements in 

almost every aspect of our daily lives. To meet these changing demands, the STEM workers at 

every level of education must be capable of flexibility in the application of their skills or 

knowledge, including high school and two-year college technical STEM workers as well as those 

with advanced degrees (NGSS Lead States, 2013; NSB, 2015). The National Research Council 

(NRC) uses the term “deeper learning” to describe this ability to adapt what was learned and 

apply it to other situations (2012a).  

To retain a relevant skillset as the demands of the workforce change, students must learn 

to adapt their knowledge to a variety of situations as the technological demands change and even 

learn new skills and information on their own (NRC, 2011). To create scientifically literate 

consumers of technology and to successfully educate a technical workforce to use and adapt their 

skills  across every educational level, that effort should begin early and continue on through 

university (NRC, 2012b). In 2012, the NRC set forth a Framework for K-12 Science Education 

that provided research-based suggestions toward the design of an effective and coherent 

curriculum  for students in the United States (US). They recommended future curriculum and 

3 

 
 
assessment be built around a central framework that incorporated the learning of science with the 

skills  necessary to plan and revise experiments  to better understand the world around them.  

That NRC’s Framework for K-12 Science Education divides the knowledge and skills 

associated with science learning  into three dimensions labeled as Disciplinary  Core Ideas (DCIs), 

Crosscutting Concepts (CCs), and Science and Engineering Practices (SEPs). DCIs are broad 

scientific concepts that are fundamental to the field of study. CCs are concepts which facilitate 

understanding across multiple fields of study. And SEPs are practices employed by scientists and 

engineers in their respective fields to conduct investigations, build models, create theories and 

explanations that use reasoning to explain phenomena, and to design and build systems.  

The NRC (2012b) argues that rather than developing a limited understanding of many 

topics, students should instead learn a limited number of DCIs and CCs, with a primary focus on 

the depth of the learning. They further argue that students should engage in a process of building 

upon their knowledge and skills through engagement in scientific inquiry and engineering.   

Next Generation Science Standards 

The NGSS facilitates the development of curriculum by taking the dimensions of learning 

defined by the NRC, and describing how students’ mastery of the three dimensions can be 

operationalized by grade level. According to the NGSS (2013), these grade-based performance 

expectations (PEs) have been adapted by many states for use as policy in their own standards to 

encourage a learning progression  which guides students through the recommended revisions to 

continually build upon their skills  through a variety of tasks associated with science 

understanding. The individual PEs take a specific task and integrate each of the three dimensions 

into the skills and content knowledge required to master that task. As students progress through 

the grade levels, the PEs become more in-depth and complex (Schneider et al., 2022).  

4 

 
 
Three-Dimensional Curriculum 

To engage students in each of the dimensions of science learning, they must be given 

opportunities to practice as scientists working in their field (Krajcik, & Shin, 2014). In the 

United States (US), many individual states have adapted the Next Generation Science Standards 

(NGSS) for their K-12 curriculum  and assessment materials  because they have been associated 

with various  benefits to students’ learning outcomes (NGSS Lead States, 2013). Despite having a 

suggested framework to shape curriculum and assessment, there are a number of limitations 

including teacher’s ability to roll out new lessons in the classroom. Unfortunately, it takes time 

to create meaningful lesson plans (NRC, 2012a).  

Three-Dimensional Assessment 

To understand what students are learning from the multidimensional science curriculum, 

research suggests that assessments should move away from memorization  tasks such as multiple-

choice to include a variety of learning tasks that require students to use scientific practices and 

engage reasoning  through explanation (e.g. Pellegrino,  2013; Harris et al., 2019; NRC, 2014). 

Although modeling and explanation tasks are more difficult and time-consuming to score than 

multiple-choice,  assessment  must use items other than multiple-choice  and incorporate more in-

depth reasoning tasks to measure what students are learning in the classroom (Haudek et al., 

2019). For example, Cooper et al. (2017) noted that researchers gained more information about 

what the students understood in chemistry because they were better able to convey the spatial 

aspects of their reasoning. 

To develop this transferable deeper learning described by the NRC, the curriculum and 

pedagogy are matched with assessment and feedback to provide meaningful learning experiences 

(NRC, 2012). Measuring the students’ understanding though the learning progression is critical 

5 

 
 
in this process as the assessments help to shape pedagogical support and foster a healthy 

classroom  environment. The value of these assessments is improved when the students are 

required to demonstrate their knowledge-in-use or ability to connect science content to specific 

scientific practices (Kubsch et al., 2019). 

Facilitating  Three-Dimensional  Science Learning Through Technology 

Automated Scoring of Constructed Response 

Unfortunately, the shift away from multiple-choice tasks to items which more accurately 

measure students master of the dimensions of learning may prove challenging  as assessments 

must be developed and scoring must be facilitated to reduce the burden on teachers (NRC, 2014). 

Without new methods to facilitate scoring, the longer time in scoring requires  more work to 

assess students while students wait longer to receive feedback (Williamson et al., 2012; Ha & 

Nehm 2016). Fortunately, existing studies suggest that machine learning may be able to reduce 

that time between assessment and feedback in turn promoting the use of these more in-depth 

assessment measures  (Lee et al., 2019b; Lottridge et al., 2018; Zhai et al., 2020b). 

Because assessments  which capture more information about students’ understanding are 

more difficult to score, methods to reduce the burden of scoring are highly desirable. By 

reducing the time needed for scoring, machine learning  could make CR as accessible  as multiple 

choice. Research into the use of automated scoring of constructed response is showing great 

promise  (Zhai et al., 2020a). Generally, machine learning  is successfully  being used to facilitate 

science learning  in many ways including automated feedback, the scoring of essays, and even 

learning games (Zhai et al., 2020b; Lee et al., 2019a).  

6 

 
 
The Crafting Engagement in Science Environments Study 

Researchers with Crafting Engagement in Science Environments (CESE) designed a 

study to meet the NRC’s many recommendations to promote science learning. CESE wanted to 

investigate and promote students’ mastery of these necessary and adaptable science 

competencies through the development of research-based curriculum  and methods to assess 

student learning. The curriculum  and assessments are based on the science learning guidelines 

set forth in NGSS (2013) with project-based activities that allow students to engage in inquiry 

driven reasoning and experimentation using a variety of scientific practices. CESE applied the 

NRC’s dimensions of learning and the related NGSS PEs to develop specific lessons that 

teachers can implement in their high school physics and chemistry classrooms.  Researchers in 

both the US and Finland worked together to design a series of project-based lesson plans for 

teachers.  

As described in Schneider et al. (2022), the CESE physics and chemistry interventions 

involve the enactment of three units throughout the school year, with each unit lasting 

approximately 4 weeks. Designed to be taught in a specific order, the three CESE units each 

build upon each other as students use ideas or concepts they discovered in the previous unit to 

scaffold their understanding of new ideas in subsequent units. Each unit is designed to align to a 

specific set of performance expectations (PEs) and is built around the idea of students figuring 

out the answer to a meaningful and relevant question referred to as the “Driving Question” (DQ). 

Given the importance of these SEPs for science learning, CESE scaffolded students learning 

through a variety of tasks that were introduced throughout the units and measured on the post-

unit assessments. While many SEPs are invoked throughout the curriculum, special care is put 

7 

 
 
towards engaging students with making observations, collecting data, using those observations to 

explain phenomena, and then modeling the phenomena. 

Students in both the US and Finland participated in the learning intervention in the 2018-

2019 and 2019-2020 school years. The magnitude and duration of the study left researchers with 

a large and comprehensive dataset obtained through cluster randomized trails. Students 

completed multiple instruments that were designed to collect information about their science 

achievement before and after the intervention as well as background and exit surveys regarding 

their experiences  in the classroom  and their beliefs around science. That data collection effort led 

to a number of manuscripts intended to shed light onto science learning and how to facilitate that 

learning through technology. 

Improving Science Achievement – Is It Possible? Evaluating the Efficacy of a High School 

Chemistry and Physics Project-Based Learning Intervention 

In 2022, CESE researchers Schneider et al. published their main-effects in Educational 

Researcher in a manuscript titled “Improving Science Achievement – Is It Possible? Evaluating 

the Efficacy of a High School Chemistry and Physics Project-Based Learning Intervention.” 

Based on prior evidence surrounding the benefits of the NGSS, CESE researchers hypothesized 

that students would perform better on a third-party developed measure of science achievement 

after participation in the intervention’s inquiry-driven  project-based curriculum.   

The 2022 publication from Schneider et al. comprises  Chapter 2 of this dissertation. In 

this chapter we discuss the foundations of the CESE study, its development, founding principles, 

teacher training, and curriculum development. To explore what was driving the treatment effects, 

CESE researchers tested the interaction effects of treatment with gender and race or ethnicity. 

They also tested the mediating effect of fidelity of implementation in the classroom. Finally, 

8 

 
 
CESE researchers tested the impact of treatment on educational ambition with the hypothesis 

that the positive effects of participation in NGSS aligned, project-based curriculum, and acting as 

real scientists would encourage students to consider their future learning objectives. All analysis 

were considered in three-level models clustered at the teacher and school levels and included 

both student and school level covariates.  

The results of this study show positive impacts for students engaged in the treatment 

condition of the intervention. On average the intervention showed a positive effect for all 

students with no interaction effects by gender or race showing significance with an alpha of 0.05. 

The sample used for this study showed a high generalizability index to the general population of 

high school students in chemistry and physics courses in the US. Additionally, engagement in the 

project-based intervention was associated with students’ later learning ambition, even without 

any specific career focus in the curriculum.  The results are discussed in more detail in Chapter 2 

of the dissertation. 

Using Machine Learning to Score Multi-Dimensional  Assessments  of Chemistry and 

Physics 

As a part of the CESE intervention, students participated in multiple assessments of 

science achievement that were designed to incorporate multiple dimensions of learning. Prior to 

beginning the intervention, students participated in a science achievement  assessment that 

included several constructed response (CR) items. Although the original rubrics were aligned 

with NGSS PEs for varying grade levels, they were not intended to capture the three-dimensions 

of learning, CESE found that most students did engage multiple of the three-dimensions in their 

responses despite not being prompted to do so. Therefore, CESE researchers developed special 

rubrics to capture information regarding the DCIs, CCs, and SEPs related to the chosen items. 

9 

 
 
 
Due to the magnitude of the study in the 2018-2019 school year, and additional costs associated 

with scoring CR items, researchers  with CESE decided to explore the possibility of using 

automated scoring methods.  

In Chapter 3 of this dissertation, I discuss the methods and results from a manuscript 

titled “Using Machine Learning to Score Multi-Dimensional  Assessments of Chemistry and 

Physics” which was published in the Journal of Science Education and Technology in 2021. This 

study was developed and conducted by Maestrales et al. under the larger CESE project. In this 

manuscript, the authors discuss in depth the need for three-dimensional assessment and the need 

for automated scoring methods that reduce the burden on teachers. They outline the supervised 

learning approach with rigorous training of human raters and the process of developing a dataset 

for use in training the algorithm. Human-to-human agreement and human-to-machine agreement 

are described using Cohen’s kappa. Researchers also explored the agreement between human 

and machine raters by the dimensions captured in the rubric and representation within the 

sample. The machine reported probability of a correct classification was reported by dimensions 

of learning and representation within the sample as well. 

Unlike many other studies into automated scoring, this study discusses the specific 

rubrics used by human raters and how they address the dimensions of learning to create 

classification categories. The results of this study, also discussed in Chapter 2, show that the 

human raters were able to come to high agreement using the multi-dimensional  rubrics.  Using 

the data sets created using those human scores, machine learning algorithms  developed by 

AACR were effective in classifying the constructed response items with the machine-to-human 

rater agreement being similar  to agreement between human raters. In categories which were well 

represented within the training set, the algorithm performed very well. Although some student 

10 

 
 
vocabulary choices which were under-represented within the sample were often over-represented 

among the discrepancies, the algorithms were generally  rather robust even when scoring very 

open-ended responses. 

U.S. and Finnish High School Engagement During the Covid-19 Pandemic  

In addition to aiding in the process of assessment and feedback in science learning, 

technology has also taken an important role in science learning as more coursework has moved 

to an online environment. During the 2019-2020 school year, the Covid-19 pandemic forced 

schools around the world to close their doors and many schools turned to remote content delivery 

to connect with their students. Having already collected information about students’ science 

engagement in their physics and chemistry classrooms  at the beginning of the school year, CESE 

researchers  in the US were in a unique position to study changes in student engagement during 

remote instruction. Moreover, CESE was able to collect data from students from two countries 

with very different approaches to the shift to remote content delivery.  

The details of this international collaboration  in data collection and analyses during the 

pandemic comprise  Chapter 4 of this dissertation. In 2021, CESE researchers in the US and 

Finland published their findings in the International Journal of Psychology in a manuscript titled 

“U.S. and Finnish high school engagement during the Covid-19 Pandemic.” CESE defined 

engagement as students reporting high interest, high skill, and high challenge. The US team 

reported the results of the change in the log odds, from the beginning of the school year to the 

time they were surveyed during remote instruction, that a student reported high interest, skill, 

challenge, and engagement. Both teams discussed the activities students reported in their 

classrooms.  The US team discussed which activities students reported in their online science 

classes, how interested they were in those activities, and how those relationships impacted their 

11 

 
 
engagement. In Finland researchers  were able to collect data regarding situational engagement 

during remote teaching and used this information to understand engagement during high, 

medium, and low frequency activities. In the US, researchers used educational ambition as a 

measure of persistence during the pandemic. Finally, Finnish researchers  discussed the 

relationships  between situational engagement and social and emotional learning with correlations 

between the students’ emotional state and their situational engagement. 

Also described in Chapter 4 are the results of this analysis. Despite the shift to remote 

instruction, this study showed an increase in engagement for US students. Students in both the 

US and Finland showed a preference for those activities which were least available during 

remote instruction. Students in the US reported the most interest in SEPs that could be done 

remotely. Not surprisingly, those types of activities showed the strongest relationships to 

engagement during the pandemic. Additionally, US students were showing persistence in their 

college ambitions, with many students firming up decisions to attend 4 or more years of college 

despite hardships encountered during the Covid-19 pandemic.  

12 

 
 
 
 
REFERENCES 

Alper, J. (Ed.). (2016). Developing a national STEM workforce strategy: A workshop summary. 

National Academies Press. 

Cooper, M. M., Stieff, M., & DeSutter, D. (2017). Sketching the invisible to predict the visible: 

From drawing to modeling in chemistry. Topics in cognitive science,  9(4), 902-920. 

Flowers III, A.M. and Banda, R. (2016), "Cultivating science identity through sources of self -

efficacy", Journal for Multicultural  Education, Vol. 10 No. 3, pp. 405-417. 
https://doi.org/10.1108/JME-01-2016-0014. 

Ha, M., & Nehm, R. H. (2016). The impact of misspelled words on automated computer scoring: 

a case study of scientific explanations. Journal of Science Education and Technology, 
25(3), 358-374. 

Harris, C. J., Krajcik, J. S., Pellegrino,  J. W., & DeBarger, A. H. (2019). Designing knowledge-

in-use assessments to promote deeper learning. Educational  Measurement: Issues and 
Practice, 38(2), 53-67. https://doi.org/10.1111/emip.12253. 

Haudek, K., Santiago, M., Wilson, C., Stuhlsatz, M.,Donovan, B., Bracey, Z., Gardner, A., 

Osborne, J., & Cheuk, T. (2019). Using Automated Analysis to Assess Middle School 
Students’ Competence with Scientific Argumentation, presented at the Annual Meeting 
of the National Council on Measurement in Education (NCME). Toronto, ON.  

Krajcik, J. S., & Shin, N. (2014). Project-Based Learning. Dalam S. Keith (Ed). The Cambridge 

Handbook of The Learning Science (hlm. 275-297). 

Kubsch, M., Nordine, J., Neumann, K., Fortus, D., & Krajcik, J. (2019). Probing the relation 

between students’ integrated knowledge and knowledge-in-use about energy using 
network analysis. Eurasia Journal of Mathematics,  Science and Technology Education, 
15(8), em1728.  

Lee, H. S., McNamara, D., Bracey, Z. B., Liu, O. L., Gerard, L., Sherin, B., Wilson, C., Pallant, 

A., Linn, M., Haudek, K., & Osborne, J. (2019a). Computerized text analysis: 
Assessment and research potentials for promoting learning. 

Lee, H. S., Pallant, A., Pryputniewicz, S., Lord, T., Mulholland, M., & Liu, O. L. (2019b). 
Automated text scoring and real‐time adjustable feedback: Supporting revision of 
scientific arguments involving  uncertainty. Science Education, 103(3), 590-622. 

Lottridge, S., Wood, S., & Shaw, D. (2018). The effectiveness of machine score-ability ratings in 

predicting automated scoring performance. Applied Measurement  in Education, 31(3), 
215-232. 

Ma, Y., & Xiao, S. (2021). Math and science identity change and paths into and out of STEM: 

Gender and racial disparities. Socius, 7, 23780231211001978. 

13 

 
 
Maestrales, S., Marias  Dezendorf, R., Tang, X., Samela-Aro, K., Bartz, K., Juuti, K., Lavonen, 

J., Krajcik, J., & Schneider, B. (2021a). US and Finnish High School Science 
Engagement During the Covid-19 Pandemic. International  Journal of Psychology, 57(1), 
73-86. 

Maestrales, S., Zhai, X., Touitou, I., Baker, Q., Krajcik, J., & Schneider, B. (2021b). Using 

Machine Learning to Evaluate Multidimensional Assessments of Chemistry and Physics. 
Journal of Science Education and Technology, 30(2), 239-254. 

National Research Council. (2011). Assessing 21st Century Skills: Summary  of a Workshop. 

The National Academies Press. 

National Research Council. (2012a). Education for life and work: Developing transferable 
knowledge and skills in the 21st century. The National Academies Press. 

National Research Council. (2012b). A Framework for K-12 Science Education: Practices, 

Crosscutting  Concepts, and Core Ideas. The National Academies Press. 

National Research Council. (2014). Developing Assessments for the Next Generation Science 

Standards. Washington, DC: The National Academies Press. 

National Science Board. (2015). Revisiting the STEM workforce: A companion to science and 

engineering  indicators 2014. NSB-2015-10. 

Schneider, B., Krajcik, J., Lavonen, J., Salmela-Aro,  K., Klager, C., Bradford, L., Chen, I., 

Baker, Q., Touitou, I., Peek-Brown, D., Marias Dezendorf, R., Maestrales, S. & Bartz, K. 
(2022). Improving Science Achievement—Is It Possible? Evaluating the Efficacy of a 
High School Chemistry and Physics Project-Based Learning Intervention. Educational 
Researcher, 0013189X211067742. 

Tai, R. H., Qi Liu, C., Maltese, A. V., & Fan, X. (2006). Planning early for careers in science. 

Science, 312(5777), 1143-1144. 

Vincent-Ruz, P., & Schunn, C. D. (2018). The nature of science identity and its role as the driver 

of student choices. International journal of STEM education, 5(1), 1-12. 

Wendler, C., Bridgeman, B., Cline, F., Millett, C., Rock, J., Bell, N., & McAllister,  P. (2010). 
The path forward: The future of graduate education in the United States. Educational 
Testing Service. 

Williamson,  D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of 

automated scoring. Educational measurement: issues and practice, 31(1), 2-13. 

Zhai, X., Haudek, K., Shi, L., H Nehm, R., & Urban‐Lurain, M. (2020a). From substitution to 

redefinition: A framework of machine learning‐based  science assessment.  Journal of 
Research in Science Teaching, 57(9), 1430-1459. 

14 

 
 
Zhai, X., Yin, Y., Pellegrino, J. W., Haudek, K. C., & Shi, L. (2020b). Applying machine 

learning in science assessment:  a systematic review.  Studies in Science Education, 56(1), 
111-151. 

15 

 
 
 
 
 CHAPTER 2: IMPROVING SCIENCE ACHIEVEMENT—IS IT POSSIBLE? 

EVALUATING THE EFFICACY OF A HIGH SCHOOL CHEMISTRY AND PHYSICS 

PROJECT-BASED LEARNING INTERVENTION  

Abstract 

Crafting Engaging Science Environments is a high school chemistry and physics project-

based learning intervention that meets Next Generation Science Standards performance 

expectations. It was administered to a diverse group of over 4,000 students in a randomized 

control trial in California and Michigan. Results show that treatment students, on average, 

performed 0.20 standard deviations higher than control students on an independently developed 

summative science assessment. Mediation analyses  show an indirect path between teacher and 

student-reported participation in modeling practices and science achievement. Exploratory 

analyses indicate positive treatment effects for enhancing college ambitions. Overall, results 

show that improving secondary school science learning  is achievable  with a coherent system 

comprising  teacher and student learning experiences, professional learning, and formative unit 

assessments that support students in “doing” science. 

16 

 
 
 
 
Introduction and Literature Review 

U.S. students’ engagement and achievement in elementary and secondary science has 

been relatively  stagnant over the past two decades (National Center for Education Statistics, 

2017) and regrettably remains behind other industrialized countries (Organization for Economic 

Co-operation and Development, 2020). Recognizing the consequences of persisting mediocre 

student academic performance in science achievement, three national efforts A Framework for 

K–12 Science Education (National Research Council [NRC], 2012), the Next Generation Science 

Standards (NGSS; NGSS Lead States, 2013), and Science and Engineering for Grades 6 to 12: 

Investigation  and Design at the Center (National Academies of Sciences, Engineering,  and 

Medicine, 2019)—were initiated to articulate a vision and a set of standards for markedly 

reforming science teaching and learning. Breaking with traditional practices of memorizing 

science facts, formulae, and mathematical operations, these reports emphasize  the importance of 

“doing” science for improving science literacy and encouraging the pursuit of science careers. 

Grounded in the ideas of John Dewey (1938) and subsequent work by Blumenfeld et al. (1991) 

and Krajcik and Shin (2014), “the act of doing” engages the students in a cognitive process 

where they encounter a problem, plan a solution, work on it, and reflect on the results. 

The NRC’s (2012) Framework for K–12 Science Education argues that science learning 

should focus on making sense of phenomena or solving design problems by engaging students in 

the three dimensions of scientific knowledge. These dimensions include science and engineering 

practices, crosscutting concepts, and disciplinary core ideas (DCIs) which should increase  in 

depth and sophistication within and across the grade levels (NRC, 2012). Science and 

engineering  practices are behaviors scientists perform as they build theories about natural 

phenomena through investigations, creating models, generating explanations, and constructing 

17 

 
 
science-based arguments. Crosscutting concepts refer to “ideas” that are found and linked across 

disciplines and which provide different tools for exploring phenomena. DCIs focus on the key 

ideas of a science discipline—or,  in other instances, a fundamental organizing idea for a single 

discipline—and are critical for explaining  phenomena. 

Shortly after the development of the Framework, the NGSS were released having been 

created by a national effort of scientists, educational researchers, teachers, and policymakers 

(NGSS Lead States, 2013). The NGSS described a set of performance expectations that 

identified what students should know and be able to do. These performance expectations follow 

a progression  of learning from primary to secondary grades, outlining and linking specific 

science ideas that increase in meaning and complexity. Many U.S. states have adopted the NGSS 

(National Science Teacher  Association, 2017), or adapted them to develop similar science 

proficiency standards. Despite this widespread adoption of NGSS-like standards, there is a lack 

of research on evidence-based curricula and lessons that exemplify these reforms. The 

intervention “Crafting Engaging Science Environments” (CESE) was initiated to fill this gap by: 

(1) creating an intervention for high school chemistry and physics designed on the Framework’s 

vision and NGSS performance expectations and (2) assessing its effects on student science 

achievement (Schneider et al., 2020). 

The Intervention: Crafting Engaging Science Environments 

Deeply concerned with the lack of interest and engagement among secondary school 

students, an international team of Finnish and U.S. science education researchers, teachers, 

psychologists, sociologists,  and psychometricians collaboratively  designed the CESE 

intervention. Over the course of 3 years, this interdisciplinary  team worked on developing a 

holistic intervention that takes a system approach to teaching and learning in secondary school 

18 

 
 
chemistry and physics classrooms.  The intervention consists of six units—three in chemistry and 

three in physics—with corresponding professional learning and assessments. Although team 

members  from both countries collaborated on the design, the intent was not a comparative study 

but instead each country conducted its own investigations, recognizing the cultural and 

demographic differences of the student and teacher populations. Beginning with a design phase 

followed by a field-test conducted in the United States and Finland in 2017– 2018 (Schneider et 

al., 2020), an efficacy trial was then conducted in the United States (2018–2019) with students of 

different demographic factors at the individual level (i.e., prior academic ability, race and 

ethnicity, and gender) and at the school level (i.e., urbanicity, percentage of free or reduced -price 

lunch, and region) in California and Michigan.1 The results of this efficacy study are reported 

below. 

Theoretical Basis of the Intervention  

The CESE intervention was designed taking into account critiques of successful 

curricular  interventions, including empirically  tested instructional and learning materials  for 

teachers and students, teacher professional learning with sustained support and three-dimensional 

assessments for evaluating learning in science (Harris et al., 2015; Polikoff et al., 2018; Sinatra 

et al., 2015). Guided by the principles of project-based learning (PBL) and the complementary 

ideas of the Framework and the NGSS, the coherent CESE system was formed to support science 

learning by challenging  students to engage in relevant and meaningful experiences. 

The theoretical principles  of PBL in science revolve around learners making sense of 

“real-world phenomena” and solving relevant questions by planning and carrying out their own 

investigations (Krajcik & Shin, 2014). Collaborating with classmates, students engage in 

experiences  in which they create artifacts that support the development of scientific ideas and use 

19 

 
 
of scientific practices. PBL uses the three dimensions of scientific knowledge that allow students 

to draw from their knowledge across disciplines and life experiences, rather than being the 

passive recipients of knowledge. This contrasts with traditional approaches to chemistry and 

physics, where students are often instructed to plug numbers into equations without truly 

understanding the underlying relationships described by the equations. 

Specifically, the CESE intervention focuses on seven major PBL principles:  meeting 

subject and grade-relevant NGSS performance expectations; constructing a meaningful driving 

question that motivates a solution to a complex problem or an explanation for a compelling 

phenomenon; providing opportunities for the use of scientific practices; creating collaborative 

experiences  and investigations for finding solutions to a driving question; integrating learning 

tools to make sense of evidence; developing artifacts that respond to the driving question and 

reveal students’ comprehension; and using assessments that capture emerging understandings 

(Krajcik & Shin, 2014). Enacting these principles  helps transform the daily science experiences 

of teachers and students, from teacher-led instruction to environments where both teachers and 

students work together on solving problems and figuring out how to explain phenomena. In such 

a learning  environment, students have agency and directly participate in using scientific 

practices, much like scientists and engineers. 

There were several  reasons for why our intervention design with its PBL framework was 

constructed for high school students in physics and chemistry. First, advances in science learning 

emphasize  the components of PBL as necessary for student learning (National Academies of 

Sciences, Engineering, and Medicine, 2019; NRC, 2000). Second, there is limited work on 

measuring  reforms in high school physical  science courses (What Works Clearinghouse,  2020). 

We recognize that there are several newer studies that take a different approach to understanding 

20 

 
 
why high school students lack an interest in science (National Assessment of Educational 

Progress, 2021) and why fewer students are choosing STEM (science, technology, engineering, 

and mathematics) majors  (Riegle-Crumb  et al., 2011). Furthermore, there have also been many 

studies highlighting how different types of methodologies enhance students’ learning 

experiences  in sciences (Lee et al., 2020). 

There are considerable studies that examine student access among underrepresented 

minorities  to more advanced level courses or subjects like chemistry or physics (National 

Science Board, 2020; Riegle-Crumb et al., 2019) with new promising science  curricular  efforts 

(Engels et al., 2019; Sasson et al., 2018). However, several critiques  concerned that the equitable 

and inclusionary participatory work of PBL suggest that it needs to be studied more rigorously to 

learn if indeed it has a positive impact on student outcomes (Chen & Yang, 2019; Cheung et al., 

2016; Condliffe et al., 2017; Harris et al., 2019). Our work is a response to this critique. 

Third, chemistry and physics are important subject areas, as they are often considered 

gatekeeper courses for many science specializations  and postsecondary schooling (Hinojosa et 

al., 2016; Riegle-Crumb  & King, 2010). Finally,  scientific literacy is needed for all students, as 

evidenced by our understanding and responses when faced with pandemics, technological 

change, and environmental concerns (National Science Board, 2019). This combination of a need 

for new curricula paired with the necessity of PBL led to the creation of the CESE system. 

Components of the Intervention  

Teacher and Student Experiences and Materials   

Recognizing the challenges that teachers would undoubtedly have in transforming 

instructional practices for all their science units, the team decided to develop three units for 

chemistry and physics, each of which lasted 4 to 6 weeks. Table 2.1 describes the three 

21 

 
 
chemistry and three physics units, along with their driving questions, performance expectations, 

and phenomena. 

Table 2.1  
Units, Performance Expectations, Driving Questions, and Phenomena for the Units 

Unit 

Performance Expectation 

Driving 
Question 

Phenomena 

Evaporation 

HS-PS1-3: Plan and conduct an investigation to 
gather evidence to compare the structure of 
substances at the bulk scale to infer the strength 
of electrical forces between particles. 

“Why do I feel 
colder when I am 
wet than when I 
am dry?” 

Water, acetone and 
ethanol evaporate when 
placed on your skin, 
making you feel cool. 

HP-PS3-2: Develop and use models to illustrate 
that energy at the macroscopic scale can be 
accounted for as a combination of energy 
associated with the motion of particles (objects) 
and energy associated with the relative 
positions of particles (objects). 

Periodic  Table 

HS-PS1-1: Use the periodic table as a model to 
predict the relative properties of elements based 
on the patterns of electrons in the outermost 
energy level of atoms. 

HS-PS1-2,  Construct and revise an explanation 
for the outcome of a simple chemical reactions 
based on the outermost electron states of atoms, 
trends in the periodic table, and knowledge of 
the patterns of chemical properties. 

“Why is table salt 
safe to eat, but 
the substances 
that form it are 
explosive or 
toxic when 
separated?” 

Conservation 
of Matter 

HS-PS1-7: Use mathematical representations to 
support the claim that atoms, and therefore 
mass, are conserved during a chemical reaction. 

“Why does it 
seem like I can 
make a substance 
appear or 
disappear?” 

Ice at zero degrees 
Celsius  will  change to 
liquid water at zero 
degree with the addition 
of energy but no 
temperature change 
occurs. 

Sodium reacts with 
water. 
Potassium reacts with 
water. 
A solution of sodium 
chlorine reacts with 
potassium iodide 
solution to form iodine 
and potassium chloride. 
A solution of bromine 
reacts with potassium 
iodide solution to form 
iodine and potassium 
bromide solution. 
Substances like 
paper, can burn. They 
appear to disappear, but 
in a closed system, the 
burning of paper has no 
mass change. 

It is necessary to add 

energy to start the 
burning of paper but 
after the start, lots of 
energy is given off as 
the temperature of the 
surrounding area 
increases. 

22 

 
 
 
 
 
 
 
 
 
Table 2.1 (cont’d) 

Forces and 
Motion 

MagLev 

Electric  Motors 

HS-PS2-3: Apply scientific and engineering 
ideas to design, evaluate, and refine a device 
that minimizes the force on a macroscopic 
object during a collision. 

HS-PS2-3: Apply scientific and engineering 
ideas to design, evaluate, and refine a device 
that minimizes the force on a macroscopic 
object during a collision. 

HS-PS3-5: Develop and use a model of two 
objects interacting through electric or magnetic 
fields to illustrate the forces between objects 
and the changes in energy of the objects due to 
the interaction. 
HS-PS3-2: Develop and use models to illustrate 
that energy at the macroscopic scale can be 
accounted for as a combination of energy 
associated with the motion of particles (objects) 
and energy associated with the relative 
positions of particles/ objects. 

HS-PS3-1: Create a computational model to 
calculate the change in energy of one 
component in a system when the change in 
energy of the other components and energy 
flows in and out of the system are known; 
HS-PS2-5: Plan and conduct an investigation to 
provide evidence that an electric current can 
produce a magnetic field and that a changing 
magnetic field can produce an electric current; 
HS-PS3-3: Design,  build, and refine a device 
that works within given constraints to convert 
one form of energy into another form of energy. 

“How can I 
design a vehicle 
to be safer for a 
passenger during 
a collision?” 

When a car crashes, 
destruction and 
damages to the car and 
the passengers occur. 

“What makes a 
super speed train 
(Maglev) 
function without 
touching the 
track?” 

When a magnet is 
brought close enough to 
second magnet, the 
second magnet will 
move without touching 
it. 

“How can I make 
the most efficient 
electric  motor?” 

An electric current can 
cause the shaft of an 
electric  motor to spin or 
turn. 

Each unit was designed with an overriding driving question, lesson sequences 

incorporating scientific practices, and postunit assessments. The first step in the unit design 

process was to select a set of performance expectations for each unit in chemistry  and physics 

and then unpack the performance expectations to elaborate on the ideas and identify the scientific 

practices (Harris et al., 2019; Krajcik & Czerniak, 2018; Krajcik  & Shin, 2014). The PBL 

framework requires  each unit to have a driving question (see Table 2.1) that is meaningful to 

students’ lives and connects a phenomena or complex  problem to a concrete experience 

recognizable  to the students. This drives the sequence of coherent lessons that continue to 

23 

 
 
 
 
 
 
 
motivate students to meet the unit’s learning goals. Moreover, the initial experiencing  of the 

phenomena leads the students to ask their own questions. 

Another important feature of the driving question is that it initiates the construction of a 

systematic sequence of lessons that builds throughout the unit, leading to additional questions 

that are threaded through various lessons, providing coherence across multiple experiences.  The 

lessons that form from the driving question are not defined scripts but a flexible roadmap that 

connects prior experiences  and ideas to specific scientific practices, such as planning and 

carrying out investigations, analyzing and interpreting data, and constructing explanations and 

designing solutions. The investigations related to the driving question are not independent 

projects, but rather carefully constructed experiential activities that build across the unit and are 

anchored in and help answer the driving question. 

One of the most important scientific practices that the PBL units emphasize is having 

students construct models and connect them with evidence-based explanations of phenomena. 

Here again, the direct involvement of students in modeling is not an isolated task, such as 

diagramming a simple  relationship  between two variables, but instead is directly related to 

explaining  the phenomena under investigation and responding to the driving question. Raising 

the importance of modeling is a key scientific practice in the NRC’s Framework, in which the 

intent is to provide students with experiences whereby they can become directly involved in 

systems thinking.2 

By incorporating the practices of scientists and engineers, the modeling experiences 

afford students the opportunity to learn how to represent phenomena and physical systems, 

explain and predict the phenomena in a consistent and logical manner, and understand their data. 

The CESE modeling experiences are deliberately created so that students are supported in 

24 

 
 
learning about identifying system components and the relationships among them which can take 

many forms, including mathematical  formulae, diagrams, and computer simulations. 

To understand how this process plays out in classrooms,  the following two examples 

summarize  the first units in chemistry and physics.3 The first unit in chemistry focuses on 

explaining  evaporative cooling. Working from the CESE driving question (see Table  2.1), the 

lessons are designed so that students use classroom experiments and models to figure out and 

explain how evaporative cooling occurs;  they must then figure out how this relates to the 

interactions of particles at the molecular level, as well as the matter’s macro-level  structure and 

properties, and energy transfer. Students manipulate different variables throughout the 

experiments, looking to explain how each component may influence evaporation. Across the 

unit, as students learn and assess how to make sense of phenomena, they construct models and 

explanations of the process of evaporative cooling, connecting energy changes to changes in the 

structure of matter in the system. 

The first unit in physics focuses on students exploring the driving question, How can I 

make a vehicle safer during a collision?  Here, experiences  and investigations include working 

collaboratively  to investigate and develop computer models to explain collisions,  figuring out 

relationships  among mass, force, and velocity in a collision  by experiencing what happens when 

each of those variables is individually manipulated. Students use their new knowledge of mass, 

force, velocity, acceleration  (Newton’s second law), and momentum in combination with 

engineering  practices to develop their best design in answering the driving question. Then they 

use a set of materials to build and test a cart that minimizes the force on a passenger in a 

collision. 

25 

 
 
Postunit Assessments 

One of the underlying principles of the CESE system is to create postunit assessment 

tasks that extend student learning experiences by using the three dimensions of scientific 

knowledge to explain phenomena and solve challenging problems  to demonstrate mastery of 

NGSS performance expectations—but not what was articulated in the curricular unit. The steps 

used for creating these assessment tasks and rubrics were a modification of a previous process 

articulated by Harris et al. (2019). The development of assessment tasks allows for the creation 

of items through a principled, clearly defined process that is grounded in learning and assessment 

theory. All of the postunit assessments have the students draw models and write full descriptions 

of what is shown; these are then evaluated using a rubric that assesses their knowledge of the 

NGSS performance expectations, including the three dimensions of science learning.  In addition, 

external reviews and classroom pilots were performed to increase item validity. 

Professional  Learning and Teacher Support  

Professional learning  was designed with best practices from research and emphasized 

teachers’ active participation in learning,  connections to classroom  contexts, collaboration, and 

reflection (Darling-Hammond et al., 2017; Garet et al., 2001; Krajcik, 2014; van den Bergh et al., 

2015). A key feature of CESE professional learning is for treatment teachers to experience what 

their students will be doing during their science activities, including  using the driving question 

board, asking their own questions, building models, developing evidence-based explanations, 

and conducting experiments, not as students but as adult learners. The intent here is to guide and 

support the teachers in new ways of teaching often found to be challenging. 

All treatment teachers spent three in-person  days learning about the NGSS, three-

dimensional learning,  science PBL, and a review of the first units in chemistry and physics, 

26 

 
 
conducted with team members and experienced teacher facilitators. Several times during the 

school year, the treatment teachers also met in person with facilitators to talk through teaching 

the next set of units. Facilitators also connected with the teachers via video conferences and 

online message boards. A hotline and a monitored email address were also available that 

centered on teacher questions. Over each of the 4 to 6-week units, there were approximately 90 

teacher requests for additional support and information. If needed, facilitators were also available 

for face-to-face interactions.4 Control teachers also met in-person  at the beginning of the school 

year for a day and were given a workshop on the NGSS. 

Testing the Intervention 

Three research  questions guided this investigation: 

Research Question 1: What is the main effect of this intervention on students’ science 

learning?  Do treatment students outperform control students on a summative science 

assessment? 

Research Question 2: What other conditions besides the intervention could be affecting 

the treatment effect? More specifically, does the treatment effect vary by race/ethnicity and 

gender? 

Research Question 3: What is the mediating effect of fidelity of implementation on the 

treatment effect? 

Sample 

Method 

A power analysis indicated that the sample should include at least 48 schools with 50 

students per school to detect a robust effect (i.e., 0.20).5 Schools were recruited from four areas 

(Los Angeles Unified School District, San Diego County Office of Education, Detroit Public 

27 

 
 
School Community District, and other districts throughout Michigan), which allowed for a 

diverse sample  of prior school-level  science achievement, socioeconomic  status, and race and 

ethnicity, including a substantial representation of Hispanic students, many of whom were 

English language learners. Participating school districts signed memoranda of understanding and 

supplied lists of schools for potential participation from which the randomization process was 

undertaken. 

Randomization 

 Assignment to treatment status was made by schools rather than individual teachers to 

prevent spillover,  as teachers within a particular school might plan together and/or share 

curriculum  materials  and instructional practices with colleagues. Also, a small  number of schools 

requested that all their teachers in a subject be teaching the same curriculum.  Team members 

contacted the principals from the district lists for potential participation, explaining that they 

would either be assigned to receive the treatment or control group that would receive the 

treatment the following school year. After receiving a principal’s  agreement to participate, 

schools were randomly assigned to treatment or control status, with an equal probability (0.50) of 

each. Given district and principal support, nearly all of the teachers were willing to participate. 

(See APPENDIX 2.A, also available on the journal website, for the balance tables between the 

treatment and control schools).6 

Attrition 

After randomization, nine schools attrited because of school closures or canceled science 

courses. Although not part of the sample selection (i.e., schools and students), teacher attrition 

was explored because of its impact on the student sample. Teacher attrition was related to school 

policies  and personal issues:  fiscal problems and a teacher strike which resulted in: teacher class 

28 

 
 
reassignments  and class cancellations;  medical emergencies;  mismatch between the course 

timeline and student abilities; and undisclosed reasons. Students were excluded from the analytic 

sample if they were missing  either the pretest or summative assessment (or both).7 Students who 

joined classrooms  after the pretest was administered were excluded from the analysis. 

Initially, there were 70 schools (36 treatment and 34 control), with 129 chemistry and 

physics teachers and 6,211 students (3,325 treatment and 2,886 control). After accounting for 

attrition, the final analytic sample  includes 61 schools (30 treatment and 31 control), 119 

teachers and 4,238 students (2,127 treatment and 2,111 control). Table 2.2 summarizes  overall 

and differential attrition across the treatment and control groups at the school and student levels. 

Table 2.2 
Attrition 

Level 

Panel A: School level 

Initial schools 

Final schools 

Attrition 

Panel B: Student level 

Initial students 

Final students 

Attrition 

Overall 

Treatment  Control  Differential 

70 

61 

36 

30 

34 

31 

12.86% 

16.67% 

8.82% 

7.84% 

6,211 

4,238 

3,325 

2,127 

2,886 

2,111 

31.77% 

36.03% 

26.85% 

9.18% 

Table 2.3 provides descriptive statistics and the balance for the analytic sample on pretest 

and demographic characteristics. The balance was estimated using a two-level hierarchical  linear 

model (HLM) of the treatment on the characteristic of interest. 

29 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Table 2.3  
Descriptive  Statistics  and Balance for the Analytic Sample 

Note. Standardized  pretest is in standard deviations.  Other covariates are proportions. 
*p < 0.05. **p < 0.01. ***p < 0.001. 

Given slight differences in the standardized pretest and proportion of race and ethnicity 

between the treatment and control students, these measures were included as control variables in 

the analytic models along with dummy variables for region and course subject.  

Instruments 

Students  

Several instruments were used to collect information from the treatment and control 

students, including demographic characteristics, pretest baseline science achievement, and a 

summative assessment. For students whose first language was Spanish, all consent forms, 

30 

 
 
 
 
teaching materials,  and unit and summative tests were first translated into Spanish, and then 

translated back into English by another translator to ensure accuracy. All translators were fluent 

in both languages with a secondary school science specialization. 

The student background survey was administered on Qualtrics or via paper and pencil. 

Questions were largely  adapted from Programme for International Student Assessment (PISA) 

and included questions about home background (race, ethnicity, language, parent’s education, 

etc.) attitudes toward science, and preferences for a future career in science. A pretest was given 

at the start of the school year to students in both conditions to measure their baseline science 

knowledge for a covariate in the analytic model. This  pretest contained multiple choice items 

chosen from the National Assessment of Educational Progress (NAEP) test bank. The items 

covered a range of topics and difficulty levels, plus, some were aligned with DCIs and 

performance expectations for chemistry and physics (see Maestrales et al., 2021).8 

To measure the difference between the treatment and control groups at the end of the 

intervention, the students completed a summative assessment consisting of items developed by 

the Michigan Department of Education for use on the state’s 11th grade science assessment.9 

Michigan was one of the earlier  states to adopt NGSS-like standards and that adoption spurred an 

interest in a redesign of their science achievement test given to high school students. Several 

science curriculum  specialists  and psychometricians worked to design a science assessment that 

encompassed grade-level NGSS standards and NRC three-dimensional science. Our summative 

test included items that corresponded to the physical science performance expectations.10 This 

third-party assessment allowed for an objective measure of the differences in achievement 

between the treatment and control group.11 

31 

 
 
Student exit surveys asked about the frequency of different PBL activities in their 

classroom,  which allowed for a comparison of student versus teacher perceptions of fidelity of 

implementation. The exit survey also asked students to reflect on the importance of science in 

their own lives, interest in the future study of science, experiences in their science classrooms, 

and science materials  available  at their schools. 

Teachers  

A teacher survey with questions adapted from the Teaching and Learning International 

Survey (TALIS) included items on years of teaching experience, teaching methods, and attitude 

toward teaching. This survey also included teacher knowledge and exposure to NGSS and PBL 

to ensure that the beginning of the intervention of such knowledge between the treatment and 

control teachers. All teachers and students were also asked to complete an exit survey at the end 

of the year. Items on the teacher exit survey asked about their use of PBL, questions on coverage 

of performance expectations, and the quality of classroom resources that affected the 

intervention lessons.  These measures also allowed for the testing of the assumption that the 

control teachers were teaching business as usual. The exit survey items used to determine the 

units covered by the control teachers, their practices, and their textbooks and curriculum tools 

can be found in APPENDIX 2.B (also available on the journal website). 

In addition to these exit surveys, in-person observations of randomly selected teachers (in 

both treatment and control) were also conducted and used for determining fidelity of 

implementation. Another important use of these observations was to obtain information and 

confirm that the control teachers were conducting their science classrooms  with business as 

usual, and not using PBL practices or a similar  type of curriculum that emphasized CESE 

principles  or experiential  activities. Although unable to conduct observations for all teachers, 

32 

 
 
these observations provided important insights into the use of PBL in their classrooms, 

independent observer assessments of PBL use, and how these measures corresponded with 

teacher self-reports in the exit surveys. 

Analysis 

To assess the effect of the treatment on science achievement and to account for clustering 

that occurs as a result of assignment of schools to treatment, a two-level HLM was used (Bloom, 

2005; Raudenbush, 1997; Raudenbush & Bryk, 2002), with students clustered within schools. 

Model 1: 

𝑌𝑖𝑗 = 𝛾00 + 𝛾01 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑗 + 𝑆𝑗𝛿𝑗 + 𝑋𝑖𝑗𝛾𝑖𝑜 + 𝑟0𝑗 +∈𝑖𝑗 

where 𝑌𝑖𝑗is the standardized summative assessment test score with student i in school j; 𝛾00  is the 

mean outcome of the control group; 𝛾01  is the difference between the treatment and control 

group; 𝑠𝑗are the school-level  covariates, including school pretest mean and region; and 𝛿𝑗are the 

coefficients on those covariates. 𝑋𝑖𝑗 are the individual-level covariates, including pretest, course 

(chemistry or physics), and gender of the students. 𝛾𝑖0  are the coefficients on these individual-

level covariates. Finally, 𝑟0𝑗  is the school-level  error term, and ∈𝑖𝑗 is the student-level error 

term. 

Because race, ethnicity, and gender data for every student was not available, Model 2 was 

estimated including a dummy variable for this missing  data: 

Model 2: 

𝑌𝑖𝑗 = 𝛾00 + 𝛾01 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑗 + 𝑆𝑗𝛿𝑗 + 𝑋𝑖𝑗𝛾𝑖𝑜 + 𝑀𝑖𝑗𝛼𝑖𝑜 + 𝑟0𝑗 +∈𝑖𝑗 

where 𝑋𝑖𝑗 are the individual-level covariates, including pretest, course (chemistry  or physics), 

and gender of the students, now including race and ethnicity of the students. 𝛾𝑖0  are the 

33 

 
 
coefficients on these individual-level covariates. 𝑀𝑖𝑗 is the vector of missing dummies for gender 

and race and 𝛼𝑖0 are the coefficients on those dummies. 

To determine whether treatment effects differ by race/ethnicity or gender, a cross-level 

interaction between gender and treatment and then race and treatment was conducted. The 

following model is specified for gender. The race and ethnicity model substitutes the Female 

interaction with the treatment with each Race dummy. The full model is shown in  APPENDIX 

2.C (also available  on the journal website). 

Gender heterogeneity: 

𝑌𝑖𝑗 = 𝛾00 + 𝛾01 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑗 + 𝛾11 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑗 × 𝐹𝑒𝑚𝑎𝑙𝑒𝑖𝑗 + 𝑆𝑗𝛿𝑗 

+𝑋𝑖𝑗𝛾𝑖𝑜 + 𝑀𝑖𝑗𝛼𝑖𝑜 + 𝑟0𝑗 +∈𝑖𝑗 

where 𝛾11  is the coefficient on female students in the treatment group. 

When teachers implemented the intervention with fidelity, it was expected that students 

would have higher scores on their summative  science assessment compared with students whose 

teachers did not. This assumption was based on teachers’ reports of the frequency of various 

science practices, the extent to which they “incorporated” PBL into their classes (32% reported 

frequently, 50% sometimes,  15% rarely, 1% never), and students’ reports of the use of modeling, 

aggregated to the teacher level to represent the extent to which teachers mad e this scientific 

practice a part of their science classroom  instruction. 

The impact of teacher fidelity of implementation was conducted with the 3-2-1 mediation 

model (Pituch et al., 2009). The 3-2-1 mediation model requires  a three-level HLM where the 

teacher is at Level 2. This is because the expected mediation is from the teacher’s level of 

implementing  the intervention, which is a teacher-level variable.  Therefore, to ensure that the 

treatment effects were consistent with the three levels, the overall treatment effect was estimated 

34 

 
 
using a three-level HLM. Once the reliability of the three levels was found, the mediation 

analysis  was conducted. Here, the treatment is delivered at Level 3 (the school level), and 

mediation occurs at Level 2 (the teacher level) and outcome is student assessment scores (Level 

1). This  process is conducted twice, first for the teachers’ reported use of the intervention in their 

class and then the students’ use of modeling in that teachers’ class. The mediation model is 

shown in APPENDIX 2.D (also available on the journal website). 

Additionally, one exploratory research question was whether the treatment would 

enhance educational ambitions. For this model, a mixed effects logistic regression was used 

where the outcome was the change in levels of planned education attainment from the beginning 

to the end of the school year. The covariates used in this analysis  were the same as Model 1 

above. The models for estimating these effects are given in APPENDIX 2.E (also available on 

the journal website). 

Generalizability  of the Study 

To assess the generalizability  of the study, generalizability indexes were analyzed that 

summarized the degree of similarity between distributions of the propensity score of the sample 

schools compared with the various inference populations (the United States as a whole as well as 

each state; Tipton, 2014). This  generalizability  index takes values between 0 and 1. Essentially, 

for each of the 51 inference populations chosen (including the U.S. population as a whole), the 

sample of 61 schools was compared with the population of schools meeting the inclusion criteria 

(public schools)  using a propensity score estimated with logistic regression.  These scores 

estimate the probability that a school would be selected into the study given its covariates which 

included school enrollment, percentage of students on free or reduced -price lunch, school 

urbanicity, and percentage of White, Black, Hispanic, and Asian students. 

35 

 
 
Student Outcomes 

Results 

Across both of the different models for the treatment effect, the treatment students 

outperformed the control students at a significance level of p < 0.05 or less and with a consistent 

treatment effect near 0.2 standard deviations. In the most conservative result as shown in Model 

2, which includes all the covariates, the treatment students scored 0.208 standard deviations (p < 

0.01) higher than the control students after taking into account the pretest, region, and 

demographic information. This  represents about a 7 percentage-point increase for a standardized 

test score based on the summative assessment. Both the estimated treatment effects of the two 

models are reported in Table 2.4 (see APPENDIX 2.F, for the full model results, also available 

on the journal website). 

Table 2.4  
Estimated Treatment Effect of the CESE System 
Effect 

(1) 

Treatment effect 

0.22 

(0.064) 

(2) 

0.208 

(0.065) 

Additional controls 
Note. Treatment effect  is the difference  between the treatment and control  group, measured in 
standard deviations.  Standard errors are in parentheses.  CESE = Crafting Engaging Science 
Environments. *p < 0.05. **p < 0.01. ***p < 0.001. 

x 

To confirm that the findings were robust, a sensitivity check was conducted using the 

Frank et al. (2013) framework for evaluating the robustness of an inference. To invalidate this 

inference, 28.6% of the estimated treatment effect, which is approximately 1,637 observations, 

would have to be replaced with cases for which the effect of the treatment is zero. 

Following the estimation of the treatment effect, Cohen’s f was calculated and found to 

be 0.01, indicating that the variance due to the treatment effect relative to the proportion of 

unexplained outcome variance  is 0.01 (Lorah, 2018). 

36 

 
 
  
 
Tests of Homogeneity of Variance 

In tests of heterogeneity by student demographic characteristics (gender and 

race/ethnicity), there were no statistically significant differences in treatment effect by student 

gender or race/ethnicity (see Table 2.5 and APPENDIX 2.G for the full model of results, also 

available  on the journal website). 

Table 2.5 Summary of Student Level Heterogeneity 
Summary of Student Level Heterogeneity 

Effect 

Female 

Black 

Hispanic 

Asian 

Other race 

Multiple 
races 

Treatment 

Predictor of 
interest 

Interaction 

0.185** 

(0.708) 

−0.061* 

(0.03) 

0.047 

(0.058) 

0.192** 

(0.069) 

0.222** 

(0.085) 

−0.280** 

−0.238** 

(0.087) 

0.189 

(0.120) 

(0.084) 

−0.033 

(0.089) 

0.212** 

0.205** 

0.212** 

(0.064) 

0.044 

(0.062) 

−0.090 

(0.139) 

(0.065) 

−0.188 

(0.096) 

0.183 

(0.151) 

(0.066) 

−0.086 

(0.163) 

−0.109 

(0.201) 

Note. Coefficients  are measured in standard deviations. Standard errors are in parentheses.   
*p < 0.05 **p < 0.01. 

As shown in Table 2.5, in the first iteration of the model, the outcome is significant for 

race/ethnicity and gender. However, when examining the effect for gender and race interacted 

with the treatment, the effects are not significantly different than zero. This interaction effect 

indicates that there is no evidence of a difference in the treatment by gender or race/ethnicity. 

With regard to the exploratory model, a mixed effects logistic regression estimating 

change in educational ambitions from fall to spring was conducted. Students in the treatment 

group were 20% more likely than the control group to increase their postsecond ary aspirations. 

The coefficient on the treatment was 0.18 (standard error = 0.09; p = 0.05). If treatment students 

originally  intended to attend a 2-year school, they would be 1.2 times more likely at the end of 

the intervention to intend to attend a 4-year college than their control counterparts (see 

APPENDIX 2.H for the full model results which are also available on the journal website). 

37 

 
 
 
Fidelity of Implementation 

Table 2.6 shows the results of estimating the treatment effect using a three-level HLM. 

As seen in Table 2.6, the results are consistent with those above in the two-level HLM; therefore, 

it is appropriate to use a three-level  model for the fidelity of implementation analysis. Table  2.7 

shows the results of the fidelity of implementation mediation models. The composite teacher 

measure of “incorporation of PBL” was not a statistically significant effect and only accounted 

for a small portion (11%) of the overall treatment effect (see APPENDIX 2.I, for the full model 

results which are also available  on the journal website). 

Table 2.6  
Estimated Treatment Effect of the CESE System with Three Levels 
(4) 
Effect 

(3) 

Treatment effect 

0.211*** 

(0.059) 

0.196** 

(0.060) 

Additional controls 
Three levels  (student, school, and teacher) 
Note. Treatment effect  is the difference  between the treatment and control  group, measured in 
standard deviations.  Standard errors are in parentheses.  CESE = Crafting Engaging Science 
Environments.*p < 0.05. **p < 0.01. ***p < 0.001. 

x 

x 

x 

Table 2.7  
Mediation  Effects 

Measure 

a 

b 

Indirect effect  
(a * b) 

95% Confidence 
interval 

Total effect 
explained  
(a * b)/c 

Teachers’ incorporation 
of PBL 

0.329 

0.068 

0.022 

[−0.013, 0.071] 

11% 

(0.126) 

(0.055) 

(0.021) 

Students’ use of modeling 

0.234 

0.234 

0.055 

[−0.008, 0.132] 

28% 

(0.056) 

(0.139) 

(0.036) 

Note. Standard errors are in parentheses. PBL = project-based  learning. 

However, students’ reported use of modeling explains roughly 28% of the total treatment 

effect (and is significant at the 0.10 level). This was expected, as the teacher is ultimately 

38 

 
 
 
 
  
  
  
 
 
  
  
responsible  for guiding and supporting the students’ modeling activities. The extent to which 

teachers’ frequent use of modeling—one of the key scientific practices incorporated in the 

intervention—was a promising  sign that this experience was a path through which the treatment 

worked. 

Findings of Generalizability Index 

When the generalizability  index analysis was conducted using the Common Core Data 

with seven covariates, the results show that the generalizable index for the entirety of the United 

States is 0.82. This indicates that the sample from this study is similar  to the inference 

population—here, in the United States with regard to the covariates selected. In this case, when 

statistical adjustments are used to find an average treatment effect, these would be approximately 

unbiased for the inference population. 

Discussion 

Results of this randomized controlled trial and its generalizability  are an important 

contribution to science teaching and learning; its significant effects on over 4,000 diverse 

students in two different states are especially heartening given the few science interventions that 

have been rigorously  tested at the high school level and shown to be effective at improving 

science learning.  Results show that the intervention was effective at raising students’ science 

learning as measured by an independently developed summative assessment. There was no 

evidence that the intervention produced different effects based on students’ gender, race, or 

ethnicity. 

It is important to underscore what these results mean and what they do and do not 

conclude. Since there is no difference here among racial  and ethnic groups on the effect of the 

treatment, this does not mean that the intervention is engaging history, culture, or race and 

39 

 
 
ethnicity sufficiently. One of the key principles of CESE states that science must be personally 

meaningful and of interest to the students to engage them in science. This happens by having 

students ask questions about scientific phenomenon and having them relate these ideas to their 

own lives. These principles  of CESE and their enactment are quite different from the current 

critique of science learning  and curriculum (see Lee et al., 2020; Pinkard et al., 2017; Rosebery 

et al., 2016). Additionally, CESE science experiences are deliberately designed for different 

inclusionary  group activities. These activities are uniquely designed to involve multiple groups 

of students in problem solving, writing explanations, and becoming more informed about their 

world as well as learning from one another. Our results show that contrary to other interventions 

that often fail to positively affect all students, our theoretical framework and experiential 

learning opportunities should benefit all groups on average. 

The most important takeaway is that science academic performance related to the NGSS 

can be improved with an intervention that is created and implemented as a coherent system 

approach: an approach that includes teacher and student learning experiences, teacher 

professional learning,  and formative unit assessments that incorporate the three dimensions, 

including—but not limited to— modeling and writing explanations. The treatment provided 

teachers with multiple professional learning  experiences on how to enact PBL that underscored 

the importance of “doing science.” Many of the teacher practices emphasized in the treatment 

included having the students take primary ownership for solving problems, figuring out 

phenomenon, engaging in scientific practices, and learning the meaning of science concepts 

across multiple DCIs. 

Results shown in Table 2.6 indicate  that “modeling” was an indirect  pathway that 

affected  students’ science achievement  scores. It is no surprise that students  who frequently 

40 

 
 
participated  in modeling activities  had an advantage on the summative  assessment,  as modeling 

is one of the eight scientific  practices  emphasized  in the NGSS. However, modeling is not an 

experience that occurs independent of instructional  opportunities.  That the treatment  students 

reported that modeling  was a practice they used on multiple  occasions suggests that a key 

principle  of the intervention  was being implemented  in the classrooms. 

Our goal has been to help adolescents not only become more science  literate but also to 

ignite a new deepened interest in science, which in our exploratory work, we were pleased to 

find. Our exploratory analyses indicated that the intervention changed students’ educational 

ambitions. This  was encouraging, as educational ambitions are a key predictor for college 

enrollment (Schneider et al., 2016). Given that the treatment increased students’ college 

ambitions suggests that the intervention, with its emphasis on engaging in three dimensions of 

learning scientific knowledge to make sense of compelling  phenomena or solve complex 

problems,  may be a trigger for pursuing further education in science and other fields. 

Limitations  of the Study 

The three units in this intervention lasted, on average, 12 to 16 weeks. Most science 

courses typically last longer and include additional areas of study. Had the intervention lasted 

longer and included more units, the treatment effects on science learning may have been larger. 

However, it could also be the case that students may have reached a saturation point in their 

exposure to scientific practices and that teachers would be unable to sustain the types of 

instruction used in this intervention. From interviews we conducted with treatment teachers who 

participated in earlier  phases of this study, this did not appear to be the case; indeed, teachers 

reported that they found themselves using these practices in other units that were not part of the 

CESE curriculum. 

41 

 
 
The lack of an effect for teacher reports in the meditation model may be the result of the 

measures  employed. Other methods may have provided a more robust measure, such as many 

more in-person observations of teachers with high interrater agreement between multiple 

observers and repeated student surveys. However, due to cost constraints, this was not a 

possibility.  Finally, conducting a study of this magnitude—one that is a large-scale randomized 

control trial in two different states, and includes professional development and materials—is 

quite costly from financial, personnel  considerations, and time-consuming.  For future studies that 

look to expand and generalize this work to larger populations, these costs will be considerable. 

However, the promise of this intervention appears to be positive enough to warrant further 

investment. 

Conclusion 

Unquestionably, there is an immediacy for dramatically transforming science learning, 

especially  given the health and environmental challenges young people are facing today and 

likely to face in the future. While recent major reforms to science teaching and learning have 

been met with enthusiasm and action at the state level, these reforms have not yet been widely 

adopted at the classroom level, particularly  in high school chemistry and physics. This is in part 

because of the lack of aligned curriculum materials,  professional learning, and assessments. In 

this respect, the results of this intervention are especially  encouraging and merit further 

expansion. 

This intervention is grounded in the recommendations of A Framework for K–12 Science 

Education (NRC, 2012) and Science and Engineering for Grades 6 to 12 (National Academies of 

Sciences, Engineering, and Medicine, 2019), which incorporate the three dimensions of scientific 

knowledge. Our results suggest it is possible to change the scientific learning environments  for 

42 

 
 
all students and expect positive science achievement results. However, this can only happen 

when there is a principled design system that involves not only just an engaging curriculum,  but 

also high-quality professional learning  and formative assessments designed to stimulate 

knowledge. If the science and science education communities intend to improve science learning, 

students need to work on relevant meaningful problems and participate in scientific practices 

similar  to the actual work of scientists—such as “figuring out” phenomena, building and testing 

models that explain those phenomena, searching for patterns and connections in data, and 

uncovering cause and effect relationships. 

Notes 

This study is supported by the National Science Foundation (OISE-1545684; PIs Barbara 

Schneider and Joe Krajcik). Any opinions, findings, and conclusions or recommendations 

expressed in this material are those of the authors and do not necessarily reflect the views of the 

National Science Foundation. Finnish authors were supported by the Academy of Finland 

(#298323; PIs Jari Lavonen and Katariina Salmela-Aro). 

We thank the following people for their contributions and consultation regarding this 

study: Larry Hedges (Northwestern University), Jeffery Wooldridge (Michigan State 

University), Mark Reckase (Michigan  State University), Yiling Cheng (Kaohsiung Medical 

University), and Kalle Juuti and Janna Inkinen (University of Helsinki).  

1The Finnish randomized controlled trial started a year later in 2019–2020 and had 

several interruptions because of the Covid-19 pandemic. Results from the Finnish efficacy study 

and U.S. control teachers who gave their new chemistry and physics classrooms  the CESE 

treatment in 2019–2020 can be found in Maestrales, Dezendorf, et al. (2021). 

43 

 
 
2See Cooper (2020) for a description of the relationship between crosscutting ideas, 

system thinking, and modeling. 

3Additional details can be found on the CESE website: 

https://sites.google.com/a/msu.edu/craftingengagingscience/home. 

4On average for both chemistry and physics, respectively, the total professional learning 

was 40 hours. 

5These estimates were based on Optimal Design Software (Spybrook et al., 2011), which 

calculates the number of clusters (i.e., schools) to power a study given the expected 

characteristics  of the sample and a hypothesized effect size. This includes the number of 

observations (i.e., students), intraclass correlations,  and R2 values from previous science 

assessments and covariates. These  values were estimated based on information compiled by 

Spybrook et al. (2016), as well as our knowledge about the schools we expected to recruit. 

6Because of space limitations, we have included an online APPENDIX for additional 

tables (also available  on the journal website) and information. 

7If pretests were missing at random, imputation would not increase efficiency. The 

standard errors did not increase after imputation. If the imputation reduced the standard errors, 

then we would have expected greater efficiency of our models, but this was not the case 

(Wooldridge, 2010). The student attrition rate was higher than that recommended by the What 

Works Clearinghouse (2020), so as a robustness check we imputed pretest scores for those who 

only had a summative assessment, bringing our attrition rates down. 

8To verify the quality of the pretest, a multinomial logistic regression  and an item 

response theory (IRT) nominal  response model were conducted to examine students’ response 

patterns by gender Cheng and Reckase [2020] for a fuller description of the pretest item 

44 

 
 
differentiation). Results showed no significant differences on test-level scores between genders, 

or significant gender differences for most of the distractors. High-achieving girls and boys also 

chose the correct answers at the same frequency. These findings resonate with previous findings 

of gender similarities  (Hyde & Linn, 2006; Zell et al., 2015). 

9A confidentiality agreement with Michigan  Department of Education was signed and the 

team was not allowed to show the items used for the summative assessment. Student scores and 

other de-identified information were stored on a secured server at Michigan State University. 

10The Psychometric technical report for the science test has been delayed because of Covid-19. 

However, a test of the reliability of the items is available  on request. 

11The physical science summative test scores for students taking chemistry or physics 

were made comparable using the R package equateIRT. The scores were then standardized for 

subjects and a 2pl model was analyzed. Then the two tests were equated with only the control 

students so that a treatment effect would not bias the equating process so that the students’ 

summative assessments  for physics and chemistry would be comparable based on the distribution 

of the two sets of scores (see APPENDIX 2.J for the full table of item equivalence for the 

chemistry and physics summative  assessments). 

45 

 
 
 
 
REFERENCES 

Bloom, H. (2005). Randomizing groups to evaluate place-based programs. MDRC. 

Blumenfeld, P. C., Soloway, E., Marx, R. W., Krajcik, J. S., Guzdial, M., & Palincsar,  A. (1991). 
Motivating project-based learning:  sustaining the doing, supporting the learning. 
Educational Psychologist, 26(3–4), 369–398. 
https://doi.org/10.1080/00461520.1991.9653139. 

Chen, C.-H., & Yang, Y.-C. (2019). Revisiting the effects of project-based learning on students’ 
academic achievement:  A meta-analysis investigating moderators. Educational Research 
Review, 26(February), 71–81. https://doi.org/10.1016/j.edurev.2018.11.001. 

Cheng, Y., & Reckase, M. (2020). The effect of gender differences and similarities on science 
performance [Conference section]. American Educational Research Association 
Conference, San Francisco, CA, United States. https://www.aera.net/Events-
Meetings/Annual-Meeting/Previous-Annual-Meetings/2020-Annual-Meeting. 
(Conference canceled). 

Cheung, A., Slavin, R. E., Kim, E., & Lake, C. (2016). Effective secondary science programs: A 
best-evidence synthesis. Journal of Research in Science Teaching,  54(1), 58–81. 
https://doi.org/10.1002/tea.21338. 

Condliffe, B., Quint, J., Visher, M. G., Bangser, M. R., Drohojowska, S., Saco, L., & Nelson, E. 

(2017). Project-based learning:  A literature review (Working Paper). MDRC. 
https://eric.ed.gov/?id=ED578933. 

Maestrales, S., Zhai, X., Touitou, I., Baker, Q., Schneider, B., & Krajcik, J. (2021). Using 

machine learning  to evaluate multidimensional  assessments of chemistry and physics. 
Journal of Science Education and Technology, 30(2), 239–254. 
https://doi.org/10.1007/s10956-020-09895-9. 

National Academies of Sciences, Engineering, and Medicine. (2019). Science and engineering 
for Grades 6–12: Investigation and design at the center. The National Academies Press. 

National Assessment of Educational Progress. (2021). Results from the 2019 Science 

Assessment. U.S. Department of Education and the Institute of Education Sciences. 

National Center for Education Statistics. (2017). The condition of education 2017. Author. 

https://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2017144. 

National Research Council. (2000). How people learn: Brain, mind, experience, and school 

(Expanded ed.). National Academies Press. 

National Research Council. (2012). A framework for K–12 science education: Practices, 

crosscutting concepts, and core ideas. National Academies Press. 

46 

 
 
National Science Board. (2019). The skilled technical workforce: Crafting America’s science 

and engineering enterprise. https://www.nsf.gov/nsb/publications/2019/nsb201923.pdf . 

National Science Board. (2020). The state of U.S. science and engineering 2020. 

https://ncses.nsf.gov/pubs/nsb20201. 

National Science Teacher Association. (2017). NGSS hub. https://ngss.nsta.org/. 

NGSS Lead States. (2013). Next generation science standards: For states, by states. National 

Academies Press. 
https://epsc.wustl.edu/seismology/book/presentations/2014_Promotion/NGSS_2013.pdf. 

Organisation for Economic Co-operation and Development. (2020). Science performance 
(PISA)—indicator. https://data.oecd.org/pisa/science-performance-pisa.htm. 

Pinkard, N., Erete, S., Martin, C., & McKinney de Royston, M. (2017). Digital Youth Divas: 
Exploring narrative-driven curriculum  to spark middle school girls’  interest in 
computational activities. Journal of the Learning Sciences, 26(3), 477–516. 
https://doi.org/10.1080/10508406.2017.1307199. 

Pituch, K. A., Murphy, D. L., & Tate, R. L. (2009). Three-level  models for indirect effects in 
schooland class-randomized experiments in education. Journal of Experimental 
Education, 78(1), 60–95. https://doi.org/10.1080/00220970903224685. 

Polikoff, M. S., Campbell, S. E., Koedel, C., Le, Q. T., Haraway, T., & Gasparian, H. (2018). 

The formalized processes  districts use to evaluate textbooks [University of Southern 
California Working Paper]. 

Raudenbush, S. W. (1997). Statistical analysis  and optimal design for cluster randomized trials. 

Psychological  Methods, 2(2), 173–185. https://doi.org/10.1037/1082-989X.2.2.173. 

Raudenbush, S. W., & Bryk, A. (2002). Hierarchical  linear models (2nd ed.). Sage. 

Riegle-Crumb,  C., & King, B. (2010). Questioning a White male advantage in STEM: 

Examining disparities in college major by gender and race/ethnicity. Educational 
Researcher, 39(9), 656–664. https://doi.org/10.3102/0013189X10391657. 

Riegle-Crumb,  C., King, B., & Irizarry, Y. (2019). Does STEM stand out? Examining 

racial/ethnic  gaps in persistence across postsecondary fields. Educational Researcher, 
48(3), 133–144. https://doi.org/10.3102/0013189X19831006. 

Riegle-Crumb,  C., Moore, C., & Ramos-Wada, A. (2011). Who wants to have a career in science 
or math? Exploring adolescents’ future aspirations by gender and race/ethnicity. Science 
Education, 95(3), 458–476. https://doi.org/10.1002/sce.20431. 

Rosebery, A. S., Warren, B., & Tucker-Raymond, E. (2016). Developing interpretive power in 
science teaching. Journal of Research in Science Teaching, 53(10), 1571–1600. 
https://doi.org/ 10.1002/tea.21267. 

47 

 
 
Sasson, I., Yehuda, I., & Malkinson, N. (2018). Fostering the skills of critical thinking and 

question-posing in a project-based learning  environment. Thinking  Skills  and Creativity, 
29(September), 203–212. https://doi.org/10.1016/J.TSC.2018.08.001. 

Schneider, B., Klager, C., Chen, I.-C., & Burns, J. (2016). Transitioning  into adulthood: Striking 

a balance between support and independence. Policy Insights From the Behavioral and 
Brain Sciences, 3(1), 106–113. https://doi.org/10.1177/2372732215624932. 

Schneider, B., Krajcik, J., Lavonen, J., & Salmela-Aro, K. (2020). Learning science:  The value 

of crafting engagement in science environments. Yale University Press. 
https://doi.org/10.2307/j.ctvwcjfk1. 

Sinatra, G. M., Heddy, B. C., & Lombardi, D. (2015). The challenges of defining and measuring 

student engagement in science. Educational Psychologist, 50(1), 1–13. 
https://doi.org/10.1080/00461520.2014.1002924. 

Spybrook, J., Bloom, H., Congdon, R., Hill, C., Martinez, A., & Raudenbush, S. W. (2011). 

Optimal design plus empirical  evidence: Documentation for the “optimal design.” 
http://www.hlmsoft.net/ od/od-manual-20111016-v300.pdf 

Spybrook, J., Westine, C. D., & Taylor, J. A. (2016). Design parameters for impact research in 

science education: A multistate analysis. AERA Open, 2(1). 
https://doi.org/10.1177/2332858415625975. 

Tipton, E. (2014). How generalizable  is your experiment? An index for comparing experimental 

samples  and populations. Journal of Educational and Behavioral Statistics, 39(6), 478–
501. https://doi.org/10.3102/1076998614558486. 

van den Bergh, L., Ros, A., & Beijaard, D. (2015). Teacher learning  in the context of a 

continuing professional  development programme:  A case study. Teaching and Teacher 
Education, 47(April), 142–150. https://doi.org/10.1016/j.tate.2015.01.002. 

What Works Clearinghouse.  (2020). What Works Clearinghouse™: Standards handbook, version 

4.1. Institute of Education Sciences. 

U.S. Department of Education. https://ies.ed.gov/ncee/wwc/Docs/referenceresources/WWC-

Standards-Handbook-v4-1-508.pdf. 

Wooldridge, J. M. (2010). Econometric analysis  of cross section and panel data. MIT Press. 

Zell, E., Krizan, Z., & Teeter, S. R. (2015). Evaluating gender similarities  and differences using 
metasynthesis. American Psychologist, 70(1), 10–20. https://doi.org/10.1037/a0038208. 

48 

 
 
 
 
APPENDIX 2.A BALANCE TABLES BETWEEN THE TREATMENT AND CONTROL 

T-test 
  T-Value 

-0.96 

0.68 

1.50 

Variable 

Total Enrollment 
Proportion free-
reduced price 
lunch 
SAT  composite 

SCHOOLS  

Table 2.A.1  
Michigan Schools Data Balance Check 

MI:  Treatment (n=12) 

MI:  Control (n=12) 

SD 
Mean  
883.10  338.40  151.00 

Min  Max 

134.00 

  Mean  
744.00 

SD 

Min 
337.80  276.00  1186.00 

Max 

0.42 

0.23 

0.11 

0.87 

0.39 

0.22 

0.16 

0.86 

-0.29 

990.60 

92.30 

840.30  1166.90  1015.00 

67.50 

887.20  1076.90 

Proportion White 

0.58 

0.29 

0.11 

0.94 

0.76 

0.23 

0.20 

0.95 

0.00 

0.24 

0.30 

Proportion Black 
Proportion 
Hispanic 
Proportion Asian 
-0.27 
Source: Data comes from the Michigan Consortium for Educational  Research (MCER) and The 
Common Core of Data (CCD). 

-1.60 

0.03 

0.11 

0.07 

0.11 

0.04 

0.80 

0.25 

0.10 

0.10 

0.00 

0.00 

0.41 

0.48 

0.02 

0.03 

0.14 

0.09 

0.09 

0.03 

0.00 

0.03 

0.29 

Figure 2.A.1 Michigan Schools Data Balance Check 

49 

 
 
  
  
  
 
 
 
 
 
 
 
 
 
 
  
  
 
 
 
 
Table 2.A.2  
Detroit Schools Data Balance Check 

Detroit:  Treatment (n=6) 

Detroit:  Control (n=6) 

Variable 

Mean 

SD 

Min 

Max 

  Mean 

SD 

Min  Max 

% of Free-reduced 

0.66 

0.14 

0.51 

0.83 

0.75 

0.13 

0.58 

0.90 

SAT  composite 

872.46  120.16  754.62  1035.98 

812.66  56.41  744.29  884.46 

% of White 

% of African 

% of Hispanic 

0.01 

0.79 

0.17 

0.02 

0.38 

0.38 

0.00 

0.03 

0.00 

0.04 

0.99 

0.94 

0.02 

0.82 

0.12 

0.03 

0.32 

0.29 

0.00 

0.20 

0.00 

0.07 

1.00 

0.72 

 T-
test 
T-
value 
-1.15 

1.10 

-0.68 

-0.15 

0.26 

% of Asian 
0.09 
0.00 
Source: Data comes from the Michigan Consortium for Educational  Research (MCER) and The 
Common Core of Data (CCD). 

0.00 

0.07 

0.03 

0.00 

0.03 

0.04 

0.18 

Figure 2.A.2 Detroit Schools Data Balance Check 

50 

 
 
  
  
 
 
 
 
 
 
 
 
 
 
 
 
  
 
 
 
 
 
Table 2.A.3 Los Angeles Schools Data Balance Check 
Los Angeles Schools Data Balance Check 

LA: Treatment group (n=12) 

LA: Control group (n=13) 

Variable 

Mean 

SD 

Min 

Max 

Mean 

SD 

Min 

Max 

T-
test 
T-
value 

0.43 

0.96 

0.68 

0.85 

37.06 

500.00 

2547.93 

2437.00 

1262.17  670.47 

Total 
enrollment 
% of Free-
reduced 
Grade 11 
math score 
% of White 
% of 
African 
% of 
Hispanic 
% of Asian 
0.01 
Source: https://www.caschooldashboard.org/#/Home. 

2478.80  2595.20 

0.20 

0.05 

0.00 

0.05 

0.00 

0.86 

0.02 

0.63 

0.17 

0.22 

0.06 

0.98 

0.03 

0.04 

0.47 

1676.77 

653.95 

409.00 

2531.00 

1.56 

0.86 

0.44 

0.75 

0.93 

0.26 

2544.89 

38.87 

2491.40  2605.70 

-0.19 

0.04 

0.10 

0.78 

0.05 

0.02 

0.06 

0.16 

0.04 

0.50 

0.20 

0.55 

0.10 

0.15 

0.42 

0.98 

0.16 

-0.23 

1.43 

-1.30 

1.49 

Figure 2.A.3 Los Angeles Schools Data Balance Check 

51 

 
 
  
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Total 
enrollment 
% of Free-
reduced 
Grade 11 math  2607.50 

0.35 

Table 2.A.4 San Diego Schools Data Balance Check 
San Diego Schools Data Balance Check 

San Diego 

San Diego 

Treatment group (n=8) 

Control group (n=8) 

Variable 

Mean 

SD 

Min 

Max 

  Mean 

SD 

Min 

Max 

T-
test 
T-
value 

2222.38  550.64  1257.00  3051.00 

  2022.56  686.40  396.00 

2739.00 

-0.75 

0.25 

0.08 

0.71 

0.32 

0.29 

0.03 

0.74 

58.84 

2530.20  2713.30 

% of White 

% of African 

% of Hispanic 

0.14 

0.03 

0.59 

0.17 

0.02 

0.31 

0.01 

0.01 

0.07 

0.52 

0.06 

0.92 

  2609.70 
0.31 

0.02 

0.45 

68.23 

2521.40  2714.70 

0.28 

0.01 

0.37 

0.01 

0.01 

0.07 

0.75 

0.03 

0.93 

0.22 

-0.12 

-1.12 

0.00 

-0.22 

% of Asian 
0.54 
Sources: California  Department of Education; https://www.caschooldashboard.org/#/Home. 

0.15 

0.00 

0.21 

0.18 

0.60 

0.00 

0.10 

1.42 

Figure 2.A.4 San Diego Schools Data Balance Check 

52 

 
 
  
  
  
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
APPENDIX 2.B TEACHER EXIT SURVEY ITEMS DEALING WITH THE TEACHER 

UNITS, PRACTICES, AND CURRICULUM  

Items: 

1.  How familiar are you with the principles of project-based learning in science? 

a.  Very familiar 
b.  Somewhat familiar 
c.  Not at all familiar 

2.  How often do you incorporate project-based learning in your science teaching? 

a.  Frequently 
b.  Sometimes 
c.  Rarely 
d.  Never 

3.  How often do you employ the following teaching practices (1-never or 

almost never, 2- occasionally,  3-frequently, 4-in all or almost all lessons)? 

a.  I present a summary of recently learned content 
b.  Students work in small groups to come up with a joint solution to 

a problem or task 

c.  I give different work to the students who have difficulties learning 

and/or to those who can advance faster 

d.  I refer to a problem from everyday life or work to 

demonstrate why new knowledge is useful 

e.  I let students practice similar tasks until I know that every student 

has understood the subject matter 

f.  I check my students’ exercise books or homework 
g.  Students work on projects that require at least one week to complete 
h.  Students use information and communication technology for 

projects or class  work 
I expect students to explain their thinking on complex problems 
I give students a choice of problems to solve 

i. 
j. 
k.  I connect science concepts I teach to uses of those concepts outside of 

school 
I encourage students to solve problems in more than one way 

l. 

4.  How often do you employ the following scientific practices (1-never or 

almost never, 2- occasionally,  3-frequently, 4-in all or almost all lessons)? 

a.  I guide students to ask questions 
b.  I guide students to define problems 
c.  I guide students to develop models 
d.  I guide students to plan investigations 
e.  I guide students to conduct investigations 
f.  I guide students to interpret data 
g.  I guide students to solve problems 
h.  I guide students to construct an explanation 
i. 

I guide students to use evidence to make an argument 

53 

 
 
I guide students to communicate information 

j. 
k.  I guide students in having them present their explanations and models 
l. 

I guide students in construction products related to the work they do in 
class. 

5.  How often do you employ the following classroom activities (1-never or 

almost never, 2- occasionally,  3-frequently, 4-in all or almost all lessons)? 

a.  I guide students to work on a computer 
b.  I guide students to work in a group 
c.  I guide students to work in a lab 
d.  I guide students to solve math problems 

6.  How often do you employ the following teaching practices (1-two to 5 

times a week, 2- about once per week, 3-twice a month, 4- once per month 
or less)? 

a.  Hands-on experiments 
b.  Create or use models 
c.  Assign textbook 
d.  Use the internet to find answers for science 
e.  Use computational modeling, CAD programs, or other modeling software 

7.  What textbooks did you use in your classes this year? 
8.  Do you write all of your own lesson plans or do you get them from somewhere 

else? 

a.  I write all of my own lesson plans 
b.  I write some of my own lesson plans 
c.  I do not write my own lesson plans 

9.  Where do you get your lesson plans or your inspiration  for lessons plans? 

a.  District or Department Standard Lesson Plans 
b.  Other teachers & colleagues 
c.  Internet search 
d.  Other 

54 

 
 
 
 
 
APPENDIX 2.C RACE HETEROGENEITY MODEL 

𝑌𝑖𝑗 = 𝛾00 + 𝛾01 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗 + 𝛾11 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗𝑥𝑊ℎ𝑖𝑡𝑒𝑖𝑗 + 𝑆𝑗𝜕𝑗 + 𝛾𝑖0 𝑋𝑖𝑗 + 𝑀𝑖𝑗 𝑎𝑖0 + 𝑟0𝑗 + 𝜖𝑖𝑗  

𝑌𝑖𝑗 = 𝛾00 + 𝛾01 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗 + 𝛾11 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗𝑥𝐵𝑙𝑎𝑐𝑘𝑖𝑗 + 𝑆𝑗𝜕𝑗 + 𝛾𝑖0𝑋𝑖𝑗 + 𝑀𝑖𝑗 𝑎𝑖0 + 𝑟0𝑗 + 𝜖𝑖𝑗 

𝑌𝑖𝑗 = 𝛾00 + 𝛾01 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗 + 𝛾11𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗𝑥𝐻𝑖𝑠𝑝𝑎𝑛𝑖𝑐𝑖𝑗 + 𝑆𝑗𝜕𝑗 + 𝛾𝑖0 𝑋𝑖𝑗 + 𝑀𝑖𝑗 𝑎𝑖0 + 𝑟0𝑗 + 𝜖𝑖𝑗  

𝑌𝑖𝑗 = 𝛾00 + 𝛾01 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗 + 𝛾11 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗𝑥𝐴𝑠𝑖𝑎 𝑛𝑖𝑗 + 𝑆𝑗𝜕𝑗 + 𝛾𝑖0𝑋𝑖𝑗 + 𝑀𝑖𝑗 𝑎𝑖0 + 𝑟0𝑗 + 𝜖𝑖𝑗 

𝑌𝑖𝑗 = 𝛾00 + 𝛾01 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗 + 𝛾11𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗𝑥𝑂𝑡ℎ𝑒𝑟𝑅𝑎𝑐 𝑒𝑖𝑗 + 𝑆𝑗𝜕𝑗 + 𝛾𝑖0 𝑋𝑖𝑗 + 𝑀𝑖𝑗 𝑎𝑖0 + 𝑟0𝑗 + 𝜖𝑖𝑗  

𝑌𝑖𝑗 = 𝛾00 + 𝛾01 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗 + 𝛾11𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗𝑥𝑀𝑢𝑙𝑡𝑖𝑅𝑎𝑐𝑖𝑎 𝑙𝑖𝑗 + 𝑆𝑗𝜕𝑗 + 𝛾𝑖0 𝑋𝑖𝑗 + 𝑀𝑖𝑗 𝑎𝑖0 + 𝑟0𝑗 + 𝜖𝑖𝑗  

55 

 
 
 
 
APPENDIX 2.D MEDIATION MODEL 

First, we estimate the overall treatment effect controlling for the mediator of interest to 

find paths b (𝛾010 ) and c (𝛾001 ): 

𝑌𝑖𝑗𝑘 = 𝛾000 + 𝛾001 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑘 + 𝛾010 𝑀𝑒𝑑𝑖𝑎𝑡𝑜 𝑟𝑗𝑘 + 𝑆𝑘𝜕𝑘 + 𝛾020 𝐶ℎ𝑒𝑚𝑖𝑠𝑡𝑟 𝑦𝑗𝑘 + 𝛾𝑖00 𝑋𝑖𝑗 + 𝑀𝑖𝑗 𝑎𝑖00 + 𝜇00𝑘 + 𝑟0𝑗𝑘

+ 𝜖𝑖𝑗𝑘  

Next, we estimate a two-level model in which the treatment indicator predicts the indirect 

effect on the outcome which accounts for some proportion of the overall treatment effect, c. 

Here, pathway a is 𝛾001 . 

𝑀𝑒𝑑𝑖𝑎𝑡𝑜 𝑟𝑗𝑘 = 𝛾000 + 𝛾001 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑘 + 𝑆𝑘𝜕𝑘 + 𝛾020 𝐶ℎ𝑒𝑚𝑖𝑠𝑡𝑟 𝑦𝑗𝑘 + 𝜇00𝑘 + 𝑟0𝑗𝑘  

Confidence intervals for the multilevel mediation effects were computed using an 

empirical  M-test as outlined by Tofighi and Mackinnon (2011). 

56 

 
 
 
 
APPENDIX 2.E EDUCATION AMBITION MODEL 

The model for students’ educational  ambition   

𝑙𝑜𝑔𝑖𝑡(𝜋𝑖𝑗) = 𝛾00 + 𝛾01 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛 𝑡𝑗 + 𝑆𝑗𝛿𝑗 + 𝑋𝑖𝑗 𝛾𝑖0 + 𝑟0𝑗 +𝜖𝑖𝑗  

where 𝜋𝑖𝑗  is the probability that the binary indicator showing that a students’ college ambition 

increased from the fall to the spring is equal to one.  

𝛾00  is the likelihood of the control students increasing their college ambition.  

𝛾01  is the difference of the likelihood between the treatment and control group.  

𝑆𝑗 are  the  school  level  covariates,  including  school  pretest  mean  and  region  and  𝜕𝑗   are  the 

coefficients on those covariates.  

𝑋𝑖𝑗   are the individual level covariates, including pretest, course (chemistry or physics), gender, 

and race/ethnicity of the students and 𝛾𝑖0 are the coefficients on these individual level covariates.  

𝑟0𝑗  and 𝜖𝑖𝑗   are the school and student level error terms. 

57 

 
 
 
 
APPENDIX 2.F FULL TREATMENT EFFECTS ESTIMATES 

Table 2.F.1  
Full Treatment Effect Estimates 

Treatment 

Standardized Pretest Score 

Chemistry 

Region 2 

Region 3 

Region 4 

School Mean Pretest 

Female 

Missing Sex 

Black 

Hispanic 

Asian 

Other Race 

Multiple Races 

Missing Race 

Constant 

Random Effects Variances 
School 

Teacher 

Student 

(1) 
0.220*** 
(0.064) 
0.283*** 
(0.022) 
-0.522*** 
(0.116) 
-0.126+ 
(0.075) 
-0.1 
(0.09) 
0.009 
(0.09) 
0.220+ 
(0.123) 

0.311* 
(0.127) 

(2) 
0.208** 
(0.065) 
0.269*** 
(0.021) 
-0.514*** 
(0.115) 
0.003 
(0.079) 
0.026 
(0.105) 
0.125 
(0.1) 
0.179 
(0.12) 
-0.038 
(0.033) 
-0.014 
(0.118) 
-0.191* 
(0.077) 
-0.253*** 
(0.067) 
0.005 
(0.077) 
-0.105 
(0.084) 
-0.137 
(0.093) 
-0.257** 
(0.095) 
0.409** 
(0.129) 

0.038*** 
(0.005) 

0.038*** 
(0.004) 

0.768*** 
(0.030) 
4238 

0.761*** 
(0.030) 
4238 

58 

N 
Note. Standard errors in parentheses.  +p < 0.10 *p < 0.05 **p < 0.01 ***p < 0.001 

 
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
  
 
 
 
 
 
 
  
 
 
APPENDIX 2.G FULL HETEROGENEITY RESULTS 

Table 2.G.1  
Full Heterogeneity  Results 

Model 2 

Female 

Black 

Hispanic 

Asian 

Other 

Multi 

Treatment 

Treatment x 
Interaction 

Female 

Missing 
female 

Race 

Black 

Hispanic 

Asian 

Multi 

Other 

Missing race 

Pretest 
average 

School 
Pretest 

Chemistry 

Region 

Detroit 

LA 

San Diego 

.208** 
(0.065) 
not in 
model 
-0.038 
(0.033) 
-0.014 
(0.118) 

.185** 
(0.071) 
0.047 
(0.058) 
-.0608* 
(0.031) 
-0.014 
(0.119) 

-0.191* 
(0.118) 

-.1914* 
(0.077) 

-.250*** 
(0.067) 
0.005 
(0.077) 
-0.137 
(0.093) 
-0.105 
(0.084) 
-.257** 
(0.095) 
.269*** 
(0.021) 
0.179 
(0.120) 
-.510*** 
(0.115) 

-.250*** 
(0.067) 
0.005 
(0.077) 
-0.137 
(0.093) 
-0.106 
(0.084) 
-.255** 
(0.095) 
.269*** 
(0.021) 
0.182 
(0.120) 
-.510*** 
(0.115) 

0.003 
(0.079) 
0.027 
(0.105) 
0.125 
(0.010) 

0.003 
(0.079) 
0.028 
(0.105) 
0.125 
(0.099) 

.192** 
(0.069) 
0.189 
(0.120) 
-0.039 
(0.033) 
-0.011 
(0.116) 

-.280** 
(0.087) 

-.250*** 
(0.068) 
0.010 
(0.075) 
-0.137 
(0.093) 
-0.103 
(0.085) 
-.256** 
(0.092) 
.269*** 
(0.021) 
0.179 
(0.118) 
-.510*** 
(0.114) 

-0.019 
(0.078) 
0.021 
(0.107) 
0.120 
(0.102) 

.222** 
(0.085) 
-0.033 
(0.089) 
-0.038 
(0.033) 
-0.017 
(0.117) 

-0.190* 
(0.077) 

-.238** 
(0.084) 
0.005 
(0.077) 
-0.137 
(0.093) 
-0.105 
(0.084) 
-.256** 
(0.095) 
.269*** 
(0.021) 
0.178 
(0.119) 
-.510*** 
(0.115) 

-0.001 
(0.080) 
0.026 
(0.105) 
0.125 
(0.010) 

.212** 
(0.064) 
-0.090 
(0.140) 
-0.038 
(0.33) 
-0.015 
(0.119) 

-.193* 
(0.076) 

-.250*** 
(0.067) 
0.044 
(0.062) 
-0.137 
(0.093) 
-0.105 
(0.084) 
-.257** 
(0.095) 
.270*** 
(0.021) 
0.178 
(0.120) 
-.510*** 
(0.115) 

0.004 
(0.079) 
0.026 
(0.105) 
0.125 
(0.099) 

.205**  
(0.065) 
0.183 
(0.151) 
-0.038 
(0.033) 
-0.013 
(0.118) 

-.190* 
(0.078) 

-.253*** 
(0.067) 
0.004 
(0.077) 
-0.137 
(0.093) 
-0.188 
(0.096) 
-.257** 
(0.095) 
.269*** 
(0.021) 
0.179 
(0.120) 
-.510*** 
(0.115) 

0.002 
(0.079) 
0.026 
(0.105) 
0.124 
(0.100) 

.211** 
(0.066) 
-0.109 
(0.201) 
-0.038 
(0.033) 
-0.014 
(0.119) 

-.190* 
(0.076) 

-.250*** 
(0.067) 
0.006 
(0.077) 
-0.086 
(0.163) 
-0.104 
(0.083) 
-.257** 
(0.095) 
.270*** 
(0.021) 
0.179 
(0.120) 
-.510*** 
(0.115) 

0.003 
(0.079) 
0.026 
(0.104) 
0.125 
(0.100) 

Note: standard errors  are in parentheses.  *p<0.5 **p<0.01 ***p<0.001 

59 

 
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
APPENDIX 2.H COLLEGE AMBITION FULL RESULTS 

Table 2.H.1  
College Ambition Full Results 

Treatment 

Female 

Chemistry 

Standardized pretest score 

Coefficient 
β 
0.18* 
(0.09) 
-0.01 
(0.09) 
0.10 
(0.10) 
0.01 
(0.06) 
-0.23 
(0.16) 

Standardized school average 
pretest score 
Race (White non-Hispanic  is default) 

Hispanic 

Black 

Asian 

Other 

Multiracial 

Region (Detroit is default) 

LA 

Michigan 

San Diego 

0 
(0.14) 
-0.24 
(0.22) 
-0.01 
(0.21) 
-0.09 
(0.31) 
-0.29 
(0.24) 

-0.19 
(0.23) 
-0.12 
(0.23) 
-0.12 
(0.22) 

   Odds Ratio 

e^(β) 

1.20 

0.99 

1.11 

1.01 

0.79 

1.00 

0.79 

0.99 

0.91 

0.75 

0.83 

0.89 

0.89 

Note. Standard errors are in parentheses. *p<0.05, **p<0.01 ***p<0.001. Changed 
Educational Ambition omits students who said they did not know. High school and less than high 
school were in one category, Trade School and Community College are combined in another, 
and the outcome is the difference  of the end of year less the beginning of the year. 

60 

 
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
APPENDIX 2.I FULL MEDIATION RESULTS 

Table 2.I.1 
Full Mediation  Model 

Treatment 

Mediator 

Chemistry 

Standardized Pretest Score 

School Mean Pretest 

Region 2 

Region 3 

Region 4 

Constant 

Random Effects - 
Variances 

School 

Path A - 
PBL 
0.361** 
(0.135) 

-0.164 
(0.115) 

0.016 
(0.148) 
0.427* 
(0.180) 
0.335+ 
(0.186) 
-0.291 
(0.181) 
-0.099 
(0.141) 

Path B - 
PBL 
0.193** 
(0.060) 
0.059 
(0.050) 
-0.528*** 
(0.092) 
0.284*** 
(0.019) 
0.349*** 
(0.065) 
-0.187* 
(0.083) 
-0.066 
(0.086) 
0.017 
(0.079) 
0.311** 
(0.115) 

Path A - 
Models 
0.224*** 
(0.061) 

-0.327*** 
(0.055) 

-0.031 
(0.062) 
-0.094 
(0.118) 
0.061 
(0.087) 
-0.063 
(0.081) 
0.110+ 
(0.067) 

Path B - 
Models 
0.191** 
(0.063) 
0.171 
(0.134) 
-0.446*** 
(0.101) 
0.283*** 
(0.019) 
0.403*** 
(0.064) 
-0.122 
(0.084) 
-0.081 
(0.090) 
0.026 
(0.084) 
0.242* 
(0.123) 

0.000 
(0.000) 

0.000 
(0.000) 

0.021*** 
(0.005) 

0.000*** 
(0.000) 

Teacher 

0.085*** 
(0.009) 
0.724*** 
(0.028) 
Note. Standard errors in parentheses.  +p < 0.10 *p < 0.05 **p < 0.01 ***p < 0.001. 

0.087*** 
(0.032) 
0.723 
(0.147) 

0.040*** 
(0.005) 

0.394*** 
(0.025) 

Student 

61 

 
 
  
 
 
 
 
 
 
 
 
 
  
  
  
  
 
 
 
 
 
 
 
 
 
 
 
APPENDIX 2.J ITEM EQUIVALENCE FOR THE CHEMISTRY AND PHYSICS 

SUMMATIVE ASSESSMENTS  

In order to equate the chemistry summative assessment and the physics summative 

assessment, two separate summative assessments with different items, we used a 2pl polynomial 

model, which uses two parameters to estimate the item difficulty and student ability, to first 

generate the distributions of the two tests. The chemistry summative assessment  contained 25 

items; however, every student incorrectly answered one of the items, item 11. Therefore, this 

item was unable to be used to differentiate between stud ent ability. Because of this, we removed 

the item from the analysis, leaving the chemistry summative assessment with 24 items. The 

physics summative assessment contained 12 items. 

We then used the equateIRT package in R to obtain the following table. The equating is 

based only on the chemistry control group, not including the treatment group to avoid possible 

issues caused by the treatment effect within the analysis. 

62 

 
 
 
 
Table 2.J.1  
Chemistry and Physics Score Transformation: Norm Table 

Chemistry score (24 items) 

Equating: estimated score in physics 
(SE) 
-0.486 (.014) 
-0.390 (.114) 
0.069 (.587) 
0.614 (.201) 
1.047 (.590) 
1.610 (.290) 
2.000 (.593) 
2.560 (.364) 
2.906 (.577) 
3.395 (.878) 
3.769 (0.587) 
4.184 (0.770) 
4.636 (0.659) 
5.058 (0.778) 
5.538 (0.806) 
6.007 (0.845) 
6.489 (0.885) 
7.017 (0.939) 
7.495 (0.87) 
8.076 (1.023) 
8.568 (1.483) 
9.184 (1.053) 
9.743 (1.59) 
10.345 (0.971) 

Score 1 
Score 2 
Score 3 
Score 4 
Score 5 
Score 6 
Score 7 
Score 8 
Score 9 
Score 10 
Score 11 
Score 12 
Score 13 
Score 14 
Score 15 
Score 16 
Score 17 
Score 18 
Score 19 
Score 20 
Score 21 
Score 22 
Score 23 
Score 24 
Note. Because the physics exam was more difficult  than the chemistry assessment, we see that a 
score of 1-4 on the chemistry exam is equivalent to a score of 0 on the physics assessment. In 
addition, a perfect  score in chemistry is only equivalent  to a score of 10 in physics. We used 
these equated scores for the final analysis  of the main effect for this intervention. 

63 

 
 
 
CHAPTER 3: USING MACHINE LEARNING TO SCORE MULTIDIMENSIONAL 

ASSESSMENTS OF CHEMISTY AND PHYSICS 

Abstract 

In response to the call for promoting three-dimensional science  learning (NRC, 2012), 

researchers  argue for developing assessment items that go beyond rote memorization tasks to 

ones that require deeper understanding and the use of reasoning that can improve science 

literacy. Such assessment items are usually  performance-based constructed responses and need 

technology involvement to ease the burden of scoring placed on teachers. This study responds to 

this call by examining  the use and accuracy of a machine learning  text analysis protocol as an 

alternative to human scoring of constructed response items. The items we employed represent 

multiple dimensions of science learning as articulated in the 2012 NRC report. Using a sample of 

over 26,000 constructed responses taken by 6700 students in chemistry and physics, we trained 

human raters and compiled a robust training set to develop machine algorithmic  models and 

cross-validate the machine scores. Results show that human raters yielded good (Cohen’s k = 

0.40 – 0.75) to excellent (Cohen’s k > 0.75) interrater reliability  on the assessment items with 

varied numbers of dimensions. A comparison  reveals that the machine scoring algorithms 

achieved comparable  scoring accuracy to human raters on these same items. Results also show 

that responses with formal vocabulary (e.g., velocity) were likely to yield lower machine-human 

agreements, which may be associated with the fact that fewer students employed formal phrases 

compared with the informal alternatives. 

64 

 
 
 
 
Introduction and Literature Review 

Measuring science  knowledge and achievement has long been an important topic in 

science education research. The National Research Council ([NRC], 2012) has spelled out what 

they call three-dimensional learning  to better facilitate student knowledge development and meet 

the demands of a modern STEM workforce. Three-dimensional learning  encourages knowledge-

in-use that can be generalized and used across multiple scientific fields to successfully meet the 

rapidly changing demands of science and technology careers adapted to the emerging issues of 

the twenty-first century (Harris et al., 2019; Haudek et al., 2019). This concept of knowledge-in-

use occurs when students apply disciplinary core ideas (DCIs) in tandem with science and 

engineering  practices (SEPs) and crosscutting concepts (CCs) to solve problems or make sense 

of phenomena. While the Framework for K-12 Science Education (NRC, 2012) presents a 

promising  new vision of science learning,  assessing the three-dimensionality of science learning 

is challenging (see NRC assessment report [2014] and the National Academies of Sciences, 

Engineering, and Medicine report [2019]). 

Multiple-choice  questions are used ubiquitously in national, state, and classroom 

assessments of science achievement. However, these multiple-choice  assessments typically rely 

on memorization  of key concepts and thus have difficulty meeting the needs for assessing 

knowledge- in-use learning. Instead of strictly using multiple-choice questions, assessments 

should incorporate items that use the three dimensions of learning through a variety of task 

formats including constructed response (CR), which requires  students to use their knowledge to 

solve problems with scientific practices (e.g., Harris et al., 2019). 

Unfortunately, CR is both time and resource consuming to score compared with multiple-

choice items, and thus, teachers may not be willing to implement CR items in their classrooms. 

65 

 
 
Approaches that employ machine  learning have shown great potential in automatically scoring 

CR assessments (Zhai et al., 2020a). As indicated in a recent review study (Zhai et al., 2020c), 

machine learning  has been adopted in many science assessment practices using CRs, essays, 

educational games, and interdisciplinary assessments (e.g., Lee et al., 2019a; Nehm et al., 2012). 

More importantly, machine  scoring can provide automatic and immediate feedback to students 

and teachers, with the potential to accelerate the use of three-dimensional assessment practices in 

classrooms,  benefiting science learning  (Lee et al., 2019b). 

While the potential of machine learning  has been recognized, few studies have tackled 

the true challenge of scoring CR items on multi-dimensional  science assessments. There are 

relatively few studies applying machine learning  to analyze assessment items in which students 

perform tasks that require the use of multiple dimensions of scientific knowledge to make sense 

of phenomena (Zhai et al., 2020a). In addition, none of the studies explicitly document whether 

and how these assessments measure the dimensionalities  of science learning.  We examine the 

capacity of machine learning to automatically score multi-dimensional  science assessments, 

contrasted with human scorers, using a large database of student CRs. We highlight how the 

machine agreement changes as we increase the size of the training set as compared with human 

rater agreement. Additionally, we discuss some of the complex challenges for achieving high 

agreement between human and machine algorithms when scoring multi-dimensional 

assessments, including clarifying rubrics  for improved agreement between human and machine 

scoring and treatment of missing and outlier responses and scores. This  study answers three 

questions: (1) How reliable were human raters in scoring  multi-dimensional  responses? (2) 

Could machine learning  algorithms score multi-dimensional  assessments  as accurately as 

66 

 
 
humans? and (3) How are key phrases in student responses associated with machine scoring of 

the multi-dimensional  assessments? 

Defining the Dimensions of Learning 

Building on the NRC reports, science researchers  are calling for more comprehensive 

assessments that gauge students’ abilities to use knowledge to explain phenomena, solve real- 

world problems, engage in creative and critical thinking, and analyze and interpret scientific data 

(Haudek et al., 2019). According to Pellegrino (2013), assessments that include CR should 

reflect the principles  spelled out in A Framework for K-12 Science Education (NRC, 2012) and 

the Next Generation Science Standards (NGSS Lead States [NGSS], 2013), by carefully 

considering and identifying which of the three dimensions of learning each question is being 

designed to measure. In contrast to multiple choice, CR items are often difficult to develop and 

to score. Most studies that include them have not met the challenge of specifying and connecting 

the three dimensions of learning with their assessment items. 

The NRC Framework for K-12 Science Education (2012) recommends that the learning 

and instruction of science throughout K-12 should integrate SEPs, DCIs, and CCs to make sense 

of phenomena. The NGSS puts this recommendation into practice by creating three-dimensional 

performance standards or expectations explaining the key concepts and skills students should be 

able to use at a given grade level (NGSS Lead States, 2013). Each dimension of learning has its 

own grade-based expectations. Measuring SEPs offers insight into how students use practices 

employed by scientists and engineers in the field, such as gathering and obtaining information 

and using it in argumentation (NRC, 2012). The CCs are considered “crosscutting” because they 

are concepts used to aid in examining phenomena and solving problems  across all fields of 

67 

 
 
science and engineering (NRC, 2012). In turn, assessments should measure how students apply 

disciplinary  knowledge through SEPs and CCs. 

Creating and Scoring Multi-Dimensional  Science Assessments 

Despite the obvious advantages of CR in gaining more in-depth insights into student 

understanding, they are used somewhat less frequently than multiple-choice because scoring CR 

requires  considerable time and effort. However, it is important to create assessments that capture 

students’ use of the three dimensions of science learning, and machine learning  offers the 

potential to facilitate scoring, making these assessments more tenable for classroom  and research 

purposes. Yet, three-dimensional items can be complicated to score as they must evaluate 

students’ knowledge of the subject matter through the DCIs and provide insights about how 

students develop, understand, and use that knowledge. To score them, one must be familiar, not 

only with core content, but also with how the dimensions of learning work together and are being 

measured. These  constraints prevent the broad use of three-dimensional tests that rely on CRs for 

science learning. 

Because scoring CRs requires  considerable time and effort on behalf of human raters, this 

can potentially elongate the period before students receive feedback (Ha & Nehm, 2016). 

However, studies (e.g., Lottridge et al., 2018) suggest that it should be possible to decrease 

human rater costs through automated machine learning algorithms.  This would allow teachers 

and researchers to collect more detailed data on students’ science knowledge with scoring costs 

comparable  with those of multiple-choice assessments. Scientists are building rubrics  to measure 

aspects of three-dimensional learning  related to human and machine scoring, but what seems to 

have received less attention is attributing the outcomes of the algorithm to the dimensions being 

scored. This is an important concept that needs to be examined in greater depth as it requires 

68 

 
 
students to develop knowledge across subject domains. Outlining the dimensions associated with 

each item is one of the key contributions of our study. Further exploration of human and machine 

scoring for three-dimensional learning  CRs needs to incorporate DCIs, SEPs, and CCs while 

recognizing  the complexity and difficulty of this particular task. 

Applications of Machine Learning in K-12 Science Assessment 

The number of studies on the use of machine learning to score science assessments is 

increasing  as the technology becomes more accessible  and recent studies show considerable 

promise  for machine scoring  for science CRs across multiple age groups. A prior review (Zhai et 

al., 2020c) suggests that various researchers  have used more than 20 programs or platforms to 

study the automatic scoring of science learning assessments.  For example, researchers  from ETS 

(Liu et al., 2014) applied machine learning  for scoring student CRs that explained science 

phenomena through multi-dimensional  reasoning with “c-rater,” an automated machine 

algorithmic  program. The Liu team from ETS developed multi-level rubrics and found that the 

machine was capable of automatically scoring  student responses, achieving moderate to large 

Cohen’s kappa (k) values between the machine and human scorers. They found that “c-rater” 

could capture valid ideas in students’ responses and provide nuanced information about their 

performance. Other scientists collaborated with ETS and explored the classroom applications of 

c-rater-ML in several projects. Mao et al. (2018) applied the c-rater-ML to automatically 

evaluate students’ written argumentation to provide automatic feedback to students. The machine 

feedback could assist students to revise their arguments. In another study, Zhu et al. (2017) 

reported that more than 77% of students made revisions after receiving machine feedback and 

those who revised their responses received higher scores than the others in the final test.  

69 

 
 
The earliest development of programs besides c-rater is the SPSS Text Analysis (Nehm 

& Haertig, 2012), which required humans to develop word libraries manually. This  program is 

costly and labor-intensive for users. The Summarization  Integrated  Development  Environment 

(SIDE) developed by the TELEDIA lab at Carnegie Mellon University is the first free machine-

learning-based confirmatory analysis program  used in science education (Mayfield & Rosé, 

2010, 2013). Its successor, Light Summarization Integrated Development Environment 

(LightSIDE), is more user-friendly, flexible, and accessible to public users and has been 

frequently used in studies. Other open sources such as RapidMiner or Weka are all popular in 

automatically scored science assessments. However, most of these programs applied individual 

algorithms each time. If one algorithm does not work well, users can choose another. Instead of a 

single algorithm, this study employed the Automated Analysis of Constructed Response (AACR) 

Web Portal (AACR, 2020) to automatically score students’ CRs using multiple algorithms. 

Different from most commercial automatic scoring programs, which fit one type of algorithm at a 

time, the AACR Web Portal developed eight algorithms that can be employed simultaneously. 

The AACR scoring Web Portal was developed to serve classroom needs for  formative 

assessment  purposes.  Currently,  it  is  used  for exploratory  and  confirmatory  factor  analysis 

in  item development. For exploratory analysis purposes, AACR can be applied with 

unsupervised machine  learning to identify patterns and lexical features of student responses. 

Based on the findings, researchers can revise their items and rubrics iteratively. A confirmatory 

analysis  is used in the late stage of item development to develop and validate the machine 

algorithms.  

Based on the predictions of each algorithm, derived from the cross-validation, the machine 

assigns weights toward the algorithm that best optimizes the algorithmic parameters. The 

70 

 
 
ensemble approach has been tested and compared with other general classifiers. Extensive 

experimentation by Large, Lines, and Bagnall (2019) reveals that the ensemble approach has 

measurable  benefits over other classifiers  such as alternative weighting, selection, or meta-

classifier  approaches. More importantly, the ensemble approach outperforms other classifiers 

with small training datasets. While machine learning is being used more frequently and 

continued research leads to improved reliability,  a question remains about what steps researchers 

should take to move from human to machine scoring of multi-dimensional  assessments. 

Sample 

Methods 

To begin this process, for new and untested items, a large database of student responses is 

necessary. “Crafting Engaging Science Environments (CESE),” an ongoing science intervention 

with 6700 high school students in California and the Midwest, provided such a database. Within 

the CESE sample, 48.5% were male  students and 51.5% were female students, 29.2% of students 

identified their race as white, 47.5% identified their race as Hispanic, 11.9% as black, and 5.0% 

as Asian. Almost three quarters of students (74%) reported speaking Spanish in the home. To 

measure baseline  science understanding at the beginning of the intervention, CESE relied on an 

assessment developed using National Assessment of Education Progress (NAEP) open-source 

science test-bank items. All 6700 students took the test, and this yielded 26,800 constructed 

responses to four CR items. With 26,800 responses requiring classification,  it would be possible 

to learn if machine scoring  was a viable option. To conserve resources,  the team decided to learn 

if AACR could score the CRs with the same reliability as human raters. 

71 

 
 
Instruments  and Measures 

CESE adopted four NAEP questions in this project. The questions tapped phenomena in 

chemistry or physics using everyday scenarios. Though these test questions were not initially 

designed to be three-dimensional as those illustrated in The Framework (NRC, 2012), multiple 

dimensions of science learning  were detected in the responses to these items. As shown in Fig. 

3.1, when reviewing the responses to each question, between 14% and 55% of students 

responded using multi-dimensional  reasoning (i.e., the use of DCIs, SEPs, and CCs associated 

with the NGSS performance expectations) without being prompted. To accommodate this, three 

response classifications  including incorrect, correct, and “multi-dimensional  correct (MDC)” 

were adapted from the original binary rubrics.  The MDC rating was awarded only if a student 

was able to demonstrate reasoning with regards to associated DCIs, CCs, or SEPs. Figure 3.1 

shows the distribution of incorrect, correct, and MDC responses used by students on the 

assessment, as identified by human scorers. 

Figure 3.1 Dimensionality  of Student Responses 

The research team partnered with third party scientists to further verify the two- and 

three-dimensional  rubrics for the CRs. The newly developed rubrics allowed the project team to 

use the existing items to probe students’ learning of DCIs, CCs, and SEPs. Explanations of items 

72 

 
 
 
 
and their rubrics are given below, while more complete details of the items, rubrics, sample 

responses, and associated dimensions of learning are shown in Appendices A—D. 

Item 1: Experimental design, shown in APPENDIX 3.A, asks students to identify the 

error in an experiment where a student tests three different shoes, each on a different floor, to 

determine which had the highest coefficient of friction. Students’ responses included middle 

school level reasoning associated with NGSS performance expectations for DCI ETS1.A 

Defining and Delimiting Engineering  Problems, and the grades 3–5 level SEP of Planning and 

Carrying Out Investigations (2013). To adapt this rubric to the NGSS, the MDC score for this 

item meant students correctly identified an error in the experimental setup (DCI) and explained 

that she could not compare the frictional force of different shoes on different floors due to the 

failure to isolate a variable  while holding the floor constant (SEP). A correct response was given 

for students who correctly identified the error without explaining how the error affected the 

experiment. 

Item 2: Relative motion focused on relative motion between two vehicles traveling on the 

highway (see APPENDIX 3.B). This item asked students to explain why a truck driving 

alongside them on the highway appeared to not be moving. Students engaged in middle school 

level reasoning associated with NGSS performance expectations for DCI [PS2.A] Forces and 

Motion, and CC Scale and Proportion (2013). To align the rubric with the NGSS performance 

expectations, the MDC classification was reserved for students who connected the truck’s speed 

to that of the observer inside the vehicle (DCI) and stated the equal relative speeds would cause 

the phenomenon (CC). Responses that dis- cussed only the speed without referencing how this 

related to the visual event were considered correct, but not MDC.  

73 

 
 
In the third constructed response, item 3: properties of solutions, students engaged 

elementary to middle school level reasoning for the DCI associated with PS1.A Structures and 

Property of Matter. As shown in APPENDIX 3.C, this was a fully three-dimensional item and 

students showed middle school level reasoning in cause and effect (CC) and planning and 

carrying out investigations (SEP) as outlined by the NGSS (2013). This question asked students 

to design an effective experiment to differentiate between the contents in two identical glasses. 

One glass contained saltwater while the other contained fresh water. Students could not suggest 

tasting the contents of either glass. To achieve the MDC classification, students would describe 

an experiment that controlled for relevant variables (SEP) to differentiate between fresh water 

and a solution (DCI) and correctly attribute causality (CC) to the chosen experiment by 

explaining  the outcome. For a correct score, the DCI and SEP were considered inseparable. For 

example, a response that stated “test the density” was incorrect, because it explained neither an 

experiment that would do so, nor the expected results. 

The fourth and final constructed response, item 4: states of matter, asked students to 

demonstrate their understanding of what causes matter to change states (see APPENDIX 3.D). 

The question asked students to explain why water in a hot pot would evaporate more quickly 

than in a pot on the counter. Students used reasoning related to NGSS performance expectations 

for PS1.A Structure and properties of matter (DCI) and an understanding of energy flow (CC) 

related to performance expectations of Energy and Matter (2013). For the score to be classified 

as MDC, students would first demonstrate an understanding that the water evaporated (DCI). 

Second, students would attribute causality (CC) to the heat transferred from the stove to the 

water. Measured somewhat differently than other items, correctly reporting either the CC or the 

DCI was enough for a correct score. 

74 

 
 
Scoring occurred in a two-cycle process. The first cycle involved human scoring while 

the second cycle used human scores to train the machine algorithm  to score. Figure  3.2 shows the 

flow of the two-cycle process which began with rubric development and ended in a completed 

algorithm which could instantly score remaining  responses. 

Figure 3.2 Training and Algorithm Development 

Cycle 1: Human Training and Scoring 

Ten undergraduates were recruited to score students’ responses. These undergraduates 

were in their junior or senior  years of a natural science major  (nine physics students and one 

biology student) and had completed at least two college-level courses in both physics and 

chemistry. The raters participated in training sessions  and scored in an iterative process that 

included a cycle of rigorous training sessions,  calibration, and revisions for clarification. This 

was followed by a second cycle that involved calibrating the machine’s scores by providing the 

human scores and responses. 

As described above, the process shown in Fig. 2 began with construction of the 

multidimensional  rubrics.  When the rubrics were completed, raters were trained, and then 

proceeded to practice scoring. As training, raters scored a randomized sample of student CRs. 

The randomization procedure considered students’ knowledge level as well as ethnic, racial,  and 

75 

 
 
 
 
geographic factors. AACR recommends a minimum human agreement of k = 0.80 before 

compiling  a training set for the machine, so low inter-rater reliability  (IRR) meant continued 

rater training. When the raters achieved a high IRR for all raters from the small practice sets, 

they next completed “bulk” scoring sets of various sizes. 

After bulk scoring, additional IRR testing ensured sustained agreement. If the IRR was 

low on the bulk scoring, raters returned to training. With high agreement for human raters from 

the bulk scoring, human scores were compiled to create a training set for the AACR algorithm. 

Because the success of the algorithm may depend on the quality of the data, it is imperative to 

create a quality training set with high inter-rater agreement. Therefore, it is very important to 

address issues suspected to reduce reliability in human scoring of multi-dimensional  open-ended 

CR items. 

The Challenges for Human Raters and How We Addressed Them 

It became apparent that the rubrics needed to be explicit in what was expected of the 

student. The first consideration is to list all possible solutions. From item 3: properties of 

solutions, we learned that some issues in scoring arose, unexpectedly, from raters’ advanced 

knowledge of science and that rubrics needed to state nearly all possible correct experiments. 

Agreement typically improved with each round of scoring, as new experiments were identified 

and included in the rubric. 

The second consideration was to create a hierarchy outlining the importance of each 

dimension’s contribution to the score. Disagreements arose when rubrics did not explicitly state 

which dimension was being measured. In scoring item 1: experimental  design, for example, 

raters disagreed on whether stating a direct claim was as important as knowing how to control for 

variables  in the experiment. After the training session where this issue emerged, the human 

76 

 
 
agreement fell from k = 0.71 to k = 0.38 (shown later in Table 3.1). To correct for this in later 

scoring sessions,  we revised the rubric and created a “dimensional hierarchy”  for our raters, 

explicitly stating which dimensions and which specific aspects of each dimension were being 

measured. This  process took too long, however, and due to subsequent changes to the scoring 

team during the process, the item did not make it to machine scoring during our collaboration 

with AACR. 

The third consideration was to carefully weigh the choice to make changes to the scoring 

team. When the agreement is high, changes can drop that agreement rapidly, but the low 

agreement can be improved by training new raters on an item. Consistent training and calibration 

sessions helped to bring new raters into agreement, but new scoring teams did not easily come to 

agreement with past raters’ scores. Often, this meant re-scoring assignments  done by the 

previous rating team because the machine needed consistent agreement across the entire training 

set. When possible, maintaining  the same scoring  team until completion of the training set can 

help sustain a high inter-rater agreement. Conversely, the low agreement can be improved by 

introducing the item to a new team if the original team showed a lower than desirable agreement. 

Bulk Scoring Criteria  

AACR requested large training sets, with at least one hundred classified responses of 

each scoring proficiency category for each item. To meet this requirement, raters began to score 

in larger  “bulk” sets. When their agreement was high, raters scored independently, but continued 

to score some  content  overlapping  between  raters.  This  allowed for continued checks for inter-

rater agreement, which was calibrated after each wave of scoring so that raters could discuss 

disagreements. Scoring continued in this way until raters successfully scored at least 100 

responses representing each of the three proficiency categories. Generally, two or three raters 

77 

 
 
were selected to score a given item based on their shared availability  and the scoring was 

assigned to these  pairs  or  triads.  The  number  of  responses  scored  in a wave was determined 

by the raters’ availability  and the number of responses needed to obtain 100 examples of each 

proficiency category. 

Cycle 2: Constructing the Training Set 

In the second cycle, focused on generating a quality training set, we used the consensus, 

or median, score taken between raters who had scored a response in common. If a response was 

scored by only one rater, the individual score was considered the median. If the median score did 

not fall into one of the scoring categories (incorrect, correct, or MDC), the response was omitted 

from the training set and returned to the pool of unscored responses. Due to lower inter-rater 

agreement on item 3, a triad was used somewhat differently. A rater pair scored all responses in 

common, and the third rater in the triad acted as a tiebreaker to generate the consensus score. 

After the bulk scoring process, the human scores and student responses were used to develop 

training sets for the algorithmic models. 

For each item, a robust training set was designed to examine key lexical features 

associated with multi- dimensional  reasoning specific to the CR item. The responses and 

corresponding consensus scores  for the three successfully scored items were given to AACR to 

create a predictive model. AACR developed models specific to each item using a cross-

validation approach by using a portion of the scored responses to create the algorithm and 

reserving  the rest to test the model. AACR accomplished this using feature extraction analysis. 

The AACR Web Portal examined each response in the training set by its lexical  features, 

primarily  combinations of a number, “n,” with words called “n-grams” to tune parameters for the 

algorithm development.  

78 

 
 
To validate the accuracy of the machine algorithms, AACR Web Portal applied a cross-

validation approach which was found to be the most effective when compared with split- and 

self-validation methods (Zhai et al., 2020b). Using cross-validation, the machine first partitioned 

the data into n subsets, named “n-folds.” A random selection of (n-1) folds of human-scored 

responses was used to train the machine and develop the algorithmic model, which was then used 

to score the remaining  one-fold student responses. The machine scores of the one-fold of 

responses were compared with human scores to calibrate the machine- human agreement, which 

was indicated by parameters such as Cohen’s kappa. The training, scoring, and comparisons 

were iterated n times so that each fold of data played a role as both the training and testing sets. 

The average of Cohen’s kappa generated in these processes indicated the accuracy of the 

algorithmic  model. At the same time, the algorithmic model generated a computer confidence 

parameter which helped to diagnose which specific responses were difficult for the algorithm to 

score. After receiving the results of the bulk scoring, the scoring team reviewed cases where the 

human and machine scores disagreed. From this point, the process diverged for the three items 

for which human raters were in high enough agreement to move on to the machine scoring 

process. Each question provided unique challenges in developing the algorithmic model. 

Challenges in the Preparation of Machine Training Data and How We Addressed Them 

We first considered how to bolster low human-human agreement using tie breakers. 

Issues arose in reaching high human-human agreement when scoring atypical responses or CRs 

which were open-ended. For item 3: properties of solutions, we employed a tiebreak method. 

This method is similar  to that used by Haudek and colleagues (2019) where raters trained until 

achieving a human-human  agreement of k = 0.60 or greater then scored responses individually 

with some responses overlapping  between raters. A third rater would break ties in the 

79 

 
 
disagreements. Haudek and colleagues’ results showed that the machine-human agreement was 

similar  to human-human agreement, and machine-human agreement was higher than human-

human agreement for some constructs. 

The second consideration was in handling underrepresentation. Human raters examined 

the scoring discrepancies for lexical patterns, termed as “key phrases.” Key phrases emerged for 

some items, but not others. Key phrases (e.g., “velocity” or “relative”) were much more apparent 

in item 2: relative motion. For item 3: properties of solutions, raters coded the key phrases as the 

experiments used in student responses. Item 4: states of matter showed frequent use of specific 

terms, but those terms did not seem to impact the machine scoring. 

Because underrepresentation can be problematic, researchers  must remain  aware of the 

potential for responses to be scored incorrectly. When lexical patterns emerged around these 

errors, it was feasible to predict future discrepancies. In this study, we selected additional 

responses with key phrases that were less represented in the overall sample  but seemed common 

in the machine-  human disagreements. Unscored responses containing these key phrases were 

then mixed with random responses and added to the next wave of bulk scoring. Where there 

were not enough of the potentially problematic responses to include more examples  in the 

training set, we called upon human raters to review and score responses where errors were likely. 

Data Analysis 

To answer the first research question, we reported the human-human agreements 

indicated by Cohen’s kappa by wave of scoring for each item. To answer the second research 

question, we calibrated the machine-human  agreements for each item, using the Cohen’s kappa, 

and compared the agreements with the corresponding human- human agreements. We also 

reported the machine scoring accuracy according to the dimensions of science learning. To 

80 

 
 
answer the third research question, we calculated the frequency with which key phrases were 

used, the frequency with which they were scored incorrectly, and the percentage they comprised 

the total disagreements. 

Results 

Reliability  of Human Raters in Scoring the Multi-Dimensional  Assessments 

Table 3.1 shows the cumulative agreement for human raters over the 8-month scoring 

period described above. Responses were scored in successive  waves. For each wave, the number 

of responses overlapping between raters to check IRR is shown. Human raters achieved a f inal 

Cohen’s kappa after several-wave training as k1 = 0.67, k2 = 0.80, k3 = 0.64, and k4 = 0.76 for the 

four items, respectively.  According to a criterion proposed by Fleiss (1981), kappa values over 

0.75 indicate excellent agreement while values between 0.40 and 0.75 indicate good agreement. 

According to this criterion, the human rater reliability  is excellent for two of the items and good 

for the others. We also found that human scoring reliability  increased with successive calibration 

training meetings. For the most successful item, item 2: relative motion, agreement increased 

from k = 0.72 and peaked at k = 0.88 over the successive waves of training and scoring. For three 

of the four items, agreement between human raters was high enough to move to machine 

learning during the collaboration with AACR. 

81 

 
 
 
 
Table 3.1 
Human Agreement by Wave 

Wave 

NOverlap  Accuracy 

k 

NCumulative  Accuracy 

k 

NTotal 

Note 

Item 1: experimental design 
1 

99 

0.81 

2 

3* 

4 

50 

100 

25 

Item 2: relative motion 

1 

2 

3 

4 

5 

6* 

7* 

60 

60 

90 

30 

0 

33 

75 

0.58 

0.84 

0.80 

0.83 

0.85 

0.93 

0.87 

0.85 

0.87 

Item 3: properties of solutions 

1 

2* 

3** 

4** 

30 

190 

70 

400 

Item 4: states of matter 

1 

2 

3* 

30 

19 

25 

0.77 

0.73 

0.72 

0.78 

0.93 

0.79 

0.85 

0.71 

0.38 

0.73 

0.67 

0.72 

0.77 

0.88 

0.80 

0.76 

0.80 

0.63 

0.56 

0.57 

0.65 

0.86 

0.62 

0.72 

99.00 

50.00 

100.00 

25.00 

60.00 

90.00 

180.00 

210.00 

210.00 

243.00 

318.00 

30.00 

190.00 

70.00 

470.00 

30.00 

49.00 

74.00 

0.81 

0.58 

0.84 

0.80 

0.83 

0.83 

0.88 

0.88 

0.88 

0.87 

0.87 

0.77 

0.73 

0.72 

0.77 

0.93 

0.88 

0.87 

0.71 

0.38 

0.73 

0.67 

0.72 

0.74 

0.81 

0.81 

0.81 

0.80 

0.80 

0.63 

0.56 

0.57 

0.64 

0.86 

0.77 

0.75 

0 

0 

0 

0 

51 

80 

439 

564 

602 

740 

808 

0 

0 

70 

465 

88 

242 

524 

Practice 

Practice 

Practice 

Practice 

Practice 

Practice 

Bulk scoring 

Key phrases 

Key phrases 

Bulk scoring 

Bulk scoring 

Practice 

Practice 

Bulk scoring 

Bulk scoring 

Bulk scoring 

Bulk scoring 

Bulk scoring 

25 

0.80 

0.89 

99.00 

4* 
Bulk scoring 
Note. NOverlap is the number of items raters  scored in common. NCumulative is the total number of 
jointly  scored responses in all combined waves. NTotal is the number of total  responses scored 
which can be sent to the machine. A wave number followed by * indicates that this wave, and 
those following,  were scored by a new team, while ** indicates  a third rater was added to 
tiebreak. 

0.87 

0.76 

594 

For item 2: relative motion, training sets were compiled for the algorithm after wave 3 

(bulk scoring), wave 5 (key phrases),  and wave 7 (bulk scoring). Drops in the human rater 

agreements correspond to those time periods between waves of scoring, during which raters were 

waiting for and analyzing the results of the predictive model. For instance, agreement fell from k 

= 0.88 to k = 0.80 between waves 3 and 4 where the predictive model was tested. 

82 

 
 
 
 
 
Changes in agreement also sometimes  corresponded to considerable changes in the 

composition of the scoring team. This  drop can be seen in the agreement for wave 2 of item 3: 

properties of solutions, when new scoring members joined the existing team. Although item 1: 

experimental  design was not scored by AACR due to the low human agreement, that agreement 

is shown to improve after introducing a new rubric to a new scoring team. Between waves 2 and 

3, agreement increased from k = 0.38 to k = 0.73 when the item was reintroduced to an entirely 

new team. 

Machine-Human  Agreement Vs. Human-Human Agreement 

Table 3.2 shows the mean score awarded by humans, the mean score awarded by the 

machine, and the machine-human agreement for each wave of machine scoring. All rounds of 

scoring achieved fair to good agreement (Cohen’s k = 0.64 to k = 0.81), even with as few as 336 

responses in the smallest training set. Criteria for machine scoring, proposed by Nehm and 

Haertig (2012), consider Cohen’s kappa between 0.41 and 0.60 as moderate, between 0.61 and 

0.80 as substantial, and over 0.80 as almost perfect. According to these criteria, the machine 

scoring outcomes for the three items were categorized as substantial to almost perfect. As shown 

in Table  3.2, the agreement between the machine and humans was as high or higher than the 

human-human agreement reported for the cumulative waves of scoring for two of the three items 

(for human-human agreement, see Table 3.1). 

83 

 
 
 
 
Table 3.2  
Description of Both Human and Machine Scores 

Wave (Sample) 

Mean 

Human 

Machine 

k(SE) 

Item 2: relative motion 

1 (484) 

1.83 (0.03) 

1.76 (0.03) 

0.78 (0.02) 

2 (662) 

1.89 (0.03) 

1.83 (0.03) 

0.78 (0.02) 

3 (808) 

1.85 (0.03) 

1.81 (0.03) 

0.81 (0.02) 

Item 3 properties of solutions 

1 (468) 

1.76 (0.04) 

1.68 (0.04) 

0.69 (0.03) 

Item 4: states of matter 

1 (336) 

2.59 (0.03) 

2.60 (0.03) 

0.76 (0.04) 

2 (594) 

2.49 (0.02) 
Note. Item 1 is not shown because the item did not proceed to machine scoring. 

2.51 (0.02) 

0.64 (0.03) 

In human scoring for item 2: relative motion, raters returned a cumulative k = 0.80 after 

all waves of scoring. The selected key phrases did not significantly improve scoring for the 

second wave of items sent to AACR. Despite adding all examples of the key phrases from the 

full data set, they still comprised only 0.50% to 8.79% of the responses scored for the final 

training set. To build the final training set, we added additional responses without a focus on 

identifying key phrases for a total of 808 student responses. AACR returned their model with 

higher agreement to the human raters (k = 0.81) than the final cumulative agreement between 

humans (k = 0.80). 

The lowest human agreement for any item sent to AACR was item 3: properties of 

solutions. By having two raters score all responses  together and a third rater’s scores used as the 

tiebreaker, the algorithm matched more closely to the humans (k = 0.69) than the humans agreed 

with each other (k = 0.64). Although further predictive models were not developed for this item, 

the method yielded substantial agreement between the machine and human scores, despite lower 

human–human agreement. For item 4: states of matter, agreement was lower in the second round 

84 

 
 
 
 
 
 
 
of scoring (k = 0.64 compared with k = 0.76). Waves of bulk scoring were sent to AACR after 

waves 2 and 4 of human scoring. Despite adding an additional 258 responses and achieving 

similar  cumulative  human- human agreement for the second set (k = 0.77 to k = 0.76), agreement 

between the machine and humans still fell. 

Accuracy of the Machine Scoring Associated with Dimensions of Learning 

 To better understand the capabilities of the machine scoring algorithm, we compared the 

machine and human scores by the associated dimensions of learning. Table  3.3 shows the 

distribution of scoring proficiency classifications  for humans in all 4 items, how the machine 

classified the same responses, and agreement for the three items that had sufficient human- 

human agreement to move to machine scoring. The accuracy, or percent agreement, ranges from 

28 to 93% for the individual proficiency levels for each item regardless of the associated 

dimensions, but the machine scored with accuracy greater than 59% for all categories which 

were well represented in the sample. The lowest overall accuracy (28.57%) and lowest reported 

certainty of score (0.68) corresponded to the least represented classification which is the 

incorrect category for item 4: states of matter. This proficiency level comprised less than 4% of 

the training sample.  For each item, the lowest accuracy and lowest certainty of score correspond 

to the category with the least representation. 

85 

 
 
 
 
Table 3.3 
Human and Machine Percentage of Score, Agreement, Certainty, and Dimensionality 

Proficiency 

Item dimensionality 

Human  Machine 

Mean 
Probability 
(SE) 

Accuracy 

Level 

DCI,  CC,  SEP 

Item 2: relative motion (N = 808) 

Incorrect 

Incorrect 

Correct 

MDC 

DCI 

DCI + CC 

Item 3: properties of solutions (N = 468) 

Incorrect 

Incorrect or DCI only 

Correct 

MDC 

DCI + SEP 

DCI  + CC +  SEP 

Item 4: states of matter (N = 594) 

Incorrect 

Correct 

Incorrect 

DCI or CC 

37.87 

39.36 

22.77 

50.32 

23.44 

26.24 

3.54 

41.75 

39.98 

38.74 

21.29 

57.63 

16.56 

25.81 

2.19 

46.30 

90 (0.00) 

91 (0.01) 

82 (0.01) 

85 (0.01) 

78 (0.01) 

80 (0.01) 

68 (0.03) 

83 (0.01) 

93.14 

86.79 

78.80 

92.31 

59.63 

79.51 

28.57 

83.87 

DCI + CC 

MDC 
Note. Item 1 is not shown because it did not proceed to machine scoring. Mean probability  refers 
to the prediction  returned from AACR that a given score was correct.  MDC multidimensional 
correct, DCI disciplinary  core ideas, CC crosscutting concepts, SEP science and engineering 
practice. 

87 (0.01) 

82.77 

54.71 

51.52 

Table 3.3 shows that the machine classified student responses with a similar distribution 

to human scores but with a tendency to score a little lower than human raters. The machine 

awards between 0.4 and 3.2% fewer MDC proficiency classifications for each question. This is 

similarly  reflected in Table 3.2, where the mean score awarded by the machine is slightly lower 

than for humans in nearly all cases. Table  3.3 also shows that for items where all proficiency 

levels were well represented, the machine scored the incorrect classifications with higher 

accuracy than other categories. This  was not reflected in the machine’s predicted certainty 

however, and it cannot be asserted that the machine reported the highest confidence in scoring 

incorrect responses. 

For item 2: relative motion, Table 3.3 shows that the machine showed high accuracy in 

scoring both the correct and MDC proficiency classifications. The use of only the DCI for a 

correct answer was scored with an accuracy 86.79%, and MDC responses which combined the 

86 

 
 
 
 
 
 
use of the DCI with cause and effect (CC) were scored with an accuracy 78.80% when compared 

with human classifications.  As shown in Table 3.3, scoring for item 3: properties of solutions 

was fully three-dimensional.  The relative amount of human and machine scores for two-

dimensional (correct) responses  were 23.44% and 16.56%, respectively. The machine matched 

more closely to humans for three-dimensional  (MDC) responses, which comprised 26.24% of the 

human proficiency classifications  for this item and 25.81% of machine scores. 

Item 4: states of matter was scored differently from item 2, as it allowed for the use of 

either of two different dimensions of reasoning for partial credit. For a correct response, students 

could attribute the phenomenon to the evaporation of water into smaller particles (DCI) or reason 

that it was caused by the heat of the stove (CC). Human scorers classified responses as correct 

41.75% of the time while the machine used this proficiency level  for 46.30% of responses. The 

machine and humans classified responses with the MDC proficiency level 51.52% and 54.71%, 

respectively. The correct and MDC responses were both well represented in the sample and each 

were scored with accuracy over 82%. 

Key Phrases in Scoring Open Ended Constructed Responses 

As demonstrated in Table 3.4, the machine can score open- ended CRs despite the varied 

language engaged by students. For item 2: relative motion, the majority of students (56.06%) 

chose to use common phrases (e.g., “speed”) to describe the phenomenon while a smaller 

proportion (8.79%) used advanced vocabulary (e.g., “velocity”). Student responses with more 

advanced vocabulary use seemed to be proportionately more represented among those that were 

scored incorrectly by the machine. For example, 18.31% of student responses that include the 

formal phrase “velocity” were scored incorrectly compared with 11.70% of responses that used 

the word “speed.” As shown in Table 3.4, these key phrases were often characterized by 

87 

 
 
including commonly  used phrases with a high prediction of a correct score, or an atypical 

response accompanied by a low prediction score. Examples of student responses, the human 

score, the machine score, and the certainty of the machine score for item 2: relative motion can 

be found in APPENDIX 3.E. 

88 

 
 
 
 
Table 3.4 Key Phrases Associated with the Machine Scoring 
Key Phrases Associated  with the Machine Scoring 
% Scored 
incorrectly 

% of all 
responses 

Key phrase(s) 
Item 2: relative motion (NResponses = 808 and NDisagreements = 102) 

% All MH 
disagreements 

Velocity and relative 

Relative 

    Velocity  or relative 
       without speed 

Fast 

Velocity 

Speed 

0.50 

0.87 

7.18 

7.80 

8.79 

56.06 

50.00 

71.43 

20.69 

12.70 

18.31 

11.70 

1.96 

4.90 

11.76 

7.84 

12.75 

51.96 

Item 3: properties of solutions (NResponses = 465 and NDisagreements = 87) 

Taste 

Freeze 

pH 

Dissolve 

Smell 

Mass or weight 

Evaporate 

Boil 

Density 

1.08 

2.15 

2.80 

3.01 

6.02 

7.74 

7.96 

10.97 

13.98 

0.00 

10.00 

7.69 

7.14 

14.29 

16.67 

21.62 

19.61 

13.85 

Item 4: states of matter (NResponses = 594 and NDisagreements = 111) 

Steam 

Into the air 

Heat and evaporation 

Heat 

37.54 

15.66 

20.37 

21.21 

16.14 

16.13 

14.05 

15.08 

0.00 

1.15 

1.15 

1.15 

4.60 

6.90 

9.20 

11.49 

10.34 

32.43 

13.51 

15.32 

17.12 

Mean Probability 

77 (0.10) 

81 (0.10) 

84 (0.09) 

87 (0.11) 

85 (0.09) 

89 (0.10) 

80 (0.07) 

83 (0.11) 

80 (0.08) 

80 (0.13) 

83 (0.10) 

83 (0.09) 

83 (0.11) 

82 (0.11) 

83 (0.10) 

86 (0.10) 

88 (0.09) 

89 (0.09) 

89 (0.10) 

Evaporation 

15.25 
Note. Item 1 is not shown because it did not move to machine scoring. Mean Probability  refers to 
the prediction made by AACR that the algorithm assigned the same classification  as the human 
raters. 

86 (0.10) 

30.63 

37.54 

For item 3: properties of solutions, students provided numerous experiments, including 

boiling or evaporating the water to look for remaining residue. Human raters coded key phrases 

as words associated with the types of experiments used by students as shown in Table 3.4 (e.g., 

evaporate, smell).  Item 3 had lower human agreement (k = 0.64) than the other items scored with 

89 

 
 
 
the machine (k = 0.80 and k = 0.76), but still fell within the boundaries of substantial agreement 

(k = 0.61–0.81). We found, generally, that the percentages of machine-human disagreements of 

each key phrase were consistent with the frequency with which they appeared in student 

responses. For instance, the word “density” was used by 13.98% of students in the sample and 

accounted for 10.34% of the responses where the machine and humans disagreed. 

Item 4: states of matter shows a similar trend where the percentages of machine-human 

disagreements were also consistent with the frequency with which they appeared in student 

responses. For instance, the use of “into the air” was used in 15.66% of all responses and 

comprised 13.51% of all disagreements. The key phrases selected for this item were used in at 

least 15% of the responses and were scored incorrectly in similar  proportions ranging from 15.08 

to 16.14% of their total use. 

Discussion 

 The performance expectations embodied in NGSS for chemistry and physics cannot be 

effectively measured unless meaningful and scorable three-dimensional  assessments are 

developed (Cheuk et al., 2019; Pellegrino, 2013; NRC, 2014). Because such assessments require 

the other dimensions to be used together with scientific practices, they may involve a variety of 

performance-based tasks, such as writing short answers or drawing, to capture students’ mastery 

of performance expectations (Pellegrino,  2013; NRC, 2014). The NRC (2014) calls for 

investment of time and other resources into the development of these new assessments and to 

facilitate the implementation of these three-dimensional assessments in the classroom;  this 

includes “existing and emerging technologies” that support scoring. In accordance with this 

initiative, this study built upon prior work of developing multi-dimensional  assessments and 

implementing  machine learning  approaches for scoring those assessments. 

90 

 
 
This study demonstrated the ability of automated analysis to facilitate the transition to 

multi-dimensional  assessment  by showing high agreement between computers and humans when 

scoring CR items. Machine learning  could facilitate scoring three-dimensional  assessments more 

quickly than human scoring alone, allowing teachers and researchers to collect detailed 

information on students’ knowledge-in-use as recommended by the NRC. The findings have 

contributed to our knowledge by building on a foundation of research focused on machine 

scoring, which has previously been applied for scoring key concepts (Nehm & Haertig, 2012) 

and argumentation (Cheuk et al., 2019), among other purposes. This study has added to the 

literature surrounding machine scoring of CR by providing a comparison of multiple items, 

showing how each item was scored by humans and the machine algorithm, and demonstrating 

how each item aligned to the NGSS performance expectations. 

Using an objective measure, this study analyzed student responses to determine the use of 

two- and three-dimensional learning to describe phenomena. Through specially  developed 

rubrics, this study has shown that the machine algorithm could score accurately when students 

engaged reasoning associated with CCs. All three of the items scored by the machine included a 

CC to differentiate between correct and incorrect classifications. In fact, the machine was able to 

successfully classify students’ use of a single dimension or multiple dimensions for each item, 

with accuracy comparable  with the human raters. The machine algorithm also scored similarly  to 

human raters when a correct response included any of multiple possible  experiments. 

It is important to note that it took longer than expected to develop scoring models 

because this was a training exercise where the items were neither designed to be three-

dimensional nor to be scored with machine learning. As such, it required a great deal of thought, 

training, rubric development, and IRR testing in an iterative process. This has allowed for 

91 

 
 
exploration of very open-ended response items showing that machine algorithms can attain high 

accuracy on sufficiently large data sets, and even train the machine to correctly classify 

responses by the associated dimensions. Once the algorithm obtained high agreement between 

the humans and the machine, AACR was able to instantly score the remaining responses. 

While prior studies have collected evidence that supervised machine learning  can be used 

successfully in automatic scoring, some argue that when presenting the results of machine 

learning, training sets are not discussed sufficiently in presenting their results (Geiger et al., 

2020). Few studies explicitly describe whether and how their assessments and rubrics target the 

three dimensions of science learning,  or how well machine  learning classifies  student responses 

into the different categories. Consequently, we have limited knowledge about how machine 

scoring can support three-dimensional  assessment practices. In this study, we have discussed 

each item used in the analysis, including how the rubrics tap each dimension and the details of 

the training set. By doing so, we were able to examine the machine’s capacity to classify and 

score items based on the dimensionality of scientific knowledge employed by students. 

Consistent with other studies, our results suggest that rater calibration and sample size 

might be significant factors impacting machine performance. These issues can be mitigated with 

continued rater training and larger sample sizes or “training sets” for the machine to build its 

algorithmic  models (Balfour, 2013; Cheuk et al., 2019). Additionally, given that the quality of 

human-scored training data might be critical to machine scoring (Balfour, 2013), we examined 

how to improve human scoring of multi- dimensional assessments to facilitate machine 

performance. In this course, successful results emerged from our study that might be valuable for 

future applications in machine scoring. On scoring these multi-dimensional  CRs, we found that 

the best method to train the raters, among those we tried, was not just to explain the correct 

92 

 
 
answer in the rubric, but to also inform raters which dimensions were being scored. The human 

scoring or coding of multi-dimensional  responses should include explicit rubrics that list all 

possible solutions and show the hierarchy of importance for the dimensions being measured. 

Comprehensive rubrics  helped humans to score consistently. It is also important to carefully 

weigh decisions to change the composition of the scoring team. 

Advanced vocabulary or key phrases (e.g., “velocity”) may increase the challenge for 

machine scoring, as compared with informal key phrases  (e.g., “speed”). We suspect that this 

finding may be associated with the fact that fewer students used formal key phrases than those 

who used the informal alternatives. This  concept of representation appears again where the 

algorithm’s  lowest agreement to humans coincides with the least represented scoring proficiency.   

Because machine  learning has a difficult time scoring more unique texts (Balfour, 2013), 

and students could correctly propose many experiments, researchers  hypothesized poor results 

for item 3: properties of solutions. Despite the broad range of possible answers, we obtained 

good to substantial agreement (k = 0.69) between human raters and the machine. This was 

similar  to the results of Haudek and his colleagues (2019) where they obtained higher agreement 

between humans and the machine than between human raters for some constructs. The results 

from this study also show that it is possible to bolster a training set with low human-human 

agreement through tiebreakers.  

Addressing the complications in building an accurate algorithmic model was beneficial at 

multiple levels.  The search for lexical  patterns in scoring discrepancies not only facilitated 

construction of a more robust model for scoring, but also identified human raters’ errors even 

after sufficient agreement was reached. Even with high human-human agreement, we found that 

some discrepancies corresponded to human errors. We were then able to address these errors 

93 

 
 
with raters directly to prevent recurrence. Outside the context of this paper, reviewing 

discrepancies between humans and the machine provided insight into broader vocabulary use and 

creativity in responses. 

Limitations 

As exploratory research, this study adopted existing items in a national test and 

developed multi-dimensional rubrics  according to NGSS to score students’ responses. While this 

study collected evidence indicating sufficient machine capacity in terms of scoring responses 

according to the dimensionalities, there were limitations  to this study. Given that the assessment 

tasks were adopted from a national test that was not originally designed to be three-dimensional, 

we only have one three-dimensional  item while the other three are two-dimensional. Though this 

study achieved high machine-human agreements for these items, future studies should develop 

more three-dimensional items and test the machine capacity to automatically score three-

dimensional assessments.  Because three-dimensional  assessments are more complex than two-

dimensional assessments,  this could be more challenging  for the machine to score. 

Conclusions and Implications 

The benefits of applying the three dimensions of science learning, by incorporating  SEPs, 

DCIs, and CCs, have been proposed in both the NRC Framework (NRC, 2012) and the NGSS 

(NGSS Lead States, 2013). This concern is particularly pronounced because most multi-

dimensional assessments  contain CRs and scoring of CRs is both time and labor intensive. This 

implies  notable efforts for human scoring on behalf of educators, state departments, and 

researchers.  This study found that human experts were able to reliably (Cohen’s k > 0.60) score 

student responses to assessment items with varied dimensions. This  study shows that machine 

scoring was capable of classifying student responses when measured by the use of the 

94 

 
 
dimensions of learning spelled out by the NGSS, with accuracy that was comparable with human 

experts. This  shows promise for the use of machine learning  to facilitate the measurement of in-

depth science understanding and meeting the recommendations for science assessment from the 

NRC. Once assessments are constructed, and a number scored, the remainder of a large sample 

can be scored almost instantly. 

This study indicates that the automated analysis of two- and three-dimensional CR items 

may be a viable solution  in reducing both financial and time costs associated with measuring in-

depth science knowledge to facilitate the gradual shift to follow NRC guidelines. Despite the labor 

involved in constructing the rubrics, training raters, and developing algorithms, once the models 

were complete, the machine  algorithm could continue to score the remaining students’ CRs rapidly. 

Machine scoring of three-dimensional assessments could have several meaningful impacts on 

state or national standardized testing, and the monitoring of student performance in the classroom. 

With coordination, quality three-dimensional assessments could be delivered online and CRs scored 

almost instantly. However, given what we have learned, this is a complex process both in terms of 

identifying dimensionality and building rubrics which can be reliably  scored. If it is possible to 

develop items that are three-dimensional and create rubrics for them, this would be an important 

contribution to meeting new science education reform efforts. Machine scoring could facilitate 

the use of more robust measures  of students’ understanding through knowledge-in-use 

assessments. 

95 

 
 
 
 
REFERENCES 

AACR. (2020). September 4, 2020, Retrieved from https://apps.beyondmultiplechoice.org. 

Balfour, S. P. (2013). Assessing writing in MOOCs: Automated Essay Scoring and Calibrated 

Peer ReviewTM. Research & Practice  in Assessment, 8, 40–48. 

Cheuk, T., Osborne, J., Cunningham,  K., Haudek, K., Santiago, M., Urban-Lurain, M., Merril, 

J., Wilson,C., Stuhlsatz, M.,Donovan, 

B., Bracey, Z., & Gardner, A. (2019). Towards an Equitable Design Framework of Developing 

Argumentation in Science tasks and Rubrics for Machine Learning. Presented at the 
Annual meeting of the National Association  for Research in Science Teaching (NARST). 
Baltimore, MD. 

Fleiss, J.L. (1981). Statistical methods for rates and proportions (2nd ed.). New York: John 

Wiley. ISBN 978–0–471–26370–8. 

Geiger, R. S., Yu, K., Yang, Y., Dai, M., Qiu, J., Tang, R., & Huang, 

J. (2020, January). Garbage in, garbage out? do machine learning application papers in social 

computing report where human- labeled training data comes from?. In Proceedings of the 
2020 Conference on Fairness, Accountability,  and Transparency  (pp. 325–336). 

Ha,  M.,  &  Nehm,  R.  H.  (2016).  The  impact  of  misspelled  words on automated computer 

scoring:  a case study of scientific explanations. Journal of Science Education and 
Technology, 25(3), 358–374. 

Harris, C. J., Krajcik, J. S., Pellegrino, J. W., & DeBarger, A. H. (2019). Designing knowledge-in-
use assessments  to promote deeper learning. Educational Measurement: Issues and 
Practice, 38(2), 53-67. https://doi.org/10.1111/emip.12253. 

Haudek,  K.,  Santiago,  M.,  Wilson,  C.,  Stuhlsatz,  M.,Donovan,  B., Bracey, Z., Gardner, A., 

Osborne, J., & Cheuk, T. (2019). Using Automated Analysis to Assess Middle  School 
Students’  Competence with Scientific  Argumentation, presented at the Annual Meeting  of 
the National Council on Measurement in Education (NCME). Toronto, ON. 

Large,  J., Lines,  J., &  Bagnall,  A. (2019). A probabilistic  classifier ensemble weighting scheme 
based on cross-validated accuracy estimates. Data mining and knowledge discovery, 
33(6), 1674–1709. 

Lee, H. S., McNamara, D., Bracey, Z. B., Liu, O. L., Gerard, L., Sherin, B., Wilson, C., Pallant, A., 

Linn, M., Haudek, K., & Osborne, J. (2019a). Computerized text analysis:  Assessment 
and research potentials  for promoting learning. 

Lee, H. S., Pallant, A., Pryputniewicz, S., Lord, T., Mulholland, M.,  & Liu, O. L. (2019b). 

Automated text scoring and real-time adjustable feedback: Supporting revision of 
scientific arguments involving  uncertainty. Science Education, 103(3), 590–622. 

96 

 
 
Liu, O. L., Brew, C., Blackmore, J., & Gerard, L. (2014). Automated scoring of constructed 

response science items: Prospects and obstacles. Educational Measurement-Issues  and 
Practices, 33(2), 19–28. https://doi.org/10.1111/emip.12028. 

Lottridge,  S.,  Wood,  S.,  &  Shaw,  D.  (2018).  The  effectiveness  of machine score-ability 

ratings in predicting automated scoring  performance. Applied Measurement in 
Education, 31(3), 215–232. Mao, L., Liu, O. L., Roohr, K., Belur, V., Mulholland, M., Lee, 
H.-S., & Pallant, A. (2018). Validation of automated scoring for a formative assessment 
that employs scientific argumentation. Educational  Assessment, 23(2), 121–138. 

Mayfield, E., & Rosé, C. (2010, June). An interactive tool for supporting error analysis for text 

mining. In Proceedings of the NAACL HLT 2010 Demonstration Session (pp. 25–28). 

Mayfield, E., & Rosé, C. P. (2013). Open source machine learning  for text. Handbook of 

automated essay evaluation: Current applications  and new directions. 

National Academies of Sciences, Engineering, and Medicine. (2019). Science and engineering 

for grades 6–12: Investigation  and design at the center. National Academies Press. 

National Research Council. (2012). A framework for K-12 science education:  Practices, 

crosscutting  concepts, and core ideas. National Academies Press. 

National Research Council. (2014). Developing assessments for the next generation  science 

standards. National Academies Press. 

Nehm, R. H., & Haertig, H. (2012). Human vs. computer diagnosis of students’ natural selection 

knowledge: testing the efficacy of text analytic software. Journal of Science Education 
and Technology, 21(1), 56–73. 

NGSS Lead States. (2013). Next generation science standards: For states,  by states. 

Washington, DC: The National Academies Press. 

Pellegrino, J. W. (2013). Proficiency in science: Assessment challenges and opportunities. 

Science, 340(6130), 320–323. 

Zhai, X., Haudek, K., Shi, L., Nehm, R., Urban-Lurain, M. (2020a). From substitution to 

redefinition: A framework of machine learning-based science assessment. Journal of 
Research in Science Teaching, 57(9), 1430-1459. https://doi.org/10.1002/tea.21658. 

Zhai, X., Haudek, K., Stuhlsatz, M., Wilson, C. (2020b). Evaluation of  construct-irrelevant 

variance  yielded  by  machine  and human scoring of a science teacher PCK constructed 
response assessment. Studies in Educational Evaluation, 67, 1-12. 
https://doi.org/10.1016/j.stueduc.2020.100916. 

Zhai, X., Yin, Y., Pellegrino, J., Haudek, K., Shi., L. (2020c). Applying machine learning in 

science assessment:  A systematic review. Studies  in Science Education. 56(1), 111-151. 

97 

 
 
Zhu, M., Lee, H.-S., Wang, T., Liu, O. L., Belur, V., & Pallant, A. (2017). Investigating the impact of 
automated feedback on students’ scientific argumentation. International Journal of Science 
Education, 39(12), 1648–1668. 

98 

 
 
 
APPENDIX 3.A ITEM 1: EXPERIMENTAL DESIGN 

Table 3.A.1 Item 1: Experimental Design, Text & Rubric 
Item 1: Experimental Design Text & NGSS Alignment 

Question Text 

Meg designs an experiment to see which of three types of sneakers provides the most 
friction. She uses the equipment listed below. 1. Sneaker 1 2. Sneaker 2 3. Sneaker 3 4. 
Spring scale. She uses the setup illustrated below and pulls the spring scale to the left. 

Meg tests one type of sneaker on a gym floor, a second type of sneaker on a grass field, 
and a third type of sneaker on a cement sidewalk. Her teacher is not satisfied with 
the way Meg designed her experiment. 

A. Describe one error in Meg’s experiment. 

Alignment to the NGSS (2013) Performance Expectations 

Dimension 

Grade-Level 

Performance Expectation 

DCI 

CC 

SEP 

6-8 

(N/A) 

3-5 

ETS1.A Defining and 
Delimiting Engineering 
Problems 

(N/A) 

Planning and Carrying Out 
Investigations 

99 

 
 
 
 
 
 
 
Table 3.A.2 Item 1: Experimental Design, Text & Rubric 
Item 1: Experimental Design, Student Examples & Rubric 

Multi-dimensional 
Correct 

"Meg’s error  is that she is 
testing three experiments in 
separate and different 
settings, allowing the 
experiments to have 
different outcomes. This 
stops her from knowing if 
her other shoes work on a 
gym floor or grass field or 
a cement sidewalk." 
DCI: Student correctly 
identifies the error in the 
experimental setup. 

Correct 

Incorrect 

“Meg should have tested 
the sneakers in the same 
location for each test." 

"Meg should’ve used 
different types of sneakers, 
not the same." 

DCI: Student correctly 
identifies an error in the 
experimental setup. 

Provides an incorrect 
response or irrelevant error 
in the experimental set-up. 

& 

& 

SEP: Student explains 
this is a failure to control 
for variables or that the 
results cannot be 
compared. 

No SEP: Student does 

not explain that it controls 
for relevant variables. 

100 

 
 
 
 
 
APPENDIX 3.B ITEM 2: RELATIVE MOTION 

Table 3.B.1 Item 2: Relative Motion Text and Rubric 
Item 2: Relative Motion Text and NGSS Alignment 
Question Text 

Suppose you are riding in a car along the highway at 55 miles  per hour when a truck 
pulls up along the side of your car. This truck seems to stand still  for a moment, and 
then it seems  to be moving backward. 
A. Tell how the truck can look as if it is standing still when it is really moving forward. 

Alignment to the NGSS (2013) Performance Expectations 

Dimension 

Grade-Level 

Performance Expectation 

DCI 

CC 

SEP 

6-8 

6-8 

PS2.A Forces and Motion 

Scale and Proportion 

(N/A) 

(N/A) 

Table 3.B.2 Item 2: Relative Motion Text and Rubric 
Item 2: Relative Motion Student Example and Multi-dimensional  Rubric 

Multi-dimensional 
Correct 

“The truck looks as if it is 
standing still as both your 
car and the truck are 
moving at 55 mph in the 
same direction." 

DCI: Student relates  the 
truck’s speed to the speed 
of the observer. 
& 
CC: Student states that 
equal relative speeds would 
cause the truck to appear as 
though it is standing still. 

Correct 

Incorrect 

"It is going 55 miles per 
hour, which is as fast as the 
car is going." 

“the truck looks like it is 
still because it is losing 
speed." 

DCI: Student relates the 
truck’s speed to the speed 
of the observer.” 

& 

Student provides an 
incorrect/irrelevant 
explanation for the 
phenomena OR only 
restates the question. 

No CC: Student does not 
discuss the visual 
phenomenon being caused 
by the relative speeds. 

101 

 
 
 
 
 
 
 
APPENDIX 3.C ITEM 3: PROPERTIES OF SOLUTIONS 

Table 3.C.1 Properties of Solutions Text and Rubrics 
Item 3: Properties of Solutions  Text and NGSS Alignment 

Question Text 

Maria has one glass of pure water and one glass of salt water, which look exactly alike. 
Explain what Maria could do, without tasting the water, to find out which glass contains 
the salt water. 

Alignment to the NGSS (2013) Performance Expectations 

Dimension 

Grade-Level 

Performance Expectation 

DCI 

CC 

SEP 

3-5, 6-8 

6-8 

6-8 

PS1.A Structure and 
Properties of Matter 

Cause and Effect 

Planning and Carrying Out 
Investigations 

Table 3.C.2 
Item 3: Properties of Solutions  Student Example and Rubric 

Multi-dimensional 
Correct 

"Maria could use two 
similar cups and weigh 
them both and the heavier 
one is saltwater." 

SEP: Student response 
describes  an experiment 
that controls  for relevant 
variables. 
DCI: The experiment  iso- 
lates a measurement  that 
will differentiate  fresh 
water from salt water. 
CC: Student indicates  the 
expected result that will 
allow them to differentiate 
the fresh water and salt 
water. 

Correct 

Incorrect 

"Maria can weigh the cups 
that hold the water." 

"Your body floats easier in 

salt water." 

Student response does not 
describe an experiment  that 
will differentiate  fresh 
water from salt water. 

SEP: Student response 
describes  an experiment 
that controls  for relevant 
variables. 
DCI: The experiment 
isolates  a measurement  that 
will differentiate  fresh 
water from salt water. 
No CC: Student does not 
indicate  the expected result 
that will allow them to 
differentiate  the fresh water 
and salt water. 

102 

 
 
 
 
 
APPENDIX 3.D ITEM 4: STATES OF MATTER 

Table 3.D.1  
Item 4: States of Matter  Text and NGSS Alignment 
Question Text 

Anita puts the same amount of water in two pots of the same size and type. She places 
one pot of water on the counter and one pot of water on a hot stove. 
After ten minutes,  Anita observes that there is less water in the pot on the hot stove 
than in the pot on the counter, as shown below. 

A. Why is there less water in the pot on the hot stove? 
B. Where did the water go? 

Alignment to the NGSS (2013) Performance Expectations 

Dimension 

Grade-Level 

Performance Expectation 

DCI 

CC 

SEP 

6-8 

6-8 

PS1.A Structure and 
Properties  of Matter 
Energy and Matter 

(N/A) 

(N/A) 

Table 3.D.2 Item 4: States of Matter 
Item 4: States of Matter  Student Example and Rubric 

Multi-dimensional 
Correct 

“The heat caused it to 
evaporate.” 
DCI: Student says the 
water evaporated. 
& 
CC: Attributes  this to to 
the heat from the stove. 

Correct 

Incorrect 

“The water evaporated.” 

“It dried up.” 

DCI: Student says the 
water evaporated. 
OR 
CC: Attributes  this to the 
heat from the stove. 

Provides an incorrect  or 
irrelevant  explanation. 

103 

 
 
 
 
 
CHAPTER 4: U.S. AND FINNISH HIGH SCHOOL SCIENCE ENGAGEMENT DURING 

THE COVID-19 PANDEMIC 

Abstract 

When the Covid-19 pandemic struck, research teams in the United States and Finland 

were collaborating  on a study to improve adolescent academic engagement in chemistry and 

physics and the impact remote teaching on academic, social,  and emotional learning. The 

ongoing “Crafting Engaging Science Environments” (CESE) intervention afforded a rare data 

collection opportunity. In the United States, students were surveyed at the beginning of the 

school year and again in May, providing information for the same 751 students from before and 

during the pandemic. In Finland, 203 students were surveyed during remote learning. Findings 

from both countries during this period of remote learning revealed that students’ academic 

engagement was positively correlated with participation in hands-on, project-based lessons. In 

Finland, results showed that situational engagement occurred in only 4.7% of sampled cases. In 

the United States, students show that academic engagement, primarily the aspect of challenge, 

was enhanced during remote learning. Engagement was in turn correlated with positive 

socioemotional  constructs related to science learning. The study’s findings emphasize the 

importance of finding ways to ensure equitable opportunities for students to participate in 

project-based activities when learning remotely. 

104 

 
 
 
 
Introduction and Literature Review 

The 2019–2020 school year brought significant changes to educational systems around 

the globe when elementary and secondary schools closed suddenly, finding themselves faced 

with new social distancing guidelines as the world plunged into a crippling pandemic (Meluzzi, 

2020). According to the United Nations Educational, Scientific, and Cultural Organization 

(2020), these closures impacted over 63% of students enrolled in pre-primary  through tertiary 

learning institutions worldwide. Schools closed with little or no notice, leaving parents and 

educators barely time to prepare for this new reality. With the shift to full days of remote 

instruction, teachers and students found themselves adapting to entirely new learning 

environments.  

Unfortunately, more technology does not imply improved educational outcomes (Escueta 

et al., 2017). With many students on Zoom or other platforms, equitable participation became a 

serious problem  during the pandemic. Domina et al. (2021) showed students’ academic 

engagement was improved with greater access to technological resources and quality instruction 

that included socioemotional learning. Globally, however, socioeconomically  disadvantaged 

students are less likely to have the tools they need to participate in remote instruction (Meluzzi, 

2020). Students in lower-income schools were shown to be less engaged with their schoolwork 

than their same aged peers with greater access to resources (Hopkins et al., 2021). Learning 

losses are, in turn, shown to vary with academic engagement and access to school supplies  or 

technology necessary for participation (Dorn et al., 2020).   

The inability of some families  and schools to provide the financial and material support 

for experiences that would normally be provided in a science classroom  means existing gaps 

hindering equitable participation are exacerbated by the pandemic. In the United States, for 

105 

 
 
example, nearly one third of students were unable to participate in remote learning during the 

first wave of the pandemic (Meluzzi, 2020), placing additional stressors on students and their 

families.  Not surprisingly, the pandemic has coincided with an increase in student anxiety, which 

inhibits students’ ability to engage with their online classrooms  (Yang et al., 2020). High school 

students are experiencing stressors of the pandemic, many of which limit the coping mechanisms 

teenagers usually  employ to deal with the normal stressors associated with being in high school. 

Survey data has shown students reported feeling disinterested, bored, and socially isolated when 

spending long hours in virtual classes;  parents also have expressed similar  concerns about their 

children’s academic learning  and well-being (Kaufman et al., 2020).  

Recent literature shows the pandemic has caused difficulty in attaining academic 

engagement in remote classrooms  with many teachers reporting the need for additional resources 

to do so (Trinidad, 2021). Literature regarding learning during the Covid-19 pandemic suggests 

that students’ academic engagement may differ when content is delivered remotely as compared 

to learning in a classroom  environment. In a joint project of two countries, the United States and 

Finland, we aim to better understand academic engagement and its correlation to various 

activities assigned in science classes  during remote learning  due to the pandemic. 

A Two-Country Intervention Facing a Pandemic 

The sudden transition to remote teaching occurred during an ongoing collaborative 

intervention, “Crafting Engaging Science Environments” (CESE). Funded by the National 

Science Foundation and Academy of Finland, CESE brought together a team of learning 

scientists, science education researchers, psychologists, sociologists,  and teachers. They designed 

an intervention that supported students’ academic engagement and impacted not only their 

academic learning,  but also their social  and emotional learning (Schneider et al., 2020). Based on 

106 

 
 
the principles  of project-based learning (PBL) that support student experiential activities in 

“figuring out” phenomena (Krajcik & Shin, 2014), the unifying theme that motivated the CESE 

intervention was to improve engagement in physics and chemistry courses for high school 

students in grades 10 through 12. Throughout multiple years of this collaboration, a series  of 

questions related to the conceptualization of engagement and its impact on social and emotional 

learning remained a continual focus.  

During the 2019–2020 school year, CESE was in its first-year efficacy trial in Finland. In 

the United States, CESE was undergoing a maturation study. Having shown promising results for 

the learning outcomes of treatment students in the previous year (Schneider et al., 2022), the goal 

was to determine whether teachers in their second year of teaching the project-based learning 

intervention would show greater impacts on learning  outcomes than teachers implementing  the 

lessons for the first time. Although the pandemic ended the ability to study students’ engagement 

during hands-on lessons, a unique opportunity arose to study academic engagement in this 

remote learning  environment. 

U.S. and Finnish Government Responses 

CESE’s shifted focus to studying academic engagement during remote science classes 

occurred within two contrasting national contexts. The two countries saw differences in the 

transition based on their respective approaches. The relative populations of the United States 

(over 330 million)  and Finland (over 5.5 million)  affected each country’s response to the 

pandemic. Finland was able to centralize its decision-making,  given its smaller  population. 

Centralization in the United States was more difficult, not just because of its larger population 

but each of the 50 states has the right to control its own schools. When U.S. schools closed, 

107 

 
 
teachers in the CESE study reported guidelines for learning and instruction that differed among 

states, districts, and even schools.  

In the United States, the movement to remote instruction began in mid -March, which 

coincided with spring break in many districts. Assuming social distancing measures  would be 

brief, some schools simply  extended the spring break. Awaiting guidance from the state or 

federal governments meant many classrooms  did not make this transition until April. Consistent 

with the findings of a 2020 study from Reich et al., policies differed at multiple levels of 

decision making;  some CESE teachers reported that their schools required all teachers to use the 

same curriculum  and had strict protocols for contacting parents, while other teachers reported 

that their schools entrusted teachers with all instructional and logistical decisions. Schools, 

teachers, and students had limited familiarity with remote learning. Orchestrating an equitable 

learning environment where all students, especially those in low-income  families and 

communities,  were equipped with computers and internet access was a monumental undertaking. 

When districts were unable to provide equipment for every student, they had to develop 

alternative methods that assured equitable learning experiences.   

Planning and organizing some form of high-quality remote instruction for the 50 million 

primary  and secondary students in U.S. public schools became exceedingly challenging.  Despite 

the confusion sur rounding these unprecedented changes, some national studies suggest that the 

majority of teachers and administrators reported effective communication from their districts 

regarding policy changes and that instruction was supported through relevant professional 

development opportunities (Kraft & Simon, 2020). Consistent with these findings, teachers in the 

CESE study noted that their districts provided professional development opportunities focused 

on adapting their pedagogy with a variety of online tools.  

108 

 
 
In contrast to the United States, Finland’s government was able to quickly decide that all 

students in the country would transition to remote learning. All schools were closed from March 

18th until May 13th. For students in upper secondary schools, vocational training institutes, 

tertiary, and other educational institutions, the government recommended continuing distance 

teaching until the end of the semester. The government worked with the schools to ensure that all 

students and teachers had access to computers, wi-fi, and instructional guidance for students and 

staff (The Finnish National Agency for Education, 2020). Several educational platforms were 

already in place and allowed educators to provide feedback, assign homework, and communicate 

with parents and students through the pandemic.  

According to the Finnish Teachers’  Union Survey, about half of the teachers reported 

having sufficient pedagogical and digital competence for teaching during the remote learning 

period; a similar  proportion of students claimed the change to distance learning went well. Only 

a small  number of students had difficulties due to insufficient equipment or lack of skills for 

distance learning. Overall, Finnish students and teachers both reported having considerable 

experience with digital skills such as how to use computers, how to find information, and using 

reputable sources to “fact check.” These survey reports are consistent with several other 

international studies which have shown that Finnish students are among the most well-prepared 

to work online compared to students in other industrialized countries (see, Fraillon  et al., 2019, 

International Education Association (IEA) International Computer and Information Literacy 

Study). Not surprisingly, Finnish teachers and students reported feeling quite confident about 

their abilities  to succeed in distance learning and had the skill sets to do so. 

109 

 
 
Studying Engagement 

Even before the pandemic, students’ academic engagement in science was a deep concern 

globally, and several major reports by OECD (2019) connected engagement with interest in 

science and its attractiveness as a career option. One of the major challenges of CESE was to 

theoretically describe and measure academic engagement in science classes.  Part of the problem 

was that while there was consistent agreement that academic engagement varied over time, what 

engagement meant in various contexts was viewed differently. One of the first considerations 

was how to specify academic engagement and what constructs should be used to define it (see, 

Hidi & Renninger, 2006; Schneider et al., 2016).  

CESE views academic engagement in science as being comprised of interest, skill, and 

challenge (Schneider et al., 2016): and not all activities are likely  to have the same effect on 

students’ social and emotional or academic learning (Inkinen et al., 2020). Recognizing the 

difficulty of trying to define academic engagement without specifying when it occurs misses the 

ability to identify when students are feeling interested, skilled, and challenged in what they are 

doing. The approach identifies these three constructs as critical for enhancing students’ academic 

engagement which are grounded in psychological literature. Interest is the psychological 

predisposition for a specific activity, topic, or object; skill  is the mastery of a set of specific 

tasks; and challenge is the willingness to take on a difficult, somewhat unpredictable course of 

action. When students report high interest, skill, and challenge, they are considered  to be 

engaged. 

Situational  Engagement/Optimal Learning Moments 

The primary  focus of the CESE study is situational engagement. When measured in the 

moment, instances of situational engagement are considered optimal learning  moments (OLMs), 

110 

 
 
which are situationally specific times when a student is so deeply engrossed in a task that it feels 

as if time flies by Schneider et al. (2016). During those times, students tend to be concentrating 

and feeling in control (see, Salmela-Aro  et al., 2016). This  idea is similar  to how 

Csikszentmihalyi  (1990) describes flow as being completely immersed in an activity. For this 

study, we consider OLMs to be situations that elevate students’ academic engagement and are 

positively related to social and emotional learning. Our research shows that OLMs occur about 

15–20% of the time in science lessons  (Inkinen et al., 2020; Schneider et al., 2016). Our interest 

is to examine how often they occur when students are learning remotely. 

Academic Engagement and Social and Emotional Learning Experience 

Researchers recently started to examine the relationships between academic engagement 

and other factors including social–emotional skills  and learning experiences as they mutually 

reinforce one another (Salmela-Aro  & Upadyadya, 2020). The current study applied the OECD 

framework on social  and emotional constructs which include maintaining positive emotions, 

managing social relationships,  and keeping goal pursuits. Some key elements are optimism 

(ambition or future importance), persistence, curiosity, social  interaction, and self-efficacy 

(Kankaras & Suarez-Alvarez,  2019). In this study, we aimed to understand social–emotional 

experiences  and their relationship to academic engagement during remote learning.   

One important social–emotional  learning  experience is persistence, also referred to as grit 

(see Duckworth, 2016; Tang et al., 2019), which refers to how long students stay with a task or 

learning assignment without giving up. In relation to our idea of academic engagement, it is 

important to determine whether students “give up” in more challenging  situations and whether 

“grit” acts as a buffer, inspiring students to persist in the task at hand (Salmela-Aro & 

Upadyadya, 2020; Tang et al., 2021). “Grit” has particular importance to Finnish society as it has 

111 

 
 
been associated with the term “sisu,” which can be translated as “determination to overcome 

adversity,” and is a hallmark  of the Finnish perception of their national character.   

Another social–emotional  concept related to academic engagement is curiosity, defined 

as the desire for knowledge or information found to be associated with question-asking, 

exploration behaviors, and achievement (Hidi & Renninger, 2006). Recently, curiosity, the 

epistemic emotion that triggers interest, has also been highlighted in the engagement process 

(Hidi & Renninger, 2019). Thus, the role of curiosity in OLM is important to investigate.  

When students are engaged in learning we expect them to feel that their science work is 

related to their future education goals (Schneider et al., 2016). U.S. students who participated in 

the CESE intervention during the previous year’s efficacy trial showed increased educational 

ambitions (Schneider et al., 2022). During a pandemic, however, uncertain futures might weaken 

educational goals. Alternatively, science is now trending heavily in the news which could foster 

students’ curiosity and engagement with it, in turn strengthening their educational ambitions. 

Current Study 

When the pandemic struck, CESE teachers in both countries were unable to teach the 

intervention. Although CESE was rendered unable to study students’ academic engagement 

during these project-based lessons in a science classroom,  a unique opportunity arose  to study 

engagement in relation to the activities assigned in this remote learning environment. As 

mentioned above, comprised of three components (challenge,  interest, and skill), it was possible 

that academic engagement could increase or decrease based on these subsequent parts. If 

students felt more challenged in the remote environment or more interested in science due to ties 

to current events, it could increase their academic engagement. In Finland, CESE researchers 

112 

 
 
sought to understand situational engagement during remote lessons. Without the clear project-

based learning features, would students be less engaged during remote lessons?  

In the United States, where the transition to remote teaching varied greatly among 

schools and districts, materials distribution to study situational academic engagement was not 

possible. Instead, students were surveyed during the pandemic, and results from those who 

responded were compared to their responses collected earlier  in the year. In Finland, the same 

students were not tracked over the course of the school year due to the semester-by-semester 

class structure of Finnish high schools. Here, the more organized shift to remote teaching made it 

possible to provide students with a diary to collect data about their engagement in the moment. 

Despite their differences, the study teams were able to collect similar data related to academic 

engagement in science. Guiding the investigation at this unprecedented time are three major 

questions: (a) how engaged were students in their remote science classes? (b) how engaged were 

students in their specific learning activities during remote learning? (c) how was academic 

engagement related to social and emotional learning experiences  during remote learning?  This is 

not a comparative study; rather, its intent is to underscore similarities  and differences in students’ 

academic engagement when in remote science classroom  environments. 

U.S. and Finnish Samples 

Methods 

The ongoing work in both countries allowed CESE to survey and interview the students 

and teachers when they were participating in their remote science classes.  To measure  academic 

engagement, we have deliberately selected academic and social–emotional constructs that both 

countries have investigated with their own respective student populations with some variations 

and differences in instruments which are explained below. Both countries used self -report 

113 

 
 
surveys to collect information from high school students enrolled in chemistry and physics 

courses. In the United States, data were obtained from students in fall 2019 and then again during 

the pandemic, allowing for comparisons of attitudes and perceptions from in-person to remote 

experiences.  In Finland, however, data were only obtained in the spring. 

U.S. Sample 

During the 2019–2020 school year, the United States was in a maturation study where 

both treatment and control teachers from the previous year taught the intervention, for their 

second and first years, respectively. In the beginning of the school year, 4,954 high school 

students from 86 physics and chemistry classrooms  completed a background survey as part of the 

CESE intervention. In the United States, the expected age range of students in grades 10 through 

12 is between 14 and 17 years old. In the CESE study, 95% of students fell within the anticipated 

age range and 98% were in grades 10 through 12. We initially anticipated that our response rate 

would be similar  to the analytic sample from the efficacy trial in the previous year. Given the 

pandemic, however, it became apparent that we would not get anywhere near that percentage of 

respondents. Taking into account attrition rates from the previous year, teachers’ reports of low 

attendance after the shift to remote learning, the loss of one highly populous d istrict, and district 

policies  that gave little incentive for students to attend  remotely, we expected to receive less than 

20% (n = 879) of our original  2019–2020 sample. When surveyed again in Spring 2020, 922 

students replied to the exit survey. Of the students who responded to the exit survey, 81% (n = 

751) had completed both a background and exit survey in the participating regions. These 

students were retained in the analytic sample, of which 55.53% are female; 22.64% white (non-

Hispanic);  40.21% Hispanic; 5.73% Black; 6.39% Asian, 17.04% multiple race/ethnicities;  less 

than 1% other, and 7.46% did not provide information about their race or ethnicity. 

114 

 
 
Finnish Sample 

When the pandemic struck, the Finnish team reached out to six of the nine participating 

teachers involved in the efficacy study regarding the possibilities of collecting data while 

students were learning remotely. All six teachers agreed to the study, which included 203 

students (97 males, 103 females, 3 who preferred not to answer; all within the range of 16–

18years old). It is important to note that the number of students included in this study is 

relatively small,  but all were in grades 10 through 12 and  participating in the CESE intervention. 

Measures  

U.S. Measures 

In the United States, the same students were followed through a full school year and 

surveyed in fall 2019 and again in spring 2020, after moving to remote learning. Students were 

asked to respond to questions about their demographic data such as race, gender, grade point 

average, and attitudes toward science. In both fall and spring, students were asked about their 

interest, skill,  and challenge in their physics or chemistry class.  Students reported how much they 

agreed with the following statements: I am interested in science; I feel skilled in science; and  I 

find science challenging  on a four-point Likert scale. To measure  students’ academic 

engagement, we calculated the mid-point for each of the three categories (i.e. interest, skill,  and 

challenge).  Academic engagement is the binary outcome indicating that a student reported scores 

above the midpoint (e.g. 3 or 4) for each of the three categories. We then compared their 

responses from fall to spring after the shift to remote learning.  

Students also reported on how frequently they participated  in a number of activities when 

in a remote classroom  and how interested they were in each activity. These activities included: 

discussion boards; one-on-one video chats with the teacher; watching videos of experiments; 

115 

 
 
online simulations;  live lessons;  recorded videos of lessons; using textbooks; writing papers; 

building models at home; making presentations using slides or power-point to share with the 

class;  text-based instruction; working in groups through video chat; and experiments to try at 

home. Students’ reported interest was ranked and compared to its reported activity frequency in 

the classroom. Frequency was measured on a 5-point scale: we do not do this activity (0); less 

than once every 2 weeks (1); once every 2 weeks (2); once per week (3); or every day (4). 

Interest here was reported on a four-point scale ranging from “this does not interest me (1)” to 

“this interests me a lot (4).”  

On the fall and spring surveys, students responded to questions specifically related to 

project-based learning  tasks associated with the CESE intervention. These activities were 

included to determine whether students were still able to engage in the same project-based tasks 

fundamental to the intervention. These questions were altered to better suit the situation of 

remote learning  when administered on the second survey and additional questions related to 

modelling were added. In fall, these were measured on a scale similar to those that were specific 

to online learning  and where students reported the frequency with which they performed certain 

activities: never or almost never (1); once every month (2); once every 2 weeks (3); once per 

week (4); or more than once every week (5). These activities included several different types of 

modelling activities, opportunities to take pride in their achievements, ask questions in class, 

discuss phenomena, work together to solve problems, generally  present their findings, and the 

frequency with which students performed “science and engineering practices like taking 

measurements  to collect data about the world around us and using evidence to make a claim.”  

To understand the relationship between social and emotional skills  and students’ 

academic engagement, college ambition was considered a behavioral measure of persistence (or 

116 

 
 
grit). On both the fall and spring surveys, students were asked to report their educational goals 

regarding how far they expect to go in school, including: I do not know how far I will go; less 

than high school; graduate from high school but not go any further; go to a vocational, trade, or 

business school after high school; graduate from a two-year college (Associate’s Degree); 

graduate from a four-year college (Bachelor’s  Degree); Master’s Degree or Equivalent; and 

Ph.D., M.D., or other advanced professional degree. A binary variable was created to indicate 

whether or not a student reported plans to attend at least 4years of college. 

Finnish Measures 

In Finland, the CESE study focused on situational engagement and researchers did not 

follow the same students through an entire school year. During the pandemic, students were 

asked to complete two surveys. The first survey focused on their general feelings and 

experiences  in remote learning  during the pandemic. The second survey asked students to report 

their real-time  feelings and experiences using a diary format, the experience sampling method 

(ESM), for which they answered short surveys in the moment, d uring their remote lessons.   

Situational engagement consists of high levels  of interest, skill, and challenge. Students 

reported in the ESM survey their momentary interest (Are you interested in what you are 

doing?), skill (Do you feel skilled at what you are doing?), and challenge (Do you feel 

challenged by what you are doing?) on a four-point scale of: not at all (1); a little (2); much (3); 

and very much (4). Academic engagement was measured similarly  in both countries. Students 

were considered situationally engaged (i.e. OLM) if their responses were 3 or 4 to all three 

questions. A binary variable of 1 or 0 was generated to indicate whether this was an OLM or not. 

The ESM survey asked students to report their practices when they received the survey. 

They could choose from: following teacher’s instruction, doing tasks independently, studying 

117 

 
 
from books, studying from the web site, writing, discussing online, making videos, asking 

questions, developing a model, using a model, planning an investigation, conducting an 

investigation, analyzing data, solving math problems, constructing an explanation, using 

evidence to make an argument, evaluating information, and other. Students chose all practices 

that applied to them. These options were recoded as dichotomous variables for the analysis 

(practice was reported (1) or not reported (0)). The ESM also surveyed students’ social and 

emotional experiences  in real-time  on a four-point scale: not at all; a little; much; or very much. 

We focused on students’ remote learning experience regarding their belief that the material had 

importance for their future; feelings of loneliness;  boredom; confidence; curiosity; and grit as 

they have been highlighted in the OECD social and emotional skills  frame works (Kankaras & 

Suarez-Alvarez,  2019; Salmela-Aro  & Upadyadya, 2020). In total, these ESM surveys produced 

an average of 3.49 responses per student, for a total of 701 situational responses.  

The general survey asked students to report how often they engaged in the following 

science practices on a four-point scale: never or hardly ever (1); some hours (2); most hours (3); 

and in all classes (4). The practices were: have opportunities to explain my own thoughts; plan 

how to study; do practical tests; draw conclusions from experiments or research;  apply concepts 

related to everyday problems; participate in debate or discussion; follow the teacher’s 

demonstrations; do experiments as instructed; follow the teacher’s teaching or example in remote 

learning;  view a video or animation; do assignments independently; study a book, study a 

website or e-learning  plat form; make notes or summaries;  share documents with other students; 

present my output in a video conference (Zoom, Meet, Skype, etc.); write a joint document with 

another student; chat online; create videos or animations;  do experimental research  with tools 

118 

 
 
found at home; ask for advice from another student; help another student; and get feedback from 

the teacher that promotes learning. 

Results 

U.S. Results 

How Engaged Were U.S. Students When Attending Courses Remotely?  

Despite the many changes to instruction, U.S. students were more likely to report 

academic engagement after participating in the CESE intervention. Table  4.1 shows the change 

in the odds that a student reported above median scores for interest, skill, challenge, and the 

engagement variable,  meaning they reported high scores for all three. As shown in Table  4.1, 

when surveyed in Spring 2020, students showed a strong increase in their science interest and the 

level of challenge they felt in their remote physics or chemistry class  as compared to earlier in 

the school year. Students were 4.24 times more likely to report high levels  of interest and 7.36 

times more likely to report high levels  of challenge. Students were only 1.53 times more likely  to 

report high levels of skill during the pandemic. This change was significantly less (p < 0.001) 

than the differences in either of the other questions. The increase in all three categories resulted 

in students being 9.24 times more likely  to be engaged. 

Table 4.1  
Changes in U.S. Students Academic Engagement During the 2019-2020 School Year 

𝛽 

SE 𝛽 

OR (eβ) 

High interest 

1.44*** 

0.13 

4.24 

0.42** 

1.53 

0.14 

0.12 

2.00*** 

High skill 
High 
challenge 
Engagement 
Note. High Interest, High Skill, and High Challenge are binary variables  indicating a student 
reported a 3 or a 4 on the scale. This table shows the  change in the log odds of a student 
reporting high measures of these  variables  from fall  to spring in the 2019 – 2020 school year. 
*p < 0.05.  ∗∗p < 0.01.  ∗∗∗p < 0.001. 

2.22*** 

0.19 

9.24 

7.36 

119 

 
 
  
What Kinds of Activities  Were U.S. Students Participating  in During Remote Teaching?  

Figure 4.1 shows the frequency at which U.S. students reported specific class activities 

and how interesting they found these experiences during the pandemic. The most frequent 

activities used remotely were videos of experiments, online simulations, text-based instruction, 

and discussion boards. Students reported watching recorded videos of lessons more frequently 

than attending live lessons. Using textbooks, building models at home, and making presentations 

were the least frequently used. While performing experiments at home was one of the top 

interests reported by students, the frequency at which that occurred was low. Students found 

writing papers, using Google Slides or PowerPoint to make presentations, and using textbooks 

among the least interesting online activities. 

Figure 4.1 U.S. Students’ Frequency and Interest  in Online Learning Activities 

How Did These Tasks Relate to Students’ Academic Engagement?  

Table 4.2 shows the impact of each predictor on student academic engagement (high 

interest, skill,  and challenge) in its own logistic regression  model due to strong correlations in the 

120 

 
 
 
frequency of activities assigned during remote learning.  Each model controls for engagement at 

the beginning of the school year, race, gender. Students were clustered by school and classroom 

to account for variance that might occur due to school policy or teachers’ familiarity  with 

teaching online. When surveyed before the transition to remote learning,  only 5 students reported 

academic engagement for every 100 who did not. The increase in academic engagement when 

measured in the fall showed the odds of reporting engagement were as high as 48 students for 

every 100 students who reported not being engaged. As shown in Table 4.2, when controlling for 

demographic data and their academic engagement (high interest, skill, and challenge) measure 

recorded in fall, there were strong significant correlations  between academic engagement and the 

frequency of most of the project-based activities related to the CESE intervention. The highest 

correlations  to academic engagement were found in the frequency of equations modelling and 

participation in science and engineering practices, respectively, showing students to be 1.29 and 

1.30 times more likely to report engagement. All types of modelling, except building models at 

home, were positively correlated with academic engagement, and students were between 1.17 

and 1.30 times more likely to report engagement for each unit increase  in frequency. 

Additionally, students who reported more frequent opportunities to take pride in their science 

achievements were 1.26 times more likely to report engagement with each unit increase in 

frequency. Many of the activities specifically  related to remote teaching were not significantly 

correlated with academic engagement, and the highest correlations again corresponded with 

more project-based tasks. For example,  the odds of a student reporting engagement were 1.18 

times higher with more frequent at home experiments during remote learning and 1.20 times 

higher with increase in frequency of building presentations with Slides or Power Point. Despite 

listing textbook use as uninteresting, students were 1.19 times more likely  to report being 

121 

 
 
engaged with each increase in frequency of reported use. Building models at home is not shown 

in this table because it was not significantly correlated to academic engagement and the logit 

model did not converge when controlling for other factors. 

Table 4.2  
Logistic Regression Coefficients  for Each Activity  on Academic Engagement 

Correlation 
coefficient 

Logit 
regression 
coefficient 
(β) 

SE (β) 

Odds  ratio  
(eβ) 

Build models 

Class discussions about phenomena  

Computer modelling 1 

Computer modelling 2 

Draw visual models 

Equations modelling 

Opportunities to ask questions 

Opportunities to take pride 

Present their findings 

Science and engineering practices 

Work together to understand phenomena  

Items specifically related to remote teaching 

Discussion  board 

Experiments to try at home 

Live lessons 

One-on-one video chat with teacher 

Online  simulations 

Presentations using slides or power-point 

Recorded lessons 

Text-based instructions 

Textbook use 

Watching videos of experiments 

Working  in groups through video chat 

0.20 * 

0.15 
0.22 ** 
0.20 ** 
0.15 * 
0.16 ** 
0.27 ** 

0.11 
0.23 ** 
0.21 ** 
0.26 *** 
0.17 ** 

0.08 
0.17 ** 

0.11 
0.15 ** 

0.13 
0.18 * 

0.02 

0.13 
0.18 ** 

0.14 

0.14 
0.15 ** 

0.09 

0.08 

0.08 

0.07 

0.06 

0.05 

0.10 

0.12 

0.08 

0.07 

0.07 

0.06 

0.05 

0.06 

0.06 

0.05 

0.07 

0.07 

0.04 

0.08 

0.05 

0.08 

0.07 

1.23 

1.16 

1.25 

1.22 

1.17 

1.18 

1.3 

1.11 

1.26 

1.23 

1.29 

1.18 

1.09 

1.18 

1.12 

1.16 

1.14 

1.20 

1.02 

1.13 

1.19 

1.15 

1.15 

0.12 ** 
0.09 * 
0.13 *** 
0.13 *** 
0.11 ** 
0.10 * 
0.15 *** 

0.05 
0.14 *** 
0.13 *** 
0.15 *** 
0.11 ** 

0.06 
0.10 ** 
0.08 * 
0.10 ** 

0.07 
0.10 ** 

0.02 
0.08 * 
0.10 * 

0.07 
0.09 * 
0.09 * 

122 

Writing  papers 

1.16 
Note. Due to the high correlations  between activities,  each activity  was run in its own logistic 
regression model. The coefficients  represent the impact  of the activity  on engagement when 
controlling  for prior engagement, race, gender, and variance at the school and classroom levels. 
∗p < 0.05.  ∗∗p < 0.01.  ∗∗∗p < 0.001. 

0.04 

 
 
  
 
 
 
 
 
 
How Did Engagement During the Pandemic Impact Students’  Future Aspirations?  

In order to understand students’ persistence during the pandemic, we explored the 

changes students made to their educational plans by comparing their responses in spring to those 

from the beginning of the year using the binary college  indicator. Prior to the shift to remote 

instruction, 68.66% of students planned to attend four or more years of college. When measured 

again during remote learning, the number of students planning to attend college or graduate 

school increased significantly (p < 0.05) with the odds of a student reporting plans to attend 

college or graduate school rising from 2.19 to 2.6. Because the GPAs of students who reported 

they “do not know” were more similar to students planning 2 to 4 years of college than those 

who planned to attend trade school or no post-secondary education, we anticipated that much of 

this change would come from students affirming their plans for college. When omitting students 

who reported they did not know their plans on either the background or exit surveys, there were 

no significant differences from the beginning to the end of the year. Additionally, fewer students 

reported not knowing their plans during the pandemic than when surveyed at the beginning of 

the year (p < 0.001). 

To see how academic engagement impacted our measure of student persistence, we next 

used a two-level logistic regression,  again accounting for variance between classrooms, with the 

binary college ambition indicator as the outcome. As shown in Table  4.3, when controlling for 

race and gender, we found that both GPA and academic engagement had significant correlations 

to plans to attend four or more years of college during the pan demic, even when controlling for 

previous ambitions. Students who reported being engaged in their science courses during the 

pandemic were 2.19 times more likely to report plans to attend college or graduate school. The 

123 

 
 
odds of reporting plans to attend college or graduate school were also 1.8 times higher for each 

unit increase in grade point average. Teacher  level random effects were non-negligible. 

Table 4.3Students’ Plans to Attend Four or More Years of College and Engagement 
US Students’ Plans to Attend Four or More Years of College and Engagement 

Previous plans to attend four or more years of 
college 
Female (male comparison) 

Race (White non-Hispanic comparison) 

Hispanic 

Black 
Othera 
Asianb 

Multiple 

GPA 

Academic engagement during pandemic 

β 

SE 𝛽 

OR (eβ) 

3.56*** 

0.56 

35.01 

−0.70 

0.56 

0.50 

−0.41 

0.69 
−3.00* 

0.00 

0.78 
0.59* 
0.78** 

0.64 

0.72 

1.40 

0.00 

0.59 

0.24 

0.30 

0.66 

1.99 

0.05 

1.00 

2.18 

1.80 

2.19 

Note. The analytic  sample for this table is students who reported their  educational ambitions 
both before and during the pandemic. 
∗p < 0.05.  ∗∗p < 0.01.  ∗∗∗p < 0.001. 
aOnly  three  students  in the  final analytic  sample listed  their race as Other, one of those three 
students  selected  a lower level of education.  bAll students  who listed their race  as Asian 
reported plans to attend four or more years of college both  before and during the pandemic. 

Finnish Results 

How Did Finnish Students Engage Situationally  During Remote Teaching?  

Nearly half of students indicated interest in their science activities (44.5%); however, 

only one quarter (29%) of experiences were identified as leaving students feeling skilled and 

more than one third (34.2%) were identified as challenging (see Table 4.4). When OLMs were 

calculated from these three measures, only 4.7% of science experiences were engaging moments. 

The mean for academic engagement was only 0.05 (min  0−max 1). 

124 

 
 
 
 
 
 
 
 
 
Table 4.4 Finnish Students’ OLM Situational Engagement During Pandemic 
Finnish Students’ OLM Situational  Engagement During Pandemic 

Percentage of 
occurrence 

N 

701 

696 

700 

701 

M 

2.49 

2.18 

2.31 

0.05 

SD 

0.71 

0.73 

0.71 

0.21 

(%)a 

44.5 

29 

34.2 

4.7 

Interest 

Skill 

Challenge 

OLM, situational 
engagement 

Note. aOccurrence is defined  as choosing 3 (much) or 4 (very much) in the 
scale.  bSituational  engagement is  defined as  the joint  occurrence of  interest,  skill,  and 
challenge. 

What Kinds of Activities  Were Finnish Students Participating  in During Remote Teaching?  

To understand the learning activities students participated in while learning remotely, we 

first summarised the mean level of each of the learning activities that have been reported in the 

general survey (see Figure 4.2). Differences in frequency were examined using a one-way 

analysis  of variance [ANOVA] (F = 158.5, df = 22, p < 0.001). The most often mentioned 

activities were following teacher’s instruction, following demonstrations, and doing independent 

assignments. The least mentioned activities were doing tests, sharing documents, presenting, 

making a video or animation, and doing experiments at home. Discussion and interaction with 

peers and teachers (e.g. online chat, helping each other) were mentioned at the moderate level. 

Studying from a website and viewing videos and animations were common activities during the 

pandemic, though their frequencies were lower than teachers’ direct instruction and independent 

work.  

125 

 
 
  
  
  
  
 
 
Figure 4.2 Frequency of Learrning Activities  from General Survey 

Students’ real-time situational learning activities were also compared using chi-square 

tests. When pooling the data, significant differences were found among these activities (χ2 = 

4085.6, df = 17, p < 0.001). Following teachers’ instruction, doing tasks or assignments 

independently, studying from books and solving mathematical problems were more represented 

than other activities.  

Real-time situational learning activities were then divided into three groups based on 

their reported frequency in the 701 situational responses:  activities that occurred more than 50% 

of the time; those that occurred from 49% to 11% of the time; and those that occurred less than 

10% of the time. Using cross-tabulation analysis,  we found that more frequently employed 

activities, such as following teachers’ instruction, which happened over 50%of the time, were 

less successful in facilitating academic engagement (adj. residual=−2.47) than those activities 

126 

 
 
 
that were classified as medium or low frequency (see Table  4.5).We then compared the level of 

interest, skill,  and challenge across activities using one-way ANOVA (see Table 4.6). There 

were significant differences across activities for interest and skill but not challenge. Post-hoc 

analyses again confirmed that students were less interested and felt less skilled in activities that 

occurred the most frequently. In other words, the most common activities students experienced 

were the least engaging. 

Table 4.5  
Cross-tabulation  Analysis of Situational  Engagement per ESM Activity  Group 

Activity group 

Not occurred  Occurred  Total 

High frequency 

Count 

Std residual 

Adj std residual 

Medium frequency 

Count 

Std residual 

Adj std residual 

Low frequency 

Count 

Std residual 

Adj std residual 

1365 

0.42 

2.47 

972 

−0.36 

−1.83 

143 

−0.34 

−1.41 

75 

1440 

−1.61 

−2.47 

77 

1.38 

1.83 

14 

1.32 

1.41 

1049 

157 

166 

Total 

2480 
Note. High frequency activities  include following  teachers’ instruction,  doing tasks 
independently,  and book studying;  medium frequency activities  include solving math problems, 
writing, studying  from a website, discussing online, constructing  an explanation, using a model, 
analyzing data,  evaluating  information, and asking questions;  low frequency activities  include 
using evidence to make an argument, developing a model, conducting  an investigation,  planning 
an investigation,  making videos, and other. 

2646 

127 

 
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Table 4.6 ANOVA Results for Interest, Skill,  and Challenge per ESM Activity Group 
ANOVA Results for Interest, Skill,  and Challenge per ESM Activity Group 

High frequency 
activities 

Medium 
frequency 
activities 

Low 
frequency 
activities 

F 

Post-hoc 

Interest 

Skill 

2.54 

2.23 

2.76 

2.37 

2.76 

2.39 

Challenge 

2.31 

2.33 

2.38 

30.66, df = 2,  
p < .001 

11.58, df = 2,  
p < .001 

0.76, df = 2,  
p = .47 

High < medium, low 

High < medium, low 

ns 

Note. High frequency activities  includes  following teachers’  instruction,  doing task 
independently,  and book studying;  medium frequency activities  includes solving math problem, 
writing, studying  from website, discussing online, constructing  an explanation, using a model, 
analyzing data,  evaluating  information, and asking questions; low frequency activities  includes 
using evidence to make an argument, developing a model, conducting  an investigation,  planning 
an investigation,  making videos, and other. 

What did the social and emotional learning of the students look like while learning 

remotely in Finland? Among the six types of social and emotional learning  experiences, when 

measured situationally the most salient was the importance of learning for the future (see Table 

4.7). Close to half of responses (45.5%) indicated students felt that what they were learning was 

useful for their future. The likelihood of reporting being confident (30.44%), curious (28.1%), or 

persistent (i.e. gritty; 27.5%) was modest. Correlation analyses show that when students felt their 

learning was important for their future, and were moderately curious, persistent, and confident 

about themselves, they were more likely  to be situationally engaged (OLM). 

128 

 
 
  
 
 
 
Table 4.7 Situational Engagement (Optimal Learning Moments) and Social Emotional Learning 
Situational  Engagement (Optimal Learning Moments) and Social Emotional Learning 

M 

SD 

Occurr-
ence (%)a 

0.05 

0.21 

4.71 

1 

2 

3 

4 

5 

6 

2.53 

0.79 

45.56 

0.15** 

1.38 

1.80 

2.18 

2.09 

0.72 

0.82 

0.83 

0.87 

8.73 

17.74 

30.39 

28.10 

−0.08* 
−0.17** 
0.14** 
0.12** 
0.12** 

0.12** 
−0.13** 
0.22** 
0.30** 
0.27** 

0.30** 
−0.14** 

0.06 

−0.19** 
−0.26** 
−0.24** 

0.30** 
0.42** 

2.12 

0.54** 
7. Grit 
Note. * p < 0.05.  ∗∗p < 0.01.  aOccurrence is defined as choosing 3 (much) or 4 (very much) in the 
scale. 

27.48 

0.82 

0.06 

1. Situational 
engagement 
2. Future 
importance 
3. Lonely 

4. Bored 

5. Confident 

6. Curious 

Discussion 

When measured during the pandemic, U.S. students reported greater interest and 

challenge in their science subject than they did in Fall 2019. Generally, academic engagement for 

U.S. students showed an increase, but Finnish results showed that situational academic 

engagement was low. While it is impossible  to distinguish a causal relationship,  it is possible  that 

this difference in results suggests students are less engaged in the specific activities they do 

remotely but were influenced by factors outside the classroom that increased overall 

engagement. For example, one or more components of engagement (interest, skill or challenge) 

increased for students because of what they were seeing in the news regarding the novel 

coronavirus.   

Academic engagement was positively correlated with a number of project-based activities 

employed during the pandemic. Students showed a significant increase in the odds of reporting 

engagement with more frequent use of science and engineering practices, class discussions, 

working together to understand phenomena, various modelling activities, presenting their work, 

and conducting experiments at home. Textbook use remained high among students who reported 

an overall  sense of engagement. If teachers are using the textbooks for homework, this could be 

129 

 
 
  
  
  
  
  
  
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
related to previous findings that students show above average situational engagement while 

doing math problems (Inkinen et al., 2020). Consistent with findings from Domina et al. (2021) 

more frequent opportunities for social and emotional learning (e.g. taking pride in science 

achievements and working in groups with peers) were also positively correlated with academic 

engagement. The Finnish study found that the less frequently assigned active practices (e.g. 

asking questions, analyzing  data) were better than passive activities in facilitating interest and 

skill.  Similarly,  there was a correlation between students reporting that they were engaged and 

the frequency of real science and engineering practices in their U.S. classrooms.   

Unfortunately, the shift to remote learning did not allow students to engage in the same 

sorts of hands-on, project-based activities they would in a normal CESE classroom. For Finnish 

students, the most frequent learning activities during the pandemic were following teacher’s 

instruction or demonstration and doing independent tasks or assignments. In the United States, 

students reported high interest but few opportunities to try experiments at home. This was also 

the least frequently reported activity for students in Finland, which may be due to the difficulty 

associated with conducting science investigations without experimental tools and support from 

teacher and peers. Additionally, safety concerns and difficulty with distributing or obtaining 

resources have been shown to hinder participation in experimental  activities while learning 

science remotely (Kelley, 2020). Students in both countries reported few opportunities to 

collaborate with one another in trying new activities and problem solving while learning 

remotely.  

Analysis of specific challenges  faced by students in Finland reveals that many students 

had difficulties in planning their studies while learning  remotely. Compared to the challenge of 

study planning, students had fewer challenges regarding technical problems or a place to study at 

130 

 
 
home. Students may be unfamiliar with effective time management practices when leaving the 

structured environment of their in-person high school classes.  Finding ways to help students in 

planning their multiple  assignments may benefit students who are learning in this less structured 

remote learning  environment.  

Despite facing numerous difficulties and challenges, there were some positive findings 

regarding students’ social and emotional experiences. Despite reporting low situational 

engagement, nearly half of the Finnish students surveyed still felt that what they were learning in 

their science classes  was important to their future. In the United States, education aspirations 

remained high, and a significant proportion of the surveyed students raised or affirmed their 

ambitions toward college. This effect may be driven by students who previously did not know 

their plans deciding to attend college. For students who reported their future academic plans 

before and after the pandemic, academic engagement was significantly correlated with plans to 

attend college or graduate school. Regardless of the challenges the U.S. students faced and the 

difficulties they experienced with the organization and management of the transition to receiving 

lessons remotely, students remained positive about their future education. If we consider this 

ambition a testament to persistence, there was a significant correlation between academic 

engagement and persistence in both countries. 

Limitations   

In the U.S. sample, the most at-risk students were often unable to participate in the 

remote learning  experience. The U.S. sample of students participating in remote learning was 

skewed toward students who had access to computers in districts that supported remote learning. 

Due to socioeconomic barriers,  students in the United States were not graded and could not be 

held accountable for attendance. This lack of incentive may imply that those students who 

131 

 
 
continued to participate shared certain characteristics. While socioeconomic  disparities did not 

impact students’ participation in Finland, this study did not follow the same students in a 

longitudinal study; the results are mainly  framed by participation in full remote days of 

instruction. In both studies, academic engagement was comprised of three components, each 

represented by only one question. The three questions are not expected to measure the same 

construct, but instead to indicate when three independent concepts occur together, which is 

reflected in the low Cronbach’s alpha (0.61). Moving forward, measuring each construct with a 

multiple question scale could provide more reliable  results. 

Implications   

It is important to emphasize that in both the United States and Finland active pedagogical 

practices in distance-learning  environments are found to be the most engaging among these 

students. By acknowledging the difficulties of conducting these proactive practices remotely, 

schools should provide support and sufficient tools to promote effective science learning. 

Although Finland’s unified government experienced a somewhat smoother transition to remote 

learning than the United States, both countries had their own set of student challenges. In both 

countries, students lacked opportunities to engage in scientific practices used by real scientists in 

the field.  

Although improving engagement and promoting positive social and emotional learning  is 

undoubtedly a challenge for remote instruction, it is one that needs attention regardless of how 

soon the pandemic ends. Remote or hybrid learning situations are likely to continue for the long-

term. The activities students were most interested in involved doing science rather than simply 

reading about it, which is what the CESE intervention emphasizes. In the full sample in the prior 

year, a positive effect for science learning was found among treatment students, including for 

132 

 
 
low-income and minority students who were over-sampled (Schneider et al., 2022). This leads to 

concerns of potential learning loss when students are unable to participate in experiential science 

learning. Collaboration and experimentation are key practices used by real scientists working in 

the field. To optimize students’ science learning, remote classroom  environments  may use and 

adapt existing technologies that allow students to remain engaged with their lessons with 

opportunities to figure out phenomena. This  engagement with science may in turn encourage 

students toward more ambitious educational goals. 

Conclusion 

Similar  to results from previous years of the CESE intervention, during remote learning 

in the 2019–2020 school year, students in the CESE study showed that engagement was strongly 

correlated with a variety of project-based activities assigned by their teachers. In both the United 

States and Finland, students reported higher engagement with some of the least frequently 

assigned activities. When learning  remotely, students show more engagement when performing 

real science and engineering practices like conducting investigations at home, performing 

modelling activities, asking questions, and participating in class  discussions. This engagement is 

in turn related to positive social and emotional outcomes such as confidence, persistence, and 

ambition. The likelihood of engagement is much lower in the remote learning environments than 

in the normal classroom  setting. Consequently, attention should be paid to providing equitable 

opportunities to participate in project-based learning  activities whether learning remotely or in a 

classroom. 

133 

 
 
 
 
REFERENCES 

Csikszentmihalyi,  M. (1990). Flow: The psychology of optimal experience. Harper Perennial.  

Domina, T., Renzulli,  L., Murray, B., Garza, A. N., & Perez, L. (2021).  Remote or removed: 
Predicting successful engagement with online learning  during COVID-19. Socius, 7, 
2378023120988200.  

Dorn, E., Hancock, B., Sarakatsannis, J., & Viruleg, E. (2020). COVID-19 and student learning 
in the United States: The hurt could last  a lifetime  (Vol. 9, p. 2021). McKinsey & 
Company. Retrieved from https://www.mckinsey.com/ industries/public-and-social-
sector/our-insights/Covid-19-and-student-learning-in-the-united-states-the-hurt-could 
last-a-lifetime#.   

Duckworth, A. L. (2016). Grit: The power of passion and perseverance.  Scribner.  

Escueta, M., Quan, V., Nickow, A. J., & Oreopoulos, P. (2017). Education technology:  An 

evidence-based  review. NBER Working Paper, No. 23744. 
https://doi.org/10.3386/w23744.  

Fraillon,  J., Ainley, J., Schulz, W., Friedman, T. & Duckworth, D. (2019). Preparing for Life in 

a Digital World. IEA International  Computer and Information Literacy  Study 2018: 
International  Report. Retrieved from https://www.iea .nl/sites/default/files/2019-
11/ICILS%202019%20Digital %20final%2004112019.pdf.  

Hidi, S., & Renninger, K. A. (2006). The four-phase model of interest development. Educational 

Psychologist,  41, 111–127.  

Hopkins, B., Turner, M., Lovitz, M., Kilbride, T., & Strunk, K. O. (2021). Policy brief: A look 
inside Michigan  classrooms: Educators perceptions of Covid-19 and K-12 schooling in 
the fall of 2020. Education Policy Innovation Collaborative. Michigan State University, 
East Lansing, MI.  

Inkinen, J., Klager, C., Juuti, K., Schneider, B., Salmela-Aro, K., Krajcik, J., & Lavonen, J. 

(2020). High school students’ situational engagement associated with scientific practice 
in designing science situations. Science Education, 104(4), 667–692. 
https://doi.org/10.1002/SCE.21570.  

Kankaras, M., & Suarez-Alvarez,  J. (2019). Assessment framework of the OECD study on social 
and emotional skills.  OECD Education Working Papers No. 207. Paris: OECD.  

Kaufman, J. H., Hamilton, L. S., & Diliberti, M. (2020). Which parents need the Most support 

while K–12 schools and childcare  centers are physically  closed? (p. 2020). RAND 
Corporation. Retrieved from https://www.rand.org/ pubs/research_reports/RRA308-
7.html.  

134 

 
 
Kelley, E. W. (2020). Reflections on three different high school chemistry lab formats during 

COVID-19 remote learning. Journal of Chemical Education, 97, 2606–2616. https://doi 
.org/10.1021/acs.jchemed.0c00814.   

Kraft, M., & Simon, N.S. (2020). Teachers’  experiences working from home during the COVID-

19 pandemic. Upbeat. Retrieved from https://f.hubspotusercontent20.net/ 
hubfs/2914128/Upbeat%20Memo_Teaching_From_Home_  Survey_June_24_2020.pdf. 

Krajcik, J. S., & Shin, N. (2014). Project-based learning.  In D. S. Keith (Ed.), The Cambridge 

handbook of the learning sciences  (pp. 275–297), Cambridge, United Kingdom: The 
Cambridge University Press.  

Meluzzi, F. (2020). Strengthening  online learning when schools are closed: The role of families 

and teachers in supporting students  during the COVID-19 crisis. The OECD Forum 
Network. Retrieved from http://www.oecd.org/coronavirus/ policy-
responses/strengthening-online-learning-when  schools-are-closed-therole-of-families-
and-teachers-in supporting-students-during-the-Covid-19-crisis-c4ecba6c/.   

OECD. (2019). PISA 2018 results (Volume I): What students know and can do. PISA, OECD 

Publishing. https://doi.org/ 10.1787/5f07c754-en.  

Reich, J., Buttimer, C. J., Fang, A., Hillaire, G., Hirsch, K., Larke, L. R., Littenberg-Tobias, J., 
Moussapour, R., Napier, A., Thompson, M., & Slama, R. (2020).  Remote learning 
guidance from state education agencies during the Covid-19 pandemic: A first look. 
Retrieved from osf.io/ k6zxy/.  

Salmela-Aro,  K., Moeller, J., Schneider, B., Spicer, J., & Lavonen, J. (2016). Integrating the 

light and dark sides of student engagement using person-oriented and situation-specific 
approaches. Learning and Instruction,  43, 61–70.  

Salmela-Aro,  K., & Upadyadya, K. (2020). School engagement and school burnout profiles 
during high school– The role of socio-emotional skills.  European Journal of 
Developmental Psychology, 17(6), 943–964.  

Schneider, B., Krajcik, J., Lavonen, J., & Salmela-Aro, K. (2020). Learning science:  Crafting 

engaging science environments. Yale University Press.  

Schneider, B., Krajcik, J., Lavonen, J., Salmela-Aro,  K., Broda, M., Spicer, J., Bruner, J., 
Moeller,  J., Linnansaari, J., Juuti, K., & Viljaranta, J. (2016). Investigating optimal 
learning moments in U.S. and Finnish science classes.  Journal of Research in Science 
Teaching, 53(3), 400–421.  

Schneider, B., Krajcik, J., Lavonen, J., Salmela-Aro,  K., Klager, C., Baker, Q., Chen, I., 

Bradford, L., Touitou, T., Peek-Brown, D., Marias Dezendorf, R., & Maestrales, S. 
(2022). Improving science achievement– Is it possible? Evaluating the efficacy of a high 
school chemistry and physics project-based learning  intervention: Crafting engaging 
science environments.  

135 

 
 
Tang, X., Upadyaya, K., & Salmela-Aro, K. (2021). School burnout and psychosocial problems 
among adolescents: Grit as a resilience  factor. Journal of Adolescence, 86, 77–89. 
https://doi.org/10.1016/j.adolescence.2020.12.002.   

Tang, X., Wang, M. T., Guo, J., & Salmela-Aro,  K. (2019). Building grit: The longitudinal 

pathways between mindset, commitment, grit, and academic outcomes. Journal of Youth 
and Adolescence, 48(5), 850–863.  

The Finnish National Agency for Education. (2020). Guidelines for primary education. 

Opetushallitrus. The Author. Retrieved from https://www.oph.fi/fi/koulutus-ja-tutkinnot/ 
opetustoimi-ja-koronavirus.   

Trinidad, J. E. (2021). Equity, engagement, and health: School organisational  issues and 

priorities  during COVID-19. Journal of Educational Administration  and History, 53(1), 
67–80. https://doi.org/10.1080/00220620.2020.1858764. 

United Nations Educational, Scientific, and Cultural Organization. (2020). Education: From 

disruption  to recovery. The Author. Retrieved from https://en.unesco.org/Covid19/ 
educationresponse.  

Yang, X., Zhang, M., Kong, L., Wang, Q., & Hong, J. C. (2020). The effects of scientific self -
efficacy and cognitive anxiety on science engagement with the “question-observation-
doing-explanation” model during school disruption in COVID-19 pandemic. Journal of 
Science Education and Technology, 30(3), 380–393. 

136 

 
 
 
CHAPTER 5: DISCUSSION AND CONCLUSION  

Contributing to the Landscape of Science Education Research 

Studies regarding high school science interventions, like that developed by CESE, are 

critical to the landscape of science education research. In a 2013 content analysis of 650 

empirical  chemistry  education papers published between 2004 and 2013, the Royal Society of 

Chemistry reported that, worldwide, only 25% of the manuscripts selected studied students in 

grades 10 and 12 while % of studies were conducted at post-secondary institutions, and in the 

U.S. there were 12.7 times more manuscripts  related to post-secondary education than grades 10-

12 (Teo et al., 2013). Moreover, a 2020 analysis  by Kanim and Ximena showed that high school 

students comprised only 8% of the students in physics education research manuscripts published 

between 1970 and 2015 while 70% of those students were enrolled in university calculus-based 

courses. Kanim and Ximena (2020) use this data to argue that many physics education research 

studies are not necessarily  generalizable  to the 1.38 million high school physics students in the 

United States (US) as studies as the institutions where the studies typically occur have more 

wealthy students with higher math preparation and fewer ethnically diverse students than the 

general population. 

Unlike the studies that comprise the majority of science education research, CESE 

focused on non-calculus-based science education and deliberately over-sampled high schools in 

low-income and diverse school districts. Common Core Data was used to test the generalizability 

index for the entirety of the US with a score of 0.82, suggesting the results of the mean treatment 

effect could be generalized to the inference population (Schneider et al., 2022).  

The NRC calls for comprehensive  measures that include the development of new project -

based curriculum,  educator training into project-based and inquiry-driven teaching (NRC, 2012b) 

137 

 
 
and constructing assessments that capture information about students’ ability to construct 

explanations using three-dimensional reasoning (NRC, 2014). Researchers at CESE designed an 

intervention to meet those suggestions and provide educators with learning materials that are 

tested and ready for use in the classroom, and the results were improved science achievement 

and increased educational ambition. All three manuscripts associated with this dissertation come 

together to support prior research into the benefits of project-based curriculum and the 

possibilities  of using technologies that facilitate student learning in the average US high schools. 

Improving Science Achievement – Is It Possible? Evaluating The Efficacy of a High School 

Chemistry and Physics Project-Based Learning Intervention 

The project based-based learning intervention designed by CESE was effective in 

boosting students’ mean performance on the summative assessment taken at the end of the 

school year. Students who participated in the CESE treatment condition outperformed their peers 

in the control group by more than 0.2 standard deviations, with 28% of that effect potentially 

accounted for by students reported use of models in the classroom (Schneider et al., 2022). 

Additionally, the treatment condition was related to a significantly higher likelihood of an 

increase in educational ambition during the school year, even when controlling students’ 

personal demographics, their level of ability measured by the pretest and the measured average 

pretest score from physics and chemistry students within the school.   

Implications 

Although the CESE curriculum is not designed to specifically discuss or promote college 

ambition, it does provide students with the experience of acting as real scientists in the field. 

During the intervention, students explain real-world phenomena by asking questions, designing 

experiments, then collecting and analyzing data. While the NRC (2012a) suggests that promoting 

138 

 
 
educational attainment may enhance or promote the development of relevant science 

competencies, it may be that educators can promote educational attainment through the teaching 

of relevant science competencies as well. 

Using Machine Learning to Score Multi-Dimensional  Assessments  of Chemistry and 

Physics 

To understand more about what students were learning in the NGSS aligned curriculum, 

researchers  at CESE developed a number of assessments and rubrics which captured information 

about students’ engagement of three-dimensional reasoning. Unlike many other studies, this 

manuscript addressed specific items drawn from a national test bank, the rubrics developed, and 

how those rubrics represented the individual dimensions of learning and grade level PEs. 

Coupled with the diverse student body that lent itself to the generalizability of the 2018-2019 

study, the results suggest that machine-learning algorithms  can successfully classify responses 

from a diverse and representative sample  of US students. And moreover, the automated scoring 

classifications  showed high agreement to human raters when broken down by the NGSS PEs 

represented in the students’ reasoning.    

Implications 

Automated scoring methods were effective in differentiating between response 

classifications  using rubrics  developed to capture the use of multi-dimensional reasoning. With 

the sample sizes needed for training the scoring algorithms, this method of scoring would not be 

useful on assessments  developed by individual teachers in their classrooms.  It could, however, be 

applied to national test-bank items where large numbers of student responses are used to train an 

algorithm which is then associated with that item online. The Automated Analysis of 

Constructed Response (AACR) collaboration provides educators with a tool that does exactly 

139 

 
 
 
that. AACR provides a test-bank of NGSS aligned items that ask students to explain their 

reasoning. Teachers  can then upload the responses into the Constructed Response Classifier 

(CRC Tool) which has already been trained to identify correct responses. 

Through technologies such as the CRC Tool, teachers can collect information about 

students’ reasoning and argumentation with similar  ease to multiple choice. Individual educators 

can save time associated with assessment development by using test banks provided by the 

automated scoring services such as that provided by AACR. Teachers would not need to train the 

algorithms with human scores or worry about the impact of item difficulty on the distribution of 

scores among student responses. This could provide significantly more information to teachers 

than multiple choice with similar  scoring cost and efforts. 

U.S. and Finnish High School Engagement During the Covid-19 Pandemic  

During the pandemic, CESE also found that students reported engagement (high levels of 

interest, skill,  and challenge) during their online learning experiences,  despite the difference in 

content delivery. Students who were able to participate in their courses online reported highest 

interest in the activities they were able to do with the least frequency. Students in Finland 

showed highest interest in “low frequency activities” which included using evidence to make an 

argument, developing a model, conducting or planning investigations, making videos. In the US, 

students’ reported participation in SEPs had the largest coefficient in predicting whether a 

student reported being engaged in their online classrooms. Three  of the four highest interest 

activities involved watching videos of experiments, performing experiments at home, and using 

online simulations.   

Despite the numerous hardships for people worldwide, students who maintained 

participation in the CESE intervention, by accessing their courses remotely, during the Covid-19 

140 

 
 
pandemic were more likely  to plan to attend college when surveyed at the end of the school year. 

Unlike students in the 2018-2019 cohort, this difference was driven by students who had 

previously  reported uncertainty and later changed to a plan for 4 or more years of college. 

Additionally, the plan to attend college was strongly related to a student reporting engagement 

when surveyed during online courses.  

Implications 

The results of the Maestrales et al. (2021a) study suggest students are showing interest in 

hands-on-science while also understanding that videos and interactive simulations are a useful 

substitute when circumstances limit activities. These types of technologies can benefit students 

in a variety of learning environments both inside and outside the classroom. For instance, some 

teachers in the CESE intervention reported that they were unable to participate in certain 

experiments due to a lack of science resources. Many of those same teachers also reported 

having access to a computer lab where they could schedule time for their students to participate 

in online activities and simulations. These types of simulations can be used to provide alternative 

solutions that allow students to develop skills by incorporating project-based opportunities to 

participate in various SEPs, even when resources are limited.   

Connecting the Pieces 

The manuscripts in this dissertation agree with prior research that suggests doing science 

has proven benefits to students learning in the classroom and online. In addition to use in the 

classroom,  vetted science curriculum materials  available  online, such as those developed by 

CESE, could be used to design future research. Research from CESE provides quantitative 

evidence that supports the NRC’s call for professional development, NGSS aligned curriculum, 

and multi-dimensional assessments  (Schneider et al., 2022). Greater similarities  between 

141 

 
 
classrooms  in these ways is expected to promote equity, but school resources can still have a 

significant impact on student science learning, and students in schools that lack material 

resources may have limited opportunities to participate in project-based science activities (NGSS 

Lead States, 2013). Coupled with automated scoring methods, such as those designed by AACR, 

studies could be developed using these vetted materials that lead to more similar pedagogy, 

teaching materials,  content, and assessment without placing significantly greater burdens on 

teachers. 

Project-Based Curriculum in the Classroom and Online 

Remote content delivery by CESE teachers during Covid-19 proved that meaningful 

science teaching could occur in online classrooms.  Despite having little to no time to prepare for 

such a shift in pedagogy, teachers were resourceful in moving content online by using the many 

existing online platforms and simulations. Not reported in these manuscripts, teachers were 

asked to provide information regarding their experiences  in teaching online during the pandemic. 

Teacher  interviews revealed that they used a wide variety of online learning tools to bring as 

much project-based content to students as possible. They also noted that students seemed most 

interested in lessons and experiments that were related to what they were seeing in the news, 

such as making their own hand sanitizer during the shortages. The student data collection 

instruments for the 2019-2020 cohort were informed by teachers’ reported use of interactive 

simulations,  videos of experiments, and collaboration tools. In addition to those instruments 

described in this dissertation, CESE students were also asked to provide suggestions about how 

to improve the online learning  experience. Their  responses confirmed what their teachers were 

identifying as meaningful lessons.  Responses centered around  coverage of real-world science 

142 

 
 
topics of and project-based experiments that could be conducted at home. Many students also 

recognized the need for those experiments to be designed around equity and safety. 

A variety of online resources  have emerged that provide opportunities to participate in 

simulated experiments or collect data based on simulations of events when hands-on 

participation is not possible. Existing technologies were helpful during the Covid-19 pandemic 

when many educators worldwide turned to remote content delivery with little time to adapt 

lessons that were designed for in-person delivery (Brown & Krzic, 2021; Maestrales  et al., 

2021a). These  types of simulations could facilitate equitable learning for all students including 

those who cannot come to the classroom, those with disabilities, and students in classrooms that 

lack the resources for some project-based lessons.  CESE students suggested they were interested 

in the adaptive technologies that would allow them to participate in or watch experiments 

remotely during the pandemic (Maestrales et al. 2021a). During interviews, CESE teachers 

reported using websites, such as LabXchange, developed by Harvard University in 2018 

(LabXhange, 2018), to provide virtual experimentation  and data collection opportunities to their 

students while online. 

Some research  suggests differences in student engagement between online and in-person 

delivery formats (Robinson & Hullinger, 2008; Kemp & Grieve, 2014). In a manuscript 

regarding the efficacy of hands-on experiments compared to computerized experiments, Carter 

and Emerson (2012) highlight the difficulties in making formal comparisons between studies due 

to differences in both pedagogy and outcome measures. Although earlier studies such as that 

conducted by Carter and Emerson in 2012 found that students reported greater satisfaction when 

experiments were delivered in the classroom,  advances in the available technologies that students 

are accustomed to using inside and outside the classroom has created a much different landscape 

143 

 
 
for learners. As technology develops and the general population becomes more familiar  with the 

new resources, studies must encourage understanding of this new learning landscape. By 

coupling vetted curriculum and assessment materials with these newly developed interactive 

learning tools, researchers  could provide significant insights into differences for students 

learning in the classroom and online. In addition to the direct benefits to both students and 

teachers in the classroom, these national standards and more consistent curriculum could also 

lead to research where data is comparable across studies.  

College Ambition 

The benefits of project-based learning appear to go further than academic achievement. It 

appears that three-dimensional project-based lessons have impacts on students’ desire to further 

their educational aspirations. Connecting the results of the presented studies in regard to college 

ambition, participation in project-based activities appears to be related to future goal setting. The 

treatment intervention was linked to increased college ambition for the 2018-2019 cohort. 

Information from the 2019-2020 cohort showed that there was a significant relationship between 

participation in SEPs and students’ reported engagement, while engagement was in turn related 

to educational ambitions.  

Scientists have shown a strong, positive three-way relationship  between mastery 

experiences,  goal setting behaviors, and self-efficacy or the confidence that one can succeed in a 

task (Bandura, 1999; Earley & Lituchy, 1991; West & Thorn, 2001). It is possible that the 

engagement in and successful completion of these inquiry driven experiments may provide 

mastery experiences which foster improved self-efficacy and goal setting in the students. This 

would support the growing body of research suggesting that successful project-based science 

experiences  lead students to a greater sense of science self-efficacy (e.g. Bilgin et al., 2014; 

144 

 
 
Samsudin et al., 2020; Schaffer et al., 2013). It may be through this connectedness of mastery 

and self-efficacy that CESE students are increasing their educational goals. 

To bolster the STEM workforce and better understand the nuances of the STEM pipeline, 

future study should seek to explain the mechanism by which project-based activities foster this 

increase in educational ambition. Future endeavors in structural equations modeling or path 

analysis  could help to provide insights into the mediating effects of science identity, mastery 

experiences,  and self-efficacy, in models regarding the impact of project-based science lessons 

on goal-setting behaviors or college ambition.    

Limitations 

Although the initial student sample for the 2019-2020 cohort mirrored the sample from 

the 2018-2019 study, Covid-19 created unique circumstances which left many learners unable to 

participate. The final results of this study reflect those students who were able to attend their 

classes remotely and chose to attend when little could be done to mandate participation or 

attendance. Additionally, students were asked to respond to these questions about remote 

participation in SEPs when there were no other options available  for learning. To understand 

student interest in new and emerging technologies, it is important to consider their attitudes and 

opinions in their regular learning  environments.  

Conclusion 

The many manuscripts  and projects that have been developed under the CESE project-

based learning intervention contribute significantly to the landscape of science education 

research, yet there is more to be done. More single-unit curriculum  materials that meet NGSS 

standards need to be tested and made available to educators to implement project-based 

curriculum  for the full school year. Evidence suggests that national standards for project-based 

145 

 
 
curriculum  and assessment will benefit students’ academic performance. These new units can 

provide new opportunities for more comparable studies through the similarities  in methodology, 

pedagogy, and outcome measures. This  will in turn allow for deeper investigation of how 

emerging technologies impact students and teachers in their project-based classrooms.  

146 

 
 
 
 
Bandura, A., & Wessels, S. (1994). Self-efficacy  (Vol. 4, pp. 71-81). 

REFERENCES 

Bilgin, I., Karakuyu, Y., & Ay, Y. (2015). The effects of project based learning  on undergraduate 
students' achievement and self-efficacy beliefs towards science teaching. Eurasia Journal 
of Mathematics  Science and Technology Education, 11(3). 

Brown, S., & Krzic, M. (2021). Lessons learned teaching during the COVID‐19 pandemic: 

Incorporating change for future large science courses. Natural Sciences Education, 50(1), 
e20047. 

Carter, L. K., & Emerson, T. L. (2012). In-class vs. online experiments:  Is there a difference? 

The Journal of Economic Education, 43(1), 4-18. 

Earley, P. C., & Lituchy, T. R. (1991). Delineating goal and efficacy effects: A test of three 

models. Journal of applied  psychology, 76(1), 81. 

Kanim, S., & Cid, X. C. (2020). Demographics of physics education research. Physical Review 

Physics Education Research, 16(2), 020106. 

Kemp, N., & Grieve, R. (2014). Face-to-face or face-to-screen? Undergraduates’ opinions and 

test performance in classroom vs. online learning. Frontiers in Psychology, 5, 1278. 
https://doi.org/10.3389/ fpsyg.2014.01278. 

LabXchange:About. (2018). Retrieved from https://about.labxchange.org/. 

Maestrales, S., Marias  Dezendorf, R., Tang, X., Samela-Aro, K., Bartz, K., Juuti, K., Lavonen, 

J., Krajcik, J., & Schneider, B. (2021a). US and Finnish High School Science 
Engagement During the Covid-19 Pandemic. International  Journal of Psychology, 57(1), 
73-86. 

Maestrales, S., Zhai, X., Touitou, I., Baker, Q., Krajcik, J., & Schneider, B. (2021b). Using 

Machine Learning to Evaluate Multidimensional Assessments of Chemistry and Physics. 
Journal of Science Education and Technology, 30(2), 239-254. 

National Research Council. (2012b). A Framework for K-12 Science Education: Practices, 

Crosscutting Concepts, and Core Ideas. The National Academies Press. 

National Research Council. (2014). Developing Assessments for the Next Generation Science 

Standards. Washington, DC: The National Academies Press. 

NGSS Lead States. (2013). Next generation science standards: For states, by states. Washington, 

DC: The National Academies Press. 

Robinson, C. C., & Hullinger,  H. (2008). New benchmarks in higher education: Student 

engagement in online learning. Journal of Education for Business, 84(2), 101-109. 

147 

 
 
Samsudin, M. A., Jamali,  S. M., Md Zain, A. N., & Ale Ebrahim, N. (2020). The effect of STEM 

project-based learning  on self-efficacy among high-school  physics students. Journal of 
Turkish Science Education, 16(1), 94-108. 

Schaffer, S. P., Chen, X., Zhu, X., & Oakes, W. C. (2012). Self‐efficacy for cross‐disciplinary 

learning in project‐based teams. Journal of Engineering Education, 101(1), 82-94. 

Schneider, B., Krajcik, J., Lavonen, J., Salmela-Aro,  K., Klager, C., Bradford, L., Chen, I., 

Baker, Q., Touitou, I., Peek-Brown, D., Marias Dezendorf, R., Maestrales, S. & Bartz, K. 
(2022). Improving Science Achievement—Is It Possible? Evaluating the Efficacy of a 
High School Chemistry and Physics Project-Based Learning Intervention. Educational 
Researcher, 0013189X211067742. 

Teo, T. W., Goh, M. T., & Yeo, L. W. (2014). Chemistry education research trends: 2004–2013. 

Chemistry Education Research and Practice, 15(4), 470-487. 

West, R. & Thorn, R. (2001). Goal-setting, self-efficacy, and memory performance in older and 

younger adults. Experimental aging research, 27(1), 41-65. 

148