COGNITIVE SYNERGY: EXPLORING THE TRANSFORMATIVE INTERSECTION OF HUMAN INTELLIGENCE AND ARTIFICIAL INTELLIGENCE IN DESIGNING EQUITABLE NEXT GENERATION SCIENCE ASSESSMENTS By Tingting Li A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Educational Psychology and Educational Technology – Doctor of Philosophy 2024 ABSTRACT This study explores the intersection of human intelligence and Artificial Intelligence (AI) to design knowledge-in-use science assessments for supporting students’ deep science learning. In the context of evolving educational paradigms, it seeks to harness AI tools (GPT), to enhance knowledge-in- use assessment design, ensuring equitable opportunities for diverse learners. Anchored in the Next Generation Science Assessment and an evidence-centered design, this study aspires to harmonize AI's computational strengths with human expertise in assessment design. Drawing from an array of theoretical frameworks—Hybrid Intelligence System, Distributed Cognition, and Self-Regulated Learning Theory— the study underscores the multi-faceted and dynamic nature of knowledge-in-use and the symbiotic integration of human and AI. Employing a Design-Based Research approach, the study proceeds in three stages: (1) Iteratively training GPT models for effective designing knowledge-in-use assessments; (2) Gathering multidisciplinary expert feedback on AI-co-designed assessments; and (3) developing a domain-specific GPT-model for tailored assessment design that capture knowledge-in-use and address diverse student needs. Diverse data analysis techniques, encompassing thematic analysis, and descriptive statistics, such as heat map and scatter plot, are leveraged. Anticipated results spotlight an exploratory GPT model adept at creating tailored assessments resonating with diverse learning needs while emphasizing equity, adaptability, and inclusivity. This study holds the potential to significantly enhance the educational landscape by advocating a balanced approach where AI complements human expertise, paving the way for a progressive and inclusive future in education. ACKNOWLEDGEMENTS It is hard to believe that I am writing what I consider the most daunting section of my dissertation—the acknowledgment. Reflecting back to August 31, 2018, when I first ventured into this completely new world beyond my imagination, I see how different this moment feels compared to when I wrote my first dissertation acknowledgment. Now, I am calmer and deeply appreciative of everyone who has supported me on this incredible journey. First and foremost, I am profoundly grateful to my co-advisors, Dr. Joseph Krajcik and Dr. Rand Spiro. I often reflect on my fortune to have had the opportunity to learn from such wise, experienced advisors. You have mentored me not only in the art of research and scholarship but also in becoming a more thoughtful human being. Your open-mindedness, dedication, diligence, and support have profoundly shaped me. I am eager to pass on your spirited dedication to my students and will always hold dear your fervent enthusiasm for education—immersing in theory, bridging it with practice, consistently focusing on students, and pondering the profound implications of educational research. I aspire to embody the ideals of being a lifelong learner, collaborator, and supporter. I am committed to living up to your expectations. I extend my heartfelt thanks to my committee members: Dr. Jennifer Schmidt from the Educational Psychology and Educational Technology program; Dr. Jiliang Tang from the Computer Science Department; and Dr. Kevin Haudek from the Natural Science Department. Your expertise has not only enriched my dissertation but also my personal and professional growth. I am particularly indebted to Jennifer for your enduring guidance and unwavering support throughout this journey. Special thanks go to Dr. Christine Greenhow for her guidance, and to Drs. Emily Adah Miller, Christina Schwarz, Amelia Gotwals, and David Stroupe for supporting both Peng's and my academic and professional paths. To my CREATE for STEM colleagues—Drs. Bob Geier, Namsoo Shin, Cory Miller, Consuelo Morales, Emil Edin, Selin Akgun, and Jonathan Bowers—and to Ligita, Mary, Sue, Renee, Colter, Angie, and Alison: thank you for being part of the significant milestones in our lives. To my EPET friends— iii Samuela, Anne Drew, Larissa, and John Kean—thank you for the mutual support and camaraderie. A special shout-out to Sharon Hammond, who has always been ready to assist whenever I had questions about the program or my progress. I am grateful to the mentors and colleagues with whom I've collaborated on various research projects (Dr. Barbara Schneider, Dr. I-Chien Chen, Debroah Peek-Brown, Sue Codere, Kayla Bartz, and Lydia Bradford). My appreciation also extends to the teachers, district coordinators, and school principals I've worked with. Your insights have been invaluable. A special acknowledgment to the expert panels who reviewed the products of my dissertation—without your thoughtful feedback, this work would not have been possible. Lastly, I owe a profound debt of gratitude to my family. To my parents, Xifeng Li and Xihong Yu, thank you for your encouragement and support in pursuing my dreams. To my elder sisters, Haiyan Li and Mengmeng Li, thank you for caring for our parents, allowing me to focus on my studies. To my husband, Dr. Peng He, thank you for your encouragement, solid support, and the powerful backing, helping me to trust and surpass myself. Thank you for being an incredible partner and father, filling our lives with love, care, and adventure. And to my daughter, Jinni (何祎然), thank you for being a wonderful source of inspiration and joy—I wish you all the happiness and health in the world and hope you pursue whatever dreams you have and become the person you aspire to be. Lastly, I thank myself for remaining persistent and strong. Go Green! Thank you, everyone! I wish you all the very best! iv TABLE OF CONTENTS CHAPTER 1: INTRODUCTION .................................................................................................................. 1 CHAPTER 2: LITERATURE REVIEW AND THEORETICAL FRAMEWORK ...................................... 7 CHAPTER 3: STUDY DESIGN AND METHODOLOGY ....................................................................... 37 CHAPTER 4: FINDINGS AND DISCUSSIONS ....................................................................................... 73 CHAPTER 5: CONCLUSIONS AND IMPLICATIONS ......................................................................... 237 BIBLIOGRAPHY ...................................................................................................................................... 250 APPENDIX ................................................................................................................................................ 261 v CHAPTER 1: INTRODUCTION 1.1 Rationale 1.1.1 The Evolution of Educational Assessments in Science Education Historically, educators and researchers largely perceived assessments as static milestones marking students' progress (Bloom, 1968; Sadler, 1989). Recent pedagogical advancements, such as supporting students' higher-order thinking like problem-solving skills (Kang et al., 2014; Pellegrino & Hilton, 2012) or transferable knowledge (Shepard et al., 2019), have transformed these evaluations from mere benchmarks to dynamic tools that actively support instructional practices. Educators need to develop localized assessments that emphasize "assessment for learning" rather than mere evaluation (Black & Wiliam, 1998; Shepard, 2000; Stiggins, 2014). This transformative perspective positions assessment as an ongoing dialogue within the classroom, steering pedagogical strategies (DiCerbo, 2020; Wiggins, 1998). In science education, the notion of knowledge-in-use is central to this transformation. It refers to students' ability to apply their acquired knowledge to real-world scenarios or complex problems (Li et al., 2024; NGSS Lead States, 2013; NRC, 2012). To develop knowledge-in-use, students should actively engage with the three dimensions of scientific knowledge (3D learning): disciplinary core ideas (DCIs), science and engineering practices (SEPs), and crosscutting concepts (CCCs) to make sense of compelling phenomena and design solutions to challenging problems (NGSS Lead States, 2013; NRC, 2012). This ambitious vision for science education is emphasized in foundational documents such as A Framework for K-12 Science Education (NRC, 2012) and the Next Generation Science Standards (NGSS Lead States, 2013). Subsequent policy documents, including Science and Engineering for Grades 6-12: Investigation and Design at the Center (National Academies of Sciences, Engineering, and Medicine, 2019) and Science and Engineering in Preschool Through Elementary Grades: The Brilliance of Children and the Strengths of Educators (National Academies of Sciences, Engineering, and Medicine, 2022), also focus on developing students’ knowledge-in-use through 3D learning. The transition to knowledge-in-use raises a vital challenge for the science education community: how can we effectively collect evidence to understand if students have developed knowledge-in-use? 1 (Furtak, 2017, 2023; Pellegrino, 2013; Penuel & Smolek, 2019). Addressing this challenge requires exploring the development of suitable assessment tasks that can capture students' knowledge-in-use (Li et al., 2024; NGSS Lead States, 2013). Such assessments prioritize learners' ability to leverage acquired knowledge in real-world scenarios, which, while vital, presents challenges in design and implementation (Bertenthal & Wilson, 2006; NGSS Lead States, 2013). Moreover, students must find these assessments sufficiently compelling and engaging to motivate them in their learning processes (Li, He, & Peng, 2023; Stiggins, 2014). Additionally, the notion of “assessment for learning” emphasizes the important role of formative assessment in supporting students’ learning (Li et al., 2024). Formative knowledge-in-use assessments are essential to effective science instruction (Harris et al., 2019). These high-quality assessments, which align with NGSS standards, offer crucial formative insights for educators (NRC, 2014; Shepard et al., 2018). They depict the progression of students' learning over time. Teachers often need to develop assessment tasks for their students based on their instruction and students’ needs (Heritage, 2010). Yet, many teachers do not feel prepared to develop NGSS-aligned assessments or use them formatively (Furtak, 2017). Due to the complex and varied nature of local classrooms, science teachers need the capability to intentionally design assessment tasks that align with the NGSS and are easily integrable into their real-time, interactive classroom activities (Pellegrino, 2013). Designing these assessments demands adaptability and inclusivity for a diverse spectrum of learners, especially from minoritized and marginalized racial and ethnic groups (Darling-Hammond & Snyder, 2000). The process of developing such assessments requires collaboration between assessment experts, science education experts, and teachers, which is time-consuming and labor-intensive (Furtak & Lee, 2023). Moreover, it necessitates professional knowledge of assessment design, science content, teaching, and student knowledge (Brookhart, 2010). 1.1.2 The Confluence of Human and Artificial Intelligence in Assessment Design Teachers need support to transition from assessment for evaluation to assessment for learning (Li et al., 2024). To provide feedback to support student learning, teachers need to design effective assessments that attend to different needs of learners and that provide evidence of student learning (Harris 2 et al., 2019; Hattie & Timperley, 2007; Shute, 2008). Customization is essential to align with the diverse backgrounds and dynamic classroom scenarios, adding to the complexity and time demands on teachers (Brookhart, 2010; Darling-Hammond et al., 2020). This new vision brings both enlightenment and challenges, such as design intricacies, adaptability concerns, and the overarching quest for equity, especially for learners from minoritized and marginalized groups navigating varied pathways (Furtak & Lee, 2023). Amid the ever-evolving educational landscape, this pressing situation emerges, compelling us to reckon with the promise and pitfalls of technological integration in academic evaluation (Luckin et al., 2017). The solution may lie at the intersection of human intelligence and artificial intelligence. This exploration is driven by technological innovation and a profound aspiration for equity. AI tools have the potential to democratize the process of knowledge-in-use assessments, making them more accessible and equitable for all students (Luckin et al.,2017). Emergent generative AI technologies, notably Large Language Models (LLMs) by OpenAI, offer a glimpse of this potential (Greengard, 2022). Yet, to fully harness AI's potential, educators need a foundational understanding of machine intelligence’s underlying principles and professional expertise in corresponding fields (Ifenthaler et al., 2024; Williams, 2023; Zawacki-Richter et al., 2019). Without adequate expertise to evaluate the outputs of tools like ChatGPT, there's a risk of misguided decisions, which could further erode trust in AI. Harmoniously melding human expertise with AI's capabilities is key. A pressing concern is that many educators, especially those working on knowledge-in-use assessments, are increasingly relying on AI without fully grasping its nuances (Brown et al., 2020). Bridging this gap calls for innovations like domain-specific algorithms designed with educational paradigms in mind. Such tools can guide educators in integrating the strengths of AI with human insights for assessment design (Khosravi et al., 2022; Owan et al., 2023). This collaboration can redefine the role of assessments, ensuring they remain steadfast guiding lights in the learning voyage (Pellegrino, Chudowsky, & Glaser, 2001; Nguyen et al., 2021). This emphasis on human-machine collaboration in education reflects broader shifts in the AI era. As we delve deeper, technology's role in education 3 becomes increasingly intricate, moving beyond basic computer-aided lessons to sophisticated, intelligent educational systems (Miller, 2023; Reiser, 2001 a, b). While this integration brings opportunities, it also demands a balance between machine-driven innovations and human-centric pedagogy (Halverson & Collins, 2009; Roberts, 2021). Using generative AI exemplifies the promise and potential pitfalls of this alliance, especially in the domain of knowledge-in-use assessments (Greengard, 2022). This study stands at the crossroads of human intelligence and AI (Greengard, 2022; Johnson et al., 2022). To harness this potential, a balanced approach is essential: educators must synergize their diverse expertise with AI's prowess, ensuring the tools developed are rooted in human values, bias-free, and tailored to the diverse needs of education. Prioritizing collective human intelligence, this study integrates insights from experts across various disciplines, promoting a well-rounded approach to AI's role in education. Central to this endeavor is the belief that AI should augment, not replace, human expertise. The heart of this exploration lies in the development and validation of knowledge-in-use assessments. Beyond mere evaluation, these assessments are pivotal for enhancing student learning. A core tenet of this study posits that these assessments should be dynamic, allowing educators the autonomy to design, adapt, and align them with their students' unique needs, fostering equity, and championing culturally relevant teaching. Equity and inclusion are critical elements of this research that weave throughout the research and development process, beginning with the initial domain analysis of performance-based learning goals and continuing through the development of tasks and rubrics, recruitment of teacher participants from diverse classroom settings for broad access and participation, and data analyses for validation. This study aims to chart a course where human expertise and AI innovation converge, offering transformative insights for educational assessment. By delving into the nuances of AI-human collaboration (Dellermann et al, 2021; Fui-Hoon Nah et al., 2023; Johnson et al., 2022), I aim to refine the discourse on knowledge-in-use assessments. The goal is to seamlessly meld technology with methods that prioritize human values, adaptability, and equity. Key objectives include iteratively training a large 4 language model for effective design of knowledge-in-use assessments, gathering multidisciplinary expert feedback on AI-generated assessments, and exploring how to incorporate collective human experts’ intelligence to develop a domain-specific algorithm encompassing refined AI processes for tailored assessment design. At its core, this research seeks to foster a symbiotic relationship between AI and human agency, particularly in the realm of developing knowledge-in-use assessments. It endeavors to ensure that AI tools amplify human capabilities rather than replace them. The outcome will be a comprehensive guide to the potential, challenges, and effective practices for integrating AI into educational assessments. Drawing from the theory of distributed intelligence, I envision a harmonious integration of AI and human intelligence in educational settings (Pea, 1993; Salomon, 1997). Upholding principle like "Human in the Loop" (Mosqueira-Rey et al., 2023), this synergy promises to revolutionize educational assessments. The collaborative dynamic between human and AI in the realm of education is an emerging area of research. As the relationship between AI and education garners attention, it's pivotal to view AI as extensions of human cognitive abilities, not mere adjuncts (Pea, 1993). With the potential to reshape cognitive functions, as postulated by Pea and Kurland (1987), understanding this relationship becomes crucial when aiming to broaden educational assessment horizons. While tools like generative AI offer valuable assistance in assessment design and interpretation, educators must be equipped to harness their capabilities effectively. This research underscores the need for a holistic approach, emphasizing that AI's influence transcends technical aspects, with human agency's emotional, cognitive, and ethical dimensions remaining central (De Cremer & Narayanan, 2023; Sundar, 2020). This study is to harmonize AI with human experts’ knowledge in designing knowledge-in-use assessments. Facing the challenges of designing assessments that reflect nuanced knowledge applications and cater to diverse student needs, this study explores how AI can be integrated with human expertise to yield innovative solutions. The study delves beyond just the technical, emphasizing the ethical implications of ensuring that AI enhances rather than overshadows human agency. The goal is to strike a balance between AI's prowess and human-driven pedagogy, anchoring the research in principles of 5 equitable and meaningful education. Ultimately, this study aims to offer a nuanced, evidence-based framework for envisioning and developing knowledge-in-use assessments in an evolving, AI-influenced educational landscape. 1.2 Research Questions (RQs) This study explores three major questions: RQ1. How can generative AI models be effectively and iteratively trained to design knowledge- in-use assessments? RQ2. How do human experts across different disciplines evaluate the AI-generated knowledge- in-use assessments, and what refinements do they suggest? RQ3. What is the process of refining AI-designed knowledge-in-use assessments based on the feedback provided by human experts? 6 CHAPTER 2: LITERATURE REVIEW AND THEORETICAL FRAMEWORK To better understand the landscape and key construct of this study, I reviewed the relevant studies about “knowledge-in-use”, “AI for education”, “Human-AI collaboration” and their respective theoretical foundations to set up a better understanding of the current landscape of these research fields and how my study fills the gap of leveraging AI in science education to develop knowledge-in-use assessment. This section also presents a theoretical framework about how humans and the AI machine can collaborate with each other to augment human intelligence in designing knowledge-in-use assessment. 2.1 Meaning of Knowledge-In-Use Proficiency In the context of modern challenges such as food scarcity, pandemics, and climate change, it is essential for citizens to possess scientific knowledge to make informed decisions, support policy changes, and understand the consequences of inaction (Anderson et al., 2020; NRC, 2012; NRC, 2011; OECD, 2019). To develop a science-literate citizenry, educators need to focus on what students should ultimately know (big ideas) and be able to do (scientific practices), and create learning environments that support this integrated proficiency (NRC, 2012). Consequently, the goals of science education worldwide have shifted towards knowledge-in-use learning objectives (Kulgemeyer & Schecker, 2014; NRC, 2012; People’s Republic of China Ministry of Education, 2017). Knowledge-in-use demands that students apply their knowledge by making sense of real-world phenomena, solving complex problems, and making informed decisions (NRC, 2012; NASEM, 2019; Pellegrino & Hilton, 2012). The concept of knowledge- in-use reflects a growing awareness among learning scientists, science educators, and policymakers about the skills required for global citizens in the 21st century (OECD, 2019). It suggests that knowledge is a product of the activities, context, and culture in which it is developed and used (Brown et al., 1989; Bonwell & Eison, 1991), and posits that individuals actively participate in the creation of their own knowledge (Schreiber & Valle, 2013). Instead of merely acquiring knowledge from teachers or textbooks, knowledge-in-use emphasizes the application of scientific knowledge to make sense of natural phenomena and solve complex, authentic problems, as promoted in The Framework (NRC, 2012). This 7 approach allows students to explain new real-world phenomena or solve complex problems by applying their learning (NRC, 2012; NGSS Lead States, 2013). The development of knowledge-in-use has been a significant focus in cognitive science and science education. From a cognitive science perspective, there is a strong link between knowledge-in-use and adaptive skills (Li et al., 2023; Ward et al., 2018), highlighting the need for cognitive abilities to be adaptive, flexible, and context-sensitive. Knowledge-in-use primarily involves applying previously mastered knowledge and skills to new situations (Bransford & Schwartz, 1999). It is akin to adaptive skills, which emphasize the learning process and the continual adjustment of one’s approach in varied contexts (Hatano & Oura, 2003). Unlike transferable knowledge, which depends on contextual similarities for both far and near transfers (Ruiz-Primo et al., 2002), adaptive skills equip learners to handle unknown situations even without directly relevant prior knowledge. This aligns with the goals of knowledge-in-use, such as problem-solving and decision-making, fostering specific adaptive skills like flexibility, resilience, and metacognition (Spiro et al., 2017; Sternberg & Kaufman, 1998). 2.2 Supporting Knowledge-In-Use One of the most effective strategies for helping learners adapt to novel situations is equipping them with the appropriate knowledge and skills to tackle and solve complex real-life problems (Brown & Duguid, 1993). This can be achieved by designing learning environments that provide authentic activities for novices to experience expert performance, offer scaffolding at crucial moments, support cooperative knowledge building, and include monitoring features (Herrington & Oliver, 1995, 2000). This approach is exemplified when learners apply their scientific understanding to make sense of phenomena or solve intricate problems, encapsulating adaptive skills, transferable knowledge, and cognitive flexibility (Mensah & Chen, 2022; Spiro et al., 2018; Ward et al., 2018). However, developing such proficiency is a gradual process that requires continuous exposure to disciplinary experiences involving open-ended, unresolved problems (Esposito & Bauer, 2017). The Framework and the NGSS advocate a three- dimensional (3D) learning approach to explain relevant phenomena and provide solutions to complex problems, proposing performance goals that develop knowledge-in-use across Grades K-12 (NASEM, 8 2022; NASEM, 2019; NRC, 2012). Despite the recognized value of 3D learning, it presents operational challenges for teachers (Penuel et al., 2015). Teachers must adapt their teaching and assessment practices, conceptualize learning as a trajectory toward generative ideas, and nurture the use of scientific practices. Situated learning, one of the tactical foundations of 3D learning, posits that knowledge is partly a product of the activities, context, and culture in which it is developed and used (Brown, Collins, & Duguid, 1989).Situated learning theory suggests that learning environments should foster learners' participation in inquiry and support the development of their personal identities as capable and confident learners and knowers. Curricula should be designed to sequence learning activities with attention to students' progress in various disciplinary practices of discourse and representation. Learning activities should focus on meaningful, problematic situations that resonate with students' experiences and show how concepts and methods of subject-matter disciplines are embedded. This also requires the knowledge- in-use assessment to be designed to capture the situated and application nature of 3D learning. Research in the learning sciences (Krajcik et al., 2023; NRC, 2012) has shown that the most effective learning occurs when it is situated in authentic contexts. 2.3 Measuring Knowledge-In-Use Given the complex cognitive nature of knowledge-in-use, measuring it presents significant challenges. However, understanding students' performance on knowledge-in-use activities and tasks is crucial for effective teaching and learning. Assessment, as a component of any educational system, plays a vital role in diagnosing, monitoring, and promoting students' development of knowledge-in-use in science learning (NRC, 2014). The intricate nature of knowledge-in-use constructs makes assessment design and validation particularly challenging (NRC, 2012; 2014). To address these challenges, the National Research Council's (2014) report, "Developing Assessment for the Next Generation Science Standards," recommended evidence-centered design (ECD) as the cognitive foundation for developing knowledge-in-use assessments (NRC, 2001). Several prominent research groups have made significant efforts to design classroom assessments for knowledge-in-use using principled design approaches (Harris, He et al., under review; Krajcik, & Pellegrino, 2023; Osborne & Wertheim, 2019; Penuel et al., 2019), 9 such as the ECD approach (Mislevy & Haertel, 2006) and the construct-modeling approach (Wilson et al., 2005). For example, the Next Generation Science Assessment (NGSA) project applies a modified ECD design process to articulate a systematic design approach (Harris et al., 2019; 2023). The design and validation of assessments involve ensuring their reliability, validity, and fairness. ECD emphasizes aligning assessment tasks with the knowledge and skills to be measured, ensuring that the assessment provides valid evidence of student learning (Pellegrino, 2014). Validity frameworks, including content validity, construct validity, and criterion-related validity, ensure that assessments accurately measure what they are intended to measure and are fair to all students (Coffey, Black, & Atkin, 2001). Mark Wilson's work on the development and application of assessment frameworks has significantly contributed to this field. Wilson (2005) emphasizes the importance of construct modeling and the BEAR Assessment System, which provides a structured approach to designing, implementing, and validating assessments. In the study, I emphasize assessment for learning to underline the critical role of using assessment formatively to support teaching and learning. 2.4 Designing Knowledge-in-use Assessments Designing effective knowledge-in-use assessments presents several challenges. One of the main challenges is ensuring that the tasks are accessible and engaging for all students, regardless of their background or prior knowledge. This requires careful consideration of language appropriateness, cultural relevance, and the inclusion of compelling and relatable phenomena (National Research Council, 2012). Additionally, equity considerations must be addressed to ensure that all students have an equal opportunity to demonstrate their understanding and abilities. The National Research Council (2012) highlights the importance of designing assessments that are inclusive and culturally responsive. 2.4.1 Assessment Frameworks and Theoretical Foundations There are different perspectives on developing formative assessment for knowledge-in-use, including sociocognitive and sociocultural perspectives. Sociocognitive approaches assess students' understanding and skills as they engage in increasingly sophisticated practices typical of disciplinary experts. This method is grounded in the belief that thinking and learning are inherently social activities, 10 and thus, assessments are based on "local instructional theories" of learning. These theories involve creating sequences of instructional activities tailored to support a specific group of students in developing proficiency (Bakker & Gravemeijer, 2004). In both scenarios, assessment materials are designed for specific content areas, considering the typical challenges students face and common strategies to help them progress. Sociocognitive strategies not only measure students' content knowledge but also aim to help them adopt the dispositions and identities of their field of study. This approach favors assessment practices such as collaborative inquiry, expertly facilitated questioning, discussion, and qualitative feedback, allowing teachers to observe how students are acting, thinking, and reasoning in disciplinary ways (Penuel et al., 2017; Smith et al., 2006). The strengths of the sociocognitive approach lie in its discipline-specific learning goals and a well-defined learning theory (Penuel & Shepard, 2016). Instead of merely reporting the number of correct answers, this approach reveals how students think about and solve specific problems, providing teachers with valuable insights into students' knowledge, confusions, and immediate learning needs. However, developing such fine-grained, subject-specific assessment tools requires significant expertise and resources, making them less accessible to many schools and districts, especially for certain topics and grade levels. Additionally, these tools focus on assessing students' understanding of specific subject matter without considering their diverse values, experiences, and personal goals. This limitation raises concerns about whether these tools benefit all students equitably, particularly regarding their racial, ethnic, or gender identities, emerging bilingualism, conditions of poverty, and other critical aspects of their lives. Sociocultural perspectives share many foundational ideas with the sociocognitive approach, such as the social nature of learning and the importance of student engagement in disciplinary practices. However, they differ significantly in how they address student diversity. Sociocultural theories of learning emphasize that students bring valuable knowledge and interests from their personal and community backgrounds into the classroom. Instead of disregarding these experiences, teachers should help students reflect on how the school's ways of knowing, doing, and being relate to those valued in their 11 families and communities (Bang & Medin, 2010). This broad understanding and acceptance are essential for implementing and sustaining equitable assessment practices in diverse educational settings. 2.4.2 Critical Aspects for Designing Knowledge-In-Use Assessment Designing assessments in science education involves several critical factors to ensure that assessments are effective, equitable, and capable of measuring students' higher-order thinking skills. To support equitable learning and teaching, assessments must be designed with inclusivity and accessibility in mind. This involves developing tasks that are not only challenging and engaging but also accessible to students from diverse backgrounds. Equity considerations include ensuring that the assessment tasks do not disadvantage any group of students and that they provide adequate support for all learners to understand and engage with the phenomena being assessed (Pellegrino & Hilton, 2012). Pellegrino and Hilton (2012) suggest that equitable assessments should accommodate diverse learning styles and provide multiple means of demonstrating understanding. Thus, equity is a paramount consideration in assessment design, ensuring that all students, regardless of their background, have equal opportunities to demonstrate their knowledge and skills. Equitable assessments address potential biases and barriers that might disadvantage certain groups of students. This includes considerations for cultural relevance, socioeconomic status, and varying levels of prior knowledge (Pellegrino & Hilton, 2012). Language is a critical factor, particularly for English language learners (ELLs). Assessments must ensure that language complexity does not impede students' ability to demonstrate their understanding of scientific concepts. This includes using clear, concise language and providing support such as glossaries or translated materials when necessary (Lee, Quinn, & Valdés, 2013). Engagement refers to the extent to which assessment tasks are interesting and relevant to students. Engaging assessments motivate students to perform their best and provide more accurate measures of their abilities. Engaging tasks often involve real-world problems and scenarios that are meaningful to students (Hidi & Renninger, 2006). Accessibility ensures that all students, including those with disabilities, can participate meaningfully in the assessment process. The National Research Council's (2014) report, "Developing Assessment for the Next Generation 12 Science Standards," recommended ECD as the cognitive foundation for developing knowledge-in-use assessments (NRC, 2001). Several prominent research groups have made significant efforts to design classroom assessments for knowledge-in-use using principled design approach (He et al., under review; Osborne & Wertheim, 2019; Penuel et al., 2019), such as the ECD approach (Mislevy & Haertel, 2006) and the construct-modeling approach (Wilson et al., 2005). For example, the NGSA project applies ECD design principles to articulate a systematic design approach (Harris et al., 2023). 2.4.3 The NGSA Approach to Design Knowledge-In-Use Assessment In pursuit of the objective to design assessments for learning, given the systematic approach to design knowledge-in-use assessment, the NGSA design process (Harris et al., 2019; Figure 2-1) is employed, initiating with the deconstruction of the NGSS PEs. This process encompasses three primary phases. Figure 2-1. Overview of the assessment design process (from Harris et al., 2019) Within the Next Generation Science Standards (NGSS), there are three distinct and equally important dimensions to learning science. These dimensions are combined to form each standard—or performance expectation (PE)—and each dimension works with the other two to help students build a cohesive understanding of science over time. The Domain Analysis phase concentrates on breaking down the broad PEs into manageable components that facilitate the creation of more detailed learning performances. PEs are comprehensive statements outlining the knowledge and skills students should 13 possess at the end of a grade or grade band. They originate from the Framework for K-12 Science Education (enhanced forth, the Framework, National Research Council, 2012), embodying a vast scope. Each PE is structured as a singular statement encapsulating competencies at a large grain size, without delving into the underlying specifics. PEs are inherently three-dimensional, always incorporating a Disciplinary Core Idea (DCI), a Science and Engineering Practice (SEP), and a Crosscutting Concept (CCC) as outlined in the Framework. The elements of each dimension, elaborated further in the NGSS, vary across different PEs within a grade level or band, growing in complexity as students advance through K-12. The integration of the dimensions in PEs depicts the application of DCIs and CCCs through engagement in SEPs for understanding phenomena or solving problems, not meant to be isolated from each other. I use a middle school physical science PE as an example to explain its’ structure and why it needs elaboration. MS-PS1-2: Matter and Its Interactions1. The PE is “Analyze and interpret data on the properties of substances before and after the substances interact to determine if a chemical reaction has occurred. [Clarification Statement: Examples of reactions could include burning sugar or steel wool, fat reacting with sodium hydroxide, and mixing zinc with hydrogen chloride.] [Assessment boundary: Assessment is limited to analysis of the following properties: density, melting point, boiling point, solubility, flammability, and odor.]” MS stands for middle school level. PS refers to “Physical Science,” which is one of the four major domains [Life Science (LS), Earth and Space Science (ES), and Engineering] included in the NGSS. The numbers "1-2" indicate that this is the second performance expectation within the first major topic of Physical Science. A concise PE statement, like MS-PS1-2, encompasses numerous concepts; it is essential not to overlook the intricacies of each dimension within a PE. The apparent simplicity of a PE, such as MS-PS1- 2, belies its depth – for instance, the DCI aspect necessitates applying knowledge about substance properties for identifying chemical reactions. The CCC element, while not explicit in the PE statement, is 1 https://www.nextgenscience.org/pe/ms-ps1-2-matter-and-its-interactions 14 crucial for pattern recognition and reasoning. The SEP explicitly focuses on data analysis for distinguishing similarities and differences before and after substance interactions. Unpacking these dimensions reveals the extensive range encompassed within a single PE statement, indicating the need for comprehensive and sequential learning experiences and assessments to progress towards achieving these multifaceted PEs. The NGSA design process is instrumental in dissecting and identifying the significant components of PEs suitable for classroom-based assessment, focusing on constructing detailed learning performances from these components. 2.4.4 Evidence-Centered Design in Understanding Student Learning The process of drawing conclusions from assessments fundamentally relies on evidence-centered design. Pioneered by assessment experts like Robert Mislevy and colleagues (Mislevy, Steinberg & Almond, 2003; Mislevy & Haertel, 2006), ECD prioritizes establishing learning objectives and identifying the necessary evidence to judge student performance against these objectives, subsequently defining task features to elicit this evidence. Central to ECD is the goal of substantiating claims about students' knowledge and abilities with collected evidence, typically manifested through student responses to assessment tasks. Following its introduction two decades ago, ECD has garnered significant attention in education for its principled approach to assessment design. Notably, post-NGSS release, the National Research Council (2014) advocated for ECD-aligned assessment designs to accurately measure three-dimensional learning. ECD's argumentative reasoning is integral to NGSA's design process, particularly in developing tasks that provide evidence of students' three-dimensional performance and route to meeting PEs. The NGSA design process, depicted in Figure 2-1, guides the utilization of PEs as the basis for developing three-dimensional assessment tasks for classroom application, enhancing NGSS teaching and learning. The subsequent sections detail each step of the NGSA design process, elucidating the methodologies for selecting performance expectations, unpacking NGSS dimensions, mapping dimensions, articulating learning performances, specifying design blueprints, and constructing tasks and 15 rubrics. These steps collectively ensure the creation of effective, equitable, and inclusive assessment tools aligned with NGSS standards. Step 1: Selecting Performance Expectations The initial step involves choosing a target PE or a coherent bundle of PEs suitable for classroom instruction. The selection should align with instructional content, enabling students to progressively build the necessary knowledge and skills required by the PE or bundle. A strong correlation between the PEs and instructional activities ensures their appropriateness as focal points for developing three-dimensional assessment tasks. Unpacking entails a thorough exploration of each dimension's proficiency aspects, highlighting intersections between dimensions and considering additional SEPs and CCCs that might productively contribute to achieving the PE or bundle. This step also involves considering students' prior knowledge, potential challenges with the dimensions, equity and inclusion issues, and identifying relevant phenomena and realistic scenarios that can motivate and engage students. For example, unpacking MS-PS1-2 would include delving into the SEP of Analyzing and Interpreting Data, DCI elements related to the Structure and Properties of Matter, and the CCC of Patterns. This comprehensive unpacking process establishes a clear understanding of the required knowledge depth and scope for each dimension at the given grade level. Step 2: Unpacking the NGSS Dimensions A critical aspect in assessment design is understanding the depth of a PE, which often extends beyond its one-sentence statement. Unpacking is essential for revealing all components involved in a PE or PE bundle. This step is invaluable as it allows designers to detail the specifics of the three dimensions and the required student proficiencies in each. Documenting the results of unpacking provides a reliable reference throughout the design process, ensuring key design decisions are supported and verified. Step 3: Mapping the Dimensions The third step in the NGSA design process uses the detailed information from unpacking to create what is termed an "integrated dimension map." This map visually represents the key relationships among 16 the DCI, SEP, and CCC, synthesizing the unpacked information to illustrate the most significant and productive intersections. The mapping process is akin to constructing a concept map, depicting the key sub-ideas and their interrelations, thereby forming a comprehensive visual guide for achieving the target PE or bundle. Step 4: Crafting Inclusive Learning Performances In this step, learning performances are articulated, drawing from the integrated dimension map. These performances are crafted as specific knowledge-in-use statements that are narrower in scope, covering distinct areas of the map. The learning performances are the claims in the ECD argument. Each learning performance, like a PE, is structured to be three-dimensional, ensuring that students apply their knowledge in practical contexts. The process involves integrating various SEPs and CCCs with DCI elements, offering students diverse ways to engage with the content and demonstrate their understanding. The aim is to cover the entire scope of a PE or bundle through a set of complementary learning performances. Step 5: Developing Design Blueprints Design blueprints are utilized to guide the development of assessment tasks aligned with each learning performance. These blueprints document the essential and variable task features, as well as equity and inclusion considerations, ensuring that tasks are both comprehensive and accessible. Characteristic task features describe the attributes that are common across all the tasks for a learning performance. Variable task features describe the features that can vary across tasks, such as the level of scaffolding to vary task difficulty. Both types of task features include equity/fairness considerations to help ensure that our tasks are accessible and fair to students of diverse cultural, linguistic, and socioeconomic backgrounds. The blueprints also include evidence statements that articulate the observable features of student performance that can provide evidence of a high-level demonstration of the learning performance, and we use these to inform the development of both tasks and scoring rubrics. The blueprints answer critical design questions, such as what students should know and be able to do, the evidence needed to demonstrate this knowledge, and how to construct tasks that are inclusive and fair. 17 This step is vital for creating a diverse range of tasks that are consistent in quality and aligned with the learning objectives. Step 6: Constructing Tasks and Rubrics The final step involves the actual construction of assessment tasks and rubrics based on the design blueprints. This process includes selecting phenomena or problems that are relevant and engaging, creating scenarios that prompt students to apply their knowledge, and writing task prompts that elicit integrated three-dimensional responses. The development of rubrics is an integral part of this step, providing a framework for evaluating student responses and ensuring that they reflect the multidimensional nature of the learning performances. Throughout these steps, there is a constant emphasis on considering the diverse backgrounds and experiences of students. This includes using language and scenarios that are relatable and accessible, reducing bias in task content, and providing scaffolds where necessary to support all students in demonstrating their knowledge and skills. By incorporating these principles, the NGSA design process ensures that assessment tasks are not only effective in measuring student understanding but also inclusive and equitable, catering to the needs of a diverse student population. In all, the NGSA design process, as outlined, represents a comprehensive approach to developing assessments in science education. It emphasizes the integration of knowledge dimensions, evidence-based design, and a strong commitment to equity and inclusion. This process ensures that assessments are not only aligned with educational standards but also responsive to the diverse needs and abilities of all students, fostering an inclusive and equitable learning environment. 2.4.5 Challenges in Measuring Knowledge-In-Use Proficiency The global education paradigm is shifting from traditional rote learning to a focus on fostering adaptive thinking and championing knowledge-in-use (NRC, 2012; Pellegrino & Hilton, 2012). Consequently, the community is obligated to explore pioneering strategies for creating appropriate assessment tasks that capture students' knowledge-in-use and, importantly, determine methods to utilize these tasks to enhance deep science learning (Li et al., 2024). The design of performance-based tasks 18 presents significant challenges, which requires students to apply their knowledge and experience to solving novel context problems or explaining real-world phenomena, presents significant challenges (He et al., 2023). Additionally, it is often laborious and time-intensive for educators to analyze students' constructed responses to these tasks (Li et al., 2023a). Furthermore, due to the constructed and formative nature of these assessments, they frequently illuminate students' diverse learning trajectories and needs. Moreover, they require students to apply their knowledge in new scenarios. Although critical to developing knowledge, analyzing students’ 3D responses becomes intricate and time-consuming. These assessments reveal the various paths students take in their learning, which demands that teachers reconceptualize assessments to cater to diverse student backgrounds (particularly from minoritized and marginalized racial and ethnic groups), underscoring the critical need to empower teachers with robust assessment design skills. Designing assessments to capture the complex cognitive construct of knowledge-in-use remains challenging for the field (He et al., 2023). 2.5 Artificial Intelligence and Assessment 2.5.1 The Origins and Evolution of AI The advent of big data, cloud computing, artificial neural networks, and machine learning has enabled the development of machines capable of mimicking human intelligence. These technologies underpin the creation of systems that can perceive, recognize, learn, respond, and solve problems, collectively known as artificial intelligence (AI) (Kumar & Thakur, 2012; Spector, Polson, & Muraida, 1993). These advanced technologies are set to revolutionize future workplaces (Lawler & Rushby, 2013). AI, with its capability to interact with and assist humans in performing complex tasks, is being recognized as a major disruptive innovation (Seldon & Abidoye, 2018). Often seen as a critical component of the fourth industrial revolution, AI also has the potential to initiate a significant transformation in the educational sector. Integrating AI into school curricula has already begun (Dai et al., 2020; Knox, 2020). However, similar to how television and computers were initially perceived as groundbreaking for education, AI’s role is likely to enhance information accessibility without fundamentally altering the core educational practices. AI is defined as the capability of digital machines to perform tasks that typically 19 require human intelligence. These technologies span various fields, including computer vision, speech recognition, machine learning, big data, and natural language processing (Chiu, 2021; Chiu et al., 2022; Xia et al., 2022). The rapid expansion of AI is profoundly altering how people interact, communicate, live, learn, and work (Chiu, 2021; Chiu et al., 2022; Xia et al., 2022; Pedró et al., 2019). In the context of education, AI in education (AIEd) refers to the application of AI technologies such as intelligent tutoring systems, chatbots, robots, and automated assessments to support and enhance educational processes. AIEd holds significant promise for improving learning, teaching, assessment, and educational administration by providing personalized and adaptive learning experiences, enhancing teachers' understanding of student learning processes, and enabling anywhere, anytime machine-supported queries and immediate feedback. Consequently, AIEd is driving an evolution in teaching practices and program development, making it a crucial area for educational research. 2.5.2 AI and Science Assessments In science education, AI has been used primarily for the automated assessment of student-written text data, which is common because science educators often use open-ended items to assess students' explanations of phenomena (Liu et al., 2016; Shin & Shim, 2021). Initial studies demonstrated the feasibility of using machine learning models to assess student responses in large-scale classroom assessments, detached from teaching and learning contexts. For instance, automated assessments have been used to evaluate student concepts of natural selection (Ha & Nehm, 2016), climate change (Zhu et al., 2017, 2020), and acid-base reactions (Haudek et al., 2012). Recent research has also explored automated assessment of student-generated hand drawings and written responses about the particulate nature of matter (Lee et al., 2023). Recent advancements in automated assessment in science education have expanded to focus on students' application of knowledge in scientific practices. These advancements aim to provide individualized feedback and support learning through appropriate instructional interventions (Ha et al., 2019; Zhu et al., 2020). Although there are concerns about the socio-cultural and linguistic sensitivity of AI assessments (Li et al., 2023), the practice of integrating AI with formative assessment is becoming widespread (Li et al., 2023, 2024). This research indicates that AI integration in 20 classroom assessment is significantly impacting science teaching and learning. Since the release of ChatGPT, research has highlighted increasing opportunities to use AI in science learning beyond assessment. For example, reviews also suggest that AI chatbots offer opportunities for learners to interact with AI to gain knowledge (Kuhail et al., 2023). This evidence suggests that AI can collaborate with humans to play a critical role in science learning and teaching. 2.5.3 Addressing Gaps and Harnessing AI’s Potential in Assessments Despite the significant advancements in AI and learning science, a systematic approach to using AI technologies in developing and implementing knowledge-in-use assessment tasks in science education remains elusive (NRC, 2012b; Pellegrino & Hilton, 2012). Formative assessments, especially those that require automation and optimization, are complex. They need a deep understanding of how students think and learn (cognitive processes) and how they plan, monitor, and assess their understanding and performance (metacognitive processes). AI systems need to be sophisticated enough to understand these intricate aspects of learning to be truly effective in educational settings. Therefore, a principal challenge resides in the successful translation of AI advancements into pedagogically sound practices for creating, interpreting, or analyzing tasks to provide feedback, and utilizing assessments that support and evaluate knowledge-in-use proficiency. It's crucial to shift the focus of assessment from merely evaluating students to facilitating their learning. To achieve this, teachers need robust support in crafting tailored materials that address the unique needs of each learner. This transition not only requires a change in the way assessments are designed but also underscores the need for teachers to have resources and guidance to effectively adapt to diverse learning styles and challenges. Herein lies the potential of AI. This tool can substantially aid teachers in designing, interpreting, and leveraging assessments that enhance student learning. However, a challenge persists: teachers may lack the requisite knowledge to efficiently utilize AI to provide support tailored to their specific needs. Thus, this paper delves into the iterative process of training generative AI to design, analyze, and utilize performance-based knowledge-in-use assessments as a lever for students' deep science learning. By doing so, I aim to pave the way for more inclusive and 21 effective pedagogical practices, harnessing the power of AI in augmenting human intelligence and fostering students’ proficiency in knowledge-in-use. Presently, most AI solutions today focus on automating processes, often overlooking their potential role in educational models (Sanusi et al., 2024). This overlooks the need for a deep understanding of pedagogy and insight into learners' cognitive processes, especially when automating tasks. The challenge is translating AI advancements into pedagogically sound practices for creating and using knowledge-in-use assessments. With the shift towards assessments that focus on learning rather than evaluation, educators seek tools and guidance tailored to diverse student needs. AI emerges as a promising tool to aid educators in designing and interpreting these assessments. I presented the complicated design process of designing these types of assessment tasks. However, there is a hurdle: educators might not be familiar with using AI effectively for their specific needs. This study aims to hone generative AI's capabilities in designing assessments to enhance deep science learning. The goal is to utilize AI's potential to supplement human intelligence, fostering more inclusive and effective educational strategies and enhancing knowledge-in-use. 2.6 Human-AI Collaboration in Education Human-AI collaboration presents opportunities to address the challenges. Despite promoting advancements in teaching and learning, AI technologies should primarily aim to enhance human capacities rather than merely replace human tasks (Hwang et al., 2020; Pedró et al., 2019). While AI excels in logical decision-making, it cannot emulate human perceptions, emotions, and cognitions (Yang et al., 2024). Thus, integrating human intelligence with machine intelligence may aid in transitioning towards a human-centered AI involves perceiving AI from a human perspective and acknowledging the multifaceted attributes and contexts of humans. Human-Computer Interaction (HCI) has been a foundational area of research for decades. Berg (2000) notes that traditional HCI studies emphasized human factors, usability, and interface design, highlighting the computer primarily as a medium. This paradigm shifted significantly with the advent of AI, which has broadened the scope of interaction to include human-AI interaction (HAI) . The 22 psychological aspects of HCI were significantly developed by Card et al. (1983), who conceptualized the human mind as an information-processing system. This view laid the groundwork for understanding how users interact with computers and, by extension, AI systems. With the rise of AI, the literature has increasingly focused on HAI, reflecting a growing interest in how AI can augment human decision- making processes. Hybrid intelligence, where human-AI collaboration leverages the complementary strengths of both, is crucial for effective teaming. In educational contexts, researchers have explored human-AI collaboration to promote student-centered learning (Kim, 2023). 2.7 Theoretical Underpinnings of This Study In this section, based on the review above, I propose an adapted theoretical framework of human- AI collaboration to design knowledge-in-use assessment. To introduce the framework, I first discuss the differences between human intelligence and machine intelligence , then I define what kind of AI I use in my study with the definition of AI in my work. Finally, I introduce the theoretical framework of this study that guides the research design, data analysis and presentation. 2.7.1 Human Intelligence and Cognition Human intelligence is a multifaceted cognitive ability that encompasses various mental capacities, including reasoning, problem-solving, planning, abstract thinking, comprehension, and learning from experience. It involves both cognitive processes, such as working memory and long-term memory, and the ability to manage cognitive load effectively (Baddeley, 2000; Sweller, 1988). Human intelligence is characterized by its complexity and adaptability, enabling individuals to handle ill-defined problems requiring flexibility and creativity (Sternberg, 1985). Human intelligence and AI differ fundamentally in their nature and functioning. Human intelligence is characterized by its flexibility, adaptability, and emotional depth. It encompasses not only cognitive abilities but also emotional and social intelligence, enabling humans to navigate complex social interactions and emotional landscapes. In contrast, artificial intelligence is a product of human design and programming, aimed at replicating specific cognitive tasks. AI operates based on algorithms and data processing, excelling in tasks that require pattern recognition, data analysis, and computational efficiency. 23 AI systems are not inherently capable of emotional understanding or subjective experiences. They rely on large datasets and computational power to learn and improve, lacking the innate curiosity and creativity that drive human learning. While AI can surpass human performance in certain tasks, it lacks the holistic understanding and consciousness that characterize human intelligence. 2.7.2 Artificial Machine Intelligence and Relational Epistemology The branch of intelligence focused on machines is referred to as AI. This encompasses systems that execute "activities that we associate with human thinking, activities such as decision-making, problem solving, learning" (Bellman, 1978). Despite the various definitions of AI, the overarching concept involves creating machines capable of achieving complex objectives. These objectives include natural language processing, object recognition, knowledge storage and application for problem-solving, and the ability to adapt and act within their environment through machine learning (Russell & Norvig, 2016). At its core, AI seeks to simulate human intelligence through computational methods. Alan Turing suggested that machines could perform tasks requiring human intelligence by automating calculations, a process that machines can execute much faster than humans (Turing, 1950). The famous Turing 'imitation game' posits that AI is achieved when distinguishing between a conversation with a human and a machine becomes impossible. Although the notion of the "Turing machine" has been critiqued (Searle, 1980), the core idea proposed by Turing remains compelling. Turing emphasized that the significance lies not in the inherent nature of the computer but in what a person perceives the computer to be. Inspired by Turing's notion, this study adopts a relational epistemology (Bearman & Ajjawi, 2022), conceptualizing AI based on human-technology interactions rather than the computational approach. AI is defined not by its technological features but by the context-bound relationship between humans and computational artifacts during specific interactions. This perspective emphasizes the sociomaterial production of knowledge, focusing on what technologies do rather than their intrinsic properties. This dynamic conceptualization of AI interactions depends heavily on the circumstances of their use. The relational epistemology proposes that knowledge exists between actors, meaning it is 24 contextualized within specific relationships between people, things, and spaces. This idea aligns with connectivism, which recognizes the interconnectedness of all entities (Siemens, 2005). A sociomaterial perspective further appreciates human and non-human actions and knowledge as entangled in systemic webs (Fenwick, 2010). Non-humans are seen as active participants, following Latour's (2007) view that actors are defined by their actions and impacts rather than their human qualities. For instance, Latour (1999) suggests that a 'speed bump' is an actor whose agency is expressed through its effect on traffic, causing drivers to slow down. Similarly, we consider AIs to be agentic but not sentient, understanding knowledge and knowing as products of the social dynamics involving objects and spaces (Foucault, 1963). This view aligns with Johnson and Verdicchio's (2017) conceptualization of AI systems as 'sociotechnical ensembles ... combinations of artifacts, human behavior, social arrangements, and meaning.' AI encompasses several distinctive features that enable it to perform cognitive tasks traditionally associated with human intelligence. One of AI's key characteristics is its ability to perform calculations at a speed and scale beyond human capacity. “They (AI) are much less than human intelligence—they can only calculate. And they are much more—they can calculate larger numbers and faster than humans.” (Cope et al., 2021). Claude Shannon's development of binary calculation using relay circuits laid the groundwork for modern computing, allowing AI to process vast amounts of data efficiently (Shannon, 1938). AI systems often employ machine learning and deep learning techniques to analyze and interpret data. Machine learning involves algorithms that identify patterns within data, while deep learning uses multilayered neural networks to recognize intricate patterns, requiring substantial data and computational power (Krizhevsky, Sutskever, & Hinton, 2012). Another essential feature of AI is its capability to name and categorize extensive datasets. This process involves representing real-world objects and concepts in binary form, enabling machines to recognize and process these entities more quickly than humans can (Cope & Kalantzis, 2020). AI's calculability allows it to handle large quantities of data swiftly, which is particularly useful in fields such as natural language processing and statistical modeling (Cope & Kalantzis, 2020). 25 AI systems also possess the ability to measure and interpret data through various sensors and data collection methods. This capability enables AI to gather real-time data and generate insights. For instance, in educational environments, AI can track student interactions to provide personalized feedback and adaptive learning pathways (Cope, Kalantzis, & Searsmith, 2021). Additionally, AI can represent information in multiple formats, such as text, images, sound, and videos, facilitating effective communication and data processing. Despite these advanced capabilities, AI significantly differs from human cognitive processes. Human intelligence involves context, understanding, and experiential learning, which AI lacks. AI's power lies in its ability to perform detailed and extensive calculations and process vast datasets, rather than understanding or experiencing the world as humans do (Cope, Kalantzis, & Searsmith, 2021). Human intelligence is based on biological neural networks, while AI operates on silicon-based digital systems, resulting in distinct operating principles and capabilities (Korteling et al., 2021). AI systems can process information at speeds far beyond human capabilities. Human nerve signals travel at most 120 m/s, whereas AI systems can operate at nearly the speed of light (Tegmark, 2018). Furthermore, human learning is influenced by biological and environmental factors, often requiring significant time and effort, while AI can rapidly learn from vast datasets and adapt through continuous training (Russell & Norvig, 2014). AI systems, when designed and validated appropriately, can mitigate human cognitive biases, providing more objective analyses ( Korteling et al., 2021). However, humans possess emotional and social intelligence, allowing for nuanced interpersonal interactions, whereas AI, despite advancements in natural language processing and affective computing, lacks genuine emotional understanding and social intelligence (Shneiderman, 2020). Additionally, human decisions are often explainable through introspection and communication, while AI decisions, particularly those made by deep learning models, can be less transparent, necessitating efforts to improve explainability (Cope, Kalantzis, & Searsmith, 2021; Shneiderman, 2020). 2.7.3 Hybrid Intelligence System and Human-AI Collaboration Human intelligence and artificial intelligence can complement each other in various ways, 26 creating synergistic effects that enhance capabilities in numerous fields. Human intelligence brings creativity, intuition, and emotional understanding to the table, which are areas where AI currently falls short. Humans can excel at making sense of ambiguous and novel situations, understanding context, and applying ethical considerations to decision-making. On the other hand, AI can process and analyze vast amounts of data at unprecedented speeds, identify patterns that might elude human analysts, and perform repetitive tasks with high precision and consistency. By leveraging AI, humans can enhance their decision-making processes, gain insights from complex data sets, and automate mundane tasks, freeing up time and cognitive resources for more strategic and creative endeavors. For instance, in healthcare, AI can assist in diagnosing diseases by analyzing medical images and patient data, while human doctors provide the necessary context, empathy, and ethical judgment in patient care, and more importantly, doctors judge the ambiguous cases that AI has challenges to judge. In education, AI can personalize learning experiences by adapting to individual student's needs, while teachers guide and mentor students, fostering critical thinking and emotional development. The collaboration between human intelligence and artificial intelligence holds the potential to revolutionize various sectors, driving innovation and improving efficiency. Rather than limiting human involvement to specific parts or times during the creation of machine learning models, real-world problem-solving applications require a continuous socio-technological collaboration between humans and machines. This approach contrasts with earlier research on decision support and expert systems (Gregor, 2001; Holzinger, 2016). Dellermann et al. (2021) argue that the most likely paradigm for the future division of labor between humans and machines is hybrid intelligence. This concept leverages the complementary strengths of human intelligence and AI, enabling them to function more intelligently together than separately (Kamar, 2016). The fundamental rationale is to merge the complementary strengths of heterogeneous intelligences (i.e., human and artificial agents) into a socio- technological ensemble. Hybrid intelligence systems (HIS) are envisioned as those capable of achieving complex goals by combining human and artificial intelligence to collectively achieve superior results than either could 27 independently, continuously improving through mutual learning (Dellermann et al., 2021). Tasks are performed collectively, meaning that while the activities conducted by each part are interdependent, they are not necessarily always aligned to achieve a common goal, such as teaching an AI adversarial tasks like playing games. The system achieves a performance level that none of the involved actors could have achieved alone (superior results). The goal is to make the outcome, such as a prediction, more efficient and effective at the socio-technical system level by achieving goals that were previously unattainable. Over time, the socio-technological system improves as a whole, with each component (i.e., humans and machines) learning from each other’s experiences, thus enhancing performance in specific tasks (continuous learning). The performance of such systems is measured not only by the superior outcomes of the entire system but also by the learning progress of the human and machine agents within the socio- technical system. The concept of hybrid intelligence systems thus envisions socio-technical ensembles where human and AI components co-evolve to improve over time. The HIS perspective reflects the idea of human–computer interaction (HCI). While extensive research has been conducted on general HCI aspects such as human factors, usability, and interface design, educational HCI studies have traditionally emphasized the computer as a medium (Berg, 2000). Card et al. (1983) laid the groundwork for the psychology of HCI by conceptualizing the human mind as an information-processing system. With the advent of AI technology, research attention has shifted toward human-AI interaction or human-centered AI (HAI) (Stanford HAI, 2020). Lai et al. (2021) reviewed over 80 empirical studies on human-AI decision-making across various fields, including education, and noted a substantial increase in publications on human-AI interaction and decision-making post-2010. The number of relevant papers surged from fewer than 100 every two years before 2016 to over 1000 per topic by 2020. Decision tasks such as predicting student performance, admissions, dropouts, and answering law school admission test questions have been particularly prevalent. HAI can be interpreted from two perspectives: AI under human control and AI on the human condition. Shneiderman (2020) discusses AI under human control, where AI systems are judged based on the degree of human oversight. At one end of this spectrum is AI that operates entirely under human 28 control, merely assisting with automation. At the other end is AI that operates autonomously, making decisions independently. Human-controlled AI leverages the collaboration between human oversight and AI automation to enhance human productivity, ensuring high levels of reliability, safety, and trust (Shneiderman, 2020). The second perspective, AI on the human condition, is discussed by Stanford HAI (2020). This approach reflects on the design of AI algorithms with humanity as the central consideration. AI on the human condition emphasizes the importance of creating AI systems that are explainable and interpretable, ensuring that their computational and judgment processes can be understood by humans. Additionally, these systems must continuously adjust their algorithms based on human context and societal phenomena. The goal is to augment human intelligence using machine intelligence, ultimately enhancing human welfare (Stanford HAI, 2020). I take the second perspective in this study about HAI. 2.7.4 Distributed Cognition Theory and HAI The HIS and HAI also reflect the theory of Distributed Cognition (Hutchins, 2000; Pea, 1993), which asserts that cognitive processes are shared and shaped between humans and their tools, highlighting a collaborative cognitive dynamic. It offers a framework to understand the symbiotic dynamics between educators and AI tools. It steers the research methodologies and interpretation, especially in recognizing how ChatGPT can function as an active participant in the cognitive ecosystem of educational assessments. Within the context of developing a domain-specific AI algorithm, Distributed Cognition emphasizes AI's active and integral role in shaping its design, testing, and optimization beyond mere computational augmentation. Hutchins' Distributed Cognition Theory (1995) posits that cognitive processes aren't singularly anchored but resonate across collective entities, both human and non-human. In conjunction, Roy Pea (1985, 1993) underscores the transformative role of digital tools as cognitive amplifiers that not only extend but also reshape human thinking and collaboration in educational contexts. At its core, Distributed Cognition serves as the theoretical bedrock guiding the structure and trajectory of this research. By adopting this lens, the study is explicitly oriented to capture the fluid interplay between diverse human experts and the sophisticated AI capabilities of ChatGPT. This perspective drives the research 29 methodologies: from the design of experimental setups that foster seamless collaboration to the selection of evaluative metrics that capture both individual and collective cognitive contributions. In practical terms, when studying the process of creating, evaluating, and refining knowledge-in-use assessments, the research actively looks for evidence of distributed cognitive dynamics. For instance, it doesn't just observe an educator’s individual input but examines how that input morphs when interfaced with AI suggestions or when juxtaposed with insights from another domain expert. The interventions, iterative refinements, and validations conducted in the study are all set up to capture these dynamic cognitive exchanges. Furthermore, the research's emphasis on ChatGPT is not just a tool, but a 'cognitive partner' finds its roots in Pea's observations. The AI’s role is conceived not merely as a passive repository or a computational enhancer but as an active agent in the cognitive matrix, shaping and being shaped in turn. By intertwining the precepts of Distributed Cognition and the insights from Roy Pea, this research champions a groundbreaking approach to understanding AI-human collaboration in educational settings. It strives for a nuanced appreciation of the cognitive orchestra that emerges when human expertise, in all its diverse richness, collaborates with the computational prowess of AI, promising a richer, more holistic outcome that transcends individual capabilities. This paradigm not only informs the study's foundational logic but also steers its empirical pursuits and interpretative analyses, setting a benchmark for future explorations in the realm of distributed cognitive research. 2.7.5 Interdisciplinary Collaborative Learning and HIS Hybrid intelligence systems often necessitate varying levels of expertise from the humans providing input. Traditionally, both research and practical applications have emphasized the importance of input from machine learning (ML) experts, requiring deep expertise in AI (Attenberg et al., 2015; Chakarov et al., 2016; Kulesza et al., 2010; Patel et al., 2019). Additionally, end users can contribute to product recommendations and e-commerce, or human non-experts can provide input through crowd work platforms (Chang et al., 2018; Nushi et al., 2017). More recent efforts focus on integrating domain experts into hybrid intelligence architectures. These experts use their deep understanding of the semantics of a problem domain to teach machines without needing extensive ML expertise (Dellermann et al., 2019; 30 Simard et al., 2017). The quantity of human input can range from individual contributions to aggregated input from multiple individuals. Individual input is often used in recommender systems for personalization or cost efficiency (Li et al., 2017). Conversely, collective human input aggregates the contributions of several individuals through mechanisms of human computation (Dellermann et al., 2019; Quinn & Bederson, 2011). This method helps reduce errors and biases inherent in individual inputs and aggregates diverse knowledge (Cheng et al., 2023; Dellermann et al., 2019). Aggregation can be tailored to individual characteristics (Dawid & Skene, 1979; Kamar et al., 2012; Kim & Ghahramani, 2012) or adjusted based on the teaching task (Kosinski et al., 2014; Raykar et al., 2010; Whitehill et al., 2009). This approach informs the design of studies involving expert panels with diverse expertise to collaboratively provide feedback on hybrid intelligence systems’ products. 2.7.6 Self-Regulated Learning and HAI In addressing complex and novel problems while maintaining system efficiency, it is crucial to emphasize the significant role of humans in the HAI process. Consequently, self-regulated learning (SRL) serves as an essential theoretical framework for understanding and enhancing the human learning process and actions within this context. SRL is defined as a goal-oriented process where learners make conscious decisions to achieve their learning objectives (Azevedo, 2015; Winne, 2018). Self-regulated learners utilize cognitive processes such as summarizing, rereading, and elaboration, and metacognitive processes like orientation, planning, monitoring, and evaluation to control their learning and motivate themselves (Greene & Azevedo, 2007). Research on SRL has shown that self-regulated learners are adaptive, engaging metacognitively, motivationally, and behaviorally in their learning (Schunk & Greene, 2018). These learners implement appropriate learning strategies, monitor their progress towards goals, and adjust their strategies and learning conditions when progress is insufficient (Winne & Hadwin, 1998). Effective self-regulating learners set learning goals to plan their activities and adjust strategies as needed to achieve these goals (Winne, 2017). They continuously monitor whether their actions are aiding progress towards their learning objectives (Azevedo, 2009). Zimmerman (2000) identified three phases in the self-regulated learning process: Forethought, Performance, and Self-reflection. In the forethought 31 phase, learners analyze tasks, set specific goals, and plan strategies. During the performance phase, they implement these strategies, monitor their progress, and receive feedback. In the self-reflection phase, learners evaluate the effectiveness of their strategies and make necessary adjustments. In my study, I incorporated Zimmerman (2000)’s three phase model and the COPES model (Winne & Hadwin, 1998) to understand the human cognitive conditions in the process of collaborating with AI to design knowledge-in-use tasks. COPES is an acronym representing conditions, operations, products, evaluations, and standards within a task completion framework. Conditions include the available resources and any constraints affecting the task, while standards are profiles of desired attributes refined through planning. Operations involve cognitive processes in working memory that transform information, ranging from innate, simple processes to more complex, acquired strategies. These operations generate products, which are evaluated against standards. Monitoring these comparisons is crucial and if discrepancies arise, it may lead to adjustments in the task, conditions, goals, and standards, or even to abandoning the task. Thus, the COPES model functions as a recursive, adaptable system in task management and learning. Järvelä, Nguyen, and Hadwin (2023) introduced a framework to operationalize human-AI collaboration, proposing a hybrid human-AI shared regulation in learning (HASRL) model (Figure 2-2). This model positions human and AI collaboration for socially shared regulation (SSRL) in learning, highlighting the synergy between humans and AI to improve learning regulation. Through empirical examples, they demonstrate how hybrid intelligence can enhance learning sciences research, arguing that combining human and AI strengths is vital for advancing this field. 32 Figure 2-2. Human-AI shared regulation in learning (HASRL) model from Järvelä et al. (2023) In their study, the HASRL framework is adapted to explore the collaborative potential of hybrid intelligence, leveraging the capabilities of humans and machines to design knowledge-in-use assessments. Human learners bring creative, flexible thinking, and long-term goal orientation to the process, while SRL provides a theoretical foundation for understanding the human-machine interaction in designing assessments. The framework (Figure 2-3) illustrates the interplay between human and AI components in a hybrid intelligent system. On the human side, during the Forethought phase, humans set the context, define the scope, purpose, and goals, including background information on NGSS, PE, DCI, CCC, and SEP. They design tasks for collaboration with AI. In the Performance phase, humans monitor learning progress and guide AI to reflect on task completion. During the Adaptation phase, humans reflect on goals and requirements, evaluate products, and decide on necessary adjustments, incorporating interdisciplinary feedback from experts throughout the collaboration process. The AI component, informed by Molenaar (2022), follows the detect-diagnose-act framework. In the detect phase, AI collects learning process data. In the diagnosis phase, AI assesses the current state and predicts future development of assessment tasks. In the act phase, AI implements plausible changes 33 based on the diagnosis, while also adjusting AI models to support human cognitive development. This ensures that AI systems not only respond to critical needs but also scaffold human cognitive competencies. The HHACI model provides a conceptual architecture for integrating technology developers and science educators to create AI-enabled solutions for designing knowledge-in-use assessment tasks. Figure 2-3. Hybrid human-AI collaborative model (HHACI) in complex task design There are several noteworthy aspects of this model. First, it is an iterative training model. Central to this is the idea that assessment design is not a linear process; rather, it is a complex tapestry woven together with numerous variables, including students' cognitive states, social-emotional needs, language competencies, and diverse cultural backgrounds. Integrating GPT into this complex environment does not merely add another variable but acts as a catalyst, potentially fostering innovative patterns of interaction and pedagogical strategies (Johnson, 2001). This synergy between educators, students, and GPT forms what Complexity Theory designates as a "complex adaptive system." In this dynamic setup, the principles of Complexity Theory are prominent, emphasizing the adaptability and fluidity required for effective educational outcomes (Byrne, 1998). 34 Informed by this, my research recognizes that shaping GPT for assessment is not only a multifaceted task, influenced by evolving student needs and educational contexts, but also one that strives for equity in assessment. This ensures that all students, irrespective of their backgrounds, have fair opportunities. To comprehensively address these complexities, an interdisciplinary panel of expert reviewers will be assembled. Additionally, central to the project's methodology is the commitment to iterative training, adaptation, and refinement of ChatGPT, with the goal of achieving both optimal and equitable educational outcomes. Second, due to the exploratory nature of this study, to better understand the black box, I emphasize human's ability to intentionally influence their functioning and life circumstances. Within my research, this theory illuminates how human experts actively shape AI's role in education rather than merely absorbing its outputs. Their feedback merges AI's potential with their deliberate cognitive tactics. Thus, this research perspective emphasizes the proactive collaboration between educators and students with AI, creating an environment where human intentionality coexists and flourishes with AI-enhanced capabilities. Informed by Bandura's Human Agency Theory (1989), this research underlines the salience of human capacity to shape one's circumstances and functions, a perspective that becomes paramount when exploring the GPT model's potential to amplify human cognitive faculties. Delving into the theory's core tenets — intentionality, forethought, self-reactiveness, and self-reflectiveness — offers a nuanced lens to understand the multifaceted human-AI interplay in designing knowledge-in-use assessments. The principle of intentionality emerges prominently in the research as educators proactively harness ChatGPT, showcasing a conscious choice rather than passive acquiescence. Central to the research's premise, this principle aligns with humans' purposeful engagement with ChatGPT. Their proactive involvement suggests a conscious decision to harness AI, rather than a passive acceptance, underscoring the act of choosing specific paths and outcomes in AI-mediated educational settings. Forethought, meanwhile, is exemplified in the study's forward-looking approach, moving beyond immediate requirements to anticipate the future trajectories of educational AI. Beyond mere immediacy, the research adopts a 35 visionary stance, enabled by the educator's strategic foresight. Guided by this principle, the study not only focuses on current pedagogical necessities but also aspires to anticipate and prepare for the evolving contours of educational AI. Lastly, given the knowledge-in-use assessment features, this study also is informed by the Cognitive Flexibility Theory (CFT, Spiro et al., 1992), emphasizing adaptive cognition in ill-structured domains, suggesting that true understanding necessitates multiple viewpoints. This study uses CFT to analyze interdisciplinary expert feedback on AI-designed knowledge-in-use assessments. In this research, the CFT offers an indispensable lens through which the complex construct of knowledge-in-use can be understood and analyzed. By leveraging the principles of CFT, this study endeavors to collaborate with AI, specifically in crafting assessment tasks that can aptly measure such a nuanced domain. The amalgamation of AI capabilities and the insights from CFT holds the promise of generating more refined, context-sensitive, and adaptive assessment tools that can capture the dynamism and depth of knowledge- in-use. 36 CHAPTER 3: STUDY DESIGN AND METHODOLOGY 3.1 Positionality and the Assessment Development Framework My involvement with the Next Generation Science Assessment (NGSA) project provided experience in designing knowledge-in-use assessments. I adopt this principled approach (Harris et al., 2019) designed using an evidence-centered design approach (ECD; Mislevy & Haertel, 2006) to guide the GPT-4 model to design assessment tasks. My foundational knowledge not only informs but also guides the GPT-4 model in creating assessments with precise prompts. This expertise is crucial for evaluating the quality of the outputs. I direct the GPT model to design assessment tasks to capture knowledge-in-use. 3.2 Focal Performance Expectations This study focuses on two elementary school level performance expectations from the NGSS. The performance expectations focus on two major scientific and engineering practices of developing models and constructing scientific explanations, fundamental to students' knowledge-in-use (Krajcik et al., 2023; Schneider et al., 2022). The two PEs are both for 3rd grade level, one PE focuses on “Physical Sciences” and the other PE is from the “Life Sciences.” The two PEs and their associated information are presented below. 3-PS2-1. Plan and conduct an investigation to provide evidence of the effects of balanced and unbalanced forces on the motion of an object. 37 Figure 3-1. Snapshot of PE 3-PS2-1 from NGSS Online Resources 3-LS4-3 Construct an argument with evidence that in a particular habitat some organisms can survive well, some survive less well, and some cannot survive at all. 38 Figure 3-2. Snapshot of PE 3-LS4-3 from NGSS Online Resources The PEs were unpacked through the unpacking process, which resulted in several learning performances for each PE. Notably, these learning performances should be able to cover the entire PE when used together. The next section details how to design assessments by leveraging the GPT-4 model through unpacking the research methodology and research process. 3.3 Study Design Anchored in the principles of Design-Based Research (DBR) (Barab & Squire, 2004; Collins, Joseph, & Bielaczyc, 2004), this study endeavors to extract the intricate interplay between AI and human intelligence within the realm of knowledge-in-use assessment design. DBR, recognized for its systematic and iterative approach, facilitates a profound exploration that seamlessly marries theoretical understanding with empirical applications. In the pursuit to address the overarching questions concerning the potential collaboration between human intelligence and artificial intelligence in knowledge-in-use assessment design, the research is structured into three distinct yet interlinked stages with corresponding research questions (see Table 3-1). 39 These stages span from the initial training and design capabilities of the GPT-4 model, to the critical examination by human experts across disciplines, and finally, to the evolution and optimization of GPT-4 generated assessments. Each stage, while providing depth in its area, collectively contribute to a holistic understanding of the synergy between AI and human intelligence, setting the stage for a transformative leap in educational assessment practices. Through this DBR-driven approach, the study promises depth in exploration and breadth in application, paving the way for innovative strides in the landscape of educational assessment. Given the features of DBR, for each stage of my study and data analysis, I follow the socio- technical evaluation strategies proposed by Waschull and Emmanouilidis (2023) to analyze the human-AI collaborative assessment system. Being specific, I use the implementation workflow and evaluation methodology presented in Figure 3-3 to evaluate the human-AI collaborative knowledge-in-use assessment design system across the three stages of my study. It is worth to note that this study is an exploratory study to investigate the possibility of AI and human intelligence interaction. Generalizability is beyond the scope of this study. 40 Figure 3-3. Implementation flow and evaluation methodology 41 The model involves a structured workflow comprising three stages: initial assessment design, feedback collection, and assessment refinement. In the initial assessment design stage, human knowledge- in-use assessment developers, with the aid of GPT-4 models, create the first round of human-AI collaborated interim and final products, including unpacking, learning performances (LPs), integrated dimension maps (IDMs), evidence statements, assessments, and rubrics. This stage addresses the RQ1, "How can generative AI models be effectively and iteratively trained to design knowledge-in-use assessments?" Next, the multidisciplinary expert panel review stage collects feedback on the interim products. This panel comprises NGSS experts, science content experts, engagement experts, equity experts, language experts, and teacher experts. The panel provides feedback on the LPs, evidence statements, assessments, and rubrics, focusing on 3D learning, engagement, language complexity, equity, and practice perspectives. This feedback process addresses the RQ2, "How do human experts across different disciplines evaluate the AI-generated knowledge-in-use assessments, and what refinements do they suggest?" The final stage is assessment refinement, where human knowledge-in-use assessment developers, integrating collective feedback, collaborate with GPT-4 models to produce the second round of interim and final products. The same multidisciplinary expert panel reviews these refined products, providing further feedback. This stage responds to the RQ3, "What is the process of refining AI-designed knowledge-in-use assessments based on the feedback provided by human experts? Whether and how are the revised assessments changed?" The evaluation process involves defining the unit of analysis and identifying critical areas, individual expert reflection, and collecting and validating relevant performance categories. This process aims to conduct evaluations and feed outcomes back into the design process, ensuring continuous improvement and alignment with educational objectives and standards. This iterative cycle of design, feedback, and refinement ensures that the assessments developed are robust, context-sensitive, and pedagogically sound, leveraging the strengths of both human and AI intelligence. In the subsequent sections, the specific research design for each stage will be meticulously detailed. 42 Table 3-1. Data collection and analysis overview Data Source Data Method of Analysis Intended Inference Research Question 1: Iterative Training and Initial GPT-4 Model-Based Assessment Design (Stage 1) How can generative AI models be effectively and iteratively trained to design knowledge-in-use assessments? Self-reflection of the GPT-4 model generated outputs and key themes of high-quality prompt design Thematic analysis & Reflective records Analyze GPT-4 model’s outputs after each training to pinpoint essential prompt features. By refining prompts iteratively, I will further probe the quality of generated content. - Capture trends and deeper themes in GPT-4 model developed outputs, including in-the process outcomes and final assessments. - Assess the potential aptitude of the GPT-4 model in the domain of assessment design. - Set the groundwork for a comprehensive framework that nuances the integration of AI in specialized educational contexts or knowledge-in-use assessment design. Research Question 2: Human Expert Review and Feedback Collection (Stage 2) How do human experts across different disciplines evaluate the AI-generated knowledge-in-use assessments, and what refinements do they suggest? Expert panel review of learning performances Confirm adequacy of the set of learning performances with respect to representing the domain Descriptive statistics (Heatmap) Thematic analysis Likert ratings Expert panel review of tasks and rubrics Expert panel review of tasks and rubrics regarding equity Expert panel review of tasks and rubrics regarding engagement Teacher cognitive interviews Responses to open ended review questions Likert ratings Responses to open ended review questions Likert ratings Responses to open ended review questions Likert ratings Responses to open ended review questions Teacher reflections on task and overall reactions Descriptive statistics (Heatmap) Thematic analysis Confirm cognitive appropriateness of each task including task complexity and equity issues Descriptive statistics (Heatmap) Thematic analysis Confirm the adequacy of equitable opportunity for diverse students’ needs Descriptive statistics (Heatmap) Thematic analysis Confirm the adequacy of the assessment for students’ engagement. Creswell (2003) hierarchical coding procedure Understand the trends of AI- designed assessments in supporting diverse students’ three-dimensional learning in the classroom. 43 Table 3-1 (cont’d) Research Question 3: Assessment Refinement (Stage 3) What is the process of refining AI-designed knowledge-in-use assessments based on the feedback provided by human experts? Expert panel review of tasks and rubrics Descriptive statistics (scatter plot) and thematic analysis Glean deeper insights into the extent of improvement of the refined assessments Experts reflections on task and overall reactions Documented revision/refinement process by reflections Self-reflections on the refinement process Thematic analysis 3.3.1 Stage 1: Initial Iterative Training and Preliminary GPT-4 Assessment Design Stage 1 aims to respond to the research question 1: “How can the GPT-4 model be effectively and iteratively trained to design knowledge-in-use assessments?.” I adopted the NGSA approach to establish a training blueprint for the GPT-4 model. This approach was explicitly present above in Chapter 2 section 2.4.2. There are two reasons why this study adopts the NGSA approach to design the knowledge-in-use assessment. One is based on the comprehensive analysis of the current approaches to design performance- based assessments to understand students’ knowledge and skills of solving complex problems or explain real-world phenomena. This NGSA approach can ensure the designed assessments can capture both the scope and depth of the ideas and abilities that are embedded in certain PEs. The evidence-centered design process also allows the designed assessments to elicit and collect evidence to capture students’ understandings. This approach also can ensure the designed assessments to align with the NGSS PEs (Harris et al., 2019; Li et al., 2024). The second reason is, as a person who has extensive assessment design experience enable me to serve as the critical person to use the design criteria to train the GPT-4 model to design the assessments and can judge the outputs of GPT-4’s generations to make effective reflections on the outputs to give iterative feedback to the GPT-4 model for further improvement or adjustment. This stage involves feeding GPT with background data, emphasizing design principles, and introducing domain analysis and modeling processes, following the workflow proposed in the HHAIC model in Figure 2-3. After each training session, I analyzed GPT's outputs to identify prominent prompt 44 features through thematic analysis and reflection based on the assessment design approach and human cognitive functions when doing self-regulated learning (specify the task; set up goals and plans; enact the plans; monitor the learning process; finally reflect on the entire process of the human and AI interactive process). Through iterative refinement, I judge if the outputs against or meet the requirements/goals I set up before. For a more holistic analysis, I deploy thematic analysis complemented by reflective insights, aiming to understand both the explicit patterns and the underlying themes of the AI-generated assessments. The significance of this stage is twofold. Primarily, it seeks to evaluate the potential of the GPT-4 model, in the realm of assessment design. Subsequently, it aspires to craft an initial framework, detailing the nuances of molding AI for specialized educational applications. 3.3.1.1 Design-Based Research with Reflective Practice Highlight In my work, I used the iterative design and feedback loops provided by my reflections to explore the research questions. In this process, I did not just observe the outcomes but actively engage in refining the assessment designs based on observations and reflections. In the process of exploring the research question 1, where I engaged in in-depth reflection to summarize effective strategies and identify future improvements. This process allows for deep insights into the iterative design process, 3.3.1.2 Data Collection and Analysis Data collection I input essential background information for the GPT-4 model to enable it to equip the basic understandings of NGSS, knowledge-in-use, and NGSA design procedures and criteria. Then, I gave detailed instructions about each step of the assessment design process and provided some examples that I want the GPT model to learn from. I collected the outputs generated by the GPT-4 model after each prompt for each step of the design process. I also collected the prompts and corresponding outputs of the training process. Training process and environment setting: Set up my training with GPT-4 Turbo In setting up my training process, I utilized OpenAI's Application Programming Interface (API) to interact with the GPT-4 Turbo model for generating structured responses and guiding me through the 45 design process for knowledge-in-use science assessments. The API, a set of rules and protocols, allows different software applications to communicate with each other, enabling me to send requests to OpenAI servers and receive responses generated by their language models. I selected the “gpt-4-turbo-preview” model for its advanced capabilities, as it is designed to provide high-quality and efficient responses, particularly suited for tasks requiring detailed understanding and text generation. This makes it ideal for guiding me through structured processes. To ensure secure and authenticated interaction with the OpenAI API, I used an API key, a secret token that grants access to OpenAI services. I configured the request headers to include the content type as JSON and the authorization token. The payload for each request comprised the chosen model, gpt-4-turbo-preview, and a sequence of messages defining the conversation's structure, including roles such as "system" for setting the context and "user" for input prompts (see Figure 3-4). Additionally, I set a maximum token limit of 1500 to allow for comprehensive responses. Figure 3-4. Screenshot for the training environment setup 46 POST requests were made to the API endpoint with these headers and payload, and the responses were processed by verifying successful status codes, parsing the JSON data, and saving it to a local file. This setup facilitated automated text generation, which was then formatted into a document recording the entire conversation, ensuring comprehensive logging of both user inputs and assistant responses for further analysis. Setting up training for the assessment design I began the training by initializing the OpenAI API interaction framework, where the AI is provided with a system message that sets the context (Refer to Figure 3-4). The AI is tasked with understanding complex instructions, breaking down tasks into smaller steps, and generating intermediate products at each stage. This foundational setup ensures that the AI comprehends its role within the broader framework of the design process. The training process unfolds through a sequence of interactions where the AI and the human user engage in a detailed and structured conversation. Initially, the AI is equipped with a set of instructions that define its role and objectives. Following this, the human user provides a series of prompts designed to guide the AI through various aspects of the science assessment design process. Each user prompt is crafted to elicit detailed, context-specific responses from the AI, ensuring that the output aligns with educational standards and equity goals. I first set up the system to identify a role for the GPT-4 system as an assistant by giving the prompt: “ You are an assistant specialized in guiding users through a detailed and structured design process for science assessments. Your role includes understanding complex instructions, breaking down tasks into smaller steps, and generating intermediate products at each stage. You need to communicate clearly, structuring your responses in a way that aligns with the users' design framework, and refer back to previous steps or information as needed.” I then provided an overview for the system to understand the design task by providing the process of the NGSA approach. Here is the prompt I gave the GPT model, “ To start, let's define the main steps in the design process: 1) Identifying the Performance Expectation; 2) Unpacking the Performance Expectation; 3) Mapping the Dimensions; 4) Designing Learning Performances; 5) Developing Assessment Tasks; 6) Creating Rubrics; 7) Iterative Review and Revision. 47 We will go through each step one by one, ensuring clarity and focus on equity and usability.” I then started the design process with one of the focal PEs, 3-PS2-1. Data analysis I analyzed the collected data using thematic analysis to extract the common themes that GPT-4 model may fail in understanding human prompts and common strategies that are efficient for supporting GPT-4 model to understand the design purposes and goals. I also maintained a close reflection throughout the process to document my observations, thoughts, and feelings about the GPT’s outputs and the iterative training process. Analyze these reflections to identify patterns in my responses to the AI's outputs and how the information that is generated by the GPT-4 model may add to humans, in this case, my understanding or ideas of designing knowledge-in-use assessment. Anticipated outcomes Throughout this process, I anticipate achieving three key outcomes. First, I expect to obtain initial outputs from the GPT-4 model for each crucial step in the assessment design process. These steps include unpacking, learning performance generation, evidence statement generation, essential characteristics design, and varied characteristic design. Secondly, I aim to produce preliminary design assessment tasks. It is anticipated, based on the expertise of human assessment experts, that each PE yields at least three learning performances following the unpacking process. For each learning performance, I work with GPT-4 to generate two assessment tasks. These tasks are intended to evaluate or further probe the model’s comprehension of various task features. The third expected outcome is the creation of a corresponding rubric for each assessment task. This rubric is designed to evaluate the tasks generated by GPT-4, ensuring they meet the established criteria for analyzing student understanding effectively. 3.3.2 Stage 2: Interdisciplinary Expert Review and Refinement Stage 2 aims to address the research question 2: “How do human experts across different disciplines evaluate the AI-generated knowledge-in-use assessments, and what refinements do they suggest?). For this stage, I randomly select one Learning Performance (LP) from each PE that is 48 generated at Stage 1. This selection forms the basis of the documentation prepared for review by human experts in this stage. To enable the collection of various feedback that focuses on different areas of expertise, Stage 2 requires assembling an interdisciplinary panel of experts. This expert panel (Table 3-2) will include science content experts in physical science and/or life science domain, experts in knowledge- in-use assessment design, experts who has deep understanding of next-generation science standards, experienced elementary science teachers, science education experts who have different focal research areas, and experts who have expertise in motivation, engagement and/or cognitive processes. Table 3-2 shows how different experts who serve on the panel provide different focused feedback. Their feedback and comments will be sought on both the initial processing seminal products generated by GPT-4 and the preliminary assessment tasks and rubrics designed. This expert review is essential for refining the assessment tools and ensuring their alignment with educational objectives and standards. Table 3-2. Expert panel and their feedback expertise Panel members Feedback expertise Experts who have strong science content background in physical and/or life sciences Experts who have expertise in knowledge-in-use assessment design Expert who has deep understanding of the NGSS Experienced elementary science teachers Experts who are science education researchers with different research focuses and/or expertise: two experts focus on literacy and language and two experts focus on equity and inclusion Experts who have expertise in motivation, engagement, and cognitive process Provide content validity Provide feedback on the assessment design process and interim products, such as unpacking documents, etc. Provide feedback on the interim and final products of designed assessments to ensure the coherent and aligned understanding of 3D and knowledge-in-use. Provide feedback on the assessment tasks to ensure the tasks can be used for elementary, specifically 3rd grade students. Provide feedback from different perspectives, such as if the assessment language is appropriated for all students; if the scenarios or contexts in the assessment tasks are accessible to all students regardless of their backgrounds, etc. Provide feedback about the designed assessment tasks on if the task phenomena are compelling enough to cognitively engage students in the task, etc. 49 3.3.2.1 Expert Panels’ Composition and Background To comprehensively evaluate the AI-co-designed knowledge-in-use assessments, two expert panels were assembled to review the LPs and related assessments for two distinct PEs: 3-PS2-1 and 3- LS4-3. Each panel comprises multidisciplinary experts, ensuring robust and comprehensive feedback from diverse perspectives. The panels consist of individuals with extensive backgrounds in their respective fields, offering a rich blend of perspectives and insights essential for a thorough evaluation of AI-generated assessments. For both PEs, the panels include NGSS experts who have significant experience in science education, curriculum development, and state-level curriculum frameworks and policy advising. Assessment design experts contribute deep knowledge of three-dimensional teaching and learning approaches and scalable methods for NGSS-aligned teaching and learning. Science content experts specialize in physical sciences and life sciences, providing detailed insights into the subject matter. Science education researchers focus on equity and language, ensuring that the assessments address diverse student needs and are inclusive and accessible. Engagement experts bring valuable perspectives on student motivation, cognitive engagement, and innovative teaching strategies. Elementary science teacher experts, with practical classroom experience, offer a grounded view of the assessments' applicability and implementation in real-world teaching scenarios. These panels bring together a comprehensive set of skills and knowledge, providing a holistic review of the AI-generated assessments. Table 3-3 details the composition and background of each expert panel member, highlighting their interdisciplinary expertise and the robust feedback they can provide. This interdisciplinary composition ensures that the feedback provided by the panels is comprehensive and robust, addressing various aspects of the AI-generated assessments from multiple perspectives. 50 Table 3-3. Expert panel and their backgrounds Group Expertise N Background Area NGSS Expert Group 1: PE 3-PS2-1 2 T has a robust background in physical science education, with a BS degree in Earth, Atmospheric, and Planetary Sciences. He has extensive experience with the NGSS, having served as the in-house expert at the National Science Teaching Association (NSTA) for eight years. Additionally, T has significant experience in curriculum development and standards-based education reform. 2 Assessment Design Experts E has over two decades of experience teaching preschool, elementary, and middle school, including ten years as a science specialist and ESL/Bilingual teacher. She was a co-writer for the NGSS and contributed to the NGSS Diversity and Equity Team’s Appendix D. After earning a PhD, she became an assistant professor specializing in elementary science education. C is an expert in science education and assessment design, known for pioneering innovative approaches to support three-dimensional teaching and learning. With over a decade of experience, C has developed scalable methods to address the NGSS through curricula, assessments, and professional learning models. His work focuses on creating engaging, interactive, equitable, and accessible learning experiences for students and supporting teachers in implementing these strategies effectively. P is an expert in assessment design with a strong background in chemistry and education. Holding a BS in chemistry, a master’s degree in chemistry education, and a PhD in curriculum and instruction (chemistry education), P conducts research on NGSS curriculum, assessment, and professional learning at middle and high schools. With over five years of experience in NGSS curriculum and assessment design, P identifies as a science educator, science teacher educator, and an international science education research scholar. Science Content Experts (Physical Science) 5 J is a science content expert specializing in physical science. Holding a bachelor’s degree in physics and a doctorate in education, J is currently a postdoctoral researcher in science education. With previous experience in physics education research, J focuses on designing instructional environments using Project-Based LearninG to foster the development of students' knowledge-in-use and understanding of the nature of science. S holds a doctorate in physics education and a bachelor’s degree in physics. Her research interests focus on pre-service teacher professional development and project-based learning. 51 Table 3-3 (cont’d) P is an expert in assessment design with a strong background in chemistry and education. Holding a BS in chemistry, a master’s degree in chemistry education, and a PhD in curriculum and instruction (chemistry education), P conducts research on NGSS curriculum, assessment, and professional learning at middle and high schools. T has a robust background in physical science education, with a BS degree in Earth, Atmospheric, and Planetary Sciences. He has extensive experience with the NGSS, having served as the in-house expert at the National Science Teaching Association (NSTA) for eight years. Additionally, T has significant experience in curriculum development and standards-based education reform. J has a background in microbiology and holds a PhD in Science Education. He specializes in supporting students in building and revising computational models. J has taught college-level science courses and science teaching methods for secondary pre-service teachers. He has also conducted professional development sessions for in-service teachers and has experience designing assessment tasks for standardized science exams. Science Education Researchers (equity) 2 E has over two decades of experience teaching preschool, elementary, and middle school, including ten years as a science specialist and ESL/Bilingual teacher. She was a co-writer for the NGSS and contributed to the NGSS Diversity and Equity Team’s Appendix D. After earning a PhD, she became an assistant professor specializing in elementary science education. Co is an assistant professor of Teacher Education. She teaches science methods courses in the science education department and also facilitates professional learning initiatives focused on urban school districts. Co has worked on major research projects related to project-based learning and has extensive teaching experience, primarily in pre-K through 7th-grade science, as well as teaching all subjects in a self-contained 3rd-grade classroom. Science Education Researchers (language) 2 E has over two decades of experience teaching preschool, elementary, and middle school, including ten years as a science specialist and ESL/Bilingual teacher. She was a co-writer for the NGSS and contributed to the NGSS Diversity and Equity Team’s Appendix D. After earning a PhD, she became an assistant professor specializing in elementary science education. Su has an extensive background in educational standards and curriculum development. She served as the Lead State Representative on the NGSS development team and led standards development in her previous role. Currently, she oversees an eight-year elementary science PBL project, focusing on literacy integration and standards implementation. With 28 years of experience across various content areas, Su brings a wealth of knowledge and expertise to her role. 52 Table 3-3 (cont’d) Engagement experts 3 Sa is a fourth-year PhD candidate specializing in motivation, engagement, and critical race theories, with additional experience in literacy. She has been working on an NSF project supporting middle school students’ motivation in science learning. Elementary science teacher experts Group 1: PE 3- LS4-3 NGSS Expert Q is a third-year PhD student specializing in cognitive flexibility, student engagement, game-based learning, and virtual learning. With an undergraduate background in psychology and business and a master's degree in cognitive science, she brings a unique interdisciplinary perspective. H is a fifth-year PhD candidate specializing in student engagement, language and literacy assessment, and special education. She has extensive experience supporting student science assessments through a linguistic perspective. 2 2 Le has over 30 years of teaching experience, specializing in science education for intermediate school students. She particularly enjoys working with sixth graders, leveraging their energy and curiosity to promote scientific inquiry. Le has facilitated local, state, and national workshops to advance science education and has led multiple community education initiatives. B has 30 years of teaching experience in a rural public school. Her teaching background includes 3rd grade, 6th grade science, and predominantly 5th grade. For the past five years, B has served as a K-5 STEM teacher utilizing a project-based learning curriculum. She has collaborated closely with a research team to provide feedback and observational data while teaching these units. M is an expert in NGSS with extensive experience in science education. She has contributed to state-level curriculum frameworks and advises on science education policy. Her research focuses on teacher learning, professional development, and adapting pedagogies to support multilingual students. E has over two decades of experience teaching preschool, elementary, and middle school, including ten years as a science specialist and ESL/Bilingual teacher. She was a co-writer for the NGSS and contributed to the NGSS Diversity and Equity Team’s Appendix D. After earning a PhD, she became an assistant professor specializing in elementary science education. 53 Table 3-3 (cont’d) Assessment Design Experts 3 C is an expert in science education and assessment design, known for pioneering innovative approaches to support three-dimensional teaching and learning. With over a decade of experience, C has developed scalable methods to address the NGSS through curricula, assessments, and professional learning models. His work focuses on creating engaging, interactive, equitable, and accessible learning experiences for students and supporting teachers in implementing these strategies effectively. Sm has a background as a middle and high school science teacher, primarily working with students underrepresented in STEM. Currently, Sm is a tenure- track professor at a research-intensive institution, focusing on the design of science education interventions for large-scale use, including curriculum, assessments, and professional development. P is an expert in assessment design with a strong background in chemistry and education. Holding a BS in chemistry, a master’s degree in chemistry education, and a PhD in curriculum and instruction (chemistry education), P conducts research on NGSS curriculum, assessment, and professional learning at middle and high schools. Science Content Experts (Life Science) 5 J has a background in microbiology and holds a PhD in Science Education. He specializes in supporting students in building and revising computational models. J has taught college-level science courses and science teaching methods for secondary pre-service teachers. He has also conducted professional development sessions for in-service teachers and has experience designing assessment tasks for standardized science exams. Cn is a bilingual Latina with a rich background in science education. She is currently focused on research in her role as an Academic Specialist. With extensive experience in developing 3D PBL curriculum and assessments, she also brings a wealth of knowledge from her time as a middle and high school science teacher. Cn holds degrees in plant biology, education, and public health, and has a PhD in secondary science education. She is also well-versed in teacher professional learning. L is a third-year PhD student specializing in curriculum, instruction, and teacher education with a focus on science and urban education. She has four years of experience as a high school science teacher, where she taught biology, environmental science, and chemistry. Her research interests include noticing, classroom discourse, and group work, aiming to create more equitable and just science classrooms, especially for marginalized students. Sm has a background as a middle and high school science teacher, primarily working with students underrepresented in STEM. Currently, Sm is a tenure- track professor at a research-intensive institution, focusing on the design of science education interventions for large-scale use, including curriculum, assessments, and professional development. 54 Table 3-3 (cont’d) H is a PhD candidate in science education, who has extensive experience in both science and gifted education. She holds degrees in Biology and Biotechnology and has a master’s degree in healthcare administration. Initially intending to pursue medicine, she shifted her focus to education, driven by a passion for teaching. Her work includes significant experience in Artificial Intelligence in Education (AIED) and supporting teachers in integrating new technologies into their classrooms. Equity and language experts 3 E has over two decades of experience teaching preschool, elementary, and middle school, including ten years as a science specialist and ESL/Bilingual teacher. She was a co-writer for the NGSS and contributed to the NGSS Diversity and Equity Team’s Appendix D. After earning a PhD, she became an assistant professor specializing in elementary science education. Co is an assistant professor of Teacher Education. She teaches science methods courses in the science education department and also facilitates professional learning initiatives focused on urban school districts. Co has worked on major research projects related to project-based learning and has extensive teaching experience, primarily in pre-K through 7th-grade science, as well as teaching all subjects in a self-contained 3rd-grade classroom. Su has an extensive background in educational standards and curriculum development. She served as the Lead State Representative on the NGSS development team and led standards development in her previous role. Engagement experts 3 Sa is a fourth-year PhD candidate specializing in motivation, engagement, and critical race theories, with additional experience in literacy. She has been working on an NSF project supporting middle school students’ motivation in science learning. Q is a third-year PhD student specializing in cognitive flexibility, student engagement, game-based learning, and virtual learning. H is a fifth-year PhD candidate specializing in student engagement, language and literacy assessment, and special education. She has extensive experience supporting student science assessments through a linguistic perspective. H is a fifth-year PhD candidate specializing in student engagement, language and literacy assessment, and special education. She has extensive experience supporting student science assessments through a linguistic perspective. 55 Table 3-3 (cont’d) Elementary science teacher experts 2 Le has over 30 years of teaching experience, specializing in science education for intermediate school students. She particularly enjoys working with sixth graders, leveraging their energy and curiosity to promote scientific inquiry. Le has facilitated local, state, and national workshops to advance science education and has led multiple community education initiatives. She holds a B.S. in elementary education and an M.A. in teacher development and educational technology. B has 30 years of teaching experience in a rural public school. Her teaching background includes 3rd grade, 6th grade science, and predominantly 5th grade. For the past five years, B has served as a K-5 STEM teacher utilizing a project-based learning curriculum. She has collaborated closely with a research team to provide feedback and observational data while teaching these units. The curriculum has proven effective, significantly improving test scores. Additionally, B incorporates a digital platform to deliver lessons and enhance student interaction. 3.3.2.2 Data Collection I collected data to examine the cognitive, inferential, and instructional validity (Pellegrino et al., 2016) of GPT generated assessments produced in the first stage. Each panel assessed two GPT-4 designed assessment tasks from either of the two PEs. They offered both quantitative evaluations and qualitative feedback encompassing strengths, areas of concern, and potential improvements. Instruments I developed and used different instruments for experts with different expertise on the panel. The panel used a protocol to independently determine the appropriateness of designated LP and the adequacy of the set of LPs with respect to representing the domain (instructional validity). They reviewed the tasks designed to align with each LP and the scoring rubrics (inferential validity). During these reviews, they attended to cognitive validity issues, including ethnic and cultural bias, cognitive complexity, and task performance demands. Feedback collection instruments were tailored for different expert groups, including NGSS experts, assessment design experts, science content experts, and science education researchers with a focus on equity and language. These groups received protocols designed to elicit detailed feedback on 56 Learning Performances (LPs) and Evidence Statements (Table 3-4), as well as two AI-co-designed assessment tasks (Table 3-5). Engagement experts and teacher experts were given protocols specifically designed to gather insights from their unique perspectives on the AI-co-designed tasks. Three types of expert feedback collection instruments were developed. The first instrument is science-focused feedback instrument (see Table 3-4 & 3-5). which are about the designed LPs, evidence statements, and corresponding assessment tasks. The second type of instrument is engagement and language-focused feedback instrument (see Table 3-6). For the science-focused instrument, it was designed to capture the experts’ feedback on the GPT-designed assessments and interim seminal products to collect their feedback if the designed interim products and assessments 1) captured the three- dimensions of science knowledge and skills; 2) align with the PEs/LPs, 3) elicit students’ knowledge-in- use performance. Table 3-3 presents the instrument that is used for collecting the feedback from science content and knowledge-in-use assessment design experts about the designed learning performances by generative AI, which is a critical part of the assessment design process. Experts provided Likert scale ratings and open-ended feedback of the tasks. Table 3-5 presents the instrument for capturing science- focused feedback with respect to the designed assessments. These questions solicit detailed and actionable feedback from experts regarding the quality of the designed knowledge-in-use assessments. Science content, knowledge-in-use assessment design, and NGSS experts, and experienced teachers used the instrument. The questions on the instrument presented in Table 3-5 should elicit in-depth feedback on the quality of the AI-generated assessment tasks. Their collective human feedback ensures the designed assessment is robust and resonates with the pedagogical tenets, maintaining assessment validity and reliability. Data were analyzed qualitatively using a thematic analysis of the open-ended responses. For the Likert scale ratings, I checked for consistency across reviews, provided descriptive analysis, and determined a set of revisions to the tasks. Descriptive statistics dissect the quantitative feedback, while thematic analysis delves into the qualitative feedback, unraveling emergent patterns and insights. I conducted a pilot test to ensure the instruments are accessible and understandable. 57 Table 3-4. Expert panel review protocol for AI-designed learning performance 58 Table 3-4 (cont’d) Table 3-5. Expert panel feedback instrument for AI-designed assessments 59 Table 3-5 (cont’d) 60 Table 3-5 (cont’d) 61 Table 3-5 (cont’d) Engagement and language-focused feedback instrument. To facilitate a comprehensive understanding of the assessment tasks, the panel responded to prompts that examine the cognitive validity, equity, language appropriateness, and engagement evidence (see Table 3-5). All experts on the panel used the Engagement and language-focused feedback instrument. Qualitative data were analyzed through thematic analysis of the open-ended responses. For Likert scale ratings, I assessed consistency across reviews, performed descriptive statistical analysis, and identified necessary revisions to the tasks. I conducted pilot tests to ensure the instruments are accessible and understandable before formally applying them in the study. Table 3-6. Expert panel review protocol (engagement) for AI-designed assessment tasks 62 Table 3-6 (cont’d) 63 Table 3-6 (cont’d) 64 Table 3-6 (cont’d) 65 Table 3-6 (cont’d) Teacher cognitive interview protocol. A semi-structured experienced teacher interview protocol was designed to conduct interviews with experienced teachers, especially when there are unique concerns 66 or points raised in their questionnaire. This interview was conducted after teachers filled out the survey. Interview is to further detect or understand each teacher’s perceptions on the GPT-designed assessments, their concerns or suggestions on the designed assessments. I use thematic analysis to analyze the interview to further capture the suggestions or feedback from teachers. Thematic analysis will synthesize the feedback, distilling salient patterns and enlightening nuances. The insights are anticipated to offer tangible directions for refining the assessments. Table 3-7 presents the teacher interview protocol. Some prompts include: “How do you see the tasks providing appropriate opportunities for your students to demonstrate their proficiencies with 3- dimensional aspects of the NGSS PEs? (cognitive validity)" "What strengths, if any, do the AI-generated task contain?" "Which areas necessitate refinement or enhancement in the AI-generated task?" "In which ways, if any, do the specific AI-designed tasks fall short of your expectations? Could you detail the areas of deficit?" "How well do the tasks cater to students with diverse backgrounds, ensuring equitable opportunities for all to demonstrate their understanding? (equity)" "How do the tasks actively engage learners, prompting interest and sustain attention throughout the assessment? (engagement)" “In what ways do the tasks facilitate your students' ability to approach problems from multiple perspectives? (knowledge-in-use)." 67 Table 3-7. Teacher interview protocol Directions: ● Record Teacher’s Name ● ● Confirm with the teacher that this interview is voluntary, and they do not have to answer Interviews could be done via video conference or phone. questions they don’t feel comfortable with. ● State that this is confidential ● Ask for permission to record the interview and take notes. ● Make the interview conversational in tone. ○ Ask the initial question, then ask teachers follow-up questions to probe deeper but keep it like a conversation. Use probes such as: Tell me more about that. Can you give me an example? Can you tell me what you mean by… ● Be careful not to lead the teacher. They should be doing 90% of the talking. 1. How do you see the tasks provide appropriate opportunities for your students to demonstrate their proficiencies with 3-dimensional aspects of the NGSS PEs? 2. What strengths, if any, strengths do the AI-generated task contain 3. Which areas necessitate refinement or enhancement in the AI-generated task? 4. In which ways, if any, do the specific AI-designed tasks fall short of your anticipations? Could you detail the areas of deficit? 5. How well do the tasks cater to your students from diverse backgrounds, ensuring equitable opportunities for all to demonstrate their understanding? 6. How do the tasks actively engage learners, prompting interest and sustained attention throughout the assessment? 7. In what ways do the tasks facilitate your students' ability to approach problems from multiple perspectives of the three dimensions (i.e., DCIs, SEPs, CCCs)? 3.3.2.3 Data Analysis The feedback analysis was organized into three main sections: LPs and Evidence Statements, Task 1, and Task 2, corresponding to the evaluations of the Performance Expectations (PEs). Both qualitative and quantitative methods were employed in the data analysis. Each section of the report starts with an overview of the quantitative analysis, followed by an in-depth qualitative analysis that highlights key feedback and recommendations from the experts. For the quantitative data analysis stage, I use heatmaps as both an analytical and representational tool to organize and interpret the expert feedback data. Heatmaps serve as graphical representations that employ color coding to illustrate complex data matrices. This visual method facilitates the immediate recognition of patterns and correlations across multiple dimensions, which is essential for the preliminary analysis (Wilkinson & Friendly, 2009). In educational research, heatmaps effectively depict variations 68 and trends across different evaluative criteria, making them an invaluable tool for understanding assessments (Borkin et al., 2013). The color gradients in a heatmap range from lighter to darker hues, representing the spectrum of scores or feedback intensity. Typically, cooler colors (e.g., blues) indicate lower scores or less favorable feedback, while warmer colors (e.g., reds) denote higher scores or more positive evaluations. This color-coding aids in quickly identifying areas of concern where expert feedback suggests a need for improvement, as well as strengths where feedback is generally positive. The decision to employ heatmaps for data analysis in this context is strategic. They provide a clear, concise way to compare large volumes of data across multiple evaluative criteria and expert groups. This is particularly valuable where multifaceted feedback must be synthesized to guide revisions and improvements in learning performances and assessment tasks. Heatmaps enable stakeholders to visually digest complex information, promoting easier interpretation and facilitating more informed decision-making. For the qualitative analysis, a dual approach was used, incorporating both a priori and thematic analysis methods. The a priori method, chosen for its relevance to the structured assessment design, involves using predefined themes or codes established from prior research and theoretical frameworks (Brooks et al., 2015). These codes, which include dimensions such as 3D learning, engagement, language, accessibility, and equity, provided a structured lens for the initial data examination and were detailed in Section 2.4.1. This structured approach allows for focused analysis while accommodating necessary adjustments as the analysis progresses (Crabtree & Miller, 1999). Following the a priori coding, thematic analysis was conducted to identify emergent themes, major concerns, and suggestions not initially anticipated. This stage involved a systematic review of the qualitative data to detect patterns that extend beyond the predefined codes (Braun & Clarke, 2006). This comprehensive approach ensures that the analysis captures both anticipated and emerging insights from the expert feedback. 3.3.3 Stage 3: GPT-Designed Assessment Refinement Stage 3 aims to respond to the research question 3: “What is the process of refining GPT- designed knowledge-in-use assessments based on the feedback provided by human experts?” Drawing from insights gathered in earlier stages, I integrated human experts’ feedback into the assessment 69 refinement process. This stage focuses on the iterative refinement cycle, where the feedback from the interdisciplinary expert panel and the insights gained from the cognitive interviews with teachers are utilized to enhance the GPT-4 generated assessments. The outcome of this phase is a customized, domain- specific script that harmonizes the AI's functionalities with the adaptability required by educators to meet the varied needs of students and their teaching objectives. Central to this stage is exploring how to incorporate human experts’ feedback with AI to refine the initial designed knowledge-in-use assessment. 3.3.3.1 Data Utilization and Refinement Process I commence this stage by meticulously reviewing all the feedback obtained from the expert panel in Stage 2, which includes both their numerical ratings and detailed comments. I consider each piece of feedback to determine the most effective way to refine the assessments generated by GPT-4. I also leverage the themes I gained from my thematic analysis to refine the assessments. These adjustments are not merely superficial; they delve into the content, format, and rubrics to ensure that each assessment's integrity and pedagogical goals are maintained, if not enhanced. 3.3.3.2 Expert Re-Evaluation Once I refined the assessments, I brought back the same interdisciplinary panel from Stage 2 for a re-evaluation. I used the same instruments presented in Stage 2 to collect new feedback. These instruments include Tables 3-3, 3-4 and 3-5. The reason why I decided to use the same evaluation tools that I used before is to ensure I can see clearly if the changes I've made have resolved the concerns the panel originally had. I kept a comprehensive record of every adjustment made to the assessments, including the original feedback that prompted the change and the reasoning behind each decision. This practice is not just for the sake of organization—it's a commitment to transparency and accountability. I want to provide a clear justification for every modification based on the expert input I received. Sticking with the same evaluation tools for this second review allows me to understand the true impact of the refinements. The panel's familiarity with these tools streamlines the process and reinforce the validity of the adjustments made to the assessments. 70 This phase also included a significant change: the introduction of a new group of experts who were unaware that the assessments and interim products had been co-designed with AI. The decision to form this new expert group was inspired by an interview with a teacher from the initial expert panel, who mentioned, "I think the tasks are fine, but knowing they are AI-designed, I tend to be more critical compared to tasks designed by humans. It feels like there's less pressure to provide feedback." This revelation prompted me to consider the core purpose of the review— to ensure that the evaluations focused solely on the quality of the tasks, rather than the nature of their design process, which could introduce bias. Consequently, assembling a new group of experts who were not informed about the AI involvement was aimed at potentially reducing such bias. Details on the composition of this new group of experts and their backgrounds are provided in Table 3-8. Table 3-8. New expert panel (blinded) and their backgrounds Expert Expertise H O D A T M The framework for K-12 education writer; scientist; science content NGSS writing team member; integrating science, language, and computational thinking with a focus on multilingual learners; equity, justice Science assessment Science assessment, teacher education NGSS-aligned science assessment Chemistry education; 3D learning After this second round of evaluation, I compared the feedback about the first round assessment tasks and the second assessment tasks based on the experts feedback. I also took any additional feedback and made further refinements. This cycle is key to designing high-quality knowledge-in-use assessments. It's a careful iterative process of revision and refinement by incorporating experts’ feedback. Ultimately, this stage ensures that the assessments GPT-4 helps us create are not just innovative but also practically useful and pedagogically sound. By marrying the capabilities of AI with the insights of human experts, I aim to create assessments that truly measure what students know and can do. 71 3.3.3.3 Data Analysis The analysis of feedback provided by experts was methodically arranged into three primary sections: LPs and Evidence Statements, Task 1, and Task 2. These sections align with evaluations specific to distinct PEs. A blend of qualitative and quantitative methodologies was employed to analyze the data comprehensively. For the quantitative data analysis stage, I used scatter plot to show the comparison between the first-round review and second round review across multiple expert groups on multiple dimensions. A scatter plot is a type of data visualization used to display the values of typically two variables for a set of data. The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis (Cleveland & McGill, 1984). This type of visualization is particularly useful for identifying relationships, trends, or distributions within data, and is widely used in both scientific and business applications to explore potential correlations between variables. In the dissertation, scatter plots are used to display feedback scores across different feedback dimensions between two feedback rounds. Points on the plot show specific scores for a criterion at a particular round, with the vertical axis indicating the score and the horizontal axis listing different criteria. The use of different colors for points, such as red and blue, can differentiate scores from various rounds or groups, facilitating an easy comparison of score changes over time or between groups. Patterns observed, such as clustering of points or vertical dispersion within a single criterion, can suggest consensus among evaluators or significant changes in perceptions of quality over time, respectively. Scatter plots are chosen for their ability to clearly depict relationships and changes between two variables—feedback dimensions and feedback scores across rounds in this case. This visualization is effective for examining data where comparisons over time or between groups are essential for discerning underlying trends and patterns in feedback (Tufte, 2001). Scatter plots also assist in identifying outliers or anomalies, providing a basis for targeted investigation in subsequent qualitative analyses. For the qualitative analysis, I used a similar approach presented above in the section 3.3.2.3. 72 CHAPTER 4: FINDINGS AND DISCUSSIONS In this chapter, I present a comprehensive analysis of the findings related to the three research questions, systematically derived from the data analysis process. This section aims to provide a detailed examination of the iterative training of generative AI models, the evaluation of AI-generated assessments by human experts, and the subsequent refinement processes, offering critical insights into the effectiveness and challenges of integrating AI in educational assessment design. 4.1 RQ1. How Can Generative AI Models Be Effectively and Iteratively Trained to Design Knowledge-In-Use Assessments? To respond to this research question, I used thematic analysis to analyze interactions between the individual human knowledge-in-use assessment design expert and the GPT-4 model. I first present the design process by showing how the interim assessment products and final assessments were created, and then I present the themes identified based on the transcripts to discuss how to iteratively work with GPT-4 to design knowledge-in-use assessments and what kinds of challenges and opportunities this approach brought. In this section, I use PE 3-PS2-1 as an example to show the design process. This section delineates the strategic exploration of leveraging the GPT to design knowledge-in- use assessments that align with the NGSS. This process was iterative, involving a synergy between human input and AI capabilities to shape the overall assessment design. I started the design process with one of the focal PEs, 3-PS2-1. I opened up each step closely by providing specific guidelines and goals. The following sections present the brief co-designing process for each step. 4.1.1 Unpacking Performance Expectations from the NGSS The initial phase involved unpacking the three dimensions of the PE to gain a comprehensive understanding of the core concepts embodied in this concise statement. In the preliminary stage of the design process, I imparted foundational information to GPT-4 models about the unpacking process. This covered the goal of domain analysis within the ECD framework, the methodology to unpack the PE's three dimensions, related resources, targeted grade level, and the specific DCI element to be unpacked. 73 To ensure the unpacking meets the explicit and specific requirements and does not miss any critical sub-ideas of the DCIs, SEPs, and CCCs, I prompted the GPT-4 models by introducing each dimension separately. I first introduced the DCIs in the PE 3-PS2-1. Moreover, I further explicitly pointed out the major sub-disciplinary core ideas for the DCI dimension of the PE. By doing that, I hoped the GPT-4 models could cover comprehensive disciplinary core ideas. In the case of 3-PS2-1, I gave the prompts of two specific DCIs in the PE, which are “DCI1 in the PE: PS2.A: Forces and Motion - Each force acts on one particular object and has both strength and a direction. An object at rest typically has multiple forces acting on it, but they add to give zero net force on the object. Forces that do not sum to zero can cause changes in the object’s speed or direction of motion. (Boundary: Qualitative and conceptual, but not quantitative addition of forces are used at this level.)” and “DCI2 in the PE: PS2.B: Types of Interactions - Objects in contact exert forces on each other.” After providing the explicit DCIs the PE focuses on, I then gave the guideline and task of how to unpack the PE. Based on the previous unpacking experiences and literature, I provided explicit guidelines for the steps of how to unpack DCIs. For instance, for the DCI1, I gave the prompts of the unpacking process shown in Figure 4-1. 74 Figure 4-1. Prompts of DCI unpacking 4.1.1.1 DCI Unpacking I began the unpacking by setting a 500 tokens’ length response. However, the response is too concise and cannot provide explicit ideas. Thus, I gradually increased the length of the outputs from 500 tokens, 1000 tokens, to 1500 tokens. And I finally found the 1500 tokens is adequate for offering adequate information that meet the design requirements. I then kept the rest of my conversations with 1500 tokens. Table 4-1 presents the DCIs of the PE 3-PS2-1 unpacking with 1500 tokens. Following the same approach, I prompted the GPT-4 model to unpack the DCI2 of the PE. My first-round iteration ends up with this version of DCIs’ unpacking. Those DCI unpackings follow the specific unpacking procedures. But the degree of unpacking may need to be further addressed. For instance, for the “key sub-ideas”, the current unpacking version does not define the idea of “net force,” as well as the meaning of how to judge movement is not clearly pointed out. Then, I further provided extra information for the GPT model to guide its unpacking. Being specific, I reiterated the importance of identifying the critical sub-ideas of each DCI, provided extra information for the GPT models to gain enough information about the DCIs, and highlighted the importance of the grade 75 boundary. I provided DCIs’ information from the corresponding information of NGSS and The Framework. Figure 4-2 presents the prompts I provided to support further unpacking. Figure 4-2. Prompts for 2nd iteration of DCI unpacking 76 Table 4-1. Iterations of DCIs of PE 3-PS2-1 unpacking DCI unpacking DCI1_PS2.A: Forces and Motion DCI2_PS2.B: Types of Interactions Iteration 1 1. Elaborating the DCI Elements and Sub- ideas: DCI1, PS2.A, focuses on understanding that forces act on objects, these forces have both strength and direction, and how these forces affect an object's motion. Key sub-ideas include: -The concept that multiple forces can act on an object at rest, adding up to zero net force, keeping the object stationary. - The idea that forces which do not sum to zero affect the object\'s speed or direction of motion. 2. Defining Boundaries At this grade band, the focus is on qualitative and conceptual understanding of forces and their effects on motion. Quantitative aspects, such as calculating the exact sum of forces or using formulas to predict motion, are outside the learning scope here. The instructional emphasis is on observable phenomena and conceptual reasoning rather than numerical computation. 3. Describing Prerequisite Knowledge - Students need a foundational understanding of: - Basic motion concepts (e.g., speed, direction). - The idea of "force" as a push or a pull. - Simple understanding of interaction between objects (e.g., contact forces). This knowledge often stems from everyday experiences, like playing on a slide or kicking a ball, and from previous learning segments, possibly in earlier grades, focusing on basic physical concepts. 4. Identifying Relevant Phenomena - Relevant phenomena that could anchor the DCI include: Unpacking DCI 2 (PS2.B: Types of Interactions) 1. Elaborating the DCI Elements and Sub-ideas: **Key Elements**: The core of PS2.B focuses on the interactions between objects that are in contact, highlighting that these interactions result in exertion of forces on each other. **Relationship with PE or Bundle**: This DCI is directly linked to the focal Performance Expectation (3-PS2-1) as understanding the types of interactions between objects is crucial for planning and conducting investigations to observe the effects of balanced and unbalanced forces. 2. Defining Boundaries: **Scope of Learning for This Grade Band**: At this level, the understanding of interactions is qualitative. Students are expected to recognize and describe the effects of interactions without the necessity for quantitative force calculations. **Outside the Scope**: Detailed mathematical modeling of forces (i.e., using formulas to calculate force magnitudes or directions) is beyond the expected learning outcomes for elementary students. 3. Describing Prerequisite Knowledge: **From This Topic**: Familiarity with general concepts of forces (as introduced possibly in earlier grades) – that they can push or pull objects. **From Other Topics**: Basic observational and reasoning skills to infer that objects in contact can influence each other's motion. Understanding of simple motion (moving, stopping) is also necessary. 4. Identifying Relevant Phenomena: **Accessible Examples**: - Playing on a seesaw: How the weight and position of one child affect the motion of 77 Table 4-1 (cont’d) - A book remaining stationary on a table demonstrates balanced forces (gravity pulling down is balanced by the table pushing up). - Pushing a toy car to make it move illustrates how an unbalanced force (the push) changes the object\'s motion. - Observing leaves fluttering to the ground versus a rock falling quickly can spark discussion on how forces act on different objects. These phenomena are relatable and observable in students’ everyday environments, making the abstract concepts of forces and motion more tangible. 5. Identifying Student Strengths and Challenges **Strengths**: Students often have intuitive understandings of motion and force from their daily experiences, such as playing sports or moving objects. **Challenges**: - Students may have misconceptions such as believing that only moving objects exert forces or that stationary objects are not subject to forces. - Understanding that forces are vectors, having both magnitude and direction, can be challenging at this level, particularly without grounding in actual experiences. another. - Pushing a toy car and observing how it interacts with obstacles. - The stopping of a rolling ball when it comes into contact with a wall. **Real-world Situations**: Examples like the friction experienced when sliding down a slide or the way a soccer ball stops when it hits a goalpost help contextualize the phenomena of interactions between objects. 5. Identifying Student Strengths and Challenges: **Relevant Experiences**: Children often have firsthand experiences with the effects of force through play, such as pushing and pulling toys, feeling the resistance of water while swimming, or the effort needed to move heavier versus lighter objects. **Common Challenges**: A prevalent misunderstanding among students may be the notion that only moving objects exert forces, disregarding the forces present in static situations (e.g., a book resting on a table). Another challenge is conceptualizing that force involves an interaction between two objects, not an inherent property of a single object. Iteration 2 'Given the provided context and requirements for unpacking the first DCI (PS2.A: Forces and Motion) for the focal performance expectation 3- PS2-1, here's a detailed unpacking tailored for grades 3-5: Aspect 1: Elaborating the DCI Elements and Sub-ideas **DCI PS2.A: Forces and Motion focuses on understanding how forces affect the motion of objects. This includes:** - **Force Characteristics:** Forces are push or pull actions that can cause an object to start moving, stop moving, or change its motion's direction. Each force acts on a particular object with a specific strength (magnitude) and direction. Based on the provided explicit ideas and understandings of the DCIs for PS2.B: Types of Interactions, here is the unpacking for the Second DCI: PS2.B in 3-PS2-1 according to the requested aspects: 1. Elaborating the DCI Elements and Sub-ideas **DCI Elements and Sub-ideas:** The key elements of PS2.B focus on the nature of the interactions between objects. Primarily, it deals with the concept that objects in contact exert forces on each other. This covers a broad range of forces, including friction, elastic pushes and pulls, and the forces exerted over a distance, such as gravitational, electric, and magnetic forces. The sub-ideas include: - **Friction:** The resistance that one surface or object encounters when moving over 78 Table 4-1 (cont’d) - **Interaction of Forces:** Multiple forces can act on an object simultaneously. When the forces are balanced (sum to zero), the object's motion doesn’t change; it remains at rest or continues moving at a constant speed in a straight line. Unbalanced forces (those that do not sum to zero) cause changes in the object\'s motion, either accelerating it or altering its direction. -**Newton’s Third Law:** The mutual interactions between two objects can be described in terms of equal and opposite forces. When one object exerts a force on a second object, the second object simultaneously exerts a force equal in magnitude and opposite in direction on the first object. Aspect 2: Defining Boundaries **At this grade level, the focus is on:** - Qualitative and conceptual explanations of how balanced and unbalanced forces affect motion, without delving into quantitative analysis or calculations of force magnitudes. - Observations and descriptions of motion changes due to applied forces, without introducing technical terms like "magnitude," "velocity," or "vector." Aspect 3: Describing Prerequisite Knowledge **Students entering grade 3 should already have foundational knowledge, including:** - Basic understanding of motion — knowing that objects move in various ways, such as straight, zigzag, round and round, back and forth, fast and slow. - Familiarity with pushes and pulls as forces that can move objects. Aspect 4: Identifying Relevant Phenomena **Examples of relevant, real-world phenomena related to forces and motion include:** - Playing on a seesaw demonstrates how applying force (pushing down) on one end causes the other end to move up, illustrating action and reaction forces. - Observing how kicking a soccer ball another. - **Elastic Forces:** The forces exhibited by objects that can be stretched or compressed and return to their original shape. - **Gravitational Forces:** The attraction between objects with mass. Although typically weak on a human scale, it's significant for massive objects like planets. - **Magnetic and Electric Forces:** These forces can be either attractive or repulsive and do not require the objects to be in contact. These sub-ideas are intrinsically linked to each other and to the broader phenomenon of motion and stability, as outlined in 3-PS2-1. Understanding the types and nature of these interactions provides a foundation for predicting and explaining motion. 2. Defining Boundaries **Grade Band Boundaries:** For grades 3-5, the focus is on observable forces and interactions, including friction, elastic forces, and the basic introduction to forces that act at a distance like gravity, magnetism, and static electricity. The boundaries at this level exclude detailed mathematical modeling of these forces (e.g., using Newton's law of universal gravitation or Coulomb’s law) and the in-depth exploration of electromagnetic, strong and weak nuclear interactions. 3. Describing Prerequisite Knowledge Students need a basic understanding of motion, including the concepts of pushing and pulling, and that motion can change in speed or direction. Knowledge from earlier grades that objects can affect each other's motion through physical contact. An introductory understanding of magnets and static electricity, as well as the intuitive idea that the Earth pulls objects towards it. 4. Identifying Relevant Phenomena Playing with magnets to feel the attraction and repulsion forces without direct contact. - Observing objects rolling down different surfaces to introduce friction as a force that slows motion. 79 Table 4-1 (cont’d) (applying a force) changes its motion depending on the strength and direction of the kick. Aspect 5: Identifying Student Strengths and Challenges **Strengths:** - Many students have practical experience with pushing and pulling objects, which can be leveraged to introduce the concepts of force and motion. **Challenges:** - Students may believe that constant motion requires a constant force application, conflating force with motion. - Some students might think that a larger object always moves slower than a smaller one, regardless of the forces applied, due to misconceptions about mass and force. By exploring and addressing these various aspects, educators can tailor their teaching strategies to effectively convey the foundational principles of forces and motion to elementary students, setting a robust groundwork for more advanced concepts in later grades. - Using rubber bands to demonstrate how elastic forces work. - Dropping objects to discuss and observe gravity's effect on all objects, regardless of their size. 5. Identifying Student Strengths and Challenges - **Strengths:** Students often have direct, everyday experiences with the types of interactions described in PS2.B, such as playing with magnets, using elastic bands, and observing objects falling. These provide an intuitive grasp of the forces involved. - **Challenges:** Students may have misconceptions about forces, particularly about gravitational and magnetic forces acting at a distance. They might not easily accept that invisible forces can exist without direct contact between objects. Further, the concept that forces can be both push and pull, and the idea that objects can exert forces on each other simultaneously (as in Newton’s third law), can be abstract and challenging to understand fully without concrete examples and guided exploration. After two rounds of iterations, the unpacking of the DCIs became much more explicit compared to the initial generation. For instance, the unpacked sub-ideas for PS2.A evolved significantly. Initially, the unpacking provided a basic understanding: "The concept that multiple forces can act on an object at rest, adding up to zero net force, keeping the object stationary," and "The idea that forces which do not sum to zero affect the object's speed or direction of motion." However, through iterative refinement, the unpacking became more detailed: "DCI PS2.A: Forces and Motion focuses on understanding how forces affect the motion of objects. This includes Force Characteristics: Forces are push or pull actions that can cause an object to start moving, stop moving, or change its motion's direction. Each force acts on a particular object with a specific strength (magnitude) and direction. Interaction of Forces: Multiple forces can act on an object simultaneously. When the forces are balanced (sum to zero), the object's motion doesn’t change; it remains at rest or continues moving at a constant speed in a straight line. 80 Unbalanced forces (those that do not sum to zero) cause changes in the object's motion, either accelerating it or altering its direction. Newton’s Third Law: The mutual interactions between two objects can be described in terms of equal and opposite forces. When one object exerts a force on a second object, the second object simultaneously exerts a force equal in magnitude and opposite in direction on the first object." Reflecting on the co-design process of unpacking, it is crucial to emphasize the importance of providing explicit outputs by setting specific task requirements and goals. Moreover, it is essential to equip the GPT models with sufficient domain-specific information to enable accurate analysis and completion of tasks. While GPT models can access general information, they lack depth in domain- specific knowledge unless explicitly provided. Consequently, the depth and appropriateness of the unpacking are limited by the scope of information the GPT model can analyze. Human experts, with their domain content knowledge and unpacking experience, play a critical role in identifying whether the outputs are appropriate or require further revisions. Their judgment ensures that the outputs meet the task requirements. Human experts’ reflections are also vital for the iterative training process. For instance, the second iteration of DCI unpacking occurred after generating the initial DCI map, which revealed the unpacking level was insufficient. Subsequently, I re-unpacked the DCIs with more explicit prompts and additional scientific knowledge. This iterative refinement process may vary with different AI models, each having unique capabilities for acquiring training information. 4.1.1.2 SEPs and CCCs Unpacking To unpack the SEPs and CCCs for the PE, I employed a similar approach as with the DCIs. I began by identifying the specific SEP related to the PE and provided foundational information, including definitions, theoretical foundations, the framework of the SEP, and the grade boundary for the SEP (refer to Figure 4-3). Following this, I outlined specific SEP unpacking principles to guide the GPT-4 models in their analysis (refer to Figure 4-4. This structured approach ensured a comprehensive and detailed unpacking of the SEP and CCC elements, aligning them with the educational standards and goals of the PE. 81 Figure 4-3. Prompts for SEP (planning and carry out investigations) fundamental information 82 Figure 4-4. Prompts of SEP unpacking Following the initial exploration, I redirected the GPT-4 model to focus on unpacking two specific SEPs: developing and using models, and constructing scientific explanations, rather than the previously explored SEP of planning investigations. I employed similar strategies to prompt the model, beginning by specifying the task goal of unpacking the SEPs and providing fundamental knowledge about each SEP, including definitions, expectations, and grade boundaries. After establishing this foundational understanding, I introduced the SEP unpacking principles outlined to guide the GPT model in meeting the task requirements. The output of the initial unpacking of developing and using models reflected the general information provided in the prompts. Table 4-2 presents the iterations and the resulting unpacking, demonstrating the refinement process and the increasing specificity and depth of the unpacked SEPs. 83 Table 4-2. The iterations and adjustment of unpacking SEP of Developing and Using Models Input information for the task Unpacking Iteration 1 Adjustments for task requirements Input of the meaning of developing and using models, the application s of scientific models, the grade boundary of developing and using scientific models, and the progressive perspective s of scientific models. 'Given the depth and breadth of the Science and Engineering Practice (SEP) of "Developing and Using Models," let’s unpack this SEP according to the requested aspects: 1. Describing the SEP and Its Essential Aspects of Performance **Grade-Appropriate Definition:** Developing and Using Models is about students constructing, analyzing, and working with representations that symbolize natural or designed systems. These models help make predictions, provide explanations, and propose solutions to scientific or engineering problems. **Essential Aspects of the SEP:** **Constructing Models** to represent physical, conceptual, or mathematical aspects of phenomena. **Analyzing and Interpreting Models** to explore and predict behaviors or investigate the mechanisms within the phenomena. **Using Models to Test Solutions** or explanations in a simulated, conceptual environment. 2. Identifying Productive Intersections between the SEP and Other SEPs This SEP can productively intersect with: **Planning and Carrying Out Investigations:** Models can be used to design experiments or predict outcomes, guiding investigative processes. **Analyzing and Interpreting Data:** Models help in understanding data patterns and making sense of experimental results. **Using Mathematics and Computational Thinking:** Developing mathematical or computational models to simulate phenomena. **Engaging in Argument from Evidence:** Using models to support or refute arguments based on empirical evidence. These intersections are crucial for developing a systematic understanding of phenomena in line with the DCI and CCC of the PE. 3. Specifying the Evidence That Will Be Needed to Demonstrate the SEP Evidence of proficiency includes: **Ability to Build Models:** Students create appropriate models of phenomena or systems that they study. **Model Analysis:** Students can analyze a model to ~ The unpacking framework of developing and using models is not explicit enough, including constructing models, analyzing and interpreting models, and using models. However, they're all general larger grain- size of proficiency requirements of scientific modeling practices. When it comes to specific dimensions mentioned here, there should be a more explicit framework to guide students to construct models, or judge models. That is., how to construct a model, what kind of basic criteria that the scientific models should have to enable it to have the explanation function instead of only description. Also, when judging models, what should be the critical aspects that students want to consider in terms of the quality of explanatory ability of the models. In addition, the unpacking 84 Table 4-2 (cont’d) dimensions (4, 5, and challenges) are too general without providing explicit information for each aspect of modeling. explain phenomena, predict outcomes, or test solutions. **Model Revision:** They can iteratively refine their models based on new evidence or understanding. 4. Describing Students’ Prerequisite Knowledge Students should have: **Understanding of the Phenomena:** Basic knowledge about the system or phenomena being modeled. **Skills in Observation and Data Collection:** Ability to gather and interpret data that informs the model. **Mathematical Reasoning:** Depending on the model, basic to advanced mathematical skills. 5. Identifying Student Strengths and Challenges **Strengths:** **Creativity in Model Construction:** Students often bring innovative approaches to representing complex ideas. **Intuitive Understanding of Basic Systems:** From their everyday experiences, students often have an intuitive sense of how things work, which can be a good foundation. **Challenges:** **Abstract Thinking:** Moving from concrete representations to abstract models can be challenging. **Complex Systems Understanding:** Grasping the intricacies of complex systems and accurately representing them in models requires development. **Connection with Everyday Knowing and Doing:** Everyday experiences, like using a map or building simple objects, relate to modeling but on a more intuitive level. Educators can build on these experiences to develop a deeper, more systematic approach to modeling scientific phenomena. Iteration 2 Adjusted Unpacking for Developing Models (Grades 3-5) Components of the Practice: **Model Elements:** Students should focus on identifying basic observable elements (e.g., objects, organisms) and their noticeable attributes to include in their models, emphasizing the inclusion of components that are directly observable or can be researched at an elementary level. **Relationships Among Elements:** At this grade level, emphasis should be on illustrating simple and direct relationships between elements. For example, showing a predator-prey relationship in a food chain or the effect of sunlight on plant growth. **Sequence of Events:** Models by students should demonstrate a basic linear sequence of events that illustrates cause-and-effect relationships relevant to the phenomenon being explored, such as the stages of the water cycle. The adjusted output aligns with the unpacking principles and framework of developing models, the integrated proficiencies of each sub aspect of the unpacked SEP, and To ensure the SEP unpacking meet the requirements of the task that can be explicit enough to guide the evidence statement design letter, I further provided information on the 85 Table 4-2 (cont’d) framework of developing models, including identifying the essential components in the system that can affect, mapping out the relationships among components of the model, and leveraging the components and their relationships to explain the correspondin g phenomena and solve complex problems. I also provided some exemplar unpacking that were done by human experts. In addition, I asked the unpacking to focus on developing model aspects instead of the other two aspects, **Correspondence:** Students should be encouraged to make connections between their model elements and real-world observations, using familiar and concrete examples wherever possible. **Limitations:** It's important for students to recognize that their models may not cover all aspects of a phenomenon, focusing on discussing what their model shows well and what it doesn’t. **Explanation/Prediction:** Encourage students to use their models to explain simple phenomena or predict outcomes in familiar contexts, like predicting the growth of a plant under different conditions. Knowledge, Skills, and Abilities: **Model Elements:** Understanding that models represent real- world phenomena with specific, observable features that can be seen or measured. **Relationships Among Elements:** Ability to connect elements in a model to show their interactions (e.g., through arrows or spatial placement). **Sequence of Events:** Recognizing how one event can lead to another in a straightforward manner. **Correspondence:** Matching elements of the model with real- world counterparts or data they have collected or researched. **Limitations:** Awareness that models are simplifications and cannot capture all details of a phenomenon. **Explanation/Prediction:** Ability to use the model to explain observations or predict simple outcomes based on their understanding of the model. Evidence for Each Component: **Model Elements:** Model focuses on the necessary elements needed to explain a target phenomenon at a basic level, related to observable or researched information appropriate for Grades 3-5. **Relationships Among Elements:** Model clearly represents basic relationships or interactions needed to explain the phenomenon, avoiding unnecessary complexity. **Sequence of Events:** Illustrates a clear and correct sequence of events, with each step logically following the previous step in a manner appropriate for the understanding of a student at this grade level. **Correspondence:** Demonstrates a clear connection between model elements and the real-world phenomenon, using examples or data that are accessible to students in Grades 3-5. **Limitations:** Student can articulate basic limitations of the model in explaining the phenomenon, showing awareness of the model’s simplicity. **Explanation/Prediction:** Uses the model to construct understandable explanations or predictions relevant to the challenges and prerequisite knowledge that students may want to equip in order to meet the requirements of the grade boundary of developing model ability. More impressively, the challenges of prerequisite knowledge of developing models were also unpacked at a smaller grain size based on the framework I provided. Reflections: specify the task requirements and provide adequate and appropriate knowledge for the GPT- 4 models. Providing exemplar 86 Table 4-2 (cont’d) since in my study, I decided to measure students’ ability in developing models to explain real- world phenomena. phenomenon, appropriate for their developmental level. Adjusting the unpacking components for developing, using, and evaluating models to align with the grade boundary of Grades 3-5 involves simplifying the complexity of the concepts, focusing on observable and directly relational aspects of the phenomena, and ensuring that the language and examples are accessible and relevant to the experiences of students in these grades. This adjustment also emphasizes the development of foundational skills in modeling practices that students will build upon in later grades. examples and specifying the difference between the examples and the previous generation are important for the outputs. The iterative process of unpacking the SEP of Developing and Using Models involved refining the AI's outputs through multiple rounds of input and feedback, significantly enhancing the quality and detail of the unpacking over time. Initially, the task involved providing the GPT-4 model with fundamental knowledge about the SEP, including its definition, applications, and grade boundaries. The first iteration produced general insights but lacked specificity in guiding students on how to construct and evaluate scientific models. To address these shortcomings, subsequent iterations included more detailed and explicit guidelines, emphasizing the essential components of developing models. This involved clarifying the elements to be included in models, such as identifying observable components, mapping relationships among these components, and understanding the limitations of models. The adjusted approach also focused on developing students' abilities to use models for explanation and prediction, relevant to real-world phenomena. Human experts played a critical role throughout this iterative process. Their feedback helped identify areas where the AI outputs were too general or did not meet educational standards. By integrating their insights and providing more domain-specific information, the unpacking became more detailed and aligned with educational goals. The iterative nature of the process, coupled with reflective practice, ensured continuous improvement and refinement of the AI-generated outputs. To enhance the specificity of the unpacking, I provided the system with a framework of the subdimensions of developing models. This framework included the components of the system that need to be modeled, relationships among the components, and leveraging these relationships to explain relevant 87 phenomena or solve complex problems. Additionally, I provided the grade boundaries for this SEP. Iteration 2 in Table 4-2 shows the adjusted unpacking of developing models. The adjusted output aligns with the unpacking principles and framework of developing models, integrating the proficiencies of each sub-aspect of the unpacked SEP and addressing the challenges and prerequisite knowledge students need to meet the grade boundary requirements. The challenges of prerequisite knowledge were unpacked in more detail based on the provided framework. Reflecting on the process, it is clear that specifying task requirements and providing adequate and appropriate knowledge for the GPT-4 model were crucial. Providing exemplar examples and highlighting the differences between iterations significantly improved the quality of the outputs. This iterative and reflective approach ensured that the AI-generated unpacking were both comprehensive and aligned with educational standards, setting the stage for unpacking the CCCs in the next section. Following a similar approach, I also prompted the GPT-4 model to unpack the other SEP focused on in this study, which is constructing scientific explanations and argumentations. Additionally, two CCCs were unpacked using the same method. Table 4-3 presents the final versions of the unpacking for both the SEPs and CCCs. Table 4-3. The final output of unpacking of constructing scientific explanations and CCCs. SEP/CCC Unpacking Constructing scientific explanations Components of Scientific Explanation: 1. **Claim:** In the context of Grades 3-5, a claim would be a student\'s concise answer to a question about a scientific phenomenon or design solution, articulated in a manner understandable by their peers. For example, a student might claim, "Plants grow faster in sunlight than in the dark." 2. **Evidence:** At this level, evidence should be based on direct observations, simple measurements, or patterns discovered in data. Students might collect evidence about plant growth in sunlight versus darkness through regular measurement of plant height. Visual aids, such as photographs of the plants at various stages, might also be considered as supporting evidence. 3. **Reasoning:** Reasoning for students in Grades 3-5 involves linking the evidence to the claim in a straightforward manner, perhaps with the assistance of 88 Table 4-3 (cont’d) basic scientific principles (like photosynthesis). An example of reasoning at this level might be, "Plants need sunlight to perform photosynthesis, which helps them grow. Our plants in the sun grew taller, which supports our claim." Knowledge, Skills, and Abilities for Constructing Explanations: - Knowledge of the basic structure of a scientific explanation (claim, evidence, reasoning). - Skills in observing, measuring, and recording data accurately. - Ability to make straightforward observations and connect these observations to broader scientific concepts. - Ability to articulate a simple claim based on observations or findings. - Ability to use simple scientific vocabulary to explain how their evidence supports their claim. Evidence for Each Component of the Practice: **For Claims:** Students articulate a clear and testable claim that answers a question about a phenomenon or solves a problem. **For Evidence:** Students present specific observations, measurements, or identified patterns as evidence. This could include simple data tables, charts, or descriptive observations. **For Reasoning:** Students logically explain why the evidence they have gathered supports their claim, possibly linking to broader scientific ideas appropriate for their grade level. In adjusting these components for Grades 3-5 students, emphasis is placed on the clarity of expression, the directness of the observed evidence, and the basic logic linking evidence to claims. The complexity of the scientific principles involved is scaled to be age-appropriate, with a focus on fostering students\' ability to think critically and reason scientifically at an elementary level. Evaluate scientific explanations Claim: **Grade 3-5 Focus**: Determine if the claim provides a direct answer to the question based on what was observed or found in the investigation. **Evidence Requirement**: Checks if the claim makes sense given the observations or data collected during class activities or simple experiments. Evidence: **Grade 3-5 Focus**: Evaluate if the evidence includes observations or data that directly supports the claim. This could include things seen, counted, measured, or read about in trusted sources. **Evidence Requirement**: Verifies that students are not just listing observations but are connecting them explicitly to the claim as supportive evidence. Reasoning:**Grade 3-5 Focus**: Assess whether the explanation includes basic scientific concepts or ideas that connect the claim and evidence. This might involve simple cause-and-effect relationships or observations of change. **Evidence Requirement**: Looks for statements where students explain "why" their evidence supports their claim, using age-appropriate scientific language and concepts. 89 Table 4-3 (cont’d) For elementary students, evaluating explanations often revolves around their ability to: **Understand and apply basic scientific vocabulary and concepts** related to the topic at hand. **Make clear connections** between what they observed (evidence) and what they claim those observations mean. **Use simple reasoning** to articulate why their evidence supports their claim, drawing on foundational scientific principles they\'ve learned. The focus is on fostering foundational skills in scientific inquiry, such as making observations, drawing conclusions from data, and beginning to articulate the rationale behind these conclusions with basic scientific reasoning. This adjusted unpacking aims to scaffold these practices in ways that are accessible and meaningful for Grades 3-5 students, preparing them for more complex scientific thinking in later grades.' Cause and effect Causes: **Correctly identifies** or describes basic cause(s) that lead to an observable effect(s), often in a simple and direct relationship. Effects: **Correctly identifies** simple and observable effect(s) that result from a specific cause(s), understanding that the same cause may not always lead to the same effect due to different conditions. Conditions: **Correctly recognizes** or describes, in basic terms, the conditions under which certain cause-and-effect relationships hold or don't hold, acknowledging that different outcomes can arise from changes in conditions. Mechanism/Intermediate Events:**Able to provide a basic explanation** of how one event leads to another, using simple scientific concepts where applicable. Identifies basic chains of events or actions that link a cause to its effect without necessarily needing to use specific scientific terminology. Evidence:**Describes or provides** simple observations or data that they have gathered or been given as evidence to support a causal claim. Understands that evidence is needed to support the link between a cause and its effect. Predictions/Theories:**Makes simple predictions** based on observed cause-and- effect relationships, understanding that if we know the cause, we can predict the effect or vice versa. Begins to see how these predictions can connect to broader scientific ideas or theories, even at a rudimentary level. Solutions:**Designs basic solutions** to simple problems by applying an understanding of cause and effect. For instance, if plants are not growing, recognizing that water (cause) might be necessary to affect growth (effect), and proposing watering the plant as a solution. 90 4.1.2 Generating Integrated Dimension Map for the PE After unpacking the dimensions, the next step is guiding the GPT-4 model to generate an Integrated Dimension Map (IDM) to facilitate the subsequent learning performance design. The IDM should integrate the three dimensions (DCIs, SEPs, and CCCs) based on the unpacking, using the DCI as the main framework. This involves mapping out the major ideas and then adding the unpacked aspects of the SEPs and CCCs to link the different DCI components together, which will guide the learning performance design. To illustrate this process, I use PE 3-PS2-1 as an example, demonstrating how I worked iteratively with the GPT-4 model to design the IDM. Table 4-4 presents the iterative process of co- designing the IDM for PE 3-PS2-1. Table 4-4. Iterative process of co-designing IDM for PE 3-PS2-1 Input Output Reflections Iteration 1 - provide rationale and purpose of IDM; - provide design principles of designing IDM; - provide explicit design process of generatin g IDM To lay out and make the connections between the DCI sub-ideas for Performance Expectation (PE) 3-PS2-1, we will focus on the Disciplinary Core Ideas (DCIs) PS2.A: Forces and Motion, and PS2.B: Types of Interactions. The goal is to create a map that demonstrates how these sub-ideas relate within the context of balanced and unbalanced forces affecting the motion of an object. Step 1: Lay Out the DCI Sub-ideas Based on previous unpacking: 1. Forces and Motion (PS2.A): Describes how each force acts on a particular object and has both strength and a direction. This DCI also touches upon how balanced forces result in no change in motion, whereas unbalanced forces cause a change in speed or direction of motion. 2. Types of Interactions (PS2.B): Explains that objects in contact exert forces on each other. Diagram Representation: In the diagram:- Text Box 1:Forces and Motion (PS2.A): Each force acts on an object with strength and direction. Balanced forces sum to zero, causing no change in motion, while unbalanced forces result in changes in motion. - Text Box 2: Types of Interactions (PS2.B): Objects in contact The layout of the DCI sub-ideas does not really reflect the DCI unpacking, which just lists the two major ideas listed in the PE that were asked to be unpacked before. In other words, the DCI map does not really include the unpacked sub ideas of the major DCIs in this PE. 91 Table 4-4 (cont’d) exert forces on each other. Arrows and Statements: Arrow from PS2.A to PS2.B: The direction and strength of forces (PS2.A) determine the interactions (contact/non-contact) between objects (PS2.B). Arrow from PS2.B to PS2.A: The interactions between objects (PS2.B) cause forces to be exerted, affecting motion (PS2.A). This setup focuses on the reciprocity between forces and interactions: how forces determine the nature of interactions, and inversely, how interactions modify the forces experienced by objects. Furthermore, understanding that balanced forces result in no motion change, while unbalanced forces result in motion change, forms a critical connection point between these sub-ideas. Generation Description: The created diagram lays out two main sub-ideas: "Forces and Motion" and "Types of Interactions" related to the performance expectation of investigating the effects of balanced and unbalanced forces. The relationship between them is illustrated via directional arrows indicating that the characteristics of forces (strength and direction) critically influence how objects interact with each other, and how these interactions, in turn, affect the forces that influence motion. This cyclic relationship underscores the interconnectedness of these core concepts in explaining how forces affect the motion of objects.' Iteration 2 For this iteration, I prompted the GPT-4 models to regenerate the IDM by: 1. explicitly pointing out the issues of the generated IDM. 2. provide specific sub- DCI ideas unpacked before in the prompt to 'For the Performance Expectation (PE) 3-PS2-1, we start by laying out the key Disciplinary Core Ideas (DCIs) that were identified in the unpacking of this PE, focusing on "Forces and Motion" (PS2.A) and "Types of Interactions" (PS2.B). Given this, the diagram for laying out and connecting the DCI sub- ideas could look as follows: 1. Text Boxes for DCI Elements and Subideas: - Force - Balanced Force - Unbalanced Force - Motion - Types of Interactions 2. Arrangement and Connections: - Force is connected to **Balanced Force** and **Unbalanced Force** to indicate that these are two categories or outcomes of forces acting on objects. - Balanced Force is connected to **Motion** with a statement indicating that balanced forces result in no change in motion. The subideas of PS2.A are unpacked and presented based on the previous guidance. However, the ideas of PS2.B were not presented in the same way, which 92 Table 4-4 (cont’d) model how to construct the DCI map; 3. providing examples to design DCI maps. - **Unbalanced Force** is connected to **Motion** with a statement indicating that unbalanced forces cause a change in the object\'s speed or direction of motion. - **Types of Interactions** is connected to both **Balanced Force** and **Unbalanced Force** to show that the types of interactions between objects can result in either balanced or unbalanced forces affecting the motion. 3. **Succinct Statements for Each Arrow:** - From **Force** to **Balanced Force**: "When equal forces act on an object from opposite directions, they are balanced, causing no change in motion." - From **Force** to **Unbalanced Force**: "Forces acting on an object that do not equal out result in an unbalanced force, causing a change in motion." - From **Balanced Force** to **Motion**: "Balanced forces result in no change in an object’s motion." - From **Unbalanced Force** to **Motion**: "Unbalanced forces cause a change in the speed or direction of an object’s motion." - From **Types of Interactions** to **Balanced Force**: "Interacting objects can exert balanced forces on each other." - From **Types of Interactions** to **Unbalanced Force**: "Interactions between objects can lead to unbalanced forces, affecting motion." This diagram represents the relationships among the key concepts related to forces and motion as elaborated in the unpacking of DCI for PE 3-PS2- 1. It visually illustrates how different types of forces and interactions between objects influence motion, serving as a foundation for integrating the SEPs and CCCs in step 2 of the mapping process. This structured approach ensures a clear understanding of the disciplinary content before integrating practices and crosscutting concepts.' Iteration 3 1.Reiterated the importance of referring back to the IDM development procedure. 2. Address incorporate 'Given the detailed unpacking provided for PS2.A: Forces and Motion and PS2.B: Types of Interactions, let\'s synthesize these elements into a coherent DCI map explicitly designed for the grade 3-5 boundary, focusing specifically on the PE 3-PS2-1. DCI Map Description for 3-PS2-1 **DCI Elements and Key Sub-ideas Association: **Force**: Central to both PS2.A and PS2.B, representing both acted upon and resulting actions between objects. Sub-ideas include strength, direction, balanced forces, unbalanced forces, and the types of interactions (contact vs. at-distance) that can generate force. **Motion**: Tied closely with force from PS2.A, emphasizing changes in an object’s speed or direction caused by unbalanced forces. **Types of Interactions**: From PS2.B, focuses on the specific may reflects the prompts should be explicit and comprehensi ve enough to direct to detect the major requirement, analyze the information, take actions to complete the tasks and then learn from this process to transfer learning from this case to the following similar tasks. The revised DCI map meets the requiremen ts of the task. 93 Table 4-4 (cont’d) the unpacked DCIs and their sub ideas. 3. Ask the GPT model to provide rationale of how and why it changed the IDM. 4. Point out the importance of addressing the grade appropriateness. 5. Emphasize the importance of concise and clearness for the IDM and visualization nature of forces that objects exert upon each other, including gravitational and contact forces such as friction. Explicit Connections: 1. **From Force to Motion**: An arrow from "Force" to "Motion" can be labeled "Unbalanced forces cause changes," reflecting the understanding that any non-zero sum of forces (unbalanced forces) can directly alter an object\'s state of motion, either in speed or direction. This connection satisfies the core of PS2.A as it applies to the expectations for grades 3-5. 2. **From Types of Interactions to Force**: Multiple arrows come out from "Types of Interactions," each pointing towards "Force." These arrows can be labeled with the specific interaction types—gravitational, electric, magnetic, and contact. Each label serves to illustrate that these interactions are the source of forces acting upon objects. Visual Representation: Imagine the map structured with "Types of Interactions" at the top, branching out to "Force" in the center, and then leading down to "Motion" at the bottom. This visual flow guides the understanding from the source of forces through interactions, their nature and characteristics in the form of force, and finally to the effects of these forces as observed in motion. This structure directly ties back to the focal PE by illustrating the cause- and-effect relationship pivotal for understanding how different forces impact motion. Revision Observations: Given the PE\'s grade level (3-5), the map is tailored to emphasize qualitative understandings over quantitative. Thus, while the map outlines the types of forces and their effects on motion, it intentionally avoids delving into complex equations or concepts (e.g., Newton\'s laws in detail) that are beyond the recommended grade boundary. This revision ensures alignment with the grade-specific expectation that students can identify and describe forces and their effects but may not mathematically quantify those forces or effects. This DCI map provides a visual and conceptual foundation for designing learning experiences and assessments that align with the PE 3-PS2-1, ensuring grade-appropriate depth and encouraging students to explore the cause-and-effect relationships between forces and motion in their physical world.' The iterative process for generating the IDM for PE 3-PS2-1 was essential for refining and enhancing the quality of the output. Initially, the process began with providing the GPT-4 model with a rationale and purpose for the IDM, along with design principles and an explicit design process. The first iteration's output, however, lacked specificity and did not adequately reflect the detailed unpacking of the 94 DCIs, indicating the need for more precise guidance. Subsequent iterations addressed these shortcomings by providing detailed sub-ideas unpacked from the DCIs and examples of how to construct a comprehensive DCI map. This approach aimed to improve the alignment of the IDM with educational standards and specific task requirements. The second iteration showed progress but still did not fully integrate the detailed sub-ideas of PS2.B, revealing the need for even more explicit and comprehensive prompts. Further iterations emphasized referring back to the IDM development procedure, incorporating the unpacked DCIs and their sub-ideas comprehensively. This also involved asking the GPT model to provide a rationale for changes made to the IDM, ensuring the adjustments were grade-appropriate and clearly visualized. By reiterating these critical aspects, the final iteration successfully produced a detailed and coherent DCI map that met the task's requirements and educational standards. Through the iterations, I directed the GPT models to generate the DCI map. Afterward, I prompted the GPT models to enrich the DCI map by following the guidelines of adding unpacked SEP and CCC of the PE. I provided prompts outlining the requirements for integrating SEPs and CCCs into the DCI map to generate the IDM. Figure 4-5 presents the prompts I provided for generating the IDM. Figure 4-5. Prompts for enriching the DCI map by adding SEPs and CCCs to generate the IDM In the prompts, I first clarified the task goals, which were to generate the IDM by adding SEPs and CCCs to the DCI map. This integration aimed to clarify how the relationships among various sub 95 DCI ideas could be connected and developed by incorporating appropriate SEPs and CCCs, ensuring alignment with previous unpackings. The IDM was required to cover and meet the performance expectations for PE 3-PS2-1. After clarifying the task goals, I provided specific design principles to guide the GPT model in generating the IDM. Additionally, I offered an example of how to generate the IDM to facilitate learning. I directed the GPT model to refer back to the previous SEP and CCC unpacking to think about the integration. Finally, I further clarified three main goals for the task: to generate a clear IDM, to align with the grade boundaries, and to ensure explicit descriptions of the generated IDM. The iterative refinement resulted in a robust IDM for PE 3-PS2-1, presented in Table 4-5. Table 4-5. IDM for PE: 3-PS 2-1. Integrated Dimension Map for 3-PS2-1: **DCI Elements Sub-ideas Association: **Force** and **Motion**: Central concepts from PS2.A, detailing that forces can cause changes in the motion of an object, emphasizing the strength and direction of forces, and differentiating between balanced and unbalanced forces. **Types of Interactions**: From PS2.B, elaborating on the nature of forces that objects in contact exert on each other, such as gravitational, electric, magnetic, and contact forces, like friction. **Integration with SEPs and CCC: 1. **Force to Motion (PS2.A): **SEP: Developing and Using Models** – Students can model how unbalanced forces affect an object's motion, providing a visual and conceptual understanding of the principles such as direction and strength of forces. **CCC: Cause and Effect** – Highlighting that unbalanced forces (cause) lead to changes in motion (effect). 2. **Types of Interactions to Force (PS2.B): **SEP: Planning and Carrying Out Investigations** – Students can investigate different types of forces (e.g., gravity, magnetism) to understand their effects on objects. They can explore, for example, how the distance between objects affects the strength of the force. **SEP: Constructing Explanations** – Students explain how specific interaction types (gravitational, electric, magnetic, and contact) cause forces that can alter an object's motion. Visual Representation: The map visually presents **Types of Interactions** at the top, branching out to **Force** in the center, which then leads down to **Motion**. The flow demonstrates how interactions lead to forces, and how these forces influence motion. Arrows from each type of interaction towards Force are labeled with specifics like "gravity causes a pull on objects" or "contact like friction resists motion," each tied back to investigations or models students can create or use. The Cause and Effect CCC explicitly connects all elements of the map, underscoring that each interaction and the subsequent application of force lead to observable changes in motion as effects. Additional Notes: This revised map capitalizes on the natural curiosity of students in grades 3-5 to explore the physical world through hands-on investigations (Planning and Carrying Out Investigations), to visualize complex ideas through the development and usage of models (Developing 96 Table 4-5 (cont’d) and Using Models), and to articulate their understanding through the construction of scientific explanations. The CCC of Cause and Effect serves as the thematic backbone of the curriculum content, intertwining it with real-world observations and promoting a deeper understanding of the fundamental principles of physics as they pertain to forces and motion.' 4.1.3 Generating Learning Performances Based on Unpacking and IDM Upon concluding the unpacking of the three dimensions and creating the IDM, I proceeded to generate Learning Performances (LPs) that encapsulate the essential competencies students need to display to meet the PE. These LPs are detailed, three-dimensional statements of performance that act as practical indicators of student progression towards the PE, combining diverse disciplinary facets in an integrated manner. Each LP represents a crucial component of a PE that students need to master, ensuring their steady progression towards achieving the entire PE. The LP design process began with a briefing about the concept of LPs and their relationship to the PE. Following this introduction, I prompted GPT to conceive LPs for PE 3-PS2-1. The iterations and reflections on how I worked with the GPT model to design the LPs for 3-PS2-1 are presented in Table 4-6. This iterative approach ensured that the LPs developed were comprehensive, aligned with the unpacked dimensions, and reflected the integrated nature of the PE. 97 Table 4-6. Iterative process of designing LPs for PE 3-PS2-1 Input Output Reflections 1. The grain size of the generated LPs are not small enough to differentiate them from the original PE, especially for the LP1. 2.Each LP includes multiple DCI sub ideas that make the LPs too broad. For instance, LP3 discusses both the strength and directions of force in one LP, which does not align with the goals of LP. 3. The structure of the LPs are clear and the three dimensions integrated very well. 1. clarify the task goal, which is designing LPs for 3-PS2-1 based on the unpacking and IDM; 2. provide sufficient information about LPs, including the definition of LPs and the significance of LPs. 3. provide specific guidelines for doing the task, which is how to articulate LPs based on the IDM. I provided specific clarifications about the designing process. 4. provide further explicit elaborations of the designing process by providing some examples of LPs and further clarify the relationship between LPs and PE. 5. provide guidelines on how to determine whether the LPs are needed to be included or excluded from the set of LPs. 6. provide criteria of judging the quality of LPs to allow the GPT models to reflect on the generations 7. provide guidelines Iteration 1 Given the unpacking and integrated dimension map for the PE 3-PS2-1, focusing on "Plan and conduct an investigation to provide evidence of the effects of balanced and unbalanced forces on the motion of an object," here are three learning performances articulated: Learning Performance 1:**Students planning and conducting investigations (SEP)** to observe and record the effects **of balanced and unbalanced forces (DCI - PS2.A)** on the motion of an object in the playground, such as a ball or a swing, and then **identify cause and effect relationships (CCC)** based on their observations. **Rationale:** This learning performance addresses the SEP of planning and carrying out investigations by directly engaging students in hands-on activities to explore the core idea of forces and motion. It incorporates the CCC of cause and effect, as students are asked to draw connections between the forces applied and the resulting motion of objects. This learning performance has a smaller grain size compared to the PE, focusing on a practical exploration of the DCI within a specific, easily observable context. Together with others, it covers part of the PE by providing foundational experiences with forces. Learning Performance 2:**Students developing and using models (SEP)** to represent **how objects in contact exert forces on each other (DCI - PS2.B)** and then **explaining how these interactions can cause changes in motion (CCC)** using everyday materials like magnets or friction blocks. **Rationale:** This learning performance allows students to apply the SEP of developing and using models to the DCI relating to types of interactions, specifically focusing on contact forces. It includes the CCC of cause and effect as students are expected to connect the interactions to changes in motion, aligning with the "how" and "why" questions at the heart of scientific inquiry. This performance contributes to covering the PE by deepening students’ understanding of 98 Table 4-6 (cont’d) for the format of the LPs in terms of their structures. 8. further clarify the task goals and provide specific requirements for the outputs, including asking to provide rationale of the generated LPS based on the criteria I provided. the variety of interactions that can influence motion. Learning Performance 3:**Students constructing explanations (SEP)** for **how the strength and direction of forces (DCI - PS2.A)** affect an object’s motion, using evidence from classroom experiments involving pushing and pulling toys to recognize **patterns (CCC)** in motion. **Rationale:** Engaging students in constructing explanations enables them to connect empirical evidence to the scientific theory of forces and motion. This learning performance targets a specific aspect of PS2.A, focusing on the quantitative attributes of forces and their qualitative effects. By recognizing patterns in motion as a CCC, students are encouraged to abstract from specific instances to the underlying principles governing motion. This performance pushes towards the comprehensive understanding required by the PE, demonstrating an integrated approach to three-dimensional learning. Learning Performance 4:**Students using mathematics and computational thinking (SEP)** to predict **the effects of unbalanced forces on an object\'s speed and direction (DCI - PS2.A)**, and understanding that **these changes in motion can be predicted based on cause and effect (CCC)**. **Rationale:** This learning performance incorporates computational thinking, an essential skill for interpreting and predicting complex systems\' behavior in science. By focusing on unbalanced forces, which is a key concept in PS2.A, and linking it to mathematical predictions, students engage deeply with the disciplinary core idea. The cause and effect CCC reinforces the focus on the scientific reasoning behind motion changes. This performance culminates aspects of the PE concerning prediction and investigation, rounding out the set of performances for comprehensive coverage of the PE\'s scope. Iteration 2 **Revised Learning Performances for 3-PS2-1 LP1**: Students plan and carry out investigations (SEP) to observe how different strengths of forces (PS2.A) affect the motion of an object, highlighting the cause and effect (CCC) relationship between the force applied and the change in motion LP2**: Students develop models (SEP) to represent 1. The CCCs in the LPs are not integrated well into the LPs. 2. need to 1. Point out the issues of the designed LPs, including the scope and grain size of the LPs (e.g., LP1 is too similar to the original PE.) 2. Direct GPT model to 99 Table 4-6 (cont’d) reflect its’ generations and provide justifications on how the generations align with requirements set up before) 3. prompt the FPT model to consider revision scenarios where balanced forces (PS2.A) act on an object, resulting in no change in motion, such as a book on a table or a balanced tug-of-war, to illustrate the cause and effect (CCC) in physical interactions. correct the SEP expression (describe, represent, etc.) LP3**: Through constructing explanations (SEP), students use evidence from investigations to describe how objects in contact exert forces on each other (PS2.B), including friction, showing the interaction (CCC) between objects as a cause of motion changes. 3.I provided exemplary revisions to revise the LPs. 4. provide exemplar LPs to further enrich the fundamental information for GPT to learn from. LP4**: Students describe (SEP) how magnetic or gravitational forces (PS2.B) act at a distance, using models to explain the cause and effect (CCC) of these non-contact interactions on an object's motion or change in motion. Iteration 3 1. provided the revised the LPs to the GPT model to consult its’ opinions 2. Prompted the GPT model to review, judge, and revise the LPs if needed. Given the focal PE (3-PS2-1) and the unpackings provided for PS2.A: Forces and Motion and PS2.B: Types of Interactions, along with the integrated dimension map and the nature of the science and engineering practices (SEPs) and crosscutting concepts (CCCs) involved, let's address your inquiries about the four revised Learning Performances (LPs). **Do the Four LPs Make Sense?** - Yes, the four LPs make sense within the context of 3-PS2-1. They are structured to cover different aspects related to forces and their effects on motion, addressing both contact and non- contact forces in a manner that is suitable for students in grades 3-5. 2. **How Do the Four LPs Align with the Unpacking and Integrated Dimension Map?** - The LPs directly relate to the key concepts identified in the unpacking of DCIs PS2.A and PS2.B. For example, LP1 and LP2 focus on the effects of balanced and unbalanced forces on motion, which is at the heart of PS2.A. LP3 explores the interactions between objects in contact, including friction, aligning with PS2.B. LP4 expands this understanding to non- contact forces such as magnetic and gravitational forces, which, while more subtly hinted at in PS2.B's unpacking for the elementary level, serves as an important bridge to understanding the various types of forces that affect motion. - Each LP incorporates SEPs and CCCs appropriately—LP1 uses planning and carrying out investigations to observe effects (a SEP), highlighting 100 LP1**: Students plan and carry out investigations to observe how different strengths of forces affect the motion of an object. LP2**: Students develop models to explain how balanced forces acting on an object result in no change in motion, using everyday scenarios such as a book resting on a table or a tug-of-war game with equal strength Table 4-6 (cont’d) cause and effect (a CCC); LP2 and LP4 incorporate model development (a SEP) for explaining phenomena (a CCC); LP3 emphasizes constructing explanations (a SEP) based on evidence from investigations, involving interaction (PS2.B) and cause and effect (CCC). 3. **Do These Four LPs Cover the Scope of PE 3-PS2-1?** - Collectively, these LPs adequately cover the scope of PE 3-PS2-1. They explore the core concepts of forces and motion, including the nuanced distinction between contact and non-contact forces and their effects on an object's motion—elements crucial for students' understanding of this PE. They provide a comprehensive approach to building students’ understanding through investigations, modeling, and constructing explanations. 4. **Further Review or Revision?** - Regarding LP4 and the question on the criticality of 'non-contact forces' for this PE—Yes, non-contact forces are essential for a full comprehension of PS2.B: Types of Interactions and, by extension, PE 3-PS2-1. While the focus at the grade 3-5 band is often on direct, observable phenomena, introducing the concept of non-contact forces like gravity and magnetism helps broaden students' understanding of forces beyond what they can see and touch. It lays a foundational understanding that not all forces require direct contact to influence motion, preparing students for more complex physics concepts later. Thus, LP4 is indeed on the right track as it engages students with the concept of non-contact forces acting at a distance— essential for grasping the broader spectrum of forces that influence the motion of objects, in line with PS2.B's emphasis on types of interactions, including those that do not require direct contact. Considering the analysis, these LPs appear well-constructed and aligned with the goals of PE 3-PS2-1, emphasizing hands-on, exploratory learning that integrates key scientific concepts with practice and crosscutting concepts in a manner appropriate for the targeted grade level. on both sides. LP3**: Students construct explanations based on evidence from investigations to explain how objects in contact exert forces on each other, including friction, showing the interaction between objects as a cause of motion changes. LP4**: Students use models to explain how non- contact forces (e.g., gravitational forces) on an object's motion or change in motion act at a distance. The iterative process of designing LPs for PE 3-PS2-1 involved several rounds of refining inputs and outputs to develop comprehensive and educationally aligned LPs. The initial step was to clarify the task goals, which included designing LPs based on the unpacking and IDM of the PE. I began by providing the GPT-4 model with sufficient information about LPs, including their definition, significance, and their relationship with the PE. This involved giving specific guidelines on how to articulate LPs based on the IDM, providing examples of LPs, and clarifying the criteria for judging the quality of LPs. Table 4-7 presents the evolving version of the designed LPs for 3-PS2-1. 101 Table 4-7. The evolving version of the designed LPs for 3-PS2-1 1st iteration 2nd iteration 3rd iteration (final version) LP1: Students planning and conducting investigations (SEP)** to observe and record the effects **of balanced and unbalanced forces (DCI - PS2.A)** on the motion of an object in the playground, such as a ball or a swing, and then **identify cause and effect relationships (CCC)** based on their observations. LP2: Students developing and using models (SEP)** to represent **how objects in contact exert forces on each other (DCI - PS2.B)** and then **explaining how these interactions can cause changes in motion (CCC)** using everyday materials like magnets or friction blocks. LP3: Students constructing explanations (SEP)** for **how the strength and direction of forces (DCI - PS2.A)** affect an object’s motion, using evidence from classroom experiments involving pushing and pulling toys to recognize **patterns (CCC)** in motion. LP4: Students using mathematics and computational thinking (SEP)** to predict **the effects of unbalanced forces on an object's speed and direction (DCI - PS2.A)**, and understanding that **these changes in motion can be predicted based on cause and effect (CCC)**. LP1: Students plan and carry out investigations (SEP) to observe how different strengths of forces (PS2.A) affect the motion of an object, highlighting the cause and effect (CCC) relationship between the force applied and the change in motion. LP2: Students develop models (SEP) to represent scenarios where balanced forces (PS2.A) act on an object, resulting in no change in motion, such as a book on a table or a balanced tug-of- war, to illustrate the cause and effect (CCC) in physical interactions. LP3: Through constructing explanations (SEP), students use evidence from investigations to describe how objects in contact exert forces on each other (PS2.B), including friction, showing the interaction (CCC) between objects as a cause of motion changes. LP4: Students describe (SEP) how magnetic or gravitational forces (PS2.B) act at a distance, using models to explain the cause and effect (CCC) of these non-contact interactions on an object's motion or change in motion. LP1: Students plan and carry out investigations to observe how different strengths of forces affect the motion of an object. LP2: Students develop models to explain how balanced forces acting on an object result in no change in motion, using everyday scenarios such as a book resting on a table or a tug-of-war game with equal strength on both sides. LP3: Students construct explanations based on evidence from investigations to explain how objects in contact exert forces on each other, including friction, showing the interaction between objects as a cause of motion changes. LP4: Students use models to explain how non-contact forces (e.g., gravitational forces) on an object's motion or change in motion act at a distance. In the first iteration, the generated LPs were too broad, with some being nearly indistinguishable from the original PE. For example, the first LP focused on planning and conducting investigations but did not sufficiently narrow the scope to differentiate it from the PE. Similarly, some LPs integrated multiple DCIs, making them too extensive. Despite these issues, the structure of the LPs was clear, and the 102 integration of the three dimensions (DCIs, SEPs, and CCCs) was evident. Reflecting on these outputs, I identified the need for more explicit guidance. In the second iteration, I provided specific prompts to address the scope and grain size of the LPs, guiding the model to reflect on its outputs and consider necessary revisions. By offering exemplar LPs and further enriching the information for the GPT-4 model to learn from, the quality of the generated LPs improved. The revised LPs were more focused and better aligned with the educational goals. They included: Planning and conducting investigations to observe how different strengths of forces affect motion; Developing models to explain how balanced forces result in no change in motion; Constructing explanations based on evidence to describe how objects in contact exert forces on each other; and Using models to explain how non-contact forces, such as gravitational forces, affect motion. In the final iteration, I reviewed and judged the revised LPs, ensuring they were aligned with the unpacking and IDM. This involved confirming that the LPs covered the scope of PE 3-PS2-1 and appropriately integrated the three dimensions. The final set of LPs demonstrated a thorough understanding of forces and motion, exploring both contact and non-contact forces in a manner suitable for students in grades 3-5. The iterative process allowed for continuous refinement, with each cycle building on the previous one, integrating feedback, and enhancing the outputs' quality. 4.1.4 Design Blueprints ---- Evidence Statements The generation of LPs laid the groundwork for the next phase of designing knowledge-in-use assessment: the development of Design Blueprints, which encompassed crafting Integrated Proficiencies (IPs), Evidence Statements, Essential/Characteristic Task Features, Variable Task Features, and considerations for Equity/Fairness (Harris et al., 2019). These blueprints would guide the development of knowledge-in-use assessment tasks targeting the LP. In this process, GPT was provided with comprehensive information about the components of the design blueprints, their interrelationships, and their alignment with the LP. In this section, I presented the iterative design process of generating evidence statements for each LP for the PE 3-PS2-1. Prior to generating evidence statements, designing integrated proficiencies can provide guiding information for designing evidence statements. But due to the similar 103 contents, this section only focuses on the design of evidence statements. Table 4-8 presents the iterative design process of generating the evidence statements for the LPs for PE: 3-PS2-1.I take LP2 as an example to show the design process. Table 4-8. Iterative process of designing evidence statements for PE 3-PS2-1 Input Output Iteration1 Reflections 1. Recall the integrated proficiencies designed before for each LP; 2. Define the task goals; which is designing evidence statements for each LP based on the designed integrated proficiencies; 3. Discuss the critical process of designing evidence statements for each learning performance in knowledge-in-use assessments. Emphasize the importance of evidence statements in assessment and rubric design, detailing how they describe the necessary evidence or performance to evaluate if students have achieved the learning outcomes. 4. Explain how integrated proficiencies, encompassing the abilities students need to demonstrate a learning performance, guide the identification of required evidence. Describe the process of determining observable evidence to support proficiency claims in learning performances, using the abilities described in integrated proficiencies to specify this Integrated proficiencies are: IP1. Ability to develop models illustrating that when equal forces act on an object from opposite directions, the motion of the object does not change. IP2. Ability to explain within models how balanced forces maintain an object's state of motion or rest. IP3. Ability to apply these models to everyday scenarios showing the effect of balanced forces. Evidence statements for LP2: Evidence Statement 1 for IP1: Students construct a model (e.g., diagram, physical representation) that visually demonstrates forces of equal strength acting on an object from opposite directions, effectively showing that 1. The evidence statements generated for LP2 reflect a certain level of proficiency performance that students should demonstrate. For instance, the performance indicator for evidence statement 1 specifies that the model constructed by students could be either a diagram or a physical representation. This model should demonstrate the forces acting on an object and the relationships among these forces, which should be equal in strength and opposite in direction. By integrating the relationships among the components and the components themselves, the model should explain how these forces cause the object's motion to remain unaffected. Generally speaking, evidence statement 1 includes the required performance evidence for LP2. However, to make it more accessible for measurement, it is crucial to further unpack evidence statement 1 into a smaller grain size. 104 Table 4-8 (cont’d) evidence in evidence statements format. the object’s motion remains unaffected. 5. Provide a detailed guide on creating evidence statements for knowledge-in- use assessments, emphasizing their significance in assessment and rubric design. Explain the role of evidence statements in determining whether students have achieved the learning outcomes by describing the necessary evidence or performance. Focus on how integrated proficiencies— combining SEPs, DCIs, and CCCs— inform the creation of these evidence statements. 6. Provide examples of how to design evidence statements; 7. Provide specific design guideline for designing evidence statements Evidence Statement 2 for IP2: Within their models, students provide annotations or explanations that detail why an object at rest stays at rest or an object in motion maintains its motion when subjected to balanced forces. Evidence Statement 3 for IP3: Students apply their models to real-life situations (e.g., a book resting on a table, two people pushing a box from opposite sides with equal force) to demonstrate understanding of how balanced forces operate in everyday experiences. Iteration 2 2. Evidence statements 2 and 3 focus on using or applying models to explain phenomena. However, for third graders, the critical abilities in developing models include: clearly identifying and including relevant components of the model in the system that can be used to explain the phenomena; clearly discussing or presenting the relationships among the components; and collectively using the components and relationships among the components of the model to explain relevant phenomena. 1. Point out that the evidence statements 1 needs further explanations and needs to explain explicitly about what the components of the models are; their relationships and how they sue them collectively to explain the phenomena 2. Here is the prompts I provided for the GPT model: revise these evidence statements below for LP2 : "Students develop models to explain how balanced forces acting on an object result in no change in motion, using everyday scenarios such as a book resting on a table or a tug- of-war game with equal strength on both sides." for 3-PS2-1 to ensure they are specifically for 3rd grade Evidence Statement 1 for LP2: Students construct a model that visually and accurately represents an object under the influence of balanced forces, showing the object either in a static position or moving at a constant speed without any change in direction. Students create a simple drawing or diagram that shows an object (e.g., a book on a table, two children in a tug-of-war with a rope) where no movement is happening or where there's movement at a steady pace. The model should visually display forces acting equally from all directions, keeping the object still or moving without speeding up or slowing down. The model should: · Basic shapes or figures to represent the object(s) and the forces acting on them. · Arrows of equal length pointing in opposite directions to signify balanced forces. A brief label or symbol next to each arrow, indicating the type of force (e.g., push). The revised evidence statements do not only reflect the major performance for developing models, but also include specific components, relationships among the components, and also provide the rationale for each evidence statement. 105 Table 4-8 (cont’d) specifically for 3rd grade students' their developing model abilities, including identify and include the essential components of their model that could be used to explain the phenomena; the relationships among the components of the model by using labels or symbols; and finally use the components and the relationships among the components together of the developed model to explain the corresponding phenomena. Your evidence statement should include explicit evidence of what and how the students should provide/ perform to ensure they have the ability and skills. revise these evidence statements: "1. Students construct a model that visually and accurately represents an object under the influence of balanced forces, showing the object either in a static position or moving at a constant speed without any change in direction. 2. Students include annotations or keys in their model to delineate between balanced and unbalanced forces, explaining scenarios where the forces cancel out, resulting in zero net force on the object. 3. Students apply their model to real-life scenarios like a book on a table or a car cruising at a steady speed, explaining how these situations exemplify balanced forces resulting in no change in motion. " Evidence Statement 2 for LP2:Students include annotations or keys in their model to delineate between balanced and unbalanced forces, explaining scenarios where the forces cancel out, resulting in zero net force on the object. In their models, students use annotations, keys, or simple sentences to highlight the presence of balanced forces and compare these with potential unbalanced force scenarios. For example, they might annotate a tug-of-war drawing with "equal pull from both sides keeps the rope still." The model should: · Clearly mark parts of their model to show where forces are acting. Use simple vocabulary to describe how these forces are balanced or what might happen if they weren't (e.g., "If one side pulls harder, the rope moves that way"). Evidence Statement 3 for LP2: Students apply their model to real-life scenarios like a book on a table or a car cruising at a steady speed, explaining how these situations exemplify balanced forces resulting in no change in motion. Students select a familiar scenario (such as a book remaining on a table without falling off or a toy car rolling at a steady pace) and explain how their model illustrates the balance of forces at play in that scenario. The explanation should: - Identify the forces acting on the object in the scenario (e.g., gravity pulling the book down, table pushing it up). - Describe how these forces balance out, using elements from their model as reference points - Conclude how the balanced forces result in no change in the object's motion (either staying still or moving steadily). Reflecting on the evidence statements design process, it is easy for the GPT model to generate general outputs. Although the outputs are still domain-specific and three-dimensional, without providing 106 explicit and clear prompts, it is not easy for the GPT model to generate outputs that align with the requirements. However, with appropriate prompts, especially with specific framework guidelines (e.g., components of model, relationships among models, and explanation using the components and relationships among the components), the GPT models can usually generate outputs that meet the requirements. This reflects the potential of using GPT models to design evidence statements. But human experts still need to monitor the process to ensure the outputs meet the task goals. 4.1.5 Design Blueprints ---- Essential and Variable Task Features Upon delving into the design process of essential task features for measuring the LP, our approach mirrored that of the LP design. Essential task features, also known as characteristic task features, are key attributes shared by all tasks aimed at assessing a particular claim. These features serve as the foundation for creating tasks that effectively measure the LP. Like the LP design process, I guided the GPT in generating essential task features by providing appropriate prompts and refining the responses based on limitations. Essential task features aim to answer questions such as "What are the task features that must be present to assess this claim?" and "What are the common features that all tasks need to include?" These features encapsulate the attributes shared by all tasks that assess a specific learning performance. The variable task features emphasize their importance in adjusting task difficulty and ensuring accessibility and fairness for all students. To facilitate this understanding, I posed thought- provoking questions such as "What are the features that can vary among tasks?" and provided context pertaining to the targeted LP. GPT’s initial responses were insightful, suggesting modifications in interactive systems, types of evidence, scaffolding levels, response modes, collaboration levels, contextualization, language, and representation. These proposals demonstrated its understanding of tailoring task complexity and accessibility to individual learning styles and proficiency levels. The iterative process of designing both essential and variable task features involved continuous feedback and refinement. This ensured that the final task designs were robust, equitable, and aligned with educational standards. For the essential task features and variable task features for LP2, refer to Table 4-9 below. 107 Table 4-9. Essential task features and variable task features for LP2 Essential Task Features Task presents a scenario where an object is under the influence of balanced forces, resulting in no change in motion. • Example: A book resting on a table or a tug-of-war game with equal strength on both sides. Task provides data or observations from investigations highlighting the impact of balanced forces on an object's motion. • Example: Data showing a stationary object with forces acting equally in opposite directions or an object moving at constant speed. Task prompts students to use evidence from the provided data or observations to construct a model demonstrating balanced forces. • Example: Students use arrows to represent forces acting on an object and explain the absence of motion change. Task includes prompts for students to explain at a conceptual level how balanced forces result in no change in motion, encouraging them to connect evidence and reasoning. • Example: Prompts asking students to describe why an object remains stationary or moves at a constant speed when forces are balanced. Variable Task Features **Scenario Variety** Task scenarios can vary by the type of objects and the nature of forces acting on them (e.g., different weights, types of forces like gravity and normal force). • Example: Objects of varying mass, different surfaces, or forces such as gravity and tension. **Scenario Variety** Task scenarios can vary in the complexity of the investigations (e.g., analyzing balanced forces in different situations such as a hanging picture, a floating balloon). • Example: Different levels of difficulty in understanding balanced forces in static and dynamic contexts. **Modes of Representation**: Tasks can vary in the mode of expression for students' models and explanations (e.g., written descriptions, oral presentations, multimedia presentations, or physical models). • Example: Allowing students to choose how to present their understanding, such as through drawings, digital tools, or physical demonstrations. **Scaffolding Levels**: Tasks can include different levels of scaffolding, such as guiding questions, partial models, or diagrams for students to complete. • Example: Providing templates with partial models that students need to complete or questions that guide their thought process. **Scaffolding Levels**: Tasks can adjust the demand for background knowledge related to physics concepts of force and motion. • Example: Varying the complexity of the explanations required or providing additional resources and support for students with less background knowledge. Equity and Inclusion Considerations Offer scenarios that reflect a diversity of experiences to ensure all students find the task relatable. Ensure that the language and content are accessible and respectful to all students, promoting an inclusive learning environment. The essential task features and variable tasks features were also designed by the collaboration between the human experts and the GPT models. Ensuring the clear task goals and providing explicit 108 requirements to complete the tasks, and making sure human experts timely judge the output are critical for the design of task features. Despite its comprehensive response, GPT's grasp of equity considerations, especially cultural relevance and linguistic accessibility, was not robust. While it advised leveraging students’ background knowledge and experience, it didn’t elaborate extensively on this. Likewise, it suggested language complexity adjustment and multilingual resource integration but didn’t sufficiently address diverse learners' needs. To remedy these deficiencies, I supplied additional prompts centered on cultural relevance and linguistic accessibility. I solicited more in-depth responses regarding cultural and local integration in task design and how to customize language complexity for diverse learners. This process of iterative prompting aimed to enhance the inclusivity of generated task features and refine ChatGPT's ability to align with equitable educational practices. 4.1.6 Design blueprint for LP2 of PE 3-PS2-1 Synthesizing all the unpacking, LPs, evidence statements, essential task features and variable task features, Table 4-10 present the final version of the design blueprint of LP2 for PE 3-PS 2-1. This design blueprint for LP2 guides the task design and was sent out for the first round expert review. 109 Table 4-10. Design blueprint for LPP2 of PE 3-PS2-1 PE Focal LP: LP2 Evidence Statements 3-PS2-1: Plan and conduct an investigation to provide evidence of the effects of balanced and unbalanced forces on the motion of an object. [Clarification Statement: Examples could include an unbalanced force on one side of a ball can make it start moving; and, balanced forces pushing on a box from both sides will not produce any motion at all.] [Assessment Boundary: Assessment is limited to one variable at a time: number, size, or direction of forces. Assessment does not include quantitative force size, only qualitative and relative. Assessment is limited to gravity being addressed as a force that pulls objects down.] Students develop models to explain how balanced forces acting on an object result in no change in motion, using everyday scenarios such as a book resting on a table or a tug-of-war game with equal strength on both sides. 1. Students construct a model that visually represents an object under the influence of balanced forces, showing the object either in a static position or moving at a constant speed without any change in direction. The model should: ● Basic shapes or figures to represent the object(s) and the forces acting on them. ● Arrows of equal length pointing in opposite directions to signify balanced forces. ● A brief label or symbol next to each arrow, indicating the type of force (e.g., "push," "pull," "gravity"). 2. Students include annotations or keys in their model to delineate between balanced and unbalanced forces, explaining scenarios where the forces cancel out, resulting in zero net force on the object. The model should: ● Clearly mark parts of their model to show where forces are acting. ● Use simple vocabulary to describe how these forces are balanced or what might happen if they weren't (e.g., "If one side pulls harder, the rope moves that way"). 3. Students apply their model to real-life scenarios like a book on a table or a car cruising at a steady speed, explaining how these situations exemplify balanced forces resulting in no change in motion. The explanation should: ● Identify the forces acting on the object in the scenario (e.g., gravity pulling the book down, table pushing it up). ● Describe how these forces balance out, using elements from their model as reference points. ● Conclude how the balanced forces result in no change in the object's motion (either staying still or moving steadily). 110 Table 4-10 (cont’d) Essential task features Variable task features ● Task presents a scenario where an object is under the influence of balanced forces, resulting in no change in motion. ● Task provides data or observations from investigations highlighting the impact of balanced forces on an object's motion. ● Task prompts students to use evidence from the provided data or observations to construct a model demonstrating balanced forces. ● Task includes prompts for students to explain at a conceptual level how balanced forces result in no change in motion, encouraging them to connect evidence and reasoning. ● **Scenario Variety** Task scenarios can vary by the type of objects and the nature of forces acting on them (e.g., different weights, types of forces like gravity and normal force). ● **Scenario Variety** Task scenarios can vary in the complexity of the investigations (e.g., analyzing balanced forces in different situations such as a hanging picture, a floating balloon). ● **Modes of Representation**: Tasks can vary in the mode of expression for students' models and explanations (e.g., written descriptions, oral presentations, multimedia presentations, or physical models). ● **Scaffolding Levels**: Tasks can include different levels of scaffolding, such as guiding questions, partial models, or diagrams for students to complete. ● **Scaffolding Levels**: Tasks can adjust the demand for background knowledge related to physics concepts of force and motion. Equity and inclusion considerations ● Offer scenarios that reflect a diversity of experiences to ensure all students find the task relatable. ● Ensure that the language and content are accessible and respectful to all students, promoting an inclusive learning environment. 4.1.7 Task Design The design process for constructing assessment tasks with GPT began by providing explicit instructions and guidelines based on the defined LP for which the tasks were intended. To ensure alignment with the LP, both essential and variable task features were communicated to GPT, enabling the development of multiple tasks within a 'family' that maintained fidelity to the LP while allowing for variations in variable task features. The process was initiated by introducing the task goals and explaining how to utilize previously generated information to guide the task design. Detailed information regarding LP2 was shared with GPT, along with design principles and guidelines on how to design knowledge-in-use assessment tasks. 111 Specific requirements for the assessment tasks were also provided, encompassing various aspects such as the purpose of design blueprints, task characteristics, designing process steps, task design objectives, task scenarios and prompts, alignment with learning performances, variability in tasks, phenomena representations, equity and inclusion, creativity and motivation in task design, relevance of phenomena, connection with students, developmental appropriateness, three-dimensional integration features, engagement and interest, ethical practices, coherent narrative, language and accessibility, phenomena, and scenarios, and the creative process. The design steps included understanding each element of the blueprint and its intended collaboration, considering potential phenomena that match the blueprint elements and are universally relevant to students, integrating equity and inclusion considerations, and ensuring three- dimensional integration features. These task requirements were incorporated based on the critical aspects of the assessment design, including the three-dimensional nature of the assessment, language and accessibility, engagement and relevance, and more. 4.1.7.1 Assessment Task Design for LP3 After providing the task design requirements and guidelines, the assessment design process began. Using LP3 for PE 3-PS2-1 as an example, the iterative design process was demonstrated. To facilitate understanding, Table 4-11 presents the design blueprints for LP3. Table 4-11. Design blueprints for LP3 of 3-PS2-1 LP3: Students construct explanations based on evidence from investigations to explain how objects in contact exert forces on each other, including friction, showing the interaction between objects as a cause of motion changes. Integrated Proficiencies (IPs) for LP3 IP1: Ability to construct claims about the effects of contact forces, especially friction, on motion. This proficiency involves students identifying friction as a force that opposes motion and affects the speed and direction of moving objects. IP2: Ability to select and use evidence from investigations to substantiate claims about friction's role in motion alterations. Students should demonstrate competency in choosing relevant experimental or observational data that clearly show how friction alters motion. IP3: Ability to reason regarding how friction as a contact force instigates changes in the motion of objects. This entails connecting evidence to claims in a reasoned manner to explain the mechanism by which friction influences motion. 112 Table 4-11 (cont’d) Evidence Statements for LP3 1.**Students formulate claims** that pinpoint friction as a significant force influencing motion during contact between objects. 2.**Students compile appropriate evidence** from experimental or observational data showcasing friction's impact on motion. 3.**Students logically link their selected evidence with their claims**, providing explanations on how friction modifies motion. Essential Task Features 1. **Scenarios**: Incorporate real-world contexts (e.g., a soccer ball slowing down on grass vs. concrete) that vividly depict friction's role in modifying motion. 2. **Data and Observations**: Use findings from simple, age-appropriate experiments (e.g., sliding various materials across different surfaces) that allow students to observe friction's effects. 3. **Construct Explanations**: Engage students with prompts that guide them to explain motion changes using evidence of friction from provided or gathered data. Variable Task Features 1. **Scenario Variability**: Include diverse situations demonstrating friction (e.g., braking a bicycle, pushing a book across a desk) to cover a broad understanding. 2. **Scaffolding Levels**: Offer scaffolded supports like guiding questions (What happens to the object's speed when it moves across different surfaces?) or data interpretation aids to facilitate evidence-based explanations. 3. **Modes of Response**: Allow students to present their explanations through various formats, such as short essays, oral presentations, posters, or digital slides. Equity and Inclusion Considerations 1. **Relatable Scenarios**: Choose scenarios that are universally relatable and do not presume specific prior experiences, ensuring all students can connect with the task's context. 2. **Accessible Language**: Utilize language that is clear, straightforward, and devoid of unnecessary jargon, with instructions presented both verbally and visually to cater to diverse learning preferences. 3. **Varied Expression**: Enable students to demonstrate their understanding in different ways, accommodating various learning styles and needs, and providing equal assessment opportunities. After the above task design requirement and design principles, I provided the design blueprints of LP3 to the GPT 4 models. Table 4-12 presents the iterative design process for designing one assessment task for LP3 using the design blueprints presented in Table 4-10. 113 Table 4-12. First round iterative design process for LP3 relevant task design Input Output 1. Task design requirements and principles 2. Learning performance design blueprints Assessment Task 1: "Sliding and Stopping" **Scenario**: The class has just learned about the force of friction and its effects on the motion of objects. The teacher sets up two different surfaces on the classroom floor: a smooth, polished wooden plank and a carpet. Students have toy cars and are challenged to investigate how the type of surface affects the distance a car travels after being pushed with the same force. **Prompts**: 1. **Engage**: Students predict which surface will allow the car to travel further and explain their reasoning. 2. **Investigate**: Students push the toy cars across both surfaces from the same starting line, using a consistent push force. They measure the distance traveled on each surface three times and record their findings. 3. **Explain**: Based on their observations, students construct claims about the role of friction in changing the car's motion on different surfaces. The designed task aligns with the LP3, whose SEP is constructing explanations. However, while the designed tasks included the third prompts of explaining the role of friction in the investigation, the tasks do not only emphasize the SEP rather than emphasizing conducting investigations. One reason why the tasks were designed in this way could be the LP3 statement “construct explanations based on evidence from investigations to explain how objects in contact exert forces on each other, including friction, showing the interaction between objects as a cause of motion changes.” The GPT model may design the tasks aligning with the statements from conducting investigation to provide explanation. However, it could dilute the critical role of constructing explanation in this LP. Another interesting observation is the two designed tasks are both about friction, which could because of the LP statement mentioned the friction. This still reflects the linear and algorithmic nature of the machine approach. The other reason why the two tasks and the LP3 include construct explanations following up with the investigation activity is because of the previous explanations about the progressive order of the LPs for this PE. Overall, the designed tasks are decent in addressing the LP3, but the scenarios are not engaging enough and the prompts do not uncover the framework of constructing explanations with appropriate scaffoldings. 114 Based on the observations and the goals of designing assessment tasks to measure LP3, I further prompted the GPT-4 models, which led to the second round of designed assessment tasks, see Table 4-13. Table 4-13. Second round iterative design process for LP3 relevant task design Input Output ● Refine tasks to better emphasize the 3D Task 1: The Playground Slide Challenge nature of learning. ● Develop clear, engaging scenarios and prompts for each task, making them relatable to students' daily experiences and clearly introducing the concept of friction. ● Provide specific prompts that guide students to construct claims, gather evidence, and apply reasoning to explain how friction affects motion. Include a brief rationale for each task, explaining how it assesses students' understanding of LP3 and engages them in applying scientific knowledge to real-world problems. ● ● Ensure tasks adhere to the LP3 design blueprint, making them accessible, inclusive, and accommodating diverse learning styles and abilities. ● Promote equity and inclusion by designing tasks that are accessible and relatable for all students, providing necessary scaffolding to support diverse needs and successful engagement. **Scenario**: Amy and Ela are at the playground, and Amy decides to slide down the big, shiny slide. She notices that when the slide is dry, she zips down quickly, making her giggle with joy. However, when the slide becomes slightly wet after a quick sprinkle, her speed slows down, and she doesn't reach the bottom as fast. Ela, watching from below, tilts her head, curious about what’s happening. **Prompt**: Help Amy and Ela understand why the slide’s surface condition affects Amy’s sliding speed. Think about the slide and Amy’s socks: 1. **Claim**: Make a guess (hypothesis) about why Amy slides slower on the wet slide compared to the dry slide. 2. **Evidence**: Use what you observed about Amy sliding on the dry and wet slide as evidence. 3. **Reasoning**: Explain how the water on the slide might change how Amy's socks rub against the slide's surface, causing the change in speed. In the second iteration, the tasks were refined to better emphasize the 3D nature of learning. The scenarios became more engaging and relatable, connecting directly with students' daily experiences. This shift made the tasks more meaningful and increased student motivation. Clear, specific prompts were developed to guide students through the process of constructing claims, gathering evidence, and applying reasoning. This structured approach is crucial for helping students develop their explanatory skills and understand the role of friction in motion changes. Including a rationale for each task helped to articulate how the tasks assess students' understanding of LP3, ensuring they align with educational goals and 115 standards. For instance, for the task above the rationale the GPT-4 model provided is “This task immerses students in a familiar scenario that vividly illustrates friction's role in motion changes, aligning with the DCI about objects in contact exerting forces, including friction (LP3). Encouraging explanations based on observable phenomena reinforces the SEP of constructing explanations. The task highlights 'cause and effect' (CCC) by linking the wetness of the slide (cause) with the change in sliding speed (effect).” The tasks were also refined to adhere more closely to the LP3 design blueprint, ensuring they were accessible, inclusive, and accommodating diverse learning styles and abilities. This focus on equity and inclusion was critical in making the tasks relatable and achievable for all students, providing necessary scaffolding to support diverse needs and promoting successful engagement. Table 4-14. Two assessment tasks for LP3 LP3: Students construct explanations based on evidence from investigations to explain how objects in contact exert forces on each other, including friction, showing the interaction between objects as a cause of motion changes. Task 1: The Playground Slide Challenge Task 2: Classroom Carpet Carper **Scenario**: Amy and Ela are at the playground, and Amy decides to slide down the big, shiny slide. She notices that when the slide is dry, she zips down quickly, making her giggle with joy. However, when the slide becomes slightly wet after a quick sprinkle, her speed slows down, and she doesn't reach the bottom as fast. Ela, watching from below, tilts her head, curious about what’s happening. **Prompt**: Help Amy and Ela understand why the slide’s surface condition affects Amy’s sliding speed. Think about the slide and Amy’s socks: ● **Claim**: Make a guess (hypothesis) about why Amy slides slower on the wet slide compared to the dry slide. ● **Evidence**: Use what you observed about Amy sliding on the dry and wet slide as evidence. ● **Reasoning**: Explain how the water on the slide might change how Amy's socks rub against the slide's surface, causing the change in speed. **Scenario**: During a classroom activity, Amy rolls a small toy car across different surfaces - the smooth, shiny classroom floor and the fluffy classroom carpet. On the smooth floor, the car travels far and fast, but on the carpet, it quickly slows down and stops. Ela, always ready for fun, watches intently, her tail wagging with excitement. **Prompt**: Explore with Amy and Ela why the toy car moves differently on the carpet than on the smooth floor. Consider the surfaces and the car's wheels: ● **Claim**: Predict why the car goes further on the smooth floor than on the carpet. ● **Evidence**: Discuss what happened when the car moved across the different surfaces ● **Reasoning**: Explain how the fluffiness or smoothness of each surface might affect the car's wheels and its motion. 116 Table 4-14 presents the final two tasks for LP3. Reflecting on the entire iterative process of using design blueprints to design assessment tasks for LP3, the process revealed significant insights and areas for enhancement. Initially, the tasks were broadly focused on both investigation and explanation, aligning with the LP3 requirements. However, this broad approach risked diluting the primary focus on constructing explanations. The tasks were subsequently refined to emphasize constructing explanations, ensuring alignment with LP3's core objectives. Engagement and relevance emerged as critical factors. The initial scenarios, while functional, were critiqued for lacking engagement. The second iteration introduced more relatable and vivid scenarios, making the tasks more engaging and enhancing student motivation. This shift underscored the importance of creating contextually meaningful tasks to foster deeper student engagement. Another key improvement was in the scaffolding provided to students. The initial tasks lacked sufficient guidance, which was addressed in the second iteration by including specific prompts that guided students through making claims, gathering evidence, and applying reasoning. This structured approach is crucial for developing students' explanatory skills in scientific contexts. Equity and inclusion were also explicitly considered in the refined tasks. The second iteration aimed to make the tasks accessible and relatable to a diverse student population, providing scaffolding and considering different learning styles and needs. This approach supports all students in successfully engaging with and understanding the content. The use of GPT-4 in designing these tasks demonstrated both strengths and limitations of AI in educational contexts. While the AI-generated tasks aligned with LP3, the approach sometimes exhibited a linear focus, particularly on the concept of friction due to its mention in LP3. This highlights the need for human oversight to ensure a balanced and nuanced approach to task design. Overall, the iterative process led to more structured, engaging, and inclusive tasks that better aligned with LP3 goals. This process underscored the importance of continuous refinement, thoughtful scenario design, and adequate scaffolding in creating effective educational tasks. The insights from the AI-generated tasks also provide valuable lessons for integrating AI in educational design, emphasizing the need for human-AI collaboration to achieve optimal outcomes. 117 4.1.7.2 Assessment Task Design for LP2 Following the training of designing tasks for LP3, I prompted the GPT 4 models to design two tasks for LP2. Since this step is a follow up step with designing tasks for LP3, I provided the similar prompts with the prompts of LP3 but emphasize the critical reflections from the above exploration process. The iterative process of designing tasks for LP2 is shown in Table 4-15. Table 4-15. The iterative process of designing tasks for LP2 Input Output Reflections The designed tasks have engaging phenomen a, but the prompts are not clear for the modeling task. 1. Recall the critical aspects of designing knowledge-in- use assessment tasks, including task requirements, goals, and critical aspects that need to be further taken care of. 2. Recall the information, LP statement, and the design blueprints for LP2. Iteration 1 Task 1: "The Case of the Static Skateboard" **Scenario**: Jamie and Alex are at the park with a skateboard. They each push on the skateboard from opposite ends with equal force. The skateboard doesn't move from its spot, no matter how hard they push. **Prompt**: ● ● **Model Construction**: Draw a model to represent the forces Jamie and Alex are applying to the skateboard. Show why the skateboard doesn't move using arrows to represent forces. **Explanation**: Explain in your drawing how the forces from Jamie and Alex are balanced and describe what that means for the motion of the skateboard. **Application**: Think of another example where balanced forces result in no change in motion and describe it below your model. ● Task 2: "Balancing Act on the Sea-saw" **Scenario**: In the playground, Mia and Kai are on a seesaw. They position themselves in such a way that the seesaw is perfectly balanced and horizontal. Neither side is moving up or down. **Prompt**: ● **Model Construction**: Sketch a seesaw in balance. Add Mia and Kai to your model and use arrows to represent the forces they exert on the seesaw. ● **Explanation**: Annotate your model to demonstrate why the seesaw remains in a balanced, horizontal position without tilting to any side. ● **Application**: Describe a real-life situation, apart from the seesaw, where objects remain in a balanced state due to equal forces acting on them. 118 Table 4-15 (cont’d) Revise the two designed assessment tasks for LP2 of 3-PS2-1 to ensure the prompts focus on: Developing models that include: Components/ Elements: Clearly identify the parts of the model. Relationships : Describe the interactions among the components. Explanation: Explain how the components and their relationships together can account for the phenomenon. Iteration 2 Task 1: "The Book on the Table" **Scenario**: Imagine a heavy science textbook resting on a table. Despite people walking around and occasional bumps to the table, the book remains unmoved.\n\n** Prompt**: Develop a model to explain why the science textbook doesn't move on the table despite slight disturbances. Your model should include: ● **Components/Elements**: Identify the forces acting on the book and table (e.g., gravity, normal force). ● **Relationships Among Components**: Show how these forces are balanced. ● **Explanation of Phenomena**: Explain using your model how the balanced forces result in the book maintaining its position without movement. Task 2: "The Tug-of-War Standoff" **Scenario**: Two teams of equal strength are in a tug-of-war competition. The rope between them doesn't move in either direction, and both teams stay in place.\n\n **Prompt**: Create a model to explain why neither team can move the other during the tug-of-war. In your model, consider: ● **Components/Elements**: Represent the forces each team applies to the rope. ● **Relationships Among Components**: Illustrate the interaction between these forces and how they balance out. ● **Explanation of Phenomena**: Use your model to clarify why the balanced forces prevent any movement from either side, keeping the rope stationary. These revised tasks direct students explicitly to construct models with a clear focus on elements (forces), their relationships (balance or equilibrium), and how these aspects coalesce to explain real-world phenomena of static objects or equilibrium scenarios. The tasks are scaffolded to guide students through the thought process of identifying forces, understanding their interactions, and applying this knowledge to explain why objects do not change motion under balanced forces. This approach aligns with the three-dimensional learning objectives of NGSS, fostering a deep understanding of physical science concepts while practicing model development skills. The iterative design process for LP2 assessment tasks, aimed at understanding and modeling balanced forces, showcased significant evolution in task clarity and instructional alignment. Initially, the tasks "The Case of the Static Skateboard" and "Balancing Act on the Seesaw" provided engaging scenarios but lacked clarity in their prompts, particularly regarding the construction of models. This issue 119 was identified during the first iteration, highlighting the need for clearer guidance on how to develop and use models to explain phenomena. In the second iteration, the tasks were revised to focus more explicitly on the three key aspects of model development: identifying components, describing relationships among components, and explaining phenomena. The revised tasks, "The Book on the Table" and "The Tug-of- War Standoff," provided clear and structured prompts. These tasks required students to construct models that explicitly represent the forces at play, illustrate how these forces interact, and explain the resulting balance or equilibrium. This refinement process demonstrated a shift from merely engaging students with interesting phenomena to guiding them through the detailed and precise construction of scientific models. For example, the original item prompts for Task 1 for LP2 is: “ ● **Model Construction**: Draw a model to represent the forces Jamie and Alex are applying to the skateboard. Show why the skateboard doesn't move using arrows to represent forces. ● **Explanation**: Explain in your drawing how the forces from Jamie and Alex are balanced and describe what that means for the motion of the skateboard. ● **Application**: Think of another example where balanced forces result in no change in motion and describe it below your model.” The first prompt of “model construction” does not provide explicit scaffoldings for students to construct a model, and the third prompt is beyond the scope of constructing models to explain the corresponding phenomena. After the revision, the prompts were revised into: “ Develop a model to explain why the science textbook doesn't move on the table despite slight disturbances. Your model should include: ● **Components/Elements**: Identify the forces acting on the book and table (e.g., gravity, normal force). ● **Relationships Among Components**: Show how these forces are balanced. ● **Explanation of Phenomena**: Explain using your model how the balanced forces result in the book maintaining its position without movement.” By emphasizing the identification of components that should be included in the model and their 120 interactions, the revised prompts are better aligned with the 3D learning objectives of the NGSS. They provided a scaffolded approach to help students understand and apply concepts of balanced forces in real- world contexts. The iterative process underscored the importance of clear and focused prompts in assessment tasks. It highlighted how specific guidance can enhance students' ability to develop and use models effectively. This approach not only aids in grasping complex physical science concepts but also fosters essential skills in scientific modeling and reasoning. The reflections on this process reinforce the value of iterative refinement in educational task design, ensuring tasks are both engaging and instructional, thereby supporting deep and meaningful learning. 4.1.7.3 Tasks and Exemplar Responses for The Tasks To ensure the tasks are engaging for 3rd grade students, I used DALL.E to generate the scenario images to provide visual support for students to understand and engage in the tasks. The images (Tables 4-16 and 4-17) were incorporated into the item stem to provide support for students to understand the tasks. I directed GPT to generate the exemplar responses by providing the task, evidence statements, and the grade level. After several round iterations, the exemplar responses were presented in the Tables 4-16 and 4-17 along with the tasks together. 121 Table 4-16. Assessment tasks and their exemplar responses for 3-PS2-1 LP2 Tasks for LP2 of 3-PS2-1: Students develop models to explain how balanced forces acting on an object result in no change in motion, using everyday scenarios such as a book resting on a table or a tug-of-war game with equal strength on both sides. 122 Table 4-16 (cont’d) 123 Table 4-16 (cont’d) 124 Table 4-16 (cont’d) 125 Table 4-17. Assessment tasks and their exemplar responses for 3-LS4-3 LP2 Tasks for LP2 of 3-LS4-3: Students engage in argument from evidence to support claims about which organisms can survive well, less well, or not at all in a specific habitat based on their characteristics and needs, using examples from various habitats to explore cause and effect relationships. 126 Table 4-17 (cont’d) 127 Table 4-17 (cont’d) 128 Table 4-17 (cont’d) 4.1.2 Critical Self-Reflections on Collaborating with GPT-4 for Assessment Design Humans can collaborate with AI to design knowledge-in-use assessment tasks using the systematic NGSA approach. During the process, humans play a critical role by providing explicit guidance for prompting AI toward the task requirements and goals. Humans are also essential for monitoring the assessment design process by identifying critical areas where AI needs to focus. Meanwhile, humans guide AI in conducting critical reflections to help it detect and diagnose key principles and lessons that the AI can then learn from and apply in subsequent steps. Domain-specific information, including science content, 3D knowledge, and knowledge-in-use assessment design, is crucial for designing and monitoring the process. The iterative refinement and revision process is also vital for the design of assessment tasks. This section reflects on the collaborative process used to train generative AI models, specifically GPT-4, for designing knowledge-in-use assessments, addressing RQ 1: How can generative AI models be effectively and iteratively trained to design these assessments? The discussion outlines a strategy for 129 combining AI capabilities with human expertise to enhance assessment design, highlighting key components such as the need for clear instructions, leveraging domain-specific knowledge, engaging human expertise, refining processes iteratively, and fostering effective collaboration between AI and human experts. 4.1.2.1 Necessity of Explicit Guidance Providing clear and detailed instructions emerged as a fundamental theme throughout the iterative design process. Explicit guidance was crucial for generating high-quality outputs from the GPT-4 model. Without well-defined guidelines and specific goals, the depth and quality of the AI-generated unpacking were limited. Explicit guidance involves providing the AI with precise instructions, clear task definitions, and well-articulated goals. This clarity ensures that the AI fully understands the task requirements and can produce outputs that meet the specified criteria. For instance, in the design of the Integrated Dimension Maps (IDM) for PE 3-PS2-1, initially, the AI produced a general map that lacked coherence and detail. The IDM needed to integrate various dimensions (DCIs, SEPs, and CCCs) into a coherent framework. By providing explicit guidance, such as detailing the elements to be included (observable components, relationships among these components, and limitations of models), the AI’s output improved significantly. The detailed instructions ensured that the AI understood how to connect different components logically and meaningfully, resulting in a more comprehensive and coherent IDM. For instance, specifying that the map should clearly illustrate the cause-and-effect relationships between different scientific concepts ensured that the final output was both educationally valuable and practically useful. Human experts play a critical role in this process. They clarify the objectives, provide detailed and explicit requirements, continuously monitor the AI’s outputs, and adjust the guidance as necessary. The AI interprets the provided task goals, generates outputs that adhere to the defined objectives, and refines its understanding through iterative feedback. For instance, during the iterative process of designing LPs, human experts noticed that the AI's outputs were too broad. Then a human provided specific feedback emphasizing the need for detailed context and examples, which helped the AI refine its 130 outputs to be more aligned with educational standards. Similarly, in the IDM design, human experts identified gaps in the AI's initial outputs and provided targeted feedback to ensure that the final map was detailed, coherent, and aligned with the educational objectives. Explicit guidance is essential for training GPT-4 models effectively. Providing clear and detailed instructions ensures that the AI understands the task requirements, reducing the risk of producing irrelevant or low-quality outputs. Continuous monitoring and adjustment of the AI’s guidance is necessary to maintain alignment with educational standards and task requirements. Explicit guidance serves as the foundation for effective AI training, ensuring that the AI can produce high-quality, relevant outputs that meet educational standards. The iterative feedback loop between human experts and the AI plays a crucial role in refining these outputs, demonstrating the importance of clear and detailed instructions in the AI training process. This process not only enhances the AI's ability to perform specific tasks but also builds a collaborative framework where human oversight ensures that the AI's outputs remain aligned with educational objectives and standards. The importance of explicit guidance is underscored by the improvement in AI-generated outputs when detailed and precise instructions are provided, highlighting the need for continuous human involvement to guide and refine the AI’s contributions. 4.1.2.2 Importance of Domain-Specific Information The importance of domain-specific information became apparent as a critical theme during the iterative design process. The AI's ability to generate detailed and accurate unpacking significantly improved when comprehensive and specific information was provided. Domain-specific information includes detailed knowledge about the subject matter, educational standards, and principles relevant to the task. This information helps the AI models understand the context and nuances of the task, enabling it to produce more accurate and relevant outputs. When designing assessments for two PEs, 3-PS2-1 and 3-LS4-3, providing detailed information about the DCIs, SEPs, and CCCs was crucial. For example, in the case of PE 3-PS2-1, detailed explanations of forces, motion, and interactions between objects were provided, including Newton's laws 131 and their applications to everyday phenomena. Despite GPT-4's ability to process general information, it lacked the depth needed for domain-specific tasks unless supplied with relevant details. By providing explicit descriptions of DCIs such as PS2.A: Forces and Motion and PS2.B: Types of Interactions, the AI could generate more nuanced and educationally relevant outputs. An illustrative case was the unpacking of PS2.A: Forces and Motion. Initially, the AI-generated descriptions were general, such as "Forces can cause an object to move or stop." While this is correct, it lacked the depth necessary for a comprehensive educational framework. When detailed information about the grade levels of understanding and the specific disciplinary ideas and relevant ideas was provided, the AI's output transformed. The unpacking became richer, including descriptions of how balanced and unbalanced forces affect motion, and specific examples like "A book resting on a table demonstrates balanced forces, while a ball rolling down a hill shows unbalanced forces leading to acceleration." Another example involves PE 3-LS4-3, which focuses on environmental changes and their impact on organisms. Initially, the AI produced broad statements about environmental changes affecting living things. However, by providing detailed information on specific factors such as climate change, habitat destruction, and pollution, and their effects on various species, the AI could produce more targeted and relevant assessments. This included nuanced insights into how certain species adapt, migrate, or face extinction due to these environmental pressures. In this process, human experts play several crucial roles. They select what information to provide to the system, ensuring it is both comprehensive and relevant. They monitor how well the AI perceives and processes this information, identifying areas where the AI's understanding might be superficial or incomplete. Human experts also pinpoint critical areas that need adjustment or further clarification and decide whether to iterate the current process or move to the next steps. This continuous oversight ensures that the AI remains aligned with educational standards and objectives. The AI, on its part, receives the provided background information and processes it to detect the prompts and understand the specific task requirements. It analyzes the information to summarize critical points and learns from the provided inputs and clarifications to retain important information and context for future tasks. This iterative learning 132 process enables the AI to improve its performance over time, producing outputs that are increasingly aligned with educational standards. For instance, during the iterative unpacking of SEP related to "Developing and Using Models," the initial AI output was generic, merely stating that students should create models to represent phenomena. With detailed input about the various types of models (physical, conceptual, and mathematical), the specific criteria for evaluating these models, and examples of how these models can be used to explain phenomena, the AI's subsequent outputs became more sophisticated. It included specific strategies for students to develop and use models, criteria for assessing the models' effectiveness, and detailed examples illustrating the use of models in scientific inquiry. Providing adequate background information is crucial for training GPT-4 models effectively. This step ensures that the AI has a solid foundation of domain-specific knowledge, enabling it to produce outputs that meet educational standards and task requirements. Continuous updating and refining of the AI’s knowledge base are essential to maintain the quality and relevance of the outputs. For example, as educational standards evolve or new scientific discoveries are made, updating the AI with this new information ensures that it remains current and continues to produce relevant and accurate educational content. Moreover, the collaboration between human experts and AI in this context exemplifies a synergistic relationship where human intelligence provides the depth and context that AI needs to function effectively, while AI offers the capacity to process and integrate large volumes of information quickly and efficiently. This collaboration not only enhances the quality of the educational assessments but also accelerates the development process, making it more efficient and scalable. 4.1.2.3 Role of Human Experts The role of human experts was underscored as a critical theme throughout the iterative design process. Human experts were indispensable in evaluating and refining the AI outputs, providing critical insights and feedback. Their application of domain knowledge and experience guided the AI models in generating high-quality outputs and ensuring alignment with educational standards and goals. For instance, when the initial LPs were too broad and lacked specificity, experts provided detailed feedback and examples to help the AI generate more focused and relevant LPs. This intervention transformed 133 general statements into specific, actionable learning performances. An initial LP might state, "Students investigate the effects of forces on motion," which is broad and lacks detail. With expert guidance, this was refined to, "Students plan and conduct investigations to observe how different strengths of forces affect the motion of a ball rolling down a ramp, noting the differences in speed and direction." This level of specificity ensures that the learning performance is actionable and directly tied to observable student behaviors. Moreover, human experts bring a nuanced understanding of educational contexts that AI currently lacks. They can interpret curriculum standards and translate them into specific, measurable learning outcomes. When unpacking a complex DCI such as PS2.A: Forces and Motion, experts not only provide the scientific content but also pedagogical strategies to effectively teach these concepts. This might involve suggesting inquiry-based learning activities, formative assessments, and differentiated instruction strategies to meet diverse student needs. For example, during the design of assessments for 3- LS4-3, which involves understanding how environmental changes affect organisms, the initial AI outputs were broad and lacked depth. Experts provided context about specific environmental changes such as deforestation, pollution, and climate change, and their impact on particular species. This allowed the AI to produce more detailed and contextually relevant outputs, such as, "Students analyze data on polar bear populations in the Arctic to understand the impact of melting ice caps on their habitat and survival rates." Human experts evaluate the AI’s outputs, provide detailed feedback and examples to guide the AI in generating more accurate and relevant outputs, and ensure that the AI’s outputs align with educational standards and goals. The AI generates outputs based on the provided instructions and guidelines, incorporates feedback from human experts to refine its outputs, and learns from the provided inputs to improve the quality of its outputs. Human expertise is essential for training GPT-4 models effectively. Experts provide the necessary guidance and feedback to refine the AI’s outputs, ensuring that they meet educational standards and goals. This collaborative dynamic between AI and human experts enhances the overall quality and effectiveness of the generated outputs. This dynamic is not just about correction but also about enrichment. Experts provide insights that help AI models understand the broader educational 134 landscape, including the integration of crosscutting concepts and practices that are essential for three- dimensional learning as advocated by the NGSS. Furthermore, human experts play a critical role in maintaining the ethical and equitable aspects of educational content. They ensure that the AI-generated outputs do not inadvertently reinforce biases or exclude certain groups of students. By reviewing and providing feedback, experts help ensure that the content is inclusive and accessible to all students, thereby promoting equity in education. The role of human expertise in this collaborative process is multi-faceted. Experts provide detailed, domain-specific knowledge, offer pedagogical strategies, ensure alignment with educational standards, and uphold ethical and equitable principles in educational content. This partnership between human intelligence and AI results in high-quality, relevant, and effective educational assessments that are finely tuned to meet the needs of both educators and students. 4.1.2.4 Iterative Refinement The iterative nature of the design process proved vital for continuous improvement. Each iteration builds on the previous one, incorporating feedback and additional information to enhance the quality of the AI-generated unpacking. Iterative refinement involves continuously evaluating and improving the AI’s outputs through multiple rounds of feedback and adjustments. This process ensures that the outputs are progressively enhanced and aligned with the task requirements. The iterative process was crucial in refining the AI’s outputs for various tasks. For example, in designing the IDM for PE 3-PS2-1, each iteration involved re-evaluating the outputs, providing more detailed guidance, and refining the approach based on feedback. This process led to progressively better outputs, ensuring that the IDM became increasingly detailed and aligned with the task requirements. Reflective practice was also integral to this process. Human experts’ reflections on the AI’s outputs helped identify gaps and areas for improvement, guiding subsequent steps in the training process. For instance, when the initial unpacking was too general, experts provided detailed feedback and examples, leading to more accurate and relevant outputs in subsequent iterations. Human experts continuously evaluate the AI’s outputs, provide detailed feedback and examples 135 to guide the AI in refining its outputs, and adjust their guidance based on the AI’s performance. The AI generates outputs based on the provided instructions and guidelines, incorporates feedback from human experts to refine its outputs, and learns from the provided inputs to improve the quality of its outputs. Iterative refinement is essential for training GPT-4 models effectively. This process ensures that the AI’s outputs are continuously evaluated and improved, leading to progressively better outputs. Reflective practice and feedback from human experts are integral to this process, helping to identify gaps and areas for improvement. 4.1.2.5 AI-Human Collaboration The collaborative dynamic between AI and human experts facilitated the creation of high-quality educational tools. The AI’s ability to learn from provided frameworks and examples, combined with human expertise, resulted in outputs that were not only accurate but also pedagogically sound. AI-human collaboration involves the combined efforts of AI and human experts in generating high-quality outputs. This collaboration leverages the strengths of both AI and human expertise to enhance the overall quality and effectiveness of the generated outputs. Throughout the iterative design process, the collaborative dynamic between AI and human experts was evident. For instance, in designing the IDM for PE 3-PS2-1, human experts provided detailed feedback and examples, which the AI incorporated to generate progressively better outputs. This collaboration was also crucial in refining the AI’s outputs for designing LPs. Human experts provided explicit guidance and detailed examples, which helped the AI generate more focused and relevant LPs. The iterative feedback loops and continuous refinement ensured that the final outputs were accurate and pedagogically sound. Human experts provide explicit guidance and detailed examples to guide the AI in generating high-quality outputs, evaluate the AI’s outputs, identify areas that need refinement, and provide detailed feedback to help the AI refine its outputs. The AI generates outputs based on the provided instructions and guidelines, incorporates feedback from human experts to refine its outputs, and learns from the provided inputs to improve the quality of its outputs. AI-human collaboration is essential for training GPT-4 models effectively. The combined efforts 136 of AI and human experts enhance the overall quality and effectiveness of the generated outputs. This collaboration leverages the strengths of both AI and human expertise, resulting in outputs that are accurate, pedagogically sound, and aligned with educational standards. 4.2 RQ2. How Do Human Experts Across Different Disciplines Evaluate the AI-Generated Knowledge-In-Use Assessment and What Refinements Do They Suggest? To respond to RQ2, "How do human experts across different disciplines evaluate the AI- generated knowledge-in-use assessments, and what refinements do they suggest?" a multidisciplinary expert panel review stage was conducted to collect feedback on the interim products. Their feedback focuses on the LPs, evidence statements, assessments, and rubrics, emphasizing 3D learning, engagement, language complexity, equity, and practical perspectives. In this section, I report the analysis of the LPs and evidence statement design, as well as the assessment design feedback based on the different expert groups' input. First, I detail the composition and background of the expert panels for the two different PEs. Then, I explain how the data were analyzed, present the analytic results, and conclude with the major themes and suggestions derived from the reviewers’ comments and feedback. Each report section includes a summary of both quantitative and qualitative analyses, emphasizing the critical takeaways and highlighting variations in feedback across different expert groups. This systematic method provides a thorough and detailed account of the expert evaluations and their recommendations, offering valuable insights into the assessment tasks' effectiveness and areas for improvement. 4.2.1 Expert Feedback on PE: 3-PS2-1 4.2.1.1 Themes of Feedback on LPs and Evidence Statements for 3-PS2-1 The quantitative analysis of expert feedback was conducted to assess the effectiveness of the LPs and Evidence Statements for 3-PS2-1 (see Table 4-18). The analysis involved collecting numerical ratings from experts on various dimensions and visually representing these ratings to identify trends and patterns. The feedback collection table includes six major dimensions that were designed to collect the feedback on the LPs and evidence statement using a Likert scale with scores from 1 (not at all) to 5 (completely). 137 These dimensions provide a comprehensive framework for evaluating the alignment and effectiveness of the LPs and evidence statements with the NGSS standards. These dimensions and their rationales are presented in Table 4-19. Table 4-18. LPs and evidence statements for LP2 for 3-PS2-1 for review PE LPs 3-PS2-1: Plan and conduct an investigation to provide evidence of the effects of balanced and unbalanced forces on the motion of an object. LP1**: Students plan and carry out investigations to observe how different strengths of forces affect the motion of an object. LP2**: Students develop models to explain how balanced forces acting on an object result in no change in motion, using everyday scenarios such as a book resting on a table or a tug-of- war game with equal strength on both sides. LP3**: Students construct explanations based on evidence from investigations to explain how objects in contact exert forces on each other, including friction, showing the interaction between objects as a cause of motion changes. LP4**: Students use models to explain how non-contact forces (e.g., magnetic or gravitational forces) on an object's motion or change in motion act at a distance. Focal LP: LP2 Students develop models to explain how balanced forces acting on an object result in no change in motion, using everyday scenarios such as a book resting on a table or a tug-of-war game with equal strength on both sides. Evide nce State ments 1. Students construct a model that visually represents an object under the influence of balanced forces, showing the object either in a static position or moving at a constant speed without any change in direction. The model should: ● Basic shapes or figures to represent the object(s) and the forces acting on them. ● Arrows of equal length pointing in opposite directions to signify balanced forces. ● A brief label or symbol next to each arrow, indicating the type of force (e.g., "push," "pull," "gravity"). 2. Students include annotations or keys in their model to delineate between balanced and unbalanced forces, explaining scenarios where the forces cancel out, resulting in zero net force on the object. The model should: ● Clearly mark parts of their model to show where forces are acting. ● Use simple vocabulary to describe how these forces are balanced or what might happen if they weren't (e.g., "If one side pulls harder, the rope moves that way"). 3. Students apply their model to real-life scenarios like a book on a table or a car cruising at a steady speed, explaining how these situations exemplify balanced forces resulting in no change in motion. The explanation should: ● Identify the forces acting on the object in the scenario (e.g., gravity pulling the book down, table pushing it up). ● Describe how these forces balance out, using elements from their model as reference points. ● Conclude how the balanced forces result in no change in the object's motion (either staying still or moving steadily). 138 Table 4-19. Feedback collection dimensions and rationale for LPs and evidence statements Dimension Statement Rationale Collective Representation of Proficiencies (1) To what extent does the set of learning performances collectively represent the proficiencies that are necessary for attaining the performance expectation? Understand how well the set of LPs collectively represent the necessary proficiencies for attaining the PE. Essentiality of the Learning Performance (2) To what extent does LP2 comprise an essential part of what is needed to achieve the performance expectation? Understand whether the LP is essential for achieving the PE. Sufficiency of Evidence Statements (3) To what extent do the evidence statements of LP2 reflect obtainable pieces which, taken together, are sufficient for supporting a claim of student proficiency in this learning performance? Determines if the evidence statements reflect obtainable pieces sufficient to support a claim of student proficiency. Integration of Knowledge (4) To what extent is LP2 an integrated 3-dimensional statement of knowledge-in-use? Gap Identification (5) What gaps, if any, do you see in the set of learning performances (i.e., proficiencies required by the performance expectation that are not represented in the set of learning performances)? Examines if the LP is an integrated three-dimensional statement of knowledge-in-use. Identifies any proficiencies required by the PE that are not represented in the set of LPs. Overreach identification (6) What overreach occurs, if any, in the set of learning performances (i.e., proficiencies that ARE NOT required by the performance expectation but that ARE required in the set of learning performances)? Identifies any proficiencies that are not required by the PE but are included in the LPs. The ratings were averaged, and the standard deviation (SD) was calculated to provide an overall assessment and measure of variability. Figure 4-6 presents the summary of the expert ratings on the LPs and evidence statements feedback. 139 Figure 4-6. Expert feedback on the LPs and evidence statements of PE 3-PS2-1 The heatmap visualizes the expert evaluations on the LPs and Evidence Statements for PE 3-PS2- 1. Each column in this heatmap corresponds to a different expert group—NGSS, Assessment, Science, and Equity—reflecting their specific feedback. Similarly, each row represents a distinct dimension of the feedback, which are Collect/Interpret, Establishing Evidence, and Integration of the three dimensions. The color scale on the heatmap ranges from blue to dark red, serving as a gradient to indicate the level of expert ratings. Dark red colors signify the highest ratings, approaching 5.0, which indicates strong agreement or high levels of satisfaction with the specific dimension evaluated. These areas suggest that the expert group views the LPs and evidence statements as being highly effective or excellently aligned with the designated criteria. In contrast, mid-range ratings are colored in shades ranging from light red to orange, spanning values between 3.0 and 4.5. These colors denote moderate satisfaction, suggesting that while some aspects are satisfactory, there are still opportunities for improvement in these areas. The presence of these colors on the heatmap points to dimensions where feedback suggests a need for adjustments or enhancements to better meet the PE standards or improve clarity and effectiveness. 140 Finally, blue represents the lowest ratings, near the value of 2.0, indicating significant concerns or dissatisfaction from the expert reviewers. These regions highlight specific dimensions where experts believe that the LPs and evidence statements fall short of expectations and require substantial revisions or reconsideration. The layout and color coding of the heatmap enable a quick visual assessment of consensus and divergence among different expert groups. This helps in identifying areas of general agreement or satisfaction, where little modification might be needed, as well as pinpointing those aspects that require attention and likely intervention due to lower ratings. The visual format of the heatmap thus not only simplifies the comparison across multiple dimensions and expert groups but also aids in quickly locating areas of strength and those needing improvement, facilitating targeted adjustments to enhance the educational assessments. The heatmap shows that the NGSS experts have more critiques on all of the dimensions, while the other groups of experts have more conservative scores on the first and second dimensions. The feedback reflects a mix of positive feedback and areas for improvement. The high ratings for sufficiency and integration suggest that the LPs are well-supported by adequate evidence and effectively integrate 3D learning. However, the lower ratings for collective representation and essentiality, particularly from NGSS experts, indicate a need to revisit these areas to ensure the LPs comprehensively cover the necessary proficiencies and are deemed essential for achieving the PE. Qualitative feedback provided deeper insights into these findings. Coverage of DCIs and SEPs There was a common theme regarding the omission or inadequate coverage of all the requisite proficiencies. The major concern is the set of generated LPs does not sufficiently cover the major ideas of “unbalanced forces” that are emphasized in the PE, and the LP4 addresses the idea of non-contact forces is reasonable due to some students often struggle with understanding “non-contact forces” but that may not meet the PE’s major expectation about the balanced and unbalanced forces. C also pointed out “LP4 has a DCI focus that appears outside the bounds of the PE – the foundation box emphasizes objects in 141 contact (part of PS2.B) and the PE does not indicate that non-contact forces are a primary focus of the PE nor that distance apart of objects needs to be addressed. The DCI element in LP 4 seems better aligned to PS2-3.” The NGSS expert provided further insights, stating, "The unpacking of the DCI and SEP is not very good: 1. The DCI unpacking does not cover the major ideas of the DCI (the boundary of the DCI elements); 2. The SEP is beyond the grade level, such as the use of models to explain or describe. There is a need to readjust the SEP unpacking and the CCC unpacking." Interestingly, some experts believe that the set of LPs covers the DCIs in the PE very well, but more attention needs to be paid to whether the SEP in the PEs is well addressed, as only two LPs mention planning and carrying out investigations. This concern is echoed by assessment experts who commented, “LP1 most directly aims to address the SEP of the PE but does not sufficiently encompass the SEP (e.g., the LP requires students to observe in an investigation, but the PE requires that students produce/collect data that will provide evidence to make a claim).” Given the feedback, it is crucial to go back to reexamine the unpacking of the three dimensions, and redesign the LPs to ensure they cover all of the key ideas. Integration and alignment of CCCs Feedback indicated that while the LPs generally integrated CCCs, there were instances where the connections could be made more explicit. For example, assessment expert CH mentioned that "the evidence of a CCC is far less clear, but the phrase 'results in' could be evidence of cause and effect." Science content expert J remarked that "although stability and change are not as explicit as they could be in this assessment, there is still ample evidence of mastering this CCC in these evidence statements." Assessment expert P observed, "there is a bit more to be desired with the lens of cause and effect and how those are specifically developed. Could be spelled out more." NGSS experts also pointed out, "The CCC component is not clear and needs to be explicit." Additionally, they noted that "One of the LPs focuses on friction, which is not explicitly mentioned in the PE text, the clarification statement, the assessment boundary, or the DCI text. However, only gravity is explicitly mentioned, so it must be assumed that other forces would be discussed. There is an LP about students using models to explain the effect of balanced forces, but none of the LPs address the idea of the effect of unbalanced forces. This leaves a 142 very significant part of the PE unaddressed." The DCI element for the PE even uses the phrases "they add to give zero net force on the object" and "forces that do not sum to zero." Complexity and appropriateness of examples Experts noted concerns about the complexity of some examples used in the LPs, suggesting that certain examples might be too advanced for the intended grade level. Science content expert S pointed out that "the scenarios of the assessment tasks may require different levels of knowledge: 'a book resting on a table or a tug-of-war game'—the difficulty of students interpreting these two scenarios may be quite different." Teacher expert Le criticized the example of "the motion of wheels of a cruising car" as not reflective of balanced and unbalanced forces due to external factors like motors and gasoline. In addition, experts suggested having more examples of moving objects. T mentioned, “Students often have a misconception that a force is necessary for an object to keep moving. Therefore, they often struggle with the idea that a moving object will keep moving if all of the forces are balanced. While the evidence statement does mention the idea of “...moving at a constant speed without any change in direction”, there are no examples where the object is in motion.” He suggested adding one example of moving objects. Need for clarity in language and evidence statements Several experts emphasized the importance of clear and precise language to avoid confusion, especially given the elementary education context. T noted that terms like "net force" and "different strengths of forces" might confuse educators and students, especially those with limited physics backgrounds. He suggested these terms could lead to inaccuracies in assessments and might not align well with the PEs. E similarly emphasized the need for clear and precise language, pointing out that some scientific terms might be too advanced for the target educational level. Language expert Su noted that "LP4 needs a verb. The 4 LPs indicate what students will do to show evidence from the results of the investigations they plan and conduct." Similarly, T suggested revising the LP4 as “Students use models to explain how non-contact forces act at a distance.” The integration of these findings in Table 4-19 suggests several key recommendations for refining the LPs and Evidence Statements. 143 Table 4-19. Integrated Analysis Results for PE 3-PS2-1 LPs and Evidence Statements Theme Key Points Recommendations Coverage of DCIs and SEPs Effective coverage of force and motion concepts, gaps in addressing all necessary SEPs and DCIs Redo the unpacking of the DCI to cover all major ideas and boundary elements (unbalanced forces, non-contact forces). Ensure the SEP levels are grade-appropriate, such as simplifying the use of models to explain or describe (explain or describe). Include LPs that address the effect of unbalanced forces to ensure comprehensive coverage of the PE. Integration of CCCs General integration of CCCs, need for more explicit connections Make connections between CCCs and the LPs more explicit. Clearly define and illustrate concepts such as stability and change. Ensure the CCC component is clear and explicitly mentioned in the LPs. Include discussions of forces like friction and unbalanced forces as required by the PE. Example Complexity Concerns about the complexity of examples, some examples too advanced for grade level Use simpler, more relatable examples that accurately reflect the concepts being taught. Avoid scenarios that could be misinterpreted or are too complex for the grade level. Provide examples that are within the students' understanding and experience, such as more straightforward comparisons or everyday situations. Language and Terminology Importance of clear and precise language, avoid confusing terminology Simplify language to match the reading levels of elementary students and ensure consistent terminology across the LPs and evidence statements. Provide clear definitions and avoid complex phrases to enhance comprehension. Replace vague or advanced terms with simpler alternatives. Gaps and Overreach Need for alignment with NGSS standards, ensure developmental appropriateness Review the LPs to identify and address any gaps or overreach. Ensure that all necessary SEPs are represented and that the LPs do not include proficiencies beyond what is required by the PE. Adjust the SEP and CCC unpacking to align with grade-level expectations. 4.2.3.2 Expert Feedback Analysis for PE: 3-PS2-1 Assessment Task 1 Expert panels received a protocol including three major sections (item stem, item prompt, and exemplar response) to collect their feedback on Task 1 designed for LP2 for the PE 3-PS2-1. See Figures 4-7 and 4-8 below about the task 1 and its’ exemplar response. The protocol comprised 16 items across five dimensions for the item stem, 10 items across four dimensions for the item prompt, and 3 items for the exemplar response. Feedback was collected using a Likert scale from 1 (not at all) to 5 (completely) across these various dimensions. The collected data were then analyzed to provide a comprehensive 144 evaluation of Task 1. Figures 4-9, 4-10, and 4-11 provide a visual representation of the expert ratings across various dimensions on the Task 1 item stem, item prompt and exemplar response. Figure 4-7. Task 1 for LP2 of 3-PS2-1 145 Figure 4-8. Exemplar response for task 1 for LP2 of 3-PS2-1 146 Figure 4-9. Heatmap of the expert ratings on the Task 1 item stem for PE 3-PS2-1 The most striking observation is the low score provided by NGSS experts, suggesting a significant concern with how the phenomenon is presented or utilized in the item stem. This contrasts with more moderate scores from other groups, indicating a discrepancy in how the phenomenon's relevance or clarity is perceived across expertise. Language-related dimensions received a wide range of scores. Notably, domain-specific language was rated poorly by NGSS experts, but much more favorably by other groups, highlighting a potential disconnect between NGSS content expectations and language used. The language sentence structure also varied, with lower scores indicating a need for better alignment with diverse student backgrounds. Engagement dimensions show moderate to high variability. Notably, engagement interest and relevance are scored lower by several groups, suggesting that the item may not effectively capture or maintain student interest or connect well with their real-life experiences. Comprehension scores are generally high, indicating that, structurally, the item supports student 147 understanding and information processing. However, the visualization aspect received lower ratings from some groups, suggesting that visual aids or representations used in the item could be enhanced for clarity or effectiveness. Moreover, teachers have relatively lower scores on if the phenomenon in the task can be really interesting to students, although it is relevant to student life. The lower scores on phenomena engagement and language grade level appropriateness and sentence structure particularly stand out as areas needing immediate attention to ensure the item stem is both educationally effective and resonant with a diverse student population. The feedback underscores the necessity of enhancing the item's relevance, engagement potential, and language appropriateness ensuring it not only meets educational standards but also supports inclusive and equitable learning experiences. The heatmap in Figure 4-9 displays expert ratings for the Task 1 item related to PE 3-PS2-1 across various dimensions. Each column represents a different expert group—NGSS, Assessment, Science, Equity, Engagement, and Teacher. Each row corresponds to specific dimensions of the assessment, such as 'Cultural Authenticity', 'Language Sensitivity', 'Comprehension', and 'Phenomena'. The color scale of the heatmap ranges from blue to red. Blue indicates lower scores (1.0 to 2.0 range), signaling significant concerns or dissatisfaction. This suggests that aspects within this color range may require substantial revisions. Light red to orange represents moderate scores (2.5 to 4.0 range), indicating partial fulfillment of criteria and potential areas for improvement. Red signifies higher scores (4.5 to 5.0 range), denoting strong agreement or satisfaction with the dimensions evaluated, suggesting that these aspects are well-executed. 148 Figure 4-10. Heatmap of the expert ratings on the Task 1 item prompt for PE 3-PS2-1 The heatmap in Figure 4-10 provides a detailed view of expert evaluations across multiple dimensions of the Task 1 item prompt associated with 3-PS2-1, LP2. The evaluations were guided by specific criteria focused on 3D Prompt alignment, Comprehension, Language Complexity, the use of Scaffolds, and engagement by the engagement expert panel. The evaluations indicate varied perceptions of how well the prompts are 3D and align with integrated proficiencies. Particularly, the NGSS panel's lower scores (2.5) suggest concerns about the comprehensive integration or alignment of the prompts. The extent to which questions elicited by the prompts are motivated by the scenario described in the stem was also evaluated, with scores suggesting some alignment but also room for enhancement to clarify the connection between the scenario and the questions. Additionally, the accessibility of the prompts for novices was assessed, with varied scores indicating differing views on the prompt's suitability for students still developing relevant proficiencies. Experts assessed whether students have the necessary prior 149 knowledge to understand and respond to the prompt, with generally high scores in related areas such as information coherence and consistency suggesting an adequate connection with expected prior learning. However, moderate evaluations in these areas also suggest that further clarification could be beneficial. Concerns were noted regarding the clarity and directness of sentence structure, as lower scores highlight a need for simpler language to aid comprehension. Mixed evaluations in vocabulary appropriateness and domain-specific vocabulary usage indicate that while some experts find the vocabulary suitable and well- integrated, others see a need for adjustments to ensure all vocabulary is accessible and clearly explained. The effectiveness of scaffolds in helping students navigate the complexity of the task received mixed perceptions. Some panels noted lower scores, suggesting that the scaffolds might not be effectively presented or sufficiently supportive for all students, particularly those with less background knowledge or proficiency. In terms of engagement, there is a clear need to make the prompts more engaging and interesting. This could involve integrating topics or scenarios that are more directly aligned with student interests or current societal issues, making the educational experience more engaging and motivating. By more directly linking the prompts to real-life applications and demonstrating how the skills and knowledge gained are applicable outside the classroom, the prompts could become more relevant and meaningful to students. Ensuring that the prompts not only introduce but also effectively integrate the three dimensions of learning will be crucial. This may involve revising the prompts to include clearer explanations or examples of how these dimensions are relevant and can be explored through the task. Overall, there is a recognized need for improving language structures, ensuring vocabulary appropriateness, and enhancing term consistency to make the prompt more accessible and understandable for all students. The presentation and design of scaffolds also need revision to better assist students in understanding and engaging with the task, ensuring that scaffolds effectively break down task complexity and support learning. 150 Figure 4-11. Heatmap of the expert ratings on the Task 1 exemplar response for PE 3-PS2-1 Figure 4-11 displays a heatmap of expert evaluations on the exemplar responses for Task 1. These evaluations are based on specific criteria focusing on the extent to which exemplar responses capture necessary evidence statements, address the integrated proficiencies entailed by the LP, and utilize grade- appropriate language. The scores of alignments with evidence statements reflect the extent to which exemplar responses capture all necessary evidence statements. The Assessment expert group rated this particularly high (5), indicating that the exemplar responses well represent the required evidence statements. In contrast, the NGSS group gave a lower score (4), suggesting some room for improvement in how comprehensively the responses cover all designated evidence statements. The 3D integration dimension assesses whether it is possible for students to provide accurate responses without necessarily attending to all integrated proficiencies required by the LP. The varied scores, with higher ratings from Assessment and Science panels (4.2 and 4 respectively) and lower from the NGSS (3), indicate differing views on how well the exemplar responses integrate or require engagement with the 3D learning 151 components. The appropriateness of the language used in the responses concerning the grade level received mixed evaluations. The Teacher panel rated this dimension highly (4.5), suggesting that the language used is well-suited for the grade level. Conversely, scores from the NGSS and Engagement panels were lower, suggesting that some responses may not consistently meet the language level expectations for the target student group. The response interest evaluates how interesting the responses are likely to be for students, with varying perceptions among the expert groups. Lower scores from the NGSS panel (1) highlight a significant critique that the responses may lack elements that engage or captivate students' interests, which is crucial for maintaining engagement with the task. Overall, the phenomena dimension received mixed ratings, indicating concerns about the compelling nature and comprehensibility of the phenomenon presented in the item stem. So do the assessment experts, equity/language experts, and the teacher experts. In contrast, engagement experts rated it significantly higher, suggesting that they found the phenomenon more engaging for students. This discrepancy highlights the differing perspectives among expert groups on what constitutes an engaging and comprehensible phenomenon for elementary students. It is important to collect further insights from teacher experts’ insights on the engagement level of the item phenomena to gain better understandings. Additionally, the dimensions of language complexity and student engagement needs further exploring to understand the concerns or suggestions from the experts. Also, the level of exemplar responses were argued. There are concerns about students' ability to independently demonstrate proficiency and the appropriateness of the language used. These quantitative findings are further explored in the qualitative analysis. There are several themes that were identified based on the analysis. Simplify language for clarity and grade-level alignment Experts consistently noted that the language used in the item stem and prompts could be simplified to better match the reading level of third graders. Words like "despite" and phrases like "remains unmoved" were highlighted as potentially confusing. E suggested, “The second sentence is not appropriate for 3rd grade. Perhaps – The book does not move, even with slight bumps to the table.” T 152 noted, “The language in the stem creates a scenario that makes sense, but terms like ‘remains unmoved’ are too complex for third graders.” Assessment experts also suggest revising the item prompts from “Explain using your model how the forces result in the book maintaining its position without movement.” to “Use your model to explain how the forces result in the book staying in place without moving. OR Explain with your model how the forces keep the book from moving.” In addition, there were several comments about the need for clearer and more consistent terminology. For example, terms like "normal force" should be clearly defined. T noted, “Normal force is used in the example, but it is not defined in the question, nor used in any of the PEs, DCIs, or LPs.” C added, “Example of an inconsistency -- In the Stem, the phrase ‘occasional bumps’ is used; in the Prompts, the phrase ‘slight disturbances’ is used to mean the same thing.” Furthermore, most excerpts commented the exemplar response is “more of a 6th grade level explanation – the sentence structure and vocabulary is upper elementary 5-6 grade.” Enhance engagement and inclusion Most experts responded similarly to the failing “This (phenomenon) is fully comprehensible for students; perhaps not super compelling, but a very good relatable phenomenon for elementary students.” However, some NGSS and assessment experts have concerns if the phenomenon is compelling enough for students, especially for the NGSS expert who worried that the scenario will lead to misinterpretation of “side-to-side motion.” Engagement and teacher experts suggested that the task could be made more engaging by framing it as a story or using a hands-on demonstration. They also recommended using examples that are more directly relevant to students' everyday experiences. Sa proposed, “If we ran straight at a wall, we would bounce ‘off’ the wall instead of running ‘through’ it. What would it take for us to be able to run through a wall?” B mentioned, “Framing it as a story might be helpful for this age group.” Additionally, feedback also indicates the scenario is relatable to all students and that the language used does not inadvertently exclude any groups. For example, Sa commented, “The scenario is something students have experienced frequently.” But the experts also underline the importance of inclusion. Q noted, “scenario is culturally sensitive overall, but it’s important to ensure it’s inclusive for students with diverse backgrounds and abilities.” 153 Provide adequate scaffoldings There was a strong recommendation to provide more scaffolding to help students understand the concept of balanced forces. This could include clearer instructions, visual aids, and step-by-step guidance. CH suggested, “Prompts scaffold for drawing the model and explaining the model. Small suggestion – you could scaffold students to use arrows.” QL mentioned, “It would be better to tell them that they need to draw arrows and labels around the book. They may have no idea how to represent unseen forces.” Several experts think the prompts do not provide adequate scaffolding for students to develop models. Table 4-20 below synthesizes the integrated findings of expert feedback on Task 1. Table 4-20. Integrated Analysis Results for Task 1 of PE 3-PS2-1 Theme Key Points Recommendations Simplify language for clarity and grade-level alignment. Importance of clear, consistent, and precise language, avoid confusing terminology Simplify language to match the reading levels of elementary students. Replace complex terms like "despite" and "remains unmoved" with simpler language, such as "even with slight bumps, the book does not move." Ensure consistent use of terms like "occasional bumps" and "slight disturbances." Define terms like "normal force" clearly in the context of the task Simplify language level of exemplar response. Enhance engagement and inclusion Suggestions to frame the task as a story or use hands-on demonstrations to increase interest. Ensure the scenario is relatable and inclusive for all students, including those with diverse backgrounds and abilities Incorporate storytelling elements and opportunities for practical demonstrations. Use real-life examples and interactive elements to make the learning experience more dynamic. Frame the task as a relatable story or classroom event and include hands-on activities where students can physically manipulate objects to observe forces in action. Use examples and language that reflect diverse student experiences and backgrounds. Ensuring cultural sensitivity and inclusivity will help make the task more accessible to all students. For example, include culturally diverse names, contexts, and examples that reflect the backgrounds of the student population, making the task more engaging and relatable for all learners. Provide adequate scaffolding Need for clearer instructions and visual aids to help students understand balanced forces Include step-by-step guidance, visual aids, and explicit instructions for modeling forces. Providing additional support materials, such as graphic organizers or visual aids, can help students organize their thoughts and responses effectively. For instance, provide clear diagrams with labeled arrows to show forces, and include detailed instructions that guide students through the process of modeling and explaining the forces at play. 154 4.2.3.3 Expert feedback analysis for PE: 3-PS2-1 Assessment Task 2 Feedback for Task 2 designed for LP2 of PE 3-PS2-1 was also collected using the same feedback protocols and same group of experts. Task 2 and its’ exemplar response can be found from Figures 4-12 and 4-13. Figures 4-14, 4-15, and 4-16 visually present the expert feedback distribution. Figure 4-12. Task 2 for LP2 of 3-PS2-1 155 Figure 4-13. Exemplar response for task 2 for LP2 of 3-PS2-1 156 Figure 4-14. Heatmap of the expert ratings on the item stem of Task 2 for PE 3-PS2-1 Figure 4-14 illustrates the heatmap of expert evaluations for the Task 2 item stem. The feedback revealed significant variations in how different expert groups perceived the phenomenon's clarity and engagement. Notably, NGSS experts provided lower ratings, suggesting that the phenomenon might not be as engaging or accessible to all students. This contrasted with higher scores from Science experts, who viewed the phenomenon as relatively engaging. This discrepancy indicates a need for adjustments to make the phenomenon more universally accessible and engaging. In terms of comprehension, all expert groups rated the consistency of terminology highly, which underscores the clarity in the use of terms, essential for student understanding. However, there was a notable variation in the perceived effectiveness of visual aids. Some experts recommended enhancements to improve comprehension, potentially including the addition of captions for better clarity. Engagement with the scenario, assessed through metrics of student interest and relevance to real-life and 3D learning, received moderate to high scores. 157 However, the variability in these scores suggests potential for further enhancing the scenario's captivation and relevance to the three dimensions of learning. Cultural sensitivity evaluations, particularly highlighted by lower scores from Equity experts, pointed to the task's limited resonance across diverse student groups. This suggests a pressing need for broader cultural considerations within the scenario to enhance inclusiveness and authenticity. Figure 4-15. Heatmap of the expert ratings on the item prompt of Task 2 for PE 3-PS2-1 Figure 4-15 showcases a heatmap of expert evaluations for the Task 2 item prompt. The evaluation highlights two dimensions where the scores were notably low. The NGSS experts assigned a score of 1 to scaffolding and the equity and language experts assigned a score of 2, indicating significant concerns about the effectiveness of the scaffolding provided in supporting student understanding of the task. This suggests that the current scaffolding may not adequately help all students grasp complex concepts or engage deeply with the content. Enhancements in this area are crucial to ensure that the support structures are robust enough to facilitate comprehensive understanding across diverse student 158 groups. Another low score was observed in the engagement relevance dimension, where the Engagement panel rated it 3, pointing to potential shortcomings in the scenario's ability to resonate with and captivate students' interests. This moderate score suggests that while the prompt has some engaging elements, it could be significantly improved to better capture and hold student interest, making the learning experience more compelling and relevant. In contrast, high ratings were consistently given by the Science panel across dimensions such as language appropriateness and 3D learning clarity, indicating that from a scientific and educational standpoint, the language and integration of learning dimensions are effectively executed. However, the mixed feedback across different panels, particularly the lower scores from the NGSS and Engagement experts, underscores the need for a more unified approach that aligns with NGSS standards and effectively engages students. Figure 4-16. Heatmap of the expert ratings on the exemplar response of Task 2 for PE 3-PS2-1 Figure 4-16 presents a heatmap of expert evaluations for the Task 2 exemplar responses. The analysis shows that the exemplar responses have significant disparities in how effectively they 159 incorporate NGSS-required three-dimensional learning and engage students. The NGSS experts rated the Three-Dimensional Learning Integration at a notably low score of 1, indicating that the responses fall short in engaging students with the necessary competencies and learning dimensions mandated by the standards. Another area of concern highlighted by the Engagement experts is the low score of 2 for Response Interest. This suggests that the responses may lack elements necessary to capture and maintain student interest effectively, potentially impacting their educational effectiveness. Conversely, the responses were well-received in terms of their alignment with evidence statements, receiving high scores from Assessment and Science experts. This indicates that they accurately incorporate the required evidence statements. However, the Language Appropriateness received mixed evaluations, with a particularly low score from the Engagement group, suggesting the language may not be entirely suitable for all students. With quantitative analysis, the next section provides qualitative analysis to further zoom in the specific suggestions. Enhance engagement by considering individual experience. Task 2 was generally found to be engaging and relevant, for instance, "very much comprehensible for students; a very good relatable phenomenon for elementary students." Experts also suggested enhancements to further consider individual experiences. Sa mentioned that the task stem is "clear and accessible" but emphasized the importance of considering individual interests. She stated, "I personally think this is quite interesting because children have likely played tug-of-war before, making it a good anchor point to expand their understanding of the phenomenon. However, interest is individual and exists on a scale. I'd be curious to ask: interesting in relation to what? As a standalone scenario, it works well as a classroom activity, but teachers might be better positioned to discuss if this is an interesting scenario for this age group as a whole or if there are more engaging activities." Sa also commented on the relevance, stating, "This will depend on the student. Generally, the stem has high 'interest value,' but the phenomenon's real-life application might depend on students' prior experiences with the game." E echoed this sentiment, cautioning that the task might not resonate with all students, especially those who have not played tug-of- war, suggesting, "Most students will not have done this before and would not relate to it." 160 Ensure coherent information across the task Another theme is that incoherent information in tasks may lead to inefficient information processing. C mentioned that information incoherence exists in three areas: (a) the stem scenario presents a rope-pulling challenge, but the prompts require students to draw the push and pull that classes are giving to the rope; (b) the image in the scenario can mislead students to think each class is playing a separate tug-of-war game because it shows two games being played; (c) coherence is further disrupted by the prompt asking, "Show how the push from one class and the pull from the other class are just right so that neither side moves." This is echoed by science content expert P, who stated, " the stem uses one class 'pushes' and the other class 'pulls,' yet, in physics, both sides are pulling." Su similarly pointed out, "students would not suggest that one class is ‘pushing’ and one is ‘pulling’." These suggest the importance of emphasizing information coherence principles in the design process. Simplify language for grade-level appropriateness Experts emphasized the need to simplify the language used in the task to ensure it is suitable and understandable for third graders. Complex terms and phrases can hinder students' ability to engage with and comprehend the task. E noted that terms like "invisible force" could be confusing, suggesting instead, "'Invisible force' is pretty confusing for me and for kids." Similarly, P pointed out that phrases like "all their might" might not resonate with students, proposing it be changed to "really hard." SC also recommended simplifying vocabulary to better align with students' prior instruction and understanding. For the example response, C noted, "This is more of an upper elementary grade level explanation—the sentence structure and vocabulary are appropriate for grades 5-6." Language expert Su echoed this sentiment, stating, "... it is more sophisticated than expected for most third graders. It sounds more like text written by a middle schooler." However, the exemplar explanation remains student-friendly and appropriate for a student audience. Ensure inclusive and clear visual representation The visual depictions in the task received mixed feedback, highlighting concerns about their potential to mislead students and lack of inclusivity. E pointed out, "The image might be too sexist 161 because the girls all have skirts on. And there are no students who are overweight." This underscores the need for visual materials to be inclusive and representative of diverse student populations. Su noted, "The photo with two groups in a tug-of-war is confusing, as some students might think it is showing two different tug-of-war events." Additionally, J raised concerns about inclusivity for students with disabilities: " it might be important to consider that students with physical disabilities might have more challenges with this question. I would keep this question but might change the image to include participation from students with physical disabilities, such as a kid in a wheelchair." This feedback emphasizes the importance of ensuring that visual aids are clear, unambiguous, and inclusive to prevent misinterpretation and promote understanding. Ensure effective scaffoldings The need for effective scaffolding to support students in understanding and completing the task was another key theme. Proper scaffolding can help students break down complex tasks into manageable parts and guide them towards successful completion. Q suggested providing example models, stating, "Including some sort of example model might be helpful for students." Similarly, C emphasized the importance of clear instructions and scaffolds, noting, "Scaffolds help with modeling – what to draw, what to show with the model, and encouragement to use symbols and/or labels." This feedback highlights the need for well-designed scaffolding to facilitate student understanding and engagement. Table 4-21 below synthesizes the expert feedback on Task 2. 162 Table 4-21. Integrated Analysis Results for Task 2 Theme Key Points Recommendations Ensure coherent information across the task Simplify language for grade-level appropriateness Incoherent information in the task may lead to inefficient information processing Complex terms and phrases could hinder students' understanding Ensure inclusive and clear visual representations Visuals could be misleading and lack diversity Enhance engagement by considering individual experience The task is engaging but may not be relatable for all students Ensure all information in the task is coherent and consistent. Align the stem, prompts, and images to present a unified scenario. Revise the language to consistently describe the actions of both classes as "pulling" to avoid confusion. Simplify the language to match the reading levels of third graders. Replace complex terms like "invisible force" with simpler alternatives. Use straightforward language and avoid jargon to ensure clarity. Revise the images to ensure they clearly represent a single scenario and avoid depicting multiple games. Include diverse characters in the visual aids to reflect a variety of student backgrounds, including students with disabilities. Ensure that visuals are clear and unambiguous to prevent misinterpretation. Consider individual interests and provide more relatable examples to increase student engagement. Incorporate storytelling elements to make the task more dynamic and captivating for students. Use practical demonstrations or hands- on activities to illustrate concepts. Ensure that the scenarios used are relevant to the everyday experiences of third graders, such as common playground activities or familiar classroom experiments. Ensure effective scaffoldings Need for more effective scaffolding to support student understanding Provide example models and clear visual aids to help students understand the task. Include step-by-step guidance and explicit instructions for modeling forces, such as arrows and labels. Ensure that scaffolds are effectively integrated into the task, breaking down complex concepts into manageable parts. Provide additional support materials, like graphic organizers, to help students organize their thoughts and responses effectively. 4.2.4 Expert Feedback Analysis for PE: 3-LS4-3 4.2.4.1 Analysis of The Expert Feedback on LPs and Evidence Statements Protocols distributed for feedback collection on PE 3-PS2-1 are the same, with the review content tailored specifically to the PE. Table 4-22 presents the LPs and evidence statements. 163 Table 4-22. LPs and evidence statements for LP2 for 3-LS4-3 for review PE LPs 3-LS4-3: Construct an argument with evidence that in a particular habitat some organisms can survive well, some survive less well, and some cannot survive at all. LP1**: Students develop models to represent various organisms in a specific habitat and identify their basic needs for survival, illustrating the interdependence between organisms and their environment. LP2**: Students engage in argument from evidence to support claims about which organisms can survive well, less well, or not at all in a specific habitat based on their characteristics and needs, using examples from various habitats to explore cause and effect relationships. LP3**: Students analyze data to describe how certain adaptations help organisms survive in their habitats and explain the cause and effect relationship between specific adaptations and survival success. LP4**: Students predict the effects of minor environmental changes on the survival of organisms in a given habitat, identifying the cause and effect mechanisms that lead to these outcomes. Focal LP: LP2 Students engage in argument from evidence to support claims about which organisms can survive well, less well, or not at all in a specific habitat based on their characteristics and needs, using examples from various habitats to explore cause and effect relationships. Evidence Statements 1. Students collect and present specific evidence regarding the survival rates and adaptation mechanisms of organisms within varying habitats. - Students draw upon observable characteristics, inherent needs, and environmental factors influencing organismal survival. 2. Students formulate clear claims regarding which organisms can thrive, survive less well, or perish in particular habitats, grounding their assertions in gathered evidence and understanding of habitat-organism interplay. 3. Students succinctly explain, with examples, how specific habitat features afford or limit the survival capabilities of certain organisms, highlighting adaptation as a key determinant. 4. Students predict survival outcomes for distinct species, explaining the role of physical and biological habitat components in determining these outcomes given comparative habitat scenarios. 164 Figure 4-17. Heatmap for expert feedback on LPs and evidence statements for 3-LS4-3 Figure 4-17 shows the quantitative analysis of expert feedback on the LPs and Evidence Statements for the PE. The expert evaluations reveal that the Collective Representation of Proficiencies scored a low of 3.5 by NGSS experts, highlighting a concern about the comprehensiveness of the LPs in covering all necessary proficiencies required by the performance expectation. This suggests significant gaps that may impede the achievement of the targeted educational outcomes. Furthermore, the Essentiality of the LPs received a modest score of 3 from NGSS experts, indicating some uncertainty about the critical nature of all components within LPs. This points to potential overreach in the current LPs, suggesting that some elements may not be essential for meeting the performance expectations and could be streamlined or eliminated. In terms of the Sufficiency of Evidence, the Assessment group's score of 3.3 raises concerns about whether the evidence statements adequately support claims of student proficiency. This feedback suggests that the evidence provided may not be sufficiently comprehensive or robust, necessitating enhancements to better support student assessments. Additionally, the Integration of Knowledge was rated slightly lower by the NGSS group at 4, implying that while the integration of 165 knowledge generally meets educational standards, there is room for improvement in how effectively three-dimensional learning is incorporated into the LPs. These insights were taken to thematic analysis. Through the analysis, several themes emerged. The themes were also organized into Table 4-23 below. Refine LPs to align with PE Concerns regarding the appropriateness of the scope of LPs were often raised, highlighting that LP3 and LP4 included concepts beyond the intended scope of the PE. M noted, "Adaptation and environmental changes are not addressed in the PE," while C observed that "LPs 3 and 4 introduce DCI elements outside the PE," indicating a need for refinement to align these LPs more closely with the objectives of the PE. C further explained, "LP1 and LP2 accurately represent the proficiencies for the PE, with LP2 covering the entire scope. However, LPs 3 and 4 introduce unnecessary elements, such as adaptations resulting from selective pressures, which are beyond the third-grade curriculum." He added, "LP3 delves into adaptations that aid survival, more suitable for higher grades, and LP4, aligned with 3- LS4-4, focuses on environmental impacts, exceeding the intended PE focus." This trend of overreach was supported by J's comment that "LP4’s predictions about environmental impacts extend beyond the PE’s scope." Similarly, SS noted, "The detail in LP2 matches the original PE well, indicating its suitability, whereas LP3's adaptation content aligns with eighth-grade standards." NGSS expert E critiqued the excessive scope, saying, "LPs sometimes exceed necessary proficiencies. For instance, understanding 'adaptation mechanisms' isn't required; students only need to argue about organisms' varying survival likelihoods, such as comparing different aquatic species in Lake Michigan. The current focus on adaptation in LP3 is unwarranted." He also mentioned that the level of evidence required, as stated in the LPs, is often unrealistic for classroom settings. These critiques underscore the importance of tailoring content to be age-appropriate and directly aligned with PE goals. Adjustments should include scaling back advanced topics and simplifying explanations to ensure they are accessible to third graders, including English Language Learners. This approach will enhance clarity and relevance, ensuring LPs effectively meet the educational needs at the intended grade level. 166 Redefine evidence statements for enhanced clarity and filling gaps The evidence statements associated with the LPs need substantial refinement to ensure clarity and age-appropriate alignment. Sm highlighted concerns with Evidence Statement 1, which suggests that "Students collect and present specific evidence regarding the survival rates and adaptation mechanisms of organisms within varying habitats." He concerned the clarity of the statement and questioned the realism of students "collecting" data, suggesting instead that they might be "identifying" data that qualifies as evidence. For Evidence Statement 2, which states, "Students formulate clear claims regarding which organisms can thrive, survive less well, or perish in particular habitats, grounding their assertions in gathered evidence and understanding of habitat-organism interplay," Sm criticized the vague language and called for more specific discussion about the organisms' needs relative to their environments. He appreciated the specificity in Statement 3 but pointed out that its focus on "adaptation" aligns with eighth- grade standards rather than third grade, indicating a misalignment with the intended curriculum. E also noted issues with the scope of the evidence described in these statements, particularly that the complexity of gathering such detailed evidence might not be feasible in many classrooms. Additionally, E highlighted that the term "adaptation" in Evidence Statement 3 does not align with third-grade expectations. C echoed these sentiments, noting that while the evidence statements address the three dimensions outlined in the LPs and are collectively obtainable, the breadth of evidence required sometimes exceeds the scope intended for the LPs. For instance, he pointed out that requiring students to compare different habitat scenarios goes beyond the narrow habitat focus expected at this grade level. C also identified a missing component in the SEP for grades 3-5, which includes critiquing explanations—a critical thinking skill not currently reflected in the evidence statements or the corresponding LPs. This comprehensive feedback highlights the need for more precise, age-appropriate adjustments to the evidence statements to ensure they effectively support the intended learning outcomes without overreaching the PE. Refine the LPs and evidence statements for better accessibility and understanding The integration of the three dimensions within the LPs has been well-received, exemplified by J's commendation of LP2 for effectively demonstrating this integration. However, the clarity and 167 accessibility of the LPs and evidence statements remain critical areas for improvement. E from the NGSS expert group raised concerns about the use of complex language in the LPs, noting that terms like "succinct explanations" could be challenging, particularly for young learners and English Language Learners (ELLs). She expressed doubts about ELLs' ability to produce clear and succinct claims, highlighting the subjectivity of such requirements. Further, E critiqued the alignment of the DCI, suggesting a misunderstanding in the AI's interpretation of the PE, particularly with regards to adaptation. She pointed out that the essential idea is about animals meeting their needs in supportive environments, not adaptation per se. He also noted that the AI overlooked the simplicity intended in the PE, which is designed to be universally applicable across various educational settings. She also highlighted gaps in the practical implementation of the LPs, such as the unrealistic expectation for all activities to be conducted in a single well-known location like a schoolyard. She argued that the AI's design of the PE did not adequately consider the logistical and contextual realities of typical third-grade classrooms in the U.S. Furthermore, experts like Cn and Co emphasized the need for explicit instructions on evidence collection and analysis. Cn observed that the LPs lacked detailed guidelines for analyzing data, which is crucial for supporting students' arguments with evidence. Co added that the guidelines on how students should gather evidence were insufficiently clear, underlining the need for detailed and actionable instructions to aid students in their investigative processes. These insights call for a revision of the LPs and evidence statements to ensure they are not only aligned with the NGSS's 3D approach but also tailored to be clear, accessible, and practical for implementation in diverse educational environments. This includes simplifying language, clarifying expectations, and providing concrete, context-appropriate guidelines that accommodate the capabilities and realities of third-grade students, especially ELLs. 168 Table 4-23. Integrated Analysis Results for LPs and Evidence Statements for PE 3-LS4-3 Theme Key Points Recommendations Align LPs with PE to Ensure Age- Appropriate Content Refine Evidence Statements for Clarity and Educational Relevance Enhance Accessibility and Clarity in LPs and Evidence Statements Concerns were raised about LP3 and LP4 including concepts beyond the intended scope of the PE, such as advanced adaptations not suitable for third grade. Experts noted that LP1 and LP2 align well with the PE, but LP3 and LP4 introduce unnecessary complexity. The grain size of LP2 is too similar to the original PE. Evidence statements were criticized for their lack of clarity and realism in expectations. Concerns include the feasibility of students collecting data versus identifying data, and the vague language that doesn't specify organism needs in relation to environments. Also, there was a misalignment with grade-level standards, particularly with the use of the term "adaptation" which is more suited to eighth grade. Integration of the three dimensions within LPs is well-received, yet the use of complex language and unrealistic logistical expectations highlights a need for simplification and practical adjustments. Concerns about the difficulty for ELLs to produce clear and succinct claims were noted, alongside issues with the practical implementation of evidence collection guidelines. Refine LP3 and LP4 to eliminate advanced concepts not required at the third-grade level. Focus on simplifying content to ensure it is age-appropriate and directly aligned with the PE. Avoid overreach, especially for “adaptations.” Further unpack LP2, especially focus on unpacking the meanings of survive well, not well, and not at all. Revise evidence statements to be more specific and clear, ensuring they are age- appropriate. Replace "collecting" with "identifying" to better reflect realistic classroom activities. Clarify and specify the interplay between organisms and their habitats to enhance understanding and relevance. Exclude advanced terms like "adaptation" that align with higher educational standards. Revise LPs and evidence statements to simplify language and reduce complexity, making them more accessible, especially for ELLs. Ensure that the instructional materials and tasks are feasible within the common logistical and contextual boundaries of third- grade classrooms. Provide clear, actionable guidelines for evidence collection and analysis to support students in their learning processes effectively. 4.2.4.2 Expert Feedback Analysis for PE: 3-LS4-3 Assessment Task 1 Figures 4-18 and 4-19 represent Task 1 and its corresponding response for LP2 of 3-LS4-3. Expert panels were provided with the same review protocol reported above. Figures 4-20, 4-21 and 4-22 present the heatmap visualizations of expert ratings on Task 1. 169 Figure 4-18. Task 1 for LP2 of 3-LS4-3 170 Figure 4-19. Exemplar response for task 1 for LP2 of 3-LS4-3 Figure 4-20. Heatmap of the expert ratings on the Task 1 item stem for PE 3-LS4-3 171 The expert evaluations for the Task 1 item stem reveal critical concerns and strengths. The NGSS group's feedback was particularly notable, emphasizing significant deficiencies in language complexity and information processing, which were underscored by low scores of 1.5 in both Engagement Interest and Visual Comprehension. These ratings highlight serious issues regarding the item stem's capacity to effectively engage students and the clarity of visual aids necessary for comprehension. While the terminology used was consistently rated highly across all expert groups, affirming its clarity, the effectiveness of visual aids was inconsistent, prompting suggestions for more descriptive captions to enhance understanding. Engagement assessments showed variability, with some groups noting the item stem's potential to captivate students and connect the material to real-life and three-dimensional learning, yet indicating that there is room to boost the stem's overall engagement appeal. Additionally, cultural sensitivity was rated lower, especially by Equity experts, pointing to the item stem's limitations in addressing the diverse backgrounds of all students. Figure 4-21. Heatmap of the expert ratings on the Task 1 item prompt for PE 3-LS4-3 172 The evaluations highlight critical areas of concern particularly in the dimensions of Scaffolds and Language age-appropriateness, where notably low scores were observed. The NGSS group's feedback was especially critical, giving the lowest score in the Scaffolds category, suggesting that the scaffolds provided may not adequately help students break down the complexity of the task. This points to a potential disconnect between the scaffolding support and the students' ability to engage with and understand the content effectively. Language sentence structure received lower scores from teacher experts, indicating possible issues with the clarity and appropriateness of the vocabulary used. Moreover, low scores in 3D Integration from groups such as the NGSS suggest that the prompt may not fully align with the integrated proficiencies expected in the LP2. Figure 4-22. Heatmap of the expert ratings on the Task 1 exemplar response for PE 3-LS4-3 The heatmap reveals a range of scores, with particular attention needed in the dimension of 3D integration. The NGSS group provided a low score of 2, indicating that the responses may not adequately integrate the required 3D aspects. This low score suggests a critical need for improving how the responses 173 demonstrate interconnected scientific ideas and practices according to NGSS standards. In contrast, the evaluations for Alignment with Evidence Statements were generally favorable, with the Science expert group rating it at 3.6. This indicates that the exemplar responses reasonably reflect the necessary evidence statements, although there is room for further alignment to fully meet the expectations. Language appropriateness in the responses also varied, with the Teacher group giving a lower score of 2, highlighting concerns over whether the language used is direct and comprehensible enough for students. This feedback points to a need for simplifying the language or improving explanations to ensure that students can easily understand and engage with the content. Diving into the explicit feedback, several themes were identified for the expert feedback on task 1. Enhancing engagement with real-world task scenarios Experts from various fields stressed the need to make educational tasks both engaging and relevant. They suggested adding lively behaviors of squirrels to the tasks to ensure they relate to students' experiences and reflect accurate data, keeping students interested. For instance, NGSS experts noted that showing squirrels searching for homes is relevant, particularly for students familiar with squirrels living in trees. They suggested enriching the task by showing squirrels jumping and interacting with their environments. Assessment experts also emphasized the need for more captivating content. C noted that the task relates well to students' experiences with parks and animals, suggesting that focusing on different parks rather than areas within a single park could make the task more concrete and engaging. Sm recommended directly addressing how animals adapt to their environments to improve the task’s relevance. Science content experts mentioned that squirrels are a common sight across the U.S., and most students find them interesting. They recommended adjusting the animals studied to better reflect the students' local wildlife, which could make the tasks more engaging. Sa proposed using characters like Pokémon to make activities more fun for young learners. This feedback highlights the importance of creating educational tasks that are engaging and closely connected to students’ real-life experiences, ensuring a richer and more meaningful learning experience. 174 Simplifying language that aligns with grade-level language ability Experts across various fields underscored the critical need for using clear and straightforward language to ensure educational tasks are accessible for third graders. NGSS experts advocated for consistent terminology, recommending more direct phrases like "to help their investigation" instead of "to aid their investigation," and clearer labeling such as "Squirrel observation data table." They also suggested replacing phrases like "support your position" with "support your choice" to simplify communication. Assessment experts emphasized the necessity of using simpler vocabulary to aid comprehension. They identified complex words such as "aid," and "equipped " as overly challenging for third graders, recommending they be replaced with simpler alternatives. P noted the importance of shortening long sentences to make them easier for young learners to understand, while C and Sm stressed providing clear definitions of scientific terms to help students, particularly those with lower reading proficiency or from non-English-speaking backgrounds. Science content experts concurred, pointing out that the task's language was too complex for third-grade students. They suggested substantial simplification, such as replacing "foliage" and "vegetation" with more straightforward terms. Consistency in terminology was highlighted by H, who advised that "habitat" should be the consistent term used throughout educational materials to avoid confusion. Equity and language experts also emphasized the need for clear and direct sentences, with E commenting on the need to simplify phrases like "to aid their investigation" to "to help their investigation." Co argued that the language in educational tasks should provide an accessible starting point for all students, particularly for those who are multilingual or have lower reading levels. Teacher experts reinforced these points, advocating for the use of simpler terms and shorter sentences to improve understanding. B and Le noted that certain phrases and terms (e.g., “equipped,” “alongside a brief,” “considerations,” “abundant,” and “dense foliage”) used in the tasks were not appropriate for third graders, suggesting more age-appropriate language and clearer instructions for data collection activities. This feedback emphasizes the essential role of using clear and straightforward language to ensure that educational tasks are accessible and comprehensible for third- grade students. 175 Enhancing visual clarity in assessment tasks Experts across various disciplines emphasized the critical importance of using clear, accurate, and contextually appropriate visual aids to enhance student understanding and ensure scientific accuracy. The visual aids should closely align with the text and data, offering clear instructions to avoid confusion and make learning tasks more effective and accessible for all students. NGSS experts stressed the necessity for visuals to accurately represent the described phenomena. They criticized some visuals for not adding value and potentially distracting students, such as images showing squirrels on the ground instead of in trees, contrary to the textual description. They underscored the importance of clear instructions and the definition of scientific terms to help students connect visuals with the accompanying text and data tables. Assessment experts noted inconsistencies between some images and the data presented, suggesting that visuals should be both clear and functional. For instance, C and Sm observed that one image showed more squirrels in an open area than indicated by the data table, potentially leading to student confusion. They advocated for visuals that are directly aligned with the data and include clear, directive captions to guide student interpretation. Science content experts and equity and language experts proposed using multiple images to accurately represent different scenarios described in the text, ensuring that visuals are not only scientifically precise but also culturally inclusive. For example, Cn suggested adding captions to enhance clarity, especially for students who may not be familiar with the subject matter. Teacher experts also highlighted the importance of visual accuracy. Le criticized images that misrepresented the data by showing squirrels in inappropriate settings, suggesting adjustments to better reflect the factual content or employing several images to depict varying habitats accurately. B recommended enhancing visual aids with captions like "Squirrel observation trip" to clarify the context and engage students effectively. This collective feedback from experts underlines the need for the assessment tasks to incorporate well-designed visual aids that are not only scientifically accurate but also tailored to support and enhance the learning experience. Enhancing the accessibility of the assessment tasks with refining scaffoldings NGSS experts have emphasized the necessity of straightforward terminology and the alignment 176 of visual aids with textual and data information to reduce confusion and enhance learning. They noted instances where visuals did not accurately represent the described phenomena, suggesting that precise and informative captions could help link visuals to the underlying data effectively. Assessment experts, P pointed out the importance of labels and detailed captions to clarify visual aids, enhancing students' ability to connect these images with textual explanations. Science content experts remarked on potential discrepancies between images and data, which could lead to confusion. They recommended using multiple images to accurately represent different aspects of the data discussed, ensuring that all visual representations are scientifically accurate and aligned with educational goals. Additionally, equity and language experts advocated for the use of diverse and relatable visuals that cater to a broad range of student backgrounds, ensuring inclusivity in educational materials. Teacher experts underscored the need for visual aids that accurately match the data, suggesting adjustments to images to better align with the educational content and the use of captions to provide context and enhance understanding. The integration of effective scaffolding, including clear instructions, consistent terminology, and supportive visual aids, is essential. These elements help break down complex tasks into manageable parts, enabling students, especially those with lower reading proficiency for ELLs, to grasp and engage with the content more effectively. This tailored support is critical for fostering an accessible and inclusive learning environment. Underline enhancing task relevance and inclusivity Inclusivity and cultural sensitivity were central themes in the feedback from various expert groups. They emphasized the need to use examples and language that resonate with the diverse experiences and backgrounds of all students, ensuring tasks are accessible to multilingual learners and those with lower reading levels. Experts suggested incorporating more familiar animals and environments to enhance relatability and engagement. NGSS experts recommended making the tasks more inclusive by exploring various environments, providing context on the relevance of parks and animals like squirrels. They stressed the importance of connecting students' prior experiences with the content presented in the tasks. Assessment experts advocated for using different parks as focal points to make scenarios more tangible and engaging, while also aligning more closely with the PE through explicit discussions on 177 adaptation mechanisms. Science content experts noted the potential irrelevance of tasks for students in regions without squirrels, suggesting the inclusion of universally familiar animals to ensure no student feels alienated. Cn highlighted that students in extreme urban or rural settings might find tasks centered around uncommon local flora and fauna less applicable. Equity and language experts focused on simplifying language and using clear visual aids to make tasks more engaging and comprehensible. They proposed modifications to the task instructions and content to make them clearer for younger students, suggesting the use of a checklist format to clarify expectations. Engagement experts advocated for adapting the content to reflect the animals and environments that are familiar to the students' own geographical backgrounds, arguing this would make the tasks more inclusive and engaging. They pointed out that assuming familiarity with squirrels and parks might exclude students who lack such experiences. Teacher experts echoed these concerns, emphasizing the need to adapt educational tasks to reflect the diverse environments and experiences of students. They suggested that students who have never visited parks or seen squirrels first hand would find such tasks less meaningful, advocating for the use of more relatable and accessible content. This collective feedback underscores the importance of designing educational tasks that are not only scientifically accurate but also culturally sensitive and inclusive, catering to the diverse educational needs and backgrounds of all students. Table 4-24 below synthesizes the analyses of expert feedback on Task 1. Table 4-24. Integrated Analysis Results for Task 1 of PE 3-LS4-3 Theme Key Points Recommendations Enhancing Engagement with Real- World Scenarios Experts emphasized adding behaviors of squirrels to relate to students' real-life experiences and maintain engagement. They suggested using lively behaviors and familiar animals to make the tasks more engaging and relevant. NGSS experts specifically noted the relevance of squirrels in tree environments and proposed including dynamic interactions like jumping. Assessment experts suggested using different parks to make scenarios more tangible. 178 Include dynamic aspects of animal behavior to enrich tasks. Adapt the animal subjects to reflect the local wildlife familiar to students' geographical backgrounds, making scenarios more relatable and engaging. Use characters or elements like Pokémon to add fun and intrigue for younger learners. Focus on different parks rather than areas within a single park to provide concrete, engaging content. Table 4-24 (cont’d) Simplifying Language for Third Graders Consistent and straightforward language is crucial. Experts across fields highlighted the need for clear language and terminology suitable for third graders. Complex terms and phrases like "aid," "equipped," and "foliage" were noted as problematic. The importance of breaking down long sentences and providing clear definitions was emphasized, particularly for students with lower reading proficiency or from non-English- speaking backgrounds. Enhancing Visual Clarity in Assessment Tasks Clear, accurate, and helpful visual aids are essential for supporting student understanding and ensuring scientific accuracy. Experts noted that visuals must align with the text and data and provide clear instructions. Inconsistencies between images and data, such as showing squirrels in incorrect settings, were highlighted as potentially confusing. Refining Scaffoldings to Enhance Accessibility Effective scaffolding is key to helping students navigate complex tasks. Experts stressed the importance of clear, consistent, and supportive scaffolding to aid comprehension and engagement. This includes ensuring that terminology and visual aids are straightforward and align with the textual and data information provided in the tasks. Use simpler language and terminology that third graders can easily understand. Replace complex phrases with more direct alternatives, such as changing "to aid their investigation" to "to help their investigation." Ensure terminology consistency throughout the educational materials, using terms like "habitat" uniformly to avoid confusion. Provide clear, concise instructions and definitions to aid comprehension, especially for multilingual learners and those with lower reading levels. Ensure that visual aids accurately represent the described phenomena and align closely with the text and data. Use multiple images to represent different scenarios accurately, and include captions to enhance understanding and provide context. Adjust images to reflect factual content accurately and employ visuals that are both scientifically precise and culturally inclusive. Provide clear, directive captions to aid interpretation and ensure that visual aids are directly supportive of the educational content. Implement straightforward terminology and align visual aids with textual and data information to reduce confusion and enhance learning. Include precise and informative captions to link visuals to underlying data effectively. Ensure scaffolds are clear and directly supportive of the content, reflecting accurate data to avoid misconceptions. Provide detailed captions to clarify visual aids, enhancing students' ability to connect these images with textual explanations. 179 Table 4-24 (cont’d) Enhancing Task Relevance and Inclusivity Inclusivity and cultural sensitivity are crucial for making educational tasks accessible and engaging for all students. Experts suggested using examples and language that resonate with students' diverse experiences and adapting content to include familiar animals and environments. The importance of connecting students' prior experiences with the content was emphasized, as was the need to use clear visual aids and simple language. Adapt educational tasks to reflect diverse environments and experiences, using familiar animals and settings to ensure no student feels alienated. Simplify language and use clear visual aids to make tasks more engaging and comprehensible. Provide an accessible starting point for all students, particularly for multilingual learners and those with lower reading levels. Adjust tasks to include animals and environments familiar to students' backgrounds to enhance engagement and inclusivity. 4.2.4.3 Analysis of Expert Feedback on Assessment Task 2 Figures 4-23 and 4-24 presents Task 2 and its corresponding exemplar response designed for LP2 of 3-LS4-3. Figures 4-25, 4-26, and 4-27 provide heatmaps of the expert feedback on item stem, item prompt and exemplar response for Task 2 for LP2 of 3-LS4-3. 180 Figure 4-23. Task 2 for LP2 of 3-LS4-3 181 Figure 4-24. Exemplar response for task 2 for LP2 of 3-LS4-3 Figure 4-25. Heatmap of the expert ratings on Task 2 item stem for LP2 of PE 3-LS4-3 Figure 4-25 indicates several areas of concern that need addressing. Notably, Visual 182 Comprehension received the lowest scores, with the NGSS group rating it at 1 and the Assessment group at 1.3. These scores indicate significant issues with the visual aids used in the item stem, suggesting that they do not effectively support student comprehension and may lack necessary detail or clarity. Language Complexity also showed variability, particularly in language reading level, where the NGSS group provided a low score of 2. This feedback suggests that the language used might not be accessible to all students, requiring simplification and clarification to ensure it is appropriate for the target grade level. Cultural Sensitivity received moderate scores, with the teacher group noting that the item stem might not fully resonate with or be inclusive of all student demographics. The Comprehension consistency dimension received higher ratings, particularly from the NGSS expert group, which rated it at 5. This suggests that the terminology used in the item stem is consistent and clear, aiding student understanding. Figure 4-26. Heatmap of the expert ratings on Task 2 item prompt for LP2 of PE 3-LS4-3 The evaluations reveal several critical areas of concern and strength. The lowest scores were 183 observed in the Scaffolds dimension, where the NGSS group rated it at 1, indicating significant issues with how scaffolds are used to support student understanding. This suggests that the scaffolds provided may not adequately help students break down the complexity of the task, pointing to a need for more effective scaffolding strategies. Language sentence structure also showed variability, with the assessment group providing a score of 2.7, suggesting that the language might not be fully accessible to all students. In contrast, the Comprehension Information dimension received high scores, particularly a 5 from the NGSS group, indicating strong alignment with what is being described in the scenario and clear focus on the essential information students need to respond to the prompt. Similarly, the dimension of 3D Clarity received favorable reviews, with scores of 4.3 and 4.4 from both the Assessment and Science expert groups. Engagement metrics were rated highly by the Engagement expert group. Figure 4-27. Heatmap of the expert ratings on Task 2 exemplar response for LP2 of PE 3-LS4-3 The evaluations highlight significant concerns in the 3D Integration and Alignment with Evidence Statements dimensions, with the NGSS group rating both at 2. These low scores indicate that the responses do not adequately integrate NGSS's three dimensions of learning or align with the necessary 184 evidence statements, suggesting a need for improvement. Language age appropriateness showed variability, with the Assessment group rating it at 2.7, indicating that the language may not be clear or appropriate for the target grade level, necessitating refinement for better clarity and accessibility. On a positive note, the Engagement dimension received high scores, particularly a 4.7 from the Engagement expert group, showing that the responses are engaging and relevant to students' lives. Qualitative analysis has similar themes. Balancing engagement with realism in assessment tasks A key observation from all expert groups was the critical balance between engagement and realism in Task 2. NGSS experts, E, found the task particularly engaging due to the distinctive "magic" sunflower element, which was highly rated for engagement. However, concerns were raised about its realism and alignment with NGSS standards, which emphasize real-world environments. E critiqued, "The sunflower picture is extraordinary. Not scientific, a different planet," highlighting the importance of using scientifically accurate visuals to avoid misleading students. Assessment experts, like C, also pointed out the potential confusion caused by the oversized sunflower image, noting that it detracted from the task's narrative coherence. Science content experts acknowledged the task's appeal, with J stating, "The phenomenon is compelling and relatable," yet emphasized the need for visuals that are both engaging and realistic to prevent student confusion. Teacher expert B appreciated the task's ability to engage students but cautioned that the large sunflower could cause confusion, suggesting the addition of a clarifying caption to better integrate the visual with the educational content. This feedback underscores the need for educational tasks to not only be attractive but also accurately reflect scientific standards and real-world scenarios. Enhancing language clarity and consistency in assessment tasks All expert groups highlighted significant concerns regarding the clarity and consistency of language in educational content, emphasizing the necessity for simplicity and adherence to scientific standards. NGSS experts, particularly E, criticized the use of terms such as "flourish" and "struggling" for third graders, citing their lack of scientific rigor and appropriateness. The consistency of terminology was 185 rated as inadequate, receiving a score of 3 out of 5, prompting recommendations for simpler, scientifically accurate terms. Assessment experts like C pointed out the complex and inconsistent language usage, identifying terms such as "flourish," "exposure," and "moisture" as potentially confusing and overly advanced for third graders. Science content experts, including J and H, agreed, advocating for the simplification of vocabulary and sentence structures to align better with students' reading levels and improve comprehension. Teacher experts, such as BF and Le, emphasized the importance of straightforward, concise instructions to enhance comprehension and engagement. B noted that while most sentences were well-constructed, some words required replacement to better accommodate third-grade understanding. Equity and language experts like Sc further stressed the essential role of using grade- appropriate vocabulary to prevent confusion and ensure inclusivity, making educational tasks accessible and understandable for all students. This collective feedback underscores the need for careful language selection in assessments to support effective learning at the third-grade level. Enhancing inclusivity and cultural relevance in assessment tasks NGSS experts expressed concerns that the task's school garden setting might not resonate with students from urban environments who are unfamiliar with gardening. To enhance engagement, assessment experts, including P, suggested that the scenario could be more captivating with a clearer, more relatable question. Engagement experts, like Sa, recommended adapting the scenario to include plants that are indigenous to students' local environments, thus increasing its cultural relevance and inclusivity. Equity and language experts, such as Co, underscored the need for diverse visual representations and clearer distinctions in the task's data presentation. They pointed out inconsistencies in describing plant health and recommended labeling areas as "Spot A" and "Spot B" to better delineate different garden sections. Teacher experts highlighted the critical need for culturally relevant and diverse learning experiences, noting that students in urban areas might find the gardening scenario less applicable, due to their limited exposure to such environments. These insights from various expert groups underline the necessity of designing tasks that are not only scientifically accurate but also broadly accessible and culturally sensitive, ensuring all students can engage with and benefit from the learning experiences 186 provided. Enhancing assessment tasks through focused variables and effective scaffolding Experts consistently emphasized the strategic use of key variables in enhancing assessment tasks. NGSS experts expressed concerns about the task's complexity, attributing it to the inclusion of multiple variables such as sunlight, soil moisture, and soil type. They argued that this complexity could overwhelm and confuse students. E specifically suggested reducing the number of variables, advocating for a focus on a single primary variable to enhance clarity. Similarly, assessment experts C and Sm observed that the multiple variables muddled the concept of "habitat," as the task had placed sunflower sections within the same garden environment. They recommended a design that contrasts distinct environments, thereby reducing ambiguity and enhancing student understanding. Science content experts, including J and H, stressed the importance of using realistic and instructive visual aids aligned with the data to prevent misconceptions. Teacher experts also emphasized consistent and accurate data presentation as crucial for helping students draw clear and meaningful conclusions. B noted the critical need for consistently using scientifically accurate terminology throughout the task. The discussion further highlighted the role of effective scaffolding in supporting student comprehension. Assessment experts C and Sm lauded the simplification of scientific terms but suggested further refinement to clarify the language. J commended the structured approach of the task, which methodically guides students through the scientific inquiry process, thereby enhancing their inquiry skills. Engagement experts noted the benefits of specific scaffolding elements, such as reminders about plants' needs, which effectively guide student learning. Teacher experts advocated for clear and direct scaffolding to improve learning outcomes and comprehension. They urged that instructions within the task be detailed and explicit, particularly for more complex sections, to ensure that students can fully engage with and understand the content. Collectively, these insights point to the necessity of carefully designing assessment tasks that focus on key variables and incorporate well-planned scaffolding strategies to foster a comprehensive understanding among students, especially in the context of science education. Table 4-25 presents the summary. 187 Balancing Engagement with Realism in Assessment Tasks Enhancing Language Clarity and Consistency in Assessment Tasks Enhancing Inclusivity and Cultural Relevance in Assessment Tasks Table 4-25. Integration of Findings for Task 2 of PE 3-LS4-3 Theme Key points Recommendations Experts noted the importance of balancing engagement with realism. NGSS experts found the "magic" sunflower element engaging but criticized its lack of realism. EM highlighted the sunflower as "extraordinary" and not scientifically accurate. Assessment and science content experts like C and J noted the oversized sunflower could cause confusion and detracted from realism. Teacher expert B suggested adding a caption to integrate the visual with the educational content better. Ensure educational tasks are engaging yet accurately reflect scientific standards and real-world scenarios. Use scientifically accurate visuals and align narrative coherence to prevent confusion. Add clarifying captions to integrate educational content effectively. All expert groups emphasized the need for clear and consistent language. NGSS experts like E criticized terms like "flourish" and "struggling" for lacking scientific rigor. Assessment experts identified complex terms as confusing for third graders. Science content and teacher experts advocated for simplifying vocabulary and sentence structures to match student reading levels and enhance comprehension. Equity and language experts emphasized the use of grade-appropriate vocabulary to ensure inclusivity and accessibility. Experts highlighted the importance of inclusivity and cultural relevance. NGSS experts noted the school garden setting might not resonate with students from urban environments. Engagement experts suggested adapting scenarios to include local plants. Equity and language experts emphasized the need for diverse visual representation and clearer data presentation. Teacher experts pointed out the critical need for culturally relevant and diverse learning experiences, especially for students in urban areas with limited exposure to gardening. Enhancing Assessment Tasks through Focused Variables and Effective Scaffolding Experts consistently emphasized the strategic use of key variables and effective scaffolding. NGSS experts expressed concerns about task complexity due to multiple variables. They suggested focusing on a single primary variable for clarity. Science content experts advocated for realistic visual aids aligned with data. Assessment and teacher experts noted the importance of precise language and clear scaffolding to support student comprehension. 188 Simplify vocabulary and sentence structures to align with third- grade reading levels and scientific accuracy. Use straightforward, concise instructions to enhance comprehension and engagement. Replace complex words with simpler alternatives and ensure terminology consistency to support effective learning. Design tasks that are culturally sensitive and inclusive, using scenarios and visuals that resonate with diverse student backgrounds. Adapt scenarios to reflect local environments and include culturally relevant plants and animals. Label data clearly and use visuals that represent diversity effectively. Simplify tasks by focusing on one primary variable to enhance clarity and comprehension. Use visual aids that are realistic and aligned with data. Provide clear and detailed instructions, especially for complex tasks. Incorporate effective scaffolding strategies that guide students through scientific inquiries and enhance learning outcomes. 4.2.5 Cross-Case Synthesis: Summary of Expert Feedback on Refining Knowledge-In-Use Assessments There are several major themes emerging after analyzing the experts’ feedback. These emerging themes provide major guidelines and directions for further revising the products that are aiming for the knowledge-in-use assessment design. I reported the common and significant themes into two major sections, including themes related to LPs and evidence statement design and themes related to task design. 4.2.5.1 Themes Related to LPs and Evidence Statements Design I first present the summary of the overall themes related to the LPs and evidence statement design in Table 4-26. Then, I specify each theme with detailed explanations. Table 4-26. Summary of the themes related to the LPs and evidence statement design Theme Theme Description Example Strategies Ensuring Appropriate Grain Size This theme focuses on designing LPs and evidence statements to accurately reflect the scope and complexity outlined in the PEs. It ensures that the content is neither too broad nor too narrow and adheres closely to NGSS standards. Improving Integration of CCCs, DCIs, and SEPs This theme emphasizes the synergistic integration of Crosscutting Concepts (CCCs), Disciplinary Core Ideas (DCIs), and Science and Engineering Practices (SEPs) to enhance students' understanding of scientific principles through a 3D learning model. 1. Focus exclusively on the ideas specified in the PE, avoiding advanced topics beyond the grade level. 2. Identify elements in LPs that go beyond the scope of the PE. Simplify and align content with grade-level expectations. 1. Ensure that LPs and evidence statements explicitly demonstrate the integration of CCCs, DCIs, and SEPs, showcasing their mutual reinforcement. 2. Create content that clearly defines and exemplifies the connections between CCCs, DCIs, and SEPs, making these links explicit in the learning material. Ensuring Consistency in Terminology This theme underlines the importance of using uniform terminology across all educational materials to prevent confusion and ensure a consistent learning experience. It focuses on the structured and logical presentation of information. 1. Establish and use a consistent set of terms across all LPs and evidence statements to avoid confusion and enhance clarity. 2. Align language with learning goals to ensure that terminology supports the understanding of key concepts. 189 Ensuring appropriate grain size of LPs and evidence statements that adhere to PE boundaries This theme emphasizes the importance of designing LPs and evidence statements that accurately reflect the scope and complexity outlined in the PEs. "Grain size" refers to the level of detail and specificity within the LPs and evidence statements, which must be carefully calibrated to ensure they are neither too broad nor too narrow relative to the expectations set by the NGSS standards. The goal is to ensure that each LP and evidence statement fully captures the necessary concepts without introducing extraneous content or omitting crucial information. Adhering to PE boundaries means that the content must directly align with the defined standards, avoiding any extension beyond the intended scope or depth. This precise alignment is crucial for maintaining the integrity and focus of the assessment tasks, ensuring they truly measure what they are intended to measure. This theme also informs the first principle to refine the design. Table 4-27 summarizes the strategies and example prompts to refine the design. The strategies were generated based on the data and then I designed the prompts to instruct the GPT-4 models. Table 4-27. Summary of the strategies and example prompts for coverage of the PE Strategies Description Exemplar Prompt Ensure Content Matches PE Requirements Focus exclusively on the ideas specified in the PE, avoiding advanced topics beyond the grade level. Highlight and Correct Content that Exceeds PE Requirements Identify elements in LPs that go beyond the scope of the PE. Simplify and align content with grade-level expectations. "Generate LPs for PE 3-PS2-1 that focuses solely on the concept of balanced and unbalanced forces. Ensure the content does not extend into advanced topics like gravitational fields, which are beyond the third-grade curriculum. Provide a clear explanation suitable for third graders." "Review the following LP draft for PE 3-LS4-3: 'Students analyze how environmental changes can lead to plant and animal adaptation.' Identify and list elements in this draft that exceed the scope of third- grade expectations, focusing on the unnecessary inclusion of adaptation mechanisms, and suggest modifications to simplify the content." Address Content that Falls Short of PE Requirements Assess evidence statements to ensure they meet the required understanding as specified in the PE. Revise statements to directly tie to the core concepts and skills outlined in the PE. "Assess this evidence statement for PE 3-PS2-1: 'Students describe how different objects move.' Indicate how this statement falls short of addressing the required understanding of forces and motion as specified in the PE. Propose a revised statement that directly ties object movement to the types of forces acting on them." 190 Table 4-27 (cont’d) Correct Overreaching Content Revise LPs to remove advanced topics that are not required by the PE. Focus on observable, grade- appropriate properties and processes. "The draft LP includes the analysis of intermolecular forces in water samples. This topic is not required by the PE and is too advanced for the grade level. Please revise the LP to focus on observable properties of water like state changes and buoyancy, which align with the core ideas in the curriculum." Revisiting the Unpacking Documents Correct or refine the unpacking of the PE to ensure alignment with the intended learning goals and appropriate scope. "Revisit the unpacking document for PE 3-PS2-1. Identify and correct any misalignments or overextensions beyond the grade-level expectations. Refine the unpacking to ensure it accurately reflects the core concepts and skills specified in the PE, providing clear and concise guidelines for the development of LPs and evidence statements." Improving integration of CCCs, DCIs, and SEPs The theme "Improving Integration of CCCs, DCIs, and SEPs" emphasizes enhancing the integration of CCCs, DCIs, and SEPs within LPs and evidence statements. This approach aims to deepen students' understanding of scientific principles through a 3D learning model advocated by the Framework for K-12 Science Education (NRC, 2012) and utilized by the NGSS. Such integration ensures that learning not only meets curricular standards but also connects more effectively with real-world applications, making scientific reasoning more intuitive and contextually relevant for students. Experts underscore the necessity of developing LPs and evidence statements that not only cover individual components of CCCs, DCIs, and SEPs but also demonstrate their synergistic interaction. This integration is crucial for providing students with a cohesive understanding of scientific concepts. LPs and evidence statements should clearly articulate the relationships between CCCs, DCIs, and SEPs, clarifying how these dimensions interlink within the assessment materials. Further, it's important to address any gaps in the current integration of these dimensions within LPs to ensure that these elements are seamlessly woven into LPs. The complexity of content integration should be tailored to match the cognitive and developmental stages of the learners, ensuring that the material is both engaging and comprehensible at the intended grade level. Any integrations that are too complex or advanced for the 191 target audience should be simplified, focusing on delivering clear, tangible, and relatable content. Regular review and refinement of unpacking documents are also recommended to ensure they accurately guide the development of integrated LPs and evidence statements (see Table 4-28). Table 4-28. Summary of the strategies and example prompts for 3D integration Strategies Description Exemplar Prompt Comprehensive Integration of Dimensions Ensure that LPs and evidence statements explicitly demonstrate the integration of CCCs, DCIs, and SEPs, showcasing their mutual reinforcement. "Generate an LP for PE 3-PS2-1 that clearly integrates the concept of forces (DCI) with the practice of scientific investigation (SEP) and the concept of cause and effect (CCC). Provide an example that illustrates these connections in a scenario relevant to third graders." Explicitly Define Connections Address and Strengthen Integration Gaps Ensure Age- Appropriate Integration Create content that clearly defines and exemplifies the connections between CCCs, DCIs, and SEPs, making these links explicit in the learning material. "Develop an evidence statement for PE 3-LS4-3 that exemplifies how changes in an environment (DCI) affect animal behaviors (CCC) and how students can investigate these changes through data collection (SEP)." Identify areas where the integration of CCCs, DCIs, and SEPs is weak or unclear in existing LPs and revise them to strengthen these connections. "Review the LP for PE 3-PS2-1 focusing on motion and forces. Identify where the integration of CCCs and SEPs could be enhanced to better illustrate the interplay of these dimensions. Propose revisions that enhance this integration." Align the complexity of the integrated content with the cognitive and developmental level of the learners, ensuring it is appropriate for their grade level. "Refine the LPs and Evidence statements to ensure the grade-appropriate level of DCIs, SEPs and CCCs integrations. For instance, the model should be a simple model , and evidence does not need to be sufficient." Rectify Overly Complex Integrations Simplify overly complex integrations that may confuse or overwhelm students, focusing on clear, tangible examples that reflect grade-appropriate learning. "Revise the LP that currently integrates advanced genetic concepts into a third-grade curriculum on plant growth. Simplify it to focus on observable traits (DCI), pattern recognition (CCC), and basic data gathering (SEP)." Revisit Unpacking Documents for Alignment Review and refine the unpacking of standards documents to ensure that they accurately guide the development of integrations among CCCs, DCIs, and SEPs. "Revisit the unpacking document for PE 3-LS4- 3. Ensure that the descriptions accurately reflect how CCCs, DCIs, and SEPs should be integrated for third-grade students, providing a clear framework for developing LPs and evidence statements." 192 Ensuring consistency in terminology and coherence of information The theme is critical for maintaining clear communication and logical progression within educational materials, particularly in LPs and evidence statements. It emphasizes the importance of using uniform terminology to avoid confusion and ensure that students receive a consistent educational experience across different topics. Moreover, coherence in content demands that information is presented in a structured and logical manner, which is essential for students to understand and build upon complex scientific concepts effectively. Experts have suggested several approaches to enhance consistency and coherence. The adoption of a standardized glossary ensures that the same terms are used consistently across all materials, helping students to familiarize themselves with specific scientific language without the added difficulty of synonyms that might appear in different contexts. Structuring information logically allows students to follow the natural progression of ideas, which is crucial for grasping more complex theories and principles. Integrating concepts across various LPs can reinforce knowledge and show the interconnectedness of different scientific areas. Additionally, simplifying complex concepts makes the material more accessible, especially for younger students. Table 4-29 presents the summary of ensuring information coherently. Table 4-29. Summary of the strategies and example prompts for information coherently. Strategy Description Exemplar Prompt Standardize Terminology Establish and use a consistent set of terms across all LPs and evidence statements to avoid confusion and enhance clarity. "Ensure that the term 'force' is uniformly used in all LPs related to PE 3-PS2-1, defining it clearly the first time it appears." Align Language with Learning Goals Adjust language to clearly reflect the learning goals and ensure that terminology supports the understanding of key concepts. "Review the evidence statement for PE 3- LS4-3 to ensure that all terms align with the defined learning goals, adjusting language for clarity and educational alignment." Enhance Coherence in Content Ensure that content across LPs and evidence statements logically flows and supports a cohesive understanding of the curriculum. "Create a sequence in LPs for PE 3-PS2-1 that progressively builds on the concept of forces, ensuring a coherent flow that facilitates deeper understanding." 193 Table 4-29 (cont’d) Review and Refine Content Regularly Simplify Complex Concepts Periodically review LPs and evidence statements to maintain consistency and coherence, updating as necessary to align with evolving educational standards. "Conduct a quarterly review of the LPs for PE 3-LS4-3 to check for terminological consistency and content coherence, making adjustments based on the latest educational research and feedback." Break down complex ideas into simpler, understandable components while maintaining the integrity and accuracy of the scientific information. "Simplify the explanation of ecological niches in the LP for PE 3-LS4-3, using straightforward examples and consistent terminology to enhance student comprehension." 4.2.5.2 Themes related to assessment task design Table 4-30 is a summary table of the themes focusing on assessment task design. This table organizes each theme with a description and combines the strategies into a single column for clarity. Following the table, I provide explicit elaborations on each theme. Table 4-30. Summary of themes, descriptions, and strategies for assessment task design Theme Theme Description Strategies Boosting Engagement This theme involves connecting assessment tasks with students' real-life experiences to enhance understanding and retention. Tasks are designed to draw on familiar scenarios or intriguing contexts to increase motivation and engagement. 1. Integrate familiar contexts to make content relevant. 2. Connect concepts to real-world applications. 3. Incorporate interactive elements like hands-on activities or simulations. Enhancing Clarity and Accessibility of Language Focuses on simplifying complex scientific concepts through tailored vocabulary and sentence structures that are age-appropriate, ensuring the language used in assessment tasks is comprehensible for the target student audience. 1. Use age-appropriate vocabulary and structures. 2. Define technical terms clearly. 3. Test and refine for readability. 4. Align language with educational standards. Enhancing Task Clarity and Guideline Precision Centers on providing crystal-clear, straightforward instructions in assessment tasks to eliminate ambiguity, ensuring students understand exactly what is expected of them without confusion, aiding in effective demonstration of understanding. 1. Simplify instructional language. 2. Detail specific actions or steps. 3. Clarify task objectives. 4. Refine and test instructions regularly. 194 Table 4-30 (cont’d) Incorporating Supportive Visuals and Scaffolds Emphasizes the importance of integrating visual aids and scaffolding strategies into assessment tasks to make complex ideas more accessible and understandable, supporting textual information and promoting independent learning. Ensuring Cultural Sensitivity and Accessibility Focuses on designing inclusive and reflective assessment tasks that resonate with students from diverse cultural backgrounds, promoting a more equitable learning environment and enhancing student engagement by incorporating culturally relevant content. 1. Use clear and relevant visuals like diagrams and graphs. 2. Provide structured step-by-step guidance. 3. Utilize interactive visuals for engagement. 4. Tailor scaffolds for varied needs. 1. Include inclusive content selection. 2. Ensure language accessibility. 3. Represent diverse cultures in visuals. 4. Develop culturally relevant scenarios. 5. Implement feedback mechanisms for sensitivity. Boosting engagement through relevant and contextual task design Boosting engagement through relevant and contextual task design is essential for connecting assessment tasks with students' real-life experiences. This method transforms abstract scientific concepts into tangible and relatable challenges, enhancing understanding and retention. By designing tasks that draw on familiar scenarios or intriguing contexts, educators can significantly increase students' motivation to engage deeply with the content. Experts recommend several approaches to refine these tasks to ensure maximum engagement. First, integrating familiar contexts into the tasks helps make the content more relevant, as students can see direct links between their everyday lives and the scientific concepts being taught. Second, connecting these concepts to real-world applications clarifies their utility, boosting students' interest and the perceived value of their learning. Last, incorporating interactive elements into tasks, such as hands-on activities or simulations, makes the learning process more dynamic and engaging, fostering an active learning environment that is both educational and enjoyable (see Table 4-31). 195 Table 4-31. Summary of the strategies and example prompts for engagement Strategy Incorporate Real-Life Scenarios Description Exemplar Prompt Use real-life contexts that students are likely to encounter to anchor the scientific concepts taught. "Design an assessment where students analyze how playground equipment uses forces (PE 3-PS2-1) to function." Connect Concepts to Daily Activities Link scientific ideas to everyday activities to show their practical applications. "Create a task asking students to describe how animals in their neighborhood adapt to seasonal changes (PE 3-LS4-3)." Use Interactive Elements Include components that require active engagement, such as simulations or hands-on experiments. "Develop a simulation task that allows students to manipulate variables affecting the motion of an object on different surfaces." Employ Storytelling Highlight Relevance Craft scenarios as stories to draw students in and make the tasks more engaging. "Write a story-based task where students help a character choose the best materials for building a kite, considering wind forces." Explicitly explain how the science topics students are learning about impact their lives. "Ask students to investigate and present on how understanding of ecosystems can help improve local environmental practices." Enhancing clarity and accessibility of language in the design of assessment tasks It is pivotal for simplifying complex scientific concepts to ensure that the language used is suitable and comprehensible for the target student audience. This theme emphasizes tailoring vocabulary and sentence structures to be age-appropriate, minimizing the use of technical jargon unless it is necessary and clearly explained within the learning context. Simplifying the language in tasks to match the reading levels of elementary students while ensuring consistency across educational materials is crucial. This careful attention to language not only aids comprehension but also enhances the accessibility of scientific learning for all students. To operationalize this principle in task design, GPT-4 can be instructed to prioritize simplicity in vocabulary and structure during content generation. The process involves creating prompts that explicitly require the avoidance of technical jargon or, if used, ensuring it is adequately defined in a context understandable to young students. For example, a prompt might state: "Generate a task description for PE 3-PS2-1 that explains how objects move, using simple language suitable for third graders without using 196 technical terms such as 'net force.'" The outputs from GPT-4 should be rigorously tested for readability and clarity, with subsequent adjustments based on iterative feedback to ensure that they meet the developmental and cognitive needs of the target age group. Table 4-32 summarizes the strategies and example prompts to refine the design of tasks focusing on language clarity and accessibility Table 4-32. Summary of the strategies and example prompts for language appropriateness Strategy Description Exemplar Prompt Use age- appropriate vocabulary Focus on straightforward language that is easy for students to understand, avoiding complex phrasing. "Generate a task for PE 3-PS2-1 using simple terms to explain the concept of forces acting on stationary objects." Define Technical Terms Clearly Provide clear definitions for any scientific terms used within the task to ensure they are age- appropriate for students. Test and Refine for Readability Continuously evaluate and refine the task descriptions to ensure they are understandable for the intended age group, based on feedback. "Create a task for PE 3-LS4-3 that involves animal adaptations, and include a sidebar that defines 'adaptation' in simple terms suitable for third graders." "Revise this task description for clarity: Simplify the sentence structure and ensure any scientific terms are clearly explained." Align Language with Educational Standards Make certain that the language used in the tasks aligns with the educational standards and learning objectives for the specified grade level. "Review and adjust the language of this task for PE 3-PS2-1 to ensure it conforms to third-grade science standards and is understandable by students at this educational level." Enhancing task clarity and guideline precision It is important to focus on the necessity of providing assessment tasks with crystal-clear, straightforward instructions. This theme centers around crafting tasks in a way that eliminates any ambiguity, thus ensuring that students understand exactly what is expected of them without confusion. The precision of task instructions is critical in guiding students effectively through their responses, aiding them in focusing on demonstrating their understanding rather than deciphering the task requirements. By specifying exactly what steps to follow or what concepts to explore, students can easily access the assessment information, and enhance the function of assessment in students’ learning. Experts 197 emphasized the importance of using clear and age-appropriate language in task instructions. They pointed out that complex terms and phrases can confuse students, especially at the elementary level. For instance, E noted that terms like "despite" and phrases like "remains unmoved" are too complex for third graders and suggested using simpler alternatives. Additionally, the experts highlighted the need for consistent terminology and explicit steps in task instructions to ensure clarity. P recommended that instructions should be direct and concise to match the reading levels of young students. Experts also suggested that tasks should include specific actions or steps that students need to follow. For example, detailing procedures and expected outcomes in a step-by-step manner. Regularly refining and testing task instructions based on student feedback was another key recommendation from the experts. They emphasized the need to adjust instructions to ensure they are clear and unambiguous. C and Sm pointed out that refining task instructions based on student responses can help identify and address any areas of confusion. See Table 4-33. Table 4-33. Summary of the strategies and example prompts for clarity Strategy Description Exemplar Prompt Simplify Instructional Language Use age-appropriate, clear and direct language in task instructions to avoid ambiguity and ensure students understand what is required. "Describe how balanced and unbalanced forces affect an object's motion. Use simple language and diagrams to explain." Detail Specific Actions Provide explicit steps or actions students should take to complete the task, guiding them through the process. "List the materials you will use to demonstrate balanced and unbalanced forces, describe the procedure step-by-step, and predict the outcome." Clarify Task Objectives Ensure that the goals and objectives of the task are explicitly stated so students understand the purpose and what they need to achieve. "Explain the role of balanced and unbalanced forces in moving objects. State clearly what students need to demonstrate or explain." Refine and Test Instructions Regularly review and revise task instructions based on student feedback and performance to enhance clarity and precision. "Based on student feedback, revise the instructions for the task to ensure they clearly convey the expected actions and outcomes." 198 Incorporating supportive visuals and scaffolds It is crucial to emphasize the importance of integrating visual aids and scaffolding strategies into assessment tasks. This approach aims to support and enhance textual information, making complex ideas more accessible and understandable for students. Visual aids, such as diagrams, charts, and illustrations, provide concrete examples of abstract concepts, aiding in comprehension and retention. Scaffolding, which includes structured support like guided questions, step-by-step instructions, and checklists, helps students navigate through tasks that might otherwise be too challenging. These tools are crucial for building confidence and promoting independence as students progress in their learning and tackle more complex material. Experts highlighted several key strategies to enhance the use of visuals and scaffolds in assessment tasks. They emphasized the need for clear, accurate, and contextually appropriate visual aids that directly support the textual information. NGSS experts stressed the necessity of visuals that accurately represent the described phenomena, noting that some visuals were either irrelevant or potentially confusing. For example, they criticized images showing squirrels on the ground instead of in trees, which did not align with the task's description. They also highlighted the importance of clear instructions and definitions of scientific terms to help students connect visuals with the accompanying text and data tables. Assessment experts pointed out inconsistencies between some images and the data presented, suggesting that visuals should be both clear and functional. C and Sm observed that one image showed more squirrels in an open area than indicated by the data table, potentially leading to student confusion. They advocated for visuals that are directly aligned with the data and include clear, directive captions to guide student interpretation. Additionally, they stressed the importance of providing step-by- step guidance and interactive elements in visuals to engage students actively with the material. Equity and language experts underscored the need for tailored scaffolds that cater to diverse learning needs, ensuring accessibility for all students. They recommended differentiated instruction sheets that include glossaries and cater to varying reading levels. Teacher experts highlighted the importance of visual accuracy and suggested adjustments to images to better reflect the educational content. see Table 4-34. 199 Table 4-34. Summary of the strategies and example prompts for scaffolding Strategy Description Exemplar Prompt Use Clear and Relevant Visuals Incorporate diagrams, graphs, or images that directly relate to and help clarify the task’s concepts. "Show a diagram of balanced and unbalanced forces acting on an object to illustrate how they affect motion." Structured Step-by-Step Guidance Provide a breakdown of tasks into smaller, manageable steps that guide students through the learning process. "Follow these steps to construct your model of a plant cell, starting with the cell wall and moving inward." Interactive Visuals for Engagement Utilize interactive elements in visuals that allow students to engage actively with the material. "Use this interactive map to explore different ecosystems and their characteristics, and note how organisms adapt to their environments." Tailored Scaffolds for Varied Needs Adjust scaffolding techniques to meet the diverse learning needs of students, ensuring accessibility for all. "Make sure each prompt has specific instructions using “building a claim”, “identifying evidence" and “making reasoning.” Feedback Loops for Improvement Integrate opportunities for feedback within tasks, allowing for adjustments and fostering deeper understanding. "Submit a draft of your project for preliminary feedback on your use of scientific terms and concepts, and revise based on the feedback." Ensuring cultural sensitivity and accessibility in task scenarios It is important to focus on designing assessment tasks that are inclusive and reflective of the diverse cultural backgrounds and experiences of all students. This approach emphasizes the importance of using scenarios and contexts in assessments that resonate with students from different cultural perspectives, thereby fostering a more equitable learning environment. By incorporating culturally relevant content, educators can increase student engagement and promote a deeper connection with the material. Tasks that consider the varied experiences of students can help prevent cultural bias and ensure that all learners feel represented and valued in the learning process. Experts suggested several strategies for improving cultural sensitivity and accessibility. They emphasized the need to incorporate familiar contexts and real-life applications in assessment tasks to make them more relatable and engaging. For example, experts noted that using culturally diverse and familiar scenarios, such as describing local natural resources, can significantly enhance student 200 engagement. This is supported by experts who highlighted the importance of using scenarios that students can easily relate to, enhancing the relevance and engagement of the tasks. Experts also recommended using language that is age-appropriated, clear, and free from cultural bias, ensuring it is accessible to all students. This includes avoiding technical jargon and culturally specific terms that may not be universally understood. For instance, NGSS experts pointed out the need for tasks to use straightforward language and avoid complex terms that could confuse younger students. Furthermore, the importance of visual aids that accurately represent diverse cultures was highlighted. Experts suggested using images and examples in materials that depict a variety of cultural backgrounds to avoid cultural bias. As one expert noted, including visual representations of diverse characters and settings helps make the tasks more relatable and inclusive for all students. Additionally, scenarios should be relevant to students' daily lives and reflect their diverse experiences to make the tasks more engaging and meaningful. This involves creating tasks that reflect real-life situations familiar to students from different cultural backgrounds, making the learning experience more personal and engaging. Cultural sensitivity in task design not only enhances fairness but also enriches the educational experience by exposing students to different viewpoints and ways of understanding the world. It involves careful consideration of language, scenarios, and examples used in assessments to avoid stereotypes and biases. Moreover, ensuring that tasks are accessible to students with different abilities and learning needs is crucial for creating an inclusive classroom environment. This can be achieved through the implementation of feedback mechanisms that allow students to suggest improvements on cultural representation in tasks, ensuring that the educational materials remain relevant and sensitive to all students. See Table 4-35. Table 4-35. Summary of the strategies and example prompts for cultural sensitivity Strategy Description Exemplar Prompt Inclusive Content Selection Choose content that reflects a broad spectrum of cultures and experiences. Describe how different communities use local natural resources to balance forces in engineering solutions. Language Accessibility Use clear language, ensuring it is accessible to all students. Explain how balanced and unbalanced forces affect motion using simple terms. 201 Table 4-35 (cont’d) Representation in Visuals Include images and examples in materials that depict a variety of cultural backgrounds. "Use illustrations showing children from different cultures engaging in activities that involve balanced and unbalanced forces." Culturally Relevant Scenarios Develop scenarios that relate to real-life situations experienced by students from diverse backgrounds. Feedback Mechanisms for Sensitivity Implement systems for students to provide feedback on cultural relevance and sensitivity. Create a task where students investigate how animals adapt to their environments in different cultures, focusing on observable traits." "Provide a feedback form for students to suggest improvements on cultural representation in tasks, such as balancing forces and adaptations." 4.3 RQ3. What Is the Process of Refining AI-Designed Knowledge-In-Use Assessments Based on the Feedback Provided by Human Experts? Whether and How Are the Assessments Revised? 4.3.1 Refinement Model After analyzing the expert feedback and synthesizing their critical suggestions, I formulated refinement principles and corresponding prompts to guide the revisions of LPs, evidence statements, and assessment tasks. These principles are detailed in Section 4.2.5, which presents the approach for both designing LPs and evidence statements and for the broader context of knowledge-in-use assessment design. A notable change in this phase was the shift from using a standard API to a customized GPT environment for the refinement process. The primary reason for this change was the rapid technological advancements and the limitations of the API in accessing external PDF documents. These limitations often restricted the availability of critical information. To ensure the AI had access to all necessary materials, including the NGSS framework and details of the assessment design process, I opted to utilize a customized GPT setup. This setup was enhanced by uploading the initial training script and essential reference materials to ensure the AI retained knowledge from the first round of training and was well- equipped to process the refinement tasks effectively. Using the principles, I initiated an iterative refinement process with GPT-4. I collaborated with GPT-4 to refine the design products meticulously, one by one. Each refinement or revision adhered to the 202 structured process depicted in Figure 4-28, ensuring that every modification aligned with the established goals and responded effectively to the expert feedback. Figure 4-28. Design product refinement process. In the refinement process, the primary goal is to integrate the expert panel’s collective feedback to enhance and refine the knowledge-in-use assessment tasks. This refinement begins by identifying specific areas that require updates, which may include revisions to unpacking documents, LPs, evidence statements, or tasks. The next step involves defining specific task goals and requirements, which are derived from the collective insights gathered during the analysis of RQ2. Following this, the human operator provides a detailed rationale for each suggested revision, explaining why specific changes are necessary to improve the assessment tasks. This explanation is supported by the detailed guidelines for revision that were developed based on the collective feedback from experts and the refinement prompts generated from the RQ2 analysis. The AI models then execute these revisions. Throughout this process, human operators continuously monitor the outputs to ensure they align with the established goals. The AI models are also prompted to reflect and provide explanations on how their generated outputs meet the task goals and requirements. This dual monitoring process allows for further refinement of the prompts in an iterative manner. Once the revisions sufficiently meet the task goals and adhere to the guidelines, the refinement 203 process concludes with the production of the second-round design products. This iterative approach ensures that the final assessment tasks are well-aligned with expert feedback and educational standards. The refined products were distributed to two distinct expert groups for evaluation and feedback. In addition to the original panel familiar with the AI's role in designing the tasks, I introduced a new group of experts who were not informed about the AI's involvement. This strategy was employed to mitigate any potential biases related to perceptions of AI-generated content. Below, I take two examples to illustrate the refinement process from LPs and evidence statement design stage and task design stage. I also discuss the enhancements observed in the refined products compared to those from the initial design round, demonstrating the effectiveness of this iterative approach in improving the quality and relevance of the assessment tasks. 4.3.1.1 Ensuring the Scope of the LPs and Evidence Statements by Revisiting Unpacking Identify specific places for revisions In revising the LPs for PE 3-PS2-1, the expert panel's feedback highlighted several key areas requiring attention. A primary concern was that the existing LPs did not adequately address the concept of 'unbalanced forces,' a central idea within the PE. For instance, while LP4 reasonably introduces 'non- contact forces'—acknowledging that students often struggle with this concept—it does not align with the PE’s main focus on 'balanced' and 'unbalanced' forces. Moreover, there is a notable gap in addressing the effect of balanced forces on an object in motion, which should result in no change in motion. This is crucial since a common student misconception is that 'no force means no motion,' rather than the correct 'no force means no change in motion.' To illustrate this concept effectively, one proposed example involved a train car moving at a constant speed, which, despite experiencing friction, does not accelerate or decelerate—a practical demonstration of balanced forces at play. Additionally, an NGSS expert criticized the unpacking of the DCI for not thoroughly covering the essential ideas of the disciplinary core elements, pointing to a need for a comprehensive review and redesign of the LPs to ensure they fully encapsulate the key concepts as outlined in the PE. This feedback underscores the importance of 204 revisiting and refining the unpacking of the three dimensions to ensure the LPs accurately reflect the standards and effectively address common misconceptions. Specify task goals and requirements. Based on the feedback, I set up the task goals and requirements, which are to revisit the DCI and SEP unpacking for PE 3-PS2-1 to ensure that the DCI unpacking covers all of the important DCIs that are emphasized in this PE; the unpacking includes the unbalanced forces and the effect of unbalanced forces; ensure the effect of balanced forces that cause the situation of objects in motion but without changes in motion; unpack the non-contact forces and the effect of the non-contact forces; specify the type of non- contact forces should be discussed in the PE, avoiding the magnitude forces; specify the grade boundary of the unpacking; ensuring the terms are used consistently and coherently through the unpacking. It is important to note that these task requirements and goals were introduced to the GPT model in a gradual manner. This phased approach was strategic, allowing for focused adjustments and ensuring that each aspect of the feedback was meticulously addressed to refine the educational materials effectively. Explain the rationale for revisions. I also explained the rationale of why these revisions need to be made by reiterating the importance and purpose of unpacking, which is to zoom in the larger grain size PE from looking into the three dimensions of scientific knowledge and skills to understand the critical sub-ideas and sub-skills of the DCIs, SEPs, and CCCs to break down the PE into smaller pieces and reorganize them into manageable and assessable grain size for effective science learning and teaching. Then it requires the unpacking to be thorough, comprehensive, and have clear performance expectations for each critical sub- idea and skill that align with the grade level requirements. The reason why the task goals and requirements were set up as above is based on the experts' feedback. Providing explicit guidelines for revisions The refinement process, as outlined in section 4.2.5, involves creating specific prompts based on the principles of refinement and the feedback gathered from expert reviews. For instance, one prompt I developed for refining the unpacking of the DCIs related to PE 3-PS2-1 was particularly focused on 205 ensuring clarity and educational appropriateness. The prompt read: "Please revisit the DCI unpacking of PE 3-PS2-1, ensuring that the unpacking aligns with the 3rd-grade boundary. For the unpacking of the DCIs, it appears not to be explicit enough. For example, under the major DCI of Force and Motion, you should list sub-major ideas including the following: a. force—defining it, discussing its characteristics like strength and direction, explaining balanced and unbalanced forces; b. motion—defining motion, emphasizing aspects like speed and direction, discussing factors affecting motion; and c. the impact of different forces on an object's motion. You don't need to use the terms I specified here but ensure that terms are consistent throughout your elaboration and meet the grade boundary." This directive aims to ensure that the unpacking comprehensively covers the range of sub-ideas within the DCI and is appropriate for the grade level. Additionally, I instructed the GPT models to simplify the language and ensure the terms are used consistently and coherently throughout the unpacking process. Table 4-36 presents the comparisons between the first and second rounds of DCI unpacking. Table 4-36. Two rounds DCI unpacking for PS2.A: Force and Motion 1st round DCI unpacking 2nd round DCI unpacking 'Given the provided context and requirements for unpacking the first DCI (PS2.A: Forces and Motion) for the focal performance expectation 3-PS2-1, here's a detailed unpacking tailored for grades 3-5: Aspect 1: Elaborating the DCI Elements and Sub- ideas DCI PS2.A: Forces and Motion focuses on understanding how forces affect the motion of objects. This includes:** **Force Characteristics:** Forces are push or pull actions that can cause an object to start moving, stop moving, or change its motion's direction. Each force acts on a particular object with a specific strength (magnitude) and direction. **Interaction of Forces:** Multiple forces can act on an object simultaneously. When the forces are balanced (sum to zero), the object's motion doesn’t change; it remains at rest or continues moving at a DCI: PS2.A Forces and Motion 1. Force: ○ Definition: A push or pull on an object. 2. Characteristics of Force: ○ Strength: How hard the force pushes or pulls. ○ Direction: The way the force pushes or pulls. 3. Balanced Forces: ○ Definition: Forces that are equal in size and opposite in direction, canceling each other out, resulting in no change in motion. 4. Unbalanced Forces: ○ Definition: Forces that are not equal and opposite, causing a change in the object's motion (speeding up, slowing down, changing direction). 5. Motion: 206 Table 4-36 (cont’d) constant speed in a straight line. Unbalanced forces (those that do not sum to zero) cause changes in the object's motion, either accelerating it or altering its direction. **Newton’s Third Law:** The mutual interactions between two objects can be described in terms of equal and opposite forces. When one object exerts a force on a second object, the second object simultaneously exerts a force equal in magnitude and opposite in direction on the first object. Aspect 2: Defining Boundaries At this grade level, the focus is on: Qualitative and conceptual explanations of how balanced and unbalanced forces affect motion, without delving into quantitative analysis or calculations of force magnitudes. Observations and descriptions of motion changes due to applied forces, without introducing technical terms like "magnitude," "velocity," or "vector." Aspect 3: Describing Prerequisite Knowledge **Students entering grade 3 should already have foundational knowledge, including:** Basic understanding of motion — knowing that objects move in various ways, such as straight, zigzag, round and round, back and forth, fast and slow. Familiarity with pushes and pulls as forces that can move objects. Aspect 4: Identifying Relevant Phenomena **Examples of relevant, real-world phenomena related to forces and motion include:** Playing on a seesaw demonstrates how applying force (pushing down) on one end causes the other end to move up, illustrating action and reaction forces. Observing how kicking a soccer ball (applying a force) changes its motion depending on the strength and direction of the kick. Aspect 5: Identifying Student Strengths and Challenges **Strengths:** Many students have practical experience with pushing and pulling objects, which can be leveraged to introduce the concepts of force and motion. **Challenges:** Students may believe that constant motion requires a constant force application, conflating force with 207 ○ Definition: The change in position of an object over time. ○ Aspects: Speed (how fast an object moves) and direction (the way it moves). ○ Effects: Forces can change the speed and direction of an object's motion. Defining Boundaries 1. Grade Level Scope: ○ Focus on qualitative and conceptual understanding rather than quantitative calculations. ○ Students should understand the relative strengths and directions of forces. ○ Gravity is addressed qualitatively as a force that pulls objects down. ○ Technical terms such as magnitude, velocity, momentum, and vector quantity are not introduced at this level. Describing Prerequisite Knowledge 1. Basic Understanding of Motion: ○ Concepts of push and pull, start and stop. ○ Awareness that objects can move in different ways (e.g., straight, round, back and forth). 2. Surface Interaction: ○ Different surfaces affect motion (e.g., smooth vs. rough). Identifying Relevant Phenomena 1. Examples of Phenomena: ○ A ball starts to roll when pushed (unbalanced force). ○ A box remaining stationary when equal forces are applied from opposite sides (balanced forces). ○ Friction as a force that stops a moving object. Identifying Strengths and Challenges Table 4-36 (cont’d) motion. Some students might think that a larger object always moves slower than a smaller one, regardless of the forces applied, due to misconceptions about mass and force. By exploring and addressing these various aspects, educators can tailor their teaching strategies to effectively convey the foundational principles of forces and motion to elementary students, setting a robust groundwork for more advanced concepts in later grades Major changes: 1. Strengths: ○ Direct experiences with pushing and pulling objects. 2. Challenges: ○ Misconceptions such as stationary objects not exerting force and the invisibility of forces like magnetism. ● The second round of unpacking significantly improved the grain size of the unpacked DCIs compared to the first round. In the first round of unpacking, it had a larger grain size, including Force Characteristics, Interaction of Forces, and Newton’s Third Law. Although it included the major science ideas, it was still a larger grain size. The second round of unpacking has a smaller grain size. in addition, it emphasized the idea s of balanced and unbalanced forces. In terms of the grade boundary, the second-round unpacking further clarified the boundary of important ideas, including: 1. understand the relative strengths and directions of forces and 2. Gravity is addressed qualitatively as a force that pulls objects down. Table 4-36 shows that the second round of unpacking aligns with the requirements and goals I previously established. Using this updated unpacking, I continued refining the LPs to incorporate the experts' feedback regarding the initial set's coverage. I initiated the refinement by providing the GPT model with explicit instructions reflecting on the first round of LPs. I directed the GPT model with specific tasks: "Please refine the LPs to address the following parts: 1. Determine whether the forces acting on an object are balanced or unbalanced. 2. Determine that the effect of balanced forces on an object at rest is that it remains at rest. 3. Determine that the effect of balanced forces on an object in motion is that its motion does not change. 4. Determine the effect of unbalanced forces on an object at rest that its motion changes (it begins to move). 5. Determine that the effect of unbalanced forces on an object in motion is that its motion changes (it speeds up or slows down and/or it turns) Note: I have used the word 'Determine' above but this would be replaced with an appropriate practice such as: A. Carry out an investigation to show. B. Develop a model that shows… C. Construct an explanation to show… could you please design LPs based on the suggestions and our previous conversations.” After GPT produced its output, I further engaged it to identify and address any issues or to refine further. Example prompts include: "Please ensure motion is appropriately unpacked in the LP, for instance, it could be the speed or the direction of the object change," and "Please explain how the new set 208 of LPs addresses the reviewers' comments and how it can correctly measure the PE? Can you reexamine this set of LPs to check if there is anything missed or overreached?" GPT’s response, labeled as 'Potential Refinements', was insightful: "LP-E01: No refinements needed; it is clear, well-scoped, and appropriately detailed. LP-E02 and LP-E03: Ensure that the models are simple and easily understood by elementary students. Provide guidance on the types of models that are suitable for the grade level." Following this analysis, I prompted GPT to continue the refinement process, focusing on simplicity and educational appropriateness: "how to make sure: LP-E02 and LP-E03: Ensure that the models are simple and easily understood by elementary students. Provide guidance on the types of models that are suitable for the grade level. Please provide revised LPs." Table 4-37 compares the refined set of LPs with those from the first round, illustrating the adjustments made. Table 4-37. Two rounds of LPs for 3-PS2-1. 1st round LPs for 3-PS2-1 2nd round LPs for 3-PS2-1 LP1**: Students plan and carry out investigations to observe how different strengths of forces affect the motion of an object. LP1: Students plan and carry out investigations to collect data on how balanced and unbalanced forces affect the motion of an object (speeding up, slowing down, changing direction). LP2**: Students develop models to explain how balanced forces acting on an object result in no change in motion, using everyday scenarios such as a book resting on a table or a tug-of-war game with equal strength on both sides. LP3**: Students construct explanations based on evidence from investigations to explain how objects in contact exert forces on each other, including friction, showing the interaction between objects as a cause of motion changes. LP4**: Students use models to explain how non- contact forces (e.g., magnetic or gravitational forces) on an object's motion or change in motion act at a distance. LP2: Students develop simple models to explain how balanced forces acting on an object at rest result in no change in motion. LP3: Students develop simple models to explain how balanced forces acting on an object in motion result in no change in motion. LP4: Students carry out investigations to observe how unbalanced forces cause objects at rest to start moving. LP5: Students construct explanations to show how unbalanced forces affect a moving object's motion (it speeds up or slows down and /or it turns). 209 Table 4-37 (cont’d) Major changes: 1. Improve the coverage of the major ideas in the PE. 2. Further clarification of motion by including speeding up, slowing down, and changing direction. 3. Clarification that the model students are to develop is “simple” models. 4. The new LP2 emphasizes the relationship between balanced forces and an object at rest. 5. The newly added LP3 addresses the part of the PE about the relationship between balanced forces and an object in motion that the first round LPs do not include. 6. The new LP4 further emphasizes the original SEP in the PE and addresses the relationship between unbalanced forces causing objects at rest to start moving, which is not included in the first round LPs. The second round set of LPs removes the idea of contact forces and non-contact forces, which are not the critical ideas of the PE. Throughout the process, I collaborated closely with GPT to reflect on and summarize the refinements made to the LPs and evidence statements, aiming to hone the underlying principles guiding their design. I engaged GPT in discussions to assess the evolution of the LPs, asking, “How do you see this new set of LPs different from the set of LPs?” These instructions were crucial for ensuring that the adjustments effectively addressed the feedback from the expert panel. To extract actionable insights from these revisions, I frequently prompted GPT to reflect on the principles that emerged from integrating the reviewers' feedback, asking, “great! please recall the process of revising the LPs for 3-PS2-1, what principles would you extract about incorporating the reviewers' feedback from the process?” This iterative dialogue helped identify core principles that could be applied to future revisions. In this role, I served as a questioner, pinpointing essential areas for reflection and guiding the GPT models to detect nuances and learn from the ongoing refinement process. Together, we worked to further enhance the design process, ensuring that each iteration of LPs and evidence statements was progressively aligned with educational objectives and expert insights. 4.3.1.2 Refining the Assessments by Enhancing the Realism and Relevance and Revising Language Identify specific places for revisions The second example of refinement is about task refinement for PE: 3-LS4-3. A major critique of the tasks designed for the LP4 for the 3-LS4-3 is the need for language simplification and enhanced 210 realism in task design. Experts across various fields stressed the importance of using clear and straightforward language to make educational tasks accessible for third graders. They suggested replacing formal and complex phrases such as "to aid their investigation" with more direct and age-appropriate terms like "to help their investigation." The realism of visual aids used in the tasks was another significant concern. A notable example involved the use of an overly fantastical "magic" sunflower element in a task, which, while engaging, was critiqued for its lack of scientific accuracy. Assessment experts and science content experts further highlighted the potential for confusion caused by visuals that did not accurately depict scale or realistic biological processes. These insights led to targeted revisions that focused on simplifying language and aligning visual representations with scientific realities, aiming to enhance both the clarity and educational value of the tasks for young learners. Specify task goals and requirements To ensure the refinement of educational tasks meets specific requirements, a series of guidelines has been established. First, the tasks should use scientifically accurate visuals that align with the narrative to avoid confusion. Captions clarifying the content must be integrated effectively to enhance understanding. Furthermore, the scenarios used in the tasks should be familiar and relatable to the students. This could involve adapting the animal subjects to reflect the local wildlife known to students in their geographical areas and focusing on settings like various parks that offer diverse learning environments. Language and instructional clarity are critical. The tasks should use age-appropriate vocabulary and sentence structures that match third-grade comprehension levels without sacrificing scientific precision. Instructions should be straightforward and concise, eliminating complex terms in favor of simpler alternatives that maintain the accuracy of the scientific concepts being taught. Lastly, consistency in terminology is crucial. Terms should be used uniformly across all educational materials to prevent confusion. Explain rationale for revisions The necessity to refine assessment tasks stems from the overarching requirement to use visual aids that are both scientifically accurate and effectively tailored to enhance the learning process, ensuring 211 that every aspect of the task is clear and relevant for all students. It is essential to employ clear and straightforward language, making the educational content accessible and comprehensible for third-grade students. During the review process, experts highlighted issues with the use of overly fantastical visuals, such as the sunflower depiction, which they felt was more appropriate for a fictional setting than a science-based task. This feedback highlighted a critical balance that needs to be maintained in educational materials: engaging students effectively while upholding the integrity of the scientific content presented. The revised tasks aim to captivate third graders with realistic and scientifically accurate scenarios, thus enhancing their overall learning experience. Provide explicit guidelines for revisions Start by recalling the design blueprint for LP2 of PE 3-LS4-3 and the overall assessment design process. I provided explicit guidelines focusing on several key areas: 1. Use of real images and visual aids: It's crucial to include high-quality, relevant images that clearly depict the scenario being studied. Visual aids should be seamlessly integrated into the tasks to enhance understanding of the phenomena; and 2. Provide adequate scaffoldings for younger students: To support younger learners, include step-by- step prompts or visual aids that guide students in representing forces and objects in their models effectively. Addressing concerns about realism, I encouraged the GPT model to consider the authenticity of the presented phenomena, asking, “Are these real-world phenomena, and is the data real? Are there real pictures available to illustrate the phenomena? The use of authentic images rather than generated graphics would greatly enhance the task's educational value.” To ensure the tasks are age-appropriate and linguistically accessible, I guided the GPT model with specific restructuring directions: “Revise the assessment task to start with a real and compelling phenomenon, accompanied by images and a data table. Follow this with a scenario where two students debate which organism will survive in a particular environment. The item prompts should then ask students to choose which argument they agree with, provide evidence, and reason their choice, ensuring all questions are tailored to be third-grade friendly.” 212 Following this structured approach and refining the tasks based on feedback, the iterations led to significant improvements in clarity and relevance, as documented in Table 4-38, which compares task designs from the first and second rounds. Table 4-38. Task 2 for LP4 for 3-LS4-3 across two rounds of refinement 1st round Task 2 for LP4 for 3-LS4-3 2nd round Task 2 for LP4 for 3-LS4-3 Major changes: 1. Underlining the authenticity of the phenomena and data included in the task. In the newly designed tasks, all data, phenomena, and images are sourced from actual research. 2. Providing more information to offer accessible opportunities for diverse students to engage in the tasks. 3. Revising the language of the assessment task prompts and stems to be age-appropriate and avoid unscientific terminology. 4. Enhancing prompt scaffoldings to better support students in engaging with and understanding the task. 213 Throughout the iterative refinement process, the human operator plays a crucial role in identifying, monitoring, and guiding the outputs. I prompt AI models to reflect, detect, and diagnose the outputs at certain critical points. This ensures that the AI learns from the iterative process and improves its capability for future design tasks. 4.3.2 Expert Panels’ Feedback on the Refined Assessments 4.3.2.1 Unblinded Expert Panels Table 4-39 presents the number of experts who provided the second-round feedback. In the following sections, I discuss the comparison between the first round and second round feedback on the LPs and designed tasks by PE. Table 4-39. Summary of experts on second round feedback Expert group NGSS Expert Assessment Design Experts Science Content Experts Equity and language experts Engagement experts Elementary science teacher experts 3-PS2-1 3-LS4-3 1 1 2 2 4 5 2 3 3 3 2 2 3-PS2-1: Feedback on LPs and Evidence Statements Figure 4-29 presents the feedback between the first and second round expert feedback on the designed LPs and evidence statements. 214 Figure 4-29. Scatter plot for the two rounds expert feedback on the LPs and evidence statements The scatter plot in Figure 4-29 displays feedback scores from a panel of experts on four dimensions on the LPs and evidence statements evaluated across two review rounds. The dimensions include Collective Representation of Proficiencies, Essentiality of the LP, Sufficiency of Evidence, and Integration of Knowledge. Four groups of experts—NGSS experts, assessment experts, science content 215 experts, and equity and language experts—provided scores, visualized with distinct markers for each group. In the first round, shown in blue, feedback scores varied moderately across all criteria, with generally lower scores reflecting initial assessments of the LPs and evidence statements. By the second round, depicted in red, there was a noticeable improvement in scores for all criteria among all expert groups. This indicates a positive reception to the modifications made after the initial feedback. Specifically, the NGSS experts showed marked improvements in all areas, reaching the maximum score in the second round, indicating complete satisfaction with the changes made. Similarly, the assessment experts and science content experts gave higher scores in the second round, particularly noting improvements in the Sufficiency of Evidence and Integration of Knowledge. The equity and language experts, while also showing increased scores, provided slightly more conservative feedback in areas like Integration of Knowledge, suggesting areas where further refinements could be beneficial. 3-PS2-1: Feedback on Task 1 Figure 4-30 visualizes comprehensive feedback scores across a broad range of criteria from a diverse group of experts. These scores are compared between two rounds of review, with the first round represented by blue markers and the second by red markers. In general, the second round shows higher scores across most dimensions, indicating that the revisions made after the initial feedback were well received. For specific expert groups, NGSS experts demonstrated significant improvements in areas like Phenomena, Information Coherence, and Language Complexity-Sentence Structure, suggesting that revisions better aligned with NGSS standards in the second round. Assessment experts noted improvements in Engagement Relevance, Prompt Clarity, and Language Complexity-Domain Specific, reflecting a refinement in how assessments are designed to gauge knowledge accurately. Science content experts saw improvements in Criteria Sensitivity and Authenticity, indicating enhanced content accuracy and real-world relevance. Equity and language experts observed slight increases in Cultural Sensitivity/Inclusion and Language Appropriateness, making the content more inclusive and accessible. Engagement experts marked improvements in Engagement Relevance and Interest, highlighting better engagement strategies in assessment tasks. Teacher experts reported substantial improvements in 216 Response Appropriateness and Evidence Statement Alignment, suggesting that tasks became more effective for classroom use. Figure 4-30. Scatter plot for the two rounds expert feedback on Task 1 for LP2 of 3-PS2-1 Despite these improvements, some criteria like "Language Complexity-Visuals" and "Procedural Skills" received relatively low and unchanged scores from specific expert groups like the teacher experts, signaling areas that may still need further attention. This detailed feedback from various expert 217 perspectives allows for targeted improvements in future iterations of the assessment tasks, ensuring they are more effective and relevant. 3-PS2-1: Feedback on Task 2 Figure 4-31 presents the two rounds of expert feedback on Task 2 for LP2 of 3-PS2-1. In the first round, represented by blue markers, scores generally varied across different criteria, with several areas showing room for improvement. The second round, represented by red markers, shows a significant overall improvement in scores, suggesting that the feedback from the first round was effectively integrated into subsequent revisions. The NGSS experts showed marked improvements in areas related to the authenticity and relevance of STEM phenomena, demonstrating greater satisfaction with how these were presented in the second round. Assessment experts provided higher scores particularly in criteria involving engagement relevance and procedural skills, indicating that the assessments better captured student interest and effectively measured relevant skills in the second iteration. Science content experts, who focus more on the accuracy and depth of content, reflected increased scores particularly in language complexity and content coherence, suggesting enhanced clarity and alignment with scientific standards. 218 Figure 4-31. Scatter plot for the two rounds expert feedback on Task 2 for LP2 of 3-PS2-1 Equity and language experts, whose feedback is crucial for ensuring inclusivity and accessibility, noted better performance in cultural sensitivity and language appropriateness. Engagement experts, 219 focusing on how engaging and relevant the content is for learners, recorded higher scores in engagement interest and skills relevance, highlighting more compelling and relevant content in the second round. Teacher experts, whose perspectives are vital for practical classroom application, also showed improvement, particularly in the clarity of prompts and the alignment of evidence statements with learning goals. This detailed comparison between the two rounds highlights the effective incorporation of expert feedback into enhancing the overall quality, relevance, and effectiveness of the assessment tasks. 3-LS4-3: Feedback on LPs and Evidence Statements Figure 4-32 shows the two rounds feedback on the designed LPs and evidence statements of 3- LS4-3. Figure 4-32. Scatter plot for the two rounds expert feedback on the LPs and evidence statements For the LPs and evidence statements for 3-LS4-3, Figure 4-32 shows a clear shift towards higher evaluations in the later round, reflecting the positive impact of the revisions. Initially, scores across the board were somewhat lower, indicating areas of concern or need for improvement. For example, the NGSS and science content experts provided stringent feedback particularly on the sufficiency of evidence and essentiality of the LPs, highlighting a need for more precise alignment with scientific standards. 220 Assessment experts mirrored these concerns, focusing on the overall effectiveness of the assessments. The second round shows an uplift in scores across all dimensions, signaling that the revisions were well- received. Enhancements in the sufficiency of evidence and the integration of knowledge were particularly well noted by NGSS and science content experts, showcasing an improved alignment with educational standards and better representation of proficiencies. 3-LS4-3: Feedback on Task 1. Figure 4-33. Scatter plot for the two rounds expert feedback on Task 1 for LP2 of 3-LS4-3 221 In terms of Task 1 for LP2 of 3-LS4-3, Figure 4-33 provides a detailed comparison of feedback scores from a panel of experts across a wide array of assessment criteria over two review rounds. The plot reveals significant differences in scores between the rounds, indicating how expert feedback influenced the revisions of the assessment tasks. During the first round, feedback scores were generally lower across most criteria, suggesting initial concerns or deficiencies identified by the experts. For instance, lower initial scores in areas like engagement relevance, cultural sensitivity, and procedural skills highlighted the need for more focused adjustments to better cater to diverse learner needs and the practical application of knowledge. In the second round, improvements are evident across almost all criteria, with markedly higher scores indicating that the changes made were effective. NGSS experts, who focus on alignment with scientific standards, showed higher satisfaction in the second round, especially in criteria related to scientific accuracy and coherence. Similarly, assessment experts provided higher scores on criteria assessing the effectiveness of task design and alignment with learning goals, suggesting that the revisions better met assessment objectives. The comparison of scores also illustrates the effectiveness of the iterative feedback process, where adjustments based on expert critiques led to enhancements in the assessment's design and content. This iterative process ensures that the assessments are not only comprehensive but also effective and appropriate for educational use, demonstrating a successful adaptation to the experts' insights. 3-LS4-3: Feedback on Task 2 The feedback on Task 2 is also similar to Task 1. Figure 4-34 provides a comparative analysis of feedback scores across a range of criteria related to educational task assessments, as reviewed by various expert groups over two rounds. From the initial to the revised assessments, there is a noticeable improvement in scores across most criteria, signaling that the modifications made were effective. This is particularly evident in the scores provided by NGSS experts and assessment experts, who showed increased satisfaction in areas like 'Information Coherence' and 'Language Complexity.' Their scores in 222 the second round are consistently higher, reflecting their approval of the adjustments made in response to their initial critiques. Figure 4-34. Scatter plot for the two rounds expert feedback on Task 2 for LP2 of 3-LS4-3 Science content experts, whose focus is the depth and accuracy of content, demonstrated increased scores in the 'Language Complexity' and 'Procedural Skills' criteria. This suggests that the 223 revisions addressed their concerns about the clarity and application of scientific concepts in the tasks. Engagement experts, whose evaluations focus on how engaging the tasks are for students, showed higher scores in 'Engagement Interest' and 'Engagement Relevance' in the second round. These improvements indicate that the tasks were adjusted to be more engaging and relevant to students, aligning better with educational engagement goals. Overall, the plot illustrates an effective feedback loop where expert critiques were taken into account, leading to substantial enhancements in task design and execution, which were acknowledged by higher scores in the subsequent review round. This iterative process underscores the value of expert feedback in refining educational assessments to better meet pedagogical objectives and improve learner outcomes. 4.3.2.2 Blinded Expert Panels I report the expert panels who were not informed the assessments and the LPs were designed by AI. Below, I first present the experts’ feedback on the PE 3-PS2-1 and then move to the feedback on the PE 3-LS4-3. Feedback on LPs and evidence statement for PE3-PS2-1 The expert feedback on the LPs and evidence statements for 3-PS2-1 provided a rich source of insights. This feedback emphasized both the strengths and areas for improvement in the LPs, offering specific recommendations for future revisions. The expert reviews acknowledged the effective integration of 3D learning and commended the clarity and comprehensiveness of the LPs. Positive feedback on LPs and evidence statements. Experts widely commended the alignment and effectiveness of the LPs in representing the essential proficiencies required to meet 3-PS2-1, emphasizing the depth and thoroughness of the educational framework. One reviewer praised the structured approach to three-dimensional learning, stating, "The LPs and components in the evidence statements—model elements, relationships, and explanations—are meticulously unpacked to ensure comprehensive coverage of the 3Ds, providing a robust educational framework that effectively fosters student understanding." 224 The clarity and precision in modeling and explanations within LP2 were highlighted as standout features. This clarity was particularly beneficial for addressing common misconceptions about forces on stationary objects. A reviewer elaborated on this strength, noting, "LP2 excels in demystifying the dynamics of forces, clearly illustrating how balanced forces interact on stationary objects, thereby countering the prevalent student misconception that stationary objects are not subject to forces." Areas for improvement in LPs and evidence statements. Several experts have raised concerns about the learning performances, particularly highlighting the insufficient emphasis on "investigation" practices within the current framework. Most notably, only LP1 focuses on investigations, which might lead to confusion among students as they attempt to grasp the full scope of the curriculum requirements. One expert elaborated on this issue: "The set of LPs lack of emphasis on critical investigative skills, which are central to understanding forces. Students might struggle to differentiate the unique aspects of each performance, leading to a superficial understanding rather than a deep conceptual grasp." In addition to the need for a broader focus on investigative practices, the lack of adequate scaffolding in LP2 was frequently noted. Experts are calling for more structured support systems to aid students in developing robust models and explanations. This concern was voiced by an expert who questioned the current educational supports: "I am curious as in LP2 would students be offered any scaffolding for the model and explanation." The terminology used within the LPs also received feedback suggesting the need for refinement to better foster scientific inquiry. An expert specifically addressed this in the context of LP5, recommending a shift in the educational approach: "LP5 should ask for an evidence-based argument rather than an explanation, you do not explain empirical facts; you argue that the evidence indicates that the statement is true." This change aims to enhance the rigor and accuracy of the educational content, aligning it more closely with the practices of empirical science. Further emphasizing the need for developmentally appropriate content, experts suggested that the evidence statements should accommodate the capabilities of third graders more effectively. This involves fostering evidence-based reasoning and argumentation that aligns with young students' understanding 225 levels. For example, expert H proposed a model to illustrate balanced forces in a way that is accessible to younger learners: "Consider a ladder leaning on a wall, there are not equal arrows because the downward force of gravity (weight of ladder) is balanced by two upward forces (friction on the wall and upward force of ground on foot of ladder). Students create a simple drawing or diagram that shows an object at rest with balanced forces on it." This example not only aids in understanding the concept of balanced forces but also demonstrates the kind of practical, visual learning tools that can help third graders grasp complex scientific ideas. These insights from the experts underscore the necessity for revising the LPs to incorporate a greater emphasis on investigative skills, provide more substantial scaffolding, refine the use of terminology to promote evidence-based reasoning, and tailor the evidence statements to better suit the cognitive abilities of third-grade students. Feedback on task1 for PE3-PS2-1 Experts provided comprehensive feedback addressing the visual and conceptual aspects of the learning tasks. They focused on the representation of physical forces and the accessibility of the tasks for third-grade students, emphasizing the importance of aligning educational content with students' developmental levels and prior experiences. Positive feedback on task 1. The feedback from the experts highlighted several strengths in the design and execution of the task1, particularly in how the phenomena were presented to engage the students effectively. One expert emphasized the compelling nature of the phenomena used in the tasks, stating, "Students experience the effects of gravity all the time but seldom are confronted with objects at rest and balanced forces. My experience with elementary students is that they have found this an interesting phenomenon to make sense about." This observation underscores the relevance and engagement potential of the task, aligning well with the students' everyday experiences and curiosities. Another expert further supported this view by elaborating on the educational impact of the phenomena, mentioning, "The way the tasks introduce students to the invisible yet ubiquitous forces at play in everyday objects provides a foundational understanding that stimulates curiosity and critical thinking." 226 This additional insight highlights how the tasks not only align with what students observe daily but also challenge them to think deeply about the physical world. Moreover, the feedback also included appreciation for how these tasks are structured to promote scientific inquiry. As one reviewer pointed out, "By engaging students with scenarios that are both familiar and intriguing, the tasks encourage a deeper exploration of scientific concepts that they can see and feel but often do not notice." This quote reflects the thoughtfulness behind the task design, aiming to transform everyday observations into opportunities for scientific discovery and understanding. Areas for improvement for Task 1. Representation of Forces: Concerns were raised about the level of abstraction in how forces are represented in the task. Experts have stressed the importance of developing more intuitive and less abstract representations of forces to better suit third graders' understanding levels. Specifically, an expert commented on the potential confusion this might cause for third graders, "The students may have been taught to represent the downward force of gravity and the upward force of the table by a single arrow but this is a very abstract representation for the third grader, maybe the student would represent the upward force by lots of little arrows because the table pushes up wherever the book touches it, not just at the middle of the book." This feedback underscores the need for more developmentally appropriate visual representations that align with third graders' conceptual understandings, suggesting a more detailed approach that mirrors how young students perceive interactions between objects. It also highlights the need for educational materials to align more closely with how young students visually and conceptually perceive physical interactions. Visuals and Context: There is a strong push for incorporating visuals that are more familiar and relevant to students' daily school experiences. This comes in response to feedback about the current visuals not adequately engaging the intended audience. The visuals used in the task were noted to be less suitable for the intended audience. An expert pointed out the disconnect, saying, "This photo is for adults. Consider using a photo showing a book on a desk in a classroom setting." This recommendation emphasizes the need for visuals that are more relatable and understandable for young students, highlighting the importance of context and environment in educational materials. 227 Scaffolding and Terminology: Enhancing scaffolding and support in the tasks has been identified as crucial for aiding students' scientific reasoning and expression. The feedback also highlighted deficiencies in scaffolding and the need for clearer terminology. An expert raised a significant question regarding the representation of forces, asking, "Would you accept one big down arrow and multiple small up arrows as an 'accurate representation of the situation?'" This inquiry indicates a gap in the guidance provided to students on how to accurately model and discuss scientific ideas. Feedback on task 2 for PE3-PS2-1 The feedback from experts on Task 2 of PE 3-PS2-1 centers around several key aspects related to the representation of forces and the design of the task. This feedback provides actionable insights, specifically focusing on the representation of forces and the contextualization of the task scenarios. Positive feedback on Task 2. Engagement and contextual relevance: Experts appreciated the setup and context provided in Task 2, noting its potential to effectively engage students. One expert specifically commended the task's design and offered a suggestion to enhance its appeal to the students' interests: "I like this task. Picture is good, maybe the setup would be more interesting to students if instead of helping to clean Lisa and John want to move the table to make space to dance or put on a show… (it's easy enough to sweep under that table)." This feedback underscores the value of crafting scenarios that are directly relevant and exciting to students, ensuring they are more than just educational but also enjoyable. Another expert emphasized the importance of context in the learning process, stating, "Engaging students with scenarios that mirror their real-life experiences not only makes the tasks more interesting but also enhances their understanding of the scientific principles being taught." This viewpoint reinforces the strategy of using relatable and dynamic scenarios to promote a deeper connection between the students and the educational content, making learning a more integrative and enjoyable experience. Integration of real-life contexts: The positive feedback was further enhanced by suggestions to make the tasks more practical and focused on core concepts. An expert advised on optimizing the instructional visuals, saying, "Why not ask the kids to represent the forces acting on the table on an outline version of the picture, rather than wasting their time redrawing the setup in a blank space." This 228 recommendation underscores the importance of using visuals effectively to simplify the task execution and concentrate on sensemaking rather than on redundant activities. Another expert elaborated on this idea, emphasizing the educational benefits of such an approach: "Using pre-drawn outlines for students to annotate forces can significantly reduce cognitive load and allow them to focus more on the scientific principles involved." This insight suggests that reducing the complexity of tasks can help students better engage with and understand the scientific concepts being taught, making learning both efficient and effective. Areas for improvement for Task 2. The feedback from experts highlighted concerns about the abstract nature of force representation, which might be confusing for younger students. One expert critically examined the typical methods of depicting forces, suggesting a more approachable method for third graders: "Again the question of what is an acceptably ‘accurate’ representation arises – would you accept a balance of the ¼ of table weight pulling it down and the floor pushing up at each place the table touches the floor, or would you only accept one downward force of gravity acting at the center of the table and four upward forces at the feet adding to the same total?" This expert's query points to the necessity for flexible and developmentally appropriate visual representations that resonate with young students’ ways of understanding and visualizing forces. The need for better scaffolding and clearer terminology was also emphasized by experts to enhance how students model and discuss scientific concepts. Highlighting a potential improvement in instructional guidance, one expert asked, "Is it OK to have the arrows not quite to scale with the forces because the model is just a sketch, if the student writes words to explain the balance of forces?" This query underscores the need for more precise guidelines and supports students in expressing their scientific observations accurately and meaningfully. This feedback suggests integrating clearer explanations alongside visual models, thus bridging the gap between abstract scientific models and students' understanding. Table 4-39 below presents the summary of the expert feedback on LPs, tasks for 3-PS2-1. 229 Structured table summarizing the major themes from the expert feedback on Learning Performances (LPs), Task 1, and Task 2 of PE 3-PS2-1, categorizing the feedback into positive aspects, areas for improvement, and suggestions for each section. Table 4-39. Summary of the expert feedback on LPs, tasks for 3-PS2-1 Aspect Positive Aspects Areas for Improvement Suggestions LPs and Evidence Statements - Effective integration of 3D learning standards. - Insufficient emphasis on "investigation" practices. - Incorporate more investigative practices. - Clarity and comprehensiveness in modeling and explanations. - Lack of adequate scaffolding, especially in LP2. - Provide more structured support systems for developing models and explanations. - Alignment with essential proficiencies of 3-PS2-1. - Need for terminology refinement to foster scientific inquiry. - Refine terminology to enhance rigor and align with empirical science practices. Task 1 - Engaging setup and relevance to students' experiences. - Abstract representation of forces potentially confusing. - Develop more intuitive and less abstract representations of forces. - Encourages exploration of scientific concepts. - Visuals not adequately engaging for the intended audience. - Structured to promote scientific inquiry. - Need for clearer guidelines and better scaffolding. - Use more familiar and relevant visuals to enhance relatability and comprehension. - Implement clearer guidelines and enhance scaffolding to support scientific reasoning and expression. Task 2 - Effective engagement and contextual relevance. - Abstract nature of force representation confusing for students. - Use practical, contextually relevant scenarios and optimize visual aids. - Use of real-life contexts enhances understanding. - Need for more intuitive methods for depicting forces. - Suggestions to improve task setup and execution noted. - Need for clearer terminology and better scaffolding. - Allow flexibility in how forces are represented to accommodate different understanding levels. - Ensure terminology and scaffolding are appropriate for third graders' cognitive abilities. 230 Feedback on LPs and evidence statement for PE 3-LS4-3 The feedback from experts regarding the LPs and evidence statements for 3-LS4-3 provides critical insights. Below is a summary of the positive aspects and areas for improvement based on the feedback provided. Positive feedback on LPs and evidence statements. Experts have recognized the thoroughness with which the LPs cover the essential proficiencies required for PE 3-LS4-3. The structured approach to integrating three-dimensional learning is particularly commended for its effectiveness in fostering a deep understanding of biological adaptations. An expert highlighted the breadth of coverage offered by the LPs, stating, "This one covers all main aspects," affirming that the LPs comprehensively address the core elements expected in the curriculum. Another expert elaborated on this point, noting, "The learning performances are well-designed to encompass a broad spectrum of critical concepts that are necessary for students to master the performance expectation, ensuring no key element is overlooked." The design and implementation of multiple evidence statements linked to specific LPs have been positively received. Experts view this approach as a robust framework for educational design. One expert praised the clarity and utility of this structure, saying, "I find the concept of a learning performance with an explicit list of multiple evidence statements a very strong template for task design." This sentiment was echoed by another expert who emphasized the advantages of such a structured approach: "Using multiple evidence statements provides a clear pathway for students to demonstrate their comprehension and application of the learned material, which significantly aids in both teaching and assessing complex concepts." Areas for improvement for LPs and evidence statements. Significant feedback has emerged regarding the "critique" section within the LPs, particularly concerning its appropriateness and execution. Experts argue that the current approach might not fully grasp the complexity of how traits contribute to an organism's survival. Consequently, it is recommended to alter the evidence statement to more thoroughly assess the validity of claims, focusing on evidence-based reasoning rather than comparative strength. This adjustment encourages students to evaluate various claims about survival traits within their environmental 231 contexts, fostering deeper analytical skills. One expert suggested, "Students evaluate other claims about traits that contribute to survival of the organism in this environment and provide evidence-based reasoning as to whether or not they find the claim valid." This change aims to enhance the critical thinking aspect of the curriculum, ensuring students engage more profoundly with the material. Feedback has also highlighted the need for improved integration of the three-dimensional learning framework, encompassing disciplinary core ideas, science and engineering practices, and crosscutting concepts. While two dimensions are generally well-integrated, the third, specifically the crosscutting concept of systems, often remains only implicitly addressed. To remedy this, it is recommended to ensure all three dimensions are explicitly incorporated and effectively elicited through student tasks. This enhancement will provide a more balanced and comprehensive educational experience, allowing students to better understand and apply complex scientific principles in varied contexts. Concerns about the contextual relevance of certain examples, like the arctic fox, have been raised, particularly considering contemporary environmental issues such as climate change. An expert noted, "I wonder whether some students might say, ‘Because of climate change, the arctic fox does not need the thick fur.’" This feedback points to the necessity of updating and expanding the contexts and examples used within the LPs to ensure they remain relevant and reflective of current scientific and environmental understandings. Updating these examples will help prevent misconceptions and provide students with a more accurate and relatable learning experience. Feedback on Task 1 for 3-LS4-3 Positive feedback on Task 1. Experts have acknowledged the task format's ability to engage students effectively by presenting phenomena in a manner that is both engaging and educational. The tasks are carefully designed to ensure that the phenomena are not only comprehensible but also intriguing, thereby fostering a deeper interest and engagement among students. However, despite the general appreciation for the task's format, some experts voiced concerns that might impact the perceived realism and engagement quality. One expert pointed out, "Maybe [it's] compelling but see my comment above," 232 highlighting that while the scenario is designed to be engaging, there may be underlying issues that affect its effectiveness in conveying realistic scientific phenomena. The clarity of language and the structural organization of the tasks have been highlighted as strengths in the feedback from educational professionals. The tasks are praised for their coherent structure, which aids in sequential information processing and enhances comprehensibility. An expert commented on the beneficial structure of the tasks, stating, "well structured info." This feedback underscores the success of the task design in aligning with educational standards, making the content not only accessible but also effectively sequenced to facilitate student understanding and learning. Areas for improvement for Task 1. The realism and authenticity of the data used in educational tasks are crucial for maintaining credibility and fostering genuine scientific inquiry among students. Concerns regarding these aspects were notably raised by experts reviewing the task. One expert explicitly criticized the believability of the scenario based on professional experience, stating, "My problem with this task is that it appears to me to be unrealistic. Based on my experience as a docent at a biological preserve at Stanford this data is fake." This feedback points to a significant issue with how data and scenarios are presented to students, emphasizing the need for educational materials to either use authentic data or to clearly label hypothetical scenarios as such to prevent confusion and enhance educational integrity. The criticism underscores the importance of aligning educational tasks with realistic scientific standards to ensure they effectively prepare students for real-world scientific understanding and applications. The ecological accuracy of educational tasks is fundamental in teaching students about biology and the environment effectively. Experts reviewing the tasks raised concerns about the appropriateness of the settings and biological descriptions provided in the tasks. Specifically, the portrayal of hummingbirds and their environmental interactions was highlighted as not fully aligning with known biological facts. One expert provided detailed feedback on the necessary improvements to enhance realism and accuracy, stating, "Hummingbirds need a place to perch and to nest, open meadow may be part of their habitat but they are more likely found in a mixed environment which you call woodland, perhaps open woodland 233 would be a better term." This feedback emphasizes the need to adjust the environmental settings described in the tasks to reflect more accurate biological and ecological conditions. By doing so, the tasks will not only become more scientifically precise but also provide students with a more authentic understanding of how organisms interact with their environments. The presentation of observations from two children as contradictory has been identified as a potential source of confusion for students, particularly those at the elementary level. This concern was addressed by experts who noted the importance of clear and accurate communication in educational settings. One expert specifically commented on the unnecessary complexity introduced by framing the children's observations as contradictory, emphasizing that such an approach could be misleading. The expert advised, "See above no need to frame two students' ideas as if they are contradicting one another, they are not." This feedback suggests a revision in the way student observations and arguments are presented within the tasks to ensure they are clear and supportive of the learning objectives, without inadvertently confusing younger students. Cultural and contextual sensitivity in educational tasks is crucial to ensure that all students find the content relatable and engaging, regardless of their background. Experts have highlighted a gap in the current tasks concerning their relevance to students from diverse environments, particularly those from urban or non-forest areas. An expert deeply concerned about this issue provided pointed feedback, noting the disconnect many students might feel: "Children from urban and even some suburban environments may never have seen forest, woodland, or a natural meadow." This observation underscores the importance of designing educational materials that cater to a broad audience by incorporating a variety of environments that are familiar to different groups of students. This feedback calls for an expansion in the types of settings and scenarios used in tasks to ensure they resonate with students from various geographic and cultural backgrounds. By doing so, educational materials can better serve their purpose of educating a diverse student population effectively, ensuring that no student feels alienated due to a lack of familiarity with the content presented. 234 Feedback on Task 2 for 3-LS4-3 Positive feedback on Task 2. Experts recognize the task's structured approach and the comprehensive way it engages students, particularly noting its effectiveness. One expert comments, "It is compelling and comprehensive to students," emphasizing how well the task captures and maintains student interest through its educational design. The orderly presentation of information is specifically highlighted for facilitating efficient information processing. "The order of information in the item stem is very good," confirms another expert, underscoring the clarity and structured nature of the task. Additionally, the effective use of visual aids enhances comprehension, as another reviewer points out, "Yes, the images helped a lot, well presented plants with different organisms." These visuals play a crucial role in reinforcing the educational content, making complex concepts more accessible to students. Areas of improvement for Task 2. While the task's design has been positively received, experts have raised concerns about its realism and approach to teaching complex ecological concepts. One expert expressed frustration with the oversimplification presented in the task, stating, "My problem with this task is it asks for generalization based on single examples...The grass is not helpful, actually grass does not grow in wetlands, its seeds are waterlogged and do not germinate." This critique highlights the necessity for tasks to feature more authentic or clearly hypothetical scenarios to accurately reflect ecological realities and prevent the formation of misconceptions. Additionally, there's an emphasized need for clearer objectives to deepen students' understanding of environmental science. An expert critically notes, "Success in the task comes because the answers are made obvious by the stem, not because you understand anything about the nature of the environments and what it takes to survive in each." This feedback suggests that to truly enhance educational outcomes, the task should be revised to better connect its purposes with the intended educational goals, potentially by refining the prompts to ensure they more effectively guide student inquiry and engagement in learning scientific concepts. Table 4-40 shows the summary of the expert feedback on LPs and tasks for 3-LS4-3. 235 Table 4-40. Summary of the expert feedback on LPs and tasks for 3-LS4-3 Aspect Positive Aspects Areas for Improvement Suggestions LPs and Evidence Statements - Comprehensive coverage of essential proficiencies for PE 3- LS4-3. - Effective integration of three- dimensional learning. - Structured approach praised for fostering deep understanding of biological adaptations. - Multiple evidence statements provide a clear pathway for student assessment. - "Critique" section needs better alignment with the complexity of trait contributions to survival. - Crosscutting concepts need clearer integration. - Revise evidence statements to focus on evaluating the validity of claims rather than their strength. - Explicitly integrate all three dimensions of learning. Task 1 Task 2 - Engaging and educational presentation of phenomena. - Structure aids in sequential information processing. - Visual aids enhance comprehension and engagement. - Language and task organization align well with educational standards. - Concerns about the realism and authenticity of data. - Ecological accuracy of settings and descriptions. - Presentation of contradictory observations. - Use authentic data or clearly indicate hypothetical scenarios. - Adjust environmental settings to reflect accurate biological conditions. - Clarify contradictory statements. - Structured approach effectively engages students. - Visual aids enhance understanding. - "Order of information in the item stem is very good." - Realism of ecological concepts questioned. - Oversimplification of complex concepts like plant survival in various environments. - Feature more authentic or hypothetical scenarios that accurately reflect ecological realities. - Refine prompts to better guide scientific inquiry. 236 CHAPTER 5: CONCLUSIONS AND IMPLICATIONS This dissertation employed design-based research to explore how humans can work with AI to design knowledge-in-use assessments for elementary students to support their science learning. Interdisciplinary expert panels with diverse expertise collaborated to provide feedback on the co-designed knowledge-in-use assessments and interim products that are critical for the assessment design. This dissertation explores three major research questions: discussing how to iteratively and effectively design knowledge-in-use assessments with AI, the role humans play in the designing process, and the role that AI plays in the process, including where and how the synergy occurs to design these assessments. The dissertation found that humans can collaborate with AI to design knowledge-in-use assessments. The designed assessments were distributed to interdisciplinary expert panel members for review, and their collective feedback provided comprehensive insights for the assessment refinement process. Incorporating the collective feedback, the human operator worked with AI again to refine the designed assessments by incorporating the expert panel feedback. Refinement principles and frameworks were generated during the process. Additionally, the refined assessments were distributed to two different expert panels for review, including a new group of experts who were unaware that AI was involved in the design process to mitigate potential bias. The revised assessments received higher evaluations from the original expert panel compared to the first round of assessments. Interestingly, the new expert panel, who were not informed that AI was involved in the assessment design process, provided even more positive feedback compared to the expert panels, who knew that the assessments were designed by human and AI collaboration. Below, I discuss each research question. 5.1 Enhancing Knowledge-In-Use Assessments Design through Collaborating with AI This dissertation builds on the evidence-centered design (ECD) approach to develop knowledge- in-use assessments by adopting the Next Generation Science Assessment (NGSA) approaches through collaboration with collective human experts and AI models. This research contributes to the body of knowledge on designing formative assessment tasks for measuring complex cognitive constructs, as explored in prior work (Harris et al., 2019, 2024; He et al., 2023; Li et al., 2024) The findings underscore 237 the efficacy of the NGSA approach to designing knowledge-in-use tasks that were previously developed by humans. Further, this dissertation extends the NGSA design approach from solely human collaboration to human-AI collaboration, which saves both time and labor from previous studies (Pellegrino & Hilton, 2012). The systematic evidence-centered design approach of NGSA can effectively guide AI models in designing knowledge-in-use assessment tasks with the guidance of human operators and collective expert intelligence, which adds on to the effectiveness of ECD in designing assessment tasks (Mislevy & Haertel, 2006; Wilson et al., 2005). This dissertation builds on and extends current literature by demonstrating that while AI can generate valuable educational content, its effectiveness is significantly amplified when guided by explicit and detailed human instructions, which aligns with existing research (Luckin et al., 2016). By integrating human expertise, iterative feedback, and detailed guidance, AI-generated outputs can achieve a high level of detail and accuracy, meeting educational standards and supporting effective assessment design. This finding reinforces the importance of human oversight in the AI design process, underscoring the collaborative dynamic between human and AI that fosters a synergistic relationship (Bearman & Ajjawi, 2022), which enhances the overall quality and effectiveness of knowledge-in-use assessments. The dissertation also extends the literature by identifying emerging themes (Table 5-1) that highlight the importance of effectively and iteratively working with GPT-4 models in designing knowledge-in-use assessment tasks. Throughout this process, human experts play a critical role in guiding and refining AI outputs. This supports the notion that while AI has the capacity to learn from provided frameworks and examples, it is crucial for humans to provide comprehensive feedback to ensure the outputs are accurate and pedagogically sound (Fenwick, 2010). The dissertation illustrates how the collaborative dynamic between AI and human experts facilitated the creation of Instructional Design Models (IDMs) and LPs that align with educational standards and support effective learning. By addressing these themes, the iterative process showcased in this dissertation demonstrates how AI and human collaboration can produce high-quality educational assessments. This extends the current understanding of AI's role in educational design, showing that AI-human partnerships can create tools that are not only efficient but 238 also pedagogically sound, enhancing the potential of AI in education. This dissertation reinforces the importance of hybrid intelligence between humans and AI, further extending the role of hybrid intelligence to complex cognitive constructs and systematic design approaches in education (Dellermann et al., 2021; Holmes, 2020). The contributions of this dissertation are twofold. First, it emphasizes the collaborative effort required in designing knowledge-in-use tasks, necessitating experts from various domains to guide and monitor the assessment design, as noted by Harris et al. (2024) and the National Research Council (2006). Assessment is a systematic effort that must consider various levels of thinking about learning, particularly for formative assessments. This dissertation further solidifies the role of assessment as a crucial component of the educational system, requiring collaborative efforts to ensure that designed assessments accurately capture students' performance and effectively inform teaching and learning. This dissertation extends the effectiveness of the NGSA approach in designing knowledge-in-use assessment tasks and highlights the significance of evidence-centered design in these tasks. The collaborative efforts gathered from diverse experts are pivotal for effective assessment design, adding to previous findings that emphasize the necessity of expert reviews in validating assessment tasks (Black & Wiliam, 1998; Shepard et al., 2018). Additionally, this study underlines several important aspects that should be addressed when designing knowledge-in-use assessment tasks, including the integration of 3D proficiencies (NGSS, Lead States, 2013), creating equitable assessments that ensure all students can access the tasks (Darling-Hammond & Snyder, 2000), designing engaging assessments relevant to students' lives, considering language appropriateness (Lee, Quinn, & Valdés, 2013), and ensuring assessments can elicit evidence to understand students' performance effectively (Furtak, 2017, 2023; Penuel & Smolek, 2019). Moreover, this dissertation has the potential to expand on how to engage students from diverse backgrounds in the assessment tasks/scenarios. Since engagement is influenced by students' personal experiences and cultural backgrounds, designing universally engaging tasks is challenging. The integration of AI in assessment design has the potential to more efficiently provide alternative and adaptive scenarios, enhancing the ability to engage a diverse student body (Baidoo-Anu & 239 Owusu Ansah, 2023). Future research should explore how to further develop adaptive assessments. The second major contribution of this dissertation involves expanding the collaboration with AI in designing assessment tasks. By extending the collaborative partner from humans to AI models, this dissertation enriches distributed cognition theories and emphasizes the importance of hybrid intelligence in human and AI collaborations (Hutchins, 2000; Pea, 1993). This dissertation provides deeper insights into collaborating with AI models to design assessments that capture complex cognitive constructs. The iterative design process, detailed instructions, and collaboration with human experts of varied expertise broaden the scope of human-AI collaboration, making it a more holistic and comprehensive approach for integrating AI into education. The Hybrid Human-AI Collaborative Model (HHACI) exemplifies a collaborative approach for complex task design, elucidating how human and AI models can work together, particularly highlighting AI models' strengths in efficiency, flexibility, and vast information access (Johnson & Verdicchio, 2017). AI's capabilities to detect, diagnose, act, and learn from the human-AI collaboration process, and to reflect on these experiences to inform future tasks, underscore the potential of AI in education. More importantly, this dissertation reaffirms the irreplaceable value of human intelligence in the collaborative process, emphasizing that while AI can augment many aspects of educational tasks, the nuanced judgment and ethical considerations provided by humans remain indispensable (Li et al., 2023; 2024). 5.2 Leveraging Interdisciplinary Expertise to Enhance Knowledge-In-Use Assessment Design The second research question investigates the type of feedback provided by interdisciplinary expert panels on the assessment design. This dissertation discovered that diverse expertise from various perspectives is crucial for designing effective knowledge-in-use assessments. Education is a complex domain, and 3D learning is an integrated approach toward complex higher-order skills and thinking, which is central to knowledge-in-use, as argued in this study. Designing effective assessments to measure and support this type of high-order proficiency is even more challenging due to the systemic perspectives required for assessment design. For instance, such design demands not only robust science content knowledge but also a deep understanding of science learning and teaching, the NGSS, 3D learning, 240 student engagement, and language literacy, which may impact students' science learning (Pellegrino & Hilton, 2012; NGSS Lead States, 2013). This dissertation investigates the type of feedback provided by interdisciplinary expert panels on assessment design, highlighting the importance of diverse expertise in creating effective knowledge-in- use assessments. Building on the existing literature, it reinforces the necessity of integrating multiple perspectives to address the complexities of 3D learning, which involves higher-order skills and thinking central to knowledge-in-use (Pellegrino & Hilton, 2012; NGSS Lead States, 2013). This dissertation extends current understanding by demonstrating that designing effective assessments requires not only robust science content knowledge but also a deep understanding of science learning and teaching, the NGSS, 3D learning, student engagement, and language literacy. This holistic approach ensures that assessments are accessible to students from diverse backgrounds and measure the three dimensions of scientific knowledge and skills comprehensively (Penuel et al., 2017). It is impractical for one or two developers to possess such extensive expertise, underscoring the value of interdisciplinary collaboration. An interdisciplinary expert panel that included experts from various fields was assembled to address this challenge. This approach effectively gathers diverse expertise to collaboratively achieve the integrated goals of supporting knowledge-in-use proficiency development (NRC, 2012). This dissertation found that experts on the panel with different expertise can provide critical feedback in different areas. For example, NGSS experts can offer feedback on whether the LPs and evidence statements exceed the scope or grade levels of the PEs, and if the performance goals of SEPs align with the intended grade level. Such feedback on the coverage and overreach of the designed LPs and evidence statements from NGSS experts is invaluable. When it comes to the alignment and extent of integrating the three dimensions, assessment experts provided crucial feedback, including whether the designed evidence statements can be used to understand students' 3D learning, and if the designed assessments capture the integrated nature of the three dimensions rather than focusing on just one or two (Pellegrino et al., 2014; Wilson, 2005). Science content experts typically conducted critical examinations of the appropriateness of the science ideas and the accuracy of the science mechanisms presented in the tasks (Lee et al., 2021). However, 241 equity and language experts in science education research can provide important examinations of the cultural sensitivity and inclusiveness of the assessment scenarios, while language experts are crucial for determining if the tasks' language level aligns well with grade-level appropriateness (Lee, Quinn, & Valdés, 2013). One significant group involved in this study is the engagement experts, who can provide slightly different perspectives on these designed assessments to examine if they can interest and engage students in the learning process when using the assessments. Interestingly, this study found that engagement experts often provide different feedback from other groups. They highlight the individual personal interest value and emphasize the importance of considering personal experiences in understanding the engaging level of the designed assessments (Hidi & Renninger, 2006). Another critical group of experts is teacher experts, who provide extensive practical feedback on the designed assessments, which are invaluable for understanding if the assessments can be used in the classroom (Heritage, 2010). Assembling interdisciplinary expert panels is thus critical for reviewing and refining assessment tasks. The interdisciplinary expert panel provided valuable feedback on the designed assessments, including both positive feedback and suggestions for further improvement. For the positive feedback, most designed products were highly regarded for their integration of the three dimensions. However, they also received critical suggestions for further enhancement. For the LPs and evidence statements design part, the expert panel suggested: 1. ensuring appropriate grain size of LPs and evidence statements that adhere to the PE boundaries; 2. improving integration of CCCs, DCIs, and SEPs (NGSS Lead States, 2013); and 3. ensuring consistency in terminology and coherence of information. This feedback mainly came from the NGSS and assessment experts. In terms of task design, the expert panels suggested: 1. boosting engagement through relevant and contextual task design; 2. enhancing task clarity and guideline precision by providing assessment tasks with crystal-clear, straightforward instructions; 3. incorporating supportive visuals and scaffolds to emphasize the importance of integrating visual aids and scaffolding strategies into assessment tasks; and 4. ensuring cultural sensitivity and accessibility in task scenarios to create assessment tasks that are inclusive and reflective of the diverse cultural backgrounds and 242 experiences of all students. This feedback is mainly from the assessment, equity/language, teacher, and engagement experts (Baidoo-Anu & Owusu Ansah, 2023; Furtak, 2017). This dissertation adds to the literature by demonstrating the critical value of collaborative learning that can help with collective sense-making to solve complex problems beyond the capability of one or two team members, allowing them to learn from each other and expand the boundary of the zone of proximal development for each team member (Vygotsky, 1978). Similarly, this dissertation expands the notion of distributed cognition theory (Hutchins, 2000; Pea, 1993) from humans and tools to humans and humans with different focal expertise. When working with technology or AI, hybrid intelligence can be synthesized from different cognitive agents with different expertise to contribute to the intelligence system (Dellermann et al., 2021). This hybrid intelligence has the potential to design effective knowledge-in-use assessment within a short time frame and it also has the potential to design better assessments for diverse students. In the collaborative process, AI also detects, diagnoses, and reflects on the learning process which enhances its ability to design domain specific assessments. This is especially important in the age of AI, where it is challenging to expect everyone to have AI expertise, but it is one approach that can be used to leverage distributed cognition to augment human intelligence through a hybrid intelligence system (Luckin et al., 2016). 5.3 Integrating Expert Feedback through Human-AI Collaboration The third research question explores the process of incorporating experts’ collective feedback to refine the designed products. This dissertation proposed a refinement framework that highlights the collaboration between human operators and AI models. The human operator identifies critical places for revision based on the collective feedback and themes identified from RQ2, specifies the task goals and requirements, explains the rationales for revisions, and provides explicit guidelines for revision. More importantly, the human operator monitors the outputs to ensure goal alignment and also prompts the machine to detect and diagnose critical places that may be useful for the machine to learn from and act on for future tasks. The AI models learn iteratively and extract critical principles for their use in future refinement and action (Holzinger, 2016; Kamar, 2016). The extracted principles and lessons learned by 243 the AI models can further inform the human’s understanding of the task. This process demonstrates how AI and humans can collaborate to work on complex tasks by extending each other's ZPDs and even "cognition" (4E theory) (Malafouris, 2013; Vygotsky, 1978). However, it is worth noting that, unlike typical collaboration processes that are often synchronous and interactive (Wenger, 1998), where team members can build on or add to each other's ideas to achieve productive engagement (Chi et al., 2018), in this dissertation, the interactive process occurs only between the human operator and AI models, which may limit the interactive and productive nature of the expert panels' feedback. Future research should explore bringing the experts into the interactive environment to achieve real collaboration (Lai et al., 2021). Another interesting finding of this dissertation is that the group of expert panels who knew the assessments were co-designed by AI tended to provide relatively critical feedback compared to the group of experts who were not informed about the AI’s involvement in the process. This could be explained by the potential bias of perceiving AI. Research indicates that preconceived notions about AI can influence expert judgments and biases towards AI-generated outputs (Jussupow et al., 2020). This phenomenon underscores the importance of transparency and managing perceptions in human-AI collaboration. AI is good at learning extensive information quickly, but it lacks flexible thinking and empathy (Cope & Kalantzis, 2020). However, it can offer alternative phenomena or solutions, but all require human judgment before they can be used. It could perform some designs by working closely with human experts from multiple disciplines. More stakeholders need to get involved if we finally want to design a customized platform where the human plays a critical role in the entire process (Dellermann et al., 2021). For future research, this dissertation lacks the voice of students who can provide a deeper understanding of the accessibility, inclusivity, and engagement of the designed tasks. Future studies can explore how these assessments could be used in the classroom to seek further refinements. Engaging students can provide valuable insights into the practical application of assessments and ensure that they are tailored to meet diverse learning needs (Baidoo-Anu & Owusu Ansah, 2023; Furtak, 2017). 244 This dissertation also enhances the design of 3D assessment tasks by demonstrating the effectiveness of a collaborative framework that incorporates interdisciplinary expertise. The positive feedback from expert panels on the integration of the three dimensions in the designed products underscores the potential of human-AI collaboration to produce high-quality assessments. The iterative approach aligns with hybrid intelligence systems principles, emphasizing continuous improvement through mutual learning between human and machine agents (Luckin et al., 2016). This framework leverages the strengths of both human expertise and AI capabilities to create more effective and inclusive educational tools that support diverse learning outcomes (Stanford HAI, 2020). For future research, this study suggests exploring the integration of feedback from a broader range of stakeholders, including educators, students, and policymakers, to enhance the development and implementation of AI-driven educational assessments. Additionally, involving students can provide valuable insights into the accessibility, inclusivity, and engagement of the designed tasks, ensuring they meet diverse learning needs (Baidoo-Anu & Owusu Ansah, 2023; Furtak, 2017). Future studies should also examine bringing experts into the interactive environment to achieve real collaboration, further enhancing the productive nature of expert feedback (Lai et al., 2021). 5.4 Major Themes This dissertation builds on, reinforces, and extends the current literature by providing a comprehensive framework for human-AI collaboration in designing and refining educational assessments. It enhances the design of 3D assessment tasks by demonstrating the effectiveness of integrating interdisciplinary expertise and iterative feedback. The study also offers new insights into the capabilities and limitations of AI in educational contexts, highlighting the importance of human judgment and collaboration in creating high-quality educational tools. Table 5-1 summarizes the major themes found from this study and how they could contribute to the literature. 245 Table 5-1. Summary of themes and how they contribute literature Domains Themes Contribution to Literature Human-AI collaboration related themes Explicit Guidance Builds on existing literature emphasizing the need for detailed Domain-specific information Role of human experts Iterative refinement AI-Human collaboration and clear instructions when working with AI (Kumar & Thakur, 2012; Spector & Muraida, 1993). Reinforces the importance of human-provided explicit guidance to improve AI outputs. Demonstrates how specific instructions enhance the AI's ability to generate detailed and coherent educational materials, such as Integrated Dimension Maps (IDMs) (Hwang et al., 2020; Pedró et al., 2019). Extends the literature on the critical role of domain-specific knowledge in AI training (Järvelä et al., 2022). Highlights how detailed content about DCIs, SEPs, and CCCs improves the AI's ability to generate relevant and accurate outputs. Shows the necessity of providing comprehensive and specific information for designing and monitoring the AI's outputs. Reinforces the literature on the indispensable role of human expertise in evaluating and refining AI outputs (Dellermann et al., 2021). Emphasizes how human experts provide critical insights and feedback, ensuring that AI-generated content aligns with educational standards and goals. Highlights the nuanced understanding of educational contexts and pedagogical strategies human experts bring, which AI currently lacks (Fenwick, 2010). Extends the concept of iterative improvement in AI training (Kim et al., 2022). Demonstrates the importance of multiple rounds of feedback and adjustments to enhance the quality of AI-generated outputs. Highlights the role of reflective practice in identifying gaps and areas for improvement, ensuring continuous enhancement of educational content (Holzinger, 2016; Gregor, 2001). Builds on the literature that emphasizes the synergistic relationship between AI and human expertise (Bearman & Ajjawi, 2022; Dellermann et al., 2021; ). Demonstrates how the collaboration between AI's processing capabilities and human expertise results in high-quality, pedagogically sound educational tools. Highlights the iterative feedback loops and continuous refinement that characterize effective AI-human collaboration (Knox, 2020; Seldon & Abidoye, 2018). 246 Table 5-1 (cont’d) Assessment design related themes Ensure language clarity and age- appropriateness Reinforces the importance of age-appropriate language in educational materials (Lee, Quinn, & Valdés, 2013). Highlights specific examples of simplifying language to better align with students' reading levels, ensuring accessibility and comprehension. Enhance engagement and inclusion Extends literature on designing engaging and inclusive assessments (Baidoo-Anu & Owusu Ansah, 2023; Furtak, 2017). Emphasizes the need for culturally relevant and relatable scenarios to boost student engagement, suggesting framing tasks as stories or hands-on demonstrations. Provide adequate scaffolding and supportive visuals Reinforces the role of scaffolding in supporting student learning (Greene & Azevedo, 2007). Emphasizes the need for clear instructions, visual aids, and step-by-step guidance to help students understand complex concepts, particularly in modeling tasks. Coverage and integration of DCIs, SEPs, and CCCs Builds on the need for comprehensive coverage of DCIs, SEPs, and CCCs (NGSS Lead States, 2013). Highlights the importance of ensuring that all requisite proficiencies are adequately covered, avoiding gaps and overreach. Consistency and coherence in terminology and information Reinforces the need for consistency in educational terminology (Lee, Quinn, & Valdés, 2013). Emphasizes the importance of coherent information flow within tasks to facilitate efficient information processing and enhance student understanding. Enhance clarity and precision of task guidelines Ensure cultural sensitivity and accessibility Emphasizes the importance of clear and precise instructions in assessment tasks (Heritage, 2010). Highlights strategies to eliminate ambiguity, ensuring that students understand task requirements and can effectively demonstrate their understanding. Extends the literature on cultural inclusivity in education (Darling-Hammond & Snyder, 2000). Highlights the importance of designing tasks that reflect diverse cultural backgrounds and experiences, ensuring accessibility and engagement for all students. 5.5 Limitations and Future Research Directions This section highlights the limitations encountered during the research process and outlines the directions for future research. Acknowledging these limitations is essential for interpreting the dissertation's findings and understanding the scope within which these conclusions are drawn. 247 5.5.1 Operator Bias One of the limitations of this study is the human operator. I analyzed and synthesized the expert panels’ feedback and crafted refinement prompts to work with machines to refine those assessments. As mentioned in the positionality section, although I have a chemistry degree, which gives me robust science content knowledge, especially in physical sciences areas, I also have extensive knowledge-in-use assessment design experience and teaching experience in an Asian country. However, I lack sufficient understanding of life science domains that may affect my judgment of the life science assessments and products. This could be supported by the reviewers' feedback on designed products for the two domains, life sciences and physical sciences. The feedback on physical sciences is generally higher compared to the life science products, especially concerning the science ideas and phenomena parts. Born and growing up in a different cultural background causes me to not have enough understanding of Western country culture, which may affect my judgment on the feedback regarding cultural sensitivity and inclusivity. Similarly, as an English language learner, it may affect my ability to judge the language level and appropriateness of the design. With that, I want to claim that this could be one of the limitations of this study. As the monitor or operator, I may bring biased judgment into the decision-making process. To mitigate such biases in future work, it is crucial to involve a broader array of experts with diverse backgrounds and expertise to ensure a more balanced and comprehensive evaluation process, particularly when addressing complex or contentious feedback. 5.5.2 Lack of Student Input and Practical Implementation The other limitation of this study is the absence of student input in the feedback process and the lack of real classroom implementation of the designed assessments. While AI excels at processing extensive information swiftly, it does not possess the flexible thinking and empathy that human judgment provides (Cope & Kalantzis, 2020). This study did not incorporate the perspectives of those most affected by the assessments—students. Their insights are crucial for understanding the accessibility, inclusivity, and engagement of the designed tasks. Future research should focus on how these assessments can be applied in classroom settings to obtain concrete evidence of their effectiveness and practicality. Engaging 248 students directly can provide valuable feedback on the assessments' relevance and instructional validity, ensuring they are adequately tailored to diverse educational needs (Baidoo-Anu & Owusu Ansah, 2023; Furtak, 2017). 5.5.3 Future Research Directions Moving forward, the integration of feedback from a broader range of stakeholders, including educators, students, and policymakers, will be vital for enhancing the development and implementation of AI-driven educational assessments. Including these voices can provide richer insights into the accessibility, inclusivity, and engagement of the designed tasks, making them more relevant and effective. Additionally, future studies should aim to bring these diverse experts into a collaborative environment to facilitate real-time adjustments and refinements, enhancing the productive nature of expert feedback. Testing these assessments in actual classroom settings will also be critical to evaluate their instructional validity and impact on student learning, bridging the gap between assessment design and practical educational application. This will ensure that all voices in education are heard, reflecting the systematic and dynamic nature of educational assessment design. 249 BIBLIOGRAPHY Anderson, C. W., de los Santos, E. X., Bodbyl, S., Covitt, B. A., Edwards, K. D., Hancock, J. B., Lin, Q., Thomas, C. M., Penuel, W. R., & Welch, M. M. (2018). Designing educational systems to support enactment of the next generation science standards. Journal of Research in Science Teaching, 55(7), 1026–1052. Bandura, A. (1989). Human agency in social cognitive theory. American Psychologist, 44(9), 1175. Baidoo-Anu, D., & Ansah, L. O. (2023). Education in the era of generative artificial intelligence (AI): Understanding the potential benefits of ChatGPT in promoting teaching and learning. Journal of AI, 7(1), 52-62. Baddeley, A. (2000). The episodic buffer: a new component of working memory?. Trends in cognitive sciences, 4(11), 417-423. Bakker, A., & Gravemeijer, K. P. (2004). Learning to reason about distribution. In The challenge of developing statistical literacy, reasoning and thinking (pp. 147-168). Dordrecht: Springer Netherlands. Bang, M., & Medin, D. (2010). Cultural processes in science education: Supporting the navigation of multiple epistemologies. Science education, 94(6), 1008-1026. Barab, S., & Squire, K.(2004). Design-based research: Putting a stake in the ground. Journal of the Learning Sciences, 13(1),1–14. Bearman, M., & Ajjawi, R. (2023). Learning to work with the black box: Pedagogy for a world with artificial intelligence. British Journal of Educational Technology, 54(5), 1160-1173. Bellman, R. E. (1978). An introduction to artificial intelligence: Can computers think?. Boyd & Fraser Bertenthal, M. W., & Wilson, M. R. (Eds.). (2006). Systems for state science assessment. National Academies Press. Berg, G. A. (2000). Human-computer interaction (HCI) in educational environments: Implications of understanding computers as media. Journal of Educational Multimedia and Hypermedia, 9(4), 347- 368. Bransford, J. D., & Schwartz, D. L. (1999). Chapter 3: Rethinking transfer: A simple proposal with multiple implications. Review of research in education, 24(1), 61-100. Brown, J. S., Collins, A., & Duguid, P. (1989). Situated cognition and the culture of learning. 1989, 18(1), 32-42. Brown, J. S., & Duguid, P. (1993). Stolen knowledge. Educational technology, 33(3), 10-15. Bonwell, C. C., & Eison, J. A. (1991). Active learning: Creating excitement in the classroom. 1991 ASHE-ERIC higher education reports. ERIC Clearinghouse on Higher Education, The George Washington University, One Dupont Circle, Suite 630, Washington, DC 20036-1183. 250 Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education: Principles, policy & practice, 5(1), 7-74. Bloom, B. S. (1968). Toward a theory of testing which includes measurement-evaluation-assessment (No. 9). University of California. Brookhart, S. M. (2010). How to assess higher-order thinking skills in your classroom. Ascd. Card, S. K., Moran, T. P., & Newell, A. (1983). The psychology of human-computer interaction. Hillsdale, NJ: Lawrence Erlbaum. Clark, A. & Chalmers, D. (1998). The extended mind. Analysis, 58(1), 7-19. Coghlan, D., & Brannick, T. (2014). Doing Action Research in Your Own Organization. SAGE. Collins, A., Joseph, D.,& Bielaczyc, K.(2004). Design research: Theoretical and methodological issues. Journal of the Learning Sciences,13(1), 15–42. Chi, M. T., Adams, J., Bogusch, E. B., Bruchok, C., Kang, S., Lancaster, M., ... & Yaghmourian, D. L. (2018). Translating the ICAP theory of cognitive engagement into practice. Cognitive science, 42(6), 1777-1832. Chiu, T. K. (2021). A holistic approach to the design of artificial intelligence (AI) education for K-12 schools. TechTrends, 65(5), 796-807. Chiu, T. K., Moorhouse, B. L., Chai, C. S., & Ismailov, M. (2023). Teacher support and student motivation to learn with Artificial Intelligence (AI) based chatbot. Interactive Learning Environments, 1-17. Coffey, J., Black, P., & Atkin, J. M. (Eds.). (2001). Classroom assessment and the national science education standards. National Academies Press. Cope, B., & Kalantzis, M. (2020). Making sense: Reference, agency, and structure in a grammar of multimodal meaning. Cambridge University Press. Cope, B., Kalantzis, M., & Searsmith, D. (2021). Artificial intelligence for education: Knowledge and its assessment in AI-enabled learning ecologies. Educational philosophy and theory, 53(12), 1229-1245. Dai, Y., Liu, A., Qin, J., Guo, Y., Jong, M. S. Y., Chai, C. S., & Lin, Z. (2023). Collaborative construction of artificial intelligence curriculum in primary schools. Journal of engineering education, 112(1), 23-42. Darling-Hammond, L., Flook, L., Cook-Harvey, C., Barron, B., & Osher, D. (2020). Implications for educational practice of the science of learning and development. Applied developmental science, 24(2), 97-140. De Cremer, D., & Narayanan, D. (2023). How AI tools can—and cannot—help organizations become more ethical. Frontiers in Artificial Intelligence, 6, 109372. Dellermann, D., Calma, A., Lipusch, N., Weber, T., Weigel, S., & Ebel, P. (2021). The future of human- AI collaboration: a taxonomy of design knowledge for hybrid intelligence systems. arXiv preprint arXiv:2105.03354. 251 DiCerbo, K. (2020). Assessment for learning with diverse learners in a digital world. Educational Measurement: Issues and Practice, 39(3), 90-93. Esposito, A. G., & Bauer, P. J. (2017). Going beyond the lesson: Self-generating new factual knowledge in the classroom. Journal of experimental child psychology, 153, 110-125. Fenwick, T. J. (2010). (un) Doing standards in education with actor‐network theory. Journal of Education Policy, 25(2), 117-133. Fui-Hoon Nah, F., Zheng, R., Cai, J., Siau, K., & Chen, L. (2023). Generative AI and ChatGPT: Applications, challenges, and AI-human collaboration. Journal of Information Technology Case and Application Research, 25(3), 277-304. Furtak, E. M. (2017). Confronting dilemmas posed by three‐dimensional classroom assessment: Introduction to a virtual issue of Science Education. Science Education, 101(5), 854-867. Furtak, E. M., & Lee, O. (2023). Equity and Justice in Classroom Assessment of STEM Learning. Classroom-Based STEM Assessment, 69. Greene, J. A., & Azevedo, R. (2007). A theoretical review of Winne and Hadwin’s model of self- regulated learning: New perspectives and directions. Review of educational research, 77(3), 334-372. Greengard, S. (2022). ChatGPT: understanding the ChatGPT AI . eWeek. Archived from the original on January, 19, 2023. Gregor, S. (2001). Explanations from knowledge-based systems and cooperative problem solving: an empirical study. International Journal of Human-Computer Studies, 54(1), 81-105. Glaser, R., Chudowsky, N., & Pellegrino, J. W. (Eds.). (2001). Knowing what students know: The science and design of educational assessment. National Academies Press. Ha, M., & Nehm, R. H. (2016). The impact of misspelled words on automated computer scoring: A case study of scientific explanations. Journal of Science Education and Technology, 25, 358-374. Harris, C. J., Krajcik, J. S., Pellegrino, J. W., & DeBarger, A. H. (2019). Designing knowledge‐in‐use assessments to promote deeper learning. Educational Measurement: Issues and Practice, 38(2), 53- 67. Harris, C. J., Krajcik, J. S., & Pellegrino, J. W. (Eds.). (2024). Creating and using instructionally supportive assessments in NGSS classrooms. NSTA Press, National Science Teaching Association. Hatano, G., & Oura, Y. (2003). Commentary: Reconceptualizing school learning using insight from expertise research. Educational researcher, 32(8), 26-29. Haudek, K. C., Prevost, L. B., Moscarella, R. A., Merrill, J., & Urban-Lurain, M. (2012). What are they thinking? Automated analysis of student writing about acid–base chemistry in introductory biology. CBE—Life Sciences Education, 11(3), 283-293. Hattie, J., & Timperley, H. (2007). The power of feedback. Review of educational research, 77(1), 81- 112. 252 Heritage, M. (2010). Formative Assessment and Next-Generation Assessment Systems: Are We Losing an Opportunity?. Council of Chief State School Officers. He, P., Chen, I.-C., Touitou, I., Bartz, K., Schneider, B., & Krajcik, J. (2023). Predicting student science achievement using post-unit assessment performances in a coherent high school chemistry project- based learning system. Journal of Research in Science Teaching, 60(4), 724–760. https://doi.org/10.1002/tea.21815 Herrington, J., & Oliver, R. (1995). Critical characteristics of situated learning: Implications for the instructional design of multimedia. Learning with technology, 10. Herrington, J., & Oliver, R. (2000). An instructional design framework for authentic learning environments. Educational technology research and development, 48(3), 23-48. Hidi, S., & Renninger, K. A. (2006). The four-phase model of interest development. Educational Psychologist, 41(2), 111–127. https://doi.org/10.1207/s15326985ep4102_4. Holmes, W., Porayska-Pomsta, K., Holstein, K., Sutherland, E., Baker, T., Shum, S. B., ... & Koedinger, K. R. (2022). Ethics of AI in education: Towards a community-wide framework. International Journal of Artificial Intelligence in Education, 1-23. Holzinger, A. (2016). Interactive machine learning for health informatics: when do we need the human- in-the-loop?. Brain informatics, 3(2), 119-131. Hutchins, E. (2000). Distributed cognition. International Encyclopedia of the Social and Behavioral Sciences. Elsevier Science, 138, 1-10. Hwang, G. J., Xie, H., Wah, B. W., & Gašević, D. (2020). Vision, challenges, roles and research issues of Artificial Intelligence in Education. Computers and Education: Artificial Intelligence, 1, 100001. Ifenthaler, D., Majumdar, R., Gorissen, P., Judge, M., Mishra, S., Raffaghelli, J., & Shimada, A. (2024). Artificial intelligence in education: Implications for policymakers, researchers, and practitioners. Technology, Knowledge and Learning, 1-18. Järvelä, S., Nguyen, A., Vuorenmaa, E., Malmberg, J., & Järvenoja, H. (2023). Predicting regulatory activities for socially shared regulation to optimize collaborative learning. Computers in Human Behavior, 144, 107737. Johnson, B. A., Coggburn, J. D., & Llorens, J. J. (2022). Artificial intelligence and public human resource management: Questions for research and practice. Public Personnel Management, 51(4), 538-562. Johnson, D. G., & Verdicchio, M. (2017). Reframing AI discourse. Minds and Machines, 27, 575-590. Jussupow, E., Benbasat, I., & Heinzl, A. (2020). Why are we averse towards algorithms? A comprehensive literature review on algorithm aversion. Kamar, E. (2016, July). Directions in Hybrid Intelligence: Complementing AI Systems with Human Intelligence. In IJCAI (pp. 4070-4073). Kang, H., Thompson, J., & Windschitl, M. (2014). Creating opportunities for students to show what they know: The role of scaffolding in assessment tasks. Science Education, 98(4), 674-704. 253 Kemmis, S., McTaggart, R., & Nixon, R. (2014). The Action Research Planner: Doing Critical Participatory Action Research. Springer. Khosravi, H., Shum, S. B., Chen, G., Conati, C., Tsai, Y. S., Kay, J., ... & Gašević, D. (2022). Explainable artificial intelligence in education. Computers and Education: Artificial Intelligence, 3, 1000 Kim, T. W., Jiang, L., Duhachek, A., Lee, H., & Garvey, A. (2022). Do you mind if I ask you a personal question? How AI service agents alter consumer self-disclosure. Journal of Service Research, 25(4), 649-666. Kim, J. (2023). Leading teachers' perspective on teacher-AI collaboration in education. Education and Information Technologies, 1-32. Krajcik, J., Schneider, B., Miller, E. A., Chen, I. C., Bradford, L., Baker, Q., ... & Peek-Brown, D. (2023). Assessing the effect of project-based learning on science learning in elementary schools. American Educational Research Journal, 60(1), 70-102. Knox, J. (2020). Artificial intelligence and education in China. Learning, Media and Technology, 45(3), 298-311. Korteling, J. H., van de Boer-Visschedijk, G. C., Blankendaal, R. A., Boonekamp, R. C., & Eikelboom, A. R. (2021). Human-versus artificial intelligence. Frontiers in artificial intelligence, 4, 622364. Kuhail, M. A., Alturki, N., Alramlawi, S., & Alhejori, K. (2023). Interacting with educational chatbots: A systematic review. Education and Information Technologies, 28(1), 973-1018. Kulgemeyer, C., & Schecker, H. (2014). Research on educational standards in German science education— Towards a model of student competences. EURASIA Journal of Mathematics, Science and Technology Education, 10(4), 257–269. Kumar, K., & Thakur, G. S. M. (2012). Advanced applications of neural networks and artificial intelligence: A review. International journal of information technology and computer science, 4(6), 57-68. Lai, V., Chen, C., Liao, Q. V., Smith-Renner, A., & Tan, C. (2021). Towards a science of human-ai decision making: a survey of empirical studies. arXiv preprint arXiv:2112.11471. Latour, B. (1999). Pandora’s hope: Essays on the reality of science studies. Cambridge: Harvard University Press. Lawler, R. W., & Rushby, N. (2013). An interview with R obert L awler. British Journal of Educational Technology, 44(1), 20-30. Lee, O., Quinn, H., & Valdés, G. (2013). Science and language for English language learners in relation to Next Generation Science Standards and with implications for Common Core State Standards for English language arts and mathematics. Educational researcher, 42(4), 223-233. Lee, H. S., Gweon, G. H., Lord, T., Paessel, N., Pallant, A., & Pryputniewicz, S. (2021). Machine learning-enabled automated feedback: Supporting students’ revision of scientific arguments based on data drawn from simulation. Journal of Science Education and Technology, 30(2), 168-192. 254 Li, T., Reigh, E., He, P., & Adah Miller, E. (2023). Can we and should we use artificial intelligence for formative assessment in science. Journal of Research in Science Teaching, 60(6), 1385-1389. Liu, O. L., Rios, J. A., Heilman, M., Gerard, L., & Linn, M. C. (2016). Validation of automated scoring of science assessments. Journal of Research in Science Teaching, 53(2), 215–233. Li, T., Chen, I., Miller, E., Miller, C., Schneider, B., & Krajcik, J. (2024). The relationships between elementary students’ knowledge-in-use performance and their science achievement. Journal of Research in Science Teaching. 1-61 https://doi.org/10.1002/tea.21900 Li, T., Liu, F., & Krajcik, J. (2023) Automatically assess elementary students’ hand-drawn scientific models using deep learning of Artificial Intelligence. Proceedings of the Annual meeting of the International Society of the Learning Sciences (ISLS). Li, T., He, P., & Peng, L. (2024). Measuring high school student engagement in science learning: an adaptation and validation study. International Journal of Science Education, 46(6), 524-547. Luckin, R., Holmes, W., Griffiths, M., & Forcier, L. B. (2016). Intelligence unleashed. An argument for AI in Education, 18. Luckin, R., Clark, W., Avramides, K., Hunter, J., & Oliver, M. (2017). Using teacher inquiry to support technology-enhanced formative assessment: a review of the literature to inform a new method. Interactive Learning Environments, 25(1), 85-97. Malafouris, L. (2013). How things shape the mind: A theory of material engagement. MIT press. Mensah, F. M., & Chen, J. L. (2022). Elementary multicultural science teacher education. In International handbook of research on multicultural science education (pp. 1-39). Cham: Springer International Publishing. Mislevy, R. J., & Haertel, G. D. (2006). Implications of evidence‐centered design for educational testing. Educational Measurement: Issues and Practice, 25(4), 6-20. Mislevy, R. J., & Haertel, G. D. (2006). Implications of evidence‐centered design for educational testing. Educational measurement: issues and practice, 25(4), 6-20. Mislevy, R.J., Steinberg, L.S., & Almond, R.A. (2002). Design and analysis in task-based language assessment. Language Assessment, 19, 477-496. Also available as CSE Technical Report 579. Los Angeles: The National Center for Research on Evaluation, Standards, Student Testing (CRESST), Center for Studies in Education, UCLA. Retrieved June 26, 2003, from http://www.cse.ucla.edu/CRESST/Reports/TR579.pdf [ECD perspective on designing task-based language assessments. Includes examples of Bayes nets for tasks that tap multiple aspects or knowledge and skill.] Morrison, K. R. B. (2002) School Leadership and Complexity Theory. London: Routledge Falmer. Mosqueira-Rey, E., Hernández-Pereira, E., Alonso-Ríos, D., Bobes-Bascarán, J., & Fernández-Leal, Á. (2023). Human-in-the-loop machine learning: A state of the art. Artificial Intelligence Review, 56(4), 3005-3054. 255 National Research Council. (2011). Assessing 21st century skills: Summary of a workshop. National Academies Press. National Research Council. (2012). A framework for K-12 science education: Practices, crosscutting concepts, and core ideas. National Academies Press. National Academies of Sciences, Engineering, and Medicine. (2019). Science and engineering for grades 6-12: Investigation and design at the center. National Academies Press. National Academies of Sciences, Engineering, and Medicine. (2021). Science and engineering in preschool through elementary grades: The brilliance of children and the strengths of educators. National Academies Press. National Research Council. (2014). Developing Assessments for the Next Generation Science Standards. Committee on Developing Assessments of Science Proficiency in K-12. Board on Testing and Assessment and Board on Science Education, J.W. Pellegrino, M.R. Wilson, J.A. Koenig, and A.S. Beatty, Editors. Division of Behavioral and Social Sciences and Education. Washington, DC: The National Academies Press. NGSS Lead States. (2013). Next generation science standards: For states, by states. National Academies Press. Nguyen, T., Novak, R., Xiao, L., & Lee, J. (2021). Dataset distillation with infinitely wide convolutional networks. Advances in Neural Information Processing Systems, 34, 5186-5198. Organization for Economic Cooperation and Development. (2019). PISA 2018 assessment and analytical framework. OECD Publishing. Osborne, J., & Wertheim, J. (2019). Supporting Coherence Across a System of Assessment for NGSS. Owan, V. J., Abang, K. B., Idika, D. O., Etta, E. O., & Bassey, B. A. (2023). Exploring the potential of artificial intelligence tools in educational measurement and assessment. Eurasia Journal of Mathematics, Science and Technology Education, 19(8), em2307. Pea, R. D. (1993). Practices of distributed intelligence and designs for education. Distributed cognitions: Psychological and educational considerations, 11, 47-87. Pea, R. D., & Kurland, D. M. (1987). On the cognitive effects of learning computer programming. In R. D. Pea & K. Sheingold (Eds.), Mirrors of minds: Patterns of experience in educational computing (Report No. SE 043 964, pp. 147-342). Ablex Publishing Corporation. Pellegrino, J. W., DiBello, L. V., & Goldman, S. R. (2016). A framework for conceptualizing and evaluating the validity of instructionally relevant assessments. Educational Psychologist, 51(1), 59- 81. Pellegrino, J. W. (2013). Proficiency in science: Assessment challenges and opportunities. Science, 340(6130), 320-323. Pellegrino J. W., Hilton M. L. (Eds.). (2012). Education for life and work: Developing transferable knowledge and skills in the 21st century. Washington, DC: National Academies Press. 256 Pedro, F., Subosa, M., Rivas, A., & Valverde, P. (2019). Artificial intelligence in education: Challenges and opportunities for sustainable development. Pellegrino, J. W. (2014). Assessment as a positive influence on 21st century teaching and learning: A systems approach to progress. Psicología Educativa, 20(2), 65-77. Penuel, W. R., Allen, A. R., Coburn, C. E., & Farrell, C. (2015). Conceptualizing research–practice partnerships as joint work at boundaries. Journal of Education for Students Placed at Risk (JESPAR), 20(1-2), 182-19 Penuel, B., & Smolek, T. J. (2019). Reconceptualizing Alignment for NGSS Assessments. Penuel, W. R. (2019). Infrastructuring as a practice of design-based research for supporting and studying equitable implementation and sustainability of innovations. Journal of the Learning Sciences, 28(4- 5), 659-677. Penuel, W. R., & Shepard, L. A. (2016). Social models of learning and assessment. The Wiley handbook of cognition and assessment: Frameworks, methodologies, and applications, 146-173. Penuel, W. R., Briggs, D. C., Davidson, K. L., Herlihy, C., Sherer, D., Hill, H. C., ... & Allen, A. R. (2017). How school and district leaders access, perceive, and use research. AERA Open, 3(2), 2332858417705370. People's Republic of China Ministry of Education. (2014). Opinions on deepening curriculum reform and implementing the fundamental tasks of Lide-Shuren. http://www.moe.gov.cn/srcsite/A26/jcj_kcjcgh/ 201404/t20140408_167226.html. Penuel, B., & Smolek, T. J. (2019). Reconceptualizing Alignment for NGSS Assessments. Reason, P., & Bradbury, H. (Eds.). (2008). The SAGE Handbook of Action Research: Participative Inquiry and Practice. SAGE. Roberts, S. T. (2021). Your AI is a human. In T. S. Mullaney, B. Peters, M. Hicks, & K. Philip (Eds.), Your computer is on fire (Chapter 2). The MIT Press. https://doi.org/10.7551/mitpress/10993.003.0006 Rowlands, M. J. (2010). The new science of the mind: From extended mind to embodied phenomenology. MIT Press. Rubin, J., & Chisnell, D. (2008). Handbook of usability testing: How to plan, design, and conduct effective tests. John Wiley & Sons. Ruiz‐Primo, M. A., Shavelson, R. J., Hamilton, L., & Klein, S. (2002). On the evaluation of systemic science education reform: Searching for instructional sensitivity. Journal of Research in Science Teaching: The Official Journal of the National Association for Research in Science Teaching, 39(5), 369-393. Russell, S. J., & Norvig, P. (2016). Artificial intelligence: a modern approach. Pearson. Sadler, D. R. (1989). Formative assessment and the design of instructional systems. Instructional science, 18(2), 119-144. 257 Salomon, G. (Ed.). (1997). Distributed cognitions: Psychological and educational considerations. Cambridge University Press. Sanusi, I. T., Oyelere, S. S., Vartiainen, H., Suhonen, J., & Tukiainen, M. (2023). A systematic review of teaching and learning machine learning in K-12 education. Education and Information Technologies, 28(5), 5967-5997. Sauro, J., & Lewis, J. R. (2016). Quantifying the user experience: Practical statistics for user research. Morgan Kaufmann. Sanusi, I. T., Ayanwale, M. A., & Tolorunleke, A. E. (2024). Investigating pre-service teachers’ artificial intelligence perception from the perspective of planned behavior theory. Computers and Education: Artificial Intelligence, 6, 100202. Schwarz, C. V., Passmore, C., & Reiser, B. J. (2017). Helping students make sense of the world: Using next generation science and engineering practices. NSTA Press. Schreiber, L. M., & Valle, B. E. (2013). Social constructivist teaching strategies in the small group classroom. Small Group Research, 44(4), 395-411. Seldon, A., & Abidoye, O. (2018). The fourth education revolution. Legend Press Ltd. Searle, J. R. (1980). Minds, brains, and programs. Behavioral and brain sciences, 3(3), 417-424. Selwyn, N. (2016). Minding our language: why education and technology is full of bullshit… and what might be done about it. Learning, Media and Technology, 41(3), 437-443. Shin, D., & Shim, J. (2021). A systematic review on data mining for mathematics and science education. International Journal of Science and Mathematics Education, 19(4), 639-659. Shneiderman, B. (2020). Human-centered artificial intelligence: Reliable, safe & trustworthy. International Journal of Human–Computer Interaction, 36(6), 495-504. Siemens, G. (2005). Connectivism: Learning as network-creation. ASTD Learning News, 10(1), 1-28. Smith, C. L., Wiser, M., Anderson, C. W., & Krajcik, J. (2006). FOCUS ARTICLE: implications of research on children's learning for standards and assessment: a proposed learning progression for matter and the atomic-molecular theory. Measurement: Interdisciplinary Research & Perspective, 4(1-2), 1-98. Spector, J. M., Polson, M. C., & Muraida, D. J. (Eds.). (1993). Automating instructional design: Concepts and issues. Spiro, R. J., Bruce, B. C., & Brewer, W. F. (Eds.). (2017). Theoretical issues in reading comprehension: Perspectives from cognitive psychology, linguistics, artificial intelligence and education (Vol. 11). Routledge. Spiro, R. J., Bruce, B. C., & Brewer, W. F. (Eds.). (2017). Theoretical issues in reading comprehension: Perspectives from cognitive psychology, linguistics, artificial intelligence and education (Vol. 11). Routledge. 258 Spiro, R. J., Feltovich, P. J., Gaunt, A., Hu, Y., Klautke, H., Cheng, C., ... & Ward, P. (2018). Cognitive Flexibility Theory and the accelerated development of adaptive readiness and adaptive response to novelty. Sternberg, R. J., & Kaufman, J. C. (1998). Human abilities. Annual review of psychology, 49(1), 479-502. Sternberg, R. J. (1985). Implicit theories of intelligence, creativity, and wisdom. Journal of personality and social psychology, 49(3), 607. Sweller, J. (1988). Cognitive load during problem solving: Effects on learning. Cognitive science, 12(2), 257-285. Shepard, L. A. (2019). Classroom assessment to support teaching and learning. The ANNALS of the American Academy of Political and Social Science, 683(1), 183-200. Shepard, L. A., Penuel, W. R., & Pellegrino, J. W. (2018). Using learning and motivation theories to coherently link formative assessment, grading practices, and large‐scale assessment. Educational Measurement: Issues and Practice, 37(1), 21-34. Shute, V. J. (2008). Focus on formative feedback. Review of educational research, 78(1), 153-189. Spiro, R., Feltovich, P., Jacobson, M., & Coulson, R. (1992). Knowledge representation, content specification, and the development of skill in situation-specific knowledge assembly: Some constructivist issues as they relate to cognitive flexibility theory and hypertext. In T. Duffy & D. Jonassen (Eds.), Constructivism and the technology of instruction (pp. 121–128). Hillsdale, NJ: Lawrence Erlbaum. Stiggins, R. (2014). Revolutionize assessment: Empower students, inspire learning. Corwin Press. Sundar, S. S. (2020). Rise of machine agency: A framework for studying the psychology of human–AI interaction (HAII). Journal of Computer-Mediated Communication, 25(1), 74-88. Tan, Q., Soler, R., Pivot, F., Zhang, X., & Wang, H. (2020). Introspection of Personalized and Adaptive Learning. In INTED2020 Proceedings (pp. 8054-8061). IATED. Tegmark, M. (2018). Life 3.0: Being human in the age of artificial intelligence. Vintage. Turing, A. M. (1950). Mind. Mind, 59(236), 433-460. Vygotsky, L.S. (1978). Mind in society. Cambridge, MA: Harvard University Press. Ward, P., Gore, J., Hutton, R., Conway, G. E., & Hoffman, R. R. (2018). Adaptive skill as the conditio sine qua non of expertise. Journal of applied research in memory and cognition, 7(1), 35-50. Wenger, E. (1998). Communities of practice: Learning as a social system. Systems thinker, 9(5), 2-3. Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, NJ: Erlbaum. Westbroek, H. B., van Rens, L., van den Berg, E., & Janssen, F. (2020). A practical approach to assessment for learning and differentiated instruction. International Journal of Science Education, 42(6), 955-976. 259 Williams, P. (2023). AI, Analytics and a New Assessment Model for Universities. Education Sciences, 13(10), 1040. Wiggins, G. (1998). Educative Assessment. Designing Assessments To Inform and Improve Student Performance. Jossey-Bass Publishers, 350 Sansome Street, San Francisco, CA 94104. Xia, Q., Chiu, T. K., Lee, M., Sanusi, I. T., Dai, Y., & Chai, C. S. (2022). A self-determination theory (SDT) design approach for inclusive and diverse artificial intelligence (AI) education. Computers & Education, 189, 104582. Yang, S., Hu, L., Yu, L., Ali, M. A., & Wang, D. (2024). Human-ai interactions in the communication era: Autophagy makes large models achieving local optima. arXiv preprint arXiv:2402.11271. Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education–where are the educators?. International Journal of Educational Technology in Higher Education, 16(1), 1-27. Zhu, M., Liu, O. L., & Lee, H. S. (2020). The effect of automated feedback on revision behavior and learning gains in formative assessment of scientific argument writing. Computers & Education, 143, 103668. Zhu, M., Lee, H. S., Wang, T., Liu, O. L., Belur, V., & Pallant, A. (2017). Investigating the impact of automated feedback on students’ scientific argumentation. International Journal of Science Education, 39(12), 1648-1668. 260 APPENDIX I used ChatGPT-4 to edit the language and grammar of my dissertation writing. However, ChatGPT-4 did not write or alter the original meanings of the written texts. 261