DEVELOPING AND VALIDATING NGSS-ALIGNED 3D LEARNING PROGRESSION FOR ELECTRICAL INTERACTIONS IN THE CONTEXT OF 9TH GRADE PHYSICAL SCIENCE CURRICULUM By Leonora Kaldaras A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Curriculum, Instruction and Teacher Education—Doctor of Philosophy Measurement and Quantitative Methods—Dual Major 2020 ABSTRACT DEVELOPING AND VALIDATING NGSS-ALIGNED 3D LEARNING PROGRESSION FOR ELECTRICAL INTERACTIONS IN THE CONTEXT OF 9TH GRADE PHYSICAL SCIENCE CURRICULUM By Leonora Kaldaras The Framework for K-12 science education (The Framework) and Next Generation Science Standards (NGSS) emphasize the usefulness of learning progressions (LPs) in aligning curriculum, instruction and assessment. The three dimensions of science form the basis of theoretical LPs described in the document and used to develop NGSS. The three dimensions are disciplinary core ideas (DCIs), scientific and engineering practices (SEPs) and crosscutting concepts (CCCs). The Framework defines three- dimensional learning (3D learning) as the ability to integrate DCIs, SEPs and CCCc ta make sense of phenomena and solver problems. Engaging in 3D learning leads to developing deep, useable understanding of science. While the Framework LPs for the three dimensions, we currently have limited empirical evidence to show that LPs for 3D learning (3D LPs) can be developed and validated in practice. This dissertation shows the feasibility of developing and validating a large grain 3D LP and a finer-grain 3D construct map in the context of NGSS-aligned curriculum. The 3D LP focuses on the construct of electrical interactions, and the 3D construct map focuses of the construct of chemical bonding. Conceptually, the 3D construct map for chemical bonding is an integral part of 3D LP of electrical interactions, but more narrowly scoped. The feasibility of using the assessment tools designed to probe levels of the 3D LP and 3D construct map for assigning levels to individual answers and for characterizing student learning are demonstrated. These properties of a validated LP are essential for successful implementation of NGSS. This thesis is dedicated to Mr. Allen Baldwin, or simply Big Al. Thank you for helping me change my life for the better. iii ACKNOWLEDGEMENTS I would like to thank the following people, without whom I would not have made it through my PhD degree! My supervisor, Dr. Joseph (Joe) Krajcik, for his support both in work and life situations, constant encouragement, and valuable feedback. My friend and colleague, Dr. Hope Akaeze, for invaluable help in completing this dissertation. My academic advisor, Dr. Gail Richmond, for giving me an opportunity to pursue a career in education, and for supporting me in all my educational endeavors. My amazing support team, including Dr. Bob Geier, Cathrene (Sue) Carpenter and Dr. Joe Krajcik, who have helped me and my brother Kosta navigate through the most challenges times of our lives, and remain on path to pursuing our educational goals. My super-star dissertation committee, Dr. William Schmidt, Dr. Melanie Cooper, Dr. Gail Richmond and Dr. Mark Reckase, whose feedback helped me learn and improve my understanding of measurement and education. My amazing colleagues at CREATE for STEM Institute for constant support and amazing work spirit. I am blessed to be working and learning alongside each and every one of you! My brother, Kosta, for being together throughout PhD years, and still willing to keep up with me. I love and appreciate you always! My fiancé, Alonso, for not giving up on me, and for constantly pushing me to “get it done”. Your patience, love, and support made it happen. My little daughter Marina-Luisa, who is my sunshine. My parents, Marina and Nikolay: I cannot thank you enough, I have been truly blessed with the most amazing parents in the world. My host family, Neocles and Vassiliki Leontis, who have always believed in me and supported unconditionally. My dear friend, Al Baldwin (Big Al), for being the most amazing human being I have ever met. May your soul rest in peace! iv TABLE OF CONTENTS LIST OF TABLES ..................................................................................................................... vi LIST OF FIGURES ................................................................................................................... viii INTRODUCTION ..................................................................................................................... 1 CHAPTER 1 A Methodology for Determining and Validating Latent Factor Dimensionality of Complex Multi-Factor Science Constructs Measuring Knowledge-In-Use .............................. 7 Introduction ........................................................................................................................... 7 Methodology ......................................................................................................................... 14 Results ................................................................................................................................... 28 Discussion ............................................................................................................................. 37 APPENDIX ................................................................................................................................ 46 BIBLIOGRAPHY ...................................................................................................................... 72 CHAPTER 2 Developing and Validating an NGSS-Aligned Learning Progression to Track Three-Dimensional Learning of Electrical Interactions in High School Physical Science ....... 76 Introduction ........................................................................................................................... 76 Theoretical Framework ......................................................................................................... 80 Methodology ......................................................................................................................... 86 Data Analysis ........................................................................................................................ 91 Results ................................................................................................................................... 100 Discussion ............................................................................................................................. 117 APPENDIX ................................................................................................................................ 124 BIBLIOGRAPHY ...................................................................................................................... 136 CHAPTER 3 Exploring Student Reasoning about Chemical Bonds from Perspective of Energy and Force in the context of NGSS Classroom ........................................................................... 142 Introduction ........................................................................................................................... 142 Theoretical Framework ......................................................................................................... 148 Methodology ......................................................................................................................... 149 Results ................................................................................................................................... 169 Discussion ............................................................................................................................. 186 APPENDIX ................................................................................................................................ 193 BIBLIOGRAPHY ...................................................................................................................... 206 CONCLUDING REMARKS ..................................................................................................... 211 v LIST OF TABLES Table 1.1 Example of modified ECD process .............................................................................. 16 Table 1.2 Scoring rubric example ................................................................................................. 17 Table 1.3 3D structure of unit 1 assessment ................................................................................. 19 Table 1.4 3D structure of unit 2 assessment ................................................................................. 21 Table 1.5 Summary of two time points EFA model fit for unit 1 and unit 2 assessment ............. 23 Table 1.6 Summary of CFA model fit for unit 1 pre/post and unit 2 pre/post.............................. 24 Table 1.7 Reliability...................................................................................................................... 28 Table 1.8 Two time points EFA model fit for unit 1 assessment ................................................. 29 Table 1.9 Two time points EFA factor loadings for unit 1 assessment ........................................ 29 Table 1.10 Two time points EFA model fit for unit 2 assessment ............................................... 30 Table 1.11 Two time points EFA factor loadings for unit 2 assessment ...................................... 30 Table 1.12 Measurement invariance analysis for unit 1 pre/post assessment ............................... 31 Table 1.13 Measurement invariance analysis for unit 2 pre/post assessment ............................... 33 Table 2.1 Hypothetical 3D LP for electrical interactions ............................................................. 87 Table 2.2 Example of mECD process ........................................................................................... 90 Table 2.3 Sample responses for every 3D LP level for paper and rod ......................................... 94 Table 2.4 Sample responses for every 3D LP level for the foil experiment ................................. 97 Table 2.5 Sample responses that fall between levels of the 3D LP for paper and rod…………104 Table 2.6 Sample responses that fall between levels of the 3D LP for the foil experiment ........105 Table 2.7 Student score/3D LP level for each interview phenomenon ........................................107 Table 2.8 Model comparison for GPCM and GRM ....................................................................125 vi Table 2.9 S-X2 item fit statistics ..................................................................................................129 Table 3.1 Hypothetical 3D construct map for chemical bonding ................................................153 Table 3.2 Example of mECD process ..........................................................................................156 Table 3.3 Sample responses for every 3D construct map level for the match on the hot plate ...158 Table 3.4 Sample responses for every 3D construct map level for atoms forming a bond .........162 Table 3.5 Sample responses that fall between levels of the 3D construct map ...........................175 Table 3.6 Student score and 3D construct map level for each interview phenomenon ...............178 Table 3.7 Model comparison for GPCM and GRM ....................................................................195 Table 3.8 S-X2 item fit statistics ..................................................................................................198 vii LIST OF FIGURES Figure 1.1 Summary of modified evidence centered design process ............................................ 14 Figure 1.2 Theoretical latent structure of unit 1 assessment instrument ....................................... 18 Figure 1.3 Theoretical latent structure of unit 2 assessment instrument ...................................... 20 Figure 1.4 Measurement invariance model for unit 1 assessment instrument .............................. 25 Figure 1.5 Measurement invariance model for unit 2 assessment instrument ............................. 25 Figure 1.6 95% Confidence interval for factor scores at different level of observed (raw) score for unit 1 assessment instrument ................................................................................................... 35 Figure 1.7 Confidence interval (95%) for factor scores at different level of observed (raw) score for Unit 2 assessment instrument .................................................................................................. 35 Figure 2.1 Summary of modified evidence centered design process ............................................ 88 Figure 2.2 Wright map showing learning progression levels for unit 1 assessment items ..........109 Figure 2.3 Wright map showing distribution of respondents who provided answers on pre and post unit 1 test ..............................................................................................................................111 Figure 2.4 Wright map showing learning progression levels for unit 1 pretest assessment items and distribution of respondents for the relevant cut points for students who provided answers on both pre and posttest ....................................................................................................................114 Figure 2.5 Wright map showing learning progression levels for unit 1 posttest assessment items and distribution of respondents for the relevant cut points for students who provided answers on both pre and posttest ....................................................................................................................114 Figure 2.6 Modified wright map for pre unit 1 test showing student proficiency estimates and standard error bands from lowest to highest ................................................................................116 Figure 2.7 Modified wright map for post unit 1 test showing student proficiency estimates and standard error bands from lowest to highest ................................................................................116 Figure 2.8 Q3 matrix ....................................................................................................................128 Figure 2.9 Person fit Zh statistics ................................................................................................129 Figure 3.1 Summary of modified evidence centered design process ...........................................155 viii Figure 3.2 Wright map showing 3D construct map levels for unit 2 assessment items ..............181 Figure 3.3 Wright map with respondents who provided answers on pre/ post unit 2 test ...........182 Figure 3.4 Modified wright map for pre unit 2 test showing student proficiency estimates and standard error bands from lowest to highest ................................................................................185 Figure 3.5 Modified wright map for post unit 2 test showing student proficiency estimates and standard error bands from lowest to highest ................................................................................185 Figure 3.6 Q3 matrix ....................................................................................................................197 Figure 3.7 Person fit Zh statistics ................................................................................................198 ix INTRODUCTION The Framework for K-12 Science Education (The Framework) and Next Generation Science Standards (NGSS) emphasizes developmental nature of science understanding, and stress the importance of supporting students in developing useable understanding of big ideas in science coherently over time through three-dimensional (3D) learning strategies (National Research Council [NRC], 2012, Standards, 2013). The Framework defines 3D learning as ability to integrate the three dimensions of science that include scientific and engineering practices (SEPs) and crosscutting concepts (CCCs) to make sense of the disciplinary core ideas (DCIs). The developmental nature of student understanding is reflected in the idea of a learning progression that describes development of science understanding as a series of increasingly more sophisticated steps towards understanding of big ideas in science (NRC, 2012). While the Framework and NGSS promote using learning progressions as a tool to help organize curriculum, instruction and assessment, validated NGSS-aligned learning progression are not currently available in practice. While both the Framework and NGSS provide outlines of theoretical learning progressions for DCIs, SEPs and CCCs, detailed validated learning progressions that integrate the three dimensions of science and show what student understanding looks like at each level of sophistication in terms of ability to integrate the three dimensions are yet to be developed. Without such three-dimensional learning progressions (3D LPs) developed and validated in practice successful implementation of ideas of NGSS the Framework will become more complicated. This work presents first example of NGSS-aligned learning progression that integrates the three dimensions of science and demonstrates immediate pedagogical use in terms of 1 providing information about the location of each individual student on the 3D LP level with 68% confidence. The 3D LP presented here is focused on electrical interactions and is validated in the context of previously designed NGSS-aligned curriculum for 9th grade Physical Science called “Interactions”1. The curriculum helps students build understanding of electrical interactions starting from macro to atomic-molecular level. The curriculum consists of four units. Unit 1 focuses of building student understanding of electrical interactions grounded in ideas of electrical charges, fields and forces at the macroscopic level, and introduces atomic nature of matter. Unit 2 adds ideas of energy at the macro and atomic-molecular level and helps students build an integrated understanding of electrical forces and energy to explain phenomena related to intermolecular interactions and chemical bonding. Units 3 and 4 help students build their understanding of electrical interactions at the atomic-molecular level by focusing on hydrophobic/hydrophilic interactions and protein folding. The 3D LP presented in this study is aligned to the same NGSS performance expectations as the “Interactions “curriculum. The 3D LP uses the same DCIs as the “Interactions” curriculum to describe progression of understanding of electrical interactions as a continuum of ideas focused on principles of electrostatic attraction and energy that can be used to explain interactions between charged macroscopic objects, formation of chemical bonds and intermolecular interactions. The process of developing a learning progression involves specifying in detail what student understanding looks like at each level of sophistication (Duschl, Schweingruber, Shouse, 2007). In this study this process was grounded in relevant research literature including the 1 This work is supported in part by the NSF grant “Developing and Testing a Model to Support Student Understanding of the Sub-Microscopic Interactions that Govern Biological and Chemical Processes”, National Science Foundation, DRL-1232388. All opinions in the dissertation are those of the author and not NSF. 2 Framework and NGSS, and feedback from disciplinary and pedagogical experts. The validation of a learning progression is carried out by developing assessment instruments capable of probing student understanding at each level of the learning progression, and then collecting and analyzing assessment data from these instruments to see if student response data supports theoretically suggested progression of understanding described by the LP (Wilson, 2009). In the context of this study, assessment instruments were designed to probe student understanding of electrical interactions before and after each of the four curriculum units was completed. Due to limited time and resources, a few items were selected from Unit 1 and Unit 2 assessment instruments only and analyzed using Item Response Theory (IRT) approaches to obtain validity evidence for the theoretically suggested progression of student understanding of electrical interactions. Additionally, 2 items from Unit 1 and Unit 2 assessment instruments were used to conduct analysis with selected students to gain qualitative validity evidence and help describe student 3D understanding of electrical interactions at each level of sophistication with greater detail. The resulting study presented in this dissertation consists of three interconnected parts. The first part (chapter 1) focuses on demonstrating internal latent structure validity evidence and reliability of the Unit 1 and Unit 2 assessment instruments using confirmatory and exploratory factor analysis approaches. This study provides important evidence for latent dimensionality of the two assessment instruments that is important to examine prior to conducting IRT analysis to be able to choose an IRT model that most accurately represents a given data sample. The study presented in chapter 1 focuses on the construct of electrical interactions, and even though it is represented by multiple DCIs that are introduced in a specific sequence throughout the “Interactions” curriculum, the only latent construct of interest that all the DCIs are contributing towards is electrical interactions. However, the construct of electrical interactions is not merely 3 focused on student recollection of the DCIs, but the ability to apply understanding of electrical interactions to explain various phenomena by integrating relevant DCIs with SEPs and CCCs. Therefore, these NGSS dimensions might affect latent dimensionality of the construct such that even if hypothesized latent structure is described as one-dimensional (e.a. student understanding of electrical interactions), in practice the DCIs, SEPs and CCCs is inherently multidimensional (Gorin, Mislevy, 2013) and therefore might manifest as separate latent dimensions. To author’s knowledge, there have not been any studies conducted that focus on examining relationship between dimensions of NGSS (DCIs, SEPs, CCCs) and the latent dimensionality of assessment instruments built following principles of integrating the three dimensions. In the first part of this study the data from Unit 1 and Unit 2 assessment instruments is used to demonstrate that student ability to integrate the three dimensions of NGSS is in fact manifested as a single latent construct. Specifically, study presented in chapter 1 shows that student ability to integrate relevant DCIs (including ideas related to understanding of Coulomb’s law in Unit 1, energy and chemical reactions in Unit 2) with SEPs and CCCs to explain phenomena related to electrical interactions manifest as single latent constructs. The results of this study have important implications for developing and validating assessments that measure student understanding of complex constructs that integrate the three dimensions of NGSS in practice. In the context of this work, study presented in chapter 1 provides important information about latent dimensionality of constructs measured by Unit 1 and Unit 2 assessment instruments related to student understanding of electrical interactions, which allows to conduct a more accurate IRT analysis to obtain quantitative validity evidence for studies described in chapters 2 and 3. The second part of the study (chapter 2) introduces 3D LP for electrical interactions aligned to NGSS performance expectations and validated using assessment data from Unit 1 4 assessment instrument only. The 3D LP described in chapter 2 represents a large-grain 3DLP for electrical interactions scoping 1 academic year. Chapter 2 provides both qualitative (IRT analysis) and qualitative (student oral interview analysis) of the Unit 1 assessment data to demonstrate validity of the 3D LP for electrical interactions. The 3D LP presented in chapter 2 described aspects of relevant DCIs, SEPs and CCCs at each level of sophistication. The qualitative analysis of student interviews allowed to construct detailed descriptions of what student understanding of ideas related to electrical forces, fields and charges looks like at each level of the 3D LP. IRT analysis further demonstrated that the progression of understanding described by the 3D LP is supported by large scale student response data. Finally, study in chapter 2 demonstrates that the resulting 3D LP can be used to place individual students on a level of the 3D LP with 68% confidence, which suggests immediate pedagogical applicability of the designed LP. The study presented in chapter 3 introduces a finer grain size 3D construct map for chemical bonding validated using Unit 2 assessment data. As mentioned above, in “Interactions” curriculum student develop understanding of chemical bonding as an extension of the same ideas related to electric forces and energy that govern interactions between charged macroscopic objects and molecules. Therefore, in essence, the 3D construct map for chemical bonding presented in chapter 3 is an integral part of the 3D LP for electrical interactions discussed in chapter 2, but more narrowly focused on exploring student reasoning about chemical bonding, and validated using assessment items specifically focused on electrical interactions in the context of chemical bonding. The major contribution of work presented in chapter 3 is that it presents 3D construct map for chemical bonding grounded in principles outlined in the Framework and NGSS focusing on building student understanding of chemical bonding not as heuristics based 5 on octet rule and memorization of valency states, but as a state of a system of interacting atoms defined by balance of attractive and repulsive interactions which leads to energy minimization. The validity evidence from both student interviews and IRT analysis demonstrates consistency between hypothesized progression of student understanding outlined in 3D construct map and empirical student response data. The 3D construct map for chemical bonding presented in chapter 3 also can be used to place individual students on a level with 68% confidence, which also demonstrates immediate pedagogical applicability of the 3D construct map. To summarize, this dissertation presents studies that demonstrate how 3D learning can be successfully described and measured in practice, starting from developing NGSS-aligned 3D LPs that integrate DCIs, SEPs and CCCs, developing valid and reliable assessment instruments capable of measuring student progress along the levels of 3D LPs, and obtaining quantitative and qualitative validity evidence for the 3D LPs using designed assessment instruments. The studies presented here provide valuable insights into how student progress in NGSS classroom can be measured effectively, therefore helping enact the vision of the Framework and NGSS in practice. 6 CHAPTER 1 A Methodology for Determining and Validating Latent Factor Dimensionality of Complex Multi-Factor Science Constructs Measuring Knowledge-In-Use Introduction Historically, curriculum, instruction and assessment based on state and local standards have been focused on memorization of large number of scientific facts without understanding fundamental scientific principles. In the age of technology, when information is readily available through a variety of sources, memorization-based system of education is becoming obsolete. Instead, there is increasing demand for integrated, deep understanding of the key ideas in science that translates into ability to apply scientific concepts to explain phenomena and solve real life problems (National Research Council [NRC], 2007; National Research Council [NRC], 2013a; Erickan & Oliveri, 2016). Because of the growing mismatch between the educational demands of 21 century society and the actual products of present-day educational system, there has been significant effort from scientists and educational researchers to shift the focus of classroom instruction from fact-based memorization of ideas to supporting complex cognitive processes aimed at developing knowledge application skills. These efforts resulted in publication of several reports by National Research Council, and release of the Next Generation Science Standards reflecting this vision (NRC, 2013a; National Research Council [NRC], 2012; Standards [NGSS], (2013)). To meet the requirements consistent with the new vision, assessment also needs to change. Specifically, it needs to shift from measuring simple constructs reflecting fact-based memorization learning trajectory, to measuring complex constructs comprised of elements related to content and skills focusing on application of knowledge (Ercikan & Oliveri, 2016; 7 NRC, 2013a; Pellegrino & Hilton, 2012). The challenges associated with developing these types of assessment are multiple. First, we need to understand the fundamental difference between simple and complex constructs. In the context of science disciplines, simple constructs focus primarily on content. For example, traditional tests assess student ability to recite formulas, reproduce definitions, calculate outcome based on memorized equation etc. Additionally, test items are usually devoid of context, and represent content-based assessment unrelated to real-life situations. This approach is not suitable for measuring student ability to apply knowledge, which demands that assessment be situated in a real-life problem that requires a solution, or a phenomenon that needs to be explained using appropriate science ideas (Pellegrino, Wilson, Koenig, & Beatty, 2014; DeBarger, Penuel, Harris, & Kennedy, 2016). The focus of this assessment is not pure content, but also skills and competencies associated with ability to apply this content (DeBarger et al., 2016). These skills/competencies, along with content, represent different components of learning that combine to form a complex construct (Ercikan & Oliveri, 2016). The integration of these components makes complex constructs multidimensional in a sense that these constructs no longer reflect content knowledge alone, but also relevant skills/competencies required to apply it. This fundamental difference between simple and complex constructs has important implications for developing valid assessments. In order to ensure valid interpretation of assessment results, assessment development process should focus on complex constructs modeled using relevant learning theories, and supported by observed student response data (Pellegrino, Wilson, Koenig, & Beatty, 2014). The two major uses of assessment results are growth in performance and assessment of subdomains respectively (Reckase, 2017). Policy makers use growth in performance on high stakes tests for the purpose of holding schools and 8 teachers accountable, while individual teachers and schools use assessment of subdomains to obtain information regarding student performance on assessment related to specific instructional content (Reckase, 2017). Using assessment results for evaluating growth in performance is only valid if the assumption of common unidimensional continuum (single latent construct reflecting science proficiency) across grades is valid. Using assessment results for evaluating student performance on various subdomains is only valid if the latent dimensions presumably measured by the assessment instrument indeed reflect student performance on the subdomains of interest. Therefore, for both high stakes and diagnostic assessment, the understanding of dimensionality of a complex construct in psychometric context is fundamental to designing valid assessments (Gorin & Mislevy, 2013). Specifically, we need to understand if the different components of complex constructs (content and skills/competencies) manifest as separate latent dimensions psychometrically. This will help in understanding how components of complex constructs relate to dimension of variation in student response data, leading to more meaningful interpretation of student performance on both types of assessments. Current work demonstrates development and validation process for assessment of complex constructs focusing on latent structure validation. It starts by demonstrating the process of developing assessments instruments for complex constructs grounded in learning theories used to develop the most current science standards. It further describes steps for creating a theoretical validity argument for measuring complex constructs, designing operational assessment instrument based on this argument, and obtaining empirical evidence for the theoretical validity of the argument and assessment instrument. The presented validity evidence includes response process-based validity and internal latent structure-based validity. 9 This work is situated in the context of the Framework for K-12 Science Education (the Framework) that defines deep science understanding as student’s ability to integrate Scientific and Engineering Practices (SEPs) and Crosscutting Concepts (CCCs) to make sense of Disciplinary Core Ideas (DCIs) in the context of real-life phenomena. DCIs, SEPs and CCCs are referred to as the three dimensions of science. Disciplinary core ideas are different from content traditionally defined in previous standards in that they represent few fundamental ideas in science that are essential for building deep understanding and explaining phenomena. Scientific and engineering practices are authentic practices that scientists engage in when making sense of phenomena or solving problems. Crosscutting concepts are lenses you can use to look at natural phenomena when making sense of the world. A student’s ability to integrate the three dimensions of science is called “three-dimensional learning”. Three-Dimensional Learning (3D learning) and the vision of science education expressed in the Framework became the basis of Next Generation Science Standards (NGSS) that are expressed as performance expectations that combine a DCI, an SEP and a CCCs, and focus on explaining phenomena or solving problems using the three dimensions thereby promoting development of knowledge application ability in students. In the context of assessment, the three dimensions of science become components of the complex constructs that need to be assessed. The theoretical premise of 3D understanding, according to the Framework, is that these three dimensions are inseparable, and should be integrated together in curriculum, instruction and assessment. This argument is grounded in situated learning, stating that students cannot learn the content separate from the context (NRC, 2007), and developmental approach stating that deep understanding takes time and appropriate scaffolding to develop (Smith, Wiser, Anderson, & Krajcik, 2006). In other words, integration of 10 the three dimensions discussed in the Framework suggests that these three dimensions are components of one complex construct that reflects student understanding of specific aspect of science, and should manifest as a single latent construct in psychometric analysis. However, it is also possible that the three dimensions manifest as separate latent constructs (Gorin & Mislevy, 2013). These different outcomes for resulting dimensionality will have implications for developing 3D assessments and reporting student progress in the context of NGSS on both high- stakes or diagnostic assessment. To author’s knowledge, no formal studies on dimensionality of NGSS-aligned assessments have been conducted. Previously conducted studies provide compelling validity evidence based on response process showing that the designed tasks indeed elicit responses expected from the evidence centered design (ECD) argument (DeBarger et al., 2016; Gane, McElhaney, Zaidi, & Pellegrino, 2018; Gane, McElhaney, Zaidi, & Pellegrino, 2019). However, no research is currently available on studying dimensionality of NGSS-aligned tasks, which is a prerequisite for choosing appropriate psychometric models that will allow one to quantitatively evaluate performance of items and students on the assessment and provide information about student growth. The current study builds on previously conducted research on argument-based validation (DeBarger et al., 2016; Gane et al., 2018; Gane et al., 2019), and demonstrates detailed investigation of dimensionality of NGSS-aligned assessment instrument that incorporates theoretical assumptions of 3D learning including integration of the three dimensions, situated learning, and developmental nature of student understanding. The mthodology presented here expands validity evidence for NGSS-aligned assessment of complex constructs beyond qualitative response process based on ECD approach to include internal latent structure-based 11 validity. This is an essential piece of validity required for conducting meaningful latent trait analysis. To develop 3D assessment instruments aligned to NGSS, the study employs modified evidence-centered design (mECD) approach (Harris, Krajcik, Pellegrino, & DeBarger, 2019). The theoretical argument resulting from the mECD process specifies what aspects of the three dimensions related to the complex construct of interest are being measured, and what evidence from student answers is needed to draw conclusion about proficiency levels that are also clearly specified as part of the argument. The degree to which observed item difficulty and overall student performance on the test is consistent with that suggested by the mECD argument provides evidence towards validity of inferences based on the student performance on the assessment instrument, or response-process based validity (Geisinger, Bracken, Carlson, Hansen, Kuncel, Reise, & Rodriguez, 2013). These assessment instruments are further used to gather validity evidence based on internal latent structure analysis to show that the integration of the three dimensions suggested by the theory is indeed supported in practice. Specifically, since the theoretical premise of 3D learning is that deep scientific understanding is manifested in student ability to use the three dimensions of science simultaneously when making sense of phenomena, it suggests that 3D understanding of a complex construct should be manifested as a single latent construct when evaluating internal latent structure of an assessment instrument. For example, one of the assessment instruments in this work focuses on measuring student 3D understanding of Coulomb’s Law, which is a complex construct. Each assessment item measuring the construct contains an aspect of a disciplinary core idea related to Coulomb’s Law, a scientific practice, and a crosscutting concept. If 3D learning theory is to be supported in practice, a single latent 12 construct should be observed in the internal latent structure analysis instead of three different ones pertaining to DCI, CC and SEP. To evaluate the feasibility of this theoretical statement, without imposing any supposition on the measures, two-time point exploratory factor analysis with invariant loading structure is used to explore the theoretical dimensionality of an assessment instrument suggested by mECD across time. It is important to point out that while the Framework suggests interpreting the three dimensions of science as integral part of a single complex construct, it is not clear whether the three dimensions manifest as separate latent dimensions psychometrically. Therefore there is not enough theoretical grounds to initially use confirmatory factor analysis approach requiring rigid specification of latent structure. Exploratory factor analysis (EFA), on the other hand, allows one to explore and generate hypothesis for the most plausible factor solution by taking into consideration the possibility that the three dimensions may in fact manifest as separate latent constructs. Finally, the most plausible model based on EFA results is further used to conduct confirmatory factor analysis (CFA)-based measurement invariance examination to verify that theoretical dimensionality of assessment instruments is supported by student response pattern across time. The CFA-based invariance analysis provides additional source of validity evidence based in internal latent structure (Geisinger et al., 2013; Dimitrov, 2010). The following sections demonstrate in detail the process of obtaining all three types of validity evidence including response process, EFA, and CFA-based measurement invariance. 13 Methodology Assessment Context: “Interactions” curriculum The assessment instrument developed here aligns with a NGSS-aligned curriculum materials for 9th grade Physical Science called “Interactions”. The curriculum focuses on helping students develop understanding of 3-dimensional NGSS performance expectations (PE) related to electrical interactions at the macroscopic and microscopic level. The materials consist of four units: Unit 1 focuses on electric charge and forces at the macroscopic and atomic scales; Unit 2 focuses on energy and its relation to electric forces; and Units 3 and 4 apply ideas from prior units to build understanding of intermolecular interactions. Each unit has an associated assessment instrument aligned to relevant NGSS PEs administered before and after each unit is studied in the classroom. In this paper, only assessment instruments from unit 1 and unit 2 are examined because other units were yet to be implemented when data was collected. Development of assessment instrument for measuring complex NGSS constructs Modified evidence-centered design (mECD) process (Harris et al., 2019) is used to develop assessments that show evidence of 3D learning. This approach ensures that scores on every item of the test can be meaningfully interpreted in terms of what level of understanding of complex science constructs students have developed as defined by NGSS and the Framework, and what pieces and skills students are missing to develop higher levels of understanding. The mECD process is shown in Figure 1 representing modified schematic from Harris et al., 2019. Figure 1.1 Summary of modified evidence centered design process 14 The first step of the mECD approach involves unpacking NGSS performance expectations (NGSS PE) in order to develop a claim that describes what students should be able to do to demonstrate their 3D understanding of complex science constructs. Each claim incorporates an aspect of a DCI, a SEP and a CC. The next step involves specifying the evidence that shows students have met the requirements of the claim. Both claim and evidence are closely related to the learning goals of the “Interactions” curriculum and aligned to NGSS PEs. Finally, assessment tasks are developed that provide the necessary evidence to measure the claim. An example of how this process was used to develop assessments is discussed below. For example, one of the NGSS PE addresses in Unit 1 is: NGSS PE: HS-PS2-4. Use mathematical representations of Newton’s Law of Gravitation and Coulomb’s Law to describe and predict the gravitational and electrostatic forces between objects. Once each of the dimensions was thoroughly unpacked, a three-dimensional claim and evidence was developed to specify which part of the NGSS PE above will be assessed as part of unit 1 test. Since the curriculum only focused on qualitative representation of Coulomb’s Law relationship, the parts on NGSS PE that were assessed are underlined. Table 1 below illustrates the process of developing assessment items to track 3D learning in more detail, beginning with showing the claim and evidence that stems from unpacking process. 15 Table 1.1 Example of modified ECD process Claim: students should construct a causal model that shows how objects become charged using electron transfer to explain attractive and repulsive interactions between objects. Evidence Students will include these ideas in their models to explain phenomena: 1. Objects are initially neutral (# e = #p) 2. Transfer of electrons between atoms of one object and the atoms of another object causes both objects to become charged. 3. Electron transfer is caused by contact (touching or rubbing). 4. Net charge on an object is caused by gaining electrons (“-” charge) or losing electrons (“+” charge). 5. #e lost = # e gained, charge is conserved. 6. Models show causal relationship between components of atoms (electrons), distance and generated electric forces and fields. Task: Students are shown a video where fur and rod do not attract paper before they are rubbed together. Upon being rubbed together, both fur and rod start attracting paper. Draw a model that shows what happens to the rod and fur when they are rubbed together to cause the paper to move towards the rod. Make sure to label everything in your model. Describe what happens to the rod and fur during the process of rubbing them together. The item presented in Table 1 requires students to engage in the SEP of developing models using aspects of the DCI of types of interactions, specifically related to Coulomb’s law. In addition, it links to the CC of cause and effect as students must provide causal relationship between distance, charge and associated attractive force and electric field. Notice that the original NGSS PE contains the CC of Patterns and the SEP of using mathematical and computational thinking. However, it is acceptable to use CC and SEP different from those in the PE for the purposes of assessment as long as each item contains all the three dimensions: a DCI, an SEP and a CC. Table 2 below shows the scoring rubric for the item in Table 1. The scoring rubric is built on the idea that all the three dimensions (DCI, SEP, CC) are inseparable from each other, and represent elements of one complex construct. Therefore, the score for each item is assigned based on student ability to integrate all the three dimensions of a given complex construct when modeling and/or explaining a particular phenomenon. 16 Table 1.2 Scoring rubric example Pts 0 1 Item: Draw a model that shows what happens to the rod and fur when they are rubbed together to cause the paper to move towards the rod. Label everything in your model. Describe what happens to the rod and fur during the process of rubbing them together. no answer/justification is not clear DCI: Types of Interactions: • Neutral objects: charge represented as static/fuzz/magnet. • Charged objects: rubbing causes static/magnetizes rod, causing attraction • No relationship between distance and magnitude of force (Coulomb’s Law) to explain attractive force 2 3 between charged rod and paper Paper/Rod have no micro level components • SEP and CCs: Developing and Using Models and Cause and Effect models/explanations are causal at macro level, show that rubbing causes “static” effect (or “magnetic” effect) that causes paper to stick to the rod. DCI: Types of Interactions: • Neutral objects: Before rubbing rod is neutral (equal # of +/- charges, charge represented as point charge, not atoms shown) Paper/Rod contain point charges (+/-) • Charged objects: rubbing causes charge transfer, charges the rod (unequal # of +/-) • The closer charged rod is to the paper, the greater the attraction (Coulomb’s Law) • SEP and CCs: Developing and Using Models and Cause and Effect models/explanations are causal at the macro/micro level, rubbing causes transfer of charge (fur to rod). Paper sticks to the charged rod b/c neutral & charged objects attract. DCI: Types of Interactions: • Neutral objects: Before rubbing rod is neutral (charges shown as components of atoms making up objects, equal # of protons (+) and electrons (-) in atoms of the rod) • Charged objects: rubbing causes e transfer, atoms of the rod gain electrons • Charged rod attracts neutral paper when it is brought close to it because electric field is generated around charged rod, causing attractive force between electrons in the charged rod and protons in the atoms of neutral paper (Coulomb’s Law). • Note: mention of electric field is optional Paper/Rod are made up of atoms (Bohr or probabilistic model of electrons) • SEP and CCs: Developing and Using Models and Cause and Effect Models/explanations are causal at the microscopic level, show that as the charged rod moves close to the neutral paper, repulsive force is generated between electrons in the atoms of the paper and electrons of the rod. This force causes paper electrons to move away, resulting in temporary charge separation within paper, which causes attractions between electrons in the rod and temporary partial positive charge on paper. The scoring rubric emphasizes causal microscopic level mechanistic thinking, and reflects developmental nature of student understanding. In this case, the higher score (3 pts) reflects student ability to integrate all three dimensions at the microscopic level and provide detailed, microscopic level causal mechanism to explain the phenomenon in questions. Lower scores, on the other hand, reflect macroscopic level thinking with causal mechanism at the 17 macro/micro level (2 points), macroscopic level thinking with elements of causal mechanism at the macro level (1 pt), or no answer (0 pts). The scoring rubric is closely aligned to mECD argument, and shows teachers what aspects of the 3D thinking are missing from student answers if they do not get full credit. This rubric therefore guides teachers as to what aspects of the NGSS performance expectation related to Coulomb’s Law need to be further emphasized during instruction in order to help students further develop understanding of this complex construct. Theoretical Latent Structure of Assessment Instruments F1 3D understanding of Coulomb’s Law Q1 Q2 Q3 Q4 Q5 Q8 Figure 1.2 Theoretical latent structure for unit 1 assessment instrument Q6 Q Q7 QQ The theoretical latent structure for complex construct measured on Unit 1 assessment is shown in Figure 2. Unit 1 measured the complex construct focused on students’ 3D understanding of phenomena involving Coulomb’s Law focusing on electrical forces, fields and charges at macroscopic and atomic level. There were eight 3D items. For both units, all items were open-ended and scored by proficient team of graders using the rubric similar to the one described in Table 2. All eight questions assessed an aspect of the DCI of PS2: Motion and Stability: Forces and Interactions (specifically, PS2.B: Types of Interactions) and the CC of Cause and Effect. Five questions focused on SEP of Developing and Using Models, and three focused on the SEP of Constructing Explanations. Therefore, there were total four different aspects of NGSS 18 dimensions assessed on Unit 1 test. The items were designed in a form of testlets. The number of items in each testlet was not the same, and depended on what aspects of a phenomenon in question students were expected to focus on in their answers to fully evaluate a claim produced from mECD process. Table 3 summarizes the three dimensions assessed by each item. Table 1.3 3D structure of unit 1 assessment Testlet 1 2 Item 1 2 3 4 5 6 7 8 DCI Types of Interactions Types of Interactions Types of Interactions Types of Interactions Types of Interactions Types of Interactions Types of Interactions Types of Interactions SEP Developing and Using Models Developing and Using Models Developing and Using Models Constructing Explanations Developing and Using Models Developing and Using Models Constructing Explanations Constructing Explanations CC Cause and Effect Cause and Effect Cause and Effect Cause and Effect Cause and Effect Cause and Effect Cause and Effect Cause and Effect In unit 2 students continued to investigate phenomena related to electrical interactions. They added complex construct related to energy to their model of how charged objects interact. They also started talking about complex construct of chemical reactions focusing on bond- formation and bond breaking processes, and energy changes during bond formation and bond breaking processes. The mECD argument discussed below describes student 3D understanding of energy and chemical reactions as distinct complex constructs because each requires mastery of a different set of DCIs and SEPs as specified in Table 4 below. In order to demonstrate the mastery of both complex constructs, student should demonstrate the ability to integrate the appropriate DCIs with SEPs and CCs. Simple recollection of facts related to the DCIs for the two constructs is not sufficient to demonstrate mastery of these complex 3D constructs. There were total of eight items in Unit 2 assessment instrument. The theoretical latent structure for Unit 2 assessment instrument based on mECD argument is shown in Figure 3 below. 19 F1 3D understanding of Energy F2 3D understanding of chemical reactions Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Figure 1.3 Theoretical latent structure of unit 2 assessment instrument The two complex constructs representing two latent dimensions: student 3D understanding of Energy and student 3D understanding of Chemical Reactions are correlated, since both relate to explaining how energy of the system changes when electrical interactions form (between atoms that form a chemical bond, or between two charged objects), which is indicated by the arrow connecting the two latent constructs. All eight questions designed for Unit 2 assessment instrument contained aspects of the three dimensions of NGSS. The first five questions assessed aspects of the DCI of PS3: Energy (specifically, PS3.C: relationship between energy and forces), CC of Cause and Effect, and the SEP of Developing and Using Models. Questions 6-8 assess aspects of the DCI of PS1: Matter and its Interactions (specifically, PS1.B: Chemical Reactions) and the CC of Cause and Effect. Question 6-8 focuses on SEP of Constructing Explanations. Therefore, there were total five different aspects of NGSS dimensions assessed on Unit 2 test. The items were designed in a form of testlets. The number of items in each testlet was not the same, and depended on what aspects of a phenomenon in question students needed to focus on in their answers to fully evaluate a claim from the mECD process. Additionally, F1 latent dimension had more items than F2 latent dimension because in the Unit 2 curriculum more time and instruction was focused on building student 3D 20 understanding of Energy than student 3D understanding of chemical reactions. Table 4 below summarizes the aspects of the three dimensions assessed by each item on the test. Table 1.4 3D structure of unit 2 assessment Testlet 1 Item DCI 1 2 3 4 5 Rel.btw. Energy and Forces Rel.btw. Energy and Forces Rel.btw. Energy and Forces Rel.btw. Energy and Forces Rel.btw. Energy and Forces 2 3 6 7 8 Chemical Reactions Chemical Reactions Chemical Reactions SEP Dev. and Using Models Dev. and Using Models Dev. and Using Models Dev. and Using Models Dev. and Using Models Construct. Explanations Construct. Explanations Construct. Explanations CC Cause and effect Cause and effect Cause and effect Cause and effect Cause and effect Cause and effect Cause and effect Cause and effect Sample. The assessment instruments for units 1 and 2 were administered in six schools in the Mid-West and five schools in West United States. Schools in the Mid-West were rural type with 28% free and reduced lunch. Schools in the Western part of US were urban type with 72.4% free and reduced lunch. The assessment was administered in classrooms where the “Interactions” curriculum was piloted during Fall 2016 and Spring 2017. The total sample size is 899 students. Teachers in the Mid-West schools have taught the “Interactions” curriculum prior to data collection year, and teachers in Western part of the US were first time users of the curriculum. Students on average had very little prior knowledge of the constructs measured by the two assessment instruments as based on pre-unit interview data (see Chapter1 and Chapter 3 for interview data analysis). Data Analysis. Each assessment was administered before and after the corresponding unit was completed. Each assessment item was scored on 0-3-point scale following the rubric similar to 21 the one shown in Table 2. Inter-rater reliability was established qualitatively with multiple scorers. Both EFA and CFA-based measurement invariance were conducted using proper techniques for dealing with categorical data. Details of model estimation are provided in the appendix. Reliability Reliability analysis was conducted separately on the pre- and post-tests for each unit using summed item scores following methods described in Green & Yang, 2009. Reliability analysis was conducted in SAS software. The code is provided in the Appendix. Exploratory Factor Analysis to Study Latent Structure of the Complex Constructs This work uses two time points exploratory factor analysis (EFA) with factor loading invariance and correlated residuals across time to explore the number of latent dimensions for Unit 1 and Unit 2 assessment instruments. This approach uses a flexible EFA framework to explore possible latent structures without the need for rigid model specification of factor loadings, which is a requirement for confirmatory factor analysis (CFA) based approaches (Asparouhov & Muthén, 2009). This flexible approach can help provide evidence towards accurately determining dimensionality of an assessment instrument by approximating multiple possible latent structures for a given assessment instrument. Additionally, it accounts for factor loading equality across time, which will allow to more accurately approximate latent structure by considering the pre- and post-unit assessment data. Two time point EFA used 40% of the sample (322 students). EFA analysis was conducted using Mplus software, and the code is provided in the Appendix. For Unit 1, there were four different aspects of NGSS dimensions assessed on the test (see Table 3). Therefore, EFA models with one, two, three and four possible factors were 22 estimated, and only 1-factor model converged to admissible solution2. For unit 2, while there were five different aspects of NGSS dimensions assessed on the test, EFA models with one through four possible factors were estimated. EFA model with 5 possible factors was not estimated due to lack of degrees of freedom. Only a 2-factor model converged to admissible solution3. Table 5 below shows results of model fit for the 2 EFA models that converged. Table 1.5 Summary of two time points EFA model fit for unit 1 and unit 2 assessment Parameter χ2 χ2 p value CFI/TLI RMSEA Unit 1 174.2 >0.001 0.995/0.994 0.047 Unit 2 98.4 0.2540 0.999/0.998 0.017 Chi-square model fit test for Unit 1 1-factor EFA solution rejects the plausibility of proposed hypothesis. However, chi-square model fit test is sample sensitive, it rejects reasonable models if the sample size is too large (Kline, 2010; McDonald & Ho, 2002). Therefore, other indexes were used to evaluate mode fit including CFI/TLI, and RMSEA. As suggested in literature CFI/TLI> 0.900 and RMSEA <0.08 were used to evaluate model fit (McDonald & Ho, 2002; Kline, 2010, Van de Schoot et. al., 2012). Based on these parameters, the model fit for Unit 1 EFA 1-factor model is acceptable. Similarly, Unit 2 model fit for 2-factor model is acceptable using all indexes (chi square, CFI/TLI, RMSEA). 2 Two, three and four factor estimated models for Unit 1 gave inadmissible solution due to negative variances and unacceptable model fit 3 One, three and four factor estimated models for Unit 2 gave inadmissible solution due to negative variances and rejected model fit 23 Confirmatory Factor Analysis for Unit 1 and Unit 2 at Each Time Point Unlike in EFA, where all measured variables (items) are related to all factors at a given time point, CFA requires rigid specification of the latent structure of the data by indicating which items load on which factors and all the other loadings are set to zero. This analysis provides additional evidence for validity of internal latent structure of assessment instruments (Geisinger et al., 2013; Dimitrov, 2010). Based on results of EFA analysis, 1-factor CFA for unit 1 and 2- factor CFA for unit 2 are used as hypothesized structures for CFA. This analysis was conducted in MPlus software (code provided in Appendix) using the reserved 60% of the sample (577 students) for pre and post Unit 1 and Unit 2 assessment separately to ensure plausibility of hypothesis suggested by EFA. The results of CFA model fit for both units on pre and post assessment are shown in table 6 below. Table 1.6 Summary of CFA model fit for unit 1 pre/post and unit 2 pre/post Parameter χ2 χ2 p value CFI/TLI RMSEA Unit 1 pre 49.5 >0.001 0.997/0.996 0.041 Unit 1 post 61 >0.001 0.998/0.998 0.048 Unit 2 pre 82 >0.001 0.984/0.978 0.040 Unit 2 post 119 >0.001 0.996/0.995 0.053 Following similar guidelines for model fit evaluation as for EFA, model fit for is acceptable for all 4 models based on CFI/TLI> 0.900 and RMSEA <0.08 guidelines (McDonald & Ho, 2002; Kline, 2010, Van de Schoot et. al., 2012). CFA-based measurement invariance analysis is conducted further next to ascertain that same constructs are measured over time for units 1 and 2. 24 Measurement Invariance Analysis The latent factor structure for configural CFA-based measurement invariance analysis for unit 1 assessment is shown in Figure 4. One-way arrows from latent factor (e.g., F1T1) to items (e.g., Q1T1, Q2T1) represent factor loadings for each time point respectively (T1 for pre, T2 for post). Figure 1.4 Measurement invariance model for unit 1 assessment instrument Figure 1.5 Measurement invariance model for unit 2 assessment instrument 25 The latent construct F1T1 represents 3D Understanding of Coulomb’s Law. Figure 5 shows configural measurement invariance model for unit 2. The two latent constructs for unit 2 are: F1T1: 3D Understanding of Energy and F2T1: 3D Understanding of Chemical Reactions. To assess measurement invariance across time, a series of four nested hierarchical models is tested: configural/form invariance, weak/loading invariance, strong/threshold invariance, and strict invariance (Van de Schoot, Lugtig, & Hox, 2012; Liu, Millsap, West, Tein, Tanaka, & Grimm, 2017; Dimitrov, 2010). Configural invariance model represents the basic type of invariance, and tests the hypothesis that items load on the same constructs across time. In the configural invariance model factor loadings, intercepts, and unique factor variance matrix are freely estimated across time (Liu et al, 2017). Once configural invariance is established, the subsequent models are estimated by sequentially adding constraints to those three sets of parameters. If configural invariance cannot be established, it suggests the latent construct of interest is not represented by the same number of factors, and same pattern of loadings, indicating that the construct changes over time, and higher order invariance cannot be tested (Van de Schoot et al., 2012). Weak (or loading) invariance tests the hypothesis that factor loadings are equal across time (Van de Schoot et al., 2012; Liu et al, 2017; Dimitrov, 2010). Factor loadings for the same items therefore have to be constrained to be equal on the pre- and post-test. Factor loading reflects the degree to which the difference in students’ responses to the items reflect the differences in their levels on the construct being measured. To assess weak invariance, plausibility of the equal loading constraint is tested using the DIFF test function in MPLUS to compare weak invariance model fit to the configural invariance model fit (Asparouhov, Muthén, & Muthén, 2006). This function allows to accurately compare model fit difference between 26 nested ordered categorical CFA models (Liu et. al., 2017). If there is no significant difference in model fit, it suggests that factor loadings are invariant across administrations. For strong invariance, both factor loadings, and thresholds (indicator intercepts, or means) have to be equal across administrations. Strong invariance allows to compare factor means across administrations. Existence of strong invariance indicates that differences in observed mean scores on the items on the pre- and post-test can be attributed to differences in latent common factor means on the pre- and post-test (Liu et. al, 2017). Similarly, to establish strong invariance, additional constraint of equal thresholds is imposed and tested by comparing the model fits (strong to weak) using DIFF test. If the difference in fit is not significant, strong invariance is supported. Finally, strict invariance sets corresponding factor loadings, intercepts, and unique factor variances equal over time. Strict invariance model tests whether residual error variance is equivalent across administrations (Van de Schoot et. al., 2012). Strict invariance is supported in the same fashion by conducting DIFF test, comparing the fit of strict invariance model to strong invariance model. If strict invariance is supported, it indicates that construct is measured with the same precision across administrations. Strict invariance indicates that difference in pre and post scores on every item is only due to the difference in level of the factor. The details of measurement invariance model identification and Mplus code are provided in Appendix. Further, if the DIFF test for a given measurement invariance model is significant, modification indexes are examined for the corresponding parameters to decide which of the constrained parameters can be freed to achieve better model fit (Dimitrov, 2010). Modification index (MI) for a given parameter indicates expected drop in model’s chi square value if that parameter is freely estimated. Generally, MI > 3.84 indicate statistical significance 27 (Dimitrov,2010). If MIs above 3.84 were identified for a given measurement invariance model with significant DIFF test and acceptable RMSEA and CFI parameters, the corresponding parameters were freed starting from the largest one until nonsignificant chi square value for DIFF test was obtained. Higher level invariance was then tested, and DIFF test conducted by comparing the model fit of higher order model to modified lower level invariance model (Dimitrov, 2010). Results Reliability Table 7 shows reliability coefficients. In general, tests are highly reliable, with the post-test being more reliable for the given sample of examinees than the pre-test, for both unit 1 and 2. Table 1.7 Reliability Unit Unit 1 pre test Unit 1 post test Unit 2 pre test Unit 2 post test EFA Reliability Coefficient 0.872 0.934 0.823 0.932 Unit 1 EFA model fit analysis is shown in Table 8. If contrary to the theory proposed in the Framework, the three dimensions of science (DCI, SEP, CC) are distinct constructs, rather than integral parts of student understanding of the same complex construct, a better model fit should be observed for models with larger number of factors with potential cross-loadings, rather than 1 factor model suggested based on theory. Plausibility of 1 -4 factor structures were tested, and only 1-factor model converged to admissible solution. Model fit for 1-factor EFA model for Unit 1 is shown in table 8. Chi-square model fit test for 1 factor rejects the plausibility of 28 proposed hypothesis. However, chi-square model fit test is sample sensitive, it rejects reasonable models if the sample size is too large (Kline, 2010; McDonald & Ho, 2002). Table 1.8 Two time points EFA model fit for unit 1 assessment Parameter χ2 χ2 p-value CFI/TLI RMSEA 1 Factor 174.2 <0.001 0.995/0.994 0.047 Therefore, other indexes were used to evaluate mode fit including CFI/TLI, and RMSEA. As suggested in literature CFI/TLI> 0.900 and RMSEA <0.08 were used to evaluate model fit (McDonald & Ho, 2002; Kline, 2010, Van de Schoot et. al., 2012). Based on these parameters, the model fit for 1-factor model is acceptable. Table 9 shows that all factor loadings for both unit 1 pre- and post- assessments are above 0.5 (Hair, Black, Babin, Anderson, & Tatham, 2009) suggesting that each item measures dimension of interest reasonably well. Table 1.9 Two time points EFA factor loadings for unit 1 assessment Pre and Post Unit 1 Test Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Factor Loading Estimate 0.724 0.760 0.907 0.897 0.908 0.841 0.876 0.834 Standard Error 0.032 0.026 0.013 0.015 0.015 0.018 0.018 0.019 Unit 2 two-timepoint EFA analysis was conducted in a similar way. Latent structures of 1-4 factors were explored. Only 2-factor solution converged to admissible solution. Following similar guidelines as before, theoretically proposed 2-factor solution has a good model fit as shown in Table 10. Table 11 indicates the theoretically proposed factor loading structure is also 29 observed for EFA analysis. Specifically, questions 1-5 load on factor 1 with loadings all above 0.5, and questions 6-8 load on factor 2 with loadings above 0.3. Table 1.10 Two time points EFA model fit for unit 2 assessment Parameter χ2 χ2 p-value CFI/TLI RMSEA 2 Factors 98.4 0.2540 0.999/0.998 0.017 There are no cross loadings above the value of 0.5, supporting plausibility of theoretically proposed latent structure. To summarize, two-time points EFA analysis showed that theoretically proposed 1-factor model for Unit 1and 2-factor model for Unit 2 are plausible. Table 1.11 Two time points EFA factor loadings for unit 2 assessment Item Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Factor Loading F1 SE(Loading) F1 0.692 0.651 0.755 0.693 0.541 0.277 -0.052 0.183 0.071 0.075 0.072 0.065 0.099 0.082 0.049 0.089 Unit 1 Measurement Invariance Analysis Results Factor Loading F2 0.233 0.276 0.100 0.239 0.294 0.606 0.909 0.696 SE(Loading) F2 0.072 0.082 0.079 0.065 0.100 0.077 0.066 0.092 Table 12 shows results of measurement invariance analysis for unit 1. Configural invariance has good model fit (CFI/TLI>0.995, RMSEA<0.05). Weak invariance is supported based on DIFF test. For all models RMSEA is below 0.05 (Rutkowski & Svetina, 2017), and CFI/TLI are above 0.995 (Asparouhov et al., 2006), indicating good model fits. 30 Table 1.12 Measurement invariance analysis for unit 1 pre/post assessment Parameter Configural χ2 df CFI TLI RMSEA DIFF test Diff test df pdiff test 148.1 93 0.998 0.997 0.032 N/A N/A N/A Weak 160 100 0.998 0.997 0.032 13.1 7 0.0529 Strong1 171.993 105 0.997 0.997 0.033 13.7 5 0.0177 StrongM2 164.3 104 0.998 0.997 0.032 6.6 4 0.1595 Strict3 194.9 111 0.997 0.996 0.036 29.5 7 >0.001 StrictM4 178.1 110 0.997 0.997 0.033 17.032 6 0.0092 StrictM25 166.5 109 0.998 0.997 0.030 7.763 5 0.1698 1 All thresholds fixed; 2 Threshold for item 3 freed; 3. Residual variance for item 3 freed only; 4. Residual variances for items 1 and 3 freed; 5. Residual variances for items 1, 2, 3 freed Even though strong invariance did not yield satisfactory chi-square difference test (pdiff test=0.0177), RMSEA value was below 0.05, ∆RMSEA<0.05, and ∆CFI<-0.004 (Rutkowski & Svetina, 2017), all of which are indicators of good model fit. Examination of modification indexes for strong invariance showed high modification index (8.395) for item 3. Thresholds for item 3 were freed, which resulted in nonsignificant DIFF test p value of 0.1595. Strict invariance was evaluated next. Strict invariance model is nested in the StrongM model, and therefore residual variance on item 3 was freed. The resulting pdiff test>0.001, and is significant. Modification indexes for residual variances were examined and indicated that items 1 and 2 showed large MIs for residual variance parameters. Item 1 residual variance on post-test was freed first as it had the largest modification index (20.2), which resulted in larger, but still significant pdiff test=0.0092. Residual variance of item 2 on the post-test was freed (mod index=13.5), which resulted in non-significant pdiff test=0.1698 shown for model StrictM2 in Table 12. Overall, based on DIFF test, RMSEA and CFI indexes, partial measurement invariance with invariance of all factor loadings, all but 1 threshold for item 3, and all but 3 residual variances for items 1, 2 and 3 is supported for unit 1. In total, invariance of 35 parameters were 31 evaluated (14 loadings, 12 thresholds, 9 unique variances), and 5 were freed during measurement invariance evaluation. Therefore, the proportion of invariant parameters is 30/35=0.85 or 85%, and 15 % of parameters were freed. According to literature, if 20% or less of parameters were freed during the process of establishing partial measurement invariance, the results of the analysis can be used for practical applications (Dimitrov, 2010). Therefore, establishing partial measurement invariance for Unit 1 provides additional evidence for validity of internal latent structure of the instrument, and indicates that difference in student performance on the pre- and post-unit 1 test can be attributed to difference in level of the latent factor measured. Unit 2 Measurement Invariance Analysis Results As shown in Table 13, configural invariance has good model fit: RMSEA below 0.055, CFI/TLI above 0.995 (Rutkowski & Svetina, 2017, Asparouhov et al., 2006). Next, a DIFF test for weak invariance with all loadings constrained equal across time yielded significant p<0.001. However, weak invariance has good model fit based on RMSEA value (below 0.05, Rutkowski & Svetina, 2017), and acceptable model fit based on CFI/TLI>0.9 (Asparouhov et al., 2006). Further ∆RMSEA<0.05, and ∆CFI<-0.004 (Rutkowski & Svetina, 2017), indicating that there is evidence to support weak invariance. Modification indexes for factor loadings were examined further, and showed high value for item 5 (29.1). Freeing loading for item 5 resulted in nonsignificant pdiff test=0.1649 for model WeakM. Strong invariance was examined next. The strong invariance model is nested in WeakM invariance model, and therefore threshold for item 5 was freed. Strong invariance model has significant pdiff test=0.0022, but good model fit based on RMSEA, CFI- ∆RMSEA<0.05, and ∆CFI<-0.004. Examination of modification indexes showed high MI for item 1 (16.3). The threshold parameter for item 1 was freed, resulting in nonsignificant pdiff test=0.6433 for StrongM. 32 Table 1.13 Measurement invariance analysis for unit 2 pre/post assessment Parameter χ2 df CFI TLI RMSEA DIFF test Diff test df DIFF test p value Configural 138.419 90 0.997 0.996 0.031 N/A N/A N/A Weak1 WeakM2 172.4 96 0.995 0.994 0.037 31.4 6 <0.001 144.2 95 0.997 0.996 0.030 7.847 5 0.1649 Strong3 161.2 100 0.996 0.995 0.033 18.6 5 0.0022 StrongM4 145.6 99 0.997 0.996 0.029 2.508 4 0.6433 Strict 159.5 105 0.997 0.996 0.030 15.9 6 0.0141 1. All factor loadings fixed; 2. Factor loading for item 5 on pre and pos- test freed; 3. All Intercepts except for item 5 fixed; 4. Intercept freed for item 5 and item 1 Finally, strict invariance was examined. The DIFF test p value was significant, indicating model fit was not satisfactory. However, RMSEA<0.05, CFI/TLI>0.9, ∆RMSEA<0.05, and ∆CFI<-0.004 indicate good model fit for strict invariance model. There were no modification indexes that could have improved model fit at the strict invariance level. Therefore, there is evidence for strict invariance based on CFI/TLI and RMSEA indexes, but not based on DIFF test. Following similar procedure as described for unit 1, invariance of 33 parameters was evaluated (12 loadings, 12 thresholds, 9 unique variances), 6 parameters were freed during estimation process. A total of (33-6)/33=0.82 or 82% of parameters were invariant, and 18% of parameters were freed during estimation process. Since less than 20% of parameters were freed for achieving the partial strong measurement invariance for Unit 2, these results are acceptable for practical applications (Dimitrov, 2010). This analysis shows evidence for partial measurement invariance based on supported invariance of latent factor structure on the pre- and post-test (configural invariance), invariance of all factor loadings except for item 5 (weak invariance), invariance of all thresholds except for items 1 and 5 (strong invariance). There is also evidence towards equality of residual errors across administrations (strict invariance) based on the value and magnitude of change of 33 RMSEA and CFI/TLI indexes (Rutkowski & Svetina, 2017, Asparouhov et al., 2006). Therefore, establishing partial measurement invariance provides additional evidence for validity of internal latent structure of the instrument, and indicates that difference in student performance on the pre- and post-unit 2 test can be attributed to difference in level of the latent factor measured by the instrument. Consistency of mECD argument with student performance on the test EFA and CFA-based measurement invariance analysis conducted above provide evidence for the validity of theoretically suggested internal latent structure of both unit 1 and unit 2 assessment instruments. Factor scores that result from CFA-based measurement invariance analysis reflect the level of student understanding on the latent construct being measured. It is hypothesized by mECD argument that the scoring rubric for each item should be consistent with levels of the latent variable being measured, as reflected by factor scores. In other words, higher levels of the latent factor scores should correspond to higher observed (raw) scores on each item. If this trend is observed in practice, it provides additional source of evidence for the validity based on response process. It is important to note that, for either Unit 1 or Unit 2 pre- post-tests, highest level scores of 3 were not observed on any item. Possible explanation might be that, as consistent with developmental approach, student 3D understanding takes time to develop. Hence, by the end of units 1 and 2, which are approximately 3 month long, students have not yet developed the microscopic level causal thinking that is required to achieve a score of 3 on each item. As they progress further in the curriculum, they will likely develop higher level of 3D understanding. The next step compares average factor scores on each item, which reflect student level of understanding of the latent construct being measured, with the observed scores on each item 34 across time. Figure 6 shows 95% confidence interval for factor scores at different levels of raw scores for each item on the pre- and post-unit 1 assessment. As can be seen from the graph, the hypothesized trend for higher average raw scores on each item corresponding to higher factor scores holds for the pre and post-test at each level of raw score. Specifically, for raw scores of 0, 1 or 2, the corresponding factor score for each of the eight items is consistently higher, and the separation between each raw score is evident in the graph. Figure 1.6 95% Confidence interval for factor scores at different level of observed (raw) score for unit 1 assessment instrument Figure 1.7 Confidence interval (95%) for factor scores at different level of observed (raw) score for unit 2 assessment instrument 35 However, the degree of separation is different for pre and post-test. Specifically, on pre- test this trend holds for raw scores of 0 and 1, but raw score of 2 corresponds to only slightly higher factor score than for raw score 1 for all 8 items, and therefore the separation of levels suggested by mECD is not very pronounced. Possible explanation might be that on the pre-test student overall level of 3D understanding was much lower than the level measured by the assessment instrument, which is supported by the fact that most students scored very low on the pre-test. This is not surprising since no prior knowledge on the construct of interest was assumed on the pre-test. Therefore, it was harder to differentiate between students of lower ability levels on pre-test due to potential lack of easy enough items that measure lower factor levels among students. This explanation is also supported by the fact that the pre-test had slightly lower reliability coefficient as compared with the post-test for both units, suggesting lower sensitivity of the pre-test possibly due to lack of easy items to measure lower level understanding. In the future, it might be beneficial to include items measuring lower level 3D understanding to ensure better level separation on the pre-test. Similar trend holds for Unit 2 assessment instrument items shown in figure 7. Specifically, the pre-test also does not provide good separation between levels of the factor score as related to raw score, especially for raw score categories of 1 and 2 for both factors. However, the post-test differentiates much better between different factor score levels as related to raw scores, and provides clear levels separation for each of the observed raw score categories. The constructs of 3D understanding of Energy (Factor 1) and Chemical Reactions (Factor 2) were new to the students at the beginning of Unit 2, and the pre-unit 2 test performance was generally very poor. It is possible that, similar to Unit 1, Unit 2 test lacked easy 36 enough items to measure student 3D understanding at lower levels of the latent construct. It would be useful to develop and include those items in the future test revisions. In general, both Unit 1 and 2 post test results are consistent with mECD argument regarding hypothesized distribution of difficulty levels for each item, which provides validity evidence based on response process for both assessment instruments. Notably, factor scores from the analysis software were not standardized to have a mean of 0.0 and a standard deviation of 1.0. Instead, other constraints were applied to set the scale for the solution. As a result, the mean scores for the pre- and post-test factor scores are not directly comparable. However, since partial strong invariance is supported, this comparison is possible to do if proper standardization of factor scores is carried out. This extension of the analysis is beyond the scope of the current research study and will be investigated in the studies that follow. Discussion Developing valid assessments for evaluating student understanding of complex constructs is essential for successful implementation of NGSS and helping students develop 21st century skills and competencies. In order to draw accurate conclusions about student understanding, it is necessary to provide sound validity arguments that are grounded in both theoretical and empirical evidence (Kane, 2016; Messick, 1995). The Framework for K-12 Science Education describes complex construct in science as ability to blend Scientific and Engineering Practices (SEPs) and Crosscutting Concepts (CCCs) to make sense of the Disciplinary Core Ideas (DCIs) in the context of real-life phenomena. This ability is achieved through the process of 3D learning, grounded in developmental approach and situated cognition theories. In order to make inferences about student progress related to 3D learning, validity evidence needs to be obtained that shows how the theory behind complex 3D constructs relates to observed student response 37 patterns. Argument-based approach with elements of evidence-centered design (ECD) has been suggested by multiple researchers for developing assessments of complex constructs (Pellegrino et al., 2014; Harris et al., 2019; Huff, Steinberg, & Matts, 2010). The mECD approach allows to clearly specify in a theoretical argument the aspects of content, skills, and different types of responses reflecting level of understanding of a given complex construct (Mislevy, 2009; Mislevy & Haertel, 2006). The previous studies have used ECD approach to develop 3D assessments and showed, using student interviews and detailed analysis of student response patterns, that theoretical objectives outlined in ECD argument are supported by student response data (DeBarger, 2016; Gane, 2018; Gane, 2019). These studies provide rich evidence for usefulness of ECD-based methods in designing valid 3D assessments, as well as give educators valuable information about how student 3D understanding develops in the classroom. However, in order to draw inferences about student progress on complex 3D constructs that are more quantitative, it is necessary to conduct mathematical modeling to obtain standardized item parameters including item difficulty using IRT approaches for example. This analysis can further be used to develop standardized measures of student progress for both diagnostic and high stakes assessment. To select a mathematical model that will provide accurate item parameters, it is essential to investigate dimensionality of a given assessment instrument. Complex constructs such as those that arise as a result 3D learning process are inherently multidimensional because they combine an element of a DCI, SEP and CCCs (Gorin & Mislevy, 2013). However, the Framework also suggests that the blending of the three dimensions reflects deep understanding of science, therefore implying that complex NGSS constructs may manifest as single latent construct. 38 The studies on 3D assessment development and validation conducted so far assume unidimensional 3D constructs and use unidimensional IRT models to obtain item parameters (DeBarger, 2016; Gane, 2018; Gane, 2019). However, if assumption of unidimensional does not hold, then obtained item parameters cannot be used to draw valid conclusions about student performance. To author’s knowledge, this study represents the first example of investigating dimensionality of 3D assessment instrument. Current work builds on the previously published work using mECD approach (DeBarger, 2016; Gane, 2018; Gane, 2019; Harris et al., 2019) to develop assessment instrument grounded in 3D learning theories and aligned to NGSS. Current work further presents a methodology for evaluating internal latent structure of 3D assessment instruments developed using mECD approach to show that the theoretical dimensionality specified by mECD argument is confirmed in practice. Specifically, it shows that the blending of the three dimensions of NGSS (DCI, SEP, CC) suggested by the Framework is confirmed by empirical evidence. This work provides multiple sources of evidence for the validity of this assumption. First, two-time point EFA is used to show that the most plausible latent structure for both assessment instruments is the one hypothesized by mECD argument. Since the analysis is exploratory in nature, it allows to investigate possibilities of DCI, SEP and CC manifesting as septate latent constructs by estimating 1, 2, and 3-factor EFA models. The results of analysis suggest that for Unit 1, the most plausible latent structure of the assessment instrument is unidimensional, with all loadings above 0.5 and with small standard arrow (Table 6, Table 7). Similarly, for Unit 2, the most plausible latent structure is two- dimensional, as suggested by mECD argument, with loading pattern supporting hypothesized loading structure (Table 9). Further, CFA-based measurement invariance was used to gain additional evidence for the plausibility of latent structure suggested by EFA. Since CFA-based 39 measurement invariance requires rigid specification of latent structure, it provides additional confirmatory evidence towards validity of internal latent structure hypothesized by mECD argument. Also, since partial measurement invariance is supported for both units, it allows to compare student performance on pre and post-test to draw more accurate conclusions about how student 3D understanding developed during the course of each unit, which will be the focus of future work. Establishing partial measurement invariance also serves as a source of validity evidence based on internal latent structure (Geisinger et al., 2013; Dimitrov, 2010). Once the evidence for latent dimensionality is evaluated, CFA-based difficulty parameters (factor scores) are further compared with hypothesized item difficulty suggested by mECD argument reflected by observed scores. As can be seen in Figures 8 and 9, for both units increasing factor scores for all items relate to increasing observed score, suggesting that assumptions of mECD argument are indeed supported by student response data. This provides additional validity evidence based on response process. The dimensionality analysis presented here demonstrates the first study on investigating relationship between theoretical dimensionality suggested by 3D learning theories and empirically tested latent dimensionality of student response data. The results of this study suggest that 3D tasks are better described by single-factor models, which has several implications for assessment of 3D constructs. The first implication has to do with how the NGSS dimensions that comprise 3D tasks should be analyzed. One of the major assessment challenges for 3D tasks, as mentioned above, has to do with evaluating dimensionality of the tasks, which is a prerequisite for accurate scaling and reporting of any assessment. Since these tasks are inherently multidimensional, and various dimensions of NGSS are likely to contribute to individual items and overall assessment to a 40 different extend, it has been suggested that NGSS assessments will most likely present a complex multidimensional structure which is difficult if not impossible to handle with present- day measurement techniques (Gorin, Mislevy, 2013). Current work suggests that the three dimensions in fact manifest as a single latent construct, and therefore 3D tasks need to be analyzed as a whole and not as separate dimensions, which makes unidimensional models a potentially appropriate modeling tool for 3D tasks. This is an important implication because it could potentially make unidimensionality, which is always the goal in the context of educational testing, considerably more achievable, or significantly reduce dimensionality of multidimensional tests as demonstrated for Unit 2 assessment instrument in this study. This would make psychometric analysis of NGSS assessments more feasible and easily achievable in practice by bringing down time and resources required for such analysis. The second implication relates to cognitive and instructional inferences that can be made based on the current work. The finding that 3D tasks are better described by single factor models indicates the complexity of 3D constructs. In other words, a 3D construct is a complex construct that combines the three dimensions of NGSS (SEPs, CCCs, and DCIs) as opposed to three separate constructs. Therefore, 3D construct is a complex conceptual dimension that manifests as a single psychometric dimension. This finding is consistent with the situated cognition premise stated in the Framework and NGSS which states that learning content (DCIs) is inseparable from engaging in practices (SEPs) and crosscutting concepts (CCCs) along with the content. This has important instructional implication including the fact that students cannot gain deep understanding of the content (DCIs) without the context (CCCs, and SEPs). For instance, if a student can construct a model in a Physical Science, it does not mean that the student can 41 construct a model in the context of Biology discipline, since the content in the two cases is very different. A s a result, for designing assessments, this finding suggests that content should not be measured separately from the context, and that the three dimensions of NGSS should be integrated together in both instruction and assessment. The third implication relates to the need to follow systematic process in assessment design to ensure alignment between the NGSS standards and the assessment items. In the context of this work, it is important to emphasize that while the assessment items were administered in the context of “Interactions” curriculum, the items were designed to align to NGSS PEs, and not the curriculum learning goals. As shown in the methods section, NGSS PEs were carefully unpacked using mECD process to specify the aspects of the three dimensions that are being targeted in the assessment. I believe that the results obtained in this study are in part due to the good alignment between NGSS PEs and the assessments instruments that resulted following mECD process. While at the initial stage of the development the mECD process is somewhat time-consuming because it requires careful detailing of what will be assessed in the test and how it relates to NGSS PEs, the result is well-aligned assessments that provide accurate information about the degree of student understanding of NGSS PEs and, as this work suggests, demonstrate good psychometric properties that are essential for making valid conclusions based on the results of such assessments. Similarly to the assessments, the “Interactions” curriculum was also built following similar principles of aligning NGSS PEs and curriculum learning goals. The assessment was unpacked separately from the curriculum items, with new set of mECD arguments designed for assessment, therefore ensuring that assessment was entirely independent of the curriculum. The unpacking following mECD process, however, ensured that the entire 42 system, including the curriculum and the assessment are tightly aligned to NGSS, which resulted in informative outcomes demonstrated here specifically concerning dimensionality of assessment instruments. If assessment developers don’t follow systematic procedures like mECD (Harris et al., 2019), the alignment of resulting assessments with the standards will be largely unspecified and implied, and assessments essentially become a “black box”. As a result, the inferences that can be made based on those assessment related to student progress towards mastering NGSS PEs will have little validity, which is a fundamental property of any well-designed assessment (Messick, 1995). As suggested by multiple documents, alignment between standards, assessment and curriculum, as well as instruction and PD are a critical feature of any well-functioning system of assessment (National Academies of Sciences, Engineering, and Medicine, 2019, NRC, 2013a, NRC, 2007). To achieve good alignment, it is critical to follow systematic process such as mECD (Harris et. al., 2019). Lack of alignment results in a broken system where curriculum, instruction and assessment are disconnected and do not share the same learning goals. The need for alignment is not unique to the context described in this work, but to other fields to. One of the limitations of this work is that is focuses on a very small number of NGSS PEs, including a small number of DCIs, SEPs and CCs. In order to ensure reproducibility of results presented here, studies need to be conducted both in another domain, and redoing this study with a different set of NGSS PEs. Additionally, current work did not look at the instructional aspect of alignment specifically, and therefore no conclusions can be drawn about fidelity of implementation of the “Interactions” curriculum. However, as part of implementing “Interactions” curriculum, summer professional development (PD) session with teachers were conducted, including 3-day PD that focused on demonstrating fundamental design principles of the curriculum to teachers and engaging teachers in student activities. 43 For example, teachers were involved in experiencing various electrostatic phenomena and go through the process of modeling and revising their models as part of the learning process to make sure teachers understand what is required of their students. Additionally, “Interactions” had extensive teacher materials available to teachers at any time. In the future, however, fidelity of implementation is important to look at to ensure that results of the study are reproducible irrespective of the implementation context. Another important point to emphasize is that dimensionality is a function of instruction, and item sensitivity to instruction (Lord, 1976). Since assessment instruments presented here were validated in the context of “Interactions” curriculum, it is possible that, with a different population of students who did not participate in the “Interactions” curriculum, the dimensionality of both assessment instruments might be different. However, this will not necessarily disqualify assumptions about dimensionality reflected in mECD argument. Different dimensionality will imply that student ability to blend the three dimensions of NGSS being assessed on the test is not as uniform for the population that did not go through “Interactions” curriculum. At the same time, the assessment items used in this study were aligned to NGSS PEs. Thus, it is reasonable to assume that, for population of students who had instruction on the same DCIs, SEPs as those targeted by items on assessment instruments presented here, the response pattern should be similar to the students who went through “Interactions” curriculum, and therefore the dimensionality of the both instruments should not change. Nevertheless, more research is needed on how the dimensions of complex NGSS constructs manifest in response pattern as a function of different instructional content (Gorin & Mislevy, 2013). Related to the 44 previous point, if a different set of SEP and CCCs was chosen to evaluate student ability to integrate the three dimensions with the same. DCIs as presented here, it is possible that the resulting dimensionality would be different if students didn’t have the same opportunities to practice integrating the new SEPs and CCCs with DCIs. Using the language of situated cognition theory, if students didn’t have opportunities to practice integration of DCIs with a different set of SEPs and CCCs than those presented here, these new SEPs and CCCs might contribute to the latent trait being measured to a different degree depending of the level of student familiarity with them, which might result in different dimensionality of the assessment instrument. This is another important issue to investigate in the future studies. However, going back to the Framework, which states that 3D learning is defined as ability to integrate SEPs and CCCs to make sense of DCIs, it becomes clear that possible multidimensionality resulting from different degree of contribution of NGSS dimensions to the overarching complex construct should be viewed as something to be avoided rather than aimed for. This is because the Framework defines science proficiency as the ability to integrate the three dimensions, so attempting to measure the dimensions separately in the assessment is uninformative in terms of evaluating student knowledge-in-use and goes against the very principles outlined in the Framework and NGSS. Instead, what NGSS-aligned assessments should demonstrate is tasks that are tightly integrated across relevant DCIs, SEPs and CCCs that measure a well-defined complex construct of interest resulting from careful unpacking of NGSS PEs following a systematic process such as mECD (Harris et al, 2019). Only tasks designed following these principles will be informative in terms of drawing conclusions about student ability to integrate SEPs and CCCs to makes sense of DCIs, or their degree of 3D understanding, which is the ultimate goal of the Framework and NGSS. 45 APPENDIX 46 Reliability APPENDIX Polychoric correlation matrix was computed in RStudio using the code below #####PreUnit 1###### lambda<-matrix (c(0.780, 0.752 ,0.939, 0.832, 0.887, 0.830, 0.808, 0.765), nrow=8) fcor<-matrix(c(1), nrow=1) #for Unit 2 matrix (c(1, 0.62, 0.62, 1), nrow=2) POLYR <- lambda%*%fcor %*%t(lambda) # lambda: factor loading matrix (items x factors) #fcor: latent factor correlation matrix (factors x factors) diag(POLYR) <- 1 ####PostUnit 1##### lambda<-matrix (c(0.799, 0.795, 0.936, 0.922, 0.951, 0.921, 0.909, 0.884), nrow=8) fcor<-matrix(c(1), nrow=1) #for Unit 2 matrix (c(1, 0.62, 0.62, 1), nrow=2) POLYR <- lambda%*%fcor %*%t(lambda) # lambda: factor loading matrix (items x factors) #fcor: latent factor correlation matrix (factors x factors) diag(POLYR) <- 1 #####PreUnit 2##### lambda<-matrix (c(0.780, 0.860, 0.716, 0.799,0.876,0, 0, 0, 0, 0, 0, 0, 0, .846, .902, .810), nrow=8) fcor<-matrix(c(1, 0.784, 0.784, 1), nrow=2) POLYR <- lambda%*%fcor %*%t(lambda) # lambda: factor loading matrix (items x factors) #fcor: latent factor correlation matrix (factors x factors) diag(POLYR) <- 1 #####PostUnit 2##### lambda<-matrix (c(0.862, 0.933, 0.883, 0.953, 0.834, 0, 0, 0, 0, 0, 0, 0, 0, .898, .917, .904), nrow=8) fcor<-matrix(c(1, 0.928, 0.928, 1), nrow=2) POLYR <- lambda%*%fcor %*%t(lambda) # lambda: factor loading matrix (items x factors) #fcor: latent factor correlation matrix (factors x factors) 47 diag(POLYR) <- 1 Polychoric correlation matrix computed above was further used to calculate reliability is SAS using the code provided below. Unit 1 pre test SAS reliability code proc iml; RESET fuzz; THRESH={1.381 2.257,1.621 19.1,0.772 2.045,1.043 1.865,0.879 2.708,1.061 1.815, 1.010 2.198, 1.125 2.404}; LOAD={0.780,0.752,0.939,0.832, 0.887, 0.830, 0.808, 0.765}; FACCOR={1}; POLY={1 0.586560 0.732420 0.648960 0.691860 0.64740 0.630240 0.596700, 0.58656 1 0.706128 0.625664 0.667024 0.62416 0.607616 0.575280, 0.73242 0.706128 1 0.781248 0.832893 0.77937 0.758712 0.718335, 0.64896 0.625664 0.781248 1 0.737984 0.69056 0.672256 0.636480, 0.69186 0.667024 0.832893 0.737984 1 0.73621 0.716696 0.678555, 0.64740 0.624160 0.779370 0.690560 0.736210 1 0.670640 0.634950, 0.63024 0.607616 0.758712 0.672256 0.716696 0.67064 1 0.618120, 0.59670 0.575280 0.718335 0.636480 0.678555 0.63495 0.618120 1}; NTHRESH=Ncol(thresh); NCAT=NTHRESH+1; NITEM=Nrow(LOAD); NFACT=Ncol(LOAD); POLYR=LOAD*FACCOR*T(LOAD); do j=1 to NITEM; POLYR[j,j]=1; end; DIFFPOLY=POLY-POLYR; Print NTHRESH[label="Number of Thresholds"], NITEM[label="Number of items"], NCAT[label="Number of response categories"], NFACT[label="Number of factors"], THRESH[label="Response Thresholds"],LOAD[label="Factor Loadings"], FACCOR[label="Factor Correlation Matrix"], POLY[label="Polychoric Correlation Matrix among Continuous Items"] ; print "The matrix below is the difference between polychoric correlation matrix generated by factors and inputted polychoric correlation matrix. Nonzero values should represent the estimated correlated errors, as specified by the user, or an error in inputted data."; print DIFFPOLY[label=" "]; sumnum=0; addden=0; do j=1 to NITEM; do jp=1 to NITEM; sumprobn2=0; addprobn2=0; do c=1 to NTHRESH; do cp=1 to NTHRESH; sumrvstar=0; do k=1 to NFACT; do kp=1 to NFACT; sumrvstar=sumrvstar+LOAD[j,k]*LOAD[jp,kp]*FACCOR[k,kp]; 48 end; end; sumprobn2=sumprobn2+probbnrm(THRESH[j,c],THRESH[jp,cp],sumrvstar); addprobn2=addprobn2+probbnrm(THRESH[j,c],THRESH[jp,cp],POLY[j,jp]); end; end; sumprobn1=0; sumprobn1p=0; do cc=1 to NTHRESH; sumprobn1=sumprobn1+CDF('NORMAL',THRESH[j,cc]); sumprobn1p=sumprobn1p+CDF('NORMAL',THRESH[jp,cc]); end; sumnum=sumnum+(sumprobn2-sumprobn1*sumprobn1p); addden=addden+(addprobn2-sumprobn1*sumprobn1p); end; end; reliab=sumnum/addden; print sumnum[label="Numerator of Eq. (21)"], addden[label="Denominator of Eq. (21)"], reliab[label="Nonlinear SEM Reliability Coefficient"]; quit; Unit 1 post test SAS reliability code proc iml; RESET fuzz; THRESH={0.837 1.675, 0.714 1.507, 0.970 1.525, 0.815 1.464, 1.025 2.036, 0.732 1.923, 0.637 1.577, 0.743 1.503}; LOAD={0.862 0, 0.933 0, 0.883 0, 0.953 0, 0.834 0, 0 .898, 0 .917, 0 .904}; FACCOR={1 0.928, 0.928 1}; POLY={1.0 0.8042460 0.7611460 0.8214860 0.7189080 0.7183425 0.7335413 0.7231421, 0.8042460 1.0 0.8238390 0.8891490 0.7781220 0.7775100 0.7939606 0.7827049, 0.7611460 0.8238390 1.0 0.8414990 0.7364220 0.7358428 0.7514118 0.7407593, 0.8214860 0.8891490 0.8414990 1.0 0.7948020 0.7941768 0.8109801 0.7994831, 0.7189080 0.7781220 0.7364220 0.7948020 1.0 0.6950089 0.7097140 0.6996526, 0.7183425 0.7775100 0.7358428 0.7941768 0.6950089 1.0 0.8234660 0.8117920, 0.7335413 0.7939606 0.7514118 0.8109801 0.7097140 0.8234660 1.0 0.8289680, 0.7231421 0.7827049 0.7407593 0.7994831 0.6996526 0.8117920 0.8289680 1.0}; NTHRESH=Ncol(thresh); NCAT=NTHRESH+1; NITEM=Nrow(LOAD); NFACT=Ncol(LOAD); POLYR=LOAD*FACCOR*T(LOAD); do j=1 to NITEM; POLYR[j,j]=1; end; DIFFPOLY=POLY-POLYR; Print NTHRESH[label="Number of Thresholds"], NITEM[label="Number of items"], NCAT[label="Number of response categories"], NFACT[label="Number of factors"], THRESH[label="Response Thresholds"],LOAD[label="Factor Loadings"], FACCOR[label="Factor Correlation Matrix"], POLY[label="Polychoric Correlation Matrix among Continuous Items"] ; print "The matrix below is the difference between polychoric 49 correlation matrix generated by factors and inputted polychoric correlation matrix. Nonzero values should represent the estimated correlated errors, as specified by the user, or an error in inputted data."; print DIFFPOLY[label=" "]; sumnum=0; addden=0; do j=1 to NITEM; do jp=1 to NITEM; sumprobn2=0; addprobn2=0; do c=1 to NTHRESH; do cp=1 to NTHRESH; sumrvstar=0; do k=1 to NFACT; do kp=1 to NFACT; sumrvstar=sumrvstar+LOAD[j,k]*LOAD[jp,kp]*FACCOR[k,kp]; end; end; sumprobn2=sumprobn2+probbnrm(THRESH[j,c],THRESH[jp,cp],sumrvstar); addprobn2=addprobn2+probbnrm(THRESH[j,c],THRESH[jp,cp],POLY[j,jp]); end; end; sumprobn1=0; sumprobn1p=0; do cc=1 to NTHRESH; sumprobn1=sumprobn1+CDF('NORMAL',THRESH[j,cc]); sumprobn1p=sumprobn1p+CDF('NORMAL',THRESH[jp,cc]); end; sumnum=sumnum+(sumprobn2-sumprobn1*sumprobn1p); addden=addden+(addprobn2-sumprobn1*sumprobn1p); end; end; reliab=sumnum/addden; print sumnum[label="Numerator of Eq. (21)"], addden[label="Denominator of Eq. (21)"], reliab[label="Nonlinear SEM Reliability Coefficient"]; quit; Unit 2 pre test SAS reliability code proc iml; RESET fuzz; THRESH={1.748 2.311, 1.519 3.205, 1.816 2.857, 1.576 2.830, 1.442 2.560, 1.096 2.877, 0.995 2.464, 1.058 2.141}; LOAD={0.780 0, 0.860 0, 0.716 0, 0.799 0, 0.876 0, 0 .846, 0 .902, 0 .810}; FACCOR={1 0.784, 0.784 1}; POLY= {1.0 0.6708000 0.5584800 0.6232200 0.6832800 0.5173459 0.5515910 0.4953312, 0.6708000 1.0 0.6157600 0.6871400 0.7533600 0.5704070 0.6081645 0.5461344, 0.5584800 0.6157600 1.0 0.5720840 0.6272160 0.4748970 0.5063323 0.4546886, 0.6232200 0.6871400 0.5720840 1.0 0.6999240 0.5299479 0.5650272 0.5073970, 0.6832800 0.7533600 0.6272160 0.6999240 1.0 0.5810193 0.6194792 0.5562950, 0.5173459 0.5704070 0.4748970 0.5299479 0.5810193 1.0 0.7630920 0.6852600, 0.5515910 0.6081645 0.5063323 0.5650272 0.6194792 0.7630920 1.0 0.7306200, 50 0.4953312 0.5461344 0.4546886 0.5073970 0.5562950 0.6852600 0.7306200 1.0}; NTHRESH=Ncol(thresh); NCAT=NTHRESH+1; NITEM=Nrow(LOAD); NFACT=Ncol(LOAD); POLYR=LOAD*FACCOR*T(LOAD); do j=1 to NITEM; POLYR[j,j]=1; end; DIFFPOLY=POLY-POLYR; Print NTHRESH[label="Number of Thresholds"], NITEM[label="Number of items"], NCAT[label="Number of response categories"], NFACT[label="Number of factors"], THRESH[label="Response Thresholds"],LOAD[label="Factor Loadings"], FACCOR[label="Factor Correlation Matrix"], POLY[label="Polychoric Correlation Matrix among Continuous Items"] ; print "The matrix below is the difference between polychoric correlation matrix generated by factors and inputted polychoric correlation matrix. Nonzero values should represent the estimated correlated errors, as specified by the user, or an error in inputted data."; print DIFFPOLY[label=" "]; sumnum=0; addden=0; do j=1 to NITEM; do jp=1 to NITEM; sumprobn2=0; addprobn2=0; do c=1 to NTHRESH; do cp=1 to NTHRESH; sumrvstar=0; do k=1 to NFACT; do kp=1 to NFACT; sumrvstar=sumrvstar+LOAD[j,k]*LOAD[jp,kp]*FACCOR[k,kp]; end; end; sumprobn2=sumprobn2+probbnrm(THRESH[j,c],THRESH[jp,cp],sumrvstar); addprobn2=addprobn2+probbnrm(THRESH[j,c],THRESH[jp,cp],POLY[j,jp]); end; end; sumprobn1=0; sumprobn1p=0; do cc=1 to NTHRESH; sumprobn1=sumprobn1+CDF('NORMAL',THRESH[j,cc]); sumprobn1p=sumprobn1p+CDF('NORMAL',THRESH[jp,cc]); end; sumnum=sumnum+(sumprobn2-sumprobn1*sumprobn1p); addden=addden+(addprobn2-sumprobn1*sumprobn1p); end; end; reliab=sumnum/addden; print sumnum[label="Numerator of Eq. (21)"], addden[label="Denominator of Eq. (21)"], reliab[label="Nonlinear SEM Reliability Coefficient"]; quit; 51 Unit 2 post test SAS reliability code proc iml; RESET fuzz; THRESH={0.837 1.675, 0.714 1.507, 0.970 1.525, 0.815 1.464, 1.025 2.036, 0.732 1.923, 0.637 1.577, 0.743 1.503}; LOAD={0.862 0, 0.933 0, 0.883 0, 0.953 0, 0.834 0, 0 .898, 0 .917, 0 .904}; FACCOR={1 0.928, 0.928 1}; POLY={1.0 0.8042460 0.7611460 0.8214860 0.7189080 0.7183425 0.7335413 0.7231421, 0.8042460 1.0 0.8238390 0.8891490 0.7781220 0.7775100 0.7939606 0.7827049, 0.7611460 0.8238390 1.0 0.8414990 0.7364220 0.7358428 0.7514118 0.7407593, 0.8214860 0.8891490 0.8414990 1.0 0.7948020 0.7941768 0.8109801 0.7994831, 0.7189080 0.7781220 0.7364220 0.7948020 1.0 0.6950089 0.7097140 0.6996526, 0.7183425 0.7775100 0.7358428 0.7941768 0.6950089 1.0 0.8234660 0.8117920, 0.7335413 0.7939606 0.7514118 0.8109801 0.7097140 0.8234660 1.0 0.8289680, 0.7231421 0.7827049 0.7407593 0.7994831 0.6996526 0.8117920 0.8289680 1.0}; NTHRESH=Ncol(thresh); NCAT=NTHRESH+1; NITEM=Nrow(LOAD); NFACT=Ncol(LOAD); POLYR=LOAD*FACCOR*T(LOAD); do j=1 to NITEM; POLYR[j,j]=1; end; DIFFPOLY=POLY-POLYR; Print NTHRESH[label="Number of Thresholds"], NITEM[label="Number of items"], NCAT[label="Number of response categories"], NFACT[label="Number of factors"], THRESH[label="Response Thresholds"],LOAD[label="Factor Loadings"], FACCOR[label="Factor Correlation Matrix"], POLY[label="Polychoric Correlation Matrix among Continuous Items"] ; print "The matrix below is the difference between polychoric correlation matrix generated by factors and inputted polychoric correlation matrix. Nonzero values should represent the estimated correlated errors, as specified by the user, or an error in inputted data."; print DIFFPOLY[label=" "]; sumnum=0; addden=0; do j=1 to NITEM; do jp=1 to NITEM; sumprobn2=0; addprobn2=0; do c=1 to NTHRESH; do cp=1 to NTHRESH; sumrvstar=0; do k=1 to NFACT; do kp=1 to NFACT; sumrvstar=sumrvstar+LOAD[j,k]*LOAD[jp,kp]*FACCOR[k,kp]; end; end; sumprobn2=sumprobn2+probbnrm(THRESH[j,c],THRESH[jp,cp],sumrvstar); addprobn2=addprobn2+probbnrm(THRESH[j,c],THRESH[jp,cp],POLY[j,jp]); end; 52 end; sumprobn1=0; sumprobn1p=0; do cc=1 to NTHRESH; sumprobn1=sumprobn1+CDF('NORMAL',THRESH[j,cc]); sumprobn1p=sumprobn1p+CDF('NORMAL',THRESH[jp,cc]); end; sumnum=sumnum+(sumprobn2-sumprobn1*sumprobn1p); addden=addden+(addprobn2-sumprobn1*sumprobn1p); end; end; reliab=sumnum/addden; print sumnum[label="Numerator of Eq. (21)"], addden[label="Denominator of Eq. (21)"], reliab[label="Nonlinear SEM Reliability Coefficient"]; quit; 53 Two Time Point EFA Code MPlus EFA Code Unit 1 TITLE: Unit 1 EFA at two timepoints with factor loading invariance and correlated residuals across time DATA: FILE IS EFAforty.dat; VARIABLE: NAMES ARE STUID TCID Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2; CATEGORICAL ARE Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2; usevar = Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2; MODEL: f1 BY Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 (*t1 1); f2 BY Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2 (*t2 1); f1 WITH f2; Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 WITH Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2; ANALYSIS: ROTATION=CF-VARIMAX; OUTPUT: TECH1 STANDARDIZED; EFA Code Unit 2 TITLE: Unit 2 EFA at two timepoints with factor loading invariance and correlated residuals across time DATA: FILE IS EFAforty.dat; VARIABLE: NAMES ARE STUID TCID Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2; usevar = Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2; CATEGORICAL ARE Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2; MODEL: f1-f2 BY Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 (*t1 1); f3-f4 BY Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2 (*t2 1); f1-f2 WITH f3-f4; Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1WITH Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2; ANALYSIS: ROTATION=CF-VARIMAX; OUTPUT: TECH1 STANDARDIZED; 54 Measurement Invariance Model Identification The items on both unit 1 and unit 2 assessment represent ordered-categorical variables, and measurement invariance was evaluated following procedures described in Liu et al., 2017, with some modifications. Specifically, the following constrains were imposed for model identification purposes: 1. At the reference measurement occasion (pre-test was chosen to be the reference measurement occasion), common factor mean for Unit 1 was constrained to zero, and unique factor variances were constrained to one. On the post test, unique factor variances were freely estimated. For Unit 2, unique factor variances were constrained to one at the reference measurement occasion, but common factor mean were not estimated due to limited degrees of freedom. 2. On pre and post the same item is chosen as a marker variable and the factor loading of the marker variable is constrained to 1. Constraining the loading of the marker variable gives the latent common factor a scale that is in the same unit as one of the items chosen to be the marker variable (Liu et. al. 2017). Marker variable for longitudinal measurement invariance should have an invariant factor loading across all measurement occasions, and have at least two invariant thresholds (Liu et. al. 2017). Choosing marker variable to identify variance structure of the latent common factor in measurement invariance model identification Procedure described in Liu et. al. (2017) was used to identify variance structure of the latent common factor using marker variable approach. Specifically, confirmatory factor analysis (CFA) was conducted and factor loadings were examined on pre and post as well as thresholds for all the items to choose specific marker variables. Item 7 on unit 1 pre/post had the smallest difference in factor loading and the most invariant thresholds on pre and post test. Therefore, item 7 was chosen as marker variable (factor loading fixed at 1 on pre and post test, and thresholds 1 and 2 on pre and post set equal). The same procedure was used to set marker variable for unit 2 pre/post test. However, since unit 2 has 2 latent common factors, marker variable was chosen separately for each factor. Items 1-5 load on factor 1 in unit 2. Therefore, factor loadings and thresholds were examined for these items first. Following similar guidelines, Item 3 was chosen as marker variable for factor 1. Items 6-8 load on factor 2 in unit 2. Therefore, factor loadings and thresholds were further examined for these items, and item 7 was chosen as marker variable for factor 2. Excluding Threshold from invariance analysis due to sample limitations The sample of total 899 students was split 40% and 60%. The 40% sample was used to conduct EFA, while the 60% sample was used to conduct CFA-based measurement invariance analysis. It was observed that item 2 in Unit 1 pre/post assessment in the 60% random split did not contain observed second response category. Therefore, that threshold was excluded from measurement invariance analysis. See the code for measurement invariance below for details. 55 Unit 1 measurement invariance code for MPlus ! Factor loadings; Configural Longitudinal Invariance Model for unit 1 pre/post assessment FILE is CFAsixty_U1.dat; TITLE: DATA: VARIABLE: NAMES ARE STUID TCID Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2; usevar = Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2; CATEGORICAL ARE Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2; ANALYSIS: PARAMETERIZATION= THETA; ITERATIONS=3000; ESTIMATOR=WLSMV; MODEL: Time1F1 BY Q1T1* Q2T1* Q3T1* Q4T1* Q5T1* Q6T1* Q7T1@1 Q8T1*; Time2F1 BY Q1T2* Q2T2* Q3T2* Q4T2* Q5T2* Q6T2* Q7T2@1 Q8T2*; !Thresholds; [Q1T1$1 Q1T2$1](1); !unit 1 item 1 pre (Q1T1) and post (Q1T2) threshold 1 [Q1T1$2 Q1T2$2]; !unit 1 item 1 pre (Q1T1) and post (Q1T2) threshold 2 [Q2T1$1 Q2T2$1](2); !unit 1 item 2 pre (Q2T1) and post (Q2T2) threshold 1 [Q2T2$2]; !unit 1 item 2 pre (Q2T1) threshold 2 excluded (not observed in the sample), only Item 2 post (Q2T2) threshold 2 used [Q3T1$1 Q3T2$1](3); !unit 1 item 3 pre (Q3T1) and post (Q3T2) threshold 1 [Q3T1$2 Q3T2$2]; !unit 1 item 3 pre (Q3)T1 and post (Q3T2) threshold 2 [Q4T1$1 Q4T2$1](4); !unit 1 item 4 pre (Q4T1) and post (Q4T2) threshold 1 [Q4T1$2 Q4T2$2]; !unit 1 item 4 pre (Q4T1) and post (Q4T2) threshold 2 [Q5T1$1 Q5T2$1] (5); !unit 1 item 5 pre (Q5T1) and post (Q5T2) threshold 1 [Q5T1$2 Q5T2$2]; !unit 1 item 5 pre (Q5T1) and post (Q5T2) threshold 2 [Q6T1$1 Q6T2$1] (6); !unit 1 item 6 pre (Q6T1) and post (Q6T2) threshold 1 [Q6T1$2 Q6T2$2]; !unit 1 item 6 pre (Q6T1) and post (Q6T2) threshold 2 [Q7T1$1 Q7T2$1] (7); !unit 1 item 7 pre (Q7T1) and post (Q7T2) threshold 1 [Q7T1$2 Q7T2$2](9); !unit 1 item 7 pre (Q7T1) and post (Q7T2) threshold 2 [Q8T1$1 Q8T2$1](8); !unit 1 item 8 pre (Q8T1) and post (Q8T2) threshold 1 [Q8T1$2 Q8T2$2]; !unit 1 item 8 pre (Q8T1) and post (Q8T2) threshold 2 56 !Common factor covariance matrix; Time1F1 Time2F1 WITH Time1F1 Time2F1; !Common factor means; [Time1F1@0 Time2F1*]; !Unique variances; Q1T1@1 Q2T1@1 Q3T1@1 Q4T1@1 Q5T1@1 Q6T1@1 Q7T1@1 Q8T1@1; !unit 1 pre items Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; !unit 1 post items !Lagged unique factor covariances; Q1T1 WITH Q1T2*; !unit 1 item 1 pre (Q1T1) and post (Q1T2) Q2T1 WITH Q2T2*; !unit 1 item 2 pre (Q2T1) and post (Q2T2) Q3T1 WITH Q3T2*; !unit 1 item 3 pre (Q3T1) and post (Q3T2) Q4T1 WITH Q4T2*; !unit 1 item 4 pre (Q4T1) and post (Q4T2) Q5T1 WITH Q5T2*; !unit 1 item 5 pre (Q5T1) and post (Q5T2) Q6T1 WITH Q6T2*; !unit 1 item 6 pre (Q6T1) and post (Q6T2) Q7T1 WITH Q7T2*; !unit 1 item 7 pre (Q7T1) and post (Q7T2) Q8T1 WITH Q8T2*; !unit 1 item 8 pre (Q8T1) and post (Q8T2) Q4T1 WITH Q5T1; #item correlation on pre test included following mECD argument (items measure similar aspect of phenomenon in question) Q4T2 WITH Q4T2; #item correlation on post test included following mECD argument (items measure similar aspect of phenomenon in question) OUTPUT: sampstat STDYX mod(all 8); SAVEDATA: DIFFTEST IS unit1_configural.dat; 57 Weak Longitudinal Invariance Model for unit 1 pre/post assessment FILE is CFAsixty_U1.dat; TITLE: DATA: VARIABLE: NAMES ARE STUID TCID Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; usevar = Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2; CATEGORICAL ARE Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2; ANALYSIS: PARAMETERIZATION= THETA; ITERATIONS=3000; ESTIMATOR=WLSMV; MODEL: ! Factor loadings; Time2F1 BY Q1T1* (10) Q2T1* (11) Q3T1* (12) Q4T1* (13) Q5T1* (14) Q6T1* (15) Q7T1@1 Q8T1* (16); !unit 1 pre test items Time1F1 BY Q1T2*(10) Q2T2* (11) Q3T2* (12) Q4T2* (13) Q5T2* (14) Q6T2* (15) Q7T2@1 Q8T2* (16); ! Unit 1 post test !Thresholds; [Q1T1$1 Q1T2$1](1); !unit 1 item 1 pre (Q1T1) and post (Q1T2) threshold 1 [Q1T1$2 Q1T2$2]; !unit 1 item 1 pre (Q1T1) and post (Q1T2) threshold 2 [Q2T1$1 Q2T2$1](2); !unit 1 item 2 pre (Q2T1) and post (Q2T2) threshold 1 [Q2T2$2]; !unit 1 item 2 pre (Q2T1) threshold 2 excluded (not observed in the sample), only Item 2 post (Q2T2) threshold 2 used [Q3T1$1 Q3T2$1](3); !unit 1 item 3 pre (Q3T1) and post (Q3T2) threshold 1 [Q3T1$2 Q3T2$2]; !unit 1 item 3 pre (Q3)T1 and post (Q3T2) threshold 2 [Q4T1$1 Q4T2$1](4); !unit 1 item 4 pre (Q4T1) and post (Q4T2) threshold 1 [Q4T1$2 Q4T2$2]; !unit 1 item 4 pre (Q4T1) and post (Q4T2) threshold 2 [Q5T1$1 Q5T2$1] (5); !unit 1 item 5 pre (Q5T1) and post (Q5T2) threshold 1 [Q5T1$2 Q5T2$2]; !unit 1 item 5 pre (Q5T1) and post (Q5T2) threshold 2 58 [Q6T1$1 Q6T2$1] (6); !unit 1 item 6 pre (Q6T1) and post (Q6T2) threshold 1 [Q6T1$2 Q6T2$2]; !unit 1 item 6 pre (Q6T1) and post (Q6T2) threshold 2 [Q7T1$1 Q7T2$1] (7); !unit 1 item 7 pre (Q7T1) and post (Q7T2) threshold 1 [Q7T1$2 Q7T2$2] (9); !unit 1 item 7 pre (Q7T1) and post (Q7T2) threshold 2 [Q8T1$1 Q8T2$1](8); !unit 1 item 8 pre (Q8T1) and post (Q8T2) threshold 1 [Q8T1$2 Q8T2$2]; !unit 1 item 8 pre (Q8T1) and post (Q8T2) threshold 2 !Common factor covariance matrix; Time1F1 Time2F1 WITH Time1F1 Time2F1; !Common factor means; [Time1F1@0 Time2F1*]; !Unique variances; Q1T1@1 Q2T1@1 Q3T1@1 Q4T1@1 Q5T1@1 Q6T1@1 Q7T1@1 Q8T1@1; !unit 1 pre items Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; !unit 1 post items !Lagged unique factor covariances; Q1T1 WITH Q1T2*; !unit 1 item 1 pre (Q1T1) and post (Q1T2) Q2T1 WITH Q2T2*; !unit 1 item 2 pre (Q2T1) and post (Q2T2) Q3T1 WITH Q3T2*; !unit 1 item 3 pre (Q3T1) and post (Q3T2) Q4T1 WITH Q4T2*; !unit 1 item 4 pre (Q4T1) and post (Q4T2) Q5T1 WITH Q5T2*; !unit 1 item 5 pre (Q5T1) and post (Q5T2) Q6T1 WITH Q6T2*; !unit 1 item 6 pre (Q6T1) and post (Q6T2) Q7T1 WITH Q7T2*; !unit 1 item 7 pre (Q7T1) and post (Q7T2) Q8T1 WITH Q8T2*; !unit 1 item 8 pre (Q8T1) and post (Q8T2) Q4T1 WITH Q5T1; #item correlation on pre test included following mECD argument (items measure similar aspect of phenomenon in question) Q4T2 WITH Q4T2; #item correlation on post test included following mECD argument (items measure similar aspect of phenomenon in question) OUTPUT: sampstat STDYX mod(all 8); SAVEDATA: DIFFTEST IS unit1_configural.dat; 59 FILE is CFAsixty_U1.dat; TITLE: Strong Longitudinal Invariance Model for unit 1 pre/post assessment DATA: VARIABLE: NAMES ARE STUID TCID Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; usevar = Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2; CATEGORICAL ARE Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2; ANALYSIS: PARAMETERIZATION= THETA; ITERATIONS=3000; ESTIMATOR=WLSMV; MODEL: ! Factor loadings; Time2F1 BY Q1T1* (10) Q2T1* (11) Q3T1* (12) Q4T1* (13) Q5T1* (14) Q6T1* (15) Q7T1@1 Q8T1* (16); !unit 1 pre test items Time1F1 BY Q1T2*(10) Q2T2* (11) Q3T2* (12) Q4T2* (13) Q5T2* (14) Q6T2* (15) Q7T2@1 Q8T2* (16); ! Unit 1 post test !Thresholds; [Q1T1$1 Q1T2$1](1); !unit 1 item 1 pre (Q1T1) and post (Q1T2) threshold 1 [Q1T1$2 Q1T2$2](17); !unit 1 item 1 pre (Q1T1) and post (Q1T2) threshold 2 [Q2T1$1 Q2T2$1](2); !unit 1 item 2 pre (Q2T1) and post (Q2T2) threshold 1 [Q2T2$2](18); !unit 1 item 2 pre (Q2T1) threshold 2 excluded (not observed in the sample), only Item 2 post (Q2T2) threshold 2 used [Q3T1$1 Q3T2$1](3); !unit 1 item 3 pre (Q3T1) and post (Q3T2) threshold 1 [Q3T1$2 Q3T2$2]; !unit 1 item 3 pre (Q3)T1 and post (Q3T2) threshold 2 !Threshold for Item 3 free to achieve strong invariance [Q4T1$1 Q4T2$1](4); !unit 1 item 4 pre (Q4T1) and post (Q4T2) threshold 1 [Q4T1$2 Q4T2$2](20); !unit 1 item 4 pre (Q4T1) and post (Q4T2) threshold 2 60 [Q5T1$1 Q5T2$1] (5); !unit 1 item 5 pre (Q5T1) and post (Q5T2) threshold 1 [Q5T1$2 Q5T2$2](21); !unit 1 item 5 pre (Q5T1) and post (Q5T2) threshold 2 [Q6T1$1 Q6T2$1] (6); !unit 1 item 6 pre (Q6T1) and post (Q6T2) threshold 1 [Q6T1$2 Q6T2$2](22); !unit 1 item 6 pre (Q6T1) and post (Q6T2) threshold 2 [Q7T1$1 Q7T2$1] (7); !unit 1 item 7 pre (Q7T1) and post (Q7T2) threshold 1 [Q7T1$2 Q7T2$2](9); !unit 1 item 7 pre (Q7T1) and post (Q7T2) threshold 2 [Q8T1$1 Q8T2$1](8); !unit 1 item 8 pre (Q8T1) and post (Q8T2) threshold 1 [Q8T1$2 Q8T2$2](23); !unit 1 item 8 pre (Q8T1) and post (Q8T2) threshold 2 !Common factor covariance matrix; Time1F1 Time2F1 WITH Time1F1 Time2F1; !Common factor means; [Time1F1@0 Time2F1*]; !Unique variances; Q1T1@1 Q2T1@1 Q3T1@1 Q4T1@1 Q5T1@1 Q6T1@1 Q7T1@1 Q8T1@1; !unit 1 pre items Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; !unit 1 post items !Lagged unique factor covariances; Q1T1 WITH Q1T2*; !unit 1 item 1 pre (Q1T1) and post (Q1T2) Q2T1 WITH Q2T2*; !unit 1 item 2 pre (Q2T1) and post (Q2T2) Q3T1 WITH Q3T2*; !unit 1 item 3 pre (Q3T1) and post (Q3T2) Q4T1 WITH Q4T2*; !unit 1 item 4 pre (Q4T1) and post (Q4T2) Q5T1 WITH Q5T2*; !unit 1 item 5 pre (Q5T1) and post (Q5T2) Q6T1 WITH Q6T2*; !unit 1 item 6 pre (Q6T1) and post (Q6T2) Q7T1 WITH Q7T2*; !unit 1 item 7 pre (Q7T1) and post (Q7T2) Q8T1 WITH Q8T2*; !unit 1 item 8 pre (Q8T1) and post (Q8T2) Q4T1 WITH Q5T1; #item correlation on pre test included following mECD argument (items measure similar aspect of phenomenon in question) Q4T2 WITH Q4T2; #item correlation on post test included following mECD argument (items measure similar aspect of phenomenon in question) OUTPUT: sampstat STDYX mod(all 8); SAVEDATA: DIFFTEST IS unit1_configural.dat; 61 FILE is CFAsixty_U1.dat; TITLE: Strict Longitudinal Invariance Model for unit 1 pre/post assessment DATA: VARIABLE: NAMES ARE STUID TCID Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; usevar = Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2; CATEGORICAL ARE Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2Q7T2 Q8T2; ANALYSIS: PARAMETERIZATION= THETA; ITERATIONS=3000; ESTIMATOR=WLSMV; MODEL: ! Factor loadings; Time2F1 BY Q1T1* (10) Q2T1* (11) Q3T1* (12) Q4T1* (13) Q5T1* (14) Q6T1* (15) Q7T1@1 Q8T1* (16); !unit 1 pre test items Time1F1 BY Q1T2*(10) Q2T2* (11) Q3T2* (12) Q4T2* (13) Q5T2* (14) Q6T2* (15) Q7T2@1 Q8T2* (16); ! Unit 1 post test !Thresholds; [Q1T1$1 Q1T2$1](1); !unit 1 item 1 pre (Q1T1) and post (Q1T2) threshold 1 [Q1T1$2 Q1T2$2](17); !unit 1 item 1 pre (Q1T1) and post (Q1T2) threshold 2 [Q2T1$1 Q2T2$1](2); !unit 1 item 2 pre (Q2T1) and post (Q2T2) threshold 1 [Q2T2$2](18); !unit 1 item 2 pre (Q2T1) threshold 2 excluded (not observed in the sample), only Item 2 post (Q2T2) threshold 2 used [Q3T1$1 Q3T2$1](3); !unit 1 item 3 pre (Q3T1) and post (Q3T2) threshold 1 [Q3T1$2 Q3T2$2]; !unit 1 item 3 pre (Q3)T1 and post (Q3T2) threshold 2 freed to achieve strong invariance [Q4T1$1 Q4T2$1](4); !unit 1 item 4 pre (Q4T1) and post (Q4T2) threshold 1 [Q4T1$2 Q4T2$2](20); !unit 1 item 4 pre (Q4T1) and post (Q4T2) threshold 2 62 [Q5T1$1 Q5T2$1] (5); !unit 1 item 5 pre (Q5T1) and post (Q5T2) threshold 1 [Q5T1$2 Q5T2$2](21); !unit 1 item 5 pre (Q5T1) and post (Q5T2) threshold 2 [Q6T1$1 Q6T2$1] (6); !unit 1 item 6 pre (Q6T1) and post (Q6T2) threshold 1 [Q6T1$2 Q6T2$2](22); !unit 1 item 6 pre (Q6T1) and post (Q6T2) threshold 2 [Q7T1$1 Q7T2$1] (7); !unit 1 item 7 pre (Q7T1) and post (Q7T2) threshold 1 [Q7T1$2 Q7T2$2](9); !unit 1 item 7 pre (Q7T1) and post (Q7T2) threshold 2 [Q8T1$1 Q8T2$1](8); !unit 1 item 8 pre (Q8T1) and post (Q8T2) threshold 1 [Q8T1$2 Q8T2$2](23); !unit 1 item 8 pre (Q8T1) and post (Q8T2) threshold 2 !Common factor covariance matrix; Time1F1 Time2F1 WITH Time1F1 Time2F1; !Common factor means; [Time1F1@0 Time2F1*]; !Unique variances; Q1T1@1 Q2T1@1 Q3T1@1 Q4T1@1 Q5T1@1 Q6T1@1 Q7T1@1 Q8T1@1; !unit 1 pre items Q1T2 Q2T2 Q3T2 Q4T2@1 Q5T2@1 Q6T2@1 Q7T2@1 Q8T2@1; !unit 1 post items !Q3T2 unique variance freed because strict invariance model is nested in strong invariance model, and Q3 threshold was freed during strong invariance model estimation above. !Q1T2 and Q2T2 unique variances freed to achieve strict invariance !Lagged unique factor covariances; Q1T1 WITH Q1T2*; !unit 1 item 1 pre (Q1T1) and post (Q1T2) Q2T1 WITH Q2T2*; !unit 1 item 2 pre (Q2T1) and post (Q2T2) Q3T1 WITH Q3T2*; !unit 1 item 3 pre (Q3T1) and post (Q3T2) Q4T1 WITH Q4T2*; !unit 1 item 4 pre (Q4T1) and post (Q4T2) Q5T1 WITH Q5T2*; !unit 1 item 5 pre (Q5T1) and post (Q5T2) Q6T1 WITH Q6T2*; !unit 1 item 6 pre (Q6T1) and post (Q6T2) Q7T1 WITH Q7T2*; !unit 1 item 7 pre (Q7T1) and post (Q7T2) Q8T1 WITH Q8T2*; !unit 1 item 8 pre (Q8T1) and post (Q8T2) Q4T1 WITH Q5T1; #item correlation on pre test included following mECD argument (items measure similar aspect of phenomenon in question) Q4T2 WITH Q4T2; #item correlation on post test included following mECD argument (items measure similar aspect of phenomenon in question) OUTPUT: sampstat STDYX mod(all 8); SAVEDATA: DIFFTEST IS unit1_configural.dat; 63 Unit 2 measurement invariance code for MPlus ITERATIONS=3000; ESTIMATOR=WLSMV; ! Factor loadings; Configural Longitudinal Invariance Model for unit 2 pre/post assessment FILE is CFAsixty_U2.dat; TITLE: DATA: VARIABLE: NAMES ARE STUID TCID U1T1-U8T2; usevar = STUID Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; CATEGORICAL ARE Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; IDVAR=STUID; ANALYSIS: PARAMETERIZATION= THETA; OUTPUT: sampstat residual mod(all 8); STDYX RESIDUAL; MODEL: Time1F1 BY Q1T1* Q2T1* Q3T1@1 Q4T1* Q5T1*; Time1F2 BY Q6T1* Q7T1@1 Q8T1*; Time2F1 BY Q1T2* Q2T2* Q3T2@1 Q4T2* Q5T2*; Time2F2 BY Q6T2* Q7T2@1 Q8T2*; !Thresholds; [Q1T1$1 Q1T2$1](1); !unit 2 item 1 pre (Q1T1) and post (Q1T2) threshold 1 [Q1T1$2 Q1T2$2]; !unit 2 item 1 pre (Q1T1) and post (Q1T2) threshold 2 [Q2T1$1 Q2T2$1](2); !unit 2 item 2 pre (Q2T1) and post (Q2T2) threshold 1 [Q2T1$2 Q2T2$2]; !unit 2 item 2 pre (Q2T1) and post (Q2T2) threshold 2 [Q3T1$1 Q3T2$1](3); !unit 2 item 3 pre (Q3T1) and post (Q3T2) threshold 1 [Q3T1$2 Q3T2$2](11); !unit 2 item 3 pre (Q3T1) and post (Q3T2) threshold 2 [Q4T1$1 Q4T2$1](4); !unit 2 item 4 pre (Q4T1) and post (Q4T2) threshold 1 [Q4T1$2 Q4T2$2]; !unit 2 item 4 pre (Q4T1) and post (Q4T2) threshold 2 [Q5T1$1 Q5T2$1] (6); !unit 2 item 5 pre (Q5T1) and post (Q5T2) threshold 1 [Q5T1$2 Q5T2$2]; !unit 2 item 5 pre (Q5T1) and post (Q5T2) threshold 2 [Q6T1$1 Q6T2$1] (7); !unit 2 item 6 pre (Q6T1) and post (Q6T2) threshold 1 [Q6T1$2 Q6T2$2]; !unit 2 item 6 pre (Q6T1) and post (Q6T2) threshold 2 [Q7T1$1 Q7T2$1](8); !unit 2 item 7 pre (Q7T1) and post (Q7T2) threshold 1 [Q7T1$2 Q7T2$2] (12); !unit 2 item 7 pre (Q7T1) and post (Q7T2) threshold 2 64 [uQ8T1$1 Q8T2$1] (9); !unit 2 item 8 pre (Q8T1) and post (Q8T2) threshold 1 [Q8T1$2 Q8T2$2]; !unit 2 item 8 pre (Q8T1) and post (Q8T2) threshold 2 ! Common factor covariance matrix; Time1F1 Time2F1 WITH Time1F1 Time2F1; Time1F2 Time2F2 WITH Time1F2 Time2F2; !Unique variances; Q1T1@1 Q2T1@1 Q3T1@1 Q4T1@1 Q5T1@1 Q6T1@1 Q7T1@1 Q8T1@1; !unit 2 pre items Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; !unit 2 post items !Lagged unique factor covariances; Q1T1 WITH Q1T2*; !unit 2 item 1 pre (Q1T1) and post (Q1T2) Q2T1 WITH Q2T2*; !unit 2 item 2 pre (Q2T1) and post (Q2T2) Q3T1 WITH Q3T2*; !unit 2 item 3 pre (Q3T1) and post (Q3T2) Q4T1 WITH Q4T2*; !unit 2 item 4 pre (Q4T1) and post (Q4T2) Q5T1 WITH Q5T2*; !unit 2 item 5 pre (Q5T1) and post (Q5T2) Q6T1 WITH Q6T2*; !unit 2 item 6 pre (Q6T1) and post (Q6T2) Q7T1 WITH Q7T2*; !unit 2 item 7 pre (Q7T1) and post (Q7T2) Q8T1 WITH Q8T2*; !unit 2 item 8 pre (Q8T1) and post (Q8T2) Q1T1 WITH Q2T1; Q1T2 WITH Q2T2; SAVEDATA: DIFFTEST IS unit2_configural.dat; 65 ITERATIONS=3000; ESTIMATOR=WLSMV; ! Factor loadings; Weak Longitudinal Invariance Model for unit 2 pre/post assessment FILE is CFAsixty_U2.dat; TITLE: DATA: VARIABLE: NAMES ARE STUID TCID U1T1-U8T2; usevar = STUID Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; CATEGORICAL ARE Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; IDVAR=STUID; ANALYSIS: PARAMETERIZATION= THETA; OUTPUT: sampstat residual mod(all 8); DIFFTEST IS unit2_configural.dat; STDYX RESIDUAL; MODEL: Time1F1 BY Q1T1* (13) Q2T1* (14) Q3T1@1 Q4T1* (15) Q5T1* (16); Time1F2 BY Q6T1*(17) Q7T1@1 Q8T1* (18); Time2F1 BY Q1T2*(13) Q2T2*(14) Q3T2@1 Q4T2* (15) Q5T2*; !loading freed to achieve weak invariance Time2F2 BY Q6T2* (17) Q7T2@1 Q8T2* (18); !Thresholds; [Q1T1$1 Q1T2$1](1); !unit 2 item 1 pre (Q1T1) and post (Q1T2) threshold 1 [Q1T1$2 Q1T2$2]; !unit 2 item 1 pre (Q1T1) and post (Q1T2) threshold 2 [Q2T1$1 Q2T2$1](2); !unit 2 item 2 pre (Q2T1) and post (Q2T2) threshold 1 [Q2T1$2 Q2T2$2]; !unit 2 item 2 pre (Q2T1) and post (Q2T2) threshold 2 [Q3T1$1 Q3T2$1](3); !unit 2 item 3 pre (Q3T1) and post (Q3T2) threshold 1 [Q3T1$2 Q3T2$2](11); !unit 2 item 3 pre (Q3T1) and post (Q3T2) threshold 2 [Q4T1$1 Q4T2$1](4); !unit 2 item 4 pre (Q4T1) and post (Q4T2) threshold 1 66 [Q4T1$2 Q4T2$2]; !unit 2 item 4 pre (Q4T1) and post (Q4T2) threshold 2 [Q5T1$1 Q5T2$1] (6); !unit 2 item 5 pre (Q5T1) and post (Q5T2) threshold 1 [Q5T1$2 Q5T2$2]; !unit 2 item 5 pre (Q5T1) and post (Q5T2) threshold 2 [Q6T1$1 Q6T2$1] (7); !unit 2 item 6 pre (Q6T1) and post (Q6T2) threshold 1 [Q6T1$2 Q6T2$2]; !unit 2 item 6 pre (Q6T1) and post (Q6T2) threshold 2 [Q7T1$1 Q7T2$1](8); !unit 2 item 7 pre (Q7T1) and post (Q7T2) threshold 1 [Q7T1$2 Q7T2$2] (12); !unit 2 item 7 pre (Q7T1) and post (Q7T2) threshold 2 [uQ8T1$1 Q8T2$1] (9); !unit 2 item 8 pre (Q8T1) and post (Q8T2) threshold 1 [Q8T1$2 Q8T2$2]; !unit 2 item 8 pre (Q8T1) and post (Q8T2) threshold 2 ! Common factor covariance matrix; Time1F1 Time2F1 WITH Time1F1 Time2F1; Time1F2 Time2F2 WITH Time1F2 Time2F2; !Unique variances; Q1T1@1 Q2T1@1 Q3T1@1 Q4T1@1 Q5T1@1 Q6T1@1 Q7T1@1 Q8T1@1; !unit 2 pre items Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; !unit 2 post items !Lagged unique factor covariances; Q1T1 WITH Q1T2*; !unit 2 item 1 pre (Q1T1) and post (Q1T2) Q2T1 WITH Q2T2*; !unit 2 item 2 pre (Q2T1) and post (Q2T2) Q3T1 WITH Q3T2*; !unit 2 item 3 pre (Q3T1) and post (Q3T2) Q4T1 WITH Q4T2*; !unit 2 item 4 pre (Q4T1) and post (Q4T2) Q5T1 WITH Q5T2*; !unit 2 item 5 pre (Q5T1) and post (Q5T2) Q6T1 WITH Q6T2*; !unit 2 item 6 pre (Q6T1) and post (Q6T2) Q7T1 WITH Q7T2*; !unit 2 item 7 pre (Q7T1) and post (Q7T2) Q8T1 WITH Q8T2*; !unit 2 item 8 pre (Q8T1) and post (Q8T2) Q1T1 WITH Q2T1; Q1T2 WITH Q2T2; SAVEDATA: DIFFTEST IS unit2_weak.dat; 67 ITERATIONS=3000; ESTIMATOR=WLSMV; ! Factor loadings; Strong Longitudinal Invariance Model for unit 2 pre/post assessment FILE is CFAsixty_U2.dat; TITLE: DATA: VARIABLE: NAMES ARE STUID TCID U1T1-U8T2; usevar = STUID Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; CATEGORICAL ARE Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; IDVAR=STUID; ANALYSIS: PARAMETERIZATION= THETA; OUTPUT: sampstat residual mod(all 8); DIFFTEST IS unit2_weak.dat; STDYX RESIDUAL; MODEL: Time1F1 BY Q1T1* (13) Q2T1* (14) Q3T1@1 Q4T1* (15) Q5T1* (16); Time1F2 BY Q6T1*(17) Q7T1@1 Q8T1* (18); Time2F1 BY Q1T2*(13) Q2T2*(14) Q3T2@1 Q4T2* (15) Q5T2*; !loading freed to achieve weak invariance Time2F2 BY Q6T2* (17) Q7T2@1 Q8T2* (18); !Thresholds; [Q1T1$1 Q1T2$1](1); !unit 2 item 1 pre (Q1T1) and post (Q1T2) threshold 1 [Q1T1$2 Q1T2$2]; !unit 2 item 1 pre (Q1T1) and post (Q1T2) threshold 2 !Threshold freed to achieve strong invariance [Q2T1$1 Q2T2$1](2); !unit 2 item 2 pre (Q2T1) and post (Q2T2) threshold 1 [Q2T1$2 Q2T2$2](20); !unit 2 item 2 pre (Q2T1) and post (Q2T2) threshold 2 [Q3T1$1 Q3T2$1](3); !unit 2 item 3 pre (Q3T1) and post (Q3T2) threshold 1 [Q3T1$2 Q3T2$2](11); !unit 2 item 3 pre (Q3T1) and post (Q3T2) threshold 2 68 [Q4T1$1 Q4T2$1](4); !unit 2 item 4 pre (Q4T1) and post (Q4T2) threshold 1 [Q4T1$2 Q4T2$2](21); !unit 2 item 4 pre (Q4T1) and post (Q4T2) threshold 2 [Q5T1$1 Q5T2$1] (6); !unit 2 item 5 pre (Q5T1) and post (Q5T2) threshold 1 [Q5T1$2 Q5T2$2]; !unit 2 item 5 pre (Q5T1) and post (Q5T2) threshold 2 !Threshold freed because strong invariance is nested in weak invariance model [Q6T1$1 Q6T2$1] (7); !unit 2 item 6 pre (Q6T1) and post (Q6T2) threshold 1 [Q6T1$2 Q6T2$2](23); !unit 2 item 6 pre (Q6T1) and post (Q6T2) threshold 2 [Q7T1$1 Q7T2$1](8); !unit 2 item 7 pre (Q7T1) and post (Q7T2) threshold 1 [Q7T1$2 Q7T2$2] (12); !unit 2 item 7 pre (Q7T1) and post (Q7T2) threshold 2 [uQ8T1$1 Q8T2$1] (9); !unit 2 item 8 pre (Q8T1) and post (Q8T2) threshold 1 [Q8T1$2 Q8T2$2](24); !unit 2 item 8 pre (Q8T1) and post (Q8T2) threshold 2 ! Common factor covariance matrix; Time1F1 Time2F1 WITH Time1F1 Time2F1; Time1F2 Time2F2 WITH Time1F2 Time2F2; !Unique variances; Q1T1@1 Q2T1@1 Q3T1@1 Q4T1@1 Q5T1@1 Q6T1@1 Q7T1@1 Q8T1@1; !unit 2 pre items Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; !unit 2 post items !Lagged unique factor covariances; Q1T1 WITH Q1T2*; !unit 2 item 1 pre (Q1T1) and post (Q1T2) Q2T1 WITH Q2T2*; !unit 2 item 2 pre (Q2T1) and post (Q2T2) Q3T1 WITH Q3T2*; !unit 2 item 3 pre (Q3T1) and post (Q3T2) Q4T1 WITH Q4T2*; !unit 2 item 4 pre (Q4T1) and post (Q4T2) Q5T1 WITH Q5T2*; !unit 2 item 5 pre (Q5T1) and post (Q5T2) Q6T1 WITH Q6T2*; !unit 2 item 6 pre (Q6T1) and post (Q6T2) Q7T1 WITH Q7T2*; !unit 2 item 7 pre (Q7T1) and post (Q7T2) Q8T1 WITH Q8T2*; !unit 2 item 8 pre (Q8T1) and post (Q8T2) Q1T1 WITH Q2T1; Q1T2 WITH Q2T2; SAVEDATA: DIFFTEST IS unit2_strong.dat; 69 ITERATIONS=3000; ESTIMATOR=WLSMV; ! Factor loadings; Strict Longitudinal Invariance Model for unit 2 pre/post assessment FILE is CFAsixty_U2.dat; TITLE: DATA: VARIABLE: NAMES ARE STUID TCID U1T1-U8T2; usevar = STUID Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; CATEGORICAL ARE Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2; IDVAR=STUID; ANALYSIS: PARAMETERIZATION= THETA; OUTPUT: sampstat residual mod(all 8); DIFFTEST IS unit2_strong.dat; STDYX RESIDUAL; MODEL: Time1F1 BY Q1T1* (13) Q2T1* (14) Q3T1@1 Q4T1* (15) Q5T1* (16); Time1F2 BY Q6T1*(17) Q7T1@1 Q8T1* (18); Time2F1 BY Q1T2*(13) Q2T2*(14) Q3T2@1 Q4T2* (15) Q5T2*; !loading freed to achieve weak invariance Time2F2 BY Q6T2* (17) Q7T2@1 Q8T2* (18); !Thresholds; [Q1T1$1 Q1T2$1](1); !unit 2 item 1 pre (Q1T1) and post (Q1T2) threshold 1 [Q1T1$2 Q1T2$2]; !unit 2 item 1 pre (Q1T1) and post (Q1T2) threshold 2 !Threshold freed to achieve strong invariance [Q2T1$1 Q2T2$1](2); !unit 2 item 2 pre (Q2T1) and post (Q2T2) threshold 1 [Q2T1$2 Q2T2$2](20); !unit 2 item 2 pre (Q2T1) and post (Q2T2) threshold 2 [Q3T1$1 Q3T2$1](3); !unit 2 item 3 pre (Q3T1) and post (Q3T2) threshold 1 [Q3T1$2 Q3T2$2](11); !unit 2 item 3 pre (Q3T1) and post (Q3T2) threshold 2 70 [Q4T1$1 Q4T2$1](4); !unit 2 item 4 pre (Q4T1) and post (Q4T2) threshold 1 [Q4T1$2 Q4T2$2](21); !unit 2 item 4 pre (Q4T1) and post (Q4T2) threshold 2 [Q5T1$1 Q5T2$1] (6); !unit 2 item 5 pre (Q5T1) and post (Q5T2) threshold 1 [Q5T1$2 Q5T2$2]; !unit 2 item 5 pre (Q5T1) and post (Q5T2) threshold 2 !Threshold freed because strong invariance is nested in weak invariance model [Q6T1$1 Q6T2$1] (7); !unit 2 item 6 pre (Q6T1) and post (Q6T2) threshold 1 [Q6T1$2 Q6T2$2](23); !unit 2 item 6 pre (Q6T1) and post (Q6T2) threshold 2 [Q7T1$1 Q7T2$1](8); !unit 2 item 7 pre (Q7T1) and post (Q7T2) threshold 1 [Q7T1$2 Q7T2$2] (12); !unit 2 item 7 pre (Q7T1) and post (Q7T2) threshold 2 [uQ8T1$1 Q8T2$1] (9); !unit 2 item 8 pre (Q8T1) and post (Q8T2) threshold 1 [Q8T1$2 Q8T2$2](24); !unit 2 item 8 pre (Q8T1) and post (Q8T2) threshold 2 ! Common factor covariance matrix; Time1F1 Time2F1 WITH Time1F1 Time2F1; Time1F2 Time2F2 WITH Time1F2 Time2F2; !Unique variances; Q1T1@1 Q2T1@1 Q3T1@1 Q4T1@1 Q5T1@1 Q6T1@1 Q7T1@1 Q8T1@1; !unit 2 pre items Q1T2 Q2T2@1 Q3T2@1 Q4T2@1 Q5T2 Q6T2@1 Q7T2@1 Q8T2@1; !unit 2 post items !Unique variances freed for items Q1 and Q5 because strict invariance model is nested in strong invariance model !Lagged unique factor covariances; Q1T1 WITH Q1T2*; !unit 2 item 1 pre (Q1T1) and post (Q1T2) Q2T1 WITH Q2T2*; !unit 2 item 2 pre (Q2T1) and post (Q2T2) Q3T1 WITH Q3T2*; !unit 2 item 3 pre (Q3T1) and post (Q3T2) Q4T1 WITH Q4T2*; !unit 2 item 4 pre (Q4T1) and post (Q4T2) Q5T1 WITH Q5T2*; !unit 2 item 5 pre (Q5T1) and post (Q5T2) Q6T1 WITH Q6T2*; !unit 2 item 6 pre (Q6T1) and post (Q6T2) Q7T1 WITH Q7T2*; !unit 2 item 7 pre (Q7T1) and post (Q7T2) Q8T1 WITH Q8T2*; !unit 2 item 8 pre (Q8T1) and post (Q8T2) Q1T1 WITH Q2T1; Q1T2 WITH Q2T2; SAVEDATA: DIFFTEST IS unit2_strict.dat; 71 BIBLIOGRAPHY 72 BIBLIOGRAPHY Asparouhov, T., & Muthén, B. (2009). Exploratory structural equation modeling. Structural equation modeling: a multidisciplinary journal, 16(3), 397-438. Asparouhov, T., Muthén, B., & Muthén, B. O. (2006). Robust chi square difference testing with mean and variance adjusted test statistics. matrix, 1(5), 1-6. DeBarger, A. H., Penuel, W. R., Harris, C. J., & Kennedy, C. A. (2016). Building an assessment argument to design and use next generation science assessments in efficacy studies of curriculum interventions. American Journal of Evaluation, 37(2), 174-192. Dimitrov, D. M. (2010). Testing for factorial invariance in the context of construct validation. Measurement and Evaluation in Counseling and Development, 43(2), 121. Ercikan, K., & Oliveri, M. E. (2016). In search of validity evidence in support of the interpretation and use of assessments of complex constructs: Discussion of research on assessing 21st century skills. Applied Measurement in Education, 29(4), 310-318. Gane, B. D., McElhaney, K.W., Zaidi, S. Z., Pellegrino, J. W. (2018, March). Analysis of student and item performance on three-dimensional constructed response assessment tasks. Paper presented at the NARST Annual International Conference, Atlanta, GA. Gane, B. D., McElhaney, K.W., Zaidi, S. Z., Pellegrino, J. W. (2019). Design and Validation of Instructionally-Supportive Assessment: Examining Student Performance on Knowledge-in- use Assessment Tasks. Paper presented at the AERA Annual International Conference, Toronto, CA. Geisinger, K. F., Bracken, B. A., Carlson, J. F., Hansen, J. I. C., Kuncel, N. R., Reise, S. P., & Rodriguez, M. C. (2013). APA handbook of testing and assessment in psychology, Vol. 3: Testing and assessment in school psychology and education. American Psychological Association. Gorin, J. S., & Mislevy, R. J. (2013, September). Inherent measurement challenges in the next generation science standards for both formative and summative assessment. In Invitational research symposium on science assessment. Green, S. B., & Yang, Y. (2009). Reliability of summed item scores using structural equation modeling: An alternative to coefficient alpha. Psychometrika, 74(1), 155-167. Hair Junior, J. F., Black, W. C., Babin, B. J., Anderson, R. E., & Tatham, R. L. (2009). Multivariate analysis of data. 73 Harris, C. J., Krajcik, J. S., Pellegrino, J. W., DeBarger, A. H. (2019). Designing Knowledge‐In‐ Use Assessments to Promote Deeper Learning. Educational Measurement: Issues and Practice. Huff, K., Steinberg, L., & Matts, T. (2010). The promises and challenges of implementing evidence-centered design in large-scale assessment. Applied Measurement in Education, 23(4), 310-324. Kane, M. T. (2016). Explicating validity. Assessment in Education: Principles, Policy & Practice, 23(2), 198-211. Kline, R. B. (2015). Principles and practice of structural equation modeling. Guilford. Liu, Y., Millsap, R. E., West, S. G., Tein, J. Y., Tanaka, R., & Grimm, K. J. (2017). Testing measurement invariance in longitudinal data with ordered-categorical measures. Psychological methods, 22(3), 486. Lord, F. M. (1976). A Study of Item Bias Using Characteristic Curve Theory. McDonald, R. P., & Ho, M. H. R. (2002). Principles and practice in reporting structural equation analyses. Psychological methods, 7(1), 64. Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American psychologist, 50(9), 741. Mislevy, R. J. (2009). Validity from the perspective of model-based reasoning. The concept of validity: Revisions, new directions and applications, 83-108. Mislevy, R. J., & Haertel, G. D. (2006). Implications of evidence‐centered design for educational testing. Educational Measurement: Issues and Practice, 25(4), 6-20. National Academies of Sciences, Engineering, and Medicine. (2019). Science and engineering for grades 6-12: Investigation and design at the center. National Academies Press. National Research Council. (2007). Taking science to school: Learning and teaching science in grades K-8. National Academies Press. National Research Council. (2012). A framework for K-12 science education: Practices, crosscutting concepts, and core ideas. National Academies Press. National Research Council. (2013a). Education for life and work: Developing transferable knowledge and skills in the 21st century. National Academies Press. 74 Pellegrino, J. W., & Hilton, M. L. (2012). Education for Life and Work: Developing Transferable Knowledge and Skills in the 21st Century (p. 257). Wash. DC Retrieved Httpdownload Nap Educ. Cgi. Pellegrino, J. W., Wilson, M. R., Koenig, J. A., & Beatty, A. S. (2014). Developing Assessments for the Next Generation Science Standards. National Academies Press. 500 Fifth Street NW, Washington, DC 20001. Reckase, M. D. (2017). A tale of two models: Sources of confusion in achievement testing. ETS Research Report Series, 2017(1), 1-15. Rutkowski, L., & Svetina, D. (2017). Measurement invariance in international surveys: Categorical indicators and fit measure performance. Applied Measurement in Education, 30(1), 39-51. Smith, C. L., Wiser, M., Anderson, C. W., & Krajcik, J. (2006). FOCUS ARTICLE: implications of research on children's learning for standards and assessment: a proposed learning progression for matter and the atomic-molecular theory. Measurement: Interdisciplinary Research & Perspective, 4(1-2), 1-98. Standards, N. G. S. (2013). Next generation science standards: For states, by states. Van de Schoot, R., Lugtig, P., & Hox, J. (2012). A checklist for testing measurement invariance. European Journal of Developmental Psychology, 9(4), 486-492. 75 CHAPTER 2 Developing and Validating an NGSS-Aligned Learning Progression to Track Three-Dimensional Learning of Electrical Interactions in High School Physical Science. Introduction Historically, the US science education system focused on broad coverage of multiple topics instead of developing integrated, deep understanding of the key ideas in science. Deep, useable understanding is essential for being able to apply scientific ideas to solve real life problems, typically referred to as knowledge-in-use (National Research Council [NRC], 2012, 2013a). Because many commercial based curricula focus on memorization and surface coverage of material, there has been significant effort from the community of scientists and educational researchers to define and describe what knowledge-in-use should look like. These efforts resulted in the publications of the Framework for K-12 Science Education (the Framework) and the Next Generation Science Standards (NGSS) (NRC 2012; Standards, 2013; National Academies of Sciences, Engineering, and Medicine, 2019) that outline what students should know and be able to do in order to meet the demands of the 21st century. One of the major differences between the old view on science education, and the conceptual view expressed in the Framework and NGSS has to do with the developmental nature of student understanding and the importance of coherence in the learning process. The previous standards focused on covering a broad range of scientific topics without paying much attention to the need to build connections between them, and to scaffold the learning process in a way that would help students build understanding over time. The Framework and NGSS, on the other hand, emphasize the idea of a developmental approach that comes from decades of research on 76 how students learn (NRC 2012; Standards, 2013; National Academies of Sciences, Engineering, and Medicine, 2019). They stipulate that development of deep understanding takes time, careful instruction and appropriate scaffolding. If used consistently, a developmental approach to learning has been argued to lead to more meaningful and coherent organization of the learning process, and development of application and transferable skills in students (National Research Council [NRC], 2000, 2013a; Smith, Wiser, Anderson, & Krajcik, 2006; Krajcik, Sutherland, Drago, & Merritt, 2012, Duschl, Schweingruber, & Shouse, 2007). The essence of developmental approach is reflected in the idea of a learning progression (LP) that the Framework and NGSS are aiming to promote in order to help organize the learning process. The Framework emphasizes the usefulness of an LP as a valuable tool in helping educators support development of deeper and useable knowledge of disciplinary core ideas in science coherently over time. Therefore, in theory, an LP represents a “roadmap” of how students can potentially move toward more sophisticated levels of understanding of science over a broad, defined period of time (Duschl et al., 2007; Smith et al., 2006; Alonzo & Gotwals, 2012). It is important to point out that learning progressions are not developmentally inevitable, and depend on specific instruction and student prior knowledge among other factors (Stevens, Sutherland, & Krajcik, 2009). However, obtaining validity evidence for a specific learning progression helps revise and better align curriculum, instruction and assessment in a way that better supports students in developing deeper understanding of specific constructs (Neuman, Viering, Boone, & Fischer, 2013). Learning progressions have been described in literature for various constructs, including atomic molecular theory (Smith et al., 2006; Talanquer, 2009; Morell, Collier, Black & Wilson, 2017), evolution (Catley, Lehrer & Reiser, 2005), environmental literacy (Anderson, 2008, 77 Mohan, Chen, & Anderson, 2009), energy (Lee, Liu, 2010; Neumann, Viering, Boone, & Fischer, 2013), celestial motion (Plummer, Krajcik, 2009, Plummer, & Maynard, 2014), and force and motion (Alonzo & Steedle, 2009). There have also been learning progression descriptions that focus on both content and practice (Songer, Butler, Kelcey, Gotwals, 2009; Gotwals, Songer, 2013), as well as practice only (Lehrer, Kim, Ayers, & Wilson, 2014; Schwarz, Reiser, Davis, Kenyon, Achér, Fortus, D., ... & Krajcik, 2009; Berland & McNeill, 2010; Osborne, Henderson, MacPherson, Szu, Wild & Yao, 2016). The Framework builds on this research and defines three dimensions of science as the basis of theoretical learning progressions described in the document and used to develop NGSS (NRC, 2012; Standards, 2013). The three dimensions are disciplinary core ideas (DCIs), scientific and engineering practices (SEPs), and crosscutting concepts (CCCs). DCIs allow for the possibility to organize K-12 science curriculum, instruction and assessment around the most important ideas in a scientific discipline. Focusing on a few core ideas that are bigger in scope allows students to develop deep understanding of important ideas in science coherently across school grades and allows for students to explain a wide range of phenomena (NRC 2012, 2013). CCCs serve as lenses to make sense of phenomena and to ask questions such as “What is the pattern in this data?”, “Is this causal or correlational”, “How does the structure influence the function?”. CCCs include: patterns, systems and system models, cause and effect, energy and matter, among others. The third dimension, SEPs, describes authentic practices that scientists and engineers use to generate and revise knowledge. SEPs are different from simple skills in that doing a practice (like constructing a model to explain a phenomenon) requires not only skill (more of a hands-on, procedural aspect), but also knowledge specific for each practice (NRC, 2012). 78 The Framework emphasizes the idea of situated cognition, stating that students learn best when engaged in practices associated with applying the content under study to various real-life situations (Smith et. al, 2006). The Framework defines three-dimensional learning (3D learning) as a way to engage in scientific and engineering practices in order to deepen their understanding of crosscutting concepts and disciplinary core ideas (NRC, 2012). Developing the ability to integrate the three dimensions is termed three-dimensional learning (3D learning), and is an indicator of deep useable understanding of science (NRC, 2012). While the Framework describes theoretical basis of 3D learning, and NGSS outlines possible theoretical learning progressions for the three dimensions of science across grades, we currently have very limited empirical evidence to show that a learning progression for 3D learning can be developed and validated in practice (Wyner & Doherty, 2017). This paper demonstrates the feasibility of developing a three-dimensional learning progression (3D LP) supported by both qualitative and quantitative validity evidence. First, this paper presents a hypothetical 3D LP aligned to the previously designed NGSS-based curriculum. It further presents multiple sources of validity evidence for the hypothetical 3D LP, including interview analysis with 17 students and item response theory (IRT) analysis with 899 students to show validity evidence for the 3D LP on a large scale. Finally, the feasibility of using the assessment tool designed to probe levels of the 3D LP for assigning 3D LP levels to individual student answers is demonstrated, which is essential for practical applicability of any LP. This work provides an example of a study focused on validating 3D LP on a large scale in practice. It also demonstrates usefulness of validated 3D LP for organizing the learning process in NGSS classroom, which is essential for successful implementation of NGSS. 79 Theoretical Framework Validation of Theoretical Learning Progressions Learning progressions represent a continuum of increasingly more sophisticated ways of thinking about a given concept that develop across a broad, defined period of time (Corcoran, Mosher, Rogat, 2009, Duschl et al., 2007). Learning progressions are usually grounded in research on how students learn ideas associated with scientific constructs of interest, as well as specific logic of a given discipline. They are bound by lower and upper anchor. The lower anchor describes prior knowledge and relevant skills that students develop in lower grades, at home or other experiences. The upper anchor describes knowledge and skills students are expected to gain, which could relate to specific learning goals, state or local standards, or any other external criterion. The intermediate level describes skills and knowledge associated with a specific pathway that students take towards mastering ideas described in the upper anchor. The levels of an LP are expressed as learning performances that summarize what students should be able to do with the scientific knowledge they have (Reiser, Krajcik, Moje, & Marx, 2003). While LPs are promising tools for organizing science instruction, curriculum and assessment, most of them are theoretical and have not been validated in practice, and therefore cannot be effectively used as “road maps” as suggested by the Framework. To be able to use LPs as diagnostic tools to help educators identify knowledge and skills students are missing to move to higher levels of a LP, we need to develop assessment tools that can probe each level of the LP and accurately place responses on a level (Wilson, 2009). This will help educators gain information to organize instruction and curriculum around specific core ideas and skills necessary to help students transition to higher levels of a given LP. This process of developing and using assessments to characterize students’ understanding and determine whether the 80 observed response pattern on the assessment indeed corresponds to a theoretical path described by the levels constitutes validating an LP in practice (Herrmann‐Abell & DeBoer, 2018). There are two common approaches to validating a theoretical LP (Duncan & Hmelo- Silver, 2009). The first one is associated with a specific instructional intervention aligned to the theoretical LP aimed at determining what students are capable of learning given carefully designed instructional context (Cooper, Underwood, Hilley, & Klymkowsky, 2012; Nordine, Krajcik, & Fortus, 2010). The second approach is associated with the development of a measurement instrument aimed to evaluate student growth along the learning progression as a whole (Mohan, Jing, & Anderson, 2009; Herrmann‐Abell & DeBoer, 2018). The instrument may be used to investigate how previously developed curriculum impacts students learning. Consequently, this approach requires some alignment between the curriculum and the LP. The current study represents the second approach to LP validation. The same research base was used to inform the design of the curriculum, and the learning progression under study. Repeated measures of student understanding as they progress through the curriculum were conducted further. Specifically, an NGSS aligned curriculum was designed to target high school level NGSS performance expectations focused on electrical interactions, and the developmental progression of student understanding of relevant ideas in the process of the curriculum design was also carefully outlined. This developmental progression was further used as a basis for designing a theoretical learning progression (LP) that integrated the three dimensions of NGSS (3D LP). Finally, assessment tasks aligned to specific levels of the theoretical 3DLP were developed to determine how student understanding develops in the context of NGSS-based instruction. 81 Building 3D LP according to the principles described in the Framework The 3LP presented here is based on the fundamental building principles outlined in the Framework and NGSS. These principles include 1) integrating the three dimensions of scientific knowledge 2) expressing standards as performance expectations, and 3) focusing on explaining phenomena and solving problems using the three dimensions. The following paragraphs discuss each of these principles in more detail and show how each of them was used in this study. Integrating the three dimensions of scientific knowledge The three dimensions work together to allow students to make sense of and explain a variety of phenomena or to find solutions to challenging real-world problems. The Framework specifically emphasizes that it is the ability to integrate the three dimensions of scientific knowledge that is indicative of deep, meaningful science understanding (NRC, 2012). This implies that the three dimensions should also be integrated in curriculum, instruction, and assessment. In other words, the learning process should not be centered around instruction and assessment of individual DCIs, SEPs and CCCs. Rather, the focus of learning in NGSS classroom should be on exploring phenomena by integrating the three dimensions of NGSS. The 3D LP presented here was developed following the principle of integrating the three dimensions of science. Specifically, 3D LP levels describe increase in sophistication for relevant DCIs, SEPs and CCCs together, instead of developing separate learning progressions for each dimension. The three dimensions are also integrated in the curriculum and assessment used to validate the 3D LP. Expressing standards as performance expectations The three dimensions described above combine to form standards in the NGSS expressed as performance expectation (PE) that identifies what a student should know and be able to do 82 with the scientific knowledge they have at the end of a grade band. The NGSS provide PEs at each grade level for elementary and each grade band for middle and high school. Such representation of increasing sophistication in students’ mastery of the three dimensions reflects developmental approach and, if used consistently, has been argued to lead to more meaningful and coherent organization of the learning process (Smith et. al., 2006). The 3D LP presented here is aligned to NGSS PEs, and describes the appropriate and necessary degree of proficiency for the three dimensions that students should develop at each level. Focusing on explaining phenomena or solving problems using the three dimensions The three dimensions work together to allow students to make sense of and explain a variety of complex and compelling phenomena or to find solutions to challenging real-world problems. Phenomena, in the context of NGSS, are events that can be directly observed in nature and that can be explained using scientific ideas students learn or that build on what students know. Phenomena serve as a gateway that allows students to ask questions and develop an inquiry path towards building understanding of the phenomenon with help from the teacher. The focus of 3D LP presented here is to describe student ability to integrate the three dimensions of science at different levels of proficiency, as related to their ability, to explain a wide range of electrostatic phenomena. The assessment instrument designed to probe the levels of 3D LP is focused on asking students to model and explain electrostatic phenomena. Each item represents a storyline containing multiple questions about phenomenon. The assessment instrument provides detailed information about the degree of student proficiency related to being able to integrate the three dimensions of NGSS to explain relevant phenomena. 83 3D LP context: “Interactions” curriculum Understanding of electrical interactions is central to developing deep understanding of DCIs in Physical Science, and a prerequisite for developing higher level understanding of more advanced topics. The Framework defines the following DCIs for the field of Physical Sciences: “Matter and its Interactions”, “Motion and Stability: Forces and Interactions”, “Energy” and “Waves and their Applications” (NRC, 2012). The emphasis the Framework puts on student understanding of interactions is reflected in the outlined questions including “How can one explain structure, properties and interactions of matter?” and “How can we explain and predict interactions between objects and within systems of objects” (NRC, 2012). According to the Framework, ability to explain how objects interact at the macroscopic and microscopic levels is indicative of deep understanding of the DCIs in Physical Science. This work focuses on electrical interactions that are central to understanding processes in multiple fields of science, including chemical bonding, phase changes, properties of materials, interaction of drugs in cells, energy contained in hurricanes, and many others. It requires understanding of atomic nature of matter, electric fields, columbic interaction, electric forces, and energy. This project uses NGSS-aligned curriculum for 9th grade Physical Science called “Interactions”. The curriculum is phenomena driven and focuses on helping students build integrated understanding of electrical interactions across time through 3D learning strategies. It focuses on the following aspects of the DCIs as related to explaining electrical interactions: atomic nature of matter (focused on the DCI of Matter and Its Interactions, sub idea of Structure and Properties of Matter), electric forces (focused on the DCI of Motion and Stability: Forces and Interactions, sub idea of Types of Interactions), and energy (focus on DCI of Energy, sub idea of Relationship between Energy and Forces) at micro and macro scales. 84 The curriculum currently consists of four units. Each unit focuses on investigating engaging natural phenomena using specific aspects of DCIs, several SEPs and CCs. Unit 1 starts with macroscopic level phenomena related to electrical interactions. The phenomena are presented to students in the form of driving questions that they pursue during the course of the entire unit, or sometimes single or few activities. For example, Unit 1 phenomenon-based driving questions is “Why do some clothes stick together when they come out of the dryer?”. Students investigate patterns in how charged objects interact and use ideas of electrical fields and forces to explain what causes certain types of clothes stick to together when they come out of the dryer. Once students have gained useable knowledge of electrical interactions at the macroscopic level, they proceed to explore atoms and relate charges to atomic structure. This helps students construct more detailed causal models to explain how objects become charged (via transfer of electrons), and how electron cloud shifts cause neutral objects to be attracted to charged ones, etc. By the end of Unit 1, students are expected to have deep understanding of ideas related to charges, electrical fields, and forces at the microscopic level. Unit 2 focuses on ideas of energy at the macroscopic and microscopic levels. Units 3 and 4 focus on applications of ideas discussed in units 1 and 2 to explain phenomena related to hydrogen bonding, hydrophobic and hydrophilic interactions and protein folding. The “Interactions” curriculum is designed with the purpose of helping students develop the above ideas across time over the course of one academic year. The curriculum has gone through external review process by Achieve. You can read more about the review process from the Achieve website4. Unit 1 of the “Interactions” curriculum received the highest rating termed “Example of high quality NGSS design”, and Unit 2 received the second highest rating termed “Example of high quality NGSS design if improved”. These 4 Achieve review process: https://www.achieve.org/reviews 85 ratings indicate that the curriculum is a good example of implementing 3D learning in the classroom. Further, National Science Teachers Association recognizes “Interactions” as being aligned to NGSS and provides classroom videos demonstrating curriculum use on their official webpage5. These pieces of evidence support the choice of this curriculum for developing and validating 3D LP in this study. The curriculum consists of online materials where all the student activities are located6 and paper-based teacher materials that can be accessed online via Google docs. The curriculum is free and available for anyone to use. Methodology Developing and empirically testing NGSS-aligned 3D LP. A level on the 3D LP can be described as one in a series of comprehensive and developmentally appropriate steps towards more sophisticated application of DCIs, CCCs and SEPs. The 3D LP presented here focuses on two of the three DCIs covered in the curriculum including DCI of Matter and Its Interactions (sub idea of Structure and Properties of Matter) and DCI of Motion and Stability: Forces and Interactions (sub idea of Types of Interactions). This is because current work uses validity evidence collected before and after Unit 1 implementation only, and these DCIs were covered in Unit 1 of the curriculum. Further, 3D LP focuses on SEP and Developing and Using Models and CCCs of Cause and Effect because those dimensions were most heavily emphasized throughout the curriculum. First, the lower and upper anchors are defined to establish the scope of the LP. The lower anchor was based on students’ prior knowledge that was characterized from the written assessment and oral interviews with individual students before they started the curriculum. The upper anchor is based on the NGSS PEs. The intermediate levels of the LP are defined based on 5 Classroom videos demonstrating “Interactions” use: http://ngss.nsta.org/ 6 “Interactions” online materials: http://interactions.portal.concord.org/ 86 a combination of the instructional sequence, feedback from disciplinary experts, and literature related to student learning. This process resulted in a hypothetical 3D LP that was then empirically tested based on interviews with students and IRT analysis of written assessment. Table 1 provides description of levels for the hypothetical NGSS-aligned 3D LP. Table 2.1 Hypothetical 3D LP for electrical interactions Level Electrical Interactions Includes DCI sub ideas: “Types of Interactions”, “Structure and Properties of Matter” Types of Interactions: • causal relationships between amounts of charge, magnitude of electrical field and the generated attractive/repulsive forces and distance between charged objects (Coulomb’s Law) and relate these ideas to components of atoms (p, e). • use ideas of force, field and charge to explain phenomena. Structure and Properties of Matter: • matter consists of atoms modeled as having a small, dense, positively charged nucleus and electrons orbiting around it. Electrons are modeled as point charge or cloud • components of atoms (electrons, protons) are related to explaining interactions between objects Types of Interactions: • Causal relationship between amount of charge, magnitude of electric forces and distance between charges at the macro level (Coulomb’s Law) • charge viewed as microscopic (might mention electrons, protons, neutrons), but these ideas are not explicitly used to explain phenomena Structure and Properties of Matter: • Matter is made of particles, but this idea is not explicitly used to explain phenomena • Atoms modeled with plum and pudding or some other inaccurate version of atomic model Scientific Practices: Crosscutting “Developing and Using Models” Concepts: “Cause and Effect” • Student models/explanations are causal and explicitly use ideas of electric forces, fields, charges and atomic nature of matter to explain phenomena by showing a micro- level mechanism • Models relate changes in the system to changes in forces between interacting atoms to explain phenomena • Student models/explanations are causal and explicitly use ideas of electric forces and electric charges to explain phenomena by showing a macro-level mechanism • Models relate changes in the system to changes in forces between interacting objects in a system 3 I E r o f l e d o M c i p o c s o r c i M 2 I E r o f l e d o M c i p o c s o r c a M 1 c i p o c s o r c a M e t e l p m o c n I I E r o f l e d o M Types of Interactions: • same charges attract and opposite repel • charge is transferred via contact. • No relationship between magnitude of interacting charges • models/explanations are not causal, based on recollection of facts only; • no mechanism explaining and generated electric force and distance (Coulomb’s Law) phenomenon • charge causes attraction/repulsion Structure and Properties of Matter: • matter is continuous, or contains particles modeled as circles • charges are modeled as static or point charge • don’t relate charge to structure of matter 87 Assessment development. Modified evidence-centered design (mECD) process (Harris, Krajcik, Pellegrino, & DeBarger, 2019) was used to develop assessments that show evidence of 3D learning in the context of the curriculum. The mECD approach combines elements of evidence-centered design (ECD) (Mislevy & Haertel, 2006) and construct-centered design (CCD) process (Shin, Stevens, & Krajcik, 2010) to design tasks for measuring knowledge in use. The first step of mECD involves identifying and unpacking an NGSS PE in order to develop a 3D claim that describes what students should be able to do with the corresponding DCI, SEPs and CCCs. The process of unpacking specifies aspects of the DCIs, SEPs and CCCs that students should master in order to meet a given NGSS PE. It is important to unpack NGSS PEs because they represent broad statements that cover multiple content areas that are not necessarily the focus of the 3D LP, and therefore are not the focus of the assessment designed to measure the 3D LP levels. Unpacking also ensures coherency and alignment between NGSS PEs, assessment, and the 3D LP levels. The next step involves specifying the evidence that shows students have met the requirements specified in the claim. Claim and evidence combine to form an mECD argument. Finally, assessment tasks for each mECD argument are developed that will provide the necessary evidence to measure the claim. This process is shown in Figure 1. Figure 2.1 Summary of modified evidence centered design process An example of the mECD argument for an item to help characterize the level of students’ understanding of electrical interactions is summarized in Table 2. The item is designed to provide evidence on whether students are at level 1, 2 or 3 of the 3D LP. The mECD argument 88 focuses on DCI of HS-PS1 - Matter and its Interactions, specifically for the element of PS1.A (Structure and Properties of Matter), and PS1.B (Types of Interactions). Further, the mECD argument focuses on SEP of Developing and Using Models, and CCCs of Cause and Effect. There were total of 8 items designed to measure 3D understandings of electrical interaction for Unit 1. Each item is open-ended (see Table 2) and contains an aspect of a DCI, a SEP and a CCCs. Items were administered as a pre and post Unit 1 test during 2016-2017 academic year. Several items, including the one shown in Table 2 were used to conduct interviews before and after Unit 1 to obtain qualitative validity evidence for 3D LP. Alignment between the 3D Learning Progression and the Scoring Rubric Each item was open-ended and measured all 3 levels of the 3D LP. To assigning a level on the 3D LP to an answer, scorers used the following criteria: Are most relevant parts of DCI present? Does the answer reflect macro or micro level understanding? Is explanation causal? The rubric describes DCI, SEP and CCCs for each item. Each answer was scored directly to the 3D LP level. For example, score 1 on an item corresponds to level 1, etc. Table 3 shows the rubric, the level of the 3D LP, and sample answer from the oral interview for the item shown in Table 2. 89 Table 2.2 Example of mECD process Claim: Students construct a causal model to explain how objects become charged using electron transfer Evidence Students use electron transfer between atoms as their model to explain the mechanism for charging objects. They include these ideas in the models as appropriate: 1. Objects are initially neutral (# of electrons is equal to the # of protons in the atoms); 2. Transfer of electrons between atoms of one object and the atoms of another object causes both objects to become charged. a. Objects are made of atoms; b. Atoms consist of a positively charged nucleus containing positively charged protons and neutrons (neutral) that is surrounded by negatively charged electrons; c. Atoms with the same number of electrons and protons are neutral. d. Atoms with an unequal number of electrons and protons are charged. i. If they have more electrons than protons, the atoms will be negatively charged. ii. If they have less electrons than protons, the atoms will be positively charged. iii. When an atom becomes charged, electrons move from one atom to another. iv. When electrons transfer from one atom to another, one atom becomes negatively charged and 3. Electron transfer is caused by contact between objects (touching or rubbing); the other becomes positively charged. a. The effect of electron transfer on an object that gave electrons is net “+” charge because the atoms of this object have a larger number of protons than electrons; the effect of electron transfer on an object that received the electrons is net “-“ charge. b. The # of electrons lost by the atoms of one object equals the # of electrons gained by the atoms of another. Therefore, charge is conserved. 4. Student models will show causal relationship between components of atoms and generated electric forces and fields when explaining phenomena involving electrical interactions. a. Unequal number of electrons and protons within an atom causes net charge b. Charged atoms generate electric field around them c. When two atoms get close enough for their fields to interact, electric force is generated between the two atoms i. Attractive electric force is generated between oppositely charged atoms ii. Repulsive electric force is generated between similarly charged atoms iii. The smaller the distance between the atoms, the larger the generated electric force (attractive or repulsive) and vice versa iv. The larger the charge on each of the interacting atoms, larger the generated force (attractive or repulsive) and vice versa 5. Less sophisticated models contain fewer microscopic level components, and provide few or no causal relationships to account for observations of electrostatic phenomena. Task: Students are shown a video where fur and rod don’t attract paper before they are rubbed together. Upon being rubbed together, both fur and rod start attracting paper. Draw a model that shows what happens to the rod and fur when they are rubbed together to cause the paper to move towards the rod. Make sure to label everything in your model. Describe what happens to the rod and fur during the process of rubbing them together. 90 Data Analysis Constructing Hypothetical 3D LP and Evaluating Assessment Items and Rubric The hypothetical 3D LP shown in Table 1 was constructed using logical sequence of the discipline, relevant research literature and unpacking of NGSS PEs. The “Interactions” curriculum was piloted in the same schools in the Mid-West a year prior to the data collection described here. Unit 1 assessment items, designed to probe the levels of the 3D LP, were administered during the pilot year via online “Interactions” portal. Two researchers went through 100 student responses for each item to ensure that the items elicited the types of responses that the researchers anticipated based on preliminary levels of 3D LP and scoring rubric. Based on this analysis, 3D LP levels, assessment items, and scoring rubric were modified to ensure consistency and improved validity of the 3D LP and assessment instrument. There were no major changes made to either 3D LP or the assessment. The assessment items were rephrased, and scaffolds added to ensure students understand the questions better and address all parts of the question. 3D LP levels were not modified significantly, but a note was taken of the types of answers that seemed to contain ideas from multiple levels of the 3D LP, and therefore represent the “in between level” 3D understanding. This is discussed in the results section. Supporting levels of the 3D LP using qualitative analysis of student interviews The interview data was collected in a Mid-Western public high school where the “Interactions” curriculum was implemented. The school was rural type with 28% free and reduced lunch. Students from three different classrooms were interviews. Two classrooms had the same teacher, and one classroom had a different teacher. Both teachers have taught the “Interactions” curriculum prior to data collection year. Students from all three classrooms had very little prior knowledge of electrical interactions based on pre-Unit 1 interview analysis. 91 Several students from each of the three participating classrooms were interviewed before and after implementation of Unit 1, with total 17 students interviewed. The students were selected to represent different levels of academic achievement. Items from two different testlets were used in the interview: the foil experiment testlet, and the paper and rod testlet. Sample interview analysis for paper and rod testlet is shown in Table 3. Sample interview analysis for the foil experiment testlet is shown in Table 4. The mECD argument for the foil experiment testlet is provided in the Appendix. In the foil experiment item students develop atomic model consistent with Rutherford experiment results. These two testlets probe ideas related to the three levels of the hypothetical 3D LP shown in Table 1. Student interviews were analyzed using the scoring rubric and each answer was assigned a level on 3D LP. Inter-rater reliability was established in the following manner. One researcher scored all 17 interviews first. Then, two other researchers used the same rubric to score the interviews of 3 students from each classroom (total 9 students). Once 100% agreement of 3D LP level placement for all 9 students was reached between the 3 scorers, the scoring rubric and the 3D LP levels were modified accordingly, and the rest of the interviews rescored based on this discussion. Support for the Validity of Levels of the 3D LP using Item Response Theory (IRT) The pre and post Unit 1 assessment data was collected in six schools in the Mid-West and five schools in West United States. Schools in the Mid-West were rural type with 28% free and reduced lunch. Schools in the Western part of US were urban type with 72.4% free and reduced lunch. The assessment was administered in classrooms where the “Interactions” curriculum was piloted during Fall 2016 and Spring 2017. The total sample size is 899 students. Teachers in the Mid-West schools have taught the “Interactions” curriculum prior to data collection year, and teachers in Western part of the US were first time users of the curriculum. Students on average 92 had very little prior knowledge of the constructs measured by the two assessment instruments as based on pre-Unit 1 interview data. IRT analysis for Unit 1 pre/post assessment was carried out following Toland (2014). The sample of 899 students was modeled using graded response model (GRM) (Samejima,1969). Score of “0” was imputed for students who had missing values on any of the items. This was deemed appropriate because students were given unlimited amount of time to finish the assessment. Therefore, it was safe to assume that if they did not provide the answer for an item, they did not know it. Pre/post assessment data were combined in model estimation to allow for comparison of ability distributions on pre and posttest. The unidimensionality and longitudinal invariance are discussed in Chapter 1. Pre and post measures were highly reliable (pre Unit 1=0.872, post Unit 1=0.934) and supported by validity evidence (Chapter1). This suggests unidimensional IRT model is appropriate for the data. Appendix provides R code for model selection, specification, and estimation using the mirt package (Chalmers, 2012) in RStudio (RStudio Team, 2015). The results section presents IRT analysis relevant to the 3D LP validation. 93 Table 2.3 Sample responses for every 3D LP level for paper and rod Level/Score 3D LP Scoring Rubric DCI Structure and Properties of Matter: • matter is continuous, or contains particles modeled as circles • charges are modeled as static or point charge • don’t relate charge to structure of matter Types of Interactions: • same charges attract and opposite repel • charge is transferred via contact. • No relationship between magnitude of interacting charges and generated electric force and distance (Coulomb’s Law) • charge causes attraction/repulsion SEP and CC models/explanations are not causal, based on recollection of facts only; no mechanism explaining phenomenon 1 e s n o p s e R t n e d u t S e l p m a S 94 Question Draw a model that explains what happens to the rod and fur when they are rubbed together to cause paper bits to move towards the rod. Label your drawing. DCI: Structure and Properties of Matter • Charges might be shown as part of objects (rod, paper fur) but are not used to construct causal explanation • Model does not explain what makes objects initially neutral • Model does not use charge transfer to explain how rod becomes charged. Atoms are not shown DCI: Types of Interactions • Fuzz represents static electricity and charge transfer Fuzz/static is transferred through rubbing • • Model doesn’t use electric force and charge relationship to explain phenomenon SEP and CC • Model explains attraction between paper and rod using ideas related to magnets/magnetic force • No causal mechanism beyond recollection of facts • Static/fuzz causes attraction Student: as the cloth rubs on the rod, it causes the rod to have some kind of “magnetic” effect. Kind of like rubbing a piece of cloth on a balloon causes a kind of “electric” charge. Paper is attracted to this “magnetized” rod. Comment: relevant components of DCI are not present, model contains only observable components and no causal mechanism to explain why paper is attracted to the rod Table 2.3 (cont’d). Level/Score 2 3D LP Scoring Rubric DCI Structure and Properties of Matter: • Matter is made of particles, but this idea is not explicitly used to explain phenomena • Atoms modeled with plum and pudding or some other inaccurate version of atomic model Types of Interactions: • Causal relationship between amount of charge, magnitude of electric forces and distance between charges at the macro level (Coulomb’s Law) • charge viewed as microscopic (might mention electrons, protons, neutrons), but these ideas are not explicitly used to explain phenomena SEP and CC • Student models/explanations are causal and explicitly use ideas of electric forces and electric charges to explain phenomena by showing a macro-level mechanism • Models relate changes in the system to changes in forces between interacting objects Question Draw a model that explains what happens to the rod and fur when they are rubbed together to cause paper bits to move towards the rod. Label your drawing. DCI: Structure and Properties of Matter • Both paper rod and fur contain charges • Paper, rod and fur are initially neutral because no interaction is observed. All objects contain equal number of + and – charges. • Model shows charges transferred between rod and fur (just positive, just negative, or both) during rubbing • Charges are modeled as point charges. Atoms are not shown. DCI: Types of Interactions • When the charged rod is brought close to the paper bits, attractive force is generated between charged rod and charges in the paper SEP and CC • Model or written explanation shows that attractive force between charged rod and charges in the paper causes paper bits to move but doesn’t explain how charges in the neutral paper originate t n e d u t S e l p m a S e s n o p s e R Student: when the rod was not rubbed by the fur, it was neutral, which was why it did not stick to the paper (neutral and neutral objects don’t interact). When the fur was rubbed onto the rod, it gave negative charges over to the rod, which then made it negative. The paper bits were attracted to the negative rod because in the bits there are positives and negative charges and so the negative rod attracted to the positive charges inside paper bits. Comment: Relevant DCIs are present, model provides macro level causal mechanism (highlighted sentence), but does not fully explain why neutral paper is attracted to the charged rod 95 Table 2.3 (cont’d). Level/Score 3 DCI 3D LP Scoring Rubric Structure and Properties of Matter: • matter consists of atoms modeled as having a small, dense, positively charged nucleus and electrons orbiting around it. Electrons are modeled as point charge or cloud • components of atoms (electrons, protons) are related to explaining interactions between objects Types of Interactions: • causal relationships between amounts of charge, magnitude of electrical field and the generated attractive/repulsive forces and distance between charged objects (Coulomb’s Law) and relate these ideas to components of atoms (p, e). • use ideas of force, field and charge to explain phenomena. SEP and CC • Student models/explanations are causal and explicitly use ideas of electric forces, fields, charges and atomic nature of matter to explain phenomena by showing a micro-level mechanism • Models relate changes in the system to changes in forces between interacting atoms to explain phenomena Question Draw a model that explains what happens to the rod and fur when they are rubbed together to cause paper bits to move towards the rod. Label your drawing. DCI: Structure and Properties of Matter • Both paper rod and fur contain charges modeled as parts of atoms • Paper, rod and fur are initially neutral because no interaction is observed. All objects contain equal number of protons (+) and electrons (-) within their atoms • Model shows electrons transferred from fur to rod or vice versa during rubbing DCI: Types of Interactions • Excess electrons in the atoms of the rod cause the rod to have “- “charge. Alternatively, lack of electrons causes the rod to have “+” charge. • When the charged rod is brought close to the neutral paper bits, repulsive force is generated between electrons in the atoms of the rod and electrons in the atoms of the paper. This repulsive force causes electrons in the atoms of the paper to move away from the rod exposing positively charged nucleus. Attractive force between nucleus of atoms in the paper and electrons in the rod causes paper to move towards the rod. SEP and CC Model or written explanation shows that attractive force between charged rod and charges in the paper causes paper bits to move ant explains the origin of the attractive force in spite of the fact that paper is neutral. e l p m a S t n e d u t S e s n o p s e R Comments: no level 3 responses were observed by the end of Unit 1 for this interview item also, which is consistent with developmental approach. 96 Table 2.4 Sample responses for every 3D LP level for the foil experiment Level/Score 3D LP Scoring Rubric DCI Motion and Stability: Forces and Interactions: • same charges attract and opposite repel • charge transferred via contact. • charge as a macroscopic (point charge) • No relationship between magnitude of interacting charges and generated electric force and distance (Coulomb’s Law) • charge is static electricity that causes attraction/repulsion Matter and its Interactions: • matter is continuous, or made of particles modeled as plain circles don’t relate charge to structure of matter SEP and CCs • models/explanations are causal, but based on recollection of facts only; • no mechanism explaining phenomenon Question 1 Draw a model of a silver atom that is consistent with Tom’s results • Models show a plum a pudding model, or some inaccurate version of the model (matter consists of different charges that are not pats of atoms, or charges mixed up, components missing etc.) Question 2 Explain why your model is consistent with the observation that relatively few particles were deflected by the sheet of foil (followed Paths B, C, D or similar). Justify your answer. • Explanations are at the macroscopic level, ideas of forces/fields are not used to explain the pattern, interactions at a distance between charged particles are not mentioned Question 1 Question 2 Student: some particles passed through because the foil was unkrinkled…In the krinkled spots on the foil the particles would bounce back 1 t n e d u t S e l p m a S e s n o p s e R Comments: students model structure of matter as containing positive and negative point charges, and construct causal explanation of the observed pattern using only macro-level observable components (“crinkled spots” cause the observed pattern). All of this is consistent with level 1 of the 3D LP 97 Table 2.4 (cont’d). Level/Score 3D LP Scoring Rubric 2 t n e d u t S e l p m a S e s n o p s e R Motion and Stability: Forces and Interactions: • Causal relationship between amount of charge and magnitude of attractive/repulsive forces and distance between charges at the macroscopic level (Coulomb’s Law) • charge viewed as microscopic (might mention electrons, protons, neutrons), but these ideas are not explicitly used to explain phenomena Matter and its Interactions: • Matter is made of particles, but this idea is not explicitly used to explain phenomena Particles making up matter modeled with plum and pudding or some other inaccurate version of atomic model Question 1 Question 1 Draw a model of a silver atom that is consistent with Tom’s results. • Models include a concentrated positively charged nucleus that takes up small portion of the total volume of the atom and negatively charged electrons surrounding the nucleus [cloud or points] Question 2 Explain why your model is consistent with the observation that relatively few particles were deflected by the sheet of foil (followed Paths B, C, D or similar). Justify your answer. • Explanations use relationship between electric force and distance between charged particles to explain the pattern • The explanation evokes “hitting mechanism” indicating that particles that went through the foil did not hit the sub-atomic particles directly, but passed through the empty space between the atoms. Question 2 Student: Particles from the detector are deflected if they hit the nucleus, but not heads-on. They bounce back if they hit the nucleus directly, and pass through if they pass through the empty space between the atoms Comments: student models show accurate structure of the atom (small, dense, positive nucleus with point charge negative electrons around it). However, explanations don’t mention ideas of electric forces to construct causal account of the phenomenon. Instead, explanations rely on macro-level “hitting” mechanism to explain the pattern. This is consistent with level 2 of the 3D LP. 98 Table 2.4 (cont’d). Level/Score 3D LP Scoring Rubric 3 t n e d u t S e l p m a S e s n o p s e R Motion and Stability: Forces and Interactions: • causal relationships between amounts of charge, magnitude of electrical field and the generated attractive/repulsive forces and distance between charged objects (Coulomb’s Law) and relate these ideas to components of atoms (protons, electrons). • use ideas of force, field and charge to explain bond making and bond breaking processes. Matter and its Interactions: • matter consists of atoms modeled as having a small, dense, positively charged nucleus and electrons orbiting around it. Electrons are modeled as point charge or cloud Question 1 Draw a model of a silver atom that is consistent with Tom’s results. • Models include a concentrated positively charged nucleus that takes up small portion of the total volume of the atom and negatively charged electrons surrounding the nucleus [cloud or points] Question 2 Explain why your model is consistent with the observation that relatively few particles were deflected by the sheet of foil (followed Paths B, C, D or similar). Justify your answer. • Explanations use relationship between electric force, electric field, and distance between charged particles to explain the pattern • The explanation evokes microscopic level mechanism indicating that particles that went through the foil did not hit the sub-atomic particles directly, but passed through the empty space between the atoms. components of atoms (electrons, protons) are related to explaining interactions between objects Comments: no level 3 responses were observed by the end of Unit 1. This observation is consistent with developmental approach because level 3 understanding reflects deep conceptual understanding of science ideas at the microscopic level and ability to apply them by blending the three dimensions of NGSS effectively to various situations. This type of understanding takes a long time to develop. The author would expect most students to ultimately develop this level of understanding by the end of the curriculum. The elements of the answer to the items 1 and 2 consistent with level 3 include the following: • Model shows foil is made of atoms with dense positively charged nucleus and electrons as cloud of negative charge • Explanations indicate that alpha particles that come close enough to interact with electric field created by positively charges nucleus of atoms in the foil are repelled by the generated repulsive force. This repulsive force causes the particles to either bounce back if the interact with the nucleus heads-on (Path B), or come out at an angle if they come close to the nucleus (Path C, D). The alpha particles that don’t come close enough to interact with electric field generated by the nuclei come out of the foil without changing their original path. Since nucleus takes up only small volume of the atom, most alpha particles never interact with the nucleus (Path A). 99 Results Supporting the Validity of levels of the 3D LP using qualitative analysis of student interviews Identifying Key Knowledge and Practices for Each Level of the 3D LP Qualitative analysis of student interviews served as a rich source of information to help obtain validity evidence for hypothetical 3D LP levels. Analysis of student responses supported the hypothetically suggested progression of student understanding reflected in the 3D LP levels for this phenomenon. Specifically, at level 0 student answers contain no relevant information, so examples for that level are not shown. Level 1 responses reflect macro-level models. Their models contain observable components and no relevant causal mechanistic details at macro and micro scales. In the context of paper and rod item, models do not explain how the rod becomes charged (charge transfer as a result of rubbing between the rod and fur that causes the rod to become charged) or why neutral paper bits are attracted to the charged rod (due to attractive force between charged rod and charges in the paper). They use words such as “static” or “magnets” to explain electrostatic phenomena without specifying what they mean by these terms. In the context of the foil experiment items, student models only show macro-level components, or point charges without explaining how charges in the foil are involved in producing the pattern that is observed in the experiment. At Level 2 student models reflect macro-level causal accounts that contain relevant aspects of DCIs related to charges and attractive forces used to explain phenomena. In the context of paper and rod item, student models show that rubbing causes charge transfer between rod and fur, which causes paper and rod to become charged. The models also show that paper bits are attracted to the charged rod as a result of attractive force generated between charges of the rod and the paper. Charges, however, are modeled as point charges and not parts of atoms. 100 This lack of detail in the level 2 models leads to incomplete or inaccurate explanations of phenomena and lack of microscopic level details. For example, to provide full causal account for why neutral paper is attracted to charged rod, models need to show where the charges involved in the interaction between rod and paper originate (excess electrons on the rod and nucleus in the atoms of the paper, which becomes exposed as a result of repulsive interaction between electrons of the rod and the paper). Level 2 models lack that level of detail because students do not always relate charges to components of atoms (protons and electrons). Similarly, when explaining how rubbing causes fur and rod to become charged, level 2 models often indicate that both positive and negative charges are transferred between paper and rod. These inaccuracies probably also stem from the fact that students do not relate charges to the structure of the atom, and the understanding that positive charges are protons, which are located in the nucleus of the atom, and therefore cannot be transferred during rubbing. Only electrons transfer as a result of rubbing. In the context of the foil experiment items student models show atomic models which are of different degree of accuracy. However, the models and explanations use ideas of charges to explain observed pattern, but with some macro-level inaccuracies. For example, the sample student model and response shown in table 4 indicates an accurate model of the atom (small, positively charged nucleus, and electrons around the nucleus), but explains the observed pattern as resulting from a sort of “hitting mechanism” where alpha particles shot at the gold foil either hit the nucleus of the atom directly, or on the side. The explanations lack the “Interaction at a distance” aspect that would show that students view electrical forces as acting without contact, through the field across space, rather than contact forces that they are more familiar with at the macroscopic level. Therefore, level 2 of the 3D LP is characterized by student ability to develop 101 macroscopic level causal relationship between charges and generated electric forces to explain electrostatic phenomena, but lack microscopic level details to provide full causal mechanistic explanation. These microscopic level mechanistic details that are missing from level 2 models are present in level 3 responses. At this level students demonstrate mastery of force, field and charge relationships and atomic level understanding by showing charged particles (electrons, protons) as parts of atoms and explaining the origin, direction and mechanism of action of electric forces. Evidence in Support of Developmental Nature of Student 3D Understanding While there were no level 3 responses observed in the interviews or in the scoring of the entire student sample of written pre and post assessments, there were some responses that could be characterized as transitioning between the levels of the 3D LP. Table 5 provides examples of student answers that were considered to fall between the levels for the Rod and Fur item and explains why. For example, transitioning from level 1-2 of the 3D LP is characterized by the types of responses that mention microscopic level components (e.g., charged particles) either in the explanation or model, but do not provide complete causal account for how these components explain the phenomenon in question. For transition level 1-2 response in Table 5, the model shows only observable components (paper, rod, fur), which is consistent with level 1 of the 3D LP. The explanation for the model states that rubbing causes the rod to become charged, therefore recognizing that charge is generated via contact. However, neither model nor explanation show how rubbing causes rod and fur to become charged using either point charges or electrons. The explanation further mentions attraction between charged particles in the paper (protons and electrons) and the rod, but does not provide any details on what causes the attraction and where the charged particles are located. Therefore, while student might be 102 recalling some terms and processes that are consistent with higher levels of the 3D LP (protons, electrons, charging through rubbing), these ideas are not used to explain how rod becomes charged or why neutral paper is attracted to the rod. Hence, the model does not reflect the ability to integrate the three dimensions of NGSS consistent with level 2 of the 3D LP, but there are ideas and connections present that make this model more sophisticated than those at level 1 of the 3D LP. This model therefore represents an example of transitioning from level 1 to level 2. Similarly, transitioning from level 2-3 of the 3D LP is characterized by the types of models that provide incomplete or inaccurate microscopic level causal accounts of phenomena. For example, sample transition level 2-3 response shown in table 2 contains all but 1 microscopic level detail necessary to provide full causal account for the phenomenon in question. Specifically, the model explains, at the microscopic level, how the rod becomes charged (via transfer of electrons from atoms of the fur to the atoms of the rod during rubbing), but does not provide microscopic causal explanation for why neutral paper is attracted to the charged rod, using heuristics instead (because charged and neutral objects attract). Table 6 provides examples of student answers that were considered to fall between the levels for the Foil experiment item and explains why. For example, is sample level 1/2 transitional response student is attempting to use unobservable components, such as charge and field, to explain the phenomenon, but the model is vague, and it is not clear from either model or explanation what the difference between a field and a charge is, and how both ideas are involved in explaining the observed pattern. Further, is sample level 2/3 answer, the student shows an accurate model of the atom, and uses ideas of fields in the explanation, but still reverses back to “hitting mechanism” when explaining the pattern, instead of using ideas related to interactions at a distance. 103 Table 2.5 Sample responses that fall between levels of the 3D LP for paper and rod LP Level 1/2 2/3 Sample Student Answer: Paper and Rod Student explanation: after being rubbed with fur the rod becomes charged due to friction through rubbing. The paper has neutral charge. But the charged particles (protons and electrons) in the paper become attracted to the charged rod. Student Explanation: rubbing causes electrons from the fur to go to the rod, making the atoms of the rod charged. Paper atoms are neutral, they have equal number of protons and electrons. Neutral paper attracts to the charged rod because neutral and charged objects attract. The closer the rod, the bigger the force. Therefore, transition levels can be summarized as containing more relevant content (aspects of DCIs), but lacking application of the content for explaining phenomena. This reflects the nature of 3D understanding the 3D LP aims to describe, which is characterized by achieving knowledge-in-use, or ability to apply content to explain real-life situations. All in-between level responses were assigned the lower level on the 3D LP as a final level for online responses because they did not contain all the aspects consistent with the higher level. 104 Table 2.6 Sample responses that fall between levels of the 3D LP for the foil experiment LP Level Sample of Student Answers that fall between levels of the 3D LP for Foil Experiment item 1/2 Student explanation: few particles bounced back. This is because when the particles inside the foil are scattered around, but sometimes they form a clump of particles that makes it so no other particle being shot at it can go through. When the particles clump together they make very strong electric field, like a charge, which repels the particles that are being shot at it. Comment: student is attempting to use microscopic-level ideas to explain phenomenon, but confusing electric charge and force. The charges are not shown in the model, and it is not clear what the structure of the “clump of particles” is. Overall, explanation and model represent transition between purely observable macro-level thinking consistent with level 1, to elements of micro-level-based thinking consistent with level 2. 2/3 Student explanation: particles bounce back if they come close to positive or negative charges in the atoms of the foil because there is a strong electric field around these particles. Depending on which side they hit the electric field, they might come out at an angle. They go through between empty space on the foil where there is no electric field. Comment: the model shows electric charges as parts of atoms; the explanation uses ideas of field to explain the pattern. Both model and explanation are mostly at the microscopic level, consistent with level 3. However, the explanation is not accurate in that it says that alpha particles interact with both positive and negative charges in the foil. Also, it doesn’t use the idea of electric force, and instead uses “hitting” mechanism to explain interaction between electric field of atoms and alpha particles, which is consistent with level 2 of the LP. Finally, atoms are shown to take up most of the space in the foil, so the model would not explain why most particles went through undisturbed. 105 Consistency in Assigning Responses to 3D LP Level for Different Phenomena Since students were asked to explain more than one phenomenon, it was possible to study students’ ability to transfer their 3D understanding to different contexts. Specifically, the Foil Experiment items is an example of an abstract phenomenon that students can not directly observe, which makes it harder to model and explain. The foil experiment also contains more complex ideas and requires deeper understanding. On the other hand, the Paper and Rod item focuses on a more familiar, observable phenomenon. This difference in how familiar the phenomena were to students is evident in the levels of the answers provided for both scenarios in the interview. Table 7 shows assignment of levels for each student on each interview item. Specifically, on the pretest, 14 students score a level 1, and 3 scored between levels 1 and 2 of the 3D LP on the paper and rod item. With the foil experiment item, only 7 students scored a level 1 and 10 students scored a level 0 of the 3D LP. These results suggest that the abstract foil experiment was more difficult for students to model and explain. Overall, the majority of interviewed students demonstrated proficiency between level 0 and 1 of the 3D LP on the pre- unit 1 interview. Similarly, on the post test, 13 students score in level 2, and 2 scored intermediate level 2/3, and only 2 students remained in level 1 on the 3D LP for the paper and rod item. For the foil experiment, on the other hand, only 6 students moved to level 2 (4 from level 1 and 2 from level 0), 6 moved to level 1, 2 moved to intermediate level 1/2, and 3 moved to intermediate level 2/3. These results suggest that while students develop quite sophisticated, macroscopic level understanding of relatively straightforward electrostatic phenomena like attraction of neutral paper to the charged rod, they need more time and scaffolding to transition to the microscopic level 3D understanding of electrical interactions required to explain abstract phenomena like to foil experiment, which contains more complex ideas and is more abstract. 106 Table 2.7 Student score/3D LP level for each interview phenomenon Student Pre-Unit 1 level Post-Unit 1 Level Rod Paper and Fur Rod Paper and Fur Pre-Unit 1 level Foil Experiment Post-Unit 1 level Foil Experiment A B C D E F G H I J K L M N O P Q 1/2 1 1 1 1 1 1/2 1 1 1 1 1 1 1/2 1 1 1 2 2 2 2 1 1 2/3 2 2 2 2 2 2 2/3 2 2 2 1 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 2 1 1 2 1 1 1/2 2/3 1/2 2 2 1 2 2/3 1 2/3 2 Supporting the Validity of levels of the 3D LP using IRT In this section Wright Maps resulting from fitting graded response model (GRM) are used to show additional validity evidence for 3D LP levels. Wright Maps show ability and item difficulties on the same axis (y-axis), and items of x axis (Wilson, 2004; Wilson, 2009). GRM model is a polytomous item model. It is used for items with more than 1 response category, like the ones designed for this study. Under GRM, each response category has its own difficulty parameter (Samejima, 1969). The interpretation of category difficulty under GRM is the following: a student with ability level equal to the difficulty of a given response category has a fifty percent probability of scoring in that category, and fifty percent in the category below (Samejima, 1969). In order to use the Wright Map to gain validity evidence for 3D LP, it is important to keep in mind that item category difficulty (which is in the same scale as ability) relates to score in the rubric for that category, which in turn relates to 3D LP level. Therefore, when looking at 107 the Wright Map, we want to see if abilities that correspond to difficulties for various item response categories are consistent with those theoretically suggested by the rubric and 3D LP. Specifically, we expect item difficulties that correspond to lower ability levels be located in the lower ability region of the Wright Map for all items. This is because respondents of lower ability are more likely to endorse an easier item response category (lower difficulty category), which in turn corresponds to lower level of the 3D LP. Similarly, item difficulties corresponding to higher ability levels should be located at the higher ability region of the Wright Map because higher ability is related to higher probability of endorsing more difficult response category, which corresponds to higher level of the 3D LP. If this pattern is consistent for all items on the assessment, then we have evidence for the validity for the hypothetical 3D LP (Wilson, 2005; Wilson, 2009; Doherty et al., 2015). The Wright Map resulting from GRM analysis is shown in Figure 2. Recall that each item has 3 response categories aligned to the three levels of the 3D LP (see Table 3). Since no student received a 3 on any of the items, and therefore no level 3 responses were observed, there are only 3 response categories in the IRT analysis. Therefore, each of the 8 items has 2 difficulties associated with score of 0/1 and score of 1/2 on the rubric for that item respectively. The Wright Map in Figure 2 shows difficulties for categories corresponding to score 1 (labeled as “1”) and score 2 (labeled as “2”) for all items. Solid black horizontal lines represent location of thresholds for each 3D LP levels. Level 0-1 threshold separates level 0 from level 1 of the 3D LP. The cutoff for level 0-1 is 1.05 on the logit scale. The cut score for level 0-1 was taken to be approximately at or below the lowest threshold for level 1 (1.05). It means that respondents with ability level above 1.05 are at level 1 of the 3D LP, and respondents with ability level below 1.05 are at level 0 of the 3D LP. Further, the cutoff for level 1-2 is 1.72 and has the same 108 interpretation as level 0-1 cutoff. It was calculated as the median item threshold on logit scale (Doherty et al., 2015). Since no scores corresponding to level 3 of the 3D LP were observed and thresholds for level 3 LP have not been determined, the cutoff for level 2-3 cannot be accurately determined. However, the highest threshold for level 2 is 2.43, and it is likely that level 3 ability level will be located close or slightly above that value. As seen in Figure 2, level 1 difficulties are well separated from level 2 difficulties. Specifically, no level 1 difficulty falls above the cut-off point for level 1, and no level 2 difficulty falls below the cutoff point for level 2. Therefore, all level 1 difficulties are located in approximately the same ability region and do not overlap any of the level 2 difficulties. This suggests that the progression of student understanding predicted by hypothetical 3D LP levels is supported by the data, which provides quantitative validity evidence piece for the 3D LP (Doherty et al., 2015; Wilson, 2004). Pre test Post test Level 1-2 Cutoff=1.72 Level 0-1 Cutoff=1.05 Level 3 Level 2 Level 1 Level 0 Figure 2.2 Wright map showing learning progression levels for unit 1 assessment items 109 Evaluating Student Learning based on unit 1 assessment The data for pre and post assessment was combined when fitting GRM model (see Appendix for details) in order to be able to compare how ability distributions change between pre and posttest. The Wright Map in figure 2 shows distribution of responses (Respondents) for pre and posttest on one graph. As you can see, both pre and post unit 1 contain significant number of respondents below 0 on the logit scale. These are respondents with missing data, for whom zeros were imputed at both time points. Respondents who did not provide any answer on pre and post-test still participated in the curriculum as can be seen from their work in Unit 1 saved in the online portal, and provided responses for assessment on subsequent units. Therefore, even though they had missing data for Unit 1 assessment, they were left in the sample to ensure that we can use their data to further investigate levels of the 3D LP when assessment data for subsequent units is analyzed. To check the extent of learning that occurred before and after Unit 1 was covered, Wald test was conducted to determine if the increase in the mean between pre and post-test was statistically significant. The mean increased from 0.067 to 0.375 on the logit scale between pre and post-test, and the Wald test showed that this increase was statistically significant (W=149.8, df=1, p>0.001), indicating that learning occurred between pre and posttest assessment for the entire sample of students. However, to better understand how the learning occurred in terms of student movement along the levels of the 3D LP, we need to look at the distribution of responses and compare pre and post unit assessment for each level of the 3D LP. Since the respondents who did not provide any answer on pre and post assessment introduce too much noise into the distribution, they were removed from the Wright Map to be able to see the degree of spread in learning for those students who provided the answers. This allows to draw more accurate 110 conclusions about student growth upon completion of Unit 1. Figure 3 below shows the Wright Map of reduced data for those who provided answers on pre and post assessment. Pre test Post test 3D LP level cutoffs Distribution maximum on pre and post test Average ability level for each threshold 2.05 1.59 1.32 1.21 Figure 2.3 Wright map showing distribution of respondents who provided answers on pre and post unit 1 test Observe, in Figure 3, that the majority of responses on both pre and posttests lie within level 1 of the 3D LP, but the distribution of responses within level 1 changes between pre and posttest. Specifically, maximum peak is observed for pretest at the value of 1.21, which corresponds to level 1 3D LP, and is located slightly below average level 1 threshold of 1.32. On posttest the peak at 1.21 gets smaller, and a new maximum peak emerges at 1.59, which is above average threshold 1 value. Therefore, clear movement towards high level 1 3D LP region is evident on the post test. Additionally, some responses are observed at level 2 of the 3D LP, compared to essentially no level 2 responses for pretest. This indicates that some respondents moved to level 2 upon completion of unit 1. Below, changes in percent distribution of responses on the Wright Maps for pre and posttest are discussed further. Figures 4 shows a separate Wright Map with relevant peaks and percentage of response distribution for the pretest. The distribution of student responses on pre-test contains 2 well- 111 defined peaks, one at the lower end of the logit scale, at -0.54, and the other one at a higher end of the logit scale at 1.21. The peak at -0.54 lies within level 0 of the 3D LP. It corresponds to only about 8% of the sample, indicating that very few students started very low on the 3D LP. Overall, about 41 % of the sample starts at level 0 of the 3D LP. Similarly, only about 2% of responses start in level 2 of the 3D LP. The majority of respondents on the pre-test, about 57%, lie within level 1 of the 3D LP. The distribution for pre-test peaks at ability level of 1.21, which is slightly below the average ability for level 1 thresholds (1.32). About 21 % of respondents in level 1 of the 3D LP are likely to score above average the threshold 1 value. Similarly, about 36% of respondents in level 1 are likely to score below average threshold 1 value. This indicates that the majority of respondents who start in level 1 of the 3D LP on pretest are not likely to score in level 1 of the 3D LP for all the items. Specifically, they are not likely to score in level 1 of the 3D LP for items 1 and 2 whose level 1 thresholds are located significantly above average level 1 threshold. Therefore, on pretest these respondents have not achieved the level of 3D understanding associated with ability level for these item categories. Items 1 and 2 belong to the foil testlet and focus on evaluating students’ ability to model and construct scientific explanation of particle deflection pattern observed in the Rutherford experiment. These items require ability to construct causal microscopic level accounts of relatively abstract phenomenon, and it is not surprising that the majority of students on posttest have not achieved that level of 3D thinking yet. This response pattern is also consistent with qualitative interviews, where 59% (10 out of 17) of interviewed students started at level 0, and the other 41% started in level 1 of the 3D LP for these items. This distribution within level 1 changes on the posttest as shown in Figure 5. In Figure 5, 112 on post Unit 1 assessment, the largest proportion of abilities, about 55%, still lies within level 1 of the 3D LP, but the distribution within level 1 changes. The peak value increased from 1.21 on pretest to 1.59 on posttest. On the posttest, the fraction of respondents above average threshold 1 becomes 34% as opposed to 21% on the pretest. Similarly, the fraction of respondents below average threshold 1 drops to 21% from 36% on the pretest. Additionally, the fraction of responses at level 0 of the 3D LP drops from 41% on pretest to 26% on the post test, and the fraction of responses at level 2 of the 3D LP goes up from 2 % on the pretest to 19 % on the post test. Out of 19% of respondents in level 2 of the 3D LP, about 4 % lie above the average threshold for level 2, and 15% lie below average threshold for level 2. This is in contrast to pretest where all 2% of responses observed at level 2 of the 3D LP lie below average threshold for level 2. Therefore, clear increase in fraction of responses at the higher ability region of level 1, and at level 2 of the 3D LP is evident on the post test. 113 3D LP level cutoffs Relevant distribution peaks (see text) Average ability level for each threshold Pre test Post test Figure 2.4 Wright map showing learning progression levels for unit 1 pretest assessment items and distribution of respondents for the relevant cut points for students who provided answers on both pre and posttest Figure 2.5 Wright map showing learning progression levels for unit 1 posttest assessment items and distribution of respondents for the relevant cut points for students who provided answers on both pre and posttest 114 Assigning Learning Progression level to individual students This section talks about how 3D LP can be used to accurately place student on a level, therefore allowing to use the validated 3D LP and the associated assessment as a diagnostic tool in the classroom. To assign a level on the 3D LP to each individual student, it is important to take into consideration measurement error associated with estimation of each proficiency level. This is especially important for students whose proficiency levels lie close to cut points for 3D LP levels, or provide answers consistent with in-between level assignment as was observed for the oral interviews. To do this, confidence interval (CI) for all proficiency estimates are calculated using one standard error in each direction (see Appendix for the R code). Wright Maps are further modified by arranging student proficiency in ascending order excluding students who had all zeroes on pre and/or post7. The modified Wright Maps for pre and posttest are shown in Figures 6 and 7 respectively. The curved black line shows proficiencies, and the grey band represents upper and lower interval bounds. The horizontal dashed lines represent cutoffs for 3D LP levels, and vertical lines show the area where confidence intervals overlap the cut points. If confidence intervals fall entirely into one of the 3D LP regions (for example, the first 655 students on pretest (Figure 6), and the first 601 students on the posttest (Figure 7)), these students are likely to provide answers consistent with level 0 of the 3D LP, and therefore should be assigned level 0 with high degree of confidence. Similarly, students 721-891 on pretest and students 647-796 on the posttest have confidence intervals that fall entirely into level 1 of the 3D LP, so these students can be assigned level 1. Finally, confidence intervals for the students 848-899 on the posttest fall entirely into level 2, and that those students can be assigned level 2 on the 3D LP. 7 The X axis of the Wright Maps shown in figures 6 and 7 was truncated to exclude students who had zeroes on pre and post assessment and highlight the graph better. 115 Level 0 Level 0-1 Level 1 Level 1-2 Level 2 Level 1-2 Level 0-1 Figure 2.6 Modified wright map for pre unit 1 test showing student proficiency estimates and standard error bands from lowest to highest Level 0 Level 0-1 Level 1 Level 1-2 Level 1-2 Level 0-1 Figure 2.7 Modified wright map for post unit 1 test showing student proficiency estimates and standard error bands from lowest to highest 116 However, sometimes the confidence intervals overlap the cut points for the 3D LP levels. For example, students 656-720 on the pretest, and students 602-646 on the posttest have confidence intervals that overlap level 0-1 cutoff, indicating that they are likely to provide answers consistent with in-between level assignment. In this case, there is less certainty about the 3D LP level assignment for these students. Similarly, students 892-899 on pretest all have confidence interval overlapping level 1-2 threshold, indicating that there is less certainty in placing these students in level 2 of the 3D LP. Overall, only 71 students on the pretest and 94 students on the posttest fall in between levels of the 3D LP, which corresponds to 8% and 10% respectively. Therefore, there is high degree of certainty in assigning a level on the 3D LP to individual students for the majority of the sample. To be exact, since the confidence interval was calculated using 1 standard error in each direction, we are 68% confident in placing each individual student on a level of the 3D LP. This provides evidence for validity of the 3D LP as a diagnostic tool that allows placing a student on a level with a high degree of accuracy, and use the information about what student understanding looks like in terms of the three dimensions (DCI, SEP, CCCs) at each given level to characterize their science proficiency. To the author’s knowledge, this is the first validated 3DLP that provides this degree of level assignment certainty and therefore applicability in terms of immediate pedagogical use. Discussion The Framework (NRC, 2012) outlines a novel way to teach and learn science grounded in a developmental approach that states that complex ideas in science take time and appropriate scaffolding to develop (Smith, Wiser, Anderson, & Krajcik, 2006). In practice, a developmental approach is reflected in the idea of a learning progression, which describes increasingly more 117 sophisticated steps towards mastering understanding of a given construct (Duschl et al., 2007). The Framework outlines theoretical learning progressions across grades grounded in relevant disciplinary and educational research. The unique feature of learning progressions described in the Framework is that they focus on the three dimensions of science: DCIs, SEPs, CCCs (NRC, 2012). Integrating the three dimensions when explaining phenomena and solving problems fosters knowledge application ability also called knowledge-in-use, which is indicative of deep understanding of science (Pellegrino, Hilton, 2012). Knowledge-in-use is achieved through situated cognition, or engagement in applying the content being studied to real life situations (Pellegrino & Hilton, 2012). In the language of the Framework this is equivalent to being engaged in 3D learning, or developing ability to integrate the three dimensions of science to solve real life problems. According to the Framework, it takes time and appropriate scaffolding to foster this ability in students (NRC, 2012). Learning progressions described in the Framework reflect increasing level of understanding of the three dimensions and have potential to guide educators in supporting students to develop knowledge-in-use. However, these learning progressions have only been described in theory (for example, CCCs, DCIs and SEPs progressions in the Framework). Their level of detail is very general, and their practical applicability in guiding educational process is very narrow. Moreover, while the Framework talks about the need to integrate the three dimensions in curriculum, instruction and assessment to foster knowledge application, learning progression for the three dimensions are still presented separately in the Framework (for example, CCCs, DCIs and SEPs progressions in the Framework). This is because integrating of the three dimensions in practice is still a vague concept. In fact, so far, there has been no research reported that demonstrates the feasibility of developing a validated learning progression that 118 describes the three dimensions, and can be used to accurately place individual students on a level for a large-scale sample in practice. Developing this kind of learning progression requires carefully specifying the aspects of the three dimensions at each level of sophistication, developing assessment tool capable of probing each level, and designing a reporting system based on well-aligned scoring rubric and LP levels that can be easily used to place each individual student response on a level of the LP. The work described in this paper has achieved all of these requirements and provides first-hand example of a learning progression that describes aspects of all the three dimensions at increasing levels of sophistication without separating them, and uses 3D assessment tool to probe the levels of 3D LP and place each individual student on a level with high degree of confidence (68% confidence to be exact). The assessment developed for probing the levels of the 3D LP requires student to apply three dimensions of NGSS to construct causal accounts of electrostatic phenomena. The resulting validity argument provides rich source of information about what student 3D understanding looks like at various levels of sophistication from macro to micro scale. It also provides insights into how we can support students towards achieving higher levels of the 3D LP. Therefore, this work is a valuable contribution to research on design and validation of NGSS-aligned learning progressions because it expands our understanding of how to track, describe and measure 3D learning described in the Framework in practice. There are several major takeaways that this work aims to highlight, and they are discussed further. The first takeaway is based on analysis of student interview data showing that transitioning to higher level of 3D LP requires ability to apply relevant aspects of DCIs to explain phenomenon. Simple recollection of a large number of facts related to a specific DCI does not always translate to the ability to apply them when explaining phenomena. Evidence of 119 this is seen in transition level interview responses. Specifically, these responses tend to contain large numbers of relevant facts, including details about the structure of the atom, or heuristics like “neutral and charged objects attract”, but the models and explanations do not incorporate these ideas into the mechanistic causal explanations. For example, the models fail to explain how components of the atoms relate to phenomenon, or why neutral objects attract to charged ones. Therefore, these responses lack application of the content for explaining phenomena and do not demonstrate the level of ability related to integrating the three dimensions of science that is consistent with higher levels of 3D LP. For educators, this means that when evaluating student learning in NGSS classroom, we need to be cautious not to confuse student memorization ability with student knowledge application ability that provides an indicator of deep conceptual understanding. Otherwise, our educational efforts risk falling back to old ways. The second takeaway from our study, as suggested by analysis of student interviews, indicates that it is not possible to transition to higher level of 3D LP without being able to apply relevant DCIs at the microscopic level. This is also consistent with previous research suggesting that microscopic level understanding is indicative of deep conceptual understanding of science (NRC, 2012; Smith et. al., 2006; Stevens et. al, 2010; Stevens et. al., 2009). The research presented here shows that the distinctive feature of a truly causal mechanistic explanation lies in the microscopic level detail and specifying all important relationships between components of the system. This is seen in the data presented here in several forms. First, complete causal models for the paper and rod item require that all components of the model have full microscopic detail, and that causal relationships between all the components be specified. For example, in the context of the paper and rod item, to provide complete causal account for why neutral paper is attracted to the charged rod, one should specify the structure and components of the atoms that 120 make up paper and rod, and explain how the components interact with each other to cause the observed attraction. Without this level of microscopic detail, the explanation is not fully causal, and is likely to be based on memorized heuristics (for example, “because neutral and charged objects attract” as a way to explain attraction between neutral paper and charged rod). Therefore students should be given opportunities to practice constructing micro-level accounts. At the same time, it is important to distinguish between memorization and knowing. Specifically, even if students do not provide full causal account of phenomenon in their answer, it doesn’t mean they have memorized what they say. This is especially relevant for level 2 answers., where students don’t provide full causal microscopic level account, but still demonstrate considerable ability to develop causal accounts and apply their understanding to explain phenomenon. The third takeaway is connected to the previous one, and suggests that developing higher sophistication in SEPs and CCCs is not possible without knowledge of relevant DCIs and vice versa. While this work only shows preliminary evidence of this assertion, it shows that student ability to develop models and construct causal accounts is directly related to the degree of their familiarity with relevant DCIs. For both interview items a clear pattern is obvious where higher- level models contain more DCIs that are used to develop causal mechanistic models with all relevant components connected at the microscopic level and directly related to explaining phenomena. Interview analysis suggests that if students lack knowledge of relevant DCIs, their models are incomplete and lack causal mechanistic accounts. This finding is consistent with previous research on interconnectedness between content knowledge and practice (Catley, Lehrer, & Reiser, 2005; Songer, 2006). There is considerable amount of research showing that content and practices, which is also called reasoning skills, develop in concert (Gotwals & Songer, 2006; Duschl et. al., 2007). It is therefore important to develop and validate learning 121 progressions that combine aspects of all the three dimensions including content (DCIs), SEPs and CCCs in order to gauge the development of 3D understanding across time. The fourth takeaway from our study suggests that ability to provide microscopic level causal explanation might depend on the context. The interview data presented here analysis shows that student response patterns for the paper and rod and the foil experiment item were slightly different. Specifically, there was larger number of student responses at higher levels of the 3D LP for paper and rod item than for foil experiment item. This finding might have to do with the fact that paper and rod item focuses on a more familiar phenomenon that is directly observed in the video, while the foil experiment item is abstract and hard to visualize. This finding is consistent with the vision of the Framework that builds upon the idea that knowledge is situated. Previous research suggests that novices tend to have a more fragmented knowledge structure, which in turn translates into different levels of demonstrated ability in solving science problems that depends on the context (Chi, Glaser, Rees, 1981; Sabella, Redish, 2007). In the case of the data presented here, foil experiment item represents a more complex context than the paper and rod item, and it also requires, understanding structure of the atom at a deeper level, so students have more difficulty applying their fragmented understanding of electrical interactions concept to this more challenging context that involves more complex ideas (Sabella, Redish, 2007). It is therefore extremely important to make sure that, in NGSS classrooms, teachers consistently link the concepts being taught across different contexts and explicitly point out similarities in relation to key concepts across contexts. This will help students transition from fragmented science understanding of novices to a more uniform and integrated understanding of real scientists. Finally, the fifth and last takeaway of our study has to do with developmental nature of 122 student understanding and the idea that deep, integrated understanding of science takes time and appropriate scaffolding to develop (Smith et. al, 2006). Evidence of this is seen in student interviews, where more higher-level responses are observed by the end of unit 1, and student answers fall on a spectrum from less to more sophisticated level of understanding. Similar pattern holds for analysis of student written responses using IRT. Further, the fact that none of the students reached level 3 of the 3D LP by the end of unit 1 indicates that it takes a long time before students develop microscopic level causal mechanistic reasoning consistent with the highest level of the 3D LP. This suggests students need a lot of support and opportunities to engage in 3D learning and practice constructing causal models and explanations at the microscopic level to be able to transition to higher levels of the 3D LP. Study limitations and future research Data contained considerable number of missing values that were replaced with “0”. Since students were given unlimited amount of time to finish the assessments, the researchers assumed that if no answer was provided, students didn’t know it. Considerable number of zeroes, which is also reflected in considerable number of responses located in the lower end of ability spectrum might indicate that there were not enough items to measure lower ability levels. It would be beneficial to add items that measure lower end of ability spectrum to better describe student 3D understanding in that region. Another limitation is that level 3 responses were not observed. It would be useful to include some of Unit 1 assessment items on future unit tests to investigate at what point level 3 responses appear as students progress in the curriculum. This also indicates that upon completion of unit 1 students don’t develop level 3 type understanding, and therefore adjusting instruction during Unit 1 to emphasize certain ideas could be suggested to see if level 3 responses are observed upon completion of Unit 1 in the future. 123 APPENDIX 124 APPENDIX Testing Competing Item Response Theory (IRT) Models The items on Unit 1 present four ordinal response categories, where each category corresponds to a level of the 3D LP. Specifically, 0, 1, 2, 3-point response category on each item corresponds to the 3D LP levels of 0, 1, 2, 3, as can be seen from examples of scoring rubrics above. Common IRT models for polytomous items are Graded Response Model (GRM, Samejima, 1969) and Generalized Partial Credit Model (GPCM, Muraki, 1992). To choose appropriate IRT model to represent the data in this study, model fits for GRM and GPCM were compared. To ensure more accurate representation of the data, and to be able to compare student learning on pre and post assessment, pre and post assessment data was combined to specify the IRT model to be estimated using GRM and GPCM. Slopes and corresponding items were constrained to be equal on pre and post assessment for each item. This rigid model specification was safe to assume because dimensionality and longitudinal invariance of Unit 1 assessment instrument was extensively studied a priori (Chapter 1). The results of this previous study showed that Unit 1 assessment scale is one-dimensional, and partial measurement invariance holds over time for pre and post assessment. The R code for GRM and GPCM analysis is provided in the Appendix. The results of IRT model estimation are shown in Table 8. Table 2.8 Model comparison for GPCM and GRM Model GPCM GRM LL -4992 -4957 # par 51 51 AIC 10038 9968 BIC 10168 10098 M2 516 488 df 109 109 P value <0.001 <0.001 RMSEA 0.0645 0.0622 CFI/TLI 0.983/0.982 0.983/0.984 Smaller log likelihood values as well as smaller values for AIC and BIC index values suggest better fitting model (Nering & Ostini, 2011; Toland, 2014). Based on these indexes, 125 GRM is a slightly better fitting model for the data sample. Further, M2 goodness of fit statistics was used to evaluate overall model fit (Maydeu-Olivares & Joe, 2005). Smaller M2 values also indicate better model fit (Toland, 2014) and, following this guideline, GRM also presents better fitting model for the data, compared with GPCM model. The p-value for both GPCM and GRM indicate lack of fit. However, lack of fit for M2 statistics is common when fitting parametric models like GPCM and GRM to real data (Cai, Maydeu‐Olivares, Coffman & Thissen, 2006; Toland, 2014). Therefore, additional model fit indexes were used, including RMSEA and CFI/TLI. Good and reasonable model fit cut-off criteria for RMSEA was <0.6 and <0.8 respectively, and for CFI/TLI was >0.95 and >0.90 respectively (Hu & Bentler, 1999; Marsh, Hau & Wen, 2004; Van Dam, Earleywine & Borders, 2010). Based on RMSEA and CFI/TLI values presented in Table 6, both GRM and GPCM have similar model fit. RMSEA for both models are marginally good, and CFI/TLI indexes represent good model fit. Therefore, based on evaluation of all the information, GRM appears to be a more suitable model for the data, and will be further used to evaluate model assumptions and obtain item parameters. Evaluating GRM model assumptions IRT model assumptions were further evaluated for GRM following Toland (2014). As mentioned above, unidimensional and partial measurement invariance were established for the measurement instrument in the previous study (Chapter 1). The assumption of local independence is further tested below. Local independence (LI) assumes that student responses on the test are influenced only by their level on the latent trait continuum of interest. LI assumption is very important for IRT analysis because, if violated, item parameters become distorted, including inflated slopes and more homogeneous thresholds across items (Toland, 2014). In the context of NGSS, assumption of local independence becomes increasingly harder to meet 126 because 3D assessments call for more contextualized, story-based items where students can use all the information available to them to demonstrate knowledge application ability (Gorin & Mislevy, 2013). These items often take the form of testlets, as is the case for the Unit 1 assessment instrument here, which makes it especially difficult to meet assumption of local independence because items within a testlet share more commonalities than across testlets. This might lead to increased dimensionality and violation of LI assumption (Gorin & Mislevy, 2013). To evaluate LI assumption in this study the Q3 index was used with a cut-off value of |0.2| (Kim, De Ayala, Ferdous & Nering, 2011). This index and cut-off value have acceptable Type 1 error rate and is substantially more powerful than commonly used X2 and G2- LD indexes (Chen & Thissen, 1997). Further, it is also recommended that the 0.2 cut-off value be used in a relative way, and to determine what is “large” correlation relative to other residual correlations in the model (Dr. Chalmers, email conversation). Following these guidelines, the Q3 statistics was used to evaluate local independence assumption. The Q3 statistics matrix is shown in figure 8. Only values above 0.2 are shown. Most residual correlations were below cut-off value of 0.2 in absolute values, and there were no residual correlations that were unusually high relative to others. Specifically, the highest correlation value was -0.36 between items 4 and 7 on the pre-test. Slightly high residual correlation is not surprising for items 4 and 7 because these items belong to the same testlet. However, this correlation is not unreasonably high compared to other values, and most of the correlations are below cut-off value of 0.2. Therefore, there is enough evidence to conclude that assumption of local independence is met. 127 Figure 2.8 Q3 matrix Model-Data Fit Once IRT model is chosen, and model assumptions are evaluated, it is appropriate to evaluate how well the GRM model fits the data and obtain item parameters that will be used in validating levels of the 3D LP. Item level fit. To assess how well the GRM model fits each item, S-X2 item fit statistics for polytomous data was examined (Orlando & Thissen, 2000; Orlando & Thissen, 2003). Statistically significant p-value indicates that the model does not fit a given item. Item fit was evaluated using 1% significance level, and RMSEA values. This is because evaluation of item fit using S-X2 item fit statistics involves testing multiple hypothesis and larger samples lead to greater likelihood of statistically significant results (Stone, Zhang, 2003 & Toland, 2014). Item fit S-X2 statistics is shown in Table 9 below. Items 3 and 5 of pre-test and items 1, 3 and 5 on the post-test have p-values <0.01 indicating poor model fit for these items. Since larger samples lead to greater likelihood statistically significant results, RMSEA values for these items are also examined. As can be seen from Table 9, all RMSEA values are below 0.06 indicating good model fit. Therefore, GRM model fits each item reasonably well. 128 Table 2.9 S-X2 item fit statistics Item Q1T1 Q2T1 Q3T1 Q4T1 Q5T1 Q6T1 Q7T1 Q8T1 S-X2 35.002 15.999 40.075 21.604 42.847 33.622 38.575 20.155 df 22 12 21 13 21 25 22 13 RMSEA 0.026 0.019 0.032 0.000 0.034 0.020 0.029 0.025 p 0.039 0.191 0.007 0.544 0.003 0.116 0.016 0.091 Item Q1T2 Q2T2 Q3T2 Q4T2 Q5T2 Q6T2 Q7T2 Q8T2 S-X2 74.964 36.894 37.150 36.313 47.552 28.418 26.051 25.723 df 29 21 19 23 19 25 21 23 RMSEA 0.042 0.029 0.033 0.025 0.041 0.012 0.016 0.011 p 0.000 0.017 0.008 0.038 0.000 0.289 0.204 0.314 Person level fit. To evaluate consistency of student reasoning across different contexts represented in the items, person fit (Zh) statistics was examined (Drasgow, Levine & Williams,1985). The Zh distribution across pre and post assessment events for all students is shown in Figure 9 below. Figure 2.9 Person fit Zh statistics The value of -1.96 was used as a cut-off for Zh statistics, where students with Zh fit statistics above -1.96 show regular responses (Drasgow et al.,1985; Felt, Castaneda, Tiemensma & Depaoli, 2017). Figure 2 shows that the majority of students are above the cut-off value of - 1.96 (dashed line) suggesting that majority of the sample demonstrate responses consistent with those hypothesized by 3D LP levels. This provides evidence towards the validity of the hypothetical 3D LP levels (Doherty, Draney, Shin, Kim & Anderson, 2015). 129 R Code R Code R Code R Code library(mirt)#For fitting IRT models library(foreign)#For importing SPSS data file library(WrightMap)#For wright maps library(ggplot2)# For histogram Model Fit Evaluation Unit 1 pre_post test Items 1-8 represent Unit 1 pre test items, items 9-16 represent Unit 1 post test items. Pre and Post test items are identical Model Statement Model Statement Model Statement Model Statement FAmodelU1pre_post<-mirt.model('F1=1, 2, 3, 4, 5, 6, 7, 8, F2= 9, 10, 11, 12, 13, 14, 15, 16 CONSTRAIN=(1,9, a1, a2), (2,10,a1, a2), (3,11,a1, a2), (4,12, a1, a2), (5, 13, a1, a2), (6, 14, a1, a2), (7, 15, a1, a2), (8, 16, a1, a2), (1,9, d1), (2,10,d1), (3,11,d1), (4,12, d1), (5, 13, d1), (6, 14, d1), (7, 15, d1), (8, 16, d1), (1,9, d2), (2,10,d2), (3,11,d2), (4,12, d2),(5, 13, d2), (6, 14, d2), (7, 15, d2), (8, 16, d2), MEAN=F1, F2 COV=F1*F2') Model Estimation Model Estimation Model Estimation Model Estimation pre.items<-c("U1T1","U2T1","U3T1","U4T1","U5T1","U6T1","U7T1","U8T1") post.items<-c("U1T2","U2T2","U3T2","U4T2","U5T2","U6T2","U7T2","U8T2") all.items<-c(pre.items,post.items) GRM Model GRM Model GRM Model GRM Model modgrmU1pre_post<-mirt(newdataU1pre_post[all.items],FAmodelU1pre_post,itemtyp e="graded",verbose=FALSE, SE=TRUE) modgrmU1pre_post #to get AIB/BIC parameters M2(modgrmU1pre_post, impute=20, CI=.95) #To get model fit CFI/TLI, RMSEA GPCM Model GPCM Model GPCM Model GPCM Model modgpcmU1pre_post<-mirt(newdataU1pre_post[all.items],FAmodelU1pre_post,itemty pe="gpcm",verbose=FALSE, SE=TRUE) modgpcmU1pre_post #to get AIB/BIC parameters M2(modgpcmU1pre_post, impute=20, CI=.95) #To get model fit CFI/TLI, RMSEA 130 Item analysis with choice model Item analysis with choice model ---- GRMGRMGRMGRM Item analysis with choice model Item analysis with choice model Model diagnostics Model diagnostics Model diagnostics Model diagnostics Residual diagnostics residuals(modgrmU1pre_post,type="Q3", suppress=.2)# To evaluate loca indepen dence (LI), only shows pairs with cov>0.2 (possibly have LI issue) Item fit diagnostics print(item.fit<-itemfit(modgrmU1pre_post,fit_stats="S_X2"))#To evaluate items fit(cutoff: p<0.01) Person fit diagnostic person.fit<-personfit(modgrmU1pre_post, method="ML") #To evaluate person fit (Zh stats) ggplot(person.fit,aes(x=Zh))+ geom_histogram(bins = 15,colour = "black",fill = "white")+ geom_vline(xintercept=-1.96,col="black",linetype="dashed")+ labs(x="Zh statistic",y="Count")+ theme_bw(base_size=12)+ theme_classic()#histogram of Zh stats (above -1.96 good person fit) Item parameters and thresholds Item parameters and thresholds Item parameters and thresholds Item parameters and thresholds item.par<-data.frame(coef(modgrmU1pre_post, simplify=TRUE)$items) #Item param eters; item.par$T1<-with(item.par, ifelse(a1>0,-d1/a1,-d1/a2))#a1= discrimination, D ifficulty=(-d/a) item.par<-item.par[1:8,]#Select the first 8 rows since the remaining are time two items with equal parameters as time 1 item.par$T2<-with(item.par, ifelse(a1>0,-d2/a1,-d2/a2)) mean.T1<-mean(item.par$T1)#Mean threshold 1 mean.T2<-mean(item.par$T2)#Mean threshold 2 t0_1<-min(item.par$T1)#cut off for level0_1 t1_2<-median(c(item.par$T1, item.par$T2)) #cut off for level1_2 t2_3<-max(item.par$T2)#cut off for level2_3 Ability Wright Maps Ability Wright Maps Ability Wright Maps Ability Wright Maps #Compute factor scores (Y-axis for Wright Map) AbilityU1Pre_Post<-data.frame(fscores(modgrmU1pre_post)) # Add ability scores to the data file fulldata<-data.frame(cbind(newdataU1pre_post, AbilityU1Pre_Post)) #merge students who have complete data with fulldata file to create the reduc ed sample file reducedata<-merge(U1P2_STUID_allstudentscompletedata, fulldata, by.x = "STUID ") 131 Complete sample data wrightMap(with(fulldata,cbind(F1,F2)),matrix(c(item.par$T1,item.par$T2),ncol= 2), person.side=personDens,cutpoints=c(t0_1,t1_2,mean.T1,mean.T2),min.l=-.9, max.l=2.5) ####Reduced sample data wrightMap(with(reducedata,cbind(F1,F2)),matrix(c(item.par$T1,item.par$T2),nco l=2), person.side=personDens,cutpoints=c(t0_1,t1_2,mean.T1,mean.T2),min.l=-.9 , max.l=2.5) Finding peaks on the Reduced sample Wright Map and % of examinees in each level of the 3D LP Finding peaks on the Reduced sample Wright Map and % of examinees in each level of the 3D LP Finding peaks on the Reduced sample Wright Finding peaks on the Reduced sample Wright Map and % of examinees in each level of the 3D LP Map and % of examinees in each level of the 3D LP Functions to calculate percentiles for given cut-offs #Percentile function for pretest pct_pre<-ecdf(reducedata$F1) #Percentile function for post test pct_post<-ecdf(reducedata$F2) Thresholds for level 0-1 and 1-2 of the 3D LP t0_1 #lowest difficulty 1 t1_2 #median between Difficulty 1 and Difficulty 2 mean.T1 #Average difficulty 1 mean.T2 #Average difficulty 2 Percentage of examinees between thresholds #pretest pct_pre(t0_1)#% prob density that fall below cutoff for level 1 (% prob. dens ity in level 0 of 3D LP) pct_pre(t1_2)#% prob density that fall below cutoff for level 2 pct_pre(t1_2)-pct_pre(t0_1) #57% between level 1 cutoff and level 2 cutoff of pre test (% prob. density in level 1 of the 3D LP) pct_pre(mean.T1)#% prob density that fall below average difficulty 1 (1.32) pct_pre(t1_2)-pct_pre(mean.T1)#21% between average difficulty 1 and level 2 c utoff (% prob. density in level 1 of the 3D LP that lies above average diffic ulty 1) (pct_pre(t1_2)-pct_pre(t0_1))-((pct_pre(t1_2)-pct_pre(mean.T1)))# 36% between average difficulty 1 and level 1 cutoff (% prob. density in level 1 of the 3D LP that lies below average difficulty 1) pct_pre(mean.T2)#% prob density that fall below average difficulty 2 (2.02) #Post test pct_post(t0_1)#% prob density that fall below cutoff for level 1 (% prob. den sity in level 0 of 3D LP) pct_post(t1_2)#% prob density that fall below cutoff for level 2 pct_post(t1_2)-pct_post(t0_1)#55% between level 1 cutoff and level 2 cutoff a t post test (% prob. density in level 1 of the 3D LP) pct_post(mean.T1)#% prob density that fall below average difficulty 1 (1.32) pct_post(t1_2)-pct_post(mean.T1)# 34% between average difficulty 1 and level 132 2 cutoff (% prob. density in level 1 of the 3D LP that lies above average dif ficulty 1) (pct_post(t1_2)-pct_post(t0_1))-((pct_post(t1_2)-pct_post(mean.T1)))# 21% bet ween average difficulty 1 and level 1 cutoff (% prob. density in level 1 of t he 3D LP that lies below average difficulty 1) pct_post(mean.T2)#% prob density that fall below average difficulty 2 (2.02) pct_post(mean.T2)-pct_post(t1_2)#15% between average difficulty 2 and level 2 cutoff (% prob. density in level 2 of the 3D LP that lies below average diffi culty 2) Determine density peak values #Peak values for pretest print(pre_peak1<-density(reducedata$F1)$x[which.max(density(reducedata$F1)$y) ]) #pre test peak - larger peak print(pre_peak2<-density(reducedata$F1[which(reducedata$F1<0.5)])$x[which.max (density(reducedata$F1[which(reducedata$F1<0.5)])$y)]) #smaller peak - for pr e test values below 0.5 #Peak values for post test print(post_peak1<-density(reducedata$F2)$x[which.max(density(reducedata$F2)$y )]) #post test peak print(post_peak2<-density(reducedata$F2[which(reducedata$F2<0.5)])$x[which.ma x(density(reducedata$F2[which(reducedata$F2<0.5)])$y)])#second peak - for pos t test values below 0.5 print(post_peak3<-density(reducedata$F2[which(reducedata$F2<1.5)])$x[which.ma x(density(reducedata$F2[which(reducedata$F2<1.5)])$y)])#third peak for post t est scores in level 1 3D LP region print(post_peak4<-density(reducedata$F2[which(reducedata$F2>1.7)])$x[which.ma x(density(reducedata$F2[which(reducedata$F2>1.7)])$y)])#fourth peak for post test scores in level 2 3D LP region Ascending Ability Wright Ascending Ability Wright MapsMapsMapsMaps Ascending Ability Wright Ascending Ability Wright #create factor scores with standard errors (UP= upper bound, LP= lower bound) fulldata_with_SE<-cbind(newdataU1pre_post,data.frame(fscores(modgrmU1pre_post , full.scores.SE=TRUE))) fulldata_with_SE$UBF1<-fulldata_with_SE$F1+fulldata_with_SE$SE_F1 fulldata_with_SE$LBF1<-fulldata_with_SE$F1-fulldata_with_SE$SE_F1 fulldata_with_SE$LBF2<-fulldata_with_SE$F2-fulldata_with_SE$SE_F2 fulldata_with_SE$UBF2<-fulldata_with_SE$F2+fulldata_with_SE$SE_F2 #Create a variable to count how many students have CI overlapping each LP lev el (Pre test) fulldata_with_SE$LP0_1_F1<-ifelse(fulldata_with_SE$LBF1<= t0_1&fulldata_with_ SE$UBF1>=t0_1, 1, 0) fulldata_with_SE$LP1_2_F1<-ifelse(fulldata_with_SE$LBF1<= t1_2&fulldata_with_ SE$UBF1>=t1_2, 1, 0) #Create a variable to count how many students have CI overlapping each LP lev el (Pre test) fulldata_with_SE$LP0_1_F2<-ifelse(fulldata_with_SE$LBF2<= t0_1&fulldata_with_ SE$UBF2>=t0_1, 1, 0) 133 fulldata_with_SE$LP1_2_F2<-ifelse(fulldata_with_SE$LBF2<= t1_2&fulldata_with_ SE$UBF2>=t1_2, 1, 0) #Find smallest lower bound score of F1 (pre test) LB_LP0_1_pre_stu<-max(fulldata_with_SE$LBF1[which(fulldata_with_SE$LP0_1_F1== 1)]) print(LB_L0_pre<-which(sort(fulldata_with_SE$LBF1)==LB_LP0_1_pre_stu))#studen ts below level 1 LP LB_LP1_2_pre_stu<-max(fulldata_with_SE$LBF1[which(fulldata_with_SE$LP1_2_F1== 1)]) print(LB_L1_pre<-which(sort(fulldata_with_SE$LBF1)==LB_LP1_2_pre_stu))#studen ts below level 2 LP #Find smallest lower bound score of F2 (post test) LB_LP0_1_post_stu<-max(fulldata_with_SE$LBF2[which(fulldata_with_SE$LP0_1_F2= =1)]) print(LB_L0_post<-which(sort(fulldata_with_SE$LBF2)==LB_LP0_1_post_stu))#stud ents below level1 LP LB_LP1_2_post_stu<-max(fulldata_with_SE$LBF2[which(fulldata_with_SE$LP1_2_F2= =1)]) print(LB_L1_post<-which(sort(fulldata_with_SE$LBF2)==LB_LP1_2_post_stu))#stud ents below level2 LP #Find highest upper bound score of F1 (pre test) UB_LP0_1_pre_stu<-min(fulldata_with_SE$UBF1[which(fulldata_with_SE$LP0_1_F1== 1)]) print(UB_L0_pre<-which(sort(fulldata_with_SE$UBF1)==UB_LP0_1_pre_stu))#studen ts below level 1 LP UB_LP1_2_pre_stu<-min(fulldata_with_SE$UBF1[which(fulldata_with_SE$LP1_2_F1== 1)]) print(UB_L1_pre<-which(sort(fulldata_with_SE$UBF1)==UB_LP1_2_pre_stu))#studen ts below level 2 LP #Find highest upper bound score of F2 (post test) UB_LP0_1_post_student<-min(fulldata_with_SE$UBF2[which(fulldata_with_SE$LP0_1 _F2==1)]) print(UB_L0_post<-which(sort(fulldata_with_SE$UBF2)==UB_LP0_1_post_student))# students below level1 LP UB_LP1_2_post_student<-min(fulldata_with_SE$UBF2[which(fulldata_with_SE$LP1_2 _F2==1)]) print(UB_L1_post<-which(sort(fulldata_with_SE$UBF2)==UB_LP1_2_post_student))# students below level2 LP #of people in overlap for each level on pre test LB_L0_pre-UB_L0_pre #66 people between level 0 and 1 LB_L1_pre-UB_L1_pre #7 people between 1-2 #of people in overlap for each level on pre test LB_L0_post-UB_L0_post #46 people between level 0 and 1 LB_L1_post-UB_L1_post #50 people between 1-2 #Sort data by ability score (pre test) sort_pre<-fulldata_with_SE[order(fulldata_with_SE$F1),]#sort data 134 sort_pre<-data.frame(x=seq(nrow(sort_pre)),F1=sort_pre$F1,lwr=sort_pre$LBF1,u pr=sort_pre$UBF1) #Sort data by ability score (post test) sort_post<-fulldata_with_SE[order(fulldata_with_SE$F2),]#sort data sort_post<-data.frame(x=seq(nrow(sort_post)),F2=sort_post$F2,lwr=sort_post$LB F2,upr=sort_post$UBF2) Ascending Ability Wright map for Pretest plot(sort_pre$F1, xlab="Persons", ylab="Ability", pch=16, ylim=c(-2, 3.1), xl im=c(558, 900), cex=0.5) with(sort_pre,polygon(c(x,rev(x)),c(lwr,rev(upr)),col = "grey75", border = FA LSE)) matlines(sort_pre[,1],sort_pre[,-1],lwd=c(1,1),lty=1,col=c("black","black","b lack"),type=c("p","l","l"), cex=0.4, pch=16) abline(h=c(t0_1, t1_2),lty=2,v=c(LB_L0_pre, LB_L1_pre,UB_L0_pre,UB_L1_pre)) Ascending Ability Wright map for Post test plot(sort_post$F2, xlab="Persons", ylab="Ability", pch=16, ylim=c(-2, 3.1), x lim=c(558, 900), cex=0.5) with(sort_post,polygon(c(x,rev(x)),c(lwr,rev(upr)),col = "grey75", border = F ALSE)) matlines(sort_post[,1],sort_post[,-1],lwd=c(1,1),lty=1,col=c("black","black", "black"),type=c("p","l","l"), cex=0.4, pch=16) abline(h=c(t0_1, t1_2),lty=2,v=c(LB_L0_post, LB_L1_post,UB_L0_post,UB_L1_post )) 135 BIBLIOGRAPHY 136 BIBLIOGRAPHY Alonzo, A. C., & Gotwals, A. W. (Eds.). (2012). Learning progressions in science: Current challenges and future directions. Springer Science & Business Media. Alonzo, A. C., & Steedle, J. T. (2009). Developing and assessing a force and motion learning progression. Science Education, 93(3), 389-421 Berland, L. K., & McNeill, K. L. (2010). A learning progression for scientific argumentation: Understanding student work and designing supportive instructional contexts. Science Education, 94(5), 765-793. Cai, L., Maydeu‐Olivares, A., Coffman, D. L., & Thissen, D. (2006). Limited‐information goodness‐of‐fit testing of item response theory models for sparse 2P tables. British Journal of Mathematical and Statistical Psychology, 59(1), 173-194. Catley, K., Lehrer, R., & Reiser, B. (2005). Tracing a prospective learning progression for developing understanding of evolution. Paper commissioned by the National Academies Committee on Test Design for K-12 Science Achievement. Washington, DC: National Academies. Chalmers, R. P. (2012). “mirt: A Multidimensional Item Response Theory Package for the R Environment.” Journal of Statistical Software, 48(6), 1–29. doi: 10.18637/jss.v048.i06. Chen, W. H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265-289. Chi, M. T., Glaser, R., & Rees, E. (1981). Expertise in problem solving (No. TR-5). PITTSBURGH UNIV PA LEARNING RESEARCH AND DEVELOPMENT CENTER. Cooper, M. M., Underwood, S. M., Hilley, C. Z., & Klymkowsky, M. W. (2012). Development and assessment of a molecular structure and properties learning progression. Journal of Chemical Education, 89(11), 1351-1357. Corcoran, T., Mosher, F. A., & Rogat, A. (2009). Learning progressions in science: An evidence- based approach to reform. New York, NY: Columbia University, Teachers College, Center on Continuous Instructional Improvement. Doherty, J. H., Draney, K., Shin, H. J., Kim, J., & Anderson, C. W. (2015). Validation of a learning progression-based monitoring assessment. Manuscript submitted for publication. Drasgow, F., Levine, M. V., & Williams, E. A. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38(1), 67-86. 137 Duncan, R. G., & Hmelo‐Silver, C. E. (2009). Learning progressions: Aligning curriculum, instruction, and assessment. Journal of Research in Science Teaching: The Official Journal of the National Association for Research in Science Teaching, 46(6), 606-609. Duschl, R.A., Schweingruber H.A., Shouse A. (Eds.). (2007). Taking science to school: Learning and teaching science in grades K-8. Washington, D.C.: National Academy Press. Felt, J. M., Castaneda, R., Tiemensma, J., & Depaoli, S. (2017). Using person fit statistics to detect outliers in survey research. Frontiers in psychology, 8, 863. Gorin, J. S., & Mislevy, R. J. (2013, September). Inherent measurement challenges in the next generation science standards for both formative and summative assessment. In Invitational research symposium on science assessment. Gotwals, A. W., & Songer, N. B. (2013). Validity evidence for learning progression‐based assessment items that fuse core disciplinary ideas and science practices. Journal of Research in Science Teaching, 50(5), 597-626. Harris, C. J., Krajcik, J. S., Pellegrino, J. W., DeBarger, A. H. (2019). Designing Knowledge‐In‐ Use Assessments to Promote Deeper Learning. Educational Measurement: Issues and Practice. Herrmann‐Abell, C. F., & DeBoer, G. E. (2018). Investigating a learning progression for energy ideas from upper elementary through high school. Journal of Research in Science Teaching, 55(1), 68-93. Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural equation modeling: a multidisciplinary journal, 6(1), 1-55. Kim, D., De Ayala, R. J., Ferdous, A. A., & Nering, M. L. (2011). The comparative performance of conditional independence indices. Applied Psychological Measurement, 35(6), 447- 471. Krajcik, J.S., Sutherland, L.M., Drago, K., & Merritt, J. (2012). The promise and value of learning progression research. In S. Bernholt,, K. Neumann, & P. Nentwig (Eds.) Lee, H. S., & Liu, O. L. (2010). Assessing learning progression of energy concepts across middle school grades: The knowledge integration perspective. Science Education, 94(4), 665-688. Lehrer, R., Kim, M. J., Ayers, E., & Wilson, M. (2014). Toward establishing a learning progression to support the development of statistical reasoning. Learning over time: Learning trajectories in mathematics education, 31-60. 138 Maydeu-Olivares, A., & Joe, H. (2005). Limited-and full-information estimation and goodness- of-fit testing in 2 n contingency tables: A unified framework. Journal of the American Statistical Association, 100(471), 1009-1020. Mislevy, R. J., & Haertel, G. D. (2006). Implications of evidence‐centered design for educational testing. Educational Measurement: Issues and Practice, 25(4), 6-20. Mohan, L., Chen, J., & Anderson, W.A. (2009). Developing a multi-year learning progression for carbon cycling in socio-ecological systems. Journal of Research in Science Teaching, 46(6), 675–698. Morell, L., Collier, T., Black, P., & Wilson, M. (2017). A construct‐modeling approach to develop a learning progression of how students understand the structure of matter. Journal of Research in Science Teaching, 54(8), 1024-1048. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. ETS Research Report Series, 1992(1), i-30. National Academies of Sciences, Engineering, and Medicine. (2019). Science and engineering for grades 6-12: Investigation and design at the center. National Academies Press. National Research Council. (2000). How people learn: Brain, mind, experience, and school: Expanded edition. National Academies Press. National Research Council. (2012). A framework for K-12 science education: Practices, crosscutting concepts, and core ideas. National Academies Press. National Research Council. (2013a). Education for life and work: Developing transferable knowledge and skills in the 21st century. National Academies Press. Nering, M. L., & Ostini, R. (Eds.). (2011). Handbook of polytomous item response theory models. Taylor & Francis. Neumann, K., Viering, T., Boone, W. J., & Fischer, H. E. (2013). Towards a learning progression of energy. Journal of research in science teaching, 50(2), 162-188. Nordine, J., Krajcik, J., & Fortus, D. (2010). Transforming energy instruction in middle school to support integrated understanding and future learning. Science Education, 95(4), 670–690. DOI: 10.1002/ sce.20423 Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24(1), 50-64. Orlando, M., & Thissen, D. (2003). Further investigation of the performance of S-X2: An item fit index for use with dichotomous item response theory models. Applied Psychological Measurement, 27(4), 289-298. 139 Osborne, J. F., Henderson, J. B., MacPherson, A., Szu, E., Wild, A., & Yao, S. Y. (2016). The development and validation of a learning progression for argumentation in science. Journal of Research in Science Teaching, 53(6), 821-846. Pellegrino, J. W., & Hilton, M. L. (Eds.). (2012). Education for life and work: Developing transferable knowledge and skills in the 21st century. Washington, DC: The National Academies Press. Plummer, J. D., & Krajcik, J. (2010). Building a learning progression for celestial motion: Elementary levels from an earth‐based perspective. Journal of Research in Science Teaching, 47(7), 768-787. Plummer, J. D., & Maynard, L. (2014). Building a learning progression for celestial motion: An exploration of students' reasoning about the seasons. Journal of Research in Science Teaching, 51(7), 902-929. Reiser, B. J., Krajcik, J., Moje, E., & Marx, R. (2003, March). Design strategies for developing science instructional materials. In Annual Meeting of the National Association of Research in Science Teaching, Philadelphia, PA. RStudio Team (2015). RStudio: Integrated Development for R. RStudio, Inc., Boston, MA URL http://www.rstudio.com/. Sabella, M. S., & Redish, E. F. (2007). Knowledge organization and activation in physics problem solving. American Journal of Physics, 75(11), 1017-1029. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika monograph supplement. Schwarz, C. V., Reiser, B. J., Davis, E. A., Kenyon, L., Achér, A., Fortus, D., ... & Krajcik, J. (2009). Developing a learning progression for scientific modeling: Making scientific modeling accessible and meaningful for learners. Journal of Research in Science Teaching: The Official Journal of the National Association for Research in Science Teaching, 46(6), 632-654. Shin, N., Stevens, S. Y., & Krajcik, J. (2010). Tracking student learning over time using construct-centred design. In Using Analytical Frameworks for Classroom Research (pp. 56-76). Routledge. Smith, C. L., Wiser, M., Anderson, C. W., & Krajcik, J. (2006). FOCUS ARTICLE: implications of research on children's learning for standards and assessment: a proposed learning progression for matter and the atomic-molecular theory. Measurement: Interdisciplinary Research & Perspective, 4(1-2), 1-98. Songer, N. B., Kelcey, B., & Gotwals, A. W. (2009). How and when does complex reasoning occur? Empirically driven development of a learning progression focused on complex 140 reasoning about biodiversity. Journal of Research in Science Teaching: The Official Journal of the National Association for Research in Science Teaching, 46(6), 610-631 Songer, N.B. (2006). BioKIDS: An animated conversation on the development of curricular activity structures for inquiry science. In: R. Keith Sawyer (Ed.), Cambridge handbook of the learning sciences (pp. 355–369). New York: Cambridge. Standards, N. G. S. (2013). Next generation science standards: For states, by states. Stevens, S. Y., Delgado, C., & Krajcik, J. S. (2010). Developing a hypothetical multi‐ dimensional learning progression for the nature of matter. Journal of Research in Science Teaching, 47(6), 687-715. Stevens, S. Y., Sutherland, L. M., & Krajcik, J. S. (2009). The big ideas of nanoscale science and engineering. NSTA press. Stone, C. A., & Zhang, B. (2003). Assessing goodness of fit of item response theory models: A comparison of traditional and alternative procedures. Journal of Educational Measurement, 40(4), 331-352. Talanquer, V. (2009). On cognitive constraints and learning progressions: The case of “structure of matter”. International Journal of Science Education, 31(15), 2123-2136. Toland, M. D. (2014). Practical guide to conducting an item response theory analysis. The Journal of Early Adolescence, 34(1), 120-151. Van Dam, N. T., Earleywine, M., & Borders, A. (2010). Measuring mindfulness? An item response theory analysis of the Mindful Attention Awareness Scale. Personality and Individual Differences, 49(7), 805-810. Wilson, M. (2004). Constructing measures: An item response modeling approach. Routledge. Wilson, M. (2009). Measuring progressions: Assessment structures underlying a learning progression. Journal of Research in Science Teaching: The Official Journal of the National Association for Research in Science Teaching, 46(6), 716- Wyner, Y., & Doherty, J. H. (2017). Developing a learning progression for three‐dimensional learning of the patterns of evolution. Science Education, 101(5), 787-817. 141 CHAPTER 3 Exploring Student Reasoning about Chemical Bonds from Perspective of Energy and Force in the context of NGSS Classroom Introduction The new way of teaching about chemical bonds Understanding mechanisms and driving factors that influence formation of chemical bonds is essential for developing deep, useable understanding of Chemistry. Previous research suggests that students at different levels of science preparation hold multiple inaccurate ideas about why and how chemical bonds form. For example, even after instruction, students still hold on to the idea that chemical bonds “store” energy, which is released when chemical bonds are broken (Barker and Millar 2000; Boo 1998). Further, students tend to view different types of bonds, including covalent and ionic as distinctly different from intermolecular interactions rather than recognizing that these are different manifestations of the same phenomenon arising from atoms forming electrical interactions of different magnitude leading to increased stability of the system through energy minimization (Taber, 1998a). To build deep, useable understanding of chemical bonding student need to develop the ability to model and explain bond forming and bond breaking processes at the atomic level using ideas related to energy and electrostatic forces (Cooper et al., 2014). However, students in secondary science settings rarely discuss atomic level mechanisms for bond formation that are built on fundamental principles of energy minimization and electrostatic attraction. Instead, instruction tends to emphasize heuristics such as octet rule to explain why certain elements form certain types of bonds (Taber, 1998a). Additionally, in K-12 settings students rarely discuss 142 energy at the atomic and molecular level, and therefore struggle to apply ideas of energy to bond formation processes at the atomic level (Cooper, Klymkowsky, Becker, 2014). Recently there has been significant push from the educational community towards building instruction on fundamental scientific principles to help students build deep understanding of big ideas in science that they can apply when explaining natural world and solve real-life problems. This type of useable understanding is typically referred to as knowledge-in-use (National Research Council [NRC], 2012, NRC, 2013a, Pellegrino, Hilton, 2012). These efforts resulted in the publication of the Framework for K-12 Science Education (the Framework) and Next Generation Science Standards (NGSS) both of which are based on years of research on how students best learn ideas in science (NRC, 2012, Standards, 2013). Previous research suggests introducing the phenomenon of chemical bonding in terms of state of the system of interacting objects in which attractive and repulsive forces are balanced out, which leads to energy minimization in the system (Nahum, 2007; Nahum, Mamlok‐Naaman, Hofstein, Krajcik, 2007). The Framework builds on this view and suggests using ideas of balance of electric forces and energy minimization as underlying big ideas when explaining various phenomena including chemical bonding. Specifically, the Framework emphasizes the importance of recognizing energy minimization as the driving force in formation of chemical bonds by stating that: “Matter in a stable form minimizes the stored energy in the electric and magnetic fields within it; this defines the equilibrium positions and spacing of the atomic nuclei in a molecule (e.g., chemical bonds)” (NRC, 2012, p. 121). It further emphasizes introducing electrical interactions between changed species as a mechanism according to which atoms form molecules: “The substructure of atoms determines how they combine and rearrange to form all of the world’s substances. Electrical attractions and repulsions between charged particles (i.e., 143 atomic nuclei and electrons) in matter explain the structure of atoms and the forces between atoms that cause them to form molecules (via chemical bonds), which range in size from two to thousands of atoms (e.g., in biological molecules such as proteins).” (NRC, 2012, p.107). In short, the Framework builds on previous research to suggest radically different ways of teaching ideas related to chemical bonding that are grounded in fundamental principles related to electrical interactions and energy minimization. However, as discussed below, apart from the scientific content, the Framework also outlines new ways of organizing the learning process to ensure students develop deeper, more meaningful and life-long understanding of the content. Learning Progressions and Three-Dimensional Learning The Framework views learning as a developmental progression designed to help students build and revise their knowledge and skills from elementary to high school (NRC, 20120, p. 11). This notion is built on years of research indicating that deep understanding of big ideas in science develops over time, and learning progressions provide a “road map” for different routes that students can follow to achieve this understanding (NRC, 2012, p.26, Duschl et al., 2007; Smith et al., 2006; Alonzo & Gotwals, 2012). The Framework suggests building instruction around the three dimensions of science: disciplinary core ideas (DCIs), scientific and engineering practices (SEPs) and crosscutting concepts (CCCs). Disciplinary Core Ideas represent few core ideas in a given scientific discipline and aim to help students build deep understanding of science and ability to explain a wide range of phenomena. Crosscutting concepts serve as lenses used to make sense of a wide range of phenomena and build coherent understanding of science across disciplines. Finally, Scientific and Engineering Practices represent authentic practices that scientists and engineers use to generate and revise scientific knowledge (NRC, 2012). The Framework further defines three-dimensional learning (3D learning) as a way to engage in 144 scientific and engineering practices in order to deepen understanding of crosscutting concepts and disciplinary core ideas (NRC, 2012). According to the Framework, engaging in 3D learning helps students build deep, useable understanding of big ideas in science coherently over time (NRC, 2012). While the Framework emphasizes the importance of developing student science proficiency following learning progressions along the three dimensions, it does not provide detailed learning progressions for the three dimensions (DCIs, SEPs, CCCs) across grade levels. Possible general LPs for SEPs and CCCs are outlined in the Framework, but they don’t specify grade band details due to lack of relevant research. For each component of DCIs the Framework describes in a bit more detail what student should understand by the end of a given grade band (2, 5, 8, 12). The NGSS provides slightly more detailed description for the possible learning progressions of the three dimensions across grade bands. Development of detailed and validated LPs for the three dimensions was beyond the scope of both the Framework and NGSS and remains one of the major tasks to be accomplished for successful implementation of the new vision of science education. Validating NGSS-aligned Learning Progressions in Practice To implement the educational changes called for by the Framework and NGSS, it is essential to develop and validate learning progressions that combine aspects of DCIs, SEPs and CCCs. The scope of learning progressions can range from large grain size encompassing multiple grades to finer grain size focusing on exploring development of student understanding of specific aspect of a broader LP (Gotwals, 2012; Mohan, Plummer, 2012). Smaller scale LPs can be more useful for instructional purposes, while large scale LPs provide a “large scale map” of progression of student understanding during a broad range of time (Gotwals, 2012). In this 145 work the smaller scale approach is used that focuses specifically on exploring how student reasoning about chemical bonding develops during the course of one unit constituting approximately 2 months of instructional time. This work is a continuation of previously described 3D learning progression (3D LP) for electrical interactions validated in the context of Unit 1 of NGSS-aligned curriculum for 9th grade Physical Science that spans one academic year (see chapter 2). The curriculum is called “Interactions” and focuses on helping student build 3D understanding of electrical interactions at macroscopic and atomic-molecular level to explain a wide range of phenomena including chemical bonding. The study described in Chapter 2 demonstrated development of 3D LP focusing on the following ideas related to explaining electrical interactions: atomic nature of matter (focused on the DCI of Matter and Its Interactions, sub idea of Structure and Properties of Matter) and electric forces (focused on the DCI of Motion and Stability: Forces and Interactions, sub idea of Types of Interactions). The 3D LP also integrated SEP of Developing and Using Models and a CCC of Cause and Effect. This study provides evidence for validity of the 3D LP using Unit 1 assessment data only. This work uses assessment data from Unit 2 of the “Interactions” curriculum that is focused on building student understanding of chemical bonding from the perspective of energy and force to construct a finer-grained progression of student understanding related to chemical bonding. Specifically, this work uses Wilson’s approach (2009) that focuses on building a finer- grain construct map focused on specific concept and aimed to relate assessment and cognition theories (Wilson, 2009). This work explored the progression of student understanding along the following aspects of the three dimensions: DCI of HS-PS3 - Energy, specifically the element of PS3.C Relationship Between Energy and Forces, DCI of HS-PS1- Matter and its Interactions, specifically the element of 1.B- Chemical Reactions; SEP of Developing and Using Models, SEP 146 of Constructing Explanations and CCC of Cause and Effect. Student interview data before and after completion of Unit 2 and item response theory analysis of Unit 2 assessment data is used to demonstrate validity evidence for the 3D construct map describing progression of student understanding of chemical bonding during the course of the unit. Contribution to the field of Chemical Education This work contributes to the field of chemical education in several ways. First, a 3D construct map for chemical bonding is presented that is based on ideas of energy minimization and balance of electric forces. The 3D construct map specifies aspects of DCI, SEP and CCC related to chemical bonding from macroscopic to atomic molecular scale for each level of sophistication. Previous studies describe learning progressions for energy (Lee, Liu, 2010; Neumann, Viering, Boone, & Fischer, 2013) and force (Alonzo & Steedle, 2009). There have also been studies that explore student thinking related to intermolecular interactions (Becker, Noyes, Cooper, 2016), energy at the atomic-molecular scale (Becker, Cooper, 2014), and chemical bonding (Burrows, Mooring, 2015; Taber, Coll, 2002). Previously published learning progression descriptions focus on both content and practice (Songer, Butler, Kelcey, Gotwals, 2009; Gotwals, Songer, 2013), as well as practice only (Lehrer, Kim, Ayers, & Wilson, 2014; Schwarz, Reiser, Davis, Kenyon, Achér, Fortus, D., ... & Krajcik, 2009; Berland & McNeill, 2010; Osborne, Henderson, MacPherson, Szu, Wild & Yao, 2016). However, to author’s knowledge, the current work is the first example of a study focused on exploring student thinking about chemical bonding that is built on the fundamental principles of energy and force, and according to the vision expressed in the Framework and NGSS, specifically focusing on integrating content (DCIs) and practice (SEPs, CCCs). Second, this work demonstrates that the 3D construct map can be used to describe student learning in the context of Unit 2. Finally, this 147 work demonstrates that the 3D construct map can be used to place individual students on a level with 68% confidence, which suggests immediate and high degree of applicability of the 3D construct map for pedagogical use. Theoretical Framework This work uses construct modeling framework to develop and revise 3D construct map for chemical bonding (Brown & Wilson, 2011; Wilson, 2005, 2009). Construct in this case represents a specific unobserved (or latent) trait being measured. In the context of this study, the construct is student 3D understanding of chemical bonding. Construct modeling approach is an extension of the learning progression vision into the field of assessment because it allows to interpret assessment results based on relevant learning and cognition theories (Wilson, 2009; Pellegrino, Chudowsky, Glaser, 2001). It also allows to interpret assessment results more meaningfully by providing information about what students know and can do at each level of proficiency, and helping guide instructional process in terms of what supports students need to get to higher proficiency levels of a given construct (Brown & Wilson, 2011). Construct modeling approach therefore provides a framework for defining proficiency in a meaningful way and increasing the validity power of the test scores as a result (Brown & Wilson, 2011; Mislevy, 1996). Construct modeling approach consists of four steps and constitutes an iterative process. The first step involves specifying cognition model, which in this case is the hypothetical 3D construct map for student understanding of chemical bonding that combines proficiencies in DCIs, SEPs and CCCs described above, and describes increasingly more sophisticated levels of ability along these three proficiencies as students develop deeper understanding and incorporate new knowledge into their existing knowledge framework. In this study a construct map for 148 chemical bonding is defined based on unpacking on NGSS performance expectations (PE) and feedback from disciplinary and pedagogical experts. It is important to point out that while the validation of the 3D construct map for chemical bonding is carried out in the context of “Interactions” curriculum, the assessment instrument used to probe the levels of 3D construct map is aligned to NGSS PEs and not to the curriculum learning goals. Therefore, the results obtained in this study are generalizable to contexts other than “Interactions” curriculum. The second step involves designing items to probe the levels of the 3D construct map following modified evidence centered design (mECD) methodology (Harris, Krajcik, Pellegrino, DeBarger, 2019) that will be further described in the methods section. The third step involves evaluating the outcome space by analyzing student responses to the items and mapping them to the levels of the construct map to ensure that scores on the items related to the levels of the construct map in a meaningful way. Finally, the last step involves choosing a measurement model that allows to relate student responses to calculated ability levels to gain additional evidence for the validity of the 3D construct map, therefore allowing to interpret results of assessment (Brown &Wilson, 2011; Pellegrino et al., 2001). This study will focus on investigating how student integrate ideas of electric forces learned in Unit 1 and ideas of energy learned in Unit 2 to explain chemical bonding and suggest ways of interpreting the obtained results. Methodology Context: Interactions curriculum “Interactions” curriculum was developed according to the principles outlined in the Framework and NGSS. Each unit engages students in relevant natural phenomena in the form of driving questions with the purpose of developing deeper understanding of electrical interactions 149 during the course of academic year through 3D learning strategies. The curriculum consists of 4 units. The first unit is focused on building student understanding of electrical interactions using ideas of electric forces, fields and charges at macro and atomic-molecular scale. The second unit brings in ideas related to energy changes in the system when two charged objects interact at macro and atomic-molecular scale. Chemical bonding is the central phenomenon students are exploring in Unit 2. The driving question for unit 2 is “How does a small spark start a huge explosion?”. Students are exploring ideas related to bond formation and bond breaking from perspective of energy and force, and relate macroscopic observations of phenomena to atomic and molecular level mechanisms. Therefore, the curriculum aims at helping students build 3D understanding of chemical bonding as an extension of the same electrostatic principles that are responsible for observed attractive and repulsive forces between charged macroscopic objects, and recognizing that energy minimization is the driving force behind the observed electrostatic attraction between macroscopic objects and atoms forming a bond. Units 1 and 2 represent about two thirds of the curriculum instructional time. Units 3 and 4 further develop student understanding of electrical interactions to explain a wide range of phenomena including hydrophobic and hydrophilic interactions (Unit 3) and protein folding (Unit 4). Units 1 and 2 of the “Interactions” curriculum have gone through external review process by Achieve. Unit 1 received the highest rating termed “Example of high quality NGSS design”, and Unit 2 received the second highest rating termed “Example of high quality NGSS design if improved”. Further, National Science Teachers Association recognizes “Interactions” as being aligned to NGSS and provides classroom videos demonstrating curriculum use on their official webpage8. These pieces of evidence support the choice of this curriculum for developing and 8 Classroom videos demonstrating “Interactions” use: http://ngss.nsta.org/ 150 validating 3D construct map in this study. The curriculum consists of online materials where all the student activities are located9 and paper-based teacher materials that can be accessed online via Google docs. The curriculum is free and available for anyone to use. Step 1: Specifying cognition model Similarly to an LP, a level on the 3D construct map can be described as one in a series of comprehensive and developmentally appropriate steps towards more sophisticated application of a given latent construct. The major differences between an LP and a construct map are that a construct map is typically defined at a smaller grain size than a learning progression, and specifically focusing on relating assessment and relevant cognition theories (Wilson, 2009). The 3D construct map presented here focuses on DCI of Energy, sub idea of Relationship Between Energy and Forces as well as DCI of HS-PS1- Matter and its Interactions, specifically the sub idea of PS 1.B- Chemical Reactions. These sub ideas are central to Unit 2 of “Interactions” curriculum. Further, 3D construct map presented here focuses on SEP and Developing and Using Models, SEP of Constructing Explanations, and CCC of Cause and Effect because those dimensions were most heavily emphasized throughout the curriculum, and assessments designed to probe 3D construct map levels were focused on these dimensions. The lower anchor was based on students’ prior knowledge that was characterized from the written assessment and oral interviews with individual students before they started Unit 2. The upper anchor is based on the NGSS PEs focused specifically on energy changes during bond breaking and bond forming processes as shown below: 9 “Interactions” online materials: http://interactions.portal.concord.org/ 151 HS-PS3-5. Develop and use a model of two objects interacting through electric or magnetic fields to illustrate the forces between objects and the changes in energy of the objects due to the interaction. HS-PS1-4. Develop a model to illustrate that the release or absorption of energy from a chemical reaction system depends upon the changes in total bond energy. The 3D construct map and the assessment used to probe the levels specifically focuses on phenomena related to electrical interactions, and therefore the aspect of PE HS-PS3-5 related to magnetic interactions is greyed out. Further, as related to PE HS-PS1-4, this study does not focus on evaluating student ability to calculate bond energies. Instead, it focuses on describing and evaluating student qualitative understanding of energy changes during bond breaking and bond making processes, specifically focusing on adding energy to the system in order to break a chemical bond. The intermediate levels of the LP are defined based on a combination of logical sequence of the discipline, feedback from disciplinary experts, and literature related to student learning. This process resulted in a hypothetical 3D construct map that was then empirically tested based on interviews with students and IRT analysis of written assessment. Table 1 provides description of levels for the hypothetical NGSS-aligned 3D construct map for Chemical bonding. This construct map represents the cognition model that was defined as the first step of the construct modeling approach (Brown & Wilson, 2011). It is important to point out that the Framework emphasizes the importance of measuring student ability to integrate the three dimensions, and while the 3D construct map shows DCIs and SEPs/CCCs in separate columns, the SEP/CCC column is an integrative statement because it refers to the DCIs. The DCIs are listed separately to avoid having to write each of the statements under SEP/CC C for all the DCI sub-ideas. 152 Table 3.1 Hypothetical 3D construct map for chemical bonding Level 3 c i p o c s o r c i M 2 e t e l p m o c n I c i p o c s o r c i M Chemical Bonding Includes DCI sub ideas: Matter and its interactions: “Structure and Properties of Mattr”, “Chemical Reactions” Energy: “Relationship between Energy and Forces” Relationship between energy and forces: • Energy is associated with either motion (kinetic) or “stored” (potential) at macroscopic and atomic level scale • Ideas of energy are applied to explain bond breaking/making processes • energy changes are related to Coulombic interactions between charges at macro and atomic-molecular level • chemical bonds are described as resulting from balance of attractive/repulsive forces leading to energy minimization • electric fields are used to explain interactions at a distance Chemical Reactions: • chemical reactions are explained using bond breaking/making processes • energy changes are associated with chemical reaction and bond making/breaking processes • chemical reactions described using atoms/molecules Relationship between energy and forces: • energy is associated with either motion (kinetic) or “stored” (potential) at macroscopic and inaccurate microscopic scale • inaccurate ideas about relationship between energy/heat/force • ideas of energy are applied to explain bond breaking/making processes and explanation might relate to electrical interactions between atom components with some inaccuracies • energy changes are related to Coulombic interactions between point charges and charges and macroscopic objects, might be inaccurate SEPs: “Developing and Using Models”, “Constructing Explanations” CCC: “Cause and Effect” • Student models/explanations are causal and explicitly use ideas of energy and electric force to explain phenomena related to bond breaking and bond making by showing a micro-level mechanism. • Models relate energy changes to changes in forces between interacting atoms to explain why a bonds form • Student models/explanations are causal and use ideas of energy and electric force to explain phenomena related to bond breaking and bond making by showing a micro-level mechanism with some inaccuracies • Need to prompt to elicit ideas • chemical bonds are described as resulting from balance of attractive/repulsive forces; energy relationships are of energy inaccurate or absent • electric fields might be used to explain interactions at a distance Chemical Reactions: • chemical reactions are explained using bond breaking/making processes with some inaccuracies • energy changes are associated with chemical reaction, contains inaccuracies • chemical reactions described using atoms/molecules with some inaccuracies • Models relate energy changes to changes in forces between interacting atoms to explain why a bonds form. Need to be prompted. 153 Table 3. 1 (cont’d). Level Chemical Bonding Includes DCI sub ideas: Matter and its interactions: “Structure and Properties of Matter”, “Chemical Reactions” Energy: “Relationship between Energy and Forces” Relationship between energy and forces: • energy is associated with either motion (kinetic) or “stored” (potential) at macroscopic scale • energy is the same as heat/friction/force • ideas of energy are applied to explain bond breaking/making processes, but explanation does not relate to electrical interactions between atom components • energy changes are related to Coulombic interactions between point charges and charged macroscopic objects but with some inaccuracies • chemical bonds are not described as resulting from balance of attractive/repulsive forces that lead to energy minimization • electric fields might be used to explain interactions at a distance Chemical Reactions: • chemical reactions are recognized (indicators include temperature change, color change, release of gas, precipitate formation, odor) • chemical reactions are not explained using ideas related to chemical bonds • energy changes are associated with chemical reaction, contains inaccuracies • chemical reactions are not described using ideas of atoms/molecules Relationship between energy and forces: • energy is the same as heat/friction/force • ideas of energy are not applied to explain bond breaking/making processes • energy changes are not related to Coulombic interactions • chemical bonds are not described as resulting from balance of attractive/repulsive forces that lead to energy minimization • electric fields are not used to explain interaction at a distance Chemical Reactions: • chemical reactions are not recognized • chemical reactions are not explained using ideas related to chemical bonds • energy changes are not associated with chemical reactions • chemical reactions are not described using ideas related to atoms/ molecules 154 1 c i p o c s o r c a M 0 CCC: “Cause and Effect” SEPs: “Developing and Using Models”, “Constructing Explanations” • Student models/explanations are causal, use ideas of energy when explaining chemical reactions, but at the macroscopic level only; contain inaccuracies • Inaccurate macro-level mechanism. Might need to prompt to elicit ideas of energy. • Models do not relate energy changes to changes in forces between interacting objects in a system • models/explanations are not causal, based on recollection of facts or observable components only; • no mechanism explaining phenomenon Step 2: Developing assessment to probe the levels of 3D Construct Map The second step of the construct modeling approach involves developing assessments to probe various levels of the 3D construct map. This work uses modified evidence-centered design (mECD) process (Harris et al., 2019) to develop assessments to probe levels of the hypothetical 3D construct map for chemical bonding. The mECD approach combines elements of evidence- centered design (ECD) (Mislevy & Haertel, 2006) and construct-centered design (CCD) process (Shin, Stevens, & Krajcik, 2010) to design tasks for measuring knowledge-in-use. The first step of mECD involves identifying and unpacking an NGSS PE to develop a 3D claim that describes what students should be able to do with the corresponding DCI, SEPs and CCs. The process of unpacking specifies aspects of the DCIs, SEPs and CCs that students should master in order to meet a given NGSS PE. Unpacking also ensures coherency and alignment between NGSS PEs, assessment, and the 3D construct map levels and specifies which aspects of the broad NGSS PE are being measured. The next step involves specifying the evidence that shows students have met the requirements of the claim. Claim and evidence combine to form an mECD argument. Finally, assessment tasks for each mECD argument are developed that will provide the necessary evidence to measure the claim. This process is illustrated in Figure Figure 3.1 Summary of modified evidence centered design process An example of the mECD argument for an item to help characterize the level of students’ understanding of chemical bonding is summarized in Table 2. The item is designed to provide evidence on whether students are at level 0, 1, 2 or 3 of the 3D construct map shown in table 1. The mECD argument focuses on DCI of HS-PS1- Matter and its Interactions, specifically the 155 element of PS 1.B- Chemical Reactions; SEP of Constructing Explanations, CCC of Cause and Effect. While NGSS PE focuses of SEP of Developing Models, and CCC of Matter and Energy, it is acceptable to change those elements as long as DCI focus is the same as the given NGSS PE. Table 3.2 Example of mECD process E P S S G N HS-PS1-4. Develop a model to illustrate that the release or absorption of energy from a chemical reaction system depends upon the changes in total bond energy. Note: assessment items designed for this PE focused on qualitative understanding of energy changes associated with bond breaking/bond making processes. Students were not asked to calculate changes in total bond energy Students will construct explanation that tracks energy changes that occur when chemical bonds break and m i a l C form in a chemical reaction to explain what causes observed energy/heat absorption or release. Students’ explanation will account for energy changes that occur when chemical bonds break and form in chemical reaction. Explanations will contain the following ideas as appropriate: 1. Chemical reactions involve breaking bonds of the reactants and forming bonds to make new substances as products. 2. Energy is required to break bonds; energy is released when bonds form. 3. Starting a chemical reaction requires adding energy to break bonds of the reactants. 4. Heat and light indicated that energy is being transferred to/from the system. 5. Causal explanations will account for molecular/atomic level mechanism and relate to observed phenomenon. Burning is a type of chemical reaction. The video shows a match that is lit by heating it on a hot plate. Note: the video shows a match on a hot plate that is turned on. After some time the match lights up. Snapshots of the video are shown below. e c n e d i v E t n e m s s e s s A k s a T Match lights up after some Match is burning for a while The flame from the match time on the hot plate dies out eventually Question: Striking a match across a rough surface will create a spark that sets the match on fire. How can the match in the video light without a spark? Justify your answer. There were total of 8 items designed to measure 3D understandings of chemical bonding for Unit 2. Each item is open-ended like the one shown in table 2, contains an aspect of a DCI, a 156 SEP and a CCC and designed to measure all three levels of the 3D construct map shown in Table 1. Items were administered as a pre and post Unit 2 test. Two items were used to conduct interviews before and after Unit 2 to obtain qualitative validity evidence for 3D construct map. The first interview item called “Match on the Hot Plate”, which is the same as shown in table 2. The item assessed DCI of Chemical Reactions, SEP of Constructing Explanations, and CCC of Cause and Effect. The second interview item called “Atoms forming Chemical Bond” assessed DCI of Relationship between Energy and Forces, SEP of Developing and Using Models, and CCC of cause and effect. This item was not exactly the same as in the written assessment, but aligned closely to a similar item. For both items, each answer was scored using a scoring rubric aligned to the 3D construct map levels. The rubric describes scoring rules specific for each item and reflect ability to apply DCI, SEP and CCC described in the 3D construct map to make sense of specific phenomenon in question. The construct map, on the other hand, provides a general description of increasingly more sophisticated ways of thinking about chemical bonding from perspective of energy and force. In terms of alignment, score 1 on an item aligns to level 1 of the 3D LP, score 2 aligns to level 2 and so on. Table 3 shows the rubric, corresponding level of the 3D construct map, and sample answer from the oral interview for the “Match on the Hot Plate” item. Table 4 shows the rubric, corresponding level of the 3D construct map, and sample answer from the oral interview for the “Atoms Forming a Chemical Bond” item. For all items on the test, including those shown in table 3 and 4, both the rubric and the 3D construct map aim to characterize understanding of chemical bonding starting from basic level with essentially no relevant DCIs present (level 0), transitioning to macroscopic level understanding (level 1), to incomplete microscopic level (level 2), and finally compete microscopic level (level 3). For both items, there were no responses for level 3 of 3D construct map upon completion of Unit 2. 157 Table 3.3 Sample responses for every 3D construct map level for match on the hot plate Level/ Score 0 e s n o p s e R e l p m a S 3D Construct Map Scoring Rubric DCI Chemical Reactions: • chemical reactions are not recognized • chemical reactions are not explained using ideas related to chemical bonds • energy changes are not associated with chemical reactions • chemical reactions are not described using ideas related to atoms/ molecules SEP and CC models/explanations are not causal, based on recollection of facts or observable components only; no mechanism explaining phenomenon Question Striking a match across a rough surface will create a spark that sets the match on fire. How can the match in the video light without a spark? Justify your answer. DCI: Chemical Reactions • Match burning is not recognized as a chemical reaction • No molecular level explanation for what causes match to light up from perspective of bond breaking/forming processes • No relationship between match burning, chemical bonds and energy at the atomic level SEP and CC • Explanations focused on observable components only • No causal mechanism • Heat from the hot plate causes match to light up Comment: relevant components of DCI are not present, explanation contains only observable components and no causal mechanism to explain what causes the match to light without a spark 158 Table 3.3 (cont’d). Level/Score 3D Construct Map Scoring Rubric 1 c i p o c s o r c a M e s n o p s e R e l p m a S DCI Chemical Reactions: • chemical reactions are recognized (indicators include temperature change, color change, release of gas, precipitate formation, odor) • chemical reactions are not explained using ideas related to chemical bonds Question Striking a match across a rough surface will create a spark that sets the match on fire. How can the match in the video light without a spark? Justify your answer. DCI: Chemical Reactions • Match burning is recognized as a chemical reaction • No molecular level explanation for what causes match to light up from • energy changes are associated with chemical reaction, perspective of bond breaking/forming processes contains inaccuracies • chemical reactions are not described using ideas of atoms/molecules SEP and CC Student models/explanations are causal, use ideas of energy when explaining chemical reactions, but at the macroscopic level only; contains inaccuracies • No relationship between match burning, chemical bonds and energy at the atomic level SEP and CC • Explanations uses ideas of energy to explain match lighting/burning • No causal mechanism beyond observable components • Explanation relates heat and energy, might be inaccurate Comment: Relevant DCIs are present, model provides macro level causal mechanism using ideas of energy to explain what causes match to light up when it is sitting on the hot plate. 159 Table 3.3 (cont’d). Level/Score 3D Construct Map Scoring Rubric Question Striking a match across a rough surface will create a spark that sets the match on fire. How can the match in the video light without a spark? Justify your answer. DCI: Chemical Reactions • Match burning is recognized as a chemical reaction • Molecular level explanation for what causes match to light up from • perspective of bond breaking/forming processes, might be inaccurate Inaccurate relationship between match burning, chemical bonds and energy at the atomic level SEP and CC • Explanations uses ideas of energy to explain match lighting/burning • Microscopic level causal mechanism with some inaccuracies • Explanation relates heat and energy, might be inaccurate From the interview: Student: “The hot plate gives off heat, which is then transferred to the match, causing the molecules in the match to move faster. The heat energy causes atoms in the molecules to rub together faster, which separates molecules in the match to individual atoms and sets the match on fire” Comment: molecular level explanation with significant inaccuracies, no explicit mention of bond breaking/bond making processes, and how energy is involved in these processes. The model and explanation states that the atoms are set on fire and are also present in the flame itself. 2 c i p o c s o r c i M e t e l p m o c n I DCI Chemical Reactions: • chemical reactions are explained using bond breaking/making processes with some inaccuracies • energy changes are associated with chemical reaction, contains inaccuracies • chemical reactions described using atoms/molecules with some inaccuracies SEP and CC Student models/explanations are causal, use ideas of energy when explaining chemical reactions at the microscopic level with some inaccuracies e s n o p s e R e l p m a S 160 Table 3.3 (cont’d). Level/Score 3D Construct Map Scoring Rubric Question Striking a match across a rough surface will create a spark that sets the match on fire. How can the match in the video light without a spark? Justify your answer. DCI: Chemical Reactions • Match burning is recognized as a chemical reaction • Molecular level explanation for what causes match to light up from perspective of bond breaking/forming processes • Relationship between match burning, chemical bonds and energy at the atomic level SEP and CC • Explanations uses ideas of energy to explain match lighting/burning • Microscopic level causal mechanism • Explanation relates heat and energy Comments: no level 3 responses were observed by the end of Unit 2 for this interview item 3 DCI Chemical Reactions: • chemical reactions are explained using bond breaking/making processes • energy changes are associated with chemical reaction and bond making/breaking processes • chemical reactions described using atoms/molecules SEP and CC Student models/explanations are causal, use ideas of energy when explaining chemical reactions at the molecular level c i p o c s o r c i M e s n o p s e R e l p m a S 161 Table 3.4 Sample responses for every 3D construct map level for atoms forming a bond Level/Score 3D Construct Map Scoring Rubric 0 DCI Relationship between energy and forces: • energy is the same as heat/friction/force • ideas of energy are not applied to explain bond breaking/making processes • energy changes are not related to Coulombic interactions • chemical bonds are not described as resulting from balance of attractive/repulsive forces that lead to energy minimization • electric fields are not used to explain interaction at a distance SEP and CC models/explanations are not causal, based on recollection of facts or observable components only; no mechanism explaining phenomenon • Question Draw a model to explain how two atoms can form a chemical bond using ideas related to atomic structure, electric force and energy. DCI: Relationship between Energy and Forces • Basic attractive interactions between opposite charges and repulsive interactions between similar charges might be used to explain bond formation, but with some inaccuracies. ideas of energy are not applied to explain bond breaking/making processes energy changes are not related to Coulombic interactions between components of atoms (protons, electrons) to explain bond formation chemical bonds are not described as resulting from balance of attractive/repulsive forces that lead to energy minimization electric fields are not used to explain bond formation • • • SEP and CC • Models focused on observable components only • No causal mechanism e s n o p s e R e l p m a S Student: “I am not sure how they form a bond. They will stick together somehow, but I am not sure how.” Comment: model/explanation does not contain any components beyond those provided in the question. The model/explanation does not use ideas of energy, force and atomic structure to explain bond formation 162 Table 3.4 (cont’d). Level/Score 3D Construct Map 1 DCI Relationship between energy and forces: • energy is associated with either motion (kinetic) or “stored” (potential) at macroscopic scale • energy is the same as heat/friction/force • ideas of energy are applied to explain bond breaking/making processes w/o relating to electrical interactions between atom components • energy changes are related to Coulombic interactions between point charges and charged macroscopic objects but with some inaccuracies Scoring Rubric Question Draw a model to explain how two atoms can form a chemical bond using ideas related to atomic structure, electric force and energy. DCI: Relationship between Energy and Forces • Basic attractive interactions between opposite charges and repulsive interactions between similar charges are used to explain bond formation. Charges are modeled as point charges, not parts of atoms. Ideas of energy are applied to explain bond formation w/o relating to electrical interactions between atom components • • Energy is mentioned in the context of potential energy associated with energy “stored”, and kinetic energy associated with motion. c i p o c s o r c a M • chemical bonds are not described as resulting from • Chemical bonds are not described as resulting from balance of balance of attractive/repulsive forces that lead to energy minimization • electric fields might be used to explain interactions at a distance SEP and CC • Inaccurate macro-level mechanism (charges modeled as point charges and not parts of atoms). Might need to prompt to elicit ideas of energy. • Models don’t relate energy changes to changes in forces attractive/repulsive forces between components of atoms that lead to energy minimization electric fields might be used to explain bond formation • SEP and CC • Models don’t show atomic-level causal mechanism, might relate energy changes to changes in forces between interacting objects in a system, but we some inaccuracies. between interacting objects in a system. Sample response #1 Comment: the model/explanation describes components of atoms (protons, electrons) as point charges, and provides basic causal mechanism for attractive integrations between these components as a basis of forming a chemical bond. The model and explanation do not provide causal atomic- level mechanism of bond formation using ideas of energy, but explanation makes a distinction between energy associated with motion of atoms (kinetic), and potential energy associated with bonding state. From the interview: Student: atoms usually bond together by touching. Bond is like a bridge, I think it is just air in between. Interviewer: What makes the atoms stick together in a bond? Student: The charges, because opposite charges attract. Interviewer: So, are the atoms in a bond charged? Student: I think so, I am not sure. Interviewer: Does energy change in any way when atoms for a bond? Student: Yes. Say you have kinetic energy when they are moving, then when they are stuck together its potential energy. 163 Table 3.4 (cont’d). Level/Score 3D Construct Map Scoring Rubric Sample response #2 Student: “Atoms need a third atom to form a bond. They give the extra energy to the third atoms through collision, which as allows them to form a bond” Comment: answer uses ideas of energy to explain bond formation, but does not relate energy changes to electrical interactions between atomic components. No atomic components or point charges are shown. This is also a piece of knowledge that comes directly from the simulation that students did as part of their Unit 2 learning experience. Sample response #3 Student: the bond forms by adding energy. For those atoms to be able to connect we have to have a third atom that provides energy. So, when this one (third atom) gets pushed up, they attract and then they bond. Comment: the answer uses ideas related to electrical interactions (attraction between opposite charges) to explain bond formation. Charges are modeled as point charges, and not as parts of atoms. Idea of energy is inaccurate, the explanation states that you need to add energy to form a bond. The idea that atoms need a third atom to form a bond might also come from a simulation student did in Unit 2, just like for sample response #2. 164 e s n o p s e R e l p m a S Table 3.4 (cont’d). Level/Score 3D Construct Map Scoring Rubric 2 c i p o c s o r c i M e t e l p m o c n I e s n o p s e R e l p m a S DCI Relationship between energy and forces: • energy is associated with either motion (kinetic) or “stored” (potential) at macroscopic and inaccurate microscopic scale • inaccurate ideas about relationship between energy/heat/force • ideas of energy are applied to explain bond breaking/making processes with some inaccuracies • energy changes are related to Coulombic interactions between point charges and charges and macroscopic objects, might be inaccurate • chemical bonds are described as resulting from balance of attractive/repulsive forces; energy relationships are inaccurate or absent • electric fields might be used to explain interactions at a distance SEP and CC • Student models/explanations are causal and use ideas of electric force to explain phenomena related to bond breaking and bond making by showing a micro-level mechanism with some inaccuracies • Models relate energy changes to changes in forces between interacting atoms to explain why a bonds form. Need to be prompted. Question Draw a model to explain how two atoms can form a chemical bond using ideas related to atomic structure, electric force and energy. DCI: Relationship between Energy and Forces • Energy changes are related to Coulombic interactions between components of atoms (protons, electrons) to explain bond formation, might be inaccurate • Chemical bonds are described as resulting from balance of attractive/repulsive forces between components of atoms, energy minimization idea is inaccurate or needs to be prompted electric fields might be used to explain bond formation • SEP and CC • Models show atomic-level causal mechanism for bond formation focused on describing balance of electrical interactions between components of atoms (protons, electrons), energy minimization idea not present or inaccurate. Student: a bond forms where electrons are attracted to the core of the nucleus, the electrons will attract to the core of each other’s atoms Interviewer: What about the two nuclei? Student: The do repel each other, so they keep some distance between them. They won’t be touching because the cores are repelling each other, but they also won’t get too far because the electrons are attracted to the core. Interviewer: So, with both attractive and repulsive interactions present, why does a bond form? Student: They get to the point where they are at equilibrium, they are not attracting or repelling, they are close enough to be attracted, but far enough away not to be repelled. Interviewer: In there anything else apart from attractive/repulsive interactions that is driving this process? Student: I am not sure. Interviewer: Do you think energy is involved in forming a bond? Student: I am not sure. Comment: Model and explanation describe atomic-level causal mechanism for bond formation focused on describing balance of electrical interactions between components of atoms (protons, electrons), energy minimization idea is not present even when prompted 165 Table 3.4 (cont’d). Level/Score 3D Construct Map Scoring Rubric 3 DCI Relationship between energy and forces: • Energy is associated with either motion (kinetic) or “stored” (potential) c i p o c s o r c i M at macroscopic and atomic level scale • Energy ideas applied to explain bond breaking/making • energy changes are related to Coulombic interactions between charges at macro and atomic-molecular level • chemical bonds are described as resulting from balance of attractive/repulsive forces leading to energy minimization • electric fields are used to explain interactions at a distance SEP and CC • Student models/explanations are causal and explicitly use ideas of energy and electric force to explain phenomena related to bond breaking/making by showing a micro-level mechanism. • Models relate energy changes to changes in forces between interacting atoms to explain why a bonds form Question Draw a model to explain how two atoms can form a chemical bond using ideas related to atomic structure, electric force and energy. DCI: Relationship between Energy and Forces • Energy changes are related to Coulombic interactions between components of atoms (protons, electrons) to explain bond formation • Chemical bonds are described as resulting from balance of attractive/repulsive forces between components of atoms, driven by energy minimization electric fields might be used to explain bond formation • SEP and CC • Models show atomic-level causal mechanism for bond formation focused on describing balance of electrical interactions between components of atoms (protons, electrons) and energy minimization Sample Response: no level 3 responses were observed by the end of Unit 2 for this interview item 166 Step 3: Evaluating Outcome Space The third step involves evaluating the outcome space by analyzing student responses to the items and mapping them to the levels of the construct map to ensure that scores on the items related to the levels of the construct map in a meaningful way. The hypothetical 3D construct map shown in Table 1 was constructed using logical sequence of the discipline, relevant research literature and unpacking of NGSS PEs. The “Interactions” curriculum was piloted in the same schools in the Mid-West a year prior to the data collection described here. During the data collection year, a team of researchers used the scoring rubrics to score student answers directly to the construct map levels and made sure the types of answers student provided on each item were consistent with 3D construct map levels as well as the scoring rubrics. Examples of student answers for the interview items as well as scoring rubric and 3D construct map levels are provided in tables 3 and 4. Supporting levels of the 3D LP using qualitative analysis of student interviews The interview data was collected in a Mid-Western public high school where the “Interactions” curriculum was implemented. See chapters 1 and 2 for more detailed description of the sample. Several students from each of the three participating classrooms were interviewed before and after implementation of Unit 2, with total 17 students interviewed. The students were selected to represent different levels of academic achievement. Items shown in tables 3 and 4 were used for the interview. Sample student responses for interview items are shown in Tables 3 and 4. All items probe ideas related to the three levels of the hypothetical 3D construct map shown in Table 1. Student interviews were analyzed using the scoring rubric and each answer was assigned a level on 3D construct map (Tables 3 and 4). Inter-rater reliability was established in the following manner. One researcher scored all 167 17 interviews first. Then, two other researchers used the same rubric to score the interviews of 3 students from each classroom (total 9 students). Once 100% agreement of 3D LP level placement for all 9 students was reached between the 3 scorers, the scoring rubric and the 3D construct map levels were modified accordingly, and the remainder of the interviews rescored based on this discussion. Step 4: Measurement Model IRT analysis for Unit 2 pre/post assessment was carried out following Toland (2014). The sample of 899 students was modeled using graded response model (GRM) (Samejima,1969). See chapters 1 and 2 for more detailed description of the sample. Score of “0” was imputed for students who had missing values on any of the items. This approach was deemed appropriate because students were given unlimited amount of time to finish the assessment. Therefore, it was safe to assume that if they did not provide the answer for a given item, they did not know it. Pre and post assessment data were combined in model estimation to allow for comparison of ability distributions on pre and posttest. The dimensionality and longitudinal invariance study were reported earlier (see Chapter 1). Pre and post measures were highly reliable (pre Unit 2=0.823, post Unit 2=0.932) and supported by validity evidence (see Chapter 1). There were two theoretical latent dimensions measured on Unit 2 assessment test: student 3D understanding of Energy (more specifically, student 3D understanding of relationship between energy and forces in bond breaking/bond making processes), and student 3D understanding of Chemical Reactions (see Chapter 1). The theoretically suggested latent dimensionality was confirmed by validity evidence based on response process and invariance studies (see Chapter 1). Previous study therefore suggests 2-dimensional latent structure for Unit 2 assessment instrument. For IRT modeling, however, the interest is only in the overall progression, and since both dimensions 168 were closely related (correlation coefficient was 0.784 on pre test and 0.928 on post test) and therefore likely to be developed in conjunction, unidimensional IRT was used to model students’ progression. This also made sense from theoretical point of view because all items on the test were aimed to measure student understanding of ideas related to chemical bonding, so even though the two latent constructs were slightly different, it was reasonable to combine them as they were measuring the same science idea. Based on this assumption, unidimensional IRT model was used to model the data. Appendix provides R code for model selection, specification, and estimation using the mirt package (Chalmers, 2012) in RStudio (RStudio Team, 2015). The results section presents IRT analysis relevant to validation of the 3DLP. Results Supporting the validity of the 3D construct map using qualitative analysis of student interviews Identifying Key Knowledge and Practices for Each Level of the 3D construct map Qualitative analysis of student interviews served as a rich source of information to help obtain validity evidence for hypothetical 3D construct map levels. Analysis of student responses supported the hypothetically suggested progression of student understanding reflected in the 3D construct map levels for both interview items. Specifically, at level 0 student answers contain very little information relevant to demonstrating knowledge application ability for ideas related to how bond making and bond breaking processes from the perspective of energy and electric force. The answers focus on reciting back information provided in the question itself, and some observable macroscopic level components. For example, for “Match on The Hot Plate” item level 0 type of responses usually focus on the idea that match is set on fire because of the heat from the hotplate without mentioning any mechanistic causal details. Similarly, for the “Atoms forming a Bond” item student answers at level 0 don’t demonstrate relevant knowledge of bond 169 making/bond breaking processes or how energy and electric force is involved in these processes. Level 1 reflects the most diverse range of responses for “Atoms forming a Bond” item. In general, for this item student response reflect macroscopic level understanding with various ideas related to electrical interactions and energy being used to explain bond formation. For example, students use basic ideas related to attractive and repulsive interactions between charges to attempt and explain formation of chemical bonds (see level 1 sample response #1 for “Atoms forming a Bond”), but do not relate energy and electric force in the context of chemical bonding at the atomic-molecular level. Charges are modeled as point charges and not parts of atoms at this level. This lack of detail in the level 1 models leads to incomplete or inaccurate explanations of phenomena and lack of microscopic level details. Additionally, students at this level might recall a computer simulation they study as part of Unit 2 instruction where two atoms couldn’t form a chemical bond until a third atom was introduced into the system, and they were able to form a bond after colliding and transferring extra energy to the third atom. Student responses use ideas from this simulation to suggest that two atoms need a third atom in order to give extra energy to the third atom and form a bond. (see level 1 sample response #2 for “Atoms forming a Bond”). Atoms are still modeled as spheres without mentioning components of the atoms or interactions between components. Finally, the third type of responses for this item at level 1 reflect a combination of ideas related to electrical interactions between point charges, and the ideas from the simulation related to introducing a third atom into the system to form a chemical bond. For example, level 1 sample response #3 for “Atoms forming a Bond” indicates that atoms form a bond via attractive interactions between opposite charges (the charges are modeled as point charges, not as components of atoms), and the third atoms gives energy to the other two to form a bond. This response reflects a combination of prior knowledge about attractive 170 interactions between opposite charges, new knowledge from the simulation studies in class mixed with probably previous inaccurate idea that one needs to add energy to break chemical bond. Hence student suggestion that the third atom actually adds energy to help the other two form a chemical bond instead of taking the access energy away to allow for the bond to form. For the “Match on the Hot Plate” item level 1 responses were not as diverse, and generally reflect ability to track energy transfer in the system at macroscopic level without providing atomic level details (see level 1 sample response for “Match on the Hot Plate”). Level 2 reflect transitional macro-to molecular level understanding of chemical bonding from the perspective of energy and electric force. Specifically, student models show atomic and molecular level detail, and explanations mention that it takes energy to separate atoms (see level 2 response for “Match on the Hot Plate”). However, answers at this level don’t provide full causal mechanistic account for how the energy is transferred when atoms are separated, and how the energy provided causes bonds to break. The answers at this level also contain some inaccuracies (for example, sample level 2 answer for “Match on the Hot Plate” mentions that atoms are part of the flame from the match). Similarly, as related to chemical bond formation, answers at level 2 reflect detailed atomic level causal mechanistic understanding of the mechanism from the perspective of electrical interactions, and the idea that a chemical bond is formed at a distance between two atoms where attractive and repulsive interactions between components of atoms are balances out (see sample response for level 2 “Atoms forming a Bond” item). However, the idea of energy, and how it is involved in bond making/bond breaking processes is still lagging or contains a lot of inaccuracies at level 2. Students might need to be prompted to elicit ideas of energy, but even when prompted they don’t necessarily relate energy changes in the system to changes in electrical interactions that lead to bond formation. 171 Finally, at level 3 of the 3D construct map student models and explanations demonstrate microscopic level causal mechanistic understanding of bond breaking and bond forming processes from perspective of energy and force. Upon completion of Unit 2 there were no responses identified that would be fully consistent with level 3. Specifically, at this level student models and explanations are expected to demonstrate full causal relationships between changes in energy of the system and associated changes in attractive/repulsive interactions between atoms as related to bond forming and breaking, as well as clear understanding of differences between heat, force and energy. It is likely that this level of understanding develops as students have additional opportunities to explore more phenomena. Evidence in Support of Developmental Nature of Student 3D Understanding While there were no level 3 responses observed in the interviews or in the scoring of the entire student sample of written pre and post assessments, there were some responses that could be characterized as transitioning between the levels of the 3D LP. Table 5 provides examples of student answers that were considered to fall between the levels and explains why. For example, transitioning from level 0-1 of the 3D construct map for “Match on the Hot Plate” item is characterized by providing a more detailed causal account of phenomenon, but still few relevant DCIs present (no mention of energy or electrical interactions). For the “Atoms Forming a Bond” item level 0-1 responses are characterized by referring to idea of interactions between atoms as a driving factor of forming a bond, but no clear causal mechanism for the origin of the interactions. For “Atoms Forming a Bond” item, sample response uses idea of field to explain bond formation, but does not explain the origin of the field and how it is involved in forming a bond. Further, transitional responses for level 1-2 of the 3D construct map are characterized by providing more atomic and molecular level mechanistic details of phenomena, but lacking 172 detailed mechanism relating interactions between atomic components to bond formation, and absence or inaccurate use of energy ideas. For example, for “Match on the Hot Plate” item sample response shows a model indicating that bonds break as a result of “spark from the hot plate”, which causes a chemical reaction and sets the match on fire. Relevant ideas are present, but they are not used in a way that provides a clear causal mechanistic account of why the match lights up, and ideas of energy is completely absent. Similarly, for “Atoms forming a Bond” item level 1-2 transitional responses indicate atomic level detail (sample responses both show atomic structure, and relevant atomic components), but how interactions between atomic components is involved in forming a bond is not clear. In sample response #1 explanation indicates that the interactions are due to the field, and the origin of the field is not explained. In sample response #2 attractive and repulsive interactions between components of the atoms are accounted for, but the idea of balance of attractive and repulsive forces is missing, and therefore the mechanism of bond formation is not fully explained. Ideas of energy are not related to electrical interactions between components of atoms, and are not mentioned in the context of bond formation. Finally, level 2/3 transitional responses reflect molecular level detailed mechanism that relates interactions between components of atoms and energy, but contains some inaccuracies. For example, for the “Match on the Hot Plate” mechanism for lighting up the match is explained at the molecular level, but the answer indicates that bonds are broken as a result of heat from the hot plate and not energy. However, the answer further states that energy is released when new bonds form. It is therefore unclear if the student uses ideas of heat and energy as equivalent ones. Further, for “Atoms forming a Bond” item sample transitional response provides causal mechanistic account for how interactions between components of atoms and energy is involved in bond formation with some inaccuracies. For example, the answer is indicating that potential 173 energy is equally high for when atoms are far away and close to each other. To summarize, transitional responses tend to contain more relevant content (aspects of DCIs), but lack application of the content for explaining phenomena. This reflects the nature of 3D understanding described in the 3D construct map, which is characterized by achieving knowledge-in-use, or ability to apply DCIs, SEPs and CCs to explain phenomena. Transitional responses were assigned the lower level on the 3D construct map as a final level for online responses because they did not contain all the aspects consistent with the higher level. 174 Table 3.5 Sample responses that fall between levels of the 3D construct map Level 0/1 Sample Student Answer “Match on the Hot Plate” Student: “The friction from it moving across a rough surface causes the heat which makes it set on fire. The hot plate gives the match enough heat to lite” Comment: the answer does not use ideas of energy to explain why the match lights up, but provides a more detailed mechanistic account for what causes the phenomenon. 1/2 Student: “The match lights up because of the chemical reaction. The spark from the hot plate breaks the bonds in the molecules of the match, there is a chemical reaction and the match sets on fire”. Comment: the model and explanation provide atomic level mechanism for bond breaking, but does not explain how energy is involved in bond breaking process and setting the match on fire. Sample Student Answer “Atoms Forming a Bond” Student: the two atoms interact through the field to form a bond. Interviewer: Where do these fields come from? Student: all atoms have fields around them. Comment: explanation focuses on the idea of field as a major factor in bond formation, components of atoms or point charges are not shown, no explanation about the origin of the field is present, no explanation for how field contributes to forming a bond. Sample response #1 Student: atoms are made of electrons, and protons and neutrons in the nucleus. When atoms form a bond, their fields interact. Interviewer: Where do these fields come from? Student: They are located around the atoms. When atoms are close enough, their fields interact and form a bond. Comment: model and explanation shown atoms and indicates components of the atoms, but does not explain how components interact to form a bond, or the origin of fields. Energy is not mentioned. Sample response #2: Student: as atoms get closer, the nuclei repel, and electrons of one atom attract to the protons of the other atoms. Interviewer: so how do the two atoms from a bond? Student: probably the attraction between protons and electrons is stronger than repulsion…I am not sure. Comment: model and explanation refer to interactions between components of atoms as being the driving factor in bond formation, but the idea of balance of attractive and repulsive forces is missing. Ideas of energy are also missing. 175 Table 3.5 (cont’d). Level 2/3 Sample Student Answer “Match on the Hot Plate” Sample Student Answer “Atoms Forming a Bond” Student: “The heat from the hot plate breaks bonds in the molecules of the match. The new bonds form and the match is set on fire. Energy is released when new bonds form, and fire is indication that energy is being transferred”. Comment: the answer does not use ideas of energy to explain why bonds in the molecules of the match are broken. However, molecular level mechanism is present, and energy is used to explain observed flame as a result of bond forming and energy being released. Student: They (atoms) have to get close enough without repelling, and they have to have the energy to make a bond, but I don’t remember how. When they are moving towards each other they have potential energy, when they are attracting or repelling. And when they get close enough the potential energy goes down I think because the atoms are attracting to each other more or something… Interviewer: Why do they repel when they are too close? Student: because they have electrons, they get too close and they repel. The electrons are negative, they are on the outside (of the spheres shown in the model), and the protons are positive they are on the inside. Interviewer: and why would atoms attract when they are far away? Student: because the electrons attract to the nucleus of the other atom Interviewer: And what do you mean by balanced here? Student: they are not repelling or attracting because the electrons are wanting to attract the protons, but at the same time the electrons are repelling from each other. Interviewer: How do your energy graphs relate to each of the situations? Student: the potential energy is high when they move towards each other or away from each other because they are either repelling or attracting, which builds up the potential energy. Potential energy is like, it wants to move, but it’s not moving yet. Kinetic energy is energy in motion. Comment: the explanation describes energy changes associated with electrical interactions between components of atoms when prompted, even though the model does not show components of atoms (protons, electrons). Some inaccuracies present in the answer. For example, the potential energy is high for when atoms are far away and close together. 176 Consistency in Assigning Responses to 3D Construct Map Level for Different Phenomena Since students were asked to explain more than one phenomenon, it allowed to study students’ ability to transfer their 3D understanding to different contexts. Specifically, the “Atoms Forming a Bond” item is an example of an abstract phenomenon that students can not directly observe, which makes it harder to model and explain. This items also contains more complex ideas and requires deeper understanding. On the other hand, the “Match on the Hot Plate” item focuses on a more familiar phenomenon that is directly observed in the video shown to students. This difference in how familiar the phenomena were to students is evident in the levels of the answers provided for both scenarios in the interview. Table 6 shows assignment of levels for each student on each interview item. Specifically, on the pretest, 7 students scored a level 0, 8 students scored a level 1 and 2 students scored between levels 0-1 of the 3D construct map on the “Match on the Hot Plate” item. With the “Atoms forming a Bond” item, 10 students scored a level 0 and 5 students scored a level 1 and 2 students scored level 0/1 of the 3D construct map on the pretest. These results suggest that the abstract “Atoms forming a Bond” item was a little more difficult for students to model and explain. However, still a considerable number of students scored either level 1 or even level ½ on the pretest (total 7 students out of 17), suggesting that considerable number of students were able to apply prior knowledge about electric charges gained during Unit 1 to suggest possible mechanism for chemical bond formation. Overall, the majority of interviewed students demonstrated proficiency between level 0 and 1 of the 3D LP on the pre-unit 1 interview. On the post test, nobody scored level 0 on any of the interview items. Further, 5 students scored level 1, 9 students scored level 2, 1 student scored level 1-2 and 2 students scored level 2- 177 3 for “Match on the Hot Plate” item. Further, 6 students scored level 1, 6 students scored level 2, 2 students scored level 1-2 and 3 students scored level 2-3 on the “Atoms forming a Bond” item. These results suggest that students start developing macroscopic level understanding of energy and are starting to make sense of how energy and force might be related to bond breaking/bond making processes, which is mostly consistent with level 1 and level 2 of the 3D construct map. Overall, clear progression along the levels of the 3D construct map is evident in student interview analysis. All students moved at least 1 level of the 3D construct map upon completion of Unit 2. Additionally, the construct map 3D level assignment was consistent across the two phenomena, meaning that students overall got the same 3D construct map level assignment for both phenomena upon completion of Unit 2. This suggests that while the context for the two interview items was quite different, it did not affect student ability to apply their understanding to explain bond making/bond breaking processes to a large extend. Table 3.6 Student score and 3D construct map level for each interview phenomenon Student Pre-Unit 2 level Post-Unit 2 Level Match on the Hot Plate Match on the Hot Plate Pre-Unit 2 level Atoms Forming a Post-Unit 2 level Atoms Forming a Bond Bond 1 0 0 0 0/1 1 1 1 1 1 0/1 0 1 0 0 0 1 A B C D E F G H I J K L M N O P Q 1 0 0 0 0 0 1/2 1 0 1 1/2 0 0 1 0 0 1 2 1 1 2 1 1 2 2/3 1/2 2/3 2 1 2 1/2 1 2/3 2 2 1 1 2 1 1/2 2/3 2 2 2/3 2 2 2 1 1 2 2 178 Supporting the Validity of levels of the 3D Construct Map using IRT In this section Wright Maps resulting from fitting graded response model (GRM) are used to show additional validity evidence for 3D construct map levels. The GRM model is a polytomous item model. It is used for items with more than 1 response category, like the ones designed for this study. Under GRM, each response category has its own difficulty parameter (Samejima, 1969). The interpretation of category difficulty under GRM is the following: a student with ability level equal to the difficulty of a given response category has a fifty percent probability of scoring in that category, and fifty percent in the category below (Samejima, 1969). When looking at the Wright Map, we want to see if abilities that correspond to difficulties for various item response categories are consistent with those theoretically suggested by the rubric and 3D construct map. Specifically, we expect item difficulties that correspond to lower ability levels be located in the lower ability region of the Wright Map for all items. This is because respondents of lower ability are more likely to endorse an easier item response category (lower difficulty category), which in turn corresponds to lower level of the 3D construct map. Similarly, item difficulties corresponding to higher ability levels should be located at the higher ability region of the Wright Map because higher ability is related to higher probability of endorsing more difficult response category, which corresponds to higher level of the 3D construct map. If this pattern is consistent for all items on the assessment, then we have evidence for the validity for the hypothetical 3D LP (Wilson, 2005; Wilson, 2009; Doherty et al., 2015). The Wright Map resulting from GRM analysis is shown in Figure 2. Recall that each item has 4 response categories (0-3) aligned to the three levels of the 3D construct map (see Tables 3 and 4). Since no student received a 3 on any of the items, no level 3 responses were observed. Additionally, level 0 thresholds don’t show up on the Wright Map. Therefore, there are only 2 179 response categories in the IRT analysis. Therefore, each of the 8 items has 2 difficulties associated with score of 0/1 and score of 1/2 on the rubric for that item respectively. The Wright Map in Figure 3 shows difficulties for categories corresponding to score 1 (labeled as “1”) and score 2 (labeled as “2”) for all items. Solid black horizontal lines represent location of thresholds for each 3D construct map levels. Level 0-1 threshold separates level 0 from level 1 of the 3D construct map. The cutoff for level 0-1 is 1.31 on the logit scale. The cut score for level 0-1 was taken to be approximately at or below the lowest threshold for level 1 (1.31). It means that respondents with ability level above 1.31 are at level 1 of the 3D LP, and respondents with ability level below 1.31 are at level 0 of the 3D construct map. Further, the cutoff for level 1-2 is 1.88 and has the same interpretation as level 0-1 cutoff. It was calculated as the median item threshold on logit scale (Doherty et al., 2015). Since no scores corresponding to level 3 of the 3D LP were observed and thresholds for level 3 LP have not been determined, the cutoff for level 2-3 cannot be accurately determined. However, the highest threshold for level 2 is 2.60, and it is likely that level 3 ability level will be located close or slightly above that value. As seen in Figure 2, level 1 thresholds are well separated from level 2 thresholds. Specifically, all level 1 thresholds are located in approximately the same ability region and do not overlap any of the level 2 thresholds. In other words, no level 1 threshold falls above the cut- off point for level 1, and no level 2 threshold falls below the cutoff point for level 1. This suggests that the progression of student understanding predicted by the 3D construct map is supported by the data, providing quantitative validity evidence for the 3D construct map (Doherty et al., 2015; Wilson, 2004). 180 Pre test Post test Level 1-2 Cutoff=1.88 Level 0-1 Cutoff=1.31 Level 3 Level 2 Level 1 Level 0 Figure 3.2 Wright map showing 3D construct map levels for unit 2 assessment items Evaluating Student Learning based on unit 2 assessment Pre and post assessment data was combined when fitting GRM model (see Appendix for details) in order to be able to compare how ability distributions change between pre and posttest. The Wright Map in figure 2 shows distribution of responses (Respondents) for pre and posttest on one graph. As you can see, both pre and post unit 2 contain significant number of respondents below 0 on the logit scale. These are respondents with missing data, for whom zeros were imputed at both time points. Respondents who did not provide any answer on pre and post-test still participated in the curriculum as can be seen from their work in Unit 2 saved in the online portal. Therefore, even though they had missing data for Unit 2 assessment, they were left in the sample to ensure that their data can be used to investigate levels of the 3D construct map. To check the extent of learning that occurred before and after Unit 2 was covered, Wald test was conducted to determine if the increase in the mean between pre and post-test was statistically significant. The mean increased from -0.006 to 0.516 on the logit scale between pre 181 and post-test, and the Wald test showed that this increase was statistically significant (W=305.1, df=1, p>0.001), indicating that learning occurred between pre and posttest assessment for the entire sample of students. However, to better understand how the learning occurred in terms of student movement along the levels of the 3D construct map, we need to look at the distribution of responses and compare pre and post unit assessment for each level of the 3D construct map. Since the respondents who did not provide any answer on pre and post assessment introduce too much noise into the distribution, they were removed from the Wright Map to be able to see the degree of spread in learning for those students who provided the answers. This allows to draw more accurate conclusions about student learning upon completion of Unit 2. Figure 3 below shows the Wright Map of reduced data for those who provided answers on pre and posttest. 3D construct map level cutoffs Distribution maximum on pre and post test Average ability level for each threshold Pre test Post test 2.23 1.59 1.52 1.16 Figure 3.3 Wright map with respondents who provided answers on pre/post unit 2 test 182 Observe, in Figure 3, that the distribution maximum on pretest is located at the value of 1.16 on the logit scale, which is below 1.31 cutoff value for level 1. In total, about 80% of respondents on pre-test lie below level 1 cutoff value of 1.31 (see R code in the Appendix for percentage calculation). There is a small peak at 1.59 on the logit scale on pre-test, which is close to the 1.52 average ability level value for threshold 1. Overall, about 20% of respondents lie in level 1 of the 3D construct map on pretest, with essentially no respondents at level 2. Therefore, the majority of respondents lie below level 1 of the 3D construct map on the pretest with some respondents located around the average threshold value for level 1. On the posttest the distribution maximum is located at the value of 1.59 on the logit scale. Therefore, the small peak at 1.59 on the pre-test grows and becomes maximum of the distribution on the post test. Overall, about 53% of the respondents are located in level 1 of the 3D construct map on the post test, about 21% of respondents fall in level 0, and about 26% of respondents fall in level 2 of the 3D construct map. Therefore, there is clear movement of respondents along the levels of the 3D construct map upon completion of Unit 2. Assigning construct Map level to individual students This section shows how 3D construct map for chemical bonding can be used to accurately place student on a level, therefore allowing to use the validated 3D construct map and the associated assessment as a diagnostic tool in the classroom. To assign a level on the 3D construct map to each individual student, it is important to take into consideration measurement error associated with estimation of each proficiency level. This is especially important for students whose proficiency levels lie close to cut points for 3D construct map levels, or provide answers consistent with in-between level assignment as was observed for the oral interviews. To do this, confidence interval (CI) for all proficiency estimates are calculated using one standard 183 error in each direction (see Appendix for the R code). Wright Maps are further modified by arranging student proficiency in ascending order excluding students who had all zeroes on pre and/or post10. The modified Wright Maps for pre and posttest are shown in Figures 4 and 5 respectively. The curved black line shows proficiencies, and the grey band represents upper and lower interval bounds. The horizontal dashed lines represent cutoffs for 3D construct map levels, and vertical lines show the area where confidence intervals overlap the cut points. If confidence intervals fall entirely into one of the 3D construct map regions (for example, the first 777 students on pretest (Figure 4), and the first 613 students on the posttest (Figure 5)), these students are likely to provide answers consistent with level 0 of the 3D construct map, and therefore should be assigned level 0 with high degree of confidence. Similarly, students 857-896 on pretest and students 670-790 on the posttest have confidence intervals that fall entirely into level 1 of the 3D LP, so these students can be assigned level 1. Finally, confidence intervals for the students 838-899 on the posttest fall entirely into level 2, and those students can be assigned level 2 on the 3D construct map. However, sometimes the confidence intervals overlap the cut points for the 3D construct map levels. For example, students 778-856 on the pretest, and students 614-669 on the posttest have confidence intervals that overlap level 0-1 cutoff, indicating that they are likely to provide answers consistent with in-between level assignment. In this case, there is less certain about the 3D construct map level assignment for these students. Similarly, students 896-899 on pretest and 791-837 on the post test have confidence interval overlapping level 1-2 threshold, indicating that there is less certainty in placing these students in level 2 of the 3D construct map. 10 The X axis of the Wright Maps shown in figures 4 and 5 was truncated to exclude students who had zeroes on pre and post assessment and highlight the graph better. 184 Level 0 Level 1-2 Level 0-1 Level 0-1 Level 1 Level 1-2 Figure 3.4 Modified wright map for pre unit 2 test showing student proficiency estimates and standard error bands from lowest to highest Level 0 Level 0-1 Level 1 Level 1-2 Level 2 Level 1-2 Level 0-1 Figure 3.5 Modified wright map for post unit 2 test showing student proficiency estimates and standard error bands from lowest to highest 185 Overall, only 81 students on the pretest and 101 students on the posttest fall in between levels of the 3D construct map, which corresponds to 9% and 11% respectively. Therefore, this indicates have high degree of certainty in assigning a level on the 3D construct map to individual students for the majority of the sample. To be exact, since 1 standard error was used to calculate the confidence intervals, this indicates 68% certainty in assigning 3D construct map level to individual students. This provides evidence for validity of the 3D construct map as a diagnostic tool that allows placing a student on a level with a high degree of accuracy, and use the information about what student understanding looks like in terms of the three dimensions (DCI, SEP, CCC) at each given level to characterize their science proficiency. To the author’s knowledge, this is the first validated 3D construct map that provides this degree of level assignment certainty and therefore applicability in terms of immediate pedagogical use. Discussion This work presents a 3D construct map for chemical bonding developed following previous research and principles expressed in the Framework that suggest teaching the concepts chemical bonding from the perspective of balance of electric forces and energy minimization (NRC, 2012, Taber, 1998, Cooper et al., 2014). The 3D construct map presented here is aligned to NGSS PEs and validated in the context of NGSS-aligned “Interactions” curriculum. The curriculum aims to build student understanding of chemical bonding as an extension of the same principles of electrostatic attraction that drive interactions between macroscopic charged objects and formation of intermolecular interactions that lead to increasing stability of the system through energy minimization. In that regard, the curriculum is aiming at helping students build integrated understanding of energy and electrical interactions at macro and atomic-molecular scale. While this approach to teaching chemical bonding has been gaining popularity at the 186 undergraduate level (Cooper, Klymkowsky, 2013), to author’s knowledge this is the first study that shows development of student understanding of chemical bonding following this approach at the secondary level in the context of NGSS classroom. In that regard, this study provides valuable takeaways regarding student 3D understanding of chemical bonding following this instructional approach, which are further discussed. The first takeaway is that students need a lot of careful scaffolding to learn to integrate ideas of energy and electric force to explain chemical bonding. As can be seen in the “Atoms Forming a Bond” interview item that the highest-level responses for the most part did not contain ideas of energy unless prompted, and students felt that they have provided fully causal account of formation of a chemical bond by describing the mechanism of balancing attractive and repulsive interactions. It seemed like students didn’t think that energy was an important driving factor in formation of chemical bonds. This is consistent with previous research that shows that students struggle to connect ideas related to atomic structure and electrical interactions at atomic- molecular scale to associated energy changes (Becker, Cooper, 2014). This difficulty might be due to the fact that ideas of energy are abstract, and student often don’t have direct experience observing energy changes associated with electrical interactions especially at atomic-molecular level, and therefore cannot make use of ideas of energy productively (Cooper, Klymkowsky, Becker, 2014). This finding is also evident in the results of student interviews using the “Match on the Hot Plate” item. Specifically, student models and explanations do not show clear relationship between energy and bond breaking process in the match at either level 1 or level 2 of the 3D construct map. Additionally, answers at levels 0-2 tend to interchange ideas of heat and energy when describing bond breaking and bond making processes. For example, at level 2 answers tend to recognize that heat is involved in separating molecules into individual atoms, but 187 the details of bond breaking and bond making processes and associated energy changes are still missing. This might be due to the fact that student often confuse ideas of heat and energy, presenting heat as a form of energy rather than manifestation of energy transfer (Jewett, 2008). “Interactions” curriculum provides detailed description of these important subtle differences for teachers as part of teacher materials. In short, it is clear that students are struggling to incorporate the idea of energy when explaining chemical bonding, and it seems like the key idea required to achieve the type of 3D understanding consistent with the highest level (level 3) of the 3D construct map. For researchers and educators, the issue becomes: how do we organize instruction to help students understand the importance of using energy perspective along with electric force, and how do we help student distinguish between ideas of heat and energy? Or is it the case that student develop ability to integrate these ideas further in the curriculum? Additionally, how do we support teachers in emphasizing these ideas in their classroom as opposed to perpetuating traditional practices about teaching chemical bonding that are not grounded in emphasizing importance of energy minimization and balance of electrical forces? These are all important questions for future research. The second takeaway is that the idea of balance of electric forces between components of the interacting atoms seems central for developing a useable conceptual model for explaining why a chemical bond form. It seems that even when students reason about the mechanism of chemical bond formation in terms of attractive and repulsive interactions between components of atoms, they seem to struggle with the idea of balance of attractive and repulsive electric forces, and explain chemical bonding in terms of magnitude of force exerted by components of the atom upon each other (for example, student answer focused on suggesting that a chemical bond forms because attractive forces between nuclei and electrons are stronger than repulsive forces between 188 electrons). This finding is consistent with previous research that shows that students tend to believe that forces on the electrons from the nucleus are larger than force exerted by electrons from the nucleus (Taber, 1998). In short, it is clear that students are struggling to incorporate the idea of balance of attractive and repulsive interactions when explaining chemical bonding, and it seems like the key idea required to achieve the type of 3D understanding consistent with level 2 of the 3D construct map. The third takeaway has to do with suggesting possible ways in which students build useable 3D understanding of chemical bonding. Specifically, it is interesting to see that level 1 reflects the most diverse range of types of student responses for “Atoms Forming a Bond” item. It seems like for this item student reasoning can range from applying prior knowledge learned in Unit 1 related to electrical interactions between point charges (see sampler response #1) to recalling information from classroom simulation related to giving excess energy to the third atom in order to form a stable bond (sample response #2). Additionally, some answers contain some combination of prior knowledge and new information reflected in using ideas related to interactions between point charges, and the involvement of the third atom learned in the simulation, but probably misunderstood in light of common misconception that energy is needed to form bonds (sample response #3). These all seem like very different ideas that students are trying to apply when explaining a very abstract, unobservable process of chemical bond formation. At level 1 student answers do not seem to reflect any more or less permanent mental model used to explain chemical bonding. Rather, it seems like student are using various ideas they have learned that they still struggle to connect together to construct a possible explanation. This makes a lot of sense because these are very challenging ideas, and it takes time to put these ideas together. At the same time, as students’ progress towards higher level of understanding, the 189 types of answers provided seem to reflect a more well-established model used to explain chemical bonding. Specifically, at level 2, all answers reflect student ability to model chemical bonding in terms of balance between attractive and repulsive interactions between atoms. Therefore, the progression of student understanding along the levels of the 3D construct map seems to reflect a process during which students learn to combine various ideas that are often disconnected to explain relevant phenomena. In that regard this finding is also consistent with the notion expressed in the Framework stating that experts incorporate new knowledge into already established frameworks of understanding, while novices tend to have knowledge that is largely unstructured and disconnected (NRC, 2012). The interesting finding of this study is that moving towards expert- like understanding of DCIs (content) reflected in well-established mental models that are used to explain phenomena requires learning to make connections between various ideas to be able to use these disconnected ideas in a productive way to make sense of phenomena. In this regard, the fourth takeaway of this study suggests that transitioning to higher level of understanding can be characterized by learning to make connections between various ideas to form a more permanent mental model. This conclusion is also supported by evidence from in-between level responses for “Atoms Forming a Bond” item that show emergence of one general conceptual model where various ideas are connected to be used in explaining phenomenon of bond making and bond breaking processes, but occasionally ideas are also present that are not connected to the rest of the framework in a meaningful way (for example, transitional level responses for “Atoms forming a Bond” item that state that attractive interactions are stronger than repulsive is inaccurate, but the other example states that atoms interact through field when they form a bond, which is correct, but still does not provide a causal account for why atoms form a bond). 190 Further, student interview data suggests that students hold a wider range of disconnected ideas for “Atoms forming a Bond” phenomenon than for “Match on a Hot Plate” phenomenon, which is evident in larger variety of answers provided at level 1 of the 3D construct map for “Atoms forming a Bond” item. This might be due to the fact that “Atoms forming a Bond” is a very abstract, unobservable phenomenon as opposed to fairly familiar, observable “Match on the Hot Plate” phenomenon. Specifically, “Atoms forming a Bond” phenomenon requires students to directly apply knowledge of atomic models, which is not a knowledge that is constructed directly through experience, but rather communicated through previously developed models learned in class. Therefore, “Atoms forming a Bond” item elicits prior stored knowledge to a larger extend than “Match on the Hot Plate” item, for which students are aiming to construct their explanation primarily based on direct observations. This difference in the types of answers provided based on the context suggests that at level 1 students do in fact hold various disconnected ideas than a well-established mental model. As student move to level 2, the types of answers are not as diverse any more, and indicate formation of well-established mental model that student can apply across the two item contexts. Finally, the last takeaway of our study has to do with developmental nature of student understanding and the idea that deep, integrated understanding of science takes time and appropriate scaffolding to develop (Smith et. al, 2006). Evidence of this is seen in student interviews, where more higher-level responses are observed by the end of unit 2, and student answers fall on a spectrum from less to more sophisticated level of understanding. Similar pattern holds for analysis of student written responses using IRT (compare figures 3 and 4 in terms of number of students who progressed to levels 1 and 2 of the 3D construct map by the end of Unit 2). Further, the fact that none of the students reached level 3 of the 3D construct map by 191 the end of unit 2 indicates that it takes a long time before students develop ability to integrate ideas of electric force and energy for explaining chemical bonding at the atomic-molecular level. At the same time, 3D construct map for chemical bonding presented in this work can be used to accurately place students on a level, and provides reach description of what student 3D understanding of chemical bonding looks like at each level of the 3D construct map. Limitations This work includes several limitations. First, since Unit 1 of the “Interactions” curriculum was specifically focused on building student understanding of electrical interactions at macro and atomic-molecular scale, the prior knowledge that student came as well as future learning trajectory for Unit 2 was partly determined by what was learned in Unit 1. Therefore, it is possible that some of the student responses that were observed in the validation process of this construct map might not be observed if the context is different from “Interactions” curriculum. However, the general progression of student ability to integrate ideas of force and energy in the context of chemical bonding will still hold regardless of the curriculum context because the 3D construct map was built using relevant research literature and NGSS PEs that are not curriculum specific. Second, this work focused on specific aspects of relevant DCIs, one CCC and two SEPs. It is possible that the progression of student understanding might be different if a different set of CCC and SEP is chosen, and if the relevant DCI aspects are chosen differently. Third, the large number of students with missing data substituted by zeros indicates that the test was overall very difficult for the majority of students. For future work, it will be beneficial to include items that probe lower levels of the 3D construct map, which will allow to gain better understanding of how students at lower end pf ability spectrum make sense of ideas related to chemical bonding from the perspective of energy and force. 192 APPENDIX 193 APPENDIX Testing Competing Item Response Theory (IRT) Models The items on Unit 2 present four ordinal response categories, where each category corresponds to a level of the 3D construct map. Specifically, 0, 1, 2, 3-point response category on each item corresponds to the 3D construct map levels of 0, 1, 2, 3, as can be seen from examples of scoring rubrics. Common IRT models for polytomous items are Graded Response Model (GRM, Samejima, 1969) and Generalized Partial Credit Model (GPCM, Muraki, 1992). To choose appropriate IRT model to represent the data in this study, model fits for GRM and GPCM were compared. To ensure more accurate representation of the data, and to be able to compare student learning on pre and post assessment, pre and post assessment data was combined to specify the IRT model to be estimated using GRM and GPCM. Slopes and corresponding items were constrained to be equal on pre and post assessment for each item. This rigid model specification was safe to assume because dimensionality and longitudinal invariance of Unit 2 assessment instrument was extensively studied a priori (Chapter 1). The results of this previous study showed that Unit 2 assessment scale is two-dimensional, and partial measurement invariance holds over time for pre and post assessment. For IRT modeling, however, 2- dimensional GRM and GPCM model could not be estimated due to a limited number of indicators (items). As a result, in this study, Unit 2 assessment data was modeled using one- dimensional IRT. This was deemed appropriate because the two latent dimensions are highly correlated (0.784 on pretest; 0.928 on posttest) and aim to measure the same scientific idea- student understanding of chemical bonding from the perspective of energy and force. The R code is provided further in the Appendix. The results of IRT model estimation are shown in Table 7. 194 Table 3.7 Model comparison for GPCM and GRM Model GPCM GRM LL -4240 -4224 # par 51 51 AIC 8535 8503 BIC 8665 8633 M2 577 527 df 109 109 P value <0.001 <0.001 RMSEA 0.0692 0.0654 CFI/TLI 0.973/0.972 0.976/0.975 Smaller log likelihood values as well as smaller values for AIC and BIC index values suggest better fitting model (Nering & Ostini, 2011; Toland, 2014). Based on these indexes, GRM is a slightly better fitting model for the data sample. Further, M2 goodness of fit statistics was used to evaluate overall model fit (Maydeu-Olivares & Joe, 2005). Smaller M2 values also indicate better model fit (Toland, 2014) and, following this guideline, GRM also presents better fitting model for the data, compared with GPCM model. The p-value for both GPCM and GRM indicate lack of fit. However, lack of fit for M2 statistics is common when fitting parametric models like GPCM and GRM to real data (Cai, Maydeu‐Olivares, Coffman & Thissen, 2006; Toland, 2014). Therefore, additional model fit indexes were used, including RMSEA and CFI/TLI. Good and reasonable model fit cut-off criteria for RMSEA was <0.6 and <0.8 respectively, and for CFI/TLI was >0.95 and >0.90 respectively (Hu & Bentler, 1999; Marsh, Hau & Wen, 2004; Van Dam, Earleywine & Borders, 2010). Based on RMSEA and CFI/TLI values presented in Table 6, both GRM and GPCM have similar model fit. RMSEA for both models are marginally good, and CFI/TLI indexes represent good model fit. Therefore, based on evaluation of all the information, GRM appears to be a more suitable model for the data, and will be further used to evaluate model assumptions and obtain item parameters. Evaluating GRM model assumptions IRT model assumptions were further evaluated for GRM following Toland (2014). As mentioned above, dimensionality and partial measurement invariance were established for the measurement instrument in the previous study (Chapter 1). The assumption of local 195 independence is further tested below. Local independence (LI) assumes that student responses on the test are influenced only by their level on the latent trait continuum of interest. LI assumption is very important for IRT analysis because, if violated, item parameters become distorted, including inflated slopes and more homogeneous thresholds across items (Toland, 2014). In the context of NGSS, assumption of local independence becomes increasingly harder to meet because 3D assessments call for more contextualized, story-based items where students can use all the information available to them to demonstrate knowledge application ability (Gorin & Mislevy, 2013). These items often take the form of testlets, as is the case for the Unit 2 assessment instrument here, which makes it especially difficult to meet assumption of local independence because items within a testlet share more commonalities than across testlets. This might lead to increased dimensionality and violation of LI assumption (Gorin & Mislevy, 2013). To evaluate LI assumption in this study the Q3 index was used with a cut-off value of |0.2| (Kim, De Ayala, Ferdous & Nering, 2011). This index and cut-off value have acceptable Type 1 error rate and is substantially more powerful than commonly used X2 and G2- LD indexes (Chen & Thissen, 1997). Further, it is also recommended that the 0.2 cut-off value be used in a relative way, and to determine what is “large” correlation relative to other residual correlations in the model (Dr. Chalmers, email conversation). Following these guidelines, the Q3 statistics was used to evaluate local independence assumption. The Q3 statistics matrix is shown in figure 6. Only values above 0.2 are shown. Most residual correlations were below cut-off value of 0.2 in absolute values, and there were no residual correlations that were unusually high relative to others. Specifically, the highest correlation value was -0.381 between items 2 and 3 on the pre-test. Slightly high residual correlation is not surprising for those because they belong to the same testlet. However, this correlation is not unreasonably high compared to other values, 196 and most of the correlations are below cut-off value of 0.2. Therefore, there is enough evidence to conclude that assumption of local independence is met. Figure 3.6 Q3 matrix Model-Data Fit Once IRT model is chosen, and model assumptions are evaluated, it is appropriate to evaluate how well the GRM model fits the data and obtain item parameters that will be used in validating levels of the 3D construct map. Item level fit. To assess how well the GRM model fits each item, S-X2 item fit statistics for polytomous data was examined (Orlando & Thissen, 2000; Orlando & Thissen, 2003). Statistically significant p-value indicates that the model does not fit a given item. Item fit was evaluated using 1% significance level, and RMSEA values. This is because evaluation of item fit using S-X2 item fit statistics involves testing multiple hypothesis and larger samples lead to greater likelihood of statistically significant results (Stone, Zhang, 2003 & Toland, 2014). Item fit S-X2 statistics is shown in Table 8 below. Items 1, 3, 6 and 7 of pre-test and item 5 on the post-test have p-values <0.01 indicating poor model fit for these items. Since larger samples lead to greater likelihood statistically significant results, RMSEA values for these items are also examined. As can be seen from Table 8, all RMSEA values are below 0.06 indicating good model fit. Therefore, the GRM model fits each item reasonably well. 197 Table 3.8 S-X2 item fit statistics S-X2 Item Q1T1 22.788 Q2T1 12.965 Q3T1 26.294 Q4T1 19.952 Q5T1 22.941 Q6T1 29.459 Q7T1 30.422 Q8T1 27.090 df RMSEA 9 8 10 8 11 11 10 18 0.041 0.026 0.043 0.041 0.035 0.043 0.048 0.024 p Item 0.007 Q1T2 0.113 Q2T2 0.003 Q3T2 0.011 Q4T2 0.018 Q5T2 0.002 Q6T2 0.001 Q7T2 0.077 Q8T2 S-X2 20.23 28.509 22.788 19.126 35.283 24.136 37.617 40.457 df 18 17 21 18 18 17 21 23 RMSEA p 0.012 0.027 0.010 0.008 0.033 0.022 0.030 0.029 0.320 0.039 0.355 0.384 0.009 0.116 0.014 0.014 Person level fit. To evaluate consistency of student reasoning across different contexts represented in the items, person fit (Zh) statistics was examined (Drasgow, Levine & Williams,1985). The Zh distribution across pre and post assessment events for all students is shown in Figure 7 below. Figure 3.7 Person fit Zh statistics The value of -1.96 is used as a cut-off for Zh statistics, where students with Zh fit statistics above -1.96 show regular responses (Drasgow et al.,1985; Felt, Castaneda, Tiemensma & Depaoli, 2017). Figure 7 shows that the majority of students are above the cut-off value of - 1.96 (dashed line) suggesting that majority of the sample demonstrate responses consistent with those hypothesized by 3D construct map levels. This provides evidence towards the validity of the hypothetical 3D construct map levels (Doherty, Draney, Shin, Kim & Anderson, 2015). 198 R Studio Code R Studio Code R Studio Code R Studio Code library(mirt)#For fitting IRT models library(foreign)#For importing SPSS data file library(WrightMap)#For wright maps library(ggplot2)# For histogram Model Fit Evaluation Unit 2 pre_post test #Items 1-8 represent Unit 2 pre test items, items 9-16 represent Unit 2 post test items. #Pre and Post test items are identical Model Statement FAmodelU2_1D<-mirt.model('F1=1, 2, 3, 4, 5, 6, 7, 8, F2=9, 10, 11, 12, 13, 14, 15, 16 CONSTRAIN= (1,9, a1, a2), (2,10,a1, a2), (3,11,a1, a2), (4,12, a1, a2), (5, 13, a1, a2), (6, 14, a1, a2), (7, 15, a1, a2), (8, 16, a1, a2), (1,9, d1), (2,10, d1), (3,11, d1), (4,12, d1), (5, 13, d1), (6, 14, d1), (7, 15, d1), (8, 16, d1), (1,9, d2), (2,10, d2), (3,11, d2), (4,12, d2), (5, 13, d2), (6, 14, d2), (7, 15, d2), (8, 16, d2), MEAN=F1, F2 COV=F1*F2') 199 Model Estimation pre.items<-c("U2Q1T1","U2Q2T1","U2Q3T1","U2Q4T1","U2Q5T1", "U2Q6T1","U2Q7T1", "U2Q8T1") post.items<-c("U2Q1T2","U2Q2T2","U2Q3T2","U2Q4T2","U2Q5T2", "U2Q6T2","U2Q7T2" , "U2Q8T2") all.items<-c(pre.items,post.items) GRM model 1D modgrmU2_EM_1D<-mirt(U2_all[all.items],FAmodelU2_1D,itemtype="graded",verbose =FALSE, SE=TRUE) M2(modgrmU2_EM_1D, impute=20, CI=.95) GPCM model 1D modgpcmU2_EM_1D<-mirt(U2_all[all.items],FAmodelU2_1D,itemtype="gpcm",verbose= FALSE, SE=TRUE) M2(modgpcmU2_EM_1D, impute=20, CI=.95) Item analysis with choice model Item analysis with choice model ---- GRMGRMGRMGRM Item analysis with choice model Item analysis with choice model Model diagnostics Model diagnostics Model diagnostics Model diagnostics Residual diagnostics residuals(modgrmU2_EM_1D,type="Q3", suppress=.2)# To evaluate loca independe nce (LI), only shows pairs with cov>0.2 (possibly have LI issue) Item fit diagnostics print(item.fit<-itemfit(modgrmU2_EM_1D,fit_stats="S_X2"))#To evaluate items f it(cutoff: p<0.01) Person fit diagnostic person.fit<-personfit(modgrmU2_EM_1D, method="ML") #To evaluate person fit (Z h stats) ggplot(person.fit,aes(x=Zh))+ geom_histogram(bins = 15,colour = "black",fill = "white")+ geom_vline(xintercept=-1.96,col="black",linetype="dashed")+ labs(x="Zh statistic",y="Count")+ theme_bw(base_size=12)+ theme_classic()#histogram of Zh stats (above -1.96 good person fit) Wald Test for significance of the mean Wald Test for significance of the mean Wald Test for significance of the mean Wald Test for significance of the mean #Is mean 2=mean 1? No, they are not equal if p<0.05 (infonames <- wald(modgrmU2_EM_1D)) #choose column to be used in Wald test L <- matrix(0, 1, 27) L[26] <- 1 L[25] <- -1 wald(modgrmU2_EM_1D, L) 200 Item parameters and thresholds Item parameters and thresholds Item parameters and thresholds Item parameters and thresholds ### Item parameters and thresholds item.par<-data.frame(coef(modgrmU2_EM_1D, simplify=TRUE)$items) #Item paramet ers; item.par$T1<-with(item.par, ifelse(a1>0,-d1/a1,-d1/a2))#a1= discrimination, D ifficulty=(-d/a) item.par<-item.par[1:8,]#Select the first 8 rows since the remaining are time two items with equal parameters as time 1 item.par$T2<-with(item.par, ifelse(a1>0,-d2/a1,-d2/a2)) mean.T1<-mean(item.par$T1)#Mean threshold 1 mean.T2<-mean(item.par$T2)#Mean threshold 2 t0_1<-min(item.par$T1)#cut off for level0_1 t1_2<-median(c(item.par$T1, item.par$T2)) #cut off for level1_2 t2_3<-max(item.par$T2)#cut off for level2_3 Ability Wright Maps Ability Wright Maps Ability Wright Maps Ability Wright Maps #Compute factor scores (Y-axis for Wright Map) AbilityU2Pre_Post<-data.frame(fscores(modgrmU2_EM_1D)) # Add ability scores to the data file fulldata_U2<-data.frame(cbind(U2_all, AbilityU2Pre_Post)) #merge students who have complete data with fulldata file to create the reduc ed sample file reducedata_U2<-merge(U2_STUID_allstudentscompletedata, fulldata_U2, by.x = "S tudentID", by.y = "STUID") Complete sample data #Plot Wright Map full sample wrightMap(with(fulldata_U2,cbind(F1,F2)),matrix(c(item.par$T1,item.par$T2),nc ol=2), person.side=personDens,cutpoints=c(t0_1,t1_2,mean.T1,mean.T2),min.l=-. 9, max.l=2.7) ####Reduced sample data #Plot Wright Map reduced samle wrightMap(with(reducedata_U2,cbind(F1,F2)),matrix(c(item.par$T1,item.par$T2), ncol=2), person.side=personDens,cutpoints=c(t0_1,t1_2,mean.T1,mean.T2),min.l= -.9, max.l=2.7) Finding peaks on the Reduced sample Wright Map and % of examinees in each level of the 3D LP Finding peaks on the Reduced sample Wright Map and % of examinees in each level of the 3D LP Finding peaks on the Reduced sample Wright Map and % of examinees in each level of the 3D LP Finding peaks on the Reduced sample Wright Map and % of examinees in each level of the 3D LP Functions to calculate percentiles for given cut-offs #Percentile function for pretest pct_pre<-ecdf(reducedata_U2$F1) #Percentile function for post test pct_post<-ecdf(reducedata_U2$F2) 201 Thresholds for level 0-1 and 1-2 of the 3D LP #### Thresholds for level 0-1 and 1-2 of the 3D LP t0_1 #lowest difficulty 1 t1_2 #median between Difficulty 1 and Difficulty 2 mean.T1 #Average difficulty 1 mean.T2 #Average difficulty 2 Percentage of examinees between thresholds ##### Percentage of examinees between thresholds #pretest pct_pre(t0_1)#% prob density that fall below cutoff for level 1 pct_pre(t1_2)#% prob density that fall below cutoff for level 2 pct_pre(t1_2)-pct_pre(t0_1) # % prob density between level 1 cutoff and level 2 cutoff of pre test pct_pre(mean.T1)#% prob density that fall below average difficulty 1 pct_pre(t1_2)-pct_pre(mean.T1)# % prob density between average difficulty 1 a nd level 2 cutoff (pct_pre(t1_2)-pct_pre(t0_1))-((pct_pre(t1_2)-pct_pre(mean.T1)))# % prob dens ity between average difficulty 1 and level 1 cutoff pct_pre(mean.T2)#% prob density that fall below average difficulty 2 #Post test pct_post(t0_1)#% prob density that fall below cutoff for level 1 pct_post(t1_2)#% prob density that fall below cutoff for level 2 pct_post(t1_2)-pct_post(t0_1)# % prob density between level 1 cutoff and leve l 2 cutoff at post test pct_post(mean.T1)# % prob density that fall below average difficulty 1 pct_post(t1_2)-pct_post(mean.T1)# % prob density between average difficulty 1 and level 2 cutoff (pct_post(t1_2)-pct_post(t0_1))-((pct_post(t1_2)-pct_post(mean.T1)))# % prob density between average difficulty 1 and level 1 cutoff pct_post(mean.T2)# % prob density that fall below average difficulty 2 pct_post(mean.T2)-pct_post(t1_2)# % prob density between average difficulty 2 and level 2 cutoff Determine density peak values #### Determine density peak values #Peak values for pretest print(pre_peak1<-density(reducedata_U2$F1[which(reducedata_U2$F1>1.5)])$x[whi ch.max(density(reducedata_U2$F1[which(reducedata_U2$F1>1.5)])$y)]) # peak - f or pre test values above 1.5 print(pre_peak2<-density(reducedata_U2$F1)$x[which.max(density(reducedata_U2$ F1)$y)]) #pre test peak - larger peak print(pre_peak3<-density(reducedata_U2$F1[which(reducedata_U2$F1<0)])$x[which .max(density(reducedata_U2$F1[which(reducedata_U2$F1<0)])$y)]) #smaller peak - for pre test values below 0.5 #Peak values for post test print(post_peak1<-density(reducedata_U2$F2[which(reducedata_U2$F2>1.8)])$x[wh ich.max(density(reducedata_U2$F2[which(reducedata_U2$F2>1.8)])$y)])#third pea 202 k for post test scores in level 2 3D LP region print(post_peak2<-density(reducedata_U2$F2)$x[which.max(density(reducedata_U2 $F2)$y)]) #post test peak print(post_peak3<-density(reducedata_U2$F2[which(reducedata_U2$F2<0)])$x[whic h.max(density(reducedata_U2$F2[which(reducedata_U2$F2<0)])$y)])#second peak - for post test values below 0.5 Percentage of examinees between peak values ##### Percentage of examinees between peak values #Pretest pct_pre(pre_peak1) pct_pre(pre_peak2) pct_pre(pre_peak1)-pct_pre(pre_peak2) #Pretest pct_post(pre_peak1) pct_post(pre_peak2) pct_post(pre_peak1)-pct_pre(post_peak2) Ascending Ability Wright Maps Ascending Ability Wright Maps Ascending Ability Wright Maps Ascending Ability Wright Maps #create factor scores with standard errors (UP= upper bound, LP= lower bound) fulldata_with_SE<-cbind(U2_all,data.frame(fscores(modgrmU2_EM_1D, full.scores .SE=TRUE))) fulldata_with_SE$UBF1<-fulldata_with_SE$F1+fulldata_with_SE$SE_F1 fulldata_with_SE$LBF1<-fulldata_with_SE$F1-fulldata_with_SE$SE_F1 fulldata_with_SE$LBF2<-fulldata_with_SE$F2-fulldata_with_SE$SE_F2 fulldata_with_SE$UBF2<-fulldata_with_SE$F2+fulldata_with_SE$SE_F2 #Create a variable to count how many students have CI overlapping each LP lev el (Pre test) fulldata_with_SE$LP0_1_F1<-ifelse(fulldata_with_SE$LBF1<= t0_1&fulldata_with_ SE$UBF1>=t0_1, 1, 0) fulldata_with_SE$LP1_2_F1<-ifelse(fulldata_with_SE$LBF1<= t1_2&fulldata_with_ SE$UBF1>=t1_2, 1, 0) #Create a variable to count how many students have CI overlapping each LP lev el (Pre test) fulldata_with_SE$LP0_1_F2<-ifelse(fulldata_with_SE$LBF2<= t0_1&fulldata_with_ SE$UBF2>=t0_1, 1, 0) fulldata_with_SE$LP1_2_F2<-ifelse(fulldata_with_SE$LBF2<= t1_2&fulldata_with_ SE$UBF2>=t1_2, 1, 0) #Find smallest lower bound score of F1 (pre test) LB_LP0_1_pre_stu<-max(fulldata_with_SE$LBF1[which(fulldata_with_SE$LP0_1_F1== 1)]) print(LB_L0_pre<-which(sort(fulldata_with_SE$LBF1)==LB_LP0_1_pre_stu))#studen ts below level 1 LP LB_LP1_2_pre_stu<-max(fulldata_with_SE$LBF1[which(fulldata_with_SE$LP1_2_F1== 1)]) print(LB_L1_pre<-which(sort(fulldata_with_SE$LBF1)==LB_LP1_2_pre_stu))#studen ts below level 2 LP 203 #Find smallest lower bound score of F2 (post test) LB_LP0_1_post_stu<-max(fulldata_with_SE$LBF2[which(fulldata_with_SE$LP0_1_F2= =1)]) print(LB_L0_post<-which(sort(fulldata_with_SE$LBF2)==LB_LP0_1_post_stu))#stud ents below level1 LP LB_LP1_2_post_stu<-max(fulldata_with_SE$LBF2[which(fulldata_with_SE$LP1_2_F2= =1)]) print(LB_L1_post<-max(which(sort(fulldata_with_SE$LBF2)==LB_LP1_2_post_stu))) #students below level2 LP #Find highest upper bound score of F1 (pre test) UB_LP0_1_pre_stu<-min(fulldata_with_SE$UBF1[which(fulldata_with_SE$LP0_1_F1== 1)]) print(UB_L0_pre<-which(sort(fulldata_with_SE$UBF1)==UB_LP0_1_pre_stu))#studen ts below level 1 LP UB_LP1_2_pre_stu<-min(fulldata_with_SE$UBF1[which(fulldata_with_SE$LP1_2_F1== 1)]) print(UB_L1_pre<-which(sort(fulldata_with_SE$UBF1)==UB_LP1_2_pre_stu))#studen ts below level 2 LP #Find highest upper bound score of F2 (post test) UB_LP0_1_post_student<-min(fulldata_with_SE$UBF2[which(fulldata_with_SE$LP0_1 _F2==1)]) print(UB_L0_post<-which(sort(fulldata_with_SE$UBF2)==UB_LP0_1_post_student))# students below level1 LP UB_LP1_2_post_student<-min(fulldata_with_SE$UBF2[which(fulldata_with_SE$LP1_2 _F2==1)]) print(UB_L1_post<-which(sort(fulldata_with_SE$UBF2)==UB_LP1_2_post_student))# students below level2 LP #of people in overlap for each level on pre test LB_L0_pre-UB_L0_pre #80 people between level 0 and 1 LB_L1_pre-UB_L1_pre #1 people between 1-2 #of people in overlap for each level on pre test LB_L0_post-UB_L0_post #57 people between level 0 and 1 LB_L1_post-UB_L1_post #48 people between 1-2 #Sort data by ability score (pre test) sort_pre<-fulldata_with_SE[order(fulldata_with_SE$F1),]#sort data sort_pre<-data.frame(x=seq(nrow(sort_pre)),F1=sort_pre$F1,lwr=sort_pre$LBF1,u pr=sort_pre$UBF1) #Sort data by ability score (post test) sort_post<-fulldata_with_SE[order(fulldata_with_SE$F2),]#sort data sort_post<-data.frame(x=seq(nrow(sort_post)),F2=sort_post$F2,lwr=sort_post$LB F2,upr=sort_post$UBF2) Ascending Ability Wright map for Pretest #### Ascending Ability Wright map for Pretest plot(sort_pre$F1, xlab="Persons", ylab="Ability", pch=16, ylim=c(-2, 3.1), xl im=c(558, 900), cex=0.5) 204 with(sort_pre,polygon(c(x,rev(x)),c(lwr,rev(upr)),col = "grey75", border = FA LSE)) matlines(sort_pre[,1],sort_pre[,-1],lwd=c(1,1),lty=1,col=c("black","black","b lack"),type=c("p","l","l"), cex=0.4, pch=16) abline(h=c(t0_1, t1_2),lty=2,v=c(LB_L0_pre, LB_L1_pre,UB_L0_pre,UB_L1_pre)) Ascending Ability Wright map for Post test #### Ascending Ability Wright map for Post test plot(sort_post$F2, xlab="Persons", ylab="Ability", pch=16, ylim=c(-2, 3.1), x lim=c(558, 900), cex=0.5) with(sort_post,polygon(c(x,rev(x)),c(lwr,rev(upr)),col = "grey75", border = F ALSE)) matlines(sort_post[,1],sort_post[,-1],lwd=c(1,1),lty=1,col=c("black","black", "black"),type=c("p","l","l"), cex=0.4, pch=16) abline(h=c(t0_1, t1_2),lty=2,v=c(LB_L0_post, LB_L1_post,UB_L0_post,UB_L1_post )) 205 BIBLIOGRAPHY 206 BIBLIOGRAPHY Alonzo, A. C., & Gotwals, A. W. (Eds.). (2012). Learning progressions in science: Current challenges and future directions. Springer Science & Business Media. Barker, V., & Millar, R. (2000). Students' reasoning about basic chemical thermodynamics and chemical bonding: what changes occur during a context-based post-16 chemistry course?. International Journal of Science Education, 22(11), 1171-1200. Becker, N. M., & Cooper, M. M. (2014). College chemistry students' understanding of potential energy in the context of atomic–molecular interactions. Journal of Research in Science Teaching, 51(6), 789-808. Becker, N., Noyes, K., & Cooper, M. (2016). Characterizing students’ mechanistic reasoning about London dispersion forces. Journal of Chemical Education, 93(10), 1713-1724. Berland, L. K., & McNeill, K. L. (2010). A learning progression for scientific argumentation: Understanding student work and designing supportive instructional contexts. Science Education, 94(5), 765-793. Boo, H. K. (1998). Students' understandings of chemical bonds and the energetics of chemical reactions. Journal of Research in Science Teaching: The Official Journal of the National Association for Research in Science Teaching, 35(5), 569-581. Brown, N. J., & Wilson, M. (2011). A model of cognition: The missing cornerstone of assessment. Educational Psychology Review, 23(2), 221. Burrows, N. L., & Mooring, S. R. (2015). Using concept mapping to uncover students' knowledge structures of chemical bonding concepts. Chemistry Education Research and Practice, 16(1), 53-66. Cai, L., Maydeu‐Olivares, A., Coffman, D. L., & Thissen, D. (2006). Limited‐information goodness‐of‐fit testing of item response theory models for sparse 2P tables. British Journal of Mathematical and Statistical Psychology, 59(1), 173-194. Chen, W. H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265-289. Cooper, M., & Klymkowsky, M. (2013). Chemistry, life, the universe, and everything: A new approach to general chemistry, and a model for curriculum reform. Journal of Chemical Education, 90(9), 1116-1122. 207 Cooper, Melanie M., Michael W. Klymkowsky, and Nicole M. Becker. "Energy in chemical systems: An integrated approach." Teaching and learning of energy in K–12 education. Springer, Cham, 2014. 301-316. Doherty, J. H., Draney, K., Shin, H. J., Kim, J., & Anderson, C. W. (2015). Validation of a learning progression-based monitoring assessment. Manuscript submitted for publication. Drasgow, F., Levine, M. V., & Williams, E. A. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38(1), 67-86. Duschl, R.A., Schweingruber H.A., Shouse A. (Eds.). (2007). Taking science to school: Learning and teaching science in grades K-8. Washington, D.C.: National Academy Press. Felt, J. M., Castaneda, R., Tiemensma, J., & Depaoli, S. (2017). Using person fit statistics to detect outliers in survey research. Frontiers in psychology, 8, 863. Gorin, J. S., & Mislevy, R. J. (2013, September). Inherent measurement challenges in the next generation science standards for both formative and summative assessment. In Invitational research symposium on science assessment. Harris, C. J., Krajcik, J. S., Pellegrino, J. W., DeBarger, A. H. (2019). Designing Knowledge‐In‐ Use Assessments to Promote Deeper Learning. Educational Measurement: Issues and Practice. Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural equation modeling: a multidisciplinary journal, 6(1), 1-55. Jewett JW (2008). Energy and the confused student I: work. Phys Teach 46, 38–43. Kim, D., De Ayala, R. J., Ferdous, A. A., & Nering, M. L. (2011). The comparative performance of conditional independence indices. Applied Psychological Measurement, 35(6), 447- 471. Lee, H. S., & Liu, O. L. (2010). Assessing learning progression of energy concepts across middle school grades: The knowledge integration perspective. Science Education, 94(4), 665-688. Lehrer, R., Kim, M. J., Ayers, E., & Wilson, M. (2014). Toward establishing a learning progression to support the development of statistical reasoning. Learning over time: Learning trajectories in mathematics education, 31-60. Maydeu-Olivares, A., & Joe, H. (2005). Limited-and full-information estimation and goodness- of-fit testing in 2 n contingency tables: A unified framework. Journal of the American Statistical Association, 100(471), 1009-1020. 208 Mislevy, R. J. (1996). Test theory reconceived. Journal of Educational Measurement, 33(4), 379-416. Mislevy, R. J., & Haertel, G. D. (2006). Implications of evidence‐centered design for educational testing. Educational Measurement: Issues and Practice, 25(4), 6-20. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. ETS Research Report Series, 1992(1), i-30. Nahum, T. L. (2007). Teaching the concept of chemical bonding in high-school: Developing and implementing a new framework based on the analysis of misleading systemic factors. Nahum, T. L., Mamlok‐Naaman, R., Hofstein, A., & Krajcik, J. (2007). Developing a new teaching approach for the chemical bonding concept aligned with current scientific and pedagogical knowledge. Science Education, 91(4), 579-603. National Research Council. (2012). A framework for K-12 science education: Practices, crosscutting concepts, and core ideas. National Academies Press. National Research Council. (2013a). Education for life and work: Developing transferable knowledge and skills in the 21st century. National Academies Press. Nering, M. L., & Ostini, R. (Eds.). (2011). Handbook of polytomous item response theory models. Taylor & Francis. Neumann, K., Viering, T., Boone, W. J., & Fischer, H. E. (2013). Towards a learning progression of energy. Journal of research in science teaching, 50(2), 162-188. Orlando, M., & Thissen, D. (2003). Further investigation of the performance of S-X2: An item fit index for use with dichotomous item response theory models. Applied Psychological Measurement, 27(4), 289-298. Osborne, J. F., Henderson, J. B., MacPherson, A., Szu, E., Wild, A., & Yao, S. Y. (2016). The development and validation of a learning progression for argumentation in science. Journal of Research in Science Teaching, 53(6), 821-846. Pellegrino, J. W., & Hilton, M. L. (Eds.). (2012). Education for life and work: Developing transferable knowledge and skills in the 21st century. Washington, DC: The National Academies Press. Pellegrino, J. W., Chudowsky, N., & Glaser, R. (2001). Knowing what students know: The science and design of educational assessment. National Academy Press, 2102 Constitutions Avenue, NW, Lockbox 285, Washington, DC 20055. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika monograph supplement. 209 Schwarz, C. V., Reiser, B. J., Davis, E. A., Kenyon, L., Achér, A., Fortus, D., ... & Krajcik, J. (2009). Developing a learning progression for scientific modeling: Making scientific modeling accessible and meaningful for learners. Journal of Research in Science Teaching: The Official Journal of the National Association for Research in Science Teaching, 46(6), 632-654. Shin, N., Stevens, S. Y., & Krajcik, J. (2010). Tracking student learning over time using construct-centred design. In Using Analytical Frameworks for Classroom Research (pp. 56-76). Routledge. Smith, C. L., Wiser, M., Anderson, C. W., & Krajcik, J. (2006). FOCUS ARTICLE: implications of research on children's learning for standards and assessment: a proposed learning progression for matter and the atomic-molecular theory. Measurement: Interdisciplinary Research & Perspective, 4(1-2), 1-98. Songer, N. B., Kelcey, B., & Gotwals, A. W. (2009). How and when does complex reasoning occur? Empirically driven development of a learning progression focused on complex reasoning about biodiversity. Journal of Research in Science Teaching: The Official Journal of the National Association for Research in Science Teaching, 46(6), 610-631 Standards, N. G. S. (2013). Next generation science standards: For states, by states. Stone, C. A., & Zhang, B. (2003). Assessing goodness of fit of item response theory models: A comparison of traditional and alternative procedures. Journal of Educational Measurement, 40(4), 331-352. Taber, K. S. (1998a). An alternative conceptual framework from chemistry education. International Journal of Science Education, 20(5), 597-608. Taber, K. S. (1998b). The sharing-out of nuclear attraction: or I can’t think about Physics in Chemistry, International Journal of Science Education, 20 (8), pp.1001-1014. Taber, K. S., & Coll, R. K. (2002). Bonding. In Chemical education: Towards research-based practice (pp. 213-234). Springer, Dordrecht. Toland, M. D. (2014). Practical guide to conducting an item response theory analysis. The Journal of Early Adolescence, 34(1), 120-151. Van Dam, N. T., Earleywine, M., & Borders, A. (2010). Measuring mindfulness? An item response theory analysis of the Mindful Attention Awareness Scale. Personality and Individual Differences, 49(7), 805-810. Wilson, M. (2009). Measuring progressions: Assessment structures underlying a learning progression. Journal of Research in Science Teaching: The Official Journal of the National Association for Research in Science Teaching, 46(6), 716-730. 210 CONCLUDING REMARKS This study is a first example of developing and validating both large and small grain size NGSS-aligned learning progressions in practice. It provides valuable insights about the process of developing learning progressions aligned to specific NGSS performance expectations including specific DCIs, SEPs and CCCs, and developing assessment instruments capable of measuring 3D learning of complex NGSS constructs. Further, this study demonstrates the process of obtaining validity evidence for NGSS aligned LPs, and feasibility of using validated LPs to describe student learning in the context of NGSS classroom. The major implication of this study for both future and practicing teachers relates to describing and tracking development of 3D understanding in practice. Specifically, the current study shows that the three dimensions of NGSS work together when it comes to forming a basis of 3D understanding. Study described in chapter 1 provides evidence for this assertion from the perspective of assessment development theory and psychometrics. In particular, chapter 1 demonstrates that following a systematic evidence-based assessment design process aimed at designing NGSS-aligned tasks that measure student ability to integrate the three dimensions (DCIs, SEPs, CCCs) as suggested by the Framework results in assessments that demonstrate good psychometric properties, and measure one underlying conceptual dimension of interest. Studies described in chapters 2 and 3 demonstrate how 3D learning can be characterized in practice, and show how student progress towards developing the ability to integrate the three dimensions of NGSS to explain electrostatic phenomena can be measured and characterized. For teachers and teacher educators these results indicate that to effectively assess 3D learning, one should not aim design tasks that measure separate dimensions of NGSS, but rather 211 integrate the SEPs and CCCs to make sense of DCIs, as suggested by the Framework. Focusing on measuring SEPs or CCCs devoid of the context (DCIs) will not allow to evaluate student knowledge in use, because it can only be assessed in a given context (DCIs). On the other hand, focusing on measuring DCIs without including SEPs and CCCs might results in fact-based assessments that don’t measure student ability to apply big ideas to make sense of phenomena. Therefore, it is only through integration of the three dimensions, as suggested by the Framework, that 3D learning can be effectively assessed and characterized. The integration of the three dimensions of NGSS is essential for both assessment and instruction. For teacher education findings described in this work suggest that future teachers should be prepared to organize their science classroom in a way that students are provided with opportunities to engaged in 3D learning and develop their ability to integrate the three dimensions of NGSS to explain phenomena. “Interactions” curriculum provides a good example of instructional settings that reflect principles of NGSS and the Framework. However, more work needs to be done to develop similar instructional materials that are aligned with the vision of the Framework in different disciplines, and across grades. Additionally, both future and practicing teachers need considerable support in helping implement the vision of the Framework into practice. Just like 3D learning is a process of constantly revising one’s understanding in light of new evidence, 3D teaching (or teachings in NGSS classroom) also requires constant examination of student ideas to find ways to respond to various questions that students bring up in class and use these questions to guide their natural curiosity towards developing deep 3D understanding. Just like students develop this type of understanding in a group with peers, teachers should aim to develop extended professional learning communities for sharing ideas and 212 exchanging their experience to support each other and help guide each other towards successful implementation of NGSS in practice. The work presented here also has important limitations. Specifically, in spite of the fact that validity evidence collected in the context of Unit 1 and Unit 2 separately supports hypothetically suggested progression of student understanding outlined by 3D LP for electrical interactions and 3D construct map for chemical bonding, at this point no data is available to draw accurate conclusions as to how the 3D LP and the 3D construct map relate to each other. In other words, there is no data available that would allow development of common latent ability continuum describing progression of student understanding of electrical interactions and chemical bonding on the same ability scale. This is the main drawback of the current study. In the future, it would be beneficial to include common through items on both written and oral interview assessments to be able to develop common ability scale and study how student understanding of electrical interactions develops during the course of both Unit 1 and Unit 2, and possibly the entire curriculum. Constructing such an overarching 3D LP spanning the entire academic year will allow to accurately measure student understanding at any point during the year, describe in detail what student understanding looks like in terms of ability to integrate the three dimensions of NGSS, and provide guidance to educators as to what supports students need to be able to get to higher levels of understanding of electrical interactions as the progress towards mastering NGSS performance expectations described by the 3D LP. This kind of applicability of learning progression research is an ultimate goal that researchers should be aiming for in order to enact the vision of the Framework and NGSS in practice and ensure significant improvement of the learning process in science classroom for both students and teachers. 213